PostgreSQL Surrogate Tool

Introduction

Spatial surrogates are most commonly used to map state- or county- level emissions data onto the rectangular grid cells used by an Eulerian air quality model such as the CMAQ modeling system. Surrogate Tools are used to create and process spatial surrogates used to allocate county-level emissions to grid cells. The PostgreSQL Surrogate Tool program was designed to take advantage of relational database functionality in computing spatial surrogates.

A spatial surrogate is a value greater than zero and less than or equal to one that specifies the fraction of the emissions in an area (usually a county) that should be allocated to a particular model grid cell. Many types of emission inventories contain data summed by county. Because the area of a given county may fall into several model grid cells, spatial surrogates must be used to indicate the fraction of the emissions for that county that is to be assigned to each grid cell that overlaps the county. Spatial surrogates typically must be computed for each grid to be modeled, unless the grid used is a fully aligned subset of a grid for which spatial surrogates are already available.

When surrogates are created, usually there is some type of geographic attribute that is used to "weight" the county emissions into grid cells in a manner that is more specific than a simple, uniform spread over the county, (e.g., county area) that would not reflect the differences in emissions between cities, rural areas, ports, etc. For example, weights could be information about points that represent ports (e.g., berths); the vehicle-miles traveled on lines representing the locations of roads or railroads; or polygon-based weight attributes such as population, number of households, and land use categories. You can create spatial surrogates from point-, line-, or polygon-based shapefiles that contain weight attribute information. A single surrogate (srg) value for county C and grid cell GC is expressed as:

where srg = surrogate, C = county, and GC = grid cell. Note that a surrogate does not have to be for a county; it could be for some other geographic region such as a state, province, nation, or census tract. These polygons are known in the program as the "data polygons."

Surrogate Components

Spatial surrogates are built from three components:

Output Polygon - polygon coverage to which the weight attributes are allocated. For air quality modeling, the output polygon is a regular grid (e.g. 36x36-km continental U.S. modeling domain). Surrogates may also be computed for irregularly shaped polygons like census tracts. The surrogate fraction in each output polygon is the fraction of the total weight attribute coverage contained in the data polygon that intersects the output polygon.
Data Shapefile - polygon coverage that defines the emissions inventory spatial coverage. As surrogates are data that map an inventory to a modeling grid, the data shapefile defines the spatial coverage of the emissions inventory. As the U.S. National Emissions Inventory (NEI) is a county total inventory, the data shapefile used to create surrogates to map the NEI to a modeling grid contains county boundaries. Spatial surrogates could also be developed to map country (or state) total inventories by using a country (or state) boundary data shapefile.
Weight Shapefile - shapefile coverage with attributes used to distribute (or weight) the emissions data to the output polygons. Weight shapefiles could be polygon, line, or point data.

Types of Surrogates

Three types of surrogates can be created with this program: polygon-based, line-based, and point-based. Polygon-based surrogates use attribute information that is based on area (e.g., population in a census tract). The surrogate value is calculated as the ratio of the attribute value in a grid cell to the total value of the attribute in the overlapping "data polygon" (e.g., county, state, census tract). Examples of polygon-based weight attributes are area, population, number of households, and land use. The numerator (i.e., the value of the weight attribute in the intersection of the county C and grid cell GC) and the denominator (i.e., the total attribute value in the county or other data polygon) are calculated according to the following equations:

where srg=surrogate, C = county (or other data polygon), GC = grid cell, WP = weight polygon, int. = intersection of, i = weight polygon, and n = the number of weight polygons.

For line-based surrogates, the length of the linear weight feature (e.g., railroad, river, road) replaces area in the above equations. For point-based surrogates, instead of using area, the software will allocate a value of 1 if the weight point falls within the region of interest or a value of 0 (zero) if it does not. In some cases, no special weight attribute is desired. Instead, the surrogate is based purely on the area of the polygon, the length of the polyline, or the count of the points. This mode is supported by the Spatial Allocator. In this case, the above equations simplify to the following:

Spatial surrogates are developed using Geographic Information System (GIS) software. GIS software has the functionality required to intersect the three geospatial components of the surrogate and calculate the fraction of each weight in an output polygon. The PostgreSQL package PostGIS was used to support surrogate computations in a PostgreSQL database. This page describes the concept behind the PostgreSQL (PG) surrogate tools, installation instructions, and guidance for using the PG scripts to compute spatial surrogates.

System Design

Database for Generating Spatial Surrogates

This database is produced in PostgreSQL with PostGIS support for geometric and geographic objects. Among the key features of a PostgreSQL-PostGIS database system are the ability to store relational data with associated attributes, spatial locations, and physical characteristics. It also provides functions and data support for calculating lengths of lines, distance between points, and areas of polygons; for projecting shapes onto many flat projections of Earth and reprojecting shapes from one mathematical projection to another; and for performing spatial operations such as intersect, within, and touch. These functions are used extensively in creating spatial surrogates for emissions.

Hierarchical Structure of a Database

Every PostgreSQL database has the following general organizational structure within the PostgreSQL server:

Database
- Schema: public, other schemas
  - Table: data loaded, data generated, basic lookup data
    - Column: other databases often call these fields; each contains one datum with an associated name, type, and other characteristics
    - Index: means of ordering the data in the table by one or more columns sorted in succession; may be primary key (unique and default index), unique key (guarantee the same grouping of values cannot occur more than once), foreign key (value in named column in this table must exist uniquely in a base table)

This structure ignores other aspects of the database and what it supports, such as event triggers, domains, configurations, functions, sequences, rules, and views. Some of these are populated when a database is created, and more are populated when the PostGIS support is added.

The public schema contains tables, functions, etc. that are available regardless of the schema in which someone is working. If you look in the public schema of a PostgreSQL database that has PostGIS, you will see over 1000 functions that are part of PostGIS. There is also a spatial_ref_sys table placed in public. As part of the PostgreSQL Surrogate Tool we provide additional entries in the spatial_ref_sys table and reference those entries throughout the tool's scripts. Therefore, these additional entries must be in that table and must not be changed unless you are provided with an update.

The other table provided in the public schema is named grid_defs. Each record describes the definition of a modeling grid. The grid definitions are referenced in the tool's scripts and some of the values in this file are written to the header of each surrogate file that is generated.

There is also a user and permission structure included in PostgreSQL. It starts with its built-in user postgres. The database administrator logs in as postgres, creates a user for personal use and defines permissions, and then proceeds to set up other users and defines the permissions for each person (can login, create a database, become a superuser, etc.). Groups can also be created and have users assigned to them, then have permissions for tables etc. granted to only certain users or groups. The instructions for setting up the database for the PostgreSQL Surrogate Tool do not specify how to grant permissions, giving the initial user full access to the database. Users will need to decide how to configure the database if they prefer to restrict access to users and/or groups.

Database or Schema?

Throughout this document you will notice that the name of a schema is often included with a table. But, the name of the database is not specified. That is because PostgreSQL allows you to work with 1 database at a time. You can read from a table in one schema and write to a table in another schema, or join tables in different schemas. However, you cannot do those operations with tables in different databases. This concept is important to the overall design of the PG Surrogate Tool.

This tool is designed to re-use as many tables as practical. For example, an air quality modeling study may involve several layers of nested grids that are all in the same projection. This means that each shapefile required for the surrogate calculations (i.e., weight data) must be loaded, cleaned, and projected once into a weight data table. Assuming that the surrogates are based on counties, the county boundary file only needs to be loaded, cleaned, and projected once into a county boundary table in the database. Each modeling grid must be loaded and projected once into a grid table.

Database Tables

Before any calculations are done with the PG surrogate tool several tables must be created.

Table Name	Description
spatial_ref_sys	PostGIS system table with map projection information
{data shapefile}	Administrative boundary (e.g., counties) polygons, name this table to be similar to the name of the input shapefile (e.g., cb_2014_us_counties)
{weight shapefile}	Surrogate weight polygons, lines, or points; name this table to be similar to the name of the input shapefile (e.g. acs_2014_5yr_pophousing)
{modeling grid}	Definition of a modeling grid (e.g., conus12km_444x336), including row and column numbers
cty_{id}	Weight data cut by the county boundaries (e.g., cty_nlcd11)
cty_{id}_{grid}	Weight data but by counties and modeling grid (e.g., wp_cty_conus12km_444x336)
numer_#	Surrogate numerator values as weight values in each row-col-data shape intersection; from the wp_cty_{id} table
denom_#	Surrogate denominator value as the sum of all weight values in a data polygon; from the cty_{id} table
surg_#	Surrogate fractions as the ratio of the numerator and denominator values for a row-col-data shape intersection.

General Surrogate Computation Steps

The following steps are the general workflow for computing surrogates using PostgreSQL:

One time operations

1) Load shapefiles into the database

2) Load projection information into the spatial reference system table

3) Load output modeling grid definitions into the database

Surrogate calculations

1) Overlay the weight data tables with the county boundaries table. The resulting new table formed by each of these overlays is saved as a cty_{id} table. If the cty_{id} data table has multiple fields (e.g., attributes population and households in a census data table), a single cty_{id} table may be used multiple times to generate different surrogates.

2) Overlay the model grid table on the cty_{id} table to create a wp_cty_{id} table. Processing the wp_cty_{id} table with the grid table prepares the data to calculate the surrogate fractions and export model-ready surrogates from the database. It is possible to overlay each modeling grid separately on the same cty_{id} table, avoiding the need to reprocess the cty_{id} overlay in the subsequent step.

3) Calculate numerator, denominator, and surrogate fractions

4) Export the surrogate fractions to a SMOKE-ready surrogate file

While these are the general steps for processing polygon and line-based surrogates, some weight shapefile datasets have very specialized scripts. The HPMS roadway and NLCD land cover data are complex datasets that use specialized scripts. In addition, surrogates that are computed by adding several attributes together require specialized scripts.

Defining a Modeling Grid

Modeling grid definitions are each stored in separate database tables. The name of the data table is a short description of the coverage that it contains. For example, the table containing the CONUS 12x12-km grid cells is named conus12km_444x336. The table contains 3 fields: colnum (integer column number starting at 1), rownum (integer row number starting at 1), and gridcell (geometry containing a polygon on a geometric projection).

The bash script generate_modeling_grid.sh creates the table, defines a primary index (colnum, rownum) and a geometric index (gridcell using gist), and then populates the table based on its definition in the script. The key parameters for configuring this script are shown below.

dbname='surrogates'		# name of database
schemaname='conus12km_082016'	# name of schema in database
tblname='conus12km_444x336'	# name of table in schema

proj=900921			# value of srid in spatial_ref_sys data table
xorig=-2736000			# x-coordinate of lower-left corner (LLC) of grid 
yorig=-2088000			# y-coordinate of LLC of grid
xcellsize=12000			# cell size in x-direction
ycellsize=12000			# cell size in y-direction
cols=444			# number of columns (x)
rows=336			# number of rows (y)

Loading Shapefiles

Shapefiles are loaded and stored in separate database tables. The bash script load_shapefile_multi_noprecision.sh creates the table, defines a primary index on the ogc_fid field, and defines a geometric index on the wkb_geometry field. The script then populates the table from the shapefile. The script uses the GDAL utility OGR2OGR to load shapefiles into PostgreSQL. The script accepts 5 command-line parameters in the following order:

name of the shapefile (Do not include the .shp) name of the database name of the schema within the database name of the table to create and load the data a string to pass to the script (optional)

For example:

load_shapefile_multi_noprecision.sh 2016_pil_16Aug surrogates conus12km_082016 pil2016_16aug ""

In this example the string is empty because all fields in the data table are loaded into the new table.

Notes: Basic loading shapefile to PostgreSQL database, some records contain MULTI geometries: load_shapefile.sh Problems with numeric precision selected by ogr2ogr, some records contain MULTI geometries: load_shapefile_Multi_noPrecision.sh Problems with numeric precision selected by ogr2ogr, no MULTI geometries: load_shapefile_noPrecision.sh

Database maintenance

Vacuum and Analyze

After loading any table containing a geometry field, you need to vacuum and analyze the new table. The VACUUM command reclaims storage occupied by records that are no long being used in the data table. If you do not vacuum the table, then any space that formerly contained data for that table remains unusable; vacuum releases such space and makes it available for PostgreSQL to re-use. The analyze command updates the statistics used by the query planner to determine the most efficient way to execute a query. If you want to release the unused storage to the operating system, use the VACUUM FULL command with or without the ANALYZE command (i.e., VACUUM (FULL ANALYZE) schema_name.table_name).

The bash script vacanalyze.sh is passed values for the dbname, schemaname, and tblname as shown above, to vacuum the selected table.

Validating Geometries

A geometry can be badly formed within a shapefile. Any such geometry must be repaired prior to using it to create spatial surrogates; PostGIS may calculate an incorrect value for the area of a MultiPolygon or for the length of a MultiLineString - or even fail to generate a value - if the geometry is badly formed.

The following query examines the contents of the column geom_columname for each record in the table mytable and returns the number of records with an invalid geometry.

select count(*) from myschema.mytable where not ST_IsValid(geom_columname);

If the number returned is very small, you can replace the count(*) with one or more of the fields in myTable that will allow you to locate, examine, and repair those records with invalid geometries. If you just want to perform the repair on all records with invalid geometries in the geom_columname field, use the following query.

update myschema.mytable set geom_columname = ST_MakeValid(geom_columname);

Note that the ST_MakeValid() function does not alter any record where the geometry in the geom_columname field is valid.

Clustering a Table

If you have a table that you use frequently, your queries may run faster if you cluster the table using the CLUSTER command. When clustering a table, PostgreSQL physically reorders the records in the table according to the index that you name. Either a spatial index (i.e., a GIST index on a geometry field) or a nonspatial index may be used. Because the table is not automatically reordered if you add or delete records in the table, you may need to issue the cluster command again if you add, update, or delete a relatively large number of records.

Performing a cluster is a 2-step operation. First, you need to create the index on which you will reorder the records. If you already have generated the index, you do not need to perform this step.

Then, issue the cluster command:

cluster schema.table_name using index_name

A table may be clustered according to one index only. If you issue the cluster command for a different index on the same table, your records will be reordered according to this specified index and will no longer be ordered according to the prior index used for clustering. But, if you add, change, or update a relatively large number of records in that table, you can cluster it again either with or without specifying the index name (i.e., cluster schema.table_name) because PostgreSQL keeps track of the index used when clustering a table.

Reprojecting Shapefiles

When you first load a shapefile into a table in the database, the data are stored in the same projection as the original shapefile; this projection is specified in the .prj file. If that original projection does not match the projection of the grid that you are using to create your spatial surrogates, you must reproject that shapefile to match the grid. If you have grids with different projections that use the same shapefile, you can have a geometry column in that shapefile's table for each of the grids. Use the transform_projections.sh bash script to modify a shapefile table to include a new geometry field and then to populate and index that field with the transformed geometry.

NOTE: The WRF and CMAQ models use a perfect sphere for the Earth instead of an oblate spheroid. This causes problems for reprojecting data because GIS libraries typically do not perform projections onto a perfect sphere. The definitions of two special projections are included in the PostgreSQL Surrogate Tool: 98523 and 900921, which is the WRF sphere. If you need to project a shapefile to the WRF sphere, first project the file to 98523 and then project that file to 900921. If you keep both of the geometry columns (i.e., geom_98523 and geom_900921) in your data table, be certain to refer to the geom_900921 geometry when generating your surrogates.

The transform_projections.sh script accepts 7 command-line parameters in the following order:

name of the database
name of the schema within the database
name of the table with the geometry to transform
code for the grid's projection
type of geometry (MultiPolygon or MultiLineString)
name of the original geometry field (transform from)
name of the new geometry field (create and transform to)

The following command is actually a single line; it is broken here for formatting.

transform_projections.sh surrogates conus12km_082016 pophu_bg2010 
900921 MultiPolygon wkb_geometry geom_900921

Installation

The following components are needed to create surrogates using PostgreSQL. Versions used by UNC-IE in developing this tool are shown in parentheses.

postgresql (9.5.3)
gdal (2.1.3)
geos (3.5.1)
json-c-json-c (0.12.1-20160607)
libxml (2-2.9.4)
proj (4.9.3)
postgis (2.3.2)

Install these components in the following order, with detailed instructions from a Linux command line for a non-root, local installation. Change the prefix path for your machine.

PostgreSQL Install

Compiler: GCC 6.1.0
wget https://ftp.postgresql.org/pub/source/v9.5.3/postgresql-9.5.3.tar.gz
tar xvzf postgresql-9.5.3.tar.gz
cd postgresql-9.5.3
./configure --prefix=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/postgresql-9.5.3
make | & tee build.log
make check | & tee check.log
make install | & tee install.log

PostGIS Dependencies

Proj4 Install

Compiler: GCC 6.1.0
wget http://download.osgeo.org/proj/proj-4.9.3.tar.gz
tar xvzf proj-4.9.3.tar.gz
cd proj-4.9.3/
mkdir local
./configure --prefix=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/proj-4.9.3/local
make | & tee build.log
make check | & tee check.log
make install | & tee install.log

GEOS Install

Compiler: GCC 6.1.0
wget http://download.osgeo.org/geos/geos-3.5.1.tar.bz2
tar xvjf geos-3.5.1.tar.bz2
cd geos-3.5.1
mkdir local
./configure --prefix=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/geos-3.5.1/local
make | & tee build.log
make check | & tee check.log
make install | & tee install.log

LibXML2 Install

Compiler: GCC 4.8.1 (couldn't get this to compile with GCC 6.1.0)
wget http://download.osgeo.org/geos/geos-3.5.1.tar.bz2
wget ftp://xmlsoft.org/libxml2/libxml2-2.9.4.tar.gz
tar xvzf libxml2-2.9.4.tar.gz
cd libxml2-2.9.4
mkdir local
./configure --prefix=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/libxml2-2.9.4/local
make | & tee build.log
make check | & tee check.log
make install | & tee install.log

JSON-C Install

Compiler: GCC 6.1.0
wget https://github.com/json-c/json-c/archive/json-c-0.12.1-20160607.tar.gz
tar xvzf json-c-0.12.1-20160607.gz
cd json-c-json-c-0.12.1-20160607/
mkdir local
./configure --prefix=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/json-c-json-c-0.12.1-20160607/local
make | & tee build.log
make check | & tee check.log
make install | & tee install.log

GDAL Install

Compiler: GCC 6.1.0
wget http://download.osgeo.org/gdal/2.1.3/gdal-2.1.3.tar.gz
tar xvzf gdal-2.1.3.tar.gz
cd gdal-2.1.3
mkdir local
./configure --prefix=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/gdal-2.1.3/local --without-sqlite3
make | & tee build.log
make install | & tee install.log

PostGIS Install

Compiler: GCC 6.1.0
wget http://download.osgeo.org/postgis/source/postgis-2.3.2.tar.gz
tar xvzf postgis-2.3.2.tar.gz
cd postgis-2.3.2
./configure --prefix=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/postgis-2.3.2/local --with-pgconfig=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/postgresql-9.5.3/bin/pg_config --with-gdalconfig=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/gdal-2.1.3/local/bin/gdal-config --with-geosconfig=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/geos-3.5.1/local/bin/geos-config --with-xml2config=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/libxml2-2.9.4/local/bin/xml2-config --with-projdir=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/proj-4.9.3/local --with-jsondir=/proj/ie/apps/SA/Spatial-Allocator/pg_srgcreate/libs/json-c-json-c-0.12.1-20160607/local
make | & tee build.log
make check | & tee check.log
make install | & tee install.log

Configure/Start the PostgreSQL database

PGBIN=location of the postgres install

Initialize the PostgreSQL installation

This only needs to be done once

$PGBIN/pg_ctl init

Create a new database

DB = NEI204 (for example)

$PGBIN/createdb ${DB}  # create a database named NEI2014
$PGBIN/psql -d ${DB} -c "CREATE EXTENSION postgis;"  # load the PostGIS extension into the database

Start the database

Data = name of a data directory (for example)

When the database is started, the subsequent database queries will read/write from the specified data directory. While multiple queries can be made against a single data directory, it's important that the database using this directory is not restarted or stopped in the middle of a query.

$PGBIN/pg_ctl -D $Data -l ${DB}_surrogate_db.log start