Goodchild on Spatial Analysis and GIS

SECTION 2
SPATIAL STATISTICS

Section 2 - Spatial statistics - Simple measures for exploring geographic information - The value of the spatial perspective on data - Intuition and where it fails - Applications in crime analysis, emergencies, incidence of disease:

Measures of spatial form - centrality, dispersion, shape.

Spatial interpolation - intelligent spatial guesswork - spatial outliers.

Exploratory spatial analysis - moving windows, linking spatial and other perspectives.

Hypothesis tests - randomness, the null hypothesis, and how intuition can be misleading.

Measures of spatial form:

How to sum up a geographical distribution in a simple measure?

Two concepts of space are relevant:

Continuous:

travel can occur anywhere

- best for small scales, or where a network is too complex or too costly to capture or represent
an infinite number of locations exist
- a means must exist to calculate distances between any pair of locations, e.g. using straight lines

Discrete:

travel can occur only on a network

only certain locations (on the network) are feasible
- all distances (between all possible pairs of locations) can be evaluated using any measure (travel time, cost of transportation etc.)

In discrete space places are identified as objects; in continuous space, places are identified by coordinates

A metric is a means of measuring distance between pairs of places (in continuous space)

e.g. straight lines (the Pythagorean metric)

e.g. by moves in N-S and E-W directions (the Manhattan or city-block metric)
simple metrics can be improved using barriers or routes of lower travel cost (freeways)

The most useful single measure of a geographical distribution of objects is its center

Definitions of center:

The centroid

the average position

computed by taking a weighted average of coordinates

the point about which the distribution would balance

the basis for the US Center of Population (now in MO and still moving west)

The centroid is not the point for which half of the distribution is to the left, half to the right, half above and half below

this is the bivariate median

The centroid is not the point that minimizes aggregate distance (if the objects were people and they all traveled to the centroid, the total distance traveled would be minimum)

this is the point of minimum aggregate travel (MAT), sometimes called the median (very confusingly)

for many years the US Bureau of the Census calculated the Center of Population as the centroid, but gave the MAT definition

there is a long history of confusion over the MAT

no ready means exist to calculate its location
- the MAT must be found by an iterative process
- an interesting way of finding the MAT makes use of a valid physical analogy to the resolution of forces - the Varignon frame

on a network, the MAT is always at a node (junction or point where there is weight)

The definition of centrality becomes more difficult on the sphere

e.g. the centroid is below the surface
- the centroid of the Canadian population in 1981 was about 90km below White River, Ontario
- the bivariate median (defined by latitude and longitude) was at the intersection of the meridian passing through Toronto and the parallel through Montreal, near Burke's Falls, Ontario
- the MAT point (assuming travel on the surface by great circle paths) was in a school yard in Richmond Hill, Ontario

What use are centers?

for tracking change in geographic distributions, e.g. the march of the US Center of Population westward is still worth national news coverage
example

for identifying most efficient locations for activities
- location at the MAT minimizes travel
- a central facility should be located to minimize travel to the geographic distribution that it serves
- should we use continuous or discrete space?
- this technique was considered so important to central planning in the Soviet Union in the early 20th century that an entire laboratory was founded
- - the Mendeleev Centrographic Laboratory flourished in Leningrad around 1925

centers are often used as simplifications of complex objects
- at the lower levels of census geography in many countries
- - e.g. ED in US, EA in Canada, ED in UK
- to avoid the expense of digitizing boundaries
- - e.g. land parcel databases
- or where boundaries are unknown or undefined
- - e.g. ZIPs

in the census examples, common practice is to eyeball a centroid

some very effective algorithms have been developed for redistributing population from centroids

Measures of dispersion:

what you would want to know if you could have two measures of a geographical distribution

the spread of the distribution around its center

average distance from the center

measures of dispersion are used to indicate positional accuracy
- the error ellipse
- the Tissot indicatrix
- the CMAS

Potential measures:

a measure which increases with the weight of geographic objects and with proximity to them

calculated as:
V = summation of (w(i)/d(i))

where i is an object, d is the distance to the object and w is its weight

the summation can be carried out at any location

V can be mapped - a "potential" map

Potential is a useful measure of:

the market share obtainable by locating at a point

the best location is the place of maximum potential

population pressure on a recreation facility

accessibility to a geographic distribution
- e.g. a network of facilities

potential measures omit the "alternatives" factor
- imply that market share can potentially increase without limit

potential measures have been used as predictors of growth
- economic growth most likely in areas of highest potential

potential calculation exists as a function in SPANS GIS

the objects used to calculate potential must be discrete in an empty space
- adding new objects will increase potential without limit
- it makes no sense to calculate potential for a set of points sampled from a field
- potential makes sense only in the object view

Potential measures and density estimation

think of a scatter of points representing people
how to map the density of people?
replace each dot by a pile of sand, superimposing the piles

the amount of sand at any point represents the number and proximity of people
the shape of the pile of sand is called the kernel function

example - density estimation and Chicago crime

Measures of shape:

shape has many dimensions, no single measure can capture them all

many measures of shape try to capture the difference between compact and distended
- many of these are based on a comparison of the shape's perimeter with that of a circle of the same area
- e.g. shape = perimeter / (3.54 * sqrt(area))
- this measure is 1.0 for a circle, larger for a distended shape

all of these measures based on perimeter suffer from the same problem
- within a GIS, lines and boundaries are represented as straight line segments between points
- this will almost always result in a length that is shorter than the real length of the curve, unless the real shape is polygonal
- consequently the measure of shape will be too low, by an undetermined amount
- shape (compactness) is a useful measure to detect gerrymanders in political districting

Spatial Interpolation

Spatial interpolation is defined as a process of determining the characteristics of objects from those of nearby objects

of guessing the value of a field at locations where value has not been measured

The objects are most often points (sample observations) but may be lines or areas

The attributes are most often interval-scaled (elevations) but may be of any type

From a GIS perspective, spatial interpolation is a process of creating one class of objects from another class

Spatial interpolation is often embedded in other processes, and is often used as part of a display process

e.g. to contour a surface from a set of sample points, it is necessary to use a method of spatial interpolation to determine where to place the contours among the points

Many methods of spatial interpolation exist:

Distance-weighted interpolation

Known values exist at n locations i=1,...,n

The value at a location x_i is denoted by z(x_i)

We need to guess the value at location x, denoted by z(x)

The guessed value is an average over the known values at the sample points

the average is weighted by distance so that nearby points have more influence.

Let d(x_i,x) denote the distance from location x, where we want to make a guess, to the ith sample point.

Let w[d] denote the weight given to a point at distance d in calculating the average.

The estimate at x is calculated as:

z(x) = summation over every point i (w[d(x_i,x)] z(x_i)) / summation over every point i (w[d(x_i,x)])

in other words, the average weighted by distance.

The simplest kind of weight is a switch - a weight of 1 is given to any points within a certain distance of x, and a weight of 0 to all others

this means in effect that z(x) is calculated as the average over points within a window of a certain radius.

Better methods include weights which are continuous, decreasing functions of distance such as an inverse square:

w[d] = d^-2

All of the distance weighted methods (e.g IDW) share the same positive features and drawbacks. They are:

easy to implement and conceptually simple

adaptable - the weighting function can be changed to suit the circumstances. It is even possible to optimize the weighting function in this sense:

Suppose the weighting function has a parameter, such as the size of the window
Set the window size to some test value

Then select one of the sample points, and use the method to interpolate at that point by averaging over the remaining n-1 sample values

Compare the interpolated value to the known value at that point. Repeat for all n points and average the errors

Then the best window size (parameter value) is the one which minimizes total error

In most cases this will be a non-zero and non-infinite value.
all interpolated values must lie between the minimum and maximum observed values, unless negative weights are used
- This means that it is impossible to extrapolate trends
- If there is no data point exactly at the top of a hill or the bottom of a pit, the surface will smooth out the feature

the interpolated surface cannot extrapolate a trend outside the area covered by the data points - the value at infinity must be the arithmetic mean of the data points

Although distance-weighted methods underlie many of the techniques in use, they are far from ideal

Polynomial surfaces

A polynomial function is fitted to the known values - interpolated values are obtained by evaluating the function

e.g. planar surface - z(x,y) = a + bx + cy

e.g. cubic surface - z(x,y) = a + bx + cy + dx² + exy + fy² + gx³ + hx²y + ixy² + jy³

easy to use

useful only when there is reason to expect that the surface can be described by a simple polynomial in x and y

very sensitive to boundary effects

Kriging

Most real surfaces are observed to be spatially autocorrelated - that is, nearby points have values which are more similar than distant points.

The amount and form of spatial autocorrelation can be described by a variogram, which shows how differences in values increase with geographical separation

Observed variograms tend to have certain common features - differences increase with distance up to a certain value known as the sill, which is reached at a distance known as the range.

To make estimates by Kriging, a variogram is obtained from the observed values or past experience

Interpolated best-estimate values are then calculated based on the characteristics of the variogram.

perhaps the most satisfactory method of interpolation from a statistical viewpoint

difficult to execute with large data sets

decisions must be made by the user, requiring either experience or a "cookbook" approach

a major advantage of Kriging is its ability to output a measure of uncertainty of each estimate
- This can be used to guide sampling programs by identifying the location where an additional sample would maximally decrease uncertainty, or its converse, the sample which is most readily dropped.

Kriging and other methods of geostatistics are in the geostatistics module of ArcGIS

Locally-defined functions

Some of the most satisfactory methods use a mosaic approach in which the surface is locally defined by a polynomial function, and the functions are arranged to fit together in some way

With a TIN data structure it is possible to describe the surface within each triangle by a plane

Planes automatically fit along the edges of each triangle, but slopes are not continuous across edges

This can be arranged arbitrarily by smoothing the appearance of contours drawn to represent the surface

Alternatively a higher-order polynomial can be used which is continuous in slopes.

Another popular method fits a plane at each data point, then achieves a smooth surface by averaging planes at each interpolation point

Hypothesis tests:

compare patterns against the outcomes expected from well-defined processes

if the fit is good, one may conclude that the process that formed the observed pattern was like the one tested
- unfortunately, there will likely be other processes that might have formed the same observed pattern
- in such cases, it is reasonable to ignore them as long as a) they are no simpler than the hypothesized process, and b) the hypothesized process makes conceptual sense

the best known examples concern the processes that can give rise to certain patterns of points
- attempts to extend these to other types of objects have not been as successful

Point pattern analysis

a commonly used standard is the random or Poisson process
- in this process, points are equally likely to occur anywhere, and are located independently, i.e. one point's location does not affect another's
- CSR = complete spatial randomness

a real pattern of points can be compared to this process
- most often, the comparison is made using the average distance between a point and its nearest neighbor
- in a random pattern (a pattern of points generated by the Poisson process) this distance is expected to be 1/(2 * sqrt(density)) where density is the number of points per unit area, and area is measured in units consistent with the measurement of distance
- when the number of points is limited, we would expect to come close to this estimate in a random pattern
- - theory gives the limits within which average distance is expected to lie in 95% of cases
  - if the actual average distance falls outside these limits, we conclude that the pattern was not generated randomly

There are two major options for non-random patterns:

the pattern is clustered
- points are closer together than they should be
- the presence of one point has made other points more likely in the immediate vicinity
- some sort of attractive or contagious process is inferred

the pattern is uniformly spaced
- points are further apart than they should be
- the presence of one point has made other points less likely in the vicinity
- some sort of repulsive process is inferred, or some sort of competition for space

Unfortunately it is easy for this process of inference to come unstuck

the process that generated the pattern may be non-random, but not sufficiently so to be detectable by this test
- this false conclusion is more likely reached when there is little data - the more data we have, the more likely we are to detect differences from a simple random process
- in statistics, this is known as a Type II error - accepting the null hypothesis when in fact it is false

the process may be non-random, but not in either of the senses identified above - contagious or repulsive
- points may be located independently, but with non-uniform density, so that points are not equally likely everywhere

it is possible to hypothesize more complex processes, but the test becomes progressively weaker at confirming them