Goodchild on Spatial Analysis and GIS

SECTION 4
SPATIAL DEPENDENCE

Section 4 - Spatial dependence - Looking at causes and effects in a geographical context:

Spatial autocorrelation - what is it, how to measure it with a GIS.

The independence assumption and what it means for modeling spatial data.

Applying models that incorporate spatial dependence - tools and applications.

Two concepts:

Spatial dependence

what happens at one place depends on events in nearby places

all things are related but nearby things are more related than distant things (Tobler's first law of geography)

positive spatial dependence:
- nearby things are more alike than things are in general

negative spatial dependence:
- nearby things are less alike than things are in general
- conceptual problems with negative spatial dependence
- - e.g. the chessboard

spatial autocorrelation measures spatial dependence
- an index, rather than a parameter of a process
- dependence between discrete objects, or dependence in a continuous field?

a world without positive spatial dependence would be an impossible world
- impossible to map
- impossible to describe, live in
- hell is a place with no spatial dependence

Geary index:

compares the squared differences in value between neighboring objects with overall variance in values

Moran index:

calculates the product of values in neighboring objects

related to Geary but not in a simple algebraic sense

Calculation of the Geary index of spatial autocorrelation

a is the mean of x values

w_ij = 1 if i,j adjacent, else 0

c is 1 if neighbors vary as much as the sample as a whole

c < 1 if neighbors are more similar than the sample as a whole (positive dependence)

c > 1 if neighbors are less similar (negative dependence)

c = 3 x 16 / (2 x 10 x 2) = 48 / 40 = 1.2

i.e. neighboring values are slightly more similar than one would expect if the values were randomly allocated to the four areas

Continuous space

see the discussion of variograms and Kriging

the term geostatistics is normally associated with continuous space, spatial statistics more with discrete space

Measures of spatial dependence can be calculated in GIS:

Idrisi calculates autocorrelation over a raster

code has been written to calculate autocorrelation in ARC/INFO (see NCGIA Technical Paper 91-5)

More extensive codes have been written using the statistical packages, e.g. MINITAB, SAS

contact Dan Griffith, Syracuse University; Luc Anselin, University of Illinois

some of these fail to take advantage of GIS capabilities, for generating input data and displaying output

see also Spacestat

Spatial heterogeneity:

suppose there is a relationship between number of AIDS cases and number of people living in an area

the form of this relationship will vary spatially
- in some areas the number of cases per capita will be higher than in others
- we could map the constant of proportionality

spatial heterogeneity describes this geographic variation in the constants or parameters of relationships

when it is present, the outcome of an analysis depends on the area over which the analysis is made
- often this area is arbitrarily determined by a map boundary or political jurisdiction

Geographically weighted regression (GWR)

fits a model such as y = a + bx
but assumes that the values of a and b will vary geographically
determines a and b at any point by weighting observations inversely by distance from that point

diagram

Geographical brushing:

a technique of ESA

a user-defined window is moved over the map

analysis occurs only within the window

Conventional analysis (analysis done aspatially, e.g. using a statistical package) assumes independence (no spatial dependence) and homogeneity (no spatial heterogeneity)

e.g. regression analysis assumes that the observations (cases) are statistically independent

this violates the first law of geography

in general, analysis in space is very different from conventional statistical analysis (although this is very often carried out on spatial data)

An example:

the relationship between land devoted to growing corn and rainfall in a Midwestern state like Kansas

rainfall available at 50 weather stations

percent of land growing corn available for 100 counties

use a method of spatial interpolation to estimate rainfall in each county from the weather station data

plot one variable against the other, and perhaps fit a regression equation

how many data points are there?
- the more data points, the more significant the results
- 100 (the number of counties)?
- 50 (the real number of weather observations)?
- something in between?

more data points can be invented by intensifying the sample network using spatial interpolation, but no more real data has been created by doing so

both variables are strongly spatially autocorrelated, violating an assumption of regression

the significance of the analysis is now uncertain

methods of spatial regression try to overcome this problem in a systematic way
- see Spacestat

An example

Crime and income in Los Angeles
rate of car thefts (per sq km per year)
median annual income in thousands

per census tract

5,000 observations

b = increase in car thefts per sq km per thousand dollars median income
= -0.22
R² = percentage of variation in car thefts explained by income
= 0.26

is this significant?

is it significant at the 95% level of confidence?
in a population of millions of census tracts, exhibiting the same range of rates of car thefts and median incomes, but no relationship between them (b = 0, R² = 0), could a sample of 5,000 census tracts have exhibited the same
degree of apparent relationship, or more, purely by chance?

but, but, but...we don't have a random sample of a larger population
there are only 5,000 tracts in LA and we have all there is

A related issue - the MAUP

many statistics are reported by averaging or summing over polygons - e.g. populations of counties, average elevation

it is commonly necessary to interpolate such values to new polygons which do not coincide
- e.g. from census tracts with known populations to school districts
- source zones have known populations
- populations of target zones are unknown

the best method of solving this problem is to create a continuous surface from the source data, then to integrate this surface to the new target areas

Various assumptions can be made about the underlying surface:

density is constant within source zones

density is constant within target zones

density is constant within some third set of control zones

density varies smoothly (Tobler's Pycnophylactic interpolation)

Analysis carried out on modifiable units can produce frightening results

e.g. Openshaw and Taylor

99 counties of Iowa

two variables - % over 65, and % Republican
- correlation for the counties was .3466

Results of analysis using some alternative reporting zones:

6 Republican-proposed congressional districts .4823

6 Democrat-proposed congressional districts .6274

6 existing congressional districts .2651

6 urban/rural regional types .8624

6 functional regions .7128

By regrouping the counties into larger regions, Openshaw and Taylor were able to generate a vast range of outcomes of the analysis:

e.g. 48 regions - correlations between -.548 and +.886

e.g. 12 regions - correlations between -.936 and +.996

What to do?

evaluate the range?

are we asking the right question?
- is scale part of the question rather than a mere matter of implementation?