SECTION 4
SPATIAL DEPENDENCE
Section 4 - Spatial dependence - Looking at causes and effects in a geographical context:
- Spatial autocorrelation - what is it, how to measure it with a GIS.
- The independence assumption and what it means for modeling spatial data.
- Applying models that incorporate spatial dependence - tools and applications.
Two concepts:
Spatial dependence
- what happens at one place depends on events in nearby places
- all things are related but nearby things are more related than distant things (Tobler's first law of geography)
- positive spatial dependence:
-
- nearby things are more alike than things are in general
- negative spatial dependence:
-
- nearby things are less alike than things are in general
- conceptual problems with negative spatial dependence
-
- spatial autocorrelation measures spatial dependence
-
- an index, rather than a parameter of a process
- dependence between discrete objects, or dependence in a continuous field?
- a world without positive spatial dependence would be an impossible world
-
- impossible to map
- impossible to describe, live in
- hell is a place with no spatial dependence
Geary index:
- compares the squared differences in value between neighboring objects with overall variance in values
Moran index:
- calculates the product of values in neighboring objects
- related to Geary but not in a simple algebraic sense
Calculation of the Geary index of spatial autocorrelation
data:image/s3,"s3://crabby-images/ba025/ba0255f181da3c2dbf62af167466dad4674fed3c" alt=""
a is the mean of x values
wij = 1 if i,j adjacent, else 0
c is 1 if neighbors vary as much as the sample as a whole
c < 1 if neighbors are more similar than the sample as a whole (positive dependence)
c > 1 if neighbors are less similar (negative dependence)
c = 3 x 16 / (2 x 10 x 2) = 48 / 40 = 1.2
- i.e. neighboring values are slightly more similar than one would expect if the values were randomly allocated to the four areas
Continuous space
- see the discussion of variograms and Kriging
- the term geostatistics is normally associated with continuous space, spatial statistics more with discrete space
-
Measures of spatial dependence can be calculated in GIS:
- Idrisi calculates autocorrelation over a raster
- code has been written to calculate autocorrelation in ARC/INFO (see NCGIA Technical Paper 91-5)
More extensive codes have been written using the statistical packages, e.g. MINITAB, SAS
- contact Dan Griffith, Syracuse University; Luc Anselin, University of Illinois
- some of these fail to take advantage of GIS capabilities, for generating input data and displaying output
Spatial heterogeneity:
- suppose there is a relationship between number of AIDS cases and number of people living in an area
- the form of this relationship will vary spatially
-
- in some areas the number of cases per capita will be higher than in others
- we could map the constant of proportionality
- spatial heterogeneity describes this geographic variation in the constants or parameters of relationships
- when it is present, the outcome of an analysis depends on the area over which the analysis is made
-
- often this area is arbitrarily determined by a map boundary or political jurisdiction
Geographically weighted regression (GWR)
fits a model such as y = a + bx
but assumes that the values of a and b will vary geographically
determines a and b at any point by weighting observations inversely by distance from that point
diagram
Geographical brushing:
- a user-defined window is moved over the map
- analysis occurs only within the window
-
Conventional analysis (analysis done aspatially, e.g. using a statistical package) assumes independence (no spatial dependence) and homogeneity (no spatial heterogeneity)
-
- e.g. regression analysis assumes that the observations (cases) are statistically independent
- this violates the first law of geography
- in general, analysis in space is very different from conventional statistical analysis (although this is very often carried out on spatial data)
-
An example:
- the relationship between land devoted to growing corn and rainfall in a Midwestern state like Kansas
- rainfall available at 50 weather stations
- percent of land growing corn available for 100 counties
- use a method of spatial interpolation to estimate rainfall in each county from the weather station data
- plot one variable against the other, and perhaps fit a regression equation
- how many data points are there?
-
- the more data points, the more significant the results
- 100 (the number of counties)?
- 50 (the real number of weather observations)?
- something in between?
- more data points can be invented by intensifying the sample network using spatial interpolation, but no more real data has been created by doing so
- both variables are strongly spatially autocorrelated, violating an assumption of regression
- the significance of the analysis is now uncertain
- methods of spatial regression try to overcome this problem in a systematic way
-
An example
Crime and income in Los Angeles
rate of car thefts (per sq km per year)
median annual income in thousands
per census tract
5,000 observations
b = increase in car thefts per sq km per thousand dollars median income
= -0.22
R2 = percentage of variation in car thefts explained by income
= 0.26
is this significant?
is it significant at the 95% level of confidence?
in a population of millions of census tracts, exhibiting the same range of rates of car thefts and median incomes, but no relationship between them (b = 0, R2 = 0), could a sample of 5,000 census tracts have exhibited the same
degree of apparent relationship, or more, purely by chance?
but, but, but...we don't have a random sample of a larger population
there are only 5,000 tracts in LA and we have all there is
A related issue - the MAUP
- many statistics are reported by averaging or summing over polygons - e.g. populations of counties, average elevation
- it is commonly necessary to interpolate such values to new polygons which do not coincide
-
- e.g. from census tracts with known populations to school districts
- source zones have known populations
- populations of target zones are unknown
- the best method of solving this problem is to create a continuous surface from the source data, then to integrate this surface to the new target areas
Various assumptions can be made about the underlying surface:
- density is constant within source zones
- density is constant within target zones
- density is constant within some third set of control zones
- density varies smoothly (Tobler's Pycnophylactic interpolation)
Analysis carried out on modifiable units can produce frightening results
- two variables - % over 65, and % Republican
-
- correlation for the counties was .3466
-
Results of analysis using some alternative reporting zones:
6 Republican-proposed congressional districts .4823
6 Democrat-proposed congressional districts .6274
6 existing congressional districts .2651
6 urban/rural regional types .8624
6 functional regions .7128
By regrouping the counties into larger regions, Openshaw and Taylor were able to generate a vast range of outcomes of the analysis:
- e.g. 48 regions - correlations between -.548 and +.886
- e.g. 12 regions - correlations between -.936 and +.996
What to do?
- are we asking the right question?
-
- is scale part of the question rather than a mere matter of implementation?
|