Goodchild on Spatial Analysis and GIS

2001 ESRI USER CONFERENCE
Pre-Conference Seminar
SPATIAL ANALYSIS and GIS

Michael F. Goodchild
National Center for Geographic Information and Analysis
University of California
Santa Barbara, CA 93106
805 893 8049 (phone)
805 893 3146 (FAX)
805 893 8224 (NCGIA)
good@geog.ucsb.edu

July 8, 2001

Schedule

Four sessions:

Sunday July 8:

8:30am - 10:00am

10:30am - 12:00pm

Lunch

1:30pm - 3.00pm

3:30pm - 5:00pm

Instructor profile

Michael F. Goodchild is Professor of Geography at the University of California, Santa Barbara; Chair of the Executive Committee, National Center for Geographic Information and Analysis (NCGIA); Associate Director of the Alexandria Digital Library Project; and Director of NCGIA�s Center for Spatially Integrated Social Science. He received his BA degree from Cambridge University in Physics in 1965 and his PhD in Geography from McMaster University in 1969. After 19 years at the University of Western Ontario, including three years as Chair, he moved to Santa Barbara in 1988. He was Director of NCGIA from 1991 to 1997. He has been awarded honorary doctorates by Laval University (1999) and the University of Keele (2001). In 1990 he was given the Canadian Association of Geographers Award for Scholarly Distinction, in 1996 the Association of American Geographers award for Outstanding Scholarship, and in 1999 the Canadian Cartographic Association�s Award of Distinction for Exceptional Contributions to Cartography; he has won the American Society of Photogrammetry and Remote Sensing Intergraph Award and twice won the Horwood Critique Prize of the Urban and Regional Information Systems Association. He was Editor of Geographical Analysis between 1987 and 1990, and serves on the editorial boards of ten other journals and book series. In 2000 he was appointed Editor of the Methods, Models, and Geographic Information Sciences section of the Annals of the Association of American Geographers. His major publications include Geographical Information Systems: Principles and Applications (1991); Environmental Modeling with GIS (1993); Accuracy of Spatial Databases (1989); GIS and Environmental Modeling: Progress and Research Issues (1996); Scale in Remote Sensing and GIS (1997); Interoperating Geographic Information Systems (1999); Geographical Information Systems: Principles, Techniques, Management and Applications (1999); and Geographic Information Systems and Science (2001); in addition he is author of some 300 scientific papers. He was Chair of the National Research Council�s Mapping Science Committee from 1997 to 1999. His current research interests center on geographic information science, spatial analysis, the future of the library, and uncertainty in geographic data.

For a complete CV see the NCGIA web site www.ncgia.ucsb.edu under Personnel
Other related web sites: UCSB Geography www.geog.ucsb.edu, Alexandria Digital Library www.alexandria.ucsb.edu

TABLE OF CONTENTS

Outline:

1. What is Spatial Analysis?

Basic GIS data models

GIS function descriptions

2. Spatial Statistics

Spatial interpolation

3. Spatial Interaction Models

4. Spatial Dependence

5. Spatial Decision Support

Spatial search

Districting

What is Spatial Analysis?

GIS is designed to support a range of different kinds of analysis of geographic information: techniques to examine and explore data from a geographic perspective, to develop and test models, and to present data in ways that lead to greater insight and understanding. All of these techniques fall under the general umbrella of "spatial analysis". The statistical packages like SAS, SPSS, S, or Systat allow the user to analyze numerical data using statistical techniques�GIS packages like ArcInfo give access to a powerful array of methods of spatial analysis.

Purpose of the Course

The course will introduce participants with some knowledge of GIS to the capabilities of spatial analysis. Each of the five major sections will cover a major application area and review the techniques available, as well as some of the more fundamental issues encountered in doing spatial analysis with a GIS.

Outline

Section 1 - What is spatial analysis? - Basic GIS concepts for spatial analysis - GIS functionality - Integrating GIS and spatial analysis - Issues of error and uncertainty:

Definition of spatial analysis, major types and areas for application.

How should an analyst view a spatial database? Fields and discrete objects, attributes, relationships

How to organize the functions of a GIS into a coherent scheme.

Levels of integration of GIS and spatial analysis - loose and tight coupling, and full integration. Scripts and macros, lineage and analytical toolboxes.

The uncertainty problem - why is it such an issue in spatial analysis? What can we do now about data quality?

Section 2 - Spatial statistics - Simple measures for exploring geographic information - The value of the spatial perspective on data - Intuition and where it fails - Applications in crime analysis, emergencies, incidence of disease:

Measures of spatial form - centrality, dispersion, shape.

Spatial interpolation - intelligent spatial guesswork - spatial outliers.

Exploratory spatial analysis - moving windows, linking spatial and other perspectives.

Hypothesis tests - randomness, the null hypothesis, and how intuition can be misleading.

Section 3 - Spatial interaction models - What they are and where they're used - Calibration and "what-if" - Trade area analysis and market penetration:

The Huff model and variations.

Site modeling for retail applications - regression, analog, spatial interaction.

Modeling the impact of changes in a retail system.

Calibrating spatial interaction models in a GIS environment.

Section 4 - Spatial dependence - Looking at causes and effects in a geographical context:

Spatial autocorrelation - what is it, how to measure it with a GIS.

The independence assumption and what it means for modeling spatial data.

Applying models that incorporate spatial dependence - tools and applications.

Section 5 - Site selection - Locational analysis and location/allocation - Other forms of operations research in spatial analysis - Spatial decision support systems - Linking spatial analysis with GIS to support spatial decision-making:

Shortest path, traveling salesman, traffic assignment.

What is location/allocation, and where can it be applied?

Modeling the process of retail site selection. Criteria.

Electoral districting and sales territories.

What is an SDSS? What are its component parts? How does it compare to a GIS or a DSS? Why would you want one? Building SDSS.

Examples of SDSS use - site selection, districting.

SECTION 1
WHAT IS SPATIAL ANALYSIS?

Section 1 - What is spatial analysis? - Basic GIS concepts for spatial analysis - GIS functionality - Integrating GIS and spatial analysis - Issues of error and uncertainty:

Definition of spatial analysis, major types and areas for application.

How should an analyst view a spatial database? Objects, layers, relationships, attributes, object pairs, data models.

How to organize the functions of a GIS into a coherent scheme.

Levels of integration of GIS and spatial analysis - loose and tight coupling, and full integration. Scripts and macros, lineage and analytical toolboxes.

The uncertainty problem - why is it such an issue in spatial analysis? What can we do now about data quality?

What is spatial analysis?

A set of techniques for analyzing spatial data

used to gain insight as well as to test models
ranging from inductive to deductive
- finding new theories as well as testing old ones
can be highly technical, mathematical
- can also be very simple and intuitive

Definitions

"A set of techniques whose results are dependent on the locations of the objects being analyzed"

move the objects, and the results change
- e.g. move the people, and the US Center of Population moves
- e.g. move the people, and average income does not change
most statistical techniques are invariant under changes of location
- compare the techniques in SAS, SPSS, Systat etc.

"A set of techniques requiring access both to the locations of objects and also to their attributes"

requires methods for describing locations (i.e. a GIS)
some techniques do not look at attributes
mapping is a form of spatial analysis?

Is spatial analysis the ultimate objective of GIS?

Some books on spatial analysis:

Anselin L (1988) Spatial Econometrics: Methods and Models. Kluwer
Bailey T C, Gatrell A C (1995) Interactive Spatial Data Analysis. Harlow: Longman Scientific & Technical
Berry B J L, Marble D F (1968) Spatial Analysis: A Reader in Statistical Geography. Prentice-Hall
Boots B N, Getis A (1988) Point Pattern Analysis. Sage
Burrough P A, McDonnell R A (1998) Principles of Geographical Information Systems. New York: Oxford University Press
Cliff A D, Ord J K (1973) Spatial Autocorrelation. Pion
Cliff A D, Ord J K (1981) Spatial Processes: Models and Applications. Pion
Fischer M, Scholten H J, Unwin D J, editors (1996) Spatial Analytical Perspectives on GIS. London: Taylor & Francis
Fotheringham A S, O'Kelly M E (1989) Spatial Interaction Models: Formulations and Applications. Kluwer
Fotheringham A S, Rogerson P A (1994) Spatial Analysis and GIS. Taylor and Francis
Fotheringham A S, Wegener M (2000) Spatial Models and GIS: New Potential and New Models. London: Taylor and Francis
Fotheringham A S, Brundson C, Charlton M (2000) Quantitative Geography: Perspectives on Spatial Data Analysis. London: SAGE
Getis A, Boots B N (1978) Models of Spatial Processes: An Approach to the Study of Point, Line and Area Patterns. Cambridge University Press
Ghosh A, Imgene C A (1991) Spatial Analysis in Marketing: Theory, Methods, and Applications. JAI Press
Ghosh A, Rushton G (1987) Spatial Analysis and Location-Allocation Models. Van Nostrand Reinhold
Goodchild M F (1986) Spatial Autocorrelation. CATMOG 47, GeoBooks
Griffith D A (1987) Spatial Autocorrelation: A Primer. Association of American Geographers
Griffith D A (1988) Advanced Spatial Statistics. Special Topics in the Exploration of Quantitative Spatial Data Series. Kluwer
Haggett P, Chorley R J (1970) Network Analysis in Geography. St Martin's Press
Haggett P, Cliff A D, Frey A (1977) Locational Methods. Wiley
Haggett P, Cliff A D, Frey A (1978) Locational Models. Wiley
Haining R P (1990) Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press
Harries K (1999) Mapping Crime: Principle and Practice. Washington, DC: Crime Mapping Research Center, Department of Justice
Haynes K E, Fotheringham A S (1984) Gravity and Spatial Interaction Models. Sage
Hodder I, Orton C (1979) Spatial Analysis in Archaeology. Cambridge: Cambridge University Press
Leung Y (1988) Spatial Analysis and Planning under Imprecision. Amsterdam: North Holland
Longley P A, Batty M, editors (1996) Spatial Analysis: Modelling in a GIS Environment. Cambridge: GeoInformation International
Mitchell, A (1999) The ESRI Guide to GIS Analysis, Volume 1: Geographic Patterns and Relationships. ESRI Press
Odland J (1988) Spatial Autocorrelation. Sage
Raskin R G (1994) Spatial Analysis on the Sphere: A Review. Santa Barbara, CA: National Center for Geographic Information and Analysis
Ripley B D (1981) Spatial Statistics. Wiley
Ripley B D (1988) Statistical Inference for Spatial Processes. Cambridge University Press
Taylor P J (1977) Quantitative Methods in Geography: An Introduction to Spatial Analysis. Houghton Mifflin
Unwin D (1981) Introductory Spatial Analysis. Methuen
Upton G J G, Fingleton B (1985) Spatial Data Analysis by Example. Wiley

Geographic Information Systems and Science

Paul Longley, Mike Goodchild, David Maguire, and David Rhind
Wiley, 2001

Some background slides:

Landsat image of New York area
Indianapolis database

Snow map of Soho, 1854

the pump
Openshaw GAM map of NE England
Atlantic Monthly mystery map

Northridge earthquake epicenters

Environmental justice in LA

World map

England and Wales demography

South Wales demography

Vandenberg service station

Service station subsurface

Service station plume

How does an analyst/modeler/decision-maker work with a GIS?

What tools exist for helping/conceptualizing/problem-solving?

Assumption: these (analysis, modeling, decision-making) are the primary purposes of GIS technology.

The cost of input to a GIS is high, and can only be justified by the benefits of analysis/modeling/decision-making performed with the data.

60 polygons per hour = $1 per polygon
estimates as high as $40 per polygon
500,000 polygon database costs $500,000 to create using the low estimate
$20m using the high estimate

What types of analysis can justify these costs?

Query (if it is faster than manual lookup)
- very repetitive
- highly trained user

Analyses which are simple in nature but difficult to execute manually
- overlay (topological)
- map measurement, particularly area
- buffer zone generation
Analyses which can take advantage of GIS capabilities for data integration
- global science
Browsing/plotting independently of map boundaries and with zoom/scale-change
- seamless database
- need for automatic generalization
- editing
Complex modeling/analysis (based on the above and extensions)

The list of possibilities is endless

List of generic GIS functions has 75 entries

ESRI's ARC/INFO has over 1000 commands/functions

How can we organize/conceptualize the possibilities?

A taxonomy/classification of GIS functions
A customized view of a spatial database designed for the needs of the analyst/modeler
A set of tools to support analysis and database manipulation
Associated tools for defining needs in the analysis/modeling area, and testing systems against those needs
Methods for dealing with problems associated with analysis/modeling of spatial databases, particularly error/inaccurac

A geographical data model consists of the set of entities and relationships used to create a represention of the geographical world. The choices made when the world is modeled determine how the database is structured, and what kinds of analysis can be done with it. These choices occur when the data are captured in the field, recorded, mapped, digitized, and processed.

There are two distinct ways of conceiving of the geographical world.

In the field view, the world is conceived as a finite set of variables, each having a single value at every point on the Earth's surface (or every point in a three-dimensional space; or a four-dimensional space if time is included).

Examples of fields: elevation, temperature, soil type, vegetation cover type, land ownership

Some field-like phenomena: elevation, spectral response

To be represented digitally, a field must be constructed out of primitive one, two, three, or four-dimensional objects. There are six ways of representing fields in common use in GIS:

raster (a rectangular array of homogeneous cells)

grid (a rectangular array of sample points)

irregular points

digitized contours

polygons

TIN

Other methods can be found in environmental modeling, but not commonly in GIS.

finite element methods
splines

The field view underlies the following ESRI implementation models:

coverage
TIN

grid

but not shapefiles

in the Arc8 Geodatabase the distinction can be implemented in object behaviors

In the discrete object view an otherwise empty space is littered with objects, each of which has a series of attributes. Any point in space (two, three, or four dimensional) can lie in any number of discrete objects, including zero, and objects can therefore overlap, and need not exhaust the space.

objects can be counted
how many mountains are there in Scotland?
what's a mountain?
objects can be manipulated
they maintain integrity as they move
objects are homogeneous
the whole thing is the object
parts can inherit properties from the whole

Field and discrete object views can be implemented in either raster or vector forms

compare manipulation of shapefiles (objects) and coverages (fields)

the distinction concerns how the world is conceived, and the rules governing object behavior
a field can be represented as raster cells, points (e.g., spot heights), triangles (TIN), lines (contours), or areas (land ownership)

in many of these cases the primitive elements are not real (cannot be located on the ground), but are artifacts of the representation

If we ignore the field/discrete object distinction we may easily apply meaningless forms of analysis

buffer makes sense only for discrete objects
interpolation makes sense only for fields

Attributes can be of several types:

numeric
alphanumeric
quantitative
qualitative

nominal
ordinal
interval/ratio
cyclic

Spatial objects are distinguished by their dimensions or topological properties:

        points (0-cells)
        lines (1-cells)
        areas (2-cells)
        volumes (3-cells)

A class of objects is a set with the same topological properties (e.g. all points) and with the same set of attributes (e.g. a set of wells or quarter sections or roads). In the Arc8 Geodatabase a class also has the same behaviors, and may inherit behaviors from other classes. A class is associated with an attribute table.

Geodatabase introduces a consistent set of terms for primitive geometric objects

When a class represents a field, certain rules apply to the component objects. The objects belonging to one class of area or volume objects will fill the area and will not overlap (they are space-exhausting, they partition or tesselate the space, they are planar enforced).

the layer provides one value at every point (recall the definition of a field)

e.g. soil type
e.g. elevation
e.g. zoning

Slide: Planar enforcement

Spatial objects are abstractions of reality. Some objects are well-defined (e.g. road, bridge) but others are not. Objects representing a discrete entity view tend to be well-defined; objects representing a field are not.

A TIN or DEM is an approximation to a topographic surface, with an accuracy which is usually undetermined. Even if accuracy is known at the sampled points, it is unknown between them.

We assume that all of the points within an area object have the attributes ascribed to the object. In reality the area inside the object is not homogeneous, and the boundaries are zones of transition rather than sharp discontinuities (e.g. soil maps, climatic zones, geological maps).

A topographic surface can be represented as either a TIN or a DEM.

Slides: Elevation model options

digital elevation model (raster)
digitized contours

triangular mesh

TIN

Advantages of TIN:

sampling intensity can adapt to local variability

many landforms are approximated well by triangular mosaics

triangles can be rendered quickly by graphics processors

Advantages of DEM:

uniform sampling intensity is suited to automatic data collection via e.g. analytical stereoplotter

many applications require uniform-sized spatial objects.

A spatial database consists of a number of classes of spatial objects with associated attribute tables.

The methods used to store the attribute and locational information about the objects are not of immediate concern to the analyst/modeler.

In fact this object/attribute view of the database may have little in common with the actual data structures/models used by the system designer.

A database encodes and represents the complex relationships which exist between objects.

spatial relationships
functional relationships

A GIS must be capable of computing these relationships through such geometrical operations as intersection.

Spatial relationships include:

Relationships between objects of different classes

Relationships between objects of the same class

The potential set of relationships within a complex spatial database is enormous. No system can afford to compute and store all of them in the database.

A cartographic data structure stores no spatial relationships among objects.

Since it must compute any relationship as and when needed it is inefficient for complex spatial analyses.

A topological data structure stores certain spatial relationships among objects. Common stored relationships are:

ID of left and right polygons stored as attributes of shared boundaries (requires planar enforcement, so is associated with representation of a field).

ID of incident links stored as attributes of nodes in line networks

UML relationship types

association
a functional linkage between objects in different classes
aggregation and composition
linkage between an object and its component objects
type inheritance
classes inherit properties from more general classes

Relations between objects

An object pair is a combination of objects of the same or different types/classes which may have its own attributes.

e.g. the hydrologic relationship between a spring and a sink may have attributes (direction, volume of flow, flow through time) but may not exist as a spatial object itself.

The ability to generate object pairs, give them attributes and include them in analysis is an important component of a full GIS.

giving attributes to associations

Examples of object pairs:

Matrix of distances between pairs of objects

Hydrologic flows

Traffic flows between origin/destination pairs

Object pairs in ESRI products

turntable (link-link pairs)
distance matrix (first object, second object, distance)

association class in UML

attributed relationship class in Geodatabase
Visio example

Example: Data Model for Traffic Routing

What are the essential components of a data model for route planning in a complex street network?

1. Links - attributes: length, street name, traffic count, terminal nodes. A street can be represented by a single link with attributes which include <one-way?> or by a pair of links with associated directions - one of the pair being omitted in the case of one-way streets.
2. Nodes or intersections - attributes: incident links, presence of traffic light.
Turn prohibitions - are attributes not of nodes or links but of link/link object pairs ("turntable")

Stop signs - are attributes of link/node object pairs.
Visio example

Data modeling examples

1. Design a database to capture and analyze data on recreational fishing in the Scottish Highlands, to support decision-making by the tourist industry and regulatory agencies. The database should be able to represent the following:

locations of fishing (rivers, lakes)

locations of accommodation (hotels, guest houses)

preferences and rights (fishing locations owned by hotels, locations accessible to hotels)

2. Design a database to support analysis and modeling of shoreline erosion on the Great Lakes. It is necessary to represent conditions and processes transverse to the shoreline in much more detail than variation parallel to the shoreline.

3. Design a database to support water resource analysis and planning for complex hydrographic networks that include streams, rivers, lakes and reservoirs.

GEOGRAPHIC INFORMATION SYSTEM FUNCTION DESCRIPTIONS

A. BASIC SYSTEM CAPABILITIES

A1 Digitizing (di)

Digitizing is the process of converting point and line data from source documents to a machine-readable format.

A2 Edgematching (ed)

Edgematching is the process of joining lines and polygons across map boundaries in creation of a "seamless" database.� The join should be topological as well as graphic, that is, a polygon so joined should become a single polygon in the data base, a line so joined should become a single line segment.

A3 Polygonization (po)

Polygonizing is the process of connecting together arcs ("spaghetti") to form polygons.

A4 Labelling (la)

This process transfers labels describing the contents (attributes) of polygons, and the characteristics of lines and points, to the digital system.� This input of labels must not be confused with the process of symbolizing and labelling output described below.

A5 Reformatting digital data for input from other systems (rf)

Data previously digitized are made accessible through an interface or converted by software to the system format, and made to be topologically useful as well as graphically compatible.

A6 Reformatting for output to other systems (ro)

This function is the inverse of the previous one. Internal data is reformatted to meet the requirements of other systems or standards.

A7 Data base creation and management (db)

Data is typically digitized from map-sheets, and may be edgematched. The creation of a true "seamless" database requires the establishment of a map sheet directory, and may include tiling to partition the database.

A8 Raster/vector conversion (rv)

The ability to convert data between vector and raster forms with grid cell size, position and orientation selected by the user.

A9 Edit and display on input (ei)

This function allows continuous display and editing of input data, usually in conjunction with digitizing.

A10 Edit and display on output (eo)

The ability to preview and edit displays before creation of hard copy maps.

A11 Symbolizing (sy)

To create high quality output from a GIS, it is necessary to be able to generate a wide variety of symbols to replace the primitive point, line and area objects stored in the database.

A12 Plotting (pl)

Creation of hard copy map output.

A13 Updating (up)

Updating of the digital data base with new points, lines, polygons and attributes.

A14 Browsing (br)

Browse is used to search the data base to answer simple locational queries, and includes pan and zoom.

B. DATA MANIPULATION AND ANALYSIS FUNCTIONS

B1 Create lists and reports (cl)

This is the ability to create lists and reports on objects and their attributes in user-defined formats, and to include totals and subtotals.

B2 Reclassify attributes (ra)

Reclassification is the change in value of a set of existing attributes based on a set of user specified rules.

B3 Dissolve lines and merge attributes (dm)

Boundaries between adjacent polygons with identical attributes are dissolved to form larger polygons.

B4 Line thinning and weeding (lt)

This process is used to reduce the number of points defining a line or set of lines to a user defined tolerance.

B5 Line smoothing (ls)

Automatically smooth lines to a user-defined tolerance, creating a new set of points (compare B4).

B6 Complex generalization (cg)

Generalization which may require change in the type of an object, or relocation in response to cartographic rules.

B7 Windowing (wi)

The ability to clip features in the database to some defined polygon.

B8 Centroid calculation and sequential numbering (cn)

Calculate a contained, representative point in a polygon and assign a unique number to the new object.

B9 Spot heights (sh)

Given a digital elevation model, interpolate the height at any point.

B10 Heights along streams (hs)

Given a digital elevation model and a hydrology net, interpolate points along streams at fixed increments of height.

B11 Contours (isolines) (ci)

Given a set of regularly or irregularly spaced point values, interpolate contours at user-specified intervals.

B12 Elevation polygons (ep)

Given a digital elevation model, interpolate contours of height at user-specified intervals.

B13 Watershed boundaries (wb)

Given a digital elevation model and a hydrology net, interpolate the position of the watershed between basins.

B14 Scale change (sc)

Perform the operations associated with change of scale, which may include line thinning and generalization.

B15 Rubber sheet stretching (rs)

The ability to stretch one map image to fit over another, given common points of known locations.

B16 Distortion elimination (de)

The ability to remove various types of systematic distortion generated by different input methods.

B17 Projection change (pc)

The ability to transform maps from one map projection to another.

B18 Generate points (gp)

The ability to generate points and insert them in the database.

B19 Generate lines (gl)

The ability to generate lines and insert them in the database.

B20 Generate polygons (ga)

The ability to generate polygons and insert them in the database.

B21 Generate circles (gc)

The ability to generate circles defined by center point and radius.

B22 Generate grid cell nets (gg)

The ability to generate a network of grid cells given a point of origin, grid cell dimension and orientation.

B23 Generate latitude/longitude nets (gn)

The ability to generate graticules for a variety of map projections.

B24 Generate corridors (gb)

This process generates corridors of given width around existing points, lines or areas.

B25 Generate graphs (gr)

Create a graph illustrating attribute data by symbols, bars or fitted trend line.

B26 Generate viewshed maps (gv)

Given a digital elevation model and the locations of one or more viewpoints, generate polygons enclosing the area visible from at least one viewpoint.

B27 Generate perspective views (ge)

From a digital elevation model, generate a three-dimensional block diagram.

B28 Generate cross sections (cs)

Given a digital elevation model, show the cross-section along a user-specified line.

B29 Search by attribute (sa)

The ability to search the data base for objects with certain attributes.

B30 Search by region (sr)

The ability to search the data base within any region defined to the system.

B31 Suppress (su)

The ability to exclude objects by attribute (the converse of selecting by attribute).

B32 Measure number of items (mi)

The ability to count the number of objects in a class.

B33 Measure distances along straight and convoluted lines (md)

The ability to measure distances along a prescribed line.

B34 Measure length of perimeter of areas (mp)

The ability to measure the length of the perimeter of a polygon.

B35 Measure size of areas (ma)

The ability to measure the area of a polygon.

B36 Measure volume (mv)

The ability to compute the volume under a digital representation of a surface.

B37 Calculate - arithmetic (ca)

The ability to perform arithmetic, algebraic and Boolean calculations separately and in combination.

B38 Calculate bearings between points (cb)

The ability to calculate the bearing (with respect to True North) from a given point to another point.

B39 Calculate vertical distance or height (ch)

Given a digital elevation model, calculate the vertical distance (height) between two points.

B40 Calculate slopes along lines (gradients) (al)

The ability to measure the slope between two points of known height and location or to calculate the gradient between any two points along a convoluted line which contains two or more points of known elevation.

B41 Calculate slopes of areas (sl)

Given a digital elevation model and the boundary of a specified region (e.g., a part of a watershed), calculate the average slope of the region.

B42 Calculate aspect of areas (aa)

Given a digital elevation model and the boundary of a specified region, calculate the average aspect of the region.

B43 Calculate angles and distances along linear features (ad)

Given a prescribed linear feature, generalize its shape into a set of angles and distances from a start point, at user-set angular increments, and constrained to any known points along the linear feature.

B44 Subdivide area according to a set of rules (sb)

Given the corner points of a rectangular area, topologically subdivide the area into four quarters.

B45 Locations from traverses (lo)

Given a direction (one of eight radial directions) and distance from a given point, calculate the end point of the traverse.

B46 Statistical functions (sf)

The ability to carry out simple statistical analyses and tests on the database.

B47 Graphic overlay (go)

The ability to superimpose graphically one map on another and display the result on a screen or on a plot.

B48 Point in polygon (pp)

The ability to superimpose a set of points on a set of polygons and determine which polygon (if any) contains each point.

B49 Line on polygon overlay (lp)

The ability to superimpose a set of lines on a set of polygons, breaking the lines at intersections with polygon boundaries.

B50 Polygon overlay (op)

The ability to overlay digitally one set of polygons on another and form a topological intersection of the two, concatenating the attributes.

B51 Sliver polygon removal (sp)

The ability to delete automatically the small sliver polygons which result from a polygon overlay operation when certain polygon lines on the two maps represent different versions of the same physical line.

B52 Line of sight (ln)

The ability to determine the intervisibility of two points, or to determine those parts of pairs of lines or polygons which are intervisible.

B53 Nearest neighbor search (nn)

The ability to identify points, lines or polygons that are nearest to points, lines or polygons specified by location or attribute.

B54 Shortest route (ps)

The ability to determine the shortest or minimum cost route between two points or specified sets of points.

B55 Contiguity analysis (co)

The ability to identify areas that have a common boundary or node.

B56 Connectivity analysis (cy)

The ability to identify areas or points that are (or are not) connected to other areas or points by linear features.

B57 Complex correlation (cx)

The ability to compare maps representing different time periods, extracting differences or computing indices of change.

B58 Weighted modelling (wm)

The ability to assign weighting factors to individual data sets according to a set of rules and to overlay those data sets and carry out reclassify, dissolve and merge operations on the resulting concatenated data set.

B59 Scene generation (sg)

The ability to simulate an image of the appearance of an area from map data. The image would normally consist of an oblique view, with perspective.

B60 Network analysis (na)

Simple forms of network analysis are covered in Shortest route and Connectivity. More complex analyses are frequently carried out on network data by electrical and gas utilities, communications companies etc. These include the simulation of flows in complex networks, load balancing in electrical distribution, traffic analysis, and computation of pressure loss in gas pipes. In many cases these capabilities can be found in existing packages which can be interfaced to the GIS database.

Other groupings of GIS functions:

Berry, J.K., 1987, "Fundamental operations in computer-assisted map analysis". International Journal of GIS 1 119-36.

Reclassifying maps

Overlaying maps

Measuring distance and connectivity

Characterizing neighborhoods

Goodchild, M.F., 1988, "Towards an enumeration and classification of GIS functions". Proceedings, IGIS '87

Tomlin, Dana, 1990. Geographic Information Systems and Cartographic Modeling. Prentice Hall.
based on a standard, semi-formal taxonomy of analytic functions for raster data

Focal: operations that process a single cell

Local: operations that process a cell and a fixed neighborhood

Zonal: operations that process an area of homogeneous characteristics

Global: operations that process the entire map

Maguire, David, 1991. Chapter 21: The Functionality of GIS. In D.J. Maguire, M.F. Goodchild and D.W. Rhind, editors, Geographical Information Systems: Principles and Applications. Longman, London.

Capture

Transfer

Validate and Edit

Store and Structure

Restructure

Generalize

Transform

Query

Analyze

Present

A Six-way Classification of Spatial Analysis

1. Query and reasoning

based on database views
catalog
map

table

histogram

scatterplot

linked views

2. Measurement

simple geometric measurements associated with objects
area, distance, length, perimeter, shape

3. Transformation

buffers
point in polygon

polygon overlay

interpolation

density estimation

4. Descriptive summaries

centers
dispersion

spatial dependence

fragmentation

5. Optimization

best routes
raster version
network version

Paul's ride

best locations

6. Hypothesis testing

inference from sample to population

Integration of GIS and Spatial Analysis

1. Full integration (embedding)

spatial analysis as GIS commands

requires modification of source code
- difficult with proprietary packages
- analysis is not the strongest commercial motivation
third party macros, scripting languages

COM components and VBA

2. Loose coupling

what we have now

unsatisfactory
- hooks too awkward
- loss of higher structures in data

transfer of simple tables

3. Close coupling

better hooks

common data models

discretization problem
- discretization often not explicit in models
- e.g. slope, length

user interface design
- models easy to use?
- the user-friendly grand piano
- user community is already frustrated

SECTION 2
SPATIAL STATISTICS

Measures of spatial form - centrality, dispersion, shape.

Spatial interpolation - intelligent spatial guesswork - spatial outliers.

Exploratory spatial analysis - moving windows, linking spatial and other perspectives.

Hypothesis tests - randomness, the null hypothesis, and how intuition can be misleading.

Measures of spatial form:

How to sum up a geographical distribution in a simple measure?

Two concepts of space are relevant:

Continuous:

travel can occur anywhere

- best for small scales, or where a network is too complex or too costly to capture or represent
an infinite number of locations exist
- a means must exist to calculate distances between any pair of locations, e.g. using straight lines

Discrete:

travel can occur only on a network

only certain locations (on the network) are feasible
- all distances (between all possible pairs of locations) can be evaluated using any measure (travel time, cost of transportation etc.)

In discrete space places are identified as objects; in continuous space, places are identified by coordinates

A metric is a means of measuring distance between pairs of places (in continuous space)

e.g. straight lines (the Pythagorean metric)

e.g. by moves in N-S and E-W directions (the Manhattan or city-block metric)
simple metrics can be improved using barriers or routes of lower travel cost (freeways)

The most useful single measure of a geographical distribution of objects is its center

Definitions of center:

The centroid

the average position

computed by taking a weighted average of coordinates

the point about which the distribution would balance

the basis for the US Center of Population (now in MO and still moving west)

The centroid is not the point for which half of the distribution is to the left, half to the right, half above and half below

this is the bivariate median

The centroid is not the point that minimizes aggregate distance (if the objects were people and they all traveled to the centroid, the total distance traveled would be minimum)

this is the point of minimum aggregate travel (MAT), sometimes called the median (very confusingly)

for many years the US Bureau of the Census calculated the Center of Population as the centroid, but gave the MAT definition

there is a long history of confusion over the MAT

no ready means exist to calculate its location
- the MAT must be found by an iterative process
- an interesting way of finding the MAT makes use of a valid physical analogy to the resolution of forces - the Varignon frame

on a network, the MAT is always at a node (junction or point where there is weight)

The definition of centrality becomes more difficult on the sphere

e.g. the centroid is below the surface
- the centroid of the Canadian population in 1981 was about 90km below White River, Ontario
- the bivariate median (defined by latitude and longitude) was at the intersection of the meridian passing through Toronto and the parallel through Montreal, near Burke's Falls, Ontario
- the MAT point (assuming travel on the surface by great circle paths) was in a school yard in Richmond Hill, Ontario

What use are centers?

for tracking change in geographic distributions, e.g. the march of the US Center of Population westward is still worth national news coverage
example

for identifying most efficient locations for activities
- location at the MAT minimizes travel
- a central facility should be located to minimize travel to the geographic distribution that it serves
- should we use continuous or discrete space?
- this technique was considered so important to central planning in the Soviet Union in the early 20th century that an entire laboratory was founded
- - the Mendeleev Centrographic Laboratory flourished in Leningrad around 1925

centers are often used as simplifications of complex objects
- at the lower levels of census geography in many countries
- - e.g. ED in US, EA in Canada, ED in UK
- to avoid the expense of digitizing boundaries
- - e.g. land parcel databases
- or where boundaries are unknown or undefined
- - e.g. ZIPs

in the census examples, common practice is to eyeball a centroid

some very effective algorithms have been developed for redistributing population from centroids

Measures of dispersion:

what you would want to know if you could have two measures of a geographical distribution

the spread of the distribution around its center

average distance from the center

measures of dispersion are used to indicate positional accuracy
- the error ellipse
- the Tissot indicatrix
- the CMAS

Potential measures:

a measure which increases with the weight of geographic objects and with proximity to them

calculated as:
V = summation of (w(i)/d(i))

where i is an object, d is the distance to the object and w is its weight

the summation can be carried out at any location

V can be mapped - a "potential" map

Potential is a useful measure of:

the market share obtainable by locating at a point

the best location is the place of maximum potential

population pressure on a recreation facility

accessibility to a geographic distribution
- e.g. a network of facilities

potential measures omit the "alternatives" factor
- imply that market share can potentially increase without limit

potential measures have been used as predictors of growth
- economic growth most likely in areas of highest potential

potential calculation exists as a function in SPANS GIS

the objects used to calculate potential must be discrete in an empty space
- adding new objects will increase potential without limit
- it makes no sense to calculate potential for a set of points sampled from a field
- potential makes sense only in the object view

Potential measures and density estimation

think of a scatter of points representing people
how to map the density of people?
replace each dot by a pile of sand, superimposing the piles

the amount of sand at any point represents the number and proximity of people
the shape of the pile of sand is called the kernel function

example - density estimation and Chicago crime

Measures of shape:

shape has many dimensions, no single measure can capture them all

many measures of shape try to capture the difference between compact and distended
- many of these are based on a comparison of the shape's perimeter with that of a circle of the same area
- e.g. shape = perimeter / (3.54 * sqrt(area))
- this measure is 1.0 for a circle, larger for a distended shape

all of these measures based on perimeter suffer from the same problem
- within a GIS, lines and boundaries are represented as straight line segments between points
- this will almost always result in a length that is shorter than the real length of the curve, unless the real shape is polygonal
- consequently the measure of shape will be too low, by an undetermined amount
- shape (compactness) is a useful measure to detect gerrymanders in political districting

Spatial Interpolation

Spatial interpolation is defined as a process of determining the characteristics of objects from those of nearby objects

of guessing the value of a field at locations where value has not been measured

The objects are most often points (sample observations) but may be lines or areas

The attributes are most often interval-scaled (elevations) but may be of any type

From a GIS perspective, spatial interpolation is a process of creating one class of objects from another class

Spatial interpolation is often embedded in other processes, and is often used as part of a display process

e.g. to contour a surface from a set of sample points, it is necessary to use a method of spatial interpolation to determine where to place the contours among the points

Many methods of spatial interpolation exist:

Distance-weighted interpolation

Known values exist at n locations i=1,...,n

The value at a location x_i is denoted by z(x_i)

We need to guess the value at location x, denoted by z(x)

The guessed value is an average over the known values at the sample points

the average is weighted by distance so that nearby points have more influence.

Let d(x_i,x) denote the distance from location x, where we want to make a guess, to the ith sample point.

Let w[d] denote the weight given to a point at distance d in calculating the average.

The estimate at x is calculated as:

z(x) = summation over every point i (w[d(x_i,x)] z(x_i)) / summation over every point i (w[d(x_i,x)])

in other words, the average weighted by distance.

The simplest kind of weight is a switch - a weight of 1 is given to any points within a certain distance of x, and a weight of 0 to all others

this means in effect that z(x) is calculated as the average over points within a window of a certain radius.

Better methods include weights which are continuous, decreasing functions of distance such as an inverse square:

w[d] = d^-2

All of the distance weighted methods (e.g IDW) share the same positive features and drawbacks. They are:

easy to implement and conceptually simple

adaptable - the weighting function can be changed to suit the circumstances. It is even possible to optimize the weighting function in this sense:

Suppose the weighting function has a parameter, such as the size of the window
Set the window size to some test value

Then select one of the sample points, and use the method to interpolate at that point by averaging over the remaining n-1 sample values

Compare the interpolated value to the known value at that point. Repeat for all n points and average the errors

Then the best window size (parameter value) is the one which minimizes total error

In most cases this will be a non-zero and non-infinite value.
all interpolated values must lie between the minimum and maximum observed values, unless negative weights are used
- This means that it is impossible to extrapolate trends
- If there is no data point exactly at the top of a hill or the bottom of a pit, the surface will smooth out the feature

the interpolated surface cannot extrapolate a trend outside the area covered by the data points - the value at infinity must be the arithmetic mean of the data points

Although distance-weighted methods underlie many of the techniques in use, they are far from ideal

Polynomial surfaces

A polynomial function is fitted to the known values - interpolated values are obtained by evaluating the function

e.g. planar surface - z(x,y) = a + bx + cy

e.g. cubic surface - z(x,y) = a + bx + cy + dx² + exy + fy² + gx³ + hx²y + ixy² + jy³

easy to use

useful only when there is reason to expect that the surface can be described by a simple polynomial in x and y

very sensitive to boundary effects

Kriging

Most real surfaces are observed to be spatially autocorrelated - that is, nearby points have values which are more similar than distant points.

The amount and form of spatial autocorrelation can be described by a variogram, which shows how differences in values increase with geographical separation

Observed variograms tend to have certain common features - differences increase with distance up to a certain value known as the sill, which is reached at a distance known as the range.

To make estimates by Kriging, a variogram is obtained from the observed values or past experience

Interpolated best-estimate values are then calculated based on the characteristics of the variogram.

perhaps the most satisfactory method of interpolation from a statistical viewpoint

difficult to execute with large data sets

decisions must be made by the user, requiring either experience or a "cookbook" approach

a major advantage of Kriging is its ability to output a measure of uncertainty of each estimate
- This can be used to guide sampling programs by identifying the location where an additional sample would maximally decrease uncertainty, or its converse, the sample which is most readily dropped.

Kriging and other methods of geostatistics are in the geostatistics module of ArcGIS

Locally-defined functions

Some of the most satisfactory methods use a mosaic approach in which the surface is locally defined by a polynomial function, and the functions are arranged to fit together in some way

With a TIN data structure it is possible to describe the surface within each triangle by a plane

Planes automatically fit along the edges of each triangle, but slopes are not continuous across edges

This can be arranged arbitrarily by smoothing the appearance of contours drawn to represent the surface

Alternatively a higher-order polynomial can be used which is continuous in slopes.

Another popular method fits a plane at each data point, then achieves a smooth surface by averaging planes at each interpolation point

Hypothesis tests:

compare patterns against the outcomes expected from well-defined processes

if the fit is good, one may conclude that the process that formed the observed pattern was like the one tested
- unfortunately, there will likely be other processes that might have formed the same observed pattern
- in such cases, it is reasonable to ignore them as long as a) they are no simpler than the hypothesized process, and b) the hypothesized process makes conceptual sense

the best known examples concern the processes that can give rise to certain patterns of points
- attempts to extend these to other types of objects have not been as successful

Point pattern analysis

a commonly used standard is the random or Poisson process
- in this process, points are equally likely to occur anywhere, and are located independently, i.e. one point's location does not affect another's
- CSR = complete spatial randomness

a real pattern of points can be compared to this process
- most often, the comparison is made using the average distance between a point and its nearest neighbor
- in a random pattern (a pattern of points generated by the Poisson process) this distance is expected to be 1/(2 * sqrt(density)) where density is the number of points per unit area, and area is measured in units consistent with the measurement of distance
- when the number of points is limited, we would expect to come close to this estimate in a random pattern
- - theory gives the limits within which average distance is expected to lie in 95% of cases
  - if the actual average distance falls outside these limits, we conclude that the pattern was not generated randomly

There are two major options for non-random patterns:

the pattern is clustered
- points are closer together than they should be
- the presence of one point has made other points more likely in the immediate vicinity
- some sort of attractive or contagious process is inferred

the pattern is uniformly spaced
- points are further apart than they should be
- the presence of one point has made other points less likely in the vicinity
- some sort of repulsive process is inferred, or some sort of competition for space

Unfortunately it is easy for this process of inference to come unstuck

the process that generated the pattern may be non-random, but not sufficiently so to be detectable by this test
- this false conclusion is more likely reached when there is little data - the more data we have, the more likely we are to detect differences from a simple random process
- in statistics, this is known as a Type II error - accepting the null hypothesis when in fact it is false

the process may be non-random, but not in either of the senses identified above - contagious or repulsive
- points may be located independently, but with non-uniform density, so that points are not equally likely everywhere

it is possible to hypothesize more complex processes, but the test becomes progressively weaker at confirming them

SECTION 3
SPATIAL INTERACTION MODELS

Section 3 - Spatial interaction models - What they are and where they're used - Calibration and "what-if" - Trade area analysis and market penetration:

The Huff model and variations.

Site modeling for retail applications - regression, analog, spatial interaction.

Modeling the impact of changes in a retail system.

Calibrating spatial interaction models in a GIS environment.

What is a spatial interaction model?

a model used to explain, understand, predict the level of interaction between different geographic locations

examples of interactions:
- migration (number of migrants between pairs of states)
- phone traffic (number of calls between pairs of cities)
- commuting (number of vehicles from home to workplace)
- shopping (number of trips from home to store)
- recreation (number of campers from home to campsite)
- trade (amount of goods between pairs of countries)

interaction is always expressed as a number or quantity per unit of time

interaction occurs between defined origin and destination
- these may be the same or different classes of objects
- e.g. the same class in the case of migration between states
- e.g. different classes in the case of journeys to shop or work
- the matrix of interactions can be square or rectangular

Interaction is believed to be dependent on:

some measure of the origin (its propensity to generate interaction)

some measure of the destination (its propensity to attract interaction)

some measure of the trip (its propensity to deter interaction)

these measures are assumed to multiply

Let:

i denote an origin object (often an area)

j denote a destination object (a point or area)

I^*_ij denote the observed interaction between i and j, measured in appropriate units (e.g. numbers of trips, flow of goods, per defined interval of time)

I_ij denote the interaction predicted by the spatial interaction model

if the model is good (fits well), the predicted interactions per interval of time will be close in value to the observed interactions

each I_ij will be close to its corresponding I^*_ij

E_i denote the emissivity of the origin area i

A_j denote the attraction of the destination area j

C_ij denote the deterrence of the trip between i and j (probably some measure of the trip length or cost)

a a constant to be determined

Then the most general form of spatial interaction model is:

I_ij = a E_i A_j C_ij

that is, interaction can be predicted from the product of a constant, emissivity, attraction and deterrence

The model began life in the mid 19th century as an attempt to apply laws of gravitation to human communities - the gravity model

such ideas of social physics have long since gone out of fashion, but the name is still sometimes used

even in the form above, the model bears some relationship to Newton's Law of Gravitation

In any application of the model, some aspects are assumed to be unknown, and determined by calibration

e.g. the value of a might be unknown in a given application

its value would be calibrated by finding the value that gives the best fit between the observed interactions and the interactions predicted by the model

the conventional measure of fit is the total squared difference between observation and prediction, that is, the summation over i and j of (I_ij - I*_ij)²

this is known as least squares calibration

other unknowns might be the method of calculating deterrence (C_ij) from distance, or the attraction value to give to certain retail stores

Measurement of the variables:

C_ij

deterrence is often strongly related to distance
- the further the distance, the less interaction and thus the lower C_ij

a common choice is a decreasing function of distance:
C_ij = d_ij^-b (C_ij = 1 / d_ij^b)
or C_ij = exp (-bd_ij) (exp denotes 2.71828 to the power)

generally the fit of the model is not sufficiently good to distinguish between these two, that is, to identify which gives the better fit

the negative exponential has a minor technical advantage in not creating problems when d_ij = 0 (origin and destination are the same place)

the b parameter is unknown and must be calibrated
- its value depends on the type of interaction, and also probably on the region
- b has units in the negative exponential case (1/distance) but none in the negative power case

other measures of deterrence include:
- some function of transport cost
- some function of actual travel time
- in either case the function used is likely to be the negative power or negative exponential above
- there are examples where distance has a positive effect on interaction

E_i

how to measure the propensity of each origin to emit interaction?

gross population P_i

some more appropriate measure weighting each cohort, e.g. age and sex cohorts
- some cohorts are more likely to interact than others

gross income

E_i could be treated as unknown and calibrated

A_j

the propensity of each destination to attract interaction

could be unknown and calibrated

for shopping models, gross floor area of retail space is often used

some forms of interaction are symmetrical
- flow from origin to destination equals reverse flow
- e.g. phone calls
- requires E_i and A_j to be the same, e.g. population

The Huff model

what happens when a new destination is added?
- interactions with existing destinations are unaffected
- assumes outflow from origins can increase without limit
- in practice, in many applications flow from origin to existing destinations will be diverted
- we need some form of "production constraint"

Huff proposed this change:

I_ij = E_i A_j C_ij / summation over j (A_j C_ij )

summing interaction to all destinations from a given origin:
I_ij = E_i

that is, total interaction from an origin will always equal E_i regardless of the number and locations of destinations
flow will now be partially diverted from existing destinations to new ones
E_i is now the total outflow, can be set equal to the total of observed outflows from origin i

the Huff model is consistent with the axiom of Independence of Irrelevant Alternatives (IIA)
- the ratio of flows to two destinations from a given origin is independent of the existence and locations of other destinations

Because of its production constraint, the Huff model is very popular in retail analysis

it is often desirable to predict how much business a new store will draw from existing ones
- e.g. how much will a new mall draw business away from downtown?

Other "what if" questions:

population of a neighborhood increases by x%

ethnic mix of a neighborhood changes

a new bridge is constructed

an earthquake takes a freeway out of operation

a mall adds new space

an anchor store moves out

a store changes its signage

Site modeling for retail applications

three major areas:
- use of the spatial interaction model
- analog techniques
- regression models

Analog:

the business done by a new store or an old store operating under changed circumstances is best estimated by finding the closest analog in the chain

requires a search

criteria include:
- physical characteristics of each store
- intangibles such as management, signage
- local market area
- a GIS can help compare market areas (local densities, street layouts, traffic patterns)
- a multi-media GIS can help with the intangibles
- - bring up images of site, layout, signage...

Regression:

identify all of the factors affecting sales, and construct a model to predict based on these factors

an enormous range of factors can affect sales

some factors are exogenous
- determined by external, physical, measurable variables
- some of these travel with the store if it moves (site factors), others are attributes of place (situation factors)

other factors are endogenous
- determined by crowding, types of customers, trends, advertizing
- unpredictable, determined by the state of the system

Exogenous factors:

site layout - on a corner? parking spaces, etc.

inside layout

trade area - number of households in primary, etc

characteristics of neighborhood

Example model:

Sales per 2-week period for convenience store:

$12749

+ 4542 if gas pumps on site

+ 3172 if major shopping center in vicinity

+ 3990 if side street traffic is transient

+ 3188 per curb cut on side street

+ 2974 if store stands alone

- 1722 per lane on main street

use of surrogate variables

problems in use of model for prediction in planning

Calibration of the spatial interaction model

many different circumstances

major issues involved in calibration

specific tools are available
- SIMODEL

possible to use standard tools in e.g. SAS, GLIM

calibration possible using aggregate flows or individual choices

Linearization:

transformations to make the right hand side of the equation a linear combination of unknowns, the left hand side known

Linearization of the unconstrained model:

suppose the E_i are known, the A_j unknown
- the constant a can be absorbed into the A_j (i.e. find aA_j)

suppose we use the negative power deterrence function
I_ij = E_i A_j / d_ij^b

move the E_i to the left:
I_ij/E_i = A_j / d_ij^b

take the logs of both sides:
log (I_ij/E_i) = log A_j - b log d_ij

now a trick - introduce a set of dummy variables u_ijk, set to 1 if j=k, otherwise zero:

log (I_ij/E_i) = u_ij1 log A₁ + u_ij2 log A₂ + ... - b log d_ij

now the left hand side is all knowns, the right hand side is a linear combination of unknowns (the logs of the As and b)

the model can now be calibrated (the unknowns can be determined) using ordinary multiple regression in a package like SAS

it may be easier to avoid linearizing altogether by using the nonlinear regression facilities in many packages

The objective function:

normally, we would try to maximize the fit of the observed and predicted interactions

linearization changes this
- e.g. we minimize the squared differences between observed and predicted values of log (I_ij/E_i) if ordinary regression is used on the linearized form above
- this is easy in practice, but makes no sense

intuitively, an error of 30 in a prediction of 1000 trips is much more acceptable than an error of 30 in a prediction of 10 trips

these ideas are formalized in the technique of Poisson regression, which assumes that I_ij is a count of events, and sets up the objective function accordingly
- the function minimized to get a good fit is roughly the difference between observed and predicted, squared, divided by the predicted flow

SECTION 4
SPATIAL DEPENDENCE

Section 4 - Spatial dependence - Looking at causes and effects in a geographical context:

Spatial autocorrelation - what is it, how to measure it with a GIS.

The independence assumption and what it means for modeling spatial data.

Applying models that incorporate spatial dependence - tools and applications.

Two concepts:

Spatial dependence

what happens at one place depends on events in nearby places

all things are related but nearby things are more related than distant things (Tobler's first law of geography)

positive spatial dependence:
- nearby things are more alike than things are in general

negative spatial dependence:
- nearby things are less alike than things are in general
- conceptual problems with negative spatial dependence
- - e.g. the chessboard

spatial autocorrelation measures spatial dependence
- an index, rather than a parameter of a process
- dependence between discrete objects, or dependence in a continuous field?

a world without positive spatial dependence would be an impossible world
- impossible to map
- impossible to describe, live in
- hell is a place with no spatial dependence

Geary index:

compares the squared differences in value between neighboring objects with overall variance in values

Moran index:

calculates the product of values in neighboring objects

related to Geary but not in a simple algebraic sense

Calculation of the Geary index of spatial autocorrelation

a is the mean of x values

w_ij = 1 if i,j adjacent, else 0

c is 1 if neighbors vary as much as the sample as a whole

c < 1 if neighbors are more similar than the sample as a whole (positive dependence)

c > 1 if neighbors are less similar (negative dependence)

c = 3 x 16 / (2 x 10 x 2) = 48 / 40 = 1.2

i.e. neighboring values are slightly more similar than one would expect if the values were randomly allocated to the four areas

Continuous space

see the discussion of variograms and Kriging

the term geostatistics is normally associated with continuous space, spatial statistics more with discrete space

Measures of spatial dependence can be calculated in GIS:

Idrisi calculates autocorrelation over a raster

code has been written to calculate autocorrelation in ARC/INFO (see NCGIA Technical Paper 91-5)

More extensive codes have been written using the statistical packages, e.g. MINITAB, SAS

contact Dan Griffith, Syracuse University; Luc Anselin, University of Illinois

some of these fail to take advantage of GIS capabilities, for generating input data and displaying output

see also Spacestat

Spatial heterogeneity:

suppose there is a relationship between number of AIDS cases and number of people living in an area

the form of this relationship will vary spatially
- in some areas the number of cases per capita will be higher than in others
- we could map the constant of proportionality

spatial heterogeneity describes this geographic variation in the constants or parameters of relationships

when it is present, the outcome of an analysis depends on the area over which the analysis is made
- often this area is arbitrarily determined by a map boundary or political jurisdiction

Geographically weighted regression (GWR)

fits a model such as y = a + bx
but assumes that the values of a and b will vary geographically
determines a and b at any point by weighting observations inversely by distance from that point

diagram

Geographical brushing:

a technique of ESA

a user-defined window is moved over the map

analysis occurs only within the window

Conventional analysis (analysis done aspatially, e.g. using a statistical package) assumes independence (no spatial dependence) and homogeneity (no spatial heterogeneity)

e.g. regression analysis assumes that the observations (cases) are statistically independent

this violates the first law of geography

in general, analysis in space is very different from conventional statistical analysis (although this is very often carried out on spatial data)

An example:

the relationship between land devoted to growing corn and rainfall in a Midwestern state like Kansas

rainfall available at 50 weather stations

percent of land growing corn available for 100 counties

use a method of spatial interpolation to estimate rainfall in each county from the weather station data

plot one variable against the other, and perhaps fit a regression equation

how many data points are there?
- the more data points, the more significant the results
- 100 (the number of counties)?
- 50 (the real number of weather observations)?
- something in between?

more data points can be invented by intensifying the sample network using spatial interpolation, but no more real data has been created by doing so

both variables are strongly spatially auto