Вы находитесь на странице: 1из 38

Descriptive Statistics for

Spatial Distributions

Review Standard Descriptive Statistics


Centrographic Statistics for Spatial Data
Mean Center, Centroid, Standard Distance Deviation, Standard Distance Ellipse
Density Kernel Estimation, Mapping

Briggs Henan University 2010 1


Spatial Analysis: successive levels of sophistication
1. Spatial data description: classic GIS capabilities


Spatial queries & measurement,
buffering, map layer overlay 
2. Exploratory Spatial Data Analysis (ESDA):
– searching for patterns and possible explanations
– GeoVisualization through data graphing and mapping
– Descriptive spatial statistics: Centrographic statistics
3. Spatial statistical analysis and hypothesis testing
– Are data “to be expected” or are they “unexpected”
relative to some statistical model, usually of a random
process
4. Spatial modeling or prediction
– Constructing models (of processes) to predict spatial
outcomes (patterns) 2

Briggs Henan University 2010


Standard Statistical Analysis
Two parts:
1. Descriptive statistics
Concerned with obtaining summary measures to describe
a set of data
For example, the mean and the standard deviation

2. Inferential statistics
Concerned with making inferences from samples about a
populations

Similarly, we have Descriptive and


Inferential Spatial Statistics 3

Briggs Henan University 2010


Spatial Statistics
Descriptive Spatial Statistics: Centrographic Statistics (This time)
– single, summary measures of a spatial distribution
–- Spatial equivalents of mean, standard deviation, etc..
Inferential Spatial Statistics: Point Pattern Analysis (Next time)
Analysis of point location only--no quantity or magnitude (no attribute variable)
--Quadrat Analysis
--Nearest Neighbor Analysis, Ripley’s K function
Spatial Autocorrelation (Weeks 5 and 6)
– One attribute variable with different magnitudes at each location
The Weights Matrix
Global Measures of Spatial Autocorrelation (Moran’s I, Geary’s C, Getis/Ord Global G)
Local Measures of Spatial Autocorrelation (LISA and others)

Prediction with Correlation and Regression (Week 7)


–Two or more attribute variables
Standard statistical models
Spatial statistical models 4

Briggs Henan University 2010


Standard Statistical Analysis:
A Quick Review
1. Descriptive statistics
– Concerned with obtaining summary measures to
describe a set of data
– Calculate a few numbers to represent all the data
– we begin by looking at one variable (“univariate”)
• Later , we will look at two variables (bivariate)
Three types:
– Measures of Central Tendency
– Measures of Dispersion or Variability
– Frequency distributions
I hope you are already familiar with these. 5

I will quickly review the mainBriggs


ideas.Henan University 2010
Standard Descriptive Statistics
Central Tendency

• Central Tendency: single summary measure for one


variable: Formulae for mean

1. mean (average)
2. median (middle value)
--50% larger and 50% smaller
--rank order data and select middle number
3. mode (most frequently occurring)
These may be obtained in ArcGIS by:
--opening a table, right clicking on column heading, and selecting Statistics
--going to ArcToolbox>Analysis>Statistics>Summary Statistics
ADMIN_NAME Illiteracy-Prcnt Rank order
Beijing
Liaoning
Tianjin
3.11
3.48
3.52
1
2
3
Calculation of
Taiwan 3.9 4
Shanghai
Guangdong
Heilongjiang
3.97
4.02
4.16
5
6
7
mean and median
Shanxi 4.42 8
Jilin 4.44 9
Xinjiang 4.64 10
Hebei
Guangxi
4.83
5.61
11
12
Mean
Hunan 5.87 13
Jiangxi
Hong Kong
6.49
6.5
14
15
296.15 / 34 = 8.71
Henan 7.36 16
Hubei 7.69 17
Chongqing 7.8 18
Shandong 7.96 19
Jiangsu
Nei Mongol
8.05
8.14
20
21
Median
Shaanxi 8.19 22
Hainan
Macao
8.65
8.7
23
24
(7.69 + 7.8)/2 = 7.75
Zhejiang
Ningxia
Sichuan
9.36
10.09
10.24
25
26
27
(there are 2 “middle values”)
Fujian 10.38 28
Yunnan 13.29 29
Anhui 14.49 30
Guizhou 14.58 31
Qinghai 16.68 32
Gansu 17.77 33
Xizang 37.77 34 Note: data for Taiwan is included
Sum 296.15 7

Briggs Henan University 2010


Standard Descriptive Statistics
Variability or Dispersion
• Dispersion: measures of spread or variability
– Variance
• average squared distance of observations from mean
– Standard Deviation (square root of variance)
• “average” distance of observations from the mean

Formulae for variance


n


2 n 2
( Xi - X ) X i - [(  X ) 2 / N ]
i =1 = i =1

N
N
Definition Formula Computation Formula

These may be obtained in ArcGIS by:


--opening a table, right clicking on column heading, and selecting Statistics
--going to ArcToolbox>Analysis>Statistics>Summary Statistics
Illiteracy-Prcnt
ADMIN_NAME
Anhui
Beijing
14.49
3.11
(X - Xmean)
5.780
-5.600
(X-Xmean)
squared
33.40500009
31.3632942
Calculation of
Fujian
Gansu
Guangdong
10.38
17.77
4.02
1.670
9.060
-4.690
2.787917734
82.07827067
21.99885891
Variance and
Guangxi 5.61 -3.100 9.611823616
Guizhou
Hainan
Hebei
14.58
8.65
4.83
5.870
-0.060
-3.880
34.45344715
0.003635381
15.05668244
Standard Deviation
Heilongjiang 4.16 -4.550 20.70517656
Henan 7.36 -1.350 1.823294204
Hubei 7.69 -1.020 1.041000087 Variance from Definition Formula
Hunan 5.87 -2.840 8.067270675
Nei Mongol
Jiangsu
Jiangxi
8.14
8.05
6.49
-0.570
-0.660
-2.220
0.325235381
0.435988322
4.929705969
1361.370/34 = 40.04
Jilin 4.44 -4.270 18.23541185
Liaoning 3.48 -5.230 27.35597656
Ningxia 10.09 1.380 1.903588322
Qinghai 16.68 7.970 63.51621185 Variance from Computation Formula
[3940.924 – (296.15 * 296.15)/34]/34
Shaanxi 8.19 -0.520 0.270705969
Shandong 7.96 -0.750 0.562941263
Shanghai 3.97 -4.740 22.47038832
Shanxi
Sichuan
4.42
10.24
-4.290
1.530
18.40662362
2.340000087
=40.04
Taiwan 3.9 -4.810 23.1389295
Tianjin 3.52 -5.190 26.93915303
Xizang 37.77 29.060 844.466506
Xinjiang 4.64 -4.070 16.5672942 Standard Deviation = 40.04
Yunnan 13.29 4.580 20.97370597
Zhejiang 9.36 0.650 0.422117734
Chongqing 7.8 -0.910 0.828635381
Hong Kong
Macao
6.5
8.7
-2.210
-0.010
4.885400087
0.000105969
=6.33
Sum 296.15 0.000 1361.370297 Note: data for Taiwan is included 9
Mean 8.710294118 Variance 40.04030285
StanDev 6.3277
Briggs Henan University 2010
Classic Descriptive Statistics: Univariate
Frequency distributions
A count of the frequency with which values occur on a variable
70000
60000 US population, by age group:
50000
40000
50 million people age 45-59 (data for 2000)
30000
20000 Series1
10000
Source:
0 http://www.census.gov/compendia/statab/
under 15 to 30 to 45 to 60 to 75 and
15 29 44 59 74 older
US Bureau of the Census: Statistical Abstract of the US
years years years years years

Often represented by the area under a frequency curve


70000
60000
This area represents
50000
40000 100% of the data
30000
20000
100% Series1

10000
0 In ArcGIS, you may obtain frequency counts
under 15 to 30 to 45 to 60 to 75 and
15 29 44 59 74 older on a categorical variable via:
years years years years years
--ArcToolbox>Analysis>Statistics>Frequency
Frequency Distributions for China Province Data
Symetric Distribution
Height of bar shows frequency
There are 16 provinces with
percent urban between 38.4% and
50.8% (mode)
Mode = (38.1+50.8)/2 =44.5
Mean = 48.97
Median = 44.0
Symetric distribution:
mean = median = mode

Skewed Distribution (right skew) Height of bar shows frequency


There are 17 provinces with
illiteracy between 5.4% and 10.7%
“tail” extends (mode)
to right Mode = (5.4+10.7)/2 =8.05
Mean is Mean = 8.7
“pulled” to
Median = (7.69 + 7.8)/2 = 7.75
the right
Symetric distribution:
mean > median
Frequency Distributions for China Province Data:
Symetric Distribution
Variability
Standard deviation:

A measure of “the average”


distance of each observation from
the mean

Standard deviation = 14.8

Skewed Distribution (right skew)


Standard deviation = 6.33

On average, illiteracy values are


closer to the mean. There is less
“spread” in this data
“tail” extends
to right
Caution—these values are incorrect!
• Why?
• Incorrect to calculate mean for percentages
– Each percentage has a different base population
• Should calculate weighted mean


n
wixi
X = i =1
wi =population of each

n
i =1
wi province
• Very common error in GIS because we use
aggregated data frequently
13

Briggs Henan University 2010


Correct Values!
• Unweighted mean = 8.7
• Weighted mean = 7.75
• Weighted mean is smaller. Why?
• The largest provinces Highest rates in
have lower illiteracy small provinces
Illiteracy- Illiteracy-
ADMIN_NAME Prcnt Pop2008 ADMIN_NAME Prcnt Pop2008

Guangdong 4.02 95,440,000 Ningxia 10.09 6,176,900

Henan 7.36 94,290,000 Qinghai 16.68 5,543,000

Shandong 7.96 94,172,300 Xizang (Tibet) 37.77 2,870,000


14

Briggs Henan University 2010


ADMIN_NAME Illiteracy-Prcnt Pop2008 x*w
Anhui
Beijing
Fujian
Gansu
14.49
3.11
10.38
17.77
61,350,000
22,000,000
36,040,000
26,281,200
888961500
68420000
374095200
467016924
Calculation of
Guangdong
Guangxi
Guizhou
Hainan
4.02
5.61
14.58
8.65
95,440,000
48,160,000
37,927,300
8,540,000
383668800
270177600
552980034
73871000
weighted mean
Hebei 4.83 69,888,200 337560006
Heilongjiang 4.16 38,253,900 159136224
Henan 7.36 94,290,000 693974400
Hubei 7.69 57,110,000 439175900
Hunan 5.87 63,800,000 374506000
Nei Mongol
Jiangsu
Jiangxi
8.14
8.05
6.49
24,137,300
76,773,000
44,000,000
196477622
618022650
285560000
Unweighted mean
Jilin
Liaoning
4.44
3.48
27,340,000
43,147,000
121389600
150151560 296.15 / 34 = 8.71
Ningxia 10.09 6,176,900 62324921
Qinghai 16.68 5,543,000 92457240
Shaanxi 8.19 37,620,000 308107800
Shandong 7.96 94,172,300 749611508
Shanghai
Shanxi
3.97
4.42
19,210,000
34,106,100
76263700
150748962
Weighted mean
Sichuan 10.24 81,380,000 833331200
Taiwan 3.9 23,140,000 90246000 10,445,390,141 / 1,347,382,600
Tianjin 3.52 11,760,000 41395200
Xizang
Xinjiang
37.77
4.64
2,870,000
21,308,000
108399900
98869120
= 7.75
Yunnan 13.29 45,430,000 603764700
Zhejiang 9.36 51,200,000 479232000
Chongqing
Hong Kong
7.8
6.5
31,442,300
7,003,700
245249940
45524050
Note: we should also calculate a
Macao 8.7 542,400 4718880
weighted standard deviation
15
Sum 296.15 1347382600 10445390141

Briggs Henan University 2010


Centrographic Statistics
Descriptive statistics for spatial distributions
Mean Center
Centroid
Standard Distance Deviation
Standard Distance Ellipse
Density Kernel Estimation
(Add Frequency Distributions and mapping—use GeoDA to produce)

Briggs Henan University 2010 1


Centrographic Statistics
Measures of Centrality Measures of Dispersion
– Mean Center -- Standard Distance
– Centroid -- Standard Deviational Ellipse
– Weighted mean center
– Center of Minimum Distance
• Two dimensional (spatial) equivalents of standard
descriptive statistics for a single-variable (univariate).
• Used for point data
– May be used for polygons by first obtaining the centroid of
each polygon
• Best used to compare two distributions with each other
– 1990 with 2000
– males with females
(O&U Ch. 4 p. 77-81) 17

Briggs Henan University 2010


Mean Center
• Simply the mean of the X and the mean of the Y
coordinates for a set of points
• Sum of differences between the mean X and all
other Xs is zero (same for Y)
• Minimizes sum of squared distances
between itself and all points
d
2
min iC

Distant points have large effect:


Values for Xinjiang will have larger effect
Provides a single point summary measure
for the location of a set of points
18

Briggs Henan University 2010


Centroid
• The equivalent for polygons of the mean center for a point
distribution
• The center of gravity or balancing point of a polygon
• if polygon is composed of straight line segments between
nodes, centroid given by “average X, average Y” of nodes
(there is an example later)

• Calculation sometimes approximated as center of bounding


box
– Not good
• By calculating the centroids for a set of polygons can apply
Centrographic Statistics to polygons
19

Briggs Henan University 2010


Centroids for Provinces of China

20

Briggs Henan University 2010


Centroids for Provinces of China

21

Briggs Henan University 2010


Warning:
Centroid may not be inside its polygon
• For Gansu Province, China, centroid is
within neighboring province of Qinghai
• Problem arises
with crescent-
shaped polygons

22

Briggs Henan University 2010


Weighted Mean Center
• Produced by weighting each X and Y
coordinate by another variable (Wi)
• Centroids derived from polygons can be
weighted by any characteristic of the polygon
– For example, the population of a province

i=1 wixi
n


n

X = Y= i =1
wiyi

i=1 wi
n


n
i =1
wi
23

Briggs Henan University 2010


10 Calculating the centroid of a
polygon or the mean center of
4,7
7,7
a set of points.
ID X Y (same example data as
1 2 3 for area of polygon)
5

2 4 7
3 7 7
4 7 3 n n

2,3
7,3 5 6 2
 Xi Y i

sum 26 22
X= i =1
,Y = i =1

6,2 n n
Centroid/MC 5.2 4.4
0

0 5 10
10

Calculating the weighted mean


center. Note how it is pulled
4,7
7,7 toward the high weight point.
i X Y weight wX wY
5

1 2 3 3,000 6,000 9,000 n n

7,3
2
3
4
7
7
7
500
400
2,000
2,800
3,500
2,800
wX i wY i i i

X= i =1
,Y = i =1

w w
2,3 4 7 3 100 700 300
i i
5 6 2 300 1,800 600
6,2
sum 26 22 4,300 13,300 16,200
0

w MC 3.09 3.77
0 5 10
24

Briggs Henan University 2010


Center of Minimum Distance or Median Center
• Also called point of minimum aggregate travel
• That point (MD) which minimizes
sum of distances between itself
and all other points (i)
min diMD 
• No direct solution. Can only be derived by approximation
• Not a determinate solution. Multiple points may meet this
criteria—see next bullet.
• Same as Median center:
– Intersection of two orthogonal lines
(at right angles to each other),
such that each line has half of the points
to its left and half to its right
– Because the orientation of the axis for the
lines is arbitrary, multiple points may
meet this criteria. 25
Source: Neft, 1966
Briggs Henan University 2010
Median and Mean
Centers for US Population

Median Center:
Intersection of a north/south and an
east/west line drawn so half of
population lives above and half
below the e/w line, and half lives to
the left and half to the right of the n/s
line

Mean Center:
Balancing point of a weightless map,
if equal weights placed on it at the
residence of every person on census
day.

Source: US Statistical Abstract 200326


Briggs Henan University 2010
Standard Distance Deviation
Formulae for standard
• Represents the standard deviation of the deviation of single variable
distance of each point from the mean center

n 2
i =1
( X i- X)
• Is the two dimensional equivalent of N
standard deviation for a single variable
• Given by:
Or, with weights

i =1 i=1
n n
( Xi - Xc ) 2
 (Yi - Yc ) 2
i=1 wi( Xi - Xc)2  i=1 wi(Yi - Yc)2
n n

i=1 wi
n
N
which by Pythagoras
i =1
n 2
d iC
reduces to:
N
---essentially the average distance of points from the center
Provides a single unit measure of the spread or dispersion of a
distribution.
We can also calculate a weighted standard distance analogous to the 27
weighted mean center. Briggs Henan University 2010
Standard Distance Deviation Example

10
Circle with radii=SDD=2.9
4,7
7,7

5
7,3
2,3
i X Y (X - Xc)2 (Y - Yc)2
6,2
1 2 3 10.2 2.0

0
2 4 7 1.4 6.8
3 7 7 3.2 6.8
0 5 10
4 7 3 3.2 2.0
i X Y (X - Xc)2 (Y - Yc)2
5 6 2 0.6 5.8
1 2 3 10.2 2.0
sum 26 22 18.8 23.2 2 4 7 1.4 6.8
Centroid 5.2 4.4 3 7 7 3.2 6.8
sum 42.00 4 7 3 3.2 2.0
divide N 8.40 5 6 2 0.6 5.8
sq rt 2.90
sum 26 22 18.8 23.2
Centroid 5.2 4.4
sum of sums 42
divide N 8.4
sq rt 2.90

 ( Xi - Xc ) 2  i =1 (Yi - Yc ) 2
n n

sdd = i =1
N 28

Briggs Henan University 2010


Standard Deviational Ellipse: concept
• Standard distance deviation is a good single measure
of the dispersion of the points around the mean center,
but it does not capture any directional bias
– doesn’t capture the shape of the distribution.
• The standard deviation ellipse gives dispersion in two
dimensions
• Defined by 3 parameters
– Angle of rotation
– Dispersion (spread) along major axis
– Dispersion (spread) along minor axis
The major axis defines the
direction of maximum spread
of the distribution
The minor axis is perpendicular to it
and defines the minimum spread 29

Briggs Henan University 2010


Standard Deviational Ellipse: calculation
• Formulae for calculation may be found in references
such as
– Lee and Wong pp. 48-49
– Levine, Chapter 4, pp.125-128
• Basic concept is to:
– Find the axis going through maximum dispersion (thus derive
angle of rotation)
– Calculate standard deviation of the points along this axis (thus
derive the length (radii) of major axis)
– Calculate standard deviation of points along the axis
perpendicular to major axis (thus derive the length (radii) of
minor axis)
30

Briggs Henan University 2010


Mean Center & Standard Deviational Ellipse:
example

There appears to be no
major difference
between the location of
the software and the
telecommunications
industry in North
Texas.

31

Briggs Henan University 2010


Implementation in ArcGIS
In ArcToolbox
Median Center for a set of points

Standard deviation ellipse

Centroid for a set of points


Standard distance

• To calculate centroid for a set of polygons, with ArcGIS:


ArcToolbox>Data Management Tools>Features>Feature to Point (requires ArcInfo)
• To calculate using GeoDA:
32
– Tools>Shape>Polygons to Centroids
Briggs Henan University 2010
Density Kernel Estimation
• commonly used to “visually enhance” a point pattern
• Is an example of “exploratory spatial data analysis”
(ESDA)

Kernel=10,000 Kernel=5,000
33

Briggs Henan University 2010


low
low

high high

• SIMPLE Kernel option (see example above)


– A “neighborhood” or kernel is defined around each grid cell consisting of all grid
cells with centers within the specified kernel (search) radius
– The number of points that fall within that neighborhood is totaled
– The point total is divided by the area of the neighborhood to give the grid cell’s value
• Density KERNEL option
– a smoothly curved surface is fitted over each point
– The surface value is highest at the location of the point, and diminishes with increasing distance
from the point, reaching zero at the kernel distance from the point.
– Volume under the surface equals 1 (or the population value if a population variable is used)
– Uses quadratic kernel function described in Silverman (1986, p. 76, equation 4.5).
– The density at each output grid cell is calculated by adding the values of all the kernel surfaces
where they overlay the grid cell center.
Implementation in ArcGIS
• If specify a “population field”
software calculates as if there are
that number of points at that
location.
• The search radius:
• the size of the neighborhood or
kernel which is successively
defined around every cell (simple
kernel) or each point (density
kernel)
• Output cell size:
• Size of each raster cell
• Search radius and output cell size are
based on measurement units of the
data (here it is feet)
• It is good to “round” them (e.g.
to 10,000 and 1,000)
What have we learned today?
• We have learned about descriptive spatial
statistics, often called Centrographic
Statistics
• Next time, we will learn about Inferential
Spatial Statistics

36

Briggs Henan University 2010


Project for you
• The China data on my web site has population data for
the provinces of China in 2008
• Obtain population counts for 2000, 1990 and/or any
other year
• Calculate the weighted mean center of China’s
population for each year
• Be sure to use the same set of geographic units each
time
– For example, if you do not have data for Taiwan or Hong
Kong for one year, omit these geographic units for all years

37

Briggs Henan University 2010


Texts
O’Sullivan, David and David Unwin, 2010. Geographic
Information Analysis. Hoboken, NJ: John Wiley, 2nd ed.

Other Useful Books:


Mitchell, Andy 2005. ESRI Guide to GIS Analysis Volume 2: Spatial
Measurement & Statistics. Redlands, CA: ESRI Press.
Allen, David W 2009. GIS Tutorial II: Spatial Analysis Workbook.
Redlands, CA: ESRI Press.
Wong, David W.S. and Jay Lee 2005. Statistical Analysis of Geographic
Information. Hoboken, NJ: John Wiley, 2nd ed.
Ned Levine and Associates, Crime Stat III Manual, Washington, D.C.
National Institutes of Justice, 2004 with later updates.
http://www.icpsr.umich.edu/CrimeStat/
Density Kernel Estimation
Silverman, B.W. 1986. Density Estimation for Statistics and
Data Analysis. New York: Chapman and Hall.

Вам также может понравиться