Вы находитесь на странице: 1из 287

A Geocoding A Geocoding A Geocoding

Best Practices Guide Best Practices Guide Best Practices Guide


North American Association of CentraI Cancer Registries, Inc.
By DanieI W. GoIdberg By DanieI W. GoIdberg By DanieI W. GoIdberg November 2008 November 2008 November 2008
University of Southern CaIifornia University of Southern CaIifornia University of Southern CaIifornia
GIS Research Laboratory GIS Research Laboratory GIS Research Laboratory
SPONSORING ORGANIZATIONS: SPONSORING ORGANIZATIONS: SPONSORING ORGANIZATIONS:
Canadian Association of ProvinciaI Cancer Agencies Canadian Association of ProvinciaI Cancer Agencies Canadian Association of ProvinciaI Cancer Agencies
Canadian Partnership Against Cancer Canadian Partnership Against Cancer Canadian Partnership Against Cancer
Centers for Disease ControI and Prevention Centers for Disease ControI and Prevention Centers for Disease ControI and Prevention
CoIIege of American PathoIogists CoIIege of American PathoIogists CoIIege of American PathoIogists
NationaI Cancer Institute NationaI Cancer Institute NationaI Cancer Institute
NationaI Cancer Registrars Association NationaI Cancer Registrars Association NationaI Cancer Registrars Association
PubIic HeaIth Agency of Canada PubIic HeaIth Agency of Canada PubIic HeaIth Agency of Canada

SPONSORS WITH DISTINCTION: SPONSORS WITH DISTINCTION: SPONSORS WITH DISTINCTION:
American Cancer Society American Cancer Society American Cancer Society
American CoIIege of Surgeons American CoIIege of Surgeons American CoIIege of Surgeons
American Joint Committee on Cancer American Joint Committee on Cancer American Joint Committee on Cancer













! #$%&%'()# *$+, -.!&,(&$+ #/('$

+/*0(,,$' ,%

,1$ )%.,1 !0$.(&!) !++%&(!,(%) %2 &$),.!3 &!)&$. .$#(+,.($+
)%4$0*$. 567 8669

*:
'!)($3 ;< #%3'*$.#
/)(4$.+(,: %2 +%/,1$.) &!3(2%.)(!
#(+ .$+$!.&1 3!*%.!,%.:


















































1his page is let blank intentionally.

'< ;< #=>?@ABC
,!*3$ %2 &%),$),+

3DEF =G ,H@>AE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< IDD
3DEF =G 2DCJBAE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< DK
3DEF =G $LJHFD=ME <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< K
3DEF =G *AEF -BHNFDNAE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KD
3DEF =G !NB=MOPE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KDDD
2=BAQHB? <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KDI
-BAGHNA <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KI
!NRM=Q>A?CAPAMFE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KIDDD
'A?DNHFD=M <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KDK
!@=JF ,SDE '=NJPAMF <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KK
$KANJFDIA +JPPHBO <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< KKDDD
-HBF 5T ,SA &=MNAUF HM? &=MFAKF =G #A=N=?DMC <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 5
1. Introduction ............................................................................................................................. 3
1.1 \hat is Geocoding....................................................................................................... 3
2. 1he Importance o Geocoding ............................................................................................. 9
2.1 Geocoding's Importance to lospitals and Central Registries ................................ 9
2.2 1ypical Research \orklow ......................................................................................... 9
2.3 \hen 1o Geocode ...................................................................................................... 14
2.4 Success Stories .............................................................................................................. 1
3. Geographic Inormation Science lundamentals .............................................................. 19
3.1 Geographic Data 1ypes .............................................................................................. 19
3.2 Geographic Datums and Geographic Coordinates ................................................ 19
3.3 Map Projections and Regional Reerence Systems ................................................. 20
-HBF 8T ,SA &=PU=MAMFE =G #A=N=?DMC <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 8V
4. Address Geocoding Process Oeriew ............................................................................. 25
4.1 1ypes o Geocoding Processes .................................................................................. 25
4.2 ligh-Leel Geocoding Process Oeriew ............................................................... 25
4.3 Sotware-Based Geocoders ........................................................................................ 26
4.4 Input Data ..................................................................................................................... 28
4.5 Reerence Datasets ...................................................................................................... 31
4.6 1he Geocoding Algorithm ......................................................................................... 32
4. Output Data .................................................................................................................. 33
4.8 Metadata ........................................................................................................................ 34
)=IAP@AB 567 8669 DDD
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
5. Address Data ......................................................................................................................... 3
5.1 1ypes o Address Data ................................................................................................ 3
5.2 lirst-Order Lstimates .................................................................................................. 41
5.3 Postal Address lierarchy............................................................................................ 41
6. Address Data Cleaning Processes ...................................................................................... 45
6.1 Address Cleanliness ..................................................................................................... 45
6.2 Address Normalization ............................................................................................... 45
6.3 Address Standardization ............................................................................................. 50
6.4 Address Validation ....................................................................................................... 51
. Reerence Datasets ............................................................................................................... 55
.1 Reerence Dataset 1ypes ............................................................................................ 55
.2 1ypes o Reerence Datasets ...................................................................................... 55
.3 Reerence Dataset Relationships ............................................................................... 65
8. leature Matching .................................................................................................................. 69
8.1 1he Algorithm .............................................................................................................. 69
8.2 Classiications o Matching Algorithms .................................................................... 1
8.3 Deterministic Matching ............................................................................................... 1
8.4 Probabilistic Matching ................................................................................................. 8
8.5 String Comparison Algorithms .................................................................................. 80
9. leature Interpolation ............................................................................................................ 83
9.1 leature Interpolation Algorithms .............................................................................. 83
9.2 Linear-Based Interpolation ......................................................................................... 83
9.3 Areal Unit-Based leature Interpolation ................................................................... 90
10. Output Data ........................................................................................................................ 93
10.1 Downstream Compatibility ...................................................................................... 93
10.2 Data Loss .................................................................................................................... 93
-HBF VT ,SA 0HMO 0AFBDNE G=B 0AHEJBDMC WJH>DFO <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< XY
11. Quality Metrics .................................................................................................................... 9
11.1 Accuracy ...................................................................................................................... 9
12. Spatial Accuracy .................................................................................................................. 99
12.1 Spatial Accuracy Deined .......................................................................................... 99
12.2 Contributors to Spatial Accuracy ............................................................................ 99
12.3 Measuring Positional Accuracy .............................................................................. 104
12.4 Geocoding Process Component Lrror Introduction ......................................... 104
12.5 Uses o Positional Accuracy ................................................................................... 105
13. Reerence Data Quality .................................................................................................... 111
13.1 Spatial Accuracy o Reerence Data ...................................................................... 111
13.2 Attribute Accuracy ................................................................................................... 111
13.3 1emporal Accuracy.................................................................................................. 112
13.4 Cached Data ............................................................................................................. 114
13.5 Completeness ........................................................................................................... 115
DI )=IAP@AB 567 8669
'< ;< #=>?@ABC
14. leature-Matching Quality Metrics .................................................................................. 119
14.1 Match 1ypes ............................................................................................................. 119
14.2 Measuring Geocoding Match Success Rates........................................................ 121
14.3 Acceptable Match Rates .......................................................................................... 124
14.4 Match Rate Resolution ............................................................................................ 125
15. NAACCR GIS Coordinate Quality Codes ................................................................... 12
15.1 NAACCR GIS Coordinate Quality Codes Deined ........................................... 12
-HBF ZT &=PP=M #A=N=?DMC -B=@>APE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 5V5
16. Quality Assurance,Quality Control ............................................................................... 133
16.1 lailures and Qualities .............................................................................................. 133
1. Address Data Problems ................................................................................................... 13
1.1 Address Data Problems Deined ........................................................................... 13
1.2 1he Gold Standard o Postal Addresses .............................................................. 13
1.3 Attribute Completeness .......................................................................................... 138
1.4 Attribute Correctness .............................................................................................. 139
1.5 Address Liecycle Problems ................................................................................... 140
1.6 Address Content Problems .................................................................................... 141
1. Address lormatting Problems ............................................................................... 143
1.8 Residence 1ype and listory Problems ................................................................. 143
18. leature-Matching Problems ............................................................................................ 145
18.1 leature-Matching lailures ...................................................................................... 145
19. Manual Reiew Problems ................................................................................................ 153
19.1 Manual Reiew ......................................................................................................... 153
19.2 Sources or Deriing Addresses ............................................................................ 155
20. Geocoding Sotware Problems ....................................................................................... 159
20.1 Common Sotware Pitalls ..................................................................................... 159
-HBF YT &S==EDMC H #A=N=?DMC -B=NAEE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 5[5
21. Choosing a lome-Grown or 1hird-Party Geocoding Solution ................................ 163
21.1 lome-Grown and 1hird-Party Geocoding Options ......................................... 163
21.2 Setting Process Requirements ................................................................................ 163
21.3 In-louse s. Lxternal Processing ......................................................................... 164
21.4 lome-Grown or CO1S ......................................................................................... 165
21.5 llexibility ................................................................................................................... 166
21.6 Process 1ransparency .............................................................................................. 166
21. low 1o Select a Vendor ........................................................................................ 16
21.8 Laluating and Comparing Geocoding Results .................................................. 168
22. Buying s. Building Reerence Datasets ........................................................................ 11
22.1 No Assembly Required ........................................................................................... 11
22.2 Some Assembly Required ....................................................................................... 11
22.3 Determining Costs ................................................................................................... 11
23. Organizational Geocoding Capacity .............................................................................. 13
23.1 low 1o Measure Geocoding Capacity ................................................................ 13
)=IAP@AB 567 8669 I
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
-HBF [T ;=BRDMC ;DFS #A=N=?A? 'HFH <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 5\Y
24. 1umor Records \ith Multiple Addresses..................................................................... 1
24.1 Selecting lrom Multiple Case Geocodes ............................................................. 1
25. lybridized Data ................................................................................................................ 19
25.1 lybridized Data Deined ....................................................................................... 19
25.2 Geocoding Impacts on Incidence Rates .............................................................. 180
25.3 Implications o Aggregating Up ............................................................................ 181
26. Lnsuring Priacy and Conidentiality ............................................................................ 183
26.1 Priacy and Conidentiality .................................................................................... 183
#>=EEHBO =G ,ABPE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 59X
.AGABAMNAE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 865
!UUAM?DK !T $KHPU>A .AEAHBNSAB !EEJBHMNA '=NJPAMFE <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 88V
!UUAM?DK *T !MM=FHFA? *D@>D=CBHUSO <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 8V5


ID )=IAP@AB 567 8669
'< ;< #=>?@ABC
3(+, %2 ,!*3$+

1able 1 - 1ypes o readers, concerns, and sections o interest ................................................ xxii
1able 2 - Alternatie deinitions o geocoding` .......................................................................... 4
1able 3 - Possible input data types ,textual descriptions, ............................................................ 5
1able 4 - Common orms o input data with corresponding NAACCR ields
and example alues..................................................................................................................... 29
1able 5 - Multiple orms o a single address ................................................................................ 30
1able 6 - Lxisting and proposed address standards .................................................................... 30
1able - Lxample reerence datasets ............................................................................................ 32
1able 8 - Lxample geocoding component metadata .................................................................. 34
1able 9 - Lxample geocoding process metadata ......................................................................... 35
1able 10 - Lxample geocoding record metadata ......................................................................... 35
1able 11 - Lxample postal addresses ............................................................................................ 3
1able 12 - lirst order accuracy estimates ..................................................................................... 41
1able 13 - Resolutions, issues, and ranks o dierent address types ........................................ 42
1able 14 - Lxample postal addresses in dierent ormats ......................................................... 45
1able 15 - Common postal address attribute components ........................................................ 45
1able 16 - Common address eriication data sources ............................................................... 53
1able 1 - Common linear-based reerence datasets .................................................................. 5
1able 18 - Common postal address linear-based reerence dataset attributes ........................ 58
1able 19 - Common polygon-based reerence datasets .............................................................. 59
1able 20 - Common polygon-based reerence dataset attributes .............................................. 63
1able 21 - Point-based reerence datasets .................................................................................... 64
1able 22 - Minimum set o point-based reerence dataset attributes ....................................... 65
1able 23 - Attribute relation example, linear-based reerence eatures .................................... 1
1able 24 - Attribute relation example, ambiguous linear-based reerence eatures................ 2
1able 25 - Preerred attribute relaxation order with resulting ambiguity, relatie
magnitudes o ambiguity and spatial error, and worst-case resolution, passes 1-4 .......... 4
1able 26 - Preerred attribute relaxation order with resulting ambiguity, relatie
magnitudes o ambiguity and spatial error, and worst-case resolution, pass 5 .................. 5
1able 2 - Preerred attribute relaxation order with resulting ambiguity, relatie
magnitudes o spatial error, and worst case-resolution, pass 6 ........................................... 6
1able 28 - String comparison algorithm examples ...................................................................... 81
1able 29 - Metrics or deriing conidence in geocoded results ............................................... 98
1able 30 - Proposed relatie positional accuracy metrics ........................................................ 105
1able 31 - 11L assignment and reshness calculation considerations or cached data ....... 115
1able 32 - Simple completeness measures ................................................................................. 11
1able 33 - Possible matching outcomes with descriptions and causes .................................. 120
)=IAP@AB 567 8669 IDD
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
1able 34 - NAACCR recommended GIS Coordinate Quality Codes ,paraphrased, .......... 12
1able 35 - Classes o geocoding ailures with examples or true address 3620
S. Vermont Ae, Los Angeles CA 90089 ............................................................................. 134
1able 36 - Quality decisions with examples and rationale ....................................................... 135
1able 3 - Composite eature geocoding options or ambiguous data .................................. 151
1able 38 - 1riial data entry errors or 3620 South Vermont Ae, Los Angeles, CA ......... 154
1able 39 - Common sources o supplemental data with typical cost, ormal
agreement requirements, and usage type .............................................................................. 15
1able 40 - Geocoding process component considerations ...................................................... 164
1able 41 - Commercial geocoding package policy considerations .......................................... 166
1able 42 - 1opics and issues releant to selecting a endor .................................................... 16
1able 43 - Categorization o geocode results ............................................................................. 168
1able 44 - Comparison o geocoded cases per year to l1L positions .................................. 13
1able 45 - Possible actors inluencing the choice o dxAddress with decision
criteria i they hae been proposed ........................................................................................ 18
1able 46 - Preious geocoding studies classiied by topics o input data utilized ................ 232
1able 4 - Preious geocoding studies classiied by topics o reerence data source .......... 238
1able 48 - Preious geocoding studies classiied by topics o eature
matching approach ................................................................................................................... 244
1able 49 - Preious geocoding studies classiied by topics o eature
interpolation method ............................................................................................................... 248
1able 50 - Preious geocoding studies classiied by topics o accuracy
measured utilized ...................................................................................................................... 252
1able 51 - Preious geocoding studies classiied by topics o process used ......................... 25
1able 52 - Preious geocoding studies classiied by topics o priacy concern
and,or method ......................................................................................................................... 260
1able 53 - Preious geocoding studies classiied by topics o organizational cost ............... 261



IDDD )=IAP@AB 567 8669
'< ;< #=>?@ABC
3(+, %2 2(#/.$+

ligure 1 - 1ypical research worklow ............................................................................................ 10
ligure 2 - ligh-leel data relationships ........................................................................................ 26
ligure 3 - Schematic showing basic components o the geocoding process .......................... 26
ligure 4 - Generalized worklow ................................................................................................... 2
ligure 5 - Origin o both the 100 North ,longer arrow pointing up and to the let,
and 100 South ,shorter arrow pointing down and to the right, Sepuleda
Bouleard blocks ,Google, Inc. 2008b, .................................................................................. 38
ligure 6 - Geographic resolutions o dierent address components
,Google, Inc. 2008b, .................................................................................................................. 43
ligure - Lxample address alidation interace ,https:,,webgis.usc.edu, ............................. 52
ligure 8 - Vector reerence data o dierent resolutions ,Google, Inc. 2008b, ..................... 56
ligure 9 - Lxample 3D building models ,Google, Inc. 2008a, ................................................. 60
ligure 10 - Lxample building ootprints in raster ormat ,Uniersity o Southern
Caliornia 2008, .......................................................................................................................... 61
ligure 11 - Lxample building ootprints in digital ormat ,Uniersity o Caliornia,
Los Angeles 2008, ...................................................................................................................... 62
ligure 12 - Lxample parcel boundaries with centroids .............................................................. 63
ligure 13 - Generalized eature-matching algorithm .................................................................. 69
ligure 14 - Lxample relaxation iterations .....................................................................................
ligure 15 - Lxample o parcel existence and homogeneity assumptions ................................ 86
ligure 16 - Lxample o uniorm lot assumption ......................................................................... 8
ligure 1 - Lxample o actual lot assumption ............................................................................. 8
ligure 18 - Lxample o street osets ............................................................................................ 88
ligure 19 - Lxample o corner lot problem ................................................................................. 89
ligure 20 - Certainties within geographic resolutions ,Google, Inc. 2008b, ......................... 101
ligure 21 - Lxample o misclassiication due to uncertainty ,Google, Inc. 2008b, ............. 106
ligure 22 - Lxamples o dierent match types ......................................................................... 120
ligure 23 - Match rate diagrams .................................................................................................. 123
ligure 24 - Lxample uncertainty areas rom MBR or ambiguous streets s.
encompassing city ,Google, Inc. 2008b, ............................................................................... 150


)=IAP@AB 567 8669 DK
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
3(+, %2 $W/!,(%)+

Lquation 1 - Conditional probability ............................................................................................ 8
Lquation 2 - Agreement and disagreement probabilities and weights ..................................... 9
Lquation 3 - Size o address range and resulting distance rom origin .................................... 84
Lquation 4 - Resulting output interpolated point ....................................................................... 84
Lquation 5 - Simplistic match rate ............................................................................................... 122
Lquation 6 - Adanced match rate .............................................................................................. 122
Lquation - Generalized match rate .......................................................................................... 124

K )=IAP@AB 567 8669
'< ;< #=>?@ABC
3(+, %2 *$+, -.!&,(&$+

Best Practices 1 - lundamental geocoding concepts ....................................................................
Best Practices 2 - Address data gathering ..................................................................................... 11
Best Practices 3 - Residential history address data ...................................................................... 12
Best Practices 4 - Secondary address data gathering ................................................................... 12
Best Practices 5 - Conersion to numeric spatial data ................................................................ 13
Best Practices 6 - Spatial association ............................................................................................. 14
Best Practices - \hen to geocode .............................................................................................. 1
Best Practices 8 - Geographic undamentals ................................................................................ 22
Best Practices 9 - Geocoding requirements ................................................................................. 2
Best Practices 10 - Input data ,high leel, .................................................................................... 31
Best Practices 11 - Reerence data ,high leel, ............................................................................ 32
Best Practices 12 - Geocoding algorithm ,high leel, ................................................................. 33
Best Practices 13 - Output data ,high leel, ................................................................................. 33
Best Practices 14 - Input data types ............................................................................................... 40
Best Practices 15 - Substitution-based normalization ................................................................. 4
Best Practices 16 - Context-based normalization ........................................................................ 49
Best Practices 1 - Probability-based normalization ................................................................... 50
Best Practices 18 - Address standardization ................................................................................. 51
Best Practices 19 - Address alidation .......................................................................................... 54
Best Practices 20 - Reerence dataset types .................................................................................. 66
Best Practices 21 - Reerence dataset relationships ..................................................................... 6
Best Practices 22 - Reerence dataset characteristics .................................................................. 68
Best Practices 23 - SQL-like eature matching ............................................................................. 0
Best Practices 24 - Deterministic eature matching .................................................................... 8
Best Practices 25 - Probabilistic eature matching ...................................................................... 80
Best Practices 26 - String comparison algorithms ....................................................................... 82
Best Practices 2 - Linear-based interpolation ............................................................................. 85
Best Practices 28 - Linear-based interpolation assumptions ...................................................... 90
Best Practices 29 - Areal unit-based interpolation ...................................................................... 92
Best Practices 30 - Output data ...................................................................................................... 93
Best Practices 31 - Output data accuracy ..................................................................................... 99
Best Practices 32 - Input data implicit accuracies ...................................................................... 102
Best Practices 33 - Reerence dataset accuracy .......................................................................... 103
Best Practices 34 - Positional accuracy ....................................................................................... 108
Best Practices 35 - Reerence dataset spatial accuracy problems ............................................ 112
Best Practices 36 - Reerence dataset temporal accuracy ......................................................... 113
)=IAP@AB 567 8669 KD
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 3 - Geocode caching .......................................................................................... 116
Best Practices 38 - Reerence dataset completeness problems ................................................ 11
Best Practices 39 - leature match types ...................................................................................... 121
Best Practices 40 - Success ,match, rates .................................................................................... 126
Best Practices 41 - GIS Coordinate Quality Codes ................................................................... 129
Best Practices 42 - Common address problem management .................................................. 13
Best Practices 43 - Creating gold standard addresses ............................................................... 138
Best Practices 44 - Input data correctness .................................................................................. 140
Best Practices 45 - Address liecycle problems .......................................................................... 141
Best Practices 46 - Address content problems .......................................................................... 142
Best Practices 4 - Address ormatting problems ..................................................................... 143
Best Practices 48 - Conceptual problems ................................................................................... 144
Best Practices 49 - leature-matching ailures ............................................................................ 146
Best Practices 50 - Unmatched addresses ................................................................................... 153
Best Practices 51 - Unmatched addresses manual reiew ........................................................ 155
Best Practices 52 - Unmatched address manual reiew data sources ..................................... 156
Best Practices 53 - Common geocoding sotware limitations by component o the
geocoding process .................................................................................................................... 160
Best Practices 54 - In-house ersus external geocoding ........................................................... 165
Best Practices 55 - Process transparency .................................................................................... 16
Best Practices 56 - Laluating third-party geocoded results .................................................... 169
Best Practices 5 - Choosing a reerence dataset ...................................................................... 12
Best Practices 58 - Measuring geocoding capacity .................................................................... 14
Best Practices 59 - lybridizing data ............................................................................................ 180
Best Practices 60 - Incidence rate calculation ............................................................................ 181
Best Practices 61 - MAUP ............................................................................................................ 181
Best Practices 62 - Geocoding process priacy auditing when behind a irewall ................. 184
Best Practices 63 - 1hird-party processing ,external processing,............................................ 185
Best Practices 64 - Geocoding process log iles ........................................................................ 186
Best Practices 65 - Geographic masking ..................................................................................... 18
Best Practices 66 - Post-registry security .................................................................................... 18



KDD )=IAP@AB 567 8669
'< ;< #=>?@ABC
3(+, %2 !&.%):0+

0D Zero Dimensional
1D One Dimensional
2D 1wo Dimensional
3D 1hree Dimensional
4D lour Dimensional
CI Conidence Interal
CBG U.S. Census Bureau Census Block Group
CO1S Commercial O 1he Shel
C1 U.S. Census Bureau Census 1ract
DllS U.S. Department o lealth and luman Serices
DoD U.S. Department o Deense
DMV Department o Motor Vehicles
L-911 Lmergency 911
LMS Lmergency Medical Serices
lCC leature Classiication Code
lGDC lederal Geographic Data Committee
lIPS lederal Inormation Processing Standards
l1L lull 1ime Lquialent
GIS Geographic Inormation System
G-NAl Geocoded National Address lile
GPS Global Positioning System
IR Inormation Retrieal
LA Los Angeles
MBR Minimum Bounding Rectangle
MCD Minor Ciil Diision
M1lCC MAl,1IGLR leature Class Code
NAACCR North American Association o Central Cancer Registries
NCI National Cancer Institute
NIl United States National Institutes o lealth
PO Box USPS Post Oice Box
RR Rural Route
SLS Socio-Lconomic Status
SQL Structured Query Language
SVM Support Vector Machine
1IGLR 1opographically Integrated Geographic Lncoding and Reerencing
11L 1ime to Lie
URISA Urban and Regional Inormation Systems Association
U.S. United States
USC Uniersity o Southern Caliornia
USPS United States Postal Serice
ZC1A ZIP Code 1abulation Area

)=IAP@AB 567 8669 KDDD
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
2%.$;%.'

1he adent o geographic inormation science and the accompanying technologies ,geo-
graphic inormation systems |GIS|, global positioning systems |GPS|, remote sensing |RS|,
and more recently location-based serices |LBS|, hae oreer changed the ways in which
people conceie o and naigate planet Larth. Geocoding is a key bridge linking the old and
the new-a world in which streets and street addresses sered as the primary location iden-
tiiers and the modern world in which more precise representations are possible and needed
to explore, analyze, and isualize geographic patterns, their driers, and their consequences.
Geocoding, iewed rom this perspectie, brings together the knowledge and work o the
geographer and the computer scientist. 1he author, Daniel Goldberg, has done an excellent
job in laying out the undamentals o geocoding as a process using the best contributions
rom both o these once-disparate ields.
1his book will sere as a rich reerence manual or those who want to inject more
science and less art ,uncertainty, into their geocoding tasks. 1his is particularly important or
medical geography and epidemiology applications, as recent research indings point to eni-
ronmental conditions that may contribute to and,or exacerbate health problems that ary
oer distances o hundreds and een tens o meters ,i.e., as happens with proximity to ree-
ways,. 1hese indings call or much better and more deliberate geocoding practices than
many practitioners hae used to date and bring the contents o this best practices manual to
the ore. 1his book proides a long oerdue summary o the state-o-the-art o geocoding
and will be essential reading or those that wish and,or need to generate detailed and accu-
rate geographic positions rom street addresses and the like.





John \ilson
June 6, 2008

KDI )=IAP@AB 567 8669
'< ;< #=>?@ABC
-.$2!&$

In one sense, writing this manuscript has been a natural continuation o the balancing act
that has been, and continues to be, my graduate student career. I am ortunate to be a Com-
puter Science ,CS, Ph.D. student at the Uniersity o Southern Caliornia ,USC,, working in
the Department o Geography, adised by a Proessor in the Department o Preentie
Medicine, who at the time o this writing was supported by the Department o Deense.
\hile at times unbearably rustrating and,or strenuous, learning to tread the ine lines be-
tween these separate yet highly related ields ,as well as blur them when necessary, has
taught me some important lessons and gien me a unique perspectie rom which I hae
written this manuscript and will take with me throughout my career. 1his combination o
actors has led to my inolement in many extremely interesting and aried projects in di-
erse capacities, and to interact with academics and proessionals with whom I would most
likely not hae otherwise met or had any contact.
Case in point is this manuscript. In Noember o 2006, Dr. John P. \ilson, my always
industrious and ,at-the-time, Geography adisor ,now jointly appointed in CS, was hard at
work securing unding or his graduate students ,as all good aculty members should spend
the majority o their time,. le identiied an opportunity or a student to deelop a GIS-
based traic pollution exposure assessment tool or his colleague in the USC Department o
Preentie Medicine, ,my soon-to-be adisor, Dr. Myles G. Cockburn, which was right in
line with my programming skills. \hat started o as a simple question regarding the sup-
posed accuracy o the geocodes being used or the exposure model quickly turned into a
day-long discussion about the geocoder I had built during the preious summer as a Re-
search Assistant or my CS adisor, Dr. Craig A. Knoblock. 1his discussion eentually
spawned seeral grant proposals, including one entitled Ceocoaivg e.t Practice. Docvvevt Pba.e
: Cov.vttavt for ^..CCR C Covvittee Meetivg c Deretovevt of .vvotatea Ovttive, submitted
to the North American Association o Central Cancer Registries ,NAACCR, on April 21,
2006.
1o my great surprise, I was awarded the grant and immediately set to work creating the
outline or the meeting and the Annotated Geocoding Reading List I had promised in my
proposal. Ambitiously, I started reading and taking notes on the 150 latest geocoding works,
at which point the NAACCR GIS Committee, chaired at that time by Daid O`Brien o the
Alaska Cancer Registry, should hae run or coer. 1he irst drat I produced ater the in-
person meeting during the June NAACCR 2006 Annual Meeting in Regina, Saskatchewan,
Canada was ar too detailed, too CS oriented, and too dense or anyone to make sense o.
loweer, guided by the thoughtul but sometime ruthless suggestions o Drs. \ilson and
Cockburn, I was able to transorm that drat into an earlier ersion o this document or i-
nal submission to the NAACCR GIS Committee, which then sent it to the NAACCR Lx-
ecutie Board or approal in October 2006. It was approed, and I was subsequently se-
lected to write the ull ersion o the current work, . Ceocoaivg e.t Practice. Cviae.
I dare say that this exercise would proe longer and more in-depth than anyone could
hae anticipated. Looking back 2 years, I do not think I could hae imagined what this
project would hae eentually turned into, 200 plus pages o text, 200 plus reerences, an
annotated reading list the size o a small phone book, example research assurance docu-
ments, and a ull glossary.
)=IAP@AB 567 8669 KI
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
At more than one-hal million characters and spanning more than 250 pages, it may at
irst seem a daunting task or one to read and digest this whole document. loweer, this
ear should be laid to rest. More than one-third o this length is comprised o the ront mat-
ter ,e.g., 1able o Contents, indices, loreward, Preace, etc., and the back matter ,-e.g.,
Glossary, Reerences, and Appendices,. Most o this material is intended as reerence, and it
is expected that only the most motiated and inquisitie o readers will explore it all. 1he
main content o the document, Sections 1-26, are organized such that an interested reader
can quickly and easily turn to their topic,s, o interest, at the desired leel o detail, at a mo-
ment`s notice though the use o the 1able o Contents and lists o igures and tables ound
in the ront matter.
In addition to this concern, there were three major hurdles that had to be oercome dur-
ing the writing o this document. 1he irst was a question as to what the ocus and tone
should be. lrom the earliest conception, it was clear that this document should be a Best
Practices Guide,` which implicitly meant that it should tell someone what to do when in a
particular situation.` 1he question, howeer, was rbo was the person who was to be in-
ormed` \as it the technical person perorming the geocoding who might run into a sticky
situation and need direction as to which o two options they should choose \as it the man-
ager who needed to know the dierences between reerence datasets so they could make the
correct inestment or their registry Or, was it the researcher who would be utilizing the
geocoded data and needed to know what the accuracy measure meant and where it came
rom Ater lengthy discussion, it was determined that the irst two-the person perorming
the geocoding and the person deciding on the geocoding strategy-would be the target au-
dience, because they are the registry personnel or whom this document was being created.
1hereore, this document goes into great detail about the technical aspects o the geocoding
process such that the best practices deeloped throughout the text can and should actually
be applied during the process o geocoding. Likewise, the theoretical underpinnings are
spelled out completely such that the person deciding on which geocoding process to apply
can make the most inormed decision possible.
1he second hurdle that had to be cleared was political in nature. During the process o
determining the set o theoretical best practices presented in this document, it came to light
that in some cases, the current NAACCR standards and,or practices were insuicient, inap-
propriate, and,or precluded what I would consider the actual true best practice. lollowing
lengthy discussion, it was decided that the set o best practices deeloped or this document
should remain true to what should be done,` not simply what the current standards allow.
1hereore, in seeral places throughout the manuscript, it is explicitly stated that the best
practices recommended are in the ideal case, and may not be currently supported with other
existing NAACCR standards. In these cases, I hae attempted to proide justiication and
support or why these would be the correct best practice in the hopes that they can be taken
into consideration as the existing NAACCR standards are reiewed and modiied oer time.
1he inal challenge to oercome in creating this manuscript was the sheer diersity o the
NAACCR member registries in terms o their geocoding knowledge, resources, practices,
and standards that needed to be addressed. 1he members o the NAACCR GIS Committee
who contributed to the production o this document came rom eery corner o the United
States, arious leels o goernment, and represented the ull geocoding spectrum rom
highly adanced and extremely knowledgeable experts to indiiduals just starting out with
more questions than answers. Although input rom all o these aried user types undoubted-
ly led to a more accessible inished product, it was quite a task to produce a document that
would be equally useul to all o them. I eel that their input helped produce a much stronger
KID )=IAP@AB 567 8669
'< ;< #=>?@ABC
text that should be appropriate to readers o all leels, rom those just getting started to
those with decades o experience who may be deeloping their own geocoders.
1he content o this manuscript represents countless hours o work by many dedicated
people. 1he indiiduals listed in the Acknowledgments Section each spent a signiicant
amount o time reiewing and commenting on eery sentence o this document. Most parti-
cipated in weekly Lditorial Reiew Committee calls rom March 200 to March 2008, and all
contributed to making this document what it is. In particular, I would like to thank lrank
Boscoe or his steady leadership as NAACCR GIS Committee Chair during the period co-
ering most o the production o this book. I take ull responsibility or all grammatical errors
and run-on sentences, and beliee me when I tell you that this book would be in ar worse
shape had John \ilson not olunteered to copyedit eery single word. I would not be writ-
ing this i it was not or Myles Cockburn, so or better or worse, all blame should be directed
toward him. 1he other members o the weekly Lditorial Reiew Committee, namely Stepha-
nie loster, Kein lenry, Christian Klaus, Mary Mroszczyk, Recinda Sherman and Daid
Stinchcomb, all olunteered substantial time and eort and contributed aluable expert opi-
nions, questions, corrections, edits, and content, undoubtedly improing the quality o the
inal manuscript. 1hese detailed and oten heated discussions sered to ocus the content,
tone, and direction o the inished product in a manner that I would hae been incapable o
on my own. I would not currently be a Ph.D. student, much less know what a geocoder was,
i it were not or the support o Craig Knoblock. Last but in no way least, Mona Seymour
graciously olunteered her time to reiew portions o this manuscript, resulting in a ar more
readable text.
Sadly, eeryone who reads this document will most likely hae already been aected by
the dreadul toll that cancer can take on a amily member, riend, or other loed one. I
whole-heartedly support the goal o NAACCR to work toward reducing the burden o can-
cer in North America, and I am honored to hae been granted the opportunity to gie in this
small way to the world o cancer-related research. \hat ollows in this document is my at-
tempt to contribute through the production o a Ceocoaivg e.t Practice. Cviae or use in stan-
dardizing the way that geocoding is discussed, perormed, and used in scientiic research and
analysis.





Daniel \. Goldberg
June 6, 2008

)=IAP@AB 567 8669 KIDD
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
KIDDD )=IAP@AB 567 8669
!&])%;3$'#$0$),+

Much o the material in this handbook was generated at the North American Association o
Central Cancer Registries ,NAACCR, GIS \orkgroup meeting held in Regina, Saskatche-
wan, Canada on June 16, 2006. 1he ollowing indiiduals contributed to the deelopment o
this document:

0AAFDMC -HBFDNDUHMFET
lrancis P. Boscoe, Ph.D., New \ork State Cancer Registry
Myles G. Cockburn, Ph.D., Uniersity o Southern Caliornia
Stephanie loster, Centers or Disease Control and Preention
Daniel \. Goldberg, Ph.D. Candidate, lacilitator and landbook Author, Uniersity o
Southern Caliornia
Kein lenry, Ph.D., New Jersey State Department o lealth
Christian Klaus, North Carolina State Center or lealth Statistics
Mary Mroszczyk, C.1.R., Massachusetts Cancer Registry
Daid O`Brien, Ph.D., Alaska Cancer Registry
Daid Stinchcomb, Ph.D., National Cancer Institute

&=PPAMFE7 .AIDAQABE7 HM? $?DF=BET
Robert Borchers, \isconsin Department o lealth and luman Serices
lrancis P. Boscoe, Ph.D., New \ork State Cancer Registry
Myles G. Cockburn, Ph.D., Uniersity o Southern Caliornia
Stephanie loster, Centers or Disease Control and Preention
Kein lenry, Ph.D., New Jersey State Department o lealth
Christian Klaus, North Carolina State Center or lealth Statistics
Mary Mroszczyk, C.1.R., Massachusetts Cancer Registry
Daid O`Brien, Ph.D., Alaska Cancer Registry
Mona N. Seymour, Ph.D. Candidate, Uniersity o Southern Caliornia
Recinda L. Sherman, M.P.l., C.1.R., llorida Cancer Data System
Daid Stinchcomb, Ph.D., National Cancer Institute
John P. \ilson, Ph.D., Uniersity o Southern Caliornia

1his project has been unded in part with ederal unds rom the National Cancer Institute
,NCI,, National Institutes o lealth ,NIl,, Department o lealth and luman Serices
,DllS, under Contract No. llSN26120044401C and ADB Contract No. N02-PC-44401,
and rom the Centers or Disease Control and Preention ,CDC, under Grant,Cooperatie
Agreement No. U5,CCU523346. Daniel Goldberg was supported by a U.S. Department o
Deense ,DoD, Science, Mathematics, and Research or 1ransormation ,SMAR1, Deense
Scholarship or Serice Program ellowship and National Science loundation ,NSl, Award
No. IIS-0324955 during portions o the production o this document. 1he iews and con-
clusions contained herein are those o the authors and should not be interpreted as neces-
sarily representing the oicial policies or endorsements, either expressed or implied, o any
o the aboe organizations or any person connected with them.

'< ;< #=>?@ABC
'$'(&!,(%)


1his book is dedicated to the lie and memory o Michael Owen \right-Goldberg, be-
loed husband, son, and older brother ,195-2006,.

)=IAP@AB 567 8669 KDK
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
!*%/, ,1(+ '%&/0$),
*$+, -.!&,(&$+ #/('$
1he main purpose o this document is to act as a best practices guide or the cancer regi-
stry community, including hospitals as well as state and proincial registries. Accordingly, it
will adise those who want to know speciic best practices that they should ollow to ensure
the highest leel o conidence, reliability, standardization, and accuracy in their geocoding
endeaors. 1hese best practices will be ramed as both policy and technical decisions that
must be made by a registry as a whole and by the indiidual person perorming the geocod-
ing or using the results. Best practices are listed throughout the text, placed as close to the
section o text that describes them as possible.
+,!)'!.'(^!,(%)
Due to a undamental lack o standardization in the way that geocoding is deined and
implemented across cancer registries, it is diicult to compare or integrate data created at
dierent sources. 1his document will propose numerous deinitions germane to the geocod-
ing process, thus deeloping a consistent ocabulary or use as a irst step toward a larger
standardization process. 1hroughout the document, speciic terms will be written in bold
with deinitions closely ollowing. A CA=N=?DMC @AEF UBHNFDNA is a policy or technical deci-
sion related to geocoding recommended ,but not required, by NAACCR or use in a cancer
registry`s geocoding process. 1he CA=N=?DMC @AEF UBHNFDNAE are a set o suggested best
practices deeloped throughout this document. In addition, the document attempts to detail
sotware implementation preerences, current limitations, and aenues or improement that
geocoding endors should be aware are desired by the cancer research communities.
Note that the @AEF UBHNFDNAE deeloped in this document are not as-o-yet oicial
)!!&&. ?HFH EFHM?HB?E, meaning that they will not be ound in the current ersion o
tavaara. for Cavcer Regi.trie.: Data tavaara. ava Data Dictiovar, ,loerkamp and laener
2008,, and thus their use is not oicially required by any means. More speciically, although
the content o loerkamp and laener ,2008, represent the current mandatory )!!&&.
?HFH EFHM?HB?E that registries are required to ollow, the @AEF UBHNFDNAE ound herein are
recommended or adoption by researchers, registries, and,or sotware deelopers that seek
to begin conducting their geocoding practices in a consistent, standardized, and more accu-
rate manner. It is the hope o the author that the contents o this document will assist in the
eentual oicial standardization o the geocoding process, ully accepted and recognized by
the NAACCR Lxecutie Board. As such, sotware deelopers are encouraged to adopt and
incorporate the recommendations included in this document to: ,1, be ahead o the cure
i,when the recommendations contained herein ,or their deriates and,or replacements, are
accepted as true NAACCR data standards, and ,2, improe the quality, transparency, usabili-
ty, and legitimacy o their geocoding products.
3$!.)()# ,%%3
1o make inormed decisions about geocoding choices, an understanding o both the
theoretical and practical aspects o the geocoding process is necessary. Accordingly, this
document proides a high leel o detail about each aspect o the geocoding process such
that a reader can obtain a complete understanding o the best practice recommended, other
possible options, and the rationale behind the recommended practice. It seres to centralize
KK )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 KKD
much o the aailable research and practice scholarship on these topics to proide a single,
comprehensie perspectie on all aspects o the geocoding process.
1he document has been speciically diided into six parts. Lach part attempts to address
the topics contained in it at a consistent leel o detail. 1he decision was made to organize
the document in this ormat so that it would be easy or a reader interested in certain topics
to ind the inormation he or she is looking or ,e.g., to learn about components o geocod-
ing or ind solutions to an exact problem, without being bogged down in either too much or
too little detail.
.$2$.$)&$ ,%%3
Appendix A includes example research assurance documents that can be tailored to an
indiidual registry or ensuring that researchers understand the acceptable manner in which
registry data may be obtained and used. Lery attempt was made to back up all claims made
in the document using published scientiic literature so that it can be used as a reerence tool.
1he Annotated Bibliography included as Appendix B includes more than 250 o the most
recently published geocoding works classiied by the topic,s, they coer, and should proe a
useul resource or those interested in urther reading.
,:-$+ %2 .$!'$.+
In this document, our distinct types o readers will be identiied based on their speciic
roles in, or uses o, the geocoding process. 1hese are: ,1, the practitioner, ,2, general interest,
,3, process designer, and ,4, data consumer groups. 1he roles o these groups are described
in 1able 1, as are their main concerns regarding the geocoding process and the sections in
this document that address them.
+/##$+,$' &(,!,(%)
Goldberg D\: A Geocoding Best Practices Guide. Springield, IL: North American As-
sociation o Central Cancer Registries, 2008.

K
K
D
D

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


Table 1 Types of readers, concerns, and sections of interest
#B=JU .=>A &=MNABME +ANFD=ME =G (MFABAEF
Practitioner Registry sta perorming the geocoding
task using some pre-deined method
with existing tools, ultimately
responsible or the actual production o
the geospatial data rom the raw
aspatial address data
Practical aspects o the geocoding process

landling instances in which data do not
geocode
1, 4, 5, 14, 15, 16, 1, 18,
19, 20, 24
General
Interest
Registry sta interested in geocoding
but not ormally inoled in the
process as part o their duties, akin to
the general public
\hy is geocoding important

low does geocoding it into the larger
operations o the registry
1, 2.1, 2.4, 3, 4, 11, 12.5,
14, 15, 18, 26
Process
Designers
Registry sta oerseeing and designing
the geocoding process used at a
registry, ultimately responsible or the
oerall outcome o the geocoding
perormed at a registry
All design and policy decisions that aect the
outcome o geocoding

Data deinition, representation, and
alidation, components and algorithms
inoled in the geocoding process, orms and
ormats o reerence data sources, and
accuracy metrics and reporting
1, 2, 3, 4, 5, 6, , 8, 9, 10,
11, 12, 13, 14, 15, 16, 1,
18, 19, 20, 21, 22, 23, 24,
25, 26
Data
Consumers
Cancer researchers consuming the
geocoded data

Others responsible or monitoring
annually reported aggregate statistics to
discoer important trends
Accuracy o the geocoded output in terms o
its lineage, conidence,reliability,
accountability, and any assumptions that were
used
1, 3, 5.3, 6.1, , 8, 11, 12,
13, 14, 15, 18, 19, 20, 24,
25, 26
!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A






'< ;< #=>?@ABC
$_$&/,(4$ +/00!.:
-/.-%+$
As a rule, health research and practice ,and cancer-related research in particular, takes
place across a multitude o administratie units and geographic extents ,country-wide, state-
wide, etc.,. 1he data used to deelop and test cancer-related research questions are created,
obtained, and processed by disparate organizations at each o these dierent leels. Studies
requiring the aggregation o data rom multiple administratie units typically must integrate
these disparate data, which occur in incompatible ormats with unknown lineage or accuracy.
1he inconsistencies and unknowns amongst these data can lead to uncertainty in the results
that are generated i the data are not properly integrated. 1his problem o data integration
represents a undamental hurdle to cancer-related research.
1o oercome the diiculties associated with disparate data, a speciic set o actions must
be undertaken. lirst, key stakeholders must be identiied and inormed o potential issues
that commonly arise and contribute to the problem. Next, a common ocabulary and under-
standing must be deined and deeloped such that thoughtul communication is possible.
linally and most importantly, adice must be proided in the orm o a set o best practices
so that processes can begin to be standardized across the health research communities. 1o-
gether, these will allow health researchers to hae a reasonable leel o certainty as to how
and where the data they are working with hae been deried as well as an awareness o any
oerarching data gaps and limitations.
Person, place, eent, and time orm the our undamental axes o inormation around
which epidemiologic research is conducted. 1he spatial data representing the subject`s loca-
tion is particularly susceptible to the diiculties that plague multi-source data because much
o the spatial data are deried rom textual addresses through the process o CA=N=?DMC<
1hese data are ulnerable to inconsistencies and unknown quality because o the wide range
o methods by which they are deined, described, collected, processed, and distributed. 1o
contextualize the heterogeneity o current geocoding practices among cancer registries, see
the recent work by Abe and Stinchcomb ,2008,, which highlights the ar-ranging approaches
used at seeral cancer registries throughout the United States. 1his lack o uniormity and,or
standardization with regard to geocoding processes represents a current and signiicant prob-
lem that needs to be addressed.
Although there is a substantial amount o aailable literature on the many topics germane
to geocoding, there is no single source o reerence material one can turn to that addresses
many or all o these topics in the leel o detail required to make well-inormed decisions.
Recent works such as Rushton et al. ,2006, 2008a,, Goldberg et al. ,200a,, and Mechanda
and Puderer ,200, proide a great deal o reiew and detail on geocoding and related topics,
but aailable scholarship as to speciic recommendations, their rationale, and alternatie con-
siderations is lacking.
1o these ends, 1he North American Association o Central Cancer Registries
,NAACCR, has promoted the deelopment o this work, . Ceocoaivg e.t Practice. Cviae, the
purpose o which is to help inorm and standardize the practice o geocoding as perormed
by the cancer registries and research communities o the United States and Canada. 1his
work primarily ocuses on the theoretical and practical aspects o the actual production o
geocodes, and will briely touch upon seeral important aspects o their subsequent usage in
cancer-related research.
)=IAP@AB 567 8669 KKDDD
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
KKDI )=IAP@AB 567 8669
+&%-$
1his document will coer the undamental underpinnings o the geocoding process. Sep-
arate sections describe the components o the geocoding process, ranging rom the input
data, to the internal processing perormed, to the output data. lor each topic, choices that
aect the accuracy o the resulting data will be explored and possible options will be listed.
(),$)'$' !/'($)&$
1he primary purpose o this document is to proide a set o best practices that, i ol-
lowed, will enable the standardization o geocoding throughout the cancer research com-
munities. 1hus, the main ocus o this document will be to proide enough detailed inorma-
tion on the geocoding process such that inormed decisions can be made on each aspect-
rom selecting data sources, algorithms, and sotware to be used in the process, to deining
the policies with which the geocoding practitioner group perorm their task and make deci-
sions, to determining and deining the metadata that are associated with the output.
lor those with arying leels o interest in the geocoding process, this document
presents detailed inormation about the components and processes inoled in geocoding, as
well as sets o best practices designed to guide speciic choices that are part o any geocoding
strategy. Beneits and drawbacks o potential options also are discussed. 1he intent is to es-
tablish a standardized knowledge base that will enable inormed discussions and decisions
within local registries and result in the generation o consistent data that can be shared be-
tween organizations.
lor researchers attempting to use geocoded data in their analyses, this document outlines
the sources o error in the geocoding process and proides best practices or describing
them. I described properly, accuracy alues or each stage o the geocoding process can be
combined to derie inormatie metrics capable o representing the accuracy o the output in
terms o the whole process. 1he data consumer can use these to determine the suitability o
the data with respect to the speciic needs o their study.
lor practitioners, this document presents detailed, speciic solutions or common prob-
lems that occur during the actual process o geocoding, with the intent o standardizing the
way in which problem resolution is perormed at all registries. Uniorm problem resolution
would remoe one aspect o uncertainty ,arguably the most important leel, rom the geo-
coding process and ultimately rom the resulting data and analyses perormed on them.
Most o the inormation contained in this document ,e.g., examples, data sources, laws,
and regulations, will primarily ocus on U.S. and Canadian registries and researchers, but the
concepts should be easily translated to other countries. Likewise, some o the inormation
and techniques outlined herein may only be applicable to registries that perorm their own
geocoding instead o using a commercial endor. Although the number o these registries
perorming their own geocoding is currently small, this number has been and will continue
to increase as access to geocoding sotware and required data sources continue to improe.
Additionally, the inormation within this document should assist those registries currently
using a endor in becoming more understanding o and inoled in the geocoding process,
better able to explain what they want a endor to do under what circumstances, and more
cognizant o the repercussions o choices made during the geocoding process.






'< ;< #=>?@ABC







Part 1: 1be Covcet ava Covtet of Ceocoaivg



As a starting point, it is important to succinctly deelop a concrete notion or exactly what
geocoding is and identiy how it relates to health and cancer-related research. In this part o
the document, geocoding will be explicitly deined and its ormal place in the cancer research
worklow will be identiied.


















)=IAP@AB 567 8669 5














































1his page is let blank intentionally.



'< ;< #=>?@ABC
5< (),.%'/&,(%)
1his section proides the motiation or standardized
geocoding.
5<5 ;1!, (+ #$%&%'()#`
Person, place, eent, and time are the our key pieces o inormation rom which epide-
miologic research in general is conducted. 1his document will ocus primarily on issues aris-
ing in the description, deinition, and deriation o the place component. In the course o
this research, scientists requently use a ariety o spatial analysis methods to determine
trends, describe patterns, make predictions, and explain arious geographic phenomena.
Although there are many ways to denote place, most people rely almost exclusiely on
locationally descriptie language to describe a geospatial context. In the world o cancer regi-
stries this inormation typically includes the address, city, and proince or state o a patient at
the diagnosis o their disease ,dxAddress, dxCity, dxProince, dxState,, most commonly in
the orm o postal street addresses. 1hese ernacular, text-based descriptions are easily un-
derstood by people, but they are not directly suitable or use in a computerized enironment.
Perorming any type o geospatial mapping or inestigation with the aid o a computer re-
quires discrete, non-ambiguous, geographically alid digital data rather than descriptie tex-
tual strings.
1hus, some orm o processing is required to conert these text descriptors into alid
geospatial data. In the parlance o geographic inormation science ,GIS,, this general con-
cept o making implicit spatial inormation explicit is termed CA=BAGABAMNDMC7 or transorm-
ing M=MaCA=CBHUSDN DMG=BPHFD=M7 inormation that has no geographically alid reerence
that can be used or spatial analyses, into CA=CBHUSDN DMG=BPHFD=M7 inormation that has a
alid geographic reerence that can be used or spatial analyses ,lill 2006,.
1hroughout the years, this general concept has been realized in a multitude o actual
processes to suit the needs o arious research communities. lor instance, a C>=@H> U=EDa
FD=MDMC EOEFAP ,GPS, deice can produce coordinates or the location on the Larth`s sur-
ace based on a system o satellites, calibrated ground stations, and temporally based calcula-
tions. 1he coordinates produced rom these deices are highly accurate, but can be
expensie in terms o time and eort required to obtain the data, as they typically require a
human to go into the ield to obtain them.
#A=N=?DMC describes another method o georeerencing ,Goldberg et al. 200a,. As
seen in scholarship and practice, the term CA=N=?DMC is used throughout almost eery dis-
cipline o scientiic research that includes any orm o spatial analysis, with each ield usually
either redeining it to meet their needs or adopting another ield`s deinition wholesale. As a
result, there is a great deal o conusion as to what geocoding-and its deriaties, most not-
ably the terms CA=N=?A and CA=N=?AB-actually reer to. \hat do these words mean, and
how should they be used in the cancer registry ield
lor example, does CA=N=?DMC reer to a speciic computational process o transorming
something into something else, or simply the concept o a transormation Is a CA=N=?A a
real-world object, simply an attribute o something else, or the process itsel Is a CA=N=?AB
)=IAP@AB 567 8669 V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
the computer program that perorms calculations, a single component o the process, or the
human who makes the decisions
An online search perormed in April o 2008 ound the arious deinitions o the term
CA=N=?DMC shown in 1able 2. 1hese deinitions hae been chosen or their geographic di-
ersity as well as or displaying a mix o research, academic, and industry usages. It is useul
to contrast our proposed deinition with these other deinitions that are both more con-
strained and relaxed in their descriptions o the geocoding process to highlight how the pro-
posed deinition is more representatie o the needs o the health,cancer research communi-
ties.

Table 2 Alternative definitions of geocoding
+=JBNA 'AGDMDFD=M -=EED@>A -B=@>APE
Lnironmental
Sciences Research
Institute ,1999,
1he process o matching
tabular data that contains
location inormation such as
street addresses with real-
world coordinates.
Limited to coordinate
output only.
larard Uniersity
,2008,
1he assignment o a numeric
code to a geographical
location.
Limited to numeric code
output only.
Statistics Canada
,2008,
1he process o assigning
geographic identiiers ,codes,
to map eatures and data
records.
Limited input range.
U.S. Lnironmental
Protection Agency
,2008,
1he process o assigning
latitude and longitude to a
point, based on street
addresses, city, state and
USPS ZIP Code.
Limited to coordinate
output only.

As a urther complication, it must be noted that the methods and data sources employed
throughout all registries in the United States and Canada are quite dierse and aried so a
single deinition explicitly deining, requiring, or endorsing a particular technology would not
be useul. Lach registry may hae dierent restrictions or requirements on what can be geo-
coded in terms o types o input data ,postal addresses, named places, etc.,, which algorithms
can be used, what geographic ormat the results must be in, what can be produced as output,
or what data sources can be used to produce them. Diering leels o technical skills, budge-
tary and legal constraints, and aried access to types o geographic data, along with other ac-
tors, also may dictate the need or a broad deinition o geocoding. As such, the deinition
oered herein is meant to sere the largest possible audience by speciically not limiting any
o these characteristics o the geocoding process, intentionally leaing the door open or di-
erent laors o geocoding to be considered as alid. In the uture, as the ast body o
knowledge o geocoding constraints, ontologies, and terminologies spreads and is utilized by
registry personnel, it is expected that there will be a common desire in the registry communi-
ty to achiee consensus on standardizing geocoding and geocoding-related processes to
achiee economies o scale.
Z )=IAP@AB 567 8669

'< ;< #=>?@ABC
1he remainder o this document will explicitly deine geocoding as well as its component
parts as ollows:

#A=N=?DMC ,erb, is the act o transorming aspatial locationally descriptie text into a
alid spatial representation using a predeined process.

A CA=N=?AB ,noun, is a set o inter-related components in the orm o operations, algo-
rithms, and data sources that work together to produce a spatial representation or descrip-
tie locational reerences.

A CA=N=?A ,noun, is a spatial representation o a descriptie locational reerence.

1o CA=N=?A ,erb, is to perorm the process o geocoding.

In particular, these deinitions help to resole our common points o conusion about
geocoding that oten are complicated by disparate understandings o the term: ,1, the types
o data that can be geocoded, ,2, the methods that can be employed to geocode data, ,3, the
orms and ormats o the outputs, and ,4, the data sources and methods that are germane to
the process. 1hese deinitions hae been speciically designed to be broad enough to meet
the dierse needs o both the cancer registry and cancer research communities.
5<5<5 (EEJA b5T 0HMO ?HFH FOUAE NHM @A cHM? HBAd CA=N=?A?
1here are many orms o inormation that registries and researchers need geocoded.
1able 3 illustrates the magnitude o the problem in terms o the many dierent types o ad-
dresses that may be encountered to describe the same physical place, along with their best
and worst resolutions resulting rom geocoding and common usages.

Table 3 Possible input data types (textual descriptions)
)HPA ,OUA /EHCA
*AEFe;=BEF &HEA
%JFUJF .AE=>JFD=M
1he Uniersity o Southern
Caliornia
Named place County
counts
Parcel-leel,
Non-matchable
1he Uniersity o Southern
Caliornia GIS Research Lab
Named place Cluster
screening
Sub parcel-leel,
Non-matchable
Kaprielian lall, Unit 444 Named place Cluster
screening
Sub parcel-leel,
Non-matchable
1he northeast corner o
Vermont Aenue and 36th Place
Relatie
intersection
Cluster
screening
Intersection-leel,
Non-matchable
Across the street rom 1ogo`s,
Los Angeles 90089
Relatie
direction
Cluster
screening
Street-leel,
Non-matchable
3620 South Vermont Ae, Los
Angeles, CA 90089
Street address Cluster
screening
Building-leel,
Street-leel
USPS ZIP Code 90089-0255 USPS ZIP
Code
County
counts
Building-leel,
USPS ZIP Code-leel
34.022351, -118.29114 Geographic
coordinates
Cluster
screening
Sub parcel-leel,
Non-matchable

)=IAP@AB 567 8669 Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
It should be clear rom this list that a location can be described as a named place, a rela-
tie location, a complete postal address ,or any portion thereo,, or by its actual coordinate
representation. All o these phrases except the last ,actual coordinates, are commonly occur-
ring representations ound throughout health data that need to be translated into spatial
coordinates. Obiously, some are more useul than others in that they relay more detailed
inormation ,some may not een geocode,. Registry data standards are heading toward the
enorcement o a single data ormat or input address data, but utilization o a single repre-
sentation across all registries is presently not in place due to many actors. In keeping with
the stated purpose o this document, the deinition proided should be general enough to
encompass each o the commonly occurring reporting styles ,i.e., orms,.
5<5<8 (EEJA b8T 0HMO PAFS=?E NHM @A cHM? HBAd N=MED?ABA? CA=N=?DMC
1urning to the host o methods researchers hae used to geocode their data, it becomes
clear that there are still more arieties o geocoding. 1he process o utilizing a GPS deice
and physically going to a location to obtain a true geographic position has been commonly
cited throughout the scientiic literature as one method o geocoding. 1his is usually stated
as the most accurate method, the C=>? EFHM?HB?< Obtaining a geographic position by identi-
ying the geographic location o a structure through the use o georeerenced satellite or
aerial imagery also has been deined as geocoding. 1he direct lookup o named places or
other identiiable geographic regions ,e.g., a U.S. Census Bureau ZIP Code 1abulation Area
|ZC1A|, rom lists or CHfAFFAABE ,which are databases with names, types, and ootprints o
geographic eatures, also has been reerred to as geocoding. Most commonly, geocoding re-
ers to the use o interpolation-based computational techniques to derie estimates o geo-
graphic locations rom GIS data such as linear street ector iles or areal unit parcel ector
iles.
5<5<V (EEJA bVT 0HMO =JFUJF FOUAE HBA U=EED@>A
Geocoding output is typically conceied o as a geographic point, a simple geographic
coordinate represented as latitude and longitude alues. loweer, the base geographic data
used or the deriation o the point geocode ,e.g., the polygon boundary o the parcel or the
polyline o the street ector, also could be returned as the output o the geocoding process.
5<5<Z (EEJA bZT #A=N=?DMC NHM @A cHM? JEJH>>O DEd H PJ>FDaN=PU=MAMF UB=NAEE
linally, the geocoding process is not achieed by one single instrument, sotware, or
geographic data source. 1he process o geocoding can be conceptualized as a single opera-
tion, but there are multiple components such as operations, algorithms, and data sources that
work together to produce the inal output. Lach one o these components is the result o
signiicant research in many dierent scientiic disciplines. Lach is equally important to the
process. 1hus, when one speaks o geocoding, it begs the question: are they speaking o the
oerall process, or do they mean one or more o the components 1he proposed deinition
thereore must take this into account and make these distinctions.
By design, any o the processes stated earlier in Section 1.1.2 that are known as geocod-
ing are alid ,e.g., using a sotware geocoder, GPS in the ield, or imagery would it into this
deinition,. By using the terms locationally descriptie text` and spatial representation,`
any o the orms o data listed earlier in Sections 1.1.1 and 1.1.3 are alid as input and out-
put, respectiely. linally, instead o explicitly stating what must be a part o a geocoder, it
may be best to leae it open-ended such that dierent combinations o algorithms and data
sources can be employed and still adhere to this deinition.
[ )=IAP@AB 567 8669

'< ;< #=>?@ABC
)=IAP@AB 567 8669 \
Again, the primary purpose o this document is to assist registries in making the appro-
priate choices gien their particular constraints and to explain the repercussions these deci-
sions will hae. Because the deinition presented here is tailored speciically or a certain
community o researchers with unique characteristics, it may not be appropriate or other
research disciplines. It should be noted that although this deinition allows or any type o
geographic output, registries must at least report the results in the ormats explicitly deined
in tavaara. for Cavcer Regi.trie.: Data tavaara. ava Data Dictiovar, ,loerkamp and laener
2008,. Best practices relating to the undamental geocoding concepts deeloped in this sec-
tion are listed in Best Practices 1.

Best Practices 1 Fundamental geocoding concepts
,=UDN -=>DNO 'ANDED=M *AEF -BHNFDNA
Geocoding
concept
\hat does geocoding reer to in
their organization
1he meaning o geocoding within an
organization should be consistent with
that presented in this document.
Geocoding
motiation
\hen should geocoding be per-
ormed
Geocoding should be perormed when
descriptie location data need to be
transormed into numeric spatial data
to support spatial analysis.
Geocoding
ocabulary
\hat are the deinitions o geo-
coding and its related terms
1he deinitions o geocoding should
be based on those within this
document.







































































1his page is let blank intentionally.



'< ;< #=>?@ABC
8< ,1$ (0-%.,!)&$ %2 #$%&%'()#
1his section places geocoding in the larger context o spatial
analysis perormed as part o cancer-related research.
8<5 #$%&%'()#g+ (0-%.,!)&$ ,% 1%+-(,!3+ !)' &$),.!3 .$#(+,.($+
As the component ultimately responsible or generating the spatial attributes associated
with patient,tumor data, geocoding`s primary importance to cancer research becomes clear
,Rushton et al. 2006, 2008a,. Spatial analysis would be extremely diicult, i not impossible,
in the absence o geocoding. Note that the patient`s address at the time o diagnosis is tu-
mor-leel data ,i.e., each tumor will hae its own record, and each record will hae its own
address,. Being time-dependent, this address may ary with each cancer.
1he recent work by Abe and Stinchcomb ,2008, relating the results o a North American
Association o Central Cancer Registries ,NAACCR, GIS Committee surey o 2
NAACCR member registries crystallizes the importance o geocoding in current registry
practice. 1hey ound that 82 percent o the 4 responding registries perormed some type o
address geocoding, and that the aerage number o addresses geocoded was 1. times the
annual caseload o the registry. lor complete details o this surey, reer to NAACCR
,2008a, 2008b,.
8<8 ,:-(&!3 .$+$!.&1 ;%.]23%;
1he important role o the geocoder in cancer-related research is easily highlighted
through an example o a prototypical research worklow that utilizes geocoding as a compo-
nent. Generally, a spatially inspired research inestigation will hae two basic components-
data gathering and data analysis. Looking deeper, the data gathering portion inoles data
collection, consolidation, and processing. 1he data analysis portion inoles hypothesis test-
ing using statistical analysis to assess the phenomena in question gien the relatie strength
o the supporting eidence. ligure 1 displays an outline o a research worklow that should
solidiy the importance o the geocoder as the link between descriptie locational data and
numeric ,digital, geographic inormation.
8<8<5 'HFH CHFSABDMC
Lach registry will undoubtedly ary in its exact implementation o the data gathering
protocol. Generally, when an incidence o cancer is diagnosed, a series o best eorts are
made to obtain the most accurate data aailable about the person and their tumor. Many
types o conidential data will be collected as the patient receies care, including identiiable
inormation about the person and their lie,medical history, as well as inormation about the
diagnosed cancer. Portions o this identiiable inormation may be obtained directly rom the
patient through interiews with clerical or medical sta at a hospital. lere, the patient will be
asked or his,her current home address, and possibly a current work address and ormer
addresses.
1ime and,or conidentiality constraints may limit the amount o inormation that can be
collected by a hospital and only the patient`s current post-diagnosis address may be aailable
when consolidation occurs at the registry. 1he central registry thereore may be expected to
)=IAP@AB 567 8669 X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
determine an address at diagnosis ater the data hae been submitted rom a diagnosing acil-
ity or consolidation. Searching a patient`s medical records, or in some cases linking to De-
partment o Motor Vehicles ,DMV, records, may be options to help achiee this ,more de-
tail on these practices is proided in Section 19,.


Figure 1 Typical research workflow

In any case, the address data obtained represent the descriptie text describing the per-
son`s location in at least one point in time. At this stage a record has been created, but it is as
o yet completely HEUHFDH> ,or M=MaEUHFDH>,, meaning that it does not include any spatial in-
ormation. Although there are multiple methods to collect data and perorm address consol-
idation, there are currently no standards in place. Best Practices 2 lists data gathering and
metadata guides or the hospital and registry, in the ideal cases. Best Practices 3 lists the type
o residential history data that would be collected, also in the ideal case, but one that has a
clear backing in the registry ,e.g., Abe and Stinchcomb 2008, pp. 123, and research commun-
ities ,e.g., lan et al. 2005,. Best Practices 4 briely lists guides or obtaining secondary in-
ormation about address data.
56 )=IAP@AB 567 8669

'< ;< #=>?@ABC
Best Practices 2 Address data gathering
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen, where, how
and what type o
address data should be
gathered
Residential history inormation should be collected.

Collect inormation as early as possible:
- As much data as possible at the diagnosing acility
- As much data as possible at the registry.

Collect inormation rom the most accurate source:
- Patient
- Relatie
- Patient,tumor record.

Metadata should describe when it was collected:
- Upon intake
- Ater treatment
- At registry upon arrial
- At registry upon consolidation.

Metadata should describe where it was collected:
- At diagnosing acility
- At registry.

Metadata should describe how it was collected:
- Interiew in-person,telephone
- Patient re-contacting
- lrom patient,tumor record
- Researched online.

Metadata should describe the source o the data:
- Patient
- Relatie
- Patient,tumor record.

)=IAP@AB 567 8669 55
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 3 Residential history address data
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat type o
residential history
inormation should be
collected
Ideally, complete historical address inormation should be
collected:
- Residential history o addresses
- low long at each address
- 1ype o address ,home, work, seasonal,.

Best Practices 4 Secondary address data gathering
-=>DNO 'ANDED=M *AEF -BHNFDNA
Should secondary
research methods be
attempted at the
registry

I so, when, which
ones, and how oten
Secondary research should be attempted at the registry i
address data rom the diagnosing acility is inaccurate,
incomplete, or not timely.

All aailable and applicable sources should be utilized until
they are exhausted or enough inormation is obtained:
- Online searches
- Agency contacts
- Patient re-contacting.

Metadata should describe the data sources consulted:
- \ebsites
- Agencies
- Indiiduals.

Metadata should describe the queries perormed.

Metadata should describe the results achieed, een i
unsuccessul.

Metadata will need to be stored in locally deined ields.

8<8<8 &=MIABED=M F= MJPABDN cD<A<7 ?DCDFH>d ?HFH
low long the patient record remains unaltered depends on seeral actors, including:

- 1he requency with which data are reported to the central registry
- \hether UABaBAN=B? CA=N=?DMC ,geocoding a single record at a time, or @HFNSa
BAN=B? CA=N=?DMC ,geocoding multiple records at once, is perormed
- \hether the registry geocodes the data or they already are geocoded when the data
reaches them
- \hether geocoding is perormed in-house or outsourced.

\hen the data inally are geocoded, addresses will be conerted into numeric spatial re-
presentations that then will be appended to the record. Best practices related to the
58 )=IAP@AB 567 8669

'< ;< #=>?@ABC
conersion rom descriptie to numeric spatial data ,assuming that geocoding is perormed
at the registry, are listed in Best Practices 5. Note that the recommendation to immediately
geocode eery batch o data receied may not be a easible option or all registries under all
circumstances because o budgetary, logistical, and,or practical considerations inoled with
processing numerous data iles.

Best Practices 5 Conversion to numeric spatial data
-=>DNO 'ANDED=M *AEF -BHNFDNA
low long can a
record remain non-
matched ,e.g., must it
be transormed im-
mediately, or can it
wait indeinitely,
I records are obtained one at a time, they should be
geocoded when a suicient number hae arried to oset
the cost o geocoding ,i.e., to achiee economies o scale,.

I records are obtained in batches, they should be geocoded
as soon as possible.
Should a record eer
be re-geocoded I
so, when and under
what circumstances
1he record should retain the same geocode until it is
deemed to be unacceptably inaccurate.

I new reerence data or an improed geocoder are obtained
and will proably improe a record`s geocode, it should be
re-geocoded.

I new or updated address data are obtained or a record, it
should be re-geocoded.

Metadata should describe the reason or the inaccuracy
determination.

Metadata should retain all historical geocodes.

8<8<V +UHFDH> HEE=NDHFD=M
Once the data are in this numeric orm the quality o the geocoding process can be as-
sessed. I the quality is ound to be suicient, other desired attributes can be associated and
spatial analysis can be perormed within a GIS to inestigate any number o scientiic out-
comes. lor example, records can be isually represented on a map, or alues rom other da-
tasets can be associated with indiidual records though the spatial intersection o the geo-
code and other spatial data. Common data associations include spatial intersection with U.S.
Census Bureau data to associate socioeconomic status ,SLS, with a record or intersection
with enironmental exposure estimates. See Rushton et al. ,2008b, or one example o how
spatially continuous cancer maps can be produced rom geocoded data and,or \aller ,2008,
or concise introductory material on the type o spatial statistical analysis typically perormed
on point and areal unit data in public health research. Best practices related to the coner-
sion rom descriptie to numeric to spatial data are listed in Best Practices 6.
)=IAP@AB 567 8669 5V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 6 Spatial association
-=>DNO 'ANDED=M *AEF -BHNFDNA
Should data be
allowed to be spatially
associated with
geocoded records

I so, when and
where
Spatial associations should be allowed i the data to be
associated meet acceptable leels o accuracy and integrity.

Spatial association should be examined eery time an analysis
is to be run ,because the iability o statistical analysis
techniques will ary with the presence or absence o these
associations,, and can be perormed by data consumers or
registry sta.

Metadata should include the complete metadata o the
spatially associated data.
\hat requirements
must data meet to be
spatially associated
Data should be considered alid or association i:
- Its proenance, integrity, temporal ootprint, spatial
accuracy, and spatial resolution are known and can be
proen.
- Its temporal ootprint is within a decade o the time the
record was created.
- Its spatial resolution is equal to or less than that o the
geocode ,i.e., only associate data o lower resolution,.
8<V ;1$) ,% #$%&%'$
\hen the actual geocoding should be perormed during the data gathering process is a
critical component that aects the oerall accuracy o the output. 1here are seeral options
that need to be considered careully, as each has particular beneits and drawbacks that may
inluence the decision about when to geocode.
8<V<5 #A=N=?DMC HF FSA &AMFBH> .ACDEFBO
1he irst option, CA=N=?DMC HF FSA NAMFBH> BACDEFBO7 is the process o geocoding pa-
tient,tumor records at the registry once they hae been receied rom acilities and abstrac-
tors. Geocoding traditionally has been the role o the central registry. 1his can be accom-
plished when a record arries or ater consolidation. 1he irst approach processes one case
at a time, while the second processes batches o records at a time. One obious beneit in
the second case results rom economies o scale. It will always be cheaper and more eicient
to perorm automatic geocoding or a set o addresses, or in batch mode, rather than on a
single address at a time. Although the actual cost per produced geocode may be the same
,e.g., one-tenth o a cent or less,, the time required by sta members in charge o the
process will be greatly reduced, resulting in deinite cost saings.
Common and practical as this method may be, it also suers rom setbacks. lirst and
oremost, i incorrect, missing, or ambiguous data that preent a geocode rom successully
being produced hae been reported to a registry, it will be more diicult and time consum-
ing to correct at this stage. In act, most geocoders will not een attempt such corrections,
instead, they will simply either output a less accurate geocode ,e.g., one representing a suc-
cessul geocode at a lower geographic resolution,, or not output a geocode at all ,Section 18
proides more discussion o these options,.
5Z )=IAP@AB 567 8669

'< ;< #=>?@ABC
Some o these problems can be rectiied by perorming DMFABHNFDIA CA=N=?DMC7 where-
by the responsible sta member is notiied when problematic addresses are encountered and
interenes to choose between two equally likely options in the case o an ambiguous address
or to correct an obious and easily solable error that occurred during data entry. Interactie
geocoding, howeer, cannot sole the problems that occur when not enough inormation
has been recorded to make an intelligent decision, and the patient cannot or should not be
contacted to obtain urther details. lurther, interactie geocoding may be too time consum-
ing to be practical.
8<V<8 #A=N=?DMC HF FSA ?DHCM=EDMC GHND>DFO
1he second, less-likely option, CA=N=?DMC HF FSA ?DHCM=EDMC GHND>DFO7 is the process o
geocoding a record at the intake acility while the person perorming the intake is conducting
the ingest or perorming the interiew and creating the electronic record or abstract. 1his
option perorms the geocoding as the abstractor, clerical, intake, or registration personnel at
the hospital ,i.e., whomeer on the sta is in contact with the patient, is perorming the data
ingest, or when he or she is perorming the patient interiew and creating their electronic
record or abstract. Geocoding at this point will result in the highest percentage o alid geo-
codes because the geocoding system itsel can be used as a alidation tool. In addition, sta
can ask the patient ollow-up questions regarding the address i the system returns it as an
address that is non-matchable, a sentiment clearly echoed by the emerging trends and atti-
tudes in the cancer registry community ,e.g., Abe and Stinchcomb 2008, pp 123,. Street-leel
geocoding at the hospital is ideal, but has yet to be realized at most acilities.
1his is an example o one deinition or the term BAH>aFDPA CA=N=?DMC7 the process o
geocoding a record while the patient or the patient`s representatie is aailable to proide
more detailed or correct inormation using an iteratie reinement approach. Data entry er-
rors resulting rom sta input error can be reduced i certain aspects o the address can be
illed in automatically as the sta member enters them in a particular order, rom lowest res-
olution to highest. lor instance, the sta can start with the state attribute, ollowed by the
United States Postal Serice ,USPS, ZIP Code. Upon entering this, in some cases both the
county and city can be automatically illed in by the geocoding system, which the patient
then can eriy. loweer, i the USPS ZIP Code has other USPS-acceptable postal names or
represents mail deliery to multiple counties, these deaults may not be appropriate. 1his
process also may be perormed as DMFABHNFDIA or M=MaDMFABHNFDIA CA=N=?DMC<
Looking past this and assuming the case o a USPS ZIP Code that can assign city and
county attributes correctly, a urther step can be taken. 1he street, as deined by its attributes
,e.g., name, type, directional, can be alidated by the geocoding system as actually existing
within the already-entered county and USPS ZIP Code, and the building number can be
tested as being within a alid range on that street. At any point, i inalid or ambiguous data
are discoered by the geocoding system ,or the address alidation component or stand-alone
system, as it is being entered, the sta can be instructed to ask ollow-up questions to re-
sole the conlicts. Depending on the polices o the central registry, all that may be required
o a hospital is to ensure that all o the steps that could hae proided the patient with the
opportunity to resole the conlict were taken and their outcomes documented, een i the
outcome was a reusal to clariy. I a correct address can be determined, entered, and eri-
ied, the geocoding system then can associate any attributes that were not entered ,e.g., the
directional preix or suix o the street name,, which can be approed and accepted by the
sta member i correct, thereby increasing the completeness o the input address data.
)=IAP@AB 567 8669 5Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
In the most accurate and highly adanced scenario possible, the generated point can be
displayed on imagery and shown to the patient who then can instruct the sta member in
placing the point exactly in the center o their rooline, instead o at its original estimated
location. O course, this may not be desired or appropriate in all cases, such as when people
are araid to disclose their address or one reason or another. lurther, i this strategy were
employed without properly ensuring the patient`s conidentiality ,e.g., perorming all geocod-
ing and mapping behind a irewall,, explaining the technology that enables it ,e.g., any o the
ariety o ArcGIS products and data layers,, and what the goal o using it was ,i.e., to im-
proe cancer data with the hopes o eentually proiding better preention, diagnosis, and
care,, it would be understandable or patients to be uncomortable and think that the diag-
nosing acility was inringing on their right to priacy.
As described, this process may be impossible or seeral reasons. It may take a substan-
tially greater amount o time to perorm than what is aailable. It assumes that an on-
demand, case-by-case geocoder is aailable to the sta member, which may not be a reality
or registries geocoding with endors. I aailable, its use may be precluded by priacy or
conidentiality concerns or constraints, or the reerence datasets used may not be the correct
type or o suicient quality to achiee the desired leel o accuracy. 1his scenario assumes a
great deal o technical competence and an in-depth understanding o the geocoding process
that the sta member may not possess. I this approach were to become widely adopted,
urther questions would be raised as to i and,or when residential history inormation also
should be processed.
8<V<V &=MED?ABHFD=ME
1hese two scenarios illustrate that there are indeed beneits and drawbacks associated
with perorming the geocoding process at dierent stages o data gathering. \hen deciding
which option to choose, an organization also should take note o any trends in past peror-
mance that can be leeraged or used to indicate uture perormance. lor example, i less
than 1 percent o address data ails batch-mode processing and requires ery little o a sta
member`s time to manually correct, it may be worth doing. loweer, i 50 percent o the
data ail and the sta member is spending the majority o his or her time correcting erro-
neously or ambiguously entered data, another option in which input address alidation is
perormed closer to the leel o patient contact might be worth considering, but would re-
quire more indiiduals trained in geocoding-speciically the address alidation portion ,see
Section 6.4 or more detail,-at more locations ,e.g., hospitals and doctor`s oices,. 1hese
types o tradeos will need to be weighed careully and discussed between both the central
registry and acility beore a decision is made.
At a higher leel, it also is useul to consider the roles o each o the two organizations
inoled: ,1, the data submitters, and ,2, the central registries. It can be argued that, ideally,
the role o the data submitter is simply to gather the best raw data they can while they are in
contact with the patient ,although this may not be what actually occurs in practice or a a-
riety o reasons,, while the central registries are responsible or ensuring a standardized
process or turning the raw data into its spatial counterpart. Len i the data submitters per-
orm geocoding locally beore submitting the data to the central registries, the geocoded re-
sults may be discarded and the geocoding process applied again upon consolidation by the
central registry to maintain consistency ,in terms o geocoding quality due to the geocoding
process used, amongst all geocodes kept at the central registry. loweer, een in currently
existing and used standards, diagnosing acilities ,and,or data submitters, are responsible or
5[ )=IAP@AB 567 8669

'< ;< #=>?@ABC
some o the spatial ields in a record ,e.g., the county,, so the lines between responsibilities
hae already been blurred or some time.
Best Practices 7 When to geocode
-=>DNO 'ANDED=M *AEF -BHNFDNA
\here and when is
geocoding used
Geocoding should be perormed as early as possible ,i.e., as
soon as the address data become aailable,, whereer the da-
ta are obtained.

Metadata should describe where the geocoding took place:
- Diagnosing acility
- Central registry.

Metadata should describe when the geocoding took place:
- Upon intake
- Upon transer rom a single registry
- Upon consolidation rom multiple registries
- Lery time it is used or analysis.
8<Z +/&&$++ +,%.($+
1he use o geocoding and geocoded data in health- and cancer-related research has a
long, iid, and exciting history, stretching back many years ,e.g., the early attempts in lowe
|1986|,. 1he use o automated geocoding to acilitate spatial analyses in cancer research has
enabled entirely new modes o inquiry that were not possible or easible prior to automated
geocoding. Seeral exemplary applications are noted here to illustrate the potential o the
technique and the success that can be achieed. lor a more comprehensie reiew o re-
search studies that hae utilized geocoding as a undamental component, see the recent re-
iew article by Rushton et al. ,2006,.
Lpidemiological inestigation into the links between enironmental exposure and disease
incidence rely heaily on geocoded data and are particularly sensitie to the accuracy that can
be obtained through the dierent methods and data sources that can be employed. lor ex-
ample, a whole series o studies inestigating ambient pesticide exposure in Caliornia`s Cen-
tral Valley all hae used geocoding as the undamental component or identiying the loca-
tions o indiiduals liing near pesticide application sites ,e.g., Bell et al. 2001, Rull and Ritz
2003, Rull et al. 2001, 2006a, 2006b, Reynolds et al. 2005, Marusek et al. 2006, Goldberg et
al. 200b, and Nuckols et al. 200,. Due to the rapid distance decay inherent in these eni-
ronmental actors, a high leel o spatial accuracy was necessary to obtain accurate exposure
estimates.
Likewise, in a currently ongoing study, Cockburn et al. ,2008, hae uncoered eidence
that the risk o mesothelioma with proximity to the nearest reeway ,assessing the possible
impact o asbestos exposure rom brake and clutch linings, is two-old higher or residences
within 100 m o a reeway than those oer 500 m away, using linear-interpolation geocoding
based on 1IGLR,Line iles ,U.S. Census Bureau 2008d,. loweer, when comparing dis-
tances to reeways obtained rom 1IGLR,Line ile geocodes to those obtained rom a par-
cel-based interpolation approach, it was shown that 24 percent o the data points had parcel-
based geocode reeway distances in excess o 500 m greater than those deried rom 1IG-
LR,Line iles. 1his means that up to 24 percent o the data were misclassiied in the original
)=IAP@AB 567 8669 5\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
analysis. I the misclassiication aried by case,control status ,under examination,, then the
true relatie risk is likely ery dierent rom what was obsered ,biased either to or away
rom the null,.
In addition to its role in exposure analysis, geocoding orms a undamental component
o research studies inestigating distance and accessibility to care ,Armstrong et al. 2008,.
1hese studies typically rely on geocoded data or both a subject`s address at diagnosis
,dxAddress, and the acility at which they were treated. \ith these starting and ending
points, Luclidean, Great Circle, and,or network distance calculations can be applied to de-
termine both the distance and time that a person must trael to obtain care. Studies using
these measures hae inestigated such aspects as disparities in screening and treatment ,Stit-
zenberg et al. 200,, the aects o distance on treatment selection and,or outcomes ,e.g.,
Nattinger et al. 2001, Steoski et al. 2004, Voti et al. 2005, leudtner et al. 2006, and Lianga et
al. 200,, and or targeting regions or preention and control actiities ,e.g., Rushton et al.
2004,.



59 )=IAP@AB 567 8669

'< ;< #=>?@ABC
V< #$%#.!-1(& ()2%.0!,(%) +&($)&$ 2/)'!0$),!3+
1his section introduces the undamental geographic prin-
ciples used throughout the remainder o the document,
as well as common mistakes that oten are encountered.
V<5 #$%#.!-1(& '!,! ,:-$+
In general, GIS data are either ector- or raster-based. 4ANF=Ba@HEA? ?HFH consist o
IANF=B =@hANFE =B GAHFJBAE and rely on points and discrete line segments to speciy the loca-
tions o real-world entities. 1he latter are simply phenomena or things o interest in the
world around us ,i.e., a speciic street like Main Street, that cannot be subdiided into phe-
nomena o the same kind ,i.e., more streets with new names,. Vector data proide inorma-
tion relatie to where eerything occurs-they gie a location to eery object-but ector
objects do not necessarily ill space, because not all locations need to be reerenced by ob-
jects. One or more attributes ,like street names in the aorementioned example, can be as-
signed to indiidual objects to describe what is where with ector-based data. .HEFABa@HEA?
?HFH7 in contrast, diide the area o interest into a regular grid o cells in some speciic se-
quence, usually row-by-row rom the top let corner. Lach cell is assigned a single alue de-
scribing the phenomenon o interest. Raster-based data proide inormation relatie to what
occurs eerywhere-they are space illing because eery location in an area o interest cor-
responds to a cell in the raster-and as a consequence, they are best suited or representing
things that ary continuously across the surace o the Larth.
Most geocoding applications work with ector-based GIS data. 1he undamental primi-
tie is the U=DMF7 a 0-dimensional ,0-D, object that has a position in space but no length.
Geographic objects o increasing complexity can be created by connecting points with
straight or cured lines. A >DMA is a 1-D geographic object haing a length and is composed
o two or more 0-D point objects. Lines also may contain other descriptie attributes that
are exploited by geocoding applications such as direction, whereby one end point or node is
designated as the start node and the other is designated as the end node. A U=>OC=M is a
geographic object bounded by at least three 1-D line objects or segments with the require-
ment that they must start and end at the same location ,i.e., node,. 1hese objects hae a
length and width, and rom these properties one can calculate the area. lamiliar 2D shapes
such as squares, triangles, and circles are all polygons in ector-based iews o the world
around us.
Most GIS sotware supports both ector- and raster-based iews o the world, and any
standard GIS textbook can proide urther inormation on both the underlying principles
and strengths and weaknesses o these complementary data models. 1he key aspects rom a
geocoding perspectie relatie to the methods used to: ,1, determine and record the loca-
tions o these objects on the surace o the Larth, and ,2, calculate distance because many
geocoding algorithms rely on one or more orms o linear interpolation.

V<8 #$%#.!-1(& '!,/0+ !)' #$%#.!-1(& &%%.'()!,$+
1he positions or locations o objects on the surace o Larth are represented with one or
more N==B?DMHFA EOEFAPE< Speciying accurate x and y coordinates or objects is
)=IAP@AB 567 8669 5X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
undamental or all GIS sotware and location-based serices. loweer, many dierent
coordinate systems are used to record location, and one oten needs to transorm data in a
GIS rom one reerence system to another.
1here are three basic options: ,1, global systems, such as latitude and longitude, are
used to record position anywhere on Larth`s surace, ,2, regional or local systems that aim to
proide accurate positioning oer smaller areas, and ,3, postal codes and cadastral reerence
systems that record positions with arying leels o precision and accuracy. 1he reerence
system to be used or a particular geocoding application and accompanying GIS project will
depend on the purpose o the project and how positions were recorded in the source data.
1here usually is a geodetic datum that underpins whateer reerence system is used or
chosen. Most modern tools ,i.e., GPS receiers, and data sources ,i.e., U.S. Geological Sur-
ey National Map, U.S. Census Bureau 1IGLR,Line iles, rely on the North American Da-
tum o 1983 ,NAD-83,. 1his and other datums in use in arious parts o the world proide a
reerence system against which horizontal and,or ertical positions are deined. It consists
o an ellipsoid ,a model o the size and shape o Larth that accounts or the slight lattening
at the poles and other irregularities, and a set o point locations precisely deined with reer-
ence to that surace.
#A=CBHUSDN N==B?DMHFAE7 which speciy locations in terms o latitude and longitude,
constitute a ery popular reerence system. 1he Prime Meridian ,drawn through Greenwich,
Lngland, and Lquator sere as reerence planes to deine latitude and longitude. 3HFDFJ?A is
the angle rom the plane at the horizontal center o the ellipsoid, the Lquator, to the point
on the surace o the ellipsoid ,at sea leel,. 3=MCDFJ?A is the angle between the plane at the
ertical center o the ellipsoid, the meridian, and the point on the surace o the ellipsoid.
Both are recorded in degrees, minutes, and seconds or decimal degrees.
V<V 0!- -.%i$&,(%)+ !)' .$#(%)!3 .$2$.$)&$ +:+,$0+
Seeral dierent systems are used regionally to identiy geographic positions. Some o
these are true coordinate systems, such as those based on the Uniersal 1ranserse Mercator
,U1M, or Uniersal Polar Stereographic ,UPS, map projections. Others, such as the Public
Land Surey System ,PLSS, used widely in the \estern United States, simply partition space
into blocks. 1he systems that incorporate some orm o map projection are preerred i the
goal is to generate accurate geocoding results. A PHU UB=hANFD=M is a mathematical unction
to transer positions on the surace o Larth to their approximate positions on a lat surace
,i.e., a computer monitor or paper map,. Seeral well-known projections exist, the dierenc-
es between them generally are determined by which property o the Larth`s surace they seek
to maintain with minimal distortion ,e.g., distance, shape, area, and direction,. lortunately, a
great deal o time and eort has been expended to identiy the preerred map projections in
many,most parts o the world.
lence, the State Plane Coordinate System ,SPC, was deeloped by U.S. scientists in the
1930s to proide local reerence systems tied to a national geodetic datum. Lach state has its
own SPC system with speciic parameters and projections. Smaller states such as Rhode Isl-
and use a single SPC zone, larger states such as Caliornia and 1exas are diided into seeral
SPC zones. 1he SPC zone boundaries in the latter cases typically ollow county boundaries.
1he initial SPC system was based on the North American Datum o 192 ,NAD-2, and the
coordinates were recorded in Lnglish units ,i.e., eet,. Some maps using NAD-2 coordi-
nates are still in use today.
Improements in the measurements o both the size and shape o Larth and o positions
on the surace o Larth itsel led to numerous eorts to reine these systems, such that the
86 )=IAP@AB 567 8669

'< ;< #=>?@ABC
192 SPC system has been replaced or eeryday use by the 1983 SPC system. 1he latter is
based on the NAD-83 and the coordinates are expressed in metric units ,i.e., meters,. 1he
1983 SPC system used Lambert Conormal Conic projections or regions with larger east-
west than north-south extents ,e.g., Nebraska, North Carolina, and 1exas,, the 1ranserse
Mercator projections were used or regions with larger north-south extents ,e.g., Illinois and
New lampshire,. 1here are exceptions-llorida, or example, uses the Lambert Conormal
Conic projection in its north zone and the 1ranserse Mercator projection in its west and
east zones. Alaska uses a completely dierent Oblique Mercator projection or the thin di-
agonal zone in the southeast corner o the state.
1he choice o map projection and the accompanying coordinate system may hae seeral
consequences and is a key point to keep in mind during any aspect o the geocoding process
because distance and area calculations required or geocoding rely on them. 1he most com-
mon mistake made rom not understanding or realizing the distinctions between dierent
coordinate systems occurs during distance calculations. Latitude and longitude record angles
and the utilization o Luclidean distance unctions to measure distances in this coordinate
system is not appropriate. Spherical distance calculations should be used in these instances.
1he simpler Luclidean calculations are appropriate at a local scale because the distortion
caused by representing positions on a cured surace on a lat computer monitor and,or pa-
per map is minimized. Some special care may be needed i,when the distance calculations
extend across two or more SPC zones gien the way positions are recorded in northings and
eastings relatie to some local origin. Some additional inormation on these types o compli-
cations can be gleaned rom standard GIS textbooks. Best practices relating to geographic
undamentals are listed in Best Practices 8.
)=IAP@AB 567 8669 85
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 8 Geographic fundamentals
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat inormation should
be kept about attributes o
a reerence dataset
All metadata should be maintained about the type and
lineage o the reerence data ,e.g., coordinate system,
projection,.
\hat coordinate system
should reerence data be
kept in
Reerence data should be kept in a Geographic Coordi-
nate System using the North American Datum o 1983
,NAD- 1983, and projected when it needs to be
displayed or hae distance-based calculations perormed.

I a projected coordinate system is required, an
appropriate one or the location,purpose should be used.

\hat projection should be
used to project reerence
data
An appropriate projection should be chosen based on the
geographic extent o the area o interest and,or what the
projected data are going to be used or. lor urther
inormation, see any basic GIS textbook.

In general:
- lor most cancer maps, use an equal area projection.
- lor maps with circular buers, use a conormal
projection.
- lor calculating distances, use a projection that
minimizes distance error or the area o interest.
\hat distance calculations
should be used
In a projected coordinate space, planar distance metrics
should be used.

In a non-projected ,geographic, coordinate space,
spherical distance metrics should be used.


88 )=IAP@AB 567 8669

'< ;< #=>?@ABC
)=IAP@AB 567 8669 8V







Part 2: 1be Covovevt. of Ceocoaivg



Geocoding is an extremely complicated task inoling multiple processes and datasets all
simultaneously working together. \ithout a undamental understanding o how these pieces
all it together, intelligent decisions regarding them are impossible. 1his part o the docu-
ment will irst look at the geocoding process rom a high leel, and subsequently perorm a
detailed examination o each component o the process.


















































1his page is let blank intentionally.


'< ;< #=>?@ABC
Z< !''.$++ #$%&%'()# -.%&$++ %4$.4($;
1his section identiies types o geocoding process and out-
lines the high-leel geocoding process, illustrating the major
components and their interactions.
Z<5 ,:-$+ %2 #$%&%'()# -.%&$++$+
Now that geocoding has been deined and placed within the context o the larger con-
cept o spatial analysis, the technical background that makes the process possible will be pre-
sented. 1he majority o the remainder o this document will ocus on the predominant type
o geocoding perormed throughout the cancer research community, sotware-based geo-
coding. +=GFQHBAa@HEA? CA=N=?DMC is a geocoding process in which a signiicant portion o
the components are sotware systems. lrom this point orward unless otherwise stated, the
term geocoder` will reer to this particular arrangement.
1he sotware-based geocoding option is presently by ar the most economical option
aailable to registries and is the most commonly used option. 1his document will seek to
inorm speciic decisions that must be made with regard to sotware-based geocoding. low-
eer, inormation will be releant to other geocoder processes that utilize other tools ,e.g.,
GPS deices or identiication and coordinate assignment rom aerial imagery,. 1he accuracy
and metadata reporting discussions in particular will be applicable to all types o geocoding
process deinitions.
In the ollowing sections, the undamental components o the geocoding process will be
introduced. 1he discussion will proide a high-leel description o the components in the
geocoding process and their interactions will be oered to illustrate the basic steps that a
typical geocoder perorms as it produces output rom the input proided. Lach o these
steps, along with speciic issues and best practice recommendations related to them, will be
described in greater detail in the sections that ollow. Additional introductory material on the
oerall geocoding process, components, and possible sources o error can be ound in
Armstrong and 1iwari ,2008, and Boscoe ,2008,. 1he theoretical background presented in
the ollowing sections can be grounded by reiewing the case study o the detailed speciic
geocoding practices and products used in the New Jersey State Cancer Registry ,NJSCR, ,as
well as seeral other registries, aailable in Abe and Stinchcomb ,2008,.
Z<8 1(#1a3$4$3 #$%&%'()# -.%&$++ %4$.4($;
At the highest leel, most generalized geocoding processes inole three separate yet re-
lated components: ,1, the descriptie locational input data ,e.g., addresses,, ,2, the geocoder,
and ,3, the spatial output data. 1hese high-leel relationships are illustrated in ligure 2.
1he DMUJF ?HFH to the geocoding process can be any descriptie locational textual in-
ormation such as an address or building name. 1he =JFUJF can be any orm o alid spatial
data such as latitude and longitude. #A=N=?DMC is the process used to conert the input into
the output, which is perormed by the CA=N=?AB<


)=IAP@AB 567 8669 8Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A

Figure 2 High-level data relationships
Z<V +%2,;!.$a*!+$' #$%&%'$.+
A sotware-based geocoder is composed o two undamental components. 1hese are the
BAGABAMNA ?HFHEAF and the CA=N=?DMC H>C=BDFSP7 each o which may be composed o a
series o sub-components and operations. 1he geocoding process with these new relation-
ships is depicted in ligure 3.


Figure 3 - Schematic showing basic components of the geocoding process

It is likely that the actual sotware implementation o a geocoder will ary in the nature
o the components chosen and conceptual representation o the geocoding system. Lach
registry will hae its own CA=N=?DMC BALJDBAPAMFE7 or set o limitations, constraints, or
concerns that inluence the choice o a particular geocoding option. 1hese may be technical,
budgetary, legal, or policy related and will necessarily guide the choice o a geocoding
process. Best practices related to determining geocoding requirements are listed in Best Prac-
tices 9. Len though the geocoding requirements may ary between registries, the NAACCR
standards or data reporting spatial ields as deined in tavaara. for Cavcer Regi.trie.: Data
tavaara. ava Data Dictiovar, ,loerkamp and laener 2008, should be ollowed by all regi-
stries to ensure uniormity across registries.

8[ )=IAP@AB 567 8669

'< ;< #=>?@ABC
Best Practices 9 Geocoding requirements
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat considerations should aect the
choice o geocoding requirements
1echnical, budgetary, legal, and policy
constraints should inluence the
requirements o a geocoding process.
\hen should requirements be reiewed
and,or changed
Ideally, requirements should be reisited
annually but registries may hae constraints
that extend or shorten this time period.

1he general process worklow shown in ligure 4 represents a generalized abstraction o
the geocoding process. It illustrates the essential components that should be common to any
geocoder implementation and suicient or registries with ew requirements, while being
detailed enough to illustrate the decisions that must be made at registries with many detailed
requirements. 1his conceptualization also is suicient to illustrate the generalized steps and
requirements that geocoder endors will need to accommodate to work with registries.


Figure 4 Generalized workflow


)=IAP@AB 567 8669 8\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
89 )=IAP@AB 567 8669
Z<Z ()-/, '!,!
(MUJF ?HFH are the descriptie locational texts that are to be turned into computer-
useable spatial data by the process o geocoding. As indicated earlier ,1able 3,, the wide a-
riety o possible orms and ormats o input data is the main descriptor o a geocoder`s lex-
ibility, as well as a contributing actor to the oerall diiculty o implementing geocoding.
Z<Z<5 &>HEEDGDNHFD=ME =G DMUJF ?HFH
Input data can irst be classiied into two categories, relatie and absolute. .A>HFDIA DMa
UJF ?HFH are textual location descriptions which, by themseles, are not suicient to produce
an output geographic location. 1hese produce BA>HFDIA CA=N=?AE that are geographic loca-
tions relatie to some other reerence geographic locations ,i.e., based on an interpolated
distance along or within a reerence eature in the case o line ectors and areal units, respec-
tiely,. \ithout reerence geographic locations ,i.e., the line ector or areal unit,, the output
locations or the input data would be unobtainable.
Lxamples o these types o data include Across the street rom 1ogo`s` and 1he
northeast corner o Vermont Aenue and 36th Place.` 1hese are not typically considered
alid address data or submission to a central registry, but they nonetheless do occur. 1he
latter location, or instance, cannot be located without identiying both o the streets as well
as the cardinal direction in which one must head away rom their exact intersection. Normal
postal street addresses also are relatie. 1he address 3620 South Vermont Aenue` is mea-
ningless without understanding that 3620` denotes a relatie geographic location some-
where on the geographic location representing Vermont Aenue.` It should be noted that
in many cases, geocoding platorms do not support these types o input and thus may not be
matchable, but adances in this direction are being made.
!@E=>JFA DMUJF ?HFH are textual location descriptions which, by themseles, are sui-
cient to produce an output geographic location. 1hese input data produce an H@E=>JFA CA=a
N=?A in the orm o an absolute known location or an oset rom an absolute known loca-
tion. Input data in the orm o adequately reerenced placenames, USPS ZIP Codes, or
parcel identiiers are examples o the irst because each can be directly looked up in a data
source ,i aailable, to determine a resulting geocode.
Locations described in terms o linear addressing systems also are absolute by deinition.
lor example, the Lmergency 911-based ,L-911, geocoding systems being mandated in rural
areas o the United States are ,in many cases, absolute because they use distances rom
known mileposts on streets as coordinates. 1hese mileposts are a linear addressing system
because each represents an absolute known location. It should be noted that in some cases,
this may not be true because the only implementation action taken to adhere to the L-911
system was street renaming or renumbering.
\ith these distinctions in mind, it is instructie at this point to classiy and enumerate
seeral commonly encountered orms o input data that a geocoder can and must be able to
handle in one capacity or another, because these may be the only inormation aailable in the
case in which all other ields in a record are null. 1his list is presented in 1able 4.


)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


8
X


Table 4 Common forms of input data with corresponding NAACCR fields and example values
,OUA )!!&&. 2DA>?cEd $KHPU>A
Complete postal
address
2330: dxAddress - Number and Street
0: dxAddress - City
80: dxAddress - State
100: dxAddress - Postal Code
3620 S Vermont Ae, Unit 444, Los Angeles, CA 90089
Partial postal ad-
dress
2330: dxAddress - Number and Street 3620 Vermont
USPS PO box 2330: dxAddress - Number and Street
0: dxAddress - City
80: dxAddress - State
100: dxAddress - Postal Code
PO Box 1234, Los Angeles CA 90089-1234
Rural Route 2330: dxAddress - Number and Street
0: dxAddress - City
80: dxAddress - State
RR12, Los Angeles CA
City 0: dxAddress - City Los Angeles
County 90: County at dx Los Angeles County
State CA
USPS ZIP Code,
USPS ZIP-4
,United States
Postal Serice
2008a,
100: dxAddress - Postal Code 90089-0255
Intersection 2330: dxAddress - Supplemental Vermont Aenue and 36
th
Place
Named place 2330: dxAddress - Supplemental Uniersity o Southern Caliornia
Relatie 2330: dxAddress - Supplemental Northeast corner o Vermont Ae and 36
th
Pl
Relatie 2330: dxAddress - Supplemental O Main Rd
'
<

;
<

#
=
>
?
@
A
B
C



! #A=N=?DMC *AEF -BHNFDNAE #JD?A
lrom this list, it is apparent that most input data are based on postal addressing systems,
administratie units, named places, coordinate systems, or relatie descriptions that use one
o the others as a reerent. Input data in the orm o postal addresses, or portions thereo,
are by ar the most commonly encountered, and as such this document will ocus almost
exclusiely on this input data type. Signiicant problems may appear when processing postal
address data because they are among the noisiest` orms o data aailable. As used here,
noisy` reers to the high degree o ariability in the way they can be represented, and to the
act that they oten include extraneous data and,or are missing required elements. 1o oer-
come these problems, geocoders usually employ two techniques known as H??BAEE M=BPHa
>DfHFD=M and H??BAEE EFHM?HB?DfHFD=M<
Z<Z<8 (MUJF ?HFH UB=NAEEDMC
!??BAEE M=BPH>DfHFD=M organizes and cleans input data to increase its eiciency or use
and sharing. 1his process attempts to identiy the component pieces o an input address
,e.g., street number, street name, or USPS ZIP Code, within the input string. 1he goal is to
identiy the correct pieces in the input data so that it will hae the highest likelihood o being
successully assigned a geocode by the geocoder. In 1able 5, seeral orms o the same ad-
dress are represented to illustrate the need or address normalization.

Table 5 Multiple forms of a single address
+HPU>A !??BAEE
3620 South Vermont Aenue, Unit 444, Los Angeles, CA 90089-0255
3620 S Vermont Ae, 4444, Los Angeles, CA 90089-0255
3620 S Vermont Ae, 444, Los Angeles, 90089-0255
3620 Vermont, Los Angeles, CA 90089

!??BAEE EFHM?HB?DfHFD=M conerts an address rom one normalized ormat into anoth-
er. It is closely linked to normalization and is heaily inluenced by the perormance o the
normalization process. Standardization conerts the normalized data into the correct ormat
expected by the subsequent components o the geocoding process. Address standards may
be used or dierent purposes and may ary across organizations because there is no single,
set ormat, howeer, ariability in ormats presents a barrier to data sharing among organiza-
tions. Interoperability assumes an agreement to implement a standardized ormat. In 1able
6, seeral existing or proposed address standards are listed. Best practices related to input
data are listed in Best Practices 10.

Table 6 Existing and proposed address standards
%BCHMDfHFD=M +FHM?HB?
USPS Publication 28 ,United States Postal Serice 2008d,
Urban and Regional
Inormation Systems
Association ,URISA,,United
States lederal Geographic Data
Committee ,lGDC,
Street Address Data Standard ,United States
lederal Geographic Data Committee 2008b,

V6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 10 Input data (high level)
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat type o input data can and
should be geocoded
At a minimum, NAACCR standard address data
should be able to be geocoded.

Ideally, any type o descriptie locational data,
both relatie and absolute, in any address stan-
dard should be an acceptable type o input and
geocoding can be attempted:
- Any orm o postal address
- Intersections
- Named places
- Relatie locations.
\hat type o relatie input data
can and should be geocodable
At a minimum, postal street addresses. Ideally,
relatie directional descriptions.
\hat type o absolute input data
can and should be geocodable
At a minimum, L-911 locations ,i they are abso-
lute,.
\hat type o normalization can
and should be perormed
Any reproducible technique that produces certii-
ably alid results should be considered a alid
normalization practice:
- 1okenization
- Abbreiation ,introduction,substitution,.
\hat type o standardization can
and should be perormed
Any reproducible technique that produces certii-
ably alid results should be considered a alid
standardization practice.
Z<Y .$2$.$)&$ '!,!+$,+
1he BAGABAMNA ?HFHEAF is the underlying geographic database containing geographic ea-
tures that the geocoder can use to generate a geographic output. 1his dataset stores all o the
inormation the geocoder knows about the world and proides the base data rom which the
geocoder calculates, deries, or obtains geocodes. Interpolation algorithms ,discussed in the
next section, perorm computations on the reerence eatures contained in these datasets to
estimate where the output o the geocoding process should be placed ,using the attributes o
the input address,.
Reerence datasets are aailable in many orms and ormats. 1he sources o these data
also ary greatly rom local goernment agencies ,e.g., tax assessors, to national goernmen-
tal organizations ,e.g., the lederal Geographic Data Committee |lGDC|,. Lach must ulti-
mately contain alid spatial geographic representations that either can be returned directly in
response to a geocoder query ,as the output, or be used by other components o the geocod-
ing process to deduce or derie the spatial output. A ew examples o the numerous types o
geographic reerence data sources that may be incorporated into the geocoder process are
listed in 1able , with best practices listed in Best Practices 11.
)=IAP@AB 567 8669 V5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Table 7 Example reference datasets
,OUA $KHPU>A
Vector line ile U.S. Census Bureau`s 1IGLR,Line ,United States Census
Bureau 2008c,
Vector polygon ile Los Angeles ,LA, County Assessor Parcel Data ,Los Angeles
County Assessor 2008,
Vector point ile Australian Geocoded National Address lile ,G-NAl, ,Paull
2003,

Best Practices 11 Reference data (high level)
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat types o reerence datasets
can and should be supported by
a geocoder
Linear-, point-, and polygon-based ector reerence
datasets should be supported by a geocoding
system.
Z<[ ,1$ #$%&%'()# !3#%.(,10
1he CA=N=?DMC H>C=BDFSP is the main computational component o the geocoder. 1his
algorithm can be implemented in a ariety o ways, especially i trends about the input data
or reerence dataset can be determined a riori.
Generally speaking, any algorithm must perorm two basic tasks. 1he irst, GAHFJBA
PHFNSDMC7 is the process o identiying a geographic eature in the reerence dataset corres-
ponding to the input data to be used to derie the inal geocode output or an input. A GAHa
FJBAaPHFNSDMC H>C=BDFSP is an implementation o a particular orm o eature matching.
1hese algorithms are highly dependent on both the type o reerence dataset utilized and the
attributes it maintains about its geographic eatures. 1he algorithm`s chances o selecting the
correct eature ary with the number o attributes per eature. A substantial part o the oer-
all quality o the output geocodes rests with this component because it is responsible or
identiying and selecting the reerence eature used or output deriation.
1he next task, GAHFJBA DMFABU=>HFD=M7 is the process o deriing a geographic output
rom a reerence eature selected by eature matching. A GAHFJBA DMFABU=>HFD=M H>C=BDFSP is
an implementation o a particular orm o eature interpolation. 1hese algorithms also are
highly dependent on the reerence dataset in terms o the type o data it contains and the
attributes it maintains about these eatures.
I one were to hae a reerence dataset containing alid geographic points or eery ad-
dress in one`s study area ,e.g., the ADDRLSS-POIN1 |liggs and Martin 1995a, Ordnance
Surey 2008| and G-NAl |Paull 2003| databases,, the eature interpolation algorithm essen-
tially returns this spatial representation directly rom the reerence dataset. More oten, how-
eer, the interpolation algorithm must estimate where the input data should be located with
reerence to a eature in the reerence dataset. 1ypical operations include linear or areal in-
terpolation ,see Section 8, when the reerence datasets are street ectors and parcel poly-
gons, respectiely.
V8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 12 Geocoding algorithm (high level)
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat type o
geocoding can and
should be perormed
At a minimum, sotware-based geocoding should be
perormed.
\hat orms o eature
matching should the
geocoding algorithm
include
1he geocoding algorithm should consist o eature-matching
algorithms consistent with the orms o reerence data the
system supports. Both probability-based and deterministic
methods should be supported.
\hat orms o eature
interpolation should
the geocoding
algorithm include
1he geocoding algorithm should consist o eature interpola-
tion algorithms consistent with the orms o reerence data
the system supports ,e.g., linear-based interpolation i linear-
based reerence datasets are used,.
Z<\ %/,-/, '!,!
1he last component o the geocoder is the actual =JFUJF ?HFH7 which are the alid spa-
tial representations deried rom eatures in the reerence dataset. As deined in this docu-
ment, these data can hae many dierent orms and ormats, but each must contain some
type o alid spatial attribute.
1he most common ormat o output is points described with geographic coordinates ,la-
titude, longitude,. loweer, the accuracy o these spatial representations suers when they
are interpolated, due to data loss during production. Alternate orms can include multi-point
representations such as polylines or polygons. As noted, registries must at least report the
results in the ormats explicitly deined in tavaara. for Cavcer Regi.trie.: Data tavaara. ava Da
ta Dictiovar, ,loerkamp and laener 2008,.
\hen using these output data, one must consider the geographic resolution they
represent in addition to the type o spatial geometry. lor example, a point deried by areal
interpolation rom a polygon parcel boundary should not be treated as equialent to a point
deried rom the aerial interpolation o a polygon USPS ZIP Code boundary ,note that a
USPS ZIP Code is not actually an areal unit, more details on this topic can be ound in Sec-
tion 5.1.4,. 1hese geocoder outputs, while in the same ormat and produced through the
same process, do not represent data at the same geographic resolution and must be dieren-
tiated.

Best Practices 13 Output data (high level)
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat orms and ormats
can and should be
returned as output data
1he geocoding process should be able to return any alid
geographic object as output. At a minimum, outputs
showing the locations o point should be supported.
\hat can and should be
returned as output
1he geocoding process should be able to return the ull
eature it matched to ,e.g., parcel polygon i matching to a
parcel reerence dataset,, in addition to an interpolated
ersion.

)=IAP@AB 567 8669 VV
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Z<9 0$,!'!,!
\ithout proper metadata describing each component o the geocoding process and the
choices that were made at each step, it is nearly impossible to hae any conidence in the
quality o a geocode. \ith this in mind, it is recommended that all geocodes contain all rele-
ant inormation about all components used in the process as well as all decisions that each
component made. 1able 87 1able 9, and 1able 10 list example geocoding component,
process, and record metadata items. 1hese lists are not a complete enumeration o eery me-
tadata item or eery combination o geocoding component, method, and decision, nor do
they contain complete metadata items or all topics described in this document. 1hese lists
should sere as a baseline starting point rom which registries, geocode deelopers, and,or
endors can begin discussion as to which geocode component inormation needs documen-
tation. Details on each o the concepts listed in the ollowing tables are described later in the
document. Component- and decision-speciic metadata items or these and other portions o
the geocoding process are listed in-line through this document where applicable. 1he crea-
tion and adoption o similar metadata tables describing the complete set o geocoding topics
coered in this document would be a good irst step toward the eentual cross-registry stan-
dardization o geocoding processes, work on this task is currently underway.
Table 8 Example geocoding component metadata
&=PU=MAMF (FAP $KHPU>A 4H>JA
Input data Normalizer Name o normalizer
Normalizer ersion Version o normalizer
Normalizer strategy Substitution-based
Context-based
Probability-based
Standardizer Name o standardizer
Standardizer ersion Version o standardizer
Standard Name o standard
Reerence dataset Dataset type Lines
Points
Polygons
Dataset name 1IGLR,Line iles
Dataset age 2008
Dataset ersion 2008b
leature matching leature matcher Name o eature matcher
leature matcher ersion Version o eature matcher
leature-matching strategy Deterministic
Probabilistic
leature interpolation leature interpolator Name o eature interpolator
leature interpolator ersion Version o eature interpolator
leature interpolator strategy Address-range interpolation
Uniorm lot interpolation
Actual lot interpolation
Geometric centroid
Bounding box centroid
\eighted centroid
VZ )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 VY
Table 9 Example geocoding process metadata
&=PU=MAMF 'ANDED=M $KHPU>A 4H>JA
Substitution-based normalization Substitution table USPS Publication 28 ab-
breiations
Lquialence unc-
tion
Lxact string equialence
Case-insensitie equialence
Probabilistic eature matching Conidence thre-
shold
95
Probability unction Match-unmatch probability
Attribute weights \eight alues
Uniorm lot interpolation Dropback distance 6m
Dropback direction Reerence eature orthogon-
al

Table 10 Example geocoding record metadata
&=PU=MAMF 'ANDED=M $KHPU>A 4H>JA
Substitution-based normalization Original data 3620 So. Vermont A
Normalized data 3620 S Vermont Ae
Probabilistic eature matching Match probability 95
Unmatch probability 6
Uniorm lot interpolation Number o lots on
street
6
Lot width Street length proportional

















































1his page is let blank intentionally.



'< ;< #=>?@ABC
Y< !''.$++ '!,!
1his section presents an in-depth, detailed examination o the
issues speciically related to address data including the arious
types that are possible, estimates o their accuracies, and the
relationships between them.
Y<5 ,:-$+ %2 !''.$++ '!,!
Postal address data are the most common orm o input data encountered. 1hey can take
many dierent orms, each with its own inherent strengths and weaknesses. 1hese qualities
are directly related to the amount o inormation that is encoded. 1here also are speciic rea-
sons or the existence o each, and in some cases, plans or its eentual obsolescence. Seeral
o the commonly encountered types will be described through examples and illustrations.
Lach example will highlight dierences in possible resolutions that can be represented and
irst-order estimates o expected leels o accuracy.
Y<5<5 &DFOa+FO>A -=EFH> !??BAEEAE
A NDFOaEFO>A U=EFH> H??BAEE describes a location in terms o a numbered building along
a street. 1his address ormat can be described as consisting o a number o attributes that
when taken together uniquely identiy a postal deliery site. Seeral examples o traditional
postal addresses or both the United States and Canada are proided in 1able 11.

Table 11 Example postal addresses
$KHPU>A !??BAEE
2121 \est \hite Oaks Drie, Suite C, Springield, IL, 6204-6495
1600 Pennsylania Ae N\, \ashington, DC 20006
Kaprielian lall, Unit 444, 3620 S. Vermont Ae, Los Angeles, CA, 90089-0255
490 Sussex Drie, Ottawa, Ontario K1N 1G8, Canada

One o the main beneits o this ormat is the highly descriptie power it proides ,i.e.,
the capability o identiying locations down to sub-parcel leels,. In the United States, the
attributes o a city-style postal address usually include a house number and street name,
along with a city, state, and USPS ZIP Code. Lach attribute may be broken down into more
descriptie leels i they are not suicient to uniquely describe a location. lor example, unit
numbers, ractional addresses, and,or USPS ZIP-4 Codes ,United States Postal Serice
2008a, are commonly used to dierentiate multiple units sharing the same property ,e.g.,
3620 Apt 1, 3620 Apt 6L, 3620 ', or 90089-0255 |which identiies Kaprielian lall|,. Like-
wise, pre- and post-directional attributes are used to dierentiate indiidual street segments
when seeral in the same city hae the same name and are within the same USPS ZIP Code.
1his case oten occurs when the origin o the address range o a street is in the center o a
city and expands outward in opposite directions ,e.g., the 100 North |longer arrow pointing
up and to the let| and 100 South |shorter arrow pointing down and to the right| Sepuleda
Bouleard blocks, as depicted in ligure 5,.
)=IAP@AB 567 8669 V\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A

Figure 5 Origin of both the 100 North (longer arrow pointing up and to the left) and 100 South
(shorter arrow pointing down and to the right) Sepulveda Boulevard blocks (Google, Inc. 2008b)

Also, because this orm o input data is so ubiquitous, suitable reerence datasets and
geocoders capable o processing it are widely aailable at many dierent leels o accuracy,
resolution, and cost. linally, the signiicant body o existing research explaining geocoding
processes based upon this ormat make it an enticing option or people starting out.
loweer, seeral drawbacks to using data in the city-style postal address ormat exist.
1hese drawbacks are due to the multitude o possible attributes that gie these addresses
their descriptie power. \hen attributes are missing, not ordered correctly, or i extraneous
inormation has been included, signiicant problems can arise during eature matching.
1hese attributes also can introduce ambiguity when the same alues can be used or multiple
attributes. lor instance, directional and street suix indicators used as street names can cause
conusion as in 123 North South Street` and 123 Street Road.` Similar conusion also may
arise in other circumstances when numbers and letters are used as street name alues as in
123 Aenue 2` and 123 N Street.` Non-Lnglish-based attributes are commonly encoun-
tered in some parts o the United States and Canada ,e.g., 123 Paseo del Rey`, which ur-
ther complicates the geocoding process.
A inal, more conceptual problem arises due to a class o locations that hae ordinary
city-style postal addresses but do not receie postal deliery serice. An example o this is a
priate deelopment or gated community. 1hese data may sometimes be the most diicult
cases to geocode because postal address-based reerence data are truly not deined or them
and systems relying heaily on postal address-based normalization or standardization may
ail to process them. 1his also may occur with minor ciil diision ,MCD, names ,particular-
ly townships, that are not mailing address components.
Y<5<8 -=EF %GGDNA *=K !??BAEEAE
A USPS U=EF =GGDNA c-%d @=K H??BAEE designates a physical storage location at a U.S.
post oice or other mail-handling acility. By deinition, these types o data do not represent
residences o indiiduals, and should not be considered as such. Conceptually, a USPS PO
box address remoes the street address portion rom an address, leaing only a USPS ZIP
V9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Code. 1hus, USPS PO box data in most cases can neer be geocoded to street-leel accura-
cy. Lxceptions to this include the case o some limited mobility acilities ,e.g., nursing
homes,, or which a USPS PO box can be substituted with a street address using lookup
tables and aliases. Also, the postal codes used in Canada sere a somewhat similar purpose
but are instead among the most accurate orms o input data because o the organization o
the Canadian postal system.
In the majority o cases though, it is diicult to determine av,tbivg about the leel o ac-
curacy that can be associated with USPS PO box data in terms o how well they represent
the residential location o an indiidual. As one example, consider the situation in which a
person rents a USPS PO box at a post oice near their place o employment because it is
more conenient than receiing mail at their residence. I the person works in a completely
dierent city than where they lie, not een the city attribute o the USPS PO box address
can be assumed to correctly represent the person`s residential location ,or state or that mat-
ter when, or example, commuters go to Manhattan, N\, rom New Jersey or Connecticut,.
Similarly, personal mail boxes may be reported and hae the same lack o correlation with
residence location. Being so requently encountered, a substantial body o research exists
dedicated to the topic o USPS PO boxes and their eect on the geocoding process and stu-
dies that use them ,e.g., lurley et al. 2003, Shi 200, and the reerences within,.
Y<5<V .JBH> .=JFA HM? 1DCSQHO &=MFBHNF !??BAEEAE
A .JBH> .=JFA c..d =B 1DCSQHO &=MFBHNF c1&d H??BAEE identiies a stop on a postal
deliery route. 1his ormat is most oten ound in rural areas and is o the orm RR 16 Box
2,` which indicates that mail should be deliered to Box 2` on the rural deliery route
Number 16.` 1hese deliery locations can be composed o seeral physical cluster boxes at
a single drop-o point where multiple residents pick up their mail, or they can be single
mailboxes at single residences.
listorically, numerous problems hae occurred when applying a geocoding process to
these types o addresses. lirst and oremost, an RR by deinition is a route traeled by the
mail carrier denoting a path, not a single street ,similar to a USPS ZIP Code, as will be dis-
cussed later,. Until recently, it was thereore impossible to derie a single street name rom a
numbered RR portion o an RR address. \ithout a street name, eature matching to a reer-
ence street dataset is impossible ,coered in Section 8.1,. lurther, the box number attribute
o an RR address did not include any data needed or linear-based eature interpolation.
1here was no indication i a box was not standalone, nor did it relate to and,or inorm the
relatie distance along a reerence eature. 1hus, it was unquantiiable and unusable in a ea-
ture interpolation algorithm.
Recently, howeer, these diiculties hae begun to be resoled due to the continuing
implementation o the L-911 serice across the United States. In rural areas where RR ad-
dresses had historically been the predominant addressing system, any production o the re-
quired L-911 geocodes rom address geocoding was impossible ,or the reasons just men-
tioned,. 1o comply with L-911 regulations, local goernments thereore assigned geocodes
to the RR addresses ,and their associated phone numbers, based on the existing linear-based
reerencing system o street mileposts. 1his led to the creation and aailability o a system o
absolute geocodes or RR addresses.
Also, or these areas where L-911 had been implemented, the USPS has taken the initia-
tie to create the Locatable Address Conersion System ,LACS, database. 1he primary role
o this database is to enable RR to city-style postal street address conersion ,United States
Postal Serice 2008c,. 1he aailability o this conersion tool enables a direct link between
)=IAP@AB 567 8669 VX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
an RR postal address and the reerence datasets capable o interpolation-based geocoding
that require city-style postal addresses. 1he USPS has mandated that all Coding Accuracy
Support System ,CASS, certiied sotware proiders must support the LACS database to re-
main certiied ,United States Postal Serice 2008b,, so RR to city-style address translation is
aailable now or most areas, but at a cost. Note that USPS CASS-certiied systems are only
certiied to parse and standardize address data into alid USPS data. 1his certiication is in
no way a relection o any orm o certiication o a geocode produced by the system.
Y<5<Z /+-+ ^(- &=?AE HM? /<+< &AMEJE *JBAHJ ^&,!E
1he problems arising rom the dierences between the USPS ZIP Codes and the U.S.
Census Bureau`s ZC1A are widely discussed in the geocoding literature, and so they will on-
ly be discussed briely in this document. In general, the common misunderstanding is that
the two reer to the same thing and can be used interchangeably, despite their published di-
erences and the act that their negatie eects on the geocoding process hae been widely
publicized and documented in the geocoding literature ,e.g., Krieger et al. 2002b, lurley et
al. 2003, Grubesic and Matisziw 2006,. USPS ZIP Codes represent deliery routes rather
than regions, while a ZC1A represents a geographic area. lor an excellent reiew o USPS
ZIP Code usage in the literature and a discussion o the dierences, eects, and the multi-
tude o ways they can be handled, see Beyer et al. ,2008,.
Best practices relating to the our types o input postal address data just described are
listed in Best Practices 14.

Best Practices 14 Input data types
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat types o input
address data can and
should be supported
Any type o address data should be considered alid
geocoding input ,e.g., city-style and rural route postal
addresses,.
\hat is the preerred
input data speciication
I possible, input data should be ormatted as city-style
postal addresses.
Should USPS PO box da-
ta be accepted or
geocoding
I possible, USPS PO box data should be inestigated to
obtain more detailed inormation and ormatted as city-
style postal addresses.
Should RR or lC data be
accepted or geocoding
I possible, RR and lC data should be conerted into
city-style postal addresses.
Should USPS ZIP Code
and,or ZC1A data be
accepted or geocoding
I possible, USPS ZIP Code and,or ZC1A data should
be inestigated or more detailed inormation and
ormatted as a city-style postal address.

I USPS ZIP Code and,or ZC1A data must be used,
special care needs to be taken when using the resulting
geocodes in research, see Beyer et al. ,2008, or
additional guidance.
Should an input address
eer be abandoned and
not used or geocoding
I the potential leel o resulting accuracy is too low
gien the input data speciication and the reerence
eatures that can be matched, lower leel portions o the
input data should be used ,e.g., USPS ZIP Code, city,.
Z6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Y<8 2(.+,a%.'$. $+,(0!,$+
1he arious types o input address data are capable o describing dierent leels o in-
ormation, both in their best and worst cases. lirst-order estimates o these alues one can
expect to achiee in terms o geographic resolution are listed in 1able 12.

Table 12 First-order accuracy estimates
'HFH FOUA *AEF &HEA ;=BEF &HEA
Standard postal address Sub-parcel State
USPS PO box USPS ZIP Code centroid State
Rural route Sub-parcel State
U.S. National Grid 1 m
2
1,000 m
2

Y<V -%+,!3 !''.$++ 1($.!.&1:
As noted in Section 5.1.1, city-style postal addresses are the most common orm encoun-
tered in the geocoding process and are extremely aluable gien the hierarchical structure o
the inormation they contain. 1his implicit hierarchy oten is used as the basis or multi-
resolution geocoding processes that allow arying leels o geographic resolution in the re-
sulting geocodes based on where a match can be made in the hierarchy. 1his relates directly
to the ways in which people communicate and understand location, and is chiely responsi-
ble or enabling the geocoding process to capture this same notion.
1he ollowing city-style postal address has all possible attributes illed in ,excluding mul-
tiple street type suixes,, and will be used to illustrate this progression through the scales o
geographic resolution as dierent attribute combinations are employed:

3620 ' South Vermont Aenue Last, Unit 444, Los Angeles, CA, 90089-0255

1he possible ariations o this address in order o decreasing geographic resolution ,with
0 ranked as the highest, are listed in 1able 13. Also listed are the best possible and most
probable resolutions that could be achieed, along with the ambiguity introduced at each
resolution. Selected resolutions are also displayed isually in ligure 6. 1he table and igure
underscore two obserations: ,1, the elimination o attributes rom city-style postal ad-
dresses degrades the best possible accuracy quite rapidly, and ,2, dierent combinations o
attributes will hae a signiicant impact on the geographic resolution or granularity o the
resulting geocode. More discussion on the strengths and weaknesses o arbitrarily ranking
geographic resolutions is presented in Section 15.1.

)=IAP@AB 567 8669 Z5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Table 13 Resolutions, issues, and ranks of different address types
!??BAEE
*AEF
.AE=>JFD=M
-B=@H@>A
.AE=>JFD=M
!P@DCJDFO .HMR
3620 North Vermont
Aenue, Unit 444, Los
Angeles, CA, 90089-
0255
3D Sub-
parcel-leel
Sub-parcel-leel none 0
3620 North Vermont
Aenue, Los Angeles,
CA, 90089-0255
Parcel-leel Parcel-leel unit, loor 1
3620 North Vermont
Aenue, Los Angeles,
CA, 90089
Parcel-leel Parcel-leel unit, loor,
USPS ZIP
Code
2
3620 Vermont Aenue,
Los Angeles, CA,
90089
Parcel-leel Street-leel unit, loor,
street, USPS
ZIP Code
3
Vermont Aenue, Los
Angeles, CA, 90089
Street-leel USPS ZIP
Code-leel
building, unit,
loor, street,
USPS ZIP
Code
4
90089 USPS ZIP
Code-leel
USPS ZIP
Code-leel
building, unit,
loor, street,
city
5
Vermont Aenue, Los
Angeles, CA
City-leel City-leel,
though small
streets may all
entirely into a
single USPS ZIP
Code
building, unit,
loor, street,
USPS ZIP
Code
6
Los Angeles, CA City-leel City-leel building, unit,
loor, street,
USPS ZIP

Vermont Aenue, CA State-leel State-leel building, unit,
loor, street,
USPS ZIP,
city
8
CA State-leel State-leel building, unit,
loor, street,
USPS ZIP,
city
8


Z8 )=IAP@AB 567 8669
'< ;< #=>?@ABC

Figure 6 Geographic resolutions of different address components (Google, Inc. 2008b)


)=IAP@AB 567 8669 ZV














































1his page is let blank intentionally.


'< ;< #=>?@ABC
[< !''.$++ '!,! &3$!)()# -.%&$++$+
1his section presents a detailed examination o the dierent
types o processes used to clean address data and discusses
speciic implementations.
[<5 !''.$++ &3$!)3()$++
1he cleanliness` o input data is perhaps the greatest contributing actor to the success
or ailure o a successul geocode being produced. As Zandbergen concludes, improed
quality control during the original capture o input data is paramount to improing geocod-
ing match rates` ,2008, pp. 18,. Address data are notoriously dirty` or seeral reasons, in-
cluding simple data entry mistakes and the use o non-standard abbreiations and attribute
orderings. 1he addresses listed in 1able 14 all reer to the same address, but are in complete-
ly dierent ormats, exempliying why arious address-cleaning processes are required. 1he
address-cleaning processes applied to prepare input address data or processing will be de-
tailed in the next sections.

Table 14 Example postal addresses in different formats
3620 North Vermont Aenue, Unit 444, Los Angeles, CA, 90089-0255
3620 N Vermont Ae, 444, Los Angeles, CA, 90089-0255
3620 N. VLRMON1 AVL., UNI1 444, LA, CA
N Vermont 3620, Los Angeles, CA, 90089
[<8 !''.$++ )%.0!3(^!,(%)
!??BAEE M=BPH>DfHFD=M is the process o identiying the component parts o an address
such that they may be transormed into a desired ormat. 1his irst step is critical to the
cleaning process. \ithout identiying which piece o text corresponds to which address
attribute, it is impossible to subsequently transorm them between standard ormats or use
them or eature matching. 1he typical component parts o a city-style postal address are
displayed in 1able 15.
Table 15 Common postal address attribute components
Number 3620
Preix Directional N
Street Name Vermont
Suix Directional
Street 1ype Ae
Unit 1ype Unit
Unit Number 444
Postal Name ,Post Oice name, USPS deault or acceptable
name or gien USPS ZIP Code,

USPS ZIP Code 90089-0255
State CA
)=IAP@AB 567 8669 ZY
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
1he normalization algorithm must attempt to identiy the most likely address attribute to
associate with each component o the input address. Decades o computer science research
hae been inested into this diicult parsing problem. Many techniques can be applied to
this problem, some speciically deeloped to address it and others that were deeloped or
other purposes but are nonetheless directly applicable. 1hese approaches range in their leel
o sophistication, examples rom the simplistic to highly adanced will now be described.
[<8<5 +J@EFDFJFD=Ma*HEA? )=BPH>DfHFD=M
+J@EFDFJFD=Ma@HEA? M=BPH>DfHFD=M makes use o lookup tables or identiying com-
monly encountered terms based on their string alues. 1his is the most popular method be-
cause it is the easiest to implement. 1his simplicity also makes it applicable to the ewest
number o cases ,i.e., only capable o substituting correct abbreiations and eliminating
|some| extraneous data,.
In this method, F=RAMDfHFD=M conerts the string representing the whole address into a
series o separate F=RAME by processing it let to right, with embedded spaces used to sepa-
rate tokens. 1he original order o input attributes is highly critical because o this linear se-
quential processing. A typical system will endeaor to populate an internal representation o
the parts o the street address listed in 1able 15, in the order presented. A set o PHFNSDMC
BJ>AE deine the alid content each attribute can accept and are used in conjunction with
>==RJU FH@>AE that list synonyms or identiying common attribute alues.
As each token is encountered, the system tries to PHFNS it to the next empty attribute in
its internal representation, in a sequential order. 1he lookup tables attempt to identiy known
token alues rom common abbreiations such as directionals ,e.g., n` being equal to
North,` with either being alid,. 1he matching rules limit the types o alues that can be
assigned to each attribute. 1o see how it works, the ollowing address will be processed,
matching it to the order o attributes listed in 1able 15:

3620 Vermont Ae, RM444, Los Angeles, CA 90089`

In the irst step, a match is attempted between the irst token o the address, 3620,` and
the internal attribute in the irst index, number.` 1his token satisies the matching rule or
this internal attribute ,i.e., that the data must be a number,, and it is thereore accepted and
assigned to this attribute. Next, a match is attempted between the second word, Vermont,`
and the address attribute that comprises the second index, the pre-directional. 1his time, the
match will ail because the matching rule or this attribute is that data must be a alid orm
o a directional, and this word is not. 1he current token Vermont` then is attempted to be
matched to the next attribute ,index 3, street name,. 1he matching rule or this has no re-
strictions on content, so the token is assigned. 1he next token, Ae,` has a match at-
tempted with the alid attributes at index 4 ,the post-directional,, which ails. Another match
is attempted with the next address attribute at the next index ,5, street type,, which is suc-
cessul, so it is assigned. 1he remainder o the tokens subsequently are assigned in a similar
manner.
It is easy to see how this simplistic method can become problematic when keywords a-
lid or one attribute such as Circle` and Drie` are used or others as in 123 Circle Drie
\est,` with neither in the expected position o a street suix type. Best practices related to
substitution-based normalization are listed in Best Practices 15.

Z[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 15 Substitution-based normalization
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should substitution-based
normalization be used
Substitution-based normalization should be
used as a irst step in the normalization
process, especially i no other more adanced
methods are aailable.
\hich matching rules should be used
in substitution-based normalization
Any deterministic set o rules that create
reproducible results that are certiiably alid
should be considered acceptable.
\hich lookup tables ,substitution
synonyms, should be used in
substitution-based normalization
At a minimum, the USPS Publication 28
synonyms should be supported ,United States
Postal Serice 2008d,
\hich separators should be used or
tokenization
At a minimum, whitespace should be used as
a token separator.
\hat leel o token matching should
be used or determining a match or
non-match
At a minimum, an exact character-leel match
should be considered a match.
[<8<8 &=MFAKFa*HEA? )=BPH>DfHFD=M
&=MFAKFa@HEA? M=BPH>DfHFD=M makes use o syntactic and lexical analysis to identiy the
components o the input address. 1he main beneit o this less commonly applied method is
its support or reordering input attributes. 1his also makes it more complicated and harder
to implement. It has steps ery similar to those taken by a programming language compiler, a
tool used by programmers to produce an executable ile rom plain text source code written
in a high-leel programming language.
1he irst step, ENBJ@@DMC7 remoes illegal characters and white space rom the input da-
tum. 1he input string is scanned let to right and all inalid characters are remoed or re-
placed. Punctuation marks ,e.g., periods and commas, are all remoed and all white-space
characters are collapsed into a single space. All characters then are conerted into a single
common case, either upper or lower. 1he next step, >AKDNH> HMH>OEDE7 breaks the scrubbed
string into typed tokens. 1okenization is perormed to conert the scrubbed string into a
series o tokens using single spaces as the separator. 1he order o the tokens remains the
same as the input address. 1hese tokens then are assigned a type based on their character
content such as numeric ,e.g., 3620`,, alphabetic ,e.g., Vermont`,, and alphanumeric ,e.g.,
RM444`,. 1he inal step, EOMFHNFDN HMH>OEDE7 places the tokens into a parse tree based on a
grammar. 1his UHBEA FBAA is a data structure representing the decomposition o an input
string into its component parts. 1he CBHPPHB is the organized set o rules that describe the
language, in this case possible alid combinations o tokens that can legitimately make up an
address. 1hese are usually written in *HNRJEa)HJB G=BP ,BNl,, a notation or describing
grammars as combinations o alid components. See the next page or an example o an ad-
dress described in BNl, in which a postal address is composed o two components: ,1, the
street-address-part, and ,2, the locality-part. 1he street-address-part is composed o a house-
number, a street-name-part, and an optional suite-number and suite-type, which would be
preceded by a comma i they existed. 1he remaining components are composed in a similar
ashion:


)=IAP@AB 567 8669 Z\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
postal-address ::~ street-address-part locality-part
street-address-part ::~ house-number street-name-part "," suite-number suite-type}
street-name-part ::~ pre-directional} street-name street-type post-directional}
locality-part ::~ town-name "," state-code USPS-ZIP-Code "-" ZIP-extension}

1he diicult part o context-based normalization is that the tokens described thus ar
hae only been typed to the leel o the characters they contain, not to the domain o ad-
dress attributes ,e.g., street name, post-directional,. 1his leel o domain-speciic token typ-
ing can be achieed using lookup tables or common substitutions that map tokens to ad-
dress components based on both character types and alues. It is possible that a single token
can be mapped to more than one address attribute. 1hus, these tokens can be rearranged
and placed in multiple orders that all satisy the grammar. 1hereore, constraints must be
imposed on them to limit the erroneous assignments. Possible options include using an itera-
tie method to enorce the original order o the tokens as a irst try, then relaxing the con-
straint by allowing only tokens o speciic types to be moed in a speciic manner, etc. Also,
the suppression o certain keywords can be employed such that their importance or rele-
ance is minimized.
1his represents the diicult part o perorming context-based normalization-writing
these relaxation rules properly, in the correct order. One must walk a ine line and careully
consider what should be done to which components o the address and in what order, oth-
erwise the tokens in the input address might be moed rom their original position and see-
mingly produce alid` addresses that misrepresent the true address. Best practices related to
context-based normalization are listed in Best Practices 16.
[<8<V -B=@H@D>DFOa*HEA? )=BPH>DfHFD=M
-B=@H@D>DFOa@HEA? M=BPH>DfHFD=M makes use o statistical methods to identiy the com-
ponents o the input address. It deries mainly rom the ield o PHNSDMA >AHBMDMC, a sub-
ield o computer science dealing with algorithms that induce knowledge rom data. In par-
ticular, it is an example o BAN=B? >DMRHCA7 the task o inding eatures in two or more data-
sets that essentially reer to the same eature. 1hese methods excel at handling the diicult
cases, those that require combinations o substitutions, reordering, and remoal o extrane-
ous data. Being so powerul, they typically are ery diicult to implement, and usually are
seen only in research scenarios.
1hese algorithms essentially treat the input address as unstructured text that needs to be
semantically annotated with the appropriate attributes rom the target domain ,i.e., address
attributes,. 1he key to this approach is the deelopment o an optimal BAGABAMNA EAF, which
is the set o candidate eatures that may possibly match an input eature. 1his term should
not to be conused with reerence datasets containing the reerence eatures, een though
the reerence set will most likely be built rom them. 1he reerence set deines the search
space o possible matches that a eature-matching algorithm processes to determine an ap-
propriate match. In most cases, the complexity o perorming this search ,i.e., processing
time, grows linearly with the size o the reerence set. In the worst case, the search space can
be composed o the entire reerence dataset, resulting in non-optimal searching. 1he intelli-
gent use o @>=NRDMC ENSAPAE, or strategies designed to narrow the set o candidate alues
,O`Reagan and Saaleld 198, Jaro 1989,, can limit the size o the search space.


Z9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 16 Context-based normalization
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should context-based
normalization be used
I the correct sotware can be acquired or
deeloped, context-based normalization should be
used.
\hich characters should be
considered alid and exempt
rom scrubbing
All alpha-numeric characters should be considered
alid.

lorward slashes, dashes, and hyphens should be
considered alid when they are between other alid
characters ,e.g., 1,2 or 123-B,.
\hat action should be taken
with scrubbed characters
Non-alid ,scrubbed, characters should be
remoed and not replaced with any character.
\hich grammars should be
used to deine the components
o a alid address
Any grammar based on existing addressing
standards can be used ,e.g., OASIS xAL Standard
|Organization or the Adancement o Structured
Inormation Standards 2008| or the proposed URI-
SA,lGDC address standard |United States lederal
Geographic Data Committee 2008b|,.

1he grammar chosen should be representatie o
the address data types the geocoding process is
likely to see.
\hat leel o token matching
should be used or determining
a match or non-match
Only exact case-insensitie character-leel matching
should be considered a match.
low ar rom their original
position should tokens within
branches o a parse tree be
allowed to moe
1okens should be allowed to moe no more than
two positions o their original location.

Ater creating a reerence set, matches and non-matches between input address elements
and their normalized attribute counterparts can be determined. 1he input elements are
scored against the reerence set indiidually as well as collectiely using seeral measures.
1hese scores are combined into ectors and their likelihood as matches or non-matches is
determined using such tools as support ector machines ,SVMs,, which hae been trained
on a representatie dataset. lor complete details o a practical example using this method,
see Michelson and Knoblock ,2005,. Best practices related to probability-based normaliza-
tion are listed in Best Practices 1.
)=IAP@AB 567 8669 ZX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 17 Probability-based normalization
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should
probability-based
normalization be
used
I the output certainty o the resulting geocodes meets an
acceptable threshold, probability-based normalization should
be considered a alid option.

Lxperiments should be run to determine what an appropriate
threshold should be or a particular registry. 1hese
experiments should contrast the probability o getting a alse
positie ersus the repercussions such an outcome will cause.

\hat leel o
composite score
should be considered
a alid match
1his will depend on the conidence that is required by the
consumers o the geocoded data. At a minimum, a composite
score o 95 or aboe should be considered a alid match.
[<V !''.$++ +,!)'!.'(^!,(%)
More than one address standard may be required or in use at a registry or other purpos-
es during or outside o the geocoding process. 1hereore, ater attribute identiication and
normalization, transormation between common address standards may be required. 1he
diicult portion o this process is writing the PHUUDMC GJMNFD=ME-the algorithms that
translate between a normalized orm and a target output standard. 1hese unctions trans-
orm attributes into the desired ormats by applying such tasks as abbreiation substitution,
reduction, or expansion, and attribute reordering, merging, or splitting. 1hese transorma-
tions are encoded within the mapping unctions or each attribute in the normalized orm.
Mapping unctions must be deined a riori or each o the potential standards that the
geocoder may hae to translate an input address into, and there are commonly many. 1o bet-
ter understand this, consider that during eature matching, the input address must be in the
same standard as that used or the reerence dataset beore a match can be attempted.
1hereore, the address standard used by eery reerence dataset in a geocoder must be sup-
ported ,i.e., a mapping unction is required or each,. \ith the mapping unctions deined a
riori, the standardization process can simply execute the appropriate transormation on the
normalized input address and a properly standardized address ready or the reerence data
source will be produced.
In addition to these technical requirements or address standard support, registries must
select an address standard or their sta to report and in which to record the data. Seeral
existing and proposed address standards were listed preiously in 1able 6. NAACCR re-
commends that when choosing an address standard, registries abide by the data standards in
tavaara. for Cavcer Regi.trie.: Data tavaara. ava Data Dictiovar, ,loerkamp and laener
2008, which reerence United States Postal Serice Pvbticatiov 2 Po.tat .aare..ivg tavaara.
,United States Postal Serice 2008d, and the Canadian equialent, Po.tat tavaara.: ettervait
ava vcevtire ettervait ,Canada Post Corporation 2008,. Best practices related to address
standardization are listed in Best Practices 18.
Y6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 18 Address standardization
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hich input data standards
should be supported or
standardization
At a minimum, all postal address standards or all
countries or which geocoding are to be perormed
should be supported.
\hich address standard
should be used or record
representation
Only a single address standard should be used or
recording standardized addresses along with a
patient,tumor record. 1his should be the standard
deined in the NAACCR publication Data tavaara.
for Cavcer Regi.trie., 1otvve . All input data should be
standardized according to these guidelines.
\hich mapping unctions
should be used
Mapping unctions or all supported address
standards should be created or obtained.
[<Z !''.$++ 4!3('!,(%)
!??BAEE IH>D?HFD=M is another important component o address cleaning that deter-
mines whether an input address represents a location that actually exists. 1his should always
be attempted because it has a direct eect on the accuracy o the geocode produced or the
input data in question, as well as other addresses that may be related to it ,e.g., when per-
orming linear-interpolation as discussed in Section 9.2,. Perorming address alidation as
close to the data entry as possible is the surest way to improe all aspects o the quality o
the resulting geocode. Note that een though some addresses may alidate, they still may not
be geocodable due to problems or shortcomings with the reerence dataset ,note that the
reerse also is true,, which will be coered in more detail in Section 13.
In the ideal case, this alidation will take place not at the central registry, but at the hos-
pital. 1his practice currently is being implemented in seeral states ,e.g., Kentucky, North
Carolina, and \isconsin,, and is beginning to look like a easible option, although regula-
tions in some areas may prohibit it. 1he most commonly used source is the USPS ZIP-4
database ,United States Postal Serice 2008a,, but others may be aailable or dierent areas
and may proide additional help.
1he simplest way to attempt address alidation is to perorm eature matching using a
reerence dataset containing ?DENBAFA GAHFJBAE< Discrete eatures are those in which a single
eature represents only a single, real-world entity ,e.g., a point eature, as opposed to a ea-
ture that represents a range or series o real-world entities ,e.g., a line eature,, as described in
Section .2.3. A simple approach would be to use a USPS CASS-certiied product to alidate
each o the addresses, but because o bulk mailers CASS systems are prohibited rom ali-
dating segment-like reerence data, and parcel or address points reerence data must be used.
In contrast, N=MFDMJ=JE GAHFJBAE can correspond to multiple real-world objects, such as
street segment, which has an address range that can correspond to seeral addresses. An ex-
ample o this can be seen in the address alidation application shown in ligure , which can
be ound on the USC GIS Research Laboratory \eb site ,https:,,webgis.usc.edu,. 1his im-
age shows the USC Static Address Validator ,Goldberg 2008b,, a \eb-based address alida-
tion tool that uses the USPS ZIP-4 database to search or all alid addresses that match the
address entered by the user. Once the user clicks search, either zero, one, or more than one
potential address will be returned to indicate to the user that the inormation they entered
did not match any addresses, matched an exact address, or matched multiple addresses. 1his
inormation will allow the user to alidate the address in question by determining and cor-
)=IAP@AB 567 8669 Y5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
recting any attributes that are wrong or incomplete that could potentially lead to geocoding
errors had the non-alidated address been used directly.


Figure 7 Example address validation interface (https://webgis.usc.edu)

I eature matching applied to a reerence dataset o discrete eatures succeeds, the
matched eature returned will be in one o two categories: a true or alse positie. A FBJA
U=EDFDIA is the case when an input address is returned as being true, and is in act true ,e.g., it
actually exists in the real world,. A GH>EA U=EDFDIA is the case when an input address is re-
turned as being true, and is in act alse ,e.g., it does not actually exist in the real world,. I
eature matching ails ,een ater attempting attribute relaxation as described in Section
8.3.1, the input address will all again into one o two categories, a true or alse negatie. A
Y8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
FBJA MACHFDIA is the case when an input address is returned as being alse, and is in act alse.
A GH>EA MACHFDIA is the case when an input address is returned as being alse, and is in act
true ,e.g., it does actually exist in the real world,.
Both alse posities and negaties also can occur due to temporal inaccuracy o reerence
datasets. lalse posities occur when the input address is actually inalid but appears in the
reerence dataset ,e.g., it has not yet been remoed,. lalse negaties occur when the input
address exists, but is not present in the reerence dataset ,e.g., it has not yet been added,. 1o
address these concerns, the leel o conidence or the temporal accuracy o a reerence da-
taset needs to be determined and utilized. 1o assess this leel o conidence, a registry will
need to consider the requency o reerence dataset update as well as address liecycle man-
agement in the region and characteristics o the region ,e.g., how old is the reerence set,
how oten is it updated, and how requently do addresses change in the region,. More details
on the roots o temporal accuracy in reerence datasets are described in Section 13.3.
Common reerence data sources that can be used or address eriication are listed in
1able 16. Although parcel data hae proen ery useul or address data, it should be noted
that in most counties, assessors are under no mandate to include the EDFJE H??BAEE o a par-
cel ,the actual physical address associated with the parcel, in their databases. In these cases,
the mailing address o the owner may be all that is aailable, but may or may not be the ac-
tual address o the actual parcel. As such, L-911 address points may be an alternatie and
better option or perorming address alidation. Best practices related to address alidation
are listed in Best Practices 19. Recent work by Zandbergen ,2008, proides urther discus-
sion on the aect discrete ,address point- or parcel-based, ersus continuous ,address range-
based street segments, reerence datasets has on achieable match rates.

Table 16 Common address verification data sources
USPS ZIP-4 ,United States Postal Serice 2008a,
U.S. Census Bureau Census 1racts
County or municipal assessor parcels

1here appears to be general consensus among researchers and registries that improing
address data quality at the point o collection should be a task that is inestigated, with its
eentual implementation into existing data entry systems a priority. It is as-o-yet unclear
how utilizing address alidation tools like the USC \eb-based address alidator shown in
this section may or may not slow down the data entry process because there hae been no
published reports detailing registry and,or sta experiences, time estimates, or oerall cost
increases. loweer, preliminary results presented at the 2008 NAACCR Annual Meeting
,Durbin et al. 2008, on the experience o incorporating a similar system into a data entry sys-
tem used by the State o Kentucky seem to indicate that the time increases are manageable,
with proper user interace design haing a large impact. More research is needed on this is-
sue to determine the cost and beneits that can be obtained using these types o systems and
the oerall impact that they will hae on resulting geocode quality.
)=IAP@AB 567 8669 YV
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 19 Address validation
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should
address alidation
be used
I a trusted, complete address dataset is aailable, it should be
used or address alidation during both address standardization
and eature matching and interpolation.
\hich data
sources should be
used or address
alidation
1he temporal ootprint o the address alidation source should
coer the period or which the address in question was supposed
to hae existed in the dataset.

I an assessor parcel database is aailable, this should be used as
an address alidation reerence dataset.
\hat should be
done with inalid
addresses
I an address is ound to be inalid during address
standardization, it should be corrected.

I an inalid address is not correctable, it should be associated
with the closest alid address.
\hat metadata
should be
maintained
I an address is corrected or assigned to the closest alid address,
the action taken should be recorded in the metadata, and the
original address should be kept as well.


YZ )=IAP@AB 567 8669
'< ;< #=>?@ABC
\< .$2$.$)&$ '!,!+$,+
1his section identiies and describes the dierent types o
reerence datasets and the relationships between them.
\<5 .$2$.$)&$ '!,!+$, ,:-$+
Vector-based data, such as the U.S. Census Bureau`s 1IGLR,Line iles ,United States
Census Bureau 2008c,, are the most requently encountered reerence datasets because their
per-eature representations allow or easy eature-by-eature manipulation. 1he pixel-based
ormat o raster-based data, such as digital orthophotos, can be harder to work with and
generally make them less applicable to geocoding. loweer, some emerging geocoding
processes do employ raster-based data or seeral speciic tasks including eature extraction
and correction ,as discussed later,.
\<8 ,:-$+ %2 .$2$.$)&$ '!,!+$,+
1he ollowing sections oer more detail on the three types o ector-based reerence da-
tasets-linear-, areal unit-, and point-based-requently used in geocoding processes, orga-
nized by their degree o common usage in the geocoding process. 1he descriptions o each
will, or the most part, be generalizations applicable to the whole class o reerence data. Al-
so, it should be noted that the true accuracy o a data source can only be determined with
the use o a GPS deice, or in some cases imagery, and these discussions again are generali-
zations about classes o data sources. An excellent discussion o the beneits and drawbacks
o geocoding algorithms based on each type o reerence dataset is aailable in Zandbergen
,2008,.
\<8<5 3DMAHBa*HEA? .AGABAMNA 'HFHEAFE
A >DMAHBa@HEA? c>DMAa@HEA?d BAGABAMNA ?HFHEAF is composed o linear-based data,
which can either be simple-line or polyline ectors. 1he type o line ector contained typical-
ly can be used as a irst-order estimate o the descriptie quality o the reerence data source.
Reerence datasets containing only simple straight-line ectors usually will be less accurate
than reerence datasets containing polyline ectors or the same area ,e.g., when considering
the shortest possible distance between two endpoints,. Cures typically are represented in
these datasets by breaking single straight-line ectors into multiple segments ,i.e., polylines,.
1his scenario is depicted in ligure 8, which shows a polyline more accurately describing the
shape o a street segment than a straight line.

)=IAP@AB 567 8669 YY
! #A=N=?DMC *AEF -BHNFDNAE #JD?A

Figure 8 Vector reference data of different resolutions (Google, Inc. 2008b)
Line-based datasets underpin typical conceptions o the geocoding process and are by ar
the most cited in the geocoding literature. Most are usually representations o street MAFa
Q=BRE cCBHUSEd7 which are an example o a topologically connected set o nodes and edges.
1he M=?AE cIABFDNAEd are the endpoints o the line segments in the graph and the A?CAE
Y[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
,HBNE, are lines connecting the endpoints. 1he term network` reers to the F=U=>=CDNH>
N=MMANFDIDFO resulting rom reerence eatures sharing common endpoints, such that it is
possible to traerse through the network rom eature to eature. Most literature commonly
deines a graph as C ~ ;1, ), indicating that the graph C is composed o the set o ertices
1 and the set o edges . 1he inherent topological connectedness o these graphs enables
searching. Dijkstra`s ,1959, shortest path algorithm is requently used or route planning, and
seeral well known examples o street networks are proided in 1able 1. lurther details o
street networks, alternatie distance estimations, and their application to accessibility within
the realm o cancer preention and control can be ound in Armstrong et al. ,2008, and the
reerences within.

Table 17 Common linear-based reference datasets
)HPA 'AENBDUFD=M &=IABHCA
U.S. Census Bureau`s 1IGLR,Line
iles ,United States Census Bureau
2008c,
Street centerlines U.S.
NAV1LQ Streets
,NAV1LQ 2008,
Street centerlines \orldwide
1ele Atlas Dynamap, MultiNet
,1ele Atlas 2008a, c,
Street centerlines \orldwide

1he irst dataset listed, the U.S. Census Bureau`s 1IGLR,Line iles, is the most com-
monly used reerence dataset in geocoding. 1he next two are competing products that are
commercial deriaties o the 1IGLR,Line iles or the United States. All three products
essentially proide the same type o data, with the commercial ersions containing improe-
ments oer the 1IGLR,Lines iles in terms o reerence eature spatial accuracy and the in-
clusion o more aspatial attributes. 1he accuracy dierences between products can be stun-
ning, as can the dierences in their cost.
Commercial companies employ indiiduals to drie GPS-enabled trucks to obtain GPS-
leel accuracy or their polyline street ector representations. 1hey also oten include areal
unit-based geographic eatures ,polygons, ,e.g., hospitals, parks, water bodies,, along with
data that they hae purchased or collected themseles. 1hese data collection tasks are not
inexpensie, and these data thereore are usually ery expensie, typically costing on the or-
der o tens o thousands o dollars. loweer, part o the purchase price usually includes
yearly or quarterly updates to the entire reerence dataset, resulting in ery temporally accu-
rate reerence data.
In contrast, new releases o the 1IGLR,Line iles hae historically corresponded to the
decennial Census, resulting in temporal accuracy ar behind their commercial counterparts.
Also, een though support or polyline representations is built into the 1IGLR,Line ile
data ormat, most eatures contained are in act simple-lines, with ery ew areal unit-based
eatures included. loweer, while the commercial ersions are ery expensie, 1IGLR,Line
iles are ree ,or, to aoid time-consuming downloading, are aailable or reasonable ees on
DVD,, making them an attractie option. Also, beginning in 2002, updates to the 1IG-
LR,Line iles hae been released once and now twice per year, resulting in continually im-
proing spatial and temporal accuracy. In some areas, states and municipalities hae created
much higher quality line iles, these eentually will be or already hae been incorporated into
the 1IGLR,Line iles. Beginning in 200, the U.S. Census Bureau has released
)=IAP@AB 567 8669 Y\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
MAl-1IGLR iles to replace annual 1IGLR,Line iles, these merge the U.S. Census Bu-
reau`s Master Address lile ,MAl, with 1IGLR databases to create a relational database
management system ,RDBMS, ,United States Census Bureau 2008b,.
Recent studies hae begun to show that in some areas, the 1IGLR,Line iles are essen-
tially as accurate as commercial iles ,\ard et al. 2005,, and are becoming more so oer time.
Some o this change is due to the U.S. Census Bureau`s MAl-1IGLR,Line ile integration
and adoption o the new American Community Surey ,ACS, system ,United States Census
Bureau 2008a,, which itsel includes a large eort ocused on improing the 1IGLR,Line
iles, others are due to pressure rom the lGDC. 1hese improements are enabling greater
public participation and allowing local-scale knowledge with higher accuracy o street ea-
tures and associated attributes ,e.g., address ranges,, to inorm and improe the national-
scale products.
All o the products listed in 1able 1 share the attributes listed in 1able 18. 1hese
represent the attributes typically required or eature matching using linear-based reerence
datasets. Note that most o these attributes correspond directly to the components o city-
style postal address-based input data.

Table 18 Common postal address linear-based reference dataset attributes
!FFBD@JFA 'AENBDUFD=M
Let side street start address
number
Beginning o the address range or let side o the street
segment
Right side street start
address number
Beginning o the address range or right side o the
street segment
Let side street end address
number
Lnd o the address range or let side o the street seg-
ment
Right side street end address
number
Lnd o the address range or right side o the street
segment
Street preix directional Street directional indicator
Street suix directional Street directional indicator
Street name Name o street
Street type 1ype o street
Right side ZC1A ZC1A or addresses on right side o street
Let side ZC1A ZC1A or addresses on let side o street
Right side municipality code A code representing the municipality or the right side
Let side municipality code A code representing the municipality or the let side
Right side county code A code representing the county or the right side
Let side county code A code representing the county or the let side
leature class code A code representing the class o the eature

In street networks, it is common or each side o the reerence eature to be treated sep-
arately. Lach can be associated with dierent address ranges and ZC1As, meaning that one
side o the street can be in one ZC1A while the other is in another ZC1A ,i.e., the street
orms the boundary between two ZC1As,. 1he address ranges on each side do not neces-
sary need to be related, although they most commonly are. Attributes o lower geographic
resolutions than the ZC1A ,city name, etc., usually are represented in the orm o a code
Y9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
,e.g., a lederal Inormation Processing Standard |lIPS| code |National Institute o Standards
and 1echnology 2008|,, and also applied to each side o the street independently.
All eatures typically include an attribute identiying the class o eature it is, e.g., a major
highway without a separator, major highway with a separator, minor road, tunnel, reeway
onramp. 1hese classiications sere many unctions including allowing or dierent classes
o roads to be included or excluded during the eature-matching process, and enabling irst-
order estimates o road widths to be assumed based on the class o road, typical number o
lanes in that class, and typical lane width. In the 1IGLR,Line iles released beore March
2008, these are represented by a leature Classiication Code ,lCC,, which has subsequently
been changed to the MAl,1IGLR leature Class Code ,M1lCC, in the upgrade to MAl-
1IGLR,Line iles ,United States Census Bureau 2008c,. In the more adanced commercial
ersions, additional inormation such as one-way roads, toll roads, etc., are indicated by bi-
nary true,alse alues or each possible attribute.
\<8<8 -=>OC=Ma*HEA? .AGABAMNA ?HFHEAFE
A U=>OC=Ma@HEA? BAGABAMNA ?HFHEAF is composed o polygon-based data. 1hese data-
sets are interesting because they can represent both the most accurate and inaccurate orms
o reerence data. \hen the dataset represents true building ootprints, they can be the most
accurate data source one could hope or when they are based rom sureys, they hae less or
unknown accuracy when deried rom photographs. Likewise, when the polygons represent
cities or counties, the dataset quickly becomes less appealing. Most polygon-based datasets
only contain single-polygon representations, although some include polygons with multiple
rings. 1hree-dimensional reerence datasets such as building models are ounded on these
multi-polygon representations.
Polygon reerence eatures oten are diicult and expensie to create initially. But when
aailable, they typically are on the higher side o the accuracy spectrum. 1able 19 lists some
examples o polygon-based ector reerence datasets, along with estimates o their coerages
and costs.

Table 19 Common polygon-based reference datasets
+=JBNA 'AENBDUFD=M &=IABHCA &=EF
1ele Atlas
,2008c,,
NAV1LQ
,2008,
Building ootprints, parcel oot-
prints
\orldwide, but
sparse
Lxpensie
County or
municipal
Assessors
Building ootprints, parcel oot-
prints
U.S., but sparse Relatiely
inexpensie
but aries
U.S. Census
Bureau
Census Block Groups, Census
1racts, ZC1A, MCD, MSA,
Counties, States
U.S. lree

1he highest quality dataset one can usually expect to encounter are building ootprints.
1hese data typically enable the geocoding process to return a result with an extremely high
degree o accuracy, with automated geocoding results o higher quality generally only obtain-
able through the use o 3-D models such as that shown in ligure 9. 1hree-dimensional
models also are built rom polygon representations but are een less commonly encountered.
)=IAP@AB 567 8669 YX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A

Figure 9 Example 3D building models (Google, Inc. 2008a)

Although both building ootprints and 3-D polygons are becoming more commonplace
in commercial mapping applications ,e.g., Microsot Virtual Larth and Google Maps haing
both or portions o hundreds o cities worldwide,, these datasets oten are diicult or costly
to obtain, typically requiring a substantial monetary inestment. 1hey are most oten ound
or amous or public buildings in larger cities or or buildings on campuses where the own-
ing organization has commissioned their creation. It is quite rare that building ootprints will
be aailable or eery building in an entire city, especially or residential structures, but more
and more are becoming aailable all the time.
A person attempting to gather reerence data can become rustrated because although
maps depicting building ootprints are widely aailable, digital copies o the underlying data-
sets can be diicult, i not impossible, to obtain. 1his happens requently with paper maps
created or insurance purposes ,e.g., Sanborn Maps,, and static digital images such the USC
Campus Map ,Uniersity o Southern Caliornia 2008, shown in ligure 10. In many cases, it
is obious that digital geographic polygon data sere as the basis or online interactie map-
ping applications as in the UCLA Campus Map shown in ligure 11 ,Uniersity o Caliornia,
Los Angeles 2008,, but oten these data are not made aailable to the general public or use
as a reerence dataset within a geocoding process.

[6 )=IAP@AB 567 8669
'< ;< #=>?@ABC

Figure 10 Example building footprints in raster format (University of Southern California 2008)

In contrast to building ootprints, parcel boundaries are aailable ar more requently.
1hese are descriptions o property boundaries, usually produced by local goernments or
taxation purposes. In most cases they are legally binding and thereore oten are created with
surey-quality accuracy, as shown in ligure 12. loweer, it should be noted that only a per-
centage o the total actually are produced rom sureying, with others being either deried
rom imagery or legacy data. 1hereore, legally-binding` does not equate to highly accu-
rate` in eery instance.
1hese data are quickly becoming aailable or most regions o the United States, with
some states een mandating their creation and dissemination to the general public at low
cost ,e.g., Caliornia |,Lockyer 2005|,. Also, the U.S. lGDC has an initiatie underway to
create a national parcel ile or the entire country within a ew years ,Stage and on Meyer
2005,. As an example o their ubiquitous existence, the online site Zillow ,Zillow.com 2008,
appears to hae obtained parcel data or most o the urban areas in the United States.
1he cost to obtain parcels usually is set by the locality and can ary dramatically rom
ree ,e.g., Sonoma County, CA |County o Sonoma 2008|, to ery expensie ,e.g., >125,000
or the Grand Rapids, MI Metropolitan Area |Grand Valley Metropolitan Council 2008|,.
Also, because they are created or tax purposes, land and buildings that are not subject to
local taxation ,e.g., public housing, state-owned residential buildings, or residences on mili-
tary bases or college campuses, may be omitted. 1he attributes which these parcel-based re-
erence datasets hae in common are listed in 1able 20.

)=IAP@AB 567 8669 [5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A

Figure 11 Example building footprints in digital format (University of California,
Los Angeles 2008)


[8 )=IAP@AB 567 8669
'< ;< #=>?@ABC

Figure 12 Example parcel boundaries with centroids

Table 20 Common polygon-based reference dataset attributes
!FFBD@JFA 'AENBDUFD=M
Name 1he name o eature used or search
Polygon Coordinates Set o polylines in some coordinate system
Index code,identiier Code to identiy the polygon within the reerence data system

Similar to point-based reerence eatures, parcel-based reerence eatures are ?DENBAFA
,i.e., they typically describe a single real-world geographic eature,. 1hus, a eature-matching
algorithm usually will either ind an exact match or none at all. Unlike point eatures, these
parcel-based eatures are complex geographic types, so spatial operations can be perormed
on them to create new data such as a centroid ,i.e., interpolation,. Also, the address asso-
ciated with a parcel may be the mailing address o the owner, not the EDFJE H??BAEE7 or ad-
dress associated with the physical location o the parcel. 1he beneits and drawbacks o ari-
ous centroid calculations are detailed in Section 9.3.
Again, similar to point-based reerence datasets, lower-resolution ersions o polygon-
based reerence datasets are readily obtainable. lor example, in addition to their centroids,
the U.S. Census Bureau also reely oers polygon representations o MCDs, counties, and
states. 1he low resolution o these polygon eatures may prohibit their direct use as spatial
output, but they do hae aluable uses. In particular, they are extremely aluable as the spa-
tial boundaries o spatial queries when a eature-matching algorithm is looking or a line-
based reerence eature within another reerence dataset. 1hey can sere to limit ,clip, the
spatial domain that must be searched, thus speeding up the result, and should align well with
)=IAP@AB 567 8669 [V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
U.S. Census Bureau Census 1ract ,C1,, Census Block Group ,CBG,, etc. iles rom the
same release.
\<8<V -=DMFa*HEA? .AGABAMNA 'HFHEAFE
A U=DMFa@HEA? BAGABAMNA ?HFHEAF is composed o point-based data. 1hese are the least
commonly encountered partly because o their usability, and partly because o the wide
ranges in cost and accuracy. 1he usability o geographic points ,in terms o interpolation po-
tential, is almost non-existent because a point represents the lowest leel o geographic
complexity. 1hey contain no attributes that can be used or the interpolation o other ob-
jects, in contrast to datasets composed o more complex objects ,e.g., lines, that do hae
attributes suitable or deriing new geographic objects ,e.g., deriing a point rom a line us-
ing the length attribute,. 1heir usability is urther reduced because most are composed o
discrete eatures, howeer, they are sometimes used in research studies.
Although this is beneicial or improing the precision o the geocoder ,i.e., it will only
return alues or input addresses that actually exist,, it will lower the match rate achieed
,more details on match rate metrics are described in Section 14.2,. 1his phenomenon is in
contrast to linear-based reerence datasets that can handle alues within ranges or a eature
to be matched. 1his scenario produces the exact opposite eect o the point-based reerence
set-the match rate rises, but precision alls. See Zandbergen ,2008, or a detailed analysis o
this phenomenon.
1he cost o production and accuracy o point-based reerence datasets can range rom
extremely high costs and high accuracy when using GPS deices, such as the address points
aailable or some parts o North Carolina, to extremely low-cost and ariable accuracy
when building a cache o preiously geocoded data ,as described in Section 13.4,. Seeral
examples o well-known, national-scale reerence datasets are listed in 1able 21, and Abe and
Stinchcomb ,2008, pp. 123, note that commercial endors are beginning to produce and
market point-leel address data. 1he attributes listed in 1able 22 are common to all products
listed in 1able 21. 1hese orm the minimum set o attributes required or a eature-matching
algorithm to successully match a reerence in a point-based reerence dataset.

Table 21 Point-based reference datasets
+JUU>DAB -B=?JNF 'AENBDUFD=M &=IABHCA
Goernment L-911 Address Points Lmergency management
points or addresses
Portions o
U.S.
Goernment Postal Codes Postal Code centroids U.S.,Canada
Goernment Census MCD Minor Ciil Diision
centroids
U.S.
Goernment Geographic Names In-
ormation System
,United States Board on
Geographic Names 2008,
Gazetteer o geographic
eatures
U.S.
Goernment GeoNames ,United
States National
Geospatial-Intelligence
Agency 2008,
Gazetteer o geographic
eatures
\orld,
excepting
U.S.
Academia Alexandria Digital
Library ,2008,
Gazetteer o geographic
eatures
\orld
[Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
Table 22 Minimum set of point-based reference dataset attributes
!FFBD@JFA 'AENBDUFD=M
Name 1he name o the eature used or the search.
Point coordinates A pair o alues or the point in some coordinate system.

1he United Kingdom and Australia currently hae the highest quality point-based reer-
ence datasets aailable, containing geocodes or eery postal address in the country. 1heir
creation processes are well documented throughout the geocoding literature ,liggs and Mar-
tin 1995a, Churches et al., 2002, Paull, 2003,, as are numerous studies perormed to alidate
and quantiy their accuracy ,e.g., Gatrell 1989,. In contrast, neither the United States nor
Canada can currently claim the existence o a national-scale reerence dataset containing ac-
curate geocodes or all addresses in the country. 1he national-scale datasets that are aailable
instead contain lower-resolution geographic eatures. In the United States, these datasets are
mostly aailable rom the U.S. Census Bureau ,e.g., ZC1As, centroids, and points
representing named places such as MCDs,. 1hese two datasets in particular are distributed in
conjunction with the most common linear-based reerence data source used, the U.S. Census
Bureau 1IGLR,Line iles ,United States Census Bureau 2008c,. USPS ZIP Codes are di-
erent than U.S. Census Bureau ZC1As and their ,approximate, centroids are aailable rom
commercial endors ,coered in more detail in Section 5.1.4,. ligher resolution point data
hae been created by indiidual localities across the United States, but these can be diicult
to ind in some locations unless one is actie or has connections in the locality. Best practic-
es relating to reerence dataset types are listed in Best Practices 20.
\<V .$2$.$)&$ '!,!+$, .$3!,(%)+1(-+
1he implicit and explicit relationships that exist between dierent reerence dataset types
are similar to the components o postal address input data. 1hese can be both structured
spatially hierarchical relationships and lineage-based relationships. An example o the irst is
the hierarchical relationships between polygon-based eatures aailable at dierent geograph-
ic resolutions o Census delineations in the 1IGLR,Line iles. Census blocks are at the
highest resolution, ollowed by CBG, C1, ZC1A, county subdiisions, counties, and,or
other state subdiisions, etc. 1he spatially hierarchical relationships between these data types
are important because data at lower resolutions represent an aggregation o the eatures at
the higher leel. \hen choosing a reerence eature or interpolation, one can saely change
rom selecting a higher resolution representation to a lower one ,e.g., a block to a block
group, without ear o introducing erroneous data ,e.g., the irst digit o the block is the
block group code,. 1he inerse is not true because lower-resolution data are composed o
multiple higher-resolution eatures ,e.g., a block group contains multiple blocks,. \hen at-
tempting to increase the resolution o the eature type matched to, there will be a leel o
ambiguity introduced as to which is the correct higher resolution eature that should be se-
lected.


)=IAP@AB 567 8669 [Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 20 Reference dataset types
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat reerence datasets
ormats can and should
be used
Any reerence dataset ormat should be supported by a
geocoding process, both ector- and raster-based. At a
minimum, ector-based must be supported.
\hat ector-based
reerence dataset types
can and should be used
Any ector-based reerence dataset type should be
supported by a geocoding process ,e.g., point-, linear-, and
polygon-based,. At a minimum, linear-based must be
supported.
\hich data source
should be obtained
A registry should obtain the most accurate reerence
dataset they can obtain gien their budgetary and technical
constraints.

Cost may be the inluencing actor as to which data source
to use.

1here may be per-product limitations, so all choices and
associated initiations should be ully inestigated beore
acquisition.
\hen should a new data
source be obtained
A registry should keep their reerence dataset up-to-date as
best they can within their means. 1he update requency will
depend on budgetary constraints and the requency with
which endors proide updates.
Should old data sources
be discarded
A registry should retain historical ersions o all their
reerence datasets.
\here can reerence
data be obtained
Local goernment agencies and the lGDC should be
contacted to determine the types, amounts, and usability o
reerence datasets aailable.

Commercial irms ,e.g., 1ele Atlas |2008c| and NAV1LQ
|2008|, also can be contacted i needs cannot be met by
public domain data.
low should reerence
data sources be kept
Registries should maintain lists o reerence datasets
applicable to their area across all resolutions ,e.g.,
1IGLR,Lines |United States Census Bureau 2008c| - na-
tional, county goernment roads - regional, parcel
databases - local,.

Lxamples o deriational lineage-based relationships include the creation o NAV1LQ
,2008, and 1ele Atlas ,2008c, as enhanced deriaties o the 1IGLR,Line iles and CA=a
N=?A NHNSDMC7 in which the output o a eature interpolation method is used to create a
point-based reerence dataset ,as described in Section 13.4,. In either o these cases, the ini-
tial accuracy o the original reerence dataset is a main determinant o the accuracy o later
generations. 1his eect is less eident in the case o 1IGLR,Line ile deriaties because o
the continual updating, but is completely apparent in reerence datasets created rom cached
results. Best practices related to these spatial and deriational reerence dataset relationships
are listed in Best Practices 21.
[[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 21 Reference dataset relationships
-=>DNO 'ANDED=M *AEF -BHNFDNA
Should primary or deriatie reerence
datasets be used ,e.g., 1IGLR,Lines
or NAV1LQ,
Primary source reerence datasets should be
preerred to secondary deriaties unless
signiicant improements hae been made
and are ully documented and can be proen.
Should lower-resolution aggregate
reerence data be used oer original
indiidual eatures ,e.g., block groups
instead o blocks,
Moing to lower resolutions ,e.g., rom block
to block group, should only be done i
eature matching is not possible at the higher
resolution due to uncertainty or ambiguity.

In addition to the inter-reerence dataset relationships among dierent datasets, intra-
reerence dataset relationships are at play between eatures within a single dataset. 1his can
be seen by considering arious S=>DEFDN PAFBDNE used to describe datasets, which are charac-
teristics describing alues oer an entire dataset as a whole. !F=PDN PAFBDNE7 in contrast,
describe characteristics o indiidual eatures in a dataset. lor example, datasets commonly
purport the holistic metric aerage horizontal spatial accuracy` as a single alue ,e.g., m in
the case o the 1IGLR,Line iles,. loweer, it is impossible to measure the horizontal spa-
tial accuracy o eery eature in the entire set, so where did this number come rom 1hese
holistic measures are calculated by choosing a representatie sample and aeraging their al-
ues to derie a metric. lor this reason, holistic metrics usually are expressed along with a
N=MGD?AMNA DMFABIH> ,CI,, which is a measurement o the percentage o data alues that are
within a gien range o alues. 1his is the common and recommended practice or describ-
ing the quality o spatial data, according to the lGDC data standards.
lor example, stating that the data are accurate to m with a CI o 95 percent means that
or a particular subset o indiidual eatures that were tested out o all the possible eatures,
roughly 95 percent all within m. 1he creator o the dataset usually does not ,and usually
cannot, guarantee that each and eery eature within the dataset has this same alue as its
accuracy ,which would make it an atomic metric,. Although a data consumer generally can
trust CIs associated with holistic metrics, they must remain aware o the potential or indi-
idual eatures to ary, sometimes being much dierent than those reported or the entire
set. 1his phenomenon is commonly most pronounced in the dierences in alues or eature
metrics seen in dierent geographic regions coered by large datasets ,e.g., eature accuracy
in rural ersus urban areas,.
Another aspect related to atomic and holistic eature completeness and accuracy is CA=a
CBHUSDNH> @DHE. In one sense, this describes the obseration that the accuracy o geographic
eatures may be a unction o the area in which they are located. Researchers are beginning
to realize that geocodes produced with similar reported qualities may not actually hae the
same accuracy alues when they are produced or dierent areas. 1he accuracy o the geo-
coding process as a whole has been shown to be highly susceptible to speciic properties o
the reerence eatures, such as the length o the street segments ,Ratclie 2001, Cayo and
1albot 2003, Bakshi et al. 2004, that are correlated with characteristics such as the rural or
urban character o a region ,e.g., smaller,larger postal code,parcel areas and the likelihood
o USPS PO box addresses areas |Skelly et al. 2002, Bonner et al. 2003, McLlroy et al. 2003,
\ard et al. 2005|,. Likewise, the preponderance o change associated with the reerence ea-
tures and input data in rapidly expanding areas will undoubtedly aect the geocoding process
in dierent ways in dierent areas depending on the leel o temporal dynamism o the local
)=IAP@AB 567 8669 [\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
built enironment. 1his notion is partially captured by the newly coined term NHBF=CBHUSDN
N=MG=JM?DMC ,Olier et al., 2005,. Best practices relating to reerence dataset characteristics
are listed in Best Practices 22.

Best Practices 22 Reference dataset characteristics
-=>DNO 'ANDED=M *AEF -BHNFDNA
Should holistic or
atomic metrics be
used to describe
the accuracy o a
reerence dataset
I the geographic ariability o a region is low or the size o the
region coered is small ,e.g., city scale,, the holistic metrics or
the reerence dataset should be used.

I the geographic ariability o a region is high or the size o the
region coered is large ,e.g., national scale,, the accuracy o
indiidual reerence eatures within the area o the input data
should be considered oer the holistic measures.
Should geographic
bias be considered
a problem
I the geographic ariability o a region is high or the size o the
region coered is large ,e.g., national scale,, geographic bias
should be considered as a possible problem.


[9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
9< 2$!,/.$ 0!,&1()#
1his section inestigates the components o a eature-
matching algorithm, detailing seeral speciic implementa-
tions.
9<5 ,1$ !3#%.(,10
Many implementations o eature-matching algorithms are possible and aailable, each
with their own beneits and drawbacks. At the highest and most general leel, the eature-
matching algorithm perorms a single simple role. It selects the correct reerence eature in
the reerence dataset that represents the input datum. 1he chosen eature then is used in the
eature interpolation algorithm to produce the spatial output. 1his generalized concept is
depicted in ligure 13. 1he matching algorithms presented in this section are M=Ma
DMFABHNFDIA PHFNSDMC H>C=BDFSPE ,i.e., they are automated and the user is not directly in-
oled,. In contrast, DMFABHNFDIA PHFNSDMC H>C=BDFSPE inole the user in making choices
when the algorithm ails to produce an exact match by either haing the user correct,reine
the input data or make a subjectie, inormed decision between two equally likely options.


Figure 13 Generalized feature-matching algorithm

9<5<5 +W3 *HEDE
1he orm taken by eature-matching algorithms is dictated by the storage mechanism o
the reerence dataset. 1hereore, because most reerence datasets are stored as traditional
relational database structures, most matching algorithms usually operate by producing and
issuing queries deined using the Structured Query Language ,SQL,. 1hese SQL queries are
deined in the ollowing ormat:

)=IAP@AB 567 8669 [X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
SLLLC1 selection attributes
lROM data source
\lLRL attribute constraints

1he EA>ANFD=M HFFBD@JFAE are the attributes o the reerence eature that should be re-
turned rom the reerence dataset in response to the query. 1hese typically include the identi-
iable attributes o the eature such as postal address components, the spatial geometry o
the reerence eature such as an actual polyline, and any other desired descriptie aspatial
qualities such as road width or resolution. 1he ?HFH E=JBNAE are the relational table ,or
tables, within the reerence dataset that should be searched. lor perormance reasons ,e.g.,
scalability,, the reerence dataset may be separated into multiple tables ,e.g., one or each
state, within a national-scale database. 1he HFFBD@JFA N=MEFBHDMFE orm the real power o the
query, and consist o zero, one, or more predicates. A UBA?DNHFA is an attribute,alue pair
deining what the alue o an attribute vv.t be or a eature to be selected. Multiple predi-
cates can be linked together with AND` and OR` statements to orm conjunctions and
disjunctions. Nesting o predicates also is supported through the use o parentheses.
1o satisy a query, the relational database engine used to store the reerence dataset will
ensure that Boolean Logic is employed to ealuate the attribute constraints against each ea-
ture in the reerence dataset, returning only those that ealuate to true statements. 1he ol-
lowing example would enorce the condition that only reerence eatures whose name`
attribute was equal to Vermont` and had a type` attribute equal to either AVL` or S1`
would be returned.

SLLLC1 attributes
lROM data source
\lLRL name~`Vermont` and ,type~`AVL` or type~`S1`,

&HEA EAMEDFDIDFO relates to whether or not a database dierentiates between the case o
alphabetic characters ,i.e., upper-case or lower-case, when ealuating a query against reer-
ence eatures, and i enorced can lead to many alse negaties. 1his is platorm dependent
and may be a user-settable parameter. Best practices related to SQL-type eature matching
are listed in Best Practices 23.

Best Practices 23 SQL-like feature matching
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat leel o training does
sta need to perorm
eature matching
At a minimum, sta should be trained to understand
how to create and work with simple database
applications such as Microsot Access databases.
Should case-sensitiity be
enorced
Case-sensitiity should not be enorced in eature
matching.

All data should be conerted to upper case as per
NAACCR data standards.

\6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
9<8 &3!++(2(&!,(%)+ %2 0!,&1()# !3#%.(,10+
leature-matching algorithms generally can be classiied into two main categories: deter-
ministic and probabilistic. A ?AFABPDMDEFDN PHFNSDMC PAFS=? is based on a series o rules
that are processed in a speciic sequence. 1hese can be thought o as binary operations, a
eature is either matched or it is not. In contrast, a UB=@H@D>DEFDN PHFNSDMC PAFS=? uses a
computational scheme to determine the likelihood, or probability, that a eature matches and
returns this alue or each eature in the reerence set.
It should be noted that each normalization process rom Section 6.2 can be grouped into
these two same categories. Substitution-based normalization is deterministic, while context-
and probability-based are probabilistic. Address normalization can be seen as a higher-
resolution ersion o the eature-matching algorithm. \hereas eature-matching maps the
entire set o input attributes rom the input data to a reerence eature, address normalization
matches each component o the input address to its corresponding address attribute. 1hese
processes are both linking records to a reerence set-actual eatures in the case o eature
matching and address attributes in the case o normalization. Note that Boscoe ,2008, also
can be consulted or a discussion o portions o the matching techniques presented in this
section.
9<V '$,$.0()(+,(& 0!,&1()#
1he main beneit o deterministic matching is the ease o implementation. 1hese algo-
rithms are created by deining a series o rules and a sequential order in which they should be
applied. 1he simplest possible matching rule is the ollowing:

Match all attributes o the input address to the corresponding attributes o
the reerence eature.`

1his rule will either ind and return a perect match, or it will not ind anything and sub-
sequently return nothing, a binary operation. Because it is so restrictie, it is easy to imagine
cases when this would ail to match a eature een though the eature exists in reality ,i.e.,
alse negaties,. As one example, consider a common scenario in which the reerence dataset
contains more descriptie attributes than the input address, as is seen in the ollowing two
example items. 1he irst is an example postal address with only the attributes street number
and name deined. 1he second ,1able 23, depicts a reerence eature that is more descriptie
,i.e., it includes the pre-directional and suix attributes_:

3620 Vermont`

Table 23 Attribute relation example, linear-based reference features
2B=P ,= -BAa?DBANFD=MH> )HPA +JGGDK
3600 300 South Vermont Ae

In both o these cases, the restrictie rule would ail to match and no eatures would be
returned when one or two ,possibly, should hae been.
9<V<5 !FFBD@JFA .A>HKHFD=M
In practice, less restrictie rules than the one preiously listed tend to be created and ap-
plied. !FFBD@JFA BA>HKHFD=M, the process o easing the requirement that all street address
)=IAP@AB 567 8669 \5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
attributes must exactly match a eature in the reerence data source to obtain a matching
street eature, oten is applied to create these less restrictie rules. It generally is only applied
in deterministic eature matching because probabilistic methods can account or attribute
discrepancies through the weighting process. Relaxation is commonly perormed by remo-
ing or altering street address attributes in an iteratie manner using a predeined order, the-
reby increasing the probability o inding a match while also increasing the probability o er-
ror. Lmploying attribute relaxation, the rule preiously deined could become:

Match all attributes rbicb ei.t in the input address to the corresponding
attributes o the reerence eature`

In this case, missing attributes in the input data will not prohibit a match and the eature
3600-300 South Vermont Ae` can be matched and returned. 1his example illustrates
how to allow attributes present in the reerence eatures to be missing in input data, but
there is nothing stopping a matching algorithm rom allowing the disconnect the other way
around, with attributes missing rom the reerence dataset but present in the input data.
loweer, this example also shows how ambiguity can be introduced. 1ake the same relaxed
matching rule and apply it to the eatures listed in 1able 24 and two matches would be re-
turned. More detail on eature-matching ambiguity is proided in Section 14.

Table 24 Attribute relation example, ambiguous linear-based reference features
2B=P ,= -BAa?DBANFD=MH> )HPA +JGGDK
3600 300 South Vermont Ae
3600 300 Vermont Pl

It is important to reiterate that relaxation algorithms should be implemented in an itera-
tie manner, relaxing attributes in a speciic order through a pre-deined series o steps and
passes ,Leine and Kim 1998,. A UHEE relaxes a single ,or multiple, attributes within a step.
1hese passes start with the least descriptie attributes ,those whose remoal creates the least
amount o error, and progress upward through more and more descriptie attributes. A EFAU
relaxes a single ,or multiple, attributes at once, such that: ,1, the resulting certainty o the
relaxed address eectiely moes to another leel o geographic resolution, the ,2, ambiguity
introduced increases exponentially, or ,3, the complexity o an interactie exhaustie disam-
biguation increases linearly.
\ithin each step, seeral passes should be perormed. 1hese passes should relax the di-
erent attributes indiidually and then in conjunction, until no more combinations can be
made without resulting in a step to another leel o geographic resolution. 1he order in
which they are relaxed can be arbitrary and will hae minimal consequence because steps are
the real inluencing actor. Note that relaxing the house number increases the ambiguity li-
nearly because v ~ number o houses on street, while relaxing all other attributes increases
the ambiguity exponentially because v ~ the number o possible new segments that can be
included.
1he preerred order o steps and passes is displayed in 1able 25 through 1able 2 ,the
pass ordering has been arbitrarily selected,. 1he ambiguity column describes the domain o
potential matches that could all equally be considered likely. 1he relatie exponent and mag-
nitude o ambiguity column is an estimate that shows how the magnitude o ambiguity
should be calculated and the order o the deried exponent o this ambiguity ,in
\8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 \V
parentheses,. 1he relatie magnitude o spatial error column is an estimate o the total area
within which the correct address should be contained and the exponent o this ambiguity ,in
parentheses,. 1he worst-case resolution column lists the next leel o accuracy that could be
achieed when disambiguation is not possible and assumes that the lower-order attributes
below those that are being relaxed are correct. Note that the last two rows o 1able 26 could
belong to either pass 5 or 6 because the ambiguity has increased exponentially and the search
complexity has increased linearly, but the eectie leel o geographic certainty remains the
same ,USPS ZIP Code,.

!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


\
Z



























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9

Table 25 Preferred attribute relaxation order with resulting ambiguity, relative magnitudes of ambiguity and spatial error, and worst-case
resolution, passes 1 4

+FAU -HEE
.A>HKA?
!FFBD@JFA
!P@DCJDFO
.A>HFDIA $KU=MAMF HM? 0HCMDFJ?A
=G !P@DCJDFO
.A>HFDIA 0HCMDFJ?A
=G +UHFDH> $BB=B
;=BEFa&HEA
.AE=>JFD=M
1 1 none none ,0, none certainty o address
location
single
address
location
2 1 number multiple houses
on single street
,0, 4 houses on street length o street single street
3 1 pre single house on
multiple streets
,1, 4 streets with same name and dierent pre bounding area o locations
containing same number
house on all streets with
the same name
USPS ZIP
Code 3 2 post ,1, 4 streets with same name and dierent post
3 3 type ,1, 4 streets with same name and dierent type
4 1 number, pre multiple houses
on multiple
streets
,2, 4 houses on street 4 streets with same name
and dierent pre
bounding area o all
streets with the same
name 4 2 number, type ,2, 4 houses on street 4 streets with same name
and dierent type
4 3 number, post ,2, 4 houses on street 4 streets with same name
and dierent post













)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


\
Y


'
<


Table 26 Preferred attribute relaxation order with resulting ambiguity, relative magnitudes of ambiguity and spatial error, and worst-case
resolution, pass 5
+FAU -HEE
.A>HKA?
!FFBD@JFA
!P@DCJDFO .A>HFDIA 0HCMDFJ?A =G !P@DCJDFO
.A>HFDIA 0HCMDFJ?A =G
+UHFDH> $BB=B
;=BEFa&HEA
.AE=>JFD=M
5 1 pre, type single house on
multiple streets
,2, 4 streets with same name and dierent pre
4 streets with same name and dierent type
bounding area o locations
containing same number
house on all streets with
the same name
USPS ZIP
Code
5 2 pre, post ,2, 4 streets with same name and dierent pre
4 streets with same name and dierent post
5 3 post, type ,2, 4 streets with same name and dierent pre
4 streets with same name and dierent type
5 5 number, pre,
type
multiple houses
on multiple
streets
,2, 4 houses on street 4 streets with same name
and dierent pre 4 streets with same name and
dierent type
bounding area o all
streets with the same
name
5 6 number, pre,
post
,2, 4 houses on street 4 streets with same name
and dierent pre 4 streets with same name and
dierent post
5 number, post,
type
,2, 4 houses on street 4 streets with same name
and dierent post 4 streets with same name
and dierent type
5 8 number, pre,
post, type
,2, 4 houses on street 4 streets with same name
and dierent pre 4 streets with same name and
dierent post 4 streets with same name and
dierent type
5,6 9 pre, post, type single house on
multiple streets
,3, 4 streets with same name and dierent pre
4 streets with same name and dierent post 4
streets with same name and dierent type
bounding area o locations
containing same number
house on all streets with
the same name
USPS ZIP
Code
5,6 10 number pre,
post, type
single house on
multiple streets
,3, 4 houses on street 4 streets with same name
and dierent pre 4 streets with same name and
dierent post 4 streets with same name and
dierent type
bounding area o all
streets with the same
name
USPS ZIP
Code


;

<

#
=
>
?
@
A
B
C




!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A



\
[



























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


Table 27 Preferred attribute relaxation order with resulting ambiguity, relative magnitudes of spatial error, and worst-case resolution, pass 6
+FAU -HEE
.A>HKA?
!FFBD@JFA
!P@DCJDFO .A>HFDIA 0HCMDFJ?A =G !P@DCJDFO
.A>HFDIA 0HCMDFJ?A
=G +UHFDH> $BB=B
;=BEFa&HEA
.AE=>JFD=M
6 2 pre, type, USPS
ZIP Code
single house on
multiple streets
in multiple USPS
ZIP Codes
,3, 4 streets with same name and dierent pre
4 streets with same name and dierent type 4
USPS ZIP Codes that hae those streets
bounding area o locations
containing same number
house on all streets with
the same name in all
USPS ZIP Codes
city
6 3 pre, post, USPS
ZIP Code
,3, 4 streets with same name and dierent pre
4 streets with same name and dierent post 4
USPS ZIP Codes that hae those streets
city
6 4 post, type, USPS
ZIP Code
,3, 4 streets with same name and dierent post
4 streets with same name and dierent type 4
USPS ZIP Codes that hae those streets
city
6 4 number, pre,
type, USPS ZIP
Code
multiple houses
on multiple
streets in mul-
tiple USPS ZIP
Codes
,3, 4 houses on street 4 streets with same name
and dierent pre 4 streets with same name and
dierent type 4 USPS ZIP Codes that hae
those streets
bounding area o all
streets with the same
name in all USPS ZIP
Codes
city
6 5 number, pre,
post, USPS ZIP
Code
,3, 4 houses on street 4 streets with same name
and dierent pre 4 streets with same name and
dierent post 4 USPS ZIP Codes that hae
those streets
city
6 6 number, post,
type, USPS ZIP
Code
,3, 4 houses on street 4 streets with same name
and dierent post 4 streets with same name
and dierent type 4 USPS ZIP Codes that hae
those streets
city
6 4 number, pre,
type, post, USPS
ZIP Code
,3, 4 houses on street 4 streets with same name
and dierent pre 4 streets with same name and
dierent post 4 streets with same name and
dierent type 4 USPS ZIP Codes that hae
those streets
city
'< ;< #=>?@ABC
An example o the irst ew iterations o the algorithm is depicted in ligure 14. 1his dia-
gram shows how each step moes the certainty o the result to a lower geographic resolu-
tion. It should be noted that the authors who originally deeloped these attribute relaxation
techniques recommend neer relaxing the street name attribute ,Leine and Kim 1998,. In
their case, this action led to the introduction o a great deal o error due to the similarity in
dierent lawaiian street names. Best practices relating to deterministic eature matching are
listed in Best Practices 24.


Figure 14 Example relaxation iterations

)=IAP@AB 567 8669 \\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
\9 )=IAP@AB 567 8669
Best Practices 24 Deterministic feature matching
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should deterministic
matching be used
Deterministic matching should be the irst eature-
matching type attempted.
\hat types o matching rules
can and should be used
Any deterministic set o rules can be used, but they
should always be applied in the same order.
\hat order o matching rules
can and should be applied
Rules should be applied in order o decreasing
restrictieness, starting rom the most restrictie
such that tightly restrictie rules are applied irst,
and progressiely less restrictie rules are applied
subsequently upon a preious rule`s ailure.
Should attribute relaxation be
allowed
Attribute relaxation should be allowed when using
deterministic eature matching.
\hat order should attributes be
relaxed
Attribute relaxation should occur as the series o
steps and passes as listed in this document.

9<Z -.%*!*(3(+,(& 0!,&1()#
Probabilistic matching has its roots in the ields o probability and decision theory, and
has been employed in geocoding processes since the outset ,e.g., O`Reagan and Saaleld
198, Jaro 1989,. 1he exact implementation details can be quite messy and mathematically
complicated, but the concept in general is quite simple.
1he JMN=M?DFD=MH> UB=@H@D>DFO cUBD=B UB=@H@D>DFOd is the probability o something oc-
curring, gien that no other inormation is known. Mathematically, the unconditional proba-
bility, P, o an eent, e, occurring is notated P;e), and is equialent to ,1 - the probability o
the eent not occurring,, that is, P;e) ~ 1 - P;e).
In contrast, the N=M?DFD=MH> UB=@H@D>DFO is the probability o something occurring, gien
that other inormation is known. Mathematically, haing obtained additional inormation, ,
the conditional probability, P, o eent e occurring gien that is true, P;ie), deined as the
probability o and e occurring together diided by the probability that e that occurs alone as
in Lquation 1.


Equation 1 Conditional probability

In probabilistic matching, the PHFNS UB=@H@D>DFO is a degree o belie ranging rom 0 to
1 that a eature matches. 1hese systems report this degree o belie that a eature matches
,the easy part, based on and deried rom some criteria ,the hard part,. A degree o belie o
0 represents a 0 percent chance that it is correct, while a 1 represents a 100 percent chance.
1he N=MGD?AMNA FSBAES=>? is the probability cuto point determined by the user aboe
which a eature is accepted and below which it is rejected. 1o harness the power o these
probabilities and achiee eature results that would not otherwise be obtainable, the use o
this approach requires the acceptance o a certain leel o risk that an answer could be
wrong.
1here are many orms o probabilistic eature matching, as the entire ield o record
linkage is ocused on this task. \ears o research hae been deoted to this problem, with
) (
) | (
e P
e i P
) ( e i P .
=
'< ;< #=>?@ABC
particular interest paid to health and patient records ,e.g., \inkler 1995, Blakely and Sal-
mond 2002,. In this section, to illustrate the basic concepts and present a high-leel oer-
iew, one common approach will be presented: attribute weighting.
9<Z<5 !FFBD@JFA QADCSFDMC
!FFBD@JFA QADCSFDMC is a orm o probabilistic eature matching in which probability-
based alues are associated with each attribute, and either subtract rom or add to the N=Pa
U=EDFA EN=BA or the eature as a whole. 1hen, the composite score is used to determine a
match or non-match. In this approach each attribute o the address is assigned two probabil-
ities, known as weights. 1hese QADCSFE represent the leel o importance o the attribute,
and are a combination o the matched and unmatched probabilities. 1he PHFNSA? UB=@Ha
@D>DFO is the probability o two attributes matching, v, gien that the two records match, M.
Mathematically, this is denoted as the conditional probability P;vM). 1his probability can
be calculated with statistics oer a small sample o the total dataset in which the input datum
and the reerence eature do actually match. 1he ABB=B BHFA7 , denotes instances in which
the two attributes do not actually match, een though the two records do match. 1hus,
P;vM) ~ 1 - . In the existing literature, the ull probability notation usually is discarded,
and P;vM) is simply written as v. It should be noted that generally is high.
1he JMPHFNSA? UB=@H@D>DFO is the probability that the two attribute alues match, v,
gien that the two records themseles do not match, M. Mathematically, this denoted by
the conditional probability P;vM). 1his second probability represents the likelihood that
the attributes will match at random, and can be calculated with statistics oer a small sample
o the total dataset or which the input data and the reerence do not match. Again,
P;vM) usually is denoted simply as v. It should be noted that v generally is low or direc-
tionals, but is higher or street names.
lrom these two probabilities v and v, requency indices or agreement, f
a
, and disa-
greement, f
a
, can be calculated and used to compute the positie and negatie weights or
agreement and disagreement, r
a
, and, r
a
, as in Lquation 2.

( ) ( ) a a a a
a a
f r f r
v
v
f
v
v
f
2 2 log , log
1
1
,
= =

= =

Equation 2 Agreement and disagreement probabilities and weights

1hese weights are calculated or each o the attributes in the reerence dataset a riori.
Composite scores or input data are created on-the-ly during eature matching by summing
the attribute weights o the indiidual input attributes as compared against the reerence ea-
ture attributes. \here an agreement is ound, r
a
, it is added to the score, and where a disa-
greement is ound, r
a
, it is subtracted. 1his composite score is the probability used to de-
termine i the eature is a match ,i.e., i it is aboe the conidence threshold,. Lxcellent
descriptions o this and other more adanced record linkage algorithms can be ound in Jaro
,1989,, Blakely and Salmond ,2002,, Meyer et al. ,2005,, and Boscoe ,2008, as well as in the
reerences contained within each. Best practices related to probabilistic eature matching are
listed in Best Practices 25.
)=IAP@AB 567 8669 \X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 25 Probabilistic feature matching
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should
probabilistic matching
be used
Probabilistic matching should be used when deterministic
eature matching ails, and i the consumers o the data are
comortable with the conidence threshold.
\hat conidence thre-
shold should be consi-
dered acceptable
At a minimum, a 95 conidence threshold should be
acceptable.
\hat metadata should
be maintained
1he metadata should describe the match probability.
low and when should
match probabilities or
dierent attributes be
calculated
Match probabilities or dierent attributes should be
calculated a riori or the reerence dataset by using a
computational approach that randomly selects records and
iterates continuously until the rate stabilizes.
low and when should
unmatch probabilities
or dierent attributes
be calculated
Unmatch probabilities or dierent attributes should be
calculated a riori or the reerence dataset by using a
computational approach that randomly selects records and
iterates continuously until the rate stabilizes.
low and when should
conidence thresholds
be re-ealuated
Conidence thresholds should continuously be re-ealuated
based on the requency with which attribute alues are
encountered.
low and when should
composite weights be
re-ealuated
Composite weights should continuously be re-ealuated
based on the requency with which attribute alues are
encountered.
9<Y +,.()# &%0-!.(+%) !3#%.(,10+
Any eature-matching algorithm requires the comparison o strings o character data to
determine matches and non-matches. 1here are seeral ways this can be attempted, some
more restrictie or lexible in what they are capable o matching than others. 1he irst, NSHa
BHNFABa>AIA> ALJDIH>AMNA7 enorces that each character o two strings must be exactly the
same. In contrast, AEEAMNAa>AIA> ALJDIH>AMNA uses metrics capable o determining i two
strings are essentially` the same. 1his allows or minor misspellings in the input address to
be handled, returning reerence eatures that closely match` what the input may hae in-
tended.` 1hese techniques are applicable to both deterministic and probabilistic matching
algorithms because relaxing the spelling o attributes using dierent string matching algo-
rithms is a orm o attribute relaxation. In all cases, careul attention must be paid to the ac-
curacy eects when these techniques are employed because they can and do result in incor-
rect eatures being returned.
;=B? EFAPPDMC is the simplest ersion o an essence-leel equialence technique.
1hese algorithms reduce a word to its root ,stem,, which then is used or essence-leel equi-
alence testing. 1he -=BFAB +FAPPAB ,Porter 1980, is the most amous o these. It starts by
remoing common suixes ,e.g., -ed,` -ing,`, and additionally applies more complex rules
or speciic substitutions such as -sses` being replaced with -ss.` 1he algorithm is airly
straightorward and run as a series o steps. Lach progressie step takes into account what
has been done beore, as well as word length and potential problems with a stem i a suix is
remoed.
96 )=IAP@AB 567 8669
'< ;< #=>?@ABC
-S=MAFDN H>C=BDFSPE proide an alternatie method or encoding the essence o a
word. 1hese algorithms enable essence-leel equialence testing by representing a word in
terms o how it sounds when it is pronounced ,i.e., phonetically,. 1he goal o these types o
algorithms is to produce common representations or words that are spelled dierently, yet
sound the same. 1he +=JM?AK algorithm is the most amous o this class o algorithms. It
has existed since the late 1800s and originally was used by the U.S. Census Bureau. 1he algo-
rithm is ery simple and consists o the ollowing steps:

1, Keep the irst letter o the string
2, Remoe all owels and the letters y, h, and w, unless they are the irst letter
3, Replace all letters ater the irst with numbers based on a known table
4, Remoe any numbers which are repeated in a row
5, Return the irst our characters, padded on the right with zeros i there are less
than our.

Producing an encoded orm o any inormation necessarily loses inormation ,unless
they are deined as exact equialents,. Stemming and phonetic algorithms, while eicient and
precise, still suer rom this act and can produce inaccurate results in the context o match-
ing street names. In particular, two originally unrelated attribute alues can become related
during the process. 1able 28 presents examples o words encoded by both algorithms that
result in ambiguities.

Table 28 String comparison algorithm examples
%BDCDMH> -=BFAB +FAPPA? +=JM?AK
Running Ridge run ridg R552 R320
Runs Ridge run ridg R520 R320
lawthorne Street hawthorn street l650 S363
leatherann Street heatherann street l650 S363

1o minimize these negatie eects or data loss, eature-matching algorithms can attempt
string comparisons as a two-step process. 1he irst pass can use an essence-leel comparison
to generate a set o candidate reerence eatures. 1he second pass then can generate a prob-
ability-based score or each o the candidates using the original text o the attributes, not the
essence-leel deriations. 1he alues rom the second pass then can be used to determine
the likelihood o correctness. Best practices related to string comparison algorithms are listed
in Best Practices 26.

)=IAP@AB 567 8669 95
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 26 String comparison algorithms
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how should
alternatie string compari-
son algorithms be used
Alternatie string comparison algorithms should be
used when no exact eature matches can be identiied.

A two-step approach should be used to compare the
original input with the essence-leel equialence match
to determine the match and unmatched probabilities
,as in the probability-based eature-matching
approach,.
\hat types o string
comparison algorithms can
and should be used
Both character- and essence-leel string comparisons
should be supported.

\hen should character-leel
string equialence be used
Character-leel equialence should always be attempted
irst on eery attribute.
\hen and how should
essence-leel string
equialence be used
Lssence-leel equialence should only be attempted i
character-leel equialence ails.

Lssence-leel equialence should only be attempted on
attributes other than the street name.

Only one essence-leel equialence algorithm should
be applied at a time. 1hey can be tried in succession
but one should not process the output o the other ,i.e.,
they should both start with the raw data,.

Metadata should describe the calculated essence o the
string used or comparison, and strings that it was
matched to in the reerence dataset.
\hat types o essence-leel
algorithms should be used
Both stemming and phonetic algorithms should be
supported by the geocoding process.
\hich word-stemming
algorithms should be used
At a minimum, the Porter Stemmer ,Porter 1980,
should be supported by the geocoding process.
\hich phonetic algorithms
should be used
At a minimum, the Soundex algorithm should be
supported by the geocoding process.

98 )=IAP@AB 567 8669
'< ;< #=>?@ABC
X< 2$!,/.$ (),$.-%3!,(%)
1his section examines each o the eature interpolation algo-
rithms in depth.
X<5 2$!,/.$ (),$.-%3!,(%) !3#%.(,10+
2AHFJBA DMFABU=>HFD=M is the process o deriing an output geographic eature rom
geographic reerence eatures ,e.g., deriing a point or an address along a street center-line
or the centroid o a parcel,. A GAHFJBA DMFABU=>HFD=M H>C=BDFSP is an implementation o a
particular orm o eature interpolation. One can distinguish between separate classes o ea-
ture interpolation algorithms or linear- and areal unit-based reerence eature types. Lach
implementation is tailored to exploit the characteristics o the reerence eature types upon
which it operates.
It is useul to point out that interpolation is only erer required i the requested output
geographic ormat is o lower geographic complexity than the eatures stored in the reer-
ence dataset. I a geocoding process uses a line-based reerence dataset and is asked to pro-
duce a line-based output, no interpolation is necessary because the reerence eature is re-
turned in its natie orm. Likewise, a polygon-based reerence dataset should return a natie
polygon representation i the output ormat requests it.
Linear-based interpolation is most commonly encountered, primarily because linear-
based reerence datasets currently are the most prealent. 1he adantages and disadantages
o each type o interpolation method will be explored in this section.
X<8 3()$!.a*!+$' (),$.-%3!,(%)
3DMAHBa@HEA? GAHFJBA DMFABU=>HFD=M operates on segments lines ,or U=>O>DMAE, which
are a series o connected lines, and produces an estimation o an output eature using a
computational process on the spatial geometry o the line. 1his approach was one o the irst
implemented, and as such, is detailed dozens o times in the scientiic literature and the user
manuals o countless geocoding platorms. \ith this abundance o inormation on the topic
and data sources readily aailable ,see 1able 1,, the discussion presented here will outline
only the high-leel details, ocusing on identiying assumptions used in the process that a-
ect the results and ways that they can be oercome.
lor the purpose o this discussion, it will be assumed that the input data and the reer-
ence eature are correctly matched. In essence, linear-based interpolation attempts to esti-
mate where along the reerence eature the spatial output-in this case a point-should be
placed. 1his is achieed by using the number attribute o the input address data to identiy
the proportion o the distance down the total length o the reerence eature where the spa-
tial output should be placed. 1he reerence eature attribute used or this operation is the
H??BAEE BHMCA7 which describes the alid range o addresses on the street ,line segment, in
terms o start and end addresses ,and also seres to make street ectors continuous geo-
graphic objects,.
1he H??BAEE UHBDFO ,i.e., een or odd, is an indication o which side o the street an in-
put address alls. 1his simplistic case presumes @DMHBO H??BAEE UHBDFO or the reerence
street segment ,i.e., one side o the street is een and the other is odd,, which may not be the
)=IAP@AB 567 8669 9V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
case. More accurate reerence data sometimes account or dierent parities on the same side
o the street as necessary ,M=Ma@DMHBO H??BAEE UHBDFO, and a more adanced geocoding al-
gorithm can take adantage o these attributes. A common parity error or a reerence data
source is or an address to be listed as i it occurs on both sides o the street. An equally
common address range error in a reerence data source is or an address range to be re-
ersed. 1his can mean that the address is on the wrong sides o the street, that the address
range start and end points o the street hae been reersed, or a combination o both. 1hese
should be considered reerence data source errors, not interpolation errors, although they are
commonly iewed that way.
In an eort to continue with the simplest possible case, interpolation will be perormed
on a simple line-based reerence eature made up o only two points ,i.e., the start, or origin,
and end, or destination,. 1he distance rom the start o the street segment where the spatial
output should be placed, a, is calculated as a proportion o the total street length, t, the num-
ber o the input address, a, and the size o the address range, r, which is equal to one-hal the
dierence between the start address and end address o the address range, r
.
and r
e
respec-
tiely, as in Lquation 3.

|
.
|

\
|
= =

r
a
t a
r r .b.
r ,
2
, , e s

Equation 3 Size of address range and resulting distance from origin

Using the distance that the output point should be located rom the origin o the street
ector, it is possible to calculate the actual position where the spatial output should be
placed. 1his is achieed is through the ollowing calculation with the origin o the street de-
noted
0
,,
0
, the destination is denoted
1
,,
1
, and the output location is denoted
2
,,
2
, as in
Lquation 4. Note that although the Larth is an ellipsoid and spherical distance calculations
would be the most accurate choice, planar calculations such as Lquation 4 are most com-
monly employed because the error they introduce is negligible or short distances such as
most typical street segments.

( )
( ) 0 1 2
0 1 2 ,
y y d y
x x d x
=
=

Equation 4 Resulting output interpolated point

1his calculated position will be along the centerline o the reerence eature, correspond-
ing to the middle o the street. 1hus, a ?B=U@HNR usually is applied to moe the output loca-
tion away rom the centerline toward and,or beyond the sides o the street where the build-
ings probably are located in city-style addresses.
Lxperiments hae been perormed attempting to determine the optimal direction and
length or this dropback but hae ound that the high ariability in street widths and direc-
tions prohibits consistent improements ,Ratclie 2001, Cayo and 1albot 2003,. 1hereore,
in practice, an orthogonal direction usually is chosen along with a standard distance. lowe-
er, it is likely that better results could be achieed by inspecting the M1lCC o a road to de-
termine the number o lanes and multiplying by the aerage width per lane. Best practices
related to these undamental components o the linear-based interpolation methods are
listed in Best Practices 2.
9Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 27 Linear-based interpolation
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should linear-based
interpolation be used
Linear-based interpolation should be used
when complete and accurate point- or
polygon-based reerence datasets are not
aailable.

Linear-based interpolation should be used
when input data cannot be directly linked
with a point-based reerence dataset and must
be matched to eatures representing multi-
entity eatures.
\hen and how should the parameters
or linear interpolation be chosen
1he parameters used or linear-based
interpolation should be based upon the
attributes aailable in the reerence dataset.
\hat parity inormation should be
used or linear-based eature
interpolation
At a minimum, binary parity should be used.
I more inormation is aailable in the
reerence dataset regarding the parity o an
address it should be used ,e.g., multiple
address ranges per side o street,.
\hat linear-interpolation unction
should be used
At a minimum, planar interpolation should be
used.

I a spherical interpolation algorithm is
aailable it should be used.
Should the same dropback alue and
direction always be used
1he same dropback alue and direction
should always be used based on the width o
the street as determined by:
- Number o lanes
- M1lCC codes
- Aerage width per lane
\hich dropback alue and direction
can and should be used
An a riori dropback alue o one-hal the
reerence street`s width ,based on the street
classiication code and aerage classiication
street widths, should be applied in an orienta-
tion orthogonal to the primary direction o
the street segment to which the interpolated
output alls.

\hen perorming linear-based interpolation in the manner just described, seeral as-
sumptions are inoled and new geocoding methods are aimed at eliminating each ,e.g.,
Christen and Churches |2005| and Bakshi et al. |2004|,. 1he UHBNA> AKDEFAMNA HEEJPUFD=M
is that all addresses within an address range actually exist. 1he UHBNA> S=P=CAMADFO HEa
EJPUFD=M is that each parcel is o exactly the same dimensions. 1he UHBNA> AKFAMF HEEJPUa
FD=M is that addresses on the segment start at one endpoint o the segment and completely
ill the space on the street all the way to the other endpoint. 1hese concepts are illustrated in
ligure 15. Additionally, the N=BMAB >=F HEEJPUFD=MeUB=@>AP is that when using a measure
)=IAP@AB 567 8669 9Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
o the length o the segment or interpolation, it is unknown how much real estate may be
taken up along a street segment by parcels rom other intersecting street segments ,around
the corner,, and the actual street length may be shorter than expected. !??BAEEaBHMCA GAHa
FJBA DMFABU=>HFD=M is subject to all o these assumptions ,Bakshi et al. 2004,.


Figure 15 Example of parcel existence and homogeneity assumptions

Recent research has attempted to address each o these assumptions by incorporating
additional knowledge into the eature interpolation algorithm about the true characteristics
o the reerence eature ,Bakshi et al. 2004,. lirst, by determining the true number o build-
ings along a reerence eature, the parcel existence assumption can be alleiated. By doing
this, the distance to the proper eature can be calculated more accurately. loweer, this ap-
proach still assumes that each o the parcels is o the same size, and is thus termed JMDG=BP
>=F GAHFJBA DMFABU=>HFD=M< 1his is depicted in ligure 16.

9[ )=IAP@AB 567 8669
'< ;< #=>?@ABC

Figure 16 Example of uniform lot assumption

I the actual parcel sizes are aailable, the parcel homogeneity assumption can be oer-
come and the actual distance rom the origin o the street segment can be calculated directly
by summing the distances o each parcel until the correct one is reached, and is thus termed
HNFJH> >=F GAHFJBA DMFABU=>HFD=M. 1his is depicted in ligure 1.

Figure 17 Example of actual lot assumption

)=IAP@AB 567 8669 9\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
loweer, the distance is still calculated using the parcel extent assumption that the ad-
dresses on a block start exactly at the endpoint o the street. 1his obiously is not the case
because the endpoint o the street represents the intersection o centerlines o intersecting
streets. 1he location o this point is in the center o the street intersection, and thereore the
actual parcels o the street cannot start or at least one-hal the width o the street ,i.e., where
the curb starts,. 1his is depicted in ligure 18.


Figure 18 Example of street offsets

1he N=BMAB >=F UB=@>AP can be oercome in a two-step manner. lirst, the segments that
make up the block must be determined. Second, an error-minimizing algorithm can be run
to determine the most likely distribution o the parcels or the whole block based on the
length o the street segments, the sizes o lots, and the combinations o their possible
layouts. 1his distribution then can be used to derie a better estimate o the distance rom
the endpoint to the center o the correct parcel. 1his is depicted in ligure 19. None o the
approaches discussed thus ar can oercome the assumption that the building is located at
the centroid o the parcel, which may not be the case.

99 )=IAP@AB 567 8669
'< ;< #=>?@ABC

Figure 19 Example of corner lot problem

1hese small perormance gains in the accuracy o the linear-based eature interpolation
algorithm may hardly seem worth the eort, but this is not necessarily the case. Micro-scale
spatial analyses, although not currently perormed with great requency or regularity, are be-
coming more and more prealent in cancer- and health-related research in general. lor ex-
ample, a recent study o exposure to particulate matter emanating rom reeways determined
that the eect o this exposure is reduced greatly as one moes small distances away rom
the reeway, on the order o seeral meters ,i.e., high-distance decay,. 1hus, i the accuracy
o the geocoding process can be improed by just a ew meters, cases can more accurately be
classiied as exposed or not ,Zandbergen 200, and more accurate quantiications o poten-
tial indiidual exposure leels can be calculated, as has been attempted with pesticides ,Rull
and Ritz 2003, Nuckols et al. 200, or example. Best practices related to linear-based inter-
polation assumptions are listed in Best Practices 28.


)=IAP@AB 567 8669 9X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 28 Linear-based interpolation assumptions
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how should linear-based
interpolation assumptions be
oercome
I data are aailable and,or obtainable, all
assumptions that can be oercome should be.

\here can data be obtained to
oercome linear-based interpolation
assumptions
Local goernment organizations should be
contacted to obtain inormation on the
number, size, and orientation o parcels as
well as address points.
\hen and how can the parcel
existence assumption be oercome
I an address eriier is aailable, it should be
used to eriy the existence o parcels beore
interpolation is perormed.
\hen and how can the parcel
homogeneity assumption be
oercome
I the parcel dimensions are aailable, these
should be used to calculate the interpolated
output location.
\hen and how can the parcel extent
assumption be oercome
I the street widths are known or can be de-
ried rom the attributes o the data ,street
classiication and aerage classiication
widths,, these should be used to buer the
interpolation range geometry beore perorm-
ing interpolation.
\hen and how can the corner lot
problem be oercome
I the layout and sizes o parcels or the en-
tire block are aailable, they should be used in
conjunction with the lengths o the street
segments that compose the blocks to deter-
mine an error-minimizing arrangement which
should be used or linear-based interpolation.

X<V !.$!3 /)(,a*!+$' 2$!,/.$ (),$.-%3!,(%)
!BAH> JMDFa@HEA? GAHFJBA DMFABU=>HFD=M uses a computational process to determine a
suitable output rom the spatial geometry o polygon-based reerence eatures. 1his tech-
nique has a unique characteristic-the possibility to be both ery accurate or ery inaccurate,
depending on the geographic scale o the reerence eatures used. lor instance, areal unit-
based interpolation on parcel-leel reerence eatures should produce ery accurate results
compared to linear-based interpolation or the same input eature. loweer, areal unit-based
interpolation at the scale o a typical USPS ZIP Code would be ar less accurate in compari-
son to a linear-based interpolation or the same input data ,noting again that USPS ZIP
Codes are not actually areal units, see Section 5.1.4,.
Centroid calculations ,or an approximation thereo, are the usual interpolation per-
ormed on areal unit-based reerence eatures. 1his can be done ia seeral methods, with
each emphasizing dierent characteristics. 1he simplest method is to take the centroid o the
bounding box o the eature and oten is employed in cases or which complex computa-
tions are too expensie. A somewhat more complicated approach, the center-o-mass or
geographic centroid calculation, borrows rom physics and simply uses the shape and area to
compute the centroid. 1his does not take into account any aspatial inormation about the
contents o the areal unit that might make it more accurate. At the resolution o an urban
X6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
parcel, this has been shown to be airly accurate because the assumption that a building is in
the center o a parcel is mostly alid, as long as the parcels are small ,Ratclie 2001,.
loweer, as parcels increase in size ,e.g., as the reerence dataset moes rom an urban
area characterized by small parcels to a rural area characterized by larger parcels, this as-
sumption becomes less and less alid and the centroid calculation becomes less accurate. In
particular, on ery large parcels such as arms or campuses, the center o mass centroid be-
comes ery inaccurate ,Steenson et al. 2000, Durr and lroggatt 2002,. In contrast, algo-
rithms that employ a weighted centroid calculation sometimes are more accurate when ap-
plied to these larger parcels. 1hese make use o the descriptie quality o the aspatial
attributes associated with the reerence eature ,e.g., population density suraces, to moe
the centroid toward a more representatie location.
1o achiee this, the polygon-based eatures can be intersected with a surace created
rom iner resolution eatures to associate a series o alues or each location throughout the
polygon. 1his weight surace can be deried rom both raster-based and indiidual eature
reerence data. In either case, the weighted centroid algorithm runs on top o this surace to
calculate the position o the centroid rom the iner resolution dataset, either the raster cell
alues in the irst case or the alues o the appropriate attribute or indiidual eatures. lor
example, in a relatiely large areal unit such as a ZC1A ,granted that not all ZC1As are
large,, a weighted centroid algorithm could use inormation about a population distribution
to calculate a more representatie and probable centroid. 1his will produce a centroid closer
to where the most people actually lie, thereby increasing the probability that the geocode
produced is closer to where the input data really are. 1his surace could be computed rom a
raster dataset with cell alues equaling population counts or rom a point dataset with each
point haing a population count attribute, essentially a method o looking at a point dataset
as a non-uniormly distributed raster dataset. See Beyer et al. ,2008, or a detailed ealuation
o multiple weighting schemes. Best practices related to areal unit-based interpolation are
listed in Best Practices 29.
)=IAP@AB 567 8669 X5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 29 Areal unit-based interpolation
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should areal
unit-based
interpolation methods
be used
Areal unit-based interpolation should be used oer linear-
based alternaties when the spatial resolution o the areal
unit-based reerence eatures is higher than that o the
corresponding linear-based counterparts.

Areal unit-based interpolation should be used when more
accurate means hae been tried and ailed, and it is the only
option let.

Areal unit interpolation should not be used i metadata about
the accuracy o the eatures is not aailable.
\hen and which areal
unit-based
interpolation methods
should be used
At a minimum, geographic ,center-o-mass, centroid
calculations should be used.

I appropriate inormation is aailable, weighted centroid
approximations should be used.

leature-bounding box centroids should not be used.
\hich additional data
sources should be
used or areal unit-
based centroid ap-
proximations
Population density should be used or weighted centroid
calculation or areal unit-based reerence datasets containing
reerence eatures o lower resolution than parcels ,e.g.,
USPS ZIP Codes,.
\hat metadata should
be maintained
I weighted centroids are calculated, the metadata or the
datasets used in the calculation, identiiers or the grid cells
containing the alues used or calculation, and aggregates or
the alues used in the calculation ,e.g., mean, min, max,
range, should be recorded along with the geocoded record.

X8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 XV
56< %/,-/, '!,!
1his section briely discusses issues related to geocoded out-
put data.
56<5 '%;)+,.$!0 &%0-!,(*(3(,:
1he deinition o geocoding presented earlier was speciically designed to encompass and
include a wide ariety o data types as alid output rom the geocoding process. Accordingly,
it is perectly acceptable or a geocoding process to return a point, line, polyline, polygon, or
some other higher-complexity geographic object. \hat must be considered, howeer, is that
the output produced ineitably will need to be transmitted to and consumed and,or
processed by some downstream component ,e.g., a spatial statistical package,. 1hese re-
quirements, capabilities, and limitations o the eentual data consumer and transmission me-
chanisms need to be considered when assessing an appropriate output ormat. In most cases,
these constraints will tend to lean towards the production o simple points as output data.
56<8 '!,! 3%++
\hen examining the aailable output options rom a data loss perspectie, one may con-
sider a dierent option. 1ake the ambiguity problems inherent in moing rom a lower reso-
lution geographic eature to a higher one described earlier ,Section .3,, or example. 1he
high-resolution data can always be abstracted to lower resolution later i necessary, but once
conerted they cannot be unambiguously conerted back to their higher-resolution roots.
lor example, a parcel centroid can always be computed rom a parcel boundary, but the oth-
er direction is not possible i new data are discoered that could hae inluenced the assign-
ment o the centroid. 1hereore, it may be adisable to always return and store the spatial
output o the geocoding process at the highest leel o geographic resolution possible. 1here
is a risk associated with this process because o the temporal staleness problems that can oc-
cur with geocode caches ,e.g., i the parcel boundaries change oer time,.

Best Practices 30 Output data
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat geographic
ormat should output
data take
At a minimum, output data should be a geographic point
with a reerence to the ID o the reerence eature used.

I other processes can handle it, the ull geometry o the
reerence eature also should be returned.

















































1his page is let blank intentionally.

'< ;< #=>?@ABC







Part : 1be Mav, Metric. for Mea.vrivg Qvatit,



Notions o quality` ary among the scientiic disciplines. 1his term ,concept, has become
particularly conoluted when used to describe the geocoding process. In the inormation and
computational sciences, the quality` o a result traditionally reers to the notions o preci-
sion ,accuracy, and recall ,completeness,, while in the geographical sciences these same
terms take on dierent ,yet closely related, meanings. Although the actors that contribute to
the oerall notion o geocode quality are too numerous and conceptually dierse to be com-
bined into a single alue, this is how it is generally described. 1he ery nature o the geocod-
ing process precludes the speciication o any single quality metric capable o suiciently de-
scribing the geocoded output. 1his part o the document will elaborate on the many metrics
that can aect dierent aspects o quality or the resulting geocode.



)=IAP@AB 567 8669 XY















































1his page is let blank intentionally.

'< ;< #=>?@ABC
55< W/!3(,: 0$,.(&+
1his section explores seeral contributing actors to spatial
accuracy within dierent components and at dierent leels
o the geocoding process.
55<5 !&&/.!&:
Researchers must hae a clear understanding o the quality o their data so that they can
decide its itness-or-use in their particular study. Lach study undoubtedly will hae its own
unique data quality criteria, but in general the metrics listed in 1able 29 could be used as a
guide to deelop a leel o conidence about the quality o geocodes. Seeral aspects o con-
idence are listed along with their descriptions, actors, and example ealuation criteria.
lurther research is required to determine exactly i, how, or when these metrics could be
combined to produce a single quality` metric or a geocode. 1he topics in this table will be
coered in more detail in the ollowing sections. An excellent reiew o current studies look-
ing at geocoding quality and its eects on subsequent studies is aailable in Abe and Stin-
chcomb ,2008,. Research on how spatial statistical models can be used in the presence o
geocodes ,and geocoded dataset, o arying qualities ,e.g., Zimmerman |2008|, is emerging
as well.

)=IAP@AB 567 8669 X\

!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


X
9



























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9

Table 29 Metrics for deriving confidence in geocoded results

0AFBDN 'AENBDUFD=M $KHPU>A 2HNF=BE $KHPU>A &BDFABDH cj @AFFAB FSHMd
Precision low close is the location o
a geocode to the true
location
Interpolation algorithm used Uniorm Lot address range
Interpolation algorithm assumptions Less assumptions more assumptions
Reerence eature geometry size Smaller larger
Reerence eature geometry accuracy ligher lower
Matching algorithm certainty ligher lower
Certainty low positie can one be
that the geocode produced
represents the correct
location
Matching algorithm used Deterministic probabilistic
Matching algorithm success

Lxact match non-exact match
ligh probability low probability
Non-ambiguous match ambiguous match
Reerence eature geometry accuracy ligher lower
Matching algorithm relaxation amount None some
Matching algorithm relaxation type Attribute transposition Soundex
Reliability low much trust can be
placed in the process used
to create a geocode
1ransparency o the process ligher lower
Reputability o sotware,endor ligher lower
Reputability o reerence data ligher lower
Reerence data itness or area ligher lower
Concordance with other sources ligher lower
Concordance with ground truthing ligher lower


'< ;< #=>?@ABC
58< +-!,(!3 !&&/.!&:
1his section explores seeral contributing actors to spatial
accuracy within dierent components and at dierent leels
o the geocoding process.
58<5 +-!,(!3 !&&/.!&: '$2()$'
1he EUHFDH> HNNJBHNO o geocoding output is a combination o the accuracy o both the
processes applied and the datasets used. 1he term accuracy` can and does mean seeral di-
erent things when used in the context o geocoding. In general, HNNJBHNO typically is a
measure o how close to the true alue something is. General best practices related to geo-
coding accuracy are listed in Best Practices 31.

Best Practices 31 Output data accuracy
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can and should
accuracy inormation be associated
with output data
Any inormation aailable about the
production o the output data should always
be associated with the output data.
58<8 &%),.(*/,%.+ ,% +-!,(!3 !&&/.!&:
Reining this deinition o accuracy, EUHFDH> HNNJBHNO can be deined as a measure o
how true a geographic representation is to the actual physical location it represents. 1his is a
unction o seeral actors ,e.g., the resolution it is being modeled at, or the geographic unit
used to bind it,. lor example, a parcel represented as a point would be less spatially accurate
than the parcel being represented as a polygon. Additionally, a polygon deined at a local
scale with thousands o ertices is more spatially accurate than the same representation at the
national scale, where it has been generalized into a dozen ertices. \ith regard to the geo-
coding process, spatial accuracy is used to describe the resulting geocode and the limits o
the reerence dataset. 1he resulting positional accuracy o a geocode is dependent on eery
component o the geocoding process. 1he decisions made by geocoding algorithms at each
step can hae both positie and negatie eects, either potentially increasing or decreasing
the resulting accuracy o the spatial output.
58<8<5 (MUJF ?HFH EUANDGDNHFD=M
Output geocode accuracy can be traced back to the ery beginning o the process when
the input data initially are speciied. 1here has been some research into associating irst-
order leels o accuracy with dierent types o locational descriptions ,Dais Jr. et al. 2003,
Dais Jr. and lonseca 200,, but in practice, these distinctions rarely are quantiied and re-
turned as accuracy metrics with the resulting data. 1hese dierent types o data speciica-
tions inherently encode dierent leels o inormation. As an example, consider the dier-
ence between the ollowing two input data examples. 1he irst is a relatie locational
description that, when considered as an address at the location, could reer to the address
closest to the corner on either Vermont Ae or 36
th
Place ,the corner lot problem rom Sec-
tion 9.2,, while the second are locational data that describe a speciic street address.
)=IAP@AB 567 8669 XX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
1he northeast corner o Vermont Aenue and 36
th
Place`

3620 South Vermont Aenue, Los Angeles, CA 90089`

One description clearly inherently encodes the deinition o a more precise location than
the other. \hen compared to the speciic street address, the relatie description implicitly
embeds more uncertainty into the location it is describing, which is carried directly into the
geocoding process and the resulting geocode. Using the speciic street address, a geocoding
algorithm can uniquely identiy an unambiguous reerence eature to match. \ith the relatie
location, a geocoder can instead only narrow the result to a set o likely candidate buildings,
or the area that encompasses all o them.
Similarly, the amount o inormation encoded into a description has a undamental eect
on the leel o accuracy that can be achieed by a geocoder. Consider the dierence in im-
plicit accuracy that can be assumed between the ollowing two input data examples. 1he irst
speciies an exact address, while the second speciies a location somewhere on a street, inhe-
rently less accurate than the irst. In this case the resulting accuracy is a unction o the as-
sumed geographic resolution deined by the amount o inormation encoded in the input
data.

831 Nash Street, Ll Segundo, CA 90245`

Nash Street, Ll Segundo, CA 90245`

Relationships among the implicit spatial accuracies o dierent types o locational de-
scriptions are shown in ligure 20, with:

a, Depicting accuracy to the building ootprint ,the outline,
b, Showing how the building ootprint ,the small dot, is more accurate than the USPS
ZIP-4 ,United States Postal Serice 2008a, 90089-0255` ,the polygon,
c, Showing the implicit resolution o combined street segments ,the straight line, with-
in a USPS ZIP Code ,blue, along Vermont Ae, 90089`
d, Showing both a relatie direction ,the large polygon, northeast corner o Vermont
and 36
th
` and the 3600-300 block o Vermont Ae, Los Angeles, CA 90089`
e, Showing the relation between the building ,the small dot, USPS ZIP-4 ,the small
inner polygon, to the USPS ZIP 90089` ,the larger polygon,
-g, Showing the relations among the city, county and state.

566 )=IAP@AB 567 8669
'< ;< #=>?@ABC

Figure 20 Certainties within geographic resolutions (Google, Inc. 2008b)

)=IAP@AB 567 8669 565
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best practices related to input data are listed in Best Practices 32.

Best Practices 32 Input data implicit accuracies
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat inormation
quality metrics can and
should be associated
with input data
At a minimum, metrics based on the relatie amount o
inormation contained in an input address should be
associated with a record.
\hen and how can and
should the implicit
spatial accuracy o input
data be calculated
Implicit spatial accuracy should always be calculated and
associated with any input data that are geocoded. Implicit
spatial accuracy should be calculated as the area or which
highest resolution reerence eature can be unambiguously
matched.
\hen and how can and
should the implicit leel
o inormation o input
data be calculated
1he implicit leel o inormation should always be calcu-
lated and associated with any input data that are geocoded.

1he implicit leel o inormation should be calculated
based on the types o eatures it contains ,e.g., point, line,
polygon,, its oerall reported accuracy, and any estimates o
atomic eature accuracy within the region that are aailable.

58<8<8 )=BPH>DfHFD=M HM? GAHFJBA PHFNSDMC
1he eects on accuracy arising rom the speciicity o the input data also can be seen
clearly in both the address normalization and the eature-matching algorithms. lirst, recall
that substitution-based normalization can be considered an example o deterministic eature
matching. It perorms the same task at dierent resolutions ,i.e., per attribute instead o per
eature,.
As the normalization algorithm moes through the input string, i the alue o the cur-
rent input token does not match the rule or the corresponding address attribute, the
attribute is skipped and the alue tried as a match or the next attribute. In the case o ea-
ture matching, the relaxation o the attributes essentially remoes them rom the input data.
Both o these processes can possibly throw away` data elements as they process an in-
put address, thus lowering the accuracy o the result by implicitly lowering the amount o
inormation encoded by the input description. lor example, consider the ollowing three
addresses. 1he irst is the real address, but the input data are presented as the second, and
the eature-matching algorithm cannot match the input but can match the third. 1hrowing
away the directional element L` will hae preented the consideration that L` might hae
been correct and the street name \all` was wrong.

14 L Mall St`

14 L \all St`

14 \all St`
568 )=IAP@AB 567 8669
'< ;< #=>?@ABC
58<8<V .AGABAMNA ?HFHEAFE
1he reerence datasets used by a geocoding process contribute to the spatial accuracy o
the output in a similar manner. Dierent representations o reerence eatures necessarily
encode dierent leels o inormation. 1his phenomenon results in the ery same accuracy
eects seen as a consequence o the dierent leels o input data speciication just de-
scribed. Also, the spatial accuracy o the reerence eatures may be the most important con-
tributing actor to the oerall accuracy o the spatial output. Interpolation algorithms operat-
ing on the reerence eatures can only work with what they are gien, and will neer produce
any result more accurate than the original reerence eature. Granted, these interpolation al-
gorithms can and do produce spatial outputs o arying degrees o spatial accuracy based on
their intrinsic characteristics, but the baseline spatial accuracy o the reerence eature is
translated directly to the output o the interpolation algorithm. 1he actual spatial accuracy o
these reerence eatures can ary quite dramatically. Sometimes, the larger the geographic
coerage o a reerence dataset, the worse the spatial accuracy o its eatures. 1his has histor-
ically been obsered when comparing street ectors based on 1IGLR,Line iles to those
produced by local goernments. Likewise, the dierences in the spatial accuracies between
ree reerence datasets ,e.g., 1IGLR,Lines, and commercial counterparts ,e.g., NAV1LQ,
also can be quite striking as discussed in Sections 4.5 and . Best practices relating to reer-
ence dataset accuracy are listed in Best Practices 33.

Best Practices 33 Reference dataset accuracy
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can and
should atomic eature
accuracy be measured
and,or estimated
Lstimates o the atomic eature accuracy within a
reerence dataset should be made periodically by random
selection and manual ealuation o the reerence eatures
within the region coered by the dataset.

58<8<Z 2AHFJBA (MFABU=>HFD=M
Any time that eature interpolation is perormed, one should ask how accurate is the re-
sult` and how certain can I be o the result` In the geocoding literature, howeer, these
questions hae largely remained unanswered. \ithout going to the ield and physically mea-
suring the dierence between the predicted and actual geocode alues corresponding to a
particular address, it is diicult to obtain a quantitatie alue or the accuracy o a geocode
due to the issues raised in the sections immediately preceding and ollowing this one. low-
eer, it may be possible to derie a BA>HFDIA UBA?DNFA? NABFHDMFO7 or a relatie quantitatie
measure o the accuracy o a geocode based on inormation about how a geocode is pro-
duced ,i.e., attributes o the reerence eature used or interpolation,, so long as one assumes
that the reerence eature was selected correctly ,e.g., Dais Jr. and lonseca 200, Shi 200,.
In other words, the relatie predicted certainty is the size o the area within which it can be
certain that the actual true alue or a geocode alls. lor instance, i address range interpola-
tion was used, relatie predicted certainty would correspond roughly to an oal-shaped area
encompassing the street segment. I area unit interpolation was used, the relatie predicted
certainty would correspond to the area o the eature. Lxisting research into identiying, cal-
culating, representing, and utilizing these types o certainty measures or geocodes is in its
inancy, but will hopeully proide a much richer description o the quality o a geocode and
its suitability or use in research studies once it becomes more ully deeloped.
)=IAP@AB 567 8669 56V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
58<V 0$!+/.()# -%+(,(%)!3 !&&/.!&:
1here are seeral ways to directly derie quantitatie alues or the positional accuracy o
geocoded data, some more costly than others. 1he most accurate and expensie way is to go
into the ield with GPS deices and obtain positional readings or the address data to com-
pare with the geocodes. 1his option requires a great deal o manpower-especially i the size
o one`s geographic coerage is large-and thereore may not be a easible option or most
organizations. loweer, this approach may be easible with special eorts i only a small
subset o data need to be analyzed or a small-area analysis.
Barring the ability to use GPS deice readings or any large-scale accuracy measurements,
other options do exist. 1he simplest method is to compare the newly produced geocodes
with existing geocodes. New geocoding algorithms typically are ealuated and tested as they
are deeloped in this manner. \ith a airly small set o representatie gold standard data, the
spatial accuracy o new geocoding algorithms can be tested quickly to determine their use-
ulness. 1he key is inesting resources to acquire appropriate sample data.
Another option that is much more common is to use georeerenced aerial imagery to a-
lidate geocodes or addresses. Sophisticated mapping technologies or displaying geocodes
on top o imagery are now aailable or low cost, or are een ree online ,e.g., Google Larth
|Google, Inc. 2008a|,. 1hese tools allow one to see isually and measure quantitatiely how
close a geocode is to the actual eature it is supposed to represent. 1his approach has suc-
cessully been used in deeloping countries to create geocodes or all addresses in entire ci-
ties ,Dais Jr. 1993,. 1he time required is modest and the scalability appears easible ,e.g.,
Zimmerman et al. 200, Goldberg et al. 2008d, and the reerences within each,, although
input eriication requires a air amount o cross-alidation among those perorming the da-
ta input. Also, it should be noted that points created using imagery are still subject to some
error because the images themseles may not be perectly georeerenced, and should not be
considered equialent to ground truthing in eery case. loweer, imagery-deried points
generally can be considered more accurate than their eature interpolation-based counter-
parts. Recent research eorts hae begun to proide encouraging results as to the possibility
o quantiying positional accuracy ,e.g., Strickland et al. 200,, but studies ocusing on larger
geographic areas and larger sample sizes are still needed.
58<Z #$%&%'()# -.%&$++ &%0-%)$), $..%. (),.%'/&,(%)
In cases or which the cost or required inrastructure,sta are too prohibitie to actually
quantitatiely assess the positional accuracy o geocoded output, other relatie scales can and
should be used. lor example, a measurement that describes the total accuracy as a unction
o the resulting accuracies o each component o the process is one way to determine an es-
timate. 1here currently are no agreed-upon standards or this practice, each registry may
hae or be deeloping their own requirements. One proposed breakdown o benchmarking
component accuracy is listed in 1able 30, which is a good start at collecting and identiying
process errors but may not be at a ine enough granularity to make any real judgments about
data quality. Also, it should be noted that reerence eature and attributes alidation is easier
or some reerence data types ,e.g., address points and parcel centroids, and harder or oth-
ers ,e.g., street centerlines,.

56Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
Table 30 Proposed relative positional accuracy metrics
&=PU=MAMF 'AENBDUFD=M
1 Original address quality, benchmarked with address alidation
2 Reerence attribute quality benchmarked with address alidation
3 Geocoding match criteria, benchmarked to baseline data
4 Geocoding algorithm, benchmarked against other algorithms

It is unclear at this point how these benchmarks can,should be quantiied in such a
manner that they may be combined or a total, oerall accuracy measure.
Additionally, emerging research is inestigating the links between the geographic charac-
teristics o an area and the resulting accuracy o geocodes ,e.g., Zimmerman 2006, Zimmer-
man et al. 200,. Although it has long been known that geocoding error is reduced in urban
areas, where shorter street segments reduce interpolation error, the eects o other characte-
ristics such as street pattern design, slope, etc. are unknown and are coming under increasing
scrutiny ,Goldberg et al. 2008b,. 1he prediction, calculation, and understanding o the
sources o geocoding error presented in this section warrant urther inestigation.
58<Y /+$+ %2 -%+(,(%)!3 !&&/.!&:
lirst and oremost, these quality metrics associated with the spatial alues can be in-
cluded in the spatial analyses perormed by researchers to determine the impact o this un-
certainty on their results. 1hey also can be used or quality control and data alidation. lor
example, the geocode and accuracy can be used to alidate other parts o the patient abstract
such as the dxCounty code, and conersely, a county can reeal inaccuracies in a geocoded
result. 1o attempt this, one simply needs to intersect the geocode with a county layer to de-
termine i the county indication is indeed correct. Also, positional accuracy measures can
ensure that problems with boundary cases being misclassiied can be identiied as potential
problems beore analyses are perormed, allowing irst-order estimates o the possible uncer-
tainty leels resulting rom these data being grouped into erroneous classiications.
58<Y<5 (PU=BFHMNA =G -=EDFD=MH> !NNJBHNO
1he importance o the positional accuracy in data produced by a geocoding process can-
not be understated. Volumes o literature in many disparate research ields are dedicated to
describing the potentially detrimental eects o using inaccurate data. lor a health-ocused
reiew, see Rushton et al. ,2006, and the reerences within. 1here are, at present, no set
standards as to the minimum leels o accuracy that geocodes must hae to be suitable in
eery circumstance, but in many cases common sense can be used to determine their appro-
priateness or a particular study`s needs. \hen not up to a desired leel o accuracy, the re-
searcher may hae no choice other than conducting a case reiew or manually moing cases
to desired locations using some orm o manual correction ,e.g., as shown in Goldberg et al.
|2008d|,.
lere, a ew illustratie examples are proided to demonstrate only the simplest o the
problems that can occur, ranging rom misclassiication o subject data to misassignment o
related data. Len though the majority o studies do not report or know the spatial accuracy
o their geocoded data or how they were deried, some number usually is reported anyway.
1his alue or spatial accuracy can range rom less than 1 meter to seeral kilometers. 1he
most common problem that inaccurate data can produce is shown in ligure 21. lere, it can
be seen that or a geocode lying close to the boundary o two geographic eatures, the
)=IAP@AB 567 8669 56Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
potential spatial error is large enough that the geocode could in reality be in either one o the
larger eatures.
1hese boundary cases represent a serious problem. Although the attributes and,or clas-
siications associated rom being inside one polygon might be correct, one cannot be sure i
the positional accuracy is larger than the distance to the boundary. 1he associated data
could,would be wrong when a parcel resides in two USPS ZIP Codes ,on the border, or
when the USPS ZIP Code centroid is in the wrong ,inaccurate, location. In either o these
cases, the wrong USPS ZIP Code data would be associated with the parcel.


Figure 21 Example of misclassification due to uncertainty (Google, Inc. 2008b)

1his problem was documented requently in the early automatic geocoding literature
,e.g., Gatrell 1989, Collins et al. 1998,, yet there still is no clear rule or indicating the certain-
ty o a classiication ia a point-in-polygon association as a direct unction o the spatial ac-
curacy o the geocode as well as its proximity to boundaries. Len i metrics describing this
phenomenon became commonplace, the spatial statistical analysis methods in common use
are not suicient to handle these conditional associations o attributes. Certain uzzy-logic
operations are capable o operating under these spatially-based conditional associations o
attributes, and their introduction to spatial analysis in cancer-related research could proe
useul. Zimmerman ,2008, proides an excellent reiew o current spatial statistical methods
that can be used in the presence o incompletely or incorrectly geocoded data.
loweer, it must be noted that some parcels and,or buildings can and do legitimately
all into two or more administratie units ,boundary classiications, such as those right along
the boundary o multiple regions. 1he county assignment or taxation purposes, or example,
traditionally has been handled by agreements between county assessor`s oices in such cases,
meaning that spatial analysis alone cannot distinguish the correct attribute classiication in all
cases.
56[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 56\
Because o these shortcomings, some hae argued that an address should be linked di-
rectly with the polygon ID ,e.g., C1, using lookup tables instead o through point-in-
polygon calculations ,Rushton et al. 2006,. \hen the required lookup tables exist or the
polygon reerence data o interest this may proe a better option, but when they do not exist
the only choice may be a point-in-polygon approach. Best practices related to positional ac-
curacy are listed in Best Practices 34.
It cannot be stressed enough that in all cases, it is important or a researcher utilizing the
geocodes to determine i the reported accuracy suits the needs o the study. An example can
be ound in the study presented earlier ,in Section 2.4, inestigating whether or not liing
near a highway and subsequent exposure to asbestos rom brake linings and clutch pads has
an eect on the likelihood o deeloping mesothelioma,. 1he distance decay o the particu-
late matter is on the order o meters, so a dataset o geocodes accurate to the resolution o
the city centroids obiously would not suice. Lssentially, the scale o the phenomenon be-
ing studied needs to be determined and the appropriate scale o geocodes used. \hen the
data are not at the desired,required leel o accuracy, the researchers may hae no other
choice but to conduct a case reiew, or manually moe cases to desired ,correct, locations
,more detail on this is coered in Sections 16 and 19,.


!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


5
6
9

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9

Best Practices 34 Positional accuracy

-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can and
should GPS be used to
measure the positional
accuracy o geocoded data
I possible, GPS measurements should be used to obtain the ground truth accuracy o as much
geocoded output as possible. Coering large areas may not be a alid option or policy or budgetary
regions, but this approach may be easible or small areas.

Metadata should describe:
- 1ime and date o measurement
- 1ype o GPS
- 1ypes o any other deices used
Laser distance meters
\hen and how can and
should imagery be used to
measure the positional
accuracy o geocoded data
Imagery should be used to ground truth the accuracy o geocoded data i GPS is not an option.

Metadata should describe:
- 1ime, date, and method o measurement
- 1ype and source o imagery
\hen and how can existing
geocodes be used to
measure the positional
accuracy o geocoded data
I old alues or geocodes exist, they should be compared against the newly produced alues eery
time a new one is created.

I geocodes are updated or replaced, metadata should describe:
- Justiication or change
- 1he old alue
\hen and how can and
should georeerenced
imagery be used to measure
the positional accuracy o
geocoded data
I suitable tools exist, and i time, manpower, and budgetary constraints allow or georeerenced
imagery-based geocode accuracy measurements, it should be perormed on as much data as possible.

Metadata should describe the characteristics o the imagery used:
- Source
- Vintage
- Resolution



)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


5
6
X


'
<

;
<

#
=
>
?
@
A
B
C

\hen and how can and
should the positional
accuracy metrics associated
with geocodes be used in
research or analysis
At a minimum, the lGDC Content Standards or Digital Spatial Metadata ,United States lederal
Geographic Data Committee 2008a, should be used to describe the quality o the output geocode.

1he metrics describing the positional accuracy o geocodes should be used wheneer analysis or
research is perormed using any geocodes.

Ideally, conidence metrics should be associated with output geocodes and the entire process used to
create it including:
- Accuracy
- Certainty
- Reliability

Conidence metrics should be utilized in the analysis o geocoded spatial data.
\hen and how can and
should limits on acceptable
leels o positional accuracy
o data be placed
1here should be no limits placed on the best possible accuracy that can be produced or kept.

1here should be limits placed on the worst possible accuracy that can be produced or supported,
based on the lowest resolution eature that is considered a alid match ,e.g.,. USPS ZIP Code
centroid, county centroid,, and anything o lower resolution should be considered as a geocoding
ailure.

A USPS ZIP Code centroid should be considered the lowest acceptable match.
\hen and how can and
should a geocoded data
consumer ,e.g., researcher,
ensure the accuracy o their
geocoded data
A data consumer should always ensure the quality o their geocodes by requiring speciic leels o
accuracy beore they use it ,e.g., or spatial analyses,.

I the data to be used cannot achiee the required leels o accuracy, they should not be used.
















































1his page is let blank intentionally.


'< ;< #=>?@ABC
5V< .$2$.$)&$ '!,! W/!3(,:
1his section discusses the detailed issues inoled in the spa-
tial and temporal accuracy o reerence datasets, while also in-
troducing the concepts o caching and completeness.
5V<5 +-!,(!3 !&&/.!&: %2 .$2$.$)&$ '!,!
Geographical bias was introduced earlier to describe how the accuracy o reerence ea-
tures may be dependent on where they are located. 1his phenomenon can clearly be seen in
the accuracy reported in rural areas ersus those reported in urban areas, due to two actors.
lirst, the linear-based eature interpolation algorithms used are more accurate when applied
to shorter street segments than they are when applied to longer ones, and rural areas hae a
higher percentage o longer streets than do urban areas.
Second, the spatial accuracy o reerence eatures themseles will dier across the entire
reerence dataset. Again, in rural areas it has been shown that reerence datasets are less ac-
curate then their urban counterparts. lor example, the 1IGLR,Line iles ,United States
Census Bureau 2008d, hae been shown to hae higher spatial accuracy in urban areas with
short street segments. Additionally, as preiously discussed, dierent reerence datasets or
the same area will hae dierent leels o spatial accuracy ,e.g., NAV1LQ |NAV1LQ 2008|
may be better than 1IGLR,Lines,.
One aspect o these accuracy dierences can be seen in the resolution dierences de-
picted in ligure 8. A registry will need to make tradeos between the money and time they
wish to inest in reerence data and the accuracy o the results they require. 1here currently
is no consensus among registries on this topic. Best practices related to reerence dataset
spatial accuracy problems are listed in Best Practices 35. Note that a distinction needs to be
made between those geocoding with a endor and those geocoding themseles. In the irst
case, the registry may not hae the authority to apply some o these best practices because it
may be up to the particular endor, while in the second they will hae that authority. Also, in
some instances it may be beneicial or a registry to partner with some other goernment
organization in perorming these tasks ,e.g., emergency response organizations or U.S. De-
partment o lealth and luman Serices,, or to utilize their work directly.
5V<8 !,,.(*/,$ !&&/.!&:
1he accuracy o the non-spatial attributes is as important as the spatial accuracy o the
reerence eatures. 1his can clearly be seen in both the eature-matching and eature interpo-
lation components o the process. I the non-spatial attributes are incorrect in the reerence
dataset such as an incorrect or reersed address range or a street segment, a match may be
impossible or an incorrect eature may be chosen during eature matching. Likewise, i the
attributes are incorrect the interpolation algorithm may place the resulting geocode in the
wrong location, as in the common case o incorrectly deined address ranges. 1his is coered
in more detail with regard to the input address in Section 1.3.




)=IAP@AB 567 8669 555
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 35 Reference dataset spatial accuracy problems
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should the
eature spatial
accuracy, eature
completeness,
attribute accuracy, or
attribute completeness
o a reerence dataset
be improed or the
dataset abandoned
I the spatial accuracy o a reerence dataset is suiciently
poor that it is the main contributor to consistently low accu-
racy geocoding results, improements or abandonment o a
reerence dataset should be considered.

Simple examples o how to test some o these metrics can be
ound in Krieger et al. ,2001, and \hitsel et al. ,2004,
\hen and how can
and should
characteristics o the
reerence dataset be
improed
I the cost o undertaking a reerence dataset improement is
less than the cost o obtaining reerence data o quality
equialent to the resulting improement, improement
should be attempted i the time and money aailable or the
task are aailable.
\hen and how can
and should the eature
spatial accuracy o the
reerence dataset be
improed
I the spatial accuracy o the reerence data is consistently
leading to output with suiciently poor spatial accuracy it
should be improed, i the time and money aailable or the
task are aailable.

Improements can be made by:
- Manual or automated conlation techniques ,e.g., Chen
C.C. et al. 2004,
- Using imagery ,e.g., O`Grady 1999,
- Rubber sheeting ,see \ard et al. 2005 or additional
details,
\hen and how can
and should the
attribute accuracy o
the reerence dataset
be improed
I the attribute accuracy o the reerence data is consistently
leading to a high proportion o alse positie or alse negatie
eature-matching results it should be improed, i the time
and money aailable or the task are aailable.

Improements can be made by:
- Updating the aspatial attributes through joins with data
rom other sources ,e.g., an updated,changed street
name list published by a city,
- Appending additional attributes ,e.g., actual house num-
bers along a street instead o a simple range,.

5V<V ,$0-%.!3 !&&/.!&:
,APU=BH> HNNJBHNO is a measure o how appropriate the time period the reerence da-
taset represents is to the input data that are to be geocoded. 1his can hae a large eect on
the outcome o the geocoding process and aects both the spatial and non-spatial attributes.
lor example, although it is a common conception that the more recently created reerence
dataset will be the most accurate, this may not always be the case. 1he geography o the built
enironment is changing all the time as land is repurposed or dierent uses, cities expand
558 )=IAP@AB 567 8669
'< ;< #=>?@ABC
their borders, parcels are combined or split, street names change, streets are renumbered,
buildings burn and are destroyed or rebuilt, etc. Input address data collected at one point in
time most likely represent where the location existed at that particular instant in time. Al-
though these changes may only aect a small number o eatures, the work to correct them
in temporally inaccurate data ersions may be time consuming. 1his results in reerence da-
tasets rom dierent periods o time haing dierent characteristics in terms o the accuracy
o both the spatial and aspatial data they contain. 1his could be seen as one argument or
maintaining preious ersions o reerence datasets, although licensing restrictions may pro-
hibit their retention in some cases.
A FAPU=BH> AKFAMF is an attribute associated with a piece o data describing a time pe-
riod or which it existed, or was alid, and is useul or describing reerence datasets. Because
most people assume that the most recently produced dataset will be the most accurate, the
appropriateness o using a dataset rom one time period oer another usually is not consi-
dered during the geocoding process. loweer, in some cases it may be more appropriate to
use the reerence data rom the point in time when the data were collected to perorm the
geocoding process, instead o the most recent ersions. Seeral recent studies hae at-
tempted to inestigate the question o what is the most appropriate reerence dataset to use
based on its temporal aspect and time period elapsed since input data collection ,e.g., Bonner
et al. 2003, Kennedy et al. 2003, McLlroy et al. 2003, lan et al. 2004, 2005, Rose et al. 2004,.
Although the aspatial attributes o historical reerence datasets may be representatie o
the state o the world when the data were collected, the spatial accuracy o newer datasets is
typically more accurate because the tools and equipment used to produce them hae im-
proed in terms o precision oer time. Barring the possibility that a street was actually phys-
ically moed between two time periods, as by natural calamity perhaps, the representation in
the recent ersion will usually be more accurate than the older one. In these cases, the spatial
attributes o the newer reerence datasets can be linked with the aspatial attributes rom the
historical data. Most cities as well as the U.S. Census Bureau maintain the lineage o their
data or this exact purpose, but some skill is required to temporally link the datasets together.
1he general practice when considering which reerence dataset ersion to use is to progress
rom the most recent to the least hierarchically. Best practices relating to the temporal accu-
racy o reerence datasets are listed in Best Practices 36.

Best Practices 36 Reference dataset temporal accuracy
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can and
should historical reerence
datasets be used instead o
temporally current
ersions
In most cases, a hierarchical approach should be taken
rom most recent irst to oldest.

I the region o interest has undergone marked
transormation in terms o the construction or
demolition o streets, renumbered buildings or renamed
streets, or the merging or diision o parcels during the
time period between when the address was current and
the time the address is to be geocoded, the use o
historical data should be considered.

)=IAP@AB 567 8669 55V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
5V<Z &!&1$' '!,!
One low-cost approach or producing point-based reerence datasets is to perorm CA=a
N=?A NHNSDMC7 which stores the results o preiously deried geocodes produced rom an
interpolation method. Also, in situations or which the running time o a geocoder is a criti-
cal issue, this may be an attractie option. 1he concept o geocode caching also has been
termed APUDBDNH> CA=N=?DMC by Boscoe ,2008, pp. 100,. Using cached results instead o
recomputing them eery time may result in substantial perormance gains in cases when a
lengthy or complex eature interpolation algorithm is used. 1here is no consensus about the
economy o this approach. Dierent registries may hae dierent practices, and only a ew
registries currently make use o geocode caching. 1he most common case and strongest ar-
gument or caching is to store the results o interactie geocoding sessions such that the im-
proements made to aspects o the geocoding process while working on a particular address
can be re-leeraged in the uture ,e.g., creating better street centerline geometry or better ad-
dress ranges,.
Geocode caching essentially creates a snapshot o the current geocoder coniguration
,i.e., the state o the reerence dataset and the world as it was at the publication date and the
eature-matching and interpolation algorithms that produce the geocodes,. \hen the reer-
ence data and eature interpolation algorithms do not change requently, geocode caching
can be used. I, howeer, there is the possibility that the resulting geocode may be dierent
eery time the geocoder is run ,e.g., the case when any o the components o the geocoding
process are dynamic or intentionally changed or updated,, using the cached data may pro-
duce outdated data.
1here are potential dangers to using geocode caches in terms o FAPU=BH> EFH>AMAEE7 or
the phenomenon whereby preiously geocoded results stored in a cache become outdated
and no longer alid ,i.e., low temporal accuracy,, with alidity being determined on a per-
registry basis because there is no consensus. Also, caching data at all may be moot i there is
little chance o eer needing to re-process existing address data that already hae been geo-
coded. As new geocoding algorithms are created the cached results produced by older
processes may be proen less and less accurate, and at a certain point it may become appar-
ent that these cached results should be discarded and a new ersion created and stored. In
the data-caching literature, there are generally two choices: ,1, associate a time to lie ,11L,
or each cached alue upon creation, ater which time it is inalidated and remoed, or ,2,
calculate a alue or its reshness each time it is interrogated to determine its suitability and
remoe once it has passed a certain threshold ,Bouzeghoub 2004,. 1here are presently no set
standards or determining alues or either o these, with numerous criteria to be accounted
or in the irst and complex decay unctions possible or the second resulting rom the na-
ture o the geocoding process as well as the nature o the eer-changing landscape. General
considerations relating to the assignment o 11L and the calculation o reshness are listed
in 1able 31.

55Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
Table 31 TTL assignment and freshness calculation considerations for cached data
&=MED?ABHFD=M $KHPU>A
11L and reshness should
depend on the source o the
geocode
GPS - indeinite 11L, high reshness
Manual correction - indeinite, high reshness
Geocoded - time arying, medium reshness
11L and reshness should be
based on the match probability
ligher match score - longer 11L, higher resh-
ness
11L and reshness should be
based on the likelihood o geo-
graphic change
ligh-growth area - shorter 11L, lower resh-
ness
11L and reshness should
depend on the update requency
o the reerence data
ligh requency - shorter 11L, lower reshness
11L and reshness should
correlate with agreement between
sources
ligh agreement - longer 11L, high reshness
lreshness should correlate with
time elapsed since geocode
creation
Long elapsed time - lower reshness

In all cases where caching is used, a tradeo exists between the acceptable leels o accu-
racy present in the old cached results and the cost o potentially haing to recreate them.
Best practices relating to geocode caching are listed in Best Practices 3.
5V<Y &%0-3$,$)$++
Although the accuracy o a reerence dataset can be considered a measure o its preci-
sion, or how accurate the reerence eatures it contains are, the completeness o a reerence
dataset can be considered as a measure o recall. In particular, a more complete reerence
dataset will contain more o the real-world geographic objects or its area o coerage than
would a less complete one. Using a more complete reerence dataset, one can achiee better
results rom the geocoding process. Similar to accuracy, leels o completeness ary both
between dierent reerence datasets and within a single one. More in-depth discussions o
precision and recall are proided in Section 14.2.1.
A distinction should be made between eature and attribute completeness. As recall
measures, both reer to the amount o inormation maintained out o all possible inorma-
tion that could be maintained. 1he ormer case reers to a measurement o the amount o
eatures contained in the reerence dataset in comparison to all possible eatures that exist in
reality. 1he latter reers to a measurement o the amount o inormation ,number o
attributes, contained per eature out o all inormation ,possible attributes, that could possi-
bly be used to describe it.







)=IAP@AB 567 8669 55Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 37 Geocode caching
-=>DNO 'ANDED=M *AEF -BHNFDNA
Should geocode
caching be used
I addresses are to be geocoded more than once, the use o
geocode caching should be considered.

I geocoding speed is an issue, interpolation methods are too
slow, and addresses are geocoded more than once, geocode
caching should be used.

Geocode results rom interactie geocoding sessions should be
cached.

Metadata should describe all aspects o the geocoding process:
- 1he eature matched
- 1he interpolation algorithm
- 1he reerence dataset

\hen should a
geocode cache be
inalidated ,e.g.,
when does temporal
staleness take
eect,
I reerence datasets or interpolating algorithms are changed,
the geocode cache should be cleared.

1emporal staleness ,reshness and,or 11L ealuation, should
be calculated or both an entire geocode cache as well as per
geocode eery time a geocode is to be created.

I a 11L has expired, the cache should be inalidated.

I reshness has allen below an acceptable threshold, the cache
should be inalidated.

I a cache is cleared ,replaced,, the original data should be
archied.

1here is no consensus as to how either o these completeness measures should be calcu-
lated or ealuated because any measure would require a gold standard to be compared
against, resulting in ery ew cases or which either o these measures are reported with a
reerence data source. Instead, completeness measurements usually are expressed as compar-
isons against other datasets. lor instance, it is typical to see one company`s product touted as
haing the most building ootprints per unit area` or the greatest number o attributes,`
describing eature completeness and attribute completeness, respectiely. Using these me-
trics as anything other than inormatie comparisons among datasets should be aoided, be-
cause i the endor actually had metrics describing the completeness in quantitatie meas-
ures, they would surely be proided. 1heir absence indicates that these alues are not known.
Some simple, albeit useul quantitatie measures that hae been proposed are listed in 1able
32. Note that a small conceptual problem exists or the third row o the table. 1IGLR,Line
iles ,United States Census Bureau 2008d, represent continuous eatures ,address ranges,,
where USPS ZIP-4 ,United States Postal Serice 2008a, databases represent discreet ea-
tures, but the street eatures themseles can be checked in terms o existence and address
55[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 55\
range by making some modiications to the original structures ,e.g., grouping all addresses in
the USPS ZIP-4 per street to determine the ranges and grouping by street name to deter-
mine street existence,. Best practices relating to the completeness o the reerence datasets
are listed in Best Practices 38.

Table 32 Simple completeness measures
&=PU>AFAMAEE 0AHEJBA
1rue reerence eature exist,non-existent in reerence dataset
1rue original address exist,non-existent as attribute o eature in reerence dataset
Compare one reerence dataset to another ,e.g., 1IGLR,Line iles |United States
Census Bureau 2008d| s. USPS ZIP-4 |United States Postal Serice 2008a|,

Best Practices 38 Reference dataset completeness problems
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how
can and should the
attribute
completeness o
the reerence
dataset be
improed
I the attribute completeness o the reerence data is consistently
leading to a high proportion o alse positie or negatie eature-
matching results it should be improed, i the time and money
aailable or the task are aailable.

Improements can be made by:
- lilling in the missing aspatial attributes through joins with
data rom other sources ,e.g., a street name ile rom the
USPS,
- Appending local scale knowledge o alternate names using
alias tables.
\hen and how
can and should the
eature
completeness o
the reerence
dataset be
improed
I the eature completeness o a reerence dataset is consistently
leading to a high proportion o alse negatie eature-matching
results it should be improed, i the time and money aailable or
the task are aailable.

Improements can be made by intersecting with other reerence
datasets that contain the missing eatures ,e.g., a local road layer
being incorporated into a highway layer,.



















































1his page is let blank intentionally.

'< ;< #=>?@ABC
5Z< 2$!,/.$a0!,&1()# W/!3(,: 0$,.(&+
1his section describes the dierent types o possible matches
and their resulting leels o accuracy, and deelops alternatie
match rates.
5Z<5 0!,&1 ,:-$+
1he result o the eature-matching algorithm represents a critical part o the quality o
the resulting geocode. Many actors complicate the eature-matching process and result in
dierent match types being achieable. In particular, the normalization and standardization
processes are critical or preparing the input data. I these algorithms do a poor job o con-
erting the input to a orm and ormat consistent with that o the reerence dataset, it will be
ery diicult, i not impossible, or the eature-matching algorithm to produce a successul
result.
loweer, een when these processes are applied well and the input data and reerence
datasets both share a common ormat, there still are seeral potential pitalls. 1hese diicul-
ties are exempliied by inestigating the domain o possible outputs rom the eature-
matching process. 1hese are listed in 1able 33, which shows the descriptions and causes o
each. An input address can hae no corresponding eature in the reerence dataset ,i.e., the
no match` case,, or it can hae one or more. 1hese matches can be perect, meaning that
eery attribute is exactly the same between the input address and the reerence eature, or
non-perect, meaning that some o the attributes do not match between the two. Lxamples
that would result in some o these are depicted in ligure 22. Note that in most cases, an am-
biguous perect match indicates either an error in the reerence dataset ,L-911 is working
toward getting rid o these,, or incompletely deined input data matching multiple reerence
eatures.
Once a single eature ,or multiple eatures, is successully retrieed rom the reerence
set by changing the SQL and re-querying i necessary ,i.e., attribute relaxation,, the eature-
matching algorithm must determine the suitability o each o the eatures selected through
the use o some measures. 1he real power o a eature-matching algorithm thereore is two-
old: ,1, it irst must be able to realize that no match has been returned, and then ,2, subse-
quently automatically alter and regenerate the SQL to attempt another search or matching
eatures using a dierent set o criteria. 1hus, one deining characteristic distinguishing di-
erent eature-matching algorithms is how this task o generating alternate SQL representa-
tions to query the reerence data is perormed. Another is the measures used to determine
the suitability o the selected eatures.

)=IAP@AB 567 8669 55X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Table 33 Possible matching outcomes with descriptions and causes
%JFN=PA 'AENBDUFD=M &HJEA &=?A
Perect
match
A single eature in the
reerence dataset could be
matched to the input
datum, and both share
eery attribute.
1he combination o input
attributes exactly matches
those o a single reerence
eature.
P
Non-
perect
match
A single eature in the
reerence dataset could be
matched to the input
datum, and both share
some but not all attributes.
At least one, but not all, o
the combinations o input
attributes exactly match those
o a single reerence eature.
Np
Ambiguous
perect
match
Multiple eatures in the
reerence dataset could be
matched to the input
datum, and each shares
eery attribute.
1he combination o input
attributes exactly matches
those o multiple reerence
eatures.
Ap
Ambiguous
non-perect
match
Multiple eatures in the
reerence dataset could be
matched to the input
datum, and each shares
some but not all attributes.
At least one, but not all, o
the combinations o input
attributes exactly matches
those o multiple reerence
eatures.
Anp
No match No eatures in the
reerence dataset could be
matched to the input
datum.
1he combination o input
attributes is not ound in the
reerence dataset.
N


Figure 22 Examples of different match types
586 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Much like the address normalization process, there are both simplistic and complex ways
to achiee this, and each has its particular strengths and weaknesses and is suitable under
certain conditions. 1his process o eature matching is tightly related to the computer
science ield o record linkage. Many undamental research questions and concepts deel-
oped therein hae been applied to this task o eature matching. Best practices related to ea-
ture match types are listed in Best Practices 39.

Best Practices 39 Feature match types
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hich match types should be
considered acceptable
Perect and non-perect non-ambiguous
matches should be considered acceptable.

\hich match types should considered
unacceptable
Ambiguous matches should not be consi-
dered acceptable.

\hat should be done with
unacceptable matches
Non-acceptable matches should be reiewed,
corrected, and re-processed.
\hat metadata should be maintained Metadata should describe the reason why an
action was taken ,e.g., the match type, and
what action was taken.
\hat actions can and should be taken
to correct unacceptable matches
At a minimum, manual reiew,correction and
attribute relaxation should be attempted.
5Z<8 0$!+/.()# #$%&%'()# 0!,&1 +/&&$++ .!,$+
Match rates can be used to describe the completeness o the reerence data with regard
to how much o the input data they contain, assuming that all input data are alid and should
rightully exist within them. Match rates also can be used to test the quality o the input data
in the reerse case ,i.e., when the reerence data are assumed to be complete and unmatcha-
ble input data are assumed to be incorrect,.

/M?AB M= NDBNJPEFHMNAE ES=J>? H SDCS PHFNS BHFA @A JM?ABEF==? HE
ALJDIH>AMF F= H SDCS HNNJBHNO BHFAk FSA FQ= FABPE PAHM GJM?HPAMFH>>O ?DGa
GABAMF FSDMCE<

A geocoder resulting in a 100 percent match rate should not be considered accurate i all
o the matches are to the city or county centroid leel.
5Z<8<5 -BANDED=M HM? BANH>>
Precision and recall metrics are oten used to determine the quality o an inormation re-
trieal ,IR, strategy. 1his measurement strategy breaks the possible results rom a retrieal
algorithm into two sets o data: ,1, one containing the set o data that should hae correctly
been selected and returned by the algorithm, and ,2, another containing a set o data that is
actually selected and returned by an algorithm, with the latter one causing the problem ,Rag-
haan et al. 1989,. 1he set o data that was actually returned may contain some data that
should not hae been returned ,i.e., incorrect data,, and it may be missing some data that
should hae been returned. In typical IR parlance, BANH>> is a measure that indicates how
)=IAP@AB 567 8669 585
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
much o the data that should hae been obtained actually was obtained. -BANDED=M is a
measure o the retrieed data`s correctness.
5Z<8<8 +DPU>DEFDN PHFNS BHFAE
In the geocoding literature, the related term PHFNS BHFA oten is used to indicate the per-
centage o input data that were able to be assigned to a reerence eature. Although this is
related to the recall metric, the two are not exactly equialent. 1he match rate, as typically
deined in Lquation 5, does not capture the notion o the number o records that should
hae been matched. 1he match rate usually is deined as the number o matched records
,i.e., records rom the input data that were successully linked to a reerence eature, diided
by the total number o input records:

Records All 4
Records Matched 4

Equation 5 Simplistic match rate

1his ersion o match rate calculation corresponds to ligure 23 ,a,, in which the match
rate would be the dierences between the areas o records attempted and records matched.
5Z<8<V 0=BA BAUBAEAMFHFDIA PHFNS BHFAE
It may be o more interest to qualiy the denominator o this match rate equation in
some way to make it closer to a true recall measure, eliminating some o the alse negaties.
1o do this, one needs to determine a more representatie number o records with addresses
that should hae matched. lor example, i a geocoding process is limited in using a local-
scale reerence dataset with limited geographic coerage, input data corresponding to areas
that are outside o this coerage will not be matchable. I they are included in the match rate,
they are essentially alse negaties, they should hae been simply excluded rom the calcula-
tion instead. It might thereore be reasonable to deine the match rate by subtracting these
records with addresses that are out o the area rom the total number o input records:

etc. county, state, o out Records 4 - Records All 4
Records Matched 4

Equation 6 Advanced match rate

1his match rate calculation corresponds to ligure 23,b,, in which the match rate would
be the dierences between the areas o records matched and the records within the coerage
area, not simply just records attempted, resulting in a more representatie, higher match rate.
Registries need to use caution when utilizing this because using these other attributes ,e.g.,
county, USPS ZIP Code, or determining what should hae been rightully included or ex-
cluded or geocodability within an area also is subject to error i those attributes themseles
are erroneous as well.

588 )=IAP@AB 567 8669
'< ;< #=>?@ABC


Figure 23 Match rate diagrams
5Z<8<Z ! CAMABH>DfA? PHFNS BHFA
1his approach can be generalized een urther. 1here are seeral categories o data that
will not be possible to be matched by the eature-matching algorithm. lor instance, data that
are outside o the area o coerage o the reerence dataset, as in the last example, will posses
this property. Input data that are in a ormat not supported by the geocoding process will as
well.
lor example, i the geocoding process does not support input in the orm o named
places, intersections, or relatie directions, input in any one o these orms will neer be able
)=IAP@AB 567 8669 58V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
to be successully matched to a reerence eature and will be considered inalid` by this
geocoder ,i.e., the input data may be ine but the reerence data do not contain a match,. li-
nally, a third category comprises data that are simply garbage, and will neer be matched to a
reerence eature simply because they do not describe a real location. 1his type o data is
most typically seen because o data entry errors, when the wrong data hae been entered into
the wrong ield upon entry ,e.g., a person`s birth date being entered as his or her address,.
One can take this representatie class o inalid data, or data that are impossible to match
,or impossible to match without additional research into the patient`s usual residence address
at the time o diagnosis,, into account in the determination o a match rate as ollows:

match to e impossibl Records 4 - Records All 4
Records Matched 4

Equation 7 Generalized match rate

1his match rate calculation corresponds to ligure 23,c,, in which the match rate is no
longer based on the total number o record attempted, instead, it only includes records that
should hae been matchable, based on a set o criteria applied. In this case, the set o ad-
dresses that should hae successully matched is made up by the area resulting rom the un-
ion o each o the areas. 1he match rate then is the dierence between this area and the area
o records that matched, resulting in an een more representatie, higher match rate.
5Z<8<Y )=MaPHFNS N>HEEDGDNHFD=M
1he diicult part o obtaining a match rate using either o the two latter equations
,Lquation 6 or Lquation , is classiying the reason why a match was not obtainable or in-
put data that cannot be matched. I one were processing tens o thousands o records o in-
put data in batch and 10 percent resulted in no matches, it might be too diicult and time-
consuming to go through each one and assign an explanation.
Classiying input data into general categories such as alid or inalid input ormat should
be airly straightorward. 1his could be accomplished or address input data simply by mod-
iying the address normalization algorithm to return a binary true,alse along with its output
indicating whether or not it was able to normalize into a alid address. One could also use
the lower-resolution attributes ,e.g., USPS ZIP Code, to get a general geographic area to
compare with the coerage o the reerence dataset or classiication as inside or outside the
coerage area o the reerence dataset. Although not exactly precise, these two options could
produce irst-order estimates or the respectie number o non-matches that all into each
category and could be used to derie more representatie alues or match rates gien the
reerence dataset constraints o a particular geocoding process.
5Z<V !&&$-,!*3$ 0!,&1 .!,$+
An HNNAUFH@>A PHFNS BHFA is a speciic match rate alue that a geocoding process must
meet such that the geocoded data can be considered alid or use in a research study. \hat
constitutes an acceptable match rate is a complex subject and includes many actors such as
what type o eature matched to or the particular linkage criteria used at a registry. lurther, it
needs to be stated that the oerall match rate comes rom both the input data and the reer-
ence data, which together constrain the total alue. Other than early reports by Ratclie
,2004,, an exhaustie search at the time o this writing ound no other published work ines-
tigating this topic. 1here is a wealth o literature on both the selection bias resulting rom
58Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 58Y
match rates as well as how these rates may eectiely change between geographies ,c..,
Olier et al. 2005, and the reerences within,, but a qualitatie alue or an acceptable match
rate or cancer-related research has not been proposed. It is not possible to recommend us-
ing the percentages Ratclie ,2004, deined, as they are deried rom and meant to be ap-
plied to a dierent domain ,crime instead o health,, but it would be interesting to repeat his
experiments in the health domain to determine i health and crime hae similar cutos. lur-
ther research into using the more adanced match rates just described would be useul. \hat
can be stated generally is that an acceptable match rate will ary by study, with the primary
actor being the leel o geographic aggregation that is taking place. Researchers will need to
think careully i the match rates they hae achieed allow their geocoded data to saely be
used or drawing alid conclusions. Lach particular study will need determine i the qualities
o their geocodable ersus non-geocodable data may indicatie o bias in demographic or
tumor characteristics, rom which they should draw conclusions on the suitability and repre-
sentatieness o their data ,Olier et al. 2005,.
5Z<Z 0!,&1 .!,$ .$+%3/,(%)
1he discussion thus ar has deeloped a measure o a S=>DEFDNa>AIA> PHFNS BHFA7 which
is a match rate or the entire address as a single component. An alternatie to this is to use
an HF=PDNa>AIA> PHFNS BHFA7 which is a match rate associated with each indiidual attribute
that together composes the address. 1his type o measure relates ar more inormation about
the oerall match rate because it deines it at a higher resolution ,i.e., the indiidual attribute
leel as opposed to the whole address leel,. Lssentially, this extends the concept o match
rate beyond an oerall percentage or the dataset as a whole to the leel o per-each-
geocoded-result.
1o achiee this type o match rate resolution, ultimately all that is required is documenta-
tion o the process o geocoding. I each process applied, rom normalization and standardi-
zation to attribute relaxation, recorded or reported the decisions that were made as it
processed a particular input datum along with the result it produced, this per-eature match
rate could be obtained and an ealuation o the type o address problems in one`s input
records could be conducted.
lor instance, i probabilistic eature matching was perormed, what was the uncertainty
cuto used, and what were the weights or each attribute that contributed to the composite
weight I deterministic eature matching was used, which attributes matched and which
ones were relaxed and to what extent 1his type o per-eature match rate is typically not
reported with the output geocode when using commercial geocoding sotware, as many o
the details o the geocoding process used are hidden under the hood,` although it is part o
the eature-matching process. loweer, codes pertaining to the general match process are
generally aailable. Best practices related to success rates ,match rates, are listed in Best Prac-
tices 40.



!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


5
8
[

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


Best Practices 40 Success (match) rates
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hich metrics should be
used to describe the
success rate o eature-
matching algorithms
At a minimum, eature-matching success should be described in terms o match rates.
low should match rates
be calculated
At a minimum, match rates should be computed using the simplistic match rate ormula. I constraints
permit, more adanced match rates should be calculated using the other equations ,e.g., the adanced
and generalized match rate ormulas,.

Metadata should describe the type o match rate calculated and ariables used along with how they were
calculated.
low should an adanced
match rate be calculated
lirst-order estimates or the number o input addresses outside the coerage area or the current set o
reerence datasets should be calculated using lower resolution reerence datasets ,e.g., USPS ZIP Code
reerence iles,. 1his number should be subtracted rom the set o possible matches beore doing the
match rate calculation.

1he metadata should describe the lower resolution reerence dataset used or the calculation.
low should a generalized
match rate be calculated
I the normalization algorithm can output an indication o why it ailed, this should be used or
classiication, and the resulting classiication used to derie counts. 1his number should be subtracted
rom the set o possible matches beore doing the match rate calculation.
At what resolution and or
what components can and
should match rates be
reported
Match rates should be reported or all aspects o the geocoding process, at both the holistic and atomic
leels.
low can atomic-leel
match rates be calculated
I the geocoding process is completely transparent, inormation about the choices made and output o
each component o the geocoding process can be measured and combined to calculate atomic-leel
match rates.


'< ;< #=>?@ABC
5Y< )!!&&. #(+ &%%.'()!,$ W/!3(,: &%'$+
1his section introduces the NAACCR GIS Coordinate Quali-
ty Codes and discusses their strengths and weaknesses.
5Y<5 )!!&&. #(+ &%%.'()!,$ W/!3(,: &%'$+ '$2()$'
lor geocoding output data to be useul to consumers, metadata describing the quality as-
sociated with them are needed. 1o this end, NAACCR has deeloped a set o #(+ &==B?Da
MHFA WJH>DFO &=?AE ,loerkamp and laener 2008, p. 162, that indicate at a high leel the
type o data represented by a geocode. It is crucial that these quality codes be associated with
eery geocode produced at any time by any registry.

;DFS=JF EJNS @HEA>DMA N=?AE HEE=NDHFA? QDFS CA=N=?AE7 BAEAHBNSABE
QD>> SHIA M= D?AH S=Q C==? FSA ?HFH FSAO BJM FSADB EFJ?DAE =M HBA @Aa
NHJEA DF QD>> ?AUAM? =M FSA EFJ?O EDfA7 BAE=>JFD=M7 AFN<lQDFS=JF BAa
LJDBDMC FSA MAA? G=B G=>>=QaJU N=MFHNF QDFS FSA ?HFH UB=ID?AB =B UABa
G=BPDMC FSA CA=N=?DMC FSAPEA>IAElHM? FSABAG=BA FSA BAEAHBNSABE
QD>> SHIA M= N>JA HE F= S=Q BAUBAEAMFHFDIA FSADB BAEJ>FE HBA<

Abbreiated ersions o these codes are listed in 1able 34, and correspond roughly to
the hierarchy presented earlier. lor exact codes and deinitions, reer to Data Item 4366 o
tavaara. for Cavcer Regi.trie.: Data tavaara. ava Data Dictiovar, ,loerkamp and laener
2008, p. 162,.

Table 34 NAACCR recommended GIS Coordinate Quality Codes (paraphrased)
&=?A 'AENBDUFD=M
1 GPS
2 Parcel centroid
3 Match to a complete street address
4 Street intersection
5 Mid-point on street segment
6 USPS ZIP-4 centroid
USPS ZIP-2 centroid
8 Assigned manually without data linkage
9 5-digit USPS ZIP Code centroid
10 USPS ZIP Code centroid o Post Oice Box or Rural Route
11 City centroid
12 County centroid
98 Coordinate quality is unknown
99 Geocoding was attempted but unable or unwilling to assign coordinates

Likewise, researchers should rerain rom using any data that do not hae accuracy me-
trics like the codes in the preious table, and they should insist that these be reported in
)=IAP@AB 567 8669 58\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
geocoded data they obtain. It is up to the researcher to decide whether or not to use geo-
codes with arying degrees o reported quality, but it should be clear that incorporating data
without quality metrics can and should lower the conidence that anyone can hae in the re-
sults produced. lurther, the scientiic community at large should require that research un-
dergoing peer reiew or possible scientiic publication indicate the lineage and accuracy me-
trics or the data used as a basis or the studies presented, or at least note its absence as a
limitation o the study.
1here are three points to note about the present NAACCR GIS Coordinate Quality
Codes and other similar schemes or ranking geocodes ,e.g., SLLR census tract certainty
|Goldberg et al. 2008c|,. 1he irst is that its code 98--coordinate quality unknown--is eec-
tiely the same as haing no coordinate quality at all. 1hereore, utilization o this code
should be aoided as much as possible because it essentially endorses producing geocodes
without knowing anything about coordinate quality.
Second, the codes listed in this table are exactly what they indicate that they are, qualita-
tie codes describing characteristics o the geocodes. No quantitatie alues can be deried
rom them and no calculations can be based upon them to determine such things as direc-
tion or magnitude o the true error associated with a geocode. 1hus, they sere little unction
other than to group geocodes into classes that are ,rightully or wrongully, used to deter-
mine their suitability or a particular purpose or research study.
linally, the current standard states Codes are hierarchical, with lower numbers haing
priority` ,loerkamp and laener 2008, p. 162,. \hen taken literally, the standard only
discusses the priority that should be gien to one geocode oer another, not the actual accu-
racy o geocodes, howeer, it nonetheless has ramiications on the geocoding process be-
cause geocoding deelopers may use this to guide their work. \ithout speciically stating it,
this table can be seen in one light to imply a hierarchical accuracy scheme, with lower alues
,e.g., 1, indicating a geocode o higher accuracy and higher alues ,e.g., 12, indicating a geo-
code o lower accuracy.
Unortunately, this may not be correct in all cases and geocoding sotware deelopers
and users need to be aware that the choice o which is the best` geocode to choose,output
should not be determined rom the ranks in this table alone. Currently howeer, most com-
mercial geocoders do in act make use o hierarchies such as this in the rules that determine
the order o geocodes to attempt, which may not be as good as human interention, and is
deinitely incorrect in some cases. lor instance, out o approximately 900,000 street seg-
ments in Caliornia that hae both a ZC1A and place designation in the 1IGLR,Line iles
,where both the let and right side alues are the same or the ZC1A and Place, ,United
States Census Bureau 2008d,, approximately 300,000 street segments hae corresponding
ZC1A areas that are larger than the corresponding place areas or the same segment. Recall
that matching to eature with a smaller area and calculating its centroid is more likely to re-
sult in a geocode with greater accuracy. 1aken together, it is clear that in the cases or when a
postal address ails to match and a matching algorithm relaxes to try the next eature type in
the implied hierarchy, one-third o the time, choosing the USPS ZIP Code is the wrong
choice ,ignoring the act that ZC1A and USPS ZIP Codes are not the same-see Section
5.1.4 or details,. Goldberg et al. ,2008c, can be consulted or urther discussion o this top-
ic.
It should be clear that although the GIS coordinate quality codes such as those ado-
cated by NAACCR are good irst steps toward geocoding accountability, there is still much
work to be done beore they truly represent qualitatie alues about the geocodes that they
describe. Abe and Stinchcomb ,2008, p. 124, clearly articulate the need or geocoding
589 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 58X
sotware |to| automatically record a quantitatie estimate o the positional accuracy o each
geocode based on the size and spatial resolution o the matched data source, |which| could
be used to proide a positional conidence interal` to guide the selection o geocoded
records or indiidual spatial analysis research projects.` Best practices related to GIS Coor-
dinate Quality Codes are listed in Best Practices 41.

Best Practices 41 GIS Coordinate Quality Codes
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and which GIS
coordinate quality codes
should be used
At a minimum, the NAACCR GIS Coordinate
Quality Codes speciied in tavaara. for Cavcer
Regi.trie.: Data tavaara. ava Data Dictiovar,
,loerkamp and laener 2008, p. 162, should
always be associated with any geocoded output.

Geocode qualities o less than ull street address
,code 3, should be candidates or manual reiew.
\hen and how can and
should NAACCR GIS
Coordinate Quality Codes be
assigned
NAACCR GIS Coordinate Quality Codes should
always be assigned in the same manner, based on the
type o reerence eature matched and the type o
eature interpolation perormed.
\hat other metadata can and
should be reported
I possible, metadata about eery decision made by
the geocoding process should be reported along with
the results ,and stored outside o the present
NAACCR record layout,.
Should any geocodes without
NAACCR GIS Coordinate
Quality Codes be used or
research
Ideally, any geocodes without NAACCR GIS
Coordinate Quality Codes should not be used or
research.

I geocodes without NAACCR GIS Coordinate
Quality Codes must be used, this should be stated as
a limitation o the study.
















































1his page is let blank intentionally.

'< ;< #=>?@ABC







Part 1: Covvov Ceocoaivg Probtev.



1hroughout this document, potential problems regarding the geocoding process hae been
discussed as each component has been introduced. 1his part o the document will list speci-
ic problems and pitalls that are commonly encountered, and proide adice on the best and
recommended ways to oercome them. In all cases, the action,s, taken should be docu-
mented in metadata that accompany the resulting geocode, and the original data should be
maintained or historical lineage.




)=IAP@AB 567 8669 5V5














































1his page is let blank intentionally.


'< ;< #=>?@ABC
)=IAP@AB 567 8669 5VV
5[< W/!3(,: !++/.!)&$eW/!3(,: &%),.%3
1his section proides insight into possible methods or oer-
coming problems that may encountered in the geocoding
process.
5[<5 2!(3/.$+ !)' W/!3(,($+
As discussed throughout the text o this document, an address may ail to geocode to an
acceptable leel o accuracy ,including not geocoding at all, or any number o reasons in-
cluding errors within the address itsel, errors in the reerence dataset, and,or the uncertainty
o a particular interpolation algorithm. In 1able 35, classes o problems rom the preious
sections hae been listed along with example cases or reasons why they would hae occurred
or the input address that should be 3620 S. Vermont Ae, Los Angeles, CA 90089.` 1hese
classiications will be used in the ollowing sections to enumerate the possible options and
describe the recommended practice or each type o case. Note that each registry may hae
its own regulations that determine the protocol o action regarding how certain classes o
problems are handled, so some o the recommended solutions may not be applicable unier-
sally. In addition to these processing errors, there are also acceptable quality` leels that
may be required at a registry. 1he current standard o reporting to which endors are cur-
rently held responsible are ound within the NAACCR GIS Coordinate Quality Codes ,lo-
erkamp and laener 2008, p. 162, as listed in 1able 34. Although the shortcomings with
these codes hae been listed in Section 15.1, these will be used to guide the recommended
decisions and practices. 1he items in these tables are by no means exhaustie, registries may
ace many more that are not listed. lor these cases, the sections in the remainder o this sec-
tion proide the details as to why a particular option is recommended with the hopes o us-
ing similar logic in determine the appropriate action in the appropriate circumstance,s,.


!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


5
V
Z

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9

Table 35 Classes of geocoding failures with examples for true address 3620 S. Vermont Ave, Los Angeles CA 90089

(M?AK #A=N=?A? -B=@>AP $KHPU>A
1 No lailed to geocode because the input data are incorrect. 3620 S Verment St, Los Angeles, CA 90089
2 No lailed to geocode because the input data are incomplete. 3620 Vermont St, Los Angeles, CA 90089
3 No lailed to geocode because the reerence data are
incorrect.
Address range or 3600-300 segment in
reerence data is listed as 3650-300
4 No lailed to geocode because the reerence data are
incomplete.
Street segment does not exist in reerence data
5 No lailed to geocode because the reerence data are
temporally incompatible.
Street segment name has not been updated in the
reerence data
6 No lailed to geocode because o combination o one or
more o 1-5.
3620 Vermont St, Los Angeles CA 90089, where
the reerence data has not been updated to a
include the 3600-300 address range or segment
\es Geocoded to incorrect location because the input data
are incorrect.
3620 S Verment St, Los Angeles, CA 90089 was
,incorrectly, relaxed and matched to 3620 aaa
lerment St, Los Angeles, CA 90089
8 \es Geocoded to incorrect location because the input data
are incomplete.
3620 Vermont St, Los Angeles, CA 90089 was
arbitrarily ,incorrectly, assigned to 3620 N a
Vermont St, Los Angeles, CA 90089
9 \es Geocoded to incorrect location because the reerence
data are incorrect.
1he address range or 3600-300 is reersed to
300-3600
10 \es Geocoded to incorrect location because the reerence
data are incomplete.
Street segment geometry is generalized straight
line when the real street is extremely cury
11 \es Geocoded to incorrect location because o interpolation
error.
Interpolation ,incorrectly, assumes equal
distribution o properties along street segment
12 \es Geocoded to incorrect location because o dropback
error.
Dropback placement ,incorrectly, assumes a
constant distance and direction
13 \es Geocoded to incorrect location because o combination
o one or more o -12.
1he address range or 3600-300 is reersed to
300-3600, and dropback o length 0 is used



)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


5
V
Y


'
<

;
<

#
=
>
?
@
A
B
C
Table 36 Quality decisions with examples and rationale

'ANDED=M -BHNFDNA .HFD=MH>A
\hen only a USPS PO box is aailable, yet a USPS
ZIP-4 is correct, should the geocoded address be
based on the USPS ZIP-4 centroid or the USPS
ZIP-5 centroid
1he address should be
geocoded to the USPS
ZIP-4.
1he USPS ZIP-5 will be based on the
USPS PO box address, which is less
accurate than the USPS ZIP-4 based on
the address.
\hen only an intersection is aailable, should the
centroid o the intersection or the centroid o one o
the properties on the corners be used
1he centroid o the one o the
corner properties should the
used.
1his increases the likelihood o that the
geocode is on the correct location rom 0
,the intersection centroid will neer be
correct,, to 1,number o corners.
I the location o the address is known, should the
geocode be manually moed to it ,e.g., manually
dragged using a map interace,
1he geocode should be
moed i the location is
known.
A known location should be used oer a
calculated one.
I the location o the building or an address is known,
should the geocode be manually moed to its
centroid
1he geocode should be
moed i the building is
known.
A known location should be used oer a
calculated one.
I only a named place is aailable as an address, should
research be perormed to determine an address or
should the next lower resolution attribute be used
,e.g., city name,
Research or the address o a
named place should be
attempted beore moing to
the next lower resolution
attribute.
1he address inormation may be triially
aailable, and it will dramatically improe
the resulting geocode.
I the geocode is less accurate than USPS ZIP Code
centroid ,GIS Coordinate Quality Code 10,, should it
be reiewed or manual correction
Geocodes with accuracy less
than GIS Coordinate Quality
Code 10 should be reiewed
or manual correction.
Ater USPS ZIP Code leel certainty, the
appropriateness o using a geocode in all
but large area aggregation studies
diminishes rapidly.
Should manual steps be taken in order to get a
geocode or eery record
Manual processing should be
attempted to get a geocode
or eery record.
Patients should not be excluded rom
research studies because their address was
not able to be geocoded.



5
V
[

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A



'ANDED=M -BHNFDNA .HFD=MH>A
I a street segment can be matched, but the address
cannot, should the center point o the segment or the
centroid o the minimum bounding rectangle ,MBR,
encompassing the segment be used
1he centroid o the MBR
should be used.
In the case where a street is straight, the
centroid o the MBR would be the center
point o the street. In the case o a cury
street, using the centroid minimizes the
possible error rom any other point on
the street.
I two connected street segments are ambiguously
matched, should their intersection point or the
centroid o the MBR encompassing them be used
1he centroid o the MBR
should be used.
In the case where the two streets are
straight, the centroid o their MBR would
be the intersection point between them
,assuming their lengths are similar and the
angle between them is 180 degrees,. In
the case o two cury streets, the angle
between them being sharp, or the lengths
being dramatically dierent, using the
centroid minimizes the possible error
rom any other point on the two streets.
I two disconnected street segments are ambiguously
matched, should the centroid o the MBR
encompassing them be used
1he centroid o the MBR
should be used.
1he centroid o their MBR minimizes the
possible error rom any other point on
the two streets.
I an address geocodes dierent now than it has in the
past, should all records with that geocode be updated
All records should be updated
to the new geocode i it is
more accurate.
Research studies should use the most ac-
curate geocode aailable or a record.
'< ;< #=>?@ABC
5\< !''.$++ '!,! -.%*3$0+
1his section introduces arious types o problems at registries
that occur with address data ,e.g., dxAddress, dxCity, dxZIP,
dxState,, including liecycle and ormatting problems.
5\<5 !''.$++ '!,! -.%*3$0+ '$2()$'
Details regarding the exact issues related to a selected set o representatie postal ad-
dresses are presented next to illustrate the ambiguities that are introduced as one iteratiely
remoes attributes. 1he best-possible-case scenario is presented irst. Best practices relating
to the management o common addressing problems are listed in Best Practices 42.

Best Practices 42 Common address problem management
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat types o lists o
common input address
problems and solutions
should be maintained
Lists o problems that are both common ,occur more
than once, and uncommon with recommended
solutions should be maintained and consulted when
problems occur.

Lxamples o common problems include:
- 15 error in dxCounty
5\<8 ,1$ #%3' +,!)'!.' %2 -%+,!3 !''.$++$+
1he ollowing address example represents the gold standard in postal address data. It
contains alid inormation in each o the possible attribute ields and indicates enough in-
ormation to produce a geocode down to the sub-parcel unit or the loor leel.

3620 ' South Vermont Aenue Last, Unit 444, Los Angeles, CA, 90089-0255`

In the geographic scale progression used during the eature-matching algorithm, a search
or this address is irst conined by a state, then by a city, then by a detailed USPS ZIP Code
to limit the amount o possible candidate eatures to within an area. Next, street name ambi-
guity is remoed by the preix and suix directionals associated with the name, South` and
Last,` respectiely, as well as the street type indication, Aenue.` Parcel identiication then
becomes attainable through the use o the street number, 3620,` assuming that a parcel re-
erence dataset exists and is accessible to the eature-matching algorithm. Next, a 3-D geo-
code can inally be produced rom the sub-parcel identiication by combining the unit indi-
cators, '` and Unit 444` to determine the loor and unit on the loor, assuming that this
is an apartment building and that a 3-D building model is aailable to the eature-matching
algorithm. Note that both '` and 444` can mean dierent things in dierent localities
,e.g., they can both reer to subdiided parcels, subdiisions within a parcel, or een lots in a
trailer park,.
1his example illustrates the best-possible-case scenario in terms o postal address speci-
ication and reerence dataset aailability, and is or most registries, rarely encountered. 1his
)=IAP@AB 567 8669 5V\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
is because reerence datasets o this quality do not exist or many large regions, details such
as the loor plan within a building are seldom needed, and input data are hardly eer speci-
ied or this completely. It oten is assumed that utilization o the USPS ZIP-4 database will
proide the gold standard reerence dataset, but it actually is only the most up-to-date source
or address alidation alone and must be used in conjunction with other sources to obtain
the spatial aspect o an output geocode, which may be subject to some error. 1he practice o
transorming an incompletely described address into a gold standard address ,completely
described, is perormed by most commercial geocoders, as eidenced by the inclusion o the
ull attributes o the matched eature generally included with the geocode result. Best prac-
tices relating to gold standard addresses are listed in Best Practices 43.
Best Practices 43 Creating gold standard addresses
-=>DNO 'ANDED=M *AEF -BHNFDNA
Should non-gold
standard addresses
hae inormation
added or remoed to
make them gold
standard`
In the case where legitimate attributes o an address are
missing and can be non-ambiguously identiied, they should
be added to the address.

Metadata should include:
- \hich attributes were added
- \hich sources were used
5\<V !,,.(*/,$ &%0-3$,$)$++
1his ollowing depiction o standard address data is ar more commonly encountered
than the gold standard address:

3620 Vermont Aenue, Los Angeles, CA, 90089`

lere, the street directional, sub-parcel, and additional USPS ZIP Code components o
the address hae been remoed. A eature-matching algorithm processing this case could
again airly quickly limit its search or matching reerence eatures to within the USPS ZIP
Code as in the last example, but rom that point, problems may arise due to H??BAEE HP@Da
CJDFO7 the case when a single input address can match to more than one reerence eature,
usually indicatie o an incompletely described input address. 1his can occur at multiple le-
els o geographic resolution or numerous reasons.
1his last address shows the case o EFBAAF EACPAMF HP@DCJDFO7 where multiple street
segments all could be chosen as the reerence eature or interpolation based on the inorma-
tion aailable in the input address. lirst, multiple streets within the same USPS ZIP Code
can, and routinely do, hae the same name, diering only in the directional inormation as-
sociated with them indicating which side o a city they are on. lurther, the address range in-
ormation commonly associated with street reerence eatures that are used to distinguish
them, which will be coered in more detail later, oten is repeated or these streets ,e.g.,
3600-300 South Vermont, 3600-300 North Vermont, and 3600-300 Vermont,. 1hus, the
eature-matching algorithm may be presented with multiple options capable o satisying the
input address.
Moing to a iner scale, EFBAAF H??BAEE HP@DCJDFO is the case when a single input ad-
dress can match to more than one reerence address on a single street segment as in the case
5V9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
where a correct street segment can unambiguously be determined, but a speciic location
along the street cannot because the address number is missing:

South Vermont Aenue Last, Los Angeles, CA, 90089`

At a still iner scale, EJ@aUHBNA> H??BAEE HP@DCJDFO is the case when a single input ad-
dress can match to more than one reerence eature that is contained within the same parcel
o land. 1his problem oten arises or large complexes o buildings such as Co-op City in
Bronx, N\, or as in the ollowing example o the Cardinal Gardens residence buildings on
the USC campus, all sharing the same postal street address:

3131 S. McClintock Aenue, Los Angeles, CA, 9000`

In these ambiguous cases, most eature-matching algorithms alone do not contain
enough knowledge to be able to pick the correct one. A detailed analysis o the dierent me-
thods or dealing with these cases is presented in Section 18.
5\<Z !,,.(*/,$ &%..$&,)$++
831 North Nash Street Last, Los Angeles, CA, 90245`

1his case exempliies the beginning o a slippery slope,` the correctness o address
attributes. 1his example lists the USPS ZIP Code 90245` as being within the city Los An-
geles.` In this particular case, this association is incorrect. 1he city Los Angeles` does not
contain the USPS ZIP Code 90245`, which may at irst be considered to be a typographical
error in the USPS ZIP Code. loweer, the USPS ZIP Code is in reality correct, but it is part
o an independent city, Ll Segundo,` which is within Los Angeles County. 1hereore, one
o these attributes is indeed wrong and should be ignored and not considered during the ea-
ture selection process, or better yet, corrected and replaced with the appropriate alue.
1here are many reasons why these types o errors can and do occur. lor instance, people
sometimes reer to the city or locality in which they lie by the name o their neighborhood,
instead o the city`s oicial political name or their post oice name. As neighborhood names
are oten only locally known, they are oten not included in national-scale reerence datasets,
and thereore are not applicable and can appear to be incorrect. In Los Angeles, one obious
example is Korea 1own,` an area seeral miles in size slightly southwest o downtown LA
that most residents o the city would recognize by name immediately, but would not be
ound as an oicial name in the 1IGLR,Line iles. Also, the reerse is possible as in the
preious Ll Segundo address example. People may mistakenly use the name Los Angeles`
instead o the alid city name Ll Segundo,` because they lack the local knowledge and as-
sume that because the location is part o the Los Angeles Metropolitan Area,` Los An-
geles` is the correct name to use.

,SDE ?DEN=MMANF @AFQAAM >=NH>a>AIA> RM=Q>A?CA U=EEAEEA? @O FSA UA=U>A
NBAHFDMC FSA ?HFH cA<C<7 FSA UHFDAMF ?AENBD@DMC DF =B FSA S=EUDFH> EFHGG BAa
N=B?DMC DFd HM? FSA M=Ma>=NH>a>AIA> RM=Q>A?CA U=EEAEEA? @O FSA E=JBNAE
NBAHFDMC FSA BAGABAMNA ?HFHEAFE UBAEAMFE H UABEDEFAMF ?DGGDNJ>FO DM FSA CA=a
N=?DMC UB=NAEE<

)=IAP@AB 567 8669 5VX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Similarly, USPS ZIP Codes and ZC1As are maintained by separate organizations that do
not necessarily share all updates with each other, resulting in the possibility that the data may
not be the consistent with each other. As a result, it is oten the case that the address data
input is reerring to the USPS ZIP Code, while the reerence data source may be using the
ZC1A ,e.g., in 1IGLR,Line iles,.
linally, USPS ZIP Code routes hae a dynamic nature, changing oer time or the pur-
pose o maintaining eicient mail deliery, thereore the temporal accuracy o the reerence
data may be an issue. USPS ZIP Codes may be added, discontinued, merged, or split, and
the boundaries or the geographic regions they are assumed to represent may no longer be
alid. 1hus, older address data entered as alid in the past may no longer hae the correct
,i.e., current, USPS ZIP Code. Although these changes generally can be considered rare, they
may hae a large impact on research studies in particular regions. Best practices relating to
input data correctness are listed in Best Practices 44.
Best Practices 44 Input data correctness
-=>DNO 'ANDED=M *AEF -BHNFDNA
Should incorrect
portions o
address data be
corrected
I inormation is aailable to deduce the correct attributes, they
should be chosen and associated with the input address.

Metadata should include:
- 1he inormation used in the selection
- 1he attributes corrected
- 1he original alues

5\<Y !''.$++ 3(2$&:&3$ -.%*3$0+
1he temporal accuracy o address data urther depends on what stage in the address lie-
cycle both the input address and the reerence data are at. New addresses take time to get
into reerence datasets ater they are created, resulting in alse-negatie matches rom the
eature-matching algorithm. Likewise, they stay longer ater they hae been destroyed, result-
ing in alse posities. lor new construction in many areas, addresses are assigned by coun-
ty,municipal addressing sta ater a deeloper has receied permission to deelop the lots.
low and when the phone companies and USPS are notiied o the new address thereater
depends on the deeloper, staing issues, and other circumstances, but this practice does
occur. 1hus, it may not appear in reerence data or some time although it is already being
reported at the diagnosing acility. Similarly, upon destruction, an address may still appear to
be alid within a reerence dataset or some time when it is in act inalid. Also, just because
an address is not in the reerence dataset today does not mean that it was inalid in the past
,e.g., the time period when the address was reported,. 1hese issues need to be considered
when dealing with address data whose liecycle status could be in question. Also, the length
o time an indiidual was at an address ,i.e., tenure o address, should be considered in re-
search projects. Best practices related to address liecycle problems are listed in Best Practic-
es 45.
5Z6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 45 Address lifecycle problems
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can
address liecycle problems
be accommodated in the
geocoding process
Address liecycle problems can be oercome by obtaining
the most recent address reerence data or the region as
soon as it becomes aailable, and by maintaining
historical ersions once new ones are obtained.
\hen and how should
historical reerence
datasets be used
1he use o historical reerence data may proide higher
quality geocodes in the cases o:
- listorical addresses where changes hae been made
to the streets or numbering
- Diagnosis date may approximate the date the diagno-
sis address was in existence
- I aailable, tenure o address should be taken into
consideration during research projects
5\<[ !''.$++ &%),$), -.%*3$0+
In many cases, the content o the address used or input data will hae errors. 1hese can
include addresses with missing, incorrect, or extra inormation. lor all o these cases there
are two options and choosing the correct one will depend upon the certainty obtainable or
the attributes in question that can be determined rom inspecting both the other attributes
and the reerence dataset. Such errors may be corrected or let incorrect. It should be noted
that in some cases, this extra inormation may be useul. lor example, 101 Main Street Apt
5` might be either N Main St` or S Main St,` but perhaps only one is an apartment build-
ing. Best practices related to address content problems are listed in Best Practices 46.
)=IAP@AB 567 8669 5Z5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 46 Address content problems
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat can and
should be done
with addresses
that are missing
attribute
inormation
I a correct reerence eature can be unambiguously identiied in
a reerence dataset rom the amount o inormation aailable, the
additional missing inormation rom the reerence eature should
be amended to the original address, and denoted as such in the
metadata record to distinguish it as assumed data.

I a reerence eature cannot be unambiguously identiied, the
missing data should remain absent.
\hat can and
should be done
with addresses
that hae incorrect
attribute
inormation
I the inormation that is wrong is obiously the eect o an
easily correctable data entry error ,e.g., data placed into the
wrong ield,, it should be corrected and indicated in the
metadata.

1his action should only be taken i it can be proen through the
identiication o an unambiguous reerence eature
corresponding to the corrected data that this is the only possible
explanation or the incorrect data.

I it can be proen that there is more than one reerence eature
that could correspond to the corrected data, or there are multiple
equally likely options or correcting the data, it should be let
incorrect.
\hat can and
should be done
with addresses
that hae extra
attribute
inormation
I the extra inormation is clearly not an address attribute and,or
is the result o data entry error, it can be remoed and this must
be indicated in the metadata.

It must be proen that this is the only possible reason why this
extraneous data should be declared as such, though the use o
the reerence dataset beore this remoal can be made.

Lxtraneous inormation such as unit, loor, building name, etc.
should be moed into the Supplemental lield ,NAACCR Item
42335, so that it can be retained or possible utilization at a later
time.

I there are equally probable options as to why this inormation
was included, it should be retained.
\hat is the best
way to correct
address errors
In the ideal case, addresses should be alidated as they are
entered at the hospital using, at a minimum, the USPS ZIP-4
database

5Z8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
5\<\ !''.$++ 2%.0!,,()# -.%*3$0+
Incorrectly ormatted addresses and addresses with non-standard abbreiations should
be handled by the address normalization and standardization processes. I not, human inter-
ention may normalize and standardize them. Best practices related to address ormatting are
listed in Best Practices 4.
Best Practices 47 Address formatting problems
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat can and
should be done
with address data
that are incorrectly
ormatted
I the address is ormatted in a known ormat, the address
normalization process could be applied to try to identiy the
components o the address and subsequently reormat it into a
more standard ormat, which should be noted in the metadata.

I the ormat o the original data is unrecognizable or the address
normalization ails, it should be let in its original ormat.
\hat can and
should be done
with address data
that include non-
standard
abbreiations
1he address normalization and standardization components o
the geocoding process should be applied to correct the data and
the corrections should be noted in the metadata.

I these processes ail, the data should be let in its original
ormat.
\hat should be
done with
extraneous address
data
Any extra inormation describing the location or address should
be moed into the Supplemental lield ,NAACCR Item 42335,
or retention in the case that it becomes useul in the uture.

5\<9 .$+('$)&$ ,:-$ !)' 1(+,%.: -.%*3$0+
Not knowing the type or tenure o address data can introduce uncertainty into the result-
ing geocode that is not captured merely with a quality code. 1his shortcoming usually is
listed as a limitation o a study and is indicatie o a larger public health data issue-these
data are not collected during the primary data collection, ater which point they generally are
diicult to obtain. 1he missing inormation relates to items such as the tenure o residence,
i it is their home or work address, i this is a seasonal address, and i the address is really
representatie o their true location i they moe requently or spend a lot o time traeling
or on the road. As such, it is recommended that a tenure o residence attribute ,i.e., length o
time at address, also be associated with an address so that researchers will hae a basic un-
derstanding o how well this address really represents the location o a patient. 1his its with
the current trend o opinions in the registry community ,e.g., Abe and Stinchcomb 2008,.
1he collection o historical addresses may not be practical or all addresses collected, but
could certainly be attempted in small subsets o the total data to be used in small studies.
Currently, the NAACCR record layout does not include ields or these data items, so
these would need to be stored outside o the current layout. In the uture, a hierarchical and
extendable ormat such as lealth Leel Seen ,lL-, ,lealth Leel Seen 200, could be
adopted or embedded to capture this additional attributes within the NAACCR layout. Best
practices related to conceptual problems are listed in Best Practices 48.


)=IAP@AB 567 8669 5ZV
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 48 Conceptual problems
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat can and should
be done to alleiate
address conceptual
problems
As much data as possible should be included about the type
o address reported along with a record including:
- 1enure o residence
- Indication o current or preious address
- Indication o seasonal address or not
- Indication o residence or work address
- lousing type ,e.g., single amily, apartment building,
- Percent o day,week,month,year spent at this address

5ZZ )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 5ZY
59< 2$!,/.$a0!,&1()# -.%*3$0+
1his section discusses the arious types o problems that oc-
cur during eature matching, as well as possible processing
options that are aailable or non-matched addresses.
59<5 2$!,/.$a0!,&1()# 2!(3/.$+
1here are two basic reasons why eature matching can ail: ,1, ambiguously matching
multiple eatures, and ,2, not matching any eatures. \hen this occurs, the address can either
remain non-matched and be excluded rom a study or an attempt can be made to reprocess
it in some dierent orm or using another method. Recent research has shown that i a non-
matchable address and the patient data it represents are excluded rom a study, signiicant
bias can be introduced. In particular, residents in certain types o areas are more likely to re-
port addresses that are non-matchable ,e.g., rural areas, and thereore data rom such areas
will be underrepresented in the study. It ollows that simply excluding non-matchable ad-
dresses rom a study is not recommended ,Gregorio et al. 1999, Kwok and \ankaskas 2001,
Durr and lroggatt 2002, Bonner et al. 2003, Olier et al. 2005,. lor this reason, researchers
and registries are adised to re-attempt eature matching by:

- 1DABHBNSDNH> CA=N=?DMC7 or using iteratiely lower resolution portion o the input
address or geocoding
- 2AHFJBA ?DEHP@DCJHFD=M7 or trying to disambiguate between the ambiguous
matches
- !FFBD@JFA DPUJFHFD=M7 or trying to impute the missing data that caused the ambigui-
ty
- -EAJ?=N=?DMC, or determining an approximate geocode rom other inormation
- &=PU=EDFA GAHFJBA CA=N=?DMC7 or deriing and utilizing new reerence eatures
based on the ambiguous matches
- ;HDFDMC DF =JF7 simply doing nothing and attempting geocoding ater a period o
time ,e.g., ater the reerence datasets hae been updated,.

Best practices relating to eature-matching ailures are listed in Best Practices 49. Similar
to the warning that match rates may be indicatie o bias in one`s geocoded data ,Section
14.3,, researchers need to be aware that using any o the ollowing procedures to obtain a
geocode or all o their data may also introduce bias into their datasets. A careul ealuation
o the bias introduction rom the use o these methods should be undertaken to determine i
this may be an issue or one`s particular dataset. 1his is an ongoing area o research and
more detailed inestigations into this topic are required beore speciic adice can be gien
on how to identiy and deal with these problems.

!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


5
Z
[

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9

Best Practices 49 Feature-matching failures

-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how
can and should
non-matchable
addresses be
handled
All non-matchable addresses should be re-attempted using:
- Attempt to obtain more inormation rom source
- lierarchical geocoding
- leature disambiguation
- Attribute imputation
- Composite eature geocoding
\hen and how
can and should
ambiguous eature
matches be
handled
Any time an ambiguous eature match occurs, only a single eature ,which may be a composite eature, should
be used or calculating the resulting geocode.

I extra inormation is aailable that can be used to determine the correct eature, then it should be, and the me-
tadata should record what was used and why that eature was chosen.

I extra inormation is not aailable and,or the correct eature cannot be identiied, a geocode resulting rom
the interpolation o lower resolution eature, composite eature, or bounding box should be returned.
\hen should a
lower-resolution
eature be
returned rom a
eature-matching
algorithm
I the relatie predicted certainty produced rom eature interpolation using an attribute o lower resolution
,e.g., USPS ZIP Code ater street address is ambiguous, is less than that resulting rom using a composite ea-
ture ,i the eatures are topologically connected, or a bounding box ,i they are not topologically connected,, it
should be returned.
\hen should a
deried composite
eature be used or
eature
interpolation
I the matched eatures are topologically connected and i the predicted certainty produced rom eature
interpolation using a composite eature ,e.g., street segments joined together, is less than that resulting rom
using an attribute o lower resolution, it should be used or interpolation.



)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


5
Z
\


'
<

;
<

#
=
>
?
@
A
B
C

-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should a
deried bounding
box be used or
eature
interpolation
I the matched eatures are not topologically connected and i the relatie predicted certainty produced rom
eature interpolation using a bounding box that encompasses all matched eatures is less than that resulting
rom using a lower resolution, it should be used or eature interpolation.
low and when
can and should
missing attributes
be imputed
\hether or not to impute missing attribute inormation will depend on the subjectiity o the registry or
researcher.

Metadata should indicate:
- \hich attributes are imputed
- 1he sources used or imputing them
- 1he original alues o any attributes that hae been changed
low and when
should
pseudocoding be
used
\hether or not to pseudocode will depend on the subjectiity o the registry or researcher.

Metadata should indicate:
- \hich attributes were used to determine the pseudocode
- 1he calculation used or approximating the pseudocode
low and when
can and should
geocoding be
re-attempted at a
later date ater the
reerence datasets
hae been
updated
Geocoding should be re-attempted at a later date ater the reerence datasets hae been updated when it is
obious that the geocoding ailed because the reerence datasets were out-o-date ,e.g., geocoding an address in
a new deelopment that is not present in current ersions o a dataset,.




! #A=N=?DMC *AEF -BHNFDNAE #JD?A
59<5<5 1DABHBNSDNH> #A=N=?DMC
1he irst approach, hierarchical geocoding, is the one most commonly attempted. 1he
lower resolution attribute chosen depends both on the reason why geocoding ailed in the
irst place as well as the desired leel o accuracy and conidence that is required or the re-
search study, and is subject to the warnings regarding implied accuracies within arbitrary ea-
ture hierarchies as discussed in Section 15.1. 1o make the choice o lower resolution eature
more accurate, one could use inormation about the ambiguous eatures themseles. I the
two or more eatures returned rom the eature-matching algorithm are o the same leel o
geographic resolution, the most probable course o action is to return the next leel o geo-
graphic resolution to which they both belong. lor example, i two streets are returned and
both are in the same USPS ZIP Code, then a geocode or that USPS ZIP Code should be
returned. I the two streets are in separate USPS ZIP Codes, yet the city is the same, the
geocode or the city should be returned. 1he leels o accuracy or each o these would be
the same as the leel o accuracy o the leel o geographic resolutions presented earlier, in
ligure 20.
59<5<8 2AHFJBA 'DEHP@DCJHFD=M
In the second approach, GAHFJBA ?DEHP@DCJHFD=M7 an attempt is made to determine
which is the correct choice o the possible options. low this is done depends on why the
ambiguity occurred as well as any other inormation that may be aailable to help in the
choice o the correct option. 1hese cases o ambiguity can result rom an error in the reer-
ence dataset in the rare case that two separate reerence eatures are described by the same
attributes, but this usually indicates an error in the database and will not be discussed here.
Much more likely is the case in which ambiguity results rom the input data not being
described with enough detail, such as omitting a directional ield or the house number. lere,
disambiguation typically requires the time and subjectiity o a registry sta member, and is
essentially interactie geocoding, but it could be done ater the act. 1he sta member se-
lects one o the ambiguous matches as correct based on other inormation associated with
the input data, or by reasoning what they hae in common and returning the result o what
can be deduced rom this. 1he sta member perorming the geocoding process can take into
account any type o extra inormation that could be used to indicate and select the correct
one. Going back to the source o the data ,i.e., the hospital, to obtain some o this inorma-
tion may or may not be an option-i it is, it should be attempted.
lor instance, i an input address was simply \ashington 1ownship, NJ` without any
orm o a street address, USPS ZIP Code, or county ,o which there are multiple,, but it was
known that the person was required to isit a hospital in a certain county due to particular
treatment acilities being aailable, the county o the hospital could be assumed ,lulcomer et
al. 1998,. I a second hypothetical address, 1200 Main St.,` geocoded in the past, but now
ater L-911 implementation the street has been renamed and renumbered such that the new
address is 80 N. Main Street,` and the reerence data hae not yet caught up, the registry
could make the link between the old address and the new one based on lists o L-911
changes or their area. A third and more common example may occur when the directional
attribute is missing rom a street address ,e.g., 3620 Vermont Ae,` where both 3620 N.
Vermont Ae` and 3620 S. Vermont Ae` exist. Soling these cases are the most diicult,
unless some other inormation is aailable that can disambiguate between the possible op-
tions.
5Z9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
59<5<V !FFBD@JFA (PUJFHFD=M
Another approach that can be taken is to impute the missing input address attributes that
would be required. Unless there is only a single, obious choice or imputing the missing
attributes that hae rendered the original input data non-matchable, assigning alues will in-
troduce some uncertainty into the resulting spatial output. 1here currently is no consensus as
to why, how, and under what circumstances attribute imputation should be attempted. At
the time o this writing, imputing or not imputing is a judgment call that is let up to the re-
gistry, person, sotware, and most importantly, the circumstances o the input address.
A researcher will need to be aware o the greatest possible area o uncertainty that should
be associated with the spatial output resulting rom imputed data. Also, imputing dierent
attributes will introduce dierent leels o uncertainty, rom one-hal the total length o a
street in the case o a missing building number and a non-ambiguous street reerence eature,
to the MBR o possible city boundaries in the case or which ambiguous city names matched
and one was imputed as the correct answer.
In all cases, registry sta and researchers need to be aware o the tradeos that result
rom imputing attributes. 1he conidence,alidity one has in the imputed attributes increas-
es i they hae been eriied rom multiple sources. But, as the number o imputed attributes
rise, it increases the likelihood o error propagation. 1hereore, these imputed alues need to
be marked as such in the metadata associated with a geocode so that a researcher can choose
whether or not to utilize a geocode based on them. 1he recent works by Boscoe ,2008, and
lenry and Boscoe ,2008, can proide urther guidance on many o these issues.
59<5<Z -EAJ?=N=?DMC
Another approach that can be taken is to impute an actual output geocode based on oth-
er aailable inormation or a predeined ormula, known as UEAJ?=N=?DMC< 1his has recent-
ly been deined by Zimmerman ,2008, as the process o determining UEAJ?=N=?AE7 which
are HUUB=KDPHFA CA=N=?AE< 1hese pseudocodes can be deried by deterministically reert-
ing to a lower resolution portion o the input address ,i.e., ollowing the hierarchies pre-
sented in Section 15,, or by more complex methods probabilistic,stochastic methods such
as assigning approximate geocodes based on a speciic mathematic distribution unction
across a region. Like attribute imputation, there currently is no consensus as to why, how,
and under what circumstances pseudocoding should be attempted, but Zimmerman ,2008,
proides insight on how one should work with these data as well as dierent techniques or
creating them.
59<5<Y &=PU=EDFA 2AHFJBA #A=N=?DMC
I disambiguation through attribute imputation or the subjectiity o a sta member
ails, the only option let other than reerting to the next best leel o resolution or simply
holding o or a period o time may be to create a new eature rom the ambiguous matches
and use it or interpolation, termed here N=PU=EDFA GAHFJBA CA=N=?DMC< 1his approach can
be seen as an application o the task o delimitating boundaries or imprecise regions ,e.g.,
Reinbacher et al. 2008,.
1his approach already is essentially taken eery time a geocode with the quality mid-
point o street segment` is generated, because the geocoder undamentally does the same
task-derie a centroid or the bounding box o the conjunction o all ambiguous eatures.
lere, all ambiguous eatures` consists o only a single street, and the centroid is deried
using a more adanced calculation than strictly the centroid o the bounding box o the
ambiguous eatures.` 1hese generated eatures would be directly applicable to the
)=IAP@AB 567 8669 5ZX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
quantitatie measures based on reerence data eature resolution and size called or by Abe
and Stinchcomb ,2008, p. 124,.
I geocoding ailed because the street address was missing the directional indicator re-
sulting in ambiguity between reerence eatures that were topologically connected, one could
geocode to the centroid o the oerall eature created by orming an MBR that encompassed
ambiguously matched eatures, i releant to the study paying attention to whether or not the
entire street is within a single boundary o interest. 1he relatie predicted certainty one can
assume rom this is, at best, one-hal o the total length o the street segment, as depicted in
ligure 20,d,. 1his leel o accuracy may be more acceptable than simply reerting to the
next leel o geographic resolution.
loweer, taking the center point o multiple eatures in the ambiguous case may not be
possible when the input data do not map to ambiguous eatures that are topologically con-
nected ,e.g., when the streets hae the same name but dierent types and are spatially dis-
joint,. Lstimating a point rom these two non-connected eatures can be achieed by taking
the mid-point between them, but the accuracy o this action essentially increases to the size
o the MBR that encompassed both.
1his is depicted in ligure 24, in which the let image ,a, displays the area o uncertainty
or the ambiguously matched streets or the non-existent address 100 Sepuleda Bld, Los
Angeles CA 90049,` with the 100 North Sepuleda` block represented by the longer line,
the 100 South Sepuleda` block represented by the shorter line, and the MBR o the two
,the area o uncertainty, represented by the box. 1his is in contrast to the size o the area o
uncertainty or the whole o the City o LA, as shown in red ersus the small turquoise dot
representing the same MBR on the image to the right ,b,.


b, MBR o North and South Sepuleda ,small
dot, and LA City ,outline,
a, 100 North ,longer line, and 100 South Sepul-
eda ,shorter line, with MBR ,box,
Figure 24 Example uncertainty areas from MBR or ambiguous streets vs. encompassing city
(Google, Inc. 2008b)
Depending on the ambiguous eatures matched, the size o the resulting dynamically
created MBR can ary greatly-rom the ,small, area o two blocks as in ligure 24 where the
street segments are located next to each other, to the ,large, area o an entire city where the
streets with the same names and ranges appear on opposite sides o the city with only the
USPS ZIP Code diering. 1hus, it is impossible to indicate that taking the MBR always will
be the correct choice in eery case because the accuracy o a static eature, such as a single
5Y6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 5Y5
city polygon, will in contrast always be the same or all eatures within it, no matter which o
its child street segments are ambiguously matched and may represent a smaller area in some
cases.
1his tradeo can be both good and bad in that the relationship between the areas o the
eature with the static boundary ,e.g., the city polygon, can be tested against the eature with
the dynamic boundary ,i.e., the dynamically created MBR o the ambiguous eatures, to de-
termine and choose whicheer has the smaller area o uncertainty ,i.e., the one with the max-
imum relatie predicted certainty,. In addition, whether or not the ultimate consumer o the
output can handle spatial data with ariable-leel accuracy, as in the MBR approach, or i
they will require the static-leel accuracy a uniorm areal unit-based approach will produce,
needs to be considered. Variations o all o the practices listed in this section may or may not
be a cost-eectie use o registry resources and will ary by registry ,e.g., i accurate data are
required or incidence rates,. 1he possible options or dealing with ambiguity through com-
posite eature geocoding as described in this section are listed in 1able 3.
Table 37 Composite feature geocoding options for ambiguous data
-B=@>AP $KHPU>A %UFD=ME
Ambiguity between
connected streets
100 Sepuleda: ambiguous
between 100 N and 100 S
which are connected
- Intersection o streets
- Centroid o MBR o
streets
Ambiguity between
disconnected streets
3620 Vermont: ambiguous
between 3620 N Vermont
and 3620 S Vermont which
are not connected
- Centroid o MBR o
streets
59<5<[ ;HDFDMC (F %JF
1he inal approach is to simply wait or the reerence data sources to be updated and try
the geocoding process again. 1his option is suitable i the sta member thinks the address
data are indeed correct and that the reerence iles are out-o-date or contain errors and
omissions. 1his is most oten the case in rapidly expanding areas o the country where new
construction is underway, or in old areas where signiicant reorganization o parcels or
streets has taken place and street names and parcel delineations hae changed between the
temporal ootprint o the reerence data and the current time period. Updating the reerence
iles may assist in these input data that are not represented in the old reerence iles. In some
cases, it may be more suitable to use reerence data more representatie o the time period
the addresses were collected ,i.e., the remainder o the input data might suer rom these
newly updated reerence datasets,. Also, this keeps the record in a non-matched state, mean-
ing that it cannot be included in research or analyses, the exact problem pointed out in the
opening o this section.














































1his page is let blank intentionally.


'< ;< #=>?@ABC
5X< 0!)/!3 .$4($; -.%*3$0+
In this section, methods or attempting manual reiew are de-
lineated as are the beneits and drawbacks o each.
5X<5 0!)/!3 .$4($;
It is ineitable that some input will not produce an output geocode when run through
the geocoding process. Also, the leel o accuracy obtainable or a geocode may not be sui-
cient or it to be used. In some o these cases, manual reiew may be the only option. Best
practices relating to unmatched addresses are listed in Best Practices 50. Boscoe ,2008, and
Abe and Stinchcomb ,2008, can also be used as a guide in dealing with these circumstances.
Best Practices 50 Unmatched addresses
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can and
should unmatched addresses
be handled
I the geocoding is done per record, an unmatched
address should be inestigated to determine a
correctie action ater it is processed, i time and
money are aailable or the task.

I the geocoding process is done in-batch, all
unmatched addresses should be grouped by ailure
class ,rom 1able 35, and processed together ater
the processing has completed, i time and money are
aailable or the task.
\hen and how should
geocoding be re-attempted on
the updated input addresses
1he same geocoding process used or the original
geocoding attempt should be applied again ater the
unmatched address has been corrected.

Manual reiew is both the most accurate and most time-consuming way to handle non-
matchable addresses. Depending on the problem that caused the address to be non-
matchable, the time it takes to perorm a manual reiew can range rom a ew seconds to a
ew hours. An example o the irst case would be when one o the components o the ad-
dress attributes is obiously wrong because o incorrect data entry such as the simple exam-
ples listed in 1able 38. 1his can be easily corrected by personal reiew, but might be diicult
or a computer program to recognize and ix, although adances are being made ,e.g., em-
ploying lidden Marko Models and other artiicial intelligence approaches as in Churches et
al. |2002| and Schumacher |200|,. lew studies hae quantiied the exact eort,time re-
quired, but the NJSCR reports being able to achiee processing leels o 150 addresses per
hour ,Abe and Stinchcomb 2008,. Goldberg et al. ,2008d, also proide an analysis o the
actors inoled in these types o processes.
)=IAP@AB 567 8669 5YV
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Table 38 Trivial data entry errors for 3620 South Vermont Ave, Los Angeles, CA
$BB=MA=JE 4ABED=M $BB=B
3620 Suoth Vermont Ae, Los Angels CA Misspelling
3620 South Vermont Ae, CA, Los Angeles 1ransposition

1here are solutions that require research on the part o the sta but all short o re-
contacting the patient, which is usually not an option. lere, a sta member may need to
consult other sources o inormation to remedy the problem when the input data are not
triially and,or unambiguously correctable. 1he sta member may obtain the more detailed
inormation rom other sources i missing data are causing the geocoding process to not
match or match ambiguously. 1his task boils down to querying both the indiidual address
components, combinations, and aliases against dierent sources ,e.g., USPS ZIP-4 database
|United States Postal Serice 2008a|,, other reerence sets, local datasets, address points,
and,or parcels to either identiy an error,alias in the input data or an error,alias in the ad-
dress or address range in the reerence data, as well as the patient`s name to ind other addi-
tional address inormation ,coered in the next section,.
I a geocode has too low o a resolution to be useul, the staff member can reason what
is the most likely candidate at the most reasonable level of geographic resolution that
they can determine. Examples of when this could be done are listed previously in Section
18.1.2. lurther options include re-contacting the hospital registry i the record is one that
requires annual ollow-up, in which case a corrected ersion o the same address may already
hae been obtained. As noted earlier, re-contacting the source o the data ,e.g., the hospital,
may or may not be a iable option.
At the other end o the spectrum are corrections that would require contacting the pa-
tient to obtain corrected or more detailed inormation about their address in the case that it
originally was proided incorrectly or with insuicient detail. 1ypically, this would not be a
task perormed by a registry during the course o manual reiew to gain more inormation to
successully geocode a record. Instead, this would normally only be conducted by indiidual
researchers or research purposes in special studies. Common examples or which this would
be the only option include such things as an address consisting o only a USPS ZIP Code, a
city,town name, or some other descriptie non-matchable term such as Oerseas` or Mil-
itary.`
I approach is a alid option, it typically will result in the highest accuracy because the
sta member can potentially keep attempting to geocode the address with the patient on the
telephone, asking or more inormation until they obtain a successul geocode o suicient
accuracy. Best practices related to manually reiewing unmatched addresses are listed in Best
Practices 51.
5YZ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Best Practices 51 Unmatched addresses manual review
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can and
should manual reiew o
unmatched addresses be
attempted
I the time and money are aailable, manual reiew
should be attempted or any and all addresses that are
not capable o being processed using automated means.
\hen and how can and
should incorrect data ,data
entry errors, be corrected
I the error is obiously a data entry error and the
correction is also obious, it should be corrected and
the change noted in the metadata.
\hen and how can and
should a geocode at a higher
resolution be attempted to
be reasoned
I the geographic resolution o the output geocode is
too low to be useul ,e.g., county centroid,, a sta
member should attempt to reason what better, higher
resolution geocode could be assigned based on other
inormation about the patient,tumor ,e.g., use city cen-
troid o the diagnosing acility i it is known they isited
a acility in their city,.
5X<8 +%/.&$+ 2%. '$.(4()# !''.$++$+
Oten, een though a record`s address may not contain enough or correct content to be
directly useul, other inormation associated with the record may lend itsel to proiding a
more accurate address. lor example, i a patient-reported address is not useul, the sta
member might be able to link with the state DMV and obtain a alid address or the patient.
1his solution assumes that a working relationship exists between a registry and the DMV in
which the ormer may obtain data rom the latter, and in some cases this may not be easible
at the registry leel. In these cases, a registry may be able to work with a goernment agency
to set up a relationship o this type at a higher leel, instead o the registry obtaining the pa-
tient`s DMV data directly.

(F PJEF @A EFHFA? FSHF QSAM BACDEFBDAE JFD>DfA H??DFD=MH> E=JBNAE G=B
CHFSABDMC H??DFD=MH> DMG=BPHFD=M H@=JF DM?DID?JH>E7 FSA DMFAMFD=M DE
M=F F= mEM==Un =M HMO UHBFDNJ>HB DM?DID?JH>< ,SA UJBU=EA =G CHFSABa
DMC FSDE DMG=BPHFD=M DE AMFDBA>O H>FBJDEFDN HM? NAMFBH> F= GHND>DFHFDMC
FSADB B=>A DM BA?JNDMC FSA @JB?AM =G NHMNAB<

In general, linking with large, administratie databases such as the DMV or Medicare can
be aluable or augmenting demographic inormation, such as address, on a cancer record.
loweer, these databases are or administratie purposes and are not intended or sureil-
lance or research. 1he limitations o these databases or cancer registry objecties need to be
understood. lor example, although DMV requires a street address in addition to a USPS PO
box, the address listed in DMV may hae been updated and oerwritten since the time o
cancer diagnosis. Cancer registry personnel must ully understand the data collection me-
thods to make correct assumptions when attempting to supplement cancer registry data.
1hese situations obiously will be speciic or each registry and dependent on local laws.
Other sources o data associated with a patient that hae been used in the literature as
well as other possible sources are ound in 1able 39. Some o these sources ,e.g., phone
books, can be accessed or ree as online serices, others may require agreements to be made
between the registry and priate or public institutions. By ar, the most common approach is
)=IAP@AB 567 8669 5YY
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
5Y[ )=IAP@AB 567 8669
to look or a patient`s name in parcel ownership records and associate the address i the
match seems reasonable ,e.g., a one-to-one match is ound between name and parcel during
the correct time period when the person was known to be liing in that city,. Issues related
to querying some o these online sources are coered in Section 26. Best practices related to
data sources or manual reiew are listed in Best Practices 52.
Best Practices 52 Unmatched address manual review data sources
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can and
should alternatie sources o
inormation be reiewed to
assist in address correction
I the problem with the input address is not triially
correctable, alternatie sources o inormation should
be reiewed to attempt address correction, i time
and money are aailable or the task.

I a linkage can be determined with a suitable leel o
certainty, it should be made as long as priacy and
conidentiality concerns in Section 26 are satisied.

Metadata should include:
- 1he source o the supplemental data
- 1he sta member who made the linkage
- 1he method o linkage ,i.e., automatic,manual,
- 1he linkage criteria
- 1he date the linkage was made

)
=
I
A
P
@
A
B

5
6
7

8
6
6
9


5
Y
\


'
<

;
<

#
=
>
?
@
A
B
C
Table 39 Common sources of supplemental data with typical cost, formal agreement requirements, and usage type

+JUU>APAMFH> 'HFH +=JBNA &=EF 2=BPH> !CBAAPAMF .ALJDBA? /EHCA ,OUA
DMV ree yes batch
Phone Books ree no per-record
Phone Companies ree yes batch
Utility Companies ree yes batch
State Bureau o Vital Statistics and Registration ree yes batch
Social Security Administration ree yes per-record
Military ree yes per-record
Lducational Institutions ree yes per-record
USPS ZIP-4 Database not ree no batch
County Deeds,Real Lstate 1ransaction Registries not ree no batch
Municipal or County Assessor Databases aries no batch
Municipal Resident Lists ree no per-record
State or Municipal Voter Registration Databases aries no batch
Google Larth ree no per-record
Social Security Death Index ree no per-record
People linding \eb Sites ree no per-record
Medicare,Medicaid ree yes per-record
Vital Statistics ree yes batch
Googling` a Person`s Name ree no per-record

















































1his page is let blank intentionally.

'< ;< #=>?@ABC

86< #$%&%'()# +%2,;!.$ -.%*3$0+
1his section proides insight into required or possible me-
thods o oercoming problems with geocoding sotware.
86<5 &%00%) +%2,;!.$ -(,2!33+
Seeral common problems and limitations oten occur when using commercial geocod-
ing packages. Perhaps the most rustrating is that commercial geocoding processes by their
ery nature do not reeal much about their inner workings. 1he actual algorithms imple-
mented in commercial products can be held as trade secrets and not proided in detail ,they
are, or the most part, sold to make money,. As such, a registry or researcher may not know
the exact choices made by the components, what its other possible options were, or how the
inal choice was decided. Some older geocoding platorms simply return the spatial output
without any metadata reporting how or why it was deried or proiding inormation on its
quality. 1his can and should preent a user rom haing conidence in these results and
should be aoided. Registries and consumers o commercial sotware should push these
endors to be more open about the inner workings o the geocoding processes that they im-
plement, and about reporting metadata.
loweer, een when commercial sotware packages expose their algorithms and as-
sumptions, it will take a time commitment rom a sta member to read and understand
them. Best Practices 53 contains seeral common limitations that are encountered in work-
ing with geocoding sotware and recommended NAACCR actions or oercoming them.
Note that the intent o this section is to remain sotware neutral` by reraining rom ado-
cating any particular commercial geocoding platorm. Also, as the cost o geocoding contin-
ues to drop, een becoming ree ,Goldberg 2008a, some o the issues in this section may no
longer apply.



)=IAP@AB 567 8669 5YX

!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


5
[
6

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9

Best Practices 53 Common geocoding software limitations by component of the geocoding process

!EUANF 3DPDFHFD=M *AEF -BHNFDNA
Input Data Not accepting intersections as input. 1he input data will need to be prepared such that one street or the other is chosen and
used or input.

1he street that will produce the lower leel o uncertainty should be chosen ,e.g., the
shorter one,.
Not accepting named places as input. 1he input data will need to be prepared such that the next highest leel o resolution
should be used or input ,e.g., moe rom named building to USPS ZIP Code,.
Normalization,Parsing Cannot change the order o address
attributes ,tokens,.
1he input data will need to be prepared such that the address attributes are in the
accepted order.
Standardization An input address standard is not supported. 1he input data will need to be prepared such that the input data are in the
accepted address standard.
Reerence Dataset Only linear-based reerence datasets are
supported.
1his is how most geocoders operate so, at present, in most circumstances this will
hae to be acceptable.
1here is no control oer which reerence
dataset is used.
1his is how most geocoders operate so, at present, in most circumstances this will
hae to be acceptable.
leature Matching 1here is no control oer which eature-
matching algorithm is used.
1his is how most geocoders operate so, at present, in most circumstances this will
hae to be acceptable.
1here is no control oer the parameters o
the eature-matching algorithm used ,e.g.,
conidence interal,.
1his is how most geocoders operate so, at present, in most circumstances this will
hae to be acceptable.
1here is no control oer which eature
interpolation algorithm is used.
1his is how most geocoders operate so, at present, in most circumstances this will
hae to be acceptable.
1here is no control oer the parameters o
the eature interpolation algorithm used.
I particular parameters must be settable ,e.g., dropback distance and direction, which
assumptions are used uniorm lot, address range, etc.,, a dierent geocoding process
should be identiied and obtained.
Output Data Only capable o producing a single point as
output ,i.e., no higher complexity geometry
types,.
1his is how most geocoders operate so, at present, in most circumstances this will
hae to be acceptable.
Metadata No metadata reported along with the
results.
Geocodes returned rom a geocoder without any metadata should not be included as
the spatial component o the record.
No coordinate quality reported along with
the results.
Geocodes returned rom a geocoder without metadata describing, at a minimum, a
coordinate quality should not be included as the spatial component o the record.

'< ;< #=>?@ABC







Part :: Cboo.ivg a Ceocoaivg Proce..



1he choice o a geocoding process and,or components used by a registry will necessarily
depend oremost on the restrictions and constraints the registry must adhere to, be they le-
gal, budgetary, technical, or otherwise. 1he material in this part o the document is presented
to help a registry determine the correct solution or them, gien their particular restrictions.







)=IAP@AB 567 8669 5[5















































1his page is let blank intentionally.

'< ;< #=>?@ABC
85< &1%%+()# ! 1%0$a#.%;) %. ,1(.'a-!.,:
#$%&%'()# +%3/,(%)
1his section will identiy the requirements that must be de-
ined to determine the appropriate geocoding method to be
used.
85<5 1%0$a#.%;) !)' ,1(.'a-!.,: #$%&%'()# %-,(%)+
In practice, many dierent geocoding process options are aailable to registries and the
general public. At the top leel, they can either be deeloped in-house or obtained rom a
third party. O the ones that are not deeloped by the registry, many dierent arieties are
aailable. 1hey can range in price as well as quality, and are aailable in many dierent
orms. 1here are complete sotware systems that include all o the components o the geo-
coding process and can be used right out o the box, or each component can be purchased
separately. 1here are reely aailable online serices that require no sotware other than a
\eb browser ,e.g., Goldberg 2008a, Google, Inc. 2008c, \ahoo!, Inc. 2008,, and there are
proprietary commercial ersions o the same ,e.g., 1ele Atlas 2008b,. Choosing the right op-
tion will depend on many actors, including budgetary considerations, accuracy required, le-
el o security proided, technical ability o sta, accountability and metadata reporting capa-
bilities, and lexibility required.
85<8 +$,,()# -.%&$++ .$W/(.$0$),+
Setting the requirements o one`s geocoding process should be the irst task attempted,
een beore beginning to consider the dierent aailable options. 1he items listed in 1able
40 should sere as a starting point or discussions among registry sta designing the geocod-
ing process when determining the exact requirements o the registry in terms o the geocod-
ing sotware, the endor, and the reerence data. It also is worthwhile to hae a single person
designated as the dedicated liaison between the endor and the registry or asking,answering
questions to both build relationships and become an expert on the topics. Keep in mind that
geocoding outcomes, at a minimum, must meet the standards set orth in the NAACCR GIS
Coordinate Quality Code Item ,see Section 15,.
)=IAP@AB 567 8669 5[V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Table 40 Geocoding process component considerations
-B=NAEE
&=PU=MAMF
,=UDN (EEJA
Sotware Accuracy \hat is the minimum acceptable leel o spatial accu-
racy
Are multiple leels o accuracy accepted or required
Capacity \hat throughput must be achieable
low many concurrent users are expected
Reliability Must the sotware be ailsae
1ransparency \hat leel o oerall transparency is
required
\hat leel o transparency is required per component
Reportability \hat inormation must be reported along with a re-
sult
Vendor Accuracy \hat leel o accuracy and completeness do they
guarantee
Capacity \hat capacity can they accommodate
Reliability low reliable are their results
1ransparency \hat inormation is aailable about the process they
use
Reportability \hat inormation do they report along with their re-
sults
Reerence
datasets
Accuracy \hat leel o accuracy is associated with the dataset as
a whole
\hat leel o accuracy is associated with indiidual
eatures
Completeness low complete a representation do the reerence ea-
tures proide or the area o interest
Reliability low reliable can the reerence eatures be considered
1ransparency low were the reerence eatures created
Lineage \here and when did this data source
originate
Is there a date,source speciied or each eature type
\hat processes hae been applied to this data source

85<V ()a1%/+$ 4+< $_,$.)!3 -.%&$++()#
\hether to perorm the geocoding process in-house or by utilizing a third-party contrac-
tor or serice is perhaps the most important decision that a registry must make regarding
their geocoding process. 1he NAACCR documents ,2008a, 2008b, and the work by Abe
and Stinchcomb ,2008, that present a compressed ersion report that 28 percent o their
registry surey respondents ,n~4, report utilizing in-house geocoding, while 0 percent util-
ize external resources ,33 using commercial endors, 3 using dierent diisions within
their organization,.
Perorming the process in-house enables a registry to control eery aspect o the
process, but orces them to become experts on the topic. Contracting a third-party entity to
perorm the geocoding releases the registry rom knowing the intimate details o the
5[Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
implementation o the geocoding process, but at the same time can keep them in the dark
about what is really happening during the process and the choices that are made. 1he costs
associated with using a endor will ary between registries because the requirements the
endor will hae to meet will ary between registries.
Another option is to use a mixture o both in-house and endor geocoding. 1he com-
mon case o this is sending out all o the data to a endor, then working on the problem cas-
es that are returned in-house. Best practices related to geocoding in-house or externally are
listed in Best Practices 54.
Best Practices 54 In-house versus external geocoding
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen can and should a registry
geocode data in-house or use a
third party
I the cost o geocoding using a third-party
proider is higher than obtaining or deeloping
all components o the geocoding process, or i no
suitable conidentiality and,or priacy require-
ments can be met by a third party, it should be
perormed in-house.

I the technical requirements or costs or geocod-
ing in-house cannot be met by a registry and suit-
able conidentiality and,or priacy requirements
can be met by a third party, it should be per-
ormed by a third party.

85<Z 1%0$a#.%;) %. &%,+
I the choice is made to perorm the geocoding in-house, the registry must urther decide
i they wish to implement their own geocoder, or i they preer to use a commercial o-the-
shel ,CO1S, sotware package. Both o these options hae their strengths and weaknesses,
and this choice should be considered careully. On one hand, a CO1S package will be, or
the most part, ready to start geocoding right out o the box. loweer, there will be a sub-
stantial up-ront cost and the details o its inner workings may or may not be aailable. A
home-grown geocoding solution, on the other hand, can take a signiicant amount o time to
deelop and test, costing possibly seeral times more than a commercial counterpart, but its
inner workings will be known and can be molded to the particular needs o the registry and
modiied as needed in the uture. As o this writing, only a ew registries currently use a
composite,home-grown solution ,e.g., the NJSCR |Abe and Stinchcomb 2008|, but as open-
source geocoding platorms become aailable, such as the one being deeloped at the Uni-
ersity o Southern Caliornia ,Goldberg et al. 2008a,, this may change.
A comprehensie list o both: ,1, the commonly encountered costs o using commercial
endors ,high end and low end or per-record and batch processing,, as well as ,2, the costs
o perorming in-house geocoding including all associated costs or setup ,purchasing reer-
ence data, purchasing geocoding sotware, deeloping custom geocoding sotware, employee
training, etc., are missing rom the registry community. In many cases, the contracts under
which the data or serices are obtained rom endors are conidential, which may be the ma-
jor hurdle in assembling this list. loweer, these data items would be extremely useul or
registries just starting to orm their geocoding processes and procedures.
)=IAP@AB 567 8669 5[Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
85<Y 23$_(*(3(,:
Commercial sotware proiders tend to create one-size-its-all` solutions that appeal to
the largest market possible. 1hey usually are not capable o creating indiidualized sotware
releases or each o their customers geared toward itting their exact needs. 1his strategy,
while beneicial to the sotware endor, almost always results in sotware that does not eact
t, meet the needs o the consumer. Surely, most o their customers` requirements will be met
,otherwise they would hae neer bought the sotware,, but undoubtedly some speciic re-
quirement will not be ulilled, causing diiculty at the consumer end.
\ith regard to geocoding sotware in particular, the lack o lexibility may be the most
problematic. 1he general nature o geocoding, with many dierent types o input data, reer-
ence data sources, and output types, necessitates a certain degree o dynamism in the
process. Measuring a particular strategy in terms o how easily it can be adapted to changing
conditions o reerence datasets, new geocoding algorithms, and arying degrees o required
leels o accuracy can be considered the deining metrics o any geocoding process. 1here-
ore, a critical actor that must be taken into consideration when choosing between in-house
and commercial geocoding is the amount o lexibility that can be accommodated in the
process.
1o address this question, a registry will need to determine an anticipated amount o lee-
way that they will need and inestigate i commercially aailable packages suit their needs.
Lxamples o speciic issues that may need to be considered are listed in 1able 41. I the
needs o a registry are such that no commercial platorm can accommodate them, the only
option may be to deelop their own in-house ersion.
Table 41 Commercial geocoding package policy considerations
-=>DNO &=MED?ABHFD=ME
1he ability to speciy,change reerence data sources
1he ability to speciy,change eature-matching algorithms to suit particular study
needs
1he ability to speciy,change required leels o accuracy
1he ability to speciy,change eature interpolation algorithms
1he ability to control oset,backset parameters
85<[ -.%&$++ ,.!)+-!.$)&:
Process transparency within a geocoding strategy can be critical in determining the con-
idence one can associate with the results. In the worst possible case, no metadata are re-
turned with a geocode to indicate how it was deried or where it came rom. Data in this
orm should, or the most part, be aoided because they will hae no indication o the relia-
bility o the data rom which their results are to be obtained. 1he amount o transparency
associated with a geocoding process can be measured in terms o the amount and type o
inormation that is returned with the result. A process that reports eery decision made at
eery step o the process would be completely transparent, while the worst-case scenario just
presented would be completely non-transparent.
Commercial geocoding packages ary in terms o transparency. Some do not expose the
inner workings o their geocoding process to customers based on trade secret` concerns,
and as such the consumer must take the endor at their word regarding the operation o the
geocoding sotware. Others, howeer, report more detailed inormation. A registry will need
to decide what leel o transparency it requires and either select a commercial geocoding
5[[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
package accordingly or deelop one in-house i nothing is commercially aailable that meets
these needs. Best practices related to process transparency are listed in Best Practices 55.
Best Practices 55 Process transparency
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat minimum inormation
needs to be reported by a
geocoding process along with the
output ,e.g., leel o
transparency,
lull metadata describing the reerence data.
1he type o eature match.
1he type o interpolation perormed.

85<\ 1%; ,% +$3$&, ! 4$)'%.
I the choice has been made to select a endor to perorm geocoding, a registry should
ask speciic questions to determine both the knowledge they hae about the topic as well as
the likely capabilities they will be able to proide. Lxample questions are proided in 1able
42.
Table 42 Topics and issues relevant to selecting a vendor
,=UDN (EEJA
Data sources \hat types o data sources are used
\hat ersions
low oten are they updated
\hat leel o accuracy do they hae
low are those leels guaranteed
leature
matching
\hat algorithms are used
low are exceptional cases handled
Normalization,
Standardization
\hat algorithms are used
Input data \hat types o input data can be handled
\hat happens to input data that are not able to be handled
Output data \hat output ormats are supported
\hat inormation is proided along with the output
\hat leel o spatial accuracy can be achieed
Can they proide the NAACCR certainty codes rom 1able 34
Conidentiality \hat saeguards and guarantees are in place
Cost \hat is the per-record cost
Do they negotiate separately with each registry and require
non-disclosure agreements
leedback
capability
Do corrections submitted by a cancer registry lead to
updates in their data sources

In addition to receiing answers to these questions, it is adisable or a registry to test the
accuracy o their endor periodically. Published literature exist deining methods or testing
both a third-party contractor, as well as commercially purchased sotware ,e.g., Krieger et al.
2001, \hitsel et al. 2004,. Lssentially, a registry wishing to alidate the results rom a endor
)=IAP@AB 567 8669 5[\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
can set up lists o input data that include particularly hard cases to successully geocode,
which then can be submitted to the endor to determine how well they handle the results.
Also, the registry can maintain a small set o ground truth data obtained with GPS deic-
es to measure the spatial accuracy o the output data returned rom the endor. It should be
noted that it may be beneicial or registries to coordinate in compiling and maintain a large-
area ground truth dataset or this purpose-instead o each registry maintaining a small set
just or the area around their geographic location-to leerage the existing work o other
registries.
At the time o writing, there are no currently established rules between endors and regi-
stries as to who is responsible or ensuring the correctness o the geocoded results. By de-
ault, the registry is ultimately responsible, but it may be worthwhile to initiate discussions on
this topic with endors beore entering into any agreements.
85<9 $4!3/!,()# !)' &%0-!.()# #$%&%'()# .$+/3,+
Although geocoding outcomes between two geocoders will be identical or most ad-
dresses, there always are subsets o addresses that will generate dierent outcomes or which
there is no clear consensus on which is more correct. Methods o objectiely comparing
geocoding results that can be applied to both the results returned rom endors and those
rom other agencies are just emerging ,e.g., Loasi et al. 200, Mazumdar et al. 2008,. 1he
objectie when comparing geocoding outcomes is one o placing addresses into the catego-
rization listed in 1able 43. 1he third category represents the instances in which one party
beliees that all parties inoled should be satisied with the geocodes while another party
beliees otherwise.
Table 43 Categorization of geocode results
1, Addresses that can be geocoded to the satisaction o all parties concerned
2, Addresses that cannot be geocoded to the satisaction o all parties concerned
3, Addresses or which there is disagreement as to whether they belong in ,1, or ,2,

1he dierences are based on assumptions used during the geocoding process, which
generally are uniersally applied or all data processed using a single geocoder, and thus are
relatiely easy to identiy. lor example, it may be simple to determine i and which hierarchy
o eature matching was used ,e.g., matching a USPS ZIP Code centroid eery time a street-
leel match ails,. Lxplicitly deining the required criteria o the geocoding process is the way
or registries to make their endor`s geocoding outcomes more similar to those that they
may generate in-house.
\hen the geocoding output o two indiiduals is compared ,as in the between-agency
case,, the assumptions are less easy to identiy because they generally are made on a per-
record basis. 1here will be addresses or which one person thinks an address can geocode
based on number o edits to the address, and another person disagrees and thinks that the
record should not be geocoded because the edits required or such a match are based on as-
sumptions not supported by the data at hand. lortunately, these addresses generally com-
prise less than 5 percent o all registry records. Best practices related to ealuating third-party
geocoding results are listed in Best Practices 56.
5[9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 5[X
Best Practices 56 Evaluating third-party geocoded results
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen can and should results
rom a third-party proider be
eriied
1he results rom a endor should be eriied ater
eery submission to ensure some desired leel o
quality.
low and what can and should be
used to eriy the quality o a
endor
A pre-compiled list o problem addresses to
check exceptional case handling.

Resubmitting data to check or consistent results.

A small set o GPS ground truthed data can be
used to check the spatial accuracy, with its size
based on conidence interals ,although a rec-
ommendation as to its calculation needs more
research,.


















































1his page is let blank intentionally.

'< ;< #=>?@ABC

88< */:()# 4+< */(3'()# .$2$.$)&$ '!,!+$,+
1his section introduces methods or determining which re-
erence datasets to use.
88<5 )% !++$0*3: .$W/(.$'
Another decision that registries will be conronted with when they choose to not use a
endor relates to obtaining reerence datasets used in the geocoding process. 1hese can ei-
ther be obtained ,possibly purchased, and used directly, created at the registry, or a combina-
tion thereo whereby the registry adds alue to an acquired dataset.
As noted earlier, the accuracy and completeness o these reerence datasets can ary
dramatically, as can the costs o obtaining them. 1he price o ree reerence datasets such as
1IGLR,Line iles ,United States Census Bureau 2008d, makes them attractie. 1he relatie-
ly lower leels o accuracy and completeness included in these ,when compared to commer-
cial ariants,, howeer, may not be suicient or a registry`s needs. Likewise, although com-
mercial datasets such as 1ele Atlas ,2008c, will undoubtedly improe the oerall accuracy o
the geocoding process, their cost may be too prohibitie.
88<8 +%0$ !++$0*3: .$W/(.$'
An alternatie approach is or a registry to improe the reerence data source, be it a
public data source ,e.g., 1IGLR,Line iles, or a commercial counterpart. I the technical
ability is aailable, this option may proe worthwhile. lor example, one simple improement
that can be made to 1IGLR,Line iles that greatly improes the accuracy o linear-based
interpolation is to associate more accurate address ranges with each street reerence eature,
thus enabling uniorm lot interpolation ,Bakshi et al. 2004,. 1he required actual number o
parcels per street is typically aailable rom local assessors` oices, and they may be willing to
share that inormation with a registry. Usually, these alues can easily be associated with each
o the street reerence eatures in the 1IGLR,Line iles, unless it is determined they will
compromise the synchronicity with census attribute or population data already in use.
Another option is or the registry to directly contact the local goernment oices that
make use o GIS-related data sources to determine i suitable reerence data sources exist or
use in the geocoding process. Instead o just obtaining the number o parcels along street
segments as in the preious example, why not get the actual parcel boundary iles rom the
local assessor`s oice I they exist, the assessor`s oice may be willing to share them with
the registry, or ree or some ee. In some cases they are becoming aailable at the state leel.
Lither o these options eectiely increases the PHFNSDMC H@D>DFO7 or the ability o the reer-
ence dataset to match addresses while still maintaining the same linkage criteria.
88<V '$,$.0()()# &%+,+
In general, a registry will need to decide i the increased leel o accuracy they can expect
rom an improed reerence dataset is worth the additional cost ,e.g., does the leel o im-
proed geocode accuracy directly attributable to using 1ele Atlas |2008c| instead o 1IG-
LR,Line iles justiy the price o purchasing them,. lurther, they need to determine the po-
tential costs o upgrading an existing dataset themseles ersus buying one that is already
enhanced. 1he calculation o these costs should include both the initial ascertainment and
)=IAP@AB 567 8669 5\5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
acquisition costs, as well as the cost o adding any alue to the datasets, i it is to be per-
ormed. 1he cost o maintenance or sotware and reerence data or in-house operations
also should be taken into consideration. Best practices relating to choosing a reerence data-
set are listed in Best Practices 5.
Best Practices 57 Choosing a reference dataset
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hat type o reerence
data should be chosen
I a registry requires the ability to inestigate and ix
addresses that might be incorrect on a street, multi-entity
,range, reerence data should be used.

I a registry needs all addresses to be alidated as existing
single-entity ,discrete, reerence eatures should be used
,i.e., point- or areal unit-based,.

1he scale o the reerence eatures should match the scale
o the input data element being matched.
- National scale city names
National gazetteers ,included in 1IGLR,Lines,
- National scale USPS ZIP Codes
National USPS ZIP-4 databases

1he highest resolution reerence dataset should be chosen
,gien budgetary constraints,.
\hich point data
sources should be used
I no eature interpolation is possible gien the types o
data to be geocoded, point-based reerence datasets should
be considered.

All geocoding processes should, at a minimum, include the
national-scale gazetteers included with 1IGLR,Line iles.
\hen and which line
data sources should be
used
All geocoding processes should contain at least one linear-
based reerence dataset.

All geocoding processes should, at a minimum, include the
1IGLR,Line reerence iles.
\hen and which
polygon-based reerence
dataset should be used
I high-resolution data sources are aailable ,e.g., parcel
boundaries, building ootprints,, they should be included in
the geocoding process.

I 3-D geocoding is required, building models should be
used.
\hen multiple reerence
dataset types are
aailable, which should
be used
1he reerence eature type that produces the lowest
amount o uncertainty should always be used ,e.g., a linear-
based reerence datasets with high-resolution small eatures
may be more suitable than large parcels in rural areas,.

5\8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
8V< %.#!)(^!,(%)!3 #$%&%'()# &!-!&(,:
1his section explores the possible options or deeloping
geocoding capacity at a registry.
8V<5 1%; ,% 0$!+/.$ #$%&%'()# &!-!&(,:
1he amount o geocoding perormed at a registry will necessarily aect the choice o the
geocoding option selected. Any decisions made or the determination o an appropriate geo-
coding process need to irst and oremost take into account the research needs and policies
o the registry, with the amount o geocoding likely to be perormed also considered as a
actor. 1he number o cases geocoded can ary dramatically between registries, and a irst
order o magnitude estimation o an approximate number o cases will hae an eect on
which strategy should be undertaken. 1o determine an estimate or an approximate yearly
number o cases, a registry should determine how many cases they hae had in preious
years. 1hese prior numbers can be good indicators o uture needs, but may be biased de-
pending on the particular circumstances contributing to them. lor instance, policy changes
implemented within a registry as o a particular year can increase or decrease these numbers,
and potential uture policies will need to be taken into account.
1he costs o data and sotware are eectiely sunk` costs, with the real cost o yearly
geocoding perormed at a registry depending on the amount o time spent on the task.
1hereore, in addition to determining an aerage number o cases per year, a registry should
also determine the aerage amount o time the interactie geocoding process takes on a per-
case basis ,because batch match cost is triial on a per-record basis,. 1he speciic geocoding
policies in place at a registry will hae a substantial eect on this estimate, and likewise the
estimated amount o time per case may aect these policies. 1ime is particularly dependant
on the desired geographic leel o output. lor instance, i a policy were in place that eery
input address needed to be geocoded to street-leel accuracy, the time necessary to achiee
this might quickly render the policy ineasible, but requiring a geocode or eery address to
county leel accuracy may be substantially quicker.
1he most reliable cost estimates or geocoding at a registry oten are obtained when a
registry charges or cost recoery because most likely, the client will set the geocoding crite-
ria. Based on the number o geocoding cases that a registry has determined likely and costs
or these ,as determined by the aerage amount o time per case,, a registry should ealuate
the necessity o creating one or more ull-time equialent ,l1L, positions dedicated to the
task. Lxample numbers o cases geocoded per year and resulting l1L positions rom regi-
stries are proided in 1able 44 ,Abe and Stinchcomb 2008,. Best Practices relating to mea-
suring geocoding capacity are listed in Best Practices 58.

Table 44 Comparison of geocoded cases per year to FTE positions
.ACDEFBO &HEAE #A=N=?A? -AB :AHB )JP@AB =G 2,$ -=EDFD=ME
North Carolina 10,000- 1
New Jersey 80,000- 2
New \ork 100,000- 4
)=IAP@AB 567 8669 5\V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
5\Z )=IAP@AB 567 8669
Best Practices 58 Measuring geocoding capacity
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how should the
predicted number o
geocoding cases be
calculated
1he number o geocoding cases should be calculated
beore selecting a geocoding process.

1his can be done by estimating rom the number o cases
geocoded in preious years.
\hen and how should
aerage processing time
per geocoded case be
calculated
1he aerage per-geocode processing time should be
calculated as cases are processed.
low many l1L positions
are required or the
geocoding process
1his number will depend on the amount o cases that
need to be geocoded and the actual work time associated
with processing each geocode, which also will depend on
the leel to which the process is automated.


'< ;< #=>?@ABC







Part : !or/ivg !itb Ceocoaea Data



Ater data hae been successully geocoded, they are subsequently utilized in any number o
ways. Due to the nature o cancer registries and the types o data that they routinely work
with, seeral peculiarities exist as to how and why particular types o processing can and
sometimes must take place. Some o these saeguard the rights o indiiduals, while others
are artiacts o the way in which cancer registries do business. 1his part o the document will
discuss seeral o the more important issues in play here.




)=IAP@AB 567 8669 5\Y














































1his page is let blank intentionally.



'< ;< #=>?@ABC
8Z< ,/0%. .$&%.'+ ;(,1 0/3,(-3$ !''.$++$+
1his section discusses the issues that explain why multiple
addresses sometimes are generated or a single tumor record
and how these should be interpreted and used in subsequent
analyses.
8Z<5 +$3$&,()# 2.%0 0/3,(-3$ &!+$ #$%&%'$+
It is common or a record to hae seeral addresses associated with it, with each one
representing the indiidual`s residence. 1his multi-address situation can occur or seeral
reasons based on when a patient is seen by multiple acilities, each o which records an ad-
dress or the patient ,which can be the same or dierent,, or the patient is seen at the same
acility on multiple occasions. Additionally, i multiple abstracts with multiple addresses
where receied or a single tumor during registry consolidation ,in the case that the subject
moed their residence, reported dierent addresses at dierent times, and,or was treated at
multiple acilities,, this situation also will arise. lurther, one or more acilities may hae two
dierent ersions o the same address or a single patient, or two dierent addresses may be
maintained because the patient actually moed.
In these cases when multiple addresses exist or a record, one address must be selected
as the primary or use in spatial analyses. 1he others need not be discarded or remoed rom
the abstract, but the primary address should be the only one used or perorming analysis. In
the ideal case, when conronted with multiple addresses, one should strie to use the address
that is based on the more accurate data. 1he standard in place at registries is to use the pa-
tient`s usual address at diagnosis ,dxAddress,, no matter what its quality.
loweer, i it is unclear which is the primary dxAddress out o seeral possible ad-
dresses associated with a single patient ,in the case o multiple tumor abstracts being re-
ceied,, a decision needs to be made as to which should be used. lor instance, consider the
case when one geocode was produced the irst time the patient saw a doctor seeral decades
ago using only the USPS ZIP Code, and another geocode was subsequently created using a
ull street address obtained rom historical documents o unknown accuracy. 1hese both
represent the dxAddress, but hae dierent alues. 1he irst geocode is most likely more
temporally accurate because the data were reported by the patient him or hersel, while the
second is most likely more spatially accurate, i one assumes that the address obtained is cor-
rect. 1here is no current consensus on how to proceed in these instances, but there has been
a recent push to standardize the way that registries deal with records that hae multiple ad-
dresses. Research is underway to deelop best practices or this situation, and the considera-
tions listed in 1able 45 hae been identiied as possible actors that could,should inluence
these types o decisions.
)=IAP@AB 567 8669 5\\

!

#
A
=
N
=
?
D
M
C

*
A
E
F

-
B
H
N
F
D
N
A
E

#
J
D
?
A


5
\
9

























































































































































)
=
I
A
P
@
A
B

5
6
7

8
6
6
9

Table 45 Possible factors influencing the choice of dxAddress with decision criteria if they have been proposed

(M?AK 2HNF=B 'ANDED=M &BDFABDH
1 Abstract submission date Larliest to latest
2 Address o diagnosing acility
3 Age at diagnosis
4 Amount time elapsed between diagnosis and irst contact Least to most
5 Class o case Class code, this order:1,0,2,3,4,5,6,,8,9}
6 Current address o patient
Date o diagnosis on abstract
8 Date o last contact
9 Lxternal address reerence L.g., motor ehicles database with address near diagnosis time
10 lacility reerred rom
11 lacility reerred to
12 luzziness o match L.g., leel massaging required standardize address
13 Marital status at diagnosis
14 Particular reporting acility Use address rom the most trusted one
15 Place o death
16 Speciicity ,geographical, o address type Street address USPS ZIP Code only county only
1 1imeliness o reporting 1ime elapsed between times o case reportability and
submission ,less is better,
18 1ype o street address-related data submitted L.g., prison address known residential address
L.g., standard street address USPS PO box or rural route
location
L.g., contingent on patient age: old in nursing homes assumed
appropriate, young in college assumed appropriate
19 1ype o reporting acility L.g., American College o Surgeons or NCI hospitals other
in-state hospital in-state clinic other sources
20 Vital status Alie dead

'< ;< #=>?@ABC
8Y< 1:*.('(^$' '!,!
1his section introduces the concept o hybridized data and
the ways in which it is produced and used.
8Y<5 1:*.('(^$' '!,! '$2()$'
1ypically, once the geocodes hae been created or a set o input data, the next step in
the spatial analysis procedure is to associate other releant aspatial data with them. 1his
process has been termed SO@BD? CA=BAGABAMNDMC7 which describes the association o
attributes rom other datasets through a spatial join operation based on common spatial
attributes ,Martin and liggs 1996,. 1O@BD? ?HFH are the data created through hybrid geore-
erencing with attributes rom more than one dataset, joined by common spatial attributes.
1he U=DMFaDMaU=>OC=M PAFS=? is an approach in which a point that is spatially con-
tained within a polygon has the attributes o the polygon associated with it. 1his is the most
common method o hybrid georeerencing and is suitable when the secondary data that a
researcher wishes to associate with a geocode are in a ector-based polygon ormat. Alterna-
tiely, when the data are in a raster-based ormat, this notion o area within a geographic ea-
ture typically does not exist because o their pixel-based nature. In these cases spatial buers
,i.e., catchment areas, around the point normally are used to obtain aggregate alues oer an
area o the raster to associate with the point.
\ithout the support o a GIS or spatial database capable o perorming spatial opera-
tions, some type o non-spatial association, either probabilistic or deterministic text linkage,
may be the only option. 1hese non-spatially oriented methods o associating supplemental
data usually consist o relational database-like procedures in which an aspatial attribute asso-
ciated with the geocode is used as a key into another dataset containing alues or objects
with the same key. lor example, a city name attribute associated with a geocode can be used
as a key into a relational database containing city names as identiiers or demographic in-
ormation. Note that some endors and organizations hybrid their reerence data ,e.g., main-
tain a layer o municipality names joined with streets,.
1his association o attributes with geocodes rom other datasets using spatially based
methods must be undertaken careully, users need to pay careul attention to the base layers,
source, and data used or hybridizing. A researcher must consider how accurate and repre-
sentatie these associations are, gien the inormation that they know about both their geo-
codes and the supplemental data sources. 1he literature has shown that point-in-polygon
methods are capable o incorrectly associating data rom the wrong polygon to a geocode or
numerous reasons ,e.g., as a result o the data distribution the point-in-polygon methods as-
sume |Sadahiro 2000|, simple inaccuracies o the supplemental data, resolution dierences
between the polygon boundaries and the street networks rom which geocodes were pro-
duced |Chen \. et al. 2004|, or inaccuracy in the point`s location,. lor these reasons, asso-
ciating higher-leel data with geocodes through non-spatial-joins may be an attractie option
,e.g., obtaining a C1 by indexing o the block ace in the 1IGLR,Line iles, rather than re-
lying on a geocoded point.
1he accuracy and appropriateness o the geocode in terms o both spatial and temporal
characteristics also must be considered. Geocodes close to the boundaries o areal units can
)=IAP@AB 567 8669 5\X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
and do get assigned to the wrong polygons, resulting in the incorrect linkage o attributes.
Likewise, the polygons used or the hybridization may not be representatie o the spatial
characteristics o the region they represent at the releant time. lor example, i a case was
diagnosed in 1990, it may be more appropriate to associate 1990 Census data with the resi-
dence address, rather than Census data rom 2000 or 2010. Characteristics o both the geo-
code and the supplemental datasets used to derie hybrid data need to be recorded along
with the hybrid datasets produced so that registries and researchers can weigh these actors
when deciding on the appropriateness, correctness, and useulness o the results deried
rom them. Best practices relating to the hybridization o data are listed in Best Practices 59.
Rushton et al. ,2006, and Beyer et al. ,2008, and the reerences within can proide urther
guidance on the topics in this section.
Best Practices 59 Hybridizing data
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hich hybridization
method should be used
I spatial operations are supported, the point-in-polygon
method can be used to associate an aggregate alue to the
geocode calculated rom the alues within the polygon.

I spatial operations are not supported, relational joins
should be used to match the geocode to alues in the
secondary data based on shared keys between the two.
\hen, how, and which
secondary data should be
spatially associated with
geocoded data ,i.e.,
creating hybrid data,
lybrid georeerenced data should not be considered
certain i the uncertainty in the geocoded record is larger
than the distance to the nearest boundary.

\hen and how should the
certainty o hybrid data be
calculated
lybrid georeerenced data should not be considered
certain i the uncertainty in the geocoded record is larger
than the distance to the nearest boundary.

\hat metadata should be
maintained
\hen geocoded locations are associated with other
attributes through hybrid georeerencing, the distance to
the closest boundary should be included with the record.
8Y<8 #$%&%'()# (0-!&,+ %) ()&('$)&$ .!,$+
\hen determining incidence rates, cancer registries need to be able to spatially match`
the geographic resolution,accuracy o the numerators with the denominators. In this case,
the numerators are the case counts and the denominators are the corresponding population
counts. Mismatches can and requently do occur between these two data sources, which can
hae dramatic eects capable o either erroneously inlating or delating the obsered inci-
dence rates. lor instance, it is possible or a cancer registry to hae a ery accurate case loca-
tion ,e.g., based on geocoding with a satellite map,, but then come to an incorrect conclusion
in the analysis o incidence rates because the denominator is based on dierent ,less accu-
rate, geography. Care should be taken to ensure that the geographic resolutions,accuracies
o the data used to create the cases ,numerators,, and the geographic resolution,accuracy o
the denominator are deried at commensurate scales. lor the cases, it should urther be
noted that resolution,accuracy needs to be considered both in terms o that o the
596 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 595
underlying o the geocoding reerence dataset ,e.g., 1IGLR,Lines |United States Census
Bureau 2008d| s. 1ele Atlas |2008c|,, and that o the resulting geocoded spatial output ,e.g.,
USPS ZIP Code centroid match s. satellite imagery manual placement,. Best practices relat-
ing to the calculation o incidence rates are listed in Best Practices 60.
Best Practices 60 Incidence rate calculation
-=>DNO 'ANDED=M *AEF -BHNFDNA
At what resolutions
should reerence locations
and geocodes be used to
calculate incidence rates
Incidence rates should only be calculated when the
geocode and reerence locations are at the same
resolution.

Incidence rates should be calculated only ater
considering:
- Appropriateness o geographic area o analysis
- Issues o conidentiality and data suppression
- 1enure o residence

8Y<V (0-3(&!,(%)+ %2 !##.$#!,()# /-
\hen the output spatial location rom the geocoding process is deried rom a reerence
eature composed o a signiicant amount o area ,e.g., a city,, its use in spatial analyses per-
ormed with smaller areal units can create problems with the reliability o the research re-
sults. Data gathered at one scale and applied at another may be aected by the modiiable
areal unit problem ,MAUP,, which is reerred to oten in the geography and GIS literatures
,e.g., Openshaw 1984, Grubesic and Murray 2004, Gregorio et al. 2005, Zandbergen and
Chakraborty 2006,. 1his should be taken into consideration when considering the alidity o
spatial analyses. Best practices relating to the MAUP are listed in Best Practices 61.

Best Practices 61 MAUP
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen does the MAUP
need to be accounted or
in spatial analysis using
hybridized geocode data
1he MAUP may need to be accounted or when data
gathered at one scale are applied at another.













































1his page is let blank intentionally.


'< ;< #=>?@ABC
8[< $)+/.()# -.(4!&: !)' &%)2('$),(!3(,:
1his section details the reasons or ensuring priacy and con-
identiality in the geocoding process and identiies key ap-
proaches that help to acilitate these outcomes.
8[<5 -.(4!&: !)' &%)2('$),(!3(,:
Patient priacy and conidentiality need to be oremost concerns in any health-based da-
ta collection or research study. At no point during any o the processes undertaken as part o
a geocoding endeaor should a patient eer be identiiable to anyone other than the sta
members who hae been trained to deal with such data. loweer, both the input and output
data used in the geocoding process are necessarily identiiable inormation speciically be-
cause o what they encode-a location related to a person. 1he simple act o issuing a query
to a geocoding process reeals both these input and output data, so great care needs to be
taken to assure that this is done in a secure ashion. An excellent and extremely detailed sur-
ey o priacy issues directly related to cancer registries is aailable in Gittler ,2008a, and the
extensie reerences within, and a state-by-state breakdown o applicable statutes and admin-
istratie regulations can be ound in Gittler ,2008b,.
Cancer registries already hae existing policies in place to securely collect, store, and
transmit patient data both within the organization and to outside researchers. 1hese practic-
es also should be applied to eery aspect o the geocoding process. 1he diiculty in actually
doing this, howeer, should be eident rom the basic nature o the geocoding process.
Geocoding is an extraordinarily complex process, inoling many separate components
working in concert to produce a single result. Lach o these components and the interac-
tions between them need to be subject to these same security constraints. 1he simplest case
is when the geocoding process is perormed in a secure enironment at the registry, behind a
irewall with all components o the process kept locally, under the control o registry sta. In
this scenario, because eerything needed is at one location and under the control o a single
entity, it is possible to ensure that the low o the data as it moes through the geocoding
process neer leaes a secure enironment. 1he cost o this leel o security equates to the
cost o gathering data and bringing it behind the irewall. A dierentiation needs to be made
between an DMG=BPHFD=M >AHR7 which is breach o priacy that may be oerwritten with pub-
lic health law and unaoidable in interest o public health, and a @BAHNS =G N=MGD?AMFDH>DFO7
which is an outright misuse o the data.
I certain aspects o the geocoding process inole third parties, these interactions need
to be inspected careully to ascertain any potential security concerns. lor instance, i address
alidation is perormed by querying a county assessor`s database outside o the secure eni-
ronment o the registry, sensitie inormation may be exposed. Lery \eb serer keeps logs
o the incoming requests including the orm parameters ,which in this case will be a patient
address, as well as the source ,the Internet protocol |IP| address o the machine requesting
the serice,. 1hese IP addresses are routinely reersed ,e.g., using whois` tools such as
those proided by the American Registry o Internet Names and Numbers |2008|, to deter-
mine the issuing organization ,the registry, or usage and abuse monitoring o their serices.
It would not take a great leap or someone to conclude that when a registry queries or an
)=IAP@AB 567 8669 59V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
address, they are using it to determine inormation about people with a disease o some kind.
Any time that data are transerred rom the registry to another entity, measures need to be
taken to ensure security, including at a minimum encrypting the data beore transmitting it
securely ,e.g., hypertext transer protocol oer secure socket layer, or l11PS,. Best practices
relating to geocoding process auditing are listed in Best Practices 62.
Best Practices 62 Geocoding process privacy auditing when behind a firewall
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how should
geocoding be audited or
inormation leakages and,or
breaches o conidentiality
1he geocoding process and all related components
should be audited or inormation leakages and,or
breaches o conidentiality any time any component
is changed.

Inormation worklow diagrams and organizational
charts should be used to determine where, how, and
between which organizations inormation lows as it
is processed by a geocoding system rom start to
inish.

At any point where the inormation leaes the
registry, exactly what inormation is being
transmitted and in what manner should be careully
examined to determine i it can be used to identiy
the patient or tumor.
\hen should priate,
conidential, or identiiable
inormation about a patient or
tumor be released during the
geocoding process
As little priate, conidential, or identiiable
inormation about a patient or tumor as possible
should be released only in a secure ashion during the
geocoding process.

Upon receiing a query to geocode an address rom a cancer registry, the serice could
reasonably assume that this speciic address has something to do with a disease. In this case,
the eriication serice would essentially be receiing a list o the addresses o cancer pa-
tients, and this registry would be exposing identiiable health inormation. Seeral simple
measures can be taken to aoid this, but the speciic practices used may depend on the local
polices as to what is allowed ersus what is not allowed. As a irst example, cancer case ad-
dresses submitted to address-eriication serices can be intermixed with other randomly
generated, yet perectly alid addresses ,as well as other inalid addresses such that patient
records would not be the only ones that were inalid,. 1his is a simple example o ensuring
RaHM=MOPDFO ,Sweeney 2002,, where in this case / equals the number o real addresses plus
the number o ake addresses ,which may or may not be an acceptable leel,. loweer, this
may increase the cost o the serice i the registry is charged on a per-address basis. Alterna-
tiely, a registry could require the third party to sign a conidentiality agreement to protect
the address data transmitted rom the registry.
I a registry uses a third party or any portion o the geocoding process, it needs to en-
sure that this outside entity abides by and enorces the same rigorous standards or security
as employed by the registry. 1hese assurances should be guaranteed in legal contracts and
59Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
the contractors should be held accountable i breaches in security are discoered ,e.g., inan-
cial penalties or each disclosure,. Lxample assurance documents are proided in Appendix
A. 1hese data also should be transmitted oer a secure Internet connection. lree serices
such as the geocoding application programmer interaces, or APIs now oered by search
engines competing in the >=NHFD=Ma@HEA? EABIDNA market ,e.g., Google, Inc. 2008c, \ahoo!,
Inc. 2008, cannot, and most likely will not, be able to honor any o these assurances. In these
cases, the registry itsel must take measures to assure the required leel o security such as
those preiously mentioned ,e.g., submitting randomized bundles o erroneous as well as real
data,. Best practices related to third-party processing are listed in Best Practices 63.
Best Practices 63 Third-party processing (external processing)
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen and how can geocoding
be perormed using external
sources without reealing
priate, conidential, or
identiiable inormation about
a patient or tumor
I geocoding, or any portion thereo ,e.g., address
alidation, is perormed using external sources, all
inormation except the minimum necessary attributes
,e.g., address attributes, should be remoed rom
records beore they are sent or processing.
\hen and how should third-
party processing
conidentiality requirements
be determined
Any third party that processes records must agree to
ensure the same conidentiality standards used at the
registry beore data are transmitted to them.
low can and should data be
transmitted to third-party
processors
Conidential postal mail or secure network
connections should be used to transmit data that
hae been encrypted rom registries to third-party
processing serices.
low can and should data be
submitted to third-party
processing serices that
cannot ensure conidentiality
1he data should be submitted as part o a
randomized set o data, or some other method that
ensures /-anonymity.

Registries also should be cautious o the logs maintained by any geocoding process. 1yp-
ically, or perormance and tracking o query based serices, ,e.g., \eb sites and databases,,
logs o perormance-related metrics are generated so that the processes can be optimized.
1hese logs potentially could be used to recreate the queries or portions thereo that created
them, essentially reerse engineering the original input. lere, the policies o the registry can
and should enorce that ater an input query has been submitted and a result returned, no
trace o the query should be retained. Although inormation and statistics on query peror-
mance are useul or improing the geocoding process, this inormation represents a pa-
tient`s location, and needs to be securely discarded or made de-identiied in some ashion.
Best practices relating to log iles are listed in Best Practices 64.
)=IAP@AB 567 8669 59Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Best Practices 64 Geocoding process log files
-=>DNO 'ANDED=M *AEF -BHNFDNA
\hen should log iles
containing records o
geocoding transactions be
cleared
Log iles containing any priate, conidential, or
identiiable inormation about a patient or tumor
should be deleted immediately ater the geocode is
produced.

At the other end o the geocoding process, once spatial locations hae been securely
produced, conidentiality still must be ensured regarding these spatial locations once they are
isualized. Recent research has shown that geocodes can be used to determine the addresses
rom which they were deried ,a process known as BAIABEA CA=N=?DMC |Brownstein et al.
2006, Curtis et al. 2006|,. Researchers need to ensure that the methods they use to display
their data ,e.g., their presentation on a dot map, do not lend themseles to this type o re-
erse geocoding, rom which the input locational data can be deried rom the spatial out-
put.
1o accomplish this, seeral methods o geographic masking hae been described in the
literature ,or detailed reiews and technical approaches see Armstrong et al. 1999, Zim-
merman et al. 2008, and Chen et al. 2008,. One method is to use aggregation to place the
spatial location into a larger class o data rom which it cannot be indiidually recognized.
Alternatiely, one could use a random oset to moe the true spatial location a random dis-
tance in a random direction. Other methods exist as well, and all sere the same purpose-
protecting the priacy o the indiidual patient. 1his, o course, comes at the expense o ac-
tually introducing spatial error or uncertainty into the data rom which the study results will
be deried.
Commonly, most researchers limit geographic masking o geocodes to within either
county or CBG boundaries to deelop meaningul conclusions. 1hus, they want to ensure
that the geographically masked point and the original point are within the same county,
CBG. 1his presents additional security concerns because in many counties, the unierse o
address points is quite commonly aailable and downloadable as GIS data. It is possible to
deelop geographic masking procedures that meet standards or /-anonymity as measured
against the unierse o all known address points. See Sweeney ,2002, or how guidance on
this can be accomplished.
Depending on how much o the data actually need to be displayed, a researcher could
irst use the accurate data to deelop their results, and then use geographically masked data
in the presentation o the results. It also is possible to use geographically masked data or
analyses as well, but this approach requires making the analysis scale larger. Best practices
related to masking geocoded data are listed in Best Practices 65.
59[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 59\
Best Practices 65 Geographic masking
-=>DNO 'ANDED=M *AEF -BHNFDNA
low and when should geocoded data
be masked to ensure the required
leels o conidentiality
Any time that geocoded data are isualized
,e.g., displayed on a map, \eb site, or in a
presentation,, they should be masked to con-
orm to the conidentiality requirements o
the registry.
low can and should geocoded data
be masked when isualized
Any proen and suitable method that
accomplishes the required leel o coniden-
tiality can be used:
- Randomization
- Aggregation
- Randomized oset

It seems obious, but is worth mentioning, that once a registry has released data to a re-
searcher, the researcher can,may misuse the data in a ashion that the registry did not intend,
een ater Institutional Reiew Board approal. 1o preent these types o abuses ,and pro-
tect themseles,, registries should ollow the best practices listed in Best Practices 66, as well
as make use o research assurance documents such as those in Appendix A.
Best Practices 66 Post-registry security
-=>DNO 'ANDED=M *AEF -BHNFDNA
low should registries be inoled
with researchers to ensure proper use
o geocoded data
Registries should work closely with
researchers to be aware o the subsequent
results,publications.

Registries should proide guidance to
researchers on how to securely store, transer,
use, report, and isualize geocoded data.
low can registries protect themseles
rom the liability o researchers
misusing geocoded data
Registries should hae an assurance
document that details the appropriate use o
the geocoded data and that researchers must
initial prior to release o the data.
















































1his page is let blank intentionally.


'< ;< #=>?@ABC
#3%++!.: %2 ,$.0+

Absolute Geocode: An absolute known geographic ,spatial, location or an oset rom
an absolute known location.
Absolute Input Data: See Absolute Locational Description.
Absolute Locational Description: A description which, by itsel, contains enough inor-
mation to produce an output geographic location ,e.g., locations described in terms
o linear addressing systems,.
Acceptable Match Rate: 1he match rate alue a geocoding process must meet such that
the geocoded data can be considered alid or use in a research study.
Accuracy: A measure o how close to a true alue something is.
Actual Lot leature Interpolation: A linear-based eature interpolation algorithm that is
not subject to the parcel homogeneity assumption or parcel existence assumption.
Address Ambiguity: \ith regard to eature matching, the case when a single input ad-
dress can possibly match to multiple reerence eatures.
Address Matching: A specialized case o eature matching, strictly dealing with matching
postal street addresses to eatures in the reerence data source, usually 1IGLR type
street segments or areal unit delineations ,Census delineations, USPS ZIP Code de-
lineations, etc.,.
Address Normalization: 1he process o organizing and cleaning address data to increase
eiciency or data use and sharing.
Address Parity: An indication o which side o a street an address alls, een or odd ,as-
sumes binary address parity or a street segment,.
Address Range: Aspatial attributes associated with a linear-based reerence data ,i.e.,
street ectors,, describing the alid range o addresses on the street.
Address Range leature Interpolation: A linear-based eature interpolation algorithm that
is subject to the parcel existence, parcel homogeneity, and parcel extent assump-
tions.
Address Standardization: 1he process o conerting an address rom a normalized or-
mat into a speciied ormat ,e.g., United States Postal Serice |2008d|, United
States lederal Geographic Data Committee |2008b|,.
Address Validation: 1he process o determining whether or not an address actually
exists.
Administratie Unit: An administratiely deined leel o geographic aggregation. 1ypi-
cal units include ,rom largest to smallest,: county, state, county, major county
subdiision, city, neighborhood, census tract, census block, and parcel.
Adanced Match Rate: A match rate measure that normalizes the number o records at-
tempted by remoing those that are outside o the geographic boundaries o the
entity perorming the geocoding.
)=IAP@AB 567 8669 59X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Approximate Geocodes: See Pseudocodes.
Arc: See Ldge.
Areal Unit: See Polygon.
Areal Unit-Based Data: See Polygon-Based Data.
Areal Unit-Based leature Interpolation: A eature interpolation algorithm that uses a
computational process ,e.g., geographic centroid, to determine a suitable output
rom the spatial geometry o polygon-based reerence eatures ,e.g., parcel,.
Areal Unit-Based Reerence Dataset: See Polygon-Based Reerence Dataset.
Aspatial: Data or attributes describing data that do not reer to spatial properties.
Atomic-Leel Match Rate: A match rate associated with an indiidual attribute o an in-
put address.
Atomic Metrics: Metrics that describe the characteristics o indiidual members o a set.
Attribute Constraints: \ith regard to SQL-like eature matching, one or more predicates
that limit the reerence eatures returned rom a reerence dataset in response to a
query.
Attribute Imputation: \ith regard to eature matching, the process o imputing missing
attribute or an input address alues using other known ariables about the ad-
dress, region, and,or indiidual to which it belongs.
Attribute Relaxation: Part o the eature-matching process. 1he procedure o easing the
requirement that all street address attributes ,street number, name, pre-directional,
post-directional, suix, etc., must exactly match a eature in the reerence data
source to obtain a matching street eature, thereby increasing the probability o
inding a match, while also increasing the probability or error. 1his is commonly
perormed by remoing or altering street address attributes in an iteratie manner
using a predeined order.
Attribute \eighting: A orm o probabilistic eature matching in which probability-
based alues are associated with each attribute and either subtract rom or add to
the composite score or the eature as a whole.
Backus-Naur lorm ,BNl,: A notation used to construct a grammar describing the alid
components o an object ,e.g., an address,.
Batch-Geocoding: A geocoding process that operates in an automated ashion and
processes more than a single input datum at a time.
Best Practice: A policy or technical decision that is recommended but not required.
Binary Address Parity: 1he case when all addresses on one side o a street segment are
een and all addresses on the other side o the street segment are een.
Blocking Scheme: Method used in record linkage to narrow the set o possible candidate
alues that can match an input datum.
Breach o Conidentiality: An outright misuse o conidential data, including its unau-
thorized release, or its use or an unintended purpose.
5X6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Cancer Registry: A disease registry or cancer.
Case Sensitiity: \hether or not a computational system dierentiates between the case
o alphabetic characters ,i.e., upper-case and lower-case,.
Character-Leel Lquialence: A string comparison algorithm that enorces character-by-
character equialence between two or more strings.
City-Style Postal Address: A postal address that describes a location in terms o a num-
bered building along a street with more detailed attributes deining higher resolu-
tion within structures or large area geographic eatures possible.
Composite leature Geocoding: A geocoding process capable o creating and utilizing
composite eatures in response to an ambiguous eature-matching scenario.
Composite leature: A geographic eature created through the union o two or more dis-
parate geographic eatures ,e.g., a bounding box encompassing all o the geograph-
ic eatures,.
Composite Score: 1he oerall score o a reerence eature being a match to an input da-
tum resulting rom the summation o the indiidually weighted attributes.
Conditional Probability: 1he probability o something occurring, gien that other inor-
mation is known.
Conidence Interal: 1he percentage o the data that are within a gien range o alues.
Conidence 1hreshold: 1he leel o certainty aboe which results will be accepted and
below which they will be rejected.
Context-Based Normalization: An address normalization method that makes use o syn-
tactic and lexical analysis.
Continuous eature: \ith regard to geocoding reerence eatures, a geographic eature
that corresponds to more than one real-world entity ,e.g., a street segment,.
Coordinate System: A system or delineating where objects exist in a space.
Corner Lot Assumption: See Corner Lot Problem.
Corner Lot Problem: 1he issue arising during linear-based eature interpolation that
when using a measure o the length o the segment or interpolation it is unknown
how much real estate may be taken up along a street segment by parcels rom oth-
er intersecting street segments ,around the corner,, and the actual street length may
be shorter than expected.
Data Source: \ith regard to SQL-like eature matching, the relational table or tables o
the reerence dataset that should be searched.
Deterministic Matching Algorithm: A matching algorithm based on a series o prede-
ined rules that are processed in a speciic order.
Discrete eature: \ith regard to geocoding reerence eatures, a geographic eature that
corresponds to a single real-world entity ,e.g., a single address point,.
Disease Registry: Organizations that gather, store, and analyze inormation about a dis-
ease or their area.
)=IAP@AB 567 8669 5X5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Dropback: 1he oset used in linear-based interpolation to moe rom the centerline o
the street ector closer to the ace o the parcel.
Ldge: 1he topological connection between nodes in a graph.
Lmpirical Geocoding: See Geocode Caching.
Lrror Rate: \ith regard to probabilistic eature matching, denotes a measure o the in-
stances in which two address attributes do not match, een though two records do.
Lssence-Leel Lquialence: A string comparison algorithm that determines i two or
more strings are essentially` the same ,e.g., a phonetic algorithm,.
lalse Negatie: \ith regard to eature matching, the result o a true match being re-
turned rom the eature-matching algorithm as a non-match.
lalse Positie: \ith regard to eature matching, the result o a true non-match being re-
turned rom the eature-matching algorithm as a match.
leature: An abstraction o a real-world phenomenon.
leature Disambiguation: \ith regard to eature matching, the process o determining
the correct eature that should be matched out o a set o possible candidates using
additional inormation and,or human intuition.
leature Interpolation: 1he process o deriing an output geographic eature rom a geo-
graphic reerence eatures.
leature Interpolation Algorithm: An algorithm that implements a particular orm o ea-
ture interpolation.
leature Matching: 1he process o identiying a corresponding geographic eature in the
reerence data source to be used to derie the inal geocode output or an input.
leature-Matching Algorithm: An algorithm that implements a particular orm o eature
matching.
lootprint: See Geographic lootprint.
Gazetteer: A digital data structure that is composed o a set o geographic eatures and
maintains inormation about their geographic names, geographic types, and geo-
graphic ootprints.
Generalized Match Rate: A match rate measure that normalizes the number o records
attempted by remoing those that could neer be successully geocoded.
Geocode ,n.,: A spatial representation o locational descriptie locational text.
Geocode ,.,: 1o perorm the process o geocoding.
Geocode Caching: Storing and re-using the geocodes produced or input data rom pre-
ious geocoding attempts.
Geocoder: A set o inter-related components in the orm o operations, algorithms, and
data sources that collaboratiely work together to produce a spatial representation
or aspatial locationally descriptie text.
Geocoding: 1he act o transorming aspatial locationally descriptie text into a alid spa-
tial representation using a predeined process.
5X8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Geocoding Algorithm: 1he computational component o the geocoder that determines
the correct reerence dataset eature based on the input data and deries a spatial
output.
Geocoding Data Consumer-Group: 1he group in the geocoding process that utilizes
geocoded data ,e.g., researchers,.
Geocoding General Interest-Group: 1he group in the geocoding process that has a gen-
eral interest in the geocoding process ,e.g., the general public,.
Geocoding Practitioner-Group: 1he group in the geocoding process that perorms the
task o the actual geocoding o input data.
Geocoding Process Designer-Group: 1he group in the geocoding process that is in
charge o making policy decisions regarding how geocoding will be perormed.
Geocoding Requirements: 1he set o limitations, constraints, or concerns that inluence
the choice o a particular geocoding option. 1hese may be technical, budgetary, le-
gal, and,or policy.
Geographical Bias: 1he obseration that the accuracy o geocoding strategy may be a
unction o the area in which the geocode resides.
Geographic Coordinate System: A coordinate system that is a representation o the
Larth as an ellipsoid.
Geographic leature: A eature associated with a location relatie to the Larth.
Geographic lootprint: A spatial description representing the location o a geographic
eature on Larth.
Geographic Inormation System: A digital system that stores, displays, and allows or
the manipulation o digital geographic data.
Geographic Location: See Geographic leature.
Geographic Name: A name that reers to a geographic eature.
Geographic 1ype: A classiication that describes a geographic eature taken rom an or-
ganized hierarchy o terms.
Gold Standard: Data that represent the true state o the world.
Georeerence: 1o transorm non-geographic inormation ,inormation that has no geo-
graphically alid reerence that can be used or spatial analyses, into geographic in-
ormation ,inormation that has a alid geographic reerence that can be used or
spatial analyses,.
Georeerenced: Geographic inormation that was originally non-geographic and has
been transormed into it.
Georeerencing: 1he process o transorming non-geographic inormation into geo-
graphic inormation.
GIS Coordinate Quality Codes: A hierarchical scheme o qualitatie codes that indicate
the quality o a geocode in terms o the data it represents.
)=IAP@AB 567 8669 5XV
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Global Positioning System: 1he system o satellites, calibrated ground stations, and
temporally based calculations used to obtain geographic coordinates using digital
receier deices.
Grammar: An organized set o rules that describe a language.
Graph: A topologically connected set o nodes and edges.
lighway Contract Address: See Rural Route Address.
lierarchical Geocoding: A eature-matching strategy that uses geographic reerence ea-
tures o progressiely lower and lower accuracy.
lolistic-Leel Match Rate: A match rate associated with an oerall set o input address
data.
lolistic Metrics: Metrics that describe the oerall characteristics o a set.
lybrid Data: Data that are created by the intersection o two datasets along a shared
key, this key can be spatial, as in a geographic intersection, or aspatial as in a rela-
tional linkage in a relational database.
lybrid Georeerencing: 1he procedure o associating releant data ,spatial or aspatial,
with geocoded data.
Inormation Leak: 1he release o inormation to indiiduals outside o the owning insti-
tution, may be a breach o priacy that is oerwritten by public health laws and un-
aoidable in the interest o public health.
In-louse Geocoding: \hen geocoding is perormed locally, and not sent to a third
party.
Input Data: 1he non-spatial locationally descriptie texts that is to be turned into spatial
data by the process o geocoding.
Interactie Geocoding: \hen the geocoding process allows or interention when prob-
lems or issues arise.
Interactie Matching Algorithm: See Interactie Geocoding.
Interactie-Mode Geocoding: See Interactie Geocoding.
K-Anonymity: 1he case when a single indiidual cannot be uniquely identiied out o at
least / other indiiduals.
Latitude: 1he north-south axis describing a location in the Geographic Coordinate
System.
Lexical Analysis: 1he process by which an input address is broken up into tokens.
Line: A 1-D geographic object haing a length and is composed o two or more 0-D
point objects.
Linear-Based Data: Geographic data that is based upon lines.
Linear-Based leature Interpolation: A eature interpolation algorithm that operates on
lines ,e.g., street ectors, and produces an estimate o an output eature using a
computational process on the geometry o the line ,e.g., estimation between the
endpoints,.
5XZ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Line-Based Data: See Linear-Based Reerence Dataset.
Linear-Based Reerence Dataset: A reerence dataset that is composed o linear-based
data.
Location-Based Serice: A serice that is proided based upon the geography o the
client.
Longitude: 1he east-west axis describing a location in the Geographic Coordinate Sys-
tem.
Lookup 1ables: \ith regard to substitution-based address normalization, is a set o
known normalized alues or common address tokens, ,e.g., those rom the United
States Postal Serice |2008d|,.
Machine Learning: 1he subield o computer science dealing with algorithms that induce
knowledge rom data.
Mapping lunction: 1he portion o the address standardization algorithm that translates
between an input normalized orm and the target output standard.
Match: \ith regard to substitution-based address normalization, is the process o identi-
ying alias or an input address token within a lookup table o normalized alues.
Matching Ability: \ith regard to a reerence dataset, a measure o its ability to match an
input address while maintaining the same consistent matching criteria as applied to
other reerence datasets.
Match Probability: \ith regard to probabilistic eature matching, the degree o belie,
ranging rom 0 to 1, that a eature matches.
Matched Probability: \ith regard to probabilistic eature matching, the probability that
the attribute alues o an input datum and a reerence eature matching when the
records themseles match.
Match Rate: A measure o the amount o input data that were able to be successully
geocoded ,i.e., assigned to an output geocode,.
Matching Rules: \ith regard to substitution-based address normalization, are the set o
rules that determine the alid associations between an input address token and the
normalized alues in a lookup table.
Metadata: Descriptions associated with data that proide insight into attributes about it
,e.g., lineage,.
NAACCR Data Standard: A mandatory ,required, data ormatting and,or content
scheme or a particular data item.
Network: A topologically connected graph o nodes and edges.
Node: 1he endpoints in a graph that are topologically connected together by edges.
)=IAP@AB 567 8669 5XY
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
North American Association o Central Cancer Registries ,NAACCR,: A proessional
organization that deelops and promotes uniorm data standards or cancer regis-
tration, proides education and training, certiies population-based registries, ag-
gregates and publishes data rom central cancer registries, and promotes the use o
cancer sureillance data and systems or cancer control and epidemiologic re-
search, public health programs, and patient care to reduce the burden o cancer in
North America.
Non-Binary Address Parity: 1he case when addresses along one side o a street segment
can be een, odd, or both.
Non-Interactie Matching Algorithm: See Batch Geocoding.
Non-Spatial: See aspatial
Output Data: 1he alid spatial representations returned rom the geocoder deried rom
eatures in the reerence dataset.
Parcel Lxistence Assumption: 1he assumption used in linear-based eature interpolation
that all parcels associated with the address range o a reerence eatures exist.
Parcel Lxtent Assumption: 1he assumption used in linear-based eature interpolation
that parcels associated with the address range o a reerence eatures start imme-
diately at the beginning o the segment and ill all space to the other end.
Parcel lomogeneity Assumption: 1he assumption used in linear-based eature interpo-
lation that all parcels associated with the address range o a reerence eatures hae
the same dimensions.
Parse 1ree: A data structure representing the decomposition o an input string into its
component parts.
Pass: \ith regard to attribute relaxation, the relaxation o a single address attribute with-
in a step that does not result in a change o geographic resolution.
Per-Record-Geocoding: A geocoding process that processes a geocode or single input
datum at a time, it may either be automated or not.
Place Name: See Geographic Name.
Phonetic Algorithm: A string comparison algorithm that is based on the way that a
string is pronounced.
Point: A 0-dimensional ,0-D, object that has a position in space but no length.
Point-Based Data: Geographic data that are based upon point eatures.
Point-Based Reerence Dataset: A reerence dataset that is composed o point-based
data.
Point-Based leature Interpolation: A eature interpolation algorithm that operates on
points ,e.g., a database o address points, and simply returns the reerence eature
point as output.
Point-In-Polygon Association: 1he process o spatially intersecting a geographic eature
that is a point with another geographic eature that is areal unit-based such that the
attributes o the areal unit can be associated with the point or ice ersa.
5X[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Polygon: A geographic object bounded by at least three 1-D line objects or segments
with the requirement that they must start and end in the same location ,i.e., node,.
Polygon-Based Data: Geographic data that is based upon polygon eatures.
Polygon-Based Reerence Dataset: A reerence dataset that is composed o polygon
eatures.
Polyline: A geographic object that is composed o a series o lines.
Porter Stemmer: A word-stemming algorithm that works by remoing common suixes
and applying substitution rules.
Postal Address: A orm o input data describing a location in terms o a postal address-
ing system.
Postal Code: A portion o a postal address designating a region.
Postal Street Address: A orm o input data containing attributes that describe a location
in terms o a postal street addressing system ,e.g., USPS street address,.
Postal ZIP Code: 1he USPS address portion denoting the route a deliery address is on.
Post Oice Box Address: A postal address designating a storage location at a post oice
or other mail-handling acility.
Precision: Regarding inormation retrieal, a measure o how correct the data retrieed
are.
Predicate: See Query Predicate.
Probability-Based Normalization: An address normalization method that makes use
probabilistic methods ,e.g., support ector machines or lidden Marko Models,.
Prior Probability: See Unconditional Probability.
Projection: A mathematical unction to transer positions on the surace o the Larth to
their approximate positions on a lat surace.
Pseudocoding: 1he process deriing pseudocodes using a deterministic or probabilistic
method.
Pseudocodes: An approximation o a true geocode.
Query Predicate: In an SQL query, the list o attribute-alue pairs indicating which
attributes o a reerence eatures must contain what alues or it to be returned.
Raster-Based Data: Data that diide the area o interest into a regular grid o cells in
some speciic sequence, usually row-by-row rom the top let corner, each cell is
assigned a single alue describing the phenomenon o interest.
Real-1ime Geocoding: \ith regard to patient,tumor record geocoding upon intake, is
the process o geocoding a patient,tumor record while the patient is aailable to
proide more detailed or correct inormation using an iteratie reinement
approach.
Recall: Regarding inormation retrieal, a measure o how complete the data retrieed
are.
)=IAP@AB 567 8669 5X\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Record Linkage: A sub-ield o computer science relating to inding eatures in two or
more datasets which are essentially reerring to the same eature.
Reerence Dataset: 1he underlying geographic database containing geographic eatures
the geocoder uses to derie a geographic output.
Reerence Data Source: See Reerence Dataset.
Reerence Set: \ith regard to record linkage, reers to the set o candidate eatures that
may possibly be matched to an input eature.
Registry: See Disease Registry.
Relatie Geocode: Geographic ,spatial, location that is relatie to some other reerence
geographic eature.
Relatie Input data: See Relatie Locational Description.
Relatie Locational Description: A description which, by itsel, does not contain enough
inormation to produce an output geographic location ,e.g., locations described in
terms directions rom some other eatures,.
Relatie Predicted Certainty: A relatie quantitatie measure o the area o the accuracy
o a geocode based on inormation about how a geocode is produced.
Reerse Geocoding: 1he process o determining the address used to create a geocode
rom the output geographic location.
Rural Route Address: A postal address identiying a stop on a postal deliery route.
Scrubbing: 1he component o address normalization that remoes illegal characters and
white space rom an input datum.
Selection Attributes: \ith regard to SQL-like eature matching, the attributes o the re-
erence eature that should be returned rom the reerence dataset in response to a
query.
Simplistic Match Rate: A match rate measure computed as the number o input data
successully assigned a geocode diided by the number o input data attempted.
Situs Address: 1he actual physical address associated with the parcel.
Socioeconomic Status: Descriptie attributes associated with indiiduals or groups reer-
ring to social and economic ariables.
Sotware-Based Geocoding: A geocoding process in which a signiicant portion o the
components are sotware systems.
Soundex: A phonetic algorithm that encodes the sound o a string using a series o cha-
racter remoal and substitution rules rom a known table o alues.
Spatial Accuracy: A measure o the correctness o a geographic location based on some
metric, can be qualitatie or quantitatie.
Spatial Resolution: A measure describing the scale o geographic data, can be qualitatie
or quantitatie.
Stemming: See \ord Stemming.
5X9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Step: \ith regard to attribute relaxation, the relaxation o a multiple address attributes at
once that results in a change o geographic resolution.
Street Address Ambiguity: \ith regard to eature matching, the case when a single input
address can possibly match to multiple addresses along a street reerence eature.
Street Network: A linear-data graph structure with edges representing streets and nodes
representing their intersections.
Street Segment Ambiguity: \ith regard to eature matching, the case when a single input
address can possibly match to multiple street reerence eatures.
String Comparison Algorithms: An algorithm that calculates a similarity measure be-
tween two or more input strings.
Sub-Parcel Address Ambiguity: \ith regard to eature matching, the case when a single
input address can possibly match to multiple addresses ,e.g., buildings, within a
single parcel reerence eature.
Substitution-Based Normalization: An address normalization method that makes use o
lookup tables or identiying commonly encountered terms based on their string
alues.
Syntactic Analysis: 1he process by which tokens representing an input address are
placed into a parse tree based on the grammar which deines possible alid
combinations.
1emporal Accuracy: A measure o how appropriate the time period the reerence data-
set represents is to the input data that is to be geocoded, can be qualitatie or
quantitatie.
1emporal Lxtent: An attribute associated with a datum describing a time period or
which it existed, or was alid.
1emporal Staleness: 1he phenomenon that occurs when data become out-o-date and
less accurate ater the passage o time ,e.g., a geocode cache becoming outdated a-
ter a newer more accurate reerence dataset becomes aailable,.
1opologically Integrated Geographic Lncoding and Reerencing ,1IGLR, Line liles:
Series o ector data products distributed by and created to support o the mission
the U.S. Census Bureau ,United States Census Bureau 2008d,.
1oken: An attribute o an input address ater it has been split into its component parts.
1okenization: 1he process used to conert the single complete string representing the
whole address into a series o separate tokens.
1opological: Describes the connection between nodes and edges in a graph.
1oponym: See Geographic Name.
1rue Negatie: \ith regard to eature matching, the result o a true non-match being re-
turned rom the eature-matching algorithm as a non-match.
1rue Positie: \ith regard to eature matching, the result o a true match being returned
rom the eature-matching algorithm as a match.
)=IAP@AB 567 8669 5XX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Unconditional Probability: 1he probability o something occurring, gien that no other
inormation is known.
Uniorm Lot leature Interpolation: A linear-based eature interpolation algorithm that is
subject to the parcel homogeneity assumption and parcel existence assumption.
United States Bureau o the Census: United States ederal agency responsible or per-
orming the Census.
United States Census Bureau: See United States Bureau o the Census.
United States Postal Serice ,USPS,: United States ederal agency responsible or mail
deliery.
Unmatched Probability: \ith regard to probabilistic eature matching, the probability
that the attribute alues o an input datum and a reerence eature matching when
the records themseles do not match.
Urban and Regional Inormation Systems Association ,URISA,: A non-proit proes-
sional and educational association that promotes the eectie and ethical use o
spatial inormation and inormation technologies or the understanding and man-
agement o urban and regional systems.
Vector: An object with a direction and magnitude, commonly a line.
Vector-Based Data: Geographic data that consist o ector eatures.
Vector leature: Phenomena or things o interest in the world around us ,i.e., a speciic
street like Main Street, that cannot be subdiided into phenomena o the same
kind ,i.e., more streets with new names,.
Vector Object: See Vector leature.
Vertex: See Node.
\aiting It Out: 1he process o holding o re-attempting geocoding or a period o time
until something happens to increase the probability o a successul attempt ,e.g.,
new reerence datasets are released,.
\eights: \ith regard to probabilistic eature matching, are numeric alues calculated as
a combination o matched and unmatched probabilities and assigned to each
attribute o an address to denote its leel o importance.
\ord Stemming: 1o reduce a word to its undamental stem.
ZIP Code 1abulation Area: A geographical areal unit deined by the United States Bu-
reau o the Census.


866 )=IAP@AB 567 8669
'< ;< #=>?@ABC
.$2$.$)&$+

Abe 1 and Stinchcomb D 2008 Geocoding Best Practices in Cancer Registries. In Rush-
ton et al. ,eds, Ceocoaivg eattb Data 1be |.e of Ceograbic Coae. iv Cavcer Prerevtiov
ava Covtrot, Re.earcb, ava Practice. Boca Raton, ll CRC Press: 111-126.
Agarwal P 2004 Contested nature o place`: Knowledge mapping or resoling onto-
logical distinctions between geographical concepts. In Lgenhoer et al. ,eds, Pro
ceeaivg. of
ra
vtervatiovat Covferevce ov Ceograbic vforvatiov cievce ;Ccievce). Berlin,
Springer Lecture Notes in Computer Science No 3234: 1-21.
Agouris P, Beard K, Mountrakis G, and Steanidis A 2000 Capturing and Modeling
Geographic Object Change: A Spatio1emporal Gazetteer lramework. Pbotogravve
tric vgiveerivg ava Revote ev.ivg 66,10,: 1224-1250.
Agoino PK, Niu , lenry KA, Roche LM, Kohler BA, and Van Loon S 2005 Cancer
Incidence Rates in New Jersey`s 1en Most Populated Municipalities, 1998-2002.
NJ State Cancer Registry Report. Aailable online at: http:,,www.state.nj.us,
health,ces,documents,cancer_municipalities.pd. Last accessed April 23
rd
, 2008.
Alani l 2001 Voronoi-based region approximation or geographical inormation re-
trieal with gazetteers. vtervatiovat ]ovrvat of Ceograbicat vforvatiov cievce 15,4,:
28-306.
Alani l, Kim S, Millard DL, \eal MJ, lall \, Lewis Pl, and Shadbolt N 2003 \eb
based Knowledge Lxtraction and Consolidation or Automatic Ontology Instantia-
tion. In Proceeaivg. of Kvorteage Catvre ;KCa`0), !or/.bo ov Kvorteage Mar/v ava
evavtic .vvotatiov.
Alexandria Digital Library 2008 Alexandria Digital Library Gazetteer. Aailable online
at: http:,,alexandria.ucsb.edu,clients,gazetteer. Last accessed April 23
rd
, 2008.
American Registry or Internet Names 2008 \lOIS Lookup. Aailable online at:
http:,,www.arin.net. Last accessed April 23
rd
, 2008.
Amitay L, lar`Ll N, Sian R, and Soer A 2004 \eb-A-\here: Geotagging \eb Con-
tent. In Sanderson et al. ,eds, Proceeaivg. of tbe 2
tb
.vvvat vtervatiovat .CM CR
Covferevce ;CR 01). New \ork, ACM Press: 23-280.
Arampatzis A, an Kreeld M, Reinbacher I, Jones CB, Vaid S, Clough P, Joho l, and
Sanderson M 2006 \eb-based delineation o imprecise regions. Covvter., vri
rovvevt ava |rbav ,.tev..
Arbia G, Griith D, and laining R 1998 Lrror Propagation Modeling in Raster GIS:
Oerlay Operations. vtervatiovat ]ovrvat of Ceograbicat vforvatiov cievce 12,2,: 145-
16.
Arikawa M and Noaki K 2005 Geocoding Natural Route Descriptions using Sidewalk
Network Databases. In Proceeaivg. of tbe vtervatiovat !or/.bo ov Cbattevge. iv !eb v
forvatiov Retrierat ava vtegratiov, 200: ;!R `0:): 136-144.
Arikawa M, Sagara 1, Noaki K, and lujita l 2004 Preliminary \orkshop on Laluation
o Geographic Inormation Retrieal Systems or \eb Documents. In Kando and
Ishikawa ,eds, Proceeaivg. of tbe ^1CR !or/.bo 1 Meetivg: !or/ivg ^ote. of tbe
ovrtb ^1CR !or/.bo Meetivg. Aailable online at: http:,,research.nii.ac.jp,ntcir-
ws4,N1CIR4-\N,\LB,N1CIR4\N-\LB-ArikawaM.pd. Last accessed April
23
rd
, 2008.
)=IAP@AB 567 8669 865
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Armstrong MP, Rushton G, and Zimmerman DL 1999 Geographically Masking lealth
Data to Presere Conidentiality. tati.tic. iv Meaicive 18,5,: 49-525.
Armstrong MP, Greene BR, and Rushton G 2008 Using Geocodes to Lstimate Dis-
tances and Geographic Accessibility or Cancer Preention and Control. In Rush-
ton et al. ,eds, Ceocoaivg eattb Data 1be |.e of Ceograbic Coae. iv Cavcer Prerevtiov
ava Covtrot, Re.earcb, ava Practice. Boca Raton, ll CRC Press: 181-194.
Armstrong MP and 1iwari C 2008 Geocoding Methods, Materials, and lirst Steps to-
ward a Geocoding Lrror Budget. In Rushton et al. ,eds, Ceocoaivg eattb Data 1be
|.e of Ceograbic Coae. iv Cavcer Prerevtiov ava Covtrot, Re.earcb, ava Practice. Boca Ra-
ton, ll CRC Press: 11-36.
Axelrod A 2003 On Building a ligh Perormance Gazetteer Database. In Kornai and
Sundheim ,eds, Proceeaivg. of !or/.bo ov tbe .vat,.i. of Ceograbic Referevce. beta at
]oivt Covferevce for vvav avgvage 1ecbvotog, ava .vvvat Meetivg of tbe ^ortb .vericav
Cbater of tbe ...ociatiov for Covvtatiovat ivgvi.tic. ;1,^..C `0): 63-68.
Bakshi R, Knoblock CA, and 1hakkar S 2004 Lxploiting Online Sources to Accurately
Geocode Addresses. In Poser et al. ,eds, Proceeaivg. of tbe 12
tb
.CM vtervatiovat
,vo.ivv ov .aravce. iv Ceograbic vforvatiov ,.tev. ;.CMC `01): 194-203.
Beal JR 2003 Contextual Geolocation: A Specialized Application or Improing Indoor
Location Awareness in \ireless Local Area Networks. In Gibbons ,ed, 1be
tb

.vvvat Miare.t v.trvctiov ava Covvtivg ,vo.ivv ;MC200), Duluth, Minnesota.
Beaman R, \ieczorek J, and Blum S 2004 Determining Space rom Place or Natural
listory Collections: In a Distributed Digital Library Lnironment. Dib Magaive
10,5,.
Bell LM, lertz-Picciotto I, and Beaumont JJ 2001 A case-control study o pesticides
and etal death due to congenital anomalies. iaeviotog, 12,2,: 148-56.
Berney LR and Blane DB 199 Collecting Retrospectie Data: Accuracy o Recall Ater
50 \ears Judged Against listorical Records. ociat cievce c Meaicive 45,10,: 1519-
1525.
Beyer KMM, Schultz Al, and Rushton G 2008 Using ZIP Codes as Geocodes in Can-
cer Research. In Rushton et al. ,eds, Ceocoaivg eattb Data 1be |.e of Ceograbic
Coae. iv Cavcer Prerevtiov ava Covtrot, Re.earcb, ava Practice. Boca Raton, ll CRC Press:
3-68.
Bichler G and Balchak S 200 Address matching bias: ignorance is not bliss. Poticivg: .v
vtervatiovat ]ovrvat of Potice trategie. c Mavagevevt 30,1,: 32-60.
Bilhaut l, Charnois 1, Lnjalbert P, and Mathet \ 2003 Geographic reerence analysis
or geographic document querying. In Kornai and Sundheim ,eds, Proceeaivg. of
!or/.bo ov tbe .vat,.i. of Ceograbic Referevce. beta at ]oivt Covferevce for vvav av
gvage 1ecbvotog, ava .vvvat Meetivg of tbe ^ortb .vericav Cbater of tbe ...ociatiov for
Covvtatiovat ivgvi.tic. ;1,^..C `0): 55-62.
Blakely 1 and Salmond C 2002 Probabilistic record linkage and a method to calculate
the positie predictie alue. vtervatiovat ]ovrvat of iaeviotog, 31,6,: 1246-1252.
Block R 1995 Geocoding o Crime Incidents Using the 1990 1IGLR lile: 1he Chicago
Lxample. In Block et al. ,eds, Crive .vat,.i. 1brovgb Covvter Maivg. \ashington,
DC, Police Lxecutie Research lorum: 15.
Bonner MR, lan D, Nie J, Rogerson P, Vena JL, and lreudenheim JL 2003 Positional
Accuracy o Geocoded Addresses in Lpidemiologic Research. iaeviotog, 14,4,:
408-411.
868 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Boscoe lP, Kielb CL, Schymura MJ, and Bolani 1M 2002 Assessing and Improing
Census 1rack Completeness. ]ovrvat of Regi.tr, Mavagevevt 29,4,: 11-120.
Boscoe lP, \ard Ml, and Reynolds P 2004 Current Practices in Spatial Analysis o
Cancer Data: Data Characteristics and Data Sources or Geographic Studies o
Cancer. vtervatiovat ]ovrvat of eattb Ceograbic. 3,28,.
Boscoe lP 2008 1he Science and Art o Geocoding: 1ips or Improing Match Rates
and landling Unmatched cases in Analysis. In Rushton et al. ,eds, Ceocoaivg eattb
Data 1be |.e of Ceograbic Coae. iv Cavcer Prerevtiov ava Covtrot, Re.earcb, ava Practice.
Boca Raton, ll CRC Press: 95-110.
Boulos MNK 2004 1owards Lidence-Based, GIS-Drien National Spatial lealth In-
ormation Inrastructure and Sureillance Serices in the United Kingdom. vterva
tiovat ]ovrvat of eattb Ceograbic. 3,1,.
Bouzeghoub M 2004 A ramework or analysis o data reshness. In Proceeaivg. of tbe
2001 vtervatiovat !or/.bo ov vforvatiov Qvatit, iv vforvatiov ,.tev.: 59-6.
Bow CJD, \aters NM, laris PD, Seidel JL, Galbraith PD, Knudtson ML, and Ghali
\A 2004 Accuracy o city postal code coordinates as a proxy or location o resi-
dence. vtervatiovat ]ovrvat of eattb Ceograbic. 3,5,.
Brody JG, Vorhees DJ, Melly SJ, Swedis SR, Drias PJ, and Rudel RA 2002 Using GIS
and listorical Records to Reconstruct Residential Lxposure to Large-Scale Pesti-
cide Application. ]ovrvat of o.vre .vat,.i. ava vrirovvevtat iaeviotog, 12,1,: 64-
80.
Broome J 2003 Building and Maintaining a Reliable Lnterprise Street Database. In Pro
ceeaivg. of tbe iftb .vvvat |R. treet vart ava .aare.. arr, Covferevce.
Brownstein JS, Cassa CA, and Mandl KD 2006 No Place to lide - Reerse Identiica-
tion o Patients rom Published Maps. ^er vgtava ]ovrvat of Meaicive 355,16,:
141-142.
Can A 1993 1IGLR,Line liles in 1eaching GIS. vtervatiovat ]ovrvat of Ceograbicat vfor
vatiov cievce ,8,: 561-52.
Canada Post Corporation 2008 Postal Standards: Lettermail and Incentie Lettermail.
Aailable online at: http:,,www.canadapost.ca,tools,pg,standards,pslm-e.pd.
Last accessed April 23
rd
, 2008.
Casady 1 1999 Priacy Issues in the Presentation o Geocoded Data. Crive Maivg
^er. 1,3,.
Cayo MR and 1albot 1O 2003 Positional Lrror in Automated Geocoding o Residential
Addresses. vtervatiovat ]ovrvat of eattb Ceograbic. 2,10,.
Chalasani VS, Lngebretsen O, Denstadli JM, and Axhausen K\ 2005 Precision o Geo-
coded Locations and Network Distance Lstimates. ]ovrvat of 1rav.ortatiov ava ta
ti.tic. 8,2,.
Chaez Rl 2000 Generating and Reintegrating Geospatial Data. In Proceeaivg. of tbe :
tb

.CM Covferevce ov Digitat ibrarie. ;D `00): 250-251.
Chen CC, Knoblock CA, Shahabi C, and 1hakkar S 2003 Building linder: A System to
Automatically Annotate Buildings in Satellite Imagery. In Agouris ,ed, Proceeaivg. of
tbe vtervatiovat !or/.bo ov ^et Ceveratiov Ceo.atiat vforvatiov ;^C2 `0), Cam-
bridge, Mass.
Chen CC, Knoblock CA, Shahabi C, 1hakkar S, and Chiang \\ 2004 Automatically and
Accurately Conlating Orthoimagery and Street Maps. In Poser et al. ,eds, Proceea
ivg. of tbe 12
tb
.CM vtervatiovat ,vo.ivv ov .aravce. iv Ceograbic vforvatiov ,.
tev. ;.CMC `01), \ashington DC: 4-56.
)=IAP@AB 567 8669 86V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Chen Ml, Breiman Rl, larley M, Plikaytis B, Deaer K, and Cetron MS 1998 Geocod-
ing and Linking Data rom Population-Based Sureillance and the US Census to
Laluate the Impact o Median lousehold Income on the Lpidemiology o Ina-
sie Streptococcus Pneumoniae Inections. .vericav ]ovrvat of iaeviotog, 148,12,:
1212-1218.
Chen \, Petitti DB, and Lnger S 2004 Limitations and potential uses o census-based
data on ethnicity in a dierse community. .vvat. of iaeviotog, 14,5,: 339-345.
Chen Z, Rushton G, and Smith G 2008 Presering Priacy: Deidentiying Data by Ap-
plying a Random Perturbation Spatial Mask. In Rushton et al. ,eds, Ceocoaivg eattb
Data 1be |.e of Ceograbic Coae. iv Cavcer Prerevtiov ava Covtrot, Re.earcb, ava Practice.
Boca Raton, ll CRC Press: 139-146.
Chiang \\ and Knoblock CA 2006 Classiication o Line and Character Pixels on Ras-
ter Maps Using Discrete Cosine 1ransormation Coeicients and Support Vector
Machines. In Proceeaivg. of tbe 1
tb
vtervatiovat Covferevce ov Patterv Recogvitiov
;CPR`0).
Chiang \\, Knoblock CA, and Chen CC 2005 Automatic extraction o road intersec-
tions rom raster maps. In Proceeaivg. of tbe 1
tb
avvvat .CM vtervatiovat !or/.bo ov
Ceograbic vforvatiov ,.tev.: 26-26.
Chou \l 1995 Automatic Bus Routing and Passenger Geocoding with a Geographic
Inormation System. In Proceeaivg. of tbe 1: 1ebicte ^arigatiov ava vforva
tiov ,.tev. Covferevce: 352-359.
Christen P and Churches 1 2005 A Probabilistic Deduplication, Record Linkage and
Geocoding System. In Proceeaivg. of tbe .v.tratiav Re.earcb Covvcit eattb Data Mivivg
!or/.bo ;DM0:), Canberra, AU. Aailable online at: http:,,acrc.unisa.edu.au,
groups,health,hdw2005,Christen.pd. Last accessed April 23
rd
, 2008.
Christen P, Churches 1, and \illmore A 2004 A Probabilistic Geocoding System Based
on a National Address lile. In Proceeaivg of tbe .v.trata.iav Data Mivivg Covferevce,
Cairns, AU. Aailable online at: http:,,datamining.anu.edu.au,publications,2004,
aus-dm2004.pd. Last accessed April 23
rd
, 2008.
Chua C 2001 An Approach in Pre-processing Philippine Address or Geocoding. In Pro
ceeaivg. of tbe ecova Pbitiive Covvtivg cievce Covgre.. ;PCC 2001).
Chung K, \ang Dl, and Bell R 2004 lealth and GIS: 1oward Spatial Statistical Analys-
es. ]ovrvat of Meaicat ,.tev. 28,4,: 349-360.
Churches 1, Christen P, Lim K, and Zhu J 2002 Preparation o Name and Address
Data or Record Linkage Using lidden Marko Models. Meaicat vforvatic. ava De
ci.iov Ma/ivg 2,9,.
Clough P 2005 Lxtracting Metadata or Spatially-Aware inormation retrieal on the in-
ternet. In Jones and Pures ,eds, Proceeaivg. of tbe 200: .CM !or/.bo of Ceograbic
vforvatiov Retrierat ;CR`0:): 1-24.
Cockburn M, \ilson J, Cozen \, \ang l, and Mack 1 2008 Residential proximity to
major roadways and the risk o mesothelioma, lung cancer and leukemia. Under
reiew, personal communication.
Collins SL, laining RP, Bowns IR, Crots DJ, \illiams 1S, Rigby AS, and lall DM
1998 Lrrors in Postcode to Lnumeration District Mapping and 1heir Lect on
Small Area Analyses o lealth Data. ]ovrvat of Pvbtic eattb Meaicive 20,3,: 325-330.
County o Sonoma 2008 Vector Data - GIS Data Portal - County o Sonoma. Aailable
online at: https:,,gis.sonoma-county.org,catalog.asp. Last accessed April 23
rd
,
2008.
86Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
Cressie N and Kornak J 2003 Spatial Statistics in the Presence o Location Lrror with an
Application to Remote Sensing o the Lnironment. tati.ticat cievce 18,4,: 436-
456.
Croner CM 2003 Public lealth GIS and the Internet. .vvvat Rerier of Pvbtic eattb 24:
5-82.
Curtis AJ, Mills J\, and Leitner M 2006 Spatial conidentiality and GIS: re-engineering
mortality locations rom published maps about lurricane Katrina. vtervatiovat
]ovrvat of eattb Ceograbic. 5,44,.
Dao D, Rizos C, and \ang J 2002 Location-based serices: technical and business is-
sues. CP otvtiov. 6,3,: 169-18.
Dais Jr. CA 1993 Address Base Creation Using Raster,Vector Integration. In Proceeaivg.
of tbe |R. 1 .vvvat Covferevce, Atlanta, GA: 45-54.
Dais Jr. CA and lonseca l1 200 Assessing the Certainty o Locations Produced by an
Address Geocoding System. Ceovforvatica 11,1,: 103-129.
Dais Jr. CA, lonseca l1, and De Vasconcelos Borges, KA 2003 A llexible Addressing
System or Approximate Geocoding. In Proceeaivg. of tbe iftb raitiav ,vo.ivv ov
Ceovforvatic. ;Ceovfo 200), Campos do Jordo, So Paulo, Brazil.
Dawes SS, Cook ML, and lelbig N 2006 Challenges o 1reating Inormation as a Pub-
lic Resource: 1he Case o Parcel Data. In Proceeaivg. of tbe
tb
.vvvat araii vterva
tiovat Covferevce ov ,.tev cievce.: 1-10.
Dearwent SM, Jacobs RR, and lalbert JB 2001 Locational Uncertainty in Georeerenc-
ing Public lealth Datasets. ]ovrvat of o.vre .vat,.i. vrirovvevtat iaeviotog,
11,4,: 329-334.
Densham I and Reid J 2003 A geo-coding serice encompassing a geo-parsing tool and
integrated digital gazetteer serice. In Kornai and Sundheim ,eds, Proceeaivg. of
!or/.bo ov tbe .vat,.i. of Ceograbic Referevce. beta at ]oivt Covferevce for vvav av
gvage 1ecbvotog, ava .vvvat Meetivg of tbe ^ortb .vericav Cbater of tbe ...ociatiov for
Covvtatiovat ivgvi.tic. ;1,^..C `0): 80-81.
Diez-Roux AV, Merkin SS, Arnett D, Chambless L, Massing M, Nieto lJ, Sorlie P,
Szklo M, 1yroler lA, and \atson RL 2001 Neighborhood o Residence and Inci-
dence o Coronary leart Disease. ^er vgtava ]ovrvat of Meaicive 345,2,: 99-106.
Dijkstra L\ 1959 A note on two problems in connexion with graphs. ^vveri.cbe Ma
tbevati/ 1: 269-21.
Dru MA and Saada S 2001 Location-based mobile serices: the essentials. .tcatet 1ete
covvvvicatiov. Rerier 1: 1-6.
Drummond \J 1995 Address Matching: GIS 1echnology or Mapping luman Actiity
Patterns. ]ovrvat of tbe .vericav Ptavvivg ...ociatiov 61,2,: 240-251.
Dueker KJ 194 Urban Geocoding. .vvat. of tbe ...ociatiov of .vericav Ceograber. 64,2,:
318-325.
Durbin L, Stewart J, luang B 2008 Improing the Completeness and Accuracy o Ad-
dress at Diagnosis with Real-1ime Geocoding. Presentation at 1be ^ortb .vericav
...ociatiov of Cevtrat Cavcer Regi.trie. 200 .vvvat Meetivg, Dener, CO.
Durr PA and lroggatt ALA 2002 low Best to Georeerence larms A Case Study
lrom Cornwall, Lngland. Prerevtire 1eterivar, Meaicive 56: 51-62.
Lstathopoulos P, Mammarella M, and \arth A 2005 1he Meta Google Maps lack`.
Unpublished Report. Aailable online at: http:,,www.cs.ucla.edu,pestath,
papers,google-meta-paper.pd. Last accessed April 23
rd
, 2008.
)=IAP@AB 567 8669 86Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Lichelberger P 1993 1he Importance O Addresses: 1he Locus O GIS. In Proceeaivg. of
tbe |R. 1 .vvvat Covferevce, Atlanta, GA: 212-213.
Ll-\acoubi MA, Gilloux M, and Bertille JM 2002 A Statistical Approach or Phrase Lo-
cation and Recognition within a 1ext Line: An Application to Street Name Recog-
nition. 1rav.actiov. ov Patterv .vat,.i. ava Macbive vtettigevce 24,2,: 12-188.
Lnironmental Sciences Research Institute 1999 GIS in lealth Care 1erms. .rc|.er
Magaive April-June. Aailable online at: http:,,www.esri.com,news,arcuser,
0499,terms.html. Last accessed April 23
rd
, 2008.
leudtner C, Sileira MJ, Shabbout M, and loskins RL 2006 Distance lrom lome
\hen Death Occurs: A Population-Based Study o \ashington State, 1989-2002.
Peaiatric. 11,5,: 932-939.
londa-Bonardi P 1994 louse Numbering Systems in Los Angeles. In Proceeaivg. of tbe
C, `1 .vvvat Covferevce ava o.itiov, Phoenix, AZ: 322-331.
loody GM 2003 Uncertainty, Knowledge Discoery and Data Mining in GIS. Progre.. iv
Pb,.icat Ceograb, 2,1,: 113-121.
lortney J, Rost K, and \arren J 2000 Comparing Alternatie Methods o Measuring
Geographic Access to lealth Serices. eattb errice. ava Ovtcove. Re.earcb Metboaot
og, 1,2,: 13-184.
lrank AU, Grum L, and Vasseur B 2004 Procedure to Select the Best Dataset or a
1ask. In Lgenhoer et al. ,eds, Proceeaivg. of
ra
vtervatiovat Covferevce ov Ceograbic v
forvatiov cievce ;Ccievce). Berlin, Springer Lecture Notes in Computer Science No
3234: 81-93.
lremont AM, Bierman A, \ickstrom SL, Bird CL, Shah M, Lscarce JJ, lorstman 1,
and Rector 1 2005 Use O Geocoding In Managed Care Settings 1o Identiy Qual-
ity Disparities. eattb .ffair. 24,2,: 516-526.
lrew J, lreeston M, lreitas N, lill LL, Janee G, Loette K, Nideer R, Smith 1R, and
Zheng Q 1998 1he Alexandria Digital Library Architecture. In Nikalaou and Ste-
phanidis ,eds, Proceeaivg. of tbe 2
va
vroeav Covferevce ov Re.earcb ava .aravcea 1ecb
votog, for Digitat ibrarie. ;CD `). Berlin, Springer Lecture Notes in Computer
Science No 1513: 61-3.
lu G, Jones CB, and Abdelmoty AI 2005a Building a Geographical Ontology or Intel-
ligent Spatial Search on the \eb. In Proceeaivg. of tbe .1D vtervatiovat Covferevce
ov Databa.e. ava .ticatiov. ;D.200:).
lu G, Jones CB, and Abdelmoty AI 2005b Ontology-based Spatial Query Lxpansion in
Inormation Retrieal. Berlin, Springer Lecture Notes in Computer Science No
361: 1466-1482.
lulcomer MC, Bastardi MM, Raza l, Duy M, Duicy L, and Sass MM 1998 Assessing
the Accuracy o Geocoding Using Address Data rom Birth Certiicates: New Jer-
sey, 1989 to 1996. In \illiams et al. ,eds, Proceeaivg. of tbe 1 Ceograbic vforvatiov
,.tev. iv Pvbtic eattb Covferevce, av Diego, CA: 54-560. Aailable online at:
http:,,www.atsdr.cdc.go,GIS,conerence98,proceedings,pd,gisbook.pd. Last
accessed April 23
rd
, 2008.
Gabrosek J and Cressie N 2002 1he Lect on Attribute Prediction on Location Uncer-
tainty in Spatial Data. Ceograbicat .vat,.i. 34: 262-285.
Ganey Sl, Curriero lC, Strickland P1, Glass GL, lelzlsouer KJ, and Breysse PN
2005 Inluence o Geographic Location in Modeling Blood Pesticide Leels in a
Community Surrounding a US Lnironmental Protection Agency Superund Site.
vrirovvevtat eattb Per.ectire. 113,12,: 112-116.
86[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Gatrell AC 1989 On the Spatial Representation and Accuracy o Address-Based Data in
the United Kingdom. vtervatiovat ]ovrvat of Ceograbicat vforvatiov cievce 3,4,: 335-
348.
Geronimus A1 and Bound J 1998 Use o Census-based Aggregate Variables to Proxy
or Socioeconomic Group: Lidence rom National Samples. .vericav ]ovrvat of
iaeviotog, 148,5,: 45-486.
Geronimus A1 and Bound J 1999a RL: Use o Census-based Aggregate Variables to
Proxy or Socioeconomic Group: Lidence rom National Samples. .vericav ]ovr
vat of iaeviotog, 150,8,: 894-896.
Geronimus A1 and Bound J 1999b RL: Use o Census-based Aggregate Variables to
Proxy or Socioeconomic Group: Lidence rom National Samples. .vericav ]ovr
vat of iaeviotog, 150,9,: 99-999.
Geronimus A1, Bound J, and Neidert LJ 1995 On the Validity o Using Census Geo-
code Characteristics to Proxy Indiidual Socioeconomic Characteristics. 1echnical
\orking Paper 189. Cambridge, MA, National Bureau o Lconomic Research.
Gilboa SM, Mendola P, Olshan Al, larness C, Loomis D, Langlois Pl, Saitz DA, and
lerring Al 2006 Comparison o residential geocoding methods in population-
based study o air quality and birth deects. vrirovvevtat Re.earcb 101,2,: 256-262.
Gittler J 2008a Cancer Registry Data and Geocoding: Priacy, Conidentiality, and Secu-
rity Issues. In Rushton et al. ,eds, Ceocoaivg eattb Data 1be |.e of Ceograbic Coae.
iv Cavcer Prerevtiov ava Covtrot, Re.earcb, ava Practice. Boca Raton, ll CRC Press: 195-
224.
Gittler J 2008b Cancer Reporting and Registry Statutes and Regulations. In Rushton et
al. ,eds, Ceocoaivg eattb Data 1be |.e of Ceograbic Coae. iv Cavcer Prerevtiov ava
Covtrot, Re.earcb, ava Practice. Boca Raton, ll CRC Press: 22-234.
Goldberg D\ 2008a lree Online Geocoding. Aailable online at: https:,,
webgis.usc.edu,Serices,Geocode,Deault.aspx. Last accessed October 20
th
, 2008.
Goldberg D\ 2008b lree Online Static Address Validation. Aailable online at:
https:,,webgis.usc.edu,Serices,AddressValidation,StaticValidator.aspx. Last ac-
cessed October 20
th
, 2008.
Goldberg D\, \ilson JP, and Knoblock CA 200a lrom 1ext to Geographic Coordi-
nates: 1he Current State o Geocoding. |R. ]ovrvat 19,1,: 33-46.
Goldberg D\, Zhang , Marusek JC, \ilson JP, Ritz B, and Cockburn MG 200b De-
elopment o an Automated Pesticide Lxposure Analyst or the Caliornia`s Cen-
tral Valley. In the Proceeaivg. of tbe |R. C iv Pvbtic eattb Covferevce, New Or-
leans, LA: 136-156.
Goldberg D\, \ilson JP, Knoblock CA, and Cockburn MG 2008a 1he Deelopment
o an Open-Source, Scalable and llexible Geocoding Platorm. In preparation.
Goldberg D\, Shahabi K, \ilson JP, Knoblock CA, and Cockburn MG 2008b 1he
Geographic Characteristics o Geocoding Lrror. In preparation.
Goldberg D\, \ilson JP, Knoblock CA, and Cockburn MG 2008c Geocoding Quality
Metrics: One Size Does Not lit All. In preparation.
Goldberg D\, \ilson JP, Knoblock CA, and Cockburn MG 2008d An eectie and
eicient approach or manually improing geocoded data. vtervatiovat ]ovrvat of
eattb Ceograbic., Submitted September 23
rd
, 2008.
Goodchild Ml and lunter GJ 199 A simple positional accuracy measure or linear ea-
tures. vtervatiovat ]ovrvat of Ceograbicat vforvatiov cievce 11,3,: 299-306.
)=IAP@AB 567 8669 86\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Google, Inc. 2008a Google Larth. Aailable online at: http:,,earth.google.com. Last ac-
cessed April 23
rd
, 2008.
Google, Inc. 2008b Google Maps. Aailable online at: http:,,www.maps.google.com.
Last accessed April 23
rd
, 2008.
Google, Inc. 2008c Google Maps API Documentation. Aailable online at: http:,,
www.google.com,apis,maps,documentation. Last accessed April 23
rd
, 2008.
Grand Valley Metropolitan Council 2008 RLGIS: Purchase Digital Data. Aailable on-
line at: http:,,www.gmc-regis.org,data,ordering.html. Last accessed April 23
rd
,
2008.
Gregorio DI, Cromley L, Mrozinski R, and \alsh SJ 1999 Subject Loss in Spatial Anal-
ysis o Breast Cancer. eattb c Ptace 5,2,: 13-1.
Gregorio DI, DeChello LM, Samociuk l, and Kulldor M 2005 Lumping or Splitting:
Seeking the Preerred Areal Unit or lealth Geography Studies. vtervatiovat ]ovrvat
of eattb Ceograbic. 4,6,.
Griin Dl, Pausche JM, Riers LB, 1illman AL, and 1reat JB 1990 Improing the
Coerage o Addresses in the 1990 Census: Preliminary Results. In Proceeaivg. of tbe
.vericav tati.ticat ...ociatiov vrre, Re.earcb Metboa. ectiov, Anaheim, CA: 541-546.
Aailable online at: http:,,www.amstat.org,sections,srms,Proceedings,papers,
1990_091.pd. Last accessed April 23
rd
, 2008.
Grubesic 1l and Matisziw 1C 2006 On the use o ZIP Codes and ZIP Code tabulation
areas ,ZC1As, or the spatial analysis o epidemiological data. vtervatiovat ]ovrvat of
eattb Ceograbic. 5,58,.
Grubesic 1l and Murray A1 2004 Assessing the Locational Uncertainties o Geocoded
Data. In Proceeaivg. frov tbe 21
tb
|rbav Data Mavagevevt ,vo.ivv. Aailable online
at: http:,,www.tonygrubesic.net,geocode.pd. Last accessed April 23
rd
, 2008.
lan D, Rogerson PA, Nie J, Bonner MR, Vena JL, Vito D, Muti P, 1reisan M, Ldge
SB, and lreudenheim JL 2004 Geographic Clustering o Residence in Larly Lie
and Subsequent Risk o Breast Cancer ,United States,. Cavcer Cav.e. ava Covtrot.
15,9,:921-929.
lan D, Rogerson PA, Bonner MR, Nie J, Vena JL, Muti P, 1reisan M, and lreuden-
heim JL 2005 Assessing Spatio-1emporal Variability o Risk Suraces Using Resi-
dential listory Data in a Case Control Study o Breast Cancer. vtervatiovat ]ovrvat
of eattb Ceograbic. 4,9,.
lariharan R and 1oyama K 2004 Project Lachesis: Parsing and Modeling Location lis-
tories. In Lgenhoer et al. ,eds, Proceeaivg. of
ra
vtervatiovat Covferevce ov Ceograbic
vforvatiov cievce ;Ccievce). Berlin, Springer Lecture Notes in Computer Science
No 3234: 106-124.
larard Uniersity 2008 1he Public lealth Disparities Geocoding Project Monograph
Glossary. Aailable online at: http:,,www.hsph.harard.edu,thegeocodingproject,
webpage,monograph,glossary.htm. Last accessed April 23
rd
, 2008.
laspel M and Knotts lG 2005 Location, Location, Location: Precinct Placement and
the Costs o Voting. 1be ]ovrvat of Potitic. 6,2,: 560-53.
lealth Leel Seen, Inc. 200 Application Protocol or Llectronic Data Lxchange in
lealthcare Lnironments, Version 2.6. Aailable online at: http:,,www.hl.org,
Library,standards.cm. Last accessed April 23
rd
, 2008.
lenry KA and Boscoe lP 2008 Lstimating the accuracy o geographical imputation. v
tervatiovat ]ovrvat of eattb Ceograbic. ,3,.
869 )=IAP@AB 567 8669
'< ;< #=>?@ABC
lenshaw SL, Curriero lC, Shields 1M, Glass GL, Strickland P1, and Breysse PN 2004
Geostatistics and GIS: 1ools or Characterizing Lnironmental Contamination.
]ovrvat of Meaicat ,.tev. 28,4,: 335-348.
liggs G and Martin DJ 1995a 1he Address Data Dilemma Part 1: Is the Introduction
o Address-Point the Key to Lery Door in Britain Maivg .rareve.. 8: 26-28.
liggs G and Martin DJ 1995b 1he Address Data Dilemma Part 2: 1he Local Authority
Lxperience, and Implications or National Standards. Maivg .rareve.. 9: 26-39.
liggs G and Richards \ 2002 1he Use o Geographical Inormation Systems in Lx-
amining Variations in Sociodemographic Proiles o Dental Practice Catchments:
A Case Study o a Swansea Practice. Privar, Devtat Care 9,2,: 63-69.
lild l and lritsch D 1998 Integration o ector data and satellite imagery or geocod-
ing. vtervatiovat .rcbire. of Pbotogravvetr, ava Revote ev.ivg 32,4,: 246-251.
lill LL 2000 Core Llements o Digital Gazetteers: Placenames, Categories, and loot-
prints. In Borbinha and Baker ,eds, Re.earcb ava .aravcea 1ecbvotog, for Digitat ibra
rie., 1
tb
vroeav Covferevce ;CD `00). Berlin, Springer Lecture Notes in Computer
Science No 1923: 280-290.
lill LL 2006 Ceoreferevcivg: 1be Ceograbic ...ociatiov. of vforvatiov. Cambridge, Mass MI1
Press.
lill LL and Zheng Q 1999 Indirect Geospatial Reerencing 1hrough Place Names in
the Digital Library: Alexandria Digital Library Lxperience with Deeloping and
Implementing Gazetteers. In Proceeaivg. if tbe 2
va
.vvvat Meetivg of tbe .vericav o
ciet, for vforvatiov cievce, \ashington, DC: 5-69.
lill LL, lrew J, and Zheng Q 1999 Geographic Names: 1he Implementation o a Ga-
zetteer in a Georeerenced Digital Library. Dib Magaive 5,1,.
limmelstein M 2005 Local Search: 1he Internet Is the \ellow Pages. Covvter 38,2,: 26-
34.
loerkamp J and laener L ,eds, 2008 tavaara. for Cavcer Regi.trie.: Data tavaara. ava
Data Dictiovar,, 1otvve ,12
th
Ldition,. Springield, IL North American Associa-
tion o Central Cancer Registries.
lowe lL 1986 Geocoding N\ State Cancer Registry. .vericav ]ovrvat of Pvbtic eattb
6,12,: 1459-1460.
lurley SL, Saunders 1M, Nias R, lertz A, and Reynolds P 2003 Post Oice Box Ad-
dresses: A Challenge or Geographic Inormation System-Based Studies. iaeviot
og, 14,4,: 386-391.
lutchinson M and Veenendall B 2005a 1owards a lramework or Intelligent Geocod-
ing. In atiat vtettigevce, vvoratiov ava Prai.: 1be ^atiovat ievviat Covferevce of tbe
atiat cievce. v.titvte ;C 200:), Melbourne, AU.
lutchinson M and Veenendall B 2005b 1owards Using Intelligence 1o Moe lrom
Geocoding 1o Geolocating. In Proceeaivg. of tbe
tb
.vvvat |R. C iv .aare..ivg
Covferevce, Austin, 1.
Jaro M 1984 Record Linkage Research and the Calibration o Record Linkage Algo-
rithms. Statistical Research Diision Report Series SRD Report No. Census,
SRD,RR-84,2. \ashington, DC United States Census Bureau. Aailable online
at: http:,,www.census.go,srd,papers,pd,rr84-2.pd. Last accessed April 23
rd
,
2008.
Jaro M 1989 Adances in Record-Linkage Methodology as Applied to Matching the
1985 Census o 1ampa, llorida. ]ovrvat of tbe .vericav tati.ticat ...ociatiov 89: 414-
420.
)=IAP@AB 567 8669 86X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Johnson SD 1998a Address Matching with Stand-Alone Geocoding Lngines: Part 1.
v.ive.. Ceograbic.: 24-32.
Johnson SD 1998b Address Matching with Stand-Alone Geocoding Lngines: Part 2.
v.ive.. Ceograbic.: 30-36.
Jones CB, Alani l, and 1udhope D 2001 Geographical Inormation Retrieal with On-
tologies o Place. In Montello ,ed, Proceeaivg. of tbe vtervatiovat Covferevce ov atiat
vforvatiov 1beor,: ovvaatiov. of Ceograbic vforvatiov cievce. Berlin, Springer Lec-
ture Notes In Computer Science No 2205: 322-335.
Karimi lA, Durcik M, and Rasdor \ 2004 Laluation o Uncertainties Associated
with Geocoding 1echniques. ]ovrvat of Covvter.iaea Cirit ava vfra.trvctvre vgiveer
ivg 19,3,: 10-185.
Kennedy 1C, Brody JG, and Gardner JN 2003 Modeling listorical Lnironmental Lx-
posures Using GIS: Implications or Disease Sureillance. In Proceeaivg. of tbe 200
R eattb C Covferevce, Arlington, VA. Aailable online at: http:,,
gis.esri.com,library,usercon,health03,papers,pap3020,p3020.htm. Last ac-
cessed April 23
rd
, 2008.
Kim U 2001 listorical Study on the Parcel Number and Numbering System in Korea.
In Proceeaivg. of tbe vtervatiovat eaeratiov of vrre,or. !or/ivg !ee/ 2001.
Kimler M 2004 Geo-Coding: Recognition o geographical reerences in unstructured
text, and their isualization. PhD 1hesis. Uniersity o Applied Sciences in lo,
Germany. Aailable online at: http:,,langtech.jrc.it,Documents,0408_Kimler_
1hesis-GeoCoding.pd. Last accessed April 23
rd
, 2008.
Krieger N 1992 Oercoming the Absence o Socioeconomic Data in Medical Records:
Validation and Application o a Census-Based Methodology. .vericav ]ovrvat of
Pvbtic eattb 82,5,: 03-10.
Krieger N 2003 Place, Space, and lealth: GIS and Lpidemiology. iaeviotog, 14,4,:
384-385.
Krieger N and Gordon D 1999 RL: Use o Census-based Aggregate Variables to Proxy
or Socioeconomic Group: Lidence rom National Samples. .vericav ]ovrvat of
iaeviotog, 150,8,: 894-896.
Krieger N, \illiams DR, and Moss NL 199 Measuring Social Class in US Public
lealth Research: Concepts, Methodologies, and Guidelines. .vvvat Rerier of Pvbtic
eattb 18,1,: 341-38.
Krieger N, \aterman P, Lemieux K, Zierler S, and logan J\ 2001 On the \rong Side
o the 1racts Laluating the Accuracy o Geocoding in Public lealth Research.
.vericav ]ovrvat of Pvbtic eattb 91,,: 1114-1116.
Krieger N, Chen J1, \aterman PD, Soobader MJ, Subramanian SV, and Carson R
2002a Geocoding and Monitoring o US Socioeconomic Inequalities in Mortality
and Cancer Incidence: Does the Choice o Area-Based Measure and Geographic
Leel Matter. .vericav ]ovrvat of iaeviotog, 156,5,: 41-482.
Krieger N, \aterman PD, Chen J1, Soobader MJ, Subramanian SV, and Carson R
2002b ZIP Code Caeat: Bias Due to Spatiotemporal Mismatches Between ZIP
Codes and US Census-Deined Areas: 1he Public lealth Disparities Geocoding
Project. .vericav ]ovrvat of Pvbtic eattb 92,,: 1100-1102.
Krieger N, \aterman PD, Chen J1, Soobader M, and Subramanian SV 2003 Monitor-
ing Socioeconomic Inequalities in Sexually 1ransmitted Inections, 1uberculosis,
and Violence: Geocoding and Choice o Area-Based Socioeconomic Measures.
Pvbtic eattb Reort. 118,3,: 240-260.
856 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Krieger N, Chen J1, \aterman PD, Rehkop Dl, and Subramanian SV 2005 Painting a
1ruer Picture o US Socioeconomic and Racial,Lthnic lealth Inequalities: 1he
Public lealth Disparities Geocoding Project. .vericav ]ovrvat of Pvbtic eattb 95,2,:
312-323.
Krieger N, Chen J1, \aterman PD, Rehkop Dl, \in R, and Coull BA 2006
Race,Lthnicity and Changing US Socioeconomic Gradients in Breast Cancer Inci-
dence: Caliornia and Massachusetts, 198-2002 ,United States,. Cavcer Cav.e. ava
Covtrot 1,2,: 21-226.
Krishnamurthy S, Sanders \l, and Cukier M 2002 An adaptie ramework or tunable
consistency and timeliness using replication. In Proceeaivg. of tbe vtervatiovat Covfe
revce ov Deevaabte ,.tev. ava ^etror/.: 1-26.
Kwok RK and \ankaskas BC 2001 1he Use o Census Data or Determining Race and
Lducation as SLS Indicators A Validation Study. .vvat. of iaeviotog, 11,3,: 11-
1.
Laender All, Borges KAV, Caralho JCP, Medeiros CB, da Sila AS, and Dais Jr. CA
2005 Integrating \eb Data and Geographic Knowledge into Spatial Databases. In
Manalopoulos and Papadapoulos ,eds, atiat Databa.e.: 1ecbvotogie., 1ecbviqve. ava
1reva.. lershey, PA Idea Group Publishing: 23-4.
Lam CS, \ilson JP, and lolmes-\ong DA 2002 Building a Neighborhood-Speciic
Gazetteer or a Digital Archie. In Proceeaivg. of tbe 1revt,.ecova vtervatiovat R
|.er Covferevce: -11.
Lee J 2004 3D GIS or Geo-coding luman Actiity in Micro-scale Urban Lniron-
ments. In Lgenhoer et al. ,eds, Ceograbic vforvatiov cievce: 1bira vtervatiovat Cov
ferevce ;Ccievce 2001), College Park, MD: 162-18.
Lee MS and McNally MG 1998 Incorporating \ellow-Page Databases in GIS-Based
1ransportation Models. In Lasa ,ed, Proceeaivg. tbe .vericav ociet, of Cirit vgiveer.
Covferevce ov 1rav.ortatiov ava |.e, ava .ir Qvatit,: 652-661. Aailable online at:
repositories.cdlib.org,itsirine,casa,UCI-I1S-AS-\P-98-3.
Leidner JL 2004 1oponym Resolution in 1ext: \hich Sheield is it` In Sanderson et
al. ,eds, Proceeaivg. of tbe 2
tb
.vvvat vtervatiovat .CM CR Covferevce ;CR `01):
602.
Leesque M 2003 \est Virginia Statewide Addressing and Mapping Project. In Proceea
ivg. of tbe iftb .vvvat |R. treet vart ava .aare.. arr, Covferevce, Proidence,
RI.
Leine N and Kim KL 1998 1he Spatial Location o Motor Vehicle Accidents: A Me-
thodology or Geocoding Intersections. Covvter., vrirovvevt, ava |rbav ,.tev.
22,6,: 55-56.
Li l, Srihari RK, Niu C, and Li \ 2002 Location Normalization or Inormation Lx-
traction. In Proceeaivg. of tbe 1
tb
ivtervatiovat covferevce ov Covvtatiovat tivgvi.tic.: 1-.
Lianga S, Banerjeea S, Bushhouseb S, linleyc AO, and Carlin BP 200 lierarchical mul-
tiresolution approaches or dense point-leel breast cancer treatment data. Covv
tatiovat tati.tic. c Data .vat,.i. 52,5,: 2650-2668.
Lind M 2001 Deeloping a System o Public Addresses as a Language or Location De-
pendent Inormation. In Proceeaivg. of tbe 2001 |R. .vvvat Covferevce. Aailable
online at: http:,,www.adresseprojekt.dk,iles,Deelop_PublicAddress_
urisa2001e.pd. Last accessed April 23
rd
, 2008.
)=IAP@AB 567 8669 855
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Lind M 2005 Addresses and Address Data Play a Key Role in Spatial Inrastructure. In
Proceeaivg. of C Ptavet 200: vtervatiovat Covferevce ava bibitiov ov Ceograbic vfor
vatiov, !or/.bo ov .aare.. Referevcivg. Aailable online at: http:,,
www.adresseprojekt.dk,iles,LCGI_Addr.pd. Last accessed April 23
rd
, 2008.
Locatie 1echnologies 2006 Geocoder.us: A lree US Geocoder. Aailable online at:
http:,,geocoder.us. Last accessed April 23
rd
, 2008.
Lockyer B 2005 Oice o the Attorney General o the State o Caliornia Legal Opinion
04-1105. Aailable online at: http:,,ag.ca.go,opinions,pds,04-1105.pd. Last
accessed April 23
rd
, 2008.
Los Angeles County Assessor 2008 LA Assessor - Parcel Viewer. Aailable online at:
http:,,assessormap.co.la.ca.us,mapping,iewer.asp. Last accessed April 23
rd
,
2008.
Loasi GS, \eiss JC, loskins R, \hitsel LA, Rice K, Lrickson Cl, and Psaty BM 200
Comparing a single-stage geocoding method to a multi-stage geocoding method:
how much and where do they disagree vtervatiovat ]ovrvat of eattb Ceograbic.
6,12,.
MacDorman Ml and Gay GA 1999 State Initiaties in Geocoding Vital Data. ]ovrvat of
Pvbtic eattb Mavagevevt ava Practice 5,2,: 91-93.
Maizlish NA and lerrera L 2005 A Record Linkage Protocol or a Diabetes Registry at
Lthnically Dierse Community lealth Centers. ]ovrvat of tbe .vericav Meaicat vfor
vatic. ...ociatiov 12,3,: 331-33.
Markowetz A 2004 Geographic Inormation Retrieal. Diploma 1hesis, Philipps Uni-
ersity. Aailable online at: http:,,www.cs.ust.hk,alexmar,papers,DA.pd. Last
accessed April 23
rd
, 2008.
Markowetz A, Chen \\, Suel 1, Long , and Seeger B 2005 Design and implementa-
tion o a geographic search engine. In Proceeaivg. of tbe
tb
vtervatiovat !or/.bo ov
tbe !eb ava Databa.e. ;!ebD). Aailable online at: http:,,cis.poly.edu,suel,
papers,geo.pd. Last accessed April 23
rd
, 2008.
Martin DJ 1998 Optimizing Census Geography: 1he Separation o Collection and Out-
put Geographies. vtervatiovat ]ovrvat of Ceograbicat vforvatiov cievce 12,,: 63-685.
Martin DJ and liggs G 1996 Georeerencing People and Places: A Comparison o De-
tailed Datasets. In Parker ,ed, vvoratiov. iv C : etectea Paer. frov tbe 1bira ^a
tiovat Covferevce ov C Re.earcb |K. London, 1aylor lrancis: 3-4.
Martins B and Sila MJ 2005 A Graph-Ranking Algorithm or Geo-Reerencing Docu-
ments. In Proceeaivg. of tbe iftb vtervatiovat Covferevce ov Data Mivivg: 41-44.
Martins B, Chaes M, and Sila MJ 2005a Assigning Geographical Scopes 1o \eb Pag-
es. In Losada and lernndez-Luna ,eds, Proceeaivg. of tbe 2
tb
vroeav Covferevce ov
R Re.earcb ;CR 200:). Berlin, Springer Lecture Notes in Computer Science No
3408: 564-56.
Martins B, Sila MJ, and Chaes MS 2005b Challenges and Resources or Laluating
Geographical IR. In Jones and Pures ,eds, Proceeaivg. of tbe 200: .CM !or/.bo of
Ceograbic vforvatiov Retrierat ;CR`0:): 1-24.
Marusek JC, Cockburn MG, Mills PK, and Ritz BR 2006 Control Selection and Pesticide
Lxposure Assessment ia GIS in Prostate Cancer Studies. .vericav ]ovrvat of Prerev
tire Meaicive 30,2S,: 109-116.
Mazumdar S, Rushton G, Smith BJ, Zimmerman DL, and Donham KJ 2008 Geocoding
Accuracy and the Recoery o Relationships Between Lnironmental Lxposures
and lealth. vtervatiovat ]ovrvat of eattb Ceograbic. ,13,.
858 )=IAP@AB 567 8669
'< ;< #=>?@ABC
McCurley KS 2001 Geospatial Mapping and Naigation o the \eb. In Shen and Saito
,eds, Proceeaivg. of tbe 10
tb
vtervatiovat !orta !iae !eb Covferevce: 221-229.
McLathron SR, McGlamery P, and Shin DG 2002 Naming the Landscape: Building the
Connecticut Digital Gazetteer. vtervatiovat ]ovrvat of eciat ibrarie. 36,1,: 83-93.
McLlroy JA, Remington PL, 1rentham-Dietz A, Robert SA, and Newcomb PA 2003
Geocoding Addresses rom a Large Population-Based Study: Lessons Learned.
iaeviotog, 14,4,: 399-40.
Mechanda M and Puderer l., 200 low Postal Codes Map to Geographic Areas. Geo-
graphy \orking Paper Series 492l0138M\L. Ottawa, Statistics Canada.
Meyer M, Radespiel-1rger M, and Vogel C 2005 Probabilistic Record Linkage o Ano-
nymous Cancer Registry Records. In tvaie. iv Cta..ificatiov, Data .vat,.i., ava Kvor
teage Orgaviatiov: vvoratiov. iv Cta..ificatiov, Data cievce, ava vforvatiov ,.tev.. Ber-
lin, Springer: 599-604.
Michelson M and Knoblock CA 2005 Semantic Annotation o Unstructured and Un-
grammatical 1ext. In Proceeaivg. of tbe 1
tb
vtervatiovat ]oivt Covferevce ov .rtificiat vtet
tigevce ;]C.0:), Ldinburgh, Scotland.
Miner J\, \hite A, Palmer S, and Lubenow AL 2005 Geocoding and Social Marketing
in Alabama`s Cancer Preention Programs. Prerevtivg Cbrovic Di.ea.e 2,A1,.
Ming D, Luo J, Li J, and Shen Z 2005 leatures Based Parcel Unit Lxtraction lrom ligh
Resolution Image. In Moon ,ed, Proceeaivg. of tbe 200: vtervatiovat Ceo.cievce
ava Revote ev.ivg ,vo.ivv ;C.R`0:): 185-188.
Murphy J and Armitage R 2005 Merging the Modeled and \orking Address Database:
A Question o Dynamics and Data Quality. In Proceeaivg. of C retava 200:, Dub-
lin.
National Institute o Standards and 1echnology 2008 eaerat vforvatiov Proce..ivg tav
aara. Pvbticatiov.. Aailable online at: http:,,www.itl.nist.go,ipspubs. Last ac-
cessed April 23
rd
, 2008.
Nattinger AB, Kneusel R1, lomann RG, and Gilligan MA 2001 Relationship o Dis-
tance lrom a Radiotherapy lacility and Initial Breast Cancer 1reatment. ]ovrvat of
tbe ^atiovat Cavcer v.titvte 93,1,: 1344-1346.
NAV1LQ 2008 NAVS1RLL1S. Aailable online at: http:,,deeloper.nateq.com,
site,global,de_resources,10_nateqproducts,nadataormats,nastreets,p_na
streets.jsp. Last accessed April 23
rd
, 2008.
Nicoara G 2005 Lxploring the Geocoding Process: A Municipal Case Study using Crime
Data. Masters 1hesis, 1he Uniersity o 1exas at Dallas, Dallas, 1.
Noaki K and Arikawa M 2005a A Method or Parsing Route Descriptions using Side-
walk Network Databases. In Proceeaivg. of tbe 200: vtervatiovat Cartograbic Covferevce
;CC 200:).
Noaki K and Arikawa M 2005b A geocoding method or natural route descriptions us-
ing sidewalk network databases. In Kwon et al. ,eds, Proceeaivg. of tbe 1
tb
vtervatiovat
!or/.bo ov !eb ava !irete.. Ceograbicat vforvatiov ,.tev. ;!2C 2001) Reri.ea
etectea Paer.. Berlin, Springer Lecture Notes in Computer Science No 3428: 38-50.
Nuckols JR, \ard Ml, and Jarup L 2004 Using Geographic Inormation Systems or
Lxposure Assessment in Lnironmental Lpidemiology Studies. vrirovvevtat
eattb Per.ectire. 112,9,: 100-1115.
Nuckols JR, Gunier RB, Riggs P, Miller R, Reynolds P, and \ard Ml 200 Linkage o
the Caliornia Pesticide Use Reporting Database with Spatial Land Use Data or
Lxposure Assessment. vrirovvevtat eattb Per.ectire. 115,1,: 684-689.
)=IAP@AB 567 8669 85V
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
NAACCR 2008a NAACCR GIS Committee Geographic Inormation Systems Surey.
Aailable online at: http:,,www.naaccr.org,ilesystem,word,GIS20surey_
linal.doc. Last accessed April 23
rd
, 2008.
NAACCR 2008b NAACCR Results o GIS Surey. Spring 2006 NAACCR Newsletter.
Aailable online at: http:,,www.naaccr.org,index.aspCol_SectionKey~6Col_
ContentID~49. Last accessed April 23
rd
, 2008.
O`Grady KM 1999 A DOQ 1est Project: Collecting Data to Improe 1IGLR. In Pro
ceeaivg. of tbe 1 R |.er`. Covferevce, San Diego, CA. Aailable online at:
http:,,gis.esri.com,library,usercon,proc99,proceed,papers,pap635,p635.htm.
Last accessed April 23
rd
, 2008.
O`Reagan R1 and Saaleld A 198 Geocoding 1heory and Practice at the Bureau o the
Census. Statistical Research Report Census,SRD,RR-8,29. \ashington, DC
United States Bureau o Census.
Olier MN, Matthews KA, Siadaty M, lauck lR, and Pickle L\ 2005 Geographic Bias
Related to Geocoding in Lpidemiologic Studies. vtervatiovat ]ovrvat of eattb Ceo
grabic. 4,29,.
Olligschlaeger AM 1998 Artiicial Neural Networks and Crime Mapping. In \eisburd
and McLwen ,eds, Crive Maivg ava Crive Prerevtiov. Monsey, N\ Criminal Justice
Press: 313-34.
Olston C and \idom J 2005 Licient Monitoring and Querying o Distributed, Dy-
namic Data ia Approximate Replication. vttetiv of tbe Covvter ociet, 1ecb
vicat Covvittee ov Data vgiveerivg: 1-8.
Openshaw S 1984 1he modiiable areal unit problem. Covcet. ava 1ecbviqve. iv Moaerv
Ceograb, 38. Norwich, GeoBooks.
Openshaw S 1989 Learning to Lie with Lrrors in Spatial Databases. In Goodchild and
Gopal ,eds, .ccvrac, of atiat Databa.e.. Bristol, PA 1aylor lrancis: 263-26.
Oppong JR 1999 Data Problems in GIS and lealth. In Proceeaivg. of eattb ava vrirov
vevt !or/.bo 1: eattb Re.earcb Metboa. ava Data, 1urku, linland. Aailable online
at: geog.queensu.ca,h_and_e,healthandenir,linland20\orkshop20Papers,
OPPONG.DOC. Last accessed April 23
rd
, 2008.
Ordnance Surey 2008 ADDRLSS-POIN1: Ordnance Surey`s Map Dataset o All
Postal Addresses in Great Britain. Aailable online at: http:,,
www.ordnancesurey.co.uk,oswebsite,products,addresspoint. Last accessed April
23
rd
, 2008.
Organization or the Adancement o Structured Inormation Standards 2008 OASIS
xAL Standard 2.0. Aailable online at: http:,,www.oasis-open.org,committees,
ciq,download.html. Last accessed April 23
rd
, 2008.
Paull D 2003 A Geocoded National Address lile or Australia: 1he G-NAl \hat, \hy,
\ho and \hen Aailable online at: http:,,www.psma.com.au,ile_download,24.
Last accessed April 23
rd
, 2008.
Porter Ml 1980 An algorithm or suix stripping, Prograv 14,3,: 130-13.
Pures R, Clough P, and Joho l 2005 Identiying imprecise regions or geographic in-
ormation retrieal using the web. In Proceeaivg. of CR|K: 313-318.
85Z )=IAP@AB 567 8669
'< ;< #=>?@ABC
Raghaan VV, Bollmann P, and Jung GS 1989 Retrieal system ealuation using recall
and precision: problems and answers. In Proceeaivg. of tbe 12
tb
.vvvat vtervatiovat
.CM CR Covferevce ov Re.earcb ava Deretovevt iv vforvatiov Retrierat ;CR `):
59-68.
Ratclie Jl 2001 On the Accuracy o 1IGLR-1ype Geocoded Address Data in Rela-
tion to Cadastral and Census Areal Units. vtervatiovat ]ovrvat of Ceograbicat vforva
tiov cievce 15,5,: 43-485.
Ratclie Jl 2004 Geocoding Crime and a lirst Lstimate o a Minimum Acceptable lit
Rate. vtervatiovat ]ovrvat of Ceograbicat vforvatiov cievce 18,1,: 61-2.
Rauch L, Bukatin M, and Baker K 2003 A conidence-based ramework or disambi-
guating geographic terms. In Kornai and Sundheim ,eds, Proceeaivg. of !or/.bo ov
tbe .vat,.i. of Ceograbic Referevce. beta at ]oivt Covferevce for vvav avgvage 1ecbvotog,
ava .vvvat Meetivg of tbe ^ortb .vericav Cbater of tbe ...ociatiov for Covvtatiovat
ivgvi.tic. ;1,^..C `0): 50-54.
Reid J 2003 Geowalk: A Gazetteer Serer and Serice or UK Academia. In Koch and
Slberg ,eds, Proceeaivg. of tbe
tb
vroeav Covferevce ov Re.earcb ava .aravcea 1ecb
votog, for Digitat ibrarie. ;CD `0). Berlin, Springer Lecture Notes in Computer
Science No 269: 38-392.
Reinbacher I 2006 Geometric Algorithms or Delineating Geographic Regions. Ph.D.
1hesis, Utrecht Uniersity, NL. Aailable online at: http:,,igitur-
archie.library.uu.nl,dissertations,2006-0620-2004,ull.pd. Last accessed April
23
rd
, 2008.
Reinbacher I, Benkert M, an Kreeld M, Mitchell JSB, and \ol A 2008 Delineating
Boundaries or Imprecise Regions. .tgoritbvica 50,3,: 386-414.
Reie P and Keroot l 199 1he Canadian Geographical Names Data Base. Aailable
online at: http:,,geonames.nrcan.gc.ca,ino,cgndb_e.php. Last accessed April
23
rd
, 2008.
Reynolds P, lurley SL, Gunier RB, \erabati S, Quach 1, and lertz A 2005 Residential
Proximity to Agricultural Pesticide Use and Incidence o Breast Cancer in Calior-
nia, 1988-199. vrirovvevtat eattb Per.ectire. 113,8,: 993-1000.
Riekert \l 2002 Automated Retrieal o Inormation in the Internet by Using 1hesauri
and Gazetteers as Knowledge Sources. ]ovrvat of |virer.at Covvter cievce 8,6,: 581-
590.
Rose KM, \ood JL, Knowles S, Pollitt RA, \hitsel LA, Diez-Roux AV, \oon D, and
leiss G 2004 listorical Measures o Social Context in Lie Course Studies: Re-
trospectie Linkage o Addresses to Decennial Censuses. vtervatiovat ]ovrvat of
eattb Ceograbic. 3,2,.
Rull RP and Ritz B 2003 listorical pesticide exposure in Caliornia using pesticide use
reports and land-use sureys: an assessment o misclassiication error and bias. v
rirovvevtat eattb Per.ectire. 111,13,: 1582-1589.
Rull RP, Ritz B, Krishnadasan A, and Maglinte G 2001 Modeling listorical Lxposures
rom Residential Proximity to Pesticide Applications. In Proceeaivg. of tbe 1revt,ir.t
.vvvat R |.er Covferevce, San Diego, CA.
Rull RP, Ritz B, and Shaw GM 2006 Neural 1ube Deects and Maternal Residential
Proximity to Agricultural Pesticide Applications. .vericav ]ovrvat of iaeviotog,
163,8,: 43-53.
)=IAP@AB 567 8669 85Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Rull RP, Ritz B, and Shaw GM 2006 Validation o Sel-Reported Proximity to Agricul-
tural Crops in a Case-Control Study o Neural 1ube Deects. ]ovrvat of o.vre
cievce vrirovvevtat iaeviotog, 16,2,: 14-155.
Rushton G, Peleg I, Banerjee A, Smith G, and \est M 2004 Analyzing Geographic Pat-
terns o Disease Incidence: Rates o Late-Stage Colorectal Cancer in Iowa. ]ovrvat
of Meaicat ,.tev. 28,3,: 223-236.
Rushton G, Armstrong M, Gittler J, Greene B, Palik C, \est M, and Zimmerman D
2006 Geocoding in Cancer Research - A Reiew. .vericav ]ovrvat of Prerevtire Meai
cive 30,2,: S16-S24.
Rushton G, Armstrong, MP, Gittler J, Greene BR, Palik CL, \est M\, and Zimmer-
man DL ,eds, 2008a Ceocoaivg eattb Data 1be |.e of Ceograbic Coae. iv Cavcer Pre
revtiov ava Covtrot, Re.earcb, ava Practice, Boca Raton, ll CRC Press.
Rushton G, Cai Q, and Chen Z 2008b Producing Spatially Continuous Prostate Cancer
Maps with Dierent Geocodes and Spatial lilter Methods. In Rushton et al. ,eds,
Ceocoaivg eattb Data 1be |.e of Ceograbic Coae. iv Cavcer Prerevtiov ava Covtrot, Re
.earcb, ava Practice. Boca Raton, ll CRC Press: 69-94.
Sadahiro \ 2000 Accuracy o count data estimated by the point-in-polygon method.
Ceograbicat .vat,.i. 32,1,: 64-89.
Schlieder C, Vgele 1J, and Visser U 2001 Qualitatie Spatial Representation or Inor-
mation Retrieal by Gazetteers. In Montello ,ed, Proceeaivg. of tbe :
tb
vtervatiovat
Covferevce ov atiat vforvatiov 1beor,. Berlin, Springer Lecture Notes in Computer
Science No 2205: 336-351.
Schockaert S, De Cock M, and Kerre LL 2005 Automatic Acquisition o luzzy loot-
prints. In Meersman et al. ,eds, Ov tbe More to Meavivgfvt vtervet ,.tev. 200: ;O1M
200:). Berlin, Springer Lecture Notes in Computer Science No 362: 10-1086.
Schootman M, Jee D, Kinman L, liggs G, and Jackson-1hompson J 2004 Laluating
the Utility and Accuracy o a Reerse 1elephone Directory to Identiy the Location
o Surey Respondents. .vvat. of iaeviotog, 15,2,: 160-166.
Schumacher S 200 Probabilistic Versus Deterministic Data Matching: Making an Accu-
rate Decision. DM Direct Special Report ,January 18, 200 Issue,. Aailable online
at: http:,,www.dmreiew.com,article_sub.cmarticleId~10112. Last accessed
April 23
rd
, 2008.
Sheehan 1J, Gershman S1, MacDougal L, Danley RA, Mroszczyk M, Sorensen AM, and
Kulldor M 2000 Geographic Sureillance o Breast Cancer Screening by 1racts,
1owns and ZIP Codes. ]ovrvat of Pvbtic eattb Mavagevevt Practice. 6: 48-5.
Shi 200 Laluating the Uncertainty Caused by P.O. Box Addresses in Lnironmental
lealth Studies: A restricted Monte Carlo Approach. vtervatiovat ]ovrvat of Ceograbi
cat vforvatiov cievce 21,3,: 325-340.
Skelly C, Black \, learnden M, Lyles R, and \einstein P 2002 Disease sureillance in
rural communities is compromised by address geocoding uncertainty: A case study
o campylobacteriosis. .v.tratiav ]ovrvat of Rvrat eattb 10,2,: 8-93.
Smith DA and Crane G 2001 Disambiguating Geographic Names in a listorical Digital
Library. In Constantopoulos and Slberg ,eds, Re.earcb ava .aravcea 1ecbvotog, for
Digitat ibrarie., :
tb
vroeav Covferevce. Berlin, Springer Lecture Notes in Computer
Science No 2163: 12-136.
85[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Smith DA and Mann GS 2003 Bootstrapping toponym classiiers. In Kornai and Sund-
heim ,eds, Proceeaivg. of !or/.bo ov tbe .vat,.i. of Ceograbic Referevce. beta at ]oivt
Covferevce for vvav avgvage 1ecbvotog, ava .vvvat Meetivg of tbe ^ortb .vericav
Cbater of tbe ...ociatiov for Covvtatiovat ivgvi.tic. ;1,^..C `0): 45-49.
Smith GD, Ben-Shlomo \, and lart C 1999 RL: Use o Census-based Aggregate Va-
riables to Proxy or Socioeconomic Group: Lidence rom National Samples.
.vericav ]ovrvat of iaeviotog, 150,9,: 996-99.
Soobader M, LeClere lB, ladden \, and Maury B 2001 Using Aggregate Geographic
Data to Proxy Indiidual Socioeconomic Status: Does Size Matter .vericav ]ovrvat
of Pvbtic eattb 91,4,: 632-636.
Southall l 2003 Deining and identiying the roles o geographic reerences within text:
Lxamples rom the Great Britain listorical GIS project. In Kornai and Sundheim
,eds, Proceeaivg. of !or/.bo ov tbe .vat,.i. of Ceograbic Referevce. beta at ]oivt Covfe
revce for vvav avgvage 1ecbvotog, ava .vvvat Meetivg of tbe ^ortb .vericav Cbater of
tbe ...ociatiov for Covvtatiovat ivgvi.tic. ;1,^..C `0): 69-8.
Stage D and on Meyer N 2005 An Assessment o Parcel Data in the United States 2005
Surey Results. lederal Geographic Data Committee Subcommittee on Cadastral
Data. Aailable online at: http:,,www.nationalcad.org,showdocs.aspdocid~10.
Last accessed April 23
rd
, 2008.
Statistics Canada 2008 Census geography - Illustrated Glossary: Geocoding - plain dei-
nition. Aailable online at: http:,,geodepot.statcan.ca,Diss2006,Reerence,
COGG,Short_RSL_e.jspRLlCODL~10lILLNAML~Geocoding1\PL~L.
Last accessed April 23
rd
, 2008.
Steoski Mikeljeic J, laward R, Johnston C, Crellin A, Dodwell D, Jones A, Pisani P,
and lorman D 2004 1rends in postoperatie radiotherapy delay and the eect on
surial in breast cancer patients treated with conseration surgery. riti.b ]ovrvat of
Cavcer 90: 1343-1348.
Steenson MA, \ilesmith J, Ryan J, Morris R, Lawson A, Peier D, and Lin D 2000
Descriptie Spatial Analysis o the Lpidemic o Boine Spongiorm Lncephalopa-
thy in Great Britain to June 199. 1be 1eterivar, Recora 14,14,: 39-384.
Strickland MJ, Siel C, Gardner BR, Berzen AK, and Correa A 200 Quantiying geo-
code location error using GIS methods. vrirovvevtat eattb 6,10,.
Stitzenberg KB, 1homas NL, Dalton K, Brier SL, Ollila D\, Berwick M, Mattingly D,
and Millikan RC 200 Distance to Diagnosing Proider as a Measure o Access or
Patients \ith Melanoma. .rcbire. of Dervatotog, 143,8,: 991-998.
Sweeney L 2002 /-Anonymity: A Model lor Protecting Priacy. vtervatiovat ]ovrvat ov
|vcertaivt,, vive.. ava Kvorteagea.ea ,.tev. 10,5,: 55-50.
1ele Atlas Inc. 2008a Dynamap Map Database. Aailable online at: http:,,
www.teleatlas.com,OurProducts,MapData,Dynamap,index.htm. Last accessed
April 23
rd
, 2008.
1ele Atlas Inc. 2008b Geocode.com ,, 1ele Atlas Geocoding Serices. Aailable online
at: http:,,www.geocode.com. Last accessed April 23
rd
, 2008
1ele Atlas Inc. 2008c MultiNet Map Database. Aailable online at: http:,,
www.teleatlas.com,OurProducts,MapData,Multinet,index.htm. Last accessed
April 23
rd
, 2008.
)=IAP@AB 567 8669 85\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
1emple C, Ponas G, Regan R, and Sochats K 2005 1he Pittsburgh Street Addressing
Project. In Proceeaivg. of tbe 2:
tb
.vvvat R vtervatiovat |.er Covferevce. Aailable
online at: http:,,gis2.esri.com,library,usercon,proc05,papers,pap1525.pd. Last
accessed April 23
rd
, 2008.
1ezuka 1 and 1anaka K 2005 Landmark Lxtraction: A \eb Mining Approach. In Cohn
and Mark ,eds, Proceeaivg. of tbe
tb
vtervatiovat Covferevce ov atiat vforvatiov 1beor,.
Berlin, Springer Lecture Notes in Computer Science No 3693: 39-396.
1hrall GI 2006 Geocoding Made Lasy. Ceo.atiat otvtiov.. Aailable online at:
http:,,ba.geospatial-online.com,gssba,article,articleDetail.jspid~339221. Last
accessed April 23
rd
, 2008.
1obler \ 192 Geocoding 1heory. In Proceeaivg. of tbe ^atiovat Ceocoaivg Covferevce,
\ashington, DC Department o 1ransportation: IV.1.
1oral A and Muoz R 2006 A proposal to automatically build and maintain gazetteers
or Named Lntity Recognition by using \ikipedia. In Proceeaivg. of tbe .C200
!or/.bo ov ^! 1`1: !i/i. ava btog. ava otber a,vavic tet .ovrce.: 56-61. Aail-
able online at: acl.ldc.upenn.edu,\,\06,\06-2800.pd. Last accessed April 23
rd
,
2008.
United States Board on Geographic Names 2008 Ceograbic ^ave. vforvatiov ,.tev.
Reston, VA United States Board on Geographic Names. Aailable online at:
http:,,geonames.usgs.go,pls,gnispublic. Last accessed April 23
rd
, 2008.
United States Census Bureau 2008a .vericav Covvvvit, vrre,, \ashington, DC United
States Census Bureau. Aailable online at: http:,,www.census.go,acs. Last ac-
cessed April 23
rd
, 2008.
United States Census Bureau 2008b M.,1CR .ccvrac, vrorevevt Pro;ect, \ashing-
ton, DC United States Census Bureau. Aailable online at: http:,,
www.census.go,geo,mod,matiger.html. Last accessed April 23
rd
, 2008.
United States Census Bureau 2008c 1ootogicatt, vtegratea Ceograbic vcoaivg ava Referevc
ivg ,.tev, \ashington, DC United States Census Bureau. Aailable online at:
http:,,www.census.go,geo,www,tiger. Last accessed April 23
rd
, 2008.
United States Department o lealth and luman Serices 2000 eattb, Peote 2010: |v
aer.tavaivg ava vrorivg eattb ,Second Ldition,, \ashington, DC United States
Goernment Printing Oice. Aailable online at: http:,www.healthypeople.go,
Document,pd,uih,2010uih.pd. Last accessed April 23
rd
, 2008.
United States lederal Geographic Data Committee 2008a Covtevt tavaara for Digitat
Ceo.atiat Metaaata. Aailable online at: http:,,www.gdc.go,metadata,csdgm.
Last accessed April 23
rd
, 2008.
United States lederal Geographic Data Committee 2008b treet .aare.. Data tavaara.
Reston, VA United States lederal Geographic Data Committee. Aailable online
at: http:,,www.gdc.go,standards,projects,lGDC-standards-projects, streetad-
dress,index_html. Last accessed April 23
rd
, 2008.
United States National Geospatial-Intelligence Agency 2008 ^C. C^ earcb. Bethes-
da, MD United States National Geospatial-Intelligence Agency. Aailable online at:
http:,,geonames.nga.mil,ggmagaz,geonames4.asp. Last accessed April 23
rd
, 2008
United States Postal Serice 2008a .aare.. vforvatiov ,.tev Proavct. 1ecbvicat Cviae.
\ashington, DC United States Postal Serice. Aailable online at: http:,,
ribbs.usps.go,iles,Addressing,PUBS,AIS.pd. Last accessed April 23
rd
, 2008.
859 )=IAP@AB 567 8669
'< ;< #=>?@ABC
United States Postal Serice 2008b C. Maiter`. Cviae. \ashington, DC United States
Postal Serice. Aailable online at: http:,,ribbs.usps.go,doc,cmg.html. Last ac-
cessed April 23
rd
, 2008.
United States Postal Serice 2008c ocatabte .aare.. Covrer.iov ,.tev. \ashington, DC
United States Postal Serice. Aailable online at: http:,,www.usps.com,ncsc,
addressserices,addressqualityserices,lacsystem.htm. Last accessed April 23
rd
,
2008.
United States Postal Serice 2008d Pvbticatiov 2 - Po.tat .aare..ivg tavaara.. \ashing-
ton, DC United States Postal Serice. Aailable online at: http:,,pe.usps.com,
text,pub28,welcome.htm. Last accessed April 23
rd
, 2008.
Uniersity o Caliornia, Los Angeles 2008 vteractire |C. Cavv. Ma. Los Angeles,
CA, Uniersity o Caliornia, Los Angeles. Aailable online at: http:,,
www.m.ucla.edu,CampusMap,Campus.htm. Last accessed April 23
rd
, 2008.
Uniersity o Southern Caliornia 2008 |PC Cotor Ma. Los Angeles, CA Uniersity o
Southern Caliornia. Aailable online at: http:,,www.usc.edu,priate,about,
isit_usc,USC_UPC_map_color.pd. Last accessed April 23
rd
, 2008.
Vaid S, Jones CB, Joho l, and Sanderson M 2005 Spatio-textual indexing or geograph-
ical search on the web. In Proceeaivg. of tbe
tb
,vo.ivv ov atiat ava 1evorat Data
ba.e. ;1D0:).
Van Kreeld M and Reinbacher I 2004 Good NL\S: Partitioning a Simple Polygon by
Compass Directions. vtervatiovat ]ovrvat of Covvtatiovat Ceovetr, ava .ticatiov.
14,4,: 233-259.
Veregin l 1999 Data Quality Parameters. In Longley et al. ,eds, Ceograbicat vforvatiov
,.tev., 1otvve 1 ,Second Ldition,. New \ork, \iley: 1-189.
Vestaik 2004 Geographic Inormation Retrieal: An Oeriew. Unpublished, pre-
sented at Internal Doctoral Conerence, Department o Computer and Inorma-
tion Science, Norwegian Uniersity o Science and 1echnology. Aailable online at:
http:,,www.idi.ntnu.no,oyinde,article.pd. Last accessed April 23
rd
, 2008.
Vine Ml, Degnan D, and lanchette C 1998 Geographic Inormation Systems: 1heir
Use in Lnironmental Lpidemiologic Research. ]ovrvat of vrirovvevtat eattb 61: -
16.
Vgele 1J and Schlieder C 2003 Spatially-Aware Inormation Retrieal with Graph-
Based Qualitatie Reerence Models. In Russell and laller ,eds, Proceeaivg. of tbe
iteevtb vtervatiovat toriaa .rtificiat vtettigevce Re.earcb ociet, Covferevce: 40-44.
Vgele 1J and Stuckenschmidt l 2001 Lnhancing Gazetteers with Qualitatie Spatial
Concepts. In 1ochtermann and Arndt ,eds, Proceeaivg. of tbe !or/.bo ov ,erveaia
iv vrirovvevtat Protectiov.
Vgele 1J, Schlieder C, and Visser U 2003 Intuitie modeling o place name regions or
spatial inormation retrieal. In Kuhn et al. ,eds, Proceeaivg. of tbe
tb
vtervatiovat
Covferevce ov ovvaatiov. of Ceograbic vforvatiov cievce ;CO1 200). Berlin, Sprin-
ger Lecture Notes in Computer Science No 2825: 239-252.
Voti L, Richardson LC, Reis IM, lleming LL, MacKinnon J, Coebergh J\\ 2005
1reatment o local breast carcinoma in llorida. Cavcer 106,1,: 201-20.
\aldinger R, Jaris P, and Dungan J 2003 Pointing to places in a deductie geospatial
theory. In Kornai and Sundheim ,eds, Proceeaivg. of !or/.bo ov tbe .vat,.i. of Ceo
grabic Referevce. beta at ]oivt Covferevce for vvav avgvage 1ecbvotog, ava .vvvat Meet
ivg of tbe ^ortb .vericav Cbater of tbe ...ociatiov for Covvtatiovat ivgvi.tic.
;1,^..C `0): 10-1.
)=IAP@AB 567 8669 85X
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
\aller LA 2008 Spatial Statistical Analysis o Point- and Area-Reerenced Public lealth
Data. In Rushton et al. ,eds, Ceocoaivg eattb Data 1be |.e of Ceograbic Coae. iv
Cavcer Prerevtiov ava Covtrot, Re.earcb, ava Practice. Boca Raton, ll CRC Press: 14-
164.
\alls MD 2003 Is Consistency in Address Assignment Still Needed. In Proceeaivg. of tbe
iftb .vvvat |R. treet vart ava .aare.. arr, Covferevce, Proidence, RI.
\ard Ml, Nuckols JR, Giglierano J, Bonner MR, \olter C, Airola M, Mix \, Colt JS,
and lartge P 2005 Positional accuracy o two methods o geocoding. iaeviotog,
16,4,: 542-54.
\erner PA 194 National Geocoding. .vvat. of tbe ...ociatiov of .vericav Ceograber.
64,2,: 310-31.
\hitsel LA, Rose KM, \ood JL, lenley AC, Liao D, and leiss G 2004 Accuracy and
Repeatability o Commercial Geocoding. .vericav ]ovrvat of iaeviotog, 160,10,:
1023-1029.
\hitsel LA, Quibrera PM, Smith RL, Catellier DJ, Liao D, lenley AC, and leiss G
2006 Accuracy o commercial geocoding: assessment and implications. iaeviotogic
Per.ectire. c vvoratiov. 3,8,.
\ieczorek J 2008 MaNIS,lerpNet,ORNIS Georeerencing Guidelines. Aailable on-
line at: http:,,manisnet.org,GeoreGuide.html. Last accessed April 23
rd
, 2008.
\ieczorek J, Guo Q, and lijmans RJ 2004 1he Point-Radius Method or Georeerenc-
ing Locality Descriptions and Calculating Associated Uncertainty. vtervatiovat ]ovr
vat of Ceograbicat vforvatiov cievce 18,8,: 45-6.
\ilson JP, Lam CS, and lolmes-\ong DA 2004 A New Method or the Speciication
o Geographic lootprints in Digital Gazetteers. Cartograb, ava Ceograbic vforva
tiov cievce 31,4,: 195-203.
\inkler \L 1995 Matching and Record Linkage. In Cox et al. ,eds, v.ive.. vrre, Me
tboa.. New \ork, \iley: 355-384.
\ong \S and Chuah MC 1994 A lybrid Approach to Address Normalization.
ert: vtettigevt ,.tev. ava 1beir .ticatiov. 9,6,: 38-45.
\oodru AG and Plaunt C 1994 GIPS\: Georeerenced Inormation Processing Sys-
tem. ]ovrvat of tbe .vericav ociet, for vforvatiov cievce: 645-655.
\u J, lunk 1l, Lurmann l\, and \iner AM 2005 Improing Spatial Accuracy o
Roadway Networks and Geocoded Addresses. 1rav.actiov. iv C 9,4,: 585-601
\ahoo!, Inc. 2008 \ahoo! Maps \eb Serices - Geocoding API. Aailable online at:
http:,,deeloper.yahoo.com,maps,rest,V1,geocode.html. Last accessed April
23
rd
, 2008.
\ang Dl, Bilaer LM, layes O, and Goerge R 2004 Improing Geocoding Practices:
Laluation o Geocoding 1ools. ]ovrvat of Meaicat ,.tev. 28,4,: 361-30.
\ildirim V and \omralioglu 1 2004 An Address-based Geospatial Application. In Pro
ceeaivg. of tbe vtervatiovat eaeratiov of vrre,or. !or/ivg !ee/ 2001.
\u L 1996 Deelopment and Laluation o a lramework or Assessing the Liciency
and Accuracy o Street Address Geocoding Strategies. Ph.D. 1hesis, Uniersity at
Albany, State Uniersity o New \ork - Rockeeller College o Public Aairs and
Policy, New \ork.
Zandbergen PA 200 Inluence o geocoding quality on enironmental exposure as-
sessment o children liing near high traic roads. MC Pvbtic eattb ,3,.
Zandbergen PA 2008 A comparison o address point, parcel and street geocoding tech-
niques. Covvter., vrirovvevt ava |rbav ,.tev..
886 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 885
Zandbergen PA and Chakraborty J 2006 Improing enironmental exposure analysis us-
ing cumulatie distribution unctions and indiidual geocoding. vtervatiovat ]ovrvat
of eattb Ceograbic. 5,23,.
Zillow.com 2008 Zillow - Real Lstate Search Results. Aailable online at: http:,,
www.zillow.com,search,Search.htmmode~browse. Last accessed April 23
rd
,
2008.
Zimmerman DL 2006 Lstimating spatial intensity and ariation in risk rom locations
coarsened by incomplete geocoding. 1echnical report 4362, Department o Statis-
tics and Actuarial Science, Uniersity o Iowa: 1-28. Aailable online at:
http:,,www.stat.uiowa.edu,techrep,tr362.pd. Last accessed April 23
rd
, 2008.
Zimmerman DL 2008 Statistical Methods or Incompletely and Incorrectly Geocoded
Cancer Data. In Rushton et al. ,eds, Ceocoaivg eattb Data 1be |.e of Ceograbic
Coae. iv Cavcer Prerevtiov ava Covtrot, Re.earcb, ava Practice. Boca Raton, ll CRC Press:
165-180.
Zimmerman DL, Armstrong MP, and Rushton G 2008 Alternatie 1echniques or
Masking Geographic Detail to Protect Priacy. In Rushton et al. ,eds, Ceocoaivg
eattb Data 1be |.e of Ceograbic Coae. iv Cavcer Prerevtiov ava Covtrot, Re.earcb, ava
Practice. Boca Raton, ll CRC Press: 12-138.
Zimmerman DL, lang , Mazumdar S, and Rushton G 200 Modeling the probability
distribution o positional errors incurred by residential address geocoding. vterva
tiovat ]ovrvat of eattb Ceograbic. 6,1,.
Zong \, \u D, Sun A, Lim LP, and Goh DlL 2005On Assigning Place Names to
Geography Related \eb Pages. In Marlino et al. ,eds, Proceeaivg of tbe 200:
.CM, ]oivt Covferevce ov Digitat ibrarie.: 354-362.

















































1his page is let blank intentionally.

'< ;< #=>?@ABC
!--$)'(_ !T $_!0-3$ .$+$!.&1$.
!++/.!)&$ '%&/0$),+

Cancer registries should already hae ormalized policies regarding the distribution o regi-
stry data in terms o who can access the data or what purposes. 1he ollowing pages contain
an example researcher data release request document that registries can use as a starting
point to standardize these procedures, i this is needed. Also included is an example re-
searcher assurances agreement which can be used to partially protect the registry by speciy-
ing the acceptable usage o registry data and outlining the responsibilities o the researcher.

)=IAP@AB 567 8669 88V













































1his page is let blank intentionally.

'< ;< #=>?@ABC
.AEAHBNS .AIDAQ -B=NA?JBA


Beore the release o any data, all research proposals requesting the use o conidential can-
cer registry data must be reiewed by the ^ave_of_Regi.tr, or compliance with the ollowing
criteria:

1. the proposed research will be used to determine the sources o cancer among the
residents o ^ave_of_ocatit, or to reduce the burden o cancer in ^ave_of_ocatit,,

2. the data requested are necessary or the eicient conduct o the study,

3. adequate protections are in place to proide secure conditions to use and store the
data,

4. assurances are gien that the data will only be used or the purposes o the study, and
assurances that conidential data will be destroyed at the conclusion o the study ,see
Assurances lorm,,

5. the researcher has adequate resources to carry out the proposed research,

6. the proposal has been reiewed and approed by the ^ave_of_Covvittee_for_tbe Pro
tectiov_of_vvav_vb;ect. or is exempt rom such reiew,

. any additional saeguards needed to protect the data rom inadertent disclosure due
to unique or special characteristics o the proposed research hae been required o
the researcher, and

8. the research methodology has been reiewed or scientiic excellence by a nationally
recognized peer group, or i such a reiew has not taken place, that an ad hoc peer
reiew subcommittee o the ^ave_of_.ari.or,_Covvittee containing appropriately
qualiied scientists has perormed a peer reiew o the research.

Additionally, all releant research ees hae been paid prior to data release.




)=IAP@AB 567 8669 88Y
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
)HPAo=Go1AH>FSo'DIDED=M
)HPAo=Go.ACDEFBO

.AEAHBNS -B=U=EH> .AIDAQ

Please complete each section o this orm and return with all attachments to: .a
are.._of_Regi.tr,.

Principal Inestigator Date
Organization
Address
City State ZIP
1el lax Lmail
1itle o Research Project
List other institutions or agencies that will collaborate in conducting the project:



)=FAT U>AHEA HFFHNS H N=UO =G FSA UB=U=EA? UB=F=N=> =B PAFS=?E EANFD=M =G O=JB
UB=hANF<

1. ectiov o egi.tatire_Coae states tbe vro.e of tbe regi.tr, .batt be to vro.e_of_regi.tr,. In
the section below, please describe how your proposed research will be used to determine
the sources o cancer among the residents o ^ave_of_ocatit, or to reduce the burden o
cancer in ^ave_of_ocatit,. I additional space is needed, please attach a separate sheet.



2. Details o data necessary or conduct o the study elements.



3. Describe procedures or identiying patients ,patient population,.



88[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
4. All protocols including a request or conidential data require peer reiew or scientiic
merit. ^ave_of_Regi.tr, accepts reiew by nationally recognized peer reiew groups.
Please indicate below whether or not such a reiew has been perormed.

No
\es, i your proposal has been reiewed or scientiic merit, please attach a copy o
that reiew.

I your proposal has M=F been reiewed or scientiic merit by a nationally recognized
peer reiew group, the Diision shall conene an ad hoc peer reiew subcommittee o
the Cancer Registry Adisory Committee. 1he data shall not be released unless and until
the proposed research is judged to be scientiically meritorious by the peer group. Re-
iew or scientiic merit must be completed prior to Committee or Protection or lu-
man Research Subjects Institutional Reiew Board ,IRB, reiew i one has not already
been perormed.

5. All requests or conidential data must be approed by an IRB established in accor-
dance with ectiov o egi.tatire_Coae. Please indicate whether or not this proposal has al-
ready been approed by an IRB.

No Please indicate the approximate reiew date: _________________________
\es Date: ____________________________
I your proposal has been approed by an IRB, please attach a copy o the approal.
^ave_of_eattb_Diri.iov may require approal by the ^ave_of_eattb_Diri.iov IRB.
Please contact ^ave_of_Covtact_Per.ov at Pbove_^vvber_of_Covtact_Per.ov or instructions
on obtaining ^ave_of_eattb_Diri.iov IRB approal.

6. 1he data must be protected against inadertent disclosure o conidential data. In the
section below, please address the ollowing issues: ,I additional space is needed, please
attach a separate sheet.,

a, low you will proide secure conditions to use and store the data:



b, Assurances that the data will be used only or the purposes o the study:






)=IAP@AB 567 8669 88\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
c, Assurances that conidential data will be destroyed at the conclusion o the research:



1he reiew committee may require additional saeguards i it is determined that these are
necessary due to unique or special characteristics o your proposed research.

. Prior to the release o conidential data, assurances must be gien that you hae ade-
quate inancial resources to carry out the proposed research. Please document adequate
project unding and attach supporting documentation. I additional space is needed,
please attach a separate sheet.




8. Please complete the ollowing Researcher Assurances lorm on page 5.

!FFHNSPAMFE cU>AHEA NSANR HUU>DNH@>A @=KAEdT

.AEAHBNS UB=F=N=> HFFHNSA?
IRB approal
Project unding
Peer reiew approal
Researcher Assurances lorm

Date reiewed by ^ave_of_eattb_Diri.iov administration: __________________

Approed Denied

Comments:

889 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 88X
.AEAHBNSAB !EEJBHMNAE 2=BP

1he undersigned agrees to ,initial each statement, sign and date,:

___ accept responsibility or the ethical conduct o the study and the protection o the
rights, priacy and welare o the indiiduals whose priate health inormation is re-
tained in the ^ave_of_Reg.itr,,

___ conduct this study in compliance with the protocol as reiewed and approed by
^ave_of_Reg.itr, and,or the Adisory Committee,

___ submit all proposed study changes, including those accepted by an IRB, to
^ave_of_Reg.itr, to seek approal prior to implementing changes. 1his includes but
is not limited to change in enue, change in PI or other inestigators, change in study
ocus, and any change requiring IRB approal,

___ report upon discoery all unanticipated problems, protocol iolations, and breaches
o conidentiality to ^ave_of_Reg.itr,,

___ submit copies o literature and ormal presentations generated using ^ave_of_Reg.itr,
data,

___ pay all releant ees prior to receiing ^ave_of_Reg.itr, data ,see Schedule o Re-
search lees,, and

___ complete dataset receied rom ^ave_of_Reg.itr, will be destroyed upon conclusion
o the study and ^ave_of_Reg.itr, will be inormed.

I agree to comply with the aboe requirements. I attest that inormation in this Research
Proposal Reiew lorm and attachments are true and complete. I also attest that I hae
no conlicts o interest to disclose regarding this study.

Non-compliance to this agreement may result in termination o the study approal. 1his
means approal or use o ^ave_of_Reg.itr, study data may be reoked. I this occurs,
proo is required that all data obtained rom ^ave_of_Reg.itr, or the purposes o this
study are destroyed. I this occurs, no inestigator on this study may beneit rom the
use o ^ave_of_Reg.itr, data either monetarily, including grant unding, nor through pub-
lications, presentations, or any other means.

________________ Date ___________________ ,PI signature,















































1his page is let blank intentionally.

'< ;< #=>?@ABC
!--$)'(_ *T !))%,!,$' *(*3(%#.!-1:

1he tables appearing on the ollowing pages proide an annotated bibliography o the ma-
jority o preious work related to the ield o geocoding to date. 1he manuscripts listed in-
clude those that are explicitly about the geocoding process itsel, as well as those that make
use o it as a part o the research presented and address an aspect o the geocoding process
,both explicitly and implicitly,. Lach table ocuses on a speciic theme-the works listed
within are urther classiied into which aspect o the theme they are releant to. 1hese tables
should be used to guide the reader to the releant works or urther background reading on a
topic and,or to see how other research studies hae addressed and,or dealt with an issue
related to the geocoding process.



)=IAP@AB 567 8669 8V5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
,H@>A Z[ p -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G DMUJF ?HFH JFD>DfA?
Input Data
1ypes Process Accuracy

N
a
m
e
d

P
l
a
c
e
s

R
e
l
a
t
i

e

D
e
s
c
r
i
p
t
i
o
n
s

P
o
s
t
a
l

A
d
d
r
e
s
s
e
s

U
S
P
S

P
O

B
o
x
e
s

P
o
s
t
a
l

C
o
d
e
s

R
u
r
a
l

R
o
u
t
e
s

P
a
r
s
i
n
g

N
o
r
m
a
l
i
z
a
t
i
o
n

S
t
a
n
d
a
r
d
i
z
a
t
i
o
n

V
a
l
i
d
a
t
i
o
n

A
m
b
i
g
u
i
t
y

R
e
s
o
l
u
t
i
o
n

1
e
m
p
o
r
a
l
i
t
y

C
o
r
r
e
c
t
n
e
s
s

Abe and Stinchcomb 2008
Agarwal 2004
Agouris et al. 2000
Agoino et al. 2005
Alani 2001
Alani et al. 2003
Amitay et al. 2004
Arampatzis et al. 2006
Arbia et al. 1998
Arikawa and Noaki 2005
Arikawa et al. 2004
Armstrong et al. 2008
Armstrong and 1iwari 2008
Axelrod 2003
Bakshi et al. 2004
Beal 2003
Beaman et al. 2004
Berney and Blane 199
Beyer et al. 2008
Bichler and Balchak 200
Bilhaut et al. 2003
Blakely and Salmond 2002
Block 1995
Bonner et al. 2003
Boscoe et al. 2002
Boscoe et al. 2004
Boscoe 2008
Bow et al. 2004
Brody et al. 2002
Cayo and 1albot 2003
Chalasani et al. 2005
Chaez 2000
Chen, C.C. et al. 2003
Chen, C.C. et al. 2004
Chen, M.l. et al. 1998
Chen, \ et al. 2004
Chen et al. 2008
Chou 1995
Christen and Churches 2005
Christen et al. 2004
Chua 2001
8V8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Input Data
1ypes Process Accuracy

N
a
m
e
d

P
l
a
c
e
s

R
e
l
a
t
i

e

D
e
s
c
r
i
p
t
i
o
n
s

P
o
s
t
a
l

A
d
d
r
e
s
s
e
s

U
S
P
S

P
O

B
o
x
e
s

P
o
s
t
a
l

C
o
d
e
s

R
u
r
a
l

R
o
u
t
e
s

P
a
r
s
i
n
g

N
o
r
m
a
l
i
z
a
t
i
o
n

S
t
a
n
d
a
r
d
i
z
a
t
i
o
n

V
a
l
i
d
a
t
i
o
n

A
m
b
i
g
u
i
t
y

R
e
s
o
l
u
t
i
o
n

1
e
m
p
o
r
a
l
i
t
y

C
o
r
r
e
c
t
n
e
s
s

Chung et al. 2004
Churches et al. 2002
Clough 2005
Collins et al. 1998
Croner 2003
Curtis et al. 2006
Dais Jr. 1993
Dais Jr. and lonseca 200
Dais Jr. et al. 2003
Dawes et al. 2006
Dearwent et al. 2001
Densham and Reid 2003
Diez-Roux et al. 2001
Drummond 1995
Dueker 194
Durr and lroggatt 2002
Lstathopoulos et al. 2005
Lichelberger 1993
Ll-\acoubi et al. 2002
londa-Bonardi 1994
loody 2003
lortney et al. 2000
lremont et al. 2005
lrew et al. 1998
lu et al. 2005a
lu et al. 2005b
lulcomer et al. 1998
Ganey et al. 2005
Gatrell 1989
Geronimus and Bound 1998
Geronimus and Bound 1999a
Geronimus and Bound 1999b
Geronimus et al. 1995
Gilboa et al. 2006
Gilboa et al. 2006
Goldberg et al. 200
Gregorio et al. 1999
Gregorio et al. 2005
Griin et al. 1990
Grubesic and Matisziw 2006
Grubesic and Murray 2004
lan et al. 2004
lan et al. 2005
)=IAP@AB 567 8669 8VV
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Input Data
1ypes Process Accuracy

N
a
m
e
d

P
l
a
c
e
s

R
e
l
a
t
i

e

D
e
s
c
r
i
p
t
i
o
n
s

P
o
s
t
a
l

A
d
d
r
e
s
s
e
s

U
S
P
S

P
O

B
o
x
e
s

P
o
s
t
a
l

C
o
d
e
s

R
u
r
a
l

R
o
u
t
e
s

P
a
r
s
i
n
g

N
o
r
m
a
l
i
z
a
t
i
o
n

S
t
a
n
d
a
r
d
i
z
a
t
i
o
n

V
a
l
i
d
a
t
i
o
n

A
m
b
i
g
u
i
t
y

R
e
s
o
l
u
t
i
o
n

1
e
m
p
o
r
a
l
i
t
y

C
o
r
r
e
c
t
n
e
s
s

lariharan and 1oyama 2004
laspel and Knotts 2005
lenshaw et al. 2004
liggs and Martin 1995a
liggs and Martin 1995b
liggs and Richards 2002
lill 2000
lill and Zheng 1999
lill et al. 1999
limmelstein 2005
lurley et al. 2003
lutchinson and Veenendall 2005a
lutchinson and Veenendall 2005b
Jaro 1984
Jaro 1989
Johnson 1998a
Johnson 1998b
Jones et al. 2001
Karimi et al. 2004
Kennedy et al. 2003
Kim 2001
Kimler 2004
Krieger 1992
Krieger 2003
Krieger and Gordon 1999
Krieger et al. 199
Krieger et al. 2001
Krieger et al. 2002a
Krieger et al. 2002b
Krieger et al. 2003
Krieger et al. 2005
Krieger et al. 2006
Kwok and \ankaskas 2001
Laender et al. 2005
Lam et al. 2002
Lee 2004
Lee and McNally 1998
Leidner 2004
Leesque 2003
Leine and Kim 1998
Li et al. 2002
Lind 2001
Lind 2005
8VZ )=IAP@AB 567 8669
'< ;< #=>?@ABC
Input Data
1ypes Process Accuracy

N
a
m
e
d

P
l
a
c
e
s

R
e
l
a
t
i

e

D
e
s
c
r
i
p
t
i
o
n
s

P
o
s
t
a
l

A
d
d
r
e
s
s
e
s

U
S
P
S

P
O

B
o
x
e
s

P
o
s
t
a
l

C
o
d
e
s

R
u
r
a
l

R
o
u
t
e
s

P
a
r
s
i
n
g

N
o
r
m
a
l
i
z
a
t
i
o
n

S
t
a
n
d
a
r
d
i
z
a
t
i
o
n

V
a
l
i
d
a
t
i
o
n

A
m
b
i
g
u
i
t
y

R
e
s
o
l
u
t
i
o
n

1
e
m
p
o
r
a
l
i
t
y

C
o
r
r
e
c
t
n
e
s
s

Loasi et al. 200
Maizlish and lerrera 2005
Markowetz 2004
Markowetz et al. 2005
Martin 1998
Martin and liggs 1996
Martins and Sila 2005
Martins et al. 2005a
Martins et al. 2005b
Mazumdar et al. 2008
McCurley 2001
McLathron et al. 2002
McLlroy et al. 2003
Mechanda and Puderer 200
Michelson and Knoblock 2005
Miner et al. 2005
Ming et al. 2005
Murphy and Armitage 2005
Nicoara 2005
Noaki and Arikawa 2005a
Noaki and Arikawa 2005b
Nuckols et al. 2004
O`Reagan and Saaleld 198
Olier et al. 2005
Olligschlaeger 1998
Oppong 1999
Paull 2003
Pures et al. 2005
Ratclie 2001
Ratclie 2004
Rauch et al. 2003
Reid 2003
Reinbacher 2006
Reinbacher et al. 2008
Reie and Keroot 199
Riekert 2002
Rose et al. 2004
Rull et al. 2006
Rushton et al. 2006
Rushton et al. 2008b
Schlieder et al. 2001
Schockaert et al. 2005
)=IAP@AB 567 8669 8VY
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Input Data
1ypes Process Accuracy

N
a
m
e
d

P
l
a
c
e
s

R
e
l
a
t
i

e

D
e
s
c
r
i
p
t
i
o
n
s

P
o
s
t
a
l

A
d
d
r
e
s
s
e
s

U
S
P
S

P
O

B
o
x
e
s

P
o
s
t
a
l

C
o
d
e
s

R
u
r
a
l

R
o
u
t
e
s

P
a
r
s
i
n
g

N
o
r
m
a
l
i
z
a
t
i
o
n

S
t
a
n
d
a
r
d
i
z
a
t
i
o
n

V
a
l
i
d
a
t
i
o
n

A
m
b
i
g
u
i
t
y

R
e
s
o
l
u
t
i
o
n

1
e
m
p
o
r
a
l
i
t
y

C
o
r
r
e
c
t
n
e
s
s

Schootman et al. 2004
Sheehan et al. 2000
Shi 200
Smith and Crane 2001
Smith and Mann 2003
Smith et al. 1999
Soobader et al. 2001
Southall 2003
Steenson et al. 2000
Strickland et al. 200
1emple et al. 2005
1ezuka and 1anaka 2005
1hrall 2006
1obler 192
1oral and Muoz 2006
UN Lconomic Commission 2005
United States Department o lealth
and luman Serices 2000

Vaid et al. 2005
Van Kreeld and Reinbacher 2004
Vestaik 2004
Vine et al. 1998
Vgele and Schlieder 2003
Vgele and Stuckenschmidt 2001
Vgele et al. 2003
\aldinger et al. 2003
\aller 2008
\alls 2003
\ard et al. 2005
\erner 194
\hitsel et al. 2004
\hitsel et al. 2006
\ieczorek 2008
\ieczorek et al. 2004
\ilson, et al. 2004
\inkler 1995
\ong and Chuah 1994
\oodru and Plaunt 1994
\u et al. 2005
\ang et al. 2004
\ildirim and \omralioglu 2004
\u 1996
Zandbergen 200
8V[ )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8V\
Input Data
1ypes Process Accuracy

N
a
m
e
d

P
l
a
c
e
s

R
e
l
a
t
i

e

D
e
s
c
r
i
p
t
i
o
n
s

P
o
s
t
a
l

A
d
d
r
e
s
s
e
s

U
S
P
S

P
O

B
o
x
e
s

P
o
s
t
a
l

C
o
d
e
s

R
u
r
a
l

R
o
u
t
e
s

P
a
r
s
i
n
g

N
o
r
m
a
l
i
z
a
t
i
o
n

S
t
a
n
d
a
r
d
i
z
a
t
i
o
n

V
a
l
i
d
a
t
i
o
n

A
m
b
i
g
u
i
t
y

R
e
s
o
l
u
t
i
o
n

1
e
m
p
o
r
a
l
i
t
y

C
o
r
r
e
c
t
n
e
s
s

Zandbergen 2008
Zandbergen and Chakraborty 2006
Zimmerman 2006
Zimmerman 2008
Zimmerman et al. 200
Zimmerman et al. 2008
Zong et al. 2005
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
8V9 )=IAP@AB 567 8669
,H@>A Z\ p -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G BAGABAMNA ?HFH E=JBNA
Reerence Data Source
1ype Process Accuracy

G
a
z
e
t
t
e
e
r

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

I
m
a
g
e
r
y

C
o
n

l
a
t
i
o
n

A
d
d
i
n
g

V
a
l
u
e

R
e
s
o
l
u
t
i
o
n

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

N
o
n
-
S
p
a
t
i
a
l

A
t
t
r
i
b
u
t
e
s

Abe and Stinchcomb 2008
Agouris et al. 2000
Agoino et al. 2005
Alani 2001
Alani et al. 2003
Amitay et al. 2004
Arampatzis et al. 2006
Arbia et al. 1998
Arikawa and Noaki 2005
Arikawa et al. 2004
Armstrong et al. 1999
Armstrong et al. 2008
Armstrong and 1iwari 2008
Axelrod 2003
Bakshi et al. 2004
Beal 2003
Beaman et al. 2004
Beyer et al. 2008
Bichler and Balchak 200
Bilhaut et al. 2003
Block 1995
Bonner et al. 2003
Boscoe et al. 2002
Boscoe et al. 2004
Boscoe 2008
Boulos 2004
Bow et al. 2004
Brody et al. 2002
Broome 2003
Can 1993
Cayo and 1albot 2003
Chalasani et al. 2005
Chaez 2000
Chen, C.C. et al. 2003
Chen, C.C. et al. 2004
Chen, M.l. et al. 1998
Chen, \ et al. 2004
Chiang and Knoblock 2006
Chiang et al. 2005
Chou 1995
'< ;< #=>?@ABC
Reerence Data Source
1ype Process Accuracy

G
a
z
e
t
t
e
e
r

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

I
m
a
g
e
r
y

C
o
n

l
a
t
i
o
n

A
d
d
i
n
g

V
a
l
u
e

R
e
s
o
l
u
t
i
o
n

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

N
o
n
-
S
p
a
t
i
a
l

A
t
t
r
i
b
u
t
e
s

Christen and Churches 2005
Christen et al. 2004
Chua 2001
Chung et al. 2004
Churches et al. 2002
Clough 2005
Collins et al. 1998
Cressie and Kornak 2003
Croner 2003
Curtis et al. 2006
Dais Jr. 1993
Dais Jr. and lonseca 200
Dais Jr. et al. 2003
Dawes et al. 2006
Dearwent et al. 2001
Densham and Reid 2003
Diez-Roux et al. 2001
Drummond 1995
Dueker 194
Durr and lroggatt 2002
Lstathopoulos et al. 2005
Lichelberger 1993
londa-Bonardi 1994
loody 2003
lortney et al. 2000
lrank et al. 2004
lremont et al. 2005
lrew et al. 1998
lu et al. 2005a
lu et al. 2005b
lulcomer et al. 1998
Gabrosek and Cressie 2002
Gatrell 1989
Geronimus and Bound 1998
Geronimus and Bound 1999a
Geronimus and Bound 1999b
Geronimus et al. 1995
Gilboa et al. 2006
Goldberg et al. 200
Goodchild and lunter 199
Gregorio et al. 1999
Gregorio et al. 2005
)=IAP@AB 567 8669 8VX
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Reerence Data Source
1ype Process Accuracy

G
a
z
e
t
t
e
e
r

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

I
m
a
g
e
r
y

C
o
n

l
a
t
i
o
n

A
d
d
i
n
g

V
a
l
u
e

R
e
s
o
l
u
t
i
o
n

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

N
o
n
-
S
p
a
t
i
a
l

A
t
t
r
i
b
u
t
e
s

Griin et al. 1990
Grubesic and Matisziw 2006
Grubesic and Murray 2004
lan et al. 2004
lan et al. 2005
lariharan and 1oyama 2004
laspel and Knotts 2005
lenshaw et al. 2004
liggs and Martin 1995a
liggs and Martin 1995b
liggs and Richards 2002
lild and lritsch 1998
lill 2000
lill and Zheng 1999
lill et al. 1999
lurley et al. 2003
lutchinson and Veenendall 2005a
lutchinson and Veenendall 2005b
Johnson 1998a
Johnson 1998b
Jones et al. 2001
Karimi et al. 2004
Kennedy et al. 2003
Kim 2001
Kimler 2004
Krieger 1992
Krieger 2003
Krieger and Gordon 1999
Krieger et al. 199
Krieger et al. 2001
Krieger et al. 2002a
Krieger et al. 2002b
Krieger et al. 2003
Krieger et al. 2005
Krieger et al. 2006
Kwok and \ankaskas 2001
Laender et al. 2005
Lam et al. 2002
Lee 2004
Lee and McNally 1998
Leesque 2003
Leine and Kim 1998
8Z6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Reerence Data Source
1ype Process Accuracy

G
a
z
e
t
t
e
e
r

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

I
m
a
g
e
r
y

C
o
n

l
a
t
i
o
n

A
d
d
i
n
g

V
a
l
u
e

R
e
s
o
l
u
t
i
o
n

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

N
o
n
-
S
p
a
t
i
a
l

A
t
t
r
i
b
u
t
e
s

Li et al. 2002
Lind 2001
Lind 2005
Loasi et al. 200
Markowetz 2004
Markowetz et al. 2005
Martin 1998
Martin and liggs 1996
Martins and Sila 2005
Martins et al. 2005a
Martins et al. 2005b
Mazumdar et al. 2008
McCurley 2001
McLathron et al. 2002
McLlroy et al. 2003
Mechanda and Puderer 200
Miner et al. 2005
Ming et al. 2005
Murphy and Armitage 2005
Nicoara 2005
Noaki and Arikawa 2005a
Noaki and Arikawa 2005b
Nuckols et al. 2004
O`Reagan and Saaleld 198
Olier et al. 2005
Olligschlaeger 1998
Openshaw 1989
Oppong 1999
Paull 2003
Pures et al. 2005
Ratclie 2001
Ratclie 2004
Rauch et al. 2003
Reid 2003
Reinbacher 2006
Reinbacher et al. 2008
Reie and Keroot 199
Riekert 2002
Rose et al. 2004
Rull et al. 2006
Rushton et al. 2006
Rushton et al. 2008b
)=IAP@AB 567 8669 8Z5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
Reerence Data Source
1ype Process Accuracy

G
a
z
e
t
t
e
e
r

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

I
m
a
g
e
r
y

C
o
n

l
a
t
i
o
n

A
d
d
i
n
g

V
a
l
u
e

R
e
s
o
l
u
t
i
o
n

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

N
o
n
-
S
p
a
t
i
a
l

A
t
t
r
i
b
u
t
e
s

Schlieder et al. 2001
Schockaert et al. 2005
Schootman et al. 2004
Sheehan et al. 2000
Shi 200
Smith and Crane 2001
Smith and Mann 2003
Smith et al. 1999
Soobader et al. 2001
Southall 2003
Steenson et al. 2000
Strickland et al. 200
1emple et al. 2005
1hrall 2006
1oral and Muoz 2006
UN Lconomic Commission 2005
Vaid et al. 2005
Van Kreeld and Reinbacher 2004
Veregin 1999
Vestaik 2004
Vine et al. 1998
Vgele and Schlieder 2003
Vgele and Stuckenschmidt 2001
Vgele et al. 2003
\aldinger et al. 2003
\aller 2008
\alls 2003
\ard et al. 2005
\erner 194
\hitsel et al. 2004
\hitsel et al. 2006
\ieczorek 2008
\ieczorek et al. 2004
\ilson, et al. 2004
\oodru and Plaunt 1994
\u et al. 2005
\ang et al. 2004
\ildirim and \omralioglu 2004
\u 1996
Zandbergen 200
Zandbergen 2008
Zandbergen and Chakraborty 2006
8Z8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
Reerence Data Source
1ype Process Accuracy

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

I
m
a
g
e
r
y

C
o
n

l
a
t
i
o
n

A
d
d
i
n
g

V
a
l
u
e

R
e
s
o
l
u
t
i
o
n

G
a
z
e
t
t
e
e
r

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

N
o
n
-
S
p
a
t
i
a
l

A
t
t
r
i
b
u
t
e
s

Zimmerman 2006
Zimmerman 2008
Zimmerman et al. 200
Zimmerman et al. 200
Zong et al. 2005

)=IAP@AB 567 8669 8ZV
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
,H@>A Z9 p -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G GAHFJBAaPHFNSDMC HUUB=HNS
Matching
1ype Process Accuracy

D
e
t
e
r
m
i
n
i
s
t
i
c

P
r
o
b
a
b
i
l
i
t
y
-
B
a
s
e
d

S
t
r
i
n
g

C
o
m
p
a
r
i
s
o
n

R
e
l
a
x
a
t
i
o
n

M
a
t
c
h

1
y
p
e

M
a
t
c
h

R
a
t
e

Abe and Stinchcomb 2008
Agouris et al. 2000
Alani 2001
Amitay et al. 2004
Arampatzis et al. 2006
Armstrong et al. 2008
Armstrong and 1iwari 2008
Bakshi et al. 2004
Beal 2003
Beaman et al. 2004
Beyer et al. 2008
Bichler and Balchak 200
Bilhaut et al. 2003
Blakely and Salmond 2002
Block 1995
Bonner et al. 2003
Boscoe et al. 2002
Boscoe 2008
Bow et al. 2004
Cayo and 1albot 2003
Chaez 2000
Chen, M.l. et al. 1998
Chen, \ et al. 2004
Chen et al. 2008
Chiang and Knoblock 2006
Chiang et al. 2005
Christen and Churches 2005
Christen et al. 2004
Chua 2001
Chung et al. 2004
Churches et al. 2002
Clough 2005
Dais Jr. 1993
Dais Jr. and lonseca 200
Dais Jr. et al. 2003
Dearwent et al. 2001
Densham and Reid 2003
Drummond 1995
Durr and lroggatt 2002
Lstathopoulos et al. 2005
Lichelberger 1993
Ll-\acoubi et al. 2002
8ZZ )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8ZY
Matching
1ype Process Accuracy

D
e
t
e
r
m
i
n
i
s
t
i
c

P
r
o
b
a
b
i
l
i
t
y
-
B
a
s
e
d

S
t
r
i
n
g

C
o
m
p
a
r
i
s
o
n

R
e
l
a
x
a
t
i
o
n

M
a
t
c
h

1
y
p
e

M
a
t
c
h

R
a
t
e

lu et al. 2005a
lu et al. 2005b
lulcomer et al. 1998
Gabrosek and Cressie 2002
Gilboa et al. 2006
Goldberg et al. 200
Goodchild and lunter 199
Gregorio et al. 1999
Gregorio et al. 2005
Grubesic and Matisziw 2006
Grubesic and Murray 2004
lan et al. 2004
lariharan and 1oyama 2004
laspel and Knotts 2005
liggs and Martin 1995b
liggs and Richards 2002
lill 2000
lill and Zheng 1999
lill et al. 1999
lurley et al. 2003
Jaro 1984
Jaro 1989
Johnson 1998a
Johnson 1998b
Jones et al. 2001
Karimi et al. 2004
Kimler 2004
Krieger 1992
Krieger 2003
Krieger et al. 2001
Krieger et al. 2002a
Krieger et al. 2002b
Krieger et al. 2003
Krieger et al. 2005
Krieger et al. 2006
Kwok and \ankaskas 2001
Laender et al. 2005
Lam et al. 2002
Lee 2004
Lee and McNally 1998
Leidner 2004
Leine and Kim 1998
Li et al. 2002
Lind 2001
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
8Z[ )=IAP@AB 567 8669
Matching
1ype Process Accuracy

D
e
t
e
r
m
i
n
i
s
t
i
c

P
r
o
b
a
b
i
l
i
t
y
-
B
a
s
e
d

S
t
r
i
n
g

C
o
m
p
a
r
i
s
o
n

R
e
l
a
x
a
t
i
o
n

M
a
t
c
h

1
y
p
e

M
a
t
c
h

R
a
t
e

Lind 2005
Loasi et al. 200
MacDorman and Gay 1999
Maizlish and lerrera 2005
Markowetz 2004
Markowetz et al. 2005
Martin and liggs 1996
Martins and Sila 2005
Martins et al. 2005a
Martins et al. 2005b
Mazumdar et al. 2008
McCurley 2001
McLlroy et al. 2003
Mechanda and Puderer 200
Meyer et al. 2005
Michelson and Knoblock 2005
Ming et al. 2005
Murphy and Armitage 2005
Nicoara 2005
Nuckols et al. 2004
O`Reagan and Saaleld 198
Olier et al. 2005
Olligschlaeger 1998
Paull 2003
Porter 1980
Pures et al. 2005
Raghaan et al. 1989
Ratclie 2001
Ratclie 2004
Rauch et al. 2003
Reid 2003
Reinbacher 2006
Reinbacher et al. 2008
Reie and Keroot 199
Riekert 2002
Rose et al. 2004
Rull et al. 2006
Rushton et al. 2006
Rushton et al. 2008b
Schootman et al. 2004
Schumacher 200
Shi 200
Soobader et al. 2001
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8Z\
Matching
1ype Process Accuracy

D
e
t
e
r
m
i
n
i
s
t
i
c

P
r
o
b
a
b
i
l
i
t
y
-
B
a
s
e
d

S
t
r
i
n
g

C
o
m
p
a
r
i
s
o
n

R
e
l
a
x
a
t
i
o
n

M
a
t
c
h

1
y
p
e

M
a
t
c
h

R
a
t
e

Strickland et al. 200
1ezuka and 1anaka 2005
1hrall 2006
Vine et al. 1998
\aldinger et al. 2003
\aller 2008
\alls 2003
\ard et al. 2005
\hitsel et al. 2004
\hitsel et al. 2006
\ieczorek 2008
\ieczorek et al. 2004
\ilson, et al. 2004
\inkler 1995
\ong and Chuah 1994
\oodru and Plaunt 1994
\u et al. 2005
\ang et al. 2004
\u 1996
Zandbergen 200
Zandbergen 2008
Zandbergen and Chakraborty 2006
Zimmerman 2006
Zimmerman 2008
Zimmerman et al. 200
Zimmerman et al. 2008
Zong et al. 2005


! #A=N=?DMC *AEF -BHNFDNAE #JD?A
,H@>A ZX p -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G GAHFJBA DMFABU=>HFD=M PAFS=?
Interpolation
1ype Accuracy

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

Abe and Stinchcomb 2008
Agouris et al. 2000
Agoino et al. 2005
Alani 2001
Amitay et al. 2004
Armstrong et al. 2008
Armstrong and 1iwari 2008
Bakshi et al. 2004
Beal 2003
Beyer et al. 2008
Bichler and Balchak 200
Bilhaut et al. 2003
Boscoe et al. 2002
Boscoe 2008
Bow et al. 2004
Brody et al. 2002
Cayo and 1albot 2003
Chalasani et al. 2005
Chen, C.C. et al. 2003
Chen, C.C. et al. 2004
Chen, M.l. et al. 1998
Chen, \ et al. 2004
Chen et al. 2008
Chiang and Knoblock 2006
Chiang et al. 2005
Christen and Churches 2005
Christen et al. 2004
Chua 2001
Chung et al. 2004
Churches et al. 2002
Clough 2005
Collins et al. 1998
Cressie and Kornak 2003
Curtis et al. 2006
Dais Jr. 1993
Dais Jr. and lonseca 200
Dais Jr. et al. 2003
Dearwent et al. 2001
Densham and Reid 2003
Drummond 1995
Dueker 194
Durr and lroggatt 2002
8Z9 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8ZX
Interpolation
1ype Accuracy

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

lortney et al. 2000
lu et al. 2005a
lu et al. 2005b
lulcomer et al. 1998
Gatrell 1989
Gilboa et al. 2006
Goldberg et al. 200
Goodchild and lunter 199
Gregorio et al. 1999
Gregorio et al. 2005
Grubesic and Matisziw 2006
Grubesic and Murray 2004
lan et al. 2004
laspel and Knotts 2005
lenshaw et al. 2004
liggs and Martin 1995b
liggs and Richards 2002
lill 2000
lill and Zheng 1999
lill et al. 1999
lurley et al. 2003
lutchinson and Veenendall 2005a
lutchinson and Veenendall 2005b
Johnson 1998a
Johnson 1998b
Karimi et al. 2004
Kennedy et al. 2003
Kimler 2004
Krieger et al. 2001
Krieger et al. 2002a
Krieger et al. 2002b
Krieger et al. 2003
Krieger et al. 2005
Krieger et al. 2006
Kwok and \ankaskas 2001
Laender et al. 2005
Lam et al. 2002
Lee 2004
Lee and McNally 1998
Leine and Kim 1998
Li et al. 2002
Lind 2001
Lind 2005
Loasi et al. 200
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
8Y6 )=IAP@AB 567 8669
Interpolation
1ype Accuracy

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

Markowetz 2004
Markowetz et al. 2005
Martin 1998
Mazumdar et al. 2008
McCurley 2001
McLathron et al. 2002
McLlroy et al. 2003
Miner et al. 2005
Ming et al. 2005
Murphy and Armitage 2005
Nicoara 2005
Noaki and Arikawa 2005a
Noaki and Arikawa 2005b
Olier et al. 2005
Olligschlaeger 1998
Ratclie 2001
Ratclie 2004
Rauch et al. 2003
Reid 2003
Reinbacher 2006
Reinbacher et al. 2008
Reie and Keroot 199
Rose et al. 2004
Rull et al. 2006
Rushton et al. 2006
Rushton et al. 2008b
Sadahiro 2000
Schlieder et al. 2001
Schockaert et al. 2005
Schootman et al. 2004
Sheehan et al. 2000
Shi 200
Southall 2003
Steenson et al. 2000
Strickland et al. 200
1hrall 2006
1obler 192
Van Kreeld and Reinbacher 2004
Vine et al. 1998
Vgele and Schlieder 2003
Vgele and Stuckenschmidt 2001
Vgele et al. 2003
\aller 2008
\alls 2003
'< ;< #=>?@ABC
Interpolation
1ype Accuracy

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

P
o
i
n
t
-
B
a
s
e
d

L
i
n
e
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

A
r
e
a

U
n
i
t
-
B
a
s
e
d

\ard et al. 2005
\hitsel et al. 2004
\hitsel et al. 2006
\ieczorek 2008
\ieczorek et al. 2004
\ilson, et al. 2004
\oodru and Plaunt 1994
\u et al. 2005
\ang et al. 2004
\u 1996
Zandbergen 200
Zandbergen 2008
Zandbergen and Chakraborty 2006
Zimmerman 2006
Zimmerman 2008
Zimmerman et al. 200
Zong et al. 2005


)=IAP@AB 567 8669 8Y5
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
,H@>A Y6 p -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G HNNJBHNO PAHEJBA? JFD>DfA?
Accuracy
Measures Lstimates

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

B
i
a
s

I
n
t
r
o
d
u
c
t
i
o
n

Q
u
a
l
i
t
y

C
o
d
e
s

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

Abe and Stinchcomb 2008
Agouris et al. 2000
Agoino et al. 2005
Alani 2001
Alani et al. 2003
Amitay et al. 2004
Arampatzis et al. 2006
Arikawa et al. 2004
Armstrong et al. 2008
Armstrong and 1iwari 2008
Axelrod 2003
Bakshi et al. 2004
Beal 2003
Beaman et al. 2004
Berney and Blane 199
Beyer et al. 2008
Bichler and Balchak 200
Bilhaut et al. 2003
Blakely and Salmond 2002
Bonner et al. 2003
Boscoe et al. 2002
Boscoe 2008
Bow et al. 2004
Brody et al. 2002
Casady 1999
Cayo and 1albot 2003
Chalasani et al. 2005
Chaez 2000
Chen, C.C. et al. 2003
Chen, C.C. et al. 2004
Chen, \ et al. 2004
Chen et al. 2008
Christen and Churches 2005
Christen et al. 2004
Chua 2001
Churches et al. 2002
Clough 2005
Collins et al. 1998
Cressie and Kornak 2003
Curtis et al. 2006
Dais Jr. 1993
Dais Jr. and lonseca 200
8Y8 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8YV
Accuracy
Measures Lstimates

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

B
i
a
s

I
n
t
r
o
d
u
c
t
i
o
n

Q
u
a
l
i
t
y

C
o
d
e
s

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

Dais Jr. et al. 2003
Dearwent et al. 2001
Diez-Roux et al. 2001
Dru and Saada 2001
Drummond 1995
Dueker 194
Durr and lroggatt 2002
londa-Bonardi 1994
loody 2003
lortney et al. 2000
lremont et al. 2005
lrew et al. 1998
lu et al. 2005a
lu et al. 2005b
lulcomer et al. 1998
Gabrosek and Cressie 2002
Ganey et al. 2005
Gatrell 1989
Geronimus and Bound 1998
Geronimus and Bound 1999a
Geronimus and Bound 1999b
Geronimus et al. 1995
Gilboa et al. 2006
Goldberg et al. 200
Goodchild and lunter 199
Gregorio et al. 1999
Gregorio et al. 2005
Grubesic and Matisziw 2006
Grubesic and Murray 2004
lan et al. 2004
lan et al. 2005
lariharan and 1oyama 2004
laspel and Knotts 2005
lenshaw et al. 2004
liggs and Martin 1995a
liggs and Martin 1995b
liggs and Richards 2002
lill 2000
lill and Zheng 1999
lill et al. 1999
limmelstein 2005
lurley et al. 2003
lutchinson and Veenendall 2005a
lutchinson and Veenendall 2005b
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
8YZ )=IAP@AB 567 8669
Accuracy
Measures Lstimates

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

B
i
a
s

I
n
t
r
o
d
u
c
t
i
o
n

Q
u
a
l
i
t
y

C
o
d
e
s

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

Jones et al. 2001
Karimi et al. 2004
Kennedy et al. 2003
Kimler 2004
Krieger 1992
Krieger 2003
Krieger and Gordon 1999
Krieger et al. 199
Krieger et al. 2001
Krieger et al. 2002a
Krieger et al. 2002b
Krieger et al. 2003
Krieger et al. 2005
Krieger et al. 2006
Kwok and \ankaskas 2001
Laender et al. 2005
Lam et al. 2002
Lee 2004
Lee and McNally 1998
Leesque 2003
Leine and Kim 1998
Li et al. 2002
Lind 2001
Lind 2005
Loasi et al. 200
Markowetz 2004
Markowetz et al. 2005
Martin 1998
Martin and liggs 1996
Martins et al. 2005b
Mazumdar et al. 2008
McCurley 2001
McLathron et al. 2002
McLlroy et al. 2003
Mechanda and Puderer 200
Murphy and Armitage 2005
Nicoara 2005
Noaki and Arikawa 2005a
Noaki and Arikawa 2005b
Nuckols et al. 2004
O`Reagan and Saaleld 198
Olier et al. 2005
Olligschlaeger 1998
Oppong 1999
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8YY
Accuracy
Measures Lstimates

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

B
i
a
s

I
n
t
r
o
d
u
c
t
i
o
n

Q
u
a
l
i
t
y

C
o
d
e
s

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

Paull 2003
Pures et al. 2005
Ratclie 2001
Ratclie 2004
Rauch et al. 2003
Reid 2003
Reinbacher 2006
Reinbacher et al. 2008
Reie and Keroot 199
Rose et al. 2004
Rull et al. 2006
Rushton et al. 2006
Rushton et al. 2008b
Sadahiro 2000
Schlieder et al. 2001
Schockaert et al. 2005
Schootman et al. 2004
Sheehan et al. 2000
Shi 200
Smith et al. 1999
Soobader et al. 2001
Southall 2003
Steenson et al. 2000
Strickland et al. 200
1emple et al. 2005
1hrall 2006
Vaid et al. 2005
Vestaik 2004
Vine et al. 1998
Vgele and Schlieder 2003
Vgele and Stuckenschmidt 2001
Vgele et al. 2003
\aldinger et al. 2003
\aller 2008
\alls 2003
\ard et al. 2005
\erner 194
\hitsel et al. 2004
\hitsel et al. 2006
\ieczorek 2008
\ieczorek et al. 2004
\ilson, et al. 2004
\oodru and Plaunt 1994
\u et al. 2005
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
8Y[ )=IAP@AB 567 8669
Accuracy
Measures Lstimates

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

B
i
a
s

I
n
t
r
o
d
u
c
t
i
o
n

Q
u
a
l
i
t
y

C
o
d
e
s

S
p
a
t
i
a
l

1
e
m
p
o
r
a
l
i
t
y

R
e
s
o
l
u
t
i
o
n

\ang et al. 2004
\u 1996
Zandbergen 200
Zandbergen 2008
Zandbergen and Chakraborty 2006
Zimmerman 2006
Zimmerman 2008
Zimmerman et al. 200
Zimmerman et al. 2008

'< ;< #=>?@ABC
,H@>A Y5 p -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G UB=NAEE JEA?
Process
Manual Auto

G
P
S

R
a
s
t
e
r

I
m
a
g
e
r
y

,

M
a
p
s

S
u
p
p
l
e
m
e
n
t
a
l

D
a
t
a

B
a
t
c
h
-
M
o
d
e

S
i
n
g
l
e
-
M
o
d
e

Abe and Stinchcomb 2008
Agouris et al. 2000
Agoino et al. 2005
Arampatzis et al. 2006
Armstrong and 1iwari 2008
Bakshi et al. 2004
Beal 2003
Beaman et al. 2004
Beyer et al. 2008
Bichler and Balchak 200
Bilhaut et al. 2003
Bonner et al. 2003
Boscoe et al. 2002
Boscoe 2008
Bow et al. 2004
Brody et al. 2002
Cayo and 1albot 2003
Chalasani et al. 2005
Chaez 2000
Chen, C.C. et al. 2003
Chen, C.C. et al. 2004
Chen, \ et al. 2004
Christen and Churches 2005
Christen et al. 2004
Clough 2005
Curtis et al. 2006
Dao et al. 2002
Dais Jr. 1993
Dearwent et al. 2001
Dru and Saada 2001
Drummond 1995
Dueker 194
Durr and lroggatt 2002
lortney et al. 2000
lrew et al. 1998
lulcomer et al. 1998
Gilboa et al. 2006
Goldberg et al. 200
Gregorio et al. 1999
Gregorio et al. 2005
lan et al. 2004
)=IAP@AB 567 8669 8Y\
! #A=N=?DMC *AEF -BHNFDNAE #JD?A
8Y9 )=IAP@AB 567 8669
Process
Manual Auto

G
P
S

R
a
s
t
e
r

I
m
a
g
e
r
y

,

M
a
p
s

S
u
p
p
l
e
m
e
n
t
a
l

D
a
t
a

B
a
t
c
h
-
M
o
d
e

S
i
n
g
l
e
-
M
o
d
e

laspel and Knotts 2005
lenshaw et al. 2004
liggs and Martin 1995b
lill 2000
lill and Zheng 1999
lill et al. 1999
lurley et al. 2003
lutchinson and Veenendall 2005a
lutchinson and Veenendall 2005b
Karimi et al. 2004
Kennedy et al. 2003
Krieger 1992
Krieger 2003
Krieger et al. 2001
Krieger et al. 2002a
Krieger et al. 2002b
Krieger et al. 2003
Krieger et al. 2005
Krieger et al. 2006
Kwok and \ankaskas 2001
Lee 2004
Lee and McNally 1998
Leesque 2003
Leine and Kim 1998
Loasi et al. 200
MacDorman and Gay 1999
Mazumdar et al. 2008
McLathron et al. 2002
McLlroy et al. 2003
Mechanda and Puderer 200
Ming et al. 2005
Nicoara 2005
Olligschlaeger 1998
Pures et al. 2005
Ratclie 2001
Rauch et al. 2003
Rose et al. 2004
Rushton et al. 2006
Rushton et al. 2008b
Schootman et al. 2004
Strickland et al. 200
1emple et al. 2005
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8YX
Process
Manual Auto

G
P
S

R
a
s
t
e
r

I
m
a
g
e
r
y

,

M
a
p
s

S
u
p
p
l
e
m
e
n
t
a
l

D
a
t
a

B
a
t
c
h
-
M
o
d
e

S
i
n
g
l
e
-
M
o
d
e

1hrall 2006
Vine et al. 1998
\ard et al. 2005
\hitsel et al. 2004
\hitsel et al. 2006
\ieczorek 2008
\ieczorek et al. 2004
\u et al. 2005
\ang et al. 2004
Zandbergen 200
Zandbergen 2008
Zandbergen and Chakraborty 2006
Zimmerman 2006
Zimmerman et al. 200


! #A=N=?DMC *AEF -BHNFDNAE #JD?A
,H@>A Y8 p -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G UBDIHNO N=MNABM HM?e=B PAFS=?
Priacy
1ype Process

D
a
t
a

L
e
a
k

S
e
l

-
I
d
e
n
t
i

y
i
n
g

M
a
s
k
i
n
g

R
a
n
d
o
m
i
z
a
t
i
o
n

A
g
g
r
e
g
a
t
i
o
n

Arikawa et al. 2004
Armstrong et al. 1999
Beyer et al. 2008
Boscoe et al. 2004
Boulos 2004
Brownstein et al. 2006
Casady 1999
Chen et al. 2008
Christen and Churches 2005
Churches et al. 2002
Croner 2003
Curtis et al. 2006
Dao et al. 2002
Gittler 2008a
Goldberg et al. 200
MacDorman and Gay 1999
Mazumdar et al. 2008
Miner et al. 2005
Oppong 1999
Rushton et al. 2006
Rushton et al. 2008b
Steenson et al. 2000
Sweeney 2002
Vine et al. 1998
\aller 2008
Zimmerman et al. 2008


8[6 )=IAP@AB 567 8669
'< ;< #=>?@ABC
)=IAP@AB 567 8669 8[5
,H@>A YV a -BAID=JE CA=N=?DMC EFJ?DAE N>HEEDGDA? @O F=UDNE =G =BCHMDfHFD=MH> N=EF
Organization
Cost

O
b
t
a
i
n
i
n
g

R
e

e
r
e
n
c
e

D
a
t
a

P
e
r
-
G
e
o
c
o
d
e

M
a
n
p
o
w
e
r

Beal 2003
Boscoe et al. 2002
Boscoe et al. 2004
Dais Jr. 1993
Johnson 1998a
Johnson 1998b
Krieger 1992
Krieger 2003
Krieger et al. 2001
Martin and liggs 1996
McLlroy et al. 2003
Miner et al. 2005
Strickland et al. 200
1emple et al. 2005
1hrall 2006
\hitsel et al. 2004
\hitsel et al. 2006



North American Association of CentraI Cancer Registries, Inc.
2121 W. White Oaks Drive, Suite B
SpringfieId, IL 62704
217.698.0800
217.698.0188 fax
info@naaccr.org
www.naaccr.org

Вам также может понравиться