Data Input and Output

CHAPTER 4: DATA INPUT AND OUTPUT
(GIS: A Management Perspective - Stan Aronoff)

Pages 103 - 131
For a GIS to be useful it must be capable of receiving and producing information in an effective manner.
The data input and output functions are the means by which a GIS communicates with the world outside.
The objective in defining GIS input and output requirements is to identify the mix of equipment and methods needed to
meet the required level of performance and quality. No one device or approach is optimum for all situations.
DATA INPUT: The procedure of encoding data into a computer-readable form and writing the data to the GIS database.
Data entry is usually the major bottleneck in implementing a GIS. The initial cost of building the database is commonly 5
to 10 times to cost of the GIS hardware and software.
The creation of an accurate and well-documented database is critical to the operation of the GIS.
Accurate information can only be generated if the data on which it is based were accurate to begin with.
Data quality information includes the date of collection, the positional accuracy, completeness, and the method used to
collect and encode the data. (Discussed in detail in Ch. 5)
There are two types of data to be entered into a GIS: Spatial data and the associated non-spatial attribute data.
The spatial data represents the geographical location of the features

The non-spatial attribute data provide descriptive information like the name of a street, salinity of the lake or the type
of tree stand.
The non-spatial attribute data must be logically attached to the features they describe.
There are five types of data entry systems commonly used in a GIS:
 keyboard entry
 coordinate geometry
 manual digitizing
 scanning
 input of existing digital files
Keyboard entry: involves manually entering the data at a computer terminal. Attribute data are commonly input by
keyboard whereas spatial data are rarely input this way.
Keyboard entry may also be used during manual digitizing to enter the attribute information. However this is usually more
efficiently handled as a separate operation.
Roads files versus the census file -- roads file will use codes for the various road types while the census file uses exact
numbers for things like total population, age range, etc.
Coordinate Geometry (COGO): involves entering survey data using a keyboard. From these data the coordinate of spatial
features are calculated. This produces a very high level of precision and accuracy which is needed in a cadastral system.
For a city with 100,000 parcels, it would cost approximately $1 - $1.50 per parcel or $100,000 to $150,000 to digitize
the parcels manually. COGO procedures are commonly 6 times and can be up to 20 times more expensive than manual
digitizing.
Surveyors and engineers want the higher accuracy of COGO for their applications. Planners and most others are happy
with the lower accuracy provided by manual digitizing.
Manual Digitizing: The most widely used method for entering spatial data from maps. The map is mounted on a digitizing
tablet and a hand held device termed a puck or cursor is used to trace each map feature. The position of the puck is
accurately measured by the device to generate the coordinate data.
Digitizing surfaces range from 12 inches x 12 inches (digitizing tablet) to 36 x 48 (digitizing table) and on up.
The digitizing table electronically encodes the position of the pointing device with a precision of a fraction of a
millimeter.
The most common digitizer uses a fine wire mesh grid embedded in the table. The cursor normally has 16 or more buttons
that are used to operate the data entry and to enter attribute data.
V. Drake 1
SMC
The digitizing operation itself requires little computing power and so can be done without using the full GIS. A smaller,
less expensive computer can be used to control the digitizing process and store the data. The data can later be
transferred to the GIS for processing. The problem with this is having enough software for all the computers.
The efficiency of digitizing depends on the quality of the digitizing software and the skill of the operator. The process
of tracing lines is time-consuming and error prone. The software can provide aids that substantially reduce the effort of
detecting and correcting errors.
Attribute information may be entered during the digitizing process, but usually only as an identification number. The
attribute information referenced to the same ID number is entered separately.
Manual digitizing is a tedious job. Operator fatigue (eye strain, back soreness, etc.) can seriously degrade the data
quality. Managers must limit the number of hours an operator works at one time. A commonly used quality check is to
produce a verification plot of the digitized data that is visually compared with the map from which the data were
originally digitized.
Scanning: Scanning provides a faster means of data entry compared to manual digitizing.
In scanning, a digital image of the map is produced by moving an electronic detector across the surface of the map.
There are two types of scanner designs:
Flat-bed scanner: On a flat-bed scanner the map is placed on a flat scanning stage and the detectors move across the
map in both the X and the Y directions (similar to copy machine).
Drum scanner: On a drum scanner, the map is mounted on a cylindrical drum which rotates while the detector moves
horizontally across the map. The sensor motion provides movement in the X direction while the drum rotation provides
movement in the Y direction.
The output from the scanner is a digital image. Usually the image is black and white but scanners can record color by
scanning the same document three times using red, green and blue filters.
Inputting existing digital files: There are many companies and organizations on the market that provide or sell digital
data files often in a format that can be read directly into a GIS. These digital data sets are priced at a fraction of the
cost of digitizing existing maps.
Over the next decade, the increased availability of data should reduce the current high cost and lengthy production
times needed to develop digital geographic data bases.
SCANNING VERSUS MANUAL DIGITIZING

Scanning is being used by many organizations, yet the subject is very controversial. One reason for the questions on data
accuracy is that rigorous trials are few and of necessity are specific to the organization and application.
Data entry using scanning is claimed to be 5 to 10 times (or more) faster than digitizing.
However maps normally must be redrafted before they can be scanned or the color separates must be scanned.
Redrafting is often considered to be a major disadvantage of the scanning option. Redrafting, although time consuming,
does not necessarily add to the cost of the data conversion process. Redrafting can reduce the total cost of both
scanning and manual digitizing. For example, studies by the US Forest Service have shown that a "map preparation" step
before the manual digitizing is done can reduce the overall digital encoding costs by as much as 50%.
WHY?
1. The redrafting is done manually, not on a computer system and therefore costs are not incurred for the computer time
or the higher salaries of computer operators.
2. The digitizing operation proceeds much more quickly and requires less editing if the map has fewer errors and
inconsistencies. Faster completion of the digitization and editing functions reduces the amount and therefore the costs
of expensive computer system and computer operator time.
3. When inconsistencies on the map must be worked out, manual drafting is more efficient and faster than digitizing
because they require different skills. They are not equal tasks.
4. It is very time consuming and therefore very costly to make large numbers of changes to a map once it is in digital
form.
While a scanning system is for the most part automated, and requires less highly trained personnel, more complex
equipment must be maintained, more sophisticated software must be written or purchased and there are most steps in
the process.
V. Drake 2
SMC
Scanners are more expensive than digitizing tables. A 60 x 44 inch digitizing table can cost between $3000 and $8000.
A high quality scanner will cost $100,000. The higher equipment costs can be justified if there is a great deal of
production that needs to be done.
Most GIS software packages include a digitizing software capability, but separate special-purpose software is needed to
operate a scanning system. Scanning works best with maps that are very clean, simple, and do not contain extraneous
information. Scanning is most cost-effective for maps with large numbers of polygons (1000 or more) and maps with a
large number of irregularly shaped features such as lines and odd polygons.
Manual digitizing tends to be more cost-effective when there are relatively few maps that are not in a form that can be
scanned. Maps that require a lot of interpretation do not need to be scanned.
There is a strong demand for faster, more cost effective data entry methods. Hundreds of computer operators with
thousands of maps are not the answer. Although scanning will never replace manual digitizing, as more and more scanners
are used, the technology will become better and better.
DIRECT USE OF RASTER SCANNED IMAGES

Much of the difficulty in using raster scanning to enter map information is the extraction of points, lines and polygons
from the raster data.
In some cases, the raster image is only needed as a background on which to overlay other geographic information.
Air photos, satellite imagery, and scanned map images can be stored and presented in this way.
For example, if a raster satellite image is displayed on a screen, a vector map can be overlaid and then updated or a
totally new map created by digitizing on the screen.
Using a raster image as a background can be an effective solution when a relatively small amount of data needs to be
extracted but a large area must be displayed in order to find the data.
EXISTING DIGITAL DATA

In the US and Canada, low-cost digital geographic information is becoming more readily available.
Data sets are being produced by the national mapping agencies and agencies responsible for the census and other
nationwide statistical data. In the US these agencies include but are not limited to the USGS, US Census Bureau, and
the DMA. Natural resources information is being converted to digital form at both the federal and state or provincial
levels.
Since digital data sets are produced to satisfy a wide range of users, the cost of the data, currency and accuracy vary.
The accuracy with which boundaries are drawn, the date of the information, and the method compilation may be
sufficiently different to create errors when different data layers or adjacent map sheets within a data layer are used
together.
Figure 4.4 p. 112 This is a map produced from the USGS 1:250,000 Land Use/Land Cover digital data set. To generate
this map, the data for two adjacent data sets were joined. Notice the abrupt change in land use categories along a
horizontal line in the center of the map. This change coincides with the boundary between two map sheets from which
the data were digitized. The differences may be a result of discrepancies in airphoto interpretation or of the three year
difference in the source dates of the aerial photography used.
Problems such as there may occur in any digital data set and must be identified and taken into account.
Private companies are also beginning to provide off-the-shelf database products. Although there may be difficulties, the
cost of existing data is usually a fraction of the cost of creating anew data set.
The availability of inexpensive data sets will make GIS technology economically more attractive and easier to implement.
In the US the cartographic community has made a considerable effort to coordinate and standardize the production and
distribution of digital geographic data.
At the federal level, the Federal Interagency Coordinating Committee on Digital Cartography (FICCDC) was formed in
1983 for this purpose. Over 14 organizations participate in the Committee, which holds regular meetings and produces a
newsletter and a variety of reports.
Now we are going to discuss examples of data sets available from these federal agencies.
BASE CARTOGRAPHIC DATA

Base cartographic data include the topographic and planimetric information usually portrayed on a map.
Topographic data are those data that portray relief, such as elevation contours and spot heights.
V. Drake 3
SMC
Planimetric data include roads and streams, as well as cultural data such as administrative and political boundaries, cities,
and towns.
Often these data sets are digitized version of an existing map series with each type of information such as the elevation
contours, assigned to separate data layers. Base cartographic data sets are produced in two formats: Graphics and
topologically-structured.
Graphics format is essentially the line and point features digitized in vector format. In this form, the map can be easily
updated or modified to produced special purpose maps.
These data sets are well suited for the CAD systems used in digital mapping. However, they are severely limited by the
lack of topological structuring.
A commonly used interchange format is the SIF (Standard Interchange Format) developed by the digital mapping
industry for transferring lines, points, curves, and symbols.
These data sets can be incorporated into a GIS but there can be a lot of problems associated with it. For example, the
data files often have not been checked for topological consistency.
They may contain such inconsistencies as lines that do not met precisely, that overshoot or under shoot the correct
connection point. The may be missing lines or gaps that create polygons that are not closed.
For use in a vector GIS these files must be clean and topologically structured.
Topologically-Structured Format is designed to encode geographic information in a form better suited for spatial
analysis and other geographic studies. Most GISs are designed to use topologically structured data.
The USGS Digital Line Graph (DLG) data set is an example of topologically structured data. This cartographic data set
has been developed from previous mapping efforts at the 1:2 million scale and more recently at the 1:100,000 and
1:24,000 scales.
The older 1:2 million data includes transportation , hydrography, and political boundary maps.
The 1:100,000 scale data sets for hydrography and transportation have been completed for the entire US while the
political boundaries and Public Land Survey System are still being developed.
The 1:24,000 series will include the PLSS, political boundaries, transportation, hydrography, and contour data layers. See
Figure 4.5 on page 114
These data sets represent a comprehensive, standardized inexpensive and publicly available source of digital information.
The complete coverage (at the 1:100,000 scale) makes it possible to assemble large-area data bases quickly and at a low
cost.
LAND USE / LAND COVER DATA

The USGS has developed a LU/LC data set compiled from 1:58,000 color infrared aerial photography and mapped at the
1:250,000 scale.
The data sets were generated by both manual digitizing and scan digitizing.
The LU/LC classes include urban areas, agricultural land, rangeland, forest, wetlands, barren land and tundra.
Associated maps provide political boundaries, hydrological units (watershed boundaries), federal land ownership, and
census subdivisions.
Data are available for about 75% of the US. A separate file is being developed for Alaska using a different classification
scheme and automated classification of digital satellite imagery.
CENSUS-RELATED DATA SETS

In Canada and the US, the agencies responsible for disseminating census data provide a number of digital data sets that
can be input to a GIS.
Census and other statistical data are provided in the form of attribute data sets coded by geographic location.
Enumeration districts, street addresses, postal codes, census tracts and other similar codes are used.
Spatial data sets are provided that can be linked to the attribute datasets by means of these area codes.
Street networks in metropolitan areas, census tract boundaries, and political boundaries are examples of the spatial data
sets commonly available.
The spatial and attribute data sets are sued together to produce special purpose maps and to retrieve information for
selection geographic areas. They are also used for more specialized analyses including address matching, district
delineation, and network analysis.
V. Drake 4
SMC
Address Matching is the technique of linking data from separate files by means of a common attribute, the street
address. For example, welfare case records may include the name and the address of each recipient but not the census
tract. The census tract information can be retrieved from the spatial data file by using the address as a key to find the
data in the other file.
District Delineation is a procedure that defines compact areas based on one ore more attributes. For example, it can be
used to divide an area into electoral districts that each have about the same population. Conceptually, this involves
starting at one point and enlarging the area until it encompasses the specified number of people, then a new district is
started and the process is repeated.
The population information would be retrieved from the attribute data file and the information needed to define and
enlarge the district boundaries would be retrieved from the spatial data file.
The district delineation procedure is used to define police and fire service districts, school districts, and commercial
market areas.
Network Analysis is used to optimize transportation routing such as bus routes and emergency vehicle dispatching.
This procedure takes into account the length of each transportation segment and facts that affect the speed of travel
or the quantity of material that can be carried. Sophisticated systems can take into account the effects of rush hour
traffic, road closures, and vehicle availability in order to make the best assignment of delivery vehicles and routing.
GBF/DIME AND TIGER FILES

The US Census Bureau developed a geographic coding system to automate the processing of census questionnaires. This
system, called GBF/DIME has been used since 1970.
The acronym stands form Geographic Base File/Dual Independent Map Encoding system . The files are topologically
structured and were produced for 350 major cities and suburbs across the US. The spatial data included street
networks, street addresses, political boundaries, and major hydrological features.
One of the benefits of this file was that census data could be easily aggregated by geographic regions for reporting
purposes. Local governments found that the GBF/DIME files were inexpensive data sources for their GIS. Digital street
maps could be produced from the data and after editing could be used as digital base maps for municipal applications.
However, the GBF/DIME files were not designed to be used as a digital map base and have some limitations. First, the
data do not accurately show the shape of the streets because each segment is a straight line connecting two
intersections and therefore curved lines become straight lines.
Secondly, the address range is provided for each street segment but the geographic position of each address location is
not included. In preparation for the 1990 Census, the Bureau of the Census developed the TIGER files (Topologically
Integrated Geographic Encoding and Referencing System ) to replace the GBF/DIME system. The TIGER overcame
many of the limitations of the earlier system. It covers the 50 states, DC, Puerto Rico, the Virgin Islands of the US, and
the outlying areas of the Pacific over which the US has jurisdiction. See Figure 4.6 on page 118.
Attribute data in the TIGER file include feature names, political and statistical geographic area codes (such as county,
incorporated place, census tract and block number) and potential address ranges, and zip codes for that portion of the
file. The Census Bureau no longer supports the DIME files. The TIGER files can be easily integrated into an existing
GIS data base by file matching, using the geographic area codes as match keys.
DIGITAL ELEVATION DATA

Digital elevation data are a set of elevation measurements for locations distributed over the land surface. They are used
to analyze the topography (surface features) of an area.
Various terms have been used to refer to digital elevation data and its derivatives:
Digital Terrain Data Digital Terrain Models Digital Elevation Model
Digital Terrain Elevation Data
Digital elevation data are used in a wide range of engineering, planning, and military applications. For example, the are
used to:
• Calculate cut-and-fill operations for road construction;
• Calculate the area that would be flooded by a hydroelectric dam;
• Analyze and delineate area that can be seen from a location in the terrain;
• Intervisibility can also be used to plan route locations for roadways;
• Optimize the location of radar antennas or microwave towers; or
• Define the viewshed of an area.
V. Drake 5
SMC
The methods used to capture and store elevation data can be grouped into four basic approaches:
* A regular grid contours profiles
* Triangulated Irregular Network (TIN) SEE FIGURE 4.9 page 122
* Digital elevation data are generated from existing contour maps, by photogrammetric analysis of stereo
aerial photographs, or more recently by automated analysis of stereo satellite data.
* DTM data are most commonly provided in grid format in which an elevation value is stored for each of a set of
regularly spaced ground positions. Each data point represents the elevation of the grid cell in which it is located.
One of the limitations of the raster form of representation is that the same density of elevation points is used for the
entire coverage area.
Ideally, the data points would be more closely spaced in complex terrain and sparsely distributed over more level areas.
A number of methods have been developed to provide a variable point density.
One method is to use a variable grid cell spacing to accommodate a variable density of points, with smaller cell sizes
being used to capture the detail in more complex terrain. A second approach has been to use irregularly spaced
elevation points and represent the topography by a network of triangular facets. In this way, elevation data can be
stored and manipulated using a vector representation. The TIN is produced from a set of irregularly spaced elevation
points (SEE FIGURE 4.9). A network of triangular facets is fit to these points. The coordinate positions and elevations
of the three points forming the vertices of each triangular facet are used to calculate such terrain parameters as the
slope and aspect.
The advantage of a TIN compared with a gridded representation is that the TIN can use fewer points, capture the
critical points that define discontinuities like ridge crests, and can be topologically encoded so that adjacency analyses
are more easily done.
A third way to digitally represent a topographic surface is by development of a profile showing the elevation of points
along a series of parallel lines. Elevation values should be recorded at all breaks in slope and at scattered points in level
terrain. If the profiles are constructed from a topographic map, the elevation values can only be taken where the
profile crosses a contour line.
The fourth approach is to digitize contour lines. Here the topographic surface is represented by series of elevation
points taken along the individual contours. Although elevation data can be converted from one format to another, each
time the data are converted some information is lost reducing the detail to the topographic surface.
Digital elevation data is available in the US and was first produced by the Defense Mapping Agency. They were produced
by scanning the contour overlays for 1:250,000 scale topographic maps.
These data have an accuracy of 15 m in level terrain, 30m in moderate terrain, and 60 m in steep terrain.
The data are sold by the map sheet as 1 degree x 1 degree blocks and are available for the entire US.
The USGS plans to progressively upgrade the accuracy of this data set and is also producing a higher accuracy DTM file
with a 30m sampling interval. The data are maintained in two datasets; one with a +7m accuracy and the other with a +7 -
+15m accuracy. These data are available for about 30% of the US and are sold by 7.5 minute quad sheets.
The unit price for these data decrease with the number of DTs purchased. Prices for orders of six or more DTM consist
of a base charge of $90 and $7 for each additional unit.
DATA OUTPUT
Output is the procedure by which information from the GIS is presented in a form suitable to the user. Data are output
in one of three formats: Hardcopy, Softcopy and electronic.
Hardcopy outputs are permanent means of display. The information is printed on paper, mylar, photographic film or other
similar materials.
Softcopy output is in the format viewed on a computer monitor. Softcopy outputs are used to allow operator interaction
and to preview data before final output. A Softcopy output can be changed interactively but he view is restricted by the
size of the monitor.
The hardcopy output takes longer to produce and requires more expensive equipment. However, it is a permanent record.
Output in electronic formats consists of computer-compatible files.
V. Drake 6
SMC
CHAPTER 5: DATA QUALITY
(GIS: A Management Perspective - Stan Aronoff)
Pages 133 - 149
People routinely make judgments about data quality. Hikers learn from experience that on topographic maps the position
of trails are less accurately shown than the position of roads. Their judgment of the relative quality of the trail and road
information guides their use of the map data. Knowing the quality of data is critical to judging the applications for which
they are appropriate.
When spatial analyses are done manually using map overlays, users quickly learn to shift the map slightly to align
boundaries that should overlap. A map overlay may not be precisely registered but with these manual adjustments it can
be shifted so that any local area can be registered closely enough for the work at hand.
You can't do this in a GIS. Implicit assumptions about data quality must be made explicit so that they can be properly
addressed. In a computer, either roads meet or they don't. The computer must be programmed to read a line ending
short of the road as connected.
The cost of assessing data quality varies with the degree of rigor needed. The more rigorous the data quality testing,
the more costly it becomes. This cost is not only a result of the expense of performing the test, but also of the delays
caused in the production process to perform the tests and correct errors. The level of testing should be balanced
against the cost of the consequences of less accurate data or a less rigorously confirmed level of quality. Demanding
higher levels of data quality than are actually needed quickly becomes a significant unnecessary expense when it is
applied to the entire GIS database.
In a similar way, the expense of testing and recording the quality of the data in a GIS should be matched to the
consequences of its inappropriate use. The data in a GIS may be used for a wider range of analyses than when the same
data were in a non-digital form. This is one of the advantages of a GIS, the capability to integrate diverse data sets that
could not be analyzed together.
However, the data may be used in ways not foreseen by heir produces and by users without the knowledge or experience
to judge whether the application is appropriate.
A landowner in Wisconsin successfully sued the state for inappropriately showing the highwater mark around a lake on a
standard topographic map. The user did not realize that this type of topographic map was not sufficiently accurate to
show land parcel boundaries in the context of the elevation data.
As a consequence, it appeared that a portion of the owner's land was below the highwater mark. According to state laws,
land below the highwater mark is the property of the state. The error was corrected, however the owner successfully
sued for damages because in the interim a reasonable interpretation of the map would have caused her title to the land
to be in doubt.
Basically, a USGS topographic map was used to present data of unknown quality. The hand-drawn information was judged
to have the accuracy of the topographic map, which was an incorrect assumption.
In another case, the US federal government was held responsible for inaccurately and negligently showing the location of
a broadcasting tower on an aeronautical chart. This was shown to be a contributing factor in a fatal plane crash.
The quality of geographic data is often examined only after incorrect decisions have been made and financial losses or
personal injury have occurred. More and more, producers of geographic information are being held liable when their
products are found to contain errors, are poorly designed, or are used in ways and for purposes unintended by their
designers. Data quality standards, appropriately designed, tested, and reported, can protect both the producer and user
of the geographic information.
A GIS provides the means for geographic information to be used for a broader range of applications and by users with a
wider variety of skills than ever before. In order for these data to be used in decision-making, their quality must be
predictable and known. Ultimately, the data quality standards must serve the needs of the users, so the user
community must be directly involved in specifying the data quality standards for the GIS data base and in dealing with
practical constraints like budget, technical capabilities, and rate of production.
V. Drake 7
SMC
COMPONENTS OF DATA QUALITY
The characteristics that affect the usefulness of data can be divided into 9 components which are grouped into 3
categories: micro level components, macro level components, and usage components.
Micro Level Components

Micro level components are data quality factors that pertain to the individual data elements. These components are
usually evaluated by statistical testing of the data product against an independent source of higher quality information.
Positional Accuracy
Positional accuracy is the expected deviance in the geographic location of an object in the data set (map) from its true
ground position.
It is usually tested by selecting a specified sample of points in a prescribed manner and comparing the position
coordinates with an independent and more accurate source of information.
There are two components of positional accuracy: bias and precision.
Bias refers to systematic discrepancies between the represented and true position. Bias is measured by the average
positional error of the sample points.
Precision refers to the dispersion of the positional errors of the data elements. Precision is commonly estimated by
calculating the standard deviation of the selected test points. A low SD indicates that the dispersion of the errors is
narrow; the errors tends to be relatively small. The higher the precision, the greater confidence in using the data.
Attribute Accuracy
Attributes may be discrete or continuous. A discrete variable can take on a finite number of values whereas a continuous
various can take on any number of values.
Categories like land use class, vegetation type, or administrative area are discrete.
Variables like temperature or average property value are continuous, the variable can take on any value so intermediate
values are valid.
The method of assessing accuracy for continuous variables is similar to that discussed for positional accuracy. HOW???
The assessment of the accuracy of discrete variables is the domain of classification accuracy assessment, which is a
complex procedure. The difficulties in assessing classification accuracy arise because accuracy measurement is
significantly affected by shape and size of individual areas. the way test points are selected, and the classes that are
confused with each other.
Randomly selected points from the data set are checked against field observations. For example, wetlands along
streams are typically long, narrow areas. Though they are often important for planning purposes, these areas commonly
make up less than 1% of the total map area.
In a randomly selected sample of test points, these areas would probably not be chosen. Therefore, a classification
accuracy could be calculated from the test points, but that is no indication of the wetlands class accuracy unless it was
also tested.
One way to alleviate this is to chosen a set of test points from every category.
Another problem exists because very few sharp boundaries exist in nature whereas the data set will have a demarcation
line between classes.
Logical Consistency
Logical consistency refers to how well logical relations among data elements are maintained. For example, it would not be
consistent to map some forest stand boundaries to the center of adjacent roads and others to the road edge.
Political and administrative boundaries defined by physical features should precisely overlay those features. For
example, the edge of a property that borders a lake should coincide with the lake boundary.
Another problem exists when mapping a reservoir because the water level will fluctuate over the span of a year.
Different GIS data layers may show the reservoir boundary at different locations, depending on the date of the
mapping. This problem is solved by providing a standard outline for a reservoir and placing it in each layer.
It is important to remember that two data sets may be correct to their specified level of positional accuracy and yet not
be logically consistent.
V. Drake 8
SMC
When two data sets are overlaid, the slight discrepancy causes a sliver.
There is no standard measurement of logical consistency, however, it is best addressed before data are entered in the
GIS data base. A map preparation stage is commonly used during which individual maps that are to be digitized are
checked and redrafted to correct errors and inconsistencies.
Resolution
The resolution of a data set is the smallest discernible unit or the smallest unit represented. For satellite images this is
called spatial resolution.
For thematic maps, the resolution is the size of the smallest object that is represented, and is termed minimum mapping
unit. The minimum mapping unit decision is made during the map compilation phase. Factors like the expected use of the
map, legibility, source data accuracy, and drafting expense are all considered. Geographic data in a digital GIS data
base can be displayed at any scale, because the geographic data do not really exists at a specific scale. Therefore the
minimum mapping unit can be very small. However, this ease with which the geographic data in a GIS can be used at any
scale highlights the importance of accurate data quality information.
Although the data do not have a specific scale, they were produced with levels of accuracy and resolution the make it
appropriate to use them at only certain scales. Using a GIS, a 1:50,000 scale map could be produced from data that
were digitized from a 1:500,000 scale map. However the map would not have the quality of a 1:50,000 scale map.
Macro Level Components

Macro level components of data quality pertain t the data set as a whole. These are usually evaluated by judgment or by
reporting information about them, not by true testing.
Completeness
There are several aspects to completeness and these are completeness of coverage, classification, and verification.
The completeness of coverage is the proportion of data available for the area of interest. Ideally, a data set will provide
100% coverage, however many data sets are progressively updated.
When information is needed about the current status of a resource, the most current information may be the most
suitable. In other cases, such as comparative analysis, it may be more important to have consistency in the data set. An
older data set for the entire study area may be more appropriate tan a patchwork of more recent data collected in
different years.
Completeness of classification is an assessment of how well the chosen classification is able to represent the data. For a
classification to be complete it should be exhaustive, that is it should be possible to encode all data at the selected level
of detail.
TABLE 5.1. If the livestock category horses occurs, it cannot be encoded at Level 3 only at Level 2.
Under truck crop, potato will fall into the OTHER category however it does not make the data set complete because the
total crop area could not be calculated.
Another problem occurs if an observation can be placed into more than one category. For instance, in the forest category
what happens in the transition areas from coniferous to mixed to deciduous.
Class definitions may also differ among map sheets as a result of the individual or the organizations that produced them.
The maps may be accurate in terms of position and classification, but the boundaries from adjacent maps may not match
if they were produced by different forest districts.
Completeness of verification refers to the amount and distribution of field measurement or other independent sources
of information that were used to develop the data.
Geologist indicated this aspect of data quality by using solid lines t map rock types for which they have direct field
evidence (boundaries they can see) and inferred boundaries by dashed or dotted lines.
Reporting of qualitative assessments of completeness have largely been ignored, however they are critical to the
appropriate use of the data.
V. Drake 9
SMC
Time
Time is a critical factor in using many types of data. Demographic information is usually very time sensitive and change
significantly over a year. Land cover will change rapidly in an area of rapid urbanization.
Some data are biased depending on what time of the year they are collected. For example, in areas that produce multiple
crops per year, the crop types grown in an area change with the seasons.
The time aspect of data quality is most commonly reported as the date of the source material. Topo maps usually include
the original source date as well as the update date. For geographic information that changes relatively quickly over
time, the date of acquisition may be a very important attribute. Forest inventory maps may be updated on a 5 - 10 year
basis while crop conditions change rapidly over the growing season and is commonly updated on a weekly basis.
Time is a frequently overlooked consideration when multiple data sets, collected independently, are used together.
Lineage
The lineage of a data set is its history, the source data and processing steps used to produce it. The source data may
include transaction records, field notes, airphotos, and other maps.
A lineage report documents this information.
A lineage report for a topo map included the date of the aerial photography used, the photogrammetric methods used to
map the contour lines and cultural features from the airphotos, the use of check points for photogrammetric control, and
the methods used to generate the final map.
Ideally, some indication of lineage should be included with the data set since the internal documents are rarely available
and usually require considerable expertise to evaluate.
Unfortunately, lineage information most often exists as the personal experience of a few staff members and is not
readily available to most users.
Usage Components
The usage components of data quality are specific to the resources of the organization. For example, the effect of data
cost depends on the financial resources of the organization.
A given data set may be too expensive for one organization and be considered inexpensive by another.
Satellite imagery may be inexpensive for an oil company to use in exploration but outrageously expensive for a wildlife
agency to use for habitat mapping.
The accessibility of the data depends on imposed usage restrictions and the human and computer resources of the
organization.
Accessibility
Accessibility refers to the ease of obtaining and using data. Data use may be restricted because it is privately held or it
is judged a matter of national defense or to protect the right of citizens.
Direct and Indirect Costs

The direct cost of a data set purchased from another organization is usually well-known; it is the price paid for the data.
However, when the data are generated in-house, the true cost may not be known. Assessing the true cost of these data
is usually difficult or impossible because the services and equipment used in their production support other activities as
well.
The indirect costs include all the time and materials used to make use of the data. When the data are purchased from a
vendor, the indirect costs may be more significant than the direct costs. It may take longer for staff to handle data
with which they are unfamiliar, or the data may not be compatible with other data sets to be used. For example, the
data may be in non-digital format or in a digital format that cannot be directly input to the GIS on which it is to used.
Converting the data to a compatible format may simply involve running an existing conversion program. The human and
technical resources of the organization may largely determine whether the data are usable and how expensive it will be
to handle the conversion.
SOURCES OF ERROR (page 141)

There is error associated with all geographic information. Error is introduced at every step in the process of generating
and using geographic information, from collection of the source data to the interpretation of the results of a completed
analysis.
The objective in dealing with error should not be to eliminate it but to manage it.
V. Drake 10
SMC
Achieving the lowest possible level of error may not be the most cost-effective approach. There is a trade-off between
reducing the level of error in the data base and the cost to create and maintain the data base.
The level of error in a GIS needs to be managed so that data errors will not invalidate the information that the system is
used to provide.
Data Collection Errors
Error exists in the original source materials that are entered into the GIS. These area may be a result of:
Inaccuracies in field measurement,
Inaccurate equipment, or
Incorrect recording procedures.
Much of the data input into a GIS comes from remote sensing techniques. There are inaccuracies in the photogrammetric
methods used to draw maps and measure elevations.
Airphoto and satellite image interpretations introduce a degree of error in the classification and delineation of
boundaries.
Data Input Errors

The data input devices used to enter geographic data all introduce positional errors.
Digitizing tables are commonly accurate to fractions of a millimeter, but the accuracy varies over the digitizing surface.
The center of a digitizing table commonly has a higher positional accuracy than the edges. The operator introduces error
in the way the map is:
Registered on the table,

Boundaries are traced, and
Accuracy with which the attributes and label information are entered.
In digitizing, curved boundaries are approximated by a series of straight-line segments. The smaller the segment used,
the more closely the boundary is approximated. However, the smaller the line segments, the larger the data files are.
No matter how carefully boundaries and points are entered, some residual error will always remain.
Errors in the position of natural boundaries are often introduced because the boundary does not exist as a sharp line. A
forest edge, even though it is drawn as a definite line, usually exists as a zone that may be several meters or tens of
meters wide.
Data Storage Errors

When data are stored in a digital form, they must be stored with a finite level of precision. Typical vector GIS use a 32-
bit real number format. This provides at most, seven significant digits.
UTM geographic coordinates use seven significant digits. A GIS data base that contains information with levels of detail
ranging from fractions of a meter to full UTM coordinates would require greater precision.
Each value in a raster file represents a unit of the terrain. If the data are encoded using a pixel size of 10 x 10m, then
even if the geographic position of a point is known exactly to the fraction of a meter, it can only be represented to the
nearest 10m.
In theory, higher levels of spatial resolution could be obtained by simply defining a larger number of pixels. However,
increasing the resolution 10 time from 10m to 1m increases the file size 100 times.
The vector data model is better suited to storing high precision coordinates for discrete map elements, while the raster
data model tends to be better suited for representing measurements that vary continuously over an area.
Data Manipulation Errors

Many GIS analysis procedures involve the combining of multiple overlays. As the number of overlays increase, the number
of possible opportunities for error increase.
The highest accuracy possible will not be better than the least accurate input overlay.
Many errors arise, as discussed earlier, when the same boundary may be drawn slightly differently in two overlays. The
more complex the shape of the boundary, the more of a problem this becomes and, in an overlay operation, this mismatch
will create inaccuracies in the results.
V. Drake 11
SMC
There is also a level of inaccuracy inherent in the way classes are defined. Many continuous phenomena, such as
vegetation and soils, are mapped as homogeneous map units with sharp boundaries, e. g. choropleth and thematic maps.
In reality there is variability within each map unit. A polygon labeled as a Pine stand may actually have other types of
trees in small numbers.
When data are compiled a decision is made that areas below a certain size (minimum mapping unit) will not be recognized
within an otherwise homogeneous map unit.
While this is perfectly acceptable based on what you intend to use the data for, it may be unacceptable when applied to
analyses that you do not foresee.
A soil materials map may show an area to be sandy soil. In forestry, even the presence of 15% clay soils in this map unit
does not restrict the use of the data. However, in siting a house, the presence of class inclusions is important because a
house sited partly on clay and partly on sandy soil will tend to settle unevenly, cracking the walls and the foundation.
Data Output Errors

Error can be introduced in the plotting of maps by the output device and by the shrinkage and swelling of the map
material.
As paper shrinks and swells, measurements taken from the map will be changed. On a small scale map, the millimeter
changes can represent several meters at the ground resolution.
Erroneous Use of Results

Error is also introduced when the reports generated by a GIS are incorrectly used.
Results may be misinterpreted, accuracy levels ignored, and inappropriate analyses accepted.
The source of these errors would seem to be independent of the GIS. However, the resulting errors in decision-making
represent errors in the process of using geographic information. These errors may be attributed to the GIS facility.
ACCURACY
Accuracy is the likelihood that a prediction will be correct.
In the case of a map, the positional accuracy is the likelihood that the position of a point as determined from the map
will be the "true" position, i. e., the position determined by more accurate information, such as by field survey.
Classification accuracy is the probability that the class assigned to a location on the map is the class that would be
found at that location in the field. No map can be 100% accurate.
CONCLUSION
Accuracy assessment can be an expensive procedure, and although it is extremely valuable, its costs must be weighed
against the benefits of the accuracy information.
Less rigorous tests that are less expensive can be used for data sets where the consequences of errors are less critical.
Accuracy assessments usually involve a comparison of values from the data set to be tested with values from an
independent source of higher accuracy, such as field verification, which may be more expensive than the application can
justify.
Less expensive approaches may be used. For example, indirect verification of test points may be done by interpreting
airphotos instead of by field observations.
In the end, the specification of an accuracy level and the rigor with which it is assessed are judgment calls. They must
take into account:
How the information is being used,
The consequences of inaccuracies, and
whether the accuracy measurements are indeed valid.
Requiring different levels of accuracy for different features in the same map or in the same data base is more cost
effective than demanding than all features be represented at the same accuracy level.
For this reason, the expenditure on accuracy assessment and data quality reporting in general must be matched to the
consequences of errors.
The trade-offs in accuracy assessment costs, the mandate and budget of the producer of the data, and the willingness
of the user to pay for data will all influence the assessment methods chosen. A rigorous accuracy assessment may not be
justified for every data set in the GIS.
But an accuracy rating of some form and a description of the method used to generate that rating should always be
provided.
V. Drake 12
SMC

Data Input and Output

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Input and Output

Загружено:

Авторское право:

Доступные форматы

CHAPTER 4: DATA INPUT AND OUTPUT

(GIS: A Management Perspective - Stan Aronoff)

The spatial data represents the geographical location of the features

SCANNING VERSUS MANUAL DIGITIZING

DIRECT USE OF RASTER SCANNED IMAGES

EXISTING DIGITAL DATA

BASE CARTOGRAPHIC DATA

LAND USE / LAND COVER DATA

CENSUS-RELATED DATA SETS

GBF/DIME AND TIGER FILES

DIGITAL ELEVATION DATA

Micro Level Components

Macro Level Components

Direct and Indirect Costs

SOURCES OF ERROR (page 141)

Data Input Errors

Registered on the table,

Data Storage Errors

Data Manipulation Errors

Data Output Errors

Erroneous Use of Results

Вам также может понравиться