Вы находитесь на странице: 1из 7

A Quality-aware Approach towards the Integration of Feature-based Geospatial Data

Jung-Hong Hong Min-Lang Huang


Associate Professor Ph.D. Candidate Department of Geomatics, National Cheng Kung University Tainan City 701, Taiwan two years has been remarkable, the majority of efforts nevertheless are focusing on making the geospatial data available and accessible in open data format, not on helping users to interpret and integrate the heterogeneous data originally produced according to the requirements of independent organizations. Despite data is now directly accessible and easily visualized, it is difficult enough for users to deal with the data with little or even no understanding about its quality, not to mention to ask the Web-based application systems to determine its status without humans knowledge. The risk of wrong decision making steadily increases when data from heterogeneous resources are being integrated. Sometimes the results may involve many complicated issues and become totally unpredictable. Consequently the fast improvement on the accessibility of geospatial resources ironically endangers the chance for GIS-based applications to help users to make correct decisions. From the perspective of integrating geospatial data, two typical modeling challenges are: (1) how to model the phenomena in reality that continuously changes over time and (2) how to model the diverse types of data quality according to the property of the geospatial data. Users must have a thorough understanding about the spatio-temporal and quality status of the acquired data before a correct decision can be made. The authors [2] proposed a primitive set of framework for geospatial feature to model the different spatio-temporal status of reality phenomena. Based on the well-known GML standard [3] [4], this framework enables the necessary components of a geospatial feature, namely, identification, location, attribute and quality, to be encoded in a standardized way. While distributing their spatial data, all data providers need to do is to select the most appropriate primitive spatiotemporal data type according to the phenomena being modeled. Since all geospatial data is encoded in GML, users can consistently and unambiguously parse the necessary information and easily distinguish their differences to aid users decision making. This type of feature is defined as self-describe feature in this paper, meaning a feature that possesses sufficient information to describe itself. The standardized framework of self-describe feature is certainly advantageous, but an in-depth knowledge about how data quality is modeled and analyzed is still mandatory for applications to make correct decisions. This paper will focus on how to extend the previous work to strengthen the

Abstract- Integrating geospatial data acquired from diverse resources has been drawing a tremendous attention in recent years. Although many barriers have been removed by the recently developed OpenGIS technology, merely superimposing the acquired geospatial data together in the map interface certainly does not suffice the needs for correct application use. We argued that every distributed geospatial feature must be equipped with quality information that can uniquely characterize its property and enable future applications to automatically and unambiguously parse the necessary information to aid the decision making. Such a quality-aware approach helps users to build a knowledge-based working environment where the quality difference of acquired data can be correctly identified and analyzed. Furthermore, the various GIS-based operations can incorporate built-in GIS processing knowledge based on the acquired quality information to ensure the precision and accuracy of the analyzed results. As the future GIS-based application will largely depend on data dynamically collected from other resources, we demonstrated the necessity and importance of data quality information to the decision making and how it can be incorporated into the design of future data sharing environment, e.g., SDI.

I. INTRODUCTION The integration of heterogeneous data from diverse resources has been drawing a tremendous attention from the information community. Especially for the GIS (Geographic Information System) domain, where a large number of GIS software were developed with their own distinguished architecture, to dynamically integrate data from independent resources is surely not an easy task. To improve the interoperability of heterogeneous geospatial data, the OGC (Open Geospatial Consortium) proposed several service standards, e.g., GML, WMS, WFS, WCS, and WPS[1], to facilitate an open and consensus framework for the distribution and processing of geospatial data. These standards opened a new dimension to enable the sharing of heterogeneous geospatial data between different organizations and the development of Web-based GISs. For example, users can now easily overlay data maintained by different organizations, e.g., address, landmarks, traffic information, weather, DTM, satellite imageries, via WMS or WFS services to enrich their applications, some of them can be even continuously updated as time goes by. Although the progress of geospatial web services in the last

recording and use of quality information. The remaining of the paper is organized as follows: section 2 briefly reviews major concepts related to this research. Section 3 analyzed the applicable data quality elements for self-describe feature. Section 4 introduced the standardized self-describe frameworks and section 5 discusses the quality-aware GIS application. Finally, section 6 concludes our major findings and discusses some of the future works. II. RELATED WORKS Data Quality refers to the degree of excellence exhibited by the data in relation to the portrayal of the actual phenomena [4]. Although the importance of data quality was widely recognized by the geospatial community, the past developments in the 1990s were mainly restricted to individual specifications specially designed by particular domains or organizations (CEN/TC-287, 1994/1995; FGDC, 2000). Although many geospatial data files were generated following rigorous technical specifications, e.g., topographic maps, their quality information is often missing due to the lack of a consensus and comprehensive mechanism on the quality of geospatial data. A solid theory and cross-domain standardized framework for geospatial data quality have been extensively discussed in various literatures [6] [7]. To build such a framework, however, must consider all of the different aspects of data quality, as well as provide a total solution towards the evaluation procedure and result analysis. Meanwhile, to be acceptable by the variety of geospatial domains all over the world, this framework is better proposed by a well-accepted authorized organization. The three standards and specification from ISO/TC211 were developed to dedicate to exactly this issue and appeared to be an ideal choice. ISO19113 [8] provided an overview of geospatial data quality by first subdividing its representation into qualitative and quantitative methods. Quantitative data quality was further subdivided into 5 major categories of quality elements and 15 subcategories of sub elements. ISO19114 [9] and ISO19138 [10] were developed to provide fundamental evaluation procedure and suggest quality measure for geospatial data respectively. These standardized data quality information were further included as metadata elements in the ISO19115 [11], which has been globally accepted as the foundation for the design of metadata profile. The above discussion implies that the quality information can be now distributed in a standardized way, as long as the corresponding metadata can be correctly established. The availability of quality information can bring extremely positive impacts to the clients application environment. For example, [12][13] transformed the data quality information into symbols to enable the illustration of their differences in the map interface. On the other hand, the concept of Qualityaware GIS [13][14][15] intended to include the consideration of data quality into GIS-based functions. Zargar and Devillers (2009)[16] modified the MEASURE operation in ArcGIS to demonstrate that the inclusion of quality report (position

accuracy, completeness and logical consistency) can improve the quality of decision making. A similar approach can also be found in [17]. Hong and Liao (2009) [18] proposed the theory of valid extent to illustrate the data completeness status of multiple datasets in the map interface. As the availability of quality information becomes possible, its use will no doubt become even more versatile with the development of integrated applications. III. FEATURE-BASED QUALITY INFORMATION With the aids of SOA (Service-Oriented Architecture), the future application environment will largely rely on geospatial features dynamically collected from diverse resources. In addition to the open, standardized and interoperability advantages, the self-describe geospatial features possess sufficient data quality information to describe itself. Theoretically every geospatial feature must be distributed together with its quality information, such that whenever a feature is being used, its quality information can be parsed accordingly. To fulfill this need, the following two factors must be considered: A. Data Quality Scope and the Hierarchy of Geospatial Data According to the specification of ISO 19113, all distributed quality information has its own specified scope, meaning the description is only valid for this specified scope of data. Four major types of scope, namely, dataset series, dataset, feature and attribute, are identified according to how geospatial data is organized in real practice. There is a hierarchical relationship existing among these four data scopes. Since a dataset is composed of a number of features, the data quality information can be recorded at the scope of dataset if the target of the evaluation procedure is the whole dataset or all the features within this dataset share the same content of quality information. This property simplifies the encoding of data quality information and avoids unnecessary duplicates on the data quality information at the feature level. On the other hand, if every feature has its own data quality evaluation result (e.g., different absolute positional accuracy), then the quality information should be recorded at the feature level. The future application environment will no doubt involve geospatial data referring to different scopes, one way to resolve this issue is to process all the quality information at the feature level, regardless of what its original data scope is. To facilitate such an feature-based application environment, a processing module that can transform the quality information from higher level (dataset and dataset series) onto the feature level is necessary.

Fig. 1. An example of hierarchy of each data quality information level

Level of Data Quality Elements The data quality information for a particular scope is also dependent on the type of data quality chosen. Take absolute

B.

positional accuracy for example, if the accuracy for every individual feature is different, then it should be recorded at the feature level. However, if the positional accuracy for all features in a dataset is less than the threshold value specified in the product specification, then we have an alternative to record the quality information at the dataset level by saying the positional accuracy of the dataset is compliant with the product specification. Under such circumstances, users can no longer differentiate the difference of positional accuracy between individual features. Many evaluation procedure tests the whole dataset to determine its quality status. For example, we may check the polygon overlapping status by issuing a topological consistency test. Since all polygons in the dataset were examined (corrected if there is something wrong), the quality information can be recorded at the dataset level instead of the feature level. When one polygon is being used in a particular application, we can easily obtain the topological consistency status of this polygon by parsing the quality information in the dataset level. This framework is thus quite flexible. If the quality information for the lower levels of data scope is the same or compliant with the product specification, then the quality information can be recorded at their immediate higher level of data scope (e.g., feature to dataset; dataset to dataset series). When being used in GIS application, the parser must therefore parse all scopes of quality information related to the selected features, i.e., feature, dataset and dataset series, to build a thorough understanding about the status of data quality. IV. STANDARDIZED FRAMEWORK for SELF-DESCRIBE FEATURE Because the data is dynamically acquired when needed, how these data is integrated can only be assessed after data acquisition. The major concept we proposed in this paper is the distributed geospatial data itself must be open and selfdescribe, which can not only ensure all the information users need to know is encoded and distributed with the data, but also enable users to consistently parse the acquired data without any loss. Every self-describe feature is required to include an attribute of Validtime to denote its temporal status. Based on this specified valid time, the other essential components include its identification, temporal accuracy, spatial description, position accuracy, attributes and attribute accuracy (Figure 2). Although the component of temporal, positional and attribute accuracy are not mandatory, they can indicate a certain aspect of data quality when being used.

framework design, followed by the discussion about how to use it in the modeling process and the example of encoding strategies. A. Component analysis Regardless of what the modeling phenomena is, a feature must include at least one attribute that can uniquely identify itself. Although we often used ID to denote this attribute, the ID value alone does suffice the need to describe itself, unless its reference system is well defined and distributed with the ID value. The most obvious advantage of this design is users can unambiguously determine if two features are actually referred to the same phenomena by comparing their ID value and reference system. It can also serve as the basis for developing a searching mechanism on a feature basis. A XML-based common identifier framework can be found in [19] (Figure 3). This framework can be readily implemented in GML to enable the identification of distributed features.

Fig. 3. The component of Identification.(Tai,2007)

The attribute of validtime is designed to denote the temporal status of a feature. When there are more than one time slice available for a feature, every slice would require a temporal description and should be modeled accordingly. Temporal accuracy, on the other hand, is defined as the difference between the recoded time and the time the event actually happens. To be more specific, temporal coordinates are the temporal limits within which the entity is valid [20]. Every feature is also required to include a spatial description to represent its shape, size and location. Positional accuracy is used to denote how close this recorded location is with respect to its actual location in reality. Although many geospatial data were the end product of surveying procedures and the evaluation on its spatial quality does not seem to be a problem, the positional accuracy is unfortunately often missing. In this paper, the positional accuracy information is recorded either at the feature-level or the dataset level; the latter case often denotes the case where the quality evaluation procedure is specifically performed on the dataset and every feature in the dataset fulfill the same criteria. To meet the variety of application needs, every type of feature must include a distinguished combination of attributes. Dependent on its unique characteristics, different quality measures from ISO19138 can be used for the chosen attribute. B. Primitive frameworks From the perspective of geospatial data modeling, it is necessary to develop a concise framework to encode all the required spatiotemporal information. But it is clear that the often used snapshot modeling approach is not always the ultimate modeling strategy. For example, the observations at a

Fig. 2. The components of self-describe feature

The following first discusses the basic components in the

monitoring station during a specified time period may include a number of observations acquired at different time instances. Every observation must be associated with an unambiguously specified time instance. On the other hand, the location of the monitoring station normally remains unchanged during this time period. It is impractical to repeatedly record the same location with every observation. The best strategy is to be able to coherently model the distinguished characteristics of every chosen phenomenon in the reality, but in the meantime also avoid unnecessary duplicates. By analyzing the possible scenario, The authors [2] proposed four predefined data type: (1) Single Time Version Feature (STVF), (2) Time-series Stationary Feature (TSF), (3) Simple Moving Feature (SMF), and (4) Complex Moving Feature (CMF), which allows the schema designers to choose the one that best fits the spatiotemporal characteristics of the modeling phenomena. Table 1 lists the data type chosen to model the debris flow related phenomena. For example, the rain gauge station data includes the continuously acquired observations at fixed locations, so it is better modeled as the time-series stationary feature.
TABLE I COMMON DATA RESOURCES OF DEBRIS FLOW DISASTER Feature Identificatio Spatial type Basic feature types n debris flow ID polyline Single time version feature influence area Resident address point Single time version feature Rain Gauge ID point Time-series stationary station feature Typhoon Number polyline Complex moving feature

C. Encoding strategies The open encoded distributed data allow application programs to transparently parse necessary information. Fig. 5 shows the GML encoding example of Resident data. In Taiwan, resident data is administered by the local governments and provided to the users upon requests. Because the location and attributes remained unchanged, it is modeled as by the data type of Single Time Version Feature. Because the four primitive data types are all developed upon the basic architecture in Figure 2, the addition of quality information is rather straightforward. In this example, the absolute positional accuracy is explicitly stored at the feature level and the data completeness information is stored at the dataset level. Whenever a resident feature is used, application programs can correctly identify the difference of absolute positional accuracy on the basis of individual feature, even only one portion of resident data is acquired.
<igis:STVFCollection> <gml:metaDataProperty><gmd:DQ_DataQuality>.... <gmd:report><gmd:DQ_CompletenessOmission>.... Dataset level quality information <gmd:value> <gco:Record>0%</gco:Record> </gmd:value>... </gmd:DQ_CompletenessOmission></gmd:report> </gmd:DQ_DataQuality></gml:metaDataProperty> <gml:featureMember> <igis:Resident>

Identification of feature <igis:Identification> <igis:value></igis:value> <igis:ReferenceSystem>T House Sys</igis:ReferenceSystem> </igis:Identification > Validtime of a resident <gml:validTime><gml:TimeInstant> <gml:beginPosition>2008-0928T16:15</gml:timePosition> <gml:EndPosition>2008-0928T16:15</gml:timePosition> </gml:TimeInstant></gml:validTime> <igis:Spatial>.</igis:Spatial> <gmd:DQ_AbsoluteExternalPositionalAccuracy> <gmd:value><gco:Record>5</gco:Record></gmd:value > </gmd:DQ_AbsoluteExternalPositionalAccuracy> Feature level quality information </ igis:Resident> </gml:featureMember> <gml:featureMember> <igis:Resident> <igis:Identification> <igis:value></igis:value> <igis:ReferenceSystem>T House Sys </igis:ReferenceSystem> </igis:Identification > <gml:validTime><gml:TimeInstant> <gml:beginPosition>2008-0928T16:15</gml:timePosition> <gml:EndPosition>2008-0928T16:15</gml:timePosition> </gml:TimeInstant></gml:validTime> <igis:Spatial>.</igis:Spatial> <gmd:DQ_AbsoluteExternalPositionalAccuracy> <gmd:value><gco:Record>10</gco:Record></gmd:valu e> </gmd:DQ_AbsoluteExternalPositionalAccuracy>.. </ igis:Resident> <gml:featureMember>. </igis: STVFCollection > Fig.4. The example of Resident encoding result

V. QUALITY-AWARE GIS APPLICATION The fundamental ideas for the quality-aware approach are (1) all acquired features are self-describe and (2) the clients application has built-in knowledge to smartly introduced the consideration of data quality into the decision making process. Fig. 5 shows a simplified view of the major process. During data modeling process, the data providers must select the most appropriate data type to model every selected theme of data. The data will be distributed in GML on the basis of individual features. The quality-aware GIS applications include a number of functions that can intelligently complete its task by additionally analyzing the parsed quality information. The logic flow of every function must be reexamined to ensure the final decision has taken the quality difference of different types of data into consideration.

Fig. 5. A simplified view of quality-aware GIS application including knowledge of standardized self-describe framework.

The following discussion focuses on the two major operations in the quality-aware approach: information parsing and intelligent functions. The decision for resident evacuation caused by the debris flow is chosen as the test case. The making of such a decision requires the acquisition of continuously updated rainfall data. Once it reached the preselected threshold value, the debris flow disaster authority must issue a red warning message to the local governments. All residents living in the debris flow influenced areas are required to be evacuated. Despite the spatial extent of the debris flow influenced areas are analyzed beforehand, who actually live in the influenced areas is dynamically determined by the resident data acquired from local governments. A simple selection-by-location function can easily do the trick by selecting the residents (point-based data) whose location is within the debris flow influenced areas (polygon-based data). While this is a well-known procedure to GIS users, we argued that the consideration would be rather different after taking data quality information into consideration. A system prototype for debris flow evacuation decision making was developed by Visual Basic Application in ESRI ArcGIS 9.3. We intend to modify the original code of the select by location function to ensure the correct decision can be made. The original select by location function is specifically designed for selecting features from data already stored in the database, but whether all residents within the influence area were found does not only depend on the correct geometric algorithm (e.g., point-in-polygon), but also depend on if the acquired data completely model the phenomena in reality. Most users intuitively assume that all resident data is available, such that all the select by location function needs to do is to find the residents within the influence area. This is unfortunately not always true. Take Fig. 7 for example, the influence area overlaps with only one part of township A, B and C. To ensure all the residents within the influence area were found, two requirements must be met: (1) all of the data for the three townships is available and (2) all of the datasets for the three townships fulfill the completeness requirement (no omission and commission error). These two requirements imply that we must make sure the data being selected is exactly the same as the phenomena in reality, otherwise the returned result only reflect the status in the database, not the reality. Therefore, from a data quality perspective, the data completeness requirement should be a prerequisite included in the design of select-by-location function. Since the original select-by-location function only considers the geometric relationships between the two types of data, the result may not be always correct. For example, the resident data of township B is missing in Figure 6, therefore the evacuated residents after query only include residents from township A and C.

Fig.6. An example of completeness of resident datasets.

But users may never notice this if they only concentrate on the data available in the database. Data quality, which denotes how close the recorded data is to the actual status in reality, thus makes the whole procedure totally different from the result of original functions. Fig. 7 shows a real case of Pindong county in Taiwan. The influence area (i.e. The identification of debris flow is DF017 ) overlaps with two different townships: Majia and Neipu. If only the resident data of Neipu is available, the traditional select-by-location operation will return only those residents living in Neipu(Fig. 7.a). Without other supporting information (e.g., satellite images), these features appear to be quite reasonable. After introducing the concept of surveyed area [18] and further requiring the union of the surveyed area of the resident datasets must completely contain the influence area, the system intelligently respond that only portions of the area has been inspected and the decision based on the queried results is risky(Fig. 7.b).

(a) (b) Fig.7.The real case of residents in influence area. (a) The returned residents data only reflect the status in the database. (b) The returned residents reflect the status in the reality considering quality information.

Since the encoding of geospatial data is in GML, the application programs can be developed according to the standardized feature framework, such that the quality information for all the distributed features can be correctly parsed and illustrated. For example, the dialog window in Fig. 8 shows the rate of commission error and omission error for the resident dataset of Majia Township is 0%. This helps users to visually inspect the status of data quality in the map interface and avoid wrong decision making.

Fig.8. An example of quality report of dataset

Because the data being superimposed may be acquired from different resources, it is highly possible that the content of different datasets may be referring to different valid time. The traditional map overlay operations only requires all the datasets to be referenced to the same coordinate reference system. The newly modified map overlay function can automatically parse the valid time information and illustrate their temporal difference with different map symbols to attract users attention. The idea is if the features illustrated in the map interface do not exist simultaneously, they should be represented differently. Fig. 9 illustrates the temporal information of each selected resident point and display with red symbol if the resident does not exist during the valid time of the influence area. The absolute positional accuracy of individual features can be illustrated in a similar way to highlight their differences. By introducing the concept of quality-aware in the system design, the proposed approach can bring a revolutionary change to the GIS-based functions we are already very familiar with over the years. However, without data quality information, the chance of success is almost impossible.

automatically acquire and consistently parse standardized quality information to aid their decision making. Comparing to the traditional approach where users are often forced to make wild guess on the quality of the acquired data, the proposed approach not only establishes a necessary links between data provider and users, but also makes the integration of heterogeneous geospatial data much easier than before. To successfully facilitate this quality-aware concept, however, still requires every domain to examine how the quality of their data can be determined and the application developers to examine how to improve their current GIS-based functions by taking data processing knowledge into consideration. In the future SDI environment, these two issues should not be considered separately, a consensus agreement on the technology being used, e.g., OpenGIS, will be necessary. REFERENCES
[1] [2] http://www.opengeospatial.org/standards Jung-Hong Hong, Min-Lang Huang, Chiao-Ling Kuo,Yu-Wen Tseng, A Spatio-Temporal Perspective towards the Intelligent and Interoperable Data Use in SDI Environment, AGIS 2010 International Conference,2010. ISO/TC 211, "ISO 19107: Geographic information-Spatial schema", 2003. ISO/TC 211, "ISO 19136: Geographic information-Geography Markup Language", 2007. http://en.wikipedia.org/wiki/Data_quality. Guptill, S.C. and Morrison, J.L., Elements of spatial data quality (The International Cartographic Association, Elsevier Science, Oxford and Tarrytown NY, 202 pages, 1995. Shi, W.Z., Fisher, P.F. and Goodchild, M.F,Spatial Data Quality, Taylor & Francis Group / CRC Press, 2002, London and New York, 313 pages, 2002. ISO/TC 211, ISO 19113: Geographic information -- Quality principles, 2002. ISO/TC 211,ISO 19114 : Geographic information -- Quality evaluation procedures, 2003a. ISO/TC 211,ISO 19138 : Geographic information -- Data quality measures, 2006. ISO/TC 211,ISO 19115 : Geographic Information - Metadata, 2003b. Devillers, R., Bedard, Y., Gervais, M., Jeansoulin, R., Pinet, F., Schneider, M., Zargar, A. How to improve geospatial data usability: From metadata to Quality-Aware GIS Community, Spatial Data Usability Workshop, AGILE 2007 Conference, Aalborg, Denmark.2007. Retrieved from http://sirs.scg.ulaval.ca/YvanBedard/article nonprotege/459 soumis.pdf Yang, T., Visualisation of spatial data quality for disturbuted GIS, The University of New South Wales, Australia, 2007. Devillers, R., Bdard, Y., & Jeansoulin, R., Multidimensional management of geospatial data quality information for its dynamic use within GIS. Photogrammetric Engineering and Remote Sensing, 71(2), pp. 205-215, 2005. Devillers, R. and Zargar, A. Towards quality-aware GIS: Operationbased retrieval of spatial data quality information. Spatial Knowledge and Information (SKI) Conference, Fernie (BC), Canada, February 21st. 2009. Amin Zargar, Rodolphe Devillers, "An Operation-Based Communication of Spatial Data Quality," geows, pp.140-145, 2009 International Conference on Advanced Geographic Information Systems & Web Services, 2009. Yueh-chuan Teng , Map Interface Content Interoperability in Geospatial SOA Environment with Open Geographic Data, Master's Thesis, Department of Geomatics, National Cheng Kung University, Taiwan, 2007. Hong, J. H., Liao, H. P., Maps or Not A New Insight of the Map Interface to the Open and Distributed Geospatial Web Service Environment, ASPRS 2009, 2009.

[3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

[13] [14]

[15] Fig.9. The demonstration of temporal information of selected residents

VI. CONCLUSION AND FUTURE OUTLOOK Especially for future geospatial application environment where the sharing of dynamically acquired data would be very common, the availability of data quality information should be regarded as a mandatory requirement. By proposing a qualityaware approach based on open encoded geospatial feature framework, we demonstrated the feasibility for clients to

[16]

[17]

[18]

[19] Ying-chiu Tai, The Design and Application Framework of Geographic Feature Identifier Master's Thesis, Department of Geomatics, National Cheng Kung University, 2007. [20] Langran G., Time in Geographic Information Systems. London:Taylor & Francis, 1992.

Вам также может понравиться