Вы находитесь на странице: 1из 6

Searchable Long-term Preservation of Scientific Data

through Semantic Web Representations

Silvia Stefanova1, Tore Risch1


1
Uppsala University, Department of Informational Technology, Uppsala,
Silvia.Stefanova, Tore.Risch@it.uu.se

Abstract. Scientific data has been traditionally preserved in regular files or


relational databases. The advantages of relational databases are undisputable
since they provide scalable, searchable, and reusable storage for scientific
results. However, for long-term preservation and reuse of scientific results
relational database technology alone is not sufficient. Semantic Web technology
provides standards for meta-data descriptions of any kind of data, not just
regular databases, and seems very promising for long-term preservation of
scientific databases. This paper presents a PhD project to investigate how
Semantic Web representations can be utilized for long term preservation,
documentation, evolution, and efficient search of different kinds of scientific
data. We describe the background and existing work in the area and then we
formulate the goals and the research questions. We present the current status
and the results of the project so far by describing the architecture and
functionality of a prototype system, followed by ongoing work.

Keywords: Long-term preservation, scientific database, Semantic Web


representations, query processing, preserved data, data provenance

1 Introduction

Scientific data so far is mostly preserved in files or relational databases. One


advantage with using relational databases is that they require standardized meta-data
definitions as database schemas. For file formats the meta-data is not required and
often becomes ad hoc. Once scientific data is structured in terms of metadata it can be
processed by a database management system (DBMS), i.e. data can be searched,
updated, deleted, inserted, unloaded, or loaded. The contemporary DBMSs provide
features for scalable storage and querying of large databases, which is crucial for the
data amounts produced by scientific experiments.
The DBMSs evolve continuously, providing additional and improved features for
data manipulation, which unfortunately results in new DBMS versions being released
quite often. This brings inconvenience for the maintenance of both live database
servers and archived databases since some of the previous DBMSs version
functionalities may not be compatible with the new ones. Moreover, changing the
DBMS is problematic since it is often not possible to directly take a database from
one vendors DBMS and load it in another vendors DBMS.
Because there is a substantial cost of keeping all databases on-line and up-to-date
one would like to keep only actively used databases on-line and preserve the other
databases off-line in some standard format. The commercial DBMSs provide tools to
unload a live database and store it as a disk file for archival. The archived database
can be loaded after some time and the data used again only by the same vendors
DBMS. However, if scientific data has been preserved by a DBMSs archiving tool
we do not have the guarantee that, lets say 50 years later, the DBMSs vendor or the
version will exist and thus the data can be used again. This is our motivation to
investigate technology and tools for long-term preservation of scientific data that are
independent of the current DBMS technology.
The basic requirements to the new technology and tools for long-term preservation
of scientific data can be briefly summarized as follows: (i) the tools should make use
of evolving state-of-the-art database technology for maintaining and searching
preserved data, (ii) it should be possible to preserve different kinds of scientific data,
not only data stored in relational databases, (iii) the preserved data should include
information sufficient to retrieve, reproduce, and disseminate the experiments, (iv) it
should be easy to query both preserved data and live databases, (v) long-term
database preservation has to be guaranteed.
The Semantic Web is a universal medium for data and information, and
knowledge exchange [2]. It uses common formats for integration and combination of
data from different sources, not only relational databases [1]. By means of languages
such as RDF [3], RDF-Schema [4], OWL [5] any web resource can be annotated with
meta-data properties describing its structure and contents, called ontologies. The
ontologies facilitate guided search of web resources using problem specific
terminologies and structures. Semantic Web seems a promising technology for long-
term preservation of scientific databases.

2 Existing Work on long-term preservation of databases

The importance of database preservation has gained substantial interests recently. A


number of initiatives have been established, e.g. the CEDARS project [16], the
National Library of Australias PANDORA project [17], the PREMIS Working
Group [18], etc. All of them have been investigating strategies for the preservation of
digital content.
Long-term preservation of relational databases through XML modeling has been
implemented in [20]. The idea is to preserve both the data and metadata of a relational
database as XML documents. Only unload, load, and limited queries were supported
and no database updates. Furthermore, preservation of other kinds of databases than
relational databases was not supported. Work on archival and compression of
scientific data as XML has been done in [19], where a system is proposed for
maintaining, populating, and querying archives of hierarchical data. Instead of using a
standard query language a declarative query language to query archived data has been
designed and implemented.
The area of managing curated scientific databases, where there is additional
information about how the data values have been obtained or computed, has been
studied in particular by Buneman [15], [21]. He identified this as one of the major
issues in scientific databases.
The RDF-based Dublin Core standard is well suited for documenting scientific
publications. NASA [22] uses Semantic Web representation to publish scientific
meta-data. These and other efforts for building ontologies of published results should
be studied in the PhD project.
There has been a lot of research on querying integrated heterogeneous data
sources, e.g. [23], [24], [25]. None of those projects deal with how to query Semantic
Web views of integrated databases, how to use Semantic Web representation for long
term database preservation, or how to efficient search of preserved scientific data in
terms of such representations.

3 Projects Goal and Research Questions

The goal of the PhD project is to investigate how Semantic Web representations can
be utilized for long term preservation, documentation, evolution, and efficient search
of different kinds of scientific data. In particular, the project investigates how to build
tools for efficient search of preserved scientific data that is stored using database
neutral formats based on Semantic Web representations. This would allow exchange
of data between different DBMSs and file formats.
Archived databases should be searchable using some standard query language, e.g.
SPARQL [11], which is the recognized Semantic Web query language. It must be
possible to find data transparently, independent on whether it is stored in a live
relational database or archived on some disk or other media. The system should know
where all necessary data is located and perform the necessary actions to answer a
users query.
Data provenance information documenting how data was obtained has to be
included and relating different versions of curated data [15]. The system should check
as well that new data added to pre-existing result databases fulfill the constraint
imposed by the ontology used for the old results.

4 Overview of the Prototype System

During the initial stages of the PhD project investigations were made on how to use
Semantic Web representations for long-term preservation of data stored in relational
databases in terms of
What information to represent
What kind of Semantic Web representations to use
DBMS independence and exchangeability
Efficient search of life and archived data.
Based on these investigations a first version of the prototype system SSARD
(Semantic Web Search and Archival of Relational Databases) has been developed.
The so far implemented features of SSARD are:
1. A scientific database stored in any relational commercial DBMS can be
preserved by unloading it in terms of RDF-Schema (RDFS) [4]. Both data
and metadata are preserved.
2. A preserved database can be made live again by loading it into any relational
commercial DBMS, independently on the original DBMS.
3. The system maintains a catalogue of the preserved databases.
4. The preserved database, both data and metadata can be queried in terms of
RDFS using the RDF query language SPARQL [11].
5. Also the catalogue of the preserved databases can be queried with SPARQL.
The architecture of the current SSARD system is presented on Fig. 1. The source
database is the database for long-term preservation (e.g. stored in MySQL). Both the
data and the metadata of the source database are wrapped by a JDBC wrapper [9].
The RDB viewer (Relational Database Viewer) generates two views, one over the data
and one over the metadata of the relational database in terms of RDFS. The archiver
transforms the data in the RDFS views to RDF/XML [10] and stores it on a disk as
two RDF/XML files, the repository of the archived databases. During the
preservation process the SSARD catalogue is updated. It keeps information about the
subject, the location, the size, the creation date, etc. of all the preserved databases. In
order to organize the catalogue to easily find the information, it is represented as a
Topic Map [6] stored on a disk in the XTM [8] file format. Topic Maps is another
Semantic Web standard created for managing indices to documents [6]. The Topic

SSARD
JDBC Wrapper RDB Viewer
Archived
Databases
Archiver
Source
Database
RDF/XML files
MySQL TM Builder Repository of
archived databases
SPARQL Parser TM Viewer

Reloader SSARD
Destination
Catalogue
Database

PostgreSQL Topic Map


(XTM file)
Data preservation
Reload of preserved data
Query of preserved data
Query of the catalogue

Fig.1 SSARD Architecture


Map data model provides built-in concepts useful to describe indices of documents or
websites, i.e. it can be seen as an ontology describing how to navigate into document
collections [6].
The TM Builder (Topic Map builder) collects the corresponding information from
the database under archival, represents it in terms of the Topic Maps conceptual
schema [14], transforms it to XTM and finally adds it to the XTM file of the
catalogue. Later on, after a period of time, the reloader can restore any preserved
database from the repository. Thus the database can be made live again by loading it
as a destination database in any commercial relational DBMS, i.e. PostgreSQL.
It is possible to query any preserved database by the RDF query language
SPARQL [11] using the SPARQL Parser [12]. First the preserved database has to be
reloaded in a commercial relational DBMS by the reloader. Then an RDFS view over
the destination relational database is generated by the RDB viewer. SPARQL queries
can be used to query this view. For scalable processing of the SPARQL queries to
RDFS views over relational databases, rules for the view definitions used in SWARD
[13] have been implemented in the RDB viewer. SWARD [13] is a system for
scalable processing of SPARQL queries over RDFS views of relational databases
using a general partial evaluation algorithm.
The SSARD catalogue can be also queried by SPARQL. The TM viewer (Topic
Map Viewer) first imports the XTM file of the SSARD catalogue in terms of a Topic
Map conceptual schema [14] and generates an RDFS view over it. After that the
RDFS view can be queried by SPARQL. The TM viewer utilizes SWATM [14], a
system for scalable viewing and querying Topic Maps in terms of RDFS.

5 Ongoing Work

During the coming stages of the PhD project we plan to extend and improve the
functionality of the prototype SSARD. The system should be able to preserve, reload
and search efficiently scientific data stored not only in relational databases but also in
database neutral formats. Scalable methods will be implemented for searching the
preserved and live data flexibly, transparently, and efficiently. The performance of
various types of SPARQL queries to RDFS views of preserved data reloaded in a
DBMS have to be analyzed. The performance of the same queries to different kinds of
repositories should be compared. For example, a triples store [7] could be used as
alternative to a relational database. Furthermore, the query processing for SPARQL
queries to RDFS views over both data and metadata needs to be investigated.
Finally, investigations need to be done on how to include data provenance
information documenting how data was obtained and relating different versions of
curated data.

References

1. W3C Semantic Web Activity, http://www.w3.org/2001/sw/


2. Herman, I.: Semantic Web Activity Statement, W3C.
http://www.w3.org/2001/sw/Activity.html. Retrieved on 2008-03-13.
3. Klyne, G., Carroll, J.: Resource Description Framework (RDF): Concepts and Abstract
Syntax. W3C Recommendation, http://www.w3.org/TR/rdf-concepts/
4. Brickley, D., Guha, R.V.: Vocabulary Description Language 1.0: RDF Schema,
http://www.w3.org/TR/rdf-schema/ (2004)
5. Stuckenschmidt, H., Harmelen, F.: Information Sharing on the Semantic Web, Springer,
ISBN 3-540-20594-2, 20056
6. Pepper, S.: The TAO of Topic Maps, http://www.ontopia.net/topicmaps/materials/tao.htm
7. Large Triple Stores, http://esw.w3.org/topic/LargeTripleStores
8. Pepper, S., Moore, G.: XML Topic Maps (XTM) 1.0, TopicMaps.Org,
http://www.topicmaps.org/xtm/1.0/ (2001)
9. Fahl, G., Risch, T.; Query Processing over Object Views of Relational Data, The VLDB
Journal , Vol. 6 No. 4, pp 261-281 (1997)
10.Beckett D.: RDF/XML Syntax Specification (Revised), W3C Recommendation 10 February
2004, http://www.w3.org/TR/rdf-syntax-grammar/ (2004)
11.Prudhommeaux, E., Seaborne, A.: SPARQL Query Language for RDF, W3C
Recommendation, 15 January 2008, http://www.w3.org/TR/rdf-sparql-query/ (2008)
12. Cao, Yu.: Processing SparQL Queries in an Object Oriented Mediator, Uppsala Master's
Theses in Computing Science, ISSN 1100-1836 (2007)
13.Petrini, J.: Querying RDF Schema Views of Relational Databases, PhD Thesis, Uppsala
University, Department of IT, ISSN 1104-2516 (2008)
14.Stefanova, S., Risch T.: Viewing and Querying Topic Maps in terms of RDF, SeMMA2008,
First International Workshop on Semantic Metadata Management and Applications,
ESWC2008 (2008 )
15.Buneman, P., Chapman, A., Cheney, J.: Provenance management in curated databases, pp.
539-550, SIGMOD Conference (2006)
16.CEDARS: CURL Exemplars in Digital Archives, http://www.leeds.ac.uk/cedars/
17.PANDORA: http://pandora.nla.gov.au/
18.PREMIS (PREservation Metadata: Implementation Strategies) Working Group:
http://www.oclc.org/research/pmwg/
19.Mller, H., Buneman, P., Koltsidas I.: XArch: Archiving Scientific and Reference Data,
International Conference on Management of Data archive Proceedings of the 2008 ACM
SIGMOD, Vancouver, Canada, (2008)
20.Ramalho, J.C., Ferreira, M., Faria, L., Castro, R.: Relational database preservation through
xml modelling, International workshop on Markup of Overlapping Structures, Montral,
Canad, (2007)
21.Buneman, P.: How to cite curated databases and how to make them citable, Proc. 18th
International Conf. on Scientific and Statistical Database Management (SSDBM06) (2006)
22. http://sweet.jpl.nasa.gov/ontology/
23.EGEE: Enabling Grids for E-sciencE, http://egee-intranet.web.cern.ch/egee-
intranet/gateway.html
24.Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.,
Vassalos, V., Widom, J.: The TSIMMIS Approach to Mediation: Data Models and
Languages, Intelligent Information Systems (JIIS), Kluwer, 8(2), 117-132 (1997)
25.Haas, L., Kossmann, D., Wimmers, E. L., Yang, J.: Optimizing Queries across Diverse Data
Sources. 23rd Intl. Conf. on Very Large Databases (VLDB'97), 276-285, (1997)

Вам также может понравиться