Академический Документы
Профессиональный Документы
Культура Документы
1 Introduction
The goal of the PhD project is to investigate how Semantic Web representations can
be utilized for long term preservation, documentation, evolution, and efficient search
of different kinds of scientific data. In particular, the project investigates how to build
tools for efficient search of preserved scientific data that is stored using database
neutral formats based on Semantic Web representations. This would allow exchange
of data between different DBMSs and file formats.
Archived databases should be searchable using some standard query language, e.g.
SPARQL [11], which is the recognized Semantic Web query language. It must be
possible to find data transparently, independent on whether it is stored in a live
relational database or archived on some disk or other media. The system should know
where all necessary data is located and perform the necessary actions to answer a
users query.
Data provenance information documenting how data was obtained has to be
included and relating different versions of curated data [15]. The system should check
as well that new data added to pre-existing result databases fulfill the constraint
imposed by the ontology used for the old results.
During the initial stages of the PhD project investigations were made on how to use
Semantic Web representations for long-term preservation of data stored in relational
databases in terms of
What information to represent
What kind of Semantic Web representations to use
DBMS independence and exchangeability
Efficient search of life and archived data.
Based on these investigations a first version of the prototype system SSARD
(Semantic Web Search and Archival of Relational Databases) has been developed.
The so far implemented features of SSARD are:
1. A scientific database stored in any relational commercial DBMS can be
preserved by unloading it in terms of RDF-Schema (RDFS) [4]. Both data
and metadata are preserved.
2. A preserved database can be made live again by loading it into any relational
commercial DBMS, independently on the original DBMS.
3. The system maintains a catalogue of the preserved databases.
4. The preserved database, both data and metadata can be queried in terms of
RDFS using the RDF query language SPARQL [11].
5. Also the catalogue of the preserved databases can be queried with SPARQL.
The architecture of the current SSARD system is presented on Fig. 1. The source
database is the database for long-term preservation (e.g. stored in MySQL). Both the
data and the metadata of the source database are wrapped by a JDBC wrapper [9].
The RDB viewer (Relational Database Viewer) generates two views, one over the data
and one over the metadata of the relational database in terms of RDFS. The archiver
transforms the data in the RDFS views to RDF/XML [10] and stores it on a disk as
two RDF/XML files, the repository of the archived databases. During the
preservation process the SSARD catalogue is updated. It keeps information about the
subject, the location, the size, the creation date, etc. of all the preserved databases. In
order to organize the catalogue to easily find the information, it is represented as a
Topic Map [6] stored on a disk in the XTM [8] file format. Topic Maps is another
Semantic Web standard created for managing indices to documents [6]. The Topic
SSARD
JDBC Wrapper RDB Viewer
Archived
Databases
Archiver
Source
Database
RDF/XML files
MySQL TM Builder Repository of
archived databases
SPARQL Parser TM Viewer
Reloader SSARD
Destination
Catalogue
Database
5 Ongoing Work
During the coming stages of the PhD project we plan to extend and improve the
functionality of the prototype SSARD. The system should be able to preserve, reload
and search efficiently scientific data stored not only in relational databases but also in
database neutral formats. Scalable methods will be implemented for searching the
preserved and live data flexibly, transparently, and efficiently. The performance of
various types of SPARQL queries to RDFS views of preserved data reloaded in a
DBMS have to be analyzed. The performance of the same queries to different kinds of
repositories should be compared. For example, a triples store [7] could be used as
alternative to a relational database. Furthermore, the query processing for SPARQL
queries to RDFS views over both data and metadata needs to be investigated.
Finally, investigations need to be done on how to include data provenance
information documenting how data was obtained and relating different versions of
curated data.
References