Вы находитесь на странице: 1из 17

The Electronic Library

Dataspace and its application in digital libraries


Borjigin Chaolemen Zhang Yong Xing Chunxiao Lan Chao Zhang Jian
Article information:
To cite this document:
Borjigin Chaolemen Zhang Yong Xing Chunxiao Lan Chao Zhang Jian , (2013)," Dataspace and its application in digital
libraries ", The Electronic Library, Vol. 31 Iss 6 pp. 688 - 702
Permanent link to this document:
http://dx.doi.org/10.1108/EL-02-2012-0017
Downloaded on: 16 March 2017, At: 00:50 (PT)
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

References: this document contains references to 32 other documents.


To copy this document: permissions@emeraldinsight.com
The fulltext of this document has been downloaded 796 times since 2013*
Users who downloaded this article also downloaded:
(2014),"Towards dynamic and evolving digital libraries", The Electronic Library, Vol. 32 Iss 1 pp. 2-16 http://
dx.doi.org/10.1108/EL-07-2012-0089
(2011),"Digital library deployment in a university: Challenges and prospects", Library Hi Tech, Vol. 29 Iss 2 pp. 373-386
http://dx.doi.org/10.1108/07378831111138233

Access to this document was granted through an Emerald subscription provided by emerald-srm:584523 []
For Authors
If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service
information about how to choose which publication to write for and submission guidelines are available for all. Please
visit www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
Emerald is a global publisher linking research and practice to the benefit of society. The company manages a portfolio of
more than 290 journals and over 2,350 books and book series volumes, as well as providing an extensive range of online
products and additional customer resources and services.
Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committee on Publication
Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation.

*Related content and download information correct at time of download.


The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0264-0473.htm

EL
31,6 Dataspace and its application in
digital libraries
Chaolemen Borjigin
688 Key Laboratory of Data Engineering and Knowledge Engineering,
Renmin University of China, MOE, Renmin University of China, Beijing,
Received 17 February 2012
PR China and Tsinghua University, Beijing, PR China, and
Accepted 9 May 2012 Yong Zhang, Chunxiao Xing, Chao Lan and Jian Zhang
Research Institute of Information Technology (RIIT), Tsinghua University,
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Beijing, PR China

Abstract
Purpose This paper aims to introduce dataspace into digital libraries in order to address its
emerging data management challenges which stem from cross-domain, heterogeneous, and uncertain
nature of data resources, based on looking into the fundamental principles, significant features, and
research directions of dataspace.
Design/methodology/approach This research mainly employs three types of research
methodologies: a literature study was conducted for revealing the fundamental principles,
analyzing the significant features, and discussing the new research topics; knowledge engineering
methodology is used to design the data model; software engineering methodology is applied to develop
the reference framework and the digital library.
Findings This paper for the first time proposes a motion to build a Dataspace Support Digital
Library (DSDL), and provides its data model, data management policies and a reference framework.
Further, its implementation is described and some implications learned from the case study are also
discussed.
Practical implications Introducing dataspace technologies into developing a digital library frees
the developer to just focus on solving business challenges, rather than addressing non-business
related, data management level tasks. In addition, the data model and the reference framework
presented in this paper lay foundations for constructing Dataspace Support Digital Libraries.
Originality/value This is the first paper to introduce dataspaces into the design of digital libraries
and is also the first paper to propose a novel data model, data management and reference framework
for Dataspace Support Digital Libraries.
Keywords Databases, Digital libraries, Dataspace, Personal information management
Paper type Research paper

1. Introduction
One of most acute challenges faced by current digital libraries is how to handle
heterogeneous, uncertain and distributed data resources. However, current digital
library applications fail to meet these challenges because they are built on top of a

This work was funded by Natural Science Foundation of China (No. 71103020), Key Laboratory
The Electronic Library
Vol. 31 No. 6, 2013 of Data Engineering and Knowledge Engineering (Renmin University of China), Ministry of
pp. 688-702 Education (No. KF2011001), National High Technology Research and Development Program of
q Emerald Group Publishing Limited
0264-0473
China (863 Project, Project Number: 2009AA01Z143) and National Natural Science Foundation of
DOI 10.1108/EL-02-2012-0017 China (Key Program No. 71133006).
Database Management System (DBMS) that is inapt in this new scenario. The design Dataspace in
of a digital library not only should meet its key business requirements, but also have to digital libraries
deal with the cross-domain, heterogeneity, and uncertainty of data resources. As a
result, digital library development is high costly and some work at data management
level is repeated. Further, digital libraries built by different designers are difficult to
interact with one another for the sake of difference between data management
solutions. Franklin et al. (2005) proposed dataspace as a new abstraction for data 689
management in such scenarios. With the advent of dataspace, it draws recently more
and more attention of the database community and some relevant research fields such
as information management, data integration and software engineering. This study
introduces this new solution into developing digital libraries in order to address the
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

challenges discussed above.


The rest of this paper is organized as follows. Section 2 describes the definition,
underlying principles and significant features of dataspace. Section 3 further discusses
some emerging research directions related to dataspace by conducting a literature
review. Then, section 4 introduces dataspace into digital libraries and its data model,
data management, reference frameworks and IT implementations are described.
Finally, section 5 concludes this paper with a discussion of its contributions and
limitations.

2. Dataspace
Dataspace can be defined as a set of participants and a set of relationships among them
(Singh and Jain, 2011). The participants provide it with their data resources and
computing services. Dataspace should not only integrate the resources from
participants, notably cross domain, heterogeneous, uncertain ones, but also manage
relationships between them, including overlapping, conflicting, inheriting,
homogeneous, matching or mapping. It is notable that dataspace is not merely a
new kind of data integration solutions due to its main components, fundamental
principles and significant features. A typical dataspace system involves catalog, local
storage and index, search and query, discovery, enhancement, administration, and
other components (Franklin et al., 2005). The catalog records the participants and their
mutual relationships. The local storage and index are mainly used for supporting
query, access and management of its resources. The search and query component
provides a variety of searching services, such as unified, structured, metadata queries.
The discovery is in charge of finding participants and mining relationships between
them. The enhancement is designed to extend the data management capabilities of
participants if they fail to provide required functionalities. The administration
component is mainly used to manage the other parts.

2.1 Underlying principles


Dataspace differs from traditional data management approaches such as database and
data integration, because it conforms to some revolutionized fundamental principles.
This gives dataspaces some significant features and makes it necessary for us to engage
in some relevant research directions that will be discussed later in section 3. Table I
shows comparisons between dataspace and database at an underlying principles level:
.
Pay-as-you-go principle. Traditional databases employ a conventional
construction approach that may be referred to as pay-before-you-go fashion. In
EL other words, the whole design is strictly prior to its use. Once the design is
31,6 accepted, it will remain stable for a long period of time. This conventional
construction approach results in two acute shortcomings of traditional databases
applications. One is they have to amass large amounts of seldom used or useless
information in order merely to consider the possibility of use in the future. The
other is lack of flexibility. When there are changes in users needs or application
690 contexts, it is difficult for a traditional application to alter its design to keep pace
with these changes. On the contrary, dataspace abides by an alternative
construction principle called pay-as-you-go (Halevy et al., 2006) and its design of
could evolve with its use. Therefore, dataspaces are not only able to integrate all
available resources from cross domain participants, but also capable of
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

dismissing some of them or decomposing themselves with the changes of user


needs or running contexts. The development and use of a dataspace are almost
simultaneously and the design tends to be more reasonable and practicable.
. Data first, schema later or never principle. The overall design of a database is
called database schema and a database has several schemas such as physical
schema, logical schema, and sub schema. Of these, the logical schema is by far
the most important, in terms of its effect on application programs, since
programmers construct applications by using the logical schema (Silberschatz
et al., 2009). Apparently, data in database are captured and handled according to
the schema that has been designed in advance. Although this ensures high
accuracy of data processing in databases, there are some disadvantages: notably
lack of flexibility and loss of information. Since the data is captured according to
a predefined schema, some variables or methods, which are not allowed in the
schema, will be refused. At the same time, some data or services have to be
adapted in order to follow the schema of the database. This sometimes results in
loss of information or distortion of data processing. Dataspaces, by contrast,
work on a contrary principle, namely data first, schema later or never principle
(Blunschi et al., 2007).
.
Data networking principle. RDBMS (Relational Database Management System),
which is the most common database technology in use today, uses a relational
data model. It employs relational algebra as its theoretical foundation, and
supports the ACID properties of a transaction. As a result, RDBMS computes
precisely. However, the cross-domain, heterogeneous, massive, and uncertain
feature of data in a database makes it impossible to model the data using
relational data model. Compared to relational model, network model is more
suitable for modeling data in a dataspace (Li et al., 2008).

Dataspace Database

Construction approach Pay-as-you-go Pay-before-you-go


Data schema Data first, schema later or never Schema first, data later
Table I. Data model Network model Relational model
Comparisons between Data storage Data integration Data coexisting
dataspace and database Data control Incomplete Complete
.
Data coexisting principle. The goal of dataspace is to provide general Dataspace in
functionality over all data sources, regardless of how integrated they are. For digital libraries
example, a Dataspace Support System (DSSP) can provide keyword search over
all of its data sources, similar to that provided by existing desktop search
systems. When more sophisticated operations are required, such as
relational-style queries, data mining, or monitoring over certain sources, then
additional effort can be applied to more closely integrate those sources in an 691
incremental, pay as you go fashion (Franklin, 2005). Therefore, the underlying
principles of dataspace differ from those of data integration, and it conforms to
data coexisting principle.
Incomplete control principle. In database technologies, data are fully managed by
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Database Management Systems (DBMS) and they have full control over the data
stored in them. Unlike traditional DBMS, a Dataspace Management System
(DSMS), which has similar roles with DBMS, has only incomplete control over
the data in it. Data in a dataspace may be controlled not only by its DSMS, but
also by its providers. The functionality of a dataspace may be increased or
decreased with its use by the pay-as-you-go principle.

2.2 Significant features


Compared to traditional data management technologies such as database and data
integration, dataspace shows some significant features in that it conforms to four new
underlying principles discussed above. These are summarized in Table II:
.
Uncertainty. The data source, data schema, and data model in a dataspace bears
uncertainty. This feature stems from the pay-as-you-go principle and data first,
schema later or never principle. The pay-as-you-go principle requires databases
to embrace or dismiss a data source dynamically. The data first, schema later or
never principle enables dataspaces to have an uncertain data schema or no
schema at all. The uncertainty in data sources and data schema further results in
uncertainty in design, development and use of a dataspace.
.
Incompleteness. Data in a dataspace is not fully controlled by the DSMS due to
the incomplete control principle and the data co-existing principle. The
incomplete control principle makes a dataspace heavily dependent on its
participant and has to forward some user requests to the participants. On the
other hand, the data coexisting principle makes it possible for the participants to

Dataspace Database

Data source Uncertain, changed frequently Certain, changed infrequently


Data management DSSP (dataspace support system) DBMS (database management system)
self-management of data sources
Information loss Low High
Data link Dynamic Static Table II.
Diverse Unified Comparisons between
Service result Best effort Better effort dataspace and database
EL retain some self-management and to provide its services for its dataspaces
31,6 management system.
.
Flexibility. The pay-as-you-go principle and the data networking principle enhance
the flexibility in design and redesign of a dataspaces. Unlike traditional database
technologies, dataspace may be composed or disposed flexibly. When there is a
need to access the information resident in more data sources, dataspaces can easily
692 embrace the desired participants. Similarly, when a data source is inaccessible,
dataspaces should not only exclude it, but also seek an alternative one. Further, if
all the data resources are unavailable, the DSMS should try to find a candidate
answer based on its searching logs and logical reasoning.
.
Low information loss. There are lower information losses, compared to
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

traditional information integration approaches, for its data first, schema later or
never principle and data coexisting principle. Owing to the data first, schema
later or never principle, there is no format transforming or approximate
calculation. This avoids the loss of information during integrating data sets to
provide services across its data sources.
.
Better effort services. This feature has its roots in the pay-as-you-go principle, the
data first, schema later or never principle, and the incomplete control principle.
The pay-as-you-go principle make the dataspaces have to provide services just
based on the resources which have already been embraced. Dataspace sometimes
may be unable to reach all the desired resources, and just returns an approximate
answer or a better effort result. The data first, schema later or never principle
affects the accuracy of data computing in that it has lacked mathematical
foundations. Incomplete control over its data resources results in the quality of
services in a dataspace heavily depending on its participants. Therefore, unlike
databases, dataspaces provides a better effort services, instead of a best services.

3. Emerging research directions


In order to examine state of the arts in dataspace and its applications, we conducted a
literature survey, taking ACM Digital Library, Engineering Village 2, IEEE/IET Electronic
Library, Web of Knowledge and China Knowledge Resource Integrated Database as data
sources. We selected key words such as dataspace *, pay as you go, pay-as-you-go,
Data schema, or Data model, searched against several categories such as the
documents keywords, title, and abstract. Finally, we found out a result set including 113
valid papers in sum. Further, we made a topic analysis on them and finally reveal there are
four main emerging research directions that adhere to the studies of dataspaces.

3.1 Data model


Since most of the data modeling technologies developed so far conform to the schema
first and data later principle, it is necessary to design novel data modeling technologies
or to alter the existing technologies to adapt the new feature of dataspaces.
Consequently, data modeling has been one of hot topics in the relevant studies. Several
solutions such as Haystack model (Karger et al., 2005), iDM (iMeMex Data Model)
(Dittrich et al., 2006), UDM (Unified Data Model) (Pradhan, 2007), and PSM
(Probabilistic Semantic Model) (Sarma et al., 2009) may be applied to model data in
dataspaces. However, some of these modeling methods are proposed mainly by
altering or improving traditional data modeling technologies. The others do not Dataspace in
directly target for dataspaces and have just some implications for dataspaces. All of digital libraries
them lack an overall breakthrough and a solid theoretical foundation. Similar to the
relational data model that uses relational algebra as its theoretical foundation, data
modeling for dataspaces also need new modeling technologies which are based on its
specific mathematical theory.
693
3.2 Dataspace management
In database technologies, the DBMS has full control over the creation, maintenance, and
use of a database. This frees the developers from dealing with non-business related data
management tasks to ensure its integrity, concurrency, recovery, and security. Similarly,
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

dataspace applications also need a set of fundamental tools, which may be called the
Dataspace Management System (DSMS). Although a practicable DSMS has not been
developed yet, some key problems in it have been discussed. Li et al. (2008) proposed a
framework for dataspace integration and management, which consists of four main parts:
(1) data integration engine;
(2) dataspace engine;
(3) evolution engine; and
(4) output engine.
In addition, some research problems in indexing (Dong and Halevy, 2007), integrating
(Vaz Salles et al., 2007), mapping (Belhajjame et al., 2010; Hedeler et al., 2011), searching
(Song et al., 2010), and navigating (Schindler et al., 2011) dataspaces are also discussed.

3.3 Dataspace support platform


Michael Franklin et al. (2005) viewed the design and development of Dataspace
Support Platforms (DSSP) as a key agenda item for database. Hedeler et al. (2009) made
a comparative study on existing dataspace or data integration approaches such as
iMeMex (Dittrich et al., 2005), ITrails (Vaz Salles et al., 2007), DB2II (Haas, 2002),
Aladin (Leser and Naumann, 2005), SEMEX (Dong and Halevy, 2005), PayGo
(Madhavan, 2007), UDI (Das Sarma et al., 2008), Roomba (Jeffery et al., 2008), Quarry
(Howe et al., 2008) from life cycle, data source, integration schema, design/derivation,
matching, mapping, and resulting data resource, creation, search/query and evaluation
perspectives. Mirza et al. (2010) examined practicability of database systems and
discussed some of challenges in designing and developing dataspace systems. Cornelia
Hedeler et al. (2010) argued life cycle of a dataspace includes stages of initialization,
test/evaluation, deployment, usage, maintenance and improvement and demission.

3.4 Personal information management


Dataspace is starting to be used in personal information management, personal health
management, e-science and the Web (Franklin et al., 2008). However, personal
information management is more attractive among them and personal dataspace has
been the hottest topic in database application. Vaz Salles et al. (2006) developed a
platform for personal dataspace management that uses iDM as data model. Li et al.
(2009) proposed a method for constructing personal dataspace. Liu et al. (2010)
discussed document clustering in personal dataspace. Li (2011) proposed a framework
towards task-based query in personal dataspaces.
EL 4. Dataspace supported digital library
31,6 We developed a digital library called iDLib and introduced dataspace into it. This
digital library system is one of main outcomes of a research project (Project Name:
R&D of Cross-domain Sharing and Service Support Platform for Data-driven
Applications, Project Number: 2009AA01Z143) funded by the National High
Technology Research and Development Program of China (863 Project). The main
694 motivations of introducing dataspaces into the digital library are:
(1) to access cross domain data sources that reside in Tsinghua University Library,
National Library of China, Peking University Library, and China Academic
Library and Information System;
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

(2) to design a flexible data modeling approach to uniformly represent


heterogeneous documents such as books, papers, patents, standards,
software, photos, videos or theirs metadata;
(3) to provide some value-added services, including integrated keyword searching,
automatic abstracting, tagging, question answering, log management, and data
mining; and
(4) to enhance the flexibility of the system so that it can be easily adapted to meet
dynamic requirements of data source and user needs.

4.1 Data model


We designed a novel data model that can represent the cross-domain, heterogeneous,
and uncertain data in a flexible way. This model has two types of basic building
blocks: dataspace object and semantic relation between dataspace objects. The
dataspace object encapsulates all the data into digital objects, regardless of its format,
size, storage, and location. The semantic relations are defined to represent relations
between dataspace objects and to support semantic computing in a dataspace.
Therefore, this model can used to describe not only all the available formats such as
record, table, text, XML, image, or video file, but also all the accessible participants
from different security domains.
Definition 1 Dataspace object. A dataspace object Oi is a 5-tuple (ai, bi, gi, di, 1i),
where ai is a URI component, bi is a content component, gi is a tag component, di is a
log component, 1i is a Relation component. Each component of a digital object Oi is
defined as follows:
.
ai (URI component) is a finite string that denotes the URI (Uniform Resource
Identifier) of Oi..
. bi (Content component) is the resource encapsulated in a dataspace object, which
may be in any format, size, storage, location and content.
.
gi (Tag Component) is a set of tags related to Oi, which may be created
automatically or manually, and is defined as gi =M < H and M > H ?, where
M represents a set of tags created by machines and M {m1, m2,. . .mo }; H
denotes a set of tags created by humans, H {h1, h2, . . . ,hp}.
.
di (Log component) is a 3-tuple (r, s, t), where r is string that denotes who
conducted a operations on Oi,s is string that records the time of the operations
conducted on Oi; t is a string that represents the type of operation.
.
1 (Relation component) is a set of strings which denote the relations between Oi Dataspace in
and other dataspace objects, 1i {R1, R2, . . . , Rj,.., Rn}, where Rj is the set of digital libraries
semantic relations between Oi and Oj. Rj # Z and the definition of Z is as
follows:

Definition 2 Relations between dataspace objects. The relationship between dataspace


objects is a set Z { z1, z2,. . .z13 }, where the value of zi is listed in Table III. 695
Definition 3 Dataspace modeling DTD. A DTD for dataspace objects and relations
between them are defined in Figure 1.
It can be easily seen from the discussion above that a dataspace object comprises
one or more contents, logs, tag and relations, while it must be assigned only a unique
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

URI. In addition, the content components should include its unique id, version number
and full content or an address to the content. The log component should contain time,
actor and type of a transaction in order to provide support for data recovery or version
control in a dataspace. Tag components include tags related to a dataspace objects and
must write down its ID, content and logs. Relation component records its relations with
other dataspaces objects and contain the URI, type and address of related objects.
Current version number as well as prior version number will be used to rollback
content, tag or relation in a dataspace.

4.2 Data management


We take advantage of the RDF Schema syntax to define dataspace ontology in order to
implement data management of dataspace according to the data model discussed
above. The code fragment of the ontology in RDF Schema is show in Figure 2 (To save
space, we here present only the key fragment by replacing similar or regular code with
an ellipsis).
And then, we can use this dataspace ontology to build a specific dataspace.
Each dataspace is an instance of the dataspace ontology. Therefore, the semantic
web technologies can serve as managing tools for dataspaces. Notably, SPARQL
can be used to query a dataspace, while SPARQL-Update may be used for
updating it.

Element Name Description Tag

z1 subClass Oi is a sub class of Oj subClassOf


z2 equivalentClass Oi is a equivalent class of Oj equivalentClass
z3 Property Oi is a property of concept Oj Property
z4 subProperty Oi is a sub property of concept Oj subPropertyOf
z5 equivalentProperty Oj is a equivalent property of Oj equivalentProperty
z6 domain Oi is the domain of property Oj domain
z7 range Oi is the range of property Oj range
z8 Individual Oi is a individual of Concept Oj Individual
z9 sameAs Oi is same as Oj sameAs
z10 differentFrom Oi is different from Oj differentFrom
z11 intersection Oi is intersection of other n objects including Oj intersectionOf
z12 disjointWith Oi is a disjoint class with class Oj disjointWith Table III.
z13 inverse Oi is the inverse property of Oj inverseOf Semantic relation
EL
31,6

696
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Figure 1.
A DTD for dataspace
objects and relations

4.3 Reference framework


The reference framework is defined in a four-layer model, shown as Figure 3, which is
located on the top of IT infrastructures and consumes the services provided by them.
On the other hand, this framework provides dataspace services for upper applications
in education, government or enterprise, which are built on the top of it. The model
includes data source manager, database manager, shared services, local data services,
local application services, enterprises service bus and load balancer. In addition, there
are two more components that across the all layers: data-driven engines and security.
Data-driven engines are responsible to monitor the data sources and inform the
changes in them to the system. Security management is a crucial component of the
framework and all the layers needs a unified security solution.
This framework supports some principles and features of a dataspace:
.
Its data model is designed in accordance with the pay-as-you-go principle, data
first, schema later or never principle, data interlinking principle, as discussed in
section 2.1.
.
Its data management is in line with the data coexisting principle and incomplete
control principle. Data are categorized into two types: data content and its usage
information. Data contents commonly reside remotely in their original sources
Dataspace in
digital libraries

697
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Figure 2.
The code fragment of the
ontology in RDF Schema
EL
31,6

698
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Figure 3.
Reference framework

and their use records, notably log information and tag information, are stored
locally in the system. Similarly, newly added services are deployed at this system
and original services remain to reside in its data sources, which are called local
services and remote services in the framework. Local services have an
incomplete control over the data because it not only may have a restricted access
to data providers, but also may lack some desired functions and have to forward
data processing requests to remote participants.
.
Its construction abides by the pay-as-you-go principle. The data in the system is
created directly by its users, instead of loading or integrating contents of remote
data sources in advance. The users behaviors, such as searching, tagging,
evaluating, updating, or navigating data, will be recorded automatically into the
system. As a result, the data in the system will be accumulated in accordance
with dynamic needs of its users and the value of the system will be increased
with its use.
.
It sometimes only provides better-effort-results because the quality of services
may be affected by the amount of data which have been stored in the system and
services quality provided by remote data sources.
We made a comprehensive use of a wide range of technologies such as J2EE, Mysql,
ApacheDS, Berkeley dbxml, JBoss JRules, Memcache, Osworkflow, Mule, ActiveMQ
and developed a digital library using the framework. At the time of this writing, we
have implemented the member methods to create URI, content, log, tag, and relations Dataspace in
by now and theirs updating or searching methods remain under developing. Apart digital libraries
from these methods, other parts of the framework are almost implemented and the
system starts to put into use. The main interface is shown as Figure 4.

5. Conclusions 699
A dataspace differs from a database in that it employs some revolutionary principles,
such as a pay-as-you-go principle, data first, schema later or never principle, data
networking principle, data coexisting principle, and incomplete control principle.
Benefited from these underlying principles, some new features surface from
dataspaces, including uncertainty, incompleteness, flexibility, low information loss,
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

and better effort services. The literature survey reveals that data models, data
management, dataspace support platforms, and personal information management are
now becoming hot topics in this research field. Dataspace gives more flexible solutions
for data management over heterogeneous, uncertain and distributed data sources.
Developing digital library platforms on the top of dataspace, therefore, not only can
free the developer to just focus on solving business challenges, but also can ensure high
interoperationality between different digital libraries. This study, for the first time,
strived to introduce dataspace into the design and development of a digital library, and
has proposed a novel data model and reference framework. Furthermore, we have
successfully implemented the data model and reference framework and embedded
them in a digital library called iDLib. This study will have great implications for
building a Dataspace Support Digital Library as well as the study or implementation of
dataspaces. However, the following tasks remain to be done:
.
To justify completeness of the dataspace object relation set. As listed in Table III,
we initialized the set with 13 types of recommended semantic relations. Because
its design is mainly based on an empirical study and there are unique needs in
the digital library, there still lacks a justification study and the set may need to
include new elements, or exclude some of the existing elements.

Figure 4.
iDLib: a dataspace
supported digital library
EL .
To further define the tag sets in the data model. We believe there are two types of
31,6 tags: required tags and optional tags. Required tags are the mandatory ones such
as format, address, and size of data resources, which are defined in dataspace
model and context independent. Optional tags are not mandated by dataspace
management system and defined by specific application contexts in accordance
with pay as you go principle, and data first, schema later or never principle.
700 Therefore, required tags should be indentified based on the good practices of
Dataspace Support Digital Libraries, including iDLib.
.
To refine the software system which implements the data model and the
reference framework. By now, we only completed the programming of creating
URI, content, log, tag, and relation of a dataspace object and its monitoring,
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

updating, or mining have not been implemented yet. In addition, abstracting,


cleaning and migrating of remote services in the data source managers need to be
altered in the dataspace model because they are currently working for other
purposes. Our ongoing efforts are mainly focused on justifying the data model
and to refining the software implementation according to usage feedback of the
digital library.

References
Belhajjame, K., Paton, N.W., Embury, S., Fernandes, A.A.A. and Hedeler, C. (2010),
Feedback-based annotation, selection and refinement of schema mappings for
dataspaces, in EDBT 10: Proceedings of the 13th International Conference on
Extending Database Technology, Lausanne, Switzerland, March 22-26, p. 573-584.
Blunschi, L., Dittrich, J.P., Girard, O.R., Karakashian, S.K. and Vaz Salles, M.A. (2007),
A dataspace odyssey: the iMeMex Personal Dataspace Management System, in CIDR
2007, Asilomar, CA, January 7-10, pp. 1-6.
Das Sarma, A., Dong, X. and Halevy, A. (2008), Bootstrapping pay-as-you-go data integration
systems, in SIGMOD 08: Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data, ACM, New York, NY, pp. 861-874.
Dittrich, J.P., Antonio, M. and Salles, V. (2006), iDM: a unified and versatile data model for
personal dataspace management, in VLDB 06: Proceedings of the 32nd International
Conference on Very Large Data Bases, September 12-5, 2006, Seoul, Korea, VLDB
Endowment, pp. 367-378.
Dittrich, J.P., Vaz Salles, M.A., Kossmann, D. and Blunschi, L. (2005), iMeMex: escapes from the
personal information jungle, in VLDB 05: Proceedings of the 31st International
Conference on Very Large Data Bases, Trento, Italy, October 4-6, VLDB Endowment,
pp. 1306-309.
Dong, X. and Halevy, A. (2007), Indexing dataspaces, in SIGMOD 07: Proceedings of the 2007
ACM SIGMOD International Conference on Management of Data, Beijing, China, 11-14
June, pp. 43-54.
Dong, X.L. and Halevy, A.Y. (2005), A platform for personal information management and
integration, in Conference in Innovative Database Research (CIDR) 2005, Asilomar, CA,
January 4-7, pp. 119-130.
Franklin, M., Halevy, A. and Maier, D. (2005), From databases to dataspaces: a new abstraction
for information management, ACM SIGMOD Record, Vol. 34 No. 4, pp. 27-33.
Franklin, M., Halevy, A. and Maier, D. (2008), A first tutorial on dataspaces, in PVLDB 08,
Auckland, New Zealand, August 23-28, pp. 1516-1517.
Haas, L., Lin, E. and Roth, M. (2002), Data integration through database federation, Dataspace in
IBM Systems Journal, Vol. 41 No. 4, pp. 578-596.
digital libraries
Halevy, A., Franklin, M. and Maier, D. (2006), Principles of dataspace systems, in PODS 06:
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles
of Database Systems, Chicago, IL, 2007, ACM, New York, NY, pp. 1-9.
Hedeler, C., Belhajjame, K., Fernandes, A.A., Embury, S.M. and Paton, N.W. (2009), Dimensions
of dataspaces, in Sexton, A.P. (Ed.), Dataspace: The Final Frontier, Springer, Berlin, 701
pp. 55-66.
Hedeler, C., Belhajjame, K., Paton, N.W., Campi, A., Fernandes, A.A.A. and Embury, S.M. (2010),
Dataspaces, in Ceri, S. and Brambilla, M. (Eds), Search Computing Challenges and
Directions, Springer, Berlin, pp. 114-134.
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Hedeler, C., Belhajjame, K., Paton, N.W., Fernandes, A.A.A., Embury, S.M., Mao, L. and Guo, C.J.
(2011), Pay-as-you-go mapping selection in dataspaces, in SIGMOD 11: Proceedings of
the 2011 International Conference on Management of Data, Athens, Greece, June 12-16,
pp. 1279-1282.
Howe, B., Maier, D., Rayner, N. and Rucker, J. (2008), Quarrying dataspaces: schemaless
profiling of unfamiliar information sources, in 2008 IEEE 24th International Conference
on Data Engineering workshop (ICDE Workshop 2008), Cancun, Mexico, 7-12 April,
p. 270-277.
Jeffery, S.R., Franklin, M.J. and Halevy, A.Y. (2008), Pay-as-you-go user feedback for dataspace
systems, in SIGMOD 08: Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data, ACM, New York, NY, pp. 847-860.
Karger, D.R., Bakshi, K., Huynh, D., Quan, D. and Sinha, V. (2005), Haystack: a customizable
general-purpose information management tool for end users of semistructured data,
in The 2nd Conference on Innovative Data Systems Research (CIDR 2005), ACM, New
York, NY, pp. 13-26.
Leser, U. and Naumann, F. (2005), Hands-off information integration for the life sciences,
in Conference in Innovative Database Research (CIDR) 2005, Asilomar, CA, January 4-7,
pp. 131-143.
Li, Y.K. (2011), A framework towards task-based query in personal dataspace, in 2011 Seventh
International Conference on Semantics Knowledge and Grid (SKG),Beijing, China, 24-26
October, pp. 215-218.
Li, Y.K., Meng, X.F. and Kou, Y.B. (2009), An efficient method for constructing personal
dataspace, in 2009 Web Information Systems and Applications Conference, Xuzhou,
China, 18-20 September, pp. 3-8.
Li, Y.K., Meng, X.F. and Zhang, X.Y. (2008), Research on dataspace, Journal of Software, Vol. 19
No. 8, pp. 2018-2031.
Liu, D.B., Yang, D., Nie, T.Z., Kou, Y. and Shen, D.R. (2010), Document clustering in personal
dataspace, in 2010 Web Information Systems and Applications Conference, Huhehot,
China, 20-22 August, pp. 9-12.
Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D. and Yu, C. (2007),
Web-scale data integration: you can only afford to pay as you go, in Conference in
Innovative Database Research (CIDR) 2007, Asilomar CA, January 7-10, pp. 342-350.
Mirza, H.T., Chen, L. and Chen, G. (2010), Practicability of dataspace systems, International
Journal of Digital Content Technology and its Applications, Vol. 4 No. 3, pp. 233-243.
EL Pradhan, S. (2007), Towards a novel desktop search technique, in Wagner, R., Revell, N. and
Pernul, G. (Eds), D Database and Expert Systems Applications, LNCS, Springer,
31,6 Heidelberg, pp. 192-201.
Sarma, A., Dong, X. and Halevy, A. (2009), Data modeling in dataspace support platform,
in Borgida, A.T., Chaudhri, V.K., Giorgini, P. and Yu, E.S. (Eds), Conceptual Modeling:
Foundations and Applications, LNCS, Springer, Heidelberg, pp. 122-138.
702 Schindler, S., Hauswirth, M. and Koenig-Ries, B. (2011), Navigating in a heterogeneous
dataspace, in Proceedings of the ACM WebSci11, June 14-17, Koblenz, Germany, p. 1-2.
Silberschatz, A., Korth, H.F. and Sudarshan, S. (2009), Database System Concepts, 6th ed.,
McGraw-Hill, New York, NY, pp. 8-9.
Singh, M. and Jain, S.K. (2011), A survey on dataspace, in Wyld, D.C. (Ed.), Advances in
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Network Security and Applications, Springer, Berlin, pp. 608-621.


Song, S.Y., Chen, L. and Yuan, M.X. (2010), Materialization and decomposition of dataspaces for
efficient search, IEEE Transactions on Knowledge and Data Engineering, Vol. 23 No. 12,
pp. 1872-1887.
Vaz Salles, M.A. and Dittrich, J.P. (2006), iMeMex: a platform for personal dataspace
management, in the VLDB2006 PhD Workshop, Seoul, Republic of Korea, September 11,
2006.
Vaz Salles, M.A., Dittrich, J.P., Karakashian, S.K., Girard, O.R. and Blunschi, L. (2007), iTrails:
pay-as-you-go information integration in dataspaces, in VLDB 07: Proceedings of the
33rd International Conference on Very Large Data Bases, University of Vienna, Austria,
September 23-27, pp. 663-674.

Corresponding author
Chaolemen Borjigin can be contacted at: chaolemen@pku.org.cn

To purchase reprints of this article please e-mail: reprints@emeraldinsight.com


Or visit our web site for further details: www.emeraldinsight.com/reprints
This article has been cited by:

1. Mamta Kayest, S.K. Jain. 2016. A Proposal for Exhaustive Search on Desktop Data. Procedia Computer
Science 89, 422-427. [CrossRef]
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)

Вам также может понравиться