Академический Документы
Профессиональный Документы
Культура Документы
Access to this document was granted through an Emerald subscription provided by emerald-srm:584523 []
For Authors
If you would like to write for this, or any other Emerald publication, then please use our Emerald for Authors service
information about how to choose which publication to write for and submission guidelines are available for all. Please
visit www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
Emerald is a global publisher linking research and practice to the benefit of society. The company manages a portfolio of
more than 290 journals and over 2,350 books and book series volumes, as well as providing an extensive range of online
products and additional customer resources and services.
Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committee on Publication
Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive preservation.
EL
31,6 Dataspace and its application in
digital libraries
Chaolemen Borjigin
688 Key Laboratory of Data Engineering and Knowledge Engineering,
Renmin University of China, MOE, Renmin University of China, Beijing,
Received 17 February 2012
PR China and Tsinghua University, Beijing, PR China, and
Accepted 9 May 2012 Yong Zhang, Chunxiao Xing, Chao Lan and Jian Zhang
Research Institute of Information Technology (RIIT), Tsinghua University,
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
Beijing, PR China
Abstract
Purpose This paper aims to introduce dataspace into digital libraries in order to address its
emerging data management challenges which stem from cross-domain, heterogeneous, and uncertain
nature of data resources, based on looking into the fundamental principles, significant features, and
research directions of dataspace.
Design/methodology/approach This research mainly employs three types of research
methodologies: a literature study was conducted for revealing the fundamental principles,
analyzing the significant features, and discussing the new research topics; knowledge engineering
methodology is used to design the data model; software engineering methodology is applied to develop
the reference framework and the digital library.
Findings This paper for the first time proposes a motion to build a Dataspace Support Digital
Library (DSDL), and provides its data model, data management policies and a reference framework.
Further, its implementation is described and some implications learned from the case study are also
discussed.
Practical implications Introducing dataspace technologies into developing a digital library frees
the developer to just focus on solving business challenges, rather than addressing non-business
related, data management level tasks. In addition, the data model and the reference framework
presented in this paper lay foundations for constructing Dataspace Support Digital Libraries.
Originality/value This is the first paper to introduce dataspaces into the design of digital libraries
and is also the first paper to propose a novel data model, data management and reference framework
for Dataspace Support Digital Libraries.
Keywords Databases, Digital libraries, Dataspace, Personal information management
Paper type Research paper
1. Introduction
One of most acute challenges faced by current digital libraries is how to handle
heterogeneous, uncertain and distributed data resources. However, current digital
library applications fail to meet these challenges because they are built on top of a
This work was funded by Natural Science Foundation of China (No. 71103020), Key Laboratory
The Electronic Library
Vol. 31 No. 6, 2013 of Data Engineering and Knowledge Engineering (Renmin University of China), Ministry of
pp. 688-702 Education (No. KF2011001), National High Technology Research and Development Program of
q Emerald Group Publishing Limited
0264-0473
China (863 Project, Project Number: 2009AA01Z143) and National Natural Science Foundation of
DOI 10.1108/EL-02-2012-0017 China (Key Program No. 71133006).
Database Management System (DBMS) that is inapt in this new scenario. The design Dataspace in
of a digital library not only should meet its key business requirements, but also have to digital libraries
deal with the cross-domain, heterogeneity, and uncertainty of data resources. As a
result, digital library development is high costly and some work at data management
level is repeated. Further, digital libraries built by different designers are difficult to
interact with one another for the sake of difference between data management
solutions. Franklin et al. (2005) proposed dataspace as a new abstraction for data 689
management in such scenarios. With the advent of dataspace, it draws recently more
and more attention of the database community and some relevant research fields such
as information management, data integration and software engineering. This study
introduces this new solution into developing digital libraries in order to address the
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
2. Dataspace
Dataspace can be defined as a set of participants and a set of relationships among them
(Singh and Jain, 2011). The participants provide it with their data resources and
computing services. Dataspace should not only integrate the resources from
participants, notably cross domain, heterogeneous, uncertain ones, but also manage
relationships between them, including overlapping, conflicting, inheriting,
homogeneous, matching or mapping. It is notable that dataspace is not merely a
new kind of data integration solutions due to its main components, fundamental
principles and significant features. A typical dataspace system involves catalog, local
storage and index, search and query, discovery, enhancement, administration, and
other components (Franklin et al., 2005). The catalog records the participants and their
mutual relationships. The local storage and index are mainly used for supporting
query, access and management of its resources. The search and query component
provides a variety of searching services, such as unified, structured, metadata queries.
The discovery is in charge of finding participants and mining relationships between
them. The enhancement is designed to extend the data management capabilities of
participants if they fail to provide required functionalities. The administration
component is mainly used to manage the other parts.
Dataspace Database
Database Management Systems (DBMS) and they have full control over the data
stored in them. Unlike traditional DBMS, a Dataspace Management System
(DSMS), which has similar roles with DBMS, has only incomplete control over
the data in it. Data in a dataspace may be controlled not only by its DSMS, but
also by its providers. The functionality of a dataspace may be increased or
decreased with its use by the pay-as-you-go principle.
Dataspace Database
traditional information integration approaches, for its data first, schema later or
never principle and data coexisting principle. Owing to the data first, schema
later or never principle, there is no format transforming or approximate
calculation. This avoids the loss of information during integrating data sets to
provide services across its data sources.
.
Better effort services. This feature has its roots in the pay-as-you-go principle, the
data first, schema later or never principle, and the incomplete control principle.
The pay-as-you-go principle make the dataspaces have to provide services just
based on the resources which have already been embraced. Dataspace sometimes
may be unable to reach all the desired resources, and just returns an approximate
answer or a better effort result. The data first, schema later or never principle
affects the accuracy of data computing in that it has lacked mathematical
foundations. Incomplete control over its data resources results in the quality of
services in a dataspace heavily depending on its participants. Therefore, unlike
databases, dataspaces provides a better effort services, instead of a best services.
dataspace applications also need a set of fundamental tools, which may be called the
Dataspace Management System (DSMS). Although a practicable DSMS has not been
developed yet, some key problems in it have been discussed. Li et al. (2008) proposed a
framework for dataspace integration and management, which consists of four main parts:
(1) data integration engine;
(2) dataspace engine;
(3) evolution engine; and
(4) output engine.
In addition, some research problems in indexing (Dong and Halevy, 2007), integrating
(Vaz Salles et al., 2007), mapping (Belhajjame et al., 2010; Hedeler et al., 2011), searching
(Song et al., 2010), and navigating (Schindler et al., 2011) dataspaces are also discussed.
URI. In addition, the content components should include its unique id, version number
and full content or an address to the content. The log component should contain time,
actor and type of a transaction in order to provide support for data recovery or version
control in a dataspace. Tag components include tags related to a dataspace objects and
must write down its ID, content and logs. Relation component records its relations with
other dataspaces objects and contain the URI, type and address of related objects.
Current version number as well as prior version number will be used to rollback
content, tag or relation in a dataspace.
696
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
Figure 1.
A DTD for dataspace
objects and relations
697
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
Figure 2.
The code fragment of the
ontology in RDF Schema
EL
31,6
698
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
Figure 3.
Reference framework
and their use records, notably log information and tag information, are stored
locally in the system. Similarly, newly added services are deployed at this system
and original services remain to reside in its data sources, which are called local
services and remote services in the framework. Local services have an
incomplete control over the data because it not only may have a restricted access
to data providers, but also may lack some desired functions and have to forward
data processing requests to remote participants.
.
Its construction abides by the pay-as-you-go principle. The data in the system is
created directly by its users, instead of loading or integrating contents of remote
data sources in advance. The users behaviors, such as searching, tagging,
evaluating, updating, or navigating data, will be recorded automatically into the
system. As a result, the data in the system will be accumulated in accordance
with dynamic needs of its users and the value of the system will be increased
with its use.
.
It sometimes only provides better-effort-results because the quality of services
may be affected by the amount of data which have been stored in the system and
services quality provided by remote data sources.
We made a comprehensive use of a wide range of technologies such as J2EE, Mysql,
ApacheDS, Berkeley dbxml, JBoss JRules, Memcache, Osworkflow, Mule, ActiveMQ
and developed a digital library using the framework. At the time of this writing, we
have implemented the member methods to create URI, content, log, tag, and relations Dataspace in
by now and theirs updating or searching methods remain under developing. Apart digital libraries
from these methods, other parts of the framework are almost implemented and the
system starts to put into use. The main interface is shown as Figure 4.
5. Conclusions 699
A dataspace differs from a database in that it employs some revolutionary principles,
such as a pay-as-you-go principle, data first, schema later or never principle, data
networking principle, data coexisting principle, and incomplete control principle.
Benefited from these underlying principles, some new features surface from
dataspaces, including uncertainty, incompleteness, flexibility, low information loss,
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
and better effort services. The literature survey reveals that data models, data
management, dataspace support platforms, and personal information management are
now becoming hot topics in this research field. Dataspace gives more flexible solutions
for data management over heterogeneous, uncertain and distributed data sources.
Developing digital library platforms on the top of dataspace, therefore, not only can
free the developer to just focus on solving business challenges, but also can ensure high
interoperationality between different digital libraries. This study, for the first time,
strived to introduce dataspace into the design and development of a digital library, and
has proposed a novel data model and reference framework. Furthermore, we have
successfully implemented the data model and reference framework and embedded
them in a digital library called iDLib. This study will have great implications for
building a Dataspace Support Digital Library as well as the study or implementation of
dataspaces. However, the following tasks remain to be done:
.
To justify completeness of the dataspace object relation set. As listed in Table III,
we initialized the set with 13 types of recommended semantic relations. Because
its design is mainly based on an empirical study and there are unique needs in
the digital library, there still lacks a justification study and the set may need to
include new elements, or exclude some of the existing elements.
Figure 4.
iDLib: a dataspace
supported digital library
EL .
To further define the tag sets in the data model. We believe there are two types of
31,6 tags: required tags and optional tags. Required tags are the mandatory ones such
as format, address, and size of data resources, which are defined in dataspace
model and context independent. Optional tags are not mandated by dataspace
management system and defined by specific application contexts in accordance
with pay as you go principle, and data first, schema later or never principle.
700 Therefore, required tags should be indentified based on the good practices of
Dataspace Support Digital Libraries, including iDLib.
.
To refine the software system which implements the data model and the
reference framework. By now, we only completed the programming of creating
URI, content, log, tag, and relation of a dataspace object and its monitoring,
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
References
Belhajjame, K., Paton, N.W., Embury, S., Fernandes, A.A.A. and Hedeler, C. (2010),
Feedback-based annotation, selection and refinement of schema mappings for
dataspaces, in EDBT 10: Proceedings of the 13th International Conference on
Extending Database Technology, Lausanne, Switzerland, March 22-26, p. 573-584.
Blunschi, L., Dittrich, J.P., Girard, O.R., Karakashian, S.K. and Vaz Salles, M.A. (2007),
A dataspace odyssey: the iMeMex Personal Dataspace Management System, in CIDR
2007, Asilomar, CA, January 7-10, pp. 1-6.
Das Sarma, A., Dong, X. and Halevy, A. (2008), Bootstrapping pay-as-you-go data integration
systems, in SIGMOD 08: Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data, ACM, New York, NY, pp. 861-874.
Dittrich, J.P., Antonio, M. and Salles, V. (2006), iDM: a unified and versatile data model for
personal dataspace management, in VLDB 06: Proceedings of the 32nd International
Conference on Very Large Data Bases, September 12-5, 2006, Seoul, Korea, VLDB
Endowment, pp. 367-378.
Dittrich, J.P., Vaz Salles, M.A., Kossmann, D. and Blunschi, L. (2005), iMeMex: escapes from the
personal information jungle, in VLDB 05: Proceedings of the 31st International
Conference on Very Large Data Bases, Trento, Italy, October 4-6, VLDB Endowment,
pp. 1306-309.
Dong, X. and Halevy, A. (2007), Indexing dataspaces, in SIGMOD 07: Proceedings of the 2007
ACM SIGMOD International Conference on Management of Data, Beijing, China, 11-14
June, pp. 43-54.
Dong, X.L. and Halevy, A.Y. (2005), A platform for personal information management and
integration, in Conference in Innovative Database Research (CIDR) 2005, Asilomar, CA,
January 4-7, pp. 119-130.
Franklin, M., Halevy, A. and Maier, D. (2005), From databases to dataspaces: a new abstraction
for information management, ACM SIGMOD Record, Vol. 34 No. 4, pp. 27-33.
Franklin, M., Halevy, A. and Maier, D. (2008), A first tutorial on dataspaces, in PVLDB 08,
Auckland, New Zealand, August 23-28, pp. 1516-1517.
Haas, L., Lin, E. and Roth, M. (2002), Data integration through database federation, Dataspace in
IBM Systems Journal, Vol. 41 No. 4, pp. 578-596.
digital libraries
Halevy, A., Franklin, M. and Maier, D. (2006), Principles of dataspace systems, in PODS 06:
Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART Symposium on Principles
of Database Systems, Chicago, IL, 2007, ACM, New York, NY, pp. 1-9.
Hedeler, C., Belhajjame, K., Fernandes, A.A., Embury, S.M. and Paton, N.W. (2009), Dimensions
of dataspaces, in Sexton, A.P. (Ed.), Dataspace: The Final Frontier, Springer, Berlin, 701
pp. 55-66.
Hedeler, C., Belhajjame, K., Paton, N.W., Campi, A., Fernandes, A.A.A. and Embury, S.M. (2010),
Dataspaces, in Ceri, S. and Brambilla, M. (Eds), Search Computing Challenges and
Directions, Springer, Berlin, pp. 114-134.
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
Hedeler, C., Belhajjame, K., Paton, N.W., Fernandes, A.A.A., Embury, S.M., Mao, L. and Guo, C.J.
(2011), Pay-as-you-go mapping selection in dataspaces, in SIGMOD 11: Proceedings of
the 2011 International Conference on Management of Data, Athens, Greece, June 12-16,
pp. 1279-1282.
Howe, B., Maier, D., Rayner, N. and Rucker, J. (2008), Quarrying dataspaces: schemaless
profiling of unfamiliar information sources, in 2008 IEEE 24th International Conference
on Data Engineering workshop (ICDE Workshop 2008), Cancun, Mexico, 7-12 April,
p. 270-277.
Jeffery, S.R., Franklin, M.J. and Halevy, A.Y. (2008), Pay-as-you-go user feedback for dataspace
systems, in SIGMOD 08: Proceedings of the 2008 ACM SIGMOD International
Conference on Management of Data, ACM, New York, NY, pp. 847-860.
Karger, D.R., Bakshi, K., Huynh, D., Quan, D. and Sinha, V. (2005), Haystack: a customizable
general-purpose information management tool for end users of semistructured data,
in The 2nd Conference on Innovative Data Systems Research (CIDR 2005), ACM, New
York, NY, pp. 13-26.
Leser, U. and Naumann, F. (2005), Hands-off information integration for the life sciences,
in Conference in Innovative Database Research (CIDR) 2005, Asilomar, CA, January 4-7,
pp. 131-143.
Li, Y.K. (2011), A framework towards task-based query in personal dataspace, in 2011 Seventh
International Conference on Semantics Knowledge and Grid (SKG),Beijing, China, 24-26
October, pp. 215-218.
Li, Y.K., Meng, X.F. and Kou, Y.B. (2009), An efficient method for constructing personal
dataspace, in 2009 Web Information Systems and Applications Conference, Xuzhou,
China, 18-20 September, pp. 3-8.
Li, Y.K., Meng, X.F. and Zhang, X.Y. (2008), Research on dataspace, Journal of Software, Vol. 19
No. 8, pp. 2018-2031.
Liu, D.B., Yang, D., Nie, T.Z., Kou, Y. and Shen, D.R. (2010), Document clustering in personal
dataspace, in 2010 Web Information Systems and Applications Conference, Huhehot,
China, 20-22 August, pp. 9-12.
Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D. and Yu, C. (2007),
Web-scale data integration: you can only afford to pay as you go, in Conference in
Innovative Database Research (CIDR) 2007, Asilomar CA, January 7-10, pp. 342-350.
Mirza, H.T., Chen, L. and Chen, G. (2010), Practicability of dataspace systems, International
Journal of Digital Content Technology and its Applications, Vol. 4 No. 3, pp. 233-243.
EL Pradhan, S. (2007), Towards a novel desktop search technique, in Wagner, R., Revell, N. and
Pernul, G. (Eds), D Database and Expert Systems Applications, LNCS, Springer,
31,6 Heidelberg, pp. 192-201.
Sarma, A., Dong, X. and Halevy, A. (2009), Data modeling in dataspace support platform,
in Borgida, A.T., Chaudhri, V.K., Giorgini, P. and Yu, E.S. (Eds), Conceptual Modeling:
Foundations and Applications, LNCS, Springer, Heidelberg, pp. 122-138.
702 Schindler, S., Hauswirth, M. and Koenig-Ries, B. (2011), Navigating in a heterogeneous
dataspace, in Proceedings of the ACM WebSci11, June 14-17, Koblenz, Germany, p. 1-2.
Silberschatz, A., Korth, H.F. and Sudarshan, S. (2009), Database System Concepts, 6th ed.,
McGraw-Hill, New York, NY, pp. 8-9.
Singh, M. and Jain, S.K. (2011), A survey on dataspace, in Wyld, D.C. (Ed.), Advances in
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)
Corresponding author
Chaolemen Borjigin can be contacted at: chaolemen@pku.org.cn
1. Mamta Kayest, S.K. Jain. 2016. A Proposal for Exhaustive Search on Desktop Data. Procedia Computer
Science 89, 422-427. [CrossRef]
Downloaded by UNIVERSITAS SUMATERA UTARA At 00:50 16 March 2017 (PT)