Вы находитесь на странице: 1из 6

RDF-based Model for Encoding Document Hierarchies

Ma. Auxilio Medina Universidad de las Am ricas - Puebla e Ex-hacienda Santa Catarina M rtir S/N a Cholula, Puebla, M xico e mauxmedina@gmail.com J. Alfredo S nchez a Universidad de las Am ricas - Puebla e Ex-hacienda Santa Catarina M rtir S/N a Cholula, Puebla, M xico e alfredo.sanchez@udlap.mx

R. Omar Ch vez a Universidad Tecn logica de la Mixteca o Km. 2.5 Carretera Acatlima Huajuapan de Le n, Oaxaca, M xico o e r.omar.cg@gmail.com

Abstract
Some document clustering tools automatically or semiautomatically produce document hierarchies. Commonly, they are stored as text or graphical les. As a result, applications are not able to use them before some kind of processing. A more recent representation of document hierarchies employs Extensible Markup Language. However, the reusability of this representation is diminished because the semantic of XML elements is not machine accessible. This paper proposes the use of a Resource Description Framework Model to overcome these drawbacks.

store the tree of clusters in a text le or as a graphic. These representations need to be processed before applications can use them to support tasks such as edition, browsing or information retrieval. In order to overcome this disadvantage, the paper proposes to describe document hierarchies by means of a Resource Description Framework Model. This model makes semantic information machine accessible. The remainder of the paper is organized as follows: Section 2 contains related work. It is followed by Section 3 where the research context is described. The model is presented in Section 4. Section 5 describes an example that shows how to instantiate the proposed model. We conclude with a discussion and some ideas for future work.

1 Introduction 2 Related Work


Category labels or subject headings have been widely used to support organization of documents. However, when dealing with large collections of documents or when categories are unknown, it becomes necessary to look for another alternatives. Document clustering automatically groups documents into clusters such as documents within a cluster have high internal similarity whereas documents in different clusters are dissimilar [10]. It establishes a similarity function on documents. Diverse algorithms support document clustering. We are interested in the hierarchical ones because they organize clusters into a tree that facilitates browsing. Previous works on reference collections have shown that hierarchical document algorithms produce high quality clusters [10], [12]. Common implementations of these algorithms often Document clustering has been widely investigated as a means of improving information retrieval. However, the reusability of the output of clustering algorithms has not received much attention. This section describes some related work. DocCluster [4] is a system that supports automatic construction of document hierarchies. It implements the Frequent Itemset-based Hierarchical Clustering algorithm [5] (abbreviated as FIHC) to construct document hierarchies. Documents are stored in text les that form a directory. Text les only contain terms, numbers, labels or punctuation marks are not included. The constructed hierarchies are stored as XML well-formed documents. The main element of a hierarchy produced by DocCluster is called root. It is formed by a documents

17th International Conference on Electronics, Communications and Computers (CONIELECOMP'07) 0-7695-2799-X/07 $20.00 2007

element which contains the total number of documents and one or more cluster elements. Root element has a list of attributes that represent metadata of the hierarchy such as: F-measure, Entropy_h, Entropy_f, number of clusters, number of children and number of documents. A cluster element is formed by zero or more documents closed in Cdocuments tags. The cluster attributes are its label, the number of children clusters and the number of documents. The names of text les are contained in document elements. CORPORUM [3] is another tool that constructs document hierarchies by using natural language techniques. It implements two main tasks: interpretation of natural language texts and extraction of specic information from free text. CORPORUM has a morphologic component and a relation (sub-class of/instance of) determining engine to identify classes, sub-class relationships and instances. This tool can be applied to semi-structured and structured documents. It uses XML to encode document hierarchies and RDF to represent taxonomic relationships. CORPORUM is applied to documents that have full text content.

provider are represented as a document hierarchy that we have called ontology of records. In [8] we have proposed an XML-based implementation of an ontology of records where taxonomic relationships are expressed in Resource Description Framework (RDF). The RDF-Model proposed in this paper is an extension of that work.

4 RDF Model of a Document Hierarchy


The use of XML language to represent document hierarchies inherits the advantages of the language itself such as human and machine readability, extensibility and portability. However, there are two main disadvantages: 1) tag nesting does not have a unique interpretation and 2) the semantic of the elements is not machine accessible. According to the pyramid of languages proposed to reach the Semantic Web vision [11], the Resource Description Framework (RDF) overcome these drawbacks. It is a framework for metadata description developed by the WWW Consortium (W3C). It has already proposed as a W3C recommendation [9]. RDF is a domain independent data model that groups statements [1]. Each statement is a triple formed by a subject (or resource), an attribute (or property), and a value (a resource or a literal). Frequently, RDF uses XML to represent its syntax. In this case, RDF data types are predened by an XML-Schema. However, other representations are also possible. By means of RDF-Schema, (a language for declaring basic classes and types to describe the terms used in RDF documents), it is possible to dene a vocabulary, specify properties and restrict their range in order to represent assumptions about any particular application domain. RDFSchema makes semantic information machine accessible [2]. Figure 1 shows a semantic network of a document hierarchy. The ovals represent the basic classes and the edges show relationships between them. By using an RDF interpretation, two ovals connected with an edge represent an RDF statement where the oval at the head of the edge is the subject and the oval at the tail of the edge is the value. The names of the attributes of the sentences are not included in the gure in order to maintain the relationships between classes as clear as possible. A brief interpretation of the semantic network is as follows. A document hierarchy has two main elements: the name of the algorithm that produces it (Calgorithm), and zero or more clusters (Cclusters). A cluster is formed by a set of documents (Cdocuments). A document is described by a label (Clabel) and has a number that represents its level in the document hierarchy (Clevel). A cluster can include one or more input pa2

3 Research Context
Open Archives Initiative (OAI) is an organization whose goal is to create simple standards to support interoperability in Digital Libraries (DLs). There are two types of participants in OAI: (1) data providers that expose metadata of their resources in semi-structured documents termed records, and (2) service providers that use records to offer value-added services [6]. OAI proposes a low level mechanism called the Open Archives Initiative Protocol for Metadata Harvesting (OAIPMH protocol) to offer application-independent interoperability and external online access to records. The protocol uses HTTP and XML to encode requests and responses, respectively. OAI-PMH protocol does not offer search mechanisms. Thus, it is difcult (if not impossible) to nd relevant records in data providers. Search engines and meta search engines have been widely used for this purpose. Furthermore, some members of OAI have developed tools to retrieve information from data providers1 . However, when multiple data providers are involved, users need to access, select and organize records in order to determine their relevance. We have constructed OntoSIR, a multi-collection retrieval service as a modest contribution in this direction [7]. In OntoSIR, software agents receive user queries and they suggest a list of relevant records. The records of a data
1 OAI

tools list: http://www.openarchives.org/tools/tools.html

17th International Conference on Electronics, Communications and Computers (CONIELECOMP'07) 0-7695-2799-X/07 $20.00 2007

Table 1. RDF-based Model of a Document Hierarchy <?xml version=1.0?> <!DOCTYPE rdf:RDF [ <!ENTITY xsd http://www.w3.org/2001/XMLSchema#> ]> <rdf:RDF xml:base=http://www.utm.mx/ sigepr/ recursos/docHierarchy/ xmlns:rdf=http://www.w3.org/1999 /02/22-rdf-syntax-ns#> xmlns:rdfs=http://www.w3.org/2000 /01/rdf-schema#> <rdfs:Datatype rdf:about=&xsd;string/> <rdfs:Datatype rdf:about=&xsd;integer/> Figure 1. Semantic network that represents a document hierarchy <rdfs:Class rdf:ID=docHierarchy/> <rdfs:Class rdf:ID=Calgorithm/> <rdfs:Class rdf:ID=Ccluster/> <rdfs:Class rdf:ID=Clabel/> <rdfs:Class rdf:ID=Clevel/> <rdfs:Class rdf:ID=Cdocument/> <rdfs:Class rdf:ID=Cidentifier/> <rdfs:Class rdf:ID=Csource/> <rdfs:Property rdf:ID=isDescribedByName <rdfs:domain rdf:resource=#Calgorithm/> <rdfs:range rdf:resource=&xsd;string/> </rdfs:Property> <rdfs:Property rdf:ID=hasCluster <rdfs:domain rdf:resource=#docHierarchy/> <rdfs:range rdf:resource=Ccluster/> </rdfs:Property> ... <rdfs:Property rdf:ID=hasDocument> <rdfs:domain rdf:resource=#Ccluster/> <rdfs:range rdf:resource=#Cdocument/> </rdfs:Property> ... 3 < /rdf:RDF>

rameters (Cparameters). Documents can be described by different attributes, although in Figure 1 just an identier (Cidentifier) and a source (Csource) are shown. Note that the bottom ovals represent two different data types (integer, string). Table 1 shows an excerpt of encoding of a document hierarchy using RDF. The complete representation is available at the URL shown in the xml:base element. It is validated with a W3C tool2 . This representation allow users to express taxonomic rules of a generic document hierarchy. It is machine accessible. Table 1 has been divided in sections separated by blank lines in order to identify the main components of this model. The rst section species the base name space as well as the name spaces of RDF and RDF-Schema, respectively. The second section includes the data types used for the document hierarchy, they are taken from an XML-Schema. The third section contains class denitions whereas the last sections describe some relationships between classes. Table 2 shows the main features of the RDF-based model. Class attributes, as well as n-ary relations are not included, however they could be supported with some additional effort. RDF and RDF-Schema offer basic primitives for ontolo2 It

is available at: http://www.w3.org/RDF/Validator/

17th International Conference on Electronics, Communications and Computers (CONIELECOMP'07) 0-7695-2799-X/07 $20.00 2007

Table 2. Features of the RDF-based model Binary relations Instances Instance attributes Relation hierarchies Subclass-of relationships Type constrains

Figure 3. Partial Denition of the class documentHierarchy in RDF and RDF-Schema

5 Experimental Implementation
Figure 2. Editing the RDF-based model Figure 3 shows a partial denition of the class documentHierarchy and two of its attributes. It contains classes and properties that belong to the RDF Knowledge Representation ontology (rdf:Property and rdf:type), to XML Schema datatypes (xsd:integer) and to the RDF-based model for encoding document hierarchies. The primitive rdf:type determines if a resource is a class or a property. If it is a property, it is necessary to dene its domain and range. Table 3 shows how the RDF-based model can be instantiated. This is an excerpt of an RDF document. The xml:base element contains denitions about the elements of an RDF document which represents a document hierarchy. This RDF-Schema is also used to avoid misunderstanding in the construction of document hierarchies. XML-Spy from Altova software was used to check the validness of this document. 4

gy modeling. Other ontology languages reuse and extend these primitives to support more complex features such as cardinality constrains, disjoint-decomposition, partition or binary functions. Due to the proposed RDF-based model uses XML to represent its syntax, it is easily read and managed since standard libraries for the treatment of XML are available free. That means that more tools are available for editing, handling and documenting. As a way of illustration, Figure 2 shows the use of Altova SemanticWorks tool 3 to edit the structure of the RDF-based model.

3 http://www.altova.com

17th International Conference on Electronics, Communications and Computers (CONIELECOMP'07) 0-7695-2799-X/07 $20.00 2007

Table 3. Instantiating the RDF-based Model <?xml version=1.0?> <rdf:RDF xml:base=http://www.utm.mx/sigepr/ recursos/docHierarchy/ xmlns:rdf=http://www.w3.org/1999 /02/22-rdf-syntax-ns#> xmlns:rdfs=http://www.w3.org/2000 /01/rdf-schema#>

The model exploits the RDF ability to uniquely identify documents, schemas and schema attributes. Furthermore, one could implement tasks such as reviewing the domain and range assignments as well as nding documents with duplicated properties. In general, document hierarchies can take advantage of RDF-specialized software. For example, it is possible to apply RDQL (RDF data query language), which is an efcient way to extract information from RDFgraphs.

6 Conclusions

In this paper, we have presented a semantic machine accessible model to represent document hierarchies. They are hierarchical structures which provide users or agents of an alternative way to explore documents collections. The model can be used to implement document hierarchies manually or automatically constructed. An RDF<rdf:Description rdf:ID=docHierarchy> Schema was designed to dene the vocabulary used by the <docH:hasAlgorithm proposed model. As a consequence, taxonomic relationdocH:isDescribedByClusterSupport=30 ships have a unique interpretation that is human and madocH:isDescribedByGlobalSupport=25 chine accessible. docH:isDescribedByName=FIHC/> At the moment, we are using the model to represent do<docH:hasCluster> cument hierarchies for OAI-compliant data providers in or<rdf:Description der to support an agent-based architecture that implements rdf:about = information retrieval tasks. http://www.utm.mx/sigepr/recursos /docHierarchy/Ccluster> References <docH:hasLabel> Physics </docH:hasLabel> <docH:hasLevel> 1 </docH:hasLevel> [1] M. F. L. A. G mez P rez and O. Corcho. Ontological o e <docH:hasDocument> Engineering. Springer, 2004. <rdf:Description rdf:about= [2] G. Antoniou and F. van Harmelen. A Semantic Web http://www.utm.mx/sigepr/recursos Primer. The MIT Press, 2004. /docHierarchy/document> [3] R. Engels and B. Bremdal. Corporum: A workbench <docH:hasIdentifier> libroUTM:2211 for the semantic web. In PKDD/ECML Proceedings. </docH:hasIdentifier> Semantic Web Mining workshop, Freiburg, Germany <docH:hasSource> UTM 2001. </docH:hasSource> </rdf:Description> [4] B. Fung. Doccluster, 2005. </docH:hasDocument> [5] B. Fung, K. Wang, and M. Ester. Hierarchical doc</rdf:Description> ument clustering using frequent itemsets. In SIAM </docH:hasCluster> International Conference on Data Mining (SDM03) </rdf:Description> Proceedings, pages 5970, May 2003. ... </rdf:RDF> [6] C. Lagoze and H. V. de Sompel. The open archives initiative: Building a low-barrier interoperability framework. In JCDL01 Conference Proceedings, pages 5462. Joint Conference on Digital Libraries, June 2001. [7] M. A. Medina, J. S nchez, Y. Ostr vskaya, and N. R. a o Brisaboa. Ontosir; an oai service for multi-collection 5

17th International Conference on Electronics, Communications and Computers (CONIELECOMP'07) 0-7695-2799-X/07 $20.00 2007

document retrieval based on ontologies of metadata records. In LA-WEB 2005 Conference Proceedings, pages 0000. Argentine Society on Computer Science and Operations Research, SADIO, November 2005. [8] M. A. Medina, J. S nchez, and A. Ramrez. Describa ing document hierarchies by using markup languages. In Proceedings of the Mexican Internacional Conference on Computer Science (ENC 06, San Luis Potos, M xico), pages 0000. IEEE Computer Society Press, e Los Alamitos, Calif., September 2006. [9] L. O and S. R. Resource description framework (rdf) model and syntax specication. w3c recommendation, 1999. [10] E. Rasmussen. Clustering algorithms. In In Frakes and Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, chapter 16. Prentice Hall., pages 419442, 1992. [11] B. L. Tim. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by its Inventor. HarperCollins Publishers, New York, 1999. [12] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In In Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, November 4-9, pages 515 524, 2002.

17th International Conference on Electronics, Communications and Computers (CONIELECOMP'07) 0-7695-2799-X/07 $20.00 2007

Вам также может понравиться