Вы находитесь на странице: 1из 17

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

XML clustering
methods
Sohn Jong-Soo
mis026@korea.ac.kr
Intelligent Information System Lab. Korea Univ.
2007.11.06

Dept. Computer Science, Korea Univ.

0. Index

Introduction
XML and XML schema
Relational vs. XML
Paper overview
My works

Intelligent Information System Lab.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

1. Introduction
XML
It has become a standard for information exchange
and retrieval
With the continuous growth in the XML data
The ability to manage massive collections of XML data
and to discover knowledge from them becomes
essential
For web based information system

Clustering method
Database objects, text data, multimedia data
XML data is different
Semi-structured
Hierarchical

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

2. XML and XML schema


XML
XML document
XML schema
Can be obtained separately without scanning the whole
document

Style sheet
XLS, CSS

XML

Content
XML file

Structure

XML schema, DTD

Style

XLS, CSS

XML-1
XML-13
XML-2
XML-3
XML-4

XSLT
( DOM,SAX)

XML-1234

XSLT
( DOM,SAX)

XML-24

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

2. XML and XML schema


XML documents have elements and attributes
Elements (indicated by begin & end tags)

attribute

begin
element

can be nested but cannot interleave each other


can have arbitrary number of sub-elements
can have free text as values
end
<chap title = Introduction To XML>
elemen
t
some free text
<sect title = What is XML?> </sect>
<sect title = Elements> </sect>
<sect title = Why XML?> </sect>
Elements
possibly more free text
w/ same
</chap>
name can
be nested

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

2. XML and XML schema


Database Side: XML is a new way to organize data
Relational databases organize data in tables
XML documents organize data in ordered trees

Document Side: XML is a semantic markup language


HTML focuses on presentation
XML focuses on semantics/structure in the data

chap
sect
sect sect

sect
sect sect

<html>
<h1> Chapter 1 </h1>
some free text
<h2> Section 1 </h2>
some more free text
<h3> Section 1.1 </h3>
</html>

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

3. Relational vs. XML


Relational data are well organized fully
structured (more strict):
E-R modeling to model the data structures in the
application;
E-R diagram is converted to relational tables and
integrity constraints (relational schemas)

XML data are semi-structured (more flexible):


Schemas may be unfixed, or unknown
(flexible anyone can author a document)
Suitable for data integration
(data on the web, data exchange between different
enterprises).

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

3. Relational vs. XML


XML is not meant to replace relational
database systems
RDBMSs are well suited to OLTP applications
(e.g., electronic banking)
which has 1000+ small transactions per minute.

XML is suitable data exchange over heterogeneous


data sources
(e.g., Web services)
that allow them to talk.

Dept. Computer Science, Korea Univ.

3. Relational vs. XML


Advantages of using XML
Manage large volume of XML data
Provide high-level declarative language
Efficiently evaluate complex queries

XML Data Management Issues:


XML Data Model
XML Query Languages
XML Query Processing, Optimization and
Classification
I have interest in this branch !

Intelligent Information System Lab.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

4. Paper overview
XML schema clustering with semantic and
hierarchical similarity measures
This paper presents a XML schema clustering process
By organising the heterogeneous XML schemas
into various groups

Combining the semantic and syntactic relationships


To calculate the linguistic similarity bet. Two elements
Considering the ancestor-child relationship

Generalizing a suitable schema class hierarchy


Using Xmine methodology

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

4. Paper overview
Evaluating Structural Similarity in XML
Documents
Develop a dynamic programming
algorithm
to find this distance for any pair
of documents

It define a new method for


computing the distance
between any two XML documents in
terms of their structure
The lower this distance
the more similar the two
documents are in terms of structure

the more likely


they are to have been created
from the same DTD

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

4. Paper overview
A matching algorithm for measuring the
structural similarity between an XML document
and a DTD and its applications
This paper proposes a matching algorithm for
measuring the structural similarity
between an XML document and a DTD

The matching algorithm


by comparing the document structure against the one the
DTD requires

is able to identify commonalities and differences

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

4. Paper overview
This paper focused on five applications of the
algorithm:
(1) the classification of XML documents against a set of
DTDs
(2) the generation of a new schema
for a DTD by extracting structural information during the
classification of XML documents;

(3) the development of an XML-based search engine


able to answer approximate structural queries

(4) the selective dissemination of XML documents


(5) the protection of the contents of documents classified
against a set of DTDs of a database, by propagating the
authorization policies specified at DTD level

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

4. Paper overview
Schema Matching for Transforming
Structured Documents
Understanding the matching problem in the context
of structured document transformations
And developing matching methods those output
serves as the basis for the automatic generation of
transformation scripts
Four basic matching process
(1)linguistic matching
(2)datatype compatibility
(3)Designer type hierarchy
(4)structural matching

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

5. My works
XML data classification
Using a XML schema and its XML files
ID3 Algorithm
By classification tool on XML data

It will contribute to XML data preprocessing for


datamining

Problems
XML has hierarchical data type
It cant present like a table

Insufficient of sample data

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

References
E. Bertino, G. Guerrini, M. Mesiti, A matching algorithm for measuring the
structural similarity between an XML document and a DTD and its
applications, Information Systems 29 (1) (2004) 2346.
A. Boukottaya, C. Vanoirbeek, 2005, November 0204, Schema matching
for transforming structured documents. Paper presented at the The 2005
ACM Symposium on Document engineering, Bristol, United Kingdom.

A. Doan, R. Domingos, A.Y. Halevy, 2001, Reconciling schemas of


disparate sources: a machine-learning approach. Paper
presented at the ACM SIGMOD, Santa Barbara, California, United
States.
S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Fast detection of
XML structural similarities, IEEE Transaction on Knowledge and Data
Engineering 7 (2) (2005) 160175.

R. Nayak, S. Xu, XCLS: a fast and effective clustering algorithm


for heterogenous XML documents. Paper presented at the The
10th Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD), Singapore, 2006.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

References
A. Nierman, H.V. Jagadish, 2002, December, Evaluating structural
similarity in XML documents. Paper presented at the fifth International
Conference on Computational Science (ICCS05), Wisconsin, USA.
Richi Nayak, Wina Iryadi 2006, XML schema clustering with semantics
and hierarchical similarity measures.
http://www.w3c.org/xml

Вам также может понравиться