X ML Clustering Methods

Dept. Computer Science, Korea Univ.
Intelligent Information System Lab.
XML clustering
methods
Sohn Jong-Soo
mis026@korea.ac.kr
Intelligent Information System Lab. Korea Univ.
2007.11.06
0. Index
Introduction
XML and XML schema
Relational vs. XML
Paper overview
My works
1. Introduction
XML
It has become a standard for information exchange
and retrieval
With the continuous growth in the XML data
The ability to manage massive collections of XML data
and to discover knowledge from them becomes
essential
For web based information system
Clustering method
Database objects, text data, multimedia data
XML data is different
Semi-structured
Hierarchical
2. XML and XML schema

XML
XML document
XML schema
Can be obtained separately without scanning the whole
document
Style sheet
XLS, CSS
XML
Content
XML file
Structure
XML schema, DTD
Style
XLS, CSS
XML-1
XML-13
XML-2
XML-3
XML-4
XSLT
( DOM,SAX)
XML-1234
XSLT
( DOM,SAX)
XML-24

XML documents have elements and attributes
Elements (indicated by begin & end tags)
attribute
begin
element
can be nested but cannot interleave each other

can have arbitrary number of sub-elements
can have free text as values
end
<chap title = Introduction To XML>
elemen
t
some free text
<sect title = What is XML?> </sect>
<sect title = Elements> </sect>
<sect title = Why XML?> </sect>
Elements
possibly more free text
w/ same
</chap>
name can
be nested

Database Side: XML is a new way to organize data
Relational databases organize data in tables
XML documents organize data in ordered trees
Document Side: XML is a semantic markup language

HTML focuses on presentation
XML focuses on semantics/structure in the data
chap
sect
sect sect
sect
sect sect
<html>
<h1> Chapter 1 </h1>
some free text
<h2> Section 1 </h2>
some more free text
<h3> Section 1.1 </h3>
</html>
3. Relational vs. XML

Relational data are well organized fully
structured (more strict):
E-R modeling to model the data structures in the
application;
E-R diagram is converted to relational tables and
integrity constraints (relational schemas)
XML data are semi-structured (more flexible):

Schemas may be unfixed, or unknown
(flexible anyone can author a document)
Suitable for data integration
(data on the web, data exchange between different
enterprises).

XML is not meant to replace relational
database systems
RDBMSs are well suited to OLTP applications
(e.g., electronic banking)
which has 1000+ small transactions per minute.
XML is suitable data exchange over heterogeneous

data sources
(e.g., Web services)
that allow them to talk.

Advantages of using XML
Manage large volume of XML data
Provide high-level declarative language
Efficiently evaluate complex queries
XML Data Management Issues:

XML Data Model
XML Query Languages
XML Query Processing, Optimization and
Classification
I have interest in this branch !
4. Paper overview
XML schema clustering with semantic and
hierarchical similarity measures
This paper presents a XML schema clustering process
By organising the heterogeneous XML schemas
into various groups
Combining the semantic and syntactic relationships

To calculate the linguistic similarity bet. Two elements
Considering the ancestor-child relationship
Generalizing a suitable schema class hierarchy

Using Xmine methodology
4. Paper overview
Evaluating Structural Similarity in XML
Documents
Develop a dynamic programming
algorithm
to find this distance for any pair
of documents
It define a new method for

computing the distance
between any two XML documents in
terms of their structure
The lower this distance
the more similar the two
documents are in terms of structure
the more likely

they are to have been created
from the same DTD
4. Paper overview
A matching algorithm for measuring the
structural similarity between an XML document
and a DTD and its applications
This paper proposes a matching algorithm for
measuring the structural similarity
between an XML document and a DTD
The matching algorithm

by comparing the document structure against the one the
DTD requires
is able to identify commonalities and differences
4. Paper overview
This paper focused on five applications of the
algorithm:
(1) the classification of XML documents against a set of
DTDs
(2) the generation of a new schema
for a DTD by extracting structural information during the
classification of XML documents;
(3) the development of an XML-based search engine

able to answer approximate structural queries
(4) the selective dissemination of XML documents

(5) the protection of the contents of documents classified
against a set of DTDs of a database, by propagating the
authorization policies specified at DTD level
4. Paper overview
Schema Matching for Transforming
Structured Documents
Understanding the matching problem in the context
of structured document transformations
And developing matching methods those output
serves as the basis for the automatic generation of
transformation scripts
Four basic matching process
(1)linguistic matching
(2)datatype compatibility
(3)Designer type hierarchy
(4)structural matching
5. My works
XML data classification
Using a XML schema and its XML files
ID3 Algorithm
By classification tool on XML data
It will contribute to XML data preprocessing for

datamining
Problems
XML has hierarchical data type
It cant present like a table
Insufficient of sample data
References
E. Bertino, G. Guerrini, M. Mesiti, A matching algorithm for measuring the
structural similarity between an XML document and a DTD and its
applications, Information Systems 29 (1) (2004) 2346.
A. Boukottaya, C. Vanoirbeek, 2005, November 0204, Schema matching
for transforming structured documents. Paper presented at the The 2005
ACM Symposium on Document engineering, Bristol, United Kingdom.
A. Doan, R. Domingos, A.Y. Halevy, 2001, Reconciling schemas of

disparate sources: a machine-learning approach. Paper
presented at the ACM SIGMOD, Santa Barbara, California, United
States.
S. Flesca, G. Manco, E. Masciari, L. Pontieri, A. Pugliese, Fast detection of
XML structural similarities, IEEE Transaction on Knowledge and Data
Engineering 7 (2) (2005) 160175.
R. Nayak, S. Xu, XCLS: a fast and effective clustering algorithm

for heterogenous XML documents. Paper presented at the The
10th Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD), Singapore, 2006.
References
A. Nierman, H.V. Jagadish, 2002, December, Evaluating structural
similarity in XML documents. Paper presented at the fifth International
Conference on Computational Science (ICCS05), Wisconsin, USA.
Richi Nayak, Wina Iryadi 2006, XML schema clustering with semantics
and hierarchical similarity measures.
http://www.w3c.org/xml

X ML Clustering Methods

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

X ML Clustering Methods

Загружено:

Авторское право:

Доступные форматы

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

2. XML and XML schema

XML schema, DTD

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

2. XML and XML schema

can be nested but cannot interleave each other

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

2. XML and XML schema

Document Side: XML is a semantic markup language

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

3. Relational vs. XML

XML data are semi-structured (more flexible):

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

3. Relational vs. XML

XML is suitable data exchange over heterogeneous

Dept. Computer Science, Korea Univ.

3. Relational vs. XML

XML Data Management Issues:

Intelligent Information System Lab.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

Combining the semantic and syntactic relationships

Generalizing a suitable schema class hierarchy

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

It define a new method for

the more likely

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

The matching algorithm

is able to identify commonalities and differences

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

(3) the development of an XML-based search engine

(4) the selective dissemination of XML documents

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

It will contribute to XML data preprocessing for

Insufficient of sample data

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

A. Doan, R. Domingos, A.Y. Halevy, 2001, Reconciling schemas of

R. Nayak, S. Xu, XCLS: a fast and effective clustering algorithm

Dept. Computer Science, Korea Univ.

Intelligent Information System Lab.

Вам также может понравиться