Вы находитесь на странице: 1из 10

Constructing Hierarchical Information Structures of Sub-Page

Level HTML Documents


Seung-Jin Lim
Yiu-Kai Ng
Computer Science Department
Brigham Young University
Provo, Utah 84602, U.S.A.
fng,sjlimg@cs.byu.edu

Abstract information of a particular subject via index servers


Most of the existing methods for retrieving sub-page such as Alta Vista, or (ii) as a back-end processing
level information in an HTML document either rely technique for locating sub-page level information
on keyword-based searching or assume that the in- employed by Web query languages, such as WebLog
ternal structure of the document is known before- [11], WebSQL [13], W3QS [10], and OQL-doc [2].
hand. These techniques, however, are not suit- Consider the retrieval of \the email address
able for locating hierarchically organized informa- of Evan Ivie who is a visiting faculty" from
tion, especially when the internal structure of the the HTML document D as shown in Figure 1(b),
given HTML document D is unknown. We present assuming that keywords are tele-typefaced. A
an approach for inferring information hierarchy at keyword-based searching method is incapable of
the sub-page level of D. Our approach includes locating the desired email address for a number of
constructing a meta data model of D, called content reasons: (i) there are two instances of Evan Ivie
tree (CT ), which captures the hierarchical relation- in D and thus it does not know how to di erentiate
ship among the data contents of D while excluding the instances; (ii) although the second instance
HTML markups in D. Hierarchical information in of Evan Ivie might be chosen correctly, another
D can be retrieved via CT using an existing query problem is how to determine which email address
language which is capable of dealing with data in a in D is the email address of Evan Ivie, a visiting
semistructured data model. faculty member; and (iii) the nature of the desired
information is structural, i.e., the keywords are
Keywords semistructured data, information vi- hierarchically related to one another through the
sualization, HTML document, World-Wide Web words `of' and `who' in the English sentence above,
which constitute the partial ordering relationships
1 Introduction of the keywords, and an attempt simply using the
During the past few years, the growth rate of keywords with logical connectives AND, OR, or
data posted on the World-Wide Web (Web) has XOR fails to locate the email address.
been phenomenal. In order to deal with a vast Our approach for constructing information hi-
amounts of data on the Web, a number of data erarchy at the sub-page level is to exploit the hier-
retrieval methods have been proposed. Among archy of the data contents in a given HTML docu-
the proposed methods, keyword-based searching ment. One might consider using the parse tree of
methods have gained popularity and are widely the document constructed according to the HTML
used as the fundamental technique for retrieving grammar to exploit the hierarchical structure of the
sub-page level information of a given HTML doc- data contents. This approach, however, is not suit-
ument. We categorize all the information retrieval able. Consider an enhanced parse tree (in which
methods based on keyword searching at the sub- all the end-tags are eliminated from the parse tree
page level of HTML pages keyword-based searching generated according to the HTML speci cation)
if the internal structure of the pages is not exploited of document dir.htm as shown in Figure 2 and
during the process of information searching. At the path expression of each leaf node. Note that
present, keyword-based searching is mainly used in there are a few problems with the (enhanced) parse
two di erent ways: (i) as a front-end tool for a Web tree. First, `dir.htm'.HTML.BODY.UL.LI.A.`Evan
user to locate HTML documents which include Ivie' does not specify whether `Evan Ivie' is a
visiting faculty or a regular faculty. Second, it
The 5th International Conference on Foundations is common sense to comprehend the hierarchy of
of Data Organizations (FODO'98), Kobe, Japan, the second `Evan Ivie' and `Visiting Faculty'
November 1998. in Figure 1(a) such that `Evan Ivie' is subordi-
(a) dir.htm rendered in Netscape (b) A portion of the source code of dir.htm
Figure 1: An HTML document dir.htm modi ed from http://www.cs.byu.edu/info/directory.html
nate to `Visiting Faculty.' Furthermore, the ever, a WebOQL user must deal with HTML tags
words `Note: This page is a' and `modified' and hence its internal structure. A query posted
belong to the same sentence as shown in Figure 1(a) against CTD , deals with no HTML tags and its
but they appear in di erent branches in the parse user is not required to know the internal structure
tree in Figure 2. The hierarchy generated by the of the referenced HTML document.
HTML browser, which we call intended hierarchy, We proceed to present our results as follows. In
is not preserved in the parse tree due to the various Section 2, we introduce the data model adopted by
HTML tags. Subsequently, using the parse tree to us for representing data in an HTML document.
locate the email address of Evan Ivie, a visiting In Section 3, we present the rules for constructing
faculty, is inadequate. the content tree, which captures the hierarchical
In this paper, we present an approach for con- information at the sub-page level, of a given HTML
structing hierarchical information structure at the document. In Section 4, we give the concluding
sub-page level of an HTML document. Our ap- remarks.
proach takes an HTML document D as an input
and returns a meta data model of D, called con- 2 The Data Model
tent tree (CT ), denoted CTD , which is the tree Our view of information posted on the Web is a
representation of the data contents, with the ex- collection of heterogeneous data, such as HTML
clusion of HTML tags, in D. CTD is based on the documents by and large and other types of data,
notion of semistructured data, and because of that including multimedia data. One of the characteris-
the hierarchical information captured in CTD can tics of HTML documents is that their structures are
be retrieved using an existing semistructured data not rigid according to the HTML speci cation [7,
query language, such as Lorel [3]. 15]. Hence, an HTML document D contains data
Our approach is advantageous over the existing which are semistructured in nature. Semistruc-
methods for extracting sub-page level information tured data are data which are neither structured
of an HTML document D. First, CTD provides data (i.e., data with a rigid structure or schema),
a facility to identify the hierarchical relationships such as data in a relational database, nor unstruc-
among data contents in D. This capability pro- tured data (i.e., data with no rigid structure or
vides a solution to the problems of synonym key- schema), such as data in executable les. Examples
words (such as the two email addresses of Evan of semistructured data are data in text les that
Ivie). Second, for the construction of CTD , we contain formatting code, such as LATEX and HTML
do not assume that the internal structure of D is les, and les which require strict inner structure
known beforehand as [5, 9] do. Third, there exist but some of the structural components can be omit-
other methods which exploit the internal structure ted, such as BIBTEX les, Unix environment les,
of HTML tags to obtain sub-page level information etc. Based on this classi cation, we consider a
of an HTML document, such as WebOQL [4]. How- nite set of data objects D = fo1 , o2 , : : :, on g
as semistructured data if (i) the structure of D is
irregular or incomplete [14, 3, 6], (ii) the distinction
between D and the schema of D is not clear, and
the schema may change dynamically [1, 8], and
(iii) oi (1  i  n) is not type-sensitive, and two
distinct data objects in D with the same content
may be of di erent types.
An HTML document consists of a number of
HTML elements. A typical HTML element is de-
limited by its start-tag and end-tag, and contains
content that appears between the start- and end-
tags. For some HTML elements, either their end- Figure 2: SDT , an enhanced parse tree, of dir.htm
tags are implicit or their contents do not exist. The of the corresponding data characters, if o is a
content of an HTML element may include other data object.
tags (i.e., an HTML element can be nested) or data
characters. We call the latter data content.  attribute list   is a nite (possibly empty)
Given an HTML element e and its content c, we set of attributes, each of which is of the form
say that e contains c, denoted by e c, and e is attribute name = value.
called the container of c. If c is contained in e with  identifier 2  is a non-empty string which
no intermediate container between e and c, i.e., e uniquely identi es o in D. 2
c holds but e e1 : : : ek c (k  1) do Example 1 Consider the head block of the HTML
not hold, then e immediately contains c, or c is an document dir.htm in Figure 1(b). By De nition 2,
immediate content of e. HEAD, TITLE, and the two METAs are tag objects,
HTML elements cannot partially overlap with and `Faculty Directory' is a data object. The
one another according to the HTML speci cation. rst META object includes NAME and CONTENT as its
If an HTML document D conforms to the HTML attributes. 2
speci cation, then D is said to be a valid HTML
document. Among the HTML elements de ned in The notion of dependency plays an important
the HTML speci cation, the role of some elements role in determining the hierarchical relationships
is to de ne the block structure of the data contents, among objects in an HTML document.
i.e., the hierarchy of the data contents, whereas
other elements either declare the cosmetic style of De nition 3 Given an HTML document D with a
the data contents or do not have data contents at set of objects fo1 ; o2 ; : : : ; om g, an object oi directly
all. We call the former elements type 1 and the depends on another object oj (1  i; j  m; i =
latter type 2. Typical examples of type 1 elements j 1), denoted oi oj , if oi immediately contains
are list elements such as UL, OL, DIR and MENU. B oj . Dependency is antisymmetric and transitive.
and EM are typical type 2 HTML elements. Hence, by antisymmetricity, if o1 o2 and o2
o1 , then o1 = o2 ; by transitivity, if o1 o2 and
De nition 1 Given an HTML document D, an o2 o3 , then o1 indirectly depends on o3 , denoted
HTML token (or token for short)  is either the o1 o2 o3 or o1  o3 . 2
start-tag, the end-tag, or (part of) the data content
of an HTML element in D. If  denotes a tag (resp. Example 2 Consider the objects identi ed in Ex-
data content),  is of tag type (resp. data type). 2 ample 1. Since HEAD contains TITLE and the
two METAs, according to the HTML speci cation,
Our semistructured data model is called semi- HEAD TITLE and HEAD META hold. Further-
structured data tree (SDT ) [12]. semistructured more, HEAD TITLE `Faculty Directory'
data. The core constructs of an SDT are objects, also hold. 2
which are de ned over the set of all possible
strings, denoted by  , and the dependency con- We now formally de ne the tree representation
straint among di erent objects. of an HTML document D, called the semistruc-
tured data tree (SDT) of D, as given in [12].
De nition 2 An object o in an HTML document
D is a 3-tuple hname, attribute list, identifieri, De nition 4 Given an HTML document D, the
which is either a start-tag (called a tag object) or semistructured data tree (SDT ) SDTD = (V; E; g)
data content (called a data object) in D, where of D is a rooted, directed graph, where
 name1 2  is the name of the corresponding  The root node VR 2 V is labeled by the URL
HTML tag if o is a tag object, or is the text of D. For any other node v 2 V , v denotes an
1 A name may include spaces and dots. If a name includes object in D and is labeled by the name of the
blank spaces, it is enclosed by single quotes. object that v represents. Furthermore, given
a node n and its child nodes n1 , n2 , : : :, nm , Note that given an SDT = (V; E; g), the lexical
for any two nodes nj and nk , 1  j < k  m, SDT is obtained by [8i pe(VR ; oi ) (= [8i Lo ),
the token denoted by nj appears ahead of the where oi is a leaf node of the SDT . We adopt the
i

token denoted by nk in D. convention A:fB; C g for fA:B; A:C g, where A:B


 E is a nite set of directed edges. and A:C are long names or path expressions.
 g : E ! V  V is a function such that
g(e) = (v1 ; v2 ) if t1 t2 , where t1 and Example 5 Consider the SDTdir:htm in Figure 2.
t2 are the tokens represented by nodes v1 pe(HEAD, `Faculty Directory')
and v2 , respectively. g is antisymmetric and = pe(HEAD, TITLE).`Faculty Directory'
transitive. 2 = pe(HEAD; HEAD), TITLE.`Faculty Directory'
By De nition 4, we can perform transformation = HEAD.TITLE.`Faculty Directory'
back and forth between an HTML document D and LFaculty Directory
its SDTD without losing information. Thus, we use = `dir.htm'.HTML.HEAD.TITLE.`Faculty Directory'
D and SDTD , and a token in D and its correspond-
ing node in SDTD interchangeably throughout this LHEAD
paper. = pe(HEAD, `Faculty Directory') [
Example 3 Consider the document D in Figure pe(HEAD, META) [ pe(HEAD, META)
= HEAD.fTITLE.`Faculty Directory', META, METAg 2
1(b) whose source URL is dir.htm. By De nition 4
and the dependency constraints among the objects
in dir.htm, the root node `dir.htm' is created in 3 An approach to constructing hi-
SDTD and attached to it two child nodes, COMMENT erarchical, sub-page level informa-
and <HTML>, which are created by the <!DOCTYPE> tion structures
and HTML tags, respectively. The entire SDT of We now discuss our approach for extracting hi-
dir.htm is shown in Figure 2, where dotted arrows erarchical information over the domain of HTML
denote subtrees whose detailed contents are not documents at the sub-page level. We consider the
shown in the gure. 2 hierarchical structure of the data contents embed-
De nition 5 Given an SDT with nodes n1, n2, ded in a given HTML document D and represent
: : :, nm , the long name Ln of node nk (1  j; k  it as an enhanced SDTD . The hierarchy of the
m) is the name of nk if nk is the root node of
k
data contents of D is obtained by restructuring
SDT ; otherwise, it is Ln :L, where nj immediately the internal structure of the HTML elements in D
contains nk and L is the name of nk . 2
j
according to the HTML grammar. This process
consists of two steps:
Example 4 Consider the node `Faculty Direc- Approximation of D: The approximated SDT of
tory' in the SDT as shown in Figure 2. Its long
name is `dir.htm'.HTML.HEAD.TITLE.`Faculty D, denoted SDTD0 , is constructed based on
Directory' by De nition 5. 2 the HTML grammar while excluding the start-
and end-tags of type 2 elements in D. The
De nition 6 Given the long name of an object e ect of this step is threefold. First, elements
oN in an SDT of the form Lo = No1 .(: : :).No , irrelevant to the hierarchy of the data contents
where No is the name of object oi (1  i  N ),
N N

i of D are eliminated. Second, the intended


the path expression, pe, between two objects oi and hierarchy of the data contents is preserved.
oj (1  j  N ) is a binary function, pe:    To better understand this assertion, consider
!  , such that pe(oi ; oi ) = No and pe(oi ; oj ) = the leaf nodes of the subtree rooted at I
pe(oi ; oj 1 ):No , where o1 is denoted by the root
i

of SDTdir:htm in Figure 2. The three leaf


node of SDT and i < j . Also, pe(VR ; o) = Lo ,
j

nodes of the subtree are supposed to form


where VR is the root node and o 2 V of SDT . 2 a single sentence in dir.htm; however, they
An SDT precisely captures the structure and are separated into three objects in SDTdir:htm
content of the given semistructured data in an by the B element. This problem is solved by
HTML document according to the HTML gram- approximation.
mar, and the lexical SDT of a semistructured data Migration to CTD : After approximation, the re-
D is a textual representation of the corresponding maining HTML tags in SDTD0 are removed
SDT . while preserving the intended hierarchy of the
data contents in D. The migrating step from
De nition 7 Given an SDT = (V; E; g), the lexi- SDTD0 to the content tree of D (i.e., CTD )
cal representation or lexical SDT of o 2 V , denoted is accomplished by clustering, merging, and
Lo, is a textual representation of the subgraph S of promoting the data contents in D based on the
the SDT rooted at o such that Lo = [8i pe(o; oi ); coherence factor among the tokens in SDTD0 .
where oi is a leaf node in S . 2
De nition 8 Given an SDT (or SDT 0) S with k approximation of D. Hence, during the process of
nodes n1 , n2 , : : :, nk , the coherence factor (c.f.) constructing SDTD0 , we need to identify the im-
between ni and nj (1  i; j  k), denoted ni  nj , mediate container of a token. For this purpose, we
is ( maintain a stack  to assert the container of a token
n
1 6 j;
if i =  , and subsequently the parent node of  in SDTD0
ni  nj = ni ;nj
while SDTD0 is being constructed. There are two
MAXINT otherwise.
operations de ned over : when a start-tag s of a
where nn ;n is the smallest number of edges be- tag token in D is identi ed, the tag name s of s is
tween ni and nj in S . 2
i j
pushed onto the stack as the corresponding stack
symbol, whereas the appearance of an end-tag e,
The coherence factor of two nodes n1 and n2 which either appears explicitly or is implied in D,
in an SDT (resp. SDT 0) measures how closely n1 results in popping the top of the stack which is the
and n2 are related. If i = j , then c:f: = MAXINT corresponding start-tag of e. For a tag of data type,
(in other words, nn ;n = 0). Hence, a node is no stack operation is performed. A new node is
created and attached to SDTD0 if the current token
i j

coherent to itself stronger than to any other nodes.


nn ;n = 1 if either ni is the parent node of nj , or
i j is either a start-tag or of data type.
vice versa, nn ;n = 2 if ni and nj are siblings or
i j Based on the stack and its stack operations de-
if one node is the grandparent of the other, and so ned above, we now detail the construction process
forth. of SDTD0 . D consists of two blocks, head and
body, contingent on the existence of HEAD and BODY
3.1 Construction of approximated SDT elements in D. While we construct SDTD0 , we
The rst task of this step is to distinguish HTML apply a few construction rules to D. Recall that
elements of type 2 from type 1 in order to lter out a stack operation, either push or pop, is performed
type 2 elements since they either do not contribute whenever the current token is either a start-tag or
to the hierarchical (information) structure of data an end-tag. In addition, tag tokens of type 2 are
contents or are not containers, i.e., they do not ignored. For the head block of D, we apply the
have data contents to be marked up in an HTML following rules to construct nodes in SDTD0 :
document. 1. Construct the root node VR of SDTD0 and label
We group HTML elements into type 1 and it by the URL of D.
type 2 according to the HTML 3.2 Reference
Speci cation [15]. First, we include text-level 2. Attach the node TITLE as a child of the root
elements into type 2 with the exception of A. These node, and create a child node of TITLE with
elements do not cause paragraph break and hence the content of TITLE. <TITLE> is the only
do not contribute to the hierarchical structure of required element in an HTML document which
data contents. Second, we include into type 2 provides a (short) description of the content of
all the non-container elements among the rest of D.
the HTML elements in D with the exception of 3. If <META NAME=\keywords" VALUE=str> exists
META. Non-container elements are meaningless with in the head block, then create the nodes with
respect to the content of the HTML document. labels KEYWORDS and str and edges such that
Elements belonged to either type 1 or 2 are shown VR KEYWORDS str, where str is the value
in Table 1. We de ne the approximated SDT of the VALUE attribute. <META> is not a con-
constructed from D with the exclusion of elements tainer; however, <META> is widely used for
of type 2 below. specifying a list of keywords relevant to D.
De nition 9 Given an HTML document D, the We ignore any other HTML elements in the
SDT constructed from D by excluding the start- head block since they are not containers. For the
and end-tags of type 2 is called the approximated body block of D, we apply the following rules:
semistructured data tree SDTD0 of D, and the
process of constructing SDTD0 is called approxima- 4. For tokens ts tc te , where ts is a start-tag, tc
tion. 2 is the content of ts such that tc does not start
with an anchor, and te is the end-tag of ts ,
Given three tokens t1 , t2 , and t3 which appear a new node is created for tc and an edge is
in a row in D, where t1 and t3 are of data type attached from node tc to the existing node ts .
and t2 is of type 2, t1 and t3 are concatenated to
yield a single token of data type by approxima- 5. Suppose <A> is detected such that the depen-
tion. As a result of approximation, the container dencies p A C2 hold in D, and p C1
c of an HTML element e in D and the element and p C3 may or may not hold before and
represented by the parent node of e in SDTD0 may after p A C2 , respectively in D, where
be di erent since c may be eliminated during the C1 and C3 are data content of p and C2 is the
HTML elements type 1 type 2
head content TITLE, META y ISINDEX, BASE, LINK, SCRIPT, STYLE, META y
headings H1, H2, H3, H4, H5, H6
P, UL, OL, DIR, MENU, LI ISINDEX, HR
block-level DL, DT, DD, PRE, FORM
DIV, CENTER, TABLE, TH
TR, TD, BLOCKQUOTE
body font TT, I, B, U, STRIKE, BIG, SMALL
content SUB, SUP
text-level phrase EM, STRONG, DFN, CODE, SAMP, KBD
VAR, CITE
special A IMG, APPLET, FONT, BASEFONT, BR
SCRIPT, MAP
form INPUT, SELECT, TEXTAREA
address ADDRESS
y META with the name attribute=\keywords" is considered type 1, while other METAs are type 2.
< HTML >, < HEAD >, < BODY >, and <! > are also treated as type 2 elements.
Table 1: Two groups of HTML elements, type 1 and type 2
data content of <A>. Then, we ignore <A>
and </A> and create a new node
Case 1: C1 if both C1 and C3 exist, and label
the new node as C1 C2 C3 .
Case 2: C1 if only C1 exists, and label the
new node as C1 C2 .
Case 3: C2 if only C3 exists, and label the
new node as C2 C3 .
Case 4: C2 if neither C1 nor C3 exists, and
label the new node as C2 .
Attach the new node as a child of the existing
node p and in turn attach the HREF attribute of
<A> as a child of the new node. An HREF node
is kept as a placeholder for the root node of
another SDT 0 constructed from the document
speci ed in the link which appears in the HREF
attribute.
Example 6 Consider dir.htm in Figure 1(b) and Figure 3: SDT 0 of dir.htm
the SDT 0 of dir.htm as shown in Figure 3. By 3.2 Migration from SDT to CT 0

Construction Rule 1, we create the URL of `dir.


0
htm' as the root node of SDTdir:htm . In addi- The nal step in our approach for constructing the
tion, TITLE and META (with \KEYWORDS" as its content tree of the given HTML document D is the
name) are created out of the head block tokens of migration from SDTD0 to CTD by (i) sectioning the
dir.htm by Construction Rules 2 and 3, respec- nodes in SDTD0 , (ii) clustering the nodes of data
tively. Other tokens, including <!DOCTYPE> and type within a section of SDTD0 , (iii) merging the
the other <META>, are discarded. clusters within a section, and (iv) promoting the
The HTML speci cation on <A> is overridden clusters towards the root node of SDTD0 .
by Construction Rule 4. Also note that by approx- The key components involved in the migration
imation, tokens of type 2, such as <I>, <B> and process of SDTD0 are the notion of section, cluster,
<HR>, are discarded and hence the three tokens of and dependency constraint (which is extended from
data type `Note: This page is a', `modified' De nition 3). Sections and clusters are both logical
and `directory of ...' yield a single node, units of nodes in SDTD0 , i.e., each one is a subset
which di ers from SDTdir:htm in Figure 2. For the of nodes in SDTD0 with their associated edges. The
rest of the tokens in the body block, Construction di erence between a section and a cluster is that a
Rule 4 is applied. 2 cluster C is created and evolved within a section S
and resides in S only. The container-content rela-
tionship among the nodes in C might change while C
is evolved during the migration process. Regarding
sections and clusters, note that the coherence factor
of any two tokens at the same level is considered as
MAXINT , if they belong to the same section or
cluster.
De nition 10 Given an SDT 0 and let N be the
number of children of the root node VR of SDT 0.
A section S consists of one or more sibling nodes
n1 , n2 , : : :, nk (1  k  N ) and their descendants
in SDT 0, where
 ni (1  i  k) is a child of VR .
 n1 , the leftmost node (l-node) of S , is either (i)
the node next to the TITLE node or KEYWORDS
node, whichever appears rightmost in SDTD0 ,
or (ii) a heading node (i.e., H1, H2, : : : or H6).
 nk , the rightmost node (r-node) of S , is ei-
ther (i) a node appeared prior to a heading Figure 4: Sections in SDTdir:htm0
or ADDRESS node, whichever appears leftmost
in SDTD0 after the l-node of S , or (ii) the
rightmost child node of VR if there is no other
heading or ADDRESS node after the l-node.
A section S has an attribute, called type S ,
whose value is either H1, H2, H3, H4, H5, H6, data
or tag, determined by its l-node of S such that (i)
if the l-node is a heading H , S is H , (ii) if the
l-node is a token of tag type other than a heading,
S is tag, or (iii) if the l-node is a token of data
type, S is data. Among di erent types of sections,
H1 is the most important, followed by H2, : : :, H6,
and data, whereas tag is the least important. 2
Example 7 Consider the SDTdir:htm 0 in Figure 3.
According to De nition 10, section 1, which con-
tains the subtree rooted at H1 and any subtrees of
the root node dir:htm between the H1 node and the
leftmost H2 node, is created. Similarly, section 2 Figure 5: Unit clusters in section 3
and section 3 are created as shown in Figure 4. 2 B , we say that A depends on B or B is relevant to
De nition 11 Given a section S in an SDT 0, a A, denoted by A B.
cluster contains one or more subtrees in S . A clus- De nition 12 Given two logical units of tokens,
ter has an attribute, called type, whose value is A and B , A depends on B if
either H1, H2, H3, H4, H5, H6, data or tag, and
H1 is the most important, whereas tag is the least 1. the type of A is more important than that
important. Top-level nodes (t-nodes) of a cluster C of B , where A and B are sections. Let LA
are the nodes in C which are closer to the root node (resp. LB ) be the l-node of A (resp. B ), and
0
of SDT than others in C . 2 this dependency constraint is denoted as LA
LB .
We now extend the \dependency constraint" de- type of A is more important than that of
ned in De nition 3 and apply the concept to other 2. the B , 1  i, where A and Bi are clusters in the
domains. Recall that the dependency constraint in i
De nition 3 is applied to objects. The extended same section. Let TA (resp. TB ) be the t-
i

dependency constraint is applied to sections and node (i.e., top-level node) in A (resp. Bi ), and
clusters as well as to objects. When a logical unit this dependency constraint is denoted as TA
of tokens A, which is either a section, a cluster, or T B i
, for all i . 2
0
a node in SDTD , is more important (i.e., concep- In association with the de nition of cluster, we
tually superior) than another logical unit of tokens de ne the operators (), which is a unary operator
for creating a cluster, called unit clustering oper-
ator, whose argument is a token, and +(), which
is a binary operator, called cluster merging opera-
tor, where each of its arguments is a cluster. The
operators are described below.
1. (unit clustering) If  is a token of data type,
then ( ) yields a cluster of data type which
includes  and its descendants (if they exist).
2. (cluster merging) Clusters in an SDT 0 can be
merged if the following constraints hold: (i)
their root nodes are siblings of the same parent
p, and (ii) their root nodes are in the same
section of SDT 0. +(Cd1 , Cd2 ), where Cd1 is a
cluster of type 1 and Cd2 is a cluster of type
2 , merges Cd1 and Cd2 which yields a new
cluster C such that Figure 6: Promotion of clusters C2 and C3 after
merging
2.1. the type of C is 1 (resp. 2), and Cd1
(resp. Cd2 ) depends on Cd2 (resp. Cd1 )
in C , if 2 (resp. 1 ) is less important
than 1 (resp. 2 ), and
2.2. the type of C is 1, and Cd1 appears on
the left-hand side of Cd2 in C , if 1 is as
important as 2 .
The coherence factor among t-nodes that be-
long to two di erent clusters that are merged
is upgraded to MAXINT .
3. (cluster promotion) If a parent node p has only
one child which is a cluster or all the clus-
ters whose root nodes are relevant to the same
parent p are merged, discard p and mark the
resulting cluster as tag type. If p is one of the
headings H1, H2, : : :, or H6, then the attribute Figure 7: Promotion of cluster C4 after removing
of the resulting cluster is set to either H1, H2, UL
: : :, or H6, respectively, after p is discarded.
Note that if node n2 is removed from the de- and C3 in Figure 6. Note that the type of C1 is set
pendency relationship n1 n2 n3 , then n1 to H2, whereas the types of C2 and C3 are set to
n3 maintains the asymmetric property of tag.
the relationship between n1 and n3 . After be- Now, consider C2 and C3 in Figure 6 again. C2
ing promoted, a cluster moves upward towards and C3 have the same parent UL, and hence they
the root node of the tree. are merged which yields the cluster C4 of tag type
according to the Cluster Merging Rule 2.2. After
By merging and promotion, all the HTML tags merging, the resulting cluster C4 can be further
in an HTML document D, with the exception of promoted and its parent node UL is removed by
TITLE, HREF attributes, and ADDRESS, are excluded Rule 3. Figure 7 shows the clusters in section 3
from SDTD0 , but the tokens of data type remain in after the merging and promotion operations have
the resulting CTD , and the coherence factor among been applied. By now, clusters C1 and C4 in sec-
the remaining tokens is further upgraded. tion 3 have the same parent which is the root node
and hence the cluster merging rule can be applied
Example 8 Consider the sections in Figure 4 and again. Note that Rule 2.1 is applied since the type
the tokens in each section. We demonstrate the of C1 is H2, whereas the type of C4 is tag. The
creation, merging, and promotion of clusters in sec- resulting cluster is shown in Figure 8.
tion 3. Figure 5 shows three unit clusters, C1 , C2 , Similarly, the unit cluster `CS Department Fa-
and C3 , of data type generated in section 3. Since culty Directory' of data type (called C5 ) in sec-
each of C1 , C2 , and C3 is the only cluster of its tion 1 is promoted to be of type H1 and its parent
parent node, the cluster promotion rule is applied node H1 is removed from STDdir:htm0 in Figure 4.
to each of these clusters, yielding clusters C1 , C2 ,
determining the long names and path expressions
of nodes in CTdir:htm in a similar manner as we
do for lexical SDT s. The lexical CTdir:htm, i.e.,
Ldir:htm, is
`dir:htm'.f
TITLE .`Faculty Directory',
KEY WORDS .`CS; computer science,faculty,
directory', `CS Department Faculty Directory'.f
`Note : This page is a modi ed directory ...',
`Professors'.f
`Robert Preece Burton (pburton@cs:byu:edu)'.
`HREF =\rpburton:html"',
`Douglas M: Campbell (campbell@cs:byu:edu)'.
`HREF =\dmcampbell:html"',
Figure 8: Merging cluster C4 to cluster C1 `Evan Ivie (evan@cs:byu:edu)'.`HREF =
\eivie:html"',
g,
`V isiting Faculty '.f
`Timour T: Paltashev (timourl@cs:byu:edu)'.
`HREF =\http : ==students:cs:byu:edu=~timpal"',
`Evan Ivie (evani@cs:byu:edu)'.
`HREF =\evivie:html"',
g
g,
ADDRESS .`Comments to webmaster'.`HREF =
\=webmaster:html"'
g
Consider the long name of each leaf node in
CTdir:htm, which is embedded in Ldir:htm . The
long name of `Evan Ivie' who is a professor is
`dir.htm'.`CS Department Faculty Directory'.
`Professors'.`Evan Ivie (evan@cs.byu.edu)',
whereas the long name of `Evan Ivie', who is a
Figure 9: Final CT of dir.htm visiting faculty member, is `dir.htm'.`CS Depart-
ment Faculty Directory'.`Visiting Faculty'.
Also, the other unit cluster of data type, `Note: `Evan Ivie (evani@cs.byu.edu)'. Thus, the two
...' (called C6 ) is merged with C5 as shown in `Evan Ivie's are distinguishable by their respec-
Figure 9. tive long names. The following SQL-like query
We now specify the dependency constraint be- retrieves `Evan Ivie (evani@cs.byu.edu)' via X
tween section 1 and section 3. Since the type using an SQL-like query processor for semistruc-
of section 1 is H1 and that of section 3 is H2, tured data applicable to Ldir:htm:
section 1 depends on section 3 and hence `CS
Department Faculty Directory' `Visiting select *.`Visiting Faculty'.X from Ldir:htm
Faculty'. Similarly, section 2 depends on sec- where X contains `Evan Ivie'
tion 1 and hence the l-node of section 2 (i.e., where `*' in the select statement denotes an ar-
Professors, after promotion and merging) be-
comes a relevant node to the l-node of section 1, i.e., bitrary path expression and contains is a built-in
`CS Department Faculty Directory'. Figure 9
predicate asserting that X includes `Evan Ivie' as
shows the CT of dir.htm.2 a substring.
Now, consider the long name L1 of `Evan Ivie'
who is a visiting faculty member in SDTdir:htm, an
3.3 Use of CT s enhanced parse tree of dir.htm in Figure 2. L1 is
We have demonstrated the process of constructing `dir.htm'.HTML.BODY.UL. LI.A.`Evan Ivie'. Re-
CTD from a given HTML document D. For prac- call that SDT is constructed strictly based on the
tical use, CTD is presented in its lexical format so HTML grammar. The bene ts of CT over SDT
that users can develop a query processor or use an are evident in this example: users avoid dealing
existing one on top of the lexical CTD to retrieve with HTML elements to identify a data token in
hierarchical, sub-page level information from D. the tree representation of a given HTML document.
Consider CTdir:htm as shown in Figure 9. We More importantly, CT provides \better" hierarchi-
can obtain the lexical CTdir:htm from CTdir:htm by cal information of a data token than SDT . As
[3] S. Abiteboul, D. Quass, J. McHugh,
J. Widom, and J. Wiener. The Lorel Query
Language for Semistructured Data. Journal
on Digital Libraries, 1(1):68{88, 1997.
[4] G. O. Arocena and A. O. Medelzon. We-
bOQL: Restructuring Documents, Databases
and Webs. In Proceedings of the 14th In-
ternational Conference on Data Engineering,
February 1998.
Figure 10: A sample query in WebView [5] P. Atzeni and G. Mecca. Cut and Paste. In
Proceedings of the 16th ACM PODS, pages
we have seen earlier, it is ambiguous which `Evan 144{153, May 1997.
Ivie' speci ed in L1 (generated from the lexical [6] P. Atzeni, G. Mecca, and P. Merialdo. To
SDTdir:htm) is referring to. Weave the Web. In Proceedings of the 23rd
The implementation of our hierarchical infor- International Conference on Very Large Data
mation approach is about half way done using Bases, pages 206{215, August 1997.
Java programming language. Figure 10 shows a
snapshot of the WebView user interface which is [7] T. Berners-Lee and D. Connolly. Hypertext
capable of constructing the SDT 0 of a given HTML Markup Language - 2.0. Request for Com-
document and processing SQL-like queries. Since ments: #1866, November 1995.
0
WebView was implemented over SDT s, the exam- [8] P. Buneman, S. Davidson, M. Fernandez, and
ple query in the gure contains an H2 tag in the D. Suciu. Adding Structure for Unstructured
select statement. By the time the implementation Data. In Proceedings of the 6th International
of CT s is completed, which is implemented as an Conference on Database Theory, 1997.
extension of WebView, user queries can be posted
without using HTML tags. [9] J. Hammer, H. Garcia-Molina, J. Cho,
R. Aranha, and A. Crespo. Extracting
4 Conclusions Semistructured Information from the Web. In
Proceedings of Workshop on Management of
Dealing with hierarchical, sub-page level informa- Semistructured Data, May 1997.
tion on an HTML document D without knowing
the internal structure of D beforehand is challeng- [10] D. Konopnicki and O. Shmueli. W3QS: A
ing due to the fact that the structure of D is irreg- Query System for the World-Wide Web. In
ular and the HTML tags often break the intended Proceedings of the 21st Very Large Data Bases,
hierarchy of data contents in D. We analyze the pages 54{65, Sept. 1995.
HTML speci cation and its use on the Web, and [11] L. V. S. Lakshmanan, F. Sadri, and I. N.
present an approach which is capable of construct- Subramanian. A Declarative Language for
ing a tree representation, called content tree CT , of Querying and Restructuring the Web. In Post-
the data contents in an HTML document without ICDE IEEE Workshop on Research Issues in
the HTML tags while preserving its intended hier- Data Engineering, February 1996.
archy. Also, we do not assume that the structure of
the given HTML document is known beforehand. [12] S.-J. Lim and Y.-K. Ng. Extracting Structures
CTD provides a richer hierarchy of data contents of HTML Documents Using a High-level Stack
in an HTML document D than other approaches Machine. To appear in ICOIN'98 Book, Octo-
based on the parse tree of D or keyword-based ber 1998.
searching techniques for retrieving the desired hi- [13] A. O. Mendelzon, G. A. Mihaila, and T. Milo.
erarchical information at the sub-page level of D. Querying the World Wide Web. In Proceedings
of International Conference on Parallel and
References Distributed Information Systems, 1996.
[1] S. Abiteboul. Querying Semi-structured Data. [14] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman,
In Proceedings of the 6th International Confer- and J. Widom. Querying Semistructured
ence on Database Theory, 1997. Heterogeneous Information. In Proceedings of
DOOD'95, pages 319{344, December 1995.
[2] S. Abiteboul, S. Cluet, V. Christophides,
T. Milo, G. Moerkotte, and J. Simeon. Query- [15] D. Raggett. HTML 3.2 Reference Speci -
ing Documents in Object Databases. Journal cation. http://www.w3.org/TR/REC-html32,
on Digital Libraries, 1(1):5{19, April 1997. January 1997.

Вам также может понравиться