Академический Документы
Профессиональный Документы
Культура Документы
dependency constraint is applied to sections and node (i.e., top-level node) in A (resp. Bi ), and
clusters as well as to objects. When a logical unit this dependency constraint is denoted as TA
of tokens A, which is either a section, a cluster, or T B i
, for all i . 2
0
a node in SDTD , is more important (i.e., concep- In association with the denition of cluster, we
tually superior) than another logical unit of tokens dene the operators (), which is a unary operator
for creating a cluster, called unit clustering oper-
ator, whose argument is a token, and +(), which
is a binary operator, called cluster merging opera-
tor, where each of its arguments is a cluster. The
operators are described below.
1. (unit clustering) If is a token of data type,
then ( ) yields a cluster of data type which
includes and its descendants (if they exist).
2. (cluster merging) Clusters in an SDT 0 can be
merged if the following constraints hold: (i)
their root nodes are siblings of the same parent
p, and (ii) their root nodes are in the same
section of SDT 0. +(Cd1 , Cd2 ), where Cd1 is a
cluster of type 1 and Cd2 is a cluster of type
2 , merges Cd1 and Cd2 which yields a new
cluster C such that Figure 6: Promotion of clusters C2 and C3 after
merging
2.1. the type of C is 1 (resp. 2), and Cd1
(resp. Cd2 ) depends on Cd2 (resp. Cd1 )
in C , if 2 (resp. 1 ) is less important
than 1 (resp. 2 ), and
2.2. the type of C is 1, and Cd1 appears on
the left-hand side of Cd2 in C , if 1 is as
important as 2 .
The coherence factor among t-nodes that be-
long to two dierent clusters that are merged
is upgraded to MAXINT .
3. (cluster promotion) If a parent node p has only
one child which is a cluster or all the clus-
ters whose root nodes are relevant to the same
parent p are merged, discard p and mark the
resulting cluster as tag type. If p is one of the
headings H1, H2, : : :, or H6, then the attribute Figure 7: Promotion of cluster C4 after removing
of the resulting cluster is set to either H1, H2, UL
: : :, or H6, respectively, after p is discarded.
Note that if node n2 is removed from the de- and C3 in Figure 6. Note that the type of C1 is set
pendency relationship n1 n2 n3 , then n1 to H2, whereas the types of C2 and C3 are set to
n3 maintains the asymmetric property of tag.
the relationship between n1 and n3 . After be- Now, consider C2 and C3 in Figure 6 again. C2
ing promoted, a cluster moves upward towards and C3 have the same parent UL, and hence they
the root node of the tree. are merged which yields the cluster C4 of tag type
according to the Cluster Merging Rule 2.2. After
By merging and promotion, all the HTML tags merging, the resulting cluster C4 can be further
in an HTML document D, with the exception of promoted and its parent node UL is removed by
TITLE, HREF attributes, and ADDRESS, are excluded Rule 3. Figure 7 shows the clusters in section 3
from SDTD0 , but the tokens of data type remain in after the merging and promotion operations have
the resulting CTD , and the coherence factor among been applied. By now, clusters C1 and C4 in sec-
the remaining tokens is further upgraded. tion 3 have the same parent which is the root node
and hence the cluster merging rule can be applied
Example 8 Consider the sections in Figure 4 and again. Note that Rule 2.1 is applied since the type
the tokens in each section. We demonstrate the of C1 is H2, whereas the type of C4 is tag. The
creation, merging, and promotion of clusters in sec- resulting cluster is shown in Figure 8.
tion 3. Figure 5 shows three unit clusters, C1 , C2 , Similarly, the unit cluster `CS Department Fa-
and C3 , of data type generated in section 3. Since culty Directory' of data type (called C5 ) in sec-
each of C1 , C2 , and C3 is the only cluster of its tion 1 is promoted to be of type H1 and its parent
parent node, the cluster promotion rule is applied node H1 is removed from STDdir:htm0 in Figure 4.
to each of these clusters, yielding clusters C1 , C2 ,
determining the long names and path expressions
of nodes in CTdir:htm in a similar manner as we
do for lexical SDT s. The lexical CTdir:htm, i.e.,
Ldir:htm, is
`dir:htm'.f
TITLE .`Faculty Directory',
KEY WORDS .`CS; computer science,faculty,
directory', `CS Department Faculty Directory'.f
`Note : This page is a modied directory ...',
`Professors'.f
`Robert Preece Burton (pburton@cs:byu:edu)'.
`HREF =\rpburton:html"',
`Douglas M: Campbell (campbell@cs:byu:edu)'.
`HREF =\dmcampbell:html"',
Figure 8: Merging cluster C4 to cluster C1 `Evan Ivie (evan@cs:byu:edu)'.`HREF =
\eivie:html"',
g,
`V isiting Faculty '.f
`Timour T: Paltashev (timourl@cs:byu:edu)'.
`HREF =\http : ==students:cs:byu:edu=~timpal"',
`Evan Ivie (evani@cs:byu:edu)'.
`HREF =\evivie:html"',
g
g,
ADDRESS .`Comments to webmaster'.`HREF =
\=webmaster:html"'
g
Consider the long name of each leaf node in
CTdir:htm, which is embedded in Ldir:htm . The
long name of `Evan Ivie' who is a professor is
`dir.htm'.`CS Department Faculty Directory'.
`Professors'.`Evan Ivie (evan@cs.byu.edu)',
whereas the long name of `Evan Ivie', who is a
Figure 9: Final CT of dir.htm visiting faculty member, is `dir.htm'.`CS Depart-
ment Faculty Directory'.`Visiting Faculty'.
Also, the other unit cluster of data type, `Note: `Evan Ivie (evani@cs.byu.edu)'. Thus, the two
...' (called C6 ) is merged with C5 as shown in `Evan Ivie's are distinguishable by their respec-
Figure 9. tive long names. The following SQL-like query
We now specify the dependency constraint be- retrieves `Evan Ivie (evani@cs.byu.edu)' via X
tween section 1 and section 3. Since the type using an SQL-like query processor for semistruc-
of section 1 is H1 and that of section 3 is H2, tured data applicable to Ldir:htm:
section 1 depends on section 3 and hence `CS
Department Faculty Directory' `Visiting select *.`Visiting Faculty'.X from Ldir:htm
Faculty'. Similarly, section 2 depends on sec- where X contains `Evan Ivie'
tion 1 and hence the l-node of section 2 (i.e., where `*' in the select statement denotes an ar-
Professors, after promotion and merging) be-
comes a relevant node to the l-node of section 1, i.e., bitrary path expression and contains is a built-in
`CS Department Faculty Directory'. Figure 9
predicate asserting that X includes `Evan Ivie' as
shows the CT of dir.htm.2 a substring.
Now, consider the long name L1 of `Evan Ivie'
who is a visiting faculty member in SDTdir:htm, an
3.3 Use of CT s enhanced parse tree of dir.htm in Figure 2. L1 is
We have demonstrated the process of constructing `dir.htm'.HTML.BODY.UL. LI.A.`Evan Ivie'. Re-
CTD from a given HTML document D. For prac- call that SDT is constructed strictly based on the
tical use, CTD is presented in its lexical format so HTML grammar. The benets of CT over SDT
that users can develop a query processor or use an are evident in this example: users avoid dealing
existing one on top of the lexical CTD to retrieve with HTML elements to identify a data token in
hierarchical, sub-page level information from D. the tree representation of a given HTML document.
Consider CTdir:htm as shown in Figure 9. We More importantly, CT provides \better" hierarchi-
can obtain the lexical CTdir:htm from CTdir:htm by cal information of a data token than SDT . As
[3] S. Abiteboul, D. Quass, J. McHugh,
J. Widom, and J. Wiener. The Lorel Query
Language for Semistructured Data. Journal
on Digital Libraries, 1(1):68{88, 1997.
[4] G. O. Arocena and A. O. Medelzon. We-
bOQL: Restructuring Documents, Databases
and Webs. In Proceedings of the 14th In-
ternational Conference on Data Engineering,
February 1998.
Figure 10: A sample query in WebView [5] P. Atzeni and G. Mecca. Cut and Paste. In
Proceedings of the 16th ACM PODS, pages
we have seen earlier, it is ambiguous which `Evan 144{153, May 1997.
Ivie' specied in L1 (generated from the lexical [6] P. Atzeni, G. Mecca, and P. Merialdo. To
SDTdir:htm) is referring to. Weave the Web. In Proceedings of the 23rd
The implementation of our hierarchical infor- International Conference on Very Large Data
mation approach is about half way done using Bases, pages 206{215, August 1997.
Java programming language. Figure 10 shows a
snapshot of the WebView user interface which is [7] T. Berners-Lee and D. Connolly. Hypertext
capable of constructing the SDT 0 of a given HTML Markup Language - 2.0. Request for Com-
document and processing SQL-like queries. Since ments: #1866, November 1995.
0
WebView was implemented over SDT s, the exam- [8] P. Buneman, S. Davidson, M. Fernandez, and
ple query in the gure contains an H2 tag in the D. Suciu. Adding Structure for Unstructured
select statement. By the time the implementation Data. In Proceedings of the 6th International
of CT s is completed, which is implemented as an Conference on Database Theory, 1997.
extension of WebView, user queries can be posted
without using HTML tags. [9] J. Hammer, H. Garcia-Molina, J. Cho,
R. Aranha, and A. Crespo. Extracting
4 Conclusions Semistructured Information from the Web. In
Proceedings of Workshop on Management of
Dealing with hierarchical, sub-page level informa- Semistructured Data, May 1997.
tion on an HTML document D without knowing
the internal structure of D beforehand is challeng- [10] D. Konopnicki and O. Shmueli. W3QS: A
ing due to the fact that the structure of D is irreg- Query System for the World-Wide Web. In
ular and the HTML tags often break the intended Proceedings of the 21st Very Large Data Bases,
hierarchy of data contents in D. We analyze the pages 54{65, Sept. 1995.
HTML specication and its use on the Web, and [11] L. V. S. Lakshmanan, F. Sadri, and I. N.
present an approach which is capable of construct- Subramanian. A Declarative Language for
ing a tree representation, called content tree CT , of Querying and Restructuring the Web. In Post-
the data contents in an HTML document without ICDE IEEE Workshop on Research Issues in
the HTML tags while preserving its intended hier- Data Engineering, February 1996.
archy. Also, we do not assume that the structure of
the given HTML document is known beforehand. [12] S.-J. Lim and Y.-K. Ng. Extracting Structures
CTD provides a richer hierarchy of data contents of HTML Documents Using a High-level Stack
in an HTML document D than other approaches Machine. To appear in ICOIN'98 Book, Octo-
based on the parse tree of D or keyword-based ber 1998.
searching techniques for retrieving the desired hi- [13] A. O. Mendelzon, G. A. Mihaila, and T. Milo.
erarchical information at the sub-page level of D. Querying the World Wide Web. In Proceedings
of International Conference on Parallel and
References Distributed Information Systems, 1996.
[1] S. Abiteboul. Querying Semi-structured Data. [14] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman,
In Proceedings of the 6th International Confer- and J. Widom. Querying Semistructured
ence on Database Theory, 1997. Heterogeneous Information. In Proceedings of
DOOD'95, pages 319{344, December 1995.
[2] S. Abiteboul, S. Cluet, V. Christophides,
T. Milo, G. Moerkotte, and J. Simeon. Query- [15] D. Raggett. HTML 3.2 Reference Speci-
ing Documents in Object Databases. Journal cation. http://www.w3.org/TR/REC-html32,
on Digital Libraries, 1(1):5{19, April 1997. January 1997.