Extracting Structures of HTML Documents

Extracting Structures of HTML Documents
Seung-Jin Lim
Yiu-Kai Ng
Computer Science Department
Brigham Young University
Provo, Utah 84602, U.S.A.
Email: fng,sjlimg@cs.byu.edu
Abstract
Information on the Web, which are conglomeration of heterogeneous data1 , such
as texts, images and audio clips, are often accessed through documents written ac-
cording to the HTML specication [Gro95]. According to the HTML specication,
HTML documents are semistructured in nature. We propose a high-level stack ma-
chine (HSM) which accesses an HTML document through its URL and constructs a
semistructured data graph (SDG) of the document. The SDG of an HTML document
H precisely captures the structure of the semistructured data embedded in H based
on the dependency relationship [LN97] among the data objects in H . HSM is cong-
urable to accommodate a user's interest with respect to the HTML elements in H to
be considered during the construction process of the SDG of H .
1 Introduction
During the early days of the World-Wide Web (WWW or Web), users heavily relied on the
mouse-button-click navigation method through hyperlinks provided by Web browsers to
retrieve information of interest and soon found themselves lost somewhere in the midst of
cyberspace [KS95]. Since then, Web designers, as well as Web users, have been looking for
better alternatives. Two recent alternative approaches are (i) the keyword search method
using index servers, and (ii) the method of using extended query languages, including
SQL-like query languages such as [AQM+ 96, MMM96, Abi97, HGMC+ 97, NAM97], and
Datalog-like query languages such as WebLog [LSS96]. To better understand this issue, we
rst clarify three types of data with respect to their data structures according to [KS95]:
Unstructured data: Data which is stored in les such as executable les, pure text les
which contain no formatting code, and audio les. It is dicult, if not impossible,
to ascertain the semantics of this type of data. ([BDHS96, BDFS97] use the term
unstructured data for any data of no rigid structure.)
Structured data: A typical example of this type of data is tables in the relational
database model. The semantics of this type of data can be obtained using the gram-
mar of the formal language in which the data le is written.
Data is literally the plural form of datum. Hence, whenever we say data D, we really mean that D is
1
a set of datum, or a set of data objects.
1
Semistructured data: This type of data is anything between the two extreme types of
data mentioned above. Examples of this type of data are text les that contain for-
matting codes, such as LATEX or HTML, and les which require strict inner structure
but some of the structural components can be omitted, such as BIBTEX les, Unix
environment les, etc. For a BIBTEX le, if an entry of the le is of type article, then
the meaning of article can be obtained using the set of predened elds for article.
Our view of information on the Web is a collection of heterogeneous data, such as
HTML documents by and large and other types of data including images, sound, and video
clips. HTML specication [Gro95], which is the most widely used paradigm for posting
information on the Web, does not require a uniform structure in documents, e.g., an element
which appears in an HTML document may be missing in other documents. Hence, we treat
an HTML document H as a textual representation of semistructured data embedded within
H.
One of the benets of using semistructured data is its great exibility in data repre-
sentation [BDHS96]. A theory of semistructured data, however, is still missing [Abi97], and
hence there is no universally-standardized denition of semistructured data. In this paper,
we consider a nite set of data objects fo1; o2; : : : ; ong as semistructured data D if
+
the structure of D is irregular or incomplete [QRS 95, AQM 96, AMM97].
+
the distinction between the schema of D and fo1 ; o2 ; : : : ; on g is blurred, and the schema
may change dynamically [Abi97, BDFS97].

oi (1 i n) is not type-sensitive. 2
For example, a semistructure data may include components A and B . Let the data elds of
A be name and date of birth, and those of B be name, date of birth, and address. Further
assume that date of birth in A is of type string, such as \6-12-1938", whereas date of birth
in B consists of three parts, Month, Day and Year, and Month can be of type integer, i.e.,
\1", \2", : : :, or string, i.e., \January", \February", : : :, or even encoded in a character
set other than ASCII. The structures of A and B can be changed subsequently as needed.
In this paper, we present an approach for extracting the structures of semistructured
data embedded within an HTML document H on WWW, assuming that H is written in
compliance2 with the HTML specication [Gro95] and referred by an URL which is given.
(Additional URLs can be further obtained from the hyperlinks included in H .) We rst
propose a graphical data model of semistructured data, called the semistructured data graph
(SDG), and then present a tool, a high-level stack machine (HSM)3 , to extract the structures
embedded in H and represent the structures in an SDG.
The main contribution of this paper is three-fold. First, we extend the concept of
dependency relationship among database components in [LN97] to capture the structure of
the semistructured data embedded in an HTML document. Second, we design and imple-
ment a simple automaton HSM, using the Java Language Environment, to construct the
SDG for an HTML document specied by an URL. Since HSM is built based on pushdown
automata, HSM is fairly easy to implement using a stack without concerning about sophis-
ticated functions such as rereading or replacing an input [HU79], or assuming unlimited
2 Note that erroneous HTML documents are not excluded from our consideration.
3 HSM is a variation of the well-known pushdown automata (PDA).
2
auxiliary memory as does a Turing machine. Third, our HSM is easily congurable accord-
ing to the user's need in terms of what HTML elements are to be included in an SDG. To
congure HSM, a user simply provides a conguration le which contains a list of HTML
elements of an HTML document H chosen by the user to be included in the SDG of H .
Writing a conguration le does not require any knowledge on additional commands and
their syntax, such as get(), split() and citytemp[1 : 0] in [HGMC+ 97].
We process to present our results as follows. In Section 2, we describe the details of
SDGs. In Section 3, we propose HSM and demonstrate its application using a real-world
example. In Section 4, we give the concluding remark.
2 The Data Model
The fundamental of our data model for semistructured data, SDG, is based on the notion of
dependency relationship [LN97] among database objects. We give the denition of objects
below.
Denition 1 An object o in a set D is a triple label; value; identifier , where
h i
label is the textual description (i.e., a string) of o.

value is a nite ordered set of strings. If value is an empty set, then o is called a
free object; otherwise, o is called a bound object.
identifier is a non-empty string which uniquely denes o among objects in D. 2
In an SDG, the identier of an object is static, whereas its label and value are dynamic.
In other words, the identier of an object does not change, whereas its label and value may
change. We formalize and illustrate how the value of an object changes from free to bound
in Denition 2 and Example 1.
Also, type constraint is not straightly enforced on an object in an SDG, and hence the
value of an object can be of any type. For instance, the value of object Month can be a
string \January," an integer \1", or a string encoded using a character set other than
ASCII. This relaxation is desirable in that whenever a query is posted for an object o using
values of dierent types other than o's, the system may gracefully fail to process the query
rather than invoke an error [QRS+95]. (For simplicity, in this paper we assume the value
of an object is a set of strings since all atomic types such as integer, real and boolean,
and even a set of atomic types can be represented as a string. In addition, we use str
instead of fstrg for each singleton set fstrg.)
In conjunction with the denition of an object, we dene a few utility functions below,
where S is a set of strings:
(1) Function o:label() on object o returns the label of o.
(2) Function o:value() on object o returns the value of o.
(3) con is a binary function S S ! S such that con(arg1, arg2) = arg1:arg2.
(4) CON is a binary function 2S 2S ! 2S such that
3
CON ( s1; s2; : : : ; sm ; t1; t2; : : : ; tm ) = S8i;j con(si; tj ) 4.
f g f g f g
(5) inclusive-union U is a binary function U : S S ! 2S such that

(
arg1 arg2 = CON
U (fAg; fB g [ fC g) if arg1 = A:B and arg2 = A:C
fB g [ fC g if arg1 = B and arg2 = C .
Denition 2 An object o1 is said to directly depend on another object o2 , denoted o1 o2 ,
if o1 :value() := o2:value(). Dependency is transitive, which means that o1 o3 if o1 o2
and o2 o3. In such a case, we say that o1 indirectly depends on o3, denoted o1 o2 o3
or o1 o3 . If an object o directly depends on multiple objects o1, o2, : : :, on, i.e., o o1,
o o , , o o , then o:value() := o :value() S o :value() S S o :value(). 2
2 n 1 2 n
Example 1 Given the following set of objects, where each object o is associated with
the information in the form of (o:label(); o:value(); a list of objects on which o directly
depends). It is assumed that a string which contains spaces is enclosed by double quotes,
and `{' denotes an empty list.
Location (Location; (free); Address and \Time Zone"),
Address (Address; (free); \Street Address," City and \Zip code"),
\Street Address" (Street; (free); \314 E. Heather Cir."),
City (City; (free); Orem),
\Zip code" (ZipCode; (free); 84057),
\Time Zone" (TimeZone; (free); MST),
\314 E. Heather Cir." (\314 E. Heather Cir."; \314 E. Heather Cir."; { ),
Orem (Orem; Orem; { ),
84057 (84057; 84057; { ), and
MST (MST; \Mountain Standard Time"; { ).
Let's consider object Address. The dependency constraint of Address is Address f\Street
Address" \314 E. Heather Cir.", City Orem, \Zip code" 84057g. Hence,
Address.value() := f\Street Address".value(), City.value(), \Zip code".value()g
= f\314 E. Heather Cir.".value(), Orem.value(), 84057.value()g
= f\314 E. Heather Cir.", Orem, 84057g. 2
We dene the long label of an object o based on the dependency relationship of o with
other objects.
Denition 3 Given a dependency among objects o1; o2; : : : ; oN such that o1 ::: oN ,
the long label of oi, denoted oi:Label(), 1 i N , is dened as
o1:Label() = o1:label(); and
oi:Label() = con(oi 1:Label(); oi :label()); 2 i N: 2
Example 2 Consider the set of objects in Example 1. According to Denition 3, all the
expressions listed below are valid.
4In the rest of this paper, we use fs1 ; s2 ; : : : ; sm g:ft1; t2 ; : : : ; tm g on behalf of
S con(si ; tj )g for
8i;j f
simplicity of the notation.
4
Figure 1: The SDG for the semistructured data Info
Location.Label() = Location.label() = Location

Orem.Label() = (City.Label()).(Orem.label())
= (Address.Label()).(City.label()).Orem
= (Location.Label()).(Address.label()).City.Orem
= (Location.label()).Address.City.Orem
= Location.Address.City.Orem
\Time Zone".Label() = Location.TimeZone 2
We now formally dene SDGs which are based on the notion of dependency constraints
among objects.
Denition 4 Given a semistructured data D, the semistructured data graph SDG5 of D is
a triple (V; E; g) which is a rooted, directed, labeled graph, where
V is a nite set of nodes and V = fVRg [ VI [ VL, where fVRg \ VI \ VL = ;. VR
with label `D' is the root node of SDG which serves as the entry point of SDG and
represents D. VL is a nite set of leaf nodes, and VI is a nite set of nodes other than
VR and VL in SDG. n 2 (VI [ VL) with label `o' represents an object o in D.
E is a nite set of directed edges.
g is a function from an edge to a pair of endpoints such that g(e) = (n1 ; n2) if and
only if the object represented by n1 2 (fVRg [ VI ) depends on the object represented
by n2 2 (VI [ VL). 2.
Example 3 Consider the object Location in Example 1 and suppose semistructured data
Info is dened by Location. Given the dependency constraints for Location and its relevant
objects, the SDG for Info is obtained and as illustrated in Figure 1. A more comprehensive
example of SDGs is presented in Section 3. 2
Denition 5 Given a long label of an object oN in the form of (o1:label()).(: : :).(oN :label())
in an SDG, the path expression pe is a binary function, pe: S S ! S , where S is a set
of strings, such that
pe(oi; oi) = oi :label() (1 i N ); and

pe(oi; oj ) = con(pe(oi ; oj 1); oj :label()) (1 i < j N )

where o1 denotes the root node of the SDG. 2

5Although we dene SDG on semistructured data, SDG can also be applied to structured data. In fact,
semistructured data subsumes structured data.
5
It is easy to see that pe(VR ; o) = o:Label(), where VR is the root node and o 2 (VI [ VL)
of the corresponding SDG.
An SDG precisely captures the structure and values of semistructured data graphically.
Given below is the denition of the lexical representation of an SDG. The lexical represen-
tation of a semistructured data D is useful in some situations, e.g., nding an object in D
using a textual path expression.
Denition 6 Given an SDG = (fVRg [ VI [ VL; E; g), the lexical representation Lo or lexical
SDG of o 2 (fVR g [ VI [ VUL ) is a textual representation of the subgraph S of the SDG
rooted at o such that Lo = 8i pe(o; oi), where oi is a leaf node in S . 2
Note that given an SDG = (fVRg [ VI [ VL ; E; g), the lexical representation of the SDG
is obtained by U8i pe(VR ; oi) = U8i oi:Label(), where oi 2 VL.
Example 4 Consider the SDG of the semistructured data Info in Example 3. Then
pe(Address, 84057)
= con(pe(Address, \Zip code"), 84057.label())
= con(con(pe(Address; Address), \`Zip code".label()), 84057)
= con(con(Address:label(), ZipCode), 84057)
= con(Address.ZipCode, 84057)
= Address.ZipCode.84057
pe(Address; Orem) = Address.City.Orem U Address.City.Orem
LAddress = Address.Street.\314
U E. Heather Circle"
Address.ZipCode.84057
= CON ( Address , Street.\314 E. Heather Circle", City.Orem, ZipCode.84057 )
f g f g
= Address. Street.\314 E. Heather Circle", City.Orem, ZipCode.84057

f g
Similarly, LInfo = Info.Location.fAddress.fStreet.\314 E. Heather Circle", City.Orem, Zip-

Code.84057g, TimeZone.MSTg. 2
Denition 7 Given a semistructured data D and its corresponding semistructured
U data
graph SDG = (fVR g [ VI [ VL; E; g), the schema S of D in SDG is S = 8i pe(o; oi) where
oi is any object in VI and o is o^'s ancestor which is a child of VR . 2
Example 5 Consider Info in Example 3 again. The schema of Info is
S
S = peU(Location; Location) U pe(Location;

U Address) U pe(Location; \Time Zone")
U pe(Location;
pe(Location;
U \ Street Address
U ") pe ( Location; City
U ) \Zip code")
= Location Location.Address
U Location.Address.City Location.TimeZone Location.Address.Street
U Location.Address.ZipCode
= Location.fAddress.fStreet, City, ZipCodeg, TimeZoneg 2
6
3 The High-level Stack Machine
In this section, we present a high-level stack machine (HSM) as a tool which extracts data
structures embedded in an HTML document H and demonstrate the construction of the
SDG for H specied by an URL using HSM6 .
To construct the SDG for a given HTML document, we propose HSM which processes
HTML elements. As mentioned earlier, HSM is built based on pushdown automa (PDA).
We are particularly interested in employing PDA since PDA is relatively easy to compre-
hend, design, and implement compared with Turing machines [HU79] and has sucient
power to construct the SDGs for HTML documents. (We assume that readers are familiar
with PDA.)
With respect to the design of HSM, we classify the HTML elements in two types
as follows: (1) elements which begin with a start-tag and ends with an end-tag, and (2)
elements whose end-tags are either optional or do not exist. Most of the commonly used
HTML elements are of type 1. Such elements include document element HTML, head
element HEAD, body BODY, headings H1 : : : H6, title TITLE, anchor A, some of the
block structuring elements such as list elements OL and UL, block quote BLOCKQUOTE,
preformatted text PRE, directory list DIR, menu list MENU, address element ADDRESS,
and phrase markups such as EM, B, I and STRONG. Examples of elements of type 2
are some of the block structuring elements such as list element LI, denition lists DT and
DD, line break BR, and horizontal rule HR. Some HTML elements, such as IMG, do not
accompany an end-tag, but the closing angle brackets indicate where the end of the elements
are. Therefore, we categorize IMG as an element of type 1.
Besides the two types of HTML elements mentioned above, we treat some elements
as `unproductive' with respect to SDG, i.e., they do not generate an output on an SDG in
our current version of HSM, and are simply ignored by HSM. UPE, the set of unproductive
elements, is determined by the user on demand by excluding UPE from a conguration le
as discussed earlier. At the current version of HSM, elements such as HR, B, CITE, DIR,
TT, DL, DT, DD, and comments <! : : : > are included in UPE. In addition, we treat the
set of elements of type 2 as a proper subset of UPE. However, when we consider a query
language for SDG in our future work, it may be worth to adjust UPE accordingly to give
more weight to styled text than plain text. (The styled text is surrounded by character
style tags such as <B>, <BIG>, <STRONG> and their corresponding end-tags.)
During the construction process of an SDG, the following two rules of HSM are applied
to elements in an HTML document H :
Skip an element e if e 2 UPE. No stack operation is necessary, and no changes occur
in the SDG of H .
For an element e0 of type 1, push the corresponding stack symbol (dened in Deni-
tion 8) of e0 onto the stack whenever the start-tag of e0 is encountered, and pop from
the stack of HSM whenever the corresponding end-tag of e0 is detected. In general,
given the top of the stack symbol p and the SDG being constructed which includes a
SDG is constructed in various formats by using HSM, which are all equivalent with respect to the
6
structure that SDG represents. The formats include the lexical SDG, a textual denition of the SDG using
long labels of the objects or object identiers, and a graphical display of the SDG on the screen. Each of
these output formats can be chosen by the user. See Figure 2 for an example of the HSM user interface.
7
Figure 2: An HSM user interface
8
node op created for p, whenever a new stack symbol is pushed, is attached to the
SDG as a child c of op with the edge from c to op. (We assume the existence of the
function append() in HSM, in addition to the ordinary stack operations push() and
pop().)
Denition 8 The high-level stack machine HSM is a system (Q; ; ; ; qBOF ; ; F ), where
1. Q is a nite set of states: qBOF , q1, qA, qA ATTR, qB , qB ATTR , and qEOF .
qBOF denotes the beginning-of-the-le state and qEOF denotes the end-of-the-le state.
States qA and qA ATTR are used for anchors and APPLETs, and qB and qB ATTR are
used for processing the IMG and META elements. Also, when the machine is in
state q1, HSM is not currently processing the elements IMG, META, A, or APPLET
(as shown in the production rules of HSM below).
2. is a nite set of input string symbols: EOF , NT , <HTML>, </HTML>,
<HEAD>, </HEAD>, <TITLE>, </TITLE>, <H1>; : : : ; <H6>, <META>,
HTTP-EQUIV, NAME, CONTENT, <BODY>, </BODY>, <UL>, </UL>,
<OL>, </OL>, <A , </A>, HREF, NAME, <CAPTION>, </CAPTION>,
<IMG , SRC, ALT, >, <TABLE>, </TABLE>, <TR>; </TR>, <TH>, </TH>,
<TD>, </TD>, <APPLET>, </APPLET>, ARCHIVE, CODEBASE, CODE,
<MENU>, </MENU>, <ADDRESS>, and </ADDRESS>.
NT denotes any input string symbol in an HTML document other than the input
string symbols dened in (2).
3. is a nite set of stack string symbols , HTML, HEAD, TITLE , META, HTTP -
EQUIV , NAME , CONTENT , BODY , OL, UL, TABLE , CAPTION , TR, TH ,
TD, ADDRESS , IMG, SRC , ALT , A, HREF , APPLET , ARCHIV E , CODE ,
CODEBASE , MENU , DES , and \, where DES denotes Description and denotes
the empty stack symbol.
4. is a mapping from Q ( [ fg) to Q , where `' denotes the Kleene star.
The production rules of are
(qBOF ; <HTML>, ) = (q1 ; HTML)
(qBOF ; <HEAD>, ) = (q1 ; HEAD)
(qBOF ; <BODY>, ) = (q1 ; BODY )
(q1 ; </HTML>, HTML) = (qEOF ; )
(q1 ; <HEAD>, HTML) = (q1 ; HEAD:HTML)
(q1 ; </HEAD>, HEAD) = (q1 ; )
(q1 ; <TITLE>, HEAD) = (q1 ; TITLE:HEAD)
(q1 ; </TITLE>, TITLE ) = (q1 ; )
(q1 ; <BODY>, HTML) = (q1 ; BODY:HTML)
(q1 ; </BODY>, BODY ) = (q1 ; )
(q1 ; <TABLE>, ) = (q1 ; TABLE:)
(q1 ; </TABLE>, TABLE ) = (q1 ; )
(q1 ; <CAPTION>, TABLE ) = (q1 ; CAPTION:TABLE )
(q1 ; </CAPTION>, CAPTION ) = (q1 ; )
(q1 ; <TR>, TABLE ) = (q1 ; TR:TABLE )
(q1 ; </TR>, TR) = (q1 ; )
(q1 ; <TH>, TR) = (q1 ; TH:TR)
(q1 ; </TH>, TH ) = (q1 ; )
(q1 ; <TD>, TR) = (q1 ; TD:TR)
(q1 ; </TD>, TD) = (q1 ; )
9
(q1 ; <OL>, ) = (q1 ; OL:)
(q1 ; </OL>, OL) = (q1 ; )
(q1 ; <UL>, ) = (q1 ; UL:)
(q1 ; </UL>, UL) = (q1 ; )
(q1 ; <A , ) = (qA ; A:)
(qA ; HREF , A) = (qA ; HREF:A)
(qA ; NAME , A) = (qA ; NAME:A)
(qA ; \, HREF ) = (qA ATTR ; DES:HREF )
(qA ; \, NAME ) = (qA ATTR ; DES:NAME )
(qA ATTR ; NT , DES ) = (qA ATTR ; DES )y
(qA ATTR ; ", DES ) = (qA ; )
(qA ; >, HREF ) = (qA ; DES )
(qA ; >, NAME ) = (qA ; DES )
(qA ; </A>, DES ) = (q1 ; )
(qA ; NT , DES ) = (qA ; DES )y
(q1 ; <APPLET , ) = (qA ; APPLET:)
(qA ; ARCHIV E , APPLET ) = (qA ; ARCHIV E:APPLET )
(qA ; CODEBASE , APPLET ) = (qA ; CODEBASE:APPLET )
(qA ; CODE , APPLET ) = (qA ; CODE:APPLET )
(qA ; \, ARCHIV E ) = (qA ATTR ; DES:ARCHIV E )
(qA ; \, CODEBASE ) = (qA ATTR ; DES:CODEBASE )
(qA ; \, CODE ) = (qA ATTR ; DES:CODE )
(qA ATTR ; NT , DES ) = (qA ATTR ; DES )y
(qA ATTR ; ", DES ) = (qA ; )
(qA ; >, ARCHIV E ) = (qA ; DES )
(qA ; >, CODEBASE ) = (qA ; DES )
(qA ; >, CODE ) = (qA ; DES )
(qA ; </APPLET>, DES ) = (q1 ; )
(q1 ; <IMG , ) = (qB ; IMG:)
(qB ; >, IMG) = (q1 ; )
(qB ; >, SRC ) = (q1 ; )
(qB ; >, ALT ) = (q1 ; )
(qB ; SRC , IMG) = (qB ; SRC:IMG)
(qB ; SRC , ALT ) = (qB ; SRC )
(qB ; ALT , IMG) = (qB ; ALT:IMG)
(qB ; ALT , SRC ) = (qB ; ALT )
(qB ; \, SRC ) = (qB ATTR ; DES:SRC )
(qB ; \, ALT ) = (qB ATTR ; DES:ALT )
(qB ATTR ; NT , DES ) = (qB ATTR ; DES )y
(qB ATTR ; ", DES ) = (qB ; )
(q1 ; <META , ) = (qB ; META:)
(qB ; >, META) = (q1 ; )
(qB ; >, HTTP -EQUIV ) = (q1 ; )
(qB ; >, NAME ) = (q1 ; )
(qB ; >, CONTENT ) = (q1 ; )
(qB ; HTTP -EQUIV , META) = (qB ; HTTP -EQUIV:META)
(qB ; HTTP -EQUIV , NAME ) = (qB ; HTTP -EQUIV )
(qB ; HTTP -EQUIV , CONTENT ) = (qB ; HTTP -EQUIV )
(qB ; NAME , META) = (qB ; NAME:META)
(qB ; NAME , HTTP -EQUIV ) = (qB ; NAME )
(qB ; NAME , CONTENT ) = (qB ; NAME )
(qB ; CONTENT , META) = (qB ; CONTENT:META)
(qB ; CONTENT , HTTP -EQUIV ) = (qB ; CONTENT )
(qB ; CONTENT , NAME ) = (qB ; CONTENT )
(qB ; \, HTTP -EQUIV ) = (qB ATTR ; DES:HTTP -EQUIV )
(qB ; \, NAME ) = (qB ATTR ; DES:NAME )
(qB ; \, CONTENT ) = (qB ATTR ; DES:COMMENT )
(qB ATTR ; NT , DES ) = (qB ATTR ; DES )y
10
(qB ATTR ; ", DES ) = (qB ; g
(q1 ; <ADDRESS>, BODY ) = (q1 ; ADDRESS:BODY )
(q1 ; </ADDRESS>, ADDRESS ) = (q1 ; )
(; EOF , ) = (qEOF ; )
(; NT , 1 ) = (; 1 )
y In this case, i.e., the current top-of-stack symbol is DES , nothing is pushed onto the stack, but
NT is appended to the SDG.
where `:' is used to separate dierent stack symbols.
Each rule of the form (state1 ; INPUT , TOS1) = (state2 ; TOS2) is interpreted as
that \the automaton is currently in state state1 with the top-of-stack symbol TOS1.
After reading the input symbol INPUT , the automaton replaces TOS1 by TOS2,
i.e., pop TOS1 and push TOS2 onto the stack, and enter state state2 ." Hence, in
case of (q1 ; <HEAD>, HTML) = (q1 ; HEAD:HTML), the stack symbol HEAD
is pushed on top of HTML on the stack. In case of (q1; </HEAD>, HEAD) =
(q1; ), HEAD is popped and nothing is pushed onto the stack and the machine stays
in q1.
`' is a \syntactic sugar" used for simplifying notations. Hence, `' in (q1; <TABLE>,
) denotes any stack symbol, and (q1 ; <TABLE>, ) = (q1 ; TABLE:) is interpreted
as \regardless of what the top-of-stack symbol is, when the machine is in q1 and the
input string symbol is <TABLE>, push TABLE onto the stack and remain in the
same state." Furthermore, 1 denotes any stack symbol except DES .
5. QBOF 2 Q is the initial state.
6. 2 is the initial stack symbol.
7. F Q is the set of nal states, i.e., fqEOF g. 2
Example 6 Consider the Web page of the Computer Science Department at Brigham
Young University whose URL is WWW.CS.BYU.EDU. WWW.CS.BYU.EDU is created as
the root node of the SDG, and the <HTML> element directly depends on two other ele-
ments, <HEAD> and <BODY>. Furthermore, <HEAD> directly depends on <TITLE>
which directly depends on the string value \BYU CS Department Homepage", in addition
to the ve <META> elements, whereas <BODY> directly depends on two <TABLE> el-
ements, ve <A> (anchor) elements, and one <ADDRESS> element. The complete SDG
of WWW.CS.BYU.EDU is as shown in Figure 3. 2
Proposition 1 Given an HTML document H , HSM always halts.
Proof. Note that the nal state of HSM is qEOF , and HSM enters qEOF when it encounters
</HTML> or EOF . Since the reading head of HSM continues to move forward during
the process of constructing an SDG and eventually will encounter EOF if </HTML> is
missing in H , HSM always terminates. 2
Proposition 2 Given an HTML document H with NW words, the time complexity of
constructing the SDG for H using HSM is proportional to NW .
Proof. Recall that HSM processes H in one direction. While moving forward and detecting
a word in H , HSM determines if the word is an HTML tag. If it is, then the stack and
the SDG are modied accordingly. Otherwise, the word is discarded. Hence, the time
complexity of HSM is proportional to NW . 2
11
Figure 3: SDG of WWW.CS.BYU.EDU
Proposition 3 Given an HTML document H with NW words, the space complexity of
HSM is proportional to NW .
Proof. In the worst case when all the start-tags are placed before any end-tag in H ,
at most NW stack cells are required to keep all the start-tags on the stack until EOF is
encountered. Hence, the space complexity of HSM is proportional to NW . 2
4 Conclusion
In this paper, we view an HTML document H as the textual representation of semistruc-

tured data embedded within H . We present a graphical data model, called the semistruc-
tured data graph (SDG), which is based on the notion of the dependency relationships
among the data objects in semistructured data D to describe the structure of D. The
SDG of an HTML document precisely captures the structure and textual data embedded
in the document. HSM, a high-level stack machine, is introduced as a tool to extract the
structures of HTML documents.
At present, we continue to work on HSM to extend it so that HSM is able to handle
more general HTML documents, including forms. Recent fast emerging of Java applets and
use of forms in an HTML document potentially make accessing information on the Web
beyond the given HTML document more dicult. We are considering to investigate how
this new trend aects our current approach. We also plan to design a logic query language
for querying SDGs.
12
References
[Abi97] S. Abiteboul. Querying Semi-structured Data. In Proceedings of 6th Interna-

tional Conference on Database Theory, 1997.
[AMM97] P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. In Proceedings of
the International Conference on VLDB. Very Large Data Bases, 1997.
[AQM+ 96] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel
Query Language for Semistructured Data. Journal on Digital Libraries, 1(1),
1996.
[BDFS97] P. Buneman, S. Davidson, M. Fernandez, and D. Suciu. Adding Structure for
Unstructured Data. In Proceedings of International Conference on Database
Theory, 1997.
[BDHS96] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A Query Language
and Optimization Techniques for Unstructured Data. In SIGMOD, 1996.
[Gro95] Network Working Group. Hypertext Markup Language - 2.0. Request for
Comments: #1866, November 1995.
[HGMC+ 97] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting
Semistructured Information from the Web. In Proceedings of Workshop on
Management of Semistructured Data, June 1997.
[HU79] J. H. Hopcraft and J. D. Ullman. Introduction to Automata Theory, Lan-
guages, and Computation. Addison-Wesley, 1979.
[KS95] D. Konopnicki and O. Shmueli. W3QS: A Query System for the World-Wide
Web. In Proceedings of the 21st VLDB, pages 54{65, Sept. 1995.
[LN97] S.-J. Lim and Y.-K. Ng. Vertical Fragmentation and Allocation in Distributed
Deductive Database Systems. Information Systems, 22(1):1{24, 1997.
[LSS96] L. Lakshmanan, F. Sadri, and I. Subramanian. A Declarative Language for
Querying and Restructuring the Web. In Proceedings of Post-ICDE IEEE
Workshop on Research Issues in Data Engineering, February 1996.
[MMM96] A. O. Mendelzon, G. A. Mihaila, and T. Milo. Querying the World Wide
Web. In Proceedings of Conference on Parallel and Distributed Information
Systems, 1996.
[NAM97] S. Nestorov, S. Abiteboul, and R. Motwani. Inferring Structure in Semistruc-
tured Data. In Proceedings of the Workshop on Management of Semistructured
Data, May 1997.
[QRS+95] D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. Querying
semistructured heterogeneous information. In Proceedings of the DOOD '95
Conference, pages 319{344, December 1995.
13

Extracting Structures of HTML Documents

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Extracting Structures of HTML Documents

Загружено:

Авторское право:

Доступные форматы

Extracting Structures of HTML Documents

may change dynamically [Abi97, BDFS97].

2 The Data Model

label is the textual description (i.e., a string) of o.

(5) inclusive-union U is a binary function U : S S ! 2S such that

Location.Label() = Location.label() = Location

pe(oi; oj ) = con(pe(oi ; oj 1); oj :label()) (1 i < j N )

where o1 denotes the root node of the SDG. 2

= Address. Street.\314 E. Heather Circle", City.Orem, ZipCode.84057

Similarly, LInfo = Info.Location.fAddress.fStreet.\314 E. Heather Circle", City.Orem, Zip-

S = peU(Location; Location) U pe(Location;

In this paper, we view an HTML document H as the textual representation of semistruc-

[Abi97] S. Abiteboul. Querying Semi-structured Data. In Proceedings of 6th Interna-

Вам также может понравиться

Extracting Structures of HTML Documents

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Extracting Structures of HTML Documents

Загружено:

Авторское право:

Доступные форматы

Extracting Structures of HTML Documents

may change dynamically [Abi97, BDFS97].

2 The Data Model

 label is the textual description (i.e., a string) of o.

(5) inclusive-union U is a binary function U : S S ! 2S such that

Location.Label() = Location.label() = Location

pe(oi; oj ) = con(pe(oi ; oj 1); oj :label()) (1 i < j N )

where o1 denotes the root node of the SDG. 2

= Address. Street.\314 E. Heather Circle", City.Orem, ZipCode.84057

Similarly, LInfo = Info.Location.fAddress.fStreet.\314 E. Heather Circle", City.Orem, Zip-

S = peU(Location; Location) U pe(Location;

In this paper, we view an HTML document H as the textual representation of semistruc-

[Abi97] S. Abiteboul. Querying Semi-structured Data. In Proceedings of 6th Interna-

Вам также может понравиться

label is the textual description (i.e., a string) of o.