Академический Документы
Профессиональный Документы
Культура Документы
Seung-Jin Lim
Yiu-Kai Ng
Computer Science Department
Brigham Young University
Provo, Utah 84602, U.S.A.
Email: fng,sjlimg@cs.byu.edu
Abstract
Information on the Web, which are conglomeration of heterogeneous data1 , such
as texts, images and audio clips, are often accessed through documents written ac-
cording to the HTML specication [Gro95]. According to the HTML specication,
HTML documents are semistructured in nature. We propose a high-level stack ma-
chine (HSM) which accesses an HTML document through its URL and constructs a
semistructured data graph (SDG) of the document. The SDG of an HTML document
H precisely captures the structure of the semistructured data embedded in H based
on the dependency relationship [LN97] among the data objects in H . HSM is cong-
urable to accommodate a user's interest with respect to the HTML elements in H to
be considered during the construction process of the SDG of H .
1 Introduction
During the early days of the World-Wide Web (WWW or Web), users heavily relied on the
mouse-button-click navigation method through hyperlinks provided by Web browsers to
retrieve information of interest and soon found themselves lost somewhere in the midst of
cyberspace [KS95]. Since then, Web designers, as well as Web users, have been looking for
better alternatives. Two recent alternative approaches are (i) the keyword search method
using index servers, and (ii) the method of using extended query languages, including
SQL-like query languages such as [AQM+ 96, MMM96, Abi97, HGMC+ 97, NAM97], and
Datalog-like query languages such as WebLog [LSS96]. To better understand this issue, we
rst clarify three types of data with respect to their data structures according to [KS95]:
Unstructured data: Data which is stored in les such as executable les, pure text les
which contain no formatting code, and audio les. It is dicult, if not impossible,
to ascertain the semantics of this type of data. ([BDHS96, BDFS97] use the term
unstructured data for any data of no rigid structure.)
Structured data: A typical example of this type of data is tables in the relational
database model. The semantics of this type of data can be obtained using the gram-
mar of the formal language in which the data le is written.
Data is literally the plural form of datum. Hence, whenever we say data D, we really mean that D is
1
a set of datum, or a set of data objects.
1
Semistructured data: This type of data is anything between the two extreme types of
data mentioned above. Examples of this type of data are text les that contain for-
matting codes, such as LATEX or HTML, and les which require strict inner structure
but some of the structural components can be omitted, such as BIBTEX les, Unix
environment les, etc. For a BIBTEX le, if an entry of the le is of type article, then
the meaning of article can be obtained using the set of predened elds for article.
Our view of information on the Web is a collection of heterogeneous data, such as
HTML documents by and large and other types of data including images, sound, and video
clips. HTML specication [Gro95], which is the most widely used paradigm for posting
information on the Web, does not require a uniform structure in documents, e.g., an element
which appears in an HTML document may be missing in other documents. Hence, we treat
an HTML document H as a textual representation of semistructured data embedded within
H.
One of the benets of using semistructured data is its great
exibility in data repre-
sentation [BDHS96]. A theory of semistructured data, however, is still missing [Abi97], and
hence there is no universally-standardized denition of semistructured data. In this paper,
we consider a nite set of data objects fo1; o2; : : : ; ong as semistructured data D if
+
the structure of D is irregular or incomplete [QRS 95, AQM 96, AMM97].
+
the distinction between the schema of D and fo1 ; o2 ; : : : ; on g is blurred, and the schema
For example, a semistructure data may include components A and B . Let the data elds of
A be name and date of birth, and those of B be name, date of birth, and address. Further
assume that date of birth in A is of type string, such as \6-12-1938", whereas date of birth
in B consists of three parts, Month, Day and Year, and Month can be of type integer, i.e.,
\1", \2", : : :, or string, i.e., \January", \February", : : :, or even encoded in a character
set other than ASCII. The structures of A and B can be changed subsequently as needed.
In this paper, we present an approach for extracting the structures of semistructured
data embedded within an HTML document H on WWW, assuming that H is written in
compliance2 with the HTML specication [Gro95] and referred by an URL which is given.
(Additional URLs can be further obtained from the hyperlinks included in H .) We rst
propose a graphical data model of semistructured data, called the semistructured data graph
(SDG), and then present a tool, a high-level stack machine (HSM)3 , to extract the structures
embedded in H and represent the structures in an SDG.
The main contribution of this paper is three-fold. First, we extend the concept of
dependency relationship among database components in [LN97] to capture the structure of
the semistructured data embedded in an HTML document. Second, we design and imple-
ment a simple automaton HSM, using the Java Language Environment, to construct the
SDG for an HTML document specied by an URL. Since HSM is built based on pushdown
automata, HSM is fairly easy to implement using a stack without concerning about sophis-
ticated functions such as rereading or replacing an input [HU79], or assuming unlimited
2 Note that erroneous HTML documents are not excluded from our consideration.
3 HSM is a variation of the well-known pushdown automata (PDA).
2
auxiliary memory as does a Turing machine. Third, our HSM is easily congurable accord-
ing to the user's need in terms of what HTML elements are to be included in an SDG. To
congure HSM, a user simply provides a conguration le which contains a list of HTML
elements of an HTML document H chosen by the user to be included in the SDG of H .
Writing a conguration le does not require any knowledge on additional commands and
their syntax, such as get(), split() and citytemp[1 : 0] in [HGMC+ 97].
We process to present our results as follows. In Section 2, we describe the details of
SDGs. In Section 3, we propose HSM and demonstrate its application using a real-world
example. In Section 4, we give the concluding remark.
The fundamental of our data model for semistructured data, SDG, is based on the notion of
dependency relationship [LN97] among database objects. We give the denition of objects
below.
Denition 1 An object o in a set D is a triple label; value; identifier , where
h i
3
CON ( s1; s2; : : : ; sm ; t1; t2; : : : ; tm ) = S8i;j con(si; tj ) 4.
f g f g f g
Example 1 Given the following set of objects, where each object o is associated with
the information in the form of (o:label(); o:value(); a list of objects on which o directly
depends). It is assumed that a string which contains spaces is enclosed by double quotes,
and `{' denotes an empty list.
Location (Location; (free); Address and \Time Zone"),
Address (Address; (free); \Street Address," City and \Zip code"),
\Street Address" (Street; (free); \314 E. Heather Cir."),
City (City; (free); Orem),
\Zip code" (ZipCode; (free); 84057),
\Time Zone" (TimeZone; (free); MST),
\314 E. Heather Cir." (\314 E. Heather Cir."; \314 E. Heather Cir."; { ),
Orem (Orem; Orem; { ),
84057 (84057; 84057; { ), and
MST (MST; \Mountain Standard Time"; { ).
Let's consider object Address. The dependency constraint of Address is Address f\Street
Address" \314 E. Heather Cir.", City Orem, \Zip code" 84057g. Hence,
Address.value() := f\Street Address".value(), City.value(), \Zip code".value()g
= f\314 E. Heather Cir.".value(), Orem.value(), 84057.value()g
= f\314 E. Heather Cir.", Orem, 84057g. 2
We dene the long label of an object o based on the dependency relationship of o with
other objects.
Denition 3 Given a dependency among objects o1; o2; : : : ; oN such that o1 ::: oN ,
the long label of oi, denoted oi:Label(), 1 i N , is dened as
o1:Label() = o1:label(); and
oi:Label() = con(oi 1:Label(); oi :label()); 2 i N: 2
Example 2 Consider the set of objects in Example 1. According to Denition 3, all the
expressions listed below are valid.
4In the rest of this paper, we use fs1 ; s2 ; : : : ; sm g:ft1; t2 ; : : : ; tm g on behalf of
S con(si ; tj )g for
8i;j f
simplicity of the notation.
4
Figure 1: The SDG for the semistructured data Info
5
It is easy to see that pe(VR ; o) = o:Label(), where VR is the root node and o 2 (VI [ VL)
of the corresponding SDG.
An SDG precisely captures the structure and values of semistructured data graphically.
Given below is the denition of the lexical representation of an SDG. The lexical represen-
tation of a semistructured data D is useful in some situations, e.g., nding an object in D
using a textual path expression.
Denition 6 Given an SDG = (fVRg [ VI [ VL; E; g), the lexical representation Lo or lexical
SDG of o 2 (fVR g [ VI [ VUL ) is a textual representation of the subgraph S of the SDG
rooted at o such that Lo = 8i pe(o; oi), where oi is a leaf node in S . 2
Note that given an SDG = (fVRg [ VI [ VL ; E; g), the lexical representation of the SDG
is obtained by U8i pe(VR ; oi) = U8i oi:Label(), where oi 2 VL.
Example 4 Consider the SDG of the semistructured data Info in Example 3. Then
pe(Address, 84057)
= con(pe(Address, \Zip code"), 84057.label())
= con(con(pe(Address; Address), \`Zip code".label()), 84057)
= con(con(Address:label(), ZipCode), 84057)
= con(Address.ZipCode, 84057)
= Address.ZipCode.84057
pe(Address; Orem) = Address.City.Orem U Address.City.Orem
LAddress = Address.Street.\314
U E. Heather Circle"
Address.ZipCode.84057
= CON ( Address , Street.\314 E. Heather Circle", City.Orem, ZipCode.84057 )
f g f g
6
3 The High-level Stack Machine
In this section, we present a high-level stack machine (HSM) as a tool which extracts data
structures embedded in an HTML document H and demonstrate the construction of the
SDG for H specied by an URL using HSM6 .
To construct the SDG for a given HTML document, we propose HSM which processes
HTML elements. As mentioned earlier, HSM is built based on pushdown automa (PDA).
We are particularly interested in employing PDA since PDA is relatively easy to compre-
hend, design, and implement compared with Turing machines [HU79] and has sucient
power to construct the SDGs for HTML documents. (We assume that readers are familiar
with PDA.)
With respect to the design of HSM, we classify the HTML elements in two types
as follows: (1) elements which begin with a start-tag and ends with an end-tag, and (2)
elements whose end-tags are either optional or do not exist. Most of the commonly used
HTML elements are of type 1. Such elements include document element HTML, head
element HEAD, body BODY, headings H1 : : : H6, title TITLE, anchor A, some of the
block structuring elements such as list elements OL and UL, block quote BLOCKQUOTE,
preformatted text PRE, directory list DIR, menu list MENU, address element ADDRESS,
and phrase markups such as EM, B, I and STRONG. Examples of elements of type 2
are some of the block structuring elements such as list element LI, denition lists DT and
DD, line break BR, and horizontal rule HR. Some HTML elements, such as IMG, do not
accompany an end-tag, but the closing angle brackets indicate where the end of the elements
are. Therefore, we categorize IMG as an element of type 1.
Besides the two types of HTML elements mentioned above, we treat some elements
as `unproductive' with respect to SDG, i.e., they do not generate an output on an SDG in
our current version of HSM, and are simply ignored by HSM. UPE, the set of unproductive
elements, is determined by the user on demand by excluding UPE from a conguration le
as discussed earlier. At the current version of HSM, elements such as HR, B, CITE, DIR,
TT, DL, DT, DD, and comments <! : : : > are included in UPE. In addition, we treat the
set of elements of type 2 as a proper subset of UPE. However, when we consider a query
language for SDG in our future work, it may be worth to adjust UPE accordingly to give
more weight to styled text than plain text. (The styled text is surrounded by character
style tags such as <B>, <BIG>, <STRONG> and their corresponding end-tags.)
During the construction process of an SDG, the following two rules of HSM are applied
to elements in an HTML document H :
Skip an element e if e 2 UPE. No stack operation is necessary, and no changes occur
in the SDG of H .
For an element e0 of type 1, push the corresponding stack symbol
(dened in Deni-
tion 8) of e0 onto the stack whenever the start-tag of e0 is encountered, and pop
from
the stack of HSM whenever the corresponding end-tag of e0 is detected. In general,
given the top of the stack symbol p and the SDG being constructed which includes a
SDG is constructed in various formats by using HSM, which are all equivalent with respect to the
6
structure that SDG represents. The formats include the lexical SDG, a textual denition of the SDG using
long labels of the objects or object identiers, and a graphical display of the SDG on the screen. Each of
these output formats can be chosen by the user. See Figure 2 for an example of the HSM user interface.
7
Figure 2: An HSM user interface
8
node op created for p, whenever a new stack symbol
is pushed,
is attached to the
SDG as a child c of op with the edge from c to op. (We assume the existence of the
function append() in HSM, in addition to the ordinary stack operations push() and
pop().)
Denition 8 The high-level stack machine HSM is a system (Q; ; ; ; qBOF ; ; F ), where
1. Q is a nite set of states: qBOF , q1, qA, qA ATTR, qB , qB ATTR , and qEOF .
qBOF denotes the beginning-of-the-le state and qEOF denotes the end-of-the-le state.
States qA and qA ATTR are used for anchors and APPLETs, and qB and qB ATTR are
used for processing the IMG and META elements. Also, when the machine is in
state q1, HSM is not currently processing the elements IMG, META, A, or APPLET
(as shown in the production rules of HSM below).
2. is a nite set of input string symbols: EOF , NT , <HTML>, </HTML>,
<HEAD>, </HEAD>, <TITLE>, </TITLE>, <H1>; : : : ; <H6>, <META>,
HTTP-EQUIV, NAME, CONTENT, <BODY>, </BODY>, <UL>, </UL>,
<OL>, </OL>, <A , </A>, HREF, NAME, <CAPTION>, </CAPTION>,
<IMG , SRC, ALT, >, <TABLE>, </TABLE>, <TR>; </TR>, <TH>, </TH>,
<TD>, </TD>, <APPLET>, </APPLET>, ARCHIVE, CODEBASE, CODE,
<MENU>, </MENU>, <ADDRESS>, and </ADDRESS>.
NT denotes any input string symbol in an HTML document other than the input
string symbols dened in (2).
3. is a nite set of stack string symbols , HTML, HEAD, TITLE , META, HTTP -
EQUIV , NAME , CONTENT , BODY , OL, UL, TABLE , CAPTION , TR, TH ,
TD, ADDRESS , IMG, SRC , ALT , A, HREF , APPLET , ARCHIV E , CODE ,
CODEBASE , MENU , DES , and \, where DES denotes Description and denotes
the empty stack symbol.
4. is a mapping from Q ( [ fg) to Q , where `' denotes the Kleene star.
The production rules of are
(qBOF ; <HTML>, ) = (q1 ; HTML)
(qBOF ; <HEAD>, ) = (q1 ; HEAD)
(qBOF ; <BODY>, ) = (q1 ; BODY )
(q1 ; </HTML>, HTML) = (qEOF ; )
(q1 ; <HEAD>, HTML) = (q1 ; HEAD:HTML)
(q1 ; </HEAD>, HEAD) = (q1 ; )
(q1 ; <TITLE>, HEAD) = (q1 ; TITLE:HEAD)
(q1 ; </TITLE>, TITLE ) = (q1 ; )
(q1 ; <BODY>, HTML) = (q1 ; BODY:HTML)
(q1 ; </BODY>, BODY ) = (q1 ; )
(q1 ; <TABLE>, ) = (q1 ; TABLE:)
(q1 ; </TABLE>, TABLE ) = (q1 ; )
(q1 ; <CAPTION>, TABLE ) = (q1 ; CAPTION:TABLE )
(q1 ; </CAPTION>, CAPTION ) = (q1 ; )
(q1 ; <TR>, TABLE ) = (q1 ; TR:TABLE )
(q1 ; </TR>, TR) = (q1 ; )
(q1 ; <TH>, TR) = (q1 ; TH:TR)
(q1 ; </TH>, TH ) = (q1 ; )
(q1 ; <TD>, TR) = (q1 ; TD:TR)
(q1 ; </TD>, TD) = (q1 ; )
9
(q1 ; <OL>, ) = (q1 ; OL:)
(q1 ; </OL>, OL) = (q1 ; )
(q1 ; <UL>, ) = (q1 ; UL:)
(q1 ; </UL>, UL) = (q1 ; )
(q1 ; <A , ) = (qA ; A:)
(qA ; HREF , A) = (qA ; HREF:A)
(qA ; NAME , A) = (qA ; NAME:A)
(qA ; \, HREF ) = (qA ATTR ; DES:HREF )
(qA ; \, NAME ) = (qA ATTR ; DES:NAME )
(qA ATTR ; NT , DES ) = (qA ATTR ; DES )y
(qA ATTR ; ", DES ) = (qA ; )
(qA ; >, HREF ) = (qA ; DES )
(qA ; >, NAME ) = (qA ; DES )
(qA ; </A>, DES ) = (q1 ; )
(qA ; NT , DES ) = (qA ; DES )y
(q1 ; <APPLET , ) = (qA ; APPLET:)
(qA ; ARCHIV E , APPLET ) = (qA ; ARCHIV E:APPLET )
(qA ; CODEBASE , APPLET ) = (qA ; CODEBASE:APPLET )
(qA ; CODE , APPLET ) = (qA ; CODE:APPLET )
(qA ; \, ARCHIV E ) = (qA ATTR ; DES:ARCHIV E )
(qA ; \, CODEBASE ) = (qA ATTR ; DES:CODEBASE )
(qA ; \, CODE ) = (qA ATTR ; DES:CODE )
(qA ATTR ; NT , DES ) = (qA ATTR ; DES )y
(qA ATTR ; ", DES ) = (qA ; )
(qA ; >, ARCHIV E ) = (qA ; DES )
(qA ; >, CODEBASE ) = (qA ; DES )
(qA ; >, CODE ) = (qA ; DES )
(qA ; </APPLET>, DES ) = (q1 ; )
(q1 ; <IMG , ) = (qB ; IMG:)
(qB ; >, IMG) = (q1 ; )
(qB ; >, SRC ) = (q1 ; )
(qB ; >, ALT ) = (q1 ; )
(qB ; SRC , IMG) = (qB ; SRC:IMG)
(qB ; SRC , ALT ) = (qB ; SRC )
(qB ; ALT , IMG) = (qB ; ALT:IMG)
(qB ; ALT , SRC ) = (qB ; ALT )
(qB ; \, SRC ) = (qB ATTR ; DES:SRC )
(qB ; \, ALT ) = (qB ATTR ; DES:ALT )
(qB ATTR ; NT , DES ) = (qB ATTR ; DES )y
(qB ATTR ; ", DES ) = (qB ; )
(q1 ; <META , ) = (qB ; META:)
(qB ; >, META) = (q1 ; )
(qB ; >, HTTP -EQUIV ) = (q1 ; )
(qB ; >, NAME ) = (q1 ; )
(qB ; >, CONTENT ) = (q1 ; )
(qB ; HTTP -EQUIV , META) = (qB ; HTTP -EQUIV:META)
(qB ; HTTP -EQUIV , NAME ) = (qB ; HTTP -EQUIV )
(qB ; HTTP -EQUIV , CONTENT ) = (qB ; HTTP -EQUIV )
(qB ; NAME , META) = (qB ; NAME:META)
(qB ; NAME , HTTP -EQUIV ) = (qB ; NAME )
(qB ; NAME , CONTENT ) = (qB ; NAME )
(qB ; CONTENT , META) = (qB ; CONTENT:META)
(qB ; CONTENT , HTTP -EQUIV ) = (qB ; CONTENT )
(qB ; CONTENT , NAME ) = (qB ; CONTENT )
(qB ; \, HTTP -EQUIV ) = (qB ATTR ; DES:HTTP -EQUIV )
(qB ; \, NAME ) = (qB ATTR ; DES:NAME )
(qB ; \, CONTENT ) = (qB ATTR ; DES:COMMENT )
(qB ATTR ; NT , DES ) = (qB ATTR ; DES )y
10
(qB ATTR ; ", DES ) = (qB ; g
(q1 ; <ADDRESS>, BODY ) = (q1 ; ADDRESS:BODY )
(q1 ; </ADDRESS>, ADDRESS ) = (q1 ; )
(; EOF , ) = (qEOF ; )
(; NT , 1 ) = (; 1 )
y In this case, i.e., the current top-of-stack symbol is DES , nothing is pushed onto the stack, but
NT is appended to the SDG.
where `:' is used to separate dierent stack symbols.
Each rule of the form (state1 ; INPUT , TOS1) = (state2 ; TOS2) is interpreted as
that \the automaton is currently in state state1 with the top-of-stack symbol TOS1.
After reading the input symbol INPUT , the automaton replaces TOS1 by TOS2,
i.e., pop TOS1 and push TOS2 onto the stack, and enter state state2 ." Hence, in
case of (q1 ; <HEAD>, HTML) = (q1 ; HEAD:HTML), the stack symbol HEAD
is pushed on top of HTML on the stack. In case of (q1; </HEAD>, HEAD) =
(q1; ), HEAD is popped and nothing is pushed onto the stack and the machine stays
in q1.
`' is a \syntactic sugar" used for simplifying notations. Hence, `' in (q1; <TABLE>,
) denotes any stack symbol, and (q1 ; <TABLE>, ) = (q1 ; TABLE:) is interpreted
as \regardless of what the top-of-stack symbol is, when the machine is in q1 and the
input string symbol is <TABLE>, push TABLE onto the stack and remain in the
same state." Furthermore, 1 denotes any stack symbol except DES .
5. QBOF 2 Q is the initial state.
6. 2 is the initial stack symbol.
7. F Q is the set of nal states, i.e., fqEOF g. 2
Example 6 Consider the Web page of the Computer Science Department at Brigham
Young University whose URL is WWW.CS.BYU.EDU. WWW.CS.BYU.EDU is created as
the root node of the SDG, and the <HTML> element directly depends on two other ele-
ments, <HEAD> and <BODY>. Furthermore, <HEAD> directly depends on <TITLE>
which directly depends on the string value \BYU CS Department Homepage", in addition
to the ve <META> elements, whereas <BODY> directly depends on two <TABLE> el-
ements, ve <A> (anchor) elements, and one <ADDRESS> element. The complete SDG
of WWW.CS.BYU.EDU is as shown in Figure 3. 2
Proposition 1 Given an HTML document H , HSM always halts.
Proof. Note that the nal state of HSM is qEOF , and HSM enters qEOF when it encounters
</HTML> or EOF . Since the reading head of HSM continues to move forward during
the process of constructing an SDG and eventually will encounter EOF if </HTML> is
missing in H , HSM always terminates. 2
Proposition 2 Given an HTML document H with NW words, the time complexity of
constructing the SDG for H using HSM is proportional to NW .
Proof. Recall that HSM processes H in one direction. While moving forward and detecting
a word in H , HSM determines if the word is an HTML tag. If it is, then the stack and
the SDG are modied accordingly. Otherwise, the word is discarded. Hence, the time
complexity of HSM is proportional to NW . 2
11
Figure 3: SDG of WWW.CS.BYU.EDU
Proposition 3 Given an HTML document H with NW words, the space complexity of
HSM is proportional to NW .
Proof. In the worst case when all the start-tags are placed before any end-tag in H ,
at most NW stack cells are required to keep all the start-tags on the stack until EOF is
encountered. Hence, the space complexity of HSM is proportional to NW . 2
4 Conclusion
12
References
13