Академический Документы
Профессиональный Документы
Культура Документы
Matteo Golfarelli
Stefano Rizzi
Boris Vrdoljak
mgolfarelli@deis.unibo.it
srizzi@deis.unibo.it
boris.vrdoljak@fer.hr
ABSTRACT
A large amount of data needed in decision-making processes is
stored in the XML data format, which is widely used for ecommerce and Internet-based information exchange. Thus, as
more organizations view the web as an integral part of their
communication and business, the importance of integrating XML
data in data warehousing environments is becoming increasingly
high. In this paper we show how the design of a data mart can be
carried out starting directly from an XML source. Two main
issues arise: on the one hand, since XML models semi-structured
data, not all the information needed for design can be safely
derived; on the other, different approaches for representing
relationships in XML DTDs and Schemas are possible, each with
different expressive power. After discussing these issues, we
propose a semi-automatic approach for building the conceptual
schema for a data mart starting from the XML sources.
Keywords
Data warehouse design, Data warehousing and the web, XML
1. INTRODUCTION
A large amount of data needed in decision-making processes is
stored in the XML (Extensible Markup Language) data format.
The structure of XML, composed of nested custom-defined tags
that can describe the meaning of the content itself, makes it usable
as a semantic-preserving data exchange format on the web. As the
Internet has evolved into a global platform for e-commerce and
information exchange, the interest in XML has been growing and
large volumes of XML data already exist.
XML can be considered as a particular standard syntax for the
exchange of semi-structured data [1]. One common feature of
semi-structured data models is the lack of schema, so that the data
is self-describing. However, XML documents can be associated
with and validated against either a Document Type Definition
(DTD) or an XML Schema, both of which allow the structure of
XML documents to be described and their contents to be
constrained. DTDs are defined as a part of the XML 1.0
What is the trend for the most and the least accessed pages?
CLICK
2.2
2.3
day of week
holiday
date
domain
(category /
nation)
URL
file type
no. of clicks
host
(hostname/
IP address)
hour
HOST
CATEGORY
(1,n)
time
date
(0,1)
HOST
(1,1)
(1,N)
CLICK
(1,1)
(1,N)
(1,1)
(0,1)
URL
(1,1)
(1,N)
FILE TYPE
(1,N)
(1,N)
SITE
(1,1)
(1,N)
(1,N)
(1,N)
URL
CATEGORY
NATION
date
host category
URL
nation
host
file type
click
site
nation
<element name=click>
<complexType>
<element name=url type=urlType/>
<element name=fileTypes
type=fileTypesType/>
</complexType>
<key name=fileTypeKey>
<selector xpath=fileTypes/type/>
<field xpath=@typeID/>
</key>
<keyref name=fileTypeRef
refer=fileTypeKey>
<selector xpath=url/>
<field xpath=fileType/>
</keyref>
</element>
Figure 9. An XML Schema where relationships are specified
by key/keyRef
In conclusion, the key/keyRef mechanism may be applied to any
element and attribute content, and the scope of the constraint can
be specified, while an ID is a type of attribute whose scope is
fixed to be the whole document. Furthermore, combinations of
element and attribute content can also serve as keys and
references in XML Schemas.
webTraffic
Starting with the assumption that the XML document has a DTD
and conforms to it, the methodology consists of the following
steps:
click
host
?
category
?
nation
date
hostId
time
site
siteId
url
* urlId
fileType
urlCategory
3. Choosing facts.
4. For each fact:
4.1
4.2
4.3
root=newVertex(F);
// newVertex(<vertex>) returns a new vertex
// of the attribute tree
// corresponding to <vertex>
// of the DTD graph
expand(F,root);
expand(E,V):
// E is the current DTD vertex,
// V is the current attribute tree vertex
{ for each child W of E do
if W is element or attribute do
{ next=newVertex(W);
addChild(V,next);
// adds child W to V
expand(W,next);
}
else
if W="?" do
expand(W,V);
for each parent Z of E such that
Z is not a document element do
if Z="?" or Z="*" do
expand(Z,V);
else
if not toMany(E,Z) do
if askDesignerToOne(E,Z) do
{ next=newVertex(Z);
addChild(V,next);
expand(Z,next);
}
}
Figure 11. Algorithm for building the attribute tree
Defining facts
The designer chooses one or more vertices of the DTD graph as
facts; each of them becomes the root of a fact schema. In our
example, we choose the click vertex as the only interesting fact.
The vertices of the attribute tree are a subset of the element and
attribute vertices of the DTD graph. The algorithm to build the
attribute tree is sketched in Figure 11.
since there is no need for the existence of both host and hostId
vertices, only host should be left; the same logic should be applied
for urlId and siteId. Finally, the time attribute is replaced with the
hour attribute.
date
urlId
time
host category
host
file type
url
click
siteId
nation
site
hostId
nation
webTraffic
click
url
host
?
category
?
nation
hostId
date
hour
urlId
site
+
fileType
urlCategory
siteId
Figure 13. Another possible DTD graph describing the web site traffic
date
hostId
url
urlId
click
nation
[1] Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web:
From Relations to Semistructured Data and XML. Morgan
Kaufman Publishers, 2000.
[2] Cabibbo, L., and Torlone, R. A logical approach to
multidimensional databases. In Proc. EDBT, 1998.
[3] Datta, A., and Thomas, H. A conceptual model and algebra
for on-line analytical processing in data warehouses. In Proc.
WITS, 1997.
[4] Deutsch, A., Fernandez, M., Florescu, D., Levy, A., and
Suciu, D. A Query Language for XML. In Proc. 8th World
Wide Web Conference, 1999.
[5] Florescu, D., and Kossmann, D. Storing and Querying XML
Data using an RDBMS. IEEE Data Engineering Bulletin 22,
3, 1999.
[6] Franconi, E., and Sattler, U. A data warehouse conceptual
model for multidimensional aggregation. In Proc. DMDW,
1999.
file type
[7] Golfarelli, M., Maio, D., and Rizzi, S. The Dimensional Fact
Model: a conceptual model for data warehouses. Int. Jour. of
Cooperative Inf. Systems 7, 2&3, 1998.
siteId
time
host category
host
6. REFERENCES
site
[9] Kimball, R. The data warehouse toolkit. John Wiley & Sons,
1996.
nation
5. CONCLUSIONS
In this paper we described a semi-automatic approach to
conceptual design of a data mart from an XML source. We
showed how the semi-structured nature of the source increases the
level of uncertainty on the structure of data as compared to
structured sources such as database schemas, thus requiring access
to the source documents and, possibly, the designers help in order
to detect -to-one relationships. The approach was described with
reference to the case in which the sources are constrained by a
DTD using sub-elements, but it can be adopted equivalently when
XML Schemas are considered.
Using XML sources for feeding data warehouse systems will
become a standard in the next few years. Adopting a technique to
derive the data mart schema directly from the XML sources is not
the only possible approach: the data mart schema may also be
designed manually, meaning that facts, measures and
hierarchies are determined starting from the user requirements and
the logical connection with the source schemas is established only
a posteriori. On the other hand, the main problem with this
solution is that, very often, the requirements expressed by the
users cannot be fully supported by the existing data; besides, the
process of mapping each requirement back to the source schema
may be very complex.