CSE 3124Y(5): Data Warehousing and OLAP Lecture

CSE 3124Y(5): Business intelligence
& big data analytics
Lecture 4:
Data Warehousing and OLAP
Data Warehousing and OLAP
 What is a data warehouse?
 A multi-dimensional data model
 Data warehouse architecture
 Data warehouse implementation
2 CSE 3124Y BI&BDA 2019- 2020

Data Warehousing
Data warehousing is a process, not a product, for assembling
and managing data from various sources for the purpose of
gaining a single detailed view of part or all of a business. The
single view is the data warehouse.
On-line Analytical Processing (OLAP) is a technique used for
providing management decision support using historical and
summarized data that is consolidated in the data warehouse.
3 CSE 3124Y BI&BDA 2019- 2020

Data Warehousing
A data warehouse integrates information from several sources

into a global schema and is stored separately from the
operational data. It does not represent a snapshot of the
operational database.
Moving data from various sources to a data warehouse is a
very difficult process involving data cleansing and data
integration. Sometime called ETL.
Most database systems are error-prone. A data warehouse
should have as few errors as possible.
4 CSE 3124Y BI&BDA 2019- 2020

Data Warehouse
Most database systems continue to grow but a data warehouse

grows at a slower rate.
User updates to a data warehouse are usually forbidden,
updates must come from the underlying databases to maintain
consistency.
5 CSE 3124Y BI&BDA 2019- 2020

Data Warehousing
To speed up OLAP queries, a warehouse contains summarized

and consolidated information representing materialized
aggregate views of the enterprise data from a number of
databases.
Data warehouse and OLAP are complementary. A warehouse
stores data while OLAP derives strategic information from it.
Data warehouse may be used to provide an enterprise
memory which operational data does not provide.
6 CSE 3124Y BI&BDA 2019- 2020

Data Warehousing
Warehouse usually contains information over time helping

analysis of trends
A data warehouse is repackaging information to support
business decision making
The aim in data warehousing may be to generate new revenue by
selling the repackaged information
7 CSE 3124Y BI&BDA 2019- 2020

A definition
According to W. H. Inmon: A data warehouse is a

subject-oriented, integrated, time-variant, and non-
volatile collection of data in support of management’s
decision making process.
Important to note subject-oriented, integrated, and

time-variant properties of data warehouses.
8 CSE 3124Y BI&BDA 2019- 2020

Subject-oriented
 Organized around major subjects, such as student, degree,

country.
 Focusing on the modeling and analysis of data for decision
makers, not on daily operations.
 Providing a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.
9 CSE 3124Y BI&BDA 2019- 2020

Integrated
 May be constructed by integrating multiple data sources e.g.

multiple databases.
 Data cleaning and data integration techniques are applied to
ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources.
10 CSE 3124Y BI&BDA 2019- 2020

Time Variant
 Long time horizon for data warehouse, significantly longer than
that of operational systems.
 Operational database: current value data.
 Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse contains an element
of time, explicitly or implicitly
 Operational data may or may not contain time element.
11 CSE 3124Y BI&BDA 2019- 2020

Non-volatile
 A physically separate store of data transformed from the

operational environment.
 No update of data
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing: initial loading
of data and access of data.
12 CSE 3124Y BI&BDA 2019- 2020

Why Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP
 Warehouse—tuned for OLAP.
 Different functions and different data:
 missing data: Decision support requires historical data
which operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats
13 CSE 3124Y BI&BDA 2019- 2020

Database Workloads
 OLTP (online transaction processing)
 Typical applications: e-commerce, banking, airline reservations
 User facing: real-time, low latency, highly-concurrent
 Tasks: relatively small set of “standard” transactional queries
 Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)
 OLAP (online analytical processing)
 Typical applications: business intelligence, data mining
 Back-end processing: batch workloads, less concurrency
 Tasks: complex analytical queries, often ad hoc
 Data access pattern: table scans, large amounts of data involved
per query
14 CSE 3124Y BI&BDA 2019- 2020

OLTP
 Most database operations involve On-Line Transaction

Processing (OTLP).
 Short, simple, frequent queries and/or modifications, each
involving a small number of tuples.
 Examples: Answering queries from a Web interface, sales at
cash registers, selling airline tickets.
15 CSE 3124Y BI&BDA 2019- 2020

OLAP
 Of increasing importance are On-Line Application Processing
(OLAP) queries.
 Few, but complex queries --- may run for hours.
 Queries do not depend on having an absolutely up-to-date
database.
16 CSE 3124Y BI&BDA 2019- 2020

Data Warehouse Process
• Define the architecture, do capacity planning, select

hardware and software
• Design the warehouse schema and the views
• Design the physical data structures
• Design data extraction, cleaning, transformation,
load and refresh software
• Populate the repository with data and software
• Design and implement end-user application
(Refer to Chaudhuri and Dayal)
17 CSE 3124Y BI&BDA 2019- 2020

Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube allows data to be modeled and viewed in
multiple dimensions
 Dimension tables, such as item (item_name, brand, type),
or time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables
18 CSE 3124Y BI&BDA 2019- 2020

Conceptual Modeling
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
19 CSE 3124Y BI&BDA 2019- 2020

Defining a Star Schema in
DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)
20 CSE 3124Y BI&BDA 2019- 2020

Measures: Three Categories
 distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning.
e.g., count(), sum(), min(), max().
 algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function.
21 CSE 3124Y BI&BDA 2019- 2020

Measures: Three Categories
 holistic: if there is no constant bound on the storage size needed
to describe a subaggregate.
e.g., median(), mode(), rank().
22 CSE 3124Y BI&BDA 2019- 2020

Data Warehouse Design
The E-R Model approach which consists of entities

and relationships is not suitable for designing a
schema for a warehouse.
What is the nature of data in data warehouse?

Essentially data warehouses are based on
multidimensional data model which views data as a
data cube.
23 CSE 3124Y BI&BDA 2019- 2020

An Example
Consider the following database:

Student(sid, name1, dob, country, degree, startsem,
address1, telephone, address2, email, scholarship, ..)
Enrolment(sid, subject-id, mark, tutegroup, tutor,..)
Subject(sub-id, name, school-id, whenstarted,
lecturer,..)
School(name, id, ..)
Not all of this data is needed for decision making.
Let us extract some data from this database.
24 CSE 3124Y BI&BDA 2019- 2020
Data Cube
yob, country, degree, startsem, nsubjects, scholarship
1965, Thailand, MIT, 991, 5, 25%
1970, Canada, BIT, 992, 4, 0
1967, Australia, LLB, 993, 3, 30%
1966, Australia, LLB, 983, 4, 40%
1972, Australia, Bcom, 973, 5, 10%
1972, India, BIT/Bcom, 991, 5, 10%
1982, Sweden, MSc(IT), 991, 3, 10%
Is this information useful for decision making? Not really!
25 CSE 3124Y BI&BDA 2019- 2020

Example
We could look at the information as

yob X country X degree X startsem X numsubjects X
scholarship
In fact it is natural to think of an enterprise data as

multidimensional.
26 CSE 3124Y BI&BDA 2019- 2020

Example
The university management may be interested in

retrieving information like:
• How many students are doing BIT? How many
students from Thailand? How many students started
in 1998? (queries involving only one variable)
• How many students doing BIT are from Thailand?
How many MIT students started in 981? How many
students from Thailand started in 993? (queries
involving two variables)
27 CSE 3124Y BI&BDA 2019- 2020

Example
•How many students doing MIT from Thailand

started in 981? (query involving three variables)
Special type of database systems, called data cube

systems, are often used for answering such queries.
28 CSE 3124Y BI&BDA 2019- 2020

Data Cube
The example queries discussed earlier may be

represented by a three-dimensional data cube with each
edge representing one of the variables viz. startsem,
country, and degree.
A point inside the cube is an intersection of the

coordinates defined by the edges of the cube. The
coordinates of the point define the meaning of the data
at that point.
29 CSE 3124Y BI&BDA 2019- 2020

Data Cube
Let us look at a simple two-dimensional situation:

country X degree
For decision making this may be useful information.
If we had a 2-dimensional matrix then we could find
out the number of students for any country (x) and
any degree (y).
30 CSE 3124Y BI&BDA 2019- 2020

Data Cube
But in the two-dimensional situation, we don’t just

want to find out the number of students for any
country (x) and any degree (y). We may have many
other queries e.g.
1. How many students are doing MIT?
2. How many students from Thailand?
3. How many Asian students doing Law degrees?
Thus there is kind of hierarchy that we wish to use, for
example, the world, the continents, the regions, the
countries etc. In degrees, we may want a hierarchy of
university, Schools, UG and PG, individual degrees.
31 CSE 3124Y BI&BDA 2019- 2020
Data Cube
Consider a slightly more complex situation in which

we have three dimensions:
country X degree X startsem
for any country (x), any degree (y) and any start
semester (z).
We may now look at this information as a 3-

dimensional cube as shown on the following slide.
32 CSE 3124Y BI&BDA 2019- 2020

Data Cube
(based on a slide from book by J. Han and M. Kamber)
 Number of students as a function of country, degree and

semester
Dimensions: country, degree, sem

Hierarchical summarization paths
continent school Year
country
region ug/pg
country degree semester
semester
33 CSE 3124Y BI&BDA 2019- 2020
A Sample Data Cube
Total LLB enrolments

semester From U.S.A.
991 992 993 001 sum
LLB
BCom U.S.A
MIT
Country
Sum
Norway
Australia
sum
34 CSE 3124Y BI&BDA 2019- 2020

Data Cube
Each edge of the cube is called a dimension. A user

normally has a number of different dimensions from
which the given data may be analyzed. A user therefore
has a multidimensional conceptual view of the data
which is represented by the cube.
The points inside a cube provide aggregations. For

example, a point may provide the number of students
from Malaysia admitted to BCom in year 1998.
35 CSE 3124Y BI&BDA 2019- 2020

Multidimensional View
A particular user will have one multidimensional view

of the database while another user in the same
enterprise may have another view. Therefore many
different multidimensional views of the same database
are possible and the same data may be consolidated in
many different ways.
36 CSE 3124Y BI&BDA 2019- 2020

Multidimensional data model
The cube is not always three-dimensional since often an

enterprise would have many more, perhaps eight or
even ten, dimensions of interest. Each dimension may
be associated with a table that describes the dimension.
For example, a dimension table for country would
contain the country names and could contain other
information e.g. category. Other dimensions like time do
not naturally have such table of information.
37 CSE 3124Y BI&BDA 2019- 2020

Data Cube
A number of operations may be applied to data cubes.

The common ones are:
- roll-up (increasing the level of abstraction)
- drill-down (increasing detail)
- slice and dice (selection and projection)
- pivot (re-orienting the view)
38 CSE 3124Y BI&BDA 2019- 2020

Data Cube Operations
• Roll-up (less detail) - when we wish further

abstraction (i.e. less detail). This operation performs
further aggregation on the data, for example, from
single degree programs to Schools, single countries to
Continents or from three dimensions to two
dimensions.
• Drill-down (increasing detail) - reverse of roll up,

when we wish to partition more finely or want to focus
on some particular values of certain dimensions. Drill-
down adds more detail to the data, it may involve
adding another dimension.
39 CSE 3124Y BI&BDA 2019- 2020
Data Cube Operations
• Slice and dice (selection and projection) - the slice

operation performs a selection on one dimension of the
cube (e.g. degree = “MIT”). The dice operation performs
a selection on two or more dimensions (e.g. degree =
“BIT” and country = “Australia” or “India”)
• Pivot (re-orienting the view) - an alternate

presentation of the data e.g. rotating the axes in a 3-D
cube.
40 CSE 3124Y BI&BDA 2019- 2020

Cube
Fact table view:

Multi-dimensional cube:
sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8
dimensions = 2
41 CSE 3124Y BI&BDA 2019- 2020

3-D Cube
Fact table view: Multi-dimensional cube:
sale prodId storeId date amt

p1 c1 1 12
p2 c1 1 11 c1 c2 c3
day 2
p1 c3 1 50 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
dimensions = 3
42 CSE 3124Y BI&BDA 2019- 2020

Multidimensional Data
• Sales volume as a function of product, month, and

region
Dimensions: Product, Location, Time
Industry Region Year
Category Country Quarter

Product
Product City Month Week
Office Day
Month
43 CSE 3124Y BI&BDA 2019- 2020
A Sample Data Cube
Date Total annual sales
2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
sum
Canada
Country
Mexico
sum
44 CSE 3124Y BI&BDA 2019- 2020

Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product date country
1-D cuboids
product,date product,country date, country

2-D cuboids
3-D(base) cuboid
product, date, country
45 CSE 3124Y BI&BDA 2019- 2020

Cube Aggregation
Example: computing sums

c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129
rollup sum
p1 110
p2 19
drill-down
46 CSE 3124Y BI&BDA 2019- 2020

Cube Operators
c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)
c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129
sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)
47 CSE 3124Y BI&BDA 2019- 2020

Extended Cube
* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* c267 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1 * 44 4 48 sale(*,p2,*)
p1 12 50 62
p2 11 8 19
* 23 8 50 81
48 CSE 3124Y BI&BDA 2019- 2020

Aggregation Using Hierarchies
c1 c2 c3 customer
day 2
p1 44 4
p2 c1 c2 c3
day 1 region
p1 12 50
p2 11 8
country
region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
49 CSE 3124Y BI&BDA 2019- 2020

Pivoting

p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
c1 c2 c3
p1 56 4 50
p2 11 8
50 CSE 3124Y BI&BDA 2019- 2020

CUBE Operator (SQL-99)
Chevy Sales Cross Tab
Chevy 1990 1991 1992 Total (ALL)
black 50 85 154 289
white 40 115 199 354
Total 90 200 353 1286
(ALL)
SELECT model, year, color, sum(sales) as sales

FROM sales
WHERE model in (‘Chevy’)
AND year BETWEEN 1990 AND 1992
GROUP BY CUBE (model, year, color);
51 CSE 3124Y BI&BDA 2019- 2020
CUBE Contd.
SELECT model, year, color, sum(sales) as sales
FROM sales
WHERE model in (‘Chevy’)
AND year BETWEEN 1990 AND 1992
GROUP BY CUBE (model, year, color);
 Computes union of 8 different groupings:

 {(model, year, color), (model, year), (model, color), (year,
color), (model), (year), (color), ()}
52 CSE 3124Y BI&BDA 2019- 2020

Example Contd.
DATA CUBE
Model Year Color Sales
ALL ALL ALL 942
SALES chevy ALL ALL 510
Model Year Color Sales ford ALL ALL 432
Chevy 1990 red 5 ALL 1990 ALL 343
Chevy 1990 white 87 ALL 1991 ALL 314
Chevy 1990 blue 62 ALL 1992 ALL 285
ALL ALL red 165
Chevy 1991 red 54 ALL ALL white 273
CUBE
Chevy 1991 white 95 ALL ALL blue 339
Chevy 1991 blue 49 chevy 1990 ALL 154
Chevy 1992 red 31 chevy 1991 ALL 199
Chevy 1992 white 54 chevy 1992 ALL 157
Chevy 1992 blue 71 ford 1990 ALL 189
Ford 1990 red 64 ford 1991 ALL 116
Ford 1990 white 62 ford 1992 ALL 128
Ford 1990 blue 63 chevy ALL red 91
chevy ALL white 236
Ford 1991 red 52 chevy ALL blue 183
Ford 1991 white 9 ford ALL red 144
Ford 1991 blue 55 ford ALL white 133
Ford 1992 red 27 ford ALL blue 156
Ford 1992 white 62 ALL 1990 red 69
Ford 1992 blue 39 ALL 1990 white 149
ALL 1990 blue 125
ALL 1991 red 107
ALL 1991 white 104
ALL 1991 blue 104
ALL 1992 red 59
53 ALL
CSE 3124Y BI&BDA 1992
2019-white
2020 116
ALL 1992 blue 110
Benefits of Multidimensional
Analysis
• Small high-level database with pre-computed

aggregates is created for efficient high-level queries
• Multiple-level views
• Selection by slicing and dicing
However multidimensional analysis does not provide data

mining.
54 CSE 3124Y BI&BDA 2019- 2020

OLAP Server Architectures
Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware to support
missing pieces
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 greater scalability
55 CSE 3124Y BI&BDA 2019- 2020

The Data Warehouse
 The most common form of data integration.
 Copy sources into a single DB (warehouse) and try to keep it
up-to-date.
 Usual method: periodic reconstruction of the warehouse,
perhaps overnight.
 Frequently essential for analytic queries.
56 CSE 3124Y BI&BDA 2019- 2020

OLAP Server Architectures
 Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine (sparse matrix
techniques)
 fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers
57 CSE 3124Y BI&BDA 2019- 2020

Data Warehouse Design
One approach is the star schema to represent the

multidimensional data model. The schema in this model
consists of a large single fact table containing the bulk of
the data, with no redundancy and a set of smaller tables
called dimension table, one for each dimension.
Other models have been used. These include snowflakes

model and fact constellations model.
58 CSE 3124Y BI&BDA 2019- 2020

Warehouse Architecture
Client Client
Query & Analysis
Metadata Warehouse
Integration
Source Source Source
59 CSE 3124Y BI&BDA 2019- 2020

Star Schemas
 A star schema is a common organization for data at a

warehouse. It consists of:
1. Fact table : a very large accumulation of facts such as sales.
 Often “insert-only.”
2. Dimension tables : smaller, generally static information about
the entities involved in the facts.
60 CSE 3124Y BI&BDA 2019- 2020

Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
61 CSE 3124Y BI&BDA 2019- 2020
Example: Star Schema
 Suppose we want to record in a warehouse information
about every beer sale: the bar, the brand of beer, the
drinker who bought the beer, the day, the time, and the
price charged.
 The fact table is a relation:
Sales(bar, beer, drinker, day, time, price)
62 CSE 3124Y BI&BDA 2019- 2020

Example, Continued
 The dimension tables include information about the bar,
beer, and drinker “dimensions”:
Bars(bar, addr, license)
Beers(beer, manf)
Drinkers(drinker, addr, phone)
63 CSE 3124Y BI&BDA 2019- 2020

Visualization – Star Schema
Dimension Table (Bars) Dimension Table (Drinkers)
Dimension Attrs. Dependent Attrs.
Fact Table - Sales
Dimension Table (Beers) Dimension Table (etc.)

64 CSE 3124Y BI&BDA 2019- 2020
Dimensions and Dependent Attributes
 Two classes of fact-table attributes:

1. Dimension attributes : the key of a dimension table.
2. Dependent attributes : a value determined by the
dimension attributes of the tuple.
65 CSE 3124Y BI&BDA 2019- 2020

Warehouse Models & Operators
 Data Models
 relations
 stars & snowflakes
 cubes
 Operators
 slice & dice
 roll-up, drill down
 pivoting
 other
66 CSE 3124Y BI&BDA 2019- 2020

Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
sale oderId date custId prodId storeId qty amt

o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50
customer custId name address city

53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la
67 CSE 3124Y BI&BDA 2019- 2020

Star Schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
68 CSE 3124Y BI&BDA 2019- 2020

Star Schema
69 CSE 3124Y BI&BDA 2019- 2020

Snowflake Schema
70 CSE 3124Y BI&BDA 2019- 2020

Terms
 Fact table
 Dimension tables
 Measures
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
71 CSE 3124Y BI&BDA 2019- 2020

Dimension Hierarchies
sType
store
city region
sType tId size location

t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south
 snowflake schema
 constellations region regId name
north cold region
south warm region
72 CSE 3124Y BI&BDA 2019- 2020

Aggregates
• Add up amounts for day 1

• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

p1 c1 1 12
p2 c1 1 11
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
73 CSE 3124Y BI&BDA 2019- 2020

Aggregates
• Add up amounts by day

• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4
74 CSE 3124Y BI&BDA 2019- 2020

Another Example
• Add up amounts by day, product

• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44
p1 2 48
p1 c2 2 4
rollup
drill-down
75 CSE 3124Y BI&BDA 2019- 2020

Cube
Fact table view:

Multi-dimensional cube:
sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8
dimensions = 2
76 CSE 3124Y BI&BDA 2019- 2020

3-D Cube

p1 c1 1 12
p2 c1 1 11 c1 c2 c3
day 2
p1 c3 1 50 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8
dimensions = 3
77 CSE 3124Y BI&BDA 2019- 2020

Multidimensional Data
• Sales volume as a function of product, month, and

region
Dimensions: Product, Location, Time
Industry Region Year
Category Country Quarter

Product
Product City Month Week
Office Day
Month
78 CSE 3124Y BI&BDA 2019- 2020
Aggregates
 Operators: sum, count, max, min,
median, ave
 “Having” clause
 Cube (& Rollup) operator
 Using dimension hierarchy
 average by region (within store)
 maximum by month (within date)
79 CSE 3124Y BI&BDA 2019- 2020

Query & Analysis Tools
 Query Building
 Report Writers (comparisons, growth, graphs,…)
 Spreadsheet Systems
 Web Interfaces
 Data Mining
80 CSE 3124Y BI&BDA 2019- 2020

Other Operations
 Time functions
 e.g., time average
 Computed Attributes
 e.g., commission = sales * rate
 Text Queries
 e.g., find documents with words X AND B
 e.g., rank documents by frequency of
words X,Y, Z
81 CSE 3124Y BI&BDA 2019- 2020

Implementing a Warehouse
 Monitoring: Sending data from sources

 Integrating: Loading, cleansing,...
 Processing: Query processing, indexing, ...
 Managing: Metadata, Design, ...
82 CSE 3124Y BI&BDA 2019- 2020

Multi-Tiered Architecture
Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Query
Extract
DBs Serve Reports
Transform Data
Data mining
Load Warehouse
Refresh
Data Marts
Data Sources Data Storage OLAP Engine 2019-Front-End Tools

83 CSE 3124Y BI&BDA 2020
Monitoring
 Source Types: relational, flat file, IMS,VSAM, IDMS,

WWW, news-wire, …
 Incremental vs. Refresh
customer id name address city

53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la new
84 CSE 3124Y BI&BDA 2019- 2020

Data Cleaning
 Migration (e.g., yen  dollars)

 Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
 Fusion (e.g., mail list, customer merging)
billing DB customer1(Joe)
merged_customer(Joe)
service DB customer2(Joe)
 Auditing: discover rules & relationships

(like data mining)
85 CSE 3124Y BI&BDA 2019- 2020

Loading Data
 Incremental vs. refresh
 Off-line vs. on-line
 Frequency of loading
 At night, 1x a week/month, continuously
 Parallel/Partitioned load
86 CSE 3124Y BI&BDA 2019- 2020

ETL
 Extraction - data relevant to the tasks are selected and retrieved

from a variety of sources.
 Transformation - data is consolidated by performing summary or
aggregations
 Cleansing - since data comes from a number of sources, errors
and anomalies are common. There is a need to remove
anomalies, remove errors, handling missing and irrelevant data.
Some tools are available for doing this.
87 CSE 3124Y BI&BDA 2019- 2020

Data Cleaning
Data Cleaning overcomes problems like the following:

• Inconsistent field lengths of same items
• Inconsistent values for same items
• Inconsistent interpretation of same terms
• Missing entries
• Violation of integrity constraints
Data cleaning can be a very demanding task
88 CSE 3124Y BI&BDA 2019- 2020

• Integration - combining data from many perhaps

heterogeneous sources. This is a non-trivial task since
different sources will use different formats, field
lengths, codes, descriptions, for the same data items.
• Loading - before loading additional processing may be
needed e.g. checking integrity constraints, building
derived tables, indices, access paths
89 CSE 3124Y BI&BDA 2019- 2020

• Refresh - warehouse data needs to be periodically

updated as the operational data changes. There are
several different ways of updating a data warehouse:
• The data warehouse could be periodically
reconstructed from the base sources (perhaps
overnight, once a week)
• The data warehouse could be updated periodically,
for example each week or even each month.
The updates need to be logically correct since the warehouse data is

derived data.
90 CSE 3124Y BI&BDA 2019- 2020

Why Separate Data Warehouse?
 High performance for both systems

 DBMS— tuned for OLTP
 Warehouse—tuned for OLAP
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled
91 CSE 3124Y BI&BDA 2019- 2020

OLAP
Codd defines On-line Analytical Processing or OLAP as

the dynamic enterprise analysis required to create,
manipulate, animate, and synthesize information from
exegetical, contemplative, and formulaic data analysis
models.
OLAP generally involves highly complex queries
involving large amounts of data that use one or more
aggregates. OLAP deals only with historical data
accurate at a given point in time.
92 CSE 3124Y BI&BDA 2019- 2020

OLAP Characteristics
We discuss the following OLAP characteristics listed by

Codd:
• Dynamic data analysis - involving historical data of
multiple dimensions manipulated in many different
ways with the aim of studying changes occurring in the
enterprise
• Common enterprise data - OLAP uses the enterprise
data but in a very different way to discover why some
particular situations occurred
• Synergistic implementation - data synthesis, analysis,
and consolidation
93 CSE 3124Y BI&BDA 2019- 2020

OLAP Characteristics
• Four enterprise data model

• Categorical - comparison of historical values
• Exegetical - discovering reasons for what categorical
model found
• Contemplative - “what if ” analysis of the data
• Formulaic - how to reach a desired goal
94 CSE 3124Y BI&BDA 2019- 2020

OLAP Examples
1. Amazon analyzes purchases by its customers to come
up with an individual screen with products of likely
interest to the customer.
2. Analysts at Wal-Mart look for items with increasing
sales in some region.
95 CSE 3124Y BI&BDA 2019- 2020

One Database or Two?
• Downsides of co-existing OLTP and OLAP workloads
– Poor memory management
– Conflicting data access patterns
– Variable latency
• Solution: separate databases
– User-facing OLTP database for high-volume transactions
– Data warehouse for OLAP workloads
– How do we connect the two?
96 CSE 3124Y BI&BDA 2019- 2020

OLTP/OLAP Architecture
ETL
(Extract, Transform, and Load)
OLTP OLAP
97 CSE 3124Y BI&BDA 2019- 2020

OLTP/OLAP Integration
• OLTP database for user-facing transactions
– Retain records of all activity
– Periodic ETL (e.g., nightly)
• Extract-Transform-Load (ETL)
– Extract records from source
– Transform: clean data, check integrity, aggregate, etc.
– Load into OLAP database
• OLAP database for data warehousing
– Business intelligence: reporting, ad hoc queries, data mining, etc.
– Feedback to improve OLTP services
98 CSE 3124Y BI&BDA 2019- 2020

Codd’s OLAP Evaluation Rules
Codd in his 1993 paper lists the following 12 rules for

evaluating OLAP products:
• Multidimensional conceptual view - to make a variety
of manipulations (e.g. slice and dice) relatively easy
• Transparency - user should know what data is being
used and where from
• Accessibility - able to use data in enterprise database
as well as legacy systems
99 CSE 3124Y BI&BDA 2019- 2020

Codd’s OLAP Rules
• Consistent reporting performance - consistent

reporting performance as the number of dimensions
grows
• Client-server architecture - OLAP often uses
mainframe data but users want access from desktop
• Generic dimensionality - different dimensions should
not be treated differently
• Dynamic sparse matrix handling - always a lot of
missing data, OLAP should be able to adjust to that
• Multi-user support - obviously required
100 CSE 3124Y BI&BDA 2019- 2020

Codd’s OLAP Rules
• Unrestricted cross-dimensional operations - should be

able to infer associated calculations
• Intuitive data manipulation - operations should not
require the use of a menu or a number of iterations
• Flexible reporting - logical presentation of data
• Unlimited dimensions and aggregation levels - some
applications need as many as 15-20 dimensions
101 CSE 3124Y BI&BDA 2019- 2020

ROLAP vs. MOLAP
 ROLAP:
Relational On-Line Analytical Processing
 MOLAP:
Multi-Dimensional On-Line Analytical Processing
102 CSE 3124Y BI&BDA 2019- 2020

Summary
 Data warehouse
 A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-
making process
 A multi-dimensional model of a data warehouse
 Star schema, snowflake schema, fact constellations
 A data cube consists of dimensions & measures

 OLAP operations: drilling, rolling, slicing, dicing and pivoting
103 CSE 3124Y BI&BDA 2019- 2020

Summary
 OLAP servers: ROLAP, MOLAP, HOLAP

 Efficient computation of data cubes
 Partial vs. full vs. no materialization
104 CSE 3124Y BI&BDA 2019- 2020

CSE 3124Y(5): Data Warehousing and OLAP Lecture

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CSE 3124Y(5): Data Warehousing and OLAP Lecture

Загружено:

Авторское право:

Доступные форматы

CSE 3124Y(5): Business intelligence

& big data analytics

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation

2 CSE 3124Y BI&BDA 2019- 2020

3 CSE 3124Y BI&BDA 2019- 2020

A data warehouse integrates information from several sources

4 CSE 3124Y BI&BDA 2019- 2020

Most database systems continue to grow but a data warehouse

5 CSE 3124Y BI&BDA 2019- 2020

To speed up OLAP queries, a warehouse contains summarized

6 CSE 3124Y BI&BDA 2019- 2020

Warehouse usually contains information over time helping

7 CSE 3124Y BI&BDA 2019- 2020

According to W. H. Inmon: A data warehouse is a

Important to note subject-oriented, integrated, and

8 CSE 3124Y BI&BDA 2019- 2020

 Organized around major subjects, such as student, degree,

9 CSE 3124Y BI&BDA 2019- 2020

 May be constructed by integrating multiple data sources e.g.

10 CSE 3124Y BI&BDA 2019- 2020

11 CSE 3124Y BI&BDA 2019- 2020

 A physically separate store of data transformed from the

12 CSE 3124Y BI&BDA 2019- 2020

13 CSE 3124Y BI&BDA 2019- 2020

14 CSE 3124Y BI&BDA 2019- 2020

 Most database operations involve On-Line Transaction

15 CSE 3124Y BI&BDA 2019- 2020

16 CSE 3124Y BI&BDA 2019- 2020

• Define the architecture, do capacity planning, select

(Refer to Chaudhuri and Dayal)

17 CSE 3124Y BI&BDA 2019- 2020

18 CSE 3124Y BI&BDA 2019- 2020

19 CSE 3124Y BI&BDA 2019- 2020

20 CSE 3124Y BI&BDA 2019- 2020

21 CSE 3124Y BI&BDA 2019- 2020

22 CSE 3124Y BI&BDA 2019- 2020

The E-R Model approach which consists of entities

What is the nature of data in data warehouse?

23 CSE 3124Y BI&BDA 2019- 2020

Consider the following database:

25 CSE 3124Y BI&BDA 2019- 2020

We could look at the information as

In fact it is natural to think of an enterprise data as

26 CSE 3124Y BI&BDA 2019- 2020

The university management may be interested in

27 CSE 3124Y BI&BDA 2019- 2020

•How many students doing MIT from Thailand

Special type of database systems, called data cube

28 CSE 3124Y BI&BDA 2019- 2020

The example queries discussed earlier may be

A point inside the cube is an intersection of the

29 CSE 3124Y BI&BDA 2019- 2020

Let us look at a simple two-dimensional situation:

30 CSE 3124Y BI&BDA 2019- 2020

But in the two-dimensional situation, we don’t just

Consider a slightly more complex situation in which

We may now look at this information as a 3-

32 CSE 3124Y BI&BDA 2019- 2020

 Number of students as a function of country, degree and

Dimensions: country, degree, sem