Вы находитесь на странице: 1из 104

CSE 3124Y(5): Business intelligence

& big data analytics

Lecture 4:
Data Warehousing and OLAP
Data Warehousing and OLAP
 What is a data warehouse?

 A multi-dimensional data model

 Data warehouse architecture

 Data warehouse implementation

2 CSE 3124Y BI&BDA 2019- 2020


Data Warehousing
Data warehousing is a process, not a product, for assembling
and managing data from various sources for the purpose of
gaining a single detailed view of part or all of a business. The
single view is the data warehouse.
On-line Analytical Processing (OLAP) is a technique used for
providing management decision support using historical and
summarized data that is consolidated in the data warehouse.

3 CSE 3124Y BI&BDA 2019- 2020


Data Warehousing

A data warehouse integrates information from several sources


into a global schema and is stored separately from the
operational data. It does not represent a snapshot of the
operational database.
Moving data from various sources to a data warehouse is a
very difficult process involving data cleansing and data
integration. Sometime called ETL.
Most database systems are error-prone. A data warehouse
should have as few errors as possible.

4 CSE 3124Y BI&BDA 2019- 2020


Data Warehouse

Most database systems continue to grow but a data warehouse


grows at a slower rate.
User updates to a data warehouse are usually forbidden,
updates must come from the underlying databases to maintain
consistency.

5 CSE 3124Y BI&BDA 2019- 2020


Data Warehousing

To speed up OLAP queries, a warehouse contains summarized


and consolidated information representing materialized
aggregate views of the enterprise data from a number of
databases.
Data warehouse and OLAP are complementary. A warehouse
stores data while OLAP derives strategic information from it.
Data warehouse may be used to provide an enterprise
memory which operational data does not provide.

6 CSE 3124Y BI&BDA 2019- 2020


Data Warehousing

Warehouse usually contains information over time helping


analysis of trends
A data warehouse is repackaging information to support
business decision making
The aim in data warehousing may be to generate new revenue by
selling the repackaged information

7 CSE 3124Y BI&BDA 2019- 2020


A definition

According to W. H. Inmon: A data warehouse is a


subject-oriented, integrated, time-variant, and non-
volatile collection of data in support of management’s
decision making process.

Important to note subject-oriented, integrated, and


time-variant properties of data warehouses.

8 CSE 3124Y BI&BDA 2019- 2020


Subject-oriented

 Organized around major subjects, such as student, degree,


country.
 Focusing on the modeling and analysis of data for decision
makers, not on daily operations.
 Providing a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process.

9 CSE 3124Y BI&BDA 2019- 2020


Integrated

 May be constructed by integrating multiple data sources e.g.


multiple databases.
 Data cleaning and data integration techniques are applied to
ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources.

10 CSE 3124Y BI&BDA 2019- 2020


Time Variant
 Long time horizon for data warehouse, significantly longer than
that of operational systems.
 Operational database: current value data.
 Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
 Every key structure in the data warehouse contains an element
of time, explicitly or implicitly
 Operational data may or may not contain time element.

11 CSE 3124Y BI&BDA 2019- 2020


Non-volatile

 A physically separate store of data transformed from the


operational environment.
 No update of data
 Does not require transaction processing, recovery, and
concurrency control mechanisms
 Requires only two operations in data accessing: initial loading
of data and access of data.

12 CSE 3124Y BI&BDA 2019- 2020


Why Separate Data Warehouse?
 High performance for both systems
 DBMS— tuned for OLTP
 Warehouse—tuned for OLAP.
 Different functions and different data:
 missing data: Decision support requires historical data
which operational DBs do not typically maintain
 data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
 data quality: different sources typically use inconsistent data
representations, codes and formats

13 CSE 3124Y BI&BDA 2019- 2020


Database Workloads
 OLTP (online transaction processing)
 Typical applications: e-commerce, banking, airline reservations
 User facing: real-time, low latency, highly-concurrent
 Tasks: relatively small set of “standard” transactional queries
 Data access pattern: random reads, updates, writes (involving
relatively small amounts of data)
 OLAP (online analytical processing)
 Typical applications: business intelligence, data mining
 Back-end processing: batch workloads, less concurrency
 Tasks: complex analytical queries, often ad hoc
 Data access pattern: table scans, large amounts of data involved
per query

14 CSE 3124Y BI&BDA 2019- 2020


OLTP

 Most database operations involve On-Line Transaction


Processing (OTLP).
 Short, simple, frequent queries and/or modifications, each
involving a small number of tuples.
 Examples: Answering queries from a Web interface, sales at
cash registers, selling airline tickets.

15 CSE 3124Y BI&BDA 2019- 2020


OLAP
 Of increasing importance are On-Line Application Processing
(OLAP) queries.
 Few, but complex queries --- may run for hours.
 Queries do not depend on having an absolutely up-to-date
database.

16 CSE 3124Y BI&BDA 2019- 2020


Data Warehouse Process

• Define the architecture, do capacity planning, select


hardware and software
• Design the warehouse schema and the views
• Design the physical data structures
• Design data extraction, cleaning, transformation,
load and refresh software
• Populate the repository with data and software
• Design and implement end-user application

(Refer to Chaudhuri and Dayal)

17 CSE 3124Y BI&BDA 2019- 2020


Data Cubes
 A data warehouse is based on a multidimensional data model
which views data in the form of a data cube
 A data cube allows data to be modeled and viewed in
multiple dimensions
 Dimension tables, such as item (item_name, brand, type),
or time(day, week, month, quarter, year)
 Fact table contains measures (such as dollars_sold) and
keys to each of the related dimension tables

18 CSE 3124Y BI&BDA 2019- 2020


Conceptual Modeling
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake

19 CSE 3124Y BI&BDA 2019- 2020


Defining a Star Schema in
DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

20 CSE 3124Y BI&BDA 2019- 2020


Measures: Three Categories
 distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning.
e.g., count(), sum(), min(), max().
 algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function.

21 CSE 3124Y BI&BDA 2019- 2020


Measures: Three Categories
 holistic: if there is no constant bound on the storage size needed
to describe a subaggregate.
e.g., median(), mode(), rank().

22 CSE 3124Y BI&BDA 2019- 2020


Data Warehouse Design

The E-R Model approach which consists of entities


and relationships is not suitable for designing a
schema for a warehouse.

What is the nature of data in data warehouse?


Essentially data warehouses are based on
multidimensional data model which views data as a
data cube.

23 CSE 3124Y BI&BDA 2019- 2020


An Example

Consider the following database:


Student(sid, name1, dob, country, degree, startsem,
address1, telephone, address2, email, scholarship, ..)
Enrolment(sid, subject-id, mark, tutegroup, tutor,..)
Subject(sub-id, name, school-id, whenstarted,
lecturer,..)
School(name, id, ..)
Not all of this data is needed for decision making.
Let us extract some data from this database.
24 CSE 3124Y BI&BDA 2019- 2020
Data Cube
yob, country, degree, startsem, nsubjects, scholarship
1965, Thailand, MIT, 991, 5, 25%
1970, Canada, BIT, 992, 4, 0
1967, Australia, LLB, 993, 3, 30%
1966, Australia, LLB, 983, 4, 40%
1972, Australia, Bcom, 973, 5, 10%
1972, India, BIT/Bcom, 991, 5, 10%
1982, Sweden, MSc(IT), 991, 3, 10%
Is this information useful for decision making? Not really!

25 CSE 3124Y BI&BDA 2019- 2020


Example

We could look at the information as


yob X country X degree X startsem X numsubjects X
scholarship

In fact it is natural to think of an enterprise data as


multidimensional.

26 CSE 3124Y BI&BDA 2019- 2020


Example

The university management may be interested in


retrieving information like:
• How many students are doing BIT? How many
students from Thailand? How many students started
in 1998? (queries involving only one variable)
• How many students doing BIT are from Thailand?
How many MIT students started in 981? How many
students from Thailand started in 993? (queries
involving two variables)

27 CSE 3124Y BI&BDA 2019- 2020


Example

•How many students doing MIT from Thailand


started in 981? (query involving three variables)

Special type of database systems, called data cube


systems, are often used for answering such queries.

28 CSE 3124Y BI&BDA 2019- 2020


Data Cube

The example queries discussed earlier may be


represented by a three-dimensional data cube with each
edge representing one of the variables viz. startsem,
country, and degree.

A point inside the cube is an intersection of the


coordinates defined by the edges of the cube. The
coordinates of the point define the meaning of the data
at that point.

29 CSE 3124Y BI&BDA 2019- 2020


Data Cube

Let us look at a simple two-dimensional situation:


country X degree
For decision making this may be useful information.
If we had a 2-dimensional matrix then we could find
out the number of students for any country (x) and
any degree (y).

30 CSE 3124Y BI&BDA 2019- 2020


Data Cube

But in the two-dimensional situation, we don’t just


want to find out the number of students for any
country (x) and any degree (y). We may have many
other queries e.g.
1. How many students are doing MIT?
2. How many students from Thailand?
3. How many Asian students doing Law degrees?
Thus there is kind of hierarchy that we wish to use, for
example, the world, the continents, the regions, the
countries etc. In degrees, we may want a hierarchy of
university, Schools, UG and PG, individual degrees.
31 CSE 3124Y BI&BDA 2019- 2020
Data Cube

Consider a slightly more complex situation in which


we have three dimensions:
country X degree X startsem
for any country (x), any degree (y) and any start
semester (z).

We may now look at this information as a 3-


dimensional cube as shown on the following slide.

32 CSE 3124Y BI&BDA 2019- 2020


Data Cube
(based on a slide from book by J. Han and M. Kamber)

 Number of students as a function of country, degree and


semester

Dimensions: country, degree, sem


Hierarchical summarization paths
continent school Year
country

region ug/pg

country degree semester

semester
33 CSE 3124Y BI&BDA 2019- 2020
A Sample Data Cube
(based on a slide from book by J. Han and M. Kamber)

Total LLB enrolments


semester From U.S.A.
991 992 993 001 sum
LLB
BCom U.S.A
MIT

Country
Sum
Norway

Australia

sum

34 CSE 3124Y BI&BDA 2019- 2020


Data Cube

Each edge of the cube is called a dimension. A user


normally has a number of different dimensions from
which the given data may be analyzed. A user therefore
has a multidimensional conceptual view of the data
which is represented by the cube.

The points inside a cube provide aggregations. For


example, a point may provide the number of students
from Malaysia admitted to BCom in year 1998.

35 CSE 3124Y BI&BDA 2019- 2020


Multidimensional View

A particular user will have one multidimensional view


of the database while another user in the same
enterprise may have another view. Therefore many
different multidimensional views of the same database
are possible and the same data may be consolidated in
many different ways.

36 CSE 3124Y BI&BDA 2019- 2020


Multidimensional data model

The cube is not always three-dimensional since often an


enterprise would have many more, perhaps eight or
even ten, dimensions of interest. Each dimension may
be associated with a table that describes the dimension.
For example, a dimension table for country would
contain the country names and could contain other
information e.g. category. Other dimensions like time do
not naturally have such table of information.

37 CSE 3124Y BI&BDA 2019- 2020


Data Cube

A number of operations may be applied to data cubes.


The common ones are:
- roll-up (increasing the level of abstraction)
- drill-down (increasing detail)
- slice and dice (selection and projection)
- pivot (re-orienting the view)

38 CSE 3124Y BI&BDA 2019- 2020


Data Cube Operations

• Roll-up (less detail) - when we wish further


abstraction (i.e. less detail). This operation performs
further aggregation on the data, for example, from
single degree programs to Schools, single countries to
Continents or from three dimensions to two
dimensions.

• Drill-down (increasing detail) - reverse of roll up,


when we wish to partition more finely or want to focus
on some particular values of certain dimensions. Drill-
down adds more detail to the data, it may involve
adding another dimension.
39 CSE 3124Y BI&BDA 2019- 2020
Data Cube Operations

• Slice and dice (selection and projection) - the slice


operation performs a selection on one dimension of the
cube (e.g. degree = “MIT”). The dice operation performs
a selection on two or more dimensions (e.g. degree =
“BIT” and country = “Australia” or “India”)

• Pivot (re-orienting the view) - an alternate


presentation of the data e.g. rotating the axes in a 3-D
cube.

40 CSE 3124Y BI&BDA 2019- 2020


Cube

Fact table view:


Multi-dimensional cube:
sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8

dimensions = 2

41 CSE 3124Y BI&BDA 2019- 2020


3-D Cube

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 c1 c2 c3
day 2
p1 c3 1 50 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

dimensions = 3

42 CSE 3124Y BI&BDA 2019- 2020


Multidimensional Data

• Sales volume as a function of product, month, and


region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
43 CSE 3124Y BI&BDA 2019- 2020
A Sample Data Cube
Date Total annual sales
2Qtr of TV in U.S.A.
1Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR
sum
Canada

Country
Mexico

sum

44 CSE 3124Y BI&BDA 2019- 2020


Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country

45 CSE 3124Y BI&BDA 2019- 2020


Cube Aggregation

Example: computing sums


c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8

c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129

rollup sum
p1 110
p2 19
drill-down

46 CSE 3124Y BI&BDA 2019- 2020


Cube Operators

c1 c2 c3
day 2 ...
p1 44 4
p2 c1 c2 c3
day 1
p1 12 50
p2 11 8 sale(c1,*,*)

c1 c2 c3
sum 67 12 50
c1 c2 c3
p1 56 4 50
p2 11 8 129

sum
sale(c2,p2,*) p1 110
p2 19 sale(*,*,*)

47 CSE 3124Y BI&BDA 2019- 2020


Extended Cube

* c1 c2 c3 *
p1 56 4 50 110
p2 11 8 19
day 2 c1* c267 c312 * 50 129
p1 44 4 48
p2
c1 c2 c3 *
day 1 * 44 4 48 sale(*,p2,*)
p1 12 50 62
p2 11 8 19
* 23 8 50 81

48 CSE 3124Y BI&BDA 2019- 2020


Aggregation Using Hierarchies

c1 c2 c3 customer
day 2
p1 44 4
p2 c1 c2 c3
day 1 region
p1 12 50
p2 11 8

country

region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)

49 CSE 3124Y BI&BDA 2019- 2020


Pivoting

Fact table view: Multi-dimensional cube:


sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
p1 c3 1 50 day 2
p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

c1 c2 c3
p1 56 4 50
p2 11 8

50 CSE 3124Y BI&BDA 2019- 2020


CUBE Operator (SQL-99)
Chevy Sales Cross Tab
Chevy 1990 1991 1992 Total (ALL)
black 50 85 154 289
white 40 115 199 354
Total 90 200 353 1286
(ALL)

SELECT model, year, color, sum(sales) as sales


FROM sales
WHERE model in (‘Chevy’)
AND year BETWEEN 1990 AND 1992
GROUP BY CUBE (model, year, color);
51 CSE 3124Y BI&BDA 2019- 2020
CUBE Contd.
SELECT model, year, color, sum(sales) as sales
FROM sales
WHERE model in (‘Chevy’)
AND year BETWEEN 1990 AND 1992
GROUP BY CUBE (model, year, color);

 Computes union of 8 different groupings:


 {(model, year, color), (model, year), (model, color), (year,
color), (model), (year), (color), ()}

52 CSE 3124Y BI&BDA 2019- 2020


Example Contd.
DATA CUBE
Model Year Color Sales
ALL ALL ALL 942
SALES chevy ALL ALL 510
Model Year Color Sales ford ALL ALL 432
Chevy 1990 red 5 ALL 1990 ALL 343
Chevy 1990 white 87 ALL 1991 ALL 314
Chevy 1990 blue 62 ALL 1992 ALL 285
ALL ALL red 165
Chevy 1991 red 54 ALL ALL white 273

CUBE
Chevy 1991 white 95 ALL ALL blue 339
Chevy 1991 blue 49 chevy 1990 ALL 154
Chevy 1992 red 31 chevy 1991 ALL 199
Chevy 1992 white 54 chevy 1992 ALL 157
Chevy 1992 blue 71 ford 1990 ALL 189
Ford 1990 red 64 ford 1991 ALL 116
Ford 1990 white 62 ford 1992 ALL 128
Ford 1990 blue 63 chevy ALL red 91
chevy ALL white 236
Ford 1991 red 52 chevy ALL blue 183
Ford 1991 white 9 ford ALL red 144
Ford 1991 blue 55 ford ALL white 133
Ford 1992 red 27 ford ALL blue 156
Ford 1992 white 62 ALL 1990 red 69
Ford 1992 blue 39 ALL 1990 white 149
ALL 1990 blue 125
ALL 1991 red 107
ALL 1991 white 104
ALL 1991 blue 104
ALL 1992 red 59
53 ALL
CSE 3124Y BI&BDA 1992
2019-white
2020 116
ALL 1992 blue 110
Benefits of Multidimensional
Analysis

• Small high-level database with pre-computed


aggregates is created for efficient high-level queries
• Multiple-level views
• Selection by slicing and dicing

However multidimensional analysis does not provide data


mining.

54 CSE 3124Y BI&BDA 2019- 2020


OLAP Server Architectures
Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware to support
missing pieces
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
 greater scalability

55 CSE 3124Y BI&BDA 2019- 2020


The Data Warehouse
 The most common form of data integration.
 Copy sources into a single DB (warehouse) and try to keep it
up-to-date.
 Usual method: periodic reconstruction of the warehouse,
perhaps overnight.
 Frequently essential for analytic queries.

56 CSE 3124Y BI&BDA 2019- 2020


OLAP Server Architectures
 Multidimensional OLAP (MOLAP)
 Array-based multidimensional storage engine (sparse matrix
techniques)
 fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers

57 CSE 3124Y BI&BDA 2019- 2020


Data Warehouse Design

One approach is the star schema to represent the


multidimensional data model. The schema in this model
consists of a large single fact table containing the bulk of
the data, with no redundancy and a set of smaller tables
called dimension table, one for each dimension.

Other models have been used. These include snowflakes


model and fact constellations model.

58 CSE 3124Y BI&BDA 2019- 2020


Warehouse Architecture

Client Client

Query & Analysis

Metadata Warehouse

Integration

Source Source Source

59 CSE 3124Y BI&BDA 2019- 2020


Star Schemas

 A star schema is a common organization for data at a


warehouse. It consists of:
1. Fact table : a very large accumulation of facts such as sales.
 Often “insert-only.”
2. Dimension tables : smaller, generally static information about
the entities involved in the facts.

60 CSE 3124Y BI&BDA 2019- 2020


Example of Star Schema
(based on a slide from book by J. Han and M. Kamber)
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold province_or_street
country
avg_sales
Measures
61 CSE 3124Y BI&BDA 2019- 2020
Example: Star Schema
 Suppose we want to record in a warehouse information
about every beer sale: the bar, the brand of beer, the
drinker who bought the beer, the day, the time, and the
price charged.
 The fact table is a relation:
Sales(bar, beer, drinker, day, time, price)

62 CSE 3124Y BI&BDA 2019- 2020


Example, Continued
 The dimension tables include information about the bar,
beer, and drinker “dimensions”:
Bars(bar, addr, license)
Beers(beer, manf)
Drinkers(drinker, addr, phone)

63 CSE 3124Y BI&BDA 2019- 2020


Visualization – Star Schema
Dimension Table (Bars) Dimension Table (Drinkers)

Dimension Attrs. Dependent Attrs.

Fact Table - Sales

Dimension Table (Beers) Dimension Table (etc.)


64 CSE 3124Y BI&BDA 2019- 2020
Dimensions and Dependent Attributes

 Two classes of fact-table attributes:


1. Dimension attributes : the key of a dimension table.
2. Dependent attributes : a value determined by the
dimension attributes of the tuple.

65 CSE 3124Y BI&BDA 2019- 2020


Warehouse Models & Operators

 Data Models
 relations
 stars & snowflakes
 cubes
 Operators
 slice & dice
 roll-up, drill down
 pivoting
 other

66 CSE 3124Y BI&BDA 2019- 2020


Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

67 CSE 3124Y BI&BDA 2019- 2020


Star Schema

sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

68 CSE 3124Y BI&BDA 2019- 2020


Star Schema

69 CSE 3124Y BI&BDA 2019- 2020


Snowflake Schema

70 CSE 3124Y BI&BDA 2019- 2020


Terms

 Fact table
 Dimension tables
 Measures
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

71 CSE 3124Y BI&BDA 2019- 2020


Dimension Hierarchies

sType
store
city region

sType tId size location


t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

 snowflake schema
 constellations region regId name
north cold region
south warm region

72 CSE 3124Y BI&BDA 2019- 2020


Aggregates

• Add up amounts for day 1


• In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11
p1 c3 1 50 81
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4

73 CSE 3124Y BI&BDA 2019- 2020


Aggregates

• Add up amounts by day


• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

74 CSE 3124Y BI&BDA 2019- 2020


Another Example

• Add up amounts by day, product


• In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44
p1 2 48
p1 c2 2 4

rollup

drill-down

75 CSE 3124Y BI&BDA 2019- 2020


Cube

Fact table view:


Multi-dimensional cube:
sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8

dimensions = 2

76 CSE 3124Y BI&BDA 2019- 2020


3-D Cube

Fact table view: Multi-dimensional cube:

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 c1 c2 c3
day 2
p1 c3 1 50 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

dimensions = 3

77 CSE 3124Y BI&BDA 2019- 2020


Multidimensional Data

• Sales volume as a function of product, month, and


region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
78 CSE 3124Y BI&BDA 2019- 2020
Aggregates
 Operators: sum, count, max, min,
median, ave
 “Having” clause
 Cube (& Rollup) operator
 Using dimension hierarchy
 average by region (within store)
 maximum by month (within date)

79 CSE 3124Y BI&BDA 2019- 2020


Query & Analysis Tools
 Query Building
 Report Writers (comparisons, growth, graphs,…)
 Spreadsheet Systems
 Web Interfaces
 Data Mining

80 CSE 3124Y BI&BDA 2019- 2020


Other Operations
 Time functions
 e.g., time average
 Computed Attributes
 e.g., commission = sales * rate
 Text Queries
 e.g., find documents with words X AND B
 e.g., rank documents by frequency of
words X,Y, Z

81 CSE 3124Y BI&BDA 2019- 2020


Implementing a Warehouse

 Monitoring: Sending data from sources


 Integrating: Loading, cleansing,...
 Processing: Query processing, indexing, ...
 Managing: Metadata, Design, ...

82 CSE 3124Y BI&BDA 2019- 2020


Multi-Tiered Architecture

Monitor
& OLAP Server
other Metadata
sources Integrator
Analysis
Operational Query
Extract
DBs Serve Reports
Transform Data
Data mining
Load Warehouse
Refresh

Data Marts

Data Sources Data Storage OLAP Engine 2019-Front-End Tools


83 CSE 3124Y BI&BDA 2020
Monitoring

 Source Types: relational, flat file, IMS,VSAM, IDMS,


WWW, news-wire, …
 Incremental vs. Refresh

customer id name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la new

84 CSE 3124Y BI&BDA 2019- 2020


Data Cleaning

 Migration (e.g., yen  dollars)


 Scrubbing: use domain-specific knowledge (e.g., social
security numbers)
 Fusion (e.g., mail list, customer merging)
billing DB customer1(Joe)
merged_customer(Joe)

service DB customer2(Joe)

 Auditing: discover rules & relationships


(like data mining)

85 CSE 3124Y BI&BDA 2019- 2020


Loading Data
 Incremental vs. refresh
 Off-line vs. on-line
 Frequency of loading
 At night, 1x a week/month, continuously
 Parallel/Partitioned load

86 CSE 3124Y BI&BDA 2019- 2020


ETL

 Extraction - data relevant to the tasks are selected and retrieved


from a variety of sources.
 Transformation - data is consolidated by performing summary or
aggregations
 Cleansing - since data comes from a number of sources, errors
and anomalies are common. There is a need to remove
anomalies, remove errors, handling missing and irrelevant data.
Some tools are available for doing this.

87 CSE 3124Y BI&BDA 2019- 2020


Data Cleaning

Data Cleaning overcomes problems like the following:


• Inconsistent field lengths of same items
• Inconsistent values for same items
• Inconsistent interpretation of same terms
• Missing entries
• Violation of integrity constraints
Data cleaning can be a very demanding task

88 CSE 3124Y BI&BDA 2019- 2020


Data Warehouse Process

• Integration - combining data from many perhaps


heterogeneous sources. This is a non-trivial task since
different sources will use different formats, field
lengths, codes, descriptions, for the same data items.
• Loading - before loading additional processing may be
needed e.g. checking integrity constraints, building
derived tables, indices, access paths

89 CSE 3124Y BI&BDA 2019- 2020


Data Warehouse Process

• Refresh - warehouse data needs to be periodically


updated as the operational data changes. There are
several different ways of updating a data warehouse:
• The data warehouse could be periodically
reconstructed from the base sources (perhaps
overnight, once a week)
• The data warehouse could be updated periodically,
for example each week or even each month.

The updates need to be logically correct since the warehouse data is


derived data.

90 CSE 3124Y BI&BDA 2019- 2020


Why Separate Data Warehouse?

 High performance for both systems


 DBMS— tuned for OLTP
 Warehouse—tuned for OLAP
 Different functions and different data:
 missing data: Decision support requires historical data which
operational DBs do not typically maintain
 data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many sources
 data quality: different sources typically use inconsistent data
representations, codes and formats which have to be
reconciled

91 CSE 3124Y BI&BDA 2019- 2020


OLAP

Codd defines On-line Analytical Processing or OLAP as


the dynamic enterprise analysis required to create,
manipulate, animate, and synthesize information from
exegetical, contemplative, and formulaic data analysis
models.
OLAP generally involves highly complex queries
involving large amounts of data that use one or more
aggregates. OLAP deals only with historical data
accurate at a given point in time.

92 CSE 3124Y BI&BDA 2019- 2020


OLAP Characteristics

We discuss the following OLAP characteristics listed by


Codd:
• Dynamic data analysis - involving historical data of
multiple dimensions manipulated in many different
ways with the aim of studying changes occurring in the
enterprise
• Common enterprise data - OLAP uses the enterprise
data but in a very different way to discover why some
particular situations occurred
• Synergistic implementation - data synthesis, analysis,
and consolidation

93 CSE 3124Y BI&BDA 2019- 2020


OLAP Characteristics

• Four enterprise data model


• Categorical - comparison of historical values
• Exegetical - discovering reasons for what categorical
model found
• Contemplative - “what if ” analysis of the data
• Formulaic - how to reach a desired goal

94 CSE 3124Y BI&BDA 2019- 2020


OLAP Examples
1. Amazon analyzes purchases by its customers to come
up with an individual screen with products of likely
interest to the customer.
2. Analysts at Wal-Mart look for items with increasing
sales in some region.

95 CSE 3124Y BI&BDA 2019- 2020


One Database or Two?
• Downsides of co-existing OLTP and OLAP workloads
– Poor memory management
– Conflicting data access patterns
– Variable latency
• Solution: separate databases
– User-facing OLTP database for high-volume transactions
– Data warehouse for OLAP workloads
– How do we connect the two?

96 CSE 3124Y BI&BDA 2019- 2020


OLTP/OLAP Architecture

ETL
(Extract, Transform, and Load)
OLTP OLAP

97 CSE 3124Y BI&BDA 2019- 2020


OLTP/OLAP Integration
• OLTP database for user-facing transactions
– Retain records of all activity
– Periodic ETL (e.g., nightly)
• Extract-Transform-Load (ETL)
– Extract records from source
– Transform: clean data, check integrity, aggregate, etc.
– Load into OLAP database
• OLAP database for data warehousing
– Business intelligence: reporting, ad hoc queries, data mining, etc.
– Feedback to improve OLTP services

98 CSE 3124Y BI&BDA 2019- 2020


Codd’s OLAP Evaluation Rules

Codd in his 1993 paper lists the following 12 rules for


evaluating OLAP products:
• Multidimensional conceptual view - to make a variety
of manipulations (e.g. slice and dice) relatively easy
• Transparency - user should know what data is being
used and where from
• Accessibility - able to use data in enterprise database
as well as legacy systems

99 CSE 3124Y BI&BDA 2019- 2020


Codd’s OLAP Rules

• Consistent reporting performance - consistent


reporting performance as the number of dimensions
grows
• Client-server architecture - OLAP often uses
mainframe data but users want access from desktop
• Generic dimensionality - different dimensions should
not be treated differently
• Dynamic sparse matrix handling - always a lot of
missing data, OLAP should be able to adjust to that
• Multi-user support - obviously required

100 CSE 3124Y BI&BDA 2019- 2020


Codd’s OLAP Rules

• Unrestricted cross-dimensional operations - should be


able to infer associated calculations
• Intuitive data manipulation - operations should not
require the use of a menu or a number of iterations
• Flexible reporting - logical presentation of data
• Unlimited dimensions and aggregation levels - some
applications need as many as 15-20 dimensions

101 CSE 3124Y BI&BDA 2019- 2020


ROLAP vs. MOLAP
 ROLAP:
Relational On-Line Analytical Processing
 MOLAP:
Multi-Dimensional On-Line Analytical Processing

102 CSE 3124Y BI&BDA 2019- 2020


Summary

 Data warehouse
 A subject-oriented, integrated, time-variant, and nonvolatile
collection of data in support of management’s decision-
making process
 A multi-dimensional model of a data warehouse
 Star schema, snowflake schema, fact constellations

 A data cube consists of dimensions & measures


 OLAP operations: drilling, rolling, slicing, dicing and pivoting

103 CSE 3124Y BI&BDA 2019- 2020


Summary

 OLAP servers: ROLAP, MOLAP, HOLAP


 Efficient computation of data cubes
 Partial vs. full vs. no materialization

104 CSE 3124Y BI&BDA 2019- 2020

Вам также может понравиться