Академический Документы
Профессиональный Документы
Культура Документы
Which
Whichcustomers
customers
are
aremost
mostlikely
likelyto
togo
go
omm-- to
c
r
ttpproggeesst t tothe
thecompetition
competition??
u c
t p rorodduhe bbi igge?
W hhaat pavveetthevennuue? What
Whatimpact
impactwill
will
W ns hha n rereve new
newproducts/services
products/services
-otitoionsacttoon
-o imppac have
haveon
onrevenue
revenue
im and
andmargins?
margins?
Data, Data everywhere yet ...
I can’t find the data I need
data is scattered over the network
many versions, subtle differences
I can’t get the data I need
need an expert to get the data
I can’t understand the data I found
available data poorly documented
I can’t use the data I found
results are unexpected
data needs to be transformed from one
form to other
Information Crisis
Over 50% of business users found it “difficult”
or “very difficult” to get the information they
need.
t he a nswer
p r ovide acy ?? o uld I Do?
ou P lea s e
wi t h acc ur
W h at S h
t elli ge nce …
Will Y n time iness I n
e r ie s o i s B u s o w ??
q u lu tio n r i t n
of my So
ho uld I go f o
S
What is Data Warehousing?
Data
Evolution
60’s: Batch reports
hard to find and analyze information
inflexible and expensive, reprogram every new request
BI—Business Intelligence
Provides analytics
May lock a whole table for a table scan
Results from large amounts of data an important goal
Increasingly the source of great business benefit
Different Systems
OLTP ODS OLAP DM/DW
Business Focus Operational Operational Tactical Tactical Strategic
Tactical
End User Tools Client Server Client Server Web Client Server Client Server Web
Web
DB Technology Relational Relational Cubic Relational
Time Variant: Data stored may not be current but varies with time and
data have an element of time.
Example: Data of sales in last 5 years, etc.
Data Data that runs the business Current and historical information
Too Low
Results in an exponential increase in the size requirements
of the warehouse.
For example, if each time record represented an hour, there will
be one sales fact record for each hour of the day
8,760 sales fact records for a year with 365 days for each
combination of Product, Client, and Organization
If daily sales facts are all that are required, the number of
records in the database can be reduced dramatically.
Operational Data Store (ODS)
The Operational Data Store is used for tactical decision making
while the DW supports strategic decisions. It contains
transaction data, at the lowest level of detail for the subject area
subject-oriented, just like a DW
integrated, just like a DW
volatile (or updateable) , unlike a DW
an ODS is like a transaction processing system
information gets overwritten with updated data
no history is maintained (other than audit trail) or operational
history
current, i.e., not time-variant, unlike a DW
current data, up to a few years
no history is maintained (other than audit trail) or operational
history
Operational Data Store (ODS)
An ODS is a collection of integrated databases designed to
support the monitoring of operations. Unlike the databases of
OLTP applications (that are function oriented), the ODS
contains subject oriented, volatile, and current enterprise-wide
detailed information. It serves as a system of record that
provides comprehensive views of data in operational sources.
Integrated
Data is cleansed, standardized and placed into a consistent data model
Volatile
UPDATEs occur regularly, whereas data warehouses are refreshed via
Examples
Goldman Sachs
Disparate, global real estate investment details collected daily, then
Reports
Report
ETL
Finance
Finance
Data Mart
Reports
SMS Report
ETL
SMS
Data Mart
Reports
ETL
HR Report
HR
Data Mart
Reports
ETL
Other Report
External
Data Mart
Report
Finance
Finance
Data Mart
Report
Reporting Infrastructure
SMS
SMS
Data Mart
ETL
Data Report
ETL
Warehouse
Finance
External
Data Mart Report
Finance Report
HR
Data Mart
An approach also used in the early days, but refined over time
Originally suggested extensive effort in building the DW Now
recommends building DW incrementally
Data Warehouse Architectures (A)
Data Mart Bus (conformed)
Usually employing a Bottom-Up approach (Kimball)
Finance
Note Finance
Report
Data Mart
Report
Reporting Infrastructure
SMS
SMS
Data Mart
ETL
Staging Report
ETL
Area
Finance
External
Data Mart Report
Finance Report
HR
Data Mart
An approach also used in the early days, but refined over time
Originally suggested building silos Now recommends enterprise
perspective
Data Warehouse Architectures (A)
Central Data Warehouse
Usually employing a Hybrid approach
Report
Finance
Report
Reporting Infrastructure
SMS
Report
ETL
ETL
Staging Data
Area Warehouse
External
Report
Report
HR
Report
Finance
DW 1
Report
Reporting Infrastructure
SMS
Enterprise Report
ETL
ETL
DW 2 Data
Warehouse
External
Report
DW 3
Report
HR
T
E
T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW
Data Warehouse Architectures (B)
Logical data mart and Active data warehouse
ODS and data warehouse are
ODS data warehouse
one and the same
T
E
Near real-time ETL for Data marts are NOT separate databases,
@active Data Warehouse but logical views of the data warehouse
Easier to create new data marts
Data Warehouse Architectures (B)
Three-layer architecture Reconciled and derived data
Reconciled data: detailed, current data intended to be the single,
authoritative source for all decision support.
Derived data: Data that have been selected, formatted, and aggregated
for end-user decision support application.
Metadata: technical and business data that describe the properties or
characteristics of other data.
Data Sources
Unstructured data – Support of any text file type in 32 languages
Incremental Extraction
Only the data that has changed since a well-defined event back in
history will be extracted
The further downstream you go from the originating data source, the
more you increase the risk of extracting corrupt data. Barring rare
exceptions, maintain the practice of sourcing data only from the
system-of-record.
Analyze your source system
get a ER-model for the system or reverse engineering one (develop
one by looking at the metadata of the system)
reverse engineering is not the same as “forward engineering”, i.e.,
given the ER-models of the source systems derive the dimensional
schema of the data warehouse
Data Analysis
Reverse engineering of the understanding of a source system
unique identifiers and natural keys
data types
relationships between tables (1-to-1, many-to-1, many to
many), problematic when source database does not have
foreign keys defined
discrete relationships (static data, reference tables)
Data content analysis
NULL values, especially in foreign keys, NULL result in lossy
joins
In spite of the most detailed analysis, we recommend
using outer join logic when extracting from relational
source systems, simply because referential integrity often
cannot be trusted on remote systems
Dates in non-date fields
Extract data from disparate systems
What is the standard for the
enterprise?
ODBC, OLE DB, JDBC, .NET
access databases from
windows applications, so that
applications are portable
performance is major drawback
every DBMS has an ODBC
driver, even flat files
Adds two layers of interaction
between the ETL and the
database
Extracting from different sources
Mainframe
COBOL copybooks give you the datatypes
EBCDIC and not ASCII character set (FTP does the
translation between the mainframe and Unix/Windows)
Working with redefined fields (To save space the same field
is used for different types of data)
Extracting from IMS, IDMS, Adabase
you need special adapters or you get someone in those
systems to give you a file
XML sources, Web Log Files: doable, if you undestand the structure of
those sources
Enterprise-Resource-Planning ERP Systems (SAP, PeopleSoft, Oracle)
Don’t treat it as a relational system -- it’s a mess
Use adapters
Extracting Changed Data
Using Audit Columns
Use the last update timestamp, populated by triggers or the front-end application
Must ensure that the timestamp is dependable, that is if the front-end modifies it, a batch job
does not override it
Index the timestamp it if it’s dependable
Database Log Scrapping or sniffing
Take the log of the source file and try to determine the transactions that affect you
Sniffing does it real time
Timed extracts
Retrieve all records from the source that were modified “today”
POTENTIALLY dangerous -- what if the process fails today? When it runs tomorrow, you’d
have lost today’s changes
Process of elimination
Preserve yesterday’s data in the stage area
Bring today’s entire data in the stage area
Perform a comparison
Inefficient, but the most reliable
Initial and Incremental Loads
Create two tables, previous-load and current-load
Load into the current-load, compare with the previous-load, when you are done drop the
previous-load, rename the current-load into previous-load, create a new curent-log
Tips for Extracting
Constrain on indexed columns
Retrieve only the data you need
Use DISTINCT sparingly
Use the SET operations sparingly
Use HINT (HINT tells the DBMS to make sure
it uses a certain index)
Avoid NOT
Avoid functions in the where clause
Avoid subqueries
Transformation (ETL)
Data Profiling
Gather metadata
Identify and Prepare Data for Profiling
Value Analysis
Structure Analysis
Single Object Data Rule Analysis
Multiple Object Data Rule Analysis
Data Cleaning
Parsing
Correcting
Standardizing
Matching
Consolidating
Cleaning Deliverables
Keep accurate records of the types of data
quality problems you look for, when you look,
what you look at, etc
Is data quality getting better or worse?
Which source systems generate the most data
quality errors?
Is there any correlation between data quality
levels and the performance of the organization
as a whole?
Cleaning and Conforming
While the Extracting and Loading part of an
ETL process simply moves data, the cleaning
and conforming part (the transformation part
truly adds value)
How do we deal with dirty data?
Data Profiling report
The Error Event fact table
Audit Dimension
Defining Data Quality
Basic definition of data quality is data accuracy and
that means
Correct: the values of the data are valid, e.g., my
resident state is PA
Unambiguous: The values of the data can mean only
one thing, e.g., there is only one PA
Consistent: the values of the data use the same
format, e.g., PA and not Penn, or Pennsylvania
Complete: data are not null, and aggregates do not
lose data somewhere in the information flow
Who cares about information
quality?
Most organization accept low quality data as normal,
after all we are profitable, aren’t we?
In fact as long as information quality is relatively the
same across the competition, it’s probably acceptable
But look what happened at the U.S. Auto
Manufactures (GM, Ford, Chrysler) who have been
losing round consistently over the Japanese
automobile quality
The high cost of low quality data #1
Some Metro Nashville city pensioners overpaid $2.3 million form 1987
to 1995 while another set of pensioners underpaid $2.6 million as a
result of incorrect pension calculations (The Tennessean, March
21,1998)
Two 20-year old “calculation errors” socked Los Angeles County’s
pension systems with $1.2 billion in unforeseen liabilities and will force
county officials to make $25 millions/year of unplanned contributions to
make up the difference (Los Angeles Times April 8, 1998)
Wrong price data in retail databases may cost American consumers as
much as $2.5 billion in overcharges annually. Data audits show 4 out of
5 errors in prices are overcharges. (Information Week Sept 1992)
Four years later, 1 out of 20 items scanned incorrectly, according to a
Federal Trade Commission study of 17,000 items.
The high cost of low quality data #2
The US Attorney general’s office has stated that “approximately $23
billion or 14% of the health care dollar is wasted in fraud or incorrect
billing (Nashile Business Journal, Sept 1997)
In 1992, 96,000 IRS tax refund checks were returned as undeliverable
due to bad addresses
No fewer than 1 out of 6 US registered voters on voter registration lists
have either moved or are deceased, according to audits that compare
voter registration lists with the US postal office change-of-address lists
Electronic data audits reveal that invalid data values in a typical
customer database averages 15-20%
Barbra Streisand pulled her investment account from her investment
bank because it misspelled her name as “Barbara”
The high costs of low quality data
#3
The Gartner group estimates for the worldwide costs to modify software
and change databases to fix the Y2K problem was $400-$600 billion.
T.Capers Jones says this estimate is low, it should be $1.5 trillion. The
cost to fix this single pervasive error is one eighth of the US federal
deficit ($8 trillion Oct 2005).
Another way to look at it. The 50 most profitable companies in the world
earned a combined $178 billion in profits in 1996. If the entire profit of
these companies was used to fix the problem, it would only fix about
12% of the problem
And MS Excell, in year 2000, still regards 1900 as a leap year (which is
not).
Data Profiling Deliverable
Start before building the ETL system
Data profiling analysis including
Schema definitions
Business objects
Domains
Data Sources
Table definitions
Synonyms
Data rules
Value rules
Issues that need to be addressed
Loading (ETL)
Data are physically moved to the data
warehouse
The loading takes place within a “load
window”
The trend is to near real time updates of the
data warehouse as the warehouse is
increasingly used for operational applications
Dimensional Modeling
The process and outcome of designing logical
database schemas created to support OLAP and
Data Warehousing solutions
Used by most contemporary BI solutions
– “Right” mix of normalization and denormalization
often called Dimensional Normalization
– Some use for full data warehouse design
– Others use for data mart designs
Consists of two primary types of tables
– Dimension tables
– Fact tables
Dimensional Modeling …
Dimension Tables!
Organized hierarchies of categories, levels, and members
Used to “slice” and query within a cube
Business perspective from which data is looked upon
Collection of text attributes that are highly correlated
(e.g. Product, Store, Time)
Shared with multiple fact relationships to provides data
correlation
Dimensional Modeling …
Dimension Details
Attributes
Descriptive characteristics of an entity
Building blocks of dimensions, describe each instance
Usually text fields, with discrete values
e.g., the flavor of a product, the size of a product
Dimension Keys
Surrogate Keys
Candidate Business Keys
Dimension Granularity
Granularity in general is the level of detail of data contained
in an entity
A dimensions granularity is the lowest level object which
uniquely identifies a member
Typically the identifying name of a dimension
Dimensional Modeling …
Hierarchies
Dimensional Modeling …
DW - Surrogate Keys
OLTP – Natural Keys
Production Keys
Intelligent Keys
Smart Keys
NKs (Natural Keys) tell us something about the record they represent
For example: Student IDNO - 2003B4A7290
DW - Surrogate Keys
Integer keys
Artificial Keys
Non-intelligent Keys
Meaningless Keys
SKs do not tell us anything about the record they represent
Surrogate Keys - Advantages
Buffers the DW from operational changes
Saves Space
Faster Joins
Allows proper handling of changing dimensions
Dimensional Modeling …
DW - Surrogate Keys
Buffering DW from operational changes
Production keys are often reused
For Eg. Inactive account numbers or obsolete product codes are reassigned after a
period of dormancy
Not a problem in operational system, but can cause problems in a DW
SKs allow the DW to differentiate between the two instances of the same production
key
Space Saving
Surrogate Keys are integers
4 bytes of space
Are 4 bytes enough?
Nearly 4 billion values!!!
For example
Date data type occupies 8 bytes
10 million records in fact table
Space saving=4x10million bytes =38.15 MB
Faster Joins
Every join between dimension table and fact table is based on SKs and not on NKs
Which is faster?
Comparing 2 strings
Comparing 2 integers
But the issue is – Do we need joins in the first place?
Changing Dimensions
Surrogate keys helps in saving historical data in the same Dimension.
Dimensional Modeling …
Dimension Table
The final component of the Dimension besides the SK & NK is a set of
descriptive attributes. (May be large approximately 100 in a dimension).
DW Architect should not call for the numerical field in the dimension
tables, else need to call only textual attributes. All descriptive attributes
should be truly static or should only change slowly and episodically.
Example:- The difference between the measured fact and
numeric descriptive attribute is obvious in 98% cases. Sometimes it
takes time to distinguish. Take case of Catalog Price :- this standard
catalog is numeric so can be taken as a fact, but what when the
standard price of a product change. So we can’t take this
numeric attribute as a measure or fact. It should be in the Dimension
table as a numeric descriptive attribute.
Contains attributes for dimensions
50 to 100 attributes common
Best attributes are textual and descriptive
DW is only as good as the dimension attributes
Contains hierarchal information albeit redundantly
Entry points into the fact table
Dimensional Modeling …
Dimension Types
Static Dimension:
When a dimension is static and is not being updated
for historical changes to individual rows, there is 1-to-1
relationship between the PK (SK) and the Natural key
(NK)
i.e. PK (SK) : NK :: 1 : 1
Dynamic or Slowly changing Dimension:-
When a dimension is slowly changing we generate
many PKs (SK) for each Natural Key as we track the
history of changes to the dimension then the
relationship between the PK (SK) to Natural Key (NK)
is N-to-1.
i.e. PK (SK) : NK :: N : 1
Dimensional Modeling …
Dimension Types
Big dimensions
Big represents wide as well as deep. Eg- customer, product, location etc. with
millions of records and hundred or more fields. Almost always derived from
multiple source.
Examples: Customer, Product, Location
Millions or records with hundreds of fields (insurance
customers) Or hundreds of millions of records with few
fields (supermarket customers)
Always derived by multiple sources
These dimensions should be conformed
Small Dimensions
Many of the dimensions in the DWH are tiny lookup tables with only a few
records and one or two columns. Eg – is the transaction dimension. These
dimensions cannot be, should not be confirmed along the various fact tables.
Examples: Transaction Type, Claim Status
Tiny lookup tables with only a few records and one ore more columns
Build by typing into a spreadsheet and loading the data into the DW
These dimensions should NOT be conformed
Dimensional Modeling …
One dimension or two dimension
In dimensional modeling we usually think that dimensions are independent. But a good
statistician would be able to demonstrate a degree of correlation between the product
dimension and the store dim. (if correlation degree is high we can combine these two
dimensions as one).
But understand the fact that if there are 10000 rows in product and 100 store dimension, if we
combine these two dimension’s in the resulting dim. There will be <(100,000,0) rows which is
a disadvantage of combining two dim’s.
There may be an more than one independent type of correlation (independent) between the
two dimensions.
Exclude and discard all flags and texts Not a good option.
Place the flags and texts unchanged in the fact table This option is also not
good, as it swell up the fact table to no specific advantage.
Make only those flags and textual fields as a separate table on its own
Not good because it will increase the number of dimension tables.
Best approach Keep only those flags and texts that are meaningful. Group all
the useful flags into a single “JUNK” dimension. These will be useful for
constraining queries based on flag text values.
Again junk :- some data source have a dozen or more operational codes attached to
the fact table records, many of which have very low cardinalities. Even if there is no
obvious correlation between the values of the operational codes. A single junk
dimension can be created to bring all these little codes into one dimension and tidy
up design. The records in the junk dimension should probably be created as they
are encountered in the data, rather than beforehand as the Cartesian product of all
the separate codes. It is likely that the incrementally produced junk dimension in
much smaller than full Cartesian product of all the values of the code.
Dimensional Modeling …
Degenerated Dimension
When a parent-child relationship exists and
the grain of the fact table is the child, the
parent is kind of left out in the design process
Example:
grain of the fact able is the line item in an order
the order number is significant part of the key
but we don’t create a dimension for the order
number, because it would be useless
we insert the order number as part of the key, as
if it was a dimension, but we don’t create a
dimension table for it
Dimensional Modeling …
Date and Time Dimension
Virtually everywhere:
measurements are
defined at specific times,
repeated over time, etc.
Most common: calendar-
day dimension with the
grain of a single day,
many attributes
Doesn’t have a
conventional source:
Built by hand,
spreadsheet
Holidays, workdays,
fiscal periods, week
numbers, last day of
month flags, must be
entered manually
10 years are about 4K
rows
Dimensional Modeling …
Slow-changing Dimensions
When the DW receives notification that some
record in a dimension has changed, there are
three basic responses:
Type 1 slow changing dimension (Overwrite)
Type 2 slow changing dimension (Partitioning
History)
Type 3 slow changing dimension (Alternate
Realities)
Dimensional Modeling …
Type 1 Slowly Changing Dimension (Overwrite)
Overwrite one or more values of the dimension with the new value
Use when
the data are corrected
there is no interest in keeping history
there is no need to run previous reports or the changed value is immaterial to the report
Type 1 Overwrite results in an UPDATE SQL statement when the value changes
If a column is Type-1, the ETL subsystem must
Add the dimension record, if it’s a new value or
Update the dimension attribute in place
Must also update any Staging tables, so that any subsequent DW load from the staging tables
will preserve the overwrite
This update never affects the surrogate key
But it affects materialized aggregates that were built on the value that changed (will be
discussed more next week when we talk about delivering fact tables)
Beware of ETL tools “Update else Insert” statements, which are convenient but inefficient
Some developers use “UPDATE else INSERT” for fast changing dimensions and “INSERT else UPDATE”
for very slow changing dimensions
Better Approach: Segregate INSERTS from UPDATES, and feed the DW independently for the updates
and for the inserts
No need to invoke a bulk loader for small tables, simply execute the SQL updates, the performance impact
is immaterial, even with the DW logging the SQL statement
For larger tables, a loader is preferable, because SQL updates will result into unacceptable database
logging activity
Turn the logger off before you update with SQL Updates and separate SQL Inserts
Or use a bulk loader
Prepare the new dimension in a staging file
Drop the old dimension table
Load the new dimension table using the bulk loader
Dimensional Modeling …
Type-2 Slowly Changing Dimension (Partitioning History)
Standard
With a Type-2 change, you might want to include the following additional
attributes in the dimension
Date of change
Exact timestamp of change
Reason for change
Current Flag (current/expired)
Dimensional Modeling …
Type-2 Slowly Changing Dimension (Partitioning History)
REGION
region key
number
name
office address
manager name
Dimensional Modeling …
time
Fact Constellation or Galaxy Schema
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location
1. The dimension records must have a begin and end timestamp so that they can
be located quickly for the late arriving fact
2. We must be willing to accept late arriving facts and invalidate previously
published reports
3. If using partitioning you must guarantee that the late arriving fact is put in its
correct partition
Dimensional Modeling …
Error Event Table Deliverable
Built as a
star schema
Each data
quality error
or issue is
added to the
table
Dimensional Modeling …
Audit Dimension Deliverable
audit key (PK)
Captures the
overall quality category (text)
specific data
quality of a given overall quality score (integer)
table completeness category (text)
completeness score (integer)
validation category (text)
validation score (integer)
out-of-bounds category (text)
out-of-bounds score (integer)
number screens failed
max seventy score
extract time stamp
clean time stamp
conform time stamp
FTL system version
allocation version
currency conversion version
other audit attributes
Factless Fact Tables
There are applications in which fact tables do
not have non key data but that do have
foreign keys for the associated dimensions.
Dimensional modeling, is the name of the logical design technique often used
for data warehouses. It is different from entity-relationship modeling.
For example, a query that requests the total sales income and quantity sold for a
range of products in a specific geographical region for a specific time period can
typically be answered in a few seconds or less regardless of how many
hundreds of millions of rows of data are stored in the data warehouse database.
Entity–Relationship Modeling
Customer Demographics
CustomerSubscriptions Salesperson
Zones
City
Dimensional Modeling
Dimensions Fact Table Dimensions
Subscription Sales
Customer Date
EffectiveDateKey
CustomerKey
SubscriptionsKey
Payment PaymentKey Subscriptions
CampaignKey
SalesPersonKey
RouteKey
Demographics Key
Campaign UnitsSold Salesperson
DollarsSold
DiscountCost
PremiumCost
Route Demographics
Entity–Relationship Modeling
Dimensional Model
Few Facts:
Q: Ralph Kimball invented the fact and dimension terminology.
A: While Ralph played a critical role in establishing these terms
as industry standards, he didn’t “invent” the concepts. As best as
we can determine, the terms facts and dimensions originated
from a joint research projected conducted by General Mills and
Dartmouth University in the 1960s. By the 1970s, both AC
Nielsen and IRI used these terms consistently when describing
their syndicated data offerings. Ralph first heard about
“dimensions,” “facts,” and “conformed dimensions” from AC
Nielsen in 1983 as they were explaining their dimensional
structures for simplifying the presentation of analytic information.