Вы находитесь на странице: 1из 67

Data Warehouse

Architecture
Objectives
 Define Data Warehouse Architecture
 Define Data Warehouse and Data Mart
 Present a Data Warehouse Architectural
Framework
 Demo – Data Enterprise Integration Server
Information Systems
Architecture
 Information Systems Architecture is the
process of making the key choices that are
essential to the development of an
information system. Architecture includes:
◦ Guiding Principles:
◦ Approaches/philosophies
◦ “Logical” representations of a system
◦ Hardware/Operating System
◦ Computing model: client/server vs traditional vs
Web-based
◦ Tools and technologies
 It is key, when making these choices that they
are:
◦ Requirements driven
◦ Take into consideration operational, technical and
financial feasibility
◦ Made within an architectural framework

Architecture Drivers
 There are a lot of Drivers of Architecture

Corporat
Corporat
ee
Politics
Politics
Business
Business System
System
Plan
Plan Qualities
Qualities

Architecture
Architecture
Emergin
Emergin
Current
Current gg
Systems
Systems Technolo
Technolo
gies
gies
End
EndUser
User
Require
Require
ments
ments
How is Architecture Different
from Design?
 Its not – Architecture can be considered
‘high-level’ design
 Architecture includes those aspects of the
design that are essential to the
information system
 Architecture Example:
◦ Users must be able to self-serve (guiding
principle)
◦ “We will use a “hub and spoke” design where
data will be placed in a central data warehouse,
then be propagated to one or more data marts.
(approach)
◦ We will normalize data in the central warehouse
and use a dimensional design in the data marts
(approach)
◦ We will use Oracle 8i as our DBMS (technical
architecture)
Architecture vs Design

 Not Architecture:
◦ The Order subject area will be composed of the
following tables: order_fact, customer_dim,
product_dim and time_dim
◦ The customer_dim table will have the following
attributes…….

The Value of
 Communication:
Architecture
◦ To business sponsors, and business users
◦ Between members of the project team
 Planning:
◦ Cross Check for Project Plan
◦ Ensure that all important components of the
data warehouse are accounted for
 Flexibility and Growth
◦ Thinking about overall architecture will reduce
risk associated with the ‘success’ of the data
warehouse
 Learning
 Productivity and Reuse
What’s different about DW
Architecture?
 Transaction processing systems – growth is
(relatively) predictable
 Example:
◦ A company uses SAP for order processing
◦ They are opening a new retail store
◦ They predict (based on experience) 2000
transactions per week
◦ To process this volume, we need 3 workstations
to capture the transactions
◦ Peak time each day is 11-2 when 50% of
transactions occur

What’s Different About Data
Warehouse Architecture?
 Successdrives
explosive growth
Data ◦ More users
Warehouse ◦ More (complex)
queries
◦ More data
 Performance is
unpredictable
Growth

SAP R/3
◦ Unpredictable
Siebel
queries
◦ Unpredictable use
patterns
Time
The Great Data Warehouse
Architecture Debate
 Bill Inmon: “The
enterprise data If you build it,
They will come
warehouse”

 Ralph Kimball: “data
marts”

 The compromise:
“Hub and Spoke” or
“Federated” models
What is a Data Mart?
 A data mart is a collection of subject areas
organized for decision support based on
the specific needs of a given user group.
 Each mart may widely different from others
(as we will see)
 Typically, data marts are built on the
dimensional data model:
◦ Facts – things that the organization wants to
measure: revenue, orders, shipments,
purchases, etc.
◦ Dimensions – the means by which the
organization wants to analyze the measures
(facts) – by customer, by time, by product –
BY ANY COMBINATION!!

What is a Data Mart?
There are two kinds of data marts--dependent

and independent.
 A dependent data mart is one whose source is a
data warehouse.
 An independent data mart is one whose source is
the legacy applications environment. All
dependent data marts are fed by the same
source--the data warehouse. Each independent
data mart is fed uniquely and separately by the
legacy applications environment.
 Dependent data marts are architecturally and
structurally sound.
 Independent data marts have a number of
significant issues
Data Warehouse vs.
Data Marts
What comes first
From the Data Warehouse to
Data Marts
Information

Individually Less
Structured

History
Departmentally Normalized
Structured Detailed

Organizationally More
Structured Data Warehouse

Data
Data Warehouse and Data
Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured

Organizationally structured
Atomic
Detailed Data Warehouse Data
Data Mart Centric

Data Sources

Data Marts

Data Warehouse
Problems with Data Mart
Centric Solution

If you end up creating multiple warehouses, integrating them


is a problem
True Warehouse

Data Sources

Data Warehouse

Data Marts
Data Warehouse
Architectures
 Generic Two-Level Architecture
 Independent Data Mart
 Dependent Data Mart and Operational
Data Store
 Logical Data Mart and Real-Time Data
Warehouse
 Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)


ETL

19
Generic two-level data warehousing architecture

L
One,
company-
wide
T warehouse

Periodic extraction  data is not completely current in warehouse

20
Independent data mart data Data marts:
warehousing architecture Mini-warehouses, limited in scope

T
E

Separate ETL for each Data access complexity


independent data mart due to multiple data marts

21
Dependent data mart with operational data ODS provides option for
store: a three-level architecture obtaining current data

T
E Simpler data access
Single ETL for
enterprise data warehouse Dependent data marts
(EDW) loaded from EDW

22
Logical data mart and real time ODS and data warehouse
are one and the same
warehouse architecture

T
E
Near real-time ETL for Data marts are NOT separate databases,
Data Warehouse but logical views of the data warehouse
 Easier to create new data marts
23
Three-layer data architecture for a data warehouse

24
The Major Data Warehouse
Architectures

 Independent data marts


 Hub and spoke architecture
 Data mart bus architecture
 Federated data warehouse


Independent data mart
architecture

End user
access/
applications

Data staging

Independent data marts


Data sources

•Developed independently.
•No conformed dimensions (i.e., does not have the same categories and labels
for data elements in data marts which would allowdata across data marts to be
combined).
•Built to a business unit or functional area.
Federated architecture
Data mart
Data staging

End user
access/
applications

Federated data store


Data stores

Data warehouse

•Key spokesperson: Doug Hackney (2000, 2002).


•Combines data in an organization’s existing data warehousing environment.
•Characterized by combing key metrics and measures in existing data marts,
data warehouses and legacy systems.
Data Warehouse
Architecture Selection
Architecture selection

Architecture selection factors


Information interdependence
Upper management’s information needs
Urgency of need
View of the data warehouse Data warehouse architectures
Compatibility with existing systems
Independent data mart
Nature of end user tasks Data mart bus architecture
Resource constraints Hub and spoke architecture
Perceived ability of the IT staff Federated

Source of sponsorship
Expert influence
Best Practice #1
ØUse a data model that is optimized for
information retrieval
◦ dimensional model
◦ denormalized
◦ hybrid approach

DW Architecture Best Practices12/04/09 32


Data Acquisition Processes

 Extract Transform Load (ETL)


◦ the process of unloading or copying data
from the source systems, transforming it
into the format and data model required in
the BI environment, and loading it to the
DW
◦ also, a software development tool for
building ETL processes (an ETL tool)
◦ many production DWs use COBOL or other
general-purpose programming languages
to implement ETL

DW Architecture Best Practices12/04/09 33


The ETL Process

 Capture/Extract
 Scrub or data cleansing
 Transform
 Load and Index

ETL = Extract, transform, and load

34
Capture/Extract…obtaining a snapshot of a chosen subset
of the source data for loading into the data warehouse
Steps in data
reconciliation

Static extract = Incremental extract =


capturing a snapshot of the capturing changes that
source data at a point in have occurred since the
time last static extract
35
Scrub/Cleanse…uses pattern recognition and AI
techniques to upgrade data quality
Steps in data
reconciliation
(cont.)

Fixing errors: misspellings, Also: decoding, reformatting,


erroneous dates, incorrect field time stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating
inconsistencies missing data

36
Transform = convert data from format of operational
system to format of data warehouse
Steps in data
reconciliation
(cont.)

Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many

37
Load/Index= place transformed data
into the warehouse and create indexes
Steps in data
reconciliation
(cont.)

Refresh mode: bulk rewriting Update mode: only changes in


of target data at periodic intervals source data are written to data
warehouse

38
Data Quality Assurance

 data cleansing
◦ the process of validating and enriching
the data as it is published to the DW
◦ also, a software development tool for
building data cleansing processes (a
data cleansing tool)
◦ many production DWs have only very
rudimentary data quality assurance
processes

DW Architecture Best Practices12/04/09 39


Data Acquisition &
Cleansing
 getting data loaded efficiently and
correctly is critical to the success of your
DW
◦ implementation of data acquisition &
cleansing processes represents from 50
to 80% of effort on typical DW projects
◦ inaccurate data content can be ‘the kiss
of death’ for user acceptance

DW Architecture Best Practices12/04/09 40


Best Practice #2
ØCarefully design the data acquisition and
cleansing processes for your DW
◦ Ensure the data is processed efficiently and
accurately
◦ Consider acquiring ETL and Data Cleansing tools
◦ Use them well!

DW Architecture Best Practices12/04/09 41


Data Model

 Already discussed the benefits of a


dimensional model
 No matter whether dimensional modeling
or any other design approach is used,
the data model must be documented

DW Architecture Best Practices12/04/09 42


Documenting the Data
Model

 The best practice is to use some kind of data


modeling tool
◦ CA ERwin
◦ Sybase PowerDesigner
◦ Oracle Designer
◦ IBM Rational Rose
◦ Etc.
 Different tools support different modeling notations,
but they are more or less equivalent anyway
 Most tools allow sharing of their metadata with an ETL
tool

DW Architecture Best Practices12/04/09 43


Data Model Standards
 data model standards appropriate for the
environment and tools chosen in your data
warehouse should be adopted
 considerations should be given to data access
tool(s) and integration with overall enterprise
standards
 standards must be documented and enforced
within the DW team
◦ someone must ‘own’ the data model
 to ensure a quality data model, all changes should
be reviewed thru some formal process

DW Architecture Best Practices12/04/09 44


Data Model Metadata

 Business definitions should be recorded


for every field (unless they are technical
fields only)
 Domain of data should be recorded
 Sample values should be included
 As more metadata is populated into the
modeling tool it becomes increasingly
important to be able to share this data
across ETL and Data Access tools

DW Architecture Best Practices12/04/09 45


Metadata Architecture

 The strategy for sharing data model and


other metadata should be formalized and
documented
 Metadata management tools should be
considered & the overall metadata
architecture should be carefully planned

DW Architecture Best Practices12/04/09 46


Best Practice #3
ØDesign a metadata architecture that allows
sharing of metadata between components
of your DW

DW Architecture Best Practices12/04/09 47


Alternative Architecture
Approaches
 Bill Inmon: “Corporate Information Factory”
 Hub and Spoke philosophy
 “JBOC” – just a bunch of cubes
 Let it evolve naturally

DW Architecture Best Practices12/04/09 48


What We Want
(Architectural Principal)
 In most cases, business and IT agree that
the data warehouse should provide a
‘single version of the truth’
 Any approach that can result in disparate
data marts or cubes is undesireable
 This is known as data silos or…

DW Architecture Best Practices12/04/09 49


Enterprise DW
Architecture
 how to design an enterprise data
warehouse and ensure a ‘single
version of the truth’?
 according to Kimball:
◦ start with an overall data architecture
phase
◦ use “Data Warehouse Bus” design to
integrate multiple data marts
◦ use incremental approach by building
one data mart at a time

DW Architecture Best Practices12/04/09 50


Data Warehouse Bus
Architecture
 named for the bus in a computer
◦ standard interface that allows you to plug
in cdrom, disk drive, etc.
◦ these peripherals work together smoothly
 provides framework for data marts to
fit together
 allows separate data marts to be
implemented by different groups,
even at different times

DW Architecture Best Practices12/04/09 51


Data Mart Definition

 data mart is a complete subset of the


overall data warehouse
◦ a single business process OR
◦ a group of related business processes
 think of a data mart as a collection of
related fact tables sharing
conformed dimensions, aka a ‘fact
constellation’

DW Architecture Best Practices12/04/09 52


Designing The DW Bus
 determine which dimensions will be
shared across multiple data marts
 conform the shared dimensions
 produce a master suite of shared dimensions
 determine which facts will be shared
across data marts
 conform the facts
standardize the definitions of facts

DW Architecture Best Practices12/04/09 53


Dimension Granularity
 conformed dimensions will usually be
granular
◦ makes it easy to integrate with various base
level fact tables
◦ easy to extend fact table by adding new facts
◦ no need to drop or reload fact tables, and no
keys have to be changed

DW Architecture Best Practices12/04/09 54


Conforming Dimensions

 by adhering to standards, the separate


data marts can be plugged together
◦ e.g. customer, product, time
 they can even share data usefully, for
example in a drill across report
 ensures reports or queries from different
data marts share the same context

DW Architecture Best Practices12/04/09 55


Data Consolidation
 a current trend in BI/DW is ‘data
consolidation’
 from a software vendor perspective, it is
tempting to simplify this:
◦ ‘we can keep all the tables for all your disparate
applications in one physical database’

DW Architecture Best Practices12/04/09 56


Data Integration

 To truly achieve ‘a single version of


the truth’, must do more than
simply consolidating application
databases
 Must integrate data models and
establish common terms of
reference

DW Architecture Best Practices12/04/09 57


Best Practice #4
ØTake an approach that consolidates data into
‘a single version of the truth’
◦ Data Warehouse Bus
conformed dimensions & facts
◦ OR?

DW Architecture Best Practices12/04/09 58


Operational Data Store
(ODS)
 a single point of integration for disparate
operational systems
 contains integrated data at the most
detailed level (transactional)
 may be loaded in ‘near real time’ or
periodically
 can be used for centralized operational
reporting

DW Architecture Best Practices12/04/09 59


Best Practice #5
ØConsider implementing an ODS only when
information retrieval requirements are
near the bottom of the data abstraction
pyramid and/or when there are multiple
operational sources that need to be
accessed
◦ Must ensure that the data model is integrated,
not just consolidated
◦ May consider 3NF data model
◦ Avoid at all costs a ‘data dumping ground’

DW Architecture Best Practices12/04/09 60


Capacity Planning

 DW workloads are typically very


demanding, especially for I/O capacity
 Successful implementations tend to grow
very quickly, both in number of users
and data volume
 Rules of thumb do exist for sizing the
hardware platform to provide adequate
initial performance
◦ typically based on estimated ‘raw’ data size
of proposed database e.g. 100-150 Gb per
modern CPU

DW Architecture Best Practices12/04/09 61


SMP Server Scale Up

 Scaling performance within a single SMP


server is referred to as ‘scale up’
 Database benchmarks suggest Windows
scalability is near that of Linux
 IBM claims near-linear scalability for Linux
(on commodity hardware) up to about 4
processors
◦ Probably not cost effective to scale up Linux
much beyond 4 processors
 IBM claims near-linear scalability for AIX
on POWER5 up to about 8 processors

DW Architecture Best Practices12/04/09 62


Scale Up vs. Scale Out

 To obtain the total number of processors


required for the estimated DW
workload, must plan either to scale up
or scale out
 Both options are viable but, all other
things being equal, scaling up is less
disruptive to end users and requires
less work to implement
◦ scaling up can offer lower hardware
investment, if practical
◦ however, network bandwidth or latency
issues can limit effectiveness of parallelism

DW Architecture Best Practices12/04/09 63


Best Practice #6

ØCreate a capacity plan for your BI


application & monitor it carefully
ØConsider future additional performance
demands
◦ Establish standard performance benchmark
queries and regularly run them
◦ Implement capacity monitoring tools
◦ Build scalability into your architecture
◦ May need to allow for scaling both up and
out!

DW Architecture Best Practices12/04/09 64


Open Source Affordability

 Another emerging trend in IT generally is to


utilize Open Source software running on
commodity hardware
◦ this is expected to offer lower total cost of ownership
◦ certainly, GNU/Linux and other Open Source
initiatives do provide very good functionality and
quality for minimal cost
 This trend also applies to BI & DW:
◦ most traditional rdbms’s are now supported on Linux
◦ however, open source rdbms’s lag behind on
providing good performance for DW queries

DW Architecture Best Practices12/04/09 65


DW Appliances

 DW appliances, consisting of
packaged solutions providing all
required software and hardware, are
beginning to offer very promising
price/performance
 production experience is limited so
far, so this is not yet a ‘best practice’

DW Architecture Best Practices12/04/09 66


Role of an ODS in DW
Architecture

 In the case where an ODS is a


necessary component of the overall
DW, it should be carefully integrated
into the overall architecture
 Can also be used for:
◦ Staging area
◦ Master/reference data management
◦ Etc…

DW Architecture Best Practices12/04/09 67