Вы находитесь на странице: 1из 34

Building an Effective Data

Warehouse Architecture
May 7-9, 2014 | San Jose, CA

James Serra, Big Data Evangelist


Microsoft

Other Presentations

Building an Effective Data Warehouse Architecture


Reasons for building a DW and the various approaches and DW concepts (Kimball vs Inmon)

Building a Big Data Solution (Building an Effective Data


Warehouse Architecture with Hadoop, the cloud and MPP)
Explains what Big Data is, its benefits including use cases, and how Hadoop, the cloud, and MPP fit in

Finding business value in Big Data (What exactly is Big Data


and why should I care?)
Very similar to Building a Big Data Solution but target audience is business users/CxO instead of
architects

How does Microsoft solve Big Data?


Covers the Microsoft products that can be used to create a Big Data solution

Modern Data Warehousing with the Microsoft Analytics


Platform System
The next step in data warehouse performance is APS, a MPP appliance

Power BI, Azure ML, Azure HDInsights, Azure Data Factory,

About Me
Business Intelligence Consultant, in IT for 28 years
Microsoft, Big Data Evangelist
Worked as desktop/web/database developer, DBA, BI and DW architect and
developer, MDM architect, PDW developer
Been perm, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference and PASS Summit
MCSE for SQL Server 2012: Data Platform and BI
Blog at JamesSerra.com
SQL Server MVP
Author of book Reporting with Microsoft SQL Server 2012

I tried to build a data warehouse on my own

And ended up passed-out drunk in a


Dennys parking lot

Lets prevent that from happening

Agenda

What a Data Warehouse is not


What is a Data Warehouse and why use one?
Fast Track Data Warehouse (FTDW)
Appliances
Data Warehouse vs Data Mart
Kimball and Inmon Methodologies
Populating a Data Warehouse
ETL vs ELT
Surrogate Keys
SSAS Cubes
Modern Data Warehouse

What a Data Warehouse is not

A data warehouse is not a copy of a source database with the name prefixed with
DW
It is not a copy of multiple tables (i.e. customer) from various sources systems
unioned together in a view
It is not a dumping ground for tables from various sources with not much design
put into it

Data Warehouse Maturity Model

Courtesy of Wayne
Eckerson

What is a Data Warehouse and why use


one?
A data warehouse is where you store data from multiple data sources to be used for historical and
trend analysis reporting. It acts as a central repository for many subject areas and contains the "single
version of truth". It is NOT to be used for OLTP applications.
Reasons for a data warehouse:
Reduce stress on production system
Optimized for read access, sequential disk scans
Integrate many sources of data
Keep historical records (no need to save hardcopy reports)
Restructure/rename tables and fields, model data
Protect against source system upgrades
Use Master Data Management, including hierarchies
No IT involvement needed for users to create reports
Improve data quality and plugs holes in source systems
One version of the truth
Easy to create BI solutions on top of it (i.e. SSAS Cubes)

Why use a Data Warehouse?


Legacy applications + databases = chaos
Production
Control
MRP
Inventory
Control
Parts
Management
Logistics
Shipping
Raw Goods
Order
Control
Purchasing

Finance
Marketing

Enterprise data warehouse =


order
Continuity
Consolidation
Control
Compliance
Collaboration

Sales
Accounting
Management
Reporting
Engineering

Single
version of
the truth
Enterprise Data
Warehouse

Actuarial
Human
Resources
Every question = decision

Two purposes of data warehouse: 1) save time building reports; 2) slice in dice in ways you could
not do before

Hardware Solutions
Fast Track Data Warehouse - A reference configuration optimized for data
warehousing. This saves an organization from having to commit resources to
configure and build the server hardware. Fast Track Data Warehouse hardware is
tested for data warehousing which eliminates guesswork and is designed to save
you months of configuration, setup, testing and tuning.You just need to install
the OS and SQL Server
Appliances - Microsoft has made availableSQL Server appliances (SMP and MPP)
that allow customers to deploydata warehouse (DW), business intelligence (BI)
and database consolidation solutions in a very short time, with all the
components pre-configured and pre-optimized. These appliances include all the
hardware, software and services for a complete, ready-to-run, out-of-the-box,
high performance, energy-efficient solutions

Data Warehouse Fast Track for


SQL Server 2014
A workload-specific
database system
design and
validation program
for Microsoft
partners and
customers

Windows Server
2012 R2
SQL Server 2014

Processors
Networking
Servers
Storage

Software
SQL Server 2014 Enterprise
Windows Server 2012 R2
Database configuration
Workload-specific
Database architecture
SQL Server settings
Windows Server settings
Performance guidance

Hardware system design


Tight specifications for servers,
storage, and networking
Resource balanced and validated
Latest-generation servers and
storage, including solid-state disks
(SSDs)

Options for data warehouse


solutions
Balancing
flexibility
and choice

By yourself

HIGH

With a
reference
architecture

With an
appliance

Tuning and optimization


Configuration

Time to
solution

Tuning and optimization


Configuration

Installation

Tuning and optimization

Installation

Installation

LOW
Existing or procured
hardware and support

Existing or procured
hardware and support

Procured software and


support

Procured software and


support

Procured appliance
and support

Price
HIGH

Offerings

SQL Server 2014


Windows Server 2012
R2
System Center 2012
SP1

Offerings

Private Cloud Fast Track


Data Warehouse Fast
Track

Offerings

Data Warehouse Fast


Track
Analytics Platform
System

Optional, if you have hardware already

Data Warehouse Fast Track


Microsoft-validated reference architecture
advantages

Faster
Deployment

Reduced risk

Flexibility and
Choice

Vendors with 2014 Fast Track Appliances

Dell
EMC
HP/ScanDisk
Lenovo
NEC
Tegile

Data Warehouse vs Data Mart


Data Warehouse: A single organizational repository of enterprise
wide data across many or all subject areas

Holds multiple subject areas


Holds very detailed information
Works to integrate all data sources
Feeds data mart

Data Mart: Subset of the data warehouse that is usually oriented to


specific subject (finance, marketing, sales)

The logical combination of all the data marts is a data warehouse

In short, a data warehouse as contains many subject areas, and a


data mart contains just one of those subject areas

Kimball and Inmon Methodologies


Two approaches for building data warehouses

Kimball and Inmon Myths


Myth: Kimball is a bottom-up approach without enterprise focus

Really top-down: BUS matrix (choose business processes/data sources),


conformed dimensions, MDM

Myth: Inmon requires a ton of up-front design that takes a long time

Inmon says to build DW iteratively, not big bang approach (p. 91 BDW, p. 21
Imhoff)

Myth: Star schema data marts are not allowed in Inmons model

Inmon says they are good for direct end-user access of data (p. 365 BDW),
good for data marts (p. 12 TTA)

Myth: Very few companies use the Inmon method

Survey had 39% Inmon vs 26% Kimball. Many have a EDW

Myth: The Kimball and Inmon architectures are incompatible

They can work together to provide a better solution

Kimball and Inmon Methodologies


Relational (Inmon) vs Dimensional (Kimball)
Relational Modeling:

Entity-Relationship (ER) model


Normalization rules
Many tables using joins
History tables, natural keys
Good for indirect end-user access of data

Dimensional Modeling:

Facts and dimensions, star schema


Less tables but have duplicate data (de-normalized)
Easier for user to understand (but strange for IT people used to relational)
Slowly changing dimensions, surrogate keys
Good for direct end-user access of data

Relational Model vs Dimensional Model


Relational
Model

Dimensional
Model

If you are a business user, which model is easier


to use?

Kimball and Inmon Methodologies


Kimball:

Logical data warehouse (BUS), made up of subject areas (data marts)


Business driven, users have active participation
Decentralized data marts (not required to be a separate physical data
store)
Independent dimensional data marts optimized for reporting/analytics
Integrated via Conformed Dimensions (provides consistency across data
sources)
2-tier (data mart, cube), less ETL, no data duplication

Inmon:

Enterprise data model (CIF) that is a enterprise data warehouse (EDW)


IT Driven, users have passive participation
Centralized atomic normalized tables (off limit to end users)
Later create dependent data marts that are separate physical subsets of
data and can be used for multiple purposes
Integration via enterprise data model

Kimball Model
DW Bus Architecture
Data Warehouse (star
schema subject areas)

Staging

SSIS

Staging
Area 1

SSIS

Cube Process
Data Mart 1

Multidimensional

Atomic Data
OLTP Data
Sources

SSIS

SSIS

Staging
Area 2

Staging
Area 3

Reporting
Layer

SSIS
Data Mart 2

Dimensionalized View

Multidimensional

Why staging: Limit source contention (ELT), Recoverability, Backup,


Auditing

Inmon Model
Staging

SSIS

Staging
Area 1

Corporate Information
Factory (CIF)

SSIS

SSIS
OLTP Data
Sources

SSIS

Staging
Area 2

SSIS

Staging
Area 3

SSIS

Dimensionalized
View

Multidimensional
Reporting
Layer

Data Warehouse
(Normalized)
SSIS

SSIS

Data Mart 1
(Normalized)

Atomic Data

Data Mart 2
(Normalized)

Cube Process

Tabular

Reasons to add a Enterprise Data Warehouse


Single version of the truth
May make building dimensions easier using lightly denormalized tables in EDW
instead of going directly from the OLTP source
Normalized EDW results in enterprise-wide consistency which makes it easier
to spawn-off the data marts at the expense of duplicated data
Less daily ETL refresh and reconciliation if have many sources and many data
marts in multiple databases
One place to control data (no duplication of effort and data)
Reason not to: If have a few sources that need reporting quickly

Which model to use?


Models are not that different, having become similar over the years, and can
compliment each other
Boils down to Inmon creates a normalized DW before creating a dimensional data
mart and Kimball skips the normalized DW
With tweaks to each model, they look very similar (adding a normalized EDW to
Kimball, dimensionally structured data marts to Inmon)
Bottom line: Understand both approaches and pick parts from both for your
situation no need to just choose just one approach
BUT, no solution will be effective unless you possess soft skills (leadership,
communication, planning, and interpersonal relationships)

Hybrid Model
DW Bus Architecture
Staging

SSIS

OLTP Data
Sources

Staging
Area 1

Data Warehouse (star


schema subject areas)

Corporate Information
Factory (CIF)
SSIS

EDW

SSIS

Cube Process
Data Mart 1

Multidimensional

Mirror
OLTP
(subset)
SSIS

Staging
Area 2

SSIS

Data Warehouse
(Normalized)

Reporting
Layer

SSIS
Data Mart 2
Atomic Data

SSIS

Staging
Area 3

SSIS

Atomic Data

Cube Process

Tabular

In the DW Bus Architecture, each data mart could be a schema (broken out by business process subject
areas), all in one database. Another option is to have each data mart in its own database with all databases
on one server or spread among multiple servers. Also, the staging areas, CIF, and DW Bus can all be on the
same powerful server (MPP)
Advice: Use SQL Server Views to interface between each level in the model

Kimball Methodology

From: Kimballs The Microsoft Data Warehouse Toolkit


Kimball defines a development lifecycle, where Inmon is just about the data warehouse (not
how used)

Populating a Data Warehouse


Determine frequency of data pull (daily, weekly, etc)
Full Extraction All data (usually dimension tables)
Incremental Extraction Only data changed from last run (fact tables)
How to determine data that has changed

Online Extraction Data from source. First create copy of source:

Timestamp - Last Updated


Change Data Capture (CDC)
Partitioning by date
Triggers on tables
MERGE SQL Statement
Column DEFAULT value populated with date
Replication
Database Snapshot
Availability Groups

Offline Extraction Data from flat file

ETL vs ELT

Extract, Transform, and Load (ETL)


Transform while hitting source system
No staging tables
Processing done by ETL tools (SSIS)
Extract, Load, Transform (ELT)
Uses staging tables
Processing done by target database engine (SSIS: Execute T-SQL Statement
task instead of Data Flow Transform tasks)
Use for big volumes of data
Use when source and target databases are the same
Use with the Analytics Platform System (APS)
ELT is better since database engine is more efficient than SSIS
Best use of database engine: Transformations
Best use of SSIS: Data pipeline and workflow management

Surrogate Keys
Surrogate Keys Unique identifier not derived from source
system

Embedded in fact tables as foreign keys to dimension tables


Allows integrating data from multiple source systems
Protect from source key changes in the source system
Allows for slowly changing dimensions
Allows you to create rows in the dimension that dont exist in the source (-1 in
fact table for unassigned)
Improves performance (joins) and database size by using integer type instead
of text
Implemented via identity column on dimension tables

SSAS Cubes
Reasons to report off cubes instead of the data warehouse:

Aggregating (Summarizing) the data for performance


Multidimensional analysis slice, dice, drilldown, show details
Can store Hierarchies
Built-in support for KPIs
Security: You can use the security setting to give end-users access to only those
parts (slices) of the cube relevant to them
Built-in advanced time-calculations i.e. 12-month rolling average
Easily use Excel to view data via Pivot Tables
Automatically handles Slowly Changing Dimensions (SCD)
Required for PerformancePoint, Power View, and SSAS data mining

Data Warehouse Architecture

Modern Data Warehouse


Think about future needs:

Increasing data volumes


Real-time performance
New data sources and types (Hadoop)
Cloud-born data

Solution Microsoft Analytics Platform System:

Scalable
MPP architecture
HDInsight
Polybase

Follow-on presentation: Building a Big Data Solution (Building an Effective Data


Warehouse Architecture with Hadoop, the cloud, and MPP)

Resources

Data Warehouse Architecture Kimball and Inmon methodologies: http://bit.ly/SrzNHy


SQL Server 2012: Multidimensional vs tabular: http://bit.ly/SrzX1x
Data Warehouse vs Data Mart: http://bit.ly/SrAi4p
Fast Track Data Warehouse Reference Architecture for SQL Server 2014: http://bit.ly/1xuX9m6
Complex reporting off a SSAS cube: http://bit.ly/SrAEYw
Surrogate Keys: http://bit.ly/SrAIrp
Normalizing Your Database: http://bit.ly/SrAHnc
Difference between ETL and ELT: http://bit.ly/SrAKQa
Microsofts Data Warehouse offerings: http://bit.ly/xAZy9h
Microsoft SQL Server Reference Architecture and Appliances: http://bit.ly/y7bXY5
Methods for populating a data warehouse: http://bit.ly/SrARuZ
Great white paper: Microsoft EDW Architecture, Guidance and Deployment Best Practices:
http://bit.ly/SrAZug
End-User Microsoft BI Tools Clearing up the confusion: http://bit.ly/SrBMLT
Microsoft Appliances: http://bit.ly/YQIXzM
Why You Need a Data Warehouse: http://bit.ly/1fwEq0j
Data Warehouse Maturity Model: http://bit.ly/xl4mGM
Operational Data Store (ODS) Defined: http://bit.ly/1H6wnE7
The Modern Data Warehouse: http://bit.ly/1xuX4Py

Q&A

James Serra, Big Data Evangelist


Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck will be posted under the
Presentations tab)