Академический Документы
Профессиональный Документы
Культура Документы
Objectives
Definition of terms Reasons for information gap between information needs and availability Reasons for need of data warehousing Describe three levels of data warehouse architectures List four steps of data reconciliation Describe two components of star schema Estimate fact table size Design a data mart
2
Introduction
Solution: data warehouse consolidate and integrate information from many internal and external sources and arrange it in a meaningful format for making accurate and timely business decisions.
One-thing-at-a-time approach result in islands of information systems, uncoordinated and inconsistent database Most systems are developed to support operational processing, not for decision making
Data Warehouse:
Definition
A subject-oriented, integrated, time-variant, nonupdatable collection of data used in support of management decision-making processes Subject-oriented: e.g. customers, patients, students, products Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources Time-variant: Can study trends and changes Nonupdatable: Read-only, periodically refreshed; never deleted
Data in operational systems are often fragmented and inconsistent, distributed on a variety of incompatible hardware and software platform
Companywide view
Inconsistent key structure Synonyms Free-form versus structured fields Inconsistent data values Missing data
6
Operational system
Run business in real time, based on current data Sales order processing, reservation systems, patient registration Support decision making based on historical point-in-time and prediction data Sales trend analysis, customer segmentation, human resources planning
Informational system:
Physical architectures:
Generic Two-Level Architecture Independent Data Mart Three-Layer data architecture Dependent Data Mart and Operational Data Store Logical Data Mart and Real-Time Data Warehouse
2-layers:
Source data systems (operational databases) Data and metadata storage area (data warehouse)
Data are extracted from internal and external source systems Data are transformed and integrated before loading into data warehouse The data warehouse is organized for decision support user access the data warehouse by means of query languages or analytical tools
11
L T E
Data Mart:
A data warehouse that is limited in scope, support decision making for a particular business function or end-user group Independent: filled with data extracted directly from the operational environment Dependent: filled with data derived from enterprise data warehouse Firms faced with short-term, expedient business objectives Lower cost and time commitment Organizational and political difficulties to reach a common view of enterprise data in a central data warehouse Technology constraints
13
Data marts:
Mini-warehouses, limited in scope
E
Separate ETL for each independent data mart Data access complexity due to multiple data marts
14
A separate ETL process is developed for each data mart Data marts may not be consistent with one another No capability to drill down into greater detail or into related facts in other data marts Scaling cost are high , each new application creates a new data mart and repeat the ETL steps
15
3-layers:
Source data systems (operational databases) Operational data store Data and metadata storage (enterprise data warehouse and dependent data mart) a centralized integrated data warehouse that is the single source of all data made available to decision support An integrated, subject-oriented, updatable, current valued, enterprise-wide, detailed database designed to serve operational users as they do decision support processing Hold current, detailed data for drilling down, and also serve as staging area for loading data into EDW
16
Figure 11-4 Dependent data mart with ODS provides option for operational data store: a three-level architecture obtaining current data
T
E
Single ETL for enterprise data warehouse (EDW) Simpler data access Dependent data marts loaded from EDW
17
Accepts near-real-time feed of transactional data from operational systems, analyzes warehouse data, and in near-real-time relays business rules to the data warehouse and operational systems so that immediate action can be taken in response to business event Logical data marts are not physically separated databases, but different relational views of a data warehouse Data are moved into data warehouse rather than to a separate staging area New data mart can be created quickly Data marts are always up to date
Characteristics:
18
Figure 11-5 Logical data mart and real time warehouse architecture
T E
Near real-time ETL for Data Warehouse
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
19
stored in the various operational systems throughout the organization the data stored in the enterprise data warehouse Single, authoritative source for all decision support applications generally not intended for direct access by end users the data stored in the data marts selected, formatted, and aggregated for end user decision-support applications
20
Reconciled Data
Derived Data
21
Data Characteristics
Status Data
data representing the state of some part of the database at some point in time examples: the before image and after image of a database record that has been altered data describing a database transaction (create, update, or delete) typically stored only temporarily in logs and then deleted (or archived)
22
Event Data
Data Characteristics
Status vs. Event Data
Status
Event =
Status
23
Data Characteristics
Transient Data
data in which changes to existing records are written over previous records, destroying the previous record data typical of operational systems
Periodic Data
data that are never physically altered or deleted once added to the data store
24
Data Characteristics
Transient vs. Periodic Data
With transient data, changes to existing records are written over previous records, thus destroying the previous data content
25
Data Characteristics
Transient vs. Periodic Data
Periodic data are never physically altered or deleted once they have been added to the store
26
New descriptive attributes New business activity attributes New classes of descriptive attributes Descriptive attributes become more refined Descriptive data are related to one another New source of data
27
Transientnot historical Not normalized (perhaps due to denormalization for performance) Restricted in scopenot comprehensive Sometimes poor qualityinconsistencies and errors Detailednot summarized yet Historicalperiodic Normalized3rd normal form or higher Comprehensiveenterprise-wide perspective Timelydata should be current enough to assist decisionmaking Quality controlledaccurate with full integrity
28
extracting the relevant data from the source files used to fill the EDW the relevant data is typically a subset of all the data that is contained in the operational systems two types of capture are:
Static A method of capturing a snapshot of the required source data at a point in time, for initial EDW loading Incremental - for ongoing updates of an existing EDW; only captures the changes that have occurred in the source data since the last capture
30
Capture/Extractobtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Incremental extract = capturing changes that have occurred since the last static extract
31
this is removing or correcting errors and inconsistencies present in operational data values (e.g., inconsistently spelled names, impossible birth dates, out-of-date zip codes, missing data, etc.) may use pattern recognition and other artificial intelligence techniques
only part of the solution to poor quality data (see next slide)
Formal program in total quality management (TQM) should be implemented. It focus on defect prevention, rather than defect correction.
32
time stamping, conversion, key generation, merging, error detection/logging, locating missing data
33
converts selected data from the format of the source system to the format of the EDW Record-Level Functions
Selection - selecting data according to predefined criteria (we can use SQL: Select ) Joining - consolidating related data from multiple sources (SQL: join tables together if the source data are relational) Normalization - discussed in Chapter 5 Aggregation - summarizing detailed data (for data marts)
Single-field transformations Multi-field transformations
34
Field-Level Functions
Transform = convert data from format of operational system to format of data warehouse
Record-level:
Field-level:
35
Table lookupanother
37
the last step in data reconciliation is to load the selected data into the EDW and to create the desired indexes two modes for loading data:
Refresh Mode - employs bulk writing or rewriting of the data at periodic intervals
Update Mode - only changes in the source data are written to the data warehouse
Typically used for ongoing data warehouse maintenance To support the periodic nature of warehouse data, these new records are usually written to the data warehouse without overwriting or deleting previous records
38
Load/Index= place transformed data into the warehouse and create indexes
Derived Data
Recall that derived data refers to the data stored in data marts
This is the layer with which users typically interact with decision-support applications These data have typically been designed for use by particular groups of end users or specific individuals
40
Derived Data
Objectives
Characteristics
Ease of use for decision support applications Fast response to predefined user queries Customized data for particular target audiences Ad-hoc query support Data mining capabilities both detailed data and aggregate data exist
detailed data are often (but not always) periodic; Aggregate data are formatted to respond quickly to predetermined (or common) queries
Most common data model = star schema (also called dimensional model)
41
Star schema:
A simple database design in which dimensional data are separated from fact or event data Contains factual or quantitative data about a business such as units sold, orders booked Hold descriptive data about the subjects of the business Source of attributes used to qualify, categorize or summarize facts Example: product, customer, period
Fact table
Dimension table
Each dimension table has one-to-many relationship to the central fact table; fact table is a n-ary associative entity that links the various dimensions
42
Excellent for ad-hoc queries, but bad for online transaction processing
43
44
45
Dimension table keys must be surrogate (nonintelligent and non-business related), because:
Transactional grainfinest level Aggregated grainmore summarized Finer grains better market basket analysis capability Finer grain more dimensions exist, more rows in fact table
Natural duration13 months or 5 quarters Financial institutions may need longer duration Older data is more difficult to source and cleanse
46
Both grain and duration have impact on table size 2 steps to estimate the number of rows:
Estimate the number of possible values for each dimension associated with the fact table Multiply the values obtained in the first step after making necessary adjustment
If the size of each field is known, we can estimate the storage size on disk
47
# of stores= 1,000 # of products=10,000 # of period= 24 months Suppose on average 50% of total product appear on sales record in a given month Total rows=1,000 stores*5,000 active products*24 months=120,000,000 rows 6 fields, 4 bytes/field Total size=120,000,000*6*4=2.88 gigabytes
48
Suppose 20% of products are active in a certain day Total rows = 1,000 store * 2,000 products *720 days = 1,440,000,000 rows Total size = 1,440,000,000*6*4=34.56 gigabytes
49
Identify subjects of the data mart Identify dimensions and facts Indicate how data is derived from enterprise data warehouses, including derivation rules Indicate how data is derived from operational data store, including derivation rules Identify available reports and predefined queries Identify data analysis techniques (e.g. drill-down) Identify responsible people
51
Tools available to query and analyze the data stored in data warehouses and data marts include:
Traditional query and reporting tools On-line analytical processing (OLAP) Data mining tools Data visualization tools
52
The use of a set of graphical tools that provides users with multidimensional views of their data and allows them to analyze the data using simple windowing techniques
OLAP Operations
Cube slicingcome up with 2-D view of data Drill-downgoing from summary to more
detailed views
53
54
Summary report
Starting with summary data, users can obtain details for particular cells
55
Data mining: Knowledge discovery using a blend of statistical, AI, and computer graphics techniques
Goals:
Explain observed events or conditions Confirm hypotheses Explore data for new or unexpected relationships Statistical regression Decision tree induction Clustering and signal processing Affinity Sequence association Case-based reasoning Rule discovery Neural nets Fractals
Techniques
56