Вы находитесь на странице: 1из 44

SCCS 453

Data Warehousing and Data Mining


Lecture 1
Overview of Data Warehousing and Data Mining

Songsri Tangsripairoj, Ph.D.


ccsts@mahidol.ac.th

Department of Computer Science


Faculty of Science, Mahidol University

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 1


Semester 2, Year 2006
Data Warehousing
p Evolution of Database Technology
p Evolution of Data Warehousing
p Operational and Data Warehouse System
p Operational Data and Informational Data
p OLTP and OLAP
p The Modern Data Warehouse
p Data Marts

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 2


Semester 2, Year 2006
Evolution of Database Technology
p 1960s:
n Data collection, database creation, IMS and network DBMS
p 1970s:
n Relational data model, relational DBMS implementation
p 1980s:
n RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
n Application-oriented DBMS (spatial, scientific, engineering, etc.)
p 1990s:
n Data mining, data warehousing, multimedia databases, and Web
databases
p 2000s:
n Stream data management and mining
n Data mining and its applications
n Web technology (XML, data integration) and global information systems
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 3
Semester 2, Year 2006
Evolution of Data Warehousing…
p Since 1970s, organizations gained competitive
advantage through systems that automate business
processes to offer more efficient and cost-effective
services to the customer

p This resulted in accumulation of growing amounts of


data in operational systems
n Support day-to-day business operations such as order
processing, inventory control, out-patient billing, and so on.
n Provide online information and produce a variety of reports to
monitor and run the business.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 4


Semester 2, Year 2006
…Evolution of Data Warehousing…
p In 1990s, organizations typically have numerous
operational systems with overlapping and sometimes
contradictory definitions

p Organizations focus on ways to use operational data to


support decision-making, as a means of gaining
competitive advantage

p Operational systems were never designed to provide


strategic information
n Which product lines to expand
n Which markets they should strengthen
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 5
Semester 2, Year 2006
Characteristics of
Strategic Information
p Integrated
n Have a single enterprise-wide view
p Integrity
n Be accurate and conform to business rules
p Accessible
n Easily accessible with intuitive access paths and responsive for
analysis
p Credible
n Every business factor must have one and only one value
p Timely
n Information must be available within the stipulated time frame

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 6


Semester 2, Year 2006
The Information Crisis
p Organizations have lots of data
p Information technology resources and systems
are not effective at turning all that data into
useful strategic information

“Most companies are faced with an information crisis


not because of lack of sufficient data,
but because the available data is
not readily usable for strategic decision making.”
[Ponniah 2001]
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 7
Semester 2, Year 2006
Evolution of Data Warehousing…
p Organizations need to turn their archives of data
into a source of knowledge, so that a single
integrated / consolidated view of the
organization’s data is presented to the user

p Data warehousing is a new paradigm


n intended to provide strategic information
n deemed the solution to meet the requirements of a
system capable of supporting decision-making,
receiving data from multiple operational data sources

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 8


Semester 2, Year 2006
Data Warehousing
p a collection of decision support technologies,
p aimed at enabling the knowledge worker
(executive, manager, analyst) to make better
and faster decisions

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 9


Semester 2, Year 2006
Organizations’ Use of
Data Warehousing
p Manufacturing (cost reduction and order shipment)
p Retail (customer loyalty and market planning)
p Financial (credit card analysis and risk analysis)
p Transportation (route profitability)
p Telecommunications (call analysis and fraud
detection)
p Utilities (power usage management)
p Government (manpower planning)

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 10


Semester 2, Year 2006
Operational Data Warehouse
System System
p Get the data in p Get the information out

p Make the wheels of p Watch the wheels of


organization turn organization turn

p Deal with one record at a p Hundreds or thousands of


time rows are searched and
aggregated
p Users repeatedly perform
p Users continuously
the same operational
change kinds of questions
tasks over and over

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 11


Semester 2, Year 2006
Operational System
Get the data in
Making the wheels of business turn

p Take an order
p Process a claim
p Make a shipment
p Generate an invoice
p Receive cash
p Reserve an airline seat

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 12


Semester 2, Year 2006
Data Warehouse System
Get the information out
Watching the wheels of business turn
p Show me the top-selling products
p Show me the problem regions
p Tell me why (drill down)
p Let me see other data (drill across)
p Show the highest margins
p Alert me when a district sells below
target
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 13
Semester 2, Year 2006
Operational & Informational Data
p Operational / application-oriented / transactional data
n Support transactional functions e.g., bank card withdraw and
deposit
n Detailed, controlled redundant, updateable and reflect current
value
n Answer such questions as “How many gadgets were sold to a
customer number 123 on Sep. 19?”

p Informational / decision-support data


n Organized around subjects e.g., customer, product etc.
n Summarized, redundant, non-updateable
n Answer such questions as “What three products resulted in the
most frequent calls to the hotline over the past quarter?”
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 14
Semester 2, Year 2006
Operational & Informational Data
p Operational data supports online transaction
processing (OLTP) applications

p Informational data supports online analytical


processing (OLAP) applications

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 15


Semester 2, Year 2006
OLTP & OLAP
Operational Data Warehouse
System System

OLTP OLAP
Applications Applications

Database: Data Warehouse:


Operational data Informational data

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 16


Semester 2, Year 2006
OLTP Apps. & OLAP Apps.
Data Content
p Hold historical,
p Hold up-to-date, detailed,
individual and summarized, consolidated
application-oriented data and subject-oriented data
Data Stability
p Continuous high- p Periodic, scheduled batch
frequently changes changes
Data Model (Schema)
p Normalized to reflect the p De-normalized to reflect
ACID properties and end-user analysis needs
support day-to-day
and support strategic
decision
decision
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 17
Semester 2, Year 2006
ACID (Atomicity, Consistency, Isolation, and Durability)
OLTP Apps. & OLAP Apps.
Data Access
p Predictable, structured, p Ad hoc, unstructured, and
and repetitive access heuristic access
Access Type
p Read/insert/delete/update p Read/aggregate
Data Size
p Trend to be MB-GB in size p Trend to be GB-TB in size
Key Performance
p Maximizing transaction p Query throughput and
throughput is the key response times are more
performance metric important than transaction
throughput
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 18
Semester 2, Year 2006
The Modern Data Warehouse
p A data warehouse is a copy of transaction data
specifically structured for querying, analysis and
reporting
n Transaction data is not updated or changed later by
the data warehouse.
n Transaction data is specially structured, and may
have been transformed when it was placed in the
warehouse

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 19


Semester 2, Year 2006
Position of the Data Warehouse
Within the Organization

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 20


Semester 2, Year 2006
Data Warehouse Roles and Structures
The DW has the following primary functions:
p A direct reflection of the business rules of the
enterprise.
p The collection point for strategic information.
p The historical store of strategic information.
p The source of information later delivered to data
marts.
p The source of stable data regardless of how the
business processes may change.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 21


Semester 2, Year 2006
What Can a DW Do?
Some of the benefits of a DW are:
p Immediate information delivery
p Data integration from across and even outside
the organization
p Future vision from historical trends
p Tools for looking at data in new ways
p Freedom from IS department resource
limitations (end users can easily create most of
their queries and reports themselves)
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 22
Semester 2, Year 2006
Data Marts
p A data mart is a smaller, more focused data
warehouse. It reflects the business rules of a
specific business unit.
p The data mart does not need to cleanse its
data because that was done when it went into
the warehouse.
p It is a set of tables for direct access by users.
p These tables are designed for aggregation.
p It typically is not a data source for traditional
statistical analysis but for data mining.
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 23
Semester 2, Year 2006
Position of the Data Mart Within
the Organization

Decision
Support
Data Mart Information
Data Delivery

Decision
Support
Data Mart Information

Decision
Support
Data Mart Information

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 24


Semester 2, Year 2006
Data Marts and the Data Warehouse
Legacy Legacy Systems

systems feed Finance


Sales
Data Mart
Data Mart

data to the Operational


Data Store
Marketing
Data Mart

warehouse.
Accounting
Operational Data Mart
Data Store

The warehouse
feeds Operational
Data Store Organizational
Data
specialized Warehouse

information to Operational
Data Store

departments.
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 25
Semester 2, Year 2006
The Data Mart is More Specialized
Organizational Data
Warehouse

•Corporate Sales
Finance
•Highly granular data Data Mart
Data Mart
The data mart •Normalized design
•Robust historical data
•Large data volume
Marketing
Data Mart
•Data Model driven data

serves the •Versatile


•General purpose DBMS
technologies
Accting

needs of one Data Mart

business unit, Data Marts

not the •Departmentalized


•Summarized, aggregated
data

organization. •Star join design


•Limited historical data
•Limited data volume
•Requirements driven data
Organizational •Focused on departmental
Data needs
•Multi-dimensional DBMS
Warehouse technologies

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 26


Semester 2, Year 2006
Typical Multidatabase Report and
Screen Generation

Data download Source


System

and A

transformation Source
System

contribute to B

retrieval costs Source


System

for every report C

or screen Source
System

generated D

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 27


Semester 2, Year 2006
Typical DW Report and Screen
Generation

Data upload and Source


System
A

transformation
costs occur just Source
System
B

once. Retrieval Organizational


Data
Warehouse

costs are lower. Source


System
C

Source
System
D

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 28


Semester 2, Year 2006
Data Mining
p Foundations of Data Mining
p The Roots of Data Mining
p Data Mining Discipline
p Data Mining and Data Warehousing
p Knowledge Discovery (KDD) Process

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 29


Semester 2, Year 2006
Foundations of Data Mining
p Data mining is the process of using raw data to
infer important business relationships.
p Despite a consensus on the value of data mining,
a great deal of confusion exists about what it is.
p It is a collection of powerful techniques intended
for analyzing large datasets.
p There is no single data mining approach, but
rather a set of techniques that can be used in
combination with each other.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 30


Semester 2, Year 2006
The Roots of Data Mining
p The approach has roots in practice dating back
over 30 years.
p In the early 1960s, data mining was called
statistical analysis, and the pioneers were
statistical software companies such as SAS and
SPSS.
p By the 1980s, the traditional techniques had
been augmented by new methods such as fuzzy
logic and neural networks.

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 31


Semester 2, Year 2006
Data Mining Discipline
Domain
Visualization
Experts

Data
Machine Mining Statistics
Learning

Databases

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 32


Semester 2, Year 2006
Domain Experts’ Aspects
p DM requires domain experts to determine the
problems, understand the data for a particular
domain

Domain
Experts
Visualization

Data
Machine Mining Statistics
Learning

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 33


Semester 2, Year 2006 Databases
Statistics’ Aspects
p DM can be perceived as investigating and applying
techniques for Exploratory Data Analysis
p The interpretation of statistical knowledge is one of the
major issue
p Techniques used: sampling, clustering, classification,
regression, prediction, …

Domain
Experts
Visualization

Data
Machine Mining Statistics
Learning

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 34


Semester 2, Year 2006 Databases
Database’s Aspects
p DM can be about OLAP, warehouse model, data mart and
decision support system
p Analysis and derivation of new knowledge are important
issues
p Techniques used: data modeling, file organization, query
languages, query processing and optimization, …
p Be able to store and retrieve data
n Lacks of analysis and derivation of new knowledge Domain
Experts
Visualization

Data
Machine Mining Statistics
Learning

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 35


Semester 2, Year 2006 Databases
Machine Learning’s Aspects
p DM can be about algorithms that are computationally
practical for learning patterns from large data sets.
p Scalability with respect to data size is an important issue
p Techniques used: neural network, Bayesian network,
genetic algorithms, knowledge representation, decision
trees, …
Domain
Experts
Visualization

Data
Machine Mining Statistics
Learning

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 36


Semester 2, Year 2006 Databases
Visualization’s Aspects
p DM can be about clustering and compressing high
dimensional data and transforming the data into a visual
form for conveying “useful” information in data
p Visualization of data could improve knowledge
understanding
p Techniques used: graphical representation, computer
vision, …
Domain
Visualization Experts

Data
Machine Mining Statistics
Learning

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 37


Semester 2, Year 2006 Databases
Data Mining and Data Warehousing
p DW & OLAP tools allow users to easily navigate
and visualize data
n “What has been going on?”

p DM tools allow users to model the data and


predict future events
n “What is going to happen next and how can I profit?”

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 38


Semester 2, Year 2006
Data Mining and Data Warehousing
Business Problems

p Retrospective analysis (Past & Present)


n e.g., performance analysis of the sales for the last 2
years across different regions, product types, and
demographics

p Predictive analysis (Future)


n e.g., predictive model that describe destruction rates
of customers to the competition

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 39


Semester 2, Year 2006
Data Mining and Data Warehousing

Data Warehousing

Database: Data Warehouse:


data Information
Knowledge

Data Mining

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 40


Semester 2, Year 2006
Data Mining and Data Warehousing
p Data mining does not require the use of a
warehouse, but it may be the best foundation for
mining.
p If multiple analyses are run in sequence, the
data need to be held constant (as in a DW). In
an operational database, data change often.
p Also important is that the data in the DW is
integrated and stable

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 41


Semester 2, Year 2006
Data Mining and Data Warehousing

SQL for simple queries


Surface and reporting

OLAP & Statistical for


Corporate Shallow summaries, analysis and
Data forecasting

Hidden Data Mining and Statistical


for classification, clustering
and predictions

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 42


Semester 2, Year 2006
Knowledge Discovery (KDD) Process
n Data mining—core of Pattern Evaluation
knowledge discovery
process
Data Mining

Task-relevant Data

Data Warehouse Selection and Transformation

Data Cleaning

Data Integration

SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 43


Databases
Semester 2, Year 2006
KDD Process: Several Key Steps
p Learning the application domain
n relevant prior knowledge and goals of application
p Creating a target data set: data selection
p Data cleaning and preprocessing: (may take 60% of effort!)
p Data reduction and transformation
n Find useful features, dimensionality/variable reduction, invariant
representation
p Choosing functions of data mining
n summarization, classification, regression, association, clustering
p Choosing the mining algorithm(s)
p Data mining: search for patterns of interest
p Pattern evaluation and knowledge presentation
n visualization, transformation, removing redundant patterns, etc.
p Use of discovered knowledge
SCCS 453 DW and DM Songsri Tangsripairoj, Ph.D. 44
Semester 2, Year 2006

Вам также может понравиться