Вы находитесь на странице: 1из 22

NOTES

on
DATA WAREHOUSING AND DATA MINING
MCA-III Semester

By:
Dr. Asha Ambhaikar
Professor & Dean Students Welfare

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 1
Kalinga University, Naya Raipur
Data Warehousing & Data Mining

MCA 305 External Marks:70 Internal Marks: 30

UNIT- I

The Compelling Need for data warehousing: Escalating Need for strategic information, failures of Past
decision-support systems, operational versus decision-support systems, data warehousing – the only viable
solution, data warehouse defined Data warehouse – The building Blocks: Defining Features, data warehouses
and data marts, overview of the components, metadata in the data warehouse Defining the business
requirements: Dimensional analysis, OLAP operations : Drilldown and roll-up, slice-and-dice or rotation.

UNIT- II Principles of dimensional modeling: , the STAR schema, STAR Schema Keys, Advantages of the STAR
Schema Dimensional Modeling: Updates to the Dimension tables, miscellaneous dimensions, the snowflake
schema, aggregate fact tables, families of STARS ,Steps for the Design & Construction of Data warehouse :
Framework , Architecture , Type of OLAP Servers : ROLAP , MOLAP , Data warehouse implementation tolls &
techniques.

UNIT- III Data Mining, Data Mining of what kind of Data , Knowledge discovery process (KDD) , What kind of
patterns can be mined , OLAP versus data mining, data mining and the data warehouse, Data mining
functionalities, classification Systems , Data processing : Cleaning , Integration & transformation, Reduction .
Data Mining primitives: What defines a Data Mining Task.

UNIT- IV Data Mining Query language (DMQL), Cluster Analysis : Partitioning , Hierarchical Density , Grid &
Model based methods., Major Data Mining Techniques, Cluster detection, decision Trees, memory-based
reasoning, link analysis, neural networks, genetic algorithms, moving into data mining, Data Mining
Applications, Benefits of data mining & applications.

Text books: 1. Paul Raj Poonia, “Fundamentals of Data Warehousing”, John Wiley & Sons, 2003. 2. W. H.
Inmon, “Building the Operational Data Store”,2nd Ed., John Wiley, 1999 3. Sam Anahony, “Data Warehousing
in the Real World: A Practical Guide for Building Decision Support Systems”, John Wiley, 2004. 4. Jarke,
“Fundamentals of Data Warehouse”, Springer

References books: 1. Kamber and Han, “Data Mining Concepts and Techniques”, Hartcourt India P. Ltd.,
2001. 2. G. K. Gupta, “Introduction to Data Mining with Case Studies”, PHI, 2006. 3. A. B. M. Shawkat Ali,
Saleh A. Wasimi, “Data Mining Methods and Techniques”, Cengage Learning, 2009. 4. Pang - Ning, Michael-
Steinbach, “Introduction to Data Mining”, Pearson, 4th Ed., 2009.

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 2
UNIT-I

COMPILING NEED FOR DATA WAREHOUSING

What is Data Warehousing?

A process of transforming
Information data into information
and making it available to
users in a timely enough
manner to make a
difference

Data

 Technique for assembling and managing data from various sources for the
purpose of answering business questions. Thus making decisions that were
not previous possible

 A decision support database maintained separately from the organization’s


operational database

 “A data warehouse is a subject-oriented, integrated, time-variant, and


nonvolatile collection of data in support of management’s decision-making
process.”—W. H. Inmon

 Data warehousing:

 The process of constructing and using data warehouses

Subject Oriented

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 3
 Organized around major subjects, such as customer, product, sales.

 Focusing on the modeling and analysis of data for decision makers, not on
daily operations or transaction processing.

 Provide a simple and concise view around particular subject issues by


excluding data that are not useful in the decision support process.

Integrated:

 Constructed by integrating multiple, heterogeneous data sources

 relational databases, flat files, on-line transaction records

 Data cleaning and data integration techniques are applied.

 Ensure consistency in naming conventions, encoding structures,


attribute measures, etc. among different data sources

 E.g., Hotel price: currency, tax, breakfast covered, etc.

 When data is moved to the warehouse, it is converted.

Time variant:

 The time horizon for the data warehouse is significantly longer than that of
operational systems.

 Operational database: current value data.

 Data warehouse data: provide information from a historical perspective


(e.g., past 5-10 years)

 Every key structure in the data warehouse

 Contains an element of time, explicitly or implicitly

 But the key of operational data may or may not contain “time element”.

Non volatile:

 A physically separate store of data transformed from the operational


environment.

 Operational update of data does not occur in the data warehouse


environment.

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 4
 Does not require transaction processing, recovery, and concurrency
control mechanisms

 Requires only two operations in data accessing:

 initial loading of data and access of data.

Very large Databases:

 Terabytes -- 10^12 bytes: Walmart-- 24 Terabytes

 Petabytes -- 10^15 bytes: Geographic Information Systems

 Exabytes -- 10^18 bytes: National Medical Records

 Zettabytes -- 10^21 bytes: Weather images

 Zottabytes -- 10^24 bytes: Intelligence Agency Videos

Need for Data Ware House:


A Data Warehouse is the necessary foundation for a true Business Intelligence
solution. Almost all sophisticated, mature BI systems are based on one. The
alternative to creating a data warehouse is to use data accessed directly from the
applications that create it.

Reporting and analysis can only be done properly by experts.


A Rapid Decision data warehouse eliminates all these problems while offering many
other advantages.

Evolution of Data Warehouse:

Data Warehouse vs. Heterogeneous DBMS


 Traditional heterogeneous DB integration:

 Build on top of heterogeneous databases

 Query driven approach

 When a query is posed to a client site, a meta-dictionary is used


to translate the query into queries appropriate for individual

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 5
heterogeneous sites involved, and the results are integrated into
a global answer set

 Complex information filtering, compete for resources

 Data warehouse: update-driven, high performance

 Information from heterogeneous sources is integrated in advance and


stored in warehouses for direct query and analysis

Benefits of Data warehouse:


 Better Information

 Better Strategies and plans

 Better tactics and decisions

 More efficient processed

 Time saving

 Reduction in paper reporting

 Increase in knowledge worker productivity

 Supports all decision makers’ data requirements

 Provide ready access to critical data

 Insulates operation databases from ad hoc processing

 Provides high-level summary information

 Provides drill down capabilities


Yields

 Improved business knowledge

 Competitive advantage

 Enhances customer service and satisfaction

 Facilitates decision making

 Help streamline business processes

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 6
Goals of Data Warehouse:
 Makes an organization’s information accessible.
 Makes the organization’s information consistent.
 Is an adaptive and durable source of information
 Is a secure support that protects the
organization’s information asset
 Is the foundation for decision making

Data Warehouse vs. Operational DBMS:


 OLTP (on-line transaction processing)

 Major task of traditional relational DBMS

 Day-to-day operations: purchasing, inventory, banking, manufacturing,


payroll, registration, accounting, etc.

 OLAP (on-line analytical processing)

 Major task of data warehouse system

 Data analysis and decision making

 Distinct features (OLTP vs. OLAP):

 User and system orientation: customer vs. market

 Data contents: current, detailed vs. historical, consolidated

 Database design: ER + application vs. star + subject

 View: current, local vs. evolutionary, integrated

 Access patterns: update vs. read-only but complex queries

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 7
OLTP vs Data Warehouse
 OLTP  Warehouse (DSS)
 Application Oriented  Subject Oriented
 Used to run business  Used to analyze business
 Detailed data  Summarized and refined
 Current up to date  Snapshot data
 Isolated Data  Integrated Data
 Repetitive access  Ad-hoc access
 Clerical User  Knowledge User
(Manager)

 OLTP  Data Warehouse


 Performance Sensitive  Performance relaxed
 Few Records accessed at a  Large volumes accessed at
time (tens) a time(millions)
 Mostly Read (Batch
 Read/Update Access Update)
 Redundancy present
 No data redundancy  Database Size 100
 Database Size 100MB - GB - few terabytes
100 GB

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 8
To summarize ...

 OLTP Systems are


used to “run” a
business

 The Data Warehouse


helps to “optimize” the
business

Why Separate Data Warehouse?


 High performance for both systems

 DBMS— tuned for OLTP: access methods, indexing, concurrency


control, recovery

 Warehouse—tuned for OLAP: complex OLAP queries, multidimensional


view, and consolidation.

Need of Data Warehouse:


 missing data: Decision support requires historical data which
operational DBs do not typically maintain

 data consolidation: DS requires consolidation (aggregation,


summarization) of data from heterogeneous sources

 data quality: different sources typically use inconsistent data


representations, codes and formats which have to be reconciled.

What are the various Trends in Data Warehouse?


There are Three Complementary Trends in Data Warehouse:

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 9
 Data Warehousing

_ Consolidate data from many sources in one large repository.

– Loading, periodic synchronization of replicas.

– Semantic integration.

 OLAP:

– Complex SQL queries and views.

– Queries based on spreadsheet-style operations and multidimensional” view


of data.

 – Interactive and “online” queries.

 Data Mining

_ Exploratory search for interesting

trends and anomalies.

Apart from above basic Trends, In the early stages, four significant factors drove
many companies to move into data warehousing:

 violent competition

 Government deregulation

 Need to restore internal processes

 Imperative for customized marketing

Significant Factors:
These significant factors reflect the new

trends in data warehousing:

 Multiple Data Types

 Data Visualization

 Parallel Processing

 Query Tools

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 10
 Browser Tools

 Data Fusion

 Multidimensional Analysis

 Agent Technology

 E-Business- ERP, KM, CRM

Basic Elements of Data warehouse:


 Source System
 Staging Area
 Presentation Area
 End User Data Access Tools
 Metadata

Data warehouse Architecture


Monitor
& OLAP Server
other Metadata
sources Integrator

Analysis
Operational Extract Query
Transform Data Serve Reports
DBs Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 11
Working of Data Warehouse:
Bottom Layer:

 The bottom layer is a DW database servers that is almost always a relational


database system
 Data from operational databases and external sources are extracted using
application program interfaces known as gateways

 It is supported by primary system


 It has repository that is metadata (data about data)

 Which is responsible for extracting the information from DW according to the


queries given by the end users
 Metadata is the bridge between DW and the DSS

 It provides logical linkage between data and application


 Metadata can pinpoint access to information across the entire DW and can
unable the development of applications which automatically updates
themselves to reflect data warehouse content change.
Middle Layer:

 The middle layer consists of OLAP server


 OLAP means On Line Analytical Processing

 It is used to perform analysis on data


and transform it in to useful information for decision making

 OLAP is a continuously iterative process


 OLAP servers are implemented by either ROLAP,MOLAP or HOLAP

TOP Layer:
 The top layer is a client

 That is the end user


 It consists of

1.query and reporting tools


Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 12
2.Analysis tools and

3. Data Mining Tools


 It acts as an interface between the user and the server

 This layer takes queries from the users


 And then send it to the servers

 Receiving information records back and


 Gives them as output to the end users.

 Eg. Analysis of weather forecasting, predictions and so on.

DEFINING THE BUSINESS REQUIREMENTS:


 A requirements definition for the data warehouse can, therefore, be
based on business dimensions such as product, geography, time, and
promotion.
 Information packages are the backbone of the requirements definition.
An information package records the critical measurements or facts and
business dimensions along which the facts are normally analyzed.
 Interviews and group sessions are standard methods for collecting
requirements.

 Key people to be interviewed or to be included in group sessions are


senior executives (including the sponsors), departmental managers,
business analysts, and operational systems DBAs.
 Review all existing documentation of related operational systems.
 Scope and content of the requirements definition document include
data sources, data transformation, data storage, information delivery,
and information package diagrams.

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 13
Project Planning and Management:

Objectives:
 Review the essentials of planning for Data Warehouse
 While planning a data Warehouse key issues to be considered include:
setting proper expectations, assessing risk, deciding between Top-
down and Bottom-up approaches, choosing from vendor solution.

 Business requirement not technology must drive your project.


 The data warehouse project without the full support of the top
management and without strong, enthusiastic and executive
sponsored.
 Data warehouse projects are different from OLTP projects.

 Discuss the project team, organization’s role and responsibilities.


 Participation of the user is mandatory for success of the Data
Warehouse project. Users can participate in a variety of ways.

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 14
 Consider the warning sign and success factors in the final analysis;
adopt a practical approach to built a successful data warehouse.
 Project Manager: It is a serious mistake to have a project manager who
is more technology-oriented than user-oriented and business-oriented.

 New Paradigm: Data warehousing is new for most companies;


innovative project management methods are essential to deal with the
unexpected challenges.
 Team Roles: Team roles are not to be assigned arbitrarily; the roles
must reflect the needs of each individual data warehouse project.
 Data Quality: Three critical aspects of data in the data warehouse are:
quality, quality and quality.
 User Requirements: Although obvious, user requirements alone form
the driving force of every task on the project schedule.
 Building for Growth: Number of users and number of queries shoot up
very quickly after deployment; data warehouses not built for growth will
fall to pieces.
 Project Politics: The first data warehouse project in a company poses
challenges and threats to users at different levels; trying to handle
project politics is like walking the known tightrope, to be compressed
with extreme caution.
 Realistic Expectations: It is easy to promise the world in the first data
warehouse project; setting expectations at the right and attainable
levels is the best course.

 Dimensional Data Modeling: A well-designed dimensional data model is


a required foundation and blueprint.
 External Data: A data warehouse does not live by internal data alone;
data from relevant external sources is an absolutely necessary
ingredient.
 Training: Data warehouse user tools are different and new. If the users
do not know how to use the tools, they will not use the data warehouse.
An unused data warehouse is a failed data warehouse.

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 15
Fig. 1.2 Data Warehouse Deployment Phases

Dimensional Analysis (Multidimensional Data Model)


 A data warehouse is based on a multidimensional data model which views data in the form
of a data cube

 A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

 Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)

 Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables

 In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The
lattice of cuboids forms a data cube.

Multidimensionality:

 3-D + Spreadsheets (OLAP has this)

 Data can be organized the way managers like to see them, rather than the way that the
system analysts do

 Different presentations of the same data can be arranged easily and quickly

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 16
 Dimensions: products, salespeople, market segments, business units, geographical
locations, distribution channels, country, or industry

 Measures: money, sales volume, head count, inventory profit, actual versus forecast

 Time: daily, weekly, monthly, quarterly, or yearly

Multidimensional Data
 Sales volume as a function of product, month,
and region
Dimensions: Product, Location, Time
Hierarchical summarization paths

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 17
A Sample Data Cube
Total annual sales
Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
TV
PC U.S.A
VCR

Country
sum
Canada

Mexico

sum

Cuboids Corresponding to the Cube

all
0-D(apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D(base) cuboid
product, date, country

OLAP Operations
 OLAP means On Line Analytical Processing.

 It is used to perform analysis on data and transform it into information for decision making
purpose.

 OLAP is a continuous iterative process.

 A common operation is to aggregate a measure over one or more dimensions.

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 18
– Find total sales.

– Find total sales for each city, or for each state.

– Find top five products ranked by total sales.

Typical OLAP Operations

 Roll up (drill-up): summarize data

 Drill down (roll down): reverse of roll-up

 Slice and dice:

 project and select

 Pivot : rotate

Roll-up and Drill Down


Higher Level of
Aggregation
 Sales Channel

 Region

 Country

 State

 Location Address

 Sales Representative

Low-level
Details

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 19
Slicing and Dicing

Product The Telecomm Slice

Household

Telecomm

Video Europe
Far East
Audio India

Retail Direct Special Sales Channel

A Visual Operation: Pivot


(Rotate)

Juice
Cola 10

Milk 47

Crea 30
m 12 Product

3/1 3/2 3/3 3/4


Date
Typical OLAP Operations

 Roll up (drill-up): summarize data

 by climbing up hierarchy or by dimension reduction


Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 20
 This operation performs aggregation on the data cube, either by climbing up a
concept of hierarchy for a dimension or by dimension reduction.

 When roll up is performed by dimension reduction, one or more dimensions are


removed from the given cube.

 Drill down (roll down): reverse of roll-up

 from higher level summary to lower level summary or detailed data, or introducing
new dimensions

 It navigates from less detailed data to more detailed data.

This can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions

 Slice and dice:

 project and select

 The slice operation performs a selection on one dimension of the given cube resulting
in a sub cube

 The dice operation defines a sub cube by performing a selection on two or more
dimensions

 Pivot (rotate):

 It is visualization operation that rotates the data axes in new view in order to provide
an alternative presentation of the data.

reorient the cube, visualization, 3D to series of 2D planes.

 Other operations

 drill across: Executes queries involving (across) more than one fact table

 drill through: Operation uses relational SQL facilities to drill through the bottom level
of the data cube to its back-end relational tables

OLAP Is FASMI

 Fast

 Analysis to

 Share

 Multidimensional
Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 21
 Information

--------------------

Important Questions:
1. What is Data Warehouse? Explain in detail.

2. What are the goals of Data Warehouse?

3. What is the need of Data Warehouse?


4. What are the benefits of Data warehouse?

5. Write the Data Warehouse applications with its storage capacities.


6. Explain an Architecture of Data Warehouse with neat diagram.

7. What are the various elements of Data warehouse?


8. Explain and give the Business requirements of Data Warehouse.

9. How to plan for Data warehouse?


10. What is Multidimensional data model? Explain with neat diagram.

11. What are the various OLAP operations? Explain each with neat diagram.

Dr. Asha Ambhaikar, Professor and Dean Students Welfare, Kalinga University, Naya Raipur Page 22

Вам также может понравиться