DWDM 5

Visit: www.geocities.com/chinna_chetan05/forfriends.
html
INTRODUCTION
Data Warehousing and Mining is simply one component of modern reporting
architectures. The real goal of reporting systems is decision support or its modern
equivalent, business intelligence-to help people make better, more intelligent decisions.
The problems with the present Database reporting are:
They do not provide full-fledged support for the following:
• Accessibility Getting required information when ever needed
• Timeliness Time taken to submit the report
• Formats Formats like spreadsheets, graphs, maps etc.,
• Integrity Accuracy and Reliability of data
In order to achieve the above characteristics, which are the main requirements of the
present Information Systems department, Data Warehousing and Mining is preferred over
the present Database systems.
Definition of Data Warehousing:
Data Warehouse is a database of data gathered from many systems and intended
To support management reporting and decision making.
This process of gathering data is called Data Warehousing.
Definition of Data Mining:
Data Mining is a form of analysis intended to create predictive models that
are used to discover patterns and relationships in the data.
These patterns and relationships in the data can help in making better business
decisions.
1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html
CHARACTERISTICS OF DATA WAREHOUSE
It is subject oriented, integrated, non-volatile, time variant collection of data in
Support of management decisions.
• Subject oriented: Data Warehouse deals with all the subjects of corporate data.
Eg: sales, finance, customers etc.,
• Integrated: Integrates data from different Database systems (Heterogeneous

data)
to single homogeneous data.
• Non-volatile: The Data Warehouse is a read only database. It cannot be
Overwritten or deleted. So, it’s Non-volatile.
• Time variant: Historical data with chronological importance, i.e Historical data
is maintained and analysed for future analysis.
One of the most important characteristics of Data Warehouse is, it invites
redundancy of data to some extent, to support for an excellent reporting system.
This is one of the main difference in ordinary Database and Data Warehouse.
Ordinary Database does not accept redundancy.
GOALS OF DATA WAREHOUSING AND MINING
• To provide a reliable, single, integrated source of information
• To give end users access to their data without a reliance on reports produced by
Information System (IS) department.
• Allows to analyze corporate data, predictive models and improve Business
Intelligence.
ARCHITECTURE OF DATA WAREHOUSE

FLOW3
Store4
FLOW1 OLAP
Store 1
OLTP
FLOW2
Store 3
HPQS
Store 4
Store1 Store 2 OLAP
OLTP Integration
Layer
(DWH)
Store 3 Store 4
Store 1 HPQS OLAP
OLTP DATA MART
As shown in the architecture, the approach to reporting systems and data warehousing is
built around:
• Four Data structures for the storage of data. They are
1. DATA STORE1 , called Online Transaction Processing(OLTP)
2. DATA STORE2, called Integration Layer or Data Warehouse
3. DATA STORE3, called Data Mart or High Processing Query
System(HPQS)
4. DATA STORE4, called Online Analytical Processing(OLTP)
• Three Data flow paths between the four data structures. They are
1. FLOW1, from DATA STORE1 to DATA STORE2
This architecture can be well explained as an ETL (EXTRACT, TRANSFORM, LOAD)
architecture . i.e The architecture is divided into three phases, they are
1. Extract phase
2. Transform phase
3. Loading phase
EXTRACT PHASE
This phase is the process of transferring data from DATA STORE1 to DATA-
STORE2.
DATA STORE1 is also called Source Systems-the transaction processing
system or systems-that will provide data to the warehouse. Eg: like sales, accounting, and
distribution- that will provide data to your warehouse. Each of these systems has a
database of information that end users need access to. Frequently, the need access to data
that has been integrated from many of these systems.
There are different mechanisms for extracting that data out of its sources. This is called
Data extraction step.
The data from different source systems may be in different formats, say in Oracle,
Foxpro, Ms-Access etc., These data should be integrated and transferred to DATA-
STORE2.
The art of determining what records to extract from the source system is frequently called
change data capture. The point of change data capture is to recognize what source
records have changed and how, so that just the changed records are moved to the
warehouse. Deletes can be very difficult to recognize. Because they leave no trace. The
deleted record is just gone. If you need to recognize deletes, you may have to jump
through a few hoops to figure out if records have been deleted from your data source.
Some general techniques used to recognize changes to source database tables.
• Timestamps The lucky among us extract data from systems that timestamp
records whenever they are inserted or deleted. In these situations, change data
capture is reduced to an exercise of a search through source tables to determine
what records have changed.
• Triggers A great technique for capturing changes in source records is to put
triggers on the source tables. Every time a record is inserted into, updated in,
or deleted from a source table, these triggers write a corresponding message in a
log file. The warehouse uses the information in these log files to determine how to
update itself.
• File Compares Probably the least desirable technique for identifying changes in
your source data is to compare the file as it appears today to a copy of how it
appeared when you last loaded the warehouse. Not only is this technique difficult
to implement, but it’s also less accurate than some of “snapshots.” Thus, if you
load your warehouse weekly, you’ll only be able to see the new state of the
database every week, but not every change that occurred during the week.
TRANSFORM PHASE
After the Extract phase, data is present in the DATA STORE2. Transform phase
is where this data is Transformed into the required form in the DATA STORE2.
Some of the fundamental steps in the Transformation phase are,
1. Converting heterogeneous data to homogeneous data.
2. Adding Surrogate keys.
3. Removing dirty data.
4. Normalization
Step 1:
The data in the DATA STORE2 is from the different source systems of DATA-
STORE1. So, the data is heterogeneous. Hence this should be converted to Homogeneous
Data. This is the reason why this DATA STORE2 is called Integration Layer or
Warehouse.
Step 2:
Some special keys are added to the data in the DATA STORE2.This keys have
no business meaning. For example, rather than using the customer number as the key on
the CUSTOMER table, you might use a surrogate key that is simply a sequential number
generated by your warehouse load programs. The customer number would still appear on
the table, it just wouldn’t be the primary key of the table.
Surrogate keys open a whole range of functionality otherwise unavailable.
Step 3:
Dirty data is the set of records, which are not useful for the database.
Dirty base can be handled in the following ways,
• Ignoring them.
• Rejecting bad records, but saving them in a separate file for manual review.
• Loading as much of the bad record as possible and pointing out the errors for later
Step 4:
Normalization of data is done in the DATA STORE2. A normalized database is
like a flat file that is broken up into smaller files or tables in order to store the data
more efficiently. As in all databases, we strongly recommend that referential integrity
constraints be enabled in your warehouse / integration layer.
After the performing above 4 steps, data is efficiently stored in the DATA STORE2.
LOADING PHASE
Transformed data is sent to DATA STORE3, which is called DATA MART.
Loading phase constitutes the DATA FLOW2 of the architecture.
DEFINITION OF DATA MART:
Data Marts are databases that share many of the features of data warehouses but
are smaller in scope.
Like a warehouse, data in mart may come from multiple systems although, in our
standard architecture, this data will have been integrated in the warehouse before it ever
comes to the mart.
Marts differ from warehouses in that, whereas the warehouse focuses on the needs of the
entire organization, marts are dedicated to particular subject areas or the needs of a
particular department or business function.
LOADING phase constitutes several schemas. Two of them are:
• Star Schema in which, maintenance of data will be in one fact table and
multiple dimension tables.
• Snow Flake Schema in which, maintenance of data will be in the form of
normalized dimension tables.
This DATA STORE3 is also called High Performance Query Structures(HPQS).
DATA MARTS, or HPQS, are databases and data structures set up specifically to
support end-user queries. These database are most frequently managed by either
relational database engines or multidimensional databases engines.
Note: Data Marts and HPQSs are logical, not physical, concepts. Frequently, an
organization’s data warehouse and its data marts will share the same computer.
Finally, after Extract, Transform and Loading of Data, the DATA FLOW3 of the
architecture is done.
DATA FLOW3 is the transfer of data from the High Performance Query Structures to the
End User Reporting applications, DATA STORE4.
There are some Query tools, to transfer data from Data Marts to the End user. These
tools perform the formatting of the results.
DATA STORE4 is the data in the end user’s hands. This report in users’ hands is the end
of the information utility. It is, also, the last data store in our preferred warehousing
architecture.
ETL – TOOLS:
• INFORMATICA
• MICROSTRATAGY
• ABIONTASH
• ORACLE WAREHOUSE BUILDER
• SAS (Statistical Analysis System)
OLAP TOOLS:
• BUSINESS OBJECTS
• COGNOS
ORACLE8i - DATA WAREHOUSE FEATURES
• Relational Database
• PL/Sql development tools
• Oracle Warehouse Builder(OWB) – Extract Transform and Load(ETL)
• Express – Multidimensional database engine
• Discoverer – Relational OLAP query tool
• Oracle Data Mining Suite
APPLICATION OF DATA WAREHOUSING AND MINING IN
GOVERNMENT DEPARTMENTS
All the Government departments are connected by a private network.
We assume the network to be highly secure and reliable.
The transactions in a particular Government department are online.
A Centralized Data Warehouse Server is maintained at a particular place.
The transactions of all the Government Departments are transferred to
the Centralized Server, Data Warehouse Server.
The topology of the Network is equated to the Architecture of the
Data Warehouse as shown in the fig:1.
Telephone Employees OLAP

Dept. Details
DATA MART
Railway Centralized
Dept Server
(DWHS)
Credit & Debit OLAP
Details
Registration DATA MART
Dept
DWHS : Data Warehouse Server ; OLAP : Online Analytical Processing
Fig: 1 Application topology of Data Warehousing and Mining
The above is an example with three departments in the network.
Data from the 3 departments will be Extracted and Transformed in the
Centralized Server(DWHS). The Transformed data is sent to different Data Marts
The Data Marts can analyze data and answer the most complex queries of the user.
Report Generation will be immediate. So, any decisions for the development can be
taken immediately. The data from the Centralized Server can even be sent to internet
for making the people know the Government’s activities, and can question the
Government if any incorrect data is found in the Data Warehouse. For example the
details of salary and loans of a particular employee are available through their respective
department systems. His assets are obtained from the Registration department. Data can
be analyzed and can conclude whether a particular employee is loyal to the Government
or not. Thus respective actions can be taken. Thus, Data Warehousing and Mining can
take both private and public sectors to a top level.
REFERENCES
• Michael Corey, Michael Abbey, Ian Abramson and Ben Taub, “ORACLE8i Data
Warehoushing”.
• Kimbal, “Data Warehouse Life Cycle Tool Kit”.
WEB SITES
• www.datawarehouse.com
• www.data-warehouse.com
• www.dwinfocentre.org

DWDM 5

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DWDM 5

Загружено:

Авторское право:

Доступные форматы

Visit: www.geocities.com/chinna_chetan05/forfriends.

Data Warehousing and Mining is simply one component of modern reporting

The problems with the present Database reporting are:

They do not provide full-fledged support for the following:

• Accessibility Getting required information when ever needed

• Timeliness Time taken to submit the report

• Formats Formats like spreadsheets, graphs, maps etc.,

• Integrity Accuracy and Reliability of data

the present Database systems.

Definition of Data Warehousing:

To support management reporting and decision making.

This process of gathering data is called Data Warehousing.

Definition of Data Mining:

Data Mining is a form of analysis intended to create predictive models that

are used to discover patterns and relationships in the data.

CHARACTERISTICS OF DATA WAREHOUSE

It is subject oriented, integrated, non-volatile, time variant collection of data in

Support of management decisions.

Eg: sales, finance, customers etc.,

• Integrated: Integrates data from different Database systems (Heterogeneous

to single homogeneous data.

• Non-volatile: The Data Warehouse is a read only database. It cannot be

Overwritten or deleted. So, it’s Non-volatile.

is maintained and analysed for future analysis.

One of the most important characteristics of Data Warehouse is, it invites

redundancy of data to some extent, to support for an excellent reporting system.

Ordinary Database does not accept redundancy.

GOALS OF DATA WAREHOUSING AND MINING

• To provide a reliable, single, integrated source of information

Information System (IS) department.

• Allows to analyze corporate data, predictive models and improve Business

ARCHITECTURE OF DATA WAREHOUSE

• Four Data structures for the storage of data. They are

1. DATA STORE1 , called Online Transaction Processing(OLTP)

2. DATA STORE2, called Integration Layer or Data Warehouse

3. DATA STORE3, called Data Mart or High Processing Query

4. DATA STORE4, called Online Analytical Processing(OLTP)

1. FLOW1, from DATA STORE1 to DATA STORE2

2. FLOW2, from DATA STORE2 to DATA STORE3

3. FLOW3, from DATA STORE3 to DATA STORE4

This architecture can be well explained as an ETL (EXTRACT, TRANSFORM, LOAD)

DATA STORE1 is also called Source Systems-the transaction processing

that has been integrated from many of these systems.

Data extraction step.

Some general techniques used to recognize changes to source database tables.

capture is reduced to an exercise of a search through source tables to determine

what records have changed.

• Triggers A great technique for capturing changes in source records is to put

or deleted from a source table, these triggers write a corresponding message in a

Some of the fundamental steps in the Transformation phase are,

1. Converting heterogeneous data to homogeneous data.

2. Adding Surrogate keys.

3. Removing dirty data.

the table, it just wouldn’t be the primary key of the table.

Surrogate keys open a whole range of functionality otherwise unavailable.

Dirty base can be handled in the following ways,

Normalization of data is done in the DATA STORE2. A normalized database is

more efficiently. As in all databases, we strongly recommend that referential integrity

constraints be enabled in your warehouse / integration layer.

Transformed data is sent to DATA STORE3, which is called DATA MART.

Loading phase constitutes the DATA FLOW2 of the architecture.

DEFINITION OF DATA MART: