Вы находитесь на странице: 1из 12

Visit: www.geocities.com/chinna_chetan05/forfriends.

html

INTRODUCTION

Data Warehousing and Mining is simply one component of modern reporting

architectures. The real goal of reporting systems is decision support or its modern

equivalent, business intelligence-to help people make better, more intelligent decisions.

The problems with the present Database reporting are:

They do not provide full-fledged support for the following:

• Accessibility Getting required information when ever needed

• Timeliness Time taken to submit the report

• Formats Formats like spreadsheets, graphs, maps etc.,

• Integrity Accuracy and Reliability of data

In order to achieve the above characteristics, which are the main requirements of the

present Information Systems department, Data Warehousing and Mining is preferred over

the present Database systems.

Definition of Data Warehousing:

Data Warehouse is a database of data gathered from many systems and intended

To support management reporting and decision making.

This process of gathering data is called Data Warehousing.

Definition of Data Mining:

Data Mining is a form of analysis intended to create predictive models that

are used to discover patterns and relationships in the data.

These patterns and relationships in the data can help in making better business

decisions.

1 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

CHARACTERISTICS OF DATA WAREHOUSE

It is subject oriented, integrated, non-volatile, time variant collection of data in

Support of management decisions.

• Subject oriented: Data Warehouse deals with all the subjects of corporate data.

Eg: sales, finance, customers etc.,

• Integrated: Integrates data from different Database systems (Heterogeneous


data)

to single homogeneous data.

• Non-volatile: The Data Warehouse is a read only database. It cannot be

Overwritten or deleted. So, it’s Non-volatile.

• Time variant: Historical data with chronological importance, i.e Historical data

is maintained and analysed for future analysis.

One of the most important characteristics of Data Warehouse is, it invites

redundancy of data to some extent, to support for an excellent reporting system.

This is one of the main difference in ordinary Database and Data Warehouse.

Ordinary Database does not accept redundancy.

GOALS OF DATA WAREHOUSING AND MINING

• To provide a reliable, single, integrated source of information

• To give end users access to their data without a reliance on reports produced by

Information System (IS) department.

• Allows to analyze corporate data, predictive models and improve Business

Intelligence.
2 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

ARCHITECTURE OF DATA WAREHOUSE


FLOW3
Store4
FLOW1 OLAP
Store 1
OLTP
FLOW2
Store 3
HPQS

Store 4
Store1 Store 2 OLAP
OLTP Integration
Layer
(DWH)

Store 3 Store 4
Store 1 HPQS OLAP
OLTP DATA MART

As shown in the architecture, the approach to reporting systems and data warehousing is

built around:

• Four Data structures for the storage of data. They are

1. DATA STORE1 , called Online Transaction Processing(OLTP)

2. DATA STORE2, called Integration Layer or Data Warehouse

3. DATA STORE3, called Data Mart or High Processing Query

System(HPQS)

4. DATA STORE4, called Online Analytical Processing(OLTP)

• Three Data flow paths between the four data structures. They are

1. FLOW1, from DATA STORE1 to DATA STORE2

2. FLOW2, from DATA STORE2 to DATA STORE3

3 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

3. FLOW3, from DATA STORE3 to DATA STORE4

This architecture can be well explained as an ETL (EXTRACT, TRANSFORM, LOAD)

architecture . i.e The architecture is divided into three phases, they are

1. Extract phase

2. Transform phase

3. Loading phase

EXTRACT PHASE

This phase is the process of transferring data from DATA STORE1 to DATA-

STORE2.

DATA STORE1 is also called Source Systems-the transaction processing

system or systems-that will provide data to the warehouse. Eg: like sales, accounting, and

distribution- that will provide data to your warehouse. Each of these systems has a

database of information that end users need access to. Frequently, the need access to data

that has been integrated from many of these systems.

There are different mechanisms for extracting that data out of its sources. This is called

Data extraction step.

The data from different source systems may be in different formats, say in Oracle,

Foxpro, Ms-Access etc., These data should be integrated and transferred to DATA-

STORE2.

The art of determining what records to extract from the source system is frequently called

change data capture. The point of change data capture is to recognize what source

records have changed and how, so that just the changed records are moved to the

warehouse. Deletes can be very difficult to recognize. Because they leave no trace. The

4 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

deleted record is just gone. If you need to recognize deletes, you may have to jump

through a few hoops to figure out if records have been deleted from your data source.

Some general techniques used to recognize changes to source database tables.

• Timestamps The lucky among us extract data from systems that timestamp

records whenever they are inserted or deleted. In these situations, change data

capture is reduced to an exercise of a search through source tables to determine

what records have changed.

• Triggers A great technique for capturing changes in source records is to put

triggers on the source tables. Every time a record is inserted into, updated in,

or deleted from a source table, these triggers write a corresponding message in a

log file. The warehouse uses the information in these log files to determine how to

update itself.

• File Compares Probably the least desirable technique for identifying changes in

your source data is to compare the file as it appears today to a copy of how it

appeared when you last loaded the warehouse. Not only is this technique difficult

to implement, but it’s also less accurate than some of “snapshots.” Thus, if you

load your warehouse weekly, you’ll only be able to see the new state of the

database every week, but not every change that occurred during the week.

TRANSFORM PHASE

After the Extract phase, data is present in the DATA STORE2. Transform phase

is where this data is Transformed into the required form in the DATA STORE2.

5 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Some of the fundamental steps in the Transformation phase are,

1. Converting heterogeneous data to homogeneous data.

2. Adding Surrogate keys.

3. Removing dirty data.

4. Normalization

Step 1:

The data in the DATA STORE2 is from the different source systems of DATA-

STORE1. So, the data is heterogeneous. Hence this should be converted to Homogeneous

Data. This is the reason why this DATA STORE2 is called Integration Layer or

Warehouse.

Step 2:

Some special keys are added to the data in the DATA STORE2.This keys have

no business meaning. For example, rather than using the customer number as the key on

the CUSTOMER table, you might use a surrogate key that is simply a sequential number

generated by your warehouse load programs. The customer number would still appear on

the table, it just wouldn’t be the primary key of the table.

Surrogate keys open a whole range of functionality otherwise unavailable.

Step 3:

Dirty data is the set of records, which are not useful for the database.

Dirty base can be handled in the following ways,

• Ignoring them.

• Rejecting bad records, but saving them in a separate file for manual review.

• Loading as much of the bad record as possible and pointing out the errors for later

6 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Step 4:

Normalization of data is done in the DATA STORE2. A normalized database is

like a flat file that is broken up into smaller files or tables in order to store the data

more efficiently. As in all databases, we strongly recommend that referential integrity

constraints be enabled in your warehouse / integration layer.

After the performing above 4 steps, data is efficiently stored in the DATA STORE2.

LOADING PHASE

Transformed data is sent to DATA STORE3, which is called DATA MART.

Loading phase constitutes the DATA FLOW2 of the architecture.

DEFINITION OF DATA MART:

Data Marts are databases that share many of the features of data warehouses but

are smaller in scope.

Like a warehouse, data in mart may come from multiple systems although, in our

standard architecture, this data will have been integrated in the warehouse before it ever

comes to the mart.

Marts differ from warehouses in that, whereas the warehouse focuses on the needs of the

entire organization, marts are dedicated to particular subject areas or the needs of a

particular department or business function.

LOADING phase constitutes several schemas. Two of them are:

• Star Schema in which, maintenance of data will be in one fact table and

multiple dimension tables.

• Snow Flake Schema in which, maintenance of data will be in the form of

7 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

normalized dimension tables.

This DATA STORE3 is also called High Performance Query Structures(HPQS).

DATA MARTS, or HPQS, are databases and data structures set up specifically to

support end-user queries. These database are most frequently managed by either

relational database engines or multidimensional databases engines.

Note: Data Marts and HPQSs are logical, not physical, concepts. Frequently, an

organization’s data warehouse and its data marts will share the same computer.

Finally, after Extract, Transform and Loading of Data, the DATA FLOW3 of the

architecture is done.

DATA FLOW3 is the transfer of data from the High Performance Query Structures to the

End User Reporting applications, DATA STORE4.

There are some Query tools, to transfer data from Data Marts to the End user. These

tools perform the formatting of the results.

DATA STORE4 is the data in the end user’s hands. This report in users’ hands is the end

of the information utility. It is, also, the last data store in our preferred warehousing

architecture.

ETL – TOOLS:

• INFORMATICA

• MICROSTRATAGY

• ABIONTASH

• ORACLE WAREHOUSE BUILDER

• SAS (Statistical Analysis System)

8 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

OLAP TOOLS:

• BUSINESS OBJECTS

• COGNOS

ORACLE8i - DATA WAREHOUSE FEATURES

• Relational Database

• PL/Sql development tools

• Oracle Warehouse Builder(OWB) – Extract Transform and Load(ETL)

• Express – Multidimensional database engine

• Discoverer – Relational OLAP query tool

• Oracle Data Mining Suite

APPLICATION OF DATA WAREHOUSING AND MINING IN

GOVERNMENT DEPARTMENTS

All the Government departments are connected by a private network.

We assume the network to be highly secure and reliable.

The transactions in a particular Government department are online.

A Centralized Data Warehouse Server is maintained at a particular place.

The transactions of all the Government Departments are transferred to

the Centralized Server, Data Warehouse Server.

The topology of the Network is equated to the Architecture of the

Data Warehouse as shown in the fig:1.

9 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

Telephone Employees OLAP


Dept. Details
DATA MART

Railway Centralized
Dept Server
(DWHS)
Credit & Debit OLAP
Details
Registration DATA MART
Dept

DWHS : Data Warehouse Server ; OLAP : Online Analytical Processing

Fig: 1 Application topology of Data Warehousing and Mining

The above is an example with three departments in the network.

Data from the 3 departments will be Extracted and Transformed in the

Centralized Server(DWHS). The Transformed data is sent to different Data Marts

The Data Marts can analyze data and answer the most complex queries of the user.

Report Generation will be immediate. So, any decisions for the development can be

taken immediately. The data from the Centralized Server can even be sent to internet

for making the people know the Government’s activities, and can question the

Government if any incorrect data is found in the Data Warehouse. For example the

details of salary and loans of a particular employee are available through their respective

department systems. His assets are obtained from the Registration department. Data can

be analyzed and can conclude whether a particular employee is loyal to the Government

or not. Thus respective actions can be taken. Thus, Data Warehousing and Mining can

take both private and public sectors to a top level.

REFERENCES

10 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

• Michael Corey, Michael Abbey, Ian Abramson and Ben Taub, “ORACLE8i Data

Warehoushing”.

• Kimbal, “Data Warehouse Life Cycle Tool Kit”.

WEB SITES

• www.datawarehouse.com

• www.data-warehouse.com

• www.dwinfocentre.org

11 Email: chinna_chetan05@yahoo.com
Visit: www.geocities.com/chinna_chetan05/forfriends.html

12 Email: chinna_chetan05@yahoo.com

Вам также может понравиться