Soft

CHANDIPADAR, BERHAMPUR
TECHNICAL SEMINAR ON
“DATAMINING AND WAREHOUSING”
Seminar Guide By
DR.ASHALLATA PANIGRAHY
Submitted by:
Datamining &Warehousing Page 1

NA
ME: RAJALAXMI PANDA
REGD. No: 0701220262

BRANCH: CSE
(Approved by A.I.C.T.E. New Delhi & Affiliated to B.P.U.T. Rourkela)

Chandipadar, Bhatakumarada, Berhampur, Orissa
This is to certify that the student RAJALAXMI

PANDA of Comuterscience & Engineeringhave under gone the live
seminar on “DATAMINING AND WAREHOUSING” and has
prepared this seminar report by virtue of her diligence, adherence
and advice.

She has successfully completed every aspect of this
seminar with a lot of sincerity. Her sincerity and devotion during
the seminar was very much appreciating.
We wish her all success for bright future.
Dr. Ashallata Panigrahy Dr. Ashallata Panigrahy
(HOD, CSE) (Seminar Guide)
(Principal) (Seal)
Before making a foray into the details of the seminar

topic on “DATAMINING AND WAREHOUSING” I would like to
take this opportunity to express my gratitude and heartful
obligation to all of them who have helped me in completing this
seminar.
I have the greatest pleasure to offer our profound
respect and sincere thanks to Dr.Ashallata Panigrahy(H.O.D of
CSE Engineering) for his support in achieving in the objective of
our seminar.

I also owe our friends and My Seminar Guide
Dr.Ashallata Panigrahyfor their support and encouragement
during this seminar.
Submitted by
NAME: RAJALAXMI PANDA
REGD NO:
0701220262
BRANCH: CSE
Group: 01
SEM: 8th
DeCLARATION
I would like to declare that I am fully responsible for the
technical seminar slide and hardcopy too for the
completion and requirement. I have done it by myself to
the best of my knowledge under the guidance of
Dr.Ashallata Panigrahy .If I have done some mistake,
requested you to excuse me I am the student under your
guidance.
Submitted by
NAME: rajalaxmi panda
REGD NO: 0701220262
BRANCH: cse
GROUP: 01
SEM: 8th
ABSRACT

Data mining, the extraction of hidden predictive information
from large database, is a powerful new technology with great
potential to help Companies focus on the most important
information in their data warehouses. Data Mining tools predicts
future trends and behaviors, allowing business to make proactive
knowledge-driven decisions. The automated, prospective analysis
offered by data mining move beyond the analysis of past events
provided by retrospective tools typical of decision support
systems. Data Mining tools can answer business questions that
traditionally were too time consuming to resolve .Data Mining
techniques can be implemented rapidly on existing software and
hardware platforms to enhance the value of existing information
resources, and can be integrated with new products and system
as they are brought online.
Analyzing data can provide further knowledge about a
business by going beyond the data explicitly stored to derive
knowledge about the business.

Contents
1. INTRODUCTION………………………………………………………….7
WHAT IS DATAMINING?.........................................................................7
GOALS OF DATAMINING……………………………………………….7
2.DATAMINING BACKGROUND…………………………………………..8
3.HOW DATAMINING WORKS?...................................................................10
DATAMINING PROCESS…………………………………………………..10
4.DATAMINING MODELS AND ALGORITHMS…………………………..12
5.DATAWAREHOUSING……………………………………………………..16
CHARRACTERISTICS OF DATAWAREHOUSING…………………….17
6. PROCESSING DATAWAREHOUSING……………………………………18
7.ALICATION ,ADVANTAGES OF DATAMINING…………………………21
8.APLICATION,ADVANTAGES,DISADVANTAGES OF
DATAWAREHOUSING…………………………………………………………

9. CONCLUSION…………………………………………………………………
23
10. REFERENCE…………………………………………………………………
24

INTRODUCTION
What is Data Mining?
The objective of data mining is to extract valuable

information from your data, to discover the “hidden gold.” This
gold is the valuable information in that data. Small changes in
strategy, provided by data mining’s discovery process, can
translate into a difference of millions of dollars to the bottom
line. With the proliferation of data warehouses, data mining
tools are fast becoming a business necessity. An important point
to remember, however, is that you do not need a data
warehouse to successfully use data mining—all you need is
data.
“Data mining is the search for relationships and global

patterns that exist in large databases but are `hidden' among
the vast amount of data, such as a relationship between patient
data and their medical diagnosis. These relationships represent
valuable knowledge about the database and the objects in the
database and, if the database is a faithful mirror, of the real
world registered by the database.

GOALS OF DATA MINING
The two primary goals of data mining tend to be

prediction and
description. Prediction involves using some variables or fields in
the data set to predict unknown or future values of other
variables of interest. Description, on the other hand, focuses on
finding patterns describing the data that can be interpreted by
humans. Therefore, it is possible to put data-mining activities into
one of two categories:
1) Predictive data mining, which produces the model of the
system described by the given data set, or
2) Descriptive data mining, which produces new, nontrivial
information based on the available data set.
The goals of prediction and description are achieved by using

data-mining techniques, explained later in this book, for the
following primary data-mining tasks:
1. Classification - discovery of a predictive learning function
that classifies a data item into one of several predefined classes.
2. Regression - discovery of a predictive learning function,
which maps a data item to a real-value prediction variable.

3. Clustering - a common descriptive task in which one seeks to
identify a finite set of categories or clusters to describe the data.
4. Summarization - an additional descriptive task that involves
methods for finding a compact description for a set (or subset) of
data.
5. Dependency Modeling - finding a local model that describes
significant dependencies between variables or between the
values of a feature in a data set or in a part of a data set.
DATAMINING BACKGROUND
Data Mining has drawn on a number of fields such as
inductive learning, machine learning, statistics, etc.
Inductive Learning: Induction is the inference of information

from data and inductive learning is the model building process
where the environment i.e. database is analyzed with a view to
finding patterns. Similar objects are grouped in classes and rules
formulated where by it are possible to predict the class of
unseen objects.
Statistics: Statistics has a solid theoretical foundation but

the results from statistics can be overwhelming and difficult
to interpret, as they require user guidance as to where and
how to analyze the data .Data Mining however allows the
expert’s knowledge of the data and the advanced analysis
techniques of the computer to work together. For example
statistical induction is something like the average rate of
failure of machines.
Machine Learning :Machine learning is the automation of a

learning process and learning is tantamount to the
construction of rules based on observations of environmental
states and transitions. This is a broad field, which includes
not only learning from examples, but also reinforcement
learning, learning with teacher, etc
HOW DATA MINING WORKS?
Data mining includes several steps: problem analysis, data

extraction, data cleansing, rules development, output analysis
and review. Data may however be derived from almost any
source.
DATAMINING PROCESSES
State the
Problem
Collect the Data
Perform Processing
Estimate the model mine

the data
Interpret the model &draw
conclusion
THE DATAMINING PROCESS
Here is a process for extracting hidden knowledge from your

data warehouse, your customer information file, or any other
company database.
1. Identify The Objective -- Before you begin, be clear on

what you hope to accomplish with your analysis. Know in
advance the business goal of the data mining. Establish whether
or not the goal is measurable.
2. Select The Data -- Once you have defined your goal,

your next step is to select the data to meet this goal. It may be
your customer information file.

3. Prepare The Data -- Once you've assembled the data,
you must decide which attributes to convert into usable
formats. Consider the input of domain experts—creators and
users of the data.
4. Audit The Data -- Evaluate the structure of your data in

order to determine the appropriate tools.
5. Select The Tools -- Two concerns drive the selection of

the appropriate data-mining tool—your business objectives and
your data structure. Both should guide you to the same tool.
6. Format The Solution -- In conjunction with your data audit,

your business objective and the selection of your tool determine
the format of your solution.
7. Construct The Model -- At this point that the data mining

process begins. Usually the first step is to use a random number
seed to split the data into a training set and a test set and
construct and evaluate a model. The generation of classification
rules, decision trees, clustering sub-groups, scores, code, weights
and evaluation data/error rates takes place at this stage.
8. Validate The Findings -- Share and discuss the results of the

analysis with the business client or domain expert. Ensure that

the findings are correct and appropriate to the business
objectives.
9. Deliver The Findings -- Provide a final report to the business

unit or client. The report should document the entire data mining
process including data preparation, tools used, test results,
source code, and rules.
10. Integrate The Solution -- Share the findings with all

interested end-users in the appropriate business units.
DATA MINING MODELS AND

ALGORITHM
NEURAL NETWORKS
Neural networks are of particular interest because they offer a

means of efficiently modeling large and complex problems in
which there may be hundreds of predictor variables that have
many interactions. (Actual biological neural networks are
incomparably more complex.) Neural nets may be used in
classification problems (where the output is a categorical
variable) or for regressions (where the output variable is
continuous).

A neural network starts with an input layer,
where each node corresponds to a predictor variable. These
input nodes are connected to a number of nodes in a hidden
layer. Each input node is connected to every node in the hidden
layer. The nodes in the hidden layer may be connected to nodes
in another hidden layer, or to an output layer. The output layer
consists of one or more response variables.
After the input layer, each node takes in a set of inputs,

multiplies them by a connection weight Wxy adds them
together, applies a function (called the activation or squashing
function) to them, and passes the output to the node(s) in the
next layer. Each node may be viewed as a predictor variable or
as a combination of predictor variables The connection weights
(W’s) are the unknown parameters, which are estimated by a
training method.

Fig – A simple Neural Network
Users must be conscious of several facts about neural

networks: First, neural networks are not easily interpreted.
There is no explicit rationale given for the decisions or
predictions a neural network makes. Second, they tend to
overfit the training data unless very stringent measures, such as
weight decay and/or cross validation, are used judiciously. This
is due to the very large number of parameters of the neural
network, which if allowed to be of sufficient size, will fit any data
set arbitrarily well when allowed to train to convergence. Third,
neural networks require an extensive amount of training time

unless the problem is very small. Once trained, however, they
can provide predictions very quickly.
DECISION TREES :-Decision trees are a way of representing a series

of rules that lead to a class or value. For example, you may wish
to classify loan applicants as good or bad credit risks. Below
Figure shows a simple decision tree that solves this problem
while illustrating all the basic components of a decision tree: the
decision node, branches and leaves.
A Simple Decision Tree Structure.
Depending on the algorithm, each node may have two or

more branches. For example, CART generates trees with only
two branches at each node. Such a tree is called a binary tree.
When more than two branches are allowed it is called a
multiway tree. Each branch will lead either to another decision
node or to the bottom of the tree, called a leaf node. By
navigating the decision tree you can assign a value or class to a
case by deciding which branch to take, starting at the root node
and moving to each subsequent node until a leaf node is
reached. Each node uses the data from the case to choose the
appropriate branch.
Decision trees are grown through an iterative splitting of data

into discrete groups, where the goal is to maximize the
“distance” between groups at each split. One of the distinctions
between decision tree methods is how they measure this
distance. While the details of such measurement are beyond the
scope of this introduction, you can think of each split as
separating the data into new groups, which are as different from
each other as possible. This is also sometimes called making the
groups purer. Using our simple example where the data had two
possible output classes — Good Risk and Bad Risk — it would be
preferable if each data split found a criterion resulting in “pure”
groups with instances of only one class instead of both classes.
Decision trees, which are used to predict categorical

variables, are called classification trees because they place
instances in categories or classes. Decision trees used to predict
continuous variables are called regression trees.

The example we’ve been using up until now has been very
simple. The tree is easy to understand and interpret. However,
trees can become very complicated. Imagine the complexity of
a decision tree derived from a database of hundreds of
attributes and a response variable with a dozen output classes.
Decision trees make few passes through the data (no more
than one pass for each level of the tree) and they work well with
many predictor variables. As a consequence, models can be
built very quickly, making them suitable for large data sets.
Trees left to grow without bound take longer to build and
become unintelligible, but more importantly they over fit the
data. Tree size can be controlled via stopping rules that limit
growth. One common stopping rule is simply to limit the
maximum depth to which a tree may grow. Another stopping
rule is to establish a lower limit on the number of records in a
node and not do splits below this limit.
An alternative to stopping rules is to prune the tree. The tree

is allowed to grow to its full size and then, using either built-in
heuristics or user intervention, the tree is pruned back to the
smallest size that does not compromise accuracy. For example,

a branch or sub tree that the user feels is inconsequential
because it has very few cases might be removed.
DATA WAREHOUING
Data Warehousing is the process of extracting and
transforming operational data into informational data and
loading it into a central data store or warehouse. Once the data
is loaded it is accessible via desktop query and analysis tools by
the decision makers.
The data within the actual warehouse itself has a distinct

structure with the emphasis on different levels of summarization
as shown in the figure below. A warehousing is a relational
database management system (RDBMS) designed specifically to
meet the need of transaction processing system. It can be
loosely as any centralized data repository, which can be queried
for business benefit.

Characteristics of a data warehouse
There are generally four characteristics that describe a data

warehouse.
1. Subject Oriented: Data are organized according to

subject instead of application. The data organized by subject
contain only a information necessary for decision support
processing.

2. Integrated: When data resides in many separate
applications in the operational environment, encoding of data
is often inconsistent.
3. Time-Variant: The Data Warehouse contains a place for

storing data that are 5 to 10 years old, or older, to be used
for comparisons. Trends, and forecasting. These data are not
updated.
4.Non-Volatile: Data are not updated or changed in any way

once they enter the data warehouse, but are only loaded
accessed.
Processes in Data Warehouse
The first phase in data warehousing is to “insulate” your

current operational information. The Data Warehouse thus
retrieves data from variety of homogeneous operational
databases. The database is then transformed and delivered
to the Data Warehouse/store based on the selected model (or
mapping definitions). The data transformation and movement
processes are executed whenever an update to the

Warehouse data is required so there should some form of
automation to manage and execute these functions.
The information that describes the model and definitions of

the source date element is called metadata. The metadata is
the means by which the end users finds and understands the
data in the warehouse and is in important part of warehouse.
The metadata should at the very least content:
• Structure of data
• Algorithm used for summarization
• Mapping from the operational environment
to the data warehouse.
Data Warehousing and OLTP systems
Database which is build for online transactions processing i.e.

OLTP generally regarded as unsuitable for data warehousing as
they have been design with a different set of needs in mind that
maximizing transactions capacity and typically having hundreds
of tables in order not to lock out users etc. Data Warehouses are
interested in query processing as opposed transaction
processing.

OLTP systems can not be repositories of facts and historical data
for business analysis .They can not quickly answer ad hoc
Queries and rapid retrival is almost impossible. The data is
inconsistent and changing , duplicate entries exists, entries can
be missing and there is an absence of historical data which is
necessary to analyze trends. Basically OLTP offers large amount
of data which is not easily understood. The data warehouse
offers the potential to retrieve and analyze information easily
and quickly.
APPLICATION OF DATAMINING:
Data mining has many and varied fields of applications. Some

of which are listed below:
• A pharmaceutical company
• A credit card company
• Banking
• Insurance and health care
• Medicine
• Large consumer package goods company
BENIFITS OF DATAMINING:

• Data mining can aid direct marketers by providing them
with useful and accurate trends about their customers’
purchasing behavior. Based on these trends, marketers
can direct their marketing attentions to their customers with
more precision.
• Data mining can assist financial institutions in areas such as
credit reporting and loan information.
• Data mining can aid law enforcers in identifying criminal
suspects as well as apprehending these criminals by
examining trends in location, crime type, habit, and other
patterns of behaviors.
• Data mining can assist researchers by speeding up their
data analyzing process; thus, allowing them more time to
work on other projects.
APPLICATION OF DATAWAREHOUSING:
Some of the applications data warehousing can be used for are:
• Decision support
• Financial forecasting
• Insurance fraud analysis

• Call record analysis
• Agriculture
BENIFITS OF DATAWAREHOUSING
• A data warehouse provides a common data model for all

data of interest regardless of the data's source
• Prior to loading data into the data warehouse,
inconsistencies are identified and resolved. This greatly
simplifies reporting and analysis.
• Because they are separate from operational systems, data
warehouses provide retrieval of data without slowing down
operational systems
• Data warehouses can record historical information for data
source tables that are not set up to save an update history.
DISADVANTAGES OF DATA WAREHOUSING

• Before data can be stored within the warehouse, it must be
cleaned, loaded, or extracted. This is a process that can take
a long period of time.
• If they are not trained properly, they may choose not to work
within the data warehouse. If the data warehouse can be
accessed via the internet, this could lead to a large number
of security problems.
• Another problem with the data warehouse is that it is difficult
to maintain.
CONCLUSION
Comprehensive data warehouses that integrates operational

data to the customers suppliers and market information has
resulted in an explosion of information. Competition requires timely
and sophisticated analysis on an integrated view of the data
however there is growing gap between more powerful storage and
retrival systems and users ability to effectively analyze and act on

the information they contain. A new technological leap is structur is
needed to structure and prioritize information for specific end user
problem. The data mining tools can make this leap. Quantifiable
business benefits have been proven through integration of data
mining with current information system and new products are on
the horizon that will bring this integration to an even wider
audience of user.
References
➢ META Group Application development strategies: “Data mining

for data warehouses: Uncovering hidden patterns”.
➢ Gartner group high performance computing research note.
➢ Gartner group advance technologies and Applications research
notes.
Websites
www.datamining.com
www.askjeevs.com

Soft

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Soft

Загружено:

Авторское право:

Доступные форматы

CHANDIPADAR, BERHAMPUR

Datamining &Warehousing Page 1

REGD. No: 0701220262

(Approved by A.I.C.T.E. New Delhi & Affiliated to B.P.U.T. Rourkela)

This is to certify that the student RAJALAXMI

Datamining &Warehousing Page 2

Dr. Ashallata Panigrahy Dr. Ashallata Panigrahy

(HOD, CSE) (Seminar Guide)

Before making a foray into the details of the seminar

Datamining &Warehousing Page 3

Datamining &Warehousing Page 5

Datamining &Warehousing Page 6

3.HOW DATAMINING WORKS?...................................................................10

4.DATAMINING MODELS AND ALGORITHMS…………………………..12

7.ALICATION ,ADVANTAGES OF DATAMINING…………………………21

Datamining &Warehousing Page 7

Datamining &Warehousing Page 8

The objective of data mining is to extract valuable

“Data mining is the search for relationships and global

Datamining &Warehousing Page 9

The two primary goals of data mining tend to be

The goals of prediction and description are achieved by using

Datamining &Warehousing Page 10

Inductive Learning: Induction is the inference of information

Statistics: Statistics has a solid theoretical foundation but

Machine Learning :Machine learning is the automation of a

HOW DATA MINING WORKS?

Data mining includes several steps: problem analysis, data

Estimate the model mine

Here is a process for extracting hidden knowledge from your

1. Identify The Objective -- Before you begin, be clear on

2. Select The Data -- Once you have defined your goal,

Datamining &Warehousing Page 13

4. Audit The Data -- Evaluate the structure of your data in

5. Select The Tools -- Two concerns drive the selection of

6. Format The Solution -- In conjunction with your data audit,

7. Construct The Model -- At this point that the data mining

8. Validate The Findings -- Share and discuss the results of the

Datamining &Warehousing Page 14

9. Deliver The Findings -- Provide a final report to the business

10. Integrate The Solution -- Share the findings with all

DATA MINING MODELS AND

Neural networks are of particular interest because they offer a

Datamining &Warehousing Page 15

After the input layer, each node takes in a set of inputs,

Datamining &Warehousing Page 16

Users must be conscious of several facts about neural

Datamining &Warehousing Page 17

DECISION TREES :-Decision trees are a way of representing a series

decision node, branches and leaves.

A Simple Decision Tree Structure.

Depending on the algorithm, each node may have two or

Decision trees are grown through an iterative splitting of data

Decision trees, which are used to predict categorical

Datamining &Warehousing Page 19

An alternative to stopping rules is to prune the tree. The tree

Datamining &Warehousing Page 20

The data within the actual warehouse itself has a distinct

Datamining &Warehousing Page 21

There are generally four characteristics that describe a data

1. Subject Oriented: Data are organized according to

Datamining &Warehousing Page 22