You are on page 1of 20


Submitted To
Department of MCA,
Vignan University,

Submitted By
E.Raja Sekhar,
MCA II Year(09491F0040),
QIS College of Engg&Tech,


Data mining is the process of exploration and analysis, by automatic or semi
automatic means of large quantities of data in order to discover meaningful pattern and
rules. Revisiting that definition a few years later, we find that, for the most part, it stands
for up pretty well. We like the emphasis on large quantities of the data, since data
volumes continue to increase. We like the notion that the patterns and rules to be found
ought to be meaningful. If there is anything we regret, it is the phrase “by automatic or
semi automatic means”. Not because it is untrue-without automation it would be
impossible to mine the huge quantities of data being generated today, but because we feel
there has come to be too much focus on the automatic techniques and not enough on the
exploration and analysis. This has misled many people into believing that data mining is
a product that can be bought rather than a discipline that must be mastered.
The newest answer is data mining, which is being used both to increase revenues
and to reduce costs. The potential returns are enormous. Innovative organizations
worldwide are already using data mining to locate and appeal to higher-value customers,
to reconfigure their product offerings to increase sales, and to minimize losses due to
error or fraud. Data mining is a process that uses a variety of data analysis tools to
discover patterns and relationships in data that may be used to make valid predictions.
The first and simplest analytical step in data mining is to describe the data — summarize
its statistical attributes (such as means and standard deviations), visually review it using
charts and graphs, and look for potentially meaningful links among variables (such as
values that often occur together). Collecting, exploring and selecting the right data are
critically important. But data description alone cannot provide an action plan. You must
build a predictive model based on patterns determined from known results, then test
that model on results outside the original sample. A good model should never be
confused with reality (you know a road map isn’t a perfect representation of the actual
road), but it can be a useful guide to understanding your business.
The final step is to empirically verify the model. For example, from a database
of customers who have already responded to a particular offer, you’ve built a model
predicting which prospects are likeliest to respond to the same offer. Can you rely on this
prediction? Send a mailing to a portion of the new list and see what results you get.

What can Data mining do?:

The term data mining is often thrown around rather loosely. We use the term for a
specific set of activities, all of which involve extracting meaning new information from
the data. The six activities are:
 Classification
 Estimation
 Predication
 Affinity grouping or association rules
 Clustering
 Description and visualization

Classification consists of examining the features of a newly presented object and
assigning to it a predefined class. For our purposes, the objects to be classified are
generally represented by records in a database. The act of classification consists of
updating each record by filling in a field with a class code. The classification task is
characterized by a well-defined definition of the classes, and a training set consisting of
preclassified examples. The task is to build a model that can be applied to unclassified
data in order to classify it.
Examples of classification tasks include:
 Assigning keywords to articles as they come in off the news wire.
 Classifying credit applicants as low, medium, or high risk.
 Determining which home telephone lines are used for Internet access.
 Assigning customers to predefined customer segments.

Classification deals with discrete outcomes: yes or no, debit card mortgage, or car
loan. Estimation deals with continuously valued outcomes. Given some input data, we
use estimation to come up with a value for some unknown continuous variable such as
income, height, or credit card balance. In practice, estimation is often used to perform a
classification task. A bank trying to whom they should offer a home equality loan might
run all its customers through a model that gives them each a score, such as a number
between 0 and 1.This is actually an estimate of the probability that the person will
respond positively to an offer. This approach has the great advantage that the individual
records may now be rank ordered from most likely to least likely to respond. The
classification task now comes down to establishing a threshold score. Anyone with a
score greater than or equal to the threshold will receive the offer.

Any prediction can be thought of as classification or estimation. The difference
is one of emphasis. When data mining is used to classify phone line as primarily used for
Internet access or a credit card transaction as fraudulent, e do not expect to be able to go
back later to see if the classification was correct. Our classification may be correct or
incorrect, but the uncertainty is due only to incomplete knowledge: out in the real world,
the relevant actions have already taken place.

Clustering is the task of segmenting a diverse group into a number of more similar
subgroups or clusters. What distinguishes clustering from classification is that clustering
does not rely on predefined classed. In clustering, there are no predefined classes and no
examples. The records are no predefined classes and no examples. The records are
grouped together on the basis of self-similarity. It is up to the miner to determine what
meaning, if any, to attach to the resulting cluster. A particular cluster of symptoms might
indicate a particular disease. Dissimilar clusters of video and music purchases might
indicate membership in different subcultures.
Clustering is often done as a prelude to some other form of data mining or
modeling. For example, clustering might be the first step in a market segmentation
effort. Instead of trying to come up with a one-size-fits-all rule for” what kind of
promotion do customers respond to best,” first divide the customer base into clusters or
people with similar buying habits, and then ask what kind of promotion works best for
each cluster.

Need of Data mining:-

A senior vice president in the credit card group of large bank has spent of
thousands of dollars developing a response model. This model is designed to identify the
prospects who are most likely to respond to the bank’s next offering. The VP is told that
by using the model, she can save money: using only20%of the prospect list will yield
70% of the responders. However, despite these findings, she replies that she wants every
responder-not just of sum of them. Getting every responder requires using the entire
prospect list, since no model is perfect. In this case, data mining is not necessary.
Data mining is not always necessary. As this example shows, not every marketing
campaign requires a response model. Sometimes, the purpose of the campaign is to
communicate to everyone. Or, the communication might be particularly inexpensive,
such as billing insert or e-mail, so there may be less reason to focus only on responders.

Evaluation of Data mining:-

It’s remarkable how the relationship between data some times must be
discovered. They are not really with out inspection, and they must be through to the
surface. This is an everyday occurrence evident in all areas of our lives. In the province
of Ontario, for example, the last six digits of a person‘s driver’s license are the year,
month, and day of their birth.
Discovery deliberately goes out with no predetermined idea of what to search will
find. There is no intervention on the process by the end user; the data mining discovery
process wades through the source data looking for similarities and occurrences of data,
which allows for grouping and pattern identification. In the electronic data warehousing
environment, this process must be able to do its thing in a relatively short time period.
Rapid delivery of results is crucial to the adoption of a data-mining product.
So much data is captured in some operational systems that the contents of
seemingly unimportant data elements can easily become lost in the big picture. Suppose
the customer address-tracking component of a system insisted that a ZIP code be entered
when new customer information is recorded. This is deliberate. Until the information
embedded in this code becomes useful to the analyst, the ZIP credit limit, and payment
method. Every time an invoice is generated, the ZIP code tags along and appears on the
header of the bill, as well as on the address on the envelope. The complete address
record for each customer can yield invaluable geographical information about customer

Knowing streets, cities, and building numbers in useful, but the ZIP code allows
you to pinpoint segments of large areas within heavily populated urban areas. In the
United States, the “ZIP plus four” ZIP code provides such pinpoint accuracy that
knowledge of the city and state are not required. Take that ZIP code data and merge it
with Census data, and you now have the ability to determine the latitude and longitude of
any given person. You can now look at large areas of the population or even drill down
to a particular street. IN this way, the substance of data becomes important, and data
element values can provide enormous payoff when absorbed by a data mining initiative.

Our experience has taught us that even with the best of tools, you still need an
individual “super user” who understands the business and, more importantly, the data and
what the relationships between the data mean. Finding these relationships can be quite a
task in itself. Very few people have the mathematical background to approach this in a
step-by-step “repeatable and teachable” process. So what you get many times is just
more intelligent guesses.

Up until now, very few organizations had a “super user” with all the right skills,
so they could near fully harness the potential of their data. With its predictive modeling,
Oracle Darwin gives your “super user” the tools they need to gain insight into your data
and unlock its potential. To me, Oracle Darwin is the third generation of data-mining
tools. With the first tools you gained hindsight, with the next foresight, and now you can
gain insight.

Data mining techniques:-

Data mining uses a number of techniques to discover patterns and uncover trends
in data warehouse data. By looking briefly at three of at three of these techniques, you
will begin to see how data mining works.

Neural Network:
Neural network based mining is especially suited to identify patterns or
forecasting trends based on previously identified behavior. A trend identifies a movement
in habit based on past behavior. When a particular company’s share price over the last six
months is analyzed, you may be able to predict a trend in share price over the next
number of weeks.
The roots of this type of processing lie in what was learned from working with the
central nervous system. Knowledge can be learned from a set of widely disparate,
complex, or imprecise data. There are three layers to the network: the bottom layer
receives input, the hidden (middle) layer performs the work and the output layer presents
the analyst with outputs. In a marketing organization, the inputs could be historical
information pertaining to the spending habits of clients in close proximity to the time the
company undertook significant new marketing initiatives. The hidden, or middle, layer
processes incoming information and passes results in the form of patterns and trends to
the output layer. The input layer, the hidden layer and the output layer are made up of
nodes. These nodes are another term for processing elements, which are linked to thee
neurons in the brain; hence, the terminology neural network.

Fig: Neural Network Structure

When this network is trained on the information in the input layer, it takes on an
eerie human like component as it becomes an expert at ingesting seemingly unrelated
elements of data and spitting out results to the output. The number of nodes, though
unknown in the hidden layer, decreases as the results rise to the surface and are spewed
out by the output layer. The above figure shows the structure of a neural network and
how each node in every layer is interconnected to each node I the adjacent layer.
Suppose the network shown in the above figure weighted factors affecting the risk
of loaning money to a segment of the general public. At the input layer, each node would
contain information related to a single factor about the borrower, such as:
• Age
• Checking account bank balance
• Annual income
• Credit rating
• Years at current job
• Martial status
• Monthly balance carried on charge card

Automatic Detection:

There are many mathematical approaches to finding clusters in data. Some

methods, called divisive methods, start by considering all records to be part of one big
cluster. That cluster is then split into two or more smaller clusters, which are themselves
split until eventually each record has a cluster all to itself. At each step in the process,
some measure of the value of the splits is recorded so that the best set of cluster can be
chosen at the end. Other methods, called agglomerative methods, start with each record
occupying a separate cluster, and iteratively combine clusters until there is one big one
containing all the records. There are also self-organizing maps, a specialized form of
neural network that can be used for cluster detection. K-means, the clustering algorithm is
available in a wide variety of commercial data mining tools and is more easily explained
than most. It works best when the input data is primarily numeric. For an example of
clustering, consider an analysis of supermarket shopping behavior based on loyalty card
data. Simply take each customer and create a field for the total amount purchased in
various departments in the supermarket over course of some period of time- diary, meat,
cereal, fresh produce, and so on. This data is all numeric, so k-means clustering can work
with it quite easily. In fact, the automatic cluster detection algorithm will find clusters of
the customers with similar purchasing patterns.
How K-Means Cluster Detection Works:
This algorithm divides a data set into a predetermined number of clusters. That
number is the “k” in the phrase k means. A mean is, of course, just what a statistician
calls an average. In this case it refers to the average location of the members of a
particular cluster.
To form clusters, each record is mapped to a point in “record space”. The space
has as many dimensions as there are fields in the records. The value of each field is
interpreted as a distance from the origin along the corresponding axis to the space. In
order for this geometric interpretation to be useful, the fields must all be converted into
numbers and the numbers must be normalized so that a change in one dimension is
comparable to a change in another.
Records are assigned to clusters through an iterative process that starts with
clusters centered at essentially random locations in the record space and moves the
clusters centroids (another name for cluster means) around until each one is actually at
the center of some cluster of records. This process is illustrated in the following figure:

In the first step, we select k data points to be the seeds, more or less arbitrarily.
Each of the seeds is an embryonic cluster with only one element. In this example, k is 3.
In the second step, we assign each record to the cluster whose centroid is nearest.
As we continue to work through the k-means algorithm, pay particular attention to
the fate of the point with the box drawn around it. On the basis of the initial seeds, it is
assigned to the cluster controlled by seed number2 because it is closer to that seed than to
either of the others. It is on seed 3’s side of the perpendicular separating seeds 1 and 3, on
seed 2’s side of the perpendicular separating seeds 2 and 3, and on the see 2’s side of the
perpendicular separating seeds 1 and 3.
At this point, every point has been assigned to exactly one of the three clusters
centered around the original seeds. The next step is to calculate the centroids of the new
clusters. This is simply a matter of averaging the positions of each point in the clusters
along each dimension. Remember that k-means clustering requires that the data values be
numeric. Therefore, it is possible to calculate the average position just by taking the
average of each field. If there are 200 records assigned to a cluster and we are clustering
based on four fields from those records, then a geometrically we have 200 points in a
four-dimensional space. The location of each record is described by four fields, with the
form (x1, x2, x3, x4). The value of x1 for the new centroid is the average of all 200 x1 s and
similarly for x2, x3, and x4.

Decision trees:-
Decision trees are a wonderfully versatile tool for data mining. Decision trees
seems to come in nearly as many as actual trees in a tropical rain forest. And ,like
deciduous and coniferous trees, there are two main types of decision trees:

Classification trees label records and assign them to the proper class. Classification trees
can also provide the confidence that the classification is correct. In this case, the
classification tree reports the class probability, which is the confidence that a record is in
given class.
Regression trees estimate the value of a target variable that takes on numeric values. So,
a regression tree might calculate the amount of that a donor will contribute or the
expected size of the clims made by an insured person.

How decision trees are built:-

Decision trees are built through a process known as recursive partitioning.

Recursive partitioning is an iterative process of splitting the data up into partitions and
them into some more. Initially all of the records in the training set the pre classified
records that are used to determining the structure of the tree are together in one big box.
The algorithm them tries breaking up the data, using every possible binary split on every
field. So, if the age takes on 72 values, from 18 to 90, then one split is everyone who is
20 or older. And so on. The algorithm chooses the split the partitioning procedure is then
applied to each of the new boxes. The process continuous until no more useful splits can
be found. So, the heart of the algorithm is the rule that determines the initial split.

How the decision trees works:-

The discussion on the clustering described how the fields in a record can be
viewed as the coordinates of the record in a multidimensional record space. That
geometric way of the thinking is useful when talking about decision trees as well. Each
branch of decision tree is a test on a single variable that cuts the space in to two or more
pieces. For concreteness and simplicity, consider a simple example where there are only
two variables, X and Y. These variables take on values from 0 to 100. Each split in the
tree is constrained to binary. That is to say, at every node in the tree , a record will go
either left or right based on some test of either X or Y.
A decision tree has been grown until every box is completely pure in the sense of
that it contain only one species of dinosaur. Such a tree is fine as a description of this
particular arrangement of stegosauruses and triceratopses, but is unlikely to do a good job
of classifying another similar set of prehistoric reptiles.


The primary benefit of data mining is the ability to turn feeling into facts. When
meeting with coworkers, how many times do you say, “ I just have a gut feeling that
product X is more attractive to consumer group Y than the rest of our customer base.
“This feeling can become a fact when you’re armed with the ability to sift through
corporate data, liking for behavior patterns and customers habits. It also protects you
from your gut feelings, because we all realize that many times they’re not right. The
fundamental benefit of data mining is then twofold.
Data mining can be used to support or refute feelings people have about how
business is doing. It can be used to add credibility to these feelings and warrant the
dedication of more resources and time to the most productive areas of a company’s

This benefit deals with situation where a company starts the data mining process
with an idea of what they are looking for. This is called targeted data mining. Data
mining can discover unexpected patterns in behavior, patterns that were no under
consideration when the mining exercise commenced. This is called out-of-the-blue data

• Shipping orders:-

A group of clerks in a retail building supplies chain is systematically short-shipping

orders and hiding the discrepancy between the requisition for goods and the freight
bill gong out the delivery. This could be uncovered b analyzing the makeup of bona
fide orders and what is found to be a premature depletion of corresponding stock.

• Credit vouchers:

A retail clothing gaint notices an unusual number of credit vouchers going out on
one shift every Saturday morning in their sportswear and athletic shoes departments. By
analyzing the volume and amounts of credit voucher transaction, management would be
able to detect times when volume is repeatedly higher than the norm.

• Timesheets:

After auditing payroll at a factory, a company notices an excessive amount of

overtime over a six-week period for a handle of employees. Through a data mining
effort, they uncover time sheets that had been deliberately altered after management
signature had been obtained.

• Transactions:

Using data mining, a banking institution could analyze historical data and develop an
understanding of “normal” business operations-debits, credits, transfers, and so on.
When a frequency is tacked onto each transaction activity as well as its size, source, and
recipient information, the institution can go about the same analysis against current
transactions. If behavior our of the norm is detected, they can engage the services of
internal, and perhaps external, auditors to resolve the problem.

Return on Investment:-

A significant segment of the companies looking at, or already adopting, data

warehouse technology spend millions of dollars on new business initiatives. The
research and development costs are astronomical. An oil company can spend upwards of
$35 million (U.S) on an oil rig. Data mining historical data from within the company and
any government or other external data available to the firm could help answer the big
ticket question: “Will this effort pay off?”

Everyone has struggled with time, so little seems to exist, and so much heeds to
be accomplished. Most workdays are supposed to last seven to nine hours. Time
management has become crucial in this day and age. In a business environment, where a
finite number of hours exist in a day, wading through data to discover areas that will
yield the best results is a benefit o data mining. This is your return on investment.
Business decision makers always try o dedicate the most time and resources to initiatives
with the best return. Looking for the best way to proceed. Given a finite amount of
dollars and people available, is a form of targeted data mining.

Scalability of the Electronic Solution

The major player in the data-mining arena provide solutions that are robust and
scalable. A robust data mining solution is one that performs well and can display results
in an acceptable time period. The length of that acceptable time period depends on
factors such as the user’s past experiences and expectations.

A common occurrence may be for one person to prepare a set of parameters and
variables for a data mining exercise, press the ENTER key, and go off to a meeting. Two
hours later, the computer screen sits full of the mining results, patiently waiting for the
user’s return. On the other hand, another user may sit impatiently in front of the screen,
fingers drumming on the tabletop, waiting or what seems a light-year for results (in this
situation, three to five minutes can seem an eternity). A robust solution can help get
information to your users in a more timely fashion.

Successful mining software providers’ products can ingest anything from small
amounts of data all the way up to voluminous amounts. The ability to work with a wide
range of input datasets is part of this phenomenon called scalability.

Another component of scalability is the ability to deploy a data mining solution

on a stand-alone personal computer, on an a small group of computers tied together by a
local area network, or on an enterprise-wide set of corporate computers. The transition
from single to multiple users must be both transparent and seamless to the users and easy
to deploy for the professionals responsible for a company-wide or workgroup-side data
mining effort.

Financial services:-
Global technological changes have reduced market barriers, creating new ways of
doing business and new arrivals. Leading banks, insurance companies and securities
firms are competing for customer by offering one-step shopping for a growing list of
products and services. Increased product complexity, new instruments, the “endless”
day and exposure to volatile markets have increased the need of data analysis software
and services for asset management, risk management and portfolio optimization.

Data mining as a Research tool:-

One way that data mining can lower costs is at the very beginning of the product
lifecycle, during the research and development. The pharmaceutical industry provides a
good example. This industry is characterized by very high R&D costs. Globally, the
industry spends $13billion a year on research and development of prescription drugs. At
the risk of oversimplifying a very complex business, the drug development process is
like a large funnel with millions of chemical compounds going in the wide end and only
few safe and useful drugs emerging at the narrow cost.
An entire discipline of bioinformatics has grown up to mine and interpret the
data being generated by high throughput screening and other frontiers of biology such as
the mapping of the entire human genetic sequence.

Data mining for process improvement:-

Another area where the data mining can be profitably to save money is
manufacturing. Many modern manufacturing process from chip fabrication to brewing
are controlled using statistical process control. Sensors keep track of pressure,
temperature, speed, color, humidity, and weather other variables are appropriate to
particular process. Computer program watch over the output from these sensors and order
slight adjustments to keep the readings within the proper bounds.
The consequence of a problem in the manufacturing process is often a ruined
batch of product and very costly shutdown and restart of the process. Once again the data
produced by automated manufacturing system is perfect for data mining: huge volumes
of input, consisting of precisely measured values for scored of hundreds of variables and
few simple output like “good” and “bad .
Data mining for marketing:-

Many of the most successful applications of data mining are in the marketing
arena, especially in the area known as database marketing. In this world data mining is
used in both cost term and revenue term of profit equation. In database marketing the
database refers to a collection of data on prospective targets of a marketing campaign.
Data mining can be used to cut marketing costs by eliminating calls and letters to
people who are very unlikely to respond to an offer. For example, the AARP(American
association of retied people) saved a millions of dollars by excluding the 10 percent of
eligible households who were judged least likely to become members. On the revenue
side , data mining can also be used to spot the most valuable prospects, those likely to
buy the highest insurance coverage amounts or the most expensive, high-margin

Data mining for customer relation ship:-

The Phrase “customer relationship management” seems to be on the lips of every

chief executive and management consultant these days. Good customer relationship
management means presenting a single image of the company across all the many
channels a customer may use to interact with the firm, and keeping a single image of the
customer that shared across the enterprise. Good customer relationship management
requires understanding who customers are and what they like and don’t like. It means
anticipating their needs and addressing them proactively. It means recognizing when they
are unhappy and doing something about it before they get fed up and go to a competitor.
Data mining plays a leading role in every fact of customer relationship
management. Only through the application of the data mining techniques can large
enterprise hope to turn the myriad records in its customer database into some sort of
coherent picture of its customer.
The final thread is decision support. Over the past few decades, people have been
gathering data into database to make better informed decisions. Data mining is a natural
extension of this effort.
The Future of Data Mining:-

The foundation of data mining is nothing new—the desire to conduct business

based on the past customer behavior. With the advent of larger and more powerful
computers, data mining technology is primed to take off over the next few years. Not
long ago, there were few electronic data mining solutions. A glut of technology now
exists on the market, and their numbers are still growing.

The biggest reason data mining has a bright future is the computing power of
today’s high-end machines. One of the industry’s hottest phrases is parallel processing.
This type of processing is so well suited to data mining, it can easily be derailed by
hardware not robust enough to ingest and analyze multiple gigabytes of data. It seems
you can’t turn around without stumbling on yet another bigger and better parallel
computing solution.

As a new breed of frontline employees contribute to the business decision making

process, it makes so much sense to arm hem with effective data mining technology.
These individuals live and breathe the business operations, experience firsthand contact
with consumers of the company’s products and services and, therefore, are players in the
selection of future business endeavors an initiatives. They may not actually make the
final decision, but they’re responsible for giving recommendations based on their
frontline knowledge of the business. Increasingly sophisticated electronic data mining
solutions enable these players to make better recommendations to upper management.

Tools like Oracle Darwin are using technology to close the gaps between the
users’ business knowledge and their understanding o9f how to use data-mining
techniques like regression trees (decision trees), neural networks, and nearest neighbor
(clustering). These techniques will allow them to fin those hidden relations within the
data and give them the insight to predict future customer behaviors.
The foundation of data mining has been around for some time. Machine learning
and artificial intelligence are two disciplines dedicated to training machines to learn from
past experiences and, based on that training, make predictions about the future.
Naturally, many have based their Ph.D. dissertations on machine learning, and a wealth
of books, journals, an technical conferences exist worldwide that deal with the subject
of data mining. We will certainly see more about these topics in the years to come.

This only makes sense; the Web is the mot efficient distribution mechanism for
information in the history of mankind. Utilizing it helps leverage your data warehouse


Mastering Data Mining

--Michael J.A. Berry & Gordon S. Linoff