Академический Документы
Профессиональный Документы
Культура Документы
THROUGH
DATA MINNING
BY
M.GURUNATH G.SIVACHANDRA
murarigurunath_gupta@yahoo.co.in
sivachandra22@yahoo.com
CONTENTS
Abstract
Foundation
D ata mining
Data mining applications
Customer Relationship Management (CRM)
Data mining & CRM issues – Relation
Decision
Trees
Generating
Decision Tree through Simple Probabilistic Approach – An Attempt
The Algorithm
Problem Solving Issues
Addressing various CRM issues
Algorithm Implementation
The Problem:
D efinition
User Selection
Data Definition
Solution:
Splitter algorithm-Implementation
Finding MSA (Most Significant Attribute)
Result:
MSA and Attribute selection priority .
The Final Decision Tree
Conclusion
References
ABSTRACT
Almost, each and every real time process is being automated in today’s competitive world of
technological advancements. Automation has become the Blood Line of life. Data Mining is one
of the powerful automation tools, as it has evolved from the concept of Knowledge Discovery
Knowledge Discovery is an intelligent process and Data Mining does it artificially, thus being
Artificially Intelligent. Extracting data from large databases through pruning and other implicit
means is not a single handed job. Data Mining has been a ‘Boon from Mars’ to many technical
fields. Management fields are also not immune to it. Customer Relationship Management is one
such field which has been under focus for deployment of Data Mining. Many financial
companies and business organizations
have grown from rats to riches by tackling marketing issues through Data Mining.However small
organizations can only dream to incorporate mining into their business as they can not afford for
such softwares. CRM is a soft issue and doesn’t need complex mining softwares based on
Bayesian curves and Normalized distributions. The key aspect is, it requires simplicity and not a
higher level of accuracy.We have taken efforts to produce a rather simpler algorithm based on
Probabilistic Classification to generate a Decision Tree. Traversing through the Decision
Tree we can predict values for unknown attributes. Tree generation requires training
thealgorithm through numerous samples. Increasing the number of samples will automatically
enhance the accuracy of the algorithm.Our Algorithm would bridge the gap between such small
Marketing organizations and the technology of Data Mining.
FOUNDATION
• Data Mining
Data Mining is a tool that automates the detection of relevant patterns in a Database. It is a
technology, which on its progressive path leads to Knowledge Discovery, thereby making the
system Artificially Intelligent. In practical terms, the system is made self reliable. There are
certain prerequisites needed to perform datamining. A database full of statistical data, and certain
efficient pruning algorithms to mine out them, form the core region.William Frawley and
Gregory Shapiro (MIT Press, 1991) defined it as “…the nontrivial extraction of implicit,
previously unknown and potentially useful information from data…”In other words it is the
process of discovering meaningful correlations and hidden patterns by mining large amounts of
data stored in warehouses (large repository of data).
The major advantage is its capability to build predictive models rather than being
retrospective.Thus data mining is about exploration and analysis, by automatic or semiautomatic
means, quantities of data can help to uncover meaningful patterns and rules.
Requirements
Often people ask this question stating that statistics are enough to get knowledge, `based on
previously existing data, and what is all new about data mining?!! Now let us explain you…
Data mining effectively automates the statistical
process leading to more accuracy and reduction of burden. Intelligent systems learn from events
i.e. discover knowledge and act more relevantly in future. Thus data analysis techniques through
data mining seem to be automated self decisive tests which result in the most appropriate or
rather best suited solutions to the various situations. So we use data mining as the tool to churn
out data for identification from voluminous samples and action determination based on rules
obtained from Knowledge
Discovery.
Applications
Data mining is not restricted towards any field. In fact it has now become an integral part of
every database oriented application. However the following fields have surprisingly gained more
from the tool
Marketing
E-commerce.
Medicine.
Telecommunications.
Transportation.
Research
Law and order
The Process
Data mining uses simple tools to perform the churning process from large ocean of data. The
following are performed:-
Discovering knowledge
Segmentation
Classification
Association
Preferencing
Visualizing data
In this example, the Decision Tree algorithm might determine that the most significant attribute
for predicting credit risk is debt level. The first split in the decision tree is therefore made on debt
level. One of the two new nodes (Debt = High) is a leaf node, containing three cases with bad
credits and no cases with good credit. In this example, a high debt level is a perfect predictor of a
bad credit risk. The other node (Debt = Low) is still mixed, having three good credits and one
bad credit case. The decision tree algorithm then chooses employment type as the next most
Significant predictor of credit risk. The split on employment type has two leaf nodesindicating
that self-employed people have a higher bad credit probability. This is, of course, a small
example based on synthetic data, but it illustrates how the decision tree can use known attributes
of the credit applicants to predict credit risk. In reality there are typically far more attributes for
each credit applicant, and the numbers of applicants would be very large.When the scale of the
problem expands, it is difficult for a person to manually extract the rules to identify good and bad
credit risks. The classification algorithm can consider hundreds of attributes and millions of
records to come up with the decision tree that describes rules for credit risk prediction.
Algorithm:
1) Start
2) Define an array for attributes and data types // Atr[ ],Dt[ ]
3) Define a structure for template // Template {<data type1> <attribute1>,
<data type2> <attribute2>,
.
.
<data typeN> <attributeN>int NOC};
// [NOC—no. of occurrence of a template]
4) Define a structure for training model. // Train {<data type1> <attribute1>,
<data type2> <attribute2>,
.
.
<data typeN> <attributeN>};
5) Get the (total number of attributes – 1) as max. //[Key attribute discarded]
6) For each attribute get the data type // Get it in an array Atr [ ], Dt [ ]
//[CRM issues may require data of three types:-
Boolean,(Yes/No—married/single)
Numerical ranges(1000-2000—salary)
char[ ](Engineer, Doctor-profession) ]
7) If attribute is of numeric range call splitter algorithm to find the range.
//[Splitter algorithm -
/* i) Get min value, max value
ii) x=1;
iii) I=10
iv) n=min value / I
v) If (n lies between 0 to 9)
Range[ x ]=min value to (min value + I)
min value= min value + I
x++;
Else
I=I*10
vi) If (min value! = max value)
Repeat steps iii) and iv)
Else
Return x */ ]
8) Get the value returned by splitter algorithm into NOS [An]
// [no. of values an attribute takes is contained in NOS.]
9) If data type is Boolean NOS [An] =2
10) If data type is char [ ] NOS [An] = no. of. Different entries
11) For I = 1 to max sum up NOS [Ai] as totalnos
12) Calculate tempno. = (totalnos. x max)
//[The above value contains total no. of all possible combinations of attribute
values i.e. total no of attribute values combination total no of attributes]
13) Create tempno. (Calculated above) number of objects for the structure
Template.
14) By iteration all possible values for the template objects are filled using
Filltemp algorithm
// [ Filltemp algorithm
/* For I = 1 to max
For J =I+1 to max-1
Set A [ I ] to one value,
Set all possible values for J [ I ]
Call sort function and remove redundant templates */
]
15) Obtain no. of samples to be entered for training as trno.
16) Create trno (Obtained above) number of objects for the structure Train.
// [The above operation is termed as training.]
17)For each template (object in structure Template) obtain NOC value by
comparing objects of structure Train
// [The above operation is to find the no. of occurrences of a template]
18) Call function MSA
// [MSA would find the most significant attribute to predict the predictable
attribute]
MSA
/* i) Get the predictable attribute
ii) Get values for other attribute
iii) For I = 1 to NOS [Apr] // for each state of predictable attribute
Select all the templates with current value for the predictable
attribute.
Sort based on noc
Find the attribute with most no. same value occurrences
Return the attribute as msa [ 1 ]
Find the next attribute using the previous two steps till all attributes
but for the predictable attribute are selected */
]
19) Using the array MSA [ ] generate the Decision Tree.
20) The value for predictable attribute is contained in the template which matched
the values of other attributes and had the maximum NOC value among such
templates
21) Stop
Solution :-{ Finding the MSA and plotting the Decision Tree}
Once the above training model is fed and the algorithm trains the miner
The no. of attributes is identified to be 8.[max=8] (9 Attributes-1 Key Attribute)
Three attributes are entered to be Boolean and result in 6 possible values in total.
“Splitter algorithm” returns {NOS-No. of possible states/ranges}
NOS for Income as 4
NOS for Age as 4
NOS for Duration as 4
Offers could assume 4 different values.
Therefore totalnos = 6 + 4 + 4 + 4 + 4 = 22
Now tempno. = totalnos. x max
= 22 x 8 = 176 Templates
Finding MSA
The Predictable attribute here is Offers
It can take 4 possible values viz., a, b, c, d.
For (Offer = a), we have 176/4 =44 templates.
Similarly each offer b ,c ,d would have 44 templates each
Intersecting those templates with objects of structure Train
(Contents of Structure Train - The training model i.e. above table)
We obtain 3 intersections for a, 3 intersections for b, 2 intersections for c and 2
intersections for d.
The MSA is calculated by the most NOC (No. of occurrences) which is represented in the form
of ratios in the following table.
Temp Income Age Emp. Durn. Edu. Sex
a1 9-10k 2:1 20-30 3:0 Emp 1:2 20-30 1:1:1 Gra 2:1 M 2:1
a2 10-20k 20-30 S-Emp 0-10 N-Gra M
a3 10-20k 20-30 Emp 10-20 Gra F
b1 10-20k 2:1 30-40 1:1:1 Emp 2:1 30-40 1:1:1 Gra 2:1 F 1:2
b2 30-40k 40-50 S-Emp 0-10 Gra M
b3 10-20k 20-30 Emp 10-20 N-Gra F
c1 20-3 0k 2:0 20-30 1:1 S-Emp 2:0 0-10 2:0 Gra 1:1 F 1:1
c2 20-3 0k 50-60 S-Emp 0-10 N-Gra M
d1 40-50k 40-50 S-Emp 20-30 N-Gra M
40-50k 2:0 40-50 2:0 Emp 1:1 0-10 1:1 Gra 1:1 F 1:1
MSA 1 2 4 5 3 6
• Result:
The Most Significant Attribute (MSA) in determining the offer was found as Income followed
by Age, Education, Employment, Duration of Relationship and Sex respectively
• The Final Decision Tree
The Decision Tree as we have seen before may not be complete always.After we split according
to the order generated by the MSA we stop at nodes called the leaf node where either there are
no cases for that particular node (or) all the cases belong to the same state (or) the cases are
distributed among the states such that no further split
is possible.
• Conclusion:
Thus we have shown how Data Mining could be deployed for CRM issues .We believe that steps
outlined in this paper and the algorithm formulated are to a great extent successful in optimizing
the existing CRM process.
Though the algorithm may have some limitations, we suppose soft issues like CRM could bear
with them as a high level of accuracy is generally not required. Successful implementation of the
algorithm would benefit small business
organizations as some of them cannot afford to buy Clementine or any other Data mining
software for that case which would make them void of Data Mining. But, Data mining could
provide drastic performance improvements for such business oriented organizations also. We
expect our algorithm to bridge the gap between such organizations and the technology of Data
Mining. We are trying to build a programmable model and before getting into the bottom
of such a process, we would like to analyze the possible merits and demerits of our algorithm.
We are for accepting useful ideas and criticism.
References.
Text:
Building Data Mining Applications for CRM – Alex Berson , Stephen smith
kurt thearling – Tata mcgraw-hill edition 2000
Fundamentals of database systems – Elmasri, navathe - 3rd edition Pearson
Education-2000
Websites:
www.microsoft.com
www.spss.com
www.google.com