Вы находитесь на странице: 1из 60

Data Mining

&
Business Intelligence

Lab Manual
FH-2015

Third year
Information Technology

Course Objectives:
1
2
3
4
5

To introduce the concept of data mining as an important tool for enterprise data management
and as a cutting edge technology for building competitive advantage.
To enable students to effectively identify sources of data and process it for data mining.
To make students well versed in all data mining algorithms, methods and tools.
Learning how to gather and analyze large sets of data to gain useful business understanding.
To impart skills that can enable students to approach business problems analytically by
identifying opportunities to derive business value from data.

Course Outcomes
Course On successful completion of the course students should be able to
Outcome
Demonstrate an understanding of the importance of data mining and the principles of
CO1
business intelligence.
Able to prepare the data needed for data mining algorithms in terms of attributes and
CO2
class inputs, training, validateing, and testing files.
Implement the appropriate data mining methods like classification, clustering or
CO3
association mining on large data sets.
Define and apply metrics to measure the performance of various data mining
CO4
algorithms.
Apply BI to solve practical problems: analyze the problem domain, use the data
CO5
collected in enterprise apply the appropriate data mining technique, interpret and
visualize the results and provide decision support.

TPCTs
Terna Engineering College, Nerul, Navi Mumbai
Department of Information Technology
Term Duration: From 08/01/2015 to 15/04/2015
Experiment List
Subject: DMBI
Class: TEIT-VI
Ex.
No.

Title

CO

PO

Tutorial: Hands on Data Exploration and Data preprocessing.

1,2

a, b, g,

Assessment

Implementation of Decision Tree Classifier using WEKA.

3,4

b, g, k

Usage

Implementation of Nave Bayes Classifier using JAVA.

3,4

a, b, c, i

Usage

Implementation of Random Forest Classifier using WEKA

3,4

b, g, k

Usage

Implementation of K-means clustering using JAVA.

3,4

a, b, c, i

Usage

Implementation of Agglomerative clustering using WEKA

3,4

b, g, k

Usage

Implementation of Density Based Clustering: DBSCAN and


OPTICS using WEKA

3,4

b, g, k

Usage

Implementation of Association Mining (Apriori, FPM) using


WEKA.

3,4

a, b, f, g, k

Usage

Study of BI tool - Oracle BI, XLMiner, Rapid Miner.

10

Case Study: Business Intelligence Mini Project


Group of Three students will perform a separate BI project
and report must contain:
a) Problem definition, Identifying which data mining task is
needed
b) Identify and use a standard data mining dataset available
for the problem.
c) Implement the data mining algorithm of choice
d) Interpret and visualize the results
e) Provide clearly the BI decision that is to be taken as a
result of mining.

Subject Incharge

1,5

Level

i, k

Assessment

a, b, c, d,
f, g, j, k

Familiarity
And
Usage

HOD

DMBI
Text Books:

IT Dept

1. Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition
2. G. Shmueli, N.R. Patel, P.C. Bruce, Data Mining for Business Intelligence: Concepts, Techniques, and
Applications in Microsoft Office Excel with XLMiner, 1st Edition, Wiley India.
3. Carlo Vercellis, Business Intelligence: Data Mining and Optimization for Decision Making ,Wiley India
Publications.

References:
1. P. N. Tan, M. Steinbach, Vipin Kumar, Introduction to Data Mining, Pearson Education
2. Michael Berry and Gordon Linoff Data Mining Techniques, 2nd Edition Wiley Publications.
3. Michael Berry and Gordon Linoff Mastering Data Mining- Art & science of CRM, Wiley Student
Edition
4. Vikram Pudi & Radha Krishna, Data Mining, Oxford Higher Education.

Additional Books:
1. MacLennan Jamie, Tang ZhaoHui and Crivat Bogdan, Data Mining with Microsoft SQL Server 2008,
Wiley India Edition.
2. Alex Berson and Smith, Data Mining and Data Warehousing and OLAP, McGraw Hill Publication.
3. Arijay Chaudhry & P. S. Deshpande, Multidimensional Data Analysis and Data Mining Dreamtech Press
4. Carlo Vercellis, Business Intelligence : Data Mining and Optimization for Decision Making John Wiley
& Sons, Ltd. ISBN: 978-0-470-51138-1, 2009
5. M. H. Dunham, Data Mining Techniques and Algorithms, Prentice Hall-2000.

Experiment No 1
Aim: Tutorial: Hands on Data Exploration and Data preprocessing.

Objective:
After completing this experiment you will be able to:
1. Understand the Eexploratory Data Analysis.
2. Understand the need of data preprocessing.

COs to be achieved:
CON
CO1
CO2

Course Outcomes
PO
Demonstrate an understanding of the importance of data mining and the
a, b, g,
principles of business intelligence.
Able to prepare the data needed for data mining algorithms in terms of
attributes and class inputs, training, validateing, and testing files.

Theory:
Having in mind that data mining is an analytic process designed to explore large amounts of data in
search of consistent and valuable hidden knowledge, the first step made in this fabulous research
field consists in an initial data exploration.
The exploratory data analysis makes use of the human ability to recognize patterns based on the
previous experience. Based on information and knowledge accumulated over time, people can
recognize certain forms, trends, patterns, etc., systematically appearing in data, and that cannot
always be emphasized by classical methods of investigation. Thus, to choose the optimal data
mining methodology for the available data, we need a data analysis, an exploration of them with
well-known statistical means.
Basically, exploratory data analysis (EDA) is the Statistics part which deals with reviewing,
communicating and using data in case of a low level of information on them. EDA uses various
techniques -many of them based on visualization- in order to:
Maximize the innermost knowledge of the data;
Reveal underlying structure;
Extract important variables;
Detect outliers/anomalies;
Identify fundamental assumptions to be tested afterwards;
Develop simple enough models;
Determine the optimal setting of parameters;
Suggest some hypotheses concerning the causes of the observed phenomena;
Suggest appropriate statistical techniques for the available data;
Provide knowledge for further data collection in support of research or experimentation.
EDA techniques are used as a preamble to the data mining process; the results must be verified and
validated by applying it to different data sets to tests its quality.

Basic Statistical Descriptions of Data


For data preprocessing to be successful, it is essential to have an overall picture of your data. Basic
statistical descriptions can be used to identify properties of the data and highlight which data values
should be treated as noise or outliers.
The three areas of basic statistical descriptions are:
Measures of central tendency-- which measure the location of the middle or center of a
data distribution. Intuitively speaking, given an attribute, where do most of its values fall? In
particular, we discuss the mean, median, mode, and midrange.
Dispersion of the data -- how are the data spread out? The most common data dispersion
measures are the range, quartiles, and interquartile range; the five-number summary and
boxplots; and the variance and standard deviation of the data. These measures are useful for
identifying outliers.
The Graphic displays of basic statistical descriptions to visually inspect our data. Most
statistical or graphical data presentation software packages include bar charts, pie charts,
and line graphs. Other popular displays of data summaries and distributions include quantile
plots, quantilequantile plots, histograms, and scatter plots.
Measuring the Central Tendency: Mean, Median, and Mode
If we were to plot the observations for income, where would most of the values fall? This gives us
an idea of the central tendency of the data. Measures of central tendency include the mean, median,

mode. The most common and effective numeric measure of the center of a set of data is the
(arithmetic) mean. The mean of set of values is

This corresponds to the built-in aggregate function, average (avg() in SQL), provided in relational
database systems. Although the mean is the single most useful quantity for describing a data set, it
is not always the best way of measuring the center of the data. A major problem with the mean is its
sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values can corrupt the
mean. For example, the mean salary at a company may be substantially pushed up by that of a few
highly paid managers.
For skewed (asymmetric) data, a better measure of the center of data is the median, which is the
middle value in a set of ordered data values. It is the value that separates the higher half of a data set
from the lower half.
The mode is another measure of central tendency. The mode for a set of data is the value that
occurs most frequently in the set. Therefore, it can be determined for qualitative and quantitative
attributes. It is possible for the greatest frequency to correspond to several different values, which
results in more than one mode. Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with two or more modes is multimodal. At
the other extreme, if each data value occurs only once, then there is no mode.
Measuring the Dispersion of Data: Range, Quartiles, Variance &Standard Deviation
The dispersion or spread of numeric data includes range, quantiles, quartiles, percentiles, and the
interquartile range. The five-number summary, which can be displayed as a boxplot, is useful in
identifying outliers. Variance and standard deviation also indicate the spread of a data distribution.
The range of the set is the difference between the largest (max()) and smallest (min()) values.
Suppose that the data for attribute X are sorted in increasing numeric order. Imagine that we can
pick certain data points so as to split the data distribution into equal-size consecutive sets, as in
Figure-1. These data points are called quantiles. Quantiles are points taken at regular intervals of a
data distribution, dividing it into essentially equalsize consecutive sets.

Figure-1 A plot of the data distribution for some attribute X. The quantiles plotted are quartiles. The
three quartiles divide the distribution into four equal-size consecutive subsets. The second quartile
corresponds to the median.

The 2-quantile is the data point dividing the lower and upper halves of the data distribution. It
corresponds to the median. The 4-quantiles are the three data points that split the data distribution
into four equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles. The 100-quantiles are more commonly referred to as percentiles;
they divide the data distribution into 100 equal-sized consecutive sets. The median, quartiles, and
percentiles are the most widely used forms of quantiles.
The quartiles give an indication of a distributions center, spread, and shape. The distance between
the first and third quartiles is a simple measure of spread that gives the range covered by the middle
half of the data. This distance is called the interquartile range (IQR) and is defined as
IQR = Q3 Q1
No single numeric measure of spread (e.g., IQR) is very useful for describing skewed distributions.
Therefore, it is more informative to also provide the two quartiles Q1 and Q3, along with the
median. A common rule of thumb for identifying suspected outliers is to single out values falling at
least 1.5 x IQR above the third quartile or below the first quartile.
Because Q1, the median, and Q3 together contain no information about the endpoints (e.g., tails) of
the data, a fuller summary of the shape of a distribution can be obtained by providing the lowest
and highest data values as well. This is known as the five-number summary. The five-number
summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3, and the smallest
and largest individual observations, written in the order of Minimum, Q1, Median, Q3, Maximum.
Example of Five-number summary
For numerical attribute values
(minimum, Q1, Q2, Q3, maximum)
Attribute values: 6 47 49 15 42 41 7 39 43 40 36
Sorted: 6 7 15 36 39 40 41 42 43 47 49
Q1 = 15 lower quartile
Q2 = median = 40
(mean = 33.18)
Q3 = 43 upper quartile
Q3 Q1 = 28 interquartile range
Variance and standard deviation are measures of data dispersion. They indicate how spread out a
data distribution is. A low standard deviation means that the data observations tend to be very close
to the mean, while a high standard deviation indicates that the data are spread out over a large range
of values.
The variance of N observations, x1, x2, . . . , xN, for a numeric attribute X is

Where is the mean value of the observations. The standard deviation, , of the observations is the
square root of the variance, 2.
Histograms
Plotting histograms is a graphical method for summarizing the distribution of a given attribute, X. If
X is nominal, such as automobile model or item type, then a pole or vertical bar is drawn for each
known value of X. The height of the bar indicates the frequency (i.e., count) of that X value. The
resulting graph is more commonly known as a bar chart. The problem becomes more complicated
in the case of continuous numerical data. Here, the division (grouping) of the numerical data in

certain classes (usually intervals) is necessary to draw the corresponding histogram. Concretely, to
each lass (group) represented on one axis, there will correspond the relative frequency of
occurrence (or the number of observations) on the other axis.
Data Preprocessing:
Todays real-world databases are highly susceptible to noisy, missing, and inconsistent data due to
their typically huge size (often several gigabytes or more) and their likely origin from multiple,
heterogeneous sources. Low-quality data will lead to low-quality mining results.
There are several data preprocessing techniques. Data cleaning can be applied to remove noise and
correct inconsistencies in data. Data integration merges data from multiple sources into a coherent
data store such as a data warehouse. Data reduction can reduce data size by, for instance,
aggregating, eliminating redundant features, or clustering. Data transformations (e.g.,
normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0.
This can improve the accuracy and efficiency of mining algorithms involving distance
measurements. These techniques are not mutually exclusive; they may work together. For example,
data cleaning can involve transformations to correct wrong data, such as by transforming all entries
for a date field to a common format.
Tasks in data preprocessing:
Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
Data integration: using multiple databases, data cubes, or files.
Data transformation: normalization and aggregation.
Data reduction: reducing the volume but producing the same or similar analytical results.
Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
Data cleaning
1. Fill in missing values (attribute or class value):
o Ignore the tuple: usually done when class label is missing.
o Use the attribute mean (or majority nominal value) to fill in the missing value.
o Use the attribute mean (or majority nominal value) for all samples belonging to the
same class.
o Predict the missing value by using a learning algorithm: consider the attribute with
the missing value as a dependent (class) variable and run a learning algorithm
(usually Bayes or decision tree) to predict the missing value.
2. Identify outliers and smooth out noisy data:
o Binning
Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);
Then smooth by bin means, bin median, or bin boundaries.
o Clustering: group values in clusters and then detect and remove outliers (automatic
or manual)
o Regression: smooth by fitting the data into regression functions.
3. Correct inconsistent data: use domain knowledge or expert decision.
Data transformation
1. Normalization:
o Scaling attribute values to fall within a specified range.

Example: to transform V in [min, max] to V' in [0,1], apply V'=(VMin)/(Max-Min)


o Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StDev
2. Aggregation: moving up in the concept hierarchy on numeric attributes.
3. Generalization: moving up in the concept hierarchy on nominal attributes.
4. Attribute construction: replacing or adding new attributes inferred by existing attributes.
Data reduction
1. Reducing the number of attributes
o Data cube aggregation: applying roll-up, slice or dice operations.
o Removing irrelevant attributes: attribute selection (filtering and wrapper methods),
searching the attribute space (see Lecture 5: Attribute-oriented analysis).
o Principle component analysis (numeric attributes only): searching for a lower
dimensional space that can best represent the data..
2. Reducing the number of attribute values
o Binning (histograms): reducing the number of attributes by grouping them into
intervals (bins).
o Clustering: grouping values in clusters.
o Aggregation or generalization
3. Reducing the number of tuples
o Sampling
Discretization and generating concept hierarchies
1. Unsupervised discretization - class variable is not used.
o Equal-interval (equiwidth) binning: split the whole range of numbers in intervals
with equal size.
o Equal-frequency (equidepth) binning: use intervals containing equal number of
values.
2. Supervised discretization - uses the values of the class variable.
o Using class boundaries. Three steps:
Sort values.
Place breakpoints between values belonging to different classes.
3. Generating concept hierarchies: recursively applying partitioning or discretization methods.
Conclusion:
Basic statistical descriptions provide the analytical foundation for data preprocessing. The basic
statistical measures for data summarization include mean, median, and mode for measuring the
central tendency of data; and range, quantiles, quartiles, interquartile range, variance, and standard
deviation for measuring the dispersion of data. Graphical representations (e.g., boxplots,
histograms, and scatter plots) facilitate visual inspection of the data and are thus useful for data
preprocessing and mining.
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers,
and correct inconsistencies in the data. Data integration combines data from multiple sources to
form a coherent data store. The resolution of semantic heterogeneity, metadata, correlation analysis,
tuple duplication detection, and data conflict detection contribute to smooth data integration. Data

reduction techniques obtain a reduced representation of the data while minimizing the loss of
information content. These include methods of dimensionality reduction, numerosity reduction, and
data compression. Data transformation routines convert the data into appropriate forms for mining.
Data discretization transforms numeric data by mapping values to interval or concept labels. Such
methods can be used to automatically generate concept hierarchiesfor the data, which allows for
mining at multiple levels of granularity. Discretization techniques include binning, histogram
analysis, cluster analysis, decision tree analysis, and correlation analysis. For nominal data, concept
hierarchies may be generated based on schema definitions as well as the number of distinct values
per attribute.

Data Exploration:

Experiment No 2
AIM: Implementation of Decision Tree Classifier using WEKA.

Objectives:
After completing this experiment you will be able to:
1. List the differences among the learning types: supervised and unsupervised.
2. Identify examples of classification tasks, including the available input features.
3. Apply the Decision Trees algorithm to generate classification rules.
COs to be achieved:
CON
CO3
CO4

Course Outcomes
PO
Implement the appropriate data mining methods like classification,
b, g, k
clustering or association mining on large data sets.
Define and apply metrics to measure the performance of various data
mining algorithms.

Theory:
Classification and prediction are two forms of data analysis that can be used to extract models
describing important data classes or to predict future data trends. Such analysis can help provide us
with a better understanding of the data at large. Whereas classification predicts categorical
(discrete, unordered) labels, prediction models continuous valued functions.
How does classification work? Data classification is a two-step process:
In the first step, a classifier is built describing a predetermined set of data classes or concepts. This
is the learning step (or training phase), where a classification algorithm builds the classifier by
analyzing or learning from a training set made up of database tuples and their associated class
labels. Because the class label of each training tuple is provided, this step is also known as
supervised learning (i.e., the learning of the classifier is supervised in that it is told to which class
each training tuple belongs).
This first step of the classification process can also be viewed as the learning of a mapping or
function, y = f (X), that can predict the associated class label y of a given tuple X. In this view, we
wish to learn a mapping or function that separates the data classes. Typically, this mapping is
represented in the form of classification rules, Decision Trees, or mathematical formulae. In the
second step, we determine if the models accuracy is acceptable, and if so, we use the model to
classify new data.
Classification by Decision Tree Induction
Decision tree induction is the learning of decision trees from class-labeled training tuples.
Adecision tree is a flowchart-like tree structure, where
Each internal node (nonleaf node) denotes a test on an attribute,
Each branch represents an outcome of the test, and
Each leaf node (or terminal node) holds a class label.
The topmost node in a tree is the root node.
How are decision trees used for classification? Given a tuple, X, for which the associated class
label is unknown, the attribute values of the tuple are tested against the decision tree. A path is
traced from the root to a leaf node, which holds the class prediction for that tuple. Decision trees
can easily be converted to classification rules.
Why are decision tree classifiers so popular?
The construction of decision tree classifiers does not require any domain knowledge or
parameter setting, and therefore is appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.
Their representation of acquired knowledge in tree form is intuitive and generally easy to
assimilate by humans.
The learning and classification steps of decision tree induction are simple and fast.
In general, decision tree classifiers have good accuracy.
Decision trees are the basis of several commercial rule induction systems.
Decision tree induction algorithms have been used for classification in many application
areas, such as medicine, manufacturing and production, financial analysis, astronomy, and
molecular biology.
However, successful use may depend on the data at hand.

Figure-1: A decision tree for the concept buys computer, indicating whether a customer at
AllElectronics is likely to purchase a computer. Each internal (nonleaf) node represents a test on an
attribute. Each leaf node represents a class (either buys computer = yes or buys computer = no).
Decision Tree Induction
During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning,
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser). Quinlan later presented
C4.5 (a successor of ID3), which became a benchmark to which newer supervised learning
algorithms are often compared. In 1984, a group of statisticians (L. Breiman, J. Friedman, R.
Olshen, and C. Stone) published the book Classification and Regression Trees (CART). ID3, C4.5,
and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees are constructed
in a top-down recursive divide-and-conquer manner.
Algorithm: Generate decision tree. Generate a decision tree from the training tuples of data partition
D.
Input:
Data partition, D, which is a set of training tuples and their associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting criterion that best
partitions the data tuples into individual classes. This criterion consists of a splitting
attribute and, possibly, either a split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
(3)
return N as a leaf node labeled with the class C;
(4) if attribute list is empty then
(5)
return N as a leaf node labeled with the majority class in D; // majority voting
(6) Apply Attribute selection method(D, attribute list) to find the best splitting criterion;
(7) label node N with splitting criterion;
(8) if splitting attribute is discrete-valued and multiway splits allowed then // not
restricted to binary trees
(9) attribute list attribute list - splitting attribute; // remove splitting attribute
(10) for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
(11) let Dj be the set of data tuples in D satisfying outcome j; // a partition

(12) if Dj is empty then


(13)
attach a leaf labeled with the majority class in D to node N;
(14) else attach the node returned by Generate decision tree(Dj, attribute list) to node N;
endfor
(15) return N;
Conclusion:

The decision tree provides a theoretical framework for taking into account not only the
experimental data to design an optimal classifier, but also a structural behavior for allowing
better generalization capability.

Decision trees are also not sensitive to outliers since the splitting happens based on
proportion of samples within the split ranges and not on absolute values.

Highly nonlinear relationships between variables will result in failing checks for simple
regression models and thus make such models invalid. However, decision trees do not
require any assumptions of linearity in the data. Thus, we can use them in scenarios where
we know the parameters are nonlinearly related

Weka output:

=== Run information ===


Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
=== Classifier model (full
training set) ===
J48 pruned tree
-----------------outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes
(3.0)
Number of Leaves : 5
Size of the tree :

Time taken to build model: 0.02 seconds


=== Summary ===
Correctly Classified Instances
9
Incorrectly Classified Instances
5
Mean absolute error
0.2857
Total Number of Instances
14

64.2857 %
35.7143 %

=== Detailed Accuracy By Class ===


TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.778 0.6
0.7
0.778 0.737 0.789 yes
0.4
0.222 0.5
0.4
0.444 0.789 no
Weighted Avg. 0.643 0.465 0.629 0.643 0.632 0.789
=== Confusion Matrix ===
a b <-- classified as
7 2 | a = yes
3 2 | b = no

Experiment No 3
AIM: Implementation of Nave Bayes Classifier using JAVA.

Objectives:
After completing this experiment you will be able to:
1. List the differences among the learning types: supervised and unsupervised.
2. Apply the simple statistical learning algorithm such as Naive Bayesian Classifier to a
classification task and measure the classifier's accuracy
COs to be achieved:
CON
CO3
CO4

Course Outcomes
PO
Implement the appropriate data mining methods like classification,
a, b, c, i
clustering or association mining on large data sets.
Define and apply metrics to measure the performance of various data
mining algorithms.

Theory:
Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such
as the probability that a given tuple belongs to a particular class. Bayesian classification is based on
Bayes theorem.
Nave Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class conditional
independence. Bayesian belief networks are graphical models, which unlike nave Bayesian
classifiers, allow the representation of dependencies among subsets of attributes.
Bayes Theorem
Let X be a data tuple. In Bayesian terms, X is considered evidence. Let H be some hypothesis,
such as that the data tuple X belongs to a specified class C. For classification problems, we want to
determine P(H|X), the probability that the hypothesis H holds given the evidence or observed
data tuple X. In other words, we are looking for the probability that tuple X belongs to class C,
given that we know the attribute description of X.
P(X) is the prior probability of X. Using our example, it is the probability that a person from our set
of customers is 35 years old and earns $40,000.
P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability
that any given customer will buy a computer, regardless of age, income, or any other information,
for that matter.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For example,
suppose our world of data tuples is confined to customers described by the attributes age and
income, respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that
H is the hypothesis that our customer will buy a computer. Then P(H|X) reflects the probability that
customer X will buy a computer given that we know the customers age and income.
Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability
that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a
computer.
Bayes theorem is:

Nave Bayesian Classification


The nave Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n measurements
made on the tuple from n attributes, respectively, A1, A2, : : : , An.
2. Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier will predict
that X belongs to the class having the highest posterior probability, conditioned on X. That is, the

nave Bayesian classifier predicts that tuple X belongs to the class Ci if and only if P(Ci|X) >
P(Cj|X) for 1 j m; j i:
Thus we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the maximum
posteriori hypothesis. By Bayes theorem:

3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely, that is,
P(C1) = P(C2) = = P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize
P(X|Ci)P(Ci). Note that the class prior probabilities may be estimated by P(Ci)=|Ci,D|/|D|,where
|Ci,D| is the number of training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption of
class conditional independence is made. This presumes that the values of the attributes are
conditionally independent of one another, given the class label of the tuple (i.e., that there are no
dependence relationships among the attributes). Thus,

We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), : : : , P(xn|Ci) from the training tuples.
Recall that here xk refers to the value of attribute Ak for tuple X. For each attribute, we look at
whether the attribute is categorical or continuous-valued. For instance, to compute P(X|Ci), we
consider the following:
(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having the value xk for
Ak, divided by |Ci,D|, the number of tuples of class Ci in D.
(b) If Ak is continuous-valued, then the attribute is typically assumed to have a Gaussian
distribution with a mean and standard deviation s, and perform the computations.
5. In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The classifier
predicts that the class label of tuple X is the class Ci if and only if:

In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the maximum.
Example: Predicting a class label using nave Bayesian classification. We wish to predict the class
label of a tuple using nave Bayesian classification, given the same training data as in for decision
tree induction. The training data are The data tuples are described by the attributes age, income,

student, and credit rating. The class label attribute, buys computer, has two distinct values (namely,
{yes, no}). Let C1 correspond to the class buys computer = yes and C2 correspond to buys
computer = no. The tuple we wish to classify is:
X = (age = youth, income = medium, student = yes, credit rating = fair)

We need to maximize P(XjCi)P(Ci), for i = 1, 2. P(Ci), the prior probability of each class, can be
computed based on the training tuples:
P(buys computer = yes) = 9=14 = 0:643
P(buys computer = no) = 5=14 = 0:357
To compute P(X|Ci), for i = 1, 2, we compute the following conditional probabilities:
P(age = youth | buys computer = yes) = 2/9 = 0.222
P(age = youth | buys computer = no) = 3/5 = 0.600
P(income = medium | buys computer = yes) = 4/9 = 0.444
P(income = medium | buys computer = no) = 2/5 = 0.400
P(student = yes | buys computer = yes) = 6/9 = 0.667
P(student = yes | buys computer = no) = 1/5 = 0.200
P(credit rating = fair | buys computer = yes) = 6/9 = 0.667
P(credit rating = fair | buys computer = no) = 2/5 = 0.400
Using the above probabilities, we obtain

P(X|buys computer = yes) = P(age = youth|j buys computer = yes) x


P(income = medium | buys computer = yes) x P(student = yes | buys computer = yes) x
P(credit rating = fair|j buys computer = yes) = 0:222_0:444_0:667_0:667 = 0:044.
Similarly,
P(X|buys computer = no) = 0:600_0:400_0:200_0:400 = 0:019.
To find the class, Ci, that maximizes P(X|Ci)P(Ci), we compute
P(X|buys computer = yes)P(buys computer = yes) = 0:044x0:643 = 0:028
P(X|buys computer = no)P(buys computer = no) = 0:019x0:357 = 0:007
Therefore, the nave Bayesian classifier predicts buys computer = yes for tuple X.
Conclusion:

Bayesian theorem is particularly suited when the dimensionality of the inputs is high.
Parameter estimation for naive Bayes models uses the method of maximum likelihood. In
spite over-simplified assumptions, it often performs better in many complex real-world
situations
The accuracy is high as compared with other classification techniques.
The Nave Bayesian technique is computationally intensive.
Advantage: Requires a small amount of training data to estimate the parameters.

JAVA CODE:

import java.util.*;
class weather{
static char outlook[]={'S','S','O','R','R','R','O','S','S','R','S','O','O','R'};
static char temperature[]={'H','H','H','M','C','C','C','M','C','M','M','M','H','M'};
static char humidity[]={'P','P','P','P','N','N','N','P','N','N','N','P','N','P'};
static char windy[]={'F','T','F','F','F','T','T','F','F','F','T','T','F','T'};
static char class1[]={'N','N','P','P','P','N','P','N','P','P','P','P','P','N'};
static double prob[][]=new double[4][2];
static double pp=9.0/14.0;
static double npp=5.0/14.0;
static int flag=0;
static int flag1=0;
static double play_N=1;
static double notplay_N=1;
static void cal_N(int a)
{
if(a==1)
{
for(int i=0;i<4;++i)
play_N*=prob[i][0];
play_N*=pp;
//System.out.println("\nValue of N of play \n"+play_N);
}
else
{
for(int i=0;i<4;++i)
notplay_N*=prob[i][1];
notplay_N*=npp;
//System.out.println("\nValue of N of No play \n"+notplay_N);
}
}
static double cal_play_prob(char ch)
{
double prob=0;
double count=0;
if(flag==0)
{
for(int i=0;i<14;++i)
if(outlook[i]==ch && class1[i]=='P')
++count;
prob=count/9.0;
flag=1;
}
else if(flag==1)
{
for(int i=0;i<14;++i)
if(temperature[i]==ch && class1[i]=='P')

++count;
prob=count/9.0;
flag=2;
}
else if(flag==2)
{
for(int i=0;i<14;++i)
if(humidity[i]==ch && class1[i]=='P')
++count;
prob=count/9.0;
flag=3;
}
Else if(flag==3)
{
for(int i=0;i<14;++i)
if(windy[i]==ch && class1[i]=='P')
++count;
prob=count/9.0;
}
return prob;
}
static double cal_noplay_prob(char ch)
{
double prob=0;
double count=0;
if(flag1==0)
{
for(int i=0;i<14;++i)
if(outlook[i]==ch && class1[i]=='N')
++count;
prob=count/5.0;
flag1=1;
}
else if(flag1==1)
{
for(int i=0;i<14;++i)
if(temperature[i]==ch && class1[i]=='N')
++count;
prob=count/5.0;
flag1=2;
}
else if(flag1==2)
{
for(int i=0;i<14;++i)
if(humidity[i]==ch && class1[i]=='N')
++count;
prob=count/5.0;
flag1=3;
}

Else if(flag1==3)
{
for(int i=0;i<14;++i)
if(windy[i]==ch && class1[i]=='N')
++count;
prob=count/5.0;
}
return prob;
}
public static void main(String args[])
{
Scanner scr=new Scanner(System.in);
System.out.println("Table\n");
System.out.println("Outlook\t Temperature\t Humidity\t
Windy \tClass");
for(int i=0;i<14;++i)
{
System.out.print(outlook[i]+"\t\t"+temperature[i]+"\t\t"+humidity[i]+"\t\t"+windy[i]+"\t\t"+
class1[i]);
System.out.println();
}
System.out.println("Menu:\nOutlook: Sunny=S Overcast=O Rain=R\tTemperature: Hot=H Mild=M
Cool=C\n");
System.out.println("Humidity: Peak=P Normal=N\t\tWindy: True=T False=F\n\nYour input should
belong to one of these classes.\n");
System.out.println("class1: Play=P class2:Not Play=NP");
System.out.println("\nEnter your input: example. t={rain,hot,peak,false} input will be R,H,P,F");
String s=scr.nextLine();
char ch;
int count=0;
for(int i=0;i<8;i+=2)
{
ch=s.charAt(i);
prob[count][0]=cal_play_prob(ch);
prob[count][1]=cal_noplay_prob(ch);
++count;
}
cal_N(1);
cal_N(2);
double pt=play_N+notplay_N;
double prob_of_play=0;
double prob_of_noplay=0;
prob_of_play=play_N/pt;
prob_of_noplay=notplay_N/pt;
System.out.println("\nProbability of play "+prob_of_play);
System.out.println("\nProbability of NO play "+prob_of_noplay );
if(prob_of_play>prob_of_noplay)
System.out.println("\nThe new tuple classified under \"PLAY\" category.Hence there will be
play!!!");
else

System.out.println("\nThe new tuple classified under \"NO PLAY\" category.Hence there will
be NO play.");
}
}

OUTPUT:

Experiment No 4
AIM: Implementation of Random Forest Classifier using WEKA

Objectives:
After completing this experiment you will be able to:
1. Explain the value of finding associations in market basket data.
2. Characterize the kinds of patterns that can be discovered by association rule mining.
3. Describe how to extend a relational system to find patterns using association rules.
COs to be achieved:
CON
CO3
CO4

Course Outcomes
PO
Implement the appropriate data mining methods like classification,
b, g, k
clustering or association mining on large data sets.
Define and apply metrics to measure the performance of various data
mining algorithms.

Theory
Recently there has been a lot of interest in ensemble learning methods that generate many
classifiers and aggregate their results. Two well-known methods are boosting and bagging of
classification trees. In boosting, successive trees give extra weight to points incorrectly predicted by
earlier predictors. In the end, a weighted vote is taken for prediction. In bagging, successive trees do
not depend on earlier trees each is independently constructed using a bootstrap sample of the
data set. In the end, a simple majority vote is taken for prediction.
Imagine that each of the classifiers in the ensemble is a decision tree classifier so that the collection
of classifiers is a forest. The individual decision trees are generated using a random selection of
attributes at each node to determine the split. More formally, each tree depends on the values of a
random vector sampled independently and with the same distribution for all trees in the forest.
During classification, each tree votes and the most popular class is returned.
Random forests can be built using bagging in tandem with random attribute selection. A
training set, D, of d tuples is given. The general procedure to generate k decision trees for
the ensemble is as follows.
For each iteration, i, (i= 1, 2, : : : , k), a training set, Di , of d tuples is sampled with
replacement from D. That is, each Di is a bootstrap sample of D, so that some tuples may
occur more than once in Di, while others may be excluded.
Let F be the number of attributes to be used to determine the split at each node, where F is
much smaller than the number of available attributes.
To construct a decision tree classifier, Mi, randomly select, at each node, F attributes as
candidates for the split at the node.
The CART methodology is used to grow the trees. The trees are grown to maximum size
and are not pruned. Random forests formed this way, with random input selection, are
called Forest-RI.
Another form of random forest, called Forest-RC, uses random linear combinations of the input
attributes. Instead of randomly selecting a subset of the attributes, it creates new attributes (or
features) that are a linear combination of the existing attributes. That is, an attribute is generated by
specifying L, the number of original attributes to be combined. At a given node, L attributes are
randomly selected and added together with coefficients that are uniform random numbers on [-1, 1].
F linear combinations are generated, and a search is made over these for the best split. This form of
random forest is useful when there are only a few attributes available, so as to reduce the correlation
between individual classifiers.
Breiman proposed random forests, which add an additional layer of randomness to bagging. In
addition to constructing each tree using a different bootstrap sample of the data, random forests
change how the classification or regression trees are constructed. In standard trees, each node is
split using the best split among all variables. In a random forest, each node is split using the best
among a subset of predictors randomly chosen at that node. This somewhat counterintuitive strategy
turns out to perform very well compared to many other classifiers, including discriminant analysis,
support vector machines and neural networks, and is robust against overfitting. Random forests are
comparable in accuracy to AdaBoost, yet are more robust to errors and outliers. The generalization
error for a forest converges as long as the number of trees in the forest is large. The accuracy of a
random forest depends on the strength of the individual classifiers and a measure of the dependence
between them. The ideal is to maintain the strength of individual classifiers without increasing their
correlation. Random forests are insensitive to the number of attributes selected for consideration at
each split. Typically, up to log2d +1 is chosen. (An interesting empirical observation was that using
a single random input attribute may result in good accuracy that is often higher than when using
several attributes.) Because random forests consider many fewer attributes for each split, they are

efficient on very large databases. They can be faster than either bagging or boosting. Random
forests give internal estimates of variable importance. It is very user-friendly in the sense that it has
only two parameters (the number of variables in the random subset at each node and the number of
trees in the forest), and is usually not very sensitive to their values.
The advantages of random forest are:
It is one of the most accurate learning algorithms available. For many data sets, it produces a
highly accurate classifier.
It runs efficiently on large databases.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building
progresses.
It has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.
It has methods for balancing error in class population unbalanced data sets.
Generated forests can be saved for future use on other data.
Prototypes are computed that give information about the relation between the variables and
the classification.
It computes proximities between pairs of cases that can be used in clustering, locating
outliers, or (by scaling) give interesting views of the data.
The capabilities of the above can be extended to unlabeled data, leading to unsupervised
clustering, data views and outlier detection.
It offers an experimental method for detecting variable interactions.
Disadvantages
Random forests have been observed to overfit for some datasets with noisy
classification/regression tasks.
For data including categorical variables with different number of levels, random forests are
biased in favor of those attributes with more levels. Therefore, the variable importance
scores from random forest are not reliable for this type of data.
Conclusion
Fast processing and results!
RF is fast to build. Even faster to predict!
Practically speaking, not requiring cross-validation alone for model selection
significantly speeds training by 10x-100x or more.
Fully parallelizable to go even faster!
Automatic predictor selection from large number of candidates
Resistance to over training
Ability to handle data without preprocessing
data does not need to be rescaled, transformed, or modified
resistant to outliers
automatic handling of missing values
Cluster identification can be used to generate tree-based clusters through sample proximity.

Weka Output:
=== Run information ===
Scheme:weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 3 random features.
Out of bag error: 0.4286
Time taken to build model: 0.01 seconds

Experiment No 5
AIM: Implementation of K-means clustering using JAVA.

Objectives:
After completing this experiment you will be able to:
1. Explain the differences among the two main styles of learning: supervised and unsupervised.
2. Implement simple algorithms for unsupervised learning.
3. Explain the problem of outlier on clustering algorithms.

COs to be achieved:
CON
CO3
CO4

Course Outcomes
PO
Implement the appropriate data mining methods like classification,
a, b, c, i
clustering or association mining on large data sets.
Define and apply metrics to measure the performance of various data
mining algorithms.

Theory:
The process of grouping a set of physical or abstract objects into classes of similar objects is called
clustering. A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters. The grouping is accomplished by finding
similarities between data according to characteristics found in the data itself. Thus clustering is
viewed to be driven by the data itself and is often based on the similarity between attribute values.
Major Clustering Methods
Partitioning methods:
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partition represents a cluster and k n. That is, it classifies the data into k groups,
which together satisfy the following requirements:
(1) Each group must contain at least one object, and
(2) Each object must belong to exactly one group.
Notice that the second requirement can be relaxed in some fuzzy partitioning techniques.
To achieve global optimality in partitioning-based clustering would require the exhaustive
enumeration of all of the possible partitions. Instead, most applications adopt one of a few popular
heuristic methods, such as
(1) The k-means algorithm, where each cluster is represented by the mean value of the objects in the
cluster, and
(2) The k-medoids algorithm, where each cluster is represented by one of the objects located near
the center of the cluster.
These heuristic clustering methods work well for finding spherical-shaped clusters in small to
medium-sized databases.
Hierarchical methods:
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A
hierarchical method can be classified as being either agglomerative or divisive, based on how the
hierarchical decomposition is formed.
Density-based methods: Most partitioning methods cluster objects based on the distance between
objects.. Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density (number of objects or
data points) in the neighborhood exceeds some threshold; that is, for each data point within a
given cluster, the neighborhood of a given radius has to contain at least a minimum number of
points.
DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters
according to a density-based connectivity analysis.

K-Means Clustering:
The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters.
The clusters are formed based on distance. The objects within a cluster are similar whereas the
objects of different clusters are dissimilar. Cluster similarity is measured in regard to the mean
value of the objects in a cluster, which can be viewed as the clusters centroid or center of gravity.

where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional).
In other words, for each object in each cluster, the distance from the object to its cluster center is
squared, and the distances are summed. This criterion tries to make the resulting k clusters as
compact and as separate as possible.
Initial values for the means are arbitrarily assigned. The convergence criteria could be based on the
squared error or fixed number of iterations or when no ( or very small ) number of tuples is
assigned to different clusters.
Algorithm: k-means.
The k-means algorithm for partitioning, where each clusters center is represented by the mean
value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) Arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3)
(re)assign each object to the cluster to which the object is the most similar,
based on the mean value of the objects in the cluster;
(4)
update the cluster means, i.e., calculate the mean value of the objects for
each cluster;
(5) until no change;
The method is relatively scalable and efficient in processing large data sets because the
computational complexity of the algorithm is O(nkt), where n is the total number of objects, k is the
number of clusters, and t is the number of iterations.
Example:
Suppose that we are given the following items to cluster
{2, 4, 10, 12, 3, 20, 30, 11, 25}
And let k=2;
K-means clustering: Initially suppose the means m1= 2 and m2 =4;
Using Euclidean distance we find that
k1 = { 2, 3}
k2 ={ 4,10,12,20,30,11,25}

M1
2
2.5
3
4.75
7

M2
4
16
18
19.6
25

K1
2,3
2,3,4
2,3,4,10
2,3,4,10,11,12
2,3,4,10,11,12

K2
10,11,12,20,,25 30
10,11,12,20,25,30
11,12,20,25,30
20,25,30
20,25,30

K-means finds a local optimum and may actually miss the global optimum.
K-means does not work on categorical data.
K-means does not handle outliers well.

Conclusion:
In this experiment we have studied k-mean clustering method which is use partitioning based
method. The K-means is the simplest method for clustering but the number of clusters must be
given in advance and it is very sensitive to outliers.

Java Code
import java.util.*;
class k_means
{
static int count1,count2,count3;
static int d[];
static int k[][];
static int tempk[][];
static double m[];
static double diff[];
static int n,p;
static int cal_diff(int a)
{
for(int i=0;i<p;++i)
{
if(a>m[i])
diff[i]=a-m[i];
else
diff[i]=m[i]-a;
}
int val=0;
double temp=diff[0];
for(int i=0;i<p;++i)
{
if(diff[i]<temp)
{
temp=diff[i];
val=i;
}
}//end of for loop
return val;
}

static void cal_mean()


{
for(int i=0;i<p;++i)
m[i]=0; // initializing means to 0
int cnt=0;
for(int i=0;i<p;++i)

{
cnt=0;
for(int j=0;j<n-1;++j)
{
if(k[i][j]!=-1)
{
m[i]+=k[i][j];
++cnt;
}
}
m[i]=m[i]/cnt;
}
}
static int check1()
{
for(int i=0;i<p;++i)
for(int j=0;j<n;++j)
if(tempk[i][j]!=k[i][j])
{
return 0;
}
return 1;
}

public static void main(String args[])


{
Scanner scr=new Scanner(System.in);
/* Accepting number of elements */
System.out.println("Enter the number of elements ");
n=scr.nextInt();
d=new int[n];
/* Accepting elements */
System.out.println("Enter "+n+" elements: ");
for(int i=0;i<n;++i)
d[i]=scr.nextInt();
/* Accepting num of clusters */
System.out.println("Enter the number of clusters: ");
p=scr.nextInt();
/* Initialising arrays */
k=new int[p][n];
tempk=new int[p][n];
m=new double[p];
diff=new double[p];
/* Initializing m */
for(int i=0;i<p;++i)
m[i]=d[i];
int temp=0;
int flag=0;
do
{

for(int i=0;i<p;++i)
for(int j=0;j<n;++j)
{
k[i][j]=-1;
}
for(int i=0;i<n;++i) // for loop will cal cal_diff(int) for every element.
{
temp=cal_diff(d[i]);
if(temp==0)
k[temp][count1++]=d[i];
else if(temp==1)
k[temp][count2++]=d[i];
else if(temp==2)
k[temp][count3++]=d[i];
}

cal_mean(); // call to method which will calculate mean at this step.


flag=check1(); // check if terminating condition is satisfied.
if(flag!=1)
/*Take backup of k in tempk so that you can check for equivalence in next step*/
for(int i=0;i<p;++i)
for(int j=0;j<n;++j)
tempk[i][j]=k[i][j];
System.out.println("\n\nAt this step");
System.out.println("\nValue of clusters");
for(int i=0;i<p;++i)
{
System.out.print("K"+(i+1)+"{ ");
for(int j=0;k[i][j]!=-1 && j<n-1;++j)
{
System.out.print(k[i][j]+" ");
System.out.println("}");
}//end of for loop
System.out.println("\nValue of m ");
for(int i=0;i<p;++i)
System.out.print("m"+(i+1)+"="+m[i]+" ");
count1=0;count2=0;count3=0;
}
while(flag==0);
{
System.out.println("\n\n\nThe Final Clusters By Kmeans are as follows: ");
for(int i=0;i<p;++i)
{
System.out.print("K"+(i+1)+"{ ");
for(int j=0;k[i][j]!=-1 && j<n-1;++j)
System.out.print(k[i][j]+" ");
System.out.println("}");
}
}
}

Output:

Weka Output:

Experiment No 6
AIM: Implementation of Agglomerative clustering using WEKA

Objectives:
After completing this experiment you will be able to:
1. Explain the differences among the two main styles of learning: supervised and unsupervised.
2. Implement simple algorithms for unsupervised learning.
3. Explain the problem of outlier on clustering algorithms.

COs to be achieved:
CON
CO3
CO4

Course Outcomes
PO
Implement the appropriate data mining methods like classification,
b, g, k
clustering or association mining on large data sets.
Define and apply metrics to measure the performance of various data
mining algorithms.

Theory:
The process of grouping a set of physical or abstract objects into classes of similar objects is called
clustering. A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters. The grouping is accomplished by finding
similarities between data according to characteristics found in the data itself. Thus clustering is
viewed to be driven by the data itself and is often based on the similarity between attribute values.
Major Clustering Methods
Partitioning methods:
Given a database of n objects or data tuples, a partitioning method constructs k partitions of the
data, where each partition represents a cluster and k n. That is, it classifies the data into k groups,
which together satisfy the: (1) Each group must contain at least one object, and (2) Each object
must belong to exactly one group, requirements.
To achieve global optimality in partitioning-based clustering would require the exhaustive
enumeration of all of the possible partitions. Instead, most applications adopt one of a few popular
heuristic methods, such as k-means algorithm, where each cluster is represented by the mean value
of the objects in the cluster, and k-medoids algorithm, where each cluster is represented by one of
the objects located near the center of the cluster.
Density-based methods: Most partitioning methods cluster objects based on the distance
between objects.. Other clustering methods have been developed based on the notion of density.
Their general idea is to continue growing the given cluster as long as the density (number of objects
or data points) in the neighborhood exceeds some threshold; that is, for each data point within a
given cluster, the neighborhood of a given radius has to contain at least a minimum number of
points.
DBSCAN and its extension, OPTICS, are typical density-based methods that grow clusters
according to a density-based connectivity analysis.
Hierarchical methods:
A hierarchical method creates a hierarchical decomposition of the given set of data objects. A
hierarchical method can be classified as being either agglomerative or divisive, based on how the
hierarchical decomposition is formed.
The agglomerative approach, also called the bottom-up approach, starts with each object forming a
separate group. It successively merges the objects or groups that are close to one another, until all
of the groups are merged into one (the topmost level of the hierarchy), or until a termination
condition holds.
The divisive approach, also called the top-down approach, starts with all of the objects in the same
cluster. In each successive iteration, a cluster is split up into smaller clusters, until eventually each
object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is done, it can never be
undone. {Solution iterative relocation- BIRCH}.
Hierarchical Clustering Algorithms:
A hierarchical clustering method works by grouping data objects into a tree of clusters.
Hierarchical clustering methods can be further classified as either agglomerative or divisive.
Agglomerative the hierarchical decomposition is formed in a bottom-up (merging).
Divisive Top-down (splitting).
AGNES (AGglomerative NESting), an agglomerative hierarchical clustering method,
DIANA (DIvisive ANAlysis), a divisive hierarchical clustering method

Agglomerative hierarchical clustering:


This bottom-up strategy starts by placing each object in its own cluster and then merges these
atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until
certain termination conditions are satisfied. Most hierarchical clustering methods belong to this
category. They differ only in their definition of inter-cluster similarity.
A tree structure called a dendrogram is commonly used to represent the process of hierarchical
clustering. Four widely used measures for distance between clusters are as follows:

where |p-p| is the distance between two objects or points, p and p; mi is the mean for cluster, Ci;
and ni is the number of objects in Ci.
When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the distance
between clusters, it is sometimes called a nearest-neighbor clustering algorithm. Moreover,
if the clustering process is terminated when the distance between nearest clusters exceeds an
arbitrary threshold, it is called a single-linkage algorithm.
When an algorithm uses the maximum distance, dmax(Ci, Cj), to measure the distance
between clusters, it is sometimes called a farthest-neighbor clustering algorithm. If the
clustering process is terminated when the maximum distance between nearest clusters
exceeds an arbitrary threshold, it is called a complete-linkage algorithm.
Example: A divisive hierarchical clustering method, to a data set of five objects, {a, b, c, d, e}.
Initially, AGNES places each object into a cluster of its own. The clusters are then merged step-bystep according to some criterion. For example, clusters C1 and C2 may be merged if an object in C1
and an object in C2 form the minimum Euclidean distance between any two objects from different
clusters. This is a single-linkage approach in that each cluster is represented by all of the objects in
the cluster, and the similarity between two clusters is measured by the similarity of the closest pair
of data points belonging to different clusters. The cluster merging process repeats
until all of the objects are eventually merged to form one cluster. In DIANA, all of the objects are
used to form one initial cluster. The cluster is split according to some principle, such as the
maximum Euclidean distance between the closest neighboring objects in the cluster. The cluster
splitting process repeats until, eventually, each new cluster contains only a single object.
In either agglomerative or divisive hierarchical clustering, the user can specify the desired number
of clusters as a termination condition. A tree structure called a dendrogram is commonly used to
represent the process of hierarchical clustering.
It shows how objects are grouped together step by step. Figure-2 shows a dendrogram for the five
objects presented in Figure 7.6, where l = 0 shows the five objects as singleton clusters at level 0.
At l = 1, objects a and b are grouped together to form the first cluster, and they stay together at all
subsequent levels. We can also use a vertical axis to show the similarity scale between clusters. For
example, when the similarity of two groups of objects, {a, b} and {c, d, e}, is roughly 0.16, they are
merged together to form a single cluster.

Figure-1: Agglomerative and divisive hierarchical clustering on data objects {a, b, c, d, e}

Figure-2 Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}


Conclusion:
In this experiment we have studied agglomerative hierarchical clustering which is a bottom-up
strategy starts by placing each object in its own cluster and then merges these atomic clusters into
larger and larger clusters until all of the objects are in a single cluster.
In case of hierarchical clustering we do not have to assume any particular number of clusters, any
desired number of clusters can be obtained by cutting the dendogram at the proper level. The
major limitation of hierarchical clustering is: Once a decision is made to combine two clusters, it
cannot be undone

Weka Output:

Experiment No 7
AIM: Implementation of Density Based Clustering: DBSCAN and OPTICS using WEKA.

Objectives:
After completing this experiment you will be able to:
1. Explain the differences among the two main styles of learning: supervised and unsupervised.
2. Implement simple algorithms for unsupervised learning.
3. Explain the problem of outlier on clustering algorithms.

COs to be achieved:
CON
CO3
CO4

Course Outcomes
PO
Implement the appropriate data mining methods like classification,
b, g, k
clustering or association mining on large data sets.
Define and apply metrics to measure the performance of various data
mining algorithms.

Note: Student will perform this experiment as an exercise.

Experiment No 8
AIM: Implementation of Association Mining (Apriori, FPM) using WEKA.

Objectives:
After completing this experiment you will be able to:
1. Explain the value of finding associations in market basket data.
2. Characterize the kinds of patterns that can be discovered by association rule mining.
3. Describe how to extend a relational system to find patterns using association rules.
COs to be achieved:
CON
CO3
CO4

Course Outcomes
PO
Implement the appropriate data mining methods like classification, a, b, f,
g, k
clustering or association mining on large data sets.
Define and apply metrics to measure the performance of various data
mining algorithms.

Theory:
Frequent itemset mining leads to the discovery of associations and correlations among items in
large transactional or relational data sets. With massive amounts of data continuously being
collected and stored, many industries are becoming interested in mining such patterns from their
databases. The discovery of interesting correlation relationships among huge amounts of business
transaction records can help in many business decision-making processes, such as catalog design,
cross-marketing, and customer
shopping behavior analysis.
A typical example of frequent itemset mining is market basket analysis. This process analyzes
customer buying habits by finding associations between the different items that customers place in
their shopping baskets. The discovery of such associations can help retailers develop marketing
strategies by gaining insight into which items are frequently purchased together by customers.

Frequent Itemsets, Closed Itemsets, and Association Rules


1. An association rule is of the form X => Y,
Where X = {xl, x2, x3, , xn} and Y = {y1, y2, y3, , ym} are sets of items,
The Xi and Yj being distinct items for all i and all j. i.e X Y =
2. This association states that if a customer buys X, he or she is also likely to buy Y.
3. In general, any association rule has the form LHS (left-hand side) => RHS (right-hand side),
where LHS and RHS are sets of items.
4. The set LHS U RHS is called an itemset, the set of items purchased by customers.
5. For an association rule to be of interest to a data miner, the rule should satisfy some interest
measure.
Two common interest measures are support and confidence.
6. The support for a rule LHS => RHS is with respect to the itemset:
6.1 It refers to how frequently a specific itemset occurs in the database.
6.2 That is, the support is the percentage of transactions that contain all of the items in the
itemset, LHS U RHS.
6.3 If the support is low, it implies that there is no overwhelming evidence that items in LHS
U RHS occur together, because the itemset occurs in only a small fraction of
transactions.
7. The confidence is with regard to the inference shown in the rule.
7.1 The confidence of the rule LHS => RHS is computed as:
Support (LHS U RHS) / Support (LHS).
7.2 We can think of it as the probability that the items in RHS will be purchased given that
the items in LHS are purchased by a customer.
7.3 Another term for confidence is strength of the rule.
8. Example:
TID
101
792
1130
1735
Figure-2:

Time
Items Bought
6.35
Milk, Bread, Cookies, Juice
7.38
Milk, Juice
8.05
Milk, Eggs
8.40
Bread, Cookies, Coffee
Example transactions in market-basket model.

8.1 As an example of support and confidence, consider the following two rules:
Milk => Juice and Bread => Juice.
8.2 Looking at our four sample transactions in Figure-2, we see that
The support of {Milk, Juice} is 50% and
The support of {Bread. Juice} is only 25%.
The confidence of Milk => Juice is 66.7% (meaning that, of three transactions in which
milk occurs, two contain juice) and
The confidence of Bread => Juice is 50% (meaning that one of two transactions
containing bread also contains juice).
8.3 As we can see, support and confidence do not necessarily go hand in hand.
The goal of mining association rules, then, is to generate all possible rules that exceed
some minimum user-specified support and confidence thresholds. The problem is thus
decomposed into two sub-problems:
8.3.1 Generate all itemsets that have a support that exceeds the threshold. These sets of
items are called large (or frequent) itemsets. Note that large here means large
support.
8.3.2 For each large itemset, all the rules that have a minimum confidence are generated
as follows:
For a large itemset X and Y X, let Z = X - Y; then if support( X)/support(Z) >
minimum confidence, the rule Z => Y (that is, X - Y => Y) is a valid rule.
9. Generating rules by using all large itemsets and their supports is relatively straightforward.
However, discovering all large itemsets together with the value for their support is a major
problem if the cardinality of the set of items is very high. A typical supermarket has thousands of
items. The number of distinct itemsets is 2 m, where m is the number of items, and counting
support for all possible itemsets becomes very computation-intensive. To reduce the
combinatorial search space, algorithms for finding association rules utilize the following
properties:
9.1 A subset of a large itemset must also be large (that is, each subset of a large itemset
exceeds the minimum required support). i.e downward closure property.
9.2 Conversely, a superset of a small itemset is also small (implying that it does not have
enough support). i.e antimonotonicity property.
9.3 These two properties help in reducing the search space of possible solutions. That is,
once an itemset is found to be small, then any extension to that itemset, formed by
adding one or more items to the set, will also yield a small itemset.
Conclusion:

In the process of finding frequent itemsets, Apriori avoids the effort wastage of
counting the candidate itemsets that are known to be infrequent. The
candidates are generated by joining among the frequent itemsets level-wisely,
also candidate are pruned according the Apriori property.

As a result the number of remaining candidate itemsets ready for further


support checking becomes much smaller, which dramatically reduces the
computation, I/O cost and memory requirement.

The main limitations of Apriori is the number of scan required for generating
the frequent itemsset.

Weka Output

=== Run information ===


Scheme:
weka.associations.Apriori -N 10 -T 0 -C 0.7 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: supermarket-weka.filters.unsupervised.attribute.Remove-R10,14-16,19-25,27-44,4650,52-217
Instances: 4627
Apriori
Attributes: 17
=======
department1
Minimum support: 0.1 (463 instances)
department2
Minimum metric <confidence>: 0.7
department3
Number of cycles performed: 18
department4
Generated sets of large itemsets:
department5
Size of set of large itemsets L(1): 7
department6
Size of set of large itemsets L(2): 11
department7
Size of set of large itemsets L(3): 4
department8
Best rules found:
1. department1=t biscuits=t 669 ==> bread and cake=t 562 conf:(0.84)

2. tea=t biscuits=t 627 ==> bread and cake=t 525 conf:(0.84)


3. biscuits=t coffee=t 719 ==> bread and cake=t 591 conf:(0.82)
4. biscuits=t soft drinks=t 1193 ==> bread and cake=t 969 conf:(0.81)
5. biscuits=t 2605 ==> bread and cake=t 2083 conf:(0.8)
6. tea=t 896 ==> bread and cake=t 709 conf:(0.79)
7. coffee=t 1094 ==> bread and cake=t 837 conf:(0.77)
8. department1=t 1047 ==> bread and cake=t 794 conf:(0.76)
9. soft drinks=t 1888 ==> bread and cake=t 1429 conf:(0.76)
10. baby needs=t 619 ==> bread and cake=t 467 conf:(0.75)

Experiment No 9
Aim: Study of BI tool - Oracle BI, XL-Miner, Rapid Miner.

Objectives:
After completing this experiment you will be able to:
1. Understand working of Rapid Miner.
2. Use Rapid Miner as a BI tool.

COs to be achieved:
CON
CO5

Course Outcomes
PO
Apply BI to solve practical problems: analyze the problem domain,
use the data collected in enterprise apply the appropriate data mining i, k
technique, interpret and visualize the results and provide decision
support.

Theory:
Business Intelligence (BI) is a terminology representing a collection of processes, tools and
technologies helpful in achieving more profit by considerably improving the productivity, sales and
service of an enterprise. With the help of BI methods, the corporate data can be organized, analyzed
in a better way and then converted into a useful knowledge of information needed to initiate a
profitable business action. Thus its about turning a raw, collected data into intelligent information
by analyzing and re-arranging the data according to the relationships between the data items by
knowing what data to collect and manage and in what context.
A company's collected raw data is an important asset where one can find solutions to many of an
organizations critical questions like 'what was the net profit for a particular product last year and
what will be sales this year and what are the key factors to be focused this year in order to increase
the sales?'. So there arises a necessity of a well planned BI system which can lead to a greater
profitability by reducing the operating costs, increasing the sales and thereby improving the
customer satisfaction for an enterprise.
With the help of a Business Intelligence System, a company may improve its business or rule over
its competitors by exploring and exploiting its data to know the customer preferences, nature of
customers, supply chains, geographical influences, pricings and how to increase its overall business
efficiency.
Business Intelligence enables us to take some action based on the intelligence acquired using BI
strategy. If this knowledge or information is not utilized properly in the right direction, there is no
point in analyzing and finding the intelligence.
For example, lets assume a company has implemented a BI system to analyze the customer interests
and requirements enabling them to promote a particular product in the near future. All the analysis
and knowledge management will be pointless and a waste of investment if they don't have a proper
plan to approach the right customer at the right time. So Business Intelligence is all about strategies
in increasing business efficiency while vastly cutting down the operating costs.
Implementing a Business Intelligence system in an organization requires a significant amount of
money to be invested in order to build and implement a BI system and its applications. It requires
more skilled top level managers to build a ROI(Return on Investment) model to analyze the costs
involved in implementing and maintaining these BI models and methods to get the return on
investment sooner.
A proper business action should be taken based on the strategies derived with the help of these
intelligence models. Often an erroneous model and wrong assumptions can bring a loss much
greater than building the entire Business Intelligence system itself. Once everything is done more
properly in a way an organisation want them to be, then the benefit that comes out of it is priceless.
Topmost executives of an organization are really interested in aggregated facts or numbers to take
decisions rather than querying several databases (that are normalized) to get the data and do the
comparison by themselves. OLAP tools visualize the data in an understandable format, like in the
form of Scorecards and Dashboards with Key Performance Indicators enabling managers to monitor
and take immediate actions. In today's business life, OLAP plays a vital role by assisting decision
makers in the field of banking and finance, hospitals, insurance, manufacturing, pharmaceuticals
etc., to measure facts across geography, demography, product, and sales.

Polynomial regression is a form of linear regression in which the relationship between the
independent variable x and the dependent variable y is modeled as an nth order polynomial. In
RapidMiner, y is the label attribute and x is the set of regular attributes that are used for the
prediction of y. Polynomial regression fits a nonlinear relationship between the value of x and the
corresponding conditional mean of y, denoted E(y | x), and has been used to describe nonlinear
phenomena such as the growth rate of tissues and the progression of disease epidemics. Although
polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is
linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are
estimated from the data. For this reason, polynomial regression is considered to be a special case of
multiple linear regressions.
The goal of regression analysis is to model the expected value of a dependent variable y in terms of
the value of an independent variable (or vector of independent variables) x. In simple linear
regression, the following model is used:
y = w0 + (w1 * x)
In this model, for each unit increase in the value of x, the conditional expectation of y increases by
w1 units. In many settings, such a linear relationship may not hold. For example, if we are
modeling the yield of a chemical synthesis in terms of the temperature at which the synthesis takes
place, we may find that the yield improves by increasing amounts for each unit increase in
temperature. In this case, we might propose a quadratic model of the form:
y = w0 + (w1 * x1 * 1) + (w2 * x2 ^ 2)
In this model, when the temperature is increased from x to x + 1 units, the expected yield changes
by w1 + w2 + 2 (w2 * x). The fact that the change in yield depends on x is what makes the
relationship nonlinear (this must not be confused with saying that this is nonlinear regression; on
the contrary, this is still a case of linear regression). In general, we can model the expected value of
y as an nth order polynomial, yielding the general polynomial regression model:
y = w0 + (w1 * x1 ^1) + (w2 * x2 ^2) + . . . + (wm * xm ^m)
Regression is a technique used for numerical prediction. It is a statistical measure that attempts to
determine the strength of the relationship between one dependent variable ( i.e. the label attribute)
and a series of other changing variables known as independent variables (regular attributes). Just
like Classification is used for predicting categorical labels, Regression is used for predicting a
continuous value. For example, we may wish to predict the salary of university graduates with 5
years of work experience, or the potential sales of a new product given its price. Regression is often
used to determine how much specific factors such as the price of a commodity, interest rates,
particular industries or sectors influence the price movement of an asset.

Conclusion:
Business Intelligence is a terminology refers to taking advantage of data and converting them into
an intelligent information or knowledge by carefully observing data patterns or trends. These
findings are key factors in helping any business to improve it's current business processes to gain
more on customer satisfaction, increase sales, produce more profit etc. The knowledge observed
from several report based analysis may lead to new business changes or improvements thus helping
the organization to grow in the targeted direction

Source:
Rapid miner stock prediction
https://www.youtube.com/watch?v=LbtZU1_i9Qk
http://au.finance.yahoo.com
http://finance.yahoo.com/q/hp?s=KO data link for Coca Cola
http://finance.yahoo.com/q/hp?s=MCD data link for McDonald's Corp. (MCD)

Experiment No 10
Aim: Case Study: Business Intelligence Mini Project

Objectives:
After completing this experiment you will be able to:
1. Understand BI concepts.
2. Applications of BI in credit card fraud detection.
COs to be achieved:
CON
CO1
CO5

After successful completion of the course students should be able to


Demonstrate an understanding of the importance of data mining and the
principles of business intelligence.
Apply BI to solve practical problems: analyze the problem domain, use the data
collected in enterprise apply the appropriate data mining technique, interpret
and visualize the results and provide decision support.

PO
a, b, c,
d, f, g,
j, k

Theory:
Fraud detection is a challenging field of research, development and creativity! Fraud is a billiondollar business and it is increasing every year. Fraud involves one or more persons who
intentionally act secretly to deprive another of something of value, for their own benefit. Fraud is as
old as humanity itself and can take an unlimited variety of different forms. However, in recent
years, the development of new technologies has also provided further ways in which criminals may
commit fraud. In addition to that, business reengineering, reorganization or downsizing may
weaken or eliminate control, while new information systems may present additional opportunities
to commit fraud.
Now a day the usage of credit cards has dramatically increased. As credit card becomes the most
popular mode of payment for both online as well as regular purchase, cases of fraud associated with
it are also rising. Various techniques like classification, clustering and Apriori of web mining will
be integrated to represent the sequence of operations in credit card transaction processing and show
how it can be used for the detection of frauds.
Facts:
About 2.000.000 lost each day!
A fraud transaction every 9 seconds.
33% of cardholders affected by fraud.
ONLY 0.141% fraudulent transactions.
Challenge:
No universal fraud patterns : What is normal for one cardholder is unusual for another
Fraud patterns changing dynamically : Thieves are clever: action => reaction
Huge volumes of data : Hundreds of transactions per second, millions of accounts
Build an intelligent, self-learning system that detects fraud in real-time!
The Classical rule-based approach will take time to analyze the data and find the patterns for
fraudulent transactions. It will be too late to take the decisions based on frequent patterns till then
new fraud pattern is invented by criminals. The cardholders lose money and complain and Banks
investigate complaints and try to understand the new pattern. In short, in Classical rule-based
approach a new rule is implemented a few weeks later and it is expensive to build (knowledge
intensive), difficult to maintain (many rules), the situation is dynamically changing, so frequently
rules have to be added, modified, or removed
A perfect fraud detection system must be tuned to every cardholder. Each cardholder is treated
individually and the system is adaptive i.e evolve with slow/small changes in cardholder behavior
as well as fast with high accuracy. A system based on cardholders profiles can be a better
alternative. Every cardholder gets a vector of parameters that describe his/her behavior: an
average-behavior profile. The system constantly compares this long-term profile with the
recent behavior of cardholder. Transactions that do not fit into cardholders profile are flagged as
suspicious (or are blocked). Profiles are updated with every single transaction, so the system
constantly adopts to (slow and small) changes in cardholders behavior.

Figure: A system based on cardholders profiles.

Data Mining Challenges

Identification of profile variables


Sampling (very skewed distributions)
Development of the scoring model
Optimization criterion: what do we optimize:
Number of detected fraud transactions?
Number of detected fraud cards?
Amount of money saved?

Conclusion:
A system based on cardholders profiles is a powerful system which does not require tuning (selflearning!). It has unlimited scalability & speed (could serve the whole India on a single PC!). It
could be even implemented as a distributed system (smart terminals and smart cards).

Вам также может понравиться