Вы находитесь на странице: 1из 4

International Journal of Scientific Research Engineering & Technology (IJSRET)

Volume 2 Issue 12 pp 840-843

March 2014

www.ijsret.org

ISSN 2278 0882

A STUDY ON MACHINE LEARNING TECHNIQUES USED IN DATA


MINING
S.Rajasulochana1, M.Nagulanand2, Ramasubash M.P.3
1,2,3

M.E Computer Science and Engineering, SriGuru Institute of Technology, Coimbatore

ABSTRACT
Data mining and machine learning are two main
areas that are under a serious approach. Machine learning
technique enables the machine to improve its performance
based on previous results. Data mining is a concept that
makes use of machine learning technique in solving many
real world problems. The paper provides a state of art
about the machine learning techniques used in data mining
in a laymans perspective.

Data mining often defined as knowledge discovery


in database (KDD) is an iterative sequence of various steps
that involves the following:
Data cleaning
Data integration
Data selection
Data transformation
Relevance analysis
DATA CLEANING

Keywords Data mining, Machine learning, deep


learning, classification, clustering, reinforcement learning

DATA INTEGRATION

I. INTRODUCTION
Present is an era where information plays a vital
role in all sorts of processing. With the advent of
computers, large amount of information flooding the web,
hence gathering, analyzing and processing the data to
extract required patterns has become a serious issue. It is
difficult to deal the big data obtained as a result of data
mining. A means for storing and retrieving data efficiently
is the requirement that need to be done with. There are also
cases where it is essential to classify the data based on
class labels, cluster the relevant data and to associate them
based on patterns to arrive at an inference. Data mining an
important step in Knowledge discovery in database
(KDD) deals with the process of retrieving useful
information (patterns) from the ample amount of available
raw data. Data mining has its evolution in classical
statistics, artificial intelligence and machine learning. It is
one such concept that uses machine learning technique.
Unlike traditional data retrieval that retrieves records for a
given query, data mining is the process of discovering
patterns that are not explicitly stored in the database i.e., it
is the process of discovering the implicit patterns stored in
the database. Yet data mining faces several issues like
security and social issues, performance issues, data source
issues and the like.

II. DATA MINING

DATA SELECTION

DATA TRANSFORMATION

DATA MINING

PATTERN EVALUATION

KNOWLEDGE
REPRESENTATION

Fig. 1 Data Mining process


Thus data mining is the process of extracting
patterns from the data that have been consolidated /
transformed.
Computing is used in all the fields that involve
processing large amount of data. Industries, educational
institutions, researchers exploring the natural world, social

IJSRET @ 2014

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 12 pp 840-843

March 2014

websites all caters to the source of data. These data have to


effectively classified, clustered, pruned and rendered in
order to obtain the required pattern. Machine learning
plays a vital role in rendering these activities. Now data
mining has moved on to the next stage called semantic
mining or ontology-based mining where prediction is done
based on a training set of data.

III. MACHINE LEARNING


The inaccurate estimates as a result of statistical
estimation results in poor performance. Machine learning
is an approach that aims at optimizing the performance of
the solution thus provides far better estimations when
compared to statistical estimation. The performance can be
optimized either with the help of example data called the
training set or the past experience. Machine learning can
be either inductive or deductive. The core of machine
learning is the learning. Learning is all about the
observations or past experiences for the given set of data to
do better in the future. The main goal of machine learning
is to device an algorithm that learns automatically based
on the past experience. Thus the machine learning
paradigm can be best viewed as Programming by
example. Machine learning can be used to solve problems
by considering the following things [3] [4].
Task identification
Performance analysis
Knowledge identification
Knowledge representation
Identifying the learning paradigm to use.
How to construct a training experience for the
learner.
Machine learning algorithms also have
been used in:
speech recognition
drive automobiles
play world-class backgammon
program generation
routing in communication networks
understanding handwritten text
data mining
Health care etc.
Machine learning problems can be
broadly classified as
supervised learning
unsupervised learning
reinforcement learning
Agent-based modeling and Basket analyses are
some other types of machine learning problems that do not
fall in these three categories.

www.ijsret.org

ISSN 2278 0882

IV. SUPERVISED LEARNING


In case of supervised learning the given set of data
is labeled with pre-defined classes. Supervised learning
can be further classified into two types based on the type
of variables (continuous and discrete) to which they are
applied namely
Classification
Regression
Decision trees, neural networks and Nave Bayes use
classification algorithms whereas Regression, Association
rules and clustering uses Prediction algorithms.
A. Classification
More than 90% of the machines learning problems
are classification problems.
1. Classification by decision tree induction
Decision tree is the simplest form of classification
being used in Data mining. CART (Classification and
Regression Trees), ID3 (Iterative and Dichotomized 3),
CHAID (CHi-squared Automatic Interaction Detector),
MARS and C4.5 are some of the decision trees widely
used. These algorithms differ in the way the split point is
chosen. When the target variable has more than two
categories then a variant of decision tree induction called
the C4.5 algorithm is used and in case of binary split the
typical CART procedure is used. There are two phases in
decision tree classifier namely
Growth phase
Prune phase
The initial phase of building a decision tree is
called the growth phase. Pruning phase reduces any over
fitting of data i.e., it removes any noisy data or outliers.
Over fitting can be removed either by pre-pruning or postpruning. In pre-pruning outliers are removed before any
node is split based on the measure of threshold (choosing
an appropriate threshold measure is indeed difficult)
whereas in post-pruning outliers are removed from a fully
grown tree.
B. Regression
Regression is different from classification in that it
is used to predict the behavior of continuous variables
whereas regression is used to predict the behavior of one
or more random variables. Regression generates numerical
value as the estimated outcome whereas classification
identifies the categorical class label for the given data set.
Let xi be the variable used to predict the outcome
called the independent variable. yi be the observed value of
the predicted variable called the dependent variable. yi be
the predicted value of the dependent variable. A model is

IJSRET @ 2014

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 12 pp 840-843

March 2014

built using these variables that helps in predicting a


variable from one or more variables and is called the
regression model.

www.ijsret.org

ISSN 2278 0882

target variable is discrete. Supervised learning is similar to


a teacher teaching a elementary school student.
C. ISSUES IN SUPERVISED LEARNING
Bias-variance trade off
Function complexity and amount of training data
Dimensionality of the input space
Noise in the output values

Fig. 2 An example of how a machine learner is trained to


recognize images using training set (a corrupted image of
the number 8) which is labeled or identified as the
number 8.
Supervised learning provides the learning
algorithm with a labeled set of data based on which the
inference is generated. The inference function is defined as
where X is the set of input object also referred to as a
vector and Y is the set of output objects typically called as
supervisory signal. Let X is given as {x1, x2... xn} and Y be
given as {y1, y2 yn}. Then a pair say (x1, y1) forms the
training example and a set containing {(x1, y1), (x2, y2)
(xn,yn)} forms the training set.
TRAINING
SET

LEARNING
ALGORITHM

Fig. 3 supervised learning algorithm


The learning problem in supervised learning is
termed as Regression problem when the target variable
is continuous and as Classification problem when the

Fig. 4 Some of the modeling objectives and supervised


learning techniques

V. UNSUPERVISED LEARNING
It is the problem of trying to find the hidden
structure where the input data is not labeled. Thus the task
is to find the clusters of data from the given unlabeled
data set. Unsupervised learning is similar to a teacher
teaching a graduate student. Following are some of the
approaches to unsupervised learning:
Clustering (eg. K-means, hierarchical clustering)
Association rule mining
Hidden markov model
Blind signal separation
A. Clustering
A clustering problem that is given a training set {x (1),
(2)
x x (k)} and if no output label y (i) is provided the
learning problem is called as unsupervised learning
problem. Clustering is the process of grouping of data
points using the measure of similarity such as Correlation

IJSRET @ 2014

International Journal of Scientific Research Engineering & Technology (IJSRET)


Volume 2 Issue 12 pp 840-843

March 2014

or Euclidean distance [1]. Clustering paves the way for


pattern recognition.

www.ijsret.org

ISSN 2278 0882

molecules identification for drug designing. It follows a


layer by layer or hierarchical approach for classification in
case of supervised learning.

VIII. CONCLUSION
A birds overview on various machine learning
techniques been used in data mining has been discussed. It
is such a technique that it could be understood easily in a
laymans perspective. Machine learning techniques can be
applied in all phases of data mining thereby achieving
efficiency.

ACKNOWLEDGEMENT
Fig. 5 Some of the modeling objectives and unsupervised
learning techniques

VI. REINFORCEMENT LEARNING


Reinforcement learning sometimes called as
unsupervised learning is a form of predicting what to do
or in other words mapping situations to actions. That is, it
helps the learning agent to learn the behavior of the system
based on the feedback from the environment. In other
words it is the process of learning from the action. It finds
its application in sequential decision making and control
problems where explicit supervision is not possible.
Reinforcement learning algorithms make use of a reward
function that marks the learning agent to be either
successful or unsuccessful. Upon right move the learning
agent is given positive rewards and upon wrong move or
failure the learning agent is provided with negative
rewards. That is, Reinforcement learning is associated with
learning of policies. For example What to do and not
What is that.
Reinforcement learning has been successful in
applications like autonomous helicopter flight, cell-phone
network routing, factory control, marketing strategy
selection, robot legged locomotion and efficient web-page
indexing.
The advantage of reinforcement learning is that
the algorithm improves its accuracy over time as it reads
more training data and modifies the rules as it makes
wrong prediction.
Reinforcement learning problems are usually
posed using Markov Decision Process (MDP).

The authors would like to thank the staff and


students of SriGuru Institute of Technology for their
support and guidance. The authors also would like to thank
the friends and family members for their valuable
comments.

REFERENCES
[1] Yogesh Singh, Pradeep Kumar Bhatia & Omprakash
Sangwan, A Review Of Studies On Machine Learning
Techniques, International Journal of Computer Science
and Security, Volume (1) : Issue (1).
[2] R. Agarwal, M. Mehta, J. Shafer, R. Srikant, A.
Arning, T. Bollinger. The Quest Data Mining System
Proceedings of 1996 International Conference on Data
Mining and Knowledge Discovery (KDD96), Port-land,
Oregon, pp. 244-249, August 1996.
[3] Clifton Phua, Vincent Lee, Kate Smith1 & Ross
Gayler, A Comprehensive Survey of Data Mining-Based
Fraud Detection Research.
[4] Rob Schapire, A lecture note on Theoretical Machine
Learning.
[5] Jiban K Pal, Usefulness and application of data
mining in extracting information from different
perspective, Annals of Library and Information Studies,
Vol. 58, March 2011, pp 7-16.

VII. DEEP LEARNING


Deep learning is supposed to be the future of
machine learning. It is widely used in image recognition,
IJSRET @ 2014

Вам также может понравиться