Академический Документы
Профессиональный Документы
Культура Документы
CERTIFICATE
Sanchita Mandlik(B211058)
Kajal Kamble (B211045)
Prof.R.T.Waghamode Dr.Sangave
Department of Computer Engineering
ZEAL College Of Engineering and Research,
INDEX
1.3
Introduction………………………………………………1
Classification Algorithm
Output scene………………………………………………12
Department of Computer Engineering
ZEAL College Of Engineering and Research,
Comparison Of Classification……………………………14
Conclusion………………………………………………..15
Name of
Bibliography……………………………………………….15
Project:
Outcomes:
Introduction:
Nowadays, Data mining is playing vital role in various field and one of the most
important area of research with the objective of finding meaningful information
from the data stored in a huge Data Set. Data mining or knowledge discovery
has become the area of growing significance because it helps in analyzing data
from different perspectives and summarizing it into useful information. Data
mining is defined as extracting information from huge sets of data.
Here we are using Electronic card transaction data set with 600 samples. we are
classifying the services of the data set using data value, period, Magnitude,
units, status. Data Preprocessing is a data mining technique that involves
transforming row data into an understandable format. Real world data is often
incomplete, inconsistent and lacking in certain behavior or trends, and is likely
to contain many errors. Data Preprocessing is proven method of resolving such
issues. After the data preprocessing, we are applying classifiers for prediction of
services of data set. Here we are applying Decision tree classifier (ID3), KNN,
Random Forest Classifier.
Sample Dataset
Software Requirement:
PyCharm IDE
Python 3.7
Hardware requirement:
Laptop/PC, 4 GB RAM, Windows based 64 bit OS.
The basic process of loading data from a CSV file into a Pandas Data Frame
(with all going well) is achieved using the “read_csv” function in Pandas.
1. Label Encoding:
Department of Computer Engineering
ZEAL College Of Engineering and Research,
In our dataset, the two columns of the X have string values. We are
converting them into numeric form using label encoder function. Here in
STATUS column, there are 4 values i.e. F, P, C, R. as well as in UNITS
column, there are 3 values i.e. Dollars, Percent, Number.
Label encoding convert the data in machine readable form, but it assigns
a unique number (starting from 0) to each class of data. This may lead to
the generation of priority issue in training of data sets. A label with high
value may be considered to have high priority than a label having lower
value.
Here we are using Imputer function for handling the missing values. In our
dataset, Data_value column has missing value with blank spaces. It is
considered as NaN value and hence this NaN value is replaced by the mean
value of column Data_value.
3. Transformation:
This step is taken in order to transform the data in appropriate forms, suitable
for mining process. It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0).
The Standard Scaler assumes your data is normally distributed within each
feature and will scale them such that the distribution is now centred around 0,
with a standard deviation of 1.
The mean and standard deviation are calculated for the feature and then the
feature is scaled based on:
xi–mean(x)/stdev(x)
The data we use is usually split into training data and test data. The
training set contains a known output and the model learns on this data in
order to be generalized to other data later on. We have the test dataset (or
subset) in order to test our model’s prediction on this subset.
Now we have used the train_test_split function in order to make the split.
The test_size=0.3 inside the function indicates the percentage of the data
that should be held over for testing.
level thinking. That is why decision trees are easy to understand and
interpret.
Algorithm:
1. Compute the entropy for data-set
2. For every attribute/features:
1.calculate entropy for all categorical values
2.take average information entropy for the current attribute
Entropy
A decision tree is built top-down from a root node and involves partitioning the
data into subsets that contain instances with similar values (homogeneous). ID3
algorithm uses entropy to calculate the homogeneity of a sample. If the sample is
completely homogeneous the entropy is zero and if the sample is equally
divided then it has entropy of one.
Entropy controls how a Decision Tree decides to split the data. It actually effects
how a Decision Tree draws its boundaries. Entropy is the measures
of impurity, disorder or uncertainty in a bunch of examples.
Department of Computer Engineering
ZEAL College Of Engineering and Research,
Information Gain
Shannon invented the concept of entropy, which measures the impurity of the
input set. In physics and mathematics, entropy referred as the randomness or the
impurity in the system. In information theory, it refers to the impurity in a group
of examples. Information gain is the decrease in entropy. Information gain
computes the difference between entropy before split and average entropy after
split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser)
decision tree algorithm uses information gain.
B. K-Nearest Neighbor:
K-Nearest Neighbor is a simple algorithm that stores all the available
cases and classifies the new data or case based on similarity measure. It
searches the pattern space for k training tuples that are closest to the
unknown new test tuple.
Closeness is calculated by using Euclidean distance. KNN classifier can
be extremely slow when classifying test tuples O(n). KNN is a type of
instance-based learning, or lazy learning, where the function is only
approximated locally, and all computation is deferred until classification.
Both for classification and regression, a useful technique can be to assign
weights to the contributions of the neighbors, so that the nearer neighbors
contribute more to the average than the more distant ones.
For example, a common weighting scheme consists in giving each
neighbor a weight of 1/d, where d is the distance to the neighbor. The
neighbors are taken from a set of objects for which the class (for KNN
Department of Computer Engineering
ZEAL College Of Engineering and Research,
Parameter Selection
The best choice of k depends upon the data; generally, larger values of k
reduces effect of the noise on the classification, but make boundaries
between classes less distinct. A good k can be selected by various
heuristic techniques (see hyperparameter optimization). The special case
where the class is predicted to be the class of the closest training sample
(i.e. when k=1) is called the nearest neighbor algorithm. The accuracy of
the KNN algorithm can be severely degraded by the presence of noisy or
irrelevant features, or if the feature scales are not consistent with their
importance. Much research effort has been put into selecting or scaling
features to improve classification. A particularly popular [citation needed]
approach is the use of evolutionary algorithms to optimize feature
scaling.[6] Another popular approach is to scale features by the mutual
information of the training data with the training classes.
Properties
KNN is a special case of a variable-bandwidth, kernel density ”balloon”
estimator with a uniform kernel. The naive version of the algorithm is
easy to implement by computing the distances from the test example to
all stored examples, but it is computationally intensive for large training
sets. Using an approximate nearest neighbor search algorithm makes
KNN computationally tractable even for large data sets. Many nearest
neighbor search algorithms have been proposed over the years; these
generally seek to reduce the number of distance evaluations actually
performed.
Classification
A case is classified by a majority vote of its neighbors, with the case
being assigned to the class most common amongst its K nearest neighbors
measured by a distance function. If K = 1, then the case is simply
assigned to the class of its nearest neighbor.
Distance Function
Example : Consider the following data set and predict the service for
given new tuple.
Where, k=3
By Euclidean Distance,
1) D (X,1)
Sqrt((20210-36422)^2 + (2006-2007)^2) =16212
2) D (X,2)
Department of Computer Engineering
ZEAL College Of Engineering and Research,
3) D (X,1)
Sqrt((20210-80468)^2 + (2006-2017)^2) =60085
4) D (X,1)
Sqrt((20210-62614)^2 + (2006-2012)^2) =42404
5) D (X,1)
Sqrt((20210-22272)^2 + (2006-2007)^2) =2062
As given , k=3
As given, Majority is Credit. Hence the new tuple X has service Credit.
Algorithm:
C. Random Forest:
Random forest classifier creates a set of decision trees from randomly
selected subset of training set. It then aggregates the votes from different
decision trees to decide the final class of the test object. Random Forest
Classifieris ensemble algorithm. Innextone or two posts we shall explore
such algorithms.
Department of Computer Engineering
ZEAL College Of Engineering and Research,
Algorithm:
1. Randomly select “k” features from total “m” features.
1. Where k << m
2. Among the “k” features, calculate the node “d” using the
best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been
reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to
create “n” number of trees.
Algorithm (Prediction):
1. Takes the test features and use the rules of each randomly
created decision tree to predict the outcome and stores
the predicted outcome (target)
2. Calculate the votes for each predicted target.
3. Consider the high voted predicted target as the final
prediction from the random forest algorithm.
Department of Computer Engineering
ZEAL College Of Engineering and Research,
Presentation:
Output Sceens:
Department of Computer Engineering
ZEAL College Of Engineering and Research,
Conclusion:
Bibliography:
https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
https://scikit-learn.org/stable/modules/preprocessing.html
https://www.datacamp.com/community/tutorials/k-nearest-neighbor-
classificationscikit-learn