Creditcard Fault Detection: Arnav Madan

CREDITCARD FAULT DETECTION
A SUMMER TRAINING REPORT
Submitted by
ARNAV MADAN
41414803117
in partial fulfillment of Summer Internship for the award of the degree of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
Maharaja Agrasen Institute of Technology

Maharaja Agrasen Institute of Technology
To Whom It May Concern
I, Arnav Madan, Enrollment No. 41414803117, a student of Bachelors of Technology (IT), a class of
2017-21, Maharaja Agrasen Institute of Technology, Delhi hereby declare that the Summer Training
project report entitled “Credit Card Fraud Detection” is an original work and the same has not been
submitted to any other Institute for the award of any other degrees.
Date: 10 September 2019

Place: Delhi
Arnav Madan
41414803117
Information Technology
4I456
ACKNOWLEDGEMENT
Every Summer Training big or small is successful largely due to the effort of a number of wonderful
people who have always given their valuable advice or lent a helping hand. I sincerely appreciate the
inspiration; support and guidance of all those people who have been instrumental in making this
project a success.
I, ARNAV MADAN, the student Of Maharaja Agresen Institute Of Technology, Rohini,
(INFORMATION TECHNOLOGY), extremely grateful to “WAC” for the confidence bestowed
in me and entrusting my seminar entitled “CREDIT CARD FRAUD DETECTION” with special
reference. At this juncture I feel deeply honoured in expressing my sincere thanks to JITENDER
PUROHIT for making the resources available at right time and providing valuable insights leading
to the successful completion of my seminar.
I express my gratitude to Mr. M.L.SHARMA (HOD, IT) for arranging the summer training in good
schedule.
I would also like to thank all the IT faculty members of MAIT for their critical advice and guidance
without which this seminar would not have been possible.
Last but not the least I place a deep sense of gratitude to my family members and my friends who
have been constant source of inspiration during the preparation of this seminar report.
Arnav Madan
41414803117
4I456
CANDIDATE’S DECLARATION
I hereby declare that the work, which is being presented in the training report, entitled “CREDIT
CARD FRAUD DETECTION” in partial fulfillment for the award of Degree of “Bachelor of
Technology” in Department of Information Technology and submitted to the Information
Technology , Maharaja Agrasen Institute of Technology is a record of my own investigations
carried under the Guidance of Jitendar Purohit, Data Science faculty at Kyrion Technologies.
I have not submitted the matter presented in this report anywhere for the award of any other
Degree.
Arnav Madan
Information technology
Enrolment no:
41414803117
ABSTRACT
In this project, we were asked to experiment with a real world dataset, and to explore how
machine learning algorithms can be used to find the patterns in data. We were expected to gain
experience in common data-mining and machine learning and were expected
to submit a report about the dataset and the algorithms used. After performing the required tasks
on a dataset of my choice, herein lies my final report.
Title Page no.
Certificate 1
Declaration 2
Acknowledgement 3
Declaration 4
Hbhbhb
Abstract 5
Chapter 1: Introduction To Machine Learning 9

Chapter 2: Understanding Our Data 11
2.1: Introduction 11
2.2: Gathering Sense of Our Data 12
Chapter 3: Dataset and Preprocessing 12
3.1: Dataset 12
3.2: Feature Scaling 15
3.3: Random Undersampling 17
3.4: Equally Distributing Data 17
Chapter 4: Naive Bayes Algorithm 19
4.2: The Bayes Rule 20
4.3: Naive Predictor Performance 20
Chapter 5: Classifiers 21
5.1.1: Logistic Regression 22
5.1.2: Support Vector Classifier 23
5.1.3: Decision tree 24

5
5.1.4: K Nearest Classifier 25

5.2: Cross Validation Curve 26
5.3 Learning Curve 27
Conclusion 30
References 31
TABLE OF CONTENTS
LIST OF FIGURES
Fig no. Caption Page no
Fig 3.1 5 rows x 31 columns 13

Fig 3.2 Fraud percentage of dataset 13
Fig 3.3 Number of frauds transaction vs no fraud transaction 13
Fig 3.4 Output showing if any column contain null values 14
Fig 3.5 Plots showing distribution of time and amount respectively 15
Fig 3.6 Dataset with scaled amount and time 16
Fig 3.7 Equally distributed class ( blue=no frauds ,red=frauds) 17
Fig 3.8 Histogram plot of each parameter 18
Fig 4.1 Output of Naive Bayes Predictor 20
Fig 5.1 Diagram Showing Logistic Regression 22

Diagram Showing Support Vector Machine
Fig 5.2 23
Diagram Showing Decision Tree
Fig 5.3 24
Diagram Showing K nearest Classifier
Fig 5.4 25
Cross Validation scores of classifiers

Fig 5.5 26
Fig 5.6 Diagram Showing Learning Rate of following cases 27
Fig 5.7 Training set and Cross Validation score of classifiers 28
Fig 5.8 Classification Report of all classifiers 29

CHAPTER 1
INTRODUCTION
Machine learning is a sub-domain of computer science which evolved from the study of
pattern recognition in data, and also from the computational learning theory in artificial
intelligence. It is the first-class ticket to most interesting careers in data analytics today. As
data sources proliferate along with the computing power to process them, going straight to the
data is one of the most straightforward ways to quickly gain insights and make predictions.
Machine Learning can be thought of as the study of a list of sub-problems, viz: decision
making, clustering, classification, forecasting, deep-learning, inductive logic programming,
support vector machines, reinforcement learning, similarity and metric learning, genetic
algorithms, sparse dictionary learning, etc. Supervised learning, or classification is the machine
learning task of inferring a function from a labeled data . In Supervised learning, we have a
training set, and a test set. The training and test set consists of a set of examples consisting of
input and output vectors, and the goal of the supervised learning algorithm is to infer a function
that maps the input vector to the output vector with minimal error. In an optimal scenario, a
model trained on a set of examples will classify an unseen example in a correct fashion, which
requires the model to generalize from the training set in a reasonable way. In layman’s terms,
supervised learning can be termed as the process of concept learning, where a brain is exposed to
a set of inputs and result vectors and the brain learns the concept that relates said inputs to
outputs. A wide array of supervised machine learning algorithms are available to the machine
learning enthusiast, for example Neural Networks, Decision Trees, Support Vector Machines,
Random Forest, Naïve Bayes Classifier, Bayes Net, Majority Classifier etc., and they
each have their own merits and demerits. There is no single algorithm that works for all cases, as
merited by the No free lunch theorem . In this project, we try and find patterns in a dataset [2],
which is a sample of males in a heart-disease high risk region of South Africa, and attempt to
throw various intelligently-picked algorithms at the data, and see what sticks.
Problems and Issues in Supervised learning:
Before we get started, we must know about how to pick a good machine learning
algorithm for the given dataset. To intelligently pick an algorithm to use for a supervised learning
task, we must consider the following factors :
1. Heterogeneity of Data:
Many algorithms like neural networks and support vector machines like their
feature vectors to be homogeneous numeric and normalized. The algorithms that
employ distance metrics are very sensitive to this, and hence if the data is
heterogeneous, these methods should be the afterthought. Decision Trees can handle
heterogeneous data very easily.
2. Redundancy of Data:
If the data contains redundant information, i.e. contain highly correlated values,
then it’s useless to use distance based methods because of numerical instability. In
this case, some sort of Regularization can be employed to the data to prevent this
situation.
3. Dependent Features:
If there is some dependence between the feature vectors, then algorithms that
monitor complex interactions like Neural Networks and Decision Trees fare better
than other algorithms.
4. Bias-Variance Tradeoff:
A learning algorithm is biased for a particular input x if, when trained on each of
these data sets, it is systematically incorrect when predicting the correct output for x,
whereas a learning algorithm has high variance for a particular input x if it predicts
different output values when trained on different training sets. The prediction error of
a learned classifier can be related to the sum of bias and variance of the learning
algorithm, and neither can be high as they will make the prediction error to be high. A
key feature of machine learning algorithms is that they are able to tune the balance
between bias and variance automatically, or by manual tuning using bias parameters,
and using such algorithms will resolve this situation.
5. Curse of Dimensionality:
If the problem has an input space that has a large number of dimensions, and the
problem only depends on a subspace of the input space with small dimensions, the
machine learning algorithm can be confused by the huge number of dimensions and
hence the variance of the algorithm can be high. In practice, if the data scientist can
manually remove irrelevant features from the input data, this is likely to improve the
accuracy of the learned function. In addition, there are many algorithms for feature
selection that seek to identify the relevant features and discard the irrelevant ones, for
instance Principle Component Analysis for unsupervised learning. This reduces the
dimensionality.
6. Overfitting:
The programmer should know that there is a possibility that the output values may
constitute of an inherent noise which is the result of human or sensor errors. In this
case, the algorithm must not attempt to infer the function that exactly matches all the
data. Being too careful in fitting the data can cause overfitting, after which the model
will answer perfectly for all training examples but will have a very high error for
unseen samples. A practical way of preventing this is stopping the learning process
prematurely, as well as applying filters to the data in the pre-learning phase to remove
noises.
Only after considering all these factors can we pick a supervised learning algorithm that
works for the dataset we are working on. For example, if we were working with a dataset
consisting of heterogeneous data, then decision trees would fare better than other algorithms. If
the input space of the dataset we were working on had 1000 dimensions, then it’s better to first
perform PCA on the data before using a supervised learning algorithm on it.
CHAPTER 2
UNDERSTANDING OUR DATA

2.1 Introduction
Credit card fraud is a wide-ranging term for theft and fraud committed using or involving a payment
card, such as a credit card or debit card, as a fraudulent source of funds in a transaction. The
purpose may be to obtain goods without paying or to obtain unauthorised funds from an account.
Credit card fraud is also an adjunct to identity theft. Although incidences of credit card fraud are
limited to about 0.1% of all card transactions, they have resulted in huge financial losses as the
fraudulent transactions have been large value transactions. It is important that credit card companies
are able to recognise fraudulent credit card transactions so that customers are not charged for items
that they did not purchase. What we need is an algorithm, which could classify a transaction as
fraudulent or non-fraudulent. Doing so will benefit both the credit card companies and the
customers who have to go through the ordeal.
In this kernel we will use various predictive models to see how accurate they are in detecting
whether a transaction is a normal payment or a fraud. As described in the dataset, the features are
scaled and the names of the features are not shown due to privacy reasons. Nevertheless, we can
still analyse some important aspects of the dataset. Let's start!
Our Goals:
• Understand the little distribution of the "little" data that was provided to us.
• Create a 50/50 sub-dataframe ratio of "Fraud" and "Non-Fraud" transactions.(NearMiss
Algorithm)
• Determine the Classifiers we are going to use and decide which one has a higher accuracy.
• Create a Neural Network and compare the accuracy to our best classifier.
• Understand common mistakes made with imbalanced datasets.
2.2 Gathering Sense of Our Data

The first thing we must do is gather a basic sense of our data. Remember, except for
the transaction and amount we dont know what the other columns are (due to privacy reasons). The
only thing we know, is that those columns that are unknown have been scaled already
• The transaction amount is relatively small. The mean of all the amounts made is
approximately USD 88.
• There are no "Null" values, so we don't have to work on ways to replace values.
• Most of the transactions were Non-Fraud (99.83%) of the time, while Fraud transactions
occurs (017%) of the time in the dataframe.
CHAPTER 3
Dataset and Preprocessing
3.1 Dataset
The dataset is provided by Kaggle and contains transactions made by credit cards in September
2013 by European cardholders. This dataset presents transactions that occurred in two days, where
we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class
(frauds) account for 0.172% of all transactions.
It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, the original features and more background information
about the data could not be provided. Features V1, V2, ... V28 are the principal components
obtained with PCA, the only features which have not been transformed with PCA are 'Time' and
'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first
transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used
for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes
value 1 in case of fraud and 0 otherwise.
Figure 3.1: 5 rows x 31

columns
Figure 3.2: Fraud
percentage of dataset
Figure 3.3: Number of frauds transaction vs no fraud transaction

On using info() function we can check if any of the columns contain null values. Looking at the
output it can be seen that all the 31 columns have non-null values
Figure 3.4: Output showing if any column

contain null values
Figure 3.5: Plots showing distribution of time and amount respectively
3.2 Feature Scaling

Through the above dataset analysis it can be seen that all the columns are scaled except
the Amount & Time features. Most of the machine learning algorithms use Eucledian distance
between two data points in their computations, this is a problem. If left alone, these algorithms only
take in the magnitude of features neglecting the units. The results would vary greatly between
different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the
distance calculations than features with low magnitudes. To supress this effect, we need to bring all
features to the same level of magnitudes. This can be acheived by scaling.
In this phase of our kernel, we will first scale the columns comprise of Time and Amount . Time
and amount should be scaled as the other columns. On the other hand, we need to also create a sub
sample of the dataframe in order to have an equal amount of Fraud and Non-Fraud cases, helping
our algorithms better understand patterns that determines whether a transaction is a fraud or not.
What is a sub-Sample?
In this scenario, our subsample will be a dataframe with a 50/50 ratio of fraud and non-fraud
transactions. Meaning our sub-sample will have the same amount of fraud and non fraud
transactions.
Why do we create a sub-Sample?

In the beginning of this notebook we saw that the original dataframe was heavily imbalanced!
Using the original dataframe will cause the following issues:
• Overfitting: Our classification models will assume that in most cases there are no frauds!
What we want for our model is to be certain when a fraud occurs.
• Wrong Correlations: Although we don't know what the "V" features stand for, it will be
useful to understand how each of this features influence the result (Fraud or No Fraud) by
having an imbalance dataframe we are not able to see the true correlations between the class
and features.
Figure 3.6: Dataset with scaled amount and time
3.3 Random Under-sampling

In this phase of the project we will implement "Random Under Sampling" which basically consists
of removing data in order to have a more balanced dataset and thus avoiding our models to
overfitting.
Steps:
• The first thing we have to do is determine how imbalanced is our class (use
"value_counts()" on the class column to determine the amount for each label)
• Once we determine how many instances are considered fraud transactions (Fraud = "1") ,
we should bring the non-fraud transactions to the same amount as fraud transactions
(assuming we want a 50/50 ratio), this will be equivalent to 492 cases of fraud and 492 cases
of non-fraud transactions.
• After implementing this technique, we have a sub-sample of our dataframe with a 50/50
ratio with regards to our classes. Then the next step we will implement is to shuffle the
data to see if our models can maintain a certain accuracy everytime we run this script.
3.4 Equally Distributing Data
In order to go further with our analysis and preprocessing, we need to have our dataframe perfectly
balanced.
Figure 3.7: Equally

distributed class ( blue=no frauds ,red=frauds)
Figure 3.8: Histogram plot of each parameter
CHAPTER 4
Naive Bayes Algorithm
4.1 Introduction
Naive Bayes is a probabilistic machine learning algorithm that can be used in a wide variety of
classification tasks. Typical applications include filtering spam, classifying documents, sentiment
prediction etc. It is based on the works of Rev. Thomas Bayes (1702–61) and hence the name.
But why is it called ‘Naive’?
The name naive is used because it assumes the features that go into the model is independent of
each other. That is changing the value of one feature, does not directly influence or change the
value of any of the other features used in the algorithm.
Alright. By the sounds of it, Naive Bayes does seem to be a simple yet powerful algorithm. But why
is it so popular?
That’s because there is a significant advantage with NB. Since it is a probabilistic model, the
algorithm can be coded up easily and the predictions made real quick. Real-time quick. Because of
this, it is easily scalable and is trditionally the algorithm of choice for real-world applications (apps)
that are required to respond to user’s requests instantaneously.
But before you go into Naive Bayes, you need to understand what ‘Conditional Probability’ is and
what is the ‘Bayes Rule’.
And by the end of this tutorial, you will know:
• How exactly Naive Bayes Classifier works step-by-step
• What is Gaussian Naive Bayes, when is it used and how it works?
• How to code it up in R and Python
• How to improve your Naive Bayes models?
4.2 The Bayes Rule
The Bayes Rule is a way of going from P(X|Y), known from the training dataset, to find P(Y|X).
To do this, we replace A and B in the above formula, with the feature X and response Y.
For observations in test or scoring data, the X would be known while Y is unknown. And for each
row of the test dataset, you want to compute the probability of Y given the X has already happened.
What happens if Y has more than 2 categories? we compute the probability of each class of Y and
let the highest win.
Naive Bayes Classifier (Generative Learning Model) :

It is a classification technique based on Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature. Even if these features
depend on each other or upon the existence of the other features, all of these properties
independently contribute to the probability. Naive Bayes model is easy to build and particularly
useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even
highly sophisticated classification methods
4.3 Naive Predictor Performance

Naive Predictor will be a model which predicts all the transactions as Non-Fraudulant. The
following will be the definitions for the model:-
• "Fraud(Class = 1)" is a negative class.

• "Non-Fraud(Class = 0)" is a Positive class.
The purpose of generating a naive predictor is simply to show what a base model without any
intelligence would look like. In the real world, ideally your base model would be either the results
of a previous model or could be based on a research paper upon which you are looking to improve.
When there is no benchmark model set, getting a result better than random choice is a place you
could start from.

Figure 4.1: Output of Naive Bayes Predictor
CHAPTER 5
CLASSIFIERS
5.1 Introduction
In this section I'll be using 4 different classifiers to classify the transactions as Fraudulant or Non-
Fraudulant in the randomly undersampled dataset. My aim is to compare the performance of the
Naive predictor(Benchmark model) with that of the classifiers that I choose. The following are the
classifiers that I'll be using:-
• Logistic Regression
• Support Vector Classifier
• Decision Tree
• K-Nearest Classifier
The randomly undersampled dataset used for the analysis is undersampled before cross validation
and hence, is prove to overfit. To get the best model undersampling should be done along with cross
validation.
5.1.1 Logistic Regression (Predictive Learning Model) :
It is a statistical method for analysing a data set in which there are one or more independent
variables that determine an outcome. The outcome is measured with a dichotomous variable (in
which there are only two possible outcomes). The goal of logistic regression is to find the best
fitting model to describe the relationship between the dichotomous characteristic of interest
(dependent variable = response or outcome variable) and a set of independent (predictor or
explanatory) variables. This is better than other binary classification like nearest neighbor since it
also explains quantitatively the factors that lead to classification.
Representation Used for Logistic Regression:
Logistic regression uses an equation as the representation, very much like linear regression.
Figure 5.1: Diagram Showing Logistic Regression
Input values (x) are combined linearly using weights or coefficient values (referred to as the Greek
capital letter Beta) to predict an output value (y). A key difference from linear regression is that the
output value being modeled is a binary values (0 or 1) rather than a numeric value.
Below is an example logistic regression equation:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))
Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for the
single input value (x). Each column in your input data has an associated b coefficient (a constant
real value) that must be learned from your training data.
5.1.2 Support Vector Classifier
Introduction to SVMs:
In machine learning, support vector machines (SVMs, also support vector networks) are supervised
learning models with associated learning algorithms that analyze data used for classification and
regression analysis.
An SVM model is a representation of the examples as points in space, mapped so that the examples
of the separate categories are divided by a clear gap that is as wide as possible.
In addition to performing linear classification, SVMs can efficiently perform a non-linear
classification, implicitly mapping their inputs into high-dimensional feature spaces.
A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating
hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs
an optimal hyperplane which categorizes new examples.
Figure 5.2: Diagram Showing Support Vector Machine


What is Support Vector Machine?

What does SVM do?

Given a set of training examples, each marked as belonging to one or the other of two categories, an
SVM training algorithm builds a model that assigns new examples to one category or the other,
making it a non-probabilistic binary linear classifier.
Let you have basic understandings from this article before you proceed further. Here I’ll discuss an
example about SVM classification of cancer UCI datasets using machine learning tools i.e. scikit-
learn compatible with Python.
5.1.3 Decision Tree
Figure 5.3: Figure Showing Decision Tree
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a data set into smaller and smaller subsets while at the same time an associated decision tree
is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision
node has two or more branches and a leaf node represents a classification or decision. The topmost
decision node in a tree which corresponds to the best predictor called root node. Decision trees can
handle both categorical and numerical data.
A tree can be “learned” by splitting the source set into subsets based on an attribute value test. This
process is repeated on each derived subset in a recursive manner called recursive partitioning. The
recursion is completed when the subset at a node all has the same value of the target variable, or
when splitting no longer adds value to the predictions. The construction of decision tree classifier
does not require any domain knowledge or parameter setting, and therefore is appropriate for
exploratory knowledge discovery. Decision trees can handle high dimensional data. In general
decision tree classifier has good accuracy. Decision tree induction is a typical inductive approach to
learn knowledge on classification.
Decision trees classify instances by sorting them down the tree from the root to some leaf node,
which provides the classification of the instance. An instance is classified by starting at the root
node of the tree,testing the attribute specified by this node,then moving down the tree branch
corresponding to the value of the attribute as shown in the above figure.This process is then
repeated for the subtree rooted at the new node.
5.1.4 K Nearest Classifier
The k-nearest-neighbors algorithm is a classification algorithm, and it is supervised: it takes a bunch
of labelled points and uses them to learn how to label other points. To label a new point, it looks at
the labelled points closest to that new point (those are its nearest neighbors), and has those
neighbors vote, so whichever label the most of the neighbors have is the label for the new point (the
“k” is the number of neighbors it checks
Figure 5.4: Diagram Showing K nearest Classifier
K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine
Learning. It belongs to the supervised learning domain and finds intense application in pattern
recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does not make
any underlying assumptions about the distribution of data (as opposed to other algorithms such
as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into groups
identified by an attribute.
An object is classified by a majority vote of its neighbors, with the object being assigned to the
class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1,
then the object is simply assigned to the class of that single nearest neighbor.
5.2 Cross Validation
Learning the parameters of a prediction function and testing it on the same data is a methodological
mistake: a model that would just repeat the labels of the samples that it has just seen would have a
perfect score but would fail to predict anything useful on yet-unseen data. This situation is called
overfitting. To avoid it, it is common practice when performing a (supervised) machine learning
experiment to hold out part of the available data as a test set X_test, y_test. Note that the word
“experiment” is not intended to denote academic use only, because even in commercial
settingmachine learning usually starts out experimentally.To avoid it, it is common practice when
performing a (supervised) machine learning experiment to hold out part of the available data as a
test set X_test, y_test. Note that the word “experiment” is not intended to denote academic use only,
because even in commercial settingmachine learning usually starts out experimentally. Here is a
flowchart of typical cross validation workflow in model training.This situation is called overfitting.
To avoid it, it is common practice when performing a (supervised) machine learning experiment to
hold out part of the available data as a test set X_test, y_test. Note that the word “experiment” is not
intended to denote academic use only, because even in commercial settingmachine learning usually
starts out experimentally. The best parameters can be determined by grid search techniques. Here is
a flowchart of typical cross validation workflow in model training. The best parameters can be
determined by grid search techniques.
Figure 5.5: Cross Validation scores of classifiers
5.3 Learning Curve

A learning curve refers to a plot of the prediction accuracy/error vs. the training set size (i.e: how
better does the model get at predicting the target as you the increase number of instances used to
train it). Usually both the training and test/validation performance are plotted together so we can
diagnose the bias-variance tradeoff (i.e determine if we benefit from adding more training data,
and assess the model complexity by controlling regularization or number of features).The amount
that the weights are updated during training is referred to as the step size or the “learning rate.”
Specifically, the learning rate is a configurable hyperparameter used in the training of neural
networks that has a small positive value, often in the range between 0.0 and 1.0
Figure 5.6:Diagram Showing Learning Rate of following cases
The above plots show the following cases(These graphs are better understood when seen upside
down as we are plotting the error function against number of instances):-
• Plot 1 : This is a case of Underfitting(High Bias) as it has very high training and cross-
validation error meaning that the model performs badly on both training and cross-validation
set.
• Plot 2 : This is the Ideal case as it has low training and cross-validation errors and they
seem to converge as the number of instances increase. We strive to find such model for our
predictions.
• Plot 3 : This is a case of Overfitting(High Variance) as it has very low training error and
pretty high cross-validation error meaning that the model has learned each and every detail
of the training data that if anything different fed to it then it won't be able to perform well.
And the the training and cross-validation error lines don't seem to converge at any point of
time
Learning rate is a hyper-parameter that controls how much we are adjusting the weights of our
network with respect the loss gradient. The lower the value, the slower we travel along the
downward slope
Figure 5.7: Training set and Cross Validation score of classifiers

Logistic Regression and Support Vector Classifier show the best score in training and cross-
validation set and are pretty close to the ideal case.
Figure 5.8: Classification Report of all classifiers

Conclusion
In this project, I tried to find a model which would help me predict whether a transaction is
fraudulent or non-fraudulent. In order to achieve this, I downloaded a dataset from kaggle.I then
performed preprocessing on the dataset and later on did exploratory data analysis on it to study all
of its features and to check which all influence a transaction to be fraudulent or non-fraudulent .
Initially, I defined a Naive predictor which would act as the benchmark model for all other models
that I put to test. After that, I took four different classifiers and tested them on the randomly
undersampled dataset and evaluated several metrics on it: accuracy, precision, recall, f1-score. It
can be concluded that the Naive predictor acted the best on the under sampled dataset and will be
the best model for Credit card fraud detection.
Limitations
In our undersample data our model is unable to detect for a large number of cases non
fraud transactions correctly and instead, misclassifies those non fraud transactions as
fraud cases. Imagine that people that were making regular purchases got their card
blocked due to the reason that our model classified that transaction as a fraud
transaction, this will be a huge disadvantage for the financial institution. The number
of customer complaints and customer disatisfaction will increase. The next step of
this analysis will be to do an outlier removal on our oversample dataset and see if our
accuracy in the test set improves.
References
1. [1]https://www.kaggle.com/mlg-ulb/creditcardfraud
2. [2]https://www.datasciencelearner.com/design-best-machine-learning-datasets/
4. [4]https://towardsdatascience.com/understanding-learning-rates-and-how-it-
improves-performance-in-deep-learning-d0d4059c1c10
5. [5]https://towardsdatascience.com/naive-bayes-in-machine-learning-
f49cc8f831b4
7. [7]Machine Learning - Over-& Undersampling - Python/ Scikit/ Scikit-

Imblearn by Coding-Maniac
8. [8]https://matplotlib.org/users/style_sheets.html

Creditcard Fault Detection: Arnav Madan

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Creditcard Fault Detection: Arnav Madan

Загружено:

Авторское право:

Доступные форматы

CREDITCARD FAULT DETECTION

A SUMMER TRAINING REPORT

Maharaja Agrasen Institute of Technology

To Whom It May Concern

Date: 10 September 2019

Chapter 1: Introduction To Machine Learning 9

2.2: Gathering Sense of Our Data 12

Chapter 3: Dataset and Preprocessing 12

3.2: Feature Scaling 15

3.3: Random Undersampling 17

3.4: Equally Distributing Data 17

Chapter 4: Naive Bayes Algorithm 19

4.2: The Bayes Rule 20

4.3: Naive Predictor Performance 20

5.1.1: Logistic Regression 22

5.1.2: Support Vector Classifier 23

5.1.3: Decision tree 24

5.1.4: K Nearest Classifier 25

5.3 Learning Curve 27

Fig no. Caption Page no

Fig 3.1 5 rows x 31 columns 13

Fig 3.3 Number of frauds transaction vs no fraud transaction 13

Fig 3.4 Output showing if any column contain null values 14

Fig 3.5 Plots showing distribution of time and amount respectively 15

Fig 3.6 Dataset with scaled amount and time 16

Fig 3.7 Equally distributed class ( blue=no frauds ,red=frauds) 17

Fig 3.8 Histogram plot of each parameter 18

Fig 4.1 Output of Naive Bayes Predictor 20

Fig 5.1 Diagram Showing Logistic Regression 22

Cross Validation scores of classifiers

Fig 5.6 Diagram Showing Learning Rate of following cases 27

Fig 5.7 Training set and Cross Validation score of classifiers 28

Fig 5.8 Classification Report of all classifiers 29

UNDERSTANDING OUR DATA

2.2 Gathering Sense of Our Data

Dataset and Preprocessing

value 1 in case of fraud and 0 otherwise.

Figure 3.1: 5 rows x 31

Figure 3.3: Number of frauds transaction vs no fraud transaction

Figure 3.4: Output showing if any column

3.2 Feature Scaling

Why do we create a sub-Sample?

Figure 3.6: Dataset with scaled amount and time

3.3 Random Under-sampling

Figure 3.7: Equally

But why is it called ‘Naive’?

And by the end of this tutorial, you will know:

• How exactly Naive Bayes Classifier works step-by-step

• What is Gaussian Naive Bayes, when is it used and how it works?

• How to code it up in R and Python

• How to improve your Naive Bayes models?

4.2 The Bayes Rule

Naive Bayes Classifier (Generative Learning Model) :

4.3 Naive Predictor Performance

• "Fraud(Class = 1)" is a negative class.

could start from.

5.1.1 Logistic Regression (Predictive Learning Model) :

Figure 5.1: Diagram Showing Logistic Regression

Below is an example logistic regression equation:

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

5.1.2 Support Vector Classifier

Figure 5.2: Diagram Showing Support Vector Machine

In addition to performing linear classification, SVMs can efficiently perform a non-linear

y = e^(b0 + b1x) / (1 + e^(b0 + b1x))