0 оценок0% нашли этот документ полезным (0 голосов)

13 просмотров54 страницыMay 24, 2020

© © All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

13 просмотров54 страницыВы находитесь на странице: 1из 54

ETCS-454

Roll No: 04414802716

Semester: 8th

Group: 8-C-3

Sector – 22, Rohini, New Delhi – 110085

MACHINE LEARNING LAB

PRACTICAL RECORD

Branch : CSE

PRACTICAL DETAILS

performa checking (10)

nce

Introduction to Machine

Learning Lab with

1. Python(3.x)

Understanding of

Machine learning

2. algorithms.

Understand clustering

3. approaches and

implement K means

Algorithm using Sci-Kit

Learn

Study of datasets and

4. understanding attributes

evaluation in regard to

problem description.

Working of Major

5. Classifiers

a) Naïve Bayes b)

Decision Tree

c)CART

d) ARIMA

e) linear and logistics

regression

f) Support vector

machine

g) KNN

Implement supervised

6. learning (KNN

classification) .Estimate

the accuracy of using 5-

fold cross-validation.

7. Introduction to R. Be

aware of the basics of

machine learning

methods in R.

8. Develop a machine

learning method using

Neural Networks in

Python to Predict stock

prices based on past price

variation.

MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

VISION

To nurture young minds in a learning environment of high academic value and imbibe spiritual and

ethical values with technological and management competence.

MISSION

The Institute shall endeavour to incorporate the following basic missions in the teaching methodology:

Management disciplines shall be carried out by Hardware equipment as well as the related

software enabling a deeper understanding of basic concepts and encouraging inquisitive nature.

❖Life-Long Learning: The Institute strives to match technological advancements and encourage

students to keep updating their knowledge for enhancing their skills and inculcating their habit

of continuous learning

management skills of students so that they are intellectually capable and competent

professionals with Industrial Aptitude to face the challenges of globalization.

fields of studies with different attributes. The aim is to create a synergy of the above attributes

by encouraging analytical thinking.

❖Digitization of Learning Processes: The Institute provides seamless opportunities for innovative

learning in all Engineering and Management disciplines through digitization of learning processes

using analysis, synthesis, simulation, graphics, tutorials and related tools to create a platform for

multi-disciplinary approach.

Entrepreneurship: The Institute strives to develop potential Engineers and Managers by enhancing their

skills and research capabilities so that they emerge as successful entrepreneurs and responsible citizens.

MAHARAJA AGRASEN INSTITUTE OF TECHNOLOGY

VISION

MISSION

discipline to inculcate professional behaviour, strong ethical values, innovative

research capabilities and leadership abilities which enable them to become

successful entrepreneurs in this globalized world.

their problem solving skills and to prepare students to be lifelong learners by

offering a solid theoretical foundation with applied computing experiences

and educating them about their professional, and ethical responsibilities.

industrial environment and be successful in their professional lives.

convergence.

solution thereby setting a ground for producing successful entrepreneur

EXPERIMENT 1

An Introduction to Python

Python is a popular object-oriented programming language having

the capabilities of a high-level programming language. It's easy to

learn syntax and portability makes it popular these days. The

followings facts gives us the introduction to Python −

● Python was developed by Guido van Rossum at Stichting Mathematisch Centrum in the

Netherlands.

● It was written as the successor of a programming language named ‘ABC’.

● It’s first version was released in 1991.

● The name Python was picked by Guido van Rossum from a TV show named Monty

Python’s Flying Circus.

● It is an open source programming language which means that we can freely download it

and use it to develop programs. It can be downloaded from www.python.org..

● Python programming language is having the features of Java and C both. It is having the

elegant ‘C’ code and on the other hand, it is having classes and objects like Java for

object-oriented programming.

● It is an interpreted language, which means the source code of Python program would be

first converted into bytecode and then executed by Python virtual machine.

Every programming language has some strengths as well as weaknesses, so does Python too.

Strengths

● According to studies and surveys, Python is the fifth most

important language as well as the most popular language for

machine learning and data science. It is because of the

following strengths that Python has −

● Easy to learn and understand − The syntax of Python is simpler;

hence it is relatively easy, even for beginners also, to learn

and understand the language.

● Multi-purpose language − Python is a multi-purpose

programming language because it supports structured

programming, object-oriented programming as well as

functional programming.

● Huge number of modules − Python has a huge number of modules

for covering every aspect of programming. These modules are

easily available for use hence making Python an extensible

language.

● Support of open source community − As being an open source

programming language, Python is supported by a very large

developer community. Due to this, the bugs are easily fixed by

the Python community. This characteristic makes Python very

robust and adaptive.

● Scalability − Python is a scalable programming language

because it provides an improved structure for supporting large

programs than shell-scripts.

Weakness

● Although Python is a popular and powerful programming language, it has its own

weakness of slow execution speed.

● The execution speed of Python is slow as compared to compiled languages because

Python is an interpreted language. This can be a major area of improvement for the

Python community.

Installing Python

For working in Python, we must first have to install it. You can

perform the installation of Python in any of the following two ways

−

● Installing Python individually

● Using Pre-packaged Python distribution: Anaconda

If you want to install Python on your computer, then then you need to download only the binary

code applicable for your platform. Python distribution is available for Windows, Linux and Mac

platforms.

The following is a quick overview of installing Python on the above-

mentioned platforms −

On Unix and Linux platform

With the help of following steps, we can install Python on Unix and

Linux platform −

● First, go to www.python.org/downloads/.

● Next, click on the link to download zipped source code available for Unix/Linux.

● Now, Download and extract files.

● Next, we can edit the Modules/Setup file if we want to customize some options.

○ Next, write the command run ./configure script

○ make

○ make install

Anaconda is a packaged compilation of Python which have all the

libraries widely used in Data science. We can follow the following

steps to setup Python environment using Anaconda −

● Step 1 − First, we need to download the required installation

package from Anaconda distribution. The link for the same is

www.anaconda.com/distribution/. You can choose from Windows, Mac and Linux OS as

per your requirement.

● Step 2 − Next, select the Python version you want to install on

your machine. The latest Python version is 3.7. There you will

get the options for 64-bit and 32-bit Graphical installer both.

● Step 3 − After selecting the OS and Python version, it will

download the Anaconda installer on your computer. Now, double

click the file and the installer will install the Anaconda

package.

● Step 4 − For checking whether it is installed or not, open a

command prompt and type Python as follows

Why Python for Machine Learning ?

● Python is the fifth most important language as well as most

popular language for Machine learning and data science. The

following are the features of Python that makes it the

preferred choice of language for data science −

● Extensive set of packages

Python has an extensive and powerful set of packages which are ready to be used in

various domains. It also has packages like numpy, scipy, pandas, scikit-learn etc. which

are required for machine learning and data science.

● Easy prototyping

Another important feature of Python that makes it the choice of language for data science

is the easy and fast prototyping. This feature is useful for developing new algorithm.

● Collaboration feature

The field of data science basically needs good collaboration and Python provides many

useful tools that make this extremely.

● One language for many domains

A typical data science project includes various domains like data extraction, data

manipulation, data analysis, feature extraction, modelling, evaluation, deployment and

updating the solution. As Python is a multi-purpose language, it allows the data scientist

to address all these domains from a common platform.

In this section, let us discuss some core Data Science libraries that

form the components of Python Machine learning ecosystem. These

useful components make Python an important language for Data

Science. Though there are many such components, let us discuss some

of the importance components of Python ecosystem here −

● Jupyter Notebook − Jupyter notebooks basically provides an

interactive computational environment for developing Python

based Data Science applications

EXPERIMENT 2

Aim : Understanding of Machine learning algorithms.

This taxonomy or way of organizing machine learning algorithms is useful because it forces us

to think about the roles of the input data and the model preparation process and select one that is

the most appropriate for the problem in order to get the best result.

1. Supervised Learning

Supervised Learning AlgorithmsInput data is called training data and has a known label or result

such as spam/not-spam or a stock price at a time.

A model is prepared through a training process in which it is required to make predictions and is

corrected when those predictions are wrong. The training process continues until the model

achieves a desired level of accuracy on the training data.

Example algorithms include: Logistic Regression and the Back Propagation Neural Network.

2. Unsupervised Learning

Unsupervised Learning AlgorithmsInput data is not labeled and does not have a known result.

A model is prepared by deducing structures present in the input data. This may be to extract

general rules. It may be through a mathematical process to systematically reduce redundancy, or

it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

3. Semi-Supervised Learning

There is a desired prediction problem but the model must learn the structures to organize the data

as well as make predictions.

Example algorithms are extensions to other flexible methods that make assumptions about how

to model the unlabeled data.

Regression Algorithms

that is iteratively refined using a measure of error in the predictions made by the model.

Regression methods are a workhorse of statistics and have been co-opted into statistical machine

learning. This may be confusing because we can use regression to refer to the class of problem

and the class of algorithms. Really, regression is a process.

● Linear Regression

● Logistic Regression

● Stepwise Regression

● Multivariate Adaptive Regression Splines (MARS)

● Locally Estimated Scatterplot Smoothing (LOESS)

Instance-based Algorithms

Such methods typically build up a database of example data and compare new data to the

database using a similarity measure in order to find the best match and make a prediction. For

this reason, instance-based methods are also called winner-take-all methods and memory-based

learning. Focus is put on the representation of the stored instances and similarity measures used

between instances.

● Learning Vector Quantization (LVQ)

● Self-Organizing Map (SOM)

● Locally Weighted Learning (LWL)

● Support Vector Machines (SVM)

Regularization Algorithms

An extension made to another method (typically regression methods) that penalizes models

based on their complexity, favoring simpler models that are also better at generalizing.

The most popular regularization algorithms are:

● Ridge Regression

● Least Absolute Shrinkage and Selection Operator (LASSO)

● Elastic Net

● Least-Angle Regression (LARS)

Decision tree methods construct a model of decisions made based on actual values of attributes

in the data.

Decisions fork in tree structures until a prediction decision is made for a given record. Decision

trees are trained on data for classification and regression problems. Decision trees are often fast

and accurate and a big favorite in machine learning.

● Iterative Dichotomiser 3 (ID3)

● C4.5 and C5.0 (different versions of a powerful approach)

● Chi-squared Automatic Interaction Detection (CHAID)

● Decision Stump

● M5

● Conditional Decision Trees

Bayesian Algorithms

Bayesian methods are those that explicitly apply Bayes’ Theorem for problems such as

classification and regression.

● Naive Bayes

● Gaussian Naive Bayes

● Multinomial Naive Bayes

● Averaged One-Dependence Estimators (AODE)

● Bayesian Belief Network (BBN)

● Bayesian Network (BN)

Clustering Algorithms

Clustering, like regression, describes the class of problems and the class of methods.

Clustering methods are typically organized by the modeling approaches such as centroid-based

and hierarchical. All methods are concerned with using the inherent structures in the data to best

organize the data into groups of maximum commonality.

● k-Means

● k-Medians

● Expectation Maximisation (EM)

● Hierarchical Clustering

Association rule learning methods extract rules that best explain observed relationships between

variables in data.

These rules can discover important and commercially useful associations in large

multidimensional datasets that can be exploited by an organization.

The most popular association rule learning algorithms are:

● Apriori algorithm

● Eclat algorithm

Artificial Neural Networks are models that are inspired by the structure and/or function of

biological neural networks.

They are a class of pattern matching that are commonly used for regression and classification

problems but are really an enormous subfield composed of hundreds of algorithms and variations

for all manner of problem types.

The most popular artificial neural network algorithms are:

● Perceptron

● Multilayer Perceptrons (MLP)

● Back-Propagation

● Stochastic Gradient Descent

● Hopfield Network

● Radial Basis Function Network (RBFN)

Deep Learning methods are a modern update to Artificial Neural Networks that exploit abundant

cheap computation.

They are concerned with building much larger and more complex neural networks and, as

commented on above, many methods are concerned with very large datasets of labelled analog

data, such as image, text. audio, and video.

● Recurrent Neural Networks (RNNs)

● Long Short-Term Memory Networks (LSTMs)

● Stacked Auto-Encoders

● Deep Boltzmann Machine (DBM)

● Deep Belief Networks (DBN)

Like clustering methods, dimensionality reduction seek and exploit the inherent structure in the

data, but in this case in an unsupervised manner or order to summarize or describe data using less

information.

This can be useful to visualize dimensional data or to simplify data which can then be used in a

supervised learning method. Many of these methods can be adapted for use in classification and

regression.

● Principal Component Regression (PCR)

● Partial Least Squares Regression (PLSR)

● Sammon Mapping

● Multidimensional Scaling (MDS)

● Projection Pursuit

● Linear Discriminant Analysis (LDA)

● Mixture Discriminant Analysis (MDA)

● Quadratic Discriminant Analysis (QDA)

● Flexible Discriminant Analysis (FDA)

Ensemble Algorithms

Ensemble methods are models composed of multiple weaker models that are independently

trained and whose predictions are combined in some way to make the overall prediction.

Much effort is put into what types of weak learners to combine and the ways in which to

combine them. This is a very powerful class of techniques and as such is very popular.

● Boosting

● Bootstrapped Aggregation (Bagging)

● AdaBoost

● Weighted Average (Blending)

● Stacked Generalization (Stacking)

● Gradient Boosting Machines (GBM)

● Gradient Boosted Regression Trees (GBRT)

● Random Forest

EXPERIMENT 3

Sci-Kit Learn

Clustering

It is basically a type of unsupervised learning method . An unsupervised learning method is a

method in which we draw references from datasets consisting of input data without labelled

responses. Generally, it is used as a process to find meaningful structure, explanatory underlying

processes, generative features, and groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of groups such that

data points in the same groups are more similar to other data points in the same group and

dissimilar to the data points in other groups. It is basically a collection of objects on the basis of

similarity and dissimilarity between them.

For ex– The data points in the graph below clustered together can be classified into one single

group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below

picture.

K-Means Algorithm

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance,

minimizing a criterion known as the inertia or within-cluster sum-of-squares (see below). This

algorithm requires the number of clusters to be specified. It scales well to a large number of

samples and has been used across a large range of application areas in many different fields.

The k-means algorithm divides a set of samples into disjoint clusters , each described by the

mean μ of the samples in the cluster. The means are commonly called the cluster “centroids”.

The K-means algorithm aims to choose centroids that minimise the inertia, or within-cluster

sum-of-squares criterion.

K-means is often referred to as Lloyd’s algorithm. In basic terms, the algorithm has three steps.

The first step chooses the initial centroids, with the most basic method being to choose samples

from the dataset . After initialization, K-means consists of looping between the two other steps.

The first step assigns each sample to its nearest centroid. The second step creates new centroids

by taking the mean value of all of the samples assigned to each previous centroid. The difference

between the old and the new centroids are computed and the algorithm repeats these last two

steps until this value is less than a threshold. In other words, it repeats until the centroids do not

move significantly.

K-means is equivalent to the expectation-maximization algorithm with a small, all-equal,

diagonal covariance matrix.

The algorithm can also be understood through the concept of Voronoi diagrams. First the

Voronoi diagram of the points is calculated using the current centroids. Each segment in the

Voronoi diagram becomes a separate cluster. Secondly, the centroids are updated to the mean of

each segment. The algorithm then repeats this until a stopping criterion is fulfilled. Usually, the

algorithm stops when the relative decrease in the objective function between iterations is less

than the given tolerance value. This is not the case in this implementation: iteration stops when

centroids move less than the tolerance.

Given enough time, K-means will always converge, however this may be to a local minimum.

This is highly dependent on the initialization of the centroids. As a result, the computation is

often done several times, with different initializations of the centroids.

import numpy as np

from numpy.linalg import norm

class Kmeans:

'''Implementing Kmeans algorithm.'''

def __init__(self, n_clusters, max_iter=100, random_state=123):

self.n_clusters = n_clusters

self.max_iter = max_iter

self.random_state = random_state

np.random.RandomState(self.random_state)

random_idx = np.random.permutation(X.shape[0])

centroids = X[random_idx[:self.n_clusters]]

return centroids

centroids = np.zeros((self.n_clusters, X.shape[1]))

for k in range(self.n_clusters):

centroids[k, :] = np.mean(X[labels == k, :], axis=0)

return centroids

distance = np.zeros((X.shape[0], self.n_clusters))

for k in range(self.n_clusters):

row_norm = norm(X - centroids[k, :], axis=1)

distance[:, k] = np.square(row_norm)

return distance

return np.argmin(distance, axis=1)

distance = np.zeros(X.shape[0])

for k in range(self.n_clusters):

distance[labels == k] = norm(X[labels == k] -

centroids[k], axis=1)

return np.sum(np.square(distance))

self.centroids = self.initializ_centroids(X)

for i in range(self.max_iter):

old_centroids = self.centroids

distance = self.compute_distance(X, old_centroids)

self.labels = self.find_closest_cluster(distance)

self.centroids = self.compute_centroids(X, self.labels)

if np.all(old_centroids == self.centroids):

break

self.error = self.compute_sse(X, self.labels, self.centroids)

distance = self.compute_distance(X, old_centroids)

return self.find_closest_cluster(distance)

DATA:

The dataset has 272 observations and 2 features. The data covers the waiting time between

eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National

Park, Wyoming, USA. We will try to find K subgroups within the data points and group them

accordingly. Below is the description of the features:

● eruptions (float): Eruption time in minutes.

● waiting (int): Waiting time for the next eruption.

# Modules

import matplotlib.pyplot as plt

import pandas as pd

import seaborn as sns

from sklearn.cluster import KMeans

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import silhouette_samples, silhouette_score

df = pd.read_csv('../data/old_faithful.csv')

# Standardize the data

X_std = StandardScaler().fit_transform(df)

km = Kmeans(n_clusters=2, max_iter=100)

km.fit(X_std)

centroids = km.centroids

fig, ax = plt.subplots(figsize=(6, 6))

plt.scatter(X_std[km.labels == 0, 0], X_std[km.labels == 0, 1],

c='green', label='cluster 1')

plt.scatter(X_std[km.labels == 1, 0], X_std[km.labels == 1, 1],

c='blue', label='cluster 2')

plt.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=300,

c='r', label='centroid')

plt.legend()

plt.xlim([-2, 2])

plt.ylim([-2, 2])

plt.xlabel('Eruption time in mins')

plt.ylabel('Waiting time to next eruption')

plt.title('Visualization of clustered data', fontweight='bold')

ax.set_aspect('equal');

# Run the Kmeans algorithm and get the index of data points clusters

sse = []

list_k = list(range(1, 10))

for k in list_k:

km = KMeans(n_clusters=k)

km.fit(X_std)

sse.append(km.inertia_)

plt.figure(figsize=(6, 6))

plt.plot(list_k, sse, '-o')

plt.xlabel(r'Number of clusters *k*')

plt.ylabel('Sum of squared distance');

EXPERIMENT 4

problem description.

There are lot of data files that store attributes details of problem description and they store data

in either of formats

2. ARFF

3. Excel -xls

4. Sqlite

5. XML

For our use case, we’ll consider CSVs and understand attribute evaluation.

For that we’ll take an example of white variant of Wine Quality data set which is available on

UCI Machine Learning Repository and try to catch hold of as many insights from the data set

using EDA.

To start with,import necessary libraries (for this example pandas, numpy,matplotlib and seaborn)

and load the data set.

● To take a closer look at the data took help of “ .head()”function of pandas library which

returns first five observations of the data set.Similarly “.tail()” returns last five

observations of the data set.

● Dataset comprises of 4898 observations and 12 characteristics.

● Out of which one is dependent variable and rest 11 are independent variables — physico-

chemical characteristics.

● No variable column has null/missing values.

● Here we see mean value is less than the median value of each column which is

represented by 50%(50th percentile) in the index column.

● There is notably a large difference between 75th %tile and max values of predictors

“residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”.

● Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set.

● “quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.

● 1,2 & 10 Quality ratings are not given by any observation. Only scores obtained are

between 3 to 9.

● “quality” has most values concentrated in the categories 5, 6 and 7.

● Only a few observations made for the categories 3 & 9.

EXPERIMENT 5

a) Naïve Bayes

b) Decision Tree

c) Logistics regression

d) Support vector machine

e) KNN.

f) ARIMA

a) Naïve Bayes

Loading Data

Let's first load the required wine dataset from scikit-learn datasets.

from sklearn import datasets

#Load dataset

wine = datasets.load_wine()

Exploring Data

You can print the target and feature names, to make sure you have the right dataset, as such:

print "Features: ", wine.feature_names

print "Labels: ", wine.target_names

Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash',

'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols',

'proanthocyanins', 'color_intensity', 'hue',

'od280/od315_of_diluted_wines', 'proline']

Labels: ['class_0' 'class_1' 'class_2']

# print data(feature)shape

wine.data.shape

(178L, 13L)

# print the wine data features (top 5 records)

print wine.data[0:5]

1.27000000e+02 2.80000000e+00 3.06000000e+00 2.80000000e-01

2.29000000e+00 5.64000000e+00 1.04000000e+00 3.92000000e+00

1.06500000e+03]

[ 1.32000000e+01 1.78000000e+00 2.14000000e+00 1.12000000e+01

1.00000000e+02 2.65000000e+00 2.76000000e+00 2.60000000e-01

1.28000000e+00 4.38000000e+00 1.05000000e+00 3.40000000e+00

1.05000000e+03]

[ 1.31600000e+01 2.36000000e+00 2.67000000e+00 1.86000000e+01

1.01000000e+02 2.80000000e+00 3.24000000e+00 3.00000000e-01

2.81000000e+00 5.68000000e+00 1.03000000e+00 3.17000000e+00

1.18500000e+03]

[ 1.43700000e+01 1.95000000e+00 2.50000000e+00 1.68000000e+01

1.13000000e+02 3.85000000e+00 3.49000000e+00 2.40000000e-01

2.18000000e+00 7.80000000e+00 8.60000000e-01 3.45000000e+00

1.48000000e+03]

[ 1.32400000e+01 2.59000000e+00 2.87000000e+00 2.10000000e+01

1.18000000e+02 2.80000000e+00 2.69000000e+00 3.90000000e-01

1.82000000e+00 4.32000000e+00 1.04000000e+00 2.93000000e+00

7.35000000e+02]]

# print the wine labels (0:Class_0, 1:class_2, 2:class_2)

print wine.target

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

2 2 2

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]

Splitting Data

First, you separate the columns into dependent and independent variables(or features and label).

Then you split those variables into train and test set.

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(wine.data,

wine.target, test_size=0.3,random_state=109) # 70% training and 30%

test

Model Generation

After splitting, we will generate a Naive Bayes model on the training set and perform prediction

on test set features.

from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

Evaluating Model

After model generation, check the accuracy using actual and predicted values.

from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

>>> ('Accuracy:', 0.90740740740740744)

b) Decision Tree

Also on the Wine Dataset used above.

Model Generation

After splitting, we will generate a Decision Tree model on the training set and perform

prediction on test set features.

clf = DecisionTreeClassifier(random_state=0)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

Evaluating Model

After model generation, check the accuracy using actual and predicted values.

from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

c) Logistic Regression

Also on the Wine Dataset used above.

Model Generation

After splitting, we will generate a Logistic Regression model on the training set and perform

prediction on test set features.

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

Evaluating Model

After model generation, check the accuracy using actual and predicted values.

from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Model Generation

After splitting, we will generate a Support Vector Machine model on the training set and

perform prediction on test set features.

from sklearn import svm

#Create a Logistic Regression Classifier

clf =svm.SVC()

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

Evaluating Model

After model generation, check the accuracy using actual and predicted values.

from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

e) k Nearest Neighbour

Model Generation

After splitting, we will generate a k Nearest Neighbour model on the training set and perform

prediction on test set eee

#Import k Nearest Neighbour model

from sklearn.neighbors import NearestNeighbors

clf = NearestNeighbors(n_neighbors=2, algorithm='ball_tree')

clf.fit(X_train, y_train)

#Predict the response for test dataset

y_pred = clf.predict(X_test)

Evaluating Model

After model generation, check the accuracy using actual and predicted values.

from sklearn import metrics

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

f) ARIMA

from pandas import datetime

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

index_col=0, squeeze=True, date_parser=parser)

print(series.head())

series.plot()

pyplot.show()

Month

1901-01-01 266.0

1901-02-01 145.9

1901-03-01 183.1

1901-04-01 119.3

1901-05-01 180.3

Name: Sales, dtype: float64

Model Training

from pandas import datetime

from pandas import DataFrame

from statsmodels.tsa.arima_model import ARIMA

from matplotlib import pyplot

def parser(x):

return datetime.strptime('190'+x, '%Y-%m')

index_col=0, squeeze=True, date_parser=parser)

# fit model

model = ARIMA(series, order=(5,1,0))

model_fit = model.fit(disp=0)

print(model_fit.summary())

# plot residual errors

residuals = DataFrame(model_fit.resid)

residuals.plot()

pyplot.show()

residuals.plot(kind='kde')

pyplot.show()

print(residuals.describe())

OUTPUT

ARIMA Model Results

=====================================================================

=========

Dep. Variable: D.Sales No. Observations:

35

Model: ARIMA(5, 1, 0) Log Likelihood

-196.170

Method: css-mle S.D. of innovations

64.241

Date: Mon, 12 Dec 2016 AIC

406.340

Time: 11:09:13 BIC

417.227

Sample: 02-01-1901 HQIC

410.098

- 12-01-1903

=====================================================================

============

coef std err z P>|z| [ 95.0%

Conf. Int.]

---------------------------------------------------------------------

------------

const 12.0649 3.652 3.304 0.003

4.908 19.222

ar.L1.D.Sales -1.1082 0.183 -6.063 0.000

-1.466 -0.750

ar.L2.D.Sales -0.6203 0.282 -2.203 0.036

-1.172 -0.068

ar.L3.D.Sales -0.3606 0.295 -1.222 0.231

-0.939 0.218

ar.L4.D.Sales -0.1252 0.280 -0.447 0.658

-0.674 0.424

ar.L5.D.Sales 0.1289 0.191 0.673 0.506

-0.246 0.504

Roots

=====================================================================

========

Real Imaginary Modulus

Frequency

---------------------------------------------------------------------

--------

AR.1 -1.0617 -0.5064j 1.1763

-0.4292

AR.2 -1.0617 +0.5064j 1.1763

0.4292

AR.3 0.0816 -1.3804j 1.3828

-0.2406

AR.4 0.0816 +1.3804j 1.3828

0.2406

AR.5 2.9315 -0.0000j 2.9315

-0.0000

---------------------------------------------------------------------

--------

Inference

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0],

index_col=0, squeeze=True, date_parser=parser)

X = series.values

size = int(len(X) * 0.66)

train, test = X[0:size], X[size:len(X)]

history = [x for x in train]

predictions = list()

for t in range(len(test)):

model = ARIMA(history, order=(5,1,0))

model_fit = model.fit(disp=0)

output = model_fit.forecast()

yhat = output[0]

predictions.append(yhat)

obs = test[t]

history.append(obs)

print('predicted=%f, expected=%f' % (yhat, obs))

error = mean_squared_error(test, predictions)

print('Test MSE: %.3f' % error)

# plot

pyplot.plot(test)

pyplot.plot(predictions, color= 'red')

pyplot.show()

OUTPUT

predicted=349.117688, expected=342.300000

predicted=306.512968, expected=339.700000

predicted=387.376422, expected=440.400000

predicted=348.154111, expected=315.900000

predicted=386.308808, expected=439.300000

predicted=356.081996, expected=401.300000

predicted=446.379501, expected=437.400000

predicted=394.737286, expected=575.500000

predicted=434.915566, expected=407.600000

predicted=507.923407, expected=682.000000

predicted=435.483082, expected=475.300000

predicted=652.743772, expected=581.300000

predicted=546.343485, expected=646.900000

Test MSE: 6958.325

EXPERIMENT 6

of using 5-fold cross-validation.

k-Nearest Neighbors

The k-Nearest Neighbors algorithm or KNN for short is a very simple technique.

The entire training dataset is stored. When a prediction is required, the k-most similar records to

a new record from the training dataset are then located. From these neighbors, a summarized

prediction is made.

Similarity between records can be measured many different ways. A problem or data-specific

method can be used. Generally, with tabular data, a good starting point is the Euclidean distance.

Once the neighbors are discovered, the summary prediction can be made by returning the most

common outcome or taking the average. As such, KNN can be used for classification or

regression problems.

from random import seed

from random import randrange

from csv import reader

from math import sqrt

def load_csv(filename):

dataset = list()

with open(filename, 'r') as file:

csv_reader = reader(file)

for row in csv_reader:

if not row:

continue

dataset.append(row)

return dataset

def str_column_to_float(dataset, column):

for row in dataset:

row[column] = float(row[column].strip())

def str_column_to_int(dataset, column):

class_values = [row[column] for row in dataset]

unique = set(class_values)

lookup = dict()

for i, value in enumerate(unique):

lookup[value] = i

for row in dataset:

row[column] = lookup[row[column]]

return lookup

def dataset_minmax(dataset):

minmax = list()

for i in range(len(dataset[0])):

col_values = [row[i] for row in dataset]

value_min = min(col_values)

value_max = max(col_values)

minmax.append([value_min, value_max])

return minmax

def normalize_dataset(dataset, minmax):

for row in dataset:

for i in range(len(row)):

row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] -

minmax[i][0])

def cross_validation_split(dataset, n_folds):

dataset_split = list()

dataset_copy = list(dataset)

fold_size = int(len(dataset) / n_folds)

for _ in range(n_folds):

fold = list()

while len(fold) < fold_size:

index = randrange(len(dataset_copy))

fold.append(dataset_copy.pop(index))

dataset_split.append(fold)

return dataset_split

def accuracy_metric(actual, predicted):

correct = 0

for i in range(len(actual)):

if actual[i] == predicted[i]:

correct += 1

return correct / float(len(actual)) * 100.0

def evaluate_algorithm(dataset, algorithm, n_folds, *args):

folds = cross_validation_split(dataset, n_folds)

scores = list()

for fold in folds:

train_set = list(folds)

train_set.remove(fold)

train_set = sum(train_set, [])

test_set = list()

for row in fold:

row_copy = list(row)

test_set.append(row_copy)

row_copy[-1] = None

predicted = algorithm(train_set, test_set, *args)

actual = [row[-1] for row in fold]

accuracy = accuracy_metric(actual, predicted)

scores.append(accuracy)

return scores

def euclidean_distance(row1, row2):

distance = 0.0

for i in range(len(row1)-1):

distance += (row1[i] - row2[i])**2

return sqrt(distance)

# Locate the most similar neighbors

def get_neighbors(train, test_row, num_neighbors):

distances = list()

for train_row in train:

dist = euclidean_distance(test_row, train_row)

distances.append((train_row, dist))

distances.sort(key=lambda tup: tup[1])

neighbors = list()

for i in range(num_neighbors):

neighbors.append(distances[i][0])

return neighbors

def predict_classification(train, test_row, num_neighbors):

neighbors = get_neighbors(train, test_row, num_neighbors)

output_values = [row[-1] for row in neighbors]

prediction = max(set(output_values), key=output_values.count)

return prediction

# kNN Algorithm

def k_nearest_neighbors(train, test, num_neighbors):

predictions = list()

for row in test:

output = predict_classification(train, row, num_neighbors)

predictions.append(output)

return(predictions)

seed(1)

filename = 'iris.csv'

dataset = load_csv(filename)

for i in range(len(dataset[0])-1):

str_column_to_float(dataset, i)

# convert class column to integers

str_column_to_int(dataset, len(dataset[0])-1)

# evaluate algorithm

n_folds = 5

num_neighbors = 5

scores = evaluate_algorithm(dataset, k_nearest_neighbors, n_folds,

num_neighbors)

print('Scores: %s' % scores)

print('Mean Accuracy: %.3f%%' % (sum(scores)/float(len(scores))))

OUTPUT

Scores: [96.66666666666667, 96.66666666666667, 100.0, 90.0, 100.0]

Mean Accuracy: 96.667%

EXPERIMENT 7

Libraries

# Load multiple libraries

p.names <- c('xgboost', 'caret', 'dplyr', 'e1071')

lapply(p.names, library, character.only = TRUE)

# Getting help and documentation

?functionName

help(functionName)

example(functionName)

Variables

Defining variables is pretty straightforward, we equally use the “=” or “<-” operators. One

unusual thing, if you come from Python, is that variable names may contain points “.” and you

get variables like “my.data.vector”. This is actually very common in code snippets found online.

my_var <- 54

my.data.vector = c(34, 54, 65)

# Clean a variable

my_var <- NULL

Functions

● Use the function keyword with parameters inside parenthesis.

● Use return as exit points

The following small function named prod_age, takes creation_date as argument. With an if

statement, we treat the NULL cases, otherwise we cast the value as date.

if (is.na(creation_date)) {return(as.numeric(-1))}

else {return(as.Date(creation_date))}

}

The read_delim function, from the readr library offers a lot of tools to read most of filetypes.

In the example below, we specify the data type of each column. The file has 56 columns, and we

want all of them to be read as characters, so we use the col_types argument with “c…c”, each

character corresponding to a column.

# Load the library

library(readr)

# Create dataframe from CSV file

my_dataframe <- read_delim("C:/path/to/file.csv",

delim = "|",

escape_double = FALSE,

col_types = paste(rep("c", 56), collapse = ''))

Subsetting a dataframe

Dataframes are not only encountered by importing your dataset. Sometimes functions results are

dataframe. The main tool for subsetting is the brackets operator.

To access a specific column, use the $ operator, very convenient.

To access specific rows, we use the [] operator. You might be

familiar with this syntax : [rows, columns]

# Works with numeric indices

y_train <- my.dataframe[c(0:100), 8]

# Works with negative indices to exclude

y_test <- my.dataframe[-c(0:100), 8]

# Here is another technique still using the bracket syntax. The which

# and names operators are used to subset rows and columns.

filtered.dataframe <- my.dataframe[

which(my.dataframe$col1 == 2), # Filter rows on condition

names(my.dataframe) %in% c("col1","col2","col3")] # Subset cols

The subset function : first argument is the dataframe, then the filter condition on rows, then the

columns to select.

my.dataframe,

col1 == 2,

select = c("col1","col2","col3"))

plyr is a popular library for data manipulation. From this came the dplyr package, which

introduces a grammar for the most common data manipulation challenges.

If you come from Python, you may be familiar with chaining commands with a dot. Here with

dplyr you can do just the same with a special pipe : %>%.

starwars %>%

filter(species == "Droid")

starwars %>%

select(name, ends_with("color"))

starwars %>%

mutate(name, bmi = mass / ((height / 100) ^ 2)) %>%

select(name:mass, bmi)

starwars %>%

arrange(desc(mass))

starwars %>%

group_by(species) %>%

summarise(

n = n(),

mass = mean(mass, na.rm = TRUE)

) %>%

filter(n > 1)

dplyr is actually very convenient for data filtering and exploration, and the grammar is

straightforward.

package, which introduces a grammar for the most common data manipulation challenges

Modify column values.

When a dataframe object is created, we access specific columns with the $ operator.

my_datarame <- subset(my_dataframe, COLNAME != 'str_value')

# Assign 0 where column values match condition

non_conformites$REGUL_DAYS[non_conformites$REGUL_DAYS_NUM < 0] <- 0

# Create new column from existing columns

table$AMOUNT <- table$Q_LITIG * table$PRICE

# Delete a column

my_dataframe$COLNAME <- NULL

Once we have a dataframe and functions ready, we often need to apply functions on columns, to

apply transformations.

Here we use the apply operator. We use it to apply an operation to a blob of structured data, so

it’s not limited to dataframes. Of course, every point must have the same type.

prod_age <- function(creation_date) {

if (xxx) {return(as.numeric(-1))}

else { return(as.Date(creation_date))}

}

# Apply function on column

mytable$PRODUCT_AGE <-

apply(mytable[,c('DATE_CREA'), drop=F], 1, function(x) prod_age(x))

Plotting

R comes with several libraries for plotting data. The plot function is actually similar to plt.plot

with python.

RStudio is very convenient for plotting, it has a dedicated plotting window, with a possibility to

back on previous plots.

Line charts

plot(

ref_sales$Date, ref_sales$Sales,

type = 'l',

xlab = "Date", ylab = "Sales",

main = paste('Sales evolution over time for : ', article_ref)

)

Various charts

R being the language of statisticians, it comes with various charts for plotting data distributions.

values <- c(1, 4, 8, 2, 4)

barplot(values)

hist(values)

pie(values)

boxplot(values)

Machine learning : XGBoost library

The xgboost package is a good starting point, as it is well documented. It enables to gain quick

insights on a dataset, such as feature importance, as we will see below.

For this part, we need those specific libraries :

- xgboost : Let’s work around XGB famous algorithm.

- caret : Classification And REgression Training, includes lots of data processing functions

- dplyr : A fast, consistent tool for working with data frame like objects, both in memory and out

of memory.

Train-Test split

Once the dataframe is prepared, we split it into train and test sets, using an index (inTrain) :

set.seed(1337)

inTrain <- createDataPartition(y = my.dataframe$label, p = 0.85, list

= FALSE)

X_train = xgb.DMatrix(as.matrix(my.dataframe[inTrain, ] %>% select(-

label)))

y_train = my.dataframe[inTrain, ]$label

X_test = xgb.DMatrix(as.matrix(my.dataframe[-inTrain, ] %>% select(-

label)))

y_test = my.dataframe[-inTrain, ]$label

- Take our train/test sets as input.

- Define a trainControl for cross validation .

- Define a grid for parameters.

- Setup a XGB model including the parameter search.

- Evaluate the model’s accuracy

- Return the set of best parameters

# Cross validation init

xgb_trcontrol = trainControl(method = "cv", number = 5,

allowParallel = TRUE,

verboseIter = T, returnData = FALSE)

# Param grid

xgbGrid <- expand.grid(nrounds = 60, #nrounds = c(10,20,30,40),

max_depth = 20, #max_depth = c(3, 5, 10, 15, 20, 30),

colsample_bytree = 0.6,#colsample_bytree = seq(0.5, 0.9, length.out

= 5),

eta = 0.005, #eta = c(0.001, 0.0015, 0.005, 0.1),

gamma=0, min_child_weight = 1, subsample = 1

)

# Model and parameter search

xgb_model = train(xtrain, ytrain, trControl = xgb_trcontrol,

tuneGrid = xgbGrid, method = "xgbTree",

verbose=2,

#objective="multi:softprob",

eval_metric="mlogloss")

#num_class=3)

# Evaluate du model

xgb.pred = predict(xgb_model, xtest, reshape=T)

xgb.pred = as.data.frame(xgb.pred, col.names=c("pred"))

result = sum(xgb.pred$xgb.pred==ytest) / nrow(xgb.pred)

print(paste("Final Accuracy =",sprintf("%1.2f%%", 100*result)))

return(xgb_model)

}

Once the parameter search is done, we can use it directly to define our working model, we access

each element with the $ operator :

data = as.matrix(my.dataframe[inTrain, ] %>% select(-IMPORTANCE)),

label = as.matrix(as.numeric(my.dataframe[inTrain,]$IMPORTANCE)-1),

nrounds = xgb_model$bestTune$nrounds,

max_depth = xgb_model$bestTune$max_depth,

eta = xgb_model$bestTune$eta,

gamma = xgb_model$bestTune$gamma,

colsample_bytree = xgb_model$bestTune$colsample_bytree,

min_child_weight = xgb_model$bestTune$min_child_weight,

subsample = xgb_model$bestTune$subsample,

objective = "multi:softprob", num_class=3)

Compute and plot feature importance

Here again, a lot of functions are available in the xgboost package.

The documentation presents most of them.

xgb_feature_imp <- xgb.importance(

colnames(donnees[inTrain, ] %>% select(-label)),

model = best.model

)

gg <- xgb.ggplot.importance(xgb_feature_imp, 40); gg

Below is an example of a feature importance plot, as displayed in Rstudio. Clusters made with

xgboost simply group features by similar score, there is no other specific meaning for these.

EXPERIMENT 8

Predict stock prices based on past price variation.

Most introductory texts to Neural Networks bring up brain analogies when describing them.

Without delving into brain analogies, I find it easier to simply describe Neural Networks as a

mathematical function that maps a given input to a desired output.

Neural Networks consist of the following components

● An input layer, x

● An arbitrary amount of hidden layers

● An output layer, ŷ

● A set of weights and biases between each layer, W and b

● A choice of activation function for each hidden layer, σ. In this tutorial, we’ll use a

Sigmoid activation function.

class NeuralNetwork:

def __init__(self, x, y):

self.input = x

self.weights1 = np.random.rand(self.input.shape[1],4)

self.weights2 = np.random.rand(4,1)

self.y = y

self.output = np.zeros(self.y.shape)

def feedforward(self):

self.layer1 = sigmoid(np.dot(self.input, self.weights1))

self.output = sigmoid(np.dot(self.layer1, self.weights2))

def backprop(self):

# application of the chain rule to find derivative of the

loss function with respect to weights2 and weights1

d_weights2 = np.dot(self.layer1.T, (2*(self.y - self.output)

* sigmoid_derivative(self.output)))

d_weights1 = np.dot(self.input.T, (np.dot(2*(self.y -

self.output) * sigmoid_derivative(self.output), self.weights2.T) *

sigmoid_derivative(self.layer1)))

# update the weights with the derivative (slope) of the loss

function

self.weights1 += d_weights1

self.weights2 += d_weights2

Using keras to build a RNN based Neural Network to predict Google Stock Prices

# Importing Packages

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

import keras

from keras.models import load_model

from tensorflow.contrib import lite

from sklearn.preprocessing import MinMaxScaler

from keras.models import Sequential

from keras.layers import Dense, Dropout, LSTM

# Importing Dataset

dataset_train = pd.read_csv('google_train.csv')

train_dataset = dataset_train.iloc[:, 1:2].values

# Scaling

sc = MinMaxScaler()

train_dataset_scaled = sc.fit_transform (train_dataset)

# Data Preprocessing

X_train = []

y_train = []

X_train.append(train_dataset_scaled[i-60: i, 0])

y_train.append(train_dataset_scaled[i, 0])

X_train, y_train = np.array(X_train), np.array(y_train)

1))

# Model Generation

regressor = Sequential()

regressor.add(LSTM(units=50, return_sequences=True,

input_shape=(X_train.shape[1],1)))

regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50, return_sequences=True))

regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50, return_sequences=True))

regressor.add(Dropout(0.2))

regressor.add(LSTM(units=50))

regressor.add(Dropout(0.2))

regressor.add(Dense(units=1))

# Model Compilation

regressor.compile('adam', loss='mean_squared_error')

## Model Training

regressor.fit(X_train, y_train, batch_size=32, epochs=50)

Epoch 45/50

3213/3213 [==============================] - 12s 4ms/step - loss:

6.9449e-04

Epoch 46/50

3213/3213 [==============================] - 12s 4ms/step - loss:

8.2133e-04

Epoch 47/50

3213/3213 [==============================] - 12s 4ms/step - loss:

7.4369e-04

Epoch 48/50

3213/3213 [==============================] - 12s 4ms/step - loss:

6.2250e-04

Epoch 49/50

3213/3213 [==============================] - 12s 4ms/step - loss:

9.9489e-04

Epoch 50/50

3213/3213 [==============================] - 12s 4ms/step - loss:

6.9578e-04

Inference

dataset_test = pd.read_csv('google_test.csv')

real_values = dataset_test.iloc[:, 1:2].values

Preprocessing

dataset_total = pd.concat((dataset_train['Open'],

dataset_test['Open']), axis=0)

- 60:].values

dataset_total = dataset_total.reshape(-1, 1)

X_test = []

X_test.append(dataset_total[i-60: i, 0])

X_test = np.array(X_test)

X_test = sc.transform(X_test)

X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))

pred = regressor.predict(X_test)

pred = sc.inverse_transform(pred)

# Visualising Results

plt.plot(real_values, label = 'real values')

plt.plot(pred, label = 'predicted values')

plt.xlabel('Time')

plt.ylabel('Stock Prices')

plt.legend()

plt.show()