Вы находитесь на странице: 1из 210

Machine Learning

with Python

The Basics

By David V.
Copyright2017 by David V.
All Rights Reserved
Copyright 2017 by David V.

All rights reserved. No part of this

publication may be reproduced,

distributed, or transmitted in any form or

by any means, including photocopying,

recording, or other electronic or

mechanical methods, without the prior

written permission of the author, except

in the case of brief quotations embodied
in critical reviews and certain other
noncommercial uses permitted by
copyright law.
Table of Contents
Chapter 1- Getting Started
Chapter 2- Python and
matplotlib for Data
Chapter 3- Logistic

While all attempts have been made to

verify the information provided in this

book, the author does assume any

responsibility for errors, omissions, or

contrary interpretations of the subject

matter contained within. The information

provided in this book is for educational
and entertainment purposes only. The
reader is responsible for his or her own
actions and the author does not accept

any responsibilities for any liabilities or

damages, real or perceived, resulting

from the use of this information.

The trademarks that are used are

without any consent, and the

publication of the trademark is
without permission or backing by the
trademark owner. All trademarks and

brands within this book are for

clarifying purposes only and are the

owned by the owners themselves, not

affiliated with this document. **


Machine learning is a very common field

of study in the world today. With Python,

it is easy for you to implement the

concept of machine learning. This is

because this programming language has

numerous libraries which can help you

create production systems employing the
concept of machine learning. This book
helps you learn this. Enjoy reading!
Chapter 1- Getting


The only way for you to understand

machine learning is by doing and then

completing projects, beginning with the

small ones. Python is an interpreted

programming language, and very
powerful. This language can be used for
the purposes of research, as well as the
creation of production systems as it is a

complete programming language.

Python has numerous libraries and

modules from which you can choose,

meaning that there are different ways that

you can accomplish a particular task.

The following are the best steps to
follow when doing a machine learning
project in Python:

1. Define Problem.

2. Prepare Data.

3. Evaluate Algorithms.

4. Improve Results.

5. Present Results.

The above steps will also help you learn

how to use some new tools or a

First Project

Our first project should be the one for

classifying iris flowers. This project is

good for the following reasons:

1. It has numeric attributes, making it

possible for us to figure out on how
the loading and handling of data can
be done.

2. The problem falls under the category

of classification, meaning that it is

possible for us to implement it by use

of a supervised learning algorithm.

3. It is also a multi-class classification

problem, and it needs a specialized

form of handling.

4. The project is small, and it will fit

into the memory very well.

5. Our numeric units will be in the same

scale and same units. This means that

we will not have to do any special

transformations or scaling so as to
get started.

The following is what we will cover in

this small project:

1. Installing Python and the SciPy


2. Loading the dataset.

3. Summarizing the dataset.

4. Visualizing the dataset.

5. Evaluating some algorithms.

6. Making predictions.
Installing Python and the
SciPy Platform

You should install the Python SciPy

platform in your system if you have not

installed it. The following are the SciPy

libraries which you should install on

your system:


These libraries can be installed in a

number of ways. The best way for you to

do the installation is by choosing a

single way and then following it

throughout the installation process.
Installing via pip
In the case of Mac and Linux users, it is
possible for you to install the SciPy

libraries via pip. The installation of

these libraries will be done in wheel

package format. For you to do the

installation, you must ensure that you

have installed both pip and Python into

your system. However, the installation of

these libraries with pip in Windows

does not work properly.

First, begin by upgrading pip to the latest

version. This can be done by use of the

following command:

python -m pip install --upgrade pip

You can then use pip to install the SciPy

packages. The following command

demonstrates how to do this:

pip install --user numpy scipy

matplotlib ipython jupyter pandas
sympy nose

The packages will then be installed to

your local user, and no permissions will

be needed for you to write to the

In the case of user installs, please ensure

that the user install executable directory

is on the PATH. In Linux, the PATH can

be set as follows:

# This should be added at the end of

~/.bashrc file


In OSX, the PATH can be set as follows:

# This should be added at the end of

~/.bash_profile file

Note that you must use your correct

Installation via Linux
Package Manager

The installation can also be done much

quicker from the repositories of the

various Linux distributions. Note that

such installations will be system-wide,

and they will be somehow outdated

compared to the ones which are installed

via pip.
For users of Ubunu and Debian, the
installation of the libraries can be done

by executing the following command on

the terminal:

sudo apt-get install python-numpy

python-scipy python-matplotlib
ipython ipython-notebook python-
pandas python-sympy python-nose
For users of Fedora 22 and the later

versions, use the following command:

sudo dnf install numpy scipy python-
matplotlib ipython python-pandas
sympy python-nose atlas-devel
Installation via Mac
Package Manager

Unlike Linux, the Mac does not come

with a package manager, but there exists

several package managers which you

can install.

The installation of the SciPy libraries by

use of this package manager can be done

by executing the following command:

sudo port install py35-numpy py35-

scipy py35-matplotlib py35-ipython
+notebook py35-pandas py35-sympy

Now that you have the SciPy libraries

necessary for this project, you can move

to the next step.
Start Python, Check for

You should check so as to be sure that

the Python was installed properly and it

is running as expected.

Just open the terminal and then type the


Below is a script which can be used for

testing the environment. It works by

importing each of the libraries which are

required for our project, and it will then

return the version for each library. This

is the script:

# Check versions of the libraries

# Python version

import sys

print('Python: {}'.format(sys.version))

# scipy

import scipy


# numpy

import numpy


# matplotlib

import matplotlib



# pandas

import pandas

# scikit-learn

import sklearn



You will then get the versions for each

of the libraries if the installation was

done successfully.

Load Data
We will be using the data set for the iris
flower. It is a very famous dataset,

widely used in machine learning by

almost everyone.

It has 150 observations of iris flowers. It

has four columns for flower

measurements in centimeters. The fifth

will have the species to which the

flower belongs. There are three species,
and each flower must belong to one of

Importing the Libraries

There are objects, functions, and

libraries which should be used in this

project. We can import them into the

project as follows:
# Load libraries

import pandas

from pandas.tools.plotting import


import matplotlib.pyplot as plt

from sklearn.metrics import

from sklearn.metrics import


from sklearn import model_selection

from sklearn.tree import


from sklearn.metrics import


from sklearn.neighbors import


from sklearn.linear_model import

from sklearn.discriminant_analysis
import LinearDiscriminantAnalysis

from sklearn.svm import SVC

from sklearn.naive_bayes import


You should have everything load without

getting an error. In case you get an error,

you have to stop and then begin to work
on your environment.

Loading the Dataset

It is possible for us to direct load data

from the UCI Machine Learning

repository. We will use pandas so as to

load the data. The pandas will also be

used for exploring the data with both

data visualization and descriptive

The names for each column will have to

be specified during the process of

loading the data. This will then help us

later when we need to explore the data.

This can bedone as shown below:

# Load dataset
url =
names = ['sepal-length', 'sepal-width',
'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url,


Our expectation is that the data should

be loaded without any incident. For

those with network problem, feel free to

download the iris.data file into the

working directory and then use the same
mechanism so as to load it, but the
URLhas to be changed so as to reflect

the one leading to the local file.

It is possible for us to learn the number

of attributes and the rows (instances)

which we have by use of the shape

property. This can be done as shown

# shape


There should be 150 instances, and the

attributes should be 50 as shown below:

(150, 50)
We can then eyeball the data as follows:

# head


This will give the first 20 rows of data

which are contained in the file. This is

shown below:
sepal-length sepal-width petal-

length petal-width class

0 5.1 3.5 1.4

0.2 Iris-setosa

1 4.9 3.0 1.4

0.2 Iris-setosa

2 4.7 3.2 1.3

0.2 Iris-setosa

3 4.6 3.1 1.5

0.2 Iris-setosa
4 5.0 3.6 1.4

0.2 Iris-setosa

5 5.4 3.9 1.7

0.4 Iris-setosa

6 4.6 3.4 1.4

0.3 Iris-setosa

7 5.0 3.4 1.5

0.2 Iris-setosa

8 4.4 2.9 1.4

0.2 Iris-setosa
9 4.9 3.1 1.5

0.1 Iris-setosa

10 5.4 3.7 1.5

0.2 Iris-setosa

11 4.8 3.4 1.6

0.2 Iris-setosa

12 4.8 3.0 1.4

0.1 Iris-setosa

13 4.3 3.0 1.1

0.1 Iris-setosa
14 5.8 4.0 1.2

0.2 Iris-setosa

15 5.7 4.4 1.5

0.4 Iris-setosa

16 5.4 3.9 1.3

0.4 Iris-setosa

17 5.1 3.5 1.4

0.3 Iris-setosa

18 5.7 3.8 1.7

0.3 Iris-setosa
19 5.1 3.8 1.5

0.3 Iris-setosa

We can then go ahead and look for the

statistics of each attribute. This should

include the mean, the min, max, the

count, and some other percentiles. This

is shown below:

# descriptions

You will observe that there will be a

similar scale for all the numeric values,

and the same ranges.

Class Distribution

We now need to know the number of

instances for the classes. This can be
viewed as an absolute count. This is
demonstrated below:

# class distribution


You will then find that all the classes

have the same number of the instances.

Data Visualizations
Now that we have some basic idea
regarding the data, it is good for us to

extend it through visualizations. Let us

have a look at the plots.

Univariate Plots

These will help us to understand each

attribute. They are the plots for the

individual variables. Suppose we have
numeric input variables, we can go
ahead to create some whisker and box

plots for these. This is shown below:

# whisker and box plots

subplots=True, layout=(2,2),
sharex=False, sharey=False)

This will help us have a good picture
regarding how the input variables are

distributed. We can use each of the input

variables so as to create a histogram,

and this will help learn more about the


# histograms


Two of the input variables will have a

Gaussian distribution. There are

algorithms which can be used for

exploiting this assumption.

Multivariate Slots

We should explore how the variables

interact with each other. Scatterplots for

the pair of the attributes will help us

identify any structured relationships

between the input variables.

# scatter plot matrix



You will realize that there is a diagonal

grouping for some of the attribute pairs.

This is an indication of a high

correlation and some predictable

Evaluation of Algorithms

We should make some data models, and

then estimate their accuracy based on

unseen data.

Creating the Validation

We want to know whether we created a

good model or not. Statistical methods

will then be used for determining the

accuracy of the models which were

created on unseen data. We need some

estimate of the best model on the unseen

data by simply evaluating it on the actual

unseen data. This is shown below:

# Split-out validation dataset

array = dataset.values

X = array[:,0:4]

Y = array[:,4]

validation_size = 0.20

seed = 7
X_train, X_validation, Y_train,
Y_validation =
model_selection.train_test_split(X, Y,

The loaded dataset has been split into

two, whereby 80% of it will be used for
training the models, and 20% of this will
be held as the validation dataset. The

training data is now contained in the

X_train and Y_train for purposes of

training the model while X_validation

and Y_validation are sets which will be

used later.

Test Harness
The 10-fold cross validation will be
used for the purpose of estimating the

accuracy. With this, the dataset will be

split into 10 parts, whereby training will

be done on 9 datasets, while the other

dataset will be used for testing. All the

combinations for the train-test splits will

then be repeated. This is shown below:
# Test options and the evaluation


seed = 7

scoring = 'accuracy'

The models are to be evaluated by use of

the metric of accuracy. This refers to

the ratio of the number of predicted

instances which are correct divided by

the total number of instances in your

dataset, and then multiplied by 100 so as
to get a percentage. We will use a
scoring variable when running build

and we will then evaluate each model.

Build Models

We are not sure of the best algorithms or

the best configurations for use in solving

this kind of problem. There are 6
possible algorithms which we can use.
These include the following:

1. Logistic Regression (LR)

2. Gaussian Naive Bayes (NB).
3. Linear Discriminant Analysis (LDA)
4. K-Nearest Neighbors (KNN).
5. Classification and Regression Trees
6. Support Vector Machines (SVM).
We will be mixing the simple linear and

non-linear algorithms. The random

number seed will be reset before each

run, and this will help us ensure that the

execution of each algorithm is done by

use of similar data splits. This serves to

help us be sure that the results can be

directly compared. Let us begin by

evaluating our five models:

# Spot Check Algorithms

models = []







models.append(('NB', GaussianNB()))

models.append(('SVM', SVC()))

# evaluate each model in turn

results = []

names = []

for name, model in models:

kfold =

cv_results =
X_train, Y_train, cv=kfold,


msg = "%s: %f (%f)" % (name,
cv_results.mean(), cv_results.std())


Selecting the best Model

Currently, we have 6 models as well as

the accuracy estimation for each of these

models. Our aim is to do a comparison

between the models, and then choose the

most accurate one. The program should

give the following result once executed:

LR: 0.966667 (0.040825)

LDA: 0.975000 (0.038188)

KNN: 0.983333 (0.033333)

CART: 0.975000 (0.038188)

NB: 0.975000 (0.053359)

SVM: 0.981667 (0.025000)

As shown in the above output, the KNN

seems to be the one with the highest

estimated accuracy score. We can go

ahead and create a plot of model

evaluation and then compare the mean

accuracy and spread of each model.

Note that the evaluation of each
algorithm was done 10 times.

# Compare Algorithms

fig = plt.figure()

fig.suptitle('Algorithm Comparison')

ax = fig.add_subplot(111)



Making Predictions

Note that we found the KNN model to be

the most accurate one from the tested

ones. It is now time for us to get the

accuracy of this model on the validation

This will provide us with a final

independent check on the accuracy of the

best model. It is also good for us to

maintain a validation set so that in case a

mistake occurs during training, such as

overfitting to a training set or some data

leak. These will lead to an overly

optimistic result.

The KNN model can be executed

directly on the validation set, and then
we summarize the result in the form of
one final score, a classification result,

and a confusion matrix. This is shown


# Make predictions on the validation


knn = KNeighborsClassifier()

knn.fit(X_train, Y_train)
predictions =








You will then get the accuracy. The

confusion matrix will give the three
errors which are made. The
classification report will then give the

breakdown for each class, based on

precision. Here is the result:


[[ 7 0 0]

[ 0 11 1]
[ 0 2 9]]

precision recall f1-score


Iris-setosa 1.00 1.00

1.00 7

Iris-versicolor 0.85 0.92

0.88 12

Iris-virginica 0.90 0.82

0.86 11
avg / total 0.90 0.90 0.90

Chapter 2- Python and
matplotlib for Data
The Python libraries can be used

together with matplotlib for the purpose

of exploring data. Let us discuss how

this can be done.

Load the Data set

This should be the first step in machine
learning. This should be observed data,

and you can choose to collect it, or you

may choose to browse for the data from

various data sources so as to get the data

sets. In this case, you may choose to load

the digits data set which comes with

scikit-learn, which is a Python library.
For the data to be loaded, we have to

import the datasets module from the

sklearn. You can then make use of the

load_digits() method from datasets so

as to load in the data. This is shown


# Import `datasets` from the `sklearn`

from sklearn import ________

# Load in `digits` data

digits = datasets.load_digits()

# Print `digits` data


It is good for you to be aware that the

datasets module also has other

methods for loading and fetching the
popular reference datasets, and one can
count on the module if they need
artificial data generators. If we needed
to pull the data from the latter, then we

could have done it as follows:

# Import `pandas` library as `pd`

import ______ as __

# Load in data with the `read_csv()`

digits =

# Print out the `digits`


It is good for you to be aware that if data

is split in this manner, the data will be

split into a test and training set, and

these are indicated by .tes and .tra
extensions. Both files should be loaded
so that the project can be elaborated.

The above command will only help us to

load the training set.
Exploring the Data

Before you can begin to use a particular

data set, it is good for you to read its

description and try to learn something.

For the case of scikit-learn, this

information is not readily available, but

when you import the data from a

particular data source, it will have a
description and this will be enough
information for you to gain enough
insights into the data. However, it is
good for you to have asufficient

knowledge about how the data net


The process of performing exploratory

data analysis (EDA) for a particular data

set has been found to be very difficult.

Gather Information on the

Suppose you have not checked any

folder with the data description. It is

time for you to begin gathering the


After printing out the digits data once

you loaded it through the scikit-learn
datasets module, you will notice that
there is too much information which is
available. You should now be aware of

some things such as the target as well as

the description of the data. The digits

data can be accessed through the data

attributes. The target attribute can also

be used for accessing the target values

or attributes and the description via the

DESCR attribute.
If you need to know the keys which are
available, you just have to execute

digits.keys(). You can try the one

shown below:

# Get keys of `digits` data


# Print out data


# Print out target values


# Print out description of `digits` data


You can then go ahead and check for the

type of your data. If you want the
read_csv() to do the importation of the
data, you will have a data frame with all

the data. There will be no description

component, but it will be possible for

you to resort to head() or tail() for the

purpose of inspecting the data. It is

always good for you to read the folder

for data description.

The data attribute should be used for

isolating the numpy array from the

digits data and the shape attribute

should be used for finding more. The

same can also be done for target and

DESCR. An attribute known as

images also exists, and this is used for

describing the data in the images.

The shape attribute of an array can be

used as shown below:

# Isolate `digits` data

digits_data = digits.data

# Inspect the shape


# Isolate target values with the


digits_target = digits.______

# Inspect the shape


# Print number of the unique labels

number_digits =
# Isolate the `images`

digits_images = digits.images

# Inspect the shape


Visualizing the Data Images

using matplotlib
It is also possible for you to visualize

the images which you are using. There

are multiple libraries in Python which

can be used for this purpose, but here,

we will be using matplotlib. This can

be used as shown below:

# Import matplotlib library

import matplotlib.pyplot as plt

# Figure out the size (width, height) in

fig = plt.figure(figsize=(6, 6))

# Change the subplots

fig.subplots_adjust(left=0, right=1,
bottom=0, top=1, hspace=0.05,

# For each of 64 images

for i in range(64):
# Initializing the subplots, add
subplot in grid of 8 by 8, at i+1-th

ax = fig.add_subplot(8, 8, i + 1,
xticks=[], yticks=[])
# Display image at i-th position


# label image with a target value

ax.text(0, 7, str(digits.target[i]))

# Show the plot


The above code might seem to be

lengthy and even overwhelming. Note

that we began by importing the library,

which is matplotlib.pyplo. We have

then setup a figure, with dimensions of 6

inches long and 6 inches wide. This will

create a canvas, and all the subplots

having images will be displayed on it.

We have also set the alignment of this on

the left, right bottom, and top. We have

then created a loop which is to help us

fill the figure which we have created.

The subplots have been initialized one

by one, and each has been added into its

own position on a grid which measures

8 by 8. Note that each image has been

displayed on the grid ay a particular
time. We have also used binary colors
which will in turn give us white, black,
and gray values. We have used nearest

as the interpolation method, which

translates to the fact that the data is not


The cherry on pie adds text to the

subplots. The target labels will be

printed at the coordinates (0,7) for each
subplot, meaning that these will be
visible on the bottom-left corner of the
subplot. The line plt.show() has been

used for displaying the plot so that it can

be visible. To make it simple, it is

possible for you to visualize the target

labels as shown below:

# Import matplotlib
import matplotlib.pyplot as plt
# Join images and the target labels
into a list

images_and_labels =
list(zip(digits.images, digits.target))

# for each element contained in the


for index, (image, label) in

# initializing a subplot of the 2X4 at

i+1-th position
plt.subplot(2, 4, index + 1)

# Do not plot any axes


# Display images in the all subplots


# Add some title to every subplot

plt.title('Training: ' + str(label))
# Show the plot

Note that once we had imported the

matplotlib.pyplot, we went ahead to zip

our two numpy arrays together, and then

saved it in a variable named

images_and_labels. You will also

learn that each will have aninstance of

digits.images and a value of


Principal Component
Analysis (PCA)

Since the digits data set will have 64

features, it becomes a challenge. It

becomes hard for us to understand the

structure and maintain an overview of

digits data. You will then be working
on a high-dimensional data set.

This results when one tries to describe

objects via the collection of features.

The problem with high dimensionality of

data is that the algorithms might be

expected to take in too many features.

Having many dimensions may be an

indication that the data points are

located far from each other point, and
the distance between data points can be

The Principle Component Analysis

(PCA) will help us solve this problem.

It works by finding a linear combination

of two variables which contains much of

the information. This principle

component or the new variable can be

used for the replacement of your two
original variables. You can see it as a
linear transformation method for
yielding directions for maximizing

variance of data.

The scikit-learn can help you find it

easy to apply PCA to your data. This is

shown below:

# Creating some Randomized PCA

model which takes in two components
randomized_pca =


# Fit and transform data to model

reduced_data_rpca =

# Creating some regular PCA model

pca = PCA(n_components=2)
# Fit and transform data to model

reduced_data_pca =


# Inspect the shape


# Print out data



Note that in the above example, we have

used the RandomizedPCA() method.

This is because this performs better in

such circumstances than in times when

there are a few number of dimensions.

You may choose to replace the estimator

object or the randomized PCA model

with a regular PCA model and observe
the difference you get.

It is good for you to keep in mind how

the model is told to keep two

components only. This is a way of

ensuring that you will only have two-

dimensional data for plotting. Also, you

should note that the target class is not

passed with labels to PCA
transformation, since one needs to
investigate whether the PCA reveals the
distribution of different labels and

whether it is possible for the instances to

be separated from each other clearly. A

scatterplot can now be built for the

purpose of visualizing the data:

colors = ['black', 'purple', 'blue',

'yellow', 'white', 'lime', 'cyan',
'orange', 'red', 'gray']
for i in range(len(colors)):

x = reduced_data_rpca[:, 0]
[digits.target == i]

y = reduced_data_rpca[:, 1]
[digits.target == i]

plt.scatter(x, y, c=colors[i])
bbox_to_anchor=(1.05, 1), loc=2,

plt.xlabel('First Principal Component')

plt.ylabel('Second Principal

plt.title("PCA Scatter Plot")


The matplotlib should then be used for

visualizing the data.

Preprocessing the Data

Data should be prepared well before it

can be modeled. The preparation step is
commonly known as preprocessing.

Data Normalization

We will begin by preprocessing the data.

The digits data can, for example, be

standardized by use of the scale ()

method. The following example

demonstrates this:
# Import

from sklearn.preprocessing import


# Apply the`scale()` method to `digits`


data = _____(digits.data)

Once the data has been scaled, the

distribution of each attribute will be
shifted so that its mean can be 0 and the
standard deviation can be 1.

Split the Data into Test and

Training Sets

For you to assess the performance of

your model later, the data set should be

divided into two parts: a test set and a

training set. The first one will be used
for evaluating the system which has been
trained, while the second one will be
used for training the system.

The best way for one to approach this is

by taking 2/3 of the data set and uses it

for the training set, and the 1/3 of the

data set as a test set. Consider the

example given below:

# Import the `train_test_split`

from sklearn.cross_validation import


# Split `digits` data into the training

and the test sets

X_train, X_test, y_train, y_test,

images_train, images_test =
train_test_split(data, digits.target,
digits.images, test_size=0.25,
Note that in the above code, the

traditional way of splitting has been

respected. In the arguments for the

train_test_split() method, it is clearly

shown that the test_size has been set to

0.25. You also see the parameter

random_state has been set to a value

of 42. This argument will ensure that the

split is done so as to be the same.
Now that the data set has been split into
train and test sets, the numbers can be

inspected before the data can be

modeled. This is shown below:

# Number of the training features

n_samples, n_features = X_train.shape

# Print out the `n_samples`


# Print out the `n_features`


# Number of the Training labels

n_digits = len(np.unique(y_train))

# Inspect the `y_train`


The training set X_train should now

have 1347 samples, and this represents

only 2/3 of what was contained in the

original data set. This is an indication

that the test used X_train and y_train to

be the size of 450 samples.

Clustering digits Data

At this point, you must be aware that all
the known data has been stored. We have

not performed any actual learning or

model until now.

Thw time has come for us to find the

clusters of the training set. The module

can be setup by use of KMeans() from

the cluster. You will observe that only

three arguments are passed to the
module, and these include init,
n_clusters, and random_state.

You must remember that we had the last

argument given above before we could

split our data into the training and test

sets. The argument was responsible for

ensuring that we are getting some

reproducible results. Consider the code
given below:

# Import `cluster` module

from sklearn import ________

# Create KMeans model

clf = cluster.KMeans(init='k-
means++', n_clusters=10,
# Fit training data `X_train`to model


The init represents the initialization

method and even after defaulting to k-

means++, it will come back to the code.

It is also clear that the argument

n_clusters has been set to 10. This

number is responsible for specifying the

number of groups or clusters which will

be formed by the data, as well as the
number of centroids which will be
generated. Note that a cluster centroid

represents the middle of the cluster.

Note that once we add the n-init

parameter to the KMeans() function,

you will be in a position to determine

the number of different configurations

which the algorithm will try. The images
making up cluster centers can be
visualized as shown below:

# Import matplotlib

import matplotlib.pyplot as plt

# Figure the size in inches

fig = plt.figure(figsize=(8, 3))

# Add the title

fig.suptitle('Cluster Center Images',
fontsize=14, fontweight='bold')

# For all the labels (0-9)

for i in range(10):
# Initialize the subplots in some grid
measuring 2X5, at the i+1th position

ax = fig.add_subplot(2, 5, 1 + i)

# Display the images

8)), cmap=plt.cm.binary)
# Don't show axes


# Show plot

The next step should be prediction of

labels of the test set. This can be done as

shown below:
# Predict labels for the `X_test`


# Print out first 100 instances of the



# Print out first 100 instances of the


# Study shape of cluster centers


In the above example, we are predicting

the values of the test set, and this has

450 samples. The result of this has been

stored in the y_pred. We have then

gone ahead so as to print out the first
100 instances of the y_pred and the
y_test, and some results should be
observed immediately.

We can now visualize the labels which

have been predicted. This can be done

as shown below:

# Import `Isomap()`

from sklearn.manifold import Isomap

# Create an isomap and fit the `digits`
data to it

X_iso =

# Compute the cluster centers and

then predict the cluster index for

# sample

clusters = clf.fit_predict(X_train)
# Create plot with the subplots in grid
measuring 1X2

fig, ax = plt.subplots(1, 2, figsize=(8,


# Adjust the layout

fig.suptitle('Predicted Versus the
Training Labels', fontsize=14,

# Add the scatterplots to subplots
ax[0].scatter(X_iso[:, 0], X_iso[:, 1],

ax[0].set_title('Predicted Training


ax[1].scatter(X_iso[:, 0], X_iso[:, 1],


ax[1].set_title('Actual Training

# Show the plots

The Isomap() should be used as a way

of reducing the high-dimensional data set

digits. The difference with the PCA

method is the ISOMAP, a non-linear

reduction method. You can try to run the

above code using PCA rather than

Isomap and see the effect. The solution

can be found here:

# Import `PCA()`

from sklearn.decomposition import


# Model and then fit `digits` data to

PCA model
X_pca =

# Compute the cluster centers and

then predict the cluster index for

# sample

clusters = clf.fit_predict(X_train)

# Create some plot with the subplots

in some grid of 1X2

fig, ax = plt.subplots(1, 2, figsize=(8,


# Adjust the layout

fig.suptitle('Predicted Versus the
Training Labels', fontsize=14,


# Add the scatterplots to subplots

ax[0].scatter(X_pca[:, 0], X_pca[:, 1],

ax[0].set_title('Predicted the Training

ax[1].scatter(X_pca[:, 0], X_pca[:, 1],

ax[1].set_title('The Actual Training


# Show plots

Evaluating the Clustering

We should now evaluate the

performance of our model. In other

words, we need to know how accurate

our models predictions are.

Let us begin by printing out a confusion


# Import `metrics` from `sklearn`

from sklearn import _______

# Print out confusion matrix with the


You may also need to learn more
regarding the results rather than by use

of the confusion matrix alone. We should

apply some different cluster metrics so

as to know the quality of our clusters.

This way, you will be able to learn the

goodness of fit for the cluster labels to

correct labels. Consider the following

from sklearn.metrics import

print('% 9s' % 'inertia homo compl

v-meas ARI AMI silhouette')

print('%i %.3f %.3f %.3f %.3f

%.3f %.3f'




v_measure_score(y_test, y_pred),




silhouette_score(X_test, y_pred,

The fact is that some metrics exist which

one has to consider. The homogeneity

score is responsible for telling us the

extent to which the clusters have data

points which belong to a single class.

The completeness score is responsible

for measuring the extent to which the
data points are members of a given
class, and are also elements of a similar
cluster. V-measure score refers to the

harmonic between the homogeneity and

the completeness.

The property adjusted Rand score is

used for measuring the similarity

between any two clusterings, and all

samples and counting pairs are
considered which have been assigned in
different or same clusters in the true or
predicted clusterings.

The Adjusted Mutual Info (AMI) score

helps in the comparison of clusters. It is

for measuring the similarity between the

data points which are in clusterings,

providing for chance groupings and if

the clusterings are equivalent, this will
take a maximum value of 1.

The silhouette score is used for

measuring the similarity of an object to

its own clusters when compared to the

other clusters. The value for this ranges

between -1 and 1, and if you get a higher

value, it is an indication that the object

is closely matched to the cluster it

belongs to and worse matched to the
neighboring clusters. If there are many
points with a higher value, the cluster
configuration will be good.

From the above explanation, it is very

clear that our values are not good. In our

case, the value of the silhouette score is

0, and this means that the sample is too

close to the decision boundary between

the two neighboring clusters. This is an
indication that the samples might have
been assigned to the wrong clusters.

The ARI measure shows that not all the

data point clusters are the same, while

the completeness score shows that there

are some data points which were not

assigned to the correct cluster. You

should consider another estimator so that

you can predict the labels for the
digits data.
Support Vector Machines

Consider the code given below:

# Import the `train_test_split`

from sklearn.cross_validation import


# Split data into training and the test

X_train, X_test, y_train, y_test,
images_train, images_test =
digits.target, digits.images,
test_size=0.25, random_state=42)

# Import `svm` model

from sklearn import svm

# Create SVC model

svc_model = svm.SVC(gamma=0.001,

C=100., kernel='linear')

# Fit data to SVC model

svc_model.fit(X_train, y_train)

Once you follow the algorithm map, the

first model which we get is a linear

SVC. This has then been applied to the

digits data. Also, note that we have
used the X_train and y_train so as to fit
the data into our SVC model. This is
very different from clustering. Also, the

value for gamma has been set

manually. You can automatically obtain

good values for the parameters by the

use of tools such as cross validation and

grid search.

You will note that the use of grid search

is the best way for you to adjust the
parameters. This can be done as shown

# Split `digits` data to two equal sets

X_train, X_test, y_train, y_test =
digits.target, test_size=0.5,

# Import GridSearchCV
from sklearn.grid_search import


# Set parameter candidates

parameter_candidates = [

{'C': [1, 10, 100, 1000], 'kernel':

{'C': [1, 10, 100, 1000], 'gamma':
[0.001, 0.0001], 'kernel': ['rbf']},
# Create some classifier with
parameter candidates
clf =

# Train classifier on the training data

clf.fit(X_train, y_train)
# Print out results

print('Best score for training data:',






You should then use a classifier together

with the classifier and the parameter

candidates which have just been created

so that it can be applied to the data sets

second part. Next, a new classifier

should be trained by the use of the best

parameters which are used which are

found via grid search. You will score the

result so as to see whether the best

parameters which are found in grid

search are working.

# Apply classifier to test data, and

then view the accuracy of score

clf.score(X_test, y_test)

# Train and then score some new

classifier with grid search parameters

svm.SVC(C=10, kernel='rbf',
y_train).score(X_test, y_test)
You will see that the parameters are

working well. You must have observed

that in SVM classifier, penalty

parameter, which is c, for error term

is usually specified at 100. Note that the

kernel has been explicitly specified as

linear. The argument kernel is used for

specifying the kernel which is to be used

in the algorithm, and this defaults to

rbf. There exist other types of kernels
which you can specify. Examples
include poly, linear, and others.

The kernel can be seen to be the same as

a function, and it is used for computing

the similarity between training data sets.

Once the kernel has been provided to an

algorithm, as well as the labels and the

training data, one gets the classifier, just

as we have in this case. In this case, we
have trained a model which is able to
categorize unseen objects to their
specific category. When using SVM, one

has to do a linear division on the data


The results obtained from the grid search

are an indication that an rbf kernel

could have worked better. Both the

gamma and the penalty parameter were
well specified.

We can now proceed with our test

model, and then predict values for test

set. This is shown below:

# Predict label of the `X_test`

# Print the `y_test` to check for



The images can also be visualized

together with their predicted labels. This

is shown below:

# Import matplotlib
import matplotlib.pyplot as plt
# Assign predicted values to the

predicted = svc_model.predict(X_test)

# Zip together `images_test` and the

`predicted` values in the

images_and_predictions =
list(zip(images_test, predicted))
# For first 4 elements in the

for index, (image, prediction) in

# Initializing subplots in some grid

measuring 1 by 4 at the position i+1

plt.subplot(1, 4, index + 1)

# Do not show the axes

# Displaying the images in all the

subplots in a grid

# Add some title to plot

plt.title('Predicted: ' +


# Show a plot


Note that we have zipped the images

together and predicted values, and only

the first four elements of the

images_and_predictions have been

taken. We should now determine the

performance of the model. This is shown


# Import `metrics`

from sklearn import metrics

# Print classification report of the

`y_test` and the `predicted`



# Print confusion matrix of the

`y_test` and the `predicted`

You will notice that the performance of

this model is often compared to the

clustering model which we used later. It

is also possible for you to visualize the

predicted and actual labels by use of

Isomap(). This is shown below:

# Import `Isomap()`

from sklearn.manifold import Isomap

# Create isomap and then fit `digits`

data to this
X_iso =


# Calculate the cluster centers and

then predict the cluster index for

# sample
predicted =

# Create some plot with subplots on a
grid measuring 1X2

fig, ax = plt.subplots(1, 2, figsize=(8,


# Adjusting the layout


# Add the scatterplots to subplots

ax[0].scatter(X_iso[:, 0], X_iso[:, 1],

ax[0].set_title('Predicted labels')

ax[1].scatter(X_iso[:, 0], X_iso[:, 1],


ax[1].set_title('Actual Labels')

# Add title
fig.suptitle('Predicted versus actual
labels', fontsize=14,

# Show the plot


The visualization will have confirmed

the classification report.

Chapter 3- Logistic


Logic regression is just a classification

algorithm. Its learning process is closely

related to that of linear regression, with

the difference being that the formulation

of the cost and gradient functions is done

differently. The logistic regression

makes use of a logit or sigmoid
activation function rather than the
continuous output in a linear regression.

To demonstrate how this works, let us

begin by importing the data which is to

be used in the exercise:

import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

import os
path = os.getcwd() + '\data\data1.txt'
data = pd.read_csv(path,
header=None, names=['Exam 1',
'Exam 2', 'Admitted'])


In this data, we have two continuous

independent variables, and these are the

Exam 1 and Exam 2. The

Admitted label represents our

prediction target, and it is binary-valued.

A value of 0 is an indication that the

student has not been admitted, while a

value of 1 is an indication that the

student has been admitted. This can be

visualized by the use of colors on a

graph by use of the following code:

positive =

negative =


fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(positive['Exam 1'],
positive['Exam 2'], s=50, c='b',
marker='o', label='Admitted')

ax.scatter(negative['Exam 1'],
negative['Exam 2'], s=50, c='r',
marker='x', label='Not Admitted')


ax.set_xlabel('Exam 1 Score')

ax.set_ylabel('Exam 2 Score')

You will get a linear decision boundary.

Since it is curving, it is impossible for

us to classify the examples correctly by

use of a straight line, but it is possible

for us to get closer to the result.

It is now time for us to implement the
logistic regression so as to train the

model to be able to find the optimal

decision boundary, and then make some

class predictions. Our first step should

be the implementation of a sigmoid

function. This is shown below:

def sigmoid(z):
return 1 / (1 + np.exp(-z))

Above is the activation function for

output of the logistic regression. It works

by converting in a continuous format to a

value between 0 and 1. The value can

then be seen as the probability of the

class, or likelihood that an input

example will be classified positively.

With the use of this probability value

together with a threshold value, we will

be in a position to get some discrete

label prediction. This is good for

helping us visualize the output of the

function so as to be able to see what is

happening. This is shown below:

nums = np.arange(-10, 10, step=1)

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(nums, sigmoid(nums), 'r')

We should then go ahead and write a

cost function. This is good for evaluating

your models performance on some

training data which is given a set of the

model pictures. The cost function for the

logistic regression should be as follows:

def cost(theta, X, y):

theta = np.matrix(theta)

X = np.matrix(X)
y = np.matrix(y)

first = np.multiply(-y,
np.log(sigmoid(X * theta.T)))

second = np.multiply((1 - y),

np.log(1 - sigmoid(X * theta.T)))

return np.sum(first - second) /

The output has to be reduced

downwards to a scalar value which

should be the sum of the error quantified

as the function of the difference between

the class probability which was

assigned by model and the true label.

Note that this is implemented in a

vectorized manner because the

computation of the predictions of the

model for the whole data set is done into

one statement, which is sigmoid(X *


It is now a good time for us to do a test

on the cost function so as to be sure that

it is working well, but it will be good

for us to first do a setup. This can be

done as shown below:

# add ones column - this will make the

matrix multiplication exercise easier
data.insert(0, 'Ones', 1)

# set X (training data) and y (target


cols = data.shape[1]

X = data.iloc[:,0:cols-1]

y = data.iloc[:,cols-1:cols]

# converting to numpy arrays and then

initializing parameter array theta
X = np.array(X.values)

y = np.array(y.values)
theta = np.zeros(3)

It is good for you to check for the shape

of the data structures which you are

using, and you will be able to learn

whether or not their values are sensible.

This is a very good technique for the

implementation of matrix multiplication.

X.shape, theta.shape, y.shape

If zeros are given for the model

parameters, we can calculate the cost of

the initial solution, and the zeros are

represented as theta:

cost(theta, X, y)
Our cost function is working, and our

next step should be to write a function

for computing the gradient of module

parameters to learn how we can change

the parameters for improving the model

outcome on data training. The function

can be written as shown below:

def gradient(theta, X, y):

theta = np.matrix(theta)
X = np.matrix(X)

y = np.matrix(y)

parameters =

grad = np.zeros(parameters)

error = sigmoid(X * theta.T) - y

for i in range(parameters):
term = np.multiply(error, X[:,i])

grad[i] = np.sum(term) / len(X)

return grad

Note that we have only computed a

single gradient step. Lets use a function

named "fminunc" which will help in the

optimization of a parameter given

functions for the computation of cost and

gradients. This is shown below:

import scipy.optimize as opt

result = opt.fmin_tnc(func=cost,
x0=theta, fprime=gradient, args=(X,

cost(result[0], X, y)

Our next step should be writing a

function for giving us predictions for the

dataset X by use of the learned
parameters theta. The function can then
be used for scoring the training accuracy
of the classifier.

def predict(theta, X):

probability = sigmoid(X * theta.T)

return [1 if x >= 0.5 else 0 for x in


theta_min = np.matrix(result[0])
predictions = predict(theta_min, X)
correct = [1 if ((a == 1 and b == 1) or
(a == 0 and b == 0)) else 0 for (a, b) in
zip(predictions, y)]

accuracy = (sum(map(int, correct)) %


print 'accuracy =


This will then give you an accuracy of

89%, which is a great percentage.


We have come to the end of this book.

That is how machine learning is done in

Python. Always begin by choosing the

training data set, and then continue with

the rest of the steps!