Вы находитесь на странице: 1из 86

OLS ALGORITHM

What Is the Least Squares Method?


The "least squares" method is a form of mathematical regression analysis used
to determine the line of best fit for a set of data, providing a visual demonstration
of the relationship between the data points. Each point of data represents the
relationship between a known independent variable and an unknown dependent
variable.

What Does the Least Squares Method Tell You?


The least squares method provides the overall rationale for the placement of the
line of best fit among the data points being studied. The most common
application of this method, which is sometimes referred to as "linear" or
"ordinary", aims to create a straight line that minimizes the sum of the squares of
the errors that are generated by the results of the associated equations, such as
the squared residuals resulting from differences in the observed value, and the
value anticipated, based on that model.

This method of regression analysis begins with a set of data points to be plotted
on an x- and y-axis graph. An analyst using the least squares method will
generate a line of best fit that explains the potential relationship between
independent and dependent variables.

In regression analysis, dependent variables are illustrated on the vertical y-axis,


while independent variables are illustrated on the horizontal x-axis. These
designations will form the equation for the line of best fit, which is determined
from the least squares method.

In contrast to a linear problem, a non-linear least squares problem has no closed


solution and is generally solved by iteration. The discovery of the least squares
method is attributed to Carl Friedrich Gauss, who discovered the method in 1795.

KEY TAKEAWAYS

 The least squares method is a statistical procedure to find the best fit for a
set of data points by minimizing the sum of the offsets or residuals of
points from the plotted curve.
 Least squares regression is used to predict the behavior of dependent
variables.
Example of the Least Squares Method
An example of the least squares method is an analyst who wishes to test the
relationship between a company’s stock returns, and the returns of the index for
which the stock is a component. In this example, the analyst seeks to test the
dependence of the stock returns on the index returns. To achieve this, all of the
returns are plotted on a chart. The index returns are then designated as the
independent variable, and the stock returns are the dependent variable. The line
of best fit provides the analyst with coefficients explaining the level of
dependence.

The Line of Best Fit Equation


The line of best fit determined from the least squares method has an equation
that tells the story of the relationship between the data points. Line of best fit
equations may be determined by computer software models, which include a
summary of outputs for analysis, where the coefficients and summary outputs
explain the dependence of the variables being tested.

Least Squares Regression Line


If the data shows a leaner relationship between two variables, the line that best
fits this linear relationship is known as a least squares regression line, which
minimizes the vertical distance from the data points to the regression line. The
term “least squares” is used because it is the smallest sum of squares of errors,
which is also called the "variance".

Introduction to Gradient Descent Algorithm


(along with variants) in Machine Learning
Introduction
Optimization is always the ultimate goal whether you are dealing with a real life problem or
building a software product. I, as a computer science student, always fiddled with optimizing
my code to the extent that I could brag about its fast execution.

Optimization basically means getting the optimal output for your problem. If you read the
recent article on optimization, you would be acquainted with how optimization plays an
important role in our real-life.

Optimization in machine learning has a slight difference. Generally, while optimizing, we know
exactly how our data looks like and what areas we want to improve. But in machine learning
we have no clue how our “new data” looks like, let alone try to optimize on it.

So in machine learning, we perform optimization on the training data and check its
performance on a new validation data.
Broad applications of Optimization
There are various kinds of optimization techniques which are applied across various domains
such as

 Mechanics – For eg: In deciding the surface of aerospace design


 Economics – For eg: Cost minimization
 Physics – For eg: Optimization time in quantum computing

Optimization has many more advanced applications like deciding optimal route for
transportation, shelf-space optimization, etc.

Many popular machine algorithms depend upon optimization techniques such as linear
regression, k-nearest neighbors, neural networks, etc. The applications of optimization are
limitless and is a widely researched topic in both academia and industries.

In this article, we will look at a particular optimization technique called Gradient Descent. It
is the most commonly used optimization technique when dealing with machine learning.

Table of Content
1. What is Gradient Descent?
2. Challenges in executing Gradient Descent
3. Variants of Gradient Descent algorithm
4. Implementation of Gradient Descent
5. Practical tips on applying gradient descent
6. Additional Resources

1. What is Gradient Descent?


To explain Gradient Descent I’ll use the classic mountaineering example.

Suppose you are at the top of a mountain, and you have to reach a lake which is at the lowest
point of the mountain (a.k.a valley). A twist is that you are blindfolded and you have zero
visibility to see where you are headed. So, what approach will you take to reach the lake?
The best way is to check the ground near you and observe where the land tends to descend.
This will give an idea in what direction you should take your first step. If you follow the
descending path, it is very likely you would reach the lake.

To represent this graphically, notice the below graph.

Let us now map this scenario in mathematical terms.

Suppose we want to find out the best parameters (θ1) and (θ2) for our learning algorithm.
Similar to the analogy above, we see we find similar mountains and valleys when we plot our
“cost space”. Cost space is nothing but how our algorithm would perform when we choose a
particular value for a parameter.

So on the y-axis, we have the cost J(θ) against our parameters θ1 and θ2 on x-axis and z-
axis respectively. Here, hills are represented by red region, which have high cost, and valleys
are represented by blue region, which have low cost.

Now there are many types of gradient descent algorithms. They can be classified by two
methods mainly:

 On the basis of data ingestion


1. Full Batch Gradient Descent Algorithm
2. Stochastic Gradient Descent Algorithm

In full batch gradient descent algorithms, you use whole data at once to compute the gradient,
whereas in stochastic you take a sample while computing the gradient.

 On the basis of differentiation techniques


1. First order Differentiation
2. Second order Differentiation

Gradient descent requires calculation of gradient by differentiation of cost function. We can


either use first order differentiation or second order differentiation.

2. Challenges in executing Gradient Descent


Gradient Descent is a sound technique which works in most of the cases. But there are many
cases where gradient descent does not work properly or fails to work altogether. There are
three main reasons when this would happen:

1. Data challenges
2. Gradient challenges
3. Implementation challenges

2.1 Data Challenges


 If the data is arranged in a way that it poses a non-convex optimization problem. It
is very difficult to perform optimization using gradient descent. Gradient descent only
works for problems which have a well defined convex optimization problem.
 Even when optimizing a convex optimization problem, there may be numerous
minimal points. The lowest point is called global minimum, whereas rest of the points
are called local minima. Our aim is to go to global minimum while avoiding local
minima.
 There is also a saddle point problem. This is a point in the data where the gradient is
zero but is not an optimal point. We don’t have a specific way to avoid this point and
is still an active area of research.
2.2 Gradient Challenges
 If the execution is not done properly while using gradient descent, it may lead to
problems like vanishing gradient or exploding gradient problems. These problems
occur when the gradient is too small or too large. And because of this problem the
algorithms do not converge.

2.3 Implementation Challenges


 Most of the neural network practitioners don’t generally pay attention to
implementation, but it’s very important to look at the resource utilization by networks.
For eg: When implementing gradient descent, it is very important to note how many
resources you would require. If the memory is too small for your application, then the
network would fail.
 Also, its important to keep track of things like floating point considerations and
hardware / software prerequisites.

3. Variants of Gradient Descent algorithms


Let us look at most commonly used gradient descent algorithms and their implementations.

3.1 Vanilla Gradient Descent


This is the simplest form of gradient descent technique. Here, vanilla means pure / without
any adulteration. Its main feature is that we take small steps in the direction of the minima by
taking gradient of the cost function.

Let’s look at its pseudocode.

update = learning_rate * gradient_of_parameters

parameters = parameters - update

Here, we see that we make an update to the parameters by taking gradient of the parameters.
And multiplying it by a learning rate, which is essentially a constant number suggesting how
fast we want to go the minimum. Learning rate is a hyper-parameter and should be treated
with care when choosing its value.

Source

3.2 Gradient Descent with Momentum


Here, we tweak the above algorithm in such a way that we pay heed to the prior step before
taking the next step.

Here’s a pseudocode.

update = learning_rate * gradient

velocity = previous_update * momentum


parameter = parameter + velocity – update

Here, our update is the same as that of vanilla gradient descent. But we introduce a new
term called velocity, which considers the previous update and a constant which is called
momentum.

Source

3.3 ADAGRAD
ADAGRAD uses adaptive technique for learning rate updation. In this algorithm, on the basis
of how the gradient has been changing for all the previous iterations we try to change the
learning rate.

Here’s a pseudocode

grad_component = previous_grad_component + (gradient * gradient)

rate_change = square_root(grad_component) + epsilon

adapted_learning_rate = learning_rate * rate_change


update = adapted_learning_rate * gradient

parameter = parameter – update

In the above code, epsilon is a constant which is used to keep rate of change of learning rate
in check.

3.4 ADAM
ADAM is one more adaptive technique which builds on adagrad and further reduces it
downside. In other words, you can consider this as momentum + ADAGRAD.

Here’s a pseudocode.

adapted_gradient = previous_gradient + ((gradient – previous_gradient) * (1 – beta1))

gradient_component = (gradient_change – previous_learning_rate)

adapted_learning_rate = previous_learning_rate + (gradient_component * (1 – beta2))

update = adapted_learning_rate * adapted_gradient

parameter = parameter – update

Here beta1 and beta2 are constants to keep changes in gradient and learning rate in check

There are also second order differentiation method like l-BFGS. You can see an
implementation of this algorithm in scipy library.
4. Implementation of Gradient Descent
We will now look at a basic implementation of gradient descent using python.

Here we will use gradient descent optimization to find our best parameters for our deep
learning model on an application of image recognition problem. Our problem is an image
recognition, to identify digits from a given 28 x 28 image. We have a subset of images for
training and the rest for testing our model. In this article we will take a look at how we define
gradient descent and see how our algorithm performs. Refer this article for an end-to-end
implementation using python.

Here is the main code for defining vanilla gradient descent,

params = [weights_hidden, weights_output, bias_hidden, bias_output]

def sgd(cost, params, lr=0.05):

grads = T.grad(cost=cost, wrt=params)

updates = []

for p, g in zip(params, grads):

updates.append([p, p - g * lr])

return updates
updates = sgd(cost, params)

Now we break it down to understand it better.

We defined a function sgd with arguments as cost, params and lr. These represent J(θ) as
seen previously, θ i.e. the parameters of our deep learning algorithm and our learning rate.
We set default learning rate as 0.05, but this can be changed easily as per our preference.

def sgd(cost, params, lr=0.05):

We then defined gradients of our parameters with respect to the cost function. Here we
used theano library to find gradients and we imported theano as T

grads = T.grad(cost=cost, wrt=params)

and finally iterated through all the parameters to find out the updates for all possible
parameters. You can see that we use vanilla gradient descent here.

for p, g in zip(params, grads):

updates.append([p, p - g * lr]

We can use this function to then find the best optimal parameters for our neural network. On
using this function, we find that our neural network does a good enough job in finding the
digits in our image as seen below

Prediction is: 8
In this implementation, we see that on using gradient descent we can get optimal
parameters for our deep learning algorithm.

5. Practical tips on applying gradient descent


Each of the above mentioned gradient descent algorithms have their strengths and
weaknesses. I’ll just mention some quick tips which might help you choose the right algorithm.

 For rapid prototyping, use adaptive techniques like Adam/Adagrad. These help in
getting quicker results with much less efforts. As here, you don’t require much hyper-
parameter tuning.
 To get the best results, you should use vanilla gradient descent or momentum.
gradient descent is slow to get the desired results, but these results are mostly better
than adaptive techniques.
 If your data is small and can be fit in a single iteration, you can use 2nd order
techniques like l-BFGS. This is because 2nd order techniques are extremely fast and
accurate, but are only feasible when data is small enough
 There also an emerging method (which I haven’t tried but looks promising) to use
learned features to predict learning rates of gradient descent. Go through this paper for
more details.

Now there are many reasons why a neural network fails to learn. But it helps immensely if
you can monitor where your algorithm is going wrong.
When applying gradient descent, you can look at these points which might be helpful in
circumventing the problem:

 Error rates – You should check the training and testing error after specific iterations
and make sure both of them decreases. If that is not the case, there might be a
problem!
 Gradient flow in hidden layers – Check if the network doesn’t show a vanishing
gradient problem or exploding gradient problem.
 Learning rate – which you should check when using adaptive techniques.

A comprehensive beginners guide for Linear,


Ridge and Lasso Regression in Python and R
Table of Contents
1. Simple models for Prediction
2. Linear Regression
3. The Line of Best Fit
4. Gradient Descent
5. Using Linear Regression for prediction
6. Evaluate your Model – R square and Adjusted R squared
7. Using all the features for prediction
8. Polynomial Regression
9. Bias and Variance
10. Regularization
11. Ridge Regression
12. Lasso Regression
13. ElasticNet Regression
14. Types of Regularization Techniques [Optional]

1. Simple models for Prediction


Let us start with making predictions using a few simple ways to start with. If I were to ask
you, what could be the simplest way to predict the sales of an item, what would you say?

Model 1 – Mean sales:


Even without any knowledge of machine learning, you can say that if you have to predict
sales for an item – it would be the average over last few days . / months / weeks.

It is a good thought to start, but it also raises a question – how good is that model?

Turns out that there are various ways in which we can evaluate how good is our model. The
most common way is Mean Squared Error. Let us understand how to measure it.
Prediction error

To evaluate how good is a model, let us understand the impact of wrong predictions. If we
predict sales to be higher than what they might be, the store will spend a lot of money
making unnecessary arrangement which would lead to excess inventory. On the other side
if I predict it too low, I will lose out on sales opportunity.

So, the simplest way of calculating error will be, to calculate the difference in the predicted
and actual values. However, if we simply add them, they might cancel out, so we square
these errors before adding. We also divide them by the number of data points to calculate a
mean error since it should not be dependent on number of data points.

[each error squared and divided by number of data points]

This is known as the mean squared error.

Here e1, e2 …. , en are the difference between the actual and the predicted values.

So, in our first model what would be the mean squared error? On predicting the mean for all
the data points, we get a mean squared error = 29,11,799. Looks like huge error. May be its
not so cool to simply predict the average value.

Let’s see if we can think of something to reduce the error. Here is a live coding window to
predict target using mean.

Model 2 – Average Sales by Location:


We know that location plays a vital role in the sales of an item. For example, let us say,
sales of car would be much higher in Delhi than its sales in Varanasi. Therefore let us use
the data of the column ‘Outlet_Location_Type’.

So basically, let us calculate the average sales for each location type and predict
accordingly.

On predicting the same, we get mse = 28,75,386, which is less than our previous case. So
we can notice that by using a characteristic[location], we have reduced the error.

Now, what if there are multiple features on which the sales would depend on. How would
we predict sales using this information? Linear regression comes to our rescue.
2. Linear Regression
Linear regression is the simplest and most widely used statistical technique for predictive
modeling. It basically gives us an equation, where we have our features as independent
variables, on which our target variable [sales in our case] is dependent upon.

So what does the equation look like? Linear regression equation looks like this:

Here, we have Y as our dependent variable (Sales), X’s are the independent variables and
all thetas are the coefficients. Coefficients are basically the weights assigned to the
features, based on their importance. For example, if we believe that sales of an item would
have higher dependency upon the type of location as compared to size of store, it means
that sales in a tier 1 city would be more even if it is a smaller outlet than a tier 3 city in a
bigger outlet. Therefore, coefficient of location type would be more than that of store size.

So, firstly let us try to understand linear regression with only one feature, i.e., only one
independent variable. Therefore our equation becomes,

This equation is called a simple linear regression equation, which represents a straight line,
where ‘Θ0’ is the intercept, ‘Θ1’ is the slope of the line. Take a look at the plot below
between sales and MRP.

Surprisingly, we can see that sales of a product increases with increase in its MRP.
Therefore the dotted red line represents our regression line or the line of best fit. But one
question that arises is how you would find out this line?
3. The Line of Best Fit
As you can see below there can be so many lines which can be used to estimate Sales
according to their MRP. So how would you choose the best fit line or the regression line?

The main purpose of the best fit line is that our predicted values should be closer to our
actual or the observed values, because there is no point in predicting values which are far
away from the real values. In other words, we tend to minimize the difference between the
values predicted by us and the observed values, and which is actually termed as error.
Graphical representation of error is as shown below. These errors are also called as
residuals. The residuals are indicated by the vertical lines showing the difference between
the predicted and actual value.
Okay, now we know that our main objective is to find out the error and minimize it. But
before that, let’s think of how to deal with the first part, that is, to calculate the error. We
already know that error is the difference between the value predicted by us and the
observed value. Let’s just consider three ways through which we can calculate error:

 Sum of residuals (∑(Y – h(X))) – it might result in cancelling out of positive and
negative errors.
 Sum of the absolute value of residuals (∑|Y-h(X)|) – absolute value would prevent
cancellation of errors
 Sum of square of residuals ( ∑ (Y-h(X))2) – it’s the method mostly used in practice
since here we penalize higher error value much more as compared to a smaller one,
so that there is a significant difference between making big errors and small errors,
which makes it easy to differentiate and select the best fit line.

Therefore, sum of squares of these residuals is denoted by:

where, h(x) is the value predicted by us, h(x) =Θ1*x +Θ0 , y is the actual values and m is the
number of rows in the training set.

The cost Function

So let’s say, you increased the size of a particular shop, where you predicted that the sales
would be higher. But despite increasing the size, the sales in that shop did not increase that
much. So the cost applied in increasing the size of the shop, gave you negative results.

So, we need to minimize these costs. Therefore we introduce a cost function, which is
basically used to define and measure the error of the model.

If you look at this equation carefully, it is just similar to sum of squared errors, with just a
factor of 1/2m is multiplied in order to ease mathematics.

So in order to improve our prediction, we need to minimize the cost function. For this
purpose we use the gradient descent algorithm. So let us understand how it works.
4. Gradient Descent
Let us consider an example, we need to find the minimum value of this equation,

Y= 5x + 4x^2. In mathematics, we simple take the derivative of this equation with respect to
x, simply equate it to zero. This gives us the point where this equation is minimum.
Therefore substituting that value can give us the minimum value of that equation.

Gradient descent works in a similar manner. It iteratively updates Θ, to find a point where
the cost function would be minimum. If you wish to study gradient descent in depth, I would
highly recommend going through this article.

5. Using Linear Regression for Prediction


Now let us consider using Linear Regression to predict Sales for our big mart sales
problem.

Model 3 – Enter Linear Regression:


From the previous case, we know that by using the right features would improve our
accuracy. So now let us use two features, MRP and the store establishment year to
estimate sales.

Now, let us built a linear regression model in python considering only these two features.

# importing basic libraries

import numpy as np

import pandas as pd

from pandas import Series, DataFrame

from sklearn.model_selection import train_test_split

import test and train file

train = pd.read_csv('Train.csv')

test = pd.read_csv('test.csv')

# importing linear regressionfrom sklearn


from sklearn.linear_model import LinearRegression

lreg = LinearRegression()

splitting into training and cv for cross validation

X = train.loc[:,['Outlet_Establishment_Year','Item_MRP']]

x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales)

training the model

lreg.fit(x_train,y_train)

predicting on cv

pred = lreg.predict(x_cv)

calculating mse

mse = np.mean((pred - y_cv)**2)

In this case, we got mse = 19,10,586.53, which is much smaller than our model 2.
Therefore predicting with the help of two features is much more accurate.

Let us take a look at the coefficients of this linear regression model.

# calculating coefficients

coeff = DataFrame(x_train.columns)

coeff['Coefficient Estimate'] = Series(lreg.coef_)

coeff

Therefore, we can see that MRP has a high coefficient, meaning items having higher prices
have better sales.
6. Evaluating your Model – R square and adjusted R-
square
How accurate do you think the model is? Do we have any evaluation metric, so that we can
check this? Actually we have a quantity, known as R-Square.

R-Square: It determines how much of the total variation in Y (dependent variable) is


explained by the variation in X (independent variable). Mathematically, it can be written as:

The value of R-square is always between 0 and 1, where 0 means that the model does not
model explain any variability in the target variable (Y) and 1 meaning it explains full
variability in the target variable.

Now let us check the r-square for the above model.

lreg.score(x_cv,y_cv)

0.3287

In this case, R² is 32%, meaning, only 32% of variance in sales is explained by year of
establishment and MRP. In other words, if you know year of establishment and the MRP,
you’ll have 32% information to make an accurate prediction about its sales.

Now what would happen if I introduce one more feature in my model, will my model predict
values more closely to its actual value? Will the value of R-Square increase?

Let us consider another case.

Model 4 – Linear regression with more variables


We learnt, by using two variables rather than one, we improved the ability to make accurate
predictions about the item sales.

So, let us introduce another feature ‘weight’ in case 3. Now let’s build a regression model
with these three features.

X = train.loc[:,['Outlet_Establishment_Year','Item_MRP','Item_Weight']]

splitting into training and cv for cross validation


x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales)

## training the model

lreg.fit(x_train,y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

It produces an error, because item weights column have some missing values. So let us
impute it with the mean of other non-null entries.

train['Item_Weight'].fillna((train['Item_Weight'].mean()), inplace=True)

Let us try to run the model again.

training the model lreg.fit(x_train,y_train)

## splitting into training and cv for cross validation

x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales)

## training the model lreg.fit(x_train,y_train)

predicting on cv pred = lreg.predict(x_cv)

calculating mse

mse = np.mean((pred - y_cv)**2)

mse

1853431.59

## calculating coefficients

coeff = DataFrame(x_train.columns)

coeff['Coefficient Estimate'] = Series(lreg.coef_)


calculating r-square

lreg.score(x_cv,y_cv) 0.32942

Therefore we can see that the mse is further reduced. There is an increase in the value R-
square, does it mean that the addition of item weight is useful for our model?

Adjusted R-square
The only drawback of R2 is that if new predictors (X) are added to our model, R2 only
increases or remains constant but it never decreases. We can not judge that by increasing
complexity of our model, are we making it more accurate?

That is why, we use “Adjusted R-Square”.

The Adjusted R-Square is the modified form of R-Square that has been adjusted for the
number of predictors in the model. It incorporates model’s degree of freedom. The adjusted
R-Square only increases if the new term improves the model accuracy.

where

R2 = Sample R square

p = Number of predictors

N = total sample size

7. Using all the features for prediction


Now let us built a model containing all the features. While building the regression models, I
have only used continuous features. This is because we need to treat categorical variables
differently before they can used in linear regression model. There are different techniques to
treat them, here I have used one hot encoding(convert each class of a categorical variable
as a feature). Other than that I have also imputed the missing values for outlet size.

Data pre-processing steps for regression model


# imputing missing values
train['Item_Visibility'] =
train['Item_Visibility'].replace(0,np.mean(train['Item_Visibility']))

train['Outlet_Establishment_Year'] = 2013 - train['Outlet_Establishment_Year']

train['Outlet_Size'].fillna('Small',inplace=True)

# creating dummy variables to convert categorical into numeric values

mylist = list(train1.select_dtypes(include=['object']).columns)

dummies = pd.get_dummies(train[mylist], prefix= mylist)

train.drop(mylist, axis=1, inplace = True)

X = pd.concat([train,dummies], axis =1 )

Building the model


import numpy as np

import pandas as pd

from pandas import Series, DataFrame

import matplotlib.pyplot as plt

%matplotlib inline

train = pd.read_csv('training.csv')

test = pd.read_csv('testing.csv')

# importing linear regression

from sklearn from sklearn.linear_model import LinearRegression

lreg = LinearRegression()

# for cross validation

from sklearn.model_selection import train_test_split

X = train.drop('Item_Outlet_Sales',1)

x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales, test_size


=0.3)
# training a linear regression model on train

lreg.fit(x_train,y_train)

# predicting on cv

pred_cv = lreg.predict(x_cv)

# calculating mse

mse = np.mean((pred_cv - y_cv)**2)

mse

1348171.96

# evaluation using r-square

lreg.score(x_cv,y_cv)

0.54831541460870059

Clearly, we can see that there is a great improvement in both mse and R-square, which
means that our model now is able to predict much closer values to the actual values.

Selecting the right features for your model


When we have a high dimensional data set, it would be highly inefficient to use all the
variables since some of them might be imparting redundant information. We would need to
select the right set of variables which give us an accurate model as well as are able to
explain the dependent variable well. There are multiple ways to select the right set of
variables for the model. First among them would be the business understanding and
domain knowledge. For instance while predicting sales we know that marketing efforts
should impact positively towards sales and is an important feature in your model. We should
also take care that the variables we’re selecting should not be correlated among
themselves.

Instead of manually selecting the variables, we can automate this process by using forward
or backward selection. Forward selection starts with most significant predictor in the model
and adds variable for each step. Backward elimination starts with all predictors in the model
and removes the least significant variable for each step. Selecting criteria can be set to any
statistical measure like R-square, t-stat etc.

Interpretation of Regression Plots


Take a look at the residual vs fitted values plot.
residual plot

x_plot = plt.scatter(pred_cv, (pred_cv - y_cv), c='b')

plt.hlines(y=0, xmin= -1000, xmax=5000)

plt.title('Residual plot')

We can see a funnel like shape in the plot. This shape indicates Heteroskedasticity. The
presence of non-constant variance in the error terms results in heteroskedasticity. We can
clearly see that the variance of error terms(residuals) is not constant. Generally, non-
constant variance arises in presence of outliers or extreme leverage values. These values
get too much weight, thereby disproportionately influencing the model’s performance. When
this phenomenon occurs, the confidence interval for out of sample prediction tends to be
unrealistically wide or narrow.

We can easily check this by looking at residual vs fitted values plot. If heteroskedasticity
exists, the plot would exhibit a funnel shape pattern as shown above. This indicates signs of
non linearity in the data which has not been captured by the model. I would highly
recommend going through this article for a detailed understanding of assumptions and
interpretation of regression plots.

In order to capture this non-linear effects, we have another type of regression known as
polynomial regression. So let us now understand it.
8. Polynomial Regression
Polynomial regression is another form of regression in which the maximum power of the
independent variable is more than 1. In this regression technique, the best fit line is not a
straight line instead it is in the form of a curve.

Quadratic regression, or regression with second order polynomial, is given by the following
equation:

Y =Θ1 +Θ2*x +Θ3*x2

Now take a look at the plot given below.

Clearly the quadratic equation fits the data better than simple linear equation. In this case,
what do you think will the R-square value of quadratic regression greater than simple linear
regression? Definitely yes, because quadratic regression fits the data better than linear
regression. While quadratic and cubic polynomials are common, but you can also add
higher degree polynomials.

Below figure shows the behavior of a polynomial equation of degree 6.


So do you think it’s always better to use higher order polynomials to fit the data set. Sadly,
no. Basically, we have created a model that fits our training data well but fails to estimate
the real relationship among variables beyond the training set. Therefore our model performs
poorly on the test data. This problem is called as over-fitting. We also say that the model
has high variance and low bias.

Similarly, we have another problem called underfitting, it occurs when our model neither
fits the training data nor generalizes on the new data.
Our model is underfit when we have high bias and low variance.

9. Bias and Variance in regression models


What does that bias and variance actually mean? Let us understand this by an example of
archery targets.

Let’s say we have model which is very accurate, therefore the error of our model will be low,
meaning a low bias and low variance as shown in first figure. All the data points fit within the
bulls-eye. Similarly we can say that if the variance increases, the spread of our data point
increases which results in less accurate prediction. And as the bias increases the error
between our predicted value and the observed values increases.

Now how this bias and variance is balanced to have a perfect model? Take a look at the
image below and try to understand.
As we add more and more parameters to our model, its complexity increases, which results
in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one
optimum point in our model where the decrease in bias is equal to increase in variance. In
practice, there is no analytical way to find this point. So how to deal with high variance or
high bias?

To overcome underfitting or high bias, we can basically add new parameters to our model
so that the model complexity increases, and thus reducing high bias.

Now, how can we overcome Overfitting for a regression model?

Basically there are two methods to overcome overfitting,

 Reduce the model complexity


 Regularization

Here we would be discussing about Regularization in detail and how to use it to make your
model more generalized.

10. Regularization
You have your model ready, you have predicted your output. So why do you need to study
regularization? Is it necessary?

Suppose you have taken part in a competition, and in that problem you need to predict a
continuous variable. So you applied linear regression and predicted your output. Voila! You
are on the leaderboard. But wait what you see is still there are many people above you on
the leaderboard. But you did everything right then how is it possible?

“Everything should be made simple as possible, but not simpler – Albert Einstein”
What we did was simpler, everybody else did that, now let us look at making it simple. That
is why, we will try to optimize our code with the help of regularization.

In regularization, what we do is normally we keep the same number of features, but reduce
the magnitude of the coefficients j. How does reducing the coefficients will help us?

Let us take a look at the coefficients of feature in our above regression model.

checking the magnitude of coefficients

predictors = x_train.columns

coef = Series(lreg.coef_,predictors).sort_values()

coef.plot(kind='bar', title='Modal Coefficients')


We can see that coefficients of Outlet_Identifier_OUT027 and
Outlet_Type_Supermarket_Type3(last 2) is much higher as compared to rest of the
coefficients. Therefore the total sales of an item would be more driven by these two
features.

How can we reduce the magnitude of coefficients in our model? For this purpose, we have
different types of regression techniques which uses regularization to overcome this
problem. So let us discuss them.

11. Ridge Regression


Let us first implement it on our above problem and check our results that whether it
performs better than our linear regression model.

from sklearn.linear_model import Ridge

## training the model

ridgeReg = Ridge(alpha=0.05, normalize=True)

ridgeReg.fit(x_train,y_train)

pred = ridgeReg.predict(x_cv)

calculating mse

mse = np.mean((pred_cv - y_cv)**2)

mse 1348171.96 ## calculating score ridgeReg.score(x_cv,y_cv) 0.5691

So, we can see that there is a slight improvement in our model because the value of the R-
Square has been increased. Note that value of alpha, which is hyperparameter of Ridge,
which means that they are not automatically learned by the model instead they have to be
set manually.

Here we have consider alpha = 0.05. But let us consider different values of alpha and plot
the coefficients for each case.
You can see that, as we increase the value of alpha, the magnitude of the coefficients
decreases, where the values reaches to zero but not absolute zero.

But if you calculate R-square for each alpha, we will see that the value of R-square will be
maximum at alpha=0.05. So we have to choose it wisely by iterating it through a range of
values and using the one which gives us lowest error.

So, now you have an idea how to implement it but let us take a look at the mathematics side
also. Till now our idea was to basically minimize the cost function, such that values
predicted are much closer to the desired result.

Now take a look back again at the cost function for ridge regression.

Here if you notice, we come across an extra term, which is known as the penalty term. λ
given here, is actually denoted by alpha parameter in the ridge function. So by changing the
values of alpha, we are basically controlling the penalty term. Higher the values of alpha,
bigger is the penalty and therefore the magnitude of coefficients are reduced.

Important Points:
 It shrinks the parameters, therefore it is mostly used to prevent multicollinearity.
 It reduces the model complexity by coefficient shrinkage.
 It uses L2 regularization technique. (which I will discussed later in this article)

Now let us consider another type of regression technique which also makes use of
regularization.

12. Lasso regression


LASSO (Least Absolute Shrinkage Selector Operator), is quite similar to ridge, but lets
understand the difference them by implementing it in our big mart problem.

from sklearn.linear_model import Lasso

lassoReg = Lasso(alpha=0.3, normalize=True)

lassoReg.fit(x_train,y_train)

pred = lassoReg.predict(x_cv)
# calculating mse

mse = np.mean((pred_cv - y_cv)**2)

mse

1346205.82

lassoReg.score(x_cv,y_cv)

0.5720

As we can see that, both the mse and the value of R-square for our model has been
increased. Therefore, lasso model is predicting better than both linear and ridge.

Again lets change the value of alpha and see how does it affect the coefficients.
So, we can see that even at small values of alpha, the magnitude of coefficients have
reduced a lot. By looking at the plots, can you figure a difference between ridge and lasso?

We can see that as we increased the value of alpha, coefficients were approaching towards
zero, but if you see in case of lasso, even at smaller alpha’s, our coefficients are reducing to
absolute zeroes. Therefore, lasso selects the only some feature while reduces the
coefficients of others to zero. This property is known as feature selection and which is
absent in case of ridge.

Mathematics behind lasso regression is quiet similar to that of ridge only difference being
instead of adding squares of theta, we will add absolute value of Θ.

Here too, λ is the hypermeter, whose value is equal to the alpha in the Lasso function.
Important Points:
 It uses L1 regularization technique (will be discussed later in this article)
 It is generally used when we have more number of features, because it automatically
does feature selection.

Now that you have a basic understanding of ridge and lasso regression, let’s think of an
example where we have a large dataset, lets say it has 10,000 features. And we know that
some of the independent features are correlated with other independent features. Then
think, which regression would you use, Rigde or Lasso?

Let’s discuss it one by one. If we apply ridge regression to it, it will retain all of the features
but will shrink the coefficients. But the problem is that model will still remain complex as
there are 10,000 features, thus may lead to poor model performance.

Instead of ridge what if we apply lasso regression to this problem. The main problem with
lasso regression is when we have correlated variables, it retains only one variable and sets
other correlated variables to zero. That will possibly lead to some loss of information
resulting in lower accuracy in our model.

Then what is the solution for this problem? Actually we have another type of regression,
known as elastic net regression, which is basically a hybrid of ridge and lasso regression.
So let’s try to understand it.

13. Elastic Net Regression


Before going into the theory part, let us implement this too in big mart sales problem. Will it
perform better than ridge and lasso? Let’s check!

from sklearn.linear_model import ElasticNet

ENreg = ElasticNet(alpha=1, l1_ratio=0.5, normalize=False)

ENreg.fit(x_train,y_train)

pred_cv = ENreg.predict(x_cv)

#calculating mse

mse = np.mean((pred_cv - y_cv)**2)

mse 1773750.73

ENreg.score(x_cv,y_cv)
0.4504

So we get the value of R-Square, which is very less than both ridge and lasso. Can you
think why? The reason behind this downfall is basically we didn’t have a large set of
features. Elastic regression generally works well when we have a big dataset.

Note, here we had two parameters alpha and l1_ratio. First let’s discuss, what happens in
elastic net, and how it is different from ridge and lasso.

Elastic net is basically a combination of both L1 and L2 regularization. So if you know


elastic net, you can implement both Ridge and Lasso by tuning the parameters. So it uses
both L1 and L2 penality term, therefore its equation look like as follows:

So how do we adjust the lambdas in order to control the L1 and L2 penalty term? Let us
understand by an example. You are trying to catch a fish from a pond. And you only have a
net, then what would you do? Will you randomly throw your net? No, you will actually wait
until you see one fish swimming around, then you would throw the net in that direction to
basically collect the entire group of fishes. Therefore even if they are correlated, we still
want to look at their entire group.

Elastic regression works in a similar way. Let’ say, we have a bunch of correlated
independent variables in a dataset, then elastic net will simply form a group consisting of
these correlated variables. Now if any one of the variable of this group is a strong predictor
(meaning having a strong relationship with dependent variable), then we will include the
entire group in the model building, because omitting other variables (like what we did in
lasso) might result in losing some information in terms of interpretation ability, leading to a
poor model performance.

So, if you look at the code above, we need to define alpha and l1_ratio while defining the
model. Alpha and l1_ratio are the parameters which you can set accordingly if you wish to
control the L1 and L2 penalty separately. Actually, we have

Alpha = a + b and l1_ratio = a / (a+b)

where, a and b weights assigned to L1 and L2 term respectively. So when we change the
values of alpha and l1_ratio, a and b are set aaccordingly such that they control trade off
between L1 and L2 as:

a * (L1 term) + b* (L2 term)

Let alpha (or a+b) = 1, and now consider the following cases:
 If l1_ratio =1, therefore if we look at the formula of l1_ratio, we can see that l1_ratio
can only be equal to 1 if a=1, which implies b=0. Therefore, it will be a lasso penalty.
 Similarly if l1_ratio = 0, implies a=0. Then the penalty will be a ridge penalty.
 For l1_ratio between 0 and 1, the penalty is the combination of ridge and lasso.

So let us adjust alpha and l1_ratio, and try to understand from the plots of coefficient given
below.
Now, you have basic understanding about ridge, lasso and elasticnet regression. But during
this, we came across two terms L1 and L2, which are basically two types of regularization.
To sum up basically lasso and ridge are the direct application of L1 and L2 regularization
respectively.

Random forest is like bootstrapping algorithm with Decision tree (CART) model. Say, we have
1000 observation in the complete population with 10 variables. Random forest tries to build
multiple CART models with different samples and different initial variables.

What is Logistic Regression?


Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0,
Yes / No, True / False) given a set of independent variables. To represent binary/categorical
outcome, we use dummy variables. You can also think of logistic regression as a special case
of linear regression when the outcome variable is categorical, where we are using log of odds
as dependent variable. In simple words, it predicts the probability of occurrence of an event
by fitting data to a logit function.
Derivation of Logistic Regression Equation
Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model
(glm). In 1972, Nelder and Wedderburn proposed this model with an effort to provide a means
of using linear regression to the problems which were not directly suited for application of
linear regression. Infact, they proposed a class of different models (linear regression, ANOVA,
Poisson Regression etc) which included logistic regression as a special case.

The fundamental equation of generalized linear model is:

g(E(y)) = α + βx1 + γx2

Here, g() is the link function, E(y) is the expectation of target variable and α + βx1 + γx2 is
the linear predictor ( α,β,γ to be predicted). The role of link function is to ‘link’ the expectation
of y to linear predictor.

Important Points

1. GLM does not assume a linear relationship between dependent and independent
variables. However, it assumes a linear relationship between link function and
independent variables in logit model.
2. The dependent variable need not to be normally distributed.
3. It does not uses OLS (Ordinary Least Square) for parameter estimation. Instead, it
uses maximum likelihood estimation (MLE).
4. Errors need to be independent but not normally distributed.

Let’s understand it further using an example:

We are provided a sample of 1000 customers. We need to predict the probability whether a
customer will buy (y) a particular magazine or not. As you can see, we’ve a categorical
outcome variable, we’ll use logistic regression.

To start with logistic regression, I’ll first write the simple linear regression equation with
dependent variable enclosed in a link function:

g(y) = βo + β(Age) ---- (a)

Note: For ease of understanding, I’ve considered ‘Age’ as independent variable.


In logistic regression, we are only concerned about the probability of outcome dependent
variable ( success or failure). As described above, g() is the link function. This function is
established using two things: Probability of Success(p) and Probability of Failure(1-p). p
should meet following criteria:

1. It must always be positive (since p >= 0)


2. It must always be less than equals to 1 (since p <= 1)

Now, we’ll simply satisfy these 2 conditions and get to the core of logistic regression. To
establish link function, we’ll denote g() with ‘p’ initially and eventually end up deriving this
function.

Since probability must always be positive, we’ll put the linear equation in exponential form.
For any value of slope and dependent variable, exponent of this equation will never be
negative.

p = exp(βo + β(Age)) = e^(βo + β(Age)) ------- (b)

To make the probability less than 1, we must divide p by a number greater than p. This can
simply be done by:

p = exp(βo + β(Age)) / exp(βo + β(Age)) + 1 = e^(βo + β(Age)) / e^(βo + β(Age))

+ 1 ----- (c)

Using (a), (b) and (c), we can redefine the probability as:

p = e^y/ 1 + e^y --- (d)

where p is the probability of success. This (d) is the Logit Function

If p is the probability of success, 1-p will be the probability of failure which can be written as:

q = 1 - p = 1 - (e^y/ 1 + e^y) --- (e)

where q is the probability of failure


On dividing, (d) / (e), we get,

After taking log on both side, we get,

log(p/1-p) is the link function. Logarithmic transformation on the outcome variable allows us
to model a non-linear association in a linear way.

After substituting value of y, we’ll get:

This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the
log of odd ratio is found to be positive, the probability of success is always more than 50%. A
typical logistic model plot is shown below. You can see probability never goes below 0 and
above 1.
Performance of Logistic Regression Model
To evaluate the performance of a logistic regression model, we must consider few metrics.
Irrespective of tool (SAS, R, Python) you would work on, always look for:

1. AIC (Akaike Information Criteria) – The analogous metric of adjusted R² in logistic


regression is AIC. AIC is the measure of fit which penalizes model for the number of model
coefficients. Therefore, we always prefer model with minimum AIC value.

2. Null Deviance and Residual Deviance – Null Deviance indicates the response predicted
by a model with nothing but an intercept. Lower the value, better the model. Residual
deviance indicates the response predicted by a model on adding independent variables.
Lower the value, better the model.

3. Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted


values. This helps us to find the accuracy of the model and avoid overfitting. This is how it
looks like:

You can calculate the accuracy of your model with:

From confusion matrix, Specificity and Sensitivity can be derived as illustrated below:
Specificity and Sensitivity plays a crucial role in deriving ROC curve.

4. ROC Curve: Receiver Operating Characteristic(ROC) summarizes the model’s


performance by evaluating the trade offs between true positive rate (sensitivity) and false
positive rate(1- specificity). For plotting ROC, it is advisable to assume p > 0.5 since we are
more concerned about success rate. ROC summarizes the predictive power for all possible
values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or
concordance index, is a perfect performance metric for ROC curve. Higher the area under
curve, better the prediction power of the model. Below is a sample ROC curve. The ROC of
a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top
left corner of the graph.

Note: For model performance, you can also consider likelihood function. It is called so,
because it selects the coefficient values which maximizes the likelihood of explaining the
observed data. It indicates goodness of fit as its value approaches one, and a poor fit of the
data as its value approaches zero.
Decision Tree – Simplified!
I started working as a business analyst in my previous organisation. I transitioned from a
Business Intelligence (BI) Analyst to become a Business Analyst. During the initial days of
tenure as a business analyst, I had a bias towards using a classification technique
– DECISION TREE. This was because of its inherent simplicity and many advantages. We
will discuss these in more details later in this article.

Later on, I figured out that decision trees are one of the most commonly used technique
among all business analysts. It can not only help us with prediction and classification, but also
is a very effective tool to understand the behavior of various variables. In this article, we will
discuss this algorithm in detail.

What is a Decision Tree?


Decision tree is a type of supervised learning algorithm (having a pre-defined target variable)
that is mostly used in classification problems. It works for both categorical and continuous
input and output variables. In this technique, we split the population or sample into two or
more homogeneous sets (or sub-populations) based on most significant splitter / differentiator
in input variables.

Example:-

Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/
X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create
a model to predict who will play cricket during leisure period? In this problem, we need to
segregate students who play cricket in their leisure time based on highly significant input
variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three
variable and identify the variable, which creates the best homogeneous sets of students
(which are heterogeneous to each other). In the snapshot below, you can see that variable
Gender is able to identify best homogeneous sets compared to the other two variables.

As mentioned above, decision tree identifies the most significant variable and it’s value that
gives best homogeneous sets of population. Now the question which arises is, how does it
identify the variable and the split? To do this, decision tree uses various algorithms, which we
will discuss in next article.

Types of Decision Tree


Types of decision tree is based on the type of target variable we have. It can be of two
types:

1. Binary Variable Decision Tree: Decision Tree which has binary target variable then
it called as Binary Variable Decision Tree. Example:- In above scenario of student
problem, where the target variable was “Student will play cricket or not” i.e. YES or
NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target variable
then it is called as Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no). Here we know that income of customer is
a significant variable but insurance company does not have income details for all customers.
Now, as we know this is an important variable, then we can build a decision tree to predict
customer income based on occupation, product and various other variables. In this case, we
are predicting values for continuous variable.

Terminology related to Decision Trees:


Let’s look at the basic terminology used with Decision trees:

ROOT Node: It represents entire population or sample and this further gets divided into two
or more homogeneous sets.

SPLITTING: It is a process of dividing a node into two or more sub-nodes.


Decision Node: When a sub-node splits into further sub-nodes, then it is called decision
node.

Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You
can say opposite process of splitting.

Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree

Parent and Child Node: A node, which is divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of parent node.

These are the terms commonly used for decision trees. As we know that every algorithm has
advantages and disadvantages, below I am discussing some of these for decision trees.

Advantages:
1. Easy to Understand: Decision tree output is very easy to understand even for people
from non-analytical background. It does not require any statistical knowledge to read
and interpret them. Its graphical representation is very intuitive and users can easily
relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of
decision trees, we can create new variables / features that has better power to predict
target variable. You can refer article (Trick to enhance power of regression model) for
one such trick. It can also be used in data exploration stage. For example, we are
working on a problem where we have information available in hundreds of variables,
there decision tree will help to identify most significant variable.
3. Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair
degree.
4. Data type is not a constraint: It can handle both numerical and categorical variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric method.
This means that decision trees have no assumptions about the space distribution
and the classifier structure.

Disadvantages:
1. Overfit: Over fitting is one of the most practical difficulty for decision tree models. This
problem gets solved by use of random forests, which we will discuss some other day.
2. Not fit for continuous variables: While working with continuous numerical variables,
decision tree looses information when it categorizes variables in different categories.

Table of Contents
1.
1. What is a Decision Tree? How does it work?
2. Regression Trees vs Classification Trees
3. How does a tree decide where to split?
4. What are the key parameters of model building and how can we avoid over-
fitting in decision trees?
5. Are tree based models better than linear models?
6. Working with Decision Trees in R and Python
7. What are the ensemble methods of trees based model?
8. What is Bagging? How does it work?
9. What is Random Forest ? How does it work?
10. What is Boosting ? How does it work?
11. Which is more powerful: GBM or Xgboost?
12. Working with GBM in R and Python
13. Working with Xgboost in R and Python
14. Where to Practice ?

1. What is a Decision Tree ? How does it work ?


Decision tree is a type of supervised learning algorithm (having a pre-defined target variable)
that is mostly used in classification problems. It works for both categorical and continuous
input and output variables. In this technique, we split the population or sample into two or
more homogeneous sets (or sub-populations) based on most significant splitter / differentiator
in input variables.
Tree based modeling in R and Python

Example:-

Let’s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/
X) and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create
a model to predict who will play cricket during leisure period? In this problem, we need to
segregate students who play cricket in their leisure time based on highly significant input
variable among all three.

This is where decision tree helps, it will segregate the students based on all values of three
variable and identify the variable, which creates the best homogeneous sets of students
(which are heterogeneous to each other). In the snapshot below, you can see that variable
Gender is able to identify best homogeneous sets compared to the other two variables.

As mentioned above, decision tree identifies the most significant variable and it’s value that
gives best homogeneous sets of population. Now the question which arises is, how does it
identify the variable and the split? To do this, decision tree uses various algorithms, which we
will discuss in the following section.
Types of Decision Trees
Types of decision tree is based on the type of target variable we have. It can be of two
types:

1. Categorical Variable Decision Tree: Decision Tree which has categorical target
variable then it called as categorical variable decision tree. Example:- In above
scenario of student problem, where the target variable was “Student will play cricket
or not” i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target variable
then it is called as Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no). Here we know that income of customer is
a significant variable but insurance company does not have income details for all customers.
Now, as we know this is an important variable, then we can build a decision tree to predict
customer income based on occupation, product and various other variables. In this case, we
are predicting values for continuous variable.

Important Terminology related to Decision Trees


Let’s look at the basic terminology used with Decision trees:

1. Root Node: It represents entire population or sample and this further gets divided into
two or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called
decision node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.

5. Pruning: When we remove


sub-nodes of a decision node, this process is called pruning. You can say opposite
process of splitting.
6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node
of sub-nodes where as sub-nodes are the child of parent node.

These are the terms commonly used for decision trees. As we know that every algorithm has
advantages and disadvantages, below are the important factors which one should know.

Advantages
1. Easy to Understand: Decision tree output is very easy to understand even for people
from non-analytical background. It does not require any statistical knowledge to read
and interpret them. Its graphical representation is very intuitive and users can easily
relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most
significant variables and relation between two or more variables. With the help of
decision trees, we can create new variables / features that has better power to predict
target variable. You can refer article (Trick to enhance power of regression model) for
one such trick. It can also be used in data exploration stage. For example, we are
working on a problem where we have information available in hundreds of variables,
there decision tree will help to identify most significant variable.
3. Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair
degree.
4. Data type is not a constraint: It can handle both numerical and categorical variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric method.
This means that decision trees have no assumptions about the space distribution
and the classifier structure.

Disadvantages
1. Over fitting: Over fitting is one of the most practical difficulty for decision tree
models. This problem gets solved by setting constraints on model parameters and
pruning (discussed in detailed below).
2. Not fit for continuous variables: While working with continuous numerical
variables, decision tree looses information when it categorizes variables in different
categories.

2. Regression Trees vs Classification Trees


We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree. This
means that decision trees are typically drawn upside down such that leaves are the the
bottom & roots are the tops (shown below).

Both the trees work almost similar to each other, let’s look at the primary differences &
similarity between classification and regression trees:

1. Regression trees are used when dependent variable is continuous. Classification trees
are used when dependent variable is categorical.
2. In case of regression tree, the value obtained by terminal nodes in the training data is
the mean response of observation falling in that region. Thus, if an unseen data
observation falls in that region, we’ll make its prediction with mean value.
3. In case of classification tree, the value (class) obtained by terminal node in the training
data is the mode of observations falling in that region. Thus, if an unseen data
observation falls in that region, we’ll make its prediction with mode value.
4. Both the trees divide the predictor space (independent variables) into distinct and non-
overlapping regions. For the sake of simplicity, you can think of these regions as high
dimensional boxes or boxes.
5. Both the trees follow a top-down greedy approach known as recursive binary splitting.
We call it as ‘top-down’ because it begins from the top of tree when all the observations
are available in a single region and successively splits the predictor space into two
new branches down the tree. It is known as ‘greedy’ because, the algorithm cares
(looks for best variable available) about only the current split, and not about future
splits which will lead to a better tree.
6. This splitting process is continued until a user defined stopping criteria is reached. For
example: we can tell the the algorithm to stop once the number of observations per
node becomes less than 50.
7. In both the cases, the splitting process results in fully grown trees until the stopping
criteria is reached. But, the fully grown tree is likely to overfit data, leading to poor
accuracy on unseen data. This bring ‘pruning’. Pruning is one of the technique used
tackle overfitting. We’ll learn more about it in following section.

3. How does a tree decide where to split?


The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria
is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes.
The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words,
we can say that purity of the node increases with respect to the target variable. Decision tree
splits the nodes on all available variables and then selects the split which results in most
homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. Let’s look at the four most
commonly used algorithms in decision tree:

Gini

Gini says, if we select two items from a population at random then they must be of same
class and probability for this is 1 if population is pure.

1. It works with categorical target variable “Success” or “Failure”.


2. It performs only Binary splits
3. Higher the value of Gini higher the homogeneity.
4. CART (Classification and Regression Tree) uses Gini method to create binary splits.

Steps to Calculate Gini for a split

1. Calculate Gini for sub-nodes, using formula sum of square of probability for success
and failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Example: – Referring to example used above, where we want to segregate the students
based on target variable ( playing cricket or not ). In the snapshot below, we split the
population using two input variables Gender and Class. Now, I want to identify which split is
producing more homogeneous sub-nodes using Gini .

Split
on Gender:

1. Calculate, Gini for sub-node Female = (0.2)*(0.2)+(0.8)*(0.8)=0.68


2. Gini for sub-node Male = (0.65)*(0.65)+(0.35)*(0.35)=0.55
3. Calculate weighted Gini for Split Gender = (10/30)*0.68+(20/30)*0.55 = 0.59

Similar for Split on Class:

1. Gini for sub-node Class IX = (0.43)*(0.43)+(0.57)*(0.57)=0.51


2. Gini for sub-node Class X = (0.56)*(0.56)+(0.44)*(0.44)=0.51
3. Calculate weighted Gini for Split Class = (14/30)*0.51+(16/30)*0.51 = 0.51

Above, you can see that Gini score for Split on Gender is higher than Split on Class, hence,
the node split will take place on Gender.

You might often come across the term ‘Gini Impurity’ which is determined by subtracting the
gini value from 1. So mathematically we can say,

Gini Impurity = 1-Gini

Chi-Square
It is an algorithm to find out the statistical significance between the differences between sub-
nodes and parent node. We measure it by sum of squares of standardized differences
between observed and expected frequencies of target variable.

1. It works with categorical target variable “Success” or “Failure”.


2. It can perform two or more splits.
3. Higher the value of Chi-Square higher the statistical significance of differences
between sub-node and Parent node.
4. Chi-Square of each node is calculated using formula,
5. Chi-square = ((Actual – Expected)^2 / Expected)^1/2
6. It generates tree called CHAID (Chi-square Automatic Interaction Detector)

Steps to Calculate Chi-square for a split:

1. Calculate Chi-square for individual node by calculating the deviation for Success and
Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of
each node of the split

Example: Let’s work with above example that we have used to calculate Gini.

Split on Gender:

1. First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
2. Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be
5 for both because parent node has probability of 50% and we have applied same
probability on Female count(10).
3. Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 –
5 = -3) and for “Not play cricket” ( 8 – 5 = 3).
4. Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula
with formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer below table
for calculation.
5. Follow similar steps for calculating Chi-square value for Male node.
6. Now add all Chi-square values to calculate Chi-square for split Gender.

Split on Class:

Perform similar steps of calculation for split on Class and you will come up with below table.
Above, you can see that Chi-square also identify the Gender split is more significant
compare to Class.

Information Gain:
Look at the image below and think which node can be described easily. I am sure, your
answer is C because it requires less information as all values are similar. On the other hand,
B requires more information to describe it and A requires the maximum information. In other
words, we can say that C is a Pure node, B is less Impure and A is more impure.

Now, we can build a conclusion that less impure node requires less information to describe
it. And, more impure node requires more information. Information theory is a measure to
define this degree of disorganization in a system known as Entropy. If the sample is
completely homogeneous, then the entropy is zero and if the sample is an equally divided
(50% – 50%), it has entropy of one.

Entropy can be calculated using formula:-

Here p and q is probability of success and failure respectively in that node. Entropy is also
used with categorical target variable. It chooses the split which has lowest entropy compared
to parent node and other splits. The lesser the entropy, the better it is.

Steps to calculate entropy for a split:

1. Calculate entropy of parent node


2. Calculate entropy of each individual node of split and calculate weighted average of
all sub-nodes available in split.

Example: Let’s use this method to identify best split for student example.

1. Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1
shows that it is a impure node.
2. Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for
male node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
3. Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 +
(20/30)*0.93 = 0.86
4. Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for
Class X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.
5. Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99

Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will
split on Gender. We can derive information gain from entropy as 1- Entropy.

Reduction in Variance
Till now, we have discussed the algorithms for categorical target variable. Reduction in
variance is an algorithm used for continuous target variables (regression problems). This
algorithm uses the standard formula of variance to choose the best split. The split with lower
variance is selected as the criteria to split the population:

Above X-bar is mean of the values, X is actual and n is number of values.

Steps to calculate Variance:

1. Calculate variance for each node.


2. Calculate variance for each split as weighted average of each node variance.

Example:- Let’s assign numerical value 1 for play cricket and 0 for not playing cricket. Now
follow the steps to identify the right split:

1. Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have 15
one and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-
0.5)^2+(0-0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-0.5)^2)
/ 30 = 0.25
2. Mean of Female node = (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-0.2)^2) /
10 = 0.16
3. Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-
0.65)^2) / 20 = 0.23
4. Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 + (20/30)
*0.23 = 0.21
5. Mean of Class IX node = (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-
0.43)^2) / 14= 0.24
6. Mean of Class X node = (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-
0.56)^2) / 16 = 0.25
7. Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25

Above, you can see that Gender split has lower variance compare to parent node, so the split
would take place on Gender variable.

Until here, we learnt about the basics of decision trees and the decision making process
involved to choose the best splits in building a tree model. As I said, decision tree can be
applied both on regression and classification problems. Let’s understand these aspects in
detail.

4. What are the key parameters of tree modeling and how


can we avoid over-fitting in decision trees?
Overfitting is one of the key challenges faced while modeling decision trees. If there is no limit
set of a decision tree, it will give you 100% accuracy on training set because in the worse
case it will end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal
while modeling a decision tree and it can be done in 2 ways:

1. Setting constraints on tree size


2. Tree pruning

Lets discuss both of these briefly.

Setting Constraints on Tree Size


This can be done by using various parameters which are used to define a tree. First, lets
look at the general structure of a decision tree:
The parameters used for defining a tree are further explained below. The parameters
described below are irrespective of tool. It is important to understand the role of parameters
used in tree modeling. These parameters are available in R & Python.

1. Minimum samples for a node split


o Defines the minimum number of samples (or observations) which are
required in a node to be considered for splitting.
o Used to control over-fitting. Higher values prevent a model from learning
relations which might be highly specific to the particular sample selected for a
tree.
o Too high values can lead to under-fitting hence, it should be tuned using CV.
2. Minimum samples for a terminal node (leaf)
o Defines the minimum samples (or observations) required in a terminal node
or leaf.
o Used to control over-fitting similar to min_samples_split.
o Generally lower values should be chosen for imbalanced class problems
because the regions in which the minority class will be in majority will be very
small.
3. Maximum depth of tree (vertical depth)
o The maximum depth of a tree.
o Used to control over-fitting as higher depth will allow model to learn relations
very specific to a particular sample.
o Should be tuned using CV.
4. Maximum number of terminal nodes
o The maximum number of terminal nodes or leaves in a tree.
o Can be defined in place of max_depth. Since binary trees are created, a depth
of ‘n’ would produce a maximum of 2^n leaves.
5. Maximum features to consider for split
o The number of features to consider while searching for a best split. These will
be randomly selected.
o As a thumb-rule, square root of the total number of features works great but we
should check upto 30-40% of the total number of features.
o Higher values can lead to over-fitting but depends on case to case.

Tree Pruning
As discussed earlier, the technique of setting constraint is a greedy-approach. In other words,
it will check for the best split instantaneously and move forward until one of the specified
stopping condition is reached. Let’s consider the following case when you’re driving:

There are 2 lanes:

1. A lane with cars moving at 80km/h


2. A lane with trucks moving at 30km/h

At this instant, you are the yellow car and you have 2 choices:

1. Take a left and overtake the other 2 cars quickly


2. Keep moving in the present lane

Lets analyze these choice. In the former choice, you’ll immediately overtake the car ahead
and reach behind the truck and start moving at 30 km/h, looking for an opportunity to move
back right. All cars originally behind you move ahead in the meanwhile. This would be the
optimum choice if your objective is to maximize the distance covered in next say 10 seconds.
In the later choice, you sale through at same speed, cross trucks and then overtake maybe
depending on situation ahead. Greedy you!
This is exactly the difference between normal decision tree &
pruning. A decision tree with constraints won’t see the truck ahead and adopt a greedy
approach by taking a left. On the other hand if we use pruning, we in effect look at a few steps
ahead and make a choice.

So we know pruning is better. But how to implement it in decision tree? The idea is simple.

1. We first make the decision tree to a large depth.


2. Then we start at the bottom and start removing leaves which are giving us negative
returns when compared from the top.
3. Suppose a split is giving us a gain of say -10 (loss of 10) and then the next split on
that gives us a gain of 20. A simple decision tree will stop at step 1 but in pruning, we
will see that the overall gain is +10 and keep both leaves.

Note that sklearn’s decision tree classifier does not currently support pruning. Advanced
packages like xgboost have adopted tree pruning in their implementation. But the
library rpart in R, provides a function to prune. Good for R users!

5. Are tree based models better than linear models?


“If I can use logistic regression for classification problems and linear regression for
regression problems, why is there a need to use trees”? Many of us have this question.
And, this is a valid one too.

Actually, you can use any algorithm. It is dependent on the type of problem you are solving.
Let’s look at some key factors which will help you to decide which algorithm to use:

1. If the relationship between dependent & independent variable is well approximated by


a linear model, linear regression will outperform tree based model.
2. If there is a high non-linearity & complex relationship between dependent &
independent variables, a tree model will outperform a classical regression method.
3. If you need to build a model which is easy to explain to people, a decision tree
model will always do better than a linear model. Decision tree models are even simpler
to interpret than linear regression!

6. Working with Decision Trees in R and Python


For R users and Python users, decision tree is quite easy to implement. Let’s quickly look at
the set of codes that can get you started with this algorithm. For ease of use, I’ve shared
standard codes where you’ll need to replace your data set name and variables to get started.

In fact, you can build the decision tree in Python right here! Here’s a live coding window for
you to play around the code and generate results:

For R users, there are multiple packages available to implement decision tree such as
ctree, rpart, tree etc.

> library(rpart)

> x <- cbind(x_train,y_train)

# grow tree

> fit <- rpart(y_train ~ ., data = x,method="class")

> summary(fit)

#Predict Output
> predicted= predict(fit,x_test)

In the code above:

 y_train – represents dependent variable.


 x_train – represents independent variable
 x – represents training data.

For Python users, below is the code:

#Import Library

#Import other necessary libraries like pandas, numpy...

from sklearn import tree

#Assumed you have, X (predictor) and Y (target) for training data set and x_test(pred

ictor) of test_dataset

# Create tree object

model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you

can change the algorithm as gini or entropy (information gain) by default it is gini

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score

model.fit(X, y)
model.score(X, y)

#Predict Output

predicted= model.predict(x_test)

7. What are ensemble methods in tree based modeling ?


The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of
predictive models to achieve a better accuracy and model stability. Ensemble methods are
known to impart supreme boost to tree based models.

Like every other model, a tree based model also suffers from the plague of bias and variance.
Bias means, ‘how much on an average are the predicted values different from the actual
value.’ Variance means, ‘how different will the predictions of the model be at the same point
if different samples are taken from the same population’.

You build a small tree and you will get a model with low variance and high bias. How do you
manage to balance the trade off between bias and variance ?

Normally, as you increase the complexity of your model, you will see a reduction in prediction
error due to lower bias in the model. As you continue to make your model more complex, you
end up over-fitting your model and your model will start suffering from high variance.

A champion model should maintain a balance between these two types of errors. This is
known as the trade-off management of bias-variance errors. Ensemble learning is one way
to execute this trade off analysis.
Some of
the commonly used ensemble methods include: Bagging, Boosting and Stacking. In this
tutorial, we’ll focus on Bagging and Boosting in detail.

8. What is Bagging? How does it work?


Bagging is a technique used to reduce the variance of our predictions by combining the
result of multiple classifiers modeled on different sub-samples of the same data set. The
following figure will make it clearer:
The steps followed in bagging are:

1. Create Multiple DataSets:


o Sampling is done with replacement on the original data and new datasets are
formed.
o The new data sets can have a fraction of the columns as well as rows, which
are generally hyper-parameters in a bagging model
o Taking row and column fractions less than 1 helps in making robust models,
less prone to overfitting
2. Build Multiple Classifiers:
o Classifiers are built on each data set.
o Generally the same classifier is modeled on each data set and predictions are
made.
3. Combine Classifiers:
o The predictions of all the classifiers are combined using a mean, median or
mode value depending on the problem at hand.
o The combined values are generally more robust than a single model.

Note that, here the number of models built is not a hyper-parameters. Higher number of
models are always better or may give similar performance than lower numbers. It can be
theoretically shown that the variance of the combined predictions are reduced to 1/n (n:
number of classifiers) of the original variance, under some assumptions.

There are various implementations of bagging models. Random forest is one of them and
we’ll discuss it next.
9. What is Random Forest ? How does it work?
Random Forest is considered to be a panacea of all data science problems. On a funny note,
when you can’t think of any algorithm (irrespective of situation), use random forest!

Random Forest is a versatile machine learning method capable of performing both regression
and classification tasks. It also undertakes dimensional reduction methods, treats missing
values, outlier values and other essential steps of data exploration, and does a fairly good
job. It is a type of ensemble learning method, where a group of weak models combine to form
a powerful model.

How does it work?


In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see
comparison between CART and Random Forest here, part1 and part2). To classify a new
object based on attributes, each tree gives a classification and we say the tree “votes” for that
class. The forest chooses the classification having the most votes (over all the trees in the
forest) and in case of regression, it takes the average of outputs by different trees.

It works in the following manner. Each tree is planted & grown as follows:
1. Assume number of cases in the training set is N. Then, sample of these N cases is
taken at random but with replacement. This sample will be the training set for growing
the tree.
2. If there are M input variables, a number m<M is specified such that at each node, m
variables are selected at random out of the M. The best split on these m is used to
split the node. The value of m is held constant while we grow the forest.
3. Each tree is grown to the largest extent possible and there is no pruning.
4. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes
for classification, average for regression).

To understand more in detail about this algorithm using a case study, please read this article
“Introduction to Random forest – Simplified“.

Advantages of Random Forest


 This algorithm can solve both type of problems i.e. classification and regression and
does a decent estimation at both fronts.
 One of benefits of Random forest which excites me most is, the power of handle large
data set with higher dimensionality. It can handle thousands of input variables and
identify most significant variables so it is considered as one of the dimensionality
reduction methods. Further, the model outputs Importance of variable, which can be
a very handy feature (on some random data set).

 It has an effective method for estimating missing data and maintains accuracy when
a large proportion of the data are missing.
 It has methods for balancing errors in data sets where classes are imbalanced.
 The capabilities of the above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection.
 Random Forest involves sampling of the input data with replacement called as
bootstrap sampling. Here one third of the data is not used for training and can be used
to testing. These are called the out of bag samples. Error estimated on these out of
bag samples is known as out of bag error. Study of error estimates by Out of bag,
gives evidence to show that the out-of-bag estimate is as accurate as using a test set
of the same size as the training set. Therefore, using the out-of-bag error estimate
removes the need for a set aside test set.

Disadvantages of Random Forest


 It surely does a good job at classification but not as good as for regression problem
as it does not give precise continuous nature predictions. In case of regression,
it doesn’t predict beyond the range in the training data, and that they may over-fit data
sets that are particularly noisy.
 Random Forest can feel like a black box approach for statistical modelers – you have
very little control on what the model does. You can at best – try different parameters
and random seeds!
Python & R implementation
Random forests have commonly known implementations in R packages and Python scikit-
learn. Let’s look at the code of loading random forest model in R and Python below:

R Code

> library(randomForest)

> x <- cbind(x_train,y_train)

# Fitting model

> fit <- randomForest(Species ~ ., x,ntree=500)

> summary(fit)

#Predict Output

> predicted= predict(fit,x_test)

10. What is Boosting? How does it work?


Definition: The term ‘Boosting’ refers to a family of algorithms which converts weak learner to
strong learners.

Let’s understand this definition in detail by solving a problem of spam email identification:

How would you classify an email as SPAM or not? Like everyone else, our initial approach
would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

1. Email has only one image file (promotional image), It’s a SPAM
2. Email has only link(s), It’s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
4. Email from our official domain “Analyticsvidhya.com” , Not a SPAM
5. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you
think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner
using methods like:

 Using average/ weighted average


 Considering prediction has higher vote

For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’
and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM
because we have higher(3) vote for ‘SPAM’.

How does it work?


Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule.
An immediate question which should pop in your mind is, ‘How boosting identify weak rules?‘

To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each
time base learning algorithm is applied, it generates a new weak prediction rule. This is an
iterative process. After many iterations, the boosting algorithm combines these weak rules
into a single strong prediction rule.

Here’s another question which might haunt you, ‘How do we choose different distribution for
each round?’

For choosing the right distribution, here are the following steps:

Step 1: The base learner takes all the distributions and assign equal weight or attention to
each observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we pay
higher attention to observations having prediction error. Then, we apply the next base
learning algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is
achieved.

Finally, it combines the outputs from weak learner and creates a strong learner which
eventually improves the prediction power of the model. Boosting pays higher focus on
examples which are mis-classified or have higher errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model’s accuracy. In
this tutorial, we’ll learn about the two most commonly used algorithms i.e. Gradient Boosting
(GBM) and XGboost.

11. Which is more powerful: GBM or Xgboost?


I’ve always admired the boosting capabilities that xgboost algorithm. At times, I’ve found that
it provides better result compared to GBM implementation, but at times you might find that
the gains are just marginal. When I explored more about its performance and science behind
its high accuracy, I discovered many advantages of Xgboost over GBM:

1. Regularization:
o Standard GBM implementation has no regularization like XGBoost, therefore it
also helps to reduce overfitting.
o In fact, XGBoost is also known as ‘regularized boosting‘ technique.
2. Parallel Processing:
o XGBoost implements parallel processing and is blazingly faster as
compared to GBM.
o But hang on, we know that boosting is sequential process so how can it be
parallelized? We know that each tree can be built only after the previous one,
so what stops us from making a tree using all cores? I hope you get where I’m
coming from. Check this link out to explore further.
o XGBoost also supports implementation on Hadoop.
3. High Flexibility
o XGBoost allow users to define custom optimization objectives and
evaluation criteria.
o This adds a whole new dimension to the model and there is no limit to what
we can do.
4. Handling Missing Values
o XGBoost has an in-built routine to handle missing values.
o User is required to supply a different value than other observations and pass
that as a parameter. XGBoost tries different things as it encounters a missing
value on each node and learns which path to take for missing values in future.
5. Tree Pruning:
o A GBM would stop splitting a node when it encounters a negative loss in the
split. Thus it is more of a greedy algorithm.
o XGBoost on the other hand make splits upto the max_depth specified and
then start pruning the tree backwards and remove splits beyond which there
is no positive gain.
o Another advantage is that sometimes a split of negative loss say -2 may be
followed by a split of positive loss +10. GBM would stop as it encounters -2.
But XGBoost will go deeper and it will see a combined effect of +8 of the split
and keep both.
6. Built-in Cross-Validation
o XGBoost allows user to run a cross-validation at each iteration of the
boosting process and thus it is easy to get the exact optimum number of
boosting iterations in a single run.
o This is unlike GBM where we have to run a grid-search and only a limited
values can be tested.
7. Continue on Existing Model
o User can start training an XGBoost model from its last iteration of previous run.
This can be of significant advantage in certain specific applications.
o GBM implementation of sklearn also has this feature so they are even on this
point.

12. Working with GBM in R and Python


Before we start working, let’s quickly understand the important parameters and the working
of this algorithm. This will be helpful for both R and Python users. Below is the overall pseudo-
code of GBM algorithm for 2 classes:

1. Initialize the outcome

2. Iterate from 1 to total number of trees

2.1 Update the weights for targets based on previous run (higher for the ones mis-c

lassified)

2.2 Fit the model on selected subsample of data

2.3 Make predictions on the full set of observations


2.4 Update the output with current results taking into account the learning rate

3. Return the final output.

This is an extremely simplified (probably naive) explanation of GBM’s working. But, it will help
every beginners to understand this algorithm.

Lets consider the important GBM parameters used to improve model performance in Python:

1. learning_rate
o This determines the impact of each tree on the final outcome (step 2.4). GBM
works by starting with an initial estimate which is updated using the output of
each tree. The learning parameter controls the magnitude of this change in the
estimates.
o Lower values are generally preferred as they make the model robust to the
specific characteristics of tree and thus allowing it to generalize well.
o Lower values would require higher number of trees to model all the relations
and will be computationally expensive.
2. n_estimators
o The number of sequential trees to be modeled (step 2)
o Though GBM is fairly robust at higher number of trees but it can still overfit at
a point. Hence, this should be tuned using CV for a particular learning rate.
3. subsample
o The fraction of observations to be selected for each tree. Selection is done by
random sampling.
o Values slightly less than 1 make the model robust by reducing the variance.
o Typical values ~0.8 generally work fine but can be fine-tuned further.

Apart from these, there are certain miscellaneous parameters which affect overall
functionality:

1. loss
o It refers to the loss function to be minimized in each split.
o It can have various values for classification and regression case. Generally the
default values work fine. Other values should be chosen only if you understand
their impact on the model.
2. init
o
This affects initialization of the output.
o
This can be used if we have made another model whose outcome is to be
used as the initial estimates for GBM.
3. random_state
o The random number seed so that same random numbers are generated every
time.
oThis is important for parameter tuning. If we don’t fix the random number, then
we’ll have different outcomes for subsequent runs on the same parameters and
it becomes difficult to compare models.
o It can potentially result in overfitting to a particular random sample selected.
We can try running models for different random samples, which is
computationally expensive and generally not used.
4. verbose
o The type of output to be printed when the model fits. The different values can
be:
 0: no output generated (default)
 1: output generated for trees in certain intervals
 >1: output generated for all trees
5. warm_start
o This parameter has an interesting application and can help a lot if used
judicially.
o Using this, we can fit additional trees on previous fits of a model. It can save a
lot of time and you should explore this option for advanced applications
6. presort
o Select whether to presort data for faster splits.
o It makes the selection automatically by default but it can be changed if needed.

I know its a long list of parameters but I have simplified it for you in an excel file which you
can download from this GitHub repository.

For R users, using caret package, there are 3 main tuning parameters:

1. n.trees – It refers to number of iterations i.e. tree which will be taken to grow the trees
2. interaction.depth – It determines the complexity of the tree i.e. total number of splits it
has to perform on a tree (starting from a single node)
3. shrinkage – It refers to the learning rate. This is similar to learning_rate in python
(shown above).
4. n.minobsinnode – It refers to minimum number of training samples required in a node
to perform splitting

GBM in R (with cross validation)


I’ve shared the standard codes in R and Python. At your end, you’ll be required to change the
value of dependent variable and data set name used in the codes below. Considering the
ease of implementing GBM in R, one can easily perform tasks like cross validation and grid
search with this package.

> library(caret)
> fitControl <- trainControl(method = "cv",

number = 10, #5folds)

> tune_Grid <- expand.grid(interaction.depth = 2,

n.trees = 500,

shrinkage = 0.1,

n.minobsinnode = 10)

> set.seed(825)

> fit <- train(y_train ~ ., data = train,

method = "gbm",

trControl = fitControl,

verbose = FALSE,

tuneGrid = gbmGrid)

> predicted= predict(fit,test,type= "prob")[,2]

GBM in Python

13. Working with XGBoost in R and Python


XGBoost (eXtreme Gradient Boosting) is an advanced implementation of gradient boosting
algorithm. It’s feature to implement parallel computing makes it at least 10 times faster than
existing gradient boosting implementations. It supports various objective functions, including
regression, classification and ranking.

R Tutorial: For R users, this is a complete tutorial on XGboost which explains the parameters
along with codes in R. Check Tutorial.

Python Tutorial: For Python users, this is a comprehensive tutorial on XGBoost, good to get
you started. Check Tutorial.

14. Where to practice ?


Practice is the one and true method of mastering any concept. Hence, you need to start
practicing if you wish to master these algorithms.

Till here, you’ve got gained significant knowledge on tree based models along with these
practical implementation. It’s time that you start working on them. Here are open practice
problems where you can participate and check your live rankings on leaderboard:

Practice Problem: Food Predict the demand of


Demand Forecasting meals for a meal delivery
Challenge company

Practice Problem: HR Identify the employees most


Analytics Challenge likely to get promoted
Predict number of upvotes
Practice Problem: Predict on a query asked at an
Number of Upvotes online question & answer
platform

KNN

Table of Contents
 When do we use KNN algorithm?
 How does the KNN algorithm work?
 How do we choose the factor K?

When do we use KNN algorithm?


KNN can be used for both classification and regression predictive problems. However, it is
more widely used in classification problems in the industry. To evaluate any technique we
generally look at 3 important aspects:

1. Ease to interpret output

2. Calculation time

3. Predictive Power

Let us take a few examples to place KNN in the scale :

KNN
algorithm fairs across all parameters of considerations. It is commonly used for its easy of
interpretation and low calculation time.
How does the KNN algorithm work?
Let’s take a simple case to understand this algorithm. Following is a spread of red circles
(RC) and green squares (GS) :

You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing
else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say
K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three
datapoints on the plane. Refer to following diagram for more details:

The three closest points to BS is all RC. Hence, with good confidence level we can say that
the BS should belong to the class RC. Here, the choice became very obvious as all three
votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in
this algorithm. Next we will understand what are the factors to be considered to conclude the
best K.
How do we choose the factor K?
First let us try to understand what exactly does K influence in the algorithm. If we see the last
example, given that all the 6 training observation remain constant, with a given K value we
can make boundaries of each class. These boundaries will segregate RC from GS. The same
way, let’s try to see the effect of value “K” on the class boundaries. Following are the different
boundaries separating the two classes with different values of K.

If you watch carefully, you can see that the boundary becomes smoother with increasing
value of K. With K increasing to infinity it finally becomes all blue or all red depending on the
total majority. The training error rate and the validation error rate are two parameters we
need to access on different K-value. Following is the curve for the training error rate with
varying value of K :

As
you can see, the error rate at K=1 is always zero for the training sample. This is because the
closest point to any training data point is itself.Hence the prediction is always accurate with
K=1. If validation error curve would have been similar, our choice of K would have been 1.
Following is the validation error curve with varying value of K:

This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error
rate initially decreases and reaches a minima. After the minima point, it then increase with
increasing K. To get the optimal value of K, you can segregate the training and validation
from the initial dataset. Now plot the validation error curve to get the optimal value of K. This
value of K should be used for all predictions.
NAIVE BAYES

Table of Contents
1. What is Naive Bayes algorithm?
2. How Naive Bayes Algorithms works?
3. What are the Pros and Cons of using Naive Bayes?
4. 4 Applications of Naive Bayes Algorithm
5. Steps to build a basic Naive Bayes Model in Python
6. Tips to improve the power of Naive Bayes Model

What is Naive Bayes algorithm?


It is a classification technique based on Bayes’ Theorem with an assumption of independence
among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a
particular feature in a class is unrelated to the presence of any other feature.

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches
in diameter. Even if these features depend on each other or upon the existence of the other
features, all of these properties independently contribute to the probability that this fruit is an
apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and
P(x|c). Look at the equation below:

Above,

 P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

How Naive Bayes algorithm works?


Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’ (suggesting possibilities of playing). Now, we need to
classify whether players will play or not based on weather condition. Let’s follow the below
steps to perform it.

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29
and probability of playing is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.

Problem: Players will play if weather is sunny. Is this statement is correct?

We can solve it using above discussed method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

What are the Pros and Cons of Naive Bayes?


Pros:

 It is easy and fast to predict class of test data set. It also perform well in multi class
prediction
 When assumption of independence holds, a Naive Bayes classifier performs better
compare to other models like logistic regression and you need less training data.
 It perform well in case of categorical input variables compared to numerical variable(s).
For numerical variable, normal distribution is assumed (bell curve, which is a strong
assumption).

Cons:

 If categorical variable has a category (in test data set), which was not observed in
training data set, then model will assign a 0 (zero) probability and will be unable to
make a prediction. This is often known as “Zero Frequency”. To solve this, we can use
the smoothing technique. One of the simplest smoothing techniques is called Laplace
estimation.
 On the other side naive Bayes is also known as a bad estimator, so the probability
outputs from predict_proba are not to be taken too seriously.
 Another limitation of Naive Bayes is the assumption of independent predictors. In real
life, it is almost impossible that we get a set of predictors which are completely
independent.

4 Applications of Naive Bayes Algorithms


 Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast.
Thus, it could be used for making predictions in real time.
 Multi class Prediction: This algorithm is also well known for multi class prediction
feature. Here we can predict the probability of multiple classes of target variable.
 Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers
mostly used in text classification (due to better result in multi class problems and
independence rule) have higher success rate as compared to other algorithms. As a
result, it is widely used in Spam filtering (identify spam e-mail) and Sentiment Analysis
(in social media analysis, to identify positive and negative customer sentiments)
 Recommendation System: Naive Bayes Classifier and Collaborative
Filtering together builds a Recommendation System that uses machine learning and
data mining techniques to filter unseen information and predict whether a user would
like a given resource or not

How to build a basic model using Naive Bayes in


Python and R?
Again, scikit learn (python library) will help here to build a Naive Bayes model in Python.
There are three types of Naive Bayes model under the scikit-learn library:

 Gaussian: It is used in classification and it assumes that features follow a normal


distribution.

 Multinomial: It is used for discrete counts. For example, let’s say, we have a text
classification problem. Here we can consider Bernoulli trials which is one step further
and instead of “word occurring in the document”, we have “count how often word
occurs in the document”, you can think of it as “number of times outcome number x_i
is observed over the n trials”.

 Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros
and ones). One application would be text classification with ‘bag of words’ model
where the 1s & 0s are “word occurs in the document” and “word does not occur in the
document” respectively.

Tips to improve the power of Naive Bayes Model


Here are some tips for improving power of Naive Bayes Model:

 If continuous features do not have normal distribution, we should use transformation


or different methods to convert it in normal distribution.
 If test data set has zero frequency issue, apply smoothing techniques “Laplace
Correction” to predict the class of test data set.
 Remove correlated features, as the highly correlated features are voted twice in the
model and it can lead to over inflating importance.
 Naive Bayes classifiers has limited options for parameter tuning like alpha=1 for
smoothing, fit_prior=[True|False] to learn class prior probabilities or not and some
other options (look at detail here). I would recommend to focus on your pre-processing
of data and the feature selection.
 You might think to apply some classifier combination technique like ensembling,
bagging and boosting but these methods would not help. Actually, “ensembling,
boosting, bagging” won’t help since their purpose is to reduce variance. Naive Bayes
has no variance to minimize.

Вам также может понравиться