Вы находитесь на странице: 1из 34


1. Parametric vs Non-Parametric models:

 For parametric model, to predict the new data: just the parameters are required, which are fixed.
Examples: Linear, Logistic, SVM.
 For Non-parametric model, need to know the parameters and the current state of the data that
has been observed. No underlying assumption about the data.
 NP is flexible, makes better prediction. Select NP when lots of data, no prior knowledge and not
worrying too much about the right features.
 Examples: KNN, Decision trees, ANN.

2. Supervised vs Unsupervised models:

 For supervised learning, we have prior knowledge of the output values for the input sample. So,
the goal is to learn that function that will map the output to the features.
 Examples are classification (discrete output) and regression (continuous output). E.g. Logistic,
SVM, Random Forests, face recognition.
 If you have small amount of data, opt for low complexity i.e low variance model otherwise it
would result in overfitting.
 For unsupervised learning, we do not have any labeled outputs. So, the task is to determine, the
inherent structure present within the data points.
 Examples are clustering (discrete) and dimensionality reduction (continuous). E.g. k means, PCA,
image classification.

3. Overfitting vs Underfitting models:

 A model learns relationships between the inputs called features, and outputs called labels, from
a training dataset. During training, the model is given both the features and the labels and
learns how to map the former to the latter.
 A trained model is evaluated on a testing set, where we only give it the features and it makes
predictions. We compare the predictions with the known labels for the testing set to calculate
accuracy. Models can take many shapes, from simple linear regressions to deep neural networks,
but all supervised models are based on the fundamental idea of learning relationships between
inputs and outputs from training data.

 Variance: It refers to how much the model is dependent on the training data. For linear
regression, it is (SER)^2*(X’X)^-1. X is the matrix of independent variables. And (X’X)^-1 is the
variance covariance matrix between X. Spread or Uncertainty in the estimates. Constant
 Bias: It is basically a measure of the difference between the expected value of the predicted
function/parameter and its actual value. For linear regression it is E[b^]-b. For non-linear it is as
shown below. The model makes strong assumptions about data. Errors from those assumptions.
Eg linear regression to a nonlinear data results in high bias. ACCURACY of the estimates.
Generalization/Test error: Y=f(X)+e is the actual value of the predictor. f^(X) is the predicted

For OLS.
 Bias and Variance move in opposite directions.
 Underfit model: Poor in-sample prediction. Low Variance and High Bias. It will have high training
and test error.
 Overfit model: Good in sample prediction but poor out of sample prediction. High Variance and
Low Bias. It will have very low training error but a high testing error, high cross validation error
and high generation gap.
 Remember the error vs complexity curve for both. There is training error and
Generalization/True error cannot be computed and is the expected value of all test errors or
difference between test and training error.

4. About KNN:

1. Non-Parametric, i.e. no assumptions about data. Therefore, KNN could and probably should be
one of the first choices for a classification study when there is little or no prior knowledge
about the distribution of the data.
2. Probably supervised.
3. Used for both classification and regression.
4. It is a lazy algorithm, that is does not use any training data points to make any generalization,
i.e. almost 0 training phase. This means it keeps all the training data during testing phase.
5. K in KNN is the number of nearest neighbors we wish to take vote from. For example, in the
webpage given above, if we take K=3, we draw a circle around the blue star to enclose 3 data
points on the plane. And then the majority data points are the class of that star.
 We take a single test case and multiple rows of training set. For clarity, here single test
case means one row, which can have multiple columns. Consider that single point
represented in n dimensions, where n is the number of columns.
 We calculate the Euclidean distance as a measure of distance b/w test case and each row
of training set. Euclidean distance is the straight-line distance between 2 points in N
 Sort the distances in ascending order. (K nearest)
 Get top K rows from the sorted array and then the most frequent class of these rows.
 That would be the predicted class.
6. How to decide K? This is called Hyperparameter optimization.
I. If K is too small, then Overfitting: low training error but high testing error. For example,
for K=1, training error is 0. But testing error is too large.
II. As K increases, the testing error drops up to a point but then increases again. So, decide
the optimal value of K at the minima.
III. As a rule, K should be odd or Square root of n.
7. Output: For Classification, it is a discrete variable of some class. For regression, it is an average
of the values of its neighbors.
8. Scikit learn algorithm: K is specified along with test data. Based on point 5, K rows in training
are selected and hence classified.
from sklearn.neighbors import KNeighborsClassifier (n_neighbours=5)

5. CART: (Rpart, Caret, Sklearn)

Python: From sklearn import trees
X=Trees.DecisionTreeClassifier/Regressor ()


 Supervised learning.
 Non-parametric, no idea about the distribution of the data, I should use Non-
 Binary splits because the problem is that multiway splits fragments the data too quickly,
leaving insufficient data at the next level down. Hence, we would want to use such splits
only when needed. Since multiway splits can be achieved by a series of binary splits, the
latter are preferred. Multiway splitting results in overfitting.
 Different tree-based methods are C4.5, CART, C5.0.
 CART- Divide the dataset into homogenous subsets (uses Gini Index for classification)
based on the most significant splitter. Eg Gender, Age and city are 3 variables given.
 In case of regression tree, the value obtained by terminal nodes in the training data is
the mean response of observation falling in that region. Thus, if an unseen data
observation falls in that region, we’ll make its prediction with mean values of known
target variable. In case of classification tree, the value (class) obtained by terminal node
in the training data is the mode of observations falling in that region. Thus, if an unseen
data observation falls in that region, we’ll make its prediction with mode value
 Both the trees follow a top-down greedy approach known as recursive binary splitting.
We call it as ‘top-down’ because it begins from the top of tree when all the observations
are available in a single region and successively splits the predictor space into two new
branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for
best variable available) about only the current split, and not about future splits which
will lead to a better tree.
 Root Node -> Decision Node -> Leaf/Terminal Node
 Different criteria measure the best/homogeneity of the target variable in a given node.
A) Gini Impurity for Classification trees (USED BY CART):
1. Probability that a randomly selected element would be labeled incorrectly if that
new instance is randomly classified according to the distribution of the labels in the
2. The Gini impurity can be computed by summing the probability (Pi) of an item with
label i being chosen times the probability (Pk =1-Pi) of a mistake in categorizing that
3. If the node contains one class only, then impurity is 0. The minimum value is
therefore 0. Maximum value is when all classes have same probability/equally
weighted. The objective is to minimize the Gini impurity.
4. For the example given below: GI of female node is: 0.2*0.8*2=1-0.04-0.64=0.32
GI for male: (0.65)*(0.35)*2=0.45.
For the parent node= (10/30)*0.32+(20/30)*0.45=0.41. This should be the least.

B) Information gain/Entropy for Classification trees: Used by C4.5, C5.0 decision trees.
1. Entropy is a measure of the degree of disorganization in a system. Therefore, more
the entropy, more the information it requires to describe that system. If the sample
is completely homogeneous, then the entropy is zero and if the sample is an equally
divided (50% – 50%), it has entropy of one.

For example, I divide a parent node that has 30 elements into 2 decision nodes on
the basis of gender.
Female node has 10 total and has 2 Yes and 8 No. Male node has 20 total with 13 yes
and 7 no.
So, each entropy: -(2/10)*log(2/10)-(8/10)*log(8/10)=0.72 and 0.93.
Total entropy for the split is: (10/30)*0.72+(20/30)*0.93=0.86. This should be

C) Variance Reduction for Regression trees (Used by CART):

1. How much the variance is reduced when I move from a parent node to a child node.
2. Assign value of 1 for a yes and value of 0 for a no.
3. So, mean for the female: 0.2. Variance is (2/10) *(1-0.2)^2+(8/10)(0-0.2)^2=0.16.
Mean of a male is 0.65 and Variance of male is 0.23. So, calculate the variance as
weighted average=0.21. Parent node had 0.25.

 Parameter Tuning in a decision tree:

Overfitting is one of the key challenges faced while modeling decision trees. If there is
no limit set of a decision tree, it will give you 100% accuracy on training set because in the
worst case it will end up making 1 leaf for each observation. Thus, preventing overfitting
is pivotal while modeling a decision tree and it can be done in 2 ways:
1. Setting constraints on how the tree is constructed: Similar to RF like minimum number
of observations in a parent node to split further, leaf size/tree depth, maximum number
of terminal nodes, no of features to be considered for split. CROSS VALIDATION to
decide the optimal number for parameters.

The technique of setting constraint is a greedy-approach. In other words, it will check

for the best split instantaneously and move forward until one of the specified stopping
conditions is reached. Not think about future.

2. Tree pruning: In decision trees, pruning is a method to avoid overfitting. Pruning means
selecting a subtree that leads to the lowest test error rate. Pruning a tree that leads to the
highest reduction in error, when the subtree is replaced by the leaf with majority class
label. We can use cross validation to determine the test error rate of a subtree.

 Decision trees can be unstable because small variations in the data might result in a
completely different tree being generated. This problem is mitigated by using decision
trees within an ensemble.


 Ensemble methods is a machine learning technique that combines several base models using
same technique in order to produce one optimal predictive model. The purpose is to minimize
either the bias or the variance.
Total generalization error= Bias^2 + Variance. So, Ensemble methods are used to decrease the
total error, either by decreasing the variance and by decreasing the bias.

There are 2 types of Ensemble techniques:

 Bagging
 Boosting

Bagging and Boosting decrease the variance of your single estimate because averaging reduces
the bias. as they combine several estimates from different models. So, the result may be a
model with higher stability.

If the problem is that the single model gets a very low performance, Bagging will rarely get
a better bias. However, Boosting could generate a combined model with lower errors as it
optimizes the advantages and reduces pitfalls of the single model.
By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best option.
Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this
problem itself. For this reason, Bagging is effective more often than Boosting. Commented [PP1]: Why Bagging not for bias reduction
and why Boosting not for overfitting?

1. Bagging and Boosting get N learners by generating additional data in the training stage. N new
training data sets are produced by random sampling with replacement from the original set. By
sampling with replacement some observations may be repeated in each new training data set.
For example:

2. In the case of Bagging, any element has the same probability to appear in a new data set. The
training stage is parallel for Bagging (i.e., each model is built independently)

However, for Boosting (initially equally weighted) the observations are weighted and therefore
some of them will take part in the new sets more often. Boosting builds the new learner in a
sequential way:
In Boosting algorithms each classifier is trained on data, considering the previous classifiers’
success. After each training step, the weights are redistributed. Misclassified data increases its
weights to emphasize the most difficult cases. In this way, subsequent learners will focus on
them during their training. Commented [PP2]: So essentially Boosting (or
bootstrapping) refers to overweighting scarcer observations
to train the model
to detect these more easily.

3. To predict the class of new data we only need to apply the N learners to the new observations.
In Bagging the result is obtained by averaging the responses of the N learners (or majority
vote). However, Boosting assigns a second set of weights, this time for the N classifiers, in
order to take a weighted average of their estimates.
In the Boosting training stage, the algorithm allocates weights to each resulting model. A
learner with good a classification result on the training data will be assigned a higher
weight than a poor one. So, when evaluating a new learner, boosting needs to keep track
of learners’ errors, too. Let’s see the differences in the procedures:

Some of the Boosting techniques include an extra-condition to keep or discard a single

learner. For example, in AdaBoost, the most renowned, an error less than 50% is required
to maintain the model; otherwise, the iteration is repeated until achieving a learner better
than a random guess.

In BAGGING all the features of the data in a bootstrapped sample (drawing samples with
replacement) are taken while making a tree, multiple trees are formed, and aggregation is done
to find out the prediction. Therefore, models are correlated, so variance reduction is less than
Bagging improves accuracy only of Unstable Classifiers, i.e. a small change in input will lead to
large changes in the output. Example is a decision tree. KNN is a stable classifier, so bagging can
harmfully impact its output. Boosting could be used.
Large no of parameters in boosting to tune. Therefore, computationally expensive.

randomForest (formula, data=NULL, ..., subset, na.action=na.fail)

It is a type of Bagging technique on decision trees but with below differences:

a) Each tree selects m out of M random/different features for splitting. This means that 2
trees generated on same training data will have m randomly different variables selected
at each split, hence this is how the trees will get de-correlated and will be independent
of each other. Therefore, less prone to overfitting. VERY VERY VERY IMPORTANT.
b) In random forests, about one-third of the cases are left out of the bootstrap sample and
not used in the construction of the kth tree, called OOB sample. Hence, there is no need
for cross-validation, or a separate test set to get an unbiased estimate of the test set
error. (subset in the above command) VERY VERY VERY IMPORTANT.
c) The value of m is held constant while growing the forest.
d) In R, random forest internally takes care of missing values using mean/ mode
e) For regression, the random forest takes an average of all the individual decision tree
estimates. For classification, the random forest will take a majority vote for the
predicted class.
f) IMPORTANCE parameters:
There are two measures of importance given for each variable in the random forest.
 Accuracy based importance is a measure of by how much including a variable increases
accuracy or of by how much removing a variable decreases accuracy. Thus, higher the
number more important it becomes. As mentioned before each tree has its own out-of-bag
sample of data that was not used during construction.
For regression accuracy is %INCMSE:
i. First, the prediction accuracy (MSE for regression) on the out-of-bag sample is measured.
Call it mse0.
ii. Then, the values of the that variable in the out-of-bag-sample are randomly shuffled,
keeping all other variables the same. Random shuffling means that, on average, the shuffled
variable has no predictive power. Call it MSE(j).
iii. %IncMSE of j'th is (mse(j)-mse0)/mse0 * 100%.

For classification it is MeanDecreaseAccuracy, (which is the error rate):

Accuracy: Overall, how often is the classifier correct? (TP+TN)/total = (100+50)/165 = 0.91.
Misclassification Rate: Overall, how often is it wrong? (FP+FN)/total = (10+5)/165 = 0.09
equivalent to 1 minus Accuracy also known as "Error Rate".
Same process for calculating the accuracy as %INCMSE, here variable calculated is Accuracy.
So more the decrease in the error rate, more important the variable is. VERY VERY VERY

 Purity based importance: Total decrease in the node impurities / (Total increase in node
purities) from splitting on the variable, averaged over all trees. For regression, it is measured by
the reduction in Residual sum of squares (IncNodePurity), whenever a variable is chosen to
split. For classification, the node impurity is measured by the Gini impurity (MeanDecreaseGini)
For each variable, the sum of the Gini/RSS decrease across every tree of the forest is
accumulated every time that variable is chosen to split a node. The sum is divided by the
number of trees in the forest to give an average.

One advantage of the Gini-based importance is that the Gini calculations are already performed
during training, so minimal extra computation is required.
A disadvantage is that splits are biased towards variables with many classes, which also biases
the importance measure.
IncNodePurity is biased and should only be used if the extra computation time of calculating
%IncMSE is unacceptable. Since it only takes ~5-25% extra time to calculate %IncMSE, this
would almost never happen

g) Each tree is grown to largest extent possible. No pruning. Each tree is grown fully VERY
h) Parameter Tuning: Mainly, there are three parameters in the random forest algorithm
which you should look at (for tuning): Cross validation is a method. Less number of
parameters to tune. VERY VERY VERY IMPORTANT. “RandomizedSearchCV” in
 ntree – As the name suggests, the number of trees to grow. Larger the tree, the more
beneficial it is, but it will be more computationally expensive to build models.
 mtry – It refers to how many variables we should select at a node split. Also as mentioned
above, the default value is p/3 for regression and sqrt(p) for classification. We should
always try large values of mtry to avoid overfitting. Increasing mtry/max_features
generally improves the performance of the model as at each node now we have a higher
number of options to be considered. However, this is not necessarily true as this
decreases the diversity of individual tree which is the USP of random forest. But, for
sure, you decrease the speed of algorithm by increasing the mtry/ max_features. Hence,
you need to strike the right balance and choose the optimal mtry/max_features
 leaf size/node size – It refers to how many observations we want in the terminal nodes.
This parameter is directly related to tree depth. Higher the number, lower the tree depth.
With lower tree depth, the tree might even fail to recognize useful signals from the data.
Too deep a tree means overfitting! On the flip side, if we choose too large a leaf size, say
in the above example, 500, the tree will stop growing after the second split itself. Meaning
poor predictive performance.
i) Not useful for small datasets, a black box so can’t used for descriptive statistics, gives
preferences to ordinal features with a greater number of levels and computationally

j) Important links/Extra:
 Importance Parameters
 VarImpPlot(model) and importance(model) gives the same results. Importance () is a part
of randomForest package. However, varIMP is a part of Caret Package.
 Therefore, in order to avoid waiting time, let’s impute the missing values using median /
mode imputation method; i.e., missing values in the integer variable will be imputed with
median and factor variables will be imputed with mode (most frequent value).

8. PCA:
1. Dimension reduction by Feature extraction, i.e. basically create new variables (Principal
components) from old ones such that the new ones are a combination of each of the
original variables and are sorted in order of importance of predicting the outcome. So
later on, we can delete the least important ones.
2. The new variables created are independent of each other, because they are orthogonal
to each other, one of the assumptions of linear regression.
3. Eigenvalues & Eigenvectors: det(A-bi) =0, where A is the matrix of X variables. When we
solve this equation, values of b obtained are called Eigenvalues.
From here, when we put b1, b2…. in (A-bi), we get matrix B. Solving B*x=0, we get values
of x as x1, x2…. depending on dimensions of A. The matrix x is called the Eigenvector,
such that A*x=b(eigen value)*x.

4. I have my original matrix Z formed out of X by standardization. Calculate Z’Z to get the
variance-covariance matrix. The sum of elements along the diagonal is the total
variance. Decompose my Z’Z into PDP^-1
5. D matrix has diagonal elements as eigenvalues and rest as 0, P has corresponding
columns as eigenvectors. The eigenvalues on the diagonal of D will be associated with
the corresponding column in P — that is, the first element of D is λ₁ and the
corresponding eigenvector is the first column of P. So, sort D from largest to smallest,
in this way P also gets sorted and forms P*.
6. Step 7 there. So, we get Z* = ZP. So, Z* is your principal components, where Z* also might
have diagonal elements as eigenvalues. Not sure.
7. To determine how many features to keep, set a threshold for the proportion of variance
8. As we know in Z’Z, we have variances along the diagonal, we still have D matrix diagonal
elements as sorted eigenvalues and rest are 0. So, the first eigen value/sum of variance
is the % variance explained by that eigen value.
Because each
9. Basically, variance explained is eigenvalue/sum of eigen value.
eigenvalue is roughly the importance of its corresponding
eigenvector, the proportion of variance explained is the sum of the
eigenvalues of the features you kept divided by the sum of the eigenvalues
of all features.
10. https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-
11. PCA finds, in the data space, the dimension (direction) with the largest variance out of
the overall variance 1.343730519+.619205620+1.485549631 = 3.448. That largest
variance would be 1.651354285. Then it finds the dimension of the second largest
variance, orthogonal to the first one, out of the remaining 3.448-1.651354285 overall
variance. That 2nd dimension would be 1.220288343 variance. And so on.
12. An eigenvalue greater (less) than one implies that this component is summarizing a
component of the total variance which exceeds (is less than) the information provided by
the original variable. Therefore, it is common that only principal components with
eigenvalues greater than one are considered.

9. GLM:
 Generalized linear models, the general equation for which is:

which has 3 components: Here, g () is the link function, E(y) is the expectation of target
variable and α + βx1 + γx2 is the linear predictor (α, β, γ to be predicted). The role of link
function is to ‘link’ the expectation of y to linear predictor. For example, g(E(y)) =E(y) for
linear regression and g(E(y)) = log(p/1-p) for logistic regression, where p is the probability
of success.
 Above 3 components: systematic component, the explanatory variables or the RHS in
the above equation. Random component, the response variable and its probability
function. For example, normal distribution for linear regression and Binomial
distribution for logistic regression. Link function, as above specifies the link between
the random and the systematic component.
 Binomial distribution is a discrete probability distribution represented by B (n, p), where
n is the number of trials and p is the probability of success. Mean is np and variance
are np(1-p). A single event is called a Bernoulli trial.
 Assumptions/Important points:
1.GLM does not assume a linear relationship between dependent and independent
variables. However, it assumes a linear relationship between link function and
independent variables.
2. The dependent variable and error need not to be normally distributed. But both have
independent distribution
3. It does not use OLS (Ordinary Least Square) for parameter estimation. Instead, it uses
maximum likelihood estimation (MLE). It is a parameter estimation method when you
maximize the known likelihood distribution.


 Logistic regression is a part of GLM. It is a classification problem, i.e. to get a binary

outcome (Yes/No), given a set of independent variables we use Logistic regression.
 For logistic regression, the link function is log(p/1-p), where p is the prob. of success.

 The logistic regression method assumes that:

a) The outcome is a binary or dichotomous variable like yes vs no, positive vs negative,
1 vs 0.
b) There is a linear relationship between the logit of the outcome and each predictor
variables. Recall that the logit/sigmoid function is logit(p) = log(p/(1-p)),
where p is the probabilities of the outcome. Check by scatterplot.
c) There is no extreme values or outliers in the continuous predictors. Check by
d) There is no high intercorrelations (i.e. multicollinearity) among the predictors.
Check by VIF()

 https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-
regression-in-r/ This is the derivation of the logit function.

 The RHS is continuous. But to map the predicted values to probabilities, we use the
sigmoid function. P, the probability of success = exp(y)/1+exp(y) = 1/1+exp(-y), where
y=bo+b1x1+b2x2+… So, the graph of p is the sigmoid function.

 Once the sigmoid function is formed, we can define the threshold.

11. LASSO and RIDGE REGRESSION (glmnet package in R):

 As we add more and more parameters to our model, its complexity increases, which
results in increasing variance and decreasing bias, thus overfitting. Too many features
result in multicollinearity. Variance is also increased when the independent variables
suffer from multicollinearity.
 One way to measure multicollinearity is the variance inflation factor (VIF), which assesses
how much the variance of an estimated regression coefficient increases if your predictors
are correlated. If no factors are correlated, the VIFs will all be 1.
 Multicollinearity causes issue in hypothesis testing, since t tests and p value tests
become unreliable.
 Look at residual vs fitted values plot. If heteroskedasticity exists, the plot would exhibit
a funnel shape pattern as shown above.

 To overcome underfitting or high bias, we can basically add new parameters to

our model so that the model complexity increases, and thus reducing high
 There are following methods to overcome overfitting/ reducing the features:
 Regularization: Ridge/lasso
 Ensemble methods
 Pruning in case of Decision trees.

 For regularization, we do not remove the features but reduce the coefficients of
those features by introducing a penalty factor.

 For ridge regression (L2 regularization), we have the hyperparameter ά in the

function, which is related to the penalty factor ʎ, such that:

Therefore, as ʎ becomes larger, variance would decrease, and bias would

increase. We need to tune ʎ but, as we cannot have high bias.

 There are two ways we could tackle this issue. A more traditional approach would be
to choose λ such that some information criterion, e.g., AIC or BIC, is the
smallest. A more machine learning-like approach is to perform cross-validation
and select the value of λ that minimizes the cross-validated sum of squared residuals
(or some other measure). The former approach emphasizes the model's fit to the
data, while the latter is more focused on its predictive performance.
 In R, the glmnet package for ridge regression.

 Lasso, or Least Absolute Shrinkage and Selection Operator, is quite similar

conceptually to ridge regression. It also adds a penalty for non-zero coefficients, but
unlike ridge regression which penalizes sum of squared coefficients (the so-
called L2 penalty), lasso penalizes the sum of their absolute values (L1 penalty).

 As a result, for high values of λ, many coefficients are exactly zeroed under
lasso, which is never the case in ridge regression.
 Therefore, lasso selects the only some feature while reduces the coefficients
of others to zero. This property is known as feature selection and which is
absent in case of ridge.

 Both methods allow to use correlated predictors, but they solve multicollinearity
issue differently:
 In ridge regression, the coefficients of correlated predictors are similar, but
 In lasso, one of the correlated predictors has a larger coefficient, while the rest
are (nearly) zeroed.
 Elastic net is a combination of both.


12. SVM:
Classifies two classes by drawing a hyperplane/line between them.
SVM to core tries to achieve a good margin. A margin is a separation of line to the closest
class points. A good margin is one where this separation is larger for both the classes.

Hyperparameters include:
C called the regularization parameter, lower C allows for large margin but some
misclassification, larger C does not allow any misclassification.
Gamma, with low gamma, points far away from plausible separation line are considered
in calculation for the separation line where as high gamma means the points close to
plausible line are considered in calculation.
13. Gradient descent:
To minimize the cost function, I will use gradient descent optimization. The objective is
to find the parameters for which the cost function is minimum.

To start with finding the right values we initialize the values of our parameters with some
random numbers and Gradient Descent then starts at that point. Then it takes one step
after another in the steepest downside direction till it reaches the point where the cost
function is as small as possible.

The equation below describes what Gradient Descent does: b describes the next position
of our climber, while a represents his current position. The minus sign refers to the
minimization part of gradient descent. The gamma in the middle is a waiting factor and
the gradient term ( Δf(a) ) is simply the direction of the steepest descent.

How big the steps are that Gradient Descent takes into the direction of the local minimum
are determined by the so-called learning rate. It determines how fast or slow we will
move towards the optimal weights.
In order for Gradient Descent to reach the local minimum, we have to set the learning
rate to an appropriate value, which is neither too low nor too high.

This is because if the steps it takes are too big, it maybe will not reach the local minimum
because it just bounces back and forth between the convex function of gradient descent
like you can see on the left side of the image below. If you set the learning rate to a very
small value, gradient descent will eventually reach the local minimum, but it will maybe
take too much time like you can see on the right side of the image.
If gradient descent is working properly, the cost function should decrease after every
The higher the gradient, the steeper the slope and the faster a model can learn
14. SO GBM is basically boosting + gradient descent.
Parameters: learning rate, number of trees to fit, minimum number of observations in leaf, depth
of each tree……

15. Missing value imputation:

1. Types of missing values: MCAR (completely no pattern in the missing data), MAR (some
probabilistic pattern in the missing data), MNAR (non-ignorable pattern in the missing
2. Ways: Delete, mean/median/mode imputation, linear regression, multivariate
imputations chained equation. (i.e. m sets of imputed values)
3. Depends on the algorithms used. Example Ensemble techniques resolve missing values
themselves either by ignoring or imputing them.

16. K-Means Clustering:

Types of clustering:
 Hard Clustering: In hard clustering, each data point either belongs to a cluster completely
or not. For example, in the above example each customer is put into one group out of the
10 groups.
 Soft Clustering: In soft clustering, instead of putting each data point into a separate
cluster, a probability or likelihood of that data point to be in those clusters is assigned.
For example, from the above scenario each costumer is assigned a probability to be in
either of 10 clusters of the retail store.


1. Specify the desired number of clusters K.

2. Randomly assign each data point to a cluster.
3. Compute cluster centroids.
4. Re-assign each point to the closest cluster centroid.
5. Re-compute cluster centroids



 Central Limit Theorem:

a) If the sample size n is sufficiently large (n > 30), the sampling distribution of the sample
means will be approximately normal. Remember what’s going on here: random samples
of size n are repeatedly being taken from an overall larger population. Each of these
random samples has its own mean, which is itself a random variable, and this set of
sample means has a distribution that is approximately normal.
b) The mean of the population and the mean of the distribution of all possible sample means
are equal.
c) The variance of the distribution of sample means is the population variance divided by
the sample size.

 Moments of a distribution: Mean, Variance, Skewness (measures symmetry), Kurtosis

(measures clustering)

Positively Skewed: Mean>Median>Mode

Negatively Skewed: Mean<Median<Mode
Normal Distribution: Mean=Median=Mode.

Kurtosis of a normal distribution: 3, i.e. excess kurtosis=0.

Leptokurtic has more peaked. Therefore, excess kurtosis (kurtosis-3) >0.
Platykurtic is less peaked than normal.
Relative to a normal distribution, a leptokurtic distribution will have a greater percentage of
small deviations from the mean and a greater percentage of extremely large deviations from
the mean. This means there is a relatively greater probability of an observed value being either
close to the mean or far from the mean.

 MSE: Summation (Predicted-Actual)2/N.

RMSE: Square root of MSE.

 SSR: N*MSE. For OLS, the objective is to minimize this SSR and not to maximize R2.
TSS: Summation of (Y-Ymean)2 = N*Variance.

 R2: 1-(SSR/TSS) = 1-(MSE/Variance).

Called Coefficient of Determination.
How much variability in Y (around its mean) is explained by X variables? Basically, how much
TSS is explained by ESS.
Square of Correlation coefficient between Yi and Yi predicted.
 However, R squared is not adequate.
1. Increasing the number of X variables will increase R squared, even if the terms are not
significant. So, we can use Adjusted R squared, AIC and BIC for model comparison.
Adjusted R squared =
[1-(1-R2) *(N-1)/(N-(k+1)]
2. We introduce a penalty factor for higher number of terms. Increasing terms would
decrease error but multiplying with 2K/T or (K/T) would balance it out. BIC has got a
strict penalty factor. Basically k=2 and k=logT is the penalty factor.
Criteria: Lowest AIC or SBIC.

AIC=2K+T*log (SSR/T)
BIC= K*logT+T*log (SSR/T)

 Solving the OLS regression by minimizing the SSR would give us the value of the coefficients.
So, ά= Ymean- 𝛽Xmean and 𝛽= Cov (X, Y)/Var(X)

 OLS is BLUE (best linear unbiased estimator).

Best/Efficient: OLS estimator has minimum variance/SSR.
Linear: Linear in parameters and not variables.
Unbiased: Expected value of predicted alpha and beta is equal to actual values.
Consistent: estimates converge to true values as sample size increases.
 The standard error of the regression is the standard deviation of the error term i.e.
Square root of (SSR/T-K), T is the total sample size, K=k+1, where k is the number of
independent parameters. If we divide SSR/T, it will become a biased estimate.

 Confidence Interval in general for Normal distribution: (Predicted Value -+

Zcritical/Tcritical*SE). Z critical with known population variance. SE= Standard deviation of
population/Sqrt(N). 1.65, 1.96, 2.58 for 90%, 95% and 99%.
T critical used when sample variance is known. SE= Standard deviation of sample/Sqrt(N)

 What if Non-normal distribution. Known population variance, use Z as long as N>30.

Unknown population variance uses T as long as N>30.

 Test statistic value for hypothesis testing otherwise is calculated as (Test statistic-
hypothesis value)/SE. Two tail tests if equality in null hypothesis. Z Value at 95% confidence
level for 2 tail tests is 1.96, and 1.65 for 1 tail test.

 Confidence Interval of a coefficient in Regression: (Predicted Value -+ Tcritical*SE), SE=

standard error of coefficient = Coefficient standard deviation/ Sqrt. (N).
 So, if we have to calculate the CI of Y in regression, error can be due to error term and due
to error in coefficient estimation. If we assume only due to error term, then SE= standard
error of regression, which we get directly in R. If due to both factors, then there is a separate
 Standard error of coefficients= (SER)^2* [X^(T)X]^-1
 Matrix form: 𝛽 {OLS}=(X^{T}}X) ^{-1}X^{T}} y}
Variance-Covariance matrix of 𝛽 is given by: (SER)^2* [X^(T)X] ^-1
 Types of Distribution:
1. Poisson: Discrete probability distribution just like Binomial, that expresses the
probability of a given number of events occurring in a fixed interval of time or space if
these events occur with a known constant rate and independently of the time since the
last event. Example, number of phone calls received by a call center per hour. ʎ refers to
the expected number of success per unit, which is also the mean and the variance.

2. Normal: For any normally distributed random variable, 68% of the outcomes are within
one standard deviation of the expected value (mean), and approximately 95% of the
outcomes are within two standard deviations of the expected value.
For Standard Normal, we calculate the Z score= (X-mean)/SD. From Z table, we calculate
the probability that something is less than X.

3. Student’s T distribution: It is an appropriate distribution to use when constructing

confidence intervals based on small samples (n < 30) from populations with unknown
variance and an approximately normal distribution. It may also be appropriate to use
the t distribution when the population variance is unknown, and the sample size is large
enough that the central limit theorem will assure that the sampling distribution is
approximately normal.
It is defined by a single parameter, the degrees of freedom (df), where the degrees of
freedom are equal to the number of sample observations minus 1, n — 1, for sample
It has more probability in the tails (fatter tails) than the normal distribution. As the
degrees of freedom (the sample size) gets larger, the shape of the t-distribution more
closely approaches a standard normal distribution.

4. Chi Squared and F distribution (Graphs same as Lognormal for both): Chi-squared for
hypothesis testing of a population variance to some fixed value. Bounded at 0 just like
lognormal. Requires DOF and confidence interval just like T and compare with below
critical value to accept/reject null hypothesis.
F distribution for comparing the variances of two different populations. It has 2 DOF for
two populations. Compare F critical with below F statistic to reject/accept the null

F-statistic is also used to determine joint null hypothesis of significant parameters.

DF is K and N-K-1. K is the number of independent parameters, excluding the constant.

 Assumptions of OLS:
1. Parameters should be linear.
Detection: Can be checked by residuals vs fitted value plots. See below example can see
the parabolic pattern, which indicates non-linearity.
Remedy: To overcome the issue of non-linearity, you can do a non-linear transformation
of predictors such as log (X), √X or X² transform the dependent variable.
2. No perfect multicollinearity.
 Perfect implies correlation is 1 between explanatory variables. When correlation is
less than 1 called imperfect.
 The OLS coefficients are still unbiased and consistent but inefficient.
 R squared may still be high, therefore, it only affects individual coefficients
hypothesis tests, i.e. high standard errors and therefore unreliable T
tests/confidence intervals.
 In this situation the coefficient estimates of the multiple regression may change
erratically in response to small changes in the model or the data.
 Matrix X has less than full rank, and therefore the moment matrix X^{T}X cannot
be inverted. 𝛽 {OLS}=(X^{T}}X) ^{-1}X^{T}} y} does not exist.
 Detection: Multiple ways:
1. Correlation matrix.
2. Insignificant regression coefficients for the affected variables in the multiple
regression, but a rejection of the joint hypothesis that those coefficients are all
zero (using an F-test).
3. Variance Inflation Factor: measures the increase in variance of estimated
coefficients because of multicollinearity == The VIF is computed separately for
each explanatory variable in the model and is interpreted as the ratio of the actual
var (𝛽)to what the variance would have been if xi were not linearly related to
other x in the model.
A rule of thumb is that VIF values of 10 or more indicate variables that warrant
further investigation. Ri squared is by regressing Xi on other X variables.

 Consequences:
Type II error since high standard errors.
Another consequence of multicollinearity is Overfitting in the regression analysis.
 Remedies:
1. Ridge regression
2. PCA
3. Obtain more data, if possible. This is the preferred solution. More data can
produce more precise parameter estimates (with lower standard errors), as seen
from the formula in variance inflation factor for the variance of the estimate of
a regression coefficient in terms of the sample size and the degree of
4. Dropping some variables might be an approach in some cases. However, if those
variables are significant, then they would go into error terms and thus give us
Biased estimates.

3. The error has an expected value of 0, conditional on explanatory variable. E[e/X] = 0.

The mean of the residuals will always be zero provided that there is a constant term in
the regression.

4. The error terms are IID or Presence of spherical disturbances, i.e. Homoskedasticity:
Variance (error term) is finite/constant across all observations and No Autocorrelation
between error terms.

 Consequences:
Again, it would still give us unbiased estimates but not BLUE i.e. inefficient.
So, we can still use OLS, but standard errors would be incorrect.

 Two forms of heteroskedasticity:

Unconditional heteroskedasticity occurs when the heteroskedasticity is not related to
the level of the independent variables, which means that it doesn’t systematically
increase or decrease with changes in the value of the independent variable(s). While this
is a violation of the equal variance assumption, it usually causes no major problems with
the regression.
Conditional heteroskedasticity is heteroskedasticity that is related to the level of (i.e.,
conditional on) the independent variable. For example, conditional heteroskedasticity
exists if the variance of the residual term increases as the value of the independent
variable increases, as shown in Figure 1. Notice in this figure that the residual variance
associated with the larger values of the independent variable, X, is larger than the residual
variance associated with the smaller values of X. Conditional heteroskedasticity does
create significant problems for statistical inference.
 Detection:
1. Y/Residuals vs X plot.
2. Residual vs Fitted Value plot

3. White’s test:
To test for constant variance one undertakes an auxiliary regression analysis: this regresses
the squared residuals from the original regression model onto a set of regressors that
contain the original regressors along with their squares and cross-products. One then
inspects the R2. The Lagrange multiplier (LM) test statistic is the product of the R2 value
and sample size.
This follows a chi-squared distribution, with degrees of freedom equal to P − 1,
where P is the number of estimated parameters (in the auxiliary regression).
If LM>Chi squared value, reject null that it is homoscedastic.

The logic of the test is as follows. First, the squared residuals from the original model
serve as a proxy for the variance of the error term at each observation. (The error term
is assumed to have a mean of zero, and the variance of a zero-mean random variable is
just the expectation of its square.) The independent variables in the auxiliary regression
account for the possibility that the error variance depends on the values of the original
regressors in some way (linear or quadratic). If the error term in the original model is in
fact homoscedastic (has a constant variance) then the coefficients in the auxiliary
regression (besides the constant) should be statistically indistinguishable from zero and
the R2 should be “small". Conversely, a “large" R2 (scaled by the sample size so that it
follows the chi-squared distribution) counts against the hypothesis of homoskedasticity.

4. Breusch-Pagan test.
Same procedure as white’s test. Just that auxiliary regression involves only x terms and
not non-linear terms (squares and cross products).
Disadvantage: Only tests for linear forms of heteroskedasticity.

 Remedies:
1. Mostly heteroskedasticity results from mis-specification. So, check for any missing
variables or change the specification.
2.To overcome heteroskedasticity, a possible way is to transform the response variable
such as log(Y) or √Y.
3. Robust standard errors: Here for calculating the variance-covariance of 𝛽, we use
different formula.

4. Weighted Least Squares is also used. to be discussed later.


 Same Consequences.
 Detection:
1. ACF plot of the residuals.
2. Durbin-Watson Test.
Approximate value for the DW statistic is: 2(1-p), where p is the autocorrelation.
It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies
positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation. Also get
the P value to check the null hypothesis of no autocorrelation.

3. LM Test/Breusch Godfrey.:
DW test only checks for first order correlation. Cannot be used with omitted variables.
So, we use LM Test.

4. For time series, I have the Ljung-Box and Box-Pierce test statistic.
Ho: no autocorrelation, Ha: Serial Autocorrelation.

 Remedy:
Same as before.
Take the first difference series.

5. No correlation between X variables and error term.

An explanatory variable X is endogenous if it is (contemporaneously) correlated with
the error e.
 Consequences:
Mis- specified functional form: The intuition behind the test is that if non-linear
combinations of the explanatory variables have any power in explaining the response
variable, the model is mis-specified in the sense that the data generating process might
be better approximated by a polynomial or another non-linear functional form.
 Causes are Omitted Variables, measurement error.
 Detection: Ramsey reset Test.
 Remedy: Instrumental Variables.


6. The error terms should be normally distributed.

 Detection:
Q-Q plot: A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one
another. If both sets of quantiles came from the same distribution, we should see the points
forming a line that’s roughly straight. So Normal Quantiles on X axis

Bera- Jarque test:

If the errors are not normally distributed, non – linear transformation of the variables
(response or predictors) can bring improvement in the model.

R Sentiment Analysis, Important packages: tm (text mining), tidyverse

1. Setting Up Twitter API using: twitteR package, setup_twitter_oauth() function that
requires 4 inputs. (consumer keys and secrets, access tokens and secrets).
2. Getting n tweets using searchTwitter and obtaining screen name and text from each of
the n tweets.