© All Rights Reserved

Просмотров: 2

© All Rights Reserved

- 01380460
- [IJCST-V4I4P11]:Rajni David Bagul, Prof. Dr. B. D. Phulpagar
- Multimedia 2009 (MMSP09) Paper Number 238 Estimating Motion With Principal Component
- Filmer.getting.girls.in.School.in.Cambodia
- IBNR with dependent accident years for Solvency II
- Notes on Econometrics
- Pattern Classification Using Web Mining
- Introduction to programming with openCV
- Zapdf.com Education 40 x2014 Fostering Students Performance
- __Anukrishnan2010 [Article] - Facial Recognition in Automotive
- Autonomous Underwater Multi Vehicle
- creng 2
- CH-GATE-2012
- IJETTCS-2013-08-20-105
- Patton Sheppard Realized Semi Variance 7oct11
- AUC Evaluation
- Economic Uncertainty and Corruption: Evidence from a Large Cross-Country Data Set
- j.1469-8137.2011.03689.x
- Decision Tree Learning - Wikipedia, The Free Encyclopedia
- RJournal_2009-2_Williams.pdf

Вы находитесь на странице: 1из 34

For parametric model, to predict the new data: just the parameters are required, which are fixed.

Examples: Linear, Logistic, SVM.

For Non-parametric model, need to know the parameters and the current state of the data that

has been observed. No underlying assumption about the data.

NP is flexible, makes better prediction. Select NP when lots of data, no prior knowledge and not

worrying too much about the right features.

Examples: KNN, Decision trees, ANN.

For supervised learning, we have prior knowledge of the output values for the input sample. So,

the goal is to learn that function that will map the output to the features.

Examples are classification (discrete output) and regression (continuous output). E.g. Logistic,

SVM, Random Forests, face recognition.

If you have small amount of data, opt for low complexity i.e low variance model otherwise it

would result in overfitting.

For unsupervised learning, we do not have any labeled outputs. So, the task is to determine, the

inherent structure present within the data points.

Examples are clustering (discrete) and dimensionality reduction (continuous). E.g. k means, PCA,

image classification.

A model learns relationships between the inputs called features, and outputs called labels, from

a training dataset. During training, the model is given both the features and the labels and

learns how to map the former to the latter.

A trained model is evaluated on a testing set, where we only give it the features and it makes

predictions. We compare the predictions with the known labels for the testing set to calculate

accuracy. Models can take many shapes, from simple linear regressions to deep neural networks,

but all supervised models are based on the fundamental idea of learning relationships between

inputs and outputs from training data.

Variance: It refers to how much the model is dependent on the training data. For linear

regression, it is (SER)^2*(X’X)^-1. X is the matrix of independent variables. And (X’X)^-1 is the

variance covariance matrix between X. Spread or Uncertainty in the estimates. Constant

function.

Bias: It is basically a measure of the difference between the expected value of the predicted

function/parameter and its actual value. For linear regression it is E[b^]-b. For non-linear it is as

shown below. The model makes strong assumptions about data. Errors from those assumptions.

Eg linear regression to a nonlinear data results in high bias. ACCURACY of the estimates.

Generalization/Test error: Y=f(X)+e is the actual value of the predictor. f^(X) is the predicted

function.

For OLS.

Bias and Variance move in opposite directions.

Underfit model: Poor in-sample prediction. Low Variance and High Bias. It will have high training

and test error.

Overfit model: Good in sample prediction but poor out of sample prediction. High Variance and

Low Bias. It will have very low training error but a high testing error, high cross validation error

and high generation gap.

Remember the error vs complexity curve for both. There is training error and

Generalization/True error cannot be computed and is the expected value of all test errors or

difference between test and training error.

4. About KNN:

(https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-

clustering/)

(https://medium.com/@adi.bronshtein/a-quick-introduction-to-k-nearest-neighbors-algorithm-

62214cea29c7)

1. Non-Parametric, i.e. no assumptions about data. Therefore, KNN could and probably should be

one of the first choices for a classification study when there is little or no prior knowledge

about the distribution of the data.

2. Probably supervised.

3. Used for both classification and regression.

4. It is a lazy algorithm, that is does not use any training data points to make any generalization,

i.e. almost 0 training phase. This means it keeps all the training data during testing phase.

5. K in KNN is the number of nearest neighbors we wish to take vote from. For example, in the

webpage given above, if we take K=3, we draw a circle around the blue star to enclose 3 data

points on the plane. And then the majority data points are the class of that star.

We take a single test case and multiple rows of training set. For clarity, here single test

case means one row, which can have multiple columns. Consider that single point

represented in n dimensions, where n is the number of columns.

We calculate the Euclidean distance as a measure of distance b/w test case and each row

of training set. Euclidean distance is the straight-line distance between 2 points in N

dimension.

Sort the distances in ascending order. (K nearest)

Get top K rows from the sorted array and then the most frequent class of these rows.

That would be the predicted class.

6. How to decide K? This is called Hyperparameter optimization.

I. If K is too small, then Overfitting: low training error but high testing error. For example,

for K=1, training error is 0. But testing error is too large.

II. As K increases, the testing error drops up to a point but then increases again. So, decide

the optimal value of K at the minima.

III. As a rule, K should be odd or Square root of n.

7. Output: For Classification, it is a discrete variable of some class. For regression, it is an average

of the values of its neighbors.

8. Scikit learn algorithm: K is specified along with test data. Based on point 5, K rows in training

are selected and hence classified.

from sklearn.neighbors import KNeighborsClassifier (n_neighbours=5)

Python: From sklearn import trees

X=Trees.DecisionTreeClassifier/Regressor ()

Model1=X.fit()

Results=model1.predict()

R:rpart(formula,data,method=class/anova,params=list(split=’information’)hyperparameters)

Supervised learning.

Non-parametric, no idea about the distribution of the data, I should use Non-

parametric.

Binary splits because the problem is that multiway splits fragments the data too quickly,

leaving insufficient data at the next level down. Hence, we would want to use such splits

only when needed. Since multiway splits can be achieved by a series of binary splits, the

latter are preferred. Multiway splitting results in overfitting.

Different tree-based methods are C4.5, CART, C5.0.

CART- Divide the dataset into homogenous subsets (uses Gini Index for classification)

based on the most significant splitter. Eg Gender, Age and city are 3 variables given.

In case of regression tree, the value obtained by terminal nodes in the training data is

the mean response of observation falling in that region. Thus, if an unseen data

observation falls in that region, we’ll make its prediction with mean values of known

target variable. In case of classification tree, the value (class) obtained by terminal node

in the training data is the mode of observations falling in that region. Thus, if an unseen

data observation falls in that region, we’ll make its prediction with mode value

Both the trees follow a top-down greedy approach known as recursive binary splitting.

We call it as ‘top-down’ because it begins from the top of tree when all the observations

are available in a single region and successively splits the predictor space into two new

branches down the tree. It is known as ‘greedy’ because, the algorithm cares (looks for

best variable available) about only the current split, and not about future splits which

will lead to a better tree.

Root Node -> Decision Node -> Leaf/Terminal Node

Different criteria measure the best/homogeneity of the target variable in a given node.

A) Gini Impurity for Classification trees (USED BY CART):

1. Probability that a randomly selected element would be labeled incorrectly if that

new instance is randomly classified according to the distribution of the labels in the

subset.

2. The Gini impurity can be computed by summing the probability (Pi) of an item with

label i being chosen times the probability (Pk =1-Pi) of a mistake in categorizing that

item.

3. If the node contains one class only, then impurity is 0. The minimum value is

therefore 0. Maximum value is when all classes have same probability/equally

weighted. The objective is to minimize the Gini impurity.

4. For the example given below: GI of female node is: 0.2*0.8*2=1-0.04-0.64=0.32

GI for male: (0.65)*(0.35)*2=0.45.

For the parent node= (10/30)*0.32+(20/30)*0.45=0.41. This should be the least.

B) Information gain/Entropy for Classification trees: Used by C4.5, C5.0 decision trees.

1. Entropy is a measure of the degree of disorganization in a system. Therefore, more

the entropy, more the information it requires to describe that system. If the sample

is completely homogeneous, then the entropy is zero and if the sample is an equally

divided (50% – 50%), it has entropy of one.

For example, I divide a parent node that has 30 elements into 2 decision nodes on

the basis of gender.

Female node has 10 total and has 2 Yes and 8 No. Male node has 20 total with 13 yes

and 7 no.

So, each entropy: -(2/10)*log(2/10)-(8/10)*log(8/10)=0.72 and 0.93.

Total entropy for the split is: (10/30)*0.72+(20/30)*0.93=0.86. This should be

least.

1. How much the variance is reduced when I move from a parent node to a child node.

2. Assign value of 1 for a yes and value of 0 for a no.

3. So, mean for the female: 0.2. Variance is (2/10) *(1-0.2)^2+(8/10)(0-0.2)^2=0.16.

Mean of a male is 0.65 and Variance of male is 0.23. So, calculate the variance as

weighted average=0.21. Parent node had 0.25.

Overfitting is one of the key challenges faced while modeling decision trees. If there is

no limit set of a decision tree, it will give you 100% accuracy on training set because in the

worst case it will end up making 1 leaf for each observation. Thus, preventing overfitting

is pivotal while modeling a decision tree and it can be done in 2 ways:

1. Setting constraints on how the tree is constructed: Similar to RF like minimum number

of observations in a parent node to split further, leaf size/tree depth, maximum number

of terminal nodes, no of features to be considered for split. CROSS VALIDATION to

decide the optimal number for parameters.

for the best split instantaneously and move forward until one of the specified stopping

conditions is reached. Not think about future.

2. Tree pruning: In decision trees, pruning is a method to avoid overfitting. Pruning means

selecting a subtree that leads to the lowest test error rate. Pruning a tree that leads to the

highest reduction in error, when the subtree is replaced by the leaf with majority class

label. We can use cross validation to determine the test error rate of a subtree.

Decision trees can be unstable because small variations in the data might result in a

completely different tree being generated. This problem is mitigated by using decision

trees within an ensemble.

6. ENSEMBLE METHODS:

(https://towardsdatascience.com/ensemble-methods-in-machine-learning-what-are-they-and-

why-use-them-68ec3f9fef5f)

Ensemble methods is a machine learning technique that combines several base models using

same technique in order to produce one optimal predictive model. The purpose is to minimize

either the bias or the variance.

Total generalization error= Bias^2 + Variance. So, Ensemble methods are used to decrease the

total error, either by decreasing the variance and by decreasing the bias.

Bagging

Boosting

Bagging and Boosting decrease the variance of your single estimate because averaging reduces

the bias. as they combine several estimates from different models. So, the result may be a

model with higher stability.

If the problem is that the single model gets a very low performance, Bagging will rarely get

a better bias. However, Boosting could generate a combined model with lower errors as it

optimizes the advantages and reduces pitfalls of the single model.

By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best option.

Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this

problem itself. For this reason, Bagging is effective more often than Boosting. Commented [PP1]: Why Bagging not for bias reduction

and why Boosting not for overfitting?

1. Bagging and Boosting get N learners by generating additional data in the training stage. N new

training data sets are produced by random sampling with replacement from the original set. By

sampling with replacement some observations may be repeated in each new training data set.

For example:

2. In the case of Bagging, any element has the same probability to appear in a new data set. The

training stage is parallel for Bagging (i.e., each model is built independently)

However, for Boosting (initially equally weighted) the observations are weighted and therefore

some of them will take part in the new sets more often. Boosting builds the new learner in a

sequential way:

In Boosting algorithms each classifier is trained on data, considering the previous classifiers’

success. After each training step, the weights are redistributed. Misclassified data increases its

weights to emphasize the most difficult cases. In this way, subsequent learners will focus on

them during their training. Commented [PP2]: So essentially Boosting (or

bootstrapping) refers to overweighting scarcer observations

to train the model

to detect these more easily.

3. To predict the class of new data we only need to apply the N learners to the new observations.

In Bagging the result is obtained by averaging the responses of the N learners (or majority

vote). However, Boosting assigns a second set of weights, this time for the N classifiers, in

order to take a weighted average of their estimates.

In the Boosting training stage, the algorithm allocates weights to each resulting model. A

learner with good a classification result on the training data will be assigned a higher

weight than a poor one. So, when evaluating a new learner, boosting needs to keep track

of learners’ errors, too. Let’s see the differences in the procedures:

learner. For example, in AdaBoost, the most renowned, an error less than 50% is required

to maintain the model; otherwise, the iteration is repeated until achieving a learner better

than a random guess.

In BAGGING all the features of the data in a bootstrapped sample (drawing samples with

replacement) are taken while making a tree, multiple trees are formed, and aggregation is done

to find out the prediction. Therefore, models are correlated, so variance reduction is less than

1/N.

Bagging improves accuracy only of Unstable Classifiers, i.e. a small change in input will lead to

large changes in the output. Example is a decision tree. KNN is a stable classifier, so bagging can

harmfully impact its output. Boosting could be used.

Large no of parameters in boosting to tune. Therefore, computationally expensive.

7. RANDOM FORESTS:

randomForest (formula, data=NULL, ..., subset, na.action=na.fail)

a) Each tree selects m out of M random/different features for splitting. This means that 2

trees generated on same training data will have m randomly different variables selected

at each split, hence this is how the trees will get de-correlated and will be independent

of each other. Therefore, less prone to overfitting. VERY VERY VERY IMPORTANT.

b) In random forests, about one-third of the cases are left out of the bootstrap sample and

not used in the construction of the kth tree, called OOB sample. Hence, there is no need

for cross-validation, or a separate test set to get an unbiased estimate of the test set

error. (subset in the above command) VERY VERY VERY IMPORTANT.

c) The value of m is held constant while growing the forest.

d) In R, random forest internally takes care of missing values using mean/ mode

imputation. VERY VERY VERY IMPORTANT.

e) For regression, the random forest takes an average of all the individual decision tree

estimates. For classification, the random forest will take a majority vote for the

predicted class.

f) IMPORTANCE parameters:

There are two measures of importance given for each variable in the random forest.

Accuracy based importance is a measure of by how much including a variable increases

accuracy or of by how much removing a variable decreases accuracy. Thus, higher the

number more important it becomes. As mentioned before each tree has its own out-of-bag

sample of data that was not used during construction.

For regression accuracy is %INCMSE:

i. First, the prediction accuracy (MSE for regression) on the out-of-bag sample is measured.

Call it mse0.

ii. Then, the values of the that variable in the out-of-bag-sample are randomly shuffled,

keeping all other variables the same. Random shuffling means that, on average, the shuffled

variable has no predictive power. Call it MSE(j).

iii. %IncMSE of j'th is (mse(j)-mse0)/mse0 * 100%.

Accuracy: Overall, how often is the classifier correct? (TP+TN)/total = (100+50)/165 = 0.91.

Misclassification Rate: Overall, how often is it wrong? (FP+FN)/total = (10+5)/165 = 0.09

equivalent to 1 minus Accuracy also known as "Error Rate".

Same process for calculating the accuracy as %INCMSE, here variable calculated is Accuracy.

So more the decrease in the error rate, more important the variable is. VERY VERY VERY

IMPORTANT.

Purity based importance: Total decrease in the node impurities / (Total increase in node

purities) from splitting on the variable, averaged over all trees. For regression, it is measured by

the reduction in Residual sum of squares (IncNodePurity), whenever a variable is chosen to

split. For classification, the node impurity is measured by the Gini impurity (MeanDecreaseGini)

For each variable, the sum of the Gini/RSS decrease across every tree of the forest is

accumulated every time that variable is chosen to split a node. The sum is divided by the

number of trees in the forest to give an average.

One advantage of the Gini-based importance is that the Gini calculations are already performed

during training, so minimal extra computation is required.

A disadvantage is that splits are biased towards variables with many classes, which also biases

the importance measure.

IncNodePurity is biased and should only be used if the extra computation time of calculating

%IncMSE is unacceptable. Since it only takes ~5-25% extra time to calculate %IncMSE, this

would almost never happen

g) Each tree is grown to largest extent possible. No pruning. Each tree is grown fully VERY

VERY VERY IMPORTANT.

h) Parameter Tuning: Mainly, there are three parameters in the random forest algorithm

which you should look at (for tuning): Cross validation is a method. Less number of

parameters to tune. VERY VERY VERY IMPORTANT. “RandomizedSearchCV” in

scikitlearn

ntree – As the name suggests, the number of trees to grow. Larger the tree, the more

beneficial it is, but it will be more computationally expensive to build models.

mtry – It refers to how many variables we should select at a node split. Also as mentioned

above, the default value is p/3 for regression and sqrt(p) for classification. We should

always try large values of mtry to avoid overfitting. Increasing mtry/max_features

generally improves the performance of the model as at each node now we have a higher

number of options to be considered. However, this is not necessarily true as this

decreases the diversity of individual tree which is the USP of random forest. But, for

sure, you decrease the speed of algorithm by increasing the mtry/ max_features. Hence,

you need to strike the right balance and choose the optimal mtry/max_features

leaf size/node size – It refers to how many observations we want in the terminal nodes.

This parameter is directly related to tree depth. Higher the number, lower the tree depth.

With lower tree depth, the tree might even fail to recognize useful signals from the data.

Too deep a tree means overfitting! On the flip side, if we choose too large a leaf size, say

in the above example, 500, the tree will stop growing after the second split itself. Meaning

poor predictive performance.

i) Not useful for small datasets, a black box so can’t used for descriptive statistics, gives

preferences to ordinal features with a greater number of levels and computationally

expensive.

j) Important links/Extra:

Importance Parameters

(https://www.displayr.com/how-is-variable-importance-calculated-for-a-random-

forest/)

https://stats.stackexchange.com/questions/162465/in-a-random-forest-is-larger-

incmse-better-or-worse

https://stats.stackexchange.com/questions/197827/how-to-interpret-mean-decrease-

in-accuracy-and-mean-decrease-gini-in-random-fore

VarImpPlot(model) and importance(model) gives the same results. Importance () is a part

of randomForest package. However, varIMP is a part of Caret Package.

Therefore, in order to avoid waiting time, let’s impute the missing values using median /

mode imputation method; i.e., missing values in the integer variable will be imputed with

median and factor variables will be imputed with mode (most frequent value).

8. PCA:

1. Dimension reduction by Feature extraction, i.e. basically create new variables (Principal

components) from old ones such that the new ones are a combination of each of the

original variables and are sorted in order of importance of predicting the outcome. So

later on, we can delete the least important ones.

2. The new variables created are independent of each other, because they are orthogonal

to each other, one of the assumptions of linear regression.

3. Eigenvalues & Eigenvectors: det(A-bi) =0, where A is the matrix of X variables. When we

solve this equation, values of b obtained are called Eigenvalues.

From here, when we put b1, b2…. in (A-bi), we get matrix B. Solving B*x=0, we get values

of x as x1, x2…. depending on dimensions of A. The matrix x is called the Eigenvector,

such that A*x=b(eigen value)*x.

4. I have my original matrix Z formed out of X by standardization. Calculate Z’Z to get the

variance-covariance matrix. The sum of elements along the diagonal is the total

variance. Decompose my Z’Z into PDP^-1

5. D matrix has diagonal elements as eigenvalues and rest as 0, P has corresponding

columns as eigenvectors. The eigenvalues on the diagonal of D will be associated with

the corresponding column in P — that is, the first element of D is λ₁ and the

corresponding eigenvector is the first column of P. So, sort D from largest to smallest,

in this way P also gets sorted and forms P*.

6. Step 7 there. So, we get Z* = ZP. So, Z* is your principal components, where Z* also might

have diagonal elements as eigenvalues. Not sure.

7. To determine how many features to keep, set a threshold for the proportion of variance

explained

8. As we know in Z’Z, we have variances along the diagonal, we still have D matrix diagonal

elements as sorted eigenvalues and rest are 0. So, the first eigen value/sum of variance

is the % variance explained by that eigen value.

Because each

9. Basically, variance explained is eigenvalue/sum of eigen value.

eigenvalue is roughly the importance of its corresponding

eigenvector, the proportion of variance explained is the sum of the

eigenvalues of the features you kept divided by the sum of the eigenvalues

of all features.

10. https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-

explained

11. PCA finds, in the data space, the dimension (direction) with the largest variance out of

the overall variance 1.343730519+.619205620+1.485549631 = 3.448. That largest

variance would be 1.651354285. Then it finds the dimension of the second largest

variance, orthogonal to the first one, out of the remaining 3.448-1.651354285 overall

variance. That 2nd dimension would be 1.220288343 variance. And so on.

12. An eigenvalue greater (less) than one implies that this component is summarizing a

component of the total variance which exceeds (is less than) the information provided by

the original variable. Therefore, it is common that only principal components with

eigenvalues greater than one are considered.

9. GLM:

Generalized linear models, the general equation for which is:

which has 3 components: Here, g () is the link function, E(y) is the expectation of target

variable and α + βx1 + γx2 is the linear predictor (α, β, γ to be predicted). The role of link

function is to ‘link’ the expectation of y to linear predictor. For example, g(E(y)) =E(y) for

linear regression and g(E(y)) = log(p/1-p) for logistic regression, where p is the probability

of success.

Above 3 components: systematic component, the explanatory variables or the RHS in

the above equation. Random component, the response variable and its probability

function. For example, normal distribution for linear regression and Binomial

distribution for logistic regression. Link function, as above specifies the link between

the random and the systematic component.

Binomial distribution is a discrete probability distribution represented by B (n, p), where

n is the number of trials and p is the probability of success. Mean is np and variance

are np(1-p). A single event is called a Bernoulli trial.

Assumptions/Important points:

1.GLM does not assume a linear relationship between dependent and independent

variables. However, it assumes a linear relationship between link function and

independent variables.

2. The dependent variable and error need not to be normally distributed. But both have

independent distribution

3. It does not use OLS (Ordinary Least Square) for parameter estimation. Instead, it uses

maximum likelihood estimation (MLE). It is a parameter estimation method when you

maximize the known likelihood distribution.

outcome (Yes/No), given a set of independent variables we use Logistic regression.

For logistic regression, the link function is log(p/1-p), where p is the prob. of success.

a) The outcome is a binary or dichotomous variable like yes vs no, positive vs negative,

1 vs 0.

b) There is a linear relationship between the logit of the outcome and each predictor

variables. Recall that the logit/sigmoid function is logit(p) = log(p/(1-p)),

where p is the probabilities of the outcome. Check by scatterplot.

c) There is no extreme values or outliers in the continuous predictors. Check by

boxplot.

d) There is no high intercorrelations (i.e. multicollinearity) among the predictors.

Check by VIF()

https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-

regression-in-r/ This is the derivation of the logit function.

The RHS is continuous. But to map the predicted values to probabilities, we use the

sigmoid function. P, the probability of success = exp(y)/1+exp(y) = 1/1+exp(-y), where

y=bo+b1x1+b2x2+… So, the graph of p is the sigmoid function.

11. LASSO and RIDGE REGRESSION (glmnet package in R):

As we add more and more parameters to our model, its complexity increases, which

results in increasing variance and decreasing bias, thus overfitting. Too many features

result in multicollinearity. Variance is also increased when the independent variables

suffer from multicollinearity.

One way to measure multicollinearity is the variance inflation factor (VIF), which assesses

how much the variance of an estimated regression coefficient increases if your predictors

are correlated. If no factors are correlated, the VIFs will all be 1.

Multicollinearity causes issue in hypothesis testing, since t tests and p value tests

become unreliable.

Look at residual vs fitted values plot. If heteroskedasticity exists, the plot would exhibit

a funnel shape pattern as shown above.

our model so that the model complexity increases, and thus reducing high

bias.

There are following methods to overcome overfitting/ reducing the features:

Regularization: Ridge/lasso

PCA

Ensemble methods

Pruning in case of Decision trees.

For regularization, we do not remove the features but reduce the coefficients of

those features by introducing a penalty factor.

function, which is related to the penalty factor ʎ, such that:

increase. We need to tune ʎ but, as we cannot have high bias.

There are two ways we could tackle this issue. A more traditional approach would be

to choose λ such that some information criterion, e.g., AIC or BIC, is the

smallest. A more machine learning-like approach is to perform cross-validation

and select the value of λ that minimizes the cross-validated sum of squared residuals

(or some other measure). The former approach emphasizes the model's fit to the

data, while the latter is more focused on its predictive performance.

In R, the glmnet package for ridge regression.

conceptually to ridge regression. It also adds a penalty for non-zero coefficients, but

unlike ridge regression which penalizes sum of squared coefficients (the so-

called L2 penalty), lasso penalizes the sum of their absolute values (L1 penalty).

As a result, for high values of λ, many coefficients are exactly zeroed under

lasso, which is never the case in ridge regression.

Therefore, lasso selects the only some feature while reduces the coefficients

of others to zero. This property is known as feature selection and which is

absent in case of ridge.

Both methods allow to use correlated predictors, but they solve multicollinearity

issue differently:

In ridge regression, the coefficients of correlated predictors are similar, but

reduced.

In lasso, one of the correlated predictors has a larger coefficient, while the rest

are (nearly) zeroed.

Elastic net is a combination of both.

https://www.datacamp.com/community/tutorials/tutorial-ridge-lasso-elastic-net

https://www.analyticsvidhya.com/blog/2017/06/a-comprehensive-guide-for-linear-ridge-and-

lasso-regression/

GOOD ARTICLES ON REGULARIZATION.

12. SVM:

Classifies two classes by drawing a hyperplane/line between them.

SVM to core tries to achieve a good margin. A margin is a separation of line to the closest

class points. A good margin is one where this separation is larger for both the classes.

Hyperparameters include:

C called the regularization parameter, lower C allows for large margin but some

misclassification, larger C does not allow any misclassification.

Gamma, with low gamma, points far away from plausible separation line are considered

in calculation for the separation line where as high gamma means the points close to

plausible line are considered in calculation.

13. Gradient descent:

To minimize the cost function, I will use gradient descent optimization. The objective is

to find the parameters for which the cost function is minimum.

To start with finding the right values we initialize the values of our parameters with some

random numbers and Gradient Descent then starts at that point. Then it takes one step

after another in the steepest downside direction till it reaches the point where the cost

function is as small as possible.

The equation below describes what Gradient Descent does: b describes the next position

of our climber, while a represents his current position. The minus sign refers to the

minimization part of gradient descent. The gamma in the middle is a waiting factor and

the gradient term ( Δf(a) ) is simply the direction of the steepest descent.

How big the steps are that Gradient Descent takes into the direction of the local minimum

are determined by the so-called learning rate. It determines how fast or slow we will

move towards the optimal weights.

In order for Gradient Descent to reach the local minimum, we have to set the learning

rate to an appropriate value, which is neither too low nor too high.

This is because if the steps it takes are too big, it maybe will not reach the local minimum

because it just bounces back and forth between the convex function of gradient descent

like you can see on the left side of the image below. If you set the learning rate to a very

small value, gradient descent will eventually reach the local minimum, but it will maybe

take too much time like you can see on the right side of the image.

If gradient descent is working properly, the cost function should decrease after every

iteration.

The higher the gradient, the steeper the slope and the faster a model can learn

14. SO GBM is basically boosting + gradient descent.

Parameters: learning rate, number of trees to fit, minimum number of observations in leaf, depth

of each tree……

1. Types of missing values: MCAR (completely no pattern in the missing data), MAR (some

probabilistic pattern in the missing data), MNAR (non-ignorable pattern in the missing

data)

2. Ways: Delete, mean/median/mode imputation, linear regression, multivariate

imputations chained equation. (i.e. m sets of imputed values)

3. Depends on the algorithms used. Example Ensemble techniques resolve missing values

themselves either by ignoring or imputing them.

Types of clustering:

Hard Clustering: In hard clustering, each data point either belongs to a cluster completely

or not. For example, in the above example each customer is put into one group out of the

10 groups.

Soft Clustering: In soft clustering, instead of putting each data point into a separate

cluster, a probability or likelihood of that data point to be in those clusters is assigned.

For example, from the above scenario each costumer is assigned a probability to be in

either of 10 clusters of the retail store.

Steps:

2. Randomly assign each data point to a cluster.

3. Compute cluster centroids.

4. Re-assign each point to the closest cluster centroid.

5. Re-compute cluster centroids

ECONOMETRICS:

a) If the sample size n is sufficiently large (n > 30), the sampling distribution of the sample

means will be approximately normal. Remember what’s going on here: random samples

of size n are repeatedly being taken from an overall larger population. Each of these

random samples has its own mean, which is itself a random variable, and this set of

sample means has a distribution that is approximately normal.

b) The mean of the population and the mean of the distribution of all possible sample means

are equal.

c) The variance of the distribution of sample means is the population variance divided by

the sample size.

(measures clustering)

Negatively Skewed: Mean<Median<Mode

Normal Distribution: Mean=Median=Mode.

Leptokurtic has more peaked. Therefore, excess kurtosis (kurtosis-3) >0.

Platykurtic is less peaked than normal.

Relative to a normal distribution, a leptokurtic distribution will have a greater percentage of

small deviations from the mean and a greater percentage of extremely large deviations from

the mean. This means there is a relatively greater probability of an observed value being either

close to the mean or far from the mean.

RMSE: Square root of MSE.

SSR: N*MSE. For OLS, the objective is to minimize this SSR and not to maximize R2.

TSS: Summation of (Y-Ymean)2 = N*Variance.

Called Coefficient of Determination.

How much variability in Y (around its mean) is explained by X variables? Basically, how much

TSS is explained by ESS.

Square of Correlation coefficient between Yi and Yi predicted.

However, R squared is not adequate.

1. Increasing the number of X variables will increase R squared, even if the terms are not

significant. So, we can use Adjusted R squared, AIC and BIC for model comparison.

Adjusted R squared =

[1-(1-R2) *(N-1)/(N-(k+1)]

2. We introduce a penalty factor for higher number of terms. Increasing terms would

decrease error but multiplying with 2K/T or (K/T) would balance it out. BIC has got a

strict penalty factor. Basically k=2 and k=logT is the penalty factor.

Criteria: Lowest AIC or SBIC.

AIC=2K+T*log (SSR/T)

BIC= K*logT+T*log (SSR/T)

Solving the OLS regression by minimizing the SSR would give us the value of the coefficients.

So, ά= Ymean- 𝛽Xmean and 𝛽= Cov (X, Y)/Var(X)

Best/Efficient: OLS estimator has minimum variance/SSR.

Linear: Linear in parameters and not variables.

Unbiased: Expected value of predicted alpha and beta is equal to actual values.

Consistent: estimates converge to true values as sample size increases.

The standard error of the regression is the standard deviation of the error term i.e.

Square root of (SSR/T-K), T is the total sample size, K=k+1, where k is the number of

independent parameters. If we divide SSR/T, it will become a biased estimate.

Zcritical/Tcritical*SE). Z critical with known population variance. SE= Standard deviation of

population/Sqrt(N). 1.65, 1.96, 2.58 for 90%, 95% and 99%.

T critical used when sample variance is known. SE= Standard deviation of sample/Sqrt(N)

Unknown population variance uses T as long as N>30.

Test statistic value for hypothesis testing otherwise is calculated as (Test statistic-

hypothesis value)/SE. Two tail tests if equality in null hypothesis. Z Value at 95% confidence

level for 2 tail tests is 1.96, and 1.65 for 1 tail test.

standard error of coefficient = Coefficient standard deviation/ Sqrt. (N).

So, if we have to calculate the CI of Y in regression, error can be due to error term and due

to error in coefficient estimation. If we assume only due to error term, then SE= standard

error of regression, which we get directly in R. If due to both factors, then there is a separate

formula.

Standard error of coefficients= (SER)^2* [X^(T)X]^-1

Matrix form: 𝛽 {OLS}=(X^{T}}X) ^{-1}X^{T}} y}

Variance-Covariance matrix of 𝛽 is given by: (SER)^2* [X^(T)X] ^-1

Types of Distribution:

1. Poisson: Discrete probability distribution just like Binomial, that expresses the

probability of a given number of events occurring in a fixed interval of time or space if

these events occur with a known constant rate and independently of the time since the

last event. Example, number of phone calls received by a call center per hour. ʎ refers to

the expected number of success per unit, which is also the mean and the variance.

2. Normal: For any normally distributed random variable, 68% of the outcomes are within

one standard deviation of the expected value (mean), and approximately 95% of the

outcomes are within two standard deviations of the expected value.

For Standard Normal, we calculate the Z score= (X-mean)/SD. From Z table, we calculate

the probability that something is less than X.

confidence intervals based on small samples (n < 30) from populations with unknown

variance and an approximately normal distribution. It may also be appropriate to use

the t distribution when the population variance is unknown, and the sample size is large

enough that the central limit theorem will assure that the sampling distribution is

approximately normal.

It is defined by a single parameter, the degrees of freedom (df), where the degrees of

freedom are equal to the number of sample observations minus 1, n — 1, for sample

means.

It has more probability in the tails (fatter tails) than the normal distribution. As the

degrees of freedom (the sample size) gets larger, the shape of the t-distribution more

closely approaches a standard normal distribution.

4. Chi Squared and F distribution (Graphs same as Lognormal for both): Chi-squared for

hypothesis testing of a population variance to some fixed value. Bounded at 0 just like

lognormal. Requires DOF and confidence interval just like T and compare with below

critical value to accept/reject null hypothesis.

F distribution for comparing the variances of two different populations. It has 2 DOF for

two populations. Compare F critical with below F statistic to reject/accept the null

hypothesis.

DF is K and N-K-1. K is the number of independent parameters, excluding the constant.

Assumptions of OLS:

1. Parameters should be linear.

Detection: Can be checked by residuals vs fitted value plots. See below example can see

the parabolic pattern, which indicates non-linearity.

Remedy: To overcome the issue of non-linearity, you can do a non-linear transformation

of predictors such as log (X), √X or X² transform the dependent variable.

2. No perfect multicollinearity.

Perfect implies correlation is 1 between explanatory variables. When correlation is

less than 1 called imperfect.

The OLS coefficients are still unbiased and consistent but inefficient.

R squared may still be high, therefore, it only affects individual coefficients

hypothesis tests, i.e. high standard errors and therefore unreliable T

tests/confidence intervals.

In this situation the coefficient estimates of the multiple regression may change

erratically in response to small changes in the model or the data.

Matrix X has less than full rank, and therefore the moment matrix X^{T}X cannot

be inverted. 𝛽 {OLS}=(X^{T}}X) ^{-1}X^{T}} y} does not exist.

Detection: Multiple ways:

1. Correlation matrix.

2. Insignificant regression coefficients for the affected variables in the multiple

regression, but a rejection of the joint hypothesis that those coefficients are all

zero (using an F-test).

3. Variance Inflation Factor: measures the increase in variance of estimated

coefficients because of multicollinearity == The VIF is computed separately for

each explanatory variable in the model and is interpreted as the ratio of the actual

var (𝛽)to what the variance would have been if xi were not linearly related to

other x in the model.

A rule of thumb is that VIF values of 10 or more indicate variables that warrant

further investigation. Ri squared is by regressing Xi on other X variables.

Consequences:

Type II error since high standard errors.

Another consequence of multicollinearity is Overfitting in the regression analysis.

Remedies:

1. Ridge regression

2. PCA

3. Obtain more data, if possible. This is the preferred solution. More data can

produce more precise parameter estimates (with lower standard errors), as seen

from the formula in variance inflation factor for the variance of the estimate of

a regression coefficient in terms of the sample size and the degree of

multicollinearity.

4. Dropping some variables might be an approach in some cases. However, if those

variables are significant, then they would go into error terms and thus give us

Biased estimates.

The mean of the residuals will always be zero provided that there is a constant term in

the regression.

4. The error terms are IID or Presence of spherical disturbances, i.e. Homoskedasticity:

Variance (error term) is finite/constant across all observations and No Autocorrelation

between error terms.

Homoskedasticity:

Consequences:

Again, it would still give us unbiased estimates but not BLUE i.e. inefficient.

So, we can still use OLS, but standard errors would be incorrect.

Unconditional heteroskedasticity occurs when the heteroskedasticity is not related to

the level of the independent variables, which means that it doesn’t systematically

increase or decrease with changes in the value of the independent variable(s). While this

is a violation of the equal variance assumption, it usually causes no major problems with

the regression.

Conditional heteroskedasticity is heteroskedasticity that is related to the level of (i.e.,

conditional on) the independent variable. For example, conditional heteroskedasticity

exists if the variance of the residual term increases as the value of the independent

variable increases, as shown in Figure 1. Notice in this figure that the residual variance

associated with the larger values of the independent variable, X, is larger than the residual

variance associated with the smaller values of X. Conditional heteroskedasticity does

create significant problems for statistical inference.

Detection:

1. Y/Residuals vs X plot.

2. Residual vs Fitted Value plot

3. White’s test:

To test for constant variance one undertakes an auxiliary regression analysis: this regresses

the squared residuals from the original regression model onto a set of regressors that

contain the original regressors along with their squares and cross-products. One then

inspects the R2. The Lagrange multiplier (LM) test statistic is the product of the R2 value

and sample size.

This follows a chi-squared distribution, with degrees of freedom equal to P − 1,

where P is the number of estimated parameters (in the auxiliary regression).

If LM>Chi squared value, reject null that it is homoscedastic.

The logic of the test is as follows. First, the squared residuals from the original model

serve as a proxy for the variance of the error term at each observation. (The error term

is assumed to have a mean of zero, and the variance of a zero-mean random variable is

just the expectation of its square.) The independent variables in the auxiliary regression

account for the possibility that the error variance depends on the values of the original

regressors in some way (linear or quadratic). If the error term in the original model is in

fact homoscedastic (has a constant variance) then the coefficients in the auxiliary

regression (besides the constant) should be statistically indistinguishable from zero and

the R2 should be “small". Conversely, a “large" R2 (scaled by the sample size so that it

follows the chi-squared distribution) counts against the hypothesis of homoskedasticity.

4. Breusch-Pagan test.

Same procedure as white’s test. Just that auxiliary regression involves only x terms and

not non-linear terms (squares and cross products).

Disadvantage: Only tests for linear forms of heteroskedasticity.

Remedies:

1. Mostly heteroskedasticity results from mis-specification. So, check for any missing

variables or change the specification.

2.To overcome heteroskedasticity, a possible way is to transform the response variable

such as log(Y) or √Y.

3. Robust standard errors: Here for calculating the variance-covariance of 𝛽, we use

different formula.

Autocorrelation:

Same Consequences.

Detection:

1. ACF plot of the residuals.

2. Durbin-Watson Test.

Approximate value for the DW statistic is: 2(1-p), where p is the autocorrelation.

It must lie between 0 and 4. If DW = 2, implies no autocorrelation, 0 < DW < 2 implies

positive autocorrelation while 2 < DW < 4 indicates negative autocorrelation. Also get

the P value to check the null hypothesis of no autocorrelation.

3. LM Test/Breusch Godfrey.:

DW test only checks for first order correlation. Cannot be used with omitted variables.

So, we use LM Test.

4. For time series, I have the Ljung-Box and Box-Pierce test statistic.

Ho: no autocorrelation, Ha: Serial Autocorrelation.

Remedy:

Same as before.

Take the first difference series.

An explanatory variable X is endogenous if it is (contemporaneously) correlated with

the error e.

Consequences:

BIASED and INCONSISTENT estimates,

Mis- specified functional form: The intuition behind the test is that if non-linear

combinations of the explanatory variables have any power in explaining the response

variable, the model is mis-specified in the sense that the data generating process might

be better approximated by a polynomial or another non-linear functional form.

Causes are Omitted Variables, measurement error.

Detection: Ramsey reset Test.

Remedy: Instrumental Variables.

Detection:

Q-Q plot: A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one

another. If both sets of quantiles came from the same distribution, we should see the points

forming a line that’s roughly straight. So Normal Quantiles on X axis

If the errors are not normally distributed, non – linear transformation of the variables

(response or predictors) can bring improvement in the model.

CONFUSION MATRIX

STRESS TESTING,CECL,CCAR,DFAST, BASEL3.

1. Setting Up Twitter API using: twitteR package, setup_twitter_oauth() function that

requires 4 inputs. (consumer keys and secrets, access tokens and secrets).

2. Getting n tweets using searchTwitter and obtaining screen name and text from each of

the n tweets.

3.

- 01380460Загружено:Fari Pratomosiwi
- [IJCST-V4I4P11]:Rajni David Bagul, Prof. Dr. B. D. PhulpagarЗагружено:EighthSenseGroup
- Multimedia 2009 (MMSP09) Paper Number 238 Estimating Motion With Principal ComponentЗагружено:shadowbraz
- Filmer.getting.girls.in.School.in.CambodiaЗагружено:somongkol
- IBNR with dependent accident years for Solvency IIЗагружено:ilyes23SRH
- Pattern Classification Using Web MiningЗагружено:International Journal for Scientific Research and Development - IJSRD
- Introduction to programming with openCVЗагружено:adarshajoisa
- Autonomous Underwater Multi VehicleЗагружено:Kiran Akula Chandramohan
- Notes on EconometricsЗагружено:Jorge Rojas-Vallejos
- Zapdf.com Education 40 x2014 Fostering Students PerformanceЗагружено:Bong Tho
- creng 2Загружено:Muhammad Andi Mubarok
- Patton Sheppard Realized Semi Variance 7oct11Загружено:Adrian Pulka
- __Anukrishnan2010 [Article] - Facial Recognition in AutomotiveЗагружено:Ciprian Romeo Comsa
- CH-GATE-2012Загружено:Mukesh Kumar
- AUC EvaluationЗагружено:comp33
- j.1469-8137.2011.03689.xЗагружено:Silvio Sousa
- IJETTCS-2013-08-20-105Загружено:Anonymous vQrJlEN
- Economic Uncertainty and Corruption: Evidence from a Large Cross-Country Data SetЗагружено:Firmansyah
- Decision Tree Learning - Wikipedia, The Free EncyclopediaЗагружено:Ferran Callicó Ros
- RJournal_2009-2_Williams.pdfЗагружено:acrosstheland8535
- Classification of 2nd Order PDEЗагружено:Mani Agarwal
- C45-1Загружено:yanuar.wi7417
- msom%2E2013%2E0459Загружено:Arunkumar
- MATH219 Lecture 8Загружено:Serdar Bilge
- X-TREPAN: AN EXTENDED TREPAN FOR COMPREHENSIBILITY AND CLASSIFICATION ACCURACY IN ARTIFICIAL NEURAL NETWORKSЗагружено:Adam Hansen
- DM-WM-05-Part II-2016.pptЗагружено:Anonymous Hd8AgKI0Tb
- sr818Загружено:TBP_Think_Tank
- psifor 2Загружено:Kharisul
- Logit Regression _ R Data Analysis ExamplesЗагружено:Art
- L1_Classical Linear RegressionЗагружено:Man Ho Li

- 014187_3803824zzЗагружено:Utkarsh Choudhary
- notes_two fund separation and CAPM_2019.pdfЗагружено:Utkarsh Choudhary
- intro to econometrics and forecasting.pdfЗагружено:Utkarsh Choudhary
- multiple regression real estate example.pdfЗагружено:Utkarsh Choudhary
- Recent Questions of Placement Interview and GET by DAIMLER INDIAЗагружено:Utkarsh Choudhary
- AdityaЗагружено:Utkarsh Choudhary
- ML Concepts.docxЗагружено:Utkarsh Choudhary
- FmtzЗагружено:Utkarsh Choudhary
- ReadmeЗагружено:Utkarsh Choudhary
- GFWLIVESetupLogVerboseЗагружено:anake7777
- Assignment 6Загружено:Utkarsh Choudhary
- Mcgill MeetingЗагружено:Utkarsh Choudhary
- Ref LetterЗагружено:Utkarsh Choudhary
- Resume (1)Загружено:Utkarsh Choudhary
- Moshi home electricity bill.pdfЗагружено:Utkarsh Choudhary
- Ch 01 Hull Fundamentals 9 the dЗагружено:kcmodesty
- currency_2018.pdfЗагружено:Utkarsh Choudhary
- MIPC 2018 Case.pdfЗагружено:Utkarsh Choudhary
- chЗагружено:Utkarsh Choudhary
- riskЗагружено:Utkarsh Choudhary
- J.hull RIsk Management and Financial InstitutionsЗагружено:Eric Salim McLaren
- TE Mech 2012-Syllabus23 June2014Загружено:Utkarsh Choudhary

- Practice ProblemsЗагружено:Michelle Garcia
- SyllabusЗагружено:cory_m_hunnicutt
- Harvard Lecture Series Session 4_Factor AnalysisЗагружено:Atiqah Nizam
- STATISTICS Problems.docxЗагружено:A
- VCE Psychology Research Methods Extended Response SampleЗагружено:Kirthana Senthil
- Chap9 - Examples Robust RegressionЗагружено:jcmani1
- Quatitative Analysis - Case study on Final Laundry StatisticsЗагружено:mrnishu2011
- Cjc h2 Math p2 SolutionЗагружено:jimmytanlimlong
- Currie and Cain (2015)Загружено:marleiman
- multivariate analysisЗагружено:neamma
- Six Sigma Tools in a Excel SheetЗагружено:prv3
- Taguchi Case StudyЗагружено:ahmed elkhouly
- Information TheoryЗагружено:Nagwan Qassem
- Taxes and Transfer Pricing: Income ShiftingЗагружено:Bogat AR
- homework4Загружено:api-253978194
- Introduction to Statistics and Quantitative Research MethodsЗагружено:Citizen Kwadwo Ansong
- Prob Stat Lesson 9Загружено:Wellington Flores
- Linear RegressionЗагружено:svrbikkina
- Learning Unit 12Загружено:BelindaNiemand
- Paper 2Загружено:Rakeshconclave
- Mathieu, Taylor, 2006Загружено:Shereen Noranee
- ATQ exp 1.docxЗагружено:Dana Aguto
- edaЗагружено:lmartinezr9017
- Chapitre_5_IMEA_1Загружено:akkaakka
- R7210501 Probability & StatisticsЗагружено:sivabharathamurthy
- Lecture 7Загружено:Ash Wadd (Jopa)
- PEP_Guide_Stata.pdfЗагружено:Walter Noel
- 3. Quiz 2 - 1Загружено:himanshubahmani
- lab4_EGЗагружено:deui
- Data Models lectureЗагружено:sankalppatkar

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.