Вы находитесь на странице: 1из 6

Machine Learning Top Algorithms Revision

Material
March 1, 2016

Decision Trees

When to use decision trees?


When there are fixed set of features and features take on a small set of
values
Target function has discrete small set of values
It is robust to errors in classification of training data and can also handle
some missing values. Why? Because it considers all available training
examples during learning..

Important algorithms and extensions


1. ID3
(a) The hypothesis space for ID3 consists of all possible hypothesis functions and amongst the multiple possible valid hypotheses, the algorithm prefers shorter trees over longer trees.
(b) Why shorter trees? Occams razor: Prefer the simplest hypothesis
that fits the data. Why?
(c) It is a greedy search algorithm i.e. the algorithm never backtracks
to re-consider its previous choice. Therefore it might end up at local
optimum
(d) The crux of the algorithm is to specify a way to choose an optimal
root node
(e) Optimal selection of root node is decided by calculating Information
gain which measures how well the node/feature separates the training
examples w.r.t target classification
(f) For variable with only binary values, entropy is defined as P log2 P P log2 P
1

Therefore for a training set with 9 and 5 values of target variable,


the entropy will be

9
9
5
5
log2

log2
= 0.940
14
14 14
14

(g) Information gain of a node with attribute A is defined as Entropy(S)

X |Sv |
iA

|S|

Entropy(Sv )

Where S is the set of all examples at root ans Sv is the set of examples
in the child node corresponding to a value of A. Node with maximum
information gain is selected as root
2. C4.5
(a) If training data has significant noise, it results in over-fitting.
(b) Over-fitting can be resolved by post pruning. One such successful
method is called Rule Post Pruning and a variant of the algorithm
is called C4.5
(c) Infer the decision tree from the training set, growing the tree until
the training data is fit as well as possible and allowing over-fitting to
occur
(d) Convert the learned tree into an equivalent set of rules by creating
one rule for each path from the root node to a leaf node. Why?
(e) Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. Done recursively on that
rule till the accuracy worsens.
(f) Sort the pruned rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instances
(g) Drawback is keeping a separate validation set. To apply algorithm
using same training set, use pessimistic estimate
Other Modifications
Substitute maximum occurring value at that node in case of missing values
Divide information gains by weights if the features are to be weighted
Use thresholds for continuous values
Need to use a modified version of information gain for choosing root node
such as Split Information Ratio

References
Machine Learning - Tom Mitchell, Chapter 3
2

Logistic Regression

When to use logistic regression?


When target variable is discrete valued

Algorithm
1. The hypothesis function is of the form (for binary classification)h (x) = g(T x) =

1
1 + eT x

Where g is the logistic function which varies between 0 and 1


2. How did we choose this logistic function? Answer - GLMs
3. Let P (y = 1bx; ) = h (x)
P (y = 0bx; ) = 1 h (x)
More generally,
P (ybx; ) = h (x)y (1 h (x))1y
4. Likelihood, which is the joint probability will be m
Y

(i)
(i)

P (
y bX; ) = h (x(i) )y (1 h (x(i) ))1y

i=1

5. We maximize log of this likelihood function by using batch gradient ascent


method := + l()
6. for one training example, derivative of log likelihood w.r.t is (simple
derivation)l()
= (y h (x))xj
j
7. Therefore the update rule becomes (i)

j := j + (y (i) h (x(i) ))xj

8. Can also use Newtons method for maximizing log likelihood for fast convergence but it involves computing inverse of hessian of log likelihood
function w.r.t at each iteration
9. Use Softmax Regression which can also be derived from GLM theory
3

References
Stanford CS229 Notes 1

Linear Regression

When to use linear regression?


For any regression problem

Algorithm
1. The hypothesis function is of the form (for binary classification)h (x) = T x
With x0 = 1
2. Intuitively, we reduce the sum squared error over the training data to find

3. Cost function becomes m

J() =

1X
(h (x(i) ) y (i) )2
2 i=1

And to solve we minimize J()


4. Why did we choose least squares cost function? Answer - Assume the
training data is just a Gaussian noise around an actual polynomial. The
log of probability distribution of noise i.e. log likelihood gives you the
least squares formula
5. Use gradient descent to minimize cost function
6. Normal equation solution
= (X T X)1 X T
y)
7. Locally weighted linear regression is one modification to account for highly
varying data without over-fitting

References
Stanford CS229 Notes 1

Nearest Neighbours

Gaussian Discriminant Analysis

When to use GDA?


For classification problems with continuous features
GDA makes stronger assumptions about the training data than logistic
regression. When these are correct, GDA generally performs better than
logistic regression.

Algorithm
1. GDA is a generative algorithm
2. Intuitive explanation for generative algorithms - First, looking at elephants, we can build a model of what elephants look like. Then, looking
at dogs, we can build a separate model of what dogs look like. Finally, to
classify a new animal, we can match the new animal against the elephant
model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in
the training set.
3. Model y = Bernoulli()
(x|y = 0) = (0 , )
(x|y = 1) = (1 , )

Naive Bayes

Bayes Networks

Adaboost

SVM

10

K-means Clustering

11

Expectation Maximization

12

SVD

13

PCA

14

Random Forests

15

Artificial Neural Networks

Вам также может понравиться