Вы находитесь на странице: 1из 8

Machine learning has 3 categories

1- Supervised learning, algorithims learn from labeled data. After


studying the data, the tecniques are able to determine which
label should be given to new data based on observing patterns and
associating those patterns to new unlabeled data
2- Unsupervised learning- can be classification and regression.
Classification predicts a category an item belongs to, it can be
valid for multiple categories. Numerical outcomes predict a
numeric value. Reduces huge datasets in to limited data
3- Reinforcement learning- training algoryghtms to take certain
actions and receive awards for them

Deep learning combines the three above, but it has 3 barriers:

Data- a lot data is required


Computing power-
Some decision that are being meade wont have clear reasoning therefore not all applications of machine learning will require deep learning

Supervised Machine Learning

Thus far supervised machine learning has gained the most traction in use cases across business applications. Studying labeled data, these techniques
can extend patterns to unlabeled data.
Supervised learning is used for many business applications from spam filters to movie recommendations. We looked at the two broad categories of
supervised machine learning:

 Classification
 Regression

Deep learning can be used within supervised machine learning to create techniques that are better at image recognition or identifying when a movie
was created based on the video footage.

Unsupervised Machine Learning

Unsupervised techniques are the second most used in business applications. By learning patterns even when data do not have labels, these techniques
can group items together that are likely to be similar.
Unsupervised techniques are used for business applications from figuring out market segments to again building recommendation engines. There are a
lot of ways you can build recommendation engines! But more of that will be shown in term 2.

These two techniques (supervised and unsupervised) will be the main techniques focused on in this term. You will also use deep learning as a supervised learning technique
for image recognition.

Reinforcement Learning
The final type of machine learning, reinforcement learning, has recently been gaining a lot of traction, but still is limited in its use cases related to
many business use cases. There are a number of obstacles in training these algorithms, and the approaches are not as streamlined as the other
approaches you will see in this term.

With that being said, recent publicity in reinforcement learning has come from AlphaGo and autonomous vehicles. Additionally, reinforcement
learning is often used in gaming agents like you see in Open AI Gym.

This program will not have any applications of reinforcement learning, because their use cases are not often used within common data science
applications.

Main algoriths for machine learning

Classification- answers questions in the form yes/no (example: is the patient sick or not)
Regression – answers questions in the form how much (example: how much does this house cost)

Linear regression is used for finding linear relationship between target and one or more predictors
If w1 is increased, the slope is increased and the line rotates up and down. In w2 is
increased then the line moves in parallel up and down

The two most common error functions for linear


regression are the mean absolute error and the mean
squared error.

1) Mean absolute error


The distance from (x,y) till the line is y-yhat – the error. This is not the actual distance to the line, because if they do it would be perpendicular (90
degrees intersection). It is the vertical distance of the line which is the line between the point and the prediction. The total error is the sum of all
distances. In the dataset we use mean absolute error = sum of all errors/ by m (the number of points in our dataset)

The error should always be positive, otherwise negative errors will cancel positive errors

Root mean squared error (RMSE): RMSE is a quadratic scoring rule that also measures the average magnitude of the error. It’s the square root of the
average of squared differences between prediction and actual observation.

Differences: Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they
are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly
undesirable. The three tables below show examples where MAE is steady and RMSE increases as the variance associated with the frequency
distribution of error magnitudes also increases.

RMSE avoids the use of taking the absolute value.

derivative is a way to show rate of change: that is, the amount by which a function is changing at one given point. For functions that act on the real
numbers, it is the slope of the tangent line at a point on a graph.

Mean vs Total Squared (or Absolute) Error


A potential confusion is the following: How do we know if we should use the mean or the total squared (or absolute) error?

The total squared error is the sum of errors at each point, given by the following equation:

M=∑i=1m12(y−y^)2, M = \sum_{i=1}^m \frac{1}{2} (y - \hat{y})^2, M=∑i=1m21(y−y^)2,

whereas the mean squared error is the average of these errors, given by the equation, where m m m is the number of points:
T=∑i=1m12m(y−y^)2. T = \sum_{i=1}^m \frac{1}{2m}(y - \hat{y})^2. T=∑i=1m2m1(y−y^)2.

The good news is, it doesn't really matter. As we can see, the total squared error is just a multiple of the mean squared error, since

M=mT. M = mT. M=mT.

Therefore, since derivatives are linear functions, the gradient of T T T is also m m m times the gradient of M M M.

However, the gradient descent step consists of subtracting the gradient of the error times the learning rate α \alpha α. Therefore, choosing between the
mean squared error and the total squared error really just amounts to picking a different learning rate.

In real life, we'll have algorithms that will help us determine a good learning rate to work with. Therefore, if we use the mean error or the total error,
the algorithm will just end up picking a different learning rate.

Batch vs Stochastic Gradient Descent


At this point, it seems that we've seen two ways of doing linear regression.

 By applying the squared (or absolute) trick at every point in our data one by one, and repeating this process many times.
 By applying the squared (or absolute) trick at every point in our data all at the same time, and repeating this process many times.

More specifically, the squared (or absolute) trick, when applied to a point, gives us some values to add to the weights of the model. We can add these
values, update our weights, and then apply the squared (or absolute) trick on the next point. Or we can calculate these values for all the points, add
them, and then update the weights with the sum of these values.

The latter is called batch gradient descent. The former is called stochastic gradient descent.
The question is, which one is used in practice?

Actually, in most cases, neither. Think about this: If your data is huge, both are a bit slow, computationally. The best way to do linear regression, is to
split your data into many small batches. Each batch, with roughly the same number of points. Then, use each batch to update your weights. This is still
called mini-batch gradient descent.