Вы находитесь на странице: 1из 13

MACHINE LEARNING

The Process

When people think of Machine Learning, they often think of a program that is
taking in data and spitting out predictions and insights. The process of performing
Machine Learning often requires many more steps before and after the predictive
analytics.

We try to think of the Machine Learning process as:

1. Formulating a Question
2. Finding and Understanding the Data
3. Cleaning the Data and Feature Engineering
4. Choosing a Model
5. Tuning and Evaluating
6. Using the Model and Presenting Results

1. Formulating a Question

What is it that we want to find out? How will we reach the success criteria that
we set?

Let’s say we are performing machine learning for a high-traffic fast-casual


restaurant chain, and our goal is to improve the customer experience. We can
serve this goal in many ways. When we’re thinking about creating a model, we
have to narrow down to one measurable, specific task. For example, we might say
we want to predict the wait times for customers’ food orders within 2 minutes, so
that we can give them an accurate time estimate.

2. Finding and Understanding the Data

Arguably the largest chunk of time in any machine learning process is finding the
relevant data to help answer your question, and getting it into the format
necessary for performing predictive analysis.

We know that for supervised learning, we need labeled datasets, or datasets that
have clear labels of what their ground truth is. For an example like the restaurant
wait time, this would mean we would need many examples of past orders, tagged
with how long the wait time was. Maybe the restaurant already tracks this data,
but we might need to augment the data collection with a timer that starts when
the customer orders, stops when the customer receives their food, and records
that information.

Creating this system of recording data, as well as gathering enough data to be


able to train our model will take time.

Once you have your data, you want to understand it so that you will know what
model to apply and what the outputs will mean. First, you will want to examine
the summary statistics:

 Calculate means and medians to understand the distribution


 Calculate percentiles
 Find correlations that indicate relationships

You may also want to visualize the data, perhaps using box plots to identify
outliers, histograms to show the basic structure of the data, and scatter plots to
examine relationships between variables.

Let’s say we’re examining the existing distribution of wait times. We see that the
overall average is 6.25 minutes per order. But we also produce this histogram:

We might glean from this that there are two main groups of orders. One group
seems to cluster around 4 minutes, while another, smaller, group seems to cluster
around 11 mins. We could use this to modify our question and build a model that
will classify whether or not an order will be in this “short” timeframe, or in the
“long” timeframe. Is it dependent on the food that it ordered? The time of day of
the order?

Perhaps we just become aware of the bimodality of our data. If our model
consistently predicts a wait time of around 6 or 7 minutes, then we are not taking
into account the true structure of our data.

3. Cleaning the Data and Feature Engineering

Real data is messy! Data may have errors. Some columns may be empty. The
features we’re interested in might require string manipulation to extract. Cleaning
the data refers to the process by which we address missing values and outliers,
among other things that may affect our insights.

We may see that we have a group of orders that took over 20 minutes, due to an
emergency in the kitchen one afternoon. This is pushing our average wait time up,
and may skew our predictions. If we want to model the more general functioning
of the restaurant, we may want to remove these values.

Feature Engineering refers to the process by which we choose the important


features (or columns) to look at, and make the appropriate transformations to
prepare our data for our model.

We might try:

 Normalizing or standardizing the data


 Augmenting the data by adding new columns
 Removing unnecessary columns

After we test our model on the data we have, we might go back and reengineer
features to see if we get a better result.

4. Choosing a Model

Once we understand our dataset and know the problem we are trying to solve,
we can begin to choose a model that will help us tackle our problem.
If we are attempting to find a continuous output, like predicting the number of
minutes someone should wait for their order, we would use a regression
algorithm.

If we are attempting to classify an input, like determining if an order will take


under 5 minutes or over 10 mins, then we would use a classification algorithm.

The different classification and regression algorithms work better on different


types of datasets. We use different models on categorical and numerical data,
and different models on datasets with many features and datasets with few
features. Our models also have different levels of interpretability — how easy is it
for us to see what these results mean and what led to them? When we teach the
models, we will discuss the tradeoffs of using each one.

5. Tuning and Evaluating

We often want to set a metric of success, so that we know the model we’ve
chosen is good enough. Are we looking for accuracy? Precision? Some
combination of the two? We discuss this in our lesson on Precision and Accuracy.

Each model has a variety of parameters that change how it makes decisions. We
can adjust these and compare the chosen evaluation metrics of the different
variants to find the most accurate model.

For example, let’s say we’re using a K-Nearest Neighbors regression algorithm to
solve the wait time prediction problem. This algorithm uses a parameter k, which
you will learn about in the KNN lesson. We can adjust k to get different results.

Is it ideal to compare against 3 nearest neighbors? 10? 1? We can try many


different values of k and see which one gives us the highest level of accuracy:
From this analysis, we would set our k to be 26, which got the highest level of
accuracy.

6. Using the Model and Presenting Results

When you achieve the level of accuracy you want on your training set, you can
use the model on the data you actually care about analyzing.

For our example, we can now start inputting new orders. The input could be an
order, with features like:

 the type of item ordered


 the quantity
 the time of day
 the number of employees working

The output would be how long the order is expected to take. This information
could be displayed to users.

An important step is being able to convey what you’ve learned and created, so
that people can use it in the future.

Sometimes you learn more about your data by looking at the model. For example,
using Multiple Learning Regression can give you insights into the importance of
each feature. We can create a feature importance graph to visualize this for those
unfamiliar with our model:
Your Process

The process we have outlined is a fairly standard process for performing machine
learning. As you get experience going through this process on your own, with your
own problems, you will start to form your own process. The steps may not be
linear! As you clean your data, you may uncover a better question to ask. As you
tune your model, you may realize you need more data, and go back to the
collection step.

The important part is to stay curious, and to keep iterating until you find a model
that works the best!

SUPERVISED VS UNSUPERVISED

As humans, we have many different ways we learn things. The way you learned
calculus, for example, is probably not the same way you learned to stack blocks.
The way you learned the alphabet is probably wildly different from the way you
learned how to tell if objects are approaching you or going away from you. The
latter you might not even realize you learned at all!

Similarly, when we think about making programs that can learn, we have to think
about these programs learning in different ways. Two main ways that we can
approach machine learning are Supervised Learning and Unsupervised Learning.
Both are useful for different situations or kinds of data available.
Supervised Learning

Let’s imagine you’re first learning about different genres in music. Your music
teacher plays you an indie rock song and says “This is an indie rock song”. Then,
they play you a K-pop song and tell you “This is a K-pop song”. Then, they play you
a techno track and say “This is techno”. You go through many examples of these
genres.

The next time you’re listening to the radio, and you hear techno, you may think
“This is similar to the 5 techno tracks I heard in class today. This must be techno!”

Even though the teacher didn’t tell you about this techno track, she gave you
enough examples of songs that were techno, so you could recognize more
examples of it.

When we explicitly tell a program what we expect the output to be, and let it
learn the rules that produce expected outputs from given inputs, we are
performing supervised learning.

A common example of this is image classification. Often, we want to build


systems that will be able to describe a picture. To do this, we normally show a
program thousands of examples of pictures, with labels that describe them.
During this process, the program adjusts its internal parameters. Then, when we
show it a new example of a photo with an unknown description, it should be able
to produce a reasonable description of the photo.

When you complete a Captcha and identify the images that have cars, you’re
labeling images! A supervised machine learning algorithm can now use those
pictures that you’ve tagged to make it’s car-image predictor more accurate.

Unsupervised Learning

Let’s say you are an alien who has been observing the meals people eat. You’ve
embedded yourself into the body of an employee at a typical tech startup, and
you see people eating breakfasts, lunches, and snacks. Over the course of a
couple weeks, you surmise that for breakfast people mostly eat foods like:

 Cereals
 Bagels
 Granola bars

Lunch is usually a combination of:

 Some sort of vegetable


 Some sort of protein
 Some sort of grain

Snacks are usually a piece of fruit or a handful of nuts. No one explicitly told you
what kinds of foods go with each meal, but you learned from natural observation
and put the patterns together. In unsupervised learning, we don’t tell the
program anything about what we expect the output to be. The program itself
analyzes the data it encounters and tries to pick out patterns and group the data
in meaningful ways.

An example of this includes clustering to create segments in a business’s user


population. In this case, an unsupervised learning algorithm would probably
create groups (or clusters) based on parameters that a human may not even
consider.

Summary

We have gone over the difference between supervised and unsupervised


learning:

 Supervised Learning: data is labeled and the program learns to predict the
output from the input data
 Unsupervised Learning: data is unlabeled and the program learns to
recognize the inherent structure in the input data
What is Scikit-Learn?

Open-source ML library for Python. Built on NumPy, SciPy, and Matplotlib.

In this course, we will learn how to construct various machine learning algorithms
from scratch. In the real world, however, we don’t want to recreate a complex
algorithm every time we want to use it. Writing an algorithm from scratch is a
great way to understand the fundamental principles of why it works, but we may
not get the efficiency or reliability we need.

Scikit-learn is a library in Python that provides many unsupervised and supervised


learning algorithms. It’s built upon some of the technology you might already be
familiar with, like NumPy, pandas, and Matplotlib!

The functionality that scikit-learn provides include:

 Regression, including Linear and Logistic Regression


 Classification, including K-Nearest Neighbors
 Clustering, including K-Means and K-Means++
 Model selection
 Preprocessing, including Min-Max Normalization

As you move through Codecademy’s Machine Learning content, you will become
familiar with many of these terms. You will also see scikit-learn (in Python,
sklearn) modules being used. For example:

sklearn.linear_model.LinearRegression()

is a Linear Regression model inside the linear_model module of sklearn.

The power of scikit-learn will greatly aid your creation of robust Machine Learning
programs.
Scikit.learn is a library in Python that provides many unsupervised and supervised
learning algorithms. It’s built upon some of the technology you might already be
familiar with, like NumPy, pandas, and Matplotlib!

As you build robust Machine Learning programs, it’s helpful to have all the sklearn
commands all in one place in case you forget.

Linear Regression

Import and create the model:

from sklearn.linear_model import LinearRegression

your_model = LinearRegression()

Fit:

your_model.fit(x_training_data, y_training_data)

 .coef_: contains the coefficients


 .intercept_: contains the intercept

Predict:

predictions = your_model.predict(your_x_data)

 .score(): returns the coefficient of determination R²

Naive Bayes

Import and create the model:

from sklearn.naive_bayes import MultinomialNB

your_model = MultinomialNB()

Fit:
your_model.fit(x_training_data, y_training_data)

Predict:

# Returns a list of predicted classes - one prediction for every data point
predictions = your_model.predict(your_x_data)

# For every data point, returns a list of probabilities of each class


probabilities = your_model.predict_proba(your_x_data)

K-Nearest Neighbors

Import and create the model:

from sklearn.neigbors import KNeighborsClassifier

your_model = KNeighborsClassifier()

Fit:

your_model.fit(x_training_data, y_training_data)

Predict:

# Returns a list of predicted classes - one prediction for every data point
predictions = your_model.predict(your_x_data)

# For every data point, returns a list of probabilities of each class


probabilities = your_model.predict_proba(your_x_data)

K-Means

Import and create the model:

from sklearn.cluster import KMeans

your_model = KMeans(n_clusters=4, init='random')


 n_clusters: number of clusters to form and number of centroids to
generate
 init: method for initialization
o k-means++: K-Means++ [default]
o random: K-Means
 random_state: the seed used by the random number generator [optional]

Fit:

your_model.fit(x_training_data)

Predict:

predictions = your_model.predict(your_x_data)

Validating the Model

Import and print accuracy, recall, precision, and F1 score:

from sklearn.metrics import accuracy_score, recall_score, precision_score,


f1_score

print(accuracy_score(true_labels, guesses))
print(recall_score(true_labels, guesses))
print(precision_score(true_labels, guesses))
print(f1_score(true_labels, guesses))

Import and print the confusion matrix:

from sklearn.metrics import confusion_matrix

print(confusion_matrix(true_labels, guesses))

Training Sets and Test Sets

from sklearn.model_selection import train_test_split


x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)

 train_size: the proportion of the dataset to include in the train split


 test_size: the proportion of the dataset to include in the test split
 random_state: the seed used by the random number generator [optional]

Вам также может понравиться