Вы находитесь на странице: 1из 36


An introduction to
Machine Learning eMag Issue 50 - April 2017


Introduction to Real-World, Andrew McAfee and

Machine Learning Man-Machine Erik Brynjolfssons The
with Python Algorithms Second Machine Age
Introduction to Machine
Learning with Python Practicing Machine Learning
This series explores various topics and techniques in with Optimism
machine learning, arguably the most talked-about Using machine learning to solve real-world problems
area of technology and computer science over the often presents challenges that werent initially consid-
past several years. In this article, Michael Manapat ered during the development of the machine learning
begins with an extended case study in Python: method. Alyssa Frazee addresses a few examples of
how can we build a machine learning model to such issues and hopefully provides some suggestions
detect credit card fraud? (and inspiration) for how to overcome the challenges
using straightforward analyses on the data you al-
ready have.

Anomaly Detection for

Time Series Data with
Deep Learning
This article introduces neural networks, including
brief descriptions of feed-forward neural networks
and recurrent neural networks, and describes
how to build a recurrent neural network that
detects anomalies in time series data. To make the
discussion concrete, Tom Hanlon shows how to build
a neural network using Deeplearning4j, a popular
open-source deep-learning library for the JVM.

Book Review:
Andrew McAfee and Real-World, Man-Machine
Erik Brynjolfssons Algorithms
In this article, Edwin Chen and Justin Palmer talk
The Second about the end-to-end flow of developing machine
Machine Age learning models:
where you get training data, how you pick the ML
Andrew McAffee and Erik Bryn-
algorithm, what you must address after your model
jolfsson begin their book The Second
is deployed, and so forth.
Machine Age with a simple ques-
tion: what innovation has had the
greatest impact on human history?

GENERAL FEEDBACK feedback@infoq.com
ADVERTISING sales@infoq.com
EDITORIAL editors@infoq.com

facebook.com @InfoQ google.com linkedin.com

/InfoQ /+InfoQ company/infoq
leads work on Stripes machine learning products,
including Stripe Radar. Prior to Stripe, he was an
engineer at Google and a postdoctoral fellow in
and lecturer on applied mathematics at Harvard. He
received a Ph.D. in mathematics from MIT.


Machine learning has long powered many products we interact with dailyfrom
intelligent assistants like Apples Siri and Google Now, to recommendation en-
gines like Amazons that suggest new products to buy, to the ad ranking systems
used by Google and Facebook.

More recently, machine learning has entered the public consciousness because of
advances in deep learningthese include AlphaGos defeat of Go grandmaster
Lee Sedol and impressive new products around image recognition and machine

While much of the press around machine learning has focused on achievements
that were not previously possible, the full range of machine learning methods
from traditional techniques that have been around for decades to more recent ap-
proaches with neural networkscan be deployed to solve many important (but
perhaps more prosaic) problems that businesses face. Examples of these applica-
tions include, but are by no means limited to, fraud prevention, time-series fore-
casting, and spam detection.

InfoQ has curated a series of articles for this introduction to machine learning
eMagazine covering everything from the very basics of machine learning (what
are typical classifiers and how do you measure their performance?), to production
considerations (how do you deal with changing patterns in data after youve de-
ployed your model?), to newer techniques in deep learning. After reading through
this series, you should be ready to start on a few machine learning experiments of
your own.
Read online on InfoQ

Introduction to Machine Learning with Python

Michael Manapat

This e-mag will explore various topics and techniques in machine

learning, arguably the most-talked-about area of technology and
computer science over the past several years.

Machine learning at a high level has been covered classification problems for example, ad-click pre-
in previous InfoQ articles (see, for example, Get- diction.) Along the way, well encounter many of the
ting Started with Machine Learning in the Getting key ideas and terms in machine learning.
a Handle on Data Science e-mag and series), and
this article and the ones that follow it elaborate on
many of the concepts and methods discussed earli- Target: Credit-card fraud
er with emphasis on concrete examples and venture Businesses that sell products online inevitably have
into some new areas, including neural networks and to deal with fraud. In a typical fraudulent transaction,
deep learning. the fraudster will obtain stolen credit-card numbers
and use them to purchase goods online. The fraud-
Well begin, in this article, with an extended case sters will then sell those goods elsewhere at a dis-
study in Python: how can we build a machine-learn- count, pocketing the proceeds, while the business
ing model to detect credit-card fraud? (While well must bear the cost of the chargeback. You can read
use the language of fraud detection, much of what more about the details of credit-card fraud here.
we do may apply with little modification to other

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 5

Logistic regression is appropriate for binary classification when the
relationship between the input variables and the output were trying to
predict is linear or when its important to be able to interpret the model (by,
for example, isolating the impact that any one input variable has on the
Decision trees and random forests are non-linear models that can
capture well more complex relationships but are less amenable to human
Its important to assess model performance appropriately to verify that your
model will perform well on data it has not seen before.
Putting a machine-learning model into production involves many
considerations distinct from those in the model development process: for
example, how do you compute model inputs synchronously? What information
do you need to log every time you score? And how do you determine the
performance of your model in production?

Lets say were an online business that has been ex- False,2015-12-31T23:59:59Z,2359,US,0
periencing fraud for some time, and wed like to use
machine learning to help with the problem. More spe- False,2015-12-31T23:59:59Z,1480,US,3
cifically, every time a transaction is made, wed like to
predict whether or not itll turn out to be fraudulent False,2015-12-31T23:59:59Z,535,US,3
(i.e., whether or not the authorized cardholder is mak-
ing the purchase) so that we can take appropriate ac-
tion. This type of machine-learning problem is known False,2015-12-31T23:59:59Z,10305,US,1
as classification as we are assigning every incoming
payment to one of two classes: fraud or not-fraud. False,2015-12-31T23:59:59Z,2783,US,0

For every historical payment, we have a Boolean that

indicates whether the charge was fraudulent (fraud- There are two important details were going to skip
ulent) and other attributes that we think might be over in our discussion but theyre worth keeping in
indicative of fraud for example, the payment in US mind as they are just as important, if not more so, than
dollars (amount), the country in which the card was is- the basics of model building were covering here.
sued (card_country), and the number of payments
made with the card at our business in the past day First, there is the problem of data science in determin-
(card_use_24h). Thus, the data we have to build our ing what features we think are indicative of fraud. In
predictive model might look like the following CSV: our example, weve identified the payment amount,
the country in which the card was issued, and the num-
fraudulent,charge_time,amount,card_ ber of times the card was used in the past day as fea-
country,card_use_24h tures that we think may be useful in predicting fraud.
In general, youll need to spend a lot of time looking at
data to determine whats useful and whats not.
Second, there is the problem of data infrastructure
False,2015-12-31T23:59:59Z,8396,US,1 in computing the values of features: we need those
values for all historical samples to train the model but

6 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

every time we also need their real-time values as payments come in to score new
transactions appropriately. Its unlikely that, before we began worrying
about fraud, we were already maintaining and recording the number of
a transaction is card uses over 24-hour rolling windows, so if we find that that feature
is useful for fraud detection, well need to be able to compute it both in

made, wed like to production and in batch. Depending on the definition of the feature,
this can be highly non-trivial.

predict whether These problems together are frequently referred to as feature engi-
neering and are often the most involved (and impactful) parts of indus-

or not itll turn out trial machine learning.

to be fraudulent Logistic regression

Lets start with one of the most basic possible models: a linear one. Well

(i.e., whether or attempt to find coefficients a, b, Z so that:

not the authorized For every payment, well plug in the values of amount, card_coun-

cardholder is try, and card_use_24h into the formula above, and if the probability
is greater than 0.5, well predict that the payment is fraudulent and
otherwise well predict that its legitimate.
making the Even before we discuss how to compute a, b, Z, there are two imme-
purchase) so diate problems to address:

that we can
Probability(fraud) needs to be a number between 0 and 1, but the
quantity on the right side can get arbitrarily large (in absolute value) de-
pending on the values of amount and card_use_24h (if those feature
take appropriate values are sufficiently large and one of a or b is nonzero).

action. card_country isnt a number: it takes one of a number of values (say

US, AU, GB, and so forth). Such so-called categorical features need to
be encoded appropriately before we can train our model.

Logit function
To address the first problem, instead of directly modeling p=Probabil-
ity(fraud), well model what is known as the log-odds of fraud, so our
model becomes:

If an event has probability p, its odds are p/(1-p), which is why the left
side is called the log odds or logit.

Given values of a, b, Z, and the features, we can compute the predict-

ed probability of fraud by inverting the function above to get:

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 7

The probability of fraud p is a sigmoidal function of the linear function
L=a*amount+b*card_use_24h+ and looks like the following:

Regardless of the value of the linear function, the sigmoid maps it to a

number between 0 and 1, which is a legitimate probability.

Categorical variables
To address the second problem in our list, well take the categorical vari-
able card_country (which, say, takes one of N distinct values) and ex-
pand it into N-1 dummy variables. These new features will be Booleans
of the form card_country=AU, card_country=GB, etc. We only need
N-1 dummies because the Nth value is implied when the N-1 dummies
are all false. For simplicity, lets say that card_country can take just one
of three values here: AU, GB, or US. Then we need two dummy variables
to encode it, and the model we would like to fit (i.e., find the coefficient
values for) is:

This type of model is known as a logistic regression.

Fitting the model

How do we determine the values of a, b, c, d, and Z? Lets start by pick-
ing random guesses for them . We can define the likelihood of this set
of guesses as:

8 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

That is, take every sample in our data set and compute the predicted
probability of fraud p given our guesses of a, b, c, d, and Z (and the fea-
ture values for each sample) using:

For every sample that actually was fraudulent, wed like p to be close to
1, and for every sample that was not fraudulent, wed like p to be close
to 0 (and so 1-p should be close to 1). Thus, we take the product of p
over all fraudulent samples with the product of 1-p over all non-fraud-
ulent samples to assess the accuracy of our guesses for a, b, c, d, and
Z. Wed like to make the likelihood function as large as possible (i.e., as
close as possible to 1). Starting with our guess, well iteratively tweak
a, b, c, d, and Z to improve the likelihood until we find that we can no
longer increase it by perturbing the coefficients. One common method
for this optimization is stochastic gradient descent.

Implementation in Python
Now well use some standard open-source tools in Python to put into
practice the theory weve just discussed. Well use pandas, which brings
R-like data frames to Python, and scikit-learn, a popular machine-learn-
ing package. Lets say the sample data we described above is in a CSV
file named data.csv; we can load the data and take a peek at it with the

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 9

We can encode card_country into the appropriate dummy variables with:

Now the data frame has all the data metrics of model performance (see
we need, dummy variables and all, the next section on what these are).
to train our model. Weve split up the If a model overfits, it will perform
target (the variable were trying to well on the training set (as it will have
predict which in this case is fraudu- learned the patterns in the set) but
lent) and the features as scikit-learn poorly on the validation set. There
takes them as different parameters. are other approaches to cross-valida-
tion (for example, k-fold cross-valida-
Before proceeding with the model tion), but a train-test split will serve
training, theres one more issue to our purposes here.
discuss. Wed like our model to gen-
eralize well i.e., it should be accu- We can easily split our data into train-
rate when classifying payments that ing and testing sets with scikit-learn
we havent seen before and it should as follows:
not just capture the idiosyncratic
patterns in the payments we happen
to have already seen. To make sure
that we dont overfit our models to
the noise in the data we have, well
separate the data into two sets: a
training set that well use to estimate
the model parameters (a, b, c, d, and
Z) and a validation set (also called a
test set) that well use to compute

10 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

In this example, well use two thirds of the data to train the model and one third of
the data to validate it.

Were now ready to train the model, which at this point is a triviality:

The fit function runs the fitting procedure (which maximizes the likelihood func-
tion described above), and then we can query the returned object for the values of
a, b, c, and d (in coef_) and Z (in intercept_). Our final model is:

Evaluating model performance

Once weve trained a model, we need to determine how good that model is at
predicting the variable of interest (in this case, the Boolean that indicates whether
the payment is believed to be fraudulent or not). Recall that we said wed classify
a payment as fraudulent if Probability(fraud) is greater than 0.5 and that wed
classify it as legitimate otherwise. Two quantities frequently used to measure per-
formance given a model and a classification policy such as ours are:

the false-positive rate, the fraction of all legitimate charges that are incor-
rectly classified as fraudulent, and

the true-positive rate (also known as recall or sensitivity), the fraction of all
fraudulent charges that are correctly classified as fraudulent.

While there are many measures of classifier performance, well focus on these two.

Ideally, the false-positive rate will be close to 0 and the true-positive rate will be
close to 1. As we vary the probability threshold at which we classify a charge as
fraudulent (above we said it was 0.5, but we can choose any value between 0 and 1
low values mean were more aggressive in labeling payments as fraudulent and
high values mean were more conservative), the false-positive rate and true-posi-
tive rate trace out a curve that depends on how good our model is. This is known
as the ROC curve and can be computed easily with scikit-learn:

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 11

The variables fpr, tpr, and
thresholds contain the
data for the full ROC curve,
but weve picked a sample
point here: if we say a charge
is fraudulent if Probabili-
ty(fraud) is greater than
0.514, then the false positive
rate is 0.374 and the true pos-
itive rate is 0.681. The whole
ROC curve and the point we
picked out are depicted in
the graph to the right.

The better a model is, the

closer the ROC curve (the
blue line above) will hug the
left and top borders of the graph (the model would achieve perfect performance in the top left
corner). Note that ROC curve tells us how good our model is, and this can be captured with a single
number the area under the curve (AUC). The closer the AUC is to 1, the more accurate the model.
The AUC score for our current model is:

Of course, when we put a model into production to take an action, we generally need to action the
model-outputted probabilities by comparing them to a threshold as we did above, saying that a
charge is predicted to be fraudulent if Probability(fraud)>0.5. Thus, the performance of our
model for a specific application corresponds to a point on the ROC curve the curve just controls
the tradeoff between false-positive rate and true-positive rate, i.e., the policy options we have at our

12 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

Decision trees and random forests
The model above, a logistic regression, is an example of a linear ma-
chine-learning model. Imagine that every sample payment we have is a
point in space whose coordinates are the values of features. If we had just
two features, each sample point would be a point in the x-y plane. A linear
model like logistic regression will generally perform well if we can separate
the fraudulent samples from the non-fraudulent samples with a linear func-
tion in the two-feature case, that just means that almost all the fraudulent
samples lie on one side of a line and almost all the non-fraudulent samples
lie on the other side of that line.

Its often the case that the relationship between predictive features and the
target variable were trying to predict is nonlinear, in which case we should
use a nonlinear model to capture the relationship. One powerful and intui-
tive type of a nonlinear model is a decision tree like the following:

At each node, we compare the val- game until we reach a leaf, which
ue of a specified feature to some contains a predicted probability of
threshold and branch either to the fraud we can assign to that trans-
left or the right depending on the action.
output of the comparison. We con-
tinue in this manner (like a game In brief, we create a decision tree by
of 20 Questions, though trees do selecting a feature and threshold at
not need to be 20 levels deep) un- each node to maximize some no-
til we reach a leaf of the tree. The tion of information gain or discrim-
leaf consists of all the samples in inatory power the gini shown
our training set for which the com- in the figure above and pro-
parisons at each node satisfied the ceed recursively until we hit some
path we took down the tree, and pre-specified stopping criterion.
the fraction of samples in the leaf While we wont go further into the
that are fraudulent is the predicted details of producing the decision
probability of fraud that the model tree, training such a model with
reports. When we have a new sam- scikit-learn is as easy as training a
ple to be classified, we generate its logistic regression (or any other
features and play the 20 Questions model, in fact):

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 13

One issue with decision trees is that they can for example). Some call this productionizing
easily be overfit. A very deep tree in which it.
each leaf has just one sample from the train-
ing data will often capture noise pertinent to While we wont go into detail here, produc-
each sample and not general trends but tionization can involve a number of challenges
random-forest models can help address this. for instance, we may use Python for mod-
In a random forest, we train a large number of el development while our production stack is
decision trees, but each tree is trained on just a in Ruby. If that is the case, well either need to
subset of the data we have available, and when port our model to Ruby by serializing it in some
building each tree, we only consider a subset format from Python and having our produc-
of features for splitting. The predicted proba- tion Ruby code load the serialization or use a
bility of fraud is then simply the average of the service-oriented architecture with service calls
probabilities produced by all the trees in the from Ruby to Python.
forest. Training each tree on just a subset of the
data, and only considering a subset of the fea- Well also want to maintain our models perfor-
tures as split candidates at each node, reduces mance metrics in production (as distinct from
the correlation between the trees and makes metrics as computed on the validation data).
overfitting less likely. Depending on how we use our model, this can
be difficult because the mere act of using the
To summarize, linear models like logistic re- model to dictate actions can result in not hav-
gressions are appropriate when the relation- ing the data to compute these metrics. Oth-
ship between the features and the target vari- er articles in this series will consider some of
able is linear or when wed like to be able to these problems.
isolate the impact that any given feature has
on the prediction (as this can be directly read
off the regression coefficient). On the other Supporting materials
hand, nonlinear models like decision trees and A Jupyter notebook with all the code examples
random forests are harder to interpret but can above can be found here, and sample data for
capture more complex relationships. model training can be found here.

Productionizing machine-learning
Training a machine-learning model is just one
step in the process of using machine learn-
ing to solve a business problem. As described
above, model training generally must be pre-
ceded by the work of feature engineering. And
once we have a model, we need to put it into
production to take appropriate actions (by
blocking payments assessed to be fraudulent,

14 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

Read online on InfoQ

Practicing Machine Learning with Optimism

Alyssa Frazee is a machine-learning engineer at Stripe, where she builds models to detect fraud
in online credit-card payments. Before Stripe, she did a Ph.D. in biostatistics and fell in love with
programming at the Recurse Center. Find her on Twitter at @acfrazee.

Using machine learning to solve real-world problems often presents

challenges that werent initially considered during the development of
the machine-learning (ML) method, but encountering challenges from
our very own application is part of the joy of being a practitioner!
This article will address a few ex- the width of the error bars often most always be used to create er-
amples of such issues and hope- assume that our data points are ror bars, and all we need is a few
fully will provide some sugges- independent, which is almost lines of code and some comput-
tions (and inspiration) for how to never true in any business for ing power.
overcome the challenges using example, we might have multi-
straightforward analyses on the ple data points per customer or Or perhaps were using a binary
data we already have. customers connected to each classifier in production: for exam-
other on a social network. Anoth- ple, we may be deciding whether
Perhaps wed like to quantify er common assumption is that or not to show a website visitor a
the uncertainty around one of our business metric is normally specific advertisement or wheth-
our business metrics. Unfortu- distributed across users, which er or not to decline a credit-card
nately, adding error bars around often fails with super-users or a transaction due to fraud risk. A
any metric more complicated large number of inactive users. classifier that results in action
than an average can be daunt- But never fear simulations and being taken can actually become
ing. Reasonable formulas for non-parametric methods can al- its own adversary by stopping us

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 15

You can use simulations to determine your confidence around estimates of a
given metric.
When you use your machine-learning model to take actions that affect
outcomes in the world, you need to have a system for counterfactual
You can generate explanations for black-box model decisions, and those
explanations can help with model interpretation and debugging (even if
theyre rudimentary).

from observing the outcome for optimism Ive seen derives from Problem 1: Your model
observations in one of the class- the fact that ML practitioners becomes its own
es: we never get to see wheth- have been doing their very best adversary
er a website visitor would have to develop techniques for over- Adversarial machine learning is
clicked an ad if we dont show it coming these sorts of problems. a fascinating subfield of ML that
and we never get to see if a cred- We can correct expensive but deals with model-building with-
it-card charge was actually fraud- badly designed biology exper- in a system whose data chang-
ulent unless we process it since iments after the fact. We can es over time due to an external
were missing the data to evalu- build regression models even if adversary, i.e., someone trying
ate. Luckily, there are statistical our data is correlated in surpris- to exploit weaknesses in the
methods for addressing this. ing or unquantifiable ways that current model or someone who
rule out standard linear regres- benefits from the model making
Finally, we may be using a black- sion. We can empirically estimate a mistake. Fraud and security are
box model: a model that makes what could have been if we had two huge application areas in ad-
accurate, fast predictions that missing data. versarial ML.
computers easily understand
but that arent designed to be I work on machine learning at
examined post hoc by a human Stripe, a company that builds
(random forests are a canonical I mention these examples be- payments infrastructure for the
example). Do our users want un- cause they (and countless others Internet. Specifically, I build ML
derstandable explanations for like them) have led me to believe models to automatically detect
decisions that the model made? that you can solve most of data and block fraudulent payments
Simple modeling techniques can problems with relatively simple across our platform. My team
handle that problem too. techniques. Im loath to give up aims to decline charges being
on answering an empirical ma- made without the consent of the
One of my favorite things chine learning question just be- cardholder. We identify fraud us-
about being a statisti- cause, at first glance, our data set ing disputes: cardholders file dis-
cian-turned-ML-practitioner is isnt quite textbook. What follows putes against businesses where
the optimism of the field. It feels are a few examples of ML prob- their cards are used without their
strange to highlight optimism lems that at one point seemed authorization.
in fields concerned with data insurmountable but that can be
analysis: statisticians have a bit tackled with some straightfor- In this scenario, our obvious ad-
of a reputation for being party ward solutions. versaries are fraudsters: people
poopers when they point out to trying to charge stolen cred-
collaborators flaws in experimen- it-card numbers for financial
tal designs, violations of model gain. Intelligent fraudsters are
assumptions, or issues arising generally aware that banks and
because of missing data. But the

16 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

The model is its payment processors have mod-
els in place to block fraudulent
pens i.e., observe whether
or not the charge is fraudulent
transactions, so theyre constant- and fill in some of our missing
own adversary, ly looking for ways to get around
them so we strive to stay re-
data. The probability of revers-
ing a models decision to block a

in a loose sense, cent with our models in order to

get ahead of bad actors.
transaction is dependent on how
confident the model is that the
charge is fraudulent. Charges the
since it works However, a more subtle adver-
sary is the model itself: once we
model is less certain about have
higher probabilities of being re-

against model launch a model in production,

standard evaluation metrics for
versed; charges the model gives
very high fraud probabilities are
binary classifiers (like precision approximately never reversed.
improvements and recall, described in the first
article of this series) can become
The reversal probabilities are re-
by obscuring impossible to calculate. If we
block a credit-card charge, the We can then use a statistical

charge never happens and so we technique called inverse proba-
cant determine if it would have bility weighting to reconstruct a
been fraudulent. This means we fully labeled data set of charges
metrics and cant estimate model perfor-
mance. Any increase in observed
with labeled outcomes. The
idea behind inverse probability

depleting the fraud rate could theoretically

be chalked up to an increase
weighting is that a charge whose
outcome we know because a
in inbound fraud rather than models block decision was re-
supply of training degradation in model perfor-
mance; we cant know without
versed with a probability of 5%
represents 20 charges: itself, plus

data outcome data. The model is its

own adversary, in a loose sense,
19 other charges like it whose
block decisions the model didnt
since it works against model im- reverse. So we essentially create
provements by obscuring per- a data set containing 20 copies
formance metrics and depleting of that charge. From there, we
the supply of training data. This can calculate all the usual binary
can also be thought of as an un- classifier metrics for our model:
fortunate missing data prob- precision, recall, false-positive
lem: were missing the outcomes rate, etc. We can also estimate
for all of the charges the model things like incoming fraudulent
blocks. Other ML applications volume and create weighted
suffer from the same issue: for training data sets for new, im-
example, in advertising, its im- proved models.
possible to see whether a cer-
tain visitor to a website would Here, we first took advantage
have clicked an ad if it never gets of our ability to change the way
shown to that visitor (based on a the underlying system works:
models predicted click probabil- we dont control who makes
ity for that user). payments on Stripe but were
able to be creative with what
Having labeled training data happens after the payment in
and model performance metrics order to get the data we need
is business critical, so we de- to improve fraud detection. Our
veloped a relatively simple ap- reversal probabilities, which var-
proach to work around the issue: ied with our models certainty,
we let through a tiny, randomly reflected the business require-
chosen sample of the charges ments of this solution: we should
our models ordinarily would almost always block charges we
have blocked and see what hap- know to be fraudulent in the in-

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 17

terest of doing whats best for Ill illustrate this challenge with techniques we can use to get
the businesses that depend on a real example from the pre- around this problem.
us for their payments. And even vious section: estimating the
though the best business solu- recall of our credit-card-fraud Computational methods for er-
tion is not the ideal solution for model using the inverse-proba- ror-bar estimation work in virtu-
data analysis, we made use of a bility-weighted data. Wed like to ally any scenario and have almost
classic statistical method to cor- know what percentage of incom- no assumptions that go along
rect for that. Keeping smart sys- ing fraud our existing production with them. My favorite, and the
tem modifications in mind and model blocks. Lets assume we one Im going to talk about now,
remembering that we can often have a data frame, df, with four is the bootstrap, invented by
adjust our post hoc analyses columns: the charge id; a Bool- Brad Efron in 1979. Efron proved
were both key insights to solving ean fraud that indicates wheth- that confidence intervals calcu-
this problem. er or not the charge was actually lated this way have all the math-
fraudulent; a Boolean predict- ematical properties youd expect
ed_fraud that indicates wheth- from a confidence interval. The
Problem 2: Error-bar er or not our model classified main disadvantage to methods
calculations seem the charge as fraudulent; and like the bootstrap is that theyre
impossible weight, the probability that we computationally intensive, but
Determining the margin of er- observed the charges outcome. its 2017 and computing pow-
ror on any estimate is (a) very The formula for model recall (in er is cheap, and what made this
important, since the certainty pseudo-code) is then: sometimes unusable in 1979 is
in your estimate can very much basically a non-issue today.
affect how we act on that infor- recall=((df.fraud&df.pre-
mation later, and (b) often ter- dicted_fraud).toint*df. Bootstrapping involves estimat-
rifyingly challenging. Standard weight)/ (df.fraud*df. ing variation in our observed
error formulas can only get us so weight) data set using sampling: we take
far; once we try to put error bars a large number of samples with
around any quantity that isnt an (Note that & is an element-wise replacement from the original
average, things quickly get com- logical on the df.fraud and data set, each with the same
plicated. Many standard error df.predicted_fraud vectors, number of observations as in the
formulas also require some esti- and * is a vector dot product.) original data set. We then calcu-
mate of correlation or covariance There isnt a known closed-form late our estimated metric (recall)
a quantification of how the solution for calculating a confi- on each of those bootstrap sam-
data points going into the calcu- dence interval (i.e., calculating ples. The 2.5th percentile and the
lation are related to each other the widths of the error bars) 97.5th percentile of those esti-
or an assumption that those around an estimator like that. mated recalls are then our lower
data points are independent. Luckily, there are straightforward and upper bounds for a 95% con-

001 from numpy import percentile

002 from numpy.random import randint
004 def recall(df):
005 return ((df.fraud & df.predicted_fraud).toint * df.weight) / (df.fraud *
007 n = len(df)
008 num_bootstrap_samples = 10000
009 bootstrapped_recalls = []
010 for _ in xrange(num_bootstrap_samples):
011 sampled_data = df.iloc[randint(0, n, size=n)]
012 est_recall = recall(sampled_data)
013 boostrapped_recalls.append(est_recall)
015 ci_lower = percentile(bootstrapped_results, 2.5)
016 ci_upper = percentile(bootstrapped_results, 97.5)

18 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

fidence interval for the true recall.
Heres the algorithm in Python,
classification accuracy. Each item
is run through all of the trees, and For many
assuming df is the same data the individual tree decisions are
frame as in the example below. averaged (i.e., the trees vote) to
get a final prediction. Random
With techniques like the boot-
strap and the jackknife, error bars
forests are flexible, perform well,
and quickly evaluate in real-time, ML systems,
can almost always be estimated. so theyre a common choice for
They might have to be done in
batch rather than in real time, but
production ML models. But its
very hard to translate several sets
we can basically always calculate
an accurate measure of uncer-
of splits plus tree votes into an
intuitive explanation for the final individual decisions
tainty! prediction.

It turns out that research has

made by the model
Problem 3: A human
must interpret a black-
looked at this problem. Even
though it likely wont be part of is crucial.
box models decisions an introductory ML course, the
ML models are commonly de- body of knowledge is out there.
scribed as magic or black boxes But perhaps more importantly,
the important thing is what this problem exemplifies the les-
goes into them and what comes son that a simple solution can
out, not necessarily how that out- work really well. Again, this is
put is calculated. Sometimes this 2017 and raw computing power
is what we want: many consum- is abundant and cheap. One way
ers dont really need to see inside to get a rudimentary explanation
the sausage factory, as they say for a black-box model is to write
they just want a tasty tubular a simulation: vary one feature at
treat. But other times, a predic- a time across its domain, and see
tion from a box full of magic isnt how the prediction changes or
satisfying. For many production maybe change the values of two
ML systems, understanding in- covariates at a time and see how
dividual decisions made by the the predictions change. Another
model is crucial: at Stripe, we re- approach (one we used for a while
cently made our machine learn- here at Stripe) is to recalculate the
ing model decisions visible to the predicted outcome probability by
online businesses we support, treating each feature in turn as
which means business owners missing (and non-deterministi-
can understand what factors led cally traversing both paths when-
to our models decisions to de- ever there is a split acting on the
cline or allow a charge. omitted feature); the features that
change the predicted outcome
As noted in the introduction, ran- probabilities the most can be
dom forests are a canonical ex- considered the most important.
ample of a black-box model, and This produced some confusing
theyre at the core of Stripes fraud explanations, but worked reason-
models. The basic idea behind ably well until we were able to im-
a random forest is that it is com- plement a more formal solution,
posed of a set of decision trees. which we now use for explana-
The trees are constructed by find- tions to our customer businesses.
ing the splits or questions (e.g.,
Does the country this credit card With each solution we imple-
was issued in match the country mented, we anecdotally experi-
of the IP address its being used enced a marked improvement
from?) that optimize some clas- in overall understanding of why
sification criterion, e.g., overall specific decisions were made.

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 19

Having explanations available was also a great de-
bugging tool for identifying specific areas where
our models were making systematic mistakes.
Being able to solve this problem again highlights
how we have the tools (straightforward math and
a few computers) to solve business-critical ML
problems, and even simple solutions were bet-
ter than none. Having a this is solvable mindset
helped us implement a useful system, and its giv-
en me optimism about our ability to get what we
need from the data we have.

Other reasons Im optimistic

I remain hopeful about being able to solve im-
portant problems with data. In addition to the
examples in the introduction and the three prob-
lems outlined above, several other small, simple,
clever ways we can solve problems came to mind:

Unwieldy data sets If a data set is too large

to fit in memory or computations are unrea-
sonably slow because of the amount of data,
down-sample. Many questions can be reason-
ably answered using a sample of the data set
(and most statistical techniques were devel-
oped for random samples anyway).

Lack of rare events in sampled data sets

For example, I often get random samples of
charges that dont contain any fraud, since
fraud is a rare event. A strategy here is to take all
of the fraud in the original data set, down-sam-
ple the non-fraud, and use sample weights
(similar to the inverse-probability weighting
discussed above) in the final analysis.

Beautiful, clever computational tricks to cal-

culate computationally intensive quantities
Exponentially weighted moving averages
are a nice example here. Moving averages are
notoriously hard to compute (since we have
to keep all of the data points in the window of
interest), but exponentially weighted moving
averages get at the same idea and use aggre-
gates, so are much faster. HyperLogLogs are a
lovely approximation of all-time counts with-
out needing to scan the entire data set, and
HLLSeries are their really cool counterpart for
counts in a specific window. These strategies
are all approximations, but ML is an approxi-
mation anyway.

These are just a few data-driven ways to over-

come the everyday challenges of practical ma-
chine learning.
Read online on InfoQ

Anomaly Detection for Time-Series Data

with Deep Learning

Tom Hanlon is currently at Skymind, where he is developing a training program for

Deeplearning4J. The consistent thread in Toms career has been data, from MySQL to Hadoop
and now neural networks.

The increasing accuracy of deep neural networks for solving problems

such as speech and image recognition has stoked attention and
research devoted to deep learning and AI more generally. But widening
popularity has also resulted in confusion.

This article introduces neural What are neural By building a system of connect-
networks, including brief de- networks? ed artificial neurons, we obtain
scriptions of feed-forward neural Artificial neural networks are systems we can train to learn
networks and recurrent neural algorithms initially conceived higher-level patterns in data and
networks, and describes how to emulate biological neurons. to perform useful functions such
to build a recurrent neural net- The analogy, however, is a loose as regression, classification, clus-
work that detects anomalies in one. The features of biological tering, and prediction.
time-series data. To make our neurons that artificial neural
discussion concrete, well show networks mirror include connec- The comparison to biological
how to build a neural network tions between the nodes and an neurons only goes so far. An ar-
using Deeplearning4j, a popular activation threshold, or trigger, tificial neural network is a collec-
open-source deep-learning li- for each neuron to fire. tion of compute nodes. We pass
brary for the JVM. data, represented as a numeric

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 21

Neural nets are a type of machine learning model that mimic biological
neuronsdata comes in through an input layer and flows through nodes with
various activation thresholds.
Recurrent neural networks are a type of neural net that maintain internal
memory of the inputs theyve seen before, so they can learn about time-
dependent structures in streams of data.

array, into a networks input layer that represents the input data. the node either fires or does not,
and the data proceeds through For example, each pixel in an depending on whether or not
the networks so-called hidden image may be represented by a the strength of the stimulus it re-
layers until the network gener- scalar that is then fed to a node. ceives, the product of the input
ates an output or decision about That input data passes through and the coefficient, surpasses the
the data. We then compare the the coefficients, or parameters, threshold of activation.
nets resulting output to expect- of the net and through multipli-
ed results (ground-truth labels cation, those coefficients will am- In a so-called dense or fully con-
applied to the data, for example) plify or mute the input, depend- nected layer, the output of each
and use the difference between ing on its learned importance node passes to all nodes of the
the networks guess and the right i.e., whether or not that pixel subsequent layer. This continues
answer to incrementally correct should affect the nets decision through all hidden dense layers,
the activation thresholds of the about the entire input. ending with the output layer,
nets nodes. As we repeat this where the network reaches a
process, the nets outputs con- Initially, the coefficients are ran- decision about the input. At the
verge on the expected results. dom; i.e., the network is creat- output layer, the nets decision
ed knowing nothing about the about the input is evaluated
A whole neural network of many structure of the data. The activa- against the expected decision
nodes can run on a single ma- tion function of each node deter- (e.g., do the pixels in this image
chine. It is important to note, for mines the output of that node represent a cat or a dog?). The
those coming from distributed given an input or set of inputs. So error is calculated by comparing
systems, that a neural network is
not necessarily a distributed sys-
tem of multiple machines. Node,
here, means a place where com-
putation occurs.

Training process
To build a neural network, we
need a basic understanding of
the training process and how
the nets generates output. While
we wont go deep into the equa-
tions, a brief description follows.

The nets input nodes receive a

numeric array, perhaps a multidi-
mensional array called a tensor,

22 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

While deep the nets guess to the true answer
contained in a test set and that
learning4j for the JVM make it
fairly easy to get started building
error is used to update the coef- neural networks. Deciding which
learning is a ficients of the network in order to
change how the net assigns im-
network architecture to use of-
ten involves matching our data

complicated portance to different pixels in the

image. The goal is to decrease
type to a known solved problem
and then modifying an existing
the error between generated and architecture to suit our use case.
process involving expected outputs to correctly
label a dog as a dog.

matrix algebra, While deep learning is a compli-

Types of neural
networks and their
cated process involving matrix applications
derivatives, algebra, derivatives, probability,
and intensive hardware utiliza-
Neural networks have been
known and used for many de-

probability, tion as large matrices of coeffi-

cients are modified, the end user
cades. But a number of import-
ant technological trends have
recently made deep neural nets
and intensive
does not need to be exposed to
all the complexity. much more effective.

hardware There are, however, some basic

parameters that we should be
Computing power has increased
with the advent of GPUs to in-
crease the speed of the matrix
utilization as aware of to help understand how
neural networks function. These operations as well as with larger
distributed-computing frame-
are the activation function, op-
large matrices of timization algorithm, and objec-
tive function (also known as the
works, making it possible to train
neural nets faster and iterate

coefficients are loss, cost, or error function). quickly through many combina-
tions of hyperparameters to find
The activation function deter- the right architecture.
modified, the mines whether and to what ex-
tent a signal should be sent to Larger data sets are being gen-

end user does connected nodes. A frequently

used activation is just a basic
erated, and large, high-quality,
labeled data sets such as Ima-
step function that is 0 if its input geNet already exist. As a rule, the
not need to be is less than some threshold and 1
if its input exceeds the threshold.
more data a machine-learning
algorithm is trained on, the more

exposed to all the A node with a step-function acti-

vation function thus either sends
accurate it will be.

Finally, advances in how we

a 0 or 1 to connected nodes. The
optimization algorithm deter- understand and build the neu-
mines how the network learns, ral-network algorithms have
more accurately how weights are resulted in neural networks con-
modified after determining the sistently setting new accuracy
error. The most commonly used records in competitions for com-
optimization algorithm is sto- puter vision, speech recognition,
chastic gradient descent. Finally, machine translation, and many
a cost function is a measure of other machine-perception and
error, which gauges how well the goal-oriented tasks.
neural network performed when
making decisions about a given Although the universe of neu-
training sample, compared to the ral-network architectures is large,
expected output. a few main types of networks
have seen wide use.
Open-source frameworks such
as Keras for Python or Deep-

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 23

Feed-forward neural Convolutional neural nition (and because sound can
networks networks be represented visually in the
A feed-forward neural network Convolutional neural networks form of a spectrogram, convolu-
has an input layer, an output are similar to feed-forward neural tional networks are widely used
layer, and one or more hidden nets, at least in the way that data for voice recognition and ma-
layers. Feed-forward neural net- passes through the network. chine-transcription tasks as well).
works make good universal ap- Their form is roughly modeled on
proximators (functions that map the visual cortex. Convolutional Both convolutional and feed-for-
any input to any output) and can nets pass several filters like mag- ward network types can analyze
be used to build general-pur- nifying glasses over an underly- images, but how they analyze
pose models. ing image. Those filters focus on them is different. While a con-
feature recognition on a subset volutional neural network steps
This type of neural network can of the image, a patch or tile, and through overlapping sections of
be used for both classification then repeat that process in a se- the image and trains by learn-
and regression. For example, ries of tiles across the image field. ing to recognize features in each
when using a feed-forward net- section, a feed-forward network
work for classification, the num- Each filter is looking for a differ- trains on the complete image.
ber of neurons on the output ent pattern in the visual data. For A feed-forward network trained
layer is equal to the number of example, one filter might look for on images that always depict a
classes. Conceptually, the out- a horizontal line, another might feature in a particular position
put neuron that fires determines look for a diagonal line, anoth- or orientation may not recognize
the class that the network has er for a vertical. Those lines are that feature when it shows up in
predicted. More accurately, each known as features and as the an uncommon position, while
output neuron returns a prob- filters pass over the image, they a convolutional network would
ability that the record matches construct feature maps that lo- recognize it, if trained well.
that class, and the class with the cate each kind of line each time
highest probability is chosen as it occurs at a different place in Convolutional neural networks
the models output classification. the image. Different objects in are used for tasks such as image,
images cats, 747s, masticat- video, voice, and sound recog-
The benefit of feed-forward neu- ing juicers generate different nition as well as in autonomous
ral networks such as multilayer sorts of features maps, which vehicles.
perceptrons is that they are easy can ultimately be used to classify
to use, less complicated than photos. Convolutional networks This article focuses on recurrent
other types of nets, and available have proven very useful in the neural networks, but convo-
in a wide variety of examples. field of image and video recog- lutional neural networks have
performed so well with image
recognition that we should ac-
knowledge their utility.

Recurrent neural
Unlike feed-forward neural net-
works, the hidden layer nodes
of a recurrent neural network
(RNN) maintain an internal state,
a memory, that updates with
new input fed into the network.
Those nodes make decisions
based both on the current input
and on what has come before.
RNNs can use that internal state
to process relevant data in arbi-
trary sequences of inputs, such
as time series.

24 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

RNNs are used for handwriting
InfoQ recommends
recognition, speech recognition,
log analysis, fraud detection, and
The InfoQ Podcast
They are best for data sets that Eric Horesnyi on High Frequency Trading
contain a temporal dimension and How Hedge Funds are Applying Deep
like logs of web or server activi-
Learning to Markets
ty, sensor data from hardware or
medical devices, financial trans-
actions, or call records. Tracking
dependencies and correlations
within data over many time steps
requires that we know the cur-
rent state and some number of
previous states. Although this
might be possible with a typi-
cal feed-forward network that
ingests a window of events and
subsequently moves that win-
dow through time, such an ap-
use case for RNNs. The Internet Java is an interesting example
proach would limit us to depen-
has multiple examples of using because its structure includes
dencies captured by the window,
RNNs for generating text, one many nested dependencies. Ev-
and the solution would not be
character at a time, after being ery parenthesis that opens will
trained on a corpus of text to pre- eventually close. Every open
dict the next letter given whats curly brace pairs with a closed
A better approach to track long-
gone before. Lets take a look at curlybrace down the line. These
term dependencies over time
the features of an RNN by look- are dependencies not located
is some sort of memory that
ing more closely at that use case. immediately next to one another
stores significant events so that
the distance between multi-
later events can be understood
ple events can vary. Without be-
and classified in context. The RNNs for character ing told about these dependent
beauty of an RNN is that the generation structures, a RNN will learn them.
memory contained in its hid- RNNs can be trained to treat char-
den layers learns the significance acters in the English language In anomaly detection, we will
of these time-dependent fea- as a series of time-dependent be asking our neural net to
tures on its own over very long events. The network will learn learn similar, perhaps hidden or
windows. that one character frequently non-obvious patterns in data.
follows another (e follows h Just as a character generator
In what follows, we will discuss in the, he, and she) and as it understands the structure of
the application of recurrent net- predicts the next character in a data well enough to generate a
works to both character gen- sequence, it will train to reduce simulacrum of it, a RNN used for
eration and network-anomaly error by comparisons with actual anomaly detection understands
detection. What makes an RNN English text. the structure of the data well
useful for anomaly detection in
enough to know whether what it
time-series data is this ability When fed the complete works of is fed looks normal or not.
to detect dependent features Shakespeare, for example, a RNN
across many time steps. can then generate impressively The example of character gener-
Shakespeare-like output: for ex- ation is useful to show that RNNs
ample, Why, Salisbury must find are capable of learning temporal
Applying recurrent his flesh and thought. When
neural networks dependencies over varying rang-
fed a sufficiently large amount of es of time. A RNN can use that
Although our example will be
Java code, a RNN will emit some- same capability for anomaly de-
monitoring activity on a com-
thing that almost compiles. tection in network activity logs.
puter network, it might be useful
to start by discussing a simpler

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 25

Applied to text, anomaly detec- As an aside, the trained network For the first training runs, we
tion might reveal grammatical er- does not necessarily note that may need to adjust some hyper-
rors, because grammarstructures certain activities happen at cer- parameters (hyperparameters
what we write. Likewise, network tain times (it does not know that are parameters that control the
behavior has a structure: it fol- a particular day is Sunday) but it configuration of the model and
lows predictable patterns that does notice those more obvious how it trains) so that the model
can be learned. A RNN trained on temporal patterns we would be actually learns from the data, and
normal network activity would aware of, along with other con- does so in a reasonable amount
perceive a network intrusion to nections between events that of time. We discuss a few hyper-
be as anomalous as a sentence might not be apparent. parameters below. As the model
without punctuation trains, we should look for a steady
Well outline how to approach decrease in error.
this problem using Deeplearn-
A sample project in ing4j, a widely used open-source There is a risk that a neural net-
network-anomaly library for deep learning on the work model will overfit on the
detection JVM. Deeplearning4j comes with data. A model that has been
Suppose we wanted to detect a variety of tools that are useful trained to the point of overfit-
network anomalies with the un- throughout the model devel- ting the data set will get good
derstanding that an anomaly opment process: its DataVec is a scores on the training data, but
might point to hardware failure, collection of tools to assist with will not make accurate decisions
application failure, or an intru- the extract-transform-load (ETL) about data it has never seen be-
sion. tasks used to prepare data for fore. It doesnt generalize in
model training. Just as Sqoop machine-learning parlance. Dee-
helps load data into Hadoop, Dat- plearning4J provides regulariza-
What our model will aVec helps load data into neural tion tools and early stopping
show us nets by cleaning, preprocessing, that help prevent overfitting
The RNN will train on a numeric normalizing, and standardizing while training.
representation of network ac- data. Its similar to Trifactas Wran-
tivity logs, feature vectors that gler but focused a bit more on bi- Training the neural net is the step
translate the raw mix of text and nary data. that will take the most time and
numerical data in logs.
hardware. Running training on
GPUs will lead to a significant
By feeding a large volume of net- Getting started decrease in training time, espe-
work activity logs, with each log The first stage includes typical cially for image recognition, but
line a time step, to the RNN, the big-data tasks and ETL. We need additional hardware comes with
neural net will learn what normal to gather, move, store, prepare, additional cost, so its important
and expected network activity normalize, and vectorize the that your deep-learning frame-
looks like. When this trained net- logs. We must decide on the size work uses hardware as efficiently
work is fed new activity from the of the time steps. Data transfor- as possible. Cloud services such
network, it will be able to classi- mation may require significant as Azure and Amazon provide
fy the activity as normal and ex- effort, since JSON logs, text logs, access to GPU-based instances,
pected or anomalous. and logs withinconsistent label- and neural nets can be trained on
ing patterns will have to be read heterogeneous clusters with scal-
Training a neural net to recog- and converted into a numeric ar- able commodity servers as well
nize expected behavior has an ray. DataVec can help transform as purpose-built machines.
advantage, because it is rare to and normalize that data. As is
have a large volume of abnormal the norm when developing ma-
data or to have enough to ac- chine-learning models, the data Productionizing the
curately classify all abnormal be- must be split into a training set model
havior. We train our network on and a test (or evaluation) set. Deeplearning4J provides a Mod-
the normal data we have so that elSerializer class to save a
it alerts us to non-normal activi- trained model. A trained model
ty in the future. We train for the Training the network can be saved and used (i.e., de-
opposite where we have enough The nets initial training will run ployed to production) or updated
data about attacks. on the training split of the input later with further training.

26 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

When performing network-anomaly detection in production, we need to serialize log files into the same format
that the model trained on and, based on the output of the neural network, we would get reports on whether the
current activity was in the range of normal expected network behavior.

Sample code
The configuration of a recurrent neural network might look something like this:

001 MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()

002 .seed(123)
003 .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT).
004 .weightInit(WeightInit.XAVIER)
005 .updater(Updater.NESTEROVS).momentum(0.9)
006 .learningRate(0.005)
007 .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue)
008 .gradientNormalizationThreshold(0.5)
009 .list()
010 .layer(0, new GravesLSTM.Builder().activation(tanh).nIn(1).nOut(10).build())
011 .layer(1, new RnnOutputLayer.Builder(LossFunctions.LossFunction.MCXENT)
012 .activation(softmax).nIn(10).nOut(numLabelClasses).build())
013 .pretrain(false).backprop(true).build();
014 MultiLayerNetwork net = new MultiLayerNetwork(conf);
015 net.init();

Lets describe a few important lines of this code.


This sets a random seed to initialize the neural nets weights, in order to obtain reproducible results. Typically,
coefficients are initialized randomly, so to obtain consistent results while adjusting other hyperparameters, we
need to set a seed so we can use the same random weights over and over as we tune and test.


This determines which optimization algorithm to use (in this case, stochastic gradient descent) to determine
how to modify the weights to improve the error score. We probably wont have to modify this.


When using stochastic gradient descent, the error gradient (that is, the relation of a change in coefficients to a
change in the nets error) is calculated and the weights are moved along this gradient in an attempt to move the
error towards a minimum.Stochastic gradient descent gives us the direction of less error, and the learning rate
determines how big of a step is taken in that direction. If the learning rate is too high, we may overshoot the
error minimum; if it is too low, our training will take forever. This is a hyperparameter that we may need to adjust.

Getting help
There is an active community of Deeplearning4J users who can be found on several support channels on Gitter.

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 27

Read online on InfoQ

Real-World, Man-Machine Algorithms

Edwin Chen works at Hybrid, a platform for machine learning and human labor. He used to
build machine-learning systems for Google, Twitter, Dropbox, and quantitative finance.

Justin Palmer is founder of topfunnel, software for recruiters, and works on Hybrid. He was
most recently VP of data at LendingHome and has built ML products for speech recognition and
natural language processing at Basis Technology and MITRE.

The previous articles in this eMag focused on the algorithmic part

of machine learning (ML): training simple classifiers, pitfalls in
classification, and the basics of neural nets. But the algorithmic part of
ML is just one small part of the process of deploying a model to solve a
real-world problem.
Lets talk about the end-to-end End-to-end model are typically trained on which ads
flow of developing ML models: deployment users click on, video-recommen-
where we get training data, how There are many ML classification dation systems make heavy use
we pick the ML algorithm, what problems for which using log of which videos youve watched
we must address after our model data is standard, essentially giv- in the past, etc.
is deployed, and so forth. ing us labels for free. For exam-
ple, ad-click-prediction models

28 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

Real-world machine learning isnt simply about training a model once. Getting
training data is often a complicated problem and we will need continuous
monitoring and retraining even after the first deployment.
In order to get training data, we often need a large group of human workers to
label and annotate data. But this presents a quality-control problem, which
we may need statistical monitoring to detect.
Model selection and feature selection are important but are often
constrained by the amount of data we have available. Even if a model or
feature doesnt work now, it may work later on as we get more data.
Our users and our products will change, and the performance of our machine-
learning models will change with them. Well need to re-gather training data,
reevaluate the algorithms and features we chose, and retrain our models so
try to automate these steps as much as possible.

However, even these systems files selling Viagra, drugs, and so that we can easily experi-
need to move beyond simple other blacklisted products, and ment with different models
click data once they reach large we want to fight the problem and parameters?
enough scale and sophistica- with machine learning. But how
tion; for instance, because theyre do we do this? 3. We cant rest after weve de-
heavily biased towards clicks, ployed our first spam classifi-
it can be difficult to tune the 1. First, were going to need er. As we get new sources of
systems to show new ads and people to label training users, or spammers get more
new videos to users, and so ex- data. We cant use logs; our creative, the types of spam
plore-exploit algorithms become users arent flagging things appearing on our website
necessary. for us and even if they were, will quickly change so well
theyre surely wildly biased need to continually rerun
Whats more, many of these sys- (and spammers themselves steps 1 and 2 which is a
tems eventually incorporate ex- would misuse the system). surprisingly difficult process
plicitly human-generated labels But gathering training data to automate, especially while
as well. For instance, Netflix em- is a generally difficult prob- maintaining the accuracy lev-
ploys over 40 people to hand-tag lem in and of itself. Well need els we need.
movies and TV shows in order to hundreds of thousands of la-
make better recommendations bels, requiring thousands of 4. Even with a working, mature
and generate labels like Foreign hours of work. Where will we ML pipeline, were not fin-
movies featuring a strong female get these? ished. We dont want to ac-
lead, YouTube hand-labels every cidentally flag and remove
ad to have better features when 2. Next, well need to build legitimate users so there will
making ad-click predictions, and and deploy an actual ML al- always be cases of ML deci-
Google trains its search algorithm gorithm. Even with ML ex- sion boundaries which we
in part on scores that a large, in- pertise, this is a difficult and need a human to go and look
ternal team of dedicated raters time-consuming process: at. But how do we build a scal-
gives to query-webpage pairs. how do we choose an algo- able human-labor pipeline
rithm, how do we choose the that seamlessly integrates
Suppose were an e-commerce features to input into the al- into our ML and returns re-
site like eBay or Etsy. Weve start- gorithm, and how do we do sults in real time?
ing to see a lot of spammy pro- this in a repeatable manner,

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 29

At Hybrid, our platform for ML and large-scale human labor, we realized that we were building this complicated
pipeline over and over for many of our problems, so we built a way to abstract all the complexity behind a single
API call:

001 # Create your classifier

002 curl https://www.hybridml.com/api/classifiers
003 -u API_KEY:
004 -d categories=spam, not spam
005 -d accuracy=0.99
007 # Start classifying
008 curl https://www.hybridml.com/api/classify
009 -u API_KEY:
010 -d classifier_id=ABCDEFG
011 -d text=Come buy the latest Viagra at 50% off.

Behind the scenes, the same call Its also likely to be biased in un- One common monitoring tech-
automatically and invisibly de- known ways (after all, plenty of nique is to label a number of
cides whether a ML classifier is people are fooled by spammy Ni- profiles as spam or not spam
reliable enough to classify the gerian e-mail). and randomly send them to your
example on its own or wheth- workers in order to see if the
er it needs human intervention. Another way to come up with workers agree with our labels.
Models get built automatically, training data is to label a bunch
theyre continually retrained, of profiles ourselves. But this is Another potential approach is to
and the caller never has to worry almost certainly a waste of time use statistical distribution tests to
whether more data is needed. and resources: spam probably catch outlier workers. For exam-
constitutes less than 1-2% of all ple, imagine a simple image-la-
In the rest of this article, well go profiles, so wed need hundreds beling task: if most workers label
into more detail on the problems of thousands of profile classifica- 80% of the images as cat and
we described above problems tions (and thousands of hours) in 20% as not cat then a worker
that are common to all efforts to order to form a reasonable train- who labels only 45% of images as
deploy ML to solve real-world ing set. cat should probably be flagged.
What we need, then, is a large One difficulty, though, is that
group of workers to comb workers deviate from each other
Labels for training through a large set of profiles, in completely legitimate ways.
In order to train any spam clas- and mark them as spam or not For example, people may tend to
sifier, we first need a training set spam according to a set of in- upload more cat images during
of spam and not spam labels. structions. Common ways to find the day, or spammers may tend
One way to provide these is to workers to perform these types to operate during the night. In
use our sites visitors and logs. of tasks include hiring off of these cases, daytime workers
Just add a button that allows visi- Craigslist or using online crowd- will have higher cat and not
tors to mark profiles as spam and sourcing platforms like Amazon spam labels compared to those
use the results as a training set. Mechanical Turk, Crowdflower, or who work at night. To account
Hybrid. for this kind of natural deviation,
However, this can be a problem a more sophisticated approach is
for several reasons. Most of our However, the work generated to apply non-parametric Bayes-
visitors will ignore the button so by Craigslist or Mechanical Turk ian techniques to cluster worker
our training set is likely to be very workers is often low quality; at output, which we then measure
small. Hybrid, weve often seen spam for deviations.
rates, where workers randomly
Its easily gamed: spammers can click on labels, as high as 80-90%.
simply start marking legitimate So well need to monitor worker
profiles as spammy. output for accuracy.

30 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

Model selection Feature selection their fully connected, hierarchi-
Once we have enough training There are two approaches to cal structure, they can automat-
labels, we can start building our choosing which features to use ically discover feature crosses on
ML models. Which algorithm in the ML algorithm. their own, whereas models like
should we use? For text classifi- logistic regression need feature
cation, for example, three com- The more manual approach is crosses fed into them.
mon algorithms are naive Bayes, to think of feature selection as
logistic regression, and deep a pre-processing step: we score
neural networks. each feature independent of the Adapting to changes
ML model and only keep the top The final question well look at is
We wont go deeply into how to N features or the features that how to take care of changes in
choose a ML classifier, but one pass some threshold. For exam- the data distribution. For exam-
way to think about the difference ple, a common feature-selection ple, suppose weve built a spam
between different algorithms algorithm is to score whether the classifier but suddenly experi-
is in terms of the bias/variance feature has a different distribu- ence a spurt of user growth in
tradeoff: simpler models tend tion under each class (e.g., when a new country or a new spam-
to perform worse than com- considering whether the word mer has decided to target our
plex models with large amounts Viagra should be kept as a fea- website. Because these are new
of data (they arent powerful ture in an e-mail spam classifier, sources of data, our existing clas-
enough to model the problem we can compare whether Viagra sifier is unlikely to be able to ac-
so they have high bias), but they appears significantly more often curately handle these cases.
can often perform better when in spam vs. non-spam e-mail)
the data is limited (complex and to choose the features with One common practice is to gath-
models can easily be overfit and the greatest differences in distri- er new labels on a regular and
are sensitive to small changes butions between classes. frequent basis, but this can be
in the data so they exhibit high inefficient: how do we know how
variance). Another, increasingly common many new labels we need to
approach is to let the ML algo- gather and what if the data distri-
As a result, its fine often, ac- rithm select features by itself. bution hasnt actually changed?
tually better to start with a For example, logistic regression
simpler model or fewer features models can take a regularization As a result, another common
if we have only a few labels, and parameter that effectively con- practice is to only gather new
to add more sophistication as we trols whether coefficients in the labels and retrain models ev-
get more data. model are biased towards zero. ery few months or so. But this is
By experimenting with differ- problematic, since quality may
This also means that we should ent values of this parameter and severely degrade in the mean-
later re-evaluate a more power- monitoring accuracy on a test time.
ful algorithm that is less accurate set, the model automatically de-
early on. cides which features to zero out One solution is to randomly send
(i.e., throw away) and which fea- examples for human labeling
We take this approach at Hybrid. tures to keep. (e.g., less than 1% of the time
Our goal is to always have the once a model has reached high-
most accurate ML, whether we Its also often useful to add enough accuracy). Doing so, we
have 500 data points or 500,000. crossed features. For example, have an unbiased set of samples
As a result, we automatically suppose teenagers in general we can monitor accuracy against,
transition between different al- and Londoners in general tend so we can quickly detect if some-
gorithms: we usually start with to click on ads, but teenagers in thing has changed. You can also
the simpler models that perform London do not. An ad-click-pre- monitor the ML scores returned
better with limited amounts of diction model with a user is by the algorithm; if the distribu-
data and switch to more power- teenager AND user lives in Lon- tion of these scores change, this
ful models as more data comes don feature would likely per- is another indication the under-
in, depending on how different form better than a model that lying data has changed and the
algorithms perform on an out- only contains separate teenager models need a fresh regime of
of-sample test set. and Londoner features. This is training.
one of the advantages of deep
neural networks: because of

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 31

Read online on InfoQ

Book Review: Erik Brynjolfsson and Andrew

McAfees The Second Machine Age

by Charles Humble

Erik Brynjolfsson is the director of the MIT Center for Digital Business and one of the most
cited scholars in information systems and economics. He is a cofounder of MITs Initiative on the
Digital Economy, along with Andrew McAfee. He and McAfee are the only people named to both
the Thinkers 50 list of the worlds top management thinkers and the Politico 50 group of people
transforming American politics.

Andrew McAfee is a principal research scientist at the MIT Center for Digital Business and the
author of Enterprise 2.0. He is a cofounder of MITs Initiative on the Digital Economy, along
with Erik Brynjolfsson. He and Brynjolfsson are the only people named to both the Thinkers 50
list of the worlds top management thinkers and the Politico 50 group of people transforming
American politics.

Erik Brynjolfsson and Andrew McAfee begin The Second Machine Age
with a simple question: what innovation has had the greatest impact on
human history?

Innovation is meant in the course of humanity the most either of them, the arc of hu-
broadest sense: agriculture and (and how even is that deter- man history decisively moves
the domestication of animals mined)? up and to the right (as Silicon
were innovations, as were the Valley startups would have of all
advent of various religions and To start, Brynjolfsson and McA- of their metrics) starting around
forms of government, the print- fee suggest population and mea- 1765. The authors argue that the
ing press, and the cotton gin. sures of social development as trigger for this growth was James
But which of these changed the approximate yardsticks. Using Watts steam engine, a gener-

32 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

A combination of exponential growth in computing power and the increasing
digitization of all our data is propelling the recent advances in technology
(most of which are advances in machine learning).
There are no clear measures for the impact of the major technological
advances of recent years traditional measures like GDP are inadequate.
The second machine age will have increasing economic inequality as a side
effect because of the winner-takes-all nature of digital markets.

al-purpose technological inno- all the entrants failing just a few al years away from achievements
vation more than three times as hours in (Popular Science derid- like those of AlphaGo.
efficient as its predecessors and ed the competition as a Debacle
one that essentially kicked off in the Desert). There was also Why has the progress here been
the Industrial Revolution. IBMs Jeopardy-winning Watson, so sudden in the past several
which thoroughly demolished years? One plausible, specific an-
Brynjolfsson and McAfee, re- the two most successful human swer for many of these advances
searchers at the MIT Center for Jeopardy contestants. Watson goes unmentioned: develop-
Digital Business who have made absorbed massive amounts of ments in neural networks and
careers studying the impact of information, including the en- deep learning. But Brynjolfsson
the Internet on business, believe tirety of Wikipedia, and was able and McAfee focus on three high-
that were on the precipice of an- to answer instantaneously and er-level explanations.
other such revolution a sec- correctly even when the clues in-
ond machine age and pro- volved typical-for-Jeopardy puns First, theres the exponential
vide some anecdotal evidence and indirection (it correctly of- growth described by Moores
for this. These examples all have fered pentathlon as the answer Law: transistor density doubles
the same form: a decade ago we to A 1976 entree in the modern every 18 months. Citing Ray Kur-
were frustratingly far from prog- this was kicked out for wiring zweils rough rule of thumb that
ress in the area and almost over- his epee to score points without things meaningfully change after
night, the problems had been touching his foe). And although 32 doublings (once youre in the
solved (generally by advances it was developed after the book second half of the chess board)
in machine learning). The work was published, we could add and the fact that the Bureau of
here progressed in the same Deepminds AlphaGo, the first Economic Analysis first cited in-
way that Ernest Hemingway de- Go program ever to beat a pro- formation technology as a cor-
scribed how people go bankrupt fessional player. In October 2015, porate investment category in
in The Sun Also Rises: gradually, AlphaGo defeated the reigning 1958, the authors peg 2006 as
then suddenly. three-time European champion when Moores Law put us into a
Fan Hui 5-0, and in March 2016, new regime of computing.
Among the examples are it defeated Lee Sedol, the top
self-driving cars, now complete- Go player in the world over the Second, theres the trend of the
ly unremarkable on the free- past decade, 4-1. Because Go is digitization of everything: maps,
ways of Northern California, so combinatorially complex books, speech theyre all be-
only a decade ago seemed out on average, the number of pos- ing stored digitally in a form
of reach. As recently as 2004, sible moves a player can make thats amenable for processing
the DARPA Grand Challenge to is almost an order of magnitude and analysis. For example, the
build a car that could autono- more than the equivalent num- navigation app Waze uses sev-
mously navigate a course in the ber in chess it was generally eral streams of information: dig-
desert ended disastrously, with believed that we were still sever- itized street maps, location coor-

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 33

So, it seems, were dinates for cars broadcast by the
app, and alerts about traffic jams,
meaningfully these technologies
are changing the world? While
among others. Its Wazes ability the authors talk about the inad-
on the brink of a to bring these streams together
and make them useful for its us-
equacy of traditional economic
measures to capture the change

revolution these ers that causes the service to be

so popular.
(GDP in particular is the bugbear
here: When a business traveler
calls home to talk to her children
mind-boggling Digitized information is so pow-
erful because it can be repro-
via Skype, that may add zero to
GDP, but its hardly worthless),

technologies being duced without cost and there-

fore can be used in innumerable
they do not offer a clear metric
for at least the positive impact
applications. (or bounty) of recent progress.
anecdotal evidence Finally, the authors describe in- On the other hand, Brynjolfsson
of that but how novation as being driven by a
recombination of existing tech-
and McAfee do an admirable job
of talking concretely about at

will that revolution

nologies. least one of the negative impacts
of all this change: economic in-
The Web itself is a pretty straight- equality (or what they call the
mani- fest? forward combination: the In-
ternets much older TCP/IP
spread). Digital technologies
can replicate valuable ideas, in-
data-transmission network; a sights, and innovations at very
markup language called HTML low cost, they write. This creates
that specified how text, pictures, bounty for society and wealth
and so on should be laid out; and for innovators, but diminishes
simple software called a brows- the demand for previously im-
er to display the results. None of portant types of labor, which can
these elements was particularly leave many people with reduced
novel. Their combination was incomes.
To those who may argue that
As the Internet facilitates the tax policy, the influence of the
availability of information and finance industry, or social norms
other resources, this process are the source of growing in-
of recombination accelerates. equality, the authors note that in-
Brynjolfsson and McAfee write: equality in Sweden, Finland, and
Today, people with connect- Germany has actually increased
ed smartphones or tablets any- more rapidly over the past 20 or
where in the world have access 30 years than it has in the U.S.
to many (if not most) of the same Technology is the culprit here,
communications resources and and it has been more disruptive
information that we do while sit- in recent years for two reasons.
ting in our offices.
The primary reason is that work
So, it seems, were on the brink of in digital goods, machine-learn-
a revolution these mind-bog- ing algorithms, Internet soft-
gling technologies being anec- ware, and so forth is not subject
dotal evidence of that but to capacity constraints. The best
how will that revolution mani- manual laborer can only sell so
fest? A growth in population like many hours of his or her work,
the one that attended the indus- leaving opportunities for the sec-
trial revolution is impossible, so is ond-best laborer (though at an
this a revolution just of awe and appropriately lower rate). On the
wonder or is there a measure other hand, a software program-
that captures just how fast and mer who writes a slightly better

34 An Introduction to Machine Learning // eMag Issue 50 - Apr 2017

mapping application one that pages of the book. These include
loads a little faster, has slightly issues of privacy, fragility in highly
more complete data, or prettier coupled systems, and the possi-
icons might completely dom- bility of the singularity and ma-
inate a market. There would be chine self-awareness.
little, if any, demand for the tenth-
best mapping application, even if The Second Machine Age was first
it got the job done almost as well. published in 2014 (and issued in
paperback in 2016), and it feels
This effect is magnified by global- like it just barely missed deep
ization, the second reason. Local learning as a framework for un-
leaders, who previously could derstanding why progress has
safely serve their users, are now been so significant recently and
getting disrupted by global lead- for anticipating upcoming issues
ers: a locally produced mapping and challenges. In a summary
application has no advantage of research that now feels odd-
over Google Maps whereas a lo- ly archaic, the authors write that
cal plumber is not in danger of innovators often take cues from
competition from a better, foreign biology as theyre working, but it
plumber. would be a mistake to think that
this is always the case, or that ma-
While the book thoroughly dis- jor recent AI advances have come
cussed inequality as a conse- about because were getting bet-
quence of recent advances in ter at mimicking human thought.
technology in general and artifi-
cial intelligence in particular, I felt Current AI, in short, looks intel-
the arguments and coverage were ligent, but its an artificial resem-
weaker in two areas. First, the poli- blance. That might change in the
cy recommendations were mostly future.
quite generic (almost admittedly
so as the authors referred to them Indeed it has, and were begin-
as Econ 101 policies). These in- ning to see all the consequences
cluded suggestions to focus on of these changes.
schooling (emphasizing ideation,
large-frame pattern recognition,
and complex communication in-
stead of the three Rs), to encour-
age startups, to support science
and immigration, and to upgrade
infrastructure. While these are all
sound policy suggestions, they
are generically good and dont
specifically address the issues
around new artificial intelligence.
Their recommendations for the
long term do try to be a little more
targeted towards the employ-
ment impact of new technology,
but the authors seem somewhat
fatalistic: perhaps a basic income
or a negative income tax, they
suggest, could help all those who
will be displaced. Second, while
inequality is a major issue, the au-
thors discuss other difficult prob-
lems only in passing in the closing

An Introduction to Machine Learning // eMag Issue 50 - Apr 2017 35


The Morning Paper

This first issue of our quarterly look at applied com-

puter science Includes a writeup of how Google engi-
neers and researchers incrementally improved hyer-
loglog step by step. The improvements decrease the
amount of memory required, and increase the accu-
racy for a range of important cardinalities.

The Current State of
NoSQL Databases

49 Getting a Handle on
Data Science

This eMag looks at data science from the ground up, across
technology selection, assembling raw and unstructured
data, statistical thinking, machine learning basics, and the
This eMag focuses on the current state of NoSQL data-
bases. It includes articles, a presentation and a virtual
panel discussion covering a variety of topics ranging
from highly distributed computations, time series da-
tabases to what it takes to transition to a NoSQL da-
tabase solution.

ethics of applying these new weapons. Architectures Youve
Always Wondered About

This eMag takes a look back at five of the most pop-

ular presentations from the Architectures Youve Al-
ways Wondered About track at QCons in New York,
London and San Francisco, each presenter adding a
new insight into the biggest challenges they face, and
how to achieve success. All the companies featured
have large, cloud-based, microservice architectures,
which probably comes as no surprise.