Академический Документы
Профессиональный Документы
Культура Документы
Machine Learning
Matthew Mayo
What is automated machine learning (AutoML)? Why do we need it? What
are some of the AutoML tools that are available? What does its future
hold? Read this article for answers to these and other AutoML questions.
toolkit, and its use does not actually factor in to all data science tasks. For
example, if prediction will be part of a given data science task, machine
learning will be a useful component; however, machine learning may not
play in to a descriptive analytics task at all.
Even for predictive tasks, data science encompasses much more than the
actual predictive modeling. Data scientist Sandro Saitta, when discussing
the potential confusion between AutoML and automated data science,
had this to say:
The misconception comes from the confusion between the whole Data
Science process (see for example CRISP-DM) and the sub-tasks of
data preparation (feature extraction, etc.) and modeling (algorithm
selection, hyper-parameters tuning, etc.) which I call Machine
Learning.
[...]
When you read news about tools that automate Data Science and Data
Science competitions, people with no industry experience may be
confused and think that Data Science is only modeling and can be fully
automated.
He is absolutely correct, and it's not just a matter of semantics. If you
want (need?) more clarification on the relationship between machine
learning and data science (and several other related concepts), read this.
Further, data scientist and leading automated machine learning proponent
Randy Olson states that effective machine learning design requires us to:
1. Always tune the hyperparameters for our models
2. Always try out many different models
3. Always explore numerous feature representations for our data
Taking all of the above into account, if we consider AutoML to be the tasks
of algorithm selection, hyperparameter tuning, iterative modeling, and
model assessment, we can start to define what AutoML actually is. There
(see emphasis).
Enam goes on to elaborate on the difficulties of machine learning, and
focuses on the nature of algorithms (again, emphasis added):
An aspect of this difficulty involves building an intuition for what tool
should be leveraged to solve a problem. This requires being aware of
available algorithms and models and the trade-offs and
constraints of each one.
[...]
The difficulty is that machine learning is a fundamentally hard
debugging problem. Debugging for machine learning happens in two
cases: 1) your algorithm doesn't work or 2) your algorithm doesn't work
well enough.[...] Very rarely does an algorithm work the first time
and so this ends up being where the majority of time is spent in
building algorithms.
Enam then eloquantly elaborates this framed problem from the algorithm
research point of view. Again, however, what he says applies to... well,
applying algorithms. If an algorithm does not work, or does not do so well
enough, and the process of choosing and refinining becomes iterative,
this exposes an opportunity for automation, hence automated machine
learning.
I have previously attempted to capture AutoML's essence as follows:
If, as Sebastian Raschka has described it, computer programming is
about automation, and machine learning is "all about automating
automation," then automated machine learning is "the automation of
automating automation." Follow me, here: programming relieves us by
managing rote tasks; machine learning allows computers to learn how
to best perform these rote tasks; automated machine learning allows
for computers to learn how to optimize the outcome of learning how to
perform these rote actions.
A more robust sample, for using Auto-sklearn with the MNIST dataset,
follows:
import autosklearn.classification
import sklearn.cross_validation
import sklearn.datasets
import sklearn.metrics
digits = sklearn.datasets.load_digits()
X = digits.data
y = digits.target
X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y,
random_state=1)
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))
Of additional note, Auto-sklearn won both the auto and the tweakathon
tracks of the ChaLearn AutoML challenge.
You can read the Auto-sklearn development team's winning blog
submission to the recent KDnuggets automated data science and
machine learning blog contest here, as well as a follow-up interview with
the developers here. Auto-sklearn is the result of research conducted at
the University of Freiburg.
The result of this run is a pipeline that achieves 98% testing accuracy,
along with the Python code for said pipeline being exported to the tpotmnist-pipeline.py file, shown below:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = np.recfromcsv('PATH/TO/DATA/FILE',
delimiter='COLUMN_SEPARATOR')
features = tpot_data.view((np.float64, len(tpot_data.dtype.names)))
features = np.delete(features, tpot_data.dtype.names.index('class'), axis=1)
training_features, testing_features, training_classes, testing_classes =
train_test_split(features, tpot_data['class'], random_state=42)
exported_pipeline = make_pipeline(
KNeighborsClassifier(n_neighbors=3, weights="uniform")
)
exported_pipeline.fit(training_features, training_classes)
results = exported_pipeline.predict(testing_features)
TPOT can be obtained via its official Github repo, while its documentation
is available here.
A KDnuggets article, providing an overview of both TPOT and AutoML,
written by TPOT lead developer Randy Olson, can be found here. A
followup interview with Randy is available here.
TPOT is developed at the University of Pennsylvania Institute for
Biomedical Informatics, with funding from NIH grant R01 AI117694.
Of course, these are not the only AutoML tools available. Others include
include Hyperopt (Hyperopt-sklearn), Auto-WEKA, and Spearmint. I would
wager that a number of additional projects become available over the next
few years, both of the research and industrial-strength varieties.