Вы находитесь на странице: 1из 4

Overfitting and Cross Validation

Daniel Alejandro Gonzalez Bandala


275203, Machine Learning Course.
February 7th, 2017

Abstract. This document is a brief discussion of two basic con-


cepts in machine learning, in order to understand better the algorithms
used in this subject. Based on scientific documents we will try to
define the meaning and characteristics of the Overfitting and Cross
Validation concepts avoiding formal mathematical definitions.

1 Introduction
This document will discuss two basic concepts in machine learning, the next
two sections focus on each one of them: the overfitting, and the cross-
validation.

2 Overfitting
Overfitting occurs when a statistical model or machine learning algorithm
captures the noise of the data as part of the modeled behavior. In other
words, overfitting occurs when the model or the algorithm fits the data too
well. Specifically, overfitting occurs if the model or algorithm shows low bias
but high variance. Overfitting is often a result of an excessively complic-
ated model, and it can be prevented by fitting multiple models and using
validation or cross-validation to compare their predictive accuracies on test
data (Cai, 2014). According to Hawkins in (Hawkins, 2004), Overfitting
is using models or procedures that violate parsimony, that is, they include
more terms than necessary or use more complex approaches than necessary.
Hawkins defines some types of overfitting:

Using a model that is more flexible than it needs to be.

1
Using a model that includes irrelavant components.

In a feature selection problem, models that include unneeded predictors


lead to worse decisions.
And some reasons why overfitting is not desirable:
Adding irrelevant predictors can make predictions worse.

The choice of model impacts its portability

3 Cross validation
Cross-validation is a technique to evaluate and compare learning algorithms
when an explicit validation set is not available. It can be defined as follows:

Cross-Validation is a statistical method of evaluating and comparing learning


algorithms by dividing data into two segments: one used to learn or train a
model and the other used to validate the model (Refaeilzadeh, Tang, & Liu,
2009).

The idea behind cross-validation is to create a number of partitions of sample


observations, known as the validation sets, from the training data set. After
fitting a model on to the training data, its performance is measured against
each validation set and then averaged, gaining a better assessment of how the
model will perform when asked to predict for new observations (Lucas, 2016).

It is used to avoid the overfitting of learning algorithms, the size of the


validation set and number of partitions may vary depending on the type of
cross-validation used.

3.1 Resubstitution validation


In this kind of validation, the same training set is used for validation. This
validation process uses all the available data but suffers seriously from over-
fitting. That is, the algorithm might perform well on the available data yet
poorly on future unseen test data (Refaeilzadeh et al., 2009).

3.2 Hold-out validation


is the simplest kind of cross validation. The data set is separated into two
sets, called the training set and the testing set. The function approximator

2
fits a function using the training set only. Then the function approximator
is asked to predict the output values for the data in the testing set (it has
never seen these output values before). The errors it makes are accumulated
as before to give the mean absolute test set error, which is used to evaluate
the model. The advantage of this method is that it is usually preferable to
the residual method and takes no longer to compute. However, its evaluation
can have a high variance. The evaluation may depend heavily on which data
points end up in the training set and which end up in the test set, and thus
the evaluation may be significantly different depending on how the division
is made (Schneider & Moore, 2000).

3.3 K-fold Cross-validation


This is the most common use of cross-validation. Observations are split into
K partitions, the model is trained on K 1 partitions, and the test error is
predicted on the left out partition k. The process is repeated for k = 1,2. . . K
and the result is averaged. If K=n, the process is referred to as Leave One
Out Cross-Validation, or LOOCV for short. This approach has low bias, is
computationally cheap, but the estimates of each fold are highly correlated
(Lucas, 2016).

3.4 Leave one out Cross-validation


is K-fold cross validation taken to its logical extreme, with K equal to N, the
number of data points in the set. That means that N separate times, the
function approximator is trained on all the data except for one point and a
prediction is made for that point. As before the average error is computed
and used to evaluate the model. The evaluation given by leave-one-out cross
validation error (LOO-XVE) is good, but at first pass it seems very expensive
to compute. Fortunately, locally weighted learners can make LOO predictions
just as easily as they make regular predictions. That means computing the
LOO-XVE takes no more time than computing the residual error and it is a
much better way to evaluate models (Schneider & Moore, 2000).

4 Conclusion
The understanding of the concept and disadvantages of overfitting shows
the importance of choosing the right learning algorithm and tweaking its
parameters to their optimal values. The cross validation method helps us to
properly validate our learning algorithm with a good use of a single set of

3
data, giving the chance of enhancing the performance of our chosen learning
algorithm.

References
Cai, E. (2014). Machine learning lesson of the day overfitting and under-
fitting. Retrieved March 20, 2014, from http://www.statsblogs.com/
2014 / 03 / 20 / machine - learning - lesson - of - the - day - overfitting - and -
underfitting/
Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical in-
formation and computer sciences, 44 (1), 112.
Lucas, B. (2016). Cross-validation: estimating prediction error. Retrieved
April 29, 2016, from https://datascienceplus.com/cross- validation-
estimating-prediction-error/
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. In Encyclope-
dia of database systems (pp. 532538). Springer.
Schneider, J. & Moore, A. W. (2000). A locally weighted learning tutorial
using vizier 1.0. Carnegie Mellon University, the Robotics Institute.

Вам также может понравиться