Академический Документы
Профессиональный Документы
Культура Документы
1 Introduction
This document will discuss two basic concepts in machine learning, the next
two sections focus on each one of them: the overfitting, and the cross-
validation.
2 Overfitting
Overfitting occurs when a statistical model or machine learning algorithm
captures the noise of the data as part of the modeled behavior. In other
words, overfitting occurs when the model or the algorithm fits the data too
well. Specifically, overfitting occurs if the model or algorithm shows low bias
but high variance. Overfitting is often a result of an excessively complic-
ated model, and it can be prevented by fitting multiple models and using
validation or cross-validation to compare their predictive accuracies on test
data (Cai, 2014). According to Hawkins in (Hawkins, 2004), Overfitting
is using models or procedures that violate parsimony, that is, they include
more terms than necessary or use more complex approaches than necessary.
Hawkins defines some types of overfitting:
1
Using a model that includes irrelavant components.
3 Cross validation
Cross-validation is a technique to evaluate and compare learning algorithms
when an explicit validation set is not available. It can be defined as follows:
2
fits a function using the training set only. Then the function approximator
is asked to predict the output values for the data in the testing set (it has
never seen these output values before). The errors it makes are accumulated
as before to give the mean absolute test set error, which is used to evaluate
the model. The advantage of this method is that it is usually preferable to
the residual method and takes no longer to compute. However, its evaluation
can have a high variance. The evaluation may depend heavily on which data
points end up in the training set and which end up in the test set, and thus
the evaluation may be significantly different depending on how the division
is made (Schneider & Moore, 2000).
4 Conclusion
The understanding of the concept and disadvantages of overfitting shows
the importance of choosing the right learning algorithm and tweaking its
parameters to their optimal values. The cross validation method helps us to
properly validate our learning algorithm with a good use of a single set of
3
data, giving the chance of enhancing the performance of our chosen learning
algorithm.
References
Cai, E. (2014). Machine learning lesson of the day overfitting and under-
fitting. Retrieved March 20, 2014, from http://www.statsblogs.com/
2014 / 03 / 20 / machine - learning - lesson - of - the - day - overfitting - and -
underfitting/
Hawkins, D. M. (2004). The problem of overfitting. Journal of chemical in-
formation and computer sciences, 44 (1), 112.
Lucas, B. (2016). Cross-validation: estimating prediction error. Retrieved
April 29, 2016, from https://datascienceplus.com/cross- validation-
estimating-prediction-error/
Refaeilzadeh, P., Tang, L., & Liu, H. (2009). Cross-validation. In Encyclope-
dia of database systems (pp. 532538). Springer.
Schneider, J. & Moore, A. W. (2000). A locally weighted learning tutorial
using vizier 1.0. Carnegie Mellon University, the Robotics Institute.