overfitting

~16 min read

Article

13 sections

Contents

Statistical inference
Regression
Machine learning
Consequences
Remedy
Underfitting
Resolving underfitting
Benign overfitting
See also
Notes
References
Further reading
External links

thumb|300px|Figure 1.  The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and is likely to have a higher error rate on new unseen data, illustrated by black-outlined dots, compared to the black line. thumb|300x300px|Figure 2.  Noisy (roughly linear) data is fitted to a linear function and a polynomial function. Although the polynomial function is a perfect fit, the linear function can be expected to generalize better: If the two functions were used to extrapolate beyond the fitted data, the linear function should make better predictions. thumb|300px|Figure 3.  The blue dashed line represents an underfitted model. A straight line can never fit a parabola. This model is too simple.

In mathematical modeling, overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably. An overfitted model is a mathematical model that contains more parameters than can be justified by the data. In the special case of a model that consists of a polynomial function, these parameters represent the degree of a polynomial. The essence of overfitting is to unknowingly extract some of the residual variation (i.e., noise) as if that variation represents the underlying model structure.