Bias–variance tradeoff

Function and noisy data

Spread=5

Spread=1

Spread=0.1

A function (red) is approximated using radial basis functions (blue). Several trials are shown in each graph. For each trial, a few noisy data points are provided as a training set (top). For a wide spread (image 2) the bias is high: the RBFs cannot fully approximate the function (especially the central dip), but the variance between different trials is low. As spread decreases (image 3 and 4) the bias decreases: the blue curves more closely approximate the red. However, depending on the noise in different trials the variance between trials increases. In the lowermost image the approximated values for x=0 varies wildly depending on where the data points were located.

In statistics and machine learning, the bias–variance tradeoff describes the relationship between a model's complexity, the accuracy of its predictions, and how well it can make predictions on previously unseen data that were not used to train the model. In general, as we increase the number of tunable parameters in a model, it becomes more flexible, and can better fit a training data set. It is said to have lower error, or bias. However, for more flexible models, there will tend to be greater variance to the model fit each time we take a set of samples to create a new training data set. It is said that there is greater variance in the model's estimated parameters.

The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:^[1]^[2]

The bias error is an error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).
The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an algorithm modeling the random noise in the training data (overfitting).

The bias–variance decomposition is a way of analyzing a learning algorithm's expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.

Motivation

High bias, low variance
High bias, high variance
Low bias, low variance
Low bias, high variance

The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.

It is an often made fallacy^[3]^[4] to assume that complex models must have high variance. High variance models are "complex" in some sense, but the reverse needs not be true.^[5] In addition, one has to be careful how to define complexity. In particular, the number of parameters used to describe the model is a poor measure of complexity. This is illustrated by an example adapted from:^[6] The model $f_{a,b}(x)=a\sin(bx)$ has only two parameters ( $a,b$ ) but it can interpolate any number of points by oscillating with a high enough frequency, resulting in both a high bias and high variance.

An analogy can be made to the relationship between accuracy and precision. Accuracy is a description of bias and can intuitively be improved by selecting from only local information. Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting. In other words, test data may not agree as closely with training data, which would indicate imprecision and therefore inflated variance. A graphical example would be a straight line fit to data exhibiting quadratic behavior overall. Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space. The option to select many data points over a broad sample space is the ideal condition for any analysis. However, intrinsic constraints (whether physical, theoretical, computational, etc.) will always play a limiting role. The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting). This means that test data would also not agree as closely with the training data, but in this case the reason is inaccuracy or high bias. To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior. Note that error in each case is measured the same way, but the reason ascribed to the error is different depending on the balance between bias and variance. To mitigate how much information is used from neighboring observations, a model can be smoothed via explicit regularization, such as shrinkage.

Bias–variance decomposition of mean squared error

Suppose that we have a training set consisting of a set of points $x_{1},\dots ,x_{n}$ and real values $y_{i}$ associated with each point $x_{i}$ . We assume that the data is generated by a function $f(x)$ such as $y=f(x)+\varepsilon$ , where the noise, $\varepsilon$ , has zero mean and variance $\sigma ^{2}$ .

We want to find a function ${\hat {f}}(x;D)$ , that approximates the true function $f(x)$ as well as possible, by means of some learning algorithm based on a training dataset (sample) $D=\{(x_{1},y_{1})\dots ,(x_{n},y_{n})\}$ . We make "as well as possible" precise by measuring the mean squared error between $y$ and ${\hat {f}}(x;D)$ : we want $(y-{\hat {f}}(x;D))^{2}$ to be minimal, both for $x_{1},\dots ,x_{n}$ and for points outside of our sample. Of course, we cannot hope to do so perfectly, since the $y_{i}$ contain noise $\varepsilon$ ; this means we must be prepared to accept an irreducible error in any function we come up with.

Finding an ${\hat {f}}$ that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning. It turns out that whichever function ${\hat {f}}$ we select, we can decompose its expected error on an unseen sample $x$ (i.e. conditional to x) as follows:^[7]^: 34^[8]^: 223

\operatorname {E} _{D,\varepsilon }{\Big }={\Big (}\operatorname {Bias} _{D}{\big }{\Big )}^{2}+\operatorname {Var} _{D}{\big }+\sigma ^{2}

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]