Loss function

In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) ^[1] is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function. An objective function is either a loss function or its opposite (in specific domains, variously called a reward function, a profit function, a utility function, a fitness function, etc.), in which case it is to be maximized. The loss function could include terms from several levels of the hierarchy.

In statistics, typically a loss function is used for parameter estimation, and the event in question is some function of the difference between estimated and true values for an instance of data. The concept, as old as Laplace, was reintroduced in statistics by Abraham Wald in the middle of the 20th century.^[2] In the context of economics, for example, this is usually economic cost or regret. In classification, it is the penalty for an incorrect classification of an example. In actuarial science, it is used in an insurance context to model benefits paid over premiums, particularly since the works of Harald Cramér in the 1920s.^[3] In optimal control, the loss is the penalty for failing to achieve a desired value. In financial risk management, the function is mapped to a monetary loss.

Examples

Regret

Leonard J. Savage argued that using non-Bayesian methods such as minimax, the loss function should be based on the idea of regret, i.e., the loss associated with a decision should be the difference between the consequences of the best decision that could have been made had the underlying circumstances been known and the decision that was in fact taken before they were known.

Quadratic loss function

The use of a quadratic loss function is common, for example when using least squares techniques. It is often more mathematically tractable than other loss functions because of the properties of variances, as well as being symmetric: an error above the target causes the same loss as the same magnitude of error below the target. If the target is t, then a quadratic loss function is

\lambda (x)=C(t-x)^{2}\;

for some constant C; the value of the constant makes no difference to a decision, and can be ignored by setting it equal to 1. This is also known as the squared error loss (SEL).^[1]

Many common statistics, including t-tests, regression models, design of experiments, and much else, use least squares methods applied using linear regression theory, which is based on the quadratic loss function.

The quadratic loss function is also used in linear-quadratic optimal control problems. In these problems, even in the absence of uncertainty, it may not be possible to achieve the desired values of all target variables. Often loss is expressed as a quadratic form in the deviations of the variables of interest from their desired values; this approach is tractable because it results in linear first-order conditions. In the context of stochastic control, the expected value of the quadratic form is used. The quadratic loss assigns more importance to outliers than to the true data due to its square nature, so alternatives like the Huber, Log-Cosh and SMAE losses are used when the data has many large outliers.

0-1 loss function

In statistics and decision theory, a frequently used loss function is the 0-1 loss function

L({\hat {y}},y)=\left

using Iverson bracket notation, i.e. it evaluates to 1 when ${\hat {y}}\neq y$ , and 0 otherwise.

Constructing loss and objective functions

In many applications, objective functions, including loss functions as a particular case, are determined by the problem formulation. In other situations, the decision maker’s preference must be elicited and represented by a scalar-valued function (called also utility function) in a form suitable for optimization — the problem that Ragnar Frisch has highlighted in his Nobel Prize lecture.^[4] The existing methods for constructing objective functions are collected in the proceedings of two dedicated conferences.^[5]^[6] In particular, Andranik Tangian showed that the most usable objective functions — quadratic and additive — are determined by a few indifference points. He used this property in the models for constructing these objective functions from either ordinal or cardinal data that were elicited through computer-assisted interviews with decision makers.^[7]^[8] Among other things, he constructed objective functions to optimally distribute budgets for 16 Westfalian universities^[9] and the European subsidies for equalizing unemployment rates among 271 German regions.^[10]

Expected loss

In some contexts, the value of the loss function itself is a random quantity because it depends on the outcome of a random variable X.

Statistics

Both frequentist and Bayesian statistical theory involve making a decision based on the expected value of the loss function; however, this quantity is defined differently under the two paradigms.

Frequentist expected loss

We first define the expected loss in the frequentist context. It is obtained by taking the expected value with respect to the probability distribution, P_θ, of the observed data, X. This is also referred to as the risk function^[11]^[12]^[13]^[14] of the decision rule δ and the parameter θ. Here the decision rule depends on the outcome of X. The risk function is given by:

R(\theta ,\delta )=\operatorname {E} _{\theta }L{\big (}\theta ,\delta (X){\big )}=\int _{X}L{\big (}\theta ,\delta (x){\big )}\,\mathrm {d} P_{\theta }(x).

Here, θ is a fixed but possibly unknown state of nature, X is a vector of observations stochastically drawn from a population, $\operatorname {E} _{\theta }$ is the expectation over all population values of X, dP_θ is a probability measure over the event space of X (parametrized by θ) and the integral is evaluated over the entire support of X.

Bayes Risk

In a Bayesian approach, the expectation is calculated using the prior distribution $π$ ^* of the parameter θ:

\rho (\pi ^{*},a)=\int _{\Theta }\int _{\mathbf {X}}L(\theta ,a({\mathbf {x}}))\,\mathrm {d} P({\mathbf {x}}\vert \theta )\,\mathrm {d} \pi ^{*}(\theta )=\int _{\mathbf {X}}\int _{\Theta }L(\theta ,a({\mathbf {x}}))\,\mathrm {d} \pi ^{*}(\theta \vert {\mathbf {x}})\,\mathrm {d} M({\mathbf {x}})

where m(x) is known as the predictive likelihood wherein θ has been "integrated out," $π$ ^* (θ | x) is the posterior distribution, and the order of integration has been changed. One then should choose the action a^* which minimises this expected loss, which is referred to as Bayes Risk . In the latter equation, the integrand inside dx is known as the Posterior Risk, and minimising it with respect to decision a also minimizes the overall Bayes Risk. This optimal decision, a^* is known as the Bayes (decision) Rule - it minimises the average loss over all possible states of nature θ, over all possible (probability-weighted) data outcomes. One advantage of the Bayesian approach is to that one need only choose the optimal action under the actual observed data to obtain a uniformly optimal one, whereas choosing the actual frequentist optimal decision rule as a function of all possible observations, is a much more difficult problem. Of equal importance though, the Bayes Rule reflects consideration of loss outcomes under different states of nature, θ.

Examples in statistics

For a scalar parameter θ, a decision function whose output ${\hat {\theta }}$ is an estimate of θ, and a quadratic loss function (squared error loss) $L(\theta ,{\hat {\theta }})=(\theta -{\hat {\theta }})^{2},$ the risk function becomes the mean squared error of the estimate, $R(\theta ,{\hat {\theta }})=\operatorname {E} _{\theta }(\theta -{\hat {\theta }})^{2}.$ An Estimator found by minimizing the Mean squared error estimates the Posterior distribution's mean.
In density estimation, the unknown parameter is probability density itself. The loss function is typically chosen to be a norm in an appropriate function space. For example, for L² norm, $L(f,{\hat {f}})=\|f-{\hat {f}}\|_{2}^{2}\,,$ the risk function becomes the mean integrated squared error $R(f,{\hat {f}})=\operatorname {E} \|f-{\hat {f}}\|^{2}.\,$

Economic choice under uncertainty

In economics, decision-making under uncertainty is often modelled using the von Neumann–Morgenstern utility function of the uncertain variable of interest, such as end-of-period wealth. Since the value of this variable is uncertain, so is the value of the utility function; it is the expected value of utility that is maximized.

Decision rules

A decision rule makes a choice using an optimality criterion. Some commonly used criteria are:

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]