Kolmogorov complexity

In algorithmic information theory (a subfield of computer science and mathematics), the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program (in a predetermined programming language) that produces the object as output. It is a measure of the computational resources needed to specify the object, and is also known as algorithmic complexity, Solomonoff–Kolmogorov–Chaitin complexity, program-size complexity, descriptive complexity, or algorithmic entropy. It is named after Andrey Kolmogorov, who first published on the subject in 1963^[1]^[2] and is a generalization of classical information theory.

The notion of Kolmogorov complexity can be used to state and prove impossibility results akin to Cantor's diagonal argument, Gödel's incompleteness theorem, and Turing's halting problem. In particular, no program P computing a lower bound for each text's Kolmogorov complexity can return a value essentially larger than P's own length (see section § Chaitin's incompleteness theorem); hence no single program can compute the exact Kolmogorov complexity for infinitely many texts.

Definition

Intuition

Consider the following two strings of 32 lowercase letters and digits:

abababababababababababababababab , and

4c1j5b2p0cv4w1x8rx2y39umgw5q85s7

The first string has a short English-language description, namely "write ab 16 times", which consists of 17 characters. The second one has no obvious simple description (using the same character set) other than writing down the string itself, i.e., "write 4c1j5b2p0cv4w1x8rx2y39umgw5q85s7" which has 38 characters. Hence the operation of writing the first string can be said to have "less complexity" than writing the second.

More formally, the complexity of a string is the length of the shortest possible description of the string in some fixed universal description language (the sensitivity of complexity relative to the choice of description language is discussed below). It can be shown that the Kolmogorov complexity of any string cannot be more than a few bytes larger than the length of the string itself. Strings like the abab example above, whose Kolmogorov complexity is small relative to the string's size, are not considered to be complex.

The Kolmogorov complexity can be defined for any mathematical object, but for simplicity the scope of this article is restricted to strings. We must first specify a description language for strings. Such a description language can be based on any computer programming language, such as Lisp, Pascal, or Java. If P is a program which outputs a string x, then P is a description of x. The length of the description is just the length of P as a character string, multiplied by the number of bits in a character (e.g., 7 for ASCII).

We could, alternatively, choose an encoding for Turing machines, where an encoding is a function which associates to each Turing Machine M a bitstring <M>. If M is a Turing Machine which, on input w, outputs string x, then the concatenated string <M> w is a description of x. For theoretical analysis, this approach is more suited for constructing detailed formal proofs and is generally preferred in the research literature. In this article, an informal approach is discussed.

Any string s has at least one description. For example, the second string above is output by the pseudo-code:

function GenerateString2()
    return "4c1j5b2p0cv4w1x8rx2y39umgw5q85s7"

whereas the first string is output by the (much shorter) pseudo-code:

function GenerateString1()
    return "ab" × 16

If a description d(s) of a string s is of minimal length (i.e., using the fewest bits), it is called a minimal description of s, and the length of d(s) (i.e. the number of bits in the minimal description) is the Kolmogorov complexity of s, written K(s). Symbolically,

K(s) = |d(s)|.

The length of the shortest description will depend on the choice of description language; but the effect of changing languages is bounded (a result called the invariance theorem).

Plain Kolmogorov complexity C

There are two definitions of Kolmogorov complexity: plain and prefix-free. The plain complexity is the minimal description length of any program, and denoted $C(x)$ while the prefix-free complexity is the minimal description length of any program encoded in a prefix-free code, and denoted $K(x)$ . The plain complexity is more intuitive, but the prefix-free complexity is easier to study.

By default, all equations hold only up to an additive constant. For example, $f(x)=g(x)$ really means that $f(x)=g(x)+O(1)$ , that is, $\exists c,\forall x,|f(x)-g(x)|\leq c$ .

Let $U:2^{*}\to 2^{*}$ be a computable function mapping finite binary strings to binary strings. It is a universal function if, and only if, for any computable $f:2^{*}\to 2^{*}$ , we can encode the function in a "program" $s_{f}$ , such that $\forall x\in 2^{*},U(s_{f}x)=f(x)$ . We can think of $U$ as a program interpreter, which takes in an initial segment describing the program, followed by data that the program should process.

One problem with plain complexity is that $C(xy)\not <C(x)+C(y)$ , because intuitively speaking, there is no general way to tell where to divide an output string just by looking at the concatenated string. We can divide it by specifying the length of $x$ or $y$ , but that would take $O(\min(\ln x,\ln y))$ extra symbols. Indeed, for any $c>0$ there exists $x,y$ such that $C(xy)\geq C(x)+C(y)+c$ .^[3]

Typically, inequalities with plain complexity have a term like $O(\min(\ln x,\ln y))$ on one side, whereas the same inequalities with prefix-free complexity have only $O(1)$ .

The main problem with plain complexity is that there is something extra sneaked into a program. A program not only represent for something with its code, but also represent its own length. In particular, a program $x$ may represent a binary number up to $\log _{2}|x|$ , simply by its own length. Stated in another way, it is as if we are using a termination symbol to denote where a word ends, and so we are not using 2 symbols, but 3. To fix this defect, we introduce the prefix-free Kolmogorov complexity.^[4]

Prefix-free Kolmogorov complexity K

A prefix-free code is a subset of $2^{*}$ such that given any two different words $x,y$ in the set, neither is a prefix of the other. The benefit of a prefix-free code is that we can build a machine that reads words from the code forward in one direction, and as soon as it reads the last symbol of the word, it knows that the word is finished, and does not need to backtrack or a termination symbol.

Define a prefix-free Turing machine to be a Turing machine that comes with a prefix-free code, such that the Turing machine can read any string from the code in one direction, and stop reading as soon as it reads the last symbol. Afterwards, it may compute on a work tape and write to a write tape, but it cannot move its read-head anymore.

This gives us the following formal way to describe K.^[5]

Fix a prefix-free universal Turing machine, with three tapes: a read tape infinite in one direction, a work tape infinite in two directions, and a write tape infinite in one direction.
The machine can read from the read tape in one direction only (no backtracking), and write to the write tape in one direction only. It can read and write the work tape in both directions.
The work tape and write tape start with all zeros. The read tape starts with an input prefix code, followed by all zeros.
Let $S$ be the prefix-free code on $2^{*}$ , used by the universal Turing machine.

Note that some universal Turing machines may not be programmable with prefix codes. We must pick only a prefix-free universal Turing machine.

The prefix-free complexity of a string $x$ is the shortest prefix code that makes the machine output $x$ :

K(x):=\min\{|c|:c\in S,U(c)=x\}

[1]

[2]

[3]

[4]

[5]