Non-negative matrix factorization

Non-negative matrix factorization (NMF or NNMF), also non-negative matrix approximation^[1]^[2] is a group of algorithms in multivariate analysis and linear algebra where a matrix $V$ is factorized into (usually) two matrices $W$ and $H$ , with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

NMF finds applications in such fields as astronomy,^[3]^[4] computer vision, document clustering,^[1] missing data imputation,^[5] chemometrics, audio signal processing, recommender systems,^[6]^[7] and bioinformatics.^[8]

History

In chemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution".^[9] In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the 1990s under the name positive matrix factorization.^[10]^[11]^[12] It became more widely known as non-negative matrix factorization after Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithms for two types of factorizations.^[13]^[14]

Background

Let matrix $V$ be the product of the matrices $W$ and $H$ ,

\mathbf {V} =\mathbf {W} \mathbf {H} \,.

Matrix multiplication can be implemented as computing the column vectors of $V$ as linear combinations of the column vectors in $W$ using coefficients supplied by columns of $H$ . That is, each column of $V$ can be computed as follows:

\mathbf {v} _{i}=\mathbf {W} \mathbf {h} _{i}\,,

where $v i$ is the $i$ -th column vector of the product matrix $V$ and $h i$ is the $i$ -th column vector of the matrix $H$ .

When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, if $V$ is an $m \times n$ matrix, $W$ is an $m \times p$ matrix, and $H$ is a $p \times n$ matrix then $p$ can be significantly less than both $m$ and $n$ .

Here is an example based on a text-mining application:

Let the input matrix (the matrix to be factored) be $V$ with 10000 rows and 500 columns where words are in rows and documents are in columns. That is, we have 500 documents indexed by 10000 words. It follows that a column vector $v$ in $V$ represents a document.
Assume we ask the algorithm to find 10 features in order to generate a features matrix $W$ with 10000 rows and 10 columns and a coefficients matrix $H$ with 10 rows and 500 columns.
The product of $W$ and $H$ is a matrix with 10000 rows and 500 columns, the same shape as the input matrix $V$ and, if the factorization worked, it is a reasonable approximation to the input matrix $V$ .
From the treatment of matrix multiplication above it follows that each column in the product matrix $WH$ is a linear combination of the 10 column vectors in the features matrix $W$ with coefficients supplied by the coefficients matrix $H$ .

This last point is the basis of NMF because we can consider each original document in our example as being built from a small set of hidden features. NMF generates these features.

It is useful to think of each feature (column vector) in the features matrix $W$ as a document archetype comprising a set of words where each word's cell value defines the word's rank in the feature: The higher a word's cell value the higher the word's rank in the feature. A column in the coefficients matrix $H$ represents an original document with a cell value defining the document's rank for a feature. We can now reconstruct a document (column vector) from our input matrix by a linear combination of our features (column vectors in $W$ ) where each feature is weighted by the feature's cell value from the document's column in $H$ .

Clustering property

NMF has an inherent clustering property,^[15] i.e., it automatically clusters the columns of input data $\mathbf {V} =(v_{1},\dots ,v_{n})$ .

More specifically, the approximation of $\mathbf {V}$ by $\mathbf {V} \simeq \mathbf {W} \mathbf {H}$ is achieved by finding $W$ and $H$ that minimize the error function (using the Frobenius norm)

$\left\|V-WH\right\|_{F},$ subject to $W\geq 0,H\geq 0.$ ,

If we furthermore impose an orthogonality constraint on $\mathbf {H}$ , i.e. $\mathbf {H} \mathbf {H} ^{T}=I$ , then the above minimization is mathematically equivalent to the minimization of K-means clustering.^[15]

Furthermore, the computed $H$ gives the cluster membership, i.e., if $\mathbf {H} _{kj}>\mathbf {H} _{ij}$ for all i ≠ k, this suggests that the input data $v_{j}$ belongs to $k$ -th cluster. The computed $W$ gives the cluster centroids, i.e., the $k$ -th column gives the cluster centroid of $k$ -th cluster. This centroid's representation can be significantly enhanced by convex NMF.