Generally, in real world dataset, we have the following problem:
High Dimensions = Lot of Features
The primary problem associated with high-dimensionality in the machine learning field is model overfitting, which reduces the ability to generalize beyond the examples in the training set. Richard Bellman described this phenomenon in 1961 as the Curse of Dimensionality:
“Many algorithms that work fine in low dimensions become intractable when the input is high-dimensional”.
PCA is a powerful unsupervised learning technique for taking a orthogonal projection of a high-dimensional data into a (possibly lower dimensional) subspace so that the variance of the projected data is maximized, so that we are not losing too much information. It is one of the simplest and most robust ways of doing dimensionality reduction. Other methods includes linear discriminant analysis (LDA), factor analysis, etc.
PCA is useful for the following purposes:
- Visualization
- Further processing by machine learning algorithms
- More efficient use of resources (e.g., time, memory, communication)
- Statistical: fewer dimensions and a better generalization
- Noise removal (improving data quality)
1. Mathematics of Principal Components
Given a number of \(N\) vectors \(\{ \vec{x}_i \in \mathbb{R}^n \}\) (In machine laerning, \(n\) denotes the number of features), and we want to summarize them by projecting down into a \(k\)-dimensional subspace. The summary of the given data will be the projection of the original vectors on to \(k\) directions, the principal components, which span the subspace.
Before we explain the mathematical idea behind PCA algorithm, we need to understand inner product and change of bases in \(\mathbb{R}^n\).
1.1 Inner Product and Base Change
Let \(\vec{x}\) and \(\vec{y}\) be two vectors in \(\mathbb{R}^n\) such that
\[\begin{align} \vec{x} = \sum a_{i} \vec{e}_i, \ \ \ \vec{y} = \sum b_{i} \vec{e}_i, \end{align}\]here, \(\vec{e}_i, \ i = 1, 2, \cdots, n\) are any orthonormal basis of \(\mathbb{R}^n\). For example, in the three-dimensional space \(\mathbb{R}^n\), we could choosen the following standard basis:
\[\begin{align} \vec{e}_1 = (1,0,0),\quad \vec{e}_2 = (0,1,0),\quad \vec{e}_3=(0,0,1). \end{align}\]We define the inner product of \(\vec{x}\) and \(\vec{y}\) as
\[\langle \vec{x},\ \vec{y} \rangle = ( a_1,a_2,\cdots,a_n ) \cdot \left( \begin{array}{c} b_1 \\ b_2 \\ \vdots \\ b_n \end{array} \right) = a_1b_1+a_2b_2+\cdots+a_nb_n.\]which can be written as
\[\langle \vec{x}, \ \vec{y} \rangle = |\vec{x}||\vec{y}|cos(\theta)\]where \(\theta\) is the angle between these two vectors \(\vec{x}\) and \(\vec{y}\). Geometrically speaking,the inner product of \(\vec{x}\) and \(\vec{y}\) computes the length of the projection of \(\vec{x}\) onto \(\vec{y}\).
We will also need the notion of change of bases. Given any vector \(\vec{x} = \sum a_{i} \vec{e}_i\) with respect to the orthonormal bases \(\{\vec{e}_1, \cdots, \vec{e}_n \}\), its representation under a different orthonormal basis \(\{\vec{f}_1, \cdots, \vec{f}_n \}\) is given by
\[\begin{align} \vec{x} & = \sum \langle \vec{f}_i, \ \vec{x} \rangle \vec{f}_i = (\vec{f}_1, \vec{f}_2, \cdots, \vec{f}_n) \cdot \left( \begin{array}{c}\langle \vec{f}_1, \ \vec{x} \rangle\\ \langle \vec{f}_2, \ \vec{x} \rangle \\ \vdots \\ \langle \vec{f}_n, \ \vec{x} \rangle \end{array} \right)\\ &= (\vec{f}_1, \vec{f}_2, \cdots, \vec{f}_n) \cdot \begin{bmatrix} \langle \vec{f}_1, \ \vec{e}_1 \rangle & \langle \vec{f}_1, \ \vec{e}_2 \rangle &\cdots & \langle \vec{f}_1, \ \vec{e}_n \rangle \\ \langle \vec{f}_2, \ \vec{e}_1 \rangle & \langle \vec{f}_2, \ \vec{e}_2 \rangle &\cdots & \langle \vec{f}_2, \ \vec{e}_n\rangle \\\vdots & \vdots & \ddots & \vdots\\ \langle \vec{f}_n, \ \vec{e}_1 \rangle & \langle \vec{f}_n, \ \vec{e}_2 \rangle &\cdots & \langle \vec{f}_n, \ \vec{e}_n\rangle \end{bmatrix} \cdot \left( \begin{array}{c} a_1 \\ a_2 \\ \vdots \\ a_n \end{array} \right) \end{align}\]The above matrix in the last equation is usually called the matrix transformation between bases. Moreover, we can therefore understand matrix transformation as a change of bases in the future.
So far, we have discussed that change of bases affects the representation of vectors. Now the question is, for the given data set \(\{ \vec{x}_i \in \mathbb{R}^n \}_{i=1}^N\), can we choose \(k\) bases such that the projection of the dataset onto the subspace spanned by these bases retains as much info as possible.
Intuitive speaking, we want the projection of the original data set to be as “spread out” as possible, which is measured mathematically by the covariance of the projection of the data set.
In the following, we will assume that our data have been “centered”, so that every variable has mean 0:
\[\begin{align} \vec{\mu}_x = \frac{1}{N} \sum^N_{i=1} \vec{x}_i. \end{align}\]1.2 Coviarance
Given two vectors \(\vec{x}\) and \(\vec{y}\) in \(\mathbb{R}^n\) as above, we define their covariance by the following
\[\begin{align} Cov(\vec{x},\vec{y})=\frac{1}{n}\sum_{i=1}^n {a_ib_i}. \end{align}\]The first principal component is the direction in space along which projections have the largest variance. The second principal component is the direction which maximizes variance among all directions orthogonal to the first. The \(k\)-th component is the variance-maximizing direction orthogonal to the previous \(k−1\) components. There are \(k\)-principal components in all.
Another approach would be, rather than maximizing variance, to look for the projection with the smallest average (mean-squared) distance between the original vectors and their projections on to the principal components; this turns out to beequivalent to maximizing the variance.
PCA does not make an explicit Gaussianity assumption. It finds a coordinate system (an eigenvector basis) that maximizes the variance explained in the data. The orthogonality of principal components implies that PCA finds the most uncorrelated components to explain as much variation in the data as possible. For multivariate gaussian distributions, zero correlation between components implies independence, a property that is not true for most distributions. [1] Another way of thinking about this is that gaussian distributed data have no higher order statistics beyond variance; therefore PCA is extremely efficient at representing multivariate normally distributed data.
[1] http://en.wikipedia.org/wiki/Normally_distributed_and_uncorrelated_does_not_imply_independent
But if the data is not Gaussian, there are higher order statistics beyond variance that are not being taken into account by PCA. While PCA captures only uncorrelated components, these uncorrelated components are not necessarily independent for general distributions. Thus, ICA advocates prefer minimizing mutual information (or relative Kullback-Leibler divergence) for non-Gaussian data because two distributions with zero mutual information are statistically independent.
The original ICA paper by Comon explains this well [2] http://www.i3s.unice.fr/~comon/FichiersPdf/como94SP.pdf
ICA works well in problems where we wish to un-mix truly independent signals. For instance, separating distinct audio signals in the “cocktail party” problem. There is a downside to ICA, as the original Comon paper points out. If one finds K components with ICA, it does not rank the importance of the components or the amount of variance of the data explained by the component consistently. If you compared ICA with K+1 components vs. K components these can look extremely different. Whereas, with PCA the first K components are always ranked by how much of the signal/variance of the data they explain. Moreover, these components are reproducible should you choose to find a larger number of components (If you apply PCA twice to similar data, the first K components are the same). If you have no basis for believing the your signal components come from “independent” sources, PCA or its variations are still good or atleast capture information upto the second moment in your signal.