Dimensionality Reduction

The fundamental problem of dimensionality reduction is how to
discover compact representations of high-dimensional data.

Exploratory data analysis and visualisation are of critical importance to many areas of science. The fundamental problem of dimensionality reduction is how to discover compact representations of high-dimensional data (1).

High-dimensional datasets are defined by their inclusion of a large number of distinct variables; for example, the health status of patients can include more than 100+ measured parameters from blood analysis, immune system status, genetic background, nutrition, alcohol and drug consumption, operations, treatments and diagnosed diseases (2). Each of these parameters is a dimension (or feature) of the dataset, and can be used to identify patterns and trends. This idea implies that a higher dimensional data set is preferable as it would allow us to distinguish between different conditions and find correlations more effectively, however, this is not the case; effectively reducing the number of dimensions that a dataset contains is a key problem in machine learning. Lower dimensional data is desirable due to the effects of the curse of dimensionality (3).


The Curse of Dimensionality

The curse of dimensionality is the problem of detecting structures within data embedded in high dimensional space. When classifying, objects it is necessary to utilise features, descriptions of some characteristic of each object type that can be represented numerically (so that statistical algorithms and data analysis can be applied). An example would be using the colour values of an image of a car to determine whether it was produced by Audi or BMW. Clearly, this feature alone is not adequate to identify the cars’ manufacturer with any reliability. It is possible, therefore, to increase the dimensionality of the problem by adding more features, determining the shape of each car using edge detection and colour gradients perhaps. These additional features, in combination, could all be used by our classifier to identify the make of each car more accurately.

As mentioned before, it is tempting at this point to increase the dimensionality further, adding more and more features to allow us perfectly classify every make of car using hundreds of different features. It is here that the curse of dimensionality comes into play:

"For a fixed size of training data, increasing the dimensionality (number of features) past some optimal number causes classifier performance to degrade substantially."

This conclusion is based on Hughes’ Phenomenon (4) and is depicted in Figure 1. The cause of the curse of dimensionality comes down to the decreasing density of a constant number of data points as the number of dimensions increases (5). In a high-dimensional space, most points selected from some finite, random set of N data points inside a similarly finite volume end up being distant from each other, accumulating in the corners of the space (6).

Figure 1: As the dimensionality increases, the performance of the classifier increases rapidly until an optimum point. Beyond this point, increases in dimensionality without similar growth in the number of training samples detracts from the performance of the classifier (3).

Using the car example, assume we wanted to classify Audis and BMWs using a single feature (for example, proportion of pixels that are “red”), that varies between 0 and 1, with training data that covered 10% of this total possible range of feature values. The required training sample size for this would be 10% of the complete population of Audis and BMWs. If we added another feature that varied in the same way (going up to a 2 dimensional feature space) and wanted the same 10% coverage of the possible range of feature values, we would require a sample size of more than 31% of the total population in each dimension (see Figure 2).

31% of possible Feature1 values × 31% of possible Feature2 values = 10% of total range of values
Figure 2: As the number of dimensions increases, data becomes more sparse and we require larger and larger training samples to attain the same coverage

If the size of the training data did not increase along with the number of dimensions, the data would become more and more sparse as the dimensionality increased (5). This data sparsity leads to a phenomenon known as overfitting, in which the classifier begins to learn specific details of the training data which are not relevant to the general concept that is being classified (3). Overfitting results in poor classifier performance when asked to classify any real-world data, which is unlikely to conform to the case-specific details that the classifier has learnt. Put simply, a linearly increasing number of dimensions necessitates an exponentially increasing number of training data points to maintain an equivalent level of classifier performance when exposed to real-world data.

To summarise, multiple dimensions are hard to think in, impossible to visualise and the number of possible values grows exponentially with each dimension, making completely enumerating all subspaces almost impossible for high dimensions. Additionally, incorporating a large amount of features increases the levels of noise within the data, as well as causing the training data to become more and more sparse, necessitating exponential increases in training sample size for linear increases in dimensionality, all to avoid overfitting (6).

Reducing the number of dimensions used leads to imperfect classification of the training data, however, classifier performance with real-world data is still greatly superior to high-dimensional datasets due to the lack of overfitting.


Locally Linear Embedding

The paper that revolutionised the field of Dimensionality Reduction was titled “Nonlinear Dimensionality Reduction by Locally Linear Embedding” and was produced by Sam Roweis during his tenure at the Gatsby Computational Neuroscience Unit at UCL in collaboration with Lawrence Saul of AT&T’s Research Lab (1). The paper described Locally Linear Embedding (LLE), an unsupervised learning algorithm that “attempts to discover nonlinear structure in high dimensional data by exploiting the local symmetries of linear reconstructions” (7). LLE takes a high dimensional dataset as an input and maps it to a single, global coordinate system of lower dimensionality. The algorithm is adept at generating highly nonlinear embeddings, identifying complex groupings of data with nonlinear relationships in high dimensions and maintaining these groupings whilst mapping data points to a lower dimensional space (1). This is in comparison to Principal Component Analysis, one of the main linear techniques for dimensionality reduction, which simply performs a direct linear mapping of data to a space of lower dimensions, but is inappropriate for use with nonlinear problems.

LLE initially takes one parameter, k, and constructs a k-nearest-neighbours (kNN) graph for each individual point in the dataset. It is important that the k passed to the algorithm is large enough to produce a connected kNN graph; if the chosen k is too small, the resulting kNN graph may be formed of unconnected segments, making it unsuitable for use with LLE. Following the construction of the kNN graph, a set of weights is computed for each point, which best linearly reconstruct the point from its k neighbours (i.e. the best possible descriptions of the point as a linear combination of the k closest points to it). It then uses an optimisation technique based upon eigenvectors to locate a low-dimensional embedding of each of the points, whilst ensuring that each point can still be described by the same linear combination of its k neighbours. This is the key to how LLE retains the structure of high-dimensional data in lower dimensions (7).

Figure 3: A mapping of a 3-dimensional embedding to a 2-dimensional space, whilst retaining non-linear structure within the original data using Locally Linear Embedding (LLE). Image from (8)

Problems with LLE

Since its introduction, LLE has become a classic method of nonlinear dimensionality reduction due to its ability to deal with huge amounts of high dimensional data and its innovative approach to locating low-dimensional structures in high dimensions. LLE is also simpler to compute than its competitors (including Isomap, Laplacian eigenmaps, and local tangent space alignment), as well as being able to return useful results on a wider range of datasets (9). Despite this, LLE is still imperfect; a few examples are the following: (10)

  • If a low-density sample is provided to LLE, or the points have not been sampled uniformly, then LLE is unable to locate non-uniform warps and folds.
  • Selection of the parameter k (which is used to calculate the k-nearest-neighbors graph in the first step of the algorithm) has a substantial impact on the performance of LLE.
  • LLE is extremely sensitive to any form of noise within the data; small amounts of noise relative to the dataset can often lead to failure in producing lower dimension coordinates.
  • LLE can encounter ill-conditioned eigenvalue problems
    • A square matrix is ill-conditioned if it is invertible but can quickly become non-invertible if some of its entries are changed by a small amount
    • Solving linear equation systems with coefficient matrices that are ill-conditioned is difficult, as small variations in the data can result in hugely different answers, for example, take: \[A = \begin{bmatrix}4.5 & 3.1 \\1.6 & 1.1 \end{bmatrix}, \space b =\begin{bmatrix}19.249 \\ 6.843 \end{bmatrix}, \space b_{1} =\begin{bmatrix}19.25 \\ 6.84 \end{bmatrix}\]
    • If we solve Ax=b, then x = (3.94, 0.49), but if we solve Ax=b1, then x = (2.9, 2.0)
    • The minor change from b1 to b could easily occur as a result of rounding, or floating point errors. As we have seen, these minute changes cause huge shifts in the output and LLE’s vulnerability to these ill-conditioned eigenvalue problems is of concern.
  • LLE is an unsupervised algorithm and makes the assumption that all of the data resides on one, continuous manifold, which is inherently untrue for classification problems with multiple classes.

Modifications to LLE

To conquer some of the issues mentioned above, LLE has undergone many extensions from various researchers.

ISOLLE: LLE with Geodesic Distance

The only nonlinear step in the LLE algorithm is the selection of a point’s closest neighbours, and this plays a crucial part in the algorithm’s performance. If the points are sampled in a biased way or contaminated by noise, estimated reconstruction weights used by LLE end up poorly reflecting the local geometry of the manifold, which also affects the lower dimensional embedding. One proposed solution to this is altering the distance metric used by LLE. In the original version of the algorithm, the Euclidean distance was used directly, which can result in short circuits –an assignment of neighbours that are actually very distant from the data point itself, as seen in Figure 4. An extension of LLE (known as ISOLLE) utilises the geodesic distance between points (the number of edges in the shortest path that connects them), instead of Euclidean distance. There is evidence to show that utilising ISOLLE can reduce the number of short circuits occurring whilst locating the neighbours of a point (11).

Figure 4: Use of Euclidean distance (left) has resulted in a short circuit, as the two points appear relatively close to each other, despite being on opposite ends of the “horseshoe” formation. Using Geodesic distance (right) eliminates this issue, as the shortest path between the two points is very long.

Improved LLE through New Distance Computing

Utilising a different method of computing distance between points is also an effective way to reduce the effect that the choice of the K parameter has upon LLE’s dimensionality reduction. In standard LLE, the nearest-neighbour points cover a larger area when the surrounding region is sparse and a smaller area when the surrounding region is dense with points. It is desirable to eliminate the effects of this uneven dispersal of points so that our low dimensional embedding is less influenced by the sample points’ distribution.

Improved LLE, uses a new distance formula roughly based upon Euclidean distance (12):

\[d_{ij} = \frac{\left|x_{i}-x_{j} \right|}{\sqrt{M(i)M(j)}} , \space where \space M(i) = \frac{1}{K}\sqrt{\sum_{l=1}^K \left | x_{i} - x_{l} \right | ^ 2}\]

The effect of using this modified formula is that the distance between sample points in a dense area becomes larger and the distance between sample points in a sparse area becomes smaller. In essence, the distribution of sample points becomes more proportional to reduce the effects of their original distribution (12). Also, a much larger range of values for K can now be used to produce good results, as seen below in the reduction of a 3-dimensional half-cylinder to 2 dimensions – all images from (10).

Figure 5: 3-dimensional half cylinder
Figure 6: 3-dimensional half cylinder reduced to 2-dimensional using standard LLE (left) and improved LLE (right), with K = 4
Figure 7: 3-dimensional half cylinder reduced to 2-dimensional using standard LLE (left) and improved LLE (right), with K = 9
Figure 8: 3-dimensional half cylinder reduced to 2-dimensional using standard LLE (left) and improved LLE (right), with K = 19

When k is too small (in the case K = 4), neither LLE or Improved LLE produce acceptable results. From K = 9 onwards, however, Improved LLE exhibits superior performance; standard LLE does not produce good results until K = 19, at which point Improved LLE is still performing well (and has been since K = 9).


Data from the UCI handwritten digit database was used to test the performance of the Improved LLE algorithm. In this dataset, each grayscale image depicting a handwritten character is 8 x 8 (64 dimensions), plus an additional dimension for its label, resulting in a dimensionality of 65 per sample. Extensive values of d and K were tested (where d is the reduced number of dimensions and K is the number of nearest neighbours to select). Following the dimensionality reduction, the resulting data was passed through a K-nearest-neighbours algorithm to identify the actual class of a query image. Table 1 shows the results of this testing, the values in the table represent the error rate (the average percentage of misclassified patterns in the testing set). It is evident from the data below that the average error rate is much lower when using Improved LLE, with the same values for d and K (10).

Table 1: Results for standard LLE and Improved LLE when tested with the UCI handwritten digit dataset, table from (10)

Conclusion

Dimensionality reduction is a crucial part of preprocessing data before it is utilised in machine learning. Reducing the number of dimensions in a way that preserves structure is highly desirable and LLE is adept at accomplishing this task. Nevertheless, the algorithm is susceptible to short circuits, vulnerable to noisy data and sensitive to the value of its sole parameter (k). Many extensions to the algorithm have altered the Euclidean distance formula used by standard LLE; these alterations have been effective at fixing short circuits and parameter sensitivity. Extensions also exist to convert LLE to a supervised algorithm, making it suitable for use with problems with multiple classes and many, potentially discontiguous manifolds.


References

1. Roweis ST, Saul LK. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science. 2000;290(5500):2323-6.

2. Willhelm J. What are some examples of high-dimensional data? 2014 [Available from: https://www.researchgate.net/post/What_are_some_examples_of_high-dimensional_data.

3. Spruyt V. The Curse of Dimensionality in Classification 2014 [Available from: http://www.visiondummy.com/2014/04/curse-dimensionality-affect-classification/.

4. Hughes G. On the mean accuracy of statistical pattern recognizers. IEEE Transactions on Information Theory. 1968;14(1):55-63.

5. Rojas R. The Curse of Dimensionality. Freie Universität Berlin; 2015 15/02/2015.

6. Alonso MC, Malpica JA, Agirre AMd. Consequences of the Hughes Phenomenon on Some Classification Techniques. ASPRS 2011 Annual Conferences; 01/05/2011; Milwaukee, Wisconsin2011.

7. Roweis ST, Saul LK. An Introduction to Locally Linear Embedding. 2001.

8. Jake V, Andrew C. Reducing the Dimensionality of Data: Locally Linear Embedding of Sloan Galaxy Spectra. The Astronomical Journal. 2009;138(5):1365.

9. Saul LK, Roweis ST. Think globally, fit locally: unsupervised learning of low dimensional manifolds. The Journal of Machine Learning Research. 2003;4:119-55.

10. Chen J, Liu Y. Locally linear embedding: a survey. Artificial Intelligence Review. 2011;36(1):29-48.

11. Varini C, Degenhard A, Nattkemper TW. ISOLLE: LLE with geodesic distance. Neurocomputing. 2006;69(13–15):1768-71.

12. Wang H, Zheng J, Yao Z, Li L. Improved Locally Linear Embedding Through New Distance Computing. In: Wang J, Yi Z, Zurada JM, Lu B-L, Yin H, editors. Advances in Neural Networks - ISNN 2006: Third International Symposium on Neural Networks, Chengdu, China, May 28 - June 1, 2006, Proceedings, Part I. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 1326-33.