If you work with machine learning, chances are you have already come across the curse of dimensionality. To understand the curse of dimensionality, we first need to define what is dimensionality: In the context of this blog, dimensionality refers to the dimensions of the data of a given problem. It is equivalent to the number of attributes or features in our dataset.
In this blog, we will explore the curse of dimensionality and, in particular, how deep learning models prove very effective at overcoming it in a wide range of real-life problems. We will deep dive into some of the possible reasons why deep learning can defeat this curse, providing a thorough overview of the relationship between deep learning and the curse of dimensionality.
So, what is the curse of dimensionality?
As the dimensions of a dataset grow, it is typically necessary to collect more training samples, which is necessary to cover enough of the problem space that a model needs to properly learn the dataset (generalize). The number of samples needed to accomplish this grows very rapidly in relation to the dimensions. This is known as the curse of dimensionality.
Let’s use a metaphor to explain this: Imagine there is a cockroach in your house. You probably want to catch it to get rid of it, which implies anticipating (learning) its moving pattern. The cockroach only moves in a 2D space, with 4 possible components of its direction: Left-right (X axis), forward-backward (Y axis). Predicting where the cockroach will be the next second and anticipating it won’t be too hard. However, now imagine there is a mosquito instead. Mosquitoes are able to fly, so there is now a new dimension to its movement, up-down (Z axis). It becomes substantially harder to catch the mosquito, as its movement is significantly more complex. Adding just one dimension has resulted in a significant increase of complexity of our problem.
Indeed, adding dimensions to the data greatly influences the complexity of a problem, therefore requiring more examples for an algorithm to learn and generalize. The underlying cause is that the data becomes sparse; each sample in the dataset (in the example before, the animal’s position) is now one particular set of feature-values in a list of many features- a small point in a (too) vast space of possibilities. The more dimensions we add, the more sparse the data becomes, the more samples an algorithm needs to see before it is able to generate acceptable predictions. High dimensionality makes learning not only hard, but even impossible in some instances. For instance, it was due to this curse that for years AI professionals and academic researchers struggled to solve problems in fields such as big data, speech recognition, computer vision or natural language processing.
Breaking the curse of dimensionality with deep learning
Most machine learning models are in fact affected by this curse. Yet, we observe that deep learning models are able to successfully tackle a wide range of challenging real-life high dimensionality problems without the need of abundant amounts of training data. One example of a deep learning model that breaks the curse of dimensionality would be the classic neural network structure of ResNet-152. ResNet-152 is a 152-layer Residual Neural Network with over 60.200.000 learnable parameters.
Now, suppose we train this network on the images of the ImageNet dataset. This dataset contains 1.280.000 training images with 1.000 labels/classes (in a perfectly balanced dataset, there would be 1280 images per class), and 150.528 features per sample (the images have a size of 224x224 pixels, times 3 RGB channels per pixel). The curse of dimensionality tells us that a model built up of 60 million learnable parameters being fed a 150k dimensional input and just 1.280 training samples per class should never be able to learn properly, let alone generalize well. But it does. Below we explain how this and other deep learning algorithms are able to learn in these conditions.
How does deep learning tackle the curse of dimensionality?
So, why are deep learning algorithms often not affected by this curse? The short answer is, we don’t exactly know; it remains an open question in the field of deep learning. That being said, we have some ideas about why this could be the case, and it might be related to the very nature of the neural networks that make up the backbone of deep learning algorithms.
Explanation #1: Manifold hypothesis and automatic feature extraction
Manifold hypothesis establishes that high dimensional data actually sits in a lower dimensional manifold embedded in a high dimensional space. In other words, high dimensional data can actually be described and learned using less dimensions or features, because there is some underlying lower dimension pattern that can be found and exploited. This is no surprise in the field of machine learning, where feature selection is common practice.
Deep neural networks are able to uncover these underlying patterns thanks to their nature that reinforces strong predictors. Neural networks will iteratively give more importance to features that are more relevant when generating predictions, so the dimensionality reduction happens naturally as the network learns. This is known as automatic feature extraction.
Automatic feature extraction is one of the advantages of deep learning models, in particular deep neural networks. It’s interesting to note that deep learning models learn to extract useful features directly from the raw data. Contrary to other machine learning methods, they don’t rely on any manual feature engineering to perform this feature extraction.
Automatic feature extraction is accomplished through the learning process of the neural network. A neural network consists of multiple layers of interconnected neurons. Each neuron takes an input from the previous layer, applies a weight, and produces an output. The network will iteratively adjust these weights during the training process guided by the gradient of the loss function, which measures the difference or discrepancy between the desired and obtained output. The model will try to minimize the loss, which will lead to iteratively assigning higher weights to those features with higher predicting power.
It’s important to notice that although deep neural networks excel at extracting important features from raw data, it’s not guaranteed that they will always find these features. The model will rely on other factors, such as regularization techniques used during training, and the network’s representational capacity, to be able to uncover relevant features in all cases.
Explanation #2: Locality and symmetry
Locality and symmetry are two important concepts in deep neural networks in the context of the curse of dimensionality.
Locality is a concept that is very close to the manifold hypothesis; it refers to the idea that nearby data points in high dimensional spaces have similar characteristics. Put differently, patterns in the data exhibit local structure. Deep learning models are very effective at finding these local relationships and exploiting them thanks to the hierarchical representation learning process. As the network learns, each layer will progressively extract local features from the data, effectively capturing the underlying patterns, thus breaking the curse of dimensionality.
Symmetry relates to the behavior of the network rather than the configuration of the data. It refers to the invariance of the network’s learning ability irrespective of transformations done on the input data. In simple terms, it means that the network will be able to learn the input data regardless of whether some permutations are applied to some of the samples. For example, a deep neural network trained on an image dataset will be able to recognize objects or patterns even if some of the images are rotated, or if the position of the objects is different in different images.
Symmetry can help break the curse of dimensionality because it reduces the number of different configurations of the same patterns the network needs to learn, therefore eliminating the need to cover the full input space in the training dataset.
Both locality and symmetry are important concepts in this context, and can be leveraged to break the curse of dimensionality. Some authors, however, looked at this question and analyzed deep learning models, inference algorithms and data as an integrated system for different network architectures. They concluded that it’s locality rather than symmetry that breaks the curse of dimensionality.
Explanation #3: Regularization techniques
Regularization techniques are usually applied when training deep learning models to avoid overfitting. Although these techniques are not explicitly aimed at fighting the curse of dimensionality, they can be very helpful when confronting this issue, as they all prevent the model from learning noise or irrelevant patterns in the data. Methods such as dropout, weight decay and batch normalization are commonly used as regularization techniques.
Dropout refers to dropping out hidden and visible units in a network. Weight decay is used to penalize complexity, by adding to the loss function the sum of squares of weights multiplied by a small number called weight decay. Batch normalization is a technique to standardize the inputs of a network, applied either directly to the inputs or the activations of a previous layer.
We will not dive into these methods in this blog, as they are more relevant when discussing overfitting.
The curse of dimensionality is a common issue in the field of machine learning. For years, it slowed down efforts in fields such as big data, speech recognition, natural language processing or image processing. Deep learning, however, has proven very effective in overcoming the curse of dimensionality in a wide variety of machine learning problems.
In this blog, we provided and deep dived into a number of reasons that could explain how deep learning is able to break the curse of dimensionality. Although it remains an open question in the field, we lay out some reasonable causes why these algorithms are effective fighting this curse; namely automatic feature reduction, locality and symmetry of networks, and regularization techniques.
It is worth mentioning, however, that deep learning is not immune to all limitations derived from high dimensionality. Although the characteristics of deep learning exposed in this blog make these algorithms quite good at fighting the curse of dimensionality in many cases, the performance of the models can still be affected by factors such as data quality, model architecture, hyperparameter tuning, and computational resources. In practice, this means that there will still be high-dimensional problems for which additional work such as feature engineering or hyperparameter tuning will be needed in order for a deep learning model to learn and generalize.