While walking around on holiday in Mexico I was listening to Lex Fridman’s Artificial Intelligence podcast with the guest Tomaso Poggio where he said something really interesting that struck a chord with me:
This has been one of the puzzles about neural networks.
How can you get something that really works when you have so much freedom?
The main question which came to mind was: why can we train such big neural networks where the amount of parameters in our system is greater than the amount of training samples? Let us, for example, look at the classic neural network structure of ResNet-152. ResNet-152 is a 152-layer Residual Neural Network with over 60.200.000 learnable parameters.
It is trained on the images of ImageNet. These images have a size of 224x224 pixels and channels for Red, Green and Blue (RGB), which means that there are (224x224x3) 150.528 activations of pixels, resulting in an equivalent input size. ImageNet furthermore has 1.280.000 training images with 1.000 labels/classes. If the images are perfectly balanced between the different classes, then in the ideal case there are 1.280 images per class.
Everything that is explained about dimensionality in classical machine learning says that learning a model build up of 60 million learnable parameters with a 150k dimensional input and just 1.280 training samples per class should never be able to learn properly, let alone generalize well.
In this blogpost, I would like to dive into why is it that the really advanced neural networks work so well while they have vastly more parameters than training samples. To do this, we will need some background of the classical Curse of Dimensionality, as it is often taught in classical textbooks.
Subsequently, I will take a look at neural networks and explain the Universal Approximation Theorem. Since the Universal Approximation Theorem only holds for single layer neural networks, we will look if it possible to extend this unique property of single layer neural networks to deeper networks. Once we have these building blocks in place, we are going to combine them to see how they can be used to approximate the way giant neural networks can still learn.
There's more to this post. Read on by clicking the link below.