Deep Neural Nets: a brief recap

Neural nets have experienced a surge in interest in the past years. This post is a summary about the subject from my non-academical point of view.

In the 80s feed-forward neural networks with a single hidden layer became pretty popular when a simple and efficient algorithm was engineered to train them, the backpropagation algorithm.

As it’s usual with machine learning algorithms, a naive implementation can be written very concisely without much effort. In fact, all one needs to understand to train a network is the chain rule.

A neural network is nothing more than a hierarchy of composed function applications. Take a net with a single neuron $f$ and loss function $L$ $L(f(\mathbf{x}, \mathbf{w}), y)$.

If you know how to compute the partial derivative of the loss function $L$ you desire to minimize with respect to its input $f(\mathbf{x}, \mathbf{w})$, i.e. $\frac{\partial L}{\partial f}$,

and the derivative of $f$ with respect to its weights $\mathbf{w}$, i.e. $\frac{\partial f}{\partial \mathbf{w}}$,

you can determine the derivative of $L$ with respect to the weights $\mathbf{w}$ using the chain rule: $\frac{\partial L}{\partial \mathbf{w}} = \frac{\partial L}{\partial f}\frac{\partial f}{\partial \mathbf{w}}$

That’s all you need apply gradient descent to update iteratively the parameters $w$: $\mathbf{w} = \mathbf{w} - \alpha \frac{\partial L}{\partial f}\frac{\partial f}{\partial \mathbf{w}}$, where $\alpha$ is the learning rate,

in order to reach a solution that minimizes the loss function. It’s easy to imagine applying this procedure recursively on a hierarchy of composed functions.

Unfortunately though training deep networks, i.e. networks with more than one hidden layer, is not a simple task. There are several reasons why the backpropagation algorithm can fail to find a good solution, e.g. vanishing gradient and overfitting come to mind, and it has been explored in depth elsewhere. Interest in neural nets faded away when researchers realized it.

Then, in 2006/2007 researchers have shown that by using unlabeled data it’s possible to train a deep neural net using a greedy approach. The main idea is based on a network of neurons able to reconstruct the original input with the smallest amount of error, similarly to what Principal Component Analysis does. Such a network is called Autoencoder and is composed of three layer:

• the input layer of dimension $d$
• hidden layer of dimension $d'$ with $d' < d$ (simplifying assumption)
• the output layer of dimension $d$

The hidden layer learns a lower dimensional representation of the input which allows to reconstruct the original signal in the output layer with the minimum error possible. Once an Autoencoder is trained,  the representation learned by the hidden layer can be used as input for another Autoencoder. The process is repeated forming a so called stacked Autoencoder. Effectively, each Autoencoder learns a set of new features from the ones learned by the previous Autoencoder. To give you a concrete example, the first Autoencoder might learn to detect the edges of a picture, while the second one contours and so on.

Once you have trained a stacked Autoencoder, you can initialize the weights of a deep network with $N$ hidden layers using the weights of the $N$ hidden layer of the stacked Autoencoder (pre-training). From here on you can use the neural net just like any ordinary one and train it using labeled data (fine-tuning). Using unlabeled data to pre-train a deep neural net was a big thing back in 2006/2007. This older talk by Geoff Hinton was truly inspiring, but things have changed since then.

In 2012 Hinton et. al proofed that it is possible to train a deep convolutional neural net to classify images without any sort of pre-training, and beat at the same time traditional Computer Vision approaches. Since then unsupervised pre-training has mostly stopped being researched in various universities but, nevertheless, it was the culprit that lead to more fundings and ultimately to where we are now.

A state of the art deep convolutional neural network for image classification is based on a handful of powerful ingredients:

• many hidden layers (or it wouldn’t be a deep net);
• convolutional layers followed by pooling layers early on;
• rectified linear units instead of the classic sigmoid activation function, to learn faster;
• dropout to approximate the average result of many nets, to reduce overfitting.

The devil is in the detail but the basic concepts and ideas are easy to grasp and a simple implementation can be written in a couple of afternoons.

The real difficulty is more of an engineering one, how do you write the most efficient code? If I wet your appetite and are curious to have a look at a serious implementation you should check out cuda-convnet, an extremely well written and efficient multi-GPU based convolutional Neural Net.