Slideshow (in french)
"Intelligence artificielle, machine learning et deep learning" by @vm-lbbe (Vincent Miele)
La folle histoire du deep learning, par Y. Le Cun
A slight primer on Deep Learning
by @vm-lbbe (Vincent Miele)
Well, what's a neural network?
Well, what's a neurone then? It is a dot-product that takes a vector as input and returns a value (actually, the dot-product is followed by another activation function f that post-processes the dot-product result). Each neurone has its own vector of weights w that are used to perform the dot-product. Exactly what is represented in a).
Ok. Let's assume now we have a data vector as input and we want to predict a value or a class from this data. First, the model consists in processing the data vector with a series of neurones (neurones are stacked into a layer) : this is what is represented for the first layer in b). Then, the vector of results given by this layer is processed by the neurones of another layer (2nd layer in b)). And so on, until an output layer: this layer is build to answer the original question, i.e. predicting a value or a class.
This makes a network of neurones which is quite large (or deep): a deep neural network.
(courtesy of Sandra Vieira)
Ok then, what's deep learning?
All the neurones' weights, they are unknown... It is necessary to learn the best weights to achieve the best performance. In other words, when the data pass into the neural network, what are the best weights to achieve good prediction? Learning a deep neural network, deep learning!
However, there is no close formulation for the best weights given an input dataset... Here comes the magical part! The idea is to rely on a training set for which the answer is known (the value or the class to predict). It works by having the model make predictions on training data and using the error on the predictions to update the model (i.e. the weights) in such a way as to reduce the error. Practically, we rely on the loss function L that measures the correctness of the predictions. The learning algorithm consists in iteratively updating the weights one by one to make the loss decrease, using the subsequent derivatives of this loss function over every weight parameter (see next figure). It alternates two steps: [part of] the training set is scanned forward to compute the loss, then follows a backward step (backpropagation) where the weights are modified. And so on. Many iterations are necessary and this is a huge amount of computation!
(borrowed from here)
Now, what's a convolutional neural network, aka a CNN?
We now assume the input data is an image (an array of pixels). This data, seen as a vector, is 1/ huge and 2/ spatially correlated (a pixel is linked to its neighbors). There is then a need 1/ to be parcimonious in the number of neurones and 2/ to integrate the spatial correlation (a single pixel is rarely informative). The proposed idea is to use "sliding neurones" aka filters: any of this neurone can deal with a small box of pixels (for instance a 3x3 square) and slide over the image to analyze every possible boxes : this the convolution. The largest part of a CNN is then a series of convolutional layers (see below - convolution/feature learning part) plus additional subtilities (not discussed here). The last part of a CNN, for the classification step, corresponds to a regular neural network (as presented previously) that takes as input a vectorized version (see the flatten layer below) of the results of the convolution part.
(courtesy of unknown)
A (simplified) glossary
- Batch size: The training set is divided into batches of a fixed size. The batch size is a hyperparameter that defines the number of images to work through before updating the internal model parameters.
- Dropout: Layer that randomly set some neurons to zero in order to ignore them. This force the model to be redundant and less likely to overfit.
- Epoch: One epoch corresponds to the training on all batches.
- Feature map: Repeated application of the same filter to an input results in a map of activations called a feature map.
- Flatten layer: flattening a feature map into a 1-D column.
- Filter: Imagine a small filter sliding (more or less a sliding neurone) left to right across the image from top to bottom.
- Learning rate: hyperparameters that controls the descent intensity along the slope in gradient-based methods.
- Pooling layer / MaxPool: Layer that divide the neurons into groups, a pooling operation is applied to each group. For Maxpooling, only the neuron with the highest value in the group is retained.
- Overfitting: When the model is trained for too many epochs, it becomes too close to the training dataset and is unable to generalize to the test dataset.
- ReLu: Non linear layer : $f(x)=max(x,0)$ , usually used after convolution layers to introduce non-linearity in the model.
- Stride: The stride controls how the filter of a convolution moves on the input by controlling the distance in each dimension between two convolution operations.
- Stochastic/Mini-batch Gradient Descent: iterative algorithm where “gradient” refers to the calculation of an error gradient or slope of error and “descent” refers to the moving down along that slope. SGD: batch size=1. Mini-Batch GD: 1<batch size<size of training set.
- Zero padding: Add some zero around the input tensor of a convolution in order that the output has the same shape as the input.
- BatchNormalization : Layer normalizing the activations of the previous layer by subtracting the batch mean and dividing by the batch standard deviation. It speeds up the training and improves de the performances. For more details, see here.
- Global Average Pooling : Average pooling on the two first dimension of a 3D Tensor. Can be used after the last convolution to reduce the number of parameters.
- Data augmentation: technique that increases artificially the number of image there are in the dataset. It consists in training the model with images from the training set that are modified (orientation, light,...) each time they are used during the epochs.
- Accuracy : Percentage of correct prediction.
- Weights : Represent the state of a model, the weights are the values of all the parameters of the model. They change after each batch and are usually saved after each epoch.
- Train/Validation/Test sets : In Deep Learning, the train set is for the training, the validation is usually used for choosing the weights associated with the highest validation accuracy (or any other metric) and the test set is for evaluating the performances of the model once the weights have been chosen and set.
- Channels/depth : Color images have 3 channels, red, green, and blue. A filter must always have the same number of channels as the input, often referred to as "depth". Therefore a 3×3 filter would in fact be 3x3x3 for rows, columns, and depth. However, the filter is applied to the input using a dot product operation which results in a single value. Each filter results in a single feature map. Which means that the depth of the output of applying a convolutional layer with 32 filters is 32 for the 32 feature maps created.
- Freeze : freezing prevents the weights of a neural network layer from being modified during the backward pass of training.