data-science-ipython-notebooks/deep-learning/keras-tutorial/2.1 Supervised Learning - ConvNets.ipynb
2017-08-26 08:47:27 -04:00

907 lines
22 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Credits: Forked from [deep-learning-keras-tensorflow](https://github.com/leriomaggio/deep-learning-keras-tensorflow) by Valerio Maggio"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Convolutional Neural Network"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"### References:\n",
"\n",
"Some of the images and the content I used came from this great couple of blog posts \\[1\\] [https://adeshpande3.github.io/adeshpande3.github.io/]() and \\[2\\] the terrific book, [\"Neural Networks and Deep Learning\"](http://neuralnetworksanddeeplearning.com/) by Michael Nielsen. (**Strongly recommend**) "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"A convolutional neural network (CNN, or ConvNet) is a type of **feed-forward** artificial neural network in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The networks consist of multiple layers of small neuron collections which process portions of the input image, called **receptive fields**. \n",
"\n",
"The outputs of these collections are then tiled so that their input regions overlap, to obtain a _better representation_ of the original image; this is repeated for every such layer."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## How does it look like?"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"slideshow": {
"slide_type": "-"
}
},
"source": [
"<img src=\"imgs/convnets_cover.png\" width=\"70%\" />\n",
"\n",
"> source: https://flickrcode.files.wordpress.com/2014/10/conv-net2.png"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# The Problem Space \n",
"\n",
"## Image Classification"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Image classification is the task of taking an input image and outputting a class (a cat, dog, etc) or a probability of classes that best describes the image. \n",
"\n",
"For humans, this task of recognition is one of the first skills we learn from the moment we are born and is one that comes naturally and effortlessly as adults."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"These skills of being able to quickly recognize patterns, *generalize* from prior knowledge, and adapt to different image environments are ones that we do not share with machines."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Inputs and Outputs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"imgs/cnn1.png\" width=\"70%\" />\n",
"\n",
"source: [http://www.pawbuzz.com/wp-content/uploads/sites/551/2014/11/corgi-puppies-21.jpg]()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"When a computer sees an image (takes an image as input), it will see an array of pixel values. \n",
"\n",
"Depending on the resolution and size of the image, it will see a 32 x 32 x 3 array of numbers (The 3 refers to RGB values).\n",
"\n",
"let's say we have a color image in JPG form and its size is 480 x 480. The representative array will be 480 x 480 x 3. Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Goal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What we want the computer to do is to be able to differentiate between all the images its given and figure out the unique features that make a dog a dog or that make a cat a cat. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"When we look at a picture of a dog, we can classify it as such if the picture has identifiable features such as paws or 4 legs. \n",
"\n",
"In a similar way, the computer should be able to perform image classification by looking for *low level* features such as edges and curves, and then building up to more abstract concepts through a series of **convolutional layers**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Structure of a CNN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> A more detailed overview of what CNNs do would be that you take the image, pass it through a series of convolutional, nonlinear, pooling (downsampling), and fully connected layers, and get an output. As we said earlier, the output can be a single class or a probability of classes that best describes the image. \n",
"\n",
"source: [1]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Convolutional Layer"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The first layer in a CNN is always a **Convolutional Layer**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"<img src =\"imgs/conv.png\" width=\"50%\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Convolutional filters\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Convolutional Filter much like a **kernel** in image recognition is a small matrix useful for blurring, sharpening, embossing, edge detection, and more. \n",
"\n",
"This is accomplished by means of convolution between a kernel and an image.\n",
"\n",
"The main difference _here_ is that the conv matrices are **learned**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"<img src=\"imgs/keDyv.png\" width=\"90%\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"As the filter is sliding, or **convolving**, around the input image, it is multiplying the values in the filter with the original pixel values of the image (aka computing **element wise multiplications**)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"imgs/cnn2.png\" width=\"80%\">"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"iNow, we repeat this process for every location on the input volume. (Next step would be moving the filter to the right by 1 unit, then right again by 1, and so on)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"After sliding the filter over all the locations, we are left with an array of numbers usually called an **activation map** or **feature map**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## High Level Perspective\n",
"\n",
"Lets talk about briefly what this convolution is actually doing from a high level. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Each of these filters can be thought of as **feature identifiers** (e.g. *straight edges, simple colors, curves*)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"<img src=\"imgs/cnn3.png\" width=\"70%\" />"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Visualisation of the Receptive Field"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"imgs/cnn4.png\" width=\"80%\" />"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"<img src=\"imgs/cnn5.png\" width=\"80%\" />"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"<img src=\"imgs/cnn6.png\" width=\"80%\" />"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The value is much lower! This is because there wasnt anything in the image section that responded to the curve detector filter. Remember, the output of this conv layer is an activation map. \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Going Deeper Through the Network"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Now in a traditional **convolutional neural network** architecture, there are other layers that are interspersed between these conv layers.\n",
"\n",
"<img src=\"https://adeshpande3.github.io/assets/Table.png\"/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## ReLU (Rectified Linear Units) Layer"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
" After each conv layer, it is convention to apply a *nonlinear layer* (or **activation layer**) immediately afterward.\n",
"\n",
"\n",
"The purpose of this layer is to introduce nonlinearity to a system that basically has just been computing linear operations during the conv layers (just element wise multiplications and summations)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"In the past, nonlinear functions like tanh and sigmoid were used, but researchers found out that **ReLU layers** work far better because the network is able to train a lot faster (because of the computational efficiency) without making a significant difference to the accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"It also helps to alleviate the **vanishing gradient problem**, which is the issue where the lower layers of the network train very slowly because the gradient decreases exponentially through the layers"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"(**very briefly**)\n",
"\n",
"Vanishing gradient problem depends on the choice of the activation function. \n",
"\n",
"Many common activation functions (e.g `sigmoid` or `tanh`) *squash* their input into a very small output range in a very non-linear fashion. \n",
"\n",
"For example, sigmoid maps the real number line onto a \"small\" range of [0, 1]."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"As a result, there are large regions of the input space which are mapped to an extremely small range. \n",
"\n",
"In these regions of the input space, even a large change in the input will produce a small change in the output - hence the **gradient is small**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### ReLu\n",
"\n",
"The **ReLu** function is defined as $f(x) = \\max(0, x),$ [2]\n",
"\n",
"A smooth approximation to the rectifier is the *analytic function*: $f(x) = \\ln(1 + e^x)$\n",
"\n",
"which is called the **softplus** function.\n",
"\n",
"The derivative of softplus is $f'(x) = e^x / (e^x + 1) = 1 / (1 + e^{-x})$, i.e. the **logistic function**.\n",
"\n",
"[2] [http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf]() by G. E. Hinton "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Pooling Layers"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
" After some ReLU layers, it is customary to apply a **pooling layer** (aka *downsampling layer*)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"In this category, there are also several layer options, with **maxpooling** being the most popular. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Example of a MaxPooling filter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"imgs/MaxPool.png\" width=\"80%\" />"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Other options for pooling layers are average pooling and L2-norm pooling. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The intuition behind this Pooling layer is that once we know that a specific feature is in the original input volume (there will be a high activation value), its exact location is not as important as its relative location to the other features. \n",
"\n",
"Therefore this layer drastically reduces the spatial dimension (the length and the width but not the depth) of the input volume.\n",
"\n",
"This serves two main purposes: reduce the amount of parameters; controlling overfitting. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"An intuitive explanation for the usefulness of pooling could be explained by an example: \n",
"\n",
"Lets assume that we have a filter that is used for detecting faces. The exact pixel location of the face is less relevant then the fact that there is a face \"somewhere at the top\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Dropout Layer"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The **dropout layers** have the very specific function to *drop out* a random set of activations in that layers by setting them to zero in the forward pass. Simple as that. \n",
"\n",
"It allows to avoid *overfitting* but has to be used **only** at training time and **not** at test time. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Fully Connected Layer"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The last layer, however, is an important one, namely the **Fully Connected Layer**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Basically, a FC layer looks at what high level features most strongly correlate to a particular class and has particular weights so that when you compute the products between the weights and the previous layer, you get the correct probabilities for the different classes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<img src=\"imgs/ConvNet LeNet.png\" width=\"90%\" />"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# CNN in Keras"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"**Keras** supports:\n",
"\n",
"- 1D Convolutional Layers;\n",
"- 2D Convolutional Layers;\n",
"- 3D Convolutional Layers;\n",
"\n",
"The corresponding `keras` package is `keras.layers.convolutional`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Convolution1D\n",
"\n",
"```python\n",
"from keras.layers.convolutional import Convolution1D\n",
"Convolution1D(nb_filter, filter_length, init='uniform',\n",
" activation='linear', weights=None,\n",
" border_mode='valid', subsample_length=1,\n",
" W_regularizer=None, b_regularizer=None,\n",
" activity_regularizer=None, W_constraint=None,\n",
" b_constraint=None, bias=True, input_dim=None,\n",
" input_length=None)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
">Convolution operator for filtering neighborhoods of **one-dimensional inputs**. When using this layer as the first layer in a model, either provide the keyword argument `input_dim` (int, e.g. 128 for sequences of 128-dimensional vectors), or `input_shape` (tuple of integers, e.g. (10, 128) for sequences of 10 vectors of 128-dimensional vectors)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Example\n",
"\n",
"```python\n",
"\n",
"# apply a convolution 1d of length 3 to a sequence with 10 timesteps,\n",
"# with 64 output filters\n",
"model = Sequential()\n",
"model.add(Convolution1D(64, 3, border_mode='same', input_shape=(10, 32)))\n",
"# now model.output_shape == (None, 10, 64)\n",
"\n",
"# add a new conv1d on top\n",
"model.add(Convolution1D(32, 3, border_mode='same'))\n",
"# now model.output_shape == (None, 10, 32)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Convolution2D\n",
"\n",
"```python\n",
"from keras.layers.convolutional import Convolution2D\n",
"Convolution2D(nb_filter, nb_row, nb_col, \n",
" init='glorot_uniform',\n",
" activation='linear', weights=None,\n",
" border_mode='valid', subsample=(1, 1),\n",
" dim_ordering='default', W_regularizer=None,\n",
" b_regularizer=None, activity_regularizer=None,\n",
" W_constraint=None, b_constraint=None, \n",
" bias=True)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Example\n",
"\n",
"```python\n",
"\n",
"# apply a 3x3 convolution with 64 output filters on a 256x256 image:\n",
"model = Sequential()\n",
"model.add(Convolution2D(64, 3, 3, border_mode='same', \n",
" input_shape=(3, 256, 256)))\n",
"# now model.output_shape == (None, 64, 256, 256)\n",
"\n",
"# add a 3x3 convolution on top, with 32 output filters:\n",
"model.add(Convolution2D(32, 3, 3, border_mode='same'))\n",
"# now model.output_shape == (None, 32, 256, 256)\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Dimensions of Conv filters in Keras"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The complex structure of ConvNets *may* lead to a representation that is challenging to understand."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Of course, the dimensions vary according to the dimension of the Convolutional filters (e.g. 1D, 2D)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Convolution1D\n",
"\n",
"**Input Shape**:\n",
"\n",
"**3D** tensor with shape: (`samples`, `steps`, `input_dim`).\n",
"\n",
"**Output Shape**:\n",
"\n",
"**3D** tensor with shape: (`samples`, `new_steps`, `nb_filter`)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Convolution2D\n",
"\n",
"**Input Shape**:\n",
"\n",
"**4D** tensor with shape: \n",
"\n",
"- (`samples`, `channels`, `rows`, `cols`) if `dim_ordering='th'`\n",
"- (`samples`, `rows`, `cols`, `channels`) if `dim_ordering='tf'`\n",
"\n",
"**Output Shape**:\n",
"\n",
"**4D** tensor with shape:\n",
"\n",
"- (`samples`, `nb_filter`, `new_rows`, `new_cols`) \n",
"if `dim_ordering='th'`\n",
"- (`samples`, `new_rows`, `new_cols`, `nb_filter`) if `dim_ordering='tf'`"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}