Deep Learning ver3 Lesson 1

5 min readOct 23, 2018

Following suggestion by a friend in Jan 2018, I started with Deep learning for coders v2. The non-conventional top-down approach followed by Jeremy and Rachel seemed counter intuitive at first. However, it does seem to work. Nevertheless, I did a blunder of watching videos at 2x, and not practising enough coding, a mistake which Jeremy has warned enough times during his lecture.

In v3, I plan to work on as many datasets as possible and going by the advice from Rachel (https://medium.com/@racheltho/why-you-yes-you-should-blog-7d2544ac1045), I also plan to write brief blog posts on each of the 7 lessons based on my notes taken during the live session.

The course

Disclaimer: I missed initial 20min of lecture due to internet issues.

· The course is taught by: Jeremy Howard & Rachel Thomas.

· Pre-requisites for this course is ‘High school mathematics + 1 year of coding experience’. Python coding knowledge is an advantage.

· The course consists of 7 lessons, each around 1.5 to 2 hours long, with expected 8–10 hours work each week. It is recommended that one goes through full lesson/video, without trying to understand/google everything.

Setting up the virtual machine

· The deep learning setup requires Nvidia GPU (https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning). It is possible to use personal computers/servers for the course. However, it would be slower (as high-end machines are not easy/cheap to come by) and setting up the computer would take time and energy. It is highly recommended to rent access to servers which have everything preinstalled.

· On course website , 5 options are suggested. While Paperspace , Salamander , and Sagemaker provide easy ready to run option, some installation is required for Google Compute Platform, and AWS EC2.

The approach

Top down approach: ‘Learn how it works before you learn why it works. Get hands dirty coding as much as possible.’

Getting started

· Jupyter notebook is recommended as it is interactive and can show plots/images. Use ‘Shift+Enter’ to run the cell contents.

· First cell contains the magic commands: %matplotlib inline, %reload_ext autoreload, %autoreload 2 (read more at: https://towardsdatascience.com/the-top-5-magic-commands-for-jupyter-notebooks-2bf0c5ae4bb8). These commands ensure that matplotlib plots are shown in Jupyter notebook, and the autoreload extension is loaded.

· Next cell loads the fastai library. Fastai library sits on top of PyTorch. It is faster and uses less lines of code then Keras. It supports vision, text, tabular, and collab (collaborative filtering) models.

· Import of all (‘import *’) is used, and coding recommendations are not followed as this is a research/practice environment, and all components should be readily available. During production environment, proper coding guidance may be followed.

Data and processing

· Data is taken from two sources: Kaggle datasets & academic datasets. These datasets provide strong baselines/benchmark for performance to which models can be compared to.

· The Oxford-IIIT Pet Dataset(http://www.robots.ox.ac.uk/~vgg/data/pets/): 37 category pet dataset with roughly 200 images for each class. It has 12 cat & 25 dog breeds. Previous versions of this course used ‘cats vs dogs’ datasets, which had become too easy to work on over the years with easily available models. Oxford-IIIT Pet Dataset classification is a problem of fine grained classification (i.e. classification between similar categories).

· Path = untar_data(URL.PETS), where URL.PETS is the url of the dataset. Path lib is used instead of strings, as it works across platforms (needs more explanation)

· The labels are the target which are the goal of the prediction process. In the dataset, the names of the individual images are in format: ‘path_label_extension.jpg’. Regular expressions are used to extract the labels from the text.

· Size of images: In this course, the images used are square (i.e. height = width), and of size 224 x 224. Images which are of different shape/size are resized and cropped accordingly. It is one of the drawback of deep learning models that they need images of same size. Variable size images are in scope of part 2 of the course.

· databunch object: contains training, and validation data

· Normalization of images: Individual pixel values range from 0 to 255. They are normalized to bring the value with mean 0 and standard deviation 1. Image augmentation: Generate new images from existing ones. Helps to avoid over-fitting. Padding : Concepts to be discussed in coming classes.

Model

· CNN (Convolutional Neural Network), requires 2 things to train: Data & Architecture

· ResNet architecture:
— Why ResNet? Works generally well in most cases
— 34-layer architecture: 50 layer needs more memory to run. 34 works fine for current problem
— Metric: Printed for validation set

· 1st run: takes time as ResNet initial weights are downloaded. These weights are not random but based on training the model on a number of images. So, model does not start with nothing/random weights. It is further trained for the specific problem which is being targeted, with additional data. For instance, with ResNet-34, 30 images were enough to train a world class model to differentiate between Cricket and Baseball images.

· fit_one_cycle: Based on paper released in March 2018??. It fit a model following the 1cycle policy. (Need more explanation). It works better than fit.

· First model gives around 94% accuracy, compared to 2012 research paper (source??) where problems specific model gave only 59% accuracy.

· ResNet vs Inception: ResNet scores higher in Stanford Dawnbench. Inception is also not as resilient.

· Lower layers vs higher layers:

— Lower the layer, more basic is the pattern which it captures. First layer may be capturing horizontal, vertical and diagonal lines.

— Second layer learns from the combination of features from first layer, and so on.

— As basic patterns/geometrical shapes are almost similar everywhere, there is no need to train lower layers heavily.

— Unfreezing all layers leads to loss of accuracy compared to even untrained model, as lower layer weights are disturbed. Different layers tell different semantic complexity. Unlearning and training all layers at same rate leads to lower accuracy. Hence, earlier layers should be either left untouched or very fine tuning with very low learning rate should be done.

— Zeiler and Fergus paper on visualizing CNN: shows the patterns captured by different layers.

— Learning rate finder: Tells how quickly parameters be updated. To capture the different learning rate requirements, a range of learning rates are passed.

Good work by fastai students:

· Hamel Husain: Towards Natural Language Semantic Code Search (Enables semantic search of codes: https://githubengineering.com/towards-natural-language-semantic-code-search/)

· Sara Hooker: Model to alert Forest rangers based on Chain saw sound <could not find link>

· Christine Mcleavey Payne: Combined Deep learning with Music (http://christinemcleavey.com/)

· Alexandre Cadrin-Chênevert: Radiologist. Deep learning applied to radiology. Pointed out overfitting <link??>

· Melissa Fabros: first round winners of the “AI For Everyone” Challenge. Built system to identify faces less represented in the conventional training datasets (dark skinned). (https://www.crowdfundinsider.com/2017/09/121836-crowdflower-names-kiva-engineer-first-round-winner-1-million-ai-everyone-challenge/)

· Karthik: phone based object identification for visually challenged.