Reousrce

Stanford University https://www.youtube.com/watch?v=DAOcjicFr1Y

Deep Learning Frameworks

CNN architectures

LeNet

AlexNet - the first large cnn

beat rest of non-deep learning based method on imagenet
first layer $55 \times 55 \times 96$
parameters $11 \times 11 \times 3 \times 96$ = 35K
TODO feature map height? depends on how many filters you have in the layer =)
Pooling layer $3 \times 3$ stride 2 output : $27 \times 27 \ times 96$
parameters : 0
first ues of ReLU

the second just because lack of GPU memory, then splits it into two parts
55 x 55 x 96 -> 55 x 55 x 48 x 2(two GPUS)

result

still be used in transfer learning task

QA why CNN beat all of them? - no idea, but it is the first deep learning based approach

ZFNet - 2013 imagenet

ZFNet basically tuning hyperparameter like strid, # of filters, and so on

trands

8 layer -> 19 layer(VGG) -> 22 layer(GoogleNet) -> 152 layer (insane!)

VGG

small filters, deepper networks, catching more details :p
also fewer params - which is interesting
4 bytes is float32 representation
100M / image when doing forward pass during whole training
138M parameters(VGG 19), 60M(AlexNet)
QA depth of network / depth of img (channel)
QA when people design network, deeper based on what?
- basically more computational resource, better result
- you could use pooling layer to decrease the params - maybe check 李弘毅(Why Deep)
- abc
QA we don't need params them all right? - true, but we also do BP to update weights, means update them in memory is more efficient

Some complexity analysis

standard graph representation like 3x3 conv, 64, conv1-1(3x3 filter, 64 kernels)

some details
fc7 is a good feature representation(may uesd in transfer learning)
QA - localization / detection difference? only one object / multiple objects

GoogleLent

New design - inception module
no FC layer
only 5M params 12x less than AlexNet

Inception module

local network topology

concatenate all filter outputs together depth-wise
QA if we want to do this - computational exprensive

such a lot of computation - google use dimension reduction before doing conv ops
$f \times f$ then padding with zero
QA what is $1 \times 1$conv?

under proper usage, 1x1 conv reduce depth to lowe of dimension!

check the result! google called that bottleneck layer

QA what will we loss by appling 1x1 conv?
- basically no idea, but it helps model works well

3 place to train, whole network, 2 axuxiliary at lower layers
The reason is that it is a deep network, inject at middle layer could give more gradient to earlier layer, intesteing :P
22 layers
QAs : are the auxiliary outputs actually useful for final classification?
- they do average all theses for losses coming out, but basically not sure, might check it the paper
QAs : in the bottleneck layer, is it possible to use other dimension reduction techinique?
- yes you could do that, but 1x1 conv si really convininent here
QAs : why do we need to inject gradient to earlier layer?
- basically they have a gradient vanish problem even they use a ??? activation func?

ResNet

this network basically won evertying, COCO, ILSVRC...
what happens when we contibue stacking deepper layers on a "plain" cnn?

deeper network do more worse
and it is not about overfitting, because in training error, they are still bad
then the creator has a hypothesis!

deep network should perform at least as shallow network, copy shallow and the rest are identify function
ok, how could we design our model more easier

QAs : when you use the word residual, what are you talking about exactly?
$F(x)$ a transfomation of $x$
suppose $H(x)$ is transfomation + input, which is $H(x) = F(x) + x$
we want to learning something like $H(x)$ which is hard to learning
how about we learn it partially? we learning $F(x)$
then plus x, we get $H(x)$
so the residual means $F(x)$
QAs : in practice, do we still learning a weight?
- ????? can't tell
QAs: why learning residual is more easiler?
- just their hypothesis, and it worked well, it also imply the most layer is close to the identity$X$
- it is not proving anything, just initution and hypothesis
QAs : how people try other ways to combine input layer and output layer?
- it's a active area, and basically she don't know

they also use 1x1 conv to control computational complexity
beat human, human metric came from 李飛飛lab的研究生，他花了一整週....

all right, they are main classfication network to use

Compareing complexity

VGG is heavy!
inference time

Other architectures to konw

NIN

start with 53

the first view of 1x1 conv which is treated as MLP layer

Improve ResNet

how many layer / which layer we skip could get more backprop information?
Wide Residue Blocks!
50 layer wide ResNet outperforms 152-layer original ResNet
researchers tring to figure out what's the connection between block size, depth of network, and blablabla

looks like some flavor like inception module

stochastic depth!
random drop a subset of layers, replace with identity

other architectures

FractalNet

DenseNet

Densely connected convolutional network
still talking about gradient vanishing problem stuff, tring another approach, densly connected :P

Efficient networks

more stufy of 1x1 conv, alexnet level accuracy but 1/50 size

Recap

introduce by Justin Johnson
An intesting thing is
when VGG and GoogeNet published, which is before batch normalization invented
training theses relative deep network is very challenging
VGG and googleNet tried some hacky way to get their deep models to converge

VGG(2014)

16 and 19 layers final version
actually they train 11 layers model, because 11 layers get to converge, and add some random layer in the middle to make it work

GoogleNet

use auxiliary classifiers to inject gradient in to middle layer

Batch normalization

once you have this techinique, you don't need these ugly hacks to get these deeper models to converge

ResNet

funny skip block design, two nice properties
- if we set all the weight in residual block to zero - block is compeletely identity - easier to compare the layer is needed or not $F(x) = 0$, output just $x$
  - kind of L2 regularization $F(x) + x$
  - if drive $F(x)$ to zero, means optimization encourage model do not use the layer, just use identify
- good gradient flow in the backward paths
  - gradient will be $\frac{\partial L}{\partial x} + x$
  - which is feeding more gradient, like a gradient super highway, allow us to train much easier and faster

DenseNet and FractalNet

think about gradient flow, it make more sense
they add additinal short-cut connection between layers
it will give directly gradient flow when BP

Sumup

manage your gradient flow is super important everywhere in machine learning

Parameters

a lot of params in fc layers
which should be avoid

Other Resource