Skip to content

Latest commit

 

History

History
313 lines (211 loc) · 10.1 KB

L9_CNN_architechture.md

File metadata and controls

313 lines (211 loc) · 10.1 KB

Reousrce

Stanford University https://www.youtube.com/watch?v=DAOcjicFr1Y

Deep Learning Frameworks

CNN architectures

LeNet

AlexNet - the first large cnn

  • beat rest of non-deep learning based method on imagenet

  • first layer $55 \times 55 \times 96$

  • parameters $11 \times 11 \times 3 \times 96$ = 35K

  • TODO feature map height? depends on how many filters you have in the layer =)

  • Pooling layer $3 \times 3$ stride 2 output : $27 \times 27 \ times 96$

  • parameters : 0

  • first ues of ReLU

  • the second just because lack of GPU memory, then splits it into two parts
  • 55 x 55 x 96 -> 55 x 55 x 48 x 2(two GPUS)

  • result

  • still be used in transfer learning task

  • QA why CNN beat all of them? - no idea, but it is the first deep learning based approach

ZFNet - 2013 imagenet

  • ZFNet basically tuning hyperparameter like strid, # of filters, and so on

trands

8 layer -> 19 layer(VGG) -> 22 layer(GoogleNet) -> 152 layer (insane!)

VGG

  • small filters, deepper networks, catching more details :p

  • also fewer params - which is interesting

  • 4 bytes is float32 representation

  • 100M / image when doing forward pass during whole training

  • 138M parameters(VGG 19), 60M(AlexNet)

  • QA depth of network / depth of img (channel)

  • QA when people design network, deeper based on what?

    • basically more computational resource, better result
    • you could use pooling layer to decrease the params - maybe check 李弘毅(Why Deep)
    • abc
  • QA we don't need params them all right? - true, but we also do BP to update weights, means update them in memory is more efficient

  • Some complexity analysis

  • standard graph representation like 3x3 conv, 64, conv1-1(3x3 filter, 64 kernels)

  • some details
  • fc7 is a good feature representation(may uesd in transfer learning)
  • QA - localization / detection difference? only one object / multiple objects

GoogleLent

  • New design - inception module
  • no FC layer
  • only 5M params 12x less than AlexNet

Inception module

  • local network topology

  • concatenate all filter outputs together depth-wise
  • QA if we want to do this - computational exprensive

  • such a lot of computation - google use dimension reduction before doing conv ops
  • $f \times f$ then padding with zero
  • QA what is $1 \times 1$conv?
  • under proper usage, 1x1 conv reduce depth to lowe of dimension!

  • check the result! google called that bottleneck layer

  • QA what will we loss by appling 1x1 conv?
    • basically no idea, but it helps model works well

  • 3 place to train, whole network, 2 axuxiliary at lower layers
  • The reason is that it is a deep network, inject at middle layer could give more gradient to earlier layer, intesteing :P
  • 22 layers
  • QAs : are the auxiliary outputs actually useful for final classification?
    • they do average all theses for losses coming out, but basically not sure, might check it the paper
  • QAs : in the bottleneck layer, is it possible to use other dimension reduction techinique?
    • yes you could do that, but 1x1 conv si really convininent here
  • QAs : why do we need to inject gradient to earlier layer?
    • basically they have a gradient vanish problem even they use a ??? activation func?

ResNet

  • this network basically won evertying, COCO, ILSVRC...
  • what happens when we contibue stacking deepper layers on a "plain" cnn?

  • deeper network do more worse
  • and it is not about overfitting, because in training error, they are still bad
  • then the creator has a hypothesis!

  • deep network should perform at least as shallow network, copy shallow and the rest are identify function

  • ok, how could we design our model more easier

  • QAs : when you use the word residual, what are you talking about exactly?

  • $F(x)$ a transfomation of $x$

  • suppose $H(x)$ is transfomation + input, which is $H(x) = F(x) + x$

  • we want to learning something like $H(x)$ which is hard to learning

  • how about we learn it partially? we learning $F(x)$

  • then plus x, we get $H(x)$

  • so the residual means $F(x)$

  • QAs : in practice, do we still learning a weight?

    • ????? can't tell
  • QAs: why learning residual is more easiler?

    • just their hypothesis, and it worked well, it also imply the most layer is close to the identity$X$
    • it is not proving anything, just initution and hypothesis
  • QAs : how people try other ways to combine input layer and output layer?

    • it's a active area, and basically she don't know

  • they also use 1x1 conv to control computational complexity

  • beat human, human metric came from 李飛飛lab的研究生,他花了一整週....

  • all right, they are main classfication network to use

Compareing complexity

  • VGG is heavy!

  • inference time

Other architectures to konw

NIN

start with 53

  • the first view of 1x1 conv which is treated as MLP layer

Improve ResNet

  • how many layer / which layer we skip could get more backprop information?

  • Wide Residue Blocks!

  • 50 layer wide ResNet outperforms 152-layer original ResNet

  • researchers tring to figure out what's the connection between block size, depth of network, and blablabla

  • looks like some flavor like inception module

  • stochastic depth!
  • random drop a subset of layers, replace with identity

other architectures

FractalNet

DenseNet

  • Densely connected convolutional network
  • still talking about gradient vanishing problem stuff, tring another approach, densly connected :P

Efficient networks

  • more stufy of 1x1 conv, alexnet level accuracy but 1/50 size

Recap

  • introduce by Justin Johnson

  • An intesting thing is

  • when VGG and GoogeNet published, which is before batch normalization invented

  • training theses relative deep network is very challenging

  • VGG and googleNet tried some hacky way to get their deep models to converge

VGG(2014)

  • 16 and 19 layers final version
  • actually they train 11 layers model, because 11 layers get to converge, and add some random layer in the middle to make it work

GoogleNet

  • use auxiliary classifiers to inject gradient in to middle layer

Batch normalization

  • once you have this techinique, you don't need these ugly hacks to get these deeper models to converge

ResNet

  • funny skip block design, two nice properties
    • if we set all the weight in residual block to zero - block is compeletely identity - easier to compare the layer is needed or not $F(x) = 0$, output just $x$
      • kind of L2 regularization $F(x) + x$
      • if drive $F(x)$ to zero, means optimization encourage model do not use the layer, just use identify
    • good gradient flow in the backward paths
      • gradient will be $\frac{\partial L}{\partial x} + x$
      • which is feeding more gradient, like a gradient super highway, allow us to train much easier and faster

DenseNet and FractalNet

  • think about gradient flow, it make more sense
  • they add additinal short-cut connection between layers
  • it will give directly gradient flow when BP

Sumup

  • manage your gradient flow is super important everywhere in machine learning

Parameters

  • a lot of params in fc layers
  • which should be avoid

Other Resource