Stanford University https://www.youtube.com/watch?v=DAOcjicFr1Y
-
first layer
$55 \times 55 \times 96$ -
parameters
$11 \times 11 \times 3 \times 96$ = 35K -
TODO feature map height? depends on how many filters you have in the layer =)
-
Pooling layer
$3 \times 3$ stride 2 output :$27 \times 27 \ times 96$ -
parameters : 0
-
first ues of ReLU
- the second just because lack of GPU memory, then splits it into two parts
- 55 x 55 x 96 -> 55 x 55 x 48 x 2(two GPUS)
- result
- still be used in transfer learning task
- QA why CNN beat all of them? - no idea, but it is the first deep learning based approach
- ZFNet basically tuning hyperparameter like strid, # of filters, and so on
8 layer -> 19 layer(VGG) -> 22 layer(GoogleNet) -> 152 layer (insane!)
-
small filters, deepper networks, catching more details :p
-
4 bytes is float32 representation
-
100M / image when doing forward pass during whole training
-
138M parameters(VGG 19), 60M(AlexNet)
-
QA depth of network / depth of img (channel)
-
QA when people design network, deeper based on what?
- basically more computational resource, better result
- you could use pooling layer to decrease the params - maybe check 李弘毅(Why Deep)
- abc
-
QA we don't need params them all right? - true, but we also do BP to update weights, means update them in memory is more efficient
- Some complexity analysis
- standard graph representation like 3x3 conv, 64, conv1-1(3x3 filter, 64 kernels)
- some details
- fc7 is a good feature representation(may uesd in transfer learning)
- QA - localization / detection difference? only one object / multiple objects
- New design - inception module
- no FC layer
- only 5M params 12x less than AlexNet
- local network topology
- concatenate all filter outputs together depth-wise
- QA if we want to do this - computational exprensive
- such a lot of computation - google use dimension reduction before doing conv ops
-
$f \times f$ then padding with zero - QA what is $1 \times 1$conv?
- under proper usage, 1x1 conv reduce depth to lowe of dimension!
- check the result! google called that bottleneck layer
- QA what will we loss by appling 1x1 conv?
- basically no idea, but it helps model works well
- 3 place to train, whole network, 2 axuxiliary at lower layers
- The reason is that it is a deep network, inject at middle layer could give more gradient to earlier layer, intesteing :P
- 22 layers
- QAs : are the auxiliary outputs actually useful for final classification?
- they do average all theses for losses coming out, but basically not sure, might check it the paper
- QAs : in the bottleneck layer, is it possible to use other dimension reduction techinique?
- yes you could do that, but 1x1 conv si really convininent here
- QAs : why do we need to inject gradient to earlier layer?
- basically they have a gradient vanish problem even they use a ??? activation func?
- this network basically won evertying, COCO, ILSVRC...
- what happens when we contibue stacking deepper layers on a "plain" cnn?
- deeper network do more worse
- and it is not about overfitting, because in training error, they are still bad
- then the creator has a hypothesis!
-
deep network should perform at least as shallow network, copy shallow and the rest are identify function
-
ok, how could we design our model more easier
-
QAs : when you use the word residual, what are you talking about exactly?
-
$F(x)$ a transfomation of$x$ -
suppose
$H(x)$ is transfomation + input, which is$H(x) = F(x) + x$ -
we want to learning something like
$H(x)$ which is hard to learning -
how about we learn it partially? we learning
$F(x)$ -
then plus x, we get
$H(x)$ -
so the residual means
$F(x)$ -
QAs : in practice, do we still learning a weight?
- ????? can't tell
-
QAs: why learning residual is more easiler?
- just their hypothesis, and it worked well, it also imply the most layer is close to the identity$X$
- it is not proving anything, just initution and hypothesis
-
QAs : how people try other ways to combine input layer and output layer?
- it's a active area, and basically she don't know
-
beat human, human metric came from 李飛飛lab的研究生,他花了一整週....
- all right, they are main classfication network to use
-
VGG is heavy!
-
inference time
- the first view of 1x1 conv which is treated as MLP layer
-
how many layer / which layer we skip could get more backprop information?
-
50 layer wide ResNet outperforms 152-layer original ResNet
-
researchers tring to figure out what's the connection between block size, depth of network, and blablabla
- looks like some flavor like inception module
- stochastic depth!
- random drop a subset of layers, replace with identity
- Densely connected convolutional network
- still talking about gradient vanishing problem stuff, tring another approach, densly connected :P
- more stufy of 1x1 conv, alexnet level accuracy but 1/50 size
-
An intesting thing is
-
when VGG and GoogeNet published, which is before batch normalization invented
-
training theses relative deep network is very challenging
-
VGG and googleNet tried some hacky way to get their deep models to converge
- 16 and 19 layers final version
- actually they train 11 layers model, because 11 layers get to converge, and add some random layer in the middle to make it work
- use auxiliary classifiers to inject gradient in to middle layer
- once you have this techinique, you don't need these ugly hacks to get these deeper models to converge
- funny skip block design, two nice properties
- if we set all the weight in residual block to zero - block is compeletely identity - easier to compare the layer is needed or not
$F(x) = 0$ , output just$x$ - kind of L2 regularization
$F(x) + x$ - if drive
$F(x)$ to zero, means optimization encourage model do not use the layer, just use identify
- kind of L2 regularization
- good gradient flow in the backward paths
- gradient will be
$\frac{\partial L}{\partial x} + x$ - which is feeding more gradient, like a gradient super highway, allow us to train much easier and faster
- gradient will be
- if we set all the weight in residual block to zero - block is compeletely identity - easier to compare the layer is needed or not
- think about gradient flow, it make more sense
- they add additinal short-cut connection between layers
- it will give directly gradient flow when BP
- manage your gradient flow is super important everywhere in machine learning
- a lot of params in fc layers
- which should be avoid