Skip to content

Latest commit

 

History

History
92 lines (45 loc) · 3.29 KB

week4.1_Neural_Network.md

File metadata and controls

92 lines (45 loc) · 3.29 KB

DAVIAN Lab. Deep Learning Winter Study (2021)

  • Writer: Sunjun Kweon

Information


Neural Network

  • Motivation : Linear classifers are not very powerful. It is hard to classify things which cannot be divided by a single line(hyperplane) 1

  • Stacking multiple layers with non-linearity expresses more

2

  • Instead of just having linear score s=W1x 2-layer neural network's score is s=W2f(W1x)
  • f which contributes to non-linearity is called activation function. Below are popular choices for activation functions 3

Jacobian

-Consider a vector function f(x)=(f1(x),f2(x),...fm(x)). Consider a small change Δx.

f(x+Δx)=[f1(x+Δx),f2(x+Δx),...fm(x+Δx)]=f(x)+[∇f1(x),...∇fm(x)]TΔx

The Jacobian of f at x is

image

-Example : R^m to R^n (y=Ax)

5

Jacobian is considered as a partial derivative of multidimensional mapping

-Example : R^(m*n) to R (y=f(X))

matrix derivative

6

-Example : R^(mn) to R^k (y=Wx where W is mn matrix, x is n-dim vector)

The Jacobian has dimension k*(m*n) where ith row is given by (k=m)

7

Backpropagtion

-Directly optimizing the whole neural network(getting the gradient) is complicated. Therefore we use backpropagation.

-Backpropgation comes from the chain rule

4

When we want to calculate the gradient(jacobian) of Loss with respect to a certain weight, we multiply the upstream gradient(which comes backward from the loss) with the local gradient. Then we use gradient descent algorithm to optimize the loss. (Note : the derivative can be either gradient or jacobians, but must have the same dimension with the variable to update)

-Scalar example with computational graph

8

q=x+y and f=q*z

df/dz can be directly calcaluated from f=q*z

df/dx's upstream gradient is df/dq and local gradient is dq/dx

df/dy's upstream gradient is df/dq and local gradient is dq/dy

-Backpropagation in neural network

9

We have to get the gradient of Wn for the update and Xn to deliver it to the next layer. dL/dXn+1 is received from next or (n+1)th layer.