- Writer: Sunjun Kweon
- Title: (cs231n) Lecture 4 : Introduction to Neural Networks
- Link: http://cs231n.stanford.edu/slides/2020/lecture_4.pdf
- Keywords: Neural Networks, Jacobians, Backpropagation
-
Motivation : Linear classifers are not very powerful. It is hard to classify things which cannot be divided by a single line(hyperplane)
-
Stacking multiple layers with non-linearity expresses more
- Instead of just having linear score s=W1x 2-layer neural network's score is s=W2f(W1x)
- f which contributes to non-linearity is called activation function. Below are popular choices for activation functions
-Consider a vector function f(x)=(f1(x),f2(x),...fm(x)). Consider a small change Δx.
f(x+Δx)=[f1(x+Δx),f2(x+Δx),...fm(x+Δx)]=f(x)+[∇f1(x),...∇fm(x)]TΔx
The Jacobian of f at x is
-Example : R^m to R^n (y=Ax)
Jacobian is considered as a partial derivative of multidimensional mapping
-Example : R^(m*n) to R (y=f(X))
matrix derivative
-Example : R^(mn) to R^k (y=Wx where W is mn matrix, x is n-dim vector)
The Jacobian has dimension k*(m*n) where ith row is given by (k=m)
-Directly optimizing the whole neural network(getting the gradient) is complicated. Therefore we use backpropagation.
-Backpropgation comes from the chain rule
When we want to calculate the gradient(jacobian) of Loss with respect to a certain weight, we multiply the upstream gradient(which comes backward from the loss) with the local gradient. Then we use gradient descent algorithm to optimize the loss. (Note : the derivative can be either gradient or jacobians, but must have the same dimension with the variable to update)
-Scalar example with computational graph
q=x+y and f=q*z
df/dz can be directly calcaluated from f=q*z
df/dx's upstream gradient is df/dq and local gradient is dq/dx
df/dy's upstream gradient is df/dq and local gradient is dq/dy
-Backpropagation in neural network
We have to get the gradient of Wn for the update and Xn to deliver it to the next layer. dL/dXn+1 is received from next or (n+1)th layer.