@@ -119,7 +119,7 @@ Let's consider a network with a hidden layer.
...
@@ -119,7 +119,7 @@ Let's consider a network with a hidden layer.
The weight matrix of the layer $L$ is denoted $W^{(L)}$. The bias vector of the layer $L$ is denoted $B^{(L)}$. We choose the sigmoid function, denoted $\sigma$, as the activation function. The output vector of the layer $L$ before activation is denoted $Z^{(L)}$. The output vector of the layer $L$ after activation is denoted $A^{(L)}$. By convention, we note $A^{(0)}$ the network input vector. Thus $Z^{(L+1)} = W^{(L+1)}A^{(L)} + B^{(L+1)}$ and $A^{(L+1)} = \sigma\left(Z^{(L+1)}\right)$. In our example, the output is $\hat{Y} = A^{(2)}$.
The weight matrix of the layer $L$ is denoted $W^{(L)}$. The bias vector of the layer $L$ is denoted $B^{(L)}$. We choose the sigmoid function, denoted $\sigma$, as the activation function. The output vector of the layer $L$ before activation is denoted $Z^{(L)}$. The output vector of the layer $L$ after activation is denoted $A^{(L)}$. By convention, we note $A^{(0)}$ the network input vector. Thus $Z^{(L+1)} = W^{(L+1)}A^{(L)} + B^{(L+1)}$ and $A^{(L+1)} = \sigma\left(Z^{(L+1)}\right)$. In our example, the output is $\hat{Y} = A^{(2)}$.
Let $Y$ be the desired output. We use mean squared error (MSE) as the cost function. Thus, the cost is $C = \frac{1}{N_{out}}\sum_{i=1}^{N_{out}} (\hat{y_i} - y_i)^2$.
Let $Y$ be the desired output. We use mean squared error (MSE) as the cost function. Thus, the cost is $C = \frac{1}{N_{out}}\sum_{i=1}^{N_{out}} (\hat{y_i} - y_i)^2$.
1. Prove that $\sigma' = \sigma \times (1-\sigma)$
1. Prove that $`\sigma' = \sigma \times (1-\sigma)`$
2. Express $\frac{\partial C}{\partial A^{(2)}}$, i.e. the vector of $\frac{\partial C}{\partial a^{(2)}_i}$ as a function of $A^{(2)}$ and $Y$.
2. Express $\frac{\partial C}{\partial A^{(2)}}$, i.e. the vector of $\frac{\partial C}{\partial a^{(2)}_i}$ as a function of $A^{(2)}$ and $Y$.
3. Using the chaining rule, express $\frac{\partial C}{\partial Z^{(2)}}$, i.e. the vector of $\frac{\partial C}{\partial z^{(2)}_i}$ as a function of $\frac{\partial C}{\partial A^{(2)}}$ and $A^{(2)}$.
3. Using the chaining rule, express $\frac{\partial C}{\partial Z^{(2)}}$, i.e. the vector of $\frac{\partial C}{\partial z^{(2)}_i}$ as a function of $\frac{\partial C}{\partial A^{(2)}}$ and $A^{(2)}$.
4. Similarly, express $\frac{\partial C}{\partial W^{(2)}}$, i.e. the matrix of $\frac{\partial C}{\partial w^{(2)}_{i,j}}$ as a function of $\frac{\partial C}{\partial Z^{(2)}}$ and $A^{(1)}$.
4. Similarly, express $\frac{\partial C}{\partial W^{(2)}}$, i.e. the matrix of $\frac{\partial C}{\partial w^{(2)}_{i,j}}$ as a function of $\frac{\partial C}{\partial Z^{(2)}}$ and $A^{(1)}$.