2. How the backpropagation algorithm works

Activation $a^l_j $ of the $j^{th}$ neuron in the l layer is related by:

$a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right)$

Vectorization will simply it as:

$a^{l} = \sigma(w^l a^{l-1}+b^l)$

The goal of backpropagation is to compute the partial derivatives of the cost function with respect to any weight or bias in the network.

Two assumptions are the cost function is:

an average over individual cost function
a function of the output activations from the neural network.

The error of a neuron is defined as:

$\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}$

where z is the weighted output. We didn’t use $\frac{\partial C}{\partial a^l_j}$ to have a simpler math.

backpropagation based on 4 fundamental equations

error in the output layer(last layer): $\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)$ or $\delta^L = (a^L-y) \odot \sigma'(z^L)$
error connected to next layer: $\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)$
error is the rate of change of the cost with respect to bias: $\frac{\partial C}{\partial b} = \delta$
rate of change of the cost with respect to weight: $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$

algorithm:

input x: get activation $a^1$ for input layer
Feedforward: get weighted input and activation for other layers
Output error: compute the error in the last layer by equation 1
Backpropagate the error: using equation 2
Output: get gradient of the cost function by equation 3 and 4.

The backward movement is a consequence of the fact that the cost is a cunction of outputs from the network.

The backpropagatino algorithm is a clever way of keeping track of small perturbation to the weights/biases as they propagate through the network, reach the output, and then affect the cost.