Chain Rule · latentspace

forward propagation (순전파)

loss calculation

back propagation (역전파)

The chain rule is used in calculus to differentiate. If one variable depends on another, and that second variable depends on a third, the chain rule helps to find the derivative of the first variable with respect to the third.
In this example, to update the current \(w_{5}\) with gradient descent (updated \(w_{5}\) = current \( w_{5}\) \(- \, \eta \, \frac{\partial L}{\partial w_{5}} \)), it cannot be obtained at once, so the chain rule is used as:

\(\frac{\partial L}{\partial w_{5}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_{3}} \cdot \frac{\partial z_{3}}{\partial w_{5}} \)

\(\frac{\partial L}{\partial \hat{y}} = (y - \hat{y})^2 = 2(y - \hat{y}) \cdot (-1) \)
\(\\\)
\(\frac{\partial L}{\partial \hat{y}} = -2(1 - 0.645) = -0.71 \)

Then, to get the derivative of \(\frac{\partial \hat{y}}{\partial z_{3}}\), we need to calculate the derivative of the sigmoid function \( \sigma(x) \):

\(\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) = \frac{f'(x) \cdot g(x) - f(x) \cdot g'(x)}{g(x)^2}\)
\(\frac{d}{dx} \sigma(x) = \frac{(0) \cdot (1 + e^{-x}) - (1) \cdot (-e^{-x})}{(1 + e^{-x})^2}\)
\(\frac{d}{dx} \sigma(x) = \frac{e^{-x}}{(1 + e^{-x})^2}\)
\( 1 + e^{-x} = \frac{1}{\sigma(x)} \)
\((1 + e^{-x})^2 = \left( \frac{1}{\sigma(x)} \right)^2 = \frac{1}{\sigma(x)^2}\)
\(\frac{e^{-x}}{(1 + e^{-x})^2} = \frac{e^{-x}}{\frac{1}{\sigma(x)^2}}\)
\(\frac{e^{-x}}{\frac{1}{\sigma(x)^2}} = e^{-x} \cdot \sigma(x)^2\)
\(e^{-x} = \frac{1 - \sigma(x)}{\sigma(x)}\)
\(\frac{d}{dx} \sigma(x) = \frac{1 - \sigma(x)}{\sigma(x)} \cdot \sigma(x)^2\)
\(\frac{d}{dx} \sigma(x) = \sigma(x) \cdot (1 - \sigma(x))\)

\(\frac{\partial \hat{y}}{\partial z_{3}} = \hat{y}(1 - \hat{y}) = 0.645(1 - 0.645) = 0.229\)

Lastly, \(\frac{\partial z_{3}}{\partial w_{5}}\) is simple. Since \(z_3 = h_{1}w_{5} + h_{2}w_{6}\),

therefore the \(\frac{\partial z_{3}}{\partial w_{5}} = h_{1} \require{enclose}\enclose{horizontalstrike}{w_{5}} + \require{enclose}\enclose{horizontalstrike}{h_{2}w_{6}} = h_{1} = 0.615 \)

Now, using the chain rule, we can combine all the derivatives to calculate the final gradient:

\(\frac{\partial L}{\partial w_{5}} = -0.71 \cdot 0.229 \cdot 0.615 \approx -0.01\)

This value can be used in gradient descent to update the weight \( w_5 = w_5 - \eta \cdot \frac{\partial L}{\partial w_5} = 0.551 \)