Chain Rule 05/09/2023 Backpropagation (역전파) with chain rule Let's say there's a very simple MLP, \(x_{1} (0.5)\) \(x_{2} (0.3)\) \( z_{1} \, | \, h_{1}\) \( z_{2} \, | \, h_{2}\) \( z_{3} \, | \, \hat{y} \) where \( h_{1} \), \( h_{2}\) and \( o \) are used a sigmoid \(\sigma\) for the non-linear activation, and the loss function \(L\) is used MSE. A learning rate \(\eta\) is 0.1. The label \(y\) is 1. \[ \sigma(x) = \frac{1}{1 + e^{-x}} \quad \quad L(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \] Training neural network can be divided into 3 steps These iterative steps are called training or optimization forward propagation (순전파) \( z_{1} = x_{1}w_{1} + x_{2}w_{3} = 0.5 \cdot 0.7 + 0.3 \cdot 0.4 = 0.47 \) \( z_{2} = x_{1}w_{2} + x_{2}w_{4} = 0.5 \cdot 0.3 + 0.3 \cdot 0.6 = 0.33 \) \( h_{1} = \sigma(0.47) = 0.615 \) \( h_{2} = \sigma(0.33) = 0.582 \) \( z_{3} = h_{1}w_{5} + h_{2}w_{6} = 0.615 \cdot 0.55 + 0.582 \cdot 0.45 = 0.6 \) \( \hat{y} = \sigma(0.6) = 0.645\) loss calculation \( L = \frac{1}{1} \sum_{i=1}^1 (y_1 - \hat{y}_1)^2 \) \( L = (1 - 0.645)^2 \) \( L = 0.126 \) back propagation (역전파) The chain rule is used in calculus to differentiate. If one variable depends on another, and that second variable depends on a third, the chain rule helps to find the derivative of the first variable with respect to the third. In this example, to update the current \(w_{5}\) with gradient descent (updated \(w_{5}\) = current \( w_{5}\) \(- \, \eta \, \frac{\partial L}{\partial w_{5}} \)), it cannot be obtained at once, so the chain rule is used as: \(\frac{\partial L}{\partial w_{5}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_{3}} \cdot \frac{\partial z_{3}}{\partial w_{5}} \) From the above expression, firstly, we can get the following easily: \(\frac{\partial L}{\partial \hat{y}} = (y - \hat{y})^2 = 2(y - \hat{y}) \cdot (-1) \) \(\\\) \(\frac{\partial L}{\partial \hat{y}} = -2(1 - 0.645) = -0.71 \) Then, to get the derivative of \(\frac{\partial \hat{y}}{\partial z_{3}}\), we need to calculate the derivative of the sigmoid function \( \sigma(x) \): \(\frac{d}{dx} \left( \frac{f(x)}{g(x)} \right) = \frac{f'(x) \cdot g(x) - f(x) \cdot g'(x)}{g(x)^2}\) \(\frac{d}{dx} \sigma(x) = \frac{(0) \cdot (1 + e^{-x}) - (1) \cdot (-e^{-x})}{(1 + e^{-x})^2}\) \(\frac{d}{dx} \sigma(x) = \frac{e^{-x}}{(1 + e^{-x})^2}\) \( 1 + e^{-x} = \frac{1}{\sigma(x)} \) \((1 + e^{-x})^2 = \left( \frac{1}{\sigma(x)} \right)^2 = \frac{1}{\sigma(x)^2}\) \(\frac{e^{-x}}{(1 + e^{-x})^2} = \frac{e^{-x}}{\frac{1}{\sigma(x)^2}}\) \(\frac{e^{-x}}{\frac{1}{\sigma(x)^2}} = e^{-x} \cdot \sigma(x)^2\) \(e^{-x} = \frac{1 - \sigma(x)}{\sigma(x)}\) \(\frac{d}{dx} \sigma(x) = \frac{1 - \sigma(x)}{\sigma(x)} \cdot \sigma(x)^2\) \(\frac{d}{dx} \sigma(x) = \sigma(x) \cdot (1 - \sigma(x))\) Since \(\hat{y} = \sigma(z_3)\), we can substitute \(\hat{y}\) into the result: \(\frac{\partial \hat{y}}{\partial z_{3}} = \hat{y}(1 - \hat{y}) = 0.645(1 - 0.645) = 0.229\) Lastly, \(\frac{\partial z_{3}}{\partial w_{5}}\) is simple. Since \(z_3 = h_{1}w_{5} + h_{2}w_{6}\), therefore the \(\frac{\partial z_{3}}{\partial w_{5}} = h_{1} \require{enclose}\enclose{horizontalstrike}{w_{5}} + \require{enclose}\enclose{horizontalstrike}{h_{2}w_{6}} = h_{1} = 0.615 \) Now, using the chain rule, we can combine all the derivatives to calculate the final gradient: \(\frac{\partial L}{\partial w_{5}} = -0.71 \cdot 0.229 \cdot 0.615 \approx -0.01\) This value can be used in gradient descent to update the weight \( w_5 = w_5 - \eta \cdot \frac{\partial L}{\partial w_5} = 0.551 \)