Non-linearity 04/18/2023 Linearity Stacking multiple linear functions without any non-linear functions results in a single linear function because the combination of linear transformations is still linear. Linear function definition: There is a linear function in its most basic form, where \(W\) is a matrix, \(x\) is the input, and \(b\) is a bias. \[ \,\\ f(x) = Ax + b \,\\ \] Combination of two linear functions: Let's say we have two linear functions as below: \[ \,\\ f(x) = W_1 x + b_1 \\ g(x) = W_2 x + b_2 \,\\ \] then applying one after the other results in: \[ \,\\ g(f(x)) = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2 \,\\ \] This is still a linear function of the form \(h(x) = Cx + d\), where \(C = W_2 W_1\) and \(d = W_2 b_1 + b_2\). This result shows that stacking linear functions are equivalent to a single linear transformation with new parameters \(C\) and \(d\). Therefore, no matter how many linear functions you stack, the result is always one linear function. Non-Linearity for Neural Networks The world is non-linear. Many real-world phenomena consist of complex relationships that cannot be modeled using only linear functions. By giving non-linearity, neural networks can represent these complex relationships. \(\text{Sigmoid}\): Maps the input to a value between 0 and 1 using the formula: \[ \,\\ \text{Sigmoid}\,(x) = \frac{1}{1 + e^{-x}} \,\\ \] \(\text{ReLU} \) (Rectified Linear Unit): A simple non-linear function that returns \(0\) if the given \(x\) is below than \(0\), else return itself. It is defined as: \[ \,\\ \text{ReLU}\,(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} \,\\ \] \(\text{LeakyReLU}\): A simple non-linear function that returns \(x\) if \(x\) is greater than \(0\), otherwise it returns \( \alpha x \), where \( \alpha \) is a small constant. It is defined as: \[ \,\\ \text{LeakyReLU}\,(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases} \,\\ \] \(\text{Tanh}\): Maps the input to a value between -1 and 1 using the formula: \[ \,\\ \text{Tanh}\,(x) = \frac{e^x \,-\, e^{-x}}{e^x \,+\, e^{-x}} \,\\ \] \(\text{Softmax}\): It converts logits into a probability distribution, where each output is a value between 0 and 1, and the sum of the outputs is 1. The formula for softmax is: \[ \,\\ \text{Softmax}(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}} \,\\ \] Visualization