Non-linearity

04/18/2023

Linearity
- Stacking multiple linear functions without any non-linear functions results in a single linear function because the combination of linear transformations is still linear.
  1. Linear function definition: There is a linear function in its most basic form, where \(W\) is a matrix, \(x\) is the input, and \(b\) is a bias. \[ \,\\ f(x) = Ax + b \,\\ \]
  2. Combination of two linear functions: Let's say we have two linear functions as below: \[ \,\\ f(x) = W_1 x + b_1 \\ g(x) = W_2 x + b_2 \,\\ \] then applying one after the other results in: \[ \,\\ g(f(x)) = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2 \,\\ \] This is still a linear function of the form \(h(x) = Cx + d\), where \(C = W_2 W_1\) and \(d = W_2 b_1 + b_2\).
  3. This result shows that stacking linear functions are equivalent to a single linear transformation with new parameters \(C\) and \(d\). Therefore, no matter how many linear functions you stack, the result is always one linear function.

Non-Linearity for Neural Networks
- The world is non-linear. Many real-world phenomena consist of complex relationships that cannot be modeled using only linear functions. By giving non-linearity, neural networks can represent these complex relationships.
  - \(\text{Sigmoid}\): Maps the input to a value between 0 and 1 using the formula: \[ \,\\ \text{Sigmoid}\,(x) = \frac{1}{1 + e^{-x}} \,\\ \]
  - \(\text{ReLU} \) (Rectified Linear Unit): A simple non-linear function that returns \(0\) if the given \(x\) is below than \(0\), else return itself. It is defined as: \[ \,\\ \text{ReLU}\,(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} \,\\ \]
  - \(\text{LeakyReLU}\): A simple non-linear function that returns \(x\) if \(x\) is greater than \(0\), otherwise it returns \( \alpha x \), where \( \alpha \) is a small constant. It is defined as: \[ \,\\ \text{LeakyReLU}\,(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases} \,\\ \]
  - \(\text{Tanh}\): Maps the input to a value between -1 and 1 using the formula: \[ \,\\ \text{Tanh}\,(x) = \frac{e^x \,-\, e^{-x}}{e^x \,+\, e^{-x}} \,\\ \]
  - \(\text{Softmax}\): It converts logits into a probability distribution, where each output is a value between 0 and 1, and the sum of the outputs is 1. The formula for softmax is: \[ \,\\ \text{Softmax}(x)_i = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}} \,\\ \]

Visualization