Generative Adversarial Networks 08/22/2023 Generative Adversarial Networks keywords to search: Maximum Likelihood Estimation (MLE), Markov Chain, Feedback Loop, Latent Variables, Noise-Contrastive Estimation (NCE) Introduction In the proposed adversarial nets framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles. In this article, we explore the special case when the generative model generates samples by passing random noise through a multilayer perceptron, and the discriminative model is also a multilayer perceptron. We refer to this special case as adversarial nets. We can train both models using only the highly successful backpropagation and dropout algorithms and sample from the generative model using only forward propagation Related work Score Matching : This method attempts to match the gradient of the data distribution with the gradient of the model's distribution. \[ \,\\ \mathbb{E}_{\mathbf{x} \sim p_{data}(\mathbf{x})} \, [\, \|\ \nabla_{\mathbf{x}}\,\log p_\theta(\mathbf{x}) - \nabla_{\mathbf{x}} \log p_{data}(\mathbf{x}) \|_2^2 \,] \,\\ \] The term for \( \nabla_{\mathbf{x}} \log p_{data}(\mathbf{x}) \) is known as the score of the distribution. (Gradient of the true data distribution) The term for \( \nabla_{\mathbf{x}}\,\log p_\theta(\mathbf{x}) \) is the gradient of the model's log probability to the data \(\mathbf{x}\) Noise-Contrastive Estimation (NCE) Adversarial nets To learn the generator's distribution \(p_g\) over data \(x\), we define a prior on input noise variables \(p_z(z)\), then represent a mapping to data space as \(G(z;\theta_g)\). We also define \(D(x;\theta_d)\) that outputs a single scalar. (Binary classification for whether the data is fake or real) We train \(D\) to maximize the probability of assigning the correct label to both training examples and samples from \(G\). We simultaneously train \(G\) to be that \(D(G(z)) \) is close to 1. In other words, \(D\) and \(G\) play the two-player minimax game with value function \(V(D, G)\): \[ \,\\ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)} [ \log D(x) ] + \mathbb{E}_{z \sim p_{z}(z)} [ \log (1 - D(G(z))) ]. \,\\ \] In practice, the above equation may not provide sufficient gradient for \(G\) to learn well. Early in learning, when \(G\) is poor, \(D\) can reject samples with high confidence because they are clearly different fromthe training data. Rather than training \(G\) to minimize \(\log (1 - D(G(z))) \), we can train \(G\) to maximize \(\log (D(G(z))) \). \[ \,\\ \begin{align*} Loss_D &= \mathcal{L}(D(x), \, label_{real}) + \mathcal{L}(D(G(z)), \, label_{fake}) \\ Loss_G &= \mathcal{L}(D(G(z)), \, label_{real}) \end{align*} \,\\ \] A pedagogical explanation of GAN training A pedagogical explanation of GAN training blue dashed line: Discriminative Distribution green solid line: Generative Distribution black dotted line: Real Distriution \((\text{b})\): \(D\) is trained to discriminate samples from data, converging to \(D^{*}(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)} \). \((\text{c})\): The gradient of \(D\) has guided \(G(z)\) to flow to regions that are more likely to be classified as real data. \((\text{d})\): If \(G\) and \(D\) have enough capacity, they will reach a point at which both cannot improve because \( p_g = p_{data} \). The discriminator is unable to differentiate between the two distributions, i.e. \( D(x) = \frac{1}{2} \).