Conditional Generative Adversarial Nets

01/02/2024

By conditioning the model on additional information, it is possible to direct the data generation process. Such conditioning could be based on class labels, on some part of data for inpainting, or even on data from different modality.
In this work we show how can we construct the conditional adversarial net. And for empirical results we demonstrate two set of experiment. One on MNIST digit data set conditioned on class labels and one on MIR Flickr 25,000 dataset for multi-modal learning.

Many interesting problems are more naturally thought of as a probabilistic one-to-many mapping. For instance in the case of image labeling there may be many different tags that could appropriately applied to a given image, and different (human) annotators may use different (but typically synonymous or related) terms to describe the same image.
자연어 기반의 라벨을 벡터 표현으로 학습하고 벡터간의 기하학적 관계가 유의미하도록 학습시킴으로써, 예측 오류가 발생하더라도 더 가까운 결과와(책상 -> 탁자) 일반화를 이룰 수 있다.

Generative adversarial nets were recently introduced as a novel way to train a generative model. They consist of two 'adversarial' models: a generative model \(G\), which captures the data distribution, and a discriminative model \(D\), which estimates the probability that a given sample comes from the generative distribution \(p_{g}\) or from the real data distribution \(p_{data}\). \[ \,\\ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)} [ \log D(x) ] + \mathbb{E}_{z \sim p_{z}(z)} [ \log (1 - D(G(z))) ]. \,\\ \] \[ \,\\ \begin{align*} Loss_D &= \mathcal{L}(D(x), \, label_{real}) + \mathcal{L}(D(G(z)), \, label_{fake}) \\ Loss_G &= \mathcal{L}(D(G(z)), \, label_{real}) \end{align*} \]

Generative adversarial nets can be extended to a conditional model if both the generator and discriminator are conditioned on some extra information \(\color{blue}{y}\). It could be any kind of auxiliary information, such as class labels or data from other modalities. We can perform the conditioning by feeding \(\color{blue}{y}\) into the both the \(D\) and \(G\) as additional input layer.
In the generator, the prior input noise \(p_{z}(z)\), and \(y\) are combined in joint hidden representation, and adversarial training framework allows for considerable flexibility in how this hidden representation is composed\(^{1}\).
- \(^{1}\): For now we simply have the conditioning input and prior noise as inputs to a single hidden layer of a MLP, but one could imagine using higher order interactions allowing for complex generation mechanisms.

The objective function of a two-player minimax game would be as: \[ \,\\ \min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}(x)} [ \log D(x \,|\, \color{blue}{y}) ] + \mathbb{E}_{z \sim p_{z}(z)} [ \log (1 - D(G(z \,|\, \color{blue}{y}))) ]. \,\\ \] \[ \,\\ \begin{align*} Loss_D &= \mathcal{L}(D(x \,|\, \color{blue}{y}), \, label_{real}) + \mathcal{L}(D(G(z \,|\, \color{blue}{y})), \, label_{fake}) \\ Loss_G &= \mathcal{L}(D(G(z \,|\, \color{blue}{y})), \, label_{real}) \end{align*} \]

We trained a conditional adversarial net on MNIST images conditioned on their class labels, encoded as one-hot vectors.
In the generator, a noise prior \(z\) was drawn from a uniform distribution within the unit hypercube.
We present these results more as a proof-of-concept than as demonstration of efficacy.

Generated MNIST digits, each row conditioned on one label

In this section we demonstrate automated tagging of images, with multi-label predictions, using conditional adversarial nets to generate a (possibly multi-modal) distribution of tag-vectors conditional on image features.
For evaluation, we generate 100 samples for each image and find top 20 closest words using cosine similarity of vector representation of the words in the vocabulary to each sample. Then we select the top 10 most common words among all 100 samples. The below figure shows some samples of the user assigned tags and annotations along with the generated tags.

Samples of generated tags