Conditional Generative 
Adversarial Nets

Mehdi Mirza; Simon Osindero

Conditional Generative Adversarial Nets

<!–[if gte mso 9]> 200/w:Zoom Clean/w:SpellingState false/w:TrackMoves false/w:SaveIfXMLInvalid false/w:IgnoreMixedContent

false EN-US/w:LidThemeOther ZH-CN/w:LidThemeAsian X-NONE/w:LidThemeComplexScript /w:Compatibility MicrosoftInternetExplorer4/w:BrowserLevel /m:mathPr /w:WordDocument <![endif]–> <!–[if gte mso 10]> <![endif]–>

Conditional Generative Adversarial Nets

Mehdi Mirza
Département d’informatique et de recherche opérationnelle
Université de Montréal
Montréal, QC H3C 3J7
mirzamom@iro.umontreal.ca \ANDSimon Osindero
Flickr / Yahoo Inc.
San Francisco, CA 94103
osindero@yahoo-inc.com

摘要

最近， Generative Adversarial Nets [ 8 ]被引入作为训练生成模型的新方法。在这项工作中，我们介绍了生成对抗网的条件版本，它可以通过简单地提供数据y来构造，我们希望对生成器和鉴别器都进行条件化。我们证明该模型可以生成以类标签为条件的MNIST数字。我们还说明了如何使用此模型来学习多模态模型，并提供图像标记应用程序的初步示例，其中我们演示了此方法如何生成不属于训练标签的描述性标记。

1简介

最近引入了生成性对抗网作为训练生成模型的替代框架，以避免许多难以处理的概率近似计算的困难。

对抗网具有以下优点：永远不需要马尔可夫链，仅使用反向传播来获得梯度，在学习期间不需要推理，并且可以容易地将各种因素和交互作用吸收到模型中。

此外，如[ 8 ]所示 ，它可以产生最先进的对数似然估计和逼真样本。

在无条件的生成模型中，无法控制正在生成的数据的模式。但是，通过附加信息调整模型，可以指导数据生成的过程。这种条件可以基于类别标签，在某些部分数据上进行修复，如[ 5 ] ，甚至是来自不同模态的数据。

在这项工作中，我们展示了如何构建条件对抗网。对于实证结果，我们展示了两组实验。一个在MNIST数字数据集上以类标签为条件，一个在MIR Flickr 25,000数据集[ 10 ]上用于多模态学习。

2相关工作

2.1用于图像标签的多模态学习

尽管最近监督神经网络（特别是卷积网络）取得了许多成功[ 13,17 ] ，但仍然难以扩展此类模型以适应预测极大量的输出类别。第二个问题是迄今为止的大部分工作都集中在学习从输入到输出的一对一映射。然而，许多有趣的问题更自然地被认为是概率性的一对多映射。例如，在图像标记的情况下，对于一给定图像可以适当地应用许多不同标签，并且不同（人）注释器可以使用不同（但通常是同义或相关）术语来描述相同图像。

帮助解决第一个问题的一种方法是利用来自其他模态的附加信息：例如，通过使用自然语言语料库来学习在几何关系上有语义意义的标签的向量表示。当在这样的空间中进行预测时，我们受益于以下事实：当预测错误时我们仍然经常“接近”真实情况（例如，预测“桌子”而不是“椅子”），以及我们可以自然地做出预测泛化到训练期间未见的标签的事实。诸如[ 3 ]之类的工作表明，即使从图像特征空间到字表示空间的简单线性映射也可以产生改进的分类性能。

解决第二个问题的一种方法是使用条件概率生成模型，输入被视为条件变量，并且一对多映射被实例化为条件预测分布。

[ 16 ]对这个问题采取了类似的方法，并在MIR Flickr 25,000数据集上训练多模态Deep Boltzmann机，就像我们在这项工作中所做的那样。

此外，在 [ 12 ]中，作者展示了如何训练有监督的多模态神经语言模型，并且他们能够为图像生成描述性句子。

3 条件对抗网络

3.1生成性对抗网

最近引入了生成性对抗网作为训练生成模型的新方法。它们由两个“对抗”模型组成：一个捕获数据分布的生成模型G ，以及一个估计样本来自训练数据而不是G的概率的判别模型D. G和D都可以是非线性映射函数，例如多层感知器。

为了学习数据数据x上的生成器分布pg ，生成器将先验噪声分布pz(z)到数据空间的映射函数建立为G(z;θg)。并且鉴别器 D(x;θd)输出单个标量，该标量表示x来自训练数据而不是pg的概率。

同时训练G和D ：我们调整G的参数以最小化log(1 – D(G(z))并调整D的参数以最小化logD(X) ，如同它们跟随- 具有值函数V(G， D)的双人最小 - 最大游戏（two-player min-max game）：

(D,G)=Ex∼pdata(x) [logD(x)]+Ez∼pz(z) [log(1−D(G(z)))].

(1)

[bib.bib1] Bengio et al. [2013] Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2013). Better mixing via deep representations. In ICML’2013.

[bib.bib2] Bengio et al. [2014] Bengio, Y., Thibodeau-Laufer, E., Alain, G., and Yosinski, J. (2014). Deep generative stochastic networks trainable by backprop. In Proceedings of the 30th International Conference on Machine Learning (ICML’14).

[bib.bib3] Frome et al. [2013] Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. (2013). Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129.

[bib.bib4] Glorot et al. [2011] Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics, pages 315–323.

[bib.bib5] Goodfellow et al. [2013a] Goodfellow, I., Mirza, M., Courville, A., and Bengio, Y. (2013a). Multi-prediction deep boltzmann machines. In Advances in Neural Information Processing Systems, pages 548–556.

[bib.bib6] Goodfellow et al. [2013b] Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. (2013b). Maxout networks. In ICML’2013.

[bib.bib7] Goodfellow et al. [2013c] Goodfellow, I. J., Warde-Farley, D., Lamblin, P., Dumoulin, V., Mirza, M., Pascanu, R., Bergstra, J., Bastien, F., and Bengio, Y. (2013c). Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214.

[bib.bib8] Goodfellow et al. [2014] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In NIPS’2014.

[bib.bib9] Hinton et al. [2012] Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. Technical report, arXiv:1207.0580.

[bib.bib10] Huiskes and Lew [2008] Huiskes, M. J. and Lew, M. S. (2008). The mir flickr retrieval evaluation. In MIR ’08: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval, New York, NY, USA. ACM.

[bib.bib11] Jarrett et al. [2009] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In ICCV’09.

[bib.bib12] Kiros et al. [2013] Kiros, R., Zemel, R., and Salakhutdinov, R. (2013). Multimodal neural language models. In Proc. NIPS Deep Learning Workshop.

[bib.bib13] Krizhevsky et al. [2012] Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25 (NIPS’2012).

[bib.bib14] Mikolov et al. [2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. In International Conference on Learning Representations: Workshops Track.

[bib.bib15] Russakovsky and Fei-Fei [2010] Russakovsky, O. and Fei-Fei, L. (2010). Attribute learning in large- scale datasets. In European Conference of Computer Vision (ECCV), International Workshop on Parts and Attributes, Crete, Greece.

[bib.bib16] Srivastava and Salakhutdinov [2012] Srivastava, N. and Salakhutdinov, R. (2012). Multimodal learning with deep boltzmann machines. In NIPS’2012.

[bib.bib17] Szegedy et al. [2014] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2014). Going deeper with convolutions. arXiv preprint arXiv:1409.4842.

Model	MNIST
DBN [1]
Stacked CAE [1]
Deep GSN [2]
Adversarial nets
Conditional adversarial nets

	User tags + annotations	Generated tags
track ![track](track.jpg)	montanha, trem, inverno, frio, people, male, plant life, tree, structures, transport, car	taxi, passenger, line, transportation, railway station, passengers, railways, signals, rail, rails
![cake](cake.jpg)	food, raspberry, delicious, homemade	chicken, fattening, cooked, peanut, cream, cookie, house made, bread, biscuit, bakes
![river](river.jpg)	water, river	creek, lake, along, near, river, rocky, treeline, valley, woods, waters
![baby](baby.jpg)	people, portrait, female, baby, indoor	love, people, posing, girl, young, strangers, pretty, women, happy, life

翻译 Conditional Generative Adversarial Nets

Conditional Generative Adversarial Nets

4.1 Unimodal（单模态，只有一个峰的分布）

track

	$$\min_{G}\max_{D}V(D,G)=\mathbb{E}_{{x}\sim p_{\text{data}}({x})}[\log D({x}\|{y})]+\mathbb{E}_{{z}\sim p_{z}({z})}[\log(1-D(G({z}\|{y})))]$$
	$$\min_{G}\max_{D}V(D,G)=\mathbb{E}_{{x}\sim p_{\text{data}}({x})}[\log D({x}\|{y})]+\mathbb{E}_{{z}\sim p_{z}({z})}[\log(1-D(G({z}\|{y})))] $$ test2 (D,G)=Ex∼pdata(x) [logD(x\|y)]+Ez∼pz(z) [log(1−D(G(z\| y)))].		(2)

翻译 Conditional Generative Adversarial Nets

Conditional Generative Adversarial Nets

4.1 Unimodal（单模态，只有一个 峰的分布 ）

track

4.1 Unimodal（单模态，只有一个峰的分布）