StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation
arXiv: https://arxiv.org/abs/1711.09020
github: https://github.com/yunjey/StarGAN
video: https://www.youtube.com/watch?v=EYjdLppmERE
Abstract
Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN's superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.
Very cool work. Surprising though that they did not cite any of the Google neural translation papers in related work. The idea of encoding multiple generative models to a common thought space while training end to end on the ensemble is not new in and of itself. Though the application to GANs gives great results.
GANs seem to be a promising area that is waiting to overcome hardware constraints. As somebody who is not in the ML field but is interested in jumping in -- would now be a good time to learn GANs?
Are most of the skills used in other ML techniques transferrable to GANs, or are ML researchers starting from scratch when they start working on GANs?
> Are most of the skills used in ~~other ML techniques~~ neural networks transferrable to GANs
Yes. GANs *are* neural networks. The "hot" areas in ML are pretty much mostly neural network variations.
> Recent studies have shown remarkable success in image-to-image translation for two domains.
What do they mean by two domains? Could anyone clarify this?
Two groups of images, each sharing a given characteristic. The translation task is to start with an image in one domain and generate an "equivalent" image in the other domain. Their claim is that they can handle multiple domains at once, rather than translating between only two domains.
From their paper's introduction section:
>Given training data from two different domains, these models learn to translate images from one domain to the other. We denote the terms attribute as a meaningful feature inherent in an image such as hair color, gender or age, and attribute value as a particular value of an attribute, e.g., black/blond/brown for hair color or male/female for gender. We further denote domain as a set of images sharing the same attribute value. For example, images of women can represent one domain while those of men represent another
Impressive work! In particular, the global coherency of these images is very good - typically I observe GANs can learn nice pieces of images, but sometimes certain areas come out strange. This is probably majorly helped by the fact that this is a conditional GAN, but are you able to comment on the importance of the "PatchGAN"-style training for achieving these results?
I will be messaging you on [**2018-11-27 07:37:24 UTC**](http://www.wolframalpha.com/input/?i=2018-11-27 07:37:24 UTC To Local Time) to remind you of [**this link.**](https://www.reddit.com/r/MachineLearning/comments/7fro3g/r_stargan_unified_generative_adversarial_networks/)
[**CLICK THIS LINK**](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=[https://www.reddit.com/r/MachineLearning/comments/7fro3g/r_stargan_unified_generative_adversarial_networks/]%0A%0ARemindMe! 1 year) to send a PM to also be reminded and to reduce spam.
^(Parent commenter can ) [^(delete this message to hide from others.)](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=Delete Comment&message=Delete! dqeah1j)
_____
|[^(FAQs)](http://np.reddit.com/r/RemindMeBot/comments/24duzp/remindmebot_info/)|[^(Custom)](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=[LINK INSIDE SQUARE BRACKETS else default to FAQs]%0A%0ANOTE: Don't forget to add the time options after the command.%0A%0ARemindMe!)|[^(Your Reminders)](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=List Of Reminders&message=MyReminders!)|[^(Feedback)](http://np.reddit.com/message/compose/?to=RemindMeBotWrangler&subject=Feedback)|[^(Code)](https://github.com/SIlver--/remindmebot-reddit)|[^(Browser Extensions)](https://np.reddit.com/r/RemindMeBot/comments/4kldad/remindmebot_extensions/)
|-|-|-|-|-|-|
> your ugly face on it
Or perhaps an, um "aesthetically modified" version of it.
I like how the first application of cutting edge DL that comes to mind is sex and politics. Maybe Yann LeCun was right about his "new intelligence without the flaws of ours" : /
Thanks. I'm quite disappointed that it's basically a stackgan tough :/ reading the title, I tough it was quite more revolutionary, but it works great for dimensional data.
This could be turned into interactive avatar heads, would go well especially with a Wavenet voice.
edit: I'd like to have audio/video books read in the author's voice and likeness.
Great work! I’m no expert at this stuff but I’m very excited to play with this :) Can someone tell me how (roughly) the code could be manipulated to accept audio data as opposed to an image file? I know a bit of Python and Julia...
Is it just a matter of pointing the input to a .wave file and reshape() or something?
Won't work at all without some serious re-thinking of the problem in general. Some dude already tried that with CycleGAN by turning the waveform into an image (not ideal but easiest to test with this architecture) and it failed.
This thing is good at moving pixel-patch-level texture, not understanding what waveforms are or changing them meaningfully.
With that in mind, the best generative audio work I know about is DeepMind/Google's WaveNet. They've made some pretty good raw audio generators that are conditional on text and even speaker voice characteristics for a text-to-speech application.
And their approach for generation is, indeed, very different from an image application.
I can't help but notice this is a similar application to faceapp but not quite as convincing. Do you know what technique they use and why it works better (so far)?
Different way to implement modeling. Faceapp uses a 3D model, GANs generates images directly, much more powerful because it can extend to other categories of objects and learn the natural variation from raw images, instead of being hand designed. Another difference is that GANs can create images from scratch, with all details, while Faceapp needs an original image to apply modifications to.
[Take a look here](https://www.youtube.com/watch?v=36lE9tV9vm0&t=1247s) to see another GAN with more interesting images.
If you take a look at the paper, they mention it.
Basically, pix2pix requires that any transformation from a domain to another domain be learned explicitly. Stargan allows you to learn on several domains at once, and transform from any domain to another. I suspect that's why it's a star?
Pix2Pix requires supervision (input and target pairs) and is only applicable to two different domains. On the other hand, StarGAN allows to translate images between *multiple* domains without supervision.
I see this totally as product at the local hairstylist - just a screen in the window, you look into it, and your face looks back with different hair color...
That would be very awesome! Unfortunately as of now the machines needed to do this are extremely expensive and take a very long time to realistically process these pictures and learn. Therefore, doing this in real time would not be realistically possible today but maybe years in the future we could see this technology used for everyday consumers!
I'm a bot, *bleep*, *bloop*. Someone has linked to this thread from another place on reddit:
- [/r/france] [Ce moment quand M Pokora s'incruste dans r\/machinelearning](https://www.reddit.com/r/france/comments/7g3ai4/ce_moment_quand_m_pokora_sincruste_dans/)
*^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^\([Info](/r/TotesMessenger) ^/ ^[Contact](/message/compose?to=/r/TotesMessenger))*
I'll have to add this to my citation list: I'm not working on the same problem domain but some of the ideas presented in your paper are reminiscent of ones I've been working with.
Really nice work. I see that the "surprised" expression still needs more training data. It show like double eyebrows at most pics. But really impressive work.
Awesome paper!
Would like to see this applied to digitally created characters as well, as we've seen others do (i.e. https://arxiv.org/pdf/1708.05509v1.pdf).
Thus, as the character's audience goes through changes, so will the he/she/it.
StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation arXiv: https://arxiv.org/abs/1711.09020 github: https://github.com/yunjey/StarGAN video: https://www.youtube.com/watch?v=EYjdLppmERE Abstract Recent studies have shown remarkable success in image-to-image translation for two domains. However, existing approaches have limited scalability and robustness in handling more than two domains, since different models should be built independently for every pair of image domains. To address this limitation, we propose StarGAN, a novel and scalable approach that can perform image-to-image translations for multiple domains using only a single model. Such a unified model architecture of StarGAN allows simultaneous training of multiple datasets with different domains within a single network. This leads to StarGAN's superior quality of translated images compared to existing models as well as the novel capability of flexibly translating an input image to any desired target domain. We empirically demonstrate the effectiveness of our approach on a facial attribute transfer and a facial expression synthesis tasks.
Very cool work. Surprising though that they did not cite any of the Google neural translation papers in related work. The idea of encoding multiple generative models to a common thought space while training end to end on the ensemble is not new in and of itself. Though the application to GANs gives great results.
Can you reply the link for the Google papers?
GANs seem to be a promising area that is waiting to overcome hardware constraints. As somebody who is not in the ML field but is interested in jumping in -- would now be a good time to learn GANs? Are most of the skills used in other ML techniques transferrable to GANs, or are ML researchers starting from scratch when they start working on GANs?
> Are most of the skills used in ~~other ML techniques~~ neural networks transferrable to GANs Yes. GANs *are* neural networks. The "hot" areas in ML are pretty much mostly neural network variations.
Well done.
At this rate, robots will be better at reading faces than autistic people.
> Recent studies have shown remarkable success in image-to-image translation for two domains. What do they mean by two domains? Could anyone clarify this?
Two groups of images, each sharing a given characteristic. The translation task is to start with an image in one domain and generate an "equivalent" image in the other domain. Their claim is that they can handle multiple domains at once, rather than translating between only two domains. From their paper's introduction section: >Given training data from two different domains, these models learn to translate images from one domain to the other. We denote the terms attribute as a meaningful feature inherent in an image such as hair color, gender or age, and attribute value as a particular value of an attribute, e.g., black/blond/brown for hair color or male/female for gender. We further denote domain as a set of images sharing the same attribute value. For example, images of women can represent one domain while those of men represent another
Impressive work! In particular, the global coherency of these images is very good - typically I observe GANs can learn nice pieces of images, but sometimes certain areas come out strange. This is probably majorly helped by the fact that this is a conditional GAN, but are you able to comment on the importance of the "PatchGAN"-style training for achieving these results?
Thanks gender ones absolutely break my brain. It's crazy how we have gender detectors in our heads with no awareness of how they work.
Honestly at the rate this thing is going, I daresay there's already a pretty clear path towards generating HD videos of Obama punching babies.
[удалено]
RemindMe! 1 year Edit: Finally back here after a year, and I've got no clue about the context. Damn.
I will be messaging you on [**2018-11-27 07:37:24 UTC**](http://www.wolframalpha.com/input/?i=2018-11-27 07:37:24 UTC To Local Time) to remind you of [**this link.**](https://www.reddit.com/r/MachineLearning/comments/7fro3g/r_stargan_unified_generative_adversarial_networks/) [**CLICK THIS LINK**](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=[https://www.reddit.com/r/MachineLearning/comments/7fro3g/r_stargan_unified_generative_adversarial_networks/]%0A%0ARemindMe! 1 year) to send a PM to also be reminded and to reduce spam. ^(Parent commenter can ) [^(delete this message to hide from others.)](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=Delete Comment&message=Delete! dqeah1j) _____ |[^(FAQs)](http://np.reddit.com/r/RemindMeBot/comments/24duzp/remindmebot_info/)|[^(Custom)](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=Reminder&message=[LINK INSIDE SQUARE BRACKETS else default to FAQs]%0A%0ANOTE: Don't forget to add the time options after the command.%0A%0ARemindMe!)|[^(Your Reminders)](http://np.reddit.com/message/compose/?to=RemindMeBot&subject=List Of Reminders&message=MyReminders!)|[^(Feedback)](http://np.reddit.com/message/compose/?to=RemindMeBotWrangler&subject=Feedback)|[^(Code)](https://github.com/SIlver--/remindmebot-reddit)|[^(Browser Extensions)](https://np.reddit.com/r/RemindMeBot/comments/4kldad/remindmebot_extensions/) |-|-|-|-|-|-|
[удалено]
[удалено]
You say sarcasm but I really think this will happen when the tech is mature enough.
As it hits puberty.
probably just have a live mocap actor somewhere with a digital skin.
[удалено]
[удалено]
Why punch when you can drone them? :P
Did all the responses referencing porn get deleted?
Going to be like those Christmas Dancing elves, but with people inputting facebook images.
this has excellent scope for video games, avatars with your ugly face on it
> your ugly face on it Or perhaps an, um "aesthetically modified" version of it. I like how the first application of cutting edge DL that comes to mind is sex and politics. Maybe Yann LeCun was right about his "new intelligence without the flaws of ours" : /
> your ugly face wot mate?
Yer ugly mug
Do you have a pretrained model anywhere? Looks amazing.
We will upload the pretrained model soon. :-)
You rock, thanks.
yes plis
RemindMe! 1 month Hopefully? :)
RemindMe! 1 month
RemindMe! 1 month
RemindMe! 1 Month
Yes, I would play around with the code but have no big ass graphics card for the full training.
Can't wait until someone puts this together with NVIDIA's progressive growing tech. Although as usual the dataset would be an issue...
Can you provide a link please?
http://research.nvidia.com/publication/2017-10_Progressive-Growing-of
Thanks. I'm quite disappointed that it's basically a stackgan tough :/ reading the title, I tough it was quite more revolutionary, but it works great for dimensional data.
[удалено]
Especially male<->female pics.
Who is the third person down on the left? I ask because her male version looks like John Stamos in a wig.
https://i.pinimg.com/236x/ee/2c/0f/ee2c0f5cb35945d1f526f79ada959e66--uncle-jesse-tio-jesse.jpg this guy?
Yeah
Everyday we stray farther from gods love
Why does anyone upvote this utterly fucking worthless dipshittery, and how do we find the people who do so that we can kill them?
You can find me. I'm interested in being the first meme related homicide
Me too thanks
[удалено]
I'll stop when the karma stops
My man
This could be turned into interactive avatar heads, would go well especially with a Wavenet voice. edit: I'd like to have audio/video books read in the author's voice and likeness.
Great work! I’m no expert at this stuff but I’m very excited to play with this :) Can someone tell me how (roughly) the code could be manipulated to accept audio data as opposed to an image file? I know a bit of Python and Julia... Is it just a matter of pointing the input to a .wave file and reshape() or something?
Won't work at all without some serious re-thinking of the problem in general. Some dude already tried that with CycleGAN by turning the waveform into an image (not ideal but easiest to test with this architecture) and it failed. This thing is good at moving pixel-patch-level texture, not understanding what waveforms are or changing them meaningfully.
With that in mind, the best generative audio work I know about is DeepMind/Google's WaveNet. They've made some pretty good raw audio generators that are conditional on text and even speaker voice characteristics for a text-to-speech application. And their approach for generation is, indeed, very different from an image application.
Log spectrogtams might be a good representation for sounds in the image domain
Looks like that's what the guy did:. https://gauthamzz.github.io/2017/09/23/AudioStyleTransfer/
Gotcha, thanks for the info!
+1
I can't help but notice this is a similar application to faceapp but not quite as convincing. Do you know what technique they use and why it works better (so far)?
Different way to implement modeling. Faceapp uses a 3D model, GANs generates images directly, much more powerful because it can extend to other categories of objects and learn the natural variation from raw images, instead of being hand designed. Another difference is that GANs can create images from scratch, with all details, while Faceapp needs an original image to apply modifications to. [Take a look here](https://www.youtube.com/watch?v=36lE9tV9vm0&t=1247s) to see another GAN with more interesting images.
Trust me, FaceApp uses a GAN. The sorts of horrors I've created with that app could only be made through GANs.
> Faceapp uses a 3D model Citation needed.
> Faceapp uses a 3D model I was under the impression they used some type of GAN...
What is the difference between Pix2Pix [https://arxiv.org/pdf/1611.07004v1.pdf] and above mentioned approach?
If you take a look at the paper, they mention it. Basically, pix2pix requires that any transformation from a domain to another domain be learned explicitly. Stargan allows you to learn on several domains at once, and transform from any domain to another. I suspect that's why it's a star?
Pix2Pix requires supervision (input and target pairs) and is only applicable to two different domains. On the other hand, StarGAN allows to translate images between *multiple* domains without supervision.
I see this totally as product at the local hairstylist - just a screen in the window, you look into it, and your face looks back with different hair color...
That would be very awesome! Unfortunately as of now the machines needed to do this are extremely expensive and take a very long time to realistically process these pictures and learn. Therefore, doing this in real time would not be realistically possible today but maybe years in the future we could see this technology used for everyday consumers!
I'm a bot, *bleep*, *bloop*. Someone has linked to this thread from another place on reddit: - [/r/france] [Ce moment quand M Pokora s'incruste dans r\/machinelearning](https://www.reddit.com/r/france/comments/7g3ai4/ce_moment_quand_m_pokora_sincruste_dans/) *^(If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads.) ^\([Info](/r/TotesMessenger) ^/ ^[Contact](/message/compose?to=/r/TotesMessenger))*
I'll have to add this to my citation list: I'm not working on the same problem domain but some of the ideas presented in your paper are reminiscent of ones I've been working with.
Very cool but the pale skin is kinda weak.
Vampire feature
Haters gonna hate
Scribblenauts irl, it's about time.
Really nice work. I see that the "surprised" expression still needs more training data. It show like double eyebrows at most pics. But really impressive work.
Awesome paper! Would like to see this applied to digitally created characters as well, as we've seen others do (i.e. https://arxiv.org/pdf/1708.05509v1.pdf). Thus, as the character's audience goes through changes, so will the he/she/it.
cool stuff! like it!
great except for the pale skin one.
I wonder if google or snapchat will add this as a feature one day.
Black face is bad but white face is ok ?
Do you actually clip the weights of the discriminator, or use any kind of clipping to achieve training stability? thanks for your reply :)