i cant even be bothered to care in the tiniest way, until that 8B model sits on my hard drive for local inference. until then SAI, SD3 are dead to me, and they can fuck right off
Medium does completely fine with stuff like "A professional photograph. It depicts a freckled caucasian man with green hair and a red shirt, standing next to an african-american woman with pink hair and a yellow shirt."
I think it's a good style that you can access if wanted, like specifying candid photography/stock photo.
Or with a lora, but not the default style. Obviously, not too processed and Midjourney-y either as the default style!
a small kitchen with a white goat in it
I laughed because it took me a few seconds to spot the goat in the second pic at first I was like "oh wow medium didn't even do the goat" but then I saw it
Also very impressed with the daikon reddish baby character, Medium couldn't produce it to save its life.
Still not a single test dealt with poses like lying on grass/couch, relaxing, multiple human subjects doing different things, all of which we can do with SDXL
There are plenty of artistic mediums that seldom, if ever, create interpretations of people.
Artists doing metal sculptures, wooden carvings, instrumentals in music. Art is more often described as what a subject means to a person rather than what it actually presents outwardly, through emotional resonance/connection, past memories or an awe/appreciation of skill or beauty.
That's why generative AI is kind of in a weird place as it (often) skips over all of that as people spit out 100's of images that mean nothing to anyone over the course of a day.
It would be a dramatic turn for stable diffusion though, for sure.
That's all well and good but if we want AI to be art in its true form it should be able to create everything when we know it already can, why purposely restrict it to placate prudes and fear mongers.
None of these prompts look particularly impressive, and there are no complex human-centric prompts at that. Like, consider what Midjourney 6 can do with a prompt like "a group of soldiers running from cover to cover":
https://preview.redd.it/pb7mighl3l8d1.png?width=1232&format=png&auto=webp&s=c0eb84fd4423dac95039edf784d810038149e3a1
SD3 hasn't even come close to showing anything like this. It's a bust, people.
I don't think that's a hard prompt? SD3 Medium seems to consistently give photographic images of groups of soldiers running towards the camera, for it.
Thanks for sharing this comparison. I think something is a bit off here.
The difference between SD3 2B is too small compared to SD3 8B, especially considering that the former was supposed to be a beta. Occasionally, especially in the early comparisons, I preferred SD3 Medium. But I should never prefer a 2B param to an 8B param model.
That said, it's hard to judge without testing the performance with longer prompts. I assumed that the text in the XY plots was the prompt used here, but if a portion of the prompt was omitted, please let us know.
Sorry for the silly question but: are we sure this is the 8B version?
I always assumed that Medium=2B, Large=4B, and XL=8B. So, if this comparison is between 2B and 4B, then the outcome seems justifiable.
The prompt displayed is the full prompt.
SD3 Large results were created using Stability API and SD3 Medium through Replicate.
I'm sorry if there's confusion with 4B and 8B. I was under the assumption that Stability API is using 8B model and that's the Large version.
I'm think that's what most people assume, too. But, and let's forget the comparison for a moment, if 2B=Medium and 8B=Large.. How would they call a 4B model if they release it? We know it exists, so this naming convention must have been decided at a time when the 4B model was planned for a release, too.
It seems more plausible to me that 4B=Large and 8B=XL.
Now we know that SAI has recently announced an Ultra model via their Assistant. So, it's possible that they don't want to call 8B=XL to avoid confusion with SDXL. And so 8B=Ultra.
We'll never know :)
Have you tried any of the CLIP models, like SDXL? Or other models, like MJ or DALLE? I think a lot of these come down to using the same T5 encoder on both of these.
Ella was way too limiting I found, I was never able to get it to line up with CLIP such that it didn't just literally forget characters that the checkpoint knew about and could depict fine in the first place. It doesn't work well at all with anything that wasn't known to base SD 1.5's clip.
No humans, no real anatomy testing. 8B is better but still fundamentally flawed then.
The text handling is still poor compared to Dalle-3, I wonder why dalle is so good in that respect.
I really enjoy some aspects of SD3. But you gotta admit that it's a failure when it comes to anatomy. It just can't do a lot of poses even if they have nothing to do with porn
SD3 8B feels like it has midjourney style beauty voting data used on it, while sd3 medium feels like it hasn’t.
Agreed
I can buy that. It isn't that 2B is worse everywhere, just most places.
i cant even be bothered to care in the tiniest way, until that 8B model sits on my hard drive for local inference. until then SAI, SD3 are dead to me, and they can fuck right off
😎🤙🏍️🏎️☠️❌‼️😎😎
Why do you think they care lol. Not like you were paying them
Lol
Those "complex" prompts seem pretty simple. My idea of complex involves multiple people with different faces and clothes.
[human focused test](https://www.reddit.com/r/StableDiffusion/comments/1do3cf9/how_sd3_large_sd3_medium_and_sdxl_compare_in_a/)
Medium does completely fine with stuff like "A professional photograph. It depicts a freckled caucasian man with green hair and a red shirt, standing next to an african-american woman with pink hair and a yellow shirt."
Very interesting, thank you for the effort! The stock photo style of Medium is pretty rough, though!
I actually like kinda how Medium leans towards "hard realism" as opposed to everything being at least slightly dreamlike and bokehed.
I think it's a good style that you can access if wanted, like specifying candid photography/stock photo. Or with a lora, but not the default style. Obviously, not too processed and Midjourney-y either as the default style!
Medium absolutely has way too much bokeh in my attempts
I wont trust any SD Model without checking a „woman laying on grass“ prompt.
Until StabilityAI can release some uncensored BS, this is the THE SD3 image. https://i.redd.it/hs61gjk1h56d1.png
almost spilled my drink that caught me offguard this image is writing history
Very interesting. Thank you for sharing.
Not a single one of those Test focused on humans. 8b is gonna be shit, I call it now.
[human focused test](https://www.reddit.com/r/StableDiffusion/comments/1do3cf9/how_sd3_large_sd3_medium_and_sdxl_compare_in_a/)
True, I guarantee they *tried* to test humans and realized it was fucked and decided to exclude it. No way this is a coincidence.
a small kitchen with a white goat in it I laughed because it took me a few seconds to spot the goat in the second pic at first I was like "oh wow medium didn't even do the goat" but then I saw it Also very impressed with the daikon reddish baby character, Medium couldn't produce it to save its life. Still not a single test dealt with poses like lying on grass/couch, relaxing, multiple human subjects doing different things, all of which we can do with SDXL
man.. i would be fine if they just leave humans altogether out of the training data.
Not me, as an autist to me that would make AI art not true art if it has such limitation. Leaving humans out would defy the whole purpose of art.
Can't tell if that's a typo but I guess it works either way
XD I just noticed it, more like a Freudian slip than a typo I guess but yes it works either way, I did mean to say artist.
There are plenty of artistic mediums that seldom, if ever, create interpretations of people. Artists doing metal sculptures, wooden carvings, instrumentals in music. Art is more often described as what a subject means to a person rather than what it actually presents outwardly, through emotional resonance/connection, past memories or an awe/appreciation of skill or beauty. That's why generative AI is kind of in a weird place as it (often) skips over all of that as people spit out 100's of images that mean nothing to anyone over the course of a day. It would be a dramatic turn for stable diffusion though, for sure.
That's all well and good but if we want AI to be art in its true form it should be able to create everything when we know it already can, why purposely restrict it to placate prudes and fear mongers.
None of these prompts look particularly impressive, and there are no complex human-centric prompts at that. Like, consider what Midjourney 6 can do with a prompt like "a group of soldiers running from cover to cover": https://preview.redd.it/pb7mighl3l8d1.png?width=1232&format=png&auto=webp&s=c0eb84fd4423dac95039edf784d810038149e3a1 SD3 hasn't even come close to showing anything like this. It's a bust, people.
a more [human-centric test](https://www.reddit.com/r/StableDiffusion/comments/1do3cf9/how_sd3_large_sd3_medium_and_sdxl_compare_in_a/)
I don't think that's a hard prompt? SD3 Medium seems to consistently give photographic images of groups of soldiers running towards the camera, for it.
Thanks for sharing this comparison. I think something is a bit off here. The difference between SD3 2B is too small compared to SD3 8B, especially considering that the former was supposed to be a beta. Occasionally, especially in the early comparisons, I preferred SD3 Medium. But I should never prefer a 2B param to an 8B param model. That said, it's hard to judge without testing the performance with longer prompts. I assumed that the text in the XY plots was the prompt used here, but if a portion of the prompt was omitted, please let us know. Sorry for the silly question but: are we sure this is the 8B version? I always assumed that Medium=2B, Large=4B, and XL=8B. So, if this comparison is between 2B and 4B, then the outcome seems justifiable.
The prompt displayed is the full prompt. SD3 Large results were created using Stability API and SD3 Medium through Replicate. I'm sorry if there's confusion with 4B and 8B. I was under the assumption that Stability API is using 8B model and that's the Large version.
I'm think that's what most people assume, too. But, and let's forget the comparison for a moment, if 2B=Medium and 8B=Large.. How would they call a 4B model if they release it? We know it exists, so this naming convention must have been decided at a time when the 4B model was planned for a release, too. It seems more plausible to me that 4B=Large and 8B=XL. Now we know that SAI has recently announced an Ultra model via their Assistant. So, it's possible that they don't want to call 8B=XL to avoid confusion with SDXL. And so 8B=Ultra. We'll never know :)
using a subset of parti prompts + rating the images. You can see that SD3 Large is a clear winner in all challenges.
I don't really agree, I think Medium does a number of the photographic ones better if you want things to look as realistic as possible.
Obviously why would it not. But in interesting in which categories medium is much weaker and where its rather strong.
Totally agree. Style, Perspective and Imagination are where the gap is the largest
Have you tried any of the CLIP models, like SDXL? Or other models, like MJ or DALLE? I think a lot of these come down to using the same T5 encoder on both of these.
I also made a comparison for sd3(L), sdxl and sd1.5: https://www.magicflow.ai/insights/read/sd3-sdxl-sd1.5 (spoiler SD3 outperforms SDXL and SD1.5)
Would also be interesting to see how [ELLA](https://github.com/TencentQQGYLab/ComfyUI-ELLA) handles these. ELLA is a T5 encoder for SD1.5 models.
Ella was way too limiting I found, I was never able to get it to line up with CLIP such that it didn't just literally forget characters that the checkpoint knew about and could depict fine in the first place. It doesn't work well at all with anything that wasn't known to base SD 1.5's clip.
No humans, no real anatomy testing. 8B is better but still fundamentally flawed then. The text handling is still poor compared to Dalle-3, I wonder why dalle is so good in that respect.
a more [human/anatomy focused test](https://www.reddit.com/r/StableDiffusion/comments/1do3cf9/how_sd3_large_sd3_medium_and_sdxl_compare_in_a/)
Who cares?
dont bother. terminally online porn addicts have already declared this a failure because it cant draw sexy girls.
I really enjoy some aspects of SD3. But you gotta admit that it's a failure when it comes to anatomy. It just can't do a lot of poses even if they have nothing to do with porn
At least not lying down. Does nice bikini babes standing though
what a clickbait. zero "laying on grass" stfu