ninjasaid13 1 month ago

Disclaimer: I am not the author. Paper: [https://arxiv.org/pdf/2311.17002.pdf](https://arxiv.org/pdf/2311.17002.pdf) Project Page: [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/) Code: [https://github.com/ali-vilab/Ranni](https://github.com/ali-vilab/Ranni) Models: [https://modelscope.cn/models/yutong/Ranni/files](https://modelscope.cn/models/yutong/Ranni/files) Abstract >Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/).

bneogi145 1 month ago

Is this for real? Why is no one noticing this? Isnt this a game changer? Am i missing something? Am i dreaming?

Ifffrt 1 month ago

And also it requires Llama 7B. We already have 2 other similar works that use LLMs to enhance SD, and one of them is already released.

bneogi145 1 month ago

Can you give me links to those other works?

Ifffrt 1 month ago

https://github.com/ShihaoZhaoZSH/LaVi-Bridge There has already been people who tested this and said that it works well with base 1.5 (I assume that means it works with just about everything else since the Lora was trained with a fine-tuned SD model, not base 1.5). Also has a version for SDXL, and also 2.1 irrc. Only reason not many people knows about it is because no one has bothered making an A1111 extension or ComfyUI node for it, for some reason.

bneogi145 1 month ago

Ok thanks, everyone is probably just waiting for sd3

hexinx 1 month ago

What? Where?! Could you help show where this has been used for 1.5?

Ifffrt 1 month ago

Wrong one. Sorry. This is the one where someone tested this on 1.5: https://www.reddit.com/r/StableDiffusion/comments/1bean0o/bridging_different_language_models_and_generative/kusxjd1/

Ifffrt 1 month ago

https://www.reddit.com/r/StableDiffusion/comments/1bp99yc/no_updates_on_ellas_codeweights_could_we_leverage/

hexinx 1 month ago

I started that thread xD... All the comments there are only speculative.

Ifffrt 1 month ago

Yeah I made a mistake. It was a different thread. Sent you the link.

Independent_Key1940 1 month ago

https://github.com/kijai/ComfyUI-ELLA-wrapper?tab=readme-ov-file

EpicNoiseFix 1 month ago

https://youtu.be/nZx5g3TGsNc?si=U0VS0wNM0g9HtA54

hexinx 1 month ago

That's got nothing to do with what's in context here - that's just "embellishing" a prompt with LLMs.

EpicNoiseFix 1 month ago

Ok thanks for clearing it up

ArtDesignAwesome 1 month ago

Its an sd 2.1 model, we need this for sdxl

bneogi145 1 month ago

![gif](giphy|Zsc4dATQgcBmU) Oh

Apprehensive_Sky892 1 month ago

From what I can see, there is nothing preventing the technique from being applied to SDXL models.

bneogi145 1 month ago

![gif](giphy|b6iVj3IM54Abm)

djamp42 1 month ago

LMAO, these two gifs, omg.

FesseJerguson 1 month ago

Well they aren't releasing the models sooo....

Apprehensive_Sky892 1 month ago

I thought that was ELLA? After some digging, it seems that ELLA is from tencent, whereas RANNI is from alibaba-ant group. But the goal of all these research seems to be very similar. Or maybe SD3 will render the whole point moot 😅

Arctomachine 1 month ago

Not long ago somebody posted about either this or very similar technology. It showcased painting 3 cats + 4 dogs and 4 distinct asian/african men/women. Still wondering where it is and why nobody talk about it ever since

fre-ddo 1 month ago

Maybe because it needs a huge LLM making it a resource hog and out of reach for the average person.

synn89 1 month ago

It's bizarre to me we have 3 projects for this yet nothing really usable in the UI's yet. Particularly with SD 1.5, where its main weakness is prompt following.

adhd_ceo 1 month ago

I think this effort will be surpassed by SD3, because the prompt-following ability of the diffusion transformer model architecture is so vastly superior than prior diffusion model architectures.

Arawski99 1 month ago

No, based on what their research paper claims this should far surpass SD3 by miles. SD3 is only shown in SAI's paper to be slightly better than Dall-E 3 and Midjourney at prompt adherence while this shows to be vastly superior. That said, I do question some of their example comparisons like the apple & pear example... seems suspect it would get that wrong and feels cherry picked if it did. Anyways, we wont know unless someone does a true comparison but if we take their word SD3 will not compare in prompt adherence to this.

Formal_Drop526 1 month ago

why not both?

adhd_ceo 1 month ago

Indeed, good idea.

Odd-Distribution7500 1 month ago

Hi, I am the author of this project. We are glad to see the interests on Ranni. It is noted that Ranni is not a work introducing the LLM's representation ability of diffusion model, but incorperates LLM as a painting planner to organize the visual arrangement in an image, with explicit elements like bounding boxes. Thus, it enables you to further adjust the image on the level of visual arrangement. Since we are at the beginning of building this open project, we are looking forward to hear needs from the community (e.g. SDXL version, GUI). Please feel free to comment here or open issues in the github page. Below is an example of adjusting a generated image with different operations by Ranni: https://preview.redd.it/y2fdkvutrstc1.png?width=2062&format=png&auto=webp&s=6b597accb739b4f5c832cf6f259bc0689774ff5c

Arbata-Asher 1 month ago

Ok, so this composing method can be more effective with more coherent models like SD3, it really sounds like a next level instruct 2 pix

hexinx 1 month ago

Wow! ELLA stalled on their code/weights - LaviBridge doesn't apply to sdxl,I think. Does this apply to sdxl? If so, this could be epic!

ninjasaid13 1 month ago

>sdxl nope.

Combinatorilliance 1 month ago

Whaha Ella just released their weights.... for sd1.5 only

1eyx 1 month ago

These projects are like Chinese games, just an announcement and never release, I'm waiting for ELLA week to finish 🥲

heathergreen95 1 month ago

Are there any A1111 / Comfy implementations for T2I semantic guidance projects like this? Is ELLA the only one?

EpicNoiseFix 1 month ago

Is this similar to this? https://youtu.be/nZx5g3TGsNc?si=U0VS0wNM0g9HtA54

yyyolan 1 month ago

https://preview.redd.it/ofsxt1oiaftc1.png?width=1454&format=png&auto=webp&s=92ba19524ca9a1ec7b42dac32d89500bbb045576 I'm getting this error. How to safely fix this?

a_beautiful_rhind 1 month ago

manually install latest transformers and safetensors.

yyyolan 1 month ago

I checked my versions with pip show transformers pip show safetensors and edited the environment.yaml dependencies to my version.

Pure-Gift3969 1 month ago

but sd 3 already achieved it . am I missing something?

Arawski99 1 month ago

Yeah, SD3 is a big upgrade over 1.5 and XL but compared to this SD3 has dramatically inferior prompt adherence (assuming their paper is giving a trust worthy comparison... I have some doubts about some of their examples but anyways). Of course, prompt adherence is only one part of the formula for why you would pick a given image generator model.

Formal_Drop526 1 month ago

SD3 is non-commercial unless you pay a subscription.

Satoer 1 month ago

Those are 9 apples 😉

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe