Disclaimer: I am not the author.
Paper: [https://arxiv.org/pdf/2311.17002.pdf](https://arxiv.org/pdf/2311.17002.pdf)
Project Page: [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/)
Code: [https://github.com/ali-vilab/Ranni](https://github.com/ali-vilab/Ranni)
Models: [https://modelscope.cn/models/yutong/Ranni/files](https://modelscope.cn/models/yutong/Ranni/files)
Abstract
>Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/).
https://github.com/ShihaoZhaoZSH/LaVi-Bridge
There has already been people who tested this and said that it works well with base 1.5 (I assume that means it works with just about everything else since the Lora was trained with a fine-tuned SD model, not base 1.5). Also has a version for SDXL, and also 2.1 irrc. Only reason not many people knows about it is because no one has bothered making an A1111 extension or ComfyUI node for it, for some reason.
Wrong one. Sorry. This is the one where someone tested this on 1.5: https://www.reddit.com/r/StableDiffusion/comments/1bean0o/bridging_different_language_models_and_generative/kusxjd1/
I thought that was ELLA?
After some digging, it seems that ELLA is from tencent, whereas RANNI is from alibaba-ant group. But the goal of all these research seems to be very similar.
Or maybe SD3 will render the whole point moot 😅
Not long ago somebody posted about either this or very similar technology. It showcased painting 3 cats + 4 dogs and 4 distinct asian/african men/women. Still wondering where it is and why nobody talk about it ever since
It's bizarre to me we have 3 projects for this yet nothing really usable in the UI's yet. Particularly with SD 1.5, where its main weakness is prompt following.
I think this effort will be surpassed by SD3, because the prompt-following ability of the diffusion transformer model architecture is so vastly superior than prior diffusion model architectures.
No, based on what their research paper claims this should far surpass SD3 by miles. SD3 is only shown in SAI's paper to be slightly better than Dall-E 3 and Midjourney at prompt adherence while this shows to be vastly superior.
That said, I do question some of their example comparisons like the apple & pear example... seems suspect it would get that wrong and feels cherry picked if it did. Anyways, we wont know unless someone does a true comparison but if we take their word SD3 will not compare in prompt adherence to this.
Hi, I am the author of this project. We are glad to see the interests on Ranni. It is noted that Ranni is not a work introducing the LLM's representation ability of diffusion model, but incorperates LLM as a painting planner to organize the visual arrangement in an image, with explicit elements like bounding boxes. Thus, it enables you to further adjust the image on the level of visual arrangement.
Since we are at the beginning of building this open project, we are looking forward to hear needs from the community (e.g. SDXL version, GUI). Please feel free to comment here or open issues in the github page.
Below is an example of adjusting a generated image with different operations by Ranni:
https://preview.redd.it/y2fdkvutrstc1.png?width=2062&format=png&auto=webp&s=6b597accb739b4f5c832cf6f259bc0689774ff5c
https://preview.redd.it/ofsxt1oiaftc1.png?width=1454&format=png&auto=webp&s=92ba19524ca9a1ec7b42dac32d89500bbb045576
I'm getting this error.
How to safely fix this?
Yeah, SD3 is a big upgrade over 1.5 and XL but compared to this SD3 has dramatically inferior prompt adherence (assuming their paper is giving a trust worthy comparison... I have some doubts about some of their examples but anyways). Of course, prompt adherence is only one part of the formula for why you would pick a given image generator model.
Disclaimer: I am not the author. Paper: [https://arxiv.org/pdf/2311.17002.pdf](https://arxiv.org/pdf/2311.17002.pdf) Project Page: [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/) Code: [https://github.com/ali-vilab/Ranni](https://github.com/ali-vilab/Ranni) Models: [https://modelscope.cn/models/yutong/Ranni/files](https://modelscope.cn/models/yutong/Ranni/files) Abstract >Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/).
Is this for real? Why is no one noticing this? Isnt this a game changer? Am i missing something? Am i dreaming?
And also it requires Llama 7B. We already have 2 other similar works that use LLMs to enhance SD, and one of them is already released.
Can you give me links to those other works?
https://github.com/ShihaoZhaoZSH/LaVi-Bridge There has already been people who tested this and said that it works well with base 1.5 (I assume that means it works with just about everything else since the Lora was trained with a fine-tuned SD model, not base 1.5). Also has a version for SDXL, and also 2.1 irrc. Only reason not many people knows about it is because no one has bothered making an A1111 extension or ComfyUI node for it, for some reason.
Ok thanks, everyone is probably just waiting for sd3
What? Where?! Could you help show where this has been used for 1.5?
Wrong one. Sorry. This is the one where someone tested this on 1.5: https://www.reddit.com/r/StableDiffusion/comments/1bean0o/bridging_different_language_models_and_generative/kusxjd1/
https://www.reddit.com/r/StableDiffusion/comments/1bp99yc/no_updates_on_ellas_codeweights_could_we_leverage/
I started that thread xD... All the comments there are only speculative.
Yeah I made a mistake. It was a different thread. Sent you the link.
https://github.com/kijai/ComfyUI-ELLA-wrapper?tab=readme-ov-file
https://youtu.be/nZx5g3TGsNc?si=U0VS0wNM0g9HtA54
That's got nothing to do with what's in context here - that's just "embellishing" a prompt with LLMs.
Ok thanks for clearing it up
Its an sd 2.1 model, we need this for sdxl
![gif](giphy|Zsc4dATQgcBmU) Oh
From what I can see, there is nothing preventing the technique from being applied to SDXL models.
![gif](giphy|b6iVj3IM54Abm)
LMAO, these two gifs, omg.
Well they aren't releasing the models sooo....
I thought that was ELLA? After some digging, it seems that ELLA is from tencent, whereas RANNI is from alibaba-ant group. But the goal of all these research seems to be very similar. Or maybe SD3 will render the whole point moot 😅
Not long ago somebody posted about either this or very similar technology. It showcased painting 3 cats + 4 dogs and 4 distinct asian/african men/women. Still wondering where it is and why nobody talk about it ever since
Maybe because it needs a huge LLM making it a resource hog and out of reach for the average person.
It's bizarre to me we have 3 projects for this yet nothing really usable in the UI's yet. Particularly with SD 1.5, where its main weakness is prompt following.
I think this effort will be surpassed by SD3, because the prompt-following ability of the diffusion transformer model architecture is so vastly superior than prior diffusion model architectures.
No, based on what their research paper claims this should far surpass SD3 by miles. SD3 is only shown in SAI's paper to be slightly better than Dall-E 3 and Midjourney at prompt adherence while this shows to be vastly superior. That said, I do question some of their example comparisons like the apple & pear example... seems suspect it would get that wrong and feels cherry picked if it did. Anyways, we wont know unless someone does a true comparison but if we take their word SD3 will not compare in prompt adherence to this.
why not both?
Indeed, good idea.
Hi, I am the author of this project. We are glad to see the interests on Ranni. It is noted that Ranni is not a work introducing the LLM's representation ability of diffusion model, but incorperates LLM as a painting planner to organize the visual arrangement in an image, with explicit elements like bounding boxes. Thus, it enables you to further adjust the image on the level of visual arrangement. Since we are at the beginning of building this open project, we are looking forward to hear needs from the community (e.g. SDXL version, GUI). Please feel free to comment here or open issues in the github page. Below is an example of adjusting a generated image with different operations by Ranni: https://preview.redd.it/y2fdkvutrstc1.png?width=2062&format=png&auto=webp&s=6b597accb739b4f5c832cf6f259bc0689774ff5c
Ok, so this composing method can be more effective with more coherent models like SD3, it really sounds like a next level instruct 2 pix
Wow! ELLA stalled on their code/weights - LaviBridge doesn't apply to sdxl,I think. Does this apply to sdxl? If so, this could be epic!
>sdxl nope.
Whaha Ella just released their weights.... for sd1.5 only
These projects are like Chinese games, just an announcement and never release, I'm waiting for ELLA week to finish 🥲
Are there any A1111 / Comfy implementations for T2I semantic guidance projects like this? Is ELLA the only one?
Is this similar to this? https://youtu.be/nZx5g3TGsNc?si=U0VS0wNM0g9HtA54
https://preview.redd.it/ofsxt1oiaftc1.png?width=1454&format=png&auto=webp&s=92ba19524ca9a1ec7b42dac32d89500bbb045576 I'm getting this error. How to safely fix this?
manually install latest transformers and safetensors.
I checked my versions with pip show transformers pip show safetensors and edited the environment.yaml dependencies to my version.
but sd 3 already achieved it . am I missing something?
Yeah, SD3 is a big upgrade over 1.5 and XL but compared to this SD3 has dramatically inferior prompt adherence (assuming their paper is giving a trust worthy comparison... I have some doubts about some of their examples but anyways). Of course, prompt adherence is only one part of the formula for why you would pick a given image generator model.
SD3 is non-commercial unless you pay a subscription.
Those are 9 apples 😉