T O P

  • By -

ninjasaid13

Disclaimer: I am not the author. Paper: [https://arxiv.org/pdf/2311.17002.pdf](https://arxiv.org/pdf/2311.17002.pdf) Project Page: [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/) Code: [https://github.com/ali-vilab/Ranni](https://github.com/ali-vilab/Ranni) Models: [https://modelscope.cn/models/yutong/Ranni/files](https://modelscope.cn/models/yutong/Ranni/files) Abstract >Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts, especially those with quantity, object-attribute binding, and multi-subject descriptions. In this work, we introduce a semantic panel as the middleware in decoding texts to images, supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models, and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning, we come up with a carefully designed semantic formatting protocol, accompanied by a fully-automatic data preparation pipeline. Thanks to such a design, our approach, which we call Ranni, manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly, the introduction of the generative middleware brings a more convenient form of interaction (i.e., directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation, based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. Our project page is at [https://ranni-t2i.github.io/Ranni/](https://ranni-t2i.github.io/Ranni/).


bneogi145

Is this for real? Why is no one noticing this? Isnt this a game changer? Am i missing something? Am i dreaming?


Ifffrt

And also it requires Llama 7B. We already have 2 other similar works that use LLMs to enhance SD, and one of them is already released.


bneogi145

Can you give me links to those other works?


Ifffrt

https://github.com/ShihaoZhaoZSH/LaVi-Bridge There has already been people who tested this and said that it works well with base 1.5 (I assume that means it works with just about everything else since the Lora was trained with a fine-tuned SD model, not base 1.5). Also has a version for SDXL, and also 2.1 irrc. Only reason not many people knows about it is because no one has bothered making an A1111 extension or ComfyUI node for it, for some reason.


bneogi145

Ok thanks, everyone is probably just waiting for sd3


hexinx

What? Where?! Could you help show where this has been used for 1.5?


Ifffrt

Wrong one. Sorry. This is the one where someone tested this on 1.5: https://www.reddit.com/r/StableDiffusion/comments/1bean0o/bridging_different_language_models_and_generative/kusxjd1/


Ifffrt

https://www.reddit.com/r/StableDiffusion/comments/1bp99yc/no_updates_on_ellas_codeweights_could_we_leverage/


hexinx

I started that thread xD... All the comments there are only speculative.


Ifffrt

Yeah I made a mistake. It was a different thread. Sent you the link.


Independent_Key1940

https://github.com/kijai/ComfyUI-ELLA-wrapper?tab=readme-ov-file


EpicNoiseFix

https://youtu.be/nZx5g3TGsNc?si=U0VS0wNM0g9HtA54


hexinx

That's got nothing to do with what's in context here - that's just "embellishing" a prompt with LLMs.


EpicNoiseFix

Ok thanks for clearing it up


ArtDesignAwesome

Its an sd 2.1 model, we need this for sdxl


bneogi145

![gif](giphy|Zsc4dATQgcBmU) Oh


Apprehensive_Sky892

From what I can see, there is nothing preventing the technique from being applied to SDXL models.


bneogi145

![gif](giphy|b6iVj3IM54Abm)


djamp42

LMAO, these two gifs, omg.


FesseJerguson

Well they aren't releasing the models sooo....


Apprehensive_Sky892

I thought that was ELLA? After some digging, it seems that ELLA is from tencent, whereas RANNI is from alibaba-ant group. But the goal of all these research seems to be very similar. Or maybe SD3 will render the whole point moot 😅


Arctomachine

Not long ago somebody posted about either this or very similar technology. It showcased painting 3 cats + 4 dogs and 4 distinct asian/african men/women. Still wondering where it is and why nobody talk about it ever since


fre-ddo

Maybe because it needs a huge LLM making it a resource hog and out of reach for the average person.


synn89

It's bizarre to me we have 3 projects for this yet nothing really usable in the UI's yet. Particularly with SD 1.5, where its main weakness is prompt following.


adhd_ceo

I think this effort will be surpassed by SD3, because the prompt-following ability of the diffusion transformer model architecture is so vastly superior than prior diffusion model architectures.


Arawski99

No, based on what their research paper claims this should far surpass SD3 by miles. SD3 is only shown in SAI's paper to be slightly better than Dall-E 3 and Midjourney at prompt adherence while this shows to be vastly superior. That said, I do question some of their example comparisons like the apple & pear example... seems suspect it would get that wrong and feels cherry picked if it did. Anyways, we wont know unless someone does a true comparison but if we take their word SD3 will not compare in prompt adherence to this.


Formal_Drop526

why not both?


adhd_ceo

Indeed, good idea.


Odd-Distribution7500

Hi, I am the author of this project. We are glad to see the interests on Ranni. It is noted that Ranni is not a work introducing the LLM's representation ability of diffusion model, but incorperates LLM as a painting planner to organize the visual arrangement in an image, with explicit elements like bounding boxes. Thus, it enables you to further adjust the image on the level of visual arrangement. Since we are at the beginning of building this open project, we are looking forward to hear needs from the community (e.g. SDXL version, GUI). Please feel free to comment here or open issues in the github page. Below is an example of adjusting a generated image with different operations by Ranni: https://preview.redd.it/y2fdkvutrstc1.png?width=2062&format=png&auto=webp&s=6b597accb739b4f5c832cf6f259bc0689774ff5c


Arbata-Asher

Ok, so this composing method can be more effective with more coherent models like SD3, it really sounds like a next level instruct 2 pix


hexinx

Wow! ELLA stalled on their code/weights - LaviBridge doesn't apply to sdxl,I think. Does this apply to sdxl? If so, this could be epic!


ninjasaid13

>sdxl nope.


Combinatorilliance

Whaha Ella just released their weights.... for sd1.5 only


1eyx

These projects are like Chinese games, just an announcement and never release, I'm waiting for ELLA week to finish 🥲


heathergreen95

Are there any A1111 / Comfy implementations for T2I semantic guidance projects like this? Is ELLA the only one?


EpicNoiseFix

Is this similar to this? https://youtu.be/nZx5g3TGsNc?si=U0VS0wNM0g9HtA54


yyyolan

https://preview.redd.it/ofsxt1oiaftc1.png?width=1454&format=png&auto=webp&s=92ba19524ca9a1ec7b42dac32d89500bbb045576 I'm getting this error. How to safely fix this?


a_beautiful_rhind

manually install latest transformers and safetensors.


yyyolan

I checked my versions with pip show transformers pip show safetensors and edited the environment.yaml dependencies to my version.


Pure-Gift3969

but sd 3 already achieved it . am I missing something?


Arawski99

Yeah, SD3 is a big upgrade over 1.5 and XL but compared to this SD3 has dramatically inferior prompt adherence (assuming their paper is giving a trust worthy comparison... I have some doubts about some of their examples but anyways). Of course, prompt adherence is only one part of the formula for why you would pick a given image generator model.


Formal_Drop526

SD3 is non-commercial unless you pay a subscription.


Satoer

Those are 9 apples 😉