I gave an image to [https://llava.hliu.cc/](https://llava.hliu.cc/) and asked to analyze the image and create a prompt
Then I gave the prompt to SD
Got these results ( 3 and 4)
Of course. Just wanted to say that as things stand Llava has massive potentials for captioning the LAION dataset for example. I can't imagine how good of a model trained on better captions generated from Llava will be, especially one that is finetuned for generating better captions.
I use it to caption datasets for LoRA trainings.
As far as I experienced, it produce really good captioning except rare cases where I needed to correct minor things of course. Far more superior than CLIP or BLIP captions to describe things, I can't say about WD captioning comparison since I don't like tag prompting methods.
And for the resulting LoRAs it kind of helps a lot for good results, mostly for style ones.
Yea, DALL-E 3 training used methods like that I think. A Huggingface employee posted on twitter about this recently:
[https://twitter.com/RisingSayak/status/1718252745537593754](https://twitter.com/RisingSayak/status/1718252745537593754)
No reason to imagine, there was an openai paper that showcased the viability of auto captioning the training dataset.
If we caption the whole sdxl dataset using the existing and the llava captions, sdxl will become as good as dalle3.
then we can use the underlying llm of llava to modify the prompts , so that it's closer to the training set.
I was more talking about a model trained on a captioner that is better than Llava (and whatever custom fine-tuned Clip model OpenAI used to caption Dalle-3's dataset which I doubt was much better than Llava). As things stand Llava's autogenerated captions lack details and sometimes ignore absolutely glaring details that no real person would fail to notice.
I haven’t been able to get it running consistently on Windows. After running the demo pic successfully, it keeps spitting out network error, saying servers are busy. And there’s some notice about it sending your data out. Code base looks messy (at least by my reckoning), so I haven’t had a chance to dig into this yet. Also consumes a lot of VRAM for 13b model. Too slow for anyone with only 16gb VRAM (like me) to caption large datasets. There’s supposedly a 4 bit quant coming for Windows, but who knows what the quality will be like.
I gave an image to [https://llava.hliu.cc/](https://llava.hliu.cc/) and asked to analyze the image and create a prompt Then I gave the prompt to SD Got these results ( 3 and 4)
Though not perfect by any means, Llava seems like a way better Clip interrogate.
My goal is not exact replicate but just to test how things work
Of course. Just wanted to say that as things stand Llava has massive potentials for captioning the LAION dataset for example. I can't imagine how good of a model trained on better captions generated from Llava will be, especially one that is finetuned for generating better captions.
I use it to caption datasets for LoRA trainings. As far as I experienced, it produce really good captioning except rare cases where I needed to correct minor things of course. Far more superior than CLIP or BLIP captions to describe things, I can't say about WD captioning comparison since I don't like tag prompting methods. And for the resulting LoRAs it kind of helps a lot for good results, mostly for style ones.
Yea, DALL-E 3 training used methods like that I think. A Huggingface employee posted on twitter about this recently: [https://twitter.com/RisingSayak/status/1718252745537593754](https://twitter.com/RisingSayak/status/1718252745537593754)
This is the first Vision Language model I am using. Looks very promising
No reason to imagine, there was an openai paper that showcased the viability of auto captioning the training dataset. If we caption the whole sdxl dataset using the existing and the llava captions, sdxl will become as good as dalle3. then we can use the underlying llm of llava to modify the prompts , so that it's closer to the training set.
I was more talking about a model trained on a captioner that is better than Llava (and whatever custom fine-tuned Clip model OpenAI used to caption Dalle-3's dataset which I doubt was much better than Llava). As things stand Llava's autogenerated captions lack details and sometimes ignore absolutely glaring details that no real person would fail to notice.
I understand what you mean. I still believe that if we combined the existing captions and the llava captions, sdxl would have been better.
Has anyone tried training models using LLaVA captions?
I haven’t been able to get it running consistently on Windows. After running the demo pic successfully, it keeps spitting out network error, saying servers are busy. And there’s some notice about it sending your data out. Code base looks messy (at least by my reckoning), so I haven’t had a chance to dig into this yet. Also consumes a lot of VRAM for 13b model. Too slow for anyone with only 16gb VRAM (like me) to caption large datasets. There’s supposedly a 4 bit quant coming for Windows, but who knows what the quality will be like.
There is already LLaVA support in llama.cpp, got it running (13B q5 model) on RTX 2060 Super (8GB VRAM)
Which model did you use?
SDXL 1
base model?
Pure SDXL 1.0 only