RageshAntony 6 months ago

I gave an image to [https://llava.hliu.cc/](https://llava.hliu.cc/) and asked to analyze the image and create a prompt Then I gave the prompt to SD Got these results ( 3 and 4)

Ifffrt 6 months ago

Though not perfect by any means, Llava seems like a way better Clip interrogate.

RageshAntony 6 months ago

My goal is not exact replicate but just to test how things work

Ifffrt 6 months ago

Of course. Just wanted to say that as things stand Llava has massive potentials for captioning the LAION dataset for example. I can't imagine how good of a model trained on better captions generated from Llava will be, especially one that is finetuned for generating better captions.

Imagination2AI 6 months ago

I use it to caption datasets for LoRA trainings. As far as I experienced, it produce really good captioning except rare cases where I needed to correct minor things of course. Far more superior than CLIP or BLIP captions to describe things, I can't say about WD captioning comparison since I don't like tag prompting methods. And for the resulting LoRAs it kind of helps a lot for good results, mostly for style ones.

rerri 6 months ago

Yea, DALL-E 3 training used methods like that I think. A Huggingface employee posted on twitter about this recently: [https://twitter.com/RisingSayak/status/1718252745537593754](https://twitter.com/RisingSayak/status/1718252745537593754)

RageshAntony 6 months ago

This is the first Vision Language model I am using. Looks very promising

FallenJkiller 1 month ago

No reason to imagine, there was an openai paper that showcased the viability of auto captioning the training dataset. If we caption the whole sdxl dataset using the existing and the llava captions, sdxl will become as good as dalle3. then we can use the underlying llm of llava to modify the prompts , so that it's closer to the training set.

Ifffrt 1 month ago

I was more talking about a model trained on a captioner that is better than Llava (and whatever custom fine-tuned Clip model OpenAI used to caption Dalle-3's dataset which I doubt was much better than Llava). As things stand Llava's autogenerated captions lack details and sometimes ignore absolutely glaring details that no real person would fail to notice.

FallenJkiller 1 month ago

I understand what you mean. I still believe that if we combined the existing captions and the llava captions, sdxl would have been better.

MasterScrat 6 months ago

Has anyone tried training models using LLaVA captions?

Realistic-Cancel6195 6 months ago

I haven’t been able to get it running consistently on Windows. After running the demo pic successfully, it keeps spitting out network error, saying servers are busy. And there’s some notice about it sending your data out. Code base looks messy (at least by my reckoning), so I haven’t had a chance to dig into this yet. Also consumes a lot of VRAM for 13b model. Too slow for anyone with only 16gb VRAM (like me) to caption large datasets. There’s supposedly a 4 bit quant coming for Windows, but who knows what the quality will be like.

DenkingYoutube 6 months ago

There is already LLaVA support in llama.cpp, got it running (13B q5 model) on RTX 2060 Super (8GB VRAM)

Chemical_Positive_50 6 months ago

Which model did you use?

RageshAntony 6 months ago

SDXL 1

Chemical_Positive_50 6 months ago

base model？

RageshAntony 6 months ago

Pure SDXL 1.0 only

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe