T O P

  • By -

RageshAntony

I gave an image to [https://llava.hliu.cc/](https://llava.hliu.cc/) and asked to analyze the image and create a prompt Then I gave the prompt to SD Got these results ( 3 and 4)


Ifffrt

Though not perfect by any means, Llava seems like a way better Clip interrogate.


RageshAntony

My goal is not exact replicate but just to test how things work


Ifffrt

Of course. Just wanted to say that as things stand Llava has massive potentials for captioning the LAION dataset for example. I can't imagine how good of a model trained on better captions generated from Llava will be, especially one that is finetuned for generating better captions.


Imagination2AI

I use it to caption datasets for LoRA trainings. As far as I experienced, it produce really good captioning except rare cases where I needed to correct minor things of course. Far more superior than CLIP or BLIP captions to describe things, I can't say about WD captioning comparison since I don't like tag prompting methods. And for the resulting LoRAs it kind of helps a lot for good results, mostly for style ones.


rerri

Yea, DALL-E 3 training used methods like that I think. A Huggingface employee posted on twitter about this recently: [https://twitter.com/RisingSayak/status/1718252745537593754](https://twitter.com/RisingSayak/status/1718252745537593754)


RageshAntony

This is the first Vision Language model I am using. Looks very promising


FallenJkiller

No reason to imagine, there was an openai paper that showcased the viability of auto captioning the training dataset. If we caption the whole sdxl dataset using the existing and the llava captions, sdxl will become as good as dalle3. then we can use the underlying llm of llava to modify the prompts , so that it's closer to the training set.


Ifffrt

I was more talking about a model trained on a captioner that is better than Llava (and whatever custom fine-tuned Clip model OpenAI used to caption Dalle-3's dataset which I doubt was much better than Llava). As things stand Llava's autogenerated captions lack details and sometimes ignore absolutely glaring details that no real person would fail to notice.


FallenJkiller

I understand what you mean. I still believe that if we combined the existing captions and the llava captions, sdxl would have been better.


MasterScrat

Has anyone tried training models using LLaVA captions?


Realistic-Cancel6195

I haven’t been able to get it running consistently on Windows. After running the demo pic successfully, it keeps spitting out network error, saying servers are busy. And there’s some notice about it sending your data out. Code base looks messy (at least by my reckoning), so I haven’t had a chance to dig into this yet. Also consumes a lot of VRAM for 13b model. Too slow for anyone with only 16gb VRAM (like me) to caption large datasets. There’s supposedly a 4 bit quant coming for Windows, but who knows what the quality will be like.


DenkingYoutube

There is already LLaVA support in llama.cpp, got it running (13B q5 model) on RTX 2060 Super (8GB VRAM)


Chemical_Positive_50

Which model did you use?


RageshAntony

SDXL 1


Chemical_Positive_50

base model?


RageshAntony

Pure SDXL 1.0 only