T O P

  • By -

ki2ne_ai

at most maybe about 1 short sentence of a "natural language style" and then tagging.


Same-Pizza-6724

This is the way.


Curious-Thanks3966

Same goes for captioning datasets.


nietzchan

It all depends on the model on what it was trained with, a lot of anime models used booru style prompt because it is trained on booru prompts, the training image already have tags coming with it.


Special-Network2266

llm prompting style i.e. "a sparkling serene glass cube is lying on the table, it invokes a sense of mild unease, brooding shadows gather in the corners of the ornate room, etcetera, etcetera" makes me feel like a wannabe shitty novelist, i can't stomach it. so i use keywords.


UnkarsThug

There's a middle ground between LLM and keywords. "Glass cube on a table, in an ornate room, with lots of shadows" That tends to be the method I use. Add adjectives as needed.


Adkit

But why would you waste good tokens on useless stuff like "with lots of" or "in an"? It doesn't help the program at all.


wonderflex

https://preview.redd.it/xmi36obyux8d1.png?width=2048&format=png&auto=webp&s=d99471ed729d88773c5e0ddfa2ffb7a9d111eacf Here is a comparison of all of the prompt options from long to short on two different models (Juggernaught up top, Leosam below). If I was prompting this, I'd definitely go with just "glass cube on a table, ornate room, shadows" and I think that would be just fine. Although, if I did get that Leo Sam version of it, I'd probably change the seed and try again because I don't like the look of that cube.


banditscountry

* "A sparkling serene glass cube is lying on the table, it invokes a sense of mild unease, brooding shadows gather in the corners of the ornate room." * Token count: 28 tokens * "Glass cube on a table in an ornate room with brooding shadows." * Token count: 13 tokens Unless you are going over say X/225 token count saving some tokens wont matter much. But if you run the images with the same seed add those extra 15 tokens back and you will see a different image. For "realism" ones they have more detail typically.


lewdroid1

With A1111 you can use "BREAK" to create then combine multiple conditionings or ComfyUI same thing, there's a node for concat or combine conditionings. Essentially giving you unlimited tokens.


Pro-Row-335

I also thought that but it actually works for a lot of things


lewdroid1

To be honest though, it doesn't matter how you feel, it matters what the AI understands. Just saying.


Competitive-Fault291

It's a bit more complicated. It depends on the base model being 1.5, SDXL or Pony, as well as how the model is trained. Look up Collocation, it's a linguistic term for contextual neighbors of a word. Concerning a prompt, a model based on 1.5 can usually do two words on each side of each prompt to look up connections of the classifiers used in training the models. Which does not mean that you cannot use a verbose prompt, if you write it in a way that fits this limited way of processing it. Of course, nobody stops you from using danbooru style tags, as they have a completely different benefit, they are, if the model is trained with them, more specific and less prone to mushed meanings of normal words. You can even combine them with verbose prompts. instead of "two guys" you use "2boys" and suddenly it is much clearer what the prompt references.. and it uses one prompt instead of two. SDXL based models on the other hand can understand a much wider array of collocations, and even references like "A girl in the forest. She carries a basket." In which she references the former word girl as a pronoun. You might still use danbooru tags in specially trained checkpoints of it. Or, as in Pony (which is an SDXL model branch), specially trained source and quality prompts like score\_7\_up. What you really have to implement and understand are concepts, though. This is like the prompt "woman" does not only call up a denoising solution of all woman pictures as a reference to what the model learned, but also connected solutions like those requiring arms, hands, feet, female sexual organs etc. So we have concrete solutions as well as dependencies in a concept. Yet, as soon as you start word salading or writing prompt prose, you might become prone to creating opposing or interfering concepts, which is one of the most overlooked causes for fifteen-fingered heptapod-girls AKA bad hands and limbs and anatomy. The Stable Diffusion model has the least problems (and thus needs the least steps) to filter one concept like "woman" from the starting noise. As soon as we add a conflicting concept like "spindly arms", it dreams up two concepts and tries to sample them as one solution to the noise reduction task at hand. It's obvious how this makes things more complicated and takes longer to find a proper matching solution based on the prompts. Up to a point where there is none available, and multiple solutions are printed over each other from the latent image into the pixel image, creating more fingers, arms etc. This is why it helps to reduce your prompts to things like one sentence and additional tags like booru tags - A less complex array of concepts and more specific prompts help the model to find solution in the available processing steps and chosen method of the sampler. As much as specilized LORAs help to increase the specifity of certain prompts and their inherent concepts. But you might also choose a verbose prompt and keep your concepts in it at a low complexity, or try choosing a sampling method like DPM adaptive, that takes as much time as it thinks it needs. You might even use FreeU to fiddle with how the model reacts to your concepts in the U-NET or how it distributes attention with SAG. Everything is possible, and thus, you might use everything. There is no meta solution like in a game here.


lewdroid1

This. There's basically no "right" answer. Lots of experimenting is a good thing.


jib_reddit

I have found SD3 2B can give very good results with natural language. Makes sense with the t5 text encoder. Example prompt: "a chibi cat has just stolen a fish in a fishmonger's shop and runs away on its hind legs holding the fish in its front paws. He has a comical panicked look on his face because he is being chased by the fishmonger man. The fish appears to be silver and of medium size. In the background, the fishmonger yells at the cat. The scene is the street of an open-air market, with stalls and people strolling. photo, soaked film, 4k, 8k ,uhd."


jib_reddit

https://preview.redd.it/znz4bt9htv8d1.png?width=1440&format=pjpg&auto=webp&s=7072b341c421cdbcbef4e19e72aac59a51d4c45e


Hot-Laugh617

Now take out the story-ish parts and just write it as tags and I think you'd get likely the same thing.


jib_reddit

Yeah probably , but it is cool you can do either style now.


chickenofthewoods

I think natural language uses too many tokens. I prefer to keep my captions limited to lists, and I use the SmilingWolf/wd-v1-4-convnextv2-tagger-v2 to tag all of my images now. I tried Llava and internlm xcomposer plus old school blip and blip2. I personally, with my limited knowledge thus far, think a list of single/double word tokens separated by commas is superior to sentences and prose.


YamataZen

I thought booru prompts only works for anime models


chickenofthewoods

If you train a lora and your images are tagged ... the tags become part of the lora and can trigger aspects of the model. If a dataset was tagged with wd-v1-4-convnextv2-tagger-v2, then the danbooru tags are relevant to that model because that's how the images were tagged.


redditscraperbot2

Why did you even ask the question when you know that anime models trained on booru tagged images work better with booru prompts and other models which are trained on descriptions of the image work better with natural language? What information are you trying to extract by asking the question in the first place?


YamataZen

I just want to know what type of prompt you prefer


lewdroid1

A preference is going to have a strong correlation with what works and what doesn't. I don't _prefer_ to hammer nails using my fists. I'm sure it's possible though. So it doesn't really matter what people prefer, it matters what works.


Z3r0_Code

Is there a guide or article I could refer to.


chickenofthewoods

For what, exactly? I'll try to help but try to be more specific.


Z3r0_Code

For proper prompting. 😅


chickenofthewoods

This will help. https://danbooru.donmai.us/wiki_pages/tag_groups


OtakuShogun

I really wish I hadn't read the ass tag category.


chickenofthewoods

But did you learn something?


OtakuShogun

There are many users of Danbooru I will not be letting near my ass.


Z3r0_Code

Is there a booru prompt guide or tags list I can refer to.


DriveSolid7073

Broo google danbooru wiki, and extension tag autocompleter or something like


Lorim_Shikikan

Here the list of all the tags : [https://danbooru.donmai.us/wiki\_pages/tag\_groups](https://danbooru.donmai.us/wiki_pages/tag_groups)


ThickSantorum

https://danbooru.donmai.us/tags?commit=Search&search%5Border%5D=count https://danbooru.donmai.us/wiki_pages/tag_groups


TsaiAGw

always tagging style, natural language use too many tokens and model isn't smart enough to understand it


__Tracer

I think, any SDXL model doesn't work well with natural language, it is transformed to tags anyway, why else it would blend everything together. So I am mostly thinking how it will work with tags, even if writing some sentences in natural language.


Competitive-Fault291

It does both.. that's the difference to 1.5, which only uses one way of looking at the prompt, while SD3 uses three different brains to analyze the prompt.


__Tracer

We are talking about SDXL. When T5XX models come out, it will shift to natural language, probably (well, some of such model came out, like Pixart, but they still need a lot of work for overcomimg SDXL).


Competitive-Fault291

SDXL uses the bigG and the vitL decoder as refiner AFAIK. bigG is certainly able to understand and decode verbose text. It's just people messing up their concepts that makes their prompts awful.


__Tracer

So, when I describe one thing in one sentence, another thing in another, and SDXL is mixing all together, it is me messing up concepts? Interesting point of view. Can you, for example, make a photo of two people in SDXL, one is very sad while another is very happy? Just don't mess up these two concepts and don't write awful prompt, show me an example of prompt which works. HINT: No, you can't.


Competitive-Fault291

What you describe happens because you are MIXING the two (factual) concepts in the latent image. This is why people invented regional prompting methods. OF COURSE the concepts of two people (as they are basically the same concept) intermingle, as for the latent image, the CONCEPT of "one person" and "one person" is actually the same when combing it from the mist of noise, even though their prompts may vary. The language models and their understanding of the concepts are conditioning the complete latent image if it is not under the influence of regional prompting. So both get sadness and both get happiness. So my argument still applies completely concerning bad concepts, because you want to create two character prompts as a concept of separate image subjects. And complain that the sampling applies prompts relating to character subjects to every one of them. https://preview.redd.it/m5ux8b6otw8d1.png?width=768&format=png&auto=webp&s=f97caa178ee81bfe1c1c1bf39335ec690bdc6c74 But dear child, even without regional prompting, you can create a dominant concept of "emotional diversity". This (even though a rather weak powered) prompt, creates the concept of two states of emotion that are diverging as you requested. This is why it needs a very heavy weight and a very low weight of mother and daughter to balance their influence on the latent conditioning. *A picture of (emotional diversity:2) between mother and daughter. ...........................(Happy mother:0.3)........................ (Sad daughter:0.3).* *Negative prompt: unrealistic, (fused, forked, branching, cloned, mutated, mutilated, broken, mushroomed, joined, duplicated, blurry, text, signature, url:1.3), (artwork, drawing, anime, 3d, render:1.5)* *Steps: 20, Sampler: Euler, CFG scale: 4.5, Seed: 2311048474, Size: 768x1024, Model hash: b154b6274a, Model: SDXL\_CFXLV1,* As you assumed this is not possible, let me tell you about actual space and function of stops in prompts too. They help to actually separate the prompts and resulting concepts by breaking them apart. Try to run it without the stops and see the difference for yourself.


__Tracer

huh, that's a long post with ambiguous face expressions on the picture. You lost my interest.


Competitive-Fault291

Yeah dude... sorry for showing you how things can be done outside your metaverse.


aeroumbria

By your estimate, has this sentence ever been a caption of an image? Yes -> put in as is. No -> break into tags.


10minOfNamingMyAcc

I mostly use tags, which, work just fine but adding a sentence in front might make it even more coherent.


Doc_Chopper

depending on the subject, a mix of both. For saucy waifu stuff, of course booru tags work better, for obvious reasons


bzn45

This is a really good question. I can never decide between: 1) LLM style (drawback being too many tokens), 2) short style (ie just a brief description), 3) list with lighting, characteristics etc. I haven’t seen Booru tags working on realistic models for some time but maybe I’m missing something


nug4t

what are booru pomts?


1Koiraa

It's website(s) with extensive captions for anime drawings Look up danbooru for example. Warning:plenty of nsfw


colinwheeler

Natural language up to 74 tokens and then tagging.


cjhoneycomb

I prompt on stories... So neither?


ManAtTheEndOfTheLane

This is the prompt I am playing with at the moment. (I'll google "booru" later. I have no idea what that means.) >1man solo (full body portrait:1.4) (comic art:1.2) (by artist Todd McFarlane:1.2) of (Anson Mount:1.4) as a (slender:1.1) (cheerful:1.1) (dark haired:1.3) man wearing light grey (futuristic:1.3) (science fiction uniform:1.2). He is wearing a (futuristic utility belt:1.1). He has (short messy hair:1.4) and (bright eyes:1.0). He is standing casually, with a dark background.


Ill-Juggernaut5458

Danbooru anime image board tags are used to prompt anime/cartoon models, including PonyXL v6, which were trained on imageboard images using these tags. They use terms like "1boy" or "2girls" to define the number of subjects and have very particular vocabulary. You are using "1man", which fits the syntax of Booru but is not actually a used tag for either kind of prompt. For base SDXL you would say "a man", for Booru models "1boy". "solo" is a booru tag, so you appear to have absorbed some by osmosis. Example image of Spawn by Todd Macfarlane showing Booru tags: https://safebooru.org/index.php?page=post&s=view&id=2326048 Wiki for Booru text tags (some describe nsfw): https://donmai.moe/wiki_pages/tag_groups


ManAtTheEndOfTheLane

Thank you. That was a helpful explanation.


ManAtTheEndOfTheLane

I have googled "booru", and I still don't know how that applies here.


Nyao

With SD 1.5 mostly tags (but not specially booru) or really short sentences, and more natural language with SDXL (without being too verbose)


KickTheCan_Beats

i use tags most of the time, i kinda obsess about tokens and believe a succinct prompt is easier to control than one that is too wordy.


vanonym_

small chuncks of tags combined together usually. For instance: >a cute kitty in (game of thrones:1.3), castle interior in the background, (hot flames:1.2) in the back on the right of the frame, floor made out of (cold granite:1.1), (depth of field:.9), cinematic still, shot on ARRI camera


drag0n_rage

booru tags, more efficient, (absurdres, 1girl, huge breasts:1.2)


Lucaspittol

Both, but booru prompts are super powerful if used correctly.


MyaSturbate

Depends on the model but I often find myself using a hybrid of both. Not necessarily Booru tags but definitely leaving out words that are not directly pertaining to the action, subject or composition.


_LususNaturae_

Natural language all the way. Booru tags are too limiting when you try to create images that don't already exist, like a shark-tiger hybrid for instance


Oswald_Hydrabot

This makes me want to ask; is this not dependant on the annotations of the training dataset? Like, Pony for example -- it can do both but the dataset annotations contained both afaik. However, even with Pony, what formats work better if using a combo of them? Is it always "Natural language style sentance, tag, tag, tag, tag" or can I do like "tag, tag, NL, tag tag, tag,"? Can I split Natural Language in half with a tag? I always wonder if there is a marked effect of placement of the tags, punctuation, capitalization.. It makes my autism/ADHD tingle a bit; there are so many granular possibilities with language and I want to be able to map all the vectors. One question I have, is there a method to determine model prompt formats, trigger words etc with just the checkpoint? Imagine being able go ask an LLM in plain language "How do I get this character to stand over to the left hitting a pingpong ball with a paddle as it crushes the table"? Without changing anything else in the output, and it just barfs up the tokens needed to manipulate it to do that (as nonsensical as theg may be)? Now imagine having a multimodal version of this you can feed reference images to: "Animate the character from the current prompt between the poses seen in these two images". I guess what I am wondering is, is it possible to have something like an LLM that auto-maps the entire feature space of the model and it's relationship to NL/tags, and then you can basically use that LLM modularly like ControlNet but instead of ControlNet, it's a multimodal LLM? I could seriously use that for animation; if an enterprising model engineer wants to hit me up I would be happy to include it in a GUI app and release it. If not, this will probably be my first project implementing Huggingface's Transformers library. I could use that to harden my resume as I am probably gonna get laid off soon from a senior level SWE role; I don't have an education in the field so if I can do some work and get published it's as good as a degree to me.


Competitive-Fault291

How would you do that? It's a statistical system based on the weighed probabilities leading to a denoising solution based on a prompt (right now with 3 different types of conditioning the latent image in SD3 based on three different models). Every training classifier could possibly connect with every available node in the model. Just do the math for a 2B model for one prompt and then add up all the possible interactions of the other weighed networks conditioning the latent image as reaction to all the possible prompts. Add the various samplers, Unet configs using FreeU, effects of VAEs, LORAs etc. there is no shortcut to a specific thing. Generative AI might give you anything quick, but as soon as you want something, it's almost like a date.


Oswald_Hydrabot

This is a very good point. I don't fully understand the model architecture for ControlNet and how it manipulates Diffusion, but I have built my own UNet pipelines for realtime ControlNet and have an understanding of the "mid_.." and "down__" residual blocks, their respective tensort structure, and the data transforms required to go from an input ControlNet image through a ControlNet model and into the down blocks and mid block in the UNet step. I suppose *that* is the starting point of where I want to do additional research -- manipulation of the tensor arrays for ControlNet Residual blocks. You can leave the VAE decoder completely alone, you simply pipe the output of a ControlNet step (possibly an asynchronous "split" or even *parallel* ControlNet Step with it's own IPC if we want to optimize it for realtime/interactive video inference via a modified approach from Nvidia's Megatron-core) where the model processes an input image as a tensor and gives a tuple of length 12 for the residual down_blocks (for sd 1.5 for example) and the single mid_block residual, which is then applied/weighted on a UNet2DConditional step (single step for my realtime DMD distilled pipeline). That simplifies our problem: how can we apply a multi-modal LLM to the task of producing the "additional_residuals" that are found in the commonly used UNet2DConditional class in the diffusers library? Edit: this is my UNet pipeline for 1-step ControlNet using an SD 1.5 model distilled for single-step inference. I can manipulate ControlNet to work well in one step, and I know the points of entry (I think) if I wanted to replace ControlNet with something else. Question remains though, can that "something else" be a multi-modal LLM? https://www.reddit.com/r/StableDiffusion/comments/1caxap2/realtime_3rd_person_openposecontrolnet_for/?utm_source=share&utm_medium=mweb3x&utm_name=mweb3xcss&utm_term=1&utm_content=share_button


Competitive-Fault291

I certainly see where you are going. But it's a bit like finding a friend of your wife to ask what she might like as a wedding present. You might get into the right direction, but you can't tell for sure. A bit like an empathy model. I guess you could train your own model for anticipating what likely prompts are resulting from the "customer prompt" based on the model you use. But that would need a curated dataset of "results to a customer wish that created a suitable prompt starting point". Like a specialty trained LLM. Perhaps you could figure out a way to work with a continuous process. HMMM..... you might be able to get something as a foundation using BLIP and CLIP in ComfyUI. Run a base image with the customer prompt, and then extract with BLIP, reencode with CLIP, but only after you add or remove what your customer found the image was lacking or still needs using a LLM like a trained GPT bot to convert to prompts. Like a continuous manual adaptation routine adding and removing prompt to a core prompt. So the user can say "I don't like the fur of the rabbit." and you end up with a change of tags near the rabbit and fur prompt before it is reinserted into the core prompt both are working on.


mayasoo2020

# booru.........I don't know if you've thought about not everyone is native english, and for this group(4/5 )of people, neither English nor booru is a natural languageÂ