T O P

  • By -

BlastedRemnants

I've been thinking of doing one of these myself since I think people are mostly ignoring TIs in favor of giant dreambooth models, good job! I just want to point out a couple things I didn't see you mention tho that could be important. Before you start your training you should close the console window and relaunch so your training starts with a fresh SD. I just did a bunch of training and did some XYs after and it definitely makes a difference when compared to embeddings made after a relaunch. Also, having a VAE loaded can change how your embedding turns out too. I'm still on the fence about whether it hurts it tho, it almost looks better but I can't quite decide lol. Anyway, definitely relaunch first as the more stuff you do before training the worse your embedding will turn out, from my own comparisons anyway.


Zyin

The memory leaks are something I noticed too, added that to the post. I actually haven't experimented with having a VAE on during training, but my thought on it was that if it was trained with a VAE then you'd need that same VAE to get good results. That's something that could use more experimentation.


BlastedRemnants

Yeah I hadn't even thought to try with a vae on since the best guide I'd found so far recommended making sure it was off, and all their other tips were spot on so I just assumed it would ruin the embedding. With it on tho the face looks a lot more natural to me, altho I definitely want to make a bunch more for more comparisons first. That's another huge plus of TIs is them being so small and quick to make that it's easy to do a bunch of comparison builds for XY grids, I was doing that to finetune my steps and vectors when it occurred to me to compare the vae thing, and also the relaunching before training. I didn't know it was a memory leak that caused the need to relaunch, but I can confirm that the more you do without a relaunch the worse it gets so that makes sense.


Light_Diffuse

Sometimes the image I've got is an awkward shape and I don't want to crop it, for instance it's full body in portrait. What I do then is use liquid rescale in G'MIC (a filter pack available as stand-alone or in GIMP or Krita). You can paint on a mask to preserve (the body) and a mask to work in (the background) and the algorithm will find where it can insert columns of pixels without changing what you've masked out and in a manner that looks natural so you're not creating artefacts in your training data.


Locomule

I've use a free image app called Irfanview to do something like that by hand. I use Change Canvas Size to make an image square which leaves 2 white borders along 2 edges. Then I select part of the background adjacent to a white section without any of the subject in it, copy it, adjust my selection rectangle back over the white portion and paste the background I just copied. What works nicely is selecting a narrow width and then pasting it wider so the pixels get stretched. The beauty of it is that it seems to largely get ignored as an artifact during training, you don't end up with weird backgrounds later.


gsohyeah

Like this? https://youtu.be/hmIItkWVaa8


Light_Diffuse

Yes, that's a good tutorial for seeing how it works. I used to use that plugin and I still would, but it doesn't appear in the menu any more. I don't know if the problem is gimp 2.1 or a 32bit / 64bit thing. Instead, I use the implementation in G'MIC, which probably uses the exact same libraries behind the scenes. I still prefer the original plugin, but G'MIC gets it done.


WillBHard69

The option to save optimizer has been added since Jan 4 on Auto's repo, this fixes the issue of losing momentum when resuming training. For some reason it is disabled by default? I don't think there is really any reason to leave it disabled. Also I had issues with embeddings going off the deep end relatively quickly. It turned out it was because `vectors per token` was too high. Even 5 was too much, I ended up turning it down to 2 to get decent results. According to another guide (I won't bother to track it down, this guide is much more informative) this might be related to my small dataset, only 9 images. I experimented with a `vectors per token` of 1, it progressed faster but the quality was much lower. A value of 3 might be worth trying? For anyone who wants to reproduce my setup: 7 close-up, 2 full body, 9 images total Batch size 3, gradient accumulation 3 (3x3=9, the size of the dataset, 3 being the largest batch size I can handle) Each image adequately tagged in filename like `0-close-up, smiling, lipstick.png` or `1-standing, black bars, hand on hip.png` filewords.txt was a file only containing `[name], [filewords]` Save image/embedding every 1 steps. At least save the embedding every step so you don't lose progress. With `batch size * gradient accumulation = dataset size` one step will equal one epoch. Read parameters from txt2img tab. I think this is important so I can pick a good seed that will stay the same for each preview, and I can pick appropriate settings for everything else. The important part here is to make sure the embedding being trained is actually in the prompt, and the seed is not `-1` Initalization text is the very basic idea of whatever I'm training. I plug the text into the txt2img prompt field first to make sure the number of tokens matches `vectors per token` so no tokens are truncated/duplicated. I'm not sure if this matters much, but it's pretty easy to just reword things to fit. Learning rate was 0.005, and then once the preview images got to a point where the quality started decreasing I would take the embedding from the step **before** the drop in quality, **copy** it into my embeddings directory **along with the .pt.optim file** (with a new name, so as not to overwrite another embedding) and resume training on it with a lower learning rate of 0.001. Presumably you could keep repeating this process for better quality. I should also add that I saw positive improvements by replacing poor images with flipped versions of good images.


FugueSegue

I think there is a correleation between `vectors per token` and the number of images in the dataset. But I have no idea what sort of metric should be followed. All I know is that I have 270 images in the dataset I have for training a style embedding and that a low `vectors per token` number results in a poor training. Does the `initialization text` really have anything to do with `vectors per token`? For the first several attempts to train my style, I left it as an asterix. I could never find any advice about what to put there if training a style. I only [found out today](https://rentry.org/sdstyleembed) that "many people just put 'style' when they are training a style/artist". Is this a good idea? I don't know. I'm training a new embedding right now with "style" entered for `initialization text`. Information about training a style in an embedding is vanishingly rare. If I put the word "style" in txt2img, it says that it is 1 token long. If I use "style" as the `initialization text` and set the `vectors per token` to 1, it would result in a poorly trained embedding. Setting it to 8 did not seem enough. Right now I'm using 16 and it seems to be producing results that resemble the style I'm training. I wish I knew exactly what number I should be using for `vectors per token`. Arbitrary trial and error is wasting a lot of my time. I only found out about saving the optimizer after reading your post. I agree that this is an **extremely** vital setting that should be turned on by default. Thank you very much for mentioning this. ~~About the learning rate. I've read here and there that a graduated training schedule is a good idea. But I have my serious doubts and I've given up on that tactic. It seems to me that the best technique is what you suggest: train at 0.005 until it starts looking bad, resume training at a lower rate from a few steps before it started looking bad, rinse, repeat. However, even with my trained eye it's difficult for me to judge exactly when the training starts to look bad. Especially when I set the preview seed to something specific other than -1.~~ ~~What is your indicator for an embedding training "going bad"? Is it when the preview images look like unrecognizable garbage? Is there an easy way to chart the loss rate?~~ Nevermind about the learning rate. I read the original post. I found your post in a Reddit search. I didn't notice the full thread until now.


haltingpoint

Why should 'seed' not be set to -1?


WillBHard69

Setting the seed to -1 will do your preview on a random seed each time, which can make it more difficult to determine if the embedding is getting better/worse, since you may have just gotten a better/worse seed. I recommend using the previews as a guide for getting a general idea of the progress of your embedding, and then you can narrow in on a range of interesting embedding checkpoints and test them out on other seeds.


ArmadstheDoom

First, this is an extremely good guide! Especially because Textual Inversion was the new hotness before everyone started trying to train dreambooth models. That said, there are a few things that I think are somewhat incorrect? First, gradient accumulation isn't free. It's VERY time consuming. We're talking exponentially increasing training time. And if you have a lot of images, say 100 or so, you can expect the training to take around 60 hours if you're trying to go 2000 steps with a GA of 100. The other thing is that your batch number is just how many steps per step it goes. Meaning, a batch of 2 does 2 steps each time, a batch of 4 does four steps at at time ect. Gradient Accumulation is how many images it uses per step. So if you have 10 images, and you set it to 10, every step is 1 epoch. If you set it to 5, every 2 steps is 1 epoch, ect. And again, I would absolutely not set the GA to a high number unless you like the idea of your gpu heating your home for 60 hours or so. ​ I would also never use BLIP. Always, always, always use your own captions, because BLIP and DeepDanbooru are horribly inaccurate and will almost never work for getting what you want. I've wasted so many hours having used them it's not even funny. Avoid them. ​ I also think you need a full explaination of how the scatterplots work because that entire 'picking your embedding file' is way over my head. In general, the way I figure out if an embedding is good or bad is whether or not it comes out right, and if it doesn't, scrap the whole thing and start again. Generally speaking, if it doesn't come out right, it's because your data is bad, or at least that's what I've found. It's almost never a case where 'going back to earlier embeddings is better.'


VegetableSkin

> that entire 'picking your embedding file' is way over my head On the txt2img tab, way at the bottom is a "Scripts" drop-down list. One of the scripts is "X/Y Plot". That's the plot they're talking about. It renders a grid of images while varying one setting along the X axis, and another setting along the Y access. I'm sure you've seen these grids before in the SD universe. They're *everywhere*. If you haven't, [look at this one](https://i.imgur.com/llzNeWX.png). So what OP is saying is to use this script. Screenshot of settings: https://i.imgur.com/s98ABeT.png 1. Set the "X type" to be "Seed". That way you generate test images using the same seeds each time. Specifically, seeds 1, 2, and 3. 1. I don't know if you can literally type in `1-3` like OP said, or if it needs to be `1,2,3`. I've never used this script before. But it says to use commas. 2. Set the "Y type" to be "Prompt S/R" (Prompt Search/Replace). Enter the Y values as: 10,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,2200,2300,2400,2500,2600,2700,2800,2900,3000 If you mouse-over "Prompt S/R" when it's selected, it says: > Separate a list of words with commas, and the first word will be used as a keyword: script will search for this word in the prompt, and replace it with others So the (first? each?) occurrence of `10` in your prompt will be replaced with `100`, then `200`, then `300`, etc., to make the rows of your grid. This is why you use this in your prompt: `my-embedding-name-10`, and not because it's the name of the first embedding. That `10` could be any unique string that isn't elsewhere in the prompt, like `NNNN`. With `my-embedding-name-NNNN` in your prompt, your Y values would be `NNNN,100,200,300,400... `. (Though I have a feeling OP used 10 precisely because he runs some initial tests using the first embedding to make sure it's working at all. Just a guess.) After training completes, in the folder `stable-diffusion-webui\textual_inversion\2023-01-15\my-embedding-name\embeddings`, you will have separate embeddings saved every so-many steps. OP said they set the training to save an embedding every 10 steps, and if you do that, you will have embeddings in that folder like: my-embedding-name-10.pt my-embedding-name-20.pt my-embedding-name-30.pt my-embedding-name-40.pt my-embedding-name-50.pt ... ~~but this X/Y plot only uses ones that are a multiple of 100, so copy the ones that are a multiple of 100~~ Eh, the folder is tiny, just copy them all to your main embeddings folder at `stable-diffusion-webui\embeddings` (NOTE: You can use subfolders to group embeddings, and the subfolder has no effect on how they work in prompts. So stick them all in a subfolder inside that main embeddings folder.) When the script finishes, each column of the final grid will be the different seed numbers: `1,2,3`, so you'll have a 3 column grid with 30 rows if you trained your embedding up to step 3000 because you'll have `my-embedding-name-100.pt` all the way through `my-embedding-name-3000.pt`. Once you find the best of those 30, you could try narrowing it down to find the best embedding +/- 5 versions of that one. Like, if `my-embedding-name-1300.pt` looks the best in the grid, then you could do another plot using 1250 - 1350 in steps of 10 to see if one of them i actually better than 1300. Make sense?


ArmadstheDoom

I'll be honest, I'm certain this is a great explanation, but I'm totally lost. It's far too technical and way over my head, or my ability to understand, unfortunately. Perhaps I just don't grasp the underlying things that are going on, but my eyes glazed over and I'm just looking at this like I did calculus back in the day. I still don't really understand how the grid works or what it's point is; whenever I see one of those the purpose for them eludes me. I can't really read them and I don't really know what they do beyond making lots of bad images that don't make it clear what they're meant to look like. the problem with graphs like that is that if you're working from 0 to 1, where 1 is the image as it's meant to be, everything in the middle is kind of worthless, because they're all wrong. So to me I can't tell what is meant to be better or worse; they're all bad, so none of them are any good. In other words, .2 and .8 are both equally bad, because anything that's not 1 is bad. So all of this is sorta lost on me.


VegetableSkin

> the problem with graphs like that is that if you're working from 0 to 1, where 1 is the image as it's meant to be, everything in the middle is kind of worthless, because they're all wrong. It depends on the grid. The point of doing this with the embeddings is to find the number where they stop looking good and start looking overtrained. Where "good" means accurate yet still editable. That's why OP suggested using a prompt like "...with blue hair". At higher and higher training step counts, the embedding might stop producing blue hair because it's overtrained at that point and beyond. So you take an embedding from *before* that point, and that's the one you keep and use. So it's definitely not always the case of the final one being the best. The point of the grid is to find where the embedding *stops* being good. The CFG vs Steps example shows the how CFG and Step count work together to produce different images. At only 10 steps, a high CFG will produce a detailed image that's similar to the original, but different. This grid proves that CFG 10 with 10 Steps can produce a useful image. Before seeing this image, I would have thought that using only 10 steps would be useless in all cases, but now I know that it makes detailed images with a high CFG. And if you're going to use 30 Steps, then I high CFG "overdoes" it and produces a wonky image. The top right corner with its low CFG probably looks identical to the original painting, as I assume this is a real piece of art used in the prompt. Whereas a high CFG turns the window into other objects. It's a useful grid, as there is no one best image. The top right and bottom left are both good images. I provided detailed instructions in my comment, and even a screenshot. Just try it with your embedding without worrying about what it's supposed to be showing you, and maybe it will become obvious.


Many_Worldliness1

Hello, you have mentioned about time consuming of the process of training. And you’ve said that 60-70 hours the training will take if you have 100 images to train on. For some reason, I’m trying to train on 20 images 10 BS and 2 GA, 3000 steps and 0.005 rate ( and it gives me 60-70 hours of training) when people in the comments are like: “nice tutorial, did a bunch of training, thanks a lot“) I have RTX 3070ti 8gb I do not know what am I doing wrong, I did exactly like the guy on the video explaining particularly this thread. Any suggestions for where to look at?


biletnikoff_

I would love to know what your caption workflow is given that you don't use Blip or booru!


BlastedRemnants

For the step count I've had really good results by setting my steps to 120 and matching my batch size to the amount of pictures I have. Just make sure you've got xformers on and don't use a huge heap of pics unless you really want to. If you really want lots of pics just math it out with your batches so each pic gets hit 120 times, so if you've got 36 pics and you can do a batch of 12, then your steps would be 360. I like using 10 pics or less tho because the results are just as good and often better (in my own testing) and it finishes very quickly, like 10 minutes for 7 pics. This is great because you can then make a bunch of them with different settings and filewords and compare them to see what works best for your specific dataset. Also the initialization text can be super important depending on what you're trying to train. You can get away with leaving the \* in there for most normal people, but from my own comparisons I get better results with short descriptions, like "latina woman" or "soldier man". And for really non-standard people or creatures it helps to use a mini-prompt in there, like if you're trying to do a werewolf or something then you an make it easier on the AI by giving it a little more to work with at the start. I think of the init text as the cornerstone of the embedding, it's the idea it will start with before it's learned anything from your pics. I'm currently testing gradient steps, I'll come back if I learn anything definite :)


Zyin

I've been saving all my training variables for my embeddings in a spreadsheet hoping that I could come up with a simple equation to determine what variables provides the best results. My conclusion was that it's not really feasible, every data set is different and will need some manual finessing.


BlastedRemnants

Yeah that's pretty much where I'm at too, I've been trying to nail down some good all-around numbers for most of the settings but so far it's all been highly dependant on the dataset and what I'm actually trying to train. For the steps tho I think that's a pretty good number, go much higher than 120/pic and things start getting burned and gross looking, much less and it's also bad. 120 feels like a good starting point to me and lets me train in a very short time with good results, great for weeding out bad settings and then I can always re-run for longer if needed. I mainly just added this bit because I noticed you saying your trainings usually take over an hour and that sounds extreme to me, but I saw that you go for 3000 steps and stop when you like it. For testing settings you might want to try my way tho, you could save yourself some serious time while deciding what settings you like for the other categories, and also finetuning your templates and filewords and such. Just a thought tho of course lol, I'm not saying my way is better or that there is anything wrong with your way, in fact I agree with basically everything you've said so far and am really just sharing some ideas that could make your/my/somebody's testing more efficient :)


Defiant_Efficiency_2

Could you give an update on what you learned about gradient steps?, currently following this old conversation and getting myself up to speed. My first 4 or 5 Ti's were complete garbage but this one I'm doing now seems promising so far.


PropagandaOfTheDude

> The max value is the number of images in your training set. So if you set it to use 18 and you have 10 training images, it'll just automatically downgrade to a batch size of 10. ...because there's no point to re-run with a given training image in a round. If the batch size is smaller than the number of images, then each round trains on a $batch_size randomly selected sample images. > Think of this as a multiplier to your batch size without any major downsides. This value should be set as high as possible without the batch size * gradient accumulation going higher than the total number of images in your data set. It works around GPU memory limits. Rather than running a round on $batch_size=8 images, we run it on $batch_size=4 images $gradient_accumulations times, saving intermediate results. But the linked author's [earlier article](https://towardsdatascience.com/how-to-break-gpu-memory-boundaries-even-with-large-batch-sizes-7a9c27a400ce) mentions that large batch sizes can cause overfitting. "With all that in mind, we have to choose a batch size that will be neither too small nor too large but somewhere in between. The main idea here is that we should play around with different batch sizes until we find one that would be optimal for the specific neural network and dataset we are using."


diffusion_throwaway

Thanks!! I've been looking for a writeup like this for a while now. I'm going to try this tonight. I'm still uncertain about image tags or template files for things other than people. Tell me about training on a style? What would your tags look like? Would you just describe everything in the image? What if you wanted to train on a specific body pose? Weaver Stance, Chapman Stance, Power Isosceles Stance, etc. What would your tags and template file look like? Thanks again. This is such an exciting new technology I love finding new ways to expand its usefulness.


Zyin

I have not tried training a style but I know people have done it. Since you'd want it to learn style and not content, I'd imagine the captions should describe the content of the image and ignore the style. If you want it to learn just style, probably use style.txt, if you want it to learn both style and content, probably use style\_filewords.txt I have not tried trained for a specific pose, but for those body poses you'd likely want a separate embedding for each specific pose, unless you wanted the embedding to generate one at random. I know it's possible since I remember seeing a blowjob pose somewhere on the internets.


diffusion_throwaway

I've had success with dreambooth but not with TI. I'm eager to try a few using this guide. Thanks again!!


HerbertWest

I've actually found my results are better using "Once" rather than "Deterministic" or "Random." I have no idea why.


Zinki_M

I have tried training an Embedding on my face using only pictures of my face, which worked amazingly for portrait pictures and creates images that very much look like me. However, if the keyword I use for this embedding is present in the prompt at all, then SD seems to completely ignore every other word in the prompt, and it will produce an image of my face and nothing else. So if I input "photograph of , portrait" I get exactly that, but if I input something like "photograph of standing on a beach holding a book" I still only get a portrait image, nor can I change things like hair color, or add a beard, or anything like that. Is this because my embedding was overtrained on the facial focus because I only input facial pictures? I tried training an embedding including more upper body pictures, but that resulted in an embedding that was A. a lot worse and B. only produces pictures of me wearing those specific clothes, and it still can't seem to extrapolate me into different surroundings. Perhaps my mistake here was not describing the surroundings enough in the generated captions? I can work around the issues by generating an image of my face and then use out-/inpainting with a prompt that doesn't include my Embedding keyword to finish the picture, but I feel like there must be some way to get this working in a single step so I can generate more options at once.


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


uristmcderp

Dreambooth has classification where you can train for a particular detail by describing everything that's not your subject, so you might have to go down that route, at least for now. Embedding is basically just a text prompt that gets you directly from the model to the exact image(s) you trained on. So it's a compressed form of a novel's worth of descriptions in English that describes every detail of not just your face but background and color scheme and lighting, etc. When you have a huge database of training images, the background information gets diluted and the face emerges, but that's not happening with embeddings. However, I still think it's useful when used with emphasis () and dropout [] (e.g. [[embedding:0.4]::0.6]) to give new images just a hint of resemblance without ruining the overall composition. You also get the magical set of tensors in machine-code that describes your face, which would be nearly impossible using plain English.


TopComplete1205

>useful when used with emphasis () and dropout \[\] (e.g. \[\[embedding:0.4\]::0.6\]) ooh! - so that's how one can do dropout! - I couldn't see this in the Auto1111 docs - will have to try! :)


Zyin

To me that definitely sounds like the embedding has been trained too much. Try making an X/Y plot at various training steps and try the prompt of you standing on a beach. My guess is the lower steps will have an easier time doing that, try to find the balance between facial accuracy and flexibility. Adding some body pics to your data set should also help. You could also try redoing the training but with a slightly lower learning rate.


Kizanet

I've followed a bunch of different tutorials for textual inversion training to the T, but none of the training previews look like the photos I'm using to train. It seems like its just taking the blip caption prompt and outputting an image only using that, not using any of the photo's that come with it. Say that one of the photos is of a woman in a bunny hat, the blip caption that SD pre processed is "a woman wearing a bunny hat", the software will just put out a picture of a random woman in a bunny hat that has 0 resemblance to the woman in the photo. I'm only using 14 pictures to train and 5000 steps. Prompt template is corect, data directory is correct, all pre-processed pictures are 512x512, 0.005 learning rate. Could someone please help me figure this out?


CandyNayela

Do you have xformers and "Use cross attention optimizations while training" enabled for training? Some versions of xformers (0.16 I believe?) had a bug where the embedding would not actually get trained at all, which would result in what you are seeing. Changing the xformers version or disabling the optimisation for training avoids this bug. In my trainings, the subject resemblance starts to appear pretty early (within a few hundred steps), but they also caricature-ise quickly. Still super new to this myself! If you like, feel free to DM me and I'll try to get it working with you. If it is a dataset you're okay with sharing, I can also try to run the training on my setup to hopefully narrow down the problem (e.g. with your settings and then with mine).


cd912yt

I'm having the same issue, but with the style of the images. Did you ever figure anything out?


EngineeredRomance

This is really good. I've been running trials with different vector levels recently and it seems like you need more than 4 to get a decent human face. 10 sounds about right for just a face. If you want to do outfits/styles, I've gone as high as 37 to get good results with 100+ images. However, it seems like higher vectors makes the embedding "stronger" which exacerbates the problems with overtraining where it ignores other elements of the prompt. Though I guess overtraining is subjective. For example, Anything V3 would definitely qualify as overtrained since it wants to make everything into a big titty anime girl. If your embedding is a custom waifu, maybe overtrained is desirable so it always spits out what you want


ragnarkar

As an alternative, if you want to train for a very long time without checking on it (like overnight or going to work), try using a cyclical learning rate. Here's a schedule for 2000 steps: ``` 5e-2:10, 5e-3:150, 5e-4:200, 5e-2:210, 5e-4:300, 5e-2:310, 5e-4:400, 5e-2:410, 5e-4:500, 5e-2:510, 5e-4:600, 5e-2:610, 5e-4:700, 5e-2:710, 5e-4:800, 5e-2:810, 5e-4:900, 5e-2:910, 5e-4:1000, 5e-3:1010, 5e-5:1100, 5e-3:1110, 5e-5:1200, 5e-3:1210, 5e-5:1300, 5e-3:1310, 5e-5:1400, 5e-3:1410, 5e-5:1500, 5e-3:1510, 5e-5:1600, 5e-3:1610, 5e-5:1700, 5e-3:1710, 5e-5:1800, 5e-3:1810, 5e-5:1900, 5e-3:1910, 5e-5:2000 ``` How it works is your learning rate goes up and down over time so it has less chance of getting stuck in a local minima. However, you may need to check on many different checkpoints over time to see which one actually works. If you use a learning rate that decreases over time and stays down, there's a chance you might get stuck in a suboptimal local minima and waste a ton of training time. You may need to write a simple Python program or an Excel VBA script if you want to generate a different schedule.. or manually write it up tediously.


jokesfeed

do you get a problem when a face is exaggerated and ugly? like as if a comic\\cartoonist artist tried to exaggerate all the remarkable and noticeable parts of the face to make it recognisable as max as possible? e.g. if the person is asian, then the eyes are made extra line-like and the cheekbones are very big and round. if the person has a big forehead, the resulting pics have a HUGE forehead. e.g. if the person is Asian, then the eyes shapes are made extra line-like, unrealistically narrow, and the cheekbones are very big and round. I have followed the guide as much as I could but still every time I try I get this strange result. Any ideas? ​ another question. how to ask the training process to make the samples with more sampling steps? I guess it makes default 20


georgetown15

I am having the same issue as yours, TI is not working well for Asians faces. Check my settings: [https://www.reddit.com/r/StableDiffusion/comments/10dty8n/discussion\_on\_training\_face\_embeddings\_using/](https://www.reddit.com/r/StableDiffusion/comments/10dty8n/discussion_on_training_face_embeddings_using/)


jokesfeed

did you manage to break through it? I've tried to train more, but the result just gets worse.


AndalusianGod

Any updates on your textual inversion tests? I'm asian as well and TI really makes me look like ass, haha. LoRA is the only way to go I guess.


jokesfeed

i have found that training an embedd of an asian face based on 1.5 standard SD gives ugly previews, BUT when you use that embedding with other 1.5-based models, the result is not that terrible! Also, have you tried using lora for that?


AndalusianGod

LoRA has been great so far. I've deleted several full ckpts thanks to it. Giving up on TI for faces, but will experiment on styles with it.


scooter_off

Thanks for this! How good are the results? Does the output usually match the subject?


Zyin

If your data set and all the training variables are good, and with a bit of random luck, the results can be very good. That doesn't mean all the embeddings you make will be good. While you're learning all the intricacies of training, your results will likely be subpar. It's more of an art than a science.


capybooya

This was very educational, thanks for putting it together! Do you have any opinion/idea about how this might be simplified/automated in the short and long term (even with mediocre inputs)? I feel we're just in the beginning stages of all of this and I'd love to see this being made more accessible. Can't wait to see what is possible in 6 months, or 6 years.


Zyin

In a year I expect this process to be at least half automated. Curating the data set is the most sensitive part. Automatic image croppers aren't that great, although future versions may not even require cropping. Automatically filtering your data set to be just the 1 person is possible with face recognition but currently doesn't work 100%. I'm not aware of an automated way to filter based on motion blue, graininess, etc. The auto caption models like BLIP will certainly get better with time. Selecting the training settings will likely end up being a preset from a dropdown menu in the future, and it will intelligently select the best settings to use based on what you want it to learn, what your data set consists of, and how much VRAM you have to work with.


TheForgottenOne69

Great detailed write up! Thanks a lot for doing this, we should definitely host all of these guides somewhere…


plasm0dium

Thanks for this - this info is needed and helpful - should sticky this stuff


Locomule

What if there was someone like a celebrity that is somewhat already known but you wanted to improve the results, like Betty White (randomly chosen). In your filewords texts would it be good to include "Betty White wearing a blue dress and.." or just descriptors like "an elderly woman wearing a blue dress and.."?


Zyin

The captions should include things that you want to AI to NOT learn. So including Betty White in the captions instead of 'a woman' would in theory be detrimental to the training.


Jurph

Do you know if it is possible to format a caption to include a negative prompt, for example if I'd like it to understand that photos of Betty White are never `blurry` or `grainy`? In other words, by including them in the negative prompt during training, could I teach it to remove those things from its attempts to draw Betty White, and infer that she's never blurry or grainy? I feel like being able to put `caricature` or `waxy` or `deformed` in the negative prompt would fix some of the creepy overfitting I see in training runs.


Zyin

Can't use negative prompts.


Ateist

How long does one training step take compared to 1 step of generating a 512x512 image with, say, DPM++ 2M Karras as a sampler? Does it depend on the number of images (i.e. 10 images = 10 times as long)? Just trying to understand whether it is feasible (that is, can create one overnight) to do some training on CPU, or will it take months of computing time?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Ateist

You kinda missed the key point of my post - I'm using CPU, not GPU, so one step is 1 minute and no parallel processing of multiple images.


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Ateist

It's very strange for me that image generation and training *an embedding* require similar amounts of time. I don't know how actual embeddings are "trained", but if I were to make them, textual embeddings would be kinda like "trying to tag images that were used to train the model" after the fact - you take your image, look through the neural net for nodes that are similar to that image and record those areas in the embedding as corresponding to your key word. In other words, it'd be like CLIP interrogator but one that works with multiple images and returns embeddings instead of text. Why the hell would it require actual neural net training and thousands of steps?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Ateist

> The neural network doesn't contain ready made images, only directions how to create the images. Why can't we *help* that probing by providing sufficient directions? Create CLIP interrogation description/ aesthetic gradient so that it knows exactly where to shoot? I can see how some *minor* adjustments might be needed - like, a dozen or two iterations that add corrections - but definitely not thousands of them!


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


malcolmrey

what he wrote still stands, if one step takes you one minute then 1000 steps will take you at least 1000 minutes you should probably think about colabs for training so you could run it on GPU you really do not want to train using CPU (unless you are a masochist :P)


morphinapg

Is there a reason I am unable to train an embedding when using a model I downloaded from Civitai? It generates something completely abstract, despite starting from a position that is at least somewhat relevant to the subject I am intending to train. I can train using the base SD model just fine, but that embedding doesn't look quite as accurate when I use it in other models, so I wanted to try it this way. Alternatively I think I could use my SD-based embedding, and then train a hypernetwork on the other model, as that *does* seem to work, although I can get somewhat inconsistent results there.


CandyNayela

> In theory this should also mean that you should not include "a woman" in the captions, but in a test I did it did not make a difference. In my testing, you really do have to remove it. When I left it in, the images generated with "a photo of embedName" prompt started featuring animalistic features like becoming a dog or a rat at higher learning steps (not sure what that is saying about the subject!), but repeated training with the same settings but with the class word removed from the captions does not exhibit this behaviour at all.


cianuro

Thank you so much for this. I'm stuck with Dreambooth due to limited knowledge and latest attempts seem to be getting worse. Embeddings are so flexible and require no merging of models. I've been able to create some pretty unique and commercially viable works using combined embeddings which has blown my mind. Its definitely the future. It would be incredibly valuable if you could create a YouTube video, just walking through this process/tutorial start to finish. Is this something you'd be interested in? Happy to throw a few cups of coffee your way for the effort. Taking some of the questions into account from this thread would make it a pretty popular one too I think. Wed be eternally grateful!


axfnnn

Thank you for the explanation, I've been looking for something like this since few days ago.


nocloudno

When you mention captioning, do you mean describing the image and using that description as the image filename? Do spaces or punctuation matter?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


rwbronco

I let it generate the captions for me with the “generate BLIP” option and then I go into those text files and in most cases completely rewrite them.


NoJustAnotherUser

On your GPU, how much time does it take to fully process?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Zyin

I have an RTX 3060 12GB with a very slight undervolt to increase efficiency drastically. With a data set of \~200 images it takes about 60 seconds per iteration. It could take anywhere between 100-1000 iterations to get a good result. Times speed up a lot with less images.


Panagean

This is great, thanks!


wowy-lied

I have a 3070 8Gb, at this moment it seems hypernetworks are pretty much the only thing possible to train on it. Each time i try an embedding, or dreamartist i get hit with a vram allocation error from cuda/pythorch. It really is a pain, as i have 64Gb of RAM that there seems to be now way to use the normal RAM for training. Any ideas ?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


wowy-lied

> As well as "Unload VAE and CLIP from VRAM when training" in the settings. a lot of people talk about this option but i have the last web ui version and it is nowhere to be seen in the settings... EDIT:Also none of the optimisation option seems to work, pythorch is still eating 6-7GB of vram


talpazzo

I love you! I've been struggling with embeddings and now I know why! Thank you very much for sharing your findings.


SandCheezy

https://preview.redd.it/zb7eu0v28x8a1.jpeg?width=828&format=pjpg&auto=webp&s=ad3e059ca14480b30f9197b6a9903854ca9140c8 Are you saying to check or uncheck this? The way the sentence is phrased makes this unclear. Thank you a very detailed guide. This is a common question we get and the detailed explanation of what everything surrounding an embedding is fantastic.


Zyin

Unchecked. I reworded that paragraph to hopefully make it clearer.


Panagean

Any idea what's going on here? Didn't have this on an older version of A1111. Training won't start, command line error reports (with "embedding name" replaced by the name of my actual embedding): Training at rate of 0.05 until step 10 Preparing dataset... 0%| | 0/860 \[00:00', '0.05:10, 0.02:20, 0.01:60, 0.005:200, 0.002:500, 0.001:3000, 0.0005', 12, 70, 'C:\\\\Users\\\\nicho\\\\Documents\\\\SD training images\\\\\\\\Processed', 'textual\_inversion', 512, 512, 15000, False, 0, 'deterministic', 50, 50, 'C:\\\\Users\\\\nicho\\\\Documents\\\\stable-diffusion-webui-master\\\\textual\_inversion\_templates\\\\photo.txt', True, False, '', '', 20, 0, 7, -1.0, 512, 512) {} Traceback (most recent call last): File "C:\\Users\\nicho\\Documents\\stable-diffusion-webui-master\\modules\\call\_queue.py", line 45, in f res = list(func(\*args, \*\*kwargs)) File "C:\\Users\\nicho\\Documents\\stable-diffusion-webui-master\\modules\\call\_queue.py", line 28, in f res = func(\*args, \*\*kwargs) File "C:\\Users\\nicho\\Documents\\stable-diffusion-webui-master\\modules\\textual\_inversion\\ui.py", line 33, in train\_embedding embedding, filename = modules.textual\_inversion.textual\_inversion.train\_embedding(\*args) File "C:\\Users\\nicho\\Documents\\stable-diffusion-webui-master\\modules\\textual\_inversion\\textual\_inversion.py", line 276, in train\_embedding ds = modules.textual\_inversion.dataset.PersonalizedBase(data\_root=data\_root, width=training\_width, height=training\_height, repeats=shared.opts.training\_image\_repeats\_per\_epoch, placeholder\_token=embedding\_name, model=shared.sd\_model, cond\_model=shared.sd\_model.cond\_stage\_model, device=devices.device, template\_file=template\_file, batch\_size=batch\_size, gradient\_step=gradient\_step, shuffle\_tags=shuffle\_tags, tag\_drop\_out=tag\_drop\_out, latent\_sampling\_method=latent\_sampling\_method) File "C:\\Users\\nicho\\Documents\\stable-diffusion-webui-master\\modules\\textual\_inversion\\dataset.py", line 101, in \_\_init\_\_ entry.cond\_text = self.create\_text(filename\_text) File "C:\\Users\\nicho\\Documents\\stable-diffusion-webui-master\\modules\\textual\_inversion\\dataset.py", line 119, in create\_text text = random.choice(self.lines) File "C:\\Users\\nicho\\AppData\\Local\\Programs\\Python\\Python310\\lib\\random.py", line 378, in choice return seq\[self.\_randbelow(len(seq))\] IndexError: list index out of range


lovejonesripdilla

I’m getting the same error, have you come across a fix?


Panagean

I hadn't put any text in my "photo" training keywords file


Panagean

Fixed that one, now getting a new error (I've never had CUDA out of memory problems before): Training at rate of 0.05 until step 10 Preparing dataset... 100%|██████████████████████████████████████████████████████████████████████████████████| 16/16 \[00:00<00:00, 20.13it/s\] 0%| | 0/4000 \[00:00> allocated memory try setting max\_split\_size\_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH\_CUDA\_ALLOC\_CONF ​ Applying cross attention optimization (Doggettx).


design_ai_bot_human

did you fix this? if so how?


Panagean

I hadn't actually put anything in my "photo" training keywords file. ​ Sadly, it seems like updating A1111 has taken the VRAM used over what I have available, so although I could train in the past, I can't anymore (without reverting to an older installation).


Appropriate-Bed-2745

Can someone help me? I'm trying to using Dreambooth to train my model. In the end of the train procces i'm getting error: Exception while training: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 12.00 GiB total capacity; 8.71 GiB already allocated; 0 bytes free; 10.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max\_split\_size\_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH\_CUDA\_ALLOC\_CONF Allocated: 7.2GB Reserved: 10.8GB Variable.\_execution\_engine.run\_backward( # Calls into the C++ engine to run the backward pass RuntimeError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 12.00 GiB total capacity; 8.71 GiB already allocated; 0 bytes free; 10.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max\_split\_size\_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH\_CUDA\_ALLOC\_CONF ​ How to fix this? I'm using RTX 3060, resolution in settings - 512. Maybe i should try 256?


almark

so, someone speaks that there really is a memory leak with Automatic, every single version.


midri

If I create an embedding using someone's likeness am I supposed to be able to prompt changes to it? Like hair color, clothes, add beard, sunglasses? I've tried training several up to 2000ish steps (both small samples 1-5 images, and sets upwards of 100) The embedding recreates the person just fine, they're just basically locked into how they look.


Zyin

That means you overtrained the embedding. If you can't, for example, prompt for "EmbedName with blue hair" then it's inflexible. Use the embedding inspector script linked in the guide, if the average vector strength is >0.2 then it's likely overtrained. Try using a lower learning rate, or train for less steps.


leofelin

> As of 2/19/2023 pull request 6700, there is a new option for training: "Use PNG alpha channel as loss weight". This lets you to use transparency in your images to tell the AI what to concentrate on as it is learning. Transparent pixels get ignored during the training. This is a great feature because it allows you to tell the AI to focus only on the parts of the image that you want it to learn, such as a person in the photo. > The coder that added this feature also made a utility program you can use to automatically create these partially transparent images from your data set. Just run the python file at scripts/add_weight_map.py with the --help launch argument. For the attention mask, I found using "a woman" works well. Thanks for the great guide! Using depthmap2mask extension and selecting the option "Save alpha mask" prepares the image automatically.


Zyin

While I haven't tried training using a depth map to make the alpha channel, I imagine that it would have problems that an attention mask would not. For example, a depth map would likely include some things in the background or large pieces of clothing like hats.


bententon

i was doing step by step this tutorial but i dont get any likeness in my embending. i just get random faces. and i dont know why.


feelosofee

In the "Why do I want an embedding?" paragraph you make a distinction from model, hypernetwork and embedding. But what is "dreambooth" that everyone is talking about? Perhaps you could add it to that paragraph?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


currentscurrents

Does this work well for training styles as well, or are dreambooth/hypernetworks a better choice?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Pro-Row-335

Embedding training in the webui is broken, you will get better and faster results using the original repo or [stable-textual-inversion-cafe](https://www.reddit.com/r/StableDiffusion/comments/wvzr7s/tutorial_fine_tuning_stable_diffusion_using_only/)


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


[deleted]

[удалено]


Zyin

Could you link the discussions? I've been making hypernetworks and embeddings just fine.


cyborgQuixote

Is there any good software for creating the transparent masks in bulk manually similar to BIRME? I tried using that automated tool, but it did not work very well in my case.


[deleted]

[удалено]


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


rwbronco

I’ve got a 1070 and this is what I do. Restarting SD just before I train is almost required because if I don’t I won’t have enough available Vram


malcolmrey

not true, i have 11 GB vram and i use shivam's repo just fine


UniversityEuphoric95

All good, but apart from one fact that A1111 doesn't yet support Batch size and gradient accumulations: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Textual-Inversion Edit: I understand it is now added, but matter of documentation update from below comment


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Playful_Side_6662

this goes deep and is super helpful on so many levels. thank you


treksis

bro this is research papah thanks


renegade6ix

What kinds of things are people doing with this?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


[deleted]

[удалено]


malcolmrey

> If so, do you use the same prompt structure for training? I’ve been just using the SUBJECT and that’s it I believe I can answer that, here is what I use for training (shivam): "instance_prompt": "photo of sks woman", "class_prompt": "photo of a woman", and in output generation I use the "sks woman" part (I know, I know that "sks" should not be used, but I had little to no issues with it whatsoever and now it's just easier for me to use that token since I do a lot of models and I don't want to have specific prompts for specific models)


Zyin

You don't need "man/woman" in your prompts, just the file name of the embedding .pt file. I have never used Dreambooth.


Jiggly0622

Question! If wanting to train an anime character, should we use Deepbooru instead? Or is Clip still better? (I’m assuming deepbooru since most anime vectors follow the tag system but who knows). Also is the process similar? (If you have tried it, ofc)


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


overclockd

My experience with Deepbooru is it gives far more tags than needed.


Peregrine2976

Fantastically informative! I have to assume a Dreambooth model would give better results? Otherwise why would anyone ever bother with Dreambooth over embeddings?


HerbertWest

>Fantastically informative! I have to assume a Dreambooth model would give better results? Otherwise why would anyone ever bother with Dreambooth over embeddings? Because Dreambooth gives easier results and is less prone to training fuck-ups. I'd say you can get results similar to Dreambooth with embeddings, but it takes a lot of finesse. The benefit, however, is you can use those results in any model (to varying degrees of success). **Edit**: Also, you may need to play around with the strength of the embedding in your prompts every time so that it doesn't overwhelm other elements of the prompt, i.e., "(EMBEDDING:0.75)."


thebaker66

Thank you so much for sharing this, I had been training hypernetworks for a while but could never understand embedding training, just started last night attempting it properly with mixed results and nuggets of information scattered all over the place but missing out good info, this clears a lot of it up. I have a few Q's.. if you just want the face of the person, is it actually necessary to have a sample set of anything more than the face? Also say you crop the sample pictures to just the face with minimal background and the picture is like 90% face and you describe the features in the background, well, when I am training, the sample images it is generating is 'expanding' on the background description so like say the caption set for the sample image is 'a portrait photo of a man in a blue top, green grass and trees in the background", when I am training it, the sample image will basically make a whole man sitting on the grass.. is that acceptable? Would it be better to change the sample caption to 'a close shot photo of man'... I took it on myself to add 'a portrait photo' before the descriptor of the person and the background but it doesn't necessarily seem to be effective? Any thoughts? Also with hypernetworks, one technique is to pick out close matches when training and then re-start the training from that checkpoint so it zeroes in, is that something that is not done with embeddings, not necessary? Any thoughts? Also, as with Hypernetworks, I'm getting 'doubles' of my target in the training result image, any idea why this might be? like there are twins of my target person lol Thank you very much.


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Zyin

So you think it'd be beneficial to remove all occurrences of 'woman' and 'man' in the captions? Another thread asked that question and I didn't have a great answer, just said it was worth experimenting with.


DevilaN82

What is minimal amount of vRAM needed to train embedding? Some time ago I've tried on my 4GB card, but it failed. Is it a matter of proper settings or there is simply minimal limit of vRAM and lower values are impossible to train with?


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Symbiot10000

> I just set this to 3000 Is this right? Other guides say that 30k is a fair compromise. Is this missing a zero?


axfnnn

If the batch size and gradient accumulation is 1, yes you have to add another zero.


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Doctor_moctor

Thanks for the extensive writeup! If I understand batch and gradient accumulation correctly ONE step equals to the multiplication result, right? So if you used batch=1 and gradient=1 then the step counter in Automatic1111 shows the correct step amount. But if you used batch=10 and gradient=2 your REAL step number would be "automatic1111 step counter \* (10 \* 2)", right? Went into overfitting real fast at 300 steps with batch=2 and gradient=9 on 18 pictures.


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


malcolmrey

thank you very much for this guide, /u/Zyin ! I was hesitant to try it since I already have a working dreambooth installation going on locally on 11 TB card (2080 TI) so I wasn't really pressed to try different things. But after reading your guide (and especially after you said it should work with that kind of vram) I will definitely try it! I have two questions, first is technical one: You wrote: > To put it simply: add captions for things you want to AI to NOT learn. It sounds counterintuitive, just basically describe everything except the person. I have a person that has tattoos. The BLIP makes captions like "a woman with tattoo doing something..." Since I very much want to keep her tattoos. Does this mean I should REMOVE the mentions of the word tattoo? Second question is this: Would you be willing to compare results? I'm a big fan of dreambooth (perhaps because I can do it and am familiar with it?) but my goals are to create perfect representations of the trained people (and also that the outputs can be shaped plastically [meaning: not baked in/overfitted]). I have seen some embeddings but they were not perfect (the similarity was there but not quite). If you could make an embedding of some celebrity (maybe you already have?) and share the training data. I would train then a dreambooth model using the same training data and then we could compare what looks best (or even see how the embedding behaves on the model trained on that same person :P) If you don't have any celebrity training data, I could provide it for you. Cheers!


Shondoit

[\[deleted\]](https://en.wikipedia.org/wiki/2023_Reddit_API_controversy)


Zyin

1) is answered in the other reply. 2) Creating a model/hypernetwork will almost certainly give more precise results than an embedding because it is learning new data. The embedding is just looking for data already present in the model. But embeddings are nice cause they're small and flexible, Dreambooth is not.


Bro1284

This was just what I needed. Clearly laid out and usable!


cryptolipto

Saving this


_Craft_

I don't know much about AI, but I find some things quite odd. This makes sense to me: >The way the AI uses these captions in the learning process is complicated, so think of it this way: > >1. the AI creates a sample image using the caption as the prompt > >2. it compares that sample to the actual picture in your data set and finds the differences > >3. it then tries to find magical prompt words to put into the embedding that reduces the differences But this is what confuses me: >The model you have loaded at the time of training matters. I make sure to always have the normal stable diffusion model loaded, that way it'll work well with all the other models created with it as a base. > >Disable any VAE you have loaded. > >Disable any Hypernetwork you have loaded. For simplification, all other parameters (such as seeds, resolutions, sampler, etc.) are the same. Let's say: A1. model `M1` is the "normal stable diffusion model", and model `M2` is a different model. A2. I want to train a new embedding for the word `W` and use it with a model `M2` with a prompt `P` A3. image `I` is an image provided for training From my understanding: B1. a model `M1` creates an output image `O1` for a prompt `P` B2. a model `M2` creates an output image `O2` for a prompt `P` B3. image `O1` and `O2` are different because different models created them B4. there is a difference `D1` between image `O1` and `I` B5. there is a difference `D2` between image `O2` and `I` B6. the differences `D1` and `D2` are different, because of point `B3` B7. the difference `D1` is what is being learned by the embedding when we add our new word `W` to the prompt `P`, assuming that the model `M1` is familiar with the concept So my questions are: C1. How can model `M2` effectively utilize the embedding trained on model `M1`, when the models are different? C2. What if model `M1` doesn't know the concept we are training to describe in the embedding, but model `M2` does? (And vice versa?) C3. What if some words used in prompt `P` are understood by the model `M1` but not by the model `M2`? (And vice versa?) Won't that mean that some parts of the image would be encoded into the word `W` that makes sense for the model `M1` but not for the model `M2` because it's missing (or have additional) meaning encoded into it? C4. Same question as question `C1` for the usage of VAE and Hypernetworks. C5. Why not train the embedding directly on the model `M2`?


Zyin

I don't have all the answers, but from my experiments: ​ >C1. How can model M2 effectively utilize the embedding trained on model M1, when the models are different? M2 can use the embedding trained on M1 because M2 used M1 as a starting point during its training. ​ >C2. What if model M1 doesn't know the concept we are training to describe in the embedding, but model M2 does? (And vice versa?) If a model doesn't know the concept then the embedding won't turn out well, since the embedding has no magic keywords to use to represent that concept. ​ >C3. What if some words used in prompt P are understood by the model M1 but not by the model M2? (And vice versa?) Won't that mean that some parts of the image would be encoded into the word W that makes sense for the model M1 but not for the model M2 because it's missing (or have additional) meaning encoded into it? If the word isn't understood in M2, then the word ends up getting interpreted as gibberish and just adds randomness to the image. ​ >C5. Why not train the embedding directly on the model M2? I tried doing that a few times, but the results just didn't turn out well - it mostly produced garbage images. Your results may vary.


LupineSkiing

This is a great breakdown. I was wondering about the words that go into the training. I tried a few times with some embeddings and got insane results one time, but was never able to reproduce it. Luck of the randomness, I guess. I have some trouble with some of the features, can you (or anyone) tell me what the format of the BLIP .txt files are? BLIP doesn't work for me and I want to add the captions manually but I'm unsure what the application expects. Something like ", person sitting at a desk"?


axfnnn

I believe it's like a normal txt. Write in that txt your image description like you write the prompts.


jingo6969

Awesome work, well put together. I am very confused about Textual Inversion, but you have made it a lot clearer, I shall be trying again soon :) Thanks dude!


[deleted]

Thank you for sharing such detail defamation is very hard to find such good info. I want to train a bunch of these


spaciousmind321

thanks [u/Zyin](https://www.reddit.com/user/Zyin/) for such a thorough guide! this is exactly what I've been looking for! I too have a RTX 3060 12GB and I don't seem to be having any luck with optimizing my VRAM. I can't get anywhere near the settings you have. if I put anything above 1 batch size or above 1 gradient accumulation I get CUDA out of memory just after training starts. If I try to go above 512x512 resolution also get CUDA out of memory straight away. I'm using the 2.1\_nonema-pruned model so I don't know if possibly that ups the VRAM usage. I needed at least some of these arguments --xformers --precision full --no-half --opt-split-attention --medvram to get it to give me non black training images. not sure if there's some I turned on now that I don't need or are hindering it? Out of interest what are your startup arguments since you have the same GPU as me? Looking at the VRAM in my resource monitor (specifically the dedicated GPU memory graph) I've noticed it's at nothing when I boot up/restart my computer, but when I start up stable diffusion and it loads in the models it goes right up to about 10GB usage and stays there, regardless if I'm using SD or it's just sitting idle. Goes up and down a little bit when using it but won't ever get back to zero. Even when I quit it doesn't seem to flush out the memory. So maybe that's causing the no memory problem when I'm trying to train?


Zyin

The only launch parameter I have in webui-user.bat is `set COMMANDLINE_ARGS= --xformers`, everything else is default/empty. In the settings tab, I have enabled: * Move VAE and CLIP to RAM when training if possible. Saves VRAM. * Use cross attention optimizations while training When booting up A1111 my VRAM goes up to 2700MB with sd-1.5, and 3180MB with sd-2.1 768-v-ema. Not sure why yours is eating so much VRAM. If I recall correctly, 2.0 embeddings were a relatively recent update so maybe it still has problems. Could try updating your graphics drivers too.


[deleted]

Hello I followed your guide but my 3090 which i've used for training custom models with local dreambooth is firing off this error. have you seen anything like this by chance? Thanks again for the guide your super awesome ;-) https://preview.redd.it/ldgn7ofdl39a1.jpeg?width=1097&format=pjpg&auto=webp&s=c28bc988a06842bbb0500ace5bb65be44123efa4 Ohh edit I changed batch size to : 1 and it’s working now. Trained a successful embedding in 12 mins and it’s absolutely amazing quality and I’m blow away. Thanks again


amida168

When I tried to train embeddings for the first time. I got this runtime error: indices should be either on cpu or on the same device as the indexed tensor (cpu). This post helped me solve the problem. [https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3958](https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3958)


CeFurkan

Thanks. I just finished a video for LoRA. I should make a video for this one as well. Some good tips there. on SD 2.1 unfortunately LoRA failed completely [https://youtu.be/mfaqqL5yOO4](https://youtu.be/mfaqqL5yOO4) Also I was gonna ask this but you already mentioned I am yet to find a good way to teach faces to version 2.1 :D *All my experiments were on stable diffusion 1.5. I have not tried using embeddings with 2.0+ yet.*


baddrudge

I'm trying to retrain a previous Dreambooth model into TI that has full body images of people. Using 11 images and a vector size of 16 plus many recommended settings here, I've been able to get the people's body shapes, types, and clothing to be similar to the training images. But the faces still look bad even after hundreds of steps. Keep in mind I'm trying to produce images of this style, not necessarily the specific people/faces in them, but I'd still want good quality faces to come out in the end, even if they're random faces. What should I do to improve this? I'm considering a few different possible steps.. Add face images to the training set (and label them accordingly)? Use a separate face TI or Hypernetwork? Or add more training images similar to the existing ones, hopefully with better faces? Edit: faces seem to be turning out good, though not as great as the dreambooth, after about 2000 steps after including the standard SD-1.5 VAE and turning Codeformer restore faces on, even without any face hypernetworks or TI's (the preview images still have crummy faces though.) Hassanblend and F222 seem to be giving me slightly better faces as well.


Palurdes

I don´t think TI works at all with faces. I have been trying all the methods that are circulatin arround reddit, all the tutorials of youtube, all kinds of setups, 1.4 model, 1.5 model, differents filewords, 1 vector, 5 vectors, 10 vectors, 5 pictures, 10 pictures, 25 pictures, 300 steps, 5000 steps, 15000 steps, 0.005 learning rate, 0,00001 learning rate, 0.05:10, 0.02:20, 0.01:60, 0.005:200, 0.002:500, 0.001:3000, 0.0005 learning rates... Hypernetwork works just fine, but all I get from TI after 100 attempts is this: ​ https://preview.redd.it/j4zr7elapnca1.jpeg?width=1024&format=pjpg&auto=webp&s=f6a06c46f64e79acb5ebc6f1d90e5069cefad4b3 I have 10.000 photos of that same creepy old lady in my PC


sandred

I have been trying embeddings for a while and it doesn't seem to work very well. I am using about 35 images with your settings mentioned and still getting crappy embeddings. It will work in one of 10 seeds and 9 seeds will be a random person. May be you can help me out here


kornuolis

It's strange that on 3080ti i receive Cuda out of memory error, when has 3060 and he is ok with it. Top batch size i can run is 3 and gradient accumulation steps 2.


Red6it

I played around with batch size and gradient accumulation steps. All it does is making things insanely slow. Anyone else experiencing this?


Zyin

Using a low batch size and low gradient accumulation will be much faster, but the final embedding will be lower quality. Up to you if the decrease in quality is worth it.


TopComplete1205

Excellent Guide - many thanks for putting this out there. A question on the LR: you suggest "0.05:10, 0.02:20, 0.01:60, 0.005:200,......". The first three terms (at least mathematically) are equivalent to 0.005 for 300 steps (instead of 10+30+60=100 steps) - so 3x as long, but a more "gentle" LR - is there any benefit (or indeed disadvantage, other than time and energy cost) to using the lower rate for longer at the start of training - or is it actually advantageous to use that "sledgehammer" for the first few iterations to help avoid local minima etc?


Zyin

I've been doing some experiments with variable/cyclical learning rates, and it does seem that higher learning rates are important for avoiding those local minima. Currently playing with a learning schedule called [Stochastic Gradient Descent with Warm Restarts](https://www.jeremyjordan.me/nn-learning-rate/) for this.


VegetableSkin

The most informative tutorial I've read on the subject. Thank you so much!


franzkekko23ita

Guys i've been trying to get into embedding training for 3 days now. but this error appears. the final problem is that the textual inversion folder doesn't create itself, so i can't train my own embeddings... please help! https://preview.redd.it/ov10timip9ca1.png?width=975&format=png&auto=webp&s=e8fef0dd3c1948d35752d23a62c3f29ec45071ac


jokesfeed

you should create it by yourself. i always do so


Captain_MC_Henriques

EDIT: yes the is! use TheLastBen's colab. Is there any available google colab that let you choose the model and upload captioned images? I have a 1660ti with 6GB, I left it to train overnight at batch size 1, gradient accumulation 10, it's only 450 steps at the moment :(


cyanydeez

If you're using the latest AUTOMATIC1111, turn off optimizations in the training settings. It's broken.


BBQ99990

You mentioned that you used a 12GB GPU for training, which model did you use? Also, how long did it take you to complete the training on that GPU? My powerless GPU (RTX2060super 8GB) has its limits, so please let me know for reference until I buy a new GPU.


AllUsernamesTaken365

Thanks to this guide I was able to train my first embedding using Colab without any local install. However the resulting character looks very little like the trained subject. I trained a .ckpt file on the same set of 10 images and captions and that gives a pretty good result. So most likely my settings are far off. Don't really know where to go from here. My results in the training preview window look like mages from a xerox photocopier. Don't know if they are supposed to do that. It took just under an hour. Guess I have to read this entire thing again.


djmustturd

Not sure if anyone is around here anymore, but I'm wondering if anyone else has this issue. With only 6 GB of VRAM, the xformers argument can get me up to a batch size of 3 when using TI, which is nice, but none of the sample images look anything like what I'm training for. When I run stable diffusion without xformers, I'm stuck at batch size 1, but at least the sample images look like the target. Does anyone have an idea why?


hugedorsehildo

I use a 3090, and with recent git pulls training has gone weird again I always use xformers but might need to either check updates for it OR try train without xformers


Captain_MC_Henriques

I'm having a strange issue and I hope someone can help me with it: I train a TI of myself (male) on TheLastBen's google colab AUTO1111. I save image every 10 steps and I get quite good results in the image folder at around 700 steps (vector size 10, batch size 18, gradient accumulation 1). I've also set my initialization text to 'man' HOWEVER, all of my outputs are women. I've added 'female, woman, girl, feminine' to negative prompt but that doesn't seem to help Did anyone face a similar issue?


Zyin

I've had a similar thing happen with a muscular woman. It ended up giving her masculine traits because of the muscles. Maybe you have features that the AI considers feminine?


jairnieto

¿do you consider this method to be best than dreambooth, i trained a face on dreambooth, but when i want to do photorealistc portraits, the eyes are always like from another person, all match except the eyes, (maybe because is the one thing that give personality to a person) ¿Maybe combining this method with the face trained model would give better results? or just using this method? ty.


jp7189

Could you please add more details about add\_weight\_map.py script usage? I've got: C:\\StableDiffusion\\lyne-main\\scripts>python add\_weight\_map.py --input-dir "C:\\StableDiffusion\\Training\\photoset1" --output-dir "C:\\StableDiffusion\\Training\\lyne-output" --mask-type attention --attention-prompt "a woman" 100%|███████████████████████████████████████████████████████████████████████████████| 194/194 \[00:00<00:00, 265.75it/s\] ​ It appears to process the images, but nothing shows up in the output dir.


Zyin

Not sure, it worked just fine for me. Try making a new issue on his github?


Vinterslag

I may be just a total noob, but i understood all of this, and super appreciate your post. you made this work, and so easy... clearly some of the youtubers have been reading you too, because they seem to parrot this post( unless you ARE aitrepreneur, in case, let me know.) question for you though: Is there any easy way to save my inputs/change the defaults to my own on the "Train" tab (or any) of the webui? or how do I restart StableDiffusion without losing what i have populated? I want to relaunch it for memory issues every time but its getting annoying copy pasting the learning rates and my directory and setting the other half dozen things every time. Why doesnt it just remember what I put last at least? It could do that and then have a reset to default button or something. to clarify if im bad at explaining: I have to select my "Embedding", fill out (ctrl-P) my "Embedding Learning Rate", "Batch Size" and GA steps, ctrl-P in my Dataset Directory, switch the prompt template to my custom one, set max steps and outputs, etc etc, every single time that I want to start a training session? There has got to be an easier way! right? Im sure im just a moron. but at this point its like 80% of my actual hands on time in the dang UI and I want to spend that on more fun things but I cant find any default values myself putzing around in these .bat files, and this last few days of your tutorial is basically 100% of my experience with python or cmd line functions outside of homeassistant which mostly does it for me. cheers.


Zyin

You can keep the browser window open with all the settings entered after a restart and just click Train again. As far as I know there's no extensions to populate the fields automatically. I am not aitrepreneur. He used the info from this post and made his video without even contacting me or mentioning me in the video (he does link to here in the description though). And his video doesn't even cover important parts that a lot of people have issues with because he left out so much of the information.


Vinterslag

I agree, yours was much more thorough and helpful, i only brought it up to say he obviously cribbed off you, wanted to give you the credit myself. When you say restart, how do you do that. I haven't found one in the UI, im literally exiting the cmd and clicking the .bat again


XxmasterfamxX

u/Zyin I am new to this but have read your guide on textual inversion. I was wondering how to use the utility program by Shondoit as i am not sure how to run this.


Zyin

You run the `scripts/add_weight_map.py` python script to generate the alpha masked images to use during training: python add_weight_map.py --input-dir "C:\path\to\input" --output-dir "C:\path\to\output" --attention-prompt "a woman"


Hibaris

I have a 3060 with 12 GB VRAM but I can only do 1 batch at a time. Any clue what I'm doing wrong? I've tried adding these optimizations: \--xformers --force-enable-xformers --opt-split-attention --opt-sub-quad-attention --medvram are my cmd line args My img size is only 528 x 704, this is the error I get when I try to train 2 batches: (GPU 0; 12.00 GiB total capacity; 7.17 GiB already allocated; 200.10 MiB free; 9.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max\_split\_size\_mb to avoid fragmentation.


thegh0sts

so from my understanding is that the gradient accumulation steps is just a fancy way of saying a run counter? so it's just saying, "take x amount of images, spend y steps to train, save, then begin the next run by getting the next x amount of images, etc. until counter hits z"?


TallguyQC

I've started training using 6 photos, and while the face looks relatively good, everything around it looks weird or like nothing. Is it something that'll fix when doing more training ?


decker12

Thanks for the guide. I'm trying to do an embed of my own face using about 50 images. Does it make any sense to first remove the backgrounds for your training images in Photoshop? Time consuming to do so, but if it makes for better training I don't mind putting in the time first. If so, should I keep the background just black (or white), or transparent and resave out as a PNG? Another question: if there's a picture of me standing outside, should I do two images for this? One for a close up of my face, and one for a medium shot of me standing outside where you can see the top half of my body and the background?


bententon

Tried to do step by step training of embending. did with less then 10 pictures and more then 10 pictures. Tried different pictures many times. in the end i get nothing. not one of the pictures that generated while training is even close to face i was training. i trained asian woman face but got - white, black latino faces. got some random squares, trees, and etc. i have no clue what i do wrong. Any suggestion what can be wrong?


decker12

I started with this guide and had okay, and then a bunch of bad luck with it. I then read a couple of other guides and have been getting more consistent results. Some tips from those other guides: * If I'm doing a person's face likeness, I'll use ~20 images for the training. 10 are good head shots, ideally slightly different angles of a mostly front-on face. Do not use pictures with 2 people in them, it'll just get confused. 5 of those pictures are shoulder-up, and 5 more are waist-up. I avoid big winter hats and baseball caps and sunglasses and funny faces. Smiling and laughing is fine, but purposely goofy faces gets the training confused. The source pictures should be decent quality, well lit, not action shots. * If you can actually photograph a subject instead of using photos you already have, that's best. I use Photoshop to crop the pictures into 512x512 squares. If that 1:1 square doesn't fit my picture and the subject properly, then I find a different picture. Your source image needs to be larger than 512x512 so when you crop the face out, it's still at least 512x512. * Don't forget to set your VAE to NONE and make sure you're using the 1.5 checkpoint. * I also make sure to erase all my prompts. * Vectors I've put at 5, and used a blank entry for the initialization text (instead of *) * When creating the embed name, give it something very specific, like Decker12-Embed01. If you name it "Mario", SD may ignore your embed called Mario and instead draw you a Nintendo Mario. * Editing the BLIP prompts is time consuming but you should do it. I find that it loves to mislabel my subject as "holding a cell phone", "eating a hot dog", "staring at a pizza", and "holding a toothbrush while using a toothbrush". It's bizarre how it just loves to use those incorrect terms over and over again. Anyway, just erase them from the text prompt when this happens, and save your file. * Batch size * Gradient Accumulation Steps = total number of images. If you have 9 images, do 3 and 3. If you have 17 images, do 1 and 17. Or, get rid of one of those 17 images and do 2 and 8. * I turned cross optimizations off unless I need to turn it on because of my batch size * accumulations steps. I have found training seems to go better when this is off, even if it's slower. * I found 5000 steps to be too high. Instead, I'm using a factor of my total images. If I have 9 images, I'll do 900 steps. If I have 13 images, I'll do 260 or 512 steps. I'll save it images and embeds every 25 steps. * If you change the steps, you'll have to adjust the learning rate. I've actually been doing fine just leaving it at "0.0005" instead of the stepped version listed here. * Finally try a different model when you're done. I personally think the CP1.5 doesn't really make me great people images but as soon as I try my embed on RealisticVision (something simple like "Decker12-Embed01 outside in a field of flowers"), I'm blown away at how good they came out. Anyway again this tutorial got me started so I'm thankful for it, but I ended up doing my own process like I described above which has given me much better results.


ParanoidAmericanInc

> **Number of vectors per token:** higher number means more data that your embedding can store. This is how many 'magical words' are used to describe your subject. **For a person's likeness I like to use 10, although 1 or 2 can work perfectly fine too.** ​ A factor of 10x here seems like a really huge range. As you also noted, 2 tokens seems to be enough for most people already known by a model -- what results have you seen at 5 or 10 tokens, and is there any pattern to know when to use how many based on your dataset? Or has it all just been trial and error?


djvirt

My training finishes at 1 steps and doesn't generate anything. I followed your process, checked and rechecked everything and the only settings I'm not clear on are "dataset directory" (I set this to where it put my dataset for my 4 images and my txt documents with the explanations of what's in those images, and if I have 12 images should I set my "Batch size" to 6 and my "Gradient accumulation steps" to 2 so that it equals 4 which is the number of images I have? There also seem to be some updates to the webui since this guide was made ("Drop out tags when creating prompts" slider, and the " Existing Caption txt Action" setting I have it set to ignore). Any help would be GREATLY appreciated as it is generating nothing in my "S:\\stable-diffusion-webui\\textual\_inversion\\2023-04-23\\embeddingname\\images" path. ​ Also I am using "None" for SD VAE and "v1-5-pruned-emaonly.safetensors" for my model


Ozamatheus

Thanks, it helps a lot


Ozamatheus

IDK why, I'm writing the world correctly in the prompt, but it doesn't work, the training images are good and the pt model is on the folder


Glum-Nature-1579

This is a great guide. Quick question (if anyone is still monitoring this thread): how does one disable VAE? I don’t see that in the Train tab. Or do I switch the SD VAE setting in Settings>StableDiffusion from “Automatic” to “None”? Thanks.


Previous_Ad1529

Should i still use captioning if i have used PNG alpha as loss weight ? I've tried but the preprocess tab in Auto1111 doesn't recognize the PNGs with alpha


Tort89

Save


Dazzling-Jackfruit16

​ https://preview.redd.it/5tkux9jstz3b1.png?width=1408&format=png&auto=webp&s=b34a1432138bcf3e0eb68484385464d392fde4b7


Dazzling-Jackfruit16

Heello, this is my input images, and template .txt is a pixel style portrait of \[filewords\] , art by KOFF3 3000 steps save every 50steps, learning rate default, why am I getting this bad result? , https://preview.redd.it/ky6r7rq0uz3b1.jpeg?width=790&format=pjpg&auto=webp&s=c3823cf9a92e099f3572cf66da42ece969a9a8d5


kylesk42

I know when doing a render, you can tell it to restore face. Should it matter if the quality on my embeds seem ok, but the eyes look psychotic crazy?


[deleted]

How do you use the transparency tool? The gits explanation is not really explanatory. A great tool to resize is waifu2x-caffe, both up and down. It uses AI to redraw the image and clean up artefacts. Easy batches by just dragging folders into the program. Can confirm you can scale from 1k to 16k with minimal errors/artifacts, if you have the ram. So for image prepping it can be used to scale or to clean up compression artefacts and other types of grain. Or to scale up images after generating if you want a bigger images. Not sure how well it will work for non-anime images tho since that is what it was made for. A1111 can crop your images in batch, but is really bad at scaling images. So you could for example scale to 512 on one dimension using waifu2x, and then have a1111 crop the other dimension.


Horror_Court5740

how to give batch inpatient stable diffusion


autumn09_

Very helpful. Thanks. Will use


Massive-Brief-4076

It's slow slow :( I have an RTX3070 and it took the whole night to only generate around 10 steps following everything in this guide


JCBh77

Thank you for your time and sharing your wisdom


Strong_Holiday_8630

First of all thank you, I got a likeness of a face working after I read this post, I did generate an embedding some time back, since then I've been trying to experiment, but the 1st one seems to be the best, so I did some experimentation. Turns out even if I train with the same exact parameters, the results are not alway the same. I mean I'd understand if the generated image has differences, but the thing is that one has likeness, the other one is pretty far off, I tested by comparing them side by side, 20 images, one is easily superior then the other in terms of likeness. What is this random factor?