Performance Testing: A Somewhat Comprehensive Guide #246

d8ahazard · 2022-11-17T17:42:09Z

d8ahazard
Nov 17, 2022
Maintainer

Overview

So, I did a thing. I wanted to know the impact of (almost) every tuning parameter currently available on speed, memory usage, and output image quality and editability.

As such, I created 7x different dreambooth models, and adjusted one parameter at a time and recorded the maximum logged memory usage, total run time, and average iterations/s for each run.

TL/DR

Memory usages below are most likely NOT 100% accurate, but more of a snapshot. Consider them like a rough rule of thumb, versus the end-all, be-all benchmark of what GPUs can run with what params.

The table below shows results with various training params. The closest I can get to 8GB is with the "sanchez3" prams, while sanchez 7 offers the most overall speed.

sanchez7 is "the fastest", but sanchez1 shows potential to be ultimately faster, as it requires less total training steps.

Training the text encoder is a massive boost to number of required steps.

I need to do more testing with prior preservation and [filewords] enbled, but I was lazy so those are not used here.

WHOO, Science!!

Dataset

The dataset for this test consisted of 32 images of the character Rick Sanchez from the TV show "Rick and Morty". These images were selected from appx 500 images downloaded randomly from Google Images, with 50 images of the best quality and style selected and cropped, and then another 18 images removed mostly just because I was being picky. Should I have used an even 30 images? Probably. But, hey, whatcha gonna do?

Test Parameters

For this test, prior preservation and advanced caption usage ([Filewords]) were not used. Those will be tested separately.

Each test was conducted using the same dataset and a new set of weights extracted from the v1.5-pruned.ckpt, with checkpoints saved every 500 steps for total of 2000 training steps. A learning rate of 0.000002 or 2e-6 was used, versus the "default" of 5e-6.

Note

The "maximum" memory recorded is not a true representation of the maximum VRAM used, but more of a snapshot taken after all data is loaded to GPU and 5x training steps have been completed. It is possible that more VRAM is taken up by the training process during checkpoint or preview generation, or just somewhere else during the training process.

Results

Below is a table of the checkpoint names, training parameters used for each test, and the resulting "maximum" memory usage and training times for each test.

Checkpoint	prior_preservation	Filewords	precision	8Bit Adam	text_encoder	ema	no_cache_latents	grad_check	IT/S	Loss	Duration	Max VRAM
sanchez1 (base)	no	no	no	no	yes	yes	yes	yes	1.14it/s	0.0712	29:07:00	18.3
sanchez2	no	no	fp16	no	yes	yes	yes	yes	1.02it/s	0.0712	32:31:00	16.7
sanchez3	no	no	fp16	yes	yes	yes	yes	yes	1.21it/s	0.0711	27:33:00	8.2
sanchez4	no	no	fp16	yes	no	yes	yes	yes	1.32it/s	0.0718	25:15:00	9.5
sanchez5	no	no	fp16	yes	no	no	yes	yes	1.68it/s	0.0718	19:50:00	10.8
sanchez6	no	no	fp16	yes	no	no	no	yes	1.86it/s	0.0712	17:54:00	10.5
sanchez7	no	no	fp16	yes	no	no	no	no	2.47it/s	0.0712	13:28:00	13.2

Performance/Tuning Observations

Enabling FP16 without 8bit Adam actually causes training time to increase.
Turning off several settings that I thought would improve memory usage actually seem to degrade it. The optimal settings for memory usage seem to be the sanchez3 settings:

I should probably test xformers, which would also help identify when certain combinations of settings cause crashes.

sanchez3: fp16, 8bit_adam=True, text_encoder=True, ema=True, no_cache_latents=True, grad_check=True

For maximum speed, sanchez7 is where it's at:

sanchez7: fp16, 8bit_adam=True, text_encoder=False, ema=False, no_cache_latents=False, grad_check=False

Checkpoints created using FP16 are always 2GB. This is because I setting "half=True" whenever precision is FP16. Is this right? Should I disable it? Make it optional? IDK.

Image Generation Params

All images were generated at 20 steps using the DPM++ 2M Sampler, 7.5 CFG Scale, and a seed of 420420. Wubba lubba dub dub!

Prompt word is "riksan", with "smiling", "in space", "wearing armor", and "psychedelic background" as modifiers.

Output observations

Disabling the text encoder definitely has a noticeable effect on image results. It would almost seem like the number of required training steps is reduced by a factor of 2 or more. Editability is also somewhat reduced, but this may also be resolved by using prior preservation or [filewords] with properly captioned input images.

Using EMA when training does seem to offer a small amount of improvement in output image quality, but not such a major change that I'd worry about not being able to use it to run on a lower-powered GPU.

Surprisingly, not caching latents actually seems to have an impact on image output, as evidenced by the differences in the sanchez5 and sanchez6 images. I definitely want to see what happens when using the sanchez3 set of params but with "not cache latents" disabled.

Gradient checkpointing doesn't seem to have much effect at all on image output, but further testing is required.

Other Thoughts, Future Testing

I definitely need to test the effect of prior preservation, [filewords], and the forthcoming "shuffle after epoch" feature.

Xformers too, and maybe "accelerate launch".

Because I just kind of randomly chose which setting was changed next (aside from FP16, which requires restarting the app), there's definitely a chance that some other combination of parameters could have greater performance savings.

Re-doing this same test with an actual human subject would probably yield wildly different results.

How does changing the dataset size impact memory usage in various stages of testing?

Generating preview/comparison images for the 500-step checkpoint might be necessary for results trained using the text encoder, especially if using [filewords] for prompts.

sgsdxzy · 2022-11-17T18:23:16Z

sgsdxzy
Nov 17, 2022
Collaborator

I did a small test of xformers in #230
xformers is a must for 10GB cards on Windows.

2 replies

d8ahazard Nov 17, 2022
Maintainer Author

My only hesitation in setting up xformers is that it seems to cause issues with (or without?) certain settings.

Do you know offhand what these are?

sgsdxzy Nov 17, 2022
Collaborator

Don't know much. I only have 12G so my choices in parameters are very limited.
It seems that xformers+no mixed precision would cause some errors, but no mixed precision alone causes oom anyway.

sgsdxzy · 2022-11-17T18:58:11Z

sgsdxzy
Nov 17, 2022
Collaborator

I do think something's off, why does 3 consumes less memory than 4? It needs to train an extra text encoder.

2 replies

sgsdxzy Nov 17, 2022
Collaborator

I think you can paste memory show in nvidia-smi, it is not accurate as it counts memory reserved but not used, however it is a safe upper bound of the maximum usage, or to say, if it never reports more than 10G, then you can safely train on 3080.

d8ahazard Nov 17, 2022
Maintainer Author

Either my measurements are off, or it's because we're enabling adam and fp16 in sanchez3, and the other params are about as optimized as possible, then changing subsequent values add to speed, but detract from VRAM usage.

Jonseed · 2022-11-17T21:08:58Z

Jonseed
Nov 17, 2022

Great data! Thank you for putting this together! I can't wait for the prior preservation and caption usage tests.

Why did you choose a lower learning rate of 2e-6, instead of the default 5e-6? Is the general consensus that a lower learning rate is better for training?

What is training EMA? What does it do? I thought it took a lot more VRAM, but it looks like not necessarily (two of the lowest max VRAM usage on your chart trained EMA)?

Does training from the 7GB pruned ckpt file take more memory than the 4GB emaonly version? I know the 7GB is supposed to be better for finetuning/training? (I haven't tried the 7GB ckpt on my 3060 12GB.)

4 replies

0xItx Nov 17, 2022

Does training from the 7GB pruned ckpt file take more memory than the 4GB emaonly version? I know the 7GB is supposed to be better for finetuning/training? (I haven't tried the 7GB ckpt on my 3060 12GB.)

Does that also apply for Dreambooth? I was under the impression that the difference between emaonly & the "all weights" model is relevant only for native training.

Jonseed Nov 17, 2022

It's a good question. The Readme on runwayml's Hugging Face repo for the ckpt says:

4.27GB, ema-only weight. uses less VRAM - suitable for inference
7.7GB, ema+non-ema weights. uses more VRAM - suitable for fine-tuning

I dunno how Dreambooth fits into that...

d8ahazard Nov 18, 2022
Maintainer Author

Great data! Thank you for putting this together! I can't wait for the prior preservation and caption usage tests.

Why did you choose a lower learning rate of 2e-6, instead of the default 5e-6? Is the general consensus that a lower learning rate is better for training?

What is training EMA? What does it do? I thought it took a lot more VRAM, but it looks like not necessarily (two of the lowest max VRAM usage on your chart trained EMA)?

Does training from the 7GB pruned ckpt file take more memory than the 4GB emaonly version? I know the 7GB is supposed to be better for finetuning/training? (I haven't tried the 7GB ckpt on my 3060 12GB.)

I went with the lower LR because the Huggingface blog suggests a lower value, specifically the 2e-6 (or 1e-6) for people. 2e-6 seems to work fine in that case as well, but I'm sure more testing is required to be sure.

https://huggingface.co/blog/dreambooth

Training EMA - I'm honestly not entirely sure. EMA stands for Estimated Moving Averages, so I think it has to do with how the Unet of the model handles...something, and using EMA can theoretically provide slightly better results - as demonstrated. How it applies to the size of the models or whether one takes up more space than another, I'm not sure.

In that regard, I wouldn't be surprised if we were actually missing some bit in the extraction routine that converts checkpoints to weights for the EMA stuff, but again, I could be wrong. The resulting output models are the same size, so maybe I'm just missing something in converting them back. Something to research.

And, again, while I have no scientific data to back this up, I do seem to get better results training stuff on the 7GB version than I do with a 4GB "extended" model. So, maybe it's possible to use the extra bit from the larger checkpoint with ones that don't have EMA to make any model "suitable" for fine-tuning. OR, training with EMA in this dreambooth has a larger impact on checkpoints that don't have it to begin with?

So many "maybes". Which is why I'm doing my best to generate some somewhat reproducible data. ;)

sgsdxzy Nov 18, 2022
Collaborator

I wrote my findings about EMA here: #85 (comment)

One other thing I'd like to note is that this EMA implementation is not complete. To do some serious training, one should start from the 7G model, which contains both full weight and ema weight. One should init the training weights using full weight, init the ema weight using ema weight from ckpt, train in full weight and update ema weight every step or epoch, save both weights on end, and distribute/use ema weight for inference. This is how real training is done. But most users just start from the 4G model and cannot afford to load two weights simutaneously anyway. In this scenario, ema might not be very helpful (but it at least lessens overfit a little).

Jonseed · 2022-11-18T00:43:46Z

Jonseed
Nov 18, 2022

So far for me, on my 3060 12GB, I can train with EMA on, but I can't generate sample previews, even though at 5 steps it says that it has only reserved 8GB. I get a CUDA OOM. It must take a lot of VRAM for those sample images! If I turn training EMA off, then I get samples, even though the reserved jumps to between 9-9.6GB. So the training EMA must affect the sample generation.

0 replies

mykeehu · 2022-11-18T09:03:21Z

mykeehu
Nov 18, 2022

What I noticed is that today the program offered constant training by pressing the Person button, and the day before it offered Scale Learning Rate for the same data set.

Yesterday it created the classifiers folder under the model name, today it created it under working as classifiers_concept_0

For my 28 images today it recommended 1400 steps with constant training, which is not enough because the golden multiplier is 100, so 2800 is supposed to be good.

And what bothers me a bit is that the loss used to get smaller as he learned, but now it hardly changes, but rather increases.🧐 The end result looks as if he has learned nothing. I am sad, I am getting lost.

So I don't know what it is recommending training parameters based on if there was no git patch that changed this... 🙄🤔

3 replies

d8ahazard Nov 19, 2022
Maintainer Author

What I noticed is that today the program offered constant training by pressing the Person button, and the day before it offered Scale Learning Rate for the same data set.

Yesterday it created the classifiers folder under the model name, today it created it under working as classifiers_concept_0

For my 28 images today it recommended 1400 steps with constant training, which is not enough because the golden multiplier is 100, so 2800 is supposed to be good.

And what bothers me a bit is that the loss used to get smaller as he learned, but now it hardly changes, but rather increases.🧐 The end result looks as if he has learned nothing. I am sad, I am getting lost.

So I don't know what it is recommending training parameters based on if there was no git patch that changed this... 🙄🤔

The wizards are definitely a WIP based on a few other blogs I found, and I'm inclined to agree with you that the number of steps is not enough. What I'm working towards is testing settings for various subjects and parameters and trying to come up with better math other than "images * 100". If we're training the text encoder, that may be drastically different. Same if using prior preservation. And the learning rate almost certainly plays a factor.

So, the wizards are definitely not to be considered definitive. I'll adjust them as I learn more, especially where learning rate and the other settings I mentioned are a factor.

mykeehu Nov 20, 2022

I'm really looking forward to it, because at the moment I've been using the 2e-06 LR, whether I'm using dynamic or constant training. With constant it took more of the images I was learning, while with variable weight training it produced a nicer result, but it didn't always generate what I wanted. Plus, it's interesting that for some keywords, I had to do some tricks in the prompt to keep it from being too strong or just overwhelming. Now it also happened that the number of images *100 was not enough, instead *150 was good for the same LR without classification images. So I needed more rather than less for LR, but I'm still experimenting.
Isn't it possible to go back to an earlier step number here if I want to change the LR timing? Or is it enough to unpack the ckpt?

sgsdxzy Nov 20, 2022
Collaborator

I'm really looking forward to it, because at the moment I've been using the 2e-06 LR, whether I'm using dynamic or constant training. With constant it took more of the images I was learning, while with variable weight training it produced a nicer result, but it didn't always generate what I wanted. Plus, it's interesting that for some keywords, I had to do some tricks in the prompt to keep it from being too strong or just overwhelming. Now it also happened that the number of images *100 was not enough, instead *150 was good for the same LR without classification images. So I needed more rather than less for LR, but I'm still experimenting. Isn't it possible to go back to an earlier step number here if I want to change the LR timing? Or is it enough to unpack the ckpt?

I think unpack the ckpt is enough.

poondoggle · 2022-11-18T22:55:57Z

poondoggle
Nov 18, 2022

I've been trying to run the Sanchez3 settings, but I get the following error: CUDNN_STATUS_INTERNAL_ERROR followed by a bunch of other crash data. I have an RTX 3060 with 12GB of VRAM. I can check every other setting, but EMA crashes every time.

2 replies

Jonseed Nov 18, 2022

yeah, I haven't been able to run with EMA either, and I have the same card. Mine crashes when it tries to generate a preview. I wonder if it could get through training if I turn off previews... I haven't seen the CUDNN error before.

poondoggle Nov 18, 2022

I have previews turned off on my side and it still crashes.

cerega66 · 2022-11-21T05:47:23Z

cerega66
Nov 21, 2022

I will add my observations on training dreambooth.

So I train on 3060 with the following parameters:
--mixed_precision=fp16
--train_batch_size=1
--gradient_accumulation_steps=1
--use_8bit_adam

I don't use savepoints as the learning rate and number of steps I'll list below gives me the best result.

Let's start by using the promt class.

In this case, the rule "less is better" is more appropriate here. But there are some problems. It all depends on the class promt on which you will train. If the model does not know much about the editability of this class, your class will have the same problems.
Also picky about the selection of the exact class.
As already revealed, the best learning rate is 1e-6 at 100 steps per frame (this is written in many guides). This gives a good balance. There is no overtraining or undertraining. If you want faster or more accurate, just recalculate. Example: 2e-6 at 50 steps, 5e-7 at 200 steps.
Also, as I have not tried, in my experience it is not suitable for training for style.

Now without class promt.

Here, as my experience has shown, "more is better" is more suitable.
He trains styles very well.
I found a learning rate that gives a fairly balanced result, which is 1.72e-6 at 100 steps per frame. Recalculating other values is also easy.
Here, in addition to training quality (1.72e-6) and training strength (100 steps per frame), there is such a parameter as training depth (the number of frames itself, to be more precise: learning rate * number of frames * steps per frame). The more frames of the style/object, the more it "penetrates" the model and gets better editability. As I managed to figure out so far, if the number of frames is more than 100, this is enough for a good depth of training. If it is less than 50, then with an increase in the number of prompts, the object / style is less and less similar to itself. The best result for me was achieved in the style training with 650 frames. The style transfer accuracy there reaches 100% and also allows you to display objects in this style that were not on the original frames.

Of the minuses, I would only note that it changes the entire model.

Now for more details about the coefficients.

The learning rate (1e-6) determines the accuracy of learning how small details will look and also affects editability.

The strength of training (the number of steps per frame) determines how much to learn from the image; if you set more than the balance value, it leads to overtraining (distortion, overexposure), if less, to undertraining (kinks). There is a way to determine how close you are to overtraining. Build an X/Y plot for the CFG scale parameter from 1 to 30 and see at what value distortion appears and overexposure. The lower this value, the closer you are to overtraining.

Depth of learning (learning rate * number of frames * steps per frame). Determines how much your style/object permeates the model. The more the better editability and style transfer to other objects in the model.

Training Comparison without class promt examples:

Number of frames: 50
Learning rate 1.72e-6
Number of steps: 5000 (100 per frame)
Result: Good balance, poor editability.

Number of frames: 50
Learning rate 8.6e-7
Number of steps: 10000 (200 per frame)
Result: Good balance, poor editability, better detail.

Number of frames: 50
Learning rate 1.72e-6
Number of steps: 10000 (200 per frame)
Result: Overtraining, good editability.

Frames: 100
Learning rate 1.72e-6
Number of steps: 10000 (100 per frame)
Result: Good balance, good editability.

This is all that I managed to find out for the quality of the training, if anyone needs it, I can write how to prepare images for training the style/object.

102 replies

cerega66 Dec 7, 2022

In theory, as your frame count increases, the multiplier should drop. They wrote it when they suggested using 4-5 frames, but it seems to me that when using 10 you need x100 and at 30 x33 is probably suitable, since when using 66 of your frames you can not use them, so I think we can try to find a dependency.

mykeehu Dec 7, 2022

At 30 pictures, interesting things started to emerge. Either the prompt was stronger, but the skilus was weak (it didn't generate what I wanted), or the style overpowered the prompt. I tried to find a middle ground, but failed. I trained on 600 class images, up to 760-1360 steps, but at 1180 the prompt held, then the style dominated. I did this on 5e-6 LR. I deleted this model and project.
Now I'm approaching it from the other direction: not using class images, I trained on 1e-6 up to 1980 steps (66%). Here it was just starting to take over the style, so I'm going up to 3000.

Update 1: 1e-6 didn't get what it needed even with 3000 steps, model deleted :( I raise LR to 2e-6, start again, without class images.
Update 2: Even with 2e-6 steps, it didn't deliver what it needed for 3000. I go up to 4800 with 300 saves, at 3600 it's not even good...

cerega66 Dec 7, 2022

Unfortunately, I can't help you in this case. Without the training frames and settings trainings, I can’t suggest anything.

cerega66 Dec 7, 2022

If you have a discord, you can write to me in it. It will be easier for me to communicate with you on such an occasion. I am in the group that created d8ahazard, I have the same nickname there.

mykeehu Dec 7, 2022

I'll look you up on discord, although it didn't give up on you or the group.

About the training: it generated good from 4200 steps, but not stable enough and above 4950 the model is getting stronger. I looked at 4950 because I went back to the old formula, so at 5e-6 it was 100 steps per image, then at 2e-6 how much. I multiplied that by 66% and came up with (5/2) x 100 x 0.66 x number of images (30) = 4950

But I'll get the rest on discord, because I can't do it here :)

razord93 · 2022-11-22T04:04:15Z

razord93
Nov 22, 2022

I had an idea of taking the preview.
And generate a text prompt from them, maybe like 5 to 15 of them,
(Using WD 1.4 tagger, it's blazing fast.)
And when the average word variation starts evening out
Have it replaces a single specific prompt with a generalized prompt.
Or it could use a negative word and try to have it denoise out the sameness.

And after 100 Iterations slowly have it rise until it drops back again faster than it rose, like a bubble that pops and collapses.

There are many patterns that could have an interesting effect.
maybe it could work, but it would require
many more steps to ensure that it is not shocked by too much noise.

1 reply

d8ahazard Dec 2, 2022
Maintainer Author

https://github.com/d8ahazard/sd_smartprocess (Or just "Smart Process" from the extensions page)

You can caption your images, swap the class/subject, etc. I find that captioning everything helps a lot with editability, even without using prior preservation.

cerega66 · 2022-11-23T11:27:29Z

cerega66
Nov 23, 2022

I decided here all the same to experiment deeper with the use of the promt class and without.

So what is the experiment. I train three models, the same data is input, the difference is only in the promt class. The first one will not have it, the second one will have it general (abstract), the third one will have a clarifying one.

So, here is an example of my experiment. In all models, the key promt is the same: myperson. Models will have a prefix in the form of a number. Number of frames 18, steps 100 per frame.

myperson1
learning rate: 1.72e-6.

myperson2
learning rate: 1e-6.
class promt: woman.

myperson3
learning rate: 1e-6.
class promt: woman, black hair, blue eyes and etc. (I tried to give the most accurate promt, what is on each training frame).

Then I built graphs by cfg scale and model name.

The first graph included one promt: myperson.
Result: 1 model produced slightly similar images. 2 and 3 did not give similar ones. (I look at the image itself, and not the detail and hair color, for example).
Next, I tried adding weight to the promt: (myperson:1.2).
Result: recognition of 1 model increased, 2 and 3 did not change.
Then I began to add clarifying promts one at a time. The more there were, the better was the recognition of 1 model. In the last request, models 2 and 3 also had some recognition, but worse than model 1.
I thought that the learning rate of 1 model was to blame here and trained another one but with a coefficient of 1e-6. Gave a name: myperson1a.
Repeatedly carried out the same generation.
Result: Yes, the problem was in the speed of learning. But the difference is still in favor of model 1a.
If with a simple query: myperson, there was no difference in recognition between the three models, then with a query with maximum refinement, model 1a still gave the best result.
But there is also one interesting point, model 1a produced almost the same images as model 3. I would say that the similarity was about 90-95%. while with model 2, the similarity was about 50%.

Then I decide to conduct another experiment, but increase the learning rate for all models to 5e-6. This will give a recess of 5 times. Then I ran the same tests.
Result: I was surprised. With a simple request: myperson, the 3rd model clearly won the rest 2. The transfer of style, the image of the details was the best for 3. Then comes the 2nd model. And 1 model takes the last place. As for overfitting for all 3 models, it is observed on cfg scale 12 for all. But models 1 and 3 are less prone to overtraining. Model 2 gives distortion already at 12, and 1 and 3 do not give even at 18.
Then I decided to conduct a test with editability. And then everything changed. Model 1 is in the lead. Next up is Model 2. And the 3rd model sometimes does not give me the image of the character at all.

I think I have given enough information regarding the use or not use of the promt class.
To generalize, using a promt class, as I already wrote, worsens editability. The more accurate it is, the worse it is. But the better the output is when generated using just the keyword. At the moment, training without a class of promts will remain my favorite.

Now about the coefficients.
As you can see, using the same learning rate when not using the promt class as when using it gives a worse learning outcome. Therefore, I recommend increasing the learning rate by 1.72 times. This gives similar results.
Now about why increasing the learning rate gave us deeper learning. I already wrote an approximate formula for understanding the depth of learning: learning rate * number of steps per frame * number of frames. This does not convey a 100% accurate depth calculation formula, but only an understanding of the dependencies. If we take as an ideal option when all our frames have 100% quality information among themselves. Then we get the same effect of each coefficient on the depth. But in reality this is not the case. This can only be said about the learning rate and the number of steps per frame. But their increase greatly affects the learning itself and can lead to overfitting, unlike increasing the number of frames (increasing the learning rate by 5 times will give the same retraining as increasing the number of steps by 5 times, and increasing the number of frames by 5 times, if they are ideal, will not give retraining at all).

And what if you don't have enough frames. Well, everything is simple here. The best option is to increase the number of frames. If this is not possible, then increase the learning rate, or increase the number of steps per frame. Increasing the speed will give you the same learning time, but less detail. Increasing the steps will increase the training time but keep the level of detail.

Next, I stop experimenting with the promt class. Now I will start experimenting with sequential learning of concepts in one model.

2 replies

d8ahazard Dec 2, 2022
Maintainer Author

Now you need to research how "scale learning rate" affects your outputs. I don't mess with it, but I had someone say that "polynomial" lr scheduler + 500 warmup steps and scaling was great, but they didn't specify their starting learning rate.

cerega66 Dec 2, 2022

For any question "what if I do this" I have an interest to check it out. And there are already quite a lot of such questions. And the more answers I get, the more questions appear. Despite the fact that I do not generate images (only for tests) and experiment almost 24/7, the questions only get more. Despite the fact that concept_list appeared a long time ago, I only recently started working on it and have already found quite a few problems when using it. So for example the concepts are not explicitly mixed. I noticed this most when training 9 concepts.
Regarding the scalable lr. I already had experience with him in hypernets and the principle is similar here. But I am against its use. Why? At the first steps, in fact, the base of your concept is determined and how detailed it will be. And if you start at 1e-5 and end at 1e-7 it doesn't mean that your concept will be equal to the same concept but fully trained on 1e-7. This method reduces training time by losing accuracy in some details. But a full workout with the same accuracy will always be better. Many people think that this method is simple and you take a rough concept first and then refine its details and this is almost true but not quite. A rough analogy, imagine that you want to get a circle, but instead of putting a circle right away, you put a square first and then add it to the circle. You will not get a circle, you will get a square with smooth corners.

d8ahazard · 2022-11-24T15:44:54Z

d8ahazard
Nov 24, 2022
Maintainer Author

So I've been training a model for the past week or so, and finally got it to output the image I want:

           .--.
          /} p \             /}
         `~)-) /           /` }
          ( / /          /`}.' }
           / / .-'""-.  / ' }-'}
          / (.'       \/ '.'}_.}
         |            `}   .}._}
         |     .-=-';   } ' }_.}
         \    `.-=-;'  } '.}.-}
          '.   -=-'    ;,}._.}
            `-,_  __.'` '-._}
          jgs   `|||
               .=='=,

0 replies

chakalakasp · 2022-12-02T03:43:31Z

chakalakasp
Dec 2, 2022

Question: have you tried changing the gradient accumulation steps to be larger or, optimally, to match the size of your dataset? From what I've been reading on this, this setting is designed to get you higher quality results by allowing you to somewhat mimic cards with enough RAM to run larger batches on the dataset. https://kozodoi.me/python/deep%20learning/pytorch/tutorial/2021/02/19/gradient-accumulation.html

1 reply

d8ahazard Dec 2, 2022
Maintainer Author

I have not, but it definitely sounds promising.

chakalakasp · 2022-12-02T14:26:51Z

chakalakasp
Dec 2, 2022

Watching this stuff get developed in real time is a blast. You guys are doing some seriously cool work! :)

…

On Fri, Dec 2, 2022 at 07:45, d8ahazard ***@***.***> wrote: > So you really puzzled me. I did a few more experiments and also made some calculations. And we can say that I found formulas for calculating lr and steps based on the number of images per training. But this is only on condition that you do not use additional parameters that will affect the quality of the workout. I did all the calculations without using the prompt class. If you use them, then, in theory, you will have to do calculations for each class of promt separately and the difference will be different from promt to promt and from model to model. I can’t tell you all the details right now, since it’s not convenient to do this from the phone, so I’ll describe it as briefly as I can. > > Let's start with overtraining. We will look at the fact that there would be no overtraining on the entire cfg:1-30 range, and also we are not looking at concept recognition, only overtraining. So I got the formula: steps per frame = 1e-4/lr. So we get that with 1e-6 it is 100 steps and with 1e-5 it is 10 and this is confirmed by my experiments. Now about recognition, I got the following formula: the number of steps = 1e-2 / 1e-6. We get that for recognition, tending to 100% of the concept, with 1e-6 you need 10000 steps (100 * 100, 1000 * 10, 10 * 1000), with 1e-5 you need 1000 steps (100 * 10, 10 * 100) and given that we know that overtraining does not occur at 10 steps per frame, the best option for this lr is a minimum of 100 frames per training. And so we get that with frames less than 100 we are not able to pick up lr and steps per frame without leading to overtraining. But we also see that increasing frames above 100 in theory gives a positive effect. The prompt class can help you, I answered in another topic how they work. In fact, these are substitutes for your concept. It mixes your frames and class frames to get more frames per workout. But no one will tell you how to choose the steps to lr, it's just too much. But in theory the number of class frames should affect, so mixing 10 of yours and 100 of the class will give you 1000 frames that are needed for 1e-5, but this needs to be checked and I will do it somehow. > > Also if you are training without a promt class don't look for overtraining by generating images with only one of your instance promt. Always add either a small change or an abstract class to whatever you want. For example: myperson, woman. > > Probably I will start writing a separate guide in a new topic where I will write all this in detail with examples, but this will obviously not be soon. Glad to see we came up with similar math: [image](https://user-images.githubusercontent.com/1633844/205306056-3036acaf-a91f-465d-b64f-276b18862d0a.png) I wouldn't call it the end-all, be-all solution for all training options, but for "general" training with text encoder and prior preservation, it seemed like a decent starting point for people. — Reply to this email directly, [view it on GitHub](#246 (reply in thread)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AMNT5O2SPK55OPQNA3J74YLWLH4P7ANCNFSM6AAAAAASDUJBUE). You are receiving this because you commented.Message ID: ***@***.***>

1 reply

d8ahazard Dec 2, 2022
Maintainer Author

Wait till I finish 2.0 support! :D

Performance Testing: A Somewhat Comprehensive Guide #246

d8ahazard Nov 17, 2022 Maintainer

Overview

TL/DR

Dataset

Test Parameters

Note

Results

Performance/Tuning Observations

Image Generation Params

Output observations

Other Thoughts, Future Testing

Replies: 13 comments · 120 replies

sgsdxzy Nov 17, 2022 Collaborator

d8ahazard Nov 17, 2022 Maintainer Author

sgsdxzy Nov 17, 2022 Collaborator

sgsdxzy Nov 17, 2022 Collaborator

sgsdxzy Nov 17, 2022 Collaborator

d8ahazard Nov 17, 2022 Maintainer Author

d8ahazard Nov 18, 2022 Maintainer Author

sgsdxzy Nov 18, 2022 Collaborator

d8ahazard Nov 19, 2022 Maintainer Author

sgsdxzy Nov 20, 2022 Collaborator

d8ahazard Dec 2, 2022 Maintainer Author

d8ahazard Dec 2, 2022 Maintainer Author

d8ahazard Nov 24, 2022 Maintainer Author

d8ahazard Dec 2, 2022 Maintainer Author

d8ahazard Dec 2, 2022 Maintainer Author

d8ahazard
Nov 17, 2022
Maintainer

Replies: 13 comments 120 replies

sgsdxzy
Nov 17, 2022
Collaborator

d8ahazard Nov 17, 2022
Maintainer Author

sgsdxzy Nov 17, 2022
Collaborator

sgsdxzy
Nov 17, 2022
Collaborator

sgsdxzy Nov 17, 2022
Collaborator

d8ahazard Nov 17, 2022
Maintainer Author

d8ahazard Nov 18, 2022
Maintainer Author

sgsdxzy Nov 18, 2022
Collaborator

d8ahazard Nov 19, 2022
Maintainer Author

sgsdxzy Nov 20, 2022
Collaborator

d8ahazard Dec 2, 2022
Maintainer Author

d8ahazard Dec 2, 2022
Maintainer Author

d8ahazard
Nov 24, 2022
Maintainer Author

d8ahazard Dec 2, 2022
Maintainer Author

d8ahazard Dec 2, 2022
Maintainer Author