Performance Testing: A Somewhat Comprehensive Guide #246
Replies: 13 comments 120 replies
-
I did a small test of xformers in #230 |
Beta Was this translation helpful? Give feedback.
-
I do think something's off, why does 3 consumes less memory than 4? It needs to train an extra text encoder. |
Beta Was this translation helpful? Give feedback.
-
Great data! Thank you for putting this together! I can't wait for the prior preservation and caption usage tests. Why did you choose a lower learning rate of 2e-6, instead of the default 5e-6? Is the general consensus that a lower learning rate is better for training? What is training EMA? What does it do? I thought it took a lot more VRAM, but it looks like not necessarily (two of the lowest max VRAM usage on your chart trained EMA)? Does training from the 7GB pruned ckpt file take more memory than the 4GB emaonly version? I know the 7GB is supposed to be better for finetuning/training? (I haven't tried the 7GB ckpt on my 3060 12GB.) |
Beta Was this translation helpful? Give feedback.
-
So far for me, on my 3060 12GB, I can train with EMA on, but I can't generate sample previews, even though at 5 steps it says that it has only reserved 8GB. I get a CUDA OOM. It must take a lot of VRAM for those sample images! If I turn training EMA off, then I get samples, even though the reserved jumps to between 9-9.6GB. So the training EMA must affect the sample generation. |
Beta Was this translation helpful? Give feedback.
-
What I noticed is that today the program offered constant training by pressing the Person button, and the day before it offered Scale Learning Rate for the same data set. Yesterday it created the classifiers folder under the model name, today it created it under working as classifiers_concept_0 For my 28 images today it recommended 1400 steps with constant training, which is not enough because the golden multiplier is 100, so 2800 is supposed to be good. And what bothers me a bit is that the loss used to get smaller as he learned, but now it hardly changes, but rather increases.🧐 The end result looks as if he has learned nothing. I am sad, I am getting lost. So I don't know what it is recommending training parameters based on if there was no git patch that changed this... 🙄🤔 |
Beta Was this translation helpful? Give feedback.
-
I've been trying to run the Sanchez3 settings, but I get the following error: CUDNN_STATUS_INTERNAL_ERROR followed by a bunch of other crash data. I have an RTX 3060 with 12GB of VRAM. I can check every other setting, but EMA crashes every time. |
Beta Was this translation helpful? Give feedback.
-
I will add my observations on training dreambooth. So I train on 3060 with the following parameters: I don't use savepoints as the learning rate and number of steps I'll list below gives me the best result. Let's start by using the promt class. In this case, the rule "less is better" is more appropriate here. But there are some problems. It all depends on the class promt on which you will train. If the model does not know much about the editability of this class, your class will have the same problems. Now without class promt. Here, as my experience has shown, "more is better" is more suitable. Of the minuses, I would only note that it changes the entire model. Now for more details about the coefficients. The learning rate (1e-6) determines the accuracy of learning how small details will look and also affects editability. The strength of training (the number of steps per frame) determines how much to learn from the image; if you set more than the balance value, it leads to overtraining (distortion, overexposure), if less, to undertraining (kinks). There is a way to determine how close you are to overtraining. Build an X/Y plot for the CFG scale parameter from 1 to 30 and see at what value distortion appears and overexposure. The lower this value, the closer you are to overtraining. Depth of learning (learning rate * number of frames * steps per frame). Determines how much your style/object permeates the model. The more the better editability and style transfer to other objects in the model. Training Comparison without class promt examples: Number of frames: 50 Number of frames: 50 Number of frames: 50 Frames: 100 This is all that I managed to find out for the quality of the training, if anyone needs it, I can write how to prepare images for training the style/object. |
Beta Was this translation helpful? Give feedback.
-
I had an idea of taking the preview. And after 100 Iterations slowly have it rise until it drops back again faster than it rose, like a bubble that pops and collapses. There are many patterns that could have an interesting effect. |
Beta Was this translation helpful? Give feedback.
-
I decided here all the same to experiment deeper with the use of the promt class and without. So what is the experiment. I train three models, the same data is input, the difference is only in the promt class. The first one will not have it, the second one will have it general (abstract), the third one will have a clarifying one. So, here is an example of my experiment. In all models, the key promt is the same: myperson. Models will have a prefix in the form of a number. Number of frames 18, steps 100 per frame. myperson1 myperson2 myperson3 Then I built graphs by cfg scale and model name. The first graph included one promt: myperson. Then I decide to conduct another experiment, but increase the learning rate for all models to 5e-6. This will give a recess of 5 times. Then I ran the same tests. I think I have given enough information regarding the use or not use of the promt class. Now about the coefficients. And what if you don't have enough frames. Well, everything is simple here. The best option is to increase the number of frames. If this is not possible, then increase the learning rate, or increase the number of steps per frame. Increasing the speed will give you the same learning time, but less detail. Increasing the steps will increase the training time but keep the level of detail. Next, I stop experimenting with the promt class. Now I will start experimenting with sequential learning of concepts in one model. |
Beta Was this translation helpful? Give feedback.
-
So I've been training a model for the past week or so, and finally got it to output the image I want:
|
Beta Was this translation helpful? Give feedback.
-
Question: have you tried changing the gradient accumulation steps to be larger or, optimally, to match the size of your dataset? From what I've been reading on this, this setting is designed to get you higher quality results by allowing you to somewhat mimic cards with enough RAM to run larger batches on the dataset. https://kozodoi.me/python/deep%20learning/pytorch/tutorial/2021/02/19/gradient-accumulation.html |
Beta Was this translation helpful? Give feedback.
-
Watching this stuff get developed in real time is a blast. You guys are doing some seriously cool work! :)
…On Fri, Dec 2, 2022 at 07:45, d8ahazard ***@***.***> wrote:
> So you really puzzled me. I did a few more experiments and also made some calculations. And we can say that I found formulas for calculating lr and steps based on the number of images per training. But this is only on condition that you do not use additional parameters that will affect the quality of the workout. I did all the calculations without using the prompt class. If you use them, then, in theory, you will have to do calculations for each class of promt separately and the difference will be different from promt to promt and from model to model. I can’t tell you all the details right now, since it’s not convenient to do this from the phone, so I’ll describe it as briefly as I can.
>
> Let's start with overtraining. We will look at the fact that there would be no overtraining on the entire cfg:1-30 range, and also we are not looking at concept recognition, only overtraining. So I got the formula: steps per frame = 1e-4/lr. So we get that with 1e-6 it is 100 steps and with 1e-5 it is 10 and this is confirmed by my experiments. Now about recognition, I got the following formula: the number of steps = 1e-2 / 1e-6. We get that for recognition, tending to 100% of the concept, with 1e-6 you need 10000 steps (100 * 100, 1000 * 10, 10 * 1000), with 1e-5 you need 1000 steps (100 * 10, 10 * 100) and given that we know that overtraining does not occur at 10 steps per frame, the best option for this lr is a minimum of 100 frames per training. And so we get that with frames less than 100 we are not able to pick up lr and steps per frame without leading to overtraining. But we also see that increasing frames above 100 in theory gives a positive effect. The prompt class can help you, I answered in another topic how they work. In fact, these are substitutes for your concept. It mixes your frames and class frames to get more frames per workout. But no one will tell you how to choose the steps to lr, it's just too much. But in theory the number of class frames should affect, so mixing 10 of yours and 100 of the class will give you 1000 frames that are needed for 1e-5, but this needs to be checked and I will do it somehow.
>
> Also if you are training without a promt class don't look for overtraining by generating images with only one of your instance promt. Always add either a small change or an abstract class to whatever you want. For example: myperson, woman.
>
> Probably I will start writing a separate guide in a new topic where I will write all this in detail with examples, but this will obviously not be soon.
Glad to see we came up with similar math:
[image](https://user-images.githubusercontent.com/1633844/205306056-3036acaf-a91f-465d-b64f-276b18862d0a.png)
I wouldn't call it the end-all, be-all solution for all training options, but for "general" training with text encoder and prior preservation, it seemed like a decent starting point for people.
—
Reply to this email directly, [view it on GitHub](#246 (reply in thread)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AMNT5O2SPK55OPQNA3J74YLWLH4P7ANCNFSM6AAAAAASDUJBUE).
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Overview
So, I did a thing. I wanted to know the impact of (almost) every tuning parameter currently available on speed, memory usage, and output image quality and editability.
As such, I created 7x different dreambooth models, and adjusted one parameter at a time and recorded the maximum logged memory usage, total run time, and average iterations/s for each run.
TL/DR
Memory usages below are most likely NOT 100% accurate, but more of a snapshot. Consider them like a rough rule of thumb, versus the end-all, be-all benchmark of what GPUs can run with what params.
The table below shows results with various training params. The closest I can get to 8GB is with the "sanchez3" prams, while sanchez 7 offers the most overall speed.
sanchez7 is "the fastest", but sanchez1 shows potential to be ultimately faster, as it requires less total training steps.
Training the text encoder is a massive boost to number of required steps.
I need to do more testing with prior preservation and [filewords] enbled, but I was lazy so those are not used here.
WHOO, Science!!
Dataset
The dataset for this test consisted of 32 images of the character Rick Sanchez from the TV show "Rick and Morty". These images were selected from appx 500 images downloaded randomly from Google Images, with 50 images of the best quality and style selected and cropped, and then another 18 images removed mostly just because I was being picky. Should I have used an even 30 images? Probably. But, hey, whatcha gonna do?
Test Parameters
For this test, prior preservation and advanced caption usage ([Filewords]) were not used. Those will be tested separately.
Each test was conducted using the same dataset and a new set of weights extracted from the v1.5-pruned.ckpt, with checkpoints saved every 500 steps for total of 2000 training steps. A learning rate of 0.000002 or 2e-6 was used, versus the "default" of 5e-6.
Note
The "maximum" memory recorded is not a true representation of the maximum VRAM used, but more of a snapshot taken after all data is loaded to GPU and 5x training steps have been completed. It is possible that more VRAM is taken up by the training process during checkpoint or preview generation, or just somewhere else during the training process.
Results
Below is a table of the checkpoint names, training parameters used for each test, and the resulting "maximum" memory usage and training times for each test.
Performance/Tuning Observations
Enabling FP16 without 8bit Adam actually causes training time to increase.
Turning off several settings that I thought would improve memory usage actually seem to degrade it. The optimal settings for memory usage seem to be the sanchez3 settings:
I should probably test xformers, which would also help identify when certain combinations of settings cause crashes.
sanchez3: fp16, 8bit_adam=True, text_encoder=True, ema=True, no_cache_latents=True, grad_check=True
For maximum speed, sanchez7 is where it's at:
sanchez7: fp16, 8bit_adam=True, text_encoder=False, ema=False, no_cache_latents=False, grad_check=False
Checkpoints created using FP16 are always 2GB. This is because I setting "half=True" whenever precision is FP16. Is this right? Should I disable it? Make it optional? IDK.
Image Generation Params
All images were generated at 20 steps using the DPM++ 2M Sampler, 7.5 CFG Scale, and a seed of 420420. Wubba lubba dub dub!
Prompt word is "riksan", with "smiling", "in space", "wearing armor", and "psychedelic background" as modifiers.
Output observations
Disabling the text encoder definitely has a noticeable effect on image results. It would almost seem like the number of required training steps is reduced by a factor of 2 or more. Editability is also somewhat reduced, but this may also be resolved by using prior preservation or [filewords] with properly captioned input images.
Using EMA when training does seem to offer a small amount of improvement in output image quality, but not such a major change that I'd worry about not being able to use it to run on a lower-powered GPU.
Surprisingly, not caching latents actually seems to have an impact on image output, as evidenced by the differences in the sanchez5 and sanchez6 images. I definitely want to see what happens when using the sanchez3 set of params but with "not cache latents" disabled.
Gradient checkpointing doesn't seem to have much effect at all on image output, but further testing is required.
Other Thoughts, Future Testing
I definitely need to test the effect of prior preservation, [filewords], and the forthcoming "shuffle after epoch" feature.
Xformers too, and maybe "accelerate launch".
Because I just kind of randomly chose which setting was changed next (aside from FP16, which requires restarting the app), there's definitely a chance that some other combination of parameters could have greater performance savings.
Re-doing this same test with an actual human subject would probably yield wildly different results.
How does changing the dataset size impact memory usage in various stages of testing?
Generating preview/comparison images for the 500-step checkpoint might be necessary for results trained using the text encoder, especially if using [filewords] for prompts.
Beta Was this translation helpful? Give feedback.
All reactions