-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Image-to-Image Fine-Tuning Example #10662
Comments
Hi, in diffusion models we don't have a base image-to-image finetuning because this works without it, the process of using images is just to add noise to the original image before inference. For what you're asking, there are other methods that people use to achieve that, for example, for style transfer you can use IP Adapters and for image restoration typically you can use a Tile Controlnet. We have the training examples for controlnet and you can see the original training of IP Adapters. Also there's the case of inpainting, sadly we don't have the training example for it and don't have the bandwidth to add it, but you just need to add the input channels and finetune it, still, lately people just train a controlnet for this which seems to work a lot better and most models lately don't even have an inpainting version. |
Thank you for your detailed response! It was incredibly helpful and clarified a lot for me 👍 Just to confirm, the image2image pipeline is only relevant during inference, where instead of starting from completely random noise, we begin with the initial image plus some added noise—correct? Regarding ControlNet, is there any specific reason why LoRA fine-tuning wouldn't be added on top of it? |
correct, you can refer to the image to image pipelines if you want to see how it's done.
There is one called control lora which was kind of popular in the old AI days, but the full controlnet proved to be a lot better and I don't see people using them lately. There were some attempts to add them to diffusers but the lack of bandwidth and the low use made it stale, we will gladly welcome and help any effort if someone in the community wants to finish it though. Still, the latests controlnets are a lot better and even the ones for Flux are the whole model, but maybe there will be some breakthrough in the future where a small model (or even a lora) can achieve the same thing like it's happening with LLMs |
I would be happy to help in any way possible. How I should get started? Are there any specific resources or steps you recommend before starting? I am working on transferring images between two domains and I need high fidelity and quality, but I have limited resources (2-3 A100 GPUs max). do you think there are other options except control lora? I'm assuming that a full controlnet training will be too demanding. Thank you again for your insights! |
for controlnet I suggest you start with the basic example here, please don't make the mistake to try to do the more advanced stuff before doing the basic ones, people often try to do this and fail very often and then give up, as any finetuning you should always start from the very basics and then do the more advanced trainings. Also you're mistaken thinking that you need really high VRAM or multiple GPUS to train controlnets, you can do it with a single 8GB GPU if you want, the tradeoff is the time it takes to train. I've seen people train and release really good checkpoints (full finetuning) with a single 3090 but they took months instead of days. With 2-3 A100 it shouldn't take that long but it won't be in a day too, you have to find the best batch size and learning rate for what you have, also don't expect a controlnet to be good with just a 10000 dataset, you can always repeat the images but the quality will suffer, the good controlnets I've seen and that posted the training parameters, were trained in at least 500000 images and the best one, in my opinion, claims to be trained with over 10000000 but it is a union controlnet. Also it depends on your choice, SDXL seems the be the best for you, but the best one right now is Flux which is a lot more demanding in hardware, if you want the fastest but with not that much quality, you can always choose SD 1.5 which you could probably train really fast with 3 A100. |
Thanks for the answers! closing the issue. |
Hello, and thank you for maintaining this amazing repository!
While working with the Diffusers library, I noticed there is a folder containing fine-tuning examples for text-to-image models but not for image-to-image fine-tuning.
Since image-to-image models have many use cases (e.g., style transfer, image restoration, or domain-specific adaptation), a fine-tuning example for this task would greatly benefit the community and improve accessibility for users looking to customize such models.
Questions:
I'd be happy to contribute or collaborate on this feature if it's considered valuable.
Thank you in advance for your time and response!
The text was updated successfully, but these errors were encountered: