From 79cff8f58d356cc9916ae2c70b244f87a0e7cd49 Mon Sep 17 00:00:00 2001 From: stevhliu Date: Thu, 3 Apr 2025 15:24:37 -0700 Subject: [PATCH 1/7] refactor adapter docs --- docs/source/en/_toctree.yml | 18 +- docs/source/en/using-diffusers/dreambooth.md | 35 ++ .../en/using-diffusers/loading_adapters.md | 416 ------------------ docs/source/en/using-diffusers/merge_loras.md | 266 ----------- .../textual_inversion_inference.md | 113 ++--- 5 files changed, 76 insertions(+), 772 deletions(-) create mode 100644 docs/source/en/using-diffusers/dreambooth.md delete mode 100644 docs/source/en/using-diffusers/loading_adapters.md delete mode 100644 docs/source/en/using-diffusers/merge_loras.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 64063c3be1d1..99b160fae9aa 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -17,8 +17,6 @@ title: AutoPipeline - local: tutorials/basic_training title: Train a diffusion model - - local: tutorials/using_peft_for_inference - title: Load LoRAs for inference - local: tutorials/fast_diffusion title: Accelerate inference of text-to-image diffusion models - local: tutorials/inference_with_big_models @@ -33,11 +31,19 @@ title: Load schedulers and models - local: using-diffusers/other-formats title: Model files and layouts - - local: using-diffusers/loading_adapters - title: Load adapters - local: using-diffusers/push_to_hub title: Push files to the Hub title: Load pipelines and adapters +- sections: + - local: using-diffusers/using_peft_for_inference + title: LoRA + - local: using-diffusers/ip_adapter + title: IP-Adapter + - local: using-diffusers/dreambooth + title: DreamBooth + - local: using-diffusers/textual_inversion_inference + title: Textual inversion + title: Adapters - sections: - local: using-diffusers/unconditional_image_generation title: Unconditional image generation @@ -97,8 +103,6 @@ title: SDXL Turbo - local: using-diffusers/kandinsky title: Kandinsky - - local: using-diffusers/ip_adapter - title: IP-Adapter - local: using-diffusers/omnigen title: OmniGen - local: using-diffusers/pag @@ -109,8 +113,6 @@ title: T2I-Adapter - local: using-diffusers/inference_with_lcm title: Latent Consistency Model - - local: using-diffusers/textual_inversion_inference - title: Textual inversion - local: using-diffusers/shap-e title: Shap-E - local: using-diffusers/diffedit diff --git a/docs/source/en/using-diffusers/dreambooth.md b/docs/source/en/using-diffusers/dreambooth.md new file mode 100644 index 000000000000..6c37124cb7ff --- /dev/null +++ b/docs/source/en/using-diffusers/dreambooth.md @@ -0,0 +1,35 @@ + + +# DreamBooth + +[DreamBooth](https://huggingface.co/papers/2208.12242) is a method for generating personalized images of a specific instance. It works by fine-tuning the model on 3-5 images of the subject (for example, a cat) that is associated with a unique identifier (`sks cat`). This allows you to use `sks cat` in your prompt to trigger the model to generate images of your cat in different settings, lighting, poses, and styles. + +DreamBooth checkpoints are typically a few GBs in size because it contains the full model weights. + +Load the DreamBooth checkpoint with [`~DiffusionPipeline.from_pretrained`] and include the unique identifier in the prompt to activate its generation. + +```py +import torch +from diffusers import AutoPipelineForText2Image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "sd-dreambooth-library/herge-style", + torch_dtype=torch.float16 +).to("cuda") +prompt = "A cute sks herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" +pipeline(prompt).images[0] +``` + +
+ +
\ No newline at end of file diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md deleted file mode 100644 index 3400774e6b6a..000000000000 --- a/docs/source/en/using-diffusers/loading_adapters.md +++ /dev/null @@ -1,416 +0,0 @@ - - -# Load adapters - -[[open-in-colab]] - -There are several [training](../training/overview) techniques for personalizing diffusion models to generate images of a specific subject or images in certain styles. Each of these training methods produces a different type of adapter. Some of the adapters generate an entirely new model, while other adapters only modify a smaller set of embeddings or weights. This means the loading process for each adapter is also different. - -This guide will show you how to load DreamBooth, textual inversion, and LoRA weights. - - - -Feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer), [LoRA the Explorer](https://huggingface.co/spaces/multimodalart/LoraTheExplorer), and the [Diffusers Models Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) for checkpoints and embeddings to use. - - - -## DreamBooth - -[DreamBooth](https://dreambooth.github.io/) finetunes an *entire diffusion model* on just several images of a subject to generate images of that subject in new styles and settings. This method works by using a special word in the prompt that the model learns to associate with the subject image. Of all the training methods, DreamBooth produces the largest file size (usually a few GBs) because it is a full checkpoint model. - -Let's load the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, which is trained on just 10 images drawn by Hergé, to generate images in that style. For it to work, you need to include the special word `herge_style` in your prompt to trigger the checkpoint: - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("sd-dreambooth-library/herge-style", torch_dtype=torch.float16).to("cuda") -prompt = "A cute herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" -image = pipeline(prompt).images[0] -image -``` - -
- -
- -## Textual inversion - -[Textual inversion](https://textual-inversion.github.io/) is very similar to DreamBooth and it can also personalize a diffusion model to generate certain concepts (styles, objects) from just a few images. This method works by training and finding new embeddings that represent the images you provide with a special word in the prompt. As a result, the diffusion model weights stay the same and the training process produces a relatively tiny (a few KBs) file. - -Because textual inversion creates embeddings, it cannot be used on its own like DreamBooth and requires another model. - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") -``` - -Now you can load the textual inversion embeddings with the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] method and generate some images. Let's load the [sd-concepts-library/gta5-artwork](https://huggingface.co/sd-concepts-library/gta5-artwork) embeddings and you'll need to include the special word `` in your prompt to trigger it: - -```py -pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork") -prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, style" -image = pipeline(prompt).images[0] -image -``` - -
- -
- -Textual inversion can also be trained on undesirable things to create *negative embeddings* to discourage a model from generating images with those undesirable things like blurry images or extra fingers on a hand. This can be an easy way to quickly improve your prompt. You'll also load the embeddings with [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`], but this time, you'll need two more parameters: - -- `weight_name`: specifies the weight file to load if the file was saved in the 🤗 Diffusers format with a specific name or if the file is stored in the A1111 format -- `token`: specifies the special word to use in the prompt to trigger the embeddings - -Let's load the [sayakpaul/EasyNegative-test](https://huggingface.co/sayakpaul/EasyNegative-test) embeddings: - -```py -pipeline.load_textual_inversion( - "sayakpaul/EasyNegative-test", weight_name="EasyNegative.safetensors", token="EasyNegative" -) -``` - -Now you can use the `token` to generate an image with the negative embeddings: - -```py -prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, EasyNegative" -negative_prompt = "EasyNegative" - -image = pipeline(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0] -image -``` - -
- -
- -## LoRA - -[Low-Rank Adaptation (LoRA)](https://huggingface.co/papers/2106.09685) is a popular training technique because it is fast and generates smaller file sizes (a couple hundred MBs). Like the other methods in this guide, LoRA can train a model to learn new styles from just a few images. It works by inserting new weights into the diffusion model and then only the new weights are trained instead of the entire model. This makes LoRAs faster to train and easier to store. - - - -LoRA is a very general training technique that can be used with other training methods. For example, it is common to train a model with DreamBooth and LoRA. It is also increasingly common to load and merge multiple LoRAs to create new and unique images. You can learn more about it in the in-depth [Merge LoRAs](merge_loras) guide since merging is outside the scope of this loading guide. - - - -LoRAs also need to be used with another model: - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -``` - -Then use the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method to load the [ostris/super-cereal-sdxl-lora](https://huggingface.co/ostris/super-cereal-sdxl-lora) weights and specify the weights filename from the repository: - -```py -pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors") -prompt = "bears, pizza bites" -image = pipeline(prompt).images[0] -image -``` - -
- -
- -The [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method loads LoRA weights into both the UNet and text encoder. It is the preferred way for loading LoRAs because it can handle cases where: - -- the LoRA weights don't have separate identifiers for the UNet and text encoder -- the LoRA weights have separate identifiers for the UNet and text encoder - -To directly load (and save) a LoRA adapter at the *model-level*, use [`~loaders.PeftAdapterMixin.load_lora_adapter`], which builds and prepares the necessary model configuration for the adapter. Like [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`], [`~loaders.PeftAdapterMixin.load_lora_adapter`] can load LoRAs for both the UNet and text encoder. For example, if you're loading a LoRA for the UNet, [`~loaders.PeftAdapterMixin.load_lora_adapter`] ignores the keys for the text encoder. - -Use the `weight_name` parameter to specify the specific weight file and the `prefix` parameter to filter for the appropriate state dicts (`"unet"` in this case) to load. - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.unet.load_lora_adapter("jbilcke-hf/sdxl-cinematic-1", weight_name="pytorch_lora_weights.safetensors", prefix="unet") - -# use cnmt in the prompt to trigger the LoRA -prompt = "A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration" -image = pipeline(prompt).images[0] -image -``` - -
- -
- -Save an adapter with [`~loaders.PeftAdapterMixin.save_lora_adapter`]. - -To unload the LoRA weights, use the [`~loaders.StableDiffusionLoraLoaderMixin.unload_lora_weights`] method to discard the LoRA weights and restore the model to its original weights: - -```py -pipeline.unload_lora_weights() -``` - -### Adjust LoRA weight scale - -For both [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] and [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`], you can pass the `cross_attention_kwargs={"scale": 0.5}` parameter to adjust how much of the LoRA weights to use. A value of `0` is the same as only using the base model weights, and a value of `1` is equivalent to using the fully finetuned LoRA. - -For more granular control on the amount of LoRA weights used per layer, you can use [`~loaders.StableDiffusionLoraLoaderMixin.set_adapters`] and pass a dictionary specifying by how much to scale the weights in each layer by. -```python -pipe = ... # create pipeline -pipe.load_lora_weights(..., adapter_name="my_adapter") -scales = { - "text_encoder": 0.5, - "text_encoder_2": 0.5, # only usable if pipe has a 2nd text encoder - "unet": { - "down": 0.9, # all transformers in the down-part will use scale 0.9 - # "mid" # in this example "mid" is not given, therefore all transformers in the mid part will use the default scale 1.0 - "up": { - "block_0": 0.6, # all 3 transformers in the 0th block in the up-part will use scale 0.6 - "block_1": [0.4, 0.8, 1.0], # the 3 transformers in the 1st block in the up-part will use scales 0.4, 0.8 and 1.0 respectively - } - } -} -pipe.set_adapters("my_adapter", scales) -``` - -This also works with multiple adapters - see [this guide](https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#customize-adapters-strength) for how to do it. - - - -Currently, [`~loaders.StableDiffusionLoraLoaderMixin.set_adapters`] only supports scaling attention weights. If a LoRA has other parts (e.g., resnets or down-/upsamplers), they will keep a scale of 1.0. - - - -### Hotswapping LoRA adapters - -A common use case when serving multiple adapters is to load one adapter first, generate images, load another adapter, generate more images, load another adapter, etc. This workflow normally requires calling [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`], [`~loaders.StableDiffusionLoraLoaderMixin.set_adapters`], and possibly [`~loaders.peft.PeftAdapterMixin.delete_adapters`] to save memory. Moreover, if the model is compiled using `torch.compile`, performing these steps requires recompilation, which takes time. - -To better support this common workflow, you can "hotswap" a LoRA adapter, to avoid accumulating memory and in some cases, recompilation. It requires an adapter to already be loaded, and the new adapter weights are swapped in-place for the existing adapter. - -Pass `hotswap=True` when loading a LoRA adapter to enable this feature. It is important to indicate the name of the existing adapter, (`default_0` is the default adapter name), to be swapped. If you loaded the first adapter with a different name, use that name instead. - -```python -pipe = ... -# load adapter 1 as normal -pipeline.load_lora_weights(file_name_adapter_1) -# generate some images with adapter 1 -... -# now hot swap the 2nd adapter -pipeline.load_lora_weights(file_name_adapter_2, hotswap=True, adapter_name="default_0") -# generate images with adapter 2 -``` - - - - -Hotswapping is not currently supported for LoRA adapters that target the text encoder. - - - -For compiled models, it is often (though not always if the second adapter targets identical LoRA ranks and scales) necessary to call [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] to avoid recompilation. Use [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] _before_ loading the first adapter, and `torch.compile` should be called _after_ loading the first adapter. - -```python -pipe = ... -# call this extra method -pipe.enable_lora_hotswap(target_rank=max_rank) -# now load adapter 1 -pipe.load_lora_weights(file_name_adapter_1) -# now compile the unet of the pipeline -pipe.unet = torch.compile(pipeline.unet, ...) -# generate some images with adapter 1 -... -# now hot swap adapter 2 -pipeline.load_lora_weights(file_name_adapter_2, hotswap=True, adapter_name="default_0") -# generate images with adapter 2 -``` - -The `target_rank=max_rank` argument is important for setting the maximum rank among all LoRA adapters that will be loaded. If you have one adapter with rank 8 and another with rank 16, pass `target_rank=16`. You should use a higher value if in doubt. By default, this value is 128. - -However, there can be situations where recompilation is unavoidable. For example, if the hotswapped adapter targets more layers than the initial adapter, then recompilation is triggered. Try to load the adapter that targets the most layers first. Refer to the PEFT docs on [hotswapping](https://huggingface.co/docs/peft/main/en/package_reference/hotswap#peft.utils.hotswap.hotswap_adapter) for more details about the limitations of this feature. - - - -Move your code inside the `with torch._dynamo.config.patch(error_on_recompile=True)` context manager to detect if a model was recompiled. If you detect recompilation despite following all the steps above, please open an issue with [Diffusers](https://github.com/huggingface/diffusers/issues) with a reproducible example. - - - -### Kohya and TheLastBen - -Other popular LoRA trainers from the community include those by [Kohya](https://github.com/kohya-ss/sd-scripts/) and [TheLastBen](https://github.com/TheLastBen/fast-stable-diffusion). These trainers create different LoRA checkpoints than those trained by 🤗 Diffusers, but they can still be loaded in the same way. - - - - -To load a Kohya LoRA, let's download the [Blueprintify SD XL 1.0](https://civitai.com/models/150986/blueprintify-sd-xl-10) checkpoint from [Civitai](https://civitai.com/) as an example: - -```sh -!wget https://civitai.com/api/download/models/168776 -O blueprintify-sd-xl-10.safetensors -``` - -Load the LoRA checkpoint with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method, and specify the filename in the `weight_name` parameter: - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_lora_weights("path/to/weights", weight_name="blueprintify-sd-xl-10.safetensors") -``` - -Generate an image: - -```py -# use bl3uprint in the prompt to trigger the LoRA -prompt = "bl3uprint, a highly detailed blueprint of the eiffel tower, explaining how to build all parts, many txt, blueprint grid backdrop" -image = pipeline(prompt).images[0] -image -``` - - - -Some limitations of using Kohya LoRAs with 🤗 Diffusers include: - -- Images may not look like those generated by UIs - like ComfyUI - for multiple reasons, which are explained [here](https://github.com/huggingface/diffusers/pull/4287/#issuecomment-1655110736). -- [LyCORIS checkpoints](https://github.com/KohakuBlueleaf/LyCORIS) aren't fully supported. The [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method loads LyCORIS checkpoints with LoRA and LoCon modules, but Hada and LoKR are not supported. - - - - - - -Loading a checkpoint from TheLastBen is very similar. For example, to load the [TheLastBen/William_Eggleston_Style_SDXL](https://huggingface.co/TheLastBen/William_Eggleston_Style_SDXL) checkpoint: - -```py -from diffusers import AutoPipelineForText2Image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_lora_weights("TheLastBen/William_Eggleston_Style_SDXL", weight_name="wegg.safetensors") - -# use by william eggleston in the prompt to trigger the LoRA -prompt = "a house by william eggleston, sunrays, beautiful, sunlight, sunrays, beautiful" -image = pipeline(prompt=prompt).images[0] -image -``` - - - - -## IP-Adapter - -[IP-Adapter](https://ip-adapter.github.io/) is a lightweight adapter that enables image prompting for any diffusion model. This adapter works by decoupling the cross-attention layers of the image and text features. All the other model components are frozen and only the embedded image features in the UNet are trained. As a result, IP-Adapter files are typically only ~100MBs. - -You can learn more about how to use IP-Adapter for different tasks and specific use cases in the [IP-Adapter](../using-diffusers/ip_adapter) guide. - -> [!TIP] -> Diffusers currently only supports IP-Adapter for some of the most popular pipelines. Feel free to open a feature request if you have a cool use case and want to integrate IP-Adapter with an unsupported pipeline! -> Official IP-Adapter checkpoints are available from [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter). - -To start, load a Stable Diffusion checkpoint. - -```py -from diffusers import AutoPipelineForText2Image -import torch -from diffusers.utils import load_image - -pipeline = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") -``` - -Then load the IP-Adapter weights and add it to the pipeline with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. - -```py -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") -``` - -Once loaded, you can use the pipeline with an image and text prompt to guide the image generation process. - -```py -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load_neg_embed.png") -generator = torch.Generator(device="cpu").manual_seed(33) -images = pipeline( -    prompt='best quality, high quality, wearing sunglasses', -    ip_adapter_image=image, -    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", -    num_inference_steps=50, -    generator=generator, -).images[0] -images -``` - -
-    -
- -### IP-Adapter Plus - -IP-Adapter relies on an image encoder to generate image features. If the IP-Adapter repository contains an `image_encoder` subfolder, the image encoder is automatically loaded and registered to the pipeline. Otherwise, you'll need to explicitly load the image encoder with a [`~transformers.CLIPVisionModelWithProjection`] model and pass it to the pipeline. - -This is the case for *IP-Adapter Plus* checkpoints which use the ViT-H image encoder. - -```py -from transformers import CLIPVisionModelWithProjection - -image_encoder = CLIPVisionModelWithProjection.from_pretrained( - "h94/IP-Adapter", - subfolder="models/image_encoder", - torch_dtype=torch.float16 -) - -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - image_encoder=image_encoder, - torch_dtype=torch.float16 -).to("cuda") - -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors") -``` - -### IP-Adapter Face ID models - -The IP-Adapter FaceID models are experimental IP Adapters that use image embeddings generated by `insightface` instead of CLIP image embeddings. Some of these models also use LoRA to improve ID consistency. -You need to install `insightface` and all its requirements to use these models. - - -As InsightFace pretrained models are available for non-commercial research purposes, IP-Adapter-FaceID models are released exclusively for research purposes and are not intended for commercial use. - - -```py -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16 -).to("cuda") - -pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid_sdxl.bin", image_encoder_folder=None) -``` - -If you want to use one of the two IP-Adapter FaceID Plus models, you must also load the CLIP image encoder, as this models use both `insightface` and CLIP image embeddings to achieve better photorealism. - -```py -from transformers import CLIPVisionModelWithProjection - -image_encoder = CLIPVisionModelWithProjection.from_pretrained( - "laion/CLIP-ViT-H-14-laion2B-s32B-b79K", - torch_dtype=torch.float16, -) - -pipeline = AutoPipelineForText2Image.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - image_encoder=image_encoder, - torch_dtype=torch.float16 -).to("cuda") - -pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid-plus_sd15.bin") -``` diff --git a/docs/source/en/using-diffusers/merge_loras.md b/docs/source/en/using-diffusers/merge_loras.md deleted file mode 100644 index e3ade4b01cf0..000000000000 --- a/docs/source/en/using-diffusers/merge_loras.md +++ /dev/null @@ -1,266 +0,0 @@ - - -# Merge LoRAs - -It can be fun and creative to use multiple [LoRAs]((https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora)) together to generate something entirely new and unique. This works by merging multiple LoRA weights together to produce images that are a blend of different styles. Diffusers provides a few methods to merge LoRAs depending on *how* you want to merge their weights, which can affect image quality. - -This guide will show you how to merge LoRAs using the [`~loaders.PeftAdapterMixin.set_adapters`] and [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) methods. To improve inference speed and reduce memory-usage of merged LoRAs, you'll also see how to use the [`~loaders.StableDiffusionLoraLoaderMixin.fuse_lora`] method to fuse the LoRA weights with the original weights of the underlying model. - -For this guide, load a Stable Diffusion XL (SDXL) checkpoint and the [KappaNeuro/studio-ghibli-style](https://huggingface.co/KappaNeuro/studio-ghibli-style) and [Norod78/sdxl-chalkboarddrawing-lora](https://huggingface.co/Norod78/sdxl-chalkboarddrawing-lora) LoRAs with the [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method. You'll need to assign each LoRA an `adapter_name` to combine them later. - -```py -from diffusers import DiffusionPipeline -import torch - -pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea") -pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng") -``` - -## set_adapters - -The [`~loaders.PeftAdapterMixin.set_adapters`] method merges LoRA adapters by concatenating their weighted matrices. Use the adapter name to specify which LoRAs to merge, and the `adapter_weights` parameter to control the scaling for each LoRA. For example, if `adapter_weights=[0.5, 0.5]`, then the merged LoRA output is an average of both LoRAs. Try adjusting the adapter weights to see how it affects the generated image! - -```py -pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8]) - -generator = torch.manual_seed(0) -prompt = "A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai" -image = pipeline(prompt, generator=generator, cross_attention_kwargs={"scale": 1.0}).images[0] -image -``` - -
- -
- -## add_weighted_adapter - -> [!WARNING] -> This is an experimental method that adds PEFTs [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method to Diffusers to enable more efficient merging methods. Check out this [issue](https://github.com/huggingface/diffusers/issues/6892) if you're interested in learning more about the motivation and design behind this integration. - -The [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method provides access to more efficient merging method such as [TIES and DARE](https://huggingface.co/docs/peft/developer_guides/model_merging). To use these merging methods, make sure you have the latest stable version of Diffusers and PEFT installed. - -```bash -pip install -U diffusers peft -``` - -There are three steps to merge LoRAs with the [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method: - -1. Create a [PeftModel](https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftModel) from the underlying model and LoRA checkpoint. -2. Load a base UNet model and the LoRA adapters. -3. Merge the adapters using the [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method and the merging method of your choice. - -Let's dive deeper into what these steps entail. - -1. Load a UNet that corresponds to the UNet in the LoRA checkpoint. In this case, both LoRAs use the SDXL UNet as their base model. - -```python -from diffusers import AutoModel -import torch - -unet = AutoModel.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", - subfolder="unet", -).to("cuda") -``` - -Load the SDXL pipeline and the LoRA checkpoints, starting with the [ostris/ikea-instructions-lora-sdxl](https://huggingface.co/ostris/ikea-instructions-lora-sdxl) LoRA. - -```python -from diffusers import DiffusionPipeline - -pipeline = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - variant="fp16", - torch_dtype=torch.float16, - unet=unet -).to("cuda") -pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea") -``` - -Now you'll create a [PeftModel](https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftModel) from the loaded LoRA checkpoint by combining the SDXL UNet and the LoRA UNet from the pipeline. - -```python -from peft import get_peft_model, LoraConfig -import copy - -sdxl_unet = copy.deepcopy(unet) -ikea_peft_model = get_peft_model( - sdxl_unet, - pipeline.unet.peft_config["ikea"], - adapter_name="ikea" -) - -original_state_dict = {f"base_model.model.{k}": v for k, v in pipeline.unet.state_dict().items()} -ikea_peft_model.load_state_dict(original_state_dict, strict=True) -``` - -> [!TIP] -> You can optionally push the ikea_peft_model to the Hub by calling `ikea_peft_model.push_to_hub("ikea_peft_model", token=TOKEN)`. - -Repeat this process to create a [PeftModel](https://huggingface.co/docs/peft/package_reference/peft_model#peft.PeftModel) from the [lordjia/by-feng-zikai](https://huggingface.co/lordjia/by-feng-zikai) LoRA. - -```python -pipeline.delete_adapters("ikea") -sdxl_unet.delete_adapters("ikea") - -pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng") -pipeline.set_adapters(adapter_names="feng") - -feng_peft_model = get_peft_model( - sdxl_unet, - pipeline.unet.peft_config["feng"], - adapter_name="feng" -) - -original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()} -feng_peft_model.load_state_dict(original_state_dict, strict=True) -``` - -2. Load a base UNet model and then load the adapters onto it. - -```python -from peft import PeftModel - -base_unet = AutoModel.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16, - use_safetensors=True, - variant="fp16", - subfolder="unet", -).to("cuda") - -model = PeftModel.from_pretrained(base_unet, "stevhliu/ikea_peft_model", use_safetensors=True, subfolder="ikea", adapter_name="ikea") -model.load_adapter("stevhliu/feng_peft_model", use_safetensors=True, subfolder="feng", adapter_name="feng") -``` - -3. Merge the adapters using the [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) method and the merging method of your choice (learn more about other merging methods in this [blog post](https://huggingface.co/blog/peft_merging)). For this example, let's use the `"dare_linear"` method to merge the LoRAs. - -> [!WARNING] -> Keep in mind the LoRAs need to have the same rank to be merged! - -```python -model.add_weighted_adapter( - adapters=["ikea", "feng"], - weights=[1.0, 1.0], - combination_type="dare_linear", - adapter_name="ikea-feng" -) -model.set_adapters("ikea-feng") -``` - -Now you can generate an image with the merged LoRA. - -```python -model = model.to(dtype=torch.float16, device="cuda") - -pipeline = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", unet=model, variant="fp16", torch_dtype=torch.float16, -).to("cuda") - -image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0] -image -``` - -
- -
- -## fuse_lora - -Both the [`~loaders.PeftAdapterMixin.set_adapters`] and [add_weighted_adapter](https://huggingface.co/docs/peft/package_reference/lora#peft.LoraModel.add_weighted_adapter) methods require loading the base model and the LoRA adapters separately which incurs some overhead. The [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method allows you to fuse the LoRA weights directly with the original weights of the underlying model. This way, you're only loading the model once which can increase inference and lower memory-usage. - -You can use PEFT to easily fuse/unfuse multiple adapters directly into the model weights (both UNet and text encoder) using the [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method, which can lead to a speed-up in inference and lower VRAM usage. - -For example, if you have a base model and adapters loaded and set as active with the following adapter weights: - -```py -from diffusers import DiffusionPipeline -import torch - -pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea") -pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng") - -pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8]) -``` - -Fuse these LoRAs into the UNet with the [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make the `lora_scale` adjustments in the [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method because it won’t work if you try to pass `scale` to the `cross_attention_kwargs` in the pipeline. - -```py -pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0) -``` - -Then you should use [`~loaders.StableDiffusionLoraLoaderMixin.unload_lora_weights`] to unload the LoRA weights since they've already been fused with the underlying base model. Finally, call [`~DiffusionPipeline.save_pretrained`] to save the fused pipeline locally or you could call [`~DiffusionPipeline.push_to_hub`] to push the fused pipeline to the Hub. - -```py -pipeline.unload_lora_weights() -# save locally -pipeline.save_pretrained("path/to/fused-pipeline") -# save to the Hub -pipeline.push_to_hub("fused-ikea-feng") -``` - -Now you can quickly load the fused pipeline and use it for inference without needing to separately load the LoRA adapters. - -```py -pipeline = DiffusionPipeline.from_pretrained( - "username/fused-ikea-feng", torch_dtype=torch.float16, -).to("cuda") - -image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0] -image -``` - -You can call [`~~loaders.lora_base.LoraBaseMixin.unfuse_lora`] to restore the original model's weights (for example, if you want to use a different `lora_scale` value). However, this only works if you've only fused one LoRA adapter to the original model. If you've fused multiple LoRAs, you'll need to reload the model. - -```py -pipeline.unfuse_lora() -``` - -### torch.compile - -[torch.compile](../optimization/torch2.0#torchcompile) can speed up your pipeline even more, but the LoRA weights must be fused first and then unloaded. Typically, the UNet is compiled because it is such a computationally intensive component of the pipeline. - -```py -from diffusers import DiffusionPipeline -import torch - -# load base model and LoRAs -pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea") -pipeline.load_lora_weights("lordjia/by-feng-zikai", weight_name="fengzikai_v1.0_XL.safetensors", adapter_name="feng") - -# activate both LoRAs and set adapter weights -pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8]) - -# fuse LoRAs and unload weights -pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0) -pipeline.unload_lora_weights() - -# torch.compile -pipeline.unet.to(memory_format=torch.channels_last) -pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True) - -image = pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", generator=torch.manual_seed(0)).images[0] -``` - -Learn more about torch.compile in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion#torchcompile) guide. - -## Next steps - -For more conceptual details about how each merging method works, take a look at the [🤗 PEFT welcomes new merging methods](https://huggingface.co/blog/peft_merging#concatenation-cat) blog post! diff --git a/docs/source/en/using-diffusers/textual_inversion_inference.md b/docs/source/en/using-diffusers/textual_inversion_inference.md index 6315caef10b6..766f5f509d52 100644 --- a/docs/source/en/using-diffusers/textual_inversion_inference.md +++ b/docs/source/en/using-diffusers/textual_inversion_inference.md @@ -10,109 +10,58 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Textual inversion +# Textual Inversion -[[open-in-colab]] +[Textual Inversion](https://huggingface.co/papers/2208.01618) is a method for generating personalized images of a concept. It works by fine-tuning a models word embeddings on 3-5 images of the concept (for example, pixel art) that is associated with a unique token (``). This allows you to use the `` token in your prompt to trigger the model to generate pixel art images. -The [`StableDiffusionPipeline`] supports textual inversion, a technique that enables a model like Stable Diffusion to learn a new concept from just a few sample images. This gives you more control over the generated images and allows you to tailor the model towards specific concepts. You can get started quickly with a collection of community created concepts in the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer). - -This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](../training/text_inversion) training guide. - -Import the necessary libraries: +Textual Inversion weights are very lightweight and typically only a few KBs because they're only word embeddings. However, this also means the word embeddings need to be loaded after loading a model with [`~DiffusionPipeline.from_pretrained`]. ```py import torch -from diffusers import StableDiffusionPipeline -from diffusers.utils import make_image_grid -``` - -## Stable Diffusion 1 and 2 - -Pick a Stable Diffusion checkpoint and a pre-learned concept from the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer): - -```py -pretrained_model_name_or_path = "stable-diffusion-v1-5/stable-diffusion-v1-5" -repo_id_embeds = "sd-concepts-library/cat-toy" -``` - -Now you can load a pipeline, and pass the pre-learned concept to it: +from diffusers import AutoPipelineForText2Image -```py -pipeline = StableDiffusionPipeline.from_pretrained( - pretrained_model_name_or_path, torch_dtype=torch.float16, use_safetensors=True +pipeline = AutoPipelineForText2Image.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + torch_dtype=torch.float16 ).to("cuda") - -pipeline.load_textual_inversion(repo_id_embeds) ``` -Create a prompt with the pre-learned concept by using the special placeholder token ``, and choose the number of samples and rows of images you'd like to generate: +Load the word embeddings with [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] and include the unique token in the prompt to activate its generation. ```py -prompt = "a grafitti in a favela wall with a on it" - -num_samples_per_row = 2 -num_rows = 2 -``` - -Then run the pipeline (feel free to adjust the parameters like `num_inference_steps` and `guidance_scale` to see how they affect image quality), save the generated images and visualize them with the helper function you created at the beginning: - -```py -all_images = [] -for _ in range(num_rows): - images = pipeline(prompt, num_images_per_prompt=num_samples_per_row, num_inference_steps=50, guidance_scale=7.5).images - all_images.extend(images) - -grid = make_image_grid(all_images, num_rows, num_samples_per_row) -grid +pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork") +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, style" +pipeline(prompt).images[0] ```
- +
-## Stable Diffusion XL - -Stable Diffusion XL (SDXL) can also use textual inversion vectors for inference. In contrast to Stable Diffusion 1 and 2, SDXL has two text encoders so you'll need two textual inversion embeddings - one for each text encoder model. - -Let's download the SDXL textual inversion embeddings and have a closer look at it's structure: - -```py -from huggingface_hub import hf_hub_download -from safetensors.torch import load_file - -file = hf_hub_download("dn118/unaestheticXL", filename="unaestheticXLv31.safetensors") -state_dict = load_file(file) -state_dict -``` - -``` -{'clip_g': tensor([[ 0.0077, -0.0112, 0.0065, ..., 0.0195, 0.0159, 0.0275], - ..., - [-0.0170, 0.0213, 0.0143, ..., -0.0302, -0.0240, -0.0362]], - 'clip_l': tensor([[ 0.0023, 0.0192, 0.0213, ..., -0.0385, 0.0048, -0.0011], - ..., - [ 0.0475, -0.0508, -0.0145, ..., 0.0070, -0.0089, -0.0163]], -``` +Textual Inversion can also be trained to learn *negative embeddings* to steer generation away from unwanted characteristics such as "blurry" or "ugly". It is useful for improving image quality. -There are two tensors, `"clip_g"` and `"clip_l"`. -`"clip_g"` corresponds to the bigger text encoder in SDXL and refers to -`pipe.text_encoder_2` and `"clip_l"` refers to `pipe.text_encoder`. - -Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer -to [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`]: +EasyNegative is a widely used negative embedding that contains multiple learned negative concepts. Load the negative embeddings and specify the file name and token associated with the negative embeddings. Pass the token to `negative_prompt` in your pipeline to activate it. ```py -from diffusers import AutoPipelineForText2Image import torch +from diffusers import AutoPipelineForText2Image -pipe = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16) -pipe.to("cuda") +pipeline = AutoPipelineForText2Image.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_textual_inversion( + "EvilEngine/easynegative", + weight_name="easynegative.safetensors", + token="easynegative" +) -pipe.load_textual_inversion(state_dict["clip_g"], token="unaestheticXLv31", text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2) -pipe.load_textual_inversion(state_dict["clip_l"], token="unaestheticXLv31", text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer) +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" +negative_prompt = "easynegative" -# the embedding should be used as a negative embedding, so we pass it as a negative prompt -generator = torch.Generator().manual_seed(33) -image = pipe("a woman standing in front of a mountain", negative_prompt="unaestheticXLv31", generator=generator).images[0] -image +pipeline(prompt, negative_prompt).images[0] ``` + +
+ +
\ No newline at end of file From 804fbd238e43e60bec29658bae5e47e14dd57670 Mon Sep 17 00:00:00 2001 From: stevhliu Date: Fri, 4 Apr 2025 15:33:44 -0700 Subject: [PATCH 2/7] ip-adapter --- docs/source/en/using-diffusers/ip_adapter.md | 709 ++---------------- .../textual_inversion_inference.md | 2 - 2 files changed, 67 insertions(+), 644 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index 5f483fbbdfee..fa3f726074fc 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -12,172 +12,105 @@ specific language governing permissions and limitations under the License. # IP-Adapter -[IP-Adapter](https://hf.co/papers/2308.06721) is an image prompt adapter that can be plugged into diffusion models to enable image prompting without any changes to the underlying model. Furthermore, this adapter can be reused with other models finetuned from the same base model and it can be combined with other adapters like [ControlNet](../using-diffusers/controlnet). The key idea behind IP-Adapter is the *decoupled cross-attention* mechanism which adds a separate cross-attention layer just for image features instead of using the same cross-attention layer for both text and image features. This allows the model to learn more image-specific features. +[IP-Adapter](https://huggingface.co/papers/2308.06721) is a lightweight adapter designed to integrate image-based guidance into text-to-image diffusion models. The adapter uses an image encoder to extract image features that are passed to the newly added cross-attention layers in the UNet and fine-tuned. The original UNet model, and the existing cross-attention layers corresponding to text features, is frozen. Decoupling the cross-attention for image and text features enables more fine-grained and controllable generation. -> [!TIP] -> Learn how to load an IP-Adapter in the [Load adapters](../using-diffusers/loading_adapters#ip-adapter) guide, and make sure you check out the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section which requires manually loading the image encoder. +IP-Adapter files are typically ~100MBs because they only contain the image embeddings. This means you need to load a model first, and then load the IP-Adapter with [`~loaders.IPAdapterMixin.load_ip_adapter`]. -This guide will walk you through using IP-Adapter for various tasks and use cases. - -## General tasks - -Let's take a look at how to use IP-Adapter's image prompting capabilities with the [`StableDiffusionXLPipeline`] for tasks like text-to-image, image-to-image, and inpainting. We also encourage you to try out other pipelines such as Stable Diffusion, LCM-LoRA, ControlNet, T2I-Adapter, or AnimateDiff! - -In all the following examples, you'll see the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method. This method controls the amount of text or image conditioning to apply to the model. A value of `1.0` means the model is only conditioned on the image prompt. Lowering this value encourages the model to produce more diverse images, but they may not be as aligned with the image prompt. Typically, a value of `0.5` achieves a good balance between the two prompt types and produces good results. - -> [!TIP] -> In the examples below, try adding `low_cpu_mem_usage=True` to the [`~loaders.IPAdapterMixin.load_ip_adapter`] method to speed up the loading time. - - - - -Crafting the precise text prompt to generate the image you want can be difficult because it may not always capture what you'd like to express. Adding an image alongside the text prompt helps the model better understand what it should generate and can lead to more accurate results. - -Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights. +Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] parameter to scale the influence of the IP-Adapter during generation. A value of `1.0` means the model is only conditioned on the image prompt, and `0.5` typically produces balanced results between the text and image prompt. ```py +import torch from diffusers import AutoPipelineForText2Image from diffusers.utils import load_image -import torch -pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") -pipeline.set_ip_adapter_scale(0.6) +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name="ip-adapter_sdxl.bin" +) +pipeline.set_ip_adapter_scale(0.8) ``` -Create a text prompt and load an image prompt before passing them to the pipeline to generate an image. +Pass an image to `ip_adapter_image` along with a text prompt to generate an image. ```py image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png") -generator = torch.Generator(device="cpu").manual_seed(0) -images = pipeline( +pipeline( prompt="a polar bear sitting in a chair drinking a milkshake", ip_adapter_image=image, negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", - num_inference_steps=100, - generator=generator, -).images -images[0] +).images[0] ``` -
-
- -
IP-Adapter image
-
-
- -
generated image
-
-
- -
- - -IP-Adapter can also help with image-to-image by guiding the model to generate an image that resembles the original image and the image prompt. +Take a look at the examples below to learn how to use IP-Adapter for other tasks. -Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights. + + ```py +import torch from diffusers import AutoPipelineForImage2Image from diffusers.utils import load_image -import torch -pipeline = AutoPipelineForImage2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") -pipeline.set_ip_adapter_scale(0.6) -``` - -Pass the original image and the IP-Adapter image prompt to the pipeline to generate an image. Providing a text prompt to the pipeline is optional, but in this example, a text prompt is used to increase image quality. +pipeline = AutoPipelineForImage2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name="ip-adapter_sdxl.bin" +) +pipeline.set_ip_adapter_scale(0.8) -```py image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png") -ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_2.png") - -generator = torch.Generator(device="cpu").manual_seed(4) -images = pipeline( +ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png") +pipeline( prompt="best quality, high quality", image=image, ip_adapter_image=ip_image, - generator=generator, - strength=0.6, -).images -images[0] + strength=0.5, +).images[0] ``` -
-
- -
original image
-
-
- -
IP-Adapter image
-
-
- -
generated image
-
-
-
- - -IP-Adapter is also useful for inpainting because the image prompt allows you to be much more specific about what you'd like to generate. - -Load a Stable Diffusion XL (SDXL) model and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. Use the `subfolder` parameter to load the SDXL model weights. + ```py -from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image import torch +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import load_image -pipeline = AutoPipelineForInpainting.from_pretrained("diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16).to("cuda") -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") +pipeline = AutoPipelineForImage2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name="ip-adapter_sdxl.bin" +) pipeline.set_ip_adapter_scale(0.6) -``` - -Pass a prompt, the original image, mask image, and the IP-Adapter image prompt to the pipeline to generate an image. -```py mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_mask.png") image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_bear_1.png") ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_gummy.png") - -generator = torch.Generator(device="cpu").manual_seed(4) -images = pipeline( +pipeline( prompt="a cute gummy bear waving", image=image, mask_image=mask_image, ip_adapter_image=ip_image, - generator=generator, - num_inference_steps=100, -).images -images[0] +).images[0] ``` -
-
- -
original image
-
-
- -
IP-Adapter image
-
-
- -
generated image
-
-
-
- - -IP-Adapter can also help you generate videos that are more aligned with your text prompt. For example, let's load [AnimateDiff](../api/pipelines/animatediff) with its motion adapter and insert an IP-Adapter into the model with the [`~loaders.IPAdapterMixin.load_ip_adapter`] method. + -> [!WARNING] -> If you're planning on offloading the model to the CPU, make sure you run it after you've loaded the IP-Adapter. When you call [`~DiffusionPipeline.enable_model_cpu_offload`] before loading the IP-Adapter, it offloads the image encoder module to the CPU and it'll return an error when you try to run the pipeline. +The [`~DiffusionPipeline.enable_model_cpu_offload`] method is useful for reducing memory, but you should enable it **after** the IP-Adapter is loaded. Otherwise, the IP-Adapter's image encoder is also offloaded to the CPU and returns an error. ```py import torch @@ -185,8 +118,15 @@ from diffusers import AnimateDiffPipeline, DDIMScheduler, MotionAdapter from diffusers.utils import export_to_gif from diffusers.utils import load_image -adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16) -pipeline = AnimateDiffPipeline.from_pretrained("emilianJR/epiCRealism", motion_adapter=adapter, torch_dtype=torch.float16) +adapter = MotionAdapter.from_pretrained( + "guoyww/animatediff-motion-adapter-v1-5-2", + torch_dtype=torch.float16 +) +pipeline = AnimateDiffPipeline.from_pretrained( + "emilianJR/epiCRealism", + motion_adapter=adapter, + torch_dtype=torch.float16 +) scheduler = DDIMScheduler.from_pretrained( "emilianJR/epiCRealism", subfolder="scheduler", @@ -197,548 +137,33 @@ scheduler = DDIMScheduler.from_pretrained( ) pipeline.scheduler = scheduler pipeline.enable_vae_slicing() - pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") pipeline.enable_model_cpu_offload() -``` -Pass a prompt and an image prompt to the pipeline to generate a short video. - -```py ip_adapter_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_inpaint.png") - -output = pipeline( +pipeline( prompt="A cute gummy bear waving", negative_prompt="bad quality, worse quality, low resolution", ip_adapter_image=ip_adapter_image, num_frames=16, guidance_scale=7.5, num_inference_steps=50, - generator=torch.Generator(device="cpu").manual_seed(0), -) -frames = output.frames[0] -export_to_gif(frames, "gummy_bear.gif") +).frames[0] ``` -
-
- -
IP-Adapter image
-
-
- -
generated video
-
-
-
-## Configure parameters - -There are a couple of IP-Adapter parameters that are useful to know about and can help you with your image generation tasks. These parameters can make your workflow more efficient or give you more control over image generation. - -### Image embeddings - -IP-Adapter enabled pipelines provide the `ip_adapter_image_embeds` parameter to accept precomputed image embeddings. This is particularly useful in scenarios where you need to run the IP-Adapter pipeline multiple times because you have more than one image. For example, [multi IP-Adapter](#multi-ip-adapter) is a specific use case where you provide multiple styling images to generate a specific image in a specific style. Loading and encoding multiple images each time you use the pipeline would be inefficient. Instead, you can precompute and save the image embeddings to disk (which can save a lot of space if you're using high-quality images) and load them when you need them. - -> [!TIP] -> This parameter also gives you the flexibility to load embeddings from other sources. For example, ComfyUI image embeddings for IP-Adapters are compatible with Diffusers and should work ouf-of-the-box! - -Call the [`~StableDiffusionPipeline.prepare_ip_adapter_image_embeds`] method to encode and generate the image embeddings. Then you can save them to disk with `torch.save`. - -> [!TIP] -> If you're using IP-Adapter with `ip_adapter_image_embedding` instead of `ip_adapter_image`', you can set `load_ip_adapter(image_encoder_folder=None,...)` because you don't need to load an encoder to generate the image embeddings. - -```py -image_embeds = pipeline.prepare_ip_adapter_image_embeds( - ip_adapter_image=image, - ip_adapter_image_embeds=None, - device="cuda", - num_images_per_prompt=1, - do_classifier_free_guidance=True, -) - -torch.save(image_embeds, "image_embeds.ipadpt") -``` - -Now load the image embeddings by passing them to the `ip_adapter_image_embeds` parameter. - -```py -image_embeds = torch.load("image_embeds.ipadpt") -images = pipeline( - prompt="a polar bear sitting in a chair drinking a milkshake", - ip_adapter_image_embeds=image_embeds, - negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", - num_inference_steps=100, - generator=generator, -).images -``` - -### IP-Adapter masking - -Binary masks specify which portion of the output image should be assigned to an IP-Adapter. This is useful for composing more than one IP-Adapter image. For each input IP-Adapter image, you must provide a binary mask. - -To start, preprocess the input IP-Adapter images with the [`~image_processor.IPAdapterMaskProcessor.preprocess()`] to generate their masks. For optimal results, provide the output height and width to [`~image_processor.IPAdapterMaskProcessor.preprocess()`]. This ensures masks with different aspect ratios are appropriately stretched. If the input masks already match the aspect ratio of the generated image, you don't have to set the `height` and `width`. - -```py -from diffusers.image_processor import IPAdapterMaskProcessor - -mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png") -mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png") - -output_height = 1024 -output_width = 1024 - -processor = IPAdapterMaskProcessor() -masks = processor.preprocess([mask1, mask2], height=output_height, width=output_width) -``` - -
-
- -
mask one
-
-
- -
mask two
-
-
- -When there is more than one input IP-Adapter image, load them as a list and provide the IP-Adapter scale list. Each of the input IP-Adapter images here corresponds to one of the masks generated above. - -```py -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"]) -pipeline.set_ip_adapter_scale([[0.7, 0.7]]) # one scale for each image-mask pair - -face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png") -face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png") - -ip_images = [[face_image1, face_image2]] - -masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])] -``` - -
-
- -
IP-Adapter image one
-
-
- -
IP-Adapter image two
-
-
- -Now pass the preprocessed masks to `cross_attention_kwargs` in the pipeline call. - -```py -generator = torch.Generator(device="cpu").manual_seed(0) -num_images = 1 - -image = pipeline( - prompt="2 girls", - ip_adapter_image=ip_images, - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=20, - num_images_per_prompt=num_images, - generator=generator, - cross_attention_kwargs={"ip_adapter_masks": masks} -).images[0] -image -``` - -
-
- -
IP-Adapter masking applied
-
-
- -
no IP-Adapter masking applied
-
-
- -## Specific use cases - -IP-Adapter's image prompting and compatibility with other adapters and models makes it a versatile tool for a variety of use cases. This section covers some of the more popular applications of IP-Adapter, and we can't wait to see what you come up with! - -### Face model - -Generating accurate faces is challenging because they are complex and nuanced. Diffusers supports two IP-Adapter checkpoints specifically trained to generate faces from the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) repository: - -* [ip-adapter-full-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-full-face_sd15.safetensors) is conditioned with images of cropped faces and removed backgrounds -* [ip-adapter-plus-face_sd15.safetensors](https://huggingface.co/h94/IP-Adapter/blob/main/models/ip-adapter-plus-face_sd15.safetensors) uses patch embeddings and is conditioned with images of cropped faces - -Additionally, Diffusers supports all IP-Adapter checkpoints trained with face embeddings extracted by `insightface` face models. Supported models are from the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) repository. - -For face models, use the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) checkpoint. It is also recommended to use [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models. - -```py -import torch -from diffusers import StableDiffusionPipeline, DDIMScheduler -from diffusers.utils import load_image - -pipeline = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, -).to("cuda") -pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-full-face_sd15.bin") - -pipeline.set_ip_adapter_scale(0.5) - -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png") -generator = torch.Generator(device="cpu").manual_seed(26) +## Parameters -image = pipeline( - prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant", - ip_adapter_image=image, - negative_prompt="lowres, bad anatomy, worst quality, low quality", - num_inference_steps=100, - generator=generator, -).images[0] -image -``` - -
-
- -
IP-Adapter image
-
-
- -
generated image
-
-
- -To use IP-Adapter FaceID models, first extract face embeddings with `insightface`. Then pass the list of tensors to the pipeline as `ip_adapter_image_embeds`. - -```py -import torch -from diffusers import StableDiffusionPipeline, DDIMScheduler -from diffusers.utils import load_image -from insightface.app import FaceAnalysis - -pipeline = StableDiffusionPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", - torch_dtype=torch.float16, -).to("cuda") -pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) -pipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid_sd15.bin", image_encoder_folder=None) -pipeline.set_ip_adapter_scale(0.6) - -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png") - -ref_images_embeds = [] -app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) -app.prepare(ctx_id=0, det_size=(640, 640)) -image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB) -faces = app.get(image) -image = torch.from_numpy(faces[0].normed_embedding) -ref_images_embeds.append(image.unsqueeze(0)) -ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0) -neg_ref_images_embeds = torch.zeros_like(ref_images_embeds) -id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda") - -generator = torch.Generator(device="cpu").manual_seed(42) - -images = pipeline( - prompt="A photo of a girl", - ip_adapter_image_embeds=[id_embeds], - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=20, num_images_per_prompt=1, - generator=generator -).images -``` - -Both IP-Adapter FaceID Plus and Plus v2 models require CLIP image embeddings. You can prepare face embeddings as shown previously, then you can extract and pass CLIP embeddings to the hidden image projection layers. +## Applications -```py -from insightface.utils import face_align - -ref_images_embeds = [] -ip_adapter_images = [] -app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) -app.prepare(ctx_id=0, det_size=(640, 640)) -image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB) -faces = app.get(image) -ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224)) -image = torch.from_numpy(faces[0].normed_embedding) -ref_images_embeds.append(image.unsqueeze(0)) -ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0) -neg_ref_images_embeds = torch.zeros_like(ref_images_embeds) -id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda") - -clip_embeds = pipeline.prepare_ip_adapter_image_embeds( - [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0] - -pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16) -pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False # True if Plus v2 -``` - -### Multi IP-Adapter - -More than one IP-Adapter can be used at the same time to generate specific images in more diverse styles. For example, you can use IP-Adapter-Face to generate consistent faces and characters, and IP-Adapter Plus to generate those faces in a specific style. +### Face models -> [!TIP] -> Read the [IP-Adapter Plus](../using-diffusers/loading_adapters#ip-adapter-plus) section to learn why you need to manually load the image encoder. - -Load the image encoder with [`~transformers.CLIPVisionModelWithProjection`]. - -```py -import torch -from diffusers import AutoPipelineForText2Image, DDIMScheduler -from transformers import CLIPVisionModelWithProjection -from diffusers.utils import load_image - -image_encoder = CLIPVisionModelWithProjection.from_pretrained( - "h94/IP-Adapter", - subfolder="models/image_encoder", - torch_dtype=torch.float16, -) -``` - -Next, you'll load a base model, scheduler, and the IP-Adapters. The IP-Adapters to use are passed as a list to the `weight_name` parameter: - -* [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder -* [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) has the same architecture but it is conditioned with images of cropped faces - -```py -pipeline = AutoPipelineForText2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - torch_dtype=torch.float16, - image_encoder=image_encoder, -) -pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) -pipeline.load_ip_adapter( - "h94/IP-Adapter", - subfolder="sdxl_models", - weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"] -) -pipeline.set_ip_adapter_scale([0.7, 0.3]) -pipeline.enable_model_cpu_offload() -``` - -Load an image prompt and a folder containing images of a certain style you want to use. - -```py -face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png") -style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy" -style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)] -``` - -
-
- -
IP-Adapter image of face
-
-
- -
IP-Adapter style images
-
-
- -Pass the image prompt and style images as a list to the `ip_adapter_image` parameter, and run the pipeline! - -```py -generator = torch.Generator(device="cpu").manual_seed(0) - -image = pipeline( - prompt="wonderwoman", - ip_adapter_image=[style_images, face_image], - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=50, num_images_per_prompt=1, - generator=generator, -).images[0] -image -``` - -
-    -
+### Multiple IP-Adapters ### Instant generation -[Latent Consistency Models (LCM)](../using-diffusers/inference_with_lcm_lora) are diffusion models that can generate images in as little as 4 steps compared to other diffusion models like SDXL that typically require way more steps. This is why image generation with an LCM feels "instantaneous". IP-Adapters can be plugged into an LCM-LoRA model to instantly generate images with an image prompt. - -The IP-Adapter weights need to be loaded first, then you can use [`~StableDiffusionPipeline.load_lora_weights`] to load the LoRA style and weight you want to apply to your image. - -```py -from diffusers import DiffusionPipeline, LCMScheduler -import torch -from diffusers.utils import load_image - -model_id = "sd-dreambooth-library/herge-style" -lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5" - -pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) - -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") -pipeline.load_lora_weights(lcm_lora_id) -pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config) -pipeline.enable_model_cpu_offload() -``` - -Try using with a lower IP-Adapter scale to condition image generation more on the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, and remember to use the special token `herge_style` in your prompt to trigger and apply the style. - -```py -pipeline.set_ip_adapter_scale(0.4) - -prompt = "herge_style woman in armor, best quality, high quality" -generator = torch.Generator(device="cpu").manual_seed(0) - -ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png") -image = pipeline( - prompt=prompt, - ip_adapter_image=ip_adapter_image, - num_inference_steps=4, - guidance_scale=1, -).images[0] -image -``` - -
-    -
- ### Structural control -To control image generation to an even greater degree, you can combine IP-Adapter with a model like [ControlNet](../using-diffusers/controlnet). A ControlNet is also an adapter that can be inserted into a diffusion model to allow for conditioning on an additional control image. The control image can be depth maps, edge maps, pose estimations, and more. - -Load a [`ControlNetModel`] checkpoint conditioned on depth maps, insert it into a diffusion model, and load the IP-Adapter. - -```py -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel -import torch -from diffusers.utils import load_image - -controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth" -controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16) - -pipeline = StableDiffusionControlNetPipeline.from_pretrained( - "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16) -pipeline.to("cuda") -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin") -``` - -Now load the IP-Adapter image and depth map. - -```py -ip_adapter_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png") -depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png") -``` - -
-
- -
IP-Adapter image
-
-
- -
depth map
-
-
- -Pass the depth map and IP-Adapter image to the pipeline to generate an image. - -```py -generator = torch.Generator(device="cpu").manual_seed(33) -image = pipeline( - prompt="best quality, high quality", - image=depth_map, - ip_adapter_image=ip_adapter_image, - negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", - num_inference_steps=50, - generator=generator, -).images[0] -image -``` - -
-    -
- -### Style & layout control - -[InstantStyle](https://arxiv.org/abs/2404.02733) is a plug-and-play method on top of IP-Adapter, which disentangles style and layout from image prompt to control image generation. This way, you can generate images following only the style or layout from image prompt, with significantly improved diversity. This is achieved by only activating IP-Adapters to specific parts of the model. - -By default IP-Adapters are inserted to all layers of the model. Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method with a dictionary to assign scales to IP-Adapter at different layers. - -```py -from diffusers import AutoPipelineForText2Image -from diffusers.utils import load_image -import torch - -pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.bin") - -scale = { - "down": {"block_2": [0.0, 1.0]}, - "up": {"block_0": [0.0, 1.0, 0.0]}, -} -pipeline.set_ip_adapter_scale(scale) -``` - -This will activate IP-Adapter at the second layer in the model's down-part block 2 and up-part block 0. The former is the layer where IP-Adapter injects layout information and the latter injects style. Inserting IP-Adapter to these two layers you can generate images following both the style and layout from image prompt, but with contents more aligned to text prompt. - -```py -style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg") - -generator = torch.Generator(device="cpu").manual_seed(26) -image = pipeline( - prompt="a cat, masterpiece, best quality, high quality", - ip_adapter_image=style_image, - negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry", - guidance_scale=5, - num_inference_steps=30, - generator=generator, -).images[0] -image -``` - -
-
- -
IP-Adapter image
-
-
- -
generated image
-
-
- -In contrast, inserting IP-Adapter to all layers will often generate images that overly focus on image prompt and diminish diversity. - -Activate IP-Adapter only in the style layer and then call the pipeline again. - -```py -scale = { - "up": {"block_0": [0.0, 1.0, 0.0]}, -} -pipeline.set_ip_adapter_scale(scale) - -generator = torch.Generator(device="cpu").manual_seed(26) -image = pipeline( - prompt="a cat, masterpiece, best quality, high quality", - ip_adapter_image=style_image, - negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry", - guidance_scale=5, - num_inference_steps=30, - generator=generator, -).images[0] -image -``` - -
-
- -
IP-Adapter only in style layer
-
-
- -
IP-Adapter in all layers
-
-
- -Note that you don't have to specify all layers in the dictionary. Those not included in the dictionary will be set to scale 0 which means disable IP-Adapter by default. +### Style and layout control \ No newline at end of file diff --git a/docs/source/en/using-diffusers/textual_inversion_inference.md b/docs/source/en/using-diffusers/textual_inversion_inference.md index 766f5f509d52..9923bc22fd69 100644 --- a/docs/source/en/using-diffusers/textual_inversion_inference.md +++ b/docs/source/en/using-diffusers/textual_inversion_inference.md @@ -55,10 +55,8 @@ pipeline.load_textual_inversion( weight_name="easynegative.safetensors", token="easynegative" ) - prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" negative_prompt = "easynegative" - pipeline(prompt, negative_prompt).images[0] ``` From ecae5d0df3ed55ba0b541b2e4f40efabf5f11fbb Mon Sep 17 00:00:00 2001 From: stevhliu Date: Tue, 15 Apr 2025 15:23:53 -0700 Subject: [PATCH 3/7] ip adapter --- docs/source/en/using-diffusers/ip_adapter.md | 463 ++++++++++++++++++- 1 file changed, 461 insertions(+), 2 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index fa3f726074fc..b9be2a41a563 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -154,16 +154,475 @@ pipeline(
-## Parameters +## Model variants + +There are two variants of IP-Adapter, Plus and FaceID. The Plus variant uses patch embeddings and the ViT-H image encoder. FaceID variant uses face embeddings generated from InsightFace. + + + + +```py +import torch +from transformers import CLIPVisionModelWithProjection, AutoPipelineForText2Image + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "h94/IP-Adapter", + subfolder="models/image_encoder", + torch_dtype=torch.float16 +) + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + image_encoder=image_encoder, + torch_dtype=torch.float16 +).to("cuda") + +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name="ip-adapter-plus_sdxl_vit-h.safetensors" +) +``` + + + + +```py +import torch +from transformers import AutoPipelineForText2Image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") + +pipeline.load_ip_adapter( + "h94/IP-Adapter-FaceID", + subfolder=None, + weight_name="ip-adapter-faceid_sdxl.bin", + image_encoder_folder=None +) +``` + +To use a IP-Adapter FaceID Plus model, load the CLIP image encoder as well as [`~transformers.CLIPVisionModelWithProjection`]. + +```py +from transformers import AutoPipelineForText2Image, CLIPVisionModelWithProjection + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "laion/CLIP-ViT-H-14-laion2B-s32B-b79K", + torch_dtype=torch.float16, +) + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + image_encoder=image_encoder, + torch_dtype=torch.float16 +).to("cuda") + +pipeline.load_ip_adapter( + "h94/IP-Adapter-FaceID", + subfolder=None, + weight_name="ip-adapter-faceid-plus_sd15.bin" +) +``` + + + + +## Image embeddings + +The `prepare_ip_adapter_image_embeds` generates image embeddings you can reuse if you're running the pipeline multiple times because you have more than one image. Loading and encoding multiple images each time you use the pipeline can be inefficient. Precomputing the image embeddings ahead of time, saving them to disk, and loading them when you need them is more efficient. + +```py +import torch +from diffusers import AutoPipelineForText2Image + +pipeline = AutoPipelineForImage2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") + +image_embeds = pipeline.prepare_ip_adapter_image_embeds( + ip_adapter_image=image, + ip_adapter_image_embeds=None, + device="cuda", + num_images_per_prompt=1, + do_classifier_free_guidance=True, +) + +torch.save(image_embeds, "image_embeds.ipadpt") +``` + +Reload the image embeddings by passing them to the `ip_adapter_image_embeds` parameter. Set `image_encoder_folder` to `None` because you don't need the image encoder anymore to generate the image embeddings. + +> [!TIP] +> You can also load image embeddings from other sources such as ComfyUI. + +```py +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + image_encoder_folder=None, + weight_name="ip-adapter_sdxl.bin" +) +pipeline.set_ip_adapter_scale(0.8) +image_embeds = torch.load("image_embeds.ipadpt") +pipeline( + prompt="a polar bear sitting in a chair drinking a milkshake", + ip_adapter_image_embeds=image_embeds, + negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality", + num_inference_steps=100, + generator=generator, +).images[0] +``` + +## Masking + +Binary masking enables assigning an IP-Adapter image to a specific area of the output image, making it useful for composing multiple IP-Adapter images. Each IP-Adapter image requires a binary mask. + +Load the [`~image_processor.IPAdapterMaskProcessor`] to preprocess the image masks. For the best results, provide the output `height` and `width` to ensure masks with different aspect ratios are appropriately sized. If the input masks already match the aspect ratio of the generated image, you don't need to set the `height` and `width`. + +```py +import torch +from diffusers import AutoPipelineForText2Image +from diffusers.image_processor import IPAdapterMaskProcessor +from diffusers.utils import load_image + +pipeline = AutoPipelineForImage2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") + +mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png") +mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png") + +processor = IPAdapterMaskProcessor() +masks = processor.preprocess([mask1, mask2], height=1024, width=1024) +``` + +Provide both the IP-Adapter images and their scales as a list. Pass the preprocessed masks to `cross_attention_kwargs` in the pipeline. + +```py +face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png") +face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png") + +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"] +) +pipeline.set_ip_adapter_scale([[0.7, 0.7]]) + +ip_images = [[face_image1, face_image2]] +masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])] + +pipeline( + prompt="2 girls", + ip_adapter_image=ip_images, + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", + cross_attention_kwargs={"ip_adapter_masks": masks} +).images[0] +``` ## Applications +The section below covers some popular applications of IP-Adapter. + ### Face models +Face generation and preserving its details can be challenging. To help generate more accurate faces, there are checkpoints specifically conditioned on images of cropped faces. You can find the face models in the [h94/IP-Adapter](https://huggingface.co/h94/IP-Adapter) repository or the [h94/IP-Adapter-FaceID](https://huggingface.co/h94/IP-Adapter-FaceID) repository. The FaceID checkpoints use the FaceID embeddings from [InsightFace](https://github.com/deepinsight/insightface) instead of CLIP image embeddings. + +We recommend using the [`DDIMScheduler`] or [`EulerDiscreteScheduler`] for face models. + + + + +```py +import torch +from diffusers import StableDiffusionPipeline, DDIMScheduler +from diffusers.utils import load_image + +pipeline = StableDiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + torch_dtype=torch.float16, +).to("cuda") +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="models", + weight_name="ip-adapter-full-face_sd15.bin" +) + +pipeline.set_ip_adapter_scale(0.5) +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_einstein_base.png") + +pipeline( + prompt="A photo of Einstein as a chef, wearing an apron, cooking in a French restaurant", + ip_adapter_image=image, + negative_prompt="lowres, bad anatomy, worst quality, low quality", + num_inference_steps=100, +).images[0] +``` + + + + +For FaceID models, extract the face embeddings and pass them as a list of tensors to `ip_adapter_image_embeds`. + +```py +# pip install insightface +import torch +from diffusers import StableDiffusionPipeline, DDIMScheduler +from diffusers.utils import load_image +from insightface.app import FaceAnalysis + +pipeline = StableDiffusionPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + torch_dtype=torch.float16, +).to("cuda") +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.load_ip_adapter( + "h94/IP-Adapter-FaceID", + subfolder=None, + weight_name="ip-adapter-faceid_sd15.bin", + image_encoder_folder=None +) +pipeline.set_ip_adapter_scale(0.6) + +image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png") + +ref_images_embeds = [] +app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) +app.prepare(ctx_id=0, det_size=(640, 640)) +image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB) +faces = app.get(image) +image = torch.from_numpy(faces[0].normed_embedding) +ref_images_embeds.append(image.unsqueeze(0)) +ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0) +neg_ref_images_embeds = torch.zeros_like(ref_images_embeds) +id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda") + +pipeline( + prompt="A photo of a girl", + ip_adapter_image_embeds=[id_embeds], + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", +).images[0] +``` + +The IP-Adapter FaceID Plus and Plus v2 models require CLIP image embeddings. Prepare the face embeddings and then extract and pass the CLIP embeddings to the hidden image projection layers. + +```py +clip_embeds = pipeline.prepare_ip_adapter_image_embeds( + [ip_adapter_images], None, torch.device("cuda"), num_images, True)[0] + +pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16) +# set to True if using IP-Adapter FaceID Plus v2 +pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = False +``` + + + + ### Multiple IP-Adapters +Combine multiple IP-Adapters to generate images in more diverse styles. For example, you can use IP-Adapter Face to generate consistent faces and characters and IP-Adapter Plus to generate those faces in specific styles. + +Load an image encoder with [`~transformers.CLIPVisionModelWithProjection`]. + +```py +import torch +from diffusers import AutoPipelineForText2Image, DDIMScheduler +from transformers import CLIPVisionModelWithProjection +from diffusers.utils import load_image + +image_encoder = CLIPVisionModelWithProjection.from_pretrained( + "h94/IP-Adapter", + subfolder="models/image_encoder", + torch_dtype=torch.float16, +) +``` + +Load a base model, scheduler and the following IP-Adapters. + +- [ip-adapter-plus_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder +- [ip-adapter-plus-face_sdxl_vit-h](https://huggingface.co/h94/IP-Adapter#ip-adapter-for-sdxl-10) uses patch embeddings and a ViT-H image encoder but it is conditioned on images of cropped faces + +```py +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + image_encoder=image_encoder, +) +pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"] +) +pipeline.set_ip_adapter_scale([0.7, 0.3]) +# enable_model_cpu_offload to reduce memory usage +pipeline.enable_model_cpu_offload() +``` + +Load an image and a folder containing images of a certain style to apply. + +```py +face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png") +style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy" +style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)] +``` + +Pass style and face images as a list to `ip_adapter_image`. + +```py +generator = torch.Generator(device="cpu").manual_seed(0) + +pipeline( + prompt="wonderwoman", + ip_adapter_image=[style_images, face_image], + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", +).images[0] +``` + ### Instant generation +[Latent Consistency Models (LCM)](../api/pipelines/latent_consistency_models) can generate images 4 steps or less, unlike other diffusion models which require a lot more steps, making it feel "instantaneous". IP-Adapters are compatible with LCM models to instantly generate images. + +Load the IP-Adapter weights and load the LoRA weights with [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights]. + +```py +import torch +from diffusers import DiffusionPipeline, LCMScheduler +from diffusers.utils import load_image + +pipeline = DiffusionPipeline.from_pretrained( + "sd-dreambooth-library/herge-style", + torch_dtype=torch.float16 +) + +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="models", + weight_name="ip-adapter_sd15.bin" +) +pipeline.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") +pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config) +# enable_model_cpu_offload to reduce memory usage +pipeline.enable_model_cpu_offload() +``` + +Try using a lower IP-Adapter scale to condition generation more on the style you want to apply, and remember to use the special token in your prompt to trigger its generation. + +```py +pipeline.set_ip_adapter_scale(0.4) + +prompt = "herge_style woman in armor, best quality, high quality" + +ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png") +pipeline( + prompt=prompt, + ip_adapter_image=ip_adapter_image, + num_inference_steps=4, + guidance_scale=1, +).images[0] +``` + ### Structural control -### Style and layout control \ No newline at end of file +For structural control, combine IP-Adapter with [ControlNet](../api/pipelines/controlnet) conditioned on depth maps, edge maps, pose estimations, and more. + +The example below loads a [`ControlNetModel`] checkpoint conditioned on depth maps and combines it with a IP-Adapter. + +```py +import torch +from diffusers.utils import load_image +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel + +controlnet = ControlNetModel.from_pretrained( + "lllyasviel/control_v11f1p_sd15_depth", + torch_dtype=torch.float16 +) + +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + "stable-diffusion-v1-5/stable-diffusion-v1-5", + controlnet=controlnet, + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="models", + weight_name="ip-adapter_sd15.bin" +) +``` + +Pass the depth map and IP-Adapter image to the pipeline. + +```py +pipeline( + prompt="best quality, high quality", + image=depth_map, + ip_adapter_image=ip_adapter_image, + negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", +).images[0] +``` + +### Style and layout control + +For style and layout control, combine IP-Adapter with [InstantStyle](https://huggingface.co/papers/2404.02733). InstantStyle separates *style* (color, texture, overall feel) and *content* from each other. It only applies the style in style-specific blocks of the model to prevent it from distorting other areas of an image. This generates images with stronger and more consistent styles and better control over the layout. + +The IP-Adapter is only activated for specific parts of the model. Use the [`~loaders.IPAdapterMixin.set_ip_adapter_scale`] method to scale the influence of the IP-Adapter in different layers. The example below activates the IP-Adapter in the second layer of the models down `block_2` and up `block_0`. Down `block_2` is where the IP-Adapter injects layout information and up `block_0` is where style is injected. + +```py +import torch +from diffusers import AutoPipelineForText2Image +from diffusers.utils import load_image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_ip_adapter( + "h94/IP-Adapter", + subfolder="sdxl_models", + weight_name="ip-adapter_sdxl.bin" +) + +scale = { + "down": {"block_2": [0.0, 1.0]}, + "up": {"block_0": [0.0, 1.0, 0.0]}, +} +pipeline.set_ip_adapter_scale(scale) +``` + +Load the style image and generate an image. + +```py +style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg") + +pipeline( + prompt="a cat, masterpiece, best quality, high quality", + ip_adapter_image=style_image, + negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry", + guidance_scale=5, +).images[0] +``` + +You can also insert the IP-Adapter in all the model layers. This tends to generate images that focus more on the image prompt and may reduce the diversity of generated images. Only activate the IP-Adapter in up `block_0` or the style layer. + +> [!TIP] +> You don't need to specify all the layers in the `scale` dictionary. Layers not included are set to 0, which means the IP-Adapter is disabled. + +```py +scale = { + "up": {"block_0": [0.0, 1.0, 0.0]}, +} +pipeline.set_ip_adapter_scale(scale) + +pipeline( + prompt="a cat, masterpiece, best quality, high quality", + ip_adapter_image=style_image, + negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry", + guidance_scale=5, +).images[0] +``` \ No newline at end of file From b933d5b14a5b353fc1ca946a64664760866ab3fe Mon Sep 17 00:00:00 2001 From: stevhliu Date: Tue, 15 Apr 2025 16:12:58 -0700 Subject: [PATCH 4/7] fix toctree --- docs/source/en/_toctree.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 99b160fae9aa..4d54945b7f2e 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -35,7 +35,7 @@ title: Push files to the Hub title: Load pipelines and adapters - sections: - - local: using-diffusers/using_peft_for_inference + - local: tutorials/using_peft_for_inference title: LoRA - local: using-diffusers/ip_adapter title: IP-Adapter From 4a79ebf4ebffa408524a7d7e53b9d469957452ac Mon Sep 17 00:00:00 2001 From: stevhliu Date: Tue, 15 Apr 2025 17:54:59 -0700 Subject: [PATCH 5/7] fix toctree --- docs/source/en/_toctree.yml | 2 - .../en/tutorials/using_peft_for_inference.md | 214 +----------------- 2 files changed, 8 insertions(+), 208 deletions(-) diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 4d54945b7f2e..058c67e96aae 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -65,8 +65,6 @@ title: Create a server - local: training/distributed_inference title: Distributed inference - - local: using-diffusers/merge_loras - title: Merge LoRAs - local: using-diffusers/scheduler_features title: Scheduler features - local: using-diffusers/callback diff --git a/docs/source/en/tutorials/using_peft_for_inference.md b/docs/source/en/tutorials/using_peft_for_inference.md index 33414a331ea7..7135dc865579 100644 --- a/docs/source/en/tutorials/using_peft_for_inference.md +++ b/docs/source/en/tutorials/using_peft_for_inference.md @@ -12,216 +12,18 @@ specific language governing permissions and limitations under the License. [[open-in-colab]] -# Load LoRAs for inference +# LoRA -There are many adapter types (with [LoRAs](https://huggingface.co/docs/peft/conceptual_guides/adapter#low-rank-adaptation-lora) being the most popular) trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. +## Adjust weight scale -In this tutorial, you'll learn how to easily load and manage adapters for inference with the 🤗 [PEFT](https://huggingface.co/docs/peft/index) integration in 🤗 Diffusers. You'll use LoRA as the main adapter technique, so you'll see the terms LoRA and adapter used interchangeably. +## Hotswap -Let's first install all the required libraries. +## Merge -```bash -!pip install -q transformers accelerate peft diffusers -``` +### set_adapters -Now, load a pipeline with a [Stable Diffusion XL (SDXL)](../api/pipelines/stable_diffusion/stable_diffusion_xl) checkpoint: +### add_weighted_adapter -```python -from diffusers import DiffusionPipeline -import torch +### fuse_lora -pipe_id = "stabilityai/stable-diffusion-xl-base-1.0" -pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda") -``` - -Next, load a [CiroN2022/toy-face](https://huggingface.co/CiroN2022/toy-face) adapter with the [`~diffusers.loaders.StableDiffusionXLLoraLoaderMixin.load_lora_weights`] method. With the 🤗 PEFT integration, you can assign a specific `adapter_name` to the checkpoint, which lets you easily switch between different LoRA checkpoints. Let's call this adapter `"toy"`. - -```python -pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy") -``` - -Make sure to include the token `toy_face` in the prompt and then you can perform inference: - -```python -prompt = "toy_face of a hacker with a hoodie" - -lora_scale = 0.9 -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) -).images[0] -image -``` - -![toy-face](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_8_1.png) - -With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images and call it `"pixel"`. - -The pipeline automatically sets the first loaded adapter (`"toy"`) as the active adapter, but you can activate the `"pixel"` adapter with the [`~loaders.peft.PeftAdapterMixin.set_adapters`] method: - -```python -pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel") -pipe.set_adapters("pixel") -``` - -Make sure you include the token `pixel art` in your prompt to generate a pixel art image: - -```python -prompt = "a hacker with a hoodie, pixel art" -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) -).images[0] -image -``` - -![pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_12_1.png) - - - -By default, if the most up-to-date versions of PEFT and Transformers are detected, `low_cpu_mem_usage` is set to `True` to speed up the loading time of LoRA checkpoints. - - - -## Merge adapters - -You can also merge different adapter checkpoints for inference to blend their styles together. - -Once again, use the [`~loaders.peft.PeftAdapterMixin.set_adapters`] method to activate the `pixel` and `toy` adapters and specify the weights for how they should be merged. - -```python -pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0]) -``` - - - -LoRA checkpoints in the diffusion community are almost always obtained with [DreamBooth](https://huggingface.co/docs/diffusers/main/en/training/dreambooth). DreamBooth training often relies on "trigger" words in the input text prompts in order for the generation results to look as expected. When you combine multiple LoRA checkpoints, it's important to ensure the trigger words for the corresponding LoRA checkpoints are present in the input text prompts. - - - -Remember to use the trigger words for [CiroN2022/toy-face](https://hf.co/CiroN2022/toy-face) and [nerijs/pixel-art-xl](https://hf.co/nerijs/pixel-art-xl) (these are found in their repositories) in the prompt to generate an image. - -```python -prompt = "toy_face of a hacker with a hoodie, pixel art" -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0) -).images[0] -image -``` - -![toy-face-pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_16_1.png) - -Impressive! As you can see, the model generated an image that mixed the characteristics of both adapters. - -> [!TIP] -> Through its PEFT integration, Diffusers also offers more efficient merging methods which you can learn about in the [Merge LoRAs](../using-diffusers/merge_loras) guide! - -To return to only using one adapter, use the [`~loaders.peft.PeftAdapterMixin.set_adapters`] method to activate the `"toy"` adapter: - -```python -pipe.set_adapters("toy") - -prompt = "toy_face of a hacker with a hoodie" -lora_scale = 0.9 -image = pipe( - prompt, num_inference_steps=30, cross_attention_kwargs={"scale": lora_scale}, generator=torch.manual_seed(0) -).images[0] -image -``` - -Or to disable all adapters entirely, use the [`~loaders.peft.PeftAdapterMixin.disable_lora`] method to return the base model. - -```python -pipe.disable_lora() - -prompt = "toy_face of a hacker with a hoodie" -image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0] -image -``` - -![no-lora](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_20_1.png) - -### Customize adapters strength - -For even more customization, you can control how strongly the adapter affects each part of the pipeline. For this, pass a dictionary with the control strengths (called "scales") to [`~loaders.peft.PeftAdapterMixin.set_adapters`]. - -For example, here's how you can turn on the adapter for the `down` parts, but turn it off for the `mid` and `up` parts: -```python -pipe.enable_lora() # enable lora again, after we disabled it above -prompt = "toy_face of a hacker with a hoodie, pixel art" -adapter_weight_scales = { "unet": { "down": 1, "mid": 0, "up": 0} } -pipe.set_adapters("pixel", adapter_weight_scales) -image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0] -image -``` - -![block-lora-text-and-down](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_block_down.png) - -Let's see how turning off the `down` part and turning on the `mid` and `up` part respectively changes the image. -```python -adapter_weight_scales = { "unet": { "down": 0, "mid": 1, "up": 0} } -pipe.set_adapters("pixel", adapter_weight_scales) -image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0] -image -``` - -![block-lora-text-and-mid](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_block_mid.png) - -```python -adapter_weight_scales = { "unet": { "down": 0, "mid": 0, "up": 1} } -pipe.set_adapters("pixel", adapter_weight_scales) -image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0] -image -``` - -![block-lora-text-and-up](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_block_up.png) - -Looks cool! - -This is a really powerful feature. You can use it to control the adapter strengths down to per-transformer level. And you can even use it for multiple adapters. -```python -adapter_weight_scales_toy = 0.5 -adapter_weight_scales_pixel = { - "unet": { - "down": 0.9, # all transformers in the down-part will use scale 0.9 - # "mid" # because, in this example, "mid" is not given, all transformers in the mid part will use the default scale 1.0 - "up": { - "block_0": 0.6, # all 3 transformers in the 0th block in the up-part will use scale 0.6 - "block_1": [0.4, 0.8, 1.0], # the 3 transformers in the 1st block in the up-part will use scales 0.4, 0.8 and 1.0 respectively - } - } -} -pipe.set_adapters(["toy", "pixel"], [adapter_weight_scales_toy, adapter_weight_scales_pixel]) -image = pipe(prompt, num_inference_steps=30, generator=torch.manual_seed(0)).images[0] -image -``` - -![block-lora-mixed](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_block_mixed.png) - -## Manage adapters - -You have attached multiple adapters in this tutorial, and if you're feeling a bit lost on what adapters have been attached to the pipeline's components, use the [`~diffusers.loaders.StableDiffusionLoraLoaderMixin.get_active_adapters`] method to check the list of active adapters: - -```py -active_adapters = pipe.get_active_adapters() -active_adapters -["toy", "pixel"] -``` - -You can also get the active adapters of each pipeline component with [`~diffusers.loaders.StableDiffusionLoraLoaderMixin.get_list_adapters`]: - -```py -list_adapters_component_wise = pipe.get_list_adapters() -list_adapters_component_wise -{"text_encoder": ["toy", "pixel"], "unet": ["toy", "pixel"], "text_encoder_2": ["toy", "pixel"]} -``` - -The [`~loaders.peft.PeftAdapterMixin.delete_adapters`] function completely removes an adapter and their LoRA layers from a model. - -```py -pipe.delete_adapters("toy") -pipe.get_active_adapters() -["pixel"] -``` - -## PeftInputAutocastDisableHook - -[[autodoc]] hooks.layerwise_casting.PeftInputAutocastDisableHook +## torch.compile \ No newline at end of file From 1a03c6b7b011852dee9cac5ca49d86c69045674b Mon Sep 17 00:00:00 2001 From: stevhliu Date: Wed, 16 Apr 2025 15:29:14 -0700 Subject: [PATCH 6/7] lora --- .../en/tutorials/using_peft_for_inference.md | 611 +++++++++++++++++- 1 file changed, 606 insertions(+), 5 deletions(-) diff --git a/docs/source/en/tutorials/using_peft_for_inference.md b/docs/source/en/tutorials/using_peft_for_inference.md index 7135dc865579..d21edd25f989 100644 --- a/docs/source/en/tutorials/using_peft_for_inference.md +++ b/docs/source/en/tutorials/using_peft_for_inference.md @@ -10,20 +10,621 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -[[open-in-colab]] - # LoRA -## Adjust weight scale +[LoRA (Low-Rank Adaptation)](https://huggingface.co/papers/2106.09685) is a method for quickly training a model for a new task. It works by freezing the original model weights and adding a small number of *new* trainable parameters. This means it is significantly faster and cheaper to adapt an existing model to new tasks, such as generating images in a new style. + +LoRA checkpoints are typically only a couple hundred MBs in size, so they're very lightweight and easy to store. Load these smaller set of weights into an existing base model with [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] and specify the file name. + + + + +```py +import torch +from diffusers import AutoPipelineForText2Image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/super-cereal-sdxl-lora", + weight_name="cereal_box_sdxl_v1.safetensors", + adapter_name="cereal" +) +pipeline("bears, pizza bites").images[0] +``` + + + + +```py +import torch +from diffusers import LTXConditionPipeline +from diffusers.utils import export_to_video, load_image + +pipeline = LTXConditionPipeline.from_pretrained( + "Lightricks/LTX-Video-0.9.5", torch_dtype=torch.bfloat16 +) + +pipeline.load_lora_weights( + "Lightricks/LTX-Video-Cakeify-LoRA", + weight_name="ltxv_095_cakeify_lora.safetensors", + adapter_name="cakeify" +) +pipeline.set_adapters("cakeify") + +# use "CAKEIFY" to trigger the LoRA +prompt = "CAKEIFY a person using a knife to cut a cake shaped like a Pikachu plushie" +image = load_image("https://huggingface.co/Lightricks/LTX-Video-Cakeify-LoRA/resolve/main/assets/images/pikachu.png") + +video = pipeline( + prompt=prompt, + image=image, + width=576, + height=576, + num_frames=161, + decode_timestep=0.03, + decode_noise_scale=0.025, + num_inference_steps=50, +).frames[0] +export_to_video(video, "output.mp4", fps=26) +``` + + + + +The [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] method is the preferred way to load LoRA weights into the UNet and text encoder because it can handle cases where: + +- the LoRA weights don't have separate UNet and text encoder identifiers +- the LoRA weights have separate UNet and text encoder identifiers + +The [`~loaders.PeftAdapterMixin.load_lora_adapter`] method is used to directly load a LoRA adapter at the *model-level*. It builds and prepares the necessary model configuration for the adapter. This method can also load the LoRA adapter into the UNet and text encoder. + +For example, if you're only loading a LoRA into the UNet, [`~loaders.PeftAdapterMixin.load_lora_adapter`] ignores the text encoder keys. Use the `prefix` parameter to filter and load the appropriate state dicts, `"unet"` to load. + +```py +import torch +from diffusers import AutoPipelineForText2Image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.unet.load_lora_adapter( + "jbilcke-hf/sdxl-cinematic-1", + weight_name="pytorch_lora_weights.safetensors", + adapter_name="cinematic" + prefix="unet" +) +# use cnmt in the prompt to trigger the LoRA +pipeline("A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration").images[0] +``` + +## torch.compile + +[torch.compile](../optimization/torch2.0#torchcompile) speeds up inference by compiling the PyTorch model to use optimized kernels. Before compiling, the LoRA weights need to be fused into the base model and unloaded first. + +```py +import torch +from diffusers import DiffusionPipeline + +# load base model and LoRA +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/ikea-instructions-lora-sdxl", + weight_name="ikea_instructions_xl_v1_5.safetensors", + adapter_name="ikea" +) + +# activate LoRA and set adapter weight +pipeline.set_adapters("ikea", adapter_weights=0.7) + +# fuse LoRAs and unload weights +pipeline.fuse_lora(adapter_names=["ikea"], lora_scale=1.0) +pipeline.unload_lora_weights() +``` + +Typically, the UNet is compiled because its the most compute intensive component of the pipeline. + +```py +pipeline.unet.to(memory_format=torch.channels_last) +pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True) + +pipeline("A bowl of ramen shaped like a cute kawaii bear").images[0] +``` + +## Weight scale + +The `scale` parameter is used to control how much of a LoRA to apply. A value of `0` is equivalent to only using the base model weights and a value of `1` is equivalent to fully using the LoRA. + + + + +For simple use cases, you can pass `cross_attention_kwargs={"scale": 1.0}` to the pipeline. + +```py +import torch +from diffusers import AutoPipelineForText2Image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/super-cereal-sdxl-lora", + weight_name="cereal_box_sdxl_v1.safetensors", + adapter_name="cereal" +) +pipeline("bears, pizza bites", cross_attention_kwargs={"scale": 1.0}).images[0] +``` -## Hotswap + + + +> [!WARNING] +> The [`~loaders.PeftAdapterMixin.set_adapters`] method only scales attention weights. If a LoRA has ResNets or down and upsamplers, these components keep a scale value of `1.0`. + +For finer control over each individual component of the UNet or text encoder, pass a dictionary instead. In the example below, the `"down"` block in the UNet is scaled by 0.9 and you can further specify in the `"up"` block the scales of the transformers in `"block_0"` and `"block_1"`. If a block like `"mid"` isn't specified, the default value `1.0` is used. + +```py +import torch +from diffusers import AutoPipelineForText2Image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/super-cereal-sdxl-lora", + weight_name="cereal_box_sdxl_v1.safetensors", + adapter_name="cereal" +) +scales = { + "text_encoder": 0.5, + "text_encoder_2": 0.5, + "unet": { + "down": 0.9, + "up": { + "block_0": 0.6, + "block_1": [0.4, 0.8, 1.0], + } + } +} +pipeline.set_adapters("cereal", scales) +pipeline("bears, pizza bites").images[0] +``` + + + + +## Hotswapping + +Hotswapping LoRAs is an efficient way to work with multiple LoRAs while avoiding accumulating memory from multiple calls to [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] and in some cases, recompilation, if a model is compiled. This workflow requires a loaded LoRA because the new LoRA weights are swapped in place for the existing loaded LoRA. + +```py +import torch +from diffusers import DiffusionPipeline + +# load base model and LoRAs +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/ikea-instructions-lora-sdxl", + weight_name="ikea_instructions_xl_v1_5.safetensors", + adapter_name="ikea" +) +``` + +> [!WARNING] +> Hotswapping is unsupported for LoRAs that target the text encoder. + +Set `hotswap=True` in [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] to swap the second LoRA. Use the `adapter_name` parameter to indicate which LoRA to swap (`default_0` is the default name). + +```py +pipeline.load_lora_weights( + "lordjia/by-feng-zikai", + hotswap=True, + adapter_name="ikea" +) +``` + +### Compiled models + +For compiled models, use [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] to avoid recompilation when hotswapping LoRAs. This method should be called *before* loading the first LoRA and `torch.compile` should be called *after* loading the first LoRA. + +> [!TIP] +> The [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`] method isn't always necessary if the second LoRA targets the identical LoRA ranks and scales as the first LoRA. + +Within [`~loaders.lora_base.LoraBaseMixin.enable_lora_hotswap`], the `target_rank` parameter is important for setting the rank for all LoRA adapters. Setting it to `max_rank` sets it to the highest value. For LoRAs with different ranks, you set it to a higher rank value. The default rank value is 128. + +```py +import torch +from diffusers import DiffusionPipeline + +# load base model and LoRAs +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +# 1. enable_lora_hotswap +pipeline.enable_lora_hotswap(target_rank=max_rank) +pipeline.load_lora_weights( + "ostris/ikea-instructions-lora-sdxl", + weight_name="ikea_instructions_xl_v1_5.safetensors", + adapter_name="ikea" +) +# 2. torch.compile +pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True) + +# 3. hotswap +pipeline.load_lora_weights( + "lordjia/by-feng-zikai", + hotswap=True, + adapter_name="ikea" +) +``` + +> [!TIP] +> Move your code inside the `with torch._dynamo.config.patch(error_on_recompile=True)` context manager to detect if a model was recompiled. If a model is recompiled despite following all the steps above, please open an [issue](https://github.com/huggingface/diffusers/issues) with a reproducible example. + +There are still scenarios where recompulation is unavoidable, such as when the hotswapped LoRA targets more layers than the initial adapter. Try to load the LoRA that targets the most layers *first*. For more details about this limitation, refer to the PEFT [hotswapping](https://huggingface.co/docs/peft/main/en/package_reference/hotswap#peft.utils.hotswap.hotswap_adapter) docs. ## Merge +The weights from each LoRA can be merged together to produce a blend of multiple existing styles. There are several methods for merging LoRAs, each of which differ in *how* the weights are merged (may affect generation quality). + ### set_adapters +The [`~loaders.PeftAdapterMixin.set_adapters`] method merges LoRAs by concatenating their weighted matrices. Pass the LoRA names to [`~loaders.PeftAdapterMixin.set_adapters`] and use the `adapter_weights` parameter to control the scaling of each LoRA. For example, if `adapter_weights=[0.5, 0.5]`, the output is an average of both LoRAs. + +> [!TIP] +> The `"scale"` parameter determines how much of the merged LoRA to apply. See the [Weight scale](#weight-scale) section for more details. + +```py +import torch +from diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/ikea-instructions-lora-sdxl", + weight_name="ikea_instructions_xl_v1_5.safetensors", + adapter_name="ikea" +) +pipeline.load_lora_weights( + "lordjia/by-feng-zikai", + weight_name="fengzikai_v1.0_XL.safetensors", + adapter_name="feng" +) +pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8]) +# use by Feng Zikai to activate the lordjia/by-feng-zikai LoRA +pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai", cross_attention_kwargs={"scale": 1.0}).images[0] +``` + +
+ +
+ ### add_weighted_adapter +> [!TIP] +> This is an experimental method and you can refer to PEFTs [Model merging](https://huggingface.co/docs/peft/developer_guides/model_merging) for more details. Take a look at this [issue](https://github.com/huggingface/diffusers/issues/6892) if you're interested in the motivation and design behind this integration. + +The [`~peft.LoraModel.add_weighted_adapter`] method enables more efficient merging methods like [TIES](https://huggingface.co/papers/2306.01708) or [DARE](https://huggingface.co/papers/2311.03099). These merging methods remove redundant and potentially interfering parameters from merged models. Keep in mind the LoRA ranks need to have identical ranks to be merged. + +Make sure the latest stable version of Diffusers and PEFT is installed. + +```bash +pip install -U -q diffusers peft +``` + +Load a UNET that corresponds to the LoRA UNet. + +```py +import copy +import torch +from diffusers import AutoModel, DiffusionPipeline +from peft import get_peft_model, LoraConfig, PeftModel + +unet = AutoModel.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", + subfolder="unet", +).to("cuda") +``` + +Load a pipeline, pass the UNet to it, and load a LoRA. + +```py +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + variant="fp16", + torch_dtype=torch.float16, + unet=unet +).to("cuda") +pipeline.load_lora_weights( + "ostris/ikea-instructions-lora-sdxl", + weight_name="ikea_instructions_xl_v1_5.safetensors", + adapter_name="ikea" +) +``` + +Create a [`~peft.PeftModel`] from the LoRA checkpoint by combining the first UNet you loaded and the LoRA UNet from the pipeline. + +```py +sdxl_unet = copy.deepcopy(unet) +ikea_peft_model = get_peft_model( + sdxl_unet, + pipeline.unet.peft_config["ikea"], + adapter_name="ikea" +) + +original_state_dict = {f"base_model.model.{k}": v for k, v in pipeline.unet.state_dict().items()} +ikea_peft_model.load_state_dict(original_state_dict, strict=True) +``` + +> [!TIP] +> You can save and reuse the `ikea_peft_model` by pushing it to the Hub as shown below. +> ```py +> ikea_peft_model.push_to_hub("ikea_peft_model", token=TOKEN) +> ``` + +Repeat this process and create a [`~peft.PeftModel`] for the second LoRA. + +```py +pipeline.delete_adapters("ikea") +sdxl_unet.delete_adapters("ikea") + +pipeline.load_lora_weights( + "lordjia/by-feng-zikai", + weight_name="fengzikai_v1.0_XL.safetensors", + adapter_name="feng" +) +pipeline.set_adapters(adapter_names="feng") + +feng_peft_model = get_peft_model( + sdxl_unet, + pipeline.unet.peft_config["feng"], + adapter_name="feng" +) + +original_state_dict = {f"base_model.model.{k}": v for k, v in pipe.unet.state_dict().items()} +feng_peft_model.load_state_dict(original_state_dict, strict=True) +``` + +Load a base UNet model and load the adapters. + +```py +base_unet = AutoModel.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16, + use_safetensors=True, + variant="fp16", + subfolder="unet", +).to("cuda") + +model = PeftModel.from_pretrained( + base_unet, + "stevhliu/ikea_peft_model", + use_safetensors=True, + subfolder="ikea", + adapter_name="ikea" +) +model.load_adapter( + "stevhliu/feng_peft_model", + use_safetensors=True, + subfolder="feng", + adapter_name="feng" +) +``` + +Merge the LoRAs with [`~peft.LoraModel.add_weighted_adapter`] and specify how you want to merge them with `combination_type`. The example below uses the `"dare_linear"` method (refer to this [blog post](https://huggingface.co/blog/peft_merging) to learn more about these merging methods), which randomly prunes some weights and then performs a weighted sum of the tensors based on the set weightage of each LoRA in `weights`. + +Activate the merged LoRAs with [`~loaders.PeftAdapterMixin.set_adapters`]. + +```py +model.add_weighted_adapter( + adapters=["ikea", "feng"], + combination_type="dare_linear", + weights=[1.0, 1.0], + adapter_name="ikea-feng" +) +model.set_adapters("ikea-feng") + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + unet=model, + variant="fp16", + torch_dtype=torch.float16, +).to("cuda") +pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai").images[0] +``` + +
+ +
+ ### fuse_lora -## torch.compile \ No newline at end of file +The [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] method fuses the LoRA weights directly with the original UNet and text encoder weights of the underlying model. This reduces the overhead of loading the underlying model for each LoRA because it only loads the model once, which lowers memory usage and increases inference speed. + +```py +import torch +from diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/ikea-instructions-lora-sdxl", + weight_name="ikea_instructions_xl_v1_5.safetensors", + adapter_name="ikea" +) +pipeline.load_lora_weights( + "lordjia/by-feng-zikai", + weight_name="fengzikai_v1.0_XL.safetensors", + adapter_name="feng" +) +pipeline.set_adapters(["ikea", "feng"], adapter_weights=[0.7, 0.8]) +``` + +Call [`~loaders.lora_base.LoraBaseMixin.fuse_lora`] to fuse them. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make this adjustment now because passing `scale` to `cross_attention_kwargs` won't work in the pipeline. + +```py +pipeline.fuse_lora(adapter_names=["ikea", "feng"], lora_scale=1.0) +``` + +Unload the LoRA weights since they're already fused with the underlying model. Save the fused pipeline with either [`~DiffusionPipeline.save_pretrained`] to save it locally or [`~PushToHubMixin.push_to_hub`] to save it to the Hub. + + + + +```py +pipeline.unload_lora_weights() +pipeline.save_pretrained("path/to/fused-pipeline") +``` + + + + +```py +pipeline.unload_lora_weights() +pipeline.push_to_hub("fused-ikea-feng") +``` + + + + +The fused pipeline can now be quickly loaded for inference without requiring each LoRA to be separately loaded. + +```py +pipeline = DiffusionPipeline.from_pretrained( + "username/fused-ikea-feng", torch_dtype=torch.float16, +).to("cuda") +pipeline("A bowl of ramen shaped like a cute kawaii bear, by Feng Zikai").images[0] +``` + +Use [`~loaders.LoraLoaderMixin.unfuse_lora`] to restore the underlying models weights, for example, if you want to use a different `lora_scale` value. You can only unfuse if there is a single LoRA fused. For example, it won't work with the pipeline from above because there are multiple fused LoRAs. In these cases, you'll need to reload the entire model. + +```py +pipeline.unfuse_lora() +``` + +
+ +
+ +## Manage + +Diffusers provides several methods to help you manage working with LoRAs. These methods can be especially useful if you're working with multiple LoRAs. + +### set_adapters + +[`~loaders.PeftAdapterMixin.set_adapters`] also activates the current LoRA to use if there are multiple active LoRAs. This allows you to switch between different LoRAs by specifying their name. + +```py +import torch +from diffusers import DiffusionPipeline + +pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.load_lora_weights( + "ostris/ikea-instructions-lora-sdxl", + weight_name="ikea_instructions_xl_v1_5.safetensors", + adapter_name="ikea" +) +pipeline.load_lora_weights( + "lordjia/by-feng-zikai", + weight_name="fengzikai_v1.0_XL.safetensors", + adapter_name="feng" +) +# activates the feng LoRA instead of the ikea LoRA +pipeline.set_adapters("feng") +``` + +### save_lora_adapter + +Save an adapter with [`~loaders.PeftAdapterMixin.save_lora_adapter`]. + +```py +import torch +from diffusers import AutoPipelineForText2Image + +pipeline = AutoPipelineForText2Image.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + torch_dtype=torch.float16 +).to("cuda") +pipeline.unet.load_lora_adapter( + "jbilcke-hf/sdxl-cinematic-1", + weight_name="pytorch_lora_weights.safetensors", + adapter_name="cinematic" + prefix="unet" +) +pipeline.save_lora_adapter("path/to/save", adapter_name="cinematic") +``` + +### unload_lora_weights + +The [`~loaders.lora_base.LoraBaseMixin.unload_lora_weights`] method unloads any LoRA weights in the pipeline to restore the underlying model weights. + +```py +pipeline.unload_lora_weights() +``` + +### disable_lora + +The [`~loaders.PeftAdapterMixin.disable_lora`] method disables all LoRAs (but they're still kept on the pipeline) and restores the pipeline to the underlying model weights. + +```py +pipeline.disable_lora() +``` + +### get_active_adapters + +The [`~loaders.lora_base.LoraBaseMixin.get_active_adapters`] method returns a list of active LoRAs attached to a pipeline. + +```py +pipeline.get_active_adapters() +["cereal", "ikea"] +``` + +### get_list_adapters + +The [`~loaders.lora_base.LoraBaseMixin.get_list_adapters`] method returns the active LoRAs for each component in the pipeline. + +```py +pipeline.get_list_adapters() +{"unet": ["cereal", "ikea"], "text_encoder_2": ["cereal"]} +``` + +### delete_adapters + +The [`~loaders.PeftAdapterMixin.delete_adapters`] method completely removes a LoRA and its layers from a model. + +```py +pipeline.delete_adapters("ikea") +``` + +## Resources + +Browse the [LoRA Studio](https://lorastudio.co/models) for different LoRAs to use or you can upload your favorite LoRAs from Civitai to the Hub with the Space below. + + \ No newline at end of file From c6845db22db1354ce7cbcb5bcc2c079c73ec442e Mon Sep 17 00:00:00 2001 From: stevhliu Date: Wed, 23 Apr 2025 14:12:38 -0700 Subject: [PATCH 7/7] images --- docs/source/en/using-diffusers/ip_adapter.md | 169 ++++++++++++++++++- 1 file changed, 164 insertions(+), 5 deletions(-) diff --git a/docs/source/en/using-diffusers/ip_adapter.md b/docs/source/en/using-diffusers/ip_adapter.md index b9be2a41a563..ec5b1537cda1 100644 --- a/docs/source/en/using-diffusers/ip_adapter.md +++ b/docs/source/en/using-diffusers/ip_adapter.md @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. # IP-Adapter -[IP-Adapter](https://huggingface.co/papers/2308.06721) is a lightweight adapter designed to integrate image-based guidance into text-to-image diffusion models. The adapter uses an image encoder to extract image features that are passed to the newly added cross-attention layers in the UNet and fine-tuned. The original UNet model, and the existing cross-attention layers corresponding to text features, is frozen. Decoupling the cross-attention for image and text features enables more fine-grained and controllable generation. +[IP-Adapter](https://huggingface.co/papers/2308.06721) is a lightweight adapter designed to integrate image-based guidance with text-to-image diffusion models. The adapter uses an image encoder to extract image features that are passed to the newly added cross-attention layers in the UNet and fine-tuned. The original UNet model and the existing cross-attention layers corresponding to text features is frozen. Decoupling the cross-attention for image and text features enables more fine-grained and controllable generation. IP-Adapter files are typically ~100MBs because they only contain the image embeddings. This means you need to load a model first, and then load the IP-Adapter with [`~loaders.IPAdapterMixin.load_ip_adapter`]. @@ -46,6 +46,17 @@ pipeline( ).images[0] ``` +
+
+ IP-Adapter image +
IP-Adapter image
+
+
+ generated image +
generated image
+
+
+ Take a look at the examples below to learn how to use IP-Adapter for other tasks. @@ -77,6 +88,21 @@ pipeline( ).images[0] ``` +
+
+ input image +
input image
+
+
+ IP-Adapter image +
IP-Adapter image
+
+
+ generated image +
generated image
+
+
+ @@ -107,10 +133,25 @@ pipeline( ).images[0] ``` +
+
+ input image +
input image
+
+
+ IP-Adapter image +
IP-Adapter image
+
+
+ generated image +
generated image
+
+
+
-The [`~DiffusionPipeline.enable_model_cpu_offload`] method is useful for reducing memory, but you should enable it **after** the IP-Adapter is loaded. Otherwise, the IP-Adapter's image encoder is also offloaded to the CPU and returns an error. +The [`~DiffusionPipeline.enable_model_cpu_offload`] method is useful for reducing memory and it should be enabled **after** the IP-Adapter is loaded. Otherwise, the IP-Adapter's image encoder is also offloaded to the CPU and returns an error. ```py import torch @@ -151,6 +192,17 @@ pipeline( ).frames[0] ``` +
+
+ IP-Adapter image +
IP-Adapter image
+
+
+ generated video +
generated video
+
+
+
@@ -301,6 +353,17 @@ processor = IPAdapterMaskProcessor() masks = processor.preprocess([mask1, mask2], height=1024, width=1024) ``` +
+
+ mask 1 +
mask 1
+
+
+ mask 2 +
mask 2
+
+
+ Provide both the IP-Adapter images and their scales as a list. Pass the preprocessed masks to `cross_attention_kwargs` in the pipeline. ```py @@ -325,6 +388,29 @@ pipeline( ).images[0] ``` +
+
+
+ IP-Adapter image 1 +
IP-Adapter image 1
+
+
+ IP-Adapter image 2 +
IP-Adapter image 2
+
+
+
+
+ Generated image with mask +
generated with mask
+
+
+ Generated image without mask +
generated without mask
+
+
+
+ ## Applications The section below covers some popular applications of IP-Adapter. @@ -365,6 +451,17 @@ pipeline( ).images[0] ``` +
+
+ IP-Adapter image +
IP-Adapter image
+
+
+ generated image +
generated image
+
+
+ @@ -473,6 +570,17 @@ style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/ma style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)] ``` +
+
+ Face image +
face image
+
+
+ Style images +
style images
+
+
+ Pass style and face images as a list to `ip_adapter_image`. ```py @@ -485,11 +593,18 @@ pipeline( ).images[0] ``` +
+
+ Generated image +
generated image
+
+
+ ### Instant generation [Latent Consistency Models (LCM)](../api/pipelines/latent_consistency_models) can generate images 4 steps or less, unlike other diffusion models which require a lot more steps, making it feel "instantaneous". IP-Adapters are compatible with LCM models to instantly generate images. -Load the IP-Adapter weights and load the LoRA weights with [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights]. +Load the IP-Adapter weights and load the LoRA weights with [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`]. ```py import torch @@ -512,7 +627,7 @@ pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config) pipeline.enable_model_cpu_offload() ``` -Try using a lower IP-Adapter scale to condition generation more on the style you want to apply, and remember to use the special token in your prompt to trigger its generation. +Try using a lower IP-Adapter scale to condition generation more on the style you want to apply and remember to use the special token in your prompt to trigger its generation. ```py pipeline.set_ip_adapter_scale(0.4) @@ -528,6 +643,13 @@ pipeline( ).images[0] ``` +
+
+ Generated image +
generated image
+
+
+ ### Structural control For structural control, combine IP-Adapter with [ControlNet](../api/pipelines/controlnet) conditioned on depth maps, edge maps, pose estimations, and more. @@ -567,6 +689,21 @@ pipeline( ).images[0] ``` +
+
+ IP-Adapter image +
IP-Adapter image
+
+
+ Depth map +
depth map
+
+
+ Generated image +
generated image
+
+
+ ### Style and layout control For style and layout control, combine IP-Adapter with [InstantStyle](https://huggingface.co/papers/2404.02733). InstantStyle separates *style* (color, texture, overall feel) and *content* from each other. It only applies the style in style-specific blocks of the model to prevent it from distorting other areas of an image. This generates images with stronger and more consistent styles and better control over the layout. @@ -608,6 +745,17 @@ pipeline( ).images[0] ``` +
+
+ Style image +
style image
+
+
+ Generated image +
generated image
+
+
+ You can also insert the IP-Adapter in all the model layers. This tends to generate images that focus more on the image prompt and may reduce the diversity of generated images. Only activate the IP-Adapter in up `block_0` or the style layer. > [!TIP] @@ -625,4 +773,15 @@ pipeline( negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry", guidance_scale=5, ).images[0] -``` \ No newline at end of file +``` + +
+
+ Generated image (style only) +
style-layer generated image
+
+
+ Generated image (IP-Adapter only) +
all layers generated image
+
+
\ No newline at end of file