Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EasyAnimateV5.1 text-to-video, image-to-video, control-to-video generation model #10626

Merged
merged 34 commits into from
Mar 3, 2025
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
5609fc2
Update EasyAnimate V5.1
bubbliiiing Jan 22, 2025
4978561
Merge branch 'main' of https://github.com/bubbliiiing/diffusers
bubbliiiing Jan 22, 2025
0b01118
Add docs && add tests && Fix comments problems in transformer3d and vae
bubbliiiing Feb 4, 2025
914f460
delete comments and remove useless import
bubbliiiing Feb 4, 2025
19fcc7d
delete process
bubbliiiing Feb 4, 2025
a7821be
Update EXAMPLE_DOC_STRING
bubbliiiing Feb 6, 2025
a6a5509
Merge branch 'main' into easyanimate
a-r-r-o-w Feb 14, 2025
6c0d81d
rename transformer file
a-r-r-o-w Feb 14, 2025
414bf8f
make fix-copies
a-r-r-o-w Feb 14, 2025
98602d8
make style
a-r-r-o-w Feb 14, 2025
02f8c26
refactor pt. 1
a-r-r-o-w Feb 14, 2025
d5b3db9
update toctree.yml
a-r-r-o-w Feb 14, 2025
c3eebb2
add model tests
a-r-r-o-w Feb 14, 2025
90ce00f
Update layer_norm for norm_added_q and norm_added_k in Attention
bubbliiiing Feb 21, 2025
301711b
Fix processor problem
bubbliiiing Feb 24, 2025
0f80373
refactor vae
a-r-r-o-w Feb 24, 2025
528d97e
Fix problem in comments
bubbliiiing Feb 25, 2025
2e1b4f5
Merge branch 'main' of github.com:bubbliiiing/diffusers
bubbliiiing Feb 25, 2025
9e8a249
Merge branch 'main' into easyanimate
a-r-r-o-w Feb 27, 2025
96ccfb5
Merge branch 'main' of https://github.com/bubbliiiing/diffusers into …
a-r-r-o-w Feb 27, 2025
58edc80
refactor tiling; remove einops dependency
a-r-r-o-w Feb 27, 2025
9f04fa1
fix docs path
a-r-r-o-w Feb 27, 2025
0451caf
make fix-copies
a-r-r-o-w Feb 27, 2025
7f1b78d
Update src/diffusers/pipelines/easyanimate/pipeline_easyanimate_contr…
a-r-r-o-w Feb 27, 2025
0059a35
update _toctree.yml
a-r-r-o-w Feb 27, 2025
ca252f6
Merge branch 'main' into main
yiyixuxu Mar 1, 2025
8b616d2
fix test
a-r-r-o-w Mar 1, 2025
905cd50
update
a-r-r-o-w Mar 1, 2025
5c7d8ab
update
a-r-r-o-w Mar 1, 2025
b4e73ba
update
a-r-r-o-w Mar 1, 2025
e511864
make fix-copies
a-r-r-o-w Mar 1, 2025
fce2f9e
Merge branch 'main' into main
DN6 Mar 3, 2025
856231c
fix tests
a-r-r-o-w Mar 3, 2025
f4c810a
Merge branch 'main' into main
a-r-r-o-w Mar 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,8 @@
title: CogView4Transformer2DModel
- local: api/models/dit_transformer2d
title: DiTTransformer2DModel
- local: api/models/easyanimate_transformer3d
title: EasyAnimateTransformer3DModel
- local: api/models/flux_transformer
title: FluxTransformer2DModel
- local: api/models/hunyuan_transformer2d
Expand Down Expand Up @@ -352,6 +354,8 @@
title: AutoencoderKLHunyuanVideo
- local: api/models/autoencoderkl_ltx_video
title: AutoencoderKLLTXVideo
- local: api/models/autoencoderkl_magvit
title: AutoencoderKLMagvit
- local: api/models/autoencoderkl_mochi
title: AutoencoderKLMochi
- local: api/models/autoencoder_kl_wan
Expand Down Expand Up @@ -430,6 +434,8 @@
title: DiffEdit
- local: api/pipelines/dit
title: DiT
- local: api/pipelines/easyanimate
title: EasyAnimate
- local: api/pipelines/flux
title: Flux
- local: api/pipelines/control_flux_inpaint
Expand Down
37 changes: 37 additions & 0 deletions docs/source/en/api/models/autoencoderkl_magvit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# AutoencoderKLMagvit

The 3D variational autoencoder (VAE) model with KL loss used in [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI.

The model can be loaded with the following code snippet.

```python
from diffusers import AutoencoderKLMagvit

vae = AutoencoderKLMagvit.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="vae", torch_dtype=torch.float16).to("cuda")
```

## AutoencoderKLMagvit

[[autodoc]] AutoencoderKLMagvit
- decode
- encode
- all

## AutoencoderKLOutput

[[autodoc]] models.autoencoders.autoencoder_kl.AutoencoderKLOutput

## DecoderOutput

[[autodoc]] models.autoencoders.vae.DecoderOutput
30 changes: 30 additions & 0 deletions docs/source/en/api/models/easyanimate_transformer3d.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License. -->

# EasyAnimateTransformer3DModel

A Diffusion Transformer model for 3D data from [EasyAnimate](https://github.com/aigc-apps/EasyAnimate) was introduced by Alibaba PAI.

The model can be loaded with the following code snippet.

```python
from diffusers import EasyAnimateTransformer3DModel

transformer = EasyAnimateTransformer3DModel.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="transformer", torch_dtype=torch.float16).to("cuda")
```

## EasyAnimateTransformer3DModel

[[autodoc]] EasyAnimateTransformer3DModel

## Transformer2DModelOutput

[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
88 changes: 88 additions & 0 deletions docs/source/en/api/pipelines/easyanimate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
-->

# EasyAnimate
[EasyAnimate](https://github.com/aigc-apps/EasyAnimate) by Alibaba PAI.

The description from it's GitHub page:
*EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.*

This pipeline was contributed by [bubbliiiing](https://github.com/bubbliiiing). The original codebase can be found [here](https://huggingface.co/alibaba-pai). The original weights can be found under [hf.co/alibaba-pai](https://huggingface.co/alibaba-pai).

There are two official EasyAnimate checkpoints for text-to-video and video-to-video.

| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`alibaba-pai/EasyAnimateV5.1-12b-zh`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh) | torch.float16 |
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |

There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.

| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-InP) | torch.float16 |

There are two official EasyAnimate checkpoints available for control-to-video.

| checkpoints | recommended inference dtype |
|:---:|:---:|
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control) | torch.float16 |
| [`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`](https://huggingface.co/alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera) | torch.float16 |

For the EasyAnimateV5.1 series:
- Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
- Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.

## Quantization

Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [`EasyAnimatePipeline`] for inference with bitsandbytes.

```py
import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
from diffusers.utils import export_to_video

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
"alibaba-pai/EasyAnimateV5.1-12b-zh",
subfolder="transformer",
quantization_config=quant_config,
torch_dtype=torch.float16,
)

pipeline = EasyAnimatePipeline.from_pretrained(
"alibaba-pai/EasyAnimateV5.1-12b-zh",
transformer=transformer_8bit,
torch_dtype=torch.float16,
device_map="balanced",
)

prompt = "A cat walks on the grass, realistic style."
negative_prompt = "bad detailed"
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=8)
```

## EasyAnimatePipeline

[[autodoc]] EasyAnimatePipeline
- all
- __call__

## EasyAnimatePipelineOutput

[[autodoc]] pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput
10 changes: 10 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,7 @@
"AutoencoderKLCogVideoX",
"AutoencoderKLHunyuanVideo",
"AutoencoderKLLTXVideo",
"AutoencoderKLMagvit",
"AutoencoderKLMochi",
"AutoencoderKLTemporalDecoder",
"AutoencoderKLWan",
Expand All @@ -109,6 +110,7 @@
"ControlNetUnionModel",
"ControlNetXSAdapter",
"DiTTransformer2DModel",
"EasyAnimateTransformer3DModel",
"FluxControlNetModel",
"FluxMultiControlNetModel",
"FluxTransformer2DModel",
Expand Down Expand Up @@ -293,6 +295,9 @@
"CogView4Pipeline",
"ConsisIDPipeline",
"CycleDiffusionPipeline",
"EasyAnimateControlPipeline",
"EasyAnimateInpaintPipeline",
"EasyAnimatePipeline",
"FluxControlImg2ImgPipeline",
"FluxControlInpaintPipeline",
"FluxControlNetImg2ImgPipeline",
Expand Down Expand Up @@ -620,6 +625,7 @@
AutoencoderKLCogVideoX,
AutoencoderKLHunyuanVideo,
AutoencoderKLLTXVideo,
AutoencoderKLMagvit,
AutoencoderKLMochi,
AutoencoderKLTemporalDecoder,
AutoencoderKLWan,
Expand All @@ -635,6 +641,7 @@
ControlNetUnionModel,
ControlNetXSAdapter,
DiTTransformer2DModel,
EasyAnimateTransformer3DModel,
FluxControlNetModel,
FluxMultiControlNetModel,
FluxTransformer2DModel,
Expand Down Expand Up @@ -798,6 +805,9 @@
CogView4Pipeline,
ConsisIDPipeline,
CycleDiffusionPipeline,
EasyAnimateControlPipeline,
EasyAnimateInpaintPipeline,
EasyAnimatePipeline,
FluxControlImg2ImgPipeline,
FluxControlInpaintPipeline,
FluxControlNetImg2ImgPipeline,
Expand Down
4 changes: 4 additions & 0 deletions src/diffusers/models/__init__.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
_import_structure["autoencoders.autoencoder_kl_cogvideox"] = ["AutoencoderKLCogVideoX"]
_import_structure["autoencoders.autoencoder_kl_hunyuan_video"] = ["AutoencoderKLHunyuanVideo"]
_import_structure["autoencoders.autoencoder_kl_ltx"] = ["AutoencoderKLLTXVideo"]
_import_structure["autoencoders.autoencoder_kl_magvit"] = ["AutoencoderKLMagvit"]
_import_structure["autoencoders.autoencoder_kl_mochi"] = ["AutoencoderKLMochi"]
_import_structure["autoencoders.autoencoder_kl_temporal_decoder"] = ["AutoencoderKLTemporalDecoder"]
_import_structure["autoencoders.autoencoder_kl_wan"] = ["AutoencoderKLWan"]
Expand Down Expand Up @@ -72,6 +73,7 @@
_import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"]
_import_structure["transformers.transformer_cogview3plus"] = ["CogView3PlusTransformer2DModel"]
_import_structure["transformers.transformer_cogview4"] = ["CogView4Transformer2DModel"]
_import_structure["transformers.transformer_easyanimate"] = ["EasyAnimateTransformer3DModel"]
_import_structure["transformers.transformer_flux"] = ["FluxTransformer2DModel"]
_import_structure["transformers.transformer_hunyuan_video"] = ["HunyuanVideoTransformer3DModel"]
_import_structure["transformers.transformer_ltx"] = ["LTXVideoTransformer3DModel"]
Expand Down Expand Up @@ -109,6 +111,7 @@
AutoencoderKLCogVideoX,
AutoencoderKLHunyuanVideo,
AutoencoderKLLTXVideo,
AutoencoderKLMagvit,
AutoencoderKLMochi,
AutoencoderKLTemporalDecoder,
AutoencoderKLWan,
Expand Down Expand Up @@ -144,6 +147,7 @@
ConsisIDTransformer3DModel,
DiTTransformer2DModel,
DualTransformer2DModel,
EasyAnimateTransformer3DModel,
FluxTransformer2DModel,
HunyuanDiT2DModel,
HunyuanVideoTransformer3DModel,
Expand Down
5 changes: 4 additions & 1 deletion src/diffusers/models/attention_processor.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -274,7 +274,10 @@ def __init__(
self.to_add_out = None

if qk_norm is not None and added_kv_proj_dim is not None:
if qk_norm == "fp32_layer_norm":
if qk_norm == "layer_norm":
self.norm_added_q = nn.LayerNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
self.norm_added_k = nn.LayerNorm(dim_head, eps=eps, elementwise_affine=elementwise_affine)
elif qk_norm == "fp32_layer_norm":
self.norm_added_q = FP32LayerNorm(dim_head, elementwise_affine=False, bias=False, eps=eps)
self.norm_added_k = FP32LayerNorm(dim_head, elementwise_affine=False, bias=False, eps=eps)
elif qk_norm == "rms_norm":
Expand Down
1 change: 1 addition & 0 deletions src/diffusers/models/autoencoders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .autoencoder_kl_cogvideox import AutoencoderKLCogVideoX
from .autoencoder_kl_hunyuan_video import AutoencoderKLHunyuanVideo
from .autoencoder_kl_ltx import AutoencoderKLLTXVideo
from .autoencoder_kl_magvit import AutoencoderKLMagvit
from .autoencoder_kl_mochi import AutoencoderKLMochi
from .autoencoder_kl_temporal_decoder import AutoencoderKLTemporalDecoder
from .autoencoder_kl_wan import AutoencoderKLWan
Expand Down
Loading
Loading