Skip to content

Commit 59c0d48

Browse files
committed
release 61-frame model
1 parent c09444b commit 59c0d48

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+8586
-1
lines changed

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
resources/

README.md

+123-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,124 @@
11
# FancyVideo
2-
This is the official reproduction of FancyVideo.
2+
3+
This repository is the official implementation of [FancyVideo](https://360cvgroup.github.io/FancyVideo/).
4+
5+
**[FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance](https://arxiv.org/abs/2408.08189)**
6+
</br>
7+
Jiasong Feng*, Ao Ma*, Jing Wang*, Bo Cheng, Xiaodan Liang, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
8+
</br>
9+
[![arXiv](https://img.shields.io/badge/arXiv-2307.04725-b31b1b.svg)](https://arxiv.org/abs/2408.08189)
10+
[![Project Page](https://img.shields.io/badge/Project-Website-green)](https://360cvgroup.github.io/FancyVideo/)
11+
12+
Our code builds upon [AnimateDiff](https://github.com/guoyww/AnimateDiff), and we also incorporate insights from [CV-VAE](https://github.com/AILab-CVC/CV-VAE), [Res-Adapter](https://github.com/bytedance/res-adapter), and [Long-CLIP](https://github.com/beichenzbc/Long-CLIP) to enhance our project. We appreciate the open-source contributions of these works.
13+
14+
15+
## 🔥 News
16+
- **[2024/08/19]** We initialized this github repository and released the inference code and 61-frame model.
17+
- **[2024/08/15]** We released the paper of [FancyVideo](https://arxiv.org/abs/2408.08189).
18+
19+
20+
## Quick Demos
21+
Video demos can be found in the [webpage](https://360cvgroup.github.io/FancyVideo/). Some of them are contributed by the community. You can customize your own videos using the following reasoning code.
22+
23+
24+
## Quick Start
25+
### 0. Experimental environment
26+
We tested our inference code on a machine with a 24GB 3090 GPU and CUDA environment version 12.1.
27+
28+
### 1. Setup repository and environment
29+
```
30+
git clone https://github.com/360CVGroup/FancyVideo.git
31+
cd FancyVideo
32+
33+
conda create -n fancyvideo python=3.10
34+
conda activate fancyvideo
35+
pip install -r requirements.txt
36+
```
37+
38+
### 2. Prepare the models
39+
```
40+
mkdir resources/models
41+
42+
# fancyvideo-ckpts
43+
wget -O resources/models/fancyvideo_ckpts.zip "https://drive.google.com/uc?export=download&id=1m4UqKVQ3POI5ei1A9yppHX_H--8PKMtn"
44+
unzip resources/models/fancyvideo_ckpts.zip
45+
46+
# cv-vae
47+
wget -O resources/models/CV-VAE.zip "https://drive.google.com/uc?export=download&id=1Xal1fxVbVWf0jjiPK5gb_1-lOh0w8G_r"
48+
unzip resources/models/CV-VAE.zip
49+
50+
# res-adapter
51+
wget -O resources/models/res-adapter.zip "https://drive.google.com/uc?export=download&id=18EawVd1HJtrQds703sLqoYZtLfbUgLm4"
52+
unzip resources/models/res-adapter.zip
53+
54+
# longclip
55+
wget -O resources/models/LongCLIP-L.zip "https://drive.google.com/uc?export=download&id=1-DDPcbAbmGZJPHsdl1PgFMVtxmOnUtc7"
56+
unzip resources/models/LongCLIP-L.zip
57+
58+
# sdv1.5-base-models(you can also donwload from civitai.com)
59+
wget -O resources/models/sd_v1-5_base_models.zip "https://drive.google.com/uc?export=download&id=1pxrAVT8OQKyyyW2WgImqEQrectbIpkBH"
60+
unzip resources/models/sd_v1-5_base_models.zip
61+
62+
# stable-diffusion-v1-5
63+
git lfs install
64+
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 resources/models
65+
```
66+
After download models, your resources folder is like:
67+
```
68+
📦 resouces/
69+
├── 📂 models/
70+
│ └── 📂 fancyvideo_ckpts/
71+
│ └── 📂 CV-VAE/
72+
│ └── 📂 res-adapter/
73+
│ └── 📂 LongCLIP-L/
74+
│ └── 📂 sd_v1-5_base_models/
75+
│ └── 📂 stable-diffusion-v1-5/
76+
```
77+
78+
### 3. Customize your own videos
79+
#### 3.1 Image to Video
80+
Due to the limited image generation capabilities of the SD1.5 model, we recommend generating the initial frame using a more advanced T2I model, such as SDXL, and then using our model's I2V capabilities to create the video.
81+
```
82+
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=./ python scripts/demo.py --config configs/inference/i2v.yaml
83+
```
84+
#### 3.2 Text to Video with different base models
85+
Our model features universal T2V capabilities and can be customized with the SD1.5 community base model.
86+
```
87+
# use the base model of pixars
88+
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=./ python scripts/demo.py --config configs/inference/t2v_pixars.yaml
89+
90+
# use the base model of realcartoon3d
91+
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=./ python scripts/demo.py --config configs/inference/t2v_realcartoon3d.yaml
92+
93+
# use the base model of toonyou
94+
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=./ python scripts/demo.py --config configs/inference/t2v_toonyou.yaml
95+
```
96+
97+
98+
## Reference
99+
- Animatediff: https://github.com/guoyww/AnimateDiff
100+
- CV-VAE: https://github.com/AILab-CVC/CV-VAE
101+
- Animatediff: https://github.com/bytedance/res-adapter
102+
- Animatediff: https://github.com/beichenzbc/Long-CLIP
103+
104+
105+
## We Are Hiring
106+
We are seeking academic interns in the AIGC field. If interested, please send your resume to [[email protected]](mailto:[email protected]).
107+
108+
109+
## BibTeX
110+
```
111+
@misc{feng2024fancyvideodynamicconsistentvideo,
112+
title={FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance},
113+
author={Jiasong Feng and Ao Ma and Jing Wang and Bo Cheng and Xiaodan Liang and Dawei Leng and Yuhui Yin},
114+
year={2024},
115+
eprint={2408.08189},
116+
archivePrefix={arXiv},
117+
primaryClass={cs.CV},
118+
url={https://arxiv.org/abs/2408.08189},
119+
}
120+
```
121+
122+
123+
## License
124+
This project is licensed under the [Apache License (Version 2.0)](https://github.com/modelscope/modelscope/blob/master/LICENSE).

configs/inference/i2v.yaml

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
model:
2+
base_model_type: "realisticVisionV60B1_v51VAE"
3+
model_path: "resources/models"
4+
text_to_video_mm_path: "resources/models/fancyvideo_ckpts/vae_3d_61_frames/mp_rank_00_model_states.pt"
5+
base_model_path: "resources/models/sd_v1-5_base_models/realisticVisionV60B1_v51VAE.safetensors"
6+
res_adapter_type: "res_adapter_v2"
7+
trained_keys: ["motion_modules.", "conv_in.weight", "fps_embedding.", "motion_embedding."]
8+
vae_type: "vae_3d"
9+
use_fps_embedding: true
10+
use_motion_embedding: true
11+
common_positive_prompt: "Best quality, masterpiece, ultra high res, photorealistic, Ultra realistic illustration, hyperrealistic, 8k"
12+
common_negative_prompt: "(low quality:1.3), (worst quality:1.3),poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face,Facial blurring,a large crowd, many people,advertising, information, news, watermark, text, username, signature,out of frame, low res, error, cropped, worst quality, low quality, artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, nsfw, breast, naked, eroticism"
13+
14+
inference:
15+
infer_mode: "i2v"
16+
resolution: [768, 768]
17+
video_length: 16
18+
output_fps: 25
19+
cond_fps: 25
20+
cond_motion_score: 3.0
21+
use_noise_scheduler_snr: true
22+
seed: 22
23+
prompt_path: "resources/demos/test_prompts/test_i2v_prompt.txt"
24+
reference_image_folder: "resources/demos/reference_images/768x768"
25+
output_folder: "resources/demos/samples/i2v/realisticVisionV60B1_v51VAE/768x768"
26+

configs/inference/t2v_pixars.yaml

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
model:
2+
base_model_type: "pixarsRendermanInspo_mk1"
3+
model_path: "resources/models"
4+
text_to_video_mm_path: "resources/models/fancyvideo_ckpts/vae_3d_61_frames/mp_rank_00_model_states.pt"
5+
base_model_path: "resources/models/sd_v1-5_base_models/pixarsRendermanInspo_mk1.safetensors"
6+
res_adapter_type: "res_adapter_v2"
7+
trained_keys: ["motion_modules.", "conv_in.weight", "fps_embedding.", "motion_embedding."]
8+
vae_type: "vae_3d"
9+
use_fps_embedding: true
10+
use_motion_embedding: true
11+
common_positive_prompt: "Best quality, masterpiece, ultra high res, photorealistic, Ultra realistic illustration, hyperrealistic, 8k"
12+
common_negative_prompt: "(low quality:1.3), (worst quality:1.3),poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face,Facial blurring,a large crowd, many people,advertising, information, news, watermark, text, username, signature,out of frame, low res, error, cropped, worst quality, low quality, artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, nsfw, breast, naked, eroticism"
13+
14+
inference:
15+
infer_mode: "t2v"
16+
resolution: [768, 768]
17+
video_length: 16
18+
output_fps: 25
19+
cond_fps: 25
20+
cond_motion_score: 3.0
21+
use_noise_scheduler_snr: true
22+
seed: 22
23+
prompt_path: "resources/demos/test_prompts/pixarsRendermanInspo_mk1.txt"
24+
reference_image_folder: "resources/demos/reference_images/768x768"
25+
output_folder: "resources/demos/samples/t2v/pixarsRendermanInspo_mk1/768x768"
26+
+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
model:
2+
base_model_type: "realcartoon3d_v15"
3+
model_path: "resources/models"
4+
text_to_video_mm_path: "resources/models/fancyvideo_ckpts/vae_3d_61_frames/mp_rank_00_model_states.pt"
5+
base_model_path: "resources/models/sd_v1-5_base_models/realcartoon3d_v15.safetensors"
6+
res_adapter_type: "res_adapter_v2"
7+
trained_keys: ["motion_modules.", "conv_in.weight", "fps_embedding.", "motion_embedding."]
8+
vae_type: "vae_3d"
9+
use_fps_embedding: true
10+
use_motion_embedding: true
11+
common_positive_prompt: "Best quality, masterpiece, ultra high res, photorealistic, Ultra realistic illustration, hyperrealistic, 8k"
12+
common_negative_prompt: "(low quality:1.3), (worst quality:1.3),poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face,Facial blurring,a large crowd, many people,advertising, information, news, watermark, text, username, signature,out of frame, low res, error, cropped, worst quality, low quality, artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, nsfw, breast, naked, eroticism"
13+
14+
inference:
15+
infer_mode: "t2v"
16+
resolution: [768, 768]
17+
video_length: 16
18+
output_fps: 25
19+
cond_fps: 25
20+
cond_motion_score: 3.0
21+
use_noise_scheduler_snr: true
22+
seed: 22
23+
prompt_path: "resources/demos/test_prompts/realcartoon3d_v15.txt"
24+
reference_image_folder: "resources/demos/reference_images/768x768"
25+
output_folder: "resources/demos/samples/t2v/realcartoon3d_v15/768x768"
26+

configs/inference/t2v_toonyou.yaml

+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
model:
2+
base_model_type: "toonyou_beta3"
3+
model_path: "resources/models"
4+
text_to_video_mm_path: "resources/models/fancyvideo_ckpts/vae_3d_61_frames/mp_rank_00_model_states.pt"
5+
base_model_path: "resources/models/sd_v1-5_base_models/toonyou_beta3.safetensors"
6+
res_adapter_type: "res_adapter_v2"
7+
trained_keys: ["motion_modules.", "conv_in.weight", "fps_embedding.", "motion_embedding."]
8+
vae_type: "vae_3d"
9+
use_fps_embedding: true
10+
use_motion_embedding: true
11+
common_positive_prompt: "Best quality, masterpiece, ultra high res, photorealistic, Ultra realistic illustration, hyperrealistic, 8k"
12+
common_negative_prompt: "(low quality:1.3), (worst quality:1.3),poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face,Facial blurring,a large crowd, many people,advertising, information, news, watermark, text, username, signature,out of frame, low res, error, cropped, worst quality, low quality, artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, nsfw, breast, naked, eroticism"
13+
14+
inference:
15+
infer_mode: "t2v"
16+
resolution: [768, 768]
17+
video_length: 16
18+
output_fps: 25
19+
cond_fps: 25
20+
cond_motion_score: 3.0
21+
use_noise_scheduler_snr: true
22+
seed: 22
23+
prompt_path: "resources/demos/test_prompts/toonyou_beta3.txt"
24+
reference_image_folder: "resources/demos/reference_images/768x768"
25+
output_folder: "resources/demos/samples/t2v/toonyou_beta3/768x768"
26+
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

0 commit comments

Comments
 (0)