Once upon a Diffision

In this study, we explore the application of generative models in dynamically creating descriptive images from fairy tales for educational purposes, aiming to enhance language learning and comprehension among children. We proposed a pipeline for adapting Clip-Guided diffusion model to fairy tale content, utilizing techniques like textual inversion to fine-tune the models. Additionally, we compare the performance of Clip-Guided diffusion model and Clip-Guided VQGAN in generating fairy tale descriptor images, evaluating their efficiency.

Feed forward VQGAN-CLIP

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.

CLIP-Guided Diffusion

This architecture is essentially a U-Net based on a Wide ResNet framework, with the substitution of group normalization for weight normalization to streamline implementation. Each model consists of two convolutional residual blocks per resolution level. Additionally, multi-head self-attention blocks are incorporated at the 16×16 and 8×8 resolution levels, positioned between the convolutional blocks.

Dataset

We have chosen the Old Book Illustrations dataset from Hugging Face. This dataset contains 4172 illustrations scanned from literature of 19-20th centuries and related metadata such as image title, book title, image description, image caption, illustration format and subject.

Pipeline for the fairytale project:

Output of our model:

Prompt: "Little Red Riding Hood with her bright red hood and innocent face wanders through the dark and eerie woods unaware of the danger lurking nearb"

Install dependencies.

conda create -n ff_vqgan_clip_env python=3.8
conda activate ff_vqgan_clip_env
# Install pytorch/torchvision - See https://pytorch.org/get-started/locally/ for more info.
(ff_vqgan_clip_env) conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
(ff_vqgan_clip_env) pip install -r requirements.txt

Train

Modify configs/example.yaml as needed.

$ (ff_vqgan_clip_venv) python main.py train configs/example.yaml

Test

`python -u main.py test cc12m_32x1024_vitgan.th "Prompt"`

Acknowledgements

Thanks to VitGAN authors, the VitGAN model is from https://github.com/wilile26811249/ViTGAN
Thanks to CompVis for Taming Transformers https://github.com/CompVis/taming-transformers, the code uses VQGAN pre-trained model and VGG16 feature space perceptual loss https://github.com/CompVis/taming-transformers/blob/master/taming/modules/losses/lpips.py
Thanks to OpenCLIP authors for CLIP-like code/models pre-trained on LAION-400M and LAION-2B

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
configs		configs
LICENSE		LICENSE
README.md		README.md
app.py		app.py
cog.yaml		cog.yaml
dataset.jpg		dataset.jpg
diffusion_model_output.jpg		diffusion_model_output.jpg
download_weights.py		download_weights.py
fairytale_pipeline.png		fairytale_pipeline.png
features.pkl		features.pkl
gen.png		gen.png
inference.py		inference.py
main.py		main.py
mlp_mixer_pytorch.py		mlp_mixer_pytorch.py
predict.py		predict.py
requirements.txt		requirements.txt
taming.zip		taming.zip
transformer.py		transformer.py
vitgan.py		vitgan.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Once upon a Diffision

Feed forward VQGAN-CLIP

CLIP-Guided Diffusion

Dataset

Pipeline for the fairytale project:

Output of our model:

Install dependencies.

Train

Test

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Once upon a Diffision

Feed forward VQGAN-CLIP

CLIP-Guided Diffusion

Dataset

Pipeline for the fairytale project:

Output of our model:

Install dependencies.

Train

Test

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages