I will divide the pipeline of stable diffsion to each components.
- Stable Diffusion is a text-to-image
latent diffusion model
created by the researchers and engineers from CompVis, Stability AI and LAION. - It is trained on 512x512 images from a subset of the LAION-5B database. LAION-5B is the largest, freely accessible multi-modal dataset that currently exists.
- Paper: High-Resolution Image Synthesis with Latent Diffusion Models
- Diffusion models are machine learning systems that are
trained to denoise random Gaussian noise step by step.
- Diffusion models is that the reverse
denoising process is slow
because of its repeated, sequential nature. In addition, these modelsconsume a lot of memory
they operate in pixel space. (Pixel space is high-dimensional space) - Stable diffusion is a kind of
latent diffusion model.
- Latent diffusion model can reduce the memory and compute complexity.
- The latent diffusion model's diffusion process proceeded in lower dimensions space(== latent space), not pixel space to optimize high-cost resource requirements.
- Model: CLIP / UNET / VAE
- The other: Scheduler / Latent Vector
- StableDiffusionPipeline inherits the
FromSingleFileMixin
for load checkpoint each model(clip, vae, unet). - In the FromSingleFileMixin, the
download_from_original_stable_diffusion_ckpt
function is defined. - We can check the default model and checkpoint reference in the
download_from_original_stable_diffusion_ckpt
function.
- The CLIP model consists of a tokenizer for converting text to token and a text encoder for compressing the token information.
- CLIP is used to give the text condition to the unet's embedding space.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14") if tokenizer is None else tokenizer
# text encoder also used openai/clip-vit-large-patch14 model.
- Unet has a 2 inputs that
Noisy latent(gaussian noise)
andText embedding(CLIP's text encoder)
. - The output is
predicted noise residual
.
- Encoder, VAE is used to compress the image information to a lower dimension.
- Decoder, VAE is used to generate images from Unet's embedding space(encoder).
- To be precise, it is responsible for restoring the embedded image to the original image.
- VAE is to reduce the computational time to generate High-resolution images.
- A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by
bundling
independentlytrained models
andschedulers
together.
- All pipeline types inherit from the base
DiffusionPipeline
class. - Basically, the pipeline has a parameter named
ConfigMixin
. This is config class.
# https://github.dev/huggingface/diffusers/blob/965e52ce611108559a0ebab75c8b421d1229c5ab/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L72
class StableDiffusionPipeline(DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin):
...
...
...
make env
conda activate 02-hack-diffusers-pipeline
make setup