Question: Reusing VAE #107

wangjia184 · 2025-01-31T09:30:17Z

Hi

Thanks for this project, it is an amazing work for real time video generation.

I am seeking the opportunity to reuse this highly-compression VAE from this project to build a real-time face talking model.
I thought I could train a model to output latent frames which is then decoded via VAE.
But when I look into the code, I notice that the latent is added with 0.05 noise before decoding.

                noise = torch.randn_like(latents)
                if not isinstance(decode_timestep, list):
                    decode_timestep = [decode_timestep] * latents.shape[0]
                if decode_noise_scale is None:
                    decode_noise_scale = decode_timestep
                elif not isinstance(decode_noise_scale, list):
                    decode_noise_scale = [decode_noise_scale] * latents.shape[0]

                decode_timestep = torch.tensor(decode_timestep).to(latents.device)
                decode_noise_scale = torch.tensor(decode_noise_scale).to(
                    latents.device
                )[:, None, None, None, None]
                latents = (
                    latents * (1 - decode_noise_scale) + noise * decode_noise_scale
                )

Also the paper talks about this novel design.

In contrast, we propose tasking the VAE decoder with performing the last denoising step in conjunction with converting latents to pixels. This modification is particularly impactful at high latent compression rates, where not all high-frequency details can be reconstructed and must instead be
generated.

Our holistic approach tasks the VAE decoder with performing the last denoising step in conjunction with converting the latents into pixels. To validate this design choice, we performed an internal user study comparing videos generated according to our approach to videos generated with the common approach, where denoising is performed solely by the diffusion-transformer, in latent space.
For the first set of results, our VAE decoder was conditioned on timestep t = 0.05. For the second set, the VAE decoder was conditioned on timestep t = 0.0 and did not perform any denoising.
The survey results indicated that videos generated by our method were strongly preferred over the standard results. The improvement was particularly evident in high-motion videos, where artifacts caused by the strong compression were mitigated by the VAE decoder’s last-step denoising.

I have the doubt that if I can reuse the VAE model while keeping its weights frozen. Maybe I just treat it as a normal VAE when training my model, and just adding 0.05 noise before decoding? Or the main model must be trained together with the VAE?

Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Reusing VAE #107

Question: Reusing VAE #107

wangjia184 commented Jan 31, 2025

Question: Reusing VAE #107

Question: Reusing VAE #107

Comments

wangjia184 commented Jan 31, 2025