Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Reusing VAE #107

Open
wangjia184 opened this issue Jan 31, 2025 · 0 comments
Open

Question: Reusing VAE #107

wangjia184 opened this issue Jan 31, 2025 · 0 comments

Comments

@wangjia184
Copy link

Hi

Thanks for this project, it is an amazing work for real time video generation.

I am seeking the opportunity to reuse this highly-compression VAE from this project to build a real-time face talking model.
I thought I could train a model to output latent frames which is then decoded via VAE.
But when I look into the code, I notice that the latent is added with 0.05 noise before decoding.

                noise = torch.randn_like(latents)
                if not isinstance(decode_timestep, list):
                    decode_timestep = [decode_timestep] * latents.shape[0]
                if decode_noise_scale is None:
                    decode_noise_scale = decode_timestep
                elif not isinstance(decode_noise_scale, list):
                    decode_noise_scale = [decode_noise_scale] * latents.shape[0]

                decode_timestep = torch.tensor(decode_timestep).to(latents.device)
                decode_noise_scale = torch.tensor(decode_noise_scale).to(
                    latents.device
                )[:, None, None, None, None]
                latents = (
                    latents * (1 - decode_noise_scale) + noise * decode_noise_scale
                )

Also the paper talks about this novel design.

In contrast, we propose tasking the VAE decoder with performing the last denoising step in conjunction with converting latents to pixels. This modification is particularly impactful at high latent compression rates, where not all high-frequency details can be reconstructed and must instead be
generated.

Our holistic approach tasks the VAE decoder with performing the last denoising step in conjunction with converting the latents into pixels. To validate this design choice, we performed an internal user study comparing videos generated according to our approach to videos generated with the common approach, where denoising is performed solely by the diffusion-transformer, in latent space.
For the first set of results, our VAE decoder was conditioned on timestep t = 0.05. For the second set, the VAE decoder was conditioned on timestep t = 0.0 and did not perform any denoising.
The survey results indicated that videos generated by our method were strongly preferred over the standard results. The improvement was particularly evident in high-motion videos, where artifacts caused by the strong compression were mitigated by the VAE decoder’s last-step denoising.

I have the doubt that if I can reuse the VAE model while keeping its weights frozen. Maybe I just treat it as a normal VAE when training my model, and just adding 0.05 noise before decoding? Or the main model must be trained together with the VAE?

Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant