You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for this project, it is an amazing work for real time video generation.
I am seeking the opportunity to reuse this highly-compression VAE from this project to build a real-time face talking model.
I thought I could train a model to output latent frames which is then decoded via VAE.
But when I look into the code, I notice that the latent is added with 0.05 noise before decoding.
In contrast, we propose tasking the VAE decoder with performing the last denoising step in conjunction with converting latents to pixels. This modification is particularly impactful at high latent compression rates, where not all high-frequency details can be reconstructed and must instead be
generated.
Our holistic approach tasks the VAE decoder with performing the last denoising step in conjunction with converting the latents into pixels. To validate this design choice, we performed an internal user study comparing videos generated according to our approach to videos generated with the common approach, where denoising is performed solely by the diffusion-transformer, in latent space.
For the first set of results, our VAE decoder was conditioned on timestep t = 0.05. For the second set, the VAE decoder was conditioned on timestep t = 0.0 and did not perform any denoising.
The survey results indicated that videos generated by our method were strongly preferred over the standard results. The improvement was particularly evident in high-motion videos, where artifacts caused by the strong compression were mitigated by the VAE decoder’s last-step denoising.
I have the doubt that if I can reuse the VAE model while keeping its weights frozen. Maybe I just treat it as a normal VAE when training my model, and just adding 0.05 noise before decoding? Or the main model must be trained together with the VAE?
Thanks in advance
The text was updated successfully, but these errors were encountered:
Hi
Thanks for this project, it is an amazing work for real time video generation.
I am seeking the opportunity to reuse this highly-compression VAE from this project to build a real-time face talking model.
I thought I could train a model to output latent frames which is then decoded via VAE.
But when I look into the code, I notice that the latent is added with 0.05 noise before decoding.
Also the paper talks about this novel design.
I have the doubt that if I can reuse the VAE model while keeping its weights frozen. Maybe I just treat it as a normal VAE when training my model, and just adding 0.05 noise before decoding? Or the main model must be trained together with the VAE?
Thanks in advance
The text was updated successfully, but these errors were encountered: