Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional conditioning layers #112

Open
ArEnSc opened this issue Feb 12, 2025 · 4 comments
Open

Additional conditioning layers #112

ArEnSc opened this issue Feb 12, 2025 · 4 comments
Assignees

Comments

@ArEnSc
Copy link

ArEnSc commented Feb 12, 2025

I am trying to find a way to augment LTX to support additional conditioning by using a pose video latent—encoding that information through the VAE—to enable posable controllability.

I have looked at how you conditioned the model to support image-to-video. I was considering fine-tuning the model with image-to-video conditioning and adding an extra cross-attention layer alongside the original one, along the lines of:

Simplified idea.
x_sa = self_attn(x) # Self-Attention
x_ca_A = cross_attn_A(x_sa, cond_A) # Cross-Attention for condition Text
x_ca_B = cross_attn_B(x_ca_A, cond_B)# Cross-Attention for condition Pose
output = feed_forward(x_ca_B) # Final feed-forward network

freezing the model and only finetuning new cross-attention layers.
Does this seem like a sensible thing to do? Do you have any tips or thoughts?

How would you recommend doing this?, @yoavhacohen

@yoavhacohen
Copy link
Collaborator

Adding new cross-attention layers is a valid option, but you’ll likely need to handle positional embeddings. The current cross-attention layers don’t use positional embeddings, so you might want to reference how we handle them in self-attention.

Keep in mind that this is different from how we condition the model for image-to-video, which is implemented as a temporal inpainting task using a different timestep for the conditioning tokens—see the paper for details.

Your approach of freezing the model and fine-tuning only the new cross-attention layers makes sense, especially if your goal is to minimize catastrophic forgetting while maintaining the base model’s capabilities. You might also consider training a LoRA adapter for the rest of the model alongside the new cross-attention layers to allow for more flexible adaptation.

Would be happy to discuss more details if you have specific constraints or goals in mind!

@ArEnSc
Copy link
Author

ArEnSc commented Feb 12, 2025

@yoavhacohen
Just to clarify,
I am trying to extend and build upon image-to-video conditioning with pose conditioning are you suggesting that we can extend the current image-to-video conditioning mechanism to also incorporate explicit pose information to better guide the temporal generation? How do you envision integrating this additional pose conditioning or would this be using the cross attention layers be effective?
The end result I am trying to achieve is to use the pose conditioning to steer a character in an image.

@yoavhacohen
Copy link
Collaborator

yoavhacohen commented Feb 13, 2025

Our image-to-video conditioning is implemented as a temporal inpainting task, using a different timestep for the conditioning tokens, it doesn’t rely on cross-attention.

If you want to apply a similar approach for pose conditioning, simply add more tokens with the same positional embeddings as the generated ones, but assign a different timestep embedding to the conditioning tokens.

@yoavhacohen yoavhacohen self-assigned this Feb 13, 2025
@ArEnSc
Copy link
Author

ArEnSc commented Feb 18, 2025

Ok, so I believe I understand...

You concatenate the pose tokens alongside the seed image tokens and noise for the remaining sequence you want to predict.
keep the noise low to clean for the pose tokens and the seed image tokens.

I suspect this will increase the memory size due to the sequence length a bit during inference.

The target to predict is the whole conditioned sequence the flow or velocity to that sequence.

During inference, you add the conditioning with a tiny bit of noise to the initial conditioning tokens.
Then after inference, you just peel off the "denoised" tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants