-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional conditioning layers #112
Comments
Adding new cross-attention layers is a valid option, but you’ll likely need to handle positional embeddings. The current cross-attention layers don’t use positional embeddings, so you might want to reference how we handle them in self-attention. Keep in mind that this is different from how we condition the model for image-to-video, which is implemented as a temporal inpainting task using a different timestep for the conditioning tokens—see the paper for details. Your approach of freezing the model and fine-tuning only the new cross-attention layers makes sense, especially if your goal is to minimize catastrophic forgetting while maintaining the base model’s capabilities. You might also consider training a LoRA adapter for the rest of the model alongside the new cross-attention layers to allow for more flexible adaptation. Would be happy to discuss more details if you have specific constraints or goals in mind! |
@yoavhacohen |
Our image-to-video conditioning is implemented as a temporal inpainting task, using a different timestep for the conditioning tokens, it doesn’t rely on cross-attention. If you want to apply a similar approach for pose conditioning, simply add more tokens with the same positional embeddings as the generated ones, but assign a different timestep embedding to the conditioning tokens. |
Ok, so I believe I understand... You concatenate the pose tokens alongside the seed image tokens and noise for the remaining sequence you want to predict. I suspect this will increase the memory size due to the sequence length a bit during inference. The target to predict is the whole conditioned sequence the flow or velocity to that sequence. During inference, you add the conditioning with a tiny bit of noise to the initial conditioning tokens. |
I am trying to find a way to augment LTX to support additional conditioning by using a pose video latent—encoding that information through the VAE—to enable posable controllability.
I have looked at how you conditioned the model to support image-to-video. I was considering fine-tuning the model with image-to-video conditioning and adding an extra cross-attention layer alongside the original one, along the lines of:
Simplified idea.
x_sa = self_attn(x) # Self-Attention
x_ca_A = cross_attn_A(x_sa, cond_A) # Cross-Attention for condition Text
x_ca_B = cross_attn_B(x_ca_A, cond_B)# Cross-Attention for condition Pose
output = feed_forward(x_ca_B) # Final feed-forward network
freezing the model and only finetuning new cross-attention layers.
Does this seem like a sensible thing to do? Do you have any tips or thoughts?
How would you recommend doing this?, @yoavhacohen
The text was updated successfully, but these errors were encountered: