Additional conditioning layers #112

ArEnSc · 2025-02-12T16:02:24Z

I am trying to find a way to augment LTX to support additional conditioning by using a pose video latent—encoding that information through the VAE—to enable posable controllability.

I have looked at how you conditioned the model to support image-to-video. I was considering fine-tuning the model with image-to-video conditioning and adding an extra cross-attention layer alongside the original one, along the lines of:

Simplified idea.
x_sa = self_attn(x) # Self-Attention
x_ca_A = cross_attn_A(x_sa, cond_A) # Cross-Attention for condition Text
x_ca_B = cross_attn_B(x_ca_A, cond_B)# Cross-Attention for condition Pose
output = feed_forward(x_ca_B) # Final feed-forward network

freezing the model and only finetuning new cross-attention layers.
Does this seem like a sensible thing to do? Do you have any tips or thoughts?

How would you recommend doing this?, @yoavhacohen

yoavhacohen · 2025-02-12T20:27:27Z

Adding new cross-attention layers is a valid option, but you’ll likely need to handle positional embeddings. The current cross-attention layers don’t use positional embeddings, so you might want to reference how we handle them in self-attention.

Keep in mind that this is different from how we condition the model for image-to-video, which is implemented as a temporal inpainting task using a different timestep for the conditioning tokens—see the paper for details.

Your approach of freezing the model and fine-tuning only the new cross-attention layers makes sense, especially if your goal is to minimize catastrophic forgetting while maintaining the base model’s capabilities. You might also consider training a LoRA adapter for the rest of the model alongside the new cross-attention layers to allow for more flexible adaptation.

Would be happy to discuss more details if you have specific constraints or goals in mind!

ArEnSc · 2025-02-12T23:11:31Z

@yoavhacohen
Just to clarify,
I am trying to extend and build upon image-to-video conditioning with pose conditioning are you suggesting that we can extend the current image-to-video conditioning mechanism to also incorporate explicit pose information to better guide the temporal generation? How do you envision integrating this additional pose conditioning or would this be using the cross attention layers be effective?
The end result I am trying to achieve is to use the pose conditioning to steer a character in an image.

yoavhacohen · 2025-02-13T13:36:16Z

Our image-to-video conditioning is implemented as a temporal inpainting task, using a different timestep for the conditioning tokens, it doesn’t rely on cross-attention.

If you want to apply a similar approach for pose conditioning, simply add more tokens with the same positional embeddings as the generated ones, but assign a different timestep embedding to the conditioning tokens.

ArEnSc · 2025-02-18T17:34:50Z

Ok, so I believe I understand...

You concatenate the pose tokens alongside the seed image tokens and noise for the remaining sequence you want to predict.
keep the noise low to clean for the pose tokens and the seed image tokens.

I suspect this will increase the memory size due to the sequence length a bit during inference.

The target to predict is the whole conditioned sequence the flow or velocity to that sequence.

During inference, you add the conditioning with a tiny bit of noise to the initial conditioning tokens.
Then after inference, you just peel off the "denoised" tokens

yoavhacohen self-assigned this Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional conditioning layers #112

Additional conditioning layers #112

ArEnSc commented Feb 12, 2025

yoavhacohen commented Feb 12, 2025

ArEnSc commented Feb 12, 2025 •

edited

Loading

yoavhacohen commented Feb 13, 2025 •

edited

Loading

ArEnSc commented Feb 18, 2025

Additional conditioning layers #112

Additional conditioning layers #112

Comments

ArEnSc commented Feb 12, 2025

yoavhacohen commented Feb 12, 2025

ArEnSc commented Feb 12, 2025 • edited Loading

yoavhacohen commented Feb 13, 2025 • edited Loading

ArEnSc commented Feb 18, 2025

ArEnSc commented Feb 12, 2025 •

edited

Loading

yoavhacohen commented Feb 13, 2025 •

edited

Loading