Audio-to-Speech produces incorrect output when pose conditioning is disabled (enable_skeleton_cross_attn = False)

Thank you very much for your excellent work on this project and for sharing it openly.

When running the model in audio-to-speech mode without pose conditioning, setting
enable_skeleton_cross_attn = False causes the model to produce unexpected / incorrect output.

The frame size is 1:1

<img width="832" height="480" alt="Image" src="https://github.com/user-attachments/assets/622637c7-8c8a-4270-995b-7727978eb30c" />

https://github.com/user-attachments/assets/ce1c2ec3-ce2c-46d0-af70-2a9dbe6275e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio-to-Speech produces incorrect output when pose conditioning is disabled (enable_skeleton_cross_attn = False) #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Audio-to-Speech produces incorrect output when pose conditioning is disabled (enable_skeleton_cross_attn = False) #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions