Skip to content

Qwen Image prompt encoding is not padding to max seq len #12075

@bghira

Description

@bghira

Describe the bug

The pipeline method for QwenImagePipeline.encode_prompts is not padding correctly; it's padding by the longest sequence length in the batch, which leaves very very short embeds that are out of distribution for the models' training set.

The padding should remain at 1024 tokens even after the system prompt is dropped. The attention mask has to be expanded too.

Reproduction

Execute QwenImagePipeline.encode_prompts() and check resulting shape.

Logs

prompt_embeds.shape=torch.Size([1, 1024, 3584]), prompt_embeds_mask.shape=torch.Size([1, 1024])

^ after fixing.


prompt_embeds.shape=torch.Size([1, 5, 3584]) prompt_embeds_mask.shape=torch.Size([1, 5])

^ before.

prompt was simply minecraft

this leads to extremely high loss at training time unless very-long prompts are used.

at inference time, it causes patch embed artifacts because the RoPE is not accustomed to these positions.

System Info

Latest git main.

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions