Qwen Image prompt encoding is not padding to max seq len

### Describe the bug

The pipeline method for QwenImagePipeline.encode_prompts is not padding correctly; it's padding by the longest sequence length in the batch, which leaves very very short embeds that are out of distribution for the models' training set.

The padding should remain at 1024 tokens even after the system prompt is dropped. The attention mask has to be expanded too.

### Reproduction

Execute `QwenImagePipeline.encode_prompts()` and check resulting shape.

### Logs

```shell
prompt_embeds.shape=torch.Size([1, 1024, 3584]), prompt_embeds_mask.shape=torch.Size([1, 1024])

^ after fixing.


prompt_embeds.shape=torch.Size([1, 5, 3584]) prompt_embeds_mask.shape=torch.Size([1, 5])

^ before.
```
prompt was simply `minecraft`

this leads to extremely high loss at training time unless very-long prompts are used.

at inference time, it causes patch embed artifacts because the RoPE is not accustomed to these positions.

### System Info

Latest git main.

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen Image prompt encoding is not padding to max seq len #12075

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen Image prompt encoding is not padding to max seq len #12075

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions