MusicgenMelody ignores audio conditioning (regression between 4.48 and 4.57)

_(Original Note): Claude removed this line when editing, but I wanted to fully disclose that this issue was discovered and written up by Claude code_

**Update (corrected):** the regression is wider than the title suggests — it already exists in transformers **4.57.6**, the latest 4.x. So this is not a v5 regression; it broke somewhere between 4.48 and 4.57.

### System info

- `transformers` versions tested:
  - **4.48.3** — works (audio conditioning active)
  - **4.57.6** — broken (audio ignored)
  - **5.5.4** — broken (audio ignored)
- Python 3.12.13, PyTorch 2.11.0+cu130, CUDA on RTX 3080 Ti, fp16
- Model: `facebook/musicgen-melody`

### Description

`MusicgenMelodyForConditionalGeneration.generate()` ignores the audio reference (`input_features`) in transformers ≥ 4.57. Two reference audios with different chroma classes produce **byte-identical** generated audio for the same text prompt and seed. The same code in 4.48 produces meaningfully different output.

### Minimal reproducer

```python
import numpy as np, torch
from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration

device, dtype = "cuda", torch.float16
proc = AutoProcessor.from_pretrained("facebook/musicgen-melody")
model = MusicgenMelodyForConditionalGeneration.from_pretrained(
    "facebook/musicgen-melody", torch_dtype=dtype
).to(device)

sr = 32000
t = np.linspace(0, 4, sr * 4)
A  = (0.4 * np.sin(2 * np.pi * 440.00 * t)).astype(np.float32)  # A4
Eb = (0.4 * np.sin(2 * np.pi * 311.13 * t)).astype(np.float32)  # Eb4 — different chroma class

def gen(audio):
    inputs = proc(text=["jazz"], audio=audio, sampling_rate=sr,
                  padding=True, return_tensors="pt").to(device)
    inputs["input_features"] = inputs["input_features"].to(dtype)
    torch.manual_seed(42)
    return model.generate(**inputs, max_new_tokens=100,
                          do_sample=True, guidance_scale=3.0)

a, e = gen(A), gen(Eb)
diff = (a[0, 0].float() - e[0, 0].float()).abs().mean().item()
print(f"output diff (A vs Eb): {diff:.4f}")
```

### Results

| transformers | output diff (A vs Eb) | audio conditioning |
|---|---|---|
| 4.48.3 | **0.1610** | works |
| 4.57.6 | **0.0000** | ignored |
| 5.5.4  | **0.0000** | ignored |

### Where the chain breaks (per v5.5.4 tracing)

- The processor extracts chroma features correctly. `input_features` is shape `(1, N, 12)` and values differ between A and Eb (chroma abs diff ≈ 0.17).
- `_prepare_encoder_hidden_states_kwargs_for_generation` does receive `input_features` in `model_kwargs` (verified by hooking).
- The returned `encoder_hidden_states` (audio prefix concatenated with text encoder output) differs between A and Eb — mean abs diff ≈ 0.015 against mean abs ≈ 0.034, i.e. the audio-prefix portion is meaningfully different.
- But per-step logits returned by `model.generate` are byte-identical between the two audios.

So the audio conditioning reaches the encoder-hidden-states side correctly, but the decoder produces identical logits regardless — suggesting the audio prefix is not actually being attended to in the decoder. dtype was not the cause; promoting `audio_enc_to_dec_proj` to fp32 with cast hooks did not change the result.

### Bisection range

Broken: 4.57.6, 5.5.4. Last-known-good: 4.48.3. Haven't bisected within the 4.48 → 4.57 range.

### Disclosure

This regression was identified by Claude Code during a debugging session — disclosing for transparency.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MusicgenMelody ignores audio conditioning (regression between 4.48 and 4.57) #45647

System info

Description

Minimal reproducer

Results

Where the chain breaks (per v5.5.4 tracing)

Bisection range

Disclosure

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

transformers	output diff (A vs Eb)	audio conditioning
4.48.3	0.1610	works
4.57.6	0.0000	ignored
5.5.4	0.0000	ignored

MusicgenMelody ignores audio conditioning (regression between 4.48 and 4.57) #45647

Description

System info

Description

Minimal reproducer

Results

Where the chain breaks (per v5.5.4 tracing)

Bisection range

Disclosure

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions