examples/convert_jax_model_to_pytorch.py silently drops LoRA adapter weights for LoRA-finetuned checkpoints

This issue is both a bug report (silent failure in the official conversion script) and a feature request (LoRA-aware conversion). I have a working implementation available and would like to know whether a PR would be welcome before preparing it.

## Environment

- OS: Ubuntu 24.04
- Python: 3.10 (PyTorch worker) and 3.11 (main runtime)
- openpi commit: [c23745b]
- The bug is silent (no exception, no warning), so there is no traceback.

## Summary

The official conversion script `examples/convert_jax_model_to_pytorch.py` does not handle LoRA-finetuned JAX checkpoints. All LoRA adapter weights (`lora_a`, `lora_b`) are silently dropped during the subsequent `load_state_dict(..., strict=False)` call. The resulting PyTorch model loads without errors but produces outputs that diverge from the JAX original, because the finetuning delta is lost.

This bug likely explains the symptoms reported in several existing open issues for users who finetuned with the official LoRA variants:

- #840 (significant performance drop after JAX→PyTorch conversion)
- #729 (PyTorch version PI0.5 mismatch with JAX-converted checkpoint)
- #810 (Pi05 output mismatch between PyTorch and JAX weights)

In each of these cases, users hit the symptom (model behaves differently after conversion) but the root cause—silent LoRA adapter drop—was not identified.

## Reproduction

The bug appears when:

1. A pi0 or pi05 model is finetuned in JAX with the official LoRA variants:
   - `paligemma_variant="gemma_2b_lora"`
   - `action_expert_variant="gemma_300m_lora"`

2. The resulting checkpoint is converted with the official script:

```bash
   python examples/convert_jax_model_to_pytorch.py \
       --checkpoint_dir <path_to_lora_checkpoint> \
       --output_path <output_path> \
       --config_name <lora_config_name>
```

3. The script completes successfully with no warning.

4. Inspect the load result:

```python
   load_result = model.load_state_dict(state_dict, strict=False)
   print("unexpected:", len(load_result.unexpected_keys))
   print("missing:", load_result.missing_keys)
```

   Observed on our reproduction:
   - `unexpected_keys`: 20 entries (all `lora_a` / `lora_b` keys)
   - `missing_keys`: 2 entries (only `lm_head.weight`, which is tied/unused)

5. Run inference and compare against the JAX original on the same observation. Outputs diverge.

## Numerical impact

Same-input comparison (prompt: "pick up the blue cube", saved observation), JAX CPU vs PyTorch eager:

| Stage | max_abs | mean_abs |
|-------|---------|----------|
| Official converter (LoRA silently dropped) | 0.0309 | 0.0025 |
| LoRA-aware merge applied | 0.0252 | 0.0021 |
| LoRA merge + attn_vec_einsum fix | ~0.017 | ~0.0011 |
| LoRA merge + attn_vec_einsum fix + float32 storage | 0.0017 | 0.0001 |

## Root cause

The base `examples/convert_jax_model_to_pytorch.py` slice flow has no step that merges LoRA adapter weights into the base weights before producing the PyTorch state_dict. The `lora_a` / `lora_b` keys then remain in the state_dict and are silently discarded by the `strict=False` load, with no warning to the user.

## Implementation notes

Implementing this correctly requires handling two non-obvious quirks in openpi's runtime LoRA path. A naive standard-LoRA merge formula will produce incorrect results.

1. **`attn_vec_einsum` LoRA**: the runtime second einsum in `openpi.models.gemma.Attention` sums over the head dimension (N), so the equivalent merged weight uses `sum_N(lora_b)` rather than a per-head outer product.

2. **MLP `FeedForward._dot()`**: adds the LoRA delta to the base output without applying the `alpha/rank` scaling factor that standard LoRA implementations use. Merged MLP weights must mirror this behavior (i.e., merge without the scaling).

## Additional finding

Storing the merged weights in bfloat16 causes additional numerical drift (post-merge `max_abs ~0.017` in bf16 vs `~0.0017` in float32 on the same observation; see the table above). The merged checkpoint should be stored in float32 even when downstream inference runs in bfloat16. This is consistent with the precision-related observations in #810.

## Proposed direction

I have a working implementation that:

1. Merges LoRA adapters into base weights before applying the existing `slice_paligemma_state_dict` / `slice_gemma_state_dict` flow.
2. Handles the two runtime quirks above.
3. Defaults to float32 precision for the saved merged checkpoint.
4. Validates the load (no `unexpected_keys`; `missing_keys` limited to the known-tied `lm_head`).

Repository (WIP, includes domain-specific scaffolding to be cleaned before PR): https://github.com/Ret1ehS/OpenPi-Auboi5/tree/main/tools

Two options for the PR shape:

- **(a)** Add LoRA support to `examples/convert_jax_model_to_pytorch.py` as an additive flag (auto-detected from the config, or a `--merge-lora` option). Minimally invasive to the existing script.
- **(b)** Add a separate `examples/convert_jax_lora_to_pytorch.py` script. Cleaner separation but slightly more code duplication.

I'm happy to prepare a PR in either form, including a numerical correctness test against JAX inference. Would such a PR be welcome, and if so, which form do maintainers prefer?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/convert_jax_model_to_pytorch.py silently drops LoRA adapter weights for LoRA-finetuned checkpoints #958

Environment

Summary

Reproduction

Numerical impact

Root cause

Implementation notes

Additional finding

Proposed direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage	max_abs	mean_abs
Official converter (LoRA silently dropped)	0.0309	0.0025
LoRA-aware merge applied	0.0252	0.0021
LoRA merge + attn_vec_einsum fix	~0.017	~0.0011
LoRA merge + attn_vec_einsum fix + float32 storage	0.0017	0.0001

examples/convert_jax_model_to_pytorch.py silently drops LoRA adapter weights for LoRA-finetuned checkpoints #958

Description

Environment

Summary

Reproduction

Numerical impact

Root cause

Implementation notes

Additional finding

Proposed direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions