Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
3c0d9ae
fix(distillation): reverse-KL server path NaN on variable completion …
k1064190 Apr 19, 2026
ea1cc3c
style(distillation tests): slim docstrings to match TRL convention
k1064190 Apr 19, 2026
d3f6a18
test(distillation): guard end-to-end tests against vacuous log-history
k1064190 Apr 19, 2026
5d3d085
test(distillation): parametrize end-to-end test, drop vacuous JSD case
k1064190 Apr 19, 2026
88826fd
Update AsyncGRPO example with GSM8K and tested hyperparameters (#5580)
sergiopaniego Apr 20, 2026
badeb47
Merge branch 'main' into fix/distillation-server-nan-on-variable-comp…
k1064190 Apr 20, 2026
1d9b612
[docs] Add chat templates page to web docs (#5581)
sergiopaniego Apr 20, 2026
9502575
Add additional model parameters to `TestSupportsToolCalling` for impr…
qgallouedec Apr 20, 2026
06244b0
Fix CI with dev dependencies for Llava models (#5499)
albertvillanova Apr 20, 2026
4a2dc7c
Differentiate Phi-3 and Phi-3.5 in tests (#5546)
qgallouedec Apr 20, 2026
6e1705a
Set _tokenizer as trainer attribute (#5489)
albertvillanova Apr 20, 2026
b8d69f7
Align KTO with DPO: Support dict eval_dataset (#5599)
albertvillanova Apr 20, 2026
4ca2e9b
Align KTO with DPO: Align tokenization (#5601)
albertvillanova Apr 20, 2026
d5b534e
Check prefix preservation at the token level (#5559)
qgallouedec Apr 20, 2026
dfe3788
Replace wrong comment about chat template with EOS (#5607)
albertvillanova Apr 20, 2026
14ca4af
Align KTO with DPO: Support IterableDataset (#5600)
albertvillanova Apr 20, 2026
0a54b4d
Drop vLLM 0.11 support (#5549)
qgallouedec Apr 21, 2026
1cc2b98
Align KTO with DPO: Remove maybe_apply_chat_template (#5606)
albertvillanova Apr 21, 2026
ecf9cb3
[TPO] experimental TPO trainer (#5506)
kashif Apr 21, 2026
efa22bc
refactor(distillation): address review feedback on server reverse-KL fix
k1064190 Apr 21, 2026
a08e713
fix: Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker (…
xuanduy04 Apr 21, 2026
166d550
docs: update RapidFire AI integration with FSDP and multi-backend tra…
kamran-rapidfireAI Apr 22, 2026
edaf6ec
Fix generate_tiny_models for gpt-oss (#5622)
albertvillanova Apr 22, 2026
6a4a077
Added speculative_config to vllm-serve (#5605)
Ofir408 Apr 22, 2026
9a52d73
feat(glm-4-moe): Add `{% generation %}` markers for training chat tem…
casinca Apr 22, 2026
95e76d5
Fix docstring style in vllm-serve script (#5628)
albertvillanova Apr 22, 2026
3256995
feat: add Gemma/Gemma2 training chat templates with generation marker…
ps-abhi Apr 22, 2026
b3da4eb
Align KTO with DPO: Inline tokenization, new output format, DataColla…
albertvillanova Apr 22, 2026
644d173
feat: add Phi-3 training chat template with generation markers (#5526)
RudrenduPaul Apr 22, 2026
6da8ec5
Remove `forward_masked_logits` (#5626)
qgallouedec Apr 23, 2026
a9cfe47
Use `PreTrainedTokenizerBase` for tokenizer type hints (#5629)
qgallouedec Apr 23, 2026
1996c39
Add doc-builder style check to pre-commit and CI (#5630)
albertvillanova Apr 24, 2026
b43476a
Align and update doc-builder commit hash in CI GitHub Actions (#5631)
albertvillanova Apr 24, 2026
4c8b2e9
Align KTO with DPO: Move completion assembly from _prepare_dataset to…
albertvillanova Apr 24, 2026
208337c
Hotfix CI: Add ruff dependency to doc-builder style check (#5634)
albertvillanova Apr 24, 2026
c693ca1
Fix entropy calculation in SFT (#5620)
qgallouedec Apr 24, 2026
43cbd78
Renaming of internal variables: `async_reward_X` to `async_X` (#5616)
qgallouedec Apr 24, 2026
3aa9519
Align KTO with DPO: Remove BOS/EOS handling (#5635)
albertvillanova Apr 24, 2026
2f10689
Qwen3.6 integration (#5642)
qgallouedec Apr 26, 2026
9679645
Release: v1.3 (#5647)
qgallouedec Apr 26, 2026
4798893
⬆️ Bump dev version (#5648)
qgallouedec Apr 26, 2026
923c318
Align KTO with DPO: Remove model_init parameter (#5659)
albertvillanova Apr 27, 2026
510a6f5
Align KTO with DPO: Remove preprocess_logits_for_metrics parameter (#…
albertvillanova Apr 27, 2026
a7648ba
Add tiny Qwen3-4B-Instruct-2507 (#5586)
qgallouedec Apr 27, 2026
9bcf729
Chunked cross-entropy loss for SFT (up to –50% VRAM) (#5575)
qgallouedec Apr 27, 2026
8d3a3a2
Fix missing PEFT validation when passing peft_config to core trainers…
albertvillanova Apr 28, 2026
4d0fd7d
Fix missing PEFT availability check when passing peft_config to exper…
albertvillanova Apr 28, 2026
9516563
Align KTO with DPO: Align PEFT handling (#5661)
albertvillanova Apr 28, 2026
4455858
Set _tokenizer attribute in experimental trainers (#5566)
albertvillanova Apr 28, 2026
574ebe0
Fix peft_config type hint in experimental trainers (#5666)
albertvillanova Apr 28, 2026
788555a
Add Cohere training chat template (#5627)
dschulmeist Apr 28, 2026
88e0ed4
Simplify peft_config handling in core trainers (#5673)
albertvillanova Apr 29, 2026
fdad6d8
Simplify peft_config handling in experimental trainers (#5674)
albertvillanova Apr 29, 2026
f85334a
Merge branch 'main' into fix/distillation-server-nan-on-variable-comp…
cmpatino Apr 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ env:

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
commit_sha: ${{ github.sha }}
package: trl
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_pr_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ concurrency:
jobs:
build:
if: github.event.pull_request.draft == false
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@90b4ee2c10b81b5c1a6367c4e6fc9e2fb510a7e3 # main
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/tests_latest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
steps:
- name: Git checkout
uses: actions/checkout@v6
with: { ref: v1.2-release }
with: { ref: v1.3-release }

- name: Set up Python 3.12
uses: actions/setup-python@v6
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/upload_pr_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ on:

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@9ad2de8582b56c017cb530c1165116d40433f1c6 # main
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391 # main
with:
package_name: trl
secrets:
Expand Down
16 changes: 9 additions & 7 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,12 @@ repos:
- id: ruff-format
types_or: [ python, pyi ]

# - repo: https://github.com/codespell-project/codespell
# rev: v2.1.0
# hooks:
# - id: codespell
# args:
# - --ignore-words-list=nd,reacher,thist,ths,magent,ba
# - --skip=docs/css/termynal.css,docs/js/termynal.js
- repo: local
hooks:
- id: doc-builder-style
name: Check style with doc-builder
language: python
entry: doc-builder style trl tests docs/source --max_len 119
additional_dependencies: ["git+https://github.com/huggingface/doc-builder@2430c1ec91d04667414e2fa31ecfc36c153ea391", ruff] # See GH-5633
pass_filenames: false
types_or: [python, markdown, rst]
2 changes: 1 addition & 1 deletion CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,5 @@ keywords:
- language model alignment
- post-training
license: Apache-2.0
version: '1.2'
version: '1.3'
date-released: '2020-03-27'
1 change: 0 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ test:
precommit:
python scripts/add_copyrights.py
pre-commit run --all-files
doc-builder style trl tests docs/source --max_len 119

slow_tests:
pytest -m "slow" tests/ $(if $(IS_GITHUB_CI),--report-log "slow_tests.log",)
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.3.0.dev0
1.4.0.dev0
4 changes: 4 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
title: Quickstart
title: Getting started
- sections:
- local: chat_templates
title: Chat Templates
- local: dataset_formats
title: Dataset Formats
- local: paper_index
Expand Down Expand Up @@ -133,6 +135,8 @@
title: SDPO
- local: ssd_trainer
title: SSD
- local: tpo_trainer
title: TPO
- local: xpo_trainer
title: XPO
title: Experimental
2 changes: 2 additions & 0 deletions docs/source/chat_template_utils.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Chat template utilities

For an overview of the chat templates bundled with TRL and the rationale behind the training patches, see [Chat Templates](chat_templates).

## clone_chat_template

[[autodoc]] clone_chat_template
Expand Down
113 changes: 113 additions & 0 deletions docs/source/chat_templates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Chat Templates

A [chat template](https://huggingface.co/docs/transformers/en/chat_templating) is a Jinja2 snippet that formats messages into the string a model was trained on. For example:

```python
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
>>> tokenizer.chat_template
"{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
>>> tokenizer.apply_chat_template([{"role": "user", "content": "Hi!"}], tokenize=False)
'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nHi!<|im_end|>\n'
```

In most cases you don't need to worry about chat templates: models ship their template along with the tokenizer, and TRL applies it for you. The whole thing is transparent. But some TRL recipes rely on features that most shipped templates don't include:

- **SFT with `assistant_only_loss=True`** needs `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers around assistant output, so the loss mask can target only assistant tokens.
- **GRPO with tool calls** needs the template to be *prefix-preserving*: appending a tool message must not change how earlier messages are rendered.

TRL ships patched templates under [`trl/chat_templates/`](https://github.com/huggingface/trl/tree/main/trl/chat_templates) for common families (Qwen, Llama, DeepSeek-V3, GPT-OSS, ...) and swaps them in automatically for supported models. For any other model, you'll need to patch its template yourself. The rest of this page catalogs what's bundled.

## Supported model families

TRL stores reference copies of the original templates so it can identify supported models at init and swap in a training template when needed. The following families are recognized: Cohere, DeepSeek-V3, Gemma, GLM-4-MoE, GPT-OSS, Llama 3 / 3.1 / 3.2, Qwen2.5, Qwen3, Qwen3-VL, Qwen3.5, Qwen3.6.

## Training templates

Patched templates that fix training-specific issues. Swapped in at init when tools are enabled (GRPO) or when `assistant_only_loss=True` (SFT).

### `cohere_training.jinja`

Patched Cohere template. Diff vs `cohere.jinja`:

Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.

### `deepseekv3_training.jinja`

Patched DeepSeek-V3 template. Diff vs `deepseekv3.jinja`:

- Uses `| tojson` on `tool['function']['arguments']` so that `arguments` can be passed as a `dict` (the documented format per [transformers docs](https://huggingface.co/docs/transformers/en/chat_extras#tool-calling-example)). The original template uses raw string concatenation, which crashes on dict inputs.
- Wraps assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers for SFT assistant-only loss.

### `gemma_training.jinja`

Patched Gemma template (shared by Gemma and Gemma2, which ship identical chat templates). Diff vs `gemma.jinja`:

Split the unified assistant output so that the `<start_of_turn>model\n` header (a prompt cue, not generated by the model) sits outside the generation block, and wrap the assistant content with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers for SFT assistant-only loss.

### `glm4moe_training.jinja`

Patched GLM-4-MoE template. Diff vs `glm4moe.jinja`:

Require both `<think>` and `</think>` to be present before parsing, to avoid incorrect splitting when the model generates only one tag:

```diff
- {%- if '</think>' in content %}
+ {%- if '<think>' in content and '</think>' in content %}
```

Wrap assistant message output (including the thinking block and tool calls) with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers for SFT assistant-only loss.

### `qwen3_training.jinja`

Patched Qwen3 template. Diff vs `qwen3.jinja`:

Require both `<think>` and `</think>` to be present before parsing, to avoid incorrect splitting when the model generates only one tag:

```diff
- {%- if '</think>' in content %}
+ {%- if '<think>' in content and '</think>' in content %}
```

Always include the thinking block regardless of message position. The original conditionally omits it based on `loop.last`, which changes the assistant rendering when a tool message is appended, breaking prefix-preservation:

```diff
- {%- if loop.index0 > ns.last_query_index %}
- {%- if loop.last or (not loop.last and reasoning_content) %}
- {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
- {%- else %}
- {{- '<|im_start|>' + message.role + '\n' + content }}
- {%- endif %}
- {%- else %}
- {{- '<|im_start|>' + message.role + '\n' + content }}
- {%- endif %}
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
```

Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.

### `gptoss_training.jinja`

Patched GPT-OSS template. Diff vs `gptoss.jinja`:

Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.

### `llama3_training.jinja`

Patched Llama 3 template. Diff vs `llama3.jinja`:

Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.

### `qwen2_5_training.jinja`

Patched Qwen2.5 template. Diff vs `qwen2_5.jinja`:

Wrap assistant message output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` so that `return_assistant_tokens_mask=True` produces correct masks for SFT assistant-only loss.

### `qwen3_6_training.jinja`

Patched Qwen3.6 template. Diff vs `qwen3_6.jinja`: same set of changes as `qwen3_training.jinja` — require both `<think>` and `</think>` to be present before parsing, drop the `loop.index0 > ns.last_query_index` conditional so the thinking block is always emitted (prefix-preservation), and wrap assistant output with `&#123;% generation %&#125;` / `&#123;% endgeneration %&#125;` markers for SFT assistant-only loss.

## Related utilities

See [Chat Template Utilities](chat_template_utils) for the helper functions ([`clone_chat_template`], [`is_chat_template_prefix_preserving`], [`get_training_chat_template`]) that operate on these templates.
4 changes: 4 additions & 0 deletions docs/source/grpo_trainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -632,6 +632,9 @@ trainer = GRPOTrainer(
Each tool must be a standard Python function with **type-hinted arguments and return types**, along with a **Google-style docstring** describing its purpose, arguments, and return value.
For more details, see the [Passing tools guide](https://huggingface.co/docs/transformers/en/chat_extras#passing-tools).

> [!TIP]
> The GRPO tool call loop requires the chat template to be *prefix-preserving* (appending a tool message must not change how earlier messages are rendered). For known model families (e.g. Qwen3, DeepSeek-V3), TRL automatically swaps in a patched training template when tools are enabled. See [Chat Templates](chat_templates#training-templates) for the full list.

Example:

```python
Expand Down Expand Up @@ -748,6 +751,7 @@ Tested with:
- [**Qwen3**](https://huggingface.co/collections/Qwen/qwen3) — e.g., `Qwen/Qwen3-0.6B`
- [**Qwen3-VL**](https://huggingface.co/collections/Qwen/qwen3-vl) — e.g., `Qwen/Qwen3-VL-2B-Instruct`
- [**Qwen3.5**](https://huggingface.co/collections/Qwen/qwen35) — e.g., `Qwen/Qwen3.5-2B`
- [**Qwen3.6**](https://huggingface.co/collections/Qwen/qwen36) — e.g., `Qwen/Qwen3.6-35B-A3B`

> [!TIP]
> Compatibility with all LLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
Expand Down
38 changes: 38 additions & 0 deletions docs/source/paper_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1403,6 +1403,44 @@ training_args = CPOConfig(
)
```

## Triple Preference Optimization

Papers relating to the [`experimental.tpo.TPOTrainer`]

### Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

**📜 Paper**: https://huggingface.co/papers/2405.16681

Introduces Triple Preference Optimization (TPO), a preference learning method that aligns an LLM with three responses per prompt — a gold (`reference`) completion, a preferred (`chosen`) completion and a dispreferred (`rejected`) completion — in a single optimization step. TPO combines a contrastive objective on the (chosen, rejected) pair with a supervised NLL term on the gold response, removing the need for a separate SFT stage and the reference model used in DPO. Used in TRL via [`experimental.tpo.TPOTrainer`]. To reproduce the paper's setting (Llama-3-Base, 5K), use this configuration:

```python
from trl.experimental.tpo import TPOConfig

training_args = TPOConfig(
loss_type="sigmoid", # contrastive loss between chosen and rejected (Section 3 of the paper)
tpo_alpha=1.0, # weight of the NLL term on the gold response (Section 3 of the paper)
beta=0.01, # β temperature (Table 6 of the paper)
learning_rate=5e-7, # Table 6 of the paper
num_train_epochs=1,
max_length=1024,
)
```

To use the TPO-L variant (length-normalized log-probabilities with a target reward margin γ), set `loss_type="tpo-l"` and `tpo_l_gamma`:

```python
from trl.experimental.tpo import TPOConfig

training_args = TPOConfig(
loss_type="tpo-l", # length-normalized variant (Section 3 of the paper)
tpo_alpha=1.0,
beta=0.01,
tpo_l_gamma=0.5, # γ target reward margin (Table 6 of the paper, Llama-3-Base 5K)
learning_rate=5e-7,
num_train_epochs=1,
)
```

## Nash Learning from Human Feedback

Papers relating to the [`experimental.nash_md.NashMDTrainer`]
Expand Down
Loading