CI often fails for TestDPOTrainer::test_train_toolcall_data: https://github.com/huggingface/trl/actions/runs/22581457191/job/65414409040
torch.OutOfMemoryError: CUDA out of memory
FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train_toolcall_data - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 42.19 MiB is free. Process 21219 has 284.00 MiB memory in use. Process 21222 has 3.40 GiB memory in use. Process 21213 has 258.00 MiB memory in use. Process 21216 has 10.77 GiB memory in use. Of the allocated memory 10.59 GiB is allocated by PyTorch, and 42.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Stacktrace:
> trainer.train()
tests/test_dpo_trainer.py:852:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.14/site-packages/transformers/trainer.py:1412: in train
return inner_training_loop(
.venv/lib/python3.14/site-packages/transformers/trainer.py:1742: in _inner_training_loop
tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
trl/trainer/dpo_trainer.py:1422: in training_step
return super().training_step(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.venv/lib/python3.14/site-packages/transformers/trainer.py:1951: in training_step
loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
trl/trainer/dpo_trainer.py:1417: in compute_loss
return self._compute_loss(model, inputs, return_outputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
trl/trainer/dpo_trainer.py:1349: in _compute_loss
per_token_entropy = entropy_from_logits(shift_logits.detach())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
logits = tensor([[[ 0.0229, 0.0105, -0.0195, ..., 0.0082, 0.0046, 0.0031],
[ 0.0212, 0.0045, -0.0179, ..., 0.0...64, 0.0064, 0.0093],
[ 0.0219, 0.0097, -0.0102, ..., 0.0106, 0.0090, 0.0105]]],
device='cuda:0')
chunk_size = 128
def entropy_from_logits(logits: torch.Tensor, chunk_size: int = 128) -> torch.Tensor:
"""
Compute the Shannon entropy (in nats) for each row of *logits* in a memory-efficient way.
Instead of materializing the full softmax for all rows at once, the logits are flattened to shape (N, num_classes),
where N is the product of all leading dimensions. Computation is then performed in chunks of size `chunk_size`
along this flattened dimension, reducing peak memory usage. The result is reshaped back to match the input's
leading dimensions.
Args:
logits (`torch.Tensor`):
Logits tensor of shape `(..., num_classes)`. Entropy is taken along the last axis; all leading dimensions
are preserved in the output.
chunk_size (`int`, *optional*, defaults to `128`):
Number of rows from the flattened logits to process per iteration. Smaller values reduce memory usage at
the cost of more iterations.
Returns:
`torch.Tensor`:
Entropy values with shape `logits.shape[:-1]`.
"""
original_shape = logits.shape[:-1] # all dims except num_classes
num_classes = logits.shape[-1]
# Flatten all leading dimensions into one
flat_logits = logits.reshape(-1, num_classes)
entropies = []
for chunk in flat_logits.split(chunk_size, dim=0):
logps = F.log_softmax(chunk, dim=-1)
> chunk_entropy = -(torch.exp(logps) * logps).sum(-1)
^^^^^^^^^^^^^^^^^^^^^^^^
E torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 42.19 MiB is free. Process 21219 has 284.00 MiB memory in use. Process 21222 has 3.40 GiB memory in use. Process 21213 has 258.00 MiB memory in use. Process 21216 has 10.77 GiB memory in use. Of the allocated memory 10.59 GiB is allocated by PyTorch, and 42.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
trl/trainer/utils.py:602: OutOfMemoryError
Note this recurrent CI error is raised even after the merge of:
CI often fails for
TestDPOTrainer::test_train_toolcall_data: https://github.com/huggingface/trl/actions/runs/22581457191/job/65414409040Stacktrace:
Note this recurrent CI error is raised even after the merge of: