CI often fails with torch.OutOfMemoryError: CUDA out of memory

CI often fails for `TestDPOTrainer::test_train_toolcall_data`: https://github.com/huggingface/trl/actions/runs/22581457191/job/65414409040
> torch.OutOfMemoryError: CUDA out of memory

```python
  FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train_toolcall_data - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 42.19 MiB is free. Process 21219 has 284.00 MiB memory in use. Process 21222 has 3.40 GiB memory in use. Process 21213 has 258.00 MiB memory in use. Process 21216 has 10.77 GiB memory in use. Of the allocated memory 10.59 GiB is allocated by PyTorch, and 42.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Stacktrace:
```python
   >       trainer.train()
  
  tests/test_dpo_trainer.py:852: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  .venv/lib/python3.14/site-packages/transformers/trainer.py:1412: in train
      return inner_training_loop(
  .venv/lib/python3.14/site-packages/transformers/trainer.py:1742: in _inner_training_loop
      tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  trl/trainer/dpo_trainer.py:1422: in training_step
      return super().training_step(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  .venv/lib/python3.14/site-packages/transformers/trainer.py:1951: in training_step
      loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  trl/trainer/dpo_trainer.py:1417: in compute_loss
      return self._compute_loss(model, inputs, return_outputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  trl/trainer/dpo_trainer.py:1349: in _compute_loss
      per_token_entropy = entropy_from_logits(shift_logits.detach())
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  logits = tensor([[[ 0.0229,  0.0105, -0.0195,  ...,  0.0082,  0.0046,  0.0031],
           [ 0.0212,  0.0045, -0.0179,  ...,  0.0...64,  0.0064,  0.0093],
           [ 0.0219,  0.0097, -0.0102,  ...,  0.0106,  0.0090,  0.0105]]],
         device='cuda:0')
  chunk_size = 128
  
      def entropy_from_logits(logits: torch.Tensor, chunk_size: int = 128) -> torch.Tensor:
          """
          Compute the Shannon entropy (in nats) for each row of *logits* in a memory-efficient way.
      
          Instead of materializing the full softmax for all rows at once, the logits are flattened to shape (N, num_classes),
          where N is the product of all leading dimensions. Computation is then performed in chunks of size `chunk_size`
          along this flattened dimension, reducing peak memory usage. The result is reshaped back to match the input's
          leading dimensions.
      
          Args:
              logits (`torch.Tensor`):
                  Logits tensor of shape `(..., num_classes)`. Entropy is taken along the last axis; all leading dimensions
                  are preserved in the output.
              chunk_size (`int`, *optional*, defaults to `128`):
                  Number of rows from the flattened logits to process per iteration. Smaller values reduce memory usage at
                  the cost of more iterations.
      
          Returns:
              `torch.Tensor`:
                  Entropy values with shape `logits.shape[:-1]`.
          """
          original_shape = logits.shape[:-1]  # all dims except num_classes
          num_classes = logits.shape[-1]
      
          # Flatten all leading dimensions into one
          flat_logits = logits.reshape(-1, num_classes)
      
          entropies = []
          for chunk in flat_logits.split(chunk_size, dim=0):
              logps = F.log_softmax(chunk, dim=-1)
  >           chunk_entropy = -(torch.exp(logps) * logps).sum(-1)
                                ^^^^^^^^^^^^^^^^^^^^^^^^
  E           torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 42.19 MiB is free. Process 21219 has 284.00 MiB memory in use. Process 21222 has 3.40 GiB memory in use. Process 21213 has 258.00 MiB memory in use. Process 21216 has 10.77 GiB memory in use. Of the allocated memory 10.59 GiB is allocated by PyTorch, and 42.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  
  trl/trainer/utils.py:602: OutOfMemoryError
```

Note this recurrent CI error is raised even after the merge of:
- #5197

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI often fails with torch.OutOfMemoryError: CUDA out of memory #5207

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CI often fails with torch.OutOfMemoryError: CUDA out of memory #5207

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions