Skip to content

CI often fails with torch.OutOfMemoryError: CUDA out of memory #5207

@albertvillanova

Description

@albertvillanova

CI often fails for TestDPOTrainer::test_train_toolcall_data: https://github.com/huggingface/trl/actions/runs/22581457191/job/65414409040

torch.OutOfMemoryError: CUDA out of memory

  FAILED tests/test_dpo_trainer.py::TestDPOTrainer::test_train_toolcall_data - torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 42.19 MiB is free. Process 21219 has 284.00 MiB memory in use. Process 21222 has 3.40 GiB memory in use. Process 21213 has 258.00 MiB memory in use. Process 21216 has 10.77 GiB memory in use. Of the allocated memory 10.59 GiB is allocated by PyTorch, and 42.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Stacktrace:

   >       trainer.train()
  
  tests/test_dpo_trainer.py:852: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  .venv/lib/python3.14/site-packages/transformers/trainer.py:1412: in train
      return inner_training_loop(
  .venv/lib/python3.14/site-packages/transformers/trainer.py:1742: in _inner_training_loop
      tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  trl/trainer/dpo_trainer.py:1422: in training_step
      return super().training_step(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  .venv/lib/python3.14/site-packages/transformers/trainer.py:1951: in training_step
      loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  trl/trainer/dpo_trainer.py:1417: in compute_loss
      return self._compute_loss(model, inputs, return_outputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  trl/trainer/dpo_trainer.py:1349: in _compute_loss
      per_token_entropy = entropy_from_logits(shift_logits.detach())
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  logits = tensor([[[ 0.0229,  0.0105, -0.0195,  ...,  0.0082,  0.0046,  0.0031],
           [ 0.0212,  0.0045, -0.0179,  ...,  0.0...64,  0.0064,  0.0093],
           [ 0.0219,  0.0097, -0.0102,  ...,  0.0106,  0.0090,  0.0105]]],
         device='cuda:0')
  chunk_size = 128
  
      def entropy_from_logits(logits: torch.Tensor, chunk_size: int = 128) -> torch.Tensor:
          """
          Compute the Shannon entropy (in nats) for each row of *logits* in a memory-efficient way.
      
          Instead of materializing the full softmax for all rows at once, the logits are flattened to shape (N, num_classes),
          where N is the product of all leading dimensions. Computation is then performed in chunks of size `chunk_size`
          along this flattened dimension, reducing peak memory usage. The result is reshaped back to match the input's
          leading dimensions.
      
          Args:
              logits (`torch.Tensor`):
                  Logits tensor of shape `(..., num_classes)`. Entropy is taken along the last axis; all leading dimensions
                  are preserved in the output.
              chunk_size (`int`, *optional*, defaults to `128`):
                  Number of rows from the flattened logits to process per iteration. Smaller values reduce memory usage at
                  the cost of more iterations.
      
          Returns:
              `torch.Tensor`:
                  Entropy values with shape `logits.shape[:-1]`.
          """
          original_shape = logits.shape[:-1]  # all dims except num_classes
          num_classes = logits.shape[-1]
      
          # Flatten all leading dimensions into one
          flat_logits = logits.reshape(-1, num_classes)
      
          entropies = []
          for chunk in flat_logits.split(chunk_size, dim=0):
              logps = F.log_softmax(chunk, dim=-1)
  >           chunk_entropy = -(torch.exp(logps) * logps).sum(-1)
                                ^^^^^^^^^^^^^^^^^^^^^^^^
  E           torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 14.74 GiB of which 42.19 MiB is free. Process 21219 has 284.00 MiB memory in use. Process 21222 has 3.40 GiB memory in use. Process 21213 has 258.00 MiB memory in use. Process 21216 has 10.77 GiB memory in use. Of the allocated memory 10.59 GiB is allocated by PyTorch, and 42.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
  
  trl/trainer/utils.py:602: OutOfMemoryError

Note this recurrent CI error is raised even after the merge of:

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions