Iterable Dataset #2852

felipemello1 · 2025-06-26T14:33:36Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Enable Iterable datasets in torchtune.

CONTEXT: built on top of ongoing PR step-based-ckpt: #2384

TIps when reviewing this pr

Follow this order:

recipes/configs/llama3_2/3B_full.yaml: see the configs
torchtune/datasets/_iterable_base.py: base class for iterable dataset
torchtune/datasets/_hf_iterable.py: ds based on HF -- Can be replaced easily. Downstream does not expect HF.
torchtune/datasets/_interleaved.py: interleave the datasets
torchtune/data/_metrics.py: metrics transform to create the metrics
torchtune/data/_aggregator.py: aggregate the metrics at the recipe level
recipes/full_finetune_distributed.py: everything put together
unit tests

torchtune/datasets/_hf_iterable.py

Changelog

Datasets are infinite
User doesn't define epochs anymore, but training steps (how many times we update the optimizer)
Support for dataset mixing -- follow up PRs is to enable curriculum learning
Support for dataset metric logging -- User can understand epoch per dataset, distribution of token lens, etc. Easy to add new metrics.
HF agnostic. Even though the current dataset is HF, the dataloader, packed, datamixing, metric logging is agnostic to it
Well tested in distributed setting -- WARNING: need better testing for multiprocess dataloader. It doesnt guarantee determinism, so I postponed testing this setting

Config and builder design based on the discussions after this RFC: #2785

Next steps:
7. Gather feedback on metric logging. E.g. we can add more aggregation types.
8. Polish the code a little bit
9. Add packing from this RFC: #2819
10. Add curriculum learning
11. Docs?

Test plan

UNTESTED: resume from ckpt in the recipe. However, we have plenty of tests showing that resuming works for these iterable datasets.

…iterable_dataset_final

pytorch-bot · 2025-06-26T14:33:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2852

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job, 1 Unrelated Failure

As of commit 93fa743 with merge base 3d73591 ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/test_qat_lora_finetune_distributed.py::TestQATLoRAFinetuneDistributedRecipe::test_training_state_on_resume_with_async_checkpointing[llama3/8B_qat_lora-llama3-tune-False]

CANCELLED JOB - The following job was cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
tests/recipes/test_qat_lora_finetune_distributed.py::TestQATLoRAFinetuneDistributedRecipe::test_training_state_on_resume_with_async_checkpointing[llama3/8B_qat_lora-llama3-tune-False]

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

GPU tests / gpu_test (3.9, stable) (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2025-06-26T14:42:24Z

torchtune/datasets/_slimorca.py

@@ -94,3 +95,72 @@ def slimorca_dataset(
            )
        return PackedDataset(ds, max_seq_len=tokenizer.max_seq_len)
    return ds
+
+
+def slimorca_iterable_dataset(


added here to demonstrate datamix iterable dataset with this example. Personally, i dislike exposing all of the args and defaults. I would prefer to expose only whats specific to this builder.

felipemello1 · 2025-06-26T14:43:13Z

torchtune/datasets/_sft.py

+        return tokenized_dict
+
+
+def sft_iterable_dataset(


only purpose is to hardcode the 'output_transform'

felipemello1 · 2025-06-26T14:43:51Z

torchtune/datasets/_interleaved.py

+                logger.warning(
+                    f"Child dataset {self._datasets[ds_name].dataset_name} was exhausted. "
+                    "This is unexpected for an infinite dataset. Re-initializing its iterator."
+                )


not 100% sure i like this

Let's do this: simply have a subclass for InfiniteIterable so this is super explicit

felipemello1 · 2025-06-26T14:46:30Z

torchtune/datasets/_alpaca.py

@@ -101,3 +102,64 @@ def alpaca_dataset(
 original Alpaca dataset, `yahma/alpaca-cleaned <https://huggingface.co/datasets/yahma/alpaca-cleaned>`_.
 See the dataset page and :func:`~torchtune.datasets.alpaca_dataset` for more details.
 """
+
+
+def alpaca_iterable_dataset(


added here to demonstrate datamix iterable dataset with this example. Personally, i dislike exposing all of the args and defaults. I would prefer to expose only whats specific to this builder.

But you are doing this with ``load_dataset_kwargs, right? Or did you mean something else?

nit: it's a function, so... get_alpaca_iterable_dataset?

Darktex

Great PR! I mainly had a question on the interaction with packing and on the SFT transform

Darktex · 2025-06-26T16:34:23Z

torchtune/data/_aggregator.py

+
+                # Each stat becomes its own metric
+                # For percentiles, it is an approximattion by computing avg of averages
+                metrics[(ds_name, f"{metric_name}_mean")] = {


Nah, just use torch.quantile ;)

https://docs.pytorch.org/docs/stable/generated/torch.quantile.html

Darktex · 2025-06-26T16:35:06Z

torchtune/data/_aggregator.py

+        if not grouped:
+            return reduced
+
+        for key, metric_dicts in grouped.items():


Seems weird to write the code below twice. Can you factor it out and simply call it after reducing on rank zero?

Darktex · 2025-06-26T16:37:58Z

torchtune/data/_metrics.py

+    agg_type: AggregationType
+
+
+class MetricTransform(Protocol):


Can you explain a bit more about what's the API? It's not clear to me because we have __call__ which doesn't have a name that tells me what it does :D

Darktex · 2025-06-26T16:38:48Z

torchtune/data/_metrics.py

+        ...
+
+
+class StandardMetricTransform(MetricTransform):


Hmmm, the name makes me think this is an ABC. Maybe DefaultTrainingMetricTransform?

Darktex · 2025-06-26T16:40:04Z

torchtune/datasets/_alpaca.py

@@ -101,3 +102,64 @@ def alpaca_dataset(
 original Alpaca dataset, `yahma/alpaca-cleaned <https://huggingface.co/datasets/yahma/alpaca-cleaned>`_.
 See the dataset page and :func:`~torchtune.datasets.alpaca_dataset` for more details.
 """
+
+
+def alpaca_iterable_dataset(


But you are doing this with ``load_dataset_kwargs, right? Or did you mean something else?

Darktex · 2025-06-26T17:49:43Z

torchtune/datasets/_sft.py

@@ -178,3 +180,99 @@ def __call__(self, sample: Mapping[str, Any]) -> dict[str, Any]:
            tokenized_dict = transformed_sample

        return tokenized_dict
+
+
+class SFTOutputTransform(Transform):


This transform is critical for end to end performance, so this is a piece where we need to spend cycles to optimize.

I see several ways to find more performance in this code:

No numpy. Even if we haven't moved to GPU yet, doing everything in torch ensures that the code will work regardless and makes it more robust

I think np.where allocates?

Asserts are debug-only statements that get disabled if you pass the -O flag, so this check feels more of a runtime check

I asked a LLM to rewrite given these contraints and it gave me this, which looks reasonable on the surface:

import torch CROSS_ENTROPY_IGNORE_IDX = -100 # set to whatever you use class SFTOutputTransform(Transform): """ Build the `"labels"` tensor for causal-LM SFT. Expects each dataset element to contain **1-D** torch tensors * ``"tokens"`` – token IDs, dtype=torch.long * ``"mask"`` – bool/int where **True** marks positions to ignore Produces ``"labels"`` of the same shape such that labels[t] = tokens[t+1] # shift left labels[t] = IGNORE_IDX if mask[t+1] # respect mask labels[-1] = IGNORE_IDX # last token has no target All ops are vectorised; only one fresh tensor (`labels`) is allocated. """ __slots__ = () # tiny perf win (avoids per-instance __dict__) def __call__(self, sample: Mapping[str, Any]) -> Mapping[str, Any]: try: tokens: torch.Tensor = sample["tokens"] mask: torch.Tensor = sample["mask"] except KeyError: raise ValueError( "SFTOutputTransform expects 'tokens' and 'mask' in the sample." ) if tokens.ndim != 1 or mask.ndim != 1: raise ValueError("Both 'tokens' and 'mask' must be 1-D tensors.") # ── build labels ──────────────────────────────────────────────── # 1. pre-fill with IGNORE so we don’t need extra assignments later labels = tokens.new_full(tokens.shape, CROSS_ENTROPY_IGNORE_IDX) # 2. left-shift via cheap views (no copy) labels[:-1].copy_(tokens[1:]) # 3. apply mask in-place (single fused kernel on GPU/CPU) labels[:-1].masked_fill_(mask[1:].bool(), CROSS_ENTROPY_IGNORE_IDX) # (labels[-1] is already IGNORE_IDX from the new_full above) # ── return a shallow-copied mapping so the original sample stays intact out = dict(sample) out["labels"] = labels return out

Darktex · 2025-06-26T17:51:12Z

torchtune/datasets/_interleaved.py

+        self._sampling_generator = torch.Generator().manual_seed(seed)
+
+        # Normalize weights to sum to 1
+        # TODO: make it a property? rely on ds.weight?


I'd make it a property so it remains visible. Very cheap anyway

Darktex · 2025-06-26T17:52:46Z

torchtune/datasets/_interleaved.py

+
+        while True:
+            # Sample which dataset to use
+            ds_idx = torch.multinomial(


Hmmmmm, I think we should also log what we did. It's reasonable and cheap to accumulate a list that maps iteration_id to dataset_id in self.datasets. When this guy dumps its state, this log should be part of it

Darktex · 2025-06-26T17:53:33Z

torchtune/datasets/_interleaved.py

+                logger.warning(
+                    f"Child dataset {self._datasets[ds_name].dataset_name} was exhausted. "
+                    "This is unexpected for an infinite dataset. Re-initializing its iterator."
+                )


Let's do this: simply have a subclass for InfiniteIterable so this is super explicit

Darktex · 2025-06-26T17:59:13Z

torchtune/datasets/_iterable_base.py

+from torch.utils.data import IterableDataset
+
+
+class TuneIterableDataset(IterableDataset, ABC):


We need this guy to interact with packing and IIUC I don't believe this is currently happening?

The algo we should implement is this:

One batch can be made of multiple calls to next. We keep taking until we exceed the max seq len. When we do, we put the last one aside (we'll use it to start the next batch), pad the current one to max len and return.

The calls to next will go to the interleaved dataset, therefore we automatically construct mixed batches from multiple datasets without much effort

Also, every time we call next we should make space for logging transforms (which we are, you already wrote them). I think it's ok to make your metrics transforms and aggregators an optional property here so the semantics are clearer

felipemello1 and others added 6 commits June 25, 2025 12:41

first commit

3cab533

update tests

2212b19

Merge remote-tracking branch 'joecummings/impl-step-based-ckpt' into …

4345832

…iterable_dataset_final

linter

2eb68b6

tests pass

2e51e04

it works

93fa743

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2025

felipemello1 changed the title ~~first commit~~ Iterable Dataset Jun 26, 2025

felipemello1 commented Jun 26, 2025

View reviewed changes

Darktex requested changes Jun 26, 2025

View reviewed changes

		from torch.utils.data import IterableDataset


		class TuneIterableDataset(IterableDataset, ABC):

Iterable Dataset #2852

Are you sure you want to change the base?

Iterable Dataset #2852

Conversation

felipemello1 commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

TIps when reviewing this pr

Changelog

Test plan

Uh oh!

pytorch-bot bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2852

❌ 1 New Failure, 1 Cancelled Job, 1 Unrelated Failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felipemello1 commented Jun 26, 2025 •

edited

Loading

pytorch-bot bot commented Jun 26, 2025 •

edited

Loading