Add support for Slurm arrays by sarahyurick · Pull Request #2059 · NVIDIA-NeMo/Curator

sarahyurick · 2026-06-09T21:31:11Z

TODO:

Add retry support
Add FailedTask support
Add a tutorial
Add nemo-curator-slurm-cli (not planned for this PR)
Address case when SLURM_ARRAY_TASK_COUNT > cluster limit
Add tests

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

copy-pr-bot · 2026-06-09T21:31:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick · 2026-06-12T16:37:10Z

        # Guarantee every emitted task has a task_id (derived id, or uuid fallback).
        results = self._post_process_task_ids(tasks, results)

+        self._record_failed_tasks([r for r in results if isinstance(r, FailedTask)])


Discussed with @abhinavg4 . For now the PR keeps track of FailedTask instances by looking for a user-set FAILED_TASKS_DIR_ENV_VAR = "NEMO_CURATOR_FAILED_TASKS_DIR" and writing a JSON file per failed task in the specified directory.

I did the environment variable and write approach because it seems more reliable than trying to handle a global Python variable, etc. And the reason it is an environment variable is so that BaseStageAdapter does not have to propagate an additional parameter for every single stage (which I think would involve having to update the executors as well?). Open to other suggestions.

Ok I think a lot of the util functions are coming because of this feature, and there might be an easier way for this. Continuing in DMs.

The functions are really only:

Write info about failed tasks

Use task IDs to filter by Slurm array index

but I can move to util scripts if that makes it easier to read.

praateekmahajan

Took a super quick look, here are some general thoughts

Instead of adding the same 3/4 fields to every "source" stage, can we have a base class and inherit that?
Alternatively (or maybe in addition), pipeline.build iirc now dynamically sets the first stage as is_source_stage=True, so can we just rely on those? If we do then inside backends/base.py we can say "if this is a source stage AND slurm is enabled then just use task_id as my key and decide which shard it belongs to"... this is something @abhinavg4 and I had discussed, this reduces the number of changes needed across curator code base, and also generalizes, since source_stage have task_id which is likely assigned using get_determenistic_task_id which is a hash(metadat['source_files'])

sarahyurick · 2026-06-15T21:18:41Z

Took a super quick look, here are some general thoughts

Instead of adding the same 3/4 fields to every "source" stage, can we have a base class and inherit that?

Alternatively (or maybe in addition), pipeline.build iirc now dynamically sets the first stage as is_source_stage=True, so can we just rely on those? If we do then inside backends/base.py we can say "if this is a source stage AND slurm is enabled then just use task_id as my key and decide which shard it belongs to"... this is something @abhinavg4 and I had discussed, this reduces the number of changes needed across curator code base, and also generalizes, since source_stage have task_id which is likely assigned using get_determenistic_task_id which is a hash(metadat['source_files'])

For 1, sure.

For 2, we could but it makes this PR dependent on the resumability PR, which is what we were trying to avoid I thought... also, I guess it is not immediately obvious to me how it can work for source stages that are not a FilePartitioningStage. I get the general idea I guess but I am not convinced that it could always work, it sounds to me like how it would probably have to work is convert all unselected tasks to NoneTask maybe?

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

…stage Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

abhinavg4

I think I still have some major comments about the algorithm. Especially at scale, how will the writing and reading work. Will continue in the DMs.

abhinavg4 · 2026-06-23T18:10:20Z

+SLURM_ARRAY_ENABLED_ENV_VAR = "NEMO_CURATOR_SLURM_ARRAY_ENABLED"
+SLURM_ARRAY_SHARD_INDEX_ENV_VAR = "NEMO_CURATOR_SLURM_ARRAY_SHARD_INDEX"
+SLURM_ARRAY_TOTAL_SHARDS_ENV_VAR = "NEMO_CURATOR_SLURM_ARRAY_TOTAL_SHARDS"
+SLURM_ARRAY_MINIMUM_SHARD_INDEX_ENV_VAR = "NEMO_CURATOR_SLURM_ARRAY_MINIMUM_SHARD_INDEX"


Do we need all these variables, or can they be self-inferred? I think the initial design that @praateekmahajan had in mind was we just add --array=1-100 to the slurm submit command, and everything else works OOTB. Currently, it seems like the effort from the user side is a bit more than that?

So the idea is we want to give the user full control if they need to override total shards (needed for reruns) or minimum shard index (needed to get around any Slurm array size limits). Really to enable Slurm array partitioning, the only thing explicitly needed is:

NEMO_CURATOR_SLURM_ARRAY_ENABLED=1

and it can automatically grab the environment variables without any issues. And then the user can override with NEMO_CURATOR_SLURM_ARRAY_SHARD_INDEX, etc. as desired (but not required).

abhinavg4 · 2026-06-23T18:14:54Z

        # Guarantee every emitted task has a task_id (derived id, or uuid fallback).
        results = self._post_process_task_ids(tasks, results)

+        self._record_failed_tasks([r for r in results if isinstance(r, FailedTask)])


Ok I think a lot of the util functions are coming because of this feature, and there might be an easier way for this. Continuing in DMs.

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

basic slurm array file partitioning

bde2217

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick and others added 4 commits June 9, 2026 14:54

add slurm array params to composite stages using filepartitioningstage

a0595f6

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add tutorial and tests

43ee179

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

cae17b3

Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com>

ruff

acfeceb

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick commented Jun 11, 2026

View reviewed changes

Comment thread nemo_curator/stages/text/deduplication/semantic.py Outdated

sarahyurick marked this pull request as ready for review June 11, 2026 17:31

sarahyurick requested review from a team, abhinavg4 and suiyoubi as code owners June 11, 2026 17:31

copy-pr-bot Bot temporarily deployed to public June 11, 2026 17:31 Inactive

copy-pr-bot Bot temporarily deployed to test June 11, 2026 17:32 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 11, 2026 17:32 Inactive

sarahyurick and others added 3 commits June 11, 2026 13:49

ruff

717edac

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

8f2345b

greptile comments

ebba73e

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

sarahyurick commented Jun 12, 2026

View reviewed changes

Merge branch 'main' into slurm_array

5e58793

praateekmahajan reviewed Jun 15, 2026

View reviewed changes

sarahyurick and others added 11 commits June 16, 2026 13:33

TextSemanticDeduplicationWorkflow revert

ad8f68a

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

437270f

Merge branch 'main' into slurm_array

b55ec47

use SlurmArrayConfig dataclass

672f3d2

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

use base stage adapter and source stage instead of file partitioning …

2814eec

…stage Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

formatting

4dacd31

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

ruff

d54d3a3

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

greptile feedback

5e18093

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

update tutorial

94312c4

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

298a632

Merge branch 'main' into slurm_array

4d5d20a

abhinavg4 requested changes Jun 23, 2026

View reviewed changes

sarahyurick commented Jun 23, 2026

View reviewed changes

Comment thread nemo_curator/backends/base.py Outdated

save bulk updates

e0a8af5

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

greptile-apps Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread nemo_curator/utils/retry_manifest.py

sarahyurick and others added 8 commits June 24, 2026 13:19

ruff

62a4493

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

add greptile suggestion

11c74cc

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

ruff

67e0eef

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

3ee13ff

Merge branch 'main' into slurm_array

14ddc3c

use login node to submit retry jobs

f024975

Signed-off-by: Sarah Yurick <sarahyurick@gmail.com>

Merge branch 'main' into slurm_array

ebc93dc

Merge branch 'main' into slurm_array

3fc2dbc

Uh oh!

Conversation

sarahyurick commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 9, 2026

Uh oh!

Uh oh!

sarahyurick Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

praateekmahajan left a comment

Choose a reason for hiding this comment

Uh oh!

sarahyurick commented Jun 15, 2026

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavg4 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sarahyurick commented Jun 9, 2026 •

edited

Loading

sarahyurick Jun 12, 2026 •

edited

Loading