Skip to content

feat: resume interrupted dataset generation runs (sync + async engine)#526

Open
przemekboruta wants to merge 16 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:main
Open

feat: resume interrupted dataset generation runs (sync + async engine)#526
przemekboruta wants to merge 16 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:main

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

@przemekboruta przemekboruta commented Apr 13, 2026

Summary

Closes #525

Adds resume: ResumeMode = ResumeMode.NEVER to DataDesigner.create() and DatasetBuilder.build(). Generation picks up from where the interrupted run left off — for both the sync and async engines.

from data_designer import DataDesigner, ResumeMode

dd = DataDesigner(...)
dd.add_column(...)

# First run — interrupted mid-way
results = dd.create(config_builder, num_records=10_000)

# After restart — picks up from the last completed batch/row-group
results = dd.create(config_builder, num_records=10_000, resume=ResumeMode.ALWAYS)

# Or: resume only if the config has not changed, otherwise start fresh
results = dd.create(config_builder, num_records=10_000, resume=ResumeMode.IF_POSSIBLE)

Changes

Layer Change
ArtifactStorage New ResumeMode(StrEnum) enum (NEVER/ALWAYS/IF_POSSIBLE); resume: ResumeMode = ResumeMode.NEVER field; resolved_dataset_name skips timestamp logic on ALWAYS/IF_POSSIBLE; new clear_partial_results()
DatasetBatchManager.start() New start_batch and initial_actual_num_records params (default 0, no breakage)
DatasetBuilder.build() New resume: ResumeMode param; _load_resume_state() reads and validates metadata.json; _build_with_resume() skips completed batches (sync); _build_async() skips completed row groups (async); _check_resume_config_compatibility() compares config fingerprints and invalidates resolved_dataset_name cache on IF_POSSIBLE downgrade; partial-completion warning moved before return in _build_async (was dead code)
RowGroupBufferManager.__init__() New initial_actual_num_records and initial_total_num_batches params to seed counters on resume
DatasetBuilder._find_completed_row_group_ids() New helper — scans parquet-files/ for batch_*.parquet to determine which async row groups are already done
finalize_row_group closure Now writes incremental metadata.json after every row-group checkpoint (not just at the end), making all async runs resumable if interrupted
DataDesigner.create() Exposes resume: ResumeMode, passes it through to ArtifactStorage and builder.build()
bool return in _build_with_resume / _build_async build() gates run_after_generation on the return value so processors are never re-run on an already-complete dataset

ResumeMode semantics

Mode Behaviour
NEVER (default) Always start a fresh run; existing dataset gets a timestamped directory
ALWAYS Resume from last checkpoint; raise DatasetGenerationError if incompatible
IF_POSSIBLE Resume if the current config fingerprint matches the stored one; silently start fresh otherwise (no error)

Validation and error cases

  • Missing metadata.json (interrupted before first batch): restarts from scratch (both engines)
  • num_records less than already-generated records → DatasetGenerationError; num_records greater than original target is allowed (extends the dataset)
  • buffer_size mismatch → DatasetGenerationError
  • Column/model config changed + ALWAYSDatasetGenerationError; with IF_POSSIBLE → silent fresh start, resolved_dataset_name cache invalidated so the fresh run gets a timestamped directory
  • Dataset already complete → warning logged, returns existing path without re-running processors (both engines)

Test plan

  • test_resolved_dataset_name_resume_uses_existing_folder
  • test_resolved_dataset_name_resume_raises_when_no_existing_folder
  • test_resolved_dataset_name_resume_raises_when_folder_is_empty
  • test_resolved_dataset_name_if_possible_uses_existing_folder
  • test_resolved_dataset_name_if_possible_uses_clean_name_when_no_existing_folder
  • test_clear_partial_results_removes_partial_folder
  • test_clear_partial_results_is_noop_when_no_partial_folder
  • test_start_with_start_batch
  • test_start_with_initial_actual_num_records
  • test_start_with_start_batch_and_initial_actual_num_records
  • test_start_default_values_unchanged
  • test_build_resume_starts_fresh_without_metadata
  • test_build_resume_raises_when_num_records_below_actual
  • test_build_resume_allows_larger_num_records
  • test_build_resume_raises_on_buffer_size_mismatch
  • test_build_resume_runs_remaining_batches
  • test_build_resume_logs_warning_when_already_complete
  • test_build_resume_already_complete_does_not_run_after_generation_processors
  • test_find_completed_row_group_ids_empty_dir
  • test_find_completed_row_group_ids_with_files
  • test_find_completed_row_group_ids_ignores_non_batch_files
  • test_build_async_resume_logs_warning_when_already_complete
  • test_build_async_resume_starts_fresh_without_metadata
  • test_build_async_resume_already_complete_does_not_run_after_generation_processors
  • test_find_completed_row_group_ids_used_for_initial_total_batches
  • test_initial_actual_num_records_from_filesystem_in_crash_window
  • test_build_async_resume_skip_row_groups_contains_completed_ids

@przemekboruta przemekboruta requested a review from a team as a code owner April 13, 2026 11:15
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

This PR adds a resume: ResumeMode parameter to DataDesigner.create() and DatasetBuilder.build() that lets interrupted generation runs pick up from the last completed batch (sync) or row group (async), using metadata checkpointing and filesystem scanning to determine progress.

  • P1 — IF_POSSIBLE raises on first run: _check_resume_config_compatibility() returns True when the dataset directory (and config file) don't yet exist, upgrading IF_POSSIBLE to ALWAYS. _write_builder_config() then resolves base_dataset_path → resolved_dataset_name, which raises ArtifactStorageError(\"Cannot resume: no existing dataset found\") because ALWAYS + missing directory = error. The documented behaviour is a silent fresh start. Fix: return False from _check_resume_config_compatibility when the dataset directory is absent.

Confidence Score: 4/5

Mostly safe to merge; one P1 path causes IF_POSSIBLE to raise instead of silently starting fresh when no prior dataset exists.

One P1 bug caps the score at 4. The core resume logic (sync + async), crash-window handling, and processor-skip behaviour are all well-tested and correct. The bug only affects the IF_POSSIBLE mode on first-ever runs where no dataset directory exists yet.

dataset_builder.py — specifically _check_resume_config_compatibility() and the IF_POSSIBLE upgrade block in build()

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Core resume logic — _build_with_resume (sync) and _build_async (async) added; IF_POSSIBLEALWAYS upgrade in build() has a bug when no prior dataset directory exists
packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py Adds ResumeMode enum, resume field, clear_partial_results(), and conditional path logic in resolved_dataset_name; logic is correct for cases covered by tests
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py Adds start_batch and initial_actual_num_records params to start() to seed counters on resume; backwards-compatible defaults
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py Adds initial_actual_num_records and initial_total_num_batches constructor params to seed async counters; straightforward and correct
packages/data-designer/src/data_designer/interface/data_designer.py Exposes resume: ResumeMode on DataDesigner.create() and threads it through to ArtifactStorage and builder.build()
packages/data-designer/src/data_designer/interface/results.py Adds export() method supporting jsonl/csv/parquet formats; unrelated to resume but clean addition
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py Comprehensive resume tests added; no coverage for IF_POSSIBLE + no prior dataset at the build() level

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["build(resume=...)"] --> B{resume == IF_POSSIBLE?}
    B -- Yes --> C["_check_resume_config_compatibility()"]
    C -- config differs --> D["resume = NEVER\nartifact_storage.resume = NEVER\npop cache"]
    C -- compatible OR no config file --> E["resume = ALWAYS\nartifact_storage.resume = ALWAYS\npop cache"]
    E -. no prior dir .-> F["🐛 resolved_dataset_name raises\nArtifactStorageError"]
    B -- No --> G["_write_builder_config()"]
    D --> G
    E -- prior dir exists --> G
    G --> H{resume == ALWAYS AND no metadata.json?}
    H -- Yes --> I["clear_partial_results()\nresume = NEVER"]
    H -- No --> J{async engine?}
    I --> J
    J -- Yes --> K["_build_async(..., resume=resume)"]
    J -- No, resume=ALWAYS --> L["_build_with_resume(...)"]
    J -- No, resume=NEVER --> M["standard batch loop"]
    K --> N{all row groups done?}
    L --> O{all batches done?}
    N -- Yes --> P["return False\nskip processors"]
    N -- No --> Q["skip completed IDs\nrun remaining row groups"]
    O -- Yes --> P
    O -- No --> R["run remaining batches"]
    P --> S["return final_dataset_path"]
    Q --> T["run_after_generation"]
    R --> T
    M --> T
    T --> S
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:506-507
**`IF_POSSIBLE` raises on first run when no prior dataset exists**

`_check_resume_config_compatibility()` returns `True` whenever the config file is absent — including the case where the dataset directory has never been created. That causes the `IF_POSSIBLE` branch to unconditionally upgrade `resume` to `ALWAYS` and pop the cache (lines 251–253). When `_write_builder_config()` then evaluates `self.artifact_storage.base_dataset_path`, `resolved_dataset_name` re-runs under `resume=ALWAYS` and the directory still doesn't exist, which raises `ArtifactStorageError("🛑 Cannot resume: no existing dataset found …")`.

The expected behaviour for `IF_POSSIBLE` + no prior dataset is a silent fresh start. The fix is to return `False` from `_check_resume_config_compatibility` (can't compare → can't resume) when the dataset directory itself doesn't exist:

```python
def _check_resume_config_compatibility(self) -> bool:
    dataset_dir = Path(self.artifact_storage.artifact_path) / self.artifact_storage.dataset_name
    if not dataset_dir.exists() or not any(dataset_dir.iterdir()):
        return False  # No prior run — treat as incompatible so IF_POSSIBLE starts fresh
    config_path = dataset_dir / SDG_CONFIG_FILENAME
    if not config_path.exists():
        return True
    ...
```

Reviews (13): Last reviewed commit: "fix(builder): move partial-completion wa..." | Re-trigger Greptile

@przemekboruta przemekboruta changed the title feat: resume interrupted dataset generation runs (sync engine) feat: resume interrupted dataset generation runs (sync + async engine) Apr 13, 2026
przemekboruta added a commit to przemekboruta/DataDesigner that referenced this pull request Apr 13, 2026
…set already complete

_build_with_resume and _build_async now return False when the dataset is already
complete (early-return path), True otherwise. build() skips
_processor_runner.run_after_generation() on False, preventing processors from
calling shutil.rmtree and rewriting an already-finalized dataset.

Fixes the issue raised in review: greptile P1 comment on PR NVIDIA-NeMo#526.
@github-actions
Copy link
Copy Markdown
Contributor

Issue #525 has been triaged. The linked issue check is being re-evaluated.

@andreatgretel andreatgretel added agent-review Trigger agentic CI review and removed agent-review Trigger agentic CI review labels Apr 13, 2026
@andreatgretel andreatgretel added the agent-review Trigger agentic CI review label Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #526 — Resume interrupted dataset generation runs (sync + async engine)

Summary

This PR adds a resume: bool = False parameter to DataDesigner.create() and DatasetBuilder.build(), enabling users to resume interrupted dataset generation from the last completed batch (sync) or row group (async). The implementation touches 5 source files and 4 test files across the data-designer-engine and data-designer packages.

Scope: ~860 additions, ~16 deletions across 10 files (including a plan doc and comprehensive tests).

The feature is well-designed: it leverages existing metadata.json checkpoints, validates run-parameter compatibility, handles edge cases (already-complete, no-metadata, parameter mismatch), and correctly separates the sync and async resume paths. The plan diverged from implementation in a positive way — the async engine now supports resume (the plan initially deferred it).

Findings

High Severity

(H1) _load_resume_state return value discarded in async resume path
dataset_builder.py:411 — In _build_async, when resume=True, the call self._load_resume_state(num_records, buffer_size) is made for validation only — the returned _ResumeState is discarded. This is intentional (the async path derives state from the filesystem instead), but it's confusing. The validation-only intent should be made explicit, e.g. by extracting a _validate_resume_params() method or assigning to _ with a comment. As-is, a future maintainer might remove the "unused" call and break parameter validation for async resume.

Medium Severity

(M1) _find_completed_row_group_ids parses batch filenames with split("_", 1)[1]
dataset_builder.py:381 — The glob pattern is batch_*.parquet and the ID is extracted via p.stem.split("_", 1)[1]. This works for batch_00000"00000"int("00000") = 0. However, if a file like batch_00000_extra.parquet appeared (e.g., from a future format change), split("_", 1)[1] would yield "00000_extra" and int() would raise ValueError, which is caught. This is acceptable but fragile. Consider using a regex r"^batch_(\d+)$" on the stem for robustness.

(M2) initial_actual_num_records calculation assumes uniform batch sizes
dataset_builder.py:418-420 — The async resume path computes initial_actual_num_records as:

sum(min(buffer_size, num_records - rg_id * buffer_size) for rg_id in completed_ids)

This formula assumes each row group was written with exactly min(buffer_size, remaining) rows, ignoring dropped rows. If the original run dropped rows within a row group (e.g., due to generation failures), the actual count would be lower. However, actual_num_records in the sync path also counts written records (not requested), and the metadata from write_metadata stores the true post-drop count. This means the filesystem-derived count may overestimate vs. what was actually written. The comment at line 414 acknowledges metadata may lag, but the formula's assumption about no drops could lead to inflated actual_num_records in the final metadata when some rows were dropped in completed groups.

(M3) batch_manager.start() calls reset() which deletes files on resume path
dataset_batch_manager.py:177start() calls self.reset() which sets _current_batch_number = 0 and _actual_num_records = 0, then immediately overrides them. The reset(delete_files=False) call is harmless here (it doesn't delete files), but it does zero-out internal state that's immediately overwritten. While functionally correct, this coupling is subtle — if reset() ever gains side effects beyond zeroing counters, the resume path would break silently.

Low Severity

(L1) Plan/implementation divergence: async engine support
The plan document (plans/525/resume-interrupted-runs.md) states in the Design Decisions table: "Async engine: Raise DatasetGenerationError if DATA_DESIGNER_ASYNC_ENGINE=1 with resume=True" and in Trade-offs: "Resume support for async engine: deferred to a follow-up." The implementation fully supports async resume. The plan should be updated to reflect the actual implementation.

(L2) _ResumeState.buffer_size field is redundant
dataset_builder.py:91_ResumeState stores buffer_size but it's always set to the same buffer_size parameter that was already validated. The field is never read after construction in _build_with_resume — the method uses the buffer_size parameter directly. The field could be removed to avoid confusion.

(L3) Incremental metadata writes add I/O overhead to async engine
dataset_builder.py:443write_metadata is now called after every row group checkpoint in finalize_row_group. For large datasets with many small row groups, this adds per-row-group disk I/O. The trade-off (resumability vs. performance) is reasonable, but worth noting in documentation or the PR description. The final write_metadata call at line 478 is documented as redundant ("overwrites the last incremental write with identical content") — good.

(L4) Test file has mid-file imports
test_dataset_builder.py:927-429 — The resume test section re-imports json, Path, and ArtifactStorage with underscore-prefixed aliases (_json, _Path, _ArtifactStorage) mid-file. While this works, it's unconventional and potentially confusing. Standard practice is to add imports at the top of the file.

(L5) No validation of start_batch or initial_actual_num_records bounds
dataset_batch_manager.py:165-166 — The new start_batch and initial_actual_num_records parameters have no validation (e.g., start_batch >= 0, start_batch <= num_batches, initial_actual_num_records >= 0). Since these are only called from internal resume code that validates upstream, this is acceptable — but defensive checks would prevent misuse if the method is called from new paths in the future.

Positive Observations

  • Comprehensive test coverage: 20+ new test cases covering validation errors, already-complete detection, async/sync paths, filesystem-vs-metadata crash window scenarios, and processor non-invocation on skip.
  • Clean separation of sync/async resume: The sync path uses _build_with_resume with DatasetBatchManager, while the async path extends _build_async with skip_row_groups and filesystem-based counters. No shared mutable state between the two paths.
  • Filesystem as source of truth for async: The decision to derive initial_actual_num_records from the filesystem rather than potentially-stale metadata (lines 414-420) handles the crash window correctly and is well-documented.
  • Graceful degradation for missing metadata: The build() method at line 188 handles the case where metadata.json is missing (interrupted before any batch completed) by logging and restarting fresh, rather than raising an error. This is a UX improvement over the plan's original "raise error" approach.
  • No breaking changes: All new parameters default to their pre-existing behavior (resume=False, start_batch=0, initial_actual_num_records=0).
  • Incremental metadata writes enable async resumability — a meaningful improvement over the plan's deferred-async-resume decision.

Verdict

Approve with suggestions. The implementation is solid, well-tested, and handles edge cases thoughtfully. The high-severity finding (H1) is a readability/maintainability concern rather than a correctness bug — the discarded return value works because _load_resume_state raises on validation failure. The medium-severity findings (M1-M3) are minor robustness concerns. None of these block merging, but H1 and M2 are worth addressing before or shortly after merge.

@github-actions github-actions Bot removed the agent-review Trigger agentic CI review label Apr 16, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

@nabinchha
Copy link
Copy Markdown
Contributor

nabinchha commented Apr 28, 2026

cc @johnnygreco @andreatgretel

Suggestion: add an IF_POSSIBLE mode to resume for idempotent retry workflows

First, thanks for landing this — being able to resume interrupted runs at all is a big quality-of-life win for long jobs. This suggestion is about extending the API one step further so it composes cleanly with automated retry/orchestration workflows.

Dependencies

This suggestion depends on #584 (deterministic hash to uniquely identify a workflow config), which is being shipped soon. The behavior matrix below uses that hash as its definition of "compatible" — i.e., whether the on-disk run was produced by an equivalent workflow config. #584 should land first.

Motivation

The current API is binary:

  • resume=False → start fresh (default; collisions get a timestamp suffix)
  • resume=True → resume; raise if there's no resumable state

This works well for interactive use, where the caller knows up front whether they're starting a new run or resuming an existing one.

It's awkward for automated workflows where the same invocation may be a first run or a retry:

  • A wrapper that resubmits the same job script after infra failures
  • A CI/cron pipeline that re-runs a generation job on a fixed schedule
  • Any orchestrator that doesn't track per-job "have I run this before?" state externally

In all of these, the caller has to:

  1. Stat the output directory
  2. Decide whether resume is appropriate
  3. Pass the right value to create()

That logic ends up reimplemented in every wrapper, and it has to know about DD's storage layout (where metadata lives, what counts as "resumable") to do it correctly. As DD's storage layout evolves, every wrapper breaks.

Proposal

A ResumeMode enum with three values

from enum import StrEnum

class ResumeMode(StrEnum):
    NEVER = "never"              # current resume=False
    ALWAYS = "always"            # current resume=True
    IF_POSSIBLE = "if_possible"  # new
def create(
    self,
    *,
    dataset_name: str | None = None,
    resume: ResumeMode = ResumeMode.NEVER,
    ...
) -> DatasetResult: ...

The key property of IF_POSSIBLE: the caller passes the same value on every invocation and DD does the right thing based on what's actually on disk. The caller no longer needs to reason about prior state.

Behavior matrix

"Compatible" below means the persisted config_hash (#584) matches the current invocation's hash — i.e., the on-disk run was produced by an equivalent workflow config.

State on disk ResumeMode.NEVER ResumeMode.ALWAYS ResumeMode.IF_POSSIBLE
Folder missing or empty create (timestamp on collision elsewhere) raise create in dataset_name
metadata.json present, compatible timestamp-suffix new folder resume resume
metadata.json present, incompatible timestamp-suffix new folder raise raise
Folder has data but no metadata.json timestamp-suffix new folder raise raise

The crucial line is the third one: under IF_POSSIBLE, an incompatible-config case must raise, not silently start fresh. Silently overwriting a folder that belongs to a different config is worse than failing loudly. The whole point of IF_POSSIBLE is "I might be a retry of myself" — if the hash says it isn't, the folder belongs to an unrelated run that happened to land on the same dataset_name, and the right response is to surface that collision rather than paper over it.

Implementation sketch

  • Define ResumeMode (probably in data_designer.config) and re-export from the public package.
  • ArtifactStorage: change resume: bool to resume: ResumeMode. In resolved_dataset_name, the IF_POSSIBLE branch returns dataset_name unchanged whether or not the folder currently exists, and never raises on missing/empty folders.
  • DatasetBuilder._load_resume_state:
  • Tests should cover:
    • All four state-on-disk cases × all three ResumeMode values
    • The hash-mismatch error path for ALWAYS and IF_POSSIBLE
    • String coercion via StrEnum (resume="if_possible" resolves to ResumeMode.IF_POSSIBLE) so config-driven callers stay ergonomic

Other considerations

  • Concurrency. IF_POSSIBLE plus a shared dataset_name across two concurrent processes is a race. Worth documenting that resume assumes a single writer per dataset_name. A lockfile in the dataset folder would make this enforceable, but is probably a separate piece of work.
  • Cleanup semantics. clear_partial_results() should fire in IF_POSSIBLE mode the same way it does in ALWAYS — partial results from a previous interrupted run shouldn't leak into the resumed (or fresh) run.

Related cleanups (separate from the API change)

While reading the PR, two small things stood out that are worth a follow-up regardless of the tri-state proposal:

  • The partial-completion warning at the end of _build_async is unreachable because of the return True immediately above it. Moving the warning above the return restores user-visible feedback for incomplete async runs.
  • _load_resume_state raises DatasetGenerationError from a FileNotFoundError without from exc, dropping the original traceback. Chaining it would help future debugging.

@andreatgretel
Copy link
Copy Markdown
Contributor

Thank you for taking this on, the plan in plans/525/ made the trade-offs easy to follow. The main asks before merge are in the comments above: the no-metadata fallback running before sync/async split (#1), the async already-complete check not surviving run_after_generation (#2), the dead-code warning at line 511 (greptile already flagged), and a few smaller follow-throughs (builder_config validation, happy-path test).

One thought is that we could try doing checkpointing on a task level already. However, that would need a sidecar format (parquet only wants whole row groups), concurrency-safe writes from many parallel asyncio tasks, and a CompletionTracker replay path on resume, probably 3-5x the code here plus new edge cases for skipped/dropped cells. At the row-group level the lost-work blast radius is bounded by buffer_size LLM calls per crash, which is fine for the common case. The architecture here (per-cell update_cell + CompletionTracker) is well-shaped to add task-level later by intercepting cell writes, so this PR doesn't paint anyone into a corner.

@przemekboruta
Copy link
Copy Markdown
Contributor Author

Thanks everyone for the thorough review — really useful catches across the board. Here's what was addressed:

@johnnygreco / @nabinchha

  • Added _GenerationOutcome enum (GENERATED / ALREADY_COMPLETE) replacing the bare bool returns — build() now gates run_after_generation on status is _GenerationOutcome.GENERATED, so processors are never re-run on an already-complete dataset
  • num_records validation changed from exact match to < actual_num_records — you can now resume with a larger or equal target, e.g. resume=True, num_records=6000 after a run interrupted at 5000 records
  • No-metadata fallback moved inside the sync-only branch; async handles it internally via _find_completed_row_group_ids() so the crash-window parquet files are not silently discarded

@andreatgretel

  • Async already-complete check now uses max(metadata_count, filesystem_count) — metadata count is authoritative after AFTER_GENERATION processors have rewritten parquet-files/, filesystem count is authoritative in the crash window when metadata lags; taking the max covers both cases
  • Added _check_resume_config_compatibility(): reads builder_config.json before _write_builder_config() overwrites it, compares the data_designer section (ignoring library_version), and raises DatasetGenerationError when it differs — prevents silently mixing batches generated with incompatible configs
  • Added test_build_resume_runs_remaining_batches: 3 batches total, 1 already done → asserts _run_batch is called with current_batch_number=1 and 2 only, not 0
  • Also fixed the dead-code regression: the "Surface partial completion" warning block was sitting after return True in _build_async — moved it before the return so it actually executes

PR description updated to reflect the changed semantics.

Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for grinding through several review rounds on this, @przemekboruta — the iteration is visible (the _GenerationOutcome enum, crash-window handling, and config-mismatch guard are all clear improvements). Most of my notes are about leaving the public API in a place that won't need a breaking change next time we touch it.

Summary

Adds an opt-in resume: bool to DataDesigner.create() / DatasetBuilder.build() that picks generation up from the last completed batch (sync) or row group (async), with metadata.json-based state for sync and filesystem reconciliation for async. The implementation matches the stated intent and the PR description tracks the (post-iteration) semantics — the in-code docstrings have drifted a bit in the process.

Findings

Critical — Let's fix these before merge

packages/data-designer/src/data_designer/interface/data_designer.py:193ResumeMode enum not introduced; should land alongside #587

  • What: The earlier ResumeMode proposal (NEVER / ALWAYS / IF_POSSIBLE) was only partially picked up. The new _GenerationOutcome enum is internal — it's a return-value type for _build_with_resume / _build_async, not the public input parameter. The user-facing API is still resume: bool on DataDesigner.create(), DatasetBuilder.build(), and ArtifactStorage.resume.
  • Why: This is a one-shot decision — once resume: bool ships in a release, every adoption locks us into bool semantics. Migrating to resume: ResumeMode later is a breaking change (bool and StrEnum aren't substitutable in user code, even with StrEnum's string coercion); the back-compat workaround is bool | ResumeMode for one or more deprecation cycles, which leaves a permanent wart in the public signature. The good news: the dependency for IF_POSSIBLE is no longer abstract — PR #587 (closes #584) is open and provides DataDesignerConfig.fingerprint() / fingerprint_config() with a versioned CONFIG_HASH_VERSION, which is exactly the "compatible vs. incompatible config on disk" check IF_POSSIBLE needs. So the cleanest sequencing is: land #587 first, then this PR ships the full ResumeMode enum including IF_POSSIBLE in one go — no scaffolding-then-fill-in two-PR dance, no breaking API change later.
  • Suggestion: Once #587 merges, replace resume: bool with the full enum here:
from enum import StrEnum

class ResumeMode(StrEnum):
    NEVER = "never"
    ALWAYS = "always"
    IF_POSSIBLE = "if_possible"

def create(self, *, resume: ResumeMode = ResumeMode.NEVER, ...) -> ...: ...

_load_resume_state / _check_resume_config_compatibility then call fingerprint_config(self._data_designer_config) and compare against a config_hash field persisted into metadata.json on each write. The behavior matrix from the original ResumeMode thread (compatible-fingerprint resumes, incompatible-fingerprint raises under both ALWAYS and IF_POSSIBLE) drops in cleanly on top of #587's API. If #587 stalls and this PR needs to ship sooner, the fallback is the scaffolding-only version (NEVER / ALWAYS, no IF_POSSIBLE) — but please don't ship resume: bool. cc @nabinchha / @johnnygreco for the sequencing call.

Warnings — Worth addressing

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:198 and packages/data-designer/src/data_designer/interface/data_designer.py:215 — Stale num_records docstring

  • What: Both build() and create() still say "The run parameters (num_records, buffer_size) must match those of the original run", but per the latest iteration num_records only needs to be >= actual_num_records (the whole point of the johnnygreco/nabinchha "change your mind" semantic that you adopted).
  • Why: The PR description was updated to reflect the new semantics, but users will read the docstring, not the PR description. They'll think they have to use the same num_records and end up either restarting from scratch or filing a bug.
  • Suggestion: Update both docstrings to something like "buffer_size must match the original run; num_records may be the same or larger and is treated as the new target. Resuming with num_records < actual_num_records_so_far raises DatasetGenerationError."

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:265FileNotFoundError traceback dropped

  • What: _load_resume_state re-raises FileNotFoundError as DatasetGenerationError without from exc, dropping the original traceback. This was specifically called out in the prior review thread (the "Related cleanups" section of the ResumeMode comment) and not picked up.
  • Why: STYLEGUIDE.md is explicit on this: "Re-raise with context so the original traceback is preserved." It's also a one-character fix.
  • Suggestion:
except FileNotFoundError as exc:
    raise DatasetGenerationError(
        "🛑 Cannot resume: metadata.json not found in the existing dataset directory. "
        "Run without resume=True to start a new generation."
    ) from exc

Suggestions — Take it or leave it

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:702-705 — Silent fallback when saved config is unreadable

  • What: _check_resume_config_compatibility swallows OSError and JSONDecodeError and returns silently — the comment says "unreadable config — skip check rather than block the resume".
  • Why: That's a defensible choice for partially-corrupted state, but it means a resume can silently mix incompatible batches if the saved config happens to be corrupt. At least logging a warning would make the situation visible.
  • Suggestion: logger.warning("⚠️ Saved builder_config.json is unreadable — skipping config compatibility check on resume.") before the return.

packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py:1283 — Async happy-path resume test asserts inputs but not the skip set

  • What: test_initial_actual_num_records_from_filesystem_in_crash_window is the closest async analog to the sync test_build_resume_runs_remaining_batches, but it captures initial_actual_num_records / initial_total_num_batches and not skip_row_groups. The "only the missing row groups get scheduled" invariant is the actual user-facing guarantee for async resume.
  • Why: The sync test locks down "batches 1, 2 ran, not 0" — a regression that swapped skip_row_groups for an empty set in async would not be caught by any of the current tests.
  • Suggestion: Add a sibling assertion to the existing test (or a new tiny one):
assert captured["skip_row_groups"] == frozenset({0, 1})

(captured the same way you already capture the other two kwargs.)

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:428-550_build_async is now ~120 lines and densely nested

  • What: The resume-resolution block at lines 452-495 (no-prior-state vs metadata vs filesystem-only, plus the max(meta_count, fs_count) already-complete decision) is doing real work in the middle of the scheduler-setup function.
  • Why: Mostly readability — the precedence rules between metadata and _find_completed_row_group_ids() are subtle and worth being able to point at in isolation. Also makes them more testable as a unit.
  • Suggestion (optional, can defer to a follow-up): extract _resolve_async_resume_state(num_records, buffer_size) -> tuple[frozenset[int], int, int, _GenerationOutcome | None] returning the skip_row_groups, initial_actual_num_records, initial_total_num_batches, and an ALREADY_COMPLETE sentinel when the dataset is done. The caller then becomes a straight-line dispatch.

What Looks Good

  • _GenerationOutcome cleanly fixes both prior bugs at once — gating run_after_generation on status is _GenerationOutcome.GENERATED is the right shape, and it kills both the dead-warning and the AFTER_GENERATION-rerun problems with one mechanism.
  • Crash-window handling via max(meta_count, fs_count) is exactly the right call. Metadata is authoritative once AFTER_GENERATION has rewritten parquet-files/; filesystem is authoritative when metadata lags one row group. Using the max covers both cases without needing a flag.
  • _check_resume_config_compatibility runs before _write_builder_config() — subtle but critical ordering. Easy to get wrong; you got it right and called it out in the docstring.
  • Test coverage genuinely exercises the failure modes that matter — config-mismatch, buffer-mismatch, num_records-too-small, no-metadata-fallback, crash-window filesystem reconciliation, and the "did we actually skip the right batches" invariant for sync. ~20 new tests, well-scoped.

Verdict

Needs changes / needs sequencing — the Critical (ResumeMode) is blocking; once resume: bool ships, the API decision is locked in. Cleanest path is to wait for #587 to land and then ship the full enum (incl. IF_POSSIBLE) in this PR. The two Warnings (stale docstrings, missing from exc) are small follow-throughs from the prior review thread that are much cheaper to fix here than in a follow-up. Suggestions are take-or-leave.


This review was generated by an AI assistant.

@nabinchha
Copy link
Copy Markdown
Contributor

Quick follow-up: #587 just merged, so the dependency for the Critical finding above is resolved — IF_POSSIBLE is no longer hypothetical and the cleanest path is to land the full ResumeMode enum (incl. IF_POSSIBLE) in this PR rather than the scaffolding-only fallback.

A small correction to the suggested code in my earlier review: the shipped public surface is DataDesignerConfig.fingerprint() returning a dict, not a freestanding helper (the fingerprint_config() module-level function exists but is intentionally not re-exported from data_designer.config — implementation detail). Updated sketch:

from enum import StrEnum

class ResumeMode(StrEnum):
    NEVER = "never"
    ALWAYS = "always"
    IF_POSSIBLE = "if_possible"

def create(self, *, resume: ResumeMode = ResumeMode.NEVER, ...) -> ...: ...

For the compatibility check inside _check_resume_config_compatibility:

fp = self._data_designer_config.fingerprint()  # {"config_hash": "sha256:…", "config_hash_algo": "sha256", "config_hash_version": 1}
# Persist fp into metadata.json on each write, then on resume:
if saved.get("config_hash_version") != fp["config_hash_version"]:
    # Scheme changed under us — treat as "unknown identity" rather than mismatch.
    ...
elif saved.get("config_hash") != fp["config_hash"]:
    raise DatasetGenerationError("🛑 Cannot resume: config has changed since the interrupted run.")

Two small notes worth folding in:

  • Use config_hash_version as a precondition. If the on-disk config_hash_version doesn't match the current one, the right behavior under both ALWAYS and IF_POSSIBLE is "unknown identity" (raise with a clear "config-hash scheme upgraded — start fresh" message), not a silent mismatch.
  • Persist the fingerprint at every metadata write, not just at resume. Both DatasetBatchManager.finish_batch and RowGroupBufferManager.write_metadata need to include config_hash / config_hash_version so an interrupted run that's resumed later still has the original identity to compare against. Cheapest place to plumb it through is the BuilderConfig-adjacent layer.

Otherwise the original review stands — the two Warnings (stale docstrings, missing from exc) are still small follow-throughs worth addressing in the same pass.


This review was generated by an AI assistant.

- ArtifactStorage gains a `resume: bool = False` field
- resolved_dataset_name skips timestamp logic when resume=True,
  returning the existing dataset folder name as-is
- Raises ArtifactStorageError on resume=True when the target folder
  is absent or empty (no data to resume from)
- New clear_partial_results() removes in-flight partial results
  left over from an interrupted run

Fixes NVIDIA-NeMo#525
DatasetBatchManager.start() now accepts:
- start_batch: int = 0  — first batch index to process
- initial_actual_num_records: int = 0  — records already on disk

Both default to 0 so all existing call sites are unaffected.

Fixes NVIDIA-NeMo#525
- build() gains a resume: bool = False parameter
- _load_resume_state() reads metadata.json and validates that
  num_records and buffer_size match the original run
- _build_with_resume() skips completed batches, clears in-flight
  partial results, and continues from the first incomplete batch
- Raises DatasetGenerationError with clear messages for:
  - missing metadata.json (interrupted before first batch completes)
  - num_records mismatch
  - buffer_size mismatch
  - DATA_DESIGNER_ASYNC_ENGINE=1 (not yet supported)
- Logs a warning and returns early when dataset is already complete

Fixes NVIDIA-NeMo#525
- create() gains resume: bool = False
- _create_resource_provider() passes resume to ArtifactStorage
- builder.build() receives the resume flag

Fixes NVIDIA-NeMo#525
Covers:
- ArtifactStorage.resolved_dataset_name with resume=True
- ArtifactStorage.clear_partial_results()
- DatasetBatchManager.start() with start_batch and
  initial_actual_num_records
- DatasetBuilder.build(resume=True): missing metadata, num_records
  mismatch, buffer_size mismatch, already-complete detection

Fixes NVIDIA-NeMo#525
…INE=1)

- Add _find_completed_row_group_ids() to scan parquet-files/ for already-written
  row groups by parsing batch_*.parquet filenames
- _build_async() now accepts resume=True: loads metadata, finds completed row groups,
  clears partial results, and logs progress; returns early if all row groups are done
- _prepare_async_run() accepts skip_row_groups, initial_actual_num_records, and
  initial_total_num_batches so the scheduler only processes remaining row groups
  and RowGroupBufferManager starts from the correct counts
- RowGroupBufferManager.__init__ gains initial_actual_num_records and
  initial_total_num_batches params to seed the counters on resume
- finalize_row_group closure now writes incremental metadata after each checkpoint
  so any run (resume or not) can be resumed if interrupted mid-way
- Remove the guard that rejected resume=True with DATA_DESIGNER_ASYNC_ENGINE=1
- Add tests for all new paths
…set already complete

_build_with_resume and _build_async now return False when the dataset is already
complete (early-return path), True otherwise. build() skips
_processor_runner.run_after_generation() on False, preventing processors from
calling shutil.rmtree and rewriting an already-finalized dataset.

Fixes the issue raised in review: greptile P1 comment on PR NVIDIA-NeMo#526.
…sync resume

Metadata can lag by one row group if a crash occurs between
move_partial_result_to_final_file_path and write_metadata. Using
len(completed_ids) from the filesystem scan instead of
state.num_completed_batches ensures the final metadata reflects the
actual number of parquet files present, not the potentially stale
metadata count.
Adds DatasetCreationResults.export(path, format=) supporting jsonl,
csv, and parquet. The CLI create command gains --output-format / -f
which writes dataset.<format> alongside the parquet batch files.
…efore first batch)

When a run is interrupted before any row group or batch completes, metadata.json
is never written. Previously resume=True would raise DatasetGenerationError in
this case. Now build() detects the missing file, logs an info message, clears
any leftover partial results and falls back to a clean fresh run.

This is the common scenario for small datasets (fewer records than buffer_size)
where all records fit in a single row group.
…ync resume

In the crash window (row group written to disk but write_metadata crashed before
updating the file), both initial_total_num_batches and initial_actual_num_records
now use the filesystem-discovered completed_ids as source of truth.  Previously
initial_actual_num_records was read from potentially stale metadata, causing
actual_num_records in the final metadata to be undercounted by one row group.

Also adds a test covering the partial-resume crash-window scenario.
…/IF_POSSIBLE)

- Introduces ResumeMode(StrEnum) in artifact_storage.py for use across all layers
- Replaces resume: bool with resume: ResumeMode in DatasetBuilder.build(),
  DataDesigner.create(), ArtifactStorage, and _build_async()
- Adds _check_resume_config_compatibility() using config fingerprints to support
  IF_POSSIBLE: falls back to a fresh run when config has changed since last run
- Relaxes num_records validation from strict equality to num_records >= actual_num_records,
  allowing dataset extension on resume; buffer_size must still match exactly
- Preserves exception chain with 'from exc' on FileNotFoundError in _load_resume_state
- Exports ResumeMode from data_designer.interface for users to import
- Adds skip_row_groups assertion test and IF_POSSIBLE storage behavior tests
Comment on lines +235 to +242
if resume == ResumeMode.IF_POSSIBLE:
if not self._check_resume_config_compatibility():
logger.info(
"▶️ Config has changed since the last run — starting a fresh generation (resume=IF_POSSIBLE)."
)
resume = ResumeMode.NEVER
else:
resume = ResumeMode.ALWAYS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 IF_POSSIBLE fresh run silently overwrites the previous dataset

_check_resume_config_compatibility() is the first call that accesses self.artifact_storage.base_dataset_path, which triggers and caches the @cached_property resolved_dataset_name. Because ArtifactStorage.resume is still IF_POSSIBLE at that moment, the property evaluates the "dataset exists and is non-empty → return original name" branch and locks in the original directory name.

When the check then returns False (configs differ) and the local variable resume is set to NEVER, ArtifactStorage.resume is never updated. All subsequent path accesses (including _write_builder_config() and the build loop itself) use the cached original name, so the fresh generation writes directly into — and destroys — the completed previous dataset, with no timestamp-renaming or warning.

Fix: after deciding to fall back to NEVER, also update the storage object so the cached property reflects the correct mode:

if not self._check_resume_config_compatibility():
    logger.info(...)
    resume = ResumeMode.NEVER
    self.artifact_storage.resume = ResumeMode.NEVER  # keep ArtifactStorage in sync
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Line: 235-242

Comment:
**`IF_POSSIBLE` fresh run silently overwrites the previous dataset**

`_check_resume_config_compatibility()` is the first call that accesses `self.artifact_storage.base_dataset_path`, which triggers and caches the `@cached_property` `resolved_dataset_name`. Because `ArtifactStorage.resume` is still `IF_POSSIBLE` at that moment, the property evaluates the "dataset exists and is non-empty → return original name" branch and locks in the original directory name.

When the check then returns `False` (configs differ) and the local variable `resume` is set to `NEVER`, `ArtifactStorage.resume` is never updated. All subsequent path accesses (including `_write_builder_config()` and the build loop itself) use the cached original name, so the fresh generation writes directly into — and destroys — the completed previous dataset, with no timestamp-renaming or warning.

Fix: after deciding to fall back to NEVER, also update the storage object so the cached property reflects the correct mode:

```python
if not self._check_resume_config_compatibility():
    logger.info(...)
    resume = ResumeMode.NEVER
    self.artifact_storage.resume = ResumeMode.NEVER  # keep ArtifactStorage in sync
```

How can I resolve this? If you propose a fix, please make it concise.

…downgrades to NEVER

ArtifactStorage's Pydantic model validator accesses base_dataset_path at
construction time, caching resolved_dataset_name under IF_POSSIBLE semantics
before build() can set resume=NEVER. Pop the stale cache entry so the property
re-resolves with the correct NEVER semantics (timestamped directory).

Also fixes _check_resume_config_compatibility() to use artifact_path/dataset_name
directly instead of base_dataset_path, and adds a regression test covering the
cache-bypass scenario.
Comment on lines +506 to +507
if not config_path.exists():
return True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 IF_POSSIBLE raises on first run when no prior dataset exists

_check_resume_config_compatibility() returns True whenever the config file is absent — including the case where the dataset directory has never been created. That causes the IF_POSSIBLE branch to unconditionally upgrade resume to ALWAYS and pop the cache (lines 251–253). When _write_builder_config() then evaluates self.artifact_storage.base_dataset_path, resolved_dataset_name re-runs under resume=ALWAYS and the directory still doesn't exist, which raises ArtifactStorageError("🛑 Cannot resume: no existing dataset found …").

The expected behaviour for IF_POSSIBLE + no prior dataset is a silent fresh start. The fix is to return False from _check_resume_config_compatibility (can't compare → can't resume) when the dataset directory itself doesn't exist:

def _check_resume_config_compatibility(self) -> bool:
    dataset_dir = Path(self.artifact_storage.artifact_path) / self.artifact_storage.dataset_name
    if not dataset_dir.exists() or not any(dataset_dir.iterdir()):
        return False  # No prior run — treat as incompatible so IF_POSSIBLE starts fresh
    config_path = dataset_dir / SDG_CONFIG_FILENAME
    if not config_path.exists():
        return True
    ...
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Line: 506-507

Comment:
**`IF_POSSIBLE` raises on first run when no prior dataset exists**

`_check_resume_config_compatibility()` returns `True` whenever the config file is absent — including the case where the dataset directory has never been created. That causes the `IF_POSSIBLE` branch to unconditionally upgrade `resume` to `ALWAYS` and pop the cache (lines 251–253). When `_write_builder_config()` then evaluates `self.artifact_storage.base_dataset_path`, `resolved_dataset_name` re-runs under `resume=ALWAYS` and the directory still doesn't exist, which raises `ArtifactStorageError("🛑 Cannot resume: no existing dataset found …")`.

The expected behaviour for `IF_POSSIBLE` + no prior dataset is a silent fresh start. The fix is to return `False` from `_check_resume_config_compatibility` (can't compare → can't resume) when the dataset directory itself doesn't exist:

```python
def _check_resume_config_compatibility(self) -> bool:
    dataset_dir = Path(self.artifact_storage.artifact_path) / self.artifact_storage.dataset_name
    if not dataset_dir.exists() or not any(dataset_dir.iterdir()):
        return False  # No prior run — treat as incompatible so IF_POSSIBLE starts fresh
    config_path = dataset_dir / SDG_CONFIG_FILENAME
    if not config_path.exists():
        return True
    ...
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: resume interrupted dataset generation runs

4 participants