feat: resume interrupted dataset generation runs (sync + async engine)#526
feat: resume interrupted dataset generation runs (sync + async engine)#526przemekboruta wants to merge 16 commits intoNVIDIA-NeMo:mainfrom
Conversation
Greptile SummaryThis PR adds a
|
| Filename | Overview |
|---|---|
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py | Core resume logic — _build_with_resume (sync) and _build_async (async) added; IF_POSSIBLE → ALWAYS upgrade in build() has a bug when no prior dataset directory exists |
| packages/data-designer-engine/src/data_designer/engine/storage/artifact_storage.py | Adds ResumeMode enum, resume field, clear_partial_results(), and conditional path logic in resolved_dataset_name; logic is correct for cases covered by tests |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/dataset_batch_manager.py | Adds start_batch and initial_actual_num_records params to start() to seed counters on resume; backwards-compatible defaults |
| packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py | Adds initial_actual_num_records and initial_total_num_batches constructor params to seed async counters; straightforward and correct |
| packages/data-designer/src/data_designer/interface/data_designer.py | Exposes resume: ResumeMode on DataDesigner.create() and threads it through to ArtifactStorage and builder.build() |
| packages/data-designer/src/data_designer/interface/results.py | Adds export() method supporting jsonl/csv/parquet formats; unrelated to resume but clean addition |
| packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py | Comprehensive resume tests added; no coverage for IF_POSSIBLE + no prior dataset at the build() level |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["build(resume=...)"] --> B{resume == IF_POSSIBLE?}
B -- Yes --> C["_check_resume_config_compatibility()"]
C -- config differs --> D["resume = NEVER\nartifact_storage.resume = NEVER\npop cache"]
C -- compatible OR no config file --> E["resume = ALWAYS\nartifact_storage.resume = ALWAYS\npop cache"]
E -. no prior dir .-> F["🐛 resolved_dataset_name raises\nArtifactStorageError"]
B -- No --> G["_write_builder_config()"]
D --> G
E -- prior dir exists --> G
G --> H{resume == ALWAYS AND no metadata.json?}
H -- Yes --> I["clear_partial_results()\nresume = NEVER"]
H -- No --> J{async engine?}
I --> J
J -- Yes --> K["_build_async(..., resume=resume)"]
J -- No, resume=ALWAYS --> L["_build_with_resume(...)"]
J -- No, resume=NEVER --> M["standard batch loop"]
K --> N{all row groups done?}
L --> O{all batches done?}
N -- Yes --> P["return False\nskip processors"]
N -- No --> Q["skip completed IDs\nrun remaining row groups"]
O -- Yes --> P
O -- No --> R["run remaining batches"]
P --> S["return final_dataset_path"]
Q --> T["run_after_generation"]
R --> T
M --> T
T --> S
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:506-507
**`IF_POSSIBLE` raises on first run when no prior dataset exists**
`_check_resume_config_compatibility()` returns `True` whenever the config file is absent — including the case where the dataset directory has never been created. That causes the `IF_POSSIBLE` branch to unconditionally upgrade `resume` to `ALWAYS` and pop the cache (lines 251–253). When `_write_builder_config()` then evaluates `self.artifact_storage.base_dataset_path`, `resolved_dataset_name` re-runs under `resume=ALWAYS` and the directory still doesn't exist, which raises `ArtifactStorageError("🛑 Cannot resume: no existing dataset found …")`.
The expected behaviour for `IF_POSSIBLE` + no prior dataset is a silent fresh start. The fix is to return `False` from `_check_resume_config_compatibility` (can't compare → can't resume) when the dataset directory itself doesn't exist:
```python
def _check_resume_config_compatibility(self) -> bool:
dataset_dir = Path(self.artifact_storage.artifact_path) / self.artifact_storage.dataset_name
if not dataset_dir.exists() or not any(dataset_dir.iterdir()):
return False # No prior run — treat as incompatible so IF_POSSIBLE starts fresh
config_path = dataset_dir / SDG_CONFIG_FILENAME
if not config_path.exists():
return True
...
```
Reviews (13): Last reviewed commit: "fix(builder): move partial-completion wa..." | Re-trigger Greptile
…set already complete _build_with_resume and _build_async now return False when the dataset is already complete (early-return path), True otherwise. build() skips _processor_runner.run_after_generation() on False, preventing processors from calling shutil.rmtree and rewriting an already-finalized dataset. Fixes the issue raised in review: greptile P1 comment on PR NVIDIA-NeMo#526.
|
Issue #525 has been triaged. The linked issue check is being re-evaluated. |
Code Review: PR #526 — Resume interrupted dataset generation runs (sync + async engine)SummaryThis PR adds a Scope: ~860 additions, ~16 deletions across 10 files (including a plan doc and comprehensive tests). The feature is well-designed: it leverages existing FindingsHigh Severity(H1) Medium Severity(M1) (M2) sum(min(buffer_size, num_records - rg_id * buffer_size) for rg_id in completed_ids)This formula assumes each row group was written with exactly (M3) Low Severity(L1) Plan/implementation divergence: async engine support (L2) (L3) Incremental metadata writes add I/O overhead to async engine (L4) Test file has mid-file imports (L5) No validation of Positive Observations
VerdictApprove with suggestions. The implementation is solid, well-tested, and handles edge cases thoughtfully. The high-severity finding (H1) is a readability/maintainability concern rather than a correctness bug — the discarded return value works because |
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
|
cc @johnnygreco @andreatgretel Suggestion: add an
|
| State on disk | ResumeMode.NEVER |
ResumeMode.ALWAYS |
ResumeMode.IF_POSSIBLE |
|---|---|---|---|
| Folder missing or empty | create (timestamp on collision elsewhere) | raise | create in dataset_name |
metadata.json present, compatible |
timestamp-suffix new folder | resume | resume |
metadata.json present, incompatible |
timestamp-suffix new folder | raise | raise |
Folder has data but no metadata.json |
timestamp-suffix new folder | raise | raise |
The crucial line is the third one: under IF_POSSIBLE, an incompatible-config case must raise, not silently start fresh. Silently overwriting a folder that belongs to a different config is worse than failing loudly. The whole point of IF_POSSIBLE is "I might be a retry of myself" — if the hash says it isn't, the folder belongs to an unrelated run that happened to land on the same dataset_name, and the right response is to surface that collision rather than paper over it.
Implementation sketch
- Define
ResumeMode(probably indata_designer.config) and re-export from the public package. ArtifactStorage: changeresume: booltoresume: ResumeMode. Inresolved_dataset_name, theIF_POSSIBLEbranch returnsdataset_nameunchanged whether or not the folder currently exists, and never raises on missing/empty folders.DatasetBuilder._load_resume_state:- Compare the persisted
config_hash(Add a deterministic hash to uniquely identify a workflow config #584) against the current invocation's hash; mismatch → raise a clear "config drift" error. - In
IF_POSSIBLEmode,FileNotFoundError→ "no resume state" → fresh start. Hash mismatch and corrupt metadata still raise (same asALWAYS).
- Compare the persisted
- Tests should cover:
- All four state-on-disk cases × all three
ResumeModevalues - The hash-mismatch error path for
ALWAYSandIF_POSSIBLE - String coercion via
StrEnum(resume="if_possible"resolves toResumeMode.IF_POSSIBLE) so config-driven callers stay ergonomic
- All four state-on-disk cases × all three
Other considerations
- Concurrency.
IF_POSSIBLEplus a shareddataset_nameacross two concurrent processes is a race. Worth documenting that resume assumes a single writer perdataset_name. A lockfile in the dataset folder would make this enforceable, but is probably a separate piece of work. - Cleanup semantics.
clear_partial_results()should fire inIF_POSSIBLEmode the same way it does inALWAYS— partial results from a previous interrupted run shouldn't leak into the resumed (or fresh) run.
Related cleanups (separate from the API change)
While reading the PR, two small things stood out that are worth a follow-up regardless of the tri-state proposal:
- The partial-completion warning at the end of
_build_asyncis unreachable because of thereturn Trueimmediately above it. Moving the warning above the return restores user-visible feedback for incomplete async runs. _load_resume_stateraisesDatasetGenerationErrorfrom aFileNotFoundErrorwithoutfrom exc, dropping the original traceback. Chaining it would help future debugging.
|
Thank you for taking this on, the plan in One thought is that we could try doing checkpointing on a task level already. However, that would need a sidecar format (parquet only wants whole row groups), concurrency-safe writes from many parallel asyncio tasks, and a |
|
Thanks everyone for the thorough review — really useful catches across the board. Here's what was addressed:
PR description updated to reflect the changed semantics. |
nabinchha
left a comment
There was a problem hiding this comment.
Thanks for grinding through several review rounds on this, @przemekboruta — the iteration is visible (the _GenerationOutcome enum, crash-window handling, and config-mismatch guard are all clear improvements). Most of my notes are about leaving the public API in a place that won't need a breaking change next time we touch it.
Summary
Adds an opt-in resume: bool to DataDesigner.create() / DatasetBuilder.build() that picks generation up from the last completed batch (sync) or row group (async), with metadata.json-based state for sync and filesystem reconciliation for async. The implementation matches the stated intent and the PR description tracks the (post-iteration) semantics — the in-code docstrings have drifted a bit in the process.
Findings
Critical — Let's fix these before merge
packages/data-designer/src/data_designer/interface/data_designer.py:193 — ResumeMode enum not introduced; should land alongside #587
- What: The earlier
ResumeModeproposal (NEVER/ALWAYS/IF_POSSIBLE) was only partially picked up. The new_GenerationOutcomeenum is internal — it's a return-value type for_build_with_resume/_build_async, not the public input parameter. The user-facing API is stillresume: boolonDataDesigner.create(),DatasetBuilder.build(), andArtifactStorage.resume. - Why: This is a one-shot decision — once
resume: boolships in a release, every adoption locks us intoboolsemantics. Migrating toresume: ResumeModelater is a breaking change (boolandStrEnumaren't substitutable in user code, even withStrEnum's string coercion); the back-compat workaround isbool | ResumeModefor one or more deprecation cycles, which leaves a permanent wart in the public signature. The good news: the dependency forIF_POSSIBLEis no longer abstract — PR #587 (closes #584) is open and providesDataDesignerConfig.fingerprint()/fingerprint_config()with a versionedCONFIG_HASH_VERSION, which is exactly the "compatible vs. incompatible config on disk" checkIF_POSSIBLEneeds. So the cleanest sequencing is: land #587 first, then this PR ships the fullResumeModeenum includingIF_POSSIBLEin one go — no scaffolding-then-fill-in two-PR dance, no breaking API change later. - Suggestion: Once #587 merges, replace
resume: boolwith the full enum here:
from enum import StrEnum
class ResumeMode(StrEnum):
NEVER = "never"
ALWAYS = "always"
IF_POSSIBLE = "if_possible"
def create(self, *, resume: ResumeMode = ResumeMode.NEVER, ...) -> ...: ..._load_resume_state / _check_resume_config_compatibility then call fingerprint_config(self._data_designer_config) and compare against a config_hash field persisted into metadata.json on each write. The behavior matrix from the original ResumeMode thread (compatible-fingerprint resumes, incompatible-fingerprint raises under both ALWAYS and IF_POSSIBLE) drops in cleanly on top of #587's API. If #587 stalls and this PR needs to ship sooner, the fallback is the scaffolding-only version (NEVER / ALWAYS, no IF_POSSIBLE) — but please don't ship resume: bool. cc @nabinchha / @johnnygreco for the sequencing call.
Warnings — Worth addressing
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:198 and packages/data-designer/src/data_designer/interface/data_designer.py:215 — Stale num_records docstring
- What: Both
build()andcreate()still say "The run parameters (num_records, buffer_size) must match those of the original run", but per the latest iterationnum_recordsonly needs to be>= actual_num_records(the whole point of thejohnnygreco/nabinchha"change your mind" semantic that you adopted). - Why: The PR description was updated to reflect the new semantics, but users will read the docstring, not the PR description. They'll think they have to use the same
num_recordsand end up either restarting from scratch or filing a bug. - Suggestion: Update both docstrings to something like "
buffer_sizemust match the original run;num_recordsmay be the same or larger and is treated as the new target. Resuming withnum_records < actual_num_records_so_farraisesDatasetGenerationError."
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:265 — FileNotFoundError traceback dropped
- What:
_load_resume_statere-raisesFileNotFoundErrorasDatasetGenerationErrorwithoutfrom exc, dropping the original traceback. This was specifically called out in the prior review thread (the "Related cleanups" section of theResumeModecomment) and not picked up. - Why: STYLEGUIDE.md is explicit on this: "Re-raise with context so the original traceback is preserved." It's also a one-character fix.
- Suggestion:
except FileNotFoundError as exc:
raise DatasetGenerationError(
"🛑 Cannot resume: metadata.json not found in the existing dataset directory. "
"Run without resume=True to start a new generation."
) from excSuggestions — Take it or leave it
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:702-705 — Silent fallback when saved config is unreadable
- What:
_check_resume_config_compatibilityswallowsOSErrorandJSONDecodeErrorand returns silently — the comment says "unreadable config — skip check rather than block the resume". - Why: That's a defensible choice for partially-corrupted state, but it means a resume can silently mix incompatible batches if the saved config happens to be corrupt. At least logging a warning would make the situation visible.
- Suggestion:
logger.warning("⚠️ Saved builder_config.json is unreadable — skipping config compatibility check on resume.")before thereturn.
packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py:1283 — Async happy-path resume test asserts inputs but not the skip set
- What:
test_initial_actual_num_records_from_filesystem_in_crash_windowis the closest async analog to the synctest_build_resume_runs_remaining_batches, but it capturesinitial_actual_num_records/initial_total_num_batchesand notskip_row_groups. The "only the missing row groups get scheduled" invariant is the actual user-facing guarantee for async resume. - Why: The sync test locks down "batches 1, 2 ran, not 0" — a regression that swapped
skip_row_groupsfor an empty set in async would not be caught by any of the current tests. - Suggestion: Add a sibling assertion to the existing test (or a new tiny one):
assert captured["skip_row_groups"] == frozenset({0, 1})(captured the same way you already capture the other two kwargs.)
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:428-550 — _build_async is now ~120 lines and densely nested
- What: The resume-resolution block at lines 452-495 (no-prior-state vs metadata vs filesystem-only, plus the
max(meta_count, fs_count)already-complete decision) is doing real work in the middle of the scheduler-setup function. - Why: Mostly readability — the precedence rules between
metadataand_find_completed_row_group_ids()are subtle and worth being able to point at in isolation. Also makes them more testable as a unit. - Suggestion (optional, can defer to a follow-up): extract
_resolve_async_resume_state(num_records, buffer_size) -> tuple[frozenset[int], int, int, _GenerationOutcome | None]returning theskip_row_groups,initial_actual_num_records,initial_total_num_batches, and anALREADY_COMPLETEsentinel when the dataset is done. The caller then becomes a straight-line dispatch.
What Looks Good
_GenerationOutcomecleanly fixes both prior bugs at once — gatingrun_after_generationonstatus is _GenerationOutcome.GENERATEDis the right shape, and it kills both the dead-warning and the AFTER_GENERATION-rerun problems with one mechanism.- Crash-window handling via
max(meta_count, fs_count)is exactly the right call. Metadata is authoritative onceAFTER_GENERATIONhas rewrittenparquet-files/; filesystem is authoritative when metadata lags one row group. Using the max covers both cases without needing a flag. _check_resume_config_compatibilityruns before_write_builder_config()— subtle but critical ordering. Easy to get wrong; you got it right and called it out in the docstring.- Test coverage genuinely exercises the failure modes that matter — config-mismatch, buffer-mismatch, num_records-too-small, no-metadata-fallback, crash-window filesystem reconciliation, and the "did we actually skip the right batches" invariant for sync. ~20 new tests, well-scoped.
Verdict
Needs changes / needs sequencing — the Critical (ResumeMode) is blocking; once resume: bool ships, the API decision is locked in. Cleanest path is to wait for #587 to land and then ship the full enum (incl. IF_POSSIBLE) in this PR. The two Warnings (stale docstrings, missing from exc) are small follow-throughs from the prior review thread that are much cheaper to fix here than in a follow-up. Suggestions are take-or-leave.
This review was generated by an AI assistant.
|
Quick follow-up: #587 just merged, so the dependency for the Critical finding above is resolved — A small correction to the suggested code in my earlier review: the shipped public surface is from enum import StrEnum
class ResumeMode(StrEnum):
NEVER = "never"
ALWAYS = "always"
IF_POSSIBLE = "if_possible"
def create(self, *, resume: ResumeMode = ResumeMode.NEVER, ...) -> ...: ...For the compatibility check inside fp = self._data_designer_config.fingerprint() # {"config_hash": "sha256:…", "config_hash_algo": "sha256", "config_hash_version": 1}
# Persist fp into metadata.json on each write, then on resume:
if saved.get("config_hash_version") != fp["config_hash_version"]:
# Scheme changed under us — treat as "unknown identity" rather than mismatch.
...
elif saved.get("config_hash") != fp["config_hash"]:
raise DatasetGenerationError("🛑 Cannot resume: config has changed since the interrupted run.")Two small notes worth folding in:
Otherwise the original review stands — the two Warnings (stale docstrings, missing This review was generated by an AI assistant. |
- ArtifactStorage gains a `resume: bool = False` field - resolved_dataset_name skips timestamp logic when resume=True, returning the existing dataset folder name as-is - Raises ArtifactStorageError on resume=True when the target folder is absent or empty (no data to resume from) - New clear_partial_results() removes in-flight partial results left over from an interrupted run Fixes NVIDIA-NeMo#525
DatasetBatchManager.start() now accepts: - start_batch: int = 0 — first batch index to process - initial_actual_num_records: int = 0 — records already on disk Both default to 0 so all existing call sites are unaffected. Fixes NVIDIA-NeMo#525
- build() gains a resume: bool = False parameter - _load_resume_state() reads metadata.json and validates that num_records and buffer_size match the original run - _build_with_resume() skips completed batches, clears in-flight partial results, and continues from the first incomplete batch - Raises DatasetGenerationError with clear messages for: - missing metadata.json (interrupted before first batch completes) - num_records mismatch - buffer_size mismatch - DATA_DESIGNER_ASYNC_ENGINE=1 (not yet supported) - Logs a warning and returns early when dataset is already complete Fixes NVIDIA-NeMo#525
- create() gains resume: bool = False - _create_resource_provider() passes resume to ArtifactStorage - builder.build() receives the resume flag Fixes NVIDIA-NeMo#525
Covers: - ArtifactStorage.resolved_dataset_name with resume=True - ArtifactStorage.clear_partial_results() - DatasetBatchManager.start() with start_batch and initial_actual_num_records - DatasetBuilder.build(resume=True): missing metadata, num_records mismatch, buffer_size mismatch, already-complete detection Fixes NVIDIA-NeMo#525
…INE=1) - Add _find_completed_row_group_ids() to scan parquet-files/ for already-written row groups by parsing batch_*.parquet filenames - _build_async() now accepts resume=True: loads metadata, finds completed row groups, clears partial results, and logs progress; returns early if all row groups are done - _prepare_async_run() accepts skip_row_groups, initial_actual_num_records, and initial_total_num_batches so the scheduler only processes remaining row groups and RowGroupBufferManager starts from the correct counts - RowGroupBufferManager.__init__ gains initial_actual_num_records and initial_total_num_batches params to seed the counters on resume - finalize_row_group closure now writes incremental metadata after each checkpoint so any run (resume or not) can be resumed if interrupted mid-way - Remove the guard that rejected resume=True with DATA_DESIGNER_ASYNC_ENGINE=1 - Add tests for all new paths
…set already complete _build_with_resume and _build_async now return False when the dataset is already complete (early-return path), True otherwise. build() skips _processor_runner.run_after_generation() on False, preventing processors from calling shutil.rmtree and rewriting an already-finalized dataset. Fixes the issue raised in review: greptile P1 comment on PR NVIDIA-NeMo#526.
…sync resume Metadata can lag by one row group if a crash occurs between move_partial_result_to_final_file_path and write_metadata. Using len(completed_ids) from the filesystem scan instead of state.num_completed_batches ensures the final metadata reflects the actual number of parquet files present, not the potentially stale metadata count.
Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet. The CLI create command gains --output-format / -f which writes dataset.<format> alongside the parquet batch files.
…efore first batch) When a run is interrupted before any row group or batch completes, metadata.json is never written. Previously resume=True would raise DatasetGenerationError in this case. Now build() detects the missing file, logs an info message, clears any leftover partial results and falls back to a clean fresh run. This is the common scenario for small datasets (fewer records than buffer_size) where all records fit in a single row group.
…ync resume In the crash window (row group written to disk but write_metadata crashed before updating the file), both initial_total_num_batches and initial_actual_num_records now use the filesystem-discovered completed_ids as source of truth. Previously initial_actual_num_records was read from potentially stale metadata, causing actual_num_records in the final metadata to be undercounted by one row group. Also adds a test covering the partial-resume crash-window scenario.
…/IF_POSSIBLE) - Introduces ResumeMode(StrEnum) in artifact_storage.py for use across all layers - Replaces resume: bool with resume: ResumeMode in DatasetBuilder.build(), DataDesigner.create(), ArtifactStorage, and _build_async() - Adds _check_resume_config_compatibility() using config fingerprints to support IF_POSSIBLE: falls back to a fresh run when config has changed since last run - Relaxes num_records validation from strict equality to num_records >= actual_num_records, allowing dataset extension on resume; buffer_size must still match exactly - Preserves exception chain with 'from exc' on FileNotFoundError in _load_resume_state - Exports ResumeMode from data_designer.interface for users to import - Adds skip_row_groups assertion test and IF_POSSIBLE storage behavior tests
| if resume == ResumeMode.IF_POSSIBLE: | ||
| if not self._check_resume_config_compatibility(): | ||
| logger.info( | ||
| "▶️ Config has changed since the last run — starting a fresh generation (resume=IF_POSSIBLE)." | ||
| ) | ||
| resume = ResumeMode.NEVER | ||
| else: | ||
| resume = ResumeMode.ALWAYS |
There was a problem hiding this comment.
IF_POSSIBLE fresh run silently overwrites the previous dataset
_check_resume_config_compatibility() is the first call that accesses self.artifact_storage.base_dataset_path, which triggers and caches the @cached_property resolved_dataset_name. Because ArtifactStorage.resume is still IF_POSSIBLE at that moment, the property evaluates the "dataset exists and is non-empty → return original name" branch and locks in the original directory name.
When the check then returns False (configs differ) and the local variable resume is set to NEVER, ArtifactStorage.resume is never updated. All subsequent path accesses (including _write_builder_config() and the build loop itself) use the cached original name, so the fresh generation writes directly into — and destroys — the completed previous dataset, with no timestamp-renaming or warning.
Fix: after deciding to fall back to NEVER, also update the storage object so the cached property reflects the correct mode:
if not self._check_resume_config_compatibility():
logger.info(...)
resume = ResumeMode.NEVER
self.artifact_storage.resume = ResumeMode.NEVER # keep ArtifactStorage in syncPrompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Line: 235-242
Comment:
**`IF_POSSIBLE` fresh run silently overwrites the previous dataset**
`_check_resume_config_compatibility()` is the first call that accesses `self.artifact_storage.base_dataset_path`, which triggers and caches the `@cached_property` `resolved_dataset_name`. Because `ArtifactStorage.resume` is still `IF_POSSIBLE` at that moment, the property evaluates the "dataset exists and is non-empty → return original name" branch and locks in the original directory name.
When the check then returns `False` (configs differ) and the local variable `resume` is set to `NEVER`, `ArtifactStorage.resume` is never updated. All subsequent path accesses (including `_write_builder_config()` and the build loop itself) use the cached original name, so the fresh generation writes directly into — and destroys — the completed previous dataset, with no timestamp-renaming or warning.
Fix: after deciding to fall back to NEVER, also update the storage object so the cached property reflects the correct mode:
```python
if not self._check_resume_config_compatibility():
logger.info(...)
resume = ResumeMode.NEVER
self.artifact_storage.resume = ResumeMode.NEVER # keep ArtifactStorage in sync
```
How can I resolve this? If you propose a fix, please make it concise.…downgrades to NEVER ArtifactStorage's Pydantic model validator accesses base_dataset_path at construction time, caching resolved_dataset_name under IF_POSSIBLE semantics before build() can set resume=NEVER. Pop the stale cache entry so the property re-resolves with the correct NEVER semantics (timestamped directory). Also fixes _check_resume_config_compatibility() to use artifact_path/dataset_name directly instead of base_dataset_path, and adds a regression test covering the cache-bypass scenario.
| if not config_path.exists(): | ||
| return True |
There was a problem hiding this comment.
IF_POSSIBLE raises on first run when no prior dataset exists
_check_resume_config_compatibility() returns True whenever the config file is absent — including the case where the dataset directory has never been created. That causes the IF_POSSIBLE branch to unconditionally upgrade resume to ALWAYS and pop the cache (lines 251–253). When _write_builder_config() then evaluates self.artifact_storage.base_dataset_path, resolved_dataset_name re-runs under resume=ALWAYS and the directory still doesn't exist, which raises ArtifactStorageError("🛑 Cannot resume: no existing dataset found …").
The expected behaviour for IF_POSSIBLE + no prior dataset is a silent fresh start. The fix is to return False from _check_resume_config_compatibility (can't compare → can't resume) when the dataset directory itself doesn't exist:
def _check_resume_config_compatibility(self) -> bool:
dataset_dir = Path(self.artifact_storage.artifact_path) / self.artifact_storage.dataset_name
if not dataset_dir.exists() or not any(dataset_dir.iterdir()):
return False # No prior run — treat as incompatible so IF_POSSIBLE starts fresh
config_path = dataset_dir / SDG_CONFIG_FILENAME
if not config_path.exists():
return True
...Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py
Line: 506-507
Comment:
**`IF_POSSIBLE` raises on first run when no prior dataset exists**
`_check_resume_config_compatibility()` returns `True` whenever the config file is absent — including the case where the dataset directory has never been created. That causes the `IF_POSSIBLE` branch to unconditionally upgrade `resume` to `ALWAYS` and pop the cache (lines 251–253). When `_write_builder_config()` then evaluates `self.artifact_storage.base_dataset_path`, `resolved_dataset_name` re-runs under `resume=ALWAYS` and the directory still doesn't exist, which raises `ArtifactStorageError("🛑 Cannot resume: no existing dataset found …")`.
The expected behaviour for `IF_POSSIBLE` + no prior dataset is a silent fresh start. The fix is to return `False` from `_check_resume_config_compatibility` (can't compare → can't resume) when the dataset directory itself doesn't exist:
```python
def _check_resume_config_compatibility(self) -> bool:
dataset_dir = Path(self.artifact_storage.artifact_path) / self.artifact_storage.dataset_name
if not dataset_dir.exists() or not any(dataset_dir.iterdir()):
return False # No prior run — treat as incompatible so IF_POSSIBLE starts fresh
config_path = dataset_dir / SDG_CONFIG_FILENAME
if not config_path.exists():
return True
...
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
Closes #525
Adds
resume: ResumeMode = ResumeMode.NEVERtoDataDesigner.create()andDatasetBuilder.build(). Generation picks up from where the interrupted run left off — for both the sync and async engines.Changes
ArtifactStorageResumeMode(StrEnum)enum (NEVER/ALWAYS/IF_POSSIBLE);resume: ResumeMode = ResumeMode.NEVERfield;resolved_dataset_nameskips timestamp logic onALWAYS/IF_POSSIBLE; newclear_partial_results()DatasetBatchManager.start()start_batchandinitial_actual_num_recordsparams (default 0, no breakage)DatasetBuilder.build()resume: ResumeModeparam;_load_resume_state()reads and validatesmetadata.json;_build_with_resume()skips completed batches (sync);_build_async()skips completed row groups (async);_check_resume_config_compatibility()compares config fingerprints and invalidatesresolved_dataset_namecache onIF_POSSIBLEdowngrade; partial-completion warning moved beforereturnin_build_async(was dead code)RowGroupBufferManager.__init__()initial_actual_num_recordsandinitial_total_num_batchesparams to seed counters on resumeDatasetBuilder._find_completed_row_group_ids()parquet-files/forbatch_*.parquetto determine which async row groups are already donefinalize_row_groupclosuremetadata.jsonafter every row-group checkpoint (not just at the end), making all async runs resumable if interruptedDataDesigner.create()resume: ResumeMode, passes it through toArtifactStorageandbuilder.build()boolreturn in_build_with_resume/_build_asyncbuild()gatesrun_after_generationon the return value so processors are never re-run on an already-complete datasetResumeMode semantics
NEVER(default)ALWAYSDatasetGenerationErrorif incompatibleIF_POSSIBLEValidation and error cases
metadata.json(interrupted before first batch): restarts from scratch (both engines)num_recordsless than already-generated records →DatasetGenerationError;num_recordsgreater than original target is allowed (extends the dataset)buffer_sizemismatch →DatasetGenerationErrorALWAYS→DatasetGenerationError; withIF_POSSIBLE→ silent fresh start,resolved_dataset_namecache invalidated so the fresh run gets a timestamped directoryTest plan
test_resolved_dataset_name_resume_uses_existing_foldertest_resolved_dataset_name_resume_raises_when_no_existing_foldertest_resolved_dataset_name_resume_raises_when_folder_is_emptytest_resolved_dataset_name_if_possible_uses_existing_foldertest_resolved_dataset_name_if_possible_uses_clean_name_when_no_existing_foldertest_clear_partial_results_removes_partial_foldertest_clear_partial_results_is_noop_when_no_partial_foldertest_start_with_start_batchtest_start_with_initial_actual_num_recordstest_start_with_start_batch_and_initial_actual_num_recordstest_start_default_values_unchangedtest_build_resume_starts_fresh_without_metadatatest_build_resume_raises_when_num_records_below_actualtest_build_resume_allows_larger_num_recordstest_build_resume_raises_on_buffer_size_mismatchtest_build_resume_runs_remaining_batchestest_build_resume_logs_warning_when_already_completetest_build_resume_already_complete_does_not_run_after_generation_processorstest_find_completed_row_group_ids_empty_dirtest_find_completed_row_group_ids_with_filestest_find_completed_row_group_ids_ignores_non_batch_filestest_build_async_resume_logs_warning_when_already_completetest_build_async_resume_starts_fresh_without_metadatatest_build_async_resume_already_complete_does_not_run_after_generation_processorstest_find_completed_row_group_ids_used_for_initial_total_batchestest_initial_actual_num_records_from_filesystem_in_crash_windowtest_build_async_resume_skip_row_groups_contains_completed_ids