Skip to content

feat(results): add export() method and --output-format CLI flag#540

Open
przemekboruta wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:feat/dataset-export
Open

feat(results): add export() method and --output-format CLI flag#540
przemekboruta wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
przemekboruta:feat/dataset-export

Conversation

@przemekboruta
Copy link
Copy Markdown
Contributor

@przemekboruta przemekboruta commented Apr 13, 2026

Summary

Closes #566

  • Adds DatasetCreationResults.export(path, format=) supporting jsonl, csv, and parquet formats
  • Adds DatasetCreationResults.count_records() which reads record counts from Parquet file metadata only (no data pages loaded, no OOM risk)
  • Adds --output-format / -f flag to the data-designer create CLI command with click.Choice validation; writes dataset.<format> alongside the parquet batch files
  • Output format is inferred from the file extension by default (case-insensitive); format= is an optional explicit override (e.g. write a .txt file as JSONL)
  • SUPPORTED_EXPORT_FORMATS is derived from get_args(ExportFormat) so the Literal and the tuple cannot drift apart
  • Unsupported format raises InvalidFileFormatError (consistent with project error conventions)
  • JSONL export uses date_format="iso" for consistent datetime serialization across formats

Implementation notes

Streaming export (no OOM risk)

export() never materialises the full dataset in memory. Instead of calling load_dataset(), it reads the individual batch parquet files from artifact_storage.final_dataset_path one at a time and appends each to the output file, keeping peak memory proportional to a single batch regardless of dataset size:

  • JSONL / CSV — each batch is appended to the output file; CSV header is written only on the first batch
  • Parquetpyarrow.parquet.ParquetWriter is used to write each batch as a row group into one output file; schemas are unified with pa.unify_schemas(promote_options="permissive") to handle minor numeric type drift (e.g. int64float64 across batches); truly incompatible schemas (different column names) raise InvalidFileFormatError

count_records() (no OOM risk)

count_records() reads row counts from Parquet file metadata only — no data pages are loaded:

sum(pq.read_metadata(f).num_rows for f in batch_files)

The CLI uses count_records() instead of len(load_dataset()) to report the final record count, avoiding loading the full dataset into memory just for a count.

Format inference

results.export("output.jsonl")          # inferred from extension
results.export("output.csv")            # inferred from extension
results.export("output.parquet")        # inferred from extension
results.export("output.txt", format="jsonl")  # explicit override
results.export("output.JSONL")          # case-insensitive extension

CLI:

data-designer create config.yaml --output-format jsonl
data-designer create config.yaml -n 500 -f csv

Test plan

  • test_export_writes_file — parametrized over all 3 formats
  • test_export_jsonl_content — each line is valid JSON, all records present
  • test_export_csv_content — single header row, full record count
  • test_export_parquet_content — DataFrame round-trip across two batches
  • test_export_infers_format_from_extension — format omitted, inferred from .jsonl
  • test_export_explicit_format_overrides_extension.txt written as JSONL
  • test_export_parquet_schema_unification — two batches with diverging column types (int64/float64) merged correctly via permissive promotion
  • test_export_parquet_incompatible_schemas_raises — different column names across batches raises InvalidFileFormatError
  • test_export_uppercase_extension_is_recognised.JSONL treated same as .jsonl
  • test_export_unknown_extension_raises — raises InvalidFileFormatError
  • test_export_unsupported_explicit_format_raises — raises InvalidFileFormatError
  • test_export_no_batch_files_raises — raises ArtifactStorageError
  • test_export_returns_path_object — returns Path for str input
  • test_count_records — reads row counts from Parquet metadata, no data pages loaded
  • test_run_create_with_output_format_happy_path — export called with correct path
  • test_run_create_export_failure_exits — export exception exits with code 1
  • Existing CLI delegation tests updated for new output_format parameter

Adds DatasetCreationResults.export(path, format=) supporting jsonl,
csv, and parquet. The CLI create command gains --output-format / -f
which writes dataset.<format> alongside the parquet batch files.
@przemekboruta przemekboruta requested a review from a team as a code owner April 13, 2026 19:26
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

Adds DatasetCreationResults.export() for streaming JSONL/CSV/Parquet export across batch files, a count_records() metadata-only row counter, and a --output-format / -f CLI flag validated by click.Choice before generation begins. The implementation is clean: format inference, schema unification for Parquet, and the SUPPORTED_EXPORT_FORMATS / ExportFormat coupling are all well-handled.

Confidence Score: 5/5

Safe to merge; the one flagged issue is an edge-case JSONL blank-line bug affecting empty batch files, which are not produced by the current engine.

All findings are P2. The implementation correctly handles format validation via click.Choice before generation, streams memory-efficiently, and is well-tested. The empty-batch JSONL guard has a subtle logic error but only fires in a scenario the engine does not produce today.

packages/data-designer/src/data_designer/interface/results.py — _export_jsonl empty-batch guard

Important Files Changed

Filename Overview
packages/data-designer/src/data_designer/interface/results.py Adds export() (JSONL/CSV/Parquet streaming export) and count_records() (metadata-only row count). Minor edge-case bug: empty-batch guard in _export_jsonl writes a spurious blank JSONL line instead of being a no-op.
packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Replaces load_dataset() + len() with count_records() for efficient row counting; adds export path construction and error handling after generation. Format validation is delegated to click.Choice at the CLI layer, so invalid formats are rejected before any work begins.
packages/data-designer/src/data_designer/cli/commands/create.py Adds --output-format / -f CLI option backed by click.Choice(SUPPORTED_EXPORT_FORMATS), forwarded to the controller.
packages/data-designer/tests/interface/test_results.py Comprehensive test coverage for export(): all three formats, format inference, explicit override, schema unification, error paths, and count_records().
packages/data-designer/tests/cli/controllers/test_generation_controller.py Updates mock helper to use count_records() instead of load_dataset(); adds happy-path and failure-path tests for --output-format.
packages/data-designer/tests/cli/commands/test_create_command.py Updates existing delegation tests with output_format=None; adds a new test verifying output_format is forwarded to the controller.
packages/data-designer/tests/cli/test_main.py Trivial update: adds output_format=None to the existing dispatch assertion.

Sequence Diagram

sequenceDiagram
    actor User
    participant CLI as create_command
    participant Ctrl as GenerationController
    participant DD as DataDesigner
    participant Results as DatasetCreationResults
    participant FS as Artifact Storage

    User->>CLI: data-designer create config.yaml -f jsonl
    Note over CLI: click.Choice validates format before any work starts
    CLI->>Ctrl: run_create(..., output_format="jsonl")
    Ctrl->>DD: create(config, num_records, dataset_name)
    DD-->>FS: writes batch_*.parquet files
    DD-->>Ctrl: results
    Ctrl->>Results: count_records()
    Results->>FS: pq.read_metadata(batch_*.parquet)
    FS-->>Results: num_rows (metadata only)
    Results-->>Ctrl: total row count
    Ctrl->>Results: export(base_path/dataset.jsonl)
    loop each batch file
        Results->>FS: read_parquet(batch_N.parquet)
        FS-->>Results: DataFrame chunk
        Results->>FS: append chunk to dataset.jsonl
    end
    Results-->>Ctrl: Path("dataset.jsonl")
    Ctrl-->>User: Dataset created — N record(s) generated
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
packages/data-designer/src/data_designer/interface/results.py:222-225
**Empty-batch guard writes a spurious blank line in JSONL output**

When a batch file is empty, `to_json(orient="records", lines=True)` returns `""`. `f.write("")` is a no-op, then `not "".endswith("\n")` is `True`, so a bare `"\n"` is written. That produces a blank line in the JSONL output, which is not a valid JSONL record and will cause downstream parsers to fail on that line.

The guard was intended only to ensure the last record's line ends with a newline, but it fires incorrectly for empty content too. The fix is to skip the newline append when `content` is empty:

```suggestion
            f.write(content)
            if content and not content.endswith("\n"):
                # Guard against batches where to_json() omits the trailing newline.
                f.write("\n")
```

Reviews (8): Last reviewed commit: "fix(results): fix parquet export schema ..." | Re-trigger Greptile

Comment thread packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Outdated
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @przemekboruta - export() fills a real gap and the core implementation is clean. A few things to tighten up before merging:

  • Per CONTRIBUTING.md, features should have an associated issue. Could you open one retroactively and link it here with Closes #NNN?
  • The success message / export ordering and double dataset load need fixing (see inline comments)
  • Need controller-level tests for the new output_format path - happy path, bad format rejection, and export failure
  • A couple of style guide items flagged inline (error types, import placement)

Left details in inline comments. Nice catch on the lazy import fix in the third commit.

Comment thread packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Outdated
Comment thread packages/data-designer/src/data_designer/cli/controllers/generation_controller.py Outdated
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
Comment thread packages/data-designer/tests/interface/test_results.py Outdated
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
…, import hygiene

- Derive SUPPORTED_EXPORT_FORMATS from get_args(ExportFormat) so the two can't drift apart
- Replace ValueError with InvalidFileFormatError in export() — consistent with project error conventions
- Add date_format="iso" to to_json() for consistent datetime serialization across formats
- Add click.Choice(SUPPORTED_EXPORT_FORMATS) to --output-format CLI option for parse-time
  validation, better --help output, and tab completion
- Fix double load_dataset() in run_create: inline len() so the DataFrame ref dies before export
- Move success message after the export block to avoid "Dataset created" followed by "Export failed"
- Move imports to module level in test_results.py (json, Path, lazy already imported)
- Add controller-level tests for output_format happy path, bad format rejection, and export failure
Comment thread packages/data-designer/src/data_designer/interface/results.py Outdated
@przemekboruta
Copy link
Copy Markdown
Contributor Author

hey @andreatgretel! thanks for the review. All points seem to be addressed right now.

@github-actions
Copy link
Copy Markdown
Contributor

Issue #566 has been triaged. The linked issue check is being re-evaluated.

andreatgretel
andreatgretel previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Contributor

@andreatgretel andreatgretel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @przemekboruta, all review comments are addressed - approving!

We should be able to merge soon. Seeking one more approval - cc @NVIDIA-NeMo/data-designer-reviewers.

Comment thread packages/data-designer/src/data_designer/interface/results.py
@nabinchha
Copy link
Copy Markdown
Contributor

Heads up on OOM risk with large datasets

The proposed export() method should avoid calling load_dataset() internally. Today, load_dataset() materializes the entire dataset into a single pandas DataFrame via pd.read_parquet() -- for a large partitioned dataset, this will OOM.

The data is already on disk as individual batch files (batch_00000.parquet, batch_00001.parquet, ...), so export() can stream batch-by-batch into a single output file without ever holding the full dataset in memory:

  • JSONL / CSV -- iterate over batch files, append each chunk to the output file. These formats are naturally appendable, so this produces one valid file with memory proportional to a single batch.
  • Parquet -- use pyarrow.parquet.ParquetWriter to write each batch as a row group into one output file.

One nuance for the parquet path: batch files can have slightly mismatched schemas (e.g., a column inferred as int64 in one batch and float64 in another). The fix is cheap -- read just the metadata from each batch file upfront with pq.read_schema(), merge with pa.unify_schemas(), and cast each batch to the unified schema before writing.

Example implementation sketch:

from __future__ import annotations

from pathlib import Path

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


SUPPORTED_FORMATS = {"jsonl", "csv", "parquet"}


def export(self, path: str | Path) -> Path:
    output = Path(path)
    fmt = output.suffix.lstrip(".")

    if fmt not in SUPPORTED_FORMATS:
        raise ValueError(
            f"Unsupported file extension {output.suffix!r}. "
            f"Use one of: {', '.join(SUPPORTED_FORMATS)}"
        )

    batch_files = sorted(self.artifact_storage.final_dataset_path.glob("*.parquet"))

    if not batch_files:
        raise FileNotFoundError("No batch parquet files found.")

    if fmt == "jsonl":
        _export_jsonl(batch_files, output)
    elif fmt == "csv":
        _export_csv(batch_files, output)
    elif fmt == "parquet":
        _export_parquet(batch_files, output)

    return output


def _export_jsonl(batch_files: list[Path], output: Path) -> None:
    with output.open("w") as f:
        for batch_file in batch_files:
            chunk = pd.read_parquet(batch_file)
            f.write(chunk.to_json(orient="records", lines=True))
            f.write("\n")


def _export_csv(batch_files: list[Path], output: Path) -> None:
    for i, batch_file in enumerate(batch_files):
        chunk = pd.read_parquet(batch_file)
        chunk.to_csv(output, mode="a", header=(i == 0), index=False)


def _export_parquet(batch_files: list[Path], output: Path) -> None:
    schemas = [pq.read_schema(f) for f in batch_files]
    unified = pa.unify_schemas(schemas)

    with pq.ParquetWriter(output, unified) as writer:
        for batch_file in batch_files:
            table = pq.read_table(batch_file)
            table = table.cast(unified)
            writer.write_table(table)

Usage:

results.export("output.jsonl")
results.export("output.csv")
results.export("output.parquet")

…atasets

- Rewrite export() to read batch parquet files one at a time instead of
  materialising the full dataset via load_dataset(); peak memory is now
  proportional to a single batch regardless of dataset size
- Infer output format from file extension by default; format= parameter
  kept as an explicit override (e.g. writing .txt as JSONL)
- _export_parquet unifies schemas across batches (pa.unify_schemas) to
  handle type drift (e.g. int64 vs float64 in the same column)
- Drop format= from the controller's export() call — path already carries
  the correct extension
- Rewrite export tests around real batch parquet files (stub_batch_dir
  fixture); add tests for multi-batch output, schema unification, unknown
  extension, empty batch directory, and explicit format override
@przemekboruta
Copy link
Copy Markdown
Contributor Author

Hey @nabinchha! I totally agree with your proposal and I've implemented it. Looking forward to your feedback

PR description updated

@przemekboruta przemekboruta requested a review from nabinchha April 22, 2026 19:05
Copy link
Copy Markdown
Contributor

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @przemekboruta! One blocker on memory safety and a few small wins to tidy up before merge.

Summary

Adds DatasetCreationResults.export(path, format=) (jsonl/csv/parquet) that streams batch parquet files one at a time to bound peak memory, plus --output-format/-f on data-designer create. Matches the stated intent; extension-based inference with explicit override is a nice ergonomic touch.

Findings

Critical — Let's fix these before merge

packages/data-designer/src/data_designer/cli/controllers/generation_controller.py:161 — CLI path still OOMs on large datasets, defeating the streaming export

  • What: num_records = len(results.load_dataset()) reads every batch parquet file into a single in-memory DataFrame just to print the row count, and it runs before results.export(). For any dataset big enough to need streaming, the CLI OOMs on this line before the streaming export gets a chance to run.
  • Why: The PR's headline property is "peak memory proportional to a single batch regardless of dataset size." The programmatic API honors that — results.export(...) called directly from Python is safe. But data-designer create ... --output-format X (the most visible entry point, and the one users of #566 are most likely to hit) does not. This isn't a nit: the motivating use case is broken through the CLI.
  • Suggestion: Count rows from parquet metadata — no data pages loaded:
    batch_files = sorted(results.artifact_storage.final_dataset_path.glob("batch_*.parquet"))
    num_records = sum(lazy.pq.read_metadata(f).num_rows for f in batch_files)
    Cleaner option: expose a count_records() (or similar) helper on DatasetCreationResults / ArtifactStorage so the CLI doesn't have to know about batch file layout. Either way, this should land in this PR rather than a follow-up, since it's tied to the claim the PR is making.

Warnings — Worth addressing

packages/data-designer/src/data_designer/cli/controllers/generation_controller.py:131-137 — Redundant validation now that click.Choice is in place

  • What: create.py:44 uses click_type=click.Choice(list(SUPPORTED_EXPORT_FORMATS)), which means Click rejects any invalid --output-format at parse time with its own error message — the run_create controller can never be reached with a bad value via the CLI. The controller-level check (and the local import at line 131) is dead code from the CLI's perspective.
  • Why: It's a small YAGNI violation (the controller isn't a documented public API, the only caller is create_command) and the style guide asks for imports at module level — create.py already imports SUPPORTED_EXPORT_FORMATS at the top, so lazy-loading doesn't buy anything here either. It also means we maintain two parallel error messages for the same condition.
  • Suggestion: Either (a) drop lines 131-137 and the corresponding test_run_create_invalid_output_format_exits test, or (b) if we want the controller to remain independently callable with defensive validation, move the import to module level and keep the check. I'd lean toward (a) since click.Choice already gives better --help output and tab completion.

packages/data-designer/src/data_designer/interface/results.py:229-234_export_parquet leaks raw pyarrow errors on genuinely incompatible schemas

  • What: pa.unify_schemas(schemas) handles numeric widening fine (tested), but raises pyarrow.lib.ArrowInvalid if batches have differing column names or truly incompatible types. Likewise table.cast(unified_schema) can raise on an unsupported cast. Those escape unwrapped.
  • Why: Per AGENTS.md / STYLEGUIDE.md, third-party exceptions at module boundaries should be normalized to data_designer error types. Callers writing except InvalidFileFormatError (or except DataDesignerError) will miss these, and the raw pyarrow traceback leaks an implementation detail of the engine into user-facing error handling.
  • Suggestion: Wrap the unify/cast block:
    try:
        unified_schema = lazy.pa.unify_schemas(schemas)
    except lazy.pa.lib.ArrowInvalid as e:
        raise InvalidFileFormatError(f"Cannot unify batch schemas for parquet export: {e}") from e

Suggestions — Take it or leave it

packages/data-designer/src/data_designer/interface/results.py:138 — File extension matching is case-sensitive

  • What: path.suffix.lstrip(".") returns "JSONL" for output.JSONL, which fails the format lookup.
  • Suggestion: path.suffix.lstrip(".").lower() — tiny ergonomic win, no downside.

packages/data-designer/src/data_designer/interface/results.py:211-212 — The trailing-newline guard in _export_jsonl is probably unnecessary

  • What: to_json(orient="records", lines=True) always emits a trailing \n in supported pandas versions, so if not content.endswith("\n") never fires. Not harmful, just noise.
  • Suggestion: Either drop the check, or add a one-line comment noting which pandas behavior it guards against so a future reader doesn't puzzle over it.

What Looks Good

  • Schema unification for parquet type drift — the pa.unify_schemas + table.cast approach is exactly the right move, and the dedicated test_export_parquet_schema_unification test (int64 → float64 across batches) nails the motivating case. Nice catch during the rewrite.
  • Streaming helpers are cleanly factored_export_jsonl / _export_csv / _export_parquet are small, single-purpose, and easy to reason about in isolation. The export() dispatcher stays a readable 15ish lines.
  • Error-type choices are consistent with the codebaseInvalidFileFormatError for format problems, ArtifactStorageError for "no batch files" is the right split. And the docstring → raised-type fix in commit 50a93d98 is a good correctness catch before merge.
  • Extension-inference-with-overrideresults.export("output.csv") vs. results.export("output.txt", format="jsonl") is a clean API. SUPPORTED_EXPORT_FORMATS derived from get_args(ExportFormat) means the Literal and tuple can't drift.
  • Test coverage is thorough — parametrized format coverage, content round-trips, multi-batch streaming, schema unification, extension inference, explicit override, empty batch dir, unsupported format, and the controller-level happy/sad paths. Good mix.

Verdict

Needs changes — the pre-export load_dataset() on line 161 breaks the PR's own memory-safety claim for the CLI path and should be fixed in this PR. The redundant controller validation is a smaller cleanup worth bundling in. The rest are genuine take-it-or-leave-it suggestions.


This review was generated by an AI assistant.

…g, UX

- Replace load_dataset() with count_records() in CLI to avoid OOM on
  large datasets; add count_records() method using pq.read_metadata
  (reads file metadata only, no data pages loaded)
- Remove redundant format validation in controller — click.Choice in
  create.py already rejects invalid values at parse time; dead code
  removed along with corresponding test
- Wrap pa.unify_schemas / table.cast ArrowInvalid as InvalidFileFormatError
  to normalize third-party exceptions at module boundaries per AGENTS.md
- Lowercase file extension before format lookup so .JSONL/.CSV/.PARQUET
  are accepted without error
- Add clarifying comment to trailing-newline guard in _export_jsonl
- Add tests: count_records(), uppercase extension, incompatible schemas
@przemekboruta przemekboruta requested a review from nabinchha April 25, 2026 09:51
@nabinchha
Copy link
Copy Markdown
Contributor

@przemekboruta, looks like tests are failing in CI. We will need to get those resolved to move this along.

…th bug

- Use promote_options="permissive" in pa.unify_schemas so minor numeric
  type drift (int64 vs float64) is handled by promotion instead of raising
- Also catch ArrowTypeError from unify_schemas and ValueError from
  table.cast() — the actual exception types thrown by pyarrow for these
  cases (ArrowInvalid alone is not sufficient)
- Wrap base_dataset_path in Path() in generation_controller.run_create
  to guard against callers that return a str (mock returns str, Path
  does not support / with str operands)
- Update test_export_parquet_incompatible_schemas_raises to match the
  new error source: with permissive unification, different-column-name
  batches fail at cast() not at unify_schemas(), so the match string
  changes from "Cannot unify batch schemas" to "Cannot cast batch"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add export() method and --output-format CLI flag to DatasetCreationResults

3 participants