Skip to content

Add Lance writer stage#2112

Open
VibhuJawa wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
VibhuJawa:feat/lance-writer
Open

Add Lance writer stage#2112
VibhuJawa wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
VibhuJawa:feat/lance-writer

Conversation

@VibhuJawa

Copy link
Copy Markdown
Contributor

Split from #2106. This is PR 2 of 3 in the Lance IO stack.

Stacked on #2111. The branch is based on feat/lance-reader; until #2111 merges, GitHub may show reader changes in this PR when viewed against main. For the exact PR2 delta, compare:
VibhuJawa/NeMo-Curator@feat/lance-reader...feat/lance-writer

What changed:

  • Adds LanceWriter for writing DocumentBatch tables as Lance fragments.
  • Adds checkpoint record helpers and commit_lance_checkpoint for publishing staged fragments.
  • Expands the lance extra from pylance to lance-ray only when writer support lands.
  • Adds focused writer tests, including blob preservation from LanceReader metadata.

Validation:

  • uv run --extra lance --group test pytest -q tests/stages/text/io/reader/test_lance.py tests/stages/text/io/writer/test_lance.py
  • uv run ruff check nemo_curator/stages/text/io/writer/init.py nemo_curator/stages/text/io/writer/lance.py nemo_curator/utils/lance.py tests/stages/text/io/writer/test_lance.py
  • git diff --check

@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@VibhuJawa VibhuJawa force-pushed the feat/lance-writer branch 5 times, most recently from 6e9d584 to 0b418d3 Compare June 24, 2026 22:16
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the feat/lance-writer branch 8 times, most recently from 2a529a0 to 72e5ede Compare June 24, 2026 23:38
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the feat/lance-writer branch from 72e5ede to d197e67 Compare June 24, 2026 23:41
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the feat/lance-writer branch from d197e67 to b07fd9c Compare June 25, 2026 00:12
@VibhuJawa VibhuJawa marked this pull request as ready for review June 25, 2026 17:17
@VibhuJawa VibhuJawa requested review from a team as code owners June 25, 2026 17:17
@VibhuJawa VibhuJawa requested review from sarahyurick and removed request for a team June 25, 2026 17:17
@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds LanceWriter and commit_lance_checkpoint for writing DocumentBatch tables as Lance fragments with a two-phase checkpoint-then-commit flow, along with LanceReader/LanceReaderStage/LancePartitioningStage for reading Lance datasets with fragment-level parallelism, blob restoration, and pinned version tracking. The BaseReader is generalized to a generic BaseReader[ReaderTask] with a new BaseFileReader shim for existing JSONL/Parquet readers.

  • Lance writer (writer/lance.py): LanceWriter.process calls lance_ray.write_fragment, serializes each LanceFragment via pickle+base64 into JSON checkpoint records, and defers commit to commit_lance_checkpoint which uses LanceFragmentCommitter.
  • Lance reader (reader/lance.py): LancePartitioningStage fans out fragment groups into LanceReadTasks pinned to a dataset version; LanceReaderStage reads those fragments, restores blob columns, and forwards Lance schema metadata downstream for round-trip fidelity.
  • Checkpoint utilities (utils/lance.py): helpers for writing/reading per-fragment JSON records, a _COMMITTED marker for idempotent re-runs, and Lance row-address metadata column helpers.

Confidence Score: 3/5

The writer's two-phase commit is not fully crash-safe for append mode, and the checkpoint format relies on pickle-deserializing data from remote storage.

The append-mode double-commit window (crash after on_write_complete but before the _COMMITTED marker is written) can silently duplicate rows in the dataset without any error on retry. The pickle-over-remote-storage pattern is a meaningful security widening compared to lance-ray's in-memory usage. Both issues are in the new commit path which is the central correctness guarantee of the writer.

nemo_curator/stages/text/io/writer/lance.py and nemo_curator/utils/lance.py deserve the most attention — the commit_lance_checkpoint crash-recovery logic and checkpoint marker validation.

Important Files Changed

Filename Overview
nemo_curator/stages/text/io/writer/lance.py New LanceWriter stage and commit_lance_checkpoint; has a double-commit bug for append mode and uses pickle.loads on persisted checkpoint data (security concern)
nemo_curator/utils/lance.py New checkpoint helpers and Lance metadata utilities; _COMMITTED marker check skips dataset_path validation, creating a silent wrong-version return if commit_path is reused
nemo_curator/stages/text/io/reader/lance.py New LanceReader composite stage with partitioning and fragment-level reads; version pinning, blob restoration, and scanner kwarg routing look correct
nemo_curator/stages/text/io/reader/base.py Refactored to ReaderTask generic and added BaseFileReader; allow_empty flag and _stage_perf propagation are clean additions
pyproject.toml Adds lance extra with lance-ray>=0.4; pylance (which provides the lance package) is only a transitive dependency and should be declared directly
tests/stages/text/io/writer/test_lance.py Good writer tests covering blob preservation, checkpoint idempotency, and empty-batch handling; crash-recovery scenario for append mode is not tested
tests/stages/text/io/reader/test_lance.py Comprehensive reader tests covering partitioning, blob restoration, version pinning, empty filters, and version conflict detection
nemo_curator/stages/text/io/writer/init.py Exports LanceWriter and commit_lance_checkpoint alongside existing writers

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Pipeline
    participant LanceWriter
    participant lance_ray as lance_ray.write_fragment
    participant CheckpointFS as Checkpoint Storage
    participant Commit as commit_lance_checkpoint
    participant LanceDB as Lance Dataset

    Pipeline->>LanceWriter: process(DocumentBatch)
    LanceWriter->>lance_ray: write_fragment([table], path, schema)
    lance_ray-->>LanceWriter: [(LanceFragment, schema), ...]
    LanceWriter->>LanceWriter: pickle+base64 encode each fragment
    LanceWriter->>CheckpointFS: write_lance_checkpoint_record(record_id, JSON)
    LanceWriter-->>Pipeline: FileGroupTask(record_paths)

    Note over Pipeline,Commit: After all batches processed

    Pipeline->>Commit: commit_lance_checkpoint(path, commit_path)
    Commit->>CheckpointFS: read_lance_checkpoint (check _COMMITTED marker)
    CheckpointFS-->>Commit: records (or committed_version if marker exists)
    Commit->>Commit: decode fragments via pickle.loads
    Commit->>LanceDB: LanceFragmentCommitter.on_write_complete([payloads])
    LanceDB-->>Commit: version N
    Commit->>CheckpointFS: "write_lance_checkpoint_marker(version=N)"
    Commit-->>Pipeline: version N
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Pipeline
    participant LanceWriter
    participant lance_ray as lance_ray.write_fragment
    participant CheckpointFS as Checkpoint Storage
    participant Commit as commit_lance_checkpoint
    participant LanceDB as Lance Dataset

    Pipeline->>LanceWriter: process(DocumentBatch)
    LanceWriter->>lance_ray: write_fragment([table], path, schema)
    lance_ray-->>LanceWriter: [(LanceFragment, schema), ...]
    LanceWriter->>LanceWriter: pickle+base64 encode each fragment
    LanceWriter->>CheckpointFS: write_lance_checkpoint_record(record_id, JSON)
    LanceWriter-->>Pipeline: FileGroupTask(record_paths)

    Note over Pipeline,Commit: After all batches processed

    Pipeline->>Commit: commit_lance_checkpoint(path, commit_path)
    Commit->>CheckpointFS: read_lance_checkpoint (check _COMMITTED marker)
    CheckpointFS-->>Commit: records (or committed_version if marker exists)
    Commit->>Commit: decode fragments via pickle.loads
    Commit->>LanceDB: LanceFragmentCommitter.on_write_complete([payloads])
    LanceDB-->>Commit: version N
    Commit->>CheckpointFS: "write_lance_checkpoint_marker(version=N)"
    Commit-->>Pipeline: version N
Loading

Reviews (1): Last reviewed commit: "Add Lance writer stage" | Re-trigger Greptile

Comment on lines +159 to +186
def commit_lance_checkpoint(
path: str,
commit_path: str,
*,
storage_options: dict[str, Any] | None = None,
checkpoint_storage_options: dict[str, Any] | None = None,
) -> int:
"""Commit records written by ``LanceWriter`` and return the Lance version."""
import lance
from lance_ray import LanceFragmentCommitter

records, committed_version = read_lance_checkpoint(commit_path, "lance_write", checkpoint_storage_options)
if committed_version is not None:
return committed_version

_validate_checkpoint_path(records, path)
mode = str(_single_checkpoint_value(records, "mode", "write mode"))
fragments = _decode_write_fragments(records)
schema = fragments[0][1]

committer = LanceFragmentCommitter(path, schema=schema, mode=mode, storage_options=storage_options)
if mode == "append":
committer.on_write_start(schema)
fragment_payloads = [(pickle.dumps(fragment), pickle.dumps(schema)) for fragment, schema in fragments]
committer.on_write_complete([fragment_payloads])
version = lance.dataset(path, storage_options=storage_options).version
write_lance_checkpoint_marker(commit_path, version, checkpoint_storage_options)
return version

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Potential double-commit for "append" mode on retry

If the process crashes after committer.on_write_complete succeeds but before write_lance_checkpoint_marker writes the _COMMITTED marker, a subsequent retry will call commit_lance_checkpoint again. Because the marker is absent, read_lance_checkpoint returns the original records. For "append" mode, on_write_start(schema) is called again — this time it reads the already-incremented dataset version as the new read_version — and on_write_complete commits the same fragments a second time, appending duplicate rows. "create" and "overwrite" modes are likely idempotent or will error; append is not.

A safer pattern is to read and store the post-commit version atomically with the commit (e.g. pass it through LanceFragmentCommitter), or to check whether the committed version's history already contains the expected fragments before issuing a second on_write_complete.

Comment on lines +148 to +156
def _decode_write_fragments(records: list[dict[str, Any]]) -> list[tuple[object, pa.Schema]]:
from lance.schema import json_to_schema

return [
(pickle.loads(base64.b64decode(record["fragment"])), json_to_schema(record["schema"])) # noqa: S301
for record in sorted(
records, key=lambda record: (str(record.get("task_id", "")), record.get("fragment_index", 0))
)
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Deserialized pickle data originates from persistent storage

_decode_write_fragments calls pickle.loads on base64-decoded bytes read from JSON files under commit_path. The # noqa: S301 suppresses the linter warning, but the underlying concern is real: unlike lance-ray's use of pickle for in-memory Ray object passing, these payloads are persisted to disk or remote object storage between process invocations. If an attacker gains write access to commit_path (a realistic threat when commit_path is an S3/GCS prefix shared across a job), they can replace any *.json checkpoint file with a maliciously crafted pickle payload, and commit_lance_checkpoint will execute arbitrary code.

Consider storing the fragment metadata using Lance's own JSON serialization (lance.LanceFragment.metadata() / json_to_schema) rather than pickle, or adding an HMAC over the file contents to detect tampering.

Comment on lines +73 to +93
def read_lance_checkpoint(
commit_path: str,
kind: str,
storage_options: dict[str, Any] | None = None,
) -> tuple[list[dict[str, Any]], int | None]:
fs, fs_path = _checkpoint_fs_path(commit_path, storage_options)
marker_path = _checkpoint_path(fs_path, _COMMITTED_MARKER)
if fs.exists(marker_path):
with fs.open(marker_path) as stream:
return [], int(json.loads(stream.read())["version"])

records = []
for record_path in sorted(fs.glob(_checkpoint_path(fs_path, _RECORDS_DIR, "*.json"))):
with fs.open(record_path) as stream:
record = json.loads(stream.read())
if record.get("kind") == kind:
records.append(record)
if not records:
msg = f"No {kind} checkpoint records found under {commit_path}"
raise ValueError(msg)
return records, None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _COMMITTED marker check skips dataset_path validation

When the _COMMITTED marker exists, read_lance_checkpoint returns early with the stored version number and an empty records list. The caller (commit_lance_checkpoint) never reaches _validate_checkpoint_path, so there is no verification that the marker corresponds to the same path argument. If commit_path is accidentally reused for a different dataset (e.g. due to a misconfigured pipeline), commit_lance_checkpoint will silently return the wrong version for that dataset. Storing the dataset_path in the marker JSON and validating it on read would catch this class of misconfiguration.

Comment thread pyproject.toml
"sentence-transformers",
]

lance = ["lance-ray>=0.4"]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 lance (distributed as pylance) is imported directly in nemo_curator/stages/text/io/reader/lance.py and nemo_curator/stages/text/io/writer/lance.py, but only lance-ray is listed as a direct dependency. The pylance package is currently a transitive dep of lance-ray, but this will silently break if lance-ray ever changes its dependency graph. Declaring pylance explicitly pins the contract.

Suggested change
lance = ["lance-ray>=0.4"]
lance = ["lance-ray>=0.4", "pylance"]

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant