Add Lance reader stage by VibhuJawa · Pull Request #2111 · NVIDIA-NeMo/Curator

VibhuJawa · 2026-06-24T21:07:17Z

Split from #2106. This is PR 1 of 3 in the Lance IO stack.

What changed:

Generalizes text IO readers with a task-oriented BaseReader plus BaseFileReader for existing file-path readers.
Adds Lance fragment partitioning and LanceReaderStage on top of BaseReader.
Emits Lance row/fragment metadata for downstream annotation writers and preserves Lance schema metadata for later writes.
Adds a reader-only lance extra using pylance.

Stack:

LanceReader (this PR)
LanceWriter on top of this branch
LanceAnnotationWriter on top of LanceWriter

Validation:

uv run --extra lance --group test pytest -q tests/stages/text/io/reader/test_lance.py tests/stages/text/io/reader/test_jsonl.py tests/stages/text/io/reader/test_parquet.py -k "not IdGenerator and not id_generation"
uv run ruff check nemo_curator/stages/text/io/reader/base.py nemo_curator/stages/text/io/reader/jsonl.py nemo_curator/stages/text/io/reader/parquet.py nemo_curator/stages/text/io/reader/lance.py nemo_curator/utils/lance.py tests/stages/text/io/reader/test_lance.py

Note: the full JSONL ID-generator reader test path was not used for this focused validation because the local untracked outputs/ benchmark artifacts make Ray runtime-env packaging exceed 512 MiB.

copy-pr-bot · 2026-06-24T21:07:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

VibhuJawa

Please fix these things

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa · 2026-06-25T17:18:54Z

+            scanner_kwargs["columns"] = fields
+        return scanner_kwargs
+
+    def read_task(


Main class to review

greptile-apps · 2026-06-25T17:25:30Z

Greptile Summary

This PR introduces a LanceReader composite stage for reading Lance datasets into DocumentBatch objects, following the same partitioning-then-reading pattern used by the existing JSONL and Parquet readers. The base reader abstraction is generalized to support non-file-path task types via BaseReader[ReaderTask] and a new BaseFileReader shim that preserves backward compatibility for the JSONL and Parquet readers.

LancePartitioningStage enumerates fragments, pins a dataset version into each LanceReadTask, and fans out; LanceReaderStage consumes those tasks, optionally handling Lance blob-v2 column restoration and emitting row-address/fragment-id metadata columns.
BaseReader._validate_result is tightened from duck-typing to explicit isinstance checks, an allow_empty escape hatch is added, and _stage_perf is now propagated from incoming tasks to the output DocumentBatch.

Confidence Score: 4/5

Safe to merge; all identified concerns are non-blocking and do not affect correctness for the documented use cases.

The version-pinning design is solid and well-tested. The key-parsing pattern in _dataset_kwargs / _scanner_kwargs (pop-then-catch-all) is functional but fragile to future extension. The Python-level row-address loop in utils/lance.py will be noticeably slow on large partitions. None of these affect correctness for the current feature scope.

nemo_curator/stages/text/io/reader/lance.py (read_kwargs key-parsing contract and silent empty-dataset path) and nemo_curator/utils/lance.py (vectorisation of fragment-ID extraction).

Important Files Changed

Filename	Overview
nemo_curator/stages/text/io/reader/lance.py	New file implementing LancePartitioningStage, LanceReaderStage, and LanceReader; contains the shared read_kwargs mutation design and key logic around blob-v2 restoration and version pinning.
nemo_curator/stages/text/io/reader/base.py	Refactored to generalize BaseReader over a ReaderTask TypeVar; adds ReaderOutput, BaseFileReader shim, allow_empty flag, and _stage_perf propagation – backward-compatible changes.
nemo_curator/utils/lance.py	New utility with Python-level row-address bit-shifting to derive fragment IDs; efficient approach would use PyArrow compute ops instead of to_pylist() loop.
nemo_curator/stages/text/io/reader/jsonl.py	Minimal refactor: JsonlReaderStage now extends BaseFileReader instead of BaseReader; read_data return type corrected to pd.DataFrame.
nemo_curator/stages/text/io/reader/parquet.py	Minimal refactor mirroring jsonl.py: ParquetReaderStage now extends BaseFileReader.
tests/stages/text/io/reader/test_lance.py	New test covering partitioning, blob restoration, version pinning, version conflict, and column overrides; good coverage but imports constants from reader module instead of utils.
pyproject.toml	Adds a new optional lance extra (pylance>=7) and includes it in the all group.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant U as User
    participant LR as LanceReader (CompositeStage)
    participant LP as LancePartitioningStage
    participant Lance as lance.dataset
    participant LRS as LanceReaderStage
    participant Utils as nemo_curator/utils/lance

    U->>LR: LanceReader(path, read_kwargs, ...)
    LR->>LR: decompose()
    LR-->>LP: LancePartitioningStage(path, read_kwargs)
    LR-->>LRS: LanceReaderStage(path, fields, read_kwargs)

    note over LP: Fan-out stage
    LP->>Lance: "lance.dataset(path, **dataset_kwargs)"
    Lance-->>LP: dataset (version N)
    LP->>LP: enumerate fragments, chunk by fragments_per_partition
    LP-->>LRS: "LanceReadTask(frag_ids, version=N) x M"

    note over LRS: Per-task execution
    LRS->>Lance: "lance.dataset(path, version=N)"
    Lance-->>LRS: versioned dataset
    LRS->>LRS: detect blob-v2 columns
    LRS->>Lance: dataset.scanner(fragments, with_row_address).to_table()
    Lance-->>LRS: pa.Table with _rowaddr
    opt blob-v2 columns present
        LRS->>Lance: "dataset.read_blobs(column, addresses, preserve_order=True)"
        Lance-->>LRS: blob payloads
        LRS->>LRS: table.set_column(blob_array)
    end
    opt include_lance_metadata
        LRS->>Utils: add_lance_metadata_columns(table)
        Utils->>Utils: rename _rowaddr to __lance_rowaddr
        Utils->>Utils: shift_right rowaddrs to get __lance_fragid
        Utils-->>LRS: enriched pa.Table
    end
    LRS->>LRS: attach schema JSON to metadata
    LRS-->>U: "DocumentBatch(data=pa.Table, _metadata={lance:{schema,version,...}})"

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant U as User
    participant LR as LanceReader (CompositeStage)
    participant LP as LancePartitioningStage
    participant Lance as lance.dataset
    participant LRS as LanceReaderStage
    participant Utils as nemo_curator/utils/lance

    U->>LR: LanceReader(path, read_kwargs, ...)
    LR->>LR: decompose()
    LR-->>LP: LancePartitioningStage(path, read_kwargs)
    LR-->>LRS: LanceReaderStage(path, fields, read_kwargs)

    note over LP: Fan-out stage
    LP->>Lance: "lance.dataset(path, **dataset_kwargs)"
    Lance-->>LP: dataset (version N)
    LP->>LP: enumerate fragments, chunk by fragments_per_partition
    LP-->>LRS: "LanceReadTask(frag_ids, version=N) x M"

    note over LRS: Per-task execution
    LRS->>Lance: "lance.dataset(path, version=N)"
    Lance-->>LRS: versioned dataset
    LRS->>LRS: detect blob-v2 columns
    LRS->>Lance: dataset.scanner(fragments, with_row_address).to_table()
    Lance-->>LRS: pa.Table with _rowaddr
    opt blob-v2 columns present
        LRS->>Lance: "dataset.read_blobs(column, addresses, preserve_order=True)"
        Lance-->>LRS: blob payloads
        LRS->>LRS: table.set_column(blob_array)
    end
    opt include_lance_metadata
        LRS->>Utils: add_lance_metadata_columns(table)
        Utils->>Utils: rename _rowaddr to __lance_rowaddr
        Utils->>Utils: shift_right rowaddrs to get __lance_fragid
        Utils-->>LRS: enriched pa.Table
    end
    LRS->>LRS: attach schema JSON to metadata
    LRS-->>U: "DocumentBatch(data=pa.Table, _metadata={lance:{schema,version,...}})"

Comments Outside Diff (2)

nemo_curator/stages/text/io/reader/lance.py, line 307-338 (link)

Silent empty-dataset case in LancePartitioningStage.process()

When the Lance dataset has zero fragments (e.g. the path is correct but the dataset was just created or compacted away), available_fragments is empty, the for loop produces zero iterations, and process() returns [] without any log warning or error. Downstream pipeline stages then produce no output and users have no signal that the source dataset was empty. A logger.warning when both self.fragment_ids is None and available_fragments is empty would surface this at no additional cost.
nemo_curator/stages/text/io/reader/lance.py, line 383-393 (link)

_restore_blob_v2_columns loads all row-addresses into a Python list

table["_rowaddr"].combine_chunks().to_pylist() materialises every row address as a Python int before calling dataset.read_blobs. For a partition with hundreds of thousands of rows this defeats the purpose of streaming and can spike resident memory. If lance.read_blobs accepts a PyArrow array directly, passing the array (or a slice of it) would avoid the Python-level materialisation.

_{Reviews (1): Last reviewed commit: "Refine Lance reader option handling" | Re-trigger Greptile}

greptile-apps · 2026-06-25T17:25:34Z

+def lance_fragment_ids_from_row_addresses(rowaddr_column: pa.ChunkedArray) -> pa.Array:
+    rowaddrs = rowaddr_column.combine_chunks().cast(pa.uint64())
+    return pa.array([int(value) >> 32 for value in rowaddrs.to_pylist()], type=pa.uint64())


The fragment-ID extraction converts the entire column to a Python list and performs the bit-shift in a CPython loop. For a table with millions of rows this can add measurable latency. PyArrow's compute layer performs the same shift vectorised in C++ without materialising a Python list.

Suggested change

def lance_fragment_ids_from_row_addresses(rowaddr_column: pa.ChunkedArray) -> pa.Array:

rowaddrs = rowaddr_column.combine_chunks().cast(pa.uint64())

return pa.array([int(value) >> 32 for value in rowaddrs.to_pylist()], type=pa.uint64())

def lance_fragment_ids_from_row_addresses(rowaddr_column: pa.ChunkedArray) -> pa.Array:

import pyarrow.compute as pc

rowaddrs = rowaddr_column.combine_chunks().cast(pa.uint64())

return pc.shift_right(rowaddrs, 32).cast(pa.uint64())

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-25T17:25:35Z

+    def decompose(self) -> list[ProcessingStage]:
+        if self.task_type != "document":
+            msg = f"Converting DocumentBatch to {self.task_type} is not supported yet."
+            raise NotImplementedError(msg)
+
+        return [
+            LancePartitioningStage(
+                path=self.path,
+                fragments_per_partition=self.fragments_per_partition,
+                fragment_ids=self.fragment_ids,
+                read_kwargs=self.read_kwargs,
+            ),
+            LanceReaderStage(
+                path=self.path,
+                fields=self.fields,
+                read_kwargs=self.read_kwargs,
+                include_lance_metadata=self.include_lance_metadata,
+            ),


Shared read_kwargs dict passed to both sub-stages

decompose() passes self.read_kwargs (a single dict object) to both LancePartitioningStage and LanceReaderStage. Each stage copies the dict in its own __post_init__, so the current code is safe. The risk is in LanceReaderStage._dataset_kwargs / _scanner_kwargs, which pop() from their local copy of read_kwargs in a specific order: dataset keys are consumed first, then scanner keys, then all remaining keys are forwarded to the scanner via scanner_kwargs.update(read_kwargs). If a future caller or subclass omits that local dict(read_kwargs or {}) copy in read_task, or if new dataset-level keys are added without a corresponding pop() in _dataset_kwargs, those keys will silently leak into the scanner and produce confusing Lance errors. Documenting the key-parsing contract (which keys go where) or asserting on unrecognised keys after both _dataset_kwargs and _scanner_kwargs have run would make this boundary explicit.

sarahyurick

Hi, just walking through the PR and leaving some minor comments for now. Will try to do more of a deep dive soon.

sarahyurick · 2026-06-26T17:24:18Z

-            msg = f"No data read from files in task {task.task_id}"
-            raise ValueError(msg)
+    def _effective_read_kwargs(self) -> dict[str, Any]:
+        return dict(self.read_kwargs or {})


Nit but I don't see a reason for having a 1 line helper function.

sarahyurick · 2026-06-26T17:24:30Z

        )

+    def _output_metadata(self, task: ReaderTask, _output: ReaderOutput) -> dict[str, Any]:
+        return task._metadata


Same comment as above.

sarahyurick · 2026-06-26T17:27:13Z

+        return dataset_kwargs
+
+    def process(self, _: EmptyTask) -> list[LanceReadTask]:
+        import lance


Top-level import?

sarahyurick · 2026-06-26T17:28:05Z

+        allow_empty: Whether filtered reads may return empty tables without raising.
+    """
+
+    path: str = ""


Make it a hard requirement instead of having a check in the post init:

Suggested change

path: str = ""

path: str

sarahyurick · 2026-06-26T17:29:08Z

+        return output.metadata if output.metadata is not None else task._metadata
+
+    def _restore_blob_v2_columns(self, dataset: object, table: pa.Table, blob_columns: list[str]) -> pa.Table:
+        import lance


Same here, should it be a top-level import? We can lazy-load the LanceReader so that lance is not a hard dependency.

sarahyurick · 2026-06-26T17:30:44Z

+        for column in blob_columns:
+            payloads = [
+                payload
+                for _, payload in dataset.read_blobs(column, addresses=rowaddrs, preserve_order=True)  # type: ignore[attr-defined]


General question does # type: ignore[attr-defined] matter for the codebase? Like will it break without it?

sarahyurick · 2026-06-26T17:31:26Z

+        import lance
+        from lance.schema import schema_to_json


Similar comment as above.

sarahyurick · 2026-06-26T17:34:30Z

+)
+from nemo_curator.tasks import EmptyTask
+
+pytest.importorskip("lance")


Should we create a @pytest.mark.lance or something instead? I kinda worry about using importorskip because unless someone is explicitly checking the relevant CI job then it could just silently skip or something.

This was referenced Jun 24, 2026

Add Lance writer stage #2112

Open

Add Lance annotation writer stage #2113

Draft

Add Lance reader and writer stages #2106

Closed

VibhuJawa commented Jun 24, 2026

View reviewed changes

VibhuJawa force-pushed the feat/lance-reader branch 5 times, most recently from b5f9ac2 to 0c5f4df Compare June 24, 2026 22:16

Add Lance reader stage

944dc51

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa force-pushed the feat/lance-reader branch 7 times, most recently from daf8b22 to 9a39e0d Compare June 24, 2026 23:38

Refine Lance reader option handling

bc20753

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>

VibhuJawa force-pushed the feat/lance-reader branch from 9a39e0d to bc20753 Compare June 24, 2026 23:41

VibhuJawa marked this pull request as ready for review June 25, 2026 17:17

VibhuJawa requested review from a team as code owners June 25, 2026 17:17

VibhuJawa requested review from weijiac0619 and removed request for a team June 25, 2026 17:17

VibhuJawa commented Jun 25, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 25, 2026

View reviewed changes

VibhuJawa requested review from ayushdg and sarahyurick June 25, 2026 21:09

sarahyurick reviewed Jun 26, 2026

View reviewed changes

Uh oh!

Conversation

VibhuJawa commented Jun 24, 2026

Uh oh!

copy-pr-bot Bot commented Jun 24, 2026

Uh oh!

VibhuJawa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 25, 2026 •

edited

Loading