Skip to content

Add Lance reader stage#2111

Open
VibhuJawa wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
VibhuJawa:feat/lance-reader
Open

Add Lance reader stage#2111
VibhuJawa wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
VibhuJawa:feat/lance-reader

Conversation

@VibhuJawa

Copy link
Copy Markdown
Contributor

Split from #2106. This is PR 1 of 3 in the Lance IO stack.

What changed:

  • Generalizes text IO readers with a task-oriented BaseReader plus BaseFileReader for existing file-path readers.
  • Adds Lance fragment partitioning and LanceReaderStage on top of BaseReader.
  • Emits Lance row/fragment metadata for downstream annotation writers and preserves Lance schema metadata for later writes.
  • Adds a reader-only lance extra using pylance.

Stack:

  1. LanceReader (this PR)
  2. LanceWriter on top of this branch
  3. LanceAnnotationWriter on top of LanceWriter

Validation:

  • uv run --extra lance --group test pytest -q tests/stages/text/io/reader/test_lance.py tests/stages/text/io/reader/test_jsonl.py tests/stages/text/io/reader/test_parquet.py -k "not IdGenerator and not id_generation"
  • uv run ruff check nemo_curator/stages/text/io/reader/base.py nemo_curator/stages/text/io/reader/jsonl.py nemo_curator/stages/text/io/reader/parquet.py nemo_curator/stages/text/io/reader/lance.py nemo_curator/utils/lance.py tests/stages/text/io/reader/test_lance.py

Note: the full JSONL ID-generator reader test path was not used for this focused validation because the local untracked outputs/ benchmark artifacts make Ray runtime-env packaging exceed 512 MiB.

@copy-pr-bot

copy-pr-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@VibhuJawa VibhuJawa left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix these things

Comment thread nemo_curator/stages/text/io/reader/base.py Outdated
Comment thread nemo_curator/stages/text/io/reader/base.py Outdated
Comment thread nemo_curator/stages/text/io/reader/base.py Outdated
Comment thread nemo_curator/stages/text/io/reader/lance.py Outdated
Comment thread nemo_curator/stages/text/io/reader/lance.py Outdated
Comment thread nemo_curator/stages/text/io/reader/lance.py Outdated
Comment thread nemo_curator/stages/text/io/reader/lance.py
@VibhuJawa VibhuJawa force-pushed the feat/lance-reader branch 5 times, most recently from b5f9ac2 to 0c5f4df Compare June 24, 2026 22:16
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the feat/lance-reader branch 7 times, most recently from daf8b22 to 9a39e0d Compare June 24, 2026 23:38
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@VibhuJawa VibhuJawa force-pushed the feat/lance-reader branch from 9a39e0d to bc20753 Compare June 24, 2026 23:41
@VibhuJawa VibhuJawa marked this pull request as ready for review June 25, 2026 17:17
@VibhuJawa VibhuJawa requested review from a team as code owners June 25, 2026 17:17
@VibhuJawa VibhuJawa requested review from weijiac0619 and removed request for a team June 25, 2026 17:17
scanner_kwargs["columns"] = fields
return scanner_kwargs

def read_task(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main class to review

@greptile-apps

greptile-apps Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces a LanceReader composite stage for reading Lance datasets into DocumentBatch objects, following the same partitioning-then-reading pattern used by the existing JSONL and Parquet readers. The base reader abstraction is generalized to support non-file-path task types via BaseReader[ReaderTask] and a new BaseFileReader shim that preserves backward compatibility for the JSONL and Parquet readers.

  • LancePartitioningStage enumerates fragments, pins a dataset version into each LanceReadTask, and fans out; LanceReaderStage consumes those tasks, optionally handling Lance blob-v2 column restoration and emitting row-address/fragment-id metadata columns.
  • BaseReader._validate_result is tightened from duck-typing to explicit isinstance checks, an allow_empty escape hatch is added, and _stage_perf is now propagated from incoming tasks to the output DocumentBatch.

Confidence Score: 4/5

Safe to merge; all identified concerns are non-blocking and do not affect correctness for the documented use cases.

The version-pinning design is solid and well-tested. The key-parsing pattern in _dataset_kwargs / _scanner_kwargs (pop-then-catch-all) is functional but fragile to future extension. The Python-level row-address loop in utils/lance.py will be noticeably slow on large partitions. None of these affect correctness for the current feature scope.

nemo_curator/stages/text/io/reader/lance.py (read_kwargs key-parsing contract and silent empty-dataset path) and nemo_curator/utils/lance.py (vectorisation of fragment-ID extraction).

Important Files Changed

Filename Overview
nemo_curator/stages/text/io/reader/lance.py New file implementing LancePartitioningStage, LanceReaderStage, and LanceReader; contains the shared read_kwargs mutation design and key logic around blob-v2 restoration and version pinning.
nemo_curator/stages/text/io/reader/base.py Refactored to generalize BaseReader over a ReaderTask TypeVar; adds ReaderOutput, BaseFileReader shim, allow_empty flag, and _stage_perf propagation – backward-compatible changes.
nemo_curator/utils/lance.py New utility with Python-level row-address bit-shifting to derive fragment IDs; efficient approach would use PyArrow compute ops instead of to_pylist() loop.
nemo_curator/stages/text/io/reader/jsonl.py Minimal refactor: JsonlReaderStage now extends BaseFileReader instead of BaseReader; read_data return type corrected to pd.DataFrame.
nemo_curator/stages/text/io/reader/parquet.py Minimal refactor mirroring jsonl.py: ParquetReaderStage now extends BaseFileReader.
tests/stages/text/io/reader/test_lance.py New test covering partitioning, blob restoration, version pinning, version conflict, and column overrides; good coverage but imports constants from reader module instead of utils.
pyproject.toml Adds a new optional lance extra (pylance>=7) and includes it in the all group.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant U as User
    participant LR as LanceReader (CompositeStage)
    participant LP as LancePartitioningStage
    participant Lance as lance.dataset
    participant LRS as LanceReaderStage
    participant Utils as nemo_curator/utils/lance

    U->>LR: LanceReader(path, read_kwargs, ...)
    LR->>LR: decompose()
    LR-->>LP: LancePartitioningStage(path, read_kwargs)
    LR-->>LRS: LanceReaderStage(path, fields, read_kwargs)

    note over LP: Fan-out stage
    LP->>Lance: "lance.dataset(path, **dataset_kwargs)"
    Lance-->>LP: dataset (version N)
    LP->>LP: enumerate fragments, chunk by fragments_per_partition
    LP-->>LRS: "LanceReadTask(frag_ids, version=N) x M"

    note over LRS: Per-task execution
    LRS->>Lance: "lance.dataset(path, version=N)"
    Lance-->>LRS: versioned dataset
    LRS->>LRS: detect blob-v2 columns
    LRS->>Lance: dataset.scanner(fragments, with_row_address).to_table()
    Lance-->>LRS: pa.Table with _rowaddr
    opt blob-v2 columns present
        LRS->>Lance: "dataset.read_blobs(column, addresses, preserve_order=True)"
        Lance-->>LRS: blob payloads
        LRS->>LRS: table.set_column(blob_array)
    end
    opt include_lance_metadata
        LRS->>Utils: add_lance_metadata_columns(table)
        Utils->>Utils: rename _rowaddr to __lance_rowaddr
        Utils->>Utils: shift_right rowaddrs to get __lance_fragid
        Utils-->>LRS: enriched pa.Table
    end
    LRS->>LRS: attach schema JSON to metadata
    LRS-->>U: "DocumentBatch(data=pa.Table, _metadata={lance:{schema,version,...}})"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant U as User
    participant LR as LanceReader (CompositeStage)
    participant LP as LancePartitioningStage
    participant Lance as lance.dataset
    participant LRS as LanceReaderStage
    participant Utils as nemo_curator/utils/lance

    U->>LR: LanceReader(path, read_kwargs, ...)
    LR->>LR: decompose()
    LR-->>LP: LancePartitioningStage(path, read_kwargs)
    LR-->>LRS: LanceReaderStage(path, fields, read_kwargs)

    note over LP: Fan-out stage
    LP->>Lance: "lance.dataset(path, **dataset_kwargs)"
    Lance-->>LP: dataset (version N)
    LP->>LP: enumerate fragments, chunk by fragments_per_partition
    LP-->>LRS: "LanceReadTask(frag_ids, version=N) x M"

    note over LRS: Per-task execution
    LRS->>Lance: "lance.dataset(path, version=N)"
    Lance-->>LRS: versioned dataset
    LRS->>LRS: detect blob-v2 columns
    LRS->>Lance: dataset.scanner(fragments, with_row_address).to_table()
    Lance-->>LRS: pa.Table with _rowaddr
    opt blob-v2 columns present
        LRS->>Lance: "dataset.read_blobs(column, addresses, preserve_order=True)"
        Lance-->>LRS: blob payloads
        LRS->>LRS: table.set_column(blob_array)
    end
    opt include_lance_metadata
        LRS->>Utils: add_lance_metadata_columns(table)
        Utils->>Utils: rename _rowaddr to __lance_rowaddr
        Utils->>Utils: shift_right rowaddrs to get __lance_fragid
        Utils-->>LRS: enriched pa.Table
    end
    LRS->>LRS: attach schema JSON to metadata
    LRS-->>U: "DocumentBatch(data=pa.Table, _metadata={lance:{schema,version,...}})"
Loading

Comments Outside Diff (2)

  1. nemo_curator/stages/text/io/reader/lance.py, line 307-338 (link)

    P2 Silent empty-dataset case in LancePartitioningStage.process()

    When the Lance dataset has zero fragments (e.g. the path is correct but the dataset was just created or compacted away), available_fragments is empty, the for loop produces zero iterations, and process() returns [] without any log warning or error. Downstream pipeline stages then produce no output and users have no signal that the source dataset was empty. A logger.warning when both self.fragment_ids is None and available_fragments is empty would surface this at no additional cost.

  2. nemo_curator/stages/text/io/reader/lance.py, line 383-393 (link)

    P2 _restore_blob_v2_columns loads all row-addresses into a Python list

    table["_rowaddr"].combine_chunks().to_pylist() materialises every row address as a Python int before calling dataset.read_blobs. For a partition with hundreds of thousands of rows this defeats the purpose of streaming and can spike resident memory. If lance.read_blobs accepts a PyArrow array directly, passing the array (or a slice of it) would avoid the Python-level materialisation.

Reviews (1): Last reviewed commit: "Refine Lance reader option handling" | Re-trigger Greptile

Comment on lines +21 to +23
def lance_fragment_ids_from_row_addresses(rowaddr_column: pa.ChunkedArray) -> pa.Array:
rowaddrs = rowaddr_column.combine_chunks().cast(pa.uint64())
return pa.array([int(value) >> 32 for value in rowaddrs.to_pylist()], type=pa.uint64())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The fragment-ID extraction converts the entire column to a Python list and performs the bit-shift in a CPython loop. For a table with millions of rows this can add measurable latency. PyArrow's compute layer performs the same shift vectorised in C++ without materialising a Python list.

Suggested change
def lance_fragment_ids_from_row_addresses(rowaddr_column: pa.ChunkedArray) -> pa.Array:
rowaddrs = rowaddr_column.combine_chunks().cast(pa.uint64())
return pa.array([int(value) >> 32 for value in rowaddrs.to_pylist()], type=pa.uint64())
def lance_fragment_ids_from_row_addresses(rowaddr_column: pa.ChunkedArray) -> pa.Array:
import pyarrow.compute as pc
rowaddrs = rowaddr_column.combine_chunks().cast(pa.uint64())
return pc.shift_right(rowaddrs, 32).cast(pa.uint64())

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +293 to +310
def decompose(self) -> list[ProcessingStage]:
if self.task_type != "document":
msg = f"Converting DocumentBatch to {self.task_type} is not supported yet."
raise NotImplementedError(msg)

return [
LancePartitioningStage(
path=self.path,
fragments_per_partition=self.fragments_per_partition,
fragment_ids=self.fragment_ids,
read_kwargs=self.read_kwargs,
),
LanceReaderStage(
path=self.path,
fields=self.fields,
read_kwargs=self.read_kwargs,
include_lance_metadata=self.include_lance_metadata,
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Shared read_kwargs dict passed to both sub-stages

decompose() passes self.read_kwargs (a single dict object) to both LancePartitioningStage and LanceReaderStage. Each stage copies the dict in its own __post_init__, so the current code is safe. The risk is in LanceReaderStage._dataset_kwargs / _scanner_kwargs, which pop() from their local copy of read_kwargs in a specific order: dataset keys are consumed first, then scanner keys, then all remaining keys are forwarded to the scanner via scanner_kwargs.update(read_kwargs). If a future caller or subclass omits that local dict(read_kwargs or {}) copy in read_task, or if new dataset-level keys are added without a corresponding pop() in _dataset_kwargs, those keys will silently leak into the scanner and produce confusing Lance errors. Documenting the key-parsing contract (which keys go where) or asserting on unrecognised keys after both _dataset_kwargs and _scanner_kwargs have run would make this boundary explicit.

@VibhuJawa VibhuJawa requested review from ayushdg and sarahyurick June 25, 2026 21:09

@sarahyurick sarahyurick left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, just walking through the PR and leaving some minor comments for now. Will try to do more of a deep dive soon.

msg = f"No data read from files in task {task.task_id}"
raise ValueError(msg)
def _effective_read_kwargs(self) -> dict[str, Any]:
return dict(self.read_kwargs or {})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit but I don't see a reason for having a 1 line helper function.

)

def _output_metadata(self, task: ReaderTask, _output: ReaderOutput) -> dict[str, Any]:
return task._metadata

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

return dataset_kwargs

def process(self, _: EmptyTask) -> list[LanceReadTask]:
import lance

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-level import?

allow_empty: Whether filtered reads may return empty tables without raising.
"""

path: str = ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it a hard requirement instead of having a check in the post init:

Suggested change
path: str = ""
path: str

return output.metadata if output.metadata is not None else task._metadata

def _restore_blob_v2_columns(self, dataset: object, table: pa.Table, blob_columns: list[str]) -> pa.Table:
import lance

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, should it be a top-level import? We can lazy-load the LanceReader so that lance is not a hard dependency.

for column in blob_columns:
payloads = [
payload
for _, payload in dataset.read_blobs(column, addresses=rowaddrs, preserve_order=True) # type: ignore[attr-defined]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question does # type: ignore[attr-defined] matter for the codebase? Like will it break without it?

Comment on lines +229 to +230
import lance
from lance.schema import schema_to_json

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above.

)
from nemo_curator.tasks import EmptyTask

pytest.importorskip("lance")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we create a @pytest.mark.lance or something instead? I kinda worry about using importorskip because unless someone is explicitly checking the relevant CI job then it could just silently skip or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants