Add Lance reader and writer stages#2106
Closed
VibhuJawa wants to merge 16 commits into
Closed
Conversation
Contributor
Author
|
@claude review |
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
cb513a1 to
cc51a67
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds first-pass Lance integration for Curator text IO, enabling fragment-partitioned reading into Arrow-backed DocumentBatch tasks, checkpointed writes/annotation updates, and driver-side commit helpers to publish staged writes.
Changes:
- Introduces
LanceReader(partition + read stages) with support for filter/projection, blob v2 restoration, and optional Lance metadata columns. - Adds
LanceWriterandLanceAnnotationWriterplus checkpoint/commit helpers (commit_lance_checkpoint,commit_lance_annotation_checkpoint) to publish staged writes. - Wires in a new optional
lanceextra (lance-ray>=0.4), updates CI/unit test env sync, and adds targeted pytest coverage.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds locked dependencies for lance-ray and its transitive requirements; registers lance extra and includes it in all. |
pyproject.toml |
Defines the new optional dependency extra lance and includes it under all. |
tests/L0_Unit_Test_CPU.sh |
Ensures CPU unit test env installs the lance extra. |
nemo_curator/stages/text/io/reader/__init__.py |
Exposes LanceReader from the text IO reader package. |
nemo_curator/stages/text/io/writer/__init__.py |
Exposes LanceWriter / LanceAnnotationWriter from the text IO writer package. |
nemo_curator/stages/text/io/reader/lance.py |
Implements Lance dataset partitioning by fragment and fragment-scoped scanning into DocumentBatch (with optional metadata + blob v2 restore). |
nemo_curator/stages/text/io/writer/lance.py |
Implements staged fragment writes and fragment-local annotation updates with checkpoint record emission. |
nemo_curator/stages/text/io/lance_utils.py |
Adds checkpoint record/marker read-write helpers and reserved Lance metadata column constants. |
nemo_curator/stages/text/io/lance_commit.py |
Adds driver-side commit helpers to publish checkpointed writes/updates to a Lance dataset. |
tests/stages/text/io/reader/test_lance.py |
Adds coverage for partitioning/version pinning, filter/projection behavior, blob v2, and metadata columns. |
tests/stages/text/io/writer/test_lance.py |
Adds coverage for checkpoint commits, blob v2 round trips, annotation prepare/update flows, and rejection of invalid update sets. |
Comment on lines
+254
to
+256
| ) | ||
|
|
||
| def _validate_unique_rowaddrs(table: pa.Table, fragment_id: int) -> None: |
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
VibhuJawa
commented
Jun 24, 2026
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
90c9c47 to
10c0de8
Compare
VibhuJawa
commented
Jun 24, 2026
VibhuJawa
left a comment
Contributor
Author
There was a problem hiding this comment.
Some more cleanup
| import lance | ||
| from lance.schema import schema_to_json | ||
|
|
||
| version = (task._metadata.get("lance") or {}).get("version") |
Contributor
Author
There was a problem hiding this comment.
Its fine, lets just assume we will get version from task._metadata.get("lance") !
| table = dataset.scanner(**scanner_kwargs).to_table() | ||
| if table.num_rows == 0: | ||
| return None | ||
| lance_schema = pa.schema([dataset.schema.field(name) for name in table.column_names if name in dataset.schema.names]) |
Contributor
Author
There was a problem hiding this comment.
Why dont we just get schema from dataset.schema ?
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
This was referenced Jun 24, 2026
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds a first-pass Lance integration for Curator text IO:
LanceReaderpartitions Lance datasets by fragment and returns Arrow-backedDocumentBatches.LanceWriterwrites Curator batches to Lance and records checkpointed commit payloads.LanceAnnotationWriterappends/validates annotation columns with Lanceadd_columns(...), writes fragment-local row updates, and records checkpointed commit payloads.writer/lance.pyand publish checkpointed writes withcommit_lance_checkpoint(...)and annotation-column updates withcommit_lance_annotation_checkpoint(...).nemo_curator/utils/lance.pyand reuse Curator filesystem/hash helpers.Usage
Append annotation columns back to an existing Lance dataset:
Testing
uv run ruff check nemo_curator/utils/lance.py nemo_curator/stages/text/io/reader/lance.py nemo_curator/stages/text/io/writer/lance.py nemo_curator/stages/text/io/writer/__init__.py tests/stages/text/io/reader/test_lance.py tests/stages/text/io/writer/test_lance.py pyproject.tomlbash -n tests/L0_Unit_Test_CPU.shuv run pytest tests/stages/text/io/reader/test_lance.py tests/stages/text/io/writer/test_lance.py -qChecklist