feat: add micro_commit_batch_size param in lance_compaction function#6141
feat: add micro_commit_batch_size param in lance_compaction function#6141huleilei wants to merge 1 commit intoEventual-Inc:mainfrom
Conversation
Greptile OverviewGreptile SummaryThis PR adds a new The change fits cleanly into the existing Lance connector surface area: it’s an additive parameter on the compaction helper and is passed through to the underlying LanceDB/Lance optimize/compaction call so users can tune commit batching behavior during compaction. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User
participant Py as daft.io.lance._lance
participant LC as daft.io.lance.lance_compaction
participant Lance as LanceDB/Lance
U->>Py: lance_compaction(..., micro_commit_batch_size=K)
Py->>LC: lance_compaction(..., micro_commit_batch_size=K)
LC->>Lance: optimize/compact(..., micro_commit_batch_size=K)
Lance-->>LC: compaction result
LC-->>Py: return result
Py-->>U: return result
|
cc103d3 to
8810cdd
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6141 +/- ##
==========================================
+ Coverage 73.38% 73.41% +0.02%
==========================================
Files 990 993 +3
Lines 128853 129163 +310
==========================================
+ Hits 94557 94823 +266
- Misses 34296 34340 +44
🚀 New features to boost your workflow:
|
|
@universalmind303 @Jay-ju help me review when you are convenient. Thanks |
|
|
||
| dataset = lance.dataset(str(dataset_path)) | ||
| post_fragments = len(dataset.get_fragments()) | ||
| post_rows = dataset.count_rows() |
There was a problem hiding this comment.
Isn't the judgment here not very convincing? Shouldn't it be judged whether your micro_commit_batch_size has produced an effect?
There was a problem hiding this comment.
The purpose of this test case is primarily to validate the functionality of the micro_commit_batch_size parameter. However, performing batch commits during compaction may result in the creation of multiple Lance dataset versions. In the current implementation of LanceDB, batch compaction can lead to an inconsistent number of versions due to scenarios such as data conflicts. Therefore, I believe that only the functionality needs to be validated in this context.
Changes Made
Introduce micro_commit_batch_size in daft.io.lance.compact_files and thread it through to lance_compaction to control commit batching during compaction. Docs updated (docs/connectors/lance.md) and tests extended (tests/io/lancedb/test_lancedb_compaction.py). When committing all tasks in a single batch, return CompactionMetrics; for multi-batch commits, return None. No breaking changes; minimal risk as the parameter is additive and optional.
Related Issues