-
Notifications
You must be signed in to change notification settings - Fork 549
[RFC] feat!: kernel-based log replay #3137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Not in its current form, but updating Snapshot and with that the log segment needs to definitely go in here... |
f8049db to
e7c7766
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3137 +/- ##
==========================================
- Coverage 71.81% 71.41% -0.41%
==========================================
Files 145 152 +7
Lines 45972 46811 +839
Branches 45972 46811 +839
==========================================
+ Hits 33016 33429 +413
- Misses 10859 11216 +357
- Partials 2097 2166 +69 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d59867e to
4f8ff2d
Compare
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
4f8ff2d to
5d2cf48
Compare
|
@roeap I assume with the introduction of the CommitCacheObjectStore, you would want to have the instantiate two object_stores on the log store, one for commits, and one for reading/writing parquet. With regards to the object store for reading/writing parquet, the folks at "seafowl" built an interesting caching layer for reading parquets https://github.com/splitgraph/seafowl/blob/main/src/object_store/cache.rs, I asked whether they could publish that as a crate, I think it could be really valuable for read operations during some operations that require scans |
|
Well .. this very naive caching implementation is mainly meant for now to not double down on some of the "regrets" from our pasts selves when it comes to the Snapshot implementation. By now the parquet read is very selective in delta-rs and delta-kernel-rs with column selection and row group filtering... as such the assumption is, that we do not need to cache data from checkpoints and focus on caching all these expensive json commit reads. This simplifies the data we keep in memory significantly - essentially just reconciled add action data. While not incurring too much of a penalty for repeated json (commit) reads. But this is mostly just a stop-gap for adopting kernel "the right way", or at least not in an obviously wrong way 😆. As you rightfully mention, there is much more that can be done. IIRC, datafusion also at least has the wiring to inject caching of parquet footers, which should make scanning snapshots for actions other then adds also much more efficient. Without having spend too much time thinking about it, I think the abstraction you mentioned is much nicer - i.e. we are aware of what type of file we are reading. For us this would in a kernel world mean we would hoist some caching up to a higher level, the json and parquet handler traits in One could argue that this is more or less what we are doing now, keeping all arrow state in memory, but I would say that we can build something much more efficient - and shareable across snapshots - at the engine layer. Also do things like local file cache etc .. One thing I discussed with @rtyler is to move the caching object store to a dedicated PR, as we can get that merged much quicker then this one - which may yet take some time :). Also, we can think about if we can (and should) iterate on our configuration system a bit. The tombstones config for instance has no effect for a while now. |
|
@roeap on your last note, I think that could be useful indeed to already provide the benefit of it. I haven't looked to depth in to that code, but I assume you can limit the cache size? |
|
Indeed you can - right now its a hard-coded count, bit in a separate PR this should be configurable. The crate also allows in a simple way to use other weights - e.g. limit by size, as well as choose eviction policies. Some of which we should allow users to configure, but hopefully we can just have great defaults based on what we know about delta tables 😆. |
ion-elgreco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall!
| Self { | ||
| inner, | ||
| check: Arc::new(cache_json), | ||
| cache: Arc::new(Cache::new(100)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add a Weight capacity here as well with a configurable env var to limit the Bytes held in memory
| files: Option<RecordBatch>, | ||
| } | ||
|
|
||
| impl Snapshot for EagerSnapshot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that is missing is the DeltaTableConfig, I added this some time ago to the old snapshot because we some times need to be aware in the operation how the table got loaded.
/// Get the table config which is loaded with of the snapshot
pub fn load_config(&self) -> &DeltaTableConfig {
self.snapshot.load_config()
}
| self.snapshot.table_root() | ||
| } | ||
|
|
||
| fn version(&self) -> Version { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more version_timestamp as well?
| fn next(&mut self) -> Option<Self::Item> { | ||
| if self.index < self.paths.len() { | ||
| let path = self.paths.value(self.index).to_string(); | ||
| let add = AddVisitor::visit_add(self.index, path, self.getters.as_slice()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always guaranteed to find the next add action?
| )) | ||
| } | ||
|
|
||
| pub fn stats(&self) -> Option<&str> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does a logicalFileView have stats?
|
|
||
| fn extract_column<'a>( | ||
| mut parent: &'a dyn ProvidesColumnByName, | ||
| col: &[impl AsRef<str>], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name threw me off, I thought it was multiple columns, but it's a single column_path
| res.and_then(|(data, predicate)| { | ||
| let batch: RecordBatch = | ||
| ArrowEngineData::try_from_engine_data(data)?.into(); | ||
| Ok(filter_record_batch(&batch, &BooleanArray::from(predicate))?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why even filter when the predicate was None?
| start_version: Option<Version>, | ||
| limit: Option<usize>, | ||
| ) -> DeltaResult<Box<dyn Iterator<Item = (Version, CommitInfo)>>> { | ||
| // let start_version = start_version.into(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old line?
| let end_version = start_version.unwrap_or_else(|| self.version()); | ||
| let start_version = limit | ||
| .and_then(|limit| { | ||
| if limit == 0 { | ||
| Some(end_version) | ||
| } else { | ||
| Some(end_version.saturating_sub(limit as u64 - 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is highly confusing xd, the end version becomes the start versions when passed, and then the start_versions becomes the end version again when there is no limit :S
| store: Arc<dyn ObjectStore>, | ||
| version: impl Into<Option<Version>>, | ||
| ) -> DeltaResult<Self> { | ||
| // TODO: how to deal with the dedicated IO runtime? Would this already be covered by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We currently do that all the way at the beginning in logstore_with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roeap What's the status of integrating this, any ETA?
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
Signed-off-by: Robert Pack <[email protected]>
This commit was modified from delta-io#3137 to enable an independent merge to bring some of these structural changes needed for delta-kernel-rs integration in piecemeal Signed-off-by: Robert Pack <[email protected]> Signed-off-by: R. Tyler Croy <[email protected]>
This commit was modified from delta-io#3137 to enable an independent merge to bring some of these structural changes needed for delta-kernel-rs integration in piecemeal Signed-off-by: Robert Pack <[email protected]> Signed-off-by: R. Tyler Croy <[email protected]>
This commit was modified from #3137 to enable an independent merge to bring some of these structural changes needed for delta-kernel-rs integration in piecemeal Signed-off-by: Robert Pack <[email protected]> Signed-off-by: R. Tyler Croy <[email protected]>
|
@roeap @ion-elgreco @rtyler anyone knows what's the status of this PR? IIUC moving to delta-kernel is foundational for delta-rs moving forward, closing many gaps in protocol compliance like column mapping, deletion vectors, V2 checkpoints with sidecars etc. yet the only PR that seems to be working towards it has been open since January and doesn't seem to be concluding. Can anyone share broad roadmap plans for delta-rs here? Thanks. |
There is no official roadmap. |
Thanks. |
You could take this PR and try to hook the new log replay to delta-rs |
I don't want to bifurcate the effort so just so I understand the current status, is this PR currently abandoned? Should I fork it and continue work on it or start fresh? what's the best game plan here? |
It's not abandoned afaik, but @roeap has done some prep work in latest prs to get things more compatible, see https://github.com/delta-io/delta-rs/pulls?q=is%3Apr+author%3Aroeap+is%3Aclosed, also @rtyler moved some of the dat testing in this pr already into main. I would just continue on this work and see if you can come up with something, @roeap however can give the best answer here |
* chore: remove cdf feature Signed-off-by: Ion Koutsouris <[email protected]> * Correct Python docs for incremental compaction on OPTIMIZE * fix: added restored metadata as action to the next committed version Signed-off-by: Alexander Falk <[email protected]> * chore: add a regression test to ensure restore respects metadata actions Fixes delta-io#3352 Signed-off-by: R. Tyler Croy <[email protected]> * chore: fix some minor build warnings Signed-off-by: R. Tyler Croy <[email protected]> * fix: handle unknown features Signed-off-by: Robert Pack <[email protected]> * fix: update to latest kernel state Signed-off-by: Robert Pack <[email protected]> * test: update or disable tests with unsupported features Signed-off-by: Robert Pack <[email protected]> * refactor: move transaction module to kernel Signed-off-by: Robert Pack <[email protected]> * chore: clippy Signed-off-by: Robert Pack <[email protected]> * chore: move proofs into dedicated folder Signed-off-by: Robert Pack <[email protected]> * refactor: move storage module into logstore Signed-off-by: Robert Pack <[email protected]> * feat: harmonize storage config parsing Signed-off-by: Robert Pack <[email protected]> * refactor: remove RetryConfigParse trait Signed-off-by: Robert Pack <[email protected]> * feat!: formalize parsing of storage options Signed-off-by: Robert Pack <[email protected]> * feat: centrally apply object store layers Signed-off-by: Robert Pack <[email protected]> * refactor: isolate factories for storage / log store integrations Signed-off-by: Robert Pack <[email protected]> * fix: url parsing inconsistencies Signed-off-by: Robert Pack <[email protected]> * fix: PR feedback Signed-off-by: Robert Pack <[email protected]> * fix: PR feedback Signed-off-by: Robert Pack <[email protected]> * fix: clippy warnings Signed-off-by: Andrew Lamb <[email protected]> * feat: derive macro for config implementations Signed-off-by: Robert Pack <[email protected]> * feat: error handling in derive macro Signed-off-by: Robert Pack <[email protected]> * refactor: move str_ist_truthy to config Signed-off-by: Robert Pack <[email protected]> * chore: clippy Signed-off-by: Robert Pack <[email protected]> * Chore: put a couple symbols behind the right feature gate Signed-off-by: R. Tyler Croy <[email protected]> * update for kernel 0.10.0 Signed-off-by: Zach Schuermann <[email protected]> * fix daft docs Signed-off-by: Zach Schuermann <[email protected]> * Fix the default target size Signed-off-by: Hiromu Hota <[email protected]> * gate RetryConfig usage on 'cloud' feature Signed-off-by: Ze'ev Maor <[email protected]> * add compile_error if neither 'rustls' not 'native-tls' are enabled Signed-off-by: Ze'ev Maor <[email protected]> * feat: Update to Datafusion 47.0.0 Signed-off-by: Andrew Lamb <[email protected]> * feat: Update to Datafusion 47.0.0 Signed-off-by: Andrew Lamb <[email protected]> * chore: re-enable hdfs support and add a teensy tiny unit test Signed-off-by: R. Tyler Croy <[email protected]> * chore: tighten up the checking on the predicate comparison Signed-off-by: R. Tyler Croy <[email protected]> * chore: bump versions of rust crates for another release party Signed-off-by: R. Tyler Croy <[email protected]> * chore: remove unused dependencies Signed-off-by: R. Tyler Croy <[email protected]> * chore: ensure derive is ready for publishing too Signed-off-by: R. Tyler Croy <[email protected]> * chore: the mount crate reqiores the cloud feature now Signed-off-by: R. Tyler Croy <[email protected]> * chore: hdfs requires the cloud feature Signed-off-by: R. Tyler Croy <[email protected]> * chore: modify the publish script to take the required crate ordering into consideration Signed-off-by: R. Tyler Croy <[email protected]> * chore: remove unnecessary datafusion dependency for mount Signed-off-by: R. Tyler Croy <[email protected]> * chore: reduce feature/dependency footprint for subcrates The object_store crate does not require its cloud feature in order to use RetryConfig, so most of the subcrates can shed a cloud and datafusion feature. Cleaning this up allows for avoiding the ObjectStoreFactory trait's ambiguous implementation which can cause problem if a subcrate is implemented using the "non-cloud" arm but then is included in a dependency tree where the "cloud" feature is enabled by another dependency. This was sort of only theoretically possible but did manifest during `cargo publish` operations. Additionally the removal of a datafusion feature when it iis not necessary results in ~100 fewer crates at compile and link time for those (hi!) working within subcrates Signed-off-by: R. Tyler Croy <[email protected]> * feat: introduce VacuumMode::Full for cleaning up orphaned files This allows an optional but not-on-by-default mode of removing untracked files in the delta table directory. Delta/Spark supports a "lite" and "full" mode for [vacuum]. This change is intentionally not making "full" the default as it is for Delta/Spark since that may have unintended consequences for our users who have become accustomed to "lite" being the default. Fixes delta-io#2349 [vacuum]: https://docs.delta.io/latest/delta-utility.html#remove-files-no-longer-referenced-by-a-delta-table Signed-off-by: R. Tyler Croy <[email protected]> * fix: if field contains space in constraint expression, the check will fail Signed-off-by: Alexander Falk <[email protected]> * chore: add test for handling fields with spaces in constraints Signed-off-by: R. Tyler Croy <[email protected]> * chore(deps): Update sqlparser requirement from 0.53.0 to 0.56.0 Updates the requirements on [sqlparser](https://github.com/apache/datafusion-sqlparser-rs) to permit the latest version. - [Changelog](https://github.com/apache/datafusion-sqlparser-rs/blob/main/CHANGELOG.md) - [Commits](apache/datafusion-sqlparser-rs@v0.53.0...v0.56.0) --- updated-dependencies: - dependency-name: sqlparser dependency-version: 0.56.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * chore(deps): Update foyer requirement from 0.16.1 to 0.17.0 Updates the requirements on [foyer](https://github.com/foyer-rs/foyer) to permit the latest version. - [Release notes](https://github.com/foyer-rs/foyer/releases) - [Changelog](https://github.com/foyer-rs/foyer/blob/main/CHANGELOG.md) - [Commits](foyer-rs/foyer@v0.16.1...v0.17.0) --- updated-dependencies: - dependency-name: foyer dependency-version: 0.17.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * chore: setup dat test scaffolding This commit was modified from delta-io#3137 to enable an independent merge to bring some of these structural changes needed for delta-kernel-rs integration in piecemeal Signed-off-by: Robert Pack <[email protected]> Signed-off-by: R. Tyler Croy <[email protected]> * chore: bring dat test loading into the root Signed-off-by: R. Tyler Croy <[email protected]> * chore: enable dat testing with the existing code prior to bringing kernel replay in, I would like "classic" delta-rs parsing Closes delta-io#863 Signed-off-by: R. Tyler Croy <[email protected]> * chore: missed a version bump for core Signed-off-by: R. Tyler Croy <[email protected]> * fix: build Unity Catalog crate without DataFusion Signed-off-by: Heran Lin <[email protected]> * fix: drop column earlier Signed-off-by: Ion Koutsouris <[email protected]> * chore: add a regression test for delta-io#3413 Signed-off-by: R. Tyler Croy <[email protected]> * chore: include license file in deltalake-derive crate Signed-off-by: Andrew Kane <[email protected]> * chore(deps): bump foyer to v0.17.2 to prevent from wrong result Signed-off-by: MrCroxx <[email protected]> * fix: pin arrow to 55.0.0 Signed-off-by: Ion Koutsouris <[email protected]> * feat: during LakeFS file operations, skip merge when 0 changes Signed-off-by: Sam Meyer-Reed <[email protected]> * Fix broken test Signed-off-by: Sam Meyer-Reed <[email protected]> * Fix lakefs diff API parameter order Signed-off-by: Sam Meyer-Reed <[email protected]> * Fix formatting Signed-off-by: Sam Meyer-Reed <[email protected]> * added gc valid check Signed-off-by: JustinRush80 <[email protected]> * chore: bump crate versions which are due for release Signed-off-by: R. Tyler Croy <[email protected]> * feat: spawn io with spawn service Signed-off-by: Ion Koutsouris <[email protected]> * fix: pin arrow to 55.0.0 Signed-off-by: Ion Koutsouris <[email protected]> * chore: rely on the testing during coverage generation to speed up tests Signed-off-by: R. Tyler Croy <[email protected]> * chore: make codecov more vigorously enforced to help ensure quality Signed-off-by: R. Tyler Croy <[email protected]> * chore: prepare py-1.0 release Signed-off-by: Ion Koutsouris <[email protected]> * Upgrade load_with_datetime to ignore any uncommited deltas in any subdirectory of delta_log. Signed-off-by: Corwin Joy <[email protected]> Co-authored-by: Adam Reeve <[email protected]> * feat(datafusion): file pruning based on pushdown limit for partition cols filters Signed-off-by: Adrian Tanase <[email protected]> * feat(datafusion): optmize partition pruning, pushdown full predicates for DF integration Signed-off-by: Adrian Tanase <[email protected]> * chore: experiment with using sccache in GitHub Actions Signed-off-by: R. Tyler Croy <[email protected]> * chore: cleanup the CODEOWNERS a bit for more accurate review assignments Signed-off-by: R. Tyler Croy <[email protected]> * chore: only check our documentation, not dependencies Signed-off-by: R. Tyler Croy <[email protected]> * chore: refactor the Rust build to use as much as possible of sccache Signed-off-by: R. Tyler Croy <[email protected]> * chore: remove unused code and deps Signed-off-by: Robert Pack <[email protected]> * chore: remove peek_next_commit on DeltaTable which has been deprecated since 0.22.4 Signed-off-by: R. Tyler Croy <[email protected]> * chore: refactor some symbols out of table/mod.rs into their own files This makes things a little cleaner when reviewing this code and preparing for refactors Signed-off-by: R. Tyler Croy <[email protected]> * docs: add 1.0.0 migration guide Signed-off-by: Ion Koutsouris <[email protected]> * refactor: more specific factory parameter names Signed-off-by: Robert Pack <[email protected]> * feat: expose kernel Engine on LogStore Signed-off-by: Robert Pack <[email protected]> * chore: pr feedback and test fixes Signed-off-by: Robert Pack <[email protected]> * test: avoid circular dependency with core/test crates Signed-off-by: Robert Pack <[email protected]> * refactor: use LogStore in Snapshot / LogSegment APIs Signed-off-by: Robert Pack <[email protected]> * chore: build default tests with the crate in CI Signed-off-by: R. Tyler Croy <[email protected]> * chore: enable the datafusion feature for integration tests which need it Signed-off-by: R. Tyler Croy <[email protected]> * chore: annotate tests which require datafusion appropriately Signed-off-by: R. Tyler Croy <[email protected]> * ci: add spellchecker to pr tests Signed-off-by: Robert Pack <[email protected]> * chore: mark more tests which require datafusion Signed-off-by: R. Tyler Croy <[email protected]> * refactor: move from pyarrow to arro3 Signed-off-by: Ion Koutsouris <[email protected]> * chore: pr feedback Signed-off-by: Ion Koutsouris <[email protected]> * refactor: use root store in log processing Signed-off-by: Robert Pack <[email protected]> * fix: use more accurate log path parsing Signed-off-by: Robert Pack <[email protected]> * chore: set correct markers Signed-off-by: Ion Koutsouris <[email protected]> * chore: update kernel Signed-off-by: Robert Pack <[email protected]> * chore: update kernel Signed-off-by: Robert Pack <[email protected]> * fix: remove problematic typos configuration and fix Spellcheck issues Signed-off-by: Florian VALEYE <[email protected]> * feat: use kernel checkpoint writer Signed-off-by: Robert Pack <[email protected]> * refactor: use kernel log segment for some log inspection Signed-off-by: Robert Pack <[email protected]> * chore: remove unused time_utils Signed-off-by: Robert Pack <[email protected]> * chore: more typos Signed-off-by: Robert Pack <[email protected]> * refactor: remove protocol error Signed-off-by: Robert Pack <[email protected]> * feat: add table description and name API for Python Add convenient methods to set table description and name through the Python API. Signed-off-by: Florian VALEYE <[email protected]> * feat: add validator crate and use to have update table metadata validation in Rust Signed-off-by: Florian VALEYE <[email protected]> * chore: remove unused stats parsed field Signed-off-by: Robert Pack <[email protected]> * fix: arro3 schema conversion logic Signed-off-by: Ion Koutsouris <[email protected]> * chore: update migration docs Signed-off-by: Ion Koutsouris <[email protected]> * chore: improve wording Signed-off-by: Ion Koutsouris <[email protected]> * chore: update kernel to 0.11 Signed-off-by: Robert Pack <[email protected]> * fix: set casting safe param to False Signed-off-by: Ion Koutsouris <[email protected]> * chore: add xfail to flaky test Signed-off-by: Ion Koutsouris <[email protected]> * fix bullet list formatting Signed-off-by: Avril Aysha <[email protected]> * refactor!: get transaction versions for specific applications Signed-off-by: Robert Pack <[email protected]> * test: improve storage config testing Signed-off-by: Robert Pack <[email protected]> * chore: exclude Invariants from the default writer v2 feature set Invariants cannot be supported without datafusion present, the code should not pretend they exist. This also helps ensure a number of non-invariant related test can run without datafusion present Signed-off-by: R. Tyler Croy <[email protected]> * refactor!: remove and deprecate some python methods Signed-off-by: Robert Pack <[email protected]> * fix: ensure projecting only columns that exist in new files afte schema update Signed-off-by: Alex Wilcoxson <[email protected]> * docs: update link to df Signed-off-by: Raz Luvaton <[email protected]> * chore: update runner Signed-off-by: Ion Koutsouris <[email protected]> * ci: improve coverage collection Signed-off-by: Robert Pack <[email protected]> * chore: prepare for the next python release Signed-off-by: R. Tyler Croy <[email protected]> * chore!: remove get_earliest_version Signed-off-by: Robert Pack <[email protected]> * refactor!: have DeltaTable::version return an Option Signed-off-by: Robert Pack <[email protected]> * Revert "chore: add test for handling fields with spaces in constraints" This reverts commit 89ddf07. * Revert "fix: if field contains space in constraint expression, the check will fail" This reverts commit 6babfb6. * fix: spaced columns parsing Signed-off-by: Ion Koutsouris <[email protected]> * chore: update tests Signed-off-by: Ion Koutsouris <[email protected]> * chore: fmt Signed-off-by: Ion Koutsouris <[email protected]> * fix: wrong schema set in table provider Signed-off-by: Ion Koutsouris <[email protected]> * chore: bump version Signed-off-by: Ion Koutsouris <[email protected]> * refactor: move LazyTableProvider into python crate Signed-off-by: Robert Pack <[email protected]> * feat: add convenience extensions for kernel engine types Signed-off-by: Robert Pack <[email protected]> --------- Signed-off-by: Ion Koutsouris <[email protected]> Signed-off-by: Alexander Falk <[email protected]> Signed-off-by: R. Tyler Croy <[email protected]> Signed-off-by: Robert Pack <[email protected]> Signed-off-by: Andrew Lamb <[email protected]> Signed-off-by: Zach Schuermann <[email protected]> Signed-off-by: Hiromu Hota <[email protected]> Signed-off-by: Ze'ev Maor <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: Heran Lin <[email protected]> Signed-off-by: Andrew Kane <[email protected]> Signed-off-by: MrCroxx <[email protected]> Signed-off-by: Sam Meyer-Reed <[email protected]> Signed-off-by: JustinRush80 <[email protected]> Signed-off-by: Corwin Joy <[email protected]> Signed-off-by: Adrian Tanase <[email protected]> Signed-off-by: Florian VALEYE <[email protected]> Signed-off-by: Avril Aysha <[email protected]> Signed-off-by: Alex Wilcoxson <[email protected]> Signed-off-by: Raz Luvaton <[email protected]> Co-authored-by: Ion Koutsouris <[email protected]> Co-authored-by: Roy Kim <[email protected]> Co-authored-by: Alexander Falk <[email protected]> Co-authored-by: R. Tyler Croy <[email protected]> Co-authored-by: Robert Pack <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Zach Schuermann <[email protected]> Co-authored-by: Hiromu Hota <[email protected]> Co-authored-by: Ze'ev Maor <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Heran Lin <[email protected]> Co-authored-by: Andrew Kane <[email protected]> Co-authored-by: MrCroxx <[email protected]> Co-authored-by: Sam Meyer-Reed <[email protected]> Co-authored-by: JustinRush80 <[email protected]> Co-authored-by: Corwin Joy <[email protected]> Co-authored-by: Adam Reeve <[email protected]> Co-authored-by: Adrian Tanase <[email protected]> Co-authored-by: Florian VALEYE <[email protected]> Co-authored-by: Avril Aysha <[email protected]> Co-authored-by: Alex Wilcoxson <[email protected]> Co-authored-by: Raz Luvaton <[email protected]>
Description
This PR aims to provide new implementations for the current
Snapshot(now calledLazySnapshot) andEagerSnapshotback by thedelta-kernel-rslibrary.This PR focusses on the implementation of the new snapshots, but avoids updating all usage and removing the old ones. I plan to provide some stacked PRs that actually use these in operations etc., hoping that this way reviews and feedback can be a bit more streamlined.
To reduce churn in the codebase, after the switch has been made, we introduce a trait
Snapshotwhich is implemented by the new snapshots and should also be implemented forDeltaTableState. We can now establish a more uniform API across theSnapshotvariants since Kernel's execution model allows us to avoidasyncin all APIs.One of the most significant conceptual changes is how eager the
EagerSnapshotis. The parquet reading in bothdelta-rsanddelta-kernel-rshas evolved much since theEagerSnapshotwas first written and handles pushdown of columns and predicates much more effectively. TO mitigate the cost of repeated reads of commit data, we introduce a simple caching layer in form of anObjectStoreimplementation that caches commit reads in memory. This is right now a simple brute force approach to allow for migration, but hopefully will be extended in the future to also avoid json parsing and caching parquet metadata reads.Any feedback on the direction this is taking is greatly appreciated.