-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: streamed execution in MERGE #3145
Conversation
Signed-off-by: Ion Koutsouris <[email protected]>
aa35301
to
92a8f7b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3145 +/- ##
==========================================
- Coverage 71.96% 71.77% -0.19%
==========================================
Files 135 135
Lines 43581 43717 +136
Branches 43581 43717 +136
==========================================
+ Hits 31361 31378 +17
- Misses 10181 10293 +112
- Partials 2039 2046 +7 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
took me a bit to read through all of this - very exciting stuff!
Do I read that correctly, that this right now mainly affects tables that are coming from the python side, or if someone were to specify a table externally in rs that is not a delta table (or at lest not ours? :)
So the flow where this makes most sense would be where someone has a RecordBatchReader
or Dataset
in python and wants to merge that with our table? Our tables being datasets right now?
Also excited about this b/c once we have kernelized things, this fits very well into the execution model (i think).
This reminds me - at some point we were discussing that we may no longer be able to use the Is that correct, or am I remembering something wrong? In any case we should be able to provider a recordbatch reader implementation of we can no longer use that api. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to land this and test it out with some bigger integration tests. I think this is really important work! 🥳
Yes at least in an automatic fashion. But rust users can use these lines to get the same result with these lines and then passing the DF into our operation. But they do have to enable the flag, streaming=True otherwise the early_filter_construction will consume the stream before the writer receives it:
I am not sure if we can automatically enable streaming when see a certain TableProvider, I believe we can do downcast_ref on tableProviders but not sure how to do this once it's in a logical plan. Especially when people have their own implementation of a lazyTableProvider it will become tricky, but maybe you have some ideas?
Exactly, anything that will provide RecordBatchStream will be consumed and marked as streaming mode. I am currently also refactoring the schema evolution into a logical node for write (https://github.com/ion-elgreco/delta-rs/tree/feat/lazy_schema_evolution), so that it brings us closer to streamed execution as well.
Yeah that is a tricky one, because Dataset construct doesn't provide you an interface to filter down dataset fragments. Yeah RecordBatchReaders are quite feasible, but we should have some interface to push down filters, projection into the read. Maybe even using pyarrow expr |
Description
This adds support for streamed execution in MERGE, with one caveat:
_In the future we can look into the work of influxdb or lancedb. To maybe solve getting the statistics and then pushdown this filter to the delta scan directly