-
Notifications
You must be signed in to change notification settings - Fork 320
feat: track rows processed during model evaluation #5162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d722c67 to
c3a8760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice start, I think I follow the logic. I've left some comments for your consideration.
I'm not sure about the track_row_count parameter on the engine adapter methods. I'd like to understand more about the undesirable queries that were getting picked up - perhaps that indicates we were being too coarse with the statistics gathering / wrapping in the wrong place
|
Building on @erindru's feedback, I wonder if in general this tracking logic could get pushed to the SnapshotEvaluator layer. Right now the engine adapter methods return nothing but if they returned row count then it seems like the SnapshotEvaluator could then determine what to do with that information. I don't think though I deeply understand this problem and this comes from a quick review and my general intuition. |
1e405ce to
9377241
Compare
If we pass row counts around through the entire call stack between the scheduler and the engine adapter, I believe it starts getting very messy with the number of method signature changes. The scheduler is where we want to display the information, but it's typically on a different thread and reasonably far removed from This design that Trey is proposing means that we don't have to change the return type of the engine adapter methods or have to worry about passing things back up the stack / rely on every component in the chain between the It essentially uses a combination of thread-local + global state to bypass the call stack. It emits row counts in imo this is a much simpler / less invasive implementation |
9377241 to
6957330
Compare
6957330 to
d4d5afa
Compare
d00e0b9 to
a2bdfdc
Compare
3347837 to
7c0cc46
Compare
7c0cc46 to
61161ca
Compare
Captures the number of rows processed (and potentially other information) during a model evaluation for display to the user.
Example running sushi on Bigquery (includes bytes processed for each model batch):
Context
cursor.row_countproperty that contains the number of rows affected by a DML command execution (e.g.,INSERT,CTAS)Implementation context
_executemethod, so we read the row_count property there immediately aftercursor.executeDetermining what to track
track_row_count=Truein the execute call within themengine_adapter.insert_appendTracker implementation
QueryExecutionContextdataclass_executecalls singlerecord_executionmethod, which determines which tracker is active and should be used (if any)