-
Notifications
You must be signed in to change notification settings - Fork 70
Fix pickling of DQEngine #931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
✅ 434/434 passed, 3 flaky, 36 skipped, 2h55m58s total Flaky tests:
Running from acceptance #3200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes pickling and unpickling issues with DQEngine and related objects to enable their use in Spark operations like foreachBatch. The changes introduce a PickleableMixin that removes non-serializable WorkspaceClient instances during pickling, and adds a cached get_workspace_client() utility function to lazily recreate the client when needed.
Key Changes:
- Added
PickleableMixinto handle serialization of classes withWorkspaceClientattributes - Introduced
get_workspace_client()utility for lazy WorkspaceClient initialization - Made
workspace_clientparameter optional (defaulting to None) across engine classes and storage handlers - Updated storage handlers and factory classes to use the pickleable pattern
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/databricks/labs/dqx/mixins.py | New mixin implementing __getstate__ and __setstate__ for pickling WorkspaceClient-containing classes |
| src/databricks/labs/dqx/utils.py | Added cached get_workspace_client() function for lazy client initialization |
| src/databricks/labs/dqx/base.py | Updated DQEngineBase to inherit from PickleableMixin and make workspace_client optional |
| src/databricks/labs/dqx/engine.py | Made workspace_client parameter optional in DQEngineCore and DQEngine |
| src/databricks/labs/dqx/profiler/*.py | Made workspace_client optional in DQProfiler and DQGenerator |
| src/databricks/labs/dqx/checks_storage.py | Applied PickleableMixin to storage handlers and made workspace_client optional |
| src/databricks/labs/dqx/installer/mixins.py | Updated InstallationMixin to inherit from PickleableMixin and use lazy client access |
| src/databricks/labs/dqx/installer/workflow_installer.py | Changed _ws to ws property references |
| src/databricks/labs/dqx/contexts/workspace_context.py | Made workspace_client optional with lazy initialization |
| tests/integration/test_object_serialization.py | New integration tests for pickling various DQX objects |
| tests/integration/test_apply_checks.py | Added test for foreachBatch usage with DQEngine |
| tests/integration/test_utils.py | Added test for get_workspace_client() |
| pyproject.toml | Added cloudpickle dependency |
| tests/conftest.py | Fixed comment formatting |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
| telemetry so that requests are attributed to *dqx*. | ||
| """ | ||
| return self._workspace_client | ||
| return self._ws or get_workspace_client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we simplify this, and have this logic in constructor only to avoid duplications
| def __getstate__(self): | ||
| """Removes the WorkspaceClient when the object is pickled.""" | ||
| state = self.__dict__.copy() | ||
| if '_ws' in state: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should standarize the naming and always use either ws or workspace client? we do the same in telemetry, we could avoid all these if statements.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #931 +/- ##
===========================================
- Coverage 90.02% 54.15% -35.87%
===========================================
Files 60 61 +1
Lines 5221 5268 +47
===========================================
- Hits 4700 2853 -1847
- Misses 521 2415 +1894 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Changes
This PR fixes issues with pickling and unpickling of the
DQEngineand associated objects which prevent injection ofDQEngineinto methods likeforeachBatch.I have introduced a
WorkspaceClientSerDeMixinto implement__getstate__and__setstate__methods to reset theWorkspaceClient.I have also introduced a
get_workspace_client()utility method which is called whenDQEngineneeds to be used without an internalWorkspaceClientattribute.Linked issues
Resolves #922
Tests