Skip to content

Conversation

@ghanse
Copy link
Contributor

@ghanse ghanse commented Nov 13, 2025

Changes

This PR fixes issues with pickling and unpickling of the DQEngine and associated objects which prevent injection of DQEngine into methods like foreachBatch.

I have introduced a WorkspaceClientSerDeMixin to implement __getstate__ and __setstate__ methods to reset the WorkspaceClient.

I have also introduced a get_workspace_client() utility method which is called when DQEngine needs to be used without an internal WorkspaceClient attribute.

Linked issues

Resolves #922

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@github-actions
Copy link

github-actions bot commented Nov 13, 2025

✅ 434/434 passed, 3 flaky, 36 skipped, 2h55m58s total

Flaky tests:

  • 🤪 test_profiler_serverless (3m28.79s)
  • 🤪 test_e2e_workflow_serverless (4m7.198s)
  • 🤪 test_e2e_workflow_serverless (7m48.853s)

Running from acceptance #3200

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes pickling and unpickling issues with DQEngine and related objects to enable their use in Spark operations like foreachBatch. The changes introduce a PickleableMixin that removes non-serializable WorkspaceClient instances during pickling, and adds a cached get_workspace_client() utility function to lazily recreate the client when needed.

Key Changes:

  • Added PickleableMixin to handle serialization of classes with WorkspaceClient attributes
  • Introduced get_workspace_client() utility for lazy WorkspaceClient initialization
  • Made workspace_client parameter optional (defaulting to None) across engine classes and storage handlers
  • Updated storage handlers and factory classes to use the pickleable pattern

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/mixins.py New mixin implementing __getstate__ and __setstate__ for pickling WorkspaceClient-containing classes
src/databricks/labs/dqx/utils.py Added cached get_workspace_client() function for lazy client initialization
src/databricks/labs/dqx/base.py Updated DQEngineBase to inherit from PickleableMixin and make workspace_client optional
src/databricks/labs/dqx/engine.py Made workspace_client parameter optional in DQEngineCore and DQEngine
src/databricks/labs/dqx/profiler/*.py Made workspace_client optional in DQProfiler and DQGenerator
src/databricks/labs/dqx/checks_storage.py Applied PickleableMixin to storage handlers and made workspace_client optional
src/databricks/labs/dqx/installer/mixins.py Updated InstallationMixin to inherit from PickleableMixin and use lazy client access
src/databricks/labs/dqx/installer/workflow_installer.py Changed _ws to ws property references
src/databricks/labs/dqx/contexts/workspace_context.py Made workspace_client optional with lazy initialization
tests/integration/test_object_serialization.py New integration tests for pickling various DQX objects
tests/integration/test_apply_checks.py Added test for foreachBatch usage with DQEngine
tests/integration/test_utils.py Added test for get_workspace_client()
pyproject.toml Added cloudpickle dependency
tests/conftest.py Fixed comment formatting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

telemetry so that requests are attributed to *dqx*.
"""
return self._workspace_client
return self._ws or get_workspace_client()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we simplify this, and have this logic in constructor only to avoid duplications

def __getstate__(self):
"""Removes the WorkspaceClient when the object is pickled."""
state = self.__dict__.copy()
if '_ws' in state:
Copy link
Contributor

@mwojtyczka mwojtyczka Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should standarize the naming and always use either ws or workspace client? we do the same in telemetry, we could avoid all these if statements.

@codecov
Copy link

codecov bot commented Nov 19, 2025

Codecov Report

❌ Patch coverage is 48.31461% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.15%. Comparing base (e34661b) to head (408c16d).

Files with missing lines Patch % Lines
src/databricks/labs/dqx/checks_storage.py 40.00% 24 Missing ⚠️
src/databricks/labs/dqx/installer/mixins.py 50.00% 6 Missing ⚠️
src/databricks/labs/dqx/base.py 37.50% 5 Missing ⚠️
src/databricks/labs/dqx/mixins.py 69.23% 4 Missing ⚠️
.../databricks/labs/dqx/contexts/workspace_context.py 25.00% 3 Missing ⚠️
src/databricks/labs/dqx/utils.py 25.00% 3 Missing ⚠️
src/databricks/labs/dqx/profiler/profiler.py 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (e34661b) and HEAD (408c16d). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (e34661b) HEAD (408c16d)
2 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #931       +/-   ##
===========================================
- Coverage   90.02%   54.15%   -35.87%     
===========================================
  Files          60       61        +1     
  Lines        5221     5268       +47     
===========================================
- Hits         4700     2853     -1847     
- Misses        521     2415     +1894     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Using instantiated DQengine inside foreachbatch in serverless/shared access mode is throwing up error

3 participants