Skip to content

Conversation

@jominjohny
Copy link
Contributor

Changes

LLM based Pk detector

Linked issues

#484

Resolves #..

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

@github-actions
Copy link

github-actions bot commented Nov 17, 2025

✅ 426/426 passed, 1 flaky, 35 skipped, 3h37m52s total

Flaky tests:

  • 🤪 test_e2e_workflow_serverless (10m2.645s)

Running from acceptance #3228

Looking at the cyclic import warnings, these are pre-existing architectural issues in the codebase, not caused by our changes. They all appear in tests/e2e/test_pii_detection_checks.py.
However, we can disable the cyclic-import warning globally since:
These are complex architectural dependencies
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces LLM-based primary key detection functionality for the dqx library, enabling automatic identification of primary key columns in database tables using AI analysis.

Key changes:

  • Added DatabricksPrimaryKeyDetector class for LLM-powered primary key detection with duplicate validation and retry logic
  • Integrated PK detection into the profiler workflow via ProfilerRunner
  • Created compare_datasets_with_llm wrapper function that auto-detects primary keys for dataset comparison

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/databricks/labs/dqx/llm/llm_pk_detector.py Core implementation of DSPy-based primary key detector with Databricks integration
src/databricks/labs/dqx/profiler/generator.py Added detect_primary_keys_with_llm method to DQGenerator
src/databricks/labs/dqx/profiler/profiler.py Added convenience method for PK detection via DQProfiler
src/databricks/labs/dqx/profiler/profiler_runner.py Integrated automatic PK detection into profiler workflow
src/databricks/labs/dqx/check_funcs.py Added wrapper function for dataset comparison with auto PK detection
src/databricks/labs/dqx/llm/init.py Updated LLM dependencies to include langchain packages
tests/unit/test_llm_pk_integration.py Unit tests for LLM PK detection functionality
tests/integration/test_llm_pk_detection.py Integration tests for end-to-end PK detection
pyproject.toml Updated dependencies and pylint configuration
demos/dqx_demo_llm_pk_detection.py Demo notebook showcasing PK detection capabilities

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants