-
Notifications
You must be signed in to change notification settings - Fork 3.3k
perf(ingestion): compile regex patterns for ingestion filtering hot path #15463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rob-1019
wants to merge
8
commits into
datahub-project:master
Choose a base branch
from
rob-1019:compile-regex
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+35
−15
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
AllowDenyPattern now compiles regex patterns once using cached_property instead of recompiling on every match. This affects database, schema, and table filtering across all SQL connectors and many non-SQL sources including BigQuery, S3, Kafka, Looker, PowerBI, and others. Similarly, Snowflake's temporary_tables_pattern is now compiled once at config initialization rather than on every table check. The optimization reduces regex compilation overhead in the hot path during metadata extraction without changing filtering behavior.
AllowDenyPattern now compiles regex patterns once using cached_property instead of recompiling on every match. This affects database, schema, and table filtering across all SQL connectors and many non-SQL sources including BigQuery, S3, Kafka, Looker, PowerBI, and others. Similarly, Snowflake's temporary_tables_pattern is now compiled once at config initialization rather than on every table check. The optimization reduces regex compilation overhead in the hot path during metadata extraction without changing filtering behavior.
Switch from cached-property package to stdlib functools.cached_property to fix mypy type checking errors with disallow_untyped_decorators.
… incompatibility Replace @cached_property with manual caching using hasattr/setattr pattern. This avoids triggering stricter Pydantic inspection on Python 3.11.
…lity Use stdlib functools.cached_property for new compiled pattern caching, and add it to ConfigModel.ignored_types alongside the cached-property package to ensure Pydantic v2 compatibility on Python 3.11+.
sgomezvillamor
approved these changes
Dec 3, 2025
Contributor
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could add docstrings to the cached properties explaining the performance motivation
sgomezvillamor
added a commit
that referenced
this pull request
Dec 3, 2025
Extends Rob's regex optimization pattern (#15463) to additional ingestion hot paths: 1. **SqlQueriesSource**: Pre-compile temp_table_patterns using @cached_property - Called for every table during query processing - Eliminates repeated regex compilation overhead 2. **BigQuery**: Pre-compile sharded table & wildcard patterns at module level - get_table_and_shard(): Called for every BigQuery table - get_table_display_name(): Called for table name normalization - is_sharded_table(): Called during table classification 3. **PowerBI ODBC**: Pre-compile platform detection patterns at module level - normalize_platform_from_driver(): Called for every ODBC connection - normalize_platform_name(): Called during platform normalization - Affects 18+ database platform patterns All changes follow the same optimization strategy as #15463: - Compile regex patterns once at initialization - Use compiled Pattern objects in hot path - Maintain exact behavioral equivalence - No config changes or breaking changes Expected impact: Performance improvement for ingestion workloads with: - High volume of temp table checks (SqlQueriesSource) - Large BigQuery datasets with sharded tables - PowerBI sources with many ODBC connections 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
ingestion
PR or Issue related to the ingestion of metadata
merge-pending-ci
A PR that has passed review and should be merged once CI is green.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimization regex evaluation in the ingestion hot-path by pre-compiling regex's once at the start of ingestion instead of at each evaluation.
AllowDenyPattern now compiles regex patterns once using cached_property instead of recompiling on every match. This affects database, schema, and table filtering across all SQL connectors and many non-SQL sources including BigQuery, S3, Kafka, Looker, PowerBI, and others.
Similarly, Snowflake's temporary_tables_pattern is now compiled once at config initialization rather than on every table check.
The optimization reduces regex compilation overhead in the hot path during metadata extraction without changing filtering behavior.