Skip to content

Conversation

@rob-1019
Copy link

@rob-1019 rob-1019 commented Dec 2, 2025

Optimization regex evaluation in the ingestion hot-path by pre-compiling regex's once at the start of ingestion instead of at each evaluation.

AllowDenyPattern now compiles regex patterns once using cached_property instead of recompiling on every match. This affects database, schema, and table filtering across all SQL connectors and many non-SQL sources including BigQuery, S3, Kafka, Looker, PowerBI, and others.

Similarly, Snowflake's temporary_tables_pattern is now compiled once at config initialization rather than on every table check.

The optimization reduces regex compilation overhead in the hot path during metadata extraction without changing filtering behavior.

  • [ X] The PR conforms to DataHub's Contributing Guideline (particularly PR Title Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

AllowDenyPattern now compiles regex patterns once using cached_property
instead of recompiling on every match. This affects database, schema,
and table filtering across all SQL connectors and many non-SQL sources
including BigQuery, S3, Kafka, Looker, PowerBI, and others.

Similarly, Snowflake's temporary_tables_pattern is now compiled once
at config initialization rather than on every table check.

The optimization reduces regex compilation overhead in the hot path
during metadata extraction without changing filtering behavior.
@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Dec 2, 2025
@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Dec 2, 2025
AllowDenyPattern now compiles regex patterns once using cached_property
instead of recompiling on every match. This affects database, schema,
and table filtering across all SQL connectors and many non-SQL sources
including BigQuery, S3, Kafka, Looker, PowerBI, and others.

Similarly, Snowflake's temporary_tables_pattern is now compiled once
at config initialization rather than on every table check.

The optimization reduces regex compilation overhead in the hot path
during metadata extraction without changing filtering behavior.
Switch from cached-property package to stdlib functools.cached_property
to fix mypy type checking errors with disallow_untyped_decorators.
… incompatibility

Replace @cached_property with manual caching using hasattr/setattr pattern.
This avoids triggering stricter Pydantic inspection on Python 3.11.
…lity

Use stdlib functools.cached_property for new compiled pattern caching,
and add it to ConfigModel.ignored_types alongside the cached-property
package to ensure Pydantic v2 compatibility on Python 3.11+.
@anshbansal anshbansal removed the community-contribution PR or Issue raised by member(s) of DataHub Community label Dec 2, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add docstrings to the cached properties explaining the performance motivation

@datahub-cyborg datahub-cyborg bot added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed needs-review Label for PRs that need review from a maintainer. labels Dec 3, 2025
sgomezvillamor added a commit that referenced this pull request Dec 3, 2025
Extends Rob's regex optimization pattern (#15463) to additional ingestion hot paths:

1. **SqlQueriesSource**: Pre-compile temp_table_patterns using @cached_property
   - Called for every table during query processing
   - Eliminates repeated regex compilation overhead

2. **BigQuery**: Pre-compile sharded table & wildcard patterns at module level
   - get_table_and_shard(): Called for every BigQuery table
   - get_table_display_name(): Called for table name normalization
   - is_sharded_table(): Called during table classification

3. **PowerBI ODBC**: Pre-compile platform detection patterns at module level
   - normalize_platform_from_driver(): Called for every ODBC connection
   - normalize_platform_name(): Called during platform normalization
   - Affects 18+ database platform patterns

All changes follow the same optimization strategy as #15463:
- Compile regex patterns once at initialization
- Use compiled Pattern objects in hot path
- Maintain exact behavioral equivalence
- No config changes or breaking changes

Expected impact: Performance improvement for ingestion workloads with:
- High volume of temp table checks (SqlQueriesSource)
- Large BigQuery datasets with sharded tables
- PowerBI sources with many ODBC connections

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants