-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(ingestion/hive-metastore): add upstream lineage to hive-metastore #15435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
metadata-ingestion/src/datahub/ingestion/source/sql/hive/hive_metastore_source.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/sql/hive/hive_metastore_source.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/sql/hive/hive_metastore_source.py
Show resolved
Hide resolved
| except (ValueError, TypeError, AttributeError) as e: | ||
| logger.warning( | ||
| f"Failed to create storage dataset MCPs for {storage_location}: {e}", | ||
| exc_info=True, | ||
| ) | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this exception swallowing pattern, is that what we usually do in other sources when we fail emission?
|
codecoverage report shows little coverage in failure/exception scenarios, we could iimprove a little bit there, mainly in hive_source.py also, concerning the lack of coverage for |
I'll add these today so that we can get this signed off. |
This reverts commit 8e8549d.
Add Storage Lineage to Hive/Hive Metastore + Code Refactoring
Summary
Adds storage lineage support to Hive and Hive Metastore connectors, enabling lineage tracking between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, DBFS). Also refactors Hive sources into a clean directory structure.
Key Changes
Storage Lineage (Opt-in Feature)
New configuration options (disabled by default):
emit_storage_lineage: Enable storage lineage extractionhive_storage_lineage_direction: Set direction (upstreamordownstream)include_column_lineage: Enable column-level lineagestorage_platform_instance: Platform instance for storage URNsSupported platforms: S3, Azure (ADLS/ABFS), GCS, HDFS, DBFS, local files
Code Refactoring
hive.py→hive/hive_source.pyhive_metastore.py→hive/hive_metastore_source.pyhive/storage_lineage.pywith shared logicHiveStorageLineageConfigMixinto eliminate duplicationsetup.pyentry points to use fully qualified pathsCode Quality Improvements
HiveStorageLineageConfigto Pydantic modelLineageDirectionandStoragePlatformStrEnums for type safetyget_db_schema(now raisesValueErrorfor invalid input)get_workunits_internal(specific exceptions + proper logging)PR Checks