-
Notifications
You must be signed in to change notification settings - Fork 92
implement get_lineage tool for complete lineage across dbt resources #461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
DevonFulcher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
src/dbt_mcp/discovery/client.py
Outdated
| } | ||
| """) | ||
|
|
||
| GET_SEEDS = textwrap.dedent(""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just merged a PR that adds GQL queries for seeds & snapshots. Can we reuse those queries here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the new GQL folder pattern, I have placed the GQLs in a folder, but I dont think we can reuse the existing GQLs. The reason is because the queries use resources() with AppliedResourcesFilter, which requires knowing the unique_id upfront (via the uniqueIds array).
The new search gql queries we created for the lineage are lightweight searches using the specific seeds()/snapshots() endpoints with GenericMaterializedFilter's identifier field for name-based matching.
Through introspection, I noticed that GenericMaterializedFilter has native identifier support for name matching, so we use that for 1-call lookups instead of the package enumeration pattern.
src/dbt_mcp/discovery/tools.py
Outdated
| direction: str = "both", | ||
| types: list[str] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better if these were enums or some other stronger type than strings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I have Implemented this.
| matches = await context.lineage_fetcher.search_all_resources(name) | ||
| if not matches: | ||
| raise InvalidParameterError( | ||
| f"No resource found with name '{name}' in searchable resource types " | ||
| f"(models, sources, seeds, snapshots).\n\n" | ||
| f"If this is an exposure, test, or metric, you must use the full unique_id instead:\n" | ||
| f" • For exposures: get_lineage(unique_id='exposure.project.{name}')\n" | ||
| f" • For tests: get_lineage(unique_id='test.project.{name}')\n" | ||
| f" • For metrics: get_lineage(unique_id='metric.project.{name}')\n\n" | ||
| f"Note: The Discovery API does not support searching exposures, tests, or metrics by name. " | ||
| f"You can find unique IDs in your dbt Cloud project or manifest.json." | ||
| ) | ||
| if len(matches) == 1: | ||
| resolved_unique_id = matches[0]["uniqueId"] | ||
| else: | ||
| # Multiple matches - try elicitation first, fallback to disambiguation | ||
| try: | ||
| # Format matches for display | ||
| match_descriptions = [ | ||
| f"{m['resourceType']}: {m['uniqueId']}" for m in matches | ||
| ] | ||
| message = ( | ||
| f"Multiple resources found with name '{name}':\n" | ||
| + "\n".join(f" {i+1}. {desc}" for i, desc in enumerate(match_descriptions)) | ||
| + "\n\nSelect the unique_id of the resource you want:" | ||
| ) | ||
|
|
||
| result = await ctx.elicit( | ||
| message=message, | ||
| schema=ResourceSelection | ||
| ) | ||
|
|
||
| if result.action == "accept": | ||
| # Validate the selected unique_id is in matches | ||
| selected_id = result.data.unique_id | ||
| if selected_id in [m["uniqueId"] for m in matches]: | ||
| resolved_unique_id = selected_id | ||
| else: | ||
| raise InvalidParameterError( | ||
| f"Selected unique_id '{selected_id}' not in available matches" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty interesting! How do you think it compares to the approach I took in my recent changes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually came up with this pattern after introspection of the Discovery API since ModelAppliedFilter, SourceAppliedFilter, and GenericMaterializedFilter all have an identifier field for name-based searching, I just used that native support instead which means we can just make fewer calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either approach works well, and I feel okay with trying this out. It will be our first use of elicitation, so that is exciting! It is a creative solution. Thanks for that.
At some point, we should consolidate our approaches, though. I would like the Discovery tools to work uniformly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yeah definitely, I think we should be able to get elicitation to work with either approaches. Question for you then can we assume the approach here would be the standard going forward https://github.com/dbt-labs/dbt-mcp/blob/main/src/dbt_mcp/discovery/client.py#L625-L650?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the approach could be: try elicitation. If the client doesn't support elicitation, fall back to the method you linked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point is that this could be the method for all discovery tools, not just lineage. That uniformity doesn't have to be done in this PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright I will just raise a separate issue to standardize the approach for discovery tools
| ) | ||
| descendants_result = await self._fetch_lineage_single_direction( | ||
| unique_id, LineageDirection.DESCENDANTS, types | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason for the two API calls to simplify labeling ancestors and descendants? The selector syntax allows for both directions here, but I assume the separate calls make it easier to label the nodes. Is that right?
One suggestion, if we keep two separate calls, they can be done concurrently with asyncio.gather().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup! the reason for the two calls is so we can group them into ancestors vs descendants. When using +uniqueId+, the API returns everything mixed together, so we'd need extra logic to figure out which nodes are uancestors vs descendants. I have implemented the asyncio.gather() for parallel execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a new get_lineage tool that retrieves complete upstream and downstream lineage for dbt resources in a single API call, addressing user feedback about needing multiple calls to traverse the lineage graph. The implementation includes name-based search with disambiguation support and fallback mechanisms for resource types that require unique IDs.
Key Changes:
- New
LineageFetcherclass handles resource search and lineage retrieval across models, sources, seeds, and snapshots - Interactive disambiguation via MCP elicitation when multiple resources match a name
- Client-side pagination limiting results to 50 nodes per direction
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/dbt_mcp/discovery/client.py |
Adds LineageFetcher class with search and lineage fetching capabilities, including GraphQL queries for seeds, snapshots, and lineage |
src/dbt_mcp/discovery/tools.py |
Implements get_lineage tool with parameter validation, name resolution, and elicitation-based disambiguation |
src/dbt_mcp/prompts/discovery/get_lineage.md |
Documents tool usage, parameters, limitations, and examples |
src/dbt_mcp/tools/tool_names.py |
Registers new GET_LINEAGE tool name |
src/dbt_mcp/tools/toolsets.py |
Adds GET_LINEAGE to discovery toolset |
tests/unit/discovery/test_lineage_fetcher.py |
Tests for resource search, selector building, lineage fetching, and pagination |
tests/unit/discovery/test_get_lineage_tool.py |
Tests for parameter validation, name resolution, and disambiguation flow |
tests/unit/discovery/conftest.py |
Adds fixtures for LineageFetcher and mock contexts |
.changes/unreleased/Enhancement or New Feature-20251129-121002.yaml |
Changelog entry for the new feature |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "matches": matches, | ||
| "instruction": "Please call get_lineage again with the unique_id parameter set to one of the matches above.", | ||
| } | ||
|
|
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type ignore comment suggests that resolved_unique_id could potentially be None at runtime. Since the code paths ensure it's always assigned (either from unique_id parameter, single match, or elicitation), consider adding an assertion to make this guarantee explicit: assert resolved_unique_id is not None before line 291.
| assert resolved_unique_id is not None, "resolved_unique_id must not be None at this point" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| {"uniqueId": "source.test.raw.customers", "name": "customers", "resourceType": "Source"}, | ||
| ] | ||
| # Simulate elicitation failure | ||
| mock_mcp_context.elicit.side_effect = Exception("Elicitation not supported") |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test simulates a generic Exception, but the actual code catches all exceptions at line 281 in tools.py. Consider testing with more specific exception types (e.g., TimeoutError, ConnectionError) to ensure the fallback behavior works correctly for different failure scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/dbt_mcp/discovery/client.py
Outdated
| "first": PAGE_SIZE, | ||
| } | ||
|
|
||
| # Execute query - already a direct reference to the query string |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment on line 1026 'already a direct reference to the query string' is unclear about why this reassignment is necessary. Consider clarifying that this extracts the GraphQL query string from the configuration dictionary for readability.
| # Execute query - already a direct reference to the query string | |
| # Extract the GraphQL query string from the configuration dictionary for readability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment clarified added in branch
src/dbt_mcp/discovery/tools.py
Outdated
| # Format matches for display | ||
| match_descriptions = [ | ||
| f"{m['resourceType']}: {m['uniqueId']}" for m in matches | ||
| ] | ||
| message = ( | ||
| f"Multiple resources found with name '{name}':\n" | ||
| + "\n".join(f" {i+1}. {desc}" for i, desc in enumerate(match_descriptions)) | ||
| + "\n\nSelect the unique_id of the resource you want:" |
Copilot
AI
Dec 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The message construction spans multiple operations (list comprehension, string concatenation, enumeration). Consider extracting this into a helper method _format_disambiguation_message(name: str, matches: list[dict]) -> str to improve readability and testability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added _format_disambiguation_message
461f5fc to
62cf578
Compare
| # ============================================================================ | ||
| # Lineage Tool Classes and Configuration | ||
| # ============================================================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a demarcation like this, should we move this to a different file? It seems we could break this file up a bit more in general.
| logger.debug(f"Elicitation not completed: {type(e).__name__}: {e}") | ||
| return { | ||
| "status": "disambiguation_required", | ||
| "message": f"Multiple resources found with name '{name}'", | ||
| "matches": matches, | ||
| "instruction": "Please call get_lineage again with the unique_id parameter set to one of the matches above.", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should handle this situation in a different way. Not all clients support elicitation, so we should fall back to a different method. Perhaps we should return a list of trees? Additionally, are you familiar with how elicitation is handled with remote MCP? We should ensure this tool is compatible with remote MCP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will remove elicitation for now as I am not sure how it would work in remote mcp, I will create a different issue to look into this approach.
Summary
A User building a catalog-like experience with dbt and Snowflake via Claude Desktop requested full lineage capabilities #110:
What Changed
Discovery Phase: GraphQL Introspection
We used GraphQL introspection to validate design decisions and understand the Discovery API's capabilities.
Why we used Sequential Search Strategy
Question: Does
LineageFiltersupport searching by name/identifier? LineageFilterFinding: -
LineageFilteronly accepts full unique IDs (noidentifierornamefield). Due to this Sequential search across resource types is necessary.Two API Calls for "both" Direction
Question: Does
ModelLineageNodeinclude direction metadata to categorize ancestors vs descendants?View ModelLineageNode Introspection
Finding: - No direction metadata exists (no
relationshipDirection,isAncestor,isDescendant, ordependsOnfields). Two separate API calls required for "both" direction.Graphql Client-Side Pagination
Question: Does the
lineagequery support server-side pagination parameters?Introspect
Finding: - No pagination parameters available (no
first,after,offset, orlimitarguments). Client-side 50-node limit is necessary.List Of Resource Search Support
Question: Which resource types support searching by name (via
identifierfield)?link Introspection
Finding: - 4 out of 7 resource types support name search:
ModelAppliedFilter.identifier)SourceAppliedFilter.identifier)GenericMaterializedFilter.identifier)GenericMaterializedFilter.identifier)identifierfield)identifierfield)Design Approach
Complete Resource Search Coverage
Based on introspection findings:
identifier?ModelAppliedFilterSourceAppliedFilterGenericMaterializedFilterGenericMaterializedFilterExposureFilterTestAppliedFilterMetricFilterKey Design Decisions
get_lineagetool)+uniqueIdanduniqueId+calls to categorize ancestors vs descendantsunique_idformat examples for their input (e.g.,exposure.project.{name}), (4) Explains API limitation so users understand whyCritical Test Cases
1. Disambiguation Flow (Which is one of the Important UX Feature in this change). This model was selected because it will trigger eliciation.
Prompt:
Use mcp inspector to force elicitation


2. Prompt like Error Message (Resource Type Limitation Handling)
Prompt:
3. Full Lineage (Feature Completeness)
Prompt:
Why
Implementing to close #110
Related Issues
Closes ##110
Related to ##110
Checklist
Additional Notes
Related Issues
Closes #
Related to #
Checklist
Additional Notes