Skip to content

Conversation

@Muizzkolapo
Copy link
Contributor

@Muizzkolapo Muizzkolapo commented Nov 27, 2025

Summary

A User building a catalog-like experience with dbt and Snowflake via Claude Desktop requested full lineage capabilities #110:

"Get the full lineage of a model, not only its children/parents. To parse the full lineage with the official MCP it would require quite a few calls to the tool to parse through all layers, which would be nice to get the server doing async instead providing the full list as a return."


What Changed

Discovery Phase: GraphQL Introspection

We used GraphQL introspection to validate design decisions and understand the Discovery API's capabilities.

Why we used Sequential Search Strategy

Question: Does LineageFilter support searching by name/identifier? LineageFilter

Finding: - LineageFilter only accepts full unique IDs (no identifier or name field). Due to this Sequential search across resource types is necessary.


Two API Calls for "both" Direction

Question: Does ModelLineageNode include direction metadata to categorize ancestors vs descendants?
View ModelLineageNode Introspection

Finding: - No direction metadata exists (no relationshipDirection, isAncestor, isDescendant, or dependsOn fields). Two separate API calls required for "both" direction.


Graphql Client-Side Pagination

Question: Does the lineage query support server-side pagination parameters?

Introspect

Finding: - No pagination parameters available (no first, after, offset, or limit arguments). Client-side 50-node limit is necessary.


List Of Resource Search Support

Question: Which resource types support searching by name (via identifier field)?

link Introspection

Finding: - 4 out of 7 resource types support name search:

  • ✅ Models (via ModelAppliedFilter.identifier)
  • ✅ Sources (via SourceAppliedFilter.identifier)
  • ✅ Seeds (via GenericMaterializedFilter.identifier)
  • ✅ Snapshots (via GenericMaterializedFilter.identifier)
  • ❌ Exposures (no identifier field)
  • ❌ Tests (no identifier field)
  • ❌ Metrics (no filter fields)

Design Approach

Complete Resource Search Coverage

Based on introspection findings:

Resource Type Filter Type Has identifier? Search Support
Models ModelAppliedFilter ✅ Yes ✅ Implemented
Sources SourceAppliedFilter ✅ Yes ✅ Implemented
Seeds GenericMaterializedFilter ✅ Yes ✅ Implemented
Snapshots GenericMaterializedFilter ✅ Yes ✅ Implemented
Exposures ExposureFilter ❌ No ❌ Requires unique_id
Tests TestAppliedFilter ❌ No ❌ Requires unique_id
Metrics MetricFilter ❌ No fields ❌ Requires unique_id

Key Design Decisions

Decision Choice Rationale
Name Resolution Internal (resolve inside get_lineage tool) Better UX - single call instead of requiring users to manually resolve names to unique IDs
Search Order Models → Sources → Seeds → Snapshots Optimized by frequency: Models (~90%), Sources (~8%), Seeds (~1%), Snapshots (~1%) - fast path for common cases
Search Implementation Sequential search across resource types No unified search API exists - must query each filter type separately
Disambiguation MCP Elicitation with clarifying fallback prompt Try interactive selection first using elicitation to get the response from user if the client supports it and, fall back to structured response on timeout/error - guides both LLM and human users
Direction="both" Two API calls + merge No direction metadata in responses - we had to make separate +uniqueId and uniqueId+ calls to categorize ancestors vs descendants
Pagination Hard limit: 50 nodes/direction No server-side pagination available - client-side limit protects LLM token budget
Unsearchable Types Clarifying prompt like error messages with examples to guide LLM/User to next steps when users request for non supported resource types When name search finds no matches, return helpful error that: (1) Lists which types ARE searchable (models, sources, seeds, snapshots), (2) Suggests the user might be searching for an exposure/test/metric, (3) Shows exact unique_id format examples for their input (e.g., exposure.project.{name}), (4) Explains API limitation so users understand why

Critical Test Cases

1. Disambiguation Flow (Which is one of the Important UX Feature in this change). This model was selected because it will trigger eliciation.

Prompt:

"Get lineage for customers"
image

Use mcp inspector to force elicitation
image
image

2. Prompt like Error Message (Resource Type Limitation Handling)

Prompt:

"Show lineage for customer_dashboard"
image

3. Full Lineage (Feature Completeness)

Prompt:

"Show full lineage for orders_snapshot"
image

Why

Implementing to close #110

Related Issues

Closes ##110
Related to ##110

Checklist

Additional Notes

Related Issues

Closes #
Related to #

Checklist

Additional Notes

@Muizzkolapo Muizzkolapo marked this pull request as ready for review November 28, 2025 21:38
@Muizzkolapo Muizzkolapo requested review from a team, b-per and jasnonaz as code owners November 28, 2025 21:38
@Muizzkolapo Muizzkolapo changed the title implement-get-lineage implement get_lineage tool for complete lineage across dbt resources Nov 28, 2025
@Muizzkolapo Muizzkolapo marked this pull request as draft November 29, 2025 11:44
@Muizzkolapo Muizzkolapo marked this pull request as ready for review November 29, 2025 12:18
@Muizzkolapo Muizzkolapo requested a review from a team as a code owner November 29, 2025 12:18
Copy link
Collaborator

@DevonFulcher DevonFulcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work!

}
""")

GET_SEEDS = textwrap.dedent("""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just merged a PR that adds GQL queries for seeds & snapshots. Can we reuse those queries here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new GQL folder pattern, I have placed the GQLs in a folder, but I dont think we can reuse the existing GQLs. The reason is because the queries use resources() with AppliedResourcesFilter, which requires knowing the unique_id upfront (via the uniqueIds array).

The new search gql queries we created for the lineage are lightweight searches using the specific seeds()/snapshots() endpoints with GenericMaterializedFilter's identifier field for name-based matching.

Through introspection, I noticed that GenericMaterializedFilter has native identifier support for name matching, so we use that for 1-call lookups instead of the package enumeration pattern.

Comment on lines 202 to 203
direction: str = "both",
types: list[str] | None = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better if these were enums or some other stronger type than strings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I have Implemented this.

Comment on lines 233 to 453
matches = await context.lineage_fetcher.search_all_resources(name)
if not matches:
raise InvalidParameterError(
f"No resource found with name '{name}' in searchable resource types "
f"(models, sources, seeds, snapshots).\n\n"
f"If this is an exposure, test, or metric, you must use the full unique_id instead:\n"
f" • For exposures: get_lineage(unique_id='exposure.project.{name}')\n"
f" • For tests: get_lineage(unique_id='test.project.{name}')\n"
f" • For metrics: get_lineage(unique_id='metric.project.{name}')\n\n"
f"Note: The Discovery API does not support searching exposures, tests, or metrics by name. "
f"You can find unique IDs in your dbt Cloud project or manifest.json."
)
if len(matches) == 1:
resolved_unique_id = matches[0]["uniqueId"]
else:
# Multiple matches - try elicitation first, fallback to disambiguation
try:
# Format matches for display
match_descriptions = [
f"{m['resourceType']}: {m['uniqueId']}" for m in matches
]
message = (
f"Multiple resources found with name '{name}':\n"
+ "\n".join(f" {i+1}. {desc}" for i, desc in enumerate(match_descriptions))
+ "\n\nSelect the unique_id of the resource you want:"
)

result = await ctx.elicit(
message=message,
schema=ResourceSelection
)

if result.action == "accept":
# Validate the selected unique_id is in matches
selected_id = result.data.unique_id
if selected_id in [m["uniqueId"] for m in matches]:
resolved_unique_id = selected_id
else:
raise InvalidParameterError(
f"Selected unique_id '{selected_id}' not in available matches"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty interesting! How do you think it compares to the approach I took in my recent changes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually came up with this pattern after introspection of the Discovery API since ModelAppliedFilter, SourceAppliedFilter, and GenericMaterializedFilter all have an identifier field for name-based searching, I just used that native support instead which means we can just make fewer calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either approach works well, and I feel okay with trying this out. It will be our first use of elicitation, so that is exciting! It is a creative solution. Thanks for that.

At some point, we should consolidate our approaches, though. I would like the Discovery tools to work uniformly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeah definitely, I think we should be able to get elicitation to work with either approaches. Question for you then can we assume the approach here would be the standard going forward https://github.com/dbt-labs/dbt-mcp/blob/main/src/dbt_mcp/discovery/client.py#L625-L650?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the approach could be: try elicitation. If the client doesn't support elicitation, fall back to the method you linked.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that this could be the method for all discovery tools, not just lineage. That uniformity doesn't have to be done in this PR, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright I will just raise a separate issue to standardize the approach for discovery tools

)
descendants_result = await self._fetch_lineage_single_direction(
unique_id, LineageDirection.DESCENDANTS, types
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason for the two API calls to simplify labeling ancestors and descendants? The selector syntax allows for both directions here, but I assume the separate calls make it easier to label the nodes. Is that right?

One suggestion, if we keep two separate calls, they can be done concurrently with asyncio.gather().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! the reason for the two calls is so we can group them into ancestors vs descendants. When using +uniqueId+, the API returns everything mixed together, so we'd need extra logic to figure out which nodes are uancestors vs descendants. I have implemented the asyncio.gather() for parallel execution.

@DevonFulcher DevonFulcher requested a review from Copilot December 2, 2025 15:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new get_lineage tool that retrieves complete upstream and downstream lineage for dbt resources in a single API call, addressing user feedback about needing multiple calls to traverse the lineage graph. The implementation includes name-based search with disambiguation support and fallback mechanisms for resource types that require unique IDs.

Key Changes:

  • New LineageFetcher class handles resource search and lineage retrieval across models, sources, seeds, and snapshots
  • Interactive disambiguation via MCP elicitation when multiple resources match a name
  • Client-side pagination limiting results to 50 nodes per direction

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/dbt_mcp/discovery/client.py Adds LineageFetcher class with search and lineage fetching capabilities, including GraphQL queries for seeds, snapshots, and lineage
src/dbt_mcp/discovery/tools.py Implements get_lineage tool with parameter validation, name resolution, and elicitation-based disambiguation
src/dbt_mcp/prompts/discovery/get_lineage.md Documents tool usage, parameters, limitations, and examples
src/dbt_mcp/tools/tool_names.py Registers new GET_LINEAGE tool name
src/dbt_mcp/tools/toolsets.py Adds GET_LINEAGE to discovery toolset
tests/unit/discovery/test_lineage_fetcher.py Tests for resource search, selector building, lineage fetching, and pagination
tests/unit/discovery/test_get_lineage_tool.py Tests for parameter validation, name resolution, and disambiguation flow
tests/unit/discovery/conftest.py Adds fixtures for LineageFetcher and mock contexts
.changes/unreleased/Enhancement or New Feature-20251129-121002.yaml Changelog entry for the new feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

"matches": matches,
"instruction": "Please call get_lineage again with the unique_id parameter set to one of the matches above.",
}

Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type ignore comment suggests that resolved_unique_id could potentially be None at runtime. Since the code paths ensure it's always assigned (either from unique_id parameter, single match, or elicitation), consider adding an assertion to make this guarantee explicit: assert resolved_unique_id is not None before line 291.

Suggested change
assert resolved_unique_id is not None, "resolved_unique_id must not be None at this point"

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

{"uniqueId": "source.test.raw.customers", "name": "customers", "resourceType": "Source"},
]
# Simulate elicitation failure
mock_mcp_context.elicit.side_effect = Exception("Elicitation not supported")
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test simulates a generic Exception, but the actual code catches all exceptions at line 281 in tools.py. Consider testing with more specific exception types (e.g., TimeoutError, ConnectionError) to ensure the fallback behavior works correctly for different failure scenarios.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"first": PAGE_SIZE,
}

# Execute query - already a direct reference to the query string
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment on line 1026 'already a direct reference to the query string' is unclear about why this reassignment is necessary. Consider clarifying that this extracts the GraphQL query string from the configuration dictionary for readability.

Suggested change
# Execute query - already a direct reference to the query string
# Extract the GraphQL query string from the configuration dictionary for readability

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment clarified added in branch

Comment on lines 250 to 257
# Format matches for display
match_descriptions = [
f"{m['resourceType']}: {m['uniqueId']}" for m in matches
]
message = (
f"Multiple resources found with name '{name}':\n"
+ "\n".join(f" {i+1}. {desc}" for i, desc in enumerate(match_descriptions))
+ "\n\nSelect the unique_id of the resource you want:"
Copy link

Copilot AI Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The message construction spans multiple operations (list comprehension, string concatenation, enumeration). Consider extracting this into a helper method _format_disambiguation_message(name: str, matches: list[dict]) -> str to improve readability and testability.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added _format_disambiguation_message

@Muizzkolapo Muizzkolapo force-pushed the feat/110-get-lineage-tool branch from 461f5fc to 62cf578 Compare December 2, 2025 20:07
Comment on lines +680 to +682
# ============================================================================
# Lineage Tool Classes and Configuration
# ============================================================================
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a demarcation like this, should we move this to a different file? It seems we could break this file up a bit more in general.

Comment on lines +463 to +469
logger.debug(f"Elicitation not completed: {type(e).__name__}: {e}")
return {
"status": "disambiguation_required",
"message": f"Multiple resources found with name '{name}'",
"matches": matches,
"instruction": "Please call get_lineage again with the unique_id parameter set to one of the matches above.",
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should handle this situation in a different way. Not all clients support elicitation, so we should fall back to a different method. Perhaps we should return a list of trees? Additionally, are you familiar with how elicitation is handled with remote MCP? We should ensure this tool is compatible with remote MCP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will remove elicitation for now as I am not sure how it would work in remote mcp, I will create a different issue to look into this approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: support for lineage beyond parent/child

2 participants