Skip to content

Conversation

rubywtl
Copy link
Contributor

@rubywtl rubywtl commented Sep 5, 2025

What this PR does:
This is the blog post for distributed query execution project.

Which issue(s) this PR fixes:
Related to distributed query execution proposal

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@rubywtl rubywtl marked this pull request as draft September 5, 2025 23:35
Copy link
Contributor

@dsabsay dsabsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rubywtl This wonderful and ambitious work (the code and the blog post both). Great diagrams, also. I left some suggestions.


Our new distributed query execution approach takes a more sophisticated route by breaking queries down to executable fragments at the expression level. Unlike traditional query-level processing where results are merged only at the query frontend, this fine-grained approach enables dynamic collaboration between execution units during runtime. Each fragment operates independently while maintaining the ability to exchange partial results with other units as needed. This enhanced granularity not only increases opportunities for parallel processing but also enables deeper query optimization. The system ultimately combines these distributed results to produce the final output, achieving better resource utilization and performance compared to conventional query splitting strategies.

![Comparison between previous query splitting strategy v.s. distributed query execution](/images/blog/2025/distributed-exec-splittingStrat.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expression-level splitting part of this diagram may not be obvious to some readers. In particular, it's not clear what is meant by an "operator". Perhaps we could add an example PromQL expression to this diagram to help people see how it gets split?


## Results

Distributed query execution has demonstrated significant improvements in query processing by effectively reducing resource contention. This enhancement is achieved by distributing query workloads across multiple queriers, effectively increasing the practical memory limit for high-cardinality queries that previously failed due to memory constraints.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, not sure contention is the right word.

@@ -0,0 +1,67 @@
# A First Look at Distributed Query Execution in Cortex

> One of the persistent challenges in Cortex has been dealing with resource contention in a single querier node, which occurs when too much data is pulled. While implementing safeguards through limits has been effective as a temporary solution, we needed to address this issue at its root to allow Cortex to process more data efficiently. This is where distributed query execution comes into play, offering a solution that both expands query limits and improves efficiency through parallel processing and result aggregation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very pedantic of me, but I don't think resource contention is the root problem, but really the memory capacity of single nodes.

Contention implies a competition for resources (e.g. between two concurrently running queries), which we might address with a more sophisticated scheduling algorithm. In this case, we are improving the horizontal scalability of queries. While it might alleviate contention, it doesn't necessarily. I think it would be more precise to reference vertical vs. horizontal scalability. Again, I am being pedantic :)

I would reword this first sentence to say "One of the persistent challenges in Cortex has been vertical scaling limits in individual querier nodes, ..."


### Query Scheduler: Fragment Coordination

The Query Scheduler implements a sophisticated coordination mechanism that orchestrates the distributed execution of query fragments. It performs a bottom-up traversal of the logical plan, identifying cut points at remote nodes to ensure proper execution order of child-to-root. This approach guarantees that child fragments are enqueued and processed before their parent fragments, maintaining data dependency requirements and ensure there are not too many idle querier.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The Query Scheduler implements a sophisticated coordination mechanism that orchestrates the distributed execution of query fragments. It performs a bottom-up traversal of the logical plan, identifying cut points at remote nodes to ensure proper execution order of child-to-root. This approach guarantees that child fragments are enqueued and processed before their parent fragments, maintaining data dependency requirements and ensure there are not too many idle querier.
The Query Scheduler implements a sophisticated coordination mechanism that orchestrates the distributed execution of query fragments. It performs a bottom-up traversal of the logical plan, identifying cut points at remote nodes to ensure proper execution order of child-to-root. This approach guarantees that child fragments are enqueued and processed before their parent fragments, maintaining data dependency requirements and ensure there are not too many idle queriers.

While the current implementation of distributed query execution already offers benefits, it represents just the beginning of Cortex's optimization journey. To fully realize its potential, several key enhancements are needed:

### Enhance Distributed Optimizer
Distributed optimizer currently support binary expressions, but in the future it should manage more complex operations, including complicated join functions and advanced query patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Distributed optimizer currently support binary expressions, but in the future it should manage more complex operations, including complicated join functions and advanced query patterns.
Distributed optimizer currently supports binary expressions, but in the future it should manage more complex operations, including complicated join functions and advanced query patterns.

Distributed optimizer currently support binary expressions, but in the future it should manage more complex operations, including complicated join functions and advanced query patterns.

### Storage Layer Sharding
Implementing more sophisticated storage sharding strategies can better distribute data across the cluster. For example, allow max(A) to be split to max(A[shard=0]) and max(A[shard=1]). Initial experiments with ingester storage have already demonstrated impressive results: binary expressions show 1.7-2.8x performance improvements, while multiple binary expressions achieve up to 5x speed.While not included in the initial release, we invite contributors to continue to develop these sharding capabilities for both ingestor and store-gateway components.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Implementing more sophisticated storage sharding strategies can better distribute data across the cluster. For example, allow max(A) to be split to max(A[shard=0]) and max(A[shard=1]). Initial experiments with ingester storage have already demonstrated impressive results: binary expressions show 1.7-2.8x performance improvements, while multiple binary expressions achieve up to 5x speed.While not included in the initial release, we invite contributors to continue to develop these sharding capabilities for both ingestor and store-gateway components.
Implementing more sophisticated storage sharding strategies can better distribute data across the cluster. For example, allow max(A) to be split to max(A[shard=0]) and max(A[shard=1]). Initial experiments with ingester storage have already demonstrated impressive results: binary expressions show 1.7-2.8x performance improvements, while multiple binary expressions achieve up to 5x speed. While not included in the initial release, we invite contributors to continue to develop these sharding capabilities for both ingestor and store-gateway components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants