Skip to content

Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

MattKotzbauer
Copy link
Contributor

GH Issue: #173

Extends Lighthouse to to support multiple independent quorums on a single server by tagging each gRPC call with a room-id metadata header and feeding requests through a lightweight router that maintains per‐room state with a DashMap<String, Arc<Lighthouse>>. LighthouseClient now accepts an optional room_id argument and automatically injects the corresponding metadata header into each heartbeat and quorum request, while untagged calls continue to use a default namespace. In multi_quorum_test.py, I created 2 clients with distinct room ID's and have them form independent quorums on the same server port.

Open to making any changes to the code or approach 🤙.

Warmly,
Matt

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2025
Copy link
Member

@d4l3k d4l3k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! This looks very promising and excited to have this in torchft.

I think there's a couple of ways we can make this cleaner/more generic that I've commented on

let room = self.room(&id).await;
<Arc<Lighthouse> as LighthouseService>::heartbeat(&room, req).await
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine as is since this is fairly minimal boilerplate per request but I think we can do even better.

By doing this at the Service layer instead of LighthouseService layer we can have it automatically work for all endpoints on the LighthouseService

Can you look into this and see how feasible it is? If it's not any cleaner we can land this as is

Some pointers:

There's also https://github.com/teimuraz/tonic-middleware which might be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried an initial attempt to do the routing at the Service layer rather than the LighthouseService layer, but have had trouble adapting between the initial tonic message types (tonic::Request/Response) and the Tower message types (http::Request/Response) - tonic::Request/Response wraps the body in tonic::body::BoxBody and carries gRPC-specific extensions, while the Tower stack we’re intercepting expects a bare http::Request/Response<B> where the body implements HttpBody. I haven't yet found a concise way to do this.

If I were to keep at this, I'd see if I could get something working that relies more on tonic-middleware - perhaps there's a way to stay entirely in the tonic domain that keeps the implementation and debugging cleaner?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants