-
Notifications
You must be signed in to change notification settings - Fork 32
Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189
Conversation
…tiple quorums on a single LighthouseServer (pytorch#173)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! This looks very promising and excited to have this in torchft.
I think there's a couple of ways we can make this cleaner/more generic that I've commented on
let room = self.room(&id).await; | ||
<Arc<Lighthouse> as LighthouseService>::heartbeat(&room, req).await | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine as is since this is fairly minimal boilerplate per request but I think we can do even better.
By doing this at the Service layer instead of LighthouseService layer we can have it automatically work for all endpoints on the LighthouseService
Can you look into this and see how feasible it is? If it's not any cleaner we can land this as is
Some pointers:
- https://docs.rs/tower-service/0.3.3/tower_service/trait.Service.html
- https://github.com/hyperium/tonic/blob/b303caa52ba8bbe8172310be7165a80b7c2a53f8/examples/src/tower/server.rs#L83-L109
There's also https://github.com/teimuraz/tonic-middleware which might be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried an initial attempt to do the routing at the Service layer rather than the LighthouseService layer, but have had trouble adapting between the initial tonic message types (tonic::Request/Response
) and the Tower message types (http::Request/Response
) - tonic::Request/Response
wraps the body in tonic::body::BoxBody
and carries gRPC-specific extensions, while the Tower stack we’re intercepting expects a bare http::Request/Response<B>
where the body implements HttpBody
. I haven't yet found a concise way to do this.
If I were to keep at this, I'd see if I could get something working that relies more on tonic-middleware
- perhaps there's a way to stay entirely in the tonic
domain that keeps the implementation and debugging cleaner?
…ng add_room_header for each RPC call
GH Issue: #173
Extends Lighthouse to to support multiple independent quorums on a single server by tagging each gRPC call with a
room-id
metadata header and feeding requests through a lightweight router that maintains per‐room state with aDashMap<String, Arc<Lighthouse>>
.LighthouseClient
now accepts an optionalroom_id
argument and automatically injects the corresponding metadata header into eachheartbeat
andquorum
request, while untagged calls continue to use a default namespace. Inmulti_quorum_test.py
, I created 2 clients with distinct room ID's and have them form independent quorums on the same server port.Open to making any changes to the code or approach 🤙.
Warmly,
Matt