Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189

MattKotzbauer · 2025-05-05T22:33:29Z

GH Issue: #173

Extends Lighthouse to to support multiple independent quorums on a single server by tagging each gRPC call with a room-id metadata header and feeding requests through a lightweight router that maintains per‐room state with a DashMap<String, Arc<Lighthouse>>. LighthouseClient now accepts an optional room_id argument and automatically injects the corresponding metadata header into each heartbeat and quorum request, while untagged calls continue to use a default namespace. In multi_quorum_test.py, I created 2 clients with distinct room ID's and have them form independent quorums on the same server port.

Open to making any changes to the code or approach 🤙.

Warmly,
Matt

…tiple quorums on a single LighthouseServer (pytorch#173)

d4l3k

Thanks for putting this together! This looks very promising and excited to have this in torchft.

I think there's a couple of ways we can make this cleaner/more generic that I've commented on

src/lib.rs

d4l3k · 2025-05-06T19:05:45Z

src/router.rs

+        let room = self.room(&id).await;
+        <Arc<Lighthouse> as LighthouseService>::heartbeat(&room, req).await
+    }
+}


I think this is fine as is since this is fairly minimal boilerplate per request but I think we can do even better.

By doing this at the Service layer instead of LighthouseService layer we can have it automatically work for all endpoints on the LighthouseService

Can you look into this and see how feasible it is? If it's not any cleaner we can land this as is

Some pointers:

https://docs.rs/tower-service/0.3.3/tower_service/trait.Service.html

https://github.com/hyperium/tonic/blob/b303caa52ba8bbe8172310be7165a80b7c2a53f8/examples/src/tower/server.rs#L83-L109

There's also https://github.com/teimuraz/tonic-middleware which might be useful

I tried an initial attempt to do the routing at the Service layer rather than the LighthouseService layer, but have had trouble adapting between the initial tonic message types (tonic::Request/Response) and the Tower message types (http::Request/Response) - tonic::Request/Response wraps the body in tonic::body::BoxBody and carries gRPC-specific extensions, while the Tower stack we’re intercepting expects a bare http::Request/Response<B> where the body implements HttpBody. I haven't yet found a concise way to do this.

If I were to keep at this, I'd see if I could get something working that relies more on tonic-middleware - perhaps there's a way to stay entirely in the tonic domain that keeps the implementation and debugging cleaner?

mixing the two is a bit tricky -- we probably need to stay at the tower layer. Why do you need to access the tonic::Request/Response objects? It's all HTTP at the end of the day so seems like we should be able to operate at the tower/http layer and view the metadata as a header?

middleware might work though it may be too high level

Ah I see, it became easier when I had router.rs operate entirely at the tower layer rather than trying to mix Service and tonic. Most recent commit has router.rs at the tower level, which lets us start the lighthouse server with a call to
Server::builder().add_service(router).serve(addr)

…ng add_room_header for each RPC call

…:builder calls (in src/bin/lighthouser.rs, src/lib.rs) and torchft/multi_quorum_test.py modified to reflect change.

d4l3k

@MattKotzbauer this is looking great and thanks for pushing this through! Just need some small cleanups

d4l3k · 2025-05-20T20:21:26Z

src/interceptor.rs~

@@ -0,0 +1,12 @@
+use tonic::{Request, Status, service::Interceptor};


Is this file intentional?

Ah that's unintentional, removing in next commit

d4l3k · 2025-05-20T20:23:45Z

src/lib.rs

@@ -654,7 +689,7 @@ impl LighthouseServer {
    /// Returns:
    ///    str: The address of the lighthouse server.
    fn address(&self) -> PyResult<String> {
-        Ok(self.lighthouse.address().to_string())
+        Ok(self.bind.clone())


this unfortunately isn't sufficient -- bind could be something like "0.0.0.0:0" which will bind to a random port. Address needs to be the routable http address i.e. http://foo.bar:1324

Hmm, perhaps we could use similar calls as the Lighthouse class uses to resolve host IP and address? Will include a version of this in next commit, though am also down to change it

d4l3k · 2025-05-20T20:24:22Z

src/router.rs

+/// gRPC server for a single room (inner state = `Arc<Lighthouse>`).
+type GrpcSvc = LighthouseServiceServer<Arc<Lighthouse>>;
+
+#[derive(Clone)]


why does Router need to be Cloneable?

I mainly made Router Cloneable so that calls to tonic's add_service would compile when constructing the LighthouseServer in src/bin/lighthouse.rs and src/lib.rs

d4l3k · 2025-05-20T20:32:30Z

src/router.rs

+        }
+
+        // Build room state once.
+        let lh = Lighthouse::new(tmpl.clone())


can we pass in the id into Lighthouse so we can prepend it to the Lighthouse log messages?

Sounds good, will include in next commit

d4l3k · 2025-05-20T20:35:13Z

torchft/multi_quorum_test.py

+
+import pytest
+
+import torchft._torchft as ext


can we use the torchft.coordination API for this test instead?

Sg, including in next commit

d4l3k · 2025-05-20T20:35:30Z

torchft/multi_quorum_test.py

@@ -0,0 +1,44 @@
+from __future__ import annotations
+
+import datetime as _dt


we usually just do from datetime import timedelta

Sg, including in next commit (moving test into lighthouse_test.py and will use existing imports)

d4l3k · 2025-05-20T20:37:50Z

src/router.rs

+        rooms: Arc<DashMap<String, Arc<GrpcSvc>>>,
+        tmpl: LighthouseOpt,
+        id: &str,
+    ) -> Arc<GrpcSvc> {


Should this be typed Arc<LighthouseServiceServer> instead?

This sounds good - am changing in line with the below thread (returning an Arc and then wrapping with a LighthouseServiceServer once the method returns).

d4l3k · 2025-05-20T20:39:41Z

src/router.rs

+            .await
+            .expect("failed to create Lighthouse");
+
+        let svc_new = Arc::new(LighthouseServiceServer::new(lh));


Should we be just returning Arc from this method and constructing the LighthouseServiceServer wrapper on demand so we don't need to clone it in the parent method?

Oh true, will include in next commit

d4l3k · 2025-05-20T20:40:42Z

torchft/multi_quorum_test.py

@@ -0,0 +1,44 @@
+from __future__ import annotations


could move this test to lighthouse_test.py to keep it with the rest of the lighthouse tests

…to Arc<Lighthouse>, Lighthouse::new now takes id prefix, test relocated to lighthouse_test.py and now uses coordination API, LighthouseServer now resolves host/port from the bound socket to give a routable http://host:port address

d4l3k

LGTM!

d4l3k · 2025-05-28T22:39:20Z

@MattKotzbauer looks like lint and unit tests are failing, not sure if that's related to this PR though

Initial attempt at gRPC metadata-based room assignment to support mul…

fedd473

…tiple quorums on a single LighthouseServer (pytorch#173)

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 5, 2025

Merge branch 'main' into HTTP/2_Multi_Room_Lighthouse

eb482e5

d4l3k requested changes May 6, 2025

View reviewed changes

Matt Kotzbauer and others added 2 commits May 7, 2025 15:57

Interceptor attached via LighthouseClient constructor rather than usi…

5ab4c0c

…ng add_room_header for each RPC call

Tonic-level routing changed to tower-level in src/router.rs - Server:…

0a9ce34

…:builder calls (in src/bin/lighthouser.rs, src/lib.rs) and torchft/multi_quorum_test.py modified to reflect change.

d4l3k requested changes May 20, 2025

View reviewed changes

d4l3k approved these changes May 23, 2025

View reviewed changes

d4l3k added 2 commits May 29, 2025 15:41

Merge remote-tracking branch 'origin' into HTTP/2_Multi_Room_Lighthouse

8aed1fc

lint

53ec8be

d4l3k force-pushed the HTTP/2_Multi_Room_Lighthouse branch from dc92b42 to 53ec8be Compare May 29, 2025 22:59

		@@ -0,0 +1,12 @@
		use tonic::{Request, Status, service::Interceptor};

		@@ -0,0 +1,44 @@
		from __future__ import annotations

		import datetime as _dt

Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189

Are you sure you want to change the base?

Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment #189

Conversation

MattKotzbauer commented May 5, 2025

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k commented May 28, 2025

Uh oh!

Uh oh!