-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Network performance benchmarks #586
Comments
This is interesting -- so you're saying we should model the relationship between message size and bandwidth/latency and use that information to change how we split the model? |
This comment was marked as resolved.
This comment was marked as resolved.
My hosts file wasn't correct. The task was likely distributed to two processes on the same machine. The bandwidth at 65kbytes is still a bit unrealistically high, but it's more reasonable than the previous plots. np.hosts needed to be in this format:
I also see a bit higher iperf3 speeds after re-connecting the cables: 35 Gbit/sec with one cable, 50 Gbit/sec with both. The connections have to be fully removed and re-made to hit these numbers. |
Tried to get a benchmark on the 10G Ethernet built-in to these two devices. Could not get NetPIPE to progress beyond:
iperf3 works fine and shows 9.41 Gbit in either direction. Either I'm holding it wrong or something's not behaving correctly under macOS Sequoia. |
How do users notice network degradation between exo nodes in regular operation - whether that's WiFi, Thunderbolt, Ethernet, or another link? How do they characterize links, other than with simpler tools like iperf3?
A typical tool used for this is NetPIPE: https://netpipe.cs.ksu.edu/ . It uses OpenMP and SSH to answer the question: with a message size of N bytes, what is the minimum latency and maximum bandwidth between any two given nodes.
Installation
NetPIPE won't compile as-is on macOS (mpicc Apple clang-1600.0.26.4); the makefile must be modified to remove the RT library.
On Fedora 41 (mpicc gcc version 14.2.1 20240912), a different patch is required to compile:
np.hosts config
On any system, the self-to-self performance is measured with the following np.hosts file:
For a two-machine system, you'd use this kind of file:
This in turn can be plotted with npplot, which depends on gnuplot: https://gitlab.beocat.ksu.edu/PeterGottesman/netpipe-5.x/-/blob/master/npplot?ref_type=heads .
I created a gist that does everything except installation here: https://gist.github.com/phansel/26677111a61a53c0c3cdbdf94ae1a66e.
A future version of exo could characterize each path in a cluster at runtime and use that to improve resource allocation or report connectivity issues (e.g. degraded cable or connector).
I'm curious what the TB4/TB5 performance looks like between a couple of Mac Mini nodes, or between a Mac Mini and a laptop on AC power vs. on its internal battery. Not much data on 40Gb TB4 or "80Gb" TB5 latency out there.
@AlexCheema props for publishing exo!
The text was updated successfully, but these errors were encountered: