Running multi-node workloads To run multi-node workloads, you need to ensure you update /etc/hosts appropriately, and have a hostfile.txt file in the project root directory.
To update /etc/hosts, ensure that on every node, you map the randomly assigned VP name to its private IP address.
For example, suppose I am running a multi-node tests on nodes 10.15.47.81 and 10.15.16.1, update the /etc/hosts file on both with the following:
10.15.47.81 g488.voltagepark.net g488
10.15.16.1 g081.voltagepark.net g081
The names (g488 and g081) come from voltage park, however you'll get different names on your own cluster. You will need sudo permissions in order to update the /etc/hosts file.
Inside of hostfile.txt, you'll need to specify the number of slots for each node.
10.15.47.81 slots=8
10.15.16.1 slots=8
as an example. On each node, make sure you have git cloned the project into the same path as you have on the master node.
And then finally, to run the deepspeed script, on your master node, run:
./deepspeed.sh --ib-disable inference/model.py
if the nodes are connected via ethernet, else
./deepspeed.sh inference/model.py
This will ensure to sync your file systems on each node, and then run the deepspeed script. You can optionally specify a port other than the default port of 29500 by running:
./deepspeed.sh --port 29501 inference/model.py
Note: The number of pipeline stages is automatically determined by the number of hosts in the hostfile.txt file.