First, please update the submodule for the project.
git submodule update --init --recursiveSecond, we need mamba or conda installed. You can install mamba here
We also need to setup remote nodes for mpirun or torchrun, make sure that you have /etc/hosts and ~/.ssh/config properly configured. (i.e. you can ssh to the remote nodes using hostname without password)
Suppose $ENV_PATH is the path to the conda environment, then we can use the following command to create a conda environment. We need two environments, we call it ori and emu. ori is the unmodified original code, emu is the code for emulator. We use environment variable ENV_PATH to specify the path to the conda environment. In this example, we set the ENV_PATH=~/neurona/ori or ENV_PATH=~/neurona/emu for ori and emu respectively.
mkdir -p $ENV_PATH
git submodule update --init --recursive
bash ./scripts/create_env.sh $ENV_PATH # takes about 30 minutes
nvcc --version
# expect:
# Cuda compilation tools, release 11.8, V11.8.89
# Build cuda_11.8.r11.8/compiler.31833905_0We need to keep the $ENV_PATH set in config.sh.
touch config.sh
# in config.sh
export ENV_PATH=your_env_pathFirst, build the nccl and nccl make sure you have conda environment properly configurated.
We need to build nccl from source, for both emu and ori.
cd nccl
git switch [emu/ori] # switch to the branch you want to build
cd ..
. ./config.sh
bash ./scripts/build_nccl.sh # should take less than 5 minutesAdd the following variables to config.sh that sets up the required environments:
You can use query_gpu.py to get the compute capability of your GPU.
#config for debug
export NCCL_DEBUG=MOD
export NCCL_DEBUG_SUBSYS=ALL
export NCCL_DEBUG_FILE=your_debug_file.$(date "+%Y-%m-%d %H:%M:%S")_%h:%p%h:%p
export NCCL_PROTO=Simple
export NCCL_ALGO=Ring
export NCCL_BUILD_PATH=your_nccl_build_path # a local file system like /tmp is recommended
# export NVCC_GENCODE="-gencode=arch=compute_[your_compute],code=sm_[your_sm]"
# export ONLY_FUNCS="AllReduce Sum (f16|f32) RING SIMPLE"
# this two are used to reduce compile time
export DEBUG=1
# for current experiments, we only use 2 nodes, 1 gpu per node
export CUDA_VISIBLE_DEVICES=0
export OMPI_COMM_WORLD_SIZE=2
export OMPI_COMM_WORLD_LOCAL_RANK=0#config for release
export NCCL_DEBUG=VERSION
export NCCL_DEBUG_SUBSYS=INIT
export NCCL_PROTO=Simple
export NCCL_ALGO=Ring
export NCCL_BUILD_PATH=your_nccl_build_path # a local file system like /tmp is recommended
unset ONLY_FUNCS
unset NVCC_GENCODE
export DEBUG=0
# for current experiments, we only use 2 nodes, 1 gpu per node
export CUDA_VISIBLE_DEVICES=0
export OMPI_COMM_WORLD_SIZE=2
export OMPI_COMM_WORLD_LOCAL_RANK=0After building the nccl, we need to build pytorch using our nccl.
cd pytorch
git switch [emu/ori] # switch to the branch you want to build
cd ..
bash ./scripts/build_pytorch.sh # takes about 30 mintues
conda activate $ENV_PATH
pythonimport torch
torch.__version__ # expect 2.2.0a0+git8ac9b20
torch.cuda.is_available()
torch.cuda.nccl.version() # expect 2.19.4Please check the README.md in the eval folder for more details.