ideas from Sam: https://github.com/saforem2: stas00#17 (comment) https://quarto.org/, https://quarto.org/docs/gallery/, https://kevinheavey.github.io/modern-polars/, https://quarto.org/docs/output-formats/pdf-basics.html
https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html
https://github.com/argonne-lcf/dlio_benchmark
Incoming suggestions from Ross Wightman to integrate:
-
I'd try to separate volumes by workload, so keep the 'lots of small files', high churn like environments, code separate from bulk storage like datasets, checkpoints. Possibly even split those too since datasets are largely static and checkpoints are being rotated all the time
-
When datasets are on network storage, just like bucket storage, they should consist of large files AND be read as large files (sequentially in large chunks, not mmapped!). Avoid seeking within datasets
-
Setups like HF datasets can be deceiving, might look like one big file, but often being mmap'd and the IO read pattern is nuts, like 3-4x more iops than if you'd read them as individual files. Mmap loading can be turned off, but if that's the case, for a lot of datasets you move a problem into the DataLoader processes, requiring reading too much data into memory at once. Better awareness of tradeoffs for different use cases, and especially using Iterable streaming when appropriate.
-
In a way, bucket storage like s3, via the interface limitations, enforces patterns that are reasonable for storage backends like this. It's ooh, it's mounted as a folder, I can do whatever I want (mmap files, write loads of little ones, delete them all, etc) that's the prob.
-
One also cannot expect to treat a distributed filesystem like their local disk. If you separated volumes by workload you'd probably be able to utilize much higher % of the total storage. Don't mix high churn, small files with low churn large files.
-
Also, note that once your datasets are optimally friendly for a large, distributed network filesystem, they can usually just be streamed from bucket storage in cloud systems that have that option. So better to move them off the network filesystem in that case.
Memory leak Checking
cuda-memcheck --leak-check full python program.py
Race detection:
cuda-memcheck --tool racecheck
with extra options: --save to save output to a disk --print-level to control output
cuda-memcheck --tool racecheck --racecheck-report analysis
gdb with cuda
cuda-gdb
- integrate debug_utils.py
a good table here Scaling equations of each type of parallelism. https://www.cerebras.net/blog/cerebras-sets-record-for-largest-ai-models-ever-trained-on-single-device#summary
Make a new benchmark section:
- nccl-tests
all_reduce_bench.py
- https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
- like nccl-tests, another common set of benchmarks used at HPC sites are the OSU microbenchmarks like osu_lat, osu_bw, and osu_bibw.
https://mvapich.cse.ohio-state.edu/benchmarks/
Those are MPI-based benchmarks. Those can be run using GPUDirect RDMA so you can measure MPI performance between GPUs, either on the same node or between nodes.
References:
Not-IB specific
ifconfig
- display the status of the currently active interfacesip addr show
- display the addresses for every link configured on the system
Display the local Host’s IB device status (3 different views).
ibstat
ibstatus
ibv_devinfo
Scan IB network:
ibnetdiscover
- scan topologyibroute
- display the unicast and multicast forwarding tables for the switchesibdiagnet
- IB diagnostic net
Check for network errors:
ibcheckerrors
- check if the error counters of a port/node are within predefined thresholdsibchecknet
- perform port/node/errors check on the subnet.
Test IB network configuration:
ibcheckport
- perform some basic tests on the specified portibchecknode
- perform some basic tests on the specified nodeibclearcounters
- clear port counters for the InfiniBand subnet
Other checks:
iblinkinfo
ibcheck
wwibcheck
ibswitch
- verify that an IB-QNEM is installed in the shelfibhosts
- list all hosts in the IB network.ibswitches
- list all ib switches
Tracing:
ibping
- ping/pong between InfiniBand nodesibsysstat
- obtain basic information for remote nodes (hostname, cpus, memory, utilization)ibswitches
- scan the net or use existing net topology file and list all switchesibhosts
- scan the net or use existing net topology file and list all hosts
Display network topology:
iblinkinfo -R
Use ifconfig
to discover IPoIB
networks, e.g. if you get ib0
device with inet addr:100.1.1.102
, you can connect to it - e.g. ping 100.1.1.102
Find the controller:
lspci | grep Mellanox
Print driver configuration (interface name comes from ifconfig
):
ethtool -i enP49239s1
perftest
Package includes:
ib_send_bw
ib_send_lat
ib_write_bw
ib_write_lat
ib_read_bw
ib_read_lat
ib_atomic_bw
ib_atomic_lat
example: ib_send_bw -a address
- test bandwith
qperf
measures bandwidth and latency between two nodes (TCP/IP and RDMA transports)
If the network is much slower than it should be, might have to specify which HCAs to use (ibv_devinfo
to get HCAs)
export NCCL_IB_HCA=mlx5
might need to install ib packages on the vms:
sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \
libgfortran5 bison chrpath flex graphviz gfortran tk dpatch quilt swig tcl ibverbs-utils infiniband-diags
sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf
- Verbs: allow command to be executed on feature-rich IB swtich.
- integrate the features from testing_utils.py
- NUMA affinities
https://github.com/LLNL/mpibind/tree/master/python mpibind for Python enables the use of the mpibind algorithm in arbitrary Python programs.
- Training hanging detection tool:
This is to expand: https://github.com/stas00/ml-engineering/tree/master/fault-tolerance#is-job-hanging-watchdog
notes from Adam:
https://github.com/LLNL/STAT - the Stack Trace Analysis Tool https://hpc.llnl.gov/software/development-environment-software/stat-stack-trace-analysis-tool
https://github.com/grondo/io-watchdog
And you can see how we integrated STAT a ways down on this page:
https://hpc.llnl.gov/software/development-environment-software/stat-stack-trace-analysis-tool
There are some "action" scripts one has to write, which io-watchdog executes when it detects a hang. The contents of those aren't shown on the page, but I could look that up if you're curious. The user would create a config file like:
search /usr/local/tools/io-watchdog/actions
timeout = 20m
actions = STAT, kill
This configured io-watchdog to assume the job was stuck if it saw no output for 20 minutes (from rank 0), and then to run "STAT" to collect a stack trace and run "kill" to scancel the job. We had a couple others, like one to email the user that io-watchdog detected a hang. One then launches with:
srun --io-watchdog mpi_application
a quick demo of SCR. The python to use it is pretty clean.
Install the SCR library (C + MPI) https://scr.readthedocs.io/en/v3.0/users/build.html#cmake
Install the scr.py module: https://github.com/LLNL/scr/tree/develop/python#installing-the-scr-python-module
Example checkpoint in python: https://github.com/LLNL/scr/blob/1878de8756c2b51882a7cda7b97b142eae4e3995/python/scr_example.py#L64-L105