Things to add / integrate

pdf book notes

ideas from Sam: https://github.com/saforem2: stas00#17 (comment) https://quarto.org/, https://quarto.org/docs/gallery/, https://kevinheavey.github.io/modern-polars/, https://quarto.org/docs/output-formats/pdf-basics.html

Performance

https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html

Storage chapter

Storage benchmarks:

https://github.com/argonne-lcf/dlio_benchmark

Incoming suggestions from Ross Wightman to integrate:

I'd try to separate volumes by workload, so keep the 'lots of small files', high churn like environments, code separate from bulk storage like datasets, checkpoints. Possibly even split those too since datasets are largely static and checkpoints are being rotated all the time
When datasets are on network storage, just like bucket storage, they should consist of large files AND be read as large files (sequentially in large chunks, not mmapped!). Avoid seeking within datasets
Setups like HF datasets can be deceiving, might look like one big file, but often being mmap'd and the IO read pattern is nuts, like 3-4x more iops than if you'd read them as individual files. Mmap loading can be turned off, but if that's the case, for a lot of datasets you move a problem into the DataLoader processes, requiring reading too much data into memory at once. Better awareness of tradeoffs for different use cases, and especially using Iterable streaming when appropriate.
In a way, bucket storage like s3, via the interface limitations, enforces patterns that are reasonable for storage backends like this. It's ooh, it's mounted as a folder, I can do whatever I want (mmap files, write loads of little ones, delete them all, etc) that's the prob.
One also cannot expect to treat a distributed filesystem like their local disk. If you separated volumes by workload you'd probably be able to utilize much higher % of the total storage. Don't mix high churn, small files with low churn large files.
Also, note that once your datasets are optimally friendly for a large, distributed network filesystem, they can usually just be streamed from bucket storage in cloud systems that have that option. So better to move them off the network filesystem in that case.

Debug

Memory leak Checking

cuda-memcheck --leak-check full python program.py

Race detection:

cuda-memcheck --tool racecheck

with extra options: --save to save output to a disk --print-level to control output

cuda-memcheck --tool racecheck --racecheck-report analysis

gdb with cuda

cuda-gdb

integrate debug_utils.py

model parallelism

a good table here Scaling equations of each type of parallelism. https://www.cerebras.net/blog/cerebras-sets-record-for-largest-ai-models-ever-trained-on-single-device#summary

Network

Make a new benchmark section:

nccl-tests
all_reduce_bench.py
https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/communication
like nccl-tests, another common set of benchmarks used at HPC sites are the OSU microbenchmarks like osu_lat, osu_bw, and osu_bibw.

https://mvapich.cse.ohio-state.edu/benchmarks/

Those are MPI-based benchmarks. Those can be run using GPUDirect RDMA so you can measure MPI performance between GPUs, either on the same node or between nodes.

Infiniband

References:

Sys Admin Pocket Survival Guide - InfiniBand

Diagnostics

Not-IB specific

ifconfig - display the status of the currently active interfaces
ip addr show - display the addresses for every link configured on the system

Display the local Host’s IB device status (3 different views).

ibstat
ibstatus
ibv_devinfo

Scan IB network:

ibnetdiscover - scan topology
ibroute - display the unicast and multicast forwarding tables for the switches
ibdiagnet - IB diagnostic net

Check for network errors:

ibcheckerrors - check if the error counters of a port/node are within predefined thresholds
ibchecknet - perform port/node/errors check on the subnet.

Test IB network configuration:

ibcheckport - perform some basic tests on the specified port
ibchecknode - perform some basic tests on the specified node
ibclearcounters - clear port counters for the InfiniBand subnet

Other checks:

iblinkinfo
ibcheck
wwibcheck
ibswitch - verify that an IB-QNEM is installed in the shelf
ibhosts - list all hosts in the IB network. ibswitches - list all ib switches

Tracing:

ibping - ping/pong between InfiniBand nodes
ibsysstat - obtain basic information for remote nodes (hostname, cpus, memory, utilization)
ibswitches - scan the net or use existing net topology file and list all switches
ibhosts - scan the net or use existing net topology file and list all hosts

Display network topology:

iblinkinfo -R

Use ifconfig to discover IPoIB networks, e.g. if you get ib0 device with inet addr:100.1.1.102, you can connect to it - e.g. ping 100.1.1.102

Find the controller: lspci | grep Mellanox

Print driver configuration (interface name comes from ifconfig): ethtool -i enP49239s1

Performance

perftest Package includes:

ib_send_bw
ib_send_lat
ib_write_bw
ib_write_lat
ib_read_bw
ib_read_lat
ib_atomic_bw
ib_atomic_lat

example: ib_send_bw -a address - test bandwith

qperf measures bandwidth and latency between two nodes (TCP/IP and RDMA transports)

If the network is much slower than it should be, might have to specify which HCAs to use (ibv_devinfo to get HCAs)

export NCCL_IB_HCA=mlx5

might need to install ib packages on the vms:

sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \
    libgfortran5 bison chrpath flex graphviz gfortran tk dpatch quilt swig tcl ibverbs-utils infiniband-diags
sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf

Verbs: allow command to be executed on feature-rich IB swtich.

Testing

integrate the features from testing_utils.py

From Adam Moody's team at LLNL

NUMA affinities

https://github.com/LLNL/mpibind/tree/master/python mpibind for Python enables the use of the mpibind algorithm in arbitrary Python programs.

Training hanging detection tool:

This is to expand: https://github.com/stas00/ml-engineering/tree/master/fault-tolerance#is-job-hanging-watchdog

notes from Adam:

https://github.com/LLNL/STAT - the Stack Trace Analysis Tool https://hpc.llnl.gov/software/development-environment-software/stat-stack-trace-analysis-tool

https://github.com/grondo/io-watchdog

And you can see how we integrated STAT a ways down on this page:

https://hpc.llnl.gov/software/development-environment-software/stat-stack-trace-analysis-tool

There are some "action" scripts one has to write, which io-watchdog executes when it detects a hang. The contents of those aren't shown on the page, but I could look that up if you're curious. The user would create a config file like:

search /usr/local/tools/io-watchdog/actions
timeout = 20m
actions = STAT, kill

This configured io-watchdog to assume the job was stuck if it saw no output for 20 minutes (from rank 0), and then to run "STAT" to collect a stack trace and run "kill" to scancel the job. We had a couple others, like one to email the user that io-watchdog detected a hang. One then launches with:

srun --io-watchdog mpi_application

a quick demo of SCR. The python to use it is pretty clean.

Install the SCR library (C + MPI) https://scr.readthedocs.io/en/v3.0/users/build.html#cmake

Install the scr.py module: https://github.com/LLNL/scr/tree/develop/python#installing-the-scr-python-module

Example checkpoint in python: https://github.com/LLNL/scr/blob/1878de8756c2b51882a7cda7b97b142eae4e3995/python/scr_example.py#L64-L105

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incoming.md

incoming.md

Things to add / integrate

pdf book notes

Performance

Storage chapter

Storage benchmarks:

Debug

model parallelism

Network

Infiniband

Diagnostics

Performance

Testing

From Adam Moody's team at LLNL

Files

incoming.md

Latest commit

History

incoming.md

File metadata and controls

Things to add / integrate

pdf book notes

Performance

Storage chapter

Storage benchmarks:

Debug

model parallelism

Network

Infiniband

Diagnostics

Performance

Testing

From Adam Moody's team at LLNL