KungFu Contributor Guide

Intro

KungFu is a distributed machine learning framework implemented in Golang and C++. It provides Python binding for TensorFlow.

KungFu can be integrated into existing TensorFlow programs as easy as Horovod. It can be used as a standalone collective communication library for C/C++/Golang programs, similar to MPI.

Requirements

Golang1.11 or above is required (For the Modules feature).
CMake is required for building the C++ sources.
Python and TensorFlow is required if you are going to build the TensorFlow binding.
gtest is used for unittest, it can be auto fetched and built from source.

Project Structure

All source code are under ./srcs/<lang>/ where <lang> := cpp | go | python.

Libraries

libkungfu-comm: a C library implemented in Golang
libkungfu: the public C/C++ library based on kungfu-comm

Concepts

Peer: the basic unit in a KungFu cluster. A Peer usually represents a system process.
PeerID: PeerID is the unique identifier of a Peer. It tells the KungFu runner how to start the Peer and let peers to locate each other. PeerID is immutable during the life cycle of a Peer.
Session: a group of Peers interconnected by Connections. Currently a Peer can be in at most one Session at the same time. Peers in a Session can perform collective commuication operations, such as allreduce, broadcast.
PeerList: a PeerList is an ordered list of PeerID from all peers in a Session. All Peers in the same Session have the same PeerList, which allows Rank to be defined.
Connection: the communication channel between two Peers. A Connection is the high-level abstraction of a network connection, usually TCP.
Message: the basic communication element in a Connection.
HostSpec: HostSpec is the metadata that describes a host machine.
Graph: A directed communication graph of peers. A graph may contain self loops. The vertices are numbered from 0 to n - 1.

Components

Server: A TCP-based server that accepts Connections
Client: A TCP-based client that can send data using Connections
Handler: A TCP-based connection handler that can handle Connections

Useful commands

Format code

./scripts/clean-code.sh --fmt-py

Build for release

# build a .whl package for release
pip3 wheel -vvv --no-index .

Docker

# Run the following command in the KungFu folder
docker build -f docker/Dockerfile.tf-gpu -t kungfu:gpu .

# Run the docker
docker run -it kungfu:gpu

Use NVIDIA NCCL

KungFu can use NCCL to leverage GPU-GPU direct communication.

# uncomment to use your own NCCL
# export NCCL_HOME=$HOME/local/nccl

KUNGFU_ENABLE_NCCL=1 pip3 install .

To use NVLink, add the following to your Python code

...

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
from kungfu.python import _get_cuda_index
config.gpu_options.visible_device_list = str(_get_cuda_index())

...

with tf.Session(config=config) as sess:

...

and add -allow-nvlink flag to kungfu-run command

# export NCCL_DEBUG=INFO # uncomment to enable
kungfu-run -np 4 -allow-nvlink python3 benchmarks/system/benchmark_kungfu.py

Debug

export KUNGFU_CONFIG_LOG_LEVEL=DEBUG # or INFO | WARN | ERROR, the default is INFO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CONTRIBUTING.md

CONTRIBUTING.md

KungFu Contributor Guide

Intro

Requirements

Project Structure

Libraries

Concepts

Components

Useful commands

Format code

Build for release

Docker

Use NVIDIA NCCL

Debug

Files

CONTRIBUTING.md

Latest commit

History

CONTRIBUTING.md

File metadata and controls

KungFu Contributor Guide

Intro

Requirements

Project Structure

Libraries

Concepts

Components

Useful commands

Format code

Build for release

Docker

Use NVIDIA NCCL

Debug