Skip to content

Commit aaf3bfa

Browse files
luomailgarithm
authored andcommitted
Transparent distributed model training through TensorLayer GPU Trainer (tensorlayer#700)
* Add hovorod trainer. * fix * working on the mnist tutorial. * fix build error * first working demo * is_fix=True * working on fixing hao comments * Fix more hao comments. * fix * fix * add todos * fix * fix runtime error * fix lints * fix * Deprecated old distributed APIs. * check in the script to help users install distributed runner environment. * wip: validation. * minor * fix * fix * Add validation * wip: validation loss. * validatioin works * minor * clean code * simplify * improve the APIs * fix codacy * install libopenmpi-dev before building docs * run apt as root * fix API * revert openmpi and horovod installation * format code * more API tuning. * more API refine * rename * fix lints * simplify code * fix code. * we must guard the use of session after should stop is set. * fix bug * test validation. * Add documentation. * add more comments. * minor * install openmpi and horovod when build doc * fix yapf errors. * yapf * fix installing horovod * requirements_distributed.txt * add distributed tp extras_require * install openmpi when READTHEDOCS=True * custom_build_ext * simplify * fix cmdclass * try fix * try fix * fix import * fix * use setup_py_install * disable codacy for scripts * disable bandit engine * add Trainer to autosummary * autofunction:: Trainer * fix * disable codacy for setup.py * remove Trainer from doc * add Dockerfile for building docs locally * disable crazy codary for docker files! * check in a script to help install cuda 9 and cuDNN 7 * fix * minor fix * add cifar10 example for distributed trainer * change default parameters * add a log storage option for tensor db. * add comments * tensordb * remove the tensordb example. * add changelog * update scripts * remove tensordb option * add option for learning rate. * more comments. * add an option * fix logging * fix .travis.yml * fix format * build doc with python 3.6 (pypa/setuptools#885) * use latest RTD builder * try fix rtd * try fix rtd * use absolute path * fix SCRIPT_DIR
1 parent dd7093b commit aaf3bfa

23 files changed

+675
-91
lines changed

.codacy.yaml

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# https://support.codacy.com/hc/en-us/articles/115002130625-Codacy-Configuration-File
2+
---
3+
engines:
4+
bandit:
5+
enabled: false # FIXME: make it work
6+
exclude_paths:
7+
- scripts/*
8+
- setup.py
9+
- docker/**/*

.gitignore

+2-1
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,7 @@ venv3/
114114
.vscode/
115115

116116
# TensorLayer Directories
117+
checkpoints
117118
data/
118119
lib_win/
119120

@@ -123,4 +124,4 @@ update_tl.py
123124

124125
# Data Files and ByteCode files
125126
*.gz
126-
*.npz
127+
*.npz

.readthedocs.yml

+9-9
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1+
# https://docs.readthedocs.io/en/latest/yaml-config.html
2+
3+
build:
4+
image: latest # For python 3.6
5+
16
formats:
27
- epub
38
- pdf
49

510
python:
6-
version: 3.5
7-
pip_install: true
8-
extra_requirements:
9-
- contrib_loggers
10-
- db
11-
- dev
12-
- doc
13-
- extra
14-
- test
11+
version: 3.6
12+
# to build customized extension, we have to use setup_py_install instead of pip_install
13+
# https://docs.readthedocs.io/en/latest/yaml-config.html#python-setup-py-install
14+
setup_py_install: true

.travis.yml

+8-2
Original file line numberDiff line numberDiff line change
@@ -59,8 +59,14 @@ matrix:
5959

6060

6161
install:
62-
- if [[ -v _DOC_AND_YAPF_TEST ]]; then pip install tensorflow; else pip install tensorflow==$_TF_VERSION; fi
63-
- pip install -e .[all_dev]
62+
- |
63+
if [[ -v _DOC_AND_YAPF_TEST ]]; then
64+
pip install tensorflow
65+
./scripts/install-horovod-for-doc-test.sh
66+
else
67+
pip install tensorflow==$_TF_VERSION;
68+
fi
69+
- pip install -e .[all_cpu_dev]
6470

6571

6672
script:

CHANGELOG.md

+5-2
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@ To release a new version, please update the changelog as followed:
7272
- API:
7373
- `tl.model.vgg19` added (PR #698)
7474
- `tl.logging.contrib.hyperdash` added (PR #739)
75+
- `tl.distributed.trainer` added (PR #700)
7576
- Documentation:
7677
- Add binary, ternary and dorefa links (PR #711)
7778
- Update input scale of VGG16 and VGG19 to 0~1 (PR #736)
@@ -83,6 +84,7 @@ To release a new version, please update the changelog as followed:
8384
- `tutorial_models_vgg19` has been introduced to show how to use `tl.model.vgg19` (PR #698).
8485
- fix bug of `tutorial_bipedalwalker_a3c_continuous_action.py` (PR #734, Issue #732)
8586
- `tutorial_models_vgg16` and `tutorial_models_vgg19` has been changed the input scale from [0,255] to [0,1](PR #710)
87+
- `tutorial_mnist_distributed_trainer.py` and `tutorial_cifar10_distributed_trainer.py` are added to explain the uses of Distributed Trainer (PR #700)
8688

8789
### Changed
8890
- all the input scale in both vgg16 and vgg19 has been changed the input scale from [0,255] to [0,1](PR #710)
@@ -110,9 +112,10 @@ To release a new version, please update the changelog as followed:
110112

111113
### Contributors
112114
- @DEKHTIARJonathan: #739 #747 #750
113-
- @lgarithm: #705
115+
- @lgarithm: #705 #700
114116
- @OwenLiuzZ: #698 #710
115-
- @zsdonghao: #711 #712 #734 #736 #737
117+
- @zsdonghao: #711 #712 #734 #736 #737 #700
118+
- @luomai: #700
116119

117120
## [1.9.0] - 2018-06-16
118121

Makefile

+6
Original file line numberDiff line numberDiff line change
@@ -44,3 +44,9 @@ format:
4444

4545
install3:
4646
pip3 install -U . --user
47+
48+
49+
TAG = tensorlayer-docs:snaphot
50+
51+
doc:
52+
docker build --rm -t $(TAG) -f docker/docs/Dockerfile .

docker/docs/Dockerfile

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
FROM ubuntu:bionic
2+
3+
ADD docker/docs/sources.list.ustc /etc/apt/sources.list
4+
ENV DEBIAN_FRONTEND=noninteractive
5+
RUN apt update && \
6+
apt install -y python3-pip python3-tk python-qt4 wget && \
7+
pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple tensorflow
8+
ADD . /tensorlayer
9+
WORKDIR /tensorlayer
10+
RUN ln -s `which pip3` /usr/bin/pip && \
11+
./scripts/install-horovod-for-doc-test.sh
12+
RUN pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple .
13+
RUN pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -e .[all]
14+
RUN make -C docs html

docker/docs/sources.list.ustc

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
deb http://mirrors.ustc.edu.cn/ubuntu/ bionic main restricted
2+
3+
deb http://mirrors.ustc.edu.cn/ubuntu/ bionic-updates main restricted
4+
5+
deb http://mirrors.ustc.edu.cn/ubuntu/ bionic universe
6+
deb http://mirrors.ustc.edu.cn/ubuntu/ bionic-updates universe
7+
8+
deb http://mirrors.ustc.edu.cn/ubuntu/ bionic multiverse
9+
deb http://mirrors.ustc.edu.cn/ubuntu/ bionic-updates multiverse
10+
11+
deb http://mirrors.ustc.edu.cn/ubuntu/ bionic-backports main restricted universe multiverse
12+
13+
deb http://mirrors.ustc.edu.cn/ubuntu bionic-security main restricted
14+
deb http://mirrors.ustc.edu.cn/ubuntu bionic-security universe
15+
deb http://mirrors.ustc.edu.cn/ubuntu bionic-security multiverse

docs/modules/distributed.rst

+3-2
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,11 @@ Check this `minst example <https://github.com/tensorlayer/tensorlayer/blob/maste
1818

1919

2020
Distributed training
21-
----------------------
21+
--------------------
22+
2223

2324
TaskSpecDef
24-
^^^^^^^^^^^^^^^^^^^^^^
25+
^^^^^^^^^^^
2526

2627
.. autofunction:: TaskSpecDef
2728

docs/user/example.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Basics
1919
- Data augmentation with TFRecord. Effective way to load and pre-process data, see `tutorial_tfrecord*.py <https://github.com/tensorlayer/tensorlayer/tree/master/example>`__ and `tutorial_cifar10_tfrecord.py <https://github.com/tensorlayer/tensorlayer/blob/master/example/tutorial_cifar10_tfrecord.py>`__.
2020
- Data augmentation with TensorLayer, see `tutorial_image_preprocess.py <https://github.com/tensorlayer/tensorlayer/blob/master/example/tutorial_image_preprocess.py>`__.
2121
- Float 16 half-precision model, see `tutorial_mnist_float16.py <https://github.com/tensorlayer/tensorlayer/blob/master/example/tutorial_mnist_float16.py>`__.
22-
- Distributed Training. `mnist <https://github.com/tensorlayer/tensorlayer/blob/master/example/tutorial_mnist_distributed.py>`__ and `imagenet <https://github.com/tensorlayer/tensorlayer/blob/master/example/tutorial_inceptionV3_tfslim.py>`__ by `jorgemf <https://github.com/jorgemf>`__.
22+
- Transparent distributed training. `mnist <https://github.com/tensorlayer/tensorlayer/blob/master/example/tutorial_mnist_distributed_trainer.py>`__ by `luomai <https://github.com/luomai>`__.
2323

2424
Vision
2525
==================
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
#! /usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
"""
4+
1. Before you start, run this script: https://github.com/tensorlayer/tensorlayer/blob/distributed/scripts/download_and_install_openmpi3_linux.sh
5+
2. Update the PATH with OpenMPI bin by running: PATH=$PATH:$HOME/local/openmpi/bin
6+
Update the PATH in ~/.bashrc if you want OpenMPI to be ready once the machine start
7+
3. Then XXXXX Milo please add this part
8+
mpirun -np 2 \
9+
-bind-to none -map-by slot \
10+
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
11+
-mca pml ob1 -mca btl ^openib \
12+
python3 xxxxx.py
13+
"""
14+
15+
import numpy as np
16+
import multiprocessing
17+
import tensorflow as tf
18+
import tensorlayer as tl
19+
from tensorlayer.layers import InputLayer, Conv2d, BatchNormLayer, DenseLayer, FlattenLayer, MaxPool2d
20+
21+
tf.logging.set_verbosity(tf.logging.DEBUG)
22+
tl.logging.set_verbosity(tl.logging.DEBUG)
23+
24+
25+
def make_dataset(images, labels):
26+
img = tf.data.Dataset.from_tensor_slices(images)
27+
lab = tf.data.Dataset.from_tensor_slices(np.array(labels, dtype=np.int64))
28+
return tf.data.Dataset.zip((img, lab))
29+
30+
31+
def data_aug_train(img, ann):
32+
# 1. Randomly crop a [height, width] section of the image.
33+
img = tf.random_crop(img, [24, 24, 3])
34+
# 2. Randomly flip the image horizontally.
35+
img = tf.image.random_flip_left_right(img)
36+
# 3. Randomly change brightness.
37+
img = tf.image.random_brightness(img, max_delta=63)
38+
# 4. Randomly change contrast.
39+
img = tf.image.random_contrast(img, lower=0.2, upper=1.8)
40+
# 5. Subtract off the mean and divide by the variance of the pixels.
41+
img = tf.image.per_image_standardization(img)
42+
return img, ann
43+
44+
45+
def data_aug_valid(img, ann):
46+
# 1. Crop the central [height, width] of the image.
47+
img = tf.image.resize_image_with_crop_or_pad(img, 24, 24)
48+
# 2. Subtract off the mean and divide by the variance of the pixels.
49+
img = tf.image.per_image_standardization(img)
50+
return img, ann
51+
52+
53+
def model(x, is_train):
54+
with tf.variable_scope("model", reuse=tf.AUTO_REUSE):
55+
net = InputLayer(x, name='input')
56+
net = Conv2d(net, 64, (5, 5), (1, 1), padding='SAME', b_init=None, name='cnn1')
57+
net = BatchNormLayer(net, is_train, act=tf.nn.relu, name='batch1')
58+
net = MaxPool2d(net, (3, 3), (2, 2), padding='SAME', name='pool1')
59+
60+
net = Conv2d(net, 64, (5, 5), (1, 1), padding='SAME', b_init=None, name='cnn2')
61+
net = BatchNormLayer(net, is_train, act=tf.nn.relu, name='batch2')
62+
net = MaxPool2d(net, (3, 3), (2, 2), padding='SAME', name='pool2')
63+
64+
net = FlattenLayer(net, name='flatten')
65+
net = DenseLayer(net, 384, act=tf.nn.relu, name='d1relu')
66+
net = DenseLayer(net, 192, act=tf.nn.relu, name='d2relu')
67+
net = DenseLayer(net, 10, act=None, name='output')
68+
return net
69+
70+
71+
def build_train(x, y_):
72+
net = model(x, is_train=True)
73+
cost = tl.cost.cross_entropy(net.outputs, y_, name='cost_train')
74+
L2 = 0
75+
for p in tl.layers.get_variables_with_name('relu/W', True, True):
76+
L2 += tf.contrib.layers.l2_regularizer(0.004)(p)
77+
cost = cost + L2
78+
accurate_prediction = tf.equal(tf.argmax(net.outputs, 1), y_)
79+
accuracy = tf.reduce_mean(tf.cast(accurate_prediction, tf.float32), name='accuracy_train')
80+
log_tensors = {'cost': cost, 'accuracy': accuracy}
81+
return net, cost, log_tensors
82+
83+
84+
def build_validation(x, y_):
85+
net = model(x, is_train=False)
86+
cost = tl.cost.cross_entropy(net.outputs, y_, name='cost_test')
87+
accurate_prediction = tf.equal(tf.argmax(net.outputs, 1), y_)
88+
accuracy = tf.reduce_mean(tf.cast(accurate_prediction, tf.float32), name='accuracy_test')
89+
return net, [cost, accuracy]
90+
91+
92+
if __name__ == '__main__':
93+
# Load CIFAR10 data
94+
X_train, y_train, X_test, y_test = tl.files.load_cifar10_dataset(shape=(-1, 32, 32, 3), plotable=False)
95+
96+
# Setup the trainer
97+
training_dataset = make_dataset(X_train, y_train)
98+
training_dataset = training_dataset.map(data_aug_train, num_parallel_calls=multiprocessing.cpu_count())
99+
# validation_dataset = make_dataset(X_test, y_test)
100+
# validation_dataset = training_dataset.map(data_aug_valid, num_parallel_calls=multiprocessing.cpu_count())
101+
trainer = tl.distributed.Trainer(
102+
build_training_func=build_train, training_dataset=training_dataset, batch_size=128,
103+
optimizer=tf.train.RMSPropOptimizer, optimizer_args={'learning_rate': 0.0001}
104+
# validation_dataset=validation_dataset, build_validation_func=build_validation
105+
)
106+
107+
# There are multiple ways to use the trainer:
108+
# 1. Easiest way to train all data: trainer.train_to_end()
109+
# 2. Train with validation in the middle: trainer.train_and_validate_to_end(validate_step_size=100)
110+
# 3. Train with full control like follows:
111+
while not trainer.session.should_stop():
112+
try:
113+
# Run a training step synchronously.
114+
trainer.train_on_batch()
115+
# TODO: do whatever you like to the training session.
116+
except tf.errors.OutOfRangeError:
117+
# The dataset would throw the OutOfRangeError when it reaches the end
118+
break
119+
120+
# TODO: Test the trained model
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#! /usr/bin/env python3
2+
# -*- coding: utf-8 -*-
3+
4+
import numpy as np
5+
import tensorflow as tf
6+
import tensorlayer as tl
7+
8+
tf.logging.set_verbosity(tf.logging.DEBUG)
9+
tl.logging.set_verbosity(tl.logging.DEBUG)
10+
11+
12+
def make_dataset(images, labels):
13+
ds1 = tf.data.Dataset.from_tensor_slices(images)
14+
ds2 = tf.data.Dataset.from_tensor_slices(np.array(labels, dtype=np.int64))
15+
return tf.data.Dataset.zip((ds1, ds2))
16+
17+
18+
def model(x, is_train):
19+
with tf.variable_scope('mlp', reuse=tf.AUTO_REUSE):
20+
network = tl.layers.InputLayer(x, name='input')
21+
network = tl.layers.DropoutLayer(network, keep=0.8, name='drop1', is_fix=True, is_train=is_train)
22+
network = tl.layers.DenseLayer(network, 800, tf.nn.relu, name='relu1')
23+
network = tl.layers.DropoutLayer(network, keep=0.5, name='drop2', is_fix=True, is_train=is_train)
24+
network = tl.layers.DenseLayer(network, 800, tf.nn.relu, name='relu2')
25+
network = tl.layers.DropoutLayer(network, keep=0.5, name='drop3', is_fix=True, is_train=is_train)
26+
network = tl.layers.DenseLayer(network, n_units=10, act=tf.identity, name='output')
27+
return network
28+
29+
30+
def build_train(x, y_):
31+
net = model(x, is_train=True)
32+
cost = tl.cost.cross_entropy(net.outputs, y_, name='cost_train')
33+
accurate_prediction = tf.equal(tf.argmax(net.outputs, 1), y_)
34+
accuracy = tf.reduce_mean(tf.cast(accurate_prediction, tf.float32), name='accuracy_train')
35+
log_tensors = {'cost': cost, 'accuracy': accuracy}
36+
return net, cost, log_tensors
37+
38+
39+
def build_validation(x, y_):
40+
net = model(x, is_train=False)
41+
cost = tl.cost.cross_entropy(net.outputs, y_, name='cost_test')
42+
accurate_prediction = tf.equal(tf.argmax(net.outputs, 1), y_)
43+
accuracy = tf.reduce_mean(tf.cast(accurate_prediction, tf.float32), name='accuracy_test')
44+
return net, [cost, accuracy]
45+
46+
47+
if __name__ == '__main__':
48+
# Load MNIST data
49+
X_train, y_train, X_val, y_val, X_test, y_test = tl.files.load_mnist_dataset(shape=(-1, 784))
50+
51+
# Setup the trainer
52+
training_dataset = make_dataset(X_train, y_train)
53+
# validation_dataset = make_dataset(X_val, y_val)
54+
trainer = tl.distributed.Trainer(
55+
build_training_func=build_train, training_dataset=training_dataset, batch_size=32,
56+
optimizer=tf.train.RMSPropOptimizer, optimizer_args={'learning_rate': 0.001}
57+
# validation_dataset=validation_dataset, build_validation_func=build_validation
58+
)
59+
60+
# There are multiple ways to use the trainer:
61+
# 1. Easiest way to train all data: trainer.train_to_end()
62+
# 2. Train with validation in the middle: trainer.train_and_validate_to_end(validate_step_size=100)
63+
# 3. Train with full control like follows:
64+
while not trainer.session.should_stop():
65+
try:
66+
# Run a training step synchronously.
67+
trainer.train_on_batch()
68+
# TODO: do whatever you like to the training session.
69+
except tf.errors.OutOfRangeError:
70+
# The dataset would throw the OutOfRangeError when it reaches the end
71+
break
72+
73+
# TODO: Test the trained model
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
#!/bin/sh
2+
3+
set -e
4+
5+
if [ $(uname) == "Darwin" ]; then
6+
NPROC=$(sysctl -n hw.ncpu)
7+
else
8+
NPROC=$(nproc)
9+
fi
10+
11+
mkdir -p $HOME/openmpi_tmp && cd $HOME/openmpi_tmp
12+
13+
# TODO: upgrade to latest version once https://github.com/open-mpi/ompi/pull/5296 is in the release
14+
MPI_MAJOR=3
15+
MPI_MINOR=1
16+
17+
VERSION=${MPI_MAJOR}.${MPI_MINOR}.0
18+
FILENAME=openmpi-${VERSION}.tar.bz2
19+
FOLDER=openmpi-${VERSION}
20+
URL=https://download.open-mpi.org/release/open-mpi/v${MPI_MAJOR}.${MPI_MINOR}/${FILENAME}
21+
22+
[ ! -f ${FILENAME} ] && curl -vLOJ $URL
23+
tar -xf ${FILENAME}
24+
cd ${FOLDER}
25+
26+
# will take about 8 min or longer depends on your machine
27+
./configure --prefix=$HOME/local/openmpi
28+
make -j ${NPROC} all
29+
make install
30+
31+
rm -rf $HOME/openmpi_tmp
32+
33+
echo 'Update the PATH with OpenMPI bin by running: PATH=$PATH:$HOME/local/openmpi/bin'
34+
echo 'Update the PATH in ~/.bashrc if you want OpenMPI to be ready once the machine start'

0 commit comments

Comments
 (0)