Skip to content
This repository was archived by the owner on Jan 6, 2023. It is now read-only.

Commit 111801e

Browse files
Kiuk Chungfacebook-github-bot
authored andcommitted
fix broken doc string in rdzv module and revamp the layout of the rendezvous.html page (#87)
Summary: Pull Request resolved: #87 see title Differential Revision: D20896988 fbshipit-source-id: 354644348c1b4fe5a91c8d431ef23668e481a92b
1 parent 7888608 commit 111801e

File tree

5 files changed

+213
-106
lines changed

5 files changed

+213
-106
lines changed

docs/source/examples.rst

Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Examples
2-
=========
2+
=============
33

44
The examples below run on the `torchelastic/examples <https://hub.docker.com/r/torchelastic/examples>`_
55
Docker image, built from the `examples/Dockerfile <https://github.com/pytorch/elastic/blob/master/examples/Dockerfile>`_.
@@ -71,12 +71,15 @@ Launch ``$NUM_CUDA_DEVICES`` number of workers on a single node:
7171
--batch-size 32
7272
/workspace/data/tiny-imagenet-200
7373
74-
Multi-container, multi-worker
75-
-------------------------------
74+
Multi-container
75+
----------------
7676

7777
In this example we will launch multiple containers on a single node.
78-
Each container is running multiple workers.
79-
This demonstrates how a multi-node launch would work (each node runs a container).
78+
Please follow the instructions in the multi-container example
79+
`README <https://github.com/pytorch/elastic/tree/master/examples/multi_container/README.md>`_.
80+
81+
Each container runs multiple workers. This demonstrates how a multi-node launch
82+
would work (each node runs a container occupying the whole node).
8083

8184
The high-level differences between a single-container vs multi-container
8285
launches are:
@@ -85,14 +88,22 @@ launches are:
8588
2. An etcd server must be setup before starting the worker containers.
8689
3. Remove ``--with_etcd`` and specify ``--rdzv_backend``, ``--rdzv_endpoint`` and ``--rdzv_id``.
8790

88-
For more information see `elastic launch <distributed.html>`_).
91+
For more information see `elastic launch <distributed.html>`_.
92+
8993

90-
<PLACEHOLDER, add multi-container example instructions here>
9194

92-
Multi-node, multi-worker
93-
-------------------------
95+
Multi-node
96+
-----------
9497

9598
The multi-node, multi-worker case is similar to running multi-container, multi-worker.
9699
Simply run each container on a separate node, occupying the entire node.
97100
Alternatively, you can use our kubernetes
98101
`elastic job controller <kubernetes.html>`_ to launch a multi-node job.
102+
103+
.. warning:: We recommend you setup a highly available etcd server when
104+
deploying multi-node jobs in production as this is the single
105+
point of failure for your jobs. Depending on your usecase
106+
you can either sidecar an etcd server with each job or setup
107+
a shared etcd server. If etcd does not meet your requirements
108+
you can implement your own rendezvous handler and use our
109+
APIs to create a custom launcher.

docs/source/rendezvous.rst

Lines changed: 38 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,54 @@ Rendezvous
22
==========
33

44
.. automodule:: torchelastic.rendezvous
5-
.. currentmodule:: torchelastic.rendezvous
65

76
Below is a state diagram describing how rendezvous works.
87

98
.. image:: etcd_rdzv_diagram.png
109

10+
11+
Handler
12+
--------------------
13+
14+
.. currentmodule:: torchelastic.rendezvous
15+
16+
.. autoclass:: RendezvousHandler
17+
:members:
18+
19+
Exceptions
20+
-------------
21+
.. autoclass:: RendezvousClosedException
22+
.. autoclass:: RendezvousTimeoutException
23+
.. autoclass:: RendezvousNonRetryableError
24+
25+
Implmentations
26+
----------------
27+
1128
Etcd Rendezvous
12-
---------------
29+
****************
1330

1431
.. currentmodule:: torchelastic.rendezvous.etcd_rendezvous
1532

1633
.. autoclass:: EtcdRendezvousHandler
17-
:members:
1834

1935
.. autoclass:: EtcdRendezvous
2036
:members:
37+
38+
.. autoclass:: EtcdStore
39+
:members:
40+
41+
Etcd Server
42+
*************
43+
44+
The ``EtcdServer`` is a convenience class that makes it easy for you to
45+
start and stop an etcd server on a subprocess. This is useful for testing
46+
or single-node (multi-worker) deployments where manually setting up an
47+
etcd server on the side is cumbersome.
48+
49+
.. warning:: For production and multi-node deployments please consider
50+
properly deploying a highly available etcd server as this is
51+
the single point of failure for your distributed jobs.
52+
53+
.. currentmodule:: torchelastic.rendezvous.etcd_server
54+
55+
.. autoclass:: EtcdServer

torchelastic/rendezvous/api.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ class RendezvousClosedException(Exception):
2121

2222
class RendezvousTimeoutException(Exception):
2323
"""
24-
Raised from `next_rendezvous` to signal that the rendezvous did not
24+
Raised from ``RendezvousHandler.next_rendezvous()`` to signal that the
25+
rendezvous did not
2526
succeed within the allocated time. This is meant to be interpreted
2627
as a non-retryable type of failure.
2728
"""
@@ -31,7 +32,7 @@ class RendezvousTimeoutException(Exception):
3132

3233
class RendezvousNonRetryableError(Exception):
3334
"""
34-
Raised from any of the `RendezvousHandler` methods when a failure
35+
Raised from any of the ``RendezvousHandler`` methods when a failure
3536
occured that should not be retried with the same worker process.
3637
"""
3738

@@ -60,12 +61,12 @@ def next_rendezvous(self) -> Tuple["torch.distributed.Store", int, int]:
6061
process is included in the formed worker group), or a timeout occurs, or
6162
rendezvous was marked closed.
6263
63-
Returns a tuple of (``c10d Store``, ``rank``, ``world size``)
64+
Returns: a tuple of (``c10d Store``, ``rank``, ``world size``)
6465
6566
Raises:
66-
``RendezvousClosedException`` if rendezvous for the current
67+
RendezvousClosedException - if rendezvous for the current
6768
job is closed.
68-
``RendezvousTimeoutException`` on timeout
69+
RendezvousTimeoutException - on timeout
6970
"""
7071
pass
7172

@@ -76,7 +77,7 @@ def is_closed(self) -> bool:
7677
which means all future attempts to re-rendezvous (within same job) will
7778
fail.
7879
79-
.. note:: ``is_closed``/``set_closed`` have semantics of eventual
80+
.. note:: ``is_closed`` and ``set_closed`` have semantics of eventual
8081
propagation, and should not be used for synchronization.
8182
The intention here is that if at least one worker decides
8283
the job is finished, it will close the rendezvous, and

0 commit comments

Comments
 (0)