Skip to content

Commit 01f7b62

Browse files
kaushikb11Borda
andauthored
Update WarpDrive notebook with fixes (#167)
Co-authored-by: Jirka Borovec <[email protected]>
1 parent 14535ff commit 01f7b62

File tree

2 files changed

+46
-155
lines changed

2 files changed

+46
-155
lines changed

lightning_examples/warp-drive/.meta.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ description: This notebook introduces multi-agent reinforcement learning (MARL)
1919
white paper - https://arxiv.org/abs/2108.13976.
2020

2121
requirements:
22-
- rl-warp-drive>=1.5.1
22+
- rl-warp-drive>=1.6.5
2323
- ffmpeg-python
2424
accelerator:
2525
- GPU

lightning_examples/warp-drive/multi_agent_rl.py

+45-154
Original file line numberDiff line numberDiff line change
@@ -6,34 +6,46 @@
66
# ## Introduction
77

88
# %% [markdown]
9-
# This tutorial provides a demonstration of a multi-agent Reinforcement Learning (RL) training loop with [WarpDrive](https://github.com/salesforce/warp-drive). WarpDrive is a flexible, lightweight, and easy-to-use RL framework that implements end-to-end deep multi-agent RL on a single GPU (Graphics Processing Unit). Using the extreme parallelization capability of GPUs, it enables [orders-of-magnitude faster RL](https://arxiv.org/abs/2108.13976) compared to common implementations that blend CPU simulations and GPU models. WarpDrive is extremely efficient as it runs simulations across multiple agents and multiple environment replicas in parallel and completely eliminates the back-and-forth data copying between the CPU and the GPU.
9+
# This tutorial provides a demonstration of a multi-agent Reinforcement Learning (RL) training loop with [WarpDrive](https://github.com/salesforce/warp-drive). WarpDrive is a flexible, lightweight, and easy-to-use RL framework that implements end-to-end deep multi-agent RL on a GPU (Graphics Processing Unit). Using the extreme parallelization capability of GPUs, it enables [orders-of-magnitude faster RL](https://arxiv.org/abs/2108.13976) compared to common implementations that blend CPU simulations and GPU models. WarpDrive is extremely efficient as it runs simulations across multiple agents and multiple environment replicas all in parallel and completely eliminates the back-and-forth data copying between the CPU and the GPU during every step. As such, WarpDrive
10+
# - Can simulate 1000s of agents in each environment and thousands of environments in parallel, harnessing the extreme parallelism capability of GPUs.
11+
# - Eliminates communication between CPU and GPU, and also within the GPU, as read and write operations occur in-place.
12+
# - Is fully compatible with Pytorch, a highly flexible and very fast deep learning framework.
13+
# - Implements parallel action sampling on CUDA C, which is ~3x faster than using Pytorch’s sampling methods.
14+
# - Allows for large-scale distributed training on multiple GPUs.
1015
#
11-
# We have integrated WarpDrive with the [Pytorch Lightning](https://www.pytorchlightning.ai/) framework, which greatly reduces the trainer boilerplate code, and improves training flexibility.
16+
# Below is an overview of WarpDrive’s layout of computational and data structures on a single GPU.
17+
# ![](https://blog.salesforceairesearch.com/content/images/2021/08/warpdrive_framework_overview.png)
18+
# Computations are organized into blocks, with multiple threads in each block. Each block runs a simulation environment and each thread
19+
# simulates an agent in an environment. Blocks can access the shared GPU memory that stores simulation data and neural network policy models. A DataManager and FunctionManager enable defining multi-agent RL GPU-workflows with Python APIs. For more details, please read out white [paper](https://arxiv.org/abs/2108.13976).
1220
#
13-
# Below, we demonstrate how to use WarpDrive and PytorchLightning together to train a game of [Tag](https://github.com/salesforce/warp-drive/blob/master/example_envs/tag_continuous/tag_continuous.py) where multiple *tagger* agents are trying to run after and tag multiple other *runner* agents. As such, the Warpdrive framework comprises several utility functions that help easily implement any (OpenAI-)*gym-style* RL environment, and furthermore, provides quality-of-life tools to train it end-to-end using just a few lines of code. You may familiarize yourself with WarpDrive with the help of these [tutorials](https://github.com/salesforce/warp-drive/tree/master/tutorials).
21+
# The Warpdrive framework comprises several utility functions that help easily implement any (OpenAI-)*gym-style* RL environment, and furthermore, provides quality-of-life tools to train it end-to-end using just a few lines of code. You may familiarize yourself with WarpDrive with the help of these [tutorials](https://github.com/salesforce/warp-drive/tree/master/tutorials).
1422
#
1523
# We invite everyone to **contribute to WarpDrive**, including adding new multi-agent environments, proposing new features and reporting issues on our open source [repository](https://github.com/salesforce/warp-drive).
24+
#
25+
# We have integrated WarpDrive with the [Pytorch Lightning](https://www.pytorchlightning.ai/) framework, which greatly reduces the trainer boilerplate code, and improves training modularity and flexibility. It abstracts away most of the engineering pieces of code, so users can focus on research and building models, and iterate on experiments really fast. Pytorch Lightning also provides support for easily running the model on any hardware, performing distributed training, model checkpointing, performance profiling, logging and visualization.
26+
#
27+
# Below, we demonstrate how to use WarpDrive and PytorchLightning together to train a game of [Tag](https://github.com/salesforce/warp-drive/blob/master/example_envs/tag_continuous/tag_continuous.py) where multiple *tagger* agents are trying to run after and tag multiple other *runner* agents. Here's a sample depiction of the game of Tag with $100$ runners and $5$ taggers.
28+
# ![](https://blog.salesforceairesearch.com/content/images/2021/08/same_speed_50fps-1.gif)
1629

30+
# %% [markdown]
31+
# ## Dependencies
1732

1833
# %%
1934
import logging
2035

21-
import matplotlib.pyplot as plt
22-
import mpl_toolkits.mplot3d.art3d as art3d
23-
import numpy as np
2436
import torch
2537
from example_envs.tag_continuous.tag_continuous import TagContinuous
26-
27-
# from IPython.display import HTML
28-
from matplotlib import animation
29-
from matplotlib.patches import Polygon
3038
from pytorch_lightning import Trainer
3139
from warp_drive.env_wrapper import EnvWrapper
32-
from warp_drive.training.pytorch_lightning_trainer import CudaCallback, PerfStatsCallback, WarpDriveModule
40+
from warp_drive.training.pytorch_lightning import CUDACallback, PerfStatsCallback, WarpDriveModule
41+
42+
# Uncomment below for enabling animation visualizations.
43+
# from example_envs.utils.generate_rollout_animation import generate_tag_env_rollout_animation
44+
# from IPython.display import HTML
45+
3346

3447
# %%
35-
_NUM_AVAILABLE_GPUS = torch.cuda.device_count()
36-
assert _NUM_AVAILABLE_GPUS > 0, "This notebook needs a GPU to run!"
48+
assert torch.cuda.device_count() > 0, "This notebook only runs on a GPU!"
3749

3850
# %%
3951
# Set logger level e.g., DEBUG, INFO, WARNING, ERROR.
@@ -46,7 +58,7 @@
4658
#
4759
# For our experiment, we consider an environment wherein $5$ taggers and $100$ runners play the game of [Tag](https://github.com/salesforce/warp-drive/blob/master/example_envs/tag_continuous/tag_continuous.py) on a $20 \times 20$ plane. The game lasts $200$ timesteps. Each agent chooses it's own acceleration and turn actions at every timestep, and we use mechanics to determine how the agents move over the grid. When a tagger gets close to a runner, the runner is tagged, and is eliminated from the game. For the configuration below, the runners and taggers have the same unit skill levels, or top speeds.
4860
#
49-
# We train the agents using $50$ environments or simulations running in parallel. With WarpDrive, each simulation runs on sepate GPU blocks.
61+
# We train the agents using $50$ environments or simulations running in parallel. With WarpDrive, each simulation runs on separate GPU blocks.
5062
#
5163
# There are two separate policy networks used for the tagger and runner agents. Each network is a fully-connected model with two layers each of $256$ dimensions. We use the Advantage Actor Critic (A2C) algorithm for training. WarpDrive also currently provides the option to use the Proximal Policy Optimization (PPO) algorithm instead.
5264

@@ -67,10 +79,10 @@
6779
max_acceleration=0.1,
6880
# minimum acceleration
6981
min_acceleration=-0.1,
70-
# 3*pi/4 radians
71-
max_turn=2.35,
72-
# -3*pi/4 radians
73-
min_turn=-2.35,
82+
# maximum turn (in radians)
83+
max_turn=2.35, # 3pi/4 radians
84+
# minimum turn (in radians)
85+
min_turn=-2.35, # -3pi/4 radians
7486
# number of discretized accelerate actions
7587
num_acceleration_levels=10,
7688
# number of discretized turn actions
@@ -79,19 +91,21 @@
7991
skill_level_tagger=1.0,
8092
# skill level for the runner
8193
skill_level_runner=1.0,
82-
# each agent only sees full or partial information
94+
# each agent sees the full (or partial) information of the world
8395
use_full_observation=False,
8496
# flag to indicate if a runner stays in the game after getting tagged
8597
runner_exits_game_after_tagged=True,
8698
# number of other agents each agent can see
99+
# used in the case use_full_observation is False
87100
num_other_agents_observed=10,
88-
# positive reward for the tagger upon tagging a runner
101+
# positive reward for a tagger upon tagging a runner
89102
tag_reward_for_tagger=10.0,
90-
# negative reward for the runner upon getting tagged
103+
# negative reward for a runner upon getting tagged
91104
tag_penalty_for_runner=-10.0,
92105
# reward at the end of the game for a runner that isn't tagged
93106
end_of_game_reward_for_runner=1.0,
94-
# margin between a tagger and runner to consider the runner as 'tagged'.
107+
# distance margin between a tagger and runner
108+
# to consider the runner as being 'tagged'
95109
tagging_distance=0.02,
96110
),
97111
# Trainer settings.
@@ -100,8 +114,9 @@
100114
num_envs=50,
101115
# total batch size used for training per iteration (across all the environments)
102116
train_batch_size=10000,
103-
# total number of episodes to run the training for (can be arbitrarily high!)
104-
num_episodes=50000,
117+
# total number of episodes to run the training for
118+
# This can be set arbitrarily high!
119+
num_episodes=500,
105120
),
106121
# Policy network settings.
107122
policy=dict(
@@ -173,136 +188,12 @@
173188
#
174189
# We have created a helper function (see below) to visualize an episode rollout. Internally, this function uses the WarpDrive module's `fetch_episode_states` API to fetch the data arrays on the GPU for the duration of an entire episode. Specifically, we fetch the state arrays pertaining to agents' x and y locations on the plane and indicators on which agents are still active in the game. Note that this function may be invoked at any time during training, and it will use the state of the policy models at that time to sample actions and generate the visualization.
175190

176-
# %%
177-
def generate_tag_env_rollout_animation(
178-
warp_drive_module,
179-
fps=25,
180-
tagger_color="#C843C3",
181-
runner_color="#245EB6",
182-
runner_not_in_game_color="#666666",
183-
fig_width=6,
184-
fig_height=6,
185-
):
186-
assert warp_drive_module is not None
187-
episode_states = warp_drive_module.fetch_episode_states(["loc_x", "loc_y", "still_in_the_game"])
188-
assert isinstance(episode_states, dict)
189-
env = warp_drive_module.cuda_envs.env
190-
191-
fig, ax = plt.subplots(1, 1, figsize=(fig_width, fig_height)) # , constrained_layout=True
192-
ax.remove()
193-
ax = fig.add_subplot(1, 1, 1, projection="3d")
194-
195-
# Bounds
196-
ax.set_xlim(0, 1)
197-
ax.set_ylim(0, 1)
198-
ax.set_zlim(-0.01, 0.01)
199-
200-
# Surface
201-
corner_points = [(0, 0), (0, 1), (1, 1), (1, 0)]
202-
203-
poly = Polygon(corner_points, color=(0.1, 0.2, 0.5, 0.15))
204-
ax.add_patch(poly)
205-
art3d.pathpatch_2d_to_3d(poly, z=0, zdir="z")
206-
207-
# "Hide" side panes
208-
ax.xaxis.set_pane_color((1.0, 1.0, 1.0, 0.0))
209-
ax.yaxis.set_pane_color((1.0, 1.0, 1.0, 0.0))
210-
ax.zaxis.set_pane_color((1.0, 1.0, 1.0, 0.0))
211-
212-
# Hide grid lines
213-
ax.grid(False)
214-
215-
# Hide axes ticks
216-
ax.set_xticks([])
217-
ax.set_yticks([])
218-
ax.set_zticks([])
219-
220-
# Hide axes
221-
ax.set_axis_off()
222-
223-
# Set camera
224-
ax.elev = 40
225-
ax.azim = -55
226-
ax.dist = 10
227-
228-
# Try to reduce whitespace
229-
fig.subplots_adjust(left=0, right=1, bottom=-0.2, top=1)
230-
231-
# Plot init data
232-
lines = [None for _ in range(env.num_agents)]
233-
234-
for idx in range(len(lines)):
235-
if idx in env.taggers:
236-
lines[idx] = ax.plot3D(
237-
episode_states["loc_x"][:1, idx] / env.grid_length,
238-
episode_states["loc_y"][:1, idx] / env.grid_length,
239-
0,
240-
color=tagger_color,
241-
marker="o",
242-
markersize=10,
243-
)[0]
244-
else: # runners
245-
lines[idx] = ax.plot3D(
246-
episode_states["loc_x"][:1, idx] / env.grid_length,
247-
episode_states["loc_y"][:1, idx] / env.grid_length,
248-
[0],
249-
color=runner_color,
250-
marker="o",
251-
markersize=5,
252-
)[0]
253-
254-
init_num_runners = env.num_agents - env.num_taggers
255-
256-
def _get_label(timestep, n_runners_alive, init_n_runners):
257-
line1 = "Continuous Tag\n"
258-
line2 = "Time Step:".ljust(14) + f"{timestep:4.0f}\n"
259-
frac_runners_alive = n_runners_alive / init_n_runners
260-
pct_runners_alive = f"{n_runners_alive:4} ({frac_runners_alive * 100:.0f}%)"
261-
line3 = "Runners Left:".ljust(14) + pct_runners_alive
262-
return line1 + line2 + line3
263-
264-
label = ax.text(
265-
0,
266-
0,
267-
0.02,
268-
_get_label(0, init_num_runners, init_num_runners).lower(),
269-
)
270-
271-
label.set_fontsize(14)
272-
label.set_fontweight("normal")
273-
label.set_color("#666666")
274-
275-
def animate(i):
276-
for idx, line in enumerate(lines):
277-
line.set_data_3d(
278-
episode_states["loc_x"][i : i + 1, idx] / env.grid_length,
279-
episode_states["loc_y"][i : i + 1, idx] / env.grid_length,
280-
np.zeros(1),
281-
)
282-
283-
still_in_game = episode_states["still_in_the_game"][i, idx]
284-
285-
if still_in_game:
286-
pass
287-
else:
288-
line.set_color(runner_not_in_game_color)
289-
line.set_marker("")
290-
291-
n_runners_alive = episode_states["still_in_the_game"][i].sum() - env.num_taggers
292-
label.set_text(_get_label(i, n_runners_alive, init_num_runners).lower())
293-
294-
ani = animation.FuncAnimation(fig, animate, np.arange(0, env.episode_length + 1), interval=1000.0 / fps)
295-
plt.close()
296-
297-
return ani
298-
299-
300191
# %% [markdown]
301192
# The animation below shows a sample realization of the game episode before training, i.e., with randomly chosen agent actions. The $5$ taggers are marked in pink, while the $100$ blue agents are the runners. Both the taggers and runners move around randomly and about half the runners remain at the end of the episode.
302193

303194
# %%
304-
305-
# anim = generate_tag_env_rollout_animation(wd_module)
195+
# Uncomment below for enabling animation visualizations.
196+
# anim = generate_tag_env_rollout_animation(wd_module, fps=25)
306197
# HTML(anim.to_html5_video())
307198

308199
# %% [markdown]
@@ -314,7 +205,7 @@ def animate(i):
314205
log_freq = run_config["saving"]["metrics_log_freq"]
315206

316207
# Define callbacks.
317-
cuda_callback = CudaCallback(module=wd_module)
208+
cuda_callback = CUDACallback(module=wd_module)
318209
perf_stats_callback = PerfStatsCallback(
319210
batch_size=wd_module.training_batch_size,
320211
num_iters=wd_module.num_iters,
@@ -355,12 +246,12 @@ def animate(i):
355246
# ## Visualize an episode-rollout after training
356247

357248
# %%
358-
359-
# anim = generate_tag_env_rollout_animation(wd_module)
249+
# Uncomment below for enabling animation visualizations.
250+
# anim = generate_tag_env_rollout_animation(wd_module, fps=25)
360251
# HTML(anim.to_html5_video())
361252

362253
# %% [markdown]
363-
# Note: In the configuration above, we have set the trainer to only train on $50000$ rollout episodes, but you can increase the `num_episodes` configuration parameter to train further. As more training happens, the runners learn to escape the taggers, and the taggers learn to chase after the runner. Sometimes, the taggers also collaborate to team-tag runners. A good number of episodes to train on (for the configuration we have used) is $2$M or higher.
254+
# Note: In the configuration above, we have set the trainer to only train on $500$ rollout episodes, but you can increase the `num_episodes` configuration parameter to train further. As more training happens, the runners learn to escape the taggers, and the taggers learn to chase after the runner. Sometimes, the taggers also collaborate to team-tag runners. A good number of episodes to train on (for the configuration we have used) is $2$M or higher.
364255

365256
# %%
366257
# Finally, close the WarpDrive module to clear up the CUDA memory heap

0 commit comments

Comments
 (0)