Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A2C and Synchronous extension #17

Open
wants to merge 66 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
0ce32a4
CARTPOLE - create vanilla a2c and synchronous a2c
juice1000 Mar 23, 2020
4918b37
CARTPOLE - test
juice1000 Mar 24, 2020
63b4092
CARTPOLE - renew model and parameters, add args and plotter function,…
juice1000 Mar 24, 2020
a459f81
CARTPOLE - for all: terminate when score higher than 500 within range
juice1000 Mar 25, 2020
2045869
CARTPOLE - refactor a2c to without torch.mp module
juice1000 Mar 25, 2020
490767a
CARTPOLE - stabilize algorithms: true synchronous learning for a2c_sy…
juice1000 Mar 27, 2020
b7a4bf6
CARTPOLE - training duration in plotter, args and plotter in utils, s…
juice1000 Mar 28, 2020
f9ace27
CARTPOLE - if load model, then check for 100 iterations of ep_r over 500
juice1000 Mar 29, 2020
5d78d64
CARTPOLE - if load_model, then check for last 100 iterations of ep_r …
juice1000 Mar 29, 2020
ae02294
Merge branch 'master' of https://github.com/juice1000/pytorch-A3C
juice1000 Mar 29, 2020
93e6726
CARTPOLE - initialize only one process when load_model
juice1000 Mar 29, 2020
2317cea
CARTPOLE - utils gets optimizer, a2c_sync NN update after done
juice1000 Mar 29, 2020
add50a1
CARTPOLE - add episode_time plotter and confidence intervall calculat…
juice1000 Apr 8, 2020
b9ec460
CARTPOLE - create A2Cs and A3C shared NN
juice1000 Apr 8, 2020
e35afd8
CARTPOLE - fix confidence interval error in Sync_A2C and A3C, action_…
juice1000 Apr 8, 2020
6d32122
CARTPOLE - refactor shared network algorithms, delete one confidence …
juice1000 Apr 8, 2020
211ce33
CARTPOLE - refactor shared NNs and lr, take out time sleep in test ph…
juice1000 Apr 8, 2020
aca0f93
CARTPOLE - results and saved models
juice1000 Apr 8, 2020
810b960
CARTPOLE - plots of results
juice1000 Apr 8, 2020
d943e7e
VIZDOOM - create A2C-vizdoom for deadly corridor, create viz_utils
juice1000 Apr 9, 2020
671f057
CARTPOLE - create directory
juice1000 Apr 9, 2020
019f612
VIZDOOM - create directory
juice1000 Apr 9, 2020
7104cf7
cleanup
juice1000 Apr 9, 2020
b6e57e6
VIZDOOM - create A3C model
juice1000 Apr 10, 2020
d3a4033
VIZDOOM - create A3C part2
juice1000 Apr 10, 2020
06a759f
VIZDOOM - create synchronous a2c
juice1000 Apr 10, 2020
608e20a
VIZDOOM - add plotter and game replay in a2c
juice1000 Apr 10, 2020
52ac70c
VIZDOOM - plotter and stopwatch for a2c_sync and a3c
juice1000 Apr 10, 2020
e139bbe
CARTPOLE - refactor synchronous coordinator in A2C_Sync
juice1000 Apr 11, 2020
a5179b2
VIZDOOM - reinitialize state within each episode, adjust plotter boun…
juice1000 Apr 11, 2020
e403c84
CARTPOLE - remove normalizing reward
juice1000 Apr 12, 2020
2026ba8
VIZDOOM - UPDATE_GLOBAL_ITER = 5 for A3C, correct error in push and pull
juice1000 Apr 12, 2020
42e27f5
CARTPOLE - normalized and raw duration and reward plotting
juice1000 Apr 13, 2020
b0ae642
VIZDOOM - normalized and raw reward and duration, update plotter boun…
juice1000 Apr 13, 2020
51978ff
CART/DOOM - turn off episode breaker for not normalized runs
juice1000 Apr 13, 2020
261cd3c
CARTPOLE - correct coding mistakes in A3C and A2C-sync
juice1000 Apr 13, 2020
53beb17
VIZDOOM - correct A2C and A3C, adjust plotter boundaries
juice1000 Apr 14, 2020
5dc7e25
CARTPOLE - create save_model.py, enable save data mode for all
juice1000 Apr 14, 2020
8065952
CARTPOLE - correct coding mistakes with save_data
juice1000 Apr 14, 2020
aed3bb4
CARTPOLE - separate training and test data
juice1000 Apr 14, 2020
05b2b14
CARTPOLE - remove time stamp in plots
juice1000 Apr 16, 2020
6fa38cb
CARTPOLE - add new plots, add plot data, add newly trained models
juice1000 Apr 16, 2020
534a930
VIZDOOM - save_data for A3C, create algo_compare for doom
juice1000 Apr 17, 2020
f98f2e1
VIZDOOM - new synchronous implementation, reinitialize threads and sa…
juice1000 Apr 18, 2020
439e964
VIZDOOM - refactor viz_utils
juice1000 Apr 18, 2020
8f2e140
VIZDOOM - refactor plotter
juice1000 Apr 18, 2020
d71c28a
VIZDOOM - redefine save model condition
juice1000 Apr 19, 2020
89b3e6d
VIZDOOM - refactor action_queue, take action_queue out from viz_utils
juice1000 Apr 20, 2020
bd7b79e
VIZDOOM - set ylim to 1000 in viz_utils
juice1000 Apr 24, 2020
19b5fb4
VIZDOOM - combine CSVs and plot for A2C-sync
juice1000 Apr 25, 2020
fbe2680
VIZDOOM - algo_compare with subplots of training and testing data
juice1000 Apr 25, 2020
3c8501a
CARTPOLE - make subplots for training and testing data
juice1000 Apr 25, 2020
77975c3
CART/VIZ - probabilities plot
juice1000 Apr 26, 2020
220c2ed
CART/VIZ - code corrections
juice1000 Apr 26, 2020
5bc01fc
VIZDOOM - fix probabilities plot and ep-rew-time plot, fix action_que…
juice1000 May 1, 2020
983bc7c
VIZDOOM - code grooming and refactor algo_compare
juice1000 May 4, 2020
5a1fb36
Update README.md
juice1000 May 31, 2020
0adf081
VIZDOOM - push plots and plot data
May 31, 2020
2ffe7d0
Merge branch 'master' of https://github.com/juice1000/Synchronous-vs-…
May 31, 2020
3a2b84c
Update README.md
juice1000 May 31, 2020
c106ea0
Update README.md
juice1000 May 31, 2020
14e4c57
CARTPOLE - move results to higher directory
May 31, 2020
8aef27b
Merge branch 'master' of https://github.com/juice1000/Synchronous-vs-…
May 31, 2020
8f16737
Update README.md
juice1000 May 31, 2020
72d755d
Update README.md
juice1000 Dec 17, 2022
e606ecf
Update README.md
juice1000 Dec 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
CARTPOLE - create vanilla a2c and synchronous a2c
juice1000 authored Mar 23, 2020
commit 0ce32a4f6f4b146b27e8e2a347afd032177a0b97
151 changes: 151 additions & 0 deletions a2c_cart.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
"""
Reinforcement Learning (A3C) using Pytroch + multiprocessing.
The most simple implementation for continuous action.
"""

import torch
import torch.nn as nn
from utils import v_wrap, set_init, push_and_pull, record
import torch.nn.functional as F
import torch.multiprocessing as mp
from shared_adam import SharedAdam
import gym
import os

import argparse
import time
from datetime import datetime
timestr = time.strftime("%d.%m.%Y - %H:%M:%S")

os.environ["OMP_NUM_THREADS"] = "1"

UPDATE_GLOBAL_ITER = 500
GAMMA = 0.9
MAX_EP = 3000

env = gym.make('CartPole-v0')
N_S = env.observation_space.shape[0]
N_A = env.action_space.n


def handleArguments():
"""Handles CLI arguments and saves them globally"""
parser = argparse.ArgumentParser(
description="Switch between modes in A2C or loading models from previous games")
parser.add_argument("--demo_mode", "-d", help="Renders the gym environment", action="store_true")
parser.add_argument("--load_model", "-l", help="Loads the model of previously gained training data", action="store_true")
global args
args = parser.parse_args()


class Net(nn.Module):
def __init__(self, s_dim, a_dim):
super(Net, self).__init__()
self.s_dim = s_dim
self.a_dim = a_dim
self.pi1 = nn.Linear(s_dim, 128)
self.pi2 = nn.Linear(128, a_dim)
self.v1 = nn.Linear(s_dim, 128)
self.v2 = nn.Linear(128, 1)
set_init([self.pi1, self.pi2, self.v1, self.v2])
self.distribution = torch.distributions.Categorical

def forward(self, x):
pi1 = torch.tanh(self.pi1(x))
logits = self.pi2(pi1)
v1 = torch.tanh(self.v1(x))
values = self.v2(v1)
return logits, values

def set_init(layers):
for layer in layers:
nn.init.normal_(layer.weight, mean=0., std=0.1)
nn.init.constant_(layer.bias, 0.)

def choose_action(self, s):
self.eval()
logits, _ = self.forward(s)
prob = F.softmax(logits, dim=1).data
m = self.distribution(prob)
return m.sample().numpy()[0]

def loss_func(self, s, a, v_t):
self.train()
logits, values = self.forward(s)
td = v_t - values
c_loss = td.pow(2)

probs = F.softmax(logits, dim=1)
m = self.distribution(probs)
exp_v = m.log_prob(a) * td.detach().squeeze()
a_loss = -exp_v
total_loss = (c_loss + a_loss).mean()
return total_loss


class Worker(mp.Process):
def __init__(self, gnet, opt, global_ep, global_ep_r, res_queue, name):
super(Worker, self).__init__()
self.name = 'w%02i' % name
self.g_ep, self.g_ep_r, self.res_queue = global_ep, global_ep_r, res_queue
self.gnet, self.opt = gnet, opt
self.lnet = Net(N_S, N_A) # local network
self.env = gym.make('CartPole-v0').unwrapped

def run(self):
global episode
total_step = 1
while self.g_ep.value < MAX_EP:
s = self.env.reset()
buffer_s, buffer_a, buffer_r = [], [], []
ep_r = 0.
while True:
if self.name == 'w00':
self.env.render()
a = self.lnet.choose_action(v_wrap(s[None, :]))
s_, r, done, _ = self.env.step(a)
if done: r = -1
ep_r += r
buffer_a.append(a)
buffer_s.append(s)
buffer_r.append(r)

if self.g_ep.value % UPDATE_GLOBAL_ITER == 0 or done: # update global and assign to local net
# sync
push_and_pull(self.opt, self.lnet, self.gnet, done, s_, buffer_s, buffer_a, buffer_r, GAMMA)
buffer_s, buffer_a, buffer_r = [], [], []

if done: # done and print information
record(self.g_ep, self.g_ep_r, ep_r, self.res_queue, self.name)
break
s = s_
total_step += 1
self.res_queue.put(None)


if __name__ == "__main__":

gnet = Net(N_S, N_A) # global network
gnet.share_memory() # share the global parameters in multiprocessing
opt = SharedAdam(gnet.parameters(), lr=1e-4, betas=(0.92, 0.999)) # global optimizer
global_ep, global_ep_r, res_queue = mp.Value('i', 0), mp.Value('d', 0.), mp.Queue()

# parallel training
worker = Worker(gnet, opt, global_ep, global_ep_r, res_queue, 0)
worker.start()
res = [] # record episode reward to plot
while True:
r = res_queue.get()
if r is not None:
res.append(r)
else:
break


import matplotlib.pyplot as plt

plt.plot(res)
plt.ylabel('Average Reward')
plt.xlabel('Episode')
plt.show()

162 changes: 162 additions & 0 deletions sa2c_cart.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
"""
Reinforcement Learning (A3C) using Pytroch + multiprocessing.
The most simple implementation for continuous action.
View more on my Chinese tutorial page [莫烦Python](https://morvanzhou.github.io/).
"""

import torch
import torch.nn as nn
from utils import v_wrap, set_init, push_and_pull, record
import torch.nn.functional as F
import torch.multiprocessing as mp
from shared_adam import SharedAdam
import gym
import os

import argparse
import time
from datetime import datetime
timestr = time.strftime("%d.%m.%Y - %H:%M:%S")

os.environ["OMP_NUM_THREADS"] = "1"

UPDATE_GLOBAL_ITER = 500
GAMMA = 0.9
MAX_EP = 3000
episode = 0

env = gym.make('CartPole-v0')
N_S = env.observation_space.shape[0]
N_A = env.action_space.n


def handleArguments():
"""Handles CLI arguments and saves them globally"""
parser = argparse.ArgumentParser(
description="Switch between modes in A2C or loading models from previous games")
parser.add_argument("--demo_mode", "-d", help="Renders the gym environment", action="store_true")
parser.add_argument("--load_model", "-l", help="Loads the model of previously gained training data", action="store_true")
global args
args = parser.parse_args()


class Net(nn.Module):
def __init__(self, s_dim, a_dim):
super(Net, self).__init__()
self.s_dim = s_dim
self.a_dim = a_dim
self.pi1 = nn.Linear(s_dim, 128)
self.pi2 = nn.Linear(128, a_dim)
self.v1 = nn.Linear(s_dim, 128)
self.v2 = nn.Linear(128, 1)
set_init([self.pi1, self.pi2, self.v1, self.v2])
self.distribution = torch.distributions.Categorical

def forward(self, x):
pi1 = torch.tanh(self.pi1(x))
logits = self.pi2(pi1)
v1 = torch.tanh(self.v1(x))
values = self.v2(v1)
return logits, values

def set_init(layers):
for layer in layers:
nn.init.normal_(layer.weight, mean=0., std=0.1)
nn.init.constant_(layer.bias, 0.)

def choose_action(self, s):
self.eval()
logits, _ = self.forward(s)
prob = F.softmax(logits, dim=1).data
m = self.distribution(prob)
return m.sample().numpy()[0]

def loss_func(self, s, a, v_t):
self.train()
logits, values = self.forward(s)
td = v_t - values
c_loss = td.pow(2)

probs = F.softmax(logits, dim=1)
m = self.distribution(probs)
exp_v = m.log_prob(a) * td.detach().squeeze()
a_loss = -exp_v
total_loss = (c_loss + a_loss).mean()
return total_loss


class Worker(mp.Process):
def __init__(self, gnet, opt, global_ep, global_ep_r, res_queue, name):
super(Worker, self).__init__()
self.name = 'w%02i' % name
self.g_ep, self.g_ep_r, self.res_queue = global_ep, global_ep_r, res_queue
self.gnet, self.opt = gnet, opt
self.lnet = Net(N_S, N_A) # local network
self.env = gym.make('CartPole-v0').unwrapped

def run(self):
global episode
total_step = 1
while self.g_ep.value < MAX_EP:
s = self.env.reset()
buffer_s, buffer_a, buffer_r = [], [], []
ep_r = 0.
while True:
if self.name == 'w00':
self.env.render()
a = self.lnet.choose_action(v_wrap(s[None, :]))
s_, r, done, _ = self.env.step(a)
if done: r = -1
ep_r += r
buffer_a.append(a)
buffer_s.append(s)
buffer_r.append(r)

if self.g_ep.value % UPDATE_GLOBAL_ITER == 0 or done: # update global and assign to local net
# sync
if self.g_ep.value % UPDATE_GLOBAL_ITER == 0 and self.g_ep.value != 0:
print (self.g_ep.value)
print("sleep...")
self.res_queue.put(time.sleep(10))
push_and_pull(self.opt, self.lnet, self.gnet, done, s_, buffer_s, buffer_a, buffer_r, GAMMA)
buffer_s, buffer_a, buffer_r = [], [], []

if done: # done and print information
record(self.g_ep, self.g_ep_r, ep_r, self.res_queue, self.name)
break
s = s_
total_step += 1
self.res_queue.put(None)


if __name__ == "__main__":

gnet = Net(N_S, N_A) # global network
gnet.share_memory() # share the global parameters in multiprocessing
opt = SharedAdam(gnet.parameters(), lr=1e-4, betas=(0.92, 0.999)) # global optimizer
global_ep, global_ep_r, res_queue = mp.Value('i', 0), mp.Value('d', 0.), mp.Queue()

# parallel training
workers = [Worker(gnet, opt, global_ep, global_ep_r, res_queue, i) for i in range(mp.cpu_count())]
[w.start() for w in workers]
res = [] # record episode reward to plot
while True:
#print ("Episode: ", global_ep.value)
r = res_queue.get()
if r is not None:
res.append(r)
#if global_ep.value % 500 == 0:
#res.put(time.sleep(5))
#[w.join() for w in workers]
elif global_ep.value == MAX_EP:
break
else:
True

import matplotlib.pyplot as plt

plt.plot(res)
plt.ylabel('Average Reward')
plt.xlabel('Episode')
plt.show()