- Tensorflow-gpu > 1.13 with eager execution, or tensorflow 2.x
- Tensorflow-probability 0.6.0
- OpenAI baselines
- OpenAI Gym
- Bootstrapped EBU (The basic algorithm of BEBU-UCB, BEBU-IDS and OB2I)
- Bootstrapped DQN
- EBU (we provide a clean implementation at https://github.com/Baichenjia/EBU)
- UCB-Bonus
The following command should train an agent on "Breakout".
python run_atari.py --env BreakoutNoFrameskip-v4 --reward-type ucb --ebu
The following commands should train an agent on "Breakout" with other baselines.
python run_atari.py --env BreakoutNoFrameskip-v4 --ebu
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ucb --ebu
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ids --ebu
(vote is used for evaluation)
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection vote --ebu
python run_atari.py --env BreakoutNoFrameskip-v4
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection vote
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ucb
python run_atari.py --env BreakoutNoFrameskip-v4 --action-selection ids
Any method can combine with the Randomized Prior Function by using --prior flag.
For example, run Bootstrapped DQN + Randomized Prior Function as
python run_atari.py --env BreakoutNoFrameskip-v4 --prior
deepq.pycontains stepping the environment, storing experience and saving models.deepq_learner.pycontains action-selection methods, bonus, bootstrapped DQN/EBU training.replay_buffer.pycontains two class of replay buffer for BDQN and BEBU, respectively. The memory consumption has been highly optimized.models.pycontains Q-network, Bootstrapped Q-network with multiple heads, Bootstrapped Q-network with Randomized Prior Function.run_atari.pycontains hyper-parameters setting. Run this file will start training.
The data for separate runs is stored on disk under the result directory with filename
<env-id>-<algorithm>-<date>-<time>. Each run directory contains
log.txtRecord the episode, exploration rate, episodic rewards in training (after normalization and used for training), episodic scores (raw scores), current timesteps, percentage completed.monitor.csvEnv monitor file by usingloggerfromOpenai Baselines.parameters.txtAll hyper-parameters used in training.progress.csvSame data aslog.txtbut withcsvformat.evaluate scores.txtEvaluation of policy for 108000 frames every 1e5 training steps with 30 no-op evaluation.model_10M.h5,model_20M.h5,model_best_10M.h5,model_best_20M.h5are the policy files saved.