Code for the first assignment of the course IIC3675 - Reinforcement Learning
The report is in the docs/
folder.
To replicate reported plots, first clone this repo and then run Main.py
as follows:
python Main.py
Depending on your Python installation, this may vary python
to python3
.
At first the program will ask you to select the experiment to replicate, giving three options: Epsilon Greedy, Optimistic Initial Values and Gradient Bandit. Once you have selected an experiment, it will execute 2000 runs with a 1000 steps each. Finally, the program will ask you the names you want to give to the generated plots, which will be saved on the imgs/
folder.
Warning
The Gradient Bandit experiment takes considerably more execution time than the other two. To give you a perspective, a Gradient Bandit execution takes about two minutes while the others take about 15 seconds.
Below is a reference of what you should get by selection each option.
First, you will have to name the average rewards plot, which should look like this:
Then, the program will ask you the name for the optimal action percentage plot, which should be similar to the following:
The generated plot for this experiment should be looking like this:
An important aspect to mention is that the implementation has some difference with the version from class, which has the following pseudocode:
However, our implementation updates the baseline (
For more details please check the GradientBandit.py file and also the report included for the assignment.
The generated plot from this experiment should be similar to this:
Note
The blue and green curves are the methods with baseline, whereas the orange and red are the methods without baseline.