The application trains, visualizes, and then simulates various RL agents on multiple environments. It bridges Model-Based (Dynamic Programming) and Model-Free (Monte Carlo and Temporal Difference) approaches, providing real-time visualization of value functions and policy convergence.
This framework integrates environments from the Gymnasium library. There is support for both native discrete environments and the continuous control environments discretized with custom wrappers.
- Taxi-v3: The navigation and delivery problem of picking up and dropping off passengers.
- FrozenLake-v1 (4x4 & 8x8): A slippery gridworld. The goal is to reach from Start to Goal by avoiding holes.
- Configuration: Includes a toggle for "Slippery" vs. "Deterministic" dynamics.
- GridWorld-v0: A custom 5x5 grid environment containing specific start/goal states and penalty zones (pits).
To apply tabular RL methods to continuous control problems, the following environments use state-space discretization (binning):
-
CartPole-v1 (Discrete):
- Discretization: Continuous state space (Position, Velocity, Angle, Angular Velocity) is bucketed into bins of sizes
[6, 6, 12, 12]respectively. - Bounds: Velocity and Angular Velocity bounds are clamped to ensure manageable table sizes.
- Discretization: Continuous state space (Position, Velocity, Angle, Angular Velocity) is bucketed into bins of sizes
-
MountainCar-v0 (Discrete):
- Discretization: Position and Velocity are discretized into
[12, 12]bins.
- Discretization: Position and Velocity are discretized into
The application implements a comprehensive suite of tabular algorithms, divided into Model-Based and Model-Free methods.
These require a known transition model of the environment ().
- Value Iteration: Iteratively updates state values using the Bellman Optimality Equation until convergence.
- Policy Iteration: Alternates between Policy Evaluation (computing ) and Policy Improvement (acting greedily w.r.t ).
These learn directly from experience (episodes) without a prior model.
- Q-Learning: Off-policy Temporal Difference (TD) control algorithm.
- SARSA: On-policy Temporal Difference control algorithm.
- Monte Carlo Control (-Greedy): Estimates action values by averaging returns and uses -greedy exploration.
- First-Visit Monte Carlo: Updates state values based on the first time a state is visited in an episode.
- Every-Visit Monte Carlo: Updates the state values depending on every visit of a state in an episode.
- TD(0) Prediction: Updates state values based on the immediate next reward and estimated value of the next state.
- TD n-step: Updates values based on returns computed over steps into the future.
| Algorithm | Taxi-v3 | FrozenLake (4x4 / 8x8) | GridWorld | CartPole (Discrete) | MountainCar (Discrete) |
|---|---|---|---|---|---|
| Value Iteration (DP) | ✅ | ✅ | ✅ | ❌ | ❌ |
| Policy Iteration (DP) | ✅ | ✅ | ✅ | ❌ | ❌ |
| Q-Learning | ✅ | ✅ | ✅ | ✅ | ✅ |
| SARSA | ✅ | ✅ | ✅ | ✅ | ✅ |
| MC Control (ε-Greedy) | ✅ | ✅ | ✅ | ✅ | ✅ |
| First Visit MC | ✅ | ✅ | ✅ | ✅ | ✅ |
| Every Visit MC | ✅ | ✅ | ✅ | ✅ | ✅ |
| TD(0) Prediction | ✅ | ✅ | ✅ | ✅ | ✅ |
| TD n-step | ✅ | ✅ | ✅ | ✅ | ✅ |
The user interface allows for granular control over hyperparameters, enabling experimentation with convergence behavior and learning dynamics.
-
General Parameters:
- Discount Factor (): Adjustable from to (determines the importance of future rewards).
- Training Episodes: The range can be adjusted, for example, between 100 and 50,000 for Model-Free algorithms.
-
Algorithm-Specific Parameters:
- Learning Rate (): Controls the step size for TD updates (Q-Learning, SARSA, TD-0, TD-n).
- Epsilon (): Controls the trade-off between exploration and exploitation in -greedy policies.
- Convergence Threshold (): Defines the stopping criteria for Dynamic Programming algorithms (default ).
- Max Steps (Simulation): Sets a limit for the length of an episode during the live simulation phase.
| Parameter | Symbol | Scope / Algorithm | Range / Default | Description |
|---|---|---|---|---|
| Discount Factor | γ | Global | 0.1 – 0.99 | Determines the importance of future rewards. Higher values (e.g. 0.99) encourage long-term planning; lower values focus on immediate rewards. |
| Training Episodes | N | Model-Free | 100 – 50,000 | Total number of episodes the agent interacts with the environment to learn a policy. |
| Learning Rate | α | TD / Q-Learning | 0.01 – 1.0 | Controls how much new information overrides old information. α = 0 means no learning; α = 1 uses only the latest information. |
| Epsilon | ε | ε-Greedy | 0.0 – 1.0 | Controls exploration vs. exploitation. Probability of taking a random action to explore new states. |
| Convergence Theta | θ | Model-Based (DP) | 1e-6 (default) | Stopping threshold for Value / Policy Iteration. Training stops when ΔV < θ. |
| Is Slippery | — | FrozenLake only | True / False | Toggles stochastic dynamics. If True, the agent may slip to unintended tiles. |
| Max Steps | — | Live Simulation | 20 – 300 | Safety limit for post-training simulation to prevent infinite loops. |
| Discretization Bins | — | CartPole | [6, 6, 12, 12] | Fixed bin sizes for Position, Velocity, Angle, and Angular Velocity to discretize the continuous state space. |
| Discretization Bins | — | MountainCar | [12, 12] | Fixed bin sizes for Position and Velocity to discretize the continuous state space. |
The framework uses Plotly and Matplotlib to generate high-fidelity, interactive visualisations of the learning process.
-
Heatmaps (Grid Environments): Uses
plotly.figure_factoryto create annotated heatmaps.- Plots the learned Value Function () intensity.
- Overlays directional arrows () for the optimal Policy ().
- For Taxi-v3, high-dimensional states are projected onto a 2D density heatmap.
-
3D Surface Plots (Control Environments):
- Used for CartPole and MountainCar.
- Plots a 3D surface representing the Value Function against discretized state dimensions, e.g., Angle vs. Angular Velocity for CartPole, Position vs. Velocity for MountainCar.
- Reward History: Uses Matplotlib to plot Total Reward per episode.
- Moving Average: Overlays a rolling mean (window size dynamic based on episode count) to smooth out variance and show learning trends.
- Logarithmic Scaling: Provides a toggle for Symmetric Log Scale on the Y-axis for better handling of large negative rewards.
- Step-by-Step Rendering: Replays the behavior of the trained agent in the environment using
gym.render(mode='rgb_array'). - Data Logging: Displays a synchronized data table showing Step Count, State ID, Action Taken, Immediate Reward, and Cumulative Reward in real time.
- Frontend/UI: Streamlit
- RL Core: Gymnasium, NumPy
- Visualization: Plotly Graph Objects, Plotly Figure Factory, Matplotlib, PIL (Python Imaging Library)