Skip to content

An interactive framework for training and visualizing reinforcement learning algorithms. It supports both model-based and model-free methods across grid-based and discretized control environments, with real-time visualization and simulation to make RL concepts easy to understand.

License

Notifications You must be signed in to change notification settings

Bosy-Ayman/Interactive-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Interactive Reinforcement Learning

image image

1. Introduction

The application trains, visualizes, and then simulates various RL agents on multiple environments. It bridges Model-Based (Dynamic Programming) and Model-Free (Monte Carlo and Temporal Difference) approaches, providing real-time visualization of value functions and policy convergence.

2. Implemented Environments

This framework integrates environments from the Gymnasium library. There is support for both native discrete environments and the continuous control environments discretized with custom wrappers.

A. Grid-Based Environments

  • Taxi-v3: The navigation and delivery problem of picking up and dropping off passengers.
  • FrozenLake-v1 (4x4 & 8x8): A slippery gridworld. The goal is to reach from Start to Goal by avoiding holes.
  • Configuration: Includes a toggle for "Slippery" vs. "Deterministic" dynamics.
  • GridWorld-v0: A custom 5x5 grid environment containing specific start/goal states and penalty zones (pits).

B. Discretized Control Environments

To apply tabular RL methods to continuous control problems, the following environments use state-space discretization (binning):

  • CartPole-v1 (Discrete):

    • Discretization: Continuous state space (Position, Velocity, Angle, Angular Velocity) is bucketed into bins of sizes [6, 6, 12, 12] respectively.
    • Bounds: Velocity and Angular Velocity bounds are clamped to ensure manageable table sizes.
  • MountainCar-v0 (Discrete):

    • Discretization: Position and Velocity are discretized into [12, 12] bins.

3. Implemented Algorithms

The application implements a comprehensive suite of tabular algorithms, divided into Model-Based and Model-Free methods.

A. Model-Based (Dynamic Programming)

These require a known transition model of the environment ().

  1. Value Iteration: Iteratively updates state values using the Bellman Optimality Equation until convergence.
  2. Policy Iteration: Alternates between Policy Evaluation (computing ) and Policy Improvement (acting greedily w.r.t ).

B. Model-Free (Monte Carlo & Temporal Difference)

These learn directly from experience (episodes) without a prior model.

  1. Q-Learning: Off-policy Temporal Difference (TD) control algorithm.
  2. SARSA: On-policy Temporal Difference control algorithm.
  3. Monte Carlo Control (-Greedy): Estimates action values by averaging returns and uses -greedy exploration.
  4. First-Visit Monte Carlo: Updates state values based on the first time a state is visited in an episode.
  5. Every-Visit Monte Carlo: Updates the state values depending on every visit of a state in an episode.
  6. TD(0) Prediction: Updates state values based on the immediate next reward and estimated value of the next state.
  7. TD n-step: Updates values based on returns computed over steps into the future.
Algorithm Taxi-v3 FrozenLake (4x4 / 8x8) GridWorld CartPole (Discrete) MountainCar (Discrete)
Value Iteration (DP)
Policy Iteration (DP)
Q-Learning
SARSA
MC Control (ε-Greedy)
First Visit MC
Every Visit MC
TD(0) Prediction
TD n-step

4. Parameter Adjustment Capabilities

The user interface allows for granular control over hyperparameters, enabling experimentation with convergence behavior and learning dynamics.

  • General Parameters:

    • Discount Factor (): Adjustable from to (determines the importance of future rewards).
    • Training Episodes: The range can be adjusted, for example, between 100 and 50,000 for Model-Free algorithms.
  • Algorithm-Specific Parameters:

    • Learning Rate (): Controls the step size for TD updates (Q-Learning, SARSA, TD-0, TD-n).
    • Epsilon (): Controls the trade-off between exploration and exploitation in -greedy policies.
    • Convergence Threshold (): Defines the stopping criteria for Dynamic Programming algorithms (default ).
    • Max Steps (Simulation): Sets a limit for the length of an episode during the live simulation phase.
Parameter Symbol Scope / Algorithm Range / Default Description
Discount Factor γ Global 0.1 – 0.99 Determines the importance of future rewards. Higher values (e.g. 0.99) encourage long-term planning; lower values focus on immediate rewards.
Training Episodes N Model-Free 100 – 50,000 Total number of episodes the agent interacts with the environment to learn a policy.
Learning Rate α TD / Q-Learning 0.01 – 1.0 Controls how much new information overrides old information. α = 0 means no learning; α = 1 uses only the latest information.
Epsilon ε ε-Greedy 0.0 – 1.0 Controls exploration vs. exploitation. Probability of taking a random action to explore new states.
Convergence Theta θ Model-Based (DP) 1e-6 (default) Stopping threshold for Value / Policy Iteration. Training stops when ΔV < θ.
Is Slippery FrozenLake only True / False Toggles stochastic dynamics. If True, the agent may slip to unintended tiles.
Max Steps Live Simulation 20 – 300 Safety limit for post-training simulation to prevent infinite loops.
Discretization Bins CartPole [6, 6, 12, 12] Fixed bin sizes for Position, Velocity, Angle, and Angular Velocity to discretize the continuous state space.
Discretization Bins MountainCar [12, 12] Fixed bin sizes for Position and Velocity to discretize the continuous state space.

5. Visualization Techniques

The framework uses Plotly and Matplotlib to generate high-fidelity, interactive visualisations of the learning process.

A. Value Landscape Visualization

  • Heatmaps (Grid Environments): Uses plotly.figure_factory to create annotated heatmaps.

    • Plots the learned Value Function () intensity.
    • Overlays directional arrows () for the optimal Policy ().
    • For Taxi-v3, high-dimensional states are projected onto a 2D density heatmap.
  • 3D Surface Plots (Control Environments):

    • Used for CartPole and MountainCar.
    • Plots a 3D surface representing the Value Function against discretized state dimensions, e.g., Angle vs. Angular Velocity for CartPole, Position vs. Velocity for MountainCar.

B. Convergence Analysis

  • Reward History: Uses Matplotlib to plot Total Reward per episode.
  • Moving Average: Overlays a rolling mean (window size dynamic based on episode count) to smooth out variance and show learning trends.
  • Logarithmic Scaling: Provides a toggle for Symmetric Log Scale on the Y-axis for better handling of large negative rewards.

C. Live Simulation

  • Step-by-Step Rendering: Replays the behavior of the trained agent in the environment using gym.render(mode='rgb_array').
  • Data Logging: Displays a synchronized data table showing Step Count, State ID, Action Taken, Immediate Reward, and Cumulative Reward in real time.

6. Technical Stack

  • Frontend/UI: Streamlit
  • RL Core: Gymnasium, NumPy
  • Visualization: Plotly Graph Objects, Plotly Figure Factory, Matplotlib, PIL (Python Imaging Library)

Docs

Useful sources

About

An interactive framework for training and visualizing reinforcement learning algorithms. It supports both model-based and model-free methods across grid-based and discretized control environments, with real-time visualization and simulation to make RL concepts easy to understand.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published