GitHub - Bosy-Ayman/Interactive-RL: An interactive framework for training and visualizing reinforcement learning algorithms. It supports both model-based and model-free methods across grid-based and discretized control environments, with real-time visualization and simulation to make RL concepts easy to understand.

Interactive Reinforcement Learning

1. Introduction

The application trains, visualizes, and then simulates various RL agents on multiple environments. It bridges Model-Based (Dynamic Programming) and Model-Free (Monte Carlo and Temporal Difference) approaches, providing real-time visualization of value functions and policy convergence.

2. Implemented Environments

This framework integrates environments from the Gymnasium library. There is support for both native discrete environments and the continuous control environments discretized with custom wrappers.

A. Grid-Based Environments

Taxi-v3: The navigation and delivery problem of picking up and dropping off passengers.
FrozenLake-v1 (4x4 & 8x8): A slippery gridworld. The goal is to reach from Start to Goal by avoiding holes.
Configuration: Includes a toggle for "Slippery" vs. "Deterministic" dynamics.
GridWorld-v0: A custom 5x5 grid environment containing specific start/goal states and penalty zones (pits).

B. Discretized Control Environments

To apply tabular RL methods to continuous control problems, the following environments use state-space discretization (binning):

CartPole-v1 (Discrete):
- Discretization: Continuous state space (Position, Velocity, Angle, Angular Velocity) is bucketed into bins of sizes [6, 6, 12, 12] respectively.
- Bounds: Velocity and Angular Velocity bounds are clamped to ensure manageable table sizes.
MountainCar-v0 (Discrete):
- Discretization: Position and Velocity are discretized into [12, 12] bins.

3. Implemented Algorithms

The application implements a comprehensive suite of tabular algorithms, divided into Model-Based and Model-Free methods.

A. Model-Based (Dynamic Programming)

These require a known transition model of the environment ().

Value Iteration: Iteratively updates state values using the Bellman Optimality Equation until convergence.
Policy Iteration: Alternates between Policy Evaluation (computing ) and Policy Improvement (acting greedily w.r.t ).

B. Model-Free (Monte Carlo & Temporal Difference)

These learn directly from experience (episodes) without a prior model.

Q-Learning: Off-policy Temporal Difference (TD) control algorithm.
SARSA: On-policy Temporal Difference control algorithm.
Monte Carlo Control (-Greedy): Estimates action values by averaging returns and uses -greedy exploration.
First-Visit Monte Carlo: Updates state values based on the first time a state is visited in an episode.
Every-Visit Monte Carlo: Updates the state values depending on every visit of a state in an episode.
TD(0) Prediction: Updates state values based on the immediate next reward and estimated value of the next state.
TD n-step: Updates values based on returns computed over steps into the future.

Algorithm	Taxi-v3	FrozenLake (4x4 / 8x8)	GridWorld	CartPole (Discrete)	MountainCar (Discrete)
Value Iteration (DP)	✅	✅	✅	❌	❌
Policy Iteration (DP)	✅	✅	✅	❌	❌
Q-Learning	✅	✅	✅	✅	✅
SARSA	✅	✅	✅	✅	✅
MC Control (ε-Greedy)	✅	✅	✅	✅	✅
First Visit MC	✅	✅	✅	✅	✅
Every Visit MC	✅	✅	✅	✅	✅
TD(0) Prediction	✅	✅	✅	✅	✅
TD n-step	✅	✅	✅	✅	✅

4. Parameter Adjustment Capabilities

The user interface allows for granular control over hyperparameters, enabling experimentation with convergence behavior and learning dynamics.

General Parameters:
- Discount Factor (): Adjustable from to (determines the importance of future rewards).
- Training Episodes: The range can be adjusted, for example, between 100 and 50,000 for Model-Free algorithms.
Algorithm-Specific Parameters:
- Learning Rate (): Controls the step size for TD updates (Q-Learning, SARSA, TD-0, TD-n).
- Epsilon (): Controls the trade-off between exploration and exploitation in -greedy policies.
- Convergence Threshold (): Defines the stopping criteria for Dynamic Programming algorithms (default ).
- Max Steps (Simulation): Sets a limit for the length of an episode during the live simulation phase.

Parameter	Symbol	Scope / Algorithm	Range / Default	Description
Discount Factor	γ	Global	0.1 – 0.99	Determines the importance of future rewards. Higher values (e.g. 0.99) encourage long-term planning; lower values focus on immediate rewards.
Training Episodes	N	Model-Free	100 – 50,000	Total number of episodes the agent interacts with the environment to learn a policy.
Learning Rate	α	TD / Q-Learning	0.01 – 1.0	Controls how much new information overrides old information. α = 0 means no learning; α = 1 uses only the latest information.
Epsilon	ε	ε-Greedy	0.0 – 1.0	Controls exploration vs. exploitation. Probability of taking a random action to explore new states.
Convergence Theta	θ	Model-Based (DP)	1e-6 (default)	Stopping threshold for Value / Policy Iteration. Training stops when ΔV < θ.
Is Slippery	—	FrozenLake only	True / False	Toggles stochastic dynamics. If True, the agent may slip to unintended tiles.
Max Steps	—	Live Simulation	20 – 300	Safety limit for post-training simulation to prevent infinite loops.
Discretization Bins	—	CartPole	[6, 6, 12, 12]	Fixed bin sizes for Position, Velocity, Angle, and Angular Velocity to discretize the continuous state space.
Discretization Bins	—	MountainCar	[12, 12]	Fixed bin sizes for Position and Velocity to discretize the continuous state space.

5. Visualization Techniques

The framework uses Plotly and Matplotlib to generate high-fidelity, interactive visualisations of the learning process.

A. Value Landscape Visualization

Heatmaps (Grid Environments): Uses plotly.figure_factory to create annotated heatmaps.
- Plots the learned Value Function () intensity.
- Overlays directional arrows () for the optimal Policy ().
- For Taxi-v3, high-dimensional states are projected onto a 2D density heatmap.
3D Surface Plots (Control Environments):
- Used for CartPole and MountainCar.
- Plots a 3D surface representing the Value Function against discretized state dimensions, e.g., Angle vs. Angular Velocity for CartPole, Position vs. Velocity for MountainCar.

B. Convergence Analysis

Reward History: Uses Matplotlib to plot Total Reward per episode.
Moving Average: Overlays a rolling mean (window size dynamic based on episode count) to smooth out variance and show learning trends.
Logarithmic Scaling: Provides a toggle for Symmetric Log Scale on the Y-axis for better handling of large negative rewards.

C. Live Simulation

Step-by-Step Rendering: Replays the behavior of the trained agent in the environment using gym.render(mode='rgb_array').
Data Logging: Displays a synchronized data table showing Step Count, State ID, Action Taken, Immediate Reward, and Cumulative Reward in real time.

6. Technical Stack

Frontend/UI: Streamlit
RL Core: Gymnasium, NumPy
Visualization: Plotly Graph Objects, Plotly Figure Factory, Matplotlib, PIL (Python Imaging Library)

Docs

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Algorithms		Algorithms
Envs		Envs
imgs		imgs
pages		pages
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
home.py		home.py
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interactive Reinforcement Learning

1. Introduction

2. Implemented Environments

A. Grid-Based Environments

B. Discretized Control Environments

3. Implemented Algorithms

A. Model-Based (Dynamic Programming)

B. Model-Free (Monte Carlo & Temporal Difference)

4. Parameter Adjustment Capabilities

5. Visualization Techniques

A. Value Landscape Visualization

B. Convergence Analysis

C. Live Simulation

6. Technical Stack

Useful sources

About

Uh oh!

Releases

Packages

Languages

License

Bosy-Ayman/Interactive-RL

Folders and files

Latest commit

History

Repository files navigation

Interactive Reinforcement Learning

1. Introduction

2. Implemented Environments

A. Grid-Based Environments

B. Discretized Control Environments

3. Implemented Algorithms

A. Model-Based (Dynamic Programming)

B. Model-Free (Monte Carlo & Temporal Difference)

4. Parameter Adjustment Capabilities

5. Visualization Techniques

A. Value Landscape Visualization

B. Convergence Analysis

C. Live Simulation

6. Technical Stack

Useful sources

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages