assume-framework · maurerle · Dec 4, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025
diff --git a/assume/reinforcement_learning/learning_utils.py b/assume/reinforcement_learning/learning_utils.py
@@ -212,7 +212,9 @@ def transform_buffer_data(nested_dict: dict, device: th.device) -> np.ndarray:
         for values in unit_data.values():
             if values:
                 val = values[0]
-                feature_dim = 1 if val.ndim == 0 else len(val)
+                feature_dim = (
+                    1 if isinstance(val, (int | float)) or val.ndim == 0 else len(val)
+                )
                 break
         if feature_dim is not None:
             break

diff --git a/assume/reinforcement_learning/neural_network_architecture.py b/assume/reinforcement_learning/neural_network_architecture.py
@@ -198,7 +198,7 @@ def __init__(
         act_dim: int,
         float_type,
         unique_obs_dim: int,
-        num_timeseries_obs_dim: int = 3,
+        num_timeseries_obs_dim: int,
         *args,
         **kwargs,
     ):

diff --git a/assume/scenario/loader_csv.py b/assume/scenario/loader_csv.py
@@ -757,7 +757,13 @@ def setup_world(
 
     bidding_params = config.get("bidding_strategy_params", {})
 
-    # handle initial learning parameters before leanring_role exists
+    if config.get("learning_mode"):
+        raise ValueError(
+            "The 'learning_mode' parameter in the top-level of the config.yaml has been moved to 'learning_config'. "
+            "Please adjust your config file accordingly."
+        )
+
+    # handle initial learning parameters before learning_role exists
     learning_dict = config.get("learning_config", {})
     # those settings need to be overridden before passing to the LearningConfig
     if learning_dict:
@@ -1030,15 +1036,13 @@ def run_learning(
 
     Args:
         world (World): An instance of the World class representing the simulation environment.
-        inputs_path (str): The path to the folder containing input files necessary for the simulation.
-        scenario (str): The name of the scenario for the simulation.
-        study_case (str): The specific study case for the simulation.
+        verbose (bool, optional): A flag indicating whether to enable verbose logging. Defaults to False.
 
     Note:
         - The function uses a ReplayBuffer to store experiences for training the DRL agents.
         - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
         - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
-        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.
+        - Upon completion of training, the function performs an evaluation run using the last policy learned during training.
         - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
     """
     from assume.reinforcement_learning.buffer import ReplayBuffer

diff --git a/docs/source/learning.rst b/docs/source/learning.rst
@@ -175,6 +175,12 @@ You can read more about the different algorithms and the learning role in :doc:`
 The Learning Results in ASSUME
 =====================================
 
+Learning results are note easy to understand and judge. Assume supports different visualisations to track the learning progress.
+Further we want to raise awarness for common pitfalls with learning result interpretation.
+
+Visualisations
+--------------
+
 Similarly to the other results, the learning progress is tracked in the database, either with postgresql or timescale. The latter enables the usage of the
 predefined dashboards to track the leanring process in the "Assume:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
 A more detailed description is given in the dashboard itself.
@@ -207,3 +213,46 @@ After starting the server, open the following URL in your browser:
 
 TensorBoard will then display dashboards for scalars, histograms, graphs, projectors, and other relevant visualizations, depending on the metrics that
 the training pipeline currently exports.
+
+Interpretation
+--------------
+
+Once the environment and learning algorithm are specified, agents are trained and behaviours begin to emerge. The modeller (you) analyses the reward in the
+visualisations described above. This raises a basic modelling question:
+
+    *How can we judge whether what has been learned is meaningful?*
+
+Unlike supervised learning, we do not have a ground-truth target or an error metric that reliably decreases as behaviour improves. In multi-agent settings,
+the notion of an “optimal” solution is often unclear. What we *do* observe are rewards – signals chosen by the modeller. How informative these signals are
+depends heavily on the reward design and on how other agents behave. Therefore:
+
+    **Do not rely on rewards alone.** Behaviour itself must be examined carefully.
+
+**Why solely reward-based evaluation is problematic**
+
+Let :math:`R_i` denote the episodic return of agent :math:`i` under the joint policy :math:`\pi=(\pi_1,\dots,\pi_n)`. A common but potentially misleading
+heuristic is to evaluate behaviour by the total reward,
+
+.. math::
+
+    S(\pi) = \sum_{i=1}^n \mathbb{E}[R_i].
+
+A larger :math:`S(\pi)` does *not* imply that the learned behaviour is better or more stable. In a multi-agent environment, each agent’s learning alters the
+effective environment faced by the others. The same policy can therefore earn very different returns depending on which opponent snapshot it encounters. High
+aggregate rewards can arise from:
+
+* temporary exploitation of weaknesses of other agents,
+* coordination effects that occur by chance rather than by design,
+* behaviour that works against training opponents but fails in other situations.
+
+Rewards are thus, at best, an indirect proxy for “good behaviour.” They measure how well a policy performs *under the specific reward function and opponent
+behaviour*, not whether it is robust, interpretable, or aligned with the modeller’s intent.
+
+**Implications for policy selection**
+
+This issue becomes visible when deciding which policy to evaluate at the end of training. We generally store (i) the policy with the highest average reward and
+(ii) the final policy. However, these two can differ substantially in their behaviour. The framework therefore uses the **final policy** for evaluation to
+avoid selecting a high-reward snapshot that may be far from stable.
+
+The most robust learning performance can be achieved through **early stopping** with a very large number of episodes. In that case, training halts once results
+are stable, and the final policy is likely also the stable one. This behaviour should be monitored by the modeller in TensorBoard.
diff --git a/docs/source/learning_algorithm.rst b/docs/source/learning_algorithm.rst
@@ -6,10 +6,10 @@
 Reinforcement Learning Algorithms
 ##################################
 
-In the chapter :doc:`learning` we got a general overview of how RL is implemented for a multi-agent setting in Assume.
+In the chapter :doc:`learning` we got a general overview of how RL is implemented for a multi-agent setting in ASSUME.
 If you want to apply these RL algorithms to a new problem, you do not necessarily need to understand how the RL algorithms work in detail.
 All that is needed is to adapt the bidding strategies, which is covered in the tutorials.
-However, for the interested reader, we will give a brief overview of the RL algorithms used in Assume.
+However, for the interested reader, we will give a brief overview of the RL algorithms used in ASSUME.
 We start with the learning role, which is the core of the learning implementation.
 
 The Learning Role
@@ -29,28 +29,37 @@ The following table shows the options that can be adjusted and gives a short exp
  ======================================== ==========================================================================================================
   learning config item                    description
  ======================================== ==========================================================================================================
-  continue_learning                       Whether to use pre-learned strategies and then continue learning.
-  trained_policies_save_path              Where to store the newly trained rl strategies - only needed when learning_mode is set
-  trained_policies_load_path              If pre-learned strategies should be used, where are they stored? - only needed when continue_learning
-  max_bid_price                           The maximum bid price which limits the action of the actor to this price.
-  learning_mode                           Should we use learning mode at all? If not, the learning bidding strategy is overwritten with a default strategy.
-  algorithm                               Specifies which algorithm to use. Currently, only MATD3 is implemented.
-  actor_architecture                      The architecture of the neural networks used in the algorithm for the actors. The architecture is a list of names specifying the "policy" used e.g. multi layer perceptron (mlp).
-  learning_rate                           The learning rate, also known as step size, which specifies how much the new policy should be considered in the update.
-  learning_rate_schedule                  Which learning rate decay to use. Defaults to None. Currently only "linear" decay available.
-  training_episodes                       The number of training episodes, whereby one episode is the entire simulation horizon specified in the general config.
-  episodes_collecting_initial_experience  The number of episodes collecting initial experience, whereby this means that random actions are chosen instead of using the actor network
-  train_freq                              Defines the frequency in time steps at which the actor and critic are updated.
-  gradient_steps                          The number of gradient steps.
-  batch_size                              The batch size of experience considered from the buffer for an update.
-  gamma                                   The discount factor, with which future expected rewards are considered in the decision-making.
-  device                                  The device to use.
-  noise_sigma                             The standard deviation of the distribution used to draw the noise, which is added to the actions and forces exploration.
-  noise_dt                                Determines how quickly the noise weakens over time / used for noise scheduling.
-  noise_scale                             The scale of the noise, which is multiplied by the noise drawn from the distribution.
-  action_noise_schedule                   Which action noise decay to use. Defaults to None. Currently only "linear" decay available.
-  early_stopping_steps                    The number of steps considered for early stopping. If the moving average reward does not improve over this number of steps, the learning is stopped.
-  early_stopping_threshold                The value by which the average reward needs to improve to avoid early stopping.
+  learning_mode                           Should we use learning mode at all? If False, the learning bidding strategy is loaded from trained_policies_load_path and no training occurs. Default is False.
+  evaluation_mode                         This setting is modified internally. Whether to run in evaluation mode. If True, the agent uses the learned policy without exploration noise and no training updates occur. Default is False.
+  continue_learning                       Whether to use pre-learned strategies and then continue learning. If True, loads existing policies from trained_policies_load_path and continues training. Default is False.
+  trained_policies_save_path              The directory path - relative to the scenario's inputs_path - where newly trained RL policies (actor and critic networks) will be saved. Only needed when learning_mode is True. Value is set in setup_world(). Defaults otherwise to None.
+  trained_policies_load_path              The directory path - relative to the scenario's inputs_path - from which pre-trained policies should be loaded. Needed when continue_learning is True or using pre-trained strategies. Default is None.
+  min_bid_price                           The minimum bid price which limits the action of the actor to this price. Used to constrain the actor's output to a realistic price range. Default is -100.0.
+  max_bid_price                           The maximum bid price which limits the action of the actor to this price. Used to constrain the actor's output to a realistic price range. Default is 100.0.
+  device                                  The device to use for PyTorch computations. Options include "cpu", "cuda", or specific CUDA devices like "cuda:0". Default is "cpu".
+  episodes_collecting_initial_experience  The number of episodes at the start during which random actions are chosen instead of using the actor network. This helps populate the replay buffer with diverse experiences. Default is 5.
+  exploration_noise_std                   The standard deviation of Gaussian noise added to actions during exploration in the environment. Higher values encourage more exploration. Default is 0.2.
+  training_episodes                       The number of training episodes, where one episode is the entire simulation horizon specified in the general config. Default is 100.
+  validation_episodes_interval            The interval (in episodes) at which validation episodes are run to evaluate the current policy's performance without training updates. Default is 5.
+  train_freq                              Defines the frequency in time steps at which the actor and critic networks are updated. Accepts time strings like "24h" for 24 hours or "1d" for 1 day. Default is "24h".
+  batch_size                              The batch size of experiences sampled from the replay buffer for each training update. Larger batches provide more stable gradients but require more memory. In environments with many learning agents we advise small batch sizes. Default is 128.
+  gradient_steps                          The number of gradient descent steps performed during each training update. More steps can lead to better learning but increase computation time. Default is 100.
+  learning_rate                           The learning rate (step size) for the optimizer, which controls how much the policy and value networks are updated during training. Default is 0.001.
+  learning_rate_schedule                  Which learning rate decay schedule to use. Currently only "linear" decay is available, which linearly decreases the learning rate over time. Default is None (constant learning rate).
+  early_stopping_steps                    The number of validation steps over which the moving average reward is calculated for early stopping. If the reward doesn't change by early_stopping_threshold over this many steps, training stops. If None, defaults to training_episodes / validation_episodes_interval + 1.
+  early_stopping_threshold                The minimum improvement in moving average reward required to avoid early stopping. If the reward improvement is less than this threshold over early_stopping_steps, training is terminated early. Default is 0.05.
+  algorithm                               Specifies which reinforcement learning algorithm to use. Currently, only "matd3" (Multi-Agent Twin Delayed Deep Deterministic Policy Gradient) is implemented. Default is "matd3".
+  replay_buffer_size                      The maximum number of transitions stored in the replay buffer for experience replay. Larger buffers allow for more diverse training samples. Default is 500000.
+  gamma                                   The discount factor for future rewards, ranging from 0 to 1. Higher values give more weight to long-term rewards in decision-making. Default is 0.99.
+  actor_architecture                      The architecture of the neural networks used for the actors. Options include "mlp" (Multi-Layer Perceptron) and "lstm" (Long Short-Term Memory). Default is "mlp".
+  policy_delay                            The frequency (in gradient steps) at which the actor policy is updated. TD3 updates the critic more frequently than the actor to stabilize training. Default is 2.
+  noise_sigma                             The standard deviation of the Ornstein-Uhlenbeck or Gaussian noise distribution used to generate exploration noise added to actions. Default is 0.1.
+  noise_scale                             The scale factor multiplied by the noise drawn from the distribution. Larger values increase exploration. Default is 1.
+  noise_dt                                The time step parameter for the Ornstein-Uhlenbeck process, which determines how quickly the noise decays over time. Used for noise scheduling. Default is 1.
+  action_noise_schedule                   Which action noise decay schedule to use. Currently only "linear" decay is available, which linearly decreases exploration noise over training. Default is "linear".
+  tau                                     The soft update coefficient for updating target networks. Controls how slowly target networks track the main networks. Smaller values mean slower updates. Default is 0.005.
+  target_policy_noise                     The standard deviation of noise added to target policy actions during critic updates. This smoothing helps prevent overfitting to narrow policy peaks. Default is 0.2.
+  target_noise_clip                       The maximum absolute value for clipping the target policy noise. Prevents the noise from being too large. Default is 0.5.
  ======================================== ==========================================================================================================
 
 How to use continue learning
@@ -147,9 +156,9 @@ Overall, the replay buffer is instrumental in stabilizing the learning process i
 enhancing their robustness and performance by providing a diverse and non-correlated set of training samples.
 
 
-How are they used in Assume?
+How are they used in ASSUME?
 ============================
-In principal Assume allows for different buffers to be implemented. They just need to adhere to the structure presented in the base buffer. Here we will present the different buffers already implemented, which is only one, yet.
+In principal ASSUME allows for different buffers to be implemented. They just need to adhere to the structure presented in the base buffer. Here we will present the different buffers already implemented, which is only one, yet.
 
 
 The simple replay buffer