assume-framework · maurerle · Dec 4, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ This approach is practical for modeling interactions among competing market part
 To support market design analysis in transforming electricity systems, we developed the ASSUME framework - a flexible and modular agent-based modeling tool for electricity market research.
 ASSUME enables researchers to customize components such as agent representations, market configurations, and bidding strategies, utilizing pre-built modules for standard operations.
 With the setup in ASSUME, researchers can simulate strategic interactions in electricity markets under a wide range of scenarios, from comparing market designs and modeling congestion management to analyzing the behavior of learning storage operators and renewable producers.
-The framework supports studies on bidding under uncertainty, regulatory interventions, and multi-agent dynamics, making it ideal for exploring emergent behaviour and testing new market mechanisms.
+The framework supports studies on bidding under uncertainty, regulatory interventions, and multi-agent dynamics, making it ideal for exploring emergent behavior and testing new market mechanisms.
 ASSUME has been utilized in research studies addressing diverse questions in electricity market design and operation.
 It has explored the role of complex bids, demonstrated the effects of industrial demand-side flexibility for congestion management, and advanced the explainability of emergent strategies in learning agents.
 

diff --git a/assume/reinforcement_learning/learning_utils.py b/assume/reinforcement_learning/learning_utils.py
@@ -212,7 +212,9 @@ def transform_buffer_data(nested_dict: dict, device: th.device) -> np.ndarray:
         for values in unit_data.values():
             if values:
                 val = values[0]
-                feature_dim = 1 if val.ndim == 0 else len(val)
+                feature_dim = (
+                    1 if isinstance(val, (int | float)) or val.ndim == 0 else len(val)
+                )
                 break
         if feature_dim is not None:
             break

diff --git a/assume/reinforcement_learning/neural_network_architecture.py b/assume/reinforcement_learning/neural_network_architecture.py
@@ -198,7 +198,7 @@ def __init__(
         act_dim: int,
         float_type,
         unique_obs_dim: int,
-        num_timeseries_obs_dim: int = 3,
+        num_timeseries_obs_dim: int,
         *args,
         **kwargs,
     ):

diff --git a/assume/scenario/loader_csv.py b/assume/scenario/loader_csv.py
@@ -757,7 +757,13 @@ def setup_world(
 
     bidding_params = config.get("bidding_strategy_params", {})
 
-    # handle initial learning parameters before leanring_role exists
+    if config.get("learning_mode"):
+        raise ValueError(
+            "The 'learning_mode' parameter in the top-level of the config.yaml has been moved to 'learning_config'. "
+            "Please adjust your config file accordingly."
+        )
+
+    # handle initial learning parameters before learning_role exists
     learning_dict = config.get("learning_config", {})
     # those settings need to be overridden before passing to the LearningConfig
     if learning_dict:
@@ -1030,15 +1036,13 @@ def run_learning(
 
     Args:
         world (World): An instance of the World class representing the simulation environment.
-        inputs_path (str): The path to the folder containing input files necessary for the simulation.
-        scenario (str): The name of the scenario for the simulation.
-        study_case (str): The specific study case for the simulation.
+        verbose (bool, optional): A flag indicating whether to enable verbose logging. Defaults to False.
 
     Note:
         - The function uses a ReplayBuffer to store experiences for training the DRL agents.
         - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
         - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
-        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.
+        - Upon completion of training, the function performs an evaluation run using the last policy learned during training.
         - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
     """
     from assume.reinforcement_learning.buffer import ReplayBuffer

diff --git a/assume/world.py b/assume/world.py
@@ -202,7 +202,7 @@ def setup(
             simulation_id (str): The unique identifier for the simulation.
             save_frequency_hours (int): The frequency (in hours) at which to save simulation data.
             bidding_params (dict, optional): Parameters for bidding. Defaults to an empty dictionary.
-            learning_config (dict | None, optional): Configuration for the learning process. Defaults to None.
+            learning_dict (dict, optional): Configuration for the learning process. Defaults to an empty dictionary.
             manager_address: The address of the manager.
             **kwargs: Additional keyword arguments.
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -11,7 +11,7 @@ its primary objectives are to ensure usability and customizability for a wide ra
 users and use cases in the energy system modeling community.
 
 The unique feature of the ASSUME tool-box is the integration of **Deep Reinforcement
-Learning** methods into the behavioural strategies of market agents.
+Learning** methods into the behavioral strategies of market agents.
 The model offers various predefined agent representations for both the demand and
 generation sides, which can be used as plug-and-play modules, simplifying the
 reinforcement of learning strategies. This setup enables research into new market
@@ -70,7 +70,6 @@ Documentation
    examples_basic
    example_simulations
 
-**User Guide**
 
 User Guide
 ==========
@@ -115,8 +114,7 @@ User Guide
    assume
 
 
-Indices and tables
-==================
+**Indices & Tables**
 
 * :ref:`genindex`
 * :ref:`modindex`

diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst
@@ -21,7 +21,7 @@ Architecture
 In the following figure the architecture of the framework is depicted. It can be roughly divided into two parts.
 On the left side of the world class the markets are located and on the right side the market participants,
 which are here named units. Both world are connected via the orders that market participants place on the markets.
-The learning capacbility is sketched out with the yellow classes on the right side, namely the units side.
+The learning capability is sketched out with the yellow classes on the right side, namely the units side.
 
 .. image:: img/architecture.svg
     :align: center
@@ -79,7 +79,7 @@ Market Participants
 ===================
 
 The market participants, here labeled units, comprise all entities acting in the respective markets and are at
-the core of any agent-based simulation model. The entirety of their behaviour leads to the market and system
+the core of any agent-based simulation model. The entirety of their behavior leads to the market and system
 outcome as a bottom-up simulation model, respectively.
 
 Modularity of Units

diff --git a/docs/source/learning.rst b/docs/source/learning.rst
@@ -24,7 +24,7 @@ The Basics of Reinforcement Learning
 In general, RL and deep reinforcement learning (DRL) in particular, open new prospects for agent-based electricity market modeling.
 Such algorithms offer the potential for agents to learn bidding strategies in the interplay between market participants.
 In contrast to traditional rule-based approaches, DRL allows for a faster adaptation of the bidding strategies to a changing market
-environment, which is impossible with fixed strategies that a market modeller explicitly formulates. Hence, DRL algorithms offer the
+environment, which is impossible with fixed strategies that a market modeler explicitly formulates. Hence, DRL algorithms offer the
 potential for simulated electricity market agents to develop bidding strategies for future markets and test emerging markets' mechanisms
 before their introduction into real-world systems.
 
@@ -139,7 +139,7 @@ The Actor
 
 We will explain the way learning works in ASSUME starting from the interface to the simulation, namely the bidding strategy of the power plants.
 The bidding strategy, per definition in ASSUME, defines the way we formulate bids based on the technical restrictions of the unit.
-In a learning setting, this is done by the actor network. Which maps the observation to an action. The observation thereby is managed and collected by the units operator as
+In a learning setting, this is done by the actor network which maps the observation to an action. The observation thereby is managed and collected by the units operator as
 summarized in the following picture. As you can see in the current working version, the observation space contains a residual load forecast for the next 24 hours and a price
 forecast for 24 hours, as well as the current capacity of the power plant and its marginal costs.
 
@@ -148,15 +148,15 @@ forecast for 24 hours, as well as the current capacity of the power plant and it
     :width: 500px
 
 The action space is a continuous space, which means that the actor can choose any price between 0 and the maximum bid price defined in the code. It gives two prices for two different parts of its capacity.
-One, namley :math:`p_{inflex}` for the minimum capacity of the power plant and one for the rest ( :math:`p_{flex}`). The action space is defined in the config file and can be adjusted to your needs.
+One, namely :math:`p_{inflex}` for the minimum capacity of the power plant and one for the rest ( :math:`p_{flex}`). The action space is defined in the config file and can be adjusted to your needs.
 After the bids are formulated in the bidding strategy they are sent to the market via the units operator.
 
 .. image:: img/ActorOutput.jpg
     :align: center
     :width: 500px
 
 In the case you are eager to integrate different learning bidding strategies or equip a new unit with learning,
-you need to touch these methods. To enable an easy start with the use of reinforcement learning in ASSUME we provide a tutorial in colab on github.
+you need to touch these methods. To enable an easy start with the use of reinforcement learning in ASSUME we provide a tutorial in colab on GitHub.
 
 The Critic
 ----------
@@ -175,8 +175,14 @@ You can read more about the different algorithms and the learning role in :doc:`
 The Learning Results in ASSUME
 =====================================
 
-Similarly to the other results, the learning progress is tracked in the database, either with postgresql or timescale. The latter enables the usage of the
-predefined dashboards to track the leanring process in the "Assume:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
+Learning results are not easy to understand and judge. ASSUME supports different visualizations to track the learning progress.
+Further we want to raise awareness for common pitfalls with learning result interpretation.
+
+Visualizations
+--------------
+
+Similarly to the other results, the learning progress is tracked in the database, either with PostgreSQL or TimescaleDB. The latter enables the usage of the
+predefined dashboards to track the learning process in the "ASSUME:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
 A more detailed description is given in the dashboard itself.
 
 .. image:: img/Grafana_Learning_1.jpeg
@@ -207,3 +213,44 @@ After starting the server, open the following URL in your browser:
 
 TensorBoard will then display dashboards for scalars, histograms, graphs, projectors, and other relevant visualizations, depending on the metrics that
 the training pipeline currently exports.
+
+Interpretation
+--------------
+
+Once the environment and learning algorithm are specified, agents are trained and behaviors begin to emerge. The modeler (you) analyzes the reward in the
+visualizations described above. This raises a basic modeling question:
+
+    *How can we judge whether what has been learned is meaningful?*
+
+Unlike supervised learning, we do not have a ground-truth target or an error metric that reliably decreases as behavior improves. In multi-agent settings,
+the notion of an “optimal” solution is often unclear. What we *do* observe are rewards – signals chosen by the modeler. How informative these signals are
+depends heavily on the reward design and on how other agents behave. Therefore:
+
+    **Do not rely on rewards alone.** Behavior itself must be examined carefully.
+**Why solely reward-based evaluation is problematic**
+
+Let :math:`R_i` denote the episodic return of agent :math:`i` under the joint policy :math:`\pi=(\pi_1,\dots,\pi_n)`. A common but potentially misleading
+heuristic is to evaluate behavior by the total reward,
+.. math::
+
+    S(\pi) = \sum_{i=1}^n \mathbb{E}[R_i].
+
+A larger :math:`S(\pi)` does *not* imply that the learned behavior is better or more stable. In a multi-agent environment, each agent’s learning alters the
+effective environment faced by the others. The same policy can therefore earn very different returns depending on which opponent snapshot it encounters. High
+aggregate rewards can arise from:
+
+* temporary exploitation of weaknesses of other agents,
+* coordination effects that occur by chance rather than by design,
+* behavior that works against training opponents but fails in other situations.
+
+Rewards are thus, at best, an indirect proxy for “good behavior.” They measure how well a policy performs *under the specific reward function and opponent
+behavior*, not whether it is robust, interpretable, or aligned with the modeler’s intent.
+
+**Implications for policy selection**
+
+This issue becomes visible when deciding which policy to evaluate at the end of training. We generally store (i) the policy with the highest average reward and
+(ii) the final policy. However, these two can differ substantially in their behavior. The framework therefore uses the **final policy** for evaluation to
+avoid selecting a high-reward snapshot that may be far from stable.
+
+The most robust learning performance can be achieved through **early stopping** with a very large number of episodes. In that case, training halts once results
+are stable, and the final policy is likely also the stable one. This behavior should be monitored by the modeler in TensorBoard.