diff --git a/README.md b/README.md
index 165571145..650f581cd 100644
--- a/README.md
+++ b/README.md
@@ -35,7 +35,7 @@ This approach is practical for modeling interactions among competing market part
 To support market design analysis in transforming electricity systems, we developed the ASSUME framework - a flexible and modular agent-based modeling tool for electricity market research.
 ASSUME enables researchers to customize components such as agent representations, market configurations, and bidding strategies, utilizing pre-built modules for standard operations.
 With the setup in ASSUME, researchers can simulate strategic interactions in electricity markets under a wide range of scenarios, from comparing market designs and modeling congestion management to analyzing the behavior of learning storage operators and renewable producers.
-The framework supports studies on bidding under uncertainty, regulatory interventions, and multi-agent dynamics, making it ideal for exploring emergent behaviour and testing new market mechanisms.
+The framework supports studies on bidding under uncertainty, regulatory interventions, and multi-agent dynamics, making it ideal for exploring emergent behavior and testing new market mechanisms.
 ASSUME has been utilized in research studies addressing diverse questions in electricity market design and operation.
 It has explored the role of complex bids, demonstrated the effects of industrial demand-side flexibility for congestion management, and advanced the explainability of emergent strategies in learning agents.
 
diff --git a/assume/reinforcement_learning/learning_utils.py b/assume/reinforcement_learning/learning_utils.py
index 01952e91a..3ec682a1c 100644
--- a/assume/reinforcement_learning/learning_utils.py
+++ b/assume/reinforcement_learning/learning_utils.py
@@ -212,7 +212,9 @@ def transform_buffer_data(nested_dict: dict, device: th.device) -> np.ndarray:
         for values in unit_data.values():
             if values:
                 val = values[0]
-                feature_dim = 1 if val.ndim == 0 else len(val)
+                feature_dim = (
+                    1 if isinstance(val, (int | float)) or val.ndim == 0 else len(val)
+                )
                 break
         if feature_dim is not None:
             break
diff --git a/assume/reinforcement_learning/neural_network_architecture.py b/assume/reinforcement_learning/neural_network_architecture.py
index 44969a7be..a173b4b5c 100644
--- a/assume/reinforcement_learning/neural_network_architecture.py
+++ b/assume/reinforcement_learning/neural_network_architecture.py
@@ -198,7 +198,7 @@ def __init__(
         act_dim: int,
         float_type,
         unique_obs_dim: int,
-        num_timeseries_obs_dim: int = 3,
+        num_timeseries_obs_dim: int,
         *args,
         **kwargs,
     ):
diff --git a/assume/scenario/loader_csv.py b/assume/scenario/loader_csv.py
index 810e964b1..c58f86fb9 100644
--- a/assume/scenario/loader_csv.py
+++ b/assume/scenario/loader_csv.py
@@ -757,7 +757,13 @@ def setup_world(
 
     bidding_params = config.get("bidding_strategy_params", {})
 
-    # handle initial learning parameters before leanring_role exists
+    if config.get("learning_mode"):
+        raise ValueError(
+            "The 'learning_mode' parameter in the top-level of the config.yaml has been moved to 'learning_config'. "
+            "Please adjust your config file accordingly."
+        )
+
+    # handle initial learning parameters before learning_role exists
     learning_dict = config.get("learning_config", {})
     # those settings need to be overridden before passing to the LearningConfig
     if learning_dict:
@@ -1030,15 +1036,13 @@ def run_learning(
 
     Args:
         world (World): An instance of the World class representing the simulation environment.
-        inputs_path (str): The path to the folder containing input files necessary for the simulation.
-        scenario (str): The name of the scenario for the simulation.
-        study_case (str): The specific study case for the simulation.
+        verbose (bool, optional): A flag indicating whether to enable verbose logging. Defaults to False.
 
     Note:
         - The function uses a ReplayBuffer to store experiences for training the DRL agents.
         - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
         - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
-        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.
+        - Upon completion of training, the function performs an evaluation run using the last policy learned during training.
         - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
     """
     from assume.reinforcement_learning.buffer import ReplayBuffer
diff --git a/assume/world.py b/assume/world.py
index 3b987d428..b21ed83a3 100644
--- a/assume/world.py
+++ b/assume/world.py
@@ -202,7 +202,7 @@ def setup(
             simulation_id (str): The unique identifier for the simulation.
             save_frequency_hours (int): The frequency (in hours) at which to save simulation data.
             bidding_params (dict, optional): Parameters for bidding. Defaults to an empty dictionary.
-            learning_config (dict | None, optional): Configuration for the learning process. Defaults to None.
+            learning_dict (dict, optional): Configuration for the learning process. Defaults to an empty dictionary.
             manager_address: The address of the manager.
             **kwargs: Additional keyword arguments.
 
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 892f4dda5..e15fe0c3f 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -11,7 +11,7 @@ its primary objectives are to ensure usability and customizability for a wide ra
 users and use cases in the energy system modeling community.
 
 The unique feature of the ASSUME tool-box is the integration of **Deep Reinforcement
-Learning** methods into the behavioural strategies of market agents.
+Learning** methods into the behavioral strategies of market agents.
 The model offers various predefined agent representations for both the demand and
 generation sides, which can be used as plug-and-play modules, simplifying the
 reinforcement of learning strategies. This setup enables research into new market
@@ -70,7 +70,6 @@ Documentation
    examples_basic
    example_simulations
 
-**User Guide**
 
 User Guide
 ==========
@@ -115,8 +114,7 @@ User Guide
    assume
 
 
-Indices and tables
-==================
+**Indices & Tables**
 
 * :ref:`genindex`
 * :ref:`modindex`
diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst
index 89ee141da..c143553ec 100644
--- a/docs/source/introduction.rst
+++ b/docs/source/introduction.rst
@@ -21,7 +21,7 @@ Architecture
 In the following figure the architecture of the framework is depicted. It can be roughly divided into two parts.
 On the left side of the world class the markets are located and on the right side the market participants,
 which are here named units. Both world are connected via the orders that market participants place on the markets.
-The learning capacbility is sketched out with the yellow classes on the right side, namely the units side.
+The learning capability is sketched out with the yellow classes on the right side, namely the units side.
 
 .. image:: img/architecture.svg
     :align: center
@@ -79,7 +79,7 @@ Market Participants
 ===================
 
 The market participants, here labeled units, comprise all entities acting in the respective markets and are at
-the core of any agent-based simulation model. The entirety of their behaviour leads to the market and system
+the core of any agent-based simulation model. The entirety of their behavior leads to the market and system
 outcome as a bottom-up simulation model, respectively.
 
 Modularity of Units
diff --git a/docs/source/learning.rst b/docs/source/learning.rst
index 6189cb804..7cd45e136 100644
--- a/docs/source/learning.rst
+++ b/docs/source/learning.rst
@@ -24,7 +24,7 @@ The Basics of Reinforcement Learning
 In general, RL and deep reinforcement learning (DRL) in particular, open new prospects for agent-based electricity market modeling.
 Such algorithms offer the potential for agents to learn bidding strategies in the interplay between market participants.
 In contrast to traditional rule-based approaches, DRL allows for a faster adaptation of the bidding strategies to a changing market
-environment, which is impossible with fixed strategies that a market modeller explicitly formulates. Hence, DRL algorithms offer the
+environment, which is impossible with fixed strategies that a market modeler explicitly formulates. Hence, DRL algorithms offer the
 potential for simulated electricity market agents to develop bidding strategies for future markets and test emerging markets' mechanisms
 before their introduction into real-world systems.
 
@@ -139,7 +139,7 @@ The Actor
 
 We will explain the way learning works in ASSUME starting from the interface to the simulation, namely the bidding strategy of the power plants.
 The bidding strategy, per definition in ASSUME, defines the way we formulate bids based on the technical restrictions of the unit.
-In a learning setting, this is done by the actor network. Which maps the observation to an action. The observation thereby is managed and collected by the units operator as
+In a learning setting, this is done by the actor network which maps the observation to an action. The observation thereby is managed and collected by the units operator as
 summarized in the following picture. As you can see in the current working version, the observation space contains a residual load forecast for the next 24 hours and a price
 forecast for 24 hours, as well as the current capacity of the power plant and its marginal costs.
 
@@ -148,7 +148,7 @@ forecast for 24 hours, as well as the current capacity of the power plant and it
     :width: 500px
 
 The action space is a continuous space, which means that the actor can choose any price between 0 and the maximum bid price defined in the code. It gives two prices for two different parts of its capacity.
-One, namley :math:`p_{inflex}` for the minimum capacity of the power plant and one for the rest ( :math:`p_{flex}`). The action space is defined in the config file and can be adjusted to your needs.
+One, namely :math:`p_{inflex}` for the minimum capacity of the power plant and one for the rest ( :math:`p_{flex}`). The action space is defined in the config file and can be adjusted to your needs.
 After the bids are formulated in the bidding strategy they are sent to the market via the units operator.
 
 .. image:: img/ActorOutput.jpg
@@ -156,7 +156,7 @@ After the bids are formulated in the bidding strategy they are sent to the marke
     :width: 500px
 
 In the case you are eager to integrate different learning bidding strategies or equip a new unit with learning,
-you need to touch these methods. To enable an easy start with the use of reinforcement learning in ASSUME we provide a tutorial in colab on github.
+you need to touch these methods. To enable an easy start with the use of reinforcement learning in ASSUME we provide a tutorial in colab on GitHub.
 
 The Critic
 ----------
@@ -175,8 +175,14 @@ You can read more about the different algorithms and the learning role in :doc:`
 The Learning Results in ASSUME
 =====================================
 
-Similarly to the other results, the learning progress is tracked in the database, either with postgresql or timescale. The latter enables the usage of the
-predefined dashboards to track the leanring process in the "Assume:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
+Learning results are not easy to understand and judge. ASSUME supports different visualizations to track the learning progress.
+Further we want to raise awareness for common pitfalls with learning result interpretation.
+
+Visualizations
+--------------
+
+Similarly to the other results, the learning progress is tracked in the database, either with PostgreSQL or TimescaleDB. The latter enables the usage of the
+predefined dashboards to track the learning process in the "ASSUME:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
 A more detailed description is given in the dashboard itself.
 
 .. image:: img/Grafana_Learning_1.jpeg
@@ -207,3 +213,44 @@ After starting the server, open the following URL in your browser:
 
 TensorBoard will then display dashboards for scalars, histograms, graphs, projectors, and other relevant visualizations, depending on the metrics that
 the training pipeline currently exports.
+
+Interpretation
+--------------
+
+Once the environment and learning algorithm are specified, agents are trained and behaviors begin to emerge. The modeler (you) analyzes the reward in the
+visualizations described above. This raises a basic modeling question:
+
+    *How can we judge whether what has been learned is meaningful?*
+
+Unlike supervised learning, we do not have a ground-truth target or an error metric that reliably decreases as behavior improves. In multi-agent settings,
+the notion of an “optimal” solution is often unclear. What we *do* observe are rewards – signals chosen by the modeler. How informative these signals are
+depends heavily on the reward design and on how other agents behave. Therefore:
+
+    **Do not rely on rewards alone.** Behavior itself must be examined carefully.
+**Why solely reward-based evaluation is problematic**
+
+Let :math:`R_i` denote the episodic return of agent :math:`i` under the joint policy :math:`\pi=(\pi_1,\dots,\pi_n)`. A common but potentially misleading
+heuristic is to evaluate behavior by the total reward,
+.. math::
+
+    S(\pi) = \sum_{i=1}^n \mathbb{E}[R_i].
+
+A larger :math:`S(\pi)` does *not* imply that the learned behavior is better or more stable. In a multi-agent environment, each agent’s learning alters the
+effective environment faced by the others. The same policy can therefore earn very different returns depending on which opponent snapshot it encounters. High
+aggregate rewards can arise from:
+
+* temporary exploitation of weaknesses of other agents,
+* coordination effects that occur by chance rather than by design,
+* behavior that works against training opponents but fails in other situations.
+
+Rewards are thus, at best, an indirect proxy for “good behavior.” They measure how well a policy performs *under the specific reward function and opponent
+behavior*, not whether it is robust, interpretable, or aligned with the modeler’s intent.
+
+**Implications for policy selection**
+
+This issue becomes visible when deciding which policy to evaluate at the end of training. We generally store (i) the policy with the highest average reward and
+(ii) the final policy. However, these two can differ substantially in their behavior. The framework therefore uses the **final policy** for evaluation to
+avoid selecting a high-reward snapshot that may be far from stable.
+
+The most robust learning performance can be achieved through **early stopping** with a very large number of episodes. In that case, training halts once results
+are stable, and the final policy is likely also the stable one. This behavior should be monitored by the modeler in TensorBoard.
diff --git a/docs/source/learning_algorithm.rst b/docs/source/learning_algorithm.rst
index 2e72d44ed..2121d2345 100644
--- a/docs/source/learning_algorithm.rst
+++ b/docs/source/learning_algorithm.rst
@@ -6,10 +6,10 @@
 Reinforcement Learning Algorithms
 ##################################
 
-In the chapter :doc:`learning` we got a general overview of how RL is implemented for a multi-agent setting in Assume.
+In the chapter :doc:`learning` we got a general overview of how RL is implemented for a multi-agent setting in ASSUME.
 If you want to apply these RL algorithms to a new problem, you do not necessarily need to understand how the RL algorithms work in detail.
 All that is needed is to adapt the bidding strategies, which is covered in the tutorials.
-However, for the interested reader, we will give a brief overview of the RL algorithms used in Assume.
+However, for the interested reader, we will give a brief overview of the RL algorithms used in ASSUME.
 We start with the learning role, which is the core of the learning implementation.
 
 The Learning Role
@@ -29,28 +29,37 @@ The following table shows the options that can be adjusted and gives a short exp
  ======================================== ==========================================================================================================
   learning config item                    description
  ======================================== ==========================================================================================================
-  continue_learning                       Whether to use pre-learned strategies and then continue learning.
-  trained_policies_save_path              Where to store the newly trained rl strategies - only needed when learning_mode is set
-  trained_policies_load_path              If pre-learned strategies should be used, where are they stored? - only needed when continue_learning
-  max_bid_price                           The maximum bid price which limits the action of the actor to this price.
-  learning_mode                           Should we use learning mode at all? If not, the learning bidding strategy is overwritten with a default strategy.
-  algorithm                               Specifies which algorithm to use. Currently, only MATD3 is implemented.
-  actor_architecture                      The architecture of the neural networks used in the algorithm for the actors. The architecture is a list of names specifying the "policy" used e.g. multi layer perceptron (mlp).
-  learning_rate                           The learning rate, also known as step size, which specifies how much the new policy should be considered in the update.
-  learning_rate_schedule                  Which learning rate decay to use. Defaults to None. Currently only "linear" decay available.
-  training_episodes                       The number of training episodes, whereby one episode is the entire simulation horizon specified in the general config.
-  episodes_collecting_initial_experience  The number of episodes collecting initial experience, whereby this means that random actions are chosen instead of using the actor network
-  train_freq                              Defines the frequency in time steps at which the actor and critic are updated.
-  gradient_steps                          The number of gradient steps.
-  batch_size                              The batch size of experience considered from the buffer for an update.
-  gamma                                   The discount factor, with which future expected rewards are considered in the decision-making.
-  device                                  The device to use.
-  noise_sigma                             The standard deviation of the distribution used to draw the noise, which is added to the actions and forces exploration.
-  noise_dt                                Determines how quickly the noise weakens over time / used for noise scheduling.
-  noise_scale                             The scale of the noise, which is multiplied by the noise drawn from the distribution.
-  action_noise_schedule                   Which action noise decay to use. Defaults to None. Currently only "linear" decay available.
-  early_stopping_steps                    The number of steps considered for early stopping. If the moving average reward does not improve over this number of steps, the learning is stopped.
-  early_stopping_threshold                The value by which the average reward needs to improve to avoid early stopping.
+  learning_mode                           Should we use learning mode at all? If False, the learning bidding strategy is loaded from trained_policies_load_path and no training occurs. Default is False.
+  evaluation_mode                         This setting is modified internally. Whether to run in evaluation mode. If True, the agent uses the learned policy without exploration noise and no training updates occur. Default is False.
+  continue_learning                       Whether to use pre-learned strategies and then continue learning. If True, loads existing policies from trained_policies_load_path and continues training. Default is False.
+  trained_policies_save_path              The directory path - relative to the scenario's inputs_path - where newly trained RL policies (actor and critic networks) will be saved. Only needed when learning_mode is True. Value is set in setup_world(). Defaults otherwise to None.
+  trained_policies_load_path              The directory path - relative to the scenario's inputs_path - from which pre-trained policies should be loaded. Needed when continue_learning is True or using pre-trained strategies. Default is None.
+  min_bid_price                           The minimum bid price which limits the action of the actor to this price. Used to constrain the actor's output to a realistic price range. Default is -100.0.
+  max_bid_price                           The maximum bid price which limits the action of the actor to this price. Used to constrain the actor's output to a realistic price range. Default is 100.0.
+  device                                  The device to use for PyTorch computations. Options include "cpu", "cuda", or specific CUDA devices like "cuda:0". Default is "cpu".
+  episodes_collecting_initial_experience  The number of episodes at the start during which random actions are chosen instead of using the actor network. This helps populate the replay buffer with diverse experiences. Default is 5.
+  exploration_noise_std                   The standard deviation of Gaussian noise added to actions during exploration in the environment. Higher values encourage more exploration. Default is 0.2.
+  training_episodes                       The number of training episodes, where one episode is the entire simulation horizon specified in the general config. Default is 100.
+  validation_episodes_interval            The interval (in episodes) at which validation episodes are run to evaluate the current policy's performance without training updates. Default is 5.
+  train_freq                              Defines the frequency in time steps at which the actor and critic networks are updated. Accepts time strings like "24h" for 24 hours or "1d" for 1 day. Default is "24h".
+  batch_size                              The batch size of experiences sampled from the replay buffer for each training update. Larger batches provide more stable gradients but require more memory. In environments with many learning agents we advise small batch sizes. Default is 128.
+  gradient_steps                          The number of gradient descent steps performed during each training update. More steps can lead to better learning but increase computation time. Default is 100.
+  learning_rate                           The learning rate (step size) for the optimizer, which controls how much the policy and value networks are updated during training. Default is 0.001.
+  learning_rate_schedule                  Which learning rate decay schedule to use. Currently only "linear" decay is available, which linearly decreases the learning rate over time. Default is None (constant learning rate).
+  early_stopping_steps                    The number of validation steps over which the moving average reward is calculated for early stopping. If the reward doesn't change by early_stopping_threshold over this many steps, training stops. If None, defaults to training_episodes / validation_episodes_interval + 1.
+  early_stopping_threshold                The minimum improvement in moving average reward required to avoid early stopping. If the reward improvement is less than this threshold over early_stopping_steps, training is terminated early. Default is 0.05.
+  algorithm                               Specifies which reinforcement learning algorithm to use. Currently, only "matd3" (Multi-Agent Twin Delayed Deep Deterministic Policy Gradient) is implemented. Default is "matd3".
+  replay_buffer_size                      The maximum number of transitions stored in the replay buffer for experience replay. Larger buffers allow for more diverse training samples. Default is 500000.
+  gamma                                   The discount factor for future rewards, ranging from 0 to 1. Higher values give more weight to long-term rewards in decision-making. Default is 0.99.
+  actor_architecture                      The architecture of the neural networks used for the actors. Options include "mlp" (Multi-Layer Perceptron) and "lstm" (Long Short-Term Memory). Default is "mlp".
+  policy_delay                            The frequency (in gradient steps) at which the actor policy is updated. TD3 updates the critic more frequently than the actor to stabilize training. Default is 2.
+  noise_sigma                             The standard deviation of the Ornstein-Uhlenbeck or Gaussian noise distribution used to generate exploration noise added to actions. Default is 0.1.
+  noise_scale                             The scale factor multiplied by the noise drawn from the distribution. Larger values increase exploration. Default is 1.
+  noise_dt                                The time step parameter for the Ornstein-Uhlenbeck process, which determines how quickly the noise decays over time. Used for noise scheduling. Default is 1.
+  action_noise_schedule                   Which action noise decay schedule to use. Currently only "linear" decay is available, which linearly decreases exploration noise over training. Default is "linear".
+  tau                                     The soft update coefficient for updating target networks. Controls how slowly target networks track the main networks. Smaller values mean slower updates. Default is 0.005.
+  target_policy_noise                     The standard deviation of noise added to target policy actions during critic updates. This smoothing helps prevent overfitting to narrow policy peaks. Default is 0.2.
+  target_noise_clip                       The maximum absolute value for clipping the target policy noise. Prevents the noise from being too large. Default is 0.5.
  ======================================== ==========================================================================================================
 
 How to use continue learning
@@ -64,7 +73,7 @@ The learning process will then start from these pre-trained networks instead of
 
 In other words, the input layer of the critics will vary depending on the number of agents. To enable the use of continue learning between simulations with varying agent sizes, a mapping is implemented that ensures the loaded critics are adapted to match the new number of agents.
 
-This process will fail, when the number of hidden layers differs between the loaded critic and the new critic. In this case, you will need to retrain the networks from scratch. Further, different chosen neural network arhcitectures for the critic (or actor) between the loaded and new networks will also lead to a failure of the continue learning process.
+This process will fail, when the number of hidden layers differs between the loaded critic and the new critic. In this case, you will need to retrain the networks from scratch. Further, different chosen neural network architectures for the critic (or actor) between the loaded and new networks will also lead to a failure of the continue learning process.
 
 
 The Algorithms
@@ -112,7 +121,7 @@ at the beginning of each episode. For more information regarding the buffer see
 The core of the algorithm is embodied by the :func:`assume.reinforcement_learning.algorithms.matd3.TD3.update_policy` in the learning algorithms. Here, the critic and the actor are updated according to the algorithm.
 
 The network architecture for the actor in the RL algorithm can be customized by specifying the network architecture used.
-In stablebaselines3 they are also referred to as "policies". The architecture is defined as a list of names that represent the layers of the neural network.
+In Stable Baselines3 they are also referred to as "policies". The architecture is defined as a list of names that represent the layers of the neural network.
 For example, to implement a multi-layer perceptron (MLP) architecture for the actor, you can set the "actor_architecture" config item to ["mlp"].
 This will create a neural network with multiple fully connected layers.
 
@@ -147,9 +156,9 @@ Overall, the replay buffer is instrumental in stabilizing the learning process i
 enhancing their robustness and performance by providing a diverse and non-correlated set of training samples.
 
 
-How are they used in Assume?
+How are they used in ASSUME?
 ============================
-In principal Assume allows for different buffers to be implemented. They just need to adhere to the structure presented in the base buffer. Here we will present the different buffers already implemented, which is only one, yet.
+In principal ASSUME allows for different buffers to be implemented. They just need to adhere to the structure presented in the base buffer. Here we will present the different buffers already implemented, which is only one, yet.
 
 
 The simple replay buffer
diff --git a/docs/source/redispatch_modeling.rst b/docs/source/redispatch_modeling.rst
index a3e238c19..61ff63fd2 100644
--- a/docs/source/redispatch_modeling.rst
+++ b/docs/source/redispatch_modeling.rst
@@ -3,10 +3,10 @@
 .. SPDX-License-Identifier: AGPL-3.0-or-later
 
 
-Congestion Management and Redispatch Modelling
+Congestion Management and Redispatch Modeling
 ===============================================
 
-This section demonstrates the modelling and simulation of the redispatch mechanism using PyPSA as a plug-and-play module within the ASSUME framework.
+This section demonstrates the modeling and simulation of the redispatch mechanism using PyPSA as a plug-and-play module within the ASSUME framework.
 The model primarily considers grid constraints to identify bottlenecks in the grid, resolve them using the redispatch algorithm, and account for dispatches from the EOM (Energy-Only Market).
 
 Concept of Redispatch
@@ -17,7 +17,7 @@ The locational mismatch between electricity demand and generation requires the t
 When transmission capacity is insufficient to meet demand, generation must be reduced at locations with low demand and increased at locations with high demand. This process is known as **Redispatch**. In addition to spot markets, the redispatch mechanism is used to regulate grid flows and avoid congestion issues. It is operated and controlled by the system operators (SO).
 
 
-Overview of Redispatch Modelling in PyPSA
+Overview of Redispatch Modeling in PyPSA
 ------------------------------------------
 
 The PyPSA network model can be created to visualize line flows using EOM clearing outcomes of generation and loads at different nodes (locations).
@@ -72,10 +72,10 @@ A PyPSA network model can be created by defining nodes as locations for power ge
 
 Currently, a limitation of the PyPSA model is the inability to define flexible loads.
 
-Modelling Redispatch in ASSUME
+Modeling Redispatch in ASSUME
 --------------------------------
 
-Modelling redispatch in the ASSUME framework using PyPSA primarily includes two parts:
+Modeling redispatch in the ASSUME framework using PyPSA primarily includes two parts:
 
 Congestion Identification
 --------------------------
@@ -102,7 +102,7 @@ Steps for Redispatch
    EOM market dispatches are fixed to model redispatch from power plants with accurate cost considerations. EOM dispatches are treated as a `Load` in the network, with dispatches specified via `p_set`. Generators are assigned a positive sign, and demands are given a negative sign.
 
 2. **Upward Redispatch from Market and Reserved Power Plants**
-   Due to PyPSA’s limitations in modelling load flexibility, upward redispatch is added as a `Generator` with a positive sign. The maximum available capacity for upward redispatch is restricted using the `p_max_pu` factor, estimated as the difference between the current generation and the maximum power of the power plant.
+   Due to PyPSA’s limitations in modeling load flexibility, upward redispatch is added as a `Generator` with a positive sign. The maximum available capacity for upward redispatch is restricted using the `p_max_pu` factor, estimated as the difference between the current generation and the maximum power of the power plant.
 
    ```python
    p_max_pu_up = (max_power - volume) / max_power
diff --git a/docs/source/release_notes.rst b/docs/source/release_notes.rst
index f7a3c27d1..90540f126 100644
--- a/docs/source/release_notes.rst
+++ b/docs/source/release_notes.rst
@@ -28,7 +28,10 @@ Upcoming Release
   - **Direct write access:** All learning-capable entities (units, unit operators, market agents) now write learning data directly to the learning role.
   - **Centralized logic:** Learning-related functionality is now almost always contained within the learning role, improving maintainability.
   - **Note:** Distributed learning across multiple machines is no longer supported, but this feature was not in active use.
-
+- **Restructured learning configuration**: All learning-related configuration parameters are now contained within a single `learning_config` dictionary in the `config.yaml` file. This change simplifies configuration management and avoids ambiguous setting of defaults.
+  - **Note:** ``learning_mode`` is moved from the top-level config to `learning_config`. Existing config files need to be updated accordingly.
+- **Learning_role in all cases involving DRL**: The `learning_role` is now available in all simulations involving DRL, also if pre-trained strategies are loaded and no policy updates are performed. This change ensures consistent handling of learning configurations and simplifies the codebase by removing special cases.
+- **Final DRL simulation with last policies**: After training, the final simulation now uses the last trained policies instead of the best policies. This change provides a more accurate representation of the learned behavior, as the last policies reflect the most recent training state. Additionally, multi-agent simulations do not always converge to the maximum reward. E.g. competing agents may underbid each other to gain market share, leading to lower overall rewards while reaching a stable state nevertheless.
 
 **New Features:**
 - **Unit Operator Portfolio Strategy**: A new bidding strategy type that enables portfolio optimization, where the default is called `UnitsOperatorEnergyNaiveDirectStrategy`. This strategy simply passes through bidding decisions of individual units within a portfolio, which was the default behavior beforehand as well. Further we added 'UnitsOperatorEnergyHeuristicCournotStrategy' which allows to model bidding behavior of a portfolio of units in a day-ahead market. The strategy calculates the optimal bid price and quantity for each unit in the portfolio, taking into account markup and the production costs of the units. This enables users to simulate and analyze the impact of strategic portfolio bidding on market outcomes and unit profitability.
diff --git a/docs/source/unit_operator.rst b/docs/source/unit_operator.rst
index 18d09c8fe..e6dccdf09 100644
--- a/docs/source/unit_operator.rst
+++ b/docs/source/unit_operator.rst
@@ -6,7 +6,7 @@ Unit Operator
 ==============
 
 Assume is created using flexible and usable abstractions, while still providing flexibility to cover most use cases of market modeling. This is also true for the unit operator class.
-In general the task of this calls can range from simple passing the bids of the technical units through to complex portfolio optimisation of all units assigned to one operator. This text aims
+In general the task of this calls can range from simple passing the bids of the technical units through to complex portfolio optimization of all units assigned to one operator. This text aims
 to explain its current functionalities and the possibilities the unit
 
 
@@ -25,10 +25,10 @@ The unit operator is responsible for the following tasks:
 As one can see from all the task the unit oporator covers, that it orchestrates and coordinates the technical units and the markets.
 
 
-Portfolio Optimisation
+Portfolio Optimization
 ----------------------
 
 The main felxibility a unit oporator is, that we can process all the bids and technical constraints a the unit oporator gets from its technical units
-however we want, before sending them to the market. This allows us to implement a portfolio optimisation for the technical units assigned to the unit operator.
+however we want, before sending them to the market. This allows us to implement a portfolio optimization for the technical units assigned to the unit operator.
 A respective function is in place for that. Yet, it is not used in the current version of the unit operator. The function is called :func:`assume.common.units_operator.UnitsOperator.formulate_bids`.
 For example, one could think of coordinating the bids of a battery and a PV unit to maximise the self-consumption of the PV unit under one unit operator.
diff --git a/docs/source/units.rst b/docs/source/units.rst
index 2af818340..b5272f2fb 100644
--- a/docs/source/units.rst
+++ b/docs/source/units.rst
@@ -6,7 +6,7 @@
 Unit Types
 ###########
 
-In power system modelling, various unit types are used to represent the components responsible for generating, storing, and consuming electricity. These units are essential for simulating the operation of the grid and ensuring a reliable balance between supply and demand.
+In power system modeling, various unit types are used to represent the components responsible for generating, storing, and consuming electricity. These units are essential for simulating the operation of the grid and ensuring a reliable balance between supply and demand.
 
 The primary unit types in this context include:
 
@@ -16,7 +16,7 @@ The primary unit types in this context include:
 
 3. **Demand Units**: These represent consumers of electricity, such as households, industries, or commercial buildings, whose electricity consumption is typically fixed and not easily adjustable based on real-time grid conditions. Demand units will therefore be modelled with inelastic demand most often. However, representation of elastic bidding is possible with this unit type.
 
-Each unit type has specific characteristics that affect how the power system operates, and understanding these is key to modelling and optimizing grid performance.
+Each unit type has specific characteristics that affect how the power system operates, and understanding these is key to modeling and optimizing grid performance.
 
 
 .. include:: demand_side_agent.rst
diff --git a/examples/notebooks/03_custom_unit_example.ipynb b/examples/notebooks/03_custom_unit_example.ipynb
index 3a94b2fde..21dd92468 100644
--- a/examples/notebooks/03_custom_unit_example.ipynb
+++ b/examples/notebooks/03_custom_unit_example.ipynb
@@ -67,16 +67,15 @@
     "# this cell is used to display the image in the notebook when using colab\n",
     "# or running the notebook locally\n",
     "\n",
-    "import os\n",
-    "\n",
+    "from pathlib import Path\n",
     "from IPython.display import Image, display\n",
     "\n",
-    "image_path = \"assume-repo/docs/source/img/Electrolyzer.png\"\n",
-    "alt_image_path = \"../../docs/source/img/Electrolyzer.png\"\n",
-    "\n",
-    "if os.path.exists(image_path):\n",
+    "image_path = Path(\"assume-repo/docs/source/img/Electrolyzer.png\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/Electrolyzer.png\")\n",
+    "    \n",
+    "if image_path.exists():\n",
     "    display(Image(image_path))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(Image(alt_image_path))"
    ]
   },
diff --git a/examples/notebooks/04a_reinforcement_learning_algorithm_example.ipynb b/examples/notebooks/04a_reinforcement_learning_algorithm_example.ipynb
index 13618b4e7..a683eb299 100644
--- a/examples/notebooks/04a_reinforcement_learning_algorithm_example.ipynb
+++ b/examples/notebooks/04a_reinforcement_learning_algorithm_example.ipynb
@@ -283,7 +283,13 @@
    "source": [
     "from IPython.display import Image, display\n",
     "\n",
-    "display(Image(\"../../docs/source/img/Assume_run_learning_loop.png\", width=400))"
+    "image_path = Path(\"assume-repo/docs/source/img/Assume_run_learning_loop.png\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/Assume_run_learning_loop.png\")\n",
+    "\n",
+    "if image_path.exists():\n",
+    "    display(Image(image_path, width=400))\n",
+    "elif alt_image_path.exists():\n",
+    "    display(Image(alt_image_path, width=400))"
    ]
   },
   {
@@ -306,15 +312,13 @@
     "\n",
     "    Args:\n",
     "        world (World): An instance of the World class representing the simulation environment.\n",
-    "        inputs_path (str): The path to the folder containing input files necessary for the simulation.\n",
-    "        scenario (str): The name of the scenario for the simulation.\n",
-    "        study_case (str): The specific study case for the simulation.\n",
+    "        verbose (bool, optional): A flag indicating whether to enable verbose logging. Defaults to False.\n",
     "\n",
     "    Note:\n",
     "        - The function uses a ReplayBuffer to store experiences for training the DRL agents.\n",
     "        - It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.\n",
     "        - Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.\n",
-    "        - Upon completion of training, the function performs an evaluation run using the best policy learned during training.\n",
+    "        - Upon completion of training, the function performs an evaluation run using the last policy learned during training.\n",
     "        - The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.\n",
     "    \"\"\"\n",
     "\n",
@@ -357,7 +361,7 @@
     "    # Information that needs to be stored across episodes, aka one simulation run\n",
     "    inter_episodic_data = {\n",
     "        \"buffer\": ReplayBuffer(\n",
-    "            buffer_size=world.learning_config.replay_buffer_size,\n",
+    "            buffer_size=world.learning_role.learning_config.replay_buffer_size,\n",
     "            obs_dim=world.learning_role.rl_algorithm.obs_dim,\n",
     "            act_dim=world.learning_role.rl_algorithm.act_dim,\n",
     "            n_rl_units=len(world.learning_role.rl_strats),\n",
@@ -377,14 +381,14 @@
     "    # -----------------------------------------\n",
     "\n",
     "    validation_interval = min(\n",
-    "        world.learning_role.training_episodes,\n",
-    "        world.learning_config.validation_episodes_interval,\n",
+    "        world.learning_role.learning_config.training_episodes,\n",
+    "        world.learning_role.learning_config.validation_episodes_interval,\n",
     "    )\n",
     "\n",
     "    eval_episode = 1\n",
     "\n",
     "    for episode in tqdm(\n",
-    "        range(1, world.learning_role.training_episodes + 1),\n",
+    "        range(1, world.learning_role.learning_config.training_episodes + 1),\n",
     "        desc=\"Training Episodes\",\n",
     "    ):\n",
     "        # -----------------------------------------\n",
@@ -409,7 +413,7 @@
     "        if (\n",
     "            episode % validation_interval == 0\n",
     "            and episode\n",
-    "            >= world.learning_role.episodes_collecting_initial_experience\n",
+    "            >= world.learning_role.learning_config.episodes_collecting_initial_experience\n",
     "            + validation_interval\n",
     "        ):\n",
     "            world.reset()\n",
@@ -454,11 +458,11 @@
     "        # save the policies after each episode in case the simulation is stopped or crashes\n",
     "        if (\n",
     "            episode\n",
-    "            >= world.learning_role.episodes_collecting_initial_experience\n",
+    "            >= world.learning_role.learning_config.episodes_collecting_initial_experience\n",
     "            + validation_interval\n",
     "        ):\n",
     "            world.learning_role.rl_algorithm.save_params(\n",
-    "                directory=f\"{world.learning_role.trained_policies_save_path}/last_policies\"\n",
+    "                directory=f\"{world.learning_role.learning_config.trained_policies_save_path}/last_policies\"\n",
     "            )\n",
     "\n",
     "    # container shutdown implicitly with new initialisation\n",
@@ -472,7 +476,7 @@
     "    # especially if previous strategies were loaded from an external source.\n",
     "    # This is useful when continuing from a previous learning session.\n",
     "    world.scenario_data[\"config\"][\"learning_config\"][\"trained_policies_load_path\"] = (\n",
-    "        f\"{world.learning_role.trained_policies_save_path}/avg_reward_eval_policies\"\n",
+    "        f\"{world.learning_role.learning_config.trained_policies_save_path}/avg_reward_eval_policies\"\n",
     "    )\n",
     "\n",
     "    # load scenario for evaluation\n",
@@ -535,7 +539,10 @@
     "\n",
     "        # if enough initial experience was collected according to specifications in learning config\n",
     "        # turn off initial exploration and go into full learning mode\n",
-    "        if self.episodes_done >= self.episodes_collecting_initial_experience:\n",
+    "        if (\n",
+    "            self.episodes_done\n",
+    "            >= self.learning_config.episodes_collecting_initial_experience\n",
+    "        ):\n",
     "            self.turn_off_initial_exploration()\n",
     "\n",
     "        self.set_noise_scale(inter_episodic_data[\"noise_scale\"])\n",
@@ -611,15 +618,7 @@
     "            algorithm (RLAlgorithm): The name of the reinforcement learning algorithm.\n",
     "        \"\"\"\n",
     "        if algorithm == \"matd3\":\n",
-    "            self.rl_algorithm = TD3(\n",
-    "                learning_role=self,\n",
-    "                learning_rate=self.learning_rate,\n",
-    "                episodes_collecting_initial_experience=self.episodes_collecting_initial_experience,\n",
-    "                gradient_steps=self.gradient_steps,\n",
-    "                batch_size=self.batch_size,\n",
-    "                gamma=self.gamma,\n",
-    "                actor_architecture=self.actor_architecture,\n",
-    "            )\n",
+    "            self.rl_algorithm = TD3(learning_role=self)\n",
     "        else:\n",
     "            logger.error(f\"Learning algorithm {algorithm} not implemented!\")"
    ]
@@ -767,7 +766,9 @@
     "        for _ in range(self.gradient_steps):\n",
     "            self.n_updates += 1\n",
     "\n",
-    "            transitions = self.learning_role.buffer.sample(self.batch_size)\n",
+    "            transitions = self.learning_role.buffer.sample(\n",
+    "                self.learning_config.batch_size\n",
+    "            )\n",
     "            states, actions, next_states, rewards = (\n",
     "                transitions.observations,\n",
     "                transitions.actions,\n",
@@ -777,9 +778,13 @@
     "\n",
     "            with th.no_grad():\n",
     "                # Select action according to policy and add clipped noise\n",
-    "                # Select action according to policy and add clipped noise\n",
-    "                noise = th.randn_like(actions) * self.target_policy_noise\n",
-    "                noise = noise.clamp(-self.target_noise_clip, self.target_noise_clip)\n",
+    "                noise = (\n",
+    "                    th.randn_like(actions) * self.learning_config.target_policy_noise\n",
+    "                )\n",
+    "                noise = noise.clamp(\n",
+    "                    -self.learning_config.target_noise_clip,\n",
+    "                    self.learning_config.target_noise_clip,\n",
+    "                )\n",
     "\n",
     "                next_actions = th.stack(\n",
     "                    [\n",
@@ -795,15 +800,15 @@
     "                next_actions = next_actions.transpose(0, 1).contiguous()\n",
     "                next_actions = next_actions.view(-1, n_rl_agents * self.act_dim)\n",
     "\n",
-    "            all_actions = actions.view(self.batch_size, -1)\n",
+    "            all_actions = actions.view(self.learning_config.batch_size, -1)\n",
     "\n",
     "            # Precompute unique observation parts for all agents\n",
     "            unique_obs_from_others = states[\n",
     "                :, :, self.obs_dim - self.unique_obs_dim :\n",
-    "            ].reshape(self.batch_size, n_rl_agents, -1)\n",
+    "            ].reshape(self.learning_config.batch_size, n_rl_agents, -1)\n",
     "            next_unique_obs_from_others = next_states[\n",
     "                :, :, self.obs_dim - self.unique_obs_dim :\n",
-    "            ].reshape(self.batch_size, n_rl_agents, -1)\n",
+    "            ].reshape(self.learning_config.batch_size, n_rl_agents, -1)\n",
     "\n",
     "            # Loop over all agents and update their actor and critic networks\n",
     "            for i, strategy in enumerate(self.learning_role.rl_strats.values()):\n",
@@ -827,15 +832,19 @@
     "                # Construct final state representations\n",
     "                all_states = th.cat(\n",
     "                    (\n",
-    "                        states[:, i, :].reshape(self.batch_size, -1),\n",
-    "                        other_unique_obs.reshape(self.batch_size, -1),\n",
+    "                        states[:, i, :].reshape(self.learning_config.batch_size, -1),\n",
+    "                        other_unique_obs.reshape(self.learning_config.batch_size, -1),\n",
     "                    ),\n",
     "                    dim=1,\n",
     "                )\n",
     "                all_next_states = th.cat(\n",
     "                    (\n",
-    "                        next_states[:, i, :].reshape(self.batch_size, -1),\n",
-    "                        other_next_unique_obs.reshape(self.batch_size, -1),\n",
+    "                        next_states[:, i, :].reshape(\n",
+    "                            self.learning_config.batch_size, -1\n",
+    "                        ),\n",
+    "                        other_next_unique_obs.reshape(\n",
+    "                            self.learning_config.batch_size, -1\n",
+    "                        ),\n",
     "                    ),\n",
     "                    dim=1,\n",
     "                )\n",
@@ -847,7 +856,8 @@
     "                    )\n",
     "                    next_q_values, _ = th.min(next_q_values, dim=1, keepdim=True)\n",
     "                    target_Q_values = (\n",
-    "                        rewards[:, i].unsqueeze(1) + self.gamma * next_q_values\n",
+    "                        rewards[:, i].unsqueeze(1)\n",
+    "                        + self.learning_config.gamma * next_q_values\n",
     "                    )\n",
     "\n",
     "                # Get current Q-values estimates for each critic network\n",
@@ -867,7 +877,7 @@
     "                critic.optimizer.step()\n",
     "\n",
     "                # Delayed policy updates\n",
-    "                if self.n_updates % self.policy_delay == 0:\n",
+    "                if self.n_updates % self.learning_config.policy_delay == 0:\n",
     "                    # Compute actor loss\n",
     "                    state_i = states[:, i, :]\n",
     "                    action_i = actor(state_i)\n",
@@ -877,7 +887,8 @@
     "\n",
     "                    # calculate actor loss\n",
     "                    actor_loss = -critic.q1_forward(\n",
-    "                        all_states, all_actions_clone.view(self.batch_size, -1)\n",
+    "                        all_states,\n",
+    "                        all_actions_clone.view(self.learning_config.batch_size, -1),\n",
     "                    ).mean()\n",
     "\n",
     "                    actor.optimizer.zero_grad(set_to_none=True)\n",
@@ -887,7 +898,7 @@
     "                    actor.optimizer.step()\n",
     "\n",
     "            # Perform batch-wise Polyak update at the end (instead of inside the loop)\n",
-    "            if self.n_updates % self.policy_delay == 0:\n",
+    "            if self.n_updates % self.learning_config.policy_delay == 0:\n",
     "                all_critic_params = []\n",
     "                all_target_critic_params = []\n",
     "\n",
@@ -904,8 +915,14 @@
     "                    all_target_actor_params.extend(strategy.actor_target.parameters())\n",
     "\n",
     "                # Perform batch-wise Polyak update (NO LOOPS)\n",
-    "                polyak_update(all_critic_params, all_target_critic_params, self.tau)\n",
-    "                polyak_update(all_actor_params, all_target_actor_params, self.tau)"
+    "                polyak_update(\n",
+    "                    all_critic_params,\n",
+    "                    all_target_critic_params,\n",
+    "                    self.learning_config.tau,\n",
+    "                )\n",
+    "                polyak_update(\n",
+    "                    all_actor_params, all_target_actor_params, self.learning_config.tau\n",
+    "                )"
    ]
   },
   {
@@ -951,6 +968,7 @@
    "outputs": [],
    "source": [
     "learning_config = {\n",
+    "    \"learning_mode\": True,\n",
     "    \"continue_learning\": False,\n",
     "    \"trained_policies_save_path\": None,\n",
     "    \"max_bid_price\": 100,\n",
@@ -984,7 +1002,6 @@
     "    data = yaml.safe_load(file)\n",
     "\n",
     "# store our modifications to the config file\n",
-    "data[\"base\"][\"learning_mode\"] = True\n",
     "data[\"base\"][\"learning_config\"] = learning_config\n",
     "\n",
     "# Write the modified data back to the file\n",
@@ -1015,50 +1032,7 @@
     "lines_to_next_cell": 0,
     "outputId": "e30f4279-7a4e-4efc-9cfb-61416e4fe2f1"
    },
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "Training Episodes:   7%|▋         | 7/100 [00:48<07:28,  4.83s/it]\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A\n",
-      "\u001b[A"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "import os\n",
     "\n",
@@ -1103,9 +1077,9 @@
     "    )\n",
     "\n",
     "    # run learning if learning mode is enabled\n",
-    "    # needed as we simulate the modelling horizon multiple times to train reinforcement learning run_learning(world)\n",
+    "    # needed as we simulate the modeling horizon multiple times to train reinforcement learning run_learning(world)\n",
     "\n",
-    "    if world.learning_config.learning_mode:\n",
+    "    if world.learning_mode:\n",
     "        run_learning(world)\n",
     "\n",
     "    # after the learning is done we make a normal run of the simulation, which equals a test run\n",
@@ -1128,12 +1102,6 @@
     "**Next up:** [4.2 Designing Adaptive Bidding Strategies in ASSUME using Reinforcement Learning](04b_reinforcement_learning_example.ipynb)\n",
     "\n"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "7ead7320",
-   "metadata": {},
-   "source": []
   }
  ],
  "metadata": {
diff --git a/examples/notebooks/04b_reinforcement_learning_example.ipynb b/examples/notebooks/04b_reinforcement_learning_example.ipynb
index aed2b3b3c..73ccdaa0d 100644
--- a/examples/notebooks/04b_reinforcement_learning_example.ipynb
+++ b/examples/notebooks/04b_reinforcement_learning_example.ipynb
@@ -219,7 +219,7 @@
    "source": [
     "try:\n",
     "    from assume import World\n",
-    "    from assume.strategies.learning_strategies import LearningStrategy\n",
+    "    from assume.strategies.learning_strategies import TorchLearningStrategy\n",
     "\n",
     "    print(\"✅ ASSUME framework is installed and functional.\")\n",
     "except ImportError as e:\n",
@@ -261,6 +261,7 @@
    "source": [
     "# Standard Python modules\n",
     "import logging  # For logging messages during simulation and debugging\n",
+    "import os  # For operating system interactions\n",
     "from datetime import timedelta  # To handle market time resolutions (e.g., hourly steps)\n",
     "\n",
     "import matplotlib.pyplot as plt\n",
@@ -280,8 +281,8 @@
     "    run_learning,\n",
     ")\n",
     "from assume.strategies.learning_strategies import (\n",
-    "    LearningStrategy,  # Abstract base for RL bidding strategies\n",
     "    MinMaxStrategy,  # Abstract class for powerplant-like strategies\n",
+    "    TorchLearningStrategy,  # Abstract base for RL bidding strategies\n",
     ")"
    ]
   },
@@ -331,16 +332,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
+    "from pathlib import Path\n",
     "\n",
     "from IPython.display import SVG, display\n",
     "\n",
-    "image_path = \"assume-repo/docs/source/img/architecture.svg\"\n",
-    "alt_image_path = \"../../docs/source/img/architecture.svg\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/architecture.svg\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/architecture.svg\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(SVG(image_path))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(SVG(alt_image_path))"
    ]
   },
@@ -467,7 +468,7 @@
    "source": [
     "### 3.3 Defining the Strategy Class and Constructor\n",
     "\n",
-    "To enable learning, we define a custom class that extends `LearningStrategy` and initializes key dimensions for the model:"
+    "To enable learning, we define a custom class that extends `TorchLearningStrategy` and initializes key dimensions for the model:"
    ]
   },
   {
@@ -477,7 +478,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "class SingleBidLearningStrategy(LearningStrategy, MinMaxStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(TorchLearningStrategy, MinMaxStrategy):\n",
     "    \"\"\"\n",
     "    A simple reinforcement learning bidding strategy.\n",
     "    \"\"\"\n",
@@ -522,7 +523,7 @@
     "# however, you should have all functions in a single class when using this example in .py files\n",
     "\n",
     "\n",
-    "class SingleBidLearningStrategy(SingleBidLearningStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):\n",
     "    def get_individual_observations(self, unit, start, end):\n",
     "        \"\"\"\n",
     "        Define custom unit-specific observations for the RL agent.\n",
@@ -599,7 +600,7 @@
     "# however, you should have all functions in a single class when using this example in .py files\n",
     "\n",
     "\n",
-    "class SingleBidLearningStrategy(SingleBidLearningStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):\n",
     "    def get_individual_observations(self, unit, start, end):\n",
     "        # --- Current volume & marginal cost ---\n",
     "        current_volume = unit.get_output_before(start)\n",
@@ -668,7 +669,7 @@
    "source": [
     "### 4.2 Understanding `get_actions()`\n",
     "\n",
-    "The method `get_actions(next_observation)` in `BaseLearningStrategy` defines how actions are computed in different modes of operation.\n",
+    "The method `get_actions(next_observation)` in `TorchLearningStrategy` defines how actions are computed in different modes of operation.\n",
     "\n",
     "Here is a simplified overview of the logic:\n",
     "\n",
@@ -758,7 +759,7 @@
     "# however, you should have all functions in a single class when using this example in .py files\n",
     "\n",
     "\n",
-    "class SingleBidLearningStrategy(SingleBidLearningStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):\n",
     "    def get_actions(self, next_observation):\n",
     "        \"\"\"\n",
     "        Compute actions based on the current observation, optionally applying noise for exploration.\n",
@@ -809,7 +810,7 @@
     "# however, you should have all functions in a single class when using this example in .py files\n",
     "\n",
     "\n",
-    "class SingleBidLearningStrategy(SingleBidLearningStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):\n",
     "    def get_actions(self, next_observation):\n",
     "        # Get the base action and associated noise from the parent implementation\n",
     "        curr_action, noise = super().get_actions(next_observation)\n",
@@ -917,30 +918,12 @@
     "Note that `max_power` is **positive**, as this strategy models a generator offering energy. For a **consumer or demand bid**, the volume would be **negative** to reflect load withdrawal.\n"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "155f2f1e",
-   "metadata": {},
-   "source": [
-    "### 5.4 Why We Store Everything in `unit.outputs`\n",
-    "\n",
-    "The outputs of the bidding process are stored in two places:\n",
-    "\n",
-    "* `unit.outputs[\"rl_observations\"]` and `[\"rl_actions\"]`:\n",
-    "  Stored as lists to be written into the replay buffer for learning.\n",
-    "\n",
-    "* `unit.outputs[\"actions\"]` and `[\"exploration_noise\"]`:\n",
-    "  Stored as `pandas.Series` for compatibility with the unit’s internal logging and database structure.\n",
-    "\n",
-    "This dual storage ensures that both the simulation engine and the learning backend have access to the relevant data.\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "7838b5df",
    "metadata": {},
    "source": [
-    "### 5.5 Controlling Action Dimensions\n",
+    "### 5.4 Controlling Action Dimensions\n",
     "\n",
     "By changing the `act_dim` in the strategy constructor, you can control the number of outputs returned by the actor network:\n",
     "\n",
@@ -964,7 +947,7 @@
    "id": "be5a6cd5",
    "metadata": {},
    "source": [
-    "### 5.6 Full Code Implementation\n",
+    "### 5.5 Full Code Implementation\n",
     "\n",
     "Here is the complete `calculate_bids()` implementation:"
    ]
@@ -982,7 +965,7 @@
     "# however, you should have all functions in a single class when using this example in .py files\n",
     "\n",
     "\n",
-    "class SingleBidLearningStrategy(SingleBidLearningStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):\n",
     "    def calculate_bids(self, unit, market_config, product_tuples, **kwargs):\n",
     "        start = product_tuples[0][0]\n",
     "        end = product_tuples[0][1]\n",
@@ -1021,13 +1004,8 @@
     "            },\n",
     "        ]\n",
     "\n",
-    "        # store results in unit outputs as lists to be written to the buffer for learning\n",
-    "        unit.outputs[\"rl_observations\"].append(next_observation)\n",
-    "        unit.outputs[\"rl_actions\"].append(actions)\n",
-    "\n",
-    "        # store results in unit outputs as series to be written to the database by the unit operator\n",
-    "        unit.outputs[\"actions\"].at[start] = actions\n",
-    "        unit.outputs[\"exploration_noise\"].at[start] = noise\n",
+    "        if self.learning_mode:\n",
+    "            self.learning_role.add_actions_to_cache(self.unit_id, start, actions, noise)\n",
     "\n",
     "        return bids"
    ]
@@ -1140,7 +1118,7 @@
     "# however, you should have all functions in a single class when using this example in .py files\n",
     "\n",
     "\n",
-    "class SingleBidLearningStrategy(SingleBidLearningStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):\n",
     "    def calculate_reward(self, unit, marketconfig, orderbook):\n",
     "        \"\"\"\n",
     "        Reward function: implement profit and (optionally) opportunity cost.\n",
@@ -1177,12 +1155,18 @@
     "        # === Normalize reward to ~[-1, 1] ===\n",
     "        scaling = 1 / (self.max_bid_price * unit.max_power)\n",
     "        reward = scaling * (order_profit - regret_scale * opportunity_cost)\n",
+    "        regret = regret_scale * opportunity_cost\n",
+    "\n",
+    "        # Store results in unit outputs\n",
+    "        # Note: these are not learning-specific results but stored for all units for analysis\n",
+    "        unit.outputs[\"profit\"].loc[start:end_excl] += order_profit\n",
+    "        unit.outputs[\"total_costs\"].loc[start:end_excl] += order_cost\n",
     "\n",
-    "        unit.outputs[\"profit\"].loc[start:end_excl] = order_profit\n",
-    "        unit.outputs[\"reward\"].loc[start:end_excl] = reward\n",
-    "        unit.outputs[\"regret\"].loc[start:end_excl] = regret_scale * opportunity_cost\n",
-    "        unit.outputs[\"total_costs\"].loc[start:end_excl] = order_cost\n",
-    "        unit.outputs[\"rl_rewards\"].append(reward)"
+    "        # write rl-rewards to buffer\n",
+    "        if self.learning_mode:\n",
+    "            self.learning_role.add_reward_to_cache(\n",
+    "                unit.id, start, reward, regret, order_profit\n",
+    "            )"
    ]
   },
   {
@@ -1272,7 +1256,7 @@
     "# however, you should have all functions in a single class when using this example in .py files\n",
     "\n",
     "\n",
-    "class SingleBidLearningStrategy(SingleBidLearningStrategy):\n",
+    "class EnergyLearningSingleBidStrategy(EnergyLearningSingleBidStrategy):\n",
     "    def calculate_reward(\n",
     "        self,\n",
     "        unit,\n",
@@ -1336,16 +1320,19 @@
     "\n",
     "        # scaling factor to normalize the reward to the range [-1,1]\n",
     "        scaling = 1 / (self.max_bid_price * unit.max_power)\n",
-    "\n",
     "        reward = scaling * (order_profit - regret_scale * opportunity_cost)\n",
+    "        regret = regret_scale * opportunity_cost\n",
     "\n",
-    "        # Store results in unit outputs, which are later written to the database by the unit operator.\n",
-    "        unit.outputs[\"profit\"].loc[start:end_excl] = order_profit\n",
-    "        unit.outputs[\"reward\"].loc[start:end_excl] = reward\n",
-    "        unit.outputs[\"regret\"].loc[start:end_excl] = regret_scale * opportunity_cost\n",
-    "        unit.outputs[\"total_costs\"].loc[start:end_excl] = order_cost\n",
+    "        # Store results in unit outputs\n",
+    "        # Note: these are not learning-specific results but stored for all units for analysis\n",
+    "        unit.outputs[\"profit\"].loc[start:end_excl] += order_profit\n",
+    "        unit.outputs[\"total_costs\"].loc[start:end_excl] += order_cost\n",
     "\n",
-    "        unit.outputs[\"rl_rewards\"].append(reward)"
+    "        # write rl-rewards to buffer\n",
+    "        if self.learning_mode:\n",
+    "            self.learning_role.add_reward_to_cache(\n",
+    "                unit.id, start, reward, regret, order_profit\n",
+    "            )"
    ]
   },
   {
@@ -1414,6 +1401,7 @@
     "\n",
     "| Parameter                                     | Description                                                                           |\n",
     "| --------------------------------------------- | ------------------------------------------------------------------------------------- |\n",
+    "| **learning\\_mode**                            | If `True`, performs the policy updates and evaluates learned policies.                | \n",
     "| **continue\\_learning**                        | If `True`, resumes training from saved policy checkpoints.                            |\n",
     "| **trained\\_policies\\_save\\_path**             | File path where trained policies will be saved.                                       |\n",
     "| **trained\\_policies\\_load\\_path**             | Path to pre-trained policies to load.                                                 |\n",
@@ -1490,7 +1478,7 @@
     "    world = World(database_uri=db_uri, export_csv_path=csv_path)\n",
     "\n",
     "    # 2. Register your learning strategy\n",
-    "    world.bidding_strategies[\"pp_learning\"] = SingleBidLearningStrategy\n",
+    "    world.bidding_strategies[\"pp_learning\"] = EnergyLearningSingleBidStrategy\n",
     "\n",
     "    # 3. Load scenario and case\n",
     "    load_scenario_folder(\n",
@@ -1501,7 +1489,7 @@
     "    )\n",
     "\n",
     "    # 4. Run the training phase\n",
-    "    if world.learning_config.learning_mode:\n",
+    "    if world.learning_mode:\n",
     "        run_learning(world)\n",
     "\n",
     "    # 5. Execute final evaluation run (no exploration)\n",
@@ -1697,6 +1685,7 @@
     "SELECT\n",
     "    start_time AS time,\n",
     "    price,\n",
+    "    accepted_price,\n",
     "    unit_id,\n",
     "    simulation\n",
     "FROM market_orders\n",
@@ -1733,11 +1722,18 @@
     "    label=\"Next Unit's Marginal Cost (85.7 €)\",\n",
     ")\n",
     "\n",
+    "plt.plot(\n",
+    "    bids_df[\"time\"],\n",
+    "    bids_df[\"accepted_price\"],\n",
+    "    label=\"Accepted Price\",\n",
+    "    color=\"tab:orange\",\n",
+    ")\n",
+    "\n",
     "plt.title(\"Bidding Behavior of RL Agent (pp_6)\")\n",
     "plt.xlabel(\"Time\")\n",
     "plt.ylabel(\"Bid Price (€/MWh)\")\n",
     "plt.legend()\n",
-    "plt.ylim(50, 100)\n",
+    "plt.ylim(30, 100)\n",
     "plt.grid(True)\n",
     "plt.tight_layout()\n",
     "plt.show()"
@@ -1845,7 +1841,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "assume-framework",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -1859,7 +1855,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.13.9"
   }
  },
  "nbformat": 4,
diff --git a/examples/notebooks/04c_reinforcement_learning_storage_example.ipynb b/examples/notebooks/04c_reinforcement_learning_storage_example.ipynb
index 3412a636a..bcdcd4a15 100644
--- a/examples/notebooks/04c_reinforcement_learning_storage_example.ipynb
+++ b/examples/notebooks/04c_reinforcement_learning_storage_example.ipynb
@@ -207,7 +207,7 @@
    "source": [
     "try:\n",
     "    from assume import World\n",
-    "    from assume.strategies.learning_strategies import LearningStrategy\n",
+    "    from assume.strategies.learning_strategies import TorchLearningStrategy\n",
     "\n",
     "    print(\"✅ ASSUME framework is installed and functional.\")\n",
     "except ImportError as e:\n",
@@ -249,6 +249,7 @@
    "source": [
     "# Standard Python modules\n",
     "import logging  # For logging messages during simulation and debugging\n",
+    "import os  # For operating system interactions\n",
     "from datetime import timedelta  # To handle market time resolutions (e.g., hourly steps)\n",
     "\n",
     "import matplotlib.pyplot as plt\n",
@@ -268,8 +269,8 @@
     "    run_learning,\n",
     ")\n",
     "from assume.strategies.learning_strategies import (\n",
-    "    LearningStrategy,  # Abstract base for RL bidding strategies\n",
     "    MinMaxChargeStrategy,  # Abstract class for storage-like strategies\n",
+    "    TorchLearningStrategy,  # Abstract base for RL bidding strategies\n",
     ")"
    ]
   },
@@ -319,16 +320,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
+    "from pathlib import Path\n",
     "\n",
     "from IPython.display import SVG, display\n",
     "\n",
-    "image_path = \"assume-repo/docs/source/img/architecture.svg\"\n",
-    "alt_image_path = \"../../docs/source/img/architecture.svg\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/architecture.svg\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/architecture.svg\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(SVG(image_path))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(SVG(alt_image_path))"
    ]
   },
@@ -455,7 +456,7 @@
    "source": [
     "### 3.3 Exercise 1: Choose a Suitable Foresight for Storage Agents\n",
     "\n",
-    "To enable learning for storage units, we define a custom strategy class that extends `LearningStrategy`. This class specifies key dimensions such as the size of the observation and action spaces. One crucial parameter you need to define is the **foresight**—how many future time steps the agent considers when making decisions.\n",
+    "To enable learning for storage units, we define a custom strategy class that extends `TorchLearningStrategy`. This class specifies key dimensions such as the size of the observation and action spaces. One crucial parameter you need to define is the **foresight**—how many future time steps the agent considers when making decisions.\n",
     "\n",
     "Unlike dispatchable power plants, storage units face **temporally coupled decisions**: they must charge at one point in time and discharge at another, often hours later. This delay between cost and profit means that storage agents require a **longer foresight** than units that act on short-term signals.\n",
     "\n",
@@ -472,7 +473,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "class StorageEnergyLearningStrategy(LearningStrategy, MinMaxChargeStrategy):\n",
+    "class StorageEnergyLearningStrategy(TorchLearningStrategy, MinMaxChargeStrategy):\n",
     "    \"\"\"\n",
     "    A simple reinforcement learning bidding strategy.\n",
     "    \"\"\"\n",
@@ -510,7 +511,7 @@
    "outputs": [],
    "source": [
     "# @title Solution Exercise 1\n",
-    "class StorageEnergyLearningStrategy(LearningStrategy, MinMaxChargeStrategy):\n",
+    "class StorageEnergyLearningStrategy(TorchLearningStrategy, MinMaxChargeStrategy):\n",
     "    \"\"\"\n",
     "    A simple reinforcement learning bidding strategy.\n",
     "    \"\"\"\n",
@@ -763,7 +764,7 @@
     "> **Note for advanced users:**  \n",
     "> The environment for storage units is **not fully Markovian**. Future rewards depend on past actions — particularly the prices at which energy was charged.  \n",
     "> To mitigate this partial observability, we **augment the observation space** with the **average cost of stored energy**. This acts as a memory proxy, helping the agent assess whether selling at a given price is profitable.  \n",
-    "> This approach is a form of *state augmentation*, commonly used in reinforcement learning to approximate Markovian behaviour in **partially observable environments (POMDPs)**.\n",
+    "> This approach is a form of *state augmentation*, commonly used in reinforcement learning to approximate Markovian behavior in **partially observable environments (POMDPs)**.\n",
     "\n",
     "\n",
     "### 3.5 Summary\n",
@@ -772,7 +773,7 @@
     "* The base class handles forecasted residual load and price, as well as historical price signals.\n",
     "* For storage units, individual observations include the **state of charge** and the **cost of stored energy**, which reflects past purchase prices and is updated over time.\n",
     "* You implemented the logic for updating this cost after market actions—this is crucial for enabling the agent to assess profitability when selling energy.\n",
-    "* These observations directly affect agent behaviour and learning convergence—thoughtful design matters.\n",
+    "* These observations directly affect agent behavior and learning convergence—thoughtful design matters.\n",
     "\n",
     "In the next chapter, you will define **how the agent selects actions** based on its observations, and how **exploration** is introduced during initial training to populate the learning buffer.\n",
     "\n",
@@ -934,30 +935,12 @@
     "Note that `max_power` is **positive**, as this strategy models a generator offering energy. For a **consumer or demand bid**, the volume would be **negative** to reflect load withdrawal."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "155f2f1e",
-   "metadata": {},
-   "source": [
-    "### 5.4 Why We Store Everything in `unit.outputs`\n",
-    "\n",
-    "The outputs of the bidding process are stored in two places:\n",
-    "\n",
-    "* `unit.outputs[\"rl_observations\"]` and `[\"rl_actions\"]`:\n",
-    "  Stored as lists to be written into the replay buffer for learning.\n",
-    "\n",
-    "* `unit.outputs[\"actions\"]` and `[\"exploration_noise\"]`:\n",
-    "  Stored as `pandas.Series` for compatibility with the unit’s internal logging and database structure.\n",
-    "\n",
-    "This dual storage ensures that both the simulation engine and the learning backend have access to the relevant data.\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "7838b5df",
    "metadata": {},
    "source": [
-    "### 5.5 Controlling Action Dimensions\n",
+    "### 5.4 Controlling Action Dimensions\n",
     "\n",
     "By changing the `act_dim` in the strategy constructor, you can control the number of outputs returned by the actor network:\n",
     "\n",
@@ -981,7 +964,7 @@
    "id": "be5a6cd5",
    "metadata": {},
    "source": [
-    "### 5.6 Full Code Implementation\n",
+    "### 5.5 Full Code Implementation\n",
     "\n",
     "Here is the complete `calculate_bids()` implementation:"
    ]
@@ -1033,23 +1016,20 @@
     "\n",
     "        start = product_tuples[0][0]\n",
     "        end_all = product_tuples[-1][1]\n",
-    "        # =============================================================================\n",
-    "        # 1. Get the observations, which are the basis of the action decision\n",
-    "        # =============================================================================\n",
+    "\n",
     "        next_observation = self.create_observation(\n",
     "            unit=unit,\n",
     "            market_id=market_config.market_id,\n",
     "            start=start,\n",
     "            end=end_all,\n",
     "        )\n",
-    "\n",
     "        # =============================================================================\n",
-    "        # 2. Get the actions, based on the observations\n",
+    "        # Get the Actions, based on the observations\n",
     "        # =============================================================================\n",
     "        actions, noise = self.get_actions(next_observation)\n",
     "\n",
     "        # =============================================================================\n",
-    "        # 3. Transform actions into bids\n",
+    "        # 3. Transform Actions into bids\n",
     "        # =============================================================================\n",
     "        # the absolute value of the action determines the bid price\n",
     "        bid_price = abs(actions[0]) * self.max_bid_price\n",
@@ -1059,7 +1039,6 @@
     "        elif actions[0] >= 0:\n",
     "            bid_direction = \"sell\"\n",
     "\n",
-    "        # these are function from the technical representation of storages\n",
     "        _, max_discharge = unit.calculate_min_max_discharge(start, end_all)\n",
     "        _, max_charge = unit.calculate_min_max_charge(start, end_all)\n",
     "\n",
@@ -1092,12 +1071,8 @@
     "                }\n",
     "            )\n",
     "\n",
-    "        unit.outputs[\"rl_observations\"].append(next_observation)\n",
-    "        unit.outputs[\"rl_actions\"].append(actions)\n",
-    "\n",
-    "        # store results in unit outputs as series to be written to the database by the unit operator\n",
-    "        unit.outputs[\"actions\"].at[start] = actions\n",
-    "        unit.outputs[\"exploration_noise\"].at[start] = noise\n",
+    "        if self.learning_mode:\n",
+    "            self.learning_role.add_actions_to_cache(self.unit_id, start, actions, noise)\n",
     "\n",
     "        return bids"
    ]
@@ -1259,10 +1234,13 @@
     "        )\n",
     "\n",
     "        # === Store results ===\n",
+    "        # Note: these are not learning-specific results but stored for all units for analysis\n",
     "        unit.outputs[\"profit\"].loc[start:end_excl] += profit\n",
-    "        unit.outputs[\"reward\"].loc[start:end_excl] = reward\n",
-    "        unit.outputs[\"total_costs\"].loc[start:end_excl] = order_cost\n",
-    "        unit.outputs[\"rl_rewards\"].append(reward)"
+    "        unit.outputs[\"total_costs\"].loc[start:end_excl] += order_cost\n",
+    "\n",
+    "        # write rl-rewards to buffer\n",
+    "        if self.learning_mode:\n",
+    "            self.learning_role.add_reward_to_cache(unit.id, start, reward, 0, profit)"
    ]
   },
   {
@@ -1397,12 +1375,14 @@
     "            duration_hours=duration,\n",
     "            max_bid_price=self.max_bid_price,\n",
     "        )\n",
-    "\n",
     "        # === Store results ===\n",
+    "        # Note: these are not learning-specific results but stored for all units for analysis\n",
     "        unit.outputs[\"profit\"].loc[start:end_excl] += profit\n",
-    "        unit.outputs[\"reward\"].loc[start:end_excl] = reward\n",
-    "        unit.outputs[\"total_costs\"].loc[start:end_excl] = order_cost\n",
-    "        unit.outputs[\"rl_rewards\"].append(reward)"
+    "        unit.outputs[\"total_costs\"].loc[start:end_excl] += order_cost\n",
+    "\n",
+    "        # write rl-rewards to buffer\n",
+    "        if self.learning_mode:\n",
+    "            self.learning_role.add_reward_to_cache(unit.id, start, reward, 0, profit)"
    ]
   },
   {
@@ -1471,6 +1451,7 @@
     "\n",
     "| Parameter                                     | Description                                                                           |\n",
     "| --------------------------------------------- | ------------------------------------------------------------------------------------- |\n",
+    "| **learning\\_mode**                            | If `True`, performs the policy updates and evaluates learned policies.                | \n",
     "| **continue\\_learning**                        | If `True`, resumes training from saved policy checkpoints.                            |\n",
     "| **trained\\_policies\\_save\\_path**             | File path where trained policies will be saved.                                       |\n",
     "| **trained\\_policies\\_load\\_path**             | Path to pre-trained policies to load.                                                 |\n",
@@ -1559,7 +1540,7 @@
     "    )\n",
     "\n",
     "    # 4. Run the training phase\n",
-    "    if world.learning_config.learning_mode:\n",
+    "    if world.learning_mode:\n",
     "        run_learning(world)\n",
     "\n",
     "    # 5. Execute final evaluation run (no exploration)\n",
@@ -1775,7 +1756,7 @@
     "\n",
     "One key factor is that Storage 2 quickly learns that its actions can **influence the market price** — it becomes a price setter in certain hours. This feedback between its bidding strategy and the resulting price allows it to understand the reward signal more clearly and improve faster. In contrast, Storage 1 rarely becomes price-setting and thus finds it harder to link its actions to outcomes. Without this feedback loop, learning is significantly slower or even stagnant. Here we can see a slight increase in the evaluation rewrad ar the end, that indicates storage 1 might recover.\n",
     "\n",
-    "To mitigate this, we often use a **warm start** strategy in practice: agents are initialised with policies that have already learned basic behavioural patterns, such as first charge and then discharge or how to bid in a stationary environment. This helps agents reach the price-setting regime more quickly and facilitates meaningful learning, especially in multi-agent setups.\n"
+    "To mitigate this, we often use a **warm start** strategy in practice: agents are initialised with policies that have already learned basic behavioral patterns, such as first charge and then discharge or how to bid in a stationary environment. This helps agents reach the price-setting regime more quickly and facilitates meaningful learning, especially in multi-agent setups.\n"
    ]
   },
   {
@@ -1893,7 +1874,7 @@
     "- **Blue dots** indicate charging actions (buy bids), where the storage unit purchases electricity at lower prices.\n",
     "- **Orange dots** represent discharging actions (sell bids), where electricity is sold back to the market at higher prices.\n",
     "\n",
-    "From the visual distribution, we can observe a typical storage behaviour:\n",
+    "From the visual distribution, we can observe a typical storage behavior:\n",
     "- Charging occurs during **low-price hours**, typically at night or early morning.\n",
     "- Discharging is concentrated in **higher-price hours**, typically in the afternoon or evening.\n",
     "\n",
@@ -1983,17 +1964,11 @@
     "* If you are interested in the general algorithm behind the MADDPG and how it is integrated into ASSUME look into [04a_RL_algorithm_example](./04a_reinforcement_learning_algorithm_example.ipynb) \n",
     "* In the small example we could see what the a good bidding behavior of the agent might be and, hence, can judge learning easily, but what if we model many agents in new simulations? We provide explainable RL mechanisms in another tutorial for you to dive into [09_example_Sim_and_xRL](./09_example_Sim_and_xRL.ipynb) \n"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "03d38ae6",
-   "metadata": {},
-   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "assume-framework",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -2007,7 +1982,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.13.9"
   }
  },
  "nbformat": 4,
diff --git a/examples/notebooks/09_example_Sim_and_xRL.ipynb b/examples/notebooks/09_example_Sim_and_xRL.ipynb
index 6a026608a..97b0ea452 100644
--- a/examples/notebooks/09_example_Sim_and_xRL.ipynb
+++ b/examples/notebooks/09_example_Sim_and_xRL.ipynb
@@ -428,6 +428,7 @@
     "            }\n",
     "        },\n",
     "        \"learning_config\": {\n",
+    "            \"learning_mode\": True,\n",
     "            \"continue_learning\": False,\n",
     "            \"max_bid_price\": 100,\n",
     "            \"algorithm\": \"matd3\",\n",
@@ -523,10 +524,10 @@
     "    world.learning_role.rl_algorithm.initialize_policy()\n",
     "\n",
     "    # check if we already stored policies for this simulation\n",
-    "    save_path = world.learning_config.trained_policies_save_path\n",
+    "    save_path = world.learning_role.learning_config.trained_policies_save_path\n",
     "\n",
     "    if Path(save_path).is_dir():\n",
-    "        if world.learning_config.continue_learning:\n",
+    "        if world.learning_role.learning_config.continue_learning:\n",
     "            logger.warning(\n",
     "                f\"Save path '{save_path}' exists.\\n\"\n",
     "                \"You are in continue learning mode. New strategies may overwrite previous ones.\\n\"\n",
@@ -566,7 +567,7 @@
     "    # Information that needs to be stored across episodes, aka one simulation run\n",
     "    inter_episodic_data = {\n",
     "        \"buffer\": ReplayBuffer(\n",
-    "            buffer_size=world.learning_config.replay_buffer_size,\n",
+    "            buffer_size=world.learning_role.learning_config.replay_buffer_size,\n",
     "            obs_dim=world.learning_role.rl_algorithm.obs_dim,\n",
     "            act_dim=world.learning_role.rl_algorithm.act_dim,\n",
     "            n_rl_units=len(world.learning_role.rl_strats),\n",
@@ -586,16 +587,17 @@
     "    # -----------------------------------------\n",
     "\n",
     "    validation_interval = min(\n",
-    "        world.learning_role.training_episodes,\n",
-    "        world.learning_config.validation_episodes_interval,\n",
+    "        world.learning_role.learning_config.training_episodes,\n",
+    "        world.learning_role.learning_config.validation_episodes_interval,\n",
     "    )\n",
     "\n",
     "    # Ensure training episodes exceed the sum of initial experience and one evaluation interval\n",
     "    min_required_episodes = (\n",
-    "        world.learning_role.episodes_collecting_initial_experience + validation_interval\n",
+    "        world.learning_role.learning_config.episodes_collecting_initial_experience\n",
+    "        + validation_interval\n",
     "    )\n",
     "\n",
-    "    if world.learning_role.training_episodes < min_required_episodes:\n",
+    "    if world.learning_role.learning_config.training_episodes < min_required_episodes:\n",
     "        raise ValueError(\n",
     "            f\"Training episodes ({world.learning_role.training_episodes}) must be greater than the sum of initial experience episodes ({world.learning_role.episodes_collecting_initial_experience}) and evaluation interval ({validation_interval}).\"\n",
     "        )\n",
@@ -603,7 +605,7 @@
     "    eval_episode = 1\n",
     "\n",
     "    for episode in tqdm(\n",
-    "        range(1, world.learning_role.training_episodes + 1),\n",
+    "        range(1, world.learning_role.learning_config.training_episodes + 1),\n",
     "        desc=\"Training Episodes\",\n",
     "    ):\n",
     "        # -----------------------------------------\n",
@@ -624,12 +626,13 @@
     "\n",
     "        # -----------------------------------------\n",
     "        # Store the entire buffer for xAI workflow\n",
-    "        if episode == world.learning_role.training_episodes:\n",
+    "        if episode == world.learning_role.learning_config.training_episodes:\n",
     "            export = inter_episodic_data[\"buffer\"].observations.tolist()\n",
     "\n",
     "            with open(\n",
     "                os.path.join(\n",
-    "                    world.learning_role.trained_policies_save_path, \"buffer_obs.json\"\n",
+    "                    world.learning_role.learning_config.trained_policies_save_path,\n",
+    "                    \"buffer_obs.json\",\n",
     "                ),\n",
     "                \"w\",\n",
     "            ) as f:\n",
@@ -639,7 +642,7 @@
     "        if (\n",
     "            episode % validation_interval == 0\n",
     "            and episode\n",
-    "            >= world.learning_role.episodes_collecting_initial_experience\n",
+    "            >= world.learning_role.learning_config.episodes_collecting_initial_experience\n",
     "            + validation_interval\n",
     "        ):\n",
     "            world.reset()\n",
@@ -683,11 +686,11 @@
     "        # save the policies after each episode in case the simulation is stopped or crashes\n",
     "        if (\n",
     "            episode\n",
-    "            >= world.learning_role.episodes_collecting_initial_experience\n",
+    "            >= world.learning_role.learning_config.episodes_collecting_initial_experience\n",
     "            + validation_interval\n",
     "        ):\n",
     "            world.learning_role.rl_algorithm.save_params(\n",
-    "                directory=f\"{world.learning_role.trained_policies_save_path}/last_policies\"\n",
+    "                directory=f\"{world.learning_role.learning_config.trained_policies_save_path}/last_policies\"\n",
     "            )\n",
     "\n",
     "    # container shutdown implicitly with new initialisation\n",
@@ -701,7 +704,7 @@
     "    # especially if previous strategies were loaded from an external source.\n",
     "    # This is useful when continuing from a previous learning session.\n",
     "    world.scenario_data[\"config\"][\"learning_config\"][\"trained_policies_load_path\"] = (\n",
-    "        f\"{world.learning_role.trained_policies_save_path}/avg_reward_eval_policies\"\n",
+    "        f\"{world.learning_role.learning_config.trained_policies_save_path}/avg_reward_eval_policies\"\n",
     "    )\n",
     "\n",
     "    # load scenario for evaluation\n",
@@ -798,7 +801,7 @@
     ")\n",
     "\n",
     "# If learning mode is enabled, run the reinforcement learning loop\n",
-    "if world.learning_config.learning_mode:\n",
+    "if world.learning_mode:\n",
     "    run_learning(world)\n",
     "\n",
     "# Run the simulation\n",
@@ -956,7 +959,7 @@
    "outputs": [],
    "source": [
     "!pip install matplotlib\n",
-    "!pip install shap==0.47.1\n",
+    "!pip install shap==0.50.0\n",
     "!pip install scikit-learn==1.6.1"
    ]
   },
@@ -1100,7 +1103,9 @@
     "import pandas as pd\n",
     "import shap\n",
     "import torch as th\n",
-    "from sklearn.model_selection import train_test_split"
+    "from sklearn.model_selection import train_test_split\n",
+    "\n",
+    "th.manual_seed(42)"
    ]
   },
   {
diff --git a/examples/notebooks/10a_DSU_and_flexibility.ipynb b/examples/notebooks/10a_DSU_and_flexibility.ipynb
index 1d106c2d8..34ef00e1d 100644
--- a/examples/notebooks/10a_DSU_and_flexibility.ipynb
+++ b/examples/notebooks/10a_DSU_and_flexibility.ipynb
@@ -776,16 +776,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import os\n",
+    "from pathlib import Path\n",
     "\n",
     "from IPython.display import Image, display\n",
     "\n",
-    "image_path = \"assume-repo/docs/source/img/dsm_integration.PNG\"\n",
-    "alt_image_path = \"../../docs/source/img/dsm_integration.PNG\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/dsm_integration.PNG\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/dsm_integration.PNG\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(Image(image_path))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(Image(alt_image_path))"
    ]
   },
@@ -847,12 +847,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "image_path = \"assume-repo/docs/source/img/Demand_Attribute.png\"\n",
-    "alt_image_path = \"../../docs/source/img/Demand_Attribute.png\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/Demand_Attribute.png\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/Demand_Attribute.png\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(Image(image_path, width=600))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(Image(alt_image_path, width=600))"
    ]
   },
@@ -884,12 +884,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "image_path = \"assume-repo/docs/source/img/Industry.png\"\n",
-    "alt_image_path = \"../../docs/source/img/Industry.png\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/Industry.png\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/Industry.png\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(Image(image_path, width=600))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(Image(alt_image_path, width=600))"
    ]
   },
@@ -900,12 +900,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "image_path = \"assume-repo/docs/source/img/Building.png\"\n",
-    "alt_image_path = \"../../docs/source/img/Building.png\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/Building.png\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/Building.png\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(Image(image_path, width=600))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(Image(alt_image_path, width=600))"
    ]
   },
@@ -1448,12 +1448,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "image_path = \"assume-repo/docs/source/img/Peak load.png\"\n",
-    "alt_image_path = \"../../docs/source/img/Peak load.png\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/Peak load.png\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/Peak load.png\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(Image(image_path))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(Image(alt_image_path))"
    ]
   },
@@ -1481,12 +1481,12 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "image_path = \"assume-repo/docs/source/img/RE availability.png\"\n",
-    "alt_image_path = \"../../docs/source/img/RE availability.png\"\n",
+    "image_path = Path(\"assume-repo/docs/source/img/RE availability.png\")\n",
+    "alt_image_path = Path(\"../../docs/source/img/RE availability.png\")\n",
     "\n",
-    "if os.path.exists(image_path):\n",
+    "if image_path.exists():\n",
     "    display(Image(image_path))\n",
-    "elif os.path.exists(alt_image_path):\n",
+    "elif alt_image_path.exists():\n",
     "    display(Image(alt_image_path))"
    ]
   },
@@ -2932,7 +2932,7 @@
    "provenance": []
   },
   "kernelspec": {
-   "display_name": "assume-framework",
+   "display_name": ".venv",
    "language": "python",
    "name": "python3"
   },
@@ -2946,7 +2946,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.9"
+   "version": "3.13.9"
   }
  },
  "nbformat": 4,
diff --git a/examples/notebooks/11_redispatch.ipynb b/examples/notebooks/11_redispatch.ipynb
index 34bd28b69..2c3d48f41 100644
--- a/examples/notebooks/11_redispatch.ipynb
+++ b/examples/notebooks/11_redispatch.ipynb
@@ -4,9 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# 11. Redispatch modelling using PyPSA\n",
+    "# 11. Redispatch modeling using PyPSA\n",
     "\n",
-    "This tutorial demonstrates modelling and simulation of redispatch mechanism using PyPSA as a plug and play module in ASSUME-framework. The model will be created mainly taking grid constraints into consideration to identify grid bottlenecks with dispatches from EOM and resolve them using the redispatch algorithm.\n",
+    "This tutorial demonstrates modeling and simulation of redispatch mechanism using PyPSA as a plug and play module in ASSUME-framework. The model will be created mainly taking grid constraints into consideration to identify grid bottlenecks with dispatches from EOM and resolve them using the redispatch algorithm.\n",
     "\n",
     "## Concept of Redispatch\n",
     "\n",
diff --git a/examples/notebooks/11a_redispatch_dsm.ipynb b/examples/notebooks/11a_redispatch_dsm.ipynb
index aaff718ba..f1672f817 100644
--- a/examples/notebooks/11a_redispatch_dsm.ipynb
+++ b/examples/notebooks/11a_redispatch_dsm.ipynb
@@ -7,11 +7,11 @@
     "id": "344c88d7"
    },
    "source": [
-    "# 11a. Redispatch modelling in the ASSUME Framework\n",
+    "# 11a. Redispatch modeling in the ASSUME Framework\n",
     "\n",
     "Welcome to the ASSUME DSM Workshop!\n",
     "\n",
-    "This tutorial demonstrates modelling and simulation of redispatch mechanism using **PyPSA** as a plug and play module in **ASSUME-framework**. The model will be created mainly taking grid constraints into consideration to identify grid bottlenecks with dispatches from EOM and resolve them using the redispatch algorithm.\n",
+    "This tutorial demonstrates modeling and simulation of redispatch mechanism using **PyPSA** as a plug and play module in **ASSUME-framework**. The model will be created mainly taking grid constraints into consideration to identify grid bottlenecks with dispatches from EOM and resolve them using the redispatch algorithm.\n",
     "\n",
     "---\n",
     "\n",
@@ -30,9 +30,9 @@
     "\n",
     "### Key Sections\n",
     "\n",
-    "- **Section 1:** 3 node example for modelling Redispatch (Hands-on)\n",
-    "- **Section 2:** 3 node example for modelling DSM Units ( Demonstration)\n",
-    "- **Section 3:** Germany scale example for modelling Redispatch (Demonstration)\n"
+    "- **Section 1:** 3 node example for modeling Redispatch (Hands-on)\n",
+    "- **Section 2:** 3 node example for modeling DSM Units ( Demonstration)\n",
+    "- **Section 3:** Germany scale example for modeling Redispatch (Demonstration)\n"
    ]
   },
   {