Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ This approach is practical for modeling interactions among competing market part
To support market design analysis in transforming electricity systems, we developed the ASSUME framework - a flexible and modular agent-based modeling tool for electricity market research.
ASSUME enables researchers to customize components such as agent representations, market configurations, and bidding strategies, utilizing pre-built modules for standard operations.
With the setup in ASSUME, researchers can simulate strategic interactions in electricity markets under a wide range of scenarios, from comparing market designs and modeling congestion management to analyzing the behavior of learning storage operators and renewable producers.
The framework supports studies on bidding under uncertainty, regulatory interventions, and multi-agent dynamics, making it ideal for exploring emergent behaviour and testing new market mechanisms.
The framework supports studies on bidding under uncertainty, regulatory interventions, and multi-agent dynamics, making it ideal for exploring emergent behavior and testing new market mechanisms.
ASSUME has been utilized in research studies addressing diverse questions in electricity market design and operation.
It has explored the role of complex bids, demonstrated the effects of industrial demand-side flexibility for congestion management, and advanced the explainability of emergent strategies in learning agents.

Expand Down
4 changes: 3 additions & 1 deletion assume/reinforcement_learning/learning_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,9 @@ def transform_buffer_data(nested_dict: dict, device: th.device) -> np.ndarray:
for values in unit_data.values():
if values:
val = values[0]
feature_dim = 1 if val.ndim == 0 else len(val)
feature_dim = (
1 if isinstance(val, (int | float)) or val.ndim == 0 else len(val)
)
break
if feature_dim is not None:
break
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ def __init__(
act_dim: int,
float_type,
unique_obs_dim: int,
num_timeseries_obs_dim: int = 3,
num_timeseries_obs_dim: int,
*args,
**kwargs,
):
Expand Down
14 changes: 9 additions & 5 deletions assume/scenario/loader_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -757,7 +757,13 @@ def setup_world(

bidding_params = config.get("bidding_strategy_params", {})

# handle initial learning parameters before leanring_role exists
if config.get("learning_mode"):
raise ValueError(
"The 'learning_mode' parameter in the top-level of the config.yaml has been moved to 'learning_config'. "
"Please adjust your config file accordingly."
)

# handle initial learning parameters before learning_role exists
learning_dict = config.get("learning_config", {})
# those settings need to be overridden before passing to the LearningConfig
if learning_dict:
Expand Down Expand Up @@ -1030,15 +1036,13 @@ def run_learning(

Args:
world (World): An instance of the World class representing the simulation environment.
inputs_path (str): The path to the folder containing input files necessary for the simulation.
scenario (str): The name of the scenario for the simulation.
study_case (str): The specific study case for the simulation.
verbose (bool, optional): A flag indicating whether to enable verbose logging. Defaults to False.

Note:
- The function uses a ReplayBuffer to store experiences for training the DRL agents.
- It iterates through training episodes, updating the agents and evaluating their performance at regular intervals.
- Initial exploration is active at the beginning and is disabled after a certain number of episodes to improve the performance of DRL algorithms.
- Upon completion of training, the function performs an evaluation run using the best policy learned during training.
- Upon completion of training, the function performs an evaluation run using the last policy learned during training.
- The best policies are chosen based on the average reward obtained during the evaluation runs, and they are saved for future use.
"""
from assume.reinforcement_learning.buffer import ReplayBuffer
Expand Down
2 changes: 1 addition & 1 deletion assume/world.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,7 +202,7 @@ def setup(
simulation_id (str): The unique identifier for the simulation.
save_frequency_hours (int): The frequency (in hours) at which to save simulation data.
bidding_params (dict, optional): Parameters for bidding. Defaults to an empty dictionary.
learning_config (dict | None, optional): Configuration for the learning process. Defaults to None.
learning_dict (dict, optional): Configuration for the learning process. Defaults to an empty dictionary.
manager_address: The address of the manager.
**kwargs: Additional keyword arguments.

Expand Down
6 changes: 2 additions & 4 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ its primary objectives are to ensure usability and customizability for a wide ra
users and use cases in the energy system modeling community.

The unique feature of the ASSUME tool-box is the integration of **Deep Reinforcement
Learning** methods into the behavioural strategies of market agents.
Learning** methods into the behavioral strategies of market agents.
The model offers various predefined agent representations for both the demand and
generation sides, which can be used as plug-and-play modules, simplifying the
reinforcement of learning strategies. This setup enables research into new market
Expand Down Expand Up @@ -70,7 +70,6 @@ Documentation
examples_basic
example_simulations

**User Guide**

User Guide
==========
Expand Down Expand Up @@ -115,8 +114,7 @@ User Guide
assume


Indices and tables
==================
**Indices & Tables**

* :ref:`genindex`
* :ref:`modindex`
Expand Down
4 changes: 2 additions & 2 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Architecture
In the following figure the architecture of the framework is depicted. It can be roughly divided into two parts.
On the left side of the world class the markets are located and on the right side the market participants,
which are here named units. Both world are connected via the orders that market participants place on the markets.
The learning capacbility is sketched out with the yellow classes on the right side, namely the units side.
The learning capability is sketched out with the yellow classes on the right side, namely the units side.

.. image:: img/architecture.svg
:align: center
Expand Down Expand Up @@ -79,7 +79,7 @@ Market Participants
===================

The market participants, here labeled units, comprise all entities acting in the respective markets and are at
the core of any agent-based simulation model. The entirety of their behaviour leads to the market and system
the core of any agent-based simulation model. The entirety of their behavior leads to the market and system
outcome as a bottom-up simulation model, respectively.

Modularity of Units
Expand Down
59 changes: 53 additions & 6 deletions docs/source/learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The Basics of Reinforcement Learning
In general, RL and deep reinforcement learning (DRL) in particular, open new prospects for agent-based electricity market modeling.
Such algorithms offer the potential for agents to learn bidding strategies in the interplay between market participants.
In contrast to traditional rule-based approaches, DRL allows for a faster adaptation of the bidding strategies to a changing market
environment, which is impossible with fixed strategies that a market modeller explicitly formulates. Hence, DRL algorithms offer the
environment, which is impossible with fixed strategies that a market modeler explicitly formulates. Hence, DRL algorithms offer the
potential for simulated electricity market agents to develop bidding strategies for future markets and test emerging markets' mechanisms
before their introduction into real-world systems.

Expand Down Expand Up @@ -139,7 +139,7 @@ The Actor

We will explain the way learning works in ASSUME starting from the interface to the simulation, namely the bidding strategy of the power plants.
The bidding strategy, per definition in ASSUME, defines the way we formulate bids based on the technical restrictions of the unit.
In a learning setting, this is done by the actor network. Which maps the observation to an action. The observation thereby is managed and collected by the units operator as
In a learning setting, this is done by the actor network which maps the observation to an action. The observation thereby is managed and collected by the units operator as
summarized in the following picture. As you can see in the current working version, the observation space contains a residual load forecast for the next 24 hours and a price
forecast for 24 hours, as well as the current capacity of the power plant and its marginal costs.

Expand All @@ -148,15 +148,15 @@ forecast for 24 hours, as well as the current capacity of the power plant and it
:width: 500px

The action space is a continuous space, which means that the actor can choose any price between 0 and the maximum bid price defined in the code. It gives two prices for two different parts of its capacity.
One, namley :math:`p_{inflex}` for the minimum capacity of the power plant and one for the rest ( :math:`p_{flex}`). The action space is defined in the config file and can be adjusted to your needs.
One, namely :math:`p_{inflex}` for the minimum capacity of the power plant and one for the rest ( :math:`p_{flex}`). The action space is defined in the config file and can be adjusted to your needs.
After the bids are formulated in the bidding strategy they are sent to the market via the units operator.

.. image:: img/ActorOutput.jpg
:align: center
:width: 500px

In the case you are eager to integrate different learning bidding strategies or equip a new unit with learning,
you need to touch these methods. To enable an easy start with the use of reinforcement learning in ASSUME we provide a tutorial in colab on github.
you need to touch these methods. To enable an easy start with the use of reinforcement learning in ASSUME we provide a tutorial in colab on GitHub.

The Critic
----------
Expand All @@ -175,8 +175,14 @@ You can read more about the different algorithms and the learning role in :doc:`
The Learning Results in ASSUME
=====================================

Similarly to the other results, the learning progress is tracked in the database, either with postgresql or timescale. The latter enables the usage of the
predefined dashboards to track the leanring process in the "Assume:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
Learning results are not easy to understand and judge. ASSUME supports different visualizations to track the learning progress.
Further we want to raise awareness for common pitfalls with learning result interpretation.

Visualizations
--------------

Similarly to the other results, the learning progress is tracked in the database, either with PostgreSQL or TimescaleDB. The latter enables the usage of the
predefined dashboards to track the learning process in the "ASSUME:Training Process" dashboard. The following pictures show the learning process of a simple reinforcement learning setting.
A more detailed description is given in the dashboard itself.

.. image:: img/Grafana_Learning_1.jpeg
Expand Down Expand Up @@ -207,3 +213,44 @@ After starting the server, open the following URL in your browser:

TensorBoard will then display dashboards for scalars, histograms, graphs, projectors, and other relevant visualizations, depending on the metrics that
the training pipeline currently exports.

Interpretation
--------------

Once the environment and learning algorithm are specified, agents are trained and behaviors begin to emerge. The modeler (you) analyzes the reward in the
visualizations described above. This raises a basic modeling question:

*How can we judge whether what has been learned is meaningful?*

Unlike supervised learning, we do not have a ground-truth target or an error metric that reliably decreases as behavior improves. In multi-agent settings,
the notion of an “optimal” solution is often unclear. What we *do* observe are rewards – signals chosen by the modeler. How informative these signals are
depends heavily on the reward design and on how other agents behave. Therefore:

**Do not rely on rewards alone.** Behavior itself must be examined carefully.
**Why solely reward-based evaluation is problematic**

Let :math:`R_i` denote the episodic return of agent :math:`i` under the joint policy :math:`\pi=(\pi_1,\dots,\pi_n)`. A common but potentially misleading
heuristic is to evaluate behavior by the total reward,
.. math::

S(\pi) = \sum_{i=1}^n \mathbb{E}[R_i].

A larger :math:`S(\pi)` does *not* imply that the learned behavior is better or more stable. In a multi-agent environment, each agent’s learning alters the
effective environment faced by the others. The same policy can therefore earn very different returns depending on which opponent snapshot it encounters. High
aggregate rewards can arise from:

* temporary exploitation of weaknesses of other agents,
* coordination effects that occur by chance rather than by design,
* behavior that works against training opponents but fails in other situations.

Rewards are thus, at best, an indirect proxy for “good behavior.” They measure how well a policy performs *under the specific reward function and opponent
behavior*, not whether it is robust, interpretable, or aligned with the modeler’s intent.

**Implications for policy selection**

This issue becomes visible when deciding which policy to evaluate at the end of training. We generally store (i) the policy with the highest average reward and
(ii) the final policy. However, these two can differ substantially in their behavior. The framework therefore uses the **final policy** for evaluation to
avoid selecting a high-reward snapshot that may be far from stable.

The most robust learning performance can be achieved through **early stopping** with a very large number of episodes. In that case, training halts once results
are stable, and the final policy is likely also the stable one. This behavior should be monitored by the modeler in TensorBoard.
Loading