docs: Package structure RFC #1

lewisjared · 2024-10-30T02:52:23Z

Description

Adds a new RFC for the proposed package structure of the repository.

climate-dude

What are "REF packages"? Are they the external metrics/benchmarking packages? Does it include CMEC, or just ESMValTool, ILAMB, PMP, etc. in this initial implementation?
Does this RFC propose that some version of those individual packages will be copied into this repo from their respective existing repositories? Does that mean modifications to the packages here may require cherry-picking and integration into the native package repos? It appears this complication may be addressed in bullets under Rationale and alternatives.

lewisjared · 2024-10-30T05:45:19Z

What are "REF packages"? Are they the external metrics/benchmarking packages? Does it include CMEC, or just ESMValTool, ILAMB, PMP, etc. in this initial implementation?

There will be one package per benchmarking package that describes what metrics the benchmarking packages provide. That info is then used to determine which packages to call when new data arrives. I've been using the term MetricProvider for these external benchmarking packages, but happy to tweak that language.

For packages that already interface with the CMEC driver there will be some common functionality to help write out the required CMEC configuration files. The configuration will use as much of the existing EMDS standard/CMEC configuration of possible with extensions to support this use-case.

I'll reword to be more precise.

Does this RFC propose that some version of those individual packages will be copied into this repo from their respective existing repositories? Does that mean modifications to the packages here may require cherry-picking and integration into the native package repos? It appears this complication may be addressed in bullets under Rationale and alternatives.

No code needs to be vendored/copied and it should require little to no modification to the existing source. The packages will be mostly configuration and glue code.

nocollier · 2024-10-30T12:44:04Z

Each metrics package will come with its own set of dependencies. Odds are that they will clash with other packages.

Maybe this is clear in everyone's mind but mine. Are we sure that this is the case? I ask this because it introduces a lot of complexity to REF that could be avoided if all packages can run in the same environment. It would depend on what REF wants from ILAMB, but if I can provide it from ilamb3 I currently depend on:

numpy
xarray
pint-xarray
cf_xarray
pandas
intake
matplotlib
cartopy

Is it worth someone (I volunteer, we need this for ESGF anyway) trying to install each package in the same environment and run tests? Or has a multi-environment structure already been decided on and I am just not realizing?

nocollier · 2024-10-30T13:40:45Z

The framework will be an application rather than a library...

Can't it be both? I do this in ILAMB. It is a library but when installing I provide scripts to the setup:

https://github.com/rubisco-sfa/ILAMB/blob/6780ef0824a8a245ae60e518d5b5fc25f970f3d6/setup.py#L109

These are placed in your PATH when the package is installed and can be invoked like executables.

nocollier · 2024-10-30T13:50:54Z

Splitting out each package to a separate repository is a possibility...

This would also give package providers the option to keep the REF glue in their main repository. ILAMB does this with auxillary files that CMEC uses. It might be preferable for some to make sure testing of their own packages do not break REF connections.

lee1043 · 2024-10-30T15:41:39Z

The framework will be an application rather than a library...

Can't it be both? I do this in ILAMB. It is a library but when installing I provide scripts to the setup:

https://github.com/rubisco-sfa/ILAMB/blob/6780ef0824a8a245ae60e518d5b5fc25f970f3d6/setup.py#L109

These are placed in your PATH when the package is installed and can be invoked like executables.

I second @nocollier's idea -- it might can be both. PMP does the same as ILAMB does -- it is an application and library, setup.py for installation. It can also installed via conda.

acordonez · 2024-10-30T17:01:13Z

Is it worth someone (I volunteer, we need this for ESGF anyway) trying to install each package in the same environment and run tests? Or has a multi-environment structure already been decided on and I am just not realizing?

CMEC is designed such that every package has its own conda environment. The conda environment name is actually hard-coded for each package, but you can change the source code to use it differently. Historically I've had a hard time getting the PMP in particular to install in environments with other packages. Not opposed to doing it differently but currently that's how CMEC expects environments to work.

minxu74 · 2024-10-30T18:59:43Z

Is it worth someone (I volunteer, we need this for ESGF anyway) trying to install each package in the same environment and run tests? Or has a multi-environment structure already been decided on and I am just not realizing?

CMEC is designed such that every package has its own conda environment. The conda environment name is actually hard-coded for each package, but you can change the source code to use it differently. Historically I've had a hard time getting the PMP in particular to install in environments with other packages. Not opposed to doing it differently but currently that's how CMEC expects environments to work.

I may be wrong. I originally thought that each package will be containerized with their own environments and benchmarking data and will be run in separated containers. Only model outputs and package outputs are shared among these containers. @acordonez I wonder if CMEC can handle containers?

acordonez · 2024-10-30T19:31:47Z

I may be wrong. I originally thought that each package will be containerized with their own environments and benchmarking data and will be run in separated containers. Only model outputs and package outputs are shared among these containers. @acordonez I wonder if CMEC can handle containers?

@minxu74 referring to Docker containers? We haven't built anything specific in CMEC to use containers but cmec-driver is all Python and as far as I know should run in a container without major issues.

lewisjared · 2024-10-31T00:00:25Z

Thanks for the feedback everyone. I've pulled the comments from @nocollier and @lee1043 into a separate thread with a proposed change of text.

text/0001-package-structure.md

lewisjared · 2024-10-31T01:57:42Z

It does raise the need to come to an agreement of 1 env vs multiple. I wouldn't say it is decided, but it is a pretty fundamental decision that we should discuss. Perhaps that needs a discussion doc in itself? I'll add it to the agenda for tomorrow and then can write up the results.

lewisjared · 2024-10-31T02:07:55Z

To throw my 2 cents in, I'm a strong advocate of multiple environments. I agree it adds additional complexity, but I've seen projects be burnt when trying to bring together multiple science packages (IIASA's scenario database, the AR6 WG3 climate assessment and openscm-runner (single interface for multiple simple climate models)). I'm sure the ESMValTool team has spent a lot of time fussing with packaging and the "blessed" environment.

A single environment might work now, but it's very fragile, as any one dependency may make the environment unsolvable or limit which packages can be part of the REF in future. ESMValTool has a large set of dependencies. I'm not sure if @bouweandela intends to pull in the whole ESMValTool system or just the core so this might not be quite as bad as it looks.

Splitting into multiple, decoupled and isolated environments would put the control of packaging into the hands of the benchmarking providers who know how best to do that. This will shift significant complexity into WP1 which I'm ok with because it results in a lower maintenance burden for benchmarking packages and a lower barrier to entry which I think is a good trade off.

text/0001-package-structure.md

bouweandela · 2024-11-14T11:00:03Z

A single environment might work now, but it's very fragile, as any one dependency may make the environment unsolvable or limit which packages can be part of the REF in future. ESMValTool has a large set of dependencies. I'm not sure if @bouweandela intends to pull in the whole ESMValTool system or just the core so this might not be quite as bad as it looks.

It depends on which metrics/diagnostics ESMValTool should be able to run. For example, we have an esmvaltool-python package available on conda-forge (basically the esmvaltool package from PyPI with only the dependencies of diagnostics written in Python), so that would avoid the need for NCL/R/Julia dependencies.

While it is probably possible to install everything into a single environment (I expect the ESMValTool environment will also have all the requirements for the other diagnostic packages) most of the time, there may be moments where not everyone is in sync with the latest dependencies and then there will be trouble, so I agree that using isolated environments is the most reliable solution.

Using docker containers is great for reproducibility and if users can just download them it shouldn't be difficult for the users to install them, but it does add additional burden for metrics package maintainers if they need to create containers. It is also good to take into account that not all HPC systems allow running containers, not even Apptainer containers. Local (conda) environments are also great if you want to change the code of a metric to e.g. improve a plot. Therefore, I think it would be best for adoption if the REF could support both (Docker/Apptainer) containers and Conda environments.

lewisjared · 2024-11-14T11:15:19Z

@bouweandela Do you have an example of an HPC facility that has a lot of restrictions?

The lack of docker would make the installation process painful, but it shouldn't be a deal-breaker. The impact of the lack of docker would likely be felt in the lack of the ability to use other ancillary services that are easy to pull in as docker containers (web servers, caches, databases etc.). It might require a customised deployment as the lack of docker probably suggests other limitations too.

bouweandela · 2024-11-14T11:32:44Z

text/0001-package-structure.md

+This package will be a dependency of all other packages as it will describe the interfaces that metrics providers must implement. 
+This allows us to keep the core functionality separate from the metrics providers.
+
+### `ref-metrics-*`


It would be easier for people who want to use the REF with a custom metric provider if these were maintained outside of the main REF repository. If we move the example to a separate repository, it will be much easier for relative outsiders to see how to implement the interface because they don't have to dig through all of the REF infrastructure and configuration to find what they're looking for.

Good point.

Let's migrate the example outside of this repository. It is probably best to wait until we are happy with a PoC as it is easier to refactor all in one spot. We will also require publishing the ref-core package to PyPi/CondaForge first.

Perhaps something for the new year.

bouweandela · 2024-11-14T11:39:33Z

text/0001-package-structure.md

+
+The package will be named `ref-metrics-<provider>` where `<provider>` is the name of the provider.
+
+### `ref-api` (TBD)


The way I understand it, there are at least two use cases that each need their own application.

An automated system running on an ESGF node that listens to the ESGF message queue announcing that new datasets have been published and then kicks off the required diagnostic runs and integrates/publishes the results.

Human users who want to run the tool themselves, this could be from the command line or from a Jupyter notebook

It looks like this proposes 1), but maybe it would be more convenient to start with 2) as having that is convenient for development. Or is that covered somewhere else already?

I would add/tweak another one:

Production deployment

Modelling center where they are ingesting their pre-published data. Probably by running a CLI command. This requires a bulk of the services needed for a production deployment if they want to track results.

Running metrics directly via a common interface without any tracking of results

I'm assuming that modelling centers want some form of tracking, but that is only a guess so we should validate that. The tracking requires a database which in likely just sqlite which doesn't require any additional services.

There is a lot of overlap of 1 and 2, but 3 is probably how benchmarking package maintainers will develop their packages. Each of the metrics packages should be able to be run directly in a notebook as well if you don't require the complexity that comes from the "compute engine".

bouweandela · 2024-11-14T11:40:43Z

text/0001-package-structure.md

+The package may have indirect dependencies on the metrics packages to avoid conflicting dependencies.
+That adds additional complexity, but will be more flexible in the long run.


I'm not sure what this means, I would recommend leaving this out or making it more explicit what is intended.

bouweandela · 2024-11-14T11:43:16Z

text/0001-package-structure.md

+Another drawback is that it will be harder to manage different package managers.
+The project currently uses [uv](https://docs.astral.sh/uv/) as a package manager as the core and framework
+will not require any complex to install dependencies, 
+but science packages may require more complex dependencies.


If I understand it correctly, we're not planning to do any science in the REF. The science should happen in the metrics/diagnostics packages. Therefore I would not expect that we need any science dependencies.

Correct. But the ref-metric-* packages would pull those science dependencies in. Those metric provider packages probably should live outside of the cmip-ref repository like you suggested for the example package. That example package could then become a template. It would also provide some flexibility for metrics packages to use whatever package installer they need.

I'll update the text to reflect this.

bouweandela · 2024-11-14T11:58:32Z

@bouweandela Do you have an example of an HPC facility that has a lot of restrictions?

For example, Levante at DKRZ does not support (singularity) containers on its compute nodes. The feature has been disabled for months now. It is not very publicly announced, I could only find this note.

The lack of docker would make the installation process painful, but it shouldn't be a deal-breaker.

Not necessarily, if we just make sure that every diagnostics package can be either installed from PyPI or conda-forge.

The impact of the lack of docker would likely be felt in the lack of the ability to use other ancillary services that are easy to pull in as docker containers (web servers, caches, databases etc.). It might require a customised deployment as the lack of docker probably suggests other limitations too.

I imagine that a deployment on ESGF would typically happen on a kubernetes cluster or one or more virtual machines. Typically, system administrators give you a lot more freedom to install things on a virtual machine than on an HPC node, so I would not be too worried about that.

However, for modelling groups/individual researchers that want to use the tool to evaluate their model on their own infrastructure, requiring web servers and databases and docker containers may be a dealbreaker as these typically require more time and technical skills to set up and maintain than they have.

lewisjared · 2024-11-14T13:04:33Z

requiring web servers and databases and docker containers
SQLite should allow us to use a database without additional external services. For serving the results locally we can use python-based webservers to serve the static content. Both of these are likely different dependencies than what is used in production (sqlite might be fine though).

@bouweandela Do you have access to Levante so we could do some testing there?

docs: Package structure RFC

25c0545

lewisjared self-assigned this Oct 30, 2024

lewisjared requested review from nocollier, climate-dude, paullric, bouweandela, mikapfl, minxu74, lee1043, acordonez, hb326 and eleanororourke October 30, 2024 02:54

chore: Tweaks

8efd4f4

climate-dude reviewed Oct 30, 2024

View reviewed changes

docs: Tweak following comments by Forrest

d79d176

minxu74 approved these changes Oct 31, 2024

View reviewed changes

text/0001-package-structure.md Show resolved Hide resolved

text/0001-package-structure.md Show resolved Hide resolved

text/0001-package-structure.md Show resolved Hide resolved

lewisjared commented Oct 31, 2024

View reviewed changes

text/0001-package-structure.md Show resolved Hide resolved

text/0001-package-structure.md Show resolved Hide resolved

text/0001-package-structure.md Show resolved Hide resolved

text/0001-package-structure.md Show resolved Hide resolved

bouweandela reviewed Nov 14, 2024

View reviewed changes

mikapfl approved these changes Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Package structure RFC #1

docs: Package structure RFC #1

lewisjared commented Oct 30, 2024 •

edited

Loading

climate-dude left a comment

lewisjared commented Oct 30, 2024

nocollier commented Oct 30, 2024

nocollier commented Oct 30, 2024

nocollier commented Oct 30, 2024

lee1043 commented Oct 30, 2024

acordonez commented Oct 30, 2024

minxu74 commented Oct 30, 2024

acordonez commented Oct 30, 2024

lewisjared commented Oct 31, 2024

lewisjared commented Oct 31, 2024

lewisjared commented Oct 31, 2024

bouweandela commented Nov 14, 2024

lewisjared commented Nov 14, 2024

bouweandela Nov 14, 2024

lewisjared Nov 14, 2024

bouweandela Nov 14, 2024

lewisjared Nov 14, 2024

bouweandela Nov 14, 2024

bouweandela Nov 14, 2024

lewisjared Nov 14, 2024

bouweandela commented Nov 14, 2024 •

edited

Loading

lewisjared commented Nov 14, 2024


		The package will be named `ref-metrics-<provider>` where `<provider>` is the name of the provider.

		### `ref-api` (TBD)

		The package may have indirect dependencies on the metrics packages to avoid conflicting dependencies.
		That adds additional complexity, but will be more flexible in the long run.

docs: Package structure RFC #1

Are you sure you want to change the base?

docs: Package structure RFC #1

Conversation

lewisjared commented Oct 30, 2024 • edited Loading

Description

climate-dude left a comment

Choose a reason for hiding this comment

lewisjared commented Oct 30, 2024

nocollier commented Oct 30, 2024

nocollier commented Oct 30, 2024

nocollier commented Oct 30, 2024

lee1043 commented Oct 30, 2024

acordonez commented Oct 30, 2024

minxu74 commented Oct 30, 2024

acordonez commented Oct 30, 2024

lewisjared commented Oct 31, 2024

lewisjared commented Oct 31, 2024

lewisjared commented Oct 31, 2024

bouweandela commented Nov 14, 2024

lewisjared commented Nov 14, 2024

bouweandela Nov 14, 2024

Choose a reason for hiding this comment

lewisjared Nov 14, 2024

Choose a reason for hiding this comment

bouweandela Nov 14, 2024

Choose a reason for hiding this comment

lewisjared Nov 14, 2024

Choose a reason for hiding this comment

bouweandela Nov 14, 2024

Choose a reason for hiding this comment

bouweandela Nov 14, 2024

Choose a reason for hiding this comment

lewisjared Nov 14, 2024

Choose a reason for hiding this comment

bouweandela commented Nov 14, 2024 • edited Loading

lewisjared commented Nov 14, 2024

lewisjared commented Oct 30, 2024 •

edited

Loading

bouweandela commented Nov 14, 2024 •

edited

Loading