Skip to content

docs: Package structure RFC #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

docs: Package structure RFC #1

wants to merge 3 commits into from

Conversation

lewisjared
Copy link
Contributor

@lewisjared lewisjared commented Oct 30, 2024

Description

Adds a new RFC for the proposed package structure of the repository.

Copy link

@climate-dude climate-dude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What are "REF packages"? Are they the external metrics/benchmarking packages? Does it include CMEC, or just ESMValTool, ILAMB, PMP, etc. in this initial implementation?
  • Does this RFC propose that some version of those individual packages will be copied into this repo from their respective existing repositories? Does that mean modifications to the packages here may require cherry-picking and integration into the native package repos? It appears this complication may be addressed in bullets under Rationale and alternatives.

@lewisjared
Copy link
Contributor Author

What are "REF packages"? Are they the external metrics/benchmarking packages? Does it include CMEC, or just ESMValTool, ILAMB, PMP, etc. in this initial implementation?

There will be one package per benchmarking package that describes what metrics the benchmarking packages provide. That info is then used to determine which packages to call when new data arrives. I've been using the term MetricProvider for these external benchmarking packages, but happy to tweak that language.

For packages that already interface with the CMEC driver there will be some common functionality to help write out the required CMEC configuration files. The configuration will use as much of the existing EMDS standard/CMEC configuration of possible with extensions to support this use-case.

I'll reword to be more precise.

Does this RFC propose that some version of those individual packages will be copied into this repo from their respective existing repositories? Does that mean modifications to the packages here may require cherry-picking and integration into the native package repos? It appears this complication may be addressed in bullets under Rationale and alternatives.

No code needs to be vendored/copied and it should require little to no modification to the existing source. The packages will be mostly configuration and glue code.

@nocollier
Copy link

Each metrics package will come with its own set of dependencies. Odds are that they will clash with other packages.

Maybe this is clear in everyone's mind but mine. Are we sure that this is the case? I ask this because it introduces a lot of complexity to REF that could be avoided if all packages can run in the same environment. It would depend on what REF wants from ILAMB, but if I can provide it from ilamb3 I currently depend on:

numpy
xarray
pint-xarray
cf_xarray
pandas
intake
matplotlib
cartopy

Is it worth someone (I volunteer, we need this for ESGF anyway) trying to install each package in the same environment and run tests? Or has a multi-environment structure already been decided on and I am just not realizing?

@nocollier
Copy link

The framework will be an application rather than a library...

Can't it be both? I do this in ILAMB. It is a library but when installing I provide scripts to the setup:

https://github.com/rubisco-sfa/ILAMB/blob/6780ef0824a8a245ae60e518d5b5fc25f970f3d6/setup.py#L109

These are placed in your PATH when the package is installed and can be invoked like executables.

@nocollier
Copy link

Splitting out each package to a separate repository is a possibility...

This would also give package providers the option to keep the REF glue in their main repository. ILAMB does this with auxillary files that CMEC uses. It might be preferable for some to make sure testing of their own packages do not break REF connections.

@lee1043
Copy link

lee1043 commented Oct 30, 2024

The framework will be an application rather than a library...

Can't it be both? I do this in ILAMB. It is a library but when installing I provide scripts to the setup:

https://github.com/rubisco-sfa/ILAMB/blob/6780ef0824a8a245ae60e518d5b5fc25f970f3d6/setup.py#L109

These are placed in your PATH when the package is installed and can be invoked like executables.

I second @nocollier's idea -- it might can be both. PMP does the same as ILAMB does -- it is an application and library, setup.py for installation. It can also installed via conda.

@acordonez
Copy link

Is it worth someone (I volunteer, we need this for ESGF anyway) trying to install each package in the same environment and run tests? Or has a multi-environment structure already been decided on and I am just not realizing?

CMEC is designed such that every package has its own conda environment. The conda environment name is actually hard-coded for each package, but you can change the source code to use it differently. Historically I've had a hard time getting the PMP in particular to install in environments with other packages. Not opposed to doing it differently but currently that's how CMEC expects environments to work.

@minxu74
Copy link

minxu74 commented Oct 30, 2024

Is it worth someone (I volunteer, we need this for ESGF anyway) trying to install each package in the same environment and run tests? Or has a multi-environment structure already been decided on and I am just not realizing?

CMEC is designed such that every package has its own conda environment. The conda environment name is actually hard-coded for each package, but you can change the source code to use it differently. Historically I've had a hard time getting the PMP in particular to install in environments with other packages. Not opposed to doing it differently but currently that's how CMEC expects environments to work.

I may be wrong. I originally thought that each package will be containerized with their own environments and benchmarking data and will be run in separated containers. Only model outputs and package outputs are shared among these containers. @acordonez I wonder if CMEC can handle containers?

@acordonez
Copy link

I may be wrong. I originally thought that each package will be containerized with their own environments and benchmarking data and will be run in separated containers. Only model outputs and package outputs are shared among these containers. @acordonez I wonder if CMEC can handle containers?

@minxu74 referring to Docker containers? We haven't built anything specific in CMEC to use containers but cmec-driver is all Python and as far as I know should run in a container without major issues.

@lewisjared
Copy link
Contributor Author

Thanks for the feedback everyone. I've pulled the comments from @nocollier and @lee1043 into a separate thread with a proposed change of text.

@lewisjared
Copy link
Contributor Author

It does raise the need to come to an agreement of 1 env vs multiple. I wouldn't say it is decided, but it is a pretty fundamental decision that we should discuss. Perhaps that needs a discussion doc in itself? I'll add it to the agenda for tomorrow and then can write up the results.

@lewisjared
Copy link
Contributor Author

To throw my 2 cents in, I'm a strong advocate of multiple environments. I agree it adds additional complexity, but I've seen projects be burnt when trying to bring together multiple science packages (IIASA's scenario database, the AR6 WG3 climate assessment and openscm-runner (single interface for multiple simple climate models)). I'm sure the ESMValTool team has spent a lot of time fussing with packaging and the "blessed" environment.

A single environment might work now, but it's very fragile, as any one dependency may make the environment unsolvable or limit which packages can be part of the REF in future. ESMValTool has a large set of dependencies. I'm not sure if @bouweandela intends to pull in the whole ESMValTool system or just the core so this might not be quite as bad as it looks.

Splitting into multiple, decoupled and isolated environments would put the control of packaging into the hands of the benchmarking providers who know how best to do that. This will shift significant complexity into WP1 which I'm ok with because it results in a lower maintenance burden for benchmarking packages and a lower barrier to entry which I think is a good trade off.

@bouweandela
Copy link

A single environment might work now, but it's very fragile, as any one dependency may make the environment unsolvable or limit which packages can be part of the REF in future. ESMValTool has a large set of dependencies. I'm not sure if @bouweandela intends to pull in the whole ESMValTool system or just the core so this might not be quite as bad as it looks.

It depends on which metrics/diagnostics ESMValTool should be able to run. For example, we have an esmvaltool-python package available on conda-forge (basically the esmvaltool package from PyPI with only the dependencies of diagnostics written in Python), so that would avoid the need for NCL/R/Julia dependencies.

While it is probably possible to install everything into a single environment (I expect the ESMValTool environment will also have all the requirements for the other diagnostic packages) most of the time, there may be moments where not everyone is in sync with the latest dependencies and then there will be trouble, so I agree that using isolated environments is the most reliable solution.

Using docker containers is great for reproducibility and if users can just download them it shouldn't be difficult for the users to install them, but it does add additional burden for metrics package maintainers if they need to create containers. It is also good to take into account that not all HPC systems allow running containers, not even Apptainer containers. Local (conda) environments are also great if you want to change the code of a metric to e.g. improve a plot. Therefore, I think it would be best for adoption if the REF could support both (Docker/Apptainer) containers and Conda environments.

@lewisjared
Copy link
Contributor Author

@bouweandela Do you have an example of an HPC facility that has a lot of restrictions?

The lack of docker would make the installation process painful, but it shouldn't be a deal-breaker. The impact of the lack of docker would likely be felt in the lack of the ability to use other ancillary services that are easy to pull in as docker containers (web servers, caches, databases etc.). It might require a customised deployment as the lack of docker probably suggests other limitations too.

This package will be a dependency of all other packages as it will describe the interfaces that metrics providers must implement.
This allows us to keep the core functionality separate from the metrics providers.

### `ref-metrics-*`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easier for people who want to use the REF with a custom metric provider if these were maintained outside of the main REF repository. If we move the example to a separate repository, it will be much easier for relative outsiders to see how to implement the interface because they don't have to dig through all of the REF infrastructure and configuration to find what they're looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

Let's migrate the example outside of this repository. It is probably best to wait until we are happy with a PoC as it is easier to refactor all in one spot. We will also require publishing the ref-core package to PyPi/CondaForge first.

Perhaps something for the new year.


The package will be named `ref-metrics-<provider>` where `<provider>` is the name of the provider.

### `ref-api` (TBD)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand it, there are at least two use cases that each need their own application.

  1. An automated system running on an ESGF node that listens to the ESGF message queue announcing that new datasets have been published and then kicks off the required diagnostic runs and integrates/publishes the results.
  2. Human users who want to run the tool themselves, this could be from the command line or from a Jupyter notebook

It looks like this proposes 1), but maybe it would be more convenient to start with 2) as having that is convenient for development. Or is that covered somewhere else already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add/tweak another one:

  1. Production deployment
  2. Modelling center where they are ingesting their pre-published data. Probably by running a CLI command. This requires a bulk of the services needed for a production deployment if they want to track results.
  3. Running metrics directly via a common interface without any tracking of results

I'm assuming that modelling centers want some form of tracking, but that is only a guess so we should validate that. The tracking requires a database which in likely just sqlite which doesn't require any additional services.

There is a lot of overlap of 1 and 2, but 3 is probably how benchmarking package maintainers will develop their packages. Each of the metrics packages should be able to be run directly in a notebook as well if you don't require the complexity that comes from the "compute engine".

Comment on lines +74 to +75
The package may have indirect dependencies on the metrics packages to avoid conflicting dependencies.
That adds additional complexity, but will be more flexible in the long run.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this means, I would recommend leaving this out or making it more explicit what is intended.

Another drawback is that it will be harder to manage different package managers.
The project currently uses [uv](https://docs.astral.sh/uv/) as a package manager as the core and framework
will not require any complex to install dependencies,
but science packages may require more complex dependencies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, we're not planning to do any science in the REF. The science should happen in the metrics/diagnostics packages. Therefore I would not expect that we need any science dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. But the ref-metric-* packages would pull those science dependencies in. Those metric provider packages probably should live outside of the cmip-ref repository like you suggested for the example package. That example package could then become a template. It would also provide some flexibility for metrics packages to use whatever package installer they need.

I'll update the text to reflect this.

@bouweandela
Copy link

bouweandela commented Nov 14, 2024

@bouweandela Do you have an example of an HPC facility that has a lot of restrictions?

For example, Levante at DKRZ does not support (singularity) containers on its compute nodes. The feature has been disabled for months now. It is not very publicly announced, I could only find this note.

The lack of docker would make the installation process painful, but it shouldn't be a deal-breaker.

Not necessarily, if we just make sure that every diagnostics package can be either installed from PyPI or conda-forge.

The impact of the lack of docker would likely be felt in the lack of the ability to use other ancillary services that are easy to pull in as docker containers (web servers, caches, databases etc.). It might require a customised deployment as the lack of docker probably suggests other limitations too.

I imagine that a deployment on ESGF would typically happen on a kubernetes cluster or one or more virtual machines. Typically, system administrators give you a lot more freedom to install things on a virtual machine than on an HPC node, so I would not be too worried about that.

However, for modelling groups/individual researchers that want to use the tool to evaluate their model on their own infrastructure, requiring web servers and databases and docker containers may be a dealbreaker as these typically require more time and technical skills to set up and maintain than they have.

@lewisjared
Copy link
Contributor Author

requiring web servers and databases and docker containers
SQLite should allow us to use a database without additional external services. For serving the results locally we can use python-based webservers to serve the static content. Both of these are likely different dependencies than what is used in production (sqlite might be fine though).

@bouweandela Do you have access to Levante so we could do some testing there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants