Welcome to the Framework for the Rapid Developement of Environmental Foundation Models (FRAME-FM)!
FRAME-FM is an open-source software framework designed to enable the fast, scalable, and accessible development of Foundation Models (FMs) for large-scale environmental datasets. This includes petabyte-scale archives held by the UK’s NERC Environmental Data Service (EDS).
While environmental data archives are vast and information-rich, they are difficult to process directly. This project addresses this critical gap by lowering the barrier to use these datasets. FRAME-FM provides standardised workflows, infrastructure, and tools to allow users to train, evaluate, fine-tune and publish foundation models efficiently.
- Who is FRAME-FM Built for?
- Running FRAME-FM Locally
- Running FRAME-FM on JASMIN
- Other Examples
- Technical Documentation
FRAME-FM is intended for scientists who are both experienced and new to Machine Learning.
For those experienced in Machine Learning, FRAME-FM speeds up the data wrangling processes by wrapping data loaders and setting up a thorough Extract ➡️ Transform ➡️ Load (ETL) pipeline. This ETL pipeline also has data caching to lower resource usage and speed up the models.
For scientists that are new to Machine Learning, FRAME-FM provides a great starting point to gain experience in Machine Learning. FRAME-FM abstracts away from needing to understand about schedulers and settings. Instead, you can gain confidence with FRAME-FM and develop all the knowledge to continue using PyTorch and Hydra in future projects either within our outside of FRAME-FM.
The FRAME-FM repository contains many directories. The core of which are explained below alongside details on the dependencies of FRAME-FM.
| Dependency | Purpose |
|---|---|
| PyTorch Lightning | A high-level wrapper around PyTorch. It is used to build and train the foundation models. |
| Hydra | Manages the configuration for FRAME-FM by allowing config files to be written. |
| Mlflow | Allows recording and tracking runs either via a web GUI or through output logs. |
All of the source code for the FRAME-FM project can be found in "src/FRAME-FM".
More specifically, this directory contains:
| Subfolder | Description |
|---|---|
| utils | This directory contains wrappers around important PyTorch modules. Wrappers like LightningDataModule and LightningModule contain customisations essential for FRAME-FM such as standardisations, constructions and consistent logging. |
| dataloaders | The LightningDataModule implementations for loading and pre-processing external datasets. This allows FRAME-FM to keep the model code separate from the data-handling, transform and batching logic. |
| datasets | Contains dataset wrappers to make applying transormations to input data easier. |
| models | The implementations of LightningModule from PyTorch. Each of the files within defines a modular and configurable model that should be instantiated through Hydra. |
| training | This holds the Hydra-driver train.py script used to launch model training runs. This directory also holds training entry points for Mlflow logging helpers, callbacks and trainer utilities. |
| transforms | This directory will hold all of our transformation classes and relevant utilities to glue them into FRAME-FM. |
Although FRAME-FM can be run within Jupyter / Marimo notebooks or within Windows, it is designed to be run on Linux. Although any computer can run FRAME-FM, it is intended to be used on supercomputer platforms like EDS' JASMIN or the University of Bristol's Isembard. FRAME-FM is also intended for x86 machines, but has successfully run in ARM-based machines.
Machine Learning is generally GPU-intensive and GPUs are typically more costly than CPUs. The FRAME-FM Extract ➡️ Transform ➡️ Load (ETL) pipeline is built to enable pre-processing and caching (which are IO and/or CPU-intensive) to happen before training/inference happens on GPUs. This approach is intended to optimise the use of GPUs when actual machines learning is taking place.
The below guidance assumes that you already have all access required to connect and use JASMIN.
If you are not using JASMIN, you will need to make sure that you have Python 3.11-3.13 and Pip installed.
This documentation is intended to be run within Linux. However, with some small tweaks the below commands can be run in Windows using WSL or Git Bash.
Below are the steps to run train.py using the default configuration:
-
SSH onto JASMIN
Once you have access to JASMIN, you should be able to set up a jump host (
-J) to SSH onto one of the Sci servers through JASMIN's login servers. To do this, run:ssh -A <username>@sci-vm-04.jasmin.ac.uk -J <username>@login.jasmin.ac.uk
-
Load the required module
Once connected, load the required modules with:
module load jaspy
-
Set up your Python path
Set up your python path by running the following commands:
export PYTHONPATH="$PWD/src:$PYTHONPATH"
When running
python --version, it should show Python 3.12. -
Set up a virtual environment
Next, you need to use UV to install a virtual environment (venv). This will install dependencies specifically within that environment.
pip install uv
uv venv
source .venv/bin/activate -
Install all dependencies
There are a list of dependencies that need to be installed. These are visible in pyproject.toml.
uv sync
uv add torchgeo --optional data
-
Initialise configs
By default, FRAME-FM expects a
configsdirectory to be present in your current working directory. Run the following command to initialise a configs directory if one does not already exist. (Note that you can override this location by setting theCONFIG_DIRenvironment variable. If set, the init command will use this path instead):uv run framefm config init
-
** Configure Training **
Edit the file
configs/config.yamland choose your platform. Platform configs are stored inconfigs/platforms/<platform name.yaml>. By default the "jasmin-small" config is used which allocates a single GPU and 2 CPU cores on JASMIN's Orchid cluster. -
** Run a Training Job **
framefm train run
-
** Monitoring Training with MLFlow **
If you would like to use Mlflow to record the training output, follow the Logging README.
The torchx and hydra configs can be viewed and edited during runtime via the CLI. The training can also be run via CLI passing options and overrides.
| Command | Arguments | Flags | Description |
|---|---|---|---|
framefm config list |
None | --torchx |
Recursively lists all YAML files in configs/. If --torchx is used, only verifies the .torchxconfig location. |
framefm config display |
CONFIG_FILE |
--torchx |
Prints the contents of a specific file. Use --torchx to skip the file argument and view the TorchX config. |
framefm config view-defaults |
None | None | Displays the top-level Hydra default configurations from configs/config.yaml. |
framefm config edit |
FILE KV_PAIRS |
None | Updates YAML values. Format: key:value or key:val1,key2:val2. Note: Overwrites formatting and removes comments. |
framefm config edit-torchx |
KV_PAIRS |
None | Updates .torchxconfig. Format: <table>-<key>:<value> (e.g., defaults-cpu:4 to update the CPU value in the defaults section). |
| Command | Description |
|---|---|
framefm train run |
Starts a model training run. Configuration is loaded from the configs/ directory using Hydra. You can override any config value directly from the command line without editing any YAML files. |
framefm train run [OPTIONS] [OVERRIDES]... |
Starts a model training run with the options and overrides to hydra config. |
Option Short Description --verbose-vPrint the fully resolved Hydra config to screen before training starts. --scheduler-sThe TorchX scheduler to use for running the training job. 'local' runs immediately on the local server, others submit jobs. Default is 'local' --helpShow help and exit.
Any positional arguments after
train runare passed to Hydra as config overrides.
Syntax Meaning Example key=valueOverride an existing value. Raises an error if the key does not exist. seed=99+key=valueAppend a new key. Raises an error if the key already exists. +experiment=baseline++key=valueOverride or append — safe either way. ++new_key=99~key=valueRemove a key from the config. ~logging=demo_mlflow
| Use Case | Command | Description |
|---|---|---|
| Run with defaults | framefm train run |
Executes the training pipeline using the default configuration settings. |
| Override config group | framefm train run model=demo_autoencoder |
Replaces a top-level config group (like model or data) with a specific choice. |
| Override multiple | framefm train run model=demo_autoencoder data=land_cover_map seed=99 |
Combines several overrides in a single execution. |
| Override nested | framefm train run model=demo_autoencoder model.lr=1e-4 |
Targets a specific parameter within a config group (e.g., learning rate). |
| Append experiment | framefm train run +experiment=baseline |
Uses the + prefix to add a new experiment configuration file to the run. |
| Remove logging | framefm train run ~logging=demo_mlflow |
Uses the ~ prefix to disable or remove a specific logging configuration. |
| Print resolved config | framefm train run --verbose model=demo_autoencoder |
Uses the --verbose flag to display the final configuration before training starts. |
When using SLURM please ensure you have activate the env before scheduling the training.
Run your training via TorchX:
framefm train run model=demo_autoencoderWhere to check the Slurm wrapper logs
TorchX generates a wrapper log when submitting via Slurm: if no job_dir was specified in config then(look under the platforms config.) /home/users//FRAME-FM/slurm-.out
Where to check the actual training logs The real training output is written by the worker:
/home/users//FRAME-FM/slurm--worker-0.out /home/users//FRAME-FM/slurm--worker-0.err
worker-0.out → stdout from your training process (metrics, progress). worker-0.err → stderr (errors, exceptions).
| Command | Description / Use | Notes / Examples |
|---|---|---|
squeue -u $USER |
Lists all current jobs submitted by your user | Shows job ID, name, state, partition, nodes, runtime, etc. Useful to quickly check running or pending jobs. |
sacct -u $USER -S <start_date> --format=JobID,JobName,State,Partition,Start,End |
Shows historical jobs for your user starting from <start_date> |
Example: sacct -u $USER -S 2026-03-24 --format=JobID,JobName,State,Partition,Start,End lists all jobs submitted today with their start/end times. |
scontrol show job <JobID> |
Shows detailed info for a specific job | Example: scontrol show job 10176547 displays stdout/stderr paths, resources, nodes, time limits, QoS, etc. |
tail -f <stdout_file> |
Follows real-time stdout logs of a job | Example: tail -f /home/users/<user>/FRAME-FM/slurm-10176547-worker-0.out shows the actual training logs. |
scancel <JobID> |
Cancels a specific job | Example: scancel 10176547 stops that job. |
scancel -u $USER |
Cancels all jobs for your user | Useful to clear all pending/running jobs before starting new experiments. |
sacct -j <JobID> --format=JobID,JobName,State,ExitCode,Elapsed |
Check the status and exit code of a specific job | Example: sacct -j 10176547 --format=JobID,JobName,State,ExitCode,Elapsed helps debug why a job failed. |
squeue -j <JobID> |
Check if a specific job is still running/pending | Example: squeue -j 10176547 |
scontrol requeue <JobID> |
Requeues a failed or cancelled job | Useful when a job hits a transient error and you want to retry without resubmitting. |
Tips:
StdOutandStdErrfromscontrol show job <JobID>show the exact log file locations.squeueonly shows active/pending jobs, whilesacctcan show completed, failed, or canceled jobs.- Use
tail -fon the worker logs to monitor training in real time.
Hydra organises config files into groups (e.g. model/, data/). Passing a group name as an override swaps the entire file loaded for that group.
configs/
config.yaml
model/
demo_autoencoder.yaml
convAE.yaml
data/
demo_eurosat.yaml
land_cover_map.yaml
trainer/
default.yaml
logging/
demo_mlflow.yaml
For example, model=convAE tells Hydra to load configs/model/convAE.yaml instead of the default.
The root config file is configs/config.yaml. Its defaults list determines which group configs are loaded unless overridden on the command line:
defaults:
- data: demo_eurosat
- model: demo_autoencoder
- trainer: default
- logging: demo_mlflow
- _self_
seed: 42If you would like to run FRAME-FM without using the CLI then you can run train.py directly with Python.
python src/FRAME_FM/training/train.pyFRAME-FM uses Sphinx to generate and store technical documentation. This is a mixture of hand-written and autogenerated documentation. For example, thorough details on the configs and transformers.
To view this documentation, visit https://nerc-eds.github.io/FRAME-FM/.