Current expt time for running jobs in payu status#702
Current expt time for running jobs in payu status#702Qian-HuiChen wants to merge 8 commits intopayu-org:masterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #702 +/- ##
==========================================
+ Coverage 63.30% 64.04% +0.74%
==========================================
Files 67 67
Lines 5426 5541 +115
==========================================
+ Hits 3435 3549 +114
- Misses 1991 1992 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7d79a8a to
2896232
Compare
|
Good morning @anton-seaice @blimlim @ezhilsabareesh8, It will be highly appreciated if you could provide some feedback from a user perspective:
Thanks! |
|
Thanks for adding this in! Great to have this implemented for so many model drivers. I was testing running this branch with a pre-existing MOM6 configuration, and I am getting a warning when running As the run has completed, the work directory no longer exists so should change the code to only query the current run time only when the model is running. The other initial thing I noticed was |
blimlim
left a comment
There was a problem hiding this comment.
Thanks @Qian-HuiChen, I've had a look through the changes and I think they're looking really good. Being able to query the run date like this would be very useful.
I've added a few small comments and questions throughout.
For the ESM1.6 implementation, all the relevant files (work/atmosphere/namelists, work/atmosphere/atm.fort6.pe0, and the suggested work/atmosphere/um.res.yaml) are all used and produced specifically by the UM submodel. Because of this, I'm wondering whether it could make sense to move the implementation of get_curr_exp_time from access_esm1p6.py to um.py, and then access it using the get_cur_expt_time_using_submodel method in ESM1.6?
payu/models/access_esm1p6.py
Outdated
| def read_start_date(self, nl_path): | ||
| """Read the start date from the namelist file""" | ||
| try: | ||
| namelist = f90nml.read(nl_path) | ||
| model_basis_time = namelist.get('nlstcall', {}).get('model_basis_time', None) | ||
| if model_basis_time is None: | ||
| warnings.warn(f"model_basis_time not found in {nl_path}") | ||
| return None | ||
| return datetime(*model_basis_time) | ||
| except Exception as e: | ||
| warnings.warn(f"Could not read file {nl_path}: {e}") | ||
| return None |
There was a problem hiding this comment.
Apologies @Qian-HuiChen, I'm think my previous suggestion for reading the start date from the work/atmosphere/namelists file might not be the best one! It should work perfectly as is, but I'm thinking there might be existing methods which could be used to reduce duplication.
Each UM restart directory contains a calendar yaml file um.res.yaml, which holds a copy of the restart date (see /g/data/vk83/prerelease/configurations/inputs/access-esm1p6/modern/pre-industrial/restart/2026.02.20/atmosphere for an example).
I'd forgotten that we'd previously added a method to the um.py driver, get_restart_datetime which reads the date from this file, and uses it for the date-based restart pruning.
I'm wondering if it would make sense to reuse this method to get the start date, as it could be pointed to the copy of um.res.yaml in the atmosphere/work directory.
One issue is that the um.res.yaml is technically not required to run the model, and it's possible for it to be missing. This would nearly never be the case though, but there would have to be a safety check/warning in case it is missing.
There was a problem hiding this comment.
Thanks for pointing this out! It is better to reuse the get_restart_datetime than writing a new one.
I have changed the code to read out the starting date from work/atmosphere/um.res.yaml.
| ) | ||
|
|
||
| if json_output: | ||
| print(json.dumps(data, indent=4)) |
There was a problem hiding this comment.
Should the current experiment time also be written to the json output, or would there not really be any uses for it?
There was a problem hiding this comment.
I agree it will be good to have the current expt time here as well. I have added it in the latest commit.
There was a problem hiding this comment.
If adding current experiment time to the json and the formatted display then it might make sense to add it in build_job_info() to reduce duplication and use status == model-run
There was a problem hiding this comment.
While testing this out I just noticed that payu status --json also doesn't print out the model finish time after the run has completed:
pay========================================
Run: 0
Job ID: 163732205.gadi-pbs
Run ID: cd891450d95f0b8454f880b80bbb9fed40c034b4
Stage: archive
Total queue time: 0h 1m 27s
Model Finish Time: 0001-01-06T00:00:00
Exit Status: 0 (Success)
Model Exit Code: 0 (Success)
Output Log: /home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.o163732205
Error Log: /home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.e163732205
Job File: /scratch/tm70/sw6175/access-esm/archive/gettime-test-dev-preindustrial+concentrations-d33399d8/payu_jobs/0/run/163732205.gadi-pbs.json
========================================
u(payu-env) [sw6175@gadi-login-01 gettime-test]$ payu status --json
{
"experiment_uuid": "d33399d8-053d-4eb6-aa5b-2b7e0ff783e5",
"runs": {
"0": {
"run": [
{
"job_id": "163732205.gadi-pbs",
"run_id": "cd891450d95f0b8454f880b80bbb9fed40c034b4",
"stage": "archive",
"exit_status": 0,
"model_exit_status": 0,
"stdout_file": "/home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.o163732205",
"stderr_file": "/home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.e163732205",
"job_file": "/scratch/tm70/sw6175/access-esm/archive/gettime-test-dev-preindustrial+concentrations-d33399d8/payu_jobs/0/run/163732205.gadi-pbs.json",
"start_time": "2026-03-23T16:23:21.697106"
}
]
}
}
}
I'm wondering whether it would also work to add the finish time in build_job_info(), though I don't think I properly understand yet how the finish time is set.
There was a problem hiding this comment.
though I don't think I properly understand yet how the finish time is set.
model_finish_time is parsed from restart files by get_model_restart_datetimes() in experiment.py when stage is changed to archive.
whether it would also work to add the finish time in build_job_info()
Sounds good! I have implemented this in the latest commit. Now payu status --json should return something like:
{
"experiment_uuid": "eaxxxx7",
"runs": {
"5": {
"run": [
{
::
"stage": "archive",
::
"start_time": "2026-03-27Txxx",
"model_finish_time": "0012-02-11T00:00:00"
}
]
}
}
}
I also confirm that model_finish_time is not written twice in job file.
payu/models/cesm_cmeps.py
Outdated
| if os.path.exists(log_path): | ||
| with open(log_path, 'r') as f: | ||
| for line in reversed(f.readlines()): | ||
| if line.startswith(" memory_write: model date"): |
There was a problem hiding this comment.
It looks like its currently hardcoded to write this every model day:
as long as
med_phases_profile
is run.
So this seems fairly robust that "memory_write: model date" would always be available
There was a problem hiding this comment.
Thanks for checking this in the model code 👍
payu/subcommands/status_cmd.py
Outdated
| metadata.setup() | ||
| expt = Experiment(lab, config_path=config_path) | ||
| expt.init_models() | ||
| cur_expt_time = expt.get_model_cur_expt_time() |
There was a problem hiding this comment.
I wonder if we skip get_model_cur_expt_time() this step if stage is payu setup:
e.g.
$ payu status
/g/data/tm70/as2285/payu-dev/payu-dev/lib64/python3.11/site-packages/payu/models/cesm_cmeps.py:439: UserWarning: Log file /scratch/tm70/as2285/access-om3/work/1112-update-25km-topo/log/med.log does not exist or does not contain current model time.
========================================
Run: 0
Job ID: 163373011.gadi-pbs
Stage: setup
Job File: /scratch/tm70/as2285/access-om3/archive/1112-update-25km-topo/payu_jobs/0/run/163373011.gadi-pbs.json
shows a warning when it's the expected behaviour
There was a problem hiding this comment.
or job is in the queue still
There was a problem hiding this comment.
Yeah this could be deferred to a later point when we know the the current stage is "model-run"
There was a problem hiding this comment.
Thanks for picking this up!
Following the suggestion from @jo-basevi , I have changed payu to run get_model_cur_expt_time() only during model-run stage in the latest commit.
|
For OM3 - this appears to work ok. It does take a while for the model to initialise and run the first day, is it unclear that the warning always gets returned during this initialisation" |
jo-basevi
left a comment
There was a problem hiding this comment.
Just added a couple more comments.
I wonder if each get_cur_expt_time() model driver method could return a cftime.datetime object. It would catch if a valid date was parsed from the files correctly, and it would also store calendar information that could be used at a later point
As there may (?) be updates to models and their file contents in future, I wonder if we could add a optional command-line flag to silence warnings?
Another option to protect payu status erroring out in the future would be to have a higher level exception catch, e.g. something like:
try:
current_model_time = self.model.get_cur_expt_time()
except Exception:
warnings.warn("Unexpected error parsing current experiment time")Could also raise Exceptions for FileNotFound, ValueErrors etc during parsing for current run time, and let those exceptions bubble up and then catch exceptions at a higher level and then display a warning of the error. The benefits of this is that you don't need to return and handle None values as much.
payu/subcommands/status_cmd.py
Outdated
| metadata.setup() | ||
| expt = Experiment(lab, config_path=config_path) | ||
| expt.init_models() | ||
| cur_expt_time = expt.get_model_cur_expt_time() |
There was a problem hiding this comment.
Yeah this could be deferred to a later point when we know the the current stage is "model-run"
payu/models/model.py
Outdated
| def get_cur_expt_time(self): | ||
| """For model not implemented experiment time calculate/read-out, | ||
| leaves a warning and returns None.""" | ||
| print("Getting current experiment time is not yet implemented.") |
There was a problem hiding this comment.
Could leave this print statement out or change it to a warning?
There was a problem hiding this comment.
After further consideration, I agree that a persistent warning is too intrusive, especially since payu status remains functional to display all other info, even if the experiment time isn't available.
In the latest commit, I have moved these messages to logger.debug(). This suppresses the 'noise' for standard users while ensuring the status output remains clean (see below). If a developer needs to troubleshoot why a specific model's time isn't fetched, they can simply run with the debug logging level enabled.
========================================
Run: 0
Job ID: 100000.gadi-pbs
Run ID: aaaaa0000000
Stage: model-run
Job File: /scratch/tm70/...
========================================
What do you think?
There was a problem hiding this comment.
I think that could work. We don't have command-line flags currently that enables debug logging easily however. I think there was some talk about that in this issue: #673.
As this PR already has a number of changes, an option could be to use warnings now and then change it in a future PR?
There was a problem hiding this comment.
Right. I change this back to a print statement. When I use warnings.warn, it includes the file path and line numbers in the output. I want to avoid these technical "noise" in payu status display, so I left it as a print statement.
payu/models/accessom2.py
Outdated
| warnings.warn(f"Log file {log_path} does not exist or does not contain current model time.") | ||
| return None | ||
|
|
||
| except KeyError as e: |
There was a problem hiding this comment.
What raises the KeyError here? If 'cur_exp-datetime' is in a string but not a top-level key in the json?
Could also guard against errors with json.loads(line) where line isn't valid json.
There was a problem hiding this comment.
I was thinking if line cannot be parsed as json or 'cur_exp-datetime' is not the top key.
I changed it to:
if 'cur_exp-datetime' in line:
line_json = json.loads(line)
time_str = line_json.get('cur_exp-datetime', None)
if time_str is not None:
return cftime.datetime.strptime(time_str, '%Y-%m-%dT%H:%M:%S')
and let the error bubble up and then the top level will catch the error.
payu/models/cesm_cmeps.py
Outdated
| warn(f"Log file {log_path} does not exist or does not contain current model time.") | ||
| return None | ||
|
|
||
| except KeyError as e: |
There was a problem hiding this comment.
Should this be an IndexError for line.split()[4]
There was a problem hiding this comment.
You are right! It is good to correctly specify the error.
I also wondered if the whole change could be configured in config.yaml, e.g. the filename, the search string etc. But payu already has lots of configuration items hard coded in, so it seemed unnecessarily complicated ! |
…ur_expt_time to um; log error at debug level.
|
Thanks @jo-basevi for pointing out the exception handling. I totally agree that we should use so it simplifies the code without returning a trillion amount of None.
This is also a great idea. I have changed the driver to return |
jo-basevi
left a comment
There was a problem hiding this comment.
Thank you for adding in all those changes! Just added a few more comments.
| import logging | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
| from payu.models.model import Model | ||
|
|
There was a problem hiding this comment.
| import logging | |
| logger = logging.getLogger(__name__) | |
| from payu.models.model import Model | |
| import logging | |
| from payu.models.model import Model | |
| logger = logging.getLogger(__name__) |
Just a small change to move setting the logger to after the imports. I think this applies to a couple other files
There was a problem hiding this comment.
Sounds good! I have moved all logger to after the imports in the latest commit. Some of the model drivers now raise an error which will bubble up to get_model_cur_expt_time() in build_job_info(), so logger is no longer needed and removed in those.
payu/models/accessom2.py
Outdated
| return cftime.datetime.strptime(time_str, '%Y-%m-%dT%H:%M:%S') | ||
|
|
||
| logger.debug(f"cur_exp-datetime not found in {log_path}") | ||
| return None |
There was a problem hiding this comment.
If catching exceptions higher up, could raise a ValueError here instead?
There was a problem hiding this comment.
That makes so much more sense. Thanks!
payu/models/model.py
Outdated
| logger.debug("Current experiment time not implemented for this model.") | ||
| return None |
There was a problem hiding this comment.
could raise NotImplementedError("Current experiment time not implemented for this model.")?
There was a problem hiding this comment.
Sounds good. I have improved this in the latest commit.
payu/models/mom6.py
Outdated
| 3: "proleptic_gregorian", | ||
| 4: "noleap" | ||
| } | ||
| return cftime_calendars[calendar_int] |
There was a problem hiding this comment.
Could move this function to the mom_mixin.py class to keep the ocean_solo.res logic together? Then the CFTIME_CALENDARS can be a global variable used by both get_restart_datetime and get_calendar
There was a problem hiding this comment.
Sounds good! It does make the code much cleaner.
payu/models/mom6.py
Outdated
| logger.debug(f"date_init not found in {input_path}") | ||
| return None |
There was a problem hiding this comment.
Could raise another ValueError here?
There was a problem hiding this comment.
Sounds good. Implemented in the latest commit.
payu/subcommands/status_cmd.py
Outdated
| config_path=config_path) | ||
| metadata.setup() | ||
| expt = Experiment(lab, config_path=config_path) | ||
| expt.init_models() |
There was a problem hiding this comment.
| expt.init_models() |
No need to run this as init_models() is already run as part of Experiment initialisation.
There was a problem hiding this comment.
Oh I see. Thanks for pointing this out :)
| ) | ||
|
|
||
| if json_output: | ||
| print(json.dumps(data, indent=4)) |
There was a problem hiding this comment.
If adding current experiment time to the json and the formatted display then it might make sense to add it in build_job_info() to reduce duplication and use status == model-run
payu/telemetry.py
Outdated
| timings: Optional[dict[str, Any]] = None, | ||
| cur_expt_time: Optional[str] = None |
There was a problem hiding this comment.
| timings: Optional[dict[str, Any]] = None, | |
| cur_expt_time: Optional[str] = None | |
| timings: Optional[dict[str, Any]] = None |
This code can safely be removed now I think
payu/models/model.py
Outdated
| def get_cur_expt_time(self): | ||
| """For model not implemented experiment time calculate/read-out, | ||
| leaves a warning and returns None.""" | ||
| print("Getting current experiment time is not yet implemented.") |
There was a problem hiding this comment.
I think that could work. We don't have command-line flags currently that enables debug logging easily however. I think there was some talk about that in this issue: #673.
As this PR already has a number of changes, an option could be to use warnings now and then change it in a future PR?
payu/status.py
Outdated
| cur_expt_time = None | ||
| try: | ||
| cur_expt_time = expt.get_model_cur_expt_time() | ||
| except (FileNotFoundError, IndexError, OSError, json.JSONDecodeError, UnboundLocalError) as e: |
There was a problem hiding this comment.
What causes the UnboundLocalError?
There was a problem hiding this comment.
There was somewhere that could potentially call cur_expt_time before assigning. But I think it has been changed now. I have removed UnboundLocalError in the lates commit.
…build_job_info; raise error and bubble up in build_job_info.
Thanks @jo-basevi. It sounds good! I have moved the calling of |
blimlim
left a comment
There was a problem hiding this comment.
Thanks @Qian-HuiChen for putting in those changes, they're looking really good! I've just had another read through, and added some small comments
payu/models/um.py
Outdated
| import os | ||
| import shutil | ||
| import string | ||
| import cftime |
There was a problem hiding this comment.
It looks like this one isn't directly being used!
There was a problem hiding this comment.
Good catch! Thanks :)
payu/models/um.py
Outdated
| """Get the current experiment time from file""" | ||
| log_path = os.path.join(self.expt.work_path, 'atmosphere', 'atm.fort6.pe0') | ||
|
|
||
| start_date = self.get_restart_datetime(self.expt.work_path + '/atmosphere') |
There was a problem hiding this comment.
Probably just being a bit pedantic here, but it could be worth consistently using os.path.join as with the log_path = line further above
There was a problem hiding this comment.
Sounds good to keep it consistent.
payu/models/model.py
Outdated
|
|
||
| Returns: | ||
| -------- | ||
| Current experiment time in string (e.g., 1900-01-01T01:00:00) |
There was a problem hiding this comment.
With the latest changes, the docstring can be updated to specify that it returns a cftime.datetime rather than string
There was a problem hiding this comment.
Good catch! Thanks :)
| ) | ||
|
|
||
| if json_output: | ||
| print(json.dumps(data, indent=4)) |
There was a problem hiding this comment.
While testing this out I just noticed that payu status --json also doesn't print out the model finish time after the run has completed:
pay========================================
Run: 0
Job ID: 163732205.gadi-pbs
Run ID: cd891450d95f0b8454f880b80bbb9fed40c034b4
Stage: archive
Total queue time: 0h 1m 27s
Model Finish Time: 0001-01-06T00:00:00
Exit Status: 0 (Success)
Model Exit Code: 0 (Success)
Output Log: /home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.o163732205
Error Log: /home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.e163732205
Job File: /scratch/tm70/sw6175/access-esm/archive/gettime-test-dev-preindustrial+concentrations-d33399d8/payu_jobs/0/run/163732205.gadi-pbs.json
========================================
u(payu-env) [sw6175@gadi-login-01 gettime-test]$ payu status --json
{
"experiment_uuid": "d33399d8-053d-4eb6-aa5b-2b7e0ff783e5",
"runs": {
"0": {
"run": [
{
"job_id": "163732205.gadi-pbs",
"run_id": "cd891450d95f0b8454f880b80bbb9fed40c034b4",
"stage": "archive",
"exit_status": 0,
"model_exit_status": 0,
"stdout_file": "/home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.o163732205",
"stderr_file": "/home/565/sw6175/esm1.6/misc/gettime-test/pre-industrial.e163732205",
"job_file": "/scratch/tm70/sw6175/access-esm/archive/gettime-test-dev-preindustrial+concentrations-d33399d8/payu_jobs/0/run/163732205.gadi-pbs.json",
"start_time": "2026-03-23T16:23:21.697106"
}
]
}
}
}
I'm wondering whether it would also work to add the finish time in build_job_info(), though I don't think I properly understand yet how the finish time is set.
test/models/test_mom6.py
Outdated
|
|
||
| import pytest | ||
| import f90nml | ||
| import logging |
There was a problem hiding this comment.
I think this one might not be needed here
There was a problem hiding this comment.
Good catch! Thanks :)
test/models/test_mom6.py
Outdated
| cur_expt_time = expt.get_model_cur_expt_time() | ||
|
|
||
|
|
||
| def test_read_start_date(caplog): |
There was a problem hiding this comment.
Just wanted to check if the caplog fixture is required here, and similarly for test_read_timestep
There was a problem hiding this comment.
Yeah it was used in previous versions but the new tests didn't use caplog anymore. I removed them now.
test/models/test_um.py
Outdated
| import cftime | ||
| import yaml | ||
| import f90nml | ||
| import logging |
There was a problem hiding this comment.
It looks like these two aren't currently being used
There was a problem hiding this comment.
Good catch! Thanks :)
test/models/test_um.py
Outdated
| parsed_run_dt = expt.model.get_restart_datetime(restart_path) | ||
| assert parsed_run_dt == date | ||
|
|
||
| def test_convert_timestep(caplog): |
There was a problem hiding this comment.
It looks like caplog isn't currently being used here and could be removed from the test
There was a problem hiding this comment.
Good catch! Thanks :)
| def get_cur_expt_time(self): | ||
| """ Use UM submodel to get the current experiment time.""" | ||
| model_types = ['um'] | ||
| return self.get_cur_expt_time_using_submodel(model_types) | ||
|
|
There was a problem hiding this comment.
Probably not super important as people will be moving to ESM1.6, however since ESM1.5 also uses the UM, it would be simple to add this feature to it. I think it would just involve copying these lines into access.py
There was a problem hiding this comment.
I have implemented this in access-esm1.5 in the latest commit.
In
payu status, display current experiment time for running jobs, as:This is implemented in models:
For other models, no current expt time will be displayed.
In
payu status, display current model finish time for archived jobs, e.g.,Closes #701 as part of #640.