-
Notifications
You must be signed in to change notification settings - Fork 34
EVS v2.0 Eval - jevs_plots_mesoscale_grid2obs walltime failure #926
Description
On Tuesday, March 17, 00/evs/v2.0/plots/mesoscale/jevs_plots_mesoscale_grid2obs failed on walltime for EVS v2.0 in eval mode.
Here are two important points before getting into details:
- Production implementation is still scheduled for Tuesday, March 24. NCO does not anticipate changing the implementation date at this time.
- This issue is to request an investigation into the failure.
Runtimes
- EVS v2.0 Eval: typically ~125-140 minutes
- EVS v2.0 NCO para: typically ~118-143 minutes
- EVS v1.0 NCO prod: typically ~108-148 minutes (job:
00/evs/v1.0/mesoscale/plots/jevs_mesoscale_grid2obs_plots)
These runtimes are fairly consistent. I don't see any concerning values, but I can provide more data if it's useful for debugging.
Analysis of $DATA and individual tasks
I saved related files to /lfs/h1/nco/idsb/noscrub/russell.manser/projects/evs/eval_failures/
$DATAdirectory:evs_plots_mesoscale_grid2obs_last90days.eval.failed/- Job output:
evs_plots_mesoscale_grid2obs_last90days.o101934296
GDIT helped with an analysis of slow tasks in the failed job
- Many tasks were running for over 60 minutes, with the longest tasks running for 120+ minutes
- A list of task runtimes is in the temporary data directory at
grid2obs/out/workdirs/job_times.out - The tasks that were running slow seemed to slow down at
ush/mesoscale/df_preprocessing.py
Some examples of slowdowns at ush/mesoscale/df_preprocessing.py:
job88/logs/EVS_verif_plotting_job88_grid2obs_747261.out- ran for 90 minutesjob233/logs/EVS_verif_plotting_job233_grid2obs_747261.out- ran for 89 minutes
I ran a test job in NCO test space and saved the $DATA directory. I found that the longest tasks typically run for ~50 minutes in a "normal" run.
GDIT did not find any evidence of disk slowdowns, node issues, or other system problems while this job was running.
Analysis of ush/mesoscale/df_preprocessing.py
The slowdown was reported in df_preprocessing.py at line 70. It looks like the next line that performs work is calling prune_data on line 81.
EVS/ush/mesoscale/df_preprocessing.py
Lines 70 to 88 in dd3247c
| logger.info(f"Looking for stat files in {stats_dir} using the" | |
| + f" template: {output_base_template}") | |
| if type(fcst_lev) == str: | |
| fcst_lev = [fcst_lev] | |
| elif type(fcst_lev) != list: | |
| e = (f"fcst_lev ({fcst_lev}) is neither a string nor a list." | |
| + f" Please check the plotting script" | |
| + f" (e.g., \"lead_average.py\") to make sure" | |
| + f" fcst_level(s) is preprocessed correctly.") | |
| logger.error(e) | |
| raise TypeError(e) | |
| prune_data( | |
| stats_dir, prune_dir, tmp_dir, output_base_template, | |
| valid_range, str(eval_period).upper(), str(verif_case).lower(), | |
| str(verif_type).lower(), str(line_type).upper(), str(domain), | |
| [' '+str(fcst_var_name)+' ' for fcst_var_name in fcst_var_names], | |
| str(var_name).upper(), model_list, | |
| [' '+str(lev)+' ' for lev in fcst_lev] | |
| ) |
Following this to ush/mesoscale/prune_stat_files.py, I found a subprocess call to grep -R
EVS/ush/mesoscale/prune_stat_files.py
Lines 114 to 118 in dd3247c
| for met_stat_file in met_stat_files: | |
| ps = subprocess.run('grep -R "'+model+'" '+met_stat_file+filter_cmd, | |
| shell=True, stdout=subprocess.PIPE, | |
| stderr=subprocess.STDOUT, encoding='UTF-8') | |
| grep_output = ps.stdout |
This looks like a potential area where a slowdown could occur if multiple tasks are working on the same files, or if there are many linked files. However, I am not sure if that is a possibility as I was not able to easily trace the work that each task was performing.
Questions
- Are there any file links that
grep -Rcould get stuck on? - Do these tasks need to follow file links, or do they only look at regular files?
- Do we expect that multiple tasks might run
prune_dataat the same time on the same files?
Please investigate the walltime failure further. I am able to perform more tests as requested. If the tasks are indeed working on the same files, GDIT has offered to provide support for streamlining this workflow.