Skip to content

EVS v2.0 Eval - jevs_plots_mesoscale_grid2obs walltime failure #926

@RussellManser-NCO

Description

@RussellManser-NCO

On Tuesday, March 17, 00/evs/v2.0/plots/mesoscale/jevs_plots_mesoscale_grid2obs failed on walltime for EVS v2.0 in eval mode.

Here are two important points before getting into details:

  • Production implementation is still scheduled for Tuesday, March 24. NCO does not anticipate changing the implementation date at this time.
  • This issue is to request an investigation into the failure.

Runtimes

  • EVS v2.0 Eval: typically ~125-140 minutes
  • EVS v2.0 NCO para: typically ~118-143 minutes
  • EVS v1.0 NCO prod: typically ~108-148 minutes (job: 00/evs/v1.0/mesoscale/plots/jevs_mesoscale_grid2obs_plots)

These runtimes are fairly consistent. I don't see any concerning values, but I can provide more data if it's useful for debugging.

Analysis of $DATA and individual tasks

I saved related files to /lfs/h1/nco/idsb/noscrub/russell.manser/projects/evs/eval_failures/

  • $DATA directory: evs_plots_mesoscale_grid2obs_last90days.eval.failed/
  • Job output: evs_plots_mesoscale_grid2obs_last90days.o101934296

GDIT helped with an analysis of slow tasks in the failed job

  • Many tasks were running for over 60 minutes, with the longest tasks running for 120+ minutes
  • A list of task runtimes is in the temporary data directory at grid2obs/out/workdirs/job_times.out
  • The tasks that were running slow seemed to slow down at ush/mesoscale/df_preprocessing.py

Some examples of slowdowns at ush/mesoscale/df_preprocessing.py:

  • job88/logs/EVS_verif_plotting_job88_grid2obs_747261.out - ran for 90 minutes
  • job233/logs/EVS_verif_plotting_job233_grid2obs_747261.out - ran for 89 minutes

I ran a test job in NCO test space and saved the $DATA directory. I found that the longest tasks typically run for ~50 minutes in a "normal" run.

GDIT did not find any evidence of disk slowdowns, node issues, or other system problems while this job was running.

Analysis of ush/mesoscale/df_preprocessing.py

The slowdown was reported in df_preprocessing.py at line 70. It looks like the next line that performs work is calling prune_data on line 81.

logger.info(f"Looking for stat files in {stats_dir} using the"
+ f" template: {output_base_template}")
if type(fcst_lev) == str:
fcst_lev = [fcst_lev]
elif type(fcst_lev) != list:
e = (f"fcst_lev ({fcst_lev}) is neither a string nor a list."
+ f" Please check the plotting script"
+ f" (e.g., \"lead_average.py\") to make sure"
+ f" fcst_level(s) is preprocessed correctly.")
logger.error(e)
raise TypeError(e)
prune_data(
stats_dir, prune_dir, tmp_dir, output_base_template,
valid_range, str(eval_period).upper(), str(verif_case).lower(),
str(verif_type).lower(), str(line_type).upper(), str(domain),
[' '+str(fcst_var_name)+' ' for fcst_var_name in fcst_var_names],
str(var_name).upper(), model_list,
[' '+str(lev)+' ' for lev in fcst_lev]
)

Following this to ush/mesoscale/prune_stat_files.py, I found a subprocess call to grep -R

for met_stat_file in met_stat_files:
ps = subprocess.run('grep -R "'+model+'" '+met_stat_file+filter_cmd,
shell=True, stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, encoding='UTF-8')
grep_output = ps.stdout

This looks like a potential area where a slowdown could occur if multiple tasks are working on the same files, or if there are many linked files. However, I am not sure if that is a possibility as I was not able to easily trace the work that each task was performing.

Questions

  • Are there any file links that grep -R could get stuck on?
  • Do these tasks need to follow file links, or do they only look at regular files?
  • Do we expect that multiple tasks might run prune_data at the same time on the same files?

Please investigate the walltime failure further. I am able to perform more tests as requested. If the tasks are indeed working on the same files, GDIT has offered to provide support for streamlining this workflow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions