EVS v2.0 Eval - jevs_plots_mesoscale_grid2obs walltime failure

On Tuesday, March 17, `00/evs/v2.0/plots/mesoscale/jevs_plots_mesoscale_grid2obs` failed on walltime for EVS v2.0 in eval mode.

Here are two important points before getting into details:
- Production implementation is still scheduled for Tuesday, March 24. NCO does not anticipate changing the implementation date at this time.
- This issue is to request an investigation into the failure.

### Runtimes
- EVS v2.0 Eval: typically ~125-140 minutes
- EVS v2.0 NCO para: typically ~118-143 minutes
- EVS v1.0 NCO prod: typically ~108-148 minutes (job: `00/evs/v1.0/mesoscale/plots/jevs_mesoscale_grid2obs_plots`)

These runtimes are fairly consistent. I don't see any concerning values, but I can provide more data if it's useful for debugging.

### Analysis of `$DATA` and individual tasks
I saved related files to `/lfs/h1/nco/idsb/noscrub/russell.manser/projects/evs/eval_failures/`
- `$DATA` directory: `evs_plots_mesoscale_grid2obs_last90days.eval.failed/`
- Job output: `evs_plots_mesoscale_grid2obs_last90days.o101934296`

GDIT helped with an analysis of slow tasks in the failed job
- Many tasks were running for over 60 minutes, with the longest tasks running for 120+ minutes
- A list of task runtimes is in the temporary data directory at `grid2obs/out/workdirs/job_times.out`
- The tasks that were running slow seemed to slow down at `ush/mesoscale/df_preprocessing.py`

Some examples of slowdowns at `ush/mesoscale/df_preprocessing.py`:
- `job88/logs/EVS_verif_plotting_job88_grid2obs_747261.out` - ran for 90 minutes
- `job233/logs/EVS_verif_plotting_job233_grid2obs_747261.out` - ran for 89 minutes

I ran a test job in NCO test space and saved the `$DATA` directory. I found that the longest tasks typically run for ~50 minutes in a "normal" run.

GDIT did not find any evidence of disk slowdowns, node issues, or other system problems while this job was running.

### Analysis of `ush/mesoscale/df_preprocessing.py`
The slowdown was reported in `df_preprocessing.py` at line 70. It looks like the next line that performs work is calling `prune_data` on line 81.

https://github.com/NOAA-EMC/EVS/blob/dd3247c66e861807ed09ee7bb1b6bbb6012f3c75/ush/mesoscale/df_preprocessing.py#L70-L88

Following this to `ush/mesoscale/prune_stat_files.py`, I found a subprocess call to `grep -R`

https://github.com/NOAA-EMC/EVS/blob/dd3247c66e861807ed09ee7bb1b6bbb6012f3c75/ush/mesoscale/prune_stat_files.py#L114-L118

This looks like a potential area where a slowdown could occur if multiple tasks are working on the same files, or if there are many linked files. However, I am not sure if that is a possibility as I was not able to easily trace the work that each task was performing.

### Questions
- Are there any file links that `grep -R` could get stuck on?
- Do these tasks need to follow file links, or do they only look at regular files?
- Do we expect that multiple tasks might run `prune_data` at the same time on the same files?

Please investigate the walltime failure further. I am able to perform more tests as requested. If the tasks are indeed working on the same files, GDIT has offered to provide support for streamlining this workflow.

	logger.info(f"Looking for stat files in {stats_dir} using the"
	+ f" template: {output_base_template}")
	if type(fcst_lev) == str:
	fcst_lev = [fcst_lev]
	elif type(fcst_lev) != list:
	e = (f"fcst_lev ({fcst_lev}) is neither a string nor a list."
	+ f" Please check the plotting script"
	+ f" (e.g., \"lead_average.py\") to make sure"
	+ f" fcst_level(s) is preprocessed correctly.")
	logger.error(e)
	raise TypeError(e)
	prune_data(
	stats_dir, prune_dir, tmp_dir, output_base_template,
	valid_range, str(eval_period).upper(), str(verif_case).lower(),
	str(verif_type).lower(), str(line_type).upper(), str(domain),
	[' '+str(fcst_var_name)+' ' for fcst_var_name in fcst_var_names],
	str(var_name).upper(), model_list,
	[' '+str(lev)+' ' for lev in fcst_lev]
	)

	for met_stat_file in met_stat_files:
	ps = subprocess.run('grep -R "'+model+'" '+met_stat_file+filter_cmd,
	shell=True, stdout=subprocess.PIPE,
	stderr=subprocess.STDOUT, encoding='UTF-8')
	grep_output = ps.stdout

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVS v2.0 Eval - jevs_plots_mesoscale_grid2obs walltime failure #926

Runtimes

Analysis of `$DATA` and individual tasks

Analysis of `ush/mesoscale/df_preprocessing.py`

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EVS v2.0 Eval - jevs_plots_mesoscale_grid2obs walltime failure #926

Description

Runtimes

Analysis of $DATA and individual tasks

Analysis of ush/mesoscale/df_preprocessing.py

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Analysis of `$DATA` and individual tasks

Analysis of `ush/mesoscale/df_preprocessing.py`