RFC: Bulk delete of jobs #844

pmrv · 2022-09-20T15:26:41Z

We discussed a few times already that deleting a lot of jobs takes too much time. Removing thousands of jobs can take a few hours.

Here's an idea of how to do it better:

get a list of jobs to be deleted
prun or enrich the list based on linked jobs (master, parent, etc.)
prun the list based on job status (e.g. running or submitted)
make a backup of the database entries
delete database entries in bulk (doing in it sequentially already only takes a few tens of seconds for a ~1k jobs)
delete the working directories of each job, aborting once there's an error
if there's been an error restore the database entries of jobs that have been skipped.

Optionally we could stop after 2. & 3. to be cautious.

Some points why I picked this order

1. comes after 1. and 2. because jobs added in 2. might not be finished yet.
1. is done to avoid a potential race condition between multiple pyiron processes where a working directory is deleted already but the database is not updated yet.
1. & 7. are there to restore the database entries of jobs that have not been deleted on disk because of errors.

mgt16-LANL · 2023-10-20T16:28:22Z

I'll comment here in support of speeding this up. From my end restarting from a set of bulk calculations, and using pyiron takes an absurd amount of time - it's taking ~3 minutes with delete_exisiting_jobs=True vs ~1 second to submit jobs if there is no existing job. I am finding it faster to prune out the jobs that didn't previously finish before restarting the pyiron workflow. Having a faster job removal can be key for having "resubmit-able" workflows.

jan-janssen · 2023-10-20T16:30:44Z

I have a pull request for this open #1182 maybe you can check if it solves your issue, then we should move forward and include it in the next release.

pmrv · 2023-10-23T11:34:59Z

I'm guessing #1182 and #1215 won't help much with the use case of something like

for _ in ...:
  j = pr.create.job.MyJob(..., delete_existing_job=True)
  j.run()

since it will still touch every job sequentially.

pmrv · 2023-10-23T11:36:17Z

@mgt16-LANL Can you run your workflow (re-)submission under cProfile and send the resulting profile here?

import cProfile
prof = cProfile.Profile()
prof.enable()

... # your workflow

prof.disable()
prof.dump_stats('myprofile.pyprof')

mgt16-LANL · 2023-10-23T14:25:42Z

I have a pull request for this open #1182 maybe you can check if it solves your issue, then we should move forward and include it in the next release.

@jan-janssen - Unfortunately @pmrv is right. For context - I am running thousands of calculations in a single project without the POSTGRES database backend enabled. I am intentionally giving each run more jobs that will finish during the walltime and would like to simply restart the ones that don't finish. In the process of debugging why it was taking 3 minutes for each job to start upon resubmission in a ~10,000 job submission vs. 1 second on initial submission I found that more than likely the bulk of time was being taken on remove each job. E.g. pr.remove_job(job_id) took more than 2 minutes. On a smaller project (~3000 calculations, where it definitely runs faster) I ran the following with the cProfile:

import cProfile
from pyiron import Project
from tqdm import tqdm
prof = cProfile.Profile()
prof.enable()
pr = Project('others_async_adf')
table = pr.job_table()
adf_non_finish = table[(table.status != 'finished') & (table.job.str.contains('adf'))]
for i,row in tqdm(adf_non_finish.iloc[0:50].iterrows(),total=adf_non_finish.iloc[0:50].shape[0]):
    pr.remove_job(row['id'])
prof.disable()
prof.dump_stats('myprofile.pyprof')

For 50 calculations removal this was the tqdm output:
100%|██████████| 50/50 [09:21<00:00, 11.24s/it]
pyprof.zip

And the .pyprof file is attached (zipped).

My temporary workaround to make this go faster is to just hardcode a separate script that clears out the directories where I know they didn't finish with pathlib, which is much faster than the pr.remove_job() route - but I think it's much cleaner for re-submissions if pyiron executed faster on removing jobs.

Edit/note: The especially slow deletion (e.g. 2 mins) might have been occurring when the filesystem was having issues - this is on a system where it can be hard tell. Either way, 11s/deletion is way too slow for a serial set of submissions -> adds up to hours of delay on thousands of jobs.

Edit2: Tracked down timing when I tried this on the larger project:

for i,row in tqdm(adf_non_finish.iterrows(),total=adf_non_finish.shape[0]):
    pr.remove_job(row['id'])

tqdm output:
0%| | 4/8698 [05:12<190:42:47, 78.97s/it]

pyiron.log output (initial submission):
2023-10-13 01:41:45,399 - pyiron_log - INFO - run job: Gd0_arch id: None, status: initialized
2023-10-13 01:41:45,731 - pyiron_log - INFO - run job: Gd0_arch id: None, status: created
2023-10-13 01:41:45,735 - pyiron_log - INFO - run job: Gd1_arch id: None, status: initialized
2023-10-13 01:41:46,060 - pyiron_log - INFO - run job: Gd1_arch id: None, status: created
pyiron.log output (re-submission):
2023-10-19 13:00:04,173 - pyiron_log - INFO - run job: Er116_adf_0 id: None, status: initialized
2023-10-19 13:05:26,037 - pyiron_log - INFO - run job: Er116_adf_0 id: None, status: created
2023-10-19 13:05:26,104 - pyiron_log - INFO - run job: Lu116_adf_0 id: None, status: initialized
2023-10-19 13:11:28,282 - pyiron_log - INFO - run job: Lu116_adf_0 id: None, status: created

pmrv · 2023-10-23T15:15:46Z

Ah, you are using FileTable. According to the profile most time (~80%) is spent here where it essentially updated the whole job table on a get_child_ids query (in turn probably for each job). I know you don't run a central database, but would switching to the sqlite backend be an option?

mgt16-LANL · 2023-10-23T15:37:31Z

Thanks @pmrv, I am a bit hesitant to switch to sqlite backend since I'll often have jobs going in multiple project submissions at the same time, so I'm a bit concerned about file locking. My workaround code was:

from tqdm import tqdm
from pathlib import Path
import os.path as osp
import shutil

p = Path('others_async_adf/')
# File names will not get compressed to .tar.bz2 when job is not complete.
paths = [x for x in p.rglob('input.ams')] 
print('Total_unfinished',len(paths))
for path in tqdm(paths,total=len(paths)):
    # Get rid of directory
    rmdir = osp.join('.',path.parts[0],path.parts[1])
    shutil.rmtree(rmdir)
    # Get rid of .h5 of job name.
    unlink_path = Path(osp.join(path.parts[0],path.parts[2] + '.h5'))
    unlink_path.unlink()

This took on the order of 1 min to clear out over ~9000 unfinished jobs.
To basically just find the files existing only unfinished jobs and remove both the directory and .h5 files. This might be a useful workaround for others.

pmrv added enhancement New feature or request question Further information is requested labels Sep 20, 2022

pmrv self-assigned this Sep 20, 2022

samwaseda mentioned this issue Dec 19, 2022

Removing jobs takes absurd amount of time #152

Closed

pmrv mentioned this issue Oct 24, 2023

WIP: Implement #844 (bulk delete jobs) #1215

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Bulk delete of jobs #844

RFC: Bulk delete of jobs #844

pmrv commented Sep 20, 2022

mgt16-LANL commented Oct 20, 2023

jan-janssen commented Oct 20, 2023

pmrv commented Oct 23, 2023

pmrv commented Oct 23, 2023

mgt16-LANL commented Oct 23, 2023 •

edited

Loading

pmrv commented Oct 23, 2023

mgt16-LANL commented Oct 23, 2023

RFC: Bulk delete of jobs #844

RFC: Bulk delete of jobs #844

Comments

pmrv commented Sep 20, 2022

mgt16-LANL commented Oct 20, 2023

jan-janssen commented Oct 20, 2023

pmrv commented Oct 23, 2023

pmrv commented Oct 23, 2023

mgt16-LANL commented Oct 23, 2023 • edited Loading

pmrv commented Oct 23, 2023

mgt16-LANL commented Oct 23, 2023

mgt16-LANL commented Oct 23, 2023 •

edited

Loading