Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Bulk delete of jobs #844

Open
pmrv opened this issue Sep 20, 2022 · 7 comments
Open

RFC: Bulk delete of jobs #844

pmrv opened this issue Sep 20, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request question Further information is requested

Comments

@pmrv
Copy link
Contributor

pmrv commented Sep 20, 2022

We discussed a few times already that deleting a lot of jobs takes too much time. Removing thousands of jobs can take a few hours.

Here's an idea of how to do it better:

  1. get a list of jobs to be deleted
  2. prun or enrich the list based on linked jobs (master, parent, etc.)
  3. prun the list based on job status (e.g. running or submitted)
  4. make a backup of the database entries
  5. delete database entries in bulk (doing in it sequentially already only takes a few tens of seconds for a ~1k jobs)
  6. delete the working directories of each job, aborting once there's an error
  7. if there's been an error restore the database entries of jobs that have been skipped.

Optionally we could stop after 2. & 3. to be cautious.

Some points why I picked this order

    1. comes after 1. and 2. because jobs added in 2. might not be finished yet.
    1. is done to avoid a potential race condition between multiple pyiron processes where a working directory is deleted already but the database is not updated yet.
    1. & 7. are there to restore the database entries of jobs that have not been deleted on disk because of errors.
@pmrv pmrv added enhancement New feature or request question Further information is requested labels Sep 20, 2022
@pmrv pmrv self-assigned this Sep 20, 2022
@mgt16-LANL
Copy link

I'll comment here in support of speeding this up. From my end restarting from a set of bulk calculations, and using pyiron takes an absurd amount of time - it's taking ~3 minutes with delete_exisiting_jobs=True vs ~1 second to submit jobs if there is no existing job. I am finding it faster to prune out the jobs that didn't previously finish before restarting the pyiron workflow. Having a faster job removal can be key for having "resubmit-able" workflows.

@jan-janssen
Copy link
Member

I have a pull request for this open #1182 maybe you can check if it solves your issue, then we should move forward and include it in the next release.

@pmrv
Copy link
Contributor Author

pmrv commented Oct 23, 2023

I'm guessing #1182 and #1215 won't help much with the use case of something like

for _ in ...:
  j = pr.create.job.MyJob(..., delete_existing_job=True)
  j.run()

since it will still touch every job sequentially.

@pmrv
Copy link
Contributor Author

pmrv commented Oct 23, 2023

@mgt16-LANL Can you run your workflow (re-)submission under cProfile and send the resulting profile here?

import cProfile
prof = cProfile.Profile()
prof.enable()

... # your workflow

prof.disable()
prof.dump_stats('myprofile.pyprof')

@mgt16-LANL
Copy link

mgt16-LANL commented Oct 23, 2023

I have a pull request for this open #1182 maybe you can check if it solves your issue, then we should move forward and include it in the next release.

@jan-janssen - Unfortunately @pmrv is right. For context - I am running thousands of calculations in a single project without the POSTGRES database backend enabled. I am intentionally giving each run more jobs that will finish during the walltime and would like to simply restart the ones that don't finish. In the process of debugging why it was taking 3 minutes for each job to start upon resubmission in a ~10,000 job submission vs. 1 second on initial submission I found that more than likely the bulk of time was being taken on remove each job. E.g. pr.remove_job(job_id) took more than 2 minutes. On a smaller project (~3000 calculations, where it definitely runs faster) I ran the following with the cProfile:

import cProfile
from pyiron import Project
from tqdm import tqdm
prof = cProfile.Profile()
prof.enable()
pr = Project('others_async_adf')
table = pr.job_table()
adf_non_finish = table[(table.status != 'finished') & (table.job.str.contains('adf'))]
for i,row in tqdm(adf_non_finish.iloc[0:50].iterrows(),total=adf_non_finish.iloc[0:50].shape[0]):
    pr.remove_job(row['id'])
prof.disable()
prof.dump_stats('myprofile.pyprof')

For 50 calculations removal this was the tqdm output:
100%|██████████| 50/50 [09:21<00:00, 11.24s/it]
pyprof.zip

And the .pyprof file is attached (zipped).

My temporary workaround to make this go faster is to just hardcode a separate script that clears out the directories where I know they didn't finish with pathlib, which is much faster than the pr.remove_job() route - but I think it's much cleaner for re-submissions if pyiron executed faster on removing jobs.

Edit/note: The especially slow deletion (e.g. 2 mins) might have been occurring when the filesystem was having issues - this is on a system where it can be hard tell. Either way, 11s/deletion is way too slow for a serial set of submissions -> adds up to hours of delay on thousands of jobs.

Edit2: Tracked down timing when I tried this on the larger project:

for i,row in tqdm(adf_non_finish.iterrows(),total=adf_non_finish.shape[0]):
    pr.remove_job(row['id'])

tqdm output:
0%| | 4/8698 [05:12<190:42:47, 78.97s/it]

pyiron.log output (initial submission):
2023-10-13 01:41:45,399 - pyiron_log - INFO - run job: Gd0_arch id: None, status: initialized
2023-10-13 01:41:45,731 - pyiron_log - INFO - run job: Gd0_arch id: None, status: created
2023-10-13 01:41:45,735 - pyiron_log - INFO - run job: Gd1_arch id: None, status: initialized
2023-10-13 01:41:46,060 - pyiron_log - INFO - run job: Gd1_arch id: None, status: created
pyiron.log output (re-submission):
2023-10-19 13:00:04,173 - pyiron_log - INFO - run job: Er116_adf_0 id: None, status: initialized
2023-10-19 13:05:26,037 - pyiron_log - INFO - run job: Er116_adf_0 id: None, status: created
2023-10-19 13:05:26,104 - pyiron_log - INFO - run job: Lu116_adf_0 id: None, status: initialized
2023-10-19 13:11:28,282 - pyiron_log - INFO - run job: Lu116_adf_0 id: None, status: created

@pmrv
Copy link
Contributor Author

pmrv commented Oct 23, 2023

Ah, you are using FileTable. According to the profile most time (~80%) is spent here where it essentially updated the whole job table on a get_child_ids query (in turn probably for each job). I know you don't run a central database, but would switching to the sqlite backend be an option?

@mgt16-LANL
Copy link

Thanks @pmrv, I am a bit hesitant to switch to sqlite backend since I'll often have jobs going in multiple project submissions at the same time, so I'm a bit concerned about file locking. My workaround code was:

from tqdm import tqdm
from pathlib import Path
import os.path as osp
import shutil

p = Path('others_async_adf/')
# File names will not get compressed to .tar.bz2 when job is not complete.
paths = [x for x in p.rglob('input.ams')] 
print('Total_unfinished',len(paths))
for path in tqdm(paths,total=len(paths)):
    # Get rid of directory
    rmdir = osp.join('.',path.parts[0],path.parts[1])
    shutil.rmtree(rmdir)
    # Get rid of .h5 of job name.
    unlink_path = Path(osp.join(path.parts[0],path.parts[2] + '.h5'))
    unlink_path.unlink()

This took on the order of 1 min to clear out over ~9000 unfinished jobs.
To basically just find the files existing only unfinished jobs and remove both the directory and .h5 files. This might be a useful workaround for others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants