-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Bulk delete of jobs #844
Comments
I'll comment here in support of speeding this up. From my end restarting from a set of bulk calculations, and using pyiron takes an absurd amount of time - it's taking ~3 minutes with delete_exisiting_jobs=True vs ~1 second to submit jobs if there is no existing job. I am finding it faster to prune out the jobs that didn't previously finish before restarting the pyiron workflow. Having a faster job removal can be key for having "resubmit-able" workflows. |
I have a pull request for this open #1182 maybe you can check if it solves your issue, then we should move forward and include it in the next release. |
@mgt16-LANL Can you run your workflow (re-)submission under import cProfile
prof = cProfile.Profile()
prof.enable()
... # your workflow
prof.disable()
prof.dump_stats('myprofile.pyprof') |
@jan-janssen - Unfortunately @pmrv is right. For context - I am running thousands of calculations in a single project without the POSTGRES database backend enabled. I am intentionally giving each run more jobs that will finish during the walltime and would like to simply restart the ones that don't finish. In the process of debugging why it was taking 3 minutes for each job to start upon resubmission in a ~10,000 job submission vs. 1 second on initial submission I found that more than likely the bulk of time was being taken on remove each job. E.g. pr.remove_job(job_id) took more than 2 minutes. On a smaller project (~3000 calculations, where it definitely runs faster) I ran the following with the cProfile: import cProfile
from pyiron import Project
from tqdm import tqdm
prof = cProfile.Profile()
prof.enable()
pr = Project('others_async_adf')
table = pr.job_table()
adf_non_finish = table[(table.status != 'finished') & (table.job.str.contains('adf'))]
for i,row in tqdm(adf_non_finish.iloc[0:50].iterrows(),total=adf_non_finish.iloc[0:50].shape[0]):
pr.remove_job(row['id'])
prof.disable()
prof.dump_stats('myprofile.pyprof') For 50 calculations removal this was the tqdm output: And the .pyprof file is attached (zipped). My temporary workaround to make this go faster is to just hardcode a separate script that clears out the directories where I know they didn't finish with pathlib, which is much faster than the pr.remove_job() route - but I think it's much cleaner for re-submissions if pyiron executed faster on removing jobs. Edit/note: The especially slow deletion (e.g. 2 mins) might have been occurring when the filesystem was having issues - this is on a system where it can be hard tell. Either way, 11s/deletion is way too slow for a serial set of submissions -> adds up to hours of delay on thousands of jobs. Edit2: Tracked down timing when I tried this on the larger project: for i,row in tqdm(adf_non_finish.iterrows(),total=adf_non_finish.shape[0]):
pr.remove_job(row['id']) tqdm output: pyiron.log output (initial submission): |
Ah, you are using |
Thanks @pmrv, I am a bit hesitant to switch to from tqdm import tqdm
from pathlib import Path
import os.path as osp
import shutil
p = Path('others_async_adf/')
# File names will not get compressed to .tar.bz2 when job is not complete.
paths = [x for x in p.rglob('input.ams')]
print('Total_unfinished',len(paths))
for path in tqdm(paths,total=len(paths)):
# Get rid of directory
rmdir = osp.join('.',path.parts[0],path.parts[1])
shutil.rmtree(rmdir)
# Get rid of .h5 of job name.
unlink_path = Path(osp.join(path.parts[0],path.parts[2] + '.h5'))
unlink_path.unlink() This took on the order of 1 min to clear out over ~9000 unfinished jobs. |
We discussed a few times already that deleting a lot of jobs takes too much time. Removing thousands of jobs can take a few hours.
Here's an idea of how to do it better:
Optionally we could stop after 2. & 3. to be cautious.
Some points why I picked this order
The text was updated successfully, but these errors were encountered: