Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using spot policy on batch runs #315

Open
betolink opened this issue Mar 3, 2025 · 4 comments
Open

Using spot policy on batch runs #315

betolink opened this issue Mar 3, 2025 · 4 comments

Comments

@betolink
Copy link

betolink commented Mar 3, 2025

I don't know if it's possible to force the spot VM policy when we run a batch job, https://docs.coiled.io/user_guide/batch.html
Something like

coiled batch run --name cluster --vm-type=t4g.medium --spot-policy=spot script.py
@ntabris
Copy link
Member

ntabris commented Mar 3, 2025

Hi, @betolink. That's not supported just yet, but it's something we're planning to add pretty soon—most likely in the next week or two.

For you user case, is it okay if a task is running on a spot instance that gets interrupted/preempted, then the task is marked as having failed and not re-tried? Or would retries (and potentially replacing the spot instance) be important to you for using batch?

@betolink
Copy link
Author

betolink commented Mar 3, 2025

Hi @ntabris, yeah it's ok if the cluster it's not provisioned or fails if the spot VMs are not available. A max-retries and wait-between-retries would be ideal! and if a user really needs to get it done then we could use the spot_with_fallback no? Thanks!!

@ntabris
Copy link
Member

ntabris commented Mar 3, 2025

We'll definitely support on-demand, spot, and spot with fallback as options for what happens when initially creating the cluster to run your batch job.

It's less clear to me right now what we'll do about spot VMs that have been running for a while and get reclaimed (as can happen for a spot VM).

Do you expect your individual batch tasks to be fast (say, seconds to a few minutes) or longer running (say, many minutes to hours)? How many batch tasks per job do you expect (a few, tens, hundreds)?

@betolink
Copy link
Author

betolink commented Mar 3, 2025

In my case (processing many files) the run could take minutes to hours. I'm currently testing but I expect the tasks would be in the "tens" and each VM will process a few thousand files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants