Support accelerator directive for local executor #5850

bentsherman · 2025-03-04T17:10:25Z

When a pipeline runs multiple GPU-enabled tasks on the same node, each task will see all GPUs and will not try to coordinate which task should use which GPU.

NVIDIA provides the CUDA_VISIBLE_DEVICES environment variable to control which tasks can see which GPUs, and users generally have to manage this variable themselves. Some HPC schedulers can assign this variable automatically, or use cgroups to control GPU visibility at a lower level.

Nextflow should be able to manage this variable for the local executor, so that the user doesn't have to add complex pipeline logic to do the same. Running a GPU workload locally on a multi-GPU node is a common use case, so it is worth doing.

See the docs in the PR for usage.

To use with containers, you might have to add CUDA_VISIBLE_DEVICES to docker.envWhitelist.

I don't remember if CUDA_VISIBLE_DEVICES works with containers or if you have to set NVIDIA_VISIBLE_DEVICES.
I don't remember if you have to set --gpus for the docker command in order to use the GPUs at all, that can be set in docker.runOptions

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`266852c`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/68c2f2c283fd870008cdb79f

modules/nextflow/src/main/groovy/nextflow/processor/LocalPollingMonitor.groovy

thealanjason · 2025-04-28T15:54:27Z

@bentsherman The feature implemented in this PR would really help us use all the local GPUs without dealing with scheduling tasks on them manually.

Do you know when this feature will be released? Or is it even planned?

bentsherman · 2025-04-28T16:41:51Z

Right now I just put it out so that people can try it out, so I encourage you to try it with a local build of this PR. In principle we do want to have this, just haven't decided whether it should be part of local or a separate executor like local-gpu

thealanjason · 2025-06-11T02:51:42Z

Hi @bentsherman or @pditommaso , I finally got some time to try this out. However, I was not able to compile nextflow from source.

I used the following steps:

Install Java (version 17) using SDKMAN
Clone the nextflow repository
Checkout this PR branch local-gpu-executor
cd nextflow
make compile

The error I get is as below:

ajc@mbd:~/Work/git_EXT/nextflow$ make compile
./gradlew compile exportClasspath
> Task :nextflow:compileGroovy FAILED

[Incubating] Problems report is available at: file:///home/ajc/Work/git_EXT/nextflow/build/reports/problems/problems-report.html

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':nextflow:compileGroovy'.
> Could not resolve all files for configuration ':nextflow:compileClasspath'.
   > Could not resolve com.github.nextflow-io.language-server:compiler:main-SNAPSHOT.
     Required by:
         project :nextflow
      > Could not resolve com.github.nextflow-io.language-server:compiler:main-SNAPSHOT.
         > Unable to load Maven meta-data from https://s3-eu-west-1.amazonaws.com/maven.seqera.io/releases/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml.
            > Could not GET 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/releases/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml'. Received status code 403 from server: Forbidden
      > Could not resolve com.github.nextflow-io.language-server:compiler:main-SNAPSHOT.
         > Unable to load Maven meta-data from https://s3-eu-west-1.amazonaws.com/maven.seqera.io/snapshots/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml.
            > Could not GET 'https://s3-eu-west-1.amazonaws.com/maven.seqera.io/snapshots/com/github/nextflow-io/language-server/compiler/main-SNAPSHOT/maven-metadata.xml'. Received status code 403 from server: Forbidden

* Try:
> Run with --stacktrace option to get the stack trace.
> Run with --info or --debug option to get more log output.
> Run with --scan to get full insights.
> Get more help at https://help.gradle.org.

Deprecated Gradle features were used in this build, making it incompatible with Gradle 9.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

For more on this, please refer to https://docs.gradle.org/8.12.1/userguide/command_line_interface.html#sec:command_line_warnings in the Gradle documentation.

BUILD FAILED in 4s
27 actionable tasks: 21 executed, 2 from cache, 4 up-to-date
make: *** [Makefile:31: compile] Error 1

I'm not really sure, why I receive the 403 status code during download. Do you have any ideas to fix this?

I would really want to try out this feature on our local GPU machines.

thealanjason · 2025-06-17T01:37:37Z

Hi @bentsherman or @pditommaso, I'd be happy if you could have a look at this PR:
#6189

thealanjason · 2025-06-25T16:46:35Z

Hi @bentsherman or @pditommaso, could you please also have a look at this PR:
#6218

It proposes a fix to respect gpuIDs set in CUDA_VISIBLE_DEVICES before running nextflow.

bentsherman · 2025-06-26T16:09:25Z

@thealanjason thank you again, you actually inspired me to improve the overall approach and make it more generic.

I removed the executor.gpus config setting and rely solely on the CUDA_VISIBLE_DEVICES environment variable to make GPUs visible to Nextflow. This is also easily extended to support other runtimes like AMD ROCm and HIP, since they follow the same convention.

@pditommaso I think this PR is ready for serious consideration. Using CUDA_VISIBLE_DEVICES (and its variants) for everything gives us a seamless way for GPU users to integrate with Nextflow. The AcceleratorTracker is a nice abstraction that can be extended to support new devices and strategies without complicating the local executor.

modules/nextflow/src/main/groovy/nextflow/processor/LocalPollingMonitor.groovy

thealanjason

It's great that now NVIDIA, AMD, and HIP devices can be handled generically :)

docs/executor.md

ECM893 · 2025-09-03T13:50:07Z

Just commenting to say this would help on SO many of cloud compute deployments.

bentsherman · 2025-09-03T13:54:58Z

@ECM893 do you typically use the local executor in cloud for GPUs? If so I'm curious what your process looks like

ECM893 · 2025-09-03T13:59:20Z

Yes.
I use nf-core docker based pipelines, in cloud instance VMs, in Azure and GCP.
I would like to be able to say "hey, here's a pool of 8 Graphics cards, attached to this machine, just use them as needed"

modules/nextflow/src/main/groovy/nextflow/processor/AcceleratorTracker.groovy

modules/nextflow/src/main/groovy/nextflow/processor/LocalPollingMonitor.groovy

docs/executor.md

christopher-hakkaart

I've added one suggestion. Please check meaning is retained. Otherwise docs looks. Approved.

docs/executor.md

Signed-off-by: Ben Sherman <[email protected]>

thealanjason · 2025-09-09T19:18:04Z

Hi @bentsherman and @pditommaso
Out of curiosity, I tried using the test case here on the local-gpu-executor branch. Unfortunately, it does not work as intended when using the accelerator 1 directive.

The command I am using is CUDA_VISIBLE_DEVICES=2,3 nextflow run SHOWME.how.nf

My observation is in the environment variable CUDA_VISIBLE_DEVICES, ~~some tasks receive only one value~~, some others receive two, ~~some others none~~.

I'm not sure why this happens, but maybe some more tests are needed.

thealanjason · 2025-09-09T19:41:40Z

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2025-09-11T16:07:20Z

@thealanjason thanks for testing. I added an end-to-end test based on your use case, but I wan't able to replicate your issue.

Can you add echo $CUDA_VISIBLE_DEVICES to your task script so that we can see how each task is being allocated GPUs?

pditommaso reviewed Mar 4, 2025

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/processor/LocalPollingMonitor.groovy Outdated Show resolved Hide resolved

pditommaso force-pushed the master branch from f6a3696 to 49b58d2 Compare April 9, 2025 16:18

pditommaso force-pushed the master branch 3 times, most recently from b4b321e to 069653d Compare June 4, 2025 18:54

thealanjason mentioned this pull request Jun 17, 2025

fix null gpuSlots in LocalTaskHandler #6189

Closed

bentsherman force-pushed the local-gpu-executor branch 2 times, most recently from 07b6a01 to d1f1a8a Compare June 18, 2025 16:33

thealanjason mentioned this pull request Jun 25, 2025

local-gpu-executor repects CUDA_VISIBLE_DEVICES when set #6218

Closed

bentsherman force-pushed the local-gpu-executor branch from d1f1a8a to ec6a888 Compare June 26, 2025 16:02

bentsherman changed the title ~~Manage NVIDIA GPU slots in local executor~~ Support accelerator directive for local executor Jun 26, 2025

bentsherman force-pushed the local-gpu-executor branch from ec6a888 to b88d058 Compare June 26, 2025 16:06

bentsherman marked this pull request as ready for review June 26, 2025 16:09

bentsherman requested a review from a team as a code owner June 26, 2025 16:09

bentsherman requested a review from pditommaso June 26, 2025 16:09

thealanjason reviewed Jun 26, 2025

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/processor/LocalPollingMonitor.groovy Outdated Show resolved Hide resolved

thealanjason reviewed Jun 26, 2025

View reviewed changes

christopher-hakkaart reviewed Jun 26, 2025

View reviewed changes

docs/executor.md Outdated Show resolved Hide resolved

docs/executor.md Outdated Show resolved Hide resolved

docs/executor.md Outdated Show resolved Hide resolved

bentsherman force-pushed the local-gpu-executor branch from b88d058 to 90d1422 Compare June 26, 2025 22:24

bentsherman added this to the 25.10 milestone Jul 14, 2025

bentsherman force-pushed the local-gpu-executor branch from 90d1422 to 28bff07 Compare August 8, 2025 18:56

pditommaso requested changes Sep 8, 2025

View reviewed changes

christopher-hakkaart reviewed Sep 8, 2025

View reviewed changes

docs/executor.md Outdated Show resolved Hide resolved

christopher-hakkaart approved these changes Sep 8, 2025

View reviewed changes

docs/executor.md Outdated Show resolved Hide resolved

Support accelerator directive for local executor

1eb6b61

Signed-off-by: Ben Sherman <[email protected]>

bentsherman force-pushed the local-gpu-executor branch from 0df91bb to 1eb6b61 Compare September 9, 2025 13:53

bentsherman requested a review from pditommaso September 9, 2025 13:54

bentsherman added 2 commits September 11, 2025 10:54

Add e2e test

f378848

Signed-off-by: Ben Sherman <[email protected]>

Extend e2e test

266852c

Signed-off-by: Ben Sherman <[email protected]>

pditommaso force-pushed the master branch from b7b4221 to c1114bc Compare October 9, 2025 20:41

bentsherman modified the milestones: 25.10, 26.04 Oct 20, 2025

Support accelerator directive for local executor #5850

Are you sure you want to change the base?

Support accelerator directive for local executor #5850

Uh oh!

Conversation

bentsherman commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

Uh oh!

thealanjason commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bentsherman commented Apr 28, 2025

Uh oh!

thealanjason commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thealanjason commented Jun 17, 2025

Uh oh!

thealanjason commented Jun 25, 2025

Uh oh!

bentsherman commented Jun 26, 2025

Uh oh!

Uh oh!

thealanjason left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ECM893 commented Sep 3, 2025

Uh oh!

bentsherman commented Sep 3, 2025

Uh oh!

ECM893 commented Sep 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

christopher-hakkaart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thealanjason commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thealanjason commented Sep 9, 2025

Uh oh!

bentsherman commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

bentsherman commented Mar 4, 2025 •

edited

Loading

netlify bot commented Mar 4, 2025 •

edited

Loading

thealanjason commented Apr 28, 2025 •

edited

Loading

thealanjason commented Jun 11, 2025 •

edited

Loading

thealanjason commented Sep 9, 2025 •

edited

Loading