No available node types can fulfill resource request {'GPU': 1.0" #316

rogermt · 2026-03-11T15:17:09Z

rogermt
Mar 11, 2026
Maintainer

server/app/workers/ray_actors.py
Comment on lines +30 to +33

No GPU requirement: Allow scheduling on CPU-only clusters.

Plugins can detect and use GPU at runtime via CUDA availability.

This fixes Issue #301: "No available node types can fulfill resource request {'GPU': 1.0}"

@ray.remote(num_gpus=0)
@coderabbitai
coderabbitai bot
25 minutes ago
⚠️ Potential issue | 🟠 Major

🧩 Analysis chain
🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Ray decorators with GPU accounting =="
rg -n --type=py '@ray.remote(num_gpus=' server/app

echo
echo "== Spawn sites for StreamingToolActor and execute_pipeline_remote =="
rg -n -C2 --type=py 'StreamingToolActor.remote(|execute_pipeline_remote.remote(' server/app

echo
echo "== Plugin validate methods =="
rg -n -C3 --type=py 'def validate(' server/app

echo
echo "== CUDA-specific model placement hints =="
rg -n -C3 --type=py 'torch.cuda|.to("cuda"|device\s*=\s*"cuda"|torch.device("cuda"' server/app
Repository: rogermt/forgesyte

Length of output: 5913

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Plugin implementations: Does validate() load models? =="
find server/app/plugins -name "*.py" -type f | head -20 | while read f; do
if rg -q 'def validate' "$f"; then
echo "File: $f"
rg -A20 'def validate' "$f" | head -25
echo "---"
fi
done

echo
echo "== Concurrency controls: semaphore, lock, max_tasks =="
rg -n --type=py 'semaphore|asyncio.Lock|Semaphore|max_task' server/app/workers server/app/services | head -30

echo
echo "== Ray cluster config or resource setup =="
fd -type f '(ray_config|cluster|resources)' server/app

echo
echo "== validate() call sites and context =="
rg -n -B3 -A5 '.validate(' server/app/workers server/app/services | head -50
Repository: rogermt/forgesyte

Length of output: 1629

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Concrete plugin subclasses and validate() overrides =="
find server/app/plugins -name "*.py" -type f | while read f; do
if grep -q 'class.*BasePlugin|class.*Plugin' "$f" 2>/dev/null; then
echo "=== $f ==="
rg -A30 'def validate(' "$f"
fi
done

echo
echo "== Ray actor/task options and resource configuration =="
rg -n -B5 -A10 '@ray.remote' server/app/workers/ray_actors.py server/app/ray_tasks.py

echo
echo "== Actor/task execution: check for concurrent limits =="
rg -n -C5 'active_actors|max_concurrent|num_tasks' server/app/services/vision_analysis.py

echo
echo "== Full context of execute_pipeline_remote and StreamingToolActor creation =="
rg -n -B10 -A10 'def execute_pipeline_remote' server/app/ray_tasks.py | head -50
Repository: rogermt/forgesyte

Length of output: 565

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== All plugin files with classes =="
find server/app/plugins -name "*.py" -type f -exec grep -l 'class.*Plugin' {} ;

echo
echo "== Plugin directory structure =="
find server/app/plugins -type f -name "*.py" | head -20

echo
echo "== Full execute_pipeline_remote definition =="
sed -n '90,130p' server/app/ray_tasks.py

echo
echo "== Full StreamingToolActor definition =="
sed -n '30,60p' server/app/workers/ray_actors.py

echo
echo "== vision_analysis.py actor creation context =="
sed -n '100,125p' server/app/services/vision_analysis.py

echo
echo "== Check for Ray task/actor options: num_cpus, max_concurrency, resources =="
rg -n 'num_cpus|max_concurrency|resources|scheduling_strategy' server/app/workers server/app/ray_tasks.py
Repository: rogermt/forgesyte

Length of output: 5178

Restore GPU resource accounting to prevent concurrent model-loading OOM.

Setting num_gpus=0 fixes CPU-only scheduling but removes Ray's only constraint on GPU memory. Since plugin.validate() can load models into VRAM, and StreamingToolActor is created per client/tool (vision_analysis.py:114) whilst execute_pipeline_remote is dispatched per job (worker.py:518), multiple actors and tasks will schedule concurrently and all attempt GPU preload without any admission control—increasing OOM risk significantly under load.

To maintain CPU-friendly defaults, add a plugin-specific option (e.g. .options(num_gpus=...) for GPU-heavy plugins), a custom Ray resource type, or an application-level semaphore around model preloading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @server/app/workers/ray_actors.py around lines 30 - 33, The current
@ray.remote(num_gpus=0) decorator removes GPU accounting and allows concurrent
GPU model loads; revert/remove the fixed num_gpus=0 and instead enforce GPU
admission per-plugin or per-actor: update actor/task creation sites (e.g., where
StreamingToolActor is instantiated in vision_analysis.py and where
execute_pipeline_remote is dispatched in worker.py) to call
.options(num_gpus=...) for GPU-heavy plugins based on plugin.validate() or
plugin metadata, or register/require a custom Ray resource (e.g., "gpu_model")
and allocate it via .options(resources={...}); alternatively implement an
application-level semaphore around plugin.validate()/model preload to serialize
GPU loads. Ensure references to the ray.remote decorated actor/class are
preserved and that plugin metadata drives the chosen num_gpus/resource
allocation

rogermt · 2026-03-11T15:22:37Z

rogermt
Mar 11, 2026
Maintainer Author

✦ My Analysis of the CodeRabbit Finding

Valid concern - this is a real tradeoff. The CodeRabbit bot correctly identifies that num_gpus=0 removes Ray's
GPU admission control.

However, there are mitigating factors:

Hard-coded concurrency limit of 2 jobs - worker.py:479 already limits concurrent jobs to prevent OOM
Runtime GPU detection - Plugins detect GPU via CUDA, so they gracefully fall back to CPU
Device tracking - device_tracking.py logs GPU/CPU usage for observability

The root problem:

Plugin manifests don't declare GPU requirements
requires_gpu parameter exists in dependency_checker.py:148 but is never populated

Recommended approach (defer to Issue #314):

This finding is essentially Issue #314 which I already created. The solution should be:

Add gpu_required to plugin manifest schema
Use .options(num_gpus=...) dynamically for GPU-heavy plugins
Keep num_gpus=0 as default for CPU-friendly scheduling

For now:

PR fix(v0.13.10): remove GPU requirement from Ray tasks (Issue #301) #313 correctly fixes Issue fix: Ray task GPU requirement should be optional for CPU-only environments #301 (CPU-only clusters blocked)
Issue feat: Validate GPU availability before running GPU-dependent plugins (Major) #314 tracks the OOM risk concern
The hard-coded max 2 concurrent jobs is an implicit guard

✦ Do you want me to:

Add a comment on the PR explaining this tradeoff?
Update Issue feat: Validate GPU availability before running GPU-dependent plugins (Major) #314 with this specific finding?
Extend Issue feat: Validate GPU availability before running GPU-dependent plugins (Major) #314's acceptance criteria to include plugin manifest schema?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No available node types can fulfill resource request {'GPU': 1.0" #316

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

No available node types can fulfill resource request {'GPU': 1.0" #316

Uh oh!

Uh oh!

rogermt Mar 11, 2026 Maintainer

No GPU requirement: Allow scheduling on CPU-only clusters.

Plugins can detect and use GPU at runtime via CUDA availability.

This fixes Issue #301: "No available node types can fulfill resource request {'GPU': 1.0}"

Replies: 1 comment

Uh oh!

rogermt Mar 11, 2026 Maintainer Author

rogermt
Mar 11, 2026
Maintainer

rogermt
Mar 11, 2026
Maintainer Author