Dagster jobs fail because of missing resources in kubernetes #3033

IcaroG · 2025-02-12T23:14:19Z

Describe the Bug

Dagster schedules a new pod for each step of a job. Since we have limited amounts of resource in our cluster, sometimes these pods fail to schedule with OutOfCpu error.

Expected Behavior

Jobs shouldn't failed becaused of limited resources: We should either decrease the amount of allowed concurrent jobs to run in Dagster or increase the node pool in our cluster.

The text was updated successfully, but these errors were encountered:

linear · 2025-02-12T23:14:22Z

OSO-407 Dagster jobs fail because of missing resources in kubernetes

IcaroG · 2025-02-12T23:15:36Z

I've already decrease the concurrency in Dagster from 30 to 15 but we can still see some errors when running a big job. Wondering if there are other configurations/difference between steps/ops/jobs in Dagster.

ryscheng · 2025-02-13T00:05:55Z

Relatedly, the core-assets job should fail (right now it's silently succeeding) when this happens and it skips models

ravenac95 · 2025-02-13T03:23:28Z

Hrm. Very interesting. Let me think on this. Might need to do some combination of things Kubernetes and dagster things to make this work well

ravenac95 · 2025-02-13T04:54:39Z

Ok. Hrm this is an annoying one and I think I see the issue but we have a couple things at play here that make this difficult.

So dagster has a couple of somewhat convoluted abstractions. The first is the run which is triggered by a run launcher (which is itself handled by a run coordinator). Then the second is the step which is managed by the executor. We enabled using the k8s_job_executor as the executor which might be a mistake except in some circumstances. We did this because it’s what was required for setting per asset cpu/memory requests. That being said, some jobs probably don’t need it which could help some complete. Particularly anything that calls out to separate services and does no actual processing on its own.

The alternative is we figure out how to have the run launcher pods use little to no cpu/mem. I’m curious is it the runs or the step pods that are getting outofcpu errors

ravenac95 · 2025-02-13T07:56:03Z

Oh another thing on this, we have some data now on pod cpu/memory usage. We should adjust the requests and limits of these dagster jobs. Maybe make the requests smaller (allow more colocation) and then also make the limits a little higher. This might end up being somewhat contentious on resources but if we're running a bunch of pods just to monitor each run then that pod likely isn't all that busy. If we then also do some things like specifically set some jobs to use the multiprocess executor then we likely will end up less pods all around. For example, the alerting pod should just use the multiprocess executor (I'm not sure how many pods it attempts to spawn)

IcaroG added the c:devops Devops tooling label Feb 12, 2025

github-project-automation bot added this to OSO Feb 12, 2025

github-project-automation bot moved this to Backlog in OSO Feb 12, 2025

IcaroG self-assigned this Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dagster jobs fail because of missing resources in kubernetes #3033

Dagster jobs fail because of missing resources in kubernetes #3033

IcaroG commented Feb 12, 2025 •

edited

Loading

linear bot commented Feb 12, 2025

IcaroG commented Feb 12, 2025

ryscheng commented Feb 13, 2025

ravenac95 commented Feb 13, 2025

ravenac95 commented Feb 13, 2025

ravenac95 commented Feb 13, 2025

Dagster jobs fail because of missing resources in kubernetes #3033

Dagster jobs fail because of missing resources in kubernetes #3033

Comments

IcaroG commented Feb 12, 2025 • edited Loading

Describe the Bug

Expected Behavior

linear bot commented Feb 12, 2025

IcaroG commented Feb 12, 2025

ryscheng commented Feb 13, 2025

ravenac95 commented Feb 13, 2025

ravenac95 commented Feb 13, 2025

ravenac95 commented Feb 13, 2025

IcaroG commented Feb 12, 2025 •

edited

Loading