Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagster jobs fail because of missing resources in kubernetes #3033

Open
IcaroG opened this issue Feb 12, 2025 · 6 comments
Open

Dagster jobs fail because of missing resources in kubernetes #3033

IcaroG opened this issue Feb 12, 2025 · 6 comments
Assignees
Labels
c:devops Devops tooling

Comments

@IcaroG
Copy link
Contributor

IcaroG commented Feb 12, 2025

Describe the Bug

Dagster schedules a new pod for each step of a job. Since we have limited amounts of resource in our cluster, sometimes these pods fail to schedule with OutOfCpu error.

Expected Behavior

Jobs shouldn't failed becaused of limited resources: We should either decrease the amount of allowed concurrent jobs to run in Dagster or increase the node pool in our cluster.

@IcaroG IcaroG added the c:devops Devops tooling label Feb 12, 2025
Copy link

linear bot commented Feb 12, 2025

@github-project-automation github-project-automation bot moved this to Backlog in OSO Feb 12, 2025
@IcaroG
Copy link
Contributor Author

IcaroG commented Feb 12, 2025

I've already decrease the concurrency in Dagster from 30 to 15 but we can still see some errors when running a big job. Wondering if there are other configurations/difference between steps/ops/jobs in Dagster.

@IcaroG IcaroG self-assigned this Feb 12, 2025
@ryscheng
Copy link
Member

Relatedly, the core-assets job should fail (right now it's silently succeeding) when this happens and it skips models

@ravenac95
Copy link
Member

Hrm. Very interesting. Let me think on this. Might need to do some combination of things Kubernetes and dagster things to make this work well

@ravenac95
Copy link
Member

Ok. Hrm this is an annoying one and I think I see the issue but we have a couple things at play here that make this difficult.

So dagster has a couple of somewhat convoluted abstractions. The first is the run which is triggered by a run launcher (which is itself handled by a run coordinator). Then the second is the step which is managed by the executor. We enabled using the k8s_job_executor as the executor which might be a mistake except in some circumstances. We did this because it’s what was required for setting per asset cpu/memory requests. That being said, some jobs probably don’t need it which could help some complete. Particularly anything that calls out to separate services and does no actual processing on its own.

The alternative is we figure out how to have the run launcher pods use little to no cpu/mem. I’m curious is it the runs or the step pods that are getting outofcpu errors

@ravenac95
Copy link
Member

Oh another thing on this, we have some data now on pod cpu/memory usage. We should adjust the requests and limits of these dagster jobs. Maybe make the requests smaller (allow more colocation) and then also make the limits a little higher. This might end up being somewhat contentious on resources but if we're running a bunch of pods just to monitor each run then that pod likely isn't all that busy. If we then also do some things like specifically set some jobs to use the multiprocess executor then we likely will end up less pods all around. For example, the alerting pod should just use the multiprocess executor (I'm not sure how many pods it attempts to spawn)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c:devops Devops tooling
Projects
Status: Backlog
Development

No branches or pull requests

3 participants