Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you use Resalloc to avoid the initial waiting for machines? #288

Open
praiskup opened this issue Sep 5, 2024 · 15 comments
Open

Could you use Resalloc to avoid the initial waiting for machines? #288

praiskup opened this issue Sep 5, 2024 · 15 comments

Comments

@praiskup
Copy link
Member

praiskup commented Sep 5, 2024

Resalloc implements a machine pool pre-allocation: https://github.com/praiskup/resalloc
Used e.g. by Fedora Copr and OSH

@praiskup
Copy link
Member Author

praiskup commented Sep 5, 2024

Resalloc might be missing some important features for multi-platform-controller, let us know if so (I'm the author of the project and one of the current maintainers)

@ifireball
Copy link
Member

probably not going to happen, we are probably not going to integrate a Python project into MPC. Also the existing dynamic-pool capability we already have is probably good enough. If we are going to, at some point, incorporate something external it would probably be based on DRA or using Kata-containers with out-of-cluster hypervisors.

@praiskup
Copy link
Member Author

praiskup commented Sep 5, 2024

we are probably not going to integrate a Python project into MPC

This runs as a separate small service, containerized in OSH, e.g..

on DRA or using Kata-containers

Has any of those tools resolved the "pre-allocation" of resources?

@ifireball
Copy link
Member

ifireball commented Sep 5, 2024

Has any of those tools resolved the "pre-allocation" of resources?

Those tools have huge communities and flexible external provider support - some of those providers may support pre-allocated resource pools. But my point is is that we don't want to use our own tools and would rather adopt emerging K8s standards.

This runs as a separate small service, containerized in OSH, e.g.

So one more service we would need to run, scale, monitor, write SOPs for, debug, etc.

I suppose its also stateful, so we would also need to maintain some kind of a database for it?

@praiskup
Copy link
Member Author

praiskup commented Sep 5, 2024

But my point is is that we don't want to use our own tools and would rather adopt emerging K8s standards.

I appreciate this perspective! But there's a tool that solves the
problem, and somebody (else?) needs to either use it or reimplement. And
I'm curious if it is worth waiting, any ETA?

So one more service we would need to run, scale, monitor, write SOPs for, debug, etc.

Yes, sure. For the existing deployments (hundreds of maintained VMs in
parallel, in multiple clouds and also multiple dedicated hypervisors - the
service itself is the smallest problem. The real problem is to keep up,
and detect the reasons why a particular VM in given pool doesn't spawn -
this is exactly where the service helps the administrator).

The scale of Konflux may be different, though, and the particular service
the operation might be substantial.

I suppose its also stateful, so we would also need to maintain some kind of a database for it?

It depends on use-case; the database may be local (and created when pod starts). But then, if
you restart the service, you get also the allocated resources (VMs) cleaned up.

@praiskup
Copy link
Member Author

praiskup commented Sep 5, 2024

#283 seems related

@ifireball
Copy link
Member

#283 seems related

That is simply augmenting our IBM cloud driver to support the pooling logic we already have (Which already works very well on AWS). As I already mentioned, for now it seems good enough for what we need.

@praiskup
Copy link
Member Author

praiskup commented Sep 5, 2024

/me is coming from the RPM build system world...

works very well on AWS

Allocation of the worker now takes >= 2 minutes according to my experience with Konflux instances.

The problem varies depending on what RPM package is being built, but most of them are built in Mock in less than a minute. If we want to make the builds SLSA isolated, we'll need to have to run Mock twice (the second run would be faster), so lemme claim 2 minutes in total.

The problem of allocating machines on demand gives us 2x2mins+ penalty for every given build (the majority of the task time is spent on VM allocation). What this ticket proposes is to make it a little-to-zero penalty.

@ifireball
Copy link
Member

ifireball commented Sep 5, 2024

Well, for most cases I know the fully dynamic on-demand configuration is used - and that would indeed incur the penalty you mention. Its possible configure things differently to use the so called "dynamic pool" - but that kind of conversation probably belongs in the support channels for the team who is maintaining your cluster, and not in an issue for the open source controller project.

BTM if you are using some isolation like mock, it may also be possible to run multiple pipelines on the same periodically-reallocated host to make things even faster.

@praiskup
Copy link
Member Author

praiskup commented Oct 9, 2024

BTM if you are using some isolation like mock

I forgot to react, mock doesn't really isolate, but we do mock-in-podman for this.

it may also be possible to run multiple pipelines on the same periodically-reallocated host to make things even faster.

Don't you have documentation for this? The builds are often performance-consuming, and I'm still afraid that we can not let multiple users on the same machine at the same time for security reasons, but I'd like to understand how this works. It could be an option.

@ifireball
Copy link
Member

BTM if you are using some isolation like mock

I forgot to react, mock doesn't really isolate, but we do mock-in-podman for this.

IIRC mock can run its root environment inside a namepsace, not just a plain chroot - though depending on what you do, sometimes a chroot is enough. But all of that is besides the point.

it may also be possible to run multiple pipelines on the same periodically-reallocated host to make things even faster.

Don't you have documentation for this? The builds are often performance-consuming, and I'm still afraid that we can not let multiple users on the same machine at the same time for security reasons, but I'd like to understand how this works. It could be an option.

See the "Dynamic Pool" section of the MPC architecture document:
https://github.com/konflux-ci/architecture/blob/main/architecture/multi-platform-controller.md

@cgwalters
Copy link

#197 is also strongly related here.

When (hopefully not "if") copr switches to scheduling pods in Kubernetes to build RPMs, then all the logic that already exists for both vertical and horizontal pod autoscaling applies. resalloc has a high overlap with those today.

@ifireball
Copy link
Member

@cgwalters there is some discussion going on about the possibility of completely replacing MPC for the Red Hat clusters, if you're interested, reach out to me on the internal slack so I can loop you in.

@xsuchy
Copy link

xsuchy commented Jan 14, 2025

When (hopefully not "if") copr switches to scheduling pods in Kubernetes to build RPMs,

Hmm. Last time I checked, running K8 on different arches (x86_64, aarch64, s390x, ppc64le) in several regions from two different cloud providers was non-trivial. Is it production-ready now?

@cgwalters
Copy link

Hmm. Last time I checked, running K8 on different arches (x86_64, aarch64, s390x, ppc64le) in several regions from two different cloud providers was non-trivial. Is it production-ready now?

There's several levels of this. But to start there's: https://docs.openshift.com/container-platform/4.17/post_installation_configuration/configuring-multi-arch-compute-machines/multi-architecture-configuration.html

That said if you dig in a lot of the discussion in #197 is about different techniques for doing Kubernetes-like things for builds (and general runtime) without necessarily running a persistent cluster - which this project heavily overlaps with (and copr/koji also overlap).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants