Could you use Resalloc to avoid the initial waiting for machines? #288

praiskup · 2024-09-05T06:15:22Z

Resalloc implements a machine pool pre-allocation: https://github.com/praiskup/resalloc
Used e.g. by Fedora Copr and OSH

praiskup · 2024-09-05T06:19:48Z

Resalloc might be missing some important features for multi-platform-controller, let us know if so (I'm the author of the project and one of the current maintainers)

ifireball · 2024-09-05T07:06:11Z

probably not going to happen, we are probably not going to integrate a Python project into MPC. Also the existing dynamic-pool capability we already have is probably good enough. If we are going to, at some point, incorporate something external it would probably be based on DRA or using Kata-containers with out-of-cluster hypervisors.

praiskup · 2024-09-05T07:32:35Z

we are probably not going to integrate a Python project into MPC

This runs as a separate small service, containerized in OSH, e.g..

on DRA or using Kata-containers

Has any of those tools resolved the "pre-allocation" of resources?

ifireball · 2024-09-05T07:38:47Z

Has any of those tools resolved the "pre-allocation" of resources?

Those tools have huge communities and flexible external provider support - some of those providers may support pre-allocated resource pools. But my point is is that we don't want to use our own tools and would rather adopt emerging K8s standards.

This runs as a separate small service, containerized in OSH, e.g.

So one more service we would need to run, scale, monitor, write SOPs for, debug, etc.

I suppose its also stateful, so we would also need to maintain some kind of a database for it?

praiskup · 2024-09-05T09:32:40Z

But my point is is that we don't want to use our own tools and would rather adopt emerging K8s standards.

I appreciate this perspective! But there's a tool that solves the
problem, and somebody (else?) needs to either use it or reimplement. And
I'm curious if it is worth waiting, any ETA?

So one more service we would need to run, scale, monitor, write SOPs for, debug, etc.

Yes, sure. For the existing deployments (hundreds of maintained VMs in
parallel, in multiple clouds and also multiple dedicated hypervisors - the
service itself is the smallest problem. The real problem is to keep up,
and detect the reasons why a particular VM in given pool doesn't spawn -
this is exactly where the service helps the administrator).

The scale of Konflux may be different, though, and the particular service
the operation might be substantial.

I suppose its also stateful, so we would also need to maintain some kind of a database for it?

It depends on use-case; the database may be local (and created when pod starts). But then, if
you restart the service, you get also the allocated resources (VMs) cleaned up.

praiskup · 2024-09-05T10:11:05Z

#283 seems related

ifireball · 2024-09-05T10:13:27Z

#283 seems related

That is simply augmenting our IBM cloud driver to support the pooling logic we already have (Which already works very well on AWS). As I already mentioned, for now it seems good enough for what we need.

praiskup · 2024-09-05T10:59:10Z

/me is coming from the RPM build system world...

works very well on AWS

Allocation of the worker now takes >= 2 minutes according to my experience with Konflux instances.

The problem varies depending on what RPM package is being built, but most of them are built in Mock in less than a minute. If we want to make the builds SLSA isolated, we'll need to have to run Mock twice (the second run would be faster), so lemme claim 2 minutes in total.

The problem of allocating machines on demand gives us 2x2mins+ penalty for every given build (the majority of the task time is spent on VM allocation). What this ticket proposes is to make it a little-to-zero penalty.

ifireball · 2024-09-05T11:11:33Z

Well, for most cases I know the fully dynamic on-demand configuration is used - and that would indeed incur the penalty you mention. Its possible configure things differently to use the so called "dynamic pool" - but that kind of conversation probably belongs in the support channels for the team who is maintaining your cluster, and not in an issue for the open source controller project.

BTM if you are using some isolation like mock, it may also be possible to run multiple pipelines on the same periodically-reallocated host to make things even faster.

praiskup · 2024-10-09T14:29:31Z

BTM if you are using some isolation like mock

I forgot to react, mock doesn't really isolate, but we do mock-in-podman for this.

it may also be possible to run multiple pipelines on the same periodically-reallocated host to make things even faster.

Don't you have documentation for this? The builds are often performance-consuming, and I'm still afraid that we can not let multiple users on the same machine at the same time for security reasons, but I'd like to understand how this works. It could be an option.

ifireball · 2024-10-10T05:58:44Z

BTM if you are using some isolation like mock

I forgot to react, mock doesn't really isolate, but we do mock-in-podman for this.

IIRC mock can run its root environment inside a namepsace, not just a plain chroot - though depending on what you do, sometimes a chroot is enough. But all of that is besides the point.

it may also be possible to run multiple pipelines on the same periodically-reallocated host to make things even faster.

Don't you have documentation for this? The builds are often performance-consuming, and I'm still afraid that we can not let multiple users on the same machine at the same time for security reasons, but I'd like to understand how this works. It could be an option.

See the "Dynamic Pool" section of the MPC architecture document:
https://github.com/konflux-ci/architecture/blob/main/architecture/multi-platform-controller.md

cgwalters · 2025-01-13T16:07:05Z

#197 is also strongly related here.

When (hopefully not "if") copr switches to scheduling pods in Kubernetes to build RPMs, then all the logic that already exists for both vertical and horizontal pod autoscaling applies. resalloc has a high overlap with those today.

ifireball · 2025-01-14T08:47:43Z

@cgwalters there is some discussion going on about the possibility of completely replacing MPC for the Red Hat clusters, if you're interested, reach out to me on the internal slack so I can loop you in.

xsuchy · 2025-01-14T13:39:25Z

When (hopefully not "if") copr switches to scheduling pods in Kubernetes to build RPMs,

Hmm. Last time I checked, running K8 on different arches (x86_64, aarch64, s390x, ppc64le) in several regions from two different cloud providers was non-trivial. Is it production-ready now?

cgwalters · 2025-01-14T14:25:38Z

Hmm. Last time I checked, running K8 on different arches (x86_64, aarch64, s390x, ppc64le) in several regions from two different cloud providers was non-trivial. Is it production-ready now?

There's several levels of this. But to start there's: https://docs.openshift.com/container-platform/4.17/post_installation_configuration/configuring-multi-arch-compute-machines/multi-architecture-configuration.html

That said if you dig in a lot of the discussion in #197 is about different techniques for doing Kubernetes-like things for builds (and general runtime) without necessarily running a persistent cluster - which this project heavily overlaps with (and copr/koji also overlap).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you use Resalloc to avoid the initial waiting for machines? #288

Could you use Resalloc to avoid the initial waiting for machines? #288

praiskup commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024 •

edited

Loading

praiskup commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024 •

edited

Loading

praiskup commented Oct 9, 2024

ifireball commented Oct 10, 2024

cgwalters commented Jan 13, 2025

ifireball commented Jan 14, 2025

xsuchy commented Jan 14, 2025

cgwalters commented Jan 14, 2025

Could you use Resalloc to avoid the initial waiting for machines? #288

Could you use Resalloc to avoid the initial waiting for machines? #288

Comments

praiskup commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024 • edited Loading

praiskup commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024

praiskup commented Sep 5, 2024

ifireball commented Sep 5, 2024 • edited Loading

praiskup commented Oct 9, 2024

ifireball commented Oct 10, 2024

cgwalters commented Jan 13, 2025

ifireball commented Jan 14, 2025

xsuchy commented Jan 14, 2025

cgwalters commented Jan 14, 2025

ifireball commented Sep 5, 2024 •

edited

Loading

ifireball commented Sep 5, 2024 •

edited

Loading