Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaling - Dynamic Services #657

Assignees
Labels
PO issue Created by Product owners
Milestone

Comments

@mguidon
Copy link
Member

mguidon commented Aug 4, 2022

Auto-scale the docker swarm nodes for the dynamic services.

  • Reuse the tooling created in Autoscaling - Computational Backend #656 for scaling of the cluster
  • The metrics must come from the docker swarm and has to include GPU
  • This will require to assign/consume proper numbers for the available VRAM/AIRAM
  • This will result in users having to wait for machines coming available and they need to be notified about that.
  • Use case will be running tons of sim4life:web instances
@mrnicegyu11
Copy link
Member

mrnicegyu11 commented Sep 7, 2022

a

@sanderegg
Copy link
Member

sanderegg commented Oct 17, 2022

Goal for sprint Katherine Switzer

  • show scaling of nodes in the frontend
  • run autoscaling app background task
  • Gracefully handle: No more machines on AWS available
  • --> start a service and wait for a new node to start

@sanderegg
Copy link
Member

Update on sprint Katherine Switzer

  • New Autoscaling application in osparc-simcore stack
    • unit tested
    • allows backend devs to contribute
    • migration still in progress
  • Frontend notifications in progress (#3502)

@colinRawlings colinRawlings assigned mrnicegyu11 and mguidon and unassigned GitHK Nov 14, 2022
@sanderegg
Copy link
Member

sanderegg commented Nov 14, 2022

Goal for sprint Athena

  • Migrate autoscaling script into osparc-simcore autoscaling app osparc-simcore#3558
  • cleanup remaining nodes in the docker swarm, once they are gone
  • Gracefully handle: No more machines on AWS available in case AWS has none available (take the next big one) --> add an ENV SAN
  • all sim4life-light-dy shall start on auto-scaled nodes
  • add VRAM, Generic resources handling
  • test on AWS staging by deploying the full stack (reconfigure deployer) SAN/ALL
  • apply to AWS to get more vCPUs [ALL]
  • metrics (prometheus)
  • Robustness to bugs in dy-sidecar (restarts, manual killing,...)

@elisabettai
Copy link
Contributor

@mrnicegyu11 please add new devops stuff

@mrnicegyu11
Copy link
Member

The following tasks also have to be tackled, these need to be done on the DevOps side:

  • Determine how to provide the AMI image for autoscaled nodes (talk to @mguidon )
  • Increase machine linux password security
  • Create aws functional user for autoscaling task with limited capabilities

@sanderegg
Copy link
Member

sanderegg commented Dec 1, 2022

Update on sprint Athena

Done

Ongoing

Next steps

@sanderegg
Copy link
Member

Goal for sprint Zefram Cochrane

@mguidon
Copy link
Member Author

mguidon commented Jan 20, 2023

Think about adding 1 machine as a buffer.

@sanderegg
Copy link
Member

sanderegg commented Feb 17, 2023

@sanderegg sanderegg added this to the Mithril milestone Feb 28, 2023
@sanderegg
Copy link
Member

sanderegg commented Feb 28, 2023

@sanderegg
Copy link
Member

sanderegg commented Mar 1, 2023

@drniiken for info these are my latest tests with different combinations as of yesterday in staging-AWS (staging.osparc.io) using the latest code. (not yet in production). We can check together tomorrow if you're interested.

testing s4l-lite 2.0.106 startup times

  • times [min:sec] coming from the osparc logs
  • fresh means the machine is created on demand (no buffer)
  • buffer means the machine was pre-created and is available
  • no pre-pull means nothing is pulled on machine start
  • osp pre-pull means only monitoring docker images are pulled on start
  • ops + 2.0.106 means monitoring images and the largest image of s4l-lite is pulled on start
type service-started new machine ready sidecar first log S4l-lite ready
fresh no pre-pull 00:00 02:18 04:30 07:09
fresh ops pre-pull 00:00 02:21 03:53 06:37
fresh full pre-pull 00:00 02:15 03:55 06:31
buffer no pre-pull 00:00 00:09 01:42 04:18
buffer full pre-pull 00:00 00:03 01:37 02:08
hot drained node (standard) 00:00 00:05 00:40 ~01:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment