Autoscaling - Dynamic Services #657

mguidon · 2022-08-04T09:20:42Z

Auto-scale the docker swarm nodes for the dynamic services.

Reuse the tooling created in Autoscaling - Computational Backend #656 for scaling of the cluster
The metrics must come from the docker swarm and has to include GPU
This will require to assign/consume proper numbers for the available VRAM/AIRAM
This will result in users having to wait for machines coming available and they need to be notified about that.
Use case will be running tons of sim4life:web instances

mrnicegyu11 · 2022-09-07T09:11:27Z

a

sanderegg · 2022-10-17T14:45:18Z

Goal for sprint Katherine Switzer

show scaling of nodes in the frontend
run autoscaling app background task
Gracefully handle: No more machines on AWS available
--> start a service and wait for a new node to start

sanderegg · 2022-11-07T07:46:59Z

Update on sprint Katherine Switzer

New Autoscaling application in osparc-simcore stack
- unit tested
- allows backend devs to contribute
- migration still in progress
Frontend notifications in progress (#3502)

sanderegg · 2022-11-14T10:54:48Z

Goal for sprint Athena

Migrate autoscaling script into osparc-simcore autoscaling app osparc-simcore#3558
cleanup remaining nodes in the docker swarm, once they are gone
Gracefully handle: No more machines on AWS available in case AWS has none available (take the next big one) --> add an ENV SAN
all sim4life-light-dy shall start on auto-scaled nodes
add VRAM, Generic resources handling
test on AWS staging by deploying the full stack (reconfigure deployer) SAN/ALL
apply to AWS to get more vCPUs [ALL]

metrics (prometheus)
Robustness to bugs in dy-sidecar (restarts, manual killing,...)

elisabettai · 2022-11-24T11:05:08Z

@mrnicegyu11 please add new devops stuff

mrnicegyu11 · 2022-11-24T15:41:41Z

The following tasks also have to be tackled, these need to be done on the DevOps side:

Determine how to provide the AMI image for autoscaled nodes (talk to @mguidon )
Increase machine linux password security
Create aws functional user for autoscaling task with limited capabilities

sanderegg · 2022-12-01T07:52:15Z

Update on sprint Athena

Done

♻️ Use common RabbitMQ client (⚠️ devops) osparc-simcore#3502
Migrate autoscaling script into osparc-simcore autoscaling app osparc-simcore#3558
✨Autoscaling: automatically cleanup nodes from the docker swarm osparc-simcore#3617
Autoscaling: Testing results osparc-simcore#3633
All sim4life-light-dy shall start on auto-scaled nodes (through changes in DB) - POC done
Apply to AWS to ask for more vCPUs [ALL]

Ongoing

✨ Autoscaling: connect with rabbitmq osparc-simcore#3620, Define a way to inform the user about auto-scaling of nodes in the swarm osparc-simcore#3341
Connect AWS metrics to OPS monitoring (prometheus, grafana), Autoscaling metrics osparc-simcore#3559
⬆️ Maintenance: Upgrade python-socketio + flakyness osparc-simcore#3622, Revert "Revert "⬆️ Maintenance: Upgrade python-socketio + flakyness (… osparc-simcore#3631

Next steps

Autoscaling: handle down scaling cluster osparc-simcore#3627
Autoscaling: handle autoscaling app restart osparc-simcore#3628
Autoscaling: handle when AWS cannot provide machines osparc-simcore#3629
add VRAM, Generic resources handling (not really useful for current use-case)
optimize scaling up/scaling down policies (have buffer machine? create multiple instances in one call)
Robustness to bugs in dy-sidecar (restarts, manual killing,...)
DevOPS: Create AWS EC2 specific user with restricted access rights for autoscaling service
DevOPS: Create AMI with preloaded images, determine an automatic system if possible
DevOPS: Improve linux system password security

sanderegg · 2022-12-12T08:02:01Z

Goal for sprint Zefram Cochrane

Revert "Revert "⬆️ Maintenance: Upgrade python-socketio + flakyness (… osparc-simcore#3631
Have one boto3 and aiodocker client in the autoscaling app osparc-simcore#3600
Autoscaling: handle down scaling cluster osparc-simcore#3627
Autoscaling: handle autoscaling app restart osparc-simcore#3628
-------------------------------------------- Autoscaling heavier testing ready -------------------------------------
Autoscaling: handle when AWS cannot provide machines osparc-simcore#3629
Autoscaling metrics osparc-simcore#3559
Define a way to inform the user about auto-scaling of nodes in the swarm osparc-simcore#3341
Autoscaling: E2E testing osparc-simcore#3663
DevOPS: Create AWS EC2 specific user with restricted access rights for autoscaling service
DevOPS: Create AMI with preloaded images (dynamic-sidecar, portainer-agent, cadvisor, node exporter, nvidia-exporter?, service images), create an automatic system if possible

sanderegg · 2023-01-10T07:45:42Z

Update on sprint Zefram Cochrane

Done

Ongoing

Next steps

Autoscaling: Starting sim4life on a freshly started node fails osparc-simcore#3746
Autoscaling: Handle multiple new services needing resources asynchronously osparc-simcore#3743
Autoscaling: handle when AWS cannot provide machines osparc-simcore#3629
Autoscaling metrics osparc-simcore#3559
DevOPS: Create AWS EC2 specific user with restricted access rights for autoscaling service
DevOPS: Create AMI with preloaded images (dynamic-sidecar, portainer-agent, cadvisor, node exporter, nvidia-exporter?,
service images), create an automatic system if possible

mguidon · 2023-01-20T09:12:22Z

Think about adding 1 machine as a buffer.

sanderegg · 2023-02-17T09:35:37Z

Update on sprint Resistance is Futile

Done

Autoscaling: E2E testing osparc-simcore#3663
Define a way to inform the user about auto-scaling of nodes in the swarm osparc-simcore#3341
- ✨ Enhancement/improve progress messages osparc-simcore#3773
Autoscaling: Handle multiple new services needing resources asynchronously osparc-simcore#3743
- ✨🐛 Autoscaling: async creation of machines osparc-simcore#3747
Autoscaling: Starting sim4life on a freshly started node fails osparc-simcore#3746
Fresh start on autoscaled node osparc-simcore#3789
✨Autoscaling: have a buffer of machines always ready (episode I) (⚠️ devops) osparc-simcore#3790
✨ Autoscaling: buffer machines Episode II (⚠️ devops) osparc-simcore#3799
Autoscaling: Filesystem issues for sim4life osparc-simcore#3804
Fine tuning of progress messages when starting services osparc-simcore#3810

Ongoing

Next steps

Autoscaling: handle when AWS cannot provide machines osparc-simcore#3629
Autoscaling metrics osparc-simcore#3559
Autoscaling: provide a timeout before removing a docker node that is "down" osparc-simcore#3684
Autoscaling: smart machine buffering osparc-simcore#3808
DevOPS: Create AWS EC2 specific user with restricted access rights for autoscaling service
DevOPS: automatic creation of AMIs system if possible

sanderegg · 2023-02-28T14:24:12Z

Goal for sprint Mithril

Autoscaling: Cold started EC2 machines are slow osparc-simcore#3893
- pre-pull s4l-lite-core:latest into autoscaled nodes
- AMI with 1 disk in production
- buffer in production
- evaluate performance
🚀 Release v1.50.0 #892
-------------------------------------------- Autoscaling MVP -------------------------------------
Autoscaling: handle when AWS cannot provide machines osparc-simcore#3629
Autoscaling: provide a timeout before removing a docker node that is "down" osparc-simcore#3684
Autoscaling: smart machine buffering osparc-simcore#3808
Autoscaling: check agent is not doing a backup prior to terminating EC2 instances osparc-simcore#3939

sanderegg · 2023-03-01T07:22:30Z

@drniiken for info these are my latest tests with different combinations as of yesterday in staging-AWS (staging.osparc.io) using the latest code. (not yet in production). We can check together tomorrow if you're interested.

testing s4l-lite 2.0.106 startup times

times [min:sec] coming from the osparc logs
fresh means the machine is created on demand (no buffer)
buffer means the machine was pre-created and is available
no pre-pull means nothing is pulled on machine start
osp pre-pull means only monitoring docker images are pulled on start
ops + 2.0.106 means monitoring images and the largest image of s4l-lite is pulled on start

type	service-started	new machine ready	sidecar first log	S4l-lite ready
fresh no pre-pull	00:00	02:18	04:30	07:09
fresh ops pre-pull	00:00	02:21	03:53	06:37
fresh full pre-pull	00:00	02:15	03:55	06:31
buffer no pre-pull	00:00	00:09	01:42	04:18
buffer full pre-pull	00:00	00:03	01:37	02:08

hot drained node (standard)	00:00	00:05	00:40	~01:00

sanderegg · 2023-03-27T20:03:24Z

Update on sprint Mithril

Done

MVP completed and deployed to production.
Other steps (features/bugs) shall be tackled in the maintenance case and/or as separate PO cases.

esraneufeld added the PO issue Created by Product owners label Aug 5, 2022

mrnicegyu11 assigned Surfict Aug 5, 2022

This was referenced Aug 8, 2022

S-D25.5 Simulation Framework resource allocation Y4M05 #350

Closed

S-D25.4 Simulation framework SCHEDULER Y4M05 #349

Closed

sanderegg added the Epic label Sep 9, 2022

pcrespov mentioned this issue Sep 18, 2022

✨ is3339/autoscaling integrates script in app ITISFoundation/osparc-simcore#3364

Merged

8 tasks

mrnicegyu11 mentioned this issue Oct 13, 2022

Deploy S4L Lite on AWS - DevOps #740

Closed

12 tasks

elisabettai assigned sanderegg Oct 17, 2022

pcrespov unassigned sanderegg Oct 17, 2022

elisabettai assigned GitHK and sanderegg Oct 17, 2022

colinRawlings assigned mrnicegyu11 and mguidon and unassigned GitHK Nov 14, 2022

sanderegg mentioned this issue Nov 23, 2022

✨ Migrate autoscaling (⚠️ devops) ITISFoundation/osparc-simcore#3566

Merged

mguidon mentioned this issue Dec 12, 2022

s4l-lite: Requirements for release #798

Closed

13 tasks

sanderegg unassigned mrnicegyu11 Feb 28, 2023

sanderegg added this to the Mithril milestone Feb 28, 2023

sanderegg unassigned Surfict Feb 28, 2023

sanderegg unassigned mguidon Feb 28, 2023

sanderegg closed this as completed Apr 4, 2023

sanderegg mentioned this issue Sep 21, 2023

Automate AMI creation for autoscaled machines ITISFoundation/osparc-ops-environments#356

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaling - Dynamic Services #657

Autoscaling - Dynamic Services #657

mguidon commented Aug 4, 2022 •

edited

Loading

mrnicegyu11 commented Sep 7, 2022 •

edited

Loading

sanderegg commented Oct 17, 2022 •

edited by Surfict

Loading

sanderegg commented Nov 7, 2022

sanderegg commented Nov 14, 2022 •

edited

Loading

elisabettai commented Nov 24, 2022

mrnicegyu11 commented Nov 24, 2022

sanderegg commented Dec 1, 2022 •

edited

Loading

sanderegg commented Dec 12, 2022

sanderegg commented Jan 10, 2023 •

edited

Loading

mguidon commented Jan 20, 2023

sanderegg commented Feb 17, 2023 •

edited

Loading

sanderegg commented Feb 28, 2023 •

edited

Loading

sanderegg commented Mar 1, 2023 •

edited

Loading

sanderegg commented Mar 27, 2023 •

edited

Loading

Autoscaling - Dynamic Services #657

Autoscaling - Dynamic Services #657

Comments

mguidon commented Aug 4, 2022 • edited Loading

mrnicegyu11 commented Sep 7, 2022 • edited Loading

sanderegg commented Oct 17, 2022 • edited by Surfict Loading

Goal for sprint Katherine Switzer

sanderegg commented Nov 7, 2022

Update on sprint Katherine Switzer

sanderegg commented Nov 14, 2022 • edited Loading

Goal for sprint Athena

elisabettai commented Nov 24, 2022

mrnicegyu11 commented Nov 24, 2022

sanderegg commented Dec 1, 2022 • edited Loading

Update on sprint Athena

Done

Ongoing

Next steps

sanderegg commented Dec 12, 2022

Goal for sprint Zefram Cochrane

sanderegg commented Jan 10, 2023 • edited Loading

Update on sprint Zefram Cochrane

Done

Ongoing

Next steps

mguidon commented Jan 20, 2023

sanderegg commented Feb 17, 2023 • edited Loading

Update on sprint Resistance is Futile

Done

Ongoing

Next steps

sanderegg commented Feb 28, 2023 • edited Loading

Goal for sprint Mithril

sanderegg commented Mar 1, 2023 • edited Loading

testing s4l-lite 2.0.106 startup times

sanderegg commented Mar 27, 2023 • edited Loading

Update on sprint Mithril

Done

mguidon commented Aug 4, 2022 •

edited

Loading

mrnicegyu11 commented Sep 7, 2022 •

edited

Loading

sanderegg commented Oct 17, 2022 •

edited by Surfict

Loading

sanderegg commented Nov 14, 2022 •

edited

Loading

sanderegg commented Dec 1, 2022 •

edited

Loading

sanderegg commented Jan 10, 2023 •

edited

Loading

sanderegg commented Feb 17, 2023 •

edited

Loading

sanderegg commented Feb 28, 2023 •

edited

Loading

sanderegg commented Mar 1, 2023 •

edited

Loading

sanderegg commented Mar 27, 2023 •

edited

Loading