-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding scripts for dealing with scheduled instance restart (SCP-5949) #2216
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart | ||
# crontab should be entered as follows | ||
# @reboot /root/remout_portal_source.sh > /dev/null 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# @reboot /root/remout_portal_source.sh > /dev/null 2>&1 | |
# @reboot /root/remout_portal_source.sh > /dev/null 2>&1 | |
More context: https://github.com/broadinstitute/single_cell_portal_core/pull/2216 |
|
||
# script to add to root crontab on a deployed host to check for crashed Docker containers and restart | ||
# crontab entry should be as follows: | ||
# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1 | |
# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1 | |
More context: https://github.com/broadinstitute/single_cell_portal_core/pull/2216 |
bin/remount_portal_source.sh
Outdated
@@ -0,0 +1,12 @@ | |||
#! /usr/bin/env bash | |||
|
|||
# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart | |
# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart, | |
# which is used for staging and development VMs, but not production |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## development #2216 +/- ##
===============================================
+ Coverage 70.34% 70.50% +0.16%
===============================================
Files 332 332
Lines 28492 28493 +1
Branches 2518 2518
===============================================
+ Hits 20042 20090 +48
+ Misses 8303 8256 -47
Partials 147 147 🚀 New features to boost your workflow:
|
BACKGROUND & CHANGES
This update adds two scripts to source that are used for restarting the staging portal Docker container after the VM restarts on schedule. The issue had been that when a GCE VM automatically reboots, the data disk did not automatically remount at the correct location. Since this disk holds all of the portal source code, the Docker container could not automatically restart, which led to irrecoverable issues that required a full deployment to address. Now, the script determines the correct attachment point and mounts automatically on reboot, and then the cron restarts the container 5 minutes later. No deployment is required as all processes inside the container can resume without error. Both scripts have been added to the staging VM and the
root
crontab:This unfortunately takes about 15 minutes total to recover due to our "less than ideal" load balancer health check setup that shunts traffic away from the VM when the container becomes unavailable. It takes ~1 min for the VM to restart at 8 AM, and the Docker container restarts at 8:05. It takes ~2 min to boot, and then the health check will run again between 8:10 and 8:15. Once the backend service (i.e. portal container) is deemed healthy, the load balancer then recovers and normal traffic resumes. There is no way to apply a schedule to health checks - only an interval - and classic HTTPS load balancers can't be used without a health check.
MANUAL TESTING
The simplest way to verify this is that the staging instance restarted this morning without any direct intervention. Additionally, the
remount_log.txt
file has activity from this morning showing the disk remounted correctly (note the time is UTC and the VM runs inus-central1
):