Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding scripts for dealing with scheduled instance restart (SCP-5949) #2216

Merged
merged 4 commits into from
Mar 11, 2025

Conversation

bistline
Copy link
Contributor

BACKGROUND & CHANGES

This update adds two scripts to source that are used for restarting the staging portal Docker container after the VM restarts on schedule. The issue had been that when a GCE VM automatically reboots, the data disk did not automatically remount at the correct location. Since this disk holds all of the portal source code, the Docker container could not automatically restart, which led to irrecoverable issues that required a full deployment to address. Now, the script determines the correct attachment point and mounts automatically on reboot, and then the cron restarts the container 5 minutes later. No deployment is required as all processes inside the container can resume without error. Both scripts have been added to the staging VM and the root crontab:

(ansible) [root@singlecell-01 ~]# crontab -l
0 * * * * service google-fluentd restart
*/5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1
@reboot /root/remount_portal_source.sh > /dev/null 2>&1

This unfortunately takes about 15 minutes total to recover due to our "less than ideal" load balancer health check setup that shunts traffic away from the VM when the container becomes unavailable. It takes ~1 min for the VM to restart at 8 AM, and the Docker container restarts at 8:05. It takes ~2 min to boot, and then the health check will run again between 8:10 and 8:15. Once the backend service (i.e. portal container) is deemed healthy, the load balancer then recovers and normal traffic resumes. There is no way to apply a schedule to health checks - only an interval - and classic HTTPS load balancers can't be used without a health check.

MANUAL TESTING

The simplest way to verify this is that the staging instance restarted this morning without any direct intervention. Additionally, the remount_log.txt file has activity from this morning showing the disk remounted correctly (note the time is UTC and the VM runs in us-central1):

(ansible) [root@singlecell-01 ~]# cat /home/jenkins/remount_log.txt
Tue Mar 11 12:00:37 PM UTC 2025: remounting google-singlecell-data-disk from /dev/sdc

@bistline bistline requested review from eweitz and jlchang March 11, 2025 14:23
Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! I suggest some non-blocking maintainability refinements.

Thanks for updating the playbook. More context is in Slack.


# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart
# crontab should be entered as follows
# @reboot /root/remout_portal_source.sh > /dev/null 2>&1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# @reboot /root/remout_portal_source.sh > /dev/null 2>&1
# @reboot /root/remout_portal_source.sh > /dev/null 2>&1
More context: https://github.com/broadinstitute/single_cell_portal_core/pull/2216


# script to add to root crontab on a deployed host to check for crashed Docker containers and restart
# crontab entry should be as follows:
# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1
# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1
More context: https://github.com/broadinstitute/single_cell_portal_core/pull/2216

@@ -0,0 +1,12 @@
#! /usr/bin/env bash

# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart
# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart,
# which is used for staging and development VMs, but not production

Copy link

codecov bot commented Mar 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.50%. Comparing base (12f666c) to head (5777768).
Report is 11 commits behind head on development.

Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #2216      +/-   ##
===============================================
+ Coverage        70.34%   70.50%   +0.16%     
===============================================
  Files              332      332              
  Lines            28492    28493       +1     
  Branches          2518     2518              
===============================================
+ Hits             20042    20090      +48     
+ Misses            8303     8256      -47     
  Partials           147      147              

see 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bistline bistline added the build failure: false positive Build error confirmed as false positive. E.g. upstream service has a problem. label Mar 11, 2025
@bistline bistline merged commit 691448b into development Mar 11, 2025
5 checks passed
@github-actions github-actions bot deleted the jb-staging-deploy-schedule branch March 11, 2025 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build failure: false positive Build error confirmed as false positive. E.g. upstream service has a problem.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants