Adding scripts for dealing with scheduled instance restart (SCP-5949) #2216

bistline · 2025-03-11T14:23:25Z

BACKGROUND & CHANGES

This update adds two scripts to source that are used for restarting the staging portal Docker container after the VM restarts on schedule. The issue had been that when a GCE VM automatically reboots, the data disk did not automatically remount at the correct location. Since this disk holds all of the portal source code, the Docker container could not automatically restart, which led to irrecoverable issues that required a full deployment to address. Now, the script determines the correct attachment point and mounts automatically on reboot, and then the cron restarts the container 5 minutes later. No deployment is required as all processes inside the container can resume without error. Both scripts have been added to the staging VM and the root crontab:

(ansible) [root@singlecell-01 ~]# crontab -l
0 * * * * service google-fluentd restart
*/5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1
@reboot /root/remount_portal_source.sh > /dev/null 2>&1

This unfortunately takes about 15 minutes total to recover due to our "less than ideal" load balancer health check setup that shunts traffic away from the VM when the container becomes unavailable. It takes ~1 min for the VM to restart at 8 AM, and the Docker container restarts at 8:05. It takes ~2 min to boot, and then the health check will run again between 8:10 and 8:15. Once the backend service (i.e. portal container) is deemed healthy, the load balancer then recovers and normal traffic resumes. There is no way to apply a schedule to health checks - only an interval - and classic HTTPS load balancers can't be used without a health check.

MANUAL TESTING

The simplest way to verify this is that the staging instance restarted this morning without any direct intervention. Additionally, the remount_log.txt file has activity from this morning showing the disk remounted correctly (note the time is UTC and the VM runs in us-central1):

(ansible) [root@singlecell-01 ~]# cat /home/jenkins/remount_log.txt
Tue Mar 11 12:00:37 PM UTC 2025: remounting google-singlecell-data-disk from /dev/sdc

eweitz

Code looks good! I suggest some non-blocking maintainability refinements.

Thanks for updating the playbook. More context is in Slack.

eweitz · 2025-03-11T14:28:08Z

bin/remount_portal_source.sh

+
+# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart
+# crontab should be entered as follows
+# @reboot /root/remout_portal_source.sh > /dev/null 2>&1


Suggested change

# @reboot /root/remout_portal_source.sh > /dev/null 2>&1

# @reboot /root/remout_portal_source.sh > /dev/null 2>&1

More context: https://github.com/broadinstitute/single_cell_portal_core/pull/2216

eweitz · 2025-03-11T14:28:24Z

bin/restart_portal_container.sh

+
+# script to add to root crontab on a deployed host to check for crashed Docker containers and restart
+# crontab entry should be as follows:
+# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1


Suggested change

# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1

# */5 * * * * /root/restart_portal_container.sh > /dev/null 2>&1

More context: https://github.com/broadinstitute/single_cell_portal_core/pull/2216

eweitz · 2025-03-11T14:29:00Z

bin/remount_portal_source.sh

@@ -0,0 +1,12 @@
+#! /usr/bin/env bash
+
+# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart


Suggested change

# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart

# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart,

# which is used for staging and development VMs, but not production

codecov · 2025-03-11T15:42:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.50%. Comparing base (12f666c) to head (5777768).
Report is 11 commits behind head on development.

Additional details and impacted files

@@               Coverage Diff               @@
##           development    #2216      +/-   ##
===============================================
+ Coverage        70.34%   70.50%   +0.16%     
===============================================
  Files              332      332              
  Lines            28492    28493       +1     
  Branches          2518     2518              
===============================================
+ Hits             20042    20090      +48     
+ Misses            8303     8256      -47     
  Partials           147      147

see 5 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

bistline added 3 commits March 10, 2025 14:04

Adding weekday staging deployment at 8:15AM

dfc2fdf

adding scripts to source to help deal with scheduled instances

9a9b26a

Fixing whitespace, reverting schedule

9f878fd

bistline requested review from eweitz and jlchang March 11, 2025 14:23

eweitz approved these changes Mar 11, 2025

View reviewed changes

addressing PR comments

5777768

bistline added the build failure: false positive Build error confirmed as false positive. E.g. upstream service has a problem. label Mar 11, 2025

jlchang approved these changes Mar 11, 2025

View reviewed changes

bistline merged commit 691448b into development Mar 11, 2025
5 checks passed

github-actions bot deleted the jb-staging-deploy-schedule branch March 11, 2025 15:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding scripts for dealing with scheduled instance restart (SCP-5949) #2216

Adding scripts for dealing with scheduled instance restart (SCP-5949) #2216

bistline commented Mar 11, 2025

eweitz left a comment •

edited

Loading

eweitz Mar 11, 2025

eweitz Mar 11, 2025

eweitz Mar 11, 2025

codecov bot commented Mar 11, 2025 •

edited

Loading

		@@ -0,0 +1,12 @@
		#! /usr/bin/env bash

		# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart

	# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart
	# script to add to the root crontab to remount attached disk in correct location after scheduled instance restart,
	# which is used for staging and development VMs, but not production

Adding scripts for dealing with scheduled instance restart (SCP-5949) #2216

Adding scripts for dealing with scheduled instance restart (SCP-5949) #2216

Conversation

bistline commented Mar 11, 2025

BACKGROUND & CHANGES

MANUAL TESTING

eweitz left a comment • edited Loading

Choose a reason for hiding this comment

eweitz Mar 11, 2025

Choose a reason for hiding this comment

eweitz Mar 11, 2025

Choose a reason for hiding this comment

eweitz Mar 11, 2025

Choose a reason for hiding this comment

codecov bot commented Mar 11, 2025 • edited Loading

Codecov Report

eweitz left a comment •

edited

Loading

codecov bot commented Mar 11, 2025 •

edited

Loading