Skip to content

Commit 49f84d2

Browse files
committed
[scheduled-job-alerts] Add alerting for failed scheduled jobs
1 parent c21a7e3 commit 49f84d2

File tree

7 files changed

+376
-0
lines changed

7 files changed

+376
-0
lines changed

terraform/modules/scheduled-job/README.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ Creates a complete scheduled setup:
1212
- Storage bucket with lifecycle management
1313
- Secret Manager IAM bindings
1414
- Source code change detection
15+
- **Slack alerting** for job failures (optional)
1516

1617
## Quick Start
1718

@@ -85,6 +86,10 @@ module "my_data_processor" {
8586
version = "latest"
8687
}
8788
]
89+
90+
# Enable Slack alerting for job failures (enabled by default)
91+
slack_api_token = "xoxb-your-slack-api-token"
92+
slack_channel = "#1s-and-0s"
8893
}
8994
```
9095

@@ -241,6 +246,13 @@ module "data_processor" {
241246
- `job_args` - Command arguments ([])
242247
- `job_image` - Container image URL (required)
243248

249+
### Alerting (optional)
250+
- `enable_alerting` - Whether to enable alerting for job failures (true)
251+
- `slack_webhook_url` - Slack webhook URL for sending failure notifications (null)
252+
- `slack_channel` - Slack channel to send notifications to (e.g., "#1s-and-0s") ("#1s-and-0s")
253+
- `alert_project_id` - GCP project ID where monitoring and alerting resources will be created (defaults to project_id) (null)
254+
- `notification_email` - Email address for additional failure notifications (null)
255+
244256
## Outputs
245257

246258
- `resource_name` - Name of deployed function or job
@@ -251,6 +263,10 @@ module "data_processor" {
251263
- `storage_bucket_name` - Storage bucket name
252264
- `execution_type` - The execution type used
253265

266+
### Alerting Outputs (when `enable_alerting = true`)
267+
- `monitoring_notification_channel_name` - Name of the monitoring notification channel
268+
- `alert_policy_names` - Names of the monitoring alert policies
269+
254270
## Repository Structure
255271

256272
```
@@ -389,6 +405,65 @@ Or use Cloud Build directly:
389405
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/YOUR_JOB_NAME:latest ./jobs/your-job
390406
```
391407

408+
## Alerting
409+
410+
The module supports optional Slack alerting for job failures. When enabled, it creates:
411+
412+
- **Monitoring policies**: Cloud Monitoring alert policies for different failure scenarios
413+
- **Slack notification channel**: Direct integration with Slack using webhooks
414+
415+
### Enabling Alerting
416+
417+
```hcl
418+
module "my_job_with_alerts" {
419+
source = "git::https://github.com/Khan/terraform-modules.git//terraform/modules/scheduled-job?ref=v1.0.0"
420+
421+
# ... other configuration ...
422+
423+
# Alerting is enabled by default, just provide the webhook URL
424+
slack_webhook_url = "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
425+
slack_channel = "#1s-and-0s" # Default channel
426+
427+
# Optional: Use different project for alerting resources
428+
alert_project_id = "my-monitoring-project"
429+
}
430+
```
431+
432+
### What Gets Monitored
433+
434+
When alerting is enabled, the module creates monitoring policies for:
435+
436+
1. **Cloud Function failures** (when `execution_type = "function"`)
437+
- Function execution failures
438+
- Error rates and timeouts
439+
440+
2. **Cloud Run Job failures** (when `execution_type = "job"`)
441+
- Job execution failures
442+
- Task completion failures
443+
- Timeout violations
444+
445+
3. **Cloud Scheduler failures**
446+
- Scheduler job execution failures
447+
- Missed scheduled runs
448+
449+
4. **Excessive retries**
450+
- Jobs that have been retrying more than 3 times in 10 minutes
451+
- Indicates persistent issues that need investigation
452+
453+
### Slack Message Format
454+
455+
The Slack notifications include:
456+
- Job name and status
457+
- Resource information
458+
- Failure condition details
459+
- Timestamp and incident ID
460+
- Color-coded messages (red for failures, green for recovery)
461+
462+
### Security
463+
464+
- Slack webhook URL is stored securely in the monitoring notification channel
465+
- All alerting resources are created in the specified project (or same project as the job)
466+
392467
## Common Cron Patterns
393468

394469
| Schedule | Description |

terraform/modules/scheduled-job/examples/simple-job/README.md

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ This example demonstrates how to use the scheduled-job module to create a Cloud
99
- Service account with appropriate permissions
1010
- Container image built automatically using Cloud Build
1111
- Secret Manager IAM bindings
12+
- Slack alerting for job failures (enabled by default)
1213

1314
## Key differences from Cloud Functions
1415

@@ -24,6 +25,7 @@ This example demonstrates how to use the scheduled-job module to create a Cloud
2425
```bash
2526
export TF_VAR_project_id="your-gcp-project"
2627
export TF_VAR_secrets_project_id="your-secrets-project"
28+
export TF_VAR_slack_webhook_url="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
2729
```
2830

2931
2. Initialize and apply:
@@ -69,6 +71,10 @@ module "daily_data_processor" {
6971
job_command = ["python", "processor.py"]
7072
job_args = [] # Additional arguments if needed
7173
74+
# Alerting is enabled by default
75+
slack_webhook_url = var.slack_webhook_url
76+
slack_channel = "#1s-and-0s"
77+
7278
# ... other configuration
7379
}
7480
```
@@ -89,3 +95,15 @@ The job code in `job-code/processor.py` is a simple Python script that:
8995
- **Branch-based Caching**: Cloud Build caches layers based on branch names for faster builds.
9096
- Jobs are triggered via HTTP calls to the Cloud Run Jobs API, not via PubSub like Cloud Functions.
9197
- Jobs can run for longer periods and have more resources than Cloud Functions.
98+
99+
## Alerting
100+
101+
This example includes Slack alerting for job failures by default. The alerting system:
102+
103+
- Monitors job execution failures and timeouts
104+
- Alerts on excessive retries (more than 3 times in 10 minutes)
105+
- Sends notifications to the #1s-and-0s Slack channel (configurable)
106+
- Uses Google Cloud Monitoring's native Slack webhook integration
107+
- Provides detailed failure information including job name, status, and timestamps
108+
109+
To disable alerting, set `enable_alerting = false` in the module configuration.

terraform/modules/scheduled-job/examples/simple-job/main.tf

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,13 @@ module "daily_data_processor" {
6969
version = "latest"
7070
}
7171
]
72+
73+
# Alerting is enabled by default, just provide the webhook URL
74+
slack_webhook_url = var.slack_webhook_url
75+
slack_channel = var.slack_channel
76+
77+
# Optional: Use different project for alerting resources
78+
alert_project_id = var.alert_project_id
7279
}
7380

7481
# Output the job details
@@ -91,3 +98,12 @@ output "image_info" {
9198
image_tag = module.daily_data_processor_image.image_tag
9299
}
93100
}
101+
102+
# Output alerting information
103+
output "alerting_info" {
104+
description = "Information about the alerting setup"
105+
value = {
106+
monitoring_notification_channel_name = module.daily_data_processor.monitoring_notification_channel_name
107+
alert_policy_names = module.daily_data_processor.alert_policy_names
108+
}
109+
}

terraform/modules/scheduled-job/examples/simple-job/variables.tf

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,21 @@ variable "region" {
1313
type = string
1414
default = "us-central1"
1515
}
16+
17+
variable "slack_webhook_url" {
18+
description = "Slack webhook URL for sending failure notifications"
19+
type = string
20+
sensitive = true
21+
}
22+
23+
variable "slack_channel" {
24+
description = "Slack channel to send notifications to"
25+
type = string
26+
default = "#1s-and-0s"
27+
}
28+
29+
variable "alert_project_id" {
30+
description = "GCP project ID where monitoring and alerting resources will be created (optional, defaults to project_id)"
31+
type = string
32+
default = null
33+
}

0 commit comments

Comments
 (0)