Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions terraform/modules/scheduled-job/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Creates a complete scheduled setup:
- Storage bucket with lifecycle management
- Secret Manager IAM bindings
- Source code change detection
- **Slack alerting** for job failures (optional)

## Quick Start

Expand Down Expand Up @@ -85,6 +86,10 @@ module "my_data_processor" {
version = "latest"
}
]

# Enable Slack alerting for job failures (enabled by default)
slack_api_token = "xoxb-your-slack-api-token"
slack_channel = "#1s-and-0s"
}
```

Expand Down Expand Up @@ -241,6 +246,13 @@ module "data_processor" {
- `job_args` - Command arguments ([])
- `job_image` - Container image URL (required)

### Alerting (optional)
- `enable_alerting` - Whether to enable alerting for job failures (true)
- `slack_webhook_url` - Slack webhook URL for sending failure notifications (null)
- `slack_channel` - Slack channel to send notifications to (e.g., "#1s-and-0s") ("#1s-and-0s")
- `alert_project_id` - GCP project ID where monitoring and alerting resources will be created (defaults to project_id) (null)
- `notification_email` - Email address for additional failure notifications (null)

## Outputs

- `resource_name` - Name of deployed function or job
Expand All @@ -251,6 +263,10 @@ module "data_processor" {
- `storage_bucket_name` - Storage bucket name
- `execution_type` - The execution type used

### Alerting Outputs (when `enable_alerting = true`)
- `monitoring_notification_channel_name` - Name of the monitoring notification channel
- `alert_policy_names` - Names of the monitoring alert policies

## Repository Structure

```
Expand Down Expand Up @@ -389,6 +405,65 @@ Or use Cloud Build directly:
gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/YOUR_JOB_NAME:latest ./jobs/your-job
```

## Alerting

The module supports optional Slack alerting for job failures. When enabled, it creates:

- **Monitoring policies**: Cloud Monitoring alert policies for different failure scenarios
- **Slack notification channel**: Direct integration with Slack using webhooks

### Enabling Alerting

```hcl
module "my_job_with_alerts" {
source = "git::https://github.com/Khan/terraform-modules.git//terraform/modules/scheduled-job?ref=v1.0.0"

# ... other configuration ...

# Alerting is enabled by default, just provide the webhook URL
slack_webhook_url = "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
slack_channel = "#1s-and-0s" # Default channel

# Optional: Use different project for alerting resources
alert_project_id = "my-monitoring-project"
}
```

### What Gets Monitored

When alerting is enabled, the module creates monitoring policies for:

1. **Cloud Function failures** (when `execution_type = "function"`)
- Function execution failures
- Error rates and timeouts

2. **Cloud Run Job failures** (when `execution_type = "job"`)
- Job execution failures
- Task completion failures
- Timeout violations

3. **Cloud Scheduler failures**
- Scheduler job execution failures
- Missed scheduled runs

4. **Excessive retries**
- Jobs that have been retrying more than 3 times in 10 minutes
- Indicates persistent issues that need investigation

### Slack Message Format

The Slack notifications include:
- Job name and status
- Resource information
- Failure condition details
- Timestamp and incident ID
- Color-coded messages (red for failures, green for recovery)

### Security

- Slack webhook URL is stored securely in the monitoring notification channel
- All alerting resources are created in the specified project (or same project as the job)

## Common Cron Patterns

| Schedule | Description |
Expand Down
18 changes: 18 additions & 0 deletions terraform/modules/scheduled-job/examples/simple-job/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ This example demonstrates how to use the scheduled-job module to create a Cloud
- Service account with appropriate permissions
- Container image built automatically using Cloud Build
- Secret Manager IAM bindings
- Slack alerting for job failures (enabled by default)

## Key differences from Cloud Functions

Expand All @@ -24,6 +25,7 @@ This example demonstrates how to use the scheduled-job module to create a Cloud
```bash
export TF_VAR_project_id="your-gcp-project"
export TF_VAR_secrets_project_id="your-secrets-project"
export TF_VAR_slack_webhook_url="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
```

2. Initialize and apply:
Expand Down Expand Up @@ -69,6 +71,10 @@ module "daily_data_processor" {
job_command = ["python", "processor.py"]
job_args = [] # Additional arguments if needed

# Alerting is enabled by default
slack_webhook_url = var.slack_webhook_url
slack_channel = "#1s-and-0s"

# ... other configuration
}
```
Expand All @@ -89,3 +95,15 @@ The job code in `job-code/processor.py` is a simple Python script that:
- **Branch-based Caching**: Cloud Build caches layers based on branch names for faster builds.
- Jobs are triggered via HTTP calls to the Cloud Run Jobs API, not via PubSub like Cloud Functions.
- Jobs can run for longer periods and have more resources than Cloud Functions.

## Alerting

This example includes Slack alerting for job failures by default. The alerting system:

- Monitors job execution failures and timeouts
- Alerts on excessive retries (more than 3 times in 10 minutes)
- Sends notifications to the #1s-and-0s Slack channel (configurable)
- Uses Google Cloud Monitoring's native Slack webhook integration
- Provides detailed failure information including job name, status, and timestamps

To disable alerting, set `enable_alerting = false` in the module configuration.
16 changes: 16 additions & 0 deletions terraform/modules/scheduled-job/examples/simple-job/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,13 @@ module "daily_data_processor" {
version = "latest"
}
]

# Alerting is enabled by default, just provide the webhook URL
slack_webhook_url = var.slack_webhook_url
slack_channel = var.slack_channel

# Optional: Use different project for alerting resources
alert_project_id = var.alert_project_id
}

# Output the job details
Expand All @@ -91,3 +98,12 @@ output "image_info" {
image_tag = module.daily_data_processor_image.image_tag
}
}

# Output alerting information
output "alerting_info" {
description = "Information about the alerting setup"
value = {
monitoring_notification_channel_name = module.daily_data_processor.monitoring_notification_channel_name
alert_policy_names = module.daily_data_processor.alert_policy_names
}
}
18 changes: 18 additions & 0 deletions terraform/modules/scheduled-job/examples/simple-job/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,21 @@ variable "region" {
type = string
default = "us-central1"
}

variable "slack_webhook_url" {
description = "Slack webhook URL for sending failure notifications"
type = string
sensitive = true
}

variable "slack_channel" {
description = "Slack channel to send notifications to"
type = string
default = "#1s-and-0s"
}

variable "alert_project_id" {
description = "GCP project ID where monitoring and alerting resources will be created (optional, defaults to project_id)"
type = string
default = null
}
Loading