Skip to content

Commit 3807522

Browse files
committed
Add temporl worker connection failure
1 parent 0424517 commit 3807522

File tree

4 files changed

+81
-0
lines changed

4 files changed

+81
-0
lines changed
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
rules:
2+
- cre:
3+
id: CRE-2025-0074
4+
severity: 0
5+
title: Temporal Worker → Server Downtime → Connection Refused Failure
6+
category: workflow-orchestration-connectivity
7+
author: Prequel
8+
description: |
9+
Detects failure when a Temporal worker is unable to reach the Temporal server.
10+
- This typically occurs during startup or after server downtime.
11+
- Worker log contains gRPC error: "connection refused".
12+
13+
cause: |
14+
ROOT CAUSES:
15+
- Temporal server was stopped or not reachable on port 7233.
16+
- Worker container started before the server was healthy.
17+
- Network misconfiguration or resource contention during startup.
18+
19+
impact: |
20+
BUSINESS IMPACT:
21+
- Temporal workflows do not start or progress.
22+
- Tasks queue up without being picked up by a worker.
23+
- Can go unnoticed without active monitoring.
24+
25+
impactScore: 10
26+
tags:
27+
- temporal
28+
- worker
29+
- grpc
30+
- connection-refused
31+
- startup-failure
32+
33+
mitigation: |
34+
IMMEDIATE:
35+
- Confirm server is running: `docker ps`, `docker logs temporal`
36+
- Ensure worker starts only after server is available.
37+
- Restart worker manually or with retry logic.
38+
RECOVERY ACTIONS:
39+
- Add health checks for the server container.
40+
- Use startup dependencies in Docker Compose.
41+
- Improve alerting for “connection refused” messages.
42+
PREVENTION STRATEGIES:
43+
- Monitor port 7233 availability.
44+
- Delay worker startup until Temporal is healthy.
45+
- Add retry logic in the worker code.
46+
mitigationScore: 6
47+
references:
48+
- https://docs.temporal.io/
49+
- https://docs.temporal.io/production-deployment/production-deployment-overview
50+
- https://grpc.io/docs/guides/
51+
- https://github.com/temporalio/samples-go
52+
53+
applications:
54+
- name: "temporal"
55+
version: ">=1.27.0"
56+
metadata:
57+
kind: prequel
58+
id: 7AUUSrCz9ssNvDvPwBVfAK
59+
60+
rule:
61+
sequence:
62+
window: 120s
63+
event:
64+
source: cre.log.temporal.worker
65+
order:
66+
- regex: "Failed to connect to Temporal: failed reaching server"
67+
- regex: "connection refused"

rules/cre-2025-0074/test.log

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
2025-06-05T09:40:02Z worker-1 | 2025/06/05 09:35:42 INFO No logger configured for temporal client. Created default one.
2+
2025-06-05T09:40:02Z worker-1 | 2025/06/05 09:35:42 Failed to connect to Temporal: failed reaching server: Frontend is not healthy yet
3+
2025-06-05T09:40:02Z worker-1 | 2025/06/05 09:39:47 INFO No logger configured for temporal client. Created default one.
4+
2025-06-05T09:40:02Z worker-1 | 2025/06/05 09:39:47 INFO Started Worker Namespace default TaskQueue example-task-queue WorkerID 1@4e2f67868860@
5+
2025-06-05T09:40:06Z worker-1 | 2025/06/05 09:40:06 WARN Failed to poll for task. Namespace default TaskQueue example-task-queue WorkerID 1@4e2f67868860@ WorkerType ActivityWorker Error closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "graceful_stop"
6+
2025-06-05T09:40:06Z worker-1 | 2025/06/05 09:40:06 WARN Failed to poll for task. Namespace default TaskQueue example-task-queue WorkerID 1@4e2f67868860@ WorkerType WorkflowWorker Error closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: NO_ERROR, debug data: "graceful_stop"
7+
2025-06-05T09:40:06Z worker-1 | 2025/06/05 09:40:06 WARN Failed to poll for task. Namespace default TaskQueue example-task-queue WorkerID 1@4e2f67868860@ WorkerType WorkflowWorker Error last connection error: connection error: desc = "transport: Error while dialing: dial tcp 172.22.0.3:7233: connect: connection refused"
8+
2025-06-05T09:40:06Z worker-1 | 2025/06/05 09:40:06 WARN Failed to poll for task. Namespace default TaskQueue example-task-queue WorkerID 1@4e2f67868860@ WorkerType ActivityWorker Error last connection error: connection error: desc = "transport: Error while dialing: dial tcp 172.22.0.3:7233: connect: connection refused"

rules/tags/categories.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -138,3 +138,6 @@ categories:
138138
- name: demo-problems
139139
displayName: Demo Problems
140140
description: This is a category for demos
141+
- name: distributed-worker-connectivity
142+
displayName: Distributed Worker Connectivity Issues
143+
description: Failures where a distributed systems worker fails to reach or stay connected to the orchestration backend (e.g., Temporal, Celery).

rules/tags/tags.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,9 @@ tags:
111111
- name: prometheus
112112
displayName: Prometheus
113113
description: Problems with scraping, rule evaluation, or querying Prometheus data.
114+
- name: connection-refused-startup
115+
displayName: Connection Refused on Startup
116+
description: Failures that occur when a service (e.g., Temporal worker) tries to connect to a backend and receives a connection refused error.
114117
- name: psycopg2
115118
displayName: Psycopg2
116119
description: Python client errors related to connecting or querying PostgreSQL using psycopg2.

0 commit comments

Comments
 (0)