Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
fc3318b
Initial plan
Copilot Dec 2, 2025
a45196a
feat: Add dedicated /metrics endpoints for Gateway and LLM services (…
Copilot Dec 2, 2025
ddfa9f8
refactor: Address code review feedback for metrics implementation
Copilot Dec 2, 2025
e249e30
fix(git_agent): ensure dev dependencies and venv activation before ru…
SorraTheOrc Dec 2, 2025
c24a9d4
fix(merge): resolve conflicts after merging main into copilot/add-ded…
SorraTheOrc Dec 2, 2025
6c6f5f5
fix(gateway): resolve syntax error after merge conflict resolution
SorraTheOrc Dec 2, 2025
3a9a364
refactor: Use prometheus_client for Prometheus-compatible metrics format
Copilot Dec 2, 2025
e4fa9a2
fix(merge): resolve remaining conflicts using remote version where ne…
SorraTheOrc Dec 2, 2025
f22edd2
fix(merge): resolve conflict at line 260 using remote version
SorraTheOrc Dec 2, 2025
89fc99e
fix(merge): remove duplicate else block after conflict resolution
SorraTheOrc Dec 2, 2025
68ebe9b
fix(gateway): remove merge artifact and restore correct command execu…
SorraTheOrc Dec 3, 2025
1ba6e83
fix(gateway): remove duplicate else block causing syntax error
SorraTheOrc Dec 3, 2025
7f5b3b0
fix(gateway): remove merge artifact and restore correct logic in open…
SorraTheOrc Dec 3, 2025
1ea3b24
fix(tests): remove merge artifact and restore correct imports in test…
SorraTheOrc Dec 3, 2025
a0ee701
Fix ruff linting errors: import sorting, line length, and indentation…
SorraTheOrc Dec 3, 2025
879593d
Fix: Ensure namespace is created before validating k8s/base manifests…
SorraTheOrc Dec 3, 2025
f9ec85c
Fix: Apply namespace manifest for real before dry-run validation to r…
SorraTheOrc Dec 3, 2025
7b49992
Review: Confirm workflow applies namespace before dry-run validation.…
SorraTheOrc Dec 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .github/agents/git.agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ as described in Atlassian's guide
- **Pre-merge hygiene**
- Ensure the working tree is clean before switching branches.
- Verify there are no uncommitted changes that would be lost.
- Ensure dev dependencies are installed (to avoid pytest configuration errors).
- Run tests (for example `pytest -v`) and basic checks before proposing
a merge.

Expand Down Expand Up @@ -77,6 +78,8 @@ arguments so they can run non-interactively.
- `git rebase origin/main`

- Run tests before merge:
- Activate virtual environment if present (e.g. `source .venv/bin/activate`)
- `pip install -e .[dev]` (ensure test dependencies like pytest-cov are present)
- `pytest -v`

- Merge feature branch into main locally:
Expand Down
12 changes: 8 additions & 4 deletions .github/workflows/k8s-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,12 @@ on:
branches:
- main
paths:
- 'k8s/**/*.yaml'
- '.github/workflows/k8s-*.yml'
- "k8s/**/*.yaml"
- ".github/workflows/k8s-*.yml"
pull_request:
paths:
- 'k8s/**/*.yaml'
- '.github/workflows/k8s-*.yml'
- "k8s/**/*.yaml"
- ".github/workflows/k8s-*.yml"

permissions:
contents: read
Expand Down Expand Up @@ -87,6 +87,10 @@ jobs:
kubectl cluster-info
kubectl get nodes

- name: Create gengine namespace (real)
run: |
kubectl apply -f k8s/base/namespace.yaml

- name: Dry-run validate base manifests
run: |
echo "Validating k8s/base with --dry-run=server..."
Expand Down
112 changes: 93 additions & 19 deletions docs/gengine/Deploy_GEngine_To_Kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,17 +592,99 @@ If using LLM service extensively, increase memory for context buffering:
## Monitoring and Observability

GEngine services are instrumented with Prometheus-compatible metrics endpoints
for monitoring and alerting.
for monitoring and alerting. Health checks (`/healthz`) are separate from
metrics collection (`/metrics`) to allow independent control of readiness
probes and observability scraping.

### Health Check Endpoints

Health checks are used for Kubernetes liveness and readiness probes:

| Service | Port | Health Endpoint | Description |
| ---------- | ---- | --------------- | ------------------------------------- |
| Simulation | 8000 | `/healthz` | Returns `{"status": "ok"}` |
| Gateway | 8100 | `/healthz` | Returns status and upstream URLs |
| LLM | 8001 | `/healthz` | Returns status, provider, and model |

### Metrics Endpoints

Each service exposes metrics that can be scraped by Prometheus:
Each service exposes dedicated metrics for Prometheus scraping:

| Service | Port | Metrics Endpoint | Description |
| ---------- | ---- | ---------------- | ---------------------------------- |
| Simulation | 8000 | `/metrics` | Tick count, environment, profiling |
| Gateway | 8100 | `/healthz` | Service health and connection info |
| LLM | 8001 | `/healthz` | Service health status |
| Gateway | 8100 | `/metrics` | Request counts, latencies, connections, LLM integration |
| LLM | 8001 | `/metrics` | Request counts, latencies, errors, provider stats, token usage |

### Example Metrics Responses

**Simulation Service** (`/metrics`):
```json
{
"tick": 42,
"environment": {
"temperature": 0.5,
"instability": 0.2,
"tension": 0.3
},
"profiling": {
"tick_ms_p50": 12.5,
"tick_ms_p95": 25.0,
"tick_ms_max": 45.0
}
}
```

**Gateway Service** (`/metrics`) - Prometheus text format:
```text
# HELP gateway_requests_total Total number of requests processed
# TYPE gateway_requests_total counter
gateway_requests_total 150.0
# HELP gateway_requests_by_type_total Requests by type
# TYPE gateway_requests_by_type_total counter
gateway_requests_by_type_total{request_type="command"} 120.0
gateway_requests_by_type_total{request_type="natural_language"} 30.0
# HELP gateway_errors_total Total number of errors
# TYPE gateway_errors_total counter
gateway_errors_total 2.0
# HELP gateway_active_connections Number of active WebSocket connections
# TYPE gateway_active_connections gauge
gateway_active_connections 3.0
# HELP gateway_request_latency_seconds Request latency in seconds
# TYPE gateway_request_latency_seconds histogram
gateway_request_latency_seconds_bucket{request_type="command",le="0.1"} 80.0
gateway_request_latency_seconds_bucket{request_type="command",le="0.5"} 115.0
gateway_request_latency_seconds_bucket{request_type="command",le="+Inf"} 120.0
gateway_request_latency_seconds_count{request_type="command"} 120.0
gateway_request_latency_seconds_sum{request_type="command"} 5.46
```

**LLM Service** (`/metrics`) - Prometheus text format:
```text
# HELP llm_requests_total Total number of requests processed
# TYPE llm_requests_total counter
llm_requests_total 100.0
# HELP llm_parse_intent_requests_total Total parse_intent requests
# TYPE llm_parse_intent_requests_total counter
llm_parse_intent_requests_total 80.0
# HELP llm_narrate_requests_total Total narrate requests
# TYPE llm_narrate_requests_total counter
llm_narrate_requests_total 20.0
# HELP llm_errors_total Total number of errors
# TYPE llm_errors_total counter
llm_errors_total 1.0
# HELP llm_input_tokens_total Total input tokens used
# TYPE llm_input_tokens_total counter
llm_input_tokens_total 50000.0
# HELP llm_output_tokens_total Total output tokens used
# TYPE llm_output_tokens_total counter
llm_output_tokens_total 15000.0
# HELP llm_parse_intent_latency_seconds parse_intent request latency in seconds
# TYPE llm_parse_intent_latency_seconds histogram
llm_parse_intent_latency_seconds_bucket{le="1.0"} 75.0
llm_parse_intent_latency_seconds_bucket{le="5.0"} 80.0
llm_parse_intent_latency_seconds_bucket{le="+Inf"} 80.0
```

### Prometheus Annotations

Expand All @@ -612,7 +694,7 @@ All deployments are annotated for automatic Prometheus discovery:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "<service-port>"
prometheus.io/path: "/metrics" # or "/healthz"
prometheus.io/path: "/metrics"
```

### Verifying Prometheus Scraping
Expand All @@ -624,25 +706,17 @@ To confirm Prometheus is scraping your services:
if [[ "${GENGINE_DEPLOY_ENV}" == "local" ]]; then
MINIKUBE_IP=$(minikube ip)
curl -s "http://${MINIKUBE_IP}:30000/metrics" | jq .
curl -s "http://${MINIKUBE_IP}:30100/metrics" | jq .
curl -s "http://${MINIKUBE_IP}:30001/metrics" | jq .
fi

# Using kubectl proxy or port-forward
kubectl port-forward -n "${GENGINE_NAMESPACE}" svc/simulation 8000:8000 &
kubectl port-forward -n "${GENGINE_NAMESPACE}" svc/gateway 8100:8100 &
kubectl port-forward -n "${GENGINE_NAMESPACE}" svc/llm 8001:8001 &
curl -s http://localhost:8000/metrics | jq .
```

Expected output:

```json
{
"tick": 0,
"environment": {
"temperature": 0.0,
"instability": 0.0,
"tension": 0.0
},
"profiling": {}
}
curl -s http://localhost:8100/metrics | jq .
curl -s http://localhost:8001/metrics | jq .
```

### Prometheus Operator Integration
Expand Down
85 changes: 85 additions & 0 deletions gamedev-agent-thoughts.txt
Original file line number Diff line number Diff line change
Expand Up @@ -525,3 +525,88 @@ All acceptance criteria for Issue #24 are met. The implementation is production-

### Task 9.3.1 Status: COMPLETED
All acceptance criteria for Issue #34 are met.


## Issue #39 - Dedicated Metrics Endpoints for Gateway and LLM Services - 2025-12-02T19:00 UTC
=========================================================

### Initial Assessment
- Working branch: copilot/add-dedicated-metrics-endpoints
- Current commit: fc3318b
- Repository synced and clean

### Acceptance Criteria from Issue:
1. Gateway exposes `/metrics` endpoint with request counts, latencies, error breakdowns, provider-level stats
2. LLM service exposes `/metrics` endpoint with request counts, latencies, error breakdowns, provider-level stats
3. Prometheus annotations point to `/metrics` instead of `/healthz`
4. ServiceMonitor targets updated to metrics paths/ports
5. Documentation distinguishes health vs. metrics endpoints

### Implementation Plan:
1. [ ] Add `/metrics` endpoint to gateway/app.py with tracking of:
- Request counts (total, by endpoint, by result type)
- Request latencies
- Error counts
- Active WebSocket connections
- LLM service integration stats
2. [ ] Add `/metrics` endpoint to llm/app.py with tracking of:
- Request counts (total, by endpoint)
- Request latencies
- Error counts
- Provider-level stats
- Token usage
3. [ ] Update k8s/base/gateway-deployment.yaml: prometheus.io/path → /metrics
4. [ ] Update k8s/base/llm-deployment.yaml: prometheus.io/path → /metrics
5. [ ] Update k8s/base/servicemonitor.yaml: gateway and LLM paths → /metrics
6. [ ] Update docs/gengine/Deploy_GEngine_To_Kubernetes.md
7. [ ] Run tests to verify no regressions


### Implementation Complete

**Files Modified:**
1. `src/gengine/echoes/gateway/app.py` - Added GatewayMetrics class and /metrics endpoint
- GatewayMetrics tracks: request counts, latencies, errors, connections, LLM integration
- /metrics endpoint returns JSON with all metrics for Prometheus
- Metrics tracked during WebSocket handler execution

2. `src/gengine/echoes/llm/app.py` - Added LLMMetrics class and /metrics endpoint
- LLMMetrics tracks: request counts, latencies, errors, provider stats, token usage
- /metrics endpoint returns JSON with all metrics for Prometheus
- Metrics tracked for both parse_intent and narrate endpoints

3. `k8s/base/gateway-deployment.yaml` - Updated prometheus.io/path from /healthz to /metrics
4. `k8s/base/llm-deployment.yaml` - Updated prometheus.io/path from /healthz to /metrics
5. `k8s/base/servicemonitor.yaml` - Updated gateway and LLM paths from /healthz to /metrics

6. `docs/gengine/Deploy_GEngine_To_Kubernetes.md` - Comprehensive documentation update
- Added Health Check Endpoints section explaining /healthz purpose
- Added Metrics Endpoints section explaining /metrics purpose
- Added Example Metrics Responses showing JSON structure for all 3 services
- Updated Prometheus Annotations section
- Updated verification commands

7. `tests/echoes/test_gateway_service.py` - Added 9 new tests:
- test_gateway_metrics_endpoint
- test_gateway_metrics_track_websocket_connections
- test_gateway_metrics_track_commands
- TestGatewayMetrics class with 7 tests

8. `tests/echoes/test_llm_app.py` - Added 11 new tests:
- test_metrics_endpoint
- test_metrics_track_parse_intent
- test_metrics_track_narrate
- TestLLMMetrics class with 8 tests

**Test Results:**
- Gateway/LLM tests: 39 passed (19 original + 20 new)
- Coverage: gateway/app.py 89%, llm/app.py 91%

**Acceptance Criteria Status:**
1. ✅ Gateway exposes /metrics endpoint with request counts, latencies, error breakdowns, connections, LLM integration stats
2. ✅ LLM service exposes /metrics endpoint with request counts, latencies, error breakdowns, provider stats, token usage
3. ✅ Prometheus annotations point to /metrics (updated gateway-deployment.yaml, llm-deployment.yaml)
4. ✅ ServiceMonitor targets updated to /metrics paths (updated servicemonitor.yaml)
5. ✅ Documentation distinguishes health vs. metrics endpoints with example responses

### Task Complete: Issue #39 - Dedicated Metrics Endpoints for Gateway and LLM Services
2 changes: 1 addition & 1 deletion k8s/base/gateway-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ spec:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8100"
prometheus.io/path: "/healthz"
prometheus.io/path: "/metrics"
spec:
containers:
- name: gateway
Expand Down
2 changes: 1 addition & 1 deletion k8s/base/llm-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ spec:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8001"
prometheus.io/path: "/healthz"
prometheus.io/path: "/metrics"
spec:
containers:
- name: llm
Expand Down
4 changes: 2 additions & 2 deletions k8s/base/servicemonitor.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ spec:
app.kubernetes.io/name: gateway
endpoints:
- port: http
path: /healthz
path: /metrics
interval: 30s
scrapeTimeout: 10s
---
Expand All @@ -57,6 +57,6 @@ spec:
app.kubernetes.io/name: llm
endpoints:
- port: http
path: /healthz
path: /metrics
interval: 30s
scrapeTimeout: 10s
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ dependencies = [
"httpx>=0.27.0,<0.28.0",
"websockets>=12.0,<13.0",
"openai>=1.0.0,<2.0.0",
"anthropic>=0.39.0,<1.0.0"
"anthropic>=0.39.0,<1.0.0",
"prometheus_client>=0.20.0,<1.0.0"
]

[project.optional-dependencies]
Expand Down
Loading