TheWizardsCode · SorraTheOrc · Dec 3, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025
diff --git a/.github/agents/git.agent.md b/.github/agents/git.agent.md
@@ -38,6 +38,7 @@ as described in Atlassian's guide
 - **Pre-merge hygiene**
   - Ensure the working tree is clean before switching branches.
   - Verify there are no uncommitted changes that would be lost.
+  - Ensure dev dependencies are installed (to avoid pytest configuration errors).
   - Run tests (for example `pytest -v`) and basic checks before proposing
     a merge.
 
@@ -77,6 +78,8 @@ arguments so they can run non-interactively.
   - `git rebase origin/main`
 
 - Run tests before merge:
+  - Activate virtual environment if present (e.g. `source .venv/bin/activate`)
+  - `pip install -e .[dev]` (ensure test dependencies like pytest-cov are present)
   - `pytest -v`
 
 - Merge feature branch into main locally:

diff --git a/.github/workflows/k8s-validation.yml b/.github/workflows/k8s-validation.yml
@@ -12,12 +12,12 @@ on:
     branches:
       - main
     paths:
-      - 'k8s/**/*.yaml'
-      - '.github/workflows/k8s-*.yml'
+      - "k8s/**/*.yaml"
+      - ".github/workflows/k8s-*.yml"
   pull_request:
     paths:
-      - 'k8s/**/*.yaml'
-      - '.github/workflows/k8s-*.yml'
+      - "k8s/**/*.yaml"
+      - ".github/workflows/k8s-*.yml"
 
 permissions:
   contents: read
@@ -87,6 +87,10 @@ jobs:
           kubectl cluster-info
           kubectl get nodes
 
+      - name: Create gengine namespace (real)
+        run: |
+          kubectl apply -f k8s/base/namespace.yaml
+
       - name: Dry-run validate base manifests
         run: |
           echo "Validating k8s/base with --dry-run=server..."

diff --git a/docs/gengine/Deploy_GEngine_To_Kubernetes.md b/docs/gengine/Deploy_GEngine_To_Kubernetes.md
@@ -592,17 +592,99 @@ If using LLM service extensively, increase memory for context buffering:
 ## Monitoring and Observability
 
 GEngine services are instrumented with Prometheus-compatible metrics endpoints
-for monitoring and alerting.
+for monitoring and alerting. Health checks (`/healthz`) are separate from
+metrics collection (`/metrics`) to allow independent control of readiness
+probes and observability scraping.
+
+### Health Check Endpoints
+
+Health checks are used for Kubernetes liveness and readiness probes:
+
+| Service    | Port | Health Endpoint | Description                           |
+| ---------- | ---- | --------------- | ------------------------------------- |
+| Simulation | 8000 | `/healthz`      | Returns `{"status": "ok"}`            |
+| Gateway    | 8100 | `/healthz`      | Returns status and upstream URLs      |
+| LLM        | 8001 | `/healthz`      | Returns status, provider, and model   |
 
 ### Metrics Endpoints
 
-Each service exposes metrics that can be scraped by Prometheus:
+Each service exposes dedicated metrics for Prometheus scraping:
 
 | Service    | Port | Metrics Endpoint | Description                        |
 | ---------- | ---- | ---------------- | ---------------------------------- |
 | Simulation | 8000 | `/metrics`       | Tick count, environment, profiling |
-| Gateway    | 8100 | `/healthz`       | Service health and connection info |
-| LLM        | 8001 | `/healthz`       | Service health status              |
+| Gateway    | 8100 | `/metrics`       | Request counts, latencies, connections, LLM integration |
+| LLM        | 8001 | `/metrics`       | Request counts, latencies, errors, provider stats, token usage |
+
+### Example Metrics Responses
+
+**Simulation Service** (`/metrics`):
+```json
+{
+  "tick": 42,
+  "environment": {
+    "temperature": 0.5,
+    "instability": 0.2,
+    "tension": 0.3
+  },
+  "profiling": {
+    "tick_ms_p50": 12.5,
+    "tick_ms_p95": 25.0,
+    "tick_ms_max": 45.0
+  }
+}
+```
+
+**Gateway Service** (`/metrics`) - Prometheus text format:
+```text
+# HELP gateway_requests_total Total number of requests processed
+# TYPE gateway_requests_total counter
+gateway_requests_total 150.0
+# HELP gateway_requests_by_type_total Requests by type
+# TYPE gateway_requests_by_type_total counter
+gateway_requests_by_type_total{request_type="command"} 120.0
+gateway_requests_by_type_total{request_type="natural_language"} 30.0
+# HELP gateway_errors_total Total number of errors
+# TYPE gateway_errors_total counter
+gateway_errors_total 2.0
+# HELP gateway_active_connections Number of active WebSocket connections
+# TYPE gateway_active_connections gauge
+gateway_active_connections 3.0
+# HELP gateway_request_latency_seconds Request latency in seconds
+# TYPE gateway_request_latency_seconds histogram
+gateway_request_latency_seconds_bucket{request_type="command",le="0.1"} 80.0
+gateway_request_latency_seconds_bucket{request_type="command",le="0.5"} 115.0
+gateway_request_latency_seconds_bucket{request_type="command",le="+Inf"} 120.0
+gateway_request_latency_seconds_count{request_type="command"} 120.0
+gateway_request_latency_seconds_sum{request_type="command"} 5.46
+```
+
+**LLM Service** (`/metrics`) - Prometheus text format:
+```text
+# HELP llm_requests_total Total number of requests processed
+# TYPE llm_requests_total counter
+llm_requests_total 100.0
+# HELP llm_parse_intent_requests_total Total parse_intent requests
+# TYPE llm_parse_intent_requests_total counter
+llm_parse_intent_requests_total 80.0
+# HELP llm_narrate_requests_total Total narrate requests
+# TYPE llm_narrate_requests_total counter
+llm_narrate_requests_total 20.0
+# HELP llm_errors_total Total number of errors
+# TYPE llm_errors_total counter
+llm_errors_total 1.0
+# HELP llm_input_tokens_total Total input tokens used
+# TYPE llm_input_tokens_total counter
+llm_input_tokens_total 50000.0
+# HELP llm_output_tokens_total Total output tokens used
+# TYPE llm_output_tokens_total counter
+llm_output_tokens_total 15000.0
+# HELP llm_parse_intent_latency_seconds parse_intent request latency in seconds
+# TYPE llm_parse_intent_latency_seconds histogram
+llm_parse_intent_latency_seconds_bucket{le="1.0"} 75.0
+llm_parse_intent_latency_seconds_bucket{le="5.0"} 80.0
+llm_parse_intent_latency_seconds_bucket{le="+Inf"} 80.0
+```
 
 ### Prometheus Annotations
 
@@ -612,7 +694,7 @@ All deployments are annotated for automatic Prometheus discovery:
 annotations:
   prometheus.io/scrape: "true"
   prometheus.io/port: "<service-port>"
-  prometheus.io/path: "/metrics"  # or "/healthz"
+  prometheus.io/path: "/metrics"
 ```
 
 ### Verifying Prometheus Scraping
@@ -624,25 +706,17 @@ To confirm Prometheus is scraping your services:
 if [[ "${GENGINE_DEPLOY_ENV}" == "local" ]]; then
   MINIKUBE_IP=$(minikube ip)
   curl -s "http://${MINIKUBE_IP}:30000/metrics" | jq .
+  curl -s "http://${MINIKUBE_IP}:30100/metrics" | jq .
+  curl -s "http://${MINIKUBE_IP}:30001/metrics" | jq .
 fi
 
 # Using kubectl proxy or port-forward
 kubectl port-forward -n "${GENGINE_NAMESPACE}" svc/simulation 8000:8000 &
+kubectl port-forward -n "${GENGINE_NAMESPACE}" svc/gateway 8100:8100 &
+kubectl port-forward -n "${GENGINE_NAMESPACE}" svc/llm 8001:8001 &
 curl -s http://localhost:8000/metrics | jq .
-```
-
-Expected output:
-
-```json
-{
-  "tick": 0,
-  "environment": {
-    "temperature": 0.0,
-    "instability": 0.0,
-    "tension": 0.0
-  },
-  "profiling": {}
-}
+curl -s http://localhost:8100/metrics | jq .
+curl -s http://localhost:8001/metrics | jq .
 ```
 
 ### Prometheus Operator Integration

diff --git a/gamedev-agent-thoughts.txt b/gamedev-agent-thoughts.txt
@@ -525,3 +525,88 @@ All acceptance criteria for Issue #24 are met. The implementation is production-
 
 ### Task 9.3.1 Status: COMPLETED
 All acceptance criteria for Issue #34 are met.
+
+
+## Issue #39 - Dedicated Metrics Endpoints for Gateway and LLM Services - 2025-12-02T19:00 UTC
+=========================================================
+
+### Initial Assessment
+- Working branch: copilot/add-dedicated-metrics-endpoints
+- Current commit: fc3318b
+- Repository synced and clean
+
+### Acceptance Criteria from Issue:
+1. Gateway exposes `/metrics` endpoint with request counts, latencies, error breakdowns, provider-level stats
+2. LLM service exposes `/metrics` endpoint with request counts, latencies, error breakdowns, provider-level stats
+3. Prometheus annotations point to `/metrics` instead of `/healthz`
+4. ServiceMonitor targets updated to metrics paths/ports
+5. Documentation distinguishes health vs. metrics endpoints
+
+### Implementation Plan:
+1. [ ] Add `/metrics` endpoint to gateway/app.py with tracking of:
+   - Request counts (total, by endpoint, by result type)
+   - Request latencies
+   - Error counts
+   - Active WebSocket connections
+   - LLM service integration stats
+2. [ ] Add `/metrics` endpoint to llm/app.py with tracking of:
+   - Request counts (total, by endpoint)
+   - Request latencies
+   - Error counts
+   - Provider-level stats
+   - Token usage
+3. [ ] Update k8s/base/gateway-deployment.yaml: prometheus.io/path → /metrics
+4. [ ] Update k8s/base/llm-deployment.yaml: prometheus.io/path → /metrics
+5. [ ] Update k8s/base/servicemonitor.yaml: gateway and LLM paths → /metrics
+6. [ ] Update docs/gengine/Deploy_GEngine_To_Kubernetes.md
+7. [ ] Run tests to verify no regressions
+
+
+### Implementation Complete
+
+**Files Modified:**
+1. `src/gengine/echoes/gateway/app.py` - Added GatewayMetrics class and /metrics endpoint
+   - GatewayMetrics tracks: request counts, latencies, errors, connections, LLM integration
+   - /metrics endpoint returns JSON with all metrics for Prometheus
+   - Metrics tracked during WebSocket handler execution
+
+2. `src/gengine/echoes/llm/app.py` - Added LLMMetrics class and /metrics endpoint
+   - LLMMetrics tracks: request counts, latencies, errors, provider stats, token usage
+   - /metrics endpoint returns JSON with all metrics for Prometheus
+   - Metrics tracked for both parse_intent and narrate endpoints
+
+3. `k8s/base/gateway-deployment.yaml` - Updated prometheus.io/path from /healthz to /metrics
+4. `k8s/base/llm-deployment.yaml` - Updated prometheus.io/path from /healthz to /metrics
+5. `k8s/base/servicemonitor.yaml` - Updated gateway and LLM paths from /healthz to /metrics
+
+6. `docs/gengine/Deploy_GEngine_To_Kubernetes.md` - Comprehensive documentation update
+   - Added Health Check Endpoints section explaining /healthz purpose
+   - Added Metrics Endpoints section explaining /metrics purpose
+   - Added Example Metrics Responses showing JSON structure for all 3 services
+   - Updated Prometheus Annotations section
+   - Updated verification commands
+
+7. `tests/echoes/test_gateway_service.py` - Added 9 new tests:
+   - test_gateway_metrics_endpoint
+   - test_gateway_metrics_track_websocket_connections
+   - test_gateway_metrics_track_commands
+   - TestGatewayMetrics class with 7 tests
+
+8. `tests/echoes/test_llm_app.py` - Added 11 new tests:
+   - test_metrics_endpoint
+   - test_metrics_track_parse_intent
+   - test_metrics_track_narrate
+   - TestLLMMetrics class with 8 tests
+
+**Test Results:**
+- Gateway/LLM tests: 39 passed (19 original + 20 new)
+- Coverage: gateway/app.py 89%, llm/app.py 91%
+
+**Acceptance Criteria Status:**
+1. ✅ Gateway exposes /metrics endpoint with request counts, latencies, error breakdowns, connections, LLM integration stats
+2. ✅ LLM service exposes /metrics endpoint with request counts, latencies, error breakdowns, provider stats, token usage
+3. ✅ Prometheus annotations point to /metrics (updated gateway-deployment.yaml, llm-deployment.yaml)
+4. ✅ ServiceMonitor targets updated to /metrics paths (updated servicemonitor.yaml)
+5. ✅ Documentation distinguishes health vs. metrics endpoints with example responses
+
+### Task Complete: Issue #39 - Dedicated Metrics Endpoints for Gateway and LLM Services
diff --git a/k8s/base/gateway-deployment.yaml b/k8s/base/gateway-deployment.yaml
@@ -24,7 +24,7 @@ spec:
       annotations:
         prometheus.io/scrape: "true"
         prometheus.io/port: "8100"
-        prometheus.io/path: "/healthz"
+        prometheus.io/path: "/metrics"
     spec:
       containers:
         - name: gateway

diff --git a/k8s/base/llm-deployment.yaml b/k8s/base/llm-deployment.yaml
@@ -24,7 +24,7 @@ spec:
       annotations:
         prometheus.io/scrape: "true"
         prometheus.io/port: "8001"
-        prometheus.io/path: "/healthz"
+        prometheus.io/path: "/metrics"
     spec:
       containers:
         - name: llm

diff --git a/k8s/base/servicemonitor.yaml b/k8s/base/servicemonitor.yaml
@@ -39,7 +39,7 @@ spec:
       app.kubernetes.io/name: gateway
   endpoints:
     - port: http
-      path: /healthz
+      path: /metrics
       interval: 30s
       scrapeTimeout: 10s
 ---
@@ -57,6 +57,6 @@ spec:
       app.kubernetes.io/name: llm
   endpoints:
     - port: http
-      path: /healthz
+      path: /metrics
       interval: 30s
       scrapeTimeout: 10s
diff --git a/pyproject.toml b/pyproject.toml
@@ -16,7 +16,8 @@ dependencies = [
     "httpx>=0.27.0,<0.28.0",
     "websockets>=12.0,<13.0",
     "openai>=1.0.0,<2.0.0",
-    "anthropic>=0.39.0,<1.0.0"
+    "anthropic>=0.39.0,<1.0.0",
+    "prometheus_client>=0.20.0,<1.0.0"
 ]
 
 [project.optional-dependencies]