Skip to content

feat: implement risk management and (wip: production readiness) #53

Merged
chizy7 merged 2 commits intomainfrom
feat/risk-management
Feb 20, 2026
Merged

feat: implement risk management and (wip: production readiness) #53
chizy7 merged 2 commits intomainfrom
feat/risk-management

Conversation

@chizy7
Copy link
Copy Markdown
Owner

@chizy7 chizy7 commented Feb 20, 2026

Summary

  • Add core/risk/ module with 5 components: RiskManager, CircuitBreaker, VaREngine, AlertManager, and DisasterRecovery
  • Integrate pre-trade risk checks and circuit breaker into BasicMarketMaker's trading loop
  • Add 7 REST API endpoints for risk monitoring and health/readiness probes
  • Add Kubernetes deployment manifests (deploy/k8s/) for production deployment
  • Wire AuditLogger across all risk components and strategy layer
  • Add comprehensive documentation covering risk management, K8s deployment, and disaster recovery

Summary by CodeRabbit

  • New Features

    • Complete risk management suite: pre-trade checks, exposure limits, circuit breaker, VaR engine, alerting, disaster recovery, and runtime risk endpoints.
  • Infrastructure

    • Kubernetes manifests (StatefulSet, services, network policy, PVC, PDB) and runtime image changes: healthcheck, runtime dirs, curl.
  • Documentation

    • Risk management guide, disaster recovery runbook, Kubernetes deployment guide, testing and benchmarks, README updates.
  • Tests

    • Unit tests and performance benchmarks for risk components; new REST risk/health endpoints.
  • Chores

    • Pre-commit YAML hook updated to allow multi-document YAML.

@chizy7 chizy7 self-assigned this Feb 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

Adds a new risk management layer (RiskManager, CircuitBreaker, VaREngine, AlertManager, DisasterRecovery), risk configuration, Kubernetes deployment manifests, REST endpoints, extensive tests and benchmarks, build changes to include a risk library, and runtime integration into main and strategies.

Changes

Cohort / File(s) Summary
Risk Core
core/risk/RiskManager.h, core/risk/RiskManager.cpp, core/risk/RiskConfig.h, core/risk/CircuitBreaker.h, core/risk/CircuitBreaker.cpp, core/risk/VaREngine.h, core/risk/VaREngine.cpp, core/risk/AlertManager.h, core/risk/AlertManager.cpp, core/risk/DisasterRecovery.h, core/risk/DisasterRecovery.cpp
Adds five singleton components implementing pre-/post-trade checks, stateful circuit breaker, multi-method VaR engine, throttled alerting, and disaster recovery (persistence, backups, reconciliation). Includes JSON (de)serialization, threading/atomic safety, callbacks, and public APIs for state/management.
Configuration
config/default_config.json, deploy/k8s/configmap.yaml
Introduces risk_management config section with limits, circuit_breaker, var, auto_hedge, and alerts; adds corresponding ConfigMap entry for runtime Kubernetes configuration.
Build & Tests
CMakeLists.txt, .pre-commit-config.yaml, tests/unit/*, tests/performance/RiskCheckBenchmark.cpp
Adds risk static library target, links it into strategy/visualization, exposes new unit tests and a benchmark executable; enables YAML multi-document in pre-commit hook.
App Integration & Strategy
main.cpp, strategies/basic/BasicMarketMaker.h, strategies/basic/BasicMarketMaker.cpp, strategies/config/StrategyConfig.h, Dockerfile
Initializes and wires risk components at startup/shutdown, saves risk state on shutdown, integrates RiskManager/CircuitBreaker checks into strategy flows, updates BasicMarketMaker lifecycle, and adds runtime dirs & HEALTHCHECK to Dockerfile.
REST API & Visualization
visualization/WebServer.h, visualization/WebServer.cpp
Exposes risk endpoints (/api/risk/state, /api/risk/var, /api/risk/limits, /api/risk/circuit-breaker, /api/risk/alerts) and health/readiness; extends PerformanceData with VaR/exposure/utilization fields.
Kubernetes Manifests
deploy/k8s/namespace.yaml, deploy/k8s/deployment.yaml, deploy/k8s/service.yaml, deploy/k8s/secret.yaml, deploy/k8s/networkpolicy.yaml, deploy/k8s/pdb.yaml
Adds namespace, StatefulSet with probes and PVC, services for WS/API, ConfigMap/Secret, NetworkPolicy, and PodDisruptionBudget for production deployment.
Documentation & Guides
README.md, docs/RISK_MANAGEMENT.md, docs/DISASTER_RECOVERY.md, docs/KUBERNETES_DEPLOYMENT.md, docs/PERFORMANCE_BENCHMARKS.md, docs/TESTING_GUIDE.md, docs/ROADMAP.md
Adds comprehensive risk management, DR runbook, k8s deployment guide, testing instructions, performance benchmark updates, and README updates describing the Risk Layer.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant RM as RiskManager
    participant CB as CircuitBreaker
    participant AM as AlertManager
    participant VE as VaREngine
    participant DR as DisasterRecovery
    participant Exchange as Exchange

    Client->>CB: publishPrice(price)
    CB->>CB: evaluateTriggers()
    alt breaker opens
        CB->>AM: raiseAlert(RAPID_PRICE_MOVE, CRITICAL, ...)
        CB-->>Client: trading disallowed
    end

    Client->>RM: checkOrder(side, price, qty)
    alt rejected
        RM-->>Client: rejection (limit/drawdown/halted)
    else approved
        RM-->>Client: APPROVED
        Client->>Exchange: placeOrder(...)
        Exchange-->>RM: onFill(...)
        RM->>RM: updateState()
        RM->>AM: raiseAlert(DRAWDOWN_BREACH, WARNING, ...)
        RM->>VE: addPnLUpdate(...)
        VE->>VE: recomputeVaR()
    end

    loop periodic
        RM->>DR: saveRiskState(snapshot)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I hopped through code, with whiskers bright,

I wove VaR and breakers into the night,
Alerts I chirped when markets pranced,
Backups stashed, so systems danced,
A tidy burrow for production's light. 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: implementing risk management and production readiness features through new modules, components, and infrastructure.
Description check ✅ Passed The PR description is mostly complete with clear summary of changes, but lacks specific checklist items from the template (Type of Change, Areas Changed, Testing, Performance Impact, Security Considerations, Checklist, Related Issues).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/risk-management

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
docs/PERFORMANCE_BENCHMARKS.md (1)

216-232: ⚠️ Potential issue | 🟡 Minor

Missing expected result ranges for risk benchmarks.

The "Expected Results Validation" section lists expected ranges for latency, throughput, order book, and routing benchmarks, but omits the newly added risk benchmarks (risk_check_benchmark). Adding expected ranges (e.g., pre-trade risk check: 500–1000ns, circuit breaker: <10ns) would make the validation section consistent with the new benchmark.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/PERFORMANCE_BENCHMARKS.md` around lines 216 - 232, The docs' "Expected
Results Validation" is missing expected ranges for the new risk benchmarks;
update the validation section to include ranges for risk_check_benchmark and its
subchecks (e.g., add "Pre-trade risk check: 500–1000ns" and "Circuit breaker:
<10ns" or equivalent agreed values), and ensure the new "Routing Benchmark
(NEW)" additions are mirrored by corresponding entries in the Expected Results
Validation; locate mentions of risk_check_benchmark, pre-trade risk check, and
circuit breaker in the markdown and add the expected numeric ranges in the same
format as the existing latency/throughput/order book entries.
config/default_config.json (1)

17-22: ⚠️ Potential issue | 🟠 Major

Duplicate risk configuration keys across risk and risk_management blocks.

risk.maxDrawdownPct (5.0) and risk_management.limits.max_drawdown_pct (5.0) configure the same concept, as do risk.maxTradingVolume (100.0) and risk_management.limits.max_daily_volume (100.0). The codebase reads from both: StrategyConfig loads from the old risk block while RiskConfig loads from risk_management. Having two sources of truth for the same limits is error-prone — a future edit may update one but not the other, causing divergence.

Additionally, stopLossPct and takeProfitPct exist only in the old risk block; they should either be migrated to risk_management or explicitly documented as strategy-specific config.

Consider consolidating to a single config structure and ensuring all modules read from the same source.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/default_config.json` around lines 17 - 22, There are duplicate risk
settings between the old "risk" block and the new "risk_management" block
causing two sources of truth; consolidate by migrating/merging the shared keys
(maxDrawdownPct/max_drawdown_pct and maxTradingVolume/max_daily_volume) into a
single canonical block (preferably "risk_management.limits"), update
StrategyConfig to read those values from RiskConfig (or have StrategyConfig
consume RiskConfig) so only one source is used, remove the deprecated "risk"
block from the default config, and either move stopLossPct and takeProfitPct
into "risk_management" (e.g., limits.stop_loss_pct, limits.take_profit_pct) or
clearly document them as strategy-specific and ensure only StrategyConfig reads
them; also update any config schema/validation to reflect the single structure
so consumers no longer read both.
strategies/basic/BasicMarketMaker.cpp (2)

52-69: ⚠️ Potential issue | 🟡 Minor

Race between m_strategyThread launch and m_isRunning store.

m_isRunning is set to true on line 65, after the thread is already launched on line 62. If another thread calls start() between lines 62 and 65, the m_isRunning guard on line 54 won't prevent a second thread from being spawned. This is a pre-existing issue, but now that the function emits audit events it may be worth noting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@strategies/basic/BasicMarketMaker.cpp` around lines 52 - 69, The start()
method in BasicMarketMaker has a race where m_isRunning is set after launching
m_strategyThread; fix by performing an atomic check-and-set on m_isRunning
before spawning the thread (e.g., use compare_exchange_strong to transition
m_isRunning from false to true), only proceed to store m_shouldStop=false, start
m_strategyThread and emit AUDIT_SYSTEM_EVENT if the CAS succeeds, and if the CAS
fails return false so a second thread cannot be spawned; ensure cleanup on
failure (do not start the thread if you couldn't claim the running flag) and
keep references to m_strategyThread and m_shouldStop consistent with this atomic
claim.

553-569: ⚠️ Potential issue | 🟠 Major

updateStatistics() computes PnL but never calls RiskManager::onPnLUpdate() — drawdown and daily-loss limits won't trigger.

updateStatistics recalculates estimatedPnL every loop iteration and stores it, but the RiskManager is never informed. Without calling onPnLUpdate, the risk manager's drawdown tracking and daily-loss halting (tested in RiskManagerTests.cpp and verified to work) are effectively disabled in the live strategy.

Proposed fix
   // In a real system, we would track realized P&L from fills
   double estimatedPnL = unrealizedPnL;
   m_pnl.store(estimatedPnL, std::memory_order_relaxed);
+
+  // Notify risk manager so drawdown / daily-loss limits are enforced
+  risk::RiskManager::getInstance().onPnLUpdate(estimatedPnL);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@strategies/basic/BasicMarketMaker.cpp` around lines 553 - 569, The
updateStatistics() method updates m_pnl but never notifies the risk manager;
modify it to, after storing estimatedPnL (m_pnl.store(...)), check that
m_riskManager is non-null and call m_riskManager->onPnLUpdate(estimatedPnL) so
the RiskManager (and its drawdown/daily-loss logic) receives each PnL update;
ensure the call is made outside the m_statsMutex lock to avoid potential
deadlocks and keep the null-check (or use a safe pointer) to prevent crashes.
🟠 Major comments (14)
Dockerfile-124-125 (1)

124-125: ⚠️ Potential issue | 🟠 Major

Container runs as root; consider adding a non-root USER

The new runtime directories are created without explicit ownership or permission settings. More broadly, the final image has no USER instruction, so the application process runs as root. This violates the principle of least privilege and is flagged by standard container security scanners (e.g., Trivy, Hadolint).

🔒 Proposed hardening
+# Create non-root user
+RUN groupadd -r pinnaclemm && useradd -r -g pinnaclemm pinnaclemm

 # Create logs and data directories
-RUN mkdir -p /app/logs /app/data /app/data/backups
+RUN mkdir -p /app/logs /app/data /app/data/backups \
+    && chown -R pinnaclemm:pinnaclemm /app

+USER pinnaclemm

 # Default command
 ENTRYPOINT ["/app/pinnaclemm"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 124 - 125, Create a non-root runtime user and set
proper ownership/permissions for the created directories so the container does
not run as root: after the RUN mkdir -p /app/logs /app/data /app/data/backups
step, add commands to create a dedicated user/group (e.g., appuser/appgroup),
chown the directories (/app/logs, /app/data, /app/data/backups) to that
user/group and set restrictive permissions, and finally add a USER instruction
to switch to that non-root user for the remaining image runtime; reference the
mkdir-created paths and the Dockerfile USER instruction and the chosen username
(e.g., appuser) when implementing.
deploy/k8s/networkpolicy.yaml-15-23 (1)

15-23: ⚠️ Potential issue | 🟠 Major

Ingress rules with no from selector allow any pod in the cluster to reach the risk API.

In Kubernetes NetworkPolicy, a rule that specifies only ports without a from clause permits traffic from any source — any pod in any namespace (and any external address that can reach the pod network). For ports 8081 (risk state, VaR, circuit-breaker controls, alert history) and 8080 (live WebSocket dashboard), this is an overly permissive posture for a production financial system.

Add a from clause to restrict ingress to the namespaces that legitimately need access (e.g., the ingress controller namespace, monitoring namespace, or same-namespace pods only).

🛡️ Proposed fix — scope ingress to intended consumers
   ingress:
-    # Allow inbound traffic to WebSocket port
-    - ports:
-        - port: 8080
-          protocol: TCP
-    # Allow inbound traffic to REST API port
-    - ports:
-        - port: 8081
-          protocol: TCP
+    # Allow inbound traffic from ingress controller and monitoring namespace only
+    - from:
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: ingress-nginx  # adjust to your ingress controller namespace
+        - namespaceSelector:
+            matchLabels:
+              kubernetes.io/metadata.name: monitoring      # adjust to your monitoring namespace
+      ports:
+        - port: 8080
+          protocol: TCP
+        - port: 8081
+          protocol: TCP
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/k8s/networkpolicy.yaml` around lines 15 - 23, The NetworkPolicy
ingress rules for ports 8080 and 8081 are currently missing a from: clause and
thus allow traffic from any source; update the ingress block in
networkpolicy.yaml to add a from: selector restricting sources (e.g., a
namespaceSelector matching your ingress controller namespace, monitoring
namespace, and/or a podSelector for same-namespace pods) for the entries that
list port: 8080 and port: 8081, ensuring the policy only permits those intended
consumers to reach the risk API/WebSocket endpoints; keep policyTypes: [
"Ingress" ] (or add it if missing) so the rule is enforced.
core/risk/CircuitBreaker.cpp-130-141 (1)

130-141: ⚠️ Potential issue | 🟠 Major

static local spreadSamples is never reset on re-initialization.

spreadSamples is a function-static variable that monotonically increments. When initialize() is called again (e.g., during tests or runtime reconfiguration), m_baselineInitialized is reset to false but spreadSamples retains its old value (≥ 20). On the very next onSpread call after re-init, the ++spreadSamples >= 20 check passes immediately, setting m_baselineInitialized = true after only a single EMA update instead of the intended ~20-sample warm-up.

Proposed fix: use a member variable instead of static

In CircuitBreaker.h, add a member:

size_t m_spreadSamples{0};

In CircuitBreaker.cpp, reset it in initialize():

  {
    std::lock_guard<std::mutex> lock(m_baselineMutex);
    m_baselineSpread = 0.0;
    m_baselineVolume = 0.0;
    m_baselineInitialized = false;
+   m_spreadSamples = 0;
  }

And in onSpread, replace the static:

-     static size_t spreadSamples = 0;
-     ++spreadSamples;
-     if (spreadSamples >= 20) {
+     ++m_spreadSamples;
+     if (m_spreadSamples >= 20) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/CircuitBreaker.cpp` around lines 130 - 141, The bug is that the
function-static spreadSamples in onSpread() never resets across
re-initializations so m_baselineInitialized gets re-enabled immediately; replace
the static counter with a per-instance member (e.g., add size_t
m_spreadSamples{0} to the CircuitBreaker class), reset m_spreadSamples to 0 in
initialize(), and use/increment m_spreadSamples inside onSpread() instead of the
static spreadSamples so the ~20-sample warm-up behaves correctly.
core/risk/RiskManager.cpp-228-230 (1)

228-230: ⚠️ Potential issue | 🟠 Major

onPnLUpdate stores newPnL (total) as the daily PnL — likely incorrect.

Line 230 sets m_dailyPnL to the total PnL value passed in, not a session-relative delta. The comment on line 228 acknowledges this ambiguity. However, checkOrder (line 126) and getDailyLossUtilization both treat m_dailyPnL as the daily loss figure. If newPnL is a cumulative total (as the parameter name and m_totalPnL assignment on line 208 suggest), the daily loss limit check will fire against the wrong baseline.

Clarify the contract: either compute dailyPnL = newPnL - pnlAtDayStart, or require callers to pass the daily-scoped value.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskManager.cpp` around lines 228 - 230, onPnLUpdate currently
stores the total PnL into m_dailyPnL which breaks downstream logic (checkOrder
and getDailyLossUtilization) that expect a session/daily delta; change
onPnLUpdate to compute and store the session-relative daily PnL (e.g., dailyPnL
= newPnL - pnlAtDayStart or newPnL - m_totalPnLPrevious) before calling
m_dailyPnL.store, or alternatively change the API contract and validate callers
to pass the delta; update references to m_totalPnL and m_dailyPnL in onPnLUpdate
(and any reset logic that records pnlAtDayStart) so that m_dailyPnL represents
the daily delta used by checkOrder and getDailyLossUtilization.
core/risk/RiskManager.cpp-85-94 (1)

85-94: ⚠️ Potential issue | 🟠 Major

Reading m_limits fields without synchronization on the hot path is formally UB.

Lines 88, 97–98, 108, 117, 125, 138, and 152–154 read members of m_limits (a non-atomic struct) while updateLimits() can write it concurrently under m_stateMutex. The comment acknowledges this as "best-effort," but concurrent read/write of non-atomic types is undefined behavior in C++ — not merely a torn read.

Since RiskLimits is a small POD struct, a pragmatic fix is to snapshot it at the top of checkOrder under the mutex (one lock acquisition per check) rather than per-field reads throughout.

Proposed fix — snapshot limits once
 RiskCheckResult RiskManager::checkOrder(OrderSide side, double price,
                                         double quantity,
                                         const std::string& symbol) {
+  // Snapshot limits once to avoid data-race UB on the hot path
+  RiskLimits limits;
+  {
+    std::lock_guard<std::mutex> lock(m_stateMutex);
+    limits = m_limits;
+  }
+
   // 1. Halted check
   if (m_halted.load(std::memory_order_acquire)) {

Then replace all m_limits.XXX references with limits.XXX in the method body.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskManager.cpp` around lines 85 - 94, The method checkOrder reads
non-atomic members of m_limits concurrently with updateLimits() which writes it
under m_stateMutex, causing UB; fix by taking m_stateMutex at the start of
checkOrder(), copy m_limits into a local RiskLimits limits snapshot, then
release the mutex and replace all uses of m_limits.* in checkOrder() with
limits.* so the hot path uses the stable local copy (refer to symbols:
checkOrder, m_limits, updateLimits, m_stateMutex, RiskLimits).
core/risk/AlertManager.h-102-103 (1)

102-103: ⚠️ Potential issue | 🟠 Major

Data race on m_lastAlertTime in isThrottled.

isThrottled (called from raiseAlert at line 33, before acquiring m_alertsMutex) performs a find + read on m_lastAlertTime without synchronization, while raiseAlert writes to this map under m_alertsMutex at line 50. Concurrent read of std::unordered_map during a write is undefined behavior in C++—a rehash triggered by insertion can corrupt the iterator/bucket structure.

Options:

  1. Acquire m_alertsMutex in isThrottled (simple, acceptable since alert raising is not on the trading hot path).
  2. Use a fixed-size std::array indexed by static_cast<int>(type) for lock-free access with atomics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/AlertManager.h` around lines 102 - 103, isThrottled currently reads
m_lastAlertTime without synchronization while raiseAlert writes to it under
m_alertsMutex, causing a data race; fix by synchronizing accesses: either
acquire m_alertsMutex at the start of isThrottled (simplest) or replace
m_lastAlertTime with a fixed-size container of atomics indexed by the alert type
for lock-free reads/writes. Update isThrottled to use the chosen approach and
ensure all write sites (e.g., raiseAlert) use the same synchronization
mechanism, referencing m_lastAlertTime, isThrottled, raiseAlert, and
m_alertsMutex to locate and modify the code.
core/risk/VaREngine.cpp-246-253 (1)

246-253: ⚠️ Potential issue | 🟠 Major

Monte Carlo simulations use identical random draws on every cycle.

m_rng is seeded once in the constructor and never advanced. Each call copies it via std::mt19937 localRng(m_rng), so every invocation of calculateMonteCarloVaR starts from the same RNG state. While different mean/stddev values cause different output values, the underlying uniform variates are always identical — this eliminates the sampling diversity that Monte Carlo is supposed to provide and makes MC VaR a deterministic transform of parametric VaR.

Fix: make m_rng mutable and use it directly (advancing its state), or store and advance the localRng across iterations.

Proposed fix — advance RNG state across calls

In VaREngine.h, mark m_rng as mutable, and in the method:

-  // Use a local copy of the RNG to keep this method const-correct
-  std::mt19937 localRng(m_rng);
+  // Advance the engine's RNG so successive calls produce different draws
+  std::mt19937 localRng(m_rng);
+  m_rng = localRng; // will be updated after the loop below; see alternative

A cleaner approach: drop const from calculateMonteCarloVaR and use m_rng directly (guarded by the fact that only the background thread calls it), or keep a mutable RNG with a dedicated mutex.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/VaREngine.cpp` around lines 246 - 253, calculateMonteCarloVaR
currently copies the RNG state each call via std::mt19937 localRng(m_rng),
causing identical random draws every invocation; fix by advancing shared RNG
state instead of reusing the same seed: make m_rng mutable (or remove const from
calculateMonteCarloVaR) and use m_rng directly when sampling (or store a
persistent localRng in the class and update it between calls), ensuring the RNG
state (m_rng or the persistent localRng) is advanced on each draw rather than
re-seeded from the same state.
visualization/WebServer.h-236-242 (1)

236-242: ⚠️ Potential issue | 🟠 Major

handleGetRiskVaR is a stub — architectural issue prevents access to actual VaR data.

The implementation returns a static placeholder message instead of calling VaREngine::toJson(). However, the root cause is architectural: VaREngine lacks a singleton pattern and is instantiated locally in main() with no mechanism for WebServer to access it. Unlike RiskManager, CircuitBreaker, and AlertManager which provide getInstance() accessors, VaREngine has no such pattern.

To fix this, either:

  1. Convert VaREngine to a singleton following the pattern of other risk components, or
  2. Pass the VaREngine instance to WebServer at initialization time
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@visualization/WebServer.h` around lines 236 - 242, handleGetRiskVaR currently
returns a placeholder because WebServer cannot access the real VaREngine; fix by
wiring VaREngine into WebServer: either (A) make VaREngine follow the existing
singleton pattern (add a static getInstance() method and internal static
instance, mirroring RiskManager/CircuitBreaker/AlertManager) so handleGetRiskVaR
can call VaREngine::getInstance()->toJson(), or (B) inject the VaREngine
instance into WebServer (add a constructor or setter on WebServer to store a
VaREngine* or reference and update main() to pass the locally-constructed
VaREngine), then modify WebServer::handleGetRiskVaR to call the instance's
toJson() and return that JSON instead of the static placeholder.
core/risk/AlertManager.h-1-4 (1)

1-4: 🛠️ Refactor suggestion | 🟠 Major

Missing direct #include <nlohmann/json.hpp>.

This header uses nlohmann::json in the Alert struct (line 44) and in several method signatures (lines 57–60, 81–82), but relies on RiskConfig.h to transitively provide the include. If RiskConfig.h ever stops including it, this header will break. Add the include directly.

Proposed fix
 `#include` "../utils/TimeUtils.h"
 `#include` "RiskConfig.h"
 
+#include <nlohmann/json.hpp>
+
 `#include` <atomic>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/AlertManager.h` around lines 1 - 4, AlertManager.h currently uses
nlohmann::json in the Alert struct and several method signatures but relies on a
transitive include; add a direct include for <nlohmann/json.hpp> at the top of
AlertManager.h so Alert (and methods that take/return nlohmann::json) compile
independently of RiskConfig.h changes—update the header to `#include`
<nlohmann/json.hpp> before any declarations that reference nlohmann::json (e.g.,
the Alert struct and public methods in AlertManager).
docs/DISASTER_RECOVERY.md-264-268 (1)

264-268: ⚠️ Potential issue | 🟠 Major

POST /api/risk/circuit-breaker/reset is documented but not implemented.

The manual override command on line 267 sends a POST to /api/risk/circuit-breaker/reset, but only GET /api/risk/circuit-breaker is routed in WebServer.cpp. An operator using this command during an emergency will receive a 404.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/DISASTER_RECOVERY.md` around lines 264 - 268, The docs call out a POST
endpoint that doesn't exist; add a POST route in WebServer.cpp alongside the
existing GET /api/risk/circuit-breaker handler (where routes are registered,
e.g., setupRoutes/registerRoutes) to handle POST
/api/risk/circuit-breaker/reset, implement a handler function (e.g.,
handlePostCircuitBreakerReset) that invokes the circuit-breaker reset operation
on the appropriate component (call the existing circuit breaker reset method
such as circuitBreaker.reset() or riskManager.resetCircuitBreaker()) and return
an appropriate HTTP 200/204 on success and 4xx/5xx on failure; wire that handler
into the server/router registration so the described kubectl curl command no
longer returns 404.
docs/DISASTER_RECOVERY.md-117-133 (1)

117-133: ⚠️ Potential issue | 🟠 Major

Documentation references multiple endpoints that are not implemented.

The manual position-adjustment procedure (Section 5, lines 117-129) references /api/positions and /api/reconcile which do not exist in visualization/WebServer.cpp. An operator following this runbook would receive 404 errors.

Additionally:

  • Line 235 references /api/risk/status but the implemented endpoint is /api/risk/state
  • Line 267 references POST /api/risk/circuit-breaker/reset but only the GET variant exists
  • Line 10.2 references /api/metrics for Prometheus scraping, which is not implemented
  • The /api/ready endpoint documentation (table, line 10.1) claims it checks "recovered state" and "exchange connectivity," but the actual implementation only verifies RiskManager halt state and CircuitBreaker trading allowed status

Either implement the missing endpoints or update the documentation to reflect what is actually available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/DISASTER_RECOVERY.md` around lines 117 - 133, The docs list multiple
HTTP endpoints that don't exist or differ from implementation; update either the
docs or the server. Either implement the missing handlers in
visualization/WebServer.cpp (add a GET /api/positions handler and a POST
/api/reconcile handler, add /api/metrics for Prometheus, add POST
/api/risk/circuit-breaker/reset in addition to the existing GET, and add an
alias /api/risk/status that delegates to the existing /api/risk/state handler)
or modify DISASTER_RECOVERY.md and the API table to accurately reflect the
implemented handlers (e.g., change /api/risk/status → /api/risk/state, mark
circuit-breaker reset as GET-only or add POST behavior, remove /api/positions
and /api/reconcile or document manual reconciliation steps), and update the
/api/ready description to match the readyEndpoint implementation in
WebServer.cpp (only RiskManager halt and CircuitBreaker trading checks). Ensure
changes reference the actual handler names in WebServer.cpp so documentation and
code remain consistent.
core/risk/DisasterRecovery.cpp-52-99 (1)

52-99: ⚠️ Potential issue | 🟠 Major

saveRiskState is not atomically saving both files — partial failure leaves inconsistent state.

If the risk-state rename on Line 73 succeeds but the strategy-state write/rename (Lines 80-91) fails, the system ends up with a new risk state paired with the old strategy state. For a disaster-recovery component, this inconsistency could cause incorrect state restoration.

Consider writing both temp files first, then renaming both in sequence, so that at least the risk state rename doesn't happen unless both temp files are fully written.

Proposed fix — write both temp files, then rename both
   try {
     std::string riskPath = getRiskStatePath();
     std::string riskTmpPath = riskPath + ".tmp";
+    std::string strategyPath = getStrategyStatePath();
+    std::string strategyTmpPath = strategyPath + ".tmp";
 
+    // Write both temp files first
     {
       std::ofstream ofs(riskTmpPath, std::ios::trunc);
       if (!ofs.is_open()) {
         ...
         return false;
       }
       ofs << riskState.dump(2);
       ofs.flush();
+      if (!ofs.good()) {
+        spdlog::error("DisasterRecovery: risk state write failed");
+        return false;
+      }
     }
 
-    bfs::rename(riskTmpPath, riskPath);
-
-    std::string strategyPath = getStrategyStatePath();
-    std::string strategyTmpPath = strategyPath + ".tmp";
-
     {
       std::ofstream ofs(strategyTmpPath, std::ios::trunc);
       ...
       ofs << strategyState.dump(2);
       ofs.flush();
+      if (!ofs.good()) {
+        spdlog::error("DisasterRecovery: strategy state write failed");
+        return false;
+      }
     }
 
+    // Rename both only after both temp files are written successfully
+    bfs::rename(riskTmpPath, riskPath);
     bfs::rename(strategyTmpPath, strategyPath);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 52 - 99, The current
DisasterRecovery::saveRiskState writes and renames the risk temp file before
writing/renaming the strategy temp file, allowing a partial commit; change the
flow to write both temp files fully (riskTmpPath and strategyTmpPath) first
(open, write, flush, close) and only after both writes succeed perform the two
bfs::rename calls (or perform a safe two-phase commit sequence), ensuring that
if either write fails no rename occurs and existing files remain unchanged; also
preserve existing logging behavior (use the same spdlog error messages on write
failures and the existing exception catch) so failures remain visible.
deploy/k8s/deployment.yaml-31-48 (1)

31-48: ⚠️ Potential issue | 🟠 Major

Missing securityContext — container runs as root with full privileges.

Trivy (KSV-0014, KSV-0118) and Checkov (CKV_K8S_20, CKV_K8S_23) both flag this. For a production trading system handling exchange credentials, this is a meaningful security posture gap. The container should run as non-root with a read-only root filesystem, dropped capabilities, and no privilege escalation.

Proposed fix — add securityContext at both pod and container level
     spec:
       terminationGracePeriodSeconds: 60
+      securityContext:
+        runAsNonRoot: true
+        runAsUser: 1000
+        fsGroup: 1000
       containers:
         - name: pinnaclemm
           image: pinnaclemm:latest
           imagePullPolicy: IfNotPresent
+          securityContext:
+            allowPrivilegeEscalation: false
+            readOnlyRootFilesystem: true
+            capabilities:
+              drop:
+                - ALL

Note: With readOnlyRootFilesystem: true, ensure the writable mount points (/data/pinnaclemm, /var/log/pinnaclemm) cover all write paths. You may also need a writable /tmp via an emptyDir volume.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/k8s/deployment.yaml` around lines 31 - 48, The deployment currently
runs the container "pinnaclemm" as root; add both a pod-level securityContext
and a container-level securityContext for the "pinnaclemm" container: set
runAsNonRoot: true with runAsUser (e.g., 1000) and runAsGroup, set fsGroup at
pod-level, enforce readOnlyRootFilesystem: true at the container level, disallow
privilege escalation (allowPrivilegeEscalation: false), and drop all
capabilities (capabilities: drop: ["ALL"]); also add a seccompProfile
(runtime/default) at pod-level. Because readOnlyRootFilesystem prevents writes,
add emptyDir volumes and mounts for any writable paths the app needs (e.g.,
/data/pinnaclemm, /var/log/pinnaclemm, /tmp) and reference those volumeMounts in
the "pinnaclemm" container so the app retains required write access.
main.cpp-236-244 (1)

236-244: ⚠️ Potential issue | 🟠 Major

Hardcoded relative paths for audit log, data directory, and backups ignore K8s environment variables.

The K8s manifest (deployment.yaml) sets PINNACLE_DATA_DIR=/data/pinnaclemm and PINNACLE_LOG_DIR=/var/log/pinnaclemm, but main.cpp uses hardcoded relative paths that are never read:

  • Line 238: "logs/audit.log" for the audit logger
  • Line 249: "data" for the persistence directory
  • Line 306: "data/backups" for disaster recovery

In the K8s pod, these resolve to the container's working directory rather than the mounted volumes (/data/pinnaclemm PVC and /var/log/pinnaclemm emptyDir), causing logs and state data to be lost on pod restart.

Read the environment variables and construct absolute paths at initialization.

Also applies to: 249-249, 304-307

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.cpp` around lines 236 - 244, The code currently uses hardcoded relative
paths (e.g., auditLogger.initialize("logs/audit.log")) which ignores the
Kubernetes-set env vars; update initialization to read PINNACLE_LOG_DIR and
PINNACLE_DATA_DIR via std::getenv (fallback to safe defaults), construct
absolute paths using std::filesystem::path, and pass those paths to
AuditLogger::initialize (replace "logs/audit.log"), the persistence directory
creation (replace "data"), and the backups directory (replace "data/backups");
ensure you use the resolved log dir + "audit.log" for auditLogger.initialize,
setCurrentSession remains the same, and create/ensure directories exist before
use.
🟡 Minor comments (17)
docs/RISK_MANAGEMENT.md-154-154 (1)

154-154: ⚠️ Potential issue | 🟡 Minor

Minor grammatical gap — missing subject.

isVaRBreached(portfolioValue) can optionally be used as a pre-trade check → It can optionally be used as a pre-trade check.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/RISK_MANAGEMENT.md` at line 154, The sentence lacks a subject; change
the line referencing the function name so it reads "It can optionally be used as
a pre-trade check." — i.e., update the sentence containing
`isVaRBreached(portfolioValue)` to prepend "It" so it becomes: "It can
optionally be used as a pre-trade check."
deploy/k8s/secret.yaml-1-6 (1)

1-6: ⚠️ Potential issue | 🟡 Minor

Add deploy/k8s/secret.yaml to .gitignore to prevent accidental credential commits.

The file comment warns against committing real credentials, but the file is tracked in git. A developer editing values in-place for local testing can accidentally stage and commit real credentials. The stronger pattern is:

  1. Rename this file to secret.yaml.template (tracked in git with placeholders).
  2. Add deploy/k8s/secret.yaml to .gitignore (the live file is never tracked).

Also, the personal attribution # WARNING FROM CHIZY: on line 1 should be # WARNING: or # IMPORTANT: for institutional consistency in shared infrastructure files.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/k8s/secret.yaml` around lines 1 - 6, The tracked secret file contains
a personal header "WARNING FROM CHIZY:" and placeholder credentials; rename the
tracked file to secret.yaml.template (keep placeholders) and create an untracked
live file named secret.yaml for local use, then add secret.yaml to .gitignore so
live credentials are never committed; also replace the header line "WARNING FROM
CHIZY:" with a neutral institutional header like "WARNING:" (or "IMPORTANT:") in
the template.
docs/RISK_MANAGEMENT.md-327-333 (1)

327-333: ⚠️ Potential issue | 🟡 Minor

Source Files table is missing the "Lines" column data (MD056).

The header declares three columns (| File | Lines | Purpose |) but every data row only has two cells, dropping the Lines column entirely. Most Markdown renderers will either misalign or elide the rightmost column.

📝 Proposed fix
-| File | Lines | Purpose |
+| File | Purpose |
 |---|---|---|
-| `core/risk/RiskConfig.h` | Config structs and JSON serialization |
-| `core/risk/RiskManager.h/.cpp` | Pre-trade checks, position tracking, auto-hedging |
-| `core/risk/CircuitBreaker.h/.cpp` | Market circuit breaker state machine |
-| `core/risk/VaREngine.h/.cpp` | Value at Risk with Monte Carlo |
-| `core/risk/AlertManager.h/.cpp` | Alert system with throttling |
-| `core/risk/DisasterRecovery.h/.cpp` | State persistence and backup management |
+| `core/risk/RiskConfig.h` | Config structs and JSON serialization |
+| `core/risk/RiskManager.h/.cpp` | Pre-trade checks, position tracking, auto-hedging |
+| `core/risk/CircuitBreaker.h/.cpp` | Market circuit breaker state machine |
+| `core/risk/VaREngine.h/.cpp` | Value at Risk with Monte Carlo |
+| `core/risk/AlertManager.h/.cpp` | Alert system with throttling |
+| `core/risk/DisasterRecovery.h/.cpp` | State persistence and backup management |

Alternatively, populate the Lines column with the actual line counts if known.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/RISK_MANAGEMENT.md` around lines 327 - 333, The markdown table in
RISK_MANAGEMENT.md declares three columns in the header but each row only has
two cells, which breaks table alignment; update the table header and each data
row (the rows for core/risk/RiskConfig.h, RiskManager.h/.cpp,
CircuitBreaker.h/.cpp, VaREngine.h/.cpp, AlertManager.h/.cpp,
DisasterRecovery.h/.cpp) to include a third "Lines" cell—either populate with
actual line counts or a placeholder like "-"—so every row matches the header and
the table renders correctly.
tests/performance/RiskCheckBenchmark.cpp-39-53 (1)

39-53: ⚠️ Potential issue | 🟡 Minor

Stale target comment: "< 100 ns" vs. documented 750 ns.

Line 42 states "Target: < 100 ns per call" but the documented benchmark result in PERFORMANCE_BENCHMARKS.md (line 116) shows BM_RiskCheckOrder at 750 ns. The target should be updated to reflect realistic expectations (~1 μs range given 8 sequential atomic loads).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/performance/RiskCheckBenchmark.cpp` around lines 39 - 53, Update the
stale target comment in BM_RiskCheckOrder to reflect the realistic benchmark
(~750 ns / ~1 μs) instead of "< 100 ns": edit the comment block above the
BM_RiskCheckOrder function in tests/performance/RiskCheckBenchmark.cpp
(referencing BM_RiskCheckOrder and RiskManager::checkOrder) to state the new
target (e.g., "~750 ns (~1 μs) per call") and, if helpful, add a short note that
the value comes from PERFORMANCE_BENCHMARKS.md measured results.
docs/PERFORMANCE_BENCHMARKS.md-113-120 (1)

113-120: ⚠️ Potential issue | 🟡 Minor

Fenced code block missing language specifier.

The code block at line 113 lacks a language annotation (e.g., ```text), which triggers the MD040 markdownlint warning. All other benchmark output blocks in this file have the same pattern, but since the linter flags this one, consider adding a language tag for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/PERFORMANCE_BENCHMARKS.md` around lines 113 - 120, The fenced code block
containing the benchmark output (the block with entries like BM_RiskCheckOrder,
BM_CircuitBreakerCheck, BM_OnFill, BM_OnPnLUpdate) is missing a language
specifier; update that fence to include a language tag (e.g., ```text or
```console) so it matches the other benchmark blocks and resolves the MD040 lint
warning, ensuring the change is applied to the specific fenced block that wraps
those benchmark lines.
core/risk/RiskConfig.h-156-166 (1)

156-166: ⚠️ Potential issue | 🟡 Minor

confidenceLevel95 and confidenceLevel99 are not deserialized from JSON.

VaRConfig defines confidenceLevel95 and confidenceLevel99 fields (lines 74–75), but fromJson does not read them from the "var" section, and toJson (lines 221–226) does not write them. This means these parameters are not configurable via the JSON config file and will always use their defaults (0.95 and 0.99).

If this is intentional (confidence levels are fixed), consider removing the fields from VaRConfig or documenting the rationale. If they should be configurable, add them to both fromJson and toJson.

Proposed fix for fromJson
         config.var.varLimitPct =
             v.value("var_limit_pct", config.var.varLimitPct);
+        config.var.confidenceLevel95 =
+            v.value("confidence_level_95", config.var.confidenceLevel95);
+        config.var.confidenceLevel99 =
+            v.value("confidence_level_99", config.var.confidenceLevel99);
       }
Proposed fix for toJson
          {"var",
           {{"window_size", var.windowSize},
            {"simulation_count", var.simulationCount},
            {"horizon", var.horizon},
            {"update_interval_ms", var.updateIntervalMs},
-           {"var_limit_pct", var.varLimitPct}}},
+           {"var_limit_pct", var.varLimitPct},
+           {"confidence_level_95", var.confidenceLevel95},
+           {"confidence_level_99", var.confidenceLevel99}}},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskConfig.h` around lines 156 - 166, VaRConfig's confidenceLevel95
and confidenceLevel99 are never deserialized/serialized; update the fromJson
routine that reads the "var" section to set config.var.confidenceLevel95 and
config.var.confidenceLevel99 (using v.value(..., currentValue)) and also update
the toJson routine that emits the "var" section to write confidenceLevel95 and
confidenceLevel99 from config.var into the JSON output; reference the VaRConfig
struct and the existing config.var usage in fromJson/toJson to add these two
fields so they become configurable.
core/risk/VaREngine.h-62-65 (1)

62-65: ⚠️ Potential issue | 🟡 Minor

Remove unused m_resultsMutex — it is dead code.

The double-buffering pattern with m_activeBuffer atomic provides complete synchronization for lock-free reads and writes. All methods accessing m_results (getLatestResult, isVaRBreached, getCurrentVaR95Pct, getCurrentVaR99Pct, toJson, and calculationLoop) use only the atomic variable with proper acquire/release memory ordering. The mutex is never referenced in the implementation and should be deleted.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/VaREngine.h` around lines 62 - 65, Remove the unused mutex member
m_resultsMutex from the VaREngine class: it is dead code because the
double-buffering uses the atomic m_activeBuffer with acquire/release semantics
to synchronize access to m_results[2]; delete the declaration of std::mutex
m_resultsMutex and any unused includes or references tied to it, and ensure no
methods (getLatestResult, isVaRBreached, getCurrentVaR95Pct, getCurrentVaR99Pct,
toJson, calculationLoop) reference m_resultsMutex to avoid unused-symbol or
include warnings.
core/risk/VaREngine.cpp-317-333 (1)

317-333: ⚠️ Potential issue | 🟡 Minor

Comment/code mismatch: says "population" stddev, implements sample stddev (N−1).

Line 318 says "population" but line 332 divides by data.size() - 1, which is Bessel-corrected sample standard deviation. For VaR estimation from a sample, N−1 is typically correct — update the comment.

Proposed fix
-// Helper: standard deviation (population)
+// Helper: standard deviation (sample, Bessel-corrected)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/VaREngine.cpp` around lines 317 - 333, The comment above
VaREngine::calculateStdDev is incorrect: it says "population" but the
implementation uses the sample (Bessel-corrected) standard deviation dividing by
data.size() - 1; update the comment to state "sample standard deviation (N-1)"
or otherwise make the wording consistent with the implementation so readers of
calculateStdDev see it computes the sample stddev using N-1.
tests/unit/VaREngineTests.cpp-164-168 (1)

164-168: ⚠️ Potential issue | 🟡 Minor

ToJson key assertion is overly permissive and may mask regressions.

The test accepts three different key names (historicalVaR95, historical_var_95, var_95). The actual implementation in VaREngine::toJson() uses "historical_var_95". Pin the assertion to the real key so a rename is caught as a breaking change.

Proposed fix
-  EXPECT_TRUE(j.contains("historicalVaR95") ||
-              j.contains("historical_var_95") || j.contains("var_95"));
+  EXPECT_TRUE(j.contains("historical_var_95"));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/VaREngineTests.cpp` around lines 164 - 168, The test's ToJson key
assertion is too permissive; update the assertion in VaREngineTests (the
EXPECT_TRUE block checking j.contains(...)) to require the exact key used by
VaREngine::toJson(), "historical_var_95", instead of allowing "historicalVaR95"
or "var_95"; keep the subsequent EXPECT_FALSE(j.dump().empty()) check unchanged
so serialization is still validated.
tests/unit/VaREngineTests.cpp-86-93 (1)

86-93: ⚠️ Potential issue | 🟡 Minor

Test silently passes when no calculation has run.

The if (result.sampleCount > 0) guard means that if the background thread doesn't complete a cycle within 300 ms (e.g., under CI load), the EXPECT_NEAR is skipped entirely and the test reports success with no actual assertion. The same pattern applies to ParametricVaR (line 110).

Consider calling calculateAll() directly (or exposing a blocking trigger) instead of relying on sleep timing, or use ASSERT_GT(result.sampleCount, 0u) before the value check so the test fails visibly rather than silently skipping the assertion.

Proposed fix
-  if (result.sampleCount > 0) {
-    EXPECT_NEAR(std::abs(result.historicalVaR95), 0.01645, 0.005);
-  }
+  ASSERT_GT(result.sampleCount, 0u) << "Background calculation did not run in time";
+  EXPECT_NEAR(std::abs(result.historicalVaR95), 0.01645, 0.005);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/VaREngineTests.cpp` around lines 86 - 93, Test currently can
silently pass because the `if (result.sampleCount > 0)` guard skips assertions
when no background calculation finished; replace that non-failing guard with a
hard check so the test fails visibly or trigger the calculation directly.
Specifically, for the block using `engine.getLatestResult()` and
`result.sampleCount`, either call the synchronous trigger `calculateAll()` (or
expose/invoke the blocking trigger used by the engine) before fetching
`getLatestResult()`, or replace `if (result.sampleCount > 0)` with
`ASSERT_GT(result.sampleCount, 0u)` followed by the `EXPECT_NEAR` so the test
fails when no samples exist; apply the same change to the `ParametricVaR` test
at the analogous check.
docs/DISASTER_RECOVERY.md-324-342 (1)

324-342: ⚠️ Potential issue | 🟡 Minor

Fenced code block missing language specifier.

The code block starting at line 324 has no language tag. Adding a specifier (e.g., ```text) satisfies markdownlint MD040 and enables syntax highlighting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/DISASTER_RECOVERY.md` around lines 324 - 342, The fenced code block
containing the event patterns (starting with "# Circuit breaker events" and
lines like "circuit.*breaker.*tripped", "recoverState.*Success",
"daily.*loss.*limit", "exchange.*disconnect") lacks a language specifier; update
the opening fence from ``` to a labeled fence such as ```text (or ```yaml/```ini
if preferred) so markdownlint MD040 is satisfied and syntax highlighting is
enabled for that block.
docs/DISASTER_RECOVERY.md-278-282 (1)

278-282: ⚠️ Potential issue | 🟡 Minor

Readiness probe description overstates what the implementation actually checks.

The doc states the readiness check verifies "recovered state, connected to the exchange, and is ready to accept traffic." The actual implementation (WebServer.cpp lines 1140-1142) only checks !RiskManager.isHalted() && CircuitBreaker.isTradingAllowed(). Exchange connectivity and recovery status are not verified. Update the description to match reality, or enhance the probe.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/DISASTER_RECOVERY.md` around lines 278 - 282, The readiness description
is inaccurate: the GET /api/ready endpoint (documented in DISASTER_RECOVERY.md)
currently only checks RiskManager.isHalted() and
CircuitBreaker.isTradingAllowed() in WebServer.cpp, not exchange connectivity or
recovery state; update the doc to reflect the actual checks (e.g., "returns 200
when the process is not halted and trading is allowed (uses
RiskManager.isHalted() and CircuitBreaker.isTradingAllowed())") or alternatively
modify the implementation in WebServer.cpp to include additional checks (add
explicit Exchange.isConnected() and StateRecovery.isRecovered() checks alongside
RiskManager.isHalted() and CircuitBreaker.isTradingAllowed()) so the description
matches behavior.
docs/DISASTER_RECOVERY.md-233-236 (1)

233-236: ⚠️ Potential issue | 🟡 Minor

Endpoint path mismatch: doc says /api/risk/status but the code implements /api/risk/state.

Line 235 references http://localhost:8081/api/risk/status, but WebServer.cpp (line 870) routes /api/risk/state. This will 404 if followed literally.

Proposed fix
-  curl -s http://localhost:8081/api/risk/status | python3 -m json.tool
+  curl -s http://localhost:8081/api/risk/state | python3 -m json.tool
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/DISASTER_RECOVERY.md` around lines 233 - 236, The docs call the endpoint
/api/risk/status but the server registers /api/risk/state (see the route handler
for GET "/api/risk/state" in WebServer.cpp), causing a 404; fix by updating the
snippet in DISASTER_RECOVERY.md to use http://localhost:8081/api/risk/state (or,
if you prefer the status name, rename the route in WebServer.cpp to
"/api/risk/status" and adjust its handler accordingly) so the documentation and
the WebServer route (GET "/api/risk/state") match.
visualization/WebServer.cpp-1127-1138 (1)

1127-1138: ⚠️ Potential issue | 🟡 Minor

uptime_ms is not actual process uptime — it's steady_clock epoch millis.

getCurrentMillis() returns the time since the steady_clock epoch (typically system boot), not the duration since the process started. For a meaningful uptime field, subtract the process start time.

Sketch

Store the start time once (e.g., in RestAPIServer::start()) and subtract:

+// In RestAPIServer or a global
+static const uint64_t s_startMillis = utils::TimeUtils::getCurrentMillis();

 http::response<http::string_body> RestAPIServer::handleGetHealth() {
   json health = {{"status", "healthy"},
                  {"timestamp", utils::TimeUtils::getCurrentNanos()},
-                 {"uptime_ms", utils::TimeUtils::getCurrentMillis()}};
+                 {"uptime_ms", utils::TimeUtils::getCurrentMillis() - s_startMillis}};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@visualization/WebServer.cpp` around lines 1127 - 1138, handleGetHealth
currently uses utils::TimeUtils::getCurrentMillis() which is steady_clock epoch
millis, not process uptime; fix by recording the process start time when
RestAPIServer::start() runs (add a member like start_time_nanos or start_time_ms
on RestAPIServer and set it in start()), then change
RestAPIServer::handleGetHealth to compute uptime_ms as
utils::TimeUtils::getCurrentMillis() (or getCurrentNanos()) minus the stored
start time and emit that value for "uptime_ms" while keeping other fields
(status, timestamp) unchanged; ensure types/units match (nanos vs millis) and
adjust subtraction accordingly.
docs/KUBERNETES_DEPLOYMENT.md-227-234 (1)

227-234: ⚠️ Potential issue | 🟡 Minor

Fenced code block missing a language specifier.

Static analysis (markdownlint MD040) flags this block. Adding a language identifier (e.g., text or bash) silences the warning and improves rendering.

Proposed fix
-```
+```text
 # Risk events
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/KUBERNETES_DEPLOYMENT.md` around lines 227 - 234, The fenced code block
under the "# Risk events" section in KUBERNETES_DEPLOYMENT.md lacks a language
specifier, triggering markdownlint MD040; update the triple-backtick opener for
that block (the block containing "# Risk events", "REJECTED_", "CLOSED -> OPEN",
"Manual trip", "EMERGENCY SAVE", "halt") to include a language identifier such
as text (e.g., change ``` to ```text) so the linter warning is silenced and
rendering is improved.
docs/KUBERNETES_DEPLOYMENT.md-139-143 (1)

139-143: ⚠️ Potential issue | 🟡 Minor

Health probe configuration values don't match deployment.yaml.

The documentation table lists liveness as initialDelaySeconds: 30, periodSeconds: 10 and readiness as initialDelaySeconds: 10, periodSeconds: 5, but the actual manifest (deploy/k8s/deployment.yaml Lines 49-64) specifies the opposite: liveness initialDelaySeconds: 15, periodSeconds: 30 and readiness initialDelaySeconds: 5, periodSeconds: 10.

Also, the endpoint column conflates the port with the path (e.g., GET /api/health:8081).

Proposed fix
-| **Liveness** | `GET /api/health:8081` | `initialDelaySeconds: 30`, `periodSeconds: 10` |
-| **Readiness** | `GET /api/ready:8081` | `initialDelaySeconds: 10`, `periodSeconds: 5` |
+| **Liveness** | `GET /api/health` (port 8081) | `initialDelaySeconds: 15`, `periodSeconds: 30` |
+| **Readiness** | `GET /api/ready` (port 8081) | `initialDelaySeconds: 5`, `periodSeconds: 10` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/KUBERNETES_DEPLOYMENT.md` around lines 139 - 143, The table in
docs/KUBERNETES_DEPLOYMENT.md is inconsistent with the manifest: update the
Liveness and Readiness rows so the health probe names and their
initialDelaySeconds/periodSeconds match the manifest values (set Liveness
initialDelaySeconds: 15, periodSeconds: 30 and Readiness initialDelaySeconds: 5,
periodSeconds: 10) and fix the Endpoint column to separate path and port (use
`GET /api/health` and `GET /api/ready` in the Endpoint column and add a separate
Port column or remove the :8081 suffix) so the docs correctly reflect the
liveness/readiness probes and ports used by the deployment.
core/risk/DisasterRecovery.cpp-535-589 (1)

535-589: ⚠️ Potential issue | 🟡 Minor

emergencySave does not verify write success after flush().

Lines 557-558 and 572-573 write and flush but never check ofs.good() or ofs.fail(). A disk-full or I/O error would silently produce a truncated file, and success would remain true. This is especially important for the emergency-save path since it's the last chance to persist state before shutdown.

Proposed fix
         ofs << riskState.dump(2);
         ofs.flush();
+        if (!ofs.good()) {
+          spdlog::error("DisasterRecovery: emergency risk state write error");
+          success = false;
+        }

Apply the same pattern for the strategy state write block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 535 - 589, The emergencySave
function writes and flushes both risk and strategy state but never verifies the
ofstream status afterward, so I/O errors (disk full, truncated writes) can be
missed; in DisasterRecovery::emergencySave, after each ofs.flush() (the risk
state write around getRiskStatePath() and the strategy state write around
getStrategyStatePath()), check ofs.fail() or !ofs.good() and if true set success
= false and log an spdlog::error with the path and errno/strerror detail (or
e.what() if available) so the function accurately reports failure; apply this
same verification pattern to both write blocks and ensure success is only true
when both writes report good() afterwards.

Comment thread core/risk/AlertManager.cpp
Comment thread core/risk/CircuitBreaker.h
Comment thread core/risk/DisasterRecovery.cpp Outdated
Comment thread core/risk/RiskManager.cpp
Comment thread core/risk/RiskManager.cpp
Comment thread Dockerfile
Comment thread strategies/config/StrategyConfig.h Outdated
Comment thread tests/unit/AlertManagerTests.cpp
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
Dockerfile (1)

125-127: /app/data/backups is ephemeral without a VOLUME declaration or mounted persistent volume.

The DisasterRecovery component writes backups to /app/data/backups. Without a VOLUME instruction (or a Kubernetes PersistentVolumeClaim mount), these backups are discarded on every container restart, which defeats the purpose of disaster recovery. The same applies to /app/logs if log persistence across restarts is required.

📦 Suggested VOLUME declarations
 RUN mkdir -p /app/logs /app/data /app/data/backups
+
+VOLUME ["/app/logs", "/app/data"]

Consider running the application as a non-root user.

The runtime stage has no USER directive, so the market-maker process runs as root inside the container. This is a security risk, especially for a production financial application deployed to Kubernetes (K8s manifests are also introduced in this PR). Adding a dedicated non-root user is straightforward:

🔒 Suggested non-root user setup
 WORKDIR /app

+RUN groupadd -r pinnaclemm && useradd -r -g pinnaclemm pinnaclemm

 COPY --from=builder /app/build/pinnaclemm /app/
 COPY --from=builder /app/config/ /app/config/

 RUN mkdir -p /app/logs /app/data /app/data/backups
+RUN chown -R pinnaclemm:pinnaclemm /app

+USER pinnaclemm
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Dockerfile` around lines 125 - 127, The Dockerfile creates /app/logs and
/app/data/backups but does not declare them as persistent volumes and also does
not drop root; add VOLUME declarations for "/app/data/backups" (and "/app/logs"
if you need log persistence) so backups written by the DisasterRecovery
component are persisted, and add a non-root user (via a dedicated user creation
and a USER instruction) in the runtime stage so the market-maker process does
not run as root; ensure ownership of the volume paths is transferred to that
user (chown) before switching to USER.
core/risk/DisasterRecovery.cpp (1)

1-8: Remove unused #include "RiskManager.h" at line 3.

The RiskManager symbol is never referenced in this translation unit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 1 - 8, The file includes an
unused header "RiskManager.h" (the RiskManager symbol is never referenced in
DisasterRecovery.cpp); remove the line `#include` "RiskManager.h" from
DisasterRecovery.cpp to eliminate the unnecessary dependency and recompile to
ensure no missing symbols.
core/risk/RiskManager.cpp (1)

98-174: Lock-free reads of m_limits in checkOrder() are technically undefined behavior (data race with updateLimits()).

m_limits is protected by m_stateMutex on write (updateLimits, initialize) but is read throughout checkOrder() without any lock (lines 101, 110–111, 121, 130, 138, 151, 165–167). Concurrent mutation via updateLimits() creates a data race per the C++ memory model, which is UB regardless of the underlying platform's atomicity guarantees for double.

The comment at lines 98–100 documents the trade-off for the rate-limit field only. Consider replacing m_limits with a std::atomic<std::shared_ptr<RiskLimits>> (C++20) or a seqlock, or at minimum extending the "best-effort" comment to cover all limit fields in the hot path so the trade-off is explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskManager.cpp` around lines 98 - 174, The checkOrder() function
currently performs lock-free reads of m_limits which is a data race; change
m_limits to an atomic shared pointer (e.g.,
std::atomic<std::shared_ptr<RiskLimits>>) and update initialize() and
updateLimits() to atomically store a new shared_ptr, then at the start of
checkOrder() do a single atomic load of the shared_ptr into a local const auto
limits = ... and replace all direct m_limits.X accesses with limits->X so
checkOrder() reads a consistent snapshot without holding m_stateMutex;
alternatively implement a seqlock around m_limits but be sure
updateLimits()/initialize() write under the writer path and checkOrder() reads
the snapshot atomically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@core/risk/DisasterRecovery.cpp`:
- Around line 169-171: The current fixed absolute epsilon (epsilon = 1e-8) used
to set result.positionsMatch by comparing std::abs(result.discrepancy) < epsilon
is too fragile across position magnitudes; change the comparison in
DisasterRecovery.cpp to use a magnitude-aware tolerance instead (e.g., compute a
relative tolerance based on the larger of the two position magnitudes or a
configurable per-symbol threshold, with a small absolute floor to avoid zero):
calculate tolerance = max(relativeTolerance * max(abs(expectedPosition),
abs(actualPosition)), absoluteFloor) or fetch a symbol-specific tolerance, then
set result.positionsMatch = std::abs(result.discrepancy) < tolerance; keep
references to result.discrepancy and result.positionsMatch and add configuration
lookup or parameter for per-symbol thresholds if available.
- Around line 27-46: The initialize method in DisasterRecovery currently assigns
m_backupDirectory before attempting to create it, so on failure the object
appears configured; modify DisasterRecovery::initialize to either clear
m_backupDirectory (set to empty) or set a new member flag m_initialized=false
before returning from the catch block, and set m_initialized=true only after
successful directory creation and logging; additionally add checks at the start
of public operations such as DisasterRecovery::saveRiskState and
DisasterRecovery::createBackup to verify m_initialized (or non-empty
m_backupDirectory) and return/throw an error if not initialized. Ensure
references to m_backupDirectory and the new m_initialized flag are updated
throughout the class to prevent using an invalid path.
- Around line 246-251: The metadata write block that opens
std::ofstream(backupPath + "/backup_meta.json") silently ignores open/write
failures; if ofs.is_open() is false the function still returns success causing
listBackups() to mark the backup invalid. Update the write logic around the
ofstream in DisasterRecovery.cpp so that when opening the file fails or
writing/meta.dump() results in stream errors you log a clear error (including
backupPath) and return false (or propagate an error) instead of proceeding; also
check ofs.fail()/ofs.good() after the write and flush to ensure the write
succeeded before logging success or returning true. Ensure references to
backup_meta.json, ofs, backupPath and listBackups() are used to locate and
update the correct code paths.
- Around line 51-98: The saveRiskState implementation can leave on-disk state
inconsistent and orphaned .tmp files; modify saveRiskState so it writes both
temp files first, then perform the renames in a guarded two-phase step with
cleanup and rollback: write riskTmpPath and strategyTmpPath (as you already do),
then before renaming create backups of existing riskPath/strategyPath (e.g.,
riskPath + ".bak", strategyPath + ".bak") and rename originals to backups, then
rename both tmp files to the final paths, and if any rename fails catch the
exception and restore from backups (or remove newly moved files) and always
remove any leftover .tmp files and .bak files in all error paths; ensure the
bfs::rename calls and exception handlers reference
getRiskStatePath/getStrategyStatePath, riskTmpPath/strategyTmpPath, and perform
directory fsync / flush where available to reduce window, and log errors via
spdlog and return false on failure.
- Around line 379-400: The deleteBackup/createBackup/restoreBackup flow uses
getBackupPath(label) which joins m_backupDirectory with an attacker-controlled
label; fix by validating/sanitizing label before any filesystem ops: either
reject labels containing '/', '\\' or any ".." segments (use
InputValidator::validateString with PATH_TRAVERSAL_PATTERNS) OR compute the
absolute/lexically_normalized path from getBackupPath(label) and assert it has
m_backupDirectory as a prefix (compare normalized paths) before calling
bfs::remove_all or bfs::copy_file; apply the same check in deleteBackup,
createBackup and restoreBackup to prevent escaping the backup directory.

In `@core/risk/RiskManager.cpp`:
- Around line 248-250: m_dailyPnL is being set equal to m_totalPnL in
onPnLUpdate which makes daily loss tracking wrong after reset; instead record a
snapshot of total PnL at day start (e.g. add m_totalPnLAtDayStart or
m_totalPnLSnapshot, updated in resetDaily()) and change onPnLUpdate to store
newPnL into m_totalPnL then compute m_dailyPnL = newPnL - m_totalPnLAtDayStart
(using the same atomic memory order semantics as current stores) so auto-halt
and pre-trade checks compare a true day-relative P&L.
- Around line 504-526: fromJson currently can throw nlohmann::json::type_error
on type-mismatched fields and uses an unsigned int literal for
orders_this_second; wrap the body of RiskManager::fromJson in a try/catch that
catches std::exception (or nlohmann::json::type_error) and logs the error via
spdlog::error, then restore safe defaults for the atomics (m_position,
m_totalPnL, m_peakPnL, m_dailyPnL, m_dailyVolume, m_netExposure,
m_grossExposure, m_halted, m_ordersThisSecond, m_currentSecond) and guarded
members (m_haltReason, m_dailyResetTime) to ensure initialization continues;
also change the default for "orders_this_second" to an explicit uint32_t{0} and
for "current_second"/"daily_reset_time" to uint64_t{0} to avoid implicit
narrowing.
- Around line 549-568: checkDailyReset() currently reads m_dailyResetTime under
m_stateMutex then releases the lock before calling resetDaily(), allowing two
threads to both pass the check and perform resets; fix by claiming the reset
slot while holding m_stateMutex: inside RiskManager::checkDailyReset(), after
computing todayMs, acquire the lock, if m_dailyResetTime < todayMs set
m_dailyResetTime = todayMs (or now) and then perform the daily-zeroing while
still holding the mutex (either call resetDaily() while holding m_stateMutex or
inline the resets to m_dailyPnL, m_dailyVolume, m_ordersThisSecond), then
release the lock so no other thread can race to zero counters for the same day.
- Around line 364-384: hedgeLoop currently blocks in std::this_thread::sleep_for
for up to m_limits.hedgeIntervalMs and cannot be interrupted by clearing
m_hedgeRunning; replace that sleep with an interruptible wait using a
std::condition_variable (e.g. m_hedgeCv) and an associated mutex (m_hedgeMutex):
wait_for on m_hedgeCv for intervalMs with a predicate that checks
!m_hedgeRunning (or simply check return value) so the loop wakes immediately
when shutdown is requested; add m_hedgeCv and m_hedgeMutex to the RiskManager
header alongside m_hedgeRunning, and ensure the destructor / initialize path
that clears m_hedgeRunning calls m_hedgeCv.notify_all() (before joining the
thread) so the waiting hedgeLoop exits and join() returns promptly, then
continue to call evaluateHedge() only when still running.

---

Duplicate comments:
In `@Dockerfile`:
- Around line 131-133: Remove the redundant "|| exit 1" from the HEALTHCHECK
command: the HEALTHCHECK line using curl -f already returns a non-zero exit code
on HTTP errors, so edit the HEALTHCHECK declaration (the line containing
"HEALTHCHECK --interval=30s ... CMD curl -f http://localhost:8081/api/health ||
exit 1") to call curl -f without the trailing "|| exit 1".

---

Nitpick comments:
In `@core/risk/DisasterRecovery.cpp`:
- Around line 1-8: The file includes an unused header "RiskManager.h" (the
RiskManager symbol is never referenced in DisasterRecovery.cpp); remove the line
`#include` "RiskManager.h" from DisasterRecovery.cpp to eliminate the unnecessary
dependency and recompile to ensure no missing symbols.

In `@core/risk/RiskManager.cpp`:
- Around line 98-174: The checkOrder() function currently performs lock-free
reads of m_limits which is a data race; change m_limits to an atomic shared
pointer (e.g., std::atomic<std::shared_ptr<RiskLimits>>) and update initialize()
and updateLimits() to atomically store a new shared_ptr, then at the start of
checkOrder() do a single atomic load of the shared_ptr into a local const auto
limits = ... and replace all direct m_limits.X accesses with limits->X so
checkOrder() reads a consistent snapshot without holding m_stateMutex;
alternatively implement a seqlock around m_limits but be sure
updateLimits()/initialize() write under the writer path and checkOrder() reads
the snapshot atomically.

In `@Dockerfile`:
- Around line 125-127: The Dockerfile creates /app/logs and /app/data/backups
but does not declare them as persistent volumes and also does not drop root; add
VOLUME declarations for "/app/data/backups" (and "/app/logs" if you need log
persistence) so backups written by the DisasterRecovery component are persisted,
and add a non-root user (via a dedicated user creation and a USER instruction)
in the runtime stage so the market-maker process does not run as root; ensure
ownership of the volume paths is transferred to that user (chown) before
switching to USER.

Comment on lines +27 to +46
void DisasterRecovery::initialize(const std::string& backupDirectory) {
std::lock_guard<std::mutex> lock(m_mutex);
m_backupDirectory = backupDirectory;

try {
if (!bfs::exists(m_backupDirectory)) {
bfs::create_directories(m_backupDirectory);
spdlog::info("DisasterRecovery: created backup directory: {}",
m_backupDirectory);
}
} catch (const bfs::filesystem_error& e) {
spdlog::error("DisasterRecovery: failed to create backup directory {}: {}",
m_backupDirectory, e.what());
return;
}

spdlog::info("DisasterRecovery initialized - backupDir={}",
m_backupDirectory);
AUDIT_SYSTEM_EVENT("DisasterRecovery initialized", true);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

m_backupDirectory is set even when directory creation fails.

On the error path (line 40 return), m_backupDirectory has already been assigned at line 29. Every subsequent call (saveRiskState, createBackup, etc.) will silently proceed against a non-existent directory, turning what should be an init-time failure into a series of cryptic runtime errors.

Consider resetting m_backupDirectory to empty (or throwing) before returning, and guarding all operations with an m_initialized flag:

🛡️ Proposed fix
 void DisasterRecovery::initialize(const std::string& backupDirectory) {
   std::lock_guard<std::mutex> lock(m_mutex);
-  m_backupDirectory = backupDirectory;

   try {
+    m_backupDirectory = backupDirectory;
     if (!bfs::exists(m_backupDirectory)) {
       bfs::create_directories(m_backupDirectory);
       spdlog::info("DisasterRecovery: created backup directory: {}",
                    m_backupDirectory);
     }
   } catch (const bfs::filesystem_error& e) {
     spdlog::error("DisasterRecovery: failed to create backup directory {}: {}",
-                  m_backupDirectory, e.what());
+                  backupDirectory, e.what());
+    m_backupDirectory.clear();   // leave object in a known unusable state
     return;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 27 - 46, The initialize method
in DisasterRecovery currently assigns m_backupDirectory before attempting to
create it, so on failure the object appears configured; modify
DisasterRecovery::initialize to either clear m_backupDirectory (set to empty) or
set a new member flag m_initialized=false before returning from the catch block,
and set m_initialized=true only after successful directory creation and logging;
additionally add checks at the start of public operations such as
DisasterRecovery::saveRiskState and DisasterRecovery::createBackup to verify
m_initialized (or non-empty m_backupDirectory) and return/throw an error if not
initialized. Ensure references to m_backupDirectory and the new m_initialized
flag are updated throughout the class to prevent using an invalid path.

Comment on lines +51 to +98
bool DisasterRecovery::saveRiskState(const nlohmann::json& riskState,
const nlohmann::json& strategyState) {
std::lock_guard<std::mutex> lock(m_mutex);

try {
// Atomic write for risk state: write to temp file, then rename
std::string riskPath = getRiskStatePath();
std::string riskTmpPath = riskPath + ".tmp";

{
std::ofstream ofs(riskTmpPath, std::ios::trunc);
if (!ofs.is_open()) {
spdlog::error("DisasterRecovery: failed to open tmp file for risk "
"state: {}",
riskTmpPath);
return false;
}
ofs << riskState.dump(2);
ofs.flush();
}

bfs::rename(riskTmpPath, riskPath);

// Atomic write for strategy state
std::string strategyPath = getStrategyStatePath();
std::string strategyTmpPath = strategyPath + ".tmp";

{
std::ofstream ofs(strategyTmpPath, std::ios::trunc);
if (!ofs.is_open()) {
spdlog::error("DisasterRecovery: failed to open tmp file for strategy "
"state: {}",
strategyTmpPath);
return false;
}
ofs << strategyState.dump(2);
ofs.flush();
}

bfs::rename(strategyTmpPath, strategyPath);

spdlog::debug("DisasterRecovery: saved risk and strategy state");
return true;
} catch (const std::exception& e) {
spdlog::error("DisasterRecovery: failed to save risk state: {}", e.what());
return false;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Two-file save is not atomic: state divergence and orphaned .tmp on partial failure.

Two separate issues:

  1. State divergence: if the risk-state rename succeeds (line 72) but the strategy-state write then fails (lines 80–84 or line 90 throws), the on-disk files are left inconsistent — risk state reflects the new snapshot, strategy state the old one. A restart will load mismatched states.

  2. Orphaned .tmp: on every error path, the partially-written strategy_state.json.tmp (and potentially risk_state.json.tmp) is never cleaned up.

🛡️ Proposed fix
+    // Stage both files before committing either (minimises divergence window)
     std::string riskPath = getRiskStatePath();
     std::string riskTmpPath = riskPath + ".tmp";
+    std::string strategyPath = getStrategyStatePath();
+    std::string strategyTmpPath = strategyPath + ".tmp";
+
+    // Helper lambda to remove tmp files on failure
+    auto cleanupTmps = [&]() {
+      bfs::remove(bfs::path(riskTmpPath));
+      bfs::remove(bfs::path(strategyTmpPath));
+    };

     {
       std::ofstream ofs(riskTmpPath, std::ios::trunc);
       if (!ofs.is_open()) {
         spdlog::error(...);
+        cleanupTmps();
         return false;
       }
       ofs << riskState.dump(2);
       ofs.flush();
     }

-    bfs::rename(riskTmpPath, riskPath);

-    std::string strategyPath = getStrategyStatePath();
-    std::string strategyTmpPath = strategyPath + ".tmp";

     {
       std::ofstream ofs(strategyTmpPath, std::ios::trunc);
       if (!ofs.is_open()) {
         spdlog::error(...);
+        cleanupTmps();
         return false;
       }
       ofs << strategyState.dump(2);
       ofs.flush();
     }

+    // Both tmp files written successfully — commit atomically
+    bfs::rename(riskTmpPath, riskPath);
     bfs::rename(strategyTmpPath, strategyPath);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 51 - 98, The saveRiskState
implementation can leave on-disk state inconsistent and orphaned .tmp files;
modify saveRiskState so it writes both temp files first, then perform the
renames in a guarded two-phase step with cleanup and rollback: write riskTmpPath
and strategyTmpPath (as you already do), then before renaming create backups of
existing riskPath/strategyPath (e.g., riskPath + ".bak", strategyPath + ".bak")
and rename originals to backups, then rename both tmp files to the final paths,
and if any rename fails catch the exception and restore from backups (or remove
newly moved files) and always remove any leftover .tmp files and .bak files in
all error paths; ensure the bfs::rename calls and exception handlers reference
getRiskStatePath/getStrategyStatePath, riskTmpPath/strategyTmpPath, and perform
directory fsync / flush where available to reduce window, and log errors via
spdlog and return false on failure.

Comment on lines +169 to +171
// Use a small epsilon for floating-point comparison
constexpr double epsilon = 1e-8;
result.positionsMatch = std::abs(result.discrepancy) < epsilon;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fixed absolute epsilon 1e-8 is fragile across position magnitudes.

For large notional positions (e.g. 10,000,000 USD lots expressed as a double), normal IEEE-754 rounding can produce discrepancies exceeding 1e-8, causing spurious mismatch alerts. Conversely, for some instruments a discrepancy of 1e-8 units is a genuine error worth flagging. A single hardcoded absolute threshold does not generalize across asset classes.

Consider a relative epsilon or a configurable per-symbol threshold:

🛡️ Suggested approach
-  constexpr double epsilon = 1e-8;
-  result.positionsMatch = std::abs(result.discrepancy) < epsilon;
+  // Use a relative epsilon: allow up to 1e-9 relative error, with an
+  // absolute floor of 1e-12 to handle near-zero positions.
+  const double scale = std::max(std::abs(localPosition),
+                                std::abs(exchangePosition));
+  const double threshold = std::max(1e-12, scale * 1e-9);
+  result.positionsMatch = std::abs(result.discrepancy) < threshold;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 169 - 171, The current fixed
absolute epsilon (epsilon = 1e-8) used to set result.positionsMatch by comparing
std::abs(result.discrepancy) < epsilon is too fragile across position
magnitudes; change the comparison in DisasterRecovery.cpp to use a
magnitude-aware tolerance instead (e.g., compute a relative tolerance based on
the larger of the two position magnitudes or a configurable per-symbol
threshold, with a small absolute floor to avoid zero): calculate tolerance =
max(relativeTolerance * max(abs(expectedPosition), abs(actualPosition)),
absoluteFloor) or fetch a symbol-specific tolerance, then set
result.positionsMatch = std::abs(result.discrepancy) < tolerance; keep
references to result.discrepancy and result.positionsMatch and add configuration
lookup or parameter for per-symbol thresholds if available.

Comment on lines +246 to +251
{
std::ofstream ofs(backupPath + "/backup_meta.json");
if (ofs.is_open()) {
ofs << meta.dump(2);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent metadata write failure leaves backup marked invalid in listBackups().

If ofs.is_open() is false at line 248, the metadata file is never written but the function still returns true and logs success. listBackups() will subsequently mark this backup as info.valid = false (line 356) because it cannot read the metadata, while the backup itself is structurally complete. The caller has no indication anything went wrong.

🛡️ Proposed fix
     {
       std::ofstream ofs(backupPath + "/backup_meta.json");
-      if (ofs.is_open()) {
-        ofs << meta.dump(2);
-      }
+      if (!ofs.is_open()) {
+        spdlog::warn("DisasterRecovery: failed to write metadata for backup '{}'",
+                     label);
+      } else {
+        ofs << meta.dump(2);
+      }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 246 - 251, The metadata write
block that opens std::ofstream(backupPath + "/backup_meta.json") silently
ignores open/write failures; if ofs.is_open() is false the function still
returns success causing listBackups() to mark the backup invalid. Update the
write logic around the ofstream in DisasterRecovery.cpp so that when opening the
file fails or writing/meta.dump() results in stream errors you log a clear error
(including backupPath) and return false (or propagate an error) instead of
proceeding; also check ofs.fail()/ofs.good() after the write and flush to ensure
the write succeeded before logging success or returning true. Ensure references
to backup_meta.json, ofs, backupPath and listBackups() are used to locate and
update the correct code paths.

Comment on lines +379 to +400
bool DisasterRecovery::deleteBackup(const std::string& label) {
std::lock_guard<std::mutex> lock(m_mutex);

try {
std::string backupPath = getBackupPath(label);

if (!bfs::exists(backupPath)) {
spdlog::warn("DisasterRecovery: backup '{}' does not exist", label);
return false;
}

bfs::remove_all(backupPath);

spdlog::info("DisasterRecovery: backup '{}' deleted", label);
AUDIT_SYSTEM_EVENT("Backup deleted: " + label, true);
return true;
} catch (const std::exception& e) {
spdlog::error("DisasterRecovery: failed to delete backup '{}': {}", label,
e.what());
return false;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the DisasterRecovery.cpp file
fd -e cpp -name "*DisasterRecovery*" | head -20

Repository: chizy7/PinnacleMM

Length of output: 291


🏁 Script executed:

# Search for getBackupPath implementation
rg -n "getBackupPath" --type=cpp -A 3 -B 1

Repository: chizy7/PinnacleMM

Length of output: 1811


🏁 Script executed:

# Check all backup-related functions and how they use labels
rg -n "createBackup|restoreBackup|deleteBackup|listBackups" --type=cpp -A 5 -B 1

Repository: chizy7/PinnacleMM

Length of output: 5434


🏁 Script executed:

# Check for REST API endpoints related to backups
rg -n "backup|Backup" --type=cpp -i | grep -i "route\|endpoint\|handler\|post\|get\|rest" | head -30

Repository: chizy7/PinnacleMM

Length of output: 1992


🏁 Script executed:

# Look for any existing validation of label parameter
rg -n "label.*find|label.*empty|label.*valid" --type=cpp -i

Repository: chizy7/PinnacleMM

Length of output: 43


🏁 Script executed:

# Search for REST API server setup, routes, endpoints
rg -i "route|handler|endpoint|server|http|rest|api" --type=cpp | grep -i "backup\|createBackup\|restoreBackup\|deleteBackup" | head -30

Repository: chizy7/PinnacleMM

Length of output: 1293


🏁 Script executed:

# Look for any Web framework or HTTP server code
fd -e cpp -e h | xargs grep -l "http\|request\|route\|handler" | head -20

Repository: chizy7/PinnacleMM

Length of output: 671


🏁 Script executed:

# Search for all occurrences of createBackup, restoreBackup, deleteBackup to find all call sites
rg "createBackup|restoreBackup|deleteBackup" --type=cpp

Repository: chizy7/PinnacleMM

Length of output: 973


🏁 Script executed:

# Check what m_backupDirectory is set to and if there's path canonicalization
rg -n "m_backupDirectory" --type=cpp -B 2 -A 2

Repository: chizy7/PinnacleMM

Length of output: 5073


🏁 Script executed:

# Verify the actual behavior of Boost.Filesystem path concatenation with absolute paths
cat > /tmp/test_bfs.cpp << 'EOF'
`#include` <boost/filesystem.hpp>
`#include` <iostream>

namespace bfs = boost::filesystem;

int main() {
    // Test 1: Absolute path on right side
    bfs::path base("/data/backups");
    bfs::path abs_right("/etc/passwd");
    std::cout << "Test 1 - base / absolute:\n";
    std::cout << "  Result: " << (base / abs_right).string() << "\n";
    
    // Test 2: Relative path with ..
    bfs::path rel_right("../../etc/cron.d");
    std::cout << "Test 2 - base / relative_with_dotdot:\n";
    std::cout << "  Result: " << (base / rel_right).string() << "\n";
    
    return 0;
}
EOF
# Note: compilation not available, but we can reason about it from Boost documentation

Repository: chizy7/PinnacleMM

Length of output: 43


🏁 Script executed:

# Search for any input validation or sanitization functions
rg "sanitize|validate|clean|escape" --type=cpp -i | head -20

Repository: chizy7/PinnacleMM

Length of output: 1555


🏁 Script executed:

# Check RestAPI or handlers directory if it exists
fd -type d -name "*api*" -o -name "*rest*" -o -name "*handler*" -o -name "*endpoint*"

Repository: chizy7/PinnacleMM

Length of output: 231


🏁 Script executed:

# Check WebServer.cpp for backup-related endpoints
cat core/risk/DisasterRecovery.cpp | head -50

Repository: chizy7/PinnacleMM

Length of output: 1720


🏁 Script executed:

# Look at WebServer.cpp to see what endpoints it exposes
cat visualization/WebServer.cpp | grep -i "backup\|route\|get\|post" | head -40

Repository: chizy7/PinnacleMM

Length of output: 2216


🏁 Script executed:

# Search for where the backup functions might be called from HTTP handlers
rg "createBackup|restoreBackup|deleteBackup" --type=cpp -B 5 -A 5

Repository: chizy7/PinnacleMM

Length of output: 6409


🏁 Script executed:

# Check if there are any REST endpoints that might call backup functions indirectly
rg "disasterRecovery\." --type=cpp | head -30

Repository: chizy7/PinnacleMM

Length of output: 237


🏁 Script executed:

# Verify the exact Boost.Filesystem behavior by checking its documentation context
rg "bfs::" --type=cpp | grep "/" | head -20

Repository: chizy7/PinnacleMM

Length of output: 1643


🏁 Script executed:

# Check the complete createBackup, restoreBackup, and deleteBackup implementations
sed -n '192,263p' core/risk/DisasterRecovery.cpp

Repository: chizy7/PinnacleMM

Length of output: 2652


🏁 Script executed:

# Check complete deleteBackup implementation
sed -n '379,400p' core/risk/DisasterRecovery.cpp

Repository: chizy7/PinnacleMM

Length of output: 713


🏁 Script executed:

# Search for any path canonicalization or validation before using paths from getBackupPath
rg "canonical|normalize|realpath" --type=cpp

Repository: chizy7/PinnacleMM

Length of output: 267


🏁 Script executed:

# Check if there's any input validation utility being used
cat core/utils/InputValidator.cpp | head -100

Repository: chizy7/PinnacleMM

Length of output: 3956


🏁 Script executed:

# Search for any tests or documentation about how bfs::path handles absolute paths
rg "bfs::path\|boost::filesystem" --type=cpp -A 2 | grep -E "(parent_path|absolute|canonical|/.*/)" | head -20

Repository: chizy7/PinnacleMM

Length of output: 43


🏁 Script executed:

# Check if there's anywhere in the codebase that applies path safety checks
rg "parent_path|starts_with|contains" --type=cpp | head -20

Repository: chizy7/PinnacleMM

Length of output: 1557


Path traversal vulnerability via unsanitized label in backup operations — potential arbitrary directory deletion.

getBackupPath(label) at line 589 directly concatenates the label parameter onto m_backupDirectory using bfs::path(m_backupDirectory) / label with no validation. This is exploitable:

  • If label is an absolute path like "/etc/cron.d", Boost.Filesystem's / operator replaces the entire left-hand path with the absolute right operand, allowing access outside the backup directory.
  • If label contains "../" sequences (e.g., "../../etc/cron.d"), path normalization resolves outside the intended backup directory.

All three backup operations pass unsanitized labels to getBackupPath:

  • deleteBackup calls bfs::remove_all(backupPath) — an attacker-controlled path can recursively delete arbitrary directories.
  • createBackup also calls bfs::remove_all() when overwriting existing backups.
  • restoreBackup calls bfs::copy_file() with computed paths — can read from/write to arbitrary locations.

Currently, labels are only provided via hardcoded system-generated values (main.cpp:624 uses "shutdown_" + timestamp, tests use fixed labels). However, the codebase exposes no validation despite providing InputValidator::validateString() with PATH_TRAVERSAL_PATTERNS in core/utils/InputValidator.cpp.

Sanitize backup labels immediately: Reject labels containing /, \, or ".." sequences. Alternatively, verify the resolved backup path remains within m_backupDirectory using path prefix checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/DisasterRecovery.cpp` around lines 379 - 400, The
deleteBackup/createBackup/restoreBackup flow uses getBackupPath(label) which
joins m_backupDirectory with an attacker-controlled label; fix by
validating/sanitizing label before any filesystem ops: either reject labels
containing '/', '\\' or any ".." segments (use InputValidator::validateString
with PATH_TRAVERSAL_PATTERNS) OR compute the absolute/lexically_normalized path
from getBackupPath(label) and assert it has m_backupDirectory as a prefix
(compare normalized paths) before calling bfs::remove_all or bfs::copy_file;
apply the same check in deleteBackup, createBackup and restoreBackup to prevent
escaping the backup directory.

Comment thread core/risk/RiskManager.cpp
Comment on lines +248 to +250
// Update daily PnL -- we store the total PnL as the daily value here;
// callers can provide the session-relative PnL.
m_dailyPnL.store(newPnL, std::memory_order_release);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

m_dailyPnL is always identical to m_totalPnL, making daily loss tracking unreliable post-reset.

Line 228 stores newPnL into m_totalPnL; line 250 stores the same newPnL into m_dailyPnL. After resetDaily() zeros m_dailyPnL, the very next onPnLUpdate(pnl) call sets m_dailyPnL = pnl again — which, if pnl is a cumulative (session-total) value rather than a day-relative delta, means m_dailyPnL no longer reflects only today's P&L.

The auto-halt at line 270 and the pre-trade check at line 139 both key off m_dailyPnL / newPnL against dailyLossLimit. If newPnL is cumulative, both checks are testing total cumulative loss against a per-day limit.

This should be clarified or fixed: either (a) callers must always pass a session-relative daily PnL to onPnLUpdate (document this contract explicitly), or (b) separate m_totalPnL and m_dailyPnL so onPnLUpdate increments only m_totalPnL, while m_dailyPnL is computed as m_totalPnL − snapshotAtDayStart.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskManager.cpp` around lines 248 - 250, m_dailyPnL is being set
equal to m_totalPnL in onPnLUpdate which makes daily loss tracking wrong after
reset; instead record a snapshot of total PnL at day start (e.g. add
m_totalPnLAtDayStart or m_totalPnLSnapshot, updated in resetDaily()) and change
onPnLUpdate to store newPnL into m_totalPnL then compute m_dailyPnL = newPnL -
m_totalPnLAtDayStart (using the same atomic memory order semantics as current
stores) so auto-halt and pre-trade checks compare a true day-relative P&L.

Comment thread core/risk/RiskManager.cpp
Comment on lines +364 to +384
void RiskManager::hedgeLoop() {
spdlog::info("Hedge loop started");

while (m_hedgeRunning.load(std::memory_order_acquire)) {
uint64_t intervalMs;
{
std::lock_guard<std::mutex> lock(m_stateMutex);
intervalMs = m_limits.hedgeIntervalMs;
}

std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs));

if (!m_hedgeRunning.load(std::memory_order_acquire)) {
break;
}

evaluateHedge();
}

spdlog::info("Hedge loop stopped");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

hedgeLoop uses a non-interruptible sleep_for, stalling shutdown by up to intervalMs.

Line 374 sleeps the full intervalMs before rechecking m_hedgeRunning. When m_hedgeRunning is cleared in the destructor or initialize(), the hedge thread cannot be woken; the caller blocks in m_hedgeThread.join() for up to intervalMs. For production configurations with a long hedge interval this makes shutdown/re-initialization unresponsive.

Replace the bare sleep_for with a condition_variable::wait_for signalled on shutdown:

⚡ Proposed fix: interruptible sleep via `condition_variable`

Add to the header (alongside m_hedgeRunning, m_hedgeMutex):

+  std::condition_variable m_hedgeCV;

In RiskManager.cpp:

 // Signal hedge thread to stop
 m_hedgeRunning.store(false, std::memory_order_release);
+m_hedgeCV.notify_all();
 if (m_hedgeThread.joinable()) {
     m_hedgeThread.join();
 }
 void RiskManager::hedgeLoop() {
   spdlog::info("Hedge loop started");
   while (m_hedgeRunning.load(std::memory_order_acquire)) {
     uint64_t intervalMs;
     {
       std::lock_guard<std::mutex> lock(m_stateMutex);
       intervalMs = m_limits.hedgeIntervalMs;
     }
-    std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs));
+    {
+      std::unique_lock<std::mutex> lk(m_hedgeMutex);
+      m_hedgeCV.wait_for(lk, std::chrono::milliseconds(intervalMs),
+                         [this] { return !m_hedgeRunning.load(std::memory_order_acquire); });
+    }
     if (!m_hedgeRunning.load(std::memory_order_acquire)) {
       break;
     }
     evaluateHedge();
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskManager.cpp` around lines 364 - 384, hedgeLoop currently blocks
in std::this_thread::sleep_for for up to m_limits.hedgeIntervalMs and cannot be
interrupted by clearing m_hedgeRunning; replace that sleep with an interruptible
wait using a std::condition_variable (e.g. m_hedgeCv) and an associated mutex
(m_hedgeMutex): wait_for on m_hedgeCv for intervalMs with a predicate that
checks !m_hedgeRunning (or simply check return value) so the loop wakes
immediately when shutdown is requested; add m_hedgeCv and m_hedgeMutex to the
RiskManager header alongside m_hedgeRunning, and ensure the destructor /
initialize path that clears m_hedgeRunning calls m_hedgeCv.notify_all() (before
joining the thread) so the waiting hedgeLoop exits and join() returns promptly,
then continue to call evaluateHedge() only when still running.

Comment thread core/risk/RiskManager.cpp
Comment on lines +504 to +526
void RiskManager::fromJson(const nlohmann::json& j) {
m_position.store(j.value("current_position", 0.0), std::memory_order_relaxed);
m_totalPnL.store(j.value("total_pnl", 0.0), std::memory_order_relaxed);
m_peakPnL.store(j.value("peak_pnl", 0.0), std::memory_order_relaxed);
m_dailyPnL.store(j.value("daily_pnl", 0.0), std::memory_order_relaxed);
m_dailyVolume.store(j.value("daily_volume", 0.0), std::memory_order_relaxed);
m_netExposure.store(j.value("net_exposure", 0.0), std::memory_order_relaxed);
m_grossExposure.store(j.value("gross_exposure", 0.0),
std::memory_order_relaxed);
m_halted.store(j.value("is_halted", false), std::memory_order_relaxed);
m_ordersThisSecond.store(j.value("orders_this_second", 0u),
std::memory_order_relaxed);
m_currentSecond.store(j.value("current_second", uint64_t{0}),
std::memory_order_relaxed);

{
std::lock_guard<std::mutex> lock(m_stateMutex);
m_haltReason = j.value("halt_reason", std::string{});
m_dailyResetTime = j.value("daily_reset_time", uint64_t{0});
}

spdlog::info("RiskManager state restored from JSON");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

fromJson() has no exception handling; a type-mismatched JSON value will propagate an unhandled nlohmann::json::type_error.

j.value<T>(key, default) throws nlohmann::json::type_error when the stored JSON value cannot be converted to T. For example, restoring state from a corrupted snapshot or a manually edited file will terminate the risk manager's initialization path ungracefully. Wrap the body in a try/catch or validate types before reading.

Additionally, j.value("orders_this_second", 0u) uses unsigned int as the default, while the target atomic is uint32_t; prefer an explicit uint32_t{0} cast for clarity and to avoid implicit narrowing on platforms where unsigned int is wider than 32 bits.

🛡️ Proposed fix
 void RiskManager::fromJson(const nlohmann::json& j) {
+  try {
     m_position.store(j.value("current_position", 0.0), std::memory_order_relaxed);
     m_totalPnL.store(j.value("total_pnl", 0.0), std::memory_order_relaxed);
     m_peakPnL.store(j.value("peak_pnl", 0.0), std::memory_order_relaxed);
     m_dailyPnL.store(j.value("daily_pnl", 0.0), std::memory_order_relaxed);
     m_dailyVolume.store(j.value("daily_volume", 0.0), std::memory_order_relaxed);
     m_netExposure.store(j.value("net_exposure", 0.0), std::memory_order_relaxed);
     m_grossExposure.store(j.value("gross_exposure", 0.0), std::memory_order_relaxed);
     m_halted.store(j.value("is_halted", false), std::memory_order_relaxed);
-    m_ordersThisSecond.store(j.value("orders_this_second", 0u),
+    m_ordersThisSecond.store(j.value("orders_this_second", uint32_t{0}),
                              std::memory_order_relaxed);
     m_currentSecond.store(j.value("current_second", uint64_t{0}),
                           std::memory_order_relaxed);
     {
       std::lock_guard<std::mutex> lock(m_stateMutex);
       m_haltReason = j.value("halt_reason", std::string{});
       m_dailyResetTime = j.value("daily_reset_time", uint64_t{0});
     }
     spdlog::info("RiskManager state restored from JSON");
+  } catch (const nlohmann::json::exception& e) {
+    spdlog::error("RiskManager::fromJson failed: {}", e.what());
+    throw;
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskManager.cpp` around lines 504 - 526, fromJson currently can
throw nlohmann::json::type_error on type-mismatched fields and uses an unsigned
int literal for orders_this_second; wrap the body of RiskManager::fromJson in a
try/catch that catches std::exception (or nlohmann::json::type_error) and logs
the error via spdlog::error, then restore safe defaults for the atomics
(m_position, m_totalPnL, m_peakPnL, m_dailyPnL, m_dailyVolume, m_netExposure,
m_grossExposure, m_halted, m_ordersThisSecond, m_currentSecond) and guarded
members (m_haltReason, m_dailyResetTime) to ensure initialization continues;
also change the default for "orders_this_second" to an explicit uint32_t{0} and
for "current_second"/"daily_reset_time" to uint64_t{0} to avoid implicit
narrowing.

Comment thread core/risk/RiskManager.cpp
Comment on lines +549 to +568
void RiskManager::checkDailyReset() {
// Use system_clock so we can detect calendar-day boundaries
auto now = std::chrono::system_clock::now();
auto today = std::chrono::floor<std::chrono::days>(now);
uint64_t todayMs = static_cast<uint64_t>(
std::chrono::duration_cast<std::chrono::milliseconds>(
today.time_since_epoch())
.count());

uint64_t lastReset;
{
std::lock_guard<std::mutex> lock(m_stateMutex);
lastReset = m_dailyResetTime;
}

// If the last reset was before today's midnight, do a reset
if (lastReset < todayMs) {
resetDaily();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Race condition: two concurrent threads can both read lastReset < todayMs before either updates m_dailyResetTime, causing a double daily reset that silently erases fills.

The check and the update are not atomic:

  1. Thread A reads lastReset = T₀ (under lock), releases lock, evaluates T₀ < todayMstrue.
  2. Thread B reads lastReset = T₀ (same stale value), releases lock, evaluates T₀ < todayMstrue.
  3. Thread A calls resetDaily(): zeros m_dailyPnL, m_dailyVolume, m_ordersThisSecond, sets m_dailyResetTime = now.
  4. Fills occur between A's reset and B's entry into resetDaily() — they increment m_dailyVolume and m_dailyPnL.
  5. Thread B calls resetDaily(): zeros the counters again, erasing those fills from daily tracking.

With checkDailyReset() called on every fill (line 221), the race window around the UTC midnight crossing is non-trivial in a concurrent HFT system and can cause daily P&L and volume limits to be silently bypassed.

Fix: Claim the reset slot atomically by updating m_dailyResetTime inside the mutex before releasing it, so only one thread proceeds:

🔒 Proposed fix: atomic claim inside lock
 void RiskManager::checkDailyReset() {
   auto now = std::chrono::system_clock::now();
   auto today = std::chrono::floor<std::chrono::days>(now);
   uint64_t todayMs = static_cast<uint64_t>(
       std::chrono::duration_cast<std::chrono::milliseconds>(
           today.time_since_epoch())
           .count());

-  uint64_t lastReset;
-  {
+  bool shouldReset = false;
+  {
     std::lock_guard<std::mutex> lock(m_stateMutex);
-    lastReset = m_dailyResetTime;
-  }
-
-  // If the last reset was before today's midnight, do a reset
-  if (lastReset < todayMs) {
-    resetDaily();
+    if (m_dailyResetTime < todayMs) {
+      m_dailyResetTime = todayMs;  // claim the reset slot; only one thread wins
+      shouldReset = true;
+    }
   }
+
+  if (shouldReset) {
+    // Atomics can be reset outside the lock
+    m_dailyPnL.store(0.0, std::memory_order_release);
+    m_dailyVolume.store(0.0, std::memory_order_release);
+    m_ordersThisSecond.store(0, std::memory_order_release);
+    spdlog::info("Daily risk counters reset");
+    AUDIT_SYSTEM_EVENT("Daily risk counters reset", true);
+  }
 }

With this change resetDaily() is still available as an explicit public API, but checkDailyReset() no longer delegates to it — it directly updates the atomics after claiming the slot, preventing any double-reset.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@core/risk/RiskManager.cpp` around lines 549 - 568, checkDailyReset()
currently reads m_dailyResetTime under m_stateMutex then releases the lock
before calling resetDaily(), allowing two threads to both pass the check and
perform resets; fix by claiming the reset slot while holding m_stateMutex:
inside RiskManager::checkDailyReset(), after computing todayMs, acquire the
lock, if m_dailyResetTime < todayMs set m_dailyResetTime = todayMs (or now) and
then perform the daily-zeroing while still holding the mutex (either call
resetDaily() while holding m_stateMutex or inline the resets to m_dailyPnL,
m_dailyVolume, m_ordersThisSecond), then release the lock so no other thread can
race to zero counters for the same day.

@chizy7 chizy7 merged commit 5019642 into main Feb 20, 2026
15 checks passed
github-actions Bot pushed a commit that referenced this pull request Feb 20, 2026
# [1.5.0](v1.4.0...v1.5.0) (2026-02-20)

### Features

* implement risk management and (wip: production readiness)  ([#53](#53)) ([5019642](5019642))
@github-actions
Copy link
Copy Markdown

🎉 This PR is included in version 1.5.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant