feat: implement risk management and (wip: production readiness) #53
feat: implement risk management and (wip: production readiness) #53
Conversation
📝 WalkthroughWalkthroughAdds a new risk management layer (RiskManager, CircuitBreaker, VaREngine, AlertManager, DisasterRecovery), risk configuration, Kubernetes deployment manifests, REST endpoints, extensive tests and benchmarks, build changes to include a Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant RM as RiskManager
participant CB as CircuitBreaker
participant AM as AlertManager
participant VE as VaREngine
participant DR as DisasterRecovery
participant Exchange as Exchange
Client->>CB: publishPrice(price)
CB->>CB: evaluateTriggers()
alt breaker opens
CB->>AM: raiseAlert(RAPID_PRICE_MOVE, CRITICAL, ...)
CB-->>Client: trading disallowed
end
Client->>RM: checkOrder(side, price, qty)
alt rejected
RM-->>Client: rejection (limit/drawdown/halted)
else approved
RM-->>Client: APPROVED
Client->>Exchange: placeOrder(...)
Exchange-->>RM: onFill(...)
RM->>RM: updateState()
RM->>AM: raiseAlert(DRAWDOWN_BREACH, WARNING, ...)
RM->>VE: addPnLUpdate(...)
VE->>VE: recomputeVaR()
end
loop periodic
RM->>DR: saveRiskState(snapshot)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
docs/PERFORMANCE_BENCHMARKS.md (1)
216-232:⚠️ Potential issue | 🟡 MinorMissing expected result ranges for risk benchmarks.
The "Expected Results Validation" section lists expected ranges for latency, throughput, order book, and routing benchmarks, but omits the newly added risk benchmarks (
risk_check_benchmark). Adding expected ranges (e.g., pre-trade risk check: 500–1000ns, circuit breaker: <10ns) would make the validation section consistent with the new benchmark.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/PERFORMANCE_BENCHMARKS.md` around lines 216 - 232, The docs' "Expected Results Validation" is missing expected ranges for the new risk benchmarks; update the validation section to include ranges for risk_check_benchmark and its subchecks (e.g., add "Pre-trade risk check: 500–1000ns" and "Circuit breaker: <10ns" or equivalent agreed values), and ensure the new "Routing Benchmark (NEW)" additions are mirrored by corresponding entries in the Expected Results Validation; locate mentions of risk_check_benchmark, pre-trade risk check, and circuit breaker in the markdown and add the expected numeric ranges in the same format as the existing latency/throughput/order book entries.config/default_config.json (1)
17-22:⚠️ Potential issue | 🟠 MajorDuplicate risk configuration keys across
riskandrisk_managementblocks.
risk.maxDrawdownPct(5.0) andrisk_management.limits.max_drawdown_pct(5.0) configure the same concept, as dorisk.maxTradingVolume(100.0) andrisk_management.limits.max_daily_volume(100.0). The codebase reads from both:StrategyConfigloads from the oldriskblock whileRiskConfigloads fromrisk_management. Having two sources of truth for the same limits is error-prone — a future edit may update one but not the other, causing divergence.Additionally,
stopLossPctandtakeProfitPctexist only in the oldriskblock; they should either be migrated torisk_managementor explicitly documented as strategy-specific config.Consider consolidating to a single config structure and ensuring all modules read from the same source.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@config/default_config.json` around lines 17 - 22, There are duplicate risk settings between the old "risk" block and the new "risk_management" block causing two sources of truth; consolidate by migrating/merging the shared keys (maxDrawdownPct/max_drawdown_pct and maxTradingVolume/max_daily_volume) into a single canonical block (preferably "risk_management.limits"), update StrategyConfig to read those values from RiskConfig (or have StrategyConfig consume RiskConfig) so only one source is used, remove the deprecated "risk" block from the default config, and either move stopLossPct and takeProfitPct into "risk_management" (e.g., limits.stop_loss_pct, limits.take_profit_pct) or clearly document them as strategy-specific and ensure only StrategyConfig reads them; also update any config schema/validation to reflect the single structure so consumers no longer read both.strategies/basic/BasicMarketMaker.cpp (2)
52-69:⚠️ Potential issue | 🟡 MinorRace between
m_strategyThreadlaunch andm_isRunningstore.
m_isRunningis set totrueon line 65, after the thread is already launched on line 62. If another thread callsstart()between lines 62 and 65, them_isRunningguard on line 54 won't prevent a second thread from being spawned. This is a pre-existing issue, but now that the function emits audit events it may be worth noting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@strategies/basic/BasicMarketMaker.cpp` around lines 52 - 69, The start() method in BasicMarketMaker has a race where m_isRunning is set after launching m_strategyThread; fix by performing an atomic check-and-set on m_isRunning before spawning the thread (e.g., use compare_exchange_strong to transition m_isRunning from false to true), only proceed to store m_shouldStop=false, start m_strategyThread and emit AUDIT_SYSTEM_EVENT if the CAS succeeds, and if the CAS fails return false so a second thread cannot be spawned; ensure cleanup on failure (do not start the thread if you couldn't claim the running flag) and keep references to m_strategyThread and m_shouldStop consistent with this atomic claim.
553-569:⚠️ Potential issue | 🟠 Major
updateStatistics()computes PnL but never callsRiskManager::onPnLUpdate()— drawdown and daily-loss limits won't trigger.
updateStatisticsrecalculatesestimatedPnLevery loop iteration and stores it, but the RiskManager is never informed. Without callingonPnLUpdate, the risk manager's drawdown tracking and daily-loss halting (tested inRiskManagerTests.cppand verified to work) are effectively disabled in the live strategy.Proposed fix
// In a real system, we would track realized P&L from fills double estimatedPnL = unrealizedPnL; m_pnl.store(estimatedPnL, std::memory_order_relaxed); + + // Notify risk manager so drawdown / daily-loss limits are enforced + risk::RiskManager::getInstance().onPnLUpdate(estimatedPnL);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@strategies/basic/BasicMarketMaker.cpp` around lines 553 - 569, The updateStatistics() method updates m_pnl but never notifies the risk manager; modify it to, after storing estimatedPnL (m_pnl.store(...)), check that m_riskManager is non-null and call m_riskManager->onPnLUpdate(estimatedPnL) so the RiskManager (and its drawdown/daily-loss logic) receives each PnL update; ensure the call is made outside the m_statsMutex lock to avoid potential deadlocks and keep the null-check (or use a safe pointer) to prevent crashes.
🟠 Major comments (14)
Dockerfile-124-125 (1)
124-125:⚠️ Potential issue | 🟠 MajorContainer runs as root; consider adding a non-root
USERThe new runtime directories are created without explicit ownership or permission settings. More broadly, the final image has no
USERinstruction, so the application process runs as root. This violates the principle of least privilege and is flagged by standard container security scanners (e.g., Trivy, Hadolint).🔒 Proposed hardening
+# Create non-root user +RUN groupadd -r pinnaclemm && useradd -r -g pinnaclemm pinnaclemm # Create logs and data directories -RUN mkdir -p /app/logs /app/data /app/data/backups +RUN mkdir -p /app/logs /app/data /app/data/backups \ + && chown -R pinnaclemm:pinnaclemm /app +USER pinnaclemm # Default command ENTRYPOINT ["/app/pinnaclemm"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 124 - 125, Create a non-root runtime user and set proper ownership/permissions for the created directories so the container does not run as root: after the RUN mkdir -p /app/logs /app/data /app/data/backups step, add commands to create a dedicated user/group (e.g., appuser/appgroup), chown the directories (/app/logs, /app/data, /app/data/backups) to that user/group and set restrictive permissions, and finally add a USER instruction to switch to that non-root user for the remaining image runtime; reference the mkdir-created paths and the Dockerfile USER instruction and the chosen username (e.g., appuser) when implementing.deploy/k8s/networkpolicy.yaml-15-23 (1)
15-23:⚠️ Potential issue | 🟠 MajorIngress rules with no
fromselector allow any pod in the cluster to reach the risk API.In Kubernetes NetworkPolicy, a rule that specifies only
portswithout afromclause permits traffic from any source — any pod in any namespace (and any external address that can reach the pod network). For ports 8081 (risk state, VaR, circuit-breaker controls, alert history) and 8080 (live WebSocket dashboard), this is an overly permissive posture for a production financial system.Add a
fromclause to restrict ingress to the namespaces that legitimately need access (e.g., the ingress controller namespace, monitoring namespace, or same-namespace pods only).🛡️ Proposed fix — scope ingress to intended consumers
ingress: - # Allow inbound traffic to WebSocket port - - ports: - - port: 8080 - protocol: TCP - # Allow inbound traffic to REST API port - - ports: - - port: 8081 - protocol: TCP + # Allow inbound traffic from ingress controller and monitoring namespace only + - from: + - namespaceSelector: + matchLabels: + kubernetes.io/metadata.name: ingress-nginx # adjust to your ingress controller namespace + - namespaceSelector: + matchLabels: + kubernetes.io/metadata.name: monitoring # adjust to your monitoring namespace + ports: + - port: 8080 + protocol: TCP + - port: 8081 + protocol: TCP🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/k8s/networkpolicy.yaml` around lines 15 - 23, The NetworkPolicy ingress rules for ports 8080 and 8081 are currently missing a from: clause and thus allow traffic from any source; update the ingress block in networkpolicy.yaml to add a from: selector restricting sources (e.g., a namespaceSelector matching your ingress controller namespace, monitoring namespace, and/or a podSelector for same-namespace pods) for the entries that list port: 8080 and port: 8081, ensuring the policy only permits those intended consumers to reach the risk API/WebSocket endpoints; keep policyTypes: [ "Ingress" ] (or add it if missing) so the rule is enforced.core/risk/CircuitBreaker.cpp-130-141 (1)
130-141:⚠️ Potential issue | 🟠 Major
staticlocalspreadSamplesis never reset on re-initialization.
spreadSamplesis a function-staticvariable that monotonically increments. Wheninitialize()is called again (e.g., during tests or runtime reconfiguration),m_baselineInitializedis reset tofalsebutspreadSamplesretains its old value (≥ 20). On the very nextonSpreadcall after re-init, the++spreadSamples >= 20check passes immediately, settingm_baselineInitialized = trueafter only a single EMA update instead of the intended ~20-sample warm-up.Proposed fix: use a member variable instead of static
In
CircuitBreaker.h, add a member:size_t m_spreadSamples{0};In
CircuitBreaker.cpp, reset it ininitialize():{ std::lock_guard<std::mutex> lock(m_baselineMutex); m_baselineSpread = 0.0; m_baselineVolume = 0.0; m_baselineInitialized = false; + m_spreadSamples = 0; }And in
onSpread, replace the static:- static size_t spreadSamples = 0; - ++spreadSamples; - if (spreadSamples >= 20) { + ++m_spreadSamples; + if (m_spreadSamples >= 20) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/CircuitBreaker.cpp` around lines 130 - 141, The bug is that the function-static spreadSamples in onSpread() never resets across re-initializations so m_baselineInitialized gets re-enabled immediately; replace the static counter with a per-instance member (e.g., add size_t m_spreadSamples{0} to the CircuitBreaker class), reset m_spreadSamples to 0 in initialize(), and use/increment m_spreadSamples inside onSpread() instead of the static spreadSamples so the ~20-sample warm-up behaves correctly.core/risk/RiskManager.cpp-228-230 (1)
228-230:⚠️ Potential issue | 🟠 Major
onPnLUpdatestoresnewPnL(total) as the daily PnL — likely incorrect.Line 230 sets
m_dailyPnLto the total PnL value passed in, not a session-relative delta. The comment on line 228 acknowledges this ambiguity. However,checkOrder(line 126) andgetDailyLossUtilizationboth treatm_dailyPnLas the daily loss figure. IfnewPnLis a cumulative total (as the parameter name andm_totalPnLassignment on line 208 suggest), the daily loss limit check will fire against the wrong baseline.Clarify the contract: either compute
dailyPnL = newPnL - pnlAtDayStart, or require callers to pass the daily-scoped value.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/RiskManager.cpp` around lines 228 - 230, onPnLUpdate currently stores the total PnL into m_dailyPnL which breaks downstream logic (checkOrder and getDailyLossUtilization) that expect a session/daily delta; change onPnLUpdate to compute and store the session-relative daily PnL (e.g., dailyPnL = newPnL - pnlAtDayStart or newPnL - m_totalPnLPrevious) before calling m_dailyPnL.store, or alternatively change the API contract and validate callers to pass the delta; update references to m_totalPnL and m_dailyPnL in onPnLUpdate (and any reset logic that records pnlAtDayStart) so that m_dailyPnL represents the daily delta used by checkOrder and getDailyLossUtilization.core/risk/RiskManager.cpp-85-94 (1)
85-94:⚠️ Potential issue | 🟠 MajorReading
m_limitsfields without synchronization on the hot path is formally UB.Lines 88, 97–98, 108, 117, 125, 138, and 152–154 read members of
m_limits(a non-atomic struct) whileupdateLimits()can write it concurrently underm_stateMutex. The comment acknowledges this as "best-effort," but concurrent read/write of non-atomic types is undefined behavior in C++ — not merely a torn read.Since
RiskLimitsis a small POD struct, a pragmatic fix is to snapshot it at the top ofcheckOrderunder the mutex (one lock acquisition per check) rather than per-field reads throughout.Proposed fix — snapshot limits once
RiskCheckResult RiskManager::checkOrder(OrderSide side, double price, double quantity, const std::string& symbol) { + // Snapshot limits once to avoid data-race UB on the hot path + RiskLimits limits; + { + std::lock_guard<std::mutex> lock(m_stateMutex); + limits = m_limits; + } + // 1. Halted check if (m_halted.load(std::memory_order_acquire)) {Then replace all
m_limits.XXXreferences withlimits.XXXin the method body.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/RiskManager.cpp` around lines 85 - 94, The method checkOrder reads non-atomic members of m_limits concurrently with updateLimits() which writes it under m_stateMutex, causing UB; fix by taking m_stateMutex at the start of checkOrder(), copy m_limits into a local RiskLimits limits snapshot, then release the mutex and replace all uses of m_limits.* in checkOrder() with limits.* so the hot path uses the stable local copy (refer to symbols: checkOrder, m_limits, updateLimits, m_stateMutex, RiskLimits).core/risk/AlertManager.h-102-103 (1)
102-103:⚠️ Potential issue | 🟠 MajorData race on
m_lastAlertTimeinisThrottled.
isThrottled(called fromraiseAlertat line 33, before acquiringm_alertsMutex) performs afind+ read onm_lastAlertTimewithout synchronization, whileraiseAlertwrites to this map underm_alertsMutexat line 50. Concurrent read ofstd::unordered_mapduring a write is undefined behavior in C++—a rehash triggered by insertion can corrupt the iterator/bucket structure.Options:
- Acquire
m_alertsMutexinisThrottled(simple, acceptable since alert raising is not on the trading hot path).- Use a fixed-size
std::arrayindexed bystatic_cast<int>(type)for lock-free access with atomics.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/AlertManager.h` around lines 102 - 103, isThrottled currently reads m_lastAlertTime without synchronization while raiseAlert writes to it under m_alertsMutex, causing a data race; fix by synchronizing accesses: either acquire m_alertsMutex at the start of isThrottled (simplest) or replace m_lastAlertTime with a fixed-size container of atomics indexed by the alert type for lock-free reads/writes. Update isThrottled to use the chosen approach and ensure all write sites (e.g., raiseAlert) use the same synchronization mechanism, referencing m_lastAlertTime, isThrottled, raiseAlert, and m_alertsMutex to locate and modify the code.core/risk/VaREngine.cpp-246-253 (1)
246-253:⚠️ Potential issue | 🟠 MajorMonte Carlo simulations use identical random draws on every cycle.
m_rngis seeded once in the constructor and never advanced. Each call copies it viastd::mt19937 localRng(m_rng), so every invocation ofcalculateMonteCarloVaRstarts from the same RNG state. While differentmean/stddevvalues cause different output values, the underlying uniform variates are always identical — this eliminates the sampling diversity that Monte Carlo is supposed to provide and makes MC VaR a deterministic transform of parametric VaR.Fix: make
m_rngmutable and use it directly (advancing its state), or store and advance thelocalRngacross iterations.Proposed fix — advance RNG state across calls
In
VaREngine.h, markm_rngasmutable, and in the method:- // Use a local copy of the RNG to keep this method const-correct - std::mt19937 localRng(m_rng); + // Advance the engine's RNG so successive calls produce different draws + std::mt19937 localRng(m_rng); + m_rng = localRng; // will be updated after the loop below; see alternativeA cleaner approach: drop
constfromcalculateMonteCarloVaRand usem_rngdirectly (guarded by the fact that only the background thread calls it), or keep amutableRNG with a dedicated mutex.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/VaREngine.cpp` around lines 246 - 253, calculateMonteCarloVaR currently copies the RNG state each call via std::mt19937 localRng(m_rng), causing identical random draws every invocation; fix by advancing shared RNG state instead of reusing the same seed: make m_rng mutable (or remove const from calculateMonteCarloVaR) and use m_rng directly when sampling (or store a persistent localRng in the class and update it between calls), ensuring the RNG state (m_rng or the persistent localRng) is advanced on each draw rather than re-seeded from the same state.visualization/WebServer.h-236-242 (1)
236-242:⚠️ Potential issue | 🟠 Major
handleGetRiskVaRis a stub — architectural issue prevents access to actual VaR data.The implementation returns a static placeholder message instead of calling
VaREngine::toJson(). However, the root cause is architectural:VaREnginelacks a singleton pattern and is instantiated locally inmain()with no mechanism forWebServerto access it. UnlikeRiskManager,CircuitBreaker, andAlertManagerwhich providegetInstance()accessors,VaREnginehas no such pattern.To fix this, either:
- Convert
VaREngineto a singleton following the pattern of other risk components, or- Pass the
VaREngineinstance toWebServerat initialization time🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@visualization/WebServer.h` around lines 236 - 242, handleGetRiskVaR currently returns a placeholder because WebServer cannot access the real VaREngine; fix by wiring VaREngine into WebServer: either (A) make VaREngine follow the existing singleton pattern (add a static getInstance() method and internal static instance, mirroring RiskManager/CircuitBreaker/AlertManager) so handleGetRiskVaR can call VaREngine::getInstance()->toJson(), or (B) inject the VaREngine instance into WebServer (add a constructor or setter on WebServer to store a VaREngine* or reference and update main() to pass the locally-constructed VaREngine), then modify WebServer::handleGetRiskVaR to call the instance's toJson() and return that JSON instead of the static placeholder.core/risk/AlertManager.h-1-4 (1)
1-4: 🛠️ Refactor suggestion | 🟠 MajorMissing direct
#include <nlohmann/json.hpp>.This header uses
nlohmann::jsonin theAlertstruct (line 44) and in several method signatures (lines 57–60, 81–82), but relies onRiskConfig.hto transitively provide the include. IfRiskConfig.hever stops including it, this header will break. Add the include directly.Proposed fix
`#include` "../utils/TimeUtils.h" `#include` "RiskConfig.h" +#include <nlohmann/json.hpp> + `#include` <atomic>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/AlertManager.h` around lines 1 - 4, AlertManager.h currently uses nlohmann::json in the Alert struct and several method signatures but relies on a transitive include; add a direct include for <nlohmann/json.hpp> at the top of AlertManager.h so Alert (and methods that take/return nlohmann::json) compile independently of RiskConfig.h changes—update the header to `#include` <nlohmann/json.hpp> before any declarations that reference nlohmann::json (e.g., the Alert struct and public methods in AlertManager).docs/DISASTER_RECOVERY.md-264-268 (1)
264-268:⚠️ Potential issue | 🟠 Major
POST /api/risk/circuit-breaker/resetis documented but not implemented.The manual override command on line 267 sends a POST to
/api/risk/circuit-breaker/reset, but onlyGET /api/risk/circuit-breakeris routed inWebServer.cpp. An operator using this command during an emergency will receive a 404.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/DISASTER_RECOVERY.md` around lines 264 - 268, The docs call out a POST endpoint that doesn't exist; add a POST route in WebServer.cpp alongside the existing GET /api/risk/circuit-breaker handler (where routes are registered, e.g., setupRoutes/registerRoutes) to handle POST /api/risk/circuit-breaker/reset, implement a handler function (e.g., handlePostCircuitBreakerReset) that invokes the circuit-breaker reset operation on the appropriate component (call the existing circuit breaker reset method such as circuitBreaker.reset() or riskManager.resetCircuitBreaker()) and return an appropriate HTTP 200/204 on success and 4xx/5xx on failure; wire that handler into the server/router registration so the described kubectl curl command no longer returns 404.docs/DISASTER_RECOVERY.md-117-133 (1)
117-133:⚠️ Potential issue | 🟠 MajorDocumentation references multiple endpoints that are not implemented.
The manual position-adjustment procedure (Section 5, lines 117-129) references
/api/positionsand/api/reconcilewhich do not exist invisualization/WebServer.cpp. An operator following this runbook would receive 404 errors.Additionally:
- Line 235 references
/api/risk/statusbut the implemented endpoint is/api/risk/state- Line 267 references
POST /api/risk/circuit-breaker/resetbut only the GET variant exists- Line 10.2 references
/api/metricsfor Prometheus scraping, which is not implemented- The
/api/readyendpoint documentation (table, line 10.1) claims it checks "recovered state" and "exchange connectivity," but the actual implementation only verifies RiskManager halt state and CircuitBreaker trading allowed statusEither implement the missing endpoints or update the documentation to reflect what is actually available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/DISASTER_RECOVERY.md` around lines 117 - 133, The docs list multiple HTTP endpoints that don't exist or differ from implementation; update either the docs or the server. Either implement the missing handlers in visualization/WebServer.cpp (add a GET /api/positions handler and a POST /api/reconcile handler, add /api/metrics for Prometheus, add POST /api/risk/circuit-breaker/reset in addition to the existing GET, and add an alias /api/risk/status that delegates to the existing /api/risk/state handler) or modify DISASTER_RECOVERY.md and the API table to accurately reflect the implemented handlers (e.g., change /api/risk/status → /api/risk/state, mark circuit-breaker reset as GET-only or add POST behavior, remove /api/positions and /api/reconcile or document manual reconciliation steps), and update the /api/ready description to match the readyEndpoint implementation in WebServer.cpp (only RiskManager halt and CircuitBreaker trading checks). Ensure changes reference the actual handler names in WebServer.cpp so documentation and code remain consistent.core/risk/DisasterRecovery.cpp-52-99 (1)
52-99:⚠️ Potential issue | 🟠 Major
saveRiskStateis not atomically saving both files — partial failure leaves inconsistent state.If the risk-state rename on Line 73 succeeds but the strategy-state write/rename (Lines 80-91) fails, the system ends up with a new risk state paired with the old strategy state. For a disaster-recovery component, this inconsistency could cause incorrect state restoration.
Consider writing both temp files first, then renaming both in sequence, so that at least the risk state rename doesn't happen unless both temp files are fully written.
Proposed fix — write both temp files, then rename both
try { std::string riskPath = getRiskStatePath(); std::string riskTmpPath = riskPath + ".tmp"; + std::string strategyPath = getStrategyStatePath(); + std::string strategyTmpPath = strategyPath + ".tmp"; + // Write both temp files first { std::ofstream ofs(riskTmpPath, std::ios::trunc); if (!ofs.is_open()) { ... return false; } ofs << riskState.dump(2); ofs.flush(); + if (!ofs.good()) { + spdlog::error("DisasterRecovery: risk state write failed"); + return false; + } } - bfs::rename(riskTmpPath, riskPath); - - std::string strategyPath = getStrategyStatePath(); - std::string strategyTmpPath = strategyPath + ".tmp"; - { std::ofstream ofs(strategyTmpPath, std::ios::trunc); ... ofs << strategyState.dump(2); ofs.flush(); + if (!ofs.good()) { + spdlog::error("DisasterRecovery: strategy state write failed"); + return false; + } } + // Rename both only after both temp files are written successfully + bfs::rename(riskTmpPath, riskPath); bfs::rename(strategyTmpPath, strategyPath);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/DisasterRecovery.cpp` around lines 52 - 99, The current DisasterRecovery::saveRiskState writes and renames the risk temp file before writing/renaming the strategy temp file, allowing a partial commit; change the flow to write both temp files fully (riskTmpPath and strategyTmpPath) first (open, write, flush, close) and only after both writes succeed perform the two bfs::rename calls (or perform a safe two-phase commit sequence), ensuring that if either write fails no rename occurs and existing files remain unchanged; also preserve existing logging behavior (use the same spdlog error messages on write failures and the existing exception catch) so failures remain visible.deploy/k8s/deployment.yaml-31-48 (1)
31-48:⚠️ Potential issue | 🟠 MajorMissing
securityContext— container runs as root with full privileges.Trivy (KSV-0014, KSV-0118) and Checkov (CKV_K8S_20, CKV_K8S_23) both flag this. For a production trading system handling exchange credentials, this is a meaningful security posture gap. The container should run as non-root with a read-only root filesystem, dropped capabilities, and no privilege escalation.
Proposed fix — add securityContext at both pod and container level
spec: terminationGracePeriodSeconds: 60 + securityContext: + runAsNonRoot: true + runAsUser: 1000 + fsGroup: 1000 containers: - name: pinnaclemm image: pinnaclemm:latest imagePullPolicy: IfNotPresent + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: true + capabilities: + drop: + - ALLNote: With
readOnlyRootFilesystem: true, ensure the writable mount points (/data/pinnaclemm,/var/log/pinnaclemm) cover all write paths. You may also need a writable/tmpvia anemptyDirvolume.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/k8s/deployment.yaml` around lines 31 - 48, The deployment currently runs the container "pinnaclemm" as root; add both a pod-level securityContext and a container-level securityContext for the "pinnaclemm" container: set runAsNonRoot: true with runAsUser (e.g., 1000) and runAsGroup, set fsGroup at pod-level, enforce readOnlyRootFilesystem: true at the container level, disallow privilege escalation (allowPrivilegeEscalation: false), and drop all capabilities (capabilities: drop: ["ALL"]); also add a seccompProfile (runtime/default) at pod-level. Because readOnlyRootFilesystem prevents writes, add emptyDir volumes and mounts for any writable paths the app needs (e.g., /data/pinnaclemm, /var/log/pinnaclemm, /tmp) and reference those volumeMounts in the "pinnaclemm" container so the app retains required write access.main.cpp-236-244 (1)
236-244:⚠️ Potential issue | 🟠 MajorHardcoded relative paths for audit log, data directory, and backups ignore K8s environment variables.
The K8s manifest (deployment.yaml) sets
PINNACLE_DATA_DIR=/data/pinnaclemmandPINNACLE_LOG_DIR=/var/log/pinnaclemm, butmain.cppuses hardcoded relative paths that are never read:
- Line 238:
"logs/audit.log"for the audit logger- Line 249:
"data"for the persistence directory- Line 306:
"data/backups"for disaster recoveryIn the K8s pod, these resolve to the container's working directory rather than the mounted volumes (/data/pinnaclemm PVC and /var/log/pinnaclemm emptyDir), causing logs and state data to be lost on pod restart.
Read the environment variables and construct absolute paths at initialization.
Also applies to: 249-249, 304-307
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.cpp` around lines 236 - 244, The code currently uses hardcoded relative paths (e.g., auditLogger.initialize("logs/audit.log")) which ignores the Kubernetes-set env vars; update initialization to read PINNACLE_LOG_DIR and PINNACLE_DATA_DIR via std::getenv (fallback to safe defaults), construct absolute paths using std::filesystem::path, and pass those paths to AuditLogger::initialize (replace "logs/audit.log"), the persistence directory creation (replace "data"), and the backups directory (replace "data/backups"); ensure you use the resolved log dir + "audit.log" for auditLogger.initialize, setCurrentSession remains the same, and create/ensure directories exist before use.
🟡 Minor comments (17)
docs/RISK_MANAGEMENT.md-154-154 (1)
154-154:⚠️ Potential issue | 🟡 MinorMinor grammatical gap — missing subject.
isVaRBreached(portfolioValue)can optionally be used as a pre-trade check → It can optionally be used as a pre-trade check.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/RISK_MANAGEMENT.md` at line 154, The sentence lacks a subject; change the line referencing the function name so it reads "It can optionally be used as a pre-trade check." — i.e., update the sentence containing `isVaRBreached(portfolioValue)` to prepend "It" so it becomes: "It can optionally be used as a pre-trade check."deploy/k8s/secret.yaml-1-6 (1)
1-6:⚠️ Potential issue | 🟡 MinorAdd
deploy/k8s/secret.yamlto.gitignoreto prevent accidental credential commits.The file comment warns against committing real credentials, but the file is tracked in git. A developer editing values in-place for local testing can accidentally stage and commit real credentials. The stronger pattern is:
- Rename this file to
secret.yaml.template(tracked in git with placeholders).- Add
deploy/k8s/secret.yamlto.gitignore(the live file is never tracked).Also, the personal attribution
# WARNING FROM CHIZY:on line 1 should be# WARNING:or# IMPORTANT:for institutional consistency in shared infrastructure files.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/k8s/secret.yaml` around lines 1 - 6, The tracked secret file contains a personal header "WARNING FROM CHIZY:" and placeholder credentials; rename the tracked file to secret.yaml.template (keep placeholders) and create an untracked live file named secret.yaml for local use, then add secret.yaml to .gitignore so live credentials are never committed; also replace the header line "WARNING FROM CHIZY:" with a neutral institutional header like "WARNING:" (or "IMPORTANT:") in the template.docs/RISK_MANAGEMENT.md-327-333 (1)
327-333:⚠️ Potential issue | 🟡 MinorSource Files table is missing the "Lines" column data (MD056).
The header declares three columns (
| File | Lines | Purpose |) but every data row only has two cells, dropping theLinescolumn entirely. Most Markdown renderers will either misalign or elide the rightmost column.📝 Proposed fix
-| File | Lines | Purpose | +| File | Purpose | |---|---|---| -| `core/risk/RiskConfig.h` | Config structs and JSON serialization | -| `core/risk/RiskManager.h/.cpp` | Pre-trade checks, position tracking, auto-hedging | -| `core/risk/CircuitBreaker.h/.cpp` | Market circuit breaker state machine | -| `core/risk/VaREngine.h/.cpp` | Value at Risk with Monte Carlo | -| `core/risk/AlertManager.h/.cpp` | Alert system with throttling | -| `core/risk/DisasterRecovery.h/.cpp` | State persistence and backup management | +| `core/risk/RiskConfig.h` | Config structs and JSON serialization | +| `core/risk/RiskManager.h/.cpp` | Pre-trade checks, position tracking, auto-hedging | +| `core/risk/CircuitBreaker.h/.cpp` | Market circuit breaker state machine | +| `core/risk/VaREngine.h/.cpp` | Value at Risk with Monte Carlo | +| `core/risk/AlertManager.h/.cpp` | Alert system with throttling | +| `core/risk/DisasterRecovery.h/.cpp` | State persistence and backup management |Alternatively, populate the
Linescolumn with the actual line counts if known.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/RISK_MANAGEMENT.md` around lines 327 - 333, The markdown table in RISK_MANAGEMENT.md declares three columns in the header but each row only has two cells, which breaks table alignment; update the table header and each data row (the rows for core/risk/RiskConfig.h, RiskManager.h/.cpp, CircuitBreaker.h/.cpp, VaREngine.h/.cpp, AlertManager.h/.cpp, DisasterRecovery.h/.cpp) to include a third "Lines" cell—either populate with actual line counts or a placeholder like "-"—so every row matches the header and the table renders correctly.tests/performance/RiskCheckBenchmark.cpp-39-53 (1)
39-53:⚠️ Potential issue | 🟡 MinorStale target comment: "< 100 ns" vs. documented 750 ns.
Line 42 states "Target: < 100 ns per call" but the documented benchmark result in
PERFORMANCE_BENCHMARKS.md(line 116) showsBM_RiskCheckOrderat 750 ns. The target should be updated to reflect realistic expectations (~1 μs range given 8 sequential atomic loads).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/performance/RiskCheckBenchmark.cpp` around lines 39 - 53, Update the stale target comment in BM_RiskCheckOrder to reflect the realistic benchmark (~750 ns / ~1 μs) instead of "< 100 ns": edit the comment block above the BM_RiskCheckOrder function in tests/performance/RiskCheckBenchmark.cpp (referencing BM_RiskCheckOrder and RiskManager::checkOrder) to state the new target (e.g., "~750 ns (~1 μs) per call") and, if helpful, add a short note that the value comes from PERFORMANCE_BENCHMARKS.md measured results.docs/PERFORMANCE_BENCHMARKS.md-113-120 (1)
113-120:⚠️ Potential issue | 🟡 MinorFenced code block missing language specifier.
The code block at line 113 lacks a language annotation (e.g.,
```text), which triggers the MD040 markdownlint warning. All other benchmark output blocks in this file have the same pattern, but since the linter flags this one, consider adding a language tag for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/PERFORMANCE_BENCHMARKS.md` around lines 113 - 120, The fenced code block containing the benchmark output (the block with entries like BM_RiskCheckOrder, BM_CircuitBreakerCheck, BM_OnFill, BM_OnPnLUpdate) is missing a language specifier; update that fence to include a language tag (e.g., ```text or ```console) so it matches the other benchmark blocks and resolves the MD040 lint warning, ensuring the change is applied to the specific fenced block that wraps those benchmark lines.core/risk/RiskConfig.h-156-166 (1)
156-166:⚠️ Potential issue | 🟡 Minor
confidenceLevel95andconfidenceLevel99are not deserialized from JSON.
VaRConfigdefinesconfidenceLevel95andconfidenceLevel99fields (lines 74–75), butfromJsondoes not read them from the"var"section, andtoJson(lines 221–226) does not write them. This means these parameters are not configurable via the JSON config file and will always use their defaults (0.95 and 0.99).If this is intentional (confidence levels are fixed), consider removing the fields from
VaRConfigor documenting the rationale. If they should be configurable, add them to bothfromJsonandtoJson.Proposed fix for fromJson
config.var.varLimitPct = v.value("var_limit_pct", config.var.varLimitPct); + config.var.confidenceLevel95 = + v.value("confidence_level_95", config.var.confidenceLevel95); + config.var.confidenceLevel99 = + v.value("confidence_level_99", config.var.confidenceLevel99); }Proposed fix for toJson
{"var", {{"window_size", var.windowSize}, {"simulation_count", var.simulationCount}, {"horizon", var.horizon}, {"update_interval_ms", var.updateIntervalMs}, - {"var_limit_pct", var.varLimitPct}}}, + {"var_limit_pct", var.varLimitPct}, + {"confidence_level_95", var.confidenceLevel95}, + {"confidence_level_99", var.confidenceLevel99}}},🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/RiskConfig.h` around lines 156 - 166, VaRConfig's confidenceLevel95 and confidenceLevel99 are never deserialized/serialized; update the fromJson routine that reads the "var" section to set config.var.confidenceLevel95 and config.var.confidenceLevel99 (using v.value(..., currentValue)) and also update the toJson routine that emits the "var" section to write confidenceLevel95 and confidenceLevel99 from config.var into the JSON output; reference the VaRConfig struct and the existing config.var usage in fromJson/toJson to add these two fields so they become configurable.core/risk/VaREngine.h-62-65 (1)
62-65:⚠️ Potential issue | 🟡 MinorRemove unused
m_resultsMutex— it is dead code.The double-buffering pattern with
m_activeBufferatomic provides complete synchronization for lock-free reads and writes. All methods accessingm_results(getLatestResult, isVaRBreached, getCurrentVaR95Pct, getCurrentVaR99Pct, toJson, and calculationLoop) use only the atomic variable with proper acquire/release memory ordering. The mutex is never referenced in the implementation and should be deleted.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/VaREngine.h` around lines 62 - 65, Remove the unused mutex member m_resultsMutex from the VaREngine class: it is dead code because the double-buffering uses the atomic m_activeBuffer with acquire/release semantics to synchronize access to m_results[2]; delete the declaration of std::mutex m_resultsMutex and any unused includes or references tied to it, and ensure no methods (getLatestResult, isVaRBreached, getCurrentVaR95Pct, getCurrentVaR99Pct, toJson, calculationLoop) reference m_resultsMutex to avoid unused-symbol or include warnings.core/risk/VaREngine.cpp-317-333 (1)
317-333:⚠️ Potential issue | 🟡 MinorComment/code mismatch: says "population" stddev, implements sample stddev (N−1).
Line 318 says "population" but line 332 divides by
data.size() - 1, which is Bessel-corrected sample standard deviation. For VaR estimation from a sample, N−1 is typically correct — update the comment.Proposed fix
-// Helper: standard deviation (population) +// Helper: standard deviation (sample, Bessel-corrected)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/VaREngine.cpp` around lines 317 - 333, The comment above VaREngine::calculateStdDev is incorrect: it says "population" but the implementation uses the sample (Bessel-corrected) standard deviation dividing by data.size() - 1; update the comment to state "sample standard deviation (N-1)" or otherwise make the wording consistent with the implementation so readers of calculateStdDev see it computes the sample stddev using N-1.tests/unit/VaREngineTests.cpp-164-168 (1)
164-168:⚠️ Potential issue | 🟡 MinorToJson key assertion is overly permissive and may mask regressions.
The test accepts three different key names (
historicalVaR95,historical_var_95,var_95). The actual implementation inVaREngine::toJson()uses"historical_var_95". Pin the assertion to the real key so a rename is caught as a breaking change.Proposed fix
- EXPECT_TRUE(j.contains("historicalVaR95") || - j.contains("historical_var_95") || j.contains("var_95")); + EXPECT_TRUE(j.contains("historical_var_95"));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/VaREngineTests.cpp` around lines 164 - 168, The test's ToJson key assertion is too permissive; update the assertion in VaREngineTests (the EXPECT_TRUE block checking j.contains(...)) to require the exact key used by VaREngine::toJson(), "historical_var_95", instead of allowing "historicalVaR95" or "var_95"; keep the subsequent EXPECT_FALSE(j.dump().empty()) check unchanged so serialization is still validated.tests/unit/VaREngineTests.cpp-86-93 (1)
86-93:⚠️ Potential issue | 🟡 MinorTest silently passes when no calculation has run.
The
if (result.sampleCount > 0)guard means that if the background thread doesn't complete a cycle within 300 ms (e.g., under CI load), theEXPECT_NEARis skipped entirely and the test reports success with no actual assertion. The same pattern applies toParametricVaR(line 110).Consider calling
calculateAll()directly (or exposing a blocking trigger) instead of relying on sleep timing, or useASSERT_GT(result.sampleCount, 0u)before the value check so the test fails visibly rather than silently skipping the assertion.Proposed fix
- if (result.sampleCount > 0) { - EXPECT_NEAR(std::abs(result.historicalVaR95), 0.01645, 0.005); - } + ASSERT_GT(result.sampleCount, 0u) << "Background calculation did not run in time"; + EXPECT_NEAR(std::abs(result.historicalVaR95), 0.01645, 0.005);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/VaREngineTests.cpp` around lines 86 - 93, Test currently can silently pass because the `if (result.sampleCount > 0)` guard skips assertions when no background calculation finished; replace that non-failing guard with a hard check so the test fails visibly or trigger the calculation directly. Specifically, for the block using `engine.getLatestResult()` and `result.sampleCount`, either call the synchronous trigger `calculateAll()` (or expose/invoke the blocking trigger used by the engine) before fetching `getLatestResult()`, or replace `if (result.sampleCount > 0)` with `ASSERT_GT(result.sampleCount, 0u)` followed by the `EXPECT_NEAR` so the test fails when no samples exist; apply the same change to the `ParametricVaR` test at the analogous check.docs/DISASTER_RECOVERY.md-324-342 (1)
324-342:⚠️ Potential issue | 🟡 MinorFenced code block missing language specifier.
The code block starting at line 324 has no language tag. Adding a specifier (e.g.,
```text) satisfies markdownlint MD040 and enables syntax highlighting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/DISASTER_RECOVERY.md` around lines 324 - 342, The fenced code block containing the event patterns (starting with "# Circuit breaker events" and lines like "circuit.*breaker.*tripped", "recoverState.*Success", "daily.*loss.*limit", "exchange.*disconnect") lacks a language specifier; update the opening fence from ``` to a labeled fence such as ```text (or ```yaml/```ini if preferred) so markdownlint MD040 is satisfied and syntax highlighting is enabled for that block.docs/DISASTER_RECOVERY.md-278-282 (1)
278-282:⚠️ Potential issue | 🟡 MinorReadiness probe description overstates what the implementation actually checks.
The doc states the readiness check verifies "recovered state, connected to the exchange, and is ready to accept traffic." The actual implementation (
WebServer.cpplines 1140-1142) only checks!RiskManager.isHalted() && CircuitBreaker.isTradingAllowed(). Exchange connectivity and recovery status are not verified. Update the description to match reality, or enhance the probe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/DISASTER_RECOVERY.md` around lines 278 - 282, The readiness description is inaccurate: the GET /api/ready endpoint (documented in DISASTER_RECOVERY.md) currently only checks RiskManager.isHalted() and CircuitBreaker.isTradingAllowed() in WebServer.cpp, not exchange connectivity or recovery state; update the doc to reflect the actual checks (e.g., "returns 200 when the process is not halted and trading is allowed (uses RiskManager.isHalted() and CircuitBreaker.isTradingAllowed())") or alternatively modify the implementation in WebServer.cpp to include additional checks (add explicit Exchange.isConnected() and StateRecovery.isRecovered() checks alongside RiskManager.isHalted() and CircuitBreaker.isTradingAllowed()) so the description matches behavior.docs/DISASTER_RECOVERY.md-233-236 (1)
233-236:⚠️ Potential issue | 🟡 MinorEndpoint path mismatch: doc says
/api/risk/statusbut the code implements/api/risk/state.Line 235 references
http://localhost:8081/api/risk/status, butWebServer.cpp(line 870) routes/api/risk/state. This will 404 if followed literally.Proposed fix
- curl -s http://localhost:8081/api/risk/status | python3 -m json.tool + curl -s http://localhost:8081/api/risk/state | python3 -m json.tool🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/DISASTER_RECOVERY.md` around lines 233 - 236, The docs call the endpoint /api/risk/status but the server registers /api/risk/state (see the route handler for GET "/api/risk/state" in WebServer.cpp), causing a 404; fix by updating the snippet in DISASTER_RECOVERY.md to use http://localhost:8081/api/risk/state (or, if you prefer the status name, rename the route in WebServer.cpp to "/api/risk/status" and adjust its handler accordingly) so the documentation and the WebServer route (GET "/api/risk/state") match.visualization/WebServer.cpp-1127-1138 (1)
1127-1138:⚠️ Potential issue | 🟡 Minor
uptime_msis not actual process uptime — it'ssteady_clockepoch millis.
getCurrentMillis()returns the time since the steady_clock epoch (typically system boot), not the duration since the process started. For a meaningful uptime field, subtract the process start time.Sketch
Store the start time once (e.g., in
RestAPIServer::start()) and subtract:+// In RestAPIServer or a global +static const uint64_t s_startMillis = utils::TimeUtils::getCurrentMillis(); http::response<http::string_body> RestAPIServer::handleGetHealth() { json health = {{"status", "healthy"}, {"timestamp", utils::TimeUtils::getCurrentNanos()}, - {"uptime_ms", utils::TimeUtils::getCurrentMillis()}}; + {"uptime_ms", utils::TimeUtils::getCurrentMillis() - s_startMillis}};🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@visualization/WebServer.cpp` around lines 1127 - 1138, handleGetHealth currently uses utils::TimeUtils::getCurrentMillis() which is steady_clock epoch millis, not process uptime; fix by recording the process start time when RestAPIServer::start() runs (add a member like start_time_nanos or start_time_ms on RestAPIServer and set it in start()), then change RestAPIServer::handleGetHealth to compute uptime_ms as utils::TimeUtils::getCurrentMillis() (or getCurrentNanos()) minus the stored start time and emit that value for "uptime_ms" while keeping other fields (status, timestamp) unchanged; ensure types/units match (nanos vs millis) and adjust subtraction accordingly.docs/KUBERNETES_DEPLOYMENT.md-227-234 (1)
227-234:⚠️ Potential issue | 🟡 MinorFenced code block missing a language specifier.
Static analysis (markdownlint MD040) flags this block. Adding a language identifier (e.g.,
textorbash) silences the warning and improves rendering.Proposed fix
-``` +```text # Risk events🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/KUBERNETES_DEPLOYMENT.md` around lines 227 - 234, The fenced code block under the "# Risk events" section in KUBERNETES_DEPLOYMENT.md lacks a language specifier, triggering markdownlint MD040; update the triple-backtick opener for that block (the block containing "# Risk events", "REJECTED_", "CLOSED -> OPEN", "Manual trip", "EMERGENCY SAVE", "halt") to include a language identifier such as text (e.g., change ``` to ```text) so the linter warning is silenced and rendering is improved.docs/KUBERNETES_DEPLOYMENT.md-139-143 (1)
139-143:⚠️ Potential issue | 🟡 MinorHealth probe configuration values don't match
deployment.yaml.The documentation table lists liveness as
initialDelaySeconds: 30, periodSeconds: 10and readiness asinitialDelaySeconds: 10, periodSeconds: 5, but the actual manifest (deploy/k8s/deployment.yamlLines 49-64) specifies the opposite: livenessinitialDelaySeconds: 15, periodSeconds: 30and readinessinitialDelaySeconds: 5, periodSeconds: 10.Also, the endpoint column conflates the port with the path (e.g.,
GET /api/health:8081).Proposed fix
-| **Liveness** | `GET /api/health:8081` | `initialDelaySeconds: 30`, `periodSeconds: 10` | -| **Readiness** | `GET /api/ready:8081` | `initialDelaySeconds: 10`, `periodSeconds: 5` | +| **Liveness** | `GET /api/health` (port 8081) | `initialDelaySeconds: 15`, `periodSeconds: 30` | +| **Readiness** | `GET /api/ready` (port 8081) | `initialDelaySeconds: 5`, `periodSeconds: 10` |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/KUBERNETES_DEPLOYMENT.md` around lines 139 - 143, The table in docs/KUBERNETES_DEPLOYMENT.md is inconsistent with the manifest: update the Liveness and Readiness rows so the health probe names and their initialDelaySeconds/periodSeconds match the manifest values (set Liveness initialDelaySeconds: 15, periodSeconds: 30 and Readiness initialDelaySeconds: 5, periodSeconds: 10) and fix the Endpoint column to separate path and port (use `GET /api/health` and `GET /api/ready` in the Endpoint column and add a separate Port column or remove the :8081 suffix) so the docs correctly reflect the liveness/readiness probes and ports used by the deployment.core/risk/DisasterRecovery.cpp-535-589 (1)
535-589:⚠️ Potential issue | 🟡 Minor
emergencySavedoes not verify write success afterflush().Lines 557-558 and 572-573 write and flush but never check
ofs.good()orofs.fail(). A disk-full or I/O error would silently produce a truncated file, andsuccesswould remaintrue. This is especially important for the emergency-save path since it's the last chance to persist state before shutdown.Proposed fix
ofs << riskState.dump(2); ofs.flush(); + if (!ofs.good()) { + spdlog::error("DisasterRecovery: emergency risk state write error"); + success = false; + }Apply the same pattern for the strategy state write block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/DisasterRecovery.cpp` around lines 535 - 589, The emergencySave function writes and flushes both risk and strategy state but never verifies the ofstream status afterward, so I/O errors (disk full, truncated writes) can be missed; in DisasterRecovery::emergencySave, after each ofs.flush() (the risk state write around getRiskStatePath() and the strategy state write around getStrategyStatePath()), check ofs.fail() or !ofs.good() and if true set success = false and log an spdlog::error with the path and errno/strerror detail (or e.what() if available) so the function accurately reports failure; apply this same verification pattern to both write blocks and ensure success is only true when both writes report good() afterwards.
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (3)
Dockerfile (1)
125-127:/app/data/backupsis ephemeral without aVOLUMEdeclaration or mounted persistent volume.The
DisasterRecoverycomponent writes backups to/app/data/backups. Without aVOLUMEinstruction (or a KubernetesPersistentVolumeClaimmount), these backups are discarded on every container restart, which defeats the purpose of disaster recovery. The same applies to/app/logsif log persistence across restarts is required.📦 Suggested VOLUME declarations
RUN mkdir -p /app/logs /app/data /app/data/backups + +VOLUME ["/app/logs", "/app/data"]
Consider running the application as a non-root user.
The runtime stage has no
USERdirective, so the market-maker process runs asrootinside the container. This is a security risk, especially for a production financial application deployed to Kubernetes (K8s manifests are also introduced in this PR). Adding a dedicated non-root user is straightforward:🔒 Suggested non-root user setup
WORKDIR /app +RUN groupadd -r pinnaclemm && useradd -r -g pinnaclemm pinnaclemm COPY --from=builder /app/build/pinnaclemm /app/ COPY --from=builder /app/config/ /app/config/ RUN mkdir -p /app/logs /app/data /app/data/backups +RUN chown -R pinnaclemm:pinnaclemm /app +USER pinnaclemm🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Dockerfile` around lines 125 - 127, The Dockerfile creates /app/logs and /app/data/backups but does not declare them as persistent volumes and also does not drop root; add VOLUME declarations for "/app/data/backups" (and "/app/logs" if you need log persistence) so backups written by the DisasterRecovery component are persisted, and add a non-root user (via a dedicated user creation and a USER instruction) in the runtime stage so the market-maker process does not run as root; ensure ownership of the volume paths is transferred to that user (chown) before switching to USER.core/risk/DisasterRecovery.cpp (1)
1-8: Remove unused#include "RiskManager.h"at line 3.The
RiskManagersymbol is never referenced in this translation unit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/DisasterRecovery.cpp` around lines 1 - 8, The file includes an unused header "RiskManager.h" (the RiskManager symbol is never referenced in DisasterRecovery.cpp); remove the line `#include` "RiskManager.h" from DisasterRecovery.cpp to eliminate the unnecessary dependency and recompile to ensure no missing symbols.core/risk/RiskManager.cpp (1)
98-174: Lock-free reads ofm_limitsincheckOrder()are technically undefined behavior (data race withupdateLimits()).
m_limitsis protected bym_stateMutexon write (updateLimits,initialize) but is read throughoutcheckOrder()without any lock (lines 101, 110–111, 121, 130, 138, 151, 165–167). Concurrent mutation viaupdateLimits()creates a data race per the C++ memory model, which is UB regardless of the underlying platform's atomicity guarantees fordouble.The comment at lines 98–100 documents the trade-off for the rate-limit field only. Consider replacing
m_limitswith astd::atomic<std::shared_ptr<RiskLimits>>(C++20) or a seqlock, or at minimum extending the "best-effort" comment to cover all limit fields in the hot path so the trade-off is explicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/risk/RiskManager.cpp` around lines 98 - 174, The checkOrder() function currently performs lock-free reads of m_limits which is a data race; change m_limits to an atomic shared pointer (e.g., std::atomic<std::shared_ptr<RiskLimits>>) and update initialize() and updateLimits() to atomically store a new shared_ptr, then at the start of checkOrder() do a single atomic load of the shared_ptr into a local const auto limits = ... and replace all direct m_limits.X accesses with limits->X so checkOrder() reads a consistent snapshot without holding m_stateMutex; alternatively implement a seqlock around m_limits but be sure updateLimits()/initialize() write under the writer path and checkOrder() reads the snapshot atomically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@core/risk/DisasterRecovery.cpp`:
- Around line 169-171: The current fixed absolute epsilon (epsilon = 1e-8) used
to set result.positionsMatch by comparing std::abs(result.discrepancy) < epsilon
is too fragile across position magnitudes; change the comparison in
DisasterRecovery.cpp to use a magnitude-aware tolerance instead (e.g., compute a
relative tolerance based on the larger of the two position magnitudes or a
configurable per-symbol threshold, with a small absolute floor to avoid zero):
calculate tolerance = max(relativeTolerance * max(abs(expectedPosition),
abs(actualPosition)), absoluteFloor) or fetch a symbol-specific tolerance, then
set result.positionsMatch = std::abs(result.discrepancy) < tolerance; keep
references to result.discrepancy and result.positionsMatch and add configuration
lookup or parameter for per-symbol thresholds if available.
- Around line 27-46: The initialize method in DisasterRecovery currently assigns
m_backupDirectory before attempting to create it, so on failure the object
appears configured; modify DisasterRecovery::initialize to either clear
m_backupDirectory (set to empty) or set a new member flag m_initialized=false
before returning from the catch block, and set m_initialized=true only after
successful directory creation and logging; additionally add checks at the start
of public operations such as DisasterRecovery::saveRiskState and
DisasterRecovery::createBackup to verify m_initialized (or non-empty
m_backupDirectory) and return/throw an error if not initialized. Ensure
references to m_backupDirectory and the new m_initialized flag are updated
throughout the class to prevent using an invalid path.
- Around line 246-251: The metadata write block that opens
std::ofstream(backupPath + "/backup_meta.json") silently ignores open/write
failures; if ofs.is_open() is false the function still returns success causing
listBackups() to mark the backup invalid. Update the write logic around the
ofstream in DisasterRecovery.cpp so that when opening the file fails or
writing/meta.dump() results in stream errors you log a clear error (including
backupPath) and return false (or propagate an error) instead of proceeding; also
check ofs.fail()/ofs.good() after the write and flush to ensure the write
succeeded before logging success or returning true. Ensure references to
backup_meta.json, ofs, backupPath and listBackups() are used to locate and
update the correct code paths.
- Around line 51-98: The saveRiskState implementation can leave on-disk state
inconsistent and orphaned .tmp files; modify saveRiskState so it writes both
temp files first, then perform the renames in a guarded two-phase step with
cleanup and rollback: write riskTmpPath and strategyTmpPath (as you already do),
then before renaming create backups of existing riskPath/strategyPath (e.g.,
riskPath + ".bak", strategyPath + ".bak") and rename originals to backups, then
rename both tmp files to the final paths, and if any rename fails catch the
exception and restore from backups (or remove newly moved files) and always
remove any leftover .tmp files and .bak files in all error paths; ensure the
bfs::rename calls and exception handlers reference
getRiskStatePath/getStrategyStatePath, riskTmpPath/strategyTmpPath, and perform
directory fsync / flush where available to reduce window, and log errors via
spdlog and return false on failure.
- Around line 379-400: The deleteBackup/createBackup/restoreBackup flow uses
getBackupPath(label) which joins m_backupDirectory with an attacker-controlled
label; fix by validating/sanitizing label before any filesystem ops: either
reject labels containing '/', '\\' or any ".." segments (use
InputValidator::validateString with PATH_TRAVERSAL_PATTERNS) OR compute the
absolute/lexically_normalized path from getBackupPath(label) and assert it has
m_backupDirectory as a prefix (compare normalized paths) before calling
bfs::remove_all or bfs::copy_file; apply the same check in deleteBackup,
createBackup and restoreBackup to prevent escaping the backup directory.
In `@core/risk/RiskManager.cpp`:
- Around line 248-250: m_dailyPnL is being set equal to m_totalPnL in
onPnLUpdate which makes daily loss tracking wrong after reset; instead record a
snapshot of total PnL at day start (e.g. add m_totalPnLAtDayStart or
m_totalPnLSnapshot, updated in resetDaily()) and change onPnLUpdate to store
newPnL into m_totalPnL then compute m_dailyPnL = newPnL - m_totalPnLAtDayStart
(using the same atomic memory order semantics as current stores) so auto-halt
and pre-trade checks compare a true day-relative P&L.
- Around line 504-526: fromJson currently can throw nlohmann::json::type_error
on type-mismatched fields and uses an unsigned int literal for
orders_this_second; wrap the body of RiskManager::fromJson in a try/catch that
catches std::exception (or nlohmann::json::type_error) and logs the error via
spdlog::error, then restore safe defaults for the atomics (m_position,
m_totalPnL, m_peakPnL, m_dailyPnL, m_dailyVolume, m_netExposure,
m_grossExposure, m_halted, m_ordersThisSecond, m_currentSecond) and guarded
members (m_haltReason, m_dailyResetTime) to ensure initialization continues;
also change the default for "orders_this_second" to an explicit uint32_t{0} and
for "current_second"/"daily_reset_time" to uint64_t{0} to avoid implicit
narrowing.
- Around line 549-568: checkDailyReset() currently reads m_dailyResetTime under
m_stateMutex then releases the lock before calling resetDaily(), allowing two
threads to both pass the check and perform resets; fix by claiming the reset
slot while holding m_stateMutex: inside RiskManager::checkDailyReset(), after
computing todayMs, acquire the lock, if m_dailyResetTime < todayMs set
m_dailyResetTime = todayMs (or now) and then perform the daily-zeroing while
still holding the mutex (either call resetDaily() while holding m_stateMutex or
inline the resets to m_dailyPnL, m_dailyVolume, m_ordersThisSecond), then
release the lock so no other thread can race to zero counters for the same day.
- Around line 364-384: hedgeLoop currently blocks in std::this_thread::sleep_for
for up to m_limits.hedgeIntervalMs and cannot be interrupted by clearing
m_hedgeRunning; replace that sleep with an interruptible wait using a
std::condition_variable (e.g. m_hedgeCv) and an associated mutex (m_hedgeMutex):
wait_for on m_hedgeCv for intervalMs with a predicate that checks
!m_hedgeRunning (or simply check return value) so the loop wakes immediately
when shutdown is requested; add m_hedgeCv and m_hedgeMutex to the RiskManager
header alongside m_hedgeRunning, and ensure the destructor / initialize path
that clears m_hedgeRunning calls m_hedgeCv.notify_all() (before joining the
thread) so the waiting hedgeLoop exits and join() returns promptly, then
continue to call evaluateHedge() only when still running.
---
Duplicate comments:
In `@Dockerfile`:
- Around line 131-133: Remove the redundant "|| exit 1" from the HEALTHCHECK
command: the HEALTHCHECK line using curl -f already returns a non-zero exit code
on HTTP errors, so edit the HEALTHCHECK declaration (the line containing
"HEALTHCHECK --interval=30s ... CMD curl -f http://localhost:8081/api/health ||
exit 1") to call curl -f without the trailing "|| exit 1".
---
Nitpick comments:
In `@core/risk/DisasterRecovery.cpp`:
- Around line 1-8: The file includes an unused header "RiskManager.h" (the
RiskManager symbol is never referenced in DisasterRecovery.cpp); remove the line
`#include` "RiskManager.h" from DisasterRecovery.cpp to eliminate the unnecessary
dependency and recompile to ensure no missing symbols.
In `@core/risk/RiskManager.cpp`:
- Around line 98-174: The checkOrder() function currently performs lock-free
reads of m_limits which is a data race; change m_limits to an atomic shared
pointer (e.g., std::atomic<std::shared_ptr<RiskLimits>>) and update initialize()
and updateLimits() to atomically store a new shared_ptr, then at the start of
checkOrder() do a single atomic load of the shared_ptr into a local const auto
limits = ... and replace all direct m_limits.X accesses with limits->X so
checkOrder() reads a consistent snapshot without holding m_stateMutex;
alternatively implement a seqlock around m_limits but be sure
updateLimits()/initialize() write under the writer path and checkOrder() reads
the snapshot atomically.
In `@Dockerfile`:
- Around line 125-127: The Dockerfile creates /app/logs and /app/data/backups
but does not declare them as persistent volumes and also does not drop root; add
VOLUME declarations for "/app/data/backups" (and "/app/logs" if you need log
persistence) so backups written by the DisasterRecovery component are persisted,
and add a non-root user (via a dedicated user creation and a USER instruction)
in the runtime stage so the market-maker process does not run as root; ensure
ownership of the volume paths is transferred to that user (chown) before
switching to USER.
| void DisasterRecovery::initialize(const std::string& backupDirectory) { | ||
| std::lock_guard<std::mutex> lock(m_mutex); | ||
| m_backupDirectory = backupDirectory; | ||
|
|
||
| try { | ||
| if (!bfs::exists(m_backupDirectory)) { | ||
| bfs::create_directories(m_backupDirectory); | ||
| spdlog::info("DisasterRecovery: created backup directory: {}", | ||
| m_backupDirectory); | ||
| } | ||
| } catch (const bfs::filesystem_error& e) { | ||
| spdlog::error("DisasterRecovery: failed to create backup directory {}: {}", | ||
| m_backupDirectory, e.what()); | ||
| return; | ||
| } | ||
|
|
||
| spdlog::info("DisasterRecovery initialized - backupDir={}", | ||
| m_backupDirectory); | ||
| AUDIT_SYSTEM_EVENT("DisasterRecovery initialized", true); | ||
| } |
There was a problem hiding this comment.
m_backupDirectory is set even when directory creation fails.
On the error path (line 40 return), m_backupDirectory has already been assigned at line 29. Every subsequent call (saveRiskState, createBackup, etc.) will silently proceed against a non-existent directory, turning what should be an init-time failure into a series of cryptic runtime errors.
Consider resetting m_backupDirectory to empty (or throwing) before returning, and guarding all operations with an m_initialized flag:
🛡️ Proposed fix
void DisasterRecovery::initialize(const std::string& backupDirectory) {
std::lock_guard<std::mutex> lock(m_mutex);
- m_backupDirectory = backupDirectory;
try {
+ m_backupDirectory = backupDirectory;
if (!bfs::exists(m_backupDirectory)) {
bfs::create_directories(m_backupDirectory);
spdlog::info("DisasterRecovery: created backup directory: {}",
m_backupDirectory);
}
} catch (const bfs::filesystem_error& e) {
spdlog::error("DisasterRecovery: failed to create backup directory {}: {}",
- m_backupDirectory, e.what());
+ backupDirectory, e.what());
+ m_backupDirectory.clear(); // leave object in a known unusable state
return;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/DisasterRecovery.cpp` around lines 27 - 46, The initialize method
in DisasterRecovery currently assigns m_backupDirectory before attempting to
create it, so on failure the object appears configured; modify
DisasterRecovery::initialize to either clear m_backupDirectory (set to empty) or
set a new member flag m_initialized=false before returning from the catch block,
and set m_initialized=true only after successful directory creation and logging;
additionally add checks at the start of public operations such as
DisasterRecovery::saveRiskState and DisasterRecovery::createBackup to verify
m_initialized (or non-empty m_backupDirectory) and return/throw an error if not
initialized. Ensure references to m_backupDirectory and the new m_initialized
flag are updated throughout the class to prevent using an invalid path.
| bool DisasterRecovery::saveRiskState(const nlohmann::json& riskState, | ||
| const nlohmann::json& strategyState) { | ||
| std::lock_guard<std::mutex> lock(m_mutex); | ||
|
|
||
| try { | ||
| // Atomic write for risk state: write to temp file, then rename | ||
| std::string riskPath = getRiskStatePath(); | ||
| std::string riskTmpPath = riskPath + ".tmp"; | ||
|
|
||
| { | ||
| std::ofstream ofs(riskTmpPath, std::ios::trunc); | ||
| if (!ofs.is_open()) { | ||
| spdlog::error("DisasterRecovery: failed to open tmp file for risk " | ||
| "state: {}", | ||
| riskTmpPath); | ||
| return false; | ||
| } | ||
| ofs << riskState.dump(2); | ||
| ofs.flush(); | ||
| } | ||
|
|
||
| bfs::rename(riskTmpPath, riskPath); | ||
|
|
||
| // Atomic write for strategy state | ||
| std::string strategyPath = getStrategyStatePath(); | ||
| std::string strategyTmpPath = strategyPath + ".tmp"; | ||
|
|
||
| { | ||
| std::ofstream ofs(strategyTmpPath, std::ios::trunc); | ||
| if (!ofs.is_open()) { | ||
| spdlog::error("DisasterRecovery: failed to open tmp file for strategy " | ||
| "state: {}", | ||
| strategyTmpPath); | ||
| return false; | ||
| } | ||
| ofs << strategyState.dump(2); | ||
| ofs.flush(); | ||
| } | ||
|
|
||
| bfs::rename(strategyTmpPath, strategyPath); | ||
|
|
||
| spdlog::debug("DisasterRecovery: saved risk and strategy state"); | ||
| return true; | ||
| } catch (const std::exception& e) { | ||
| spdlog::error("DisasterRecovery: failed to save risk state: {}", e.what()); | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
Two-file save is not atomic: state divergence and orphaned .tmp on partial failure.
Two separate issues:
-
State divergence: if the risk-state rename succeeds (line 72) but the strategy-state write then fails (lines 80–84 or line 90 throws), the on-disk files are left inconsistent — risk state reflects the new snapshot, strategy state the old one. A restart will load mismatched states.
-
Orphaned
.tmp: on every error path, the partially-writtenstrategy_state.json.tmp(and potentiallyrisk_state.json.tmp) is never cleaned up.
🛡️ Proposed fix
+ // Stage both files before committing either (minimises divergence window)
std::string riskPath = getRiskStatePath();
std::string riskTmpPath = riskPath + ".tmp";
+ std::string strategyPath = getStrategyStatePath();
+ std::string strategyTmpPath = strategyPath + ".tmp";
+
+ // Helper lambda to remove tmp files on failure
+ auto cleanupTmps = [&]() {
+ bfs::remove(bfs::path(riskTmpPath));
+ bfs::remove(bfs::path(strategyTmpPath));
+ };
{
std::ofstream ofs(riskTmpPath, std::ios::trunc);
if (!ofs.is_open()) {
spdlog::error(...);
+ cleanupTmps();
return false;
}
ofs << riskState.dump(2);
ofs.flush();
}
- bfs::rename(riskTmpPath, riskPath);
- std::string strategyPath = getStrategyStatePath();
- std::string strategyTmpPath = strategyPath + ".tmp";
{
std::ofstream ofs(strategyTmpPath, std::ios::trunc);
if (!ofs.is_open()) {
spdlog::error(...);
+ cleanupTmps();
return false;
}
ofs << strategyState.dump(2);
ofs.flush();
}
+ // Both tmp files written successfully — commit atomically
+ bfs::rename(riskTmpPath, riskPath);
bfs::rename(strategyTmpPath, strategyPath);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/DisasterRecovery.cpp` around lines 51 - 98, The saveRiskState
implementation can leave on-disk state inconsistent and orphaned .tmp files;
modify saveRiskState so it writes both temp files first, then perform the
renames in a guarded two-phase step with cleanup and rollback: write riskTmpPath
and strategyTmpPath (as you already do), then before renaming create backups of
existing riskPath/strategyPath (e.g., riskPath + ".bak", strategyPath + ".bak")
and rename originals to backups, then rename both tmp files to the final paths,
and if any rename fails catch the exception and restore from backups (or remove
newly moved files) and always remove any leftover .tmp files and .bak files in
all error paths; ensure the bfs::rename calls and exception handlers reference
getRiskStatePath/getStrategyStatePath, riskTmpPath/strategyTmpPath, and perform
directory fsync / flush where available to reduce window, and log errors via
spdlog and return false on failure.
| // Use a small epsilon for floating-point comparison | ||
| constexpr double epsilon = 1e-8; | ||
| result.positionsMatch = std::abs(result.discrepancy) < epsilon; |
There was a problem hiding this comment.
Fixed absolute epsilon 1e-8 is fragile across position magnitudes.
For large notional positions (e.g. 10,000,000 USD lots expressed as a double), normal IEEE-754 rounding can produce discrepancies exceeding 1e-8, causing spurious mismatch alerts. Conversely, for some instruments a discrepancy of 1e-8 units is a genuine error worth flagging. A single hardcoded absolute threshold does not generalize across asset classes.
Consider a relative epsilon or a configurable per-symbol threshold:
🛡️ Suggested approach
- constexpr double epsilon = 1e-8;
- result.positionsMatch = std::abs(result.discrepancy) < epsilon;
+ // Use a relative epsilon: allow up to 1e-9 relative error, with an
+ // absolute floor of 1e-12 to handle near-zero positions.
+ const double scale = std::max(std::abs(localPosition),
+ std::abs(exchangePosition));
+ const double threshold = std::max(1e-12, scale * 1e-9);
+ result.positionsMatch = std::abs(result.discrepancy) < threshold;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/DisasterRecovery.cpp` around lines 169 - 171, The current fixed
absolute epsilon (epsilon = 1e-8) used to set result.positionsMatch by comparing
std::abs(result.discrepancy) < epsilon is too fragile across position
magnitudes; change the comparison in DisasterRecovery.cpp to use a
magnitude-aware tolerance instead (e.g., compute a relative tolerance based on
the larger of the two position magnitudes or a configurable per-symbol
threshold, with a small absolute floor to avoid zero): calculate tolerance =
max(relativeTolerance * max(abs(expectedPosition), abs(actualPosition)),
absoluteFloor) or fetch a symbol-specific tolerance, then set
result.positionsMatch = std::abs(result.discrepancy) < tolerance; keep
references to result.discrepancy and result.positionsMatch and add configuration
lookup or parameter for per-symbol thresholds if available.
| { | ||
| std::ofstream ofs(backupPath + "/backup_meta.json"); | ||
| if (ofs.is_open()) { | ||
| ofs << meta.dump(2); | ||
| } | ||
| } |
There was a problem hiding this comment.
Silent metadata write failure leaves backup marked invalid in listBackups().
If ofs.is_open() is false at line 248, the metadata file is never written but the function still returns true and logs success. listBackups() will subsequently mark this backup as info.valid = false (line 356) because it cannot read the metadata, while the backup itself is structurally complete. The caller has no indication anything went wrong.
🛡️ Proposed fix
{
std::ofstream ofs(backupPath + "/backup_meta.json");
- if (ofs.is_open()) {
- ofs << meta.dump(2);
- }
+ if (!ofs.is_open()) {
+ spdlog::warn("DisasterRecovery: failed to write metadata for backup '{}'",
+ label);
+ } else {
+ ofs << meta.dump(2);
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/DisasterRecovery.cpp` around lines 246 - 251, The metadata write
block that opens std::ofstream(backupPath + "/backup_meta.json") silently
ignores open/write failures; if ofs.is_open() is false the function still
returns success causing listBackups() to mark the backup invalid. Update the
write logic around the ofstream in DisasterRecovery.cpp so that when opening the
file fails or writing/meta.dump() results in stream errors you log a clear error
(including backupPath) and return false (or propagate an error) instead of
proceeding; also check ofs.fail()/ofs.good() after the write and flush to ensure
the write succeeded before logging success or returning true. Ensure references
to backup_meta.json, ofs, backupPath and listBackups() are used to locate and
update the correct code paths.
| bool DisasterRecovery::deleteBackup(const std::string& label) { | ||
| std::lock_guard<std::mutex> lock(m_mutex); | ||
|
|
||
| try { | ||
| std::string backupPath = getBackupPath(label); | ||
|
|
||
| if (!bfs::exists(backupPath)) { | ||
| spdlog::warn("DisasterRecovery: backup '{}' does not exist", label); | ||
| return false; | ||
| } | ||
|
|
||
| bfs::remove_all(backupPath); | ||
|
|
||
| spdlog::info("DisasterRecovery: backup '{}' deleted", label); | ||
| AUDIT_SYSTEM_EVENT("Backup deleted: " + label, true); | ||
| return true; | ||
| } catch (const std::exception& e) { | ||
| spdlog::error("DisasterRecovery: failed to delete backup '{}': {}", label, | ||
| e.what()); | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the DisasterRecovery.cpp file
fd -e cpp -name "*DisasterRecovery*" | head -20Repository: chizy7/PinnacleMM
Length of output: 291
🏁 Script executed:
# Search for getBackupPath implementation
rg -n "getBackupPath" --type=cpp -A 3 -B 1Repository: chizy7/PinnacleMM
Length of output: 1811
🏁 Script executed:
# Check all backup-related functions and how they use labels
rg -n "createBackup|restoreBackup|deleteBackup|listBackups" --type=cpp -A 5 -B 1Repository: chizy7/PinnacleMM
Length of output: 5434
🏁 Script executed:
# Check for REST API endpoints related to backups
rg -n "backup|Backup" --type=cpp -i | grep -i "route\|endpoint\|handler\|post\|get\|rest" | head -30Repository: chizy7/PinnacleMM
Length of output: 1992
🏁 Script executed:
# Look for any existing validation of label parameter
rg -n "label.*find|label.*empty|label.*valid" --type=cpp -iRepository: chizy7/PinnacleMM
Length of output: 43
🏁 Script executed:
# Search for REST API server setup, routes, endpoints
rg -i "route|handler|endpoint|server|http|rest|api" --type=cpp | grep -i "backup\|createBackup\|restoreBackup\|deleteBackup" | head -30Repository: chizy7/PinnacleMM
Length of output: 1293
🏁 Script executed:
# Look for any Web framework or HTTP server code
fd -e cpp -e h | xargs grep -l "http\|request\|route\|handler" | head -20Repository: chizy7/PinnacleMM
Length of output: 671
🏁 Script executed:
# Search for all occurrences of createBackup, restoreBackup, deleteBackup to find all call sites
rg "createBackup|restoreBackup|deleteBackup" --type=cppRepository: chizy7/PinnacleMM
Length of output: 973
🏁 Script executed:
# Check what m_backupDirectory is set to and if there's path canonicalization
rg -n "m_backupDirectory" --type=cpp -B 2 -A 2Repository: chizy7/PinnacleMM
Length of output: 5073
🏁 Script executed:
# Verify the actual behavior of Boost.Filesystem path concatenation with absolute paths
cat > /tmp/test_bfs.cpp << 'EOF'
`#include` <boost/filesystem.hpp>
`#include` <iostream>
namespace bfs = boost::filesystem;
int main() {
// Test 1: Absolute path on right side
bfs::path base("/data/backups");
bfs::path abs_right("/etc/passwd");
std::cout << "Test 1 - base / absolute:\n";
std::cout << " Result: " << (base / abs_right).string() << "\n";
// Test 2: Relative path with ..
bfs::path rel_right("../../etc/cron.d");
std::cout << "Test 2 - base / relative_with_dotdot:\n";
std::cout << " Result: " << (base / rel_right).string() << "\n";
return 0;
}
EOF
# Note: compilation not available, but we can reason about it from Boost documentationRepository: chizy7/PinnacleMM
Length of output: 43
🏁 Script executed:
# Search for any input validation or sanitization functions
rg "sanitize|validate|clean|escape" --type=cpp -i | head -20Repository: chizy7/PinnacleMM
Length of output: 1555
🏁 Script executed:
# Check RestAPI or handlers directory if it exists
fd -type d -name "*api*" -o -name "*rest*" -o -name "*handler*" -o -name "*endpoint*"Repository: chizy7/PinnacleMM
Length of output: 231
🏁 Script executed:
# Check WebServer.cpp for backup-related endpoints
cat core/risk/DisasterRecovery.cpp | head -50Repository: chizy7/PinnacleMM
Length of output: 1720
🏁 Script executed:
# Look at WebServer.cpp to see what endpoints it exposes
cat visualization/WebServer.cpp | grep -i "backup\|route\|get\|post" | head -40Repository: chizy7/PinnacleMM
Length of output: 2216
🏁 Script executed:
# Search for where the backup functions might be called from HTTP handlers
rg "createBackup|restoreBackup|deleteBackup" --type=cpp -B 5 -A 5Repository: chizy7/PinnacleMM
Length of output: 6409
🏁 Script executed:
# Check if there are any REST endpoints that might call backup functions indirectly
rg "disasterRecovery\." --type=cpp | head -30Repository: chizy7/PinnacleMM
Length of output: 237
🏁 Script executed:
# Verify the exact Boost.Filesystem behavior by checking its documentation context
rg "bfs::" --type=cpp | grep "/" | head -20Repository: chizy7/PinnacleMM
Length of output: 1643
🏁 Script executed:
# Check the complete createBackup, restoreBackup, and deleteBackup implementations
sed -n '192,263p' core/risk/DisasterRecovery.cppRepository: chizy7/PinnacleMM
Length of output: 2652
🏁 Script executed:
# Check complete deleteBackup implementation
sed -n '379,400p' core/risk/DisasterRecovery.cppRepository: chizy7/PinnacleMM
Length of output: 713
🏁 Script executed:
# Search for any path canonicalization or validation before using paths from getBackupPath
rg "canonical|normalize|realpath" --type=cppRepository: chizy7/PinnacleMM
Length of output: 267
🏁 Script executed:
# Check if there's any input validation utility being used
cat core/utils/InputValidator.cpp | head -100Repository: chizy7/PinnacleMM
Length of output: 3956
🏁 Script executed:
# Search for any tests or documentation about how bfs::path handles absolute paths
rg "bfs::path\|boost::filesystem" --type=cpp -A 2 | grep -E "(parent_path|absolute|canonical|/.*/)" | head -20Repository: chizy7/PinnacleMM
Length of output: 43
🏁 Script executed:
# Check if there's anywhere in the codebase that applies path safety checks
rg "parent_path|starts_with|contains" --type=cpp | head -20Repository: chizy7/PinnacleMM
Length of output: 1557
Path traversal vulnerability via unsanitized label in backup operations — potential arbitrary directory deletion.
getBackupPath(label) at line 589 directly concatenates the label parameter onto m_backupDirectory using bfs::path(m_backupDirectory) / label with no validation. This is exploitable:
- If
labelis an absolute path like"/etc/cron.d", Boost.Filesystem's/operator replaces the entire left-hand path with the absolute right operand, allowing access outside the backup directory. - If
labelcontains"../"sequences (e.g.,"../../etc/cron.d"), path normalization resolves outside the intended backup directory.
All three backup operations pass unsanitized labels to getBackupPath:
deleteBackupcallsbfs::remove_all(backupPath)— an attacker-controlled path can recursively delete arbitrary directories.createBackupalso callsbfs::remove_all()when overwriting existing backups.restoreBackupcallsbfs::copy_file()with computed paths — can read from/write to arbitrary locations.
Currently, labels are only provided via hardcoded system-generated values (main.cpp:624 uses "shutdown_" + timestamp, tests use fixed labels). However, the codebase exposes no validation despite providing InputValidator::validateString() with PATH_TRAVERSAL_PATTERNS in core/utils/InputValidator.cpp.
Sanitize backup labels immediately: Reject labels containing /, \, or ".." sequences. Alternatively, verify the resolved backup path remains within m_backupDirectory using path prefix checks.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/DisasterRecovery.cpp` around lines 379 - 400, The
deleteBackup/createBackup/restoreBackup flow uses getBackupPath(label) which
joins m_backupDirectory with an attacker-controlled label; fix by
validating/sanitizing label before any filesystem ops: either reject labels
containing '/', '\\' or any ".." segments (use InputValidator::validateString
with PATH_TRAVERSAL_PATTERNS) OR compute the absolute/lexically_normalized path
from getBackupPath(label) and assert it has m_backupDirectory as a prefix
(compare normalized paths) before calling bfs::remove_all or bfs::copy_file;
apply the same check in deleteBackup, createBackup and restoreBackup to prevent
escaping the backup directory.
| // Update daily PnL -- we store the total PnL as the daily value here; | ||
| // callers can provide the session-relative PnL. | ||
| m_dailyPnL.store(newPnL, std::memory_order_release); |
There was a problem hiding this comment.
m_dailyPnL is always identical to m_totalPnL, making daily loss tracking unreliable post-reset.
Line 228 stores newPnL into m_totalPnL; line 250 stores the same newPnL into m_dailyPnL. After resetDaily() zeros m_dailyPnL, the very next onPnLUpdate(pnl) call sets m_dailyPnL = pnl again — which, if pnl is a cumulative (session-total) value rather than a day-relative delta, means m_dailyPnL no longer reflects only today's P&L.
The auto-halt at line 270 and the pre-trade check at line 139 both key off m_dailyPnL / newPnL against dailyLossLimit. If newPnL is cumulative, both checks are testing total cumulative loss against a per-day limit.
This should be clarified or fixed: either (a) callers must always pass a session-relative daily PnL to onPnLUpdate (document this contract explicitly), or (b) separate m_totalPnL and m_dailyPnL so onPnLUpdate increments only m_totalPnL, while m_dailyPnL is computed as m_totalPnL − snapshotAtDayStart.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/RiskManager.cpp` around lines 248 - 250, m_dailyPnL is being set
equal to m_totalPnL in onPnLUpdate which makes daily loss tracking wrong after
reset; instead record a snapshot of total PnL at day start (e.g. add
m_totalPnLAtDayStart or m_totalPnLSnapshot, updated in resetDaily()) and change
onPnLUpdate to store newPnL into m_totalPnL then compute m_dailyPnL = newPnL -
m_totalPnLAtDayStart (using the same atomic memory order semantics as current
stores) so auto-halt and pre-trade checks compare a true day-relative P&L.
| void RiskManager::hedgeLoop() { | ||
| spdlog::info("Hedge loop started"); | ||
|
|
||
| while (m_hedgeRunning.load(std::memory_order_acquire)) { | ||
| uint64_t intervalMs; | ||
| { | ||
| std::lock_guard<std::mutex> lock(m_stateMutex); | ||
| intervalMs = m_limits.hedgeIntervalMs; | ||
| } | ||
|
|
||
| std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs)); | ||
|
|
||
| if (!m_hedgeRunning.load(std::memory_order_acquire)) { | ||
| break; | ||
| } | ||
|
|
||
| evaluateHedge(); | ||
| } | ||
|
|
||
| spdlog::info("Hedge loop stopped"); | ||
| } |
There was a problem hiding this comment.
hedgeLoop uses a non-interruptible sleep_for, stalling shutdown by up to intervalMs.
Line 374 sleeps the full intervalMs before rechecking m_hedgeRunning. When m_hedgeRunning is cleared in the destructor or initialize(), the hedge thread cannot be woken; the caller blocks in m_hedgeThread.join() for up to intervalMs. For production configurations with a long hedge interval this makes shutdown/re-initialization unresponsive.
Replace the bare sleep_for with a condition_variable::wait_for signalled on shutdown:
⚡ Proposed fix: interruptible sleep via `condition_variable`
Add to the header (alongside m_hedgeRunning, m_hedgeMutex):
+ std::condition_variable m_hedgeCV;In RiskManager.cpp:
// Signal hedge thread to stop
m_hedgeRunning.store(false, std::memory_order_release);
+m_hedgeCV.notify_all();
if (m_hedgeThread.joinable()) {
m_hedgeThread.join();
} void RiskManager::hedgeLoop() {
spdlog::info("Hedge loop started");
while (m_hedgeRunning.load(std::memory_order_acquire)) {
uint64_t intervalMs;
{
std::lock_guard<std::mutex> lock(m_stateMutex);
intervalMs = m_limits.hedgeIntervalMs;
}
- std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs));
+ {
+ std::unique_lock<std::mutex> lk(m_hedgeMutex);
+ m_hedgeCV.wait_for(lk, std::chrono::milliseconds(intervalMs),
+ [this] { return !m_hedgeRunning.load(std::memory_order_acquire); });
+ }
if (!m_hedgeRunning.load(std::memory_order_acquire)) {
break;
}
evaluateHedge();
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/RiskManager.cpp` around lines 364 - 384, hedgeLoop currently blocks
in std::this_thread::sleep_for for up to m_limits.hedgeIntervalMs and cannot be
interrupted by clearing m_hedgeRunning; replace that sleep with an interruptible
wait using a std::condition_variable (e.g. m_hedgeCv) and an associated mutex
(m_hedgeMutex): wait_for on m_hedgeCv for intervalMs with a predicate that
checks !m_hedgeRunning (or simply check return value) so the loop wakes
immediately when shutdown is requested; add m_hedgeCv and m_hedgeMutex to the
RiskManager header alongside m_hedgeRunning, and ensure the destructor /
initialize path that clears m_hedgeRunning calls m_hedgeCv.notify_all() (before
joining the thread) so the waiting hedgeLoop exits and join() returns promptly,
then continue to call evaluateHedge() only when still running.
| void RiskManager::fromJson(const nlohmann::json& j) { | ||
| m_position.store(j.value("current_position", 0.0), std::memory_order_relaxed); | ||
| m_totalPnL.store(j.value("total_pnl", 0.0), std::memory_order_relaxed); | ||
| m_peakPnL.store(j.value("peak_pnl", 0.0), std::memory_order_relaxed); | ||
| m_dailyPnL.store(j.value("daily_pnl", 0.0), std::memory_order_relaxed); | ||
| m_dailyVolume.store(j.value("daily_volume", 0.0), std::memory_order_relaxed); | ||
| m_netExposure.store(j.value("net_exposure", 0.0), std::memory_order_relaxed); | ||
| m_grossExposure.store(j.value("gross_exposure", 0.0), | ||
| std::memory_order_relaxed); | ||
| m_halted.store(j.value("is_halted", false), std::memory_order_relaxed); | ||
| m_ordersThisSecond.store(j.value("orders_this_second", 0u), | ||
| std::memory_order_relaxed); | ||
| m_currentSecond.store(j.value("current_second", uint64_t{0}), | ||
| std::memory_order_relaxed); | ||
|
|
||
| { | ||
| std::lock_guard<std::mutex> lock(m_stateMutex); | ||
| m_haltReason = j.value("halt_reason", std::string{}); | ||
| m_dailyResetTime = j.value("daily_reset_time", uint64_t{0}); | ||
| } | ||
|
|
||
| spdlog::info("RiskManager state restored from JSON"); | ||
| } |
There was a problem hiding this comment.
fromJson() has no exception handling; a type-mismatched JSON value will propagate an unhandled nlohmann::json::type_error.
j.value<T>(key, default) throws nlohmann::json::type_error when the stored JSON value cannot be converted to T. For example, restoring state from a corrupted snapshot or a manually edited file will terminate the risk manager's initialization path ungracefully. Wrap the body in a try/catch or validate types before reading.
Additionally, j.value("orders_this_second", 0u) uses unsigned int as the default, while the target atomic is uint32_t; prefer an explicit uint32_t{0} cast for clarity and to avoid implicit narrowing on platforms where unsigned int is wider than 32 bits.
🛡️ Proposed fix
void RiskManager::fromJson(const nlohmann::json& j) {
+ try {
m_position.store(j.value("current_position", 0.0), std::memory_order_relaxed);
m_totalPnL.store(j.value("total_pnl", 0.0), std::memory_order_relaxed);
m_peakPnL.store(j.value("peak_pnl", 0.0), std::memory_order_relaxed);
m_dailyPnL.store(j.value("daily_pnl", 0.0), std::memory_order_relaxed);
m_dailyVolume.store(j.value("daily_volume", 0.0), std::memory_order_relaxed);
m_netExposure.store(j.value("net_exposure", 0.0), std::memory_order_relaxed);
m_grossExposure.store(j.value("gross_exposure", 0.0), std::memory_order_relaxed);
m_halted.store(j.value("is_halted", false), std::memory_order_relaxed);
- m_ordersThisSecond.store(j.value("orders_this_second", 0u),
+ m_ordersThisSecond.store(j.value("orders_this_second", uint32_t{0}),
std::memory_order_relaxed);
m_currentSecond.store(j.value("current_second", uint64_t{0}),
std::memory_order_relaxed);
{
std::lock_guard<std::mutex> lock(m_stateMutex);
m_haltReason = j.value("halt_reason", std::string{});
m_dailyResetTime = j.value("daily_reset_time", uint64_t{0});
}
spdlog::info("RiskManager state restored from JSON");
+ } catch (const nlohmann::json::exception& e) {
+ spdlog::error("RiskManager::fromJson failed: {}", e.what());
+ throw;
+ }
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/RiskManager.cpp` around lines 504 - 526, fromJson currently can
throw nlohmann::json::type_error on type-mismatched fields and uses an unsigned
int literal for orders_this_second; wrap the body of RiskManager::fromJson in a
try/catch that catches std::exception (or nlohmann::json::type_error) and logs
the error via spdlog::error, then restore safe defaults for the atomics
(m_position, m_totalPnL, m_peakPnL, m_dailyPnL, m_dailyVolume, m_netExposure,
m_grossExposure, m_halted, m_ordersThisSecond, m_currentSecond) and guarded
members (m_haltReason, m_dailyResetTime) to ensure initialization continues;
also change the default for "orders_this_second" to an explicit uint32_t{0} and
for "current_second"/"daily_reset_time" to uint64_t{0} to avoid implicit
narrowing.
| void RiskManager::checkDailyReset() { | ||
| // Use system_clock so we can detect calendar-day boundaries | ||
| auto now = std::chrono::system_clock::now(); | ||
| auto today = std::chrono::floor<std::chrono::days>(now); | ||
| uint64_t todayMs = static_cast<uint64_t>( | ||
| std::chrono::duration_cast<std::chrono::milliseconds>( | ||
| today.time_since_epoch()) | ||
| .count()); | ||
|
|
||
| uint64_t lastReset; | ||
| { | ||
| std::lock_guard<std::mutex> lock(m_stateMutex); | ||
| lastReset = m_dailyResetTime; | ||
| } | ||
|
|
||
| // If the last reset was before today's midnight, do a reset | ||
| if (lastReset < todayMs) { | ||
| resetDaily(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Race condition: two concurrent threads can both read lastReset < todayMs before either updates m_dailyResetTime, causing a double daily reset that silently erases fills.
The check and the update are not atomic:
- Thread A reads
lastReset = T₀(under lock), releases lock, evaluatesT₀ < todayMs→true. - Thread B reads
lastReset = T₀(same stale value), releases lock, evaluatesT₀ < todayMs→true. - Thread A calls
resetDaily(): zerosm_dailyPnL,m_dailyVolume,m_ordersThisSecond, setsm_dailyResetTime = now. - Fills occur between A's reset and B's entry into
resetDaily()— they incrementm_dailyVolumeandm_dailyPnL. - Thread B calls
resetDaily(): zeros the counters again, erasing those fills from daily tracking.
With checkDailyReset() called on every fill (line 221), the race window around the UTC midnight crossing is non-trivial in a concurrent HFT system and can cause daily P&L and volume limits to be silently bypassed.
Fix: Claim the reset slot atomically by updating m_dailyResetTime inside the mutex before releasing it, so only one thread proceeds:
🔒 Proposed fix: atomic claim inside lock
void RiskManager::checkDailyReset() {
auto now = std::chrono::system_clock::now();
auto today = std::chrono::floor<std::chrono::days>(now);
uint64_t todayMs = static_cast<uint64_t>(
std::chrono::duration_cast<std::chrono::milliseconds>(
today.time_since_epoch())
.count());
- uint64_t lastReset;
- {
+ bool shouldReset = false;
+ {
std::lock_guard<std::mutex> lock(m_stateMutex);
- lastReset = m_dailyResetTime;
- }
-
- // If the last reset was before today's midnight, do a reset
- if (lastReset < todayMs) {
- resetDaily();
+ if (m_dailyResetTime < todayMs) {
+ m_dailyResetTime = todayMs; // claim the reset slot; only one thread wins
+ shouldReset = true;
+ }
}
+
+ if (shouldReset) {
+ // Atomics can be reset outside the lock
+ m_dailyPnL.store(0.0, std::memory_order_release);
+ m_dailyVolume.store(0.0, std::memory_order_release);
+ m_ordersThisSecond.store(0, std::memory_order_release);
+ spdlog::info("Daily risk counters reset");
+ AUDIT_SYSTEM_EVENT("Daily risk counters reset", true);
+ }
}With this change resetDaily() is still available as an explicit public API, but checkDailyReset() no longer delegates to it — it directly updates the atomics after claiming the slot, preventing any double-reset.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@core/risk/RiskManager.cpp` around lines 549 - 568, checkDailyReset()
currently reads m_dailyResetTime under m_stateMutex then releases the lock
before calling resetDaily(), allowing two threads to both pass the check and
perform resets; fix by claiming the reset slot while holding m_stateMutex:
inside RiskManager::checkDailyReset(), after computing todayMs, acquire the
lock, if m_dailyResetTime < todayMs set m_dailyResetTime = todayMs (or now) and
then perform the daily-zeroing while still holding the mutex (either call
resetDaily() while holding m_stateMutex or inline the resets to m_dailyPnL,
m_dailyVolume, m_ordersThisSecond), then release the lock so no other thread can
race to zero counters for the same day.
# [1.5.0](v1.4.0...v1.5.0) (2026-02-20) ### Features * implement risk management and (wip: production readiness) ([#53](#53)) ([5019642](5019642))
|
🎉 This PR is included in version 1.5.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Summary
core/risk/module with 5 components: RiskManager, CircuitBreaker, VaREngine, AlertManager, and DisasterRecoverydeploy/k8s/) for production deploymentSummary by CodeRabbit
New Features
Infrastructure
Documentation
Tests
Chores