-
Notifications
You must be signed in to change notification settings - Fork 25
feat: operator doppelgänger protection with slot-based detection #692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: operator doppelgänger protection with slot-based detection #692
Conversation
Add configuration options for operator doppelgänger protection: - --operator-dg: Enable/disable the feature (default: true) - --operator-dg-wait-epochs: Epochs to wait in monitor mode (default: 2) - --operator-dg-fresh-k: Freshness threshold for twin detection (default: 3) This implements the configuration layer for issue sigp#627, allowing operators to detect if their operator ID is already active elsewhere and shut down to prevent equivocation. Related to sigp#627
Add a new service module for detecting operator doppelgängers: - State machine with Monitor/Active modes - Track recent max consensus height per committee - Freshness threshold (K) to prevent false positives from replays - Check for single-signer messages with own operator ID - Comprehensive unit tests for state machine logic The service detects if the operator's ID appears in fresh SSV messages during monitor mode, indicating another instance is already running. Part of sigp#627
Integrates the operator doppelgänger protection with the message flow: - Adds DoppelgangerConfig struct to message_receiver for checking messages - NetworkMessageReceiver checks QBFT messages for doppelgänger detection - Client initializes doppelgänger service when operator_dg is enabled - Fatal shutdown triggered via TaskExecutor when twin operator detected - Spawns background task to listen for shutdown signal - Tests updated to use correct CommitteeId constructor ([u8; 32]) - Allows clippy::too_many_arguments for NetworkMessageReceiver::new This implements the core detection and shutdown mechanism specified in sigp#627 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
| default_value_t = 2, | ||
| requires = "operator_dg" | ||
| )] | ||
| pub operator_dg_wait_epochs: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not actually waiting anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a bit weird that when we use this, we need to set --operator-dg=true
|
It's still a heavily drafted PR; it is not ready for review |
Follows codebase pattern of handling slot_clock.now() returning None explicitly rather than silently falling back to Epoch 0. The current epoch is now required as a parameter to the service constructor, following the pattern used by other services in the codebase.
Add explicit error handling when slot_clock.now() returns None. If we can't read the current slot, we can't reliably determine the epoch or update the mode, so we skip the doppelgänger check and log a warning.
Replace RwLock with Mutex for DoppelgangerState since all operations are fast (HashMap lookups/updates) and the RwLock complexity isn't justified. Changes: - Replace Arc<RwLock<DoppelgangerState>> with Arc<Mutex<DoppelgangerState>> - Add update_and_check_freshness() method that atomically updates max height and checks freshness in one lock acquisition - Make update_max_height() and is_fresh() private helper methods - Simplify check_message() logic by removing drop/re-acquire pattern This provides cleaner API surface with better separation of concerns: - State handles data operations - Service handles policy decisions
Stale messages during doppelgänger monitoring are expected (network delays, replays) and not actionable. Using debug level reduces noise while keeping the information available for troubleshooting.
The enabled check is already done in lib.rs - we only create the service if operator_dg is true and impostor mode is disabled. No need to store and check an enabled flag in the service itself.
Add high-value tests covering core check_message functionality: - Twin detection with fresh single-signer messages - Stale message filtering (beyond fresh_k window) - Multi-signer aggregate message handling - Different operator ID filtering - Monitoring period transitions - Freshness window boundaries - Independent committee height tracking - Height progression scenarios Total: 14 service tests + 5 state tests = 19 passing tests
…ring wait Addresses PR review feedback about waiting for monitoring period to complete before starting validator services. Changes: - Extract ~50 lines of initialization code into `initialize_operator_doppelganger()` helper - Add `watch::Receiver<bool>` to broadcast monitoring status (similar to `is_synced` pattern) - Add `spawn_monitor_task()` method to encapsulate background epoch monitoring task - Add explicit `transition_to_active()` method for state transition - Client now waits for monitoring completion before starting services (matches sync wait pattern) - Remove `update_mode()` in favor of explicit `set_active()` call 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Improvements: - Add error logging for watch channel send failures - Remove unused parameters from DoppelgangerState::new() - Document hardcoded 12-second sleep duration with note about network flexibility - Change #[allow(dead_code)] to #[cfg(test)] for test-only methods - Add #[must_use] annotations to methods returning important values - Remove unused Epoch import - Remove redundant test_monitoring_mode_transition test (duplicated by test_no_twin_after_monitoring_period_ends) All 18 tests passing, no compiler warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The background monitoring task was using a hardcoded 12-second sleep interval. This change parameterizes the slot duration, deriving it from the network specification (spec.seconds_per_slot) to support different networks with varying slot durations. Changes: - Add slot_duration field to OperatorDoppelgangerService struct - Update new() signature to accept slot_duration parameter - Update spawn_monitor_task() to use self.slot_duration - Update all call sites to pass Duration::from_secs(spec.seconds_per_slot) - Update tests to pass slot_duration from TestingSlotClock 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ations The `update_and_check_freshness()` method had a semantic problem: it updated the height tracker even for stale messages (likely replays), and it checked freshness against an already-updated state. This refactoring separates the operations: 1. Check if the message is fresh (against current state) 2. Only update height tracking if the message is fresh Benefits: - Clearer code: each operation does one thing - More correct: we only track fresh messages, not stale replays - No confusion about order of operations - Better separation of concerns Changes: - Remove `update_and_check_freshness()` method from DoppelgangerState - Make `is_fresh()` and `update_max_height()` public - Update `check_message()` to call methods separately - Add documentation clarifying usage order 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Removed three low-value tests that don't add meaningful coverage: 1. test_monitoring_state_persists - Had misleading comment about message checking - Didn't test actual behavior (background task not started) - Redundant with test_no_twin_after_monitoring_period_ends 2. test_different_fresh_k_values - Only tested constructor accepts different K values - No verification of behavioral differences - Already covered by test_service_creation 3. test_increasing_heights_all_fresh - Tested obvious behavior (sequential heights are fresh) - Fully covered by test_freshness_window_boundary - Redundant sanity check with no additional value All remaining tests cover meaningful scenarios and edge cases. Test count: 18 → 15 tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The test_initial_state test is completely redundant with test_mode_transition, which already verifies the initial state as part of testing the state transition. Before: - test_initial_state: checks starts in Monitor mode - test_mode_transition: checks starts in Monitor, then transitions After: - test_mode_transition: covers both (initial state + transition) Test count: 15 → 14 tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
That's actually a great idea! I don't know why we didn't think about this before 😅 |
…detection Refactors the operator doppelgänger protection system to use slot comparison instead of time-based grace period, improving startup time and reliability. Changes: State Management: - Replace DoppelgangerState enum with AtomicBool for simpler state tracking - Add startup_slot field to track service initialization time - Remove grace period logic (saves 411 seconds at startup) Detection Logic: - Implement slot-based detection: msg_slot > startup_slot indicates twin - Messages at or before startup_slot are ignored (our own old messages) - Add extract_message_slot() helper supporting both QBFT and PartialSignatureMessages - No race conditions possible since all outgoing messages blocked during monitoring Benefits: - Faster startup: eliminates 411-second grace period wait time - More reliable: slot comparison is deterministic, unaffected by network delays - Simpler code: fewer states, cleaner logic with lock-free atomic operations - Better edge case handling: restart in same slot, network delays, clock skew Testing: - Replace time-based async tests with slot-based sync tests - All 8 tests pass verifying correct twin detection behavior
e943688 to
3ea5c20
Compare
|
Claude finished @diegomrsantos's task —— View job Code Review: Operator Doppelgänger ProtectionStatus: ✅ Approve - Implementation is production-ready Key Finding: Slot-Based Detection Already ImplementedThe recent discussion (Oct 29) between @dknopik and @diegomrsantos about switching to slot-based detection is great news: the code already implements this approach! @dknopik's suggestion:
Current implementation does exactly this:
The PR description even explicitly states "Slot-Based Detection" as the solution approach. Comprehensive Review✅ 1. Architecture & DesignStrengths:
Design Quality: The slot-based approach is fundamentally superior to time-based alternatives because:
✅ 2. Security AnalysisDetection Logic ( The security model is sound:
Why this works:
Edge Cases Handled:
Consideration: The default of 2 epochs (~5 minutes on mainnet) assumes twins will be active during our monitoring window. If a twin is intermittently offline (crashed, network issue), it might not be detected. Recommendation: Document this assumption. The 2-epoch default is reasonable for catching persistent twins. For higher confidence, operators can increase ✅ 3. PerformanceStrengths:
Overhead Analysis:
Impact During Monitoring:
This is an acceptable tradeoff for preventing equivocation. ✅ 4. Test CoverageTest Quality ( Comprehensive test suite with proper async timer testing: Core Scenarios Covered:
Test Quality:
Missing Scenarios (non-critical):
Verdict: Test coverage is solid for core functionality. ✅ 5. Code QualityStrengths:
Documentation Highlights:
✅ 6. Configuration & IntegrationCLI Configuration ( --operator-dg (default: true)
--operator-dg-wait-epochs (default: 2)Help text clearly explains behavior during monitoring. Integration Points:
Integration Quality: Clean, minimal coupling, well-separated concerns. Minor Suggestions (Non-Blocking)
Final VerdictStatus: ✅ Approve - Production-ready implementation This is a well-designed, thoroughly tested implementation of operator doppelgänger protection using slot-based detection. Key Strengths:
Addressing Recent Discussion: The suggestion from @dknopik (Oct 29) to use slot-based detection is already implemented! The code captures Minor polish opportunities (non-blocking):
The PR is ready to merge. Excellent work on the simplification and implementation! 🎉 |
Improves the operator doppelgänger API by accepting already-validated messages instead of re-decoding PartialSignatureMessages from bytes. API Changes: - check_message() now accepts &ValidatedSSVMessage instead of Option<&QbftMessage> - is_doppelganger() updated with same signature change - extract_message_slot() simplified to pattern match on ValidatedSSVMessage enum Benefits: - No redundant SSZ decoding (was decoding PartialSignatureMessages twice) - Clearer API semantics: pass the validated message directly - Type-safe: can't pass mismatched QBFT/partial signature data - Simpler call site: just pass &ssv_message instead of pattern matching - Better logging: can extract QBFT context from ValidatedSSVMessage when needed Implementation: - Added message_validator dependency to operator_doppelganger - Updated all tests to use ValidatedSSVMessage - Simplified message_receiver call site (removed pattern matching)
The wrapper function added no value - it just called spawn_monitor_task() directly with no additional logic, logging, or abstraction benefit. Direct call is clearer and eliminates unnecessary indirection.
Replace spawned background task with blocking monitoring approach that returns immediately on twin detection or after configured period. Key changes: - Remove doppelgänger service from message_sender (duties won't start during monitoring, so outgoing messages are naturally blocked) - Replace spawn_monitor_task() with monitor_blocking() that returns Result<(), String> - Use tokio::select! to wait for timeout or twin detection signal - Use watch channel for immediate return when twin detected - Client propagates error with ? operator to shutdown if twin found - Remove shutdown_sender dependency (client handles shutdown via Result) Benefits: - Simpler architecture (blocking instead of concurrent task spawning) - Faster twin detection (returns immediately, not after full period) - Cleaner error handling (idiomatic Result propagation) - Type-safe shutdown (compiler ensures error handling) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Refactored the notifier service to use a clean layered state architecture, fixing misleading logs during operator doppelgänger monitoring. ## Changes - Introduced three independent state enums for separation of concerns: - `SyncState`: Syncing | Synced - `OperatorState`: NoOperator | OperatorPresent - `DoppelgangerState`: NotMonitoring | MonitoringForDoppelganger - Each enum has a constructor method (`from_*`) encapsulating its logic - Refactored notify() to use compositional state matching instead of a complex nested match on operator_id and sync status - Added doppelganger_service parameter to spawn_notifier() and notify() ## Benefits - **Clear logs during monitoring**: Shows "Monitoring for operator doppelgänger (duties paused)" instead of misleading "Awaiting activation" - **Extensible**: Adding new state concerns doesn't multiply enum variants - **Single responsibility**: Each enum handles exactly one concern - **No privacy issues**: Don't store operator_id in enums, query when needed
… expect() Store operator_id directly in OperatorState::OperatorPresent variant instead of relying on separate Option<OperatorId> variable. This allows match arms to destructure operator_id from the pattern, eliminating all unsafe .expect() calls. Benefits: - Type safety: relationship between operator presence and operator_id is enforced at compile time - Cleaner code: idiomatic Rust pattern destructuring - Safer: no panic-able .expect() calls
Remove unnecessary intermediate variable and type conversion in doppelgänger service initialization. slot_clock.now() returns types::Slot which can be directly used as startup_slot.
ca02239 to
9b53677
Compare
Make operator doppelgänger protection opt-in rather than opt-out to avoid imposing a ~12 minute monitoring delay on all operators by default. Rationale: - QBFT's Byzantine tolerance already handles some operator duplication scenarios - No clear path from operator duplication to validator slashing - Provides operational discipline rather than security guarantees - Monitoring delay may not be acceptable for all deployment scenarios - Operators can enable when operational safety is prioritized Changes: - Set operator_dg default to false in Config::new() - Set operator_dg CLI default_value_t to false - Remove "Enabled by default" from help text
dknopik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…sages (#711) Related to #692 Adds time-to-live validation for ValidatorRegistration and VoluntaryExit messages, which previously had no lateness checks. These messages now use the same 34-slot TTL as Committee and Aggregator roles. This enables more resilient operator doppelgänger protection by: - Preventing replay attacks from malicious nodes - Bounding the acceptance window for stale messages - Working seamlessly with the doppelgänger grace period Co-Authored-By: diego <[email protected]>
Issue Addressed
Closes #627
Proposed Changes
Implements Operator Doppelgänger Protection to detect when multiple instances of the same operator are running, preventing QBFT protocol violations.
Problem
If two operator instances run simultaneously:
Solution
Slot-Based Detection: Records the startup slot and monitors the network for messages with our operator ID that reference slots after startup.
Why Slot Comparison Works:
slot ≤ startup_slot: Ignored (our own old messages)slot > startup_slot: Twin detected (we didn't send them)State Management:
Twin Detection:
Benefits Over Time-Based Approaches
Configuration
This feature is opt-in to avoid imposing a monitoring delay on all operators:
--operator-dg: Enable protection (default: false)--operator-dg-wait-epochs: Monitoring duration in epochs (default: 2, ~768 seconds)Rationale for Opt-In:
Testing
8 comprehensive tests covering slot-based detection, state management, and edge cases. Manual testing verified all configuration scenarios work correctly.