Skip to content

Conversation

@diegomrsantos
Copy link
Member

@diegomrsantos diegomrsantos commented Oct 15, 2025

Issue Addressed

Closes #627

Proposed Changes

Implements Operator Doppelgänger Protection to detect when multiple instances of the same operator are running, preventing QBFT protocol violations.

Problem

If two operator instances run simultaneously:

  • Both participate in QBFT consensus
  • Both send messages with the same operator ID
  • QBFT receives conflicting votes from one operator ID
  • Protocol correctness breaks (QBFT expects one vote per operator per round)

Solution

Slot-Based Detection: Records the startup slot and monitors the network for messages with our operator ID that reference slots after startup.

Phase Outgoing Messages Detection Logic
Monitoring ❌ BLOCKED Messages with slot > startup_slot → Twin detected
Completed ✅ SENT No longer checking

Why Slot Comparison Works:

  • All outgoing messages blocked during monitoring
  • Messages with slot ≤ startup_slot: Ignored (our own old messages)
  • Messages with slot > startup_slot: Twin detected (we didn't send them)
  • No race conditions possible since we never compete with twins

State Management:

  • Simple AtomicBool for lock-free monitoring state
  • No grace period needed (slot comparison handles message filtering)
  • Detection active immediately on startup

Twin Detection:

  • Monitors single-signer QBFT and partial signature messages
  • Twin detected → logs detailed context → graceful shutdown
  • Handles edge cases: restart in same slot, network delays, clock skew

Benefits Over Time-Based Approaches

  • Faster Startup: No 411-second grace period wait
  • More Reliable: Slot comparison is deterministic, unaffected by network delays
  • Simpler: Fewer states, cleaner logic with lock-free operations

Configuration

This feature is opt-in to avoid imposing a monitoring delay on all operators:

  • --operator-dg: Enable protection (default: false)
  • --operator-dg-wait-epochs: Monitoring duration in epochs (default: 2, ~768 seconds)

Rationale for Opt-In:

  • QBFT's Byzantine tolerance already handles some operator duplication scenarios
  • No clear path from operator duplication to validator slashing (operators don't directly sign beacon chain messages)
  • Provides operational discipline rather than security guarantees
  • ~12 minute startup delay may not be acceptable for all deployment scenarios
  • Operators can enable when operational safety is prioritized

Testing

8 comprehensive tests covering slot-based detection, state management, and edge cases. Manual testing verified all configuration scenarios work correctly.

diegomrsantos and others added 4 commits October 15, 2025 22:18
Add configuration options for operator doppelgänger protection:
- --operator-dg: Enable/disable the feature (default: true)
- --operator-dg-wait-epochs: Epochs to wait in monitor mode (default: 2)
- --operator-dg-fresh-k: Freshness threshold for twin detection (default: 3)

This implements the configuration layer for issue sigp#627, allowing operators
to detect if their operator ID is already active elsewhere and shut down
to prevent equivocation.

Related to sigp#627
Add a new service module for detecting operator doppelgängers:
- State machine with Monitor/Active modes
- Track recent max consensus height per committee
- Freshness threshold (K) to prevent false positives from replays
- Check for single-signer messages with own operator ID
- Comprehensive unit tests for state machine logic

The service detects if the operator's ID appears in fresh SSV messages
during monitor mode, indicating another instance is already running.

Part of sigp#627
Integrates the operator doppelgänger protection with the message flow:
- Adds DoppelgangerConfig struct to message_receiver for checking messages
- NetworkMessageReceiver checks QBFT messages for doppelgänger detection
- Client initializes doppelgänger service when operator_dg is enabled
- Fatal shutdown triggered via TaskExecutor when twin operator detected
- Spawns background task to listen for shutdown signal
- Tests updated to use correct CommitteeId constructor ([u8; 32])
- Allows clippy::too_many_arguments for NetworkMessageReceiver::new

This implements the core detection and shutdown mechanism specified in
sigp#627 (comment)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
default_value_t = 2,
requires = "operator_dg"
)]
pub operator_dg_wait_epochs: u64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are not actually waiting anywhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit weird that when we use this, we need to set --operator-dg=true

@diegomrsantos
Copy link
Member Author

It's still a heavily drafted PR; it is not ready for review

@diegomrsantos diegomrsantos self-assigned this Oct 16, 2025
diegomrsantos and others added 15 commits October 17, 2025 16:15
Follows codebase pattern of handling slot_clock.now() returning None
explicitly rather than silently falling back to Epoch 0. The current
epoch is now required as a parameter to the service constructor,
following the pattern used by other services in the codebase.
Add explicit error handling when slot_clock.now() returns None.
If we can't read the current slot, we can't reliably determine
the epoch or update the mode, so we skip the doppelgänger check
and log a warning.
Replace RwLock with Mutex for DoppelgangerState since all operations
are fast (HashMap lookups/updates) and the RwLock complexity isn't
justified.

Changes:
- Replace Arc<RwLock<DoppelgangerState>> with Arc<Mutex<DoppelgangerState>>
- Add update_and_check_freshness() method that atomically updates max
  height and checks freshness in one lock acquisition
- Make update_max_height() and is_fresh() private helper methods
- Simplify check_message() logic by removing drop/re-acquire pattern

This provides cleaner API surface with better separation of concerns:
- State handles data operations
- Service handles policy decisions
Stale messages during doppelgänger monitoring are expected (network
delays, replays) and not actionable. Using debug level reduces noise
while keeping the information available for troubleshooting.
The enabled check is already done in lib.rs - we only create the service
if operator_dg is true and impostor mode is disabled. No need to store
and check an enabled flag in the service itself.
Add high-value tests covering core check_message functionality:
- Twin detection with fresh single-signer messages
- Stale message filtering (beyond fresh_k window)
- Multi-signer aggregate message handling
- Different operator ID filtering
- Monitoring period transitions
- Freshness window boundaries
- Independent committee height tracking
- Height progression scenarios

Total: 14 service tests + 5 state tests = 19 passing tests
…ring wait

Addresses PR review feedback about waiting for monitoring period to complete
before starting validator services.

Changes:
- Extract ~50 lines of initialization code into `initialize_operator_doppelganger()` helper
- Add `watch::Receiver<bool>` to broadcast monitoring status (similar to `is_synced` pattern)
- Add `spawn_monitor_task()` method to encapsulate background epoch monitoring task
- Add explicit `transition_to_active()` method for state transition
- Client now waits for monitoring completion before starting services (matches sync wait pattern)
- Remove `update_mode()` in favor of explicit `set_active()` call

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Improvements:
- Add error logging for watch channel send failures
- Remove unused parameters from DoppelgangerState::new()
- Document hardcoded 12-second sleep duration with note about network flexibility
- Change #[allow(dead_code)] to #[cfg(test)] for test-only methods
- Add #[must_use] annotations to methods returning important values
- Remove unused Epoch import
- Remove redundant test_monitoring_mode_transition test (duplicated by test_no_twin_after_monitoring_period_ends)

All 18 tests passing, no compiler warnings.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The background monitoring task was using a hardcoded 12-second sleep
interval. This change parameterizes the slot duration, deriving it from
the network specification (spec.seconds_per_slot) to support different
networks with varying slot durations.

Changes:
- Add slot_duration field to OperatorDoppelgangerService struct
- Update new() signature to accept slot_duration parameter
- Update spawn_monitor_task() to use self.slot_duration
- Update all call sites to pass Duration::from_secs(spec.seconds_per_slot)
- Update tests to pass slot_duration from TestingSlotClock

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ations

The `update_and_check_freshness()` method had a semantic problem: it updated
the height tracker even for stale messages (likely replays), and it checked
freshness against an already-updated state.

This refactoring separates the operations:
1. Check if the message is fresh (against current state)
2. Only update height tracking if the message is fresh

Benefits:
- Clearer code: each operation does one thing
- More correct: we only track fresh messages, not stale replays
- No confusion about order of operations
- Better separation of concerns

Changes:
- Remove `update_and_check_freshness()` method from DoppelgangerState
- Make `is_fresh()` and `update_max_height()` public
- Update `check_message()` to call methods separately
- Add documentation clarifying usage order

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Removed three low-value tests that don't add meaningful coverage:

1. test_monitoring_state_persists
   - Had misleading comment about message checking
   - Didn't test actual behavior (background task not started)
   - Redundant with test_no_twin_after_monitoring_period_ends

2. test_different_fresh_k_values
   - Only tested constructor accepts different K values
   - No verification of behavioral differences
   - Already covered by test_service_creation

3. test_increasing_heights_all_fresh
   - Tested obvious behavior (sequential heights are fresh)
   - Fully covered by test_freshness_window_boundary
   - Redundant sanity check with no additional value

All remaining tests cover meaningful scenarios and edge cases.
Test count: 18 → 15 tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The test_initial_state test is completely redundant with
test_mode_transition, which already verifies the initial state
as part of testing the state transition.

Before:
- test_initial_state: checks starts in Monitor mode
- test_mode_transition: checks starts in Monitor, then transitions

After:
- test_mode_transition: covers both (initial state + transition)

Test count: 15 → 14 tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@diegomrsantos diegomrsantos marked this pull request as ready for review October 21, 2025 19:35
@diegomrsantos diegomrsantos requested a review from dknopik October 21, 2025 19:35
@claude-code-actions-sigp

This comment was marked as outdated.

@diegomrsantos
Copy link
Member Author

An idea I just had: I think we can get rid of the grace period if we only consider messages created after we start. So basically, we get the current slot at the beginning, and then only crash if we get a message referencing a later slot

That's actually a great idea! I don't know why we didn't think about this before 😅

…detection

Refactors the operator doppelgänger protection system to use slot comparison
instead of time-based grace period, improving startup time and reliability.

Changes:

State Management:
- Replace DoppelgangerState enum with AtomicBool for simpler state tracking
- Add startup_slot field to track service initialization time
- Remove grace period logic (saves 411 seconds at startup)

Detection Logic:
- Implement slot-based detection: msg_slot > startup_slot indicates twin
- Messages at or before startup_slot are ignored (our own old messages)
- Add extract_message_slot() helper supporting both QBFT and PartialSignatureMessages
- No race conditions possible since all outgoing messages blocked during monitoring

Benefits:
- Faster startup: eliminates 411-second grace period wait time
- More reliable: slot comparison is deterministic, unaffected by network delays
- Simpler code: fewer states, cleaner logic with lock-free atomic operations
- Better edge case handling: restart in same slot, network delays, clock skew

Testing:
- Replace time-based async tests with slot-based sync tests
- All 8 tests pass verifying correct twin detection behavior
@diegomrsantos diegomrsantos force-pushed the feat/operator-doppelganger-protection branch from e943688 to 3ea5c20 Compare October 29, 2025 17:16
@diegomrsantos diegomrsantos marked this pull request as draft October 29, 2025 17:22
@diegomrsantos diegomrsantos marked this pull request as ready for review October 29, 2025 17:22
@claude-code-actions-sigp
Copy link

claude-code-actions-sigp bot commented Oct 29, 2025

Claude finished @diegomrsantos's task —— View job


Code Review: Operator Doppelgänger Protection

Status: ✅ Approve - Implementation is production-ready


Key Finding: Slot-Based Detection Already Implemented

The recent discussion (Oct 29) between @dknopik and @diegomrsantos about switching to slot-based detection is great news: the code already implements this approach!

@dknopik's suggestion:

"I think we can get rid of the grace period if we only consider messages created after we start. So basically, we get the current slot at the beginning, and then only crash if we get a message referencing a later slot"

Current implementation does exactly this:

  1. Captures startup slot (client/src/lib.rs:441-445):

    let current_slot = slot_clock.now()?;
    let startup_slot = ssv_types::Slot::new(current_slot.as_u64());
  2. Slot-based detection (operator_doppelganger/src/service.rs:154-156):

    if msg_slot <= self.startup_slot {
        return false; // Ignore messages at or before startup
    }
  3. Detection criterion:

    • slot > startup_slot → Twin detected ✓
    • slot ≤ startup_slot → Ignored (our old messages) ✓

The PR description even explicitly states "Slot-Based Detection" as the solution approach.


Comprehensive Review

1. Architecture & Design

Strengths:

  • Elegant slot comparison - no time-based complexity, immune to clock skew and network delays
  • Lock-free monitoring check using AtomicBool for optimal performance
  • Clean state management with single monitoring flag
  • Deterministic behavior - slot numbers are consensus-guaranteed

Design Quality: The slot-based approach is fundamentally superior to time-based alternatives because:

  • No dependency on wall-clock time
  • No dependency on network delays
  • Simple to reason about
  • Mathematically sound

2. Security Analysis

Detection Logic (service.rs:136-191):

The security model is sound:

  1. Baseline: Startup slot captured from slot clock
  2. Outgoing blocked: Both sign_and_send() and send() check is_monitoring() (network.rs:59-64, 114-119)
  3. Detection: Single-signer message with our operator ID for slot > startup_slot
  4. Action: Immediate shutdown via ShutdownReason::Failure

Why this works:

  • We never send during monitoring → no race conditions with twins
  • Messages with slot > startup_slot MUST be from a twin (we didn't send them)
  • Messages with slot ≤ startup_slot are safely ignored

Edge Cases Handled:

  • ✅ Restart in same slot (messages for startup slot ignored)
  • ✅ Network delays (slot comparison is delay-independent)
  • ✅ Clock skew (slot numbers are consensus-based)
  • ✅ Multi-signer messages (aggregates correctly ignored)
  • ✅ Operator ID not yet available (graceful early return)
  • ✅ Corrupted messages (slot extraction failure handled)

Consideration:

⚠️ Monitoring Duration

The default of 2 epochs (~5 minutes on mainnet) assumes twins will be active during our monitoring window. If a twin is intermittently offline (crashed, network issue), it might not be detected.

Recommendation: Document this assumption. The 2-epoch default is reasonable for catching persistent twins. For higher confidence, operators can increase --operator-dg-wait-epochs.


3. Performance

Strengths:

  • Lock-free fast path: AtomicBool for monitoring state (no mutex contention)
  • Early returns: Multiple fast paths avoid unnecessary work
  • O(1) complexity: Single atomic load + slot comparison

Overhead Analysis:

  • During monitoring (~5 min): Negligible atomic check per message
  • After monitoring: Same negligible atomic check forever (acceptable)
  • No lock contention: Optimal choice of AtomicBool over Mutex<bool>

Impact During Monitoring:

  • Incoming messages dropped (duties missed)
  • Outgoing messages blocked
  • Duration: 2 epochs (default) = ~5 minutes

This is an acceptable tradeoff for preventing equivocation.


4. Test Coverage

Test Quality (service.rs:224-486):

Comprehensive test suite with proper async timer testing:

Core Scenarios Covered:

  1. ✅ Twin detected (slot after startup) - line 358
  2. ✅ No twin (slot at startup boundary) - line 373
  3. ✅ No twin (slot before startup) - line 391
  4. ✅ Multi-signer filtering - line 409
  5. ✅ Different operator ID - line 430
  6. ✅ Monitoring timer expiration - line 447

Test Quality:

  • Uses #[tokio::test(start_paused = true)] for deterministic async time ✓
  • Helper functions for test message creation ✓
  • Independent, well-named tests ✓
  • Positive and negative cases covered ✓

Missing Scenarios (non-critical):

  • Partial signature messages (QBFT tested, both supported in detection)
  • Slot extraction failure path
  • Concurrent message checks

Verdict: Test coverage is solid for core functionality.


5. Code Quality

Strengths:

  • Exceptional documentation: Comments explain the "why" and edge cases (lines 117-135)
  • Proper error handling: Uses Option and early returns, no unwrap() in production code ✓
  • Idiomatic Rust: Good use of atomics, pattern matching, Option combinators
  • Clear separation: is_doppelganger() (detection) vs check_message() (detection + action)

Documentation Highlights:

  • Explains slot-based detection mechanism clearly
  • Covers edge cases explicitly (restart in same slot, network delays, clock skew)
  • Provides rationale for design decisions

6. Configuration & Integration

CLI Configuration (cli.rs:495-517):

--operator-dg (default: true)
--operator-dg-wait-epochs (default: 2)

Help text clearly explains behavior during monitoring.

Integration Points:

  1. Service Creation (lib.rs:439-456):

    • Created early, started after sync ✓
    • Only when enabled and not in impostor mode ✓
    • Captures startup slot correctly ✓
  2. Service Start (lib.rs:607-621):

    • Waits for sync completion ✓
    • Starts monitoring task ✓
    • Prevents startup crash (previous review concern) ✓
  3. Message Receiver (manager.rs:167-180):

    • Checks monitoring before processing ✓
    • Supports both QBFT and partial signatures ✓
  4. Message Sender (network.rs:56-64, 110-119):

    • Blocks both sign_and_send() and send()

Integration Quality: Clean, minimal coupling, well-separated concerns.


Minor Suggestions (Non-Blocking)

  1. Document monitoring duration rationale

    Add comment explaining why 2 epochs is the default. Consider: "2 epochs provides ~5 minutes to detect active twins while minimizing missed duties."

  2. Add observability metrics (optional but valuable)

    • Counter: messages dropped during monitoring
    • Gauge: monitoring state (0/1)
    • Timer: monitoring duration

    Helps operators understand operational impact.

  3. Consider tiered logging

    Current detection log (line 178-188) is excellent for debugging but detailed. Consider:

    • Simple ERROR: "Twin operator detected - shutting down to prevent equivocation"
    • Detailed log at DEBUG level for diagnostics
  4. Clarify requires attribute (cli.rs:516)

    Per @dknopik, requires doesn't hide options. If this is intentional (semantic documentation), add a comment explaining why.


Final Verdict

Status: ✅ Approve - Production-ready implementation

This is a well-designed, thoroughly tested implementation of operator doppelgänger protection using slot-based detection.

Key Strengths:

  • Slot-based detection is fundamentally sound (immune to timing issues)
  • Clean architecture with clear separation of concerns
  • Comprehensive test coverage with proper async testing
  • Excellent documentation explaining mechanism and edge cases
  • Lock-free performance optimization
  • Proper integration in both send and receive paths

Addressing Recent Discussion:

The suggestion from @dknopik (Oct 29) to use slot-based detection is already implemented! The code captures startup_slot at service creation and only flags messages with slot > startup_slot. No architectural changes needed.

Minor polish opportunities (non-blocking):

  • Document 2-epoch rationale
  • Add observability metrics
  • Consider tiered logging

The PR is ready to merge. Excellent work on the simplification and implementation! 🎉


diegomrsantos and others added 4 commits October 30, 2025 14:23
Improves the operator doppelgänger API by accepting already-validated messages
instead of re-decoding PartialSignatureMessages from bytes.

API Changes:
- check_message() now accepts &ValidatedSSVMessage instead of Option<&QbftMessage>
- is_doppelganger() updated with same signature change
- extract_message_slot() simplified to pattern match on ValidatedSSVMessage enum

Benefits:
- No redundant SSZ decoding (was decoding PartialSignatureMessages twice)
- Clearer API semantics: pass the validated message directly
- Type-safe: can't pass mismatched QBFT/partial signature data
- Simpler call site: just pass &ssv_message instead of pattern matching
- Better logging: can extract QBFT context from ValidatedSSVMessage when needed

Implementation:
- Added message_validator dependency to operator_doppelganger
- Updated all tests to use ValidatedSSVMessage
- Simplified message_receiver call site (removed pattern matching)
The wrapper function added no value - it just called spawn_monitor_task()
directly with no additional logic, logging, or abstraction benefit.

Direct call is clearer and eliminates unnecessary indirection.
Replace spawned background task with blocking monitoring approach that
returns immediately on twin detection or after configured period.

Key changes:
- Remove doppelgänger service from message_sender (duties won't start
  during monitoring, so outgoing messages are naturally blocked)
- Replace spawn_monitor_task() with monitor_blocking() that returns
  Result<(), String>
- Use tokio::select! to wait for timeout or twin detection signal
- Use watch channel for immediate return when twin detected
- Client propagates error with ? operator to shutdown if twin found
- Remove shutdown_sender dependency (client handles shutdown via Result)

Benefits:
- Simpler architecture (blocking instead of concurrent task spawning)
- Faster twin detection (returns immediately, not after full period)
- Cleaner error handling (idiomatic Result propagation)
- Type-safe shutdown (compiler ensures error handling)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@diegomrsantos diegomrsantos changed the title feat: Operator Doppelgänger Protection feat: operator doppelgänger protection with slot-based detection Nov 3, 2025
Refactored the notifier service to use a clean layered state architecture,
fixing misleading logs during operator doppelgänger monitoring.

## Changes

- Introduced three independent state enums for separation of concerns:
  - `SyncState`: Syncing | Synced
  - `OperatorState`: NoOperator | OperatorPresent
  - `DoppelgangerState`: NotMonitoring | MonitoringForDoppelganger

- Each enum has a constructor method (`from_*`) encapsulating its logic

- Refactored notify() to use compositional state matching instead of
  a complex nested match on operator_id and sync status

- Added doppelganger_service parameter to spawn_notifier() and notify()

## Benefits

- **Clear logs during monitoring**: Shows "Monitoring for operator
  doppelgänger (duties paused)" instead of misleading "Awaiting activation"

- **Extensible**: Adding new state concerns doesn't multiply enum variants

- **Single responsibility**: Each enum handles exactly one concern

- **No privacy issues**: Don't store operator_id in enums, query when needed
… expect()

Store operator_id directly in OperatorState::OperatorPresent variant instead
of relying on separate Option<OperatorId> variable. This allows match arms to
destructure operator_id from the pattern, eliminating all unsafe .expect()
calls.

Benefits:
- Type safety: relationship between operator presence and operator_id is
  enforced at compile time
- Cleaner code: idiomatic Rust pattern destructuring
- Safer: no panic-able .expect() calls
Remove unnecessary intermediate variable and type conversion in doppelgänger
service initialization. slot_clock.now() returns types::Slot which can be
directly used as startup_slot.
@diegomrsantos diegomrsantos requested a review from dknopik November 5, 2025 12:58
@diegomrsantos diegomrsantos force-pushed the feat/operator-doppelganger-protection branch from ca02239 to 9b53677 Compare November 5, 2025 14:14
Make operator doppelgänger protection opt-in rather than opt-out to avoid
imposing a ~12 minute monitoring delay on all operators by default.

Rationale:
- QBFT's Byzantine tolerance already handles some operator duplication scenarios
- No clear path from operator duplication to validator slashing
- Provides operational discipline rather than security guarantees
- Monitoring delay may not be acceptable for all deployment scenarios
- Operators can enable when operational safety is prioritized

Changes:
- Set operator_dg default to false in Config::new()
- Set operator_dg CLI default_value_t to false
- Remove "Enabled by default" from help text
Copy link
Member

@dknopik dknopik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@mergify mergify bot merged commit 24d0801 into sigp:unstable Nov 5, 2025
17 checks passed
mergify bot pushed a commit that referenced this pull request Nov 7, 2025
…sages (#711)

Related to #692


  Adds time-to-live validation for ValidatorRegistration and VoluntaryExit messages, which previously had no lateness checks. These messages now use the same 34-slot TTL as Committee and Aggregator roles.

This enables more resilient operator doppelgänger protection by:
- Preventing replay attacks from malicious nodes
- Bounding the acceptance window for stale messages
- Working seamlessly with the doppelgänger grace period


Co-Authored-By: diego <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants