Skip to content

Commit 24d0801

Browse files
feat: operator doppelgänger protection with slot-based detection (#692)
Closes #627 Implements **Operator Doppelgänger Protection** to detect when multiple instances of the same operator are running, preventing QBFT protocol violations. ### Problem If two operator instances run simultaneously: - Both participate in QBFT consensus - Both send messages with the same operator ID - QBFT receives conflicting votes from one operator ID - Protocol correctness breaks (QBFT expects one vote per operator per round) ### Solution **Slot-Based Detection**: Records the startup slot and monitors the network for messages with our operator ID that reference slots after startup. | Phase | Outgoing Messages | Detection Logic | |------------|-------------------|----------------------------------------------------------| | Monitoring | ❌ BLOCKED | Messages with slot > startup_slot → Twin detected | | Completed | ✅ SENT | No longer checking | **Why Slot Comparison Works:** - All outgoing messages blocked during monitoring - Messages with `slot ≤ startup_slot`: Ignored (our own old messages) - Messages with `slot > startup_slot`: Twin detected (we didn't send them) - No race conditions possible since we never compete with twins **State Management:** - Simple AtomicBool for lock-free monitoring state - No grace period needed (slot comparison handles message filtering) - Detection active immediately on startup **Twin Detection:** - Monitors single-signer QBFT and partial signature messages - Twin detected → logs detailed context → graceful shutdown - Handles edge cases: restart in same slot, network delays, clock skew ### Benefits Over Time-Based Approaches - **Faster Startup**: No 411-second grace period wait - **More Reliable**: Slot comparison is deterministic, unaffected by network delays - **Simpler**: Fewer states, cleaner logic with lock-free operations ### Configuration This feature is **opt-in** to avoid imposing a monitoring delay on all operators: - `--operator-dg`: Enable protection (default: false) - `--operator-dg-wait-epochs`: Monitoring duration in epochs (default: 2, ~768 seconds) **Rationale for Opt-In:** - QBFT's Byzantine tolerance already handles some operator duplication scenarios - No clear path from operator duplication to validator slashing (operators don't directly sign beacon chain messages) - Provides operational discipline rather than security guarantees - ~12 minute startup delay may not be acceptable for all deployment scenarios - Operators can enable when operational safety is prioritized ### Testing 8 comprehensive tests covering slot-based detection, state management, and edge cases. Manual testing verified all configuration scenarios work correctly. Co-Authored-By: diego <[email protected]>
1 parent 2c714c2 commit 24d0801

File tree

16 files changed

+839
-45
lines changed

16 files changed

+839
-45
lines changed

CLAUDE.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -225,6 +225,16 @@ When contributing to Anchor, follow these Rust best practices:
225225
6. **Simplicity First**: Always choose the simplest solution that elegantly solves the problem, follows existing patterns, maintains performance, and uses basic constructs over complex data structures
226226
7. **Check Requirements First**: Before implementing or creating anything (PRs, commits, code), always read and follow existing templates, guidelines, and requirements in the codebase
227227

228+
### Architecture and Design
229+
230+
1. **Question Intermediaries**: If data flows A → B with no transformation, question why A → intermediate → B exists. Each layer should provide clear value (logging, transformation, validation, etc.). Ask: "What problem does this solve that direct communication doesn't?"
231+
232+
2. **Separation Through Interfaces, Not Layers**: Clean boundaries come from well-defined APIs, not intermediary components. A component receiving a `Sender<Event>` achieves separation without needing forwarding tasks or wrapper channels.
233+
234+
3. **Simplification is Always Valid**: Refactoring working code for simplicity is encouraged. Question architectural decisions even after tests pass. Fewer lines and fewer components often indicates better design.
235+
236+
4. **Challenge Complexity**: Every abstraction should justify its existence. "We might need it later" or "it provides separation" aren't sufficient reasons. Complexity must solve specific, current problems.
237+
228238
### Specific Guidelines
229239

230240
1. **Naming**:
@@ -409,9 +419,34 @@ When writing PR descriptions, follow these guidelines for maintainable and revie
409419

410420
- **Keep "Proposed Changes" section high-level** - focus on what components were changed and why
411421
- **Avoid line-by-line documentation** - reviewers can see specific changes in the diff
412-
- **Use component-level summaries** rather than file-by-file breakdowns
422+
- **Use component-level summaries** rather than file-by-file breakdowns
413423
- **Emphasize the principles** being applied and operational impact
414424
- **Be concise but complete** - provide context without overwhelming detail
425+
- **Don't mention implementation details** - avoid specifying exact files, line numbers, or function names
426+
- **Don't state the obvious** - don't mention that tests pass (CI will verify this)
427+
- **Avoid redundancy** - don't repeat information already in the title or commit message
428+
- **Focus on the "why"** - explain the motivation and impact, not the mechanics
429+
430+
### Code Review Culture
431+
432+
Effective code reviews question "why" architectural decisions exist:
433+
434+
**Questions to Ask:**
435+
- "Why does this intermediary layer exist?"
436+
- "What problem does this abstraction solve?"
437+
- "Could components communicate directly?"
438+
- "Is this complexity providing clear value?"
439+
440+
**Encourage Simplification:**
441+
- Working code can still be improved
442+
- Refactoring for clarity is valuable
443+
- Fewer components usually means better architecture
444+
- Test passing ≠ design complete
445+
446+
**Balance:**
447+
- Question complexity, but respect existing patterns that solve real problems
448+
- Not every layer is unnecessary - some provide genuine value
449+
- Focus on "why" over "what"
415450

416451
## Development Tips
417452

Cargo.lock

Lines changed: 20 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ members = [
2323
"anchor/message_sender",
2424
"anchor/message_validator",
2525
"anchor/network",
26+
"anchor/operator_doppelganger",
2627
"anchor/processor",
2728
"anchor/qbft_manager",
2829
"anchor/signature_collector",
@@ -54,6 +55,7 @@ message_receiver = { path = "anchor/message_receiver" }
5455
message_sender = { path = "anchor/message_sender" }
5556
message_validator = { path = "anchor/message_validator" }
5657
network = { path = "anchor/network" }
58+
operator_doppelganger = { path = "anchor/operator_doppelganger" }
5759
operator_key = { path = "anchor/common/operator_key" }
5860
processor = { path = "anchor/processor" }
5961
qbft = { path = "anchor/common/qbft" }

anchor/client/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ multiaddr = { workspace = true }
3232
network = { workspace = true }
3333
network_utils = { workspace = true }
3434
openssl = { workspace = true }
35+
operator_doppelganger = { workspace = true }
3536
operator_key = { workspace = true }
3637
parking_lot = { workspace = true }
3738
processor = { workspace = true }

anchor/client/src/cli.rs

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -491,6 +491,32 @@ pub struct Node {
491491
#[clap(long, help = "Disables gossipsub topic scoring.", hide = true)]
492492
pub disable_gossipsub_topic_scoring: bool,
493493

494+
// Operator Doppelgänger Protection
495+
#[clap(
496+
long,
497+
help = "Enable operator doppelgänger protection. When enabled, the node blocks all \
498+
outgoing messages and monitors the network for messages signed with its operator ID \
499+
that reference slots after startup. Shuts down if a twin operator is detected \
500+
to prevent QBFT protocol violations.",
501+
display_order = 0,
502+
default_value_t = false,
503+
help_heading = FLAG_HEADER,
504+
action = ArgAction::Set
505+
)]
506+
pub operator_dg: bool,
507+
508+
#[clap(
509+
long,
510+
value_name = "EPOCHS",
511+
help = "Number of epochs to monitor for twin operators using slot-based detection. \
512+
During monitoring, outgoing messages remain blocked and the node checks incoming \
513+
messages for slots after startup to detect duplicate operator instances.",
514+
display_order = 0,
515+
default_value_t = 2,
516+
requires = "operator_dg"
517+
)]
518+
pub operator_dg_wait_epochs: u64,
519+
494520
#[clap(flatten)]
495521
pub logging_flags: FileLoggingFlags,
496522
}

anchor/client/src/config.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ pub struct Config {
7272
pub prefer_builder_proposals: bool,
7373
/// Controls whether the latency measurement service is enabled
7474
pub disable_latency_measurement_service: bool,
75+
/// Enable operator doppelgänger protection (blocks messages and monitors for twins)
76+
pub operator_dg: bool,
77+
/// Number of epochs to monitor for twins after grace period
78+
pub operator_dg_wait_epochs: u64,
7579
}
7680

7781
impl Config {
@@ -115,6 +119,8 @@ impl Config {
115119
prefer_builder_proposals: false,
116120
gas_limit: 36_000_000,
117121
disable_latency_measurement_service: false,
122+
operator_dg: false,
123+
operator_dg_wait_epochs: 2,
118124
}
119125
}
120126
}
@@ -246,6 +252,10 @@ pub fn from_cli(cli_args: &Node, global_config: GlobalConfig) -> Result<Config,
246252
config.impostor = cli_args.impostor.map(OperatorId);
247253
config.disable_latency_measurement_service = cli_args.disable_latency_measurement_service;
248254

255+
// Operator doppelgänger protection
256+
config.operator_dg = cli_args.operator_dg;
257+
config.operator_dg_wait_epochs = cli_args.operator_dg_wait_epochs;
258+
249259
// Performance options
250260
if let Some(max_workers) = cli_args.max_workers {
251261
config.processor.max_workers = max_workers;

anchor/client/src/lib.rs

Lines changed: 42 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ use message_sender::{MessageSender, NetworkMessageSender, impostor::ImpostorMess
3636
use message_validator::Validator;
3737
use network::Network;
3838
use openssl::rsa::Rsa;
39+
use operator_doppelganger::OperatorDoppelgangerService;
3940
use parking_lot::RwLock;
4041
use qbft_manager::QbftManager;
4142
use sensitive_url::SensitiveUrl;
@@ -394,6 +395,10 @@ impl Client {
394395
"syncer",
395396
);
396397

398+
// Create operator ID wrapper that watches the database for our operator ID.
399+
// Follows the common pattern: pass OwnOperatorId to components, they call .get() only when
400+
// needed. This allows initialization before sync completes (which populates the ID from
401+
// chain).
397402
let operator_id = OwnOperatorId::new(database.watch());
398403

399404
// Network sender/receiver
@@ -419,15 +424,34 @@ impl Client {
419424
&executor,
420425
);
421426

427+
// Create operator doppelgänger protection if enabled (will be started after sync)
428+
let doppelganger_service = if config.operator_dg && config.impostor.is_none() {
429+
// Get current slot for slot-based detection baseline
430+
let startup_slot = slot_clock.now().ok_or_else(|| {
431+
"Failed to get current slot for doppelgänger protection".to_string()
432+
})?;
433+
434+
Some(Arc::new(OperatorDoppelgangerService::new(
435+
operator_id.clone(),
436+
startup_slot,
437+
E::slots_per_epoch(),
438+
Duration::from_secs(spec.seconds_per_slot),
439+
)))
440+
} else {
441+
None
442+
};
443+
422444
let message_sender: Arc<dyn MessageSender> = if config.impostor.is_none() {
423445
Arc::new(NetworkMessageSender::new(
424-
processor_senders.clone(),
425-
network_tx.clone(),
426-
key.clone(),
427-
operator_id.clone(),
428-
Some(message_validator.clone()),
429-
SUBNET_COUNT,
430-
is_synced.clone(),
446+
message_sender::NetworkMessageSenderConfig {
447+
processor: processor_senders.clone(),
448+
network_tx: network_tx.clone(),
449+
private_key: key.clone(),
450+
operator_id: operator_id.clone(),
451+
validator: Some(message_validator.clone()),
452+
subnet_count: SUBNET_COUNT,
453+
is_synced: is_synced.clone(),
454+
},
431455
)?)
432456
} else {
433457
Arc::new(ImpostorMessageSender::new(network_tx.clone(), SUBNET_COUNT))
@@ -474,6 +498,7 @@ impl Client {
474498
is_synced.clone(),
475499
outcome_tx,
476500
message_validator,
501+
doppelganger_service.clone(),
477502
);
478503

479504
// Start the p2p network
@@ -560,6 +585,7 @@ impl Client {
560585
duties_service.clone(),
561586
database.watch(),
562587
is_synced.clone(),
588+
doppelganger_service.clone(),
563589
executor.clone(),
564590
&spec,
565591
);
@@ -573,6 +599,15 @@ impl Client {
573599
.map_err(|_| "Sync watch channel closed")?;
574600
info!("Sync complete, starting services...");
575601

602+
// Block client startup during operator doppelgänger monitoring period (now that sync
603+
// is complete and operator ID available). During this period, incoming messages are
604+
// checked for twins and duties services won't start until monitoring completes.
605+
if let Some(service) = &doppelganger_service {
606+
service
607+
.monitor_blocking(config.operator_dg_wait_epochs)
608+
.await?;
609+
}
610+
576611
let mut block_service_builder = BlockServiceBuilder::new()
577612
.slot_clock(slot_clock.clone())
578613
.validator_store(validator_store.clone())

0 commit comments

Comments
 (0)