Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(peer-store): Remove addresses from peer store on dial failure #5926

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dknopik
Copy link

@dknopik dknopik commented Mar 10, 2025

Description

When the MemoryStore receives a FromSwarm::DialFailure event, it will now modify the store appropriately:

  • LocalPeerId: Remove the peer from the store.
  • WrongPeerId: Remove the address for the dialed peer ID and readd it for the received peer ID.
  • Transport: Remove the failed addresses from the store.
  • other: no action.

Furthermore, we group some repeated event logic in a new function push_event_and_wake for internal use.

Follow-up to #5724

Notes & open questions

None.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Daniel! Overall LGTM, left a comment.

// Remove all attempted addresses.
let mut is_record_updated = false;
for (addr, _) in errors {
is_record_updated |= self.remove_address_silent(&peer, addr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we should remove the addresses here,
a Transport error may be for io reasons, which are not deterministic and may not be related with the address itself.
Therefore the addresses may still be correct right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah. But it would also be unfortunate to have perpetually undialable peers in here, especially as we have no TTL mechanism any more, only one LRU cache per peer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point.
But in identify we also remove all cached addresses upon a transport error, and as @dknopik already mentioned there is otherwise no way to remove an actually unreachable address.

Maybe we could differentiate between the explicitly added addresses and the ones that we learned through other behaviors with NewExternalAddrOfPeer? The former may be ones that are likely to be known and trusted, and thus could be persistent even after an TransportError, while the latter can contain unreachable addresses that are removed on a dial failure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. But then I would allow the user to add addresses that are also removable. I'll push a draft.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I am not sure if I like it, it makes the API less clean IMO.

Is it really a problem if we remove a temporarily undialable address? For example, in my use-case (Anchor), such an address would be re-added by discovery at some point. So these addresses are not permanently lost.

Copy link
Contributor

@elenaf9 elenaf9 Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could send those through an event, but that would make my code more complex (but now that I think about it, combined with having more granular event variants as you suggested, it might make sense to switch to that anyway, making my point irrelevant 🤔).

I am not sure I understand that. The granular event types are just for the output events, how would that change anything about adding addresses?

But still, there might be some conceivable cases, where users want to add "non-trusted" addresses manually.

Well technically one could still do that through the normal Swarm::add_address, that maps to NewExternalAddrOfPeer 😄.

Which makes me think: we currently have two different ways of adding new addresses: Swarm::add_address and Behavior::update_peer.
In RequestResponse we deprecated the behavior-level methods for adding and removing addresses, in favor of doing that through the Swarm. If we want to be consistent, shouldn't we also drop the public methods on the peer_store::Behavior? Is there any reason why one would not want to just use Swarm::add_address?
For removing the address again we could add an analogous method Swarm::remove_address (sorry for opening yet another orthogonal discussion - can be a separate PR as well). Of course then we'd not have the "explicit/ permanent" addresses anymore.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was not aware of Swarm::add_address at all, good to know :) In that case my first paragraph makes no sense at all, so please disregard it.

Yeah, your proposal of making update_address non-pub in favour of events sounds good.

And about the question of removing undialable addresses - what do you think about making it configurable? Just adding a bool to the MemoryStore's Config? That does not add too much complexity, IMO. Alternatively, we could just close this and just add Swarm::remove_address, then the user can at least implement the logic for removing undialable addresses themselves. What do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jxs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have some mechanism for removing undiable addresses, otherwise they could fill up the LRU cache and cause potentially correct addresses to be evicted (although, when we successfully established a connection the address is promoted to the head of the cache, so it would only concern addresses that we haven't dialed yet).

what do you think about making it configurable? Just adding a bool to the MemoryStore's Config?

Making it opt-out config sounds good to me.

add Swarm::remove_address, then the user can at least implement the logic for removing undialable addresses themselves

I would be in favor of that independently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about allowing users to "pin" specific addresses so they won't be removed by the cache?

@jxs jxs requested a review from elenaf9 March 11, 2025 16:14
Copy link
Contributor

@elenaf9 elenaf9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dknopik!

// Remove all attempted addresses.
let mut is_record_updated = false;
for (addr, _) in errors {
is_record_updated |= self.remove_address_silent(&peer, addr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm good point.
But in identify we also remove all cached addresses upon a transport error, and as @dknopik already mentioned there is otherwise no way to remove an actually unreachable address.

Maybe we could differentiate between the explicitly added addresses and the ones that we learned through other behaviors with NewExternalAddrOfPeer? The former may be ones that are likely to be known and trusted, and thus could be persistent even after an TransportError, while the latter can contain unreachable addresses that are removed on a dial failure.

is_record_updated |= self.remove_address_silent(&peer, addr);
}
if is_record_updated {
self.push_event_and_wake(crate::store::Event::RecordUpdated(peer));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Slightly orthogonal to this PR, but noticed it again while reviewing. Can be solved in a separate PR).

I find this even variant not very informative.
I've already raised it in #5724 (comment), but forgot about it again in later reviews. IMO the RecordUpdated variant should include further info about the info that was added.
For the memory store I think it would make sense to have different variants AddressAdded AddressRemove etc. Wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

elenaf9

This comment was marked as outdated.

Comment on lines +93 to 97
fn remove_address_silent(&mut self, peer: &PeerId, address: &Multiaddr, force: bool) -> bool {
self.records
.get_mut(peer)
.is_some_and(|r| r.remove_address(address))
.is_some_and(|r| r.remove_address(address, force))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove a peer from the hashmap if there are no more known addresses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if and only if there is no custom data, or regardless?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you're right. Only if there is not custom data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants