Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server considers incomplete data as complete initial sync #28

Closed
aarani opened this issue Feb 10, 2023 · 10 comments
Closed

Server considers incomplete data as complete initial sync #28

aarani opened this issue Feb 10, 2023 · 10 comments

Comments

@aarani
Copy link

aarani commented Feb 10, 2023

According to the code, the server should wait on running the Snapshotter until we're caught up with gossip but instead it runs the Snapshotter just moments after bootup after receiving a small number of gossip msgs, which creates incomplete/invalid snapshots.

Log:

Feb 09 13:19:27 rapid-gossip-sync-server[7199]: Starting gossip download
trimmed...
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: gossip count (iteration 3): 18 (delta: 13):
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: announcements: 7
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: mismatched scripts: 0
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: updates: 11
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: no HTLC max: 0
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: caught up with gossip!
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: Initial sync complete!
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: Initiating snapshotting service

I use a load balancer health check to make sure the snapshot exists before forwarding user requests to it but this behavior makes it impossible to know if an RGS server is fully caught up and has ready-to-use snapshots. Is this an intended behavior?

@TheBlueMatt
Copy link
Contributor

Sadly there's no good way in the lightning protocol today to be confident we're done syncing aside from "just see if we aren't getting as many messages anymore". It looks like the heuristic went a little too aggressive on you here, its possible your chain source is slow or the node you're fetching gossip from is overloaded.

@aarani
Copy link
Author

aarani commented Apr 25, 2023

Sadly there's no good way in the lightning protocol today to be confident we're done syncing aside from "just see if we aren't getting as many messages anymore". It looks like the heuristic went a little too aggressive on you here, its possible your chain source is slow or the node you're fetching gossip from is overloaded.

Unfortunately, the problem was indeed the slow chain source we adjusted the "20" threshold but still got a problem
where the peer stopped syncing with us after some time which again caused incomplete data.

@aarani aarani closed this as not planned Won't fix, can't repro, duplicate, stale Apr 25, 2023
@TheBlueMatt
Copy link
Contributor

When you say "the peer stopped syncing with us after some time which again caused incomplete data.", what exactly do you mean?

We should always eventually complete sync, even if the RGS server misses some seid the first time around it should catch up eventually.

@aarani
Copy link
Author

aarani commented Apr 29, 2023

When you say "the peer stopped syncing with us after some time which again caused incomplete data.", what exactly do you mean?

We should always eventually complete sync, even if the RGS server misses some seid the first time around it should catch up eventually.

Sorry, my reply should've been more comprehensive, I changed the threshold to 5 (I think), RGS would sync for a while, peer would slow down and RGS started snapshotting after around 40k channels we waited days but RGS with our chain source would never catch up to around 70k which RGS with normal chain source would get to.

@aarani
Copy link
Author

aarani commented Apr 29, 2023

When you say "the peer stopped syncing with us after some time which again caused incomplete data.", what exactly do you mean?

We should always eventually complete sync, even if the RGS server misses some seid the first time around it should catch up eventually.

Sorry, my reply should've been more comprehensive, I changed the threshold to 5 (I think), RGS would sync for a while, peer would slow down and RGS started snapshotting after around 40k channels we waited days but RGS with our chain source would never catch up to around 70k which RGS with normal chain source would get to.

I assumed that because we were consuming the msgs too slow, the peer stopped syncing with us.

@TheBlueMatt
Copy link
Contributor

What was "your chain source", out of curiosity?

Indeed, if your chain source is really slow, it's possible we'll disconnect peers for ping timeouts before we finish the sync and won't restart sync when we reconnect. You should always continue to get live updates though, a few restart cycles should fix it :)

@TheBlueMatt
Copy link
Contributor

Also did you try caching blocks in your chain source? That can alleviate a lot of pressure, eg using the wrapper at lightningdevkit/rust-lightning#2248

@aarani
Copy link
Author

aarani commented Apr 30, 2023

What was "your chain source", out of curiosity?

The chain source was an HTTP middleware that would receive a validation request from RGS and send it to an Electrum server (there is some kind of consensus logic in our client so technically it would send it to multiple Electrum servers)

Also did you try caching blocks in your chain source? That can alleviate a lot of pressure, eg using the wrapper at lightningdevkit/rust-lightning#2248

I have to look into that, we did lookup based on transactions using electrum's id_from_pos and transaction.get methods so we don't really download an entire block to actually be able to cache it.

@aarani
Copy link
Author

aarani commented Apr 30, 2023

I know that memory consumption is the problem with a fully separate validation system but still makes me think if doing that is cheaper and easier than maintaining a full node, unfortunately, I don't really know Rust so to test it I had to port RGS to my own main language (and as expected it chews memory).

@TheBlueMatt
Copy link
Contributor

Ah, yea, trying to validate gossip by requesting data from a remote server is going to be painfully slow, wouldn't surprise me if that caused issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants