Server considers incomplete data as complete initial sync #28

aarani · 2023-02-10T14:22:46Z

According to the code, the server should wait on running the Snapshotter until we're caught up with gossip but instead it runs the Snapshotter just moments after bootup after receiving a small number of gossip msgs, which creates incomplete/invalid snapshots.

Log:

Feb 09 13:19:27 rapid-gossip-sync-server[7199]: Starting gossip download
trimmed...
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: gossip count (iteration 3): 18 (delta: 13):
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: announcements: 7
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: mismatched scripts: 0
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: updates: 11
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: no HTLC max: 0
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: caught up with gossip!
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: Initial sync complete!
Feb 09 13:19:42 rapid-gossip-sync-server[7199]: Initiating snapshotting service

I use a load balancer health check to make sure the snapshot exists before forwarding user requests to it but this behavior makes it impossible to know if an RGS server is fully caught up and has ready-to-use snapshots. Is this an intended behavior?

TheBlueMatt · 2023-04-04T20:17:29Z

Sadly there's no good way in the lightning protocol today to be confident we're done syncing aside from "just see if we aren't getting as many messages anymore". It looks like the heuristic went a little too aggressive on you here, its possible your chain source is slow or the node you're fetching gossip from is overloaded.

aarani · 2023-04-25T10:54:48Z

Sadly there's no good way in the lightning protocol today to be confident we're done syncing aside from "just see if we aren't getting as many messages anymore". It looks like the heuristic went a little too aggressive on you here, its possible your chain source is slow or the node you're fetching gossip from is overloaded.

Unfortunately, the problem was indeed the slow chain source we adjusted the "20" threshold but still got a problem
where the peer stopped syncing with us after some time which again caused incomplete data.

TheBlueMatt · 2023-04-29T16:49:40Z

When you say "the peer stopped syncing with us after some time which again caused incomplete data.", what exactly do you mean?

We should always eventually complete sync, even if the RGS server misses some seid the first time around it should catch up eventually.

aarani · 2023-04-29T17:21:22Z

When you say "the peer stopped syncing with us after some time which again caused incomplete data.", what exactly do you mean?

We should always eventually complete sync, even if the RGS server misses some seid the first time around it should catch up eventually.

Sorry, my reply should've been more comprehensive, I changed the threshold to 5 (I think), RGS would sync for a while, peer would slow down and RGS started snapshotting after around 40k channels we waited days but RGS with our chain source would never catch up to around 70k which RGS with normal chain source would get to.

aarani · 2023-04-29T17:24:35Z

When you say "the peer stopped syncing with us after some time which again caused incomplete data.", what exactly do you mean?

We should always eventually complete sync, even if the RGS server misses some seid the first time around it should catch up eventually.

Sorry, my reply should've been more comprehensive, I changed the threshold to 5 (I think), RGS would sync for a while, peer would slow down and RGS started snapshotting after around 40k channels we waited days but RGS with our chain source would never catch up to around 70k which RGS with normal chain source would get to.

I assumed that because we were consuming the msgs too slow, the peer stopped syncing with us.

TheBlueMatt · 2023-04-30T22:30:56Z

What was "your chain source", out of curiosity?

Indeed, if your chain source is really slow, it's possible we'll disconnect peers for ping timeouts before we finish the sync and won't restart sync when we reconnect. You should always continue to get live updates though, a few restart cycles should fix it :)

TheBlueMatt · 2023-04-30T22:31:50Z

Also did you try caching blocks in your chain source? That can alleviate a lot of pressure, eg using the wrapper at lightningdevkit/rust-lightning#2248

aarani · 2023-04-30T23:14:45Z

What was "your chain source", out of curiosity?

The chain source was an HTTP middleware that would receive a validation request from RGS and send it to an Electrum server (there is some kind of consensus logic in our client so technically it would send it to multiple Electrum servers)

Also did you try caching blocks in your chain source? That can alleviate a lot of pressure, eg using the wrapper at lightningdevkit/rust-lightning#2248

I have to look into that, we did lookup based on transactions using electrum's id_from_pos and transaction.get methods so we don't really download an entire block to actually be able to cache it.

aarani · 2023-04-30T23:26:40Z

I know that memory consumption is the problem with a fully separate validation system but still makes me think if doing that is cheaper and easier than maintaining a full node, unfortunately, I don't really know Rust so to test it I had to port RGS to my own main language (and as expected it chews memory).

TheBlueMatt · 2023-05-01T02:20:04Z

Ah, yea, trying to validate gossip by requesting data from a remote server is going to be painfully slow, wouldn't surprise me if that caused issues.

aarani closed this as not planned Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server considers incomplete data as complete initial sync #28

Server considers incomplete data as complete initial sync #28

aarani commented Feb 10, 2023 •

edited

Loading

TheBlueMatt commented Apr 4, 2023

aarani commented Apr 25, 2023

TheBlueMatt commented Apr 29, 2023

aarani commented Apr 29, 2023

aarani commented Apr 29, 2023

TheBlueMatt commented Apr 30, 2023

TheBlueMatt commented Apr 30, 2023

aarani commented Apr 30, 2023 •

edited

Loading

aarani commented Apr 30, 2023 •

edited

Loading

TheBlueMatt commented May 1, 2023

Server considers incomplete data as complete initial sync #28

Server considers incomplete data as complete initial sync #28

Comments

aarani commented Feb 10, 2023 • edited Loading

TheBlueMatt commented Apr 4, 2023

aarani commented Apr 25, 2023

TheBlueMatt commented Apr 29, 2023

aarani commented Apr 29, 2023

aarani commented Apr 29, 2023

TheBlueMatt commented Apr 30, 2023

TheBlueMatt commented Apr 30, 2023

aarani commented Apr 30, 2023 • edited Loading

aarani commented Apr 30, 2023 • edited Loading

TheBlueMatt commented May 1, 2023

aarani commented Feb 10, 2023 •

edited

Loading

aarani commented Apr 30, 2023 •

edited

Loading

aarani commented Apr 30, 2023 •

edited

Loading