-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server considers incomplete data as complete initial sync #28
Comments
Sadly there's no good way in the lightning protocol today to be confident we're done syncing aside from "just see if we aren't getting as many messages anymore". It looks like the heuristic went a little too aggressive on you here, its possible your chain source is slow or the node you're fetching gossip from is overloaded. |
Unfortunately, the problem was indeed the slow chain source we adjusted the "20" threshold but still got a problem |
When you say "the peer stopped syncing with us after some time which again caused incomplete data.", what exactly do you mean? We should always eventually complete sync, even if the RGS server misses some seid the first time around it should catch up eventually. |
Sorry, my reply should've been more comprehensive, I changed the threshold to 5 (I think), RGS would sync for a while, peer would slow down and RGS started snapshotting after around 40k channels we waited days but RGS with our chain source would never catch up to around 70k which RGS with normal chain source would get to. |
I assumed that because we were consuming the msgs too slow, the peer stopped syncing with us. |
What was "your chain source", out of curiosity? Indeed, if your chain source is really slow, it's possible we'll disconnect peers for ping timeouts before we finish the sync and won't restart sync when we reconnect. You should always continue to get live updates though, a few restart cycles should fix it :) |
Also did you try caching blocks in your chain source? That can alleviate a lot of pressure, eg using the wrapper at lightningdevkit/rust-lightning#2248 |
The chain source was an HTTP middleware that would receive a validation request from RGS and send it to an Electrum server (there is some kind of consensus logic in our client so technically it would send it to multiple Electrum servers)
I have to look into that, we did lookup based on transactions using electrum's id_from_pos and transaction.get methods so we don't really download an entire block to actually be able to cache it. |
I know that memory consumption is the problem with a fully separate validation system but still makes me think if doing that is cheaper and easier than maintaining a full node, unfortunately, I don't really know Rust so to test it I had to port RGS to my own main language (and as expected it chews memory). |
Ah, yea, trying to validate gossip by requesting data from a remote server is going to be painfully slow, wouldn't surprise me if that caused issues. |
According to the code, the server should wait on running the Snapshotter until we're caught up with gossip but instead it runs the Snapshotter just moments after bootup after receiving a small number of gossip msgs, which creates incomplete/invalid snapshots.
Log:
I use a load balancer health check to make sure the snapshot exists before forwarding user requests to it but this behavior makes it impossible to know if an RGS server is fully caught up and has ready-to-use snapshots. Is this an intended behavior?
The text was updated successfully, but these errors were encountered: