Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database corruption on service restart #3416

Closed
UglyDuskling opened this issue Jan 23, 2025 · 0 comments
Closed

Database corruption on service restart #3416

UglyDuskling opened this issue Jan 23, 2025 · 0 comments
Labels
fix:bug Something isn't working

Comments

@UglyDuskling
Copy link

I discovered today that my Nocturne node has fallen out of sync and it has not recovered automatically. It appears that a rusk service restart caused the issue. Restarting it again does not fix it.

The error logs are filled with this error:

2025-01-20T18:33:39.272419Z ERROR node::chain: event="fsm::on_event failed" src="wire" err=Cannot accept txs: Inconsistent state verification data VerificationOutput { state_root: f10f20056e4cc9c470d6f63d86ae115ebfe8a96fd19fa964965324df73e29186, event_bloom: 00000000000000000000000000000000000000000000000000000400000000000000400000000000000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000000000000000000000000000000000000000040000000000000000000000000000 }!!

Stack backtrace:
   0: anyhow::error::<impl anyhow::Error>::msg
   1: anyhow::__private::format_err
   2: rusk::node::vm::<impl node::vm::VMExecution for rusk::node::Rusk>::accept
   3: <node::database::rocksdb::Backend as node::database::DB>::update
   4: node::chain::acceptor::Acceptor<N,DB,VM>::try_accept_block::{{closure}}
   5: node::chain::fsm::insync::InSyncImpl<DB,VM,N>::on_block_event::{{closure}}
   6: node::chain::fsm::SimpleFSM<N,DB,VM>::on_block_event::{{closure}}
   7: <node::chain::ChainSrv<N,DB,VM> as node::LongLivedService<N,DB,VM>>::execute::{{closure}}
   8: node::Node<N,DB,VM>::spawn_all::{{closure}}::{{closure}}
   9: tokio::runtime::task::core::Core<T,S>::poll
  10: tokio::runtime::task::harness::Harness<T,S>::poll
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  12: tokio::runtime::scheduler::multi_thread::worker::Context::run
  13: tokio::runtime::context::runtime::enter_runtime
  14: tokio::runtime::scheduler::multi_thread::worker::run
  15: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  16: tokio::runtime::task::core::Core<T,S>::poll
  17: tokio::runtime::task::harness::Harness<T,S>::poll
  18: tokio::runtime::blocking::pool::Inner::run
  19: std::sys_common::backtrace::__rust_begin_short_backtrace
  20: core::ops::function::FnOnce::call_once{{vtable.shim}}
  21: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/0f44eb32f1123ac93ab404d74c295263ce468343/library/alloc/src/boxed.rs:2007:9
  22: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/0f44eb32f1123ac93ab404d74c295263ce468343/library/alloc/src/boxed.rs:2007:9
  23: std::sys::unix::thread::Thread::new::thread_start
             at /rustc/0f44eb32f1123ac93ab404d74c295263ce468343/library/std/src/sys/unix/thread.rs:108:17
  24: <unknown>
  25: <unknown>

And loads of these:

2025-01-20T18:33:39.281331Z  WARN node::chain::acceptor: event="Consensus msg discarded" reason="too far in the future" topic=Ratification info=ConsensusHeader { prev_block_hash: "00c996948374bee1...7880b987095374a6", round: 242127, iteration: 0 } ray_id=""

Some timestamps:
Node started on 20th at 18:33:37

Using CONSENSUS_KEYS in /opt/dusk/conf/consensus.keys
2025-01-20T18:33:37.076153Z  INFO rusk: Using state from "/opt/dusk/rusk/state"

First error 2 seconds after at 18:33:39

2025-01-20T18:33:39.272419Z ERROR node::chain: event="fsm::on_event failed" src="wire" err=Cannot accept txs: Inconsistent state verification data VerificationOutput....

The service itself appears to have been gracefully restarted two minutes prior. It was not a machine reset. Running journalctl --since "2025-01-20" | grep rusk I get (timestamp is GMT+11, so they are equivalent to 20th 18:31:47) :

Jan 21 05:31:47 computername systemd[1]: Stopping rusk.service - DUSK Rusk...
Jan 21 05:31:47 computername systemd[1]: rusk.service: Deactivated successfully.
Jan 21 05:31:47 computername systemd[1]: Stopped rusk.service - DUSK Rusk.
Jan 21 05:31:47 computername systemd[1]: rusk.service: Consumed 18h 53min 40.012s CPU time, 900.0M memory peak, 0B memory swap peak.
Jan 21 05:32:04 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 21 05:32:04 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jan 21 05:32:04 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 21 05:32:04 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jan 21 05:32:04 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 21 05:32:04 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jan 21 05:33:19 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked executable. Please remove executable permission bits. Proceeding anyway.
Jan 21 05:33:19 computername systemd[1]: Configuration file /etc/systemd/system/rusk.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Jan 21 05:33:19 computername systemd[1]: Starting rusk.service - DUSK Rusk...
Jan 21 05:33:37 computername systemd[1]: Started rusk.service - DUSK Rusk.

I don't expect state or database errors to happen from a graceful stop/start of the rusk service.

As of now I have preserved the logs and kept the state as-is with the rusk service stopped in case you need any of it.

@UglyDuskling UglyDuskling added the fix:bug Something isn't working label Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant