-
Notifications
You must be signed in to change notification settings - Fork 3
smd can destroy the entire mail spool #18
Description
I have seen at least three cases of smd attempting to destroy my entire mail spool. In the three cases, I have mostly recovered thanks to backups, notmuch, and the distributed nature of SMD/email. But it's nerve-wracking and, well, really, really bad.
I still don't quite understand what happens. I can't find it in the logs anymore, but I remember seeing something about smd tripping over an empty directory inside the .smd directory. I wrote about his on my blog here:
https://anarc.at/blog/2021-03-22-email-crash/
https://anarc.at/blog/2021-06-29-another-mail-crash/
I think the fundamental problem is that smd-pull and smd-push's locking mechanisms might have a bug. I was able to start both processes at once by furiously hitting "reload" in my email setup (essentially telling systemd to repeately run smd-pull.service), which eventually started at the same time as smd-push. Here's a log from systemd:
jun 28 16:30:04 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Succeeded.
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir.
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:09 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:30:22 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:22 curie smd-pull[391140]: Already running.
jun 28 16:30:22 curie smd-pull[391140]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
jun 28 16:30:22 curie smd-pull[391140]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 390135) run(rm /home/anarcat/.smd/lock))
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:22 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Main process exited, code=killed, status=15/TERM
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Failed with result 'signal'.
jun 28 16:31:14 curie systemd[2845]: Stopped push emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Consumed 6.600s CPU time.
jun 28 16:31:22 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:31:23 curie smd-push[391563]: Found lockfile of a dead instance. Ignored.
jun 28 16:31:30 curie smd-push[392598]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(216)
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Succeeded.
jun 28 16:31:30 curie systemd[2845]: Finished push emails with syncmaildir.
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Consumed 4.727s CPU time.
jun 28 16:31:38 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:32:22 curie smd-pull[393092]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(51458), bytes-received(0), xdelta-received(247)
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Main process exited, code=killed, status=15/TERM
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Failed with result 'signal'.
jun 28 16:32:22 curie systemd[2845]: Stopped pull emails with syncmaildir.
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Consumed 12.632s CPU time.
The first thing to notice is that, at 16:30:09, two jobs are started in parallel. This is a bug in the systemd configuration, which shouldn't allow that in the first place. But considering that smd has its own locking primitives, I consider it's also a bug there. At 16:30:22, you see me triggering another run, and then smd-pull notices its other smd-pull process running and aborts. But it's still running because it's actively destroying all mail on the remote server!!
At 16:31:14 you can see I notice the problem and kill the process. And then at 16:32:22 you can see the timer i forgot to disable fired and destroyed thousands of emails locally.
Out fo 300k emails, I lost 220k in this case.
I'll send a PR fixing the systemd configuration, but I can't help but think there should be safeguards in SMD preventing it from destroying everything like this.