entry5 #5
ironicbadger
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Short/boost-like version:
My first self-built computer silently corrupted data on suspend! With hours of panicked troubleshooting not knowing what I was losing, I replaced the SSD to no avail then hacked together my own filesystem checksumming atop ext4. Eventually, two years later, the final firmware update for my motherboard fixed this, but by then I committed to switching to BtrFS. BtrFS for life!
In detail:
In 2014, I built my first computer for my 2nd+ year at college. Running Ubuntu 14.04 LTS, it was overall fantastic, no issues.. except for noticing that some programs SEGFAULT'd after a few sleep/resume cycles. Sometimes, even a few of my files got corrupted, requiring restoring from Déjà Dup. And I had no idea WHICH of my files were corrupted, which terrified me.
Time to investigate!
I apt-install'd "debsums" to check for package corruption (on Ubuntu), which pointed out several problems. Reinstalling e.g. Thunderbird cleared that up. But the corruption kept coming back. I also apt-install'd the "cfv" utility and scripted it to generate checksums of my home dir, suspend the computer, then whenever it woke up it'd verify every file. Slow, but effective.
By now, I knew silent corruption was happening during suspend-resume. I disabled suspend and committed to always shutting down.
But I wasn't going to give up. Was it the cheaper Crucial SSD I bought..? I replaced it with a Samsung Pro SSD - and corruption still happened!
Eventually, two whole years later (2016-3-3), my motherboard manufacturer published the final firmware update - a "Beta BIOS" (not stable!) that claimed to "Fix memory compatibility"...
If you're curious: https://www.gigabyte.com/Motherboard/GA-Z97X-SLI-rev-10/support#support-dl-bios
I installed that update, threw together another script that automatically checksum'd files, suspended, and set an "rtcwake" alarm for 2 minutes later, verified, and repeated, thoroughly exercising suspend/resume. And just like that, my silent data corruption nightmare was banished!
Needless to say, after multiple tens of hours spent troubleshooting this, buying a whole replacement SSD (they weren't as cheap in 2015!), and messily hacking together my own automated filesystem verification...
I resolved to switch to BtrFS.
It didn't matter that folks claimed BtrFS lost them data. ext4 had lost ME data through lack of file checksums, and I wasn't going to put up with this any longer.
So, summer of 2016, I reinstalled Ubuntu from scratch, setting up both LUKS encryption and BtrFS as my only filesystem. And I've stuck with BtrFS since, no data loss (not even with a "RAID1" multi-disk array on my 2016 micro server, built from used NAS drives donated by a friend's father).
Thankfully, I didn't lose much actual data, but I think I lost a lot of sleep over this silent corruption, especially with this being my first computer build... What are the odds of hitting a RAM incompatibility bug with an established RAM manufacturer (Crucial Ballistix Sport) and motherboard manufacturer (Gigabyte)?
Beta Was this translation helpful? Give feedback.
All reactions