-
Notifications
You must be signed in to change notification settings - Fork 863
Fix filesystem gets corrupted #1092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix filesystem gets corrupted #1092
Conversation
It was noticed that when copying files to the flash then eventually the fs gets corrupted. After this finding a stress testing was implemented and additional traces added to littlefs. With the testing it was noticed the off by one issue in a block boundary check. This commit fixes the off by one check and all corruption traces were fixed. The filesystem does not end up unmountable (corrupted) anymore after this fix.
Hi, I'm not contributing to littlefs, but what about this? if we connect that to the next parts of } else if (off + lfs_tag_dsize(tag) > lfs->cfg->block_size) {
break;
} It should be instead: } else if (off > lfs->cfg->block_size) {
break;
} Correct me if i'm wrong though. |
Our CI does not build the LittleFS tests and I did not notice to build them locally so I was not aware that the change impacted on the tests. |
|
Hi @erkkatalvitie, thanks for reporting this, sorry about the late response. I'm not sure this actually fixes an off-by-one. It's more likely a coincidence this avoided filesystem corruption, maybe by silently dropping a problematic commit? The code is a bit messy, but this check: Lines 1187 to 1190 in 8ed63b2
Is checking that a metadata tag is fully inside the block before calculating its checksum. It's perfectly valid for a tag's size to line up with the end-of-block. In fact, this is how the checksum tag for the last commit in a block is written due to padding for prog alignment. Which explains why the tests are failing, the last commit in a block ends up reported as corrupt. Though hitting this does require filling up the metadata block, which is why you may not be seeing this in your testing. Unfortunately it will probably cause problems after many more metadata updates. The above checks if a tag's contents go out-of-bounds, but not if the tag itself is out-of-bounds. That's actually handled by Lines 49 to 52 in 8ed63b2
This gets intercepted by Lines 1172 to 1175 in 8ed63b2
Which only reports LFS_ERR_CORRUPT if no valid commits are found. |
Hi @geky, Thank you for your reply. After posting this pull request I noticed that this similar check is used also elsewhere in the LittleFS code so I started to suspect that it is not an unintentional slip as it is repeated so many times. As a bit more detailed background we noticed the corruption of the filesystem in normal use first when copying files to the device. Then I created a script for stress testing that copies those files to the device repeatedly in a loop. I also added more traces to detect the point of failure (when the code returns the LFS_ERR_CORRUPT). Traces started to show corruption almost right away when the testing begin. In our device the filesystem goes into the unmountable state (corrupted) after 6 to 7 iterations. First I investigated the flash drivers we are using but found no bug from there. This does not mean there isn't any. However we have multiple different flash devices both NAND and also NOR. They all behave the same even with different device drivers. We also have implemented flash stress tests without a filesystem and they perform well. It would suggest that the devices are functioning properly electronically and the device drivers are performing as expected. This led me to suspect the filesystem or our configuration of it. With the change I presented in this pull request the same stress test run overnight until I stopped it. I don't recall the full count of the iterations. There were no traces about the corruption and the filesystem remained functional. Of course it is possible that it just mitigates the issue without fixing the real root cause. But it is clear by testing that in our device / setup without the change the corruption is easy to reproduce. With the change it seems to be really hard to reproduce at least as I haven't been able to reproduce it with the change. Disclaimer: the testing window with the change is still quite small. I have not worked with the LittleFS before and also not so deeply with any other filesystems. In this context my experience comes more from the lower level device drivers below the filesystem. Likely this is the reason why this line of code caught my eye. If we consider for example a NAND device with block size of 2048 bytes then writing for example to the block 0 one can write to the addresses from 0 to 2047 but not to the address 2048 as it already belongs to the block 1. This led me to think that the check in the LittleFS unintentionally allows access to the next block - the very thing it is trying to prevent with the check. But after seeing more of the LittleFS code and after your reply I think that the definition of a block in the filesystem is likely a more abstract concept and the legitime block area is more of a design decision. |
This is unfortunate, but I'm not aware of any known bugs that break the filesystem like how you are describing. Except maybe bit-errors introduced during prog/read. But given that you've tested on a range of hardware that seems unlikely. Is it possible to reproduce the bug in littlefs's test runner? This would remove hardware from the equation and allow others to debug what is going on. If you edit (or copy) one of the littlefs/tests/test_files.toml Lines 2 to 24 in 8ed63b2
You can run it with make: make test-runner -j
./scripts/test.py runners/test_runner test_files_simple This currently assumes a Linux environment. I think it works in WSL on Windows?
It does look suspicious.
Note that littlefs's blocks need to be a multiple, >=, of the In your example, 2048 would be block 1, but littlefs can read multiple |
Sorry for the late response as I have been a bit busy with other tasks. We have received two new NAND devices since I created this pull request. The filesystem seems to be more stable with them which would indicate that the issue is likely related either to the hardware, HW drivers or our configuration of the filesystem.
Also this hints into the same direction.
This is a good idea! I will look into this possibility if we can't resolve this. I have now implemented a new flash stress test that will erase-write-read-verify the whole device. I am planning to run this test multiple times to all devices and especially those that were not performing well with the filesystem. At the moment I will close this pull request. If needed I can create an issue for investigating this matter. Thank you for your support @geky! |
A fix that worked for our project using LittleFS with STM32H7XX when facing similar issues like the #461
Steps to reproduce in our project:
It was noticed that when copying files to the flash then eventually the fs gets corrupted.
After this finding a stress testing was implemented and additional traces added to littlefs. With the testing it was noticed the off by one issue in a block boundary check.
This commit fixes the off by one check and all corruption traces were fixed. The filesystem does not end up unmountable (corrupted) anymore after this fix.