Skip to content

Fix filesystem gets corrupted #1092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

erkkatalvitie
Copy link

A fix that worked for our project using LittleFS with STM32H7XX when facing similar issues like the #461

Steps to reproduce in our project:

It was noticed that when copying files to the flash then eventually the fs gets corrupted.

After this finding a stress testing was implemented and additional traces added to littlefs. With the testing it was noticed the off by one issue in a block boundary check.

This commit fixes the off by one check and all corruption traces were fixed. The filesystem does not end up unmountable (corrupted) anymore after this fix.

It was noticed that when copying files to the flash then eventually the fs gets corrupted.

After this finding a stress testing was implemented and additional traces added to littlefs.
With the testing it was noticed the off by one issue in a block boundary check.

This commit fixes the off by one check and all corruption traces were fixed.
The filesystem does not end up unmountable (corrupted) anymore after this fix.
@amlel-el-mahrouss
Copy link

Hi, I'm not contributing to littlefs, but what about this?
https://github.com/littlefs-project/littlefs/blob/master/lfs.c#L1167

if we connect that to the next parts of lfs_dir_fetchmatch:

            } else if (off + lfs_tag_dsize(tag) > lfs->cfg->block_size) {
                break;
            }

It should be instead:

            } else if (off > lfs->cfg->block_size) {
                break;
            }

Correct me if i'm wrong though.

@erkkatalvitie
Copy link
Author

Our CI does not build the LittleFS tests and I did not notice to build them locally so I was not aware that the change impacted on the tests.

@erkkatalvitie
Copy link
Author

} else if (off > lfs->cfg->block_size) { would not take the lfs_tag_dsize(tag) into account at all.

@geky
Copy link
Member

geky commented May 3, 2025

Hi @erkkatalvitie, thanks for reporting this, sorry about the late response.

I'm not sure this actually fixes an off-by-one. It's more likely a coincidence this avoided filesystem corruption, maybe by silently dropping a problematic commit?

The code is a bit messy, but this check:

littlefs/lfs.c

Lines 1187 to 1190 in 8ed63b2

// out of range?
} else if (off + lfs_tag_dsize(tag) > lfs->cfg->block_size) {
break;
}

Is checking that a metadata tag is fully inside the block before calculating its checksum. It's perfectly valid for a tag's size to line up with the end-of-block. In fact, this is how the checksum tag for the last commit in a block is written due to padding for prog alignment.

Which explains why the tests are failing, the last commit in a block ends up reported as corrupt.

Though hitting this does require filling up the metadata block, which is why you may not be seeing this in your testing. Unfortunately it will probably cause problems after many more metadata updates.


The above checks if a tag's contents go out-of-bounds, but not if the tag itself is out-of-bounds. That's actually handled by lfs_bd_read, which has its own bounds checking:

littlefs/lfs.c

Lines 49 to 52 in 8ed63b2

if (off+size > lfs->cfg->block_size
|| (lfs->block_count && block >= lfs->block_count)) {
return LFS_ERR_CORRUPT;
}

This gets intercepted by lfs_dir_fetchmatch:

littlefs/lfs.c

Lines 1172 to 1175 in 8ed63b2

if (err == LFS_ERR_CORRUPT) {
// can't continue?
break;
}

Which only reports LFS_ERR_CORRUPT if no valid commits are found.

@erkkatalvitie
Copy link
Author

erkkatalvitie commented May 5, 2025

Hi @geky,

Thank you for your reply.

After posting this pull request I noticed that this similar check is used also elsewhere in the LittleFS code so I started to suspect that it is not an unintentional slip as it is repeated so many times.

As a bit more detailed background we noticed the corruption of the filesystem in normal use first when copying files to the device. Then I created a script for stress testing that copies those files to the device repeatedly in a loop. I also added more traces to detect the point of failure (when the code returns the LFS_ERR_CORRUPT). Traces started to show corruption almost right away when the testing begin. In our device the filesystem goes into the unmountable state (corrupted) after 6 to 7 iterations.

First I investigated the flash drivers we are using but found no bug from there. This does not mean there isn't any. However we have multiple different flash devices both NAND and also NOR. They all behave the same even with different device drivers. We also have implemented flash stress tests without a filesystem and they perform well. It would suggest that the devices are functioning properly electronically and the device drivers are performing as expected. This led me to suspect the filesystem or our configuration of it.

With the change I presented in this pull request the same stress test run overnight until I stopped it. I don't recall the full count of the iterations. There were no traces about the corruption and the filesystem remained functional. Of course it is possible that it just mitigates the issue without fixing the real root cause. But it is clear by testing that in our device / setup without the change the corruption is easy to reproduce. With the change it seems to be really hard to reproduce at least as I haven't been able to reproduce it with the change. Disclaimer: the testing window with the change is still quite small.

I have not worked with the LittleFS before and also not so deeply with any other filesystems. In this context my experience comes more from the lower level device drivers below the filesystem. Likely this is the reason why this line of code caught my eye. If we consider for example a NAND device with block size of 2048 bytes then writing for example to the block 0 one can write to the addresses from 0 to 2047 but not to the address 2048 as it already belongs to the block 1. This led me to think that the check in the LittleFS unintentionally allows access to the next block - the very thing it is trying to prevent with the check. But after seeing more of the LittleFS code and after your reply I think that the definition of a block in the filesystem is likely a more abstract concept and the legitime block area is more of a design decision.

@geky
Copy link
Member

geky commented May 5, 2025

This is unfortunate, but I'm not aware of any known bugs that break the filesystem like how you are describing.

Except maybe bit-errors introduced during prog/read. But given that you've tested on a range of hardware that seems unlikely.

Is it possible to reproduce the bug in littlefs's test runner? This would remove hardware from the equation and allow others to debug what is going on.

If you edit (or copy) one of the tests/test_*.toml files:

[cases.test_files_simple]
defines.INLINE_MAX = [0, -1, 8]
code = '''
lfs_t lfs;
lfs_format(&lfs, cfg) => 0;
lfs_mount(&lfs, cfg) => 0;
lfs_file_t file;
lfs_file_open(&lfs, &file, "hello",
LFS_O_WRONLY | LFS_O_CREAT | LFS_O_EXCL) => 0;
lfs_size_t size = strlen("Hello World!")+1;
uint8_t buffer[1024];
strcpy((char*)buffer, "Hello World!");
lfs_file_write(&lfs, &file, buffer, size) => size;
lfs_file_close(&lfs, &file) => 0;
lfs_unmount(&lfs) => 0;
lfs_mount(&lfs, cfg) => 0;
lfs_file_open(&lfs, &file, "hello", LFS_O_RDONLY) => 0;
lfs_file_read(&lfs, &file, buffer, size) => size;
assert(strcmp((char*)buffer, "Hello World!") == 0);
lfs_file_close(&lfs, &file) => 0;
lfs_unmount(&lfs) => 0;
'''

You can run it with make:

make test-runner -j
./scripts/test.py runners/test_runner test_files_simple

This currently assumes a Linux environment. I think it works in WSL on Windows?


Likely this is the reason why this line of code caught my eye.

It does look suspicious.

If we consider for example a NAND device with block size of 2048 bytes then writing for example to the block 0 one can write to the addresses from 0 to 2047 but not to the address 2048 as it already belongs to the block 1. This led me to think that the check in the LittleFS unintentionally allows access to the next block - the very thing it is trying to prevent with the check. But after seeing more of the LittleFS code and after your reply I think that the definition of a block in the filesystem is likely a more abstract concept and the legitime block area is more of a design decision.

Note that littlefs's blocks need to be a multiple, >=, of the erase_size. This is very unlikely to be 2048 on NAND, usually it's in the order of ~32*1024 - ~128*1024. Though it is true littlefs's block_size does not need to strictly match the storage.

In your example, 2048 would be block 1, but lfs_dir_fetchmatch stops before that thanks to the check in lfs_bd_read.

littlefs can read multiple prog_size units at a time, but this is usually not an issue for hardware.

@erkkatalvitie
Copy link
Author

Sorry for the late response as I have been a bit busy with other tasks.

We have received two new NAND devices since I created this pull request. The filesystem seems to be more stable with them which would indicate that the issue is likely related either to the hardware, HW drivers or our configuration of the filesystem.

This is unfortunate, but I'm not aware of any known bugs that break the filesystem like how you are describing.

Also this hints into the same direction.

Is it possible to reproduce the bug in littlefs's test runner? This would remove hardware from the equation and allow others to debug what is going on.

This is a good idea! I will look into this possibility if we can't resolve this.

I have now implemented a new flash stress test that will erase-write-read-verify the whole device. I am planning to run this test multiple times to all devices and especially those that were not performing well with the filesystem.

At the moment I will close this pull request. If needed I can create an issue for investigating this matter.

Thank you for your support @geky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants