Hello !

We are experiencing issues with the ext4 file system in automatic tests.
Here are some required information:

[2.] Full description of the problem/report:

Sometimes, accessing a file on the EXT4 file system fails with an error message 
in the kernel log. So far, we observed 3 different kind of messages:

 - EXT4-fs error (device mmcblk1p2): ext4_lookup:1785: inode #10287: comm 
ostree: iget: checksum invalid
 - EXT4-fs error (device mmcblk0p3): __ext4_find_entry:1623: inode #258562: 
comm gst-launch-1.0: checksumming directory block 0
 - EXT4-fs error (device mmcblk0p3): ext4_validate_block_bitmap:390: comm 
fstrim: bg 16: bad block bitmap checksum

The first issue was apparently fixed by patching our kernel with this patchset: 
https://lore.kernel.org/all/20210901020955.1657340-1-yi.zhang@huawei.com/

The second issue seems to be happening for all kind of programs. In this 
instance, it was gstreamer opening a file. It can also happen when mkdir 
creates a directory.

The third issue seems to only happen with fstrim.

This seems to be a random issue and cannot be reproduced easily nor is there a 
procedure to reproduce it.

Each time a test suite is run, the image is freshly written on the device. The 
same tested multiple times will sometimes fail, sometimes not.

[3.] Keywords (i.e., modules, networking, kernel):
ext4, checksum

[4.] Kernel information
We use a modified version of the debian kernel, the source code is here: 
https://gitlab.apertis.org/pkg/linux

No patches are modifying the ext4 filesystem code.

[4.1.] Kernel version (from /proc/version):
It is hard to determine when the issue started appearing. One educated guess 
would be when we upgraded from 5.15.1 to 5.15.22.
One version where this failed is:

Linux version 5.15.0-trunk-amd64 (debian-kernel@lists.debian.org) (gcc-10 
(Apertis 10.2.1-6+apertis6bv2023dev1b2) 10.2.1 20210110, GNU ld (GNU Binutils 
for Apertis) 2.35.2) #1 SMP Debian 5.15.22-0~apertis2 (2022-02-16)

[4.2.] Kernel .config file:
See the attached config.txt file

[5.] Most recent kernel version which did not have the bug:
My guess is 5.15.1, but I cannot be sure of this.

[6.] Output of Oops.. message (if applicable) with symbolic information
     resolved (see Documentation/admin-guide/bug-hunting.rst)
N/A

[7.] A small shell script or example program which triggers the
     problem (if possible)
N/A Although you can check the full output here for example: https://
lava.collabora.co.uk/scheduler/job/5756873#L12901 (pointed on the line of the 
error)

[8.] Environment
We use two deployment types for our images: APT (classic debian's apt) and 
OSTree. The issue seems to only happen with OSTree images.

Also, the issue has happened on multiple different boards, with multiple 
architectures (amd64, armhf and arm64). So failing hardware is unlikely at 
fault here.

[8.1.] Software (add the output of the ver_linux script here)
[8.2.] Processor information (from /proc/cpuinfo):
Not related

[8.3.] Module information (from /proc/modules):
See attached modules.txt file

[8.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
Happens on different HW

[8.5.] PCI information ('lspci -vvv' as root)
Not related

[8.6.] SCSI information (from /proc/scsi/scsi)
Not related

[8.7.] Other information that might be relevant to the problem
       (please look in /proc and include all information that you
       think to be relevant):
See the output of mount (amd64) in mount.txt
The issues can happen on the rootfs or the home partition.

[X.] Other notes, patches, fixes, workarounds:
Because I am not familiar with the internals of the ext4 file system and the 
issue is random and hard to reproduce, I am mainly asking for pointers or for 
patches in review to try. I can get more information as needed.

Regards,

Detlev.