Re: ext4 metadata corruption - snapshot related?

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jean-Louis Dupond <jean-louis@dupond.be>
To: linux-ext4@vger.kernel.org
Subject: Re: ext4 metadata corruption - snapshot related?
Date: Wed, 2 Jul 2025 15:43:25 +0200	[thread overview]
Message-ID: <7b9c7a42-de7b-4408-91a6-1c35e14cc380@dupond.be> (raw)
In-Reply-To: <e90d9c7f-adf8-453d-a3c2-f1d28ee9d9b3@dupond.be>

We updated a machine to a newer 6.15.2-1.el8.elrepo.x86_64 kernel, and 
the same? bug reoccurred after some time:

The error was the following:
Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932
Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932
Jul 02 11:03:35 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932
Jul 02 11:04:03 xxxxx kernel: EXT4-fs error (device sdd1): 
ext4_lookup:1791: inode #44962812: comm imap: deleted inode referenced: 
44997932

Any idea's on how this could be debugged further?

Thanks
Jean-Louis

On 12/06/2025 16:43, Jean-Louis Dupond wrote:
> Hi,
>
> We have around 200 VM's running on qemu (on a AlmaLinux 9 based 
> hypervisor).
> All those VM's are migrated from physical machines recently.
>
> But when we enable backups on those VM's (which triggers snapshots), 
> we notice some weird/random ext4 corruption within the VM itself.
> The VM itself runs CloudLinux 8 (4.18.0-553.40.1.lve.el8.x86_64 kernel).
>
> This are some examples of corruption we see:
> 1)
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: 
> inode #19280823: comm lsphp: Directory block failed checksum
> kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode 
> #19280823: comm lsphp: Directory block failed checksum
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1036: 
> inode #19280820: comm lsphp: Directory block failed checksum
> kernel: EXT4-fs error (device sdc1): ext4_empty_dir:2801: inode 
> #19280820: comm lsphp: Directory block failed checksum
>
> 2)
> kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode 
> #49419787: comm lsphp: deleted inode referenced: 49422454
> kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode 
> #49419787: comm lsphp: deleted inode referenced: 49422454
> kernel: EXT4-fs error (device sdc1): ext4_lookup:1645: inode 
> #49419787: comm lsphp: deleted inode referenced: 49422454
>
> 3)
> kernel: EXT4-fs error (device sdb1): ext4_validate_block_bitmap:384: 
> comm kworker/u240:3: bg 308: bad block bitmap checksum
> kernel: EXT4-fs (sdb1): Delayed block allocation failed for inode 
> 2513946 at logical offset 2 with max blocks 1 with error 74
> kernel: EXT4-fs (sdb1): This should not happen!! Data will be lost
> kernel: EXT4-fs (sdb1): Inode 2513946 (00000000265d63ca): 
> i_reserved_data_blocks (1) not cleared!
> kernel: EXT4-fs (sdb1): error count since last fsck: 1
> kernel: EXT4-fs (sdb1): initial error at time 1747923211: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdb1): last error at time 1747923211: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdb1): error count since last fsck: 1
> kernel: EXT4-fs (sdb1): initial error at time 1747923211: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdb1): last error at time 1747923211: 
> ext4_validate_block_bitmap:384
>
> 4)
> kernel: EXT4-fs (sdc1): error count since last fsck: 4
> kernel: EXT4-fs (sdc1): initial error at time 1746616017: 
> ext4_validate_block_bitmap:384
> kernel: EXT4-fs (sdc1): last error at time 1746621676: 
> ext4_mb_generate_buddy:808
>
>
> Now as a test we upgraded to some newer (backported) kernel, more 
> specificly: 5.14.0-284.1101
> And after doing some backups again, we had another error:
>
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752060: comm tar: Directory block failed checksum
> kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: 
> inode #34752232: comm tar: No space for directory leaf checksum. 
> Please run e2fsck -D.
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752232: comm tar: Directory block failed checksum
> kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: 
> inode #34752064: comm tar: No space for directory leaf checksum. 
> Please run e2fsck -D.
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752064: comm tar: Directory block failed checksum
> kernel: EXT4-fs warning (device sdc1): ext4_dirblock_csum_verify:405: 
> inode #34752167: comm tar: No space for directory leaf checksum. 
> Please run e2fsck -D.
> kernel: EXT4-fs error (device sdc1): htree_dirblock_to_tree:1073: 
> inode #34752167: comm tar: Directory block failed checksum
>
>
> So now we are wondering what could cause this corruption here.
> - We have more VM's on the same kind of setup, without seeing any 
> corruption. The only difference there is that the VM's are running 
> Debian, have smaller disks and not doing quota.
> - If we disable backups/snapshots, no corruption is observed
> - Even if we disable the qemu-guest-agent (so no fsfreeze is 
> executed), the corruption still occurs
>
> We (for now at least) only see the corruption on filesystems where 
> quota is enabled (both usrjquota and usrquota).
> The filesystems are between 600GB and 2TB.
> And today I noticed (as the filesystems are resized during setup), the 
> journal size is only 64M (could this potentially be an issue?).
>
> The big question in the whole story here is, could it be an in-guest 
> (ext4?) bug/issue? Or do we really need to look into the layer below 
> (aka qemu/hypervisor).
> Or if somebody has other idea's, feel free to share! Also additional 
> things that could help to troubleshoot the issue.
>
> Thanks
> Jean-Louis

next prev parent reply	other threads:[~2025-07-02 13:51 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-12 14:43 ext4 metadata corruption - snapshot related? Jean-Louis Dupond
2025-07-02 13:43 ` Jean-Louis Dupond [this message]
2025-07-02 14:37   ` Theodore Ts'o
2025-07-02 15:32     ` Jean-Louis Dupond

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7b9c7a42-de7b-4408-91a6-1c35e14cc380@dupond.be \
    --to=jean-louis@dupond.be \
    --cc=linux-ext4@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).