BUG: 6.10: ext4 mpage_process_page_bufs() BUG

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers
@ 2024-09-04 18:47 Russell King (Oracle)
  2024-09-04 19:30 ` Russell King (Oracle)
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King (Oracle) @ 2024-09-04 18:47 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger; +Cc: linux-ext4

With a 6.10 based kernel, no changes to filesystem/MM code, I'm
seeing a reliable BUG_ON() within minutes of booting on one of my
VMs. I don't have a complete oops dump, but this is what I do
have, cobbled together from what was logged by journald, and
what syslogd was able to splat on the terminals before the VM
died.

Sep 04 15:51:46 lists kernel: kernel BUG at fs/ext4/inode.c:1967!

[ 1346.494848] Call trace:
[ 1346.495409] [<c04b4f90>] (mpage_process_page_bufs) from [<c04b938c>] (mpage_prepare_extent_to_map+0x410/0x51c)
[ 1346.499202] [<c04b938c>] (mpage_prepare_extent_to_map) from [<c04bbc40>] (ext4_do_writepages+0x320/0xb94)
[ 1346.502113] [<c04bbc40>] (ext4_do_writepages) from [<c04bc5dc>] (ext4_writepages+0xc0/0x1b4)
[ 1346.504662] [<c04bc5dc>] (ext4_writepages) from [<c0361154>] (do_writepages+0x68/0x220)
[ 1346.506974] [<c0361154>] (do_writepages) from [<c0354868>] (filemap_fdatawrite_wbc+0x64/0x84)
[ 1346.509165] [<c0354868>] (filemap_fdatawrite_wbc) from [<c035706c>] (__filemap_fdatawrite_range+0x50/0x58)
[ 1346.511414] [<c035706c>] (__filemap_fdatawrite_range) from [<c035709c>] (filemap_flush+0x28/0x30)
[ 1346.513518] [<c035709c>] (filemap_flush) from [<c04a8834>] (ext4_release_file+0x70/0xac)
[ 1346.515312] [<c04a8834>] (ext4_release_file) from [<c03f8088>] (__fput+0xd4/0x2cc)
[ 1346.517219] [<c03f8088>] (__fput) from [<c03f3e64>] (sys_close+0x28/0x5c)
[ 1346.518720] [<c03f3e64>] (sys_close) from [<c0200060>] (ret_fast_syscall+0x0/0x5c)

From a quick look, I don't see any patches that touch fs/ext4/inode.c
that might address this.

I'm not able to do any debugging, and from Friday, I suspect I won't
even be able to use a computer (due to operations on my eyes.)

-- 
*** please note that I probably will only be occasionally responsive
*** for an unknown period of time due to recent eye surgery making
*** reading quite difficult.

RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers
  2024-09-04 18:47 BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers Russell King (Oracle)
@ 2024-09-04 19:30 ` Russell King (Oracle)
  2024-09-04 19:50   ` Russell King (Oracle)
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King (Oracle) @ 2024-09-04 19:30 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger; +Cc: linux-ext4

On Wed, Sep 04, 2024 at 07:47:33PM +0100, Russell King (Oracle) wrote:
> With a 6.10 based kernel, no changes to filesystem/MM code, I'm
> seeing a reliable BUG_ON() within minutes of booting on one of my
> VMs. I don't have a complete oops dump, but this is what I do
> have, cobbled together from what was logged by journald, and
> what syslogd was able to splat on the terminals before the VM
> died.
> 
> Sep 04 15:51:46 lists kernel: kernel BUG at fs/ext4/inode.c:1967!
> 
> [ 1346.494848] Call trace:
> [ 1346.495409] [<c04b4f90>] (mpage_process_page_bufs) from [<c04b938c>] (mpage_prepare_extent_to_map+0x410/0x51c)
> [ 1346.499202] [<c04b938c>] (mpage_prepare_extent_to_map) from [<c04bbc40>] (ext4_do_writepages+0x320/0xb94)
> [ 1346.502113] [<c04bbc40>] (ext4_do_writepages) from [<c04bc5dc>] (ext4_writepages+0xc0/0x1b4)
> [ 1346.504662] [<c04bc5dc>] (ext4_writepages) from [<c0361154>] (do_writepages+0x68/0x220)
> [ 1346.506974] [<c0361154>] (do_writepages) from [<c0354868>] (filemap_fdatawrite_wbc+0x64/0x84)
> [ 1346.509165] [<c0354868>] (filemap_fdatawrite_wbc) from [<c035706c>] (__filemap_fdatawrite_range+0x50/0x58)
> [ 1346.511414] [<c035706c>] (__filemap_fdatawrite_range) from [<c035709c>] (filemap_flush+0x28/0x30)
> [ 1346.513518] [<c035709c>] (filemap_flush) from [<c04a8834>] (ext4_release_file+0x70/0xac)
> [ 1346.515312] [<c04a8834>] (ext4_release_file) from [<c03f8088>] (__fput+0xd4/0x2cc)
> [ 1346.517219] [<c03f8088>] (__fput) from [<c03f3e64>] (sys_close+0x28/0x5c)
> [ 1346.518720] [<c03f3e64>] (sys_close) from [<c0200060>] (ret_fast_syscall+0x0/0x5c)
> 
> From a quick look, I don't see any patches that touch fs/ext4/inode.c
> that might address this.
> 
> I'm not able to do any debugging, and from Friday, I suspect I won't
> even be able to use a computer (due to operations on my eyes.)

After rebooting the VM, the next oops was:

Sep 04 19:33:41 lists kernel: Unable to handle kernel paging request at virtual address 5ed304f3 when read
Sep 04 19:33:42 lists kernel: [5ed304f3] *pgd=80000040005003, *pmd=00000000
Sep 04 19:33:42 lists kernel: Internal error: Oops: 206 [#1] PREEMPT SMP ARM

 kernel:[  205.583038] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
 kernel:[  205.630530] Process kworker/u4:2 (pid: 33, stack limit = 0xc68f8000)
...
 kernel:[  205.661017] Call trace:
 kernel:[  205.661997] [<c04d9060>] (ext4_finish_bio) from [<c04d931c>] (ext4_release_io_end+0x48/0xfc)
 kernel:[  205.664523] [<c04d931c>] (ext4_release_io_end) from [<c04d94d8>] (ext4_end_io_rsv_work+0x88/0x188)
 kernel:[  205.666628] [<c04d94d8>] (ext4_end_io_rsv_work) from [<c023f310>] (process_one_work+0x178/0x30c)
 kernel:[  205.669924] [<c023f310>] (process_one_work) from [<c023fe48>] (worker_thread+0x25c/0x438)
 kernel:[  205.671679] [<c023fe48>] (worker_thread) from [<c02480b0>] (kthread+0xfc/0x12c)
 kernel:[  205.673607] [<c02480b0>] (kthread) from [<c020015c>] (ret_from_fork+0x14/0x38)
 kernel:[  205.682719] Code: e1540005 0a00000d e5941008 e594201c (e5913000)

This corresponds with:

c04d9050:       e1540005        cmp     r4, r5
c04d9054:       0a00000d        beq     c04d9090 <ext4_finish_bio+0x208>
c04d9058:       e5941008        ldr     r1, [r4, #8]
c04d905c:       e594201c        ldr     r2, [r4, #28]
c04d9060:       e5913000        ldr     r3, [r1] ;<<<==== faulting instruction

This code is:

                /*
                 * We check all buffers in the folio under b_uptodate_lock
                 * to avoid races with other end io clearing async_write flags
                 */
                spin_lock_irqsave(&head->b_uptodate_lock, flags);
                do {
                        if (bh_offset(bh) < bio_start ||
                            bh_offset(bh) + bh->b_size > bio_end) {

where r4 is "bh", r4+8 is the location of the bh->b_page pointer.

static inline unsigned long bh_offset(const struct buffer_head *bh)
{
        return (unsigned long)(bh)->b_data & (page_size(bh->b_page) - 1);
}

static inline unsigned long compound_nr(struct page *page)
{
        struct folio *folio = (struct folio *)page;

        if (!test_bit(PG_head, &folio->flags))
                return 1;

where PG_head is bit 6. Thus, bh->b_page was corrupt.

I've booted back into 6.7 on the offending VM, and it's stable, so it
appears to be a regression between 6.7 and 6.10.

-- 
*** please note that I probably will only be occasionally responsive
*** for an unknown period of time due to recent eye surgery making
*** reading quite difficult.

RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers
  2024-09-04 19:30 ` Russell King (Oracle)
@ 2024-09-04 19:50   ` Russell King (Oracle)
  2024-09-05 13:15     ` Theodore Ts'o
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King (Oracle) @ 2024-09-04 19:50 UTC (permalink / raw)
  To: Theodore Ts'o, Andreas Dilger; +Cc: linux-ext4

On Wed, Sep 04, 2024 at 08:30:43PM +0100, Russell King (Oracle) wrote:
> On Wed, Sep 04, 2024 at 07:47:33PM +0100, Russell King (Oracle) wrote:
> > With a 6.10 based kernel, no changes to filesystem/MM code, I'm
> > seeing a reliable BUG_ON() within minutes of booting on one of my
> > VMs. I don't have a complete oops dump, but this is what I do
> > have, cobbled together from what was logged by journald, and
> > what syslogd was able to splat on the terminals before the VM
> > died.
> > 
> > Sep 04 15:51:46 lists kernel: kernel BUG at fs/ext4/inode.c:1967!
> > 
> > [ 1346.494848] Call trace:
> > [ 1346.495409] [<c04b4f90>] (mpage_process_page_bufs) from [<c04b938c>] (mpage_prepare_extent_to_map+0x410/0x51c)
> > [ 1346.499202] [<c04b938c>] (mpage_prepare_extent_to_map) from [<c04bbc40>] (ext4_do_writepages+0x320/0xb94)
> > [ 1346.502113] [<c04bbc40>] (ext4_do_writepages) from [<c04bc5dc>] (ext4_writepages+0xc0/0x1b4)
> > [ 1346.504662] [<c04bc5dc>] (ext4_writepages) from [<c0361154>] (do_writepages+0x68/0x220)
> > [ 1346.506974] [<c0361154>] (do_writepages) from [<c0354868>] (filemap_fdatawrite_wbc+0x64/0x84)
> > [ 1346.509165] [<c0354868>] (filemap_fdatawrite_wbc) from [<c035706c>] (__filemap_fdatawrite_range+0x50/0x58)
> > [ 1346.511414] [<c035706c>] (__filemap_fdatawrite_range) from [<c035709c>] (filemap_flush+0x28/0x30)
> > [ 1346.513518] [<c035709c>] (filemap_flush) from [<c04a8834>] (ext4_release_file+0x70/0xac)
> > [ 1346.515312] [<c04a8834>] (ext4_release_file) from [<c03f8088>] (__fput+0xd4/0x2cc)
> > [ 1346.517219] [<c03f8088>] (__fput) from [<c03f3e64>] (sys_close+0x28/0x5c)
> > [ 1346.518720] [<c03f3e64>] (sys_close) from [<c0200060>] (ret_fast_syscall+0x0/0x5c)
> > 
> > From a quick look, I don't see any patches that touch fs/ext4/inode.c
> > that might address this.
> > 
> > I'm not able to do any debugging, and from Friday, I suspect I won't
> > even be able to use a computer (due to operations on my eyes.)
> 
> After rebooting the VM, the next oops was:
> 
> Sep 04 19:33:41 lists kernel: Unable to handle kernel paging request at virtual address 5ed304f3 when read
> Sep 04 19:33:42 lists kernel: [5ed304f3] *pgd=80000040005003, *pmd=00000000
> Sep 04 19:33:42 lists kernel: Internal error: Oops: 206 [#1] PREEMPT SMP ARM
> 
>  kernel:[  205.583038] Internal error: Oops: 206 [#1] PREEMPT SMP ARM
>  kernel:[  205.630530] Process kworker/u4:2 (pid: 33, stack limit = 0xc68f8000)
> ...
>  kernel:[  205.661017] Call trace:
>  kernel:[  205.661997] [<c04d9060>] (ext4_finish_bio) from [<c04d931c>] (ext4_release_io_end+0x48/0xfc)
>  kernel:[  205.664523] [<c04d931c>] (ext4_release_io_end) from [<c04d94d8>] (ext4_end_io_rsv_work+0x88/0x188)
>  kernel:[  205.666628] [<c04d94d8>] (ext4_end_io_rsv_work) from [<c023f310>] (process_one_work+0x178/0x30c)
>  kernel:[  205.669924] [<c023f310>] (process_one_work) from [<c023fe48>] (worker_thread+0x25c/0x438)
>  kernel:[  205.671679] [<c023fe48>] (worker_thread) from [<c02480b0>] (kthread+0xfc/0x12c)
>  kernel:[  205.673607] [<c02480b0>] (kthread) from [<c020015c>] (ret_from_fork+0x14/0x38)
>  kernel:[  205.682719] Code: e1540005 0a00000d e5941008 e594201c (e5913000)
> 
> This corresponds with:
> 
> c04d9050:       e1540005        cmp     r4, r5
> c04d9054:       0a00000d        beq     c04d9090 <ext4_finish_bio+0x208>
> c04d9058:       e5941008        ldr     r1, [r4, #8]
> c04d905c:       e594201c        ldr     r2, [r4, #28]
> c04d9060:       e5913000        ldr     r3, [r1] ;<<<==== faulting instruction
> 
> This code is:
> 
>                 /*
>                  * We check all buffers in the folio under b_uptodate_lock
>                  * to avoid races with other end io clearing async_write flags
>                  */
>                 spin_lock_irqsave(&head->b_uptodate_lock, flags);
>                 do {
>                         if (bh_offset(bh) < bio_start ||
>                             bh_offset(bh) + bh->b_size > bio_end) {
> 
> where r4 is "bh", r4+8 is the location of the bh->b_page pointer.
> 
> static inline unsigned long bh_offset(const struct buffer_head *bh)
> {
>         return (unsigned long)(bh)->b_data & (page_size(bh->b_page) - 1);
> }
> 
> static inline unsigned long compound_nr(struct page *page)
> {
>         struct folio *folio = (struct folio *)page;
> 
>         if (!test_bit(PG_head, &folio->flags))
>                 return 1;
> 
> where PG_head is bit 6. Thus, bh->b_page was corrupt.
> 
> I've booted back into 6.7 on the offending VM, and it's stable, so it
> appears to be a regression between 6.7 and 6.10.

So.. I decided to stop the VM, and run e2fsck, first not allowing it
to fix stuff. It identified only one error:

e2fsck 1.46.2 (28-Feb-2021)
Pass 1: Checking inodes, blocks, and sizes
Inode 657, i_blocks is 8, should be 0.  Fix<y>? no
Inode 657 passes checks, but checksum does not match inode.  Fix<y>? no
Pass 2: Checking directory structure
ext2fs_read_inode: Inode checksum does not match inode while reading
inode 657 in process_bad_inode

/dev/loop0p1: ********** WARNING: Filesystem still has errors **********

e2fsck: aborted

/dev/loop0p1: ********** WARNING: Filesystem still has errors **********

Booting the VM again and tracking that down (debugfs refused to do the
inode number->name translation for some absurd reason... surely the
inode checksum doesn't _need_ to be correct to find which directory
entry has the inode number?)

Interestingly, when I hit the bad inode:

vdir: cannot access '/var/lib/dpkg/info/libacl1:armhf.triggers': Bad message
vdir: '/var/lib/dpkg/info/libgdbm-compat4:armhf.list': Input/output error
vdir: '/var/lib/dpkg/info/libldap-2.5-0:armhf.shlibs': Input/output error

and any further accesses to the filesystem gave -EIO! As it's the root
filesystem, no further commands could be run, not even those which
I'd just run. The only solution was to destroy the VM via libvirt.
This, despite the filesystem mounted:

/dev/vda1 on / type ext4 (rw,relatime,errors=remount-ro)

So why it started spitting out -EIO errors I've no idea. That seems
utter madness.

Anyway, I re-ran e2fsck, fixing the errors:

e2fsck 1.46.2 (28-Feb-2021)
/dev/loop0p1: recovering journal
Pass 1: Checking inodes, blocks, and sizes
Inode 657, i_blocks is 8, should be 0.  Fix<y>? yes
Pass 2: Checking directory structure
Inode 657 (/var/lib/dpkg/info/libacl1:armhf.triggers) has invalid mode
(0171223).
Clear<y>? yes
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -88667
Fix<y>? yes
Free blocks count wrong for group #2 (21314, counted=21315).
Fix<y>? yes
Free blocks count wrong (862994, counted=862964).
Fix<y>? yes
Free inodes count wrong (292360, counted=292352).
Fix<y>? yes

/dev/loop0p1: ***** FILE SYSTEM WAS MODIFIED *****
/dev/loop0p1: 38272/330624 files (0.4% non-contiguous), 459276/1322240 blocks

before restarting the VM back into 6.7.0.

-- 
*** please note that I probably will only be occasionally responsive
*** for an unknown period of time due to recent eye surgery making
*** reading quite difficult.

RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers
  2024-09-04 19:50   ` Russell King (Oracle)
@ 2024-09-05 13:15     ` Theodore Ts'o
  2024-10-11 17:11       ` Russell King (Oracle)
  0 siblings, 1 reply; 6+ messages in thread
From: Theodore Ts'o @ 2024-09-05 13:15 UTC (permalink / raw)
  To: Russell King (Oracle); +Cc: Andreas Dilger, linux-ext4

Hmm, so you can reliably reproduce this when you boot the VM into
6.10, but not when using 6.7?  And this is on an arm64 system,
correct?  What is the VM doing?  Does this show up when you boot it,
or when you kick off some kind of appliance?  What sort of devices are
you using with the VM, and what is the VMM that you are using (e.g.,
qemu, Oracle's Cloud, etc.)?

Is the VM image something you can share?  Or can you share the
metadata-only image using the e2image -Q command?

						- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers
  2024-09-05 13:15     ` Theodore Ts'o
@ 2024-10-11 17:11       ` Russell King (Oracle)
  2024-10-18  0:44         ` Theodore Ts'o
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King (Oracle) @ 2024-10-11 17:11 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Andreas Dilger, linux-ext4

On Thu, Sep 05, 2024 at 09:15:42AM -0400, Theodore Ts'o wrote:
> Hmm, so you can reliably reproduce this when you boot the VM into
> 6.10, but not when using 6.7?  And this is on an arm64 system,
> correct?  What is the VM doing?  Does this show up when you boot it,
> or when you kick off some kind of appliance?  What sort of devices are
> you using with the VM, and what is the VMM that you are using (e.g.,
> qemu, Oracle's Cloud, etc.)?
> 
> Is the VM image something you can share?  Or can you share the
> metadata-only image using the e2image -Q command?

Hi Ted,

I've had another instance of it, discovered today. A VM that was
running fine yesterday was dead this morning - ssh closed any
attempt to connect within a few seconds. Trying to login through
the hypervisor console resulted in a response to typing and then
nothing further. Even sysrq was dead because the console had been
closed.

I dumped the VM memory to disk - its only 128MB and set about
extracting the kernel messages, and found these as the last
messages:

EXT4-fs error (device vda1): ext4_lookup:1854: inode #32801: comm mandb: iget: checksum invalid
Aborting journal on device vda1-8.8
EXT4-fs (vda1): Remounting filesystem read-only

So, we now have the cause and inode number, and the reason (invalid
checksum). I discovered today that ext4 records details of the error
on the filesystem, accessible using tune2fs:

First error time:         Fri Oct 11 07:18:38 2024
First error function:     ext4_lookup
First error line #:       1854
First error inode #:      32801
First error err:          EFSBADCRC
Last error time:          Fri Oct 11 07:18:38 2024
Last error function:      ext4_lookup
Last error line #:        1854
Last error inode #:       32801
Last error err:           EFSBADCRC
Checksum type:            crc32c
Checksum:                 0xb71f356b

Running e2fsck -n on the filesystem reveals no checksum errors, only
two block/inode counts being wrong:

e2fsck 1.46.2 (28-Feb-2021)
Warning: skipping journal recovery because doing a read-only filesystem check.
/dev/loop0p1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (106574, counted=106578).
Fix? no

Free inodes count wrong (82348, counted=82353).
Fix? no

/dev/loop0p1: 48724/131072 files (1.2% non-contiguous), 417202/523776 blocks

So, I think we can conclude that the on-disk checksums are correct, and
apart from the filesystem still being mounted in the guest, albiet now
in read-only errored state with an unrecovered journal, I think e2fsck
result is healthy.

This reminded me of our previous thread where checksums were apparently
invalid, which came down to a compiler bug miscompiling the ext4
checksum code.

https://lore.kernel.org/all/20210105195004.GF1551@shell.armlinux.org.uk/

However, after having looked at the generated code, it isn't a repeat
of that (thankfully!) However, it does have the feel of something very
very subtle - a specific set of swiss cheese slices that come together
such that all the holes line up...

Let's have a look at the data for the inode on disk:

debugfs:  ncheck 32801
Inode   Pathname
32801   /usr/share/man/man3/tmpnam.3.gz
debugfs:  stat <32801>
Inode: 32801   Type: regular    Mode:  0644   Flags: 0x80000
Generation: 1819410294    Version: 0x00000000:00000004
User:     0   Group:     0   Project:     0   Size: 1623
File ACL: 0
Links: 1   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x649205ad:2957c9bc -- Tue Jun 20 21:01:49 2023
 atime: 0x64920601:4d5bb184 -- Tue Jun 20 21:03:13 2023
 mtime: 0x644e45fe:00000000 -- Sun Apr 30 11:42:06 2023
crtime: 0x649205ac:7432ceb4 -- Tue Jun 20 21:01:48 2023
Size of extra inode fields: 32
Inode checksum: 0x569f27e5
EXTENTS:
(0):59627
debugfs:  imap <32801>
Inode 32801 is part of block group 4
        located at block 2339, offset 0x0000
debugfs:  id <32801>
0000  a481 0000 5706 0000 0106 9264 ad05 9264  ....W......d...d
0020  fe45 4e64 0000 0000 0000 0100 0800 0000  .ENd............
0040  0000 0800 0400 0000 0af3 0100 0400 0000  ................
0060  0000 0000 0000 0000 0100 0000 ebe8 0000  ................
0100  0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0140  0000 0000 76ff 716c 0000 0000 0000 0000  ....v.ql........
0160  0000 0000 0000 0000 0000 0000 e527 0000  .............'..
0200  2000 9f56 bcc9 5729 0000 0000 84b1 5b4d   ..V..W)......[M
0220  ac05 9264 b4ce 3274 0000 0000 0000 0000  ...d..2t........
0240  0000 0000 0000 0000 0000 0000 0000 0000  ................
*

We can see the checksum in there, so let's go searching the
memory dump for it. This is the hexdump -C offset from the start
of memory. PAGE_OFFSET=0xc0000000. The only place the bytes
corresponding to the checksum (e5, 27, ... 57, 29) appear are:

07973000  a4 c7 04 12 9b 86 ca 80  e4 21 cd 00 4f 80 3d 88  |.........!..O.=.|
07973010  fe 45 4e 64 00 00 00 00  00 00 01 00 08 00 00 00  |.ENd............|
07973020  00 00 08 00 04 00 00 00  0a f3 01 00 04 00 00 00  |................|
07973030  00 00 00 00 00 00 00 00  01 00 00 00 eb e8 00 00  |................|
07973040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
07973060  00 00 00 00 76 ff 71 6c  00 00 00 00 00 00 00 00  |....v.ql........|
07973070  00 00 00 00 00 00 00 00  00 00 00 00 e5 27 00 00  |.............'..|
07973080  20 00 9f 56 bc c9 57 29  00 00 00 00 84 b1 5b 4d  | ..V..W)......[M|
07973090  ac 05 92 64 b4 ce 32 74  00 00 00 00 00 00 00 00  |...d..2t........|
079730a0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

All other bytes correspond except for the first 16 bytes (well, the
first byte is correct but the others in that row are all different.
Note the block number for the data - 0xe8eb (59627).

To make sure that there isn't a journal entry pending on disk that
might update the inode with what I found in memory, I naievely
searched the journal blocks for the block number in the embedded
ext4 extent in the inode and found nothing. Either that isn't
sufficient, or there isn't anything outstanding. In any case, if
the timestamps had changed (bytes 8-15) then the nanosecond/epoch
parts that appear later in the ext4_inode structure should also be
different. It would be very lucky to only change the seconds part
of the timestamp!

We know that ext4_inode is looked up through a buffer head, and from
the imap command in debugfs, this inode is at block 2339 (0x923)
offset 0.

We find a pointer to 0xc7973000 only once in the entire memory dump
at offset 185af9c:

0185af80  19 00 00 00 80 af 85 c1  2c a4 fe c7 00 00 00 00  |........,.......|
0185af90  23 09 00 00 00 00 00 00  00 10 00 00 00 30 97 c7  |#............0..|
0185afa0  80 a0 80 c1 50 84 43 c0  00 00 00 00 ac af 85 c1  |....P.C.........|
0185afb0  ac af 85 c1 00 00 00 00  00 00 00 00 00 00 00 00  |................|

which appears to be the struct buffer_head.
b_state		= 0x00000019	BH_Mapped | BH_Req | BH_Uptodate
b_this_page	= 0xc185af80
b_page/b_folio	= 0xc7fea42c
b_blocknr	= 0x923 (2339) - as expected
b_size		= 0x1000 (4096) - page size
b_data		= 0xc7973000
b_bdev		= 0xc180a080
b_end_io	= 0xc0438450
b_private	= NULL
b_assoc_buffers.next = 0xc185afac \ empty
b_assoc_buffers.prev = 0xc185afac / list
b_count		= 0

Chasing this to the struct page (I don't think it's a folio because
it isn't large enough) at offset 7fea42c:

07fea420  ff ff ff ff 02 00 00 00  00 70 6f c2 24 42 00 14  |.........po.$B..|
07fea430  a0 a3 fe c7 74 a5 fe c7  20 a3 80 c1 23 09 00 00  |....t... ...#...|
07fea440  80 af 85 c1 ff ff ff ff  02 00 00 00 00 d0 d6 c2  |................|

flags		= 0x14004224, which I believe mean
		  PG_referenced | PG_lru | PG_workingset |
		  PG_private | LRU_REFS(2) | LRU_GEN(2)
lru.next	= 0xc7fea3a0
lru.prev	= 0xc7fea474
mapping		= 0xc180a320
index		= 0x923 (2339) - as expected
private		= 0xc185af80 - pointer to struct buffer_head above
_map_count	= 0xffffffff
_ref_count	= 2
memcg_data	= 0xc2d6d000

So, it looks like the page is still in use and not free. This is the
only place I can find a pointer to the struct buffer_head in the
memory dump, apart from buffer_head.b_this_page.

Now, given that the memory dump of the VM has been created hours after
it's crashed, I can't see CPU cache being an issue, because if the
data was corrupted at the time ext4 came across the error, that data
would've been evicted a hours ago.

I thought maybe something could be gleaned from the values in the
corruption, but it makes no sense to me - they don't seem to be valid
pointers:

corrupted	on-disk
0x1204c7a4	0x000081a4
0x80ca869b	0x00000657
0x00cd21e4	0x64920601
0x883d804f	0x649208ad

So I'm rather stumped at the moment. I'm now thinking that this isn't an
ext4 issue, but instead could be something really very subtle in the MM,
virtio, or even qemu. I don't see it being qemu, because 6.7 was fine.

This is the second 6.10 VM that has died on the machine - it's running
exactly the same kernel binary as the other VM that previously died.

I've saved the qcow of the filesystem, and also have the coredump of
the VM's memory in case it needs further investigation.

I'm about to throw 6.11 on at least some of these VMs (including the
two that failed) so the on-disk filesystem is going to be e2fscked
shortly. As I said above, I don't think this is an ext4 issue, but
something corrupting ext4 in-memory data structures. It could be
related to the VM having a relatively small amount of memory compared
to modern standards (maybe adding MM pressure to tickle a bug in
there). Or maybe we have another case of a tail-call optimisation
gone wrong that corrupts a pointer causing ext4 in-memory data to
be scribbled over. I'm grasping at straws at the moment though...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers
  2024-10-11 17:11       ` Russell King (Oracle)
@ 2024-10-18  0:44         ` Theodore Ts'o
  0 siblings, 0 replies; 6+ messages in thread
From: Theodore Ts'o @ 2024-10-18  0:44 UTC (permalink / raw)
  To: Russell King (Oracle); +Cc: Andreas Dilger, linux-ext4

On Fri, Oct 11, 2024 at 06:11:17PM +0100, Russell King (Oracle) wrote:
> I'm about to throw 6.11 on at least some of these VMs (including the
> two that failed) so the on-disk filesystem is going to be e2fscked
> shortly. As I said above, I don't think this is an ext4 issue, but
> something corrupting ext4 in-memory data structures.

I agree; I wonder if something would show up if KASAN were enabled.
If it is some array overrun or some other wild pointer problem in some
random subsystem that is enabled on your system (but not mine, since I
use a very restricted kernel config to speed up development builds),
maybe it will turn up something suspicious?

> It could be
> related to the VM having a relatively small amount of memory compared
> to modern standards (maybe adding MM pressure to tickle a bug in
> there).

Yeah, it would be interesting to add a test in xfstest which runs
fsstress and fio with memory pressure (where the memory shortage is
either memcg constraints or global memory availability).

> Or maybe we have another case of a tail-call optimisation
> gone wrong that corrupts a pointer causing ext4 in-memory data to
> be scribbled over. 

Maybe; I assume you've tried to see if it rerouces on different
compiler / compiler versions?

> I'm grasping at straws at the moment though...

Ditto.   :-)

					- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-10-18  0:44 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-04 18:47 BUG: 6.10: ext4 mpage_process_page_bufs() BUG_ON triggers Russell King (Oracle)
2024-09-04 19:30 ` Russell King (Oracle)
2024-09-04 19:50   ` Russell King (Oracle)
2024-09-05 13:15     ` Theodore Ts'o
2024-10-11 17:11       ` Russell King (Oracle)
2024-10-18  0:44         ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).