[BUG] regression from 974c5e6139db "xfs: flag as supporting FOP

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
@ 2025-05-25  8:32 Al Viro
  2025-05-25 18:02 ` Al Viro
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Al Viro @ 2025-05-25  8:32 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

generic/127 with xfstests built on debian-testing (trixie) ends up with
assorted memory corruption; trace below is with CONFIG_DEBUG_PAGEALLOC and
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT and it looks like a double free
somewhere in iomap.  Unfortunately, commit in question is just making
xfs use the infrastructure built in earlier series - not that useful
for isolating the breakage.

[   22.001529] run fstests generic/127 at 2025-05-25 04:13:23
[   35.498573] BUG: Bad page state in process kworker/2:1  pfn:112ce9
[   35.499260] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e 9
[   35.499764] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)
[   35.500302] raw: 800000000000000e dead000000000100 dead000000000122 000000000
[   35.500786] raw: 000000000000003e 0000000000000000 00000000ffffffff 000000000
[   35.501248] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[   35.501624] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs0
[   35.503209] CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ 7
[   35.503211] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.164
[   35.503212] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]
[   35.503279] Call Trace:
[   35.503281]  <TASK>
[   35.503282]  dump_stack_lvl+0x4f/0x60
[   35.503296]  bad_page+0x6f/0x100
[   35.503300]  free_frozen_pages+0x303/0x550
[   35.503301]  iomap_finish_ioend+0xf6/0x380
[   35.503304]  iomap_finish_ioends+0x83/0xc0
[   35.503305]  xfs_end_ioend+0x64/0x140 [xfs]
[   35.503342]  xfs_end_io+0x93/0xc0 [xfs]
[   35.503378]  process_one_work+0x153/0x390
[   35.503382]  worker_thread+0x2ab/0x3b0

It's 4:30am here, so I'm going to leave attempts to actually debug that
thing until tomorrow; I do have a kvm where it's reliably reproduced
within a few minutes, so if anyone comes up with patches, I'll be able
to test them.

Breakage is still present in the current mainline ;-/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25  8:32 [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?) Al Viro
@ 2025-05-25 18:02 ` Al Viro
  2025-05-25 18:06 ` Al Viro
  2025-05-29  1:56 ` Darrick J. Wong
  2 siblings, 0 replies; 24+ messages in thread
From: Al Viro @ 2025-05-25 18:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
> generic/127 with xfstests built on debian-testing (trixie) ends up with
> assorted memory corruption; trace below is with CONFIG_DEBUG_PAGEALLOC and
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT and it looks like a double free
> somewhere in iomap.  Unfortunately, commit in question is just making
> xfs use the infrastructure built in earlier series - not that useful
> for isolating the breakage.

FWIW, the same breakage is reproduced within a couple of iterations of
./check generic/127 on debian-testing image with xfstests built fresh from
git and debian linux-image-6.15-rc7-amd64-unsigned_6.15~rc7-1~exp1_amd64.deb

IOW, it's not something exotic in .config here.  KVM setup is also not
unusual -

kvm \
	-boot order=c \
	-m 16384 \
	-netdev "tap,id=nic0,ifname=tap4,script=no,downscript=no" \
	-device "e1000,netdev=nic0" \
	-nographic \
	-smp 4 \
	-hdb /home/al/emu/ssd/image \
	trixie.img

with image partitioned into two 6G xfs filesystems, with

export TEST_DEV=/dev/sdb1
export TEST_DIR=/home/test
export SCRATCH_DEV=/dev/sdb2
export SCRATCH_MNT=/home/scratch

for local.config.  Bog-standard install, ext4 for everything on sda,
nothing fancy for storage setup - qemu defaults all way through.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25  8:32 [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?) Al Viro
  2025-05-25 18:02 ` Al Viro
@ 2025-05-25 18:06 ` Al Viro
  2025-05-25 19:12   ` Vlastimil Babka
  2025-05-29  1:56 ` Darrick J. Wong
  2 siblings, 1 reply; 24+ messages in thread
From: Al Viro @ 2025-05-25 18:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:

> Breakage is still present in the current mainline ;-/

With CONFIG_DEBUG_VM on top of pagealloc debugging:

[ 1434.992817] run fstests generic/127 at 2025-05-25 11:46:11g
[ 1448.956242] BUG: Bad page state in process kworker/2:1  pfn:112cb0g
[ 1448.956846] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
[ 1448.957453] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
[ 1448.957863] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
[ 1448.958303] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
[ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) setg
[ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2 btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4 i2c_smbus libata e1000g
[ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ #78g
[ 1448.960878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
[ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
[ 1448.960938] Call Trace:g
[ 1448.960939]  <TASK>g
[ 1448.960940]  dump_stack_lvl+0x4f/0x60g
[ 1448.960953]  bad_page+0x6f/0x100g
[ 1448.960957]  free_frozen_pages+0x471/0x640g
[ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
[ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
[ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
[ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
[ 1448.961036]  process_one_work+0x153/0x390g
[ 1448.961044]  worker_thread+0x2ab/0x3b0g
[ 1448.961045]  ? rescuer_thread+0x470/0x470g
[ 1448.961047]  kthread+0xf7/0x200g
[ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
[ 1448.961049]  ret_from_fork+0x2d/0x50g
[ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
[ 1448.961054]  ret_from_fork_asm+0x11/0x20g
[ 1448.961058]  </TASK>g
[ 1448.961155] Disabling lock debugging due to kernel taintg
[ 1448.969569] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
[ 1448.970023] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
[ 1448.970651] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
[ 1448.971222] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
[ 1448.971812] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))g
[ 1448.972490] ------------[ cut here ]------------g
[ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g
[ 1448.973421] Oops: invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOCg
[ 1448.973853] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Tainted: G    B              6.14.0-rc1+ #78g
[ 1448.974345] Tainted: [B]=BAD_PAGEg
[ 1448.974565] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
[ 1448.975074] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
[ 1448.975428] RIP: 0010:folio_end_writeback+0x155/0x180g
[ 1448.975731] Code: 13 40 0f 92 c5 e9 23 ff ff ff 48 c7 c6 00 d5 e7 81 48 89 df e8 0c 8a 03 00 0f 0b 48 c7 c6 d0 38 e5 81 48 89 df e8 fb 89 03 00 <0f> 0b 48 c7 c6 40 5b e5 81 48 89 df e8 ea 89 03 00 0f 0b 48 c7 c6g
[ 1448.976655] RSP: 0018:ffffc90001a53d68 EFLAGS: 00010286g
[ 1448.976953] RAX: 000000000000005c RBX: ffffea00044b2c00 RCX: 0000000000000000g
[ 1448.977331] RDX: 0000000000000001 RSI: ffffffff81e74e9e RDI: 00000000ffffffffg
[ 1448.977711] RBP: ffffea00044b2c40 R08: 0000000000004ffb R09: 00000000ffffefffg
[ 1448.978089] R10: 00000000ffffefff R11: ffffffff82043bc0 R12: 0000000000001000g
[ 1448.978464] R13: ffff888101ecb840 R14: 0000000000000000 R15: ffffea00044b2c00g
[ 1448.978844] FS:  0000000000000000(0000) GS:ffff88842dd00000(0000) knlGS:0000000000000000g
[ 1448.979289] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033g
[ 1448.979609] CR2: 00007fd3d42a2000 CR3: 0000000111543000 CR4: 00000000000006f0g
[ 1448.979989] Call Trace:g
[ 1448.980170]  <TASK>g
[ 1448.980336]  ? die+0x32/0x80g
[ 1448.980543]  ? do_trap+0xd5/0x100g
[ 1448.980767]  ? folio_end_writeback+0x155/0x180g
[ 1448.981033]  ? do_error_trap+0x65/0x80g
[ 1448.981270]  ? folio_end_writeback+0x155/0x180g
[ 1448.981536]  ? exc_invalid_op+0x4c/0x60g
[ 1448.981790]  ? folio_end_writeback+0x155/0x180g
[ 1448.982056]  ? asm_exc_invalid_op+0x16/0x20g
[ 1448.982315]  ? folio_end_writeback+0x155/0x180g
[ 1448.982580]  ? folio_end_writeback+0x155/0x180g
[ 1448.982846]  iomap_finish_ioend+0x196/0x3c0g
[ 1448.983108]  iomap_finish_ioends+0x55/0xc0g
[ 1448.983363]  xfs_end_ioend+0x64/0x140 [xfs]g
[ 1448.983663]  xfs_end_io+0x93/0xc0 [xfs]g
[ 1448.983937]  process_one_work+0x153/0x390g
[ 1448.984189]  worker_thread+0x2ab/0x3b0g
[ 1448.984427]  ? rescuer_thread+0x470/0x470g
[ 1448.984674]  kthread+0xf7/0x200g
[ 1448.984887]  ? kthread_use_mm+0xa0/0xa0g
[ 1448.985128]  ret_from_fork+0x2d/0x50g
[ 1448.985362]  ? kthread_use_mm+0xa0/0xa0g
[ 1448.985601]  ret_from_fork_asm+0x11/0x20g
[ 1448.985846]  </TASK>g
[ 1448.986017] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2 btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4 i2c_smbus libata e1000g
[ 1448.987399] ---[ end trace 0000000000000000 ]---g
[ 1448.987896] RIP: 0010:folio_end_writeback+0x155/0x180g
[ 1448.988220] Code: 13 40 0f 92 c5 e9 23 ff ff ff 48 c7 c6 00 d5 e7 81 48 89 df e8 0c 8a 03 00 0f 0b 48 c7 c6 d0 38 e5 81 48 89 df e8 fb 89 03 00 <0f> 0b 48 c7 c6 40 5b e5 81 48 89 df e8 ea 89 03 00 0f 0b 48 c7 c6g
[ 1448.989246] RSP: 0018:ffffc90001a53d68 EFLAGS: 00010286g
[ 1448.992210] RAX: 000000000000005c RBX: ffffea00044b2c00 RCX: 0000000000000000g
[ 1448.992619] RDX: 0000000000000001 RSI: ffffffff81e74e9e RDI: 00000000ffffffffg
[ 1448.993010] RBP: ffffea00044b2c40 R08: 0000000000004ffb R09: 00000000ffffefffg
[ 1448.993577] R10: 00000000ffffefff R11: ffffffff82043bc0 R12: 0000000000001000g
[ 1448.994411] R13: ffff888101ecb840 R14: 0000000000000000 R15: ffffea00044b2c00g
[ 1448.994823] FS:  0000000000000000(0000) GS:ffff88842dd00000(0000) knlGS:0000000000000000g
[ 1448.995390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033g
[ 1448.995916] CR2: 00007fd3d42a2000 CR3: 0000000111543000 CR4: 00000000000006f0g
kvm: terminating on signal 15 from pid 32057 (killall)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25 18:06 ` Al Viro
@ 2025-05-25 19:12   ` Vlastimil Babka
  2025-05-25 20:32     ` Linus Torvalds
  2025-05-26 13:05     ` Jens Axboe
  0 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2025-05-25 19:12 UTC (permalink / raw)
  To: Al Viro, Jens Axboe, Matthew Wilcox, Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On 5/25/25 8:06 PM, Al Viro wrote:
> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
> 
>> Breakage is still present in the current mainline ;-/
> 
> With CONFIG_DEBUG_VM on top of pagealloc debugging:
> 
> [ 1434.992817] run fstests generic/127 at 2025-05-25 11:46:11g
> [ 1448.956242] BUG: Bad page state in process kworker/2:1  pfn:112cb0g
> [ 1448.956846] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
> [ 1448.957453] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g

It doesn't like the writeback flag.

> [ 1448.957863] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
> [ 1448.958303] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
> [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) setg
> [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2 btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4 i2c_smbus libata e1000g
> [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ #78g
> [ 1448.960878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
> [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
> [ 1448.960938] Call Trace:g
> [ 1448.960939]  <TASK>g
> [ 1448.960940]  dump_stack_lvl+0x4f/0x60g
> [ 1448.960953]  bad_page+0x6f/0x100g
> [ 1448.960957]  free_frozen_pages+0x471/0x640g
> [ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
> [ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
> [ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
> [ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
> [ 1448.961036]  process_one_work+0x153/0x390g
> [ 1448.961044]  worker_thread+0x2ab/0x3b0g
> [ 1448.961045]  ? rescuer_thread+0x470/0x470g
> [ 1448.961047]  kthread+0xf7/0x200g
> [ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
> [ 1448.961049]  ret_from_fork+0x2d/0x50g
> [ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
> [ 1448.961054]  ret_from_fork_asm+0x11/0x20g
> [ 1448.961058]  </TASK>g
> [ 1448.961155] Disabling lock debugging due to kernel taintg
> [ 1448.969569] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g

same pfn, same struct page

> [ 1448.970023] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
> [ 1448.970651] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
> [ 1448.971222] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
> [ 1448.971812] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))g
> [ 1448.972490] ------------[ cut here ]------------g
> [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g

this is folio_get() noticing refcount is 0, so a use-after free, because
we already tried to free the page above.

I'm not familiar with this code too much, but I suspect problem was
introduced by commit fb7d3bc414939 ("mm/filemap: drop streaming/uncached
pages when writeback completes") and only (more) exposed here.

so in folio_end_writeback() we have
        if (__folio_end_writeback(folio))
                folio_wake_bit(folio, PG_writeback);

but calling the folio_end_dropbehind_write() doesn't depend on the
result of __folio_end_writeback()
this seems rather suspicious

I think if __folio_end_writeback() was true then PG_writeback would be
cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE failure.
Instead we do a premature folio_end_dropbehind_write() dropping a page
ref and then the final folio_put() in folio_end_writeback() frees the
page and splats on the PG_writeback. Then the folio is processed again
in the following iteration of iomap_finish_ioend() and splats on the
refcount-already-zero.

So I think folio_end_dropbehind_write() should only be done when
__folio_end_writeback() was true. Most likely even the
folio_test_clear_dropbehind() should be tied to that, or we clear it too
early and then never act upon it later?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25 19:12   ` Vlastimil Babka
@ 2025-05-25 20:32     ` Linus Torvalds
  2025-05-25 20:48       ` Matthew Wilcox
  2025-05-26 13:05     ` Jens Axboe
  1 sibling, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2025-05-25 20:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Al Viro, Jens Axboe, Matthew Wilcox, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Christian Brauner, linux-fsdevel

Well, this isn't great timing, since I was going to do 6.15 within the hour.

On Sun, 25 May 2025 at 12:11, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> I'm not familiar with this code too much, but I suspect problem was
> introduced by commit fb7d3bc414939 ("mm/filemap: drop streaming/uncached
> pages when writeback completes") and only (more) exposed here.

That bug goes back to 6.13 if so.

But yeah, maybe the drop-behind case never triggers in practice, and I
should just revert commit 974c5e6139db ("xfs: flag as supporting
FOP_DONTCACHE") for now.

That's kind of sad too, but at least that's new to 6.15 and we
wouldn't have a kernel release that triggers this issue.

I realize that Vlastimil had a suggested possible fix, but doing
_that_ kind of surgery at this point in the release isn't an option,
I'm afraid. And delaying 6.15 for this also seems a bit excessive - if
it turns out to be easy to fix, we can always just backport the fix
and undo the revert.

Sounds like a plan?

I'm somewhat surprised that this was only noticed now if it triggers
so easily for Al with xfstests on xfs. But better late than never, I
guess..

             Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25 20:32     ` Linus Torvalds
@ 2025-05-25 20:48       ` Matthew Wilcox
  2025-05-25 20:54         ` Linus Torvalds
  2025-05-25 21:49         ` Al Viro
  0 siblings, 2 replies; 24+ messages in thread
From: Matthew Wilcox @ 2025-05-25 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Vlastimil Babka, Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Christian Brauner, linux-fsdevel

On Sun, May 25, 2025 at 01:32:33PM -0700, Linus Torvalds wrote:
> But yeah, maybe the drop-behind case never triggers in practice, and I
> should just revert commit 974c5e6139db ("xfs: flag as supporting
> FOP_DONTCACHE") for now.
> 
> That's kind of sad too, but at least that's new to 6.15 and we
> wouldn't have a kernel release that triggers this issue.
> 
> I realize that Vlastimil had a suggested possible fix, but doing
> _that_ kind of surgery at this point in the release isn't an option,
> I'm afraid. And delaying 6.15 for this also seems a bit excessive - if
> it turns out to be easy to fix, we can always just backport the fix
> and undo the revert.
> 
> Sounds like a plan?
> 
> I'm somewhat surprised that this was only noticed now if it triggers
> so easily for Al with xfstests on xfs. But better late than never, I
> guess..

I wonder if we shouldn't do ...

+++ b/include/linux/fs.h
@@ -3725,6 +3725,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
                        return -EOPNOTSUPP;
        }
        if (flags & RWF_DONTCACHE) {
+               /* Houston, we have a problem */
+               return -EOPNOTSUPP;
                /* file system must support it */
                if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE))
                        return -EOPNOTSUPP;

in case some other filesystem adds support for it?  I don't see anything
in -next right now, but I see Darrick playing with it here for FUSE:
https://lore.kernel.org/all/174787195629.1483178.7917092102987513364.stgit@frogsfrogsfrogs/
Jeff playing with it for nfsd here:
https://lore.kernel.org/all/370dd4ae06d44f852342b7ee2b969fc544bd1213.camel@kernel.org/
Trond implementing it for NFS client here:
https://lore.kernel.org/all/cover.1745381692.git.trond.myklebust@hammerspace.com/

I thought I saw someone implement it for ext4, but perhaps I'm confused
with something else.  Anyway, some kind of not-xfs-specific patch is
appropriate here, I think?

Oh, and we're only just seeing it, I think, because you need to recompile
xfstests to test this functionality ... and I certainly don't re-pull
and re-compile xfstests on a regular basis; I just use the one I pulled
and compiled, um, months ago.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25 20:48       ` Matthew Wilcox
@ 2025-05-25 20:54         ` Linus Torvalds
  2025-05-25 21:49         ` Al Viro
  1 sibling, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2025-05-25 20:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Vlastimil Babka, Al Viro, Jens Axboe, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Christian Brauner, linux-fsdevel

On Sun, 25 May 2025 at 13:48, Matthew Wilcox <willy@infradead.org> wrote:
>
> I wonder if we shouldn't do ...
>
> +++ b/include/linux/fs.h
> @@ -3725,6 +3725,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
>                         return -EOPNOTSUPP;
>         }
>         if (flags & RWF_DONTCACHE) {
> +               /* Houston, we have a problem */
> +               return -EOPNOTSUPP;

Hmm. Your point about other filesystems is well taken.

I'd have preferred a revert as a "don't do anything new at this
point", but I guess disabling it at this point is probably the safer
option considering that this isn't a xfs issue.

> Oh, and we're only just seeing it, I think, because you need to recompile
> xfstests to test this functionality ...

Ahh, good. Well, not "good" exactly, but it certainly at least
explains the unlucky timing.

Thanks,

             Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25 20:48       ` Matthew Wilcox
  2025-05-25 20:54         ` Linus Torvalds
@ 2025-05-25 21:49         ` Al Viro
  2025-05-25 22:05           ` Linus Torvalds
  1 sibling, 1 reply; 24+ messages in thread
From: Al Viro @ 2025-05-25 21:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Vlastimil Babka, Jens Axboe, Jan Kara,
	Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel

On Sun, May 25, 2025 at 09:48:45PM +0100, Matthew Wilcox wrote:
> On Sun, May 25, 2025 at 01:32:33PM -0700, Linus Torvalds wrote:
> > But yeah, maybe the drop-behind case never triggers in practice, and I
> > should just revert commit 974c5e6139db ("xfs: flag as supporting
> > FOP_DONTCACHE") for now.
> > 
> > That's kind of sad too, but at least that's new to 6.15 and we
> > wouldn't have a kernel release that triggers this issue.
> > 
> > I realize that Vlastimil had a suggested possible fix, but doing
> > _that_ kind of surgery at this point in the release isn't an option,
> > I'm afraid. And delaying 6.15 for this also seems a bit excessive - if
> > it turns out to be easy to fix, we can always just backport the fix
> > and undo the revert.
> > 
> > Sounds like a plan?
> > 
> > I'm somewhat surprised that this was only noticed now if it triggers
> > so easily for Al with xfstests on xfs. But better late than never, I
> > guess..
> 
> I wonder if we shouldn't do ...
> 
> +++ b/include/linux/fs.h
> @@ -3725,6 +3725,8 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags,
>                         return -EOPNOTSUPP;
>         }
>         if (flags & RWF_DONTCACHE) {
> +               /* Houston, we have a problem */
> +               return -EOPNOTSUPP;
>                 /* file system must support it */
>                 if (!(ki->ki_filp->f_op->fop_flags & FOP_DONTCACHE))
>                         return -EOPNOTSUPP;
> 

Perhaps

-#define FOP_DONTCACHE           ((__force fop_flags_t)(1 << 7)) when shit gets fixed
+#define FOP_DONTCACHE           0 // ((__force fop_flags_t)(1 << 7)) when shit gets fixed

instead?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25 21:49         ` Al Viro
@ 2025-05-25 22:05           ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2025-05-25 22:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Matthew Wilcox, Vlastimil Babka, Jens Axboe, Jan Kara,
	Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel

On Sun, 25 May 2025 at 14:49, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Perhaps
>
> -#define FOP_DONTCACHE           ((__force fop_flags_t)(1 << 7)) when shit gets fixed
> +#define FOP_DONTCACHE           0 // ((__force fop_flags_t)(1 << 7)) when shit gets fixed
>
> instead?

Yeah, I think that ends up being prettier than an extra error return
in the middle of code.

Will do. Thanks for noticing this, even if the timing is awkward.

              Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25 19:12   ` Vlastimil Babka
  2025-05-25 20:32     ` Linus Torvalds
@ 2025-05-26 13:05     ` Jens Axboe
  2025-05-26 15:06       ` Jens Axboe
  1 sibling, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2025-05-26 13:05 UTC (permalink / raw)
  To: Vlastimil Babka, Al Viro, Matthew Wilcox, Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On 5/25/25 1:12 PM, Vlastimil Babka wrote:
> On 5/25/25 8:06 PM, Al Viro wrote:
>> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
>>
>>> Breakage is still present in the current mainline ;-/
>>
>> With CONFIG_DEBUG_VM on top of pagealloc debugging:
>>
>> [ 1434.992817] run fstests generic/127 at 2025-05-25 11:46:11g
>> [ 1448.956242] BUG: Bad page state in process kworker/2:1  pfn:112cb0g
>> [ 1448.956846] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>> [ 1448.957453] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
> 
> It doesn't like the writeback flag.
> 
>> [ 1448.957863] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
>> [ 1448.958303] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
>> [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) setg
>> [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2 btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4 i2c_smbus libata e1000g
>> [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ #78g
>> [ 1448.960878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
>> [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
>> [ 1448.960938] Call Trace:g
>> [ 1448.960939]  <TASK>g
>> [ 1448.960940]  dump_stack_lvl+0x4f/0x60g
>> [ 1448.960953]  bad_page+0x6f/0x100g
>> [ 1448.960957]  free_frozen_pages+0x471/0x640g
>> [ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
>> [ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
>> [ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
>> [ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
>> [ 1448.961036]  process_one_work+0x153/0x390g
>> [ 1448.961044]  worker_thread+0x2ab/0x3b0g
>> [ 1448.961045]  ? rescuer_thread+0x470/0x470g
>> [ 1448.961047]  kthread+0xf7/0x200g
>> [ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
>> [ 1448.961049]  ret_from_fork+0x2d/0x50g
>> [ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
>> [ 1448.961054]  ret_from_fork_asm+0x11/0x20g
>> [ 1448.961058]  </TASK>g
>> [ 1448.961155] Disabling lock debugging due to kernel taintg
>> [ 1448.969569] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
> 
> same pfn, same struct page
> 
>> [ 1448.970023] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>> [ 1448.970651] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
>> [ 1448.971222] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
>> [ 1448.971812] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))g
>> [ 1448.972490] ------------[ cut here ]------------g
>> [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g
> 
> this is folio_get() noticing refcount is 0, so a use-after free, because
> we already tried to free the page above.
> 
> I'm not familiar with this code too much, but I suspect problem was
> introduced by commit fb7d3bc414939 ("mm/filemap: drop streaming/uncached
> pages when writeback completes") and only (more) exposed here.
> 
> so in folio_end_writeback() we have
>         if (__folio_end_writeback(folio))
>                 folio_wake_bit(folio, PG_writeback);
> 
> but calling the folio_end_dropbehind_write() doesn't depend on the
> result of __folio_end_writeback()
> this seems rather suspicious
> 
> I think if __folio_end_writeback() was true then PG_writeback would be
> cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE failure.
> Instead we do a premature folio_end_dropbehind_write() dropping a page
> ref and then the final folio_put() in folio_end_writeback() frees the
> page and splats on the PG_writeback. Then the folio is processed again
> in the following iteration of iomap_finish_ioend() and splats on the
> refcount-already-zero.
> 
> So I think folio_end_dropbehind_write() should only be done when
> __folio_end_writeback() was true. Most likely even the
> folio_test_clear_dropbehind() should be tied to that, or we clear it too
> early and then never act upon it later?

Thanks for taking a look at this! I tried to reproduce this this morning
and failed miserably. I then injected a delay for the above case, and it
does indeed then trigger for me. So far, so good.

I agree with your analysis, we should only be doing the dropbehind for a
non-zero return from __folio_end_writeback(), and that includes the
test_and_clear to avoid dropping the drop-behind state. But we also need
to check/clear this state pre __folio_end_writeback(), which then puts
us in a spot where it needs to potentially be re-set. Which fails pretty
racy...

I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support, or I
suspect this would've taken a while to run into.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-26 13:05     ` Jens Axboe
@ 2025-05-26 15:06       ` Jens Axboe
  2025-05-26 15:31         ` Vlastimil Babka
  2025-05-26 17:38         ` Jens Axboe
  0 siblings, 2 replies; 24+ messages in thread
From: Jens Axboe @ 2025-05-26 15:06 UTC (permalink / raw)
  To: Vlastimil Babka, Al Viro, Matthew Wilcox, Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On 5/26/25 7:05 AM, Jens Axboe wrote:
> On 5/25/25 1:12 PM, Vlastimil Babka wrote:
>> On 5/25/25 8:06 PM, Al Viro wrote:
>>> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
>>>
>>>> Breakage is still present in the current mainline ;-/
>>>
>>> With CONFIG_DEBUG_VM on top of pagealloc debugging:
>>>
>>> [ 1434.992817] run fstests generic/127 at 2025-05-25 11:46:11g
>>> [ 1448.956242] BUG: Bad page state in process kworker/2:1  pfn:112cb0g
>>> [ 1448.956846] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>> [ 1448.957453] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>
>> It doesn't like the writeback flag.
>>
>>> [ 1448.957863] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
>>> [ 1448.958303] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
>>> [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) setg
>>> [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2 btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4 i2c_smbus libata e1000g
>>> [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ #78g
>>> [ 1448.960878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
>>> [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
>>> [ 1448.960938] Call Trace:g
>>> [ 1448.960939]  <TASK>g
>>> [ 1448.960940]  dump_stack_lvl+0x4f/0x60g
>>> [ 1448.960953]  bad_page+0x6f/0x100g
>>> [ 1448.960957]  free_frozen_pages+0x471/0x640g
>>> [ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
>>> [ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
>>> [ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
>>> [ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
>>> [ 1448.961036]  process_one_work+0x153/0x390g
>>> [ 1448.961044]  worker_thread+0x2ab/0x3b0g
>>> [ 1448.961045]  ? rescuer_thread+0x470/0x470g
>>> [ 1448.961047]  kthread+0xf7/0x200g
>>> [ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
>>> [ 1448.961049]  ret_from_fork+0x2d/0x50g
>>> [ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
>>> [ 1448.961054]  ret_from_fork_asm+0x11/0x20g
>>> [ 1448.961058]  </TASK>g
>>> [ 1448.961155] Disabling lock debugging due to kernel taintg
>>> [ 1448.969569] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>
>> same pfn, same struct page
>>
>>> [ 1448.970023] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>> [ 1448.970651] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
>>> [ 1448.971222] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
>>> [ 1448.971812] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))g
>>> [ 1448.972490] ------------[ cut here ]------------g
>>> [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g
>>
>> this is folio_get() noticing refcount is 0, so a use-after free, because
>> we already tried to free the page above.
>>
>> I'm not familiar with this code too much, but I suspect problem was
>> introduced by commit fb7d3bc414939 ("mm/filemap: drop streaming/uncached
>> pages when writeback completes") and only (more) exposed here.
>>
>> so in folio_end_writeback() we have
>>         if (__folio_end_writeback(folio))
>>                 folio_wake_bit(folio, PG_writeback);
>>
>> but calling the folio_end_dropbehind_write() doesn't depend on the
>> result of __folio_end_writeback()
>> this seems rather suspicious
>>
>> I think if __folio_end_writeback() was true then PG_writeback would be
>> cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE failure.
>> Instead we do a premature folio_end_dropbehind_write() dropping a page
>> ref and then the final folio_put() in folio_end_writeback() frees the
>> page and splats on the PG_writeback. Then the folio is processed again
>> in the following iteration of iomap_finish_ioend() and splats on the
>> refcount-already-zero.
>>
>> So I think folio_end_dropbehind_write() should only be done when
>> __folio_end_writeback() was true. Most likely even the
>> folio_test_clear_dropbehind() should be tied to that, or we clear it too
>> early and then never act upon it later?
> 
> Thanks for taking a look at this! I tried to reproduce this this morning
> and failed miserably. I then injected a delay for the above case, and it
> does indeed then trigger for me. So far, so good.
> 
> I agree with your analysis, we should only be doing the dropbehind for a
> non-zero return from __folio_end_writeback(), and that includes the
> test_and_clear to avoid dropping the drop-behind state. But we also need
> to check/clear this state pre __folio_end_writeback(), which then puts
> us in a spot where it needs to potentially be re-set. Which fails pretty
> racy...
> 
> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support, or I
> suspect this would've taken a while to run into.

Took a closer look... I may be smoking something good here, but I don't
see what the __folio_end_writeback()() return value has to do with this
at all. Regardless of what it returns, it should've cleared
PG_writeback, and in fact the only thing it returns is whether or not we
had anyone waiting on it. Which should have _zero_ bearing on whether or
not we can clear/invalidate the range.

To me, this smells more like a race of some sort, between dirty and
invalidation. fsx does a lot of sub-page sized operations.

I'll poke a bit more...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-26 15:06       ` Jens Axboe
@ 2025-05-26 15:31         ` Vlastimil Babka
  2025-05-26 15:58           ` Jens Axboe
  2025-05-26 17:38         ` Jens Axboe
  1 sibling, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2025-05-26 15:31 UTC (permalink / raw)
  To: Jens Axboe, Al Viro, Matthew Wilcox, Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On 5/26/25 17:06, Jens Axboe wrote:
> On 5/26/25 7:05 AM, Jens Axboe wrote:
>> On 5/25/25 1:12 PM, Vlastimil Babka wrote:
>> 
>> Thanks for taking a look at this! I tried to reproduce this this morning
>> and failed miserably. I then injected a delay for the above case, and it
>> does indeed then trigger for me. So far, so good.
>> 
>> I agree with your analysis, we should only be doing the dropbehind for a
>> non-zero return from __folio_end_writeback(), and that includes the
>> test_and_clear to avoid dropping the drop-behind state. But we also need
>> to check/clear this state pre __folio_end_writeback(), which then puts
>> us in a spot where it needs to potentially be re-set. Which fails pretty
>> racy...
>> 
>> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support, or I
>> suspect this would've taken a while to run into.
> 
> Took a closer look... I may be smoking something good here, but I don't
> see what the __folio_end_writeback()() return value has to do with this
> at all. Regardless of what it returns, it should've cleared
> PG_writeback, and in fact the only thing it returns is whether or not we
> had anyone waiting on it. Which should have _zero_ bearing on whether or
> not we can clear/invalidate the range.

Yeah it's very much possible that I was wrong, folio_xor_flags_has_waiters()
looked a bit impenetrable to me, and it seemed like an simple explanation to
the splats. But as you had to add delays, this indeed smells as a race.

> To me, this smells more like a race of some sort, between dirty and
> invalidation. fsx does a lot of sub-page sized operations.
> 
> I'll poke a bit more...
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-26 15:31         ` Vlastimil Babka
@ 2025-05-26 15:58           ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2025-05-26 15:58 UTC (permalink / raw)
  To: Vlastimil Babka, Al Viro, Matthew Wilcox, Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On 5/26/25 9:31 AM, Vlastimil Babka wrote:
> On 5/26/25 17:06, Jens Axboe wrote:
>> On 5/26/25 7:05 AM, Jens Axboe wrote:
>>> On 5/25/25 1:12 PM, Vlastimil Babka wrote:
>>>
>>> Thanks for taking a look at this! I tried to reproduce this this morning
>>> and failed miserably. I then injected a delay for the above case, and it
>>> does indeed then trigger for me. So far, so good.
>>>
>>> I agree with your analysis, we should only be doing the dropbehind for a
>>> non-zero return from __folio_end_writeback(), and that includes the
>>> test_and_clear to avoid dropping the drop-behind state. But we also need
>>> to check/clear this state pre __folio_end_writeback(), which then puts
>>> us in a spot where it needs to potentially be re-set. Which fails pretty
>>> racy...
>>>
>>> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support, or I
>>> suspect this would've taken a while to run into.
>>
>> Took a closer look... I may be smoking something good here, but I don't
>> see what the __folio_end_writeback()() return value has to do with this
>> at all. Regardless of what it returns, it should've cleared
>> PG_writeback, and in fact the only thing it returns is whether or not we
>> had anyone waiting on it. Which should have _zero_ bearing on whether or
>> not we can clear/invalidate the range.
> 
> Yeah it's very much possible that I was wrong, folio_xor_flags_has_waiters()
> looked a bit impenetrable to me, and it seemed like an simple explanation to
> the splats. But as you had to add delays, this indeed smells as a race.

Here's my delay trace fwiw, which is a bit different:

BUG: Bad page state in process fsx  pfn:4866b
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x25 pfn:0x4866b
flags: 0x3ffe0000000000a(uptodate|writeback|node=0|zone=0|lastcpupid=0x1fff)
raw: 03ffe0000000000a dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000025 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
Modules linked in:
CPU: 6 UID: 0 PID: 1853 Comm: fsx Not tainted 6.15.0-rc7-00144-gb1427432d3b6-dirty #1053 NONE 
Hardware name: linux,dummy-virt (DT)
Call trace:
 show_stack+0x1c/0x30 (C)
 dump_stack_lvl+0x58/0x78
 dump_stack+0x18/0x20
 bad_page+0x1a4/0x228
 free_unref_folios+0xc2c/0x1920
 folios_put_refs+0x354/0x5f0
 __folio_batch_release+0x98/0xd0
 writeback_iter+0x8f8/0xd00
 iomap_writepages+0x16e4/0x2090
 xfs_vm_writepages+0x200/0x2c0
 do_writepages+0x148/0x7c0
 filemap_fdatawrite_wbc+0xe0/0x138
 __filemap_fdatawrite_range+0xb0/0x100
 filemap_write_and_wait_range+0x68/0x100
 __generic_remap_file_range_prep+0x418/0x1090
 generic_remap_file_range_prep+0x18/0x80
 xfs_reflink_remap_prep+0x160/0x7d8
 xfs_file_remap_range+0x164/0xa90
 vfs_dedupe_file_range_one+0x398/0x4a0
 vfs_dedupe_file_range+0x410/0x648
 do_vfs_ioctl+0x13c4/0x1fc0
 __arm64_sys_ioctl+0xd8/0x188
 invoke_syscall.constprop.0+0x60/0x2a0
 el0_svc_common.constprop.0+0x148/0x240
 do_el0_svc+0x40/0x60
 el0_svc+0x34/0x70
 el0t_64_sync_handler+0x104/0x138
 el0t_64_sync+0x170/0x178
Disabling lock debugging due to kernel taint
page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x25 pfn:0x4866b
flags: 0x3ffe0000000000a(uptodate|writeback|node=0|zone=0|lastcpupid=0x1fff)
raw: 03ffe0000000000a dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000025 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:1543!
Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
Modules linked in:
CPU: 6 UID: 0 PID: 0 Comm: swapper/6 Tainted: G    B               6.15.0-rc7-00144-gb1427432d3b6-dirty #1053 NONE 
Tainted: [B]=BAD_PAGE
Hardware name: linux,dummy-virt (DT)
pstate: 614000c5 (nZCv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
pc : folio_end_writeback+0x470/0x560
lr : folio_end_writeback+0x470/0x560
sp : ffff8000859978f0
x29: ffff8000859978f0 x28: dfff800000000000 x27: fffffdffc0219ac0
x26: 0000000000000000 x25: ffff000005ed8138 x24: 0000000000000000
x23: 1fffffbff804335e x22: 0000000000000004 x21: 0000000000000001
x20: fffffdffc0219af4 x19: fffffdffc0219ac0 x18: 000000000000000f
x17: 635f6665725f6f69 x16: 6c6f662029746e69 x15: 0720072007200720
x14: 0720072007200720 x13: 0720072007200720 x12: ffff60001b67150b
x11: 1fffe0001b67150a x10: ffff60001b67150a x9 : dfff800000000000
x8 : 00009fffe498eaf6 x7 : ffff0000db38a853 x6 : 0000000000000001
x5 : 0000000000000001 x4 : 0000000000000000 x3 : 0000000000000000
x2 : 0000000000000000 x1 : ffff0000c1f98000 x0 : 000000000000005c
Call trace:
 folio_end_writeback+0x470/0x560 (P)
 iomap_finish_ioend_buffered+0x38c/0x9e0
 iomap_writepage_end_bio+0x80/0xc0
 bio_endio+0x4dc/0x678
 blk_mq_end_request_batch+0x2b4/0x10c0
 nvme_pci_complete_batch+0x338/0x518
 nvme_irq+0xd8/0xf0
 __handle_irq_event_percpu+0xdc/0x528
 handle_irq_event+0x174/0x3d8
 handle_fasteoi_irq+0x2cc/0xba0
 handle_irq_desc+0xb8/0x120
 generic_handle_domain_irq+0x20/0x30
 gic_handle_irq+0x50/0x140
 call_on_irq_stack+0x24/0x50
 do_interrupt_handler+0xe0/0x148
 el1_interrupt+0x30/0x50
 el1h_64_irq_handler+0x14/0x20
 el1h_64_irq+0x6c/0x70
 do_idle+0x244/0x4c8 (P)
 cpu_startup_entry+0x64/0x80
 secondary_start_kernel+0x1e4/0x240
 __secondary_switched+0x74/0x78
Code: 91190021 91218021 aa1303e0 94039279 (d4210000) 
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x0000,000000e0,0109a650,834e7607
Memory Limit: none
---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]---

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-26 15:06       ` Jens Axboe
  2025-05-26 15:31         ` Vlastimil Babka
@ 2025-05-26 17:38         ` Jens Axboe
  2025-05-26 23:56           ` Al Viro
  2025-05-27  0:51           ` Trond Myklebust
  1 sibling, 2 replies; 24+ messages in thread
From: Jens Axboe @ 2025-05-26 17:38 UTC (permalink / raw)
  To: Vlastimil Babka, Al Viro, Matthew Wilcox, Jan Kara
  Cc: Christoph Hellwig, Darrick J. Wong, Christian Brauner,
	linux-fsdevel, Linus Torvalds

On 5/26/25 9:06 AM, Jens Axboe wrote:
> On 5/26/25 7:05 AM, Jens Axboe wrote:
>> On 5/25/25 1:12 PM, Vlastimil Babka wrote:
>>> On 5/25/25 8:06 PM, Al Viro wrote:
>>>> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
>>>>
>>>>> Breakage is still present in the current mainline ;-/
>>>>
>>>> With CONFIG_DEBUG_VM on top of pagealloc debugging:
>>>>
>>>> [ 1434.992817] run fstests generic/127 at 2025-05-25 11:46:11g
>>>> [ 1448.956242] BUG: Bad page state in process kworker/2:1  pfn:112cb0g
>>>> [ 1448.956846] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>>> [ 1448.957453] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>>
>>> It doesn't like the writeback flag.
>>>
>>>> [ 1448.957863] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
>>>> [ 1448.958303] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
>>>> [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) setg
>>>> [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2 btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4 i2c_smbus libata e1000g
>>>> [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ #78g
>>>> [ 1448.960878] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
>>>> [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
>>>> [ 1448.960938] Call Trace:g
>>>> [ 1448.960939]  <TASK>g
>>>> [ 1448.960940]  dump_stack_lvl+0x4f/0x60g
>>>> [ 1448.960953]  bad_page+0x6f/0x100g
>>>> [ 1448.960957]  free_frozen_pages+0x471/0x640g
>>>> [ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
>>>> [ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
>>>> [ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
>>>> [ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
>>>> [ 1448.961036]  process_one_work+0x153/0x390g
>>>> [ 1448.961044]  worker_thread+0x2ab/0x3b0g
>>>> [ 1448.961045]  ? rescuer_thread+0x470/0x470g
>>>> [ 1448.961047]  kthread+0xf7/0x200g
>>>> [ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
>>>> [ 1448.961049]  ret_from_fork+0x2d/0x50g
>>>> [ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
>>>> [ 1448.961054]  ret_from_fork_asm+0x11/0x20g
>>>> [ 1448.961058]  </TASK>g
>>>> [ 1448.961155] Disabling lock debugging due to kernel taintg
>>>> [ 1448.969569] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>>
>>> same pfn, same struct page
>>>
>>>> [ 1448.970023] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>>> [ 1448.970651] raw: 800000000000000e dead000000000100 dead000000000122 0000000000000000g
>>>> [ 1448.971222] raw: 000000000000003e 0000000000000000 00000000ffffffff 0000000000000000g
>>>> [ 1448.971812] page dumped because: VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u <= 127u))g
>>>> [ 1448.972490] ------------[ cut here ]------------g
>>>> [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g
>>>
>>> this is folio_get() noticing refcount is 0, so a use-after free, because
>>> we already tried to free the page above.
>>>
>>> I'm not familiar with this code too much, but I suspect problem was
>>> introduced by commit fb7d3bc414939 ("mm/filemap: drop streaming/uncached
>>> pages when writeback completes") and only (more) exposed here.
>>>
>>> so in folio_end_writeback() we have
>>>         if (__folio_end_writeback(folio))
>>>                 folio_wake_bit(folio, PG_writeback);
>>>
>>> but calling the folio_end_dropbehind_write() doesn't depend on the
>>> result of __folio_end_writeback()
>>> this seems rather suspicious
>>>
>>> I think if __folio_end_writeback() was true then PG_writeback would be
>>> cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE failure.
>>> Instead we do a premature folio_end_dropbehind_write() dropping a page
>>> ref and then the final folio_put() in folio_end_writeback() frees the
>>> page and splats on the PG_writeback. Then the folio is processed again
>>> in the following iteration of iomap_finish_ioend() and splats on the
>>> refcount-already-zero.
>>>
>>> So I think folio_end_dropbehind_write() should only be done when
>>> __folio_end_writeback() was true. Most likely even the
>>> folio_test_clear_dropbehind() should be tied to that, or we clear it too
>>> early and then never act upon it later?
>>
>> Thanks for taking a look at this! I tried to reproduce this this morning
>> and failed miserably. I then injected a delay for the above case, and it
>> does indeed then trigger for me. So far, so good.
>>
>> I agree with your analysis, we should only be doing the dropbehind for a
>> non-zero return from __folio_end_writeback(), and that includes the
>> test_and_clear to avoid dropping the drop-behind state. But we also need
>> to check/clear this state pre __folio_end_writeback(), which then puts
>> us in a spot where it needs to potentially be re-set. Which fails pretty
>> racy...
>>
>> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support, or I
>> suspect this would've taken a while to run into.
> 
> Took a closer look... I may be smoking something good here, but I don't
> see what the __folio_end_writeback()() return value has to do with this
> at all. Regardless of what it returns, it should've cleared
> PG_writeback, and in fact the only thing it returns is whether or not we
> had anyone waiting on it. Which should have _zero_ bearing on whether or
> not we can clear/invalidate the range.
> 
> To me, this smells more like a race of some sort, between dirty and
> invalidation. fsx does a lot of sub-page sized operations.
> 
> I'll poke a bit more...

I _think_ we're racing with the same folio being marked for writeback
again. Al, can you try the below?


diff --git a/mm/filemap.c b/mm/filemap.c
index 7b90cbeb4a1a..e95b184a2459 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1604,7 +1604,7 @@ static void folio_end_dropbehind_write(struct folio *folio)
 	 * invalidation in that case.
 	 */
 	if (in_task() && folio_trylock(folio)) {
-		if (folio->mapping)
+		if (folio->mapping && !folio_test_writeback(folio))
 			folio_unmap_invalidate(folio->mapping, folio, 0);
 		folio_unlock(folio);
 	}


-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-26 17:38         ` Jens Axboe
@ 2025-05-26 23:56           ` Al Viro
  2025-05-27  0:58             ` Jens Axboe
  2025-05-27  0:51           ` Trond Myklebust
  1 sibling, 1 reply; 24+ messages in thread
From: Al Viro @ 2025-05-26 23:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vlastimil Babka, Matthew Wilcox, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Christian Brauner, linux-fsdevel, Linus Torvalds

On Mon, May 26, 2025 at 11:38:53AM -0600, Jens Axboe wrote:
> > I'll poke a bit more...
> 
> I _think_ we're racing with the same folio being marked for writeback
> again. Al, can you try the below?

It seems to survive on top of v6.15^^

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-26 17:38         ` Jens Axboe
  2025-05-26 23:56           ` Al Viro
@ 2025-05-27  0:51           ` Trond Myklebust
  2025-05-27  0:56             ` Jens Axboe
  1 sibling, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2025-05-27  0:51 UTC (permalink / raw)
  To: willy@infradead.org, jack@suse.cz, axboe@kernel.dk,
	viro@zeniv.linux.org.uk, vbabka@suse.cz
  Cc: hch@lst.de, djwong@kernel.org, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, torvalds@linux-foundation.org

On Mon, 2025-05-26 at 11:38 -0600, Jens Axboe wrote:
> On 5/26/25 9:06 AM, Jens Axboe wrote:
> > On 5/26/25 7:05 AM, Jens Axboe wrote:
> > > On 5/25/25 1:12 PM, Vlastimil Babka wrote:
> > > > On 5/25/25 8:06 PM, Al Viro wrote:
> > > > > On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
> > > > > 
> > > > > > Breakage is still present in the current mainline ;-/
> > > > > 
> > > > > With CONFIG_DEBUG_VM on top of pagealloc debugging:
> > > > > 
> > > > > [ 1434.992817] run fstests generic/127 at 2025-05-25
> > > > > 11:46:11g
> > > > > [ 1448.956242] BUG: Bad page state in process kworker/2:1 
> > > > > pfn:112cb0g
> > > > > [ 1448.956846] page: refcount:0 mapcount:0
> > > > > mapping:0000000000000000 index:0x3e pfn:0x112cb0g
> > > > > [ 1448.957453] flags:
> > > > > 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
> > > > 
> > > > It doesn't like the writeback flag.
> > > > 
> > > > > [ 1448.957863] raw: 800000000000000e dead000000000100
> > > > > dead000000000122 0000000000000000g
> > > > > [ 1448.958303] raw: 000000000000003e 0000000000000000
> > > > > 00000000ffffffff 0000000000000000g
> > > > > [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE
> > > > > flag(s) setg
> > > > > [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd
> > > > > auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs
> > > > > 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2
> > > > > btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress
> > > > > sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4
> > > > > i2c_smbus libata e1000g
> > > > > [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not
> > > > > tainted 6.14.0-rc1+ #78g
> > > > > [ 1448.960878] Hardware name: QEMU Standard PC (i440FX +
> > > > > PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
> > > > > [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
> > > > > [ 1448.960938] Call Trace:g
> > > > > [ 1448.960939]  <TASK>g
> > > > > [ 1448.960940]  dump_stack_lvl+0x4f/0x60g
> > > > > [ 1448.960953]  bad_page+0x6f/0x100g
> > > > > [ 1448.960957]  free_frozen_pages+0x471/0x640g
> > > > > [ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
> > > > > [ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
> > > > > [ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
> > > > > [ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
> > > > > [ 1448.961036]  process_one_work+0x153/0x390g
> > > > > [ 1448.961044]  worker_thread+0x2ab/0x3b0g
> > > > > [ 1448.961045]  ? rescuer_thread+0x470/0x470g
> > > > > [ 1448.961047]  kthread+0xf7/0x200g
> > > > > [ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
> > > > > [ 1448.961049]  ret_from_fork+0x2d/0x50g
> > > > > [ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
> > > > > [ 1448.961054]  ret_from_fork_asm+0x11/0x20g
> > > > > [ 1448.961058]  </TASK>g
> > > > > [ 1448.961155] Disabling lock debugging due to kernel taintg
> > > > > [ 1448.969569] page: refcount:0 mapcount:0
> > > > > mapping:0000000000000000 index:0x3e pfn:0x112cb0g
> > > > 
> > > > same pfn, same struct page
> > > > 
> > > > > [ 1448.970023] flags:
> > > > > 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
> > > > > [ 1448.970651] raw: 800000000000000e dead000000000100
> > > > > dead000000000122 0000000000000000g
> > > > > [ 1448.971222] raw: 000000000000003e 0000000000000000
> > > > > 00000000ffffffff 0000000000000000g
> > > > > [ 1448.971812] page dumped because:
> > > > > VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u
> > > > > <= 127u))g
> > > > > [ 1448.972490] ------------[ cut here ]------------g
> > > > > [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g
> > > > 
> > > > this is folio_get() noticing refcount is 0, so a use-after
> > > > free, because
> > > > we already tried to free the page above.
> > > > 
> > > > I'm not familiar with this code too much, but I suspect problem
> > > > was
> > > > introduced by commit fb7d3bc414939 ("mm/filemap: drop
> > > > streaming/uncached
> > > > pages when writeback completes") and only (more) exposed here.
> > > > 
> > > > so in folio_end_writeback() we have
> > > >         if (__folio_end_writeback(folio))
> > > >                 folio_wake_bit(folio, PG_writeback);
> > > > 
> > > > but calling the folio_end_dropbehind_write() doesn't depend on
> > > > the
> > > > result of __folio_end_writeback()
> > > > this seems rather suspicious
> > > > 
> > > > I think if __folio_end_writeback() was true then PG_writeback
> > > > would be
> > > > cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE
> > > > failure.
> > > > Instead we do a premature folio_end_dropbehind_write() dropping
> > > > a page
> > > > ref and then the final folio_put() in folio_end_writeback()
> > > > frees the
> > > > page and splats on the PG_writeback. Then the folio is
> > > > processed again
> > > > in the following iteration of iomap_finish_ioend() and splats
> > > > on the
> > > > refcount-already-zero.
> > > > 
> > > > So I think folio_end_dropbehind_write() should only be done
> > > > when
> > > > __folio_end_writeback() was true. Most likely even the
> > > > folio_test_clear_dropbehind() should be tied to that, or we
> > > > clear it too
> > > > early and then never act upon it later?
> > > 
> > > Thanks for taking a look at this! I tried to reproduce this this
> > > morning
> > > and failed miserably. I then injected a delay for the above case,
> > > and it
> > > does indeed then trigger for me. So far, so good.
> > > 
> > > I agree with your analysis, we should only be doing the
> > > dropbehind for a
> > > non-zero return from __folio_end_writeback(), and that includes
> > > the
> > > test_and_clear to avoid dropping the drop-behind state. But we
> > > also need
> > > to check/clear this state pre __folio_end_writeback(), which then
> > > puts
> > > us in a spot where it needs to potentially be re-set. Which fails
> > > pretty
> > > racy...
> > > 
> > > I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support,
> > > or I
> > > suspect this would've taken a while to run into.
> > 
> > Took a closer look... I may be smoking something good here, but I
> > don't
> > see what the __folio_end_writeback()() return value has to do with
> > this
> > at all. Regardless of what it returns, it should've cleared
> > PG_writeback, and in fact the only thing it returns is whether or
> > not we
> > had anyone waiting on it. Which should have _zero_ bearing on
> > whether or
> > not we can clear/invalidate the range.
> > 
> > To me, this smells more like a race of some sort, between dirty and
> > invalidation. fsx does a lot of sub-page sized operations.
> > 
> > I'll poke a bit more...
> 
> I _think_ we're racing with the same folio being marked for writeback
> again. Al, can you try the below?
> 
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7b90cbeb4a1a..e95b184a2459 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1604,7 +1604,7 @@ static void folio_end_dropbehind_write(struct
> folio *folio)
>  	 * invalidation in that case.
>  	 */
>  	if (in_task() && folio_trylock(folio)) {
> -		if (folio->mapping)
> +		if (folio->mapping && !folio_test_writeback(folio))
>  			folio_unmap_invalidate(folio->mapping,
> folio, 0);
>  		folio_unlock(folio);
>  	}
> 

I think we need to test for PG_dirty after retaking the folio lock as
well. Nothing stops a second thread from redirtying the page once the
folio lock is dropped, and while some filesystems may insist on waiting
for PG_writeback before allowing redirtying to complete, that still
ends up racing because folio_end_dropbehind_write() is called after the
call to __folio_end_writeback().

Note that the same set of races can happen in
filemap_end_dropbehind_read(), so we need the same set of checks after
taking the folio lock there too. The existing checks are insufficient,
since they only happen before taking the folio lock.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-27  0:51           ` Trond Myklebust
@ 2025-05-27  0:56             ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2025-05-27  0:56 UTC (permalink / raw)
  To: Trond Myklebust, willy@infradead.org, jack@suse.cz,
	viro@zeniv.linux.org.uk, vbabka@suse.cz
  Cc: hch@lst.de, djwong@kernel.org, brauner@kernel.org,
	linux-fsdevel@vger.kernel.org, torvalds@linux-foundation.org

On 5/26/25 6:51 PM, Trond Myklebust wrote:
> On Mon, 2025-05-26 at 11:38 -0600, Jens Axboe wrote:
>> On 5/26/25 9:06 AM, Jens Axboe wrote:
>>> On 5/26/25 7:05 AM, Jens Axboe wrote:
>>>> On 5/25/25 1:12 PM, Vlastimil Babka wrote:
>>>>> On 5/25/25 8:06 PM, Al Viro wrote:
>>>>>> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
>>>>>>
>>>>>>> Breakage is still present in the current mainline ;-/
>>>>>>
>>>>>> With CONFIG_DEBUG_VM on top of pagealloc debugging:
>>>>>>
>>>>>> [ 1434.992817] run fstests generic/127 at 2025-05-25
>>>>>> 11:46:11g
>>>>>> [ 1448.956242] BUG: Bad page state in process kworker/2:1 
>>>>>> pfn:112cb0g
>>>>>> [ 1448.956846] page: refcount:0 mapcount:0
>>>>>> mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>>>>> [ 1448.957453] flags:
>>>>>> 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>>>>
>>>>> It doesn't like the writeback flag.
>>>>>
>>>>>> [ 1448.957863] raw: 800000000000000e dead000000000100
>>>>>> dead000000000122 0000000000000000g
>>>>>> [ 1448.958303] raw: 000000000000003e 0000000000000000
>>>>>> 00000000ffffffff 0000000000000000g
>>>>>> [ 1448.958833] page dumped because: PAGE_FLAGS_CHECK_AT_FREE
>>>>>> flag(s) setg
>>>>>> [ 1448.959320] Modules linked in: xfs autofs4 fuse nfsd
>>>>>> auth_rpcgss nfs_acl nfs lockd grace sunrpc loop ecryptfs
>>>>>> 9pnet_virtio 9pnet netfs evdev pcspkr sg button ext4 jbd2
>>>>>> btrfs blake2b_generic xor zlib_deflate raid6_pq zstd_compress
>>>>>> sr_mod cdrom ata_generic ata_piix psmouse serio_raw i2c_piix4
>>>>>> i2c_smbus libata e1000g
>>>>>> [ 1448.960874] CPU: 2 UID: 0 PID: 2614 Comm: kworker/2:1 Not
>>>>>> tainted 6.14.0-rc1+ #78g
>>>>>> [ 1448.960878] Hardware name: QEMU Standard PC (i440FX +
>>>>>> PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014g
>>>>>> [ 1448.960879] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]g
>>>>>> [ 1448.960938] Call Trace:g
>>>>>> [ 1448.960939]  <TASK>g
>>>>>> [ 1448.960940]  dump_stack_lvl+0x4f/0x60g
>>>>>> [ 1448.960953]  bad_page+0x6f/0x100g
>>>>>> [ 1448.960957]  free_frozen_pages+0x471/0x640g
>>>>>> [ 1448.960958]  iomap_finish_ioend+0x196/0x3c0g
>>>>>> [ 1448.960963]  iomap_finish_ioends+0x83/0xc0g
>>>>>> [ 1448.960964]  xfs_end_ioend+0x64/0x140 [xfs]g
>>>>>> [ 1448.961003]  xfs_end_io+0x93/0xc0 [xfs]g
>>>>>> [ 1448.961036]  process_one_work+0x153/0x390g
>>>>>> [ 1448.961044]  worker_thread+0x2ab/0x3b0g
>>>>>> [ 1448.961045]  ? rescuer_thread+0x470/0x470g
>>>>>> [ 1448.961047]  kthread+0xf7/0x200g
>>>>>> [ 1448.961048]  ? kthread_use_mm+0xa0/0xa0g
>>>>>> [ 1448.961049]  ret_from_fork+0x2d/0x50g
>>>>>> [ 1448.961053]  ? kthread_use_mm+0xa0/0xa0g
>>>>>> [ 1448.961054]  ret_from_fork_asm+0x11/0x20g
>>>>>> [ 1448.961058]  </TASK>g
>>>>>> [ 1448.961155] Disabling lock debugging due to kernel taintg
>>>>>> [ 1448.969569] page: refcount:0 mapcount:0
>>>>>> mapping:0000000000000000 index:0x3e pfn:0x112cb0g
>>>>>
>>>>> same pfn, same struct page
>>>>>
>>>>>> [ 1448.970023] flags:
>>>>>> 0x800000000000000e(referenced|uptodate|writeback|zone=2)g
>>>>>> [ 1448.970651] raw: 800000000000000e dead000000000100
>>>>>> dead000000000122 0000000000000000g
>>>>>> [ 1448.971222] raw: 000000000000003e 0000000000000000
>>>>>> 00000000ffffffff 0000000000000000g
>>>>>> [ 1448.971812] page dumped because:
>>>>>> VM_BUG_ON_FOLIO(((unsigned int) folio_ref_count(folio) + 127u
>>>>>> <= 127u))g
>>>>>> [ 1448.972490] ------------[ cut here ]------------g
>>>>>> [ 1448.972841] kernel BUG at ./include/linux/mm.h:1455!g
>>>>>
>>>>> this is folio_get() noticing refcount is 0, so a use-after
>>>>> free, because
>>>>> we already tried to free the page above.
>>>>>
>>>>> I'm not familiar with this code too much, but I suspect problem
>>>>> was
>>>>> introduced by commit fb7d3bc414939 ("mm/filemap: drop
>>>>> streaming/uncached
>>>>> pages when writeback completes") and only (more) exposed here.
>>>>>
>>>>> so in folio_end_writeback() we have
>>>>>         if (__folio_end_writeback(folio))
>>>>>                 folio_wake_bit(folio, PG_writeback);
>>>>>
>>>>> but calling the folio_end_dropbehind_write() doesn't depend on
>>>>> the
>>>>> result of __folio_end_writeback()
>>>>> this seems rather suspicious
>>>>>
>>>>> I think if __folio_end_writeback() was true then PG_writeback
>>>>> would be
>>>>> cleared and thus we'd not see the PAGE_FLAGS_CHECK_AT_FREE
>>>>> failure.
>>>>> Instead we do a premature folio_end_dropbehind_write() dropping
>>>>> a page
>>>>> ref and then the final folio_put() in folio_end_writeback()
>>>>> frees the
>>>>> page and splats on the PG_writeback. Then the folio is
>>>>> processed again
>>>>> in the following iteration of iomap_finish_ioend() and splats
>>>>> on the
>>>>> refcount-already-zero.
>>>>>
>>>>> So I think folio_end_dropbehind_write() should only be done
>>>>> when
>>>>> __folio_end_writeback() was true. Most likely even the
>>>>> folio_test_clear_dropbehind() should be tied to that, or we
>>>>> clear it too
>>>>> early and then never act upon it later?
>>>>
>>>> Thanks for taking a look at this! I tried to reproduce this this
>>>> morning
>>>> and failed miserably. I then injected a delay for the above case,
>>>> and it
>>>> does indeed then trigger for me. So far, so good.
>>>>
>>>> I agree with your analysis, we should only be doing the
>>>> dropbehind for a
>>>> non-zero return from __folio_end_writeback(), and that includes
>>>> the
>>>> test_and_clear to avoid dropping the drop-behind state. But we
>>>> also need
>>>> to check/clear this state pre __folio_end_writeback(), which then
>>>> puts
>>>> us in a spot where it needs to potentially be re-set. Which fails
>>>> pretty
>>>> racy...
>>>>
>>>> I'll ponder this a bit. Good thing fsx got RWF_DONTCACHE support,
>>>> or I
>>>> suspect this would've taken a while to run into.
>>>
>>> Took a closer look... I may be smoking something good here, but I
>>> don't
>>> see what the __folio_end_writeback()() return value has to do with
>>> this
>>> at all. Regardless of what it returns, it should've cleared
>>> PG_writeback, and in fact the only thing it returns is whether or
>>> not we
>>> had anyone waiting on it. Which should have _zero_ bearing on
>>> whether or
>>> not we can clear/invalidate the range.
>>>
>>> To me, this smells more like a race of some sort, between dirty and
>>> invalidation. fsx does a lot of sub-page sized operations.
>>>
>>> I'll poke a bit more...
>>
>> I _think_ we're racing with the same folio being marked for writeback
>> again. Al, can you try the below?
>>
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 7b90cbeb4a1a..e95b184a2459 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1604,7 +1604,7 @@ static void folio_end_dropbehind_write(struct
>> folio *folio)
>>  	 * invalidation in that case.
>>  	 */
>>  	if (in_task() && folio_trylock(folio)) {
>> -		if (folio->mapping)
>> +		if (folio->mapping && !folio_test_writeback(folio))
>>  			folio_unmap_invalidate(folio->mapping,
>> folio, 0);
>>  		folio_unlock(folio);
>>  	}
>>
> 
> I think we need to test for PG_dirty after retaking the folio lock as
> well. Nothing stops a second thread from redirtying the page once the
> folio lock is dropped, and while some filesystems may insist on waiting
> for PG_writeback before allowing redirtying to complete, that still
> ends up racing because folio_end_dropbehind_write() is called after the
> call to __folio_end_writeback().

Agree, local version actually has both as well.

> Note that the same set of races can happen in
> filemap_end_dropbehind_read(), so we need the same set of checks after
> taking the folio lock there too. The existing checks are insufficient,
> since they only happen before taking the folio lock.

Ah good catch. I'll send out the patch tomorrow.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-26 23:56           ` Al Viro
@ 2025-05-27  0:58             ` Jens Axboe
  2025-05-27  1:24               ` Al Viro
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2025-05-27  0:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Vlastimil Babka, Matthew Wilcox, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Christian Brauner, linux-fsdevel, Linus Torvalds

On 5/26/25 5:56 PM, Al Viro wrote:
> On Mon, May 26, 2025 at 11:38:53AM -0600, Jens Axboe wrote:
>>> I'll poke a bit more...
>>
>> I _think_ we're racing with the same folio being marked for writeback
>> again. Al, can you try the below?
> 
> It seems to survive on top of v6.15^^

Thanks for testing, Al! Assuming it goes without saying, but that's 6.15
with 478ad02d6844 reverted, right?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-27  0:58             ` Jens Axboe
@ 2025-05-27  1:24               ` Al Viro
  2025-05-27  1:29                 ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Al Viro @ 2025-05-27  1:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vlastimil Babka, Matthew Wilcox, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Christian Brauner, linux-fsdevel, Linus Torvalds

On Mon, May 26, 2025 at 06:58:47PM -0600, Jens Axboe wrote:
> On 5/26/25 5:56 PM, Al Viro wrote:
> > On Mon, May 26, 2025 at 11:38:53AM -0600, Jens Axboe wrote:
> >>> I'll poke a bit more...
> >>
> >> I _think_ we're racing with the same folio being marked for writeback
> >> again. Al, can you try the below?
> > 
> > It seems to survive on top of v6.15^^
> 
> Thanks for testing, Al! Assuming it goes without saying, but that's 6.15
> with 478ad02d6844 reverted, right?

That's 6.15 without two last commits - 478ad02d6844 and the version bump ;-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-27  1:24               ` Al Viro
@ 2025-05-27  1:29                 ` Jens Axboe
  0 siblings, 0 replies; 24+ messages in thread
From: Jens Axboe @ 2025-05-27  1:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Vlastimil Babka, Matthew Wilcox, Jan Kara, Christoph Hellwig,
	Darrick J. Wong, Christian Brauner, linux-fsdevel, Linus Torvalds

On 5/26/25 7:24 PM, Al Viro wrote:
> On Mon, May 26, 2025 at 06:58:47PM -0600, Jens Axboe wrote:
>> On 5/26/25 5:56 PM, Al Viro wrote:
>>> On Mon, May 26, 2025 at 11:38:53AM -0600, Jens Axboe wrote:
>>>>> I'll poke a bit more...
>>>>
>>>> I _think_ we're racing with the same folio being marked for writeback
>>>> again. Al, can you try the below?
>>>
>>> It seems to survive on top of v6.15^^
>>
>> Thanks for testing, Al! Assuming it goes without saying, but that's 6.15
>> with 478ad02d6844 reverted, right?
> 
> That's 6.15 without two last commits - 478ad02d6844 and the version bump ;-)

OK good, I would've been confused it not, but never hurts to confirm...

FWIW, have a branch here:

https://git.kernel.dk/cgit/linux/log/?h=dontcache

with the read/write side patches, and finally the revert as well.
There's a consolidation patch that can be done on top in terms of a
cleanup, but figured it was better to keep that separate from the actual
bug fix.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-25  8:32 [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?) Al Viro
  2025-05-25 18:02 ` Al Viro
  2025-05-25 18:06 ` Al Viro
@ 2025-05-29  1:56 ` Darrick J. Wong
  2025-05-31  1:10   ` Darrick J. Wong
  2 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2025-05-29  1:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Christoph Hellwig, Christian Brauner, linux-fsdevel,
	Linus Torvalds

On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
> generic/127 with xfstests built on debian-testing (trixie) ends up with
> assorted memory corruption; trace below is with CONFIG_DEBUG_PAGEALLOC and
> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT and it looks like a double free
> somewhere in iomap.  Unfortunately, commit in question is just making
> xfs use the infrastructure built in earlier series - not that useful
> for isolating the breakage.
> 
> [   22.001529] run fstests generic/127 at 2025-05-25 04:13:23
> [   35.498573] BUG: Bad page state in process kworker/2:1  pfn:112ce9
> [   35.499260] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e 9
> [   35.499764] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)
> [   35.500302] raw: 800000000000000e dead000000000100 dead000000000122 000000000
> [   35.500786] raw: 000000000000003e 0000000000000000 00000000ffffffff 000000000
> [   35.501248] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> [   35.501624] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs0
> [   35.503209] CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ 7
> [   35.503211] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.164
> [   35.503212] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]
> [   35.503279] Call Trace:
> [   35.503281]  <TASK>
> [   35.503282]  dump_stack_lvl+0x4f/0x60
> [   35.503296]  bad_page+0x6f/0x100
> [   35.503300]  free_frozen_pages+0x303/0x550
> [   35.503301]  iomap_finish_ioend+0xf6/0x380
> [   35.503304]  iomap_finish_ioends+0x83/0xc0
> [   35.503305]  xfs_end_ioend+0x64/0x140 [xfs]
> [   35.503342]  xfs_end_io+0x93/0xc0 [xfs]
> [   35.503378]  process_one_work+0x153/0x390
> [   35.503382]  worker_thread+0x2ab/0x3b0
> 
> It's 4:30am here, so I'm going to leave attempts to actually debug that
> thing until tomorrow; I do have a kvm where it's reliably reproduced
> within a few minutes, so if anyone comes up with patches, I'll be able
> to test them.
> 
> Breakage is still present in the current mainline ;-/

Hey Al,

Welll this certainly looks like the same report I made a month ago.
I'll go run 6.15 final (with the #define RWF_DONTCACHE 0) overnight to
confirm if that makes my problem go away.  If these are one and the same
bug, then thank you for finding a better reproducer! :)

https://lore.kernel.org/linux-fsdevel/20250416180837.GN25675@frogsfrogsfrogs/

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-29  1:56 ` Darrick J. Wong
@ 2025-05-31  1:10   ` Darrick J. Wong
  2025-05-31 21:00     ` Jens Axboe
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2025-05-31  1:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Christoph Hellwig, Christian Brauner, linux-fsdevel,
	Linus Torvalds

On Wed, May 28, 2025 at 06:56:37PM -0700, Darrick J. Wong wrote:
> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
> > generic/127 with xfstests built on debian-testing (trixie) ends up with
> > assorted memory corruption; trace below is with CONFIG_DEBUG_PAGEALLOC and
> > CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT and it looks like a double free
> > somewhere in iomap.  Unfortunately, commit in question is just making
> > xfs use the infrastructure built in earlier series - not that useful
> > for isolating the breakage.
> > 
> > [   22.001529] run fstests generic/127 at 2025-05-25 04:13:23
> > [   35.498573] BUG: Bad page state in process kworker/2:1  pfn:112ce9
> > [   35.499260] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e 9
> > [   35.499764] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)
> > [   35.500302] raw: 800000000000000e dead000000000100 dead000000000122 000000000
> > [   35.500786] raw: 000000000000003e 0000000000000000 00000000ffffffff 000000000
> > [   35.501248] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
> > [   35.501624] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs0
> > [   35.503209] CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ 7
> > [   35.503211] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.164
> > [   35.503212] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]
> > [   35.503279] Call Trace:
> > [   35.503281]  <TASK>
> > [   35.503282]  dump_stack_lvl+0x4f/0x60
> > [   35.503296]  bad_page+0x6f/0x100
> > [   35.503300]  free_frozen_pages+0x303/0x550
> > [   35.503301]  iomap_finish_ioend+0xf6/0x380
> > [   35.503304]  iomap_finish_ioends+0x83/0xc0
> > [   35.503305]  xfs_end_ioend+0x64/0x140 [xfs]
> > [   35.503342]  xfs_end_io+0x93/0xc0 [xfs]
> > [   35.503378]  process_one_work+0x153/0x390
> > [   35.503382]  worker_thread+0x2ab/0x3b0
> > 
> > It's 4:30am here, so I'm going to leave attempts to actually debug that
> > thing until tomorrow; I do have a kvm where it's reliably reproduced
> > within a few minutes, so if anyone comes up with patches, I'll be able
> > to test them.
> > 
> > Breakage is still present in the current mainline ;-/
> 
> Hey Al,
> 
> Welll this certainly looks like the same report I made a month ago.
> I'll go run 6.15 final (with the #define RWF_DONTCACHE 0) overnight to
> confirm if that makes my problem go away.  If these are one and the same
> bug, then thank you for finding a better reproducer! :)
> 
> https://lore.kernel.org/linux-fsdevel/20250416180837.GN25675@frogsfrogsfrogs/

After a full QA run, 6.15 final passes fstests with flying colors.  So I
guess we now know the culprit.  Will test the new RWF_DONTCACHE fixes
whenever they appear in upstream.

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-31  1:10   ` Darrick J. Wong
@ 2025-05-31 21:00     ` Jens Axboe
  2025-06-02  9:04       ` Christian Brauner
  0 siblings, 1 reply; 24+ messages in thread
From: Jens Axboe @ 2025-05-31 21:00 UTC (permalink / raw)
  To: Darrick J. Wong, Al Viro
  Cc: Christoph Hellwig, Christian Brauner, linux-fsdevel,
	Linus Torvalds

On 5/30/25 7:10 PM, Darrick J. Wong wrote:
> On Wed, May 28, 2025 at 06:56:37PM -0700, Darrick J. Wong wrote:
>> On Sun, May 25, 2025 at 09:32:09AM +0100, Al Viro wrote:
>>> generic/127 with xfstests built on debian-testing (trixie) ends up with
>>> assorted memory corruption; trace below is with CONFIG_DEBUG_PAGEALLOC and
>>> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT and it looks like a double free
>>> somewhere in iomap.  Unfortunately, commit in question is just making
>>> xfs use the infrastructure built in earlier series - not that useful
>>> for isolating the breakage.
>>>
>>> [   22.001529] run fstests generic/127 at 2025-05-25 04:13:23
>>> [   35.498573] BUG: Bad page state in process kworker/2:1  pfn:112ce9
>>> [   35.499260] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x3e 9
>>> [   35.499764] flags: 0x800000000000000e(referenced|uptodate|writeback|zone=2)
>>> [   35.500302] raw: 800000000000000e dead000000000100 dead000000000122 000000000
>>> [   35.500786] raw: 000000000000003e 0000000000000000 00000000ffffffff 000000000
>>> [   35.501248] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
>>> [   35.501624] Modules linked in: xfs autofs4 fuse nfsd auth_rpcgss nfs_acl nfs0
>>> [   35.503209] CPU: 2 UID: 0 PID: 85 Comm: kworker/2:1 Not tainted 6.14.0-rc1+ 7
>>> [   35.503211] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.164
>>> [   35.503212] Workqueue: xfs-conv/sdb1 xfs_end_io [xfs]
>>> [   35.503279] Call Trace:
>>> [   35.503281]  <TASK>
>>> [   35.503282]  dump_stack_lvl+0x4f/0x60
>>> [   35.503296]  bad_page+0x6f/0x100
>>> [   35.503300]  free_frozen_pages+0x303/0x550
>>> [   35.503301]  iomap_finish_ioend+0xf6/0x380
>>> [   35.503304]  iomap_finish_ioends+0x83/0xc0
>>> [   35.503305]  xfs_end_ioend+0x64/0x140 [xfs]
>>> [   35.503342]  xfs_end_io+0x93/0xc0 [xfs]
>>> [   35.503378]  process_one_work+0x153/0x390
>>> [   35.503382]  worker_thread+0x2ab/0x3b0
>>>
>>> It's 4:30am here, so I'm going to leave attempts to actually debug that
>>> thing until tomorrow; I do have a kvm where it's reliably reproduced
>>> within a few minutes, so if anyone comes up with patches, I'll be able
>>> to test them.
>>>
>>> Breakage is still present in the current mainline ;-/
>>
>> Hey Al,
>>
>> Welll this certainly looks like the same report I made a month ago.
>> I'll go run 6.15 final (with the #define RWF_DONTCACHE 0) overnight to
>> confirm if that makes my problem go away.  If these are one and the same
>> bug, then thank you for finding a better reproducer! :)
>>
>> https://lore.kernel.org/linux-fsdevel/20250416180837.GN25675@frogsfrogsfrogs/
> 
> After a full QA run, 6.15 final passes fstests with flying colors.  So I
> guess we now know the culprit.  Will test the new RWF_DONTCACHE fixes
> whenever they appear in upstream.

Please do! Unfortunately I never saw your original report as I wasn't
CC'ed on it, which I can't really fault anyone for as there was no
reason to suspect it so far.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?)
  2025-05-31 21:00     ` Jens Axboe
@ 2025-06-02  9:04       ` Christian Brauner
  0 siblings, 0 replies; 24+ messages in thread
From: Christian Brauner @ 2025-06-02  9:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Darrick J. Wong, Al Viro, Christoph Hellwig, linux-fsdevel,
	Linus Torvalds

> >> https://lore.kernel.org/linux-fsdevel/20250416180837.GN25675@frogsfrogsfrogs/
> > 
> > After a full QA run, 6.15 final passes fstests with flying colors.  So I
> > guess we now know the culprit.  Will test the new RWF_DONTCACHE fixes
> > whenever they appear in upstream.
> 
> Please do! Unfortunately I never saw your original report as I wasn't
> CC'ed on it, which I can't really fault anyone for as there was no
> reason to suspect it so far.

I've just sent the pull request with the fixes a minute ago.
Thanks for testing!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-06-02  9:04 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-25  8:32 [BUG] regression from 974c5e6139db "xfs: flag as supporting FOP_DONTCACHE" (double free on page?) Al Viro
2025-05-25 18:02 ` Al Viro
2025-05-25 18:06 ` Al Viro
2025-05-25 19:12   ` Vlastimil Babka
2025-05-25 20:32     ` Linus Torvalds
2025-05-25 20:48       ` Matthew Wilcox
2025-05-25 20:54         ` Linus Torvalds
2025-05-25 21:49         ` Al Viro
2025-05-25 22:05           ` Linus Torvalds
2025-05-26 13:05     ` Jens Axboe
2025-05-26 15:06       ` Jens Axboe
2025-05-26 15:31         ` Vlastimil Babka
2025-05-26 15:58           ` Jens Axboe
2025-05-26 17:38         ` Jens Axboe
2025-05-26 23:56           ` Al Viro
2025-05-27  0:58             ` Jens Axboe
2025-05-27  1:24               ` Al Viro
2025-05-27  1:29                 ` Jens Axboe
2025-05-27  0:51           ` Trond Myklebust
2025-05-27  0:56             ` Jens Axboe
2025-05-29  1:56 ` Darrick J. Wong
2025-05-31  1:10   ` Darrick J. Wong
2025-05-31 21:00     ` Jens Axboe
2025-06-02  9:04       ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).