* fs-next-20251103 reclaim lockdep splat
@ 2025-11-05 20:21 Matthew Wilcox
2025-11-05 21:05 ` Dave Chinner
0 siblings, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2025-11-05 20:21 UTC (permalink / raw)
To: linux-xfs
In trying to bisect the earlier reported transaction assertion failure,
I hit this:
generic/476 run fstests generic/476 at 2025-11-05 20:16:46
XFS (vdb): Mounting V5 Filesystem 7f483353-a0f6-4710-9adc-4b72f25598f8
XFS (vdb): Ending clean mount
XFS (vdc): Mounting V5 Filesystem 47fa2f49-e8e1-4622-a62c-53ea07b3d714
XFS (vdc): Ending clean mount
======================================================
WARNING: possible circular locking dependency detected
6.18.0-rc4-ktest-00382-ga1e94de7fbd5 #116 Tainted: G W
------------------------------------------------------
kswapd0/111 is trying to acquire lock:
ffff888102462418 (&xfs_nondir_ilock_class){++++}-{4:4}, at: xfs_ilock+0x14b/0x2b0
but task is already holding lock:
ffffffff82710f00 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x3e0/0x780
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
fs_reclaim_acquire+0x67/0xa0
__kmalloc_cache_noprof+0x4d/0x500
iomap_fill_dirty_folios+0x6b/0xe0
xfs_buffered_write_iomap_begin+0xaee/0x1060
iomap_iter+0x1a1/0x4a0
iomap_zero_range+0xb0/0x420
xfs_zero_range+0x54/0x70
xfs_file_write_checks+0x21d/0x320
xfs_file_dio_write_unaligned+0x140/0x2b0
xfs_file_write_iter+0x22a/0x270
vfs_write+0x23f/0x540
ksys_write+0x6d/0x100
__x64_sys_write+0x1d/0x30
x64_sys_call+0x7d/0x1da0
do_syscall_64+0x6a/0x2e0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&xfs_nondir_ilock_class){++++}-{4:4}:
__lock_acquire+0x15be/0x27d0
lock_acquire+0xb2/0x290
down_write_nested+0x2a/0xb0
xfs_ilock+0x14b/0x2b0
xfs_icwalk_ag+0x517/0xaf0
xfs_icwalk+0x46/0x80
xfs_reclaim_inodes_nr+0x8c/0xb0
xfs_fs_free_cached_objects+0x1d/0x30
super_cache_scan+0x178/0x1d0
do_shrink_slab+0x16d/0x6a0
shrink_slab+0x4cf/0x8c0
shrink_node+0x334/0x870
balance_pgdat+0x35f/0x780
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);
lock(fs_reclaim);
lock(&xfs_nondir_ilock_class);
*** DEADLOCK ***
2 locks held by kswapd0/111:
#0: ffffffff82710f00 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x3e0/0x780
#1: ffff888132a800e0 (&type->s_umount_key#38){++++}-{4:4}, at: super_cache_scan+0x3d/0x1d0
stack backtrace:
CPU: 2 UID: 0 PID: 111 Comm: kswapd0 Tainted: G W 6.18.0-rc4-ktest-00382-ga1e94de7fbd5 #116 NONE
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x63/0x90
dump_stack+0x14/0x1a
print_circular_bug.cold+0x17e/0x1bb
check_noncircular+0x123/0x140
__lock_acquire+0x15be/0x27d0
lock_acquire+0xb2/0x290
? xfs_ilock+0x14b/0x2b0
? find_held_lock+0x31/0x90
? xfs_icwalk_ag+0x50a/0xaf0
down_write_nested+0x2a/0xb0
? xfs_ilock+0x14b/0x2b0
xfs_ilock+0x14b/0x2b0
xfs_icwalk_ag+0x517/0xaf0
? xfs_icwalk_ag+0x60/0xaf0
xfs_icwalk+0x46/0x80
xfs_reclaim_inodes_nr+0x8c/0xb0
xfs_fs_free_cached_objects+0x1d/0x30
super_cache_scan+0x178/0x1d0
do_shrink_slab+0x16d/0x6a0
shrink_slab+0x4cf/0x8c0
? shrink_slab+0x2f1/0x8c0
shrink_node+0x334/0x870
balance_pgdat+0x35f/0x780
kswapd+0x1cf/0x3b0
? __pfx_autoremove_wake_function+0x10/0x10
? __pfx_kswapd+0x10/0x10
kthread+0x100/0x220
? _raw_spin_unlock_irq+0x2b/0x40
? __pfx_kthread+0x10/0x10
ret_from_fork+0x18c/0x1d0
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: fs-next-20251103 reclaim lockdep splat 2025-11-05 20:21 fs-next-20251103 reclaim lockdep splat Matthew Wilcox @ 2025-11-05 21:05 ` Dave Chinner 2025-11-06 12:20 ` Brian Foster 0 siblings, 1 reply; 5+ messages in thread From: Dave Chinner @ 2025-11-05 21:05 UTC (permalink / raw) To: Matthew Wilcox; +Cc: linux-xfs, bfoster On Wed, Nov 05, 2025 at 08:21:51PM +0000, Matthew Wilcox wrote: > In trying to bisect the earlier reported transaction assertion failure, > I hit this: > > generic/476 run fstests generic/476 at 2025-11-05 20:16:46 > XFS (vdb): Mounting V5 Filesystem 7f483353-a0f6-4710-9adc-4b72f25598f8 > XFS (vdb): Ending clean mount > XFS (vdc): Mounting V5 Filesystem 47fa2f49-e8e1-4622-a62c-53ea07b3d714 > XFS (vdc): Ending clean mount > > ====================================================== > WARNING: possible circular locking dependency detected > 6.18.0-rc4-ktest-00382-ga1e94de7fbd5 #116 Tainted: G W > ------------------------------------------------------ > kswapd0/111 is trying to acquire lock: > ffff888102462418 (&xfs_nondir_ilock_class){++++}-{4:4}, at: xfs_ilock+0x14b/0x2b0 > > but task is already holding lock: > ffffffff82710f00 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x3e0/0x780 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (fs_reclaim){+.+.}-{0:0}: > fs_reclaim_acquire+0x67/0xa0 > __kmalloc_cache_noprof+0x4d/0x500 > iomap_fill_dirty_folios+0x6b/0xe0 > xfs_buffered_write_iomap_begin+0xaee/0x1060 > iomap_iter+0x1a1/0x4a0 > iomap_zero_range+0xb0/0x420 > xfs_zero_range+0x54/0x70 > xfs_file_write_checks+0x21d/0x320 > xfs_file_dio_write_unaligned+0x140/0x2b0 > xfs_file_write_iter+0x22a/0x270 > vfs_write+0x23f/0x540 > ksys_write+0x6d/0x100 > __x64_sys_write+0x1d/0x30 > x64_sys_call+0x7d/0x1da0 > do_syscall_64+0x6a/0x2e0 > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > -> #0 (&xfs_nondir_ilock_class){++++}-{4:4}: > __lock_acquire+0x15be/0x27d0 > lock_acquire+0xb2/0x290 > down_write_nested+0x2a/0xb0 > xfs_ilock+0x14b/0x2b0 > xfs_icwalk_ag+0x517/0xaf0 > xfs_icwalk+0x46/0x80 > xfs_reclaim_inodes_nr+0x8c/0xb0 > xfs_fs_free_cached_objects+0x1d/0x30 > super_cache_scan+0x178/0x1d0 > do_shrink_slab+0x16d/0x6a0 > shrink_slab+0x4cf/0x8c0 > shrink_node+0x334/0x870 > balance_pgdat+0x35f/0x780 As I said on #xfs: false positive on the inode lock. Reclaim is running in GFP_KERNEL context, so it's allowed to lock unreferenced inodes. The inodes that the allocation context holds locked are referenced inodes, so it cannot self-deadlock on the inode locks it holds because reclaim does not access or lock referenced inodes. That being said, looking at this patch: https://lore.kernel.org/linux-xfs/20251003134642.604736-4-bfoster@redhat.com/ I think the allocation that iomap_fill_dirty_folios() should probably be using mapping_gfp_constraint(mapping, GFP_KERNEL) rather than a hard coded GFP_KERNEL allocation. This is deep in the buffered write path and the xfs ILOCK is held when iomap_fill_dirty_folios() and it does folio lookups in that context. Huh - that kinda feels like a lock order violation. ILOCK is not supposed to be held when we do page cache operations as the lock order defined by writback operations is folio lookup -> folio lock -> ILOCK. So maybe this is a problem here, but not the one lockdep flagged... Brian? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fs-next-20251103 reclaim lockdep splat 2025-11-05 21:05 ` Dave Chinner @ 2025-11-06 12:20 ` Brian Foster 2025-11-10 22:02 ` Dave Chinner 0 siblings, 1 reply; 5+ messages in thread From: Brian Foster @ 2025-11-06 12:20 UTC (permalink / raw) To: Dave Chinner; +Cc: Matthew Wilcox, linux-xfs On Thu, Nov 06, 2025 at 08:05:11AM +1100, Dave Chinner wrote: > On Wed, Nov 05, 2025 at 08:21:51PM +0000, Matthew Wilcox wrote: > > In trying to bisect the earlier reported transaction assertion failure, > > I hit this: > > > > generic/476 run fstests generic/476 at 2025-11-05 20:16:46 > > XFS (vdb): Mounting V5 Filesystem 7f483353-a0f6-4710-9adc-4b72f25598f8 > > XFS (vdb): Ending clean mount > > XFS (vdc): Mounting V5 Filesystem 47fa2f49-e8e1-4622-a62c-53ea07b3d714 > > XFS (vdc): Ending clean mount > > > > ====================================================== > > WARNING: possible circular locking dependency detected > > 6.18.0-rc4-ktest-00382-ga1e94de7fbd5 #116 Tainted: G W > > ------------------------------------------------------ > > kswapd0/111 is trying to acquire lock: > > ffff888102462418 (&xfs_nondir_ilock_class){++++}-{4:4}, at: xfs_ilock+0x14b/0x2b0 > > > > but task is already holding lock: > > ffffffff82710f00 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x3e0/0x780 > > > > which lock already depends on the new lock. > > > > > > the existing dependency chain (in reverse order) is: > > > > -> #1 (fs_reclaim){+.+.}-{0:0}: > > fs_reclaim_acquire+0x67/0xa0 > > __kmalloc_cache_noprof+0x4d/0x500 > > iomap_fill_dirty_folios+0x6b/0xe0 > > xfs_buffered_write_iomap_begin+0xaee/0x1060 > > iomap_iter+0x1a1/0x4a0 > > iomap_zero_range+0xb0/0x420 > > xfs_zero_range+0x54/0x70 > > xfs_file_write_checks+0x21d/0x320 > > xfs_file_dio_write_unaligned+0x140/0x2b0 > > xfs_file_write_iter+0x22a/0x270 > > vfs_write+0x23f/0x540 > > ksys_write+0x6d/0x100 > > __x64_sys_write+0x1d/0x30 > > x64_sys_call+0x7d/0x1da0 > > do_syscall_64+0x6a/0x2e0 > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > > -> #0 (&xfs_nondir_ilock_class){++++}-{4:4}: > > __lock_acquire+0x15be/0x27d0 > > lock_acquire+0xb2/0x290 > > down_write_nested+0x2a/0xb0 > > xfs_ilock+0x14b/0x2b0 > > xfs_icwalk_ag+0x517/0xaf0 > > xfs_icwalk+0x46/0x80 > > xfs_reclaim_inodes_nr+0x8c/0xb0 > > xfs_fs_free_cached_objects+0x1d/0x30 > > super_cache_scan+0x178/0x1d0 > > do_shrink_slab+0x16d/0x6a0 > > shrink_slab+0x4cf/0x8c0 > > shrink_node+0x334/0x870 > > balance_pgdat+0x35f/0x780 > > As I said on #xfs: false positive on the inode lock. > > Reclaim is running in GFP_KERNEL context, so it's allowed to lock > unreferenced inodes. > > The inodes that the allocation context holds locked are referenced > inodes, so it cannot self-deadlock on the inode locks it holds > because reclaim does not access or lock referenced inodes. > > That being said, looking at this patch: > > https://lore.kernel.org/linux-xfs/20251003134642.604736-4-bfoster@redhat.com/ > > I think the allocation that iomap_fill_dirty_folios() should > probably be using mapping_gfp_constraint(mapping, GFP_KERNEL) rather > than a hard coded GFP_KERNEL allocation. This is deep in the > buffered write path and the xfs ILOCK is held when > iomap_fill_dirty_folios() and it does folio lookups in that > context. > There's an outstanding patch to nuke this allocation completely: https://lore.kernel.org/linux-fsdevel/20251016190303.53881-2-bfoster@redhat.com/ This was also problematic for the ext4 on iomap WIP, so combined with the cleanup to use an iomap flag this seemed more elegant overall. The patch series it's part of still needs work, but this one is just a standalone cleanup. If I can get some acks on it I'm happy to repost it separately to take this issue off the table.. > Huh - that kinda feels like a lock order violation. ILOCK is not > supposed to be held when we do page cache operations as the lock > order defined by writback operations is folio lookup -> folio lock > -> ILOCK. > > So maybe this is a problem here, but not the one lockdep flagged... > Yeah.. but filemap_get_folios_dirty() is somewhat advisory. It is intended for use in this context, so only trylocks folios and those that it cannot lock, it just assumes are dirty||writeback and includes them in the batch for locking later (where later is defined as after the iomap callback returns where iomap typically does folio lookup/lock for buffered writes). Brian > Brian? > > -Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fs-next-20251103 reclaim lockdep splat 2025-11-06 12:20 ` Brian Foster @ 2025-11-10 22:02 ` Dave Chinner 2025-11-11 14:07 ` Brian Foster 0 siblings, 1 reply; 5+ messages in thread From: Dave Chinner @ 2025-11-10 22:02 UTC (permalink / raw) To: Brian Foster; +Cc: Matthew Wilcox, linux-xfs On Thu, Nov 06, 2025 at 07:20:39AM -0500, Brian Foster wrote: > On Thu, Nov 06, 2025 at 08:05:11AM +1100, Dave Chinner wrote: > > On Wed, Nov 05, 2025 at 08:21:51PM +0000, Matthew Wilcox wrote: > > > In trying to bisect the earlier reported transaction assertion failure, > > > I hit this: > > > > > > generic/476 run fstests generic/476 at 2025-11-05 20:16:46 .... > > As I said on #xfs: false positive on the inode lock. > > > > Reclaim is running in GFP_KERNEL context, so it's allowed to lock > > unreferenced inodes. > > > > The inodes that the allocation context holds locked are referenced > > inodes, so it cannot self-deadlock on the inode locks it holds > > because reclaim does not access or lock referenced inodes. > > > > That being said, looking at this patch: > > > > https://lore.kernel.org/linux-xfs/20251003134642.604736-4-bfoster@redhat.com/ > > > > I think the allocation that iomap_fill_dirty_folios() should > > probably be using mapping_gfp_constraint(mapping, GFP_KERNEL) rather > > than a hard coded GFP_KERNEL allocation. This is deep in the > > buffered write path and the xfs ILOCK is held when > > iomap_fill_dirty_folios() and it does folio lookups in that > > context. > > > > There's an outstanding patch to nuke this allocation completely: > > https://lore.kernel.org/linux-fsdevel/20251016190303.53881-2-bfoster@redhat.com/ > > This was also problematic for the ext4 on iomap WIP, so combined with > the cleanup to use an iomap flag this seemed more elegant overall. Ok, that looks like a good way to get rid of the allocation, so you can add Acked-by: Dave Chinner <dchinner@redhat.com> to it. > The patch series it's part of still needs work, but this one is just a > standalone cleanup. If I can get some acks on it I'm happy to repost it > separately to take this issue off the table.. > > > Huh - that kinda feels like a lock order violation. ILOCK is not > > supposed to be held when we do page cache operations as the lock > > order defined by writback operations is folio lookup -> folio lock > > -> ILOCK. > > > > So maybe this is a problem here, but not the one lockdep flagged... > > Yeah.. but filemap_get_folios_dirty() is somewhat advisory. It is > intended for use in this context, so only trylocks folios and those that > it cannot lock, it just assumes are dirty||writeback and includes them > in the batch for locking later (where later is defined as after the > iomap callback returns where iomap typically does folio lookup/lock for > buffered writes). I guess I don't understand why this needs to be done under the ilock. I've read the patches, and it doesn't explain to me why we need to look up the pages under the ILOCK? The only constraint I see is trimming the extent map to match the end of a full fbatch array, but that doesn't seem like it can't be done by iomap itself. Why do we need to do this dirty-folio batch lookup under the ILOCK instead of as a loop in iomap_zero_range() that grabs a fbatch for the unwritten mapping before calling iomap_zero_iter() repeatedly until we either hit the end of the mapping or there are no more dirty folios over the mapping? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: fs-next-20251103 reclaim lockdep splat 2025-11-10 22:02 ` Dave Chinner @ 2025-11-11 14:07 ` Brian Foster 0 siblings, 0 replies; 5+ messages in thread From: Brian Foster @ 2025-11-11 14:07 UTC (permalink / raw) To: Dave Chinner; +Cc: Matthew Wilcox, linux-xfs On Tue, Nov 11, 2025 at 09:02:32AM +1100, Dave Chinner wrote: > On Thu, Nov 06, 2025 at 07:20:39AM -0500, Brian Foster wrote: > > On Thu, Nov 06, 2025 at 08:05:11AM +1100, Dave Chinner wrote: > > > On Wed, Nov 05, 2025 at 08:21:51PM +0000, Matthew Wilcox wrote: > > > > In trying to bisect the earlier reported transaction assertion failure, > > > > I hit this: > > > > > > > > generic/476 run fstests generic/476 at 2025-11-05 20:16:46 > .... > > > As I said on #xfs: false positive on the inode lock. > > > > > > Reclaim is running in GFP_KERNEL context, so it's allowed to lock > > > unreferenced inodes. > > > > > > The inodes that the allocation context holds locked are referenced > > > inodes, so it cannot self-deadlock on the inode locks it holds > > > because reclaim does not access or lock referenced inodes. > > > > > > That being said, looking at this patch: > > > > > > https://lore.kernel.org/linux-xfs/20251003134642.604736-4-bfoster@redhat.com/ > > > > > > I think the allocation that iomap_fill_dirty_folios() should > > > probably be using mapping_gfp_constraint(mapping, GFP_KERNEL) rather > > > than a hard coded GFP_KERNEL allocation. This is deep in the > > > buffered write path and the xfs ILOCK is held when > > > iomap_fill_dirty_folios() and it does folio lookups in that > > > context. > > > > > > > There's an outstanding patch to nuke this allocation completely: > > > > https://lore.kernel.org/linux-fsdevel/20251016190303.53881-2-bfoster@redhat.com/ > > > > This was also problematic for the ext4 on iomap WIP, so combined with > > the cleanup to use an iomap flag this seemed more elegant overall. > > Ok, that looks like a good way to get rid of the allocation, so > you can add > > Acked-by: Dave Chinner <dchinner@redhat.com> > > to it. > Thanks. I'll repost that one standalone since the rest of that series still needs some work. > > The patch series it's part of still needs work, but this one is just a > > standalone cleanup. If I can get some acks on it I'm happy to repost it > > separately to take this issue off the table.. > > > > > Huh - that kinda feels like a lock order violation. ILOCK is not > > > supposed to be held when we do page cache operations as the lock > > > order defined by writback operations is folio lookup -> folio lock > > > -> ILOCK. > > > > > > So maybe this is a problem here, but not the one lockdep flagged... > > > > Yeah.. but filemap_get_folios_dirty() is somewhat advisory. It is > > intended for use in this context, so only trylocks folios and those that > > it cannot lock, it just assumes are dirty||writeback and includes them > > in the batch for locking later (where later is defined as after the > > iomap callback returns where iomap typically does folio lookup/lock for > > buffered writes). > > I guess I don't understand why this needs to be done under the > ilock. I've read the patches, and it doesn't explain to me why we > need to look up the pages under the ILOCK? The only constraint I see > is trimming the extent map to match the end of a full fbatch array, > but that doesn't seem like it can't be done by iomap itself. > > Why do we need to do this dirty-folio batch lookup under the > ILOCK instead of as a loop in iomap_zero_range() that grabs a fbatch > for the unwritten mapping before calling iomap_zero_iter() > repeatedly until we either hit the end of the mapping or there are > no more dirty folios over the mapping? > It doesn't need to be under ilock, just either under ilock or before iomap_begin to rule out problematic inconsistency (i.e. seeing a clean folio && unwritten mapping and racing with writeback). This was discussed at multiple points during review of the original batch series. The longer term prospects for this are to potentially use it for multiple different operations, and therefore lift it into core iomap (not necessarily just zero range). The reason it's done this way to start was mainly just to manage scope and complexity. The use case is bounded, so it was easiest to work this in where it is and defer complexity associated with broader use to later. That might sound pedantic, but when you consider things like dependent changes like incremental iter advancement, the impedence mismatch between some of these changes and the ext4-on-iomap WIP, and the little issues that fall out of that, this is all much easier to develop, support and improve with as little unnecessary complexity as possible. I think when some of the kinks the rest of that series is trying to address are worked out, that might be a reasonable point to consider such a lift. Brian > -Dave. > -- > Dave Chinner > david@fromorbit.com > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-11-11 14:02 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-05 20:21 fs-next-20251103 reclaim lockdep splat Matthew Wilcox 2025-11-05 21:05 ` Dave Chinner 2025-11-06 12:20 ` Brian Foster 2025-11-10 22:02 ` Dave Chinner 2025-11-11 14:07 ` Brian Foster
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox