* [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
2025-06-25 22:48 Dave Chinner
@ 2025-06-25 22:48 ` Dave Chinner
2025-06-26 9:30 ` Carlos Maiolino
2025-07-03 7:27 ` Carlos Maiolino
2025-06-25 22:48 ` [PATCH 2/7] xfs: catch stale AGF/AGF metadata Dave Chinner
` (5 subsequent siblings)
6 siblings, 2 replies; 16+ messages in thread
From: Dave Chinner @ 2025-06-25 22:48 UTC (permalink / raw)
To: linux-xfs
From: Dave Chinner <dchinner@redhat.com>
Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK
-> IFLUSHING, except for the last inode in the cluster that is
triggering the free. In that case, the lock order is ILOCK ->
cluster buffer -> IFLUSHING.
xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING,
so this can safely run concurrently with xfs_ifree_cluster().
xfs_inode_item_precommit() uses ILOCK -> cluster buffer, but this
cannot race with xfs_ifree_cluster() so being in a different order
will not trigger a deadlock.
xfs_reclaim_inode() during a filesystem shutdown uses ILOCK ->
IFLUSHING -> cluster buffer via xfs_iflush_shutdown_abort(), and
this deadlocks against xfs_ifree_cluster() like so:
sysrq: Show Blocked State
task:kworker/10:37 state:D stack:12560 pid:276182 tgid:276182 ppid:2 flags:0x00004000
Workqueue: xfs-inodegc/dm-3 xfs_inodegc_worker
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_timeout+0x8b/0x180
schedule_timeout_uninterruptible+0x1e/0x30
xfs_ifree+0x326/0x730
xfs_inactive_ifree+0xcb/0x230
xfs_inactive+0x2c8/0x380
xfs_inodegc_worker+0xaa/0x180
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
task:fsync-tester state:D stack:12160 pid:2255943 tgid:2255943 ppid:3988702 flags:0x00004006
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_timeout+0x31/0x180
__down_common+0xbe/0x1f0
__down+0x1d/0x30
down+0x48/0x50
xfs_buf_lock+0x3d/0xe0
xfs_iflush_shutdown_abort+0x51/0x1e0
xfs_icwalk_ag+0x386/0x690
xfs_reclaim_inodes_nr+0x114/0x160
xfs_fs_free_cached_objects+0x19/0x20
super_cache_scan+0x17b/0x1a0
do_shrink_slab+0x180/0x350
shrink_slab+0xf8/0x430
drop_slab+0x97/0xf0
drop_caches_sysctl_handler+0x59/0xc0
proc_sys_call_handler+0x189/0x280
proc_sys_write+0x13/0x20
vfs_write+0x33d/0x3f0
ksys_write+0x7c/0xf0
__x64_sys_write+0x1b/0x30
x64_sys_call+0x271d/0x2ee0
do_syscall_64+0x68/0x130
entry_SYSCALL_64_after_hwframe+0x76/0x7e
We can't change the lock order of xfs_ifree_cluster() - XFS_ISTALE
and XFS_IFLUSHING are serialised through to journal IO completion
by the cluster buffer lock being held.
There's quite a few asserts in the code that check that XFS_ISTALE
does not occur out of sync with buffer locking (e.g. in
xfs_iflush_cluster). There's also a dependency on the inode log item
being removed from the buffer before XFS_IFLUSHING is cleared, also
with asserts that trigger on this.
Further, we don't have a requirement for the inode to be locked when
completing or aborting inode flushing because all the inode state
updates are serialised by holding the cluster buffer lock across the
IO to completion.
We can't check for XFS_IRECLAIM in xfs_ifree_mark_inode_stale() and
skip the inode, because there is no guarantee that the inode will be
reclaimed. Hence it *must* be marked XFS_ISTALE regardless of
whether reclaim is preparing to free that inode. Similarly, we can't
check for IFLUSHING before locking the inode because that would
result in dirty inodes not being marked with ISTALE in the event of
racing with XFS_IRECLAIM.
Hence we have to address this issue from the xfs_reclaim_inode()
side. It is clear that we cannot hold the inode locked here when
calling xfs_iflush_shutdown_abort() because it is the inode->buffer
lock order that causes the deadlock against xfs_ifree_cluster().
Hence we need to drop the ILOCK before aborting the inode in the
shutdown case. Once we've aborted the inode, we can grab the ILOCK
again and then immediately reclaim it as it is now guaranteed to be
clean.
Note that dropping the ILOCK in xfs_reclaim_inode() means that it
can now be locked by xfs_ifree_mark_inode_stale() and seen whilst in
this state. This is safe because we have left the XFS_IFLUSHING flag
on the inode and so xfs_ifree_mark_inode_stale() will simply set
XFS_ISTALE and move to the next inode. An ASSERT check in this path
needs to be tweaked to take into account this new shutdown
interaction.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
fs/xfs/xfs_icache.c | 8 ++++++++
fs/xfs/xfs_inode.c | 2 +-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 726e29b837e6..bbc2f2973dcc 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -979,7 +979,15 @@ xfs_reclaim_inode(
*/
if (xlog_is_shutdown(ip->i_mount->m_log)) {
xfs_iunpin_wait(ip);
+ /*
+ * Avoid a ABBA deadlock on the inode cluster buffer vs
+ * concurrent xfs_ifree_cluster() trying to mark the inode
+ * stale. We don't need the inode locked to run the flush abort
+ * code, but the flush abort needs to lock the cluster buffer.
+ */
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
xfs_iflush_shutdown_abort(ip);
+ xfs_ilock(ip, XFS_ILOCK_EXCL);
goto reclaim;
}
if (xfs_ipincount(ip))
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ee3e0f284287..761a996a857c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1635,7 +1635,7 @@ xfs_ifree_mark_inode_stale(
iip = ip->i_itemp;
if (__xfs_iflags_test(ip, XFS_IFLUSHING)) {
ASSERT(!list_empty(&iip->ili_item.li_bio_list));
- ASSERT(iip->ili_last_fields);
+ ASSERT(iip->ili_last_fields || xlog_is_shutdown(mp->m_log));
goto out_iunlock;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
2025-06-25 22:48 ` [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock Dave Chinner
@ 2025-06-26 9:30 ` Carlos Maiolino
2025-07-03 7:27 ` Carlos Maiolino
1 sibling, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-06-26 9:30 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Jun 26, 2025 at 08:48:54AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
>
> Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK
> -> IFLUSHING, except for the last inode in the cluster that is
> triggering the free. In that case, the lock order is ILOCK ->
> cluster buffer -> IFLUSHING.
>
> xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING,
> so this can safely run concurrently with xfs_ifree_cluster().
>
> xfs_inode_item_precommit() uses ILOCK -> cluster buffer, but this
> cannot race with xfs_ifree_cluster() so being in a different order
> will not trigger a deadlock.
>
> xfs_reclaim_inode() during a filesystem shutdown uses ILOCK ->
> IFLUSHING -> cluster buffer via xfs_iflush_shutdown_abort(), and
> this deadlocks against xfs_ifree_cluster() like so:
>
> sysrq: Show Blocked State
> task:kworker/10:37 state:D stack:12560 pid:276182 tgid:276182 ppid:2 flags:0x00004000
> Workqueue: xfs-inodegc/dm-3 xfs_inodegc_worker
> Call Trace:
> <TASK>
> __schedule+0x650/0xb10
> schedule+0x6d/0xf0
> schedule_timeout+0x8b/0x180
> schedule_timeout_uninterruptible+0x1e/0x30
> xfs_ifree+0x326/0x730
> xfs_inactive_ifree+0xcb/0x230
> xfs_inactive+0x2c8/0x380
> xfs_inodegc_worker+0xaa/0x180
> process_scheduled_works+0x1d4/0x400
> worker_thread+0x234/0x2e0
> kthread+0x147/0x170
> ret_from_fork+0x3e/0x50
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> task:fsync-tester state:D stack:12160 pid:2255943 tgid:2255943 ppid:3988702 flags:0x00004006
> Call Trace:
> <TASK>
> __schedule+0x650/0xb10
> schedule+0x6d/0xf0
> schedule_timeout+0x31/0x180
> __down_common+0xbe/0x1f0
> __down+0x1d/0x30
> down+0x48/0x50
> xfs_buf_lock+0x3d/0xe0
> xfs_iflush_shutdown_abort+0x51/0x1e0
> xfs_icwalk_ag+0x386/0x690
> xfs_reclaim_inodes_nr+0x114/0x160
> xfs_fs_free_cached_objects+0x19/0x20
> super_cache_scan+0x17b/0x1a0
> do_shrink_slab+0x180/0x350
> shrink_slab+0xf8/0x430
> drop_slab+0x97/0xf0
> drop_caches_sysctl_handler+0x59/0xc0
> proc_sys_call_handler+0x189/0x280
> proc_sys_write+0x13/0x20
> vfs_write+0x33d/0x3f0
> ksys_write+0x7c/0xf0
> __x64_sys_write+0x1b/0x30
> x64_sys_call+0x271d/0x2ee0
> do_syscall_64+0x68/0x130
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> We can't change the lock order of xfs_ifree_cluster() - XFS_ISTALE
> and XFS_IFLUSHING are serialised through to journal IO completion
> by the cluster buffer lock being held.
>
> There's quite a few asserts in the code that check that XFS_ISTALE
> does not occur out of sync with buffer locking (e.g. in
> xfs_iflush_cluster). There's also a dependency on the inode log item
> being removed from the buffer before XFS_IFLUSHING is cleared, also
> with asserts that trigger on this.
>
> Further, we don't have a requirement for the inode to be locked when
> completing or aborting inode flushing because all the inode state
> updates are serialised by holding the cluster buffer lock across the
> IO to completion.
>
> We can't check for XFS_IRECLAIM in xfs_ifree_mark_inode_stale() and
> skip the inode, because there is no guarantee that the inode will be
> reclaimed. Hence it *must* be marked XFS_ISTALE regardless of
> whether reclaim is preparing to free that inode. Similarly, we can't
> check for IFLUSHING before locking the inode because that would
> result in dirty inodes not being marked with ISTALE in the event of
> racing with XFS_IRECLAIM.
>
> Hence we have to address this issue from the xfs_reclaim_inode()
> side. It is clear that we cannot hold the inode locked here when
> calling xfs_iflush_shutdown_abort() because it is the inode->buffer
> lock order that causes the deadlock against xfs_ifree_cluster().
>
> Hence we need to drop the ILOCK before aborting the inode in the
> shutdown case. Once we've aborted the inode, we can grab the ILOCK
> again and then immediately reclaim it as it is now guaranteed to be
> clean.
>
> Note that dropping the ILOCK in xfs_reclaim_inode() means that it
> can now be locked by xfs_ifree_mark_inode_stale() and seen whilst in
> this state. This is safe because we have left the XFS_IFLUSHING flag
> on the inode and so xfs_ifree_mark_inode_stale() will simply set
> XFS_ISTALE and move to the next inode. An ASSERT check in this path
> needs to be tweaked to take into account this new shutdown
> interaction.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
> fs/xfs/xfs_icache.c | 8 ++++++++
> fs/xfs/xfs_inode.c | 2 +-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 726e29b837e6..bbc2f2973dcc 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -979,7 +979,15 @@ xfs_reclaim_inode(
> */
> if (xlog_is_shutdown(ip->i_mount->m_log)) {
> xfs_iunpin_wait(ip);
> + /*
> + * Avoid a ABBA deadlock on the inode cluster buffer vs
> + * concurrent xfs_ifree_cluster() trying to mark the inode
> + * stale. We don't need the inode locked to run the flush abort
> + * code, but the flush abort needs to lock the cluster buffer.
> + */
> + xfs_iunlock(ip, XFS_ILOCK_EXCL);
> xfs_iflush_shutdown_abort(ip);
> + xfs_ilock(ip, XFS_ILOCK_EXCL);
> goto reclaim;
> }
> if (xfs_ipincount(ip))
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index ee3e0f284287..761a996a857c 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1635,7 +1635,7 @@ xfs_ifree_mark_inode_stale(
> iip = ip->i_itemp;
> if (__xfs_iflags_test(ip, XFS_IFLUSHING)) {
> ASSERT(!list_empty(&iip->ili_item.li_bio_list));
> - ASSERT(iip->ili_last_fields);
> + ASSERT(iip->ili_last_fields || xlog_is_shutdown(mp->m_log));
> goto out_iunlock;
> }
>
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
2025-06-25 22:48 ` [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock Dave Chinner
2025-06-26 9:30 ` Carlos Maiolino
@ 2025-07-03 7:27 ` Carlos Maiolino
1 sibling, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-07-03 7:27 UTC (permalink / raw)
To: linux-xfs, Dave Chinner
On Thu, 26 Jun 2025 08:48:54 +1000, Dave Chinner wrote:
> Lock order of xfs_ifree_cluster() is cluster buffer -> try ILOCK
> -> IFLUSHING, except for the last inode in the cluster that is
> triggering the free. In that case, the lock order is ILOCK ->
> cluster buffer -> IFLUSHING.
>
> xfs_iflush_cluster() uses cluster buffer -> try ILOCK -> IFLUSHING,
> so this can safely run concurrently with xfs_ifree_cluster().
>
> [...]
Applied to for-next, thanks!
[1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
commit: 09234a632be42573d9743ac5ff6773622d233ad0
[2/7] xfs: catch stale AGF/AGF metadata
commit: db6a2274162de615ff74b927d38942fe3134d298
[3/7] xfs: avoid dquot buffer pin deadlock
commit: d62016b1a2df24c8608fe83cd3ae8090412881b3
[4/7] xfs: add tracepoints for stale pinned inode state debug
commit: fc48627b9c22f4d18651ca72ba171952d7a26004
[5/7] xfs: rearrange code in xfs_buf_item.c
commit: d2fe5c4c8d25999862d615f616aea7befdd62799
[6/7] xfs: factor out stale buffer item completion
commit: 816c330b605c3f4813c0dc0ab5af5cce17ff06b3
[7/7] xfs: fix unmount hang with unflushable inodes stuck in the AIL
commit: 7b5f775be14ac1532c049022feadcfe44769566d
Best regards,
--
Carlos Maiolino <cem@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/7] xfs: catch stale AGF/AGF metadata
2025-06-25 22:48 Dave Chinner
2025-06-25 22:48 ` [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock Dave Chinner
@ 2025-06-25 22:48 ` Dave Chinner
2025-06-26 10:09 ` Carlos Maiolino
2025-06-25 22:48 ` [PATCH 3/7] xfs: avoid dquot buffer pin deadlock Dave Chinner
` (4 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2025-06-25 22:48 UTC (permalink / raw)
To: linux-xfs
From: Dave Chinner <dchinner@redhat.com>
There is a race condition that can trigger in dmflakey fstests that
can result in asserts in xfs_ialloc_read_agi() and
xfs_alloc_read_agf() firing. The asserts look like this:
XFS: Assertion failed: pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks), file: fs/xfs/libxfs/xfs_alloc.c, line: 3440
.....
Call Trace:
<TASK>
xfs_alloc_read_agf+0x2ad/0x3a0
xfs_alloc_fix_freelist+0x280/0x720
xfs_alloc_vextent_prepare_ag+0x42/0x120
xfs_alloc_vextent_iterate_ags+0x67/0x260
xfs_alloc_vextent_start_ag+0xe4/0x1c0
xfs_bmapi_allocate+0x6fe/0xc90
xfs_bmapi_convert_delalloc+0x338/0x560
xfs_map_blocks+0x354/0x580
iomap_writepages+0x52b/0xa70
xfs_vm_writepages+0xd7/0x100
do_writepages+0xe1/0x2c0
__writeback_single_inode+0x44/0x340
writeback_sb_inodes+0x2d0/0x570
__writeback_inodes_wb+0x9c/0xf0
wb_writeback+0x139/0x2d0
wb_workfn+0x23e/0x4c0
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
I've seen the AGI variant from scrub running on the filesysetm
after unmount failed due to systemd interference:
XFS: Assertion failed: pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) || xfs_is_shutdown(pag->pag_mount), file: fs/xfs/libxfs/xfs_ialloc.c, line: 2804
.....
Call Trace:
<TASK>
xfs_ialloc_read_agi+0xee/0x150
xchk_perag_drain_and_lock+0x7d/0x240
xchk_ag_init+0x34/0x90
xchk_inode_xref+0x7b/0x220
xchk_inode+0x14d/0x180
xfs_scrub_metadata+0x2e2/0x510
xfs_ioc_scrub_metadata+0x62/0xb0
xfs_file_ioctl+0x446/0xbf0
__se_sys_ioctl+0x6f/0xc0
__x64_sys_ioctl+0x1d/0x30
x64_sys_call+0x1879/0x2ee0
do_syscall_64+0x68/0x130
? exc_page_fault+0x62/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Essentially, it is the same problem. When _flakey_drop_and_remount()
loads the drop-writes table, it makes all writes silently fail. The
as reported to the fs as completed successfully, but they are not
issued to the backing store. The filesystem sees the successful
write completion and marks the metadata buffer clean and removes it
from the AIL.
If this happens at the same time as memory pressure is occuring,
the now-clean AGF and/or AGI buffers can be reclaimed from memory.
Shortly afterwards, but before _flakey_drop_and_remount() runs
unmount, background writeback is kicked and it tries to allocate
blocks for the dirty pages in memory. This then tries to access the
AGF buffer we just turfed out of memory. It's not found, so it gets
read in from disk.
This is all fine, except for the fact that the last writeback of the
AGF did not actually reach disk. The AGF on disk is stale compared
to the in-memory state held by the perag, and so they don't match
and the assert fires.
Then other operations on that inode hang because the task was killed
whilst holding inode locks. e.g:
Workqueue: xfs-conv/dm-12 xfs_end_io
Call Trace:
<TASK>
__schedule+0x650/0xb10
schedule+0x6d/0xf0
schedule_preempt_disabled+0x15/0x30
rwsem_down_write_slowpath+0x31a/0x5f0
down_write+0x43/0x60
xfs_ilock+0x1a8/0x210
xfs_trans_alloc_inode+0x9c/0x240
xfs_iomap_write_unwritten+0xe3/0x300
xfs_end_ioend+0x90/0x130
xfs_end_io+0xce/0x100
process_scheduled_works+0x1d4/0x400
worker_thread+0x234/0x2e0
kthread+0x147/0x170
ret_from_fork+0x3e/0x50
ret_from_fork_asm+0x1a/0x30
</TASK>
and it's all down hill from there.
Memory pressure is one way to trigger this, another is to run "echo
3 > /proc/sys/vm/drop_caches" randomly while tests are running.
Regardless of how it is triggered, this effectively takes down the
system once umount hangs because it's holding a sb->s_umount lock
exclusive and now every sync(1) call gets stuck on it.
Fix this by replacing the asserts with a corruption detection check
and a shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/libxfs/xfs_alloc.c | 41 ++++++++++++++++++++++++++++++--------
fs/xfs/libxfs/xfs_ialloc.c | 31 ++++++++++++++++++++++++----
2 files changed, 60 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 7839efe050bf..000cc7f4a3ce 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -3444,16 +3444,41 @@ xfs_alloc_read_agf(
set_bit(XFS_AGSTATE_AGF_INIT, &pag->pag_opstate);
}
+
#ifdef DEBUG
- else if (!xfs_is_shutdown(mp)) {
- ASSERT(pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks));
- ASSERT(pag->pagf_btreeblks == be32_to_cpu(agf->agf_btreeblks));
- ASSERT(pag->pagf_flcount == be32_to_cpu(agf->agf_flcount));
- ASSERT(pag->pagf_longest == be32_to_cpu(agf->agf_longest));
- ASSERT(pag->pagf_bno_level == be32_to_cpu(agf->agf_bno_level));
- ASSERT(pag->pagf_cnt_level == be32_to_cpu(agf->agf_cnt_level));
+ /*
+ * It's possible for the AGF to be out of sync if the block device is
+ * silently dropping writes. This can happen in fstests with dmflakey
+ * enabled, which allows the buffer to be cleaned and reclaimed by
+ * memory pressure and then re-read from disk here. We will get a
+ * stale version of the AGF from disk, and nothing good can happen from
+ * here. Hence if we detect this situation, immediately shut down the
+ * filesystem.
+ *
+ * This can also happen if we are already in the middle of a forced
+ * shutdown, so don't bother checking if we are already shut down.
+ */
+ if (!xfs_is_shutdown(pag_mount(pag))) {
+ bool ok = true;
+
+ ok &= pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks);
+ ok &= pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks);
+ ok &= pag->pagf_btreeblks == be32_to_cpu(agf->agf_btreeblks);
+ ok &= pag->pagf_flcount == be32_to_cpu(agf->agf_flcount);
+ ok &= pag->pagf_longest == be32_to_cpu(agf->agf_longest);
+ ok &= pag->pagf_bno_level == be32_to_cpu(agf->agf_bno_level);
+ ok &= pag->pagf_cnt_level == be32_to_cpu(agf->agf_cnt_level);
+
+ if (XFS_IS_CORRUPT(pag_mount(pag), !ok)) {
+ xfs_ag_mark_sick(pag, XFS_SICK_AG_AGF);
+ xfs_trans_brelse(tp, agfbp);
+ xfs_force_shutdown(pag_mount(pag),
+ SHUTDOWN_CORRUPT_ONDISK);
+ return -EFSCORRUPTED;
+ }
}
-#endif
+#endif /* DEBUG */
+
if (agfbpp)
*agfbpp = agfbp;
else
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 0c47b5c6ca7d..750111634d9f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2801,12 +2801,35 @@ xfs_ialloc_read_agi(
set_bit(XFS_AGSTATE_AGI_INIT, &pag->pag_opstate);
}
+#ifdef DEBUG
/*
- * It's possible for these to be out of sync if
- * we are in the middle of a forced shutdown.
+ * It's possible for the AGF to be out of sync if the block device is
+ * silently dropping writes. This can happen in fstests with dmflakey
+ * enabled, which allows the buffer to be cleaned and reclaimed by
+ * memory pressure and then re-read from disk here. We will get a
+ * stale version of the AGF from disk, and nothing good can happen from
+ * here. Hence if we detect this situation, immediately shut down the
+ * filesystem.
+ *
+ * This can also happen if we are already in the middle of a forced
+ * shutdown, so don't bother checking if we are already shut down.
*/
- ASSERT(pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) ||
- xfs_is_shutdown(pag_mount(pag)));
+ if (!xfs_is_shutdown(pag_mount(pag))) {
+ bool ok = true;
+
+ ok &= pag->pagi_freecount == be32_to_cpu(agi->agi_freecount);
+ ok &= pag->pagi_count == be32_to_cpu(agi->agi_count);
+
+ if (XFS_IS_CORRUPT(pag_mount(pag), !ok)) {
+ xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI);
+ xfs_trans_brelse(tp, agibp);
+ xfs_force_shutdown(pag_mount(pag),
+ SHUTDOWN_CORRUPT_ONDISK);
+ return -EFSCORRUPTED;
+ }
+ }
+#endif /* DEBUG */
+
if (agibpp)
*agibpp = agibp;
else
--
2.45.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/7] xfs: catch stale AGF/AGF metadata
2025-06-25 22:48 ` [PATCH 2/7] xfs: catch stale AGF/AGF metadata Dave Chinner
@ 2025-06-26 10:09 ` Carlos Maiolino
0 siblings, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-06-26 10:09 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Jun 26, 2025 at 08:48:55AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> There is a race condition that can trigger in dmflakey fstests that
> can result in asserts in xfs_ialloc_read_agi() and
> xfs_alloc_read_agf() firing. The asserts look like this:
>
> XFS: Assertion failed: pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks), file: fs/xfs/libxfs/xfs_alloc.c, line: 3440
> .....
> Call Trace:
> <TASK>
> xfs_alloc_read_agf+0x2ad/0x3a0
> xfs_alloc_fix_freelist+0x280/0x720
> xfs_alloc_vextent_prepare_ag+0x42/0x120
> xfs_alloc_vextent_iterate_ags+0x67/0x260
> xfs_alloc_vextent_start_ag+0xe4/0x1c0
> xfs_bmapi_allocate+0x6fe/0xc90
> xfs_bmapi_convert_delalloc+0x338/0x560
> xfs_map_blocks+0x354/0x580
> iomap_writepages+0x52b/0xa70
> xfs_vm_writepages+0xd7/0x100
> do_writepages+0xe1/0x2c0
> __writeback_single_inode+0x44/0x340
> writeback_sb_inodes+0x2d0/0x570
> __writeback_inodes_wb+0x9c/0xf0
> wb_writeback+0x139/0x2d0
> wb_workfn+0x23e/0x4c0
> process_scheduled_works+0x1d4/0x400
> worker_thread+0x234/0x2e0
> kthread+0x147/0x170
> ret_from_fork+0x3e/0x50
> ret_from_fork_asm+0x1a/0x30
>
> I've seen the AGI variant from scrub running on the filesysetm
> after unmount failed due to systemd interference:
>
> XFS: Assertion failed: pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) || xfs_is_shutdown(pag->pag_mount), file: fs/xfs/libxfs/xfs_ialloc.c, line: 2804
> .....
> Call Trace:
> <TASK>
> xfs_ialloc_read_agi+0xee/0x150
> xchk_perag_drain_and_lock+0x7d/0x240
> xchk_ag_init+0x34/0x90
> xchk_inode_xref+0x7b/0x220
> xchk_inode+0x14d/0x180
> xfs_scrub_metadata+0x2e2/0x510
> xfs_ioc_scrub_metadata+0x62/0xb0
> xfs_file_ioctl+0x446/0xbf0
> __se_sys_ioctl+0x6f/0xc0
> __x64_sys_ioctl+0x1d/0x30
> x64_sys_call+0x1879/0x2ee0
> do_syscall_64+0x68/0x130
> ? exc_page_fault+0x62/0xc0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Essentially, it is the same problem. When _flakey_drop_and_remount()
> loads the drop-writes table, it makes all writes silently fail. The
> as reported to the fs as completed successfully, but they are not
I'll fix this typo with "The writes are reported to the fs..." during commit, if
that's ok to you.
> issued to the backing store. The filesystem sees the successful
> write completion and marks the metadata buffer clean and removes it
> from the AIL.
>
> If this happens at the same time as memory pressure is occuring,
> the now-clean AGF and/or AGI buffers can be reclaimed from memory.
>
> Shortly afterwards, but before _flakey_drop_and_remount() runs
> unmount, background writeback is kicked and it tries to allocate
> blocks for the dirty pages in memory. This then tries to access the
> AGF buffer we just turfed out of memory. It's not found, so it gets
> read in from disk.
>
> This is all fine, except for the fact that the last writeback of the
> AGF did not actually reach disk. The AGF on disk is stale compared
> to the in-memory state held by the perag, and so they don't match
> and the assert fires.
>
> Then other operations on that inode hang because the task was killed
> whilst holding inode locks. e.g:
>
> Workqueue: xfs-conv/dm-12 xfs_end_io
> Call Trace:
> <TASK>
> __schedule+0x650/0xb10
> schedule+0x6d/0xf0
> schedule_preempt_disabled+0x15/0x30
> rwsem_down_write_slowpath+0x31a/0x5f0
> down_write+0x43/0x60
> xfs_ilock+0x1a8/0x210
> xfs_trans_alloc_inode+0x9c/0x240
> xfs_iomap_write_unwritten+0xe3/0x300
> xfs_end_ioend+0x90/0x130
> xfs_end_io+0xce/0x100
> process_scheduled_works+0x1d4/0x400
> worker_thread+0x234/0x2e0
> kthread+0x147/0x170
> ret_from_fork+0x3e/0x50
> ret_from_fork_asm+0x1a/0x30
> </TASK>
>
> and it's all down hill from there.
>
> Memory pressure is one way to trigger this, another is to run "echo
> 3 > /proc/sys/vm/drop_caches" randomly while tests are running.
>
> Regardless of how it is triggered, this effectively takes down the
> system once umount hangs because it's holding a sb->s_umount lock
> exclusive and now every sync(1) call gets stuck on it.
>
> Fix this by replacing the asserts with a corruption detection check
> and a shutdown.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/libxfs/xfs_alloc.c | 41 ++++++++++++++++++++++++++++++--------
> fs/xfs/libxfs/xfs_ialloc.c | 31 ++++++++++++++++++++++++----
> 2 files changed, 60 insertions(+), 12 deletions(-)
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 7839efe050bf..000cc7f4a3ce 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -3444,16 +3444,41 @@ xfs_alloc_read_agf(
>
> set_bit(XFS_AGSTATE_AGF_INIT, &pag->pag_opstate);
> }
> +
> #ifdef DEBUG
> - else if (!xfs_is_shutdown(mp)) {
> - ASSERT(pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks));
> - ASSERT(pag->pagf_btreeblks == be32_to_cpu(agf->agf_btreeblks));
> - ASSERT(pag->pagf_flcount == be32_to_cpu(agf->agf_flcount));
> - ASSERT(pag->pagf_longest == be32_to_cpu(agf->agf_longest));
> - ASSERT(pag->pagf_bno_level == be32_to_cpu(agf->agf_bno_level));
> - ASSERT(pag->pagf_cnt_level == be32_to_cpu(agf->agf_cnt_level));
> + /*
> + * It's possible for the AGF to be out of sync if the block device is
> + * silently dropping writes. This can happen in fstests with dmflakey
> + * enabled, which allows the buffer to be cleaned and reclaimed by
> + * memory pressure and then re-read from disk here. We will get a
> + * stale version of the AGF from disk, and nothing good can happen from
> + * here. Hence if we detect this situation, immediately shut down the
> + * filesystem.
> + *
> + * This can also happen if we are already in the middle of a forced
> + * shutdown, so don't bother checking if we are already shut down.
> + */
> + if (!xfs_is_shutdown(pag_mount(pag))) {
> + bool ok = true;
> +
> + ok &= pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks);
> + ok &= pag->pagf_freeblks == be32_to_cpu(agf->agf_freeblks);
> + ok &= pag->pagf_btreeblks == be32_to_cpu(agf->agf_btreeblks);
> + ok &= pag->pagf_flcount == be32_to_cpu(agf->agf_flcount);
> + ok &= pag->pagf_longest == be32_to_cpu(agf->agf_longest);
> + ok &= pag->pagf_bno_level == be32_to_cpu(agf->agf_bno_level);
> + ok &= pag->pagf_cnt_level == be32_to_cpu(agf->agf_cnt_level);
> +
> + if (XFS_IS_CORRUPT(pag_mount(pag), !ok)) {
> + xfs_ag_mark_sick(pag, XFS_SICK_AG_AGF);
> + xfs_trans_brelse(tp, agfbp);
> + xfs_force_shutdown(pag_mount(pag),
> + SHUTDOWN_CORRUPT_ONDISK);
> + return -EFSCORRUPTED;
> + }
> }
> -#endif
> +#endif /* DEBUG */
> +
> if (agfbpp)
> *agfbpp = agfbp;
> else
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 0c47b5c6ca7d..750111634d9f 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -2801,12 +2801,35 @@ xfs_ialloc_read_agi(
> set_bit(XFS_AGSTATE_AGI_INIT, &pag->pag_opstate);
> }
>
> +#ifdef DEBUG
> /*
> - * It's possible for these to be out of sync if
> - * we are in the middle of a forced shutdown.
> + * It's possible for the AGF to be out of sync if the block device is
> + * silently dropping writes. This can happen in fstests with dmflakey
> + * enabled, which allows the buffer to be cleaned and reclaimed by
> + * memory pressure and then re-read from disk here. We will get a
> + * stale version of the AGF from disk, and nothing good can happen from
> + * here. Hence if we detect this situation, immediately shut down the
> + * filesystem.
> + *
> + * This can also happen if we are already in the middle of a forced
> + * shutdown, so don't bother checking if we are already shut down.
> */
> - ASSERT(pag->pagi_freecount == be32_to_cpu(agi->agi_freecount) ||
> - xfs_is_shutdown(pag_mount(pag)));
> + if (!xfs_is_shutdown(pag_mount(pag))) {
> + bool ok = true;
> +
> + ok &= pag->pagi_freecount == be32_to_cpu(agi->agi_freecount);
> + ok &= pag->pagi_count == be32_to_cpu(agi->agi_count);
> +
> + if (XFS_IS_CORRUPT(pag_mount(pag), !ok)) {
> + xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI);
> + xfs_trans_brelse(tp, agibp);
> + xfs_force_shutdown(pag_mount(pag),
> + SHUTDOWN_CORRUPT_ONDISK);
> + return -EFSCORRUPTED;
> + }
> + }
> +#endif /* DEBUG */
> +
> if (agibpp)
> *agibpp = agibp;
> else
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Out of curiosity, the agf should be already initialized most of the time (same
for the pag's), wouldn't be interesting to have those checks in production
rather than only under DEBUG?
Carlos
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/7] xfs: avoid dquot buffer pin deadlock
2025-06-25 22:48 Dave Chinner
2025-06-25 22:48 ` [PATCH 1/7] xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock Dave Chinner
2025-06-25 22:48 ` [PATCH 2/7] xfs: catch stale AGF/AGF metadata Dave Chinner
@ 2025-06-25 22:48 ` Dave Chinner
2025-06-26 11:29 ` Carlos Maiolino
2025-06-25 22:48 ` [PATCH 4/7] xfs: add tracepoints for stale pinned inode state debug Dave Chinner
` (3 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2025-06-25 22:48 UTC (permalink / raw)
To: linux-xfs
From: Dave Chinner <dchinner@redhat.com>
On shutdown when quotas are enabled, the shutdown can deadlock
trying to unpin the dquot buffer buf_log_item like so:
[ 3319.483590] task:kworker/20:0H state:D stack:14360 pid:1962230 tgid:1962230 ppid:2 task_flags:0x4208060 flags:0x00004000
[ 3319.493966] Workqueue: xfs-log/dm-6 xlog_ioend_work
[ 3319.498458] Call Trace:
[ 3319.500800] <TASK>
[ 3319.502809] __schedule+0x699/0xb70
[ 3319.512672] schedule+0x64/0xd0
[ 3319.515573] schedule_timeout+0x30/0xf0
[ 3319.528125] __down_common+0xc3/0x200
[ 3319.531488] __down+0x1d/0x30
[ 3319.534186] down+0x48/0x50
[ 3319.540501] xfs_buf_lock+0x3d/0xe0
[ 3319.543609] xfs_buf_item_unpin+0x85/0x1b0
[ 3319.547248] xlog_cil_committed+0x289/0x570
[ 3319.571411] xlog_cil_process_committed+0x6d/0x90
[ 3319.575590] xlog_state_shutdown_callbacks+0x52/0x110
[ 3319.580017] xlog_force_shutdown+0x169/0x1a0
[ 3319.583780] xlog_ioend_work+0x7c/0xb0
[ 3319.587049] process_scheduled_works+0x1d6/0x400
[ 3319.591127] worker_thread+0x202/0x2e0
[ 3319.594452] kthread+0x20c/0x240
The CIL push has seen the deadlock, so it has aborted the push and
is running CIL checkpoint completion to abort all the items in the
checkpoint. This calls ->iop_unpin(remove = true) to clean up the
log items in the checkpoint.
When a buffer log item is unpined like this, it needs to lock the
buffer to run io completion to correctly fail the buffer and run all
the required completions to fail attached log items as well. In this
case, the attempt to lock the buffer on unpin is hanging because the
buffer is already locked.
I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE
handling changes I was testing, so I went looking for
pin events on HOLD buffers and unpin events on locked buffer. That
isolated this one buffer with these two events:
xfs_buf_item_pin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 2 pincount 0 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags HOLD|DIRTY|LOGGED liflags DIRTY
....
xfs_buf_item_unpin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 4 pincount 1 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags DIRTY liflags ABORTED
Firstly, bbcount = 0x2, which means it is not a single sector
structure. That rules out every xfs_trans_bhold() case except one:
dquot buffers.
Then hung task dumping gave this trace:
[ 3197.312078] task:fsync-tester state:D stack:12080 pid:2051125 tgid:2051125 ppid:1643233 task_flags:0x400000 flags:0x00004002
[ 3197.323007] Call Trace:
[ 3197.325581] <TASK>
[ 3197.327727] __schedule+0x699/0xb70
[ 3197.334582] schedule+0x64/0xd0
[ 3197.337672] schedule_timeout+0x30/0xf0
[ 3197.350139] wait_for_completion+0xbd/0x180
[ 3197.354235] __flush_workqueue+0xef/0x4e0
[ 3197.362229] xlog_cil_force_seq+0xa0/0x300
[ 3197.374447] xfs_log_force+0x77/0x230
[ 3197.378015] xfs_qm_dqunpin_wait+0x49/0xf0
[ 3197.382010] xfs_qm_dqflush+0x55/0x460
[ 3197.385663] xfs_qm_dquot_isolate+0x29e/0x4d0
[ 3197.389977] __list_lru_walk_one+0x141/0x220
[ 3197.398867] list_lru_walk_one+0x10/0x20
[ 3197.402713] xfs_qm_shrink_scan+0x6a/0x100
[ 3197.406699] do_shrink_slab+0x18a/0x350
[ 3197.410512] shrink_slab+0xf7/0x430
[ 3197.413967] drop_slab+0x97/0xf0
[ 3197.417121] drop_caches_sysctl_handler+0x59/0xc0
[ 3197.421654] proc_sys_call_handler+0x18b/0x280
[ 3197.426050] proc_sys_write+0x13/0x20
[ 3197.429750] vfs_write+0x2b8/0x3e0
[ 3197.438532] ksys_write+0x7e/0xf0
[ 3197.441742] __x64_sys_write+0x1b/0x30
[ 3197.445363] x64_sys_call+0x2c72/0x2f60
[ 3197.449044] do_syscall_64+0x6c/0x140
[ 3197.456341] entry_SYSCALL_64_after_hwframe+0x76/0x7e
Yup, another test run by check-parallel is running drop_caches
concurrently and the dquot shrinker for the hung filesystem is
running. That's trying to flush a dirty dquot from reclaim context,
and it waiting on a log force to complete. xfs_qm_dqflush is called
with the dquot buffer held locked, and so we've called
xfs_log_force() with that buffer locked.
Now the log force is waiting for a workqueue flush to complete, and
that workqueue flush is waiting of CIL checkpoint processing to
finish.
The CIL checkpoint processing is aborting all the log items it has,
and that requires locking aborted buffers to cancel them.
Now, normally this isn't a problem if we are issuing a log force
to unpin an object, because the ->iop_unpin() method wakes pin
waiters first. That results in the pin waiter finishing off whatever
it was doing, dropping the lock and then xfs_buf_item_unpin() can
lock the buffer and fail it.
However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not
the dquot buffer unpin event, and so it never gets woken and so does
not drop the buffer lock.
Inodes do not have this problem, as they can only be written from
one spot (->iop_push) whilst dquots can be written from multiple
places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck).
The reason that the dquot buffer has an attached buffer log item is
that it has been recently allocated. Initialisation of the dquot
buffer logs the buffer directly, thereby pinning it in memory. We
then modify the dquot in a separate operation, and have memory
reclaim racing with a shutdown and we trigger this deadlock.
check-parallel reproduces this reliably on 1kB FSB filesystems with
quota enabled because it does all of these things concurrently
without having to explicitly write tests to exercise these corner
case conditions.
xfs_qm_dquot_logitem_push() doesn't have this deadlock because it
checks if the dquot is pinned before locking the dquot buffer and
skipping it if it is pinned. This means the xfs_qm_dqunpin_wait()
log force in xfs_qm_dqflush() never triggers and we unlock the
buffer safely allowing a concurrent shutdown to fail the buffer
appropriately.
xfs_qm_dqpurge() could have this problem as it is called from
quotacheck and we might have allocated dquot buffers when recording
the quota updates. This can be fixed by calling
xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we
hold the dquot locked, nothing will be able to add to the pin count
between the unpin_wait and the dqflush callout, so this now makes
xfs_qm_dqpurge() safe against this race.
xfs_qm_dquot_isolate() can also be fixed this same way but, quite
frankly, we shouldn't be doing IO in memory reclaim context. If the
dquot is pinned or dirty, simply rotate it and let memory reclaim
come back to it later, same as we do for inodes.
This then gets rid of the nasty issue in xfs_qm_flush_one() where
quotacheck writeback races with memory reclaim flushing the dquots.
We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of
the "can't get the dqflush lock" buffer write to cycle the dqlfush
lock and enable it to be flushed again. checking if the dquot is
pinned and returning -EAGAIN so that the dquot walk will revisit the
dquot again later.
Finally, with xfs_qm_dqunpin_wait() lifted into all the callers,
we can remove it from the xfs_qm_dqflush() code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_buf.c | 38 --------------------
fs/xfs/xfs_buf.h | 1 -
fs/xfs/xfs_dquot.c | 4 +--
fs/xfs/xfs_qm.c | 86 ++++++++++------------------------------------
fs/xfs/xfs_trace.h | 1 -
5 files changed, 20 insertions(+), 110 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8af83bd161f9..ba5bd6031ece 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2082,44 +2082,6 @@ xfs_buf_delwri_submit(
return error;
}
-/*
- * Push a single buffer on a delwri queue.
- *
- * The purpose of this function is to submit a single buffer of a delwri queue
- * and return with the buffer still on the original queue.
- *
- * The buffer locking and queue management logic between _delwri_pushbuf() and
- * _delwri_queue() guarantee that the buffer cannot be queued to another list
- * before returning.
- */
-int
-xfs_buf_delwri_pushbuf(
- struct xfs_buf *bp,
- struct list_head *buffer_list)
-{
- int error;
-
- ASSERT(bp->b_flags & _XBF_DELWRI_Q);
-
- trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
-
- xfs_buf_lock(bp);
- bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC);
- bp->b_flags |= XBF_WRITE;
- xfs_buf_submit(bp);
-
- /*
- * The buffer is now locked, under I/O but still on the original delwri
- * queue. Wait for I/O completion, restore the DELWRI_Q flag and
- * return with the buffer unlocked and still on the original queue.
- */
- error = xfs_buf_iowait(bp);
- bp->b_flags |= _XBF_DELWRI_Q;
- xfs_buf_unlock(bp);
-
- return error;
-}
-
void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref)
{
/*
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 9d2ab567cf81..15fc56948346 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -326,7 +326,6 @@ extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl);
extern int xfs_buf_delwri_submit(struct list_head *);
extern int xfs_buf_delwri_submit_nowait(struct list_head *);
-extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
static inline xfs_daddr_t xfs_buf_daddr(struct xfs_buf *bp)
{
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index b4e32f0860b7..0bd8022e47b4 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1398,11 +1398,9 @@ xfs_qm_dqflush(
ASSERT(XFS_DQ_IS_LOCKED(dqp));
ASSERT(!completion_done(&dqp->q_flush));
+ ASSERT(atomic_read(&dqp->q_pincount) == 0);
trace_xfs_dqflush(dqp);
-
- xfs_qm_dqunpin_wait(dqp);
-
fa = xfs_qm_dqflush_check(dqp);
if (fa) {
xfs_alert(mp, "corrupt dquot ID 0x%x in memory at %pS",
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 417439b58785..fa135ac26471 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -134,6 +134,7 @@ xfs_qm_dqpurge(
dqp->q_flags |= XFS_DQFLAG_FREEING;
+ xfs_qm_dqunpin_wait(dqp);
xfs_dqflock(dqp);
/*
@@ -465,6 +466,7 @@ xfs_qm_dquot_isolate(
struct xfs_dquot *dqp = container_of(item,
struct xfs_dquot, q_lru);
struct xfs_qm_isolate *isol = arg;
+ enum lru_status ret = LRU_SKIP;
if (!xfs_dqlock_nowait(dqp))
goto out_miss_busy;
@@ -477,6 +479,16 @@ xfs_qm_dquot_isolate(
if (dqp->q_flags & XFS_DQFLAG_FREEING)
goto out_miss_unlock;
+ /*
+ * If the dquot is pinned or dirty, rotate it to the end of the LRU to
+ * give some time for it to be cleaned before we try to isolate it
+ * again.
+ */
+ ret = LRU_ROTATE;
+ if (XFS_DQ_IS_DIRTY(dqp) || atomic_read(&dqp->q_pincount) > 0) {
+ goto out_miss_unlock;
+ }
+
/*
* This dquot has acquired a reference in the meantime remove it from
* the freelist and try again.
@@ -492,41 +504,14 @@ xfs_qm_dquot_isolate(
}
/*
- * If the dquot is dirty, flush it. If it's already being flushed, just
- * skip it so there is time for the IO to complete before we try to
- * reclaim it again on the next LRU pass.
+ * The dquot may still be under IO, in which case the flush lock will be
+ * held. If we can't get the flush lock now, just skip over the dquot as
+ * if it was dirty.
*/
if (!xfs_dqflock_nowait(dqp))
goto out_miss_unlock;
- if (XFS_DQ_IS_DIRTY(dqp)) {
- struct xfs_buf *bp = NULL;
- int error;
-
- trace_xfs_dqreclaim_dirty(dqp);
-
- /* we have to drop the LRU lock to flush the dquot */
- spin_unlock(&lru->lock);
-
- error = xfs_dquot_use_attached_buf(dqp, &bp);
- if (!bp || error == -EAGAIN) {
- xfs_dqfunlock(dqp);
- goto out_unlock_dirty;
- }
-
- /*
- * dqflush completes dqflock on error, and the delwri ioend
- * does it on success.
- */
- error = xfs_qm_dqflush(dqp, bp);
- if (error)
- goto out_unlock_dirty;
-
- xfs_buf_delwri_queue(bp, &isol->buffers);
- xfs_buf_relse(bp);
- goto out_unlock_dirty;
- }
-
+ ASSERT(!XFS_DQ_IS_DIRTY(dqp));
xfs_dquot_detach_buf(dqp);
xfs_dqfunlock(dqp);
@@ -548,13 +533,7 @@ xfs_qm_dquot_isolate(
out_miss_busy:
trace_xfs_dqreclaim_busy(dqp);
XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
- return LRU_SKIP;
-
-out_unlock_dirty:
- trace_xfs_dqreclaim_busy(dqp);
- XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
- xfs_dqunlock(dqp);
- return LRU_RETRY;
+ return ret;
}
static unsigned long
@@ -1486,7 +1465,6 @@ xfs_qm_flush_one(
struct xfs_dquot *dqp,
void *data)
{
- struct xfs_mount *mp = dqp->q_mount;
struct list_head *buffer_list = data;
struct xfs_buf *bp = NULL;
int error = 0;
@@ -1497,34 +1475,8 @@ xfs_qm_flush_one(
if (!XFS_DQ_IS_DIRTY(dqp))
goto out_unlock;
- /*
- * The only way the dquot is already flush locked by the time quotacheck
- * gets here is if reclaim flushed it before the dqadjust walk dirtied
- * it for the final time. Quotacheck collects all dquot bufs in the
- * local delwri queue before dquots are dirtied, so reclaim can't have
- * possibly queued it for I/O. The only way out is to push the buffer to
- * cycle the flush lock.
- */
- if (!xfs_dqflock_nowait(dqp)) {
- /* buf is pinned in-core by delwri list */
- error = xfs_buf_incore(mp->m_ddev_targp, dqp->q_blkno,
- mp->m_quotainfo->qi_dqchunklen, 0, &bp);
- if (error)
- goto out_unlock;
-
- if (!(bp->b_flags & _XBF_DELWRI_Q)) {
- error = -EAGAIN;
- xfs_buf_relse(bp);
- goto out_unlock;
- }
- xfs_buf_unlock(bp);
-
- xfs_buf_delwri_pushbuf(bp, buffer_list);
- xfs_buf_rele(bp);
-
- error = -EAGAIN;
- goto out_unlock;
- }
+ xfs_qm_dqunpin_wait(dqp);
+ xfs_dqflock(dqp);
error = xfs_dquot_use_attached_buf(dqp, &bp);
if (error)
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 01d284a1c759..9f0d6bc966b7 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -778,7 +778,6 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
DEFINE_BUF_EVENT(xfs_buf_delwri_split);
-DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
DEFINE_BUF_EVENT(xfs_buf_get_uncached);
DEFINE_BUF_EVENT(xfs_buf_item_relse);
DEFINE_BUF_EVENT(xfs_buf_iodone_async);
--
2.45.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/7] xfs: avoid dquot buffer pin deadlock
2025-06-25 22:48 ` [PATCH 3/7] xfs: avoid dquot buffer pin deadlock Dave Chinner
@ 2025-06-26 11:29 ` Carlos Maiolino
0 siblings, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-06-26 11:29 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Jun 26, 2025 at 08:48:56AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> On shutdown when quotas are enabled, the shutdown can deadlock
> trying to unpin the dquot buffer buf_log_item like so:
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
>
> [ 3319.483590] task:kworker/20:0H state:D stack:14360 pid:1962230 tgid:1962230 ppid:2 task_flags:0x4208060 flags:0x00004000
> [ 3319.493966] Workqueue: xfs-log/dm-6 xlog_ioend_work
> [ 3319.498458] Call Trace:
> [ 3319.500800] <TASK>
> [ 3319.502809] __schedule+0x699/0xb70
> [ 3319.512672] schedule+0x64/0xd0
> [ 3319.515573] schedule_timeout+0x30/0xf0
> [ 3319.528125] __down_common+0xc3/0x200
> [ 3319.531488] __down+0x1d/0x30
> [ 3319.534186] down+0x48/0x50
> [ 3319.540501] xfs_buf_lock+0x3d/0xe0
> [ 3319.543609] xfs_buf_item_unpin+0x85/0x1b0
> [ 3319.547248] xlog_cil_committed+0x289/0x570
> [ 3319.571411] xlog_cil_process_committed+0x6d/0x90
> [ 3319.575590] xlog_state_shutdown_callbacks+0x52/0x110
> [ 3319.580017] xlog_force_shutdown+0x169/0x1a0
> [ 3319.583780] xlog_ioend_work+0x7c/0xb0
> [ 3319.587049] process_scheduled_works+0x1d6/0x400
> [ 3319.591127] worker_thread+0x202/0x2e0
> [ 3319.594452] kthread+0x20c/0x240
>
> The CIL push has seen the deadlock, so it has aborted the push and
> is running CIL checkpoint completion to abort all the items in the
> checkpoint. This calls ->iop_unpin(remove = true) to clean up the
> log items in the checkpoint.
>
> When a buffer log item is unpined like this, it needs to lock the
> buffer to run io completion to correctly fail the buffer and run all
> the required completions to fail attached log items as well. In this
> case, the attempt to lock the buffer on unpin is hanging because the
> buffer is already locked.
>
> I suspected a leaked XFS_BLI_HOLD state because of XFS_BLI_STALE
> handling changes I was testing, so I went looking for
> pin events on HOLD buffers and unpin events on locked buffer. That
> isolated this one buffer with these two events:
>
> xfs_buf_item_pin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 2 pincount 0 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags HOLD|DIRTY|LOGGED liflags DIRTY
> ....
> xfs_buf_item_unpin: dev 251:6 daddr 0xa910 bbcount 0x2 hold 4 pincount 1 lock 0 flags DONE|KMEM recur 0 refcount 1 bliflags DIRTY liflags ABORTED
>
> Firstly, bbcount = 0x2, which means it is not a single sector
> structure. That rules out every xfs_trans_bhold() case except one:
> dquot buffers.
>
> Then hung task dumping gave this trace:
>
> [ 3197.312078] task:fsync-tester state:D stack:12080 pid:2051125 tgid:2051125 ppid:1643233 task_flags:0x400000 flags:0x00004002
> [ 3197.323007] Call Trace:
> [ 3197.325581] <TASK>
> [ 3197.327727] __schedule+0x699/0xb70
> [ 3197.334582] schedule+0x64/0xd0
> [ 3197.337672] schedule_timeout+0x30/0xf0
> [ 3197.350139] wait_for_completion+0xbd/0x180
> [ 3197.354235] __flush_workqueue+0xef/0x4e0
> [ 3197.362229] xlog_cil_force_seq+0xa0/0x300
> [ 3197.374447] xfs_log_force+0x77/0x230
> [ 3197.378015] xfs_qm_dqunpin_wait+0x49/0xf0
> [ 3197.382010] xfs_qm_dqflush+0x55/0x460
> [ 3197.385663] xfs_qm_dquot_isolate+0x29e/0x4d0
> [ 3197.389977] __list_lru_walk_one+0x141/0x220
> [ 3197.398867] list_lru_walk_one+0x10/0x20
> [ 3197.402713] xfs_qm_shrink_scan+0x6a/0x100
> [ 3197.406699] do_shrink_slab+0x18a/0x350
> [ 3197.410512] shrink_slab+0xf7/0x430
> [ 3197.413967] drop_slab+0x97/0xf0
> [ 3197.417121] drop_caches_sysctl_handler+0x59/0xc0
> [ 3197.421654] proc_sys_call_handler+0x18b/0x280
> [ 3197.426050] proc_sys_write+0x13/0x20
> [ 3197.429750] vfs_write+0x2b8/0x3e0
> [ 3197.438532] ksys_write+0x7e/0xf0
> [ 3197.441742] __x64_sys_write+0x1b/0x30
> [ 3197.445363] x64_sys_call+0x2c72/0x2f60
> [ 3197.449044] do_syscall_64+0x6c/0x140
> [ 3197.456341] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Yup, another test run by check-parallel is running drop_caches
> concurrently and the dquot shrinker for the hung filesystem is
> running. That's trying to flush a dirty dquot from reclaim context,
> and it waiting on a log force to complete. xfs_qm_dqflush is called
> with the dquot buffer held locked, and so we've called
> xfs_log_force() with that buffer locked.
>
> Now the log force is waiting for a workqueue flush to complete, and
> that workqueue flush is waiting of CIL checkpoint processing to
> finish.
>
> The CIL checkpoint processing is aborting all the log items it has,
> and that requires locking aborted buffers to cancel them.
>
> Now, normally this isn't a problem if we are issuing a log force
> to unpin an object, because the ->iop_unpin() method wakes pin
> waiters first. That results in the pin waiter finishing off whatever
> it was doing, dropping the lock and then xfs_buf_item_unpin() can
> lock the buffer and fail it.
>
> However, xfs_qm_dqflush() is waiting on the -dquot- unpin event, not
> the dquot buffer unpin event, and so it never gets woken and so does
> not drop the buffer lock.
>
> Inodes do not have this problem, as they can only be written from
> one spot (->iop_push) whilst dquots can be written from multiple
> places (memory reclaim, ->iop_push, xfs_dq_dqpurge, and quotacheck).
>
> The reason that the dquot buffer has an attached buffer log item is
> that it has been recently allocated. Initialisation of the dquot
> buffer logs the buffer directly, thereby pinning it in memory. We
> then modify the dquot in a separate operation, and have memory
> reclaim racing with a shutdown and we trigger this deadlock.
>
> check-parallel reproduces this reliably on 1kB FSB filesystems with
> quota enabled because it does all of these things concurrently
> without having to explicitly write tests to exercise these corner
> case conditions.
>
> xfs_qm_dquot_logitem_push() doesn't have this deadlock because it
> checks if the dquot is pinned before locking the dquot buffer and
> skipping it if it is pinned. This means the xfs_qm_dqunpin_wait()
> log force in xfs_qm_dqflush() never triggers and we unlock the
> buffer safely allowing a concurrent shutdown to fail the buffer
> appropriately.
>
> xfs_qm_dqpurge() could have this problem as it is called from
> quotacheck and we might have allocated dquot buffers when recording
> the quota updates. This can be fixed by calling
> xfs_qm_dqunpin_wait() before we lock the dquot buffer. Because we
> hold the dquot locked, nothing will be able to add to the pin count
> between the unpin_wait and the dqflush callout, so this now makes
> xfs_qm_dqpurge() safe against this race.
>
> xfs_qm_dquot_isolate() can also be fixed this same way but, quite
> frankly, we shouldn't be doing IO in memory reclaim context. If the
> dquot is pinned or dirty, simply rotate it and let memory reclaim
> come back to it later, same as we do for inodes.
>
> This then gets rid of the nasty issue in xfs_qm_flush_one() where
> quotacheck writeback races with memory reclaim flushing the dquots.
> We can lift xfs_qm_dqunpin_wait() up into this code, then get rid of
> the "can't get the dqflush lock" buffer write to cycle the dqlfush
> lock and enable it to be flushed again. checking if the dquot is
> pinned and returning -EAGAIN so that the dquot walk will revisit the
> dquot again later.
>
> Finally, with xfs_qm_dqunpin_wait() lifted into all the callers,
> we can remove it from the xfs_qm_dqflush() code.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_buf.c | 38 --------------------
> fs/xfs/xfs_buf.h | 1 -
> fs/xfs/xfs_dquot.c | 4 +--
> fs/xfs/xfs_qm.c | 86 ++++++++++------------------------------------
> fs/xfs/xfs_trace.h | 1 -
> 5 files changed, 20 insertions(+), 110 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 8af83bd161f9..ba5bd6031ece 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -2082,44 +2082,6 @@ xfs_buf_delwri_submit(
> return error;
> }
>
> -/*
> - * Push a single buffer on a delwri queue.
> - *
> - * The purpose of this function is to submit a single buffer of a delwri queue
> - * and return with the buffer still on the original queue.
> - *
> - * The buffer locking and queue management logic between _delwri_pushbuf() and
> - * _delwri_queue() guarantee that the buffer cannot be queued to another list
> - * before returning.
> - */
> -int
> -xfs_buf_delwri_pushbuf(
> - struct xfs_buf *bp,
> - struct list_head *buffer_list)
> -{
> - int error;
> -
> - ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> -
> - trace_xfs_buf_delwri_pushbuf(bp, _RET_IP_);
> -
> - xfs_buf_lock(bp);
> - bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC);
> - bp->b_flags |= XBF_WRITE;
> - xfs_buf_submit(bp);
> -
> - /*
> - * The buffer is now locked, under I/O but still on the original delwri
> - * queue. Wait for I/O completion, restore the DELWRI_Q flag and
> - * return with the buffer unlocked and still on the original queue.
> - */
> - error = xfs_buf_iowait(bp);
> - bp->b_flags |= _XBF_DELWRI_Q;
> - xfs_buf_unlock(bp);
> -
> - return error;
> -}
> -
> void xfs_buf_set_ref(struct xfs_buf *bp, int lru_ref)
> {
> /*
> diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> index 9d2ab567cf81..15fc56948346 100644
> --- a/fs/xfs/xfs_buf.h
> +++ b/fs/xfs/xfs_buf.h
> @@ -326,7 +326,6 @@ extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
> void xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *bl);
> extern int xfs_buf_delwri_submit(struct list_head *);
> extern int xfs_buf_delwri_submit_nowait(struct list_head *);
> -extern int xfs_buf_delwri_pushbuf(struct xfs_buf *, struct list_head *);
>
> static inline xfs_daddr_t xfs_buf_daddr(struct xfs_buf *bp)
> {
> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index b4e32f0860b7..0bd8022e47b4 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1398,11 +1398,9 @@ xfs_qm_dqflush(
>
> ASSERT(XFS_DQ_IS_LOCKED(dqp));
> ASSERT(!completion_done(&dqp->q_flush));
> + ASSERT(atomic_read(&dqp->q_pincount) == 0);
>
> trace_xfs_dqflush(dqp);
> -
> - xfs_qm_dqunpin_wait(dqp);
> -
> fa = xfs_qm_dqflush_check(dqp);
> if (fa) {
> xfs_alert(mp, "corrupt dquot ID 0x%x in memory at %pS",
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 417439b58785..fa135ac26471 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -134,6 +134,7 @@ xfs_qm_dqpurge(
>
> dqp->q_flags |= XFS_DQFLAG_FREEING;
>
> + xfs_qm_dqunpin_wait(dqp);
> xfs_dqflock(dqp);
>
> /*
> @@ -465,6 +466,7 @@ xfs_qm_dquot_isolate(
> struct xfs_dquot *dqp = container_of(item,
> struct xfs_dquot, q_lru);
> struct xfs_qm_isolate *isol = arg;
> + enum lru_status ret = LRU_SKIP;
>
> if (!xfs_dqlock_nowait(dqp))
> goto out_miss_busy;
> @@ -477,6 +479,16 @@ xfs_qm_dquot_isolate(
> if (dqp->q_flags & XFS_DQFLAG_FREEING)
> goto out_miss_unlock;
>
> + /*
> + * If the dquot is pinned or dirty, rotate it to the end of the LRU to
> + * give some time for it to be cleaned before we try to isolate it
> + * again.
> + */
> + ret = LRU_ROTATE;
> + if (XFS_DQ_IS_DIRTY(dqp) || atomic_read(&dqp->q_pincount) > 0) {
> + goto out_miss_unlock;
> + }
> +
> /*
> * This dquot has acquired a reference in the meantime remove it from
> * the freelist and try again.
> @@ -492,41 +504,14 @@ xfs_qm_dquot_isolate(
> }
>
> /*
> - * If the dquot is dirty, flush it. If it's already being flushed, just
> - * skip it so there is time for the IO to complete before we try to
> - * reclaim it again on the next LRU pass.
> + * The dquot may still be under IO, in which case the flush lock will be
> + * held. If we can't get the flush lock now, just skip over the dquot as
> + * if it was dirty.
> */
> if (!xfs_dqflock_nowait(dqp))
> goto out_miss_unlock;
>
> - if (XFS_DQ_IS_DIRTY(dqp)) {
> - struct xfs_buf *bp = NULL;
> - int error;
> -
> - trace_xfs_dqreclaim_dirty(dqp);
> -
> - /* we have to drop the LRU lock to flush the dquot */
> - spin_unlock(&lru->lock);
> -
> - error = xfs_dquot_use_attached_buf(dqp, &bp);
> - if (!bp || error == -EAGAIN) {
> - xfs_dqfunlock(dqp);
> - goto out_unlock_dirty;
> - }
> -
> - /*
> - * dqflush completes dqflock on error, and the delwri ioend
> - * does it on success.
> - */
> - error = xfs_qm_dqflush(dqp, bp);
> - if (error)
> - goto out_unlock_dirty;
> -
> - xfs_buf_delwri_queue(bp, &isol->buffers);
> - xfs_buf_relse(bp);
> - goto out_unlock_dirty;
> - }
> -
> + ASSERT(!XFS_DQ_IS_DIRTY(dqp));
> xfs_dquot_detach_buf(dqp);
> xfs_dqfunlock(dqp);
>
> @@ -548,13 +533,7 @@ xfs_qm_dquot_isolate(
> out_miss_busy:
> trace_xfs_dqreclaim_busy(dqp);
> XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
> - return LRU_SKIP;
> -
> -out_unlock_dirty:
> - trace_xfs_dqreclaim_busy(dqp);
> - XFS_STATS_INC(dqp->q_mount, xs_qm_dqreclaim_misses);
> - xfs_dqunlock(dqp);
> - return LRU_RETRY;
> + return ret;
> }
>
> static unsigned long
> @@ -1486,7 +1465,6 @@ xfs_qm_flush_one(
> struct xfs_dquot *dqp,
> void *data)
> {
> - struct xfs_mount *mp = dqp->q_mount;
> struct list_head *buffer_list = data;
> struct xfs_buf *bp = NULL;
> int error = 0;
> @@ -1497,34 +1475,8 @@ xfs_qm_flush_one(
> if (!XFS_DQ_IS_DIRTY(dqp))
> goto out_unlock;
>
> - /*
> - * The only way the dquot is already flush locked by the time quotacheck
> - * gets here is if reclaim flushed it before the dqadjust walk dirtied
> - * it for the final time. Quotacheck collects all dquot bufs in the
> - * local delwri queue before dquots are dirtied, so reclaim can't have
> - * possibly queued it for I/O. The only way out is to push the buffer to
> - * cycle the flush lock.
> - */
> - if (!xfs_dqflock_nowait(dqp)) {
> - /* buf is pinned in-core by delwri list */
> - error = xfs_buf_incore(mp->m_ddev_targp, dqp->q_blkno,
> - mp->m_quotainfo->qi_dqchunklen, 0, &bp);
> - if (error)
> - goto out_unlock;
> -
> - if (!(bp->b_flags & _XBF_DELWRI_Q)) {
> - error = -EAGAIN;
> - xfs_buf_relse(bp);
> - goto out_unlock;
> - }
> - xfs_buf_unlock(bp);
> -
> - xfs_buf_delwri_pushbuf(bp, buffer_list);
> - xfs_buf_rele(bp);
> -
> - error = -EAGAIN;
> - goto out_unlock;
> - }
> + xfs_qm_dqunpin_wait(dqp);
> + xfs_dqflock(dqp);
>
> error = xfs_dquot_use_attached_buf(dqp, &bp);
> if (error)
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 01d284a1c759..9f0d6bc966b7 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -778,7 +778,6 @@ DEFINE_BUF_EVENT(xfs_buf_iowait_done);
> DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
> DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
> DEFINE_BUF_EVENT(xfs_buf_delwri_split);
> -DEFINE_BUF_EVENT(xfs_buf_delwri_pushbuf);
> DEFINE_BUF_EVENT(xfs_buf_get_uncached);
> DEFINE_BUF_EVENT(xfs_buf_item_relse);
> DEFINE_BUF_EVENT(xfs_buf_iodone_async);
> --
> 2.45.2
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 4/7] xfs: add tracepoints for stale pinned inode state debug
2025-06-25 22:48 Dave Chinner
` (2 preceding siblings ...)
2025-06-25 22:48 ` [PATCH 3/7] xfs: avoid dquot buffer pin deadlock Dave Chinner
@ 2025-06-25 22:48 ` Dave Chinner
2025-06-26 11:50 ` Carlos Maiolino
2025-06-25 22:48 ` [PATCH 5/7] xfs: rearrange code in xfs_buf_item.c Dave Chinner
` (2 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2025-06-25 22:48 UTC (permalink / raw)
To: linux-xfs
From: Dave Chinner <dchinner@redhat.com>
I needed more insight into how stale inodes were getting stuck on
the AIL after a forced shutdown when running fsstress. These are the
tracepoints I added for that purpose.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_inode_item.c | 5 ++++-
fs/xfs/xfs_log_cil.c | 4 +++-
fs/xfs/xfs_trace.h | 9 ++++++++-
fs/xfs/xfs_trans.c | 4 +++-
4 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index c6cb0b6b9e46..285e27ff89e2 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -758,11 +758,14 @@ xfs_inode_item_push(
* completed and items removed from the AIL before the next push
* attempt.
*/
+ trace_xfs_inode_push_stale(ip, _RET_IP_);
return XFS_ITEM_PINNED;
}
- if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp))
+ if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp)) {
+ trace_xfs_inode_push_pinned(ip, _RET_IP_);
return XFS_ITEM_PINNED;
+ }
if (xfs_iflags_test(ip, XFS_IFLUSHING))
return XFS_ITEM_FLUSHING;
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index f66d2d430e4f..a80cb6b9969a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -793,8 +793,10 @@ xlog_cil_ail_insert(
struct xfs_log_item *lip = lv->lv_item;
xfs_lsn_t item_lsn;
- if (aborted)
+ if (aborted) {
+ trace_xlog_ail_insert_abort(lip);
set_bit(XFS_LI_ABORTED, &lip->li_flags);
+ }
if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
lip->li_ops->iop_release(lip);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 9f0d6bc966b7..ba45d801df1c 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1146,6 +1146,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
__field(xfs_ino_t, ino)
__field(int, count)
__field(int, pincount)
+ __field(unsigned long, iflags)
__field(unsigned long, caller_ip)
),
TP_fast_assign(
@@ -1153,13 +1154,15 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
__entry->ino = ip->i_ino;
__entry->count = atomic_read(&VFS_I(ip)->i_count);
__entry->pincount = atomic_read(&ip->i_pincount);
+ __entry->iflags = ip->i_flags;
__entry->caller_ip = caller_ip;
),
- TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %pS",
+ TP_printk("dev %d:%d ino 0x%llx count %d pincount %d iflags 0x%lx caller %pS",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino,
__entry->count,
__entry->pincount,
+ __entry->iflags,
(char *)__entry->caller_ip)
)
@@ -1249,6 +1252,8 @@ DEFINE_IREF_EVENT(xfs_irele);
DEFINE_IREF_EVENT(xfs_inode_pin);
DEFINE_IREF_EVENT(xfs_inode_unpin);
DEFINE_IREF_EVENT(xfs_inode_unpin_nowait);
+DEFINE_IREF_EVENT(xfs_inode_push_pinned);
+DEFINE_IREF_EVENT(xfs_inode_push_stale);
DECLARE_EVENT_CLASS(xfs_namespace_class,
TP_PROTO(struct xfs_inode *dp, const struct xfs_name *name),
@@ -1653,6 +1658,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
+DEFINE_LOG_ITEM_EVENT(xlog_ail_insert_abort);
+DEFINE_LOG_ITEM_EVENT(xfs_trans_free_abort);
DECLARE_EVENT_CLASS(xfs_ail_class,
TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index c6657072361a..b4a07af513ba 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -742,8 +742,10 @@ xfs_trans_free_items(
list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
xfs_trans_del_item(lip);
- if (abort)
+ if (abort) {
+ trace_xfs_trans_free_abort(lip);
set_bit(XFS_LI_ABORTED, &lip->li_flags);
+ }
if (lip->li_ops->iop_release)
lip->li_ops->iop_release(lip);
}
--
2.45.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 4/7] xfs: add tracepoints for stale pinned inode state debug
2025-06-25 22:48 ` [PATCH 4/7] xfs: add tracepoints for stale pinned inode state debug Dave Chinner
@ 2025-06-26 11:50 ` Carlos Maiolino
0 siblings, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-06-26 11:50 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Jun 26, 2025 at 08:48:57AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> I needed more insight into how stale inodes were getting stuck on
> the AIL after a forced shutdown when running fsstress. These are the
> tracepoints I added for that purpose.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
I actually used something similar recently, (although temporarily with
trace_printk), I can see how useful those can be.
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> fs/xfs/xfs_inode_item.c | 5 ++++-
> fs/xfs/xfs_log_cil.c | 4 +++-
> fs/xfs/xfs_trace.h | 9 ++++++++-
> fs/xfs/xfs_trans.c | 4 +++-
> 4 files changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index c6cb0b6b9e46..285e27ff89e2 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -758,11 +758,14 @@ xfs_inode_item_push(
> * completed and items removed from the AIL before the next push
> * attempt.
> */
> + trace_xfs_inode_push_stale(ip, _RET_IP_);
> return XFS_ITEM_PINNED;
> }
>
> - if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp))
> + if (xfs_ipincount(ip) > 0 || xfs_buf_ispinned(bp)) {
> + trace_xfs_inode_push_pinned(ip, _RET_IP_);
> return XFS_ITEM_PINNED;
> + }
>
> if (xfs_iflags_test(ip, XFS_IFLUSHING))
> return XFS_ITEM_FLUSHING;
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index f66d2d430e4f..a80cb6b9969a 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -793,8 +793,10 @@ xlog_cil_ail_insert(
> struct xfs_log_item *lip = lv->lv_item;
> xfs_lsn_t item_lsn;
>
> - if (aborted)
> + if (aborted) {
> + trace_xlog_ail_insert_abort(lip);
> set_bit(XFS_LI_ABORTED, &lip->li_flags);
> + }
>
> if (lip->li_ops->flags & XFS_ITEM_RELEASE_WHEN_COMMITTED) {
> lip->li_ops->iop_release(lip);
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 9f0d6bc966b7..ba45d801df1c 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1146,6 +1146,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
> __field(xfs_ino_t, ino)
> __field(int, count)
> __field(int, pincount)
> + __field(unsigned long, iflags)
> __field(unsigned long, caller_ip)
> ),
> TP_fast_assign(
> @@ -1153,13 +1154,15 @@ DECLARE_EVENT_CLASS(xfs_iref_class,
> __entry->ino = ip->i_ino;
> __entry->count = atomic_read(&VFS_I(ip)->i_count);
> __entry->pincount = atomic_read(&ip->i_pincount);
> + __entry->iflags = ip->i_flags;
> __entry->caller_ip = caller_ip;
> ),
> - TP_printk("dev %d:%d ino 0x%llx count %d pincount %d caller %pS",
> + TP_printk("dev %d:%d ino 0x%llx count %d pincount %d iflags 0x%lx caller %pS",
> MAJOR(__entry->dev), MINOR(__entry->dev),
> __entry->ino,
> __entry->count,
> __entry->pincount,
> + __entry->iflags,
> (char *)__entry->caller_ip)
> )
>
> @@ -1249,6 +1252,8 @@ DEFINE_IREF_EVENT(xfs_irele);
> DEFINE_IREF_EVENT(xfs_inode_pin);
> DEFINE_IREF_EVENT(xfs_inode_unpin);
> DEFINE_IREF_EVENT(xfs_inode_unpin_nowait);
> +DEFINE_IREF_EVENT(xfs_inode_push_pinned);
> +DEFINE_IREF_EVENT(xfs_inode_push_stale);
>
> DECLARE_EVENT_CLASS(xfs_namespace_class,
> TP_PROTO(struct xfs_inode *dp, const struct xfs_name *name),
> @@ -1653,6 +1658,8 @@ DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
> DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_mark);
> DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_skip);
> DEFINE_LOG_ITEM_EVENT(xfs_cil_whiteout_unpin);
> +DEFINE_LOG_ITEM_EVENT(xlog_ail_insert_abort);
> +DEFINE_LOG_ITEM_EVENT(xfs_trans_free_abort);
>
> DECLARE_EVENT_CLASS(xfs_ail_class,
> TP_PROTO(struct xfs_log_item *lip, xfs_lsn_t old_lsn, xfs_lsn_t new_lsn),
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index c6657072361a..b4a07af513ba 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -742,8 +742,10 @@ xfs_trans_free_items(
>
> list_for_each_entry_safe(lip, next, &tp->t_items, li_trans) {
> xfs_trans_del_item(lip);
> - if (abort)
> + if (abort) {
> + trace_xfs_trans_free_abort(lip);
> set_bit(XFS_LI_ABORTED, &lip->li_flags);
> + }
> if (lip->li_ops->iop_release)
> lip->li_ops->iop_release(lip);
> }
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 5/7] xfs: rearrange code in xfs_buf_item.c
2025-06-25 22:48 Dave Chinner
` (3 preceding siblings ...)
2025-06-25 22:48 ` [PATCH 4/7] xfs: add tracepoints for stale pinned inode state debug Dave Chinner
@ 2025-06-25 22:48 ` Dave Chinner
2025-06-26 11:57 ` Carlos Maiolino
2025-06-25 22:48 ` [PATCH 6/7] xfs: factor out stale buffer item completion Dave Chinner
2025-06-25 22:49 ` [PATCH 7/7] xfs: fix unmount hang with unflushable inodes stuck in the AIL Dave Chinner
6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2025-06-25 22:48 UTC (permalink / raw)
To: linux-xfs
From: Dave Chinner <dchinner@redhat.com>
The code to initialise, release and free items is all the way down
the bottom of the file. Upcoming fixes need to these functions
earlier in the file, so move them to the top.
There is one code change in this move - the parameter to
xfs_buf_item_relse() is changed from the xfs_buf to the
xfs_buf_log_item - the thing that the function is releasing.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_buf_item.c | 116 +++++++++++++++++++++---------------------
fs/xfs/xfs_buf_item.h | 1 -
2 files changed, 58 insertions(+), 59 deletions(-)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 90139e0f3271..3e3c0f65a25c 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -32,6 +32,61 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
return container_of(lip, struct xfs_buf_log_item, bli_item);
}
+static void
+xfs_buf_item_get_format(
+ struct xfs_buf_log_item *bip,
+ int count)
+{
+ ASSERT(bip->bli_formats == NULL);
+ bip->bli_format_count = count;
+
+ if (count == 1) {
+ bip->bli_formats = &bip->__bli_format;
+ return;
+ }
+
+ bip->bli_formats = kzalloc(count * sizeof(struct xfs_buf_log_format),
+ GFP_KERNEL | __GFP_NOFAIL);
+}
+
+static void
+xfs_buf_item_free_format(
+ struct xfs_buf_log_item *bip)
+{
+ if (bip->bli_formats != &bip->__bli_format) {
+ kfree(bip->bli_formats);
+ bip->bli_formats = NULL;
+ }
+}
+
+static void
+xfs_buf_item_free(
+ struct xfs_buf_log_item *bip)
+{
+ xfs_buf_item_free_format(bip);
+ kvfree(bip->bli_item.li_lv_shadow);
+ kmem_cache_free(xfs_buf_item_cache, bip);
+}
+
+/*
+ * xfs_buf_item_relse() is called when the buf log item is no longer needed.
+ */
+static void
+xfs_buf_item_relse(
+ struct xfs_buf_log_item *bip)
+{
+ struct xfs_buf *bp = bip->bli_buf;
+
+ trace_xfs_buf_item_relse(bp, _RET_IP_);
+
+ ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
+ ASSERT(atomic_read(&bip->bli_refcount) == 0);
+
+ bp->b_log_item = NULL;
+ xfs_buf_rele(bp);
+ xfs_buf_item_free(bip);
+}
+
/* Is this log iovec plausibly large enough to contain the buffer log format? */
bool
xfs_buf_log_check_iovec(
@@ -468,7 +523,7 @@ xfs_buf_item_unpin(
ASSERT(list_empty(&bp->b_li_list));
} else {
xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
- xfs_buf_item_relse(bp);
+ xfs_buf_item_relse(bip);
ASSERT(bp->b_log_item == NULL);
}
xfs_buf_relse(bp);
@@ -578,7 +633,7 @@ xfs_buf_item_put(
*/
if (aborted)
xfs_trans_ail_delete(lip, 0);
- xfs_buf_item_relse(bip->bli_buf);
+ xfs_buf_item_relse(bip);
return true;
}
@@ -729,33 +784,6 @@ static const struct xfs_item_ops xfs_buf_item_ops = {
.iop_push = xfs_buf_item_push,
};
-STATIC void
-xfs_buf_item_get_format(
- struct xfs_buf_log_item *bip,
- int count)
-{
- ASSERT(bip->bli_formats == NULL);
- bip->bli_format_count = count;
-
- if (count == 1) {
- bip->bli_formats = &bip->__bli_format;
- return;
- }
-
- bip->bli_formats = kzalloc(count * sizeof(struct xfs_buf_log_format),
- GFP_KERNEL | __GFP_NOFAIL);
-}
-
-STATIC void
-xfs_buf_item_free_format(
- struct xfs_buf_log_item *bip)
-{
- if (bip->bli_formats != &bip->__bli_format) {
- kfree(bip->bli_formats);
- bip->bli_formats = NULL;
- }
-}
-
/*
* Allocate a new buf log item to go with the given buffer.
* Set the buffer's b_log_item field to point to the new
@@ -976,34 +1004,6 @@ xfs_buf_item_dirty_format(
return false;
}
-STATIC void
-xfs_buf_item_free(
- struct xfs_buf_log_item *bip)
-{
- xfs_buf_item_free_format(bip);
- kvfree(bip->bli_item.li_lv_shadow);
- kmem_cache_free(xfs_buf_item_cache, bip);
-}
-
-/*
- * xfs_buf_item_relse() is called when the buf log item is no longer needed.
- */
-void
-xfs_buf_item_relse(
- struct xfs_buf *bp)
-{
- struct xfs_buf_log_item *bip = bp->b_log_item;
-
- trace_xfs_buf_item_relse(bp, _RET_IP_);
- ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
-
- if (atomic_read(&bip->bli_refcount))
- return;
- bp->b_log_item = NULL;
- xfs_buf_rele(bp);
- xfs_buf_item_free(bip);
-}
-
void
xfs_buf_item_done(
struct xfs_buf *bp)
@@ -1023,5 +1023,5 @@ xfs_buf_item_done(
xfs_trans_ail_delete(&bp->b_log_item->bli_item,
(bp->b_flags & _XBF_LOGRECOVERY) ? 0 :
SHUTDOWN_CORRUPT_INCORE);
- xfs_buf_item_relse(bp);
+ xfs_buf_item_relse(bp->b_log_item);
}
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index e10e324cd245..50dd79b59cf5 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -49,7 +49,6 @@ struct xfs_buf_log_item {
int xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *);
void xfs_buf_item_done(struct xfs_buf *bp);
-void xfs_buf_item_relse(struct xfs_buf *);
bool xfs_buf_item_put(struct xfs_buf_log_item *);
void xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
--
2.45.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 5/7] xfs: rearrange code in xfs_buf_item.c
2025-06-25 22:48 ` [PATCH 5/7] xfs: rearrange code in xfs_buf_item.c Dave Chinner
@ 2025-06-26 11:57 ` Carlos Maiolino
0 siblings, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-06-26 11:57 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Jun 26, 2025 at 08:48:58AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The code to initialise, release and free items is all the way down
> the bottom of the file. Upcoming fixes need to these functions
> earlier in the file, so move them to the top.
>
> There is one code change in this move - the parameter to
> xfs_buf_item_relse() is changed from the xfs_buf to the
> xfs_buf_log_item - the thing that the function is releasing.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> fs/xfs/xfs_buf_item.c | 116 +++++++++++++++++++++---------------------
> fs/xfs/xfs_buf_item.h | 1 -
> 2 files changed, 58 insertions(+), 59 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 90139e0f3271..3e3c0f65a25c 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -32,6 +32,61 @@ static inline struct xfs_buf_log_item *BUF_ITEM(struct xfs_log_item *lip)
> return container_of(lip, struct xfs_buf_log_item, bli_item);
> }
>
> +static void
> +xfs_buf_item_get_format(
> + struct xfs_buf_log_item *bip,
> + int count)
> +{
> + ASSERT(bip->bli_formats == NULL);
> + bip->bli_format_count = count;
> +
> + if (count == 1) {
> + bip->bli_formats = &bip->__bli_format;
> + return;
> + }
> +
> + bip->bli_formats = kzalloc(count * sizeof(struct xfs_buf_log_format),
> + GFP_KERNEL | __GFP_NOFAIL);
> +}
> +
> +static void
> +xfs_buf_item_free_format(
> + struct xfs_buf_log_item *bip)
> +{
> + if (bip->bli_formats != &bip->__bli_format) {
> + kfree(bip->bli_formats);
> + bip->bli_formats = NULL;
> + }
> +}
> +
> +static void
> +xfs_buf_item_free(
> + struct xfs_buf_log_item *bip)
> +{
> + xfs_buf_item_free_format(bip);
> + kvfree(bip->bli_item.li_lv_shadow);
> + kmem_cache_free(xfs_buf_item_cache, bip);
> +}
> +
> +/*
> + * xfs_buf_item_relse() is called when the buf log item is no longer needed.
> + */
> +static void
> +xfs_buf_item_relse(
> + struct xfs_buf_log_item *bip)
> +{
> + struct xfs_buf *bp = bip->bli_buf;
> +
> + trace_xfs_buf_item_relse(bp, _RET_IP_);
> +
> + ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
> + ASSERT(atomic_read(&bip->bli_refcount) == 0);
> +
> + bp->b_log_item = NULL;
> + xfs_buf_rele(bp);
> + xfs_buf_item_free(bip);
> +}
> +
> /* Is this log iovec plausibly large enough to contain the buffer log format? */
> bool
> xfs_buf_log_check_iovec(
> @@ -468,7 +523,7 @@ xfs_buf_item_unpin(
> ASSERT(list_empty(&bp->b_li_list));
> } else {
> xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> - xfs_buf_item_relse(bp);
> + xfs_buf_item_relse(bip);
> ASSERT(bp->b_log_item == NULL);
> }
> xfs_buf_relse(bp);
> @@ -578,7 +633,7 @@ xfs_buf_item_put(
> */
> if (aborted)
> xfs_trans_ail_delete(lip, 0);
> - xfs_buf_item_relse(bip->bli_buf);
> + xfs_buf_item_relse(bip);
> return true;
> }
>
> @@ -729,33 +784,6 @@ static const struct xfs_item_ops xfs_buf_item_ops = {
> .iop_push = xfs_buf_item_push,
> };
>
> -STATIC void
> -xfs_buf_item_get_format(
> - struct xfs_buf_log_item *bip,
> - int count)
> -{
> - ASSERT(bip->bli_formats == NULL);
> - bip->bli_format_count = count;
> -
> - if (count == 1) {
> - bip->bli_formats = &bip->__bli_format;
> - return;
> - }
> -
> - bip->bli_formats = kzalloc(count * sizeof(struct xfs_buf_log_format),
> - GFP_KERNEL | __GFP_NOFAIL);
> -}
> -
> -STATIC void
> -xfs_buf_item_free_format(
> - struct xfs_buf_log_item *bip)
> -{
> - if (bip->bli_formats != &bip->__bli_format) {
> - kfree(bip->bli_formats);
> - bip->bli_formats = NULL;
> - }
> -}
> -
> /*
> * Allocate a new buf log item to go with the given buffer.
> * Set the buffer's b_log_item field to point to the new
> @@ -976,34 +1004,6 @@ xfs_buf_item_dirty_format(
> return false;
> }
>
> -STATIC void
> -xfs_buf_item_free(
> - struct xfs_buf_log_item *bip)
> -{
> - xfs_buf_item_free_format(bip);
> - kvfree(bip->bli_item.li_lv_shadow);
> - kmem_cache_free(xfs_buf_item_cache, bip);
> -}
> -
> -/*
> - * xfs_buf_item_relse() is called when the buf log item is no longer needed.
> - */
> -void
> -xfs_buf_item_relse(
> - struct xfs_buf *bp)
> -{
> - struct xfs_buf_log_item *bip = bp->b_log_item;
> -
> - trace_xfs_buf_item_relse(bp, _RET_IP_);
> - ASSERT(!test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags));
> -
> - if (atomic_read(&bip->bli_refcount))
> - return;
> - bp->b_log_item = NULL;
> - xfs_buf_rele(bp);
> - xfs_buf_item_free(bip);
> -}
> -
> void
> xfs_buf_item_done(
> struct xfs_buf *bp)
> @@ -1023,5 +1023,5 @@ xfs_buf_item_done(
> xfs_trans_ail_delete(&bp->b_log_item->bli_item,
> (bp->b_flags & _XBF_LOGRECOVERY) ? 0 :
> SHUTDOWN_CORRUPT_INCORE);
> - xfs_buf_item_relse(bp);
> + xfs_buf_item_relse(bp->b_log_item);
> }
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index e10e324cd245..50dd79b59cf5 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -49,7 +49,6 @@ struct xfs_buf_log_item {
>
> int xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *);
> void xfs_buf_item_done(struct xfs_buf *bp);
> -void xfs_buf_item_relse(struct xfs_buf *);
> bool xfs_buf_item_put(struct xfs_buf_log_item *);
> void xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
> bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 6/7] xfs: factor out stale buffer item completion
2025-06-25 22:48 Dave Chinner
` (4 preceding siblings ...)
2025-06-25 22:48 ` [PATCH 5/7] xfs: rearrange code in xfs_buf_item.c Dave Chinner
@ 2025-06-25 22:48 ` Dave Chinner
2025-06-26 11:59 ` Carlos Maiolino
2025-06-25 22:49 ` [PATCH 7/7] xfs: fix unmount hang with unflushable inodes stuck in the AIL Dave Chinner
6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2025-06-25 22:48 UTC (permalink / raw)
To: linux-xfs
From: Dave Chinner <dchinner@redhat.com>
The stale buffer item completion handling is currently only done
from BLI unpinning. We need to perform this function from where-ever
the last reference to the BLI is dropped, so first we need to
factor this code out into a helper.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_buf_item.c | 60 ++++++++++++++++++++++++++-----------------
1 file changed, 37 insertions(+), 23 deletions(-)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 3e3c0f65a25c..c95826863c82 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -444,6 +444,42 @@ xfs_buf_item_pin(
atomic_inc(&bip->bli_buf->b_pin_count);
}
+/*
+ * For a stale BLI, process all the necessary completions that must be
+ * performed when the final BLI reference goes away. The buffer will be
+ * referenced and locked here - we return to the caller with the buffer still
+ * referenced and locked for them to finalise processing of the buffer.
+ */
+static void
+xfs_buf_item_finish_stale(
+ struct xfs_buf_log_item *bip)
+{
+ struct xfs_buf *bp = bip->bli_buf;
+ struct xfs_log_item *lip = &bip->bli_item;
+
+ ASSERT(bip->bli_flags & XFS_BLI_STALE);
+ ASSERT(xfs_buf_islocked(bp));
+ ASSERT(bp->b_flags & XBF_STALE);
+ ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
+ ASSERT(list_empty(&lip->li_trans));
+ ASSERT(!bp->b_transp);
+
+ if (bip->bli_flags & XFS_BLI_STALE_INODE) {
+ xfs_buf_item_done(bp);
+ xfs_buf_inode_iodone(bp);
+ ASSERT(list_empty(&bp->b_li_list));
+ return;
+ }
+
+ /*
+ * We may or may not be on the AIL here, xfs_trans_ail_delete() will do
+ * the right thing regardless of the situation in which we are called.
+ */
+ xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
+ xfs_buf_item_relse(bip);
+ ASSERT(bp->b_log_item == NULL);
+}
+
/*
* This is called to unpin the buffer associated with the buf log item which was
* previously pinned with a call to xfs_buf_item_pin(). We enter this function
@@ -493,13 +529,6 @@ xfs_buf_item_unpin(
}
if (stale) {
- ASSERT(bip->bli_flags & XFS_BLI_STALE);
- ASSERT(xfs_buf_islocked(bp));
- ASSERT(bp->b_flags & XBF_STALE);
- ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
- ASSERT(list_empty(&lip->li_trans));
- ASSERT(!bp->b_transp);
-
trace_xfs_buf_item_unpin_stale(bip);
/*
@@ -510,22 +539,7 @@ xfs_buf_item_unpin(
* processing is complete.
*/
xfs_buf_rele(bp);
-
- /*
- * If we get called here because of an IO error, we may or may
- * not have the item on the AIL. xfs_trans_ail_delete() will
- * take care of that situation. xfs_trans_ail_delete() drops
- * the AIL lock.
- */
- if (bip->bli_flags & XFS_BLI_STALE_INODE) {
- xfs_buf_item_done(bp);
- xfs_buf_inode_iodone(bp);
- ASSERT(list_empty(&bp->b_li_list));
- } else {
- xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
- xfs_buf_item_relse(bip);
- ASSERT(bp->b_log_item == NULL);
- }
+ xfs_buf_item_finish_stale(bip);
xfs_buf_relse(bp);
return;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 6/7] xfs: factor out stale buffer item completion
2025-06-25 22:48 ` [PATCH 6/7] xfs: factor out stale buffer item completion Dave Chinner
@ 2025-06-26 11:59 ` Carlos Maiolino
0 siblings, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-06-26 11:59 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Jun 26, 2025 at 08:48:59AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The stale buffer item completion handling is currently only done
> from BLI unpinning. We need to perform this function from where-ever
> the last reference to the BLI is dropped, so first we need to
> factor this code out into a helper.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
> ---
> fs/xfs/xfs_buf_item.c | 60 ++++++++++++++++++++++++++-----------------
> 1 file changed, 37 insertions(+), 23 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 3e3c0f65a25c..c95826863c82 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -444,6 +444,42 @@ xfs_buf_item_pin(
> atomic_inc(&bip->bli_buf->b_pin_count);
> }
>
> +/*
> + * For a stale BLI, process all the necessary completions that must be
> + * performed when the final BLI reference goes away. The buffer will be
> + * referenced and locked here - we return to the caller with the buffer still
> + * referenced and locked for them to finalise processing of the buffer.
> + */
> +static void
> +xfs_buf_item_finish_stale(
> + struct xfs_buf_log_item *bip)
> +{
> + struct xfs_buf *bp = bip->bli_buf;
> + struct xfs_log_item *lip = &bip->bli_item;
> +
> + ASSERT(bip->bli_flags & XFS_BLI_STALE);
> + ASSERT(xfs_buf_islocked(bp));
> + ASSERT(bp->b_flags & XBF_STALE);
> + ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> + ASSERT(list_empty(&lip->li_trans));
> + ASSERT(!bp->b_transp);
> +
> + if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> + xfs_buf_item_done(bp);
> + xfs_buf_inode_iodone(bp);
> + ASSERT(list_empty(&bp->b_li_list));
> + return;
> + }
> +
> + /*
> + * We may or may not be on the AIL here, xfs_trans_ail_delete() will do
> + * the right thing regardless of the situation in which we are called.
> + */
> + xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> + xfs_buf_item_relse(bip);
> + ASSERT(bp->b_log_item == NULL);
> +}
> +
> /*
> * This is called to unpin the buffer associated with the buf log item which was
> * previously pinned with a call to xfs_buf_item_pin(). We enter this function
> @@ -493,13 +529,6 @@ xfs_buf_item_unpin(
> }
>
> if (stale) {
> - ASSERT(bip->bli_flags & XFS_BLI_STALE);
> - ASSERT(xfs_buf_islocked(bp));
> - ASSERT(bp->b_flags & XBF_STALE);
> - ASSERT(bip->__bli_format.blf_flags & XFS_BLF_CANCEL);
> - ASSERT(list_empty(&lip->li_trans));
> - ASSERT(!bp->b_transp);
> -
> trace_xfs_buf_item_unpin_stale(bip);
>
> /*
> @@ -510,22 +539,7 @@ xfs_buf_item_unpin(
> * processing is complete.
> */
> xfs_buf_rele(bp);
> -
> - /*
> - * If we get called here because of an IO error, we may or may
> - * not have the item on the AIL. xfs_trans_ail_delete() will
> - * take care of that situation. xfs_trans_ail_delete() drops
> - * the AIL lock.
> - */
> - if (bip->bli_flags & XFS_BLI_STALE_INODE) {
> - xfs_buf_item_done(bp);
> - xfs_buf_inode_iodone(bp);
> - ASSERT(list_empty(&bp->b_li_list));
> - } else {
> - xfs_trans_ail_delete(lip, SHUTDOWN_LOG_IO_ERROR);
> - xfs_buf_item_relse(bip);
> - ASSERT(bp->b_log_item == NULL);
> - }
> + xfs_buf_item_finish_stale(bip);
> xfs_buf_relse(bp);
> return;
> }
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 7/7] xfs: fix unmount hang with unflushable inodes stuck in the AIL
2025-06-25 22:48 Dave Chinner
` (5 preceding siblings ...)
2025-06-25 22:48 ` [PATCH 6/7] xfs: factor out stale buffer item completion Dave Chinner
@ 2025-06-25 22:49 ` Dave Chinner
2025-06-27 10:53 ` Carlos Maiolino
6 siblings, 1 reply; 16+ messages in thread
From: Dave Chinner @ 2025-06-25 22:49 UTC (permalink / raw)
To: linux-xfs
From: Dave Chinner <dchinner@redhat.com>
Unmount of a shutdown filesystem can hang with stale inode cluster
buffers in the AIL like so:
[95964.140623] Call Trace:
[95964.144641] __schedule+0x699/0xb70
[95964.154003] schedule+0x64/0xd0
[95964.156851] xfs_ail_push_all_sync+0x9b/0xf0
[95964.164816] xfs_unmount_flush_inodes+0x41/0x70
[95964.168698] xfs_unmountfs+0x7f/0x170
[95964.171846] xfs_fs_put_super+0x3b/0x90
[95964.175216] generic_shutdown_super+0x77/0x160
[95964.178060] kill_block_super+0x1b/0x40
[95964.180553] xfs_kill_sb+0x12/0x30
[95964.182796] deactivate_locked_super+0x38/0x100
[95964.185735] deactivate_super+0x41/0x50
[95964.188245] cleanup_mnt+0x9f/0x160
[95964.190519] __cleanup_mnt+0x12/0x20
[95964.192899] task_work_run+0x89/0xb0
[95964.195221] resume_user_mode_work+0x4f/0x60
[95964.197931] syscall_exit_to_user_mode+0x76/0xb0
[95964.201003] do_syscall_64+0x74/0x130
$ pstree -N mnt |grep umount
|-check-parallel---nsexec---run_test.sh---753---umount
It always seems to be generic/753 that triggers this, and repeating
a quick group test run triggers it every 10-15 iterations. Hence it
generally triggers once up every 30-40 minutes of test time. just
running generic/753 by itself or concurrently with a limited group
of tests doesn't reproduce this issue at all.
Tracing on a hung system shows the AIL repeating every 50ms a log
force followed by an attempt to push pinned, aborted inodes from the
AIL (trimmed for brevity):
xfs_log_force: lsn 0x1c caller xfsaild+0x18e
xfs_log_force: lsn 0x0 caller xlog_cil_flush+0xbd
xfs_log_force: lsn 0x1c caller xfs_log_force+0x77
xfs_ail_pinned: lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
xfs_ail_pinned: lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
....
The inode log items are marked as aborted, which means that either:
a) a transaction commit has occurred, seen an error or shutdown, and
called xfs_trans_free_items() to abort the items. This should happen
before any pinning of log items occurs.
or
b) a dirty transaction has been cancelled. This should also happen
before any pinning of log items occurs.
or
c) AIL insertion at journal IO completion is marked as aborted. In
this case, the log item is pinned by the CIL until journal IO
completes and hence needs to be unpinned. This is then done after
the ->iop_committed() callback is run, so the pin count should be
balanced correctly.
Yet none of these seemed to be occurring. Further tracing indicated
this:
d) Shutdown during CIL pushing resulting in log item completion
being called from checkpoint abort processing. Items are unpinned
and released without serialisation against each other, journal IO
completion or transaction commit completion.
In this case, we may still have a transaction commit in flight that
holds a reference to a xfs_buf_log_item (BLI) after CIL insertion.
e.g. a synchronous transaction will flush the CIL before the
transaction is torn down. The concurrent CIL push then aborts
insertion it and drops the commit/AIL reference to the BLI. This can
leave the transaction commit context with the last reference to the
BLI which is dropped here:
xfs_trans_free_items()
->iop_release
xfs_buf_item_release
xfs_buf_item_put
if (XFS_LI_ABORTED)
xfs_trans_ail_delete
xfs_buf_item_relse()
Unlike the journal completion ->iop_unpin path, this path does not
run stale buffer completion process when it drops the last
reference, hence leaving the stale inodes attached to the buffer
sitting the AIL. There are no other references to those inodes, so
there is no other mechanism to remove them from the AIL. Hence
unmount hangs.
The buffer lock context for stale buffers is passed to the last BLI
reference. This is normally the last BLI unpin on journal IO
completion. The unpin then processes the stale buffer completion and
releases the buffer lock. However, if the final unpin from journal
IO completion (or CIL push abort) does not hold the last reference
to the BLI, there -must- still be a transaction context that
references the BLI, and so that context must perform the stale
buffer completion processing before the buffer is unlocked and the
BLI torn down.
The fix for this is to rework the xfs_buf_item_relse() path to run
stale buffer completion processing if it drops the last reference to
the BLI. We still hold the buffer locked, so the buffer owner and
lock context is the same as if we passed the BLI and buffer to the
->iop_unpin() context to finish stale process on journal commit.
However, we have to be careful here. In a shutdown state, we can be
freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse()
and xfs_trans_bdetach(). The existing code handles this case by
considering shutdown state as "aborted", but in doing so
largely masks the failure to clean up stale BLI state from the
xfs_buf_item_relse() path. i.e regardless of the shutdown state and
whether the item is in the AIL, we must finish the stale buffer
cleanup if we are are dropping the last BLI reference from the
->iop_relse path in transaction commit context.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/xfs_buf_item.c | 121 +++++++++++++++++++++++++++++-------------
fs/xfs/xfs_buf_item.h | 2 +-
2 files changed, 86 insertions(+), 37 deletions(-)
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index c95826863c82..7fc54725c5f6 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -612,43 +612,42 @@ xfs_buf_item_push(
* Drop the buffer log item refcount and take appropriate action. This helper
* determines whether the bli must be freed or not, since a decrement to zero
* does not necessarily mean the bli is unused.
- *
- * Return true if the bli is freed, false otherwise.
*/
-bool
+void
xfs_buf_item_put(
struct xfs_buf_log_item *bip)
{
- struct xfs_log_item *lip = &bip->bli_item;
- bool aborted;
- bool dirty;
+
+ ASSERT(xfs_buf_islocked(bip->bli_buf));
/* drop the bli ref and return if it wasn't the last one */
if (!atomic_dec_and_test(&bip->bli_refcount))
- return false;
+ return;
- /*
- * We dropped the last ref and must free the item if clean or aborted.
- * If the bli is dirty and non-aborted, the buffer was clean in the
- * transaction but still awaiting writeback from previous changes. In
- * that case, the bli is freed on buffer writeback completion.
- */
- aborted = test_bit(XFS_LI_ABORTED, &lip->li_flags) ||
- xlog_is_shutdown(lip->li_log);
- dirty = bip->bli_flags & XFS_BLI_DIRTY;
- if (dirty && !aborted)
- return false;
+ /* If the BLI is in the AIL, then it is still dirty and in use */
+ if (test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags)) {
+ ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
+ return;
+ }
/*
- * The bli is aborted or clean. An aborted item may be in the AIL
- * regardless of dirty state. For example, consider an aborted
- * transaction that invalidated a dirty bli and cleared the dirty
- * state.
+ * In shutdown conditions, we can be asked to free a dirty BLI that
+ * isn't in the AIL. This can occur due to a checkpoint aborting a BLI
+ * instead of inserting it into the AIL at checkpoint IO completion. If
+ * there's another bli reference (e.g. a btree cursor holds a clean
+ * reference) and it is released via xfs_trans_brelse(), we can get here
+ * with that aborted, dirty BLI. In this case, it is safe to free the
+ * dirty BLI immediately, as it is not in the AIL and there are no
+ * other references to it.
+ *
+ * We should never get here with a stale BLI via that path as
+ * xfs_trans_brelse() specifically holds onto stale buffers rather than
+ * releasing them.
*/
- if (aborted)
- xfs_trans_ail_delete(lip, 0);
+ ASSERT(!(bip->bli_flags & XFS_BLI_DIRTY) ||
+ test_bit(XFS_LI_ABORTED, &bip->bli_item.li_flags));
+ ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
xfs_buf_item_relse(bip);
- return true;
}
/*
@@ -669,6 +668,15 @@ xfs_buf_item_put(
* if necessary but do not unlock the buffer. This is for support of
* xfs_trans_bhold(). Make sure the XFS_BLI_HOLD field is cleared if we don't
* free the item.
+ *
+ * If the XFS_BLI_STALE flag is set, the last reference to the BLI *must*
+ * perform a completion abort of any objects attached to the buffer for IO
+ * tracking purposes. This generally only happens in shutdown situations,
+ * normally xfs_buf_item_unpin() will drop the last BLI reference and perform
+ * completion processing. However, because transaction completion can race with
+ * checkpoint completion during a shutdown, this release context may end up
+ * being the last active reference to the BLI and so needs to perform this
+ * cleanup.
*/
STATIC void
xfs_buf_item_release(
@@ -676,18 +684,19 @@ xfs_buf_item_release(
{
struct xfs_buf_log_item *bip = BUF_ITEM(lip);
struct xfs_buf *bp = bip->bli_buf;
- bool released;
bool hold = bip->bli_flags & XFS_BLI_HOLD;
bool stale = bip->bli_flags & XFS_BLI_STALE;
-#if defined(DEBUG) || defined(XFS_WARN)
- bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
- bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
bool aborted = test_bit(XFS_LI_ABORTED,
&lip->li_flags);
+ bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
+#if defined(DEBUG) || defined(XFS_WARN)
+ bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
#endif
trace_xfs_buf_item_release(bip);
+ ASSERT(xfs_buf_islocked(bp));
+
/*
* The bli dirty state should match whether the blf has logged segments
* except for ordered buffers, where only the bli should be dirty.
@@ -703,16 +712,56 @@ xfs_buf_item_release(
bp->b_transp = NULL;
bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
+ /* If there are other references, then we have nothing to do. */
+ if (!atomic_dec_and_test(&bip->bli_refcount))
+ goto out_release;
+
/*
- * Unref the item and unlock the buffer unless held or stale. Stale
- * buffers remain locked until final unpin unless the bli is freed by
- * the unref call. The latter implies shutdown because buffer
- * invalidation dirties the bli and transaction.
+ * Stale buffer completion frees the BLI, unlocks and releases the
+ * buffer. Neither the BLI or buffer are safe to reference after this
+ * call, so there's nothing more we need to do here.
+ *
+ * If we get here with a stale buffer and references to the BLI remain,
+ * we must not unlock the buffer as the last BLI reference owns lock
+ * context, not us.
*/
- released = xfs_buf_item_put(bip);
- if (hold || (stale && !released))
+ if (stale) {
+ xfs_buf_item_finish_stale(bip);
+ xfs_buf_relse(bp);
+ ASSERT(!hold);
+ return;
+ }
+
+ /*
+ * Dirty or clean, aborted items are done and need to be removed from
+ * the AIL and released. This frees the BLI, but leaves the buffer
+ * locked and referenced.
+ */
+ if (aborted || xlog_is_shutdown(lip->li_log)) {
+ ASSERT(list_empty(&bip->bli_buf->b_li_list));
+ xfs_buf_item_done(bp);
+ goto out_release;
+ }
+
+ /*
+ * Clean, unreferenced BLIs can be immediately freed, leaving the buffer
+ * locked and referenced.
+ *
+ * Dirty, unreferenced BLIs *must* be in the AIL awaiting writeback.
+ */
+ if (!dirty)
+ xfs_buf_item_relse(bip);
+ else
+ ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
+
+ /* Not safe to reference the BLI from here */
+out_release:
+ /*
+ * If we get here with a stale buffer, we must not unlock the
+ * buffer as the last BLI reference owns lock context, not us.
+ */
+ if (stale || hold)
return;
- ASSERT(!stale || aborted);
xfs_buf_relse(bp);
}
diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
index 50dd79b59cf5..416890b84f8c 100644
--- a/fs/xfs/xfs_buf_item.h
+++ b/fs/xfs/xfs_buf_item.h
@@ -49,7 +49,7 @@ struct xfs_buf_log_item {
int xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *);
void xfs_buf_item_done(struct xfs_buf *bp);
-bool xfs_buf_item_put(struct xfs_buf_log_item *);
+void xfs_buf_item_put(struct xfs_buf_log_item *bip);
void xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
void xfs_buf_inode_iodone(struct xfs_buf *);
--
2.45.2
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 7/7] xfs: fix unmount hang with unflushable inodes stuck in the AIL
2025-06-25 22:49 ` [PATCH 7/7] xfs: fix unmount hang with unflushable inodes stuck in the AIL Dave Chinner
@ 2025-06-27 10:53 ` Carlos Maiolino
0 siblings, 0 replies; 16+ messages in thread
From: Carlos Maiolino @ 2025-06-27 10:53 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Thu, Jun 26, 2025 at 08:49:00AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Unmount of a shutdown filesystem can hang with stale inode cluster
> buffers in the AIL like so:
>
> [95964.140623] Call Trace:
> [95964.144641] __schedule+0x699/0xb70
> [95964.154003] schedule+0x64/0xd0
> [95964.156851] xfs_ail_push_all_sync+0x9b/0xf0
> [95964.164816] xfs_unmount_flush_inodes+0x41/0x70
> [95964.168698] xfs_unmountfs+0x7f/0x170
> [95964.171846] xfs_fs_put_super+0x3b/0x90
> [95964.175216] generic_shutdown_super+0x77/0x160
> [95964.178060] kill_block_super+0x1b/0x40
> [95964.180553] xfs_kill_sb+0x12/0x30
> [95964.182796] deactivate_locked_super+0x38/0x100
> [95964.185735] deactivate_super+0x41/0x50
> [95964.188245] cleanup_mnt+0x9f/0x160
> [95964.190519] __cleanup_mnt+0x12/0x20
> [95964.192899] task_work_run+0x89/0xb0
> [95964.195221] resume_user_mode_work+0x4f/0x60
> [95964.197931] syscall_exit_to_user_mode+0x76/0xb0
> [95964.201003] do_syscall_64+0x74/0x130
>
> Tracing on a hung system shows the AIL repeating every 50ms a log
> force followed by an attempt to push pinned, aborted inodes from the
> AIL (trimmed for brevity):
>
> xfs_log_force: lsn 0x1c caller xfsaild+0x18e
> xfs_log_force: lsn 0x0 caller xlog_cil_flush+0xbd
> xfs_log_force: lsn 0x1c caller xfs_log_force+0x77
> xfs_ail_pinned: lip 0xffff88826014afa0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
> xfs_ail_pinned: lip 0xffff88814000a708 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
> xfs_ail_pinned: lip 0xffff88810b850c80 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
> xfs_ail_pinned: lip 0xffff88810b850af0 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
> xfs_ail_pinned: lip 0xffff888165cf0a28 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
> xfs_ail_pinned: lip 0xffff88810b850bb8 lsn 1/37472 type XFS_LI_INODE flags IN_AIL|ABORTED
> ....
>
> The inode log items are marked as aborted, which means that either:
>
> a) a transaction commit has occurred, seen an error or shutdown, and
> called xfs_trans_free_items() to abort the items. This should happen
> before any pinning of log items occurs.
>
> or
>
> b) a dirty transaction has been cancelled. This should also happen
> before any pinning of log items occurs.
>
> or
>
> c) AIL insertion at journal IO completion is marked as aborted. In
> this case, the log item is pinned by the CIL until journal IO
> completes and hence needs to be unpinned. This is then done after
> the ->iop_committed() callback is run, so the pin count should be
> balanced correctly.
>
> Yet none of these seemed to be occurring. Further tracing indicated
> this:
>
> d) Shutdown during CIL pushing resulting in log item completion
> being called from checkpoint abort processing. Items are unpinned
> and released without serialisation against each other, journal IO
> completion or transaction commit completion.
>
> In this case, we may still have a transaction commit in flight that
> holds a reference to a xfs_buf_log_item (BLI) after CIL insertion.
> e.g. a synchronous transaction will flush the CIL before the
> transaction is torn down. The concurrent CIL push then aborts
> insertion it and drops the commit/AIL reference to the BLI. This can
> leave the transaction commit context with the last reference to the
> BLI which is dropped here:
>
> xfs_trans_free_items()
> ->iop_release
> xfs_buf_item_release
> xfs_buf_item_put
> if (XFS_LI_ABORTED)
> xfs_trans_ail_delete
> xfs_buf_item_relse()
>
> Unlike the journal completion ->iop_unpin path, this path does not
> run stale buffer completion process when it drops the last
> reference, hence leaving the stale inodes attached to the buffer
> sitting the AIL. There are no other references to those inodes, so
> there is no other mechanism to remove them from the AIL. Hence
> unmount hangs.
>
> The buffer lock context for stale buffers is passed to the last BLI
> reference. This is normally the last BLI unpin on journal IO
> completion. The unpin then processes the stale buffer completion and
> releases the buffer lock. However, if the final unpin from journal
> IO completion (or CIL push abort) does not hold the last reference
> to the BLI, there -must- still be a transaction context that
> references the BLI, and so that context must perform the stale
> buffer completion processing before the buffer is unlocked and the
> BLI torn down.
>
> The fix for this is to rework the xfs_buf_item_relse() path to run
> stale buffer completion processing if it drops the last reference to
> the BLI. We still hold the buffer locked, so the buffer owner and
> lock context is the same as if we passed the BLI and buffer to the
> ->iop_unpin() context to finish stale process on journal commit.
>
> However, we have to be careful here. In a shutdown state, we can be
> freeing dirty BLIs from xfs_buf_item_put() via xfs_trans_brelse()
> and xfs_trans_bdetach(). The existing code handles this case by
> considering shutdown state as "aborted", but in doing so
> largely masks the failure to clean up stale BLI state from the
> xfs_buf_item_relse() path. i.e regardless of the shutdown state and
> whether the item is in the AIL, we must finish the stale buffer
> cleanup if we are are dropping the last BLI reference from the
> ->iop_relse path in transaction commit context.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_buf_item.c | 121 +++++++++++++++++++++++++++++-------------
> fs/xfs/xfs_buf_item.h | 2 +-
> 2 files changed, 86 insertions(+), 37 deletions(-)
>
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index c95826863c82..7fc54725c5f6 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -612,43 +612,42 @@ xfs_buf_item_push(
> * Drop the buffer log item refcount and take appropriate action. This helper
> * determines whether the bli must be freed or not, since a decrement to zero
> * does not necessarily mean the bli is unused.
> - *
> - * Return true if the bli is freed, false otherwise.
> */
> -bool
> +void
> xfs_buf_item_put(
> struct xfs_buf_log_item *bip)
> {
> - struct xfs_log_item *lip = &bip->bli_item;
> - bool aborted;
> - bool dirty;
> +
> + ASSERT(xfs_buf_islocked(bip->bli_buf));
>
> /* drop the bli ref and return if it wasn't the last one */
> if (!atomic_dec_and_test(&bip->bli_refcount))
> - return false;
> + return;
>
> - /*
> - * We dropped the last ref and must free the item if clean or aborted.
> - * If the bli is dirty and non-aborted, the buffer was clean in the
> - * transaction but still awaiting writeback from previous changes. In
> - * that case, the bli is freed on buffer writeback completion.
> - */
> - aborted = test_bit(XFS_LI_ABORTED, &lip->li_flags) ||
> - xlog_is_shutdown(lip->li_log);
> - dirty = bip->bli_flags & XFS_BLI_DIRTY;
> - if (dirty && !aborted)
> - return false;
> + /* If the BLI is in the AIL, then it is still dirty and in use */
> + if (test_bit(XFS_LI_IN_AIL, &bip->bli_item.li_flags)) {
> + ASSERT(bip->bli_flags & XFS_BLI_DIRTY);
> + return;
> + }
>
> /*
> - * The bli is aborted or clean. An aborted item may be in the AIL
> - * regardless of dirty state. For example, consider an aborted
> - * transaction that invalidated a dirty bli and cleared the dirty
> - * state.
> + * In shutdown conditions, we can be asked to free a dirty BLI that
> + * isn't in the AIL. This can occur due to a checkpoint aborting a BLI
> + * instead of inserting it into the AIL at checkpoint IO completion. If
> + * there's another bli reference (e.g. a btree cursor holds a clean
> + * reference) and it is released via xfs_trans_brelse(), we can get here
> + * with that aborted, dirty BLI. In this case, it is safe to free the
> + * dirty BLI immediately, as it is not in the AIL and there are no
> + * other references to it.
> + *
> + * We should never get here with a stale BLI via that path as
> + * xfs_trans_brelse() specifically holds onto stale buffers rather than
> + * releasing them.
> */
> - if (aborted)
> - xfs_trans_ail_delete(lip, 0);
> + ASSERT(!(bip->bli_flags & XFS_BLI_DIRTY) ||
> + test_bit(XFS_LI_ABORTED, &bip->bli_item.li_flags));
> + ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
> xfs_buf_item_relse(bip);
> - return true;
> }
>
> /*
> @@ -669,6 +668,15 @@ xfs_buf_item_put(
> * if necessary but do not unlock the buffer. This is for support of
> * xfs_trans_bhold(). Make sure the XFS_BLI_HOLD field is cleared if we don't
> * free the item.
> + *
> + * If the XFS_BLI_STALE flag is set, the last reference to the BLI *must*
> + * perform a completion abort of any objects attached to the buffer for IO
> + * tracking purposes. This generally only happens in shutdown situations,
> + * normally xfs_buf_item_unpin() will drop the last BLI reference and perform
> + * completion processing. However, because transaction completion can race with
> + * checkpoint completion during a shutdown, this release context may end up
> + * being the last active reference to the BLI and so needs to perform this
> + * cleanup.
> */
> STATIC void
> xfs_buf_item_release(
> @@ -676,18 +684,19 @@ xfs_buf_item_release(
> {
> struct xfs_buf_log_item *bip = BUF_ITEM(lip);
> struct xfs_buf *bp = bip->bli_buf;
> - bool released;
> bool hold = bip->bli_flags & XFS_BLI_HOLD;
> bool stale = bip->bli_flags & XFS_BLI_STALE;
> -#if defined(DEBUG) || defined(XFS_WARN)
> - bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
> - bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
> bool aborted = test_bit(XFS_LI_ABORTED,
> &lip->li_flags);
> + bool dirty = bip->bli_flags & XFS_BLI_DIRTY;
> +#if defined(DEBUG) || defined(XFS_WARN)
> + bool ordered = bip->bli_flags & XFS_BLI_ORDERED;
> #endif
>
> trace_xfs_buf_item_release(bip);
>
> + ASSERT(xfs_buf_islocked(bp));
> +
> /*
> * The bli dirty state should match whether the blf has logged segments
> * except for ordered buffers, where only the bli should be dirty.
> @@ -703,16 +712,56 @@ xfs_buf_item_release(
> bp->b_transp = NULL;
> bip->bli_flags &= ~(XFS_BLI_LOGGED | XFS_BLI_HOLD | XFS_BLI_ORDERED);
>
> + /* If there are other references, then we have nothing to do. */
> + if (!atomic_dec_and_test(&bip->bli_refcount))
> + goto out_release;
> +
> /*
> - * Unref the item and unlock the buffer unless held or stale. Stale
> - * buffers remain locked until final unpin unless the bli is freed by
> - * the unref call. The latter implies shutdown because buffer
> - * invalidation dirties the bli and transaction.
> + * Stale buffer completion frees the BLI, unlocks and releases the
> + * buffer. Neither the BLI or buffer are safe to reference after this
> + * call, so there's nothing more we need to do here.
> + *
> + * If we get here with a stale buffer and references to the BLI remain,
> + * we must not unlock the buffer as the last BLI reference owns lock
> + * context, not us.
> */
> - released = xfs_buf_item_put(bip);
> - if (hold || (stale && !released))
> + if (stale) {
> + xfs_buf_item_finish_stale(bip);
> + xfs_buf_relse(bp);
> + ASSERT(!hold);
> + return;
> + }
> +
> + /*
> + * Dirty or clean, aborted items are done and need to be removed from
> + * the AIL and released. This frees the BLI, but leaves the buffer
> + * locked and referenced.
> + */
> + if (aborted || xlog_is_shutdown(lip->li_log)) {
> + ASSERT(list_empty(&bip->bli_buf->b_li_list));
> + xfs_buf_item_done(bp);
> + goto out_release;
> + }
> +
> + /*
> + * Clean, unreferenced BLIs can be immediately freed, leaving the buffer
> + * locked and referenced.
> + *
> + * Dirty, unreferenced BLIs *must* be in the AIL awaiting writeback.
> + */
> + if (!dirty)
> + xfs_buf_item_relse(bip);
> + else
> + ASSERT(test_bit(XFS_LI_IN_AIL, &lip->li_flags));
> +
> + /* Not safe to reference the BLI from here */
> +out_release:
> + /*
> + * If we get here with a stale buffer, we must not unlock the
> + * buffer as the last BLI reference owns lock context, not us.
> + */
> + if (stale || hold)
> return;
> - ASSERT(!stale || aborted);
> xfs_buf_relse(bp);
> }
>
> diff --git a/fs/xfs/xfs_buf_item.h b/fs/xfs/xfs_buf_item.h
> index 50dd79b59cf5..416890b84f8c 100644
> --- a/fs/xfs/xfs_buf_item.h
> +++ b/fs/xfs/xfs_buf_item.h
> @@ -49,7 +49,7 @@ struct xfs_buf_log_item {
>
> int xfs_buf_item_init(struct xfs_buf *, struct xfs_mount *);
> void xfs_buf_item_done(struct xfs_buf *bp);
> -bool xfs_buf_item_put(struct xfs_buf_log_item *);
> +void xfs_buf_item_put(struct xfs_buf_log_item *bip);
> void xfs_buf_item_log(struct xfs_buf_log_item *, uint, uint);
> bool xfs_buf_item_dirty_format(struct xfs_buf_log_item *);
> void xfs_buf_inode_iodone(struct xfs_buf *);
> --
> 2.45.2
>
>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
^ permalink raw reply [flat|nested] 16+ messages in thread