From: Dave Chinner <david@fromorbit.com>
To: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: linux-xfs <linux-xfs@vger.kernel.org>,
Joseph Qi <joseph.qi@linux.alibaba.com>
Subject: Re: [bug report][5.10] deadlock between xfs_create() and xfs_inactive()
Date: Fri, 7 Jul 2023 08:13:25 +1000 [thread overview]
Message-ID: <ZKc8hfIfKw0L052X@dread.disaster.area> (raw)
In-Reply-To: <6fcbbb5a-6247-bab1-0515-359e663c587f@linux.alibaba.com>
On Thu, Jul 06, 2023 at 11:36:26AM +0800, Gao Xiang wrote:
> Hi folks,
>
> This is a report from our cloud online workloads, it could
> randomly happen about ~20days, and currently we have no idea
> how to reproduce with some artificial testcase reliably:
So much of this code has changed in current upstream kernels....
> The detail is as below:
>
>
> (Thread 1)
> already take AGF lock
> loop due to inode I_FREEING
>
> PID: 1894063 TASK: ffff954f494dc500 CPU: 5 COMMAND: postgres*
> #O [ffffa141ca34f920] schedule at ffffffff9ca58505
> #1 [ffffa141ca34f9b0] schedule at ffffffff9ca5899€
> #2 [ffffa141ca34f9c0] schedule timeout at ffffffff9ca5c027
> #3 [ffffa141ca34fa48] xfs_iget at ffffffffe1137b4f [xfs] xfs_iget_cache_hit-> -> igrab(inode)
> #4 [ffffa141ca34fb00] xfs_ialloc at ffffffffc1140ab5 [xfs]
> #5 [ffffa141ca34fb80] xfs_dir_ialloc at ffffffffc1142bfc [xfs]
> #6 [ffffa141ca34fc10] xfs_create at ffffffffe1142fc8 [xfs]
> #7 [ffffa141ca34fca0] xfs_generic_create at ffffffffc1140229 [xfs]
So how are we holding the AGF here?
I haven't looked at the 5.10 code yet, but the upstream code is
different; xfs_iget() is not called until xfs_dialloc() has
returned. In that case, if we just allocated an inode from the
inobt, then no blocks have been allocated and the AGF should not be
locked. If we had to allocate a new inode chunk, the transaction has
been rolled and the AGF gets unlocked - we only hold the AGI at that
point.
IIRC the locking is the same for the older kernels (i.e. the
two-phase allocation that holds the AGI locked), so it's not
entirely clear to me how the AGF is getting held locked here.
Ah.
I suspect free inode btree updates using the last free inode
in a chunk, so the chunk is being removed from the finobt and that
is freeing a finobt block (e.g. due to a leaf merge), hence
resulting in the AGF getting locked for the block free and not
needing the transaction to be rolled.
Hmmmmm. Didn't I just fix this problem? This just went into the
current 6.5-rc0 tree:
commit b742d7b4f0e03df25c2a772adcded35044b625ca
Author: Dave Chinner <dchinner@redhat.com>
Date: Wed Jun 28 11:04:32 2023 -0700
xfs: use deferred frees for btree block freeing
Btrees that aren't freespace management trees use the normal extent
allocation and freeing routines for their blocks. Hence when a btree
block is freed, a direct call to xfs_free_extent() is made and the
extent is immediately freed. This puts the entire free space
management btrees under this path, so we are stacking btrees on
btrees in the call stack. The inobt, finobt and refcount btrees
all do this.
However, the bmap btree does not do this - it calls
xfs_free_extent_later() to defer the extent free operation via an
XEFI and hence it gets processed in deferred operation processing
during the commit of the primary transaction (i.e. via intent
chaining).
We need to change xfs_free_extent() to behave in a non-blocking
manner so that we can avoid deadlocks with busy extents near ENOSPC
in transactions that free multiple extents. Inserting or removing a
record from a btree can cause a multi-level tree merge operation and
that will free multiple blocks from the btree in a single
transaction. i.e. we can call xfs_free_extent() multiple times, and
hence the btree manipulation transaction is vulnerable to this busy
extent deadlock vector.
To fix this, convert all the remaining callers of xfs_free_extent()
to use xfs_free_extent_later() to queue XEFIs and hence defer
processing of the extent frees to a context that can be safely
restarted if a deadlock condition is detected.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
So this is probably not be a problem on a current ToT....
> ...
>
> (Thread 2)
> already have inode I_FREEING
> want to take AGF lock
> PID: 202276 TASK: ffff954d142/0000 CPU:2 COMMAND: postgres*
> #0 [ffffa141c12638d0] schedule at ffffffff9ca58505
> #1 [ffffa141c1263960] schedule at ffffffff9ca5899c
> #2 [ffffa141c1263970] schedule timeout at ffffffff9caSc0a9
> #3 [ffffa141c1263988]
> down at ffffffff9caSaba5
> 44 [ffffa141c1263a58] down at ffffffff9c146d6b
> #5 [ffffa141c1263a70] xfs_buf_lock at ffffffffc112c3dc [xfs]
> #6 [ffffa141c1263a80] xfs_buf_find at ffffffffc112c83d [xfs]
> #7 [ffffa141c1263b18] xfs_buf_get_map at ffffffffe112cb3c [xfs]
> #8 [ffffa141c1263b70] xfs_buf_read_map at ffffffffc112d175 [xfs]
> #9 [ffffa141c1263bc8] xfs_trans_read_buf map at ffffffffc116404a [xfs]
> #10 [ffffa141c1263c28] xfs_read_agf at ffffffffc10e1c44 [xfs]
> #11 [ffffa141c1263c80] xfs_alloc_read_agf at ffffffffc10e1d0a [xfs]
> #12 [ffffa141c1263cb0] xfs_agfl_free_finish item at ffffffffc115a45a [xfs]
> #13 [ffffa141c1263d00] xfs_defer_finish_noroll at ffffffffe110257e [xfs]
> #14 [ffffa141c1263d68] xfs_trans_commit at ffffffffe1150581 [xfs]
> #15 [ffffa141c1263da8] xfs_inactive_free at ffffffffc1144084 [xfs]
> #16 [ffffa141c1263dd8] xfs_inactive at ffffffffc11441f2 [xfs)
> #17 [ffffa141c1263dfO] xfs_fs_destroy_inode at ffffffffc114d489 [xfs]
> #18 [ffffa141€1263e10] destroy_inode at ffffffff9c3838a8
> #19 [ffffa141c1263e28] dentry_kill at ffffffff9c37f5d5
> #20 [ffffa141c1263e48] dput at ffffffff9c3800ab
> #21 [ffffa141c1263e70] do_renameat2 at ffffffff9c376a8b
> #22 [ffffa141c1263f38] sys_rename at ffffffff9c376cdc
> #23 [ffffa141c1263f40] do_syscall_64 at ffffffff9ca4a4c0
> #24 [ffffa141c1263f50] entry_SYSCALL_64 after hwframe at ffffffff9cc00099
Ok, so rolling the transaction requires gaining the AGF lock again,
so we are effectively doing:
lock AGI
free inode
lock AGF
fixup freelist -> defers freeing because AGFL too big
free finobt block/inode chunk
remove inode from unlinked list
xfs_trans_commit()
logs EFI for AGFL blocks
rolls transaction
commits items to CIL
unlocks AGI -> allows allocation of inode again
unlocks AGF
finishes EFI
locks AGF
<blocks>
I think drop/relock AGF after dropping the AGI is fine - the AGI
should be able to free/reallocate inodes in a chunk immediately,
and the reuse is only dependent on icache state (as is happening
here).
> I'm not sure if the mainline kernel still has the issue, but after some
> code review, I guess even after defer inactivation, such inodes pending
> for recycling still keep I_FREEING.
The inode will be (XFS_NEED_INACTIVE | XFS_INACTIVATING), so the
xfs_iget() code won't even be getting as far as calling igrab().
i.e. the VFS inode state is irrelevant with background inodegc...
> IOWs, there are still some
> dependencies between inode i_state and AGF lock with different order so
> it might be racy. Since it's online workloads, it's hard to switch the
> production environment to the latest kernel.
We should not have any dependencies between inode state and the AGF
lock - the AGI lock should be all that inode allocation/freeing
depends on, and the AGI/AGF ordering dependencies should take care
of everything else.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2023-07-06 22:13 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-06 3:36 [bug report][5.10] deadlock between xfs_create() and xfs_inactive() Gao Xiang
2023-07-06 22:13 ` Dave Chinner [this message]
2023-07-07 2:57 ` Gao Xiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZKc8hfIfKw0L052X@dread.disaster.area \
--to=david@fromorbit.com \
--cc=hsiangkao@linux.alibaba.com \
--cc=joseph.qi@linux.alibaba.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox