[bug report][5.10] deadlock between xfs_create() and xfs

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [bug report][5.10] deadlock between xfs_create() and xfs_inactive()
@ 2023-07-06  3:36 Gao Xiang
  2023-07-06 22:13 ` Dave Chinner
  0 siblings, 1 reply; 3+ messages in thread
From: Gao Xiang @ 2023-07-06  3:36 UTC (permalink / raw)
  To: linux-xfs, Joseph Qi

Hi folks,

This is a report from our cloud online workloads, it could
randomly happen about ~20days, and currently we have no idea
how to reproduce with some artificial testcase reliably:

The detail is as below:


(Thread 1)
already take AGF lock
loop due to inode I_FREEING

PID: 1894063 TASK: ffff954f494dc500 CPU: 5 COMMAND: postgres*
#O [ffffa141ca34f920] schedule at ffffffff9ca58505
#1 [ffffa141ca34f9b0] schedule at ffffffff9ca5899€
#2 [ffffa141ca34f9c0] schedule timeout at ffffffff9ca5c027
#3 [ffffa141ca34fa48] xfs_iget at ffffffffe1137b4f [xfs]	xfs_iget_cache_hit->	-> igrab(inode)
#4 [ffffa141ca34fb00] xfs_ialloc at ffffffffc1140ab5 [xfs]
#5 [ffffa141ca34fb80] xfs_dir_ialloc at ffffffffc1142bfc [xfs]
#6 [ffffa141ca34fc10] xfs_create at ffffffffe1142fc8 [xfs]
#7 [ffffa141ca34fca0] xfs_generic_create at ffffffffc1140229 [xfs]
...

(Thread 2)
already have inode I_FREEING
want to take AGF lock
  
PID: 202276 TASK: ffff954d142/0000 CPU:2 COMMAND: postgres*
#0  [ffffa141c12638d0] schedule at ffffffff9ca58505
#1  [ffffa141c1263960] schedule at ffffffff9ca5899c
#2  [ffffa141c1263970] schedule timeout at ffffffff9caSc0a9
#3  [ffffa141c1263988]
down at ffffffff9caSaba5
44  [ffffa141c1263a58] down at ffffffff9c146d6b
#5  [ffffa141c1263a70] xfs_buf_lock at ffffffffc112c3dc [xfs]
#6  [ffffa141c1263a80] xfs_buf_find at ffffffffc112c83d [xfs]
#7  [ffffa141c1263b18] xfs_buf_get_map at ffffffffe112cb3c [xfs]
#8  [ffffa141c1263b70] xfs_buf_read_map at ffffffffc112d175 [xfs]
#9  [ffffa141c1263bc8] xfs_trans_read_buf map at ffffffffc116404a [xfs]
#10 [ffffa141c1263c28] xfs_read_agf at ffffffffc10e1c44 [xfs]
#11 [ffffa141c1263c80] xfs_alloc_read_agf at ffffffffc10e1d0a [xfs]
#12 [ffffa141c1263cb0] xfs_agfl_free_finish item at ffffffffc115a45a [xfs]
#13 [ffffa141c1263d00] xfs_defer_finish_noroll at ffffffffe110257e [xfs]
#14 [ffffa141c1263d68] xfs_trans_commit at ffffffffe1150581 [xfs]
#15 [ffffa141c1263da8] xfs_inactive_free at ffffffffc1144084 [xfs]
#16 [ffffa141c1263dd8] xfs_inactive at ffffffffc11441f2 [xfs)
#17 [ffffa141c1263dfO] xfs_fs_destroy_inode at ffffffffc114d489 [xfs]
#18 [ffffa141€1263e10] destroy_inode at ffffffff9c3838a8
#19 [ffffa141c1263e28] dentry_kill at ffffffff9c37f5d5
#20 [ffffa141c1263e48] dput at ffffffff9c3800ab
#21 [ffffa141c1263e70] do_renameat2 at ffffffff9c376a8b
#22 [ffffa141c1263f38] sys_rename at ffffffff9c376cdc
#23 [ffffa141c1263f40] do_syscall_64 at ffffffff9ca4a4c0
#24 [ffffa141c1263f50] entry_SYSCALL_64 after hwframe at ffffffff9cc00099


I'm not sure if the mainline kernel still has the issue, but after some
code review, I guess even after defer inactivation, such inodes pending
for recycling still keep I_FREEING.  IOWs, there are still some
dependencies between inode i_state and AGF lock with different order so
it might be racy.  Since it's online workloads, it's hard to switch the
production environment to the latest kernel.

Hopefully it helps.

Thanks,
Gao Xiang

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [bug report][5.10] deadlock between xfs_create() and xfs_inactive()
  2023-07-06  3:36 [bug report][5.10] deadlock between xfs_create() and xfs_inactive() Gao Xiang
@ 2023-07-06 22:13 ` Dave Chinner
  2023-07-07  2:57   ` Gao Xiang
  0 siblings, 1 reply; 3+ messages in thread
From: Dave Chinner @ 2023-07-06 22:13 UTC (permalink / raw)
  To: Gao Xiang; +Cc: linux-xfs, Joseph Qi

On Thu, Jul 06, 2023 at 11:36:26AM +0800, Gao Xiang wrote:
> Hi folks,
> 
> This is a report from our cloud online workloads, it could
> randomly happen about ~20days, and currently we have no idea
> how to reproduce with some artificial testcase reliably:

So much of this code has changed in current upstream kernels....

> The detail is as below:
> 
> 
> (Thread 1)
> already take AGF lock
> loop due to inode I_FREEING
> 
> PID: 1894063 TASK: ffff954f494dc500 CPU: 5 COMMAND: postgres*
> #O [ffffa141ca34f920] schedule at ffffffff9ca58505
> #1 [ffffa141ca34f9b0] schedule at ffffffff9ca5899€
> #2 [ffffa141ca34f9c0] schedule timeout at ffffffff9ca5c027
> #3 [ffffa141ca34fa48] xfs_iget at ffffffffe1137b4f [xfs]	xfs_iget_cache_hit->	-> igrab(inode)
> #4 [ffffa141ca34fb00] xfs_ialloc at ffffffffc1140ab5 [xfs]
> #5 [ffffa141ca34fb80] xfs_dir_ialloc at ffffffffc1142bfc [xfs]
> #6 [ffffa141ca34fc10] xfs_create at ffffffffe1142fc8 [xfs]
> #7 [ffffa141ca34fca0] xfs_generic_create at ffffffffc1140229 [xfs]

So how are we holding the AGF here?

I haven't looked at the 5.10 code yet, but the upstream code is
different; xfs_iget() is not called until xfs_dialloc() has
returned. In that case, if we just allocated an inode from the
inobt, then no blocks have been allocated and the AGF should not be
locked. If we had to allocate a new inode chunk, the transaction has
been rolled and the AGF gets unlocked - we only hold the AGI at that
point.

IIRC the locking is the same for the older kernels (i.e. the
two-phase allocation that holds the AGI locked), so it's not
entirely clear to me how the AGF is getting held locked here.

Ah.

I suspect free inode btree updates using the last free inode
in a chunk, so the chunk is being removed from the finobt and that
is freeing a finobt block (e.g. due to a leaf merge), hence
resulting in the AGF getting locked for the block free and not
needing the transaction to be rolled.

Hmmmmm. Didn't I just fix this problem? This just went into the
current 6.5-rc0 tree:

commit b742d7b4f0e03df25c2a772adcded35044b625ca
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Jun 28 11:04:32 2023 -0700

    xfs: use deferred frees for btree block freeing
    
    Btrees that aren't freespace management trees use the normal extent
    allocation and freeing routines for their blocks. Hence when a btree
    block is freed, a direct call to xfs_free_extent() is made and the
    extent is immediately freed. This puts the entire free space
    management btrees under this path, so we are stacking btrees on
    btrees in the call stack. The inobt, finobt and refcount btrees
    all do this.
    
    However, the bmap btree does not do this - it calls
    xfs_free_extent_later() to defer the extent free operation via an
    XEFI and hence it gets processed in deferred operation processing
    during the commit of the primary transaction (i.e. via intent
    chaining).
    
    We need to change xfs_free_extent() to behave in a non-blocking
    manner so that we can avoid deadlocks with busy extents near ENOSPC
    in transactions that free multiple extents. Inserting or removing a
    record from a btree can cause a multi-level tree merge operation and
    that will free multiple blocks from the btree in a single
    transaction. i.e. we can call xfs_free_extent() multiple times, and
    hence the btree manipulation transaction is vulnerable to this busy
    extent deadlock vector.
    
    To fix this, convert all the remaining callers of xfs_free_extent()
    to use xfs_free_extent_later() to queue XEFIs and hence defer
    processing of the extent frees to a context that can be safely
    restarted if a deadlock condition is detected.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Darrick J. Wong <djwong@kernel.org>
    Signed-off-by: Darrick J. Wong <djwong@kernel.org>
    Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

So this is probably not be a problem on a current ToT....

> ...
> 
> (Thread 2)
> already have inode I_FREEING
> want to take AGF lock
> PID: 202276 TASK: ffff954d142/0000 CPU:2 COMMAND: postgres*
> #0  [ffffa141c12638d0] schedule at ffffffff9ca58505
> #1  [ffffa141c1263960] schedule at ffffffff9ca5899c
> #2  [ffffa141c1263970] schedule timeout at ffffffff9caSc0a9
> #3  [ffffa141c1263988]
> down at ffffffff9caSaba5
> 44  [ffffa141c1263a58] down at ffffffff9c146d6b
> #5  [ffffa141c1263a70] xfs_buf_lock at ffffffffc112c3dc [xfs]
> #6  [ffffa141c1263a80] xfs_buf_find at ffffffffc112c83d [xfs]
> #7  [ffffa141c1263b18] xfs_buf_get_map at ffffffffe112cb3c [xfs]
> #8  [ffffa141c1263b70] xfs_buf_read_map at ffffffffc112d175 [xfs]
> #9  [ffffa141c1263bc8] xfs_trans_read_buf map at ffffffffc116404a [xfs]
> #10 [ffffa141c1263c28] xfs_read_agf at ffffffffc10e1c44 [xfs]
> #11 [ffffa141c1263c80] xfs_alloc_read_agf at ffffffffc10e1d0a [xfs]
> #12 [ffffa141c1263cb0] xfs_agfl_free_finish item at ffffffffc115a45a [xfs]
> #13 [ffffa141c1263d00] xfs_defer_finish_noroll at ffffffffe110257e [xfs]
> #14 [ffffa141c1263d68] xfs_trans_commit at ffffffffe1150581 [xfs]
> #15 [ffffa141c1263da8] xfs_inactive_free at ffffffffc1144084 [xfs]
> #16 [ffffa141c1263dd8] xfs_inactive at ffffffffc11441f2 [xfs)
> #17 [ffffa141c1263dfO] xfs_fs_destroy_inode at ffffffffc114d489 [xfs]
> #18 [ffffa141€1263e10] destroy_inode at ffffffff9c3838a8
> #19 [ffffa141c1263e28] dentry_kill at ffffffff9c37f5d5
> #20 [ffffa141c1263e48] dput at ffffffff9c3800ab
> #21 [ffffa141c1263e70] do_renameat2 at ffffffff9c376a8b
> #22 [ffffa141c1263f38] sys_rename at ffffffff9c376cdc
> #23 [ffffa141c1263f40] do_syscall_64 at ffffffff9ca4a4c0
> #24 [ffffa141c1263f50] entry_SYSCALL_64 after hwframe at ffffffff9cc00099

Ok, so rolling the transaction requires gaining the AGF lock again,
so we are effectively doing:

lock AGI
free inode
lock AGF 
fixup freelist -> defers freeing because AGFL too big
free finobt block/inode chunk
remove inode from unlinked list
xfs_trans_commit()
  logs EFI for AGFL blocks
  rolls transaction
    commits items to CIL
    unlocks AGI	-> allows allocation of inode again
    unlocks AGF
  finishes EFI
    locks AGF
      <blocks>

I think drop/relock AGF after dropping the AGI is fine - the AGI
should be able to free/reallocate inodes in a chunk immediately,
and the reuse is only dependent on icache state (as is happening
here).

> I'm not sure if the mainline kernel still has the issue, but after some
> code review, I guess even after defer inactivation, such inodes pending
> for recycling still keep I_FREEING.

The inode will be (XFS_NEED_INACTIVE | XFS_INACTIVATING), so the
xfs_iget() code won't even be getting as far as calling igrab().
i.e. the VFS inode state is irrelevant with background inodegc...

> IOWs, there are still some
> dependencies between inode i_state and AGF lock with different order so
> it might be racy.  Since it's online workloads, it's hard to switch the
> production environment to the latest kernel.

We should not have any dependencies between inode state and the AGF
lock - the AGI lock should be all that inode allocation/freeing
depends on, and the AGI/AGF ordering dependencies should take care
of everything else.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [bug report][5.10] deadlock between xfs_create() and xfs_inactive()
  2023-07-06 22:13 ` Dave Chinner
@ 2023-07-07  2:57   ` Gao Xiang
  0 siblings, 0 replies; 3+ messages in thread
From: Gao Xiang @ 2023-07-07  2:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Joseph Qi

Hi Dave,

On 2023/7/7 06:13, Dave Chinner wrote:
> On Thu, Jul 06, 2023 at 11:36:26AM +0800, Gao Xiang wrote:
>> Hi folks,
>>
>> This is a report from our cloud online workloads, it could
>> randomly happen about ~20days, and currently we have no idea
>> how to reproduce with some artificial testcase reliably:
> 
> So much of this code has changed in current upstream kernels....
> 
>> The detail is as below:
>>
>>
>> (Thread 1)
>> already take AGF lock
>> loop due to inode I_FREEING
>>
>> PID: 1894063 TASK: ffff954f494dc500 CPU: 5 COMMAND: postgres*
>> #O [ffffa141ca34f920] schedule at ffffffff9ca58505
>> #1 [ffffa141ca34f9b0] schedule at ffffffff9ca5899€
>> #2 [ffffa141ca34f9c0] schedule timeout at ffffffff9ca5c027
>> #3 [ffffa141ca34fa48] xfs_iget at ffffffffe1137b4f [xfs]	xfs_iget_cache_hit->	-> igrab(inode)
>> #4 [ffffa141ca34fb00] xfs_ialloc at ffffffffc1140ab5 [xfs]
>> #5 [ffffa141ca34fb80] xfs_dir_ialloc at ffffffffc1142bfc [xfs]
>> #6 [ffffa141ca34fc10] xfs_create at ffffffffe1142fc8 [xfs]
>> #7 [ffffa141ca34fca0] xfs_generic_create at ffffffffc1140229 [xfs]
> 
> So how are we holding the AGF here?
> 
> I haven't looked at the 5.10 code yet, but the upstream code is
> different; xfs_iget() is not called until xfs_dialloc() has
> returned. In that case, if we just allocated an inode from the
> inobt, then no blocks have been allocated and the AGF should not be
> locked. If we had to allocate a new inode chunk, the transaction has
> been rolled and the AGF gets unlocked - we only hold the AGI at that
> point.
> 
> IIRC the locking is the same for the older kernels (i.e. the
> two-phase allocation that holds the AGI locked), so it's not
> entirely clear to me how the AGF is getting held locked here.
> 
> Ah.
> 
> I suspect free inode btree updates using the last free inode
> in a chunk, so the chunk is being removed from the finobt and that
> is freeing a finobt block (e.g. due to a leaf merge), hence
> resulting in the AGF getting locked for the block free and not
> needing the transaction to be rolled.
> 
> Hmmmmm. Didn't I just fix this problem? This just went into the
> current 6.5-rc0 tree:
> 
> commit b742d7b4f0e03df25c2a772adcded35044b625ca
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Wed Jun 28 11:04:32 2023 -0700
> 
>      xfs: use deferred frees for btree block freeing
>      
>      Btrees that aren't freespace management trees use the normal extent
>      allocation and freeing routines for their blocks. Hence when a btree
>      block is freed, a direct call to xfs_free_extent() is made and the
>      extent is immediately freed. This puts the entire free space
>      management btrees under this path, so we are stacking btrees on
>      btrees in the call stack. The inobt, finobt and refcount btrees
>      all do this.
>      
>      However, the bmap btree does not do this - it calls
>      xfs_free_extent_later() to defer the extent free operation via an
>      XEFI and hence it gets processed in deferred operation processing
>      during the commit of the primary transaction (i.e. via intent
>      chaining).
>      
>      We need to change xfs_free_extent() to behave in a non-blocking
>      manner so that we can avoid deadlocks with busy extents near ENOSPC
>      in transactions that free multiple extents. Inserting or removing a
>      record from a btree can cause a multi-level tree merge operation and
>      that will free multiple blocks from the btree in a single
>      transaction. i.e. we can call xfs_free_extent() multiple times, and
>      hence the btree manipulation transaction is vulnerable to this busy
>      extent deadlock vector.
>      
>      To fix this, convert all the remaining callers of xfs_free_extent()
>      to use xfs_free_extent_later() to queue XEFIs and hence defer
>      processing of the extent frees to a context that can be safely
>      restarted if a deadlock condition is detected.
>      
>      Signed-off-by: Dave Chinner <dchinner@redhat.com>
>      Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>      Signed-off-by: Darrick J. Wong <djwong@kernel.org>
>      Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> 
> So this is probably not be a problem on a current ToT....


Thanks for your reply and time on this issue.

Yes, if the transaction is rolled (due to new inode chunk), we
only hold AGI and roll this, so AGF won't be take even on Linux
5.10 too, finobt might be the stuff I think it would be
problematic (our consumer confirmed that `finobt is enabled`
as well):

  - xfs_dialloc (5.10)
      - pag->pagi_freecount != 0     goto out_alloc;
      - xfs_dialloc_ag
         ...
         /*
          * Modify or remove the finobt record.
          */
         rec.ir_free &= ~XFS_INOBT_MASK(offset);
         rec.ir_freecount--;
         if (rec.ir_freecount)
                 error = xfs_inobt_update(cur, &rec);
         else
                 error = xfs_btree_delete(cur, &i);
         if (error)
                 goto error_cur;

    finobt deletion might trigger that.

I think we will finally find a way and more time to look
into backport these, ... finally.. (manpower is limited now.)

> 
>> ...
>>
>> (Thread 2)
>> already have inode I_FREEING
>> want to take AGF lock
>> PID: 202276 TASK: ffff954d142/0000 CPU:2 COMMAND: postgres*
>> #0  [ffffa141c12638d0] schedule at ffffffff9ca58505
>> #1  [ffffa141c1263960] schedule at ffffffff9ca5899c
>> #2  [ffffa141c1263970] schedule timeout at ffffffff9caSc0a9
>> #3  [ffffa141c1263988]
>> down at ffffffff9caSaba5
>> 44  [ffffa141c1263a58] down at ffffffff9c146d6b
>> #5  [ffffa141c1263a70] xfs_buf_lock at ffffffffc112c3dc [xfs]
>> #6  [ffffa141c1263a80] xfs_buf_find at ffffffffc112c83d [xfs]
>> #7  [ffffa141c1263b18] xfs_buf_get_map at ffffffffe112cb3c [xfs]
>> #8  [ffffa141c1263b70] xfs_buf_read_map at ffffffffc112d175 [xfs]
>> #9  [ffffa141c1263bc8] xfs_trans_read_buf map at ffffffffc116404a [xfs]
>> #10 [ffffa141c1263c28] xfs_read_agf at ffffffffc10e1c44 [xfs]
>> #11 [ffffa141c1263c80] xfs_alloc_read_agf at ffffffffc10e1d0a [xfs]
>> #12 [ffffa141c1263cb0] xfs_agfl_free_finish item at ffffffffc115a45a [xfs]
>> #13 [ffffa141c1263d00] xfs_defer_finish_noroll at ffffffffe110257e [xfs]
>> #14 [ffffa141c1263d68] xfs_trans_commit at ffffffffe1150581 [xfs]
>> #15 [ffffa141c1263da8] xfs_inactive_free at ffffffffc1144084 [xfs]
>> #16 [ffffa141c1263dd8] xfs_inactive at ffffffffc11441f2 [xfs)
>> #17 [ffffa141c1263dfO] xfs_fs_destroy_inode at ffffffffc114d489 [xfs]
>> #18 [ffffa141€1263e10] destroy_inode at ffffffff9c3838a8
>> #19 [ffffa141c1263e28] dentry_kill at ffffffff9c37f5d5
>> #20 [ffffa141c1263e48] dput at ffffffff9c3800ab
>> #21 [ffffa141c1263e70] do_renameat2 at ffffffff9c376a8b
>> #22 [ffffa141c1263f38] sys_rename at ffffffff9c376cdc
>> #23 [ffffa141c1263f40] do_syscall_64 at ffffffff9ca4a4c0
>> #24 [ffffa141c1263f50] entry_SYSCALL_64 after hwframe at ffffffff9cc00099
> 

..

>> IOWs, there are still some
>> dependencies between inode i_state and AGF lock with different order so
>> it might be racy.  Since it's online workloads, it's hard to switch the
>> production environment to the latest kernel.
> 
> We should not have any dependencies between inode state and the AGF
> lock - the AGI lock should be all that inode allocation/freeing
> depends on, and the AGI/AGF ordering dependencies should take care
> of everything else.

Agreed, I just meant inode eviction just hard-coded

    inode state  ----->   AGF    dependency

due to vfs internals (since we are in inode eviction context) so we
need to avoid any potential opposite order (like AGF -----> inode
state or likewise).

Thanks again for the help!

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-07  2:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-06  3:36 [bug report][5.10] deadlock between xfs_create() and xfs_inactive() Gao Xiang
2023-07-06 22:13 ` Dave Chinner
2023-07-07  2:57   ` Gao Xiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox