[PATCH 0/1] xfs: soft lockups for unmapping large no. of extents

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] xfs: soft lockups for unmapping large no. of extents
@ 2024-04-21  7:49 Ritesh Harjani (IBM)
  2024-04-21  7:49 ` [PATCH 1/1] xfs: Add cond_resched to xfs_defer_finish_noroll Ritesh Harjani (IBM)
  0 siblings, 1 reply; 4+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-21  7:49 UTC (permalink / raw)
  To: linux-xfs
  Cc: Dave Chinner, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani (IBM)

Hello,

In one of the testcases, parallel async dio writes to a file generates
large no. of extents (upto 2M or more), and then this file is cleaned up for
running other I/O tests. In the process of deleting this file, soft lockup
messages are observed. We believe this is happening due to kernel being busy
in unmapping/freeing those extents as part of the transaction processing.

This is similar observation with the same call stack which was also reported
here [1]. I also tried the qemu-img bench testcase shared in [1], and I was
able to reproduce the soft lockup with that on Power.

So as I understood from that discussion [1], that kernel is moving towards a new
preemption model, but IIUC, it is still an ongoing work.
Also IMHO, there are stable kernels which we still do support and such a fix
might still be necessary for them.

[1]: https://lore.kernel.org/all/20240110071347.3711925-1-wenjian1@xiaomi.com/

Ritesh Harjani (IBM) (1):
  xfs: Add cond_resched to xfs_defer_finish_noroll

 fs/xfs/libxfs/xfs_defer.c | 1 +
 1 file changed, 1 insertion(+)

--
2.44.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/1] xfs: Add cond_resched to xfs_defer_finish_noroll
  2024-04-21  7:49 [PATCH 0/1] xfs: soft lockups for unmapping large no. of extents Ritesh Harjani (IBM)
@ 2024-04-21  7:49 ` Ritesh Harjani (IBM)
  2024-04-21 23:38   ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Ritesh Harjani (IBM) @ 2024-04-21  7:49 UTC (permalink / raw)
  To: linux-xfs
  Cc: Dave Chinner, Darrick J . Wong, Ojaswin Mujoo,
	Ritesh Harjani (IBM), stable

An async dio write to a sparse file can generate a lot of extents
and when we unlink this file (using rm), the kernel can be busy in umapping
and freeing those extents as part of transaction processing.
Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups
messages. Here is a call trace of such soft lockup.

watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335]
CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G             L X    5.14.21-150500.53-default
NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs]
LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs]
Call Trace:
 0xc0000000a8268340 (unreliable)
 xfs_alloc_compute_aligned+0x5c/0x150 [xfs]
 xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs]
 xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs]
 xfs_alloc_fix_freelist+0x274/0x4b0 [xfs]
 xfs_free_extent_fix_freelist+0x84/0xe0 [xfs]
 __xfs_free_extent+0xa0/0x240 [xfs]
 xfs_trans_free_extent+0x6c/0x140 [xfs]
 xfs_defer_finish_noroll+0x2b0/0x650 [xfs]
 xfs_inactive_truncate+0xe8/0x140 [xfs]
 xfs_fs_destroy_inode+0xdc/0x320 [xfs]
 destroy_inode+0x6c/0xc0
 do_unlinkat+0x1fc/0x410
 sys_unlinkat+0x58/0xb0
 system_call_exception+0x15c/0x330
 system_call_common+0xec/0x250

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
cc: stable@vger.kernel.org
cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/xfs/libxfs/xfs_defer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index c13276095cc0..cb185b97447d 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -705,6 +705,7 @@ xfs_defer_finish_noroll(
 		error = xfs_defer_finish_one(*tp, dfp);
 		if (error && error != -EAGAIN)
 			goto out_shutdown;
+		cond_resched();
 	}

 	/* Requeue the paused items in the outgoing transaction. */
--
2.44.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] xfs: Add cond_resched to xfs_defer_finish_noroll
  2024-04-21  7:49 ` [PATCH 1/1] xfs: Add cond_resched to xfs_defer_finish_noroll Ritesh Harjani (IBM)
@ 2024-04-21 23:38   ` Dave Chinner
  2024-04-22 15:10     ` Ritesh Harjani
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2024-04-21 23:38 UTC (permalink / raw)
  To: Ritesh Harjani (IBM); +Cc: linux-xfs, Darrick J . Wong, Ojaswin Mujoo, stable

On Sun, Apr 21, 2024 at 01:19:44PM +0530, Ritesh Harjani (IBM) wrote:
> An async dio write to a sparse file can generate a lot of extents
> and when we unlink this file (using rm), the kernel can be busy in umapping
> and freeing those extents as part of transaction processing.
> Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups
> messages. Here is a call trace of such soft lockup.
> 
> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335]
> CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G             L X    5.14.21-150500.53-default

Can you reproduce this on a current TOT kernel? 5.14 is pretty old,
and this stack trace:

> NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs]
> LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs]
> Call Trace:
>  0xc0000000a8268340 (unreliable)
>  xfs_alloc_compute_aligned+0x5c/0x150 [xfs]
>  xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs]
>  xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs]
>  xfs_alloc_fix_freelist+0x274/0x4b0 [xfs]
>  xfs_free_extent_fix_freelist+0x84/0xe0 [xfs]
>  __xfs_free_extent+0xa0/0x240 [xfs]
>  xfs_trans_free_extent+0x6c/0x140 [xfs]
>  xfs_defer_finish_noroll+0x2b0/0x650 [xfs]
>  xfs_inactive_truncate+0xe8/0x140 [xfs]
>  xfs_fs_destroy_inode+0xdc/0x320 [xfs]
>  destroy_inode+0x6c/0xc0

.... doesn't exist anymore.

xfs_inactive_truncate() is now done from a
background inodegc thread, not directly in destroy_inode().

I also suspect that any sort of cond_resched() should be in the top
layer loop in xfs_bunmapi_range(), not hidden deep in the defer
code. The problem is the number of extents being processed without
yielding, not the time spent processing each individual deferred
work chain to free the extent. Hence the explicit rescheduling
should be at the top level loop where it can be easily explained and
understand, not hidden deep inside the defer chain mechanism....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/1] xfs: Add cond_resched to xfs_defer_finish_noroll
  2024-04-21 23:38   ` Dave Chinner
@ 2024-04-22 15:10     ` Ritesh Harjani
  0 siblings, 0 replies; 4+ messages in thread
From: Ritesh Harjani @ 2024-04-22 15:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Darrick J . Wong, Ojaswin Mujoo, stable

Dave Chinner <david@fromorbit.com> writes:

> On Sun, Apr 21, 2024 at 01:19:44PM +0530, Ritesh Harjani (IBM) wrote:
>> An async dio write to a sparse file can generate a lot of extents
>> and when we unlink this file (using rm), the kernel can be busy in umapping
>> and freeing those extents as part of transaction processing.
>> Add cond_resched() in xfs_defer_finish_noroll() to avoid soft lockups
>> messages. Here is a call trace of such soft lockup.
>> 
>> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [rm:81335]
>> CPU: 1 PID: 81335 Comm: rm Kdump: loaded Tainted: G             L X    5.14.21-150500.53-default
>
> Can you reproduce this on a current TOT kernel? 5.14 is pretty old,
> and this stack trace:
>

Yes, I was able to reproduce this on upstream kernel too -

    watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
    CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S  L   6.9.0-rc5-0-default #1
    Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
    NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
    LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
    Call Trace:
      xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
      xfs_alloc_compute_aligned+0x5c/0x144
      xfs_alloc_ag_vextent_size+0x238/0x8d4
      xfs_alloc_fix_freelist+0x540/0x694
      xfs_free_extent_fix_freelist+0x84/0xe0
      __xfs_free_extent+0x74/0x1ec
      xfs_extent_free_finish_item+0xcc/0x214
      xfs_defer_finish_one+0x194/0x388
      xfs_defer_finish_noroll+0x1b4/0x5c8
      xfs_defer_finish+0x2c/0xc4
      xfs_bunmapi_range+0xa4/0x100
      xfs_itruncate_extents_flags+0x1b8/0x2f4
      xfs_inactive_truncate+0xe0/0x124
      xfs_inactive+0x30c/0x3e0
      xfs_inodegc_worker+0x140/0x234
      process_scheduled_works+0x240/0x57c
      worker_thread+0x198/0x468
      kthread+0x138/0x140
      start_kernel_thread+0x14/0x18


>> NIP [c00800001b174768] xfs_extent_busy_trim+0xc0/0x2a0 [xfs]
>> LR [c00800001b1746f4] xfs_extent_busy_trim+0x4c/0x2a0 [xfs]
>> Call Trace:
>>  0xc0000000a8268340 (unreliable)
>>  xfs_alloc_compute_aligned+0x5c/0x150 [xfs]
>>  xfs_alloc_ag_vextent_size+0x1dc/0x8c0 [xfs]
>>  xfs_alloc_ag_vextent+0x17c/0x1c0 [xfs]
>>  xfs_alloc_fix_freelist+0x274/0x4b0 [xfs]
>>  xfs_free_extent_fix_freelist+0x84/0xe0 [xfs]
>>  __xfs_free_extent+0xa0/0x240 [xfs]
>>  xfs_trans_free_extent+0x6c/0x140 [xfs]
>>  xfs_defer_finish_noroll+0x2b0/0x650 [xfs]
>>  xfs_inactive_truncate+0xe8/0x140 [xfs]
>>  xfs_fs_destroy_inode+0xdc/0x320 [xfs]
>>  destroy_inode+0x6c/0xc0
>
> .... doesn't exist anymore.
>
> xfs_inactive_truncate() is now done from a
> background inodegc thread, not directly in destroy_inode().
>
> I also suspect that any sort of cond_resched() should be in the top
> layer loop in xfs_bunmapi_range(), not hidden deep in the defer
> code. The problem is the number of extents being processed without
> yielding, not the time spent processing each individual deferred
> work chain to free the extent. Hence the explicit rescheduling
> should be at the top level loop where it can be easily explained and
> understand, not hidden deep inside the defer chain mechanism....

Yes, sure. I will submit a v2 with this diff then (after I verify the
fix once on the setup, just for my sanity)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 656c95a22f2e..44d5381bc66f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6354,6 +6354,7 @@ xfs_bunmapi_range(
                error = xfs_defer_finish(tpp);
                if (error)
                        goto out;
+               cond_resched();
        }
 out:
        return error;


-ritesh

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-04-22 15:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-21  7:49 [PATCH 0/1] xfs: soft lockups for unmapping large no. of extents Ritesh Harjani (IBM)
2024-04-21  7:49 ` [PATCH 1/1] xfs: Add cond_resched to xfs_defer_finish_noroll Ritesh Harjani (IBM)
2024-04-21 23:38   ` Dave Chinner
2024-04-22 15:10     ` Ritesh Harjani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox