[PATCH 0/2] Add cond_resched() in some place to avoid softlockup

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Add cond_resched() in some place to avoid softlockup
@ 2026-02-05  8:26 alexjlzheng
  2026-02-05  8:26 ` [PATCH 1/2] xfs: take a breath in xlog_ioend_work() alexjlzheng
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: alexjlzheng @ 2026-02-05  8:26 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, linux-kernel, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

We recently observed several XFS-related softlockups in non-preempt
kernels during stability testing, and we believe adding a few
cond_resched()calls would be beneficial.

Jinliang Zheng (2):
  xfs: take a breath in xlog_ioend_work()
  xfs: take a breath in xfsaild()

 fs/xfs/xfs_buf.c     | 2 ++
 fs/xfs/xfs_log_cil.c | 3 +++
 2 files changed, 5 insertions(+)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] xfs: take a breath in xlog_ioend_work()
  2026-02-05  8:26 [PATCH 0/2] Add cond_resched() in some place to avoid softlockup alexjlzheng
@ 2026-02-05  8:26 ` alexjlzheng
  2026-02-05 10:54   ` Dave Chinner
  2026-02-05  8:26 ` [PATCH 2/2] xfs: take a breath in xfsaild() alexjlzheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: alexjlzheng @ 2026-02-05  8:26 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, linux-kernel, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

The xlog_ioend_work() function contains several nested loops with
fairly complex operations, which may leads to:

  PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
   #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
   #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
   #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
   #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
   #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
   #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
   #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
   #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
  --- <IRQ stack> ---
   #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
      [exception RIP: slab_free_freelist_hook.constprop.0+107]
      RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
      RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
      RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
      RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
      R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
      R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
  #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
  #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
  #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
  #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
  #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
  #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
  #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
  #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
  #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
  #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b

This patch adds cond_resched() to avoid softlockups similar to the one
described above.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/xfs/xfs_log_cil.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 778ac47adb8c..c51c24f98acc 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -843,6 +843,8 @@ xlog_cil_ail_insert(
 					LOG_ITEM_BATCH_SIZE, ctx->start_lsn);
 			i = 0;
 		}
+
+		cond_resched();
 	}
 
 	/* make sure we insert the remainder! */
@@ -925,6 +927,7 @@ xlog_cil_process_committed(
 			struct xfs_cil_ctx, iclog_entry))) {
 		list_del(&ctx->iclog_entry);
 		xlog_cil_committed(ctx);
+		cond_resched();
 	}
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: take a breath in xlog_ioend_work()
  2026-02-05  8:26 ` [PATCH 1/2] xfs: take a breath in xlog_ioend_work() alexjlzheng
@ 2026-02-05 10:54   ` Dave Chinner
  2026-02-05 12:49     ` Jinliang Zheng
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2026-02-05 10:54 UTC (permalink / raw)
  To: alexjlzheng; +Cc: cem, linux-xfs, linux-kernel, Jinliang Zheng

On Thu, Feb 05, 2026 at 04:26:20PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> The xlog_ioend_work() function contains several nested loops with
> fairly complex operations, which may leads to:
> 
>   PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
>    #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
>    #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
>    #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
>    #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
>    #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
>    #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
>    #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
>    #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
>   --- <IRQ stack> ---
>    #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
>       [exception RIP: slab_free_freelist_hook.constprop.0+107]
>       RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
>       RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
>       RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
>       RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
>       R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
>       R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>    #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
>   #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
>   #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
>   #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
>   #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
>   #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
>   #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
>   #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
>   #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
>   #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
>   #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b
> 
> This patch adds cond_resched() to avoid softlockups similar to the one
> described above.

You've elided the soft lockup messages that tell us how long this
task was holding the CPU. What is the soft lockup timer set to?
What workload causes this to happen? How do we reproduce it?

FWIW, yes, there might be several tens of thousands of objects to
move to the AIL in this journal IO completion path, but if this
takes more than a couple of hundred milliseconds of processing time
then there is something else going wrong....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: take a breath in xlog_ioend_work()
  2026-02-05 10:54   ` Dave Chinner
@ 2026-02-05 12:49     ` Jinliang Zheng
  2026-02-05 20:27       ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jinliang Zheng @ 2026-02-05 12:49 UTC (permalink / raw)
  To: david; +Cc: alexjlzheng, alexjlzheng, cem, linux-kernel, linux-xfs

On Thu, 5 Feb 2026 21:54:49 +1100, david@fromorbit.com wrote:
> On Thu, Feb 05, 2026 at 04:26:20PM +0800, alexjlzheng@gmail.com wrote:
> > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > 
> > The xlog_ioend_work() function contains several nested loops with
> > fairly complex operations, which may leads to:
> > 
> >   PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
> >    #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
> >    #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
> >    #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
> >    #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
> >    #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
> >    #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
> >    #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
> >    #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
> >   --- <IRQ stack> ---
> >    #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
> >       [exception RIP: slab_free_freelist_hook.constprop.0+107]
> >       RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
> >       RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
> >       RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
> >       RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
> >       R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
> >       R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
> >       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >    #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
> >   #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
> >   #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
> >   #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
> >   #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
> >   #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
> >   #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
> >   #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
> >   #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
> >   #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
> >   #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b
> > 
> > This patch adds cond_resched() to avoid softlockups similar to the one
> > described above.
> 
> You've elided the soft lockup messages that tell us how long this
> task was holding the CPU. What is the soft lockup timer set to?
> What workload causes this to happen? How do we reproduce it?

Thanks for your reply. :)

The soft lockup timer is set to 20s, and the cpu was holding 22s.

The workload is a test suite combining stress-ng, LTP, and fio,
executed concurrently. I believe reproducing the issue requires a
certain probability.

thanks,
Jinliang Zheng. :)

> 
> FWIW, yes, there might be several tens of thousands of objects to
> move to the AIL in this journal IO completion path, but if this
> takes more than a couple of hundred milliseconds of processing time
> then there is something else going wrong....

Is it possible that the kernel’s CPU and memory were under high load,
causing each iteration of the loop to take more time and eventually
accumulating to over 20 seconds?

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] xfs: take a breath in xlog_ioend_work()
  2026-02-05 12:49     ` Jinliang Zheng
@ 2026-02-05 20:27       ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2026-02-05 20:27 UTC (permalink / raw)
  To: Jinliang Zheng; +Cc: alexjlzheng, cem, linux-kernel, linux-xfs

On Thu, Feb 05, 2026 at 08:49:38PM +0800, Jinliang Zheng wrote:
> On Thu, 5 Feb 2026 21:54:49 +1100, david@fromorbit.com wrote:
> > On Thu, Feb 05, 2026 at 04:26:20PM +0800, alexjlzheng@gmail.com wrote:
> > > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > > 
> > > The xlog_ioend_work() function contains several nested loops with
> > > fairly complex operations, which may leads to:
> > > 
> > >   PID: 2604722  TASK: ffff88c08306b1c0  CPU: 263  COMMAND: "kworker/263:0H"
> > >    #0 [ffffc9001cbf8d58] machine_kexec at ffffffff9d086081
> > >    #1 [ffffc9001cbf8db8] __crash_kexec at ffffffff9d20817a
> > >    #2 [ffffc9001cbf8e78] panic at ffffffff9d107d8f
> > >    #3 [ffffc9001cbf8ef8] watchdog_timer_fn at ffffffff9d243511
> > >    #4 [ffffc9001cbf8f28] __hrtimer_run_queues at ffffffff9d1e62ff
> > >    #5 [ffffc9001cbf8f80] hrtimer_interrupt at ffffffff9d1e73d4
> > >    #6 [ffffc9001cbf8fd8] __sysvec_apic_timer_interrupt at ffffffff9d07bb29
> > >    #7 [ffffc9001cbf8ff0] sysvec_apic_timer_interrupt at ffffffff9dd689f9
> > >   --- <IRQ stack> ---
> > >    #8 [ffffc900460a7c28] asm_sysvec_apic_timer_interrupt at ffffffff9de00e86
> > >       [exception RIP: slab_free_freelist_hook.constprop.0+107]
> > >       RIP: ffffffff9d3ef74b  RSP: ffffc900460a7cd0  RFLAGS: 00000286
> > >       RAX: ffff89ea4de06b00  RBX: ffff89ea4de06a00  RCX: ffff89ea4de06a00
> > >       RDX: 0000000000000100  RSI: ffffc900460a7d28  RDI: ffff888100044c80
> > >       RBP: ffff888100044c80   R8: 0000000000000000   R9: ffffffffc21e8500
> > >       R10: ffff88c867e93200  R11: 0000000000000001  R12: ffff89ea4de06a00
> > >       R13: ffffc900460a7d28  R14: ffff89ea4de06a00  R15: ffffc900460a7d30
> > >       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > >    #9 [ffffc900460a7d18] __kmem_cache_free at ffffffff9d3f65a0
> > >   #10 [ffffc900460a7d70] xlog_cil_committed at ffffffffc21e85af [xfs]
> > >   #11 [ffffc900460a7da0] xlog_cil_process_committed at ffffffffc21e9747 [xfs]
> > >   #12 [ffffc900460a7dd0] xlog_state_do_iclog_callbacks at ffffffffc21e41eb [xfs]
> > >   #13 [ffffc900460a7e28] xlog_state_do_callback at ffffffffc21e436f [xfs]
> > >   #14 [ffffc900460a7e50] xlog_ioend_work at ffffffffc21e6e1c [xfs]
> > >   #15 [ffffc900460a7e70] process_one_work at ffffffff9d12de69
> > >   #16 [ffffc900460a7ea8] worker_thread at ffffffff9d12e79b
> > >   #17 [ffffc900460a7ef8] kthread at ffffffff9d1378fc
> > >   #18 [ffffc900460a7f30] ret_from_fork at ffffffff9d042dd0
> > >   #19 [ffffc900460a7f50] ret_from_fork_asm at ffffffff9d007e2b
> > > 
> > > This patch adds cond_resched() to avoid softlockups similar to the one
> > > described above.
> > 
> > You've elided the soft lockup messages that tell us how long this
> > task was holding the CPU. What is the soft lockup timer set to?
> > What workload causes this to happen? How do we reproduce it?
> 
> Thanks for your reply. :)
> 
> The soft lockup timer is set to 20s, and the cpu was holding 22s.

Yep, something else must be screwed up here - an iclog completion
should never have enough items attched to it that is takes this long
to process them without yeilding the CPU.

We can only loop once around the iclogs in this path these days,
because the iclog we we are completing runs xlog_ioend_work() with
the iclog->ic_sema held. This locks out new IO being issued on that
iclog whilst we are processing the completion, and hence the
iclogbuf ring will stall trying to write new items to this iclog.

Hence whilst we are processing an iclog completion, the entire
journal (and hence filesystem) can stall waiting for the completion
to finish and release the iclog->ic_sema.

This also means we can have, at most, 8 iclogs worth of journal
writes to complete in xlog_ioend_work(). That greatly limits the
number of items we are processing in the xlog_state_do_callback()
loops. Yes, it can be tens of thousands of items, but it is bound by
journal size and the checkpoint pipeline depth (maximum of 4
checkpoints in flight at once).

So I don't see how the number of items that we are asking to be
processed during journal completion, by itself, can cause such a
long processing time. We typically process upwards of a thousand
items per millisecond here...

> The workload is a test suite combining stress-ng, LTP, and fio,
> executed concurrently. I believe reproducing the issue requires a
> certain probability.
> 
> thanks,
> Jinliang Zheng. :)
> 
> > 
> > FWIW, yes, there might be several tens of thousands of objects to
> > move to the AIL in this journal IO completion path, but if this
> > takes more than a couple of hundred milliseconds of processing time
> > then there is something else going wrong....
> 
> Is it possible that the kernel’s CPU and memory were under high load,
> causing each iteration of the loop to take more time and eventually
> accumulating to over 20 seconds?

That's a characteristic behaviour of catastrophic spin lock
contention...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] xfs: take a breath in xfsaild()
  2026-02-05  8:26 [PATCH 0/2] Add cond_resched() in some place to avoid softlockup alexjlzheng
  2026-02-05  8:26 ` [PATCH 1/2] xfs: take a breath in xlog_ioend_work() alexjlzheng
@ 2026-02-05  8:26 ` alexjlzheng
  2026-02-05 11:44   ` Dave Chinner
  2026-02-05 10:39 ` [PATCH 0/2] Add cond_resched() in some place to avoid softlockup Dave Chinner
  2026-02-13 21:38 ` Dave Chinner
  3 siblings, 1 reply; 11+ messages in thread
From: alexjlzheng @ 2026-02-05  8:26 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, linux-kernel, Jinliang Zheng

From: Jinliang Zheng <alexjlzheng@tencent.com>

We noticed a softlockup like:

  crash> bt
  PID: 5153     TASK: ffff8960a7ca0000  CPU: 115  COMMAND: "xfsaild/dm-4"
   #0 [ffffc9001b1d4d58] machine_kexec at ffffffff9b086081
   #1 [ffffc9001b1d4db8] __crash_kexec at ffffffff9b20817a
   #2 [ffffc9001b1d4e78] panic at ffffffff9b107d8f
   #3 [ffffc9001b1d4ef8] watchdog_timer_fn at ffffffff9b243511
   #4 [ffffc9001b1d4f28] __hrtimer_run_queues at ffffffff9b1e62ff
   #5 [ffffc9001b1d4f80] hrtimer_interrupt at ffffffff9b1e73d4
   #6 [ffffc9001b1d4fd8] __sysvec_apic_timer_interrupt at ffffffff9b07bb29
   #7 [ffffc9001b1d4ff0] sysvec_apic_timer_interrupt at ffffffff9bd689f9
  --- <IRQ stack> ---
   #8 [ffffc90031cd3a18] asm_sysvec_apic_timer_interrupt at ffffffff9be00e86
      [exception RIP: part_in_flight+47]
      RIP: ffffffff9b67960f  RSP: ffffc90031cd3ac8  RFLAGS: 00000282
      RAX: 00000000000000a9  RBX: 00000000000c4645  RCX: 00000000000000f5
      RDX: ffffe89fffa36fe0  RSI: 0000000000000180  RDI: ffffffff9d1ae260
      RBP: ffff898083d30000   R8: 00000000000000a8   R9: 0000000000000000
      R10: ffff89808277d800  R11: 0000000000001000  R12: 0000000101a7d5be
      R13: 0000000000000000  R14: 0000000000001001  R15: 0000000000001001
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #9 [ffffc90031cd3ad8] update_io_ticks at ffffffff9b6602e4
  #10 [ffffc90031cd3b00] bdev_start_io_acct at ffffffff9b66031b
  #11 [ffffc90031cd3b20] dm_io_acct at ffffffffc18d7f98 [dm_mod]
  #12 [ffffc90031cd3b50] dm_submit_bio_remap at ffffffffc18d8195 [dm_mod]
  #13 [ffffc90031cd3b70] dm_split_and_process_bio at ffffffffc18d9799 [dm_mod]
  #14 [ffffc90031cd3be0] dm_submit_bio at ffffffffc18d9b07 [dm_mod]
  #15 [ffffc90031cd3c20] __submit_bio at ffffffff9b65f61c
  #16 [ffffc90031cd3c38] __submit_bio_noacct at ffffffff9b65f73e
  #17 [ffffc90031cd3c80] xfs_buf_ioapply_map at ffffffffc23df4ea [xfs]
  #18 [ffffc90031cd3ce0] _xfs_buf_ioapply at ffffffffc23df64f [xfs]
  #19 [ffffc90031cd3d50] __xfs_buf_submit at ffffffffc23df7b8 [xfs]
  #20 [ffffc90031cd3d70] xfs_buf_delwri_submit_buffers at ffffffffc23dffbd [xfs]
  #21 [ffffc90031cd3df8] xfsaild_push at ffffffffc24268e5 [xfs]
  #22 [ffffc90031cd3eb8] xfsaild at ffffffffc2426f88 [xfs]
  #23 [ffffc90031cd3ef8] kthread at ffffffff9b1378fc
  #24 [ffffc90031cd3f30] ret_from_fork at ffffffff9b042dd0
  #25 [ffffc90031cd3f50] ret_from_fork_asm at ffffffff9b007e2b

This patch adds cond_resched() to avoid softlockups similar to the one
described above.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
 fs/xfs/xfs_buf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 47edf3041631..f1f8595d5e40 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -2026,6 +2026,8 @@ xfs_buf_delwri_submit_nowait(
 		bp->b_flags |= XBF_ASYNC;
 		xfs_buf_list_del(bp);
 		xfs_buf_submit(bp);
+
+		cond_resched();
 	}
 	blk_finish_plug(&plug);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: take a breath in xfsaild()
  2026-02-05  8:26 ` [PATCH 2/2] xfs: take a breath in xfsaild() alexjlzheng
@ 2026-02-05 11:44   ` Dave Chinner
  2026-02-05 12:49     ` Jinliang Zheng
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2026-02-05 11:44 UTC (permalink / raw)
  To: alexjlzheng; +Cc: cem, linux-xfs, linux-kernel, Jinliang Zheng

On Thu, Feb 05, 2026 at 04:26:21PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> We noticed a softlockup like:
> 
>   crash> bt
>   PID: 5153     TASK: ffff8960a7ca0000  CPU: 115  COMMAND: "xfsaild/dm-4"
>    #0 [ffffc9001b1d4d58] machine_kexec at ffffffff9b086081
>    #1 [ffffc9001b1d4db8] __crash_kexec at ffffffff9b20817a
>    #2 [ffffc9001b1d4e78] panic at ffffffff9b107d8f
>    #3 [ffffc9001b1d4ef8] watchdog_timer_fn at ffffffff9b243511
>    #4 [ffffc9001b1d4f28] __hrtimer_run_queues at ffffffff9b1e62ff
>    #5 [ffffc9001b1d4f80] hrtimer_interrupt at ffffffff9b1e73d4
>    #6 [ffffc9001b1d4fd8] __sysvec_apic_timer_interrupt at ffffffff9b07bb29
>    #7 [ffffc9001b1d4ff0] sysvec_apic_timer_interrupt at ffffffff9bd689f9
>   --- <IRQ stack> ---
>    #8 [ffffc90031cd3a18] asm_sysvec_apic_timer_interrupt at ffffffff9be00e86
>       [exception RIP: part_in_flight+47]
>       RIP: ffffffff9b67960f  RSP: ffffc90031cd3ac8  RFLAGS: 00000282
>       RAX: 00000000000000a9  RBX: 00000000000c4645  RCX: 00000000000000f5
>       RDX: ffffe89fffa36fe0  RSI: 0000000000000180  RDI: ffffffff9d1ae260
>       RBP: ffff898083d30000   R8: 00000000000000a8   R9: 0000000000000000
>       R10: ffff89808277d800  R11: 0000000000001000  R12: 0000000101a7d5be
>       R13: 0000000000000000  R14: 0000000000001001  R15: 0000000000001001
>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>    #9 [ffffc90031cd3ad8] update_io_ticks at ffffffff9b6602e4
>   #10 [ffffc90031cd3b00] bdev_start_io_acct at ffffffff9b66031b
>   #11 [ffffc90031cd3b20] dm_io_acct at ffffffffc18d7f98 [dm_mod]
>   #12 [ffffc90031cd3b50] dm_submit_bio_remap at ffffffffc18d8195 [dm_mod]
>   #13 [ffffc90031cd3b70] dm_split_and_process_bio at ffffffffc18d9799 [dm_mod]
>   #14 [ffffc90031cd3be0] dm_submit_bio at ffffffffc18d9b07 [dm_mod]
>   #15 [ffffc90031cd3c20] __submit_bio at ffffffff9b65f61c
>   #16 [ffffc90031cd3c38] __submit_bio_noacct at ffffffff9b65f73e
>   #17 [ffffc90031cd3c80] xfs_buf_ioapply_map at ffffffffc23df4ea [xfs]

This isn't from a TOT kernel. xfs_buf_ioapply_map() went away a year
ago. What kernel is this occurring on?

>   #18 [ffffc90031cd3ce0] _xfs_buf_ioapply at ffffffffc23df64f [xfs]
>   #19 [ffffc90031cd3d50] __xfs_buf_submit at ffffffffc23df7b8 [xfs]
>   #20 [ffffc90031cd3d70] xfs_buf_delwri_submit_buffers at ffffffffc23dffbd [xfs]
>   #21 [ffffc90031cd3df8] xfsaild_push at ffffffffc24268e5 [xfs]
>   #22 [ffffc90031cd3eb8] xfsaild at ffffffffc2426f88 [xfs]
>   #23 [ffffc90031cd3ef8] kthread at ffffffff9b1378fc
>   #24 [ffffc90031cd3f30] ret_from_fork at ffffffff9b042dd0
>   #25 [ffffc90031cd3f50] ret_from_fork_asm at ffffffff9b007e2b
> 
> This patch adds cond_resched() to avoid softlockups similar to the one
> described above.

Again: how do this softlock occur?

xfsaild_push() pushes at most 1000 items at a time for IO.  It would
have to be a fairly fast device not to block on the request queues
filling as we submit batches of 1000 buffers at a time.

Then the higher level AIL traversal loop would also have to be
making continuous progress without blocking. Hence it must not hit
the end of the AIL, nor ever hit pinned, stale, flushing or locked
items in the AIL for as long as it takes for the soft lookup timer
to fire.  This seems ... highly unlikely.

IOWs, if we are looping in this path without giving up the CPU for
seconds at a time, then it is not behaving as I'd expect it to
behave. We need to understand why is this code apparently behaving
in an unexpected way, not just silence the warning....

Can you please explain how the softlockup timer is being hit here so we
can try to understand the root cause of the problem? Workload,
hardware, filesystem config, storage stack, etc all matter here,
because they all play a part in these paths never blocking on
a lock, a full queue, a pinned buffer, etc, whilst processing
hundreds of thousands of dirty objects for IO.

At least, I'm assuming we're talking about hundreds of thousands of
objects, because I know the AIL can push a hundred thousand dirty
buffers to disk every second when it is close to being CPU bound. So
if it's not giving up the CPU for long enough to fire the soft
lockup timer, we must be talking about processing millions of
objects without blocking even once....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: take a breath in xfsaild()
  2026-02-05 11:44   ` Dave Chinner
@ 2026-02-05 12:49     ` Jinliang Zheng
  2026-02-05 21:17       ` Dave Chinner
  0 siblings, 1 reply; 11+ messages in thread
From: Jinliang Zheng @ 2026-02-05 12:49 UTC (permalink / raw)
  To: david; +Cc: alexjlzheng, alexjlzheng, cem, linux-kernel, linux-xfs

On Thu, 5 Feb 2026 22:44:51 +1100, david@fromorbit.com wrote:
> On Thu, Feb 05, 2026 at 04:26:21PM +0800, alexjlzheng@gmail.com wrote:
> > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > 
> > We noticed a softlockup like:
> > 
> >   crash> bt
> >   PID: 5153     TASK: ffff8960a7ca0000  CPU: 115  COMMAND: "xfsaild/dm-4"
> >    #0 [ffffc9001b1d4d58] machine_kexec at ffffffff9b086081
> >    #1 [ffffc9001b1d4db8] __crash_kexec at ffffffff9b20817a
> >    #2 [ffffc9001b1d4e78] panic at ffffffff9b107d8f
> >    #3 [ffffc9001b1d4ef8] watchdog_timer_fn at ffffffff9b243511
> >    #4 [ffffc9001b1d4f28] __hrtimer_run_queues at ffffffff9b1e62ff
> >    #5 [ffffc9001b1d4f80] hrtimer_interrupt at ffffffff9b1e73d4
> >    #6 [ffffc9001b1d4fd8] __sysvec_apic_timer_interrupt at ffffffff9b07bb29
> >    #7 [ffffc9001b1d4ff0] sysvec_apic_timer_interrupt at ffffffff9bd689f9
> >   --- <IRQ stack> ---
> >    #8 [ffffc90031cd3a18] asm_sysvec_apic_timer_interrupt at ffffffff9be00e86
> >       [exception RIP: part_in_flight+47]
> >       RIP: ffffffff9b67960f  RSP: ffffc90031cd3ac8  RFLAGS: 00000282
> >       RAX: 00000000000000a9  RBX: 00000000000c4645  RCX: 00000000000000f5
> >       RDX: ffffe89fffa36fe0  RSI: 0000000000000180  RDI: ffffffff9d1ae260
> >       RBP: ffff898083d30000   R8: 00000000000000a8   R9: 0000000000000000
> >       R10: ffff89808277d800  R11: 0000000000001000  R12: 0000000101a7d5be
> >       R13: 0000000000000000  R14: 0000000000001001  R15: 0000000000001001
> >       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> >    #9 [ffffc90031cd3ad8] update_io_ticks at ffffffff9b6602e4
> >   #10 [ffffc90031cd3b00] bdev_start_io_acct at ffffffff9b66031b
> >   #11 [ffffc90031cd3b20] dm_io_acct at ffffffffc18d7f98 [dm_mod]
> >   #12 [ffffc90031cd3b50] dm_submit_bio_remap at ffffffffc18d8195 [dm_mod]
> >   #13 [ffffc90031cd3b70] dm_split_and_process_bio at ffffffffc18d9799 [dm_mod]
> >   #14 [ffffc90031cd3be0] dm_submit_bio at ffffffffc18d9b07 [dm_mod]
> >   #15 [ffffc90031cd3c20] __submit_bio at ffffffff9b65f61c
> >   #16 [ffffc90031cd3c38] __submit_bio_noacct at ffffffff9b65f73e
> >   #17 [ffffc90031cd3c80] xfs_buf_ioapply_map at ffffffffc23df4ea [xfs]
> 
> This isn't from a TOT kernel. xfs_buf_ioapply_map() went away a year
> ago. What kernel is this occurring on?

Thanks for your reply. :)

It's based on v6.6.

> 
> >   #18 [ffffc90031cd3ce0] _xfs_buf_ioapply at ffffffffc23df64f [xfs]
> >   #19 [ffffc90031cd3d50] __xfs_buf_submit at ffffffffc23df7b8 [xfs]
> >   #20 [ffffc90031cd3d70] xfs_buf_delwri_submit_buffers at ffffffffc23dffbd [xfs]
> >   #21 [ffffc90031cd3df8] xfsaild_push at ffffffffc24268e5 [xfs]
> >   #22 [ffffc90031cd3eb8] xfsaild at ffffffffc2426f88 [xfs]
> >   #23 [ffffc90031cd3ef8] kthread at ffffffff9b1378fc
> >   #24 [ffffc90031cd3f30] ret_from_fork at ffffffff9b042dd0
> >   #25 [ffffc90031cd3f50] ret_from_fork_asm at ffffffff9b007e2b
> > 
> > This patch adds cond_resched() to avoid softlockups similar to the one
> > described above.
> 
> Again: how do this softlock occur?

[28089.641309] watchdog: BUG: soft lockup - CPU#115 stuck for 26s! [xfsaild/dm-4:5153] 

> 
> xfsaild_push() pushes at most 1000 items at a time for IO.  It would
> have to be a fairly fast device not to block on the request queues
> filling as we submit batches of 1000 buffers at a time.
> 
> Then the higher level AIL traversal loop would also have to be
> making continuous progress without blocking. Hence it must not hit
> the end of the AIL, nor ever hit pinned, stale, flushing or locked
> items in the AIL for as long as it takes for the soft lookup timer
> to fire.  This seems ... highly unlikely.
> 
> IOWs, if we are looping in this path without giving up the CPU for
> seconds at a time, then it is not behaving as I'd expect it to
> behave. We need to understand why is this code apparently behaving
> in an unexpected way, not just silence the warning....
> 
> Can you please explain how the softlockup timer is being hit here so we
> can try to understand the root cause of the problem? Workload,

Again, a testsuite combining stress-ng, LTP, and fio, executed concurrently.

> hardware, filesystem config, storage stack, etc all matter here,


================================= CPU ======================================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           45 bits physical, 48 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  384
On-line CPU(s) list:                     0-383

================================= MEM ======================================
[root@localhost ~]# free -h
               total        used        free      shared  buff/cache   available
Mem:           1.5Ti       479Gi       1.0Ti       2.0Gi       6.2Gi       1.0Ti
Swap:             0B          0B          0B


================================= XFS ======================================
[root@localhost ~]# xfs_info /dev/ts/home 
meta-data=/dev/mapper/ts-home    isize=512    agcount=4, agsize=45875200 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
data     =                       bsize=4096   blocks=183500800, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=89600, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


================================= IOS ======================================
sdb                                                                                                
├─sdb1        vfat        FAT32              EE62-ACF4                               591.3M     1% /boot/efi
├─sdb2        xfs                            95e0a12a-ed33-45b7-abe1-640c4e334468      1.6G    17% /boot
└─sdb3        LVM2_member LVM2 001           fSTN7x-BNcZ-w67a-gZBf-Vlps-7gtv-KAreS7                
  ├─ts-root   xfs                            f78c147c-86d7-4675-b9bd-ed0d512f37f4     84.5G    50% /
  └─ts-home   xfs                            01677289-32f2-41b0-ab59-3c5d14a1eefa    668.6G     4% /home

> because they all play a part in these paths never blocking on
> a lock, a full queue, a pinned buffer, etc, whilst processing
> hundreds of thousands of dirty objects for IO.

And, there's another softlockup, which is similar:

watchdog: BUG: soft lockup - CPU#342 stuck for 22s! [xfsaild/dm-4:5045]

  crash> bt
  PID: 5045     TASK: ffff89e0a0150000  CPU: 342  COMMAND: "xfsaild/dm-4"
   #0 [ffffc9001d98cd58] machine_kexec at ffffffffa8086081
   #1 [ffffc9001d98cdb8] __crash_kexec at ffffffffa820817a
   #2 [ffffc9001d98ce78] panic at ffffffffa8107d8f
   #3 [ffffc9001d98cef8] watchdog_timer_fn at ffffffffa8243511
   #4 [ffffc9001d98cf28] __hrtimer_run_queues at ffffffffa81e62ff
   #5 [ffffc9001d98cf80] hrtimer_interrupt at ffffffffa81e73d4
   #6 [ffffc9001d98cfd8] __sysvec_apic_timer_interrupt at ffffffffa807bb29
   #7 [ffffc9001d98cff0] sysvec_apic_timer_interrupt at ffffffffa8d689f9
  --- <IRQ stack> ---
   #8 [ffffc900351efa48] asm_sysvec_apic_timer_interrupt at ffffffffa8e00e86
      [exception RIP: kernel_fpu_begin_mask+66]
      RIP: ffffffffa8044f52  RSP: ffffc900351efaf8  RFLAGS: 00000206
      RAX: 0000000000000000  RBX: 0000000000000002  RCX: 0000000000000000
      RDX: 0000000000000000  RSI: ffff899f8f9ee000  RDI: ffff89e0a0150000
      RBP: 0000000000001000   R8: 0000000000000000   R9: 0000000000001000
      R10: 0000000000000000  R11: ffff899f8f9ee030  R12: ffffc900351efb38
      R13: ffff899f8f9ee000  R14: ffff88e084624158  R15: ffff88e083828000
      ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
   #9 [ffffc900351efb10] crc32c_pcl_intel_update at ffffffffc199b390 [crc32c_intel]
  #10 [ffffc900351efb30] crc32c at ffffffffc196d03f [libcrc32c]
  #11 [ffffc900351efcb8] xfs_dir3_data_write_verify at ffffffffc250cfd9 [xfs]
  #12 [ffffc900351efce0] _xfs_buf_ioapply at ffffffffc253d5cd [xfs]
  #13 [ffffc900351efd50] __xfs_buf_submit at ffffffffc253d7b8 [xfs]
  #14 [ffffc900351efd70] xfs_buf_delwri_submit_buffers at ffffffffc253dfbd [xfs]
  #15 [ffffc900351efdf8] xfsaild_push at ffffffffc25848e5 [xfs]
  #16 [ffffc900351efeb8] xfsaild at ffffffffc2584f88 [xfs]
  #17 [ffffc900351efef8] kthread at ffffffffa81378fc
  #18 [ffffc900351eff30] ret_from_fork at ffffffffa8042dd0
  #19 [ffffc900351eff50] ret_from_fork_asm at ffffffffa8007e2b

Thanks,
Jinliang Zheng. :)

> 
> At least, I'm assuming we're talking about hundreds of thousands of
> objects, because I know the AIL can push a hundred thousand dirty
> buffers to disk every second when it is close to being CPU bound. So
> if it's not giving up the CPU for long enough to fire the soft
> lockup timer, we must be talking about processing millions of
> objects without blocking even once....
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: take a breath in xfsaild()
  2026-02-05 12:49     ` Jinliang Zheng
@ 2026-02-05 21:17       ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2026-02-05 21:17 UTC (permalink / raw)
  To: Jinliang Zheng; +Cc: alexjlzheng, cem, linux-kernel, linux-xfs

On Thu, Feb 05, 2026 at 08:49:59PM +0800, Jinliang Zheng wrote:
> On Thu, 5 Feb 2026 22:44:51 +1100, david@fromorbit.com wrote:
> > On Thu, Feb 05, 2026 at 04:26:21PM +0800, alexjlzheng@gmail.com wrote:
> > > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > > 
> > > We noticed a softlockup like:
> > > 
> > >   crash> bt
> > >   PID: 5153     TASK: ffff8960a7ca0000  CPU: 115  COMMAND: "xfsaild/dm-4"
> > >    #0 [ffffc9001b1d4d58] machine_kexec at ffffffff9b086081
> > >    #1 [ffffc9001b1d4db8] __crash_kexec at ffffffff9b20817a
> > >    #2 [ffffc9001b1d4e78] panic at ffffffff9b107d8f
> > >    #3 [ffffc9001b1d4ef8] watchdog_timer_fn at ffffffff9b243511
> > >    #4 [ffffc9001b1d4f28] __hrtimer_run_queues at ffffffff9b1e62ff
> > >    #5 [ffffc9001b1d4f80] hrtimer_interrupt at ffffffff9b1e73d4
> > >    #6 [ffffc9001b1d4fd8] __sysvec_apic_timer_interrupt at ffffffff9b07bb29
> > >    #7 [ffffc9001b1d4ff0] sysvec_apic_timer_interrupt at ffffffff9bd689f9
> > >   --- <IRQ stack> ---
> > >    #8 [ffffc90031cd3a18] asm_sysvec_apic_timer_interrupt at ffffffff9be00e86
> > >       [exception RIP: part_in_flight+47]
> > >       RIP: ffffffff9b67960f  RSP: ffffc90031cd3ac8  RFLAGS: 00000282
> > >       RAX: 00000000000000a9  RBX: 00000000000c4645  RCX: 00000000000000f5
> > >       RDX: ffffe89fffa36fe0  RSI: 0000000000000180  RDI: ffffffff9d1ae260
> > >       RBP: ffff898083d30000   R8: 00000000000000a8   R9: 0000000000000000
> > >       R10: ffff89808277d800  R11: 0000000000001000  R12: 0000000101a7d5be
> > >       R13: 0000000000000000  R14: 0000000000001001  R15: 0000000000001001
> > >       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > >    #9 [ffffc90031cd3ad8] update_io_ticks at ffffffff9b6602e4
> > >   #10 [ffffc90031cd3b00] bdev_start_io_acct at ffffffff9b66031b
> > >   #11 [ffffc90031cd3b20] dm_io_acct at ffffffffc18d7f98 [dm_mod]
> > >   #12 [ffffc90031cd3b50] dm_submit_bio_remap at ffffffffc18d8195 [dm_mod]
> > >   #13 [ffffc90031cd3b70] dm_split_and_process_bio at ffffffffc18d9799 [dm_mod]
> > >   #14 [ffffc90031cd3be0] dm_submit_bio at ffffffffc18d9b07 [dm_mod]
> > >   #15 [ffffc90031cd3c20] __submit_bio at ffffffff9b65f61c
> > >   #16 [ffffc90031cd3c38] __submit_bio_noacct at ffffffff9b65f73e
> > >   #17 [ffffc90031cd3c80] xfs_buf_ioapply_map at ffffffffc23df4ea [xfs]
> > 
> > This isn't from a TOT kernel. xfs_buf_ioapply_map() went away a year
> > ago. What kernel is this occurring on?
> 
> Thanks for your reply. :)
> 
> It's based on v6.6.

v6.6 was released in late 2023. I think we largely fixed this
problem with this series that was merged into 6.11 in mid 2024:

https://lore.kernel.org/linux-xfs/20220809230353.3353059-1-david@fromorbit.com/

In more detail...

> > Can you please explain how the softlockup timer is being hit here so we
> > can try to understand the root cause of the problem? Workload,
> 
> Again, a testsuite combining stress-ng, LTP, and fio, executed concurrently.
> 
> > hardware, filesystem config, storage stack, etc all matter here,
> 
> 
> ================================= CPU ======================================
> Architecture:                            x86_64
> CPU op-mode(s):                          32-bit, 64-bit
> Address sizes:                           45 bits physical, 48 bits virtual
> Byte Order:                              Little Endian
> CPU(s):                                  384

... 384 CPUs banging on a single filesystem....

> ================================= XFS ======================================
> [root@localhost ~]# xfs_info /dev/ts/home 
> meta-data=/dev/mapper/ts-home    isize=512    agcount=4, agsize=45875200 blks

... that has very limited parallelism, and ...
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=183500800, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=89600, version=2

... a relatively small log (350MB) compared to the size of the
system that is hammering on it.

i.e. This is exactly the sort of system architecture that will push
heaps of concurrency into the filesystem's transaction reservation
slow path and keep it there for long periods of time. Especially
under sustained, highly concurrent, modification heavy stress
workloads.

Exposing any kernel spin lock to unbound user controlled
concurrency will eventually result in a workload that causes
catastrophic spin lock contention breakdown. Then everything that
uses said lock will spend excessive amounts of time spinning and not
making progress.

This is one of the scalability problems the patchset I linked above
addressed. Prior to that patchset, the transaction reservation slow
path (the "journal full" path) exposed the AIL lock to unbound
userspace concurrency via the "update the AIL push target"
mechanism. Both journal IO completion and the xfsaild are heavy
users of the AIL lock, but don't normally contend with each other
because internal filesystem concurrency is tightly bounded. Once
userspace starts banging on it, however....

Silencing soft lockups with cond_resched() is almost never the right
thing to do - they are generally indicative of some other problem
occurring. We need to understand what that "some other problem" is
before we do anything else...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] Add cond_resched() in some place to avoid softlockup
  2026-02-05  8:26 [PATCH 0/2] Add cond_resched() in some place to avoid softlockup alexjlzheng
  2026-02-05  8:26 ` [PATCH 1/2] xfs: take a breath in xlog_ioend_work() alexjlzheng
  2026-02-05  8:26 ` [PATCH 2/2] xfs: take a breath in xfsaild() alexjlzheng
@ 2026-02-05 10:39 ` Dave Chinner
  2026-02-13 21:38 ` Dave Chinner
  3 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2026-02-05 10:39 UTC (permalink / raw)
  To: alexjlzheng; +Cc: cem, linux-xfs, linux-kernel, Jinliang Zheng

On Thu, Feb 05, 2026 at 04:26:19PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> We recently observed several XFS-related softlockups in non-preempt
> kernels during stability testing, and we believe adding a few
> cond_resched()calls would be beneficial.

I as under the impression that there was a general kernel-wide NAK
in place for adding new cond_resched() points to hack around
the CONFIG_PREEMPT_NONE problems once CONFIG_PREEMPT_LAZY was
introduced...

https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/2] Add cond_resched() in some place to avoid softlockup
  2026-02-05  8:26 [PATCH 0/2] Add cond_resched() in some place to avoid softlockup alexjlzheng
                   ` (2 preceding siblings ...)
  2026-02-05 10:39 ` [PATCH 0/2] Add cond_resched() in some place to avoid softlockup Dave Chinner
@ 2026-02-13 21:38 ` Dave Chinner
  3 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2026-02-13 21:38 UTC (permalink / raw)
  To: alexjlzheng; +Cc: cem, linux-xfs, linux-kernel, Jinliang Zheng

On Thu, Feb 05, 2026 at 04:26:19PM +0800, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> We recently observed several XFS-related softlockups in non-preempt
> kernels during stability testing, and we believe adding a few
> cond_resched()calls would be beneficial.
> 
> Jinliang Zheng (2):
>   xfs: take a breath in xlog_ioend_work()
>   xfs: take a breath in xfsaild()
> 
>  fs/xfs/xfs_buf.c     | 2 ++
>  fs/xfs/xfs_log_cil.c | 3 +++
>  2 files changed, 5 insertions(+)

To follow up on my comments about cond_resched(), commit
7dadeaa6e851 ("sched: Further restrict the preemption modes") was
just merged into 7.0. This means the only two supported preempt
modes for all the main architectures are PREEMPT_FULL and
PREEMPT_LAZY.

i.e. PREEMPT_NONE and PREEMPT_VOLUNTARY are essentially gone and
only remain on fringe architectures that do not support preemption
or have not yet been fully ported to support preemption.

Hence we should be starting to consider the removal all the
cond_resched() points we have in the code, not adding more...

-Dave.
-- 
Dave Chinner
dgc@kernel.org

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-02-13 21:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-05  8:26 [PATCH 0/2] Add cond_resched() in some place to avoid softlockup alexjlzheng
2026-02-05  8:26 ` [PATCH 1/2] xfs: take a breath in xlog_ioend_work() alexjlzheng
2026-02-05 10:54   ` Dave Chinner
2026-02-05 12:49     ` Jinliang Zheng
2026-02-05 20:27       ` Dave Chinner
2026-02-05  8:26 ` [PATCH 2/2] xfs: take a breath in xfsaild() alexjlzheng
2026-02-05 11:44   ` Dave Chinner
2026-02-05 12:49     ` Jinliang Zheng
2026-02-05 21:17       ` Dave Chinner
2026-02-05 10:39 ` [PATCH 0/2] Add cond_resched() in some place to avoid softlockup Dave Chinner
2026-02-13 21:38 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox