public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jinliang Zheng <alexjlzheng@gmail.com>
Cc: alexjlzheng@tencent.com, cem@kernel.org,
	linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [PATCH 2/2] xfs: take a breath in xfsaild()
Date: Fri, 6 Feb 2026 08:17:56 +1100	[thread overview]
Message-ID: <aYUJBChyWi3WdOIR@dread.disaster.area> (raw)
In-Reply-To: <20260205125000.2324010-1-alexjlzheng@tencent.com>

On Thu, Feb 05, 2026 at 08:49:59PM +0800, Jinliang Zheng wrote:
> On Thu, 5 Feb 2026 22:44:51 +1100, david@fromorbit.com wrote:
> > On Thu, Feb 05, 2026 at 04:26:21PM +0800, alexjlzheng@gmail.com wrote:
> > > From: Jinliang Zheng <alexjlzheng@tencent.com>
> > > 
> > > We noticed a softlockup like:
> > > 
> > >   crash> bt
> > >   PID: 5153     TASK: ffff8960a7ca0000  CPU: 115  COMMAND: "xfsaild/dm-4"
> > >    #0 [ffffc9001b1d4d58] machine_kexec at ffffffff9b086081
> > >    #1 [ffffc9001b1d4db8] __crash_kexec at ffffffff9b20817a
> > >    #2 [ffffc9001b1d4e78] panic at ffffffff9b107d8f
> > >    #3 [ffffc9001b1d4ef8] watchdog_timer_fn at ffffffff9b243511
> > >    #4 [ffffc9001b1d4f28] __hrtimer_run_queues at ffffffff9b1e62ff
> > >    #5 [ffffc9001b1d4f80] hrtimer_interrupt at ffffffff9b1e73d4
> > >    #6 [ffffc9001b1d4fd8] __sysvec_apic_timer_interrupt at ffffffff9b07bb29
> > >    #7 [ffffc9001b1d4ff0] sysvec_apic_timer_interrupt at ffffffff9bd689f9
> > >   --- <IRQ stack> ---
> > >    #8 [ffffc90031cd3a18] asm_sysvec_apic_timer_interrupt at ffffffff9be00e86
> > >       [exception RIP: part_in_flight+47]
> > >       RIP: ffffffff9b67960f  RSP: ffffc90031cd3ac8  RFLAGS: 00000282
> > >       RAX: 00000000000000a9  RBX: 00000000000c4645  RCX: 00000000000000f5
> > >       RDX: ffffe89fffa36fe0  RSI: 0000000000000180  RDI: ffffffff9d1ae260
> > >       RBP: ffff898083d30000   R8: 00000000000000a8   R9: 0000000000000000
> > >       R10: ffff89808277d800  R11: 0000000000001000  R12: 0000000101a7d5be
> > >       R13: 0000000000000000  R14: 0000000000001001  R15: 0000000000001001
> > >       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > >    #9 [ffffc90031cd3ad8] update_io_ticks at ffffffff9b6602e4
> > >   #10 [ffffc90031cd3b00] bdev_start_io_acct at ffffffff9b66031b
> > >   #11 [ffffc90031cd3b20] dm_io_acct at ffffffffc18d7f98 [dm_mod]
> > >   #12 [ffffc90031cd3b50] dm_submit_bio_remap at ffffffffc18d8195 [dm_mod]
> > >   #13 [ffffc90031cd3b70] dm_split_and_process_bio at ffffffffc18d9799 [dm_mod]
> > >   #14 [ffffc90031cd3be0] dm_submit_bio at ffffffffc18d9b07 [dm_mod]
> > >   #15 [ffffc90031cd3c20] __submit_bio at ffffffff9b65f61c
> > >   #16 [ffffc90031cd3c38] __submit_bio_noacct at ffffffff9b65f73e
> > >   #17 [ffffc90031cd3c80] xfs_buf_ioapply_map at ffffffffc23df4ea [xfs]
> > 
> > This isn't from a TOT kernel. xfs_buf_ioapply_map() went away a year
> > ago. What kernel is this occurring on?
> 
> Thanks for your reply. :)
> 
> It's based on v6.6.

v6.6 was released in late 2023. I think we largely fixed this
problem with this series that was merged into 6.11 in mid 2024:

https://lore.kernel.org/linux-xfs/20220809230353.3353059-1-david@fromorbit.com/

In more detail...

> > Can you please explain how the softlockup timer is being hit here so we
> > can try to understand the root cause of the problem? Workload,
> 
> Again, a testsuite combining stress-ng, LTP, and fio, executed concurrently.
> 
> > hardware, filesystem config, storage stack, etc all matter here,
> 
> 
> ================================= CPU ======================================
> Architecture:                            x86_64
> CPU op-mode(s):                          32-bit, 64-bit
> Address sizes:                           45 bits physical, 48 bits virtual
> Byte Order:                              Little Endian
> CPU(s):                                  384

... 384 CPUs banging on a single filesystem....

> ================================= XFS ======================================
> [root@localhost ~]# xfs_info /dev/ts/home 
> meta-data=/dev/mapper/ts-home    isize=512    agcount=4, agsize=45875200 blks

... that has very limited parallelism, and ...
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
> data     =                       bsize=4096   blocks=183500800, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=89600, version=2

... a relatively small log (350MB) compared to the size of the
system that is hammering on it.

i.e. This is exactly the sort of system architecture that will push
heaps of concurrency into the filesystem's transaction reservation
slow path and keep it there for long periods of time. Especially
under sustained, highly concurrent, modification heavy stress
workloads.

Exposing any kernel spin lock to unbound user controlled
concurrency will eventually result in a workload that causes
catastrophic spin lock contention breakdown. Then everything that
uses said lock will spend excessive amounts of time spinning and not
making progress.

This is one of the scalability problems the patchset I linked above
addressed. Prior to that patchset, the transaction reservation slow
path (the "journal full" path) exposed the AIL lock to unbound
userspace concurrency via the "update the AIL push target"
mechanism. Both journal IO completion and the xfsaild are heavy
users of the AIL lock, but don't normally contend with each other
because internal filesystem concurrency is tightly bounded. Once
userspace starts banging on it, however....

Silencing soft lockups with cond_resched() is almost never the right
thing to do - they are generally indicative of some other problem
occurring. We need to understand what that "some other problem" is
before we do anything else...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2026-02-05 21:18 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-05  8:26 [PATCH 0/2] Add cond_resched() in some place to avoid softlockup alexjlzheng
2026-02-05  8:26 ` [PATCH 1/2] xfs: take a breath in xlog_ioend_work() alexjlzheng
2026-02-05 10:54   ` Dave Chinner
2026-02-05 12:49     ` Jinliang Zheng
2026-02-05 20:27       ` Dave Chinner
2026-02-05  8:26 ` [PATCH 2/2] xfs: take a breath in xfsaild() alexjlzheng
2026-02-05 11:44   ` Dave Chinner
2026-02-05 12:49     ` Jinliang Zheng
2026-02-05 21:17       ` Dave Chinner [this message]
2026-02-05 10:39 ` [PATCH 0/2] Add cond_resched() in some place to avoid softlockup Dave Chinner
2026-02-13 21:38 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYUJBChyWi3WdOIR@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=alexjlzheng@gmail.com \
    --cc=alexjlzheng@tencent.com \
    --cc=cem@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox