* Too many xfs-conv kworker threads
@ 2025-11-25 19:49 Karim Manaouil
2025-11-25 22:31 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Karim Manaouil @ 2025-11-25 19:49 UTC (permalink / raw)
To: Carlos Maiolino; +Cc: linux-xfs, linux-kernel, Karim Manaouil
Hi folks,
I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15
with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can
achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores
AMD EPYC 9224.
I am running thpchallenge-fio from mmtests (its purpose is described
here [1]). It's a fio job that inefficiently writes a large number of 64K
files. On a system with 128GiB of RAM, it could create up to 100K files.
A typical fio config looks like this:
[global]
direct=0
ioengine=sync
blocksize=4096
invalidate=0
fallocate=none
create_on_open=1
[writer]
nrfiles=785988
filesize=65536
readwrite=write
numjobs=4
filename_format=$jobnum/workfile.$filenum
I noticed that, at some point, top reports around 42650 sleeping tasks,
example:
Tasks: 42651 total, 1 running, 42650 sleeping, 0 stopped, 0 zombie
This is a test machine from a fresh boot running vanilla Debian.
After checking, it turned out, it was a massive list of xfs-conv
kworkers. Something like this (truncated):
58214 ? I 0:00 [kworker/47:203-xfs-conv/md127]
58215 ? I 0:00 [kworker/47:204-xfs-conv/md127]
58216 ? I 0:00 [kworker/47:205-xfs-conv/md127]
58217 ? I 0:00 [kworker/47:206-xfs-conv/md127]
58219 ? I 0:00 [kworker/12:539-xfs-conv/md127]
58220 ? I 0:00 [kworker/12:540-xfs-conv/md127]
58221 ? I 0:00 [kworker/12:541-xfs-conv/md127]
58222 ? I 0:00 [kworker/12:542-xfs-conv/md127]
58223 ? I 0:00 [kworker/12:543-xfs-conv/md127]
58224 ? I 0:00 [kworker/12:544-xfs-conv/md127]
58225 ? I 0:00 [kworker/12:545-xfs-conv/md127]
58227 ? I 0:00 [kworker/38:155-xfs-conv/md127]
58228 ? I 0:00 [kworker/38:156-xfs-conv/md127]
58230 ? I 0:00 [kworker/38:158-xfs-conv/md127]
58233 ? I 0:00 [kworker/38:161-xfs-conv/md127]
58235 ? I 0:00 [kworker/8:537-xfs-conv/md127]
58237 ? I 0:00 [kworker/8:539-xfs-conv/md127]
58238 ? I 0:00 [kworker/8:540-xfs-conv/md127]
58239 ? I 0:00 [kworker/8:541-xfs-conv/md127]
58240 ? I 0:00 [kworker/8:542-xfs-conv/md127]
58241 ? I 0:00 [kworker/8:543-xfs-conv/md127]
It seems like the kernel is creating too many kworkers on each CPU.
I am not sure if this has any effect on performance, but potentially,
there is some scheduling overhead?!
Is this a healthy amount of kworkers? Is this even needed? Even if we
are trying to write to ~80k small files by four parallel threads?
[1] https://lwn.net/Articles/770235/
--
~karim
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Too many xfs-conv kworker threads
2025-11-25 19:49 Too many xfs-conv kworker threads Karim Manaouil
@ 2025-11-25 22:31 ` Dave Chinner
2025-11-26 13:27 ` Karim Manaouil
0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2025-11-25 22:31 UTC (permalink / raw)
To: Karim Manaouil; +Cc: Carlos Maiolino, linux-xfs, linux-kernel
On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote:
> Hi folks,
>
> I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15
> with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can
> achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores
> AMD EPYC 9224.
The mkfs.xfs (or xfs_info) output for the filesystem is on this
device is?
> I am running thpchallenge-fio from mmtests (its purpose is described
> here [1]). It's a fio job that inefficiently writes a large number of 64K
> files. On a system with 128GiB of RAM, it could create up to 100K files.
> A typical fio config looks like this:
>
> [global]
> direct=0
> ioengine=sync
> blocksize=4096
> invalidate=0
> fallocate=none
> create_on_open=1
>
> [writer]
> nrfiles=785988
That's ~800k files, not 100k.
> filesize=65536
> readwrite=write
> numjobs=4
> filename_format=$jobnum/workfile.$filenum
>
> I noticed that, at some point, top reports around 42650 sleeping tasks,
> example:
>
> Tasks: 42651 total, 1 running, 42650 sleeping, 0 stopped, 0 zombie
>
> This is a test machine from a fresh boot running vanilla Debian.
>
> After checking, it turned out, it was a massive list of xfs-conv
> kworkers. Something like this (truncated):
>
> 58214 ? I 0:00 [kworker/47:203-xfs-conv/md127]
> 58215 ? I 0:00 [kworker/47:204-xfs-conv/md127]
> 58216 ? I 0:00 [kworker/47:205-xfs-conv/md127]
> 58217 ? I 0:00 [kworker/47:206-xfs-conv/md127]
> 58219 ? I 0:00 [kworker/12:539-xfs-conv/md127]
> 58220 ? I 0:00 [kworker/12:540-xfs-conv/md127]
> 58221 ? I 0:00 [kworker/12:541-xfs-conv/md127]
> 58222 ? I 0:00 [kworker/12:542-xfs-conv/md127]
> 58223 ? I 0:00 [kworker/12:543-xfs-conv/md127]
> 58224 ? I 0:00 [kworker/12:544-xfs-conv/md127]
> 58225 ? I 0:00 [kworker/12:545-xfs-conv/md127]
> 58227 ? I 0:00 [kworker/38:155-xfs-conv/md127]
> 58228 ? I 0:00 [kworker/38:156-xfs-conv/md127]
> 58230 ? I 0:00 [kworker/38:158-xfs-conv/md127]
> 58233 ? I 0:00 [kworker/38:161-xfs-conv/md127]
> 58235 ? I 0:00 [kworker/8:537-xfs-conv/md127]
> 58237 ? I 0:00 [kworker/8:539-xfs-conv/md127]
> 58238 ? I 0:00 [kworker/8:540-xfs-conv/md127]
> 58239 ? I 0:00 [kworker/8:541-xfs-conv/md127]
> 58240 ? I 0:00 [kworker/8:542-xfs-conv/md127]
> 58241 ? I 0:00 [kworker/8:543-xfs-conv/md127]
>
> It seems like the kernel is creating too many kworkers on each CPU.
Or there are tens of thousands of small random write IOs
in flight at any given time and this process is serialising on
unwritten extent conversion during IO completion processing. i.e.
so many kworker threads indicates work processing is blocking, so
each new work queued gets a new kworker thread created to process
it.
I suspect that the filesystem is running out of journal space, and
then unwritten extent conversion transactions start lock-stepping
waiting for journal space. Hence the question about mkfs.xfs output.
Also info about IO load (e.g. `iostat -dxm 5` output) whilst the test
is running would be useful, because even fast devices can end up
being really slow when something stupid is being done...
Kernel profiles across all CPUs whilst the workload is running and
in this state would be useful.
> I am not sure if this has any effect on performance, but potentially,
> there is some scheduling overhead?!
It probably does, but a trainsmash of stalled in-progress work like
this is typically a symptom of some other misbehaviour occuring.
FWIW, for a workload intended to produce "inefficient write IO",
this is sort of behaviour is definitely indicating something
"inefficient" is occurring during write IO. So, in the end, there is
a definite possiblity that there may not actually be anything that
can be "fixed" here....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Too many xfs-conv kworker threads
2025-11-25 22:31 ` Dave Chinner
@ 2025-11-26 13:27 ` Karim Manaouil
2025-11-26 22:33 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Karim Manaouil @ 2025-11-26 13:27 UTC (permalink / raw)
To: Dave Chinner; +Cc: Carlos Maiolino, linux-xfs, linux-kernel
Hi Dave,
Thanks for looking at this.
On Wed, Nov 26, 2025 at 09:31:59AM +1100, Dave Chinner wrote:
> On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote:
> > Hi folks,
> >
> > I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15
> > with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can
> > achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores
> > AMD EPYC 9224.
>
> The mkfs.xfs (or xfs_info) output for the filesystem is on this
> device is?
Here is xfs_info
meta-data=/dev/md127 isize=512 agcount=48, agsize=20346496 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0 metadir=0
data = bsize=4096 blocks=976630272, imaxpct=5
= sunit=128 swidth=512 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=476872, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
= rgcount=0 rgsize=0 extents
and mdadm --detail
/dev/md127:
Version : 1.2
Creation Time : Tue Sep 30 13:47:14 2025
Raid Level : raid0
Array Size : 3906521088 (3.64 TiB 4.00 TB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Tue Sep 30 13:47:14 2025
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Layout : original
Chunk Size : 512K
Consistency Policy : none
UUID : 11099fa7:7500e645:a3e71ba6:a23dda90
Events : 0
Number Major Minor RaidDevice State
0 259 3 0 active sync /dev/nvme2n1
1 259 6 1 active sync /dev/nvme3n1
2 259 5 2 active sync /dev/nvme4n1
3 259 4 3 active sync /dev/nvme5n1
> > I am running thpchallenge-fio from mmtests (its purpose is described
> > here [1]). It's a fio job that inefficiently writes a large number of 64K
> > files. On a system with 128GiB of RAM, it could create up to 100K files.
> > A typical fio config looks like this:
> >
> > [global]
> > direct=0
> > ioengine=sync
> > blocksize=4096
> > invalidate=0
> > fallocate=none
> > create_on_open=1
> >
> > [writer]
> > nrfiles=785988
>
> That's ~800k files, not 100k.
My bad, you're write. This was run with much more memory than 128GiB.
(The test scales the number of files with memory size).
> > filesize=65536
> > readwrite=write
> > numjobs=4
> > filename_format=$jobnum/workfile.$filenum
> >
> > I noticed that, at some point, top reports around 42650 sleeping tasks,
> > example:
> >
> > Tasks: 42651 total, 1 running, 42650 sleeping, 0 stopped, 0 zombie
> >
> > This is a test machine from a fresh boot running vanilla Debian.
> >
> > After checking, it turned out, it was a massive list of xfs-conv
> > kworkers. Something like this (truncated):
> >
> > 58214 ? I 0:00 [kworker/47:203-xfs-conv/md127]
> > 58215 ? I 0:00 [kworker/47:204-xfs-conv/md127]
> > 58216 ? I 0:00 [kworker/47:205-xfs-conv/md127]
> > 58217 ? I 0:00 [kworker/47:206-xfs-conv/md127]
> > 58219 ? I 0:00 [kworker/12:539-xfs-conv/md127]
> > 58220 ? I 0:00 [kworker/12:540-xfs-conv/md127]
> > 58221 ? I 0:00 [kworker/12:541-xfs-conv/md127]
> > 58222 ? I 0:00 [kworker/12:542-xfs-conv/md127]
> > 58223 ? I 0:00 [kworker/12:543-xfs-conv/md127]
> > 58224 ? I 0:00 [kworker/12:544-xfs-conv/md127]
> > 58225 ? I 0:00 [kworker/12:545-xfs-conv/md127]
> > 58227 ? I 0:00 [kworker/38:155-xfs-conv/md127]
> > 58228 ? I 0:00 [kworker/38:156-xfs-conv/md127]
> > 58230 ? I 0:00 [kworker/38:158-xfs-conv/md127]
> > 58233 ? I 0:00 [kworker/38:161-xfs-conv/md127]
> > 58235 ? I 0:00 [kworker/8:537-xfs-conv/md127]
> > 58237 ? I 0:00 [kworker/8:539-xfs-conv/md127]
> > 58238 ? I 0:00 [kworker/8:540-xfs-conv/md127]
> > 58239 ? I 0:00 [kworker/8:541-xfs-conv/md127]
> > 58240 ? I 0:00 [kworker/8:542-xfs-conv/md127]
> > 58241 ? I 0:00 [kworker/8:543-xfs-conv/md127]
> >
> > It seems like the kernel is creating too many kworkers on each CPU.
>
> Or there are tens of thousands of small random write IOs
> in flight at any given time and this process is serialising on
> unwritten extent conversion during IO completion processing. i.e.
> so many kworker threads indicates work processing is blocking, so
> each new work queued gets a new kworker thread created to process
> it.
>
> I suspect that the filesystem is running out of journal space, and
> then unwritten extent conversion transactions start lock-stepping
> waiting for journal space. Hence the question about mkfs.xfs output.
>
> Also info about IO load (e.g. `iostat -dxm 5` output) whilst the test
> is running would be useful, because even fast devices can end up
> being really slow when something stupid is being done...
>
> Kernel profiles across all CPUs whilst the workload is running and
> in this state would be useful.
This is the last 20/30s from iostat -dxm5 during the test. It's been the
same consistently throughput the test at ~80/89% utilization.
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
md127 0.00 0.00 0.00 0.00 0.00 0.00 68713.80 1051.87 0.00 0.00 1.05 15.68 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 72.14 89.52
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
md127 0.00 0.00 0.00 0.00 0.00 0.00 66888.40 943.12 0.00 0.00 0.92 14.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 61.68 88.08
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
md127 0.00 0.00 0.00 0.00 0.00 0.00 68453.80 653.24 0.00 0.00 1.23 9.77 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 84.37 87.12
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
md127 0.00 0.00 0.00 0.00 0.00 0.00 82154.80 604.90 0.00 0.00 1.64 7.54 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 134.87 86.88
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
md127 1.20 0.00 0.00 0.00 2.00 4.00 70320.60 295.50 0.00 0.00 1.97 4.30 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 138.60 87.12
Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
md127 8.20 0.07 0.00 0.00 0.00 9.17 19574.60 84.99 0.00 0.00 2.27 4.45 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 44.48 24.96
In addition, I got the kernel profile with perf record -a -g.
Please find at the end of this email the output of (~500 lines of) perf report.
I have also generated the flamegraph here to make life easy.
https://limewire.com/d/b5lJ1#ZigjlrS9mg
> > I am not sure if this has any effect on performance, but potentially,
> > there is some scheduling overhead?!
>
> It probably does, but a trainsmash of stalled in-progress work like
> this is typically a symptom of some other misbehaviour occuring.
>
> FWIW, for a workload intended to produce "inefficient write IO",
> this is sort of behaviour is definitely indicating something
> "inefficient" is occurring during write IO. So, in the end, there is
> a definite possiblity that there may not actually be anything that
> can be "fixed" here....
You're right, but having 45k kworker threads still looks questionable to me
even with the inefficiency in mind. Maybe it shouldn't be done that way. I'm
not sure, but just keeping the questions open...
========== Perf report output ================
47.99% 0.00% fio fio [.] main
5.49%
main
fio_backend
run_threads
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.19%
main
fio_backend
run_threads
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
fdget_pos
45.82% 0.00% fio fio [.] fio_backend
5.49%
fio_backend
run_threads
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.19%
fio_backend
run_threads
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
fdget_pos
45.66% 0.01% fio fio [.] run_threads
5.49%
run_threads
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.19%
run_threads
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
fdget_pos
43.21% 0.39% fio fio [.] thread_main
5.49%
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.19%
thread_main
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
fdget_pos
39.26% 0.28% fio [kernel.kallsyms] [k] do_syscall_64
5.49%
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.17%
do_syscall_64
ksys_write
fdget_pos
39.11% 0.07% fio [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
5.49%
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.17%
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
fdget_pos
32.75% 0.00% fio libc.so.6 [.] 0x00007fa8884a49ee
5.49%
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.19%
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
fdget_pos
31.46% 0.25% fio fio [.] td_io_queue
5.49%
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.19%
td_io_queue
0x7fa8884a49ee
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
fdget_pos
26.47% 0.04% fio [kernel.kallsyms] [k] ksys_write
5.49%
ksys_write
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
1.19%
ksys_write
fdget_pos
25.27% 0.96% fio [kernel.kallsyms] [k] vfs_write
5.49%
vfs_write
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
24.62% 0.11% fio [kernel.kallsyms] [k] xfs_file_buffered_write
5.49%
xfs_file_buffered_write
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
20.00% 0.00% kworker/u196:0- [kernel.kallsyms] [k] ret_from_fork_asm
20.00% 0.00% kworker/u196:0- [kernel.kallsyms] [k] ret_from_fork
20.00% 0.00% kworker/u196:0- [kernel.kallsyms] [k] kthread
20.00% 0.00% kworker/u196:0- [kernel.kallsyms] [k] worker_thread
20.00% 0.00% kworker/u196:0- [kernel.kallsyms] [k] process_one_work
19.98% 0.00% kworker/u196:0- [kernel.kallsyms] [k] wb_workfn
19.98% 0.00% kworker/u196:0- [kernel.kallsyms] [k] wb_writeback
19.74% 0.04% kworker/u196:0- [kernel.kallsyms] [k] writeback_sb_inodes
19.50% 0.03% kworker/u196:0- [kernel.kallsyms] [k] __writeback_single_inode
19.46% 0.12% kworker/u196:0- [kernel.kallsyms] [k] do_writepages
19.40% 0.10% kworker/u196:0- [kernel.kallsyms] [k] xfs_vm_writepages
19.25% 0.03% kworker/u196:0- [kernel.kallsyms] [k] iomap_writepages
16.24% 0.20% kworker/u196:0- [kernel.kallsyms] [k] iomap_writeback_folio
15.81% 0.00% kworker/u196:0- [kernel.kallsyms] [k] __writeback_inodes_wb
15.50% 0.06% kworker/u196:0- [kernel.kallsyms] [k] xfs_writeback_range
14.49% 0.02% kworker/u196:0- [kernel.kallsyms] [k] xfs_bmapi_convert_delalloc
14.48% 0.19% kworker/u196:0- [kernel.kallsyms] [k] xfs_bmapi_convert_one_delalloc
12.07% 0.40% fio [kernel.kallsyms] [k] iomap_file_buffered_write
11.82% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork_asm
11.81% 0.00% kthreadd [kernel.kallsyms] [k] ret_from_fork
11.76% 0.00% kthreadd [kernel.kallsyms] [k] kthread
11.74% 0.05% kthreadd [kernel.kallsyms] [k] worker_thread
11.52% 0.69% fio [kernel.kallsyms] [k] xfs_file_write_checks
5.49%
xfs_file_write_checks
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
11.45% 0.05% kthreadd [kernel.kallsyms] [k] process_one_work
11.22% 0.15% fio [kernel.kallsyms] [k] kiocb_modified
5.49%
kiocb_modified
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
11.03% 0.01% kworker/u196:0- [kernel.kallsyms] [k] xfs_trans_commit
10.63% 0.03% kthreadd [kernel.kallsyms] [k] xfs_end_io
10.54% 0.03% kthreadd [kernel.kallsyms] [k] xfs_end_ioend
10.21% 0.05% fio [kernel.kallsyms] [k] xfs_vn_update_time
5.53%
xfs_vn_update_time
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
9.66% 0.05% kworker/u196:0- [kernel.kallsyms] [k] xfs_defer_finish_noroll
8.88% 0.03% kthreadd [kernel.kallsyms] [k] xfs_iomap_write_unwritten
8.84% 0.34% fio fio [.] get_io_u
8.79% 0.19% fio [kernel.kallsyms] [k] __xfs_trans_commit
6.56%
__xfs_trans_commit
xlog_cil_commit
8.69% 0.03% fio [kernel.kallsyms] [k] xfs_trans_commit
6.55%
xfs_trans_commit
__xfs_trans_commit
xlog_cil_commit
7.97% 0.01% kthreadd [kernel.kallsyms] [k] xfs_trans_commit
7.28% 0.02% kworker/u196:0- [kernel.kallsyms] [k] xfs_defer_finish_one
7.05% 0.01% kworker/u196:0- [kernel.kallsyms] [k] xfs_rmap_update_finish_item
6.99% 2.00% fio [kernel.kallsyms] [k] xlog_cil_commit
4.99%
xlog_cil_commit
6.98% 0.01% kworker/u196:0- [kernel.kallsyms] [k] xfs_rmap_finish_one
6.75% 0.04% kthreadd [kernel.kallsyms] [k] xfs_defer_finish_noroll
6.67% 0.14% fio [kernel.kallsyms] [k] iomap_write_begin
6.53% 0.02% kworker/u196:0- [kernel.kallsyms] [k] xfs_rmap_map
6.36% 0.02% fio fio [.] td_io_open_file
6.13% 0.08% fio [kernel.kallsyms] [k] __filemap_get_folio
5.68% 0.04% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_insert
5.64% 0.07% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_insrec
5.13% 0.02% kthreadd [kernel.kallsyms] [k] xfs_defer_finish_one
4.89% 0.02% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_update_keys.isra.0
4.14%
xfs_btree_update_keys.isra.0
__xfs_btree_updkeys.isra.0
xfs_btree_get_leaf_keys
1.09%
xfs_btree_update_keys.isra.0
__xfs_btree_updkeys.isra.0
xfs_btree_get_leaf_keys
xfs_rmapbt_init_high_key_from_rec
4.86% 0.04% kworker/u196:0- [kernel.kallsyms] [k] __xfs_btree_updkeys.isra.0
4.14%
__xfs_btree_updkeys.isra.0
xfs_btree_get_leaf_keys
1.09%
__xfs_btree_updkeys.isra.0
xfs_btree_get_leaf_keys
xfs_rmapbt_init_high_key_from_rec
4.85% 0.01% kthreadd [kernel.kallsyms] [k] xfs_rmap_update_finish_item
4.77% 0.02% kthreadd [kernel.kallsyms] [k] xfs_rmap_finish_one
4.56% 0.01% fio [kernel.kallsyms] [k] __x64_sys_openat
4.56% 0.01% fio [kernel.kallsyms] [k] do_sys_openat2
4.38% 0.02% fio [kernel.kallsyms] [k] do_filp_open
4.37% 0.12% fio [kernel.kallsyms] [k] path_openat
4.22% 1.20% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_get_leaf_keys
3.01%
xfs_btree_get_leaf_keys
3.69% 0.06% kthreadd [kernel.kallsyms] [k] xfs_rmap_convert
3.42% 0.00% fio libc.so.6 [.] 0x00007efc15ea49ee
3.41% 0.05% kworker/u196:0- [kernel.kallsyms] [k] __xfs_trans_commit
3.28% 0.07% fio libc.so.6 [.] fstatat64
3.26% 0.01% fio [kernel.kallsyms] [k] xfs_generic_create
3.17% 0.01% fio [kernel.kallsyms] [k] xfs_create
3.12% 0.01% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_make_block_unfull
3.06% 0.03% kworker/u196:0- [kernel.kallsyms] [k] xfs_bmapi_allocate
3.04% 0.44% kworker/u196:0- [kernel.kallsyms] [k] xlog_cil_commit
3.03% 0.02% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_lshift
2.98% 0.82% fio [kernel.kallsyms] [k] filemap_add_folio
2.97% 0.77% fio [kernel.kallsyms] [k] iomap_iter
2.96% 0.01% fio [kernel.kallsyms] [k] ksys_read
2.95% 0.05% fio [kernel.kallsyms] [k] __do_sys_newfstatat
2.94% 0.04% fio [kernel.kallsyms] [k] vfs_read
2.87% 0.01% fio [kernel.kallsyms] [k] xfs_file_read_iter
2.86% 0.11% kworker/u196:0- [kernel.kallsyms] [k] xfs_bmap_btalloc
2.84% 0.01% fio [kernel.kallsyms] [k] xfs_file_buffered_read
2.73% 0.11% fio [kernel.kallsyms] [k] alloc_pages_mpol
2.72% 0.04% fio [kernel.kallsyms] [k] vfs_fstatat
2.70% 0.56% fio [kernel.kallsyms] [k] __alloc_frozen_pages_noprof
2.69% 0.04% fio [kernel.kallsyms] [k] filemap_read
2.65% 1.92% fio [vdso] [.] __vdso_clock_gettime
2.58% 0.65% fio [kernel.kallsyms] [k] get_page_from_freelist
2.58% 0.00% swapper [kernel.kallsyms] [k] common_startup_64
2.58% 0.01% swapper [kernel.kallsyms] [k] cpu_startup_entry
2.58% 0.03% swapper [kernel.kallsyms] [k] do_idle
2.50% 0.03% kthreadd [kernel.kallsyms] [k] __xfs_trans_commit
2.49% 0.05% fio [kernel.kallsyms] [k] folio_alloc_noprof
2.47% 0.05% kworker/u196:0- [kernel.kallsyms] [k] xfs_alloc_vextent_start_ag
2.42% 1.80% fio [kernel.kallsyms] [k] xfs_inode_item_format
2.40% 0.12% fio [kernel.kallsyms] [k] xfs_trans_alloc
2.39% 0.02% kworker/u196:0- [kernel.kallsyms] [k] xfs_alloc_vextent_iterate_ags.constprop.0
2.38% 0.28% fio [kernel.kallsyms] [k] _raw_spin_lock
2.09%
_raw_spin_lock
2.31% 0.05% fio [kernel.kallsyms] [k] vfs_statx
2.20% 0.00% swapper [kernel.kallsyms] [k] start_secondary
2.19% 0.32% kthreadd [kernel.kallsyms] [k] xlog_cil_commit
2.16% 0.01% kthreadd [kernel.kallsyms] [k] xfs_rmap_update
2.16% 0.01% kthreadd [kernel.kallsyms] [k] xfs_btree_update
2.14% 0.02% kworker/u196:0- [kernel.kallsyms] [k] xfs_defer_trans_roll
2.12% 0.01% kthreadd [kernel.kallsyms] [k] __xfs_btree_updkeys.isra.0
2.12% 0.01% kthreadd [kernel.kallsyms] [k] xfs_btree_update_keys.isra.0
2.10% 0.59% kthreadd [kernel.kallsyms] [k] xfs_btree_get_leaf_keys
2.08% 0.01% kworker/u196:0- [kernel.kallsyms] [k] xfs_trans_roll
2.08% 0.00% fio fio [.] parse_options
2.08% 0.00% fio fio [.] __parse_jobs_ini
2.05% 0.02% fio fio [.] add_job
2.02% 0.04% fio libc.so.6 [.] clock_gettime
1.76%
clock_gettime
__vdso_clock_gettime
1.96% 0.21% fio [kernel.kallsyms] [k] xfs_buffered_write_iomap_begin
1.92% 0.01% swapper [kernel.kallsyms] [k] cpuidle_enter
1.92% 0.03% swapper [kernel.kallsyms] [k] cpuidle_enter_state
1.84% 1.52% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_rec_offset
1.79% 1.23% fio [kernel.kallsyms] [k] memset
1.79% 0.33% fio [kernel.kallsyms] [k] filename_lookup
1.76% 1.07% kworker/u196:0- [kernel.kallsyms] [k] xfs_rmapbt_init_high_key_from_rec
1.74% 0.00% xfsaild/md127 [kernel.kallsyms] [k] ret_from_fork_asm
1.74% 0.00% xfsaild/md127 [kernel.kallsyms] [k] ret_from_fork
1.74% 0.00% xfsaild/md127 [kernel.kallsyms] [k] kthread
1.74% 0.11% xfsaild/md127 [kernel.kallsyms] [k] xfsaild
1.74% 0.32% fio [kernel.kallsyms] [k] xfs_log_reserve
1.70% 0.02% fio [kernel.kallsyms] [k] filemap_get_pages
1.66% 0.34% fio [kernel.kallsyms] [k] kmem_cache_alloc_noprof
1.64% 0.02% kthreadd [kernel.kallsyms] [k] iomap_finish_ioends
1.63% 0.03% kworker/u196:0- [kernel.kallsyms] [k] xfs_alloc_ag_vextent_near
1.63% 0.00% kswapd1 [kernel.kallsyms] [k] ret_from_fork_asm
1.63% 0.00% kswapd1 [kernel.kallsyms] [k] ret_from_fork
1.63% 0.00% kswapd1 [kernel.kallsyms] [k] kthread
1.63% 0.00% kswapd1 [kernel.kallsyms] [k] kswapd
1.63% 0.00% kswapd1 [kernel.kallsyms] [k] balance_pgdat
1.63% 0.08% kworker/u196:0- [kernel.kallsyms] [k] writeback_iter
1.62% 0.00% kswapd1 [kernel.kallsyms] [k] shrink_node
1.62% 0.00% kswapd1 [kernel.kallsyms] [k] shrink_one
1.60% 0.64% kworker/u196:0- [kernel.kallsyms] [k] xfs_rmapbt_cmp_two_keys
1.59% 0.04% kthreadd [kernel.kallsyms] [k] iomap_finish_ioend_buffered
1.55% 0.02% fio fio [.] generic_open_file
1.52% 0.01% fio fio [.] close_and_free_files
1.52% 0.00% kswapd1 [kernel.kallsyms] [k] try_to_shrink_lruvec
1.52% 0.00% kswapd1 [kernel.kallsyms] [k] evict_folios
1.46% 0.17% kworker/u196:0- [kernel.kallsyms] [k] xfs_trans_read_buf_map
1.46% 0.04% fio fio [.] put_file
1.46% 0.00% kworker/u196:0- [kernel.kallsyms] [k] iomap_ioend_writeback_submit
1.45% 0.06% kworker/u196:0- [kernel.kallsyms] [k] submit_bio_noacct_nocheck
1.44% 0.92% fio [kernel.kallsyms] [k] kmem_cache_free
1.44% 0.32% kworker/u196:0- [kernel.kallsyms] [k] xfs_btree_lookup
1.44% 0.00% kswapd0 [kernel.kallsyms] [k] ret_from_fork_asm
1.44% 0.00% kswapd0 [kernel.kallsyms] [k] ret_from_fork
1.44% 0.00% kswapd0 [kernel.kallsyms] [k] kthread
1.44% 0.00% kswapd0 [kernel.kallsyms] [k] kswapd
1.44% 0.00% kswapd0 [kernel.kallsyms] [k] balance_pgdat
1.44% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_node
1.44% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_one
1.43% 0.02% fio [kernel.kallsyms] [k] path_lookupat
1.42% 0.00% kswapd0 [kernel.kallsyms] [k] try_to_shrink_lruvec
1.42% 0.00% kswapd0 [kernel.kallsyms] [k] evict_folios
1.41% 0.03% kworker/u196:0- [kernel.kallsyms] [k] __submit_bio
1.40% 0.02% kthreadd [kernel.kallsyms] [k] xfs_defer_trans_roll
1.38% 0.24% kthreadd [kernel.kallsyms] [k] xfs_buf_get_map
1.36% 1.27% fio [kernel.kallsyms] [k] fault_in_readable
1.35% 0.38% kworker/u196:0- [kernel.kallsyms] [k] xfs_buf_get_map
1.35% 0.06% fio [kernel.kallsyms] [k] xfs_inode_item_precommit
1.34% 0.01% kthreadd [kernel.kallsyms] [k] xfs_trans_roll
1.33% 0.01% xfsaild/md127 [kernel.kallsyms] [k] xfs_inode_item_push
1.33% 0.08% kthreadd [kernel.kallsyms] [k] xfs_trans_read_buf_map
1.32% 1.00% fio [kernel.kallsyms] [k] fdget_pos
1.31% 0.06% xfsaild/md127 [kernel.kallsyms] [k] xfs_iflush_cluster
1.29% 0.06% fio [kernel.kallsyms] [k] walk_component
1.29% 0.24% fio [kernel.kallsyms] [k] xfs_iunlock
1.29% 0.00% fio fio [.] init_disk_util
1.26% 0.25% fio [kernel.kallsyms] [k] __mem_cgroup_charge
1.21% 0.17% fio [kernel.kallsyms] [k] __filemap_add_folio
1.19% 0.28% kworker/u196:0- [kernel.kallsyms] [k] kmem_cache_alloc_noprof
1.19% 0.10% kworker/u196:0- [kernel.kallsyms] [k] xfs_buf_read_map
1.18% 0.28% kthreadd [kernel.kallsyms] [k] xfs_btree_lookup
1.18% 0.83% fio [kernel.kallsyms] [k] clear_page_erms
1.17% 0.05% kthreadd [kernel.kallsyms] [k] xfs_buf_read_map
1.13% 0.21% kswapd1 [kernel.kallsyms] [k] shrink_folio_list
1.13% 0.14% fio [kernel.kallsyms] [k] balance_dirty_pages_ratelimited_flags
1.13% 0.20% fio fio [.] lookup_file_hash
1.12% 0.10% fio fio [.] io_queue_event
1.12% 0.37% fio [kernel.kallsyms] [k] link_path_walk
1.12% 0.03% kthreadd [kernel.kallsyms] [k] folio_end_writeback
1.11% 0.00% fio [kernel.kallsyms] [k] __x64_sys_close
1.10% 0.14% fio [kernel.kallsyms] [k] xfs_trans_read_buf_map
1.07% 0.01% fio [kernel.kallsyms] [k] __fput
1.04% 0.17% kswapd0 [kernel.kallsyms] [k] shrink_folio_list
1.03% 0.62% kworker/u196:0- [kernel.kallsyms] [k] memcpy
1.03% 0.00% fio [kernel.kallsyms] [k] force_page_cache_ra
1.02% 0.01% fio [kernel.kallsyms] [k] page_cache_ra_unbounded
1.02% 0.03% kthreadd [kernel.kallsyms] [k] xfs_free_extent_fix_freelist
1.02% 0.03% fio [kernel.kallsyms] [k] xlog_ticket_alloc
1.01% 0.00% fio [kernel.kallsyms] [k] xfs_dir_create_child
1.00% 0.03% kworker/u196:0- [kernel.kallsyms] [k] blk_mq_submit_bio
1.00% 0.02% fio [kernel.kallsyms] [k] iomap_write_end
--
~karim
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Too many xfs-conv kworker threads
2025-11-26 13:27 ` Karim Manaouil
@ 2025-11-26 22:33 ` Dave Chinner
0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2025-11-26 22:33 UTC (permalink / raw)
To: Karim Manaouil; +Cc: Carlos Maiolino, linux-xfs, linux-kernel
On Wed, Nov 26, 2025 at 01:27:21PM +0000, Karim Manaouil wrote:
>
> Hi Dave,
>
> Thanks for looking at this.
>
> On Wed, Nov 26, 2025 at 09:31:59AM +1100, Dave Chinner wrote:
> > On Tue, Nov 25, 2025 at 07:49:42PM +0000, Karim Manaouil wrote:
> > > Hi folks,
> > >
> > > I have four NVMe SSDs on RAID0 with XFS and upstream Linux kernel 6.15
> > > with commit id e5f0a698b34ed76002dc5cff3804a61c80233a7a. The setup can
> > > achieve 25GB/s and more than 2M IOPS. The CPU is a dual socket 24-cores
> > > AMD EPYC 9224.
> >
> > The mkfs.xfs (or xfs_info) output for the filesystem is on this
> > device is?
>
> Here is xfs_info
>
> meta-data=/dev/md127 isize=512 agcount=48, agsize=20346496 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=1, rmapbt=1
rmapbt is enabled. Important.
> This is the last 20/30s from iostat -dxm5 during the test. It's been the
> same consistently throughput the test at ~80/89% utilization.
>
> Device w/s wMB/s wrqm/s %wrqm w_await wareq-sz aqu-sz %util
> md127 68713.80 1051.87 0.00 0.00 1.05 15.68 72.14 89.52
> md127 66888.40 943.12 0.00 0.00 0.92 14.44 61.68 88.08
> md127 68453.80 653.24 0.00 0.00 1.23 9.77 84.37 87.12
> md127 82154.80 604.90 0.00 0.00 1.64 7.54 134.87 86.88
> md127 70320.60 295.50 0.00 0.00 1.97 4.30 138.60 87.12
> md127 19574.60 84.99 0.00 0.00 2.27 4.45 44.48 24.96
^^^^
And the average write IO size is between 4-16kB, and it's reaching
hundreds of IO in flight at the block layer at once. So, yeah, the
stress test is definitely resulting in inefficient IO patterns as
intended.
As for the writeback IO rate, this is pretty typical for delayed
allocation - writeback is single threaded and can block. Best case
for delayed allocation is 100-120k allocations per second. Every IO
in your workload requires allocation, and it's running at about
70-80k allocations a second.
So, yeah, that seems a bit low, but not unexpectedly low.
> In addition, I got the kernel profile with perf record -a -g.
>
> Please find at the end of this email the output of (~500 lines of) perf report.
>
> I have also generated the flamegraph here to make life easy.
>
> https://limewire.com/d/b5lJ1#ZigjlrS9mg
The vast majority of IO completion work is updating the rmapbt
in xfs_rmap_convert(). There looks to be ~10x the CPU overhead in
updating the rmapbt (5%) vs the bmapbt (0.5%) during unwritten
extent conversion.
And I'd suggest that all the xfs-conv kworker threads are being
created because the rmapbt updates are contending on the AGF lock
to be able to perform the rmapbt update.
i.e. unwritten extent conversion bmbt updates are per-inode (no
global resources needed), whilst the rmapbt updates are per-AG.
Every file that is in the same AG will contend for the same AGF lock
to do rmap updates.
It will also contend with IO submission because it is doing
allocation and that requires holding the AGF locked.
IOWs, the contention point here is AGF locking for the rmapbt
updates during IO submission and IO completion. If you turn off
rmapbt it will go somewhat faster, but it won't magically run at
device speed because writeback is single threaded. I have some
ideas on how to reduce contention on the AGF for allocation and
rmapbt updates, but they are just ideas at this point.
> > > I am not sure if this has any effect on performance, but potentially,
> > > there is some scheduling overhead?!
> >
> > It probably does, but a trainsmash of stalled in-progress work like
> > this is typically a symptom of some other misbehaviour occuring.
> >
> > FWIW, for a workload intended to produce "inefficient write IO",
> > this is sort of behaviour is definitely indicating something
> > "inefficient" is occurring during write IO. So, in the end, there is
> > a definite possiblity that there may not actually be anything that
> > can be "fixed" here....
>
> You're right, but having 45k kworker threads still looks questionable to me
> even with the inefficiency in mind.
The explosion of kworker threads is a result of scheduler behaviour.
It moves the writeback thread around because it is unbound and
frequently blocks, whilst other kernel tasks that are bound to a
specific CPU (like xfs-conv processing) takes scheduling priority.
It's not ideal behaviour in this particular corner case, but for a
stress test that is intended to create "inefficient IO patterns",
this is exactly the sort of behaviour it should be exercising.
Rmember, this is an artificial stress test....
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-11-26 22:33 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25 19:49 Too many xfs-conv kworker threads Karim Manaouil
2025-11-25 22:31 ` Dave Chinner
2025-11-26 13:27 ` Karim Manaouil
2025-11-26 22:33 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox