From: Breno Leitao <leitao@debian.org>
To: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, puranjay@kernel.org,
linux-crypto@vger.kernel.org, linux-btrfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org,
Michael van der Westhuizen <rmikey@meta.com>,
kernel-team@meta.com, Chuck Lever <chuck.lever@oracle.com>,
Breno Leitao <leitao@debian.org>
Subject: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Date: Thu, 12 Mar 2026 09:12:01 -0700 [thread overview]
Message-ID: <20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org> (raw)
TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).
Problem
=======
Some modern systems have many CPUs sharing one LLC. Here are some examples I have
access to:
* NVIDIA Grace CPU: 72 real CPUs per LLC
* Intel(R) Xeon(R) Gold 6450C: 59 SMT threads per LLC
* Intel(R) Xeon(R) Platinum 8321HC: 51 SMT threads per LLC
On these systems, the default unbound workqueue uses the WQ_AFFN_CACHE
affinity, which results in just a single pool for the whole system (when
all the CPUs share the same LLC as the systems above).
This causes contention on pool->lock, potentially affecting IO
performance (btrfs, writeback, etc)
When profiling an IO-intensive usercache at Meta, I found significant
contention on __queue_work(), making it one of the top 5 contended
locks.
Additionally, Chuck Lever recently reported this problem:
"For example, on a 12-core system with a single shared L3 cache running
NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU
cycles spent in native_queued_spin_lock_slowpath, nearly all from
__queue_work() contending on the single pool lock.
On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
scopes all collapse to a single pod."
Link: https://lore.kernel.org/all/20260203143744.16578-1-cel@kernel.org/
Solution
========
Tejun suggested solving this problem by creating an intermediate
affinity level (aka cache_shard), which would shard the WQ_AFFN_CACHE
using a heuristic, avoiding collapsing all those affinity levels to
a single pod.
Solve this by creating an intermediate sharded cache affinity, and use
it as the default one.
Micro benchmark
===============
To test its benefit, I created a microbenchmark (part of this series)
that enqueues work (queue_work) in a loop and reports the latency.
Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):
cpu 3248519 items/sec p50=10944 p90=11488 p95=11648 ns
smt 3362119 items/sec p50=10945 p90=11520 p95=11712 ns
cache_shard 3629098 items/sec p50=6080 p90=8896 p95=9728 ns (NEW) **
cache 708168 items/sec p50=44000 p90=47104 p95=47904 ns
numa 710559 items/sec p50=44096 p90=47265 p95=48064 ns
system 718370 items/sec p50=43104 p90=46432 p95=47264 ns
Same benchmark on Intel 8321HC.
cpu 2831751 items/sec p50=3909 p90=9222 p95=11580 ns
smt 2810699 items/sec p50=2229 p90=4928 p95=5979 ns
cache_shard 1861028 items/sec p50=4874 p90=8423 p95=9415 ns (NEW)
cache 591001 items/sec p50=24901 p90=29865 p95=31169 ns
numa 590431 items/sec p50=24901 p90=29819 p95=31133 ns
system 591912 items/sec p50=25049 p90=29916 p95=31219 ns
(** It is still unclear why cache_shard is "better" than SMT on
Grace/ARM. The result is constantly reproducible, though. Still
investigating it)
Block benchmark
===============
Host: Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz (16 Cores - 32 SMT)
In order to stress the workqueue, I am running fio on a dm-crypt device.
1) Create a plain dm-crypt device on top of NVMe
* cryptsetup creates an encrypted block device (/dev/mapper/crypt_nvme) on top
of a raw NVMe drive. All I/O to this device goes through kcryptd — dm-crypt's
workqueue that handles AES encryption/decryption of every data block.
# cryptsetup open --type plain -c aes-xts-plain64 -s 256 /dev/nvme0n1 crypt_nvme -d -
2) Run fio
* fio hammers the encrypted device with 36 threads (one per CPU), each doing
128-deep 4K _buffered_ I/O for 10 seconds. This generates massive workqueue
pressure — every I/O completion triggers a kcryptd work item to encrypt or
decrypt data.
# fio --filename=/dev/mapper/crypt_nvme \
--ioengine=io_uring --direct=0 \
--bs=4k --iodepth=128 \
--numjobs=$(nproc) --runtime=10 \
--time_based --group_reporting
Running this for ~3 hours:
┌────────────┬────────────────────────┬────────────────────────┬───────────┬────────┬─────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randread │ 389 MiB/s (99.6k IOPS) │ 413 MiB/s (106k IOPS) │ +5.9% │ 3.3% │ -0.7% to +12.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randwrite │ 622 MiB/s (159k IOPS) │ 614 MiB/s (157k IOPS) │ -1.3% │ 0.9% │ -3.1% to +0.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randrw │ 240 MiB/s (61.4k IOPS) │ 250 MiB/s (64.1k IOPS) │ +4.3% │ 3.4% │ -2.5% to +11.1% │
└────────────┴────────────────────────┴────────────────────────┴───────────┴────────┴─────────────────┘
Same results for buffered IO:
┌───────────┬────────────────────────┬────────────────────────┬───────────┬────────┬────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randread │ 559 MiB/s (143k IOPS) │ 577 MiB/s (148k IOPS) │ +3.1% │ 1.3% │ +0.5% to +5.7% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randwrite │ 437 MiB/s (112k IOPS) │ 431 MiB/s (110k IOPS) │ -1.5% │ 1.0% │ -3.5% to +0.5% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randrw │ 272 MiB/s (69.7k IOPS) │ 273 MiB/s (69.8k IOPS) │ +0.1% │ 1.5% │ -2.9% to +3.1% │
└───────────┴────────────────────────┴────────────────────────┴───────────┴────────┴────────────────┘
(randwrite result seems to be noise (!?))
Patchset organization
=====================
This series adds a new WQ_AFFN_CACHE_SHARD affinity scope that
subdivides each LLC into groups of at most wq_cache_shard_size CPUs
(default 8, tunable via boot parameter), providing an intermediate
option between per-LLC and per-SMT-core granularity.
On top of the feature, this patchset also prepares the code for the new
cache_shard affinity, and creates a stress test for workqueue.
Then, make this new cache affinity the default one.
On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.
---
Breno Leitao (5):
workqueue: fix parse_affn_scope() prefix matching bug
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
workqueue: add test_workqueue benchmark module
tools/workqueue: add CACHE_SHARD support to wq_dump.py
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 72 ++++++++++--
lib/Kconfig.debug | 10 ++
lib/Makefile | 1 +
lib/test_workqueue.c | 275 +++++++++++++++++++++++++++++++++++++++++++++
tools/workqueue/wq_dump.py | 3 +-
6 files changed, 352 insertions(+), 10 deletions(-)
---
base-commit: b29fb8829bff243512bb8c8908fd39406f9fd4c3
change-id: 20260309-workqueue_sharded-2327956e889b
Best regards,
--
Breno Leitao <leitao@debian.org>
next reply other threads:[~2026-03-12 16:18 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-12 16:12 Breno Leitao [this message]
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
2026-03-13 17:41 ` Tejun Heo
2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
2026-03-17 11:32 ` Breno Leitao
2026-03-17 13:58 ` Chuck Lever
2026-03-18 17:51 ` Breno Leitao
2026-03-18 23:00 ` Tejun Heo
2026-03-19 14:02 ` Breno Leitao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org \
--to=leitao@debian.org \
--cc=akpm@linux-foundation.org \
--cc=chuck.lever@oracle.com \
--cc=jiangshanlai@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-crypto@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=puranjay@kernel.org \
--cc=rmikey@meta.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox