* [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
@ 2026-03-12 16:12 Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
` (5 more replies)
0 siblings, 6 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan, Andrew Morton
Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever,
Breno Leitao
TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).
Problem
=======
Some modern systems have many CPUs sharing one LLC. Here are some examples I have
access to:
* NVIDIA Grace CPU: 72 real CPUs per LLC
* Intel(R) Xeon(R) Gold 6450C: 59 SMT threads per LLC
* Intel(R) Xeon(R) Platinum 8321HC: 51 SMT threads per LLC
On these systems, the default unbound workqueue uses the WQ_AFFN_CACHE
affinity, which results in just a single pool for the whole system (when
all the CPUs share the same LLC as the systems above).
This causes contention on pool->lock, potentially affecting IO
performance (btrfs, writeback, etc)
When profiling an IO-intensive usercache at Meta, I found significant
contention on __queue_work(), making it one of the top 5 contended
locks.
Additionally, Chuck Lever recently reported this problem:
"For example, on a 12-core system with a single shared L3 cache running
NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU
cycles spent in native_queued_spin_lock_slowpath, nearly all from
__queue_work() contending on the single pool lock.
On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
scopes all collapse to a single pod."
Link: https://lore.kernel.org/all/20260203143744.16578-1-cel@kernel.org/
Solution
========
Tejun suggested solving this problem by creating an intermediate
affinity level (aka cache_shard), which would shard the WQ_AFFN_CACHE
using a heuristic, avoiding collapsing all those affinity levels to
a single pod.
Solve this by creating an intermediate sharded cache affinity, and use
it as the default one.
Micro benchmark
===============
To test its benefit, I created a microbenchmark (part of this series)
that enqueues work (queue_work) in a loop and reports the latency.
Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):
cpu 3248519 items/sec p50=10944 p90=11488 p95=11648 ns
smt 3362119 items/sec p50=10945 p90=11520 p95=11712 ns
cache_shard 3629098 items/sec p50=6080 p90=8896 p95=9728 ns (NEW) **
cache 708168 items/sec p50=44000 p90=47104 p95=47904 ns
numa 710559 items/sec p50=44096 p90=47265 p95=48064 ns
system 718370 items/sec p50=43104 p90=46432 p95=47264 ns
Same benchmark on Intel 8321HC.
cpu 2831751 items/sec p50=3909 p90=9222 p95=11580 ns
smt 2810699 items/sec p50=2229 p90=4928 p95=5979 ns
cache_shard 1861028 items/sec p50=4874 p90=8423 p95=9415 ns (NEW)
cache 591001 items/sec p50=24901 p90=29865 p95=31169 ns
numa 590431 items/sec p50=24901 p90=29819 p95=31133 ns
system 591912 items/sec p50=25049 p90=29916 p95=31219 ns
(** It is still unclear why cache_shard is "better" than SMT on
Grace/ARM. The result is constantly reproducible, though. Still
investigating it)
Block benchmark
===============
Host: Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz (16 Cores - 32 SMT)
In order to stress the workqueue, I am running fio on a dm-crypt device.
1) Create a plain dm-crypt device on top of NVMe
* cryptsetup creates an encrypted block device (/dev/mapper/crypt_nvme) on top
of a raw NVMe drive. All I/O to this device goes through kcryptd — dm-crypt's
workqueue that handles AES encryption/decryption of every data block.
# cryptsetup open --type plain -c aes-xts-plain64 -s 256 /dev/nvme0n1 crypt_nvme -d -
2) Run fio
* fio hammers the encrypted device with 36 threads (one per CPU), each doing
128-deep 4K _buffered_ I/O for 10 seconds. This generates massive workqueue
pressure — every I/O completion triggers a kcryptd work item to encrypt or
decrypt data.
# fio --filename=/dev/mapper/crypt_nvme \
--ioengine=io_uring --direct=0 \
--bs=4k --iodepth=128 \
--numjobs=$(nproc) --runtime=10 \
--time_based --group_reporting
Running this for ~3 hours:
┌────────────┬────────────────────────┬────────────────────────┬───────────┬────────┬─────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randread │ 389 MiB/s (99.6k IOPS) │ 413 MiB/s (106k IOPS) │ +5.9% │ 3.3% │ -0.7% to +12.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randwrite │ 622 MiB/s (159k IOPS) │ 614 MiB/s (157k IOPS) │ -1.3% │ 0.9% │ -3.1% to +0.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randrw │ 240 MiB/s (61.4k IOPS) │ 250 MiB/s (64.1k IOPS) │ +4.3% │ 3.4% │ -2.5% to +11.1% │
└────────────┴────────────────────────┴────────────────────────┴───────────┴────────┴─────────────────┘
Same results for buffered IO:
┌───────────┬────────────────────────┬────────────────────────┬───────────┬────────┬────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randread │ 559 MiB/s (143k IOPS) │ 577 MiB/s (148k IOPS) │ +3.1% │ 1.3% │ +0.5% to +5.7% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randwrite │ 437 MiB/s (112k IOPS) │ 431 MiB/s (110k IOPS) │ -1.5% │ 1.0% │ -3.5% to +0.5% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randrw │ 272 MiB/s (69.7k IOPS) │ 273 MiB/s (69.8k IOPS) │ +0.1% │ 1.5% │ -2.9% to +3.1% │
└───────────┴────────────────────────┴────────────────────────┴───────────┴────────┴────────────────┘
(randwrite result seems to be noise (!?))
Patchset organization
=====================
This series adds a new WQ_AFFN_CACHE_SHARD affinity scope that
subdivides each LLC into groups of at most wq_cache_shard_size CPUs
(default 8, tunable via boot parameter), providing an intermediate
option between per-LLC and per-SMT-core granularity.
On top of the feature, this patchset also prepares the code for the new
cache_shard affinity, and creates a stress test for workqueue.
Then, make this new cache affinity the default one.
On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.
---
Breno Leitao (5):
workqueue: fix parse_affn_scope() prefix matching bug
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
workqueue: add test_workqueue benchmark module
tools/workqueue: add CACHE_SHARD support to wq_dump.py
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 72 ++++++++++--
lib/Kconfig.debug | 10 ++
lib/Makefile | 1 +
lib/test_workqueue.c | 275 +++++++++++++++++++++++++++++++++++++++++++++
tools/workqueue/wq_dump.py | 3 +-
6 files changed, 352 insertions(+), 10 deletions(-)
---
base-commit: b29fb8829bff243512bb8c8908fd39406f9fd4c3
change-id: 20260309-workqueue_sharded-2327956e889b
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
2026-03-13 17:41 ` Tejun Heo
2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
` (4 subsequent siblings)
5 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan, Andrew Morton
Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever,
Breno Leitao
parse_affn_scope() uses strncasecmp() with the length of the candidate
name, which means it only checks if the input *starts with* a known
scope name.
Given that the upcoming diff will create "cache_shard" affinity scope,
writing "cache_shard" to a workqueue's affinity_scope sysfs attribute
always matches "cache" first, making it impossible to select
"cache_shard" via sysfs, so, this fix enable it to distinguish "cache"
and "cache_shard"
Fix by replacing the hand-rolled prefix matching loop with
sysfs_match_string(), which uses sysfs_streq() for exact matching
(modulo trailing newlines). Also add the missing const qualifier to
the wq_affn_names[] array declaration.
Note that sysfs_streq() is case-sensitive, unlike the previous
strncasecmp() approach. This is intentional and consistent with
how other sysfs attributes handle string matching in the kernel.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeaec79bc09c4..028afc3d14e59 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -404,7 +404,7 @@ struct work_offq_data {
u32 flags;
};
-static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
+static const char * const wq_affn_names[WQ_AFFN_NR_TYPES] = {
[WQ_AFFN_DFL] = "default",
[WQ_AFFN_CPU] = "cpu",
[WQ_AFFN_SMT] = "smt",
@@ -7063,13 +7063,7 @@ int workqueue_unbound_housekeeping_update(const struct cpumask *hk)
static int parse_affn_scope(const char *val)
{
- int i;
-
- for (i = 0; i < ARRAY_SIZE(wq_affn_names); i++) {
- if (!strncasecmp(val, wq_affn_names[i], strlen(wq_affn_names[i])))
- return i;
- }
- return -EINVAL;
+ return sysfs_match_string(wq_affn_names, val);
}
static int wq_affn_dfl_set(const char *val, const struct kernel_param *kp)
--
2.52.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
` (3 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan, Andrew Morton
Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever,
Breno Leitao
On systems where many CPUs share one LLC, unbound workqueues using
WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock
contention on pool->lock. For example, Chuck Lever measured 39% of
cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3
NFS-over-RDMA system.
The existing affinity hierarchy (cpu, smt, cache, numa, system) offers
no intermediate option between per-LLC and per-SMT-core granularity.
Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at
most wq_cache_shard_size CPUs (default 8, tunable via boot parameter).
CPUs are distributed across shards as evenly as possible -- for example,
72 CPUs with max shard size 8 produces 9 shards of 8 each.
The implementation follows the same comparator pattern as other affinity
scopes: cpu_cache_shard_id() computes a per-CPU shard index on the fly
from the already-initialized WQ_AFFN_CACHE topology, and
cpus_share_cache_shard() is passed to init_pod_type().
Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):
cpu 3433158 items/sec p50=16416 p90=17376 p95=17664 ns
smt 3449709 items/sec p50=16576 p90=17504 p95=17792 ns
cache_shard 2939917 items/sec p50=8192 p90=11488 p95=12512 ns
cache 602096 items/sec p50=53056 p90=56320 p95=57248 ns
numa 599090 items/sec p50=53152 p90=56448 p95=57376 ns
system 598865 items/sec p50=53184 p90=56481 p95=57408 ns
cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency
compared to cache scope on this 72-core single-LLC system.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+)
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a4749f56398fd..41c946109c7d0 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -133,6 +133,7 @@ enum wq_affn_scope {
WQ_AFFN_CPU, /* one pod per CPU */
WQ_AFFN_SMT, /* one pod poer SMT */
WQ_AFFN_CACHE, /* one pod per LLC */
+ WQ_AFFN_CACHE_SHARD, /* synthetic sub-LLC shards */
WQ_AFFN_NUMA, /* one pod per NUMA node */
WQ_AFFN_SYSTEM, /* one pod across the whole system */
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 028afc3d14e59..6be884eb3450d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -409,6 +409,7 @@ static const char * const wq_affn_names[WQ_AFFN_NR_TYPES] = {
[WQ_AFFN_CPU] = "cpu",
[WQ_AFFN_SMT] = "smt",
[WQ_AFFN_CACHE] = "cache",
+ [WQ_AFFN_CACHE_SHARD] = "cache_shard",
[WQ_AFFN_NUMA] = "numa",
[WQ_AFFN_SYSTEM] = "system",
};
@@ -431,6 +432,9 @@ module_param_named(cpu_intensive_warning_thresh, wq_cpu_intensive_warning_thresh
static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT);
module_param_named(power_efficient, wq_power_efficient, bool, 0444);
+static unsigned int wq_cache_shard_size = 8;
+module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444);
+
static bool wq_online; /* can kworkers be created yet? */
static bool wq_topo_initialized __read_mostly = false;
@@ -8106,6 +8110,56 @@ static bool __init cpus_share_numa(int cpu0, int cpu1)
return cpu_to_node(cpu0) == cpu_to_node(cpu1);
}
+/**
+ * cpu_cache_shard_id - compute the shard index for a CPU within its LLC pod
+ * @cpu: the CPU to look up
+ *
+ * Returns a shard index that is unique within the CPU's LLC pod. CPUs in
+ * the same LLC are divided into shards no larger than wq_cache_shard_size,
+ * distributed as evenly as possible. E.g., 20 CPUs with max shard size 8
+ * gives 3 shards of 7+7+6.
+ */
+static int __init cpu_cache_shard_id(int cpu)
+{
+ struct wq_pod_type *cache_pt = &wq_pod_types[WQ_AFFN_CACHE];
+ const struct cpumask *pod_cpus;
+ int nr_cpus, nr_shards, shard_size, remainder, c;
+ int pos = 0;
+
+ /* CPUs in the same LLC as @cpu */
+ pod_cpus = cache_pt->pod_cpus[cache_pt->cpu_pod[cpu]];
+ /* Total number of CPUs sharing this LLC */
+ nr_cpus = cpumask_weight(pod_cpus);
+ /* Number of shards to split this LLC into */
+ nr_shards = DIV_ROUND_UP(nr_cpus, wq_cache_shard_size);
+ /* Minimum number of CPUs per shard */
+ shard_size = nr_cpus / nr_shards;
+ /* First @remainder shards get one extra CPU */
+ remainder = nr_cpus % nr_shards;
+
+ /* Find position of @cpu within its cache pod */
+ for_each_cpu(c, pod_cpus) {
+ if (c == cpu)
+ break;
+ pos++;
+ }
+
+ /*
+ * Map position to shard index. The first @remainder shards have
+ * (shard_size + 1) CPUs, the rest have @shard_size CPUs.
+ */
+ if (pos < remainder * (shard_size + 1))
+ return pos / (shard_size + 1);
+ return remainder + (pos - remainder * (shard_size + 1)) / shard_size;
+}
+
+static bool __init cpus_share_cache_shard(int cpu0, int cpu1)
+{
+ if (!cpus_share_cache(cpu0, cpu1))
+ return false;
+ return cpu_cache_shard_id(cpu0) == cpu_cache_shard_id(cpu1);
+}
+
/**
* workqueue_init_topology - initialize CPU pods for unbound workqueues
*
@@ -8118,9 +8172,15 @@ void __init workqueue_init_topology(void)
struct workqueue_struct *wq;
int cpu;
+ if (!wq_cache_shard_size) {
+ pr_warn("workqueue: cache_shard_size must be > 0, setting to 1\n");
+ wq_cache_shard_size = 1;
+ }
+
init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
+ init_pod_type(&wq_pod_types[WQ_AFFN_CACHE_SHARD], cpus_share_cache_shard);
init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
wq_topo_initialized = true;
--
2.52.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
` (2 subsequent siblings)
5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan, Andrew Morton
Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever,
Breno Leitao
Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound
workqueues. On systems where many CPUs share one LLC, the previous
default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool,
causing heavy spinlock contention on pool->lock.
WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing
a better balance between locality and contention. Users can revert to
the previous behavior with workqueue.default_affinity_scope=cache.
On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/workqueue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6be884eb3450d..0d3bad2bfdaae 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -441,7 +441,7 @@ static bool wq_topo_initialized __read_mostly = false;
static struct kmem_cache *pwq_cache;
static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
-static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE;
+static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE_SHARD;
/* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
static struct workqueue_attrs *unbound_wq_update_pwq_attrs_buf;
--
2.52.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
` (2 preceding siblings ...)
2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan, Andrew Morton
Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever,
Breno Leitao
Add a kernel module that benchmarks queue_work() throughput on an
unbound workqueue to measure pool->lock contention under different
affinity scope configurations (cache vs cache_shard).
The module spawns N kthreads (default: num_online_cpus()), each bound
to a different CPU. All threads start simultaneously and queue work
items, measuring the latency of each queue_work() call. Results are
reported as p50/p90/p95 latencies for each affinity scope.
The affinity scope is switched between runs via the workqueue's sysfs
affinity_scope attribute (WQ_SYSFS), avoiding the need for any new
exported symbols.
The module runs as __init-only, returning -EAGAIN to auto-unload,
and can be re-run via insmod.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
lib/Kconfig.debug | 10 ++
lib/Makefile | 1 +
lib/test_workqueue.c | 275 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 286 insertions(+)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93f356d2b3d95..38bee649697f3 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2628,6 +2628,16 @@ config TEST_VMALLOC
If unsure, say N.
+config TEST_WORKQUEUE
+ tristate "Test module for stress/performance analysis of workqueue"
+ default n
+ help
+ This builds the "test_workqueue" module for benchmarking
+ workqueue throughput under contention. Useful for evaluating
+ affinity scope changes (e.g., cache_shard vs cache).
+
+ If unsure, say N.
+
config TEST_BPF
tristate "Test BPF filter functionality"
depends on m && NET
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f3..ea660cca04f40 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -79,6 +79,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
obj-$(CONFIG_TEST_LKM) += test_module.o
obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
+obj-$(CONFIG_TEST_WORKQUEUE) += test_workqueue.o
obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
diff --git a/lib/test_workqueue.c b/lib/test_workqueue.c
new file mode 100644
index 0000000000000..82540e5536078
--- /dev/null
+++ b/lib/test_workqueue.c
@@ -0,0 +1,275 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for stress and performance analysis of workqueue.
+ *
+ * Benchmarks queue_work() throughput on an unbound workqueue to measure
+ * pool->lock contention under different affinity scope configurations
+ * (e.g., cache vs cache_shard).
+ *
+ * The affinity scope is changed between runs via the workqueue's sysfs
+ * affinity_scope attribute (WQ_SYSFS).
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ *
+ */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/moduleparam.h>
+#include <linux/completion.h>
+#include <linux/atomic.h>
+#include <linux/slab.h>
+#include <linux/ktime.h>
+#include <linux/cpumask.h>
+#include <linux/sched.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+
+#define WQ_NAME "bench_wq"
+#define SCOPE_PATH "/sys/bus/workqueue/devices/" WQ_NAME "/affinity_scope"
+
+static int nr_threads;
+module_param(nr_threads, int, 0444);
+MODULE_PARM_DESC(nr_threads,
+ "Number of threads to spawn (default: 0 = num_online_cpus())");
+
+static int wq_items = 50000;
+module_param(wq_items, int, 0444);
+MODULE_PARM_DESC(wq_items,
+ "Number of work items each thread queues (default: 50000)");
+
+static struct workqueue_struct *bench_wq;
+static atomic_t completed;
+static atomic_t threads_done;
+static DECLARE_COMPLETION(start_comp);
+static DECLARE_COMPLETION(all_done_comp);
+
+struct thread_ctx {
+ struct work_struct work;
+ struct completion work_done;
+ u64 *latencies;
+ int cpu;
+ int items;
+};
+
+static void bench_work_fn(struct work_struct *work)
+{
+ struct thread_ctx *ctx = container_of(work, struct thread_ctx, work);
+
+ atomic_inc(&completed);
+ complete(&ctx->work_done);
+}
+
+static int bench_kthread_fn(void *data)
+{
+ struct thread_ctx *ctx = data;
+ ktime_t t_start, t_end;
+ int i;
+
+ /* Wait for all threads to be ready */
+ wait_for_completion(&start_comp);
+
+ for (i = 0; i < ctx->items; i++) {
+ reinit_completion(&ctx->work_done);
+ INIT_WORK(&ctx->work, bench_work_fn);
+
+ t_start = ktime_get();
+ queue_work(bench_wq, &ctx->work);
+ t_end = ktime_get();
+
+ ctx->latencies[i] = ktime_to_ns(ktime_sub(t_end, t_start));
+ wait_for_completion(&ctx->work_done);
+ }
+
+ if (atomic_dec_and_test(&threads_done))
+ complete(&all_done_comp);
+
+ return 0;
+}
+
+static int cmp_u64(const void *a, const void *b)
+{
+ u64 va = *(const u64 *)a;
+ u64 vb = *(const u64 *)b;
+
+ if (va < vb)
+ return -1;
+ if (va > vb)
+ return 1;
+ return 0;
+}
+
+static int __init set_affn_scope(const char *scope)
+{
+ struct file *f;
+ loff_t pos = 0;
+ ssize_t ret;
+
+ f = filp_open(SCOPE_PATH, O_WRONLY, 0);
+ if (IS_ERR(f)) {
+ pr_err("test_workqueue: open %s failed: %ld\n",
+ SCOPE_PATH, PTR_ERR(f));
+ return PTR_ERR(f);
+ }
+
+ ret = kernel_write(f, scope, strlen(scope), &pos);
+ filp_close(f, NULL);
+
+ if (ret < 0) {
+ pr_err("test_workqueue: write '%s' failed: %zd\n", scope, ret);
+ return ret;
+ }
+
+ return 0;
+}
+
+static int __init run_bench(int n_threads, const char *scope)
+{
+ struct thread_ctx *ctxs;
+ struct task_struct **tasks;
+ u64 *all_latencies;
+ unsigned long total_items;
+ ktime_t start, end;
+ s64 elapsed_us;
+ int cpu, i, j, ret;
+
+ ret = set_affn_scope(scope);
+ if (ret)
+ return ret;
+
+ ctxs = kcalloc(n_threads, sizeof(*ctxs), GFP_KERNEL);
+ if (!ctxs)
+ return -ENOMEM;
+
+ tasks = kcalloc(n_threads, sizeof(*tasks), GFP_KERNEL);
+ if (!tasks) {
+ kfree(ctxs);
+ return -ENOMEM;
+ }
+
+ total_items = (unsigned long)n_threads * wq_items;
+ all_latencies = kvmalloc_array(total_items, sizeof(u64), GFP_KERNEL);
+ if (!all_latencies) {
+ kfree(tasks);
+ kfree(ctxs);
+ return -ENOMEM;
+ }
+
+ /* Allocate per-thread latency arrays */
+ for (i = 0; i < n_threads; i++) {
+ ctxs[i].latencies = kvmalloc_array(wq_items, sizeof(u64),
+ GFP_KERNEL);
+ if (!ctxs[i].latencies) {
+ while (--i >= 0)
+ kvfree(ctxs[i].latencies);
+ kvfree(all_latencies);
+ kfree(tasks);
+ kfree(ctxs);
+ return -ENOMEM;
+ }
+ }
+
+ atomic_set(&completed, 0);
+ atomic_set(&threads_done, n_threads);
+ reinit_completion(&all_done_comp);
+ reinit_completion(&start_comp);
+
+ /* Create kthreads, each bound to a different online CPU */
+ i = 0;
+ for_each_online_cpu(cpu) {
+ if (i >= n_threads)
+ break;
+
+ ctxs[i].cpu = cpu;
+ ctxs[i].items = wq_items;
+ init_completion(&ctxs[i].work_done);
+
+ tasks[i] = kthread_create(bench_kthread_fn, &ctxs[i],
+ "wq_bench/%d", cpu);
+ if (IS_ERR(tasks[i])) {
+ ret = PTR_ERR(tasks[i]);
+ pr_err("test_workqueue: failed to create kthread %d: %d\n",
+ i, ret);
+ while (--i >= 0)
+ kthread_stop(tasks[i]);
+ goto out_free;
+ }
+
+ kthread_bind(tasks[i], cpu);
+ wake_up_process(tasks[i]);
+ i++;
+ }
+
+ /* Start timing and release all threads */
+ start = ktime_get();
+ complete_all(&start_comp);
+
+ /* Wait for all threads to finish */
+ wait_for_completion(&all_done_comp);
+
+ /* Drain any remaining work */
+ flush_workqueue(bench_wq);
+
+ end = ktime_get();
+ elapsed_us = ktime_us_delta(end, start);
+
+ /* Merge all per-thread latencies and sort for percentile calculation */
+ j = 0;
+ for (i = 0; i < n_threads; i++) {
+ memcpy(&all_latencies[j], ctxs[i].latencies,
+ wq_items * sizeof(u64));
+ j += wq_items;
+ }
+
+ sort(all_latencies, total_items, sizeof(u64), cmp_u64, NULL);
+
+ pr_info("test_workqueue: %-12s %llu items/sec\tp50=%llu\tp90=%llu\tp95=%llu ns\n",
+ scope,
+ elapsed_us ? total_items * 1000000ULL / elapsed_us : 0,
+ all_latencies[total_items * 50 / 100],
+ all_latencies[total_items * 90 / 100],
+ all_latencies[total_items * 95 / 100]);
+
+ ret = 0;
+out_free:
+ for (i = 0; i < n_threads; i++)
+ kvfree(ctxs[i].latencies);
+ kvfree(all_latencies);
+ kfree(tasks);
+ kfree(ctxs);
+
+ return ret;
+}
+
+static const char * const bench_scopes[] = {
+ "cpu", "smt", "cache_shard", "cache", "numa", "system",
+};
+
+static int __init test_workqueue_init(void)
+{
+ int n_threads = nr_threads ?: num_online_cpus();
+ int i;
+
+ bench_wq = alloc_workqueue(WQ_NAME, WQ_UNBOUND | WQ_SYSFS, 0);
+ if (!bench_wq)
+ return -ENOMEM;
+
+ pr_info("test_workqueue: running %d threads, %d items/thread\n",
+ n_threads, wq_items);
+
+ for (i = 0; i < ARRAY_SIZE(bench_scopes); i++)
+ run_bench(n_threads, bench_scopes[i]);
+
+ destroy_workqueue(bench_wq);
+
+ return -EAGAIN;
+}
+
+module_init(test_workqueue_init);
+MODULE_AUTHOR("Breno Leitao <leitao@debian.org>");
+MODULE_DESCRIPTION("Stress/performance benchmark for workqueue subsystem");
+MODULE_LICENSE("GPL");
--
2.52.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
` (3 preceding siblings ...)
2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
To: Tejun Heo, Lai Jiangshan, Andrew Morton
Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever,
Breno Leitao
The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but
wq_dump.py was not updated to enumerate it. Add the missing constant
lookup and include it in the affinity scopes iteration so that drgn
output shows the CACHE_SHARD pod topology alongside the other scopes.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
tools/workqueue/wq_dump.py | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index d29b918306b48..06948ffcfc4b6 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -107,6 +107,7 @@ WQ_MEM_RECLAIM = prog['WQ_MEM_RECLAIM']
WQ_AFFN_CPU = prog['WQ_AFFN_CPU']
WQ_AFFN_SMT = prog['WQ_AFFN_SMT']
WQ_AFFN_CACHE = prog['WQ_AFFN_CACHE']
+WQ_AFFN_CACHE_SHARD = prog['WQ_AFFN_CACHE_SHARD']
WQ_AFFN_NUMA = prog['WQ_AFFN_NUMA']
WQ_AFFN_SYSTEM = prog['WQ_AFFN_SYSTEM']
@@ -138,7 +139,7 @@ def print_pod_type(pt):
print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='')
print('')
-for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
+for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_CACHE_SHARD, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
print('')
print(f'{wq_affn_names[affn].string_().decode().upper()}{" (default)" if affn == wq_affn_dfl else ""}')
print_pod_type(wq_pod_types[affn])
--
2.52.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
@ 2026-03-13 17:41 ` Tejun Heo
0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2026-03-13 17:41 UTC (permalink / raw)
To: Breno Leitao
Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever
Hello,
Applied to wq/for-7.1.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
` (4 preceding siblings ...)
2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
@ 2026-03-13 17:57 ` Tejun Heo
2026-03-17 11:32 ` Breno Leitao
5 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2026-03-13 17:57 UTC (permalink / raw)
To: Breno Leitao
Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever
Hello,
Applied 1/5. Some comments on the rest:
- The sharding currently splits on CPU boundary, which can split SMT
siblings across different pods. The worse performance on Intel compared
to SMT scope may be indicating exactly this - HT siblings ending up in
different pods. It'd be better to shard on core boundary so that SMT
siblings always stay together.
- How was the default shard size of 8 picked? There's a tradeoff between
the number of kworkers created and locality. Can you also report the
number of kworkers for each configuration? And is there data on
different shard sizes? It'd be useful to see how the numbers change
across e.g. 4, 8, 16, 32.
- Can you also test on AMD machines? Their CCD topology (16 or 32
threads per LLC) would be a good data point.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
@ 2026-03-17 11:32 ` Breno Leitao
2026-03-17 13:58 ` Chuck Lever
0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-17 11:32 UTC (permalink / raw)
To: Tejun Heo
Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team, Chuck Lever
Hello Tejun,
On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote:
> Hello,
>
> Applied 1/5. Some comments on the rest:
>
> - The sharding currently splits on CPU boundary, which can split SMT
> siblings across different pods. The worse performance on Intel compared
> to SMT scope may be indicating exactly this - HT siblings ending up in
> different pods. It'd be better to shard on core boundary so that SMT
> siblings always stay together.
Thank you for the insight. I'll modify the sharding to operate at the
core boundary rather than at the SMT/thread level to ensure sibling CPUs
remain in the same pod.
> - How was the default shard size of 8 picked? There's a tradeoff
> between the number of kworkers created and locality. Can you also
> report the number of kworkers for each configuration? And is there
> data on different shard sizes? It'd be useful to see how the numbers
> change across e.g. 4, 8, 16, 32.
The choice of 8 as the default shard size was somewhat arbitrary – it was
selected primarily to generate initial data points.
I'll run tests with different shard sizes and report the results.
I'm currently working on finding a suitable workload with minimal noise.
Testing on real NVMe devices shows significant jitter that makes analysis
difficult. I've also been experimenting with nullblk, but haven't had much
success yet.
If you have any suggestions for a reliable workload or benchmark, I'd
appreciate your input.
> - Can you also test on AMD machines? Their CCD topology (16 or 32
> threads per LLC) would be a good data point.
Absolutely, I'll test on AMD machines as well.
Thanks,
--breno
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
2026-03-17 11:32 ` Breno Leitao
@ 2026-03-17 13:58 ` Chuck Lever
2026-03-18 17:51 ` Breno Leitao
0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2026-03-17 13:58 UTC (permalink / raw)
To: Breno Leitao, Tejun Heo
Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team
On 3/17/26 7:32 AM, Breno Leitao wrote:
> Hello Tejun,
>
> On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> Applied 1/5. Some comments on the rest:
>>
>> - The sharding currently splits on CPU boundary, which can split SMT
>> siblings across different pods. The worse performance on Intel compared
>> to SMT scope may be indicating exactly this - HT siblings ending up in
>> different pods. It'd be better to shard on core boundary so that SMT
>> siblings always stay together.
>
> Thank you for the insight. I'll modify the sharding to operate at the
> core boundary rather than at the SMT/thread level to ensure sibling CPUs
> remain in the same pod.
>
>> - How was the default shard size of 8 picked? There's a tradeoff
>> between the number of kworkers created and locality. Can you also
>> report the number of kworkers for each configuration? And is there
>> data on different shard sizes? It'd be useful to see how the numbers
>> change across e.g. 4, 8, 16, 32.
>
> The choice of 8 as the default shard size was somewhat arbitrary – it was
> selected primarily to generate initial data points.
Perhaps instead of basing the sharding on a particular number of CPUs
per shard, why not cap the total number of shards? IIUC that is the main
concern about ballooning the number of kworker threads.
> I'll run tests with different shard sizes and report the results.
>
> I'm currently working on finding a suitable workload with minimal noise.
> Testing on real NVMe devices shows significant jitter that makes analysis
> difficult. I've also been experimenting with nullblk, but haven't had much
> success yet.
>
> If you have any suggestions for a reliable workload or benchmark, I'd
> appreciate your input.
>
>> - Can you also test on AMD machines? Their CCD topology (16 or 32
>> threads per LLC) would be a good data point.
>
> Absolutely, I'll test on AMD machines as well.
>
> Thanks,
> --breno
--
Chuck Lever
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
2026-03-17 13:58 ` Chuck Lever
@ 2026-03-18 17:51 ` Breno Leitao
2026-03-18 23:00 ` Tejun Heo
0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-18 17:51 UTC (permalink / raw)
To: Chuck Lever
Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team
On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> On 3/17/26 7:32 AM, Breno Leitao wrote:
> >> - How was the default shard size of 8 picked? There's a tradeoff
> >> between the number of kworkers created and locality. Can you also
> >> report the number of kworkers for each configuration? And is there
> >> data on different shard sizes? It'd be useful to see how the numbers
> >> change across e.g. 4, 8, 16, 32.
> >
> > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > selected primarily to generate initial data points.
>
> Perhaps instead of basing the sharding on a particular number of CPUs
> per shard, why not cap the total number of shards? IIUC that is the main
> concern about ballooning the number of kworker threads.
That's a great suggestion. I'll send a v2 that implements this approach,
where the parameter specifies the number of shards rather than the number
of CPUs per shard.
Thanks for the feedback,
--breno
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
2026-03-18 17:51 ` Breno Leitao
@ 2026-03-18 23:00 ` Tejun Heo
2026-03-19 14:02 ` Breno Leitao
0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2026-03-18 23:00 UTC (permalink / raw)
To: Breno Leitao
Cc: Chuck Lever, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team
On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > >> - How was the default shard size of 8 picked? There's a tradeoff
> > >> between the number of kworkers created and locality. Can you also
> > >> report the number of kworkers for each configuration? And is there
> > >> data on different shard sizes? It'd be useful to see how the numbers
> > >> change across e.g. 4, 8, 16, 32.
> > >
> > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > selected primarily to generate initial data points.
> >
> > Perhaps instead of basing the sharding on a particular number of CPUs
> > per shard, why not cap the total number of shards? IIUC that is the main
> > concern about ballooning the number of kworker threads.
>
> That's a great suggestion. I'll send a v2 that implements this approach,
> where the parameter specifies the number of shards rather than the number
> of CPUs per shard.
Woudl it make sense tho? If feels really odd to define the maximum number of
shards when contention is primarily a function of the number of CPUs banging
on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
of shards?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
2026-03-18 23:00 ` Tejun Heo
@ 2026-03-19 14:02 ` Breno Leitao
0 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-19 14:02 UTC (permalink / raw)
To: Tejun Heo
Cc: Chuck Lever, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
linux-crypto, linux-btrfs, linux-fsdevel,
Michael van der Westhuizen, kernel-team
On Wed, Mar 18, 2026 at 01:00:07PM -1000, Tejun Heo wrote:
> On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> > On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > > >> - How was the default shard size of 8 picked? There's a tradeoff
> > > >> between the number of kworkers created and locality. Can you also
> > > >> report the number of kworkers for each configuration? And is there
> > > >> data on different shard sizes? It'd be useful to see how the numbers
> > > >> change across e.g. 4, 8, 16, 32.
> > > >
> > > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > > selected primarily to generate initial data points.
> > >
> > > Perhaps instead of basing the sharding on a particular number of CPUs
> > > per shard, why not cap the total number of shards? IIUC that is the main
> > > concern about ballooning the number of kworker threads.
> >
> > That's a great suggestion. I'll send a v2 that implements this approach,
> > where the parameter specifies the number of shards rather than the number
> > of CPUs per shard.
>
> Woudl it make sense tho? If feels really odd to define the maximum number of
> shards when contention is primarily a function of the number of CPUs banging
> on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
> of shards?
The trade-off is that specifying the maximum number of shards makes it
clearer how many times the LLC is being sharded, which might be easier
to reason about, but it will have less impact on contention scaling as
you reported above.
I've collected some numbers with sharding per LLC, and I will switch
back to the original approach to gather comparison data.
Current change:
https://github.com/leitao/linux/commit/bedaf9ebe9594320976dcbf0cb507ecf083097c0
Workload:
========
I've finally found a workload that exercises the workqueue sufficiently,
which allows me to obtain stable benchmark results.
This is what I am doing:
- Sets up a local loopback NFS environment backed by an 8 GB tmpfs
(/tmp/nfsexport → /mnt/nfs)
- Iterates over six fio I/O engines: sync, psync, vsync, pvsync, pvsync2,
libaio
- For each engine, runs a 200-job, 512-byte block size fio benchmark (writes
then reads)
- Tests each workload under both cache and cache_shard workqueue affinity
scopes via /sys/module/workqueue/parameters/default_affinity_scope
- Prints a summary table with aggregate bandwidth (MB) per scope and the
percentage delta to show whether cache_shard helps or hurts
- Restores the affinity scope back to cache when done
The test I am running could be found at
https://github.com/leitao/debug/blob/main/workqueue_performance/test_affinity.sh
Hosts:
======
* ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)
# cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
0-71
* Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)
# cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
0-7,88-95
16-23,104-111
24-31,112-119
32-39,120-127
40-47,128-135
48-55,136-143
56-63,144-151
64-71,152-159
72-79,160-167
80-87,168-175
8-15,96-103
Results
=======
Tl;DR:
* ARM (single L3, 72 CPUs): cache_shard consistently improves write
throughput by +6 to +12% across all shard counts (2-32), with
the peak at 2 shards. Read impact is minimal (noise-level).
Shard=1 confirms no effect as expected.
* Intel (11 L3 domains, 16 CPUs each): cache_shard shows no meaningful
benefit at 1-4 shards (all within noise/stddev). At 8 shards it
regresses by ~4% for both reads and writes, likely due to loss of
data locality when sharding already-small 16-CPU cache domains
further.
benchmark Data:
===============
ARM:
┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
│ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 1 │ -0.2% │ ±1.0% │ +1.2% │ ±1.7% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 2 │ +12.5% │ ±1.3% │ -0.3% │ ±0.9% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 4 │ +8.7% │ ±0.9% │ +1.8% │ ±1.5% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 8 │ +11.4% │ ±1.8% │ +3.1% │ ±1.5% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 16 │ +7.8% │ ±1.3% │ +1.6% │ ±1.0% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 32 │ +6.1% │ ±0.6% │ +0.3% │ ±1.5% │
└────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘
Intel:
┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
│ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 1 │ -0.2% │ ±1.2% │ +0.1% │ ±1.0% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 2 │ +0.7% │ ±1.4% │ +0.5% │ ±1.1% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 4 │ +0.8% │ ±1.1% │ +1.3% │ ±1.2% │
├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
│ 8 │ -4.0% │ ±1.3% │ -4.5% │ ±0.9% │
└────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘
Microbenchmark result
=====================
I've run the micro-benchmark from this patchset as well, and the resluts
comparison:
* Intel (11 L3 domains, 16 CPUs each): cache_shard delivers +45-55%
throughput and 36-44% lower latency at 2-8 shards. The sweet spot is
4 shards (+55%, p50 cut nearly in half). Shard=1 confirms no effect.
Even though Intel already has multiple L3 domains, each 16-CPU domain
still has enough contention to benefit from further splitting (for
the sake of this microbenchmark/stress test)
* ARM (single L3, 72 CPUs): The gains are dramatic — 2x at 2 shards,
3.2x at 4 shards, and 4.4x at 8 shards. At 8 shards, cache_shard
(3.2M items/s) nearly matches cpu scope performance (3.7M), with p50
latency dropping from 43.5 us to 6.9 us.
The single monolithic L3 makes cache scope degenerate to a single
contended pool, so sharding has a massive effect.
Intel
┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
│ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 1 │ 2,660,103 │ 2,667,740 │ +0.3% │ 27.5 us │ 27.5 us │ 0% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 2 │ 2,619,884 │ 3,788,454 │ +44.6% │ 28.0 us │ 17.8 us │ -36% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 4 │ 2,506,185 │ 3,891,064 │ +55.3% │ 29.3 us │ 16.5 us │ -44% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 8 │ 2,628,321 │ 4,015,312 │ +52.8% │ 27.9 us │ 16.4 us │ -41% │
└────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘
Reference scopes (stable across shard counts): cpu ~6.2M items/s, smt ~4.0M, numa/system ~422K.
ARM
┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
│ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 2 │ 725,999 │ 1,516,967 │ +109% │ 43.8 us │ 19.6 us │ -55% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 4 │ 729,615 │ 2,347,335 │ +222% │ 43.6 us │ 11.0 us │ -75% │
├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
│ 8 │ 731,517 │ 3,230,168 │ +342% │ 43.5 us │ 6.9 us │ -84% │
└────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘
Next Steps:
* Revert the code to sharding by CPU count (instead of by shard count) and
report it again.
* Any other test that would help us?
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2026-03-19 14:03 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
2026-03-13 17:41 ` Tejun Heo
2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
2026-03-17 11:32 ` Breno Leitao
2026-03-17 13:58 ` Chuck Lever
2026-03-18 17:51 ` Breno Leitao
2026-03-18 23:00 ` Tejun Heo
2026-03-19 14:02 ` Breno Leitao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox