[PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
@ 2026-03-12 16:12 Breno Leitao
  2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).

Problem
=======

Some modern systems have many CPUs sharing one LLC. Here are some examples I have
access to:

 * NVIDIA Grace CPU: 72 real CPUs per LLC
 * Intel(R) Xeon(R) Gold 6450C: 59 SMT threads per LLC
 * Intel(R) Xeon(R) Platinum 8321HC: 51 SMT threads per LLC

On these systems, the default unbound workqueue uses the WQ_AFFN_CACHE
affinity, which results in just a single pool for the whole system (when
all the CPUs share the same LLC as the systems above).

This causes contention on pool->lock, potentially affecting IO
performance (btrfs, writeback, etc)

When profiling an IO-intensive usercache at Meta, I found significant
contention on __queue_work(), making it one of the top 5 contended
locks.

Additionally, Chuck Lever recently reported this problem:

	"For example, on a 12-core system with a single shared L3 cache running
	NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU
	cycles spent in native_queued_spin_lock_slowpath, nearly all from
	__queue_work() contending on the single pool lock.

	On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
	scopes all collapse to a single pod."

Link: https://lore.kernel.org/all/20260203143744.16578-1-cel@kernel.org/

Solution
========

Tejun suggested solving this problem by creating an intermediate
affinity level (aka cache_shard), which would shard the WQ_AFFN_CACHE
using a heuristic, avoiding collapsing all those affinity levels to
a single pod.

Solve this by creating an intermediate sharded cache affinity, and use
it as the default one.

Micro benchmark
===============

To test its benefit, I created a microbenchmark (part of this series)
that enqueues work (queue_work) in a loop and reports the latency.

  Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):

    cpu          3248519 items/sec p50=10944    p90=11488    p95=11648 ns
    smt          3362119 items/sec p50=10945    p90=11520    p95=11712 ns
    cache_shard  3629098 items/sec p50=6080     p90=8896     p95=9728 ns (NEW) **
    cache        708168 items/sec  p50=44000    p90=47104    p95=47904 ns
    numa         710559 items/sec  p50=44096    p90=47265    p95=48064 ns
    system       718370 items/sec  p50=43104    p90=46432    p95=47264 ns

Same benchmark on Intel 8321HC.

    cpu          2831751 items/sec p50=3909     p90=9222     p95=11580 ns
    smt          2810699 items/sec p50=2229     p90=4928     p95=5979 ns
    cache_shard  1861028 items/sec p50=4874     p90=8423     p95=9415 ns (NEW)
    cache        591001 items/sec p50=24901     p90=29865    p95=31169 ns
    numa         590431 items/sec p50=24901     p90=29819    p95=31133 ns
    system       591912 items/sec p50=25049     p90=29916    p95=31219 ns

(** It is still unclear why cache_shard is "better" than SMT on
Grace/ARM. The result is constantly reproducible, though. Still
investigating it)

Block benchmark
===============

Host: Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz (16 Cores - 32 SMT)

In order to stress the workqueue, I am running fio on a dm-crypt device.

  1) Create a plain dm-crypt device on top of NVMe
   * cryptsetup creates an encrypted block device (/dev/mapper/crypt_nvme) on top
     of a raw NVMe drive. All I/O to this device goes through kcryptd — dm-crypt's
     workqueue that handles AES encryption/decryption of every data block.

   # cryptsetup open --type plain -c aes-xts-plain64 -s 256 /dev/nvme0n1 crypt_nvme -d -

  2) Run fio
   * fio hammers the encrypted device with 36 threads (one per CPU), each doing
     128-deep 4K _buffered_ I/O for 10 seconds. This generates massive workqueue
     pressure — every I/O completion triggers a kcryptd work item to encrypt or
     decrypt data.

   # fio --filename=/dev/mapper/crypt_nvme \
         --ioengine=io_uring --direct=0 \
         --bs=4k --iodepth=128 \
         --numjobs=$(nproc) --runtime=10 \
         --time_based --group_reporting

Running this for ~3 hours:

  ┌────────────┬────────────────────────┬────────────────────────┬───────────┬────────┬─────────────────┐
  │ Workload   │       Avg cache        │    Avg cache_shard     │ Avg delta │ Stddev │  2-sigma range  │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randread   │ 389 MiB/s (99.6k IOPS) │ 413 MiB/s (106k IOPS)  │ +5.9%     │ 3.3%   │ -0.7% to +12.5% │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randwrite  │ 622 MiB/s (159k IOPS)  │ 614 MiB/s (157k IOPS)  │ -1.3%     │ 0.9%   │ -3.1% to +0.5%  │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randrw     │ 240 MiB/s (61.4k IOPS) │ 250 MiB/s (64.1k IOPS) │ +4.3%     │ 3.4%   │ -2.5% to +11.1% │
  └────────────┴────────────────────────┴────────────────────────┴───────────┴────────┴─────────────────┘

Same results for buffered IO:

  ┌───────────┬────────────────────────┬────────────────────────┬───────────┬────────┬────────────────┐
  │ Workload  │       Avg cache        │    Avg cache_shard     │ Avg delta │ Stddev │ 2-sigma range  │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randread  │ 559 MiB/s (143k IOPS)  │ 577 MiB/s (148k IOPS)  │ +3.1%     │ 1.3%   │ +0.5% to +5.7% │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randwrite │ 437 MiB/s (112k IOPS)  │ 431 MiB/s (110k IOPS)  │ -1.5%     │ 1.0%   │ -3.5% to +0.5% │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randrw    │ 272 MiB/s (69.7k IOPS) │ 273 MiB/s (69.8k IOPS) │ +0.1%     │ 1.5%   │ -2.9% to +3.1% │
  └───────────┴────────────────────────┴────────────────────────┴───────────┴────────┴────────────────┘

(randwrite result seems to be noise (!?))

Patchset organization
=====================

This series adds a new WQ_AFFN_CACHE_SHARD affinity scope that
subdivides each LLC into groups of at most wq_cache_shard_size CPUs
(default 8, tunable via boot parameter), providing an intermediate
option between per-LLC and per-SMT-core granularity.

On top of the feature, this patchset also prepares the code for the new
cache_shard affinity, and creates a stress test for workqueue.

Then, make this new cache affinity the default one.

On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.

---
Breno Leitao (5):
      workqueue: fix parse_affn_scope() prefix matching bug
      workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
      workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
      workqueue: add test_workqueue benchmark module
      tools/workqueue: add CACHE_SHARD support to wq_dump.py

 include/linux/workqueue.h  |   1 +
 kernel/workqueue.c         |  72 ++++++++++--
 lib/Kconfig.debug          |  10 ++
 lib/Makefile               |   1 +
 lib/test_workqueue.c       | 275 +++++++++++++++++++++++++++++++++++++++++++++
 tools/workqueue/wq_dump.py |   3 +-
 6 files changed, 352 insertions(+), 10 deletions(-)
---
base-commit: b29fb8829bff243512bb8c8908fd39406f9fd4c3
change-id: 20260309-workqueue_sharded-2327956e889b

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug
  2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
  2026-03-13 17:41   ` Tejun Heo
  2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

parse_affn_scope() uses strncasecmp() with the length of the candidate
name, which means it only checks if the input *starts with* a known
scope name.

Given that the upcoming diff will create "cache_shard" affinity scope,
writing "cache_shard" to a workqueue's affinity_scope sysfs attribute
always matches "cache" first, making it impossible to select
"cache_shard" via sysfs, so, this fix enable it to distinguish "cache"
and "cache_shard"

Fix by replacing the hand-rolled prefix matching loop with
sysfs_match_string(), which uses sysfs_streq() for exact matching
(modulo trailing newlines). Also add the missing const qualifier to
the wq_affn_names[] array declaration.

Note that sysfs_streq() is case-sensitive, unlike the previous
strncasecmp() approach. This is intentional and consistent with
how other sysfs attributes handle string matching in the kernel.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeaec79bc09c4..028afc3d14e59 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -404,7 +404,7 @@ struct work_offq_data {
 	u32			flags;
 };
 
-static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
+static const char * const wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_DFL]		= "default",
 	[WQ_AFFN_CPU]		= "cpu",
 	[WQ_AFFN_SMT]		= "smt",
@@ -7063,13 +7063,7 @@ int workqueue_unbound_housekeeping_update(const struct cpumask *hk)
 
 static int parse_affn_scope(const char *val)
 {
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(wq_affn_names); i++) {
-		if (!strncasecmp(val, wq_affn_names[i], strlen(wq_affn_names[i])))
-			return i;
-	}
-	return -EINVAL;
+	return sysfs_match_string(wq_affn_names, val);
 }
 
 static int wq_affn_dfl_set(const char *val, const struct kernel_param *kp)

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
  2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
  2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

On systems where many CPUs share one LLC, unbound workqueues using
WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock
contention on pool->lock. For example, Chuck Lever measured 39% of
cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3
NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers
no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at
most wq_cache_shard_size CPUs (default 8, tunable via boot parameter).
CPUs are distributed across shards as evenly as possible -- for example,
72 CPUs with max shard size 8 produces 9 shards of 8 each.

The implementation follows the same comparator pattern as other affinity
scopes: cpu_cache_shard_id() computes a per-CPU shard index on the fly
from the already-initialized WQ_AFFN_CACHE topology, and
cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):

  cpu          3433158 items/sec  p50=16416  p90=17376  p95=17664 ns
  smt          3449709 items/sec  p50=16576  p90=17504  p95=17792 ns
  cache_shard  2939917 items/sec  p50=8192   p90=11488  p95=12512 ns
  cache        602096 items/sec   p50=53056  p90=56320  p95=57248 ns
  numa         599090 items/sec   p50=53152  p90=56448  p95=57376 ns
  system       598865 items/sec   p50=53184  p90=56481  p95=57408 ns

cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency
compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        | 60 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 61 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a4749f56398fd..41c946109c7d0 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -133,6 +133,7 @@ enum wq_affn_scope {
 	WQ_AFFN_CPU,			/* one pod per CPU */
 	WQ_AFFN_SMT,			/* one pod poer SMT */
 	WQ_AFFN_CACHE,			/* one pod per LLC */
+	WQ_AFFN_CACHE_SHARD,		/* synthetic sub-LLC shards */
 	WQ_AFFN_NUMA,			/* one pod per NUMA node */
 	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 028afc3d14e59..6be884eb3450d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -409,6 +409,7 @@ static const char * const wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_CPU]		= "cpu",
 	[WQ_AFFN_SMT]		= "smt",
 	[WQ_AFFN_CACHE]		= "cache",
+	[WQ_AFFN_CACHE_SHARD]	= "cache_shard",
 	[WQ_AFFN_NUMA]		= "numa",
 	[WQ_AFFN_SYSTEM]	= "system",
 };
@@ -431,6 +432,9 @@ module_param_named(cpu_intensive_warning_thresh, wq_cpu_intensive_warning_thresh
 static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT);
 module_param_named(power_efficient, wq_power_efficient, bool, 0444);
 
+static unsigned int wq_cache_shard_size = 8;
+module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444);
+
 static bool wq_online;			/* can kworkers be created yet? */
 static bool wq_topo_initialized __read_mostly = false;
 
@@ -8106,6 +8110,56 @@ static bool __init cpus_share_numa(int cpu0, int cpu1)
 	return cpu_to_node(cpu0) == cpu_to_node(cpu1);
 }
 
+/**
+ * cpu_cache_shard_id - compute the shard index for a CPU within its LLC pod
+ * @cpu: the CPU to look up
+ *
+ * Returns a shard index that is unique within the CPU's LLC pod. CPUs in
+ * the same LLC are divided into shards no larger than wq_cache_shard_size,
+ * distributed as evenly as possible. E.g., 20 CPUs with max shard size 8
+ * gives 3 shards of 7+7+6.
+ */
+static int __init cpu_cache_shard_id(int cpu)
+{
+	struct wq_pod_type *cache_pt = &wq_pod_types[WQ_AFFN_CACHE];
+	const struct cpumask *pod_cpus;
+	int nr_cpus, nr_shards, shard_size, remainder, c;
+	int pos = 0;
+
+	/* CPUs in the same LLC as @cpu */
+	pod_cpus = cache_pt->pod_cpus[cache_pt->cpu_pod[cpu]];
+	/* Total number of CPUs sharing this LLC */
+	nr_cpus = cpumask_weight(pod_cpus);
+	/* Number of shards to split this LLC into */
+	nr_shards = DIV_ROUND_UP(nr_cpus, wq_cache_shard_size);
+	/* Minimum number of CPUs per shard */
+	shard_size = nr_cpus / nr_shards;
+	/* First @remainder shards get one extra CPU */
+	remainder = nr_cpus % nr_shards;
+
+	/* Find position of @cpu within its cache pod */
+	for_each_cpu(c, pod_cpus) {
+		if (c == cpu)
+			break;
+		pos++;
+	}
+
+	/*
+	 * Map position to shard index. The first @remainder shards have
+	 * (shard_size + 1) CPUs, the rest have @shard_size CPUs.
+	 */
+	if (pos < remainder * (shard_size + 1))
+		return pos / (shard_size + 1);
+	return remainder + (pos - remainder * (shard_size + 1)) / shard_size;
+}
+
+static bool __init cpus_share_cache_shard(int cpu0, int cpu1)
+{
+	if (!cpus_share_cache(cpu0, cpu1))
+		return false;
+	return cpu_cache_shard_id(cpu0) == cpu_cache_shard_id(cpu1);
+}
+
 /**
  * workqueue_init_topology - initialize CPU pods for unbound workqueues
  *
@@ -8118,9 +8172,15 @@ void __init workqueue_init_topology(void)
 	struct workqueue_struct *wq;
 	int cpu;
 
+	if (!wq_cache_shard_size) {
+		pr_warn("workqueue: cache_shard_size must be > 0, setting to 1\n");
+		wq_cache_shard_size = 1;
+	}
+
 	init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
 	init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
+	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE_SHARD], cpus_share_cache_shard);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
 
 	wq_topo_initialized = true;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
  2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
  2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
  2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
  2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound
workqueues. On systems where many CPUs share one LLC, the previous
default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool,
causing heavy spinlock contention on pool->lock.

WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing
a better balance between locality and contention. Users can revert to
the previous behavior with workqueue.default_affinity_scope=cache.

On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6be884eb3450d..0d3bad2bfdaae 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -441,7 +441,7 @@ static bool wq_topo_initialized __read_mostly = false;
 static struct kmem_cache *pwq_cache;
 
 static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
-static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE;
+static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE_SHARD;
 
 /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
 static struct workqueue_attrs *unbound_wq_update_pwq_attrs_buf;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module
  2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
                   ` (2 preceding siblings ...)
  2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
  2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
  2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Add a kernel module that benchmarks queue_work() throughput on an
unbound workqueue to measure pool->lock contention under different
affinity scope configurations (cache vs cache_shard).

The module spawns N kthreads (default: num_online_cpus()), each bound
to a different CPU. All threads start simultaneously and queue work
items, measuring the latency of each queue_work() call. Results are
reported as p50/p90/p95 latencies for each affinity scope.

The affinity scope is switched between runs via the workqueue's sysfs
affinity_scope attribute (WQ_SYSFS), avoiding the need for any new
exported symbols.

The module runs as __init-only, returning -EAGAIN to auto-unload,
and can be re-run via insmod.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/Kconfig.debug    |  10 ++
 lib/Makefile         |   1 +
 lib/test_workqueue.c | 275 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 286 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93f356d2b3d95..38bee649697f3 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2628,6 +2628,16 @@ config TEST_VMALLOC
 
 	  If unsure, say N.
 
+config TEST_WORKQUEUE
+	tristate "Test module for stress/performance analysis of workqueue"
+	default n
+	help
+	  This builds the "test_workqueue" module for benchmarking
+	  workqueue throughput under contention. Useful for evaluating
+	  affinity scope changes (e.g., cache_shard vs cache).
+
+	  If unsure, say N.
+
 config TEST_BPF
 	tristate "Test BPF filter functionality"
 	depends on m && NET
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f3..ea660cca04f40 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -79,6 +79,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
+obj-$(CONFIG_TEST_WORKQUEUE) += test_workqueue.o
 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
diff --git a/lib/test_workqueue.c b/lib/test_workqueue.c
new file mode 100644
index 0000000000000..82540e5536078
--- /dev/null
+++ b/lib/test_workqueue.c
@@ -0,0 +1,275 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for stress and performance analysis of workqueue.
+ *
+ * Benchmarks queue_work() throughput on an unbound workqueue to measure
+ * pool->lock contention under different affinity scope configurations
+ * (e.g., cache vs cache_shard).
+ *
+ * The affinity scope is changed between runs via the workqueue's sysfs
+ * affinity_scope attribute (WQ_SYSFS).
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ *
+ */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/moduleparam.h>
+#include <linux/completion.h>
+#include <linux/atomic.h>
+#include <linux/slab.h>
+#include <linux/ktime.h>
+#include <linux/cpumask.h>
+#include <linux/sched.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+
+#define WQ_NAME "bench_wq"
+#define SCOPE_PATH "/sys/bus/workqueue/devices/" WQ_NAME "/affinity_scope"
+
+static int nr_threads;
+module_param(nr_threads, int, 0444);
+MODULE_PARM_DESC(nr_threads,
+	"Number of threads to spawn (default: 0 = num_online_cpus())");
+
+static int wq_items = 50000;
+module_param(wq_items, int, 0444);
+MODULE_PARM_DESC(wq_items,
+	"Number of work items each thread queues (default: 50000)");
+
+static struct workqueue_struct *bench_wq;
+static atomic_t completed;
+static atomic_t threads_done;
+static DECLARE_COMPLETION(start_comp);
+static DECLARE_COMPLETION(all_done_comp);
+
+struct thread_ctx {
+	struct work_struct work;
+	struct completion work_done;
+	u64 *latencies;
+	int cpu;
+	int items;
+};
+
+static void bench_work_fn(struct work_struct *work)
+{
+	struct thread_ctx *ctx = container_of(work, struct thread_ctx, work);
+
+	atomic_inc(&completed);
+	complete(&ctx->work_done);
+}
+
+static int bench_kthread_fn(void *data)
+{
+	struct thread_ctx *ctx = data;
+	ktime_t t_start, t_end;
+	int i;
+
+	/* Wait for all threads to be ready */
+	wait_for_completion(&start_comp);
+
+	for (i = 0; i < ctx->items; i++) {
+		reinit_completion(&ctx->work_done);
+		INIT_WORK(&ctx->work, bench_work_fn);
+
+		t_start = ktime_get();
+		queue_work(bench_wq, &ctx->work);
+		t_end = ktime_get();
+
+		ctx->latencies[i] = ktime_to_ns(ktime_sub(t_end, t_start));
+		wait_for_completion(&ctx->work_done);
+	}
+
+	if (atomic_dec_and_test(&threads_done))
+		complete(&all_done_comp);
+
+	return 0;
+}
+
+static int cmp_u64(const void *a, const void *b)
+{
+	u64 va = *(const u64 *)a;
+	u64 vb = *(const u64 *)b;
+
+	if (va < vb)
+		return -1;
+	if (va > vb)
+		return 1;
+	return 0;
+}
+
+static int __init set_affn_scope(const char *scope)
+{
+	struct file *f;
+	loff_t pos = 0;
+	ssize_t ret;
+
+	f = filp_open(SCOPE_PATH, O_WRONLY, 0);
+	if (IS_ERR(f)) {
+		pr_err("test_workqueue: open %s failed: %ld\n",
+		       SCOPE_PATH, PTR_ERR(f));
+		return PTR_ERR(f);
+	}
+
+	ret = kernel_write(f, scope, strlen(scope), &pos);
+	filp_close(f, NULL);
+
+	if (ret < 0) {
+		pr_err("test_workqueue: write '%s' failed: %zd\n", scope, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int __init run_bench(int n_threads, const char *scope)
+{
+	struct thread_ctx *ctxs;
+	struct task_struct **tasks;
+	u64 *all_latencies;
+	unsigned long total_items;
+	ktime_t start, end;
+	s64 elapsed_us;
+	int cpu, i, j, ret;
+
+	ret = set_affn_scope(scope);
+	if (ret)
+		return ret;
+
+	ctxs = kcalloc(n_threads, sizeof(*ctxs), GFP_KERNEL);
+	if (!ctxs)
+		return -ENOMEM;
+
+	tasks = kcalloc(n_threads, sizeof(*tasks), GFP_KERNEL);
+	if (!tasks) {
+		kfree(ctxs);
+		return -ENOMEM;
+	}
+
+	total_items = (unsigned long)n_threads * wq_items;
+	all_latencies = kvmalloc_array(total_items, sizeof(u64), GFP_KERNEL);
+	if (!all_latencies) {
+		kfree(tasks);
+		kfree(ctxs);
+		return -ENOMEM;
+	}
+
+	/* Allocate per-thread latency arrays */
+	for (i = 0; i < n_threads; i++) {
+		ctxs[i].latencies = kvmalloc_array(wq_items, sizeof(u64),
+						   GFP_KERNEL);
+		if (!ctxs[i].latencies) {
+			while (--i >= 0)
+				kvfree(ctxs[i].latencies);
+			kvfree(all_latencies);
+			kfree(tasks);
+			kfree(ctxs);
+			return -ENOMEM;
+		}
+	}
+
+	atomic_set(&completed, 0);
+	atomic_set(&threads_done, n_threads);
+	reinit_completion(&all_done_comp);
+	reinit_completion(&start_comp);
+
+	/* Create kthreads, each bound to a different online CPU */
+	i = 0;
+	for_each_online_cpu(cpu) {
+		if (i >= n_threads)
+			break;
+
+		ctxs[i].cpu = cpu;
+		ctxs[i].items = wq_items;
+		init_completion(&ctxs[i].work_done);
+
+		tasks[i] = kthread_create(bench_kthread_fn, &ctxs[i],
+					  "wq_bench/%d", cpu);
+		if (IS_ERR(tasks[i])) {
+			ret = PTR_ERR(tasks[i]);
+			pr_err("test_workqueue: failed to create kthread %d: %d\n",
+			       i, ret);
+			while (--i >= 0)
+				kthread_stop(tasks[i]);
+			goto out_free;
+		}
+
+		kthread_bind(tasks[i], cpu);
+		wake_up_process(tasks[i]);
+		i++;
+	}
+
+	/* Start timing and release all threads */
+	start = ktime_get();
+	complete_all(&start_comp);
+
+	/* Wait for all threads to finish */
+	wait_for_completion(&all_done_comp);
+
+	/* Drain any remaining work */
+	flush_workqueue(bench_wq);
+
+	end = ktime_get();
+	elapsed_us = ktime_us_delta(end, start);
+
+	/* Merge all per-thread latencies and sort for percentile calculation */
+	j = 0;
+	for (i = 0; i < n_threads; i++) {
+		memcpy(&all_latencies[j], ctxs[i].latencies,
+		       wq_items * sizeof(u64));
+		j += wq_items;
+	}
+
+	sort(all_latencies, total_items, sizeof(u64), cmp_u64, NULL);
+
+	pr_info("test_workqueue:   %-12s %llu items/sec\tp50=%llu\tp90=%llu\tp95=%llu ns\n",
+		scope,
+		elapsed_us ? total_items * 1000000ULL / elapsed_us : 0,
+		all_latencies[total_items * 50 / 100],
+		all_latencies[total_items * 90 / 100],
+		all_latencies[total_items * 95 / 100]);
+
+	ret = 0;
+out_free:
+	for (i = 0; i < n_threads; i++)
+		kvfree(ctxs[i].latencies);
+	kvfree(all_latencies);
+	kfree(tasks);
+	kfree(ctxs);
+
+	return ret;
+}
+
+static const char * const bench_scopes[] = {
+	"cpu", "smt", "cache_shard", "cache", "numa", "system",
+};
+
+static int __init test_workqueue_init(void)
+{
+	int n_threads = nr_threads ?: num_online_cpus();
+	int i;
+
+	bench_wq = alloc_workqueue(WQ_NAME, WQ_UNBOUND | WQ_SYSFS, 0);
+	if (!bench_wq)
+		return -ENOMEM;
+
+	pr_info("test_workqueue: running %d threads, %d items/thread\n",
+		n_threads, wq_items);
+
+	for (i = 0; i < ARRAY_SIZE(bench_scopes); i++)
+		run_bench(n_threads, bench_scopes[i]);
+
+	destroy_workqueue(bench_wq);
+
+	return -EAGAIN;
+}
+
+module_init(test_workqueue_init);
+MODULE_AUTHOR("Breno Leitao <leitao@debian.org>");
+MODULE_DESCRIPTION("Stress/performance benchmark for workqueue subsystem");
+MODULE_LICENSE("GPL");

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py
  2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
                   ` (3 preceding siblings ...)
  2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
@ 2026-03-12 16:12 ` Breno Leitao
  2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-12 16:12 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but
wq_dump.py was not updated to enumerate it. Add the missing constant
lookup and include it in the affinity scopes iteration so that drgn
output shows the CACHE_SHARD pod topology alongside the other scopes.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/workqueue/wq_dump.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index d29b918306b48..06948ffcfc4b6 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -107,6 +107,7 @@ WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
 WQ_AFFN_CPU             = prog['WQ_AFFN_CPU']
 WQ_AFFN_SMT             = prog['WQ_AFFN_SMT']
 WQ_AFFN_CACHE           = prog['WQ_AFFN_CACHE']
+WQ_AFFN_CACHE_SHARD     = prog['WQ_AFFN_CACHE_SHARD']
 WQ_AFFN_NUMA            = prog['WQ_AFFN_NUMA']
 WQ_AFFN_SYSTEM          = prog['WQ_AFFN_SYSTEM']
 
@@ -138,7 +139,7 @@ def print_pod_type(pt):
         print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='')
     print('')
 
-for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
+for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_CACHE_SHARD, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
     print('')
     print(f'{wq_affn_names[affn].string_().decode().upper()}{" (default)" if affn == wq_affn_dfl else ""}')
     print_pod_type(wq_pod_types[affn])

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug
  2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
@ 2026-03-13 17:41   ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2026-03-13 17:41 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello,

Applied to wq/for-7.1.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
                   ` (4 preceding siblings ...)
  2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
@ 2026-03-13 17:57 ` Tejun Heo
  2026-03-17 11:32   ` Breno Leitao
  5 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2026-03-13 17:57 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello,

Applied 1/5. Some comments on the rest:

- The sharding currently splits on CPU boundary, which can split SMT
  siblings across different pods. The worse performance on Intel compared
  to SMT scope may be indicating exactly this - HT siblings ending up in
  different pods. It'd be better to shard on core boundary so that SMT
  siblings always stay together.

- How was the default shard size of 8 picked? There's a tradeoff between
  the number of kworkers created and locality. Can you also report the
  number of kworkers for each configuration? And is there data on
  different shard sizes? It'd be useful to see how the numbers change
  across e.g. 4, 8, 16, 32.

- Can you also test on AMD machines? Their CCD topology (16 or 32
  threads per LLC) would be a good data point.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
@ 2026-03-17 11:32   ` Breno Leitao
  2026-03-17 13:58     ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-17 11:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello Tejun,

On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote:
> Hello,
>
> Applied 1/5. Some comments on the rest:
>
> - The sharding currently splits on CPU boundary, which can split SMT
>   siblings across different pods. The worse performance on Intel compared
>   to SMT scope may be indicating exactly this - HT siblings ending up in
>   different pods. It'd be better to shard on core boundary so that SMT
>   siblings always stay together.

Thank you for the insight. I'll modify the sharding to operate at the
core boundary rather than at the SMT/thread level to ensure sibling CPUs
remain in the same pod.

> - How was the default shard size of 8 picked? There's a tradeoff
> between the number of kworkers created and locality. Can you also
> report the number of kworkers for each configuration? And is there
> data on different shard sizes? It'd be useful to see how the numbers
> change across e.g. 4, 8, 16, 32.

The choice of 8 as the default shard size was somewhat arbitrary – it was
selected primarily to generate initial data points.

I'll run tests with different shard sizes and report the results.

I'm currently working on finding a suitable workload with minimal noise.
Testing on real NVMe devices shows significant jitter that makes analysis
difficult. I've also been experimenting with nullblk, but haven't had much
success yet.

If you have any suggestions for a reliable workload or benchmark, I'd
appreciate your input.

> - Can you also test on AMD machines? Their CCD topology (16 or 32
> threads per LLC) would be a good data point.

Absolutely, I'll test on AMD machines as well.

Thanks,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-17 11:32   ` Breno Leitao
@ 2026-03-17 13:58     ` Chuck Lever
  2026-03-18 17:51       ` Breno Leitao
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2026-03-17 13:58 UTC (permalink / raw)
  To: Breno Leitao, Tejun Heo
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team

On 3/17/26 7:32 AM, Breno Leitao wrote:
> Hello Tejun,
> 
> On Fri, Mar 13, 2026 at 07:57:20AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> Applied 1/5. Some comments on the rest:
>>
>> - The sharding currently splits on CPU boundary, which can split SMT
>>   siblings across different pods. The worse performance on Intel compared
>>   to SMT scope may be indicating exactly this - HT siblings ending up in
>>   different pods. It'd be better to shard on core boundary so that SMT
>>   siblings always stay together.
> 
> Thank you for the insight. I'll modify the sharding to operate at the
> core boundary rather than at the SMT/thread level to ensure sibling CPUs
> remain in the same pod.
> 
>> - How was the default shard size of 8 picked? There's a tradeoff
>> between the number of kworkers created and locality. Can you also
>> report the number of kworkers for each configuration? And is there
>> data on different shard sizes? It'd be useful to see how the numbers
>> change across e.g. 4, 8, 16, 32.
> 
> The choice of 8 as the default shard size was somewhat arbitrary – it was
> selected primarily to generate initial data points.

Perhaps instead of basing the sharding on a particular number of CPUs
per shard, why not cap the total number of shards? IIUC that is the main
concern about ballooning the number of kworker threads.


> I'll run tests with different shard sizes and report the results.
> 
> I'm currently working on finding a suitable workload with minimal noise.
> Testing on real NVMe devices shows significant jitter that makes analysis
> difficult. I've also been experimenting with nullblk, but haven't had much
> success yet.
> 
> If you have any suggestions for a reliable workload or benchmark, I'd
> appreciate your input.
> 
>> - Can you also test on AMD machines? Their CCD topology (16 or 32
>> threads per LLC) would be a good data point.
> 
> Absolutely, I'll test on AMD machines as well.
> 
> Thanks,
> --breno


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-17 13:58     ` Chuck Lever
@ 2026-03-18 17:51       ` Breno Leitao
  2026-03-18 23:00         ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-18 17:51 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team

On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> On 3/17/26 7:32 AM, Breno Leitao wrote:
> >> - How was the default shard size of 8 picked? There's a tradeoff
> >> between the number of kworkers created and locality. Can you also
> >> report the number of kworkers for each configuration? And is there
> >> data on different shard sizes? It'd be useful to see how the numbers
> >> change across e.g. 4, 8, 16, 32.
> >
> > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > selected primarily to generate initial data points.
>
> Perhaps instead of basing the sharding on a particular number of CPUs
> per shard, why not cap the total number of shards? IIUC that is the main
> concern about ballooning the number of kworker threads.

That's a great suggestion. I'll send a v2 that implements this approach,
where the parameter specifies the number of shards rather than the number
of CPUs per shard.

Thanks for the feedback,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-18 17:51       ` Breno Leitao
@ 2026-03-18 23:00         ` Tejun Heo
  2026-03-19 14:02           ` Breno Leitao
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2026-03-18 23:00 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Chuck Lever, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team

On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > >> - How was the default shard size of 8 picked? There's a tradeoff
> > >> between the number of kworkers created and locality. Can you also
> > >> report the number of kworkers for each configuration? And is there
> > >> data on different shard sizes? It'd be useful to see how the numbers
> > >> change across e.g. 4, 8, 16, 32.
> > >
> > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > selected primarily to generate initial data points.
> >
> > Perhaps instead of basing the sharding on a particular number of CPUs
> > per shard, why not cap the total number of shards? IIUC that is the main
> > concern about ballooning the number of kworker threads.
> 
> That's a great suggestion. I'll send a v2 that implements this approach,
> where the parameter specifies the number of shards rather than the number
> of CPUs per shard.

Woudl it make sense tho? If feels really odd to define the maximum number of
shards when contention is primarily a function of the number of CPUs banging
on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
of shards?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-18 23:00         ` Tejun Heo
@ 2026-03-19 14:02           ` Breno Leitao
  0 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-19 14:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Chuck Lever, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team

On Wed, Mar 18, 2026 at 01:00:07PM -1000, Tejun Heo wrote:
> On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> > On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > > >> - How was the default shard size of 8 picked? There's a tradeoff
> > > >> between the number of kworkers created and locality. Can you also
> > > >> report the number of kworkers for each configuration? And is there
> > > >> data on different shard sizes? It'd be useful to see how the numbers
> > > >> change across e.g. 4, 8, 16, 32.
> > > >
> > > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > > selected primarily to generate initial data points.
> > >
> > > Perhaps instead of basing the sharding on a particular number of CPUs
> > > per shard, why not cap the total number of shards? IIUC that is the main
> > > concern about ballooning the number of kworker threads.
> >
> > That's a great suggestion. I'll send a v2 that implements this approach,
> > where the parameter specifies the number of shards rather than the number
> > of CPUs per shard.
>
> Woudl it make sense tho? If feels really odd to define the maximum number of
> shards when contention is primarily a function of the number of CPUs banging
> on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
> of shards?

The trade-off is that specifying the maximum number of shards makes it
clearer how many times the LLC is being sharded, which might be easier
to reason about, but it will have less impact on contention scaling as
you reported above.

I've collected some numbers with sharding per LLC, and I will switch
back to the original approach to gather comparison data.

Current change:
https://github.com/leitao/linux/commit/bedaf9ebe9594320976dcbf0cb507ecf083097c0


Workload:
========

I've finally found a workload that exercises the workqueue sufficiently,
which allows me to obtain stable benchmark results.

This is what I am doing:

  - Sets  up a local loopback NFS environment backed by an 8 GB tmpfs
    (/tmp/nfsexport → /mnt/nfs)
  - Iterates over six fio I/O engines: sync, psync, vsync, pvsync, pvsync2,
    libaio
  - For each engine, runs a 200-job, 512-byte block size fio benchmark (writes
    then reads)
  - Tests each workload under both cache and cache_shard workqueue affinity
    scopes via /sys/module/workqueue/parameters/default_affinity_scope
  - Prints a summary table with aggregate bandwidth (MB) per scope and the
    percentage delta to show whether cache_shard helps or hurts
  - Restores the affinity scope back to cache when done

The test I am running could be found at
https://github.com/leitao/debug/blob/main/workqueue_performance/test_affinity.sh

Hosts:
======

 * ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)
	# cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
	0-71

 * Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)
	#   cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
	0-7,88-95
	16-23,104-111
	24-31,112-119
	32-39,120-127
	40-47,128-135
	48-55,136-143
	56-63,144-151
	64-71,152-159
	72-79,160-167
	80-87,168-175
	8-15,96-103


Results
=======


Tl;DR: 

 * ARM (single L3, 72 CPUs): cache_shard consistently improves write
   throughput by +6 to +12% across all shard counts (2-32), with
   the peak at 2 shards. Read impact is minimal (noise-level).
   Shard=1 confirms no effect as expected.

 * Intel (11 L3 domains, 16 CPUs each): cache_shard shows no meaningful
   benefit at 1-4 shards (all within noise/stddev). At 8 shards it
   regresses by ~4% for both reads and writes, likely due to loss of
   data locality when sharding already-small 16-CPU cache domains
   further.

benchmark Data:
===============

ARM:

  ┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
  │ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      1 │             -0.2% │        ±1.0% │            +1.2% │       ±1.7% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      2 │            +12.5% │        ±1.3% │            -0.3% │       ±0.9% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      4 │             +8.7% │        ±0.9% │            +1.8% │       ±1.5% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      8 │            +11.4% │        ±1.8% │            +3.1% │       ±1.5% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │     16 │             +7.8% │        ±1.3% │            +1.6% │       ±1.0% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │     32 │             +6.1% │        ±0.6% │            +0.3% │       ±1.5% │
  └────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘

Intel:

  ┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
  │ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      1 │             -0.2% │        ±1.2% │            +0.1% │       ±1.0% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      2 │             +0.7% │        ±1.4% │            +0.5% │       ±1.1% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      4 │             +0.8% │        ±1.1% │            +1.3% │       ±1.2% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      8 │             -4.0% │        ±1.3% │            -4.5% │       ±0.9% │
  └────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘


Microbenchmark result
=====================

I've run the micro-benchmark from this patchset as well, and the resluts
comparison:

 * Intel (11 L3 domains, 16 CPUs each): cache_shard delivers +45-55%
   throughput and 36-44% lower latency at 2-8 shards. The sweet spot is
   4 shards (+55%, p50 cut nearly in half). Shard=1 confirms no effect.

   Even though Intel already has multiple L3 domains, each 16-CPU domain
   still has enough contention to benefit from further splitting (for
   the sake of this microbenchmark/stress test)

 * ARM (single L3, 72 CPUs): The gains are dramatic — 2x at 2 shards,
   3.2x at 4 shards, and 4.4x at 8 shards. At 8 shards, cache_shard
   (3.2M items/s) nearly matches cpu scope performance (3.7M), with p50
   latency dropping from 43.5 us to 6.9 us.

   The single monolithic L3 makes cache scope degenerate to a single
   contended pool, so sharding  has a massive effect.


Intel

  ┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
  │ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      1 │       2,660,103 │             2,667,740 │           +0.3% │   27.5 us │   27.5 us │                0% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      2 │       2,619,884 │             3,788,454 │          +44.6% │   28.0 us │   17.8 us │              -36% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      4 │       2,506,185 │             3,891,064 │          +55.3% │   29.3 us │   16.5 us │              -44% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      8 │       2,628,321 │             4,015,312 │          +52.8% │   27.9 us │   16.4 us │              -41% │
  └────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘

  Reference scopes (stable across shard counts): cpu ~6.2M items/s, smt ~4.0M, numa/system ~422K.

ARM

  ┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
  │ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      2 │         725,999 │             1,516,967 │           +109% │   43.8 us │   19.6 us │              -55% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      4 │         729,615 │             2,347,335 │           +222% │   43.6 us │   11.0 us │              -75% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      8 │         731,517 │             3,230,168 │           +342% │   43.5 us │    6.9 us │              -84% │
  └────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘


Next Steps:
  * Revert the code to sharding by CPU count (instead of by shard count) and
    report it again. 
  * Any other test that would help us?

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-19 14:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
2026-03-13 17:41   ` Tejun Heo
2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
2026-03-17 11:32   ` Breno Leitao
2026-03-17 13:58     ` Chuck Lever
2026-03-18 17:51       ` Breno Leitao
2026-03-18 23:00         ` Tejun Heo
2026-03-19 14:02           ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox