[PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope
@ 2026-03-20 17:56 Breno Leitao
  2026-03-20 17:56 ` [PATCH v2 1/5] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
                   ` (5 more replies)
  0 siblings, 6 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-20 17:56 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).

Changes from RFC:

* wq_cache_shard_size is in terms of cores (not vCPU). So,
  wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
  like 16 threads/CPUs if SMT=1

* Got more data:

  - AMD EPYC: All means are within ~1 stdev of zero. The deltas are
    indistinguishable from noise. Shard scoping has no measurable effect
    regardless of shard size. This is justified due to AMD EPYC having
    11 L3 domains and lock contention not being a problem:

  - ARM: A strong, consistent signal. At shard sizes 8 and 16 the mean
    improvement is ~7% with relatively tight stdev (~1–2%) at write,
    meaning the gain is real and reproducible across all IO engines.
    Even shard size 4 shows a solid +3.5% with the tightest stdev
    (0.97%).

    Reads: Small shard sizes (2, 4) show a slight regression of
    ~1.3–1.7% (low stdev, so consistent). Larger shard sizes (8, 16)
    flip to a modest +1.4% gain, though shard_size=8 reads have high
    variance (stdev 2.79%) driven by a single outliers (which seems
    noise)

  - Sweet spot: Shard size 8 to 16 offers the best overall profile
    — highest write gain (6.95%) with the lowest write stdev (1.18%),
    plus a consistent read gain (1.42%, stdev 0.70%), no impact on
    Intel.

* ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)

  ┌────────────┬────────────┬─────────────┬───────────┬────────────┐
  │ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 2          │ +0.75%     │ 1.32%       │ -1.28%    │ 0.45%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 4          │ +3.45%     │ 0.97%       │ -1.73%    │ 0.52%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 8          │ +6.72%     │ 1.97%       │ +1.38%    │ 2.79%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 16         │ +6.95%     │ 1.18%       │ +1.42%    │ 0.70%      │
  └────────────┴────────────┴─────────────┴───────────┴────────────┘

 * Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)

  ┌────────────┬────────────┬─────────────┬───────────┬────────────┐
  │ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 2          │ +3.22%     │ 1.90%       │ -0.08%    │ 0.72%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 4          │ +0.92%     │ 1.59%       │ +0.67%    │ 2.33%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 8          │ +1.75%     │ 1.47%       │ -0.42%    │ 0.72%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 16         │ +1.22%     │ 1.72%       │ +0.43%    │ 1.32%      │
  └────────────┴────────────┴─────────────┴───────────┴────────────┘

---
Changes in v2:
- wq_cache_shard_size is in terms of cores (not vCPU)
- Link to v1: https://patch.msgid.link/20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org

---
Breno Leitao (5):
      workqueue: fix typo in WQ_AFFN_SMT comment
      workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
      workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
      tools/workqueue: add CACHE_SHARD support to wq_dump.py
      workqueue: add test_workqueue benchmark module

 include/linux/workqueue.h  |   3 +-
 kernel/workqueue.c         | 110 +++++++++++++++++-
 lib/Kconfig.debug          |  10 ++
 lib/Makefile               |   1 +
 lib/test_workqueue.c       | 277 +++++++++++++++++++++++++++++++++++++++++++++
 tools/workqueue/wq_dump.py |   3 +-
 6 files changed, 401 insertions(+), 3 deletions(-)
---
base-commit: 1adb306427e971ccac25b19410c9f068b92bd583
change-id: 20260309-workqueue_sharded-2327956e889b

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/5] workqueue: fix typo in WQ_AFFN_SMT comment
  2026-03-20 17:56 [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Breno Leitao
@ 2026-03-20 17:56 ` Breno Leitao
  2026-03-20 17:56 ` [PATCH v2 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-20 17:56 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Fix "poer" -> "per" in the WQ_AFFN_SMT enum comment.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/workqueue.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a4749f56398fd..17543aec2a6e1 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -131,7 +131,7 @@ struct rcu_work {
 enum wq_affn_scope {
 	WQ_AFFN_DFL,			/* use system default */
 	WQ_AFFN_CPU,			/* one pod per CPU */
-	WQ_AFFN_SMT,			/* one pod poer SMT */
+	WQ_AFFN_SMT,			/* one pod per SMT */
 	WQ_AFFN_CACHE,			/* one pod per LLC */
 	WQ_AFFN_NUMA,			/* one pod per NUMA node */
 	WQ_AFFN_SYSTEM,			/* one pod across the whole system */

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-20 17:56 [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Breno Leitao
  2026-03-20 17:56 ` [PATCH v2 1/5] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
@ 2026-03-20 17:56 ` Breno Leitao
  2026-03-23 22:43   ` Tejun Heo
  2026-03-20 17:56 ` [PATCH v2 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-20 17:56 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

On systems where many CPUs share one LLC, unbound workqueues using
WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock
contention on pool->lock. For example, Chuck Lever measured 39% of
cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3
NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers
no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at
most wq_cache_shard_size cores (default 8, tunable via boot parameter).
Shards are always split on core (SMT group) boundaries so that
Hyper-Threading siblings are never placed in different pods. Cores are
distributed across shards as evenly as possible -- for example, 36 cores
in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7
cores.

The implementation follows the same comparator pattern as other affinity
scopes: cpu_cache_shard_id() computes a per-CPU shard index on the fly
from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology,
and cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show
cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency
compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/workqueue.h |   1 +
 kernel/workqueue.c        | 108 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 17543aec2a6e1..50bdb7e30d35f 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -133,6 +133,7 @@ enum wq_affn_scope {
 	WQ_AFFN_CPU,			/* one pod per CPU */
 	WQ_AFFN_SMT,			/* one pod per SMT */
 	WQ_AFFN_CACHE,			/* one pod per LLC */
+	WQ_AFFN_CACHE_SHARD,		/* synthetic sub-LLC shards */
 	WQ_AFFN_NUMA,			/* one pod per NUMA node */
 	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index a050971393f1f..ebbc7971b4fa6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -409,6 +409,7 @@ static const char * const wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_CPU]		= "cpu",
 	[WQ_AFFN_SMT]		= "smt",
 	[WQ_AFFN_CACHE]		= "cache",
+	[WQ_AFFN_CACHE_SHARD]	= "cache_shard",
 	[WQ_AFFN_NUMA]		= "numa",
 	[WQ_AFFN_SYSTEM]	= "system",
 };
@@ -431,6 +432,9 @@ module_param_named(cpu_intensive_warning_thresh, wq_cpu_intensive_warning_thresh
 static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT);
 module_param_named(power_efficient, wq_power_efficient, bool, 0444);
 
+static unsigned int wq_cache_shard_size = 8;
+module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444);
+
 static bool wq_online;			/* can kworkers be created yet? */
 static bool wq_topo_initialized __read_mostly = false;
 
@@ -8107,6 +8111,104 @@ static bool __init cpus_share_numa(int cpu0, int cpu1)
 	return cpu_to_node(cpu0) == cpu_to_node(cpu1);
 }
 
+/**
+ * llc_count_cores - count distinct cores (SMT groups) within a cpumask
+ * @pod_cpus: the cpumask to scan (typically an LLC pod)
+ * @smt_pt:   the SMT pod type, used to identify sibling groups
+ *
+ * A core is represented by the lowest-numbered CPU in its SMT group. Returns
+ * the number of distinct cores found in @pod_cpus.
+ */
+static int __init llc_count_cores(const struct cpumask *pod_cpus,
+				  struct wq_pod_type *smt_pt)
+{
+	const struct cpumask *smt_cpus;
+	int nr_cores = 0, c;
+
+	for_each_cpu(c, pod_cpus) {
+		smt_cpus = smt_pt->pod_cpus[smt_pt->cpu_pod[c]];
+		if (cpumask_first(smt_cpus) == c)
+			nr_cores++;
+	}
+
+	return nr_cores;
+}
+
+/**
+ * llc_cpu_core_pos - find a CPU's core position within a cpumask
+ * @cpu:      the CPU to locate
+ * @pod_cpus: the cpumask to scan (typically an LLC pod)
+ * @smt_pt:   the SMT pod type, used to identify sibling groups
+ *
+ * Returns the zero-based index of @cpu's core among the distinct cores in
+ * @pod_cpus, ordered by lowest CPU number in each SMT group.
+ */
+static int __init llc_cpu_core_pos(int cpu, const struct cpumask *pod_cpus,
+				   struct wq_pod_type *smt_pt)
+{
+	const struct cpumask *smt_cpus;
+	int core_pos = 0, c;
+
+	for_each_cpu(c, pod_cpus) {
+		smt_cpus = smt_pt->pod_cpus[smt_pt->cpu_pod[c]];
+		if (cpumask_test_cpu(cpu, smt_cpus))
+			break;
+		if (cpumask_first(smt_cpus) == c)
+			core_pos++;
+	}
+
+	return core_pos;
+}
+
+/**
+ * cpu_cache_shard_id - compute the shard index for a CPU within its LLC pod
+ * @cpu: the CPU to look up
+ *
+ * Returns a shard index that is unique within the CPU's LLC pod. The LLC is
+ * divided into shards of at most wq_cache_shard_size cores, always split on
+ * core (SMT group) boundaries so that SMT siblings are never placed in
+ * different shards. Cores are distributed across shards as evenly as possible.
+ *
+ * Example: 36 cores with wq_cache_shard_size=8 gives 5 shards of
+ * 8+7+7+7+7 cores.
+ */
+static int __init cpu_cache_shard_id(int cpu)
+{
+	struct wq_pod_type *cache_pt = &wq_pod_types[WQ_AFFN_CACHE];
+	struct wq_pod_type *smt_pt = &wq_pod_types[WQ_AFFN_SMT];
+	const struct cpumask *pod_cpus;
+	int nr_cores, nr_shards, cores_per_shard, remainder, core_pos;
+
+	/* CPUs in the same LLC as @cpu */
+	pod_cpus = cache_pt->pod_cpus[cache_pt->cpu_pod[cpu]];
+	nr_cores = llc_count_cores(pod_cpus, smt_pt);
+
+	/* Compute number of shards from the max cores per shard */
+	nr_shards = DIV_ROUND_UP(nr_cores, wq_cache_shard_size);
+	/* Distribute cores as evenly as possible across shards */
+	cores_per_shard = nr_cores / nr_shards;
+	remainder = nr_cores % nr_shards;
+
+	core_pos = llc_cpu_core_pos(cpu, pod_cpus, smt_pt);
+
+	/*
+	 * Map core position to shard index. The first @remainder shards have
+	 * (cores_per_shard + 1) cores, the rest have @cores_per_shard cores.
+	 */
+	if (core_pos < remainder * (cores_per_shard + 1))
+		return core_pos / (cores_per_shard + 1);
+
+	return remainder + (core_pos - remainder * (cores_per_shard + 1)) / cores_per_shard;
+}
+
+static bool __init cpus_share_cache_shard(int cpu0, int cpu1)
+{
+	if (!cpus_share_cache(cpu0, cpu1))
+		return false;
+
+	return cpu_cache_shard_id(cpu0) == cpu_cache_shard_id(cpu1);
+}
+
 /**
  * workqueue_init_topology - initialize CPU pods for unbound workqueues
  *
@@ -8119,9 +8221,15 @@ void __init workqueue_init_topology(void)
 	struct workqueue_struct *wq;
 	int cpu;
 
+	if (!wq_cache_shard_size) {
+		pr_warn("workqueue: cache_shard_size must be > 0, setting to 1\n");
+		wq_cache_shard_size = 1;
+	}
+
 	init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
 	init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
+	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE_SHARD], cpus_share_cache_shard);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
 
 	wq_topo_initialized = true;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
  2026-03-20 17:56 [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Breno Leitao
  2026-03-20 17:56 ` [PATCH v2 1/5] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
  2026-03-20 17:56 ` [PATCH v2 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-03-20 17:56 ` Breno Leitao
  2026-03-20 17:56 ` [PATCH v2 4/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-20 17:56 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound
workqueues. On systems where many CPUs share one LLC, the previous
default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool,
causing heavy spinlock contention on pool->lock.

WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing
a better balance between locality and contention. Users can revert to
the previous behavior with workqueue.default_affinity_scope=cache.

On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 cores.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ebbc7971b4fa6..6bc2e69dd5cc2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -441,7 +441,7 @@ static bool wq_topo_initialized __read_mostly = false;
 static struct kmem_cache *pwq_cache;
 
 static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
-static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE;
+static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE_SHARD;
 
 /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
 static struct workqueue_attrs *unbound_wq_update_pwq_attrs_buf;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py
  2026-03-20 17:56 [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Breno Leitao
                   ` (2 preceding siblings ...)
  2026-03-20 17:56 ` [PATCH v2 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
@ 2026-03-20 17:56 ` Breno Leitao
  2026-03-20 17:56 ` [PATCH v2 5/5] workqueue: add test_workqueue benchmark module Breno Leitao
  2026-03-23 14:11 ` [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Chuck Lever
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-20 17:56 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but
wq_dump.py was not updated to enumerate it. Add the missing constant
lookup and include it in the affinity scopes iteration so that drgn
output shows the CACHE_SHARD pod topology alongside the other scopes.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/workqueue/wq_dump.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index d29b918306b48..06948ffcfc4b6 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -107,6 +107,7 @@ WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
 WQ_AFFN_CPU             = prog['WQ_AFFN_CPU']
 WQ_AFFN_SMT             = prog['WQ_AFFN_SMT']
 WQ_AFFN_CACHE           = prog['WQ_AFFN_CACHE']
+WQ_AFFN_CACHE_SHARD     = prog['WQ_AFFN_CACHE_SHARD']
 WQ_AFFN_NUMA            = prog['WQ_AFFN_NUMA']
 WQ_AFFN_SYSTEM          = prog['WQ_AFFN_SYSTEM']
 
@@ -138,7 +139,7 @@ def print_pod_type(pt):
         print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='')
     print('')
 
-for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
+for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_CACHE_SHARD, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
     print('')
     print(f'{wq_affn_names[affn].string_().decode().upper()}{" (default)" if affn == wq_affn_dfl else ""}')
     print_pod_type(wq_pod_types[affn])

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/5] workqueue: add test_workqueue benchmark module
  2026-03-20 17:56 [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Breno Leitao
                   ` (3 preceding siblings ...)
  2026-03-20 17:56 ` [PATCH v2 4/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
@ 2026-03-20 17:56 ` Breno Leitao
  2026-03-23 14:11 ` [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Chuck Lever
  5 siblings, 0 replies; 13+ messages in thread
From: Breno Leitao @ 2026-03-20 17:56 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Add a kernel module that benchmarks queue_work() throughput on an
unbound workqueue to measure pool->lock contention under different
affinity scope configurations (cache vs cache_shard).

The module spawns N kthreads (default: num_online_cpus()), each bound
to a different CPU. All threads start simultaneously and queue work
items, measuring the latency of each queue_work() call. Results are
reported as p50/p90/p95 latencies for each affinity scope.

The affinity scope is switched between runs via the workqueue's sysfs
affinity_scope attribute (WQ_SYSFS), avoiding the need for any new
exported symbols.

The module runs as __init-only, returning -EAGAIN to auto-unload,
and can be re-run via insmod.

Example of the output:

 running 50 threads, 50000 items/thread

   cpu              6806017 items/sec p50=2574    p90=5068    p95=5818 ns
   smt              6821040 items/sec p50=2624    p90=5168    p95=5949 ns
   cache_shard      1633653 items/sec p50=5337    p90=9694    p95=11207 ns
   cache            286069 items/sec p50=72509    p90=82304   p95=85009 ns
   numa             319403 items/sec p50=63745    p90=73480   p95=76505 ns
   system           308461 items/sec p50=66561    p90=75714   p95=78048 ns

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/Kconfig.debug    |  10 ++
 lib/Makefile         |   1 +
 lib/test_workqueue.c | 277 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 288 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93f356d2b3d95..38bee649697f3 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2628,6 +2628,16 @@ config TEST_VMALLOC
 
 	  If unsure, say N.
 
+config TEST_WORKQUEUE
+	tristate "Test module for stress/performance analysis of workqueue"
+	default n
+	help
+	  This builds the "test_workqueue" module for benchmarking
+	  workqueue throughput under contention. Useful for evaluating
+	  affinity scope changes (e.g., cache_shard vs cache).
+
+	  If unsure, say N.
+
 config TEST_BPF
 	tristate "Test BPF filter functionality"
 	depends on m && NET
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f3..ea660cca04f40 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -79,6 +79,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
+obj-$(CONFIG_TEST_WORKQUEUE) += test_workqueue.o
 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
diff --git a/lib/test_workqueue.c b/lib/test_workqueue.c
new file mode 100644
index 0000000000000..a949af1d7e978
--- /dev/null
+++ b/lib/test_workqueue.c
@@ -0,0 +1,277 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for stress and performance analysis of workqueue.
+ *
+ * Benchmarks queue_work() throughput on an unbound workqueue to measure
+ * pool->lock contention under different affinity scope configurations
+ * (e.g., cache vs cache_shard).
+ *
+ * The affinity scope is changed between runs via the workqueue's sysfs
+ * affinity_scope attribute (WQ_SYSFS).
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ *
+ */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/moduleparam.h>
+#include <linux/completion.h>
+#include <linux/atomic.h>
+#include <linux/slab.h>
+#include <linux/ktime.h>
+#include <linux/cpumask.h>
+#include <linux/sched.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+
+#define WQ_NAME "bench_wq"
+#define SCOPE_PATH "/sys/bus/workqueue/devices/" WQ_NAME "/affinity_scope"
+
+static int nr_threads;
+module_param(nr_threads, int, 0444);
+MODULE_PARM_DESC(nr_threads,
+		 "Number of threads to spawn (default: 0 = num_online_cpus())");
+
+static int wq_items = 50000;
+module_param(wq_items, int, 0444);
+MODULE_PARM_DESC(wq_items,
+		 "Number of work items each thread queues (default: 50000)");
+
+static struct workqueue_struct *bench_wq;
+static atomic_t threads_done;
+static DECLARE_COMPLETION(start_comp);
+static DECLARE_COMPLETION(all_done_comp);
+
+struct thread_ctx {
+	struct completion work_done;
+	struct work_struct work;
+	u64 *latencies;
+	int cpu;
+	int items;
+};
+
+static void bench_work_fn(struct work_struct *work)
+{
+	struct thread_ctx *ctx = container_of(work, struct thread_ctx, work);
+
+	complete(&ctx->work_done);
+}
+
+static int bench_kthread_fn(void *data)
+{
+	struct thread_ctx *ctx = data;
+	ktime_t t_start, t_end;
+	int i;
+
+	/* Wait for all threads to be ready */
+	wait_for_completion(&start_comp);
+
+	for (i = 0; i < ctx->items; i++) {
+		reinit_completion(&ctx->work_done);
+		INIT_WORK(&ctx->work, bench_work_fn);
+
+		t_start = ktime_get();
+		queue_work(bench_wq, &ctx->work);
+		t_end = ktime_get();
+
+		ctx->latencies[i] = ktime_to_ns(ktime_sub(t_end, t_start));
+		wait_for_completion(&ctx->work_done);
+	}
+
+	if (atomic_dec_and_test(&threads_done))
+		complete(&all_done_comp);
+
+	return 0;
+}
+
+static int cmp_u64(const void *a, const void *b)
+{
+	u64 va = *(const u64 *)a;
+	u64 vb = *(const u64 *)b;
+
+	if (va < vb)
+		return -1;
+	if (va > vb)
+		return 1;
+	return 0;
+}
+
+static int __init set_affn_scope(const char *scope)
+{
+	struct file *f;
+	loff_t pos = 0;
+	ssize_t ret;
+
+	f = filp_open(SCOPE_PATH, O_WRONLY, 0);
+	if (IS_ERR(f)) {
+		pr_err("test_workqueue: open %s failed: %ld\n",
+		       SCOPE_PATH, PTR_ERR(f));
+		return PTR_ERR(f);
+	}
+
+	ret = kernel_write(f, scope, strlen(scope), &pos);
+	filp_close(f, NULL);
+
+	if (ret < 0) {
+		pr_err("test_workqueue: write '%s' failed: %zd\n", scope, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int __init run_bench(int n_threads, const char *scope, const char *label)
+{
+	struct task_struct **tasks;
+	unsigned long total_items;
+	struct thread_ctx *ctxs;
+	u64 *all_latencies;
+	ktime_t start, end;
+	int cpu, i, j, ret;
+	s64 elapsed_us;
+
+	ret = set_affn_scope(scope);
+	if (ret)
+		return ret;
+
+	ctxs = kcalloc(n_threads, sizeof(*ctxs), GFP_KERNEL);
+	if (!ctxs)
+		return -ENOMEM;
+
+	tasks = kcalloc(n_threads, sizeof(*tasks), GFP_KERNEL);
+	if (!tasks) {
+		kfree(ctxs);
+		return -ENOMEM;
+	}
+
+	total_items = (unsigned long)n_threads * wq_items;
+	all_latencies = kvmalloc_array(total_items, sizeof(u64), GFP_KERNEL);
+	if (!all_latencies) {
+		kfree(tasks);
+		kfree(ctxs);
+		return -ENOMEM;
+	}
+
+	/* Allocate per-thread latency arrays */
+	for (i = 0; i < n_threads; i++) {
+		ctxs[i].latencies = kvmalloc_array(wq_items, sizeof(u64),
+						   GFP_KERNEL);
+		if (!ctxs[i].latencies) {
+			while (--i >= 0)
+				kvfree(ctxs[i].latencies);
+			kvfree(all_latencies);
+			kfree(tasks);
+			kfree(ctxs);
+			return -ENOMEM;
+		}
+	}
+
+	atomic_set(&threads_done, n_threads);
+	reinit_completion(&all_done_comp);
+	reinit_completion(&start_comp);
+
+	/* Create kthreads, each bound to a different online CPU */
+	i = 0;
+	for_each_online_cpu(cpu) {
+		if (i >= n_threads)
+			break;
+
+		ctxs[i].cpu = cpu;
+		ctxs[i].items = wq_items;
+		init_completion(&ctxs[i].work_done);
+
+		tasks[i] = kthread_create(bench_kthread_fn, &ctxs[i],
+					  "wq_bench/%d", cpu);
+		if (IS_ERR(tasks[i])) {
+			ret = PTR_ERR(tasks[i]);
+			pr_err("test_workqueue: failed to create kthread %d: %d\n",
+			       i, ret);
+			while (--i >= 0)
+				kthread_stop(tasks[i]);
+			goto out_free;
+		}
+
+		kthread_bind(tasks[i], cpu);
+		wake_up_process(tasks[i]);
+		i++;
+	}
+
+	/* Start timing and release all threads */
+	start = ktime_get();
+	complete_all(&start_comp);
+
+	/* Wait for all threads to finish */
+	wait_for_completion(&all_done_comp);
+
+	/* Drain any remaining work */
+	flush_workqueue(bench_wq);
+
+	end = ktime_get();
+	elapsed_us = ktime_us_delta(end, start);
+
+	/* Merge all per-thread latencies and sort for percentile calculation */
+	j = 0;
+	for (i = 0; i < n_threads; i++) {
+		memcpy(&all_latencies[j], ctxs[i].latencies,
+		       wq_items * sizeof(u64));
+		j += wq_items;
+	}
+
+	sort(all_latencies, total_items, sizeof(u64), cmp_u64, NULL);
+
+	pr_info("test_workqueue:   %-16s %llu items/sec\tp50=%llu\tp90=%llu\tp95=%llu ns\n",
+		label,
+		elapsed_us ? total_items * 1000000ULL / elapsed_us : 0,
+		all_latencies[total_items * 50 / 100],
+		all_latencies[total_items * 90 / 100],
+		all_latencies[total_items * 95 / 100]);
+
+	ret = 0;
+out_free:
+	for (i = 0; i < n_threads; i++)
+		kvfree(ctxs[i].latencies);
+	kvfree(all_latencies);
+	kfree(tasks);
+	kfree(ctxs);
+
+	return ret;
+}
+
+static const char * const bench_scopes[] = {
+	"cpu", "smt", "cache", "cache_shard", "numa", "system",
+};
+
+static int __init test_workqueue_init(void)
+{
+	int n_threads = nr_threads ?: num_online_cpus();
+	int i;
+
+	if (wq_items <= 0) {
+		pr_err("test_workqueue: wq_items must be > 0\n");
+		return -EINVAL;
+	}
+
+	bench_wq = alloc_workqueue(WQ_NAME, WQ_UNBOUND | WQ_SYSFS, 0);
+	if (!bench_wq)
+		return -ENOMEM;
+
+	pr_info("test_workqueue: running %d threads, %d items/thread\n",
+		n_threads, wq_items);
+
+	for (i = 0; i < ARRAY_SIZE(bench_scopes); i++)
+		run_bench(n_threads, bench_scopes[i], bench_scopes[i]);
+
+	destroy_workqueue(bench_wq);
+
+	return -EAGAIN;
+}
+
+module_init(test_workqueue_init);
+MODULE_AUTHOR("Breno Leitao <leitao@debian.org>");
+MODULE_DESCRIPTION("Stress/performance benchmark for workqueue subsystem");
+MODULE_LICENSE("GPL");

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope
  2026-03-20 17:56 [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Breno Leitao
                   ` (4 preceding siblings ...)
  2026-03-20 17:56 ` [PATCH v2 5/5] workqueue: add test_workqueue benchmark module Breno Leitao
@ 2026-03-23 14:11 ` Chuck Lever
  2026-03-23 15:10   ` Breno Leitao
  5 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2026-03-23 14:11 UTC (permalink / raw)
  To: Breno Leitao, Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote:
> TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
> unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
> to a single worker pool, causing heavy spinlock (pool->lock) contention.
> Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
> wq_cache_shard_size CPUs (default 8).
>
> Changes from RFC:
>
> * wq_cache_shard_size is in terms of cores (not vCPU). So,
>   wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
>   like 16 threads/CPUs if SMT=1

My concern about the "cores per shard" approach is that it
improves the default situation for moderately-sized machines
little or not at all.

A machine with one L3 and 10 cores will go from 1 UNBOUND
pool to only 2. For virtual machines commonly deployed as
cloud instances, which are 2, 4, or 8 core systems (up to
16 threads) there will still be significant contention for
UNBOUND workers.

IOW, if you want good scaling, human intervention (via a
boot command-line option) is still needed.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope
  2026-03-23 14:11 ` [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Chuck Lever
@ 2026-03-23 15:10   ` Breno Leitao
  2026-03-23 15:28     ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-23 15:10 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello Chuck,

On Mon, Mar 23, 2026 at 10:11:07AM -0400, Chuck Lever wrote:
> On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote:
> > TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
> > unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
> > to a single worker pool, causing heavy spinlock (pool->lock) contention.
> > Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
> > wq_cache_shard_size CPUs (default 8).
> >
> > Changes from RFC:
> >
> > * wq_cache_shard_size is in terms of cores (not vCPU). So,
> >   wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
> >   like 16 threads/CPUs if SMT=1
>
> My concern about the "cores per shard" approach is that it
> improves the default situation for moderately-sized machines
> little or not at all.
>
> A machine with one L3 and 10 cores will go from 1 UNBOUND
> pool to only 2. For virtual machines commonly deployed as
> cloud instances, which are 2, 4, or 8 core systems (up to
> 16 threads) there will still be significant contention for
> UNBOUND workers.

Could you clarify your concern? Are you suggesting the default value of
wq_cache_shard_size=8 is too high, or that the cores-per-shard approach
fundamentally doesn't scale well for moderately-sized systems?

Any approach—whether sharding by cores or by LLC—ultimately relies on
heuristics that may need tuning for specific workloads. The key difference
is where we draw the line. The current default of 8 cores prevents the
worst-case scenario: severe lock contention on large systems with 16+ CPUs
all hammering a single unbound workqueue.

For smaller systems (2-4 CPUs), contention is usually negligible
regardless of the approach. My perf lock contention measurements
consistently show minimal contention in that range.

> IOW, if you want good scaling, human intervention (via a
> boot command-line option) is still needed.

I am not convinced. The wq_cache_shard_size approach creates multiple
pools on large systems while leaving small systems (<8 cores) unchanged.
This eliminates the pathological lock contention we're observing on
high-core-count machines without impacting smaller deployments.

In contrast, splitting pools per LLC would force fragmentation even on
systems that aren't experiencing contention, increasing the need for
manual tuning across a wider range of configurations.

Thanks for the review,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope
  2026-03-23 15:10   ` Breno Leitao
@ 2026-03-23 15:28     ` Chuck Lever
  2026-03-23 16:26       ` Breno Leitao
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2026-03-23 15:28 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

On 3/23/26 11:10 AM, Breno Leitao wrote:
> Hello Chuck,
> 
> On Mon, Mar 23, 2026 at 10:11:07AM -0400, Chuck Lever wrote:
>> On Fri, Mar 20, 2026, at 1:56 PM, Breno Leitao wrote:
>>> TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
>>> unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
>>> to a single worker pool, causing heavy spinlock (pool->lock) contention.
>>> Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
>>> wq_cache_shard_size CPUs (default 8).
>>>
>>> Changes from RFC:
>>>
>>> * wq_cache_shard_size is in terms of cores (not vCPU). So,
>>>   wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
>>>   like 16 threads/CPUs if SMT=1
>>
>> My concern about the "cores per shard" approach is that it
>> improves the default situation for moderately-sized machines
>> little or not at all.
>>
>> A machine with one L3 and 10 cores will go from 1 UNBOUND
>> pool to only 2. For virtual machines commonly deployed as
>> cloud instances, which are 2, 4, or 8 core systems (up to
>> 16 threads) there will still be significant contention for
>> UNBOUND workers.
> 
> Could you clarify your concern? Are you suggesting the default value of
> wq_cache_shard_size=8 is too high, or that the cores-per-shard approach
> fundamentally doesn't scale well for moderately-sized systems?
> 
> Any approach—whether sharding by cores or by LLC—ultimately relies on
> heuristics that may need tuning for specific workloads. The key difference
> is where we draw the line. The current default of 8 cores prevents the
> worst-case scenario: severe lock contention on large systems with 16+ CPUs
> all hammering a single unbound workqueue.

An 8-core machine with 16 threads can handle quite a bit of I/O, but
with the proposed scheme it will still have a single UNBOUND pool.
For NFS workloads I commonly benchmark, splitting the UNBOUND pool
on such systems is a very clear win.


> For smaller systems (2-4 CPUs), contention is usually negligible
> regardless of the approach. My perf lock contention measurements
> consistently show minimal contention in that range.
> 
>> IOW, if you want good scaling, human intervention (via a
>> boot command-line option) is still needed.
> 
> I am not convinced. The wq_cache_shard_size approach creates multiple
> pools on large systems while leaving small systems (<8 cores) unchanged.

This is exactly my concern. Smaller systems /do/ experience measurable
contention in this area. I don't object to your series at all, it's
clean and well-motivated; but the cores-per-shard approach doesn't scale
down to very commonly deployed machine sizes.

We might also argue that the NFS client and other subsystems that make
significant use of UNBOUND workqueues in their I/O paths might be well
advised to modify their approach. (net/sunrpc/sched.c, hint hint)


> This eliminates the pathological lock contention we're observing on
> high-core-count machines without impacting smaller deployments.
> 
> In contrast, splitting pools per LLC would force fragmentation even on
> systems that aren't experiencing contention, increasing the need for
> manual tuning across a wider range of configurations.

I claim that smaller deployments also need help. Further, I don't see
how UNBOUND pool fragmentation is a problem on such systems that needs
to be addressed (IMHO).


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope
  2026-03-23 15:28     ` Chuck Lever
@ 2026-03-23 16:26       ` Breno Leitao
  2026-03-23 18:04         ` Chuck Lever
  0 siblings, 1 reply; 13+ messages in thread
From: Breno Leitao @ 2026-03-23 16:26 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello Chuck,

On Mon, Mar 23, 2026 at 11:28:49AM -0400, Chuck Lever wrote:
> On 3/23/26 11:10 AM, Breno Leitao wrote:
> >
> > I am not convinced. The wq_cache_shard_size approach creates multiple
> > pools on large systems while leaving small systems (<8 cores) unchanged.
>
> This is exactly my concern. Smaller systems /do/ experience measurable
> contention in this area. I don't object to your series at all, it's
> clean and well-motivated; but the cores-per-shard approach doesn't scale
> down to very commonly deployed machine sizes.

I don't see why the cores-per-shard approach wouldn't scale down
effectively.

The sharding mechanism itself is independent of whether we use
cores-per-shard or shards-per-LLC as the allocation strategy, correct?

Regardless of the approach, we retain full control over the granularity
of the shards.

> We might also argue that the NFS client and other subsystems that make
> significant use of UNBOUND workqueues in their I/O paths might be well
> advised to modify their approach. (net/sunrpc/sched.c, hint hint)
>
>
> > This eliminates the pathological lock contention we're observing on
> > high-core-count machines without impacting smaller deployments.
>
> > In contrast, splitting pools per LLC would force fragmentation even on
> > systems that aren't experiencing contention, increasing the need for
> > manual tuning across a wider range of configurations.
>
> I claim that smaller deployments also need help. Further, I don't see
> how UNBOUND pool fragmentation is a problem on such systems that needs
> to be addressed (IMHO).

Are you suggesting we should reduce the default value to something like
wq_cache_shard_size=2 instead of wq_cache_shard_size=8?

Thanks for the feedback,
--breno

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope
  2026-03-23 16:26       ` Breno Leitao
@ 2026-03-23 18:04         ` Chuck Lever
  2026-03-23 18:19           ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Lever @ 2026-03-23 18:04 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Tejun Heo, Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

On 3/23/26 12:26 PM, Breno Leitao wrote:
> Hello Chuck,
> 
> On Mon, Mar 23, 2026 at 11:28:49AM -0400, Chuck Lever wrote:
>> On 3/23/26 11:10 AM, Breno Leitao wrote:
>>>
>>> I am not convinced. The wq_cache_shard_size approach creates multiple
>>> pools on large systems while leaving small systems (<8 cores) unchanged.
>>
>> This is exactly my concern. Smaller systems /do/ experience measurable
>> contention in this area. I don't object to your series at all, it's
>> clean and well-motivated; but the cores-per-shard approach doesn't scale
>> down to very commonly deployed machine sizes.
> 
> I don't see why the cores-per-shard approach wouldn't scale down
> effectively.

Sharding the UNBOUND pool is fine. But with a fixed cores-per-shard
ratio of 8, it doesn't scale down to smaller systems.


> The sharding mechanism itself is independent of whether we use
> cores-per-shard or shards-per-LLC as the allocation strategy, correct?
> 
> Regardless of the approach, we retain full control over the granularity
> of the shards.
> 
>> We might also argue that the NFS client and other subsystems that make
>> significant use of UNBOUND workqueues in their I/O paths might be well
>> advised to modify their approach. (net/sunrpc/sched.c, hint hint)
>>
>>
>>> This eliminates the pathological lock contention we're observing on
>>> high-core-count machines without impacting smaller deployments.
>>
>>> In contrast, splitting pools per LLC would force fragmentation even on
>>> systems that aren't experiencing contention, increasing the need for
>>> manual tuning across a wider range of configurations.
>>
>> I claim that smaller deployments also need help. Further, I don't see
>> how UNBOUND pool fragmentation is a problem on such systems that needs
>> to be addressed (IMHO).
> 
> Are you suggesting we should reduce the default value to something like
> wq_cache_shard_size=2 instead of wq_cache_shard_size=8?

A shard size of 2 clearly won't scale properly to hundreds of cores. A
varying default cores-per-shard ratio would help scaling in both
directions, without having to manually tune.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope
  2026-03-23 18:04         ` Chuck Lever
@ 2026-03-23 18:19           ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2026-03-23 18:19 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Breno Leitao, Lai Jiangshan, Andrew Morton, linux-kernel,
	puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello,

On Mon, Mar 23, 2026 at 02:04:57PM -0400, Chuck Lever wrote:
> > I don't see why the cores-per-shard approach wouldn't scale down
> > effectively.
> 
> Sharding the UNBOUND pool is fine. But with a fixed cores-per-shard
> ratio of 8, it doesn't scale down to smaller systems.

You aren't making a lot of sense. Contention is primarily the function of
the number of CPUs competing, not inverse of how many cores are in the LLC.

> A shard size of 2 clearly won't scale properly to hundreds of cores. A
> varying default cores-per-shard ratio would help scaling in both
> directions, without having to manually tune.

If your workload is bottlenecked on pool lock on small machines, the right
course of action is either making the offending workqueue per-cpu or
configure the unbound workqueue for that specific use case. That's why it's
progrmatically configurable in the first place.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-03-20 17:56 ` [PATCH v2 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-03-23 22:43   ` Tejun Heo
  0 siblings, 0 replies; 13+ messages in thread
From: Tejun Heo @ 2026-03-23 22:43 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello,

On Fri, Mar 20, 2026 at 10:56:28AM -0700, Breno Leitao wrote:
> +/**
> + * llc_count_cores - count distinct cores (SMT groups) within a cpumask
> + * @pod_cpus: the cpumask to scan (typically an LLC pod)
> + * @smt_pt:   the SMT pod type, used to identify sibling groups
> + *
> + * A core is represented by the lowest-numbered CPU in its SMT group. Returns
> + * the number of distinct cores found in @pod_cpus.
> + */
> +static int __init llc_count_cores(const struct cpumask *pod_cpus,
> +				  struct wq_pod_type *smt_pt)
> +{
> +	const struct cpumask *smt_cpus;
> +	int nr_cores = 0, c;
> +
> +	for_each_cpu(c, pod_cpus) {
> +		smt_cpus = smt_pt->pod_cpus[smt_pt->cpu_pod[c]];
> +		if (cpumask_first(smt_cpus) == c)
> +			nr_cores++;
> +	}
> +
> +	return nr_cores;
> +}
> +
> +/**
> + * llc_cpu_core_pos - find a CPU's core position within a cpumask
> + * @cpu:      the CPU to locate
> + * @pod_cpus: the cpumask to scan (typically an LLC pod)
> + * @smt_pt:   the SMT pod type, used to identify sibling groups
> + *
> + * Returns the zero-based index of @cpu's core among the distinct cores in
> + * @pod_cpus, ordered by lowest CPU number in each SMT group.
> + */
> +static int __init llc_cpu_core_pos(int cpu, const struct cpumask *pod_cpus,
> +				   struct wq_pod_type *smt_pt)
> +{
> +	const struct cpumask *smt_cpus;
> +	int core_pos = 0, c;
> +
> +	for_each_cpu(c, pod_cpus) {
> +		smt_cpus = smt_pt->pod_cpus[smt_pt->cpu_pod[c]];
> +		if (cpumask_test_cpu(cpu, smt_cpus))
> +			break;
> +		if (cpumask_first(smt_cpus) == c)
> +			core_pos++;
> +	}
> +
> +	return core_pos;
> +}

Can you do the above two in a separate pass and record the results and then
use that to implement cpu_cache_shard_id()? Doing all of it on the fly makes
it unnecessarily difficult to follow and init_pod_type() is already O(N^2)
and the above makes it O(N^4). Make the machine large enough and this may
become noticeable.

> +/**
> + * cpu_cache_shard_id - compute the shard index for a CPU within its LLC pod
> + * @cpu: the CPU to look up
> + *
> + * Returns a shard index that is unique within the CPU's LLC pod. The LLC is
> + * divided into shards of at most wq_cache_shard_size cores, always split on
> + * core (SMT group) boundaries so that SMT siblings are never placed in
> + * different shards. Cores are distributed across shards as evenly as possible.
> + *
> + * Example: 36 cores with wq_cache_shard_size=8 gives 5 shards of
> + * 8+7+7+7+7 cores.
> + */

I always feel a bit uneasy about using max number as split point in cases
like this because the reason why you picked 8 as the default was that
testing showed shard sizes close to 8 seems to behave the best (or at least
acceptably in most cases). However, setting max number to 8 doesn't
necessarily keep you close to that. e.g. If there are 9 cores, you end up
with 5 and 4 even though 9 is a lot closer to the 8 that we picked as the
default. Can the sharding logic updated so that "whatever sharding that gets
the system closest to the config target?".

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-23 22:43 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 17:56 [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Breno Leitao
2026-03-20 17:56 ` [PATCH v2 1/5] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
2026-03-20 17:56 ` [PATCH v2 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-23 22:43   ` Tejun Heo
2026-03-20 17:56 ` [PATCH v2 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
2026-03-20 17:56 ` [PATCH v2 4/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-03-20 17:56 ` [PATCH v2 5/5] workqueue: add test_workqueue benchmark module Breno Leitao
2026-03-23 14:11 ` [PATCH v2 0/5] workqueue: Introduce a sharded cache affinity scope Chuck Lever
2026-03-23 15:10   ` Breno Leitao
2026-03-23 15:28     ` Chuck Lever
2026-03-23 16:26       ` Breno Leitao
2026-03-23 18:04         ` Chuck Lever
2026-03-23 18:19           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox