[PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope

public inbox for linux-crypto@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope
@ 2026-04-01 13:03 Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 1/6] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-01 13:03 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).

Changes from RFC:

* wq_cache_shard_size is in terms of cores (not vCPU). So,
  wq_cache_shard_size=8 means the pool will have 8 cores and their siblings,
  like 16 threads/CPUs if SMT=1

* Got more data:

  - AMD EPYC: All means are within ~1 stdev of zero. The deltas are
    indistinguishable from noise. Shard scoping has no measurable effect
    regardless of shard size. This is justified due to AMD EPYC having
    11 L3 domains and lock contention not being a problem:

  - ARM: A strong, consistent signal. At shard sizes 8 and 16 the mean
    improvement is ~7% with relatively tight stdev (~1–2%) at write,
    meaning the gain is real and reproducible across all IO engines.
    Even shard size 4 shows a solid +3.5% with the tightest stdev
    (0.97%).

    Reads: Small shard sizes (2, 4) show a slight regression of
    ~1.3–1.7% (low stdev, so consistent). Larger shard sizes (8, 16)
    flip to a modest +1.4% gain, though shard_size=8 reads have high
    variance (stdev 2.79%) driven by a single outliers (which seems
    noise)

  - Sweet spot: Shard size 8 to 16 offers the best overall profile
    — highest write gain (6.95%) with the lowest write stdev (1.18%),
    plus a consistent read gain (1.42%, stdev 0.70%), no impact on
    AMD/x86.

* ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)

  ┌────────────┬────────────┬─────────────┬───────────┬────────────┐
  │ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 2          │ +0.75%     │ 1.32%       │ -1.28%    │ 0.45%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 4          │ +3.45%     │ 0.97%       │ -1.73%    │ 0.52%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 8          │ +6.72%     │ 1.97%       │ +1.38%    │ 2.79%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 16         │ +6.95%     │ 1.18%       │ +1.42%    │ 0.70%      │
  └────────────┴────────────┴─────────────┴───────────┴────────────┘

 * x86 (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)

  ┌────────────┬────────────┬─────────────┬───────────┬────────────┐
  │ Shard Size │ Write Mean │ Write StDev │ Read Mean │ Read StDev │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 2          │ +3.22%     │ 1.90%       │ -0.08%    │ 0.72%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 4          │ +0.92%     │ 1.59%       │ +0.67%    │ 2.33%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 8          │ +1.75%     │ 1.47%       │ -0.42%    │ 0.72%      │
  ├────────────┼────────────┼─────────────┼───────────┼────────────┤
  │ 16         │ +1.22%     │ 1.72%       │ +0.43%    │ 1.32%      │
  └────────────┴────────────┴─────────────┴───────────┴────────────┘

---
Changes in v3:
- Precomputed the shards to avoid exponential time when creating the
  pool. (Tejun)
- Add documentation about the new cache sharding affinity.
- Fixed use-after-free on module unload (on the selftest)
- Link to v2: https://patch.msgid.link/20260320-workqueue_sharded-v2-0-8372930931af@debian.org

Changes in v2:
- wq_cache_shard_size is in terms of cores (not vCPU)
- Link to v1: https://patch.msgid.link/20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org

---
Breno Leitao (6):
      workqueue: fix typo in WQ_AFFN_SMT comment
      workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
      workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
      tools/workqueue: add CACHE_SHARD support to wq_dump.py
      workqueue: add test_workqueue benchmark module
      docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope

 Documentation/admin-guide/kernel-parameters.txt |   3 +-
 Documentation/core-api/workqueue.rst            |  14 +-
 include/linux/workqueue.h                       |   3 +-
 kernel/workqueue.c                              | 185 ++++++++++++++-
 lib/Kconfig.debug                               |  10 +
 lib/Makefile                                    |   1 +
 lib/test_workqueue.c                            | 294 ++++++++++++++++++++++++
 tools/workqueue/wq_dump.py                      |   3 +-
 8 files changed, 505 insertions(+), 8 deletions(-)
---
base-commit: 0e4f8f1a3d081e834be5fd0a62bdb2554fadd307
change-id: 20260309-workqueue_sharded-2327956e889b

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/6] workqueue: fix typo in WQ_AFFN_SMT comment
  2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
@ 2026-04-01 13:03 ` Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 2/6] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-01 13:03 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Fix "poer" -> "per" in the WQ_AFFN_SMT enum comment.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/workqueue.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index a4749f56398f..17543aec2a6e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -131,7 +131,7 @@ struct rcu_work {
 enum wq_affn_scope {
 	WQ_AFFN_DFL,			/* use system default */
 	WQ_AFFN_CPU,			/* one pod per CPU */
-	WQ_AFFN_SMT,			/* one pod poer SMT */
+	WQ_AFFN_SMT,			/* one pod per SMT */
 	WQ_AFFN_CACHE,			/* one pod per LLC */
 	WQ_AFFN_NUMA,			/* one pod per NUMA node */
 	WQ_AFFN_SYSTEM,			/* one pod across the whole system */

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/6] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 1/6] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
@ 2026-04-01 13:03 ` Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 3/6] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-01 13:03 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

On systems where many CPUs share one LLC, unbound workqueues using
WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock
contention on pool->lock. For example, Chuck Lever measured 39% of
cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3
NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers
no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at
most wq_cache_shard_size cores (default 8, tunable via boot parameter).
Shards are always split on core (SMT group) boundaries so that
Hyper-Threading siblings are never placed in different pods. Cores are
distributed across shards as evenly as possible -- for example, 36 cores
in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7
cores.

The implementation follows the same comparator pattern as other affinity
scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array
from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology,
and cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show
cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency
compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/workqueue.h |   1 +
 kernel/workqueue.c        | 183 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 184 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 17543aec2a6e..50bdb7e30d35 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -133,6 +133,7 @@ enum wq_affn_scope {
 	WQ_AFFN_CPU,			/* one pod per CPU */
 	WQ_AFFN_SMT,			/* one pod per SMT */
 	WQ_AFFN_CACHE,			/* one pod per LLC */
+	WQ_AFFN_CACHE_SHARD,		/* synthetic sub-LLC shards */
 	WQ_AFFN_NUMA,			/* one pod per NUMA node */
 	WQ_AFFN_SYSTEM,			/* one pod across the whole system */
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b77119d71641..5b1d42115e20 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -130,6 +130,14 @@ enum wq_internal_consts {
 	WORKER_ID_LEN		= 10 + WQ_NAME_LEN, /* "kworker/R-" + WQ_NAME_LEN */
 };
 
+/* Layout of shards within one LLC pod */
+struct llc_shard_layout {
+	int nr_large_shards;	/* number of large shards (cores_per_shard + 1) */
+	int cores_per_shard;	/* base number of cores per default shard */
+	int nr_shards;		/* total number of shards */
+	/* nr_default shards = (nr_shards - nr_large_shards) */
+};
+
 /*
  * We don't want to trap softirq for too long. See MAX_SOFTIRQ_TIME and
  * MAX_SOFTIRQ_RESTART in kernel/softirq.c. These are macros because
@@ -409,6 +417,7 @@ static const char *wq_affn_names[WQ_AFFN_NR_TYPES] = {
 	[WQ_AFFN_CPU]		= "cpu",
 	[WQ_AFFN_SMT]		= "smt",
 	[WQ_AFFN_CACHE]		= "cache",
+	[WQ_AFFN_CACHE_SHARD]	= "cache_shard",
 	[WQ_AFFN_NUMA]		= "numa",
 	[WQ_AFFN_SYSTEM]	= "system",
 };
@@ -431,6 +440,9 @@ module_param_named(cpu_intensive_warning_thresh, wq_cpu_intensive_warning_thresh
 static bool wq_power_efficient = IS_ENABLED(CONFIG_WQ_POWER_EFFICIENT_DEFAULT);
 module_param_named(power_efficient, wq_power_efficient, bool, 0444);
 
+static unsigned int wq_cache_shard_size = 8;
+module_param_named(cache_shard_size, wq_cache_shard_size, uint, 0444);
+
 static bool wq_online;			/* can kworkers be created yet? */
 static bool wq_topo_initialized __read_mostly = false;
 
@@ -8113,6 +8125,175 @@ static bool __init cpus_share_numa(int cpu0, int cpu1)
 	return cpu_to_node(cpu0) == cpu_to_node(cpu1);
 }
 
+/* Maps each CPU to its shard index within the LLC pod it belongs to */
+static int cpu_shard_id[NR_CPUS] __initdata;
+
+/**
+ * llc_count_cores - count distinct cores (SMT groups) within an LLC pod
+ * @pod_cpus:  the cpumask of CPUs in the LLC pod
+ * @smt_pods:  the SMT pod type, used to identify sibling groups
+ *
+ * A core is represented by the lowest-numbered CPU in its SMT group. Returns
+ * the number of distinct cores found in @pod_cpus.
+ */
+static int __init llc_count_cores(const struct cpumask *pod_cpus,
+				  struct wq_pod_type *smt_pods)
+{
+	const struct cpumask *sibling_cpus;
+	int nr_cores = 0, c;
+
+	/*
+	 * Count distinct cores by only counting the first CPU in each
+	 * SMT sibling group.
+	 */
+	for_each_cpu(c, pod_cpus) {
+		sibling_cpus = smt_pods->pod_cpus[smt_pods->cpu_pod[c]];
+		if (cpumask_first(sibling_cpus) == c)
+			nr_cores++;
+	}
+
+	return nr_cores;
+}
+
+/*
+ * llc_shard_size - number of cores in a given shard
+ *
+ * Cores are spread as evenly as possible. The first @nr_large_shards shards are
+ * "large shards" with (cores_per_shard + 1) cores; the rest are "default
+ * shards" with cores_per_shard cores.
+ */
+static int __init llc_shard_size(int shard_id, int cores_per_shard, int nr_large_shards)
+{
+	/* The first @nr_large_shards shards are large shards */
+	if (shard_id < nr_large_shards)
+		return cores_per_shard + 1;
+
+	/* The remaining shards are default shards */
+	return cores_per_shard;
+}
+
+/*
+ * llc_calc_shard_layout - compute the shard layout for an LLC pod
+ * @nr_cores:  number of distinct cores in the LLC pod
+ *
+ * Chooses the number of shards that keeps average shard size closest to
+ * wq_cache_shard_size. Returns a struct describing the total number of shards,
+ * the base size of each, and how many are large shards.
+ */
+static struct llc_shard_layout __init llc_calc_shard_layout(int nr_cores)
+{
+	struct llc_shard_layout layout;
+
+	/* Ensure at least one shard; pick the count closest to the target size */
+	layout.nr_shards = max(1, DIV_ROUND_CLOSEST(nr_cores, wq_cache_shard_size));
+	layout.cores_per_shard = nr_cores / layout.nr_shards;
+	layout.nr_large_shards = nr_cores % layout.nr_shards;
+
+	return layout;
+}
+
+/*
+ * llc_shard_is_full - check whether a shard has reached its core capacity
+ * @cores_in_shard: number of cores already assigned to this shard
+ * @shard_id:       index of the shard being checked
+ * @layout:         the shard layout computed by llc_calc_shard_layout()
+ *
+ * Returns true if @cores_in_shard equals the expected size for @shard_id.
+ */
+static bool __init llc_shard_is_full(int cores_in_shard, int shard_id,
+				     const struct llc_shard_layout *layout)
+{
+	return cores_in_shard == llc_shard_size(shard_id, layout->cores_per_shard,
+						layout->nr_large_shards);
+}
+
+/**
+ * llc_populate_cpu_shard_id - populate cpu_shard_id[] for each CPU in an LLC pod
+ * @pod_cpus:  the cpumask of CPUs in the LLC pod
+ * @smt_pods:  the SMT pod type, used to identify sibling groups
+ * @nr_cores:  number of distinct cores in @pod_cpus (from llc_count_cores())
+ *
+ * Walks @pod_cpus in order. At each SMT group leader, advances to the next
+ * shard once the current shard is full. Results are written to cpu_shard_id[].
+ */
+static void __init llc_populate_cpu_shard_id(const struct cpumask *pod_cpus,
+					     struct wq_pod_type *smt_pods,
+					     int nr_cores)
+{
+	struct llc_shard_layout layout = llc_calc_shard_layout(nr_cores);
+	const struct cpumask *sibling_cpus;
+	/* Count the number of cores in the current shard_id */
+	int cores_in_shard = 0;
+	/* This is a cursor for the shards. Go from zero to nr_shards - 1*/
+	int shard_id = 0;
+	int c;
+
+	/* Iterate at every CPU for a given LLC pod, and assign it a shard */
+	for_each_cpu(c, pod_cpus) {
+		sibling_cpus = smt_pods->pod_cpus[smt_pods->cpu_pod[c]];
+		if (cpumask_first(sibling_cpus) == c) {
+			/* This is the CPU leader for the siblings */
+			if (llc_shard_is_full(cores_in_shard, shard_id, &layout)) {
+				shard_id++;
+				cores_in_shard = 0;
+			}
+			cores_in_shard++;
+			cpu_shard_id[c] = shard_id;
+		} else {
+			/*
+			 * The siblings' shard MUST be the same as the leader.
+			 * never split threads in the same core.
+			 */
+			cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
+		}
+	}
+
+	WARN_ON_ONCE(shard_id != (layout.nr_shards - 1));
+}
+
+/**
+ * precompute_cache_shard_ids - assign each CPU its shard index within its LLC
+ *
+ * Iterates over all LLC pods. For each pod, counts distinct cores then assigns
+ * shard indices to all CPUs in the pod. Must be called after WQ_AFFN_CACHE and
+ * WQ_AFFN_SMT have been initialized.
+ */
+static void __init precompute_cache_shard_ids(void)
+{
+	struct wq_pod_type *llc_pods = &wq_pod_types[WQ_AFFN_CACHE];
+	struct wq_pod_type *smt_pods = &wq_pod_types[WQ_AFFN_SMT];
+	const struct cpumask *cpus_sharing_llc;
+	int nr_cores;
+	int pod;
+
+	if (!wq_cache_shard_size) {
+		pr_warn("workqueue: cache_shard_size must be > 0, setting to 1\n");
+		wq_cache_shard_size = 1;
+	}
+
+	for (pod = 0; pod < llc_pods->nr_pods; pod++) {
+		cpus_sharing_llc = llc_pods->pod_cpus[pod];
+
+		/* Number of cores in this given LLC */
+		nr_cores = llc_count_cores(cpus_sharing_llc, smt_pods);
+		llc_populate_cpu_shard_id(cpus_sharing_llc, smt_pods, nr_cores);
+	}
+}
+
+/*
+ * cpus_share_cache_shard - test whether two CPUs belong to the same cache shard
+ *
+ * Two CPUs share a cache shard if they are in the same LLC and have the same
+ * shard index. Used as the pod affinity callback for WQ_AFFN_CACHE_SHARD.
+ */
+static bool __init cpus_share_cache_shard(int cpu0, int cpu1)
+{
+	if (!cpus_share_cache(cpu0, cpu1))
+		return false;
+
+	return cpu_shard_id[cpu0] == cpu_shard_id[cpu1];
+}
+
 /**
  * workqueue_init_topology - initialize CPU pods for unbound workqueues
  *
@@ -8128,6 +8309,8 @@ void __init workqueue_init_topology(void)
 	init_pod_type(&wq_pod_types[WQ_AFFN_CPU], cpus_dont_share);
 	init_pod_type(&wq_pod_types[WQ_AFFN_SMT], cpus_share_smt);
 	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE], cpus_share_cache);
+	precompute_cache_shard_ids();
+	init_pod_type(&wq_pod_types[WQ_AFFN_CACHE_SHARD], cpus_share_cache_shard);
 	init_pod_type(&wq_pod_types[WQ_AFFN_NUMA], cpus_share_numa);
 
 	wq_topo_initialized = true;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 3/6] workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
  2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 1/6] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 2/6] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-04-01 13:03 ` Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 4/6] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-01 13:03 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound
workqueues. On systems where many CPUs share one LLC, the previous
default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool,
causing heavy spinlock contention on pool->lock.

WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing
a better balance between locality and contention. Users can revert to
the previous behavior with workqueue.default_affinity_scope=cache.

On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 cores.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5b1d42115e20..3b5b21136414 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -449,7 +449,7 @@ static bool wq_topo_initialized __read_mostly = false;
 static struct kmem_cache *pwq_cache;
 
 static struct wq_pod_type wq_pod_types[WQ_AFFN_NR_TYPES];
-static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE;
+static enum wq_affn_scope wq_affn_dfl = WQ_AFFN_CACHE_SHARD;
 
 /* buf for wq_update_unbound_pod_attrs(), protected by CPU hotplug exclusion */
 static struct workqueue_attrs *unbound_wq_update_pwq_attrs_buf;

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 4/6] tools/workqueue: add CACHE_SHARD support to wq_dump.py
  2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
                   ` (2 preceding siblings ...)
  2026-04-01 13:03 ` [PATCH v3 3/6] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
@ 2026-04-01 13:03 ` Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 5/6] workqueue: add test_workqueue benchmark module Breno Leitao
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-01 13:03 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

The WQ_AFFN_CACHE_SHARD affinity scope was added to the kernel but
wq_dump.py was not updated to enumerate it. Add the missing constant
lookup and include it in the affinity scopes iteration so that drgn
output shows the CACHE_SHARD pod topology alongside the other scopes.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 tools/workqueue/wq_dump.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/workqueue/wq_dump.py b/tools/workqueue/wq_dump.py
index d29b918306b4..06948ffcfc4b 100644
--- a/tools/workqueue/wq_dump.py
+++ b/tools/workqueue/wq_dump.py
@@ -107,6 +107,7 @@ WQ_MEM_RECLAIM          = prog['WQ_MEM_RECLAIM']
 WQ_AFFN_CPU             = prog['WQ_AFFN_CPU']
 WQ_AFFN_SMT             = prog['WQ_AFFN_SMT']
 WQ_AFFN_CACHE           = prog['WQ_AFFN_CACHE']
+WQ_AFFN_CACHE_SHARD     = prog['WQ_AFFN_CACHE_SHARD']
 WQ_AFFN_NUMA            = prog['WQ_AFFN_NUMA']
 WQ_AFFN_SYSTEM          = prog['WQ_AFFN_SYSTEM']
 
@@ -138,7 +139,7 @@ def print_pod_type(pt):
         print(f' [{cpu}]={pt.cpu_pod[cpu].value_()}', end='')
     print('')
 
-for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
+for affn in [WQ_AFFN_CPU, WQ_AFFN_SMT, WQ_AFFN_CACHE, WQ_AFFN_CACHE_SHARD, WQ_AFFN_NUMA, WQ_AFFN_SYSTEM]:
     print('')
     print(f'{wq_affn_names[affn].string_().decode().upper()}{" (default)" if affn == wq_affn_dfl else ""}')
     print_pod_type(wq_pod_types[affn])

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 5/6] workqueue: add test_workqueue benchmark module
  2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
                   ` (3 preceding siblings ...)
  2026-04-01 13:03 ` [PATCH v3 4/6] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
@ 2026-04-01 13:03 ` Breno Leitao
  2026-04-01 13:03 ` [PATCH v3 6/6] docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
  2026-04-01 20:32 ` [PATCH v3 0/6] workqueue: Introduce a sharded cache " Tejun Heo
  6 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-01 13:03 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Add a kernel module that benchmarks queue_work() throughput on an
unbound workqueue to measure pool->lock contention under different
affinity scope configurations (cache vs cache_shard).

The module spawns N kthreads (default: num_online_cpus()), each bound
to a different CPU. All threads start simultaneously and queue work
items, measuring the latency of each queue_work() call. Results are
reported as p50/p90/p95 latencies for each affinity scope.

The affinity scope is switched between runs via the workqueue's sysfs
affinity_scope attribute (WQ_SYSFS), avoiding the need for any new
exported symbols.

The module runs as __init-only, returning -EAGAIN to auto-unload,
and can be re-run via insmod.

Example of the output:

 running 50 threads, 50000 items/thread

   cpu              6806017 items/sec p50=2574    p90=5068    p95=5818 ns
   smt              6821040 items/sec p50=2624    p90=5168    p95=5949 ns
   cache_shard      1633653 items/sec p50=5337    p90=9694    p95=11207 ns
   cache            286069 items/sec p50=72509    p90=82304   p95=85009 ns
   numa             319403 items/sec p50=63745    p90=73480   p95=76505 ns
   system           308461 items/sec p50=66561    p90=75714   p95=78048 ns

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/Kconfig.debug    |  10 ++
 lib/Makefile         |   1 +
 lib/test_workqueue.c | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 305 insertions(+)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 93f356d2b3d9..38bee649697f 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2628,6 +2628,16 @@ config TEST_VMALLOC
 
 	  If unsure, say N.
 
+config TEST_WORKQUEUE
+	tristate "Test module for stress/performance analysis of workqueue"
+	default n
+	help
+	  This builds the "test_workqueue" module for benchmarking
+	  workqueue throughput under contention. Useful for evaluating
+	  affinity scope changes (e.g., cache_shard vs cache).
+
+	  If unsure, say N.
+
 config TEST_BPF
 	tristate "Test BPF filter functionality"
 	depends on m && NET
diff --git a/lib/Makefile b/lib/Makefile
index 1b9ee167517f..ea660cca04f4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -79,6 +79,7 @@ UBSAN_SANITIZE_test_ubsan.o := y
 obj-$(CONFIG_TEST_KSTRTOX) += test-kstrtox.o
 obj-$(CONFIG_TEST_LKM) += test_module.o
 obj-$(CONFIG_TEST_VMALLOC) += test_vmalloc.o
+obj-$(CONFIG_TEST_WORKQUEUE) += test_workqueue.o
 obj-$(CONFIG_TEST_RHASHTABLE) += test_rhashtable.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_keys.o
 obj-$(CONFIG_TEST_STATIC_KEYS) += test_static_key_base.o
diff --git a/lib/test_workqueue.c b/lib/test_workqueue.c
new file mode 100644
index 000000000000..f2ae1ac4bd93
--- /dev/null
+++ b/lib/test_workqueue.c
@@ -0,0 +1,294 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Test module for stress and performance analysis of workqueue.
+ *
+ * Benchmarks queue_work() throughput on an unbound workqueue to measure
+ * pool->lock contention under different affinity scope configurations
+ * (e.g., cache vs cache_shard).
+ *
+ * The affinity scope is changed between runs via the workqueue's sysfs
+ * affinity_scope attribute (WQ_SYSFS).
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ *
+ */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/workqueue.h>
+#include <linux/kthread.h>
+#include <linux/moduleparam.h>
+#include <linux/completion.h>
+#include <linux/atomic.h>
+#include <linux/slab.h>
+#include <linux/ktime.h>
+#include <linux/cpumask.h>
+#include <linux/sched.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+
+#define WQ_NAME "bench_wq"
+#define SCOPE_PATH "/sys/bus/workqueue/devices/" WQ_NAME "/affinity_scope"
+
+static int nr_threads;
+module_param(nr_threads, int, 0444);
+MODULE_PARM_DESC(nr_threads,
+		 "Number of threads to spawn (default: 0 = num_online_cpus())");
+
+static int wq_items = 50000;
+module_param(wq_items, int, 0444);
+MODULE_PARM_DESC(wq_items,
+		 "Number of work items each thread queues (default: 50000)");
+
+static struct workqueue_struct *bench_wq;
+static atomic_t threads_done;
+static DECLARE_COMPLETION(start_comp);
+static DECLARE_COMPLETION(all_done_comp);
+
+struct thread_ctx {
+	struct completion work_done;
+	struct work_struct work;
+	u64 *latencies;
+	int cpu;
+	int items;
+};
+
+static void bench_work_fn(struct work_struct *work)
+{
+	struct thread_ctx *ctx = container_of(work, struct thread_ctx, work);
+
+	complete(&ctx->work_done);
+}
+
+static int bench_kthread_fn(void *data)
+{
+	struct thread_ctx *ctx = data;
+	ktime_t t_start, t_end;
+	int i;
+
+	/* Wait for all threads to be ready */
+	wait_for_completion(&start_comp);
+
+	if (kthread_should_stop())
+		return 0;
+
+	for (i = 0; i < ctx->items; i++) {
+		reinit_completion(&ctx->work_done);
+		INIT_WORK(&ctx->work, bench_work_fn);
+
+		t_start = ktime_get();
+		queue_work(bench_wq, &ctx->work);
+		t_end = ktime_get();
+
+		ctx->latencies[i] = ktime_to_ns(ktime_sub(t_end, t_start));
+		wait_for_completion(&ctx->work_done);
+	}
+
+	if (atomic_dec_and_test(&threads_done))
+		complete(&all_done_comp);
+
+	/*
+	 * Wait for kthread_stop() so the module text isn't freed
+	 * while we're still executing.
+	 */
+	while (!kthread_should_stop())
+		schedule();
+
+	return 0;
+}
+
+static int cmp_u64(const void *a, const void *b)
+{
+	u64 va = *(const u64 *)a;
+	u64 vb = *(const u64 *)b;
+
+	if (va < vb)
+		return -1;
+	if (va > vb)
+		return 1;
+	return 0;
+}
+
+static int __init set_affn_scope(const char *scope)
+{
+	struct file *f;
+	loff_t pos = 0;
+	ssize_t ret;
+
+	f = filp_open(SCOPE_PATH, O_WRONLY, 0);
+	if (IS_ERR(f)) {
+		pr_err("test_workqueue: open %s failed: %ld\n",
+		       SCOPE_PATH, PTR_ERR(f));
+		return PTR_ERR(f);
+	}
+
+	ret = kernel_write(f, scope, strlen(scope), &pos);
+	filp_close(f, NULL);
+
+	if (ret < 0) {
+		pr_err("test_workqueue: write '%s' failed: %zd\n", scope, ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int __init run_bench(int n_threads, const char *scope, const char *label)
+{
+	struct task_struct **tasks;
+	unsigned long total_items;
+	struct thread_ctx *ctxs;
+	u64 *all_latencies;
+	ktime_t start, end;
+	int cpu, i, j, ret;
+	s64 elapsed_us;
+
+	ret = set_affn_scope(scope);
+	if (ret)
+		return ret;
+
+	ctxs = kcalloc(n_threads, sizeof(*ctxs), GFP_KERNEL);
+	if (!ctxs)
+		return -ENOMEM;
+
+	tasks = kcalloc(n_threads, sizeof(*tasks), GFP_KERNEL);
+	if (!tasks) {
+		kfree(ctxs);
+		return -ENOMEM;
+	}
+
+	total_items = (unsigned long)n_threads * wq_items;
+	all_latencies = kvmalloc_array(total_items, sizeof(u64), GFP_KERNEL);
+	if (!all_latencies) {
+		kfree(tasks);
+		kfree(ctxs);
+		return -ENOMEM;
+	}
+
+	/* Allocate per-thread latency arrays */
+	for (i = 0; i < n_threads; i++) {
+		ctxs[i].latencies = kvmalloc_array(wq_items, sizeof(u64),
+						   GFP_KERNEL);
+		if (!ctxs[i].latencies) {
+			while (--i >= 0)
+				kvfree(ctxs[i].latencies);
+			kvfree(all_latencies);
+			kfree(tasks);
+			kfree(ctxs);
+			return -ENOMEM;
+		}
+	}
+
+	atomic_set(&threads_done, n_threads);
+	reinit_completion(&all_done_comp);
+	reinit_completion(&start_comp);
+
+	/* Create kthreads, each bound to a different online CPU */
+	i = 0;
+	for_each_online_cpu(cpu) {
+		if (i >= n_threads)
+			break;
+
+		ctxs[i].cpu = cpu;
+		ctxs[i].items = wq_items;
+		init_completion(&ctxs[i].work_done);
+
+		tasks[i] = kthread_create(bench_kthread_fn, &ctxs[i],
+					  "wq_bench/%d", cpu);
+		if (IS_ERR(tasks[i])) {
+			ret = PTR_ERR(tasks[i]);
+			pr_err("test_workqueue: failed to create kthread %d: %d\n",
+			       i, ret);
+			/* Unblock threads waiting on start_comp before stopping them */
+			complete_all(&start_comp);
+			while (--i >= 0)
+				kthread_stop(tasks[i]);
+			goto out_free;
+		}
+
+		kthread_bind(tasks[i], cpu);
+		wake_up_process(tasks[i]);
+		i++;
+	}
+
+	/* Start timing and release all threads */
+	start = ktime_get();
+	complete_all(&start_comp);
+
+	/* Wait for all threads to finish the benchmark */
+	wait_for_completion(&all_done_comp);
+
+	/* Drain any remaining work */
+	flush_workqueue(bench_wq);
+
+	/* Ensure all kthreads have fully exited before module memory is freed */
+	for (i = 0; i < n_threads; i++)
+		kthread_stop(tasks[i]);
+
+	end = ktime_get();
+	elapsed_us = ktime_us_delta(end, start);
+
+	/* Merge all per-thread latencies and sort for percentile calculation */
+	j = 0;
+	for (i = 0; i < n_threads; i++) {
+		memcpy(&all_latencies[j], ctxs[i].latencies,
+		       wq_items * sizeof(u64));
+		j += wq_items;
+	}
+
+	sort(all_latencies, total_items, sizeof(u64), cmp_u64, NULL);
+
+	pr_info("test_workqueue:   %-16s %llu items/sec\tp50=%llu\tp90=%llu\tp95=%llu ns\n",
+		label,
+		elapsed_us ? total_items * 1000000ULL / elapsed_us : 0,
+		all_latencies[total_items * 50 / 100],
+		all_latencies[total_items * 90 / 100],
+		all_latencies[total_items * 95 / 100]);
+
+	ret = 0;
+out_free:
+	for (i = 0; i < n_threads; i++)
+		kvfree(ctxs[i].latencies);
+	kvfree(all_latencies);
+	kfree(tasks);
+	kfree(ctxs);
+
+	return ret;
+}
+
+static const char * const bench_scopes[] = {
+	"cpu", "smt", "cache_shard", "cache", "numa", "system",
+};
+
+static int __init test_workqueue_init(void)
+{
+	int n_threads = min(nr_threads ?: num_online_cpus(), num_online_cpus());
+	int i;
+
+	if (wq_items <= 0) {
+		pr_err("test_workqueue: wq_items must be > 0\n");
+		return -EINVAL;
+	}
+
+	bench_wq = alloc_workqueue(WQ_NAME, WQ_UNBOUND | WQ_SYSFS, 0);
+	if (!bench_wq)
+		return -ENOMEM;
+
+	pr_info("test_workqueue: running %d threads, %d items/thread\n",
+		n_threads, wq_items);
+
+	for (i = 0; i < ARRAY_SIZE(bench_scopes); i++)
+		run_bench(n_threads, bench_scopes[i], bench_scopes[i]);
+
+	destroy_workqueue(bench_wq);
+
+	/* Return -EAGAIN so the module doesn't stay loaded after the benchmark */
+	return -EAGAIN;
+}
+
+module_init(test_workqueue_init);
+MODULE_AUTHOR("Breno Leitao <leitao@debian.org>");
+MODULE_DESCRIPTION("Stress/performance benchmark for workqueue subsystem");
+MODULE_LICENSE("GPL");

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 6/6] docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
  2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
                   ` (4 preceding siblings ...)
  2026-04-01 13:03 ` [PATCH v3 5/6] workqueue: add test_workqueue benchmark module Breno Leitao
@ 2026-04-01 13:03 ` Breno Leitao
  2026-04-01 20:32 ` [PATCH v3 0/6] workqueue: Introduce a sharded cache " Tejun Heo
  6 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-01 13:03 UTC (permalink / raw)
  To: Tejun Heo, Lai Jiangshan, Andrew Morton
  Cc: linux-kernel, puranjay, linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever,
	Breno Leitao

Update kernel-parameters.txt and workqueue.rst to reflect the new
cache_shard affinity scope and the default change from cache to
cache_shard.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/kernel-parameters.txt |  3 ++-
 Documentation/core-api/workqueue.rst            | 14 ++++++++++----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 03a550630644..b2558f76b7bd 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -8535,7 +8535,8 @@ Kernel parameters
         workqueue.default_affinity_scope=
 			Select the default affinity scope to use for unbound
 			workqueues. Can be one of "cpu", "smt", "cache",
-			"numa" and "system". Default is "cache". For more
+			"cache_shard", "numa" and "system". Default is
+			"cache_shard". For more
 			information, see the Affinity Scopes section in
 			Documentation/core-api/workqueue.rst.
 
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index 165ca73e8351..411e1b28b8de 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -378,9 +378,9 @@ Affinity Scopes
 
 An unbound workqueue groups CPUs according to its affinity scope to improve
 cache locality. For example, if a workqueue is using the default affinity
-scope of "cache", it will group CPUs according to last level cache
-boundaries. A work item queued on the workqueue will be assigned to a worker
-on one of the CPUs which share the last level cache with the issuing CPU.
+scope of "cache_shard", it will group CPUs into sub-LLC shards. A work item
+queued on the workqueue will be assigned to a worker on one of the CPUs
+within the same shard as the issuing CPU.
 Once started, the worker may or may not be allowed to move outside the scope
 depending on the ``affinity_strict`` setting of the scope.
 
@@ -402,7 +402,13 @@ Workqueue currently supports the following affinity scopes.
 ``cache``
   CPUs are grouped according to cache boundaries. Which specific cache
   boundary is used is determined by the arch code. L3 is used in a lot of
-  cases. This is the default affinity scope.
+  cases.
+
+``cache_shard``
+  CPUs are grouped into sub-LLC shards of at most ``wq_cache_shard_size``
+  cores (default 8, tunable via the ``workqueue.cache_shard_size`` boot
+  parameter). Shards are always split on core (SMT group) boundaries.
+  This is the default affinity scope.
 
 ``numa``
   CPUs are grouped according to NUMA boundaries.

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope
  2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
                   ` (5 preceding siblings ...)
  2026-04-01 13:03 ` [PATCH v3 6/6] docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
@ 2026-04-01 20:32 ` Tejun Heo
  6 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-04-01 20:32 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Lai Jiangshan, Andrew Morton, linux-kernel, puranjay,
	linux-crypto, linux-btrfs, linux-fsdevel,
	Michael van der Westhuizen, kernel-team, Chuck Lever

Hello,

Applied 1-6 to wq/for-7.1.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-01 20:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-01 13:03 [PATCH v3 0/6] workqueue: Introduce a sharded cache affinity scope Breno Leitao
2026-04-01 13:03 ` [PATCH v3 1/6] workqueue: fix typo in WQ_AFFN_SMT comment Breno Leitao
2026-04-01 13:03 ` [PATCH v3 2/6] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-04-01 13:03 ` [PATCH v3 3/6] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
2026-04-01 13:03 ` [PATCH v3 4/6] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-04-01 13:03 ` [PATCH v3 5/6] workqueue: add test_workqueue benchmark module Breno Leitao
2026-04-01 13:03 ` [PATCH v3 6/6] docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-04-01 20:32 ` [PATCH v3 0/6] workqueue: Introduce a sharded cache " Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox