public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Tejun Heo <tj@kernel.org>, Lai Jiangshan <jiangshanlai@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org, puranjay@kernel.org,
	 linux-crypto@vger.kernel.org, linux-btrfs@vger.kernel.org,
	 linux-fsdevel@vger.kernel.org,
	Michael van der Westhuizen <rmikey@meta.com>,
	 kernel-team@meta.com, Chuck Lever <chuck.lever@oracle.com>,
	 Breno Leitao <leitao@debian.org>
Subject: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Date: Thu, 12 Mar 2026 09:12:01 -0700	[thread overview]
Message-ID: <20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org> (raw)

TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).

Problem
=======

Some modern systems have many CPUs sharing one LLC. Here are some examples I have
access to:

 * NVIDIA Grace CPU: 72 real CPUs per LLC
 * Intel(R) Xeon(R) Gold 6450C: 59 SMT threads per LLC
 * Intel(R) Xeon(R) Platinum 8321HC: 51 SMT threads per LLC

On these systems, the default unbound workqueue uses the WQ_AFFN_CACHE
affinity, which results in just a single pool for the whole system (when
all the CPUs share the same LLC as the systems above).

This causes contention on pool->lock, potentially affecting IO
performance (btrfs, writeback, etc)

When profiling an IO-intensive usercache at Meta, I found significant
contention on __queue_work(), making it one of the top 5 contended
locks.

Additionally, Chuck Lever recently reported this problem:

	"For example, on a 12-core system with a single shared L3 cache running
	NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU
	cycles spent in native_queued_spin_lock_slowpath, nearly all from
	__queue_work() contending on the single pool lock.

	On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
	scopes all collapse to a single pod."

Link: https://lore.kernel.org/all/20260203143744.16578-1-cel@kernel.org/

Solution
========

Tejun suggested solving this problem by creating an intermediate
affinity level (aka cache_shard), which would shard the WQ_AFFN_CACHE
using a heuristic, avoiding collapsing all those affinity levels to
a single pod.

Solve this by creating an intermediate sharded cache affinity, and use
it as the default one.

Micro benchmark
===============

To test its benefit, I created a microbenchmark (part of this series)
that enqueues work (queue_work) in a loop and reports the latency.

  Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):

    cpu          3248519 items/sec p50=10944    p90=11488    p95=11648 ns
    smt          3362119 items/sec p50=10945    p90=11520    p95=11712 ns
    cache_shard  3629098 items/sec p50=6080     p90=8896     p95=9728 ns (NEW) **
    cache        708168 items/sec  p50=44000    p90=47104    p95=47904 ns
    numa         710559 items/sec  p50=44096    p90=47265    p95=48064 ns
    system       718370 items/sec  p50=43104    p90=46432    p95=47264 ns

Same benchmark on Intel 8321HC.

    cpu          2831751 items/sec p50=3909     p90=9222     p95=11580 ns
    smt          2810699 items/sec p50=2229     p90=4928     p95=5979 ns
    cache_shard  1861028 items/sec p50=4874     p90=8423     p95=9415 ns (NEW)
    cache        591001 items/sec p50=24901     p90=29865    p95=31169 ns
    numa         590431 items/sec p50=24901     p90=29819    p95=31133 ns
    system       591912 items/sec p50=25049     p90=29916    p95=31219 ns

(** It is still unclear why cache_shard is "better" than SMT on
Grace/ARM. The result is constantly reproducible, though. Still
investigating it)

Block benchmark
===============

Host: Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz (16 Cores - 32 SMT)

In order to stress the workqueue, I am running fio on a dm-crypt device.

  1) Create a plain dm-crypt device on top of NVMe
   * cryptsetup creates an encrypted block device (/dev/mapper/crypt_nvme) on top
     of a raw NVMe drive. All I/O to this device goes through kcryptd — dm-crypt's
     workqueue that handles AES encryption/decryption of every data block.

   # cryptsetup open --type plain -c aes-xts-plain64 -s 256 /dev/nvme0n1 crypt_nvme -d -

  2) Run fio
   * fio hammers the encrypted device with 36 threads (one per CPU), each doing
     128-deep 4K _buffered_ I/O for 10 seconds. This generates massive workqueue
     pressure — every I/O completion triggers a kcryptd work item to encrypt or
     decrypt data.

   # fio --filename=/dev/mapper/crypt_nvme \
         --ioengine=io_uring --direct=0 \
         --bs=4k --iodepth=128 \
         --numjobs=$(nproc) --runtime=10 \
         --time_based --group_reporting

Running this for ~3 hours:

  ┌────────────┬────────────────────────┬────────────────────────┬───────────┬────────┬─────────────────┐
  │ Workload   │       Avg cache        │    Avg cache_shard     │ Avg delta │ Stddev │  2-sigma range  │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randread   │ 389 MiB/s (99.6k IOPS) │ 413 MiB/s (106k IOPS)  │ +5.9%     │ 3.3%   │ -0.7% to +12.5% │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randwrite  │ 622 MiB/s (159k IOPS)  │ 614 MiB/s (157k IOPS)  │ -1.3%     │ 0.9%   │ -3.1% to +0.5%  │
  ├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
  │ randrw     │ 240 MiB/s (61.4k IOPS) │ 250 MiB/s (64.1k IOPS) │ +4.3%     │ 3.4%   │ -2.5% to +11.1% │
  └────────────┴────────────────────────┴────────────────────────┴───────────┴────────┴─────────────────┘

Same results for buffered IO:

  ┌───────────┬────────────────────────┬────────────────────────┬───────────┬────────┬────────────────┐
  │ Workload  │       Avg cache        │    Avg cache_shard     │ Avg delta │ Stddev │ 2-sigma range  │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randread  │ 559 MiB/s (143k IOPS)  │ 577 MiB/s (148k IOPS)  │ +3.1%     │ 1.3%   │ +0.5% to +5.7% │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randwrite │ 437 MiB/s (112k IOPS)  │ 431 MiB/s (110k IOPS)  │ -1.5%     │ 1.0%   │ -3.5% to +0.5% │
  ├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
  │ randrw    │ 272 MiB/s (69.7k IOPS) │ 273 MiB/s (69.8k IOPS) │ +0.1%     │ 1.5%   │ -2.9% to +3.1% │
  └───────────┴────────────────────────┴────────────────────────┴───────────┴────────┴────────────────┘

(randwrite result seems to be noise (!?))

Patchset organization
=====================

This series adds a new WQ_AFFN_CACHE_SHARD affinity scope that
subdivides each LLC into groups of at most wq_cache_shard_size CPUs
(default 8, tunable via boot parameter), providing an intermediate
option between per-LLC and per-SMT-core granularity.

On top of the feature, this patchset also prepares the code for the new
cache_shard affinity, and creates a stress test for workqueue.

Then, make this new cache affinity the default one.

On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.

---
Breno Leitao (5):
      workqueue: fix parse_affn_scope() prefix matching bug
      workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
      workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
      workqueue: add test_workqueue benchmark module
      tools/workqueue: add CACHE_SHARD support to wq_dump.py

 include/linux/workqueue.h  |   1 +
 kernel/workqueue.c         |  72 ++++++++++--
 lib/Kconfig.debug          |  10 ++
 lib/Makefile               |   1 +
 lib/test_workqueue.c       | 275 +++++++++++++++++++++++++++++++++++++++++++++
 tools/workqueue/wq_dump.py |   3 +-
 6 files changed, 352 insertions(+), 10 deletions(-)
---
base-commit: b29fb8829bff243512bb8c8908fd39406f9fd4c3
change-id: 20260309-workqueue_sharded-2327956e889b

Best regards,
--  
Breno Leitao <leitao@debian.org>


             reply	other threads:[~2026-03-12 16:18 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 16:12 Breno Leitao [this message]
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
2026-03-13 17:41   ` Tejun Heo
2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
2026-03-17 11:32   ` Breno Leitao
2026-03-17 13:58     ` Chuck Lever
2026-03-18 17:51       ` Breno Leitao
2026-03-18 23:00         ` Tejun Heo
2026-03-19 14:02           ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260312-workqueue_sharded-v1-0-2c43a7b861d0@debian.org \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=chuck.lever@oracle.com \
    --cc=jiangshanlai@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=puranjay@kernel.org \
    --cc=rmikey@meta.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox