Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

From: Breno Leitao <leitao@debian.org>
To: Tejun Heo <tj@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
	 Lai Jiangshan <jiangshanlai@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 linux-kernel@vger.kernel.org, puranjay@kernel.org,
	linux-crypto@vger.kernel.org,  linux-btrfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	 Michael van der Westhuizen <rmikey@meta.com>,
	kernel-team@meta.com
Subject: Re: [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
Date: Thu, 19 Mar 2026 07:02:58 -0700	[thread overview]
Message-ID: <abv8CBeQMlB-Zb2z@gmail.com> (raw)
In-Reply-To: <absud4FKm-3Trvjj@slm.duckdns.org>

On Wed, Mar 18, 2026 at 01:00:07PM -1000, Tejun Heo wrote:
> On Wed, Mar 18, 2026 at 10:51:15AM -0700, Breno Leitao wrote:
> > On Tue, Mar 17, 2026 at 09:58:54AM -0400, Chuck Lever wrote:
> > > On 3/17/26 7:32 AM, Breno Leitao wrote:
> > > >> - How was the default shard size of 8 picked? There's a tradeoff
> > > >> between the number of kworkers created and locality. Can you also
> > > >> report the number of kworkers for each configuration? And is there
> > > >> data on different shard sizes? It'd be useful to see how the numbers
> > > >> change across e.g. 4, 8, 16, 32.
> > > >
> > > > The choice of 8 as the default shard size was somewhat arbitrary – it was
> > > > selected primarily to generate initial data points.
> > >
> > > Perhaps instead of basing the sharding on a particular number of CPUs
> > > per shard, why not cap the total number of shards? IIUC that is the main
> > > concern about ballooning the number of kworker threads.
> >
> > That's a great suggestion. I'll send a v2 that implements this approach,
> > where the parameter specifies the number of shards rather than the number
> > of CPUs per shard.
>
> Woudl it make sense tho? If feels really odd to define the maximum number of
> shards when contention is primarily a function of the number of CPUs banging
> on the same CPU. Why would 32 cpu and 512 cpu systems have the same number
> of shards?

The trade-off is that specifying the maximum number of shards makes it
clearer how many times the LLC is being sharded, which might be easier
to reason about, but it will have less impact on contention scaling as
you reported above.

I've collected some numbers with sharding per LLC, and I will switch
back to the original approach to gather comparison data.

Current change:
https://github.com/leitao/linux/commit/bedaf9ebe9594320976dcbf0cb507ecf083097c0


Workload:
========

I've finally found a workload that exercises the workqueue sufficiently,
which allows me to obtain stable benchmark results.

This is what I am doing:

  - Sets  up a local loopback NFS environment backed by an 8 GB tmpfs
    (/tmp/nfsexport → /mnt/nfs)
  - Iterates over six fio I/O engines: sync, psync, vsync, pvsync, pvsync2,
    libaio
  - For each engine, runs a 200-job, 512-byte block size fio benchmark (writes
    then reads)
  - Tests each workload under both cache and cache_shard workqueue affinity
    scopes via /sys/module/workqueue/parameters/default_affinity_scope
  - Prints a summary table with aggregate bandwidth (MB) per scope and the
    percentage delta to show whether cache_shard helps or hurts
  - Restores the affinity scope back to cache when done

The test I am running could be found at
https://github.com/leitao/debug/blob/main/workqueue_performance/test_affinity.sh

Hosts:
======

 * ARM (NVIDIA Grace - Neoverse V2 - single L3 domain: CPUs 0-71)
	# cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
	0-71

 * Intel (AMD EPYC 9D64 88-Core Processor - 11 L3 domains, 8 Cores / 16 vCPUs each)
	#   cat /sys/devices/system/cpu/cpu*/cache/index3/shared_cpu_list | sort -u
	0-7,88-95
	16-23,104-111
	24-31,112-119
	32-39,120-127
	40-47,128-135
	48-55,136-143
	56-63,144-151
	64-71,152-159
	72-79,160-167
	80-87,168-175
	8-15,96-103


Results
=======


Tl;DR: 

 * ARM (single L3, 72 CPUs): cache_shard consistently improves write
   throughput by +6 to +12% across all shard counts (2-32), with
   the peak at 2 shards. Read impact is minimal (noise-level).
   Shard=1 confirms no effect as expected.

 * Intel (11 L3 domains, 16 CPUs each): cache_shard shows no meaningful
   benefit at 1-4 shards (all within noise/stddev). At 8 shards it
   regresses by ~4% for both reads and writes, likely due to loss of
   data locality when sharding already-small 16-CPU cache domains
   further.

benchmark Data:
===============

ARM:

  ┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
  │ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      1 │             -0.2% │        ±1.0% │            +1.2% │       ±1.7% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      2 │            +12.5% │        ±1.3% │            -0.3% │       ±0.9% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      4 │             +8.7% │        ±0.9% │            +1.8% │       ±1.5% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      8 │            +11.4% │        ±1.8% │            +3.1% │       ±1.5% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │     16 │             +7.8% │        ±1.3% │            +1.6% │       ±1.0% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │     32 │             +6.1% │        ±0.6% │            +0.3% │       ±1.5% │
  └────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘

Intel:

  ┌────────┬───────────────────┬──────────────┬──────────────────┬─────────────┐
  │ Shards │ Write Delta (avg) │ Write stddev │ Read Delta (avg) │ Read stddev │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      1 │             -0.2% │        ±1.2% │            +0.1% │       ±1.0% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      2 │             +0.7% │        ±1.4% │            +0.5% │       ±1.1% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      4 │             +0.8% │        ±1.1% │            +1.3% │       ±1.2% │
  ├────────┼───────────────────┼──────────────┼──────────────────┼─────────────┤
  │      8 │             -4.0% │        ±1.3% │            -4.5% │       ±0.9% │
  └────────┴───────────────────┴──────────────┴──────────────────┴─────────────┘


Microbenchmark result
=====================

I've run the micro-benchmark from this patchset as well, and the resluts
comparison:

 * Intel (11 L3 domains, 16 CPUs each): cache_shard delivers +45-55%
   throughput and 36-44% lower latency at 2-8 shards. The sweet spot is
   4 shards (+55%, p50 cut nearly in half). Shard=1 confirms no effect.

   Even though Intel already has multiple L3 domains, each 16-CPU domain
   still has enough contention to benefit from further splitting (for
   the sake of this microbenchmark/stress test)

 * ARM (single L3, 72 CPUs): The gains are dramatic — 2x at 2 shards,
   3.2x at 4 shards, and 4.4x at 8 shards. At 8 shards, cache_shard
   (3.2M items/s) nearly matches cpu scope performance (3.7M), with p50
   latency dropping from 43.5 us to 6.9 us.

   The single monolithic L3 makes cache scope degenerate to a single
   contended pool, so sharding  has a massive effect.


Intel

  ┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
  │ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      1 │       2,660,103 │             2,667,740 │           +0.3% │   27.5 us │   27.5 us │                0% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      2 │       2,619,884 │             3,788,454 │          +44.6% │   28.0 us │   17.8 us │              -36% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      4 │       2,506,185 │             3,891,064 │          +55.3% │   29.3 us │   16.5 us │              -44% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      8 │       2,628,321 │             4,015,312 │          +52.8% │   27.9 us │   16.4 us │              -41% │
  └────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘

  Reference scopes (stable across shard counts): cpu ~6.2M items/s, smt ~4.0M, numa/system ~422K.

ARM

  ┌────────┬─────────────────┬───────────────────────┬─────────────────┬───────────┬───────────┬───────────────────┐
  │ Shards │ cache (items/s) │ cache_shard (items/s) │ Throughput gain │ cache p50 │ shard p50 │ Latency reduction │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      2 │         725,999 │             1,516,967 │           +109% │   43.8 us │   19.6 us │              -55% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      4 │         729,615 │             2,347,335 │           +222% │   43.6 us │   11.0 us │              -75% │
  ├────────┼─────────────────┼───────────────────────┼─────────────────┼───────────┼───────────┼───────────────────┤
  │      8 │         731,517 │             3,230,168 │           +342% │   43.5 us │    6.9 us │              -84% │
  └────────┴─────────────────┴───────────────────────┴─────────────────┴───────────┴───────────┴───────────────────┘


Next Steps:
  * Revert the code to sharding by CPU count (instead of by shard count) and
    report it again. 
  * Any other test that would help us?

     prev parent reply	other threads:[~2026-03-19 14:03 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12 16:12 [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 1/5] workqueue: fix parse_affn_scope() prefix matching bug Breno Leitao
2026-03-13 17:41   ` Tejun Heo
2026-03-12 16:12 ` [PATCH RFC 2/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 3/5] workqueue: set WQ_AFFN_CACHE_SHARD as the default " Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 4/5] workqueue: add test_workqueue benchmark module Breno Leitao
2026-03-12 16:12 ` [PATCH RFC 5/5] tools/workqueue: add CACHE_SHARD support to wq_dump.py Breno Leitao
2026-03-13 17:57 ` [PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope Tejun Heo
2026-03-17 11:32   ` Breno Leitao
2026-03-17 13:58     ` Chuck Lever
2026-03-18 17:51       ` Breno Leitao
2026-03-18 23:00         ` Tejun Heo
2026-03-19 14:02           ` Breno Leitao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abv8CBeQMlB-Zb2z@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=chuck.lever@oracle.com \
    --cc=jiangshanlai@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=puranjay@kernel.org \
    --cc=rmikey@meta.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox