Re: [PATCH 0/4] Introduce QPW for per-cpu operations

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Leonardo Bras <leobras.c@gmail.com>
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Leonardo Bras <leobras.c@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Leonardo Bras <leobras@redhat.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Waiman Long <longman@redhat.com>,
	Boqun Feng <boqun.feng@gmail.com>,
	Frederic Weisbecker <fweisbecker@suse.de>
Subject: Re: [PATCH 0/4] Introduce QPW for per-cpu operations
Date: Fri, 20 Feb 2026 19:38:04 -0300	[thread overview]
Message-ID: <aZjiTM5v-AOsaq2y@WindFlash> (raw)
In-Reply-To: <aZiSHT5DwIZwc/cH@tpad>

On Fri, Feb 20, 2026 at 01:55:57PM -0300, Marcelo Tosatti wrote:
> On Fri, Feb 20, 2026 at 01:51:13PM -0300, Marcelo Tosatti wrote:
> > On Mon, Feb 16, 2026 at 12:00:55PM +0100, Michal Hocko wrote:
> > > On Sat 14-02-26 19:02:19, Leonardo Bras wrote:
> > > > On Wed, Feb 11, 2026 at 05:38:47PM +0100, Michal Hocko wrote:
> > > > > On Wed 11-02-26 09:01:12, Marcelo Tosatti wrote:
> > > > > > On Tue, Feb 10, 2026 at 03:01:10PM +0100, Michal Hocko wrote:
> > > > > [...]
> > > > > > > What about !PREEMPT_RT? We have people running isolated workloads and
> > > > > > > these sorts of pcp disruptions are really unwelcome as well. They do not
> > > > > > > have requirements as strong as RT workloads but the underlying
> > > > > > > fundamental problem is the same. Frederic (now CCed) is working on
> > > > > > > moving those pcp book keeping activities to be executed to the return to
> > > > > > > the userspace which should be taking care of both RT and non-RT
> > > > > > > configurations AFAICS.
> > > > > > 
> > > > > > Michal,
> > > > > > 
> > > > > > For !PREEMPT_RT, _if_ you select CONFIG_QPW=y, then there is a kernel
> > > > > > boot option qpw=y/n, which controls whether the behaviour will be
> > > > > > similar (the spinlock is taken on local_lock, similar to PREEMPT_RT).
> > > > > 
> > > > > My bad. I've misread the config space of this.
> > > > > 
> > > > > > If CONFIG_QPW=n, or kernel boot option qpw=n, then only local_lock 
> > > > > > (and remote work via work_queue) is used.
> > > > > > 
> > > > > > What "pcp book keeping activities" you refer to ? I don't see how
> > > > > > moving certain activities that happen under SLUB or LRU spinlocks
> > > > > > to happen before return to userspace changes things related 
> > > > > > to avoidance of CPU interruption ?
> > > > > 
> > > > > Essentially delayed operations like pcp state flushing happens on return
> > > > > to the userspace on isolated CPUs. No locking changes are required as
> > > > > the work is still per-cpu.
> > > > > 
> > > > > In other words the approach Frederic is working on is to not change the
> > > > > locking of pcp delayed work but instead move that work into well defined
> > > > > place - i.e. return to the userspace.
> > > > > 
> > > > > Btw. have you measure the impact of preempt_disbale -> spinlock on hot
> > > > > paths like SLUB sheeves?
> > > > 
> > > > Hi Michal,
> > > > 
> > > > I have done some study on this (which I presented on Plumbers 2023):
> > > > https://lpc.events/event/17/contributions/1484/ 
> > > > 
> > > > Since they are per-cpu spinlocks, and the remote operations are not that 
> > > > frequent, as per design of the current approach, we are not supposed to see 
> > > > contention (I was not able to detect contention even after stress testing 
> > > > for weeks), nor relevant cacheline bouncing.
> > > > 
> > > > That being said, for RT local_locks already get per-cpu spinlocks, so there 
> > > > is only difference for !RT, which as you mention, does preemtp_disable():
> > > > 
> > > > The performance impact noticed was mostly about jumping around in 
> > > > executable code, as inlining spinlocks (test #2 on presentation) took care 
> > > > of most of the added extra cycles, adding about 4-14 extra cycles per 
> > > > lock/unlock cycle. (tested on memcg with kmalloc test)
> > > > 
> > > > Yeah, as expected there is some extra cycles, as we are doing extra atomic 
> > > > operations (even if in a local cacheline) in !RT case, but this could be 
> > > > enabled only if the user thinks this is an ok cost for reducing 
> > > > interruptions.
> > > > 
> > > > What do you think?
> > > 
> > > The fact that the behavior is opt-in for !RT is certainly a plus. I also
> > > do not expect the overhead to be really be really big. To me, a much
> > > more important question is which of the two approaches is easier to
> > > maintain long term. The pcp work needs to be done one way or the other.
> > > Whether we want to tweak locking or do it at a very well defined time is
> > > the bigger question.
> > 
> > Without patchset:
> > ================
> > 
> > [ 1188.050725] kmalloc_bench: Avg cycles per kmalloc: 159
> > 
> > With qpw patchset, CONFIG_QPW=n:
> > ================================
> > 
> > [   50.292190] kmalloc_bench: Avg cycles per kmalloc: 163

Weird.. with CONFIG_QPW we should see no difference.
Oh, maybe the changes in the code, such as adding a new cpu parameter in 
some functions may have caused this.

(oh, there is the migrate_disable as well)

> > 
> > With qpw patchset, CONFIG_QPW=y, qpw=0:
> > =======================================
> > 
> > [   29.872153] kmalloc_bench: Avg cycles per kmalloc: 170
> > 

Humm, what changed here is basically from

+#define qpw_lock(lock, cpu)			\
+	local_lock(lock)

to 

+#define qpw_lock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
+		else									\
+			local_lock(lock.ll);						\
+	} while (0)


So only the cost of a static branch.. maybe I did something wrong here 
with the static_branch_maybe, as any cpu branch predictor should make this 
delta close to zero.

> > 
> > With qpw patchset, CONFIG_QPW=y, qpw=1:
> > ========================================
> > 
> > [   37.494687] kmalloc_bench: Avg cycles per kmalloc: 190
> > 

20 cycles as a price for a local_lock->spinlock seems too much.
Taking in account the previous message, maybe we should work on making them 
inlined spinlocks, if not already.
(Yeah, I missed that verification :| )

> > With PREEMPT_RT enabled, qpw=0:
> > ===============================
> > 
> > [   65.163251] kmalloc_bench: Avg cycles per kmalloc: 181
> > 
> > With PREEMPT_RT enabled, no patchset:
> > =====================================
> > [   52.701639] kmalloc_bench: Avg cycles per kmalloc: 185
> > 

Nice, having the QPW patch saved some cycles :)


> > With PREEMPT_RT enabled, qpw=1:
> > ==============================
> > 
> > [   35.103830] kmalloc_bench: Avg cycles per kmalloc: 196
> 

This is odd, though. The spinlock is already there, so from qpw=0 to qpw=1 
there should be no performance change. Maybe in local_lock they do some 
optimization in their spinlock?


> #include <linux/module.h>
> #include <linux/kernel.h>
> #include <linux/slab.h>
> #include <linux/timex.h>
> #include <linux/preempt.h>
> #include <linux/irqflags.h>
> #include <linux/vmalloc.h>
> 
> MODULE_LICENSE("GPL");
> MODULE_AUTHOR("Gemini AI");
> MODULE_DESCRIPTION("A simple kmalloc performance benchmark");
> 
> static int size = 64; // Default allocation size in bytes
> module_param(size, int, 0644);
> 
> static int iterations = 1000000; // Default number of iterations
> module_param(iterations, int, 0644);
> 
> static int __init kmalloc_bench_init(void) {
>     void **ptrs;
>     cycles_t start, end;
>     uint64_t total_cycles;
>     int i;
>     pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);
> 
>     // Allocate an array to store pointers to avoid immediate kfree-reuse optimization
>     ptrs = vmalloc(sizeof(void *) * iterations);
>     if (!ptrs) {
>         pr_err("kmalloc_bench: Failed to allocate pointer array\n");
>         return -ENOMEM;
>     }
> 
>     preempt_disable();
>     start = get_cycles();
> 
>     for (i = 0; i < iterations; i++) {
>         ptrs[i] = kmalloc(size, GFP_ATOMIC);
>     }
> 
>     end = get_cycles();
> 
>     total_cycles = end - start;
>     preempt_enable();
> 
>     pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
>     pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);
> 
>     // Cleanup
>     for (i = 0; i < iterations; i++) {
>         kfree(ptrs[i]);
>     }
>     vfree(ptrs);
> 
>     return 0;
> }
> 
> static void __exit kmalloc_bench_exit(void) {
>     pr_info("kmalloc_bench: Module unloaded\n");
> }
> 
> 


Nice!
Please collect min and max as well, maybe we can have an insight of what 
could have happened, then :)

What was the system you used for testing?

Thanks!
Leo

next prev parent reply	other threads:[~2026-02-20 22:38 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-06 14:34 [PATCH 0/4] Introduce QPW for per-cpu operations Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 1/4] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-02-06 15:20   ` Marcelo Tosatti
2026-02-07  0:16   ` Leonardo Bras
2026-02-11 12:09     ` Marcelo Tosatti
2026-02-14 21:32       ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 2/4] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-02-06 14:34 ` [PATCH 3/4] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-02-07  1:06   ` Leonardo Bras
2026-02-26 15:49     ` Marcelo Tosatti
2026-03-08 17:35       ` Leonardo Bras
2026-02-06 14:34 ` [PATCH 4/4] slub: " Marcelo Tosatti
2026-02-07  1:27   ` Leonardo Bras
2026-02-06 23:56 ` [PATCH 0/4] Introduce QPW for per-cpu operations Leonardo Bras
2026-02-10 14:01 ` Michal Hocko
2026-02-11 12:01   ` Marcelo Tosatti
2026-02-11 12:11     ` Marcelo Tosatti
2026-02-14 21:35       ` Leonardo Bras
2026-02-11 16:38     ` Michal Hocko
2026-02-11 16:50       ` Marcelo Tosatti
2026-02-11 16:59         ` Vlastimil Babka
2026-02-11 17:07         ` Michal Hocko
2026-02-14 22:02       ` Leonardo Bras
2026-02-16 11:00         ` Michal Hocko
2026-02-19 15:27           ` Marcelo Tosatti
2026-02-19 19:30             ` Michal Hocko
2026-02-20 14:30               ` Marcelo Tosatti
2026-02-23  9:18                 ` Michal Hocko
2026-03-03 10:55                   ` Frederic Weisbecker
2026-02-23 21:56               ` Frederic Weisbecker
2026-02-24 17:23                 ` Marcelo Tosatti
2026-02-25 21:49                   ` Frederic Weisbecker
2026-02-26  7:06                     ` Michal Hocko
2026-02-26 11:41                     ` Marcelo Tosatti
2026-03-03 11:08                       ` Frederic Weisbecker
2026-02-20 10:48             ` Vlastimil Babka
2026-02-20 12:31               ` Michal Hocko
2026-02-20 17:35               ` Marcelo Tosatti
2026-02-20 17:58                 ` Vlastimil Babka
2026-02-20 19:01                   ` Marcelo Tosatti
2026-02-23  9:11                     ` Michal Hocko
2026-02-23 11:20                       ` Marcelo Tosatti
2026-02-24 14:40                 ` Frederic Weisbecker
2026-02-24 18:12                   ` Marcelo Tosatti
2026-02-20 16:51           ` Marcelo Tosatti
2026-02-20 16:55             ` Marcelo Tosatti
2026-02-20 22:38               ` Leonardo Bras [this message]
2026-02-23 18:09               ` Vlastimil Babka
2026-02-26 18:24                 ` Marcelo Tosatti
2026-02-20 21:58           ` Leonardo Bras
2026-02-23  9:06             ` Michal Hocko
2026-02-28  1:23               ` Leonardo Bras
2026-03-03  0:19                 ` Marcelo Tosatti
2026-03-08 17:41                   ` Leonardo Bras
2026-03-09  9:52                     ` Vlastimil Babka (SUSE)
2026-03-11  0:01                       ` Leonardo Bras
2026-03-10 21:24                     ` Marcelo Tosatti
2026-03-11  0:03                       ` Leonardo Bras
2026-03-11 10:23                         ` Marcelo Tosatti
2026-02-19 13:15       ` Marcelo Tosatti

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZjiTM5v-AOsaq2y@WindFlash \
    --to=leobras.c@gmail.com \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=boqun.feng@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=fweisbecker@suse.de \
    --cc=hannes@cmpxchg.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=leobras@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@suse.com \
    --cc=mtosatti@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=tglx@linutronix.de \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.