[PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
@ 2026-03-02 15:49 Marcelo Tosatti
  2026-03-02 15:49 ` [PATCH v2 1/5] slab: distinguish lock and trylock for sheaf_flush_main() Marcelo Tosatti
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-02 15:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feun,
	Frederic Weisbecker

The problem:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with remote requests is
sure to introduce unexpected deadline misses.

The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
In this case, instead of scheduling work on a remote cpu, it should
be safe to grab that remote cpu's per-cpu spinlock and run the required
work locally. That major cost, which is un/locking in every local function,
already happens in PREEMPT_RT.

Also, there is no need to worry about extra cache bouncing:
The cacheline invalidation already happens due to schedule_work_on().

This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload.

Proposed solution:
A new interface called Queue PerCPU Work (QPW), which should replace
Work Queue in the above mentioned use case.

If CONFIG_QPW=n this interfaces just wraps the current
local_locks + WorkQueue behavior, so no expected change in runtime.

If CONFIG_QPW=y, and qpw kernel boot option =1, 
queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
and perform work on it locally. This is possible because on 
functions that can be used for performing remote work on remote 
per-cpu structures, the local_lock (which is already
a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
is able to get the per_cpu spinlock() for the cpu passed as parameter.

v1->v2:
- Introduce local_qpw_lock and unlock functions, move preempt_disable/
  preempt_enable to it (Leonardo Bras). This reduces performance
  overhead of the patch.
- Documentation and changelog typo fixes (Leonardo Bras).
- Fix places where preempt_disable/preempt_enable was not being
  correctly performed.
- Add performance measurements.

RFC->v1:

- Introduce CONFIG_QPW and qpw= kernel boot option to enable 
  remote spinlocking and execution even on !CONFIG_PREEMPT_RT
  kernels (Leonardo Bras).
- Move buffer_head draining to separate workqueue (Marcelo Tosatti).
- Convert mlock per-CPU page lists to QPW (Marcelo Tosatti).
- Drop memcontrol convertion (as isolated CPUs are not targets
  of queue_work_on anymore).
- Rebase SLUB against Vlastimil's slab/next.
- Add basic document for QPW (Waiman Long).

The performance numbers, as measured by the following test program,
are as follows:

Unpatched kernel:			166 cycles
Patched kernel, CONFIG_QPW=n:		166 cycles
Patched kernel, CONFIG_QPW=y, qpw=0:	168 cycles
Patched kernel, CONFIG_QPW=y, qpw=1:	192 cycles

kmalloc_bench.c:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/timex.h>
#include <linux/preempt.h>
#include <linux/irqflags.h>
#include <linux/vmalloc.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Gemini AI");
MODULE_DESCRIPTION("A simple kmalloc performance benchmark");

static int size = 64; // Default allocation size in bytes
module_param(size, int, 0644);

static int iterations = 9000000; // Default number of iterations
module_param(iterations, int, 0644);

static int __init kmalloc_bench_init(void) {
    void **ptrs;
    cycles_t start, end;
    uint64_t total_cycles;
    int i;
    pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);

    // Allocate an array to store pointers to avoid immediate kfree-reuse optimization
    ptrs = vmalloc(sizeof(void *) * iterations);
    if (!ptrs) {
        pr_err("kmalloc_bench: Failed to allocate pointer array\n");
        return -ENOMEM;
    }

    preempt_disable();
    start = get_cycles();

    for (i = 0; i < iterations; i++) {
        ptrs[i] = kmalloc(size, GFP_ATOMIC);
    }

    end = get_cycles();

    total_cycles = end - start;
    preempt_enable();

    pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
    pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);

    // Cleanup
    for (i = 0; i < iterations; i++) {
        kfree(ptrs[i]);
    }
    vfree(ptrs);

    return 0;
}

static void __exit kmalloc_bench_exit(void) {
    pr_info("kmalloc_bench: Module unloaded\n");
}

module_init(kmalloc_bench_init);
module_exit(kmalloc_bench_exit);

The following testcase triggers lru_add_drain_all on an isolated CPU
(that does sys_write to a file before entering its realtime 
loop).

/* 
 * Simulates a low latency loop program that is interrupted
 * due to lru_add_drain_all. To trigger lru_add_drain_all, run:
 *
 * blockdev --flushbufs /dev/sdX
 *
 */ 
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <stdarg.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>

int cpu;

static void *run(void *arg)
{
	pthread_t current_thread;
	cpu_set_t cpuset;
	int ret, nrloops;
	struct sched_param sched_p;
	pid_t pid;
	int fd;
	char buf[] = "xxxxxxxxxxx";

	CPU_ZERO(&cpuset);
	CPU_SET(cpu, &cpuset);

	current_thread = pthread_self();    
	ret = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
	if (ret) {
		perror("pthread_setaffinity_np failed\n");
		exit(0);
	}

	memset(&sched_p, 0, sizeof(struct sched_param));
	sched_p.sched_priority = 1;
	pid = gettid();
	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
	if (ret) {
		perror("sched_setscheduler");
		exit(0);
	}

	fd = open("/tmp/tmpfile", O_RDWR|O_CREAT|O_TRUNC);
	if (fd == -1) {
		perror("open");
		exit(0);
	}

	ret = write(fd, buf, sizeof(buf));
	if (ret == -1) {
		perror("write");
		exit(0);
	}

	do { 
		nrloops = nrloops+2;
		nrloops--;
	} while (1);
}

int main(int argc, char *argv[])
{
        int fd, ret;
	pthread_t thread;
	long val;
	char *endptr, *str;
	struct sched_param sched_p;
	pid_t pid;

	if (argc != 2) {
		printf("usage: %s cpu-nr\n", argv[0]);
		printf("where CPU number is the CPU to pin thread to\n");
		exit(0);
	}
	str = argv[1];
	cpu = strtol(str, &endptr, 10);
	if (cpu < 0) {
		printf("strtol returns %d\n", cpu);
		exit(0);
	}
	printf("cpunr=%d\n", cpu);

	memset(&sched_p, 0, sizeof(struct sched_param));
	sched_p.sched_priority = 1;
	pid = getpid();
	ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
	if (ret) {
		perror("sched_setscheduler");
		exit(0);
	}

	pthread_create(&thread, NULL, run, NULL);

	sleep(5000);

	pthread_join(thread, NULL);
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 1/5] slab: distinguish lock and trylock for sheaf_flush_main()
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
@ 2026-03-02 15:49 ` Marcelo Tosatti
  2026-03-02 15:49 ` [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-02 15:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feun,
	Frederic Weisbecker

From: Vlastimil Babka <vbabka@suse.cz>

sheaf_flush_main() can be called from __pcs_replace_full_main() where
the trylock can in theory fail, and pcs_flush_all() where it's not
expected to and it would be actually a problem if it failed and left the
main sheaf not flushed.

To make this explicit, split the function into sheaf_flush_main() (using
local_lock()) and sheaf_try_flush_main() (using local_trylock()) where
both call __sheaf_flush_main_batch() to flush a single batch of objects.
This will allow lockdep to verify our assumptions.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 47 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 37 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 18c30872d196..12912b29f5bb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2844,19 +2844,19 @@ static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
  * object pointers are moved to a on-stack array under the lock. To bound the
  * stack usage, limit each batch to PCS_BATCH_MAX.
  *
- * returns true if at least partially flushed
+ * Must be called with s->cpu_sheaves->lock locked, returns with the lock
+ * unlocked.
+ *
+ * Returns how many objects are remaining to be flushed
  */
-static bool sheaf_flush_main(struct kmem_cache *s)
+static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s)
 {
 	struct slub_percpu_sheaves *pcs;
 	unsigned int batch, remaining;
 	void *objects[PCS_BATCH_MAX];
 	struct slab_sheaf *sheaf;
-	bool ret = false;
 
-next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
-		return ret;
+	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 	sheaf = pcs->main;
@@ -2874,10 +2874,37 @@ static bool sheaf_flush_main(struct kmem_cache *s)
 
 	stat_add(s, SHEAF_FLUSH, batch);
 
-	ret = true;
+	return remaining;
+}
 
-	if (remaining)
-		goto next_batch;
+static void sheaf_flush_main(struct kmem_cache *s)
+{
+	unsigned int remaining;
+
+	do {
+		local_lock(&s->cpu_sheaves->lock);
+
+		remaining = __sheaf_flush_main_batch(s);
+
+	} while (remaining);
+}
+
+/*
+ * Returns true if the main sheaf was at least partially flushed.
+ */
+static bool sheaf_try_flush_main(struct kmem_cache *s)
+{
+	unsigned int remaining;
+	bool ret = false;
+
+	do {
+		if (!local_trylock(&s->cpu_sheaves->lock))
+			return ret;
+
+		ret = true;
+		remaining = __sheaf_flush_main_batch(s);
+
+	} while (remaining);
 
 	return ret;
 }
@@ -5685,7 +5712,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	if (put_fail)
 		 stat(s, BARN_PUT_FAIL);
 
-	if (!sheaf_flush_main(s))
+	if (!sheaf_try_flush_main(s))
 		return NULL;
 
 	if (!local_trylock(&s->cpu_sheaves->lock))

---
base-commit: 27125df9a5d3b4cfd03bce3a8ec405a368cc9aae
change-id: 20260211-b4-sheaf-flush-2eb99a9c8bfb

Best regards,
-- 
Vlastimil Babka <vbabka@suse.cz>







^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
  2026-03-02 15:49 ` [PATCH v2 1/5] slab: distinguish lock and trylock for sheaf_flush_main() Marcelo Tosatti
@ 2026-03-02 15:49 ` Marcelo Tosatti
  2026-03-03 12:03   ` Vlastimil Babka (SUSE)
                     ` (2 more replies)
  2026-03-02 15:49 ` [PATCH v2 3/5] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
                   ` (5 subsequent siblings)
  7 siblings, 3 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-02 15:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feun,
	Frederic Weisbecker, Marcelo Tosatti

Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.

On the other hand, for RT workloads this can represent a problem:
scheduling work on remote cpu that are executing low latency tasks
is undesired and can introduce unexpected deadline misses.

It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().

In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
and also the corresponding queue_percpu_work_on() and flush_percpu_work()
helpers to run the remote work.

Users of non-RT kernels but with low latency requirements can select
similar functionality by using the CONFIG_QPW compile time option.

On CONFIG_QPW disabled kernels, no changes are expected, as every
one of the introduced helpers work the exactly same as the current
implementation:
qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
queue_percpu_work_on()  ->  queue_work_on()
flush_percpu_work()     ->  flush_work()

For QPW enabled kernels, though, qpw_{un,}lock*() will use the extra
cpu parameter to select the correct per-cpu structure to work on,
and acquire the spinlock for that cpu.

queue_percpu_work_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in QPW enabled kernels, we are
safe doing that.

flush_percpu_work() then becomes a no-op since no work is actually
scheduled on a remote cpu.

Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by qpw_{un,}lock_n*(), so in
QPW enabled kernels they can reference a different cpu. It's also
necessary to use a qpw_struct instead of a work_struct, but it just
contains a work struct and, in CONFIG_QPW, the target cpu.

This should have almost no impact on non-CONFIG_QPW kernels: few
this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()).

On CONFIG_QPW kernels, this should avoid deadlines misses by
removing scheduling noise.

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |   10 
 Documentation/locking/qpwlocks.rst              |   70 ++++++
 MAINTAINERS                                     |    7 
 include/linux/qpw.h                             |  256 ++++++++++++++++++++++++
 init/Kconfig                                    |   35 +++
 kernel/Makefile                                 |    2 
 kernel/qpw.c                                    |   26 ++
 7 files changed, 406 insertions(+)
 create mode 100644 include/linux/qpw.h
 create mode 100644 kernel/qpw.c

Index: linux/Documentation/admin-guide/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/admin-guide/kernel-parameters.txt
+++ linux/Documentation/admin-guide/kernel-parameters.txt
@@ -2840,6 +2840,16 @@ Kernel parameters
 
 			The format of <cpu-list> is described above.
 
+	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
+			and remote interference mechanism on a kernel built with
+			CONFIG_QPW.
+			Format: { "0" | "1" }
+			0 - local_lock() + queue_work_on(remote_cpu)
+			1 - spin_lock() for both local and remote operations
+
+			Selecting 1 may be interesting for systems that want
+			to avoid interruption & context switches from IPIs.
+
 	iucv=		[HW,NET]
 
 	ivrs_ioapic	[HW,X86-64]
Index: linux/MAINTAINERS
===================================================================
--- linux.orig/MAINTAINERS
+++ linux/MAINTAINERS
@@ -21553,6 +21553,13 @@ F:	Documentation/networking/device_drive
 F:	drivers/bus/fsl-mc/
 F:	include/uapi/linux/fsl_mc.h
 
+QPW
+M:	Leonardo Bras <leobras.c@gmail.com>
+S:	Supported
+F:	Documentation/locking/qpwlocks.rst
+F:	include/linux/qpw.h
+F:	kernel/qpw.c
+
 QT1010 MEDIA DRIVER
 L:	linux-media@vger.kernel.org
 S:	Orphan
Index: linux/include/linux/qpw.h
===================================================================
--- /dev/null
+++ linux/include/linux/qpw.h
@@ -0,0 +1,256 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_QPW_H
+#define _LINUX_QPW_H
+
+#include "linux/spinlock.h"
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_QPW
+
+typedef local_lock_t qpw_lock_t;
+typedef local_trylock_t qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+};
+
+#define qpw_lock_init(lock)				\
+	local_lock_init(lock)
+
+#define qpw_trylock_init(lock)				\
+	local_trylock_init(lock)
+
+#define qpw_lock(lock, cpu)				\
+	local_lock(lock)
+
+#define local_qpw_lock(lock)				\
+	local_lock(lock)
+
+#define qpw_lock_irqsave(lock, flags, cpu)		\
+	local_lock_irqsave(lock, flags)
+
+#define local_qpw_lock_irqsave(lock, flags)		\
+	local_lock_irqsave(lock, flags)
+
+#define qpw_trylock(lock, cpu)				\
+	local_trylock(lock)
+
+#define local_qpw_trylock(lock)				\
+	local_trylock(lock)
+
+#define qpw_trylock_irqsave(lock, flags, cpu)		\
+	local_trylock_irqsave(lock, flags)
+
+#define qpw_unlock(lock, cpu)				\
+	local_unlock(lock)
+
+#define local_qpw_unlock(lock)				\
+	local_unlock(lock)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)		\
+	local_unlock_irqrestore(lock, flags)
+
+#define local_qpw_unlock_irqrestore(lock, flags)	\
+	local_unlock_irqrestore(lock, flags)
+
+#define qpw_lockdep_assert_held(lock)			\
+	lockdep_assert_held(lock)
+
+#define queue_percpu_work_on(c, wq, qpw)		\
+	queue_work_on(c, wq, &(qpw)->work)
+
+#define flush_percpu_work(qpw)				\
+	flush_work(&(qpw)->work)
+
+#define qpw_get_cpu(qpw)	smp_processor_id()
+
+#define qpw_is_cpu_remote(cpu)		(false)
+
+#define INIT_QPW(qpw, func, c)				\
+	INIT_WORK(&(qpw)->work, (func))
+
+#else /* CONFIG_QPW */
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+
+typedef union {
+	spinlock_t sl;
+	local_lock_t ll;
+} qpw_lock_t;
+
+typedef union {
+	spinlock_t sl;
+	local_trylock_t ll;
+} qpw_trylock_t;
+
+struct qpw_struct {
+	struct work_struct work;
+	int cpu;
+};
+
+#define qpw_lock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_lock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_trylock_init(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_init(lock.sl);					\
+		else									\
+			local_trylock_init(lock.ll);					\
+	} while (0)
+
+#define qpw_lock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock(per_cpu_ptr(lock.sl, cpu));				\
+		else									\
+			local_lock(lock.ll);						\
+	} while (0)
+
+#define local_qpw_lock(lock)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			migrate_disable();						\
+			spin_lock(this_cpu_ptr(lock.sl));				\
+		} else									\
+			local_lock(lock.ll);						\
+	} while (0)
+
+#define qpw_lock_irqsave(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);		\
+		else									\
+			local_lock_irqsave(lock.ll, flags);				\
+	} while (0)
+
+#define local_qpw_lock_irqsave(lock, flags)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			migrate_disable();						\
+			spin_lock_irqsave(this_cpu_ptr(lock.sl), flags);		\
+		} else									\
+			local_lock_irqsave(lock.ll, flags);				\
+	} while (0)
+
+
+#define qpw_trylock(lock, cpu)                                                          \
+	({                                                                              \
+		int t;                                                                  \
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))                   \
+			t = spin_trylock(per_cpu_ptr(lock.sl, cpu));                    \
+		else                                                                    \
+			t = local_trylock(lock.ll);                                     \
+		t;                                                                      \
+	})
+
+#define local_qpw_trylock(lock)								\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			migrate_disable();						\
+			t = spin_trylock(this_cpu_ptr(lock.sl));			\
+			if (!t)								\
+				migrate_enable();					\
+		} else									\
+			t = local_trylock(lock.ll);					\
+		t;									\
+	})
+
+#define qpw_trylock_irqsave(lock, flags, cpu)						\
+	({										\
+		int t;									\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			t = local_trylock_irqsave(lock.ll, flags);			\
+		t;									\
+	})
+
+#define qpw_unlock(lock, cpu)								\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			spin_unlock(per_cpu_ptr(lock.sl, cpu));				\
+		} else {								\
+			local_unlock(lock.ll);						\
+		}									\
+	} while (0)
+
+#define local_qpw_unlock(lock)								\
+do {										\
+	if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+		spin_unlock(this_cpu_ptr(lock.sl));				\
+		migrate_enable();						\
+	} else {								\
+		local_unlock(lock.ll);						\
+	}									\
+} while (0)
+
+#define qpw_unlock_irqrestore(lock, flags, cpu)						\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags);	\
+		else									\
+			local_unlock_irqrestore(lock.ll, flags);			\
+	} while (0)
+
+#define local_qpw_unlock_irqrestore(lock, flags)					\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			spin_unlock_irqrestore(this_cpu_ptr(lock.sl), flags);		\
+			migrate_enable();						\
+		} else									\
+			local_unlock_irqrestore(lock.ll, flags);			\
+	} while (0)
+
+#define qpw_lockdep_assert_held(lock)							\
+	do {										\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl))			\
+			lockdep_assert_held(this_cpu_ptr(lock.sl));			\
+		else									\
+			lockdep_assert_held(this_cpu_ptr(lock.ll));			\
+	} while (0)
+
+#define queue_percpu_work_on(c, wq, qpw)						\
+	do {										\
+		int __c = c;								\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
+			WARN_ON((__c) != __qpw->cpu);					\
+			__qpw->work.func(&__qpw->work);					\
+		} else {								\
+			queue_work_on(__c, wq, &(__qpw)->work);				\
+		}									\
+	} while (0)
+
+/*
+ * Does nothing if QPW is set to use spinlock, as the task is already done at the
+ * time queue_percpu_work_on() returns.
+ */
+#define flush_percpu_work(qpw)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		if (!static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {		\
+			flush_work(&__qpw->work);					\
+		}									\
+	} while (0)
+
+#define qpw_get_cpu(w)			container_of((w), struct qpw_struct, work)->cpu
+
+#define qpw_is_cpu_remote(cpu)		((cpu) != smp_processor_id())
+
+#define INIT_QPW(qpw, func, c)								\
+	do {										\
+		struct qpw_struct *__qpw = (qpw);					\
+		INIT_WORK(&__qpw->work, (func));					\
+		__qpw->cpu = (c);							\
+	} while (0)
+
+#endif /* CONFIG_QPW */
+#endif /* LINUX_QPW_H */
Index: linux/init/Kconfig
===================================================================
--- linux.orig/init/Kconfig
+++ linux/init/Kconfig
@@ -762,6 +762,41 @@ config CPU_ISOLATION
 
 	  Say Y if unsure.
 
+config QPW
+	bool "Queue per-CPU Work"
+	depends on SMP || COMPILE_TEST
+	default n
+	help
+	  Allow changing the behavior on per-CPU resource sharing with cache,
+	  from the regular local_locks() + queue_work_on(remote_cpu) to using
+	  per-CPU spinlocks on both local and remote operations.
+
+	  This is useful to give user the option on reducing IPIs to CPUs, and
+	  thus reduce interruptions and context switches. On the other hand, it
+	  increases generated code and will use atomic operations if spinlocks
+	  are selected.
+
+	  If set, will use the default behavior set in QPW_DEFAULT unless boot
+	  parameter qpw is passed with a different behavior.
+
+	  If unset, will use the local_lock() + queue_work_on() strategy,
+	  regardless of the boot parameter or QPW_DEFAULT.
+
+	  Say N if unsure.
+
+config QPW_DEFAULT
+	bool "Use per-CPU spinlocks by default"
+	depends on QPW
+	default n
+	help
+	  If set, will use per-CPU spinlocks as default behavior for per-CPU
+	  remote operations.
+
+	  If unset, will use local_lock() + queue_work_on(cpu) as default
+	  behavior for remote operations.
+
+	  Say N if unsure
+
 source "kernel/rcu/Kconfig"
 
 config IKCONFIG
Index: linux/kernel/Makefile
===================================================================
--- linux.orig/kernel/Makefile
+++ linux/kernel/Makefile
@@ -142,6 +142,8 @@ obj-$(CONFIG_WATCH_QUEUE) += watch_queue
 obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
 obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
 
+obj-$(CONFIG_QPW) += qpw.o
+
 CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
 CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
 obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
Index: linux/kernel/qpw.c
===================================================================
--- /dev/null
+++ linux/kernel/qpw.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/export.h"
+#include <linux/sched.h>
+#include <linux/qpw.h>
+#include <linux/string.h>
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_QPW_DEFAULT, qpw_sl);
+EXPORT_SYMBOL(qpw_sl);
+
+static int __init qpw_setup(char *str)
+{
+	int opt;
+
+	if (!get_option(&str, &opt)) {
+		pr_warn("QPW: invalid qpw parameter: %s, ignoring.\n", str);
+		return 0;
+	}
+
+	if (opt)
+		static_branch_enable(&qpw_sl);
+	else
+		static_branch_disable(&qpw_sl);
+
+	return 1;
+}
+__setup("qpw=", qpw_setup);
Index: linux/Documentation/locking/qpwlocks.rst
===================================================================
--- /dev/null
+++ linux/Documentation/locking/qpwlocks.rst
@@ -0,0 +1,70 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+QPW locks
+=========
+
+Some places in the kernel implement a parallel programming strategy
+consisting on local_locks() for most of the work, and some rare remote
+operations are scheduled on target cpu. This keeps cache bouncing low since
+cacheline tends to be mostly local, and avoids the cost of locks in non-RT
+kernels, even though the very few remote operations will be expensive due
+to scheduling overhead.
+
+On the other hand, for RT workloads this can represent a problem:
+scheduling work on remote cpu that are executing low latency tasks
+is undesired and can introduce unexpected deadline misses.
+
+QPW locks help to convert sites that use local_locks (for cpu local operations)
+and queue_work_on (for queueing work remotely, to be executed
+locally on the owner cpu of the lock) to QPW locks.
+
+The lock is declared qpw_lock_t type.
+The lock is initialized with qpw_lock_init.
+The lock is locked with qpw_lock (takes a lock and cpu as a parameter).
+The lock is unlocked with qpw_unlock (takes a lock and cpu as a parameter).
+
+The qpw_lock_irqsave function disables interrupts and saves current interrupt state,
+cpu as a parameter.
+
+For trylock variant, there is the qpw_trylock_t type, initialized with
+qpw_trylock_init. Then the corresponding qpw_trylock and
+qpw_trylock_irqsave.
+
+work_struct should be replaced by qpw_struct, which contains a cpu parameter
+(owner cpu of the lock), initialized by INIT_QPW.
+
+The queue work related functions (analogous to queue_work_on and flush_work) are:
+queue_percpu_work_on and flush_percpu_work.
+
+The behaviour of the QPW functions is as follows:
+
+* !CONFIG_QPW (or CONFIG_QPW and qpw=off kernel boot parameter):
+        - qpw_lock:                     local_lock
+        - qpw_lock_irqsave:             local_lock_irqsave
+        - qpw_trylock:                  local_trylock
+        - qpw_trylock_irqsave:          local_trylock_irqsave
+        - qpw_unlock:                   local_unlock
+        - queue_percpu_work_on:         queue_work_on
+        - flush_percpu_work:            flush_work
+
+* CONFIG_QPW (and CONFIG_QPW_DEFAULT=y or qpw=on kernel boot parameter),
+        - qpw_lock:                     spin_lock
+        - qpw_lock_irqsave:             spin_lock_irqsave
+        - qpw_trylock:                  spin_trylock
+        - qpw_trylock_irqsave:          spin_trylock_irqsave
+        - qpw_unlock:                   spin_unlock
+        - queue_percpu_work_on:         executes work function on caller cpu
+        - flush_percpu_work:            empty
+
+qpw_get_cpu(work_struct), to be called from within qpw work function,
+returns the target cpu.
+
+In addition to the locking functions above, there are the local locking
+functions (local_qpw_lock, local_qpw_trylock and local_qpw_unlock).
+These must only be used to access per-CPU data from the CPU that owns
+that data, and not remotely. They disable preemption or migration
+and don't require a cpu parameter.
+
+These should only be used when accessing per-CPU data of the local CPU.
+




^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 3/5] mm/swap: move bh draining into a separate workqueue
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
  2026-03-02 15:49 ` [PATCH v2 1/5] slab: distinguish lock and trylock for sheaf_flush_main() Marcelo Tosatti
  2026-03-02 15:49 ` [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
@ 2026-03-02 15:49 ` Marcelo Tosatti
  2026-03-02 15:49 ` [PATCH v2 4/5] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-02 15:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feun,
	Frederic Weisbecker, Marcelo Tosatti

Separate the bh draining into a separate workqueue
(from the mm lru draining), so that its possible to switch
the mm lru draining to QPW.

To switch bh draining to QPW, it would be necessary to add
a spinlock to addition of bhs to percpu cache, and that is a
very hot path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
 mm/swap.c |   52 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 15 deletions(-)

Index: linux/mm/swap.c
===================================================================
--- linux.orig/mm/swap.c
+++ linux/mm/swap.c
@@ -745,12 +745,11 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_mm_drain(void)
 {
 	local_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
 	local_unlock(&cpu_fbatches.lock);
-	invalidate_bh_lrus_cpu();
 	mlock_drain_local();
 }
 
@@ -769,10 +768,17 @@ static DEFINE_PER_CPU(struct work_struct
 
 static void lru_add_drain_per_cpu(struct work_struct *dummy)
 {
-	lru_add_and_bh_lrus_drain();
+	lru_add_mm_drain();
 }
 
-static bool cpu_needs_drain(unsigned int cpu)
+static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
+
+static void bh_add_drain_per_cpu(struct work_struct *dummy)
+{
+	invalidate_bh_lrus_cpu();
+}
+
+static bool cpu_needs_mm_drain(unsigned int cpu)
 {
 	struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
 
@@ -783,8 +789,12 @@ static bool cpu_needs_drain(unsigned int
 		folio_batch_count(&fbatches->lru_deactivate) ||
 		folio_batch_count(&fbatches->lru_lazyfree) ||
 		folio_batch_count(&fbatches->lru_activate) ||
-		need_mlock_drain(cpu) ||
-		has_bh_in_lru(cpu, NULL);
+		need_mlock_drain(cpu);
+}
+
+static bool cpu_needs_bh_drain(unsigned int cpu)
+{
+	return has_bh_in_lru(cpu, NULL);
 }
 
 /*
@@ -807,7 +817,7 @@ static inline void __lru_add_drain_all(b
 	 * each CPU.
 	 */
 	static unsigned int lru_drain_gen;
-	static struct cpumask has_work;
+	static struct cpumask has_mm_work, has_bh_work;
 	static DEFINE_MUTEX(lock);
 	unsigned cpu, this_gen;
 
@@ -870,20 +880,31 @@ static inline void __lru_add_drain_all(b
 	WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
 	smp_mb();
 
-	cpumask_clear(&has_work);
+	cpumask_clear(&has_mm_work);
+	cpumask_clear(&has_bh_work);
 	for_each_online_cpu(cpu) {
-		struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
+		struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+		struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
+
+		if (cpu_needs_mm_drain(cpu)) {
+			INIT_WORK(mm_work, lru_add_drain_per_cpu);
+			queue_work_on(cpu, mm_percpu_wq, mm_work);
+			__cpumask_set_cpu(cpu, &has_mm_work);
+		}
 
-		if (cpu_needs_drain(cpu)) {
-			INIT_WORK(work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, work);
-			__cpumask_set_cpu(cpu, &has_work);
+		if (cpu_needs_bh_drain(cpu)) {
+			INIT_WORK(bh_work, bh_add_drain_per_cpu);
+			queue_work_on(cpu, mm_percpu_wq, bh_work);
+			__cpumask_set_cpu(cpu, &has_bh_work);
 		}
 	}
 
-	for_each_cpu(cpu, &has_work)
+	for_each_cpu(cpu, &has_mm_work)
 		flush_work(&per_cpu(lru_add_drain_work, cpu));
 
+	for_each_cpu(cpu, &has_bh_work)
+		flush_work(&per_cpu(bh_add_drain_work, cpu));
+
 done:
 	mutex_unlock(&lock);
 }
@@ -929,7 +950,8 @@ void lru_cache_disable(void)
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_and_bh_lrus_drain();
+	lru_add_mm_drain();
+	invalidate_bh_lrus_cpu();
 #endif
 }
 




^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 4/5] swap: apply new queue_percpu_work_on() interface
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
                   ` (2 preceding siblings ...)
  2026-03-02 15:49 ` [PATCH v2 3/5] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
@ 2026-03-02 15:49 ` Marcelo Tosatti
  2026-03-02 15:49 ` [PATCH v2 5/5] slub: " Marcelo Tosatti
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-02 15:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feun,
	Frederic Weisbecker, Marcelo Tosatti

Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
interface to improve performance & latency.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
queue_percpu_work_on(). The same happens for flush_work() and
flush_percpu_work().

The change requires allocation of qpw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-QPW kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/internal.h   |    4 ++-
 mm/mlock.c      |   51 ++++++++++++++++++++++++++++++-----------
 mm/page_alloc.c |    2 -
 mm/swap.c       |   69 ++++++++++++++++++++++++++++++--------------------------
 4 files changed, 79 insertions(+), 47 deletions(-)

Index: linux/mm/mlock.c
===================================================================
--- linux.orig/mm/mlock.c
+++ linux/mm/mlock.c
@@ -25,17 +25,16 @@
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
 #include <linux/secretmem.h>
+#include <linux/qpw.h>
 
 #include "internal.h"
 
 struct mlock_fbatch {
-	local_lock_t lock;
+	qpw_lock_t lock;
 	struct folio_batch fbatch;
 };
 
-static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-};
+static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch);
 
 bool can_do_mlock(void)
 {
@@ -209,18 +208,29 @@ static void mlock_folio_batch(struct fol
 	folios_put(fbatch);
 }
 
+void mlock_drain_cpu(int cpu)
+{
+	struct folio_batch *fbatch;
+
+	qpw_lock(&mlock_fbatch.lock, cpu);
+	fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
+	if (folio_batch_count(fbatch))
+		mlock_folio_batch(fbatch);
+	qpw_unlock(&mlock_fbatch.lock, cpu);
+}
+
 void mlock_drain_local(void)
 {
 	struct folio_batch *fbatch;
 
-	local_lock(&mlock_fbatch.lock);
+	local_qpw_lock(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 	if (folio_batch_count(fbatch))
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	local_qpw_unlock(&mlock_fbatch.lock);
 }
 
-void mlock_drain_remote(int cpu)
+void mlock_drain_offline(int cpu)
 {
 	struct folio_batch *fbatch;
 
@@ -243,7 +253,7 @@ void mlock_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
 
-	local_lock(&mlock_fbatch.lock);
+	local_qpw_lock(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 
 	if (!folio_test_set_mlocked(folio)) {
@@ -257,7 +267,7 @@ void mlock_folio(struct folio *folio)
 	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	local_qpw_unlock(&mlock_fbatch.lock);
 }
 
 /**
@@ -269,7 +279,7 @@ void mlock_new_folio(struct folio *folio
 	struct folio_batch *fbatch;
 	int nr_pages = folio_nr_pages(folio);
 
-	local_lock(&mlock_fbatch.lock);
+	local_qpw_lock(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 	folio_set_mlocked(folio);
 
@@ -280,7 +290,7 @@ void mlock_new_folio(struct folio *folio
 	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	local_qpw_unlock(&mlock_fbatch.lock);
 }
 
 /**
@@ -291,7 +301,7 @@ void munlock_folio(struct folio *folio)
 {
 	struct folio_batch *fbatch;
 
-	local_lock(&mlock_fbatch.lock);
+	local_qpw_lock(&mlock_fbatch.lock);
 	fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
 	/*
 	 * folio_test_clear_mlocked(folio) must be left to __munlock_folio(),
@@ -301,7 +311,7 @@ void munlock_folio(struct folio *folio)
 	if (!folio_batch_add(fbatch, folio) ||
 	    !folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
-	local_unlock(&mlock_fbatch.lock);
+	local_qpw_unlock(&mlock_fbatch.lock);
 }
 
 static inline unsigned int folio_mlock_step(struct folio *folio,
@@ -823,3 +833,18 @@ void user_shm_unlock(size_t size, struct
 	spin_unlock(&shmlock_user_lock);
 	put_ucounts(ucounts);
 }
+
+int __init mlock_init(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct mlock_fbatch *fbatch = &per_cpu(mlock_fbatch, cpu);
+
+		qpw_lock_init(&fbatch->lock);
+	}
+
+	return 0;
+}
+
+module_init(mlock_init);
Index: linux/mm/swap.c
===================================================================
--- linux.orig/mm/swap.c
+++ linux/mm/swap.c
@@ -35,7 +35,7 @@
 #include <linux/uio.h>
 #include <linux/hugetlb.h>
 #include <linux/page_idle.h>
-#include <linux/local_lock.h>
+#include <linux/qpw.h>
 #include <linux/buffer_head.h>
 
 #include "internal.h"
@@ -52,7 +52,7 @@ struct cpu_fbatches {
 	 * The following folio batches are grouped together because they are protected
 	 * by disabling preemption (and interrupts remain enabled).
 	 */
-	local_lock_t lock;
+	qpw_lock_t lock;
 	struct folio_batch lru_add;
 	struct folio_batch lru_deactivate_file;
 	struct folio_batch lru_deactivate;
@@ -61,14 +61,11 @@ struct cpu_fbatches {
 	struct folio_batch lru_activate;
 #endif
 	/* Protecting the following batches which require disabling interrupts */
-	local_lock_t lock_irq;
+	qpw_lock_t lock_irq;
 	struct folio_batch lru_move_tail;
 };
 
-static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
-	.lock = INIT_LOCAL_LOCK(lock),
-	.lock_irq = INIT_LOCAL_LOCK(lock_irq),
-};
+static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches);
 
 static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
 		unsigned long *flagsp)
@@ -187,18 +184,18 @@ static void __folio_batch_add_and_move(s
 	folio_get(folio);
 
 	if (disable_irq)
-		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+		local_qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags);
 	else
-		local_lock(&cpu_fbatches.lock);
+		local_qpw_lock(&cpu_fbatches.lock);
 
 	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
 			!folio_may_be_lru_cached(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
-		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+		local_qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
 	else
-		local_unlock(&cpu_fbatches.lock);
+		local_qpw_unlock(&cpu_fbatches.lock);
 }
 
 #define folio_batch_add_and_move(folio, op)		\
@@ -359,7 +356,7 @@ static void __lru_cache_activate_folio(s
 	struct folio_batch *fbatch;
 	int i;
 
-	local_lock(&cpu_fbatches.lock);
+	local_qpw_lock(&cpu_fbatches.lock);
 	fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
 
 	/*
@@ -381,7 +378,7 @@ static void __lru_cache_activate_folio(s
 		}
 	}
 
-	local_unlock(&cpu_fbatches.lock);
+	local_qpw_unlock(&cpu_fbatches.lock);
 }
 
 #ifdef CONFIG_LRU_GEN
@@ -653,9 +650,9 @@ void lru_add_drain_cpu(int cpu)
 		unsigned long flags;
 
 		/* No harm done if a racing interrupt already did this */
-		local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+		qpw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu);
 		folio_batch_move_lru(fbatch, lru_move_tail);
-		local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+		qpw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu);
 	}
 
 	fbatch = &fbatches->lru_deactivate_file;
@@ -733,9 +730,9 @@ void folio_mark_lazyfree(struct folio *f
 
 void lru_add_drain(void)
 {
-	local_lock(&cpu_fbatches.lock);
+	local_qpw_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
+	local_qpw_unlock(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
@@ -745,30 +742,30 @@ void lru_add_drain(void)
  * the same cpu. It shouldn't be a problem in !SMP case since
  * the core is only one and the locks will disable preemption.
  */
-static void lru_add_mm_drain(void)
+static void lru_add_mm_drain(int cpu)
 {
-	local_lock(&cpu_fbatches.lock);
-	lru_add_drain_cpu(smp_processor_id());
-	local_unlock(&cpu_fbatches.lock);
-	mlock_drain_local();
+	qpw_lock(&cpu_fbatches.lock, cpu);
+	lru_add_drain_cpu(cpu);
+	qpw_unlock(&cpu_fbatches.lock, cpu);
+	mlock_drain_cpu(cpu);
 }
 
 void lru_add_drain_cpu_zone(struct zone *zone)
 {
-	local_lock(&cpu_fbatches.lock);
+	local_qpw_lock(&cpu_fbatches.lock);
 	lru_add_drain_cpu(smp_processor_id());
 	drain_local_pages(zone);
-	local_unlock(&cpu_fbatches.lock);
+	local_qpw_unlock(&cpu_fbatches.lock);
 	mlock_drain_local();
 }
 
 #ifdef CONFIG_SMP
 
-static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
+static DEFINE_PER_CPU(struct qpw_struct, lru_add_drain_qpw);
 
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
 {
-	lru_add_mm_drain();
+	lru_add_mm_drain(qpw_get_cpu(w));
 }
 
 static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
@@ -883,12 +880,12 @@ static inline void __lru_add_drain_all(b
 	cpumask_clear(&has_mm_work);
 	cpumask_clear(&has_bh_work);
 	for_each_online_cpu(cpu) {
-		struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+		struct qpw_struct *mm_qpw = &per_cpu(lru_add_drain_qpw, cpu);
 		struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
 
 		if (cpu_needs_mm_drain(cpu)) {
-			INIT_WORK(mm_work, lru_add_drain_per_cpu);
-			queue_work_on(cpu, mm_percpu_wq, mm_work);
+			INIT_QPW(mm_qpw, lru_add_drain_per_cpu, cpu);
+			queue_percpu_work_on(cpu, mm_percpu_wq, mm_qpw);
 			__cpumask_set_cpu(cpu, &has_mm_work);
 		}
 
@@ -900,7 +897,7 @@ static inline void __lru_add_drain_all(b
 	}
 
 	for_each_cpu(cpu, &has_mm_work)
-		flush_work(&per_cpu(lru_add_drain_work, cpu));
+		flush_percpu_work(&per_cpu(lru_add_drain_qpw, cpu));
 
 	for_each_cpu(cpu, &has_bh_work)
 		flush_work(&per_cpu(bh_add_drain_work, cpu));
@@ -950,7 +947,7 @@ void lru_cache_disable(void)
 #ifdef CONFIG_SMP
 	__lru_add_drain_all(true);
 #else
-	lru_add_mm_drain();
+	lru_add_mm_drain(smp_processor_id());
 	invalidate_bh_lrus_cpu();
 #endif
 }
@@ -1124,6 +1121,7 @@ static const struct ctl_table swap_sysct
 void __init swap_setup(void)
 {
 	unsigned long megs = PAGES_TO_MB(totalram_pages());
+	unsigned int cpu;
 
 	/* Use a smaller cluster for small-memory machines */
 	if (megs < 16)
@@ -1136,4 +1134,11 @@ void __init swap_setup(void)
 	 */
 
 	register_sysctl_init("vm", swap_sysctl_table);
+
+	for_each_possible_cpu(cpu) {
+		struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
+
+		qpw_lock_init(&fbatches->lock);
+		qpw_lock_init(&fbatches->lock_irq);
+	}
 }
Index: linux/mm/internal.h
===================================================================
--- linux.orig/mm/internal.h
+++ linux/mm/internal.h
@@ -1140,10 +1140,12 @@ static inline void munlock_vma_folio(str
 		munlock_folio(folio);
 }
 
+int __init mlock_init(void);
 void mlock_new_folio(struct folio *folio);
 bool need_mlock_drain(int cpu);
 void mlock_drain_local(void);
-void mlock_drain_remote(int cpu);
+void mlock_drain_cpu(int cpu);
+void mlock_drain_offline(int cpu);
 
 extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -6285,7 +6285,7 @@ static int page_alloc_cpu_dead(unsigned
 	struct zone *zone;
 
 	lru_add_drain_cpu(cpu);
-	mlock_drain_remote(cpu);
+	mlock_drain_offline(cpu);
 	drain_pages(cpu);
 
 	/*




^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 5/5] slub: apply new queue_percpu_work_on() interface
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
                   ` (3 preceding siblings ...)
  2026-03-02 15:49 ` [PATCH v2 4/5] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
@ 2026-03-02 15:49 ` Marcelo Tosatti
  2026-03-03 11:15 ` [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Frederic Weisbecker
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-02 15:49 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Leonardo Bras, Thomas Gleixner, Waiman Long, Boqun Feun,
	Frederic Weisbecker, Marcelo Tosatti

Make use of the new qpw_{un,}lock*() and queue_percpu_work_on()
interface to improve performance & latency.

For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by qpw_{un,}lock*(), and replace schedule_work_on() by
queue_percpu_work_on(). The same happens for flush_work() and
flush_percpu_work().

This change requires allocation of qpw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.

This should bring no relevant performance impact on non-QPW kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).

Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 mm/slub.c |  146 +++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 74 insertions(+), 72 deletions(-)

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -50,6 +50,7 @@
 #include <linux/irq_work.h>
 #include <linux/kprobes.h>
 #include <linux/debugfs.h>
+#include <linux/qpw.h>
 #include <trace/events/kmem.h>
 
 #include "internal.h"
@@ -129,7 +130,7 @@
  *   For debug caches, all allocations are forced to go through a list_lock
  *   protected region to serialize against concurrent validation.
  *
- *   cpu_sheaves->lock (local_trylock)
+ *   cpu_sheaves->lock (qpw_trylock)
  *
  *   This lock protects fastpath operations on the percpu sheaves. On !RT it
  *   only disables preemption and does no atomic operations. As long as the main
@@ -157,7 +158,7 @@
  *   Interrupts are disabled as part of list_lock or barn lock operations, or
  *   around the slab_lock operation, in order to make the slab allocator safe
  *   to use in the context of an irq.
- *   Preemption is disabled as part of local_trylock operations.
+ *   Preemption is disabled as part of qpw_trylock operations.
  *   kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
  *   their limitations.
  *
@@ -418,7 +419,7 @@ struct slab_sheaf {
 };
 
 struct slub_percpu_sheaves {
-	local_trylock_t lock;
+	qpw_trylock_t lock;
 	struct slab_sheaf *main; /* never NULL when unlocked */
 	struct slab_sheaf *spare; /* empty or full, may be NULL */
 	struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
@@ -480,7 +481,7 @@ static nodemask_t slab_nodes;
 static struct workqueue_struct *flushwq;
 
 struct slub_flush_work {
-	struct work_struct work;
+	struct qpw_struct qpw;
 	struct kmem_cache *s;
 	bool skip;
 };
@@ -2849,16 +2850,14 @@ static void __kmem_cache_free_bulk(struc
  *
  * Returns how many objects are remaining to be flushed
  */
-static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s)
+static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s, int cpu)
 {
 	struct slub_percpu_sheaves *pcs;
 	unsigned int batch, remaining;
 	void *objects[PCS_BATCH_MAX];
 	struct slab_sheaf *sheaf;
 
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
-
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 	sheaf = pcs->main;
 
 	batch = min(PCS_BATCH_MAX, sheaf->size);
@@ -2868,7 +2867,7 @@ static unsigned int __sheaf_flush_main_b
 
 	remaining = sheaf->size;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	__kmem_cache_free_bulk(s, batch, &objects[0]);
 
@@ -2877,14 +2876,14 @@ static unsigned int __sheaf_flush_main_b
 	return remaining;
 }
 
-static void sheaf_flush_main(struct kmem_cache *s)
+static void sheaf_flush_main(struct kmem_cache *s, int cpu)
 {
 	unsigned int remaining;
 
 	do {
-		local_lock(&s->cpu_sheaves->lock);
+		qpw_lock(&s->cpu_sheaves->lock, cpu);
 
-		remaining = __sheaf_flush_main_batch(s);
+		remaining = __sheaf_flush_main_batch(s, cpu);
 
 	} while (remaining);
 }
@@ -2898,11 +2897,13 @@ static bool sheaf_try_flush_main(struct
 	bool ret = false;
 
 	do {
-		if (!local_trylock(&s->cpu_sheaves->lock))
+		if (!local_qpw_trylock(&s->cpu_sheaves->lock))
 			return ret;
 
 		ret = true;
-		remaining = __sheaf_flush_main_batch(s);
+
+		lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+		remaining = __sheaf_flush_main_batch(s, smp_processor_id());
 
 	} while (remaining);
 
@@ -2979,13 +2980,13 @@ static void rcu_free_sheaf_nobarn(struct
  * flushing operations are rare so let's keep it simple and flush to slabs
  * directly, skipping the barn
  */
-static void pcs_flush_all(struct kmem_cache *s)
+static void pcs_flush_all(struct kmem_cache *s, int cpu)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *spare, *rcu_free;
 
-	local_lock(&s->cpu_sheaves->lock);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	qpw_lock(&s->cpu_sheaves->lock, cpu);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
 	spare = pcs->spare;
 	pcs->spare = NULL;
@@ -2993,7 +2994,7 @@ static void pcs_flush_all(struct kmem_ca
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	if (spare) {
 		sheaf_flush_unused(s, spare);
@@ -3003,7 +3004,7 @@ static void pcs_flush_all(struct kmem_ca
 	if (rcu_free)
 		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
 
-	sheaf_flush_main(s);
+	sheaf_flush_main(s, cpu);
 }
 
 static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
@@ -3953,13 +3954,13 @@ static void flush_cpu_sheaves(struct wor
 {
 	struct kmem_cache *s;
 	struct slub_flush_work *sfw;
+	int cpu = qpw_get_cpu(w);
 
-	sfw = container_of(w, struct slub_flush_work, work);
-
+	sfw = &per_cpu(slub_flush, cpu);
 	s = sfw->s;
 
 	if (cache_has_sheaves(s))
-		pcs_flush_all(s);
+		pcs_flush_all(s, cpu);
 }
 
 static void flush_all_cpus_locked(struct kmem_cache *s)
@@ -3976,17 +3977,17 @@ static void flush_all_cpus_locked(struct
 			sfw->skip = true;
 			continue;
 		}
-		INIT_WORK(&sfw->work, flush_cpu_sheaves);
+		INIT_QPW(&sfw->qpw, flush_cpu_sheaves, cpu);
 		sfw->skip = false;
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		queue_percpu_work_on(cpu, flushwq, &sfw->qpw);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
 		if (sfw->skip)
 			continue;
-		flush_work(&sfw->work);
+		flush_percpu_work(&sfw->qpw);
 	}
 
 	mutex_unlock(&flush_lock);
@@ -4005,17 +4006,18 @@ static void flush_rcu_sheaf(struct work_
 	struct slab_sheaf *rcu_free;
 	struct slub_flush_work *sfw;
 	struct kmem_cache *s;
+	int cpu = qpw_get_cpu(w);
 
-	sfw = container_of(w, struct slub_flush_work, work);
+	sfw = &per_cpu(slub_flush, cpu);
 	s = sfw->s;
 
-	local_lock(&s->cpu_sheaves->lock);
-	pcs = this_cpu_ptr(s->cpu_sheaves);
+	qpw_lock(&s->cpu_sheaves->lock, cpu);
+	pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
 	rcu_free = pcs->rcu_free;
 	pcs->rcu_free = NULL;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	qpw_unlock(&s->cpu_sheaves->lock, cpu);
 
 	if (rcu_free)
 		call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
@@ -4040,14 +4042,14 @@ void flush_rcu_sheaves_on_cache(struct k
 		 * sure the __kfree_rcu_sheaf() finished its call_rcu()
 		 */
 
-		INIT_WORK(&sfw->work, flush_rcu_sheaf);
+		INIT_QPW(&sfw->qpw, flush_rcu_sheaf, cpu);
 		sfw->s = s;
-		queue_work_on(cpu, flushwq, &sfw->work);
+		queue_percpu_work_on(cpu, flushwq, &sfw->qpw);
 	}
 
 	for_each_online_cpu(cpu) {
 		sfw = &per_cpu(slub_flush, cpu);
-		flush_work(&sfw->work);
+		flush_percpu_work(&sfw->qpw);
 	}
 
 	mutex_unlock(&flush_lock);
@@ -4555,11 +4557,11 @@ __pcs_replace_empty_main(struct kmem_cac
 	struct node_barn *barn;
 	bool can_alloc;
 
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {
-		local_unlock(&s->cpu_sheaves->lock);
+		local_qpw_unlock(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
@@ -4570,7 +4572,7 @@ __pcs_replace_empty_main(struct kmem_cac
 
 	barn = get_barn(s);
 	if (!barn) {
-		local_unlock(&s->cpu_sheaves->lock);
+		local_qpw_unlock(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
@@ -4596,7 +4598,7 @@ __pcs_replace_empty_main(struct kmem_cac
 		}
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	if (!can_alloc)
 		return NULL;
@@ -4622,7 +4624,7 @@ __pcs_replace_empty_main(struct kmem_cac
 	 * we can reach here only when gfpflags_allow_blocking
 	 * so this must not be an irq
 	 */
-	local_lock(&s->cpu_sheaves->lock);
+	local_qpw_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	/*
@@ -4699,7 +4701,7 @@ void *alloc_from_pcs(struct kmem_cache *
 		return NULL;
 	}
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!local_qpw_trylock(&s->cpu_sheaves->lock))
 		return NULL;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -4719,7 +4721,7 @@ void *alloc_from_pcs(struct kmem_cache *
 		 * the current allocation or previous freeing process.
 		 */
 		if (page_to_nid(virt_to_page(object)) != node) {
-			local_unlock(&s->cpu_sheaves->lock);
+			local_qpw_unlock(&s->cpu_sheaves->lock);
 			stat(s, ALLOC_NODE_MISMATCH);
 			return NULL;
 		}
@@ -4727,7 +4729,7 @@ void *alloc_from_pcs(struct kmem_cache *
 
 	pcs->main->size--;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	stat(s, ALLOC_FASTPATH);
 
@@ -4744,7 +4746,7 @@ unsigned int alloc_from_pcs_bulk(struct
 	unsigned int batch;
 
 next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!local_qpw_trylock(&s->cpu_sheaves->lock))
 		return allocated;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -4755,7 +4757,7 @@ next_batch:
 		struct node_barn *barn;
 
 		if (unlikely(!cache_has_sheaves(s))) {
-			local_unlock(&s->cpu_sheaves->lock);
+			local_qpw_unlock(&s->cpu_sheaves->lock);
 			return allocated;
 		}
 
@@ -4766,7 +4768,7 @@ next_batch:
 
 		barn = get_barn(s);
 		if (!barn) {
-			local_unlock(&s->cpu_sheaves->lock);
+			local_qpw_unlock(&s->cpu_sheaves->lock);
 			return allocated;
 		}
 
@@ -4781,7 +4783,7 @@ next_batch:
 
 		stat(s, BARN_GET_FAIL);
 
-		local_unlock(&s->cpu_sheaves->lock);
+		local_qpw_unlock(&s->cpu_sheaves->lock);
 
 		/*
 		 * Once full sheaves in barn are depleted, let the bulk
@@ -4799,7 +4801,7 @@ do_alloc:
 	main->size -= batch;
 	memcpy(p, main->objects + main->size, batch * sizeof(void *));
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	stat_add(s, ALLOC_FASTPATH, batch);
 
@@ -4978,7 +4980,7 @@ kmem_cache_prefill_sheaf(struct kmem_cac
 		return sheaf;
 	}
 
-	local_lock(&s->cpu_sheaves->lock);
+	local_qpw_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (pcs->spare) {
@@ -4997,7 +4999,7 @@ kmem_cache_prefill_sheaf(struct kmem_cac
 			stat(s, BARN_GET_FAIL);
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 
 	if (!sheaf)
@@ -5041,7 +5043,7 @@ void kmem_cache_return_sheaf(struct kmem
 		return;
 	}
 
-	local_lock(&s->cpu_sheaves->lock);
+	local_qpw_lock(&s->cpu_sheaves->lock);
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 	barn = get_barn(s);
 
@@ -5051,7 +5053,7 @@ void kmem_cache_return_sheaf(struct kmem
 		stat(s, SHEAF_RETURN_FAST);
 	}
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	if (!sheaf)
 		return;
@@ -5581,7 +5583,7 @@ static void __pcs_install_empty_sheaf(st
 		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
 		struct node_barn *barn)
 {
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* This is what we expect to find if nobody interrupted us. */
 	if (likely(!pcs->spare)) {
@@ -5618,9 +5620,9 @@ static void __pcs_install_empty_sheaf(st
 /*
  * Replace the full main sheaf with a (at least partially) empty sheaf.
  *
- * Must be called with the cpu_sheaves local lock locked. If successful, returns
- * the pcs pointer and the local lock locked (possibly on a different cpu than
- * initially called). If not successful, returns NULL and the local lock
+ * Must be called with the cpu_sheaves qpw lock locked. If successful, returns
+ * the pcs pointer and the qpw lock locked (possibly on a different cpu than
+ * initially called). If not successful, returns NULL and the qpw lock
  * unlocked.
  */
 static struct slub_percpu_sheaves *
@@ -5632,17 +5634,17 @@ __pcs_replace_full_main(struct kmem_cach
 	bool put_fail;
 
 restart:
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	qpw_lockdep_assert_held(&s->cpu_sheaves->lock);
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {
-		local_unlock(&s->cpu_sheaves->lock);
+		local_qpw_unlock(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
 	barn = get_barn(s);
 	if (!barn) {
-		local_unlock(&s->cpu_sheaves->lock);
+		local_qpw_unlock(&s->cpu_sheaves->lock);
 		return NULL;
 	}
 
@@ -5679,7 +5681,7 @@ restart:
 		stat(s, BARN_PUT_FAIL);
 
 		pcs->spare = NULL;
-		local_unlock(&s->cpu_sheaves->lock);
+		local_qpw_unlock(&s->cpu_sheaves->lock);
 
 		sheaf_flush_unused(s, to_flush);
 		empty = to_flush;
@@ -5695,7 +5697,7 @@ restart:
 	put_fail = true;
 
 alloc_empty:
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	/*
 	 * alloc_empty_sheaf() doesn't support !allow_spin and it's
@@ -5715,7 +5717,7 @@ alloc_empty:
 	if (!sheaf_try_flush_main(s))
 		return NULL;
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!local_qpw_trylock(&s->cpu_sheaves->lock))
 		return NULL;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -5731,7 +5733,7 @@ alloc_empty:
 	return pcs;
 
 got_empty:
-	if (!local_trylock(&s->cpu_sheaves->lock)) {
+	if (!local_qpw_trylock(&s->cpu_sheaves->lock)) {
 		barn_put_empty_sheaf(barn, empty);
 		return NULL;
 	}
@@ -5751,7 +5753,7 @@ bool free_to_pcs(struct kmem_cache *s, v
 {
 	struct slub_percpu_sheaves *pcs;
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!local_qpw_trylock(&s->cpu_sheaves->lock))
 		return false;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -5765,7 +5767,7 @@ bool free_to_pcs(struct kmem_cache *s, v
 
 	pcs->main->objects[pcs->main->size++] = object;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	stat(s, FREE_FASTPATH);
 
@@ -5855,7 +5857,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 
 	lock_map_acquire_try(&kfree_rcu_sheaf_map);
 
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!local_qpw_trylock(&s->cpu_sheaves->lock))
 		goto fail;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -5867,7 +5869,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 
 		/* Bootstrap or debug cache, fall back */
 		if (unlikely(!cache_has_sheaves(s))) {
-			local_unlock(&s->cpu_sheaves->lock);
+			local_qpw_unlock(&s->cpu_sheaves->lock);
 			goto fail;
 		}
 
@@ -5879,7 +5881,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 
 		barn = get_barn(s);
 		if (!barn) {
-			local_unlock(&s->cpu_sheaves->lock);
+			local_qpw_unlock(&s->cpu_sheaves->lock);
 			goto fail;
 		}
 
@@ -5890,14 +5892,14 @@ bool __kfree_rcu_sheaf(struct kmem_cache
 			goto do_free;
 		}
 
-		local_unlock(&s->cpu_sheaves->lock);
+		local_qpw_unlock(&s->cpu_sheaves->lock);
 
 		empty = alloc_empty_sheaf(s, GFP_NOWAIT);
 
 		if (!empty)
 			goto fail;
 
-		if (!local_trylock(&s->cpu_sheaves->lock)) {
+		if (!local_qpw_trylock(&s->cpu_sheaves->lock)) {
 			barn_put_empty_sheaf(barn, empty);
 			goto fail;
 		}
@@ -5934,7 +5936,7 @@ do_free:
 	if (rcu_sheaf)
 		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	stat(s, FREE_RCU_SHEAF);
 	lock_map_release(&kfree_rcu_sheaf_map);
@@ -5990,7 +5992,7 @@ next_remote_batch:
 		goto flush_remote;
 
 next_batch:
-	if (!local_trylock(&s->cpu_sheaves->lock))
+	if (!local_qpw_trylock(&s->cpu_sheaves->lock))
 		goto fallback;
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
@@ -6033,7 +6035,7 @@ do_free:
 	memcpy(main->objects + main->size, p, batch * sizeof(void *));
 	main->size += batch;
 
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	stat_add(s, FREE_FASTPATH, batch);
 
@@ -6049,7 +6051,7 @@ do_free:
 	return;
 
 no_empty:
-	local_unlock(&s->cpu_sheaves->lock);
+	local_qpw_unlock(&s->cpu_sheaves->lock);
 
 	/*
 	 * if we depleted all empty sheaves in the barn or there are too
@@ -7454,7 +7456,7 @@ static int init_percpu_sheaves(struct km
 
 		pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
 
-		local_trylock_init(&pcs->lock);
+		qpw_trylock_init(&pcs->lock);
 
 		/*
 		 * Bootstrap sheaf has zero size so fast-path allocation fails.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
                   ` (4 preceding siblings ...)
  2026-03-02 15:49 ` [PATCH v2 5/5] slub: " Marcelo Tosatti
@ 2026-03-03 11:15 ` Frederic Weisbecker
  2026-03-08 18:02   ` Leonardo Bras
  2026-03-03 12:07 ` Vlastimil Babka (SUSE)
  2026-03-05 16:55 ` Frederic Weisbecker
  7 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2026-03-03 11:15 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feun

Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.
> 
> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
> 
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If CONFIG_QPW=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If CONFIG_QPW=y, and qpw kernel boot option =1, 
> queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally. This is possible because on 
> functions that can be used for performing remote work on remote 
> per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the per_cpu spinlock() for the cpu passed as parameter.

Ok I'm slowly considering this as a more comfortable solution than the
flush before userspace. Despite it being perhaps a bit more complicated,
remote handling of housekeeping work is more surprise-free against all
the possible nohz_full usecases that we are having a hard time to envision.

Reviewing this more in details now.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-02 15:49 ` [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
@ 2026-03-03 12:03   ` Vlastimil Babka (SUSE)
  2026-03-03 16:02     ` Marcelo Tosatti
  2026-03-11  7:58   ` Vlastimil Babka (SUSE)
  2026-03-13 21:55   ` Frederic Weisbecker
  2 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-03 12:03 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feun, Frederic Weisbecker

On 3/2/26 16:49, Marcelo Tosatti wrote:
> +#define local_qpw_lock(lock)								\
> +	do {										\
> +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> +			migrate_disable();						\

Have you considered using migrate_disable() on PREEMPT_RT and
preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
locking in mm/page_alloc.c does, for that reason. It should reduce the
overhead with qpw=1 on !PREEMPT_RT.

> +			spin_lock(this_cpu_ptr(lock.sl));				\
> +		} else									\
> +			local_lock(lock.ll);						\
> +	} while (0)
> +


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
                   ` (5 preceding siblings ...)
  2026-03-03 11:15 ` [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Frederic Weisbecker
@ 2026-03-03 12:07 ` Vlastimil Babka (SUSE)
  2026-03-05 16:55 ` Frederic Weisbecker
  7 siblings, 0 replies; 32+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-03 12:07 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feun, Frederic Weisbecker

On 3/2/26 16:49, Marcelo Tosatti wrote:
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If CONFIG_QPW=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If CONFIG_QPW=y, and qpw kernel boot option =1, 
> queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally. This is possible because on 
> functions that can be used for performing remote work on remote 
> per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the pe
A process thing: several patches have Leo's S-o-b: but not From:
You probably need his From: and your Co-developed-by: or some other variant,
see Documentation/process/submitting-patches.rst section "When to use
Acked-by:, Cc:, and Co-developed-by:"


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-03 12:03   ` Vlastimil Babka (SUSE)
@ 2026-03-03 16:02     ` Marcelo Tosatti
  2026-03-08 18:00       ` Leonardo Bras
  0 siblings, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-03 16:02 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feun, Frederic Weisbecker

On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/2/26 16:49, Marcelo Tosatti wrote:
> > +#define local_qpw_lock(lock)								\
> > +	do {										\
> > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> > +			migrate_disable();						\
> 
> Have you considered using migrate_disable() on PREEMPT_RT and
> preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
> locking in mm/page_alloc.c does, for that reason. It should reduce the
> overhead with qpw=1 on !PREEMPT_RT.

migrate_disable:
Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles

preempt_disable:
[   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles

I tried it before, but it was crashing for some reason which i didnt
look into (perhaps PREEMPT_RT was enabled).

Will change this for the next iteration, thanks.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
                   ` (6 preceding siblings ...)
  2026-03-03 12:07 ` Vlastimil Babka (SUSE)
@ 2026-03-05 16:55 ` Frederic Weisbecker
  2026-03-06  1:47   ` Marcelo Tosatti
  2026-03-10 17:12   ` Marcelo Tosatti
  7 siblings, 2 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2026-03-05 16:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feun

Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
> 
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.
> 
> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
> 
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
> 
> Proposed solution:
> A new interface called Queue PerCPU Work (QPW), which should replace
> Work Queue in the above mentioned use case.
> 
> If CONFIG_QPW=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
> 
> If CONFIG_QPW=y, and qpw kernel boot option =1, 
> queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally. This is possible because on 
> functions that can be used for performing remote work on remote 
> per-cpu structures, the local_lock (which is already
> a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> is able to get the per_cpu spinlock() for the cpu passed as parameter.

So let me summarize what are the possible design solutions, on top of our discussions,
so we can compare:

1) Never queue remotely but always queue locally and execute on userspace
   return via task work.

   Pros:
         - Simple and easy to maintain.

   Cons:
         - Need a case by case handling.

	 - Might be suitable for full userspace applications but not for
           some HPC usecases. In the best world MPI is fully implemented in
           userspace but that doesn't appear to be the case.

2) Queue locally the workqueue right away or do it remotely (if it's
   really necessary) if the isolated CPU is in userspace, otherwise queue
   it for execution on return to kernel. The work will be handled by preemption
   to a worker or by a workqueue flush on return to userspace.

   Pros:
        - The local queue handling is simple.

   Cons:
        - The remote queue must synchronize with return to userspace and
	  eventually postpone to return to kernel if the target is in userspace.
	  Also it may need to differentiate IRQs and syscalls.

        - Therefore still involve some case by case handling eventually.
   
        - Flushing the global workqueues to avoid deadlocks is unadvised as shown
          in the comment above flush_scheduled_work(). It even triggers a
          warning. Significant efforts have been put to convert all the existing
	  users. It's not impossible to sell in our case because we shouldn't
	  hold a lock upon return to userspace. But that will restore a new
	  dangerous API.

        - Queueing the workqueue / flushing involves a context switch which
          induce more noise (eg: tick restart)
	  
        - As above, probably not suitable for HPC.

3) QPW: Handle the work remotely

   Pros:
        - Works on all cases, without any surprise.

   Cons:
        - Introduce new locking scheme to maintain and debug.

        - Needs case by case handling.

Thoughts?

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-05 16:55 ` Frederic Weisbecker
@ 2026-03-06  1:47   ` Marcelo Tosatti
  2026-03-10 21:34     ` Frederic Weisbecker
  2026-03-10 17:12   ` Marcelo Tosatti
  1 sibling, 1 reply; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-06  1:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feun

On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. That major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> > 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > RT workload.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case.
> > 
> > If CONFIG_QPW=n this interfaces just wraps the current
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > and perform work on it locally. This is possible because on 
> > functions that can be used for performing remote work on remote 
> > per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> So let me summarize what are the possible design solutions, on top of our discussions,
> so we can compare:

I find this summary difficult to comprehend. The way i see it is:

A certain class of data structures can be manipulated only by each individual CPU (the
per-CPU caches), since they lack proper locks for such data to be
manipulated by remote CPUs.

There are certain operations which require such data to be manipulated,
therefore work is queued to execute on the owner CPUs.

> 
> 1) Never queue remotely but always queue locally and execute on userspace

When you say "queue locally", do you mean to queue the data structure 
manipulation to happen on return to userspace of the owner CPU ?

What if it does not return to userspace ? (or takes a long time to return 
to userspace?).

>    return via task work.
> 
>    Pros:
>          - Simple and easy to maintain.
> 
>    Cons:
>          - Need a case by case handling.
> 
> 	 - Might be suitable for full userspace applications but not for
>            some HPC usecases. In the best world MPI is fully implemented in
>            userspace but that doesn't appear to be the case.
> 
> 2) Queue locally the workqueue right away or do it remotely (if it's
>    really necessary) if the isolated CPU is in userspace, otherwise queue
>    it for execution on return to kernel. The work will be handled by preemption
>    to a worker or by a workqueue flush on return to userspace.
> 
>    Pros:
>         - The local queue handling is simple.
> 
>    Cons:
>         - The remote queue must synchronize with return to userspace and
> 	  eventually postpone to return to kernel if the target is in userspace.
> 	  Also it may need to differentiate IRQs and syscalls.
> 
>         - Therefore still involve some case by case handling eventually.
>    
>         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
>           in the comment above flush_scheduled_work(). It even triggers a
>           warning. Significant efforts have been put to convert all the existing
> 	  users. It's not impossible to sell in our case because we shouldn't
> 	  hold a lock upon return to userspace. But that will restore a new
> 	  dangerous API.
> 
>         - Queueing the workqueue / flushing involves a context switch which
>           induce more noise (eg: tick restart)
> 	  
>         - As above, probably not suitable for HPC.
> 
> 3) QPW: Handle the work remotely
> 
>    Pros:
>         - Works on all cases, without any surprise.
> 
>    Cons:
>         - Introduce new locking scheme to maintain and debug.
> 
>         - Needs case by case handling.
> 
> Thoughts?
> 
> -- 
> Frederic Weisbecker
> SUSE Labs

Its hard for me to parse your concise summary (perhaps it could be more
verbose).

Anyway, one thought is to use some sort of SRCU type protection on the 
per-CPU caches.
But that adds cost as well (compared to non-SRCU), which then seems to
have cost similar to adding per-CPU spinlocks.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-03 16:02     ` Marcelo Tosatti
@ 2026-03-08 18:00       ` Leonardo Bras
  2026-03-09 10:14         ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 32+ messages in thread
From: Leonardo Bras @ 2026-03-08 18:00 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Leonardo Bras, Vlastimil Babka (SUSE), linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Hyeonggon Yoo, Thomas Gleixner,
	Waiman Long, Boqun Feun, Frederic Weisbecker

On Tue, Mar 03, 2026 at 01:02:13PM -0300, Marcelo Tosatti wrote:
> On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
> > On 3/2/26 16:49, Marcelo Tosatti wrote:
> > > +#define local_qpw_lock(lock)								\
> > > +	do {										\
> > > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> > > +			migrate_disable();						\
> > 
> > Have you considered using migrate_disable() on PREEMPT_RT and
> > preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
> > locking in mm/page_alloc.c does, for that reason. It should reduce the
> > overhead with qpw=1 on !PREEMPT_RT.
> 
> migrate_disable:
> Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles
> 
> preempt_disable:
> [   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles
> 
> I tried it before, but it was crashing for some reason which i didnt
> look into (perhaps PREEMPT_RT was enabled).
> 
> Will change this for the next iteration, thanks.
> 

Hi all,

That made me remember that rt spinlock already uses migrate_disable and 
non-rt spinlocks already have preempt_disable()

Maybe it's actually worth adding a local_spin_lock() in spinlock{,_rt}.c 
whichy would get the per-cpu variable inside the preempt/migrate_disable 
area, and making use of it in qpw code. That way we avoid nesting 
migtrate_disable or preempt_disable, and further reducing impact.

The alternative is to not have migrate/preempt disable here and actually 
trust the ones inside the locking primitives. Is there a chance of 
contention, but I don't remember being able to detect it.

Thanks!
Leo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-03 11:15 ` [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Frederic Weisbecker
@ 2026-03-08 18:02   ` Leonardo Bras
  0 siblings, 0 replies; 32+ messages in thread
From: Leonardo Bras @ 2026-03-08 18:02 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Thomas Gleixner, Waiman Long, Boqun Feun

On Tue, Mar 03, 2026 at 12:15:53PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. That major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> > 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > RT workload.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case.
> > 
> > If CONFIG_QPW=n this interfaces just wraps the current
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > and perform work on it locally. This is possible because on 
> > functions that can be used for performing remote work on remote 
> > per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> Ok I'm slowly considering this as a more comfortable solution than the
> flush before userspace. Despite it being perhaps a bit more complicated,
> remote handling of housekeeping work is more surprise-free against all
> the possible nohz_full usecases that we are having a hard time to envision.
> 
> Reviewing this more in details now.

Awesome! Thanks!
Leo

> 
> Thanks.
> 
> -- 
> Frederic Weisbecker
> SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-08 18:00       ` Leonardo Bras
@ 2026-03-09 10:14         ` Vlastimil Babka (SUSE)
  2026-03-11  0:16           ` Leonardo Bras
  0 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-09 10:14 UTC (permalink / raw)
  To: Leonardo Bras, Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Hyeonggon Yoo, Thomas Gleixner, Waiman Long, Boqun Feun,
	Frederic Weisbecker

On 3/8/26 19:00, Leonardo Bras wrote:
> On Tue, Mar 03, 2026 at 01:02:13PM -0300, Marcelo Tosatti wrote:
>> On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
>> > On 3/2/26 16:49, Marcelo Tosatti wrote:
>> > > +#define local_qpw_lock(lock)								\
>> > > +	do {										\
>> > > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
>> > > +			migrate_disable();						\
>> > 
>> > Have you considered using migrate_disable() on PREEMPT_RT and
>> > preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
>> > locking in mm/page_alloc.c does, for that reason. It should reduce the
>> > overhead with qpw=1 on !PREEMPT_RT.
>> 
>> migrate_disable:
>> Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles
>> 
>> preempt_disable:
>> [   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles
>> 
>> I tried it before, but it was crashing for some reason which i didnt
>> look into (perhaps PREEMPT_RT was enabled).
>> 
>> Will change this for the next iteration, thanks.
>> 
> 
> Hi all,
> 
> That made me remember that rt spinlock already uses migrate_disable and 
> non-rt spinlocks already have preempt_disable()
> 
> Maybe it's actually worth adding a local_spin_lock() in spinlock{,_rt}.c 
> whichy would get the per-cpu variable inside the preempt/migrate_disable 
> area, and making use of it in qpw code. That way we avoid nesting 
> migtrate_disable or preempt_disable, and further reducing impact.

That would be nice indeed. But since the nested disable/enable cost should
be low, and the spinlock code rather complicated, it might be tough to sell.
It would be also great to have those trylocks inline on all arches.

> The alternative is to not have migrate/preempt disable here and actually 
> trust the ones inside the locking primitives. Is there a chance of 
> contention, but I don't remember being able to detect it.

So then we could pick the lock on one cpu but then get migrated and actually
lock it on another cpu. Is contention the only possible downside of this, or
could it lead to subtle bugs depending on the particular user? The paths
that don't flush stuff on remote cpus but expect working with the local
cpu's structure in a fastpath might get broken. I'd be wary of this.

> Thanks!
> Leo



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-05 16:55 ` Frederic Weisbecker
  2026-03-06  1:47   ` Marcelo Tosatti
@ 2026-03-10 17:12   ` Marcelo Tosatti
  2026-03-10 22:14     ` Frederic Weisbecker
                       ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-10 17:12 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feun

Hi Frederic,

On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > The problem:
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem: getting
> > an important workload scheduled out to deal with remote requests is
> > sure to introduce unexpected deadline misses.
> > 
> > The idea:
> > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > In this case, instead of scheduling work on a remote cpu, it should
> > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > work locally. That major cost, which is un/locking in every local function,
> > already happens in PREEMPT_RT.
> > 
> > Also, there is no need to worry about extra cache bouncing:
> > The cacheline invalidation already happens due to schedule_work_on().
> > 
> > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > RT workload.
> > 
> > Proposed solution:
> > A new interface called Queue PerCPU Work (QPW), which should replace
> > Work Queue in the above mentioned use case.
> > 
> > If CONFIG_QPW=n this interfaces just wraps the current
> > local_locks + WorkQueue behavior, so no expected change in runtime.
> > 
> > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > and perform work on it locally. This is possible because on 
> > functions that can be used for performing remote work on remote 
> > per-cpu structures, the local_lock (which is already
> > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> 
> So let me summarize what are the possible design solutions, on top of our discussions,
> so we can compare:
> 
> 1) Never queue remotely but always queue locally and execute on userspace
>    return via task work.

How can you "queue locally" if the request is visible on a remote CPU?

That is, the event which triggers the manipulation of data structures 
which need to be performed by the owner CPU (owner of the data
structures) is triggered on a remote CPU.

This is confusing.

Can you also please give a practical example of such case ?

>    Pros:
>          - Simple and easy to maintain.
> 
>    Cons:
>          - Need a case by case handling.
> 
> 	 - Might be suitable for full userspace applications but not for
>            some HPC usecases. In the best world MPI is fully implemented in
>            userspace but that doesn't appear to be the case.
> 
> 2) Queue locally the workqueue right away

Again, the event which triggers the manipulation of data structures
by the owner CPU happens on a remote CPU. 
So how can you queue it locally ?

>    or do it remotely (if it's
>    really necessary) if the isolated CPU is in userspace, otherwise queue
>    it for execution on return to kernel. The work will be handled by preemption
>    to a worker or by a workqueue flush on return to userspace.
> 
>    Pros:
>         - The local queue handling is simple.
> 
>    Cons:
>         - The remote queue must synchronize with return to userspace and
> 	  eventually postpone to return to kernel if the target is in userspace.
> 	  Also it may need to differentiate IRQs and syscalls.
> 
>         - Therefore still involve some case by case handling eventually.
>    
>         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
>           in the comment above flush_scheduled_work(). It even triggers a
>           warning. Significant efforts have been put to convert all the existing
> 	  users. It's not impossible to sell in our case because we shouldn't
> 	  hold a lock upon return to userspace. But that will restore a new
> 	  dangerous API.
> 
>         - Queueing the workqueue / flushing involves a context switch which
>           induce more noise (eg: tick restart)
> 	  
>         - As above, probably not suitable for HPC.
> 
> 3) QPW: Handle the work remotely
> 
>    Pros:
>         - Works on all cases, without any surprise.
> 
>    Cons:
>         - Introduce new locking scheme to maintain and debug.
> 
>         - Needs case by case handling.
> 
> Thoughts?

Can you please be more verbose, mindful of lesser cognitive powers ? :-) 

Note: i also dislike the added layers (and multiple cases) QPW adds.

But there is precedence with local locks...

Code would be less complex in case spinlocks were added:

01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

But people seem to reject that in the basis of performance
degradation.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-06  1:47   ` Marcelo Tosatti
@ 2026-03-10 21:34     ` Frederic Weisbecker
  0 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2026-03-10 21:34 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feun

Le Thu, Mar 05, 2026 at 10:47:00PM -0300, Marcelo Tosatti a écrit :
> On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> > So let me summarize what are the possible design solutions, on top of our discussions,
> > so we can compare:
> 
> I find this summary difficult to comprehend. The way i see it is:
> 
> A certain class of data structures can be manipulated only by each individual CPU (the
> per-CPU caches), since they lack proper locks for such data to be
> manipulated by remote CPUs.
> 
> There are certain operations which require such data to be manipulated,
> therefore work is queued to execute on the owner CPUs.

Right.

 
> > 
> > 1) Never queue remotely but always queue locally and execute on userspace
> 
> When you say "queue locally", do you mean to queue the data structure 
> manipulation to happen on return to userspace of the owner CPU ?

Yes.

> 
> What if it does not return to userspace ? (or takes a long time to return 
> to userspace?).

Indeed it's a bet that syscalls eventually return "soon enough" for correctness
to be maintained and that the CPU is not stuck on some kthread. But on isolation
workloads, those assumptions are usually true.

> 
> >    return via task work.
> > 
> >    Pros:
> >          - Simple and easy to maintain.
> > 
> >    Cons:
> >          - Need a case by case handling.
> > 
> > 	 - Might be suitable for full userspace applications but not for
> >            some HPC usecases. In the best world MPI is fully implemented in
> >            userspace but that doesn't appear to be the case.
> > 
> > 2) Queue locally the workqueue right away or do it remotely (if it's
> >    really necessary) if the isolated CPU is in userspace, otherwise queue
> >    it for execution on return to kernel. The work will be handled by preemption
> >    to a worker or by a workqueue flush on return to userspace.
> > 
> >    Pros:
> >         - The local queue handling is simple.
> > 
> >    Cons:
> >         - The remote queue must synchronize with return to userspace and
> > 	  eventually postpone to return to kernel if the target is in userspace.
> > 	  Also it may need to differentiate IRQs and syscalls.
> > 
> >         - Therefore still involve some case by case handling eventually.
> >    
> >         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
> >           in the comment above flush_scheduled_work(). It even triggers a
> >           warning. Significant efforts have been put to convert all the existing
> > 	  users. It's not impossible to sell in our case because we shouldn't
> > 	  hold a lock upon return to userspace. But that will restore a new
> > 	  dangerous API.
> > 
> >         - Queueing the workqueue / flushing involves a context switch which
> >           induce more noise (eg: tick restart)
> > 	  
> >         - As above, probably not suitable for HPC.
> > 
> > 3) QPW: Handle the work remotely
> > 
> >    Pros:
> >         - Works on all cases, without any surprise.
> > 
> >    Cons:
> >         - Introduce new locking scheme to maintain and debug.
> > 
> >         - Needs case by case handling.
> > 
> > Thoughts?
> > 
> > -- 
> > Frederic Weisbecker
> > SUSE Labs
> 
> Its hard for me to parse your concise summary (perhaps it could be more
> verbose).
> 
> Anyway, one thought is to use some sort of SRCU type protection on the 
> per-CPU caches.
> But that adds cost as well (compared to non-SRCU), which then seems to
> have cost similar to adding per-CPU spinlocks.

Well, there is SRCU-fast now. Though do we care about housekeeping performance
to be optimized on isolated workloads to the point we complicate things with a
weaker and and trickier synchronization mechanism? Probably not. If we choose to
pick up your solution, I'm fine with spinlocks.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-10 17:12   ` Marcelo Tosatti
@ 2026-03-10 22:14     ` Frederic Weisbecker
  2026-03-11  1:18     ` Hillf Danton
  2026-03-11  7:54     ` Vlastimil Babka
  2 siblings, 0 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2026-03-10 22:14 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feun

Le Tue, Mar 10, 2026 at 02:12:03PM -0300, Marcelo Tosatti a écrit :
> Hi Frederic,
> 
> On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> > Le Mon, Mar 02, 2026 at 12:49:45PM -0300, Marcelo Tosatti a écrit :
> > > The problem:
> > > Some places in the kernel implement a parallel programming strategy
> > > consisting on local_locks() for most of the work, and some rare remote
> > > operations are scheduled on target cpu. This keeps cache bouncing low since
> > > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > > kernels, even though the very few remote operations will be expensive due
> > > to scheduling overhead.
> > > 
> > > On the other hand, for RT workloads this can represent a problem: getting
> > > an important workload scheduled out to deal with remote requests is
> > > sure to introduce unexpected deadline misses.
> > > 
> > > The idea:
> > > Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
> > > In this case, instead of scheduling work on a remote cpu, it should
> > > be safe to grab that remote cpu's per-cpu spinlock and run the required
> > > work locally. That major cost, which is un/locking in every local function,
> > > already happens in PREEMPT_RT.
> > > 
> > > Also, there is no need to worry about extra cache bouncing:
> > > The cacheline invalidation already happens due to schedule_work_on().
> > > 
> > > This will avoid schedule_work_on(), and thus avoid scheduling-out an
> > > RT workload.
> > > 
> > > Proposed solution:
> > > A new interface called Queue PerCPU Work (QPW), which should replace
> > > Work Queue in the above mentioned use case.
> > > 
> > > If CONFIG_QPW=n this interfaces just wraps the current
> > > local_locks + WorkQueue behavior, so no expected change in runtime.
> > > 
> > > If CONFIG_QPW=y, and qpw kernel boot option =1, 
> > > queue_percpu_work_on(cpu,...) will lock that cpu's per-cpu structure
> > > and perform work on it locally. This is possible because on 
> > > functions that can be used for performing remote work on remote 
> > > per-cpu structures, the local_lock (which is already
> > > a this_cpu spinlock()), will be replaced by a qpw_spinlock(), which
> > > is able to get the per_cpu spinlock() for the cpu passed as parameter.
> > 
> > So let me summarize what are the possible design solutions, on top of our discussions,
> > so we can compare:
> > 
> > 1) Never queue remotely but always queue locally and execute on userspace
> >    return via task work.
> 
> How can you "queue locally" if the request is visible on a remote CPU?
> 
> That is, the event which triggers the manipulation of data structures 
> which need to be performed by the owner CPU (owner of the data
> structures) is triggered on a remote CPU.
> 
> This is confusing.
> 
> Can you also please give a practical example of such case ?

Right so in the case of LRU batching, it consists in always queue
locally as soon as there is something to do. Then no remote queueing
is necessary. Like here:

https://lwn.net/ml/all/20250703140717.25703-7-frederic@kernel.org/

> 
> >    Pros:
> >          - Simple and easy to maintain.
> > 
> >    Cons:
> >          - Need a case by case handling.
> > 
> > 	 - Might be suitable for full userspace applications but not for
> >            some HPC usecases. In the best world MPI is fully implemented in
> >            userspace but that doesn't appear to be the case.
> > 
> > 2) Queue locally the workqueue right away
> 
> Again, the event which triggers the manipulation of data structures
> by the owner CPU happens on a remote CPU. 
> So how can you queue it locally ?

So that would be the same as above but instead of using task_work(), we
would force queue a workqueue locally. It's more agressive.

> 
> >    or do it remotely (if it's
> >    really necessary) if the isolated CPU is in userspace, otherwise queue
> >    it for execution on return to kernel. The work will be handled by preemption
> >    to a worker or by a workqueue flush on return to userspace.
> > 
> >    Pros:
> >         - The local queue handling is simple.
> > 
> >    Cons:
> >         - The remote queue must synchronize with return to userspace and
> > 	  eventually postpone to return to kernel if the target is in userspace.
> > 	  Also it may need to differentiate IRQs and syscalls.
> > 
> >         - Therefore still involve some case by case handling eventually.
> >    
> >         - Flushing the global workqueues to avoid deadlocks is unadvised as shown
> >           in the comment above flush_scheduled_work(). It even triggers a
> >           warning. Significant efforts have been put to convert all the existing
> > 	  users. It's not impossible to sell in our case because we shouldn't
> > 	  hold a lock upon return to userspace. But that will restore a new
> > 	  dangerous API.
> > 
> >         - Queueing the workqueue / flushing involves a context switch which
> >           induce more noise (eg: tick restart)
> > 	  
> >         - As above, probably not suitable for HPC.
> > 
> > 3) QPW: Handle the work remotely
> > 
> >    Pros:
> >         - Works on all cases, without any surprise.
> > 
> >    Cons:
> >         - Introduce new locking scheme to maintain and debug.
> > 
> >         - Needs case by case handling.
> > 
> > Thoughts?
> 
> Can you please be more verbose, mindful of lesser cognitive powers ? :-)

Arguably verbosity is not my most developed skill :o)

> 
> Note: i also dislike the added layers (and multiple cases) QPW adds.
> 
> But there is precedence with local locks...
> 
> Code would be less complex in case spinlocks were added:
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
> 
> But people seem to reject that in the basis of performance
> degradation.

And that makes sense. Anyway, we have lockdep to help.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-09 10:14         ` Vlastimil Babka (SUSE)
@ 2026-03-11  0:16           ` Leonardo Bras
  0 siblings, 0 replies; 32+ messages in thread
From: Leonardo Bras @ 2026-03-11  0:16 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Hyeonggon Yoo, Thomas Gleixner,
	Waiman Long, Boqun Feun, Frederic Weisbecker

On Mon, Mar 09, 2026 at 11:14:23AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/8/26 19:00, Leonardo Bras wrote:
> > On Tue, Mar 03, 2026 at 01:02:13PM -0300, Marcelo Tosatti wrote:
> >> On Tue, Mar 03, 2026 at 01:03:36PM +0100, Vlastimil Babka (SUSE) wrote:
> >> > On 3/2/26 16:49, Marcelo Tosatti wrote:
> >> > > +#define local_qpw_lock(lock)								\
> >> > > +	do {										\
> >> > > +		if (static_branch_maybe(CONFIG_QPW_DEFAULT, &qpw_sl)) {			\
> >> > > +			migrate_disable();						\
> >> > 
> >> > Have you considered using migrate_disable() on PREEMPT_RT and
> >> > preempt_disable() on !PREEMPT_RT since it's cheaper? It's what the pcp
> >> > locking in mm/page_alloc.c does, for that reason. It should reduce the
> >> > overhead with qpw=1 on !PREEMPT_RT.
> >> 
> >> migrate_disable:
> >> Patched kernel, CONFIG_QPW=y, qpw=1:    192 cycles
> >> 
> >> preempt_disable:
> >> [   65.497223] kmalloc_bench: Avg cycles per kmalloc: 184 cycles
> >> 
> >> I tried it before, but it was crashing for some reason which i didnt
> >> look into (perhaps PREEMPT_RT was enabled).
> >> 
> >> Will change this for the next iteration, thanks.
> >> 
> > 
> > Hi all,
> > 
> > That made me remember that rt spinlock already uses migrate_disable and 
> > non-rt spinlocks already have preempt_disable()
> > 
> > Maybe it's actually worth adding a local_spin_lock() in spinlock{,_rt}.c 
> > whichy would get the per-cpu variable inside the preempt/migrate_disable 
> > area, and making use of it in qpw code. That way we avoid nesting 
> > migtrate_disable or preempt_disable, and further reducing impact.
> 
> That would be nice indeed. But since the nested disable/enable cost should
> be low, and the spinlock code rather complicated, it might be tough to sell.
> It would be also great to have those trylocks inline on all arches.

Fair enough.
I will take a look in spinlock code later, maybe we can have one in qpw 
code that can be used internally without impacting other users.

> 
> > The alternative is to not have migrate/preempt disable here and actually 
> > trust the ones inside the locking primitives. Is there a chance of 
> > contention, but I don't remember being able to detect it.
> 
> So then we could pick the lock on one cpu but then get migrated and actually
> lock it on another cpu. Is contention the only possible downside of this, or
> could it lead to subtle bugs depending on the particular user? The paths
> that don't flush stuff on remote cpus but expect working with the local
> cpu's structure in a fastpath might get broken. I'd be wary of this.

Yeah, that's right. Contention could be really bad for realtime, as rare as 
it may happen. 

And you are right in potential bugs: for user functions that operate on 
local per-cpu data (this_cpu_read/write) it would be expensive to have a 
per_cpu_read/write(), so IIRC Marcelo did not convert that in functions 
that always run in local_cpu. If the cpu migrates before getting the lock, 
we will safely operate remotelly on that cpu data, but any this_cpu_*() in 
the function will operate in local cpu instead of remote cpu.

So you and Marcelo are correct: we can't have migrate/preempt happening 
during the routine, which means we need them before we get the cpu.

Thanks!
Leo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-10 17:12   ` Marcelo Tosatti
  2026-03-10 22:14     ` Frederic Weisbecker
@ 2026-03-11  1:18     ` Hillf Danton
  2026-03-11  7:54     ` Vlastimil Babka
  2 siblings, 0 replies; 32+ messages in thread
From: Hillf Danton @ 2026-03-11  1:18 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Frederic Weisbecker, linux-kernel, linux-mm, Andrew Morton,
	Christoph Lameter, Vlastimil Babka

On Tue, 10 Mar 2026 14:12:03 -0300 Marcelo Tosatti wrote:
> Can you please be more verbose, mindful of lesser cognitive powers ? :-) 
> 
> Note: i also dislike the added layers (and multiple cases) QPW adds.
> 
> But there is precedence with local locks...
> 
> Code would be less complex in case spinlocks were added:
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock
> 
> But people seem to reject that in the basis of performance degradation.
>
Given pcp_spin_lock() cut in 0f21b911011f ("mm/page_alloc: simplify and cleanup
pcp locking"), spin lock works because of trylock and fallback, so it is a special
case instead of a generic boilerplate to follow.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2)
  2026-03-10 17:12   ` Marcelo Tosatti
  2026-03-10 22:14     ` Frederic Weisbecker
  2026-03-11  1:18     ` Hillf Danton
@ 2026-03-11  7:54     ` Vlastimil Babka
  2 siblings, 0 replies; 32+ messages in thread
From: Vlastimil Babka @ 2026-03-11  7:54 UTC (permalink / raw)
  To: Marcelo Tosatti, Frederic Weisbecker
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner, Waiman Long,
	Boqun Feun, Mel Gorman

On 3/10/26 18:12, Marcelo Tosatti wrote:
> Hi Frederic,
> 
> On Thu, Mar 05, 2026 at 05:55:12PM +0100, Frederic Weisbecker wrote:
> 
> Can you please be more verbose, mindful of lesser cognitive powers ? :-) 
> 
> Note: i also dislike the added layers (and multiple cases) QPW adds.
> 
> But there is precedence with local locks...
> 
> Code would be less complex in case spinlocks were added:
> 
> 01b44456a7aa7c3b24fa9db7d1714b208b8ef3d8 mm/page_alloc: replace local_lock with normal spinlock
> 4b23a68f953628eb4e4b7fe1294ebf93d4b8ceee mm/page_alloc: protect PCP lists with a spinlock

Note that per bf75f200569dd05ac2112797f44548beb6b4be26 changelog this seems
it was all done for the same reasons as QPW. It's nice we got the
trylock-without-irqsave approach as a followup, but the cost of (especially
non-inlined) spin_trylock is not great, given that now we could do the
trylock-without-irqsave cheaply with local_trylock.

So that to me suggests it could be worth to try convert pcplists to QPW if
it's agreed upon as the best way forward and is merged.

> But people seem to reject that in the basis of performance
> degradation.
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-02 15:49 ` [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
  2026-03-03 12:03   ` Vlastimil Babka (SUSE)
@ 2026-03-11  7:58   ` Vlastimil Babka (SUSE)
  2026-03-15 17:37     ` Leonardo Bras
  2026-03-13 21:55   ` Frederic Weisbecker
  2 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-11  7:58 UTC (permalink / raw)
  To: Marcelo Tosatti, linux-kernel, linux-mm
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Hyeonggon Yoo, Leonardo Bras,
	Thomas Gleixner, Waiman Long, Boqun Feun, Frederic Weisbecker

On 3/2/26 16:49, Marcelo Tosatti wrote:
> Index: linux/Documentation/admin-guide/kernel-parameters.txt
> ===================================================================
> --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
> +++ linux/Documentation/admin-guide/kernel-parameters.txt
> @@ -2840,6 +2840,16 @@ Kernel parameters
>  
>  			The format of <cpu-list> is described above.
>  
> +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> +			and remote interference mechanism on a kernel built with
> +			CONFIG_QPW.
> +			Format: { "0" | "1" }
> +			0 - local_lock() + queue_work_on(remote_cpu)
> +			1 - spin_lock() for both local and remote operations
> +
> +			Selecting 1 may be interesting for systems that want
> +			to avoid interruption & context switches from IPIs.
Requiring a new boot option is always a nuissance. The cpu isolation is
AFAIK difficult enough to setup already. Could the default be that qpw will
auto-enable if there are isolated cpus configured? The option could still be
useful for overriding that automatic decision to both 0 and 1 for testing
etc, but not requried for the expected usecase?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-02 15:49 ` [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
  2026-03-03 12:03   ` Vlastimil Babka (SUSE)
  2026-03-11  7:58   ` Vlastimil Babka (SUSE)
@ 2026-03-13 21:55   ` Frederic Weisbecker
  2026-03-15 18:10     ` Leonardo Bras
  2 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2026-03-13 21:55 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: linux-kernel, linux-mm, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Vlastimil Babka, Hyeonggon Yoo, Leonardo Bras, Thomas Gleixner,
	Waiman Long, Boqun Feun

Le Mon, Mar 02, 2026 at 12:49:47PM -0300, Marcelo Tosatti a écrit :
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
> 
> On the other hand, for RT workloads this can represent a problem:
> scheduling work on remote cpu that are executing low latency tasks
> is undesired and can introduce unexpected deadline misses.
> 
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
> 
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> helpers to run the remote work.
> 
> Users of non-RT kernels but with low latency requirements can select
> similar functionality by using the CONFIG_QPW compile time option.
> 
> On CONFIG_QPW disabled kernels, no changes are expected, as every
> one of the introduced helpers work the exactly same as the current
> implementation:
> qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)

I find this part of the semantic a bit weird. If we eventually queue
the work, why do we care about doing a local_lock() locally ?

> queue_percpu_work_on()  ->  queue_work_on()
> flush_percpu_work()     ->  flush_work()
> 
> @@ -2840,6 +2840,16 @@ Kernel parameters
>  
>  			The format of <cpu-list> is described above.
>  
> +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> +			and remote interference mechanism on a kernel built with
> +			CONFIG_QPW.
> +			Format: { "0" | "1" }
> +			0 - local_lock() + queue_work_on(remote_cpu)
> +			1 - spin_lock() for both local and remote operations
> +
> +			Selecting 1 may be interesting for systems that want
> +			to avoid interruption & context switches from IPIs.

Like Vlastimil suggested, it would be better to just have it off by default
and turn it on only if nohz_full= is passed. Then we can consider introducing
the parameter later if the need arise.

> +#define qpw_lock_init(lock)				\
> +	local_lock_init(lock)
> +
> +#define qpw_trylock_init(lock)				\
> +	local_trylock_init(lock)
> +
> +#define qpw_lock(lock, cpu)				\
> +	local_lock(lock)
> +
> +#define local_qpw_lock(lock)				\
> +	local_lock(lock)

It would be easier to grep if all the APIs start with qpw_* prefix.

qpw_local_lock() ?

> +
> +#define qpw_lock_irqsave(lock, flags, cpu)		\
> +	local_lock_irqsave(lock, flags)
> +
> +#define local_qpw_lock_irqsave(lock, flags)		\
> +	local_lock_irqsave(lock, flags)

ditto?

> +
> +#define qpw_trylock(lock, cpu)				\
> +	local_trylock(lock)
> +
> +#define local_qpw_trylock(lock)				\
> +	local_trylock(lock)

...

> +
> +#define qpw_trylock_irqsave(lock, flags, cpu)		\
> +	local_trylock_irqsave(lock, flags)
> +
> +#define qpw_unlock(lock, cpu)				\
> +	local_unlock(lock)
> +
> +#define local_qpw_unlock(lock)				\
> +	local_unlock(lock)

...

> +
> +#define qpw_unlock_irqrestore(lock, flags, cpu)		\
> +	local_unlock_irqrestore(lock, flags)
> +
> +#define local_qpw_unlock_irqrestore(lock, flags)	\
> +	local_unlock_irqrestore(lock, flags)

...

> +
> +#define qpw_lockdep_assert_held(lock)			\
> +	lockdep_assert_held(lock)
> +
> +#define queue_percpu_work_on(c, wq, qpw)		\
> +	queue_work_on(c, wq, &(qpw)->work)

qpw_queue_work_on() ?

Perhaps even better would be qpw_queue_work_for(), leaving some room for
mystery about where/how the work will be executed :-)

> +
> +#define flush_percpu_work(qpw)				\
> +	flush_work(&(qpw)->work)

qpw_flush_work() ?

> +
> +#define qpw_get_cpu(qpw)	smp_processor_id()
> +
> +#define qpw_is_cpu_remote(cpu)		(false)
> +
> +#define INIT_QPW(qpw, func, c)				\
> +	INIT_WORK(&(qpw)->work, (func))
> +
> @@ -762,6 +762,41 @@ config CPU_ISOLATION
>  
>  	  Say Y if unsure.
>  
> +config QPW
> +	bool "Queue per-CPU Work"
> +	depends on SMP || COMPILE_TEST
> +	default n
> +	help
> +	  Allow changing the behavior on per-CPU resource sharing with cache,
> +	  from the regular local_locks() + queue_work_on(remote_cpu) to using
> +	  per-CPU spinlocks on both local and remote operations.
> +
> +	  This is useful to give user the option on reducing IPIs to CPUs, and
> +	  thus reduce interruptions and context switches. On the other hand, it
> +	  increases generated code and will use atomic operations if spinlocks
> +	  are selected.
> +
> +	  If set, will use the default behavior set in QPW_DEFAULT unless boot
> +	  parameter qpw is passed with a different behavior.
> +
> +	  If unset, will use the local_lock() + queue_work_on() strategy,
> +	  regardless of the boot parameter or QPW_DEFAULT.
> +
> +	  Say N if unsure.

Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
the need arise in the future, make it visible to the user?

Thanks.

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-11  7:58   ` Vlastimil Babka (SUSE)
@ 2026-03-15 17:37     ` Leonardo Bras
  2026-03-16 10:55       ` Vlastimil Babka (SUSE)
  0 siblings, 1 reply; 32+ messages in thread
From: Leonardo Bras @ 2026-03-15 17:37 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Hyeonggon Yoo, Thomas Gleixner,
	Waiman Long, Boqun Feun, Frederic Weisbecker

On Wed, Mar 11, 2026 at 08:58:05AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/2/26 16:49, Marcelo Tosatti wrote:
> > Index: linux/Documentation/admin-guide/kernel-parameters.txt
> > ===================================================================
> > --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
> > +++ linux/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2840,6 +2840,16 @@ Kernel parameters
> >  
> >  			The format of <cpu-list> is described above.
> >  
> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > +			and remote interference mechanism on a kernel built with
> > +			CONFIG_QPW.
> > +			Format: { "0" | "1" }
> > +			0 - local_lock() + queue_work_on(remote_cpu)
> > +			1 - spin_lock() for both local and remote operations
> > +
> > +			Selecting 1 may be interesting for systems that want
> > +			to avoid interruption & context switches from IPIs.
> Requiring a new boot option is always a nuissance. The cpu isolation is
> AFAIK difficult enough to setup already. Could the default be that qpw will
> auto-enable if there are isolated cpus configured? The option could still be
> useful for overriding that automatic decision to both 0 and 1 for testing
> etc, but not requried for the expected usecase?


I think it's okay, as something like this?
(should work for nohz_full and isolcpus)

######
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 81bc8b329ef17..6c9052c28e3e4 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -170,20 +170,23 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
                for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
                        housekeeping_setup_type(type, housekeeping_staging);
        }
 
        if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE))
                tick_nohz_full_setup(non_housekeeping_mask);
 
        housekeeping.flags |= flags;
        err = 1;
 
+       if (IS_ENABLED(CONFIG_QPW_DEFAULT))
+               qpw_setup("1");
+
 free_housekeeping_staging:
        free_bootmem_cpumask_var(housekeeping_staging);
 free_non_housekeeping_mask:
        free_bootmem_cpumask_var(non_housekeeping_mask);
 
        return err;
 }
######

We would only have to be sure that this runs before cmdline parses qpw=?, 
so user could disable qpw if wanted.

Would that work?

Thanks!
Leo





^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-13 21:55   ` Frederic Weisbecker
@ 2026-03-15 18:10     ` Leonardo Bras
  2026-03-17 13:33       ` Frederic Weisbecker
  0 siblings, 1 reply; 32+ messages in thread
From: Leonardo Bras @ 2026-03-15 18:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Thomas Gleixner, Waiman Long, Boqun Feun

On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> Le Mon, Mar 02, 2026 at 12:49:47PM -0300, Marcelo Tosatti a écrit :
> > Some places in the kernel implement a parallel programming strategy
> > consisting on local_locks() for most of the work, and some rare remote
> > operations are scheduled on target cpu. This keeps cache bouncing low since
> > cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> > kernels, even though the very few remote operations will be expensive due
> > to scheduling overhead.
> > 
> > On the other hand, for RT workloads this can represent a problem:
> > scheduling work on remote cpu that are executing low latency tasks
> > is undesired and can introduce unexpected deadline misses.
> > 
> > It's interesting, though, that local_lock()s in RT kernels become
> > spinlock(). We can make use of those to avoid scheduling work on a remote
> > cpu by directly updating another cpu's per_cpu structure, while holding
> > it's spinlock().
> > 
> > In order to do that, it's necessary to introduce a new set of functions to
> > make it possible to get another cpu's per-cpu "local" lock (qpw_{un,}lock*)
> > and also the corresponding queue_percpu_work_on() and flush_percpu_work()
> > helpers to run the remote work.
> > 
> > Users of non-RT kernels but with low latency requirements can select
> > similar functionality by using the CONFIG_QPW compile time option.
> > 
> > On CONFIG_QPW disabled kernels, no changes are expected, as every
> > one of the introduced helpers work the exactly same as the current
> > implementation:
> > qpw_{un,}lock*()        ->  local_{un,}lock*() (ignores cpu parameter)
> 
> I find this part of the semantic a bit weird. If we eventually queue
> the work, why do we care about doing a local_lock() locally ?

(Sorry, not sure if I was able to understand the question.)

Local locks make sure a per-cpu procedure happens on the same CPU from 
start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
doing preempt_disable in non_RT.

Most of the cases happen to have the work done in the local cpu, and just 
a few procedures happen to be queued remotely, such as remote cache 
draining. 

Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
well, as the cpu receiving the scheduled work needs to make sure to run it 
all without moving to a different cpu.

> 
> > queue_percpu_work_on()  ->  queue_work_on()
> > flush_percpu_work()     ->  flush_work()

btw Marcelo, I think we need to do add the local_qpw_lock here as well, or 
change the first line to '{local_,}qpw_{un,}lock*()'

> > 
> > @@ -2840,6 +2840,16 @@ Kernel parameters
> >  
> >  			The format of <cpu-list> is described above.
> >  
> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > +			and remote interference mechanism on a kernel built with
> > +			CONFIG_QPW.
> > +			Format: { "0" | "1" }
> > +			0 - local_lock() + queue_work_on(remote_cpu)
> > +			1 - spin_lock() for both local and remote operations
> > +
> > +			Selecting 1 may be interesting for systems that want
> > +			to avoid interruption & context switches from IPIs.
> 
> Like Vlastimil suggested, it would be better to just have it off by default
> and turn it on only if nohz_full= is passed. Then we can consider introducing
> the parameter later if the need arise.

I agree with having it enabled with isolcpus/nohz_full, but I would 
recommend having this option anyway, as the user could disable qpw if 
wanted, or enable outside isolcpu scenarios for any reason.

> 
> > +#define qpw_lock_init(lock)				\
> > +	local_lock_init(lock)
> > +
> > +#define qpw_trylock_init(lock)				\
> > +	local_trylock_init(lock)
> > +
> > +#define qpw_lock(lock, cpu)				\
> > +	local_lock(lock)
> > +
> > +#define local_qpw_lock(lock)				\
> > +	local_lock(lock)
> 
> It would be easier to grep if all the APIs start with qpw_* prefix.
> 
> qpw_local_lock() ?

Sure, not against the change.
And sure, would need to change all versions starting with local_ .

> 
> > +
> > +#define qpw_lock_irqsave(lock, flags, cpu)		\
> > +	local_lock_irqsave(lock, flags)
> > +
> > +#define local_qpw_lock_irqsave(lock, flags)		\
> > +	local_lock_irqsave(lock, flags)
> 
> ditto?
> 
> > +
> > +#define qpw_trylock(lock, cpu)				\
> > +	local_trylock(lock)
> > +
> > +#define local_qpw_trylock(lock)				\
> > +	local_trylock(lock)
> 
> ...
> 
> > +
> > +#define qpw_trylock_irqsave(lock, flags, cpu)		\
> > +	local_trylock_irqsave(lock, flags)
> > +
> > +#define qpw_unlock(lock, cpu)				\
> > +	local_unlock(lock)
> > +
> > +#define local_qpw_unlock(lock)				\
> > +	local_unlock(lock)
> 
> ...
> 
> > +
> > +#define qpw_unlock_irqrestore(lock, flags, cpu)		\
> > +	local_unlock_irqrestore(lock, flags)
> > +
> > +#define local_qpw_unlock_irqrestore(lock, flags)	\
> > +	local_unlock_irqrestore(lock, flags)
> 
> ...
> 
> > +
> > +#define qpw_lockdep_assert_held(lock)			\
> > +	lockdep_assert_held(lock)
> > +
> > +#define queue_percpu_work_on(c, wq, qpw)		\
> > +	queue_work_on(c, wq, &(qpw)->work)
> 
> qpw_queue_work_on() ?
> 
> Perhaps even better would be qpw_queue_work_for(), leaving some room for
> mystery about where/how the work will be executed :-)
> 

QPW comes from Queue PerCPU Work
Having it called qpw_queue_work_{on,for}() would be repetitve
But having qpw_on() or qpw_for() would be misleading :) 

That's why I went with queue_percpu_work_on() based on how we have the 
original function (queue_work_on) being called.

> > +
> > +#define flush_percpu_work(qpw)				\
> > +	flush_work(&(qpw)->work)
> 
> qpw_flush_work() ?

Same as above,
qpw_flush() ?

> 
> > +
> > +#define qpw_get_cpu(qpw)	smp_processor_id()
> > +
> > +#define qpw_is_cpu_remote(cpu)		(false)
> > +
> > +#define INIT_QPW(qpw, func, c)				\
> > +	INIT_WORK(&(qpw)->work, (func))
> > +
> > @@ -762,6 +762,41 @@ config CPU_ISOLATION
> >  
> >  	  Say Y if unsure.
> >  
> > +config QPW
> > +	bool "Queue per-CPU Work"
> > +	depends on SMP || COMPILE_TEST
> > +	default n
> > +	help
> > +	  Allow changing the behavior on per-CPU resource sharing with cache,
> > +	  from the regular local_locks() + queue_work_on(remote_cpu) to using
> > +	  per-CPU spinlocks on both local and remote operations.
> > +
> > +	  This is useful to give user the option on reducing IPIs to CPUs, and
> > +	  thus reduce interruptions and context switches. On the other hand, it
> > +	  increases generated code and will use atomic operations if spinlocks
> > +	  are selected.
> > +
> > +	  If set, will use the default behavior set in QPW_DEFAULT unless boot
> > +	  parameter qpw is passed with a different behavior.
> > +
> > +	  If unset, will use the local_lock() + queue_work_on() strategy,
> > +	  regardless of the boot parameter or QPW_DEFAULT.
> > +
> > +	  Say N if unsure.
> 
> Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> the need arise in the future, make it visible to the user?
> 

I think it would be good to have this, and let whoever is building have the 
chance to disable QPW if it doesn't work well for their machines or 
workload, without having to add a new boot parameter to continue have 
their stuff working as always after a kernel update.

But that is open to discussion :)

Thanks!
Leo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-15 17:37     ` Leonardo Bras
@ 2026-03-16 10:55       ` Vlastimil Babka (SUSE)
  2026-03-23  0:51         ` Leonardo Bras
  0 siblings, 1 reply; 32+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-03-16 10:55 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Hyeonggon Yoo, Thomas Gleixner, Waiman Long,
	Boqun Feun, Frederic Weisbecker

On 3/15/26 18:37, Leonardo Bras wrote:
> On Wed, Mar 11, 2026 at 08:58:05AM +0100, Vlastimil Babka (SUSE) wrote:
>> On 3/2/26 16:49, Marcelo Tosatti wrote:
>> > Index: linux/Documentation/admin-guide/kernel-parameters.txt
>> > ===================================================================
>> > --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
>> > +++ linux/Documentation/admin-guide/kernel-parameters.txt
>> > @@ -2840,6 +2840,16 @@ Kernel parameters
>> >  
>> >  			The format of <cpu-list> is described above.
>> >  
>> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
>> > +			and remote interference mechanism on a kernel built with
>> > +			CONFIG_QPW.
>> > +			Format: { "0" | "1" }
>> > +			0 - local_lock() + queue_work_on(remote_cpu)
>> > +			1 - spin_lock() for both local and remote operations
>> > +
>> > +			Selecting 1 may be interesting for systems that want
>> > +			to avoid interruption & context switches from IPIs.
>> Requiring a new boot option is always a nuissance. The cpu isolation is
>> AFAIK difficult enough to setup already. Could the default be that qpw will
>> auto-enable if there are isolated cpus configured? The option could still be
>> useful for overriding that automatic decision to both 0 and 1 for testing
>> etc, but not requried for the expected usecase?
> 
> 
> I think it's okay, as something like this?
> (should work for nohz_full and isolcpus)
> 
> ######
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 81bc8b329ef17..6c9052c28e3e4 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -170,20 +170,23 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
>                 for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
>                         housekeeping_setup_type(type, housekeeping_staging);
>         }
>  
>         if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE))
>                 tick_nohz_full_setup(non_housekeeping_mask);
>  
>         housekeeping.flags |= flags;
>         err = 1;
>  
> +       if (IS_ENABLED(CONFIG_QPW_DEFAULT))
> +               qpw_setup("1");
> +
>  free_housekeeping_staging:
>         free_bootmem_cpumask_var(housekeeping_staging);
>  free_non_housekeeping_mask:
>         free_bootmem_cpumask_var(non_housekeeping_mask);
>  
>         return err;
>  }
> ######
> 
> We would only have to be sure that this runs before cmdline parses qpw=?, 

I'm not sure it's possible to achieve this ordering with __setup calls,
unless one of them is early, and then it might be too early to do the
necessary action.

> so user could disable qpw if wanted.
> 
> Would that work?

The pattern I'm familiar with is collecting all related params via
early_param() setting some variables, and then an init call (not tied to any
of the param) looks at those variables and does whatever is necessary.

> Thanks!
> Leo
> 
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-15 18:10     ` Leonardo Bras
@ 2026-03-17 13:33       ` Frederic Weisbecker
  2026-03-23  1:38         ` Leonardo Bras
  2026-03-23 14:36         ` Marcelo Tosatti
  0 siblings, 2 replies; 32+ messages in thread
From: Frederic Weisbecker @ 2026-03-17 13:33 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Thomas Gleixner,
	Waiman Long, Boqun Feun

Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > I find this part of the semantic a bit weird. If we eventually queue
> > the work, why do we care about doing a local_lock() locally ?
> 
> (Sorry, not sure if I was able to understand the question.)
> 
> Local locks make sure a per-cpu procedure happens on the same CPU from 
> start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> doing preempt_disable in non_RT.
> 
> Most of the cases happen to have the work done in the local cpu, and just 
> a few procedures happen to be queued remotely, such as remote cache 
> draining. 
> 
> Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> well, as the cpu receiving the scheduled work needs to make sure to run it 
> all without moving to a different cpu.

But queue_work_on() already makes sure the work doesn't move to a different CPU
(provided hotplug is correctly handled for the work).

Looks like we are both confused, so let's take a practical example. Suppose
CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
"1". We want to guarantee that further reads of that per-cpu value by CPU 1
see the new value. With qpw=1, it looks like this:

CPU 0                                               CPU 1
-----                                               -----

qpw_lock(CPU 1)
   spin_lock(&QPW_CPU1)
qpw_queue_for(write_A, 1)
    write_A()
       A1 = per_cpu_ptr(&A, 1)
       *A1 = 1
qpw_unlock(CPU 1)
    spin_unlock(&QPW_CPU1)
                                                   read_A()
                                                       qpw_lock(CPU 1)
                                                           spin_lock(&QPW_CPU1)
                                                       r0 = __this_cpu_read(&A)
                                                       qpw_unlock(CPU 1)
                                                           spin_unlock(&QPW_CPU1)
                                                   

CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
observe the new value because it takes the same spinlock (r0 == 1)

Now look at the qpw=0 case:
                                  
CPU 0                                               CPU 1
-----                                               -----

qpw_lock(CPU 1)
   local_lock(&QPW_CPU0)
qpw_queue_for(write_A, 1)
    queue_work_on(write_A, CPU 1)
qpw_unlock(CPU 1)
    local_unlock(&QPW_CPU0)
                                                   // workqueue
                                                   write_A()
                                                       qpw_lock(CPU 1)
                                                           local_lock(&QPW_CPU1)
                                                       A1 = per_cpu_ptr(&A, 1)
                                                       *A1 = 1
                                                       qpw_unlock(CPU 1)
                                                           local_unlock(&QPW_CPU1)

                                                   read_A()
                                                       qpw_lock(CPU 1)
                                                           local_lock(&QPW_CPU1)
                                                       r0 = __this_cpu_read(&A)
                                                       qpw_unlock(CPU 1)
                                                           local_unlock(&QPW_CPU1)

Here CPU 0 queues the work on CPU 1 which writes and reads the new value
(r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.

But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?


> > > 
> > > @@ -2840,6 +2840,16 @@ Kernel parameters
> > >  
> > >  			The format of <cpu-list> is described above.
> > >  
> > > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > > +			and remote interference mechanism on a kernel built with
> > > +			CONFIG_QPW.
> > > +			Format: { "0" | "1" }
> > > +			0 - local_lock() + queue_work_on(remote_cpu)
> > > +			1 - spin_lock() for both local and remote operations
> > > +
> > > +			Selecting 1 may be interesting for systems that want
> > > +			to avoid interruption & context switches from IPIs.
> > 
> > Like Vlastimil suggested, it would be better to just have it off by default
> > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > the parameter later if the need arise.
> 
> I agree with having it enabled with isolcpus/nohz_full, but I would 
> recommend having this option anyway, as the user could disable qpw if 
> wanted, or enable outside isolcpu scenarios for any reason.

Do you know any such users? Or suspect a potential usecase? If not we can still
add that option later. It's probably better than sticking with a useless
parameter that we'll have to maintain forever.

> > > +#define qpw_lockdep_assert_held(lock)			\
> > > +	lockdep_assert_held(lock)
> > > +
> > > +#define queue_percpu_work_on(c, wq, qpw)		\
> > > +	queue_work_on(c, wq, &(qpw)->work)
> > 
> > qpw_queue_work_on() ?
> > 
> > Perhaps even better would be qpw_queue_work_for(), leaving some room for
> > mystery about where/how the work will be executed :-)
> > 
> 
> QPW comes from Queue PerCPU Work
> Having it called qpw_queue_work_{on,for}() would be repetitve

Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.

> But having qpw_on() or qpw_for() would be misleading :) 
> 
> That's why I went with queue_percpu_work_on() based on how we have the 
> original function (queue_work_on) being called.

That's much more misleading at it doesn't refer to qpw at all and it only
suggest that it's a queueing a per-cpu workqueue.

> > Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> > the need arise in the future, make it visible to the user?
> > 
> 
> I think it would be good to have this, and let whoever is building have the 
> chance to disable QPW if it doesn't work well for their machines or 
> workload, without having to add a new boot parameter to continue have 
> their stuff working as always after a kernel update.
> 
> But that is open to discussion :)

Ok I guess we can stick with the Kconfig at least in the beginning.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-16 10:55       ` Vlastimil Babka (SUSE)
@ 2026-03-23  0:51         ` Leonardo Bras
  0 siblings, 0 replies; 32+ messages in thread
From: Leonardo Bras @ 2026-03-23  0:51 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Hyeonggon Yoo, Thomas Gleixner,
	Waiman Long, Boqun Feun, Frederic Weisbecker

On Mon, Mar 16, 2026 at 11:55:46AM +0100, Vlastimil Babka (SUSE) wrote:
> On 3/15/26 18:37, Leonardo Bras wrote:
> > On Wed, Mar 11, 2026 at 08:58:05AM +0100, Vlastimil Babka (SUSE) wrote:
> >> On 3/2/26 16:49, Marcelo Tosatti wrote:
> >> > Index: linux/Documentation/admin-guide/kernel-parameters.txt
> >> > ===================================================================
> >> > --- linux.orig/Documentation/admin-guide/kernel-parameters.txt
> >> > +++ linux/Documentation/admin-guide/kernel-parameters.txt
> >> > @@ -2840,6 +2840,16 @@ Kernel parameters
> >> >  
> >> >  			The format of <cpu-list> is described above.
> >> >  
> >> > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> >> > +			and remote interference mechanism on a kernel built with
> >> > +			CONFIG_QPW.
> >> > +			Format: { "0" | "1" }
> >> > +			0 - local_lock() + queue_work_on(remote_cpu)
> >> > +			1 - spin_lock() for both local and remote operations
> >> > +
> >> > +			Selecting 1 may be interesting for systems that want
> >> > +			to avoid interruption & context switches from IPIs.
> >> Requiring a new boot option is always a nuissance. The cpu isolation is
> >> AFAIK difficult enough to setup already. Could the default be that qpw will
> >> auto-enable if there are isolated cpus configured? The option could still be
> >> useful for overriding that automatic decision to both 0 and 1 for testing
> >> etc, but not requried for the expected usecase?
> > 
> > 
> > I think it's okay, as something like this?
> > (should work for nohz_full and isolcpus)
> > 
> > ######
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 81bc8b329ef17..6c9052c28e3e4 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -170,20 +170,23 @@ static int __init housekeeping_setup(char *str, unsigned long flags)
> >                 for_each_set_bit(type, &iter_flags, HK_TYPE_MAX)
> >                         housekeeping_setup_type(type, housekeeping_staging);
> >         }
> >  
> >         if ((flags & HK_FLAG_KERNEL_NOISE) && !(housekeeping.flags & HK_FLAG_KERNEL_NOISE))
> >                 tick_nohz_full_setup(non_housekeeping_mask);
> >  
> >         housekeeping.flags |= flags;
> >         err = 1;
> >  
> > +       if (IS_ENABLED(CONFIG_QPW_DEFAULT))
> > +               qpw_setup("1");
> > +
> >  free_housekeeping_staging:
> >         free_bootmem_cpumask_var(housekeeping_staging);
> >  free_non_housekeeping_mask:
> >         free_bootmem_cpumask_var(non_housekeeping_mask);
> >  
> >         return err;
> >  }
> > ######
> > 
> > We would only have to be sure that this runs before cmdline parses qpw=?, 
> 
> I'm not sure it's possible to achieve this ordering with __setup calls,
> unless one of them is early, and then it might be too early to do the
> necessary action.
> 
> > so user could disable qpw if wanted.
> > 
> > Would that work?
> 
> The pattern I'm familiar with is collecting all related params via
> early_param() setting some variables, and then an init call (not tied to any
> of the param) looks at those variables and does whatever is necessary.
> 
> > Thanks!
> > Leo
> > 
> > 
> > 
> 

Makes sense, will take a look on that approach.

Thanks!
Leo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-17 13:33       ` Frederic Weisbecker
@ 2026-03-23  1:38         ` Leonardo Bras
  2026-03-24 11:54           ` Frederic Weisbecker
  2026-03-23 14:36         ` Marcelo Tosatti
  1 sibling, 1 reply; 32+ messages in thread
From: Leonardo Bras @ 2026-03-23  1:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Thomas Gleixner, Waiman Long, Boqun Feun

On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > I find this part of the semantic a bit weird. If we eventually queue
> > > the work, why do we care about doing a local_lock() locally ?
> > 
> > (Sorry, not sure if I was able to understand the question.)
> > 
> > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > doing preempt_disable in non_RT.
> > 
> > Most of the cases happen to have the work done in the local cpu, and just 
> > a few procedures happen to be queued remotely, such as remote cache 
> > draining. 
> > 
> > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > all without moving to a different cpu.
> 
> But queue_work_on() already makes sure the work doesn't move to a different CPU
> (provided hotplug is correctly handled for the work).
> 
> Looks like we are both confused, so let's take a practical example. Suppose
> CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> see the new value. With qpw=1, it looks like this:
> 
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    spin_lock(&QPW_CPU1)
> qpw_queue_for(write_A, 1)
>     write_A()
>        A1 = per_cpu_ptr(&A, 1)
>        *A1 = 1
> qpw_unlock(CPU 1)
>     spin_unlock(&QPW_CPU1)
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            spin_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            spin_unlock(&QPW_CPU1)
>                                                    
> 
> CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> observe the new value because it takes the same spinlock (r0 == 1)
> 

Here, if we are in CPU0 we should never take the qpw_lock(CPU1) unless we 
are inside queue_percpu_work_on().

Maybe I am not getting your use case :/

Also, I don't see a case where we would need to call 
queue_percpu_work_on() inside a qpw_lock(). This could be dangerous as it 
could be the case in another cpu and cause a deadlock:

CPU 0 				CPU 1
qpw_lock(0)			qpw_lock(1)
...				...
queue_percpu_work_on()		queue_percpu_work_on()
	qpw_lock(1)			qpw_lock(0)


> Now look at the qpw=0 case:
>                                   
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    local_lock(&QPW_CPU0)
> qpw_queue_for(write_A, 1)
>     queue_work_on(write_A, CPU 1)
> qpw_unlock(CPU 1)
>     local_unlock(&QPW_CPU0)
>                                                    // workqueue
>                                                    write_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        A1 = per_cpu_ptr(&A, 1)
>                                                        *A1 = 1
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
> Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> 
> But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?

I can't see the case where one would need to hold the qpw_lock while 
calling queue_percpu_work_on(). Holding the qpw_lock() (as is the case of
local_lock()) should be done only when one is working on data particular to 
that cpu structures. Queuing work on other CPU while touching this cpu data 
is unexpected to me. 



> 
> 
> > > > 
> > > > @@ -2840,6 +2840,16 @@ Kernel parameters
> > > >  
> > > >  			The format of <cpu-list> is described above.
> > > >  
> > > > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > > > +			and remote interference mechanism on a kernel built with
> > > > +			CONFIG_QPW.
> > > > +			Format: { "0" | "1" }
> > > > +			0 - local_lock() + queue_work_on(remote_cpu)
> > > > +			1 - spin_lock() for both local and remote operations
> > > > +
> > > > +			Selecting 1 may be interesting for systems that want
> > > > +			to avoid interruption & context switches from IPIs.
> > > 
> > > Like Vlastimil suggested, it would be better to just have it off by default
> > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > the parameter later if the need arise.
> > 
> > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > recommend having this option anyway, as the user could disable qpw if 
> > wanted, or enable outside isolcpu scenarios for any reason.
> 
> Do you know any such users? Or suspect a potential usecase? If not we can still
> add that option later. It's probably better than sticking with a useless
> parameter that we'll have to maintain forever.

Out of my head, I can think only on HPC scenario where user wants to make 
use of the regular/RT scheduler for many small workloads, but doesn't like 
the impact of IPI on those cases. Such systems that explore memory at it's 
limit will also benefit from those, for example, if cache gets drained 
remotely very often.

None of those necessarily will need to or benefit from isolcpus, and may 
want to just use the kernel scheduler policies. 

> 
> > > > +#define qpw_lockdep_assert_held(lock)			\
> > > > +	lockdep_assert_held(lock)
> > > > +
> > > > +#define queue_percpu_work_on(c, wq, qpw)		\
> > > > +	queue_work_on(c, wq, &(qpw)->work)
> > > 
> > > qpw_queue_work_on() ?
> > > 
> > > Perhaps even better would be qpw_queue_work_for(), leaving some room for
> > > mystery about where/how the work will be executed :-)
> > > 
> > 
> > QPW comes from Queue PerCPU Work
> > Having it called qpw_queue_work_{on,for}() would be repetitve
> 
> Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> 

Locks for queue'ing per-cpu work. :D

> > But having qpw_on() or qpw_for() would be misleading :) 
> > 
> > That's why I went with queue_percpu_work_on() based on how we have the 
> > original function (queue_work_on) being called.
> 
> That's much more misleading at it doesn't refer to qpw at all and it only
> suggest that it's a queueing a per-cpu workqueue.
> 

Humm, maybe qpw_queue_for/on()?

Or maybe change the name of the API for pw:
pw_lock()/unlock
pw_queue();
pw_flush()

and so on?

That way it stays true to what means :)


> > > Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> > > the need arise in the future, make it visible to the user?
> > > 
> > 
> > I think it would be good to have this, and let whoever is building have the 
> > chance to disable QPW if it doesn't work well for their machines or 
> > workload, without having to add a new boot parameter to continue have 
> > their stuff working as always after a kernel update.
> > 
> > But that is open to discussion :)
> 
> Ok I guess we can stick with the Kconfig at least in the beginning.
> 
> Thanks.
> 
> -- 
> Frederic Weisbecker
> SUSE Labs


Thanks!
Leo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-17 13:33       ` Frederic Weisbecker
  2026-03-23  1:38         ` Leonardo Bras
@ 2026-03-23 14:36         ` Marcelo Tosatti
  1 sibling, 0 replies; 32+ messages in thread
From: Marcelo Tosatti @ 2026-03-23 14:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Leonardo Bras, linux-kernel, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Thomas Gleixner,
	Waiman Long, Boqun Feun

On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > I find this part of the semantic a bit weird. If we eventually queue
> > > the work, why do we care about doing a local_lock() locally ?
> > 
> > (Sorry, not sure if I was able to understand the question.)
> > 
> > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > doing preempt_disable in non_RT.
> > 
> > Most of the cases happen to have the work done in the local cpu, and just 
> > a few procedures happen to be queued remotely, such as remote cache 
> > draining. 
> > 
> > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > all without moving to a different cpu.
> 
> But queue_work_on() already makes sure the work doesn't move to a different CPU
> (provided hotplug is correctly handled for the work).

commit b01b2141999936ac3e4746b7f76c0f204ae4b445
Author: Ingo Molnar <mingo@kernel.org>
Date:   Wed May 27 22:11:15 2020 +0200

    mm/swap: Use local_lock for protection

    The various struct pagevec per CPU variables are protected by disabling
    either preemption or interrupts across the critical sections. Inside
    these sections spinlocks have to be acquired.

    These spinlocks are regular spinlock_t types which are converted to
    "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
    locks cannot be acquired in preemption or interrupt disabled sections.

    local locks provide a trivial way to substitute preempt and interrupt
    disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
    to preempt_disable() and local_lock_irq() to local_irq_disable().

    Create lru_rotate_pvecs containing the pagevec and the locallock.
    Create lru_pvecs containing the remaining pagevecs and the locallock.
    Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
    exporting the pvec structure.

    Change the relevant call sites to acquire these locks instead of using
    preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
    local_irq_save().

    There is neither a functional change nor a change in the generated
    binary code for non PREEMPT_RT enabled non-debug kernels.

    When lockdep is enabled local locks have lockdep maps embedded. These
    allow lockdep to validate the protections, i.e. inappropriate usage of a
    preemption only protected sections would result in a lockdep warning
    while the same problem would not be noticed with a plain
    preempt_disable() based protection.

    local locks also improve readability as they provide a named scope for
    the protections while preempt/interrupt disable are opaque scopeless.

    Finally local locks allow PREEMPT_RT to substitute them with real
    locking primitives to ensure the correctness of operation in a fully
    preemptible kernel.

    [ bigeasy: Adopted to use local_lock ]

> Looks like we are both confused, so let's take a practical example. Suppose
> CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> see the new value. With qpw=1, it looks like this:
> 
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    spin_lock(&QPW_CPU1)
> qpw_queue_for(write_A, 1)
>     write_A()
>        A1 = per_cpu_ptr(&A, 1)
>        *A1 = 1
> qpw_unlock(CPU 1)
>     spin_unlock(&QPW_CPU1)
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            spin_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            spin_unlock(&QPW_CPU1)
>                                                    
> 
> CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> observe the new value because it takes the same spinlock (r0 == 1)
> 
> Now look at the qpw=0 case:
>                                   
> CPU 0                                               CPU 1
> -----                                               -----
> 
> qpw_lock(CPU 1)
>    local_lock(&QPW_CPU0)
> qpw_queue_for(write_A, 1)
>     queue_work_on(write_A, CPU 1)
> qpw_unlock(CPU 1)
>     local_unlock(&QPW_CPU0)
>                                                    // workqueue
>                                                    write_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        A1 = per_cpu_ptr(&A, 1)
>                                                        *A1 = 1
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
>                                                    read_A()
>                                                        qpw_lock(CPU 1)
>                                                            local_lock(&QPW_CPU1)
>                                                        r0 = __this_cpu_read(&A)
>                                                        qpw_unlock(CPU 1)
>                                                            local_unlock(&QPW_CPU1)
> 
> Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> 
> But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?

To protect certain that structures that are protected by
preempt_disable (non-RT) and migrate_disable (RT).

> > > > 
> > > > @@ -2840,6 +2840,16 @@ Kernel parameters
> > > >  
> > > >  			The format of <cpu-list> is described above.
> > > >  
> > > > +	qpw=		[KNL,SMP] Select a behavior on per-CPU resource sharing
> > > > +			and remote interference mechanism on a kernel built with
> > > > +			CONFIG_QPW.
> > > > +			Format: { "0" | "1" }
> > > > +			0 - local_lock() + queue_work_on(remote_cpu)
> > > > +			1 - spin_lock() for both local and remote operations
> > > > +
> > > > +			Selecting 1 may be interesting for systems that want
> > > > +			to avoid interruption & context switches from IPIs.
> > > 
> > > Like Vlastimil suggested, it would be better to just have it off by default
> > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > the parameter later if the need arise.
> > 
> > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > recommend having this option anyway, as the user could disable qpw if 
> > wanted, or enable outside isolcpu scenarios for any reason.
> 
> Do you know any such users? Or suspect a potential usecase? If not we can still
> add that option later. It's probably better than sticking with a useless
> parameter that we'll have to maintain forever.

Someone that does not boot with isolcpus= but uses cgroups for CPU
isolation?

> > > > +#define qpw_lockdep_assert_held(lock)			\
> > > > +	lockdep_assert_held(lock)
> > > > +
> > > > +#define queue_percpu_work_on(c, wq, qpw)		\
> > > > +	queue_work_on(c, wq, &(qpw)->work)
> > > 
> > > qpw_queue_work_on() ?
> > > 
> > > Perhaps even better would be qpw_queue_work_for(), leaving some room for
> > > mystery about where/how the work will be executed :-)
> > > 
> > 
> > QPW comes from Queue PerCPU Work
> > Having it called qpw_queue_work_{on,for}() would be repetitve
> 
> Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> 
> > But having qpw_on() or qpw_for() would be misleading :) 
> > 
> > That's why I went with queue_percpu_work_on() based on how we have the 
> > original function (queue_work_on) being called.
> 
> That's much more misleading at it doesn't refer to qpw at all and it only
> suggest that it's a queueing a per-cpu workqueue.
> 
> > > Perhaps that too should just be selected automatically by CONFIG_NO_HZ_FULL and if
> > > the need arise in the future, make it visible to the user?
> > > 
> > 
> > I think it would be good to have this, and let whoever is building have the 
> > chance to disable QPW if it doesn't work well for their machines or 
> > workload, without having to add a new boot parameter to continue have 
> > their stuff working as always after a kernel update.
> > 
> > But that is open to discussion :)
> 
> Ok I guess we can stick with the Kconfig at least in the beginning.
> 
> Thanks.
> 
> -- 
> Frederic Weisbecker
> SUSE Labs
> 
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-23  1:38         ` Leonardo Bras
@ 2026-03-24 11:54           ` Frederic Weisbecker
  2026-03-24 22:06             ` Leonardo Bras
  0 siblings, 1 reply; 32+ messages in thread
From: Frederic Weisbecker @ 2026-03-24 11:54 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Marcelo Tosatti, linux-kernel, linux-mm, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo, Thomas Gleixner,
	Waiman Long, Boqun Feun

Le Sun, Mar 22, 2026 at 10:38:56PM -0300, Leonardo Bras a écrit :
> On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> > Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > > I find this part of the semantic a bit weird. If we eventually queue
> > > > the work, why do we care about doing a local_lock() locally ?
> > > 
> > > (Sorry, not sure if I was able to understand the question.)
> > > 
> > > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > > doing preempt_disable in non_RT.
> > > 
> > > Most of the cases happen to have the work done in the local cpu, and just 
> > > a few procedures happen to be queued remotely, such as remote cache 
> > > draining. 
> > > 
> > > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > > all without moving to a different cpu.
> > 
> > But queue_work_on() already makes sure the work doesn't move to a different CPU
> > (provided hotplug is correctly handled for the work).
> > 
> > Looks like we are both confused, so let's take a practical example. Suppose
> > CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> > "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> > see the new value. With qpw=1, it looks like this:
> > 
> > CPU 0                                               CPU 1
> > -----                                               -----
> > 
> > qpw_lock(CPU 1)
> >    spin_lock(&QPW_CPU1)
> > qpw_queue_for(write_A, 1)
> >     write_A()
> >        A1 = per_cpu_ptr(&A, 1)
> >        *A1 = 1
> > qpw_unlock(CPU 1)
> >     spin_unlock(&QPW_CPU1)
> >                                                    read_A()
> >                                                        qpw_lock(CPU 1)
> >                                                            spin_lock(&QPW_CPU1)
> >                                                        r0 = __this_cpu_read(&A)
> >                                                        qpw_unlock(CPU 1)
> >                                                            spin_unlock(&QPW_CPU1)
> >                                                    
> > 
> > CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> > observe the new value because it takes the same spinlock (r0 == 1)
> > 
> 
> Here, if we are in CPU0 we should never take the qpw_lock(CPU1) unless we 
> are inside queue_percpu_work_on().
> 
> Maybe I am not getting your use case :/
> 
> Also, I don't see a case where we would need to call 
> queue_percpu_work_on() inside a qpw_lock(). This could be dangerous as it 
> could be the case in another cpu and cause a deadlock:
> 
> CPU 0 				CPU 1
> qpw_lock(0)			qpw_lock(1)
> ...				...
> queue_percpu_work_on()		queue_percpu_work_on()
> 	qpw_lock(1)			qpw_lock(0)

Ok I just checked the practical usecase in the patchset and it was me not
getting your usecase. The qpw lock is used inside the work itself. And now
that makes sense.

> 
> 
> > Now look at the qpw=0 case:
> >                                   
> > CPU 0                                               CPU 1
> > -----                                               -----
> > 
> > qpw_lock(CPU 1)
> >    local_lock(&QPW_CPU0)
> > qpw_queue_for(write_A, 1)
> >     queue_work_on(write_A, CPU 1)
> > qpw_unlock(CPU 1)
> >     local_unlock(&QPW_CPU0)
> >                                                    // workqueue
> >                                                    write_A()
> >                                                        qpw_lock(CPU 1)
> >                                                            local_lock(&QPW_CPU1)
> >                                                        A1 = per_cpu_ptr(&A, 1)
> >                                                        *A1 = 1
> >                                                        qpw_unlock(CPU 1)
> >                                                            local_unlock(&QPW_CPU1)
> > 
> >                                                    read_A()
> >                                                        qpw_lock(CPU 1)
> >                                                            local_lock(&QPW_CPU1)
> >                                                        r0 = __this_cpu_read(&A)
> >                                                        qpw_unlock(CPU 1)
> >                                                            local_unlock(&QPW_CPU1)
> > 
> > Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> > (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> > 
> > But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?
> 
> I can't see the case where one would need to hold the qpw_lock while 
> calling queue_percpu_work_on(). Holding the qpw_lock() (as is the case of
> local_lock()) should be done only when one is working on data particular to 
> that cpu structures. Queuing work on other CPU while touching this cpu data 
> is unexpected to me.

Yep!

> > > > Like Vlastimil suggested, it would be better to just have it off by default
> > > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > > the parameter later if the need arise.
> > > 
> > > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > > recommend having this option anyway, as the user could disable qpw if 
> > > wanted, or enable outside isolcpu scenarios for any reason.
> > 
> > Do you know any such users? Or suspect a potential usecase? If not we can still
> > add that option later. It's probably better than sticking with a useless
> > parameter that we'll have to maintain forever.
> 
> Out of my head, I can think only on HPC scenario where user wants to make 
> use of the regular/RT scheduler for many small workloads, but doesn't like 
> the impact of IPI on those cases.

There are many more IPIs to care about then. I suspect the issue would be more
about the workqueue itself.

> Such systems that explore memory at it's 
> limit will also benefit from those, for example, if cache gets drained 
> remotely very often.
> 
> None of those necessarily will need to or benefit from isolcpus, and may 
> want to just use the kernel scheduler policies.

This sounds like "just in case" usecases that could be dealt with later if
needed. But like Marcelo said, those who want to rely on cpuset isolated
partitions would need to enable that on boot.

> > > QPW comes from Queue PerCPU Work
> > > Having it called qpw_queue_work_{on,for}() would be repetitve
> > 
> > Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> > For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> > 
> 
> Locks for queue'ing per-cpu work. :D

Right!

> 
> > > But having qpw_on() or qpw_for() would be misleading :) 
> > > 
> > > That's why I went with queue_percpu_work_on() based on how we have the 
> > > original function (queue_work_on) being called.
> > 
> > That's much more misleading at it doesn't refer to qpw at all and it only
> > suggest that it's a queueing a per-cpu workqueue.
> > 
> 
> Humm, maybe qpw_queue_for/on()?
> 
> Or maybe change the name of the API for pw:
> pw_lock()/unlock
> pw_queue();
> pw_flush()
> 
> and so on?
> 
> That way it stays true to what means :)

Would better to keep the same prefix for all APIs :-)

-- 
Frederic Weisbecker
SUSE Labs


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work
  2026-03-24 11:54           ` Frederic Weisbecker
@ 2026-03-24 22:06             ` Leonardo Bras
  0 siblings, 0 replies; 32+ messages in thread
From: Leonardo Bras @ 2026-03-24 22:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Leonardo Bras, Marcelo Tosatti, linux-kernel, linux-mm,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Vlastimil Babka, Hyeonggon Yoo,
	Thomas Gleixner, Waiman Long, Boqun Feun

On Tue, Mar 24, 2026 at 12:54:17PM +0100, Frederic Weisbecker wrote:
> Le Sun, Mar 22, 2026 at 10:38:56PM -0300, Leonardo Bras a écrit :
> > On Tue, Mar 17, 2026 at 02:33:50PM +0100, Frederic Weisbecker wrote:
> > > Le Sun, Mar 15, 2026 at 03:10:27PM -0300, Leonardo Bras a écrit :
> > > > On Fri, Mar 13, 2026 at 10:55:47PM +0100, Frederic Weisbecker wrote:
> > > > > I find this part of the semantic a bit weird. If we eventually queue
> > > > > the work, why do we care about doing a local_lock() locally ?
> > > > 
> > > > (Sorry, not sure if I was able to understand the question.)
> > > > 
> > > > Local locks make sure a per-cpu procedure happens on the same CPU from 
> > > > start to end. Using migrate_disable & using per-cpu spinlocks on RT and 
> > > > doing preempt_disable in non_RT.
> > > > 
> > > > Most of the cases happen to have the work done in the local cpu, and just 
> > > > a few procedures happen to be queued remotely, such as remote cache 
> > > > draining. 
> > > > 
> > > > Even with the new 'local_qpw_lock()' which is faster for cases we are sure 
> > > > to have local usages, on qpw=0 we have to make qpw_lock() a local_lock as 
> > > > well, as the cpu receiving the scheduled work needs to make sure to run it 
> > > > all without moving to a different cpu.
> > > 
> > > But queue_work_on() already makes sure the work doesn't move to a different CPU
> > > (provided hotplug is correctly handled for the work).
> > > 
> > > Looks like we are both confused, so let's take a practical example. Suppose
> > > CPU 0 queues a work to CPU 1 which sets a per-cpu variable named A to the value
> > > "1". We want to guarantee that further reads of that per-cpu value by CPU 1
> > > see the new value. With qpw=1, it looks like this:
> > > 
> > > CPU 0                                               CPU 1
> > > -----                                               -----
> > > 
> > > qpw_lock(CPU 1)
> > >    spin_lock(&QPW_CPU1)
> > > qpw_queue_for(write_A, 1)
> > >     write_A()
> > >        A1 = per_cpu_ptr(&A, 1)
> > >        *A1 = 1
> > > qpw_unlock(CPU 1)
> > >     spin_unlock(&QPW_CPU1)
> > >                                                    read_A()
> > >                                                        qpw_lock(CPU 1)
> > >                                                            spin_lock(&QPW_CPU1)
> > >                                                        r0 = __this_cpu_read(&A)
> > >                                                        qpw_unlock(CPU 1)
> > >                                                            spin_unlock(&QPW_CPU1)
> > >                                                    
> > > 
> > > CPU 0 took the spinlock while writing to A, so CPU 1 is guaranteed to further
> > > observe the new value because it takes the same spinlock (r0 == 1)
> > > 
> > 
> > Here, if we are in CPU0 we should never take the qpw_lock(CPU1) unless we 
> > are inside queue_percpu_work_on().
> > 
> > Maybe I am not getting your use case :/
> > 
> > Also, I don't see a case where we would need to call 
> > queue_percpu_work_on() inside a qpw_lock(). This could be dangerous as it 
> > could be the case in another cpu and cause a deadlock:
> > 
> > CPU 0 				CPU 1
> > qpw_lock(0)			qpw_lock(1)
> > ...				...
> > queue_percpu_work_on()		queue_percpu_work_on()
> > 	qpw_lock(1)			qpw_lock(0)
> 
> Ok I just checked the practical usecase in the patchset and it was me not
> getting your usecase. The qpw lock is used inside the work itself. And now
> that makes sense.
> 
> > 
> > 
> > > Now look at the qpw=0 case:
> > >                                   
> > > CPU 0                                               CPU 1
> > > -----                                               -----
> > > 
> > > qpw_lock(CPU 1)
> > >    local_lock(&QPW_CPU0)
> > > qpw_queue_for(write_A, 1)
> > >     queue_work_on(write_A, CPU 1)
> > > qpw_unlock(CPU 1)
> > >     local_unlock(&QPW_CPU0)
> > >                                                    // workqueue
> > >                                                    write_A()
> > >                                                        qpw_lock(CPU 1)
> > >                                                            local_lock(&QPW_CPU1)
> > >                                                        A1 = per_cpu_ptr(&A, 1)
> > >                                                        *A1 = 1
> > >                                                        qpw_unlock(CPU 1)
> > >                                                            local_unlock(&QPW_CPU1)
> > > 
> > >                                                    read_A()
> > >                                                        qpw_lock(CPU 1)
> > >                                                            local_lock(&QPW_CPU1)
> > >                                                        r0 = __this_cpu_read(&A)
> > >                                                        qpw_unlock(CPU 1)
> > >                                                            local_unlock(&QPW_CPU1)
> > > 
> > > Here CPU 0 queues the work on CPU 1 which writes and reads the new value
> > > (r0 == 1). local_lock() / preempt_disable() makes sure the CPU doesn't change.
> > > 
> > > But what is the point in doing local_lock(&QPW_CPU0) on CPU 0 ?
> > 
> > I can't see the case where one would need to hold the qpw_lock while 
> > calling queue_percpu_work_on(). Holding the qpw_lock() (as is the case of
> > local_lock()) should be done only when one is working on data particular to 
> > that cpu structures. Queuing work on other CPU while touching this cpu data 
> > is unexpected to me.
> 
> Yep!
> 
> > > > > Like Vlastimil suggested, it would be better to just have it off by default
> > > > > and turn it on only if nohz_full= is passed. Then we can consider introducing
> > > > > the parameter later if the need arise.
> > > > 
> > > > I agree with having it enabled with isolcpus/nohz_full, but I would 
> > > > recommend having this option anyway, as the user could disable qpw if 
> > > > wanted, or enable outside isolcpu scenarios for any reason.
> > > 
> > > Do you know any such users? Or suspect a potential usecase? If not we can still
> > > add that option later. It's probably better than sticking with a useless
> > > parameter that we'll have to maintain forever.
> > 
> > Out of my head, I can think only on HPC scenario where user wants to make 
> > use of the regular/RT scheduler for many small workloads, but doesn't like 
> > the impact of IPI on those cases.
> 
> There are many more IPIs to care about then. I suspect the issue would be more
> about the workqueue itself.

There are some mechanisms for workqueues to be offloaded to other CPUs if 
those are isolated, we could easily mimic that if wanted (or use isolcpus)

It's more about the locking strategies: some code uses local_lock + 
queue_work_on() and it is really effective in a lot of scenarios, but that 
relies on IPIs which can be terrible in other scenarios.

QPW is about letting user decide which locking strategy to use based on 
it's workloads :)
 
> > Such systems that explore memory at it's 
> > limit will also benefit from those, for example, if cache gets drained 
> > remotely very often.
> > 
> > None of those necessarily will need to or benefit from isolcpus, and may 
> > want to just use the kernel scheduler policies.
> 
> This sounds like "just in case" usecases that could be dealt with later if
> needed. But like Marcelo said, those who want to rely on cpuset isolated
> partitions would need to enable that on boot.
> 

Agree, he could exemplify much better :)

> > > > QPW comes from Queue PerCPU Work
> > > > Having it called qpw_queue_work_{on,for}() would be repetitve
> > > 
> > > Well, qpw_ just becomes the name of the subsystem and its prefix for APIs.
> > > For example qpw_lock() doesn't mean that we queue and lock, it only means we lock.
> > > 
> > 
> > Locks for queue'ing per-cpu work. :D
> 
> Right!
> 
> > 
> > > > But having qpw_on() or qpw_for() would be misleading :) 
> > > > 
> > > > That's why I went with queue_percpu_work_on() based on how we have the 
> > > > original function (queue_work_on) being called.
> > > 
> > > That's much more misleading at it doesn't refer to qpw at all and it only
> > > suggest that it's a queueing a per-cpu workqueue.
> > > 
> > 
> > Humm, maybe qpw_queue_for/on()?
> > 
> > Or maybe change the name of the API for pw:
> > pw_lock()/unlock
> > pw_queue();
> > pw_flush()
> > 
> > and so on?
> > 
> > That way it stays true to what means :)
> 
> Would better to keep the same prefix for all APIs :-)
> 

Naming was always hard with this mechanism :D

Will try to come with something meaningful and consistent across this and 
other APIs.

Thanks!
Leo


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2026-03-24 22:06 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 15:49 [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Marcelo Tosatti
2026-03-02 15:49 ` [PATCH v2 1/5] slab: distinguish lock and trylock for sheaf_flush_main() Marcelo Tosatti
2026-03-02 15:49 ` [PATCH v2 2/5] Introducing qpw_lock() and per-cpu queue & flush work Marcelo Tosatti
2026-03-03 12:03   ` Vlastimil Babka (SUSE)
2026-03-03 16:02     ` Marcelo Tosatti
2026-03-08 18:00       ` Leonardo Bras
2026-03-09 10:14         ` Vlastimil Babka (SUSE)
2026-03-11  0:16           ` Leonardo Bras
2026-03-11  7:58   ` Vlastimil Babka (SUSE)
2026-03-15 17:37     ` Leonardo Bras
2026-03-16 10:55       ` Vlastimil Babka (SUSE)
2026-03-23  0:51         ` Leonardo Bras
2026-03-13 21:55   ` Frederic Weisbecker
2026-03-15 18:10     ` Leonardo Bras
2026-03-17 13:33       ` Frederic Weisbecker
2026-03-23  1:38         ` Leonardo Bras
2026-03-24 11:54           ` Frederic Weisbecker
2026-03-24 22:06             ` Leonardo Bras
2026-03-23 14:36         ` Marcelo Tosatti
2026-03-02 15:49 ` [PATCH v2 3/5] mm/swap: move bh draining into a separate workqueue Marcelo Tosatti
2026-03-02 15:49 ` [PATCH v2 4/5] swap: apply new queue_percpu_work_on() interface Marcelo Tosatti
2026-03-02 15:49 ` [PATCH v2 5/5] slub: " Marcelo Tosatti
2026-03-03 11:15 ` [PATCH v2 0/5] Introduce QPW for per-cpu operations (v2) Frederic Weisbecker
2026-03-08 18:02   ` Leonardo Bras
2026-03-03 12:07 ` Vlastimil Babka (SUSE)
2026-03-05 16:55 ` Frederic Weisbecker
2026-03-06  1:47   ` Marcelo Tosatti
2026-03-10 21:34     ` Frederic Weisbecker
2026-03-10 17:12   ` Marcelo Tosatti
2026-03-10 22:14     ` Frederic Weisbecker
2026-03-11  1:18     ` Hillf Danton
2026-03-11  7:54     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox