* [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW)
@ 2026-05-19 1:27 Leonardo Bras
2026-05-19 1:27 ` [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work Leonardo Bras
` (5 more replies)
0 siblings, 6 replies; 12+ messages in thread
From: Leonardo Bras @ 2026-05-19 1:27 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Thomas Gleixner, Feng Tang, Dapeng Mi, Kees Cook,
Marco Elver, Jakub Kicinski, Li RongQing, Eric Biggers,
Paul E. McKenney, Nathan Chancellor, Miguel Ojeda, Nicolas Schier,
Thomas Weißschuh, Douglas Anderson, Gary Guo,
Christian Brauner, Pasha Tatashin, Masahiro Yamada, Coiby Xu,
Frederic Weisbecker
Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel
The problem:
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.
On the other hand, for RT workloads this can represent a problem: getting
an important workload scheduled out to deal with remote requests is
sure to introduce unexpected deadline misses.
The idea:
Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
In this case, instead of scheduling work on a remote cpu, it should
be safe to grab that remote cpu's per-cpu spinlock and run the required
work locally. That major cost, which is un/locking in every local function,
already happens in PREEMPT_RT.
Also, there is no need to worry about extra cache bouncing:
The cacheline invalidation already happens due to schedule_work_on().
This will avoid schedule_work_on(), and thus avoid scheduling-out an
RT workload.
Proposed solution:
A new interface called PerCPU Work (PW), which should replace
Work Queue in the above mentioned use case.
If CONFIG_PWLOCKS=n this interfaces just wraps the current
local_locks + WorkQueue behavior, so no expected change in runtime.
If CONFIG_PWLOCKS=y, and kernel boot option pwlocks=1,
pw_queue_on(cpu,...) will lock that cpu's per-cpu structure
and perform work on it locally.
v3->v4:
- Mechanism name changed from QPW to PW/PWLOCKS. Helper funcions / API,
file names and config options renamed accordingly.
- All members of the Per-CPU Work API now start with the same prefix
(Frederic Weisbecker)
- Improved style a bit, reviewed documentation
v2->v3:
- Use preempt_disable/preempt_enable on !CONFIG_PREEMPT_RT (Vlastimil Babka).
- Improve documentation to include local_qpw_lock on operations table
(Leonardo Bras).
- Enable qpw=1 automatically if CPU isolation is enabled (Vlastimil Babka).
v1->v2:
- Introduce local_qpw_lock and unlock functions, move preempt_disable/
preempt_enable to it (Leonardo Bras). This reduces performance
overhead of the patch.
- Documentation and changelog typo fixes (Leonardo Bras).
- Fix places where preempt_disable/preempt_enable was not being
correctly performed.
- Add performance measurements.
RFC->v1:
- Introduce CONFIG_QPW and qpw= kernel boot option to enable
remote spinlocking and execution even on !CONFIG_PREEMPT_RT
kernels (Leonardo Bras).
- Move buffer_head draining to separate workqueue (Marcelo Tosatti).
- Convert mlock per-CPU page lists to QPW (Marcelo Tosatti).
- Drop memcontrol convertion (as isolated CPUs are not targets
of queue_work_on anymore).
- Rebase SLUB against Vlastimil's slab/next.
- Add basic document for QPW (Waiman Long).
The performance numbers, as measured by the following test program,
are as follows (v3, mechanics not changed since then):
CONFIG_PREEMPT_DYNAMIC=y
Unpatched kernel: 60 cycles
Patched kernel, CONFIG_QPW=n: 62 cycles
Patched kernel, CONFIG_QPW=y, qpw=0: 62 cycles
Patched kernel, CONFIG_QPW=y, qpw=1: 75 cycles
CONFIG_PREEMPT_RT:
Unpatched kernel: 95 cycles
Patched kernel, CONFIG_QPW=y, qpw=0: 99 cycles
Patched kernel, CONFIG_QPW=y, qpw=1: 97 cycles
kmalloc_bench.c:
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/timex.h>
#include <linux/preempt.h>
#include <linux/irqflags.h>
#include <linux/vmalloc.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Gemini AI");
MODULE_DESCRIPTION("A simple kmalloc performance benchmark");
static int size = 64; // Default allocation size in bytes
module_param(size, int, 0644);
static int iterations = 9000000; // Default number of iterations
module_param(iterations, int, 0644);
static int __init kmalloc_bench_init(void) {
void **ptrs;
cycles_t start, end;
uint64_t total_cycles;
int i;
pr_info("kmalloc_bench: Starting test (size=%d, iterations=%d)\n", size, iterations);
// Allocate an array to store pointers to avoid immediate kfree-reuse optimization
ptrs = vmalloc(sizeof(void *) * iterations);
if (!ptrs) {
pr_err("kmalloc_bench: Failed to allocate pointer array\n");
return -ENOMEM;
}
preempt_disable();
start = get_cycles();
for (i = 0; i < iterations; i++) {
ptrs[i] = kmalloc(size, GFP_ATOMIC);
}
end = get_cycles();
total_cycles = end - start;
preempt_enable();
pr_info("kmalloc_bench: Total cycles for %d allocs: %llu\n", iterations, total_cycles);
pr_info("kmalloc_bench: Avg cycles per kmalloc: %llu\n", total_cycles / iterations);
// Cleanup
for (i = 0; i < iterations; i++) {
kfree(ptrs[i]);
}
vfree(ptrs);
return 0;
}
static void __exit kmalloc_bench_exit(void) {
pr_info("kmalloc_bench: Module unloaded\n");
}
module_init(kmalloc_bench_init);
module_exit(kmalloc_bench_exit);
The following testcase triggers lru_add_drain_all on an isolated CPU
(that does sys_write to a file before entering its realtime
loop).
/*
* Simulates a low latency loop program that is interrupted
* due to lru_add_drain_all. To trigger lru_add_drain_all, run:
*
* blockdev --flushbufs /dev/sdX
*
*/
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>
#include <stdarg.h>
#include <pthread.h>
#include <sched.h>
#include <unistd.h>
int cpu;
static void *run(void *arg)
{
pthread_t current_thread;
cpu_set_t cpuset;
int ret, nrloops;
struct sched_param sched_p;
pid_t pid;
int fd;
char buf[] = "xxxxxxxxxxx";
CPU_ZERO(&cpuset);
CPU_SET(cpu, &cpuset);
current_thread = pthread_self();
ret = pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
if (ret) {
perror("pthread_setaffinity_np failed\n");
exit(0);
}
memset(&sched_p, 0, sizeof(struct sched_param));
sched_p.sched_priority = 1;
pid = gettid();
ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
if (ret) {
perror("sched_setscheduler");
exit(0);
}
fd = open("/tmp/tmpfile", O_RDWR|O_CREAT|O_TRUNC);
if (fd == -1) {
perror("open");
exit(0);
}
ret = write(fd, buf, sizeof(buf));
if (ret == -1) {
perror("write");
exit(0);
}
do {
nrloops = nrloops+2;
nrloops--;
} while (1);
}
int main(int argc, char *argv[])
{
int fd, ret;
pthread_t thread;
long val;
char *endptr, *str;
struct sched_param sched_p;
pid_t pid;
if (argc != 2) {
printf("usage: %s cpu-nr\n", argv[0]);
printf("where CPU number is the CPU to pin thread to\n");
exit(0);
}
str = argv[1];
cpu = strtol(str, &endptr, 10);
if (cpu < 0) {
printf("strtol returns %d\n", cpu);
exit(0);
}
printf("cpunr=%d\n", cpu);
memset(&sched_p, 0, sizeof(struct sched_param));
sched_p.sched_priority = 1;
pid = getpid();
ret = sched_setscheduler(pid, SCHED_FIFO, &sched_p);
if (ret) {
perror("sched_setscheduler");
exit(0);
}
pthread_create(&thread, NULL, run, NULL);
sleep(5000);
pthread_join(thread, NULL);
}
Leonardo Bras (3):
Introducing pw_lock() and per-cpu queue & flush work
swap: apply new pw_queue_on() interface
slub: apply new pw_queue_on() interface
Marcelo Tosatti (1):
mm/swap: move bh draining into a separate workqueue
MAINTAINERS | 7 +
.../admin-guide/kernel-parameters.txt | 10 +
Documentation/locking/pwlocks.rst | 76 +++++
init/Kconfig | 35 +++
kernel/Makefile | 2 +
include/linux/pwlocks.h | 265 ++++++++++++++++++
mm/internal.h | 4 +-
kernel/pwlocks.c | 47 ++++
mm/mlock.c | 51 +++-
mm/page_alloc.c | 2 +-
mm/slub.c | 142 +++++-----
mm/swap.c | 109 ++++---
12 files changed, 624 insertions(+), 126 deletions(-)
create mode 100644 Documentation/locking/pwlocks.rst
create mode 100644 include/linux/pwlocks.h
create mode 100644 kernel/pwlocks.c
base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
--
2.54.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
2026-05-19 1:27 [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW) Leonardo Bras
@ 2026-05-19 1:27 ` Leonardo Bras
2026-05-20 10:08 ` Frederic Weisbecker
2026-05-20 13:48 ` Sebastian Andrzej Siewior
2026-05-19 1:27 ` [PATCH v4 2/4] mm/swap: move bh draining into a separate workqueue Leonardo Bras
` (4 subsequent siblings)
5 siblings, 2 replies; 12+ messages in thread
From: Leonardo Bras @ 2026-05-19 1:27 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker
Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel,
Marcelo Tosatti
Some places in the kernel implement a parallel programming strategy
consisting on local_locks() for most of the work, and some rare remote
operations are scheduled on target cpu. This keeps cache bouncing low since
cacheline tends to be mostly local, and avoids the cost of locks in non-RT
kernels, even though the very few remote operations will be expensive due
to scheduling overhead.
On the other hand, for RT workloads this can represent a problem:
scheduling work on remote cpu that are executing low latency tasks
is undesired and can introduce unexpected deadline misses.
It's interesting, though, that local_lock()s in RT kernels become
spinlock(). We can make use of those to avoid scheduling work on a remote
cpu by directly updating another cpu's per_cpu structure, while holding
it's spinlock().
In order to do that, it's necessary to introduce a new set of functions to
make it possible to get another cpu's per-cpu "local" lock (pw_{un,}lock*)
and also do the corresponding queueing (pw_queue_on()) and flushing
(pw_flush()) helpers to run the remote work.
Users of non-RT kernels but with low latency requirements can select
similar functionality by using the CONFIG_PWLOCKS compile time option.
On CONFIG_PWLOCKS disabled kernels, no changes are expected, as every
one of the introduced helpers work the exactly same as the current
implementation:
pw_{un,}lock*() -> local_{un,}lock*() (ignores cpu parameter)
pw_queue_on() -> queue_work_on()
pw_flush() -> flush_work()
For PWLOCKS enabled kernels, though, pw_{un,}lock*() will use the extra
cpu parameter to select the correct per-cpu structure to work on,
and acquire the spinlock for that cpu.
pw_queue_on() will just call the requested function in the current
cpu, which will operate in another cpu's per-cpu object. Since the
local_locks() become spinlock()s in PWLOCKS enabled kernels, we are
safe doing that.
pw_flush() then becomes a no-op since no work is actually scheduled on a
remote cpu.
Some minimal code rework is needed in order to make this mechanism work:
The calls for local_{un,}lock*() on the functions that are currently
scheduled on remote cpus need to be replaced by either pw_{un,}lock_*(),
PWLOCKS enabled kernels they can reference a different cpu. It's also
necessary to use a pw_struct instead of a work_struct, but it just
contains a work struct and, in CONFIG_PWLOCKS, the target cpu.
This should have almost no impact on non-CONFIG_PWLOCKS kernels: few
this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()) on non-hotpath
functions.
On CONFIG_PWLOCKS kernels, this should avoid deadlines misses by
removing scheduling noise.
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
MAINTAINERS | 7 +
.../admin-guide/kernel-parameters.txt | 10 +
Documentation/locking/pwlocks.rst | 76 +++++
init/Kconfig | 35 +++
kernel/Makefile | 2 +
include/linux/pwlocks.h | 265 ++++++++++++++++++
kernel/pwlocks.c | 47 ++++
7 files changed, 442 insertions(+)
create mode 100644 Documentation/locking/pwlocks.rst
create mode 100644 include/linux/pwlocks.h
create mode 100644 kernel/pwlocks.c
diff --git a/MAINTAINERS b/MAINTAINERS
index c2c6d79275c6..7102031207c9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21775,20 +21775,27 @@ QORIQ DPAA2 FSL-MC BUS DRIVER
M: Ioana Ciornei <ioana.ciornei@nxp.com>
L: linuxppc-dev@lists.ozlabs.org
L: linux-kernel@vger.kernel.org
S: Maintained
F: Documentation/ABI/stable/sysfs-bus-fsl-mc
F: Documentation/devicetree/bindings/misc/fsl,qoriq-mc.yaml
F: Documentation/networking/device_drivers/ethernet/freescale/dpaa2/overview.rst
F: drivers/bus/fsl-mc/
F: include/uapi/linux/fsl_mc.h
+PW Locks
+M: Leonardo Bras <leobras.c@gmail.com>
+S: Supported
+F: Documentation/locking/pwlocks.rst
+F: include/linux/pwlocks.h
+F: kernel/pwlocks.c
+
QT1010 MEDIA DRIVER
L: linux-media@vger.kernel.org
S: Orphan
W: https://linuxtv.org
Q: http://patchwork.linuxtv.org/project/linux-media/list/
F: drivers/media/tuners/qt1010*
QUALCOMM ATH12K WIRELESS DRIVER
M: Jeff Johnson <jjohnson@kernel.org>
L: linux-wireless@vger.kernel.org
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 4d0f545fb3ec..68c8a6f9d227 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2810,20 +2810,30 @@ Kernel parameters
If a queue's affinity mask contains only isolated
CPUs then this parameter has no effect on the
interrupt routing decision, though interrupts are
only delivered when tasks running on those
isolated CPUs submit IO. IO submitted on
housekeeping CPUs has no influence on those
queues.
The format of <cpu-list> is described above.
+ pwlocks= [KNL,SMP] Select a behavior on per-CPU resource sharing
+ and remote interference mechanism on a kernel built with
+ CONFIG_PWLOCKS.
+ Format: { "0" | "1" }
+ 0 - local_lock() + queue_work_on(remote_cpu)
+ 1 - spin_lock() for both local and remote operations
+
+ Selecting 1 may be interesting for systems that want
+ to avoid interruption & context switches from IPIs.
+
iucv= [HW,NET]
ivrs_ioapic [HW,X86-64]
Provide an override to the IOAPIC-ID<->DEVICE-ID
mapping provided in the IVRS ACPI table.
By default, PCI segment is 0, and can be omitted.
For example, to map IOAPIC-ID decimal 10 to
PCI segment 0x1 and PCI device 00:14.0,
write the parameter as:
diff --git a/Documentation/locking/pwlocks.rst b/Documentation/locking/pwlocks.rst
new file mode 100644
index 000000000000..09f4a5417bc1
--- /dev/null
+++ b/Documentation/locking/pwlocks.rst
@@ -0,0 +1,76 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========
+PW (Per-CPU Work) locks
+=========
+
+Some places in the kernel implement a parallel programming strategy
+consisting on local_locks() for most of the work, and some rare remote
+operations are scheduled on target cpu. This keeps cache bouncing low since
+cacheline tends to be mostly local, and avoids the cost of locks in non-RT
+kernels, even though the very few remote operations will be expensive due
+to scheduling overhead.
+
+On the other hand, for RT workloads this can represent a problem:
+scheduling work on remote cpu that are executing low latency tasks
+is undesired and can introduce unexpected deadline misses.
+
+PW locks help to convert sites that use local_locks (for cpu local operations)
+and queue_work_on (for queueing work remotely, to be executed
+locally on the owner cpu of the lock) to a spinlocks.
+
+The lock is declared pw_lock_t type.
+The lock is initialized with pw_lock_init.
+The lock is locked with pw_lock (takes a lock and cpu as a parameter).
+The lock is unlocked with pw_unlock (takes a lock and cpu as a parameter).
+
+The pw_lock_irqsave function disables interrupts and saves current interrupt state,
+cpu as a parameter.
+
+For trylock variant, there is the pw_trylock_t type, initialized with
+pw_trylock_init. Then the corresponding pw_trylock and pw_trylock_irqsave.
+
+work_struct should be replaced by pw_struct, which contains a cpu parameter
+(owner cpu of the lock), initialized by INIT_PW.
+
+The queue work related functions (analogous to queue_work_on and flush_work) are:
+pw_queue_on and pw_flush.
+
+The behaviour of the PW lock functions is as follows:
+
+* !CONFIG_PWLOCKS (or CONFIG_PWLOCKS and pwlocks=off kernel boot parameter):
+ - pw_lock: local_lock
+ - pw_lock_irqsave: local_lock_irqsave
+ - pw_trylock: local_trylock
+ - pw_trylock_irqsave: local_trylock_irqsave
+ - pw_unlock: local_unlock
+ - pw_lock_local: local_lock
+ - pw_trylock_local: local_trylock
+ - pw_unlock_local: local_unlock
+ - pw_queue_on: queue_work_on
+ - pw_flush: flush_work
+
+* CONFIG_PWLOCKS (and CONFIG_PWLOCKS_DEFAULT=y or pwlocks=on kernel boot parameter),
+ - pw_lock: spin_lock
+ - pw_lock_irqsave: spin_lock_irqsave
+ - pw_trylock: spin_trylock
+ - pw_trylock_irqsave: spin_trylock_irqsave
+ - pw_unlock: spin_unlock
+ - pw_lock_local: preempt_disable OR migrate_disable + spin_lock
+ - pw_trylock_local: preempt_disable OR migrate_disable + spin_trylock
+ - pw_unlock_local: preempt_enable OR migrate_enable + spin_unlock
+ - pw_queue_on: executes work function on caller cpu
+ - pw_flush: empty
+
+pw_get_cpu(work_struct), to be called from within per-cpu work function,
+returns the target cpu.
+
+On the locking functions above, there are the local locking functions
+(pw_lock_local, pw_trylock_local and pw_unlock_local) that must only
+be used to access per-CPU data from the CPU that owns that data,
+and never remotely. They disable preemption/migration and don't require
+a cpu parameter, making them a replacement for local_lock functions that
+does not introduce overhead.
+
+These should only be used when accessing per-CPU data of the local CPU.
+
diff --git a/init/Kconfig b/init/Kconfig
index 2937c4d308ae..3fb751dc4530 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -764,20 +764,55 @@ config CPU_ISOLATION
depends on SMP
default y
help
Make sure that CPUs running critical tasks are not disturbed by
any source of "noise" such as unbound workqueues, timers, kthreads...
Unbound jobs get offloaded to housekeeping CPUs. This is driven by
the "isolcpus=" boot parameter.
Say Y if unsure.
+config PWLOCKS
+ bool "Per-CPU Work locks"
+ depends on SMP || COMPILE_TEST
+ default n
+ help
+ Allow changing the behavior on per-CPU resource sharing with cache,
+ from the regular local_locks() + queue_work_on(remote_cpu) to using
+ per-CPU spinlocks on both local and remote operations.
+
+ This is useful to give user the option on reducing IPIs to CPUs, and
+ thus reduce interruptions and context switches. On the other hand, it
+ increases generated code and will use atomic operations if spinlocks
+ are selected.
+
+ If set, will use the default behavior set in PWLOCKS_DEFAULT unless boot
+ parameter pwlocks is passed with a different behavior.
+
+ If unset, will use the local_lock() + queue_work_on() strategy,
+ regardless of the boot parameter or PWLOCKS_DEFAULT.
+
+ Say N if unsure.
+
+config PWLOCKS_DEFAULT
+ bool "Use per-CPU spinlocks by default on PWLOCKS"
+ depends on PWLOCKS
+ default n
+ help
+ If set, will use per-CPU spinlocks as default behavior for per-CPU
+ remote operations.
+
+ If unset, will use local_lock() + queue_work_on(cpu) as default
+ behavior for remote operations.
+
+ Say N if unsure
+
source "kernel/rcu/Kconfig"
config IKCONFIG
tristate "Kernel .config support"
help
This option enables the complete Linux kernel ".config" file
contents to be saved in the kernel. It provides documentation
of which kernel options are used in a running kernel or in an
on-disk kernel. This information can be extracted from the kernel
image file with the script scripts/extract-ikconfig and used as
diff --git a/kernel/Makefile b/kernel/Makefile
index 6785982013dc..60ccad0699e7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -135,20 +135,22 @@ obj-$(CONFIG_JUMP_LABEL) += jump_label.o
obj-$(CONFIG_CONTEXT_TRACKING) += context_tracking.o
obj-$(CONFIG_TORTURE_TEST) += torture.o
obj-$(CONFIG_HAS_IOMEM) += iomem.o
obj-$(CONFIG_RSEQ) += rseq.o
obj-$(CONFIG_WATCH_QUEUE) += watch_queue.o
obj-$(CONFIG_RESOURCE_KUNIT_TEST) += resource_kunit.o
obj-$(CONFIG_SYSCTL_KUNIT_TEST) += sysctl-test.o
+obj-$(CONFIG_PWLOCKS) += pwlocks.o
+
CFLAGS_kstack_erase.o += $(DISABLE_KSTACK_ERASE)
CFLAGS_kstack_erase.o += $(call cc-option,-mgeneral-regs-only)
obj-$(CONFIG_KSTACK_ERASE) += kstack_erase.o
KASAN_SANITIZE_kstack_erase.o := n
KCSAN_SANITIZE_kstack_erase.o := n
KCOV_INSTRUMENT_kstack_erase.o := n
obj-$(CONFIG_SCF_TORTURE_TEST) += scftorture.o
$(obj)/configs.o: $(obj)/config_data.gz
diff --git a/include/linux/pwlocks.h b/include/linux/pwlocks.h
new file mode 100644
index 000000000000..3d79621655f9
--- /dev/null
+++ b/include/linux/pwlocks.h
@@ -0,0 +1,265 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PWLOCKS_H
+#define _LINUX_PWLOCKS_H
+
+#include "linux/spinlock.h"
+#include "linux/local_lock.h"
+#include "linux/workqueue.h"
+
+#ifndef CONFIG_PWLOCKS
+
+typedef local_lock_t pw_lock_t;
+typedef local_trylock_t pw_trylock_t;
+
+struct pw_struct {
+ struct work_struct work;
+};
+
+#define pw_lock_init(lock) \
+ local_lock_init(lock)
+
+#define pw_trylock_init(lock) \
+ local_trylock_init(lock)
+
+#define pw_lock(lock, cpu) \
+ local_lock(lock)
+
+#define pw_lock_local(lock) \
+ local_lock(lock)
+
+#define pw_lock_irqsave(lock, flags, cpu) \
+ local_lock_irqsave(lock, flags)
+
+#define pw_lock_local_irqsave(lock, flags) \
+ local_lock_irqsave(lock, flags)
+
+#define pw_trylock(lock, cpu) \
+ local_trylock(lock)
+
+#define pw_trylock_local(lock) \
+ local_trylock(lock)
+
+#define pw_trylock_irqsave(lock, flags, cpu) \
+ local_trylock_irqsave(lock, flags)
+
+#define pw_unlock(lock, cpu) \
+ local_unlock(lock)
+
+#define pw_unlock_local(lock) \
+ local_unlock(lock)
+
+#define pw_unlock_irqrestore(lock, flags, cpu) \
+ local_unlock_irqrestore(lock, flags)
+
+#define pw_unlock_local_irqrestore(lock, flags) \
+ local_unlock_irqrestore(lock, flags)
+
+#define pw_lockdep_assert_held(lock) \
+ lockdep_assert_held(lock)
+
+#define pw_queue_on(c, wq, pw) \
+ queue_work_on(c, wq, &(pw)->work)
+
+#define pw_flush(pw) \
+ flush_work(&(pw)->work)
+
+#define pw_get_cpu(pw) smp_processor_id()
+
+#define pw_is_cpu_remote(cpu) (false)
+
+#define INIT_PW(pw, func, c) \
+ INIT_WORK(&(pw)->work, (func))
+
+#else /* CONFIG_PWLOCKS */
+
+DECLARE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
+
+typedef union {
+ spinlock_t sl;
+ local_lock_t ll;
+} pw_lock_t;
+
+typedef union {
+ spinlock_t sl;
+ local_trylock_t ll;
+} pw_trylock_t;
+
+struct pw_struct {
+ struct work_struct work;
+ int cpu;
+};
+
+#ifdef CONFIG_PREEMPT_RT
+#define preempt_or_migrate_disable migrate_disable
+#define preempt_or_migrate_enable migrate_enable
+#else
+#define preempt_or_migrate_disable preempt_disable
+#define preempt_or_migrate_enable preempt_enable
+#endif
+
+#define pw_lock_init(lock) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ spin_lock_init(lock.sl); \
+ else \
+ local_lock_init(lock.ll); \
+} while (0)
+
+#define pw_trylock_init(lock) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ spin_lock_init(lock.sl); \
+ else \
+ local_trylock_init(lock.ll); \
+} while (0)
+
+#define pw_lock(lock, cpu) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ spin_lock(per_cpu_ptr(lock.sl, cpu)); \
+ else \
+ local_lock(lock.ll); \
+} while (0)
+
+#define pw_lock_local(lock) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
+ preempt_or_migrate_disable(); \
+ spin_lock(this_cpu_ptr(lock.sl)); \
+ } else { \
+ local_lock(lock.ll); \
+ } \
+} while (0)
+
+#define pw_lock_irqsave(lock, flags, cpu) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \
+ else \
+ local_lock_irqsave(lock.ll, flags); \
+} while (0)
+
+#define pw_lock_local_irqsave(lock, flags) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
+ preempt_or_migrate_disable(); \
+ spin_lock_irqsave(this_cpu_ptr(lock.sl), flags); \
+ } else { \
+ local_lock_irqsave(lock.ll, flags); \
+ } \
+} while (0)
+
+#define pw_trylock(lock, cpu) \
+({ \
+ int t; \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ t = spin_trylock(per_cpu_ptr(lock.sl, cpu)); \
+ else \
+ t = local_trylock(lock.ll); \
+ t; \
+})
+
+#define pw_trylock_local(lock) \
+({ \
+ int t; \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
+ preempt_or_migrate_disable(); \
+ t = spin_trylock(this_cpu_ptr(lock.sl)); \
+ if (!t) \
+ preempt_or_migrate_enable(); \
+ } else { \
+ t = local_trylock(lock.ll); \
+ } \
+ t; \
+})
+
+#define pw_trylock_irqsave(lock, flags, cpu) \
+({ \
+ int t; \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \
+ else \
+ t = local_trylock_irqsave(lock.ll, flags); \
+ t; \
+})
+
+#define pw_unlock(lock, cpu) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ spin_unlock(per_cpu_ptr(lock.sl, cpu)); \
+ else \
+ local_unlock(lock.ll); \
+} while (0)
+
+#define pw_unlock_local(lock) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
+ spin_unlock(this_cpu_ptr(lock.sl)); \
+ preempt_or_migrate_enable(); \
+ } else { \
+ local_unlock(lock.ll); \
+ } \
+} while (0)
+
+#define pw_unlock_irqrestore(lock, flags, cpu) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags); \
+ else \
+ local_unlock_irqrestore(lock.ll, flags); \
+} while (0)
+
+#define pw_unlock_local_irqrestore(lock, flags) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
+ spin_unlock_irqrestore(this_cpu_ptr(lock.sl), flags); \
+ preempt_or_migrate_enable(); \
+ } else { \
+ local_unlock_irqrestore(lock.ll, flags); \
+ } \
+} while (0)
+
+#define pw_lockdep_assert_held(lock) \
+do { \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ lockdep_assert_held(this_cpu_ptr(lock.sl)); \
+ else \
+ lockdep_assert_held(this_cpu_ptr(lock.ll)); \
+} while (0)
+
+#define pw_queue_on(c, wq, pw) \
+do { \
+ int __c = c; \
+ struct pw_struct *__pw = (pw); \
+ if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
+ WARN_ON((__c) != __pw->cpu); \
+ __pw->work.func(&__pw->work); \
+ } else { \
+ queue_work_on(__c, wq, &(__pw)->work); \
+ } \
+} while (0)
+
+/*
+ * Does nothing if PWLOCKS is set to use spinlock, as the task is already done at the
+ * time pw_queue_on() returns.
+ */
+#define pw_flush(pw) \
+do { \
+ struct pw_struct *__pw = (pw); \
+ if (!static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
+ flush_work(&__pw->work); \
+} while (0)
+
+#define pw_get_cpu(w) container_of((w), struct pw_struct, work)->cpu
+
+#define pw_is_cpu_remote(cpu) ((cpu) != smp_processor_id())
+
+#define INIT_PW(pw, func, c) \
+do { \
+ struct pw_struct *__pw = (pw); \
+ INIT_WORK(&__pw->work, (func)); \
+ __pw->cpu = (c); \
+} while (0)
+
+#endif /* CONFIG_PWLOCKS */
+#endif /* LINUX_PWLOCKS_H */
diff --git a/kernel/pwlocks.c b/kernel/pwlocks.c
new file mode 100644
index 000000000000..1ebf5cb979b9
--- /dev/null
+++ b/kernel/pwlocks.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "linux/export.h"
+#include <linux/sched.h>
+#include <linux/pwlocks.h>
+#include <linux/string.h>
+#include <linux/sched/isolation.h>
+
+DEFINE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
+EXPORT_SYMBOL(pw_sl);
+
+static bool pwlocks_param_specified;
+
+static int __init pwlocks_setup(char *str)
+{
+ int opt;
+
+ if (!get_option(&str, &opt)) {
+ pr_warn("PWLOCKS: invalid pwlocks parameter: %s, ignoring.\n", str);
+ return 0;
+ }
+
+ if (opt)
+ static_branch_enable(&pw_sl);
+ else
+ static_branch_disable(&pw_sl);
+
+ pwlocks_param_specified = true;
+
+ return 1;
+}
+__setup("pwlocks=", pwlocks_setup);
+
+/*
+ * Enable PWLOCKS if CPUs want to avoid kernel noise.
+ */
+static int __init pwlocks_init(void)
+{
+ if (pwlocks_param_specified)
+ return 0;
+
+ if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
+ static_branch_enable(&pw_sl);
+
+ return 0;
+}
+
+late_initcall(pwlocks_init);
--
2.54.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 2/4] mm/swap: move bh draining into a separate workqueue
2026-05-19 1:27 [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW) Leonardo Bras
2026-05-19 1:27 ` [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work Leonardo Bras
@ 2026-05-19 1:27 ` Leonardo Bras
2026-05-19 1:27 ` [PATCH v4 3/4] swap: apply new pw_queue_on() interface Leonardo Bras
` (3 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Leonardo Bras @ 2026-05-19 1:27 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker
Cc: Marcelo Tosatti, linux-doc, linux-kernel, linux-mm,
linux-rt-devel
From: Marcelo Tosatti <mtosatti@redhat.com>
Separate the bh draining into a separate workqueue
(from the mm lru draining), so that its possible to switch
the mm lru draining to QPW.
To switch bh draining to QPW, it would be necessary to add
a spinlock to addition of bhs to percpu cache, and that is a
very hot path.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
---
mm/swap.c | 52 +++++++++++++++++++++++++++++++++++++---------------
1 file changed, 37 insertions(+), 15 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index 5cc44f0de987..ed9b3d371547 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -744,60 +744,70 @@ void lru_add_drain(void)
local_unlock(&cpu_fbatches.lock);
mlock_drain_local();
}
/*
* It's called from per-cpu workqueue context in SMP case so
* lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
* the same cpu. It shouldn't be a problem in !SMP case since
* the core is only one and the locks will disable preemption.
*/
-static void lru_add_and_bh_lrus_drain(void)
+static void lru_add_mm_drain(void)
{
local_lock(&cpu_fbatches.lock);
lru_add_drain_cpu(smp_processor_id());
local_unlock(&cpu_fbatches.lock);
- invalidate_bh_lrus_cpu();
mlock_drain_local();
}
void lru_add_drain_cpu_zone(struct zone *zone)
{
local_lock(&cpu_fbatches.lock);
lru_add_drain_cpu(smp_processor_id());
drain_local_pages(zone);
local_unlock(&cpu_fbatches.lock);
mlock_drain_local();
}
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
static void lru_add_drain_per_cpu(struct work_struct *dummy)
{
- lru_add_and_bh_lrus_drain();
+ lru_add_mm_drain();
}
-static bool cpu_needs_drain(unsigned int cpu)
+static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
+
+static void bh_add_drain_per_cpu(struct work_struct *dummy)
+{
+ invalidate_bh_lrus_cpu();
+}
+
+static bool cpu_needs_mm_drain(unsigned int cpu)
{
struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
/* Check these in order of likelihood that they're not zero */
return folio_batch_count(&fbatches->lru_add) ||
folio_batch_count(&fbatches->lru_move_tail) ||
folio_batch_count(&fbatches->lru_deactivate_file) ||
folio_batch_count(&fbatches->lru_deactivate) ||
folio_batch_count(&fbatches->lru_lazyfree) ||
folio_batch_count(&fbatches->lru_activate) ||
- need_mlock_drain(cpu) ||
- has_bh_in_lru(cpu, NULL);
+ need_mlock_drain(cpu);
+}
+
+static bool cpu_needs_bh_drain(unsigned int cpu)
+{
+ return has_bh_in_lru(cpu, NULL);
}
/*
* Doesn't need any cpu hotplug locking because we do rely on per-cpu
* kworkers being shut down before our page_alloc_cpu_dead callback is
* executed on the offlined cpu.
* Calling this function with cpu hotplug locks held can actually lead
* to obscure indirect dependencies via WQ context.
*/
static inline void __lru_add_drain_all(bool force_all_cpus)
@@ -806,21 +816,21 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
* lru_drain_gen - Global pages generation number
*
* (A) Definition: global lru_drain_gen = x implies that all generations
* 0 < n <= x are already *scheduled* for draining.
*
* This is an optimization for the highly-contended use case where a
* user space workload keeps constantly generating a flow of pages for
* each CPU.
*/
static unsigned int lru_drain_gen;
- static struct cpumask has_work;
+ static struct cpumask has_mm_work, has_bh_work;
static DEFINE_MUTEX(lock);
unsigned cpu, this_gen;
/*
* Make sure nobody triggers this path before mm_percpu_wq is fully
* initialized.
*/
if (WARN_ON(!mm_percpu_wq))
return;
@@ -869,34 +879,45 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
* along, adds some pages to its per-cpu vectors, then calls
* lru_add_drain_all().
*
* If the paired barrier is done at any later step, e.g. after the
* loop, CPU #x will just exit at (C) and miss flushing out all of its
* added pages.
*/
WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
smp_mb();
- cpumask_clear(&has_work);
+ cpumask_clear(&has_mm_work);
+ cpumask_clear(&has_bh_work);
for_each_online_cpu(cpu) {
- struct work_struct *work = &per_cpu(lru_add_drain_work, cpu);
+ struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+ struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
- if (cpu_needs_drain(cpu)) {
- INIT_WORK(work, lru_add_drain_per_cpu);
- queue_work_on(cpu, mm_percpu_wq, work);
- __cpumask_set_cpu(cpu, &has_work);
+ if (cpu_needs_mm_drain(cpu)) {
+ INIT_WORK(mm_work, lru_add_drain_per_cpu);
+ queue_work_on(cpu, mm_percpu_wq, mm_work);
+ __cpumask_set_cpu(cpu, &has_mm_work);
+ }
+
+ if (cpu_needs_bh_drain(cpu)) {
+ INIT_WORK(bh_work, bh_add_drain_per_cpu);
+ queue_work_on(cpu, mm_percpu_wq, bh_work);
+ __cpumask_set_cpu(cpu, &has_bh_work);
}
}
- for_each_cpu(cpu, &has_work)
+ for_each_cpu(cpu, &has_mm_work)
flush_work(&per_cpu(lru_add_drain_work, cpu));
+ for_each_cpu(cpu, &has_bh_work)
+ flush_work(&per_cpu(bh_add_drain_work, cpu));
+
done:
mutex_unlock(&lock);
}
void lru_add_drain_all(void)
{
__lru_add_drain_all(false);
}
#else
void lru_add_drain_all(void)
@@ -928,21 +949,22 @@ void lru_cache_disable(void)
*
* Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
* preempt_disable() regions of code. So any CPU which sees
* lru_disable_count = 0 will have exited the critical
* section when synchronize_rcu() returns.
*/
synchronize_rcu_expedited();
#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else
- lru_add_and_bh_lrus_drain();
+ lru_add_mm_drain();
+ invalidate_bh_lrus_cpu();
#endif
}
/**
* folios_put_refs - Reduce the reference count on a batch of folios.
* @folios: The folios.
* @refs: The number of refs to subtract from each folio.
*
* Like folio_put(), but for a batch of folios. This is more efficient
* than writing the loop yourself as it will optimise the locks which need
--
2.54.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 3/4] swap: apply new pw_queue_on() interface
2026-05-19 1:27 [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW) Leonardo Bras
2026-05-19 1:27 ` [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work Leonardo Bras
2026-05-19 1:27 ` [PATCH v4 2/4] mm/swap: move bh draining into a separate workqueue Leonardo Bras
@ 2026-05-19 1:27 ` Leonardo Bras
2026-05-20 15:07 ` Sebastian Andrzej Siewior
2026-05-19 1:27 ` [PATCH v4 4/4] slub: " Leonardo Bras
` (2 subsequent siblings)
5 siblings, 1 reply; 12+ messages in thread
From: Leonardo Bras @ 2026-05-19 1:27 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker
Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel,
Marcelo Tosatti
Make use of the new pw_{un,}lock*() and pw_queue_on() interface to improve
performance & latency.
For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by pw_{un,}lock*(), and replace schedule_work_on() by
pw_queue_on(). The same happens for flush_work() and pw_flush().
The change requires allocation of pw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.
This should bring no relevant performance impact on non-PWLOCKS kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
mm/internal.h | 4 ++-
mm/mlock.c | 51 ++++++++++++++++++++++++++----------
mm/page_alloc.c | 2 +-
mm/swap.c | 69 ++++++++++++++++++++++++++-----------------------
4 files changed, 79 insertions(+), 47 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 5a2ddcf68e0b..1ec9a11c373b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1209,24 +1209,26 @@ static inline void munlock_vma_folio(struct folio *folio,
* cause folio not fully mapped to VMA.
*
* But it's not easy to confirm that's the situation. So we
* always munlock the folio and page reclaim will correct it
* if it's wrong.
*/
if (unlikely(vma->vm_flags & VM_LOCKED))
munlock_folio(folio);
}
+int __init mlock_init(void);
void mlock_new_folio(struct folio *folio);
bool need_mlock_drain(int cpu);
void mlock_drain_local(void);
-void mlock_drain_remote(int cpu);
+void mlock_drain_cpu(int cpu);
+void mlock_drain_offline(int cpu);
extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
/**
* vma_address - Find the virtual address a page range is mapped at
* @vma: The vma which maps this object.
* @pgoff: The page offset within its object.
* @nr_pages: The number of pages to consider.
*
* If any page in this range is mapped by this VMA, return the first address
diff --git a/mm/mlock.c b/mm/mlock.c
index 8c227fefa2df..5d25bbbb09e9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -18,31 +18,30 @@
#include <linux/mempolicy.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
#include <linux/export.h>
#include <linux/rmap.h>
#include <linux/mmzone.h>
#include <linux/hugetlb.h>
#include <linux/memcontrol.h>
#include <linux/mm_inline.h>
#include <linux/secretmem.h>
+#include <linux/pwlocks.h>
#include "internal.h"
struct mlock_fbatch {
- local_lock_t lock;
+ pw_lock_t lock;
struct folio_batch fbatch;
};
-static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch) = {
- .lock = INIT_LOCAL_LOCK(lock),
-};
+static DEFINE_PER_CPU(struct mlock_fbatch, mlock_fbatch);
bool can_do_mlock(void)
{
if (rlimit(RLIMIT_MEMLOCK) != 0)
return true;
if (capable(CAP_IPC_LOCK))
return true;
return false;
}
EXPORT_SYMBOL(can_do_mlock);
@@ -202,32 +201,43 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
lruvec = __mlock_new_folio(folio, lruvec);
else
lruvec = __munlock_folio(folio, lruvec);
}
if (lruvec)
lruvec_unlock_irq(lruvec);
folios_put(fbatch);
}
+void mlock_drain_cpu(int cpu)
+{
+ struct folio_batch *fbatch;
+
+ pw_lock(&mlock_fbatch.lock, cpu);
+ fbatch = per_cpu_ptr(&mlock_fbatch.fbatch, cpu);
+ if (folio_batch_count(fbatch))
+ mlock_folio_batch(fbatch);
+ pw_unlock(&mlock_fbatch.lock, cpu);
+}
+
void mlock_drain_local(void)
{
struct folio_batch *fbatch;
- local_lock(&mlock_fbatch.lock);
+ pw_lock_local(&mlock_fbatch.lock);
fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
if (folio_batch_count(fbatch))
mlock_folio_batch(fbatch);
- local_unlock(&mlock_fbatch.lock);
+ pw_unlock_local(&mlock_fbatch.lock);
}
-void mlock_drain_remote(int cpu)
+void mlock_drain_offline(int cpu)
{
struct folio_batch *fbatch;
WARN_ON_ONCE(cpu_online(cpu));
fbatch = &per_cpu(mlock_fbatch.fbatch, cpu);
if (folio_batch_count(fbatch))
mlock_folio_batch(fbatch);
}
bool need_mlock_drain(int cpu)
@@ -236,79 +246,79 @@ bool need_mlock_drain(int cpu)
}
/**
* mlock_folio - mlock a folio already on (or temporarily off) LRU
* @folio: folio to be mlocked.
*/
void mlock_folio(struct folio *folio)
{
struct folio_batch *fbatch;
- local_lock(&mlock_fbatch.lock);
+ pw_lock_local(&mlock_fbatch.lock);
fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
if (!folio_test_set_mlocked(folio)) {
int nr_pages = folio_nr_pages(folio);
zone_stat_mod_folio(folio, NR_MLOCK, nr_pages);
__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
}
folio_get(folio);
if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
!folio_may_be_lru_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
- local_unlock(&mlock_fbatch.lock);
+ pw_unlock_local(&mlock_fbatch.lock);
}
/**
* mlock_new_folio - mlock a newly allocated folio not yet on LRU
* @folio: folio to be mlocked, either normal or a THP head.
*/
void mlock_new_folio(struct folio *folio)
{
struct folio_batch *fbatch;
int nr_pages = folio_nr_pages(folio);
- local_lock(&mlock_fbatch.lock);
+ pw_lock_local(&mlock_fbatch.lock);
fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
folio_set_mlocked(folio);
zone_stat_mod_folio(folio, NR_MLOCK, nr_pages);
__count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
folio_get(folio);
if (!folio_batch_add(fbatch, mlock_new(folio)) ||
!folio_may_be_lru_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
- local_unlock(&mlock_fbatch.lock);
+ pw_unlock_local(&mlock_fbatch.lock);
}
/**
* munlock_folio - munlock a folio
* @folio: folio to be munlocked, either normal or a THP head.
*/
void munlock_folio(struct folio *folio)
{
struct folio_batch *fbatch;
- local_lock(&mlock_fbatch.lock);
+ pw_lock_local(&mlock_fbatch.lock);
fbatch = this_cpu_ptr(&mlock_fbatch.fbatch);
/*
* folio_test_clear_mlocked(folio) must be left to __munlock_folio(),
* which will check whether the folio is multiply mlocked.
*/
folio_get(folio);
if (!folio_batch_add(fbatch, folio) ||
!folio_may_be_lru_cached(folio) || lru_cache_disabled())
mlock_folio_batch(fbatch);
- local_unlock(&mlock_fbatch.lock);
+ pw_unlock_local(&mlock_fbatch.lock);
}
static inline unsigned int folio_mlock_step(struct folio *folio,
pte_t *pte, unsigned long addr, unsigned long end)
{
unsigned int count = (end - addr) >> PAGE_SHIFT;
pte_t ptent = ptep_get(pte);
if (!folio_test_large(folio))
return 1;
@@ -822,10 +832,25 @@ int user_shm_lock(size_t size, struct ucounts *ucounts)
return allowed;
}
void user_shm_unlock(size_t size, struct ucounts *ucounts)
{
spin_lock(&shmlock_user_lock);
dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, (size + PAGE_SIZE - 1) >> PAGE_SHIFT);
spin_unlock(&shmlock_user_lock);
put_ucounts(ucounts);
}
+
+int __init mlock_init(void)
+{
+ unsigned int cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct mlock_fbatch *fbatch = &per_cpu(mlock_fbatch, cpu);
+
+ pw_lock_init(&fbatch->lock);
+ }
+
+ return 0;
+}
+
+module_init(mlock_init);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 227d58dc3de6..fa768f07f88a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6217,21 +6217,21 @@ void free_reserved_page(struct page *page)
__free_page(page);
adjust_managed_page_count(page, 1);
}
EXPORT_SYMBOL(free_reserved_page);
static int page_alloc_cpu_dead(unsigned int cpu)
{
struct zone *zone;
lru_add_drain_cpu(cpu);
- mlock_drain_remote(cpu);
+ mlock_drain_offline(cpu);
drain_pages(cpu);
/*
* Spill the event counters of the dead processor
* into the current processors event counters.
* This artificially elevates the count of the current
* processor.
*/
vm_events_fold_cpu(cpu);
diff --git a/mm/swap.c b/mm/swap.c
index ed9b3d371547..42f51bf4bb71 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -28,54 +28,51 @@
#include <linux/memremap.h>
#include <linux/percpu.h>
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/backing-dev.h>
#include <linux/memcontrol.h>
#include <linux/gfp.h>
#include <linux/uio.h>
#include <linux/hugetlb.h>
#include <linux/page_idle.h>
-#include <linux/local_lock.h>
+#include <linux/pwlocks.h>
#include <linux/buffer_head.h>
#include "internal.h"
#define CREATE_TRACE_POINTS
#include <trace/events/pagemap.h>
/* How many pages do we try to swap or page in/out together? As a power of 2 */
int page_cluster;
static const int page_cluster_max = 31;
struct cpu_fbatches {
/*
* The following folio batches are grouped together because they are protected
* by disabling preemption (and interrupts remain enabled).
*/
- local_lock_t lock;
+ pw_lock_t lock;
struct folio_batch lru_add;
struct folio_batch lru_deactivate_file;
struct folio_batch lru_deactivate;
struct folio_batch lru_lazyfree;
#ifdef CONFIG_SMP
struct folio_batch lru_activate;
#endif
/* Protecting the following batches which require disabling interrupts */
- local_lock_t lock_irq;
+ pw_lock_t lock_irq;
struct folio_batch lru_move_tail;
};
-static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = {
- .lock = INIT_LOCAL_LOCK(lock),
- .lock_irq = INIT_LOCAL_LOCK(lock_irq),
-};
+static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches);
static void __page_cache_release(struct folio *folio, struct lruvec **lruvecp,
unsigned long *flagsp)
{
if (folio_test_lru(folio)) {
folio_lruvec_relock_irqsave(folio, lruvecp, flagsp);
lruvec_del_folio(*lruvecp, folio);
__folio_clear_lru_flags(folio);
}
}
@@ -180,32 +177,32 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
}
static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
struct folio *folio, move_fn_t move_fn, bool disable_irq)
{
unsigned long flags;
folio_get(folio);
if (disable_irq)
- local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+ pw_lock_local_irqsave(&cpu_fbatches.lock_irq, flags);
else
- local_lock(&cpu_fbatches.lock);
+ pw_lock_local(&cpu_fbatches.lock);
if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
!folio_may_be_lru_cached(folio) || lru_cache_disabled())
folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
if (disable_irq)
- local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+ pw_unlock_local_irqrestore(&cpu_fbatches.lock_irq, flags);
else
- local_unlock(&cpu_fbatches.lock);
+ pw_unlock_local(&cpu_fbatches.lock);
}
#define folio_batch_add_and_move(folio, op) \
__folio_batch_add_and_move( \
&cpu_fbatches.op, \
folio, \
op, \
offsetof(struct cpu_fbatches, op) >= \
offsetof(struct cpu_fbatches, lock_irq) \
)
@@ -356,21 +353,21 @@ void folio_activate(struct folio *folio)
lruvec_unlock_irq(lruvec);
folio_set_lru(folio);
}
#endif
static void __lru_cache_activate_folio(struct folio *folio)
{
struct folio_batch *fbatch;
int i;
- local_lock(&cpu_fbatches.lock);
+ pw_lock_local(&cpu_fbatches.lock);
fbatch = this_cpu_ptr(&cpu_fbatches.lru_add);
/*
* Search backwards on the optimistic assumption that the folio being
* activated has just been added to this batch. Note that only
* the local batch is examined as a !LRU folio could be in the
* process of being released, reclaimed, migrated or on a remote
* batch that is currently being drained. Furthermore, marking
* a remote batch's folio active potentially hits a race where
* a folio is marked active just after it is added to the inactive
@@ -378,21 +375,21 @@ static void __lru_cache_activate_folio(struct folio *folio)
*/
for (i = folio_batch_count(fbatch) - 1; i >= 0; i--) {
struct folio *batch_folio = fbatch->folios[i];
if (batch_folio == folio) {
folio_set_active(folio);
break;
}
}
- local_unlock(&cpu_fbatches.lock);
+ pw_unlock_local(&cpu_fbatches.lock);
}
#ifdef CONFIG_LRU_GEN
static void lru_gen_inc_refs(struct folio *folio)
{
unsigned long new_flags, old_flags = READ_ONCE(folio->flags.f);
if (folio_test_unevictable(folio))
return;
@@ -652,23 +649,23 @@ void lru_add_drain_cpu(int cpu)
if (folio_batch_count(fbatch))
folio_batch_move_lru(fbatch, lru_add);
fbatch = &fbatches->lru_move_tail;
/* Disabling interrupts below acts as a compiler barrier. */
if (data_race(folio_batch_count(fbatch))) {
unsigned long flags;
/* No harm done if a racing interrupt already did this */
- local_lock_irqsave(&cpu_fbatches.lock_irq, flags);
+ pw_lock_irqsave(&cpu_fbatches.lock_irq, flags, cpu);
folio_batch_move_lru(fbatch, lru_move_tail);
- local_unlock_irqrestore(&cpu_fbatches.lock_irq, flags);
+ pw_unlock_irqrestore(&cpu_fbatches.lock_irq, flags, cpu);
}
fbatch = &fbatches->lru_deactivate_file;
if (folio_batch_count(fbatch))
folio_batch_move_lru(fbatch, lru_deactivate_file);
fbatch = &fbatches->lru_deactivate;
if (folio_batch_count(fbatch))
folio_batch_move_lru(fbatch, lru_deactivate);
@@ -732,56 +729,56 @@ void folio_mark_lazyfree(struct folio *folio)
if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
!folio_test_lru(folio) ||
folio_test_swapcache(folio) || folio_test_unevictable(folio))
return;
folio_batch_add_and_move(folio, lru_lazyfree);
}
void lru_add_drain(void)
{
- local_lock(&cpu_fbatches.lock);
+ pw_lock_local(&cpu_fbatches.lock);
lru_add_drain_cpu(smp_processor_id());
- local_unlock(&cpu_fbatches.lock);
+ pw_unlock_local(&cpu_fbatches.lock);
mlock_drain_local();
}
/*
* It's called from per-cpu workqueue context in SMP case so
* lru_add_drain_cpu and invalidate_bh_lrus_cpu should run on
* the same cpu. It shouldn't be a problem in !SMP case since
* the core is only one and the locks will disable preemption.
*/
-static void lru_add_mm_drain(void)
+static void lru_add_mm_drain(int cpu)
{
- local_lock(&cpu_fbatches.lock);
- lru_add_drain_cpu(smp_processor_id());
- local_unlock(&cpu_fbatches.lock);
- mlock_drain_local();
+ pw_lock(&cpu_fbatches.lock, cpu);
+ lru_add_drain_cpu(cpu);
+ pw_unlock(&cpu_fbatches.lock, cpu);
+ mlock_drain_cpu(cpu);
}
void lru_add_drain_cpu_zone(struct zone *zone)
{
- local_lock(&cpu_fbatches.lock);
+ pw_lock_local(&cpu_fbatches.lock);
lru_add_drain_cpu(smp_processor_id());
drain_local_pages(zone);
- local_unlock(&cpu_fbatches.lock);
+ pw_unlock_local(&cpu_fbatches.lock);
mlock_drain_local();
}
#ifdef CONFIG_SMP
-static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);
+static DEFINE_PER_CPU(struct pw_struct, lru_add_drain_pw);
-static void lru_add_drain_per_cpu(struct work_struct *dummy)
+static void lru_add_drain_per_cpu(struct work_struct *w)
{
- lru_add_mm_drain();
+ lru_add_mm_drain(pw_get_cpu(w));
}
static DEFINE_PER_CPU(struct work_struct, bh_add_drain_work);
static void bh_add_drain_per_cpu(struct work_struct *dummy)
{
invalidate_bh_lrus_cpu();
}
static bool cpu_needs_mm_drain(unsigned int cpu)
@@ -882,38 +879,38 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
* If the paired barrier is done at any later step, e.g. after the
* loop, CPU #x will just exit at (C) and miss flushing out all of its
* added pages.
*/
WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
smp_mb();
cpumask_clear(&has_mm_work);
cpumask_clear(&has_bh_work);
for_each_online_cpu(cpu) {
- struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
+ struct pw_struct *mm_pw = &per_cpu(lru_add_drain_pw, cpu);
struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
if (cpu_needs_mm_drain(cpu)) {
- INIT_WORK(mm_work, lru_add_drain_per_cpu);
- queue_work_on(cpu, mm_percpu_wq, mm_work);
+ INIT_PW(mm_pw, lru_add_drain_per_cpu, cpu);
+ pw_queue_on(cpu, mm_percpu_wq, mm_pw);
__cpumask_set_cpu(cpu, &has_mm_work);
}
if (cpu_needs_bh_drain(cpu)) {
INIT_WORK(bh_work, bh_add_drain_per_cpu);
queue_work_on(cpu, mm_percpu_wq, bh_work);
__cpumask_set_cpu(cpu, &has_bh_work);
}
}
for_each_cpu(cpu, &has_mm_work)
- flush_work(&per_cpu(lru_add_drain_work, cpu));
+ pw_flush(&per_cpu(lru_add_drain_pw, cpu));
for_each_cpu(cpu, &has_bh_work)
flush_work(&per_cpu(bh_add_drain_work, cpu));
done:
mutex_unlock(&lock);
}
void lru_add_drain_all(void)
{
@@ -949,21 +946,21 @@ void lru_cache_disable(void)
*
* Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
* preempt_disable() regions of code. So any CPU which sees
* lru_disable_count = 0 will have exited the critical
* section when synchronize_rcu() returns.
*/
synchronize_rcu_expedited();
#ifdef CONFIG_SMP
__lru_add_drain_all(true);
#else
- lru_add_mm_drain();
+ lru_add_mm_drain(smp_processor_id());
invalidate_bh_lrus_cpu();
#endif
}
/**
* folios_put_refs - Reduce the reference count on a batch of folios.
* @folios: The folios.
* @refs: The number of refs to subtract from each folio.
*
* Like folio_put(), but for a batch of folios. This is more efficient
@@ -1156,23 +1153,31 @@ static const struct ctl_table swap_sysctl_table[] = {
.extra2 = (void *)&page_cluster_max,
}
};
/*
* Perform any setup for the swap system
*/
void __init swap_setup(void)
{
unsigned long megs = PAGES_TO_MB(totalram_pages());
+ unsigned int cpu;
/* Use a smaller cluster for small-memory machines */
if (megs < 16)
page_cluster = 2;
else
page_cluster = 3;
/*
* Right now other parts of the system means that we
* _really_ don't want to cluster much more
*/
register_sysctl_init("vm", swap_sysctl_table);
+
+ for_each_possible_cpu(cpu) {
+ struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu);
+
+ pw_lock_init(&fbatches->lock);
+ pw_lock_init(&fbatches->lock_irq);
+ }
}
--
2.54.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v4 4/4] slub: apply new pw_queue_on() interface
2026-05-19 1:27 [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW) Leonardo Bras
` (2 preceding siblings ...)
2026-05-19 1:27 ` [PATCH v4 3/4] swap: apply new pw_queue_on() interface Leonardo Bras
@ 2026-05-19 1:27 ` Leonardo Bras
2026-05-20 14:53 ` Sebastian Andrzej Siewior
2026-05-19 6:58 ` [syzbot ci] Re: Introduce Per-CPU Work helpers (was QPW) syzbot ci
2026-05-20 13:09 ` [PATCH v4 0/4] " Sebastian Andrzej Siewior
5 siblings, 1 reply; 12+ messages in thread
From: Leonardo Bras @ 2026-05-19 1:27 UTC (permalink / raw)
To: Jonathan Corbet, Shuah Khan, Leonardo Bras, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker
Cc: linux-doc, linux-kernel, linux-mm, linux-rt-devel,
Marcelo Tosatti
Make use of the new pw_{un,}lock*() and pw_queue_on() interface to improve
performance & latency.
For functions that may be scheduled in a different cpu, replace
local_{un,}lock*() by pw_{un,}lock*(), and replace schedule_work_on() by
pw_queue_on(). The same happens for flush_work() and pw_flush().
This change requires allocation of pw_structs instead of a work_structs,
and changing parameters of a few functions to include the cpu parameter.
This should bring no relevant performance impact on non-PWLOCKS kernels:
For functions that may be scheduled in a different cpu, the local_*lock's
this_cpu_ptr() becomes a per_cpu_ptr(smp_processor_id()).
Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
---
mm/slub.c | 142 +++++++++++++++++++++++++++---------------------------
1 file changed, 72 insertions(+), 70 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 8f9004536729..a154d20e78f7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -43,20 +43,21 @@
#include <linux/prefetch.h>
#include <linux/memcontrol.h>
#include <linux/random.h>
#include <linux/prandom.h>
#include <kunit/test.h>
#include <kunit/test-bug.h>
#include <linux/sort.h>
#include <linux/irq_work.h>
#include <linux/kprobes.h>
#include <linux/debugfs.h>
+#include <linux/pwlocks.h>
#include <trace/events/kmem.h>
#include "internal.h"
/*
* Lock order:
* 0. cpu_hotplug_lock
* 1. slab_mutex (Global Mutex)
* 2a. kmem_cache->cpu_sheaves->lock (Local trylock)
* 2b. barn->lock (Spinlock)
@@ -122,21 +123,21 @@
* (Note that the total number of slabs is an atomic value that may be
* modified without taking the list lock).
*
* The list_lock is a centralized lock and thus we avoid taking it as
* much as possible. As long as SLUB does not have to handle partial
* slabs, operations can continue without any centralized lock.
*
* For debug caches, all allocations are forced to go through a list_lock
* protected region to serialize against concurrent validation.
*
- * cpu_sheaves->lock (local_trylock)
+ * cpu_sheaves->lock (pw_trylock)
*
* This lock protects fastpath operations on the percpu sheaves. On !RT it
* only disables preemption and does no atomic operations. As long as the main
* or spare sheaf can handle the allocation or free, there is no other
* overhead.
*
* barn->lock (spinlock)
*
* This lock protects the operations on per-NUMA-node barn. It can quickly
* serve an empty or full sheaf if available, and avoid more expensive refill
@@ -150,21 +151,21 @@
* cmpxchg_double this is done by a lockless update of slab's freelist and
* counters, otherwise slab_lock is taken. This only needs to take the
* list_lock if it's a first free to a full slab, or when a slab becomes empty
* after the free.
*
* irq, preemption, migration considerations
*
* Interrupts are disabled as part of list_lock or barn lock operations, or
* around the slab_lock operation, in order to make the slab allocator safe
* to use in the context of an irq.
- * Preemption is disabled as part of local_trylock operations.
+ * Preemption is disabled as part of pw_trylock operations.
* kmalloc_nolock() and kfree_nolock() are safe in NMI context but see
* their limitations.
*
* SLUB assigns two object arrays called sheaves for caching allocations and
* frees on each cpu, with a NUMA node shared barn for balancing between cpus.
* Allocations and frees are primarily served from these sheaves.
*
* Slabs with free elements are kept on a partial list and during regular
* operations no list for full slabs is used. If an object in a full slab is
* freed then the slab will show up again on the partial lists.
@@ -411,21 +412,21 @@ struct slab_sheaf {
bool pfmemalloc;
};
};
struct kmem_cache *cache;
unsigned int size;
int node; /* only used for rcu_sheaf */
void *objects[];
};
struct slub_percpu_sheaves {
- local_trylock_t lock;
+ pw_trylock_t lock;
struct slab_sheaf *main; /* never NULL when unlocked */
struct slab_sheaf *spare; /* empty or full, may be NULL */
struct slab_sheaf *rcu_free; /* for batching kfree_rcu() */
};
/*
* The slab lists for all objects.
*/
struct kmem_cache_node {
spinlock_t list_lock;
@@ -477,21 +478,21 @@ static nodemask_t slab_nodes;
* Corresponds to N_ONLINE nodes.
*/
static nodemask_t slab_barn_nodes;
/*
* Workqueue used for flushing cpu and kfree_rcu sheaves.
*/
static struct workqueue_struct *flushwq;
struct slub_flush_work {
- struct work_struct work;
+ struct pw_struct pw;
struct kmem_cache *s;
bool skip;
};
static DEFINE_MUTEX(flush_lock);
static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
/********************************************************************
* Core slab cache functions
*******************************************************************/
@@ -2838,74 +2839,74 @@ static void __kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
* Free all objects from the main sheaf. In order to perform
* __kmem_cache_free_bulk() outside of cpu_sheaves->lock, work in batches where
* object pointers are moved to a on-stack array under the lock. To bound the
* stack usage, limit each batch to PCS_BATCH_MAX.
*
* Must be called with s->cpu_sheaves->lock locked, returns with the lock
* unlocked.
*
* Returns how many objects are remaining to be flushed
*/
-static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s)
+static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s, int cpu)
{
struct slub_percpu_sheaves *pcs;
unsigned int batch, remaining;
void *objects[PCS_BATCH_MAX];
struct slab_sheaf *sheaf;
- lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
-
- pcs = this_cpu_ptr(s->cpu_sheaves);
+ pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
sheaf = pcs->main;
batch = min(PCS_BATCH_MAX, sheaf->size);
sheaf->size -= batch;
memcpy(objects, sheaf->objects + sheaf->size, batch * sizeof(void *));
remaining = sheaf->size;
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock(&s->cpu_sheaves->lock, cpu);
__kmem_cache_free_bulk(s, batch, &objects[0]);
stat_add(s, SHEAF_FLUSH, batch);
return remaining;
}
-static void sheaf_flush_main(struct kmem_cache *s)
+static void sheaf_flush_main(struct kmem_cache *s, int cpu)
{
unsigned int remaining;
do {
- local_lock(&s->cpu_sheaves->lock);
+ pw_lock(&s->cpu_sheaves->lock, cpu);
- remaining = __sheaf_flush_main_batch(s);
+ remaining = __sheaf_flush_main_batch(s, cpu);
} while (remaining);
}
/*
* Returns true if the main sheaf was at least partially flushed.
*/
static bool sheaf_try_flush_main(struct kmem_cache *s)
{
unsigned int remaining;
bool ret = false;
do {
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
return ret;
ret = true;
- remaining = __sheaf_flush_main_batch(s);
+
+ pw_lockdep_assert_held(&s->cpu_sheaves->lock);
+ remaining = __sheaf_flush_main_batch(s, smp_processor_id());
} while (remaining);
return ret;
}
/*
* Free all objects from a sheaf that's unused, i.e. not linked to any
* cpu_sheaves, so we need no locking and batching. The locking is also not
* necessary when flushing cpu's sheaves (both spare and main) during cpu
@@ -2968,45 +2969,45 @@ static void rcu_free_sheaf_nobarn(struct rcu_head *head)
/*
* Caller needs to make sure migration is disabled in order to fully flush
* single cpu's sheaves
*
* must not be called from an irq
*
* flushing operations are rare so let's keep it simple and flush to slabs
* directly, skipping the barn
*/
-static void pcs_flush_all(struct kmem_cache *s)
+static void pcs_flush_all(struct kmem_cache *s, int cpu)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *spare, *rcu_free;
- local_lock(&s->cpu_sheaves->lock);
- pcs = this_cpu_ptr(s->cpu_sheaves);
+ pw_lock(&s->cpu_sheaves->lock, cpu);
+ pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
spare = pcs->spare;
pcs->spare = NULL;
rcu_free = pcs->rcu_free;
pcs->rcu_free = NULL;
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock(&s->cpu_sheaves->lock, cpu);
if (spare) {
sheaf_flush_unused(s, spare);
free_empty_sheaf(s, spare);
}
if (rcu_free)
call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
- sheaf_flush_main(s);
+ sheaf_flush_main(s, cpu);
}
static void __pcs_flush_all_cpu(struct kmem_cache *s, unsigned int cpu)
{
struct slub_percpu_sheaves *pcs;
pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
/* The cpu is not executing anymore so we don't need pcs->lock */
sheaf_flush_unused(s, pcs->main);
@@ -3942,83 +3943,84 @@ static bool has_pcs_used(int cpu, struct kmem_cache *s)
/*
* Flush percpu sheaves
*
* Called from CPU work handler with migration disabled.
*/
static void flush_cpu_sheaves(struct work_struct *w)
{
struct kmem_cache *s;
struct slub_flush_work *sfw;
+ int cpu = pw_get_cpu(w);
- sfw = container_of(w, struct slub_flush_work, work);
-
+ sfw = &per_cpu(slub_flush, cpu);
s = sfw->s;
if (cache_has_sheaves(s))
- pcs_flush_all(s);
+ pcs_flush_all(s, cpu);
}
static void flush_all_cpus_locked(struct kmem_cache *s)
{
struct slub_flush_work *sfw;
unsigned int cpu;
lockdep_assert_cpus_held();
mutex_lock(&flush_lock);
for_each_online_cpu(cpu) {
sfw = &per_cpu(slub_flush, cpu);
if (!has_pcs_used(cpu, s)) {
sfw->skip = true;
continue;
}
- INIT_WORK(&sfw->work, flush_cpu_sheaves);
+ INIT_PW(&sfw->pw, flush_cpu_sheaves, cpu);
sfw->skip = false;
sfw->s = s;
- queue_work_on(cpu, flushwq, &sfw->work);
+ pw_queue_on(cpu, flushwq, &sfw->pw);
}
for_each_online_cpu(cpu) {
sfw = &per_cpu(slub_flush, cpu);
if (sfw->skip)
continue;
- flush_work(&sfw->work);
+ pw_flush(&sfw->pw);
}
mutex_unlock(&flush_lock);
}
static void flush_all(struct kmem_cache *s)
{
cpus_read_lock();
flush_all_cpus_locked(s);
cpus_read_unlock();
}
static void flush_rcu_sheaf(struct work_struct *w)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *rcu_free;
struct slub_flush_work *sfw;
struct kmem_cache *s;
+ int cpu = pw_get_cpu(w);
- sfw = container_of(w, struct slub_flush_work, work);
+ sfw = &per_cpu(slub_flush, cpu);
s = sfw->s;
- local_lock(&s->cpu_sheaves->lock);
- pcs = this_cpu_ptr(s->cpu_sheaves);
+ pw_lock(&s->cpu_sheaves->lock, cpu);
+ pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
rcu_free = pcs->rcu_free;
pcs->rcu_free = NULL;
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock(&s->cpu_sheaves->lock, cpu);
if (rcu_free)
call_rcu(&rcu_free->rcu_head, rcu_free_sheaf_nobarn);
}
/* needed for kvfree_rcu_barrier() */
void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
{
struct slub_flush_work *sfw;
@@ -4029,28 +4031,28 @@ void flush_rcu_sheaves_on_cache(struct kmem_cache *s)
for_each_online_cpu(cpu) {
sfw = &per_cpu(slub_flush, cpu);
/*
* we don't check if rcu_free sheaf exists - racing
* __kfree_rcu_sheaf() might have just removed it.
* by executing flush_rcu_sheaf() on the cpu we make
* sure the __kfree_rcu_sheaf() finished its call_rcu()
*/
- INIT_WORK(&sfw->work, flush_rcu_sheaf);
+ INIT_PW(&sfw->pw, flush_rcu_sheaf, cpu);
sfw->s = s;
- queue_work_on(cpu, flushwq, &sfw->work);
+ pw_queue_on(cpu, flushwq, &sfw->pw);
}
for_each_online_cpu(cpu) {
sfw = &per_cpu(slub_flush, cpu);
- flush_work(&sfw->work);
+ pw_flush(&sfw->pw);
}
mutex_unlock(&flush_lock);
}
void flush_all_rcu_sheaves(void)
{
struct kmem_cache *s;
cpus_read_lock();
@@ -4589,36 +4591,36 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
* unlocked.
*/
static struct slub_percpu_sheaves *
__pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs, gfp_t gfp)
{
struct slab_sheaf *empty = NULL;
struct slab_sheaf *full;
struct node_barn *barn;
bool allow_spin;
- lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+ pw_lockdep_assert_held(&s->cpu_sheaves->lock);
/* Bootstrap or debug cache, back off */
if (unlikely(!cache_has_sheaves(s))) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
return NULL;
}
if (pcs->spare && pcs->spare->size > 0) {
swap(pcs->main, pcs->spare);
return pcs;
}
barn = get_barn(s);
if (!barn) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
return NULL;
}
allow_spin = gfpflags_allow_spinning(gfp);
full = barn_replace_empty_sheaf(barn, pcs->main, allow_spin);
if (full) {
stat(s, BARN_GET);
pcs->main = full;
@@ -4629,21 +4631,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
if (allow_spin) {
if (pcs->spare) {
empty = pcs->spare;
pcs->spare = NULL;
} else {
empty = barn_get_empty_sheaf(barn, true);
}
}
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
pcs = NULL;
if (!allow_spin)
return NULL;
if (!empty) {
empty = alloc_empty_sheaf(s, gfp);
if (!empty)
return NULL;
}
@@ -4655,21 +4657,21 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
*/
sheaf_flush_unused(s, empty);
free_empty_sheaf(s, empty);
return NULL;
}
full = empty;
empty = NULL;
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
goto barn_put;
pcs = this_cpu_ptr(s->cpu_sheaves);
/*
* If we put any empty or full sheaf to the barn below, it's due to
* racing or being migrated to a different cpu. Breaching the barn's
* sheaf limits should be thus rare enough so just ignore them to
* simplify the recovery.
*/
@@ -4733,121 +4735,121 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
/*
* We assume the percpu sheaves contain only local objects although it's
* not completely guaranteed, so we verify later.
*/
if (unlikely(node_requested && node != numa_mem_id())) {
stat(s, ALLOC_NODE_MISMATCH);
return NULL;
}
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
return NULL;
pcs = this_cpu_ptr(s->cpu_sheaves);
if (unlikely(pcs->main->size == 0)) {
pcs = __pcs_replace_empty_main(s, pcs, gfp);
if (unlikely(!pcs))
return NULL;
}
object = pcs->main->objects[pcs->main->size - 1];
if (unlikely(node_requested)) {
/*
* Verify that the object was from the node we want. This could
* be false because of cpu migration during an unlocked part of
* the current allocation or previous freeing process.
*/
if (page_to_nid(virt_to_page(object)) != node) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
stat(s, ALLOC_NODE_MISMATCH);
return NULL;
}
}
pcs->main->size--;
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
stat(s, ALLOC_FASTPATH);
return object;
}
static __fastpath_inline
unsigned int alloc_from_pcs_bulk(struct kmem_cache *s, gfp_t gfp, size_t size,
void **p)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *main;
unsigned int allocated = 0;
unsigned int batch;
next_batch:
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
return allocated;
pcs = this_cpu_ptr(s->cpu_sheaves);
if (unlikely(pcs->main->size == 0)) {
struct slab_sheaf *full;
struct node_barn *barn;
if (unlikely(!cache_has_sheaves(s))) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
return allocated;
}
if (pcs->spare && pcs->spare->size > 0) {
swap(pcs->main, pcs->spare);
goto do_alloc;
}
barn = get_barn(s);
if (!barn) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
return allocated;
}
full = barn_replace_empty_sheaf(barn, pcs->main,
gfpflags_allow_spinning(gfp));
if (full) {
stat(s, BARN_GET);
pcs->main = full;
goto do_alloc;
}
stat(s, BARN_GET_FAIL);
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
/*
* Once full sheaves in barn are depleted, let the bulk
* allocation continue from slab pages, otherwise we would just
* be copying arrays of pointers twice.
*/
return allocated;
}
do_alloc:
main = pcs->main;
batch = min(size, main->size);
main->size -= batch;
memcpy(p, main->objects + main->size, batch * sizeof(void *));
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
stat_add(s, ALLOC_FASTPATH, batch);
allocated += batch;
if (batch < size) {
p += batch;
size -= batch;
goto next_batch;
}
@@ -5017,40 +5019,40 @@ kmem_cache_prefill_sheaf(struct kmem_cache *s, gfp_t gfp, unsigned int size)
&sheaf->objects[0])) {
kfree(sheaf);
return NULL;
}
sheaf->size = size;
return sheaf;
}
- local_lock(&s->cpu_sheaves->lock);
+ pw_lock_local(&s->cpu_sheaves->lock);
pcs = this_cpu_ptr(s->cpu_sheaves);
if (pcs->spare) {
sheaf = pcs->spare;
pcs->spare = NULL;
stat(s, SHEAF_PREFILL_FAST);
} else {
barn = get_barn(s);
stat(s, SHEAF_PREFILL_SLOW);
if (barn)
sheaf = barn_get_full_or_empty_sheaf(barn);
if (sheaf && sheaf->size)
stat(s, BARN_GET);
else
stat(s, BARN_GET_FAIL);
}
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
if (!sheaf)
sheaf = alloc_empty_sheaf(s, gfp);
if (sheaf) {
sheaf->capacity = s->sheaf_capacity;
sheaf->pfmemalloc = false;
if (sheaf->size < size &&
@@ -5080,31 +5082,31 @@ void kmem_cache_return_sheaf(struct kmem_cache *s, gfp_t gfp,
struct slub_percpu_sheaves *pcs;
struct node_barn *barn;
if (unlikely((sheaf->capacity != s->sheaf_capacity)
|| sheaf->pfmemalloc)) {
sheaf_flush_unused(s, sheaf);
kfree(sheaf);
return;
}
- local_lock(&s->cpu_sheaves->lock);
+ pw_lock_local(&s->cpu_sheaves->lock);
pcs = this_cpu_ptr(s->cpu_sheaves);
barn = get_barn(s);
if (!pcs->spare) {
pcs->spare = sheaf;
sheaf = NULL;
stat(s, SHEAF_RETURN_FAST);
}
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
if (!sheaf)
return;
stat(s, SHEAF_RETURN_SLOW);
/*
* If the barn has too many full sheaves or we fail to refill the sheaf,
* simply flush and free it.
*/
@@ -5627,21 +5629,21 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
* An alternative scenario that gets us here is when we fail
* barn_replace_full_sheaf(), because there's no empty sheaf available in the
* barn, so we had to allocate it by alloc_empty_sheaf(). But because we saw the
* limit on full sheaves was not exceeded, we assume it didn't change and just
* put the full sheaf there.
*/
static void __pcs_install_empty_sheaf(struct kmem_cache *s,
struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
struct node_barn *barn)
{
- lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+ pw_lockdep_assert_held(&s->cpu_sheaves->lock);
/* This is what we expect to find if nobody interrupted us. */
if (likely(!pcs->spare)) {
pcs->spare = pcs->main;
pcs->main = empty;
return;
}
/*
* Unlikely because if the main sheaf had space, we would have just
@@ -5678,31 +5680,31 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
*/
static struct slub_percpu_sheaves *
__pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
bool allow_spin)
{
struct slab_sheaf *empty;
struct node_barn *barn;
bool put_fail;
restart:
- lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+ pw_lockdep_assert_held(&s->cpu_sheaves->lock);
/* Bootstrap or debug cache, back off */
if (unlikely(!cache_has_sheaves(s))) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
return NULL;
}
barn = get_barn(s);
if (!barn) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
return NULL;
}
put_fail = false;
if (!pcs->spare) {
empty = barn_get_empty_sheaf(barn, allow_spin);
if (empty) {
pcs->spare = pcs->main;
pcs->main = empty;
@@ -5725,107 +5727,107 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
}
/* sheaf_flush_unused() doesn't support !allow_spin */
if (PTR_ERR(empty) == -E2BIG && allow_spin) {
/* Since we got here, spare exists and is full */
struct slab_sheaf *to_flush = pcs->spare;
stat(s, BARN_PUT_FAIL);
pcs->spare = NULL;
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
sheaf_flush_unused(s, to_flush);
empty = to_flush;
goto got_empty;
}
/*
* We could not replace full sheaf because barn had no empty
* sheaves. We can still allocate it and put the full sheaf in
* __pcs_install_empty_sheaf(), but if we fail to allocate it,
* make sure to count the fail.
*/
put_fail = true;
alloc_empty:
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
/*
* alloc_empty_sheaf() doesn't support !allow_spin and it's
* easier to fall back to freeing directly without sheaves
* than add the support (and to sheaf_flush_unused() above)
*/
if (!allow_spin)
return NULL;
empty = alloc_empty_sheaf(s, GFP_NOWAIT);
if (empty)
goto got_empty;
if (put_fail)
stat(s, BARN_PUT_FAIL);
if (!sheaf_try_flush_main(s))
return NULL;
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
return NULL;
pcs = this_cpu_ptr(s->cpu_sheaves);
/*
* we flushed the main sheaf so it should be empty now,
* but in case we got preempted or migrated, we need to
* check again
*/
if (pcs->main->size == s->sheaf_capacity)
goto restart;
return pcs;
got_empty:
- if (!local_trylock(&s->cpu_sheaves->lock)) {
+ if (!pw_trylock_local(&s->cpu_sheaves->lock)) {
barn_put_empty_sheaf(barn, empty);
return NULL;
}
pcs = this_cpu_ptr(s->cpu_sheaves);
__pcs_install_empty_sheaf(s, pcs, empty, barn);
return pcs;
}
/*
* Free an object to the percpu sheaves.
* The object is expected to have passed slab_free_hook() already.
*/
static __fastpath_inline
bool free_to_pcs(struct kmem_cache *s, void *object, bool allow_spin)
{
struct slub_percpu_sheaves *pcs;
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
return false;
pcs = this_cpu_ptr(s->cpu_sheaves);
if (unlikely(pcs->main->size == s->sheaf_capacity)) {
pcs = __pcs_replace_full_main(s, pcs, allow_spin);
if (unlikely(!pcs))
return false;
}
pcs->main->objects[pcs->main->size++] = object;
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
stat(s, FREE_FASTPATH);
return true;
}
static void rcu_free_sheaf(struct rcu_head *head)
{
struct slab_sheaf *sheaf;
struct node_barn *barn = NULL;
@@ -5898,63 +5900,63 @@ static DEFINE_WAIT_OVERRIDE_MAP(kfree_rcu_sheaf_map, LD_WAIT_CONFIG);
bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
{
struct slub_percpu_sheaves *pcs;
struct slab_sheaf *rcu_sheaf;
if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
return false;
lock_map_acquire_try(&kfree_rcu_sheaf_map);
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
goto fail;
pcs = this_cpu_ptr(s->cpu_sheaves);
if (unlikely(!pcs->rcu_free)) {
struct slab_sheaf *empty;
struct node_barn *barn;
/* Bootstrap or debug cache, fall back */
if (unlikely(!cache_has_sheaves(s))) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
goto fail;
}
if (pcs->spare && pcs->spare->size == 0) {
pcs->rcu_free = pcs->spare;
pcs->spare = NULL;
goto do_free;
}
barn = get_barn(s);
if (!barn) {
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
goto fail;
}
empty = barn_get_empty_sheaf(barn, true);
if (empty) {
pcs->rcu_free = empty;
goto do_free;
}
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
empty = alloc_empty_sheaf(s, GFP_NOWAIT);
if (!empty)
goto fail;
- if (!local_trylock(&s->cpu_sheaves->lock)) {
+ if (!pw_trylock_local(&s->cpu_sheaves->lock)) {
barn_put_empty_sheaf(barn, empty);
goto fail;
}
pcs = this_cpu_ptr(s->cpu_sheaves);
if (unlikely(pcs->rcu_free))
barn_put_empty_sheaf(barn, empty);
else
pcs->rcu_free = empty;
@@ -5971,27 +5973,27 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
rcu_sheaf->objects[rcu_sheaf->size++] = obj;
if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
rcu_sheaf = NULL;
} else {
pcs->rcu_free = NULL;
rcu_sheaf->node = numa_node_id();
}
/*
- * we flush before local_unlock to make sure a racing
+ * we flush before pw_unlock_local to make sure a racing
* flush_all_rcu_sheaves() doesn't miss this sheaf
*/
if (rcu_sheaf)
call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
stat(s, FREE_RCU_SHEAF);
lock_map_release(&kfree_rcu_sheaf_map);
return true;
fail:
stat(s, FREE_RCU_SHEAF_FAIL);
lock_map_release(&kfree_rcu_sheaf_map);
return false;
}
@@ -6082,21 +6084,21 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
continue;
}
i++;
}
if (!size)
goto flush_remote;
next_batch:
- if (!local_trylock(&s->cpu_sheaves->lock))
+ if (!pw_trylock_local(&s->cpu_sheaves->lock))
goto fallback;
pcs = this_cpu_ptr(s->cpu_sheaves);
if (likely(pcs->main->size < s->sheaf_capacity))
goto do_free;
barn = get_barn(s);
if (!barn)
goto no_empty;
@@ -6125,37 +6127,37 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
stat(s, BARN_PUT);
pcs->main = empty;
do_free:
main = pcs->main;
batch = min(size, s->sheaf_capacity - main->size);
memcpy(main->objects + main->size, p, batch * sizeof(void *));
main->size += batch;
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
stat_add(s, FREE_FASTPATH, batch);
if (batch < size) {
p += batch;
size -= batch;
goto next_batch;
}
if (remote_nr)
goto flush_remote;
return;
no_empty:
- local_unlock(&s->cpu_sheaves->lock);
+ pw_unlock_local(&s->cpu_sheaves->lock);
/*
* if we depleted all empty sheaves in the barn or there are too
* many full sheaves, free the rest to slab pages
*/
fallback:
__kmem_cache_free_bulk(s, size, p);
stat_add(s, FREE_SLOWPATH, size);
flush_remote:
@@ -7554,21 +7556,21 @@ static inline int alloc_kmem_cache_stats(struct kmem_cache *s)
static int init_percpu_sheaves(struct kmem_cache *s)
{
static struct slab_sheaf bootstrap_sheaf = {};
int cpu;
for_each_possible_cpu(cpu) {
struct slub_percpu_sheaves *pcs;
pcs = per_cpu_ptr(s->cpu_sheaves, cpu);
- local_trylock_init(&pcs->lock);
+ pw_trylock_init(&pcs->lock);
/*
* Bootstrap sheaf has zero size so fast-path allocation fails.
* It has also size == s->sheaf_capacity, so fast-path free
* fails. In the slow paths we recognize the situation by
* checking s->sheaf_capacity. This allows fast paths to assume
* s->cpu_sheaves and pcs->main always exists and are valid.
* It's also safe to share the single static bootstrap_sheaf
* with zero-sized objects array as it's never modified.
*
--
2.54.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [syzbot ci] Re: Introduce Per-CPU Work helpers (was QPW)
2026-05-19 1:27 [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW) Leonardo Bras
` (3 preceding siblings ...)
2026-05-19 1:27 ` [PATCH v4 4/4] slub: " Leonardo Bras
@ 2026-05-19 6:58 ` syzbot ci
2026-05-20 13:09 ` [PATCH v4 0/4] " Sebastian Andrzej Siewior
5 siblings, 0 replies; 12+ messages in thread
From: syzbot ci @ 2026-05-19 6:58 UTC (permalink / raw)
To: akpm, axelrasmussen, baohua, bhe, boqun, bp, brauner, chrisl, cl,
corbet, coxu, dapeng1.mi, david, dianders, ebiggers, elver,
feng.tang, frederic, gary, hannes, hao.li, harry, jackmanb, jannh,
kasong, kees, kuba, leobras.c, liam, linux-doc, linux-kernel,
linux-mm, linux-rt-devel, lirongqing, ljs, longman, masahiroy,
mhocko, mingo, mtosatti, nathan, nphamcs, nsc, ojeda,
pasha.tatashin, paulmck, peterz, pfalcato, qi.zheng, rdunlap
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v4] Introduce Per-CPU Work helpers (was QPW)
https://lore.kernel.org/all/20260519012754.240804-1-leobras.c@gmail.com
* [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
* [PATCH v4 2/4] mm/swap: move bh draining into a separate workqueue
* [PATCH v4 3/4] swap: apply new pw_queue_on() interface
* [PATCH v4 4/4] slub: apply new pw_queue_on() interface
and found the following issue:
WARNING in __pcs_replace_empty_main
Full report is available here:
https://ci.syzbot.org/series/804f81bd-77b4-490e-bd57-6345ad2aa923
***
WARNING in __pcs_replace_empty_main
tree: drm-next
URL: https://gitlab.freedesktop.org/drm/kernel.git
base: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/3ea80958-13bd-49da-9c64-6deb788113f8/config
clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
Zone ranges:
DMA [mem 0x0000000000001000-0x0000000000ffffff]
DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
Normal [mem 0x0000000100000000-0x000000023fffffff]
Device empty
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000000001000-0x000000000009efff]
node 0: [mem 0x0000000000100000-0x000000007ffdefff]
node 0: [mem 0x0000000100000000-0x0000000160000fff]
node 1: [mem 0x0000000160001000-0x000000023fffffff]
Initmem setup node 0 [mem 0x0000000000001000-0x0000000160000fff]
Initmem setup node 1 [mem 0x0000000160001000-0x000000023fffffff]
On node 0, zone DMA: 1 pages in unavailable ranges
On node 0, zone DMA: 97 pages in unavailable ranges
On node 0, zone Normal: 33 pages in unavailable ranges
setup_percpu: NR_CPUS:8 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:2
percpu: Embedded 71 pages/cpu s250632 r8192 d31992 u2097152
kvm-guest: PV spinlocks disabled, no host support
Kernel command line: earlyprintk=serial net.ifnames=0 sysctl.kernel.hung_task_all_cpu_backtrace=1 ima_policy=tcb nf-conntrack-ftp.ports=20000 nf-conntrack-tftp.ports=20000 nf-conntrack-sip.ports=20000 nf-conntrack-irc.ports=20000 nf-conntrack-sane.ports=20000 binder.debug_mask=0 rcupdate.rcu_expedited=1 rcupdate.rcu_cpu_stall_cputime=1 no_hash_pointers page_owner=on sysctl.vm.nr_hugepages=4 sysctl.vm.nr_overcommit_hugepages=4 secretmem.enable=1 sysctl.max_rcu_stall_to_panic=1 msr.allow_writes=off coredump_filter=0xffff root=/dev/sda console=ttyS0 vsyscall=native numa=fake=2 kvm-intel.nested=1 spec_store_bypass_disable=prctl nopcid vivid.n_devs=64 vivid.multiplanar=1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2 netrom.nr_ndevs=32 rose.rose_ndevs=32 smp.csd_lock_timeout=100000 watchdog_thresh=55 workqueue.watchdog_thresh=140 sysctl.net.core.netdev_unregister_timeout_secs=140 dummy_hcd.num=32 max_loop=32 nbds_max=32 \
Kernel command line: comedi.comedi_num_legacy_minors=4 panic_on_warn=1 root=/dev/sda console=ttyS0 root=/dev/sda1
Unknown kernel command line parameters "nbds_max=32", will be passed to user space.
printk: log buffer data + meta data: 262144 + 917504 = 1179648 bytes
software IO TLB: area num 2.
Fallback order for Node 0: 0 1
Fallback order for Node 1: 1 0
Built 2 zonelists, mobility grouping on. Total pages: 1834877
Policy zone: Normal
mem auto-init: stack:all(zero), heap alloc:on, heap free:off
stackdepot: allocating hash table via alloc_large_system_hash
stackdepot hash table entries: 1048576 (order: 12, 16777216 bytes, linear)
stackdepot: allocating space for 8192 stack pools via memblock
**********************************************************
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
** **
** This system shows unhashed kernel memory addresses **
** via the console, logs, and other interfaces. This **
** might reduce the security of your system. **
** **
** If you see this message and you are not debugging **
** the kernel, report this immediately to your system **
** administrator! **
** **
** Use hash_pointers=always to force this mode off **
** **
** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
**********************************************************
------------[ cut here ]------------
debug_locks && !(lock_is_held(&(&s->cpu_sheaves->lock)->dep_map) != 0)
WARNING: mm/slub.c:4601 at __pcs_replace_empty_main+0x51b/0x6e0, CPU#0: swapper/0
Modules linked in:
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted syzkaller #0 PREEMPT(undef)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__pcs_replace_empty_main+0x51b/0x6e0
Code: 48 85 f6 74 15 4c 89 ff 48 89 c6 e8 af 5e ff ff 4d 89 74 24 38 e9 36 fc ff ff 49 89 44 24 40 4d 89 74 24 38 e9 27 fc ff ff 90 <0f> 0b 90 83 7b 2c 00 0f 85 23 fb ff ff 48 8b 1b e8 20 cd 82 09 41
RSP: 0000:ffffffff8e607d58 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffffff91bb8398 RCX: 0000000000000002
RDX: 0000000000000cc0 RSI: ffffffff8e21ec94 RDI: ffffffff8c28b160
RBP: 0000000000000cc0 R08: 0000000000005e00 R09: 00000000477ac845
R10: 0000000047d13f7f R11: 000000002fa01ecd R12: ffff88812103f308
R13: 0000000000000000 R14: ffffffff91bb8398 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff88818dc8a000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e74a000 CR4: 00000000000000b0
Call Trace:
<TASK>
kmem_cache_alloc_node_noprof+0x441/0x690
do_kmem_cache_create+0x172/0x620
create_boot_cache+0xbf/0x120
kmem_cache_init+0x11a/0x1e0
mm_core_init+0x7e/0xb0
start_kernel+0x15a/0x3e0
x86_64_start_reservations+0x24/0x30
x86_64_start_kernel+0x143/0x1c0
common_startup_64+0x13e/0x147
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).
The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
2026-05-19 1:27 ` [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work Leonardo Bras
@ 2026-05-20 10:08 ` Frederic Weisbecker
2026-05-20 13:48 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2026-05-20 10:08 UTC (permalink / raw)
To: Leonardo Bras
Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, linux-doc, linux-kernel, linux-mm,
linux-rt-devel, Marcelo Tosatti
Le Mon, May 18, 2026 at 10:27:47PM -0300, Leonardo Bras a écrit :
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
>
> On the other hand, for RT workloads this can represent a problem:
> scheduling work on remote cpu that are executing low latency tasks
> is undesired and can introduce unexpected deadline misses.
>
> It's interesting, though, that local_lock()s in RT kernels become
> spinlock(). We can make use of those to avoid scheduling work on a remote
> cpu by directly updating another cpu's per_cpu structure, while holding
> it's spinlock().
>
> In order to do that, it's necessary to introduce a new set of functions to
> make it possible to get another cpu's per-cpu "local" lock (pw_{un,}lock*)
> and also do the corresponding queueing (pw_queue_on()) and flushing
> (pw_flush()) helpers to run the remote work.
>
> Users of non-RT kernels but with low latency requirements can select
> similar functionality by using the CONFIG_PWLOCKS compile time option.
>
> On CONFIG_PWLOCKS disabled kernels, no changes are expected, as every
> one of the introduced helpers work the exactly same as the current
> implementation:
> pw_{un,}lock*() -> local_{un,}lock*() (ignores cpu parameter)
> pw_queue_on() -> queue_work_on()
> pw_flush() -> flush_work()
>
> For PWLOCKS enabled kernels, though, pw_{un,}lock*() will use the extra
> cpu parameter to select the correct per-cpu structure to work on,
> and acquire the spinlock for that cpu.
>
> pw_queue_on() will just call the requested function in the current
> cpu, which will operate in another cpu's per-cpu object. Since the
> local_locks() become spinlock()s in PWLOCKS enabled kernels, we are
> safe doing that.
>
> pw_flush() then becomes a no-op since no work is actually scheduled on a
> remote cpu.
>
> Some minimal code rework is needed in order to make this mechanism work:
> The calls for local_{un,}lock*() on the functions that are currently
> scheduled on remote cpus need to be replaced by either pw_{un,}lock_*(),
> PWLOCKS enabled kernels they can reference a different cpu. It's also
> necessary to use a pw_struct instead of a work_struct, but it just
> contains a work struct and, in CONFIG_PWLOCKS, the target cpu.
>
> This should have almost no impact on non-CONFIG_PWLOCKS kernels: few
> this_cpu_ptr() will become per_cpu_ptr(,smp_processor_id()) on non-hotpath
> functions.
>
> On CONFIG_PWLOCKS kernels, this should avoid deadlines misses by
> removing scheduling noise.
>
> Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
I like it! Just a few observations:
> +#ifndef CONFIG_PWLOCKS
> +
> +typedef local_lock_t pw_lock_t;
> +typedef local_trylock_t pw_trylock_t;
> +
> +struct pw_struct {
> + struct work_struct work;
> +};
> +
> +#define pw_lock_init(lock) \
> + local_lock_init(lock)
> +
> +#define pw_trylock_init(lock) \
> + local_trylock_init(lock)
> +
> +#define pw_lock(lock, cpu) \
> + local_lock(lock)
For debugging purpose, it would be nice to ensure that in those off-case,
cpu is indeed the local one. Basically all the non-local functions, those that
take a cpu, should verify:
lockdep_assert(cpu == smp_processor_id())
> +
> +#define pw_lock_local(lock) \
> + local_lock(lock)
> +
> +#define pw_lock_irqsave(lock, flags, cpu) \
> + local_lock_irqsave(lock, flags)
> +
> +#define pw_lock_local_irqsave(lock, flags) \
> + local_lock_irqsave(lock, flags)
> +
> +#define pw_trylock(lock, cpu) \
> + local_trylock(lock)
> +
> +#define pw_trylock_local(lock) \
> + local_trylock(lock)
> +
> +#define pw_trylock_irqsave(lock, flags, cpu) \
> + local_trylock_irqsave(lock, flags)
> +
> +#define pw_unlock(lock, cpu) \
> + local_unlock(lock)
> +
> +#define pw_unlock_local(lock) \
> + local_unlock(lock)
> +
> +#define pw_unlock_irqrestore(lock, flags, cpu) \
> + local_unlock_irqrestore(lock, flags)
> +
> +#define pw_unlock_local_irqrestore(lock, flags) \
> + local_unlock_irqrestore(lock, flags)
> +
> +#define pw_lockdep_assert_held(lock) \
> + lockdep_assert_held(lock)
> +
> +#define pw_queue_on(c, wq, pw) \
> + queue_work_on(c, wq, &(pw)->work)
> +
> +#define pw_flush(pw) \
> + flush_work(&(pw)->work)
> +
> +#define pw_get_cpu(pw) smp_processor_id()
> +
> +#define pw_is_cpu_remote(cpu) (false)
> +
> +#define INIT_PW(pw, func, c) \
> + INIT_WORK(&(pw)->work, (func))
> +
> +#else /* CONFIG_PWLOCKS */
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
> +
> +typedef union {
> + spinlock_t sl;
> + local_lock_t ll;
> +} pw_lock_t;
> +
> +typedef union {
> + spinlock_t sl;
> + local_trylock_t ll;
> +} pw_trylock_t;
> +
> +struct pw_struct {
> + struct work_struct work;
> + int cpu;
> +};
> +
> +#ifdef CONFIG_PREEMPT_RT
> +#define preempt_or_migrate_disable migrate_disable
> +#define preempt_or_migrate_enable migrate_enable
> +#else
> +#define preempt_or_migrate_disable preempt_disable
> +#define preempt_or_migrate_enable preempt_enable
This can be no-op in !CONFIG_PREEMPT_RT because non-rt spinlocks
disable preemption already.
> +#endif
> +
> +#define pw_lock_init(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock_init(lock.sl); \
> + else \
> + local_lock_init(lock.ll); \
> +} while (0)
It looks like all these macros could be inline functions.
> +
> +#define pw_trylock_init(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock_init(lock.sl); \
> + else \
> + local_trylock_init(lock.ll); \
> +} while (0)
> +
> +#define pw_lock(lock, cpu)
> \
And those could have the same local CPU debug check.
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock(per_cpu_ptr(lock.sl, cpu)); \
> + else \
> + local_lock(lock.ll); \
> +} while (0)
> +
> +#define pw_lock_local(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + preempt_or_migrate_disable(); \
> + spin_lock(this_cpu_ptr(lock.sl)); \
> + } else { \
> + local_lock(lock.ll); \
> + } \
> +} while (0)
> +
> +#define pw_lock_irqsave(lock, flags, cpu) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \
> + else \
> + local_lock_irqsave(lock.ll, flags); \
> +} while (0)
> +
> +#define pw_lock_local_irqsave(lock, flags) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + preempt_or_migrate_disable(); \
> + spin_lock_irqsave(this_cpu_ptr(lock.sl), flags); \
> + } else { \
> + local_lock_irqsave(lock.ll, flags); \
> + } \
> +} while (0)
> +
> +#define pw_trylock(lock, cpu) \
> +({ \
> + int t; \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + t = spin_trylock(per_cpu_ptr(lock.sl, cpu)); \
> + else \
> + t = local_trylock(lock.ll); \
> + t; \
> +})
> +
> +#define pw_trylock_local(lock) \
> +({ \
> + int t; \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + preempt_or_migrate_disable(); \
> + t = spin_trylock(this_cpu_ptr(lock.sl)); \
> + if (!t) \
> + preempt_or_migrate_enable();
> \
This is duplicating the RT logic in local_lock_internal.h and it would be
tempting to propose spin_local_lock_t that both pw and RT local_lock could rely
upon. But I'm afraid that would create a less readable result:
- we would need to check the CONFIG_PREEMPT_RT there before doing the
migrate_disable/enable
- RT local lock don't take the lock on IRQ/NMI, which is fine as pw is not
expected to be used on the non-threaded parts of IRQs not NMIs. Still that's
one more conditional to add there.
- we'll need to differenciate local/remote operations.
Well let's stick to what you did for now (Peter might have a different opinion though).
> + } else { \
> + t = local_trylock(lock.ll); \
> + } \
> + t; \
> +})
> +
> +#define pw_trylock_irqsave(lock, flags, cpu) \
> +({ \
> + int t; \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \
> + else \
> + t = local_trylock_irqsave(lock.ll, flags); \
> + t; \
> +})
> +
> +#define pw_unlock(lock, cpu) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_unlock(per_cpu_ptr(lock.sl, cpu)); \
> + else \
> + local_unlock(lock.ll); \
> +} while (0)
> +
> +#define pw_unlock_local(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + spin_unlock(this_cpu_ptr(lock.sl)); \
> + preempt_or_migrate_enable(); \
> + } else { \
> + local_unlock(lock.ll); \
> + } \
> +} while (0)
> +
> +#define pw_unlock_irqrestore(lock, flags, cpu) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags); \
> + else \
> + local_unlock_irqrestore(lock.ll, flags); \
> +} while (0)
> +
> +#define pw_unlock_local_irqrestore(lock, flags) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + spin_unlock_irqrestore(this_cpu_ptr(lock.sl), flags); \
> + preempt_or_migrate_enable(); \
> + } else { \
> + local_unlock_irqrestore(lock.ll, flags); \
> + } \
> +} while (0)
> +
> +#define pw_lockdep_assert_held(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + lockdep_assert_held(this_cpu_ptr(lock.sl)); \
> + else \
> + lockdep_assert_held(this_cpu_ptr(lock.ll)); \
> +} while (0)
> +
> +#define pw_queue_on(c, wq, pw) \
> +do { \
> + int __c = c; \
> + struct pw_struct *__pw = (pw); \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + WARN_ON((__c) != __pw->cpu); \
> + __pw->work.func(&__pw->work); \
> + } else { \
> + queue_work_on(__c, wq, &(__pw)->work); \
> + } \
> +} while (0)
> +
> +/*
> + * Does nothing if PWLOCKS is set to use spinlock, as the task is already done at the
> + * time pw_queue_on() returns.
> + */
> +#define pw_flush(pw) \
> +do { \
> + struct pw_struct *__pw = (pw); \
> + if (!static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + flush_work(&__pw->work); \
> +} while (0)
> +
> +#define pw_get_cpu(w) container_of((w), struct pw_struct, work)->cpu
> +
> +#define pw_is_cpu_remote(cpu) ((cpu) != smp_processor_id())
> +
> +#define INIT_PW(pw, func, c) \
> +do { \
> + struct pw_struct *__pw = (pw); \
> + INIT_WORK(&__pw->work, (func)); \
> + __pw->cpu = (c); \
> +} while (0)
> +
> +#endif /* CONFIG_PWLOCKS */
> +#endif /* LINUX_PWLOCKS_H */
> diff --git a/kernel/pwlocks.c b/kernel/pwlocks.c
> new file mode 100644
> index 000000000000..1ebf5cb979b9
> --- /dev/null
> +++ b/kernel/pwlocks.c
> @@ -0,0 +1,47 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/export.h"
> +#include <linux/sched.h>
> +#include <linux/pwlocks.h>
> +#include <linux/string.h>
> +#include <linux/sched/isolation.h>
> +
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
> +EXPORT_SYMBOL(pw_sl);
> +
> +static bool pwlocks_param_specified;
> +
> +static int __init pwlocks_setup(char *str)
> +{
> + int opt;
> +
> + if (!get_option(&str, &opt)) {
> + pr_warn("PWLOCKS: invalid pwlocks parameter: %s, ignoring.\n", str);
> + return 0;
> + }
> +
> + if (opt)
> + static_branch_enable(&pw_sl);
> + else
> + static_branch_disable(&pw_sl);
> +
> + pwlocks_param_specified = true;
> +
> + return 1;
> +}
> +__setup("pwlocks=", pwlocks_setup);
> +
> +/*
> + * Enable PWLOCKS if CPUs want to avoid kernel noise.
> + */
> +static int __init pwlocks_init(void)
> +{
> + if (pwlocks_param_specified)
> + return 0;
> +
> + if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
> + static_branch_enable(&pw_sl);
> +
> + return 0;
> +}
> +
> +late_initcall(pwlocks_init);
That should be a pre-SMP initcall. Otherwise you risk some asymetric calls.
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW)
2026-05-19 1:27 [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW) Leonardo Bras
` (4 preceding siblings ...)
2026-05-19 6:58 ` [syzbot ci] Re: Introduce Per-CPU Work helpers (was QPW) syzbot ci
@ 2026-05-20 13:09 ` Sebastian Andrzej Siewior
5 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-20 13:09 UTC (permalink / raw)
To: Leonardo Bras
Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Thomas Gleixner, Feng Tang, Dapeng Mi, Kees Cook,
Marco Elver, Jakub Kicinski, Li RongQing, Eric Biggers,
Paul E. McKenney, Nathan Chancellor, Miguel Ojeda, Nicolas Schier,
Thomas Weißschuh, Douglas Anderson, Gary Guo,
Christian Brauner, Pasha Tatashin, Masahiro Yamada, Coiby Xu,
Frederic Weisbecker, linux-doc, linux-kernel, linux-mm,
linux-rt-devel
On 2026-05-18 22:27:46 [-0300], Leonardo Bras wrote:
> The problem:
> Some places in the kernel implement a parallel programming strategy
> consisting on local_locks() for most of the work, and some rare remote
> operations are scheduled on target cpu. This keeps cache bouncing low since
> cacheline tends to be mostly local, and avoids the cost of locks in non-RT
> kernels, even though the very few remote operations will be expensive due
> to scheduling overhead.
>
> On the other hand, for RT workloads this can represent a problem: getting
> an important workload scheduled out to deal with remote requests is
> sure to introduce unexpected deadline misses.
>
> The idea:
> Currently with PREEMPT_RT=y, local_locks() become per-cpu spinlocks.
It does not become a _spin_lock because it does not spin. It sleeps.
> In this case, instead of scheduling work on a remote cpu, it should
> be safe to grab that remote cpu's per-cpu spinlock and run the required
> work locally. That major cost, which is un/locking in every local function,
> already happens in PREEMPT_RT.
We did have this before but only in the RT tree. It was a bit messy from
the naming because it started with local_ but then it was a remote CPU.
The main issue was the different code path which led to a few deadlocks
back then.
By the time local_lock_t went upstream, the cross-CPU locking was
removed. As far as I remember, the cross-CPU user which did schedule
work on a remote CPU and annoyed NOHZ folks were replaced.
> Also, there is no need to worry about extra cache bouncing:
> The cacheline invalidation already happens due to schedule_work_on().
>
> This will avoid schedule_work_on(), and thus avoid scheduling-out an
> RT workload.
>
> Proposed solution:
> A new interface called PerCPU Work (PW), which should replace
> Work Queue in the above mentioned use case.
>
> If CONFIG_PWLOCKS=n this interfaces just wraps the current
> local_locks + WorkQueue behavior, so no expected change in runtime.
>
> If CONFIG_PWLOCKS=y, and kernel boot option pwlocks=1,
> pw_queue_on(cpu,...) will lock that cpu's per-cpu structure
> and perform work on it locally.
>
Sebastian
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
2026-05-19 1:27 ` [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work Leonardo Bras
2026-05-20 10:08 ` Frederic Weisbecker
@ 2026-05-20 13:48 ` Sebastian Andrzej Siewior
2026-05-20 14:47 ` Frederic Weisbecker
1 sibling, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-20 13:48 UTC (permalink / raw)
To: Leonardo Bras
Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker, linux-doc, linux-kernel,
linux-mm, linux-rt-devel, Marcelo Tosatti
On 2026-05-18 22:27:47 [-0300], Leonardo Bras wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 4d0f545fb3ec..68c8a6f9d227 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2810,20 +2810,30 @@ Kernel parameters
> If a queue's affinity mask contains only isolated
> CPUs then this parameter has no effect on the
> interrupt routing decision, though interrupts are
> only delivered when tasks running on those
> isolated CPUs submit IO. IO submitted on
> housekeeping CPUs has no influence on those
> queues.
>
> The format of <cpu-list> is described above.
>
> + pwlocks= [KNL,SMP] Select a behavior on per-CPU resource sharing
> + and remote interference mechanism on a kernel built with
> + CONFIG_PWLOCKS.
> + Format: { "0" | "1" }
> + 0 - local_lock() + queue_work_on(remote_cpu)
> + 1 - spin_lock() for both local and remote operations
> +
> + Selecting 1 may be interesting for systems that want
> + to avoid interruption & context switches from IPIs.
> +
This documentation is supposed to be for an administrator/ user of the
system. Exposing him to underlying kernel technique shouldn't happen.
It does not explain the users/ outcome so it sounds like best hope.
> iucv= [HW,NET]
>
> ivrs_ioapic [HW,X86-64]
> Provide an override to the IOAPIC-ID<->DEVICE-ID
> mapping provided in the IVRS ACPI table.
> By default, PCI segment is 0, and can be omitted.
>
> For example, to map IOAPIC-ID decimal 10 to
> PCI segment 0x1 and PCI device 00:14.0,
> write the parameter as:
> diff --git a/Documentation/locking/pwlocks.rst b/Documentation/locking/pwlocks.rst
> new file mode 100644
> index 000000000000..09f4a5417bc1
> --- /dev/null
> +++ b/Documentation/locking/pwlocks.rst
> @@ -0,0 +1,76 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=========
> +PW (Per-CPU Work) locks
> +=========
> +
> +Some places in the kernel implement a parallel programming strategy
> +consisting on local_locks() for most of the work, and some rare remote
> +operations are scheduled on target cpu. This keeps cache bouncing low since
> +cacheline tends to be mostly local, and avoids the cost of locks in non-RT
PREEMPT_RT can be spelled out if you mean it so it is not confused with
other meanings of the two letters.
> +kernels, even though the very few remote operations will be expensive due
> +to scheduling overhead.
> +
> +On the other hand, for RT workloads this can represent a problem:
> +scheduling work on remote cpu that are executing low latency tasks
> +is undesired and can introduce unexpected deadline misses.
> +
> +PW locks help to convert sites that use local_locks (for cpu local operations)
> +and queue_work_on (for queueing work remotely, to be executed
> +locally on the owner cpu of the lock) to a spinlocks.
not spinlocks.
> +
> +The lock is declared pw_lock_t type.
> +The lock is initialized with pw_lock_init.
> +The lock is locked with pw_lock (takes a lock and cpu as a parameter).
> +The lock is unlocked with pw_unlock (takes a lock and cpu as a parameter).
If it is a function, it should end with ()
> +The pw_lock_irqsave function disables interrupts and saves current interrupt state,
> +cpu as a parameter.
CPU.
> +For trylock variant, there is the pw_trylock_t type, initialized with
> +pw_trylock_init. Then the corresponding pw_trylock and pw_trylock_irqsave.
> +
> +work_struct should be replaced by pw_struct, which contains a cpu parameter
> +(owner cpu of the lock), initialized by INIT_PW.
> +
> +The queue work related functions (analogous to queue_work_on and flush_work) are:
> +pw_queue_on and pw_flush.
> +
> +The behaviour of the PW lock functions is as follows:
> +
> +* !CONFIG_PWLOCKS (or CONFIG_PWLOCKS and pwlocks=off kernel boot parameter):
> + - pw_lock: local_lock
> + - pw_lock_irqsave: local_lock_irqsave
> + - pw_trylock: local_trylock
> + - pw_trylock_irqsave: local_trylock_irqsave
> + - pw_unlock: local_unlock
> + - pw_lock_local: local_lock
> + - pw_trylock_local: local_trylock
> + - pw_unlock_local: local_unlock
> + - pw_queue_on: queue_work_on
> + - pw_flush: flush_work
> +
> +* CONFIG_PWLOCKS (and CONFIG_PWLOCKS_DEFAULT=y or pwlocks=on kernel boot parameter),
> + - pw_lock: spin_lock
> + - pw_lock_irqsave: spin_lock_irqsave
> + - pw_trylock: spin_trylock
> + - pw_trylock_irqsave: spin_trylock_irqsave
> + - pw_unlock: spin_unlock
> + - pw_lock_local: preempt_disable OR migrate_disable + spin_lock
> + - pw_trylock_local: preempt_disable OR migrate_disable + spin_trylock
> + - pw_unlock_local: preempt_enable OR migrate_enable + spin_unlock
> + - pw_queue_on: executes work function on caller cpu
> + - pw_flush: empty
> +
> +pw_get_cpu(work_struct), to be called from within per-cpu work function,
> +returns the target cpu.
> +
> +On the locking functions above, there are the local locking functions
> +(pw_lock_local, pw_trylock_local and pw_unlock_local) that must only
> +be used to access per-CPU data from the CPU that owns that data,
> +and never remotely. They disable preemption/migration and don't require
> +a cpu parameter, making them a replacement for local_lock functions that
> +does not introduce overhead.
Why do you need to either the one or the other? My only guess is that
migrate_disable() is sufficient but you prefer preempt_disable() on
!PREEMPT_RT because it is cheaper.
> +These should only be used when accessing per-CPU data of the local CPU.
> +
> diff --git a/init/Kconfig b/init/Kconfig
> index 2937c4d308ae..3fb751dc4530 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -764,20 +764,55 @@ config CPU_ISOLATION
> depends on SMP
> default y
> help
> Make sure that CPUs running critical tasks are not disturbed by
> any source of "noise" such as unbound workqueues, timers, kthreads...
> Unbound jobs get offloaded to housekeeping CPUs. This is driven by
> the "isolcpus=" boot parameter.
>
> Say Y if unsure.
>
> +config PWLOCKS
> + bool "Per-CPU Work locks"
> + depends on SMP || COMPILE_TEST
> + default n
> + help
> + Allow changing the behavior on per-CPU resource sharing with cache,
> + from the regular local_locks() + queue_work_on(remote_cpu) to using
> + per-CPU spinlocks on both local and remote operations.
> +
> + This is useful to give user the option on reducing IPIs to CPUs, and
> + thus reduce interruptions and context switches. On the other hand, it
> + increases generated code and will use atomic operations if spinlocks
> + are selected.
I think the goal is to avoid scheduling a task on a remote CPU to get
something done.
> +
> + If set, will use the default behavior set in PWLOCKS_DEFAULT unless boot
> + parameter pwlocks is passed with a different behavior.
> +
> + If unset, will use the local_lock() + queue_work_on() strategy,
> + regardless of the boot parameter or PWLOCKS_DEFAULT.
This sounds like it affects the greater kernel.
> + Say N if unsure.
> +
> +config PWLOCKS_DEFAULT
> + bool "Use per-CPU spinlocks by default on PWLOCKS"
> + depends on PWLOCKS
> + default n
n is default.
> + help
> + If set, will use per-CPU spinlocks as default behavior for per-CPU
> + remote operations.
> +
> + If unset, will use local_lock() + queue_work_on(cpu) as default
> + behavior for remote operations.
> +
> + Say N if unsure
> +
> source "kernel/rcu/Kconfig"
>
> config IKCONFIG
> tristate "Kernel .config support"
> help
> This option enables the complete Linux kernel ".config" file
> contents to be saved in the kernel. It provides documentation
> of which kernel options are used in a running kernel or in an
> on-disk kernel. This information can be extracted from the kernel
> image file with the script scripts/extract-ikconfig and used as
> diff --git a/include/linux/pwlocks.h b/include/linux/pwlocks.h
> new file mode 100644
> index 000000000000..3d79621655f9
> --- /dev/null
> +++ b/include/linux/pwlocks.h
> @@ -0,0 +1,265 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_PWLOCKS_H
> +#define _LINUX_PWLOCKS_H
> +
> +#include "linux/spinlock.h"
> +#include "linux/local_lock.h"
> +#include "linux/workqueue.h"
> +
> +#ifndef CONFIG_PWLOCKS
> +
> +typedef local_lock_t pw_lock_t;
> +typedef local_trylock_t pw_trylock_t;
> +
> +struct pw_struct {
> + struct work_struct work;
> +};
> +
> +#define pw_lock_init(lock) \
> + local_lock_init(lock)
> +
> +#define pw_trylock_init(lock) \
> + local_trylock_init(lock)
> +
> +#define pw_lock(lock, cpu) \
> + local_lock(lock)
> +
> +#define pw_lock_local(lock) \
> + local_lock(lock)
> +
> +#define pw_lock_irqsave(lock, flags, cpu) \
> + local_lock_irqsave(lock, flags)
The part where you have a `cpu' argument which is not used is entirely
confusing.
> +
> +#define pw_lock_local_irqsave(lock, flags) \
> + local_lock_irqsave(lock, flags)
> +
> +#define pw_trylock(lock, cpu) \
> + local_trylock(lock)
> +
> +#define pw_trylock_local(lock) \
> + local_trylock(lock)
> +
> +#define pw_trylock_irqsave(lock, flags, cpu) \
> + local_trylock_irqsave(lock, flags)
> +
> +#define pw_unlock(lock, cpu) \
> + local_unlock(lock)
> +
> +#define pw_unlock_local(lock) \
> + local_unlock(lock)
> +
> +#define pw_unlock_irqrestore(lock, flags, cpu) \
> + local_unlock_irqrestore(lock, flags)
> +
> +#define pw_unlock_local_irqrestore(lock, flags) \
> + local_unlock_irqrestore(lock, flags)
> +
> +#define pw_lockdep_assert_held(lock) \
> + lockdep_assert_held(lock)
> +
> +#define pw_queue_on(c, wq, pw) \
> + queue_work_on(c, wq, &(pw)->work)
> +
> +#define pw_flush(pw) \
> + flush_work(&(pw)->work)
> +
> +#define pw_get_cpu(pw) smp_processor_id()
> +
> +#define pw_is_cpu_remote(cpu) (false)
> +
> +#define INIT_PW(pw, func, c) \
> + INIT_WORK(&(pw)->work, (func))
> +
> +#else /* CONFIG_PWLOCKS */
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
> +
> +typedef union {
> + spinlock_t sl;
> + local_lock_t ll;
> +} pw_lock_t;
> +
> +typedef union {
> + spinlock_t sl;
> + local_trylock_t ll;
> +} pw_trylock_t;
Why do you use local_trylock_t ? Its use case is different compared to
local_lock_t. _IF_ you are fine with local_trylock_t then you should be
able to deal with a per-CPU spinlock_t and none of this should be
needed.
> +struct pw_struct {
> + struct work_struct work;
> + int cpu;
> +};
> +
> +#ifdef CONFIG_PREEMPT_RT
> +#define preempt_or_migrate_disable migrate_disable
> +#define preempt_or_migrate_enable migrate_enable
> +#else
> +#define preempt_or_migrate_disable preempt_disable
> +#define preempt_or_migrate_enable preempt_enable
> +#endif
if then () but this looks terrible.
> +
> +#define pw_lock_init(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock_init(lock.sl); \
> + else \
> + local_lock_init(lock.ll); \
> +} while (0)
> +
> +#define pw_trylock_init(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock_init(lock.sl); \
> + else \
> + local_trylock_init(lock.ll); \
> +} while (0)
> +
> +#define pw_lock(lock, cpu) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock(per_cpu_ptr(lock.sl, cpu)); \
> + else \
> + local_lock(lock.ll); \
> +} while (0)
> +
> +#define pw_lock_local(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + preempt_or_migrate_disable(); \
> + spin_lock(this_cpu_ptr(lock.sl)); \
> + } else { \
> + local_lock(lock.ll); \
> + } \
> +} while (0)
> +
> +#define pw_lock_irqsave(lock, flags, cpu) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_lock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \
> + else \
> + local_lock_irqsave(lock.ll, flags); \
> +} while (0)
> +
> +#define pw_lock_local_irqsave(lock, flags) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + preempt_or_migrate_disable(); \
> + spin_lock_irqsave(this_cpu_ptr(lock.sl), flags); \
> + } else { \
> + local_lock_irqsave(lock.ll, flags); \
> + } \
> +} while (0)
> +
> +#define pw_trylock(lock, cpu) \
> +({ \
> + int t; \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + t = spin_trylock(per_cpu_ptr(lock.sl, cpu)); \
> + else \
> + t = local_trylock(lock.ll); \
> + t; \
> +})
> +
> +#define pw_trylock_local(lock) \
> +({ \
> + int t; \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + preempt_or_migrate_disable(); \
> + t = spin_trylock(this_cpu_ptr(lock.sl)); \
> + if (!t) \
> + preempt_or_migrate_enable(); \
> + } else { \
> + t = local_trylock(lock.ll); \
> + } \
> + t; \
> +})
> +
> +#define pw_trylock_irqsave(lock, flags, cpu) \
> +({ \
> + int t; \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + t = spin_trylock_irqsave(per_cpu_ptr(lock.sl, cpu), flags); \
> + else \
> + t = local_trylock_irqsave(lock.ll, flags); \
> + t; \
> +})
> +
> +#define pw_unlock(lock, cpu) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_unlock(per_cpu_ptr(lock.sl, cpu)); \
> + else \
> + local_unlock(lock.ll); \
> +} while (0)
> +
> +#define pw_unlock_local(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + spin_unlock(this_cpu_ptr(lock.sl)); \
> + preempt_or_migrate_enable(); \
> + } else { \
> + local_unlock(lock.ll); \
> + } \
> +} while (0)
> +
> +#define pw_unlock_irqrestore(lock, flags, cpu) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + spin_unlock_irqrestore(per_cpu_ptr(lock.sl, cpu), flags); \
> + else \
> + local_unlock_irqrestore(lock.ll, flags); \
> +} while (0)
> +
> +#define pw_unlock_local_irqrestore(lock, flags) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + spin_unlock_irqrestore(this_cpu_ptr(lock.sl), flags); \
> + preempt_or_migrate_enable(); \
> + } else { \
> + local_unlock_irqrestore(lock.ll, flags); \
> + } \
> +} while (0)
> +
> +#define pw_lockdep_assert_held(lock) \
> +do { \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + lockdep_assert_held(this_cpu_ptr(lock.sl)); \
> + else \
> + lockdep_assert_held(this_cpu_ptr(lock.ll)); \
> +} while (0)
> +
> +#define pw_queue_on(c, wq, pw) \
> +do { \
> + int __c = c; \
> + struct pw_struct *__pw = (pw); \
> + if (static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) { \
> + WARN_ON((__c) != __pw->cpu); \
> + __pw->work.func(&__pw->work); \
> + } else { \
> + queue_work_on(__c, wq, &(__pw)->work); \
> + } \
> +} while (0)
> +
> +/*
> + * Does nothing if PWLOCKS is set to use spinlock, as the task is already done at the
> + * time pw_queue_on() returns.
> + */
> +#define pw_flush(pw) \
> +do { \
> + struct pw_struct *__pw = (pw); \
> + if (!static_branch_maybe(CONFIG_PWLOCKS_DEFAULT, &pw_sl)) \
> + flush_work(&__pw->work); \
> +} while (0)
I don't think this should be a collection of macros. Either proper
functions or static inline _if_ this is performance critical for some
reason.
> +
> +#define pw_get_cpu(w) container_of((w), struct pw_struct, work)->cpu
> +
> +#define pw_is_cpu_remote(cpu) ((cpu) != smp_processor_id())
> +
> +#define INIT_PW(pw, func, c) \
> +do { \
> + struct pw_struct *__pw = (pw); \
> + INIT_WORK(&__pw->work, (func)); \
> + __pw->cpu = (c); \
> +} while (0)
> +
> +#endif /* CONFIG_PWLOCKS */
> +#endif /* LINUX_PWLOCKS_H */
> diff --git a/kernel/pwlocks.c b/kernel/pwlocks.c
> new file mode 100644
> index 000000000000..1ebf5cb979b9
> --- /dev/null
> +++ b/kernel/pwlocks.c
> @@ -0,0 +1,47 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "linux/export.h"
> +#include <linux/sched.h>
> +#include <linux/pwlocks.h>
> +#include <linux/string.h>
> +#include <linux/sched/isolation.h>
> +
> +DEFINE_STATIC_KEY_MAYBE(CONFIG_PWLOCKS_DEFAULT, pw_sl);
> +EXPORT_SYMBOL(pw_sl);
> +
> +static bool pwlocks_param_specified;
> +
> +static int __init pwlocks_setup(char *str)
> +{
> + int opt;
> +
> + if (!get_option(&str, &opt)) {
> + pr_warn("PWLOCKS: invalid pwlocks parameter: %s, ignoring.\n", str);
> + return 0;
> + }
> +
> + if (opt)
> + static_branch_enable(&pw_sl);
> + else
> + static_branch_disable(&pw_sl);
> +
> + pwlocks_param_specified = true;
> +
> + return 1;
> +}
> +__setup("pwlocks=", pwlocks_setup);
> +
> +/*
> + * Enable PWLOCKS if CPUs want to avoid kernel noise.
> + */
> +static int __init pwlocks_init(void)
> +{
> + if (pwlocks_param_specified)
> + return 0;
> +
> + if (housekeeping_enabled(HK_TYPE_KERNEL_NOISE))
> + static_branch_enable(&pw_sl);
How likely is it, that you you had users before late_initcall()? Also
can it happen that one of them uses one function to lock and the other
unlock in this brief window? There is no check if this was used before
static_branch usage.
> +
> + return 0;
> +}
> +
> +late_initcall(pwlocks_init);
Sebastian
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work
2026-05-20 13:48 ` Sebastian Andrzej Siewior
@ 2026-05-20 14:47 ` Frederic Weisbecker
0 siblings, 0 replies; 12+ messages in thread
From: Frederic Weisbecker @ 2026-05-20 14:47 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Leonardo Bras, Jonathan Corbet, Shuah Khan, Peter Zijlstra,
Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, linux-doc, linux-kernel, linux-mm,
linux-rt-devel, Marcelo Tosatti
Le Wed, May 20, 2026 at 03:48:32PM +0200, Sebastian Andrzej Siewior a écrit :
> How likely is it, that you you had users before late_initcall()? Also
> can it happen that one of them uses one function to lock and the other
> unlock in this brief window? There is no check if this was used before
> static_branch usage.
Or let alone initialization on the wrong member of the union.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 4/4] slub: apply new pw_queue_on() interface
2026-05-19 1:27 ` [PATCH v4 4/4] slub: " Leonardo Bras
@ 2026-05-20 14:53 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-20 14:53 UTC (permalink / raw)
To: Leonardo Bras
Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker, linux-doc, linux-kernel,
linux-mm, linux-rt-devel, Marcelo Tosatti
On 2026-05-18 22:27:50 [-0300], Leonardo Bras wrote:
> @@ -4733,121 +4735,121 @@ void *alloc_from_pcs(struct kmem_cache *s, gfp_t gfp, int node)
>
> /*
> * We assume the percpu sheaves contain only local objects although it's
> * not completely guaranteed, so we verify later.
> */
> if (unlikely(node_requested && node != numa_mem_id())) {
> stat(s, ALLOC_NODE_MISMATCH);
> return NULL;
> }
>
> - if (!local_trylock(&s->cpu_sheaves->lock))
> + if (!pw_trylock_local(&s->cpu_sheaves->lock))
> return NULL;
alloc_from_pcs() can be called from kmalloc_nolock()/ NMI context.
I don't remember why exactly local_trylock_t was introduced here instead
of a per-CPU spinlock_t. But there should be nothing wrong with a
trylock on it from NMI as you do here.
One thing worth noting, on !PREEMPT_RT, spin_trylock() always succeeds
on UP. kmalloc_nolock() checks for it, not sure about other callers.
Sebastian
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v4 3/4] swap: apply new pw_queue_on() interface
2026-05-19 1:27 ` [PATCH v4 3/4] swap: apply new pw_queue_on() interface Leonardo Bras
@ 2026-05-20 15:07 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-20 15:07 UTC (permalink / raw)
To: Leonardo Bras
Cc: Jonathan Corbet, Shuah Khan, Peter Zijlstra, Ingo Molnar,
Will Deacon, Boqun Feng, Waiman Long, Andrew Morton,
David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Jann Horn, Pedro Falcato, Brendan Jackman, Johannes Weiner,
Zi Yan, Harry Yoo, Hao Li, Christoph Lameter, David Rientjes,
Roman Gushchin, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Qi Zheng, Shakeel Butt,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Borislav Petkov (AMD),
Randy Dunlap, Feng Tang, Dapeng Mi, Kees Cook, Marco Elver,
Jakub Kicinski, Li RongQing, Eric Biggers, Paul E. McKenney,
Nathan Chancellor, Nicolas Schier, Miguel Ojeda,
Thomas Weißschuh, Thomas Gleixner, Douglas Anderson,
Gary Guo, Christian Brauner, Pasha Tatashin, Coiby Xu,
Masahiro Yamada, Frederic Weisbecker, linux-doc, linux-kernel,
linux-mm, linux-rt-devel, Marcelo Tosatti
On 2026-05-18 22:27:49 [-0300], Leonardo Bras wrote:
after digesting the slub patch,
> @@ -882,38 +879,38 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
> * If the paired barrier is done at any later step, e.g. after the
> * loop, CPU #x will just exit at (C) and miss flushing out all of its
> * added pages.
> */
> WRITE_ONCE(lru_drain_gen, lru_drain_gen + 1);
> smp_mb();
>
> cpumask_clear(&has_mm_work);
> cpumask_clear(&has_bh_work);
> for_each_online_cpu(cpu) {
> - struct work_struct *mm_work = &per_cpu(lru_add_drain_work, cpu);
> + struct pw_struct *mm_pw = &per_cpu(lru_add_drain_pw, cpu);
> struct work_struct *bh_work = &per_cpu(bh_add_drain_work, cpu);
>
> if (cpu_needs_mm_drain(cpu)) {
> - INIT_WORK(mm_work, lru_add_drain_per_cpu);
> - queue_work_on(cpu, mm_percpu_wq, mm_work);
> + INIT_PW(mm_pw, lru_add_drain_per_cpu, cpu);
> + pw_queue_on(cpu, mm_percpu_wq, mm_pw);
> __cpumask_set_cpu(cpu, &has_mm_work);
> }
>
> if (cpu_needs_bh_drain(cpu)) {
> INIT_WORK(bh_work, bh_add_drain_per_cpu);
> queue_work_on(cpu, mm_percpu_wq, bh_work);
> __cpumask_set_cpu(cpu, &has_bh_work);
> }
> }
>
> for_each_cpu(cpu, &has_mm_work)
> - flush_work(&per_cpu(lru_add_drain_work, cpu));
> + pw_flush(&per_cpu(lru_add_drain_pw, cpu));
>
> for_each_cpu(cpu, &has_bh_work)
> flush_work(&per_cpu(bh_add_drain_work, cpu));
Why do we have two iterations here? Is it just a proof of concept that
is not complete yet? I am curious why it is okay/needed to "remove" the
one workqueue but not the other. Maybe the other does not bother as much
as the other does.
But essentially we can't use a spin_lock_t here because due to the
hotpath nature of the code it will kill performance. So instead we do it
anyway but behind a switch so that only those suffer from this that do
not want to suffer from workqueue interruption on a NOHZ full system,
right?
I thought that this improved since commit
ff042f4a9b050 ("mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu")
Did it get worse or was it not entirely gone?
> done:
> mutex_unlock(&lock);
> }
>
> void lru_add_drain_all(void)
> {
Sebastian
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-05-20 15:07 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-19 1:27 [PATCH v4 0/4] Introduce Per-CPU Work helpers (was QPW) Leonardo Bras
2026-05-19 1:27 ` [PATCH v4 1/4] Introducing pw_lock() and per-cpu queue & flush work Leonardo Bras
2026-05-20 10:08 ` Frederic Weisbecker
2026-05-20 13:48 ` Sebastian Andrzej Siewior
2026-05-20 14:47 ` Frederic Weisbecker
2026-05-19 1:27 ` [PATCH v4 2/4] mm/swap: move bh draining into a separate workqueue Leonardo Bras
2026-05-19 1:27 ` [PATCH v4 3/4] swap: apply new pw_queue_on() interface Leonardo Bras
2026-05-20 15:07 ` Sebastian Andrzej Siewior
2026-05-19 1:27 ` [PATCH v4 4/4] slub: " Leonardo Bras
2026-05-20 14:53 ` Sebastian Andrzej Siewior
2026-05-19 6:58 ` [syzbot ci] Re: Introduce Per-CPU Work helpers (was QPW) syzbot ci
2026-05-20 13:09 ` [PATCH v4 0/4] " Sebastian Andrzej Siewior
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox