[PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage
@ 2026-06-15 11:05 Harry Yoo (Oracle)
  2026-06-15 11:05 ` [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available Harry Yoo (Oracle)
                   ` (10 more replies)
  0 siblings, 11 replies; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:05 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

Not the best time to post a series, but didn't want to delay posting
the series for too long. no pressures ;)  This is aimed to be queued
for review and testing after the merge window closes.

This series is based on next-20260612, and is also available on
git.kernel.org [3].

To RCU folks: It would be great if you could kindly take a quick look at
patch 4 and either ack or nack the patch ;)

To BPF folks: Ulad asked to share workloads to measure performance
of kfree_rcu_nolock(). Unfortunately, I focused more on correctness
and have not spent much effort on that. It would be nice if BPF folks
could help evaluate it on their relevant workloads.

To PREEMPT_RT folks: The most relevant part is allowing
kfree_rcu_sheaf() on PREEMPT_RT (patch 6). It carefully avoids sleeping
by acquiring the locks via local_trylock() or spin_trylock_irqsave()
to avoid sleeping within a raw spinlock. When trylock or unlock is
unsafe, kmalloc_nolock() always fails.

Changes since RFC v2
====================

Reduced complexity and intrusiveness (Uladzislau Rezki)
-------------------------------------------------------

While discussing concerns about the complexity of adding allow_spin
handling with Ulad (Thanks!), I realized that adding complexity to the
kvfree_rcu batching is not strictly necessary: only slab objects need to
be batched, they are already batched by rcu sheaves, and slab already
supports unknown context. So it is enough to implement only a minimal
fallback for the sheaves path.

I tried to avoid making intrusive changes to the existing kvfree_rcu
path as much as possible. struct rcu_ptr is renamed to kfree_rcu_head
following Vlastimil's suggestion, and it is used only in the
kfree_rcu_nolock() path for now.

As a result, the complexity is significantly reduced and the series
became much less intrusive. This is also reflected well in the diffstat
below.

RFC v2 diffstat:
  8 files changed, 514 insertions(+), 163 deletions(-)

v3 diffstat:
  6 files changed, 370 insertions(+), 105 deletions(-)

v3 diffstat (slub_kunit improvements - patch 1, 2, 9 excluded):
  5 files changed, 199 insertions(+), 66 deletions(-)

kfree_rcu_sheaf() PREEMPT_RT support (Vlastimil Babka)
------------------------------------------------------

As suggested by Vlastimil (Thanks!), kfree_rcu_sheaf() can now be used
on PREEMPT_RT as well, by always assuming allow_spin is false on
PREEMPT_RT.

slub_kunit enhancements
-----------------------

- Currently the test is skipped when there is no hardware PMU. This can
  happen on machines without a PMU, or in virtualized environments
  (e.g., automated testing or virtme). Implement a fallback based on SW
  perf events so that the test can still run in such environments, even
  though the coverage is slightly smaller.

- While testing on PREEMPT_RT, I found that kmalloc_nolock() fails every
  time, so the fallback path is not properly tested. This is a limitation
  of perf events: the handler is called in NMI (HW perf events) or
  interrupt context (SW perf events), where kmalloc_nolock() cannot
  succeed.

  slub_kunit now registers a kprobe pre-handler at the points in the slab
  allocator where lockdep_assert_held() is invoked. The pre-handler calls
  kmalloc_nolock() and friends, to improve coverage on PREEMPT_RT instead
  of relying on perf events.

One thing that needs to be further explored
-------------------------------------------

The global deferred_free_by_rcu (introduced by patch 8) list for the
fallback should probably be per-CPU [5].

Actual Cover Letter
===================

This series improves kmalloc_nolock() and kfree_nolock() coverage
in slub_kunit (patch 1 and 2) and introduces kfree_rcu_nolock() for
an unknown context as suggested by Alexei Starovoitov.

Unknown context means the caller does not know whether spinning on a lock
is safe (e.g., a BPF program attached to an arbitrary kernel function or
in NMI context).

The slab allocator already supports unknown context via kmalloc_nolock()
and kfree_nolock(), but te slab allocator does not support freeing
objects by RCU in unknown context.

It is not ideal to have completely separate batching for unknown context
because the worst scenario where spinning on a lock would lead to
deadlock is very rare, and in most cases, it is safe to use the
existing mechanism (kfree_rcu_sheaf()).

Since most part of the slab allocator already supports unknown context
and sheaves support batching kvfree_rcu() calls for slab objects,
implement kfree_rcu_nolock() with minimal changes by teaching
kfree_rcu_sheaf() how to support unknown context and making
it a little bit harder to allocate an empty sheaf, instead of making
intrusive changes to the existing kvfree_rcu batching logic.

kfree_rcu_nolock() tries to free the object to the rcu sheaf if
trylock succeeds. Once the rcu sheaf becomes full, it is submitted to
RCU via call_rcu() if spinning is allowed or IRQs are enabled (to avoid
calling call_rcu() in the middle of call_rcu()). Otherwise, call_rcu()
is deferred via irq work.

In unknown context, when there is no sheaf available, kfree_rcu_sheaf()
falls back to defer_kfree_rcu(), which inserts the object to a global
lockless list [5] and those objects are freed after synchronize_rcu() in
a workqueue.

Unlike kfree_rcu(), only the 2-argument variant is supported.
This is because the last resort of the 1-arg variant is
synchronize_rcu(), which cannot be used in an unknown context.

As suggested by Alexei Starovoitov, kfree_rcu_nolock() can be used with
struct kfree_rcu_head (8 bytes), which is smaller than struct rcu_head
(16 bytes).

For more background and future plans, please see [4].

[1] RFC v1: https://lore.kernel.org/linux-mm/20260206093410.160622-1-harry.yoo@oracle.com

[2] RFC v2: https://lore.kernel.org/linux-mm/20260416091022.36823-1-harry@kernel.org

[3] https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kfree_rcu_nolock-v3r3

[4] kmalloc_nolock() follow-ups, including kfree_rcu_nolock(),
    https://lore.kernel.org/linux-mm/esepccfhqg7m6jo76ns2znj2cnuaepx2xvw5zaygtwohq4psma@563ypprp6rr3

[5] However, we should probably make the list percpu because,
    unlike RFC v2, it can be triggered more frequently under memory
    pressure.

    https://lore.kernel.org/linux-mm/805c33d7-3a7b-470c-bd9d-065717a3e3e2@paulmck-laptop

Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
Harry Yoo (Oracle) (9):
      slub_kunit: fall back to SW perf events when HW PMU is not available
      mm/slab, slub_kunit: register kprobe to trigger _nolock APIs
      mm/slab: handle the !allow_spin case in kfree_rcu_sheaf()
      mm/slab: use call_rcu() in unknown context if irqs are enabled
      mm/slab: extend deferred free mechanism to handle rcu sheaves
      mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT
      mm/slab: introduce kfree_rcu_nolock()
      mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock()
      slub_kunit: extend the test for kfree_rcu_nolock()

 include/linux/rcupdate.h |  12 +++
 include/linux/types.h    |   4 +
 lib/tests/slub_kunit.c   | 174 ++++++++++++++++++++++++++++------
 mm/slab.h                |   5 +-
 mm/slab_common.c         |  38 ++++++--
 mm/slub.c                | 242 ++++++++++++++++++++++++++++++++++-------------
 6 files changed, 370 insertions(+), 105 deletions(-)
---
base-commit: c425609d6ac4012c8bbf01ec2e10e801b1923a7b
change-id: 20260615-kfree_rcu_nolock-e5502555992f

Best regards,
-- 
Harry Yoo (Oracle) <harry@kernel.org>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
@ 2026-06-15 11:05 ` Harry Yoo (Oracle)
  2026-06-15 11:14   ` sashiko-bot
                     ` (2 more replies)
  2026-06-15 11:05 ` [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs Harry Yoo (Oracle)
                   ` (9 subsequent siblings)
  10 siblings, 3 replies; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:05 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

slub_kunit uses hardware perf events to invoke _nolock() APIs from NMI
context. However, creating a HW perf event fails when there is no
hardware PMU, or when running in a virtualized environment without a
virtual PMU configured.

Since tests are often performed in virtualized environments, fall back
to a SW perf event so that the test can still run, even if it cannot
cover IRQ-disabled sections.

For the sheaves path, IRQs are not disabled until the barn is accessed.
So some coverage is still expected.

Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 lib/tests/slub_kunit.c | 70 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 55 insertions(+), 15 deletions(-)

diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c
index fa6d31dbca16..11255fc8eb78 100644
--- a/lib/tests/slub_kunit.c
+++ b/lib/tests/slub_kunit.c
@@ -303,6 +303,7 @@ struct test_nolock_context {
 	int alloc_ok;
 	int alloc_fail;
 	struct perf_event *event;
+	bool is_perf_type_hw;
 };
 
 static struct perf_event_attr hw_attr = {
@@ -315,9 +316,19 @@ static struct perf_event_attr hw_attr = {
 	.sample_freq = 100000,
 };
 
-static void overflow_handler_test_kmalloc_kfree_nolock(struct perf_event *event,
-						       struct perf_sample_data *data,
-						       struct pt_regs *regs)
+/* Fallback when hardware perf event is not available */
+static struct perf_event_attr sw_attr = {
+	.type = PERF_TYPE_SOFTWARE,
+	.config = PERF_COUNT_SW_CPU_CLOCK,
+	.size = sizeof(struct perf_event_attr),
+	.disabled = 1,
+	.freq = 1,
+	.sample_freq = 100000,
+};
+
+static void overflow_handler_test_nolock(struct perf_event *event,
+					 struct perf_sample_data *data,
+					 struct pt_regs *regs)
 {
 	void *objp;
 	gfp_t gfp;
@@ -336,20 +347,53 @@ static void overflow_handler_test_kmalloc_kfree_nolock(struct perf_event *event,
 	ctx->callback_count++;
 }
 
+static bool enable_perf_events(struct test_nolock_context *ctx)
+{
+	struct perf_event *event;
+
+	ctx->is_perf_type_hw = true;
+	event = perf_event_create_kernel_counter(&hw_attr, -1, current,
+						 overflow_handler_test_nolock,
+						 ctx);
+
+	if (!IS_ERR(event))
+		goto out;
+
+	ctx->is_perf_type_hw = false;
+	event = perf_event_create_kernel_counter(&sw_attr, -1, current,
+						 overflow_handler_test_nolock,
+						 ctx);
+	if (!IS_ERR(event))
+		goto out;
+
+	return false;
+out:
+	ctx->event = event;
+	perf_event_enable(ctx->event);
+	return true;
+}
+
+static void disable_perf_events(struct test_nolock_context *ctx)
+{
+	kunit_info(ctx->test, "%s perf events: callback_count: %d, alloc_ok: %d, alloc_fail: %d\n",
+		   ctx->is_perf_type_hw ? "HW" : "SW",
+		   ctx->callback_count, ctx->alloc_ok, ctx->alloc_fail);
+
+	perf_event_disable(ctx->event);
+	perf_event_release_kernel(ctx->event);
+}
+
 static void test_kmalloc_kfree_nolock(struct kunit *test)
 {
 	int i, j;
 	struct test_nolock_context ctx = { .test = test };
-	struct perf_event *event;
 	bool alloc_fail = false;
+	bool perf_events_enabled;
 
-	event = perf_event_create_kernel_counter(&hw_attr, -1, current,
-						 overflow_handler_test_kmalloc_kfree_nolock,
-						 &ctx);
-	if (IS_ERR(event))
+	perf_events_enabled = enable_perf_events(&ctx);
+	if (!perf_events_enabled)
 		kunit_skip(test, "Failed to create perf event");
-	ctx.event = event;
-	perf_event_enable(ctx.event);
+
 	for (i = 0; i < NR_ITERATIONS; i++) {
 		for (j = 0; j < NR_OBJECTS; j++) {
 			gfp_t gfp = (i % 2) ? GFP_KERNEL : GFP_KERNEL_ACCOUNT;
@@ -368,11 +412,7 @@ static void test_kmalloc_kfree_nolock(struct kunit *test)
 	}
 
 cleanup:
-	perf_event_disable(ctx.event);
-	perf_event_release_kernel(ctx.event);
-
-	kunit_info(test, "callback_count: %d, alloc_ok: %d, alloc_fail: %d\n",
-		   ctx.callback_count, ctx.alloc_ok, ctx.alloc_fail);
+	disable_perf_events(&ctx);
 
 	if (alloc_fail)
 		kunit_skip(test, "Allocation failed");

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
  2026-06-15 11:05 ` [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available Harry Yoo (Oracle)
@ 2026-06-15 11:05 ` Harry Yoo (Oracle)
  2026-06-15 11:25   ` sashiko-bot
  2026-06-15 20:04   ` Alexei Starovoitov
  2026-06-15 11:05 ` [PATCH for-next v3 3/9] mm/slab: handle the !allow_spin case in kfree_rcu_sheaf() Harry Yoo (Oracle)
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:05 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

Since kmalloc_nolock() always fails in NMI and hardirq contexts on
PREEMPT_RT, slub_kunit cannot properly test _nolock() APIs.

Register a kprobe pre-handler to invoke kmalloc_nolock() and
kfree_nolock() in the middle of the slab allocator. However, do not
register the handler on UP kernels [1].

To attach the pre-handler while s->cpu_sheaves->lock or n->list_lock
is held, add a wrapper function for lockdep_assert_held() that calls
a no-op function slab_attach_kprobe_locked() on debug builds. The
function is optimized away when neither CONFIG_PROVE_LOCKING nor
CONFIG_DEBUG_VM is selected and register_kprobe() fails.

The function calls barrier() to prevent the compiler from optimizing
away its callsites. Otherwise, the compiler may consider the function
does not have any side effect and remove callsites.

Link: https://lore.kernel.org/linux-mm/20260427-nolock-api-fix-v2-0-a6b83a92d9a4@kernel.org [1]
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 lib/tests/slub_kunit.c | 82 +++++++++++++++++++++++++++++++++++++++++++-------
 mm/slub.c              | 36 ++++++++++++++++------
 2 files changed, 98 insertions(+), 20 deletions(-)

diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c
index 11255fc8eb78..01d808cb77fa 100644
--- a/lib/tests/slub_kunit.c
+++ b/lib/tests/slub_kunit.c
@@ -8,6 +8,7 @@
 #include <linux/rcupdate.h>
 #include <linux/delay.h>
 #include <linux/perf_event.h>
+#include <linux/kprobes.h>
 #include "../mm/slab.h"
 
 static struct kunit_resource resource;
@@ -292,7 +293,8 @@ static void test_krealloc_redzone_zeroing(struct kunit *test)
 	kmem_cache_destroy(s);
 }
 
-#ifdef CONFIG_PERF_EVENTS
+#if defined(CONFIG_PERF_EVENTS) || (defined(CONFIG_KPROBES) && defined(CONFIG_SMP))
+#define SLUB_KUNIT_TEST_KMALLOC_KFREE_NOLOCK
 #define NR_ITERATIONS 1000
 #define NR_OBJECTS 1000
 static void *objects[NR_OBJECTS];
@@ -302,10 +304,16 @@ struct test_nolock_context {
 	int callback_count;
 	int alloc_ok;
 	int alloc_fail;
+#ifdef CONFIG_PERF_EVENTS
 	struct perf_event *event;
 	bool is_perf_type_hw;
+#endif
+#ifdef CONFIG_KPROBES
+	struct kprobe kprobe;
+#endif
 };
 
+#ifdef CONFIG_PERF_EVENTS
 static struct perf_event_attr hw_attr = {
 	.type = PERF_TYPE_HARDWARE,
 	.config = PERF_COUNT_HW_CPU_CYCLES,
@@ -326,13 +334,10 @@ static struct perf_event_attr sw_attr = {
 	.sample_freq = 100000,
 };
 
-static void overflow_handler_test_nolock(struct perf_event *event,
-					 struct perf_sample_data *data,
-					 struct pt_regs *regs)
+static void test_nolock(struct test_nolock_context *ctx)
 {
 	void *objp;
 	gfp_t gfp;
-	struct test_nolock_context *ctx = event->overflow_handler_context;
 
 	/* __GFP_ACCOUNT to test kmalloc_nolock() in alloc_slab_obj_exts() */
 	gfp = (ctx->callback_count % 2) ? 0 : __GFP_ACCOUNT;
@@ -347,6 +352,15 @@ static void overflow_handler_test_nolock(struct perf_event *event,
 	ctx->callback_count++;
 }
 
+static void overflow_handler_test_nolock(struct perf_event *event,
+					 struct perf_sample_data *data,
+					 struct pt_regs *regs)
+{
+	struct test_nolock_context *ctx = event->overflow_handler_context;
+
+	test_nolock(ctx);
+}
+
 static bool enable_perf_events(struct test_nolock_context *ctx)
 {
 	struct perf_event *event;
@@ -382,17 +396,60 @@ static void disable_perf_events(struct test_nolock_context *ctx)
 	perf_event_disable(ctx->event);
 	perf_event_release_kernel(ctx->event);
 }
+#else
+static bool enable_perf_events(struct test_nolock_context *ctx) { return false; }
+static void disable_perf_events(struct test_nolock_context *ctx) { }
+#endif
+
+#if defined(CONFIG_KPROBES) && defined(CONFIG_SMP)
+static int slab_kprobe_pre_handler(struct kprobe *p, struct pt_regs *regs)
+{
+	struct test_nolock_context *ctx;
+
+	ctx = container_of(p, struct test_nolock_context, kprobe);
+	test_nolock(ctx);
+	return 0;
+}
+
+static bool register_slab_kprobes(struct test_nolock_context *ctx)
+{
+	ctx->kprobe.symbol_name = "slab_attach_kprobe_locked";
+	ctx->kprobe.pre_handler = slab_kprobe_pre_handler;
+
+	if (register_kprobe(&ctx->kprobe))
+		return false;
+	return true;
+}
+
+static void unregister_slab_kprobes(struct test_nolock_context *ctx)
+{
+	kunit_info(ctx->test, "kprobes: callback_count: %d, alloc_ok: %d, alloc_fail: %d\n",
+		   ctx->callback_count, ctx->alloc_ok, ctx->alloc_fail);
+	unregister_kprobe(&ctx->kprobe);
+}
+#else
+static bool register_slab_kprobes(struct test_nolock_context *ctx) { return false; }
+static void unregister_slab_kprobes(struct test_nolock_context *ctx) { }
+#endif
 
 static void test_kmalloc_kfree_nolock(struct kunit *test)
 {
 	int i, j;
-	struct test_nolock_context ctx = { .test = test };
+	struct test_nolock_context perf_ctx = { .test = test };
+	struct test_nolock_context kprobe_ctx = { .test = test };
 	bool alloc_fail = false;
 	bool perf_events_enabled;
+	bool slab_kprobes_enabled;
 
-	perf_events_enabled = enable_perf_events(&ctx);
-	if (!perf_events_enabled)
-		kunit_skip(test, "Failed to create perf event");
+	perf_events_enabled = enable_perf_events(&perf_ctx);
+	slab_kprobes_enabled = register_slab_kprobes(&kprobe_ctx);
+
+	if (!perf_events_enabled && !slab_kprobes_enabled)
+		kunit_skip(test, "Failed to enable perf event and kprobe, skipping");
+	else if (!perf_events_enabled)
+		kunit_info(test, "Failed to create perf event");
+	if (!slab_kprobes_enabled)
+		kunit_info(test, "Failed to register kprobe pre-handler");
 
 	for (i = 0; i < NR_ITERATIONS; i++) {
 		for (j = 0; j < NR_OBJECTS; j++) {
@@ -412,7 +469,10 @@ static void test_kmalloc_kfree_nolock(struct kunit *test)
 	}
 
 cleanup:
-	disable_perf_events(&ctx);
+	if (perf_events_enabled)
+		disable_perf_events(&perf_ctx);
+	if (slab_kprobes_enabled)
+		unregister_slab_kprobes(&kprobe_ctx);
 
 	if (alloc_fail)
 		kunit_skip(test, "Allocation failed");
@@ -444,7 +504,7 @@ static struct kunit_case test_cases[] = {
 	KUNIT_CASE(test_kfree_rcu_wq_destroy),
 	KUNIT_CASE(test_leak_destroy),
 	KUNIT_CASE(test_krealloc_redzone_zeroing),
-#ifdef CONFIG_PERF_EVENTS
+#ifdef SLUB_KUNIT_TEST_KMALLOC_KFREE_NOLOCK
 	KUNIT_CASE_SLOW(test_kmalloc_kfree_nolock),
 #endif
 	{}
diff --git a/mm/slub.c b/mm/slub.c
index 813fb863254d..87ca154ccd80 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -908,6 +908,24 @@ static inline unsigned int obj_exts_offset_in_object(struct kmem_cache *s)
 }
 #endif
 
+/*
+ * A no-op function used to attach kprobe handlers in slub_kunit tests.
+ * The barrier is needed to prevent the compiler from optimizing out callsites.
+ */
+#if defined(CONFIG_DEBUG_VM) || defined(CONFIG_PROVE_LOCKING)
+static noinline void slab_attach_kprobe_locked(void)
+{
+	barrier();
+}
+#else
+static inline void slab_attach_kprobe_locked(void) { }
+#endif
+
+#define slab_lockdep_assert_held(lock) do {	\
+	lockdep_assert_held(lock);		\
+	slab_attach_kprobe_locked();	\
+} while (0)
+
 #ifdef CONFIG_SLUB_DEBUG
 
 /*
@@ -1665,7 +1683,7 @@ static void add_full(struct kmem_cache *s,
 	if (!(s->flags & SLAB_STORE_USER))
 		return;
 
-	lockdep_assert_held(&n->list_lock);
+	slab_lockdep_assert_held(&n->list_lock);
 	list_add(&slab->slab_list, &n->full);
 }
 
@@ -1674,7 +1692,7 @@ static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct
 	if (!(s->flags & SLAB_STORE_USER))
 		return;
 
-	lockdep_assert_held(&n->list_lock);
+	slab_lockdep_assert_held(&n->list_lock);
 	list_del(&slab->slab_list);
 }
 
@@ -2866,7 +2884,7 @@ static unsigned int __sheaf_flush_main_batch(struct kmem_cache *s)
 	void *objects[PCS_BATCH_MAX];
 	struct slab_sheaf *sheaf;
 
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	slab_lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 	sheaf = pcs->main;
@@ -3545,7 +3563,7 @@ __add_partial(struct kmem_cache_node *n, struct slab *slab, enum add_mode mode)
 static inline void add_partial(struct kmem_cache_node *n,
 				struct slab *slab, enum add_mode mode)
 {
-	lockdep_assert_held(&n->list_lock);
+	slab_lockdep_assert_held(&n->list_lock);
 	__add_partial(n, slab, mode);
 }
 
@@ -3559,7 +3577,7 @@ static inline void clear_node_partial_state(struct kmem_cache_node *n,
 static inline void remove_partial(struct kmem_cache_node *n,
 					struct slab *slab)
 {
-	lockdep_assert_held(&n->list_lock);
+	slab_lockdep_assert_held(&n->list_lock);
 	list_del(&slab->slab_list);
 	clear_node_partial_state(n, slab);
 }
@@ -3575,7 +3593,7 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
 {
 	void *object;
 
-	lockdep_assert_held(&n->list_lock);
+	slab_lockdep_assert_held(&n->list_lock);
 
 #ifdef CONFIG_SLUB_DEBUG
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
@@ -4646,7 +4664,7 @@ __pcs_replace_empty_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	struct node_barn *barn;
 	bool allow_spin;
 
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	slab_lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {
@@ -5786,7 +5804,7 @@ static void __pcs_install_empty_sheaf(struct kmem_cache *s,
 		struct slub_percpu_sheaves *pcs, struct slab_sheaf *empty,
 		struct node_barn *barn)
 {
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	slab_lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	/* This is what we expect to find if nobody interrupted us. */
 	if (likely(!pcs->spare)) {
@@ -5837,7 +5855,7 @@ __pcs_replace_full_main(struct kmem_cache *s, struct slub_percpu_sheaves *pcs,
 	bool put_fail;
 
 restart:
-	lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
+	slab_lockdep_assert_held(this_cpu_ptr(&s->cpu_sheaves->lock));
 
 	/* Bootstrap or debug cache, back off */
 	if (unlikely(!cache_has_sheaves(s))) {

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 3/9] mm/slab: handle the !allow_spin case in kfree_rcu_sheaf()
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
  2026-06-15 11:05 ` [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available Harry Yoo (Oracle)
  2026-06-15 11:05 ` [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs Harry Yoo (Oracle)
@ 2026-06-15 11:05 ` Harry Yoo (Oracle)
  2026-06-15 11:24   ` sashiko-bot
  2026-06-15 11:05 ` [PATCH for-next v3 4/9] mm/slab: use call_rcu() in unknown context if irqs are enabled Harry Yoo (Oracle)
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:05 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

Teach kfree_rcu_sheaf() how to handle the !allow_spin case. Try to get
an empty sheaf from pcs->spare or the barn even when spinning is not
allowed. Unlike __pcs_replace_full_main(), try harder to allocate
an empty sheaf because the fallback path will be more expensive than
kfree_nolock().

When trylock fails or the kernel observes non-NULL pcs->rcu_free after
lock acquisition, free the sheaf instead of putting it to the barn.
This is rare and not worth complicating the code.

Since call_rcu() cannot be called in an unknown context,
kfree_rcu_sheaf() fails when the rcu sheaf becomes full.

Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slab.h        |  2 +-
 mm/slab_common.c |  2 +-
 mm/slub.c        | 39 ++++++++++++++++++++++++++++++---------
 3 files changed, 32 insertions(+), 11 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 509f330654b8..b1bd33a16544 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -429,7 +429,7 @@ static inline bool is_kmalloc_normal(struct kmem_cache *s)
 	return !(s->flags & (SLAB_CACHE_DMA|SLAB_ACCOUNT|SLAB_RECLAIM_ACCOUNT));
 }
 
-bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj);
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin);
 void flush_all_rcu_sheaves(void);
 void flush_rcu_sheaves_on_cache(struct kmem_cache *s);
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b6426d7ceec9..bc1a8ec938d9 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1605,7 +1605,7 @@ static bool kfree_rcu_sheaf(void *obj)
 
 	s = slab->slab_cache;
 	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id()))
-		return __kfree_rcu_sheaf(s, obj);
+		return __kfree_rcu_sheaf(s, obj, /* allow_spin = */ true);
 
 	return false;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 87ca154ccd80..b0d38d515386 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2815,7 +2815,8 @@ static inline struct slab_sheaf *alloc_empty_sheaf(struct kmem_cache *s,
 	return __alloc_empty_sheaf(s, gfp, alloc_flags, s->sheaf_capacity);
 }
 
-static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
+static void __free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf,
+			       bool allow_spin)
 {
 	/*
 	 * If the sheaf was created with SLAB_ALLOC_NO_RECURSE flag then its
@@ -2827,11 +2828,20 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
 		mark_obj_codetag_empty(sheaf);
 
 	VM_WARN_ON_ONCE(sheaf->size > 0);
-	kfree(sheaf);
+
+	if (likely(allow_spin))
+		kfree(sheaf);
+	else
+		kfree_nolock(sheaf);
 
 	stat(s, SHEAF_FREE);
 }
 
+static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
+{
+	__free_empty_sheaf(s, sheaf, /* allow_spin = */ true);
+}
+
 static unsigned int
 refill_objects(struct kmem_cache *s, void **p, gfp_t gfp, unsigned int min,
 	       unsigned int max);
@@ -3132,7 +3142,6 @@ static struct slab_sheaf *barn_get_empty_sheaf(struct node_barn *barn,
  * intended action due to a race or cpu migration. Thus they do not check the
  * empty or full sheaf limits for simplicity.
  */
-
 static void barn_put_empty_sheaf(struct node_barn *barn, struct slab_sheaf *sheaf)
 {
 	unsigned long flags;
@@ -6065,7 +6074,7 @@ static void rcu_free_sheaf(struct rcu_head *head)
  */
 static DEFINE_WAIT_OVERRIDE_MAP(kfree_rcu_sheaf_map, LD_WAIT_CONFIG);
 
-bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
+bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
 {
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_sheaf;
@@ -6081,9 +6090,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 	pcs = this_cpu_ptr(s->cpu_sheaves);
 
 	if (unlikely(!pcs->rcu_free)) {
-
 		struct slab_sheaf *empty;
 		struct node_barn *barn;
+		unsigned int alloc_flags = SLAB_ALLOC_DEFAULT;
+		gfp_t gfp = GFP_NOWAIT;
 
 		/* Bootstrap or debug cache, fall back */
 		if (unlikely(!cache_has_sheaves(s))) {
@@ -6103,7 +6113,7 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 			goto fail;
 		}
 
-		empty = barn_get_empty_sheaf(barn, true);
+		empty = barn_get_empty_sheaf(barn, allow_spin);
 
 		if (empty) {
 			pcs->rcu_free = empty;
@@ -6112,20 +6122,25 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 
 		local_unlock(&s->cpu_sheaves->lock);
 
-		empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
+		if (unlikely(!allow_spin)) {
+			alloc_flags = SLAB_ALLOC_TRYLOCK;
+			gfp = 0;
+		}
+
+		empty = alloc_empty_sheaf(s, gfp, alloc_flags);
 
 		if (!empty)
 			goto fail;
 
 		if (!local_trylock(&s->cpu_sheaves->lock)) {
-			barn_put_empty_sheaf(barn, empty);
+			__free_empty_sheaf(s, empty, allow_spin);
 			goto fail;
 		}
 
 		pcs = this_cpu_ptr(s->cpu_sheaves);
 
 		if (unlikely(pcs->rcu_free))
-			barn_put_empty_sheaf(barn, empty);
+			__free_empty_sheaf(s, empty, allow_spin);
 		else
 			pcs->rcu_free = empty;
 	}
@@ -6143,6 +6158,12 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
 	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
 	} else {
+		if (unlikely(!allow_spin)) {
+			/* call_rcu() cannot be called in an unknown context */
+			rcu_sheaf->size--;
+			local_unlock(&s->cpu_sheaves->lock);
+			goto fail;
+		}
 		pcs->rcu_free = NULL;
 		rcu_sheaf->node = numa_node_id();
 	}

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 4/9] mm/slab: use call_rcu() in unknown context if irqs are enabled
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (2 preceding siblings ...)
  2026-06-15 11:05 ` [PATCH for-next v3 3/9] mm/slab: handle the !allow_spin case in kfree_rcu_sheaf() Harry Yoo (Oracle)
@ 2026-06-15 11:05 ` Harry Yoo (Oracle)
  2026-06-15 11:25   ` sashiko-bot
  2026-06-15 11:05 ` [PATCH for-next v3 5/9] mm/slab: extend deferred free mechanism to handle rcu sheaves Harry Yoo (Oracle)
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:05 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

call_rcu() disables IRQs with local_irq_save() to protect its per-cpu
data structures. Therefore, if IRQs are not disabled, they cannot be
corrupted by reentrance into call_rcu(). So fall back to the deferred
path only when !allow_spin && irqs_disabled().

The RCU subsystem does not guarantee this contractually, and this
optimization relies on RCU's implementation details. Ideally, it should
be removed once call_rcu_nolock() is supported by the RCU subsystem.

Link: https://lore.kernel.org/linux-mm/CAADnVQKRVD5ZSnEKbZZU7w86gHbGHUug2pvzpgZTngNS+fg4rw@mail.gmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slub.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b0d38d515386..6a3552b70683 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6158,8 +6158,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
 	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
 	} else {
-		if (unlikely(!allow_spin)) {
-			/* call_rcu() cannot be called in an unknown context */
+		/* call_rcu() disables IRQs to protect percpu data structures */
+		if (unlikely(!allow_spin && irqs_disabled())) {
 			rcu_sheaf->size--;
 			local_unlock(&s->cpu_sheaves->lock);
 			goto fail;

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 5/9] mm/slab: extend deferred free mechanism to handle rcu sheaves
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (3 preceding siblings ...)
  2026-06-15 11:05 ` [PATCH for-next v3 4/9] mm/slab: use call_rcu() in unknown context if irqs are enabled Harry Yoo (Oracle)
@ 2026-06-15 11:05 ` Harry Yoo (Oracle)
  2026-06-15 11:24   ` sashiko-bot
  2026-06-15 11:06 ` [PATCH for-next v3 6/9] mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT Harry Yoo (Oracle)
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:05 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

__kfree_rcu_sheaf() cannot invoke call_rcu() when spinning is not
allowed and IRQs are disabled. To relax the limitation, extend the
deferred free fallback so that a full rcu sheaf can be submitted to
call_rcu() via the existing IRQ work.

Since the deferred mechanism does more than deferred free of objects,
rename the struct to deferred_percpu_work and adjust names accordingly.

When a sheaf is queued on an IRQ work, it is detached from
pcs->rcu_free but call_rcu() is not invoked until the irq_work runs.
To keep the kvfree_rcu barrier's promise, call irq_work_sync() on each
CPU before calling rcu_barrier().

In the meantime, remove the TODO item as apparently there is no simple
and effective way to achieve that.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slab.h        |  2 +-
 mm/slab_common.c |  7 ++---
 mm/slub.c        | 79 ++++++++++++++++++++++++++++++++++----------------------
 3 files changed, 51 insertions(+), 37 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index b1bd33a16544..961581e35ec8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -744,7 +744,7 @@ void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 void __check_heap_object(const void *ptr, unsigned long n,
 			 const struct slab *slab, bool to_user);
 
-void defer_free_barrier(void);
+void deferred_work_barrier(void);
 
 static inline bool slub_debug_orig_size(struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bc1a8ec938d9..55546b8385ff 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -551,7 +551,7 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	}
 
 	/* Wait for deferred work from kmalloc/kfree_nolock() */
-	defer_free_barrier();
+	deferred_work_barrier();
 
 	cpus_read_lock();
 	mutex_lock(&slab_mutex);
@@ -2113,13 +2113,10 @@ void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
 		cpus_read_lock();
 		flush_rcu_sheaves_on_cache(s);
 		cpus_read_unlock();
+		deferred_work_barrier();
 		rcu_barrier();
 	}
 
-	/*
-	 * TODO: Introduce a version of __kvfree_rcu_barrier() that works
-	 * on a specific slab cache.
-	 */
 	__kvfree_rcu_barrier();
 }
 EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
diff --git a/mm/slub.c b/mm/slub.c
index 6a3552b70683..ba593c1c53d5 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -418,6 +418,8 @@ struct slab_sheaf {
 	union {
 		struct rcu_head rcu_head;
 		struct list_head barn_list;
+		/* only used to defer call_rcu() in unknown context */
+		struct llist_node llnode;
 		/* only used for prefilled sheafs */
 		struct {
 			unsigned int capacity;
@@ -4071,6 +4073,20 @@ static void flush_all(struct kmem_cache *s)
 	cpus_read_unlock();
 }
 
+struct deferred_percpu_work {
+	struct llist_head objects;
+	struct llist_head rcu_sheaves;
+	struct irq_work work;
+};
+
+static void deferred_percpu_work_fn(struct irq_work *work);
+
+static DEFINE_PER_CPU(struct deferred_percpu_work, deferred_percpu_work) = {
+	.objects = LLIST_HEAD_INIT(objects),
+	.rcu_sheaves = LLIST_HEAD_INIT(rcu_sheaves),
+	.work = IRQ_WORK_INIT(deferred_percpu_work_fn),
+};
+
 static void flush_rcu_sheaf(struct work_struct *w)
 {
 	struct slub_percpu_sheaves *pcs;
@@ -4142,6 +4158,7 @@ void flush_all_rcu_sheaves(void)
 	mutex_unlock(&slab_mutex);
 	cpus_read_unlock();
 
+	deferred_work_barrier();
 	rcu_barrier();
 }
 
@@ -6158,12 +6175,6 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
 	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
 		rcu_sheaf = NULL;
 	} else {
-		/* call_rcu() disables IRQs to protect percpu data structures */
-		if (unlikely(!allow_spin && irqs_disabled())) {
-			rcu_sheaf->size--;
-			local_unlock(&s->cpu_sheaves->lock);
-			goto fail;
-		}
 		pcs->rcu_free = NULL;
 		rcu_sheaf->node = numa_node_id();
 	}
@@ -6172,8 +6183,18 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
 	 * we flush before local_unlock to make sure a racing
 	 * flush_all_rcu_sheaves() doesn't miss this sheaf
 	 */
-	if (rcu_sheaf)
-		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+	if (rcu_sheaf) {
+		/* call_rcu() disables IRQs to protect percpu data structures */
+		if (unlikely(!allow_spin && irqs_disabled())) {
+			struct deferred_percpu_work *dpw;
+
+			dpw = this_cpu_ptr(&deferred_percpu_work);
+			if (llist_add(&rcu_sheaf->llnode, &dpw->rcu_sheaves))
+				irq_work_queue(&dpw->work);
+		} else {
+			call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+		}
+	}
 
 	local_unlock(&s->cpu_sheaves->lock);
 
@@ -6360,31 +6381,20 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 	}
 }
 
-struct defer_free {
-	struct llist_head objects;
-	struct irq_work work;
-};
-
-static void free_deferred_objects(struct irq_work *work);
-
-static DEFINE_PER_CPU(struct defer_free, defer_free_objects) = {
-	.objects = LLIST_HEAD_INIT(objects),
-	.work = IRQ_WORK_INIT(free_deferred_objects),
-};
-
 /*
  * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
  * to take sleeping spin_locks from __slab_free().
  * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
  */
-static void free_deferred_objects(struct irq_work *work)
+static void deferred_percpu_work_fn(struct irq_work *work)
 {
-	struct defer_free *df = container_of(work, struct defer_free, work);
-	struct llist_head *objs = &df->objects;
+	struct deferred_percpu_work *dpw;
+	struct llist_head *objs, *rcu_sheaves;
 	struct llist_node *llnode, *pos, *t;
 
-	if (llist_empty(objs))
-		return;
+	dpw = container_of(work, struct deferred_percpu_work, work);
+	rcu_sheaves = &dpw->rcu_sheaves;
+	objs = &dpw->objects;
 
 	llnode = llist_del_all(objs);
 	llist_for_each_safe(pos, t, llnode) {
@@ -6408,27 +6418,34 @@ static void free_deferred_objects(struct irq_work *work)
 		__slab_free(s, slab, x, x, 1, _THIS_IP_);
 		stat(s, FREE_SLOWPATH);
 	}
+
+	llnode = llist_del_all(rcu_sheaves);
+	llist_for_each_safe(pos, t, llnode) {
+		struct slab_sheaf *rcu_sheaf = llist_entry(pos, struct slab_sheaf, llnode);
+
+		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
+	}
 }
 
 static void defer_free(struct kmem_cache *s, void *head)
 {
-	struct defer_free *df;
+	struct deferred_percpu_work *dpw;
 
 	guard(preempt)();
 
 	head = kasan_reset_tag(head);
 
-	df = this_cpu_ptr(&defer_free_objects);
-	if (llist_add(head + s->offset, &df->objects))
-		irq_work_queue(&df->work);
+	dpw = this_cpu_ptr(&deferred_percpu_work);
+	if (llist_add(head + s->offset, &dpw->objects))
+		irq_work_queue(&dpw->work);
 }
 
-void defer_free_barrier(void)
+void deferred_work_barrier(void)
 {
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
+		irq_work_sync(&per_cpu_ptr(&deferred_percpu_work, cpu)->work);
 }
 
 static __fastpath_inline

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 6/9] mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (4 preceding siblings ...)
  2026-06-15 11:05 ` [PATCH for-next v3 5/9] mm/slab: extend deferred free mechanism to handle rcu sheaves Harry Yoo (Oracle)
@ 2026-06-15 11:06 ` Harry Yoo (Oracle)
  2026-06-15 11:19   ` sashiko-bot
  2026-06-15 11:06 ` [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:06 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

As suggested by Vlastimil Babka, kfree_rcu_sheaf() can be used
on PREEMPT_RT if we always assume spinning is not allowed on PREEMPT_RT.
This is because local_trylock and spinlock_t are safe to use with
trylock variant as long as the kernel does not spin and the context is
not NMI and not hardirq.

Now that __kfree_rcu_sheaf() knows how to handle allow_spin = false,
relax the limitation and try the sheaves path on PREEMPT_RT as well.

Keep the lockdep map on non RT kernels. However, do not use the lockdep
map on PREEMPT_RT to avoid suppressing valid lockdep warnings.

Link: https://lore.kernel.org/linux-mm/6811cc17-8ee4-48c8-8cbf-6bf4d9f98162@kernel.org
Suggested-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 mm/slab_common.c | 11 +++++++++--
 mm/slub.c        | 17 ++++++++++-------
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 55546b8385ff..807924a94fb0 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1595,6 +1595,13 @@ static bool kfree_rcu_sheaf(void *obj)
 {
 	struct kmem_cache *s;
 	struct slab *slab;
+	bool allow_spin;
+
+	/*
+	 * It is not safe to spin on PREEMPT_RT because the kernel might be
+	 * holding a raw spinlock and slab acquires sleeping locks.
+	 */
+	allow_spin = !IS_ENABLED(CONFIG_PREEMPT_RT);
 
 	if (is_vmalloc_addr(obj))
 		return false;
@@ -1605,7 +1612,7 @@ static bool kfree_rcu_sheaf(void *obj)
 
 	s = slab->slab_cache;
 	if (likely(!IS_ENABLED(CONFIG_NUMA) || slab_nid(slab) == numa_mem_id()))
-		return __kfree_rcu_sheaf(s, obj, /* allow_spin = */ true);
+		return __kfree_rcu_sheaf(s, obj, allow_spin);
 
 	return false;
 }
@@ -1954,7 +1961,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
 	if (!head)
 		might_sleep();
 
-	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr))
+	if (kfree_rcu_sheaf(ptr))
 		return;
 
 	// Queue the object but don't yet schedule the batch.
diff --git a/mm/slub.c b/mm/slub.c
index ba593c1c53d5..4850629774b2 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -6082,12 +6082,13 @@ static void rcu_free_sheaf(struct rcu_head *head)
  * kvfree_call_rcu() can be called while holding a raw_spinlock_t. Since
  * __kfree_rcu_sheaf() may acquire a spinlock_t (sleeping lock on PREEMPT_RT),
  * this would violate lock nesting rules. Therefore, kvfree_call_rcu() avoids
- * this problem by bypassing the sheaves layer entirely on PREEMPT_RT.
+ * this problem by passing allow_spin = false on PREEMPT_RT.
  *
  * However, lockdep still complains that it is invalid to acquire spinlock_t
  * while holding raw_spinlock_t, even on !PREEMPT_RT where spinlock_t is a
  * spinning lock. Tell lockdep that acquiring spinlock_t is valid here
- * by temporarily raising the wait-type to LD_WAIT_CONFIG.
+ * by temporarily raising the wait-type to LD_WAIT_CONFIG. Skip the lockdep map
+ * on PREEMPT_RT to avoid suppressing valid lockdep warnings.
  */
 static DEFINE_WAIT_OVERRIDE_MAP(kfree_rcu_sheaf_map, LD_WAIT_CONFIG);
 
@@ -6096,10 +6097,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
 	struct slub_percpu_sheaves *pcs;
 	struct slab_sheaf *rcu_sheaf;
 
-	if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
-		return false;
+	VM_WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT) && allow_spin);
 
-	lock_map_acquire_try(&kfree_rcu_sheaf_map);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		lock_map_acquire_try(&kfree_rcu_sheaf_map);
 
 	if (!local_trylock(&s->cpu_sheaves->lock))
 		goto fail;
@@ -6199,12 +6200,14 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
 	local_unlock(&s->cpu_sheaves->lock);
 
 	stat(s, FREE_RCU_SHEAF);
-	lock_map_release(&kfree_rcu_sheaf_map);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		lock_map_release(&kfree_rcu_sheaf_map);
 	return true;
 
 fail:
 	stat(s, FREE_RCU_SHEAF_FAIL);
-	lock_map_release(&kfree_rcu_sheaf_map);
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		lock_map_release(&kfree_rcu_sheaf_map);
 	return false;
 }
 

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock()
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (5 preceding siblings ...)
  2026-06-15 11:06 ` [PATCH for-next v3 6/9] mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT Harry Yoo (Oracle)
@ 2026-06-15 11:06 ` Harry Yoo (Oracle)
  2026-06-15 11:22   ` sashiko-bot
  2026-06-15 11:06 ` [PATCH for-next v3 8/9] mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock() Harry Yoo (Oracle)
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:06 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

Currently, k[v]free_rcu() cannot be called in unknown context since
it could lead to a deadlock when called in the middle of k[v]free_rcu().

Make users' lives easier by introducing kfree_rcu_nolock() variant,
now that kfree_rcu_sheaf() is available on PREEMPT_RT and
__kfree_rcu_sheaf() handles unknown context.

Unlike k[v]free_rcu(), kfree_rcu_nolock() does not fall back to
the kvfree_rcu batching when the sheaves path fails, and falls back to
defer_kfree_rcu() instead. In most cases, the sheaves path is expected
to succeed and it's unnecessary to add complexity to the existing
kvfree_rcu batching.

Since defer_kfree_rcu() can be called on caches without sheaves, move
deferred_work_barrier() and rcu_barrier() outside the branch in
kvfree_rcu_barrier_on_cache().

Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 include/linux/rcupdate.h | 12 ++++++++++++
 mm/slab.h                |  1 +
 mm/slab_common.c         | 22 ++++++++++++++++++++--
 mm/slub.c                | 23 ++++++++++++++++++++++-
 4 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5e95acc33989..3025249bfcb5 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1099,6 +1099,7 @@ static inline void rcu_read_unlock_migrate(void)
  * In mm/slab_common.c, no suitable header to include here.
  */
 void kvfree_call_rcu(struct rcu_head *head, void *ptr);
+void kfree_call_rcu_nolock(struct rcu_head *head, void *ptr);
 
 /*
  * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
@@ -1122,6 +1123,17 @@ do {								\
 		kvfree_call_rcu(NULL, (void *) (___p));		\
 } while (0)
 
+/* kfree_rcu_nolock() supports 2-arg variant only */
+#define kfree_rcu_nolock(ptr, krhf)					\
+do {									\
+	typeof (ptr) ___p = (ptr);					\
+									\
+	if (___p) {							\
+		BUILD_BUG_ON(offsetof(typeof(*(ptr)), krhf) >= 4096);	\
+		kfree_call_rcu_nolock(&((___p)->krhf), (void *) (___p));\
+	}								\
+} while (0)
+
 /*
  * Place this after a lock-acquisition primitive to guarantee that
  * an UNLOCK+LOCK pair acts as a full barrier.  This guarantee applies
diff --git a/mm/slab.h b/mm/slab.h
index 961581e35ec8..a493c5201e96 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -745,6 +745,7 @@ void __check_heap_object(const void *ptr, unsigned long n,
 			 const struct slab *slab, bool to_user);
 
 void deferred_work_barrier(void);
+void defer_kfree_rcu(struct rcu_head *head);
 
 static inline bool slub_debug_orig_size(struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 807924a94fb0..5a39e6225160 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1263,6 +1263,23 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
 EXPORT_TRACEPOINT_SYMBOL(kfree);
 EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
 
+void kfree_call_rcu_nolock(struct rcu_head *head, void *ptr)
+{
+	struct slab *slab;
+	struct kmem_cache *s;
+
+	VM_WARN_ON_ONCE(is_vmalloc_addr(ptr) || !virt_to_slab(ptr));
+
+	slab = virt_to_slab(ptr);
+	s = slab->slab_cache;
+
+	if (__kfree_rcu_sheaf(s, ptr, /* allow_spin = */ false))
+		return;
+
+	defer_kfree_rcu(head);
+}
+EXPORT_SYMBOL_GPL(kfree_call_rcu_nolock);
+
 #ifndef CONFIG_KVFREE_RCU_BATCHED
 
 void kvfree_call_rcu(struct rcu_head *head, void *ptr)
@@ -2120,10 +2137,11 @@ void kvfree_rcu_barrier_on_cache(struct kmem_cache *s)
 		cpus_read_lock();
 		flush_rcu_sheaves_on_cache(s);
 		cpus_read_unlock();
-		deferred_work_barrier();
-		rcu_barrier();
 	}
 
+	/* kfree_rcu_nolock() might have deferred frees even without sheaves */
+	deferred_work_barrier();
+	rcu_barrier();
 	__kvfree_rcu_barrier();
 }
 EXPORT_SYMBOL_GPL(kvfree_rcu_barrier_on_cache);
diff --git a/mm/slub.c b/mm/slub.c
index 4850629774b2..19018a979445 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4075,6 +4075,7 @@ static void flush_all(struct kmem_cache *s)
 
 struct deferred_percpu_work {
 	struct llist_head objects;
+	struct llist_head objects_by_rcu;
 	struct llist_head rcu_sheaves;
 	struct irq_work work;
 };
@@ -4083,6 +4084,7 @@ static void deferred_percpu_work_fn(struct irq_work *work);
 
 static DEFINE_PER_CPU(struct deferred_percpu_work, deferred_percpu_work) = {
 	.objects = LLIST_HEAD_INIT(objects),
+	.objects_by_rcu = LLIST_HEAD_INIT(objects_by_rcu),
 	.rcu_sheaves = LLIST_HEAD_INIT(rcu_sheaves),
 	.work = IRQ_WORK_INIT(deferred_percpu_work_fn),
 };
@@ -6392,12 +6394,13 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 static void deferred_percpu_work_fn(struct irq_work *work)
 {
 	struct deferred_percpu_work *dpw;
-	struct llist_head *objs, *rcu_sheaves;
+	struct llist_head *objs, *objs_by_rcu, *rcu_sheaves;
 	struct llist_node *llnode, *pos, *t;
 
 	dpw = container_of(work, struct deferred_percpu_work, work);
 	rcu_sheaves = &dpw->rcu_sheaves;
 	objs = &dpw->objects;
+	objs_by_rcu = &dpw->objects_by_rcu;
 
 	llnode = llist_del_all(objs);
 	llist_for_each_safe(pos, t, llnode) {
@@ -6428,6 +6431,13 @@ static void deferred_percpu_work_fn(struct irq_work *work)
 
 		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
 	}
+
+	llnode = llist_del_all(objs_by_rcu);
+	llist_for_each_safe(pos, t, llnode) {
+		struct rcu_head *head = (struct rcu_head *)pos;
+
+		call_rcu(head, kvfree_rcu_cb);
+	}
 }
 
 static void defer_free(struct kmem_cache *s, void *head)
@@ -6443,6 +6453,17 @@ static void defer_free(struct kmem_cache *s, void *head)
 		irq_work_queue(&dpw->work);
 }
 
+void defer_kfree_rcu(struct rcu_head *head)
+{
+	struct deferred_percpu_work *dpw;
+
+	guard(preempt)();
+
+	dpw = this_cpu_ptr(&deferred_percpu_work);
+	if (llist_add((struct llist_node *)head, &dpw->objects_by_rcu))
+		irq_work_queue(&dpw->work);
+}
+
 void deferred_work_barrier(void)
 {
 	int cpu;

-- 
2.53.0



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 8/9] mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock()
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (6 preceding siblings ...)
  2026-06-15 11:06 ` [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-06-15 11:06 ` Harry Yoo (Oracle)
  2026-06-15 11:22   ` sashiko-bot
  2026-06-15 11:06 ` [PATCH for-next v3 9/9] slub_kunit: extend the test for kfree_rcu_nolock() Harry Yoo (Oracle)
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:06 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

struct rcu_head is overkill for kfree_rcu() because the callback
function is always kfree() and thus there is no need for a function
pointer.

It is enough to implement a linked list using a single pointer.
Introduce a new struct called kfree_rcu_head (the name was suggested
by Vlastimil Babka), which is similar to struct rcu_head but is only
a single pointer. Use it only in kfree_rcu_nolock() for now.

It is intended that in most cases, kfree_rcu_nolock() goes through
sheaves path and the field is not used at all. However, in the worst
case where trylock fails or no empty sheaves are available, maintain
a global list of objects that will later be freed after
synchronize_rcu() in a workqueue. Since the fallback is intended to be
the last resort, it focuses on minimizing complexity and memory usage.

To avoid crashing the kernel in a theoretical case where fallback path
is taken before kmem_cache_init_late(), do not queue the work until
the workqueue is actually allocated.

Factor out the logic that calculates object start address into
object_start_address() from kvfree_rcu_cb() to avoid duplicating it.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 include/linux/rcupdate.h |  2 +-
 include/linux/types.h    |  4 +++
 mm/slab.h                |  2 +-
 mm/slab_common.c         |  2 +-
 mm/slub.c                | 92 ++++++++++++++++++++++++++++++++----------------
 5 files changed, 69 insertions(+), 33 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 3025249bfcb5..83e66585001f 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -1099,7 +1099,7 @@ static inline void rcu_read_unlock_migrate(void)
  * In mm/slab_common.c, no suitable header to include here.
  */
 void kvfree_call_rcu(struct rcu_head *head, void *ptr);
-void kfree_call_rcu_nolock(struct rcu_head *head, void *ptr);
+void kfree_call_rcu_nolock(struct kfree_rcu_head *head, void *ptr);
 
 /*
  * The BUILD_BUG_ON() makes sure the rcu_head offset can be handled. See the
diff --git a/include/linux/types.h b/include/linux/types.h
index 93166b0b0617..695f2a71411e 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -255,6 +255,10 @@ struct callback_head {
 } __attribute__((aligned(sizeof(void *))));
 #define rcu_head callback_head
 
+struct kfree_rcu_head {
+	struct kfree_rcu_head *next;
+};
+
 typedef void (*rcu_callback_t)(struct rcu_head *head);
 typedef void (*call_rcu_func_t)(struct rcu_head *head, rcu_callback_t func);
 
diff --git a/mm/slab.h b/mm/slab.h
index a493c5201e96..19a2a819fd13 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -745,7 +745,7 @@ void __check_heap_object(const void *ptr, unsigned long n,
 			 const struct slab *slab, bool to_user);
 
 void deferred_work_barrier(void);
-void defer_kfree_rcu(struct rcu_head *head);
+void defer_kfree_rcu(struct kfree_rcu_head *head);
 
 static inline bool slub_debug_orig_size(struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 5a39e6225160..6b80b516c93f 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1263,7 +1263,7 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
 EXPORT_TRACEPOINT_SYMBOL(kfree);
 EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
 
-void kfree_call_rcu_nolock(struct rcu_head *head, void *ptr)
+void kfree_call_rcu_nolock(struct kfree_rcu_head *head, void *ptr)
 {
 	struct slab *slab;
 	struct kmem_cache *s;
diff --git a/mm/slub.c b/mm/slub.c
index 19018a979445..853430c7fbe0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4075,20 +4075,25 @@ static void flush_all(struct kmem_cache *s)
 
 struct deferred_percpu_work {
 	struct llist_head objects;
-	struct llist_head objects_by_rcu;
 	struct llist_head rcu_sheaves;
 	struct irq_work work;
 };
 
 static void deferred_percpu_work_fn(struct irq_work *work);
+static void deferred_irq_work_fn(struct irq_work *work);
+static void deferred_work_fn(struct work_struct *w);
 
 static DEFINE_PER_CPU(struct deferred_percpu_work, deferred_percpu_work) = {
 	.objects = LLIST_HEAD_INIT(objects),
-	.objects_by_rcu = LLIST_HEAD_INIT(objects_by_rcu),
 	.rcu_sheaves = LLIST_HEAD_INIT(rcu_sheaves),
 	.work = IRQ_WORK_INIT(deferred_percpu_work_fn),
 };
 
+static LLIST_HEAD(deferred_free_by_rcu);
+static struct workqueue_struct *deferred_wq;
+static DEFINE_IRQ_WORK(deferred_irq_work, deferred_irq_work_fn);
+static DECLARE_WORK(deferred_work, deferred_work_fn);
+
 static void flush_rcu_sheaf(struct work_struct *w)
 {
 	struct slub_percpu_sheaves *pcs;
@@ -6394,13 +6399,12 @@ static void free_to_pcs_bulk(struct kmem_cache *s, size_t size, void **p)
 static void deferred_percpu_work_fn(struct irq_work *work)
 {
 	struct deferred_percpu_work *dpw;
-	struct llist_head *objs, *objs_by_rcu, *rcu_sheaves;
+	struct llist_head *objs, *rcu_sheaves;
 	struct llist_node *llnode, *pos, *t;
 
 	dpw = container_of(work, struct deferred_percpu_work, work);
 	rcu_sheaves = &dpw->rcu_sheaves;
 	objs = &dpw->objects;
-	objs_by_rcu = &dpw->objects_by_rcu;
 
 	llnode = llist_del_all(objs);
 	llist_for_each_safe(pos, t, llnode) {
@@ -6431,12 +6435,50 @@ static void deferred_percpu_work_fn(struct irq_work *work)
 
 		call_rcu(&rcu_sheaf->rcu_head, rcu_free_sheaf);
 	}
+}
+
+static void deferred_irq_work_fn(struct irq_work *work)
+{
+	if (!deferred_wq)
+		return;
+
+	queue_work(deferred_wq, &deferred_work);
+}
+
+static inline void *object_start_address(void *ptr)
+{
+	void *obj;
+	struct slab *slab = virt_to_slab(ptr);
+	struct kmem_cache *s = slab->slab_cache;
+
+	VM_WARN_ON_ONCE(is_vmalloc_addr(ptr) || !slab);
+
+	if (is_kfence_address(ptr)) {
+		obj = kfence_object_start(ptr);
+	} else {
+		unsigned int idx = __obj_to_index(s, slab_address(slab), ptr);
+
+		obj = slab_address(slab) + s->size * idx;
+		obj = fixup_red_left(s, obj);
+	}
+
+	return obj;
+}
+
+static void deferred_work_fn(struct work_struct *w)
+{
+	struct llist_node *llnode, *pos, *t;
+
+	llnode = llist_del_all(&deferred_free_by_rcu);
+	if (!llnode)
+		return;
+
+	synchronize_rcu();
 
-	llnode = llist_del_all(objs_by_rcu);
 	llist_for_each_safe(pos, t, llnode) {
-		struct rcu_head *head = (struct rcu_head *)pos;
+		void *obj = object_start_address(pos);
 
-		call_rcu(head, kvfree_rcu_cb);
+		kfree(obj);
 	}
 }
 
@@ -6453,15 +6495,10 @@ static void defer_free(struct kmem_cache *s, void *head)
 		irq_work_queue(&dpw->work);
 }
 
-void defer_kfree_rcu(struct rcu_head *head)
+void defer_kfree_rcu(struct kfree_rcu_head *head)
 {
-	struct deferred_percpu_work *dpw;
-
-	guard(preempt)();
-
-	dpw = this_cpu_ptr(&deferred_percpu_work);
-	if (llist_add((struct llist_node *)head, &dpw->objects_by_rcu))
-		irq_work_queue(&dpw->work);
+	if (llist_add((struct llist_node *)head, &deferred_free_by_rcu))
+		irq_work_queue(&deferred_irq_work);
 }
 
 void deferred_work_barrier(void)
@@ -6470,6 +6507,9 @@ void deferred_work_barrier(void)
 
 	for_each_possible_cpu(cpu)
 		irq_work_sync(&per_cpu_ptr(&deferred_percpu_work, cpu)->work);
+
+	irq_work_sync(&deferred_irq_work);
+	flush_work(&deferred_work);
 }
 
 static __fastpath_inline
@@ -6731,8 +6771,6 @@ void kvfree_rcu_cb(struct rcu_head *head)
 	void *obj = head;
 	struct page *page;
 	struct slab *slab;
-	struct kmem_cache *s;
-	void *slab_addr;
 
 	if (is_vmalloc_addr(obj)) {
 		obj = (void *) PAGE_ALIGN_DOWN((unsigned long)obj);
@@ -6752,19 +6790,8 @@ void kvfree_rcu_cb(struct rcu_head *head)
 		return;
 	}
 
-	s = slab->slab_cache;
-	slab_addr = slab_address(slab);
-
-	if (is_kfence_address(obj)) {
-		obj = kfence_object_start(obj);
-	} else {
-		unsigned int idx = __obj_to_index(s, slab_addr, obj);
-
-		obj = slab_addr + s->size * idx;
-		obj = fixup_red_left(s, obj);
-	}
-
-	slab_free(s, slab, obj, _RET_IP_);
+	obj = object_start_address(obj);
+	slab_free(slab->slab_cache, slab, obj, _RET_IP_);
 }
 
 /**
@@ -8698,6 +8725,11 @@ void __init kmem_cache_init_late(void)
 	flushwq = alloc_workqueue("slub_flushwq", WQ_MEM_RECLAIM | WQ_PERCPU,
 				  0);
 	WARN_ON(!flushwq);
+
+	deferred_wq = alloc_workqueue("slab_deferred_wq",
+				      WQ_UNBOUND | WQ_MEM_RECLAIM, 0);
+	WARN_ON(!deferred_wq);
+	queue_work(deferred_wq, &deferred_work);
 #ifdef CONFIG_SLAB_FREELIST_RANDOM
 	prandom_init_once(&slab_rnd_state);
 #endif

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 9/9] slub_kunit: extend the test for kfree_rcu_nolock()
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (7 preceding siblings ...)
  2026-06-15 11:06 ` [PATCH for-next v3 8/9] mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-06-15 11:06 ` Harry Yoo (Oracle)
  2026-06-15 11:43 ` [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo
  2026-06-15 20:28 ` Alexei Starovoitov
  10 siblings, 0 replies; 23+ messages in thread
From: Harry Yoo (Oracle) @ 2026-06-15 11:06 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

When slub_kunit is not built-in, call kfree_rcu() and kfree_rcu_nolock()
to test kfree_rcu_nolock() in slub_kunit.

Rename the test case as the test covers more _nolock() APIs.

Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
---
 lib/tests/slub_kunit.c | 38 ++++++++++++++++++++++++++------------
 1 file changed, 26 insertions(+), 12 deletions(-)

diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c
index 01d808cb77fa..6ed4de88bf0b 100644
--- a/lib/tests/slub_kunit.c
+++ b/lib/tests/slub_kunit.c
@@ -162,7 +162,10 @@ static void test_kmalloc_redzone_access(struct kunit *test)
 }
 
 struct test_kfree_rcu_struct {
-	struct rcu_head rcu;
+	union {
+		struct rcu_head rcu;
+		struct kfree_rcu_head krcu;
+	};
 };
 
 static void test_kfree_rcu(struct kunit *test)
@@ -294,10 +297,10 @@ static void test_krealloc_redzone_zeroing(struct kunit *test)
 }
 
 #if defined(CONFIG_PERF_EVENTS) || (defined(CONFIG_KPROBES) && defined(CONFIG_SMP))
-#define SLUB_KUNIT_TEST_KMALLOC_KFREE_NOLOCK
+#define SLUB_KUNIT_TEST_KMALLOC_NOLOCK_AND_FRIENDS
 #define NR_ITERATIONS 1000
 #define NR_OBJECTS 1000
-static void *objects[NR_OBJECTS];
+static struct test_kfree_rcu_struct *objects[NR_OBJECTS];
 
 struct test_nolock_context {
 	struct kunit *test;
@@ -336,19 +339,24 @@ static struct perf_event_attr sw_attr = {
 
 static void test_nolock(struct test_nolock_context *ctx)
 {
-	void *objp;
+	struct test_kfree_rcu_struct *objp;
 	gfp_t gfp;
+	bool can_use_kfree_rcu = !IS_BUILTIN(CONFIG_SLUB_KUNIT_TEST);
 
 	/* __GFP_ACCOUNT to test kmalloc_nolock() in alloc_slab_obj_exts() */
 	gfp = (ctx->callback_count % 2) ? 0 : __GFP_ACCOUNT;
-	objp = kmalloc_nolock(64, gfp, NUMA_NO_NODE);
+	objp = kmalloc_nolock(sizeof(*objp), gfp, NUMA_NO_NODE);
 
 	if (objp)
 		ctx->alloc_ok++;
 	else
 		ctx->alloc_fail++;
 
-	kfree_nolock(objp);
+	if (can_use_kfree_rcu && (ctx->callback_count % 2))
+		kfree_rcu_nolock(objp, krcu);
+	else
+		kfree_nolock(objp);
+
 	ctx->callback_count++;
 }
 
@@ -432,12 +440,13 @@ static bool register_slab_kprobes(struct test_nolock_context *ctx) { return fals
 static void unregister_slab_kprobes(struct test_nolock_context *ctx) { }
 #endif
 
-static void test_kmalloc_kfree_nolock(struct kunit *test)
+static void test_kmalloc_nolock_and_friends(struct kunit *test)
 {
 	int i, j;
 	struct test_nolock_context perf_ctx = { .test = test };
 	struct test_nolock_context kprobe_ctx = { .test = test };
 	bool alloc_fail = false;
+	bool can_use_kfree_rcu = !IS_BUILTIN(CONFIG_SLUB_KUNIT_TEST);
 	bool perf_events_enabled;
 	bool slab_kprobes_enabled;
 
@@ -455,7 +464,7 @@ static void test_kmalloc_kfree_nolock(struct kunit *test)
 		for (j = 0; j < NR_OBJECTS; j++) {
 			gfp_t gfp = (i % 2) ? GFP_KERNEL : GFP_KERNEL_ACCOUNT;
 
-			objects[j] = kmalloc(64, gfp);
+			objects[j] = kmalloc_obj(*objects[j], gfp);
 			if (!objects[j]) {
 				j--;
 				while (j >= 0)
@@ -464,8 +473,13 @@ static void test_kmalloc_kfree_nolock(struct kunit *test)
 				goto cleanup;
 			}
 		}
-		for (j = 0; j < NR_OBJECTS; j++)
-			kfree(objects[j]);
+
+		for (j = 0; j < NR_OBJECTS; j++) {
+			if (can_use_kfree_rcu && (i % 2))
+				kfree_rcu(objects[j], rcu);
+			else
+				kfree(objects[j]);
+		}
 	}
 
 cleanup:
@@ -504,8 +518,8 @@ static struct kunit_case test_cases[] = {
 	KUNIT_CASE(test_kfree_rcu_wq_destroy),
 	KUNIT_CASE(test_leak_destroy),
 	KUNIT_CASE(test_krealloc_redzone_zeroing),
-#ifdef SLUB_KUNIT_TEST_KMALLOC_KFREE_NOLOCK
-	KUNIT_CASE_SLOW(test_kmalloc_kfree_nolock),
+#ifdef SLUB_KUNIT_TEST_KMALLOC_NOLOCK_AND_FRIENDS
+	KUNIT_CASE_SLOW(test_kmalloc_nolock_and_friends),
 #endif
 	{}
 };

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available
  2026-06-15 11:05 ` [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available Harry Yoo (Oracle)
@ 2026-06-15 11:14   ` sashiko-bot
  2026-06-15 12:58   ` Harry Yoo
  2026-06-15 20:01   ` Alexei Starovoitov
  2 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:14 UTC (permalink / raw)
  Cc: linux-rt-devel, bpf, linux-kernel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] The test cleanup function reads and logs test context variables before disabling the asynchronous perf event, introducing a data race and causing the reported counts to be inaccurate.
--

commit 4fccf2c0f65ae8a2788eab369ec35d223d9c897d
Author: Harry Yoo (Oracle) <harry@kernel.org>

slub_kunit: fall back to SW perf events when HW PMU is not available

This commit introduces a fallback mechanism in the slub_kunit tests to use
software perf events when a hardware PMU is unavailable. This ensures the
tests can still execute in virtualized environments that lack virtual PMUs.

> diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c
> --- a/lib/tests/slub_kunit.c
> +++ b/lib/tests/slub_kunit.c
[ ... ]
> +static void disable_perf_events(struct test_nolock_context *ctx)
> +{
> +	kunit_info(ctx->test, "%s perf events: callback_count: %d, alloc_ok: %d, alloc_fail: %d\n",
> +		   ctx->is_perf_type_hw ? "HW" : "SW",
> +		   ctx->callback_count, ctx->alloc_ok, ctx->alloc_fail);
> +
> +	perf_event_disable(ctx->event);

[Severity: Medium]
Could reading the counters before disabling the perf event result in missed
counts?

Since the perf event continues to fire asynchronously in NMI or IRQ context
until perf_event_disable() is called, the variables might be incremented
after kunit_info() has already read them. 

Would it be safer to disable the perf event before reading and logging the
results, restoring the original ordering from the previous cleanup block?

> +	perf_event_release_kernel(ctx->event);
> +}
[ ... ]
> @@ -368,11 +412,7 @@ static void test_kmalloc_kfree_nolock(struct kunit *test)
>  	}
>  
>  cleanup:
> -	perf_event_disable(ctx.event);
> -	perf_event_release_kernel(ctx.event);
> -
> -	kunit_info(test, "callback_count: %d, alloc_ok: %d, alloc_fail: %d\n",
> -		   ctx.callback_count, ctx.alloc_ok, ctx.alloc_fail);
> +	disable_perf_events(&ctx);
>  
>  	if (alloc_fail)
>  		kunit_skip(test, "Allocation failed");

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=1

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 6/9] mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT
  2026-06-15 11:06 ` [PATCH for-next v3 6/9] mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT Harry Yoo (Oracle)
@ 2026-06-15 11:19   ` sashiko-bot
  0 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:19 UTC (permalink / raw)
  Cc: linux-rt-devel, linux-kernel, bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Calling kfree_rcu_sheaf() on PREEMPT_RT from kvfree_call_rcu() causes an 'Invalid wait context' lockdep splat and potential deadlocks due to acquiring a sleeping lock from raw_spinlock, hardirq, or NMI contexts.
--

commit 67f0bb194ef2843a6fa5bfccfbb53884ff1a37b2
Author: Harry Yoo (Oracle) <harry@kernel.org>

mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT

This commit allows kfree_rcu_sheaf() to be used on kernels with
CONFIG_PREEMPT_RT by assuming that spinning is not allowed. It removes
the unconditional bailout and relaxes the limitation to try the sheaves
path.

Link: https://lore.kernel.org/linux-mm/6811cc17-8ee4-48c8-8cbf-6bf4d9f98162@kernel.org

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 55546b8385ff3..807924a94fb08 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
[ ... ]
> @@ -1954,7 +1961,7 @@ void kvfree_call_rcu(struct rcu_head *head, void *ptr)
>  	if (!head)
>  		might_sleep();
>  
> -	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && kfree_rcu_sheaf(ptr))
> +	if (kfree_rcu_sheaf(ptr))
>  		return;
>  
>  	// Queue the object but don't yet schedule the batch.

[ ... ]
> diff --git a/mm/slub.c b/mm/slub.c
> index ba593c1c53d50..4850629774b2a 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
[ ... ]
> @@ -6096,10 +6097,10 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
>  	struct slub_percpu_sheaves *pcs;
>  	struct slab_sheaf *rcu_sheaf;
>  
> -	if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT)))
> -		return false;
> +	VM_WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT) && allow_spin);
>  
> -	lock_map_acquire_try(&kfree_rcu_sheaf_map);
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		lock_map_acquire_try(&kfree_rcu_sheaf_map);
>  
>  	if (!local_trylock(&s->cpu_sheaves->lock))

[Severity: High]
Does acquiring this local lock with CONFIG_PREEMPT_RT enabled from
kvfree_call_rcu() trigger an invalid wait context lockdep splat?

Since kvfree_call_rcu() can be called from hardirq or nmi context, or
while holding a raw_spinlock_t (like pi_lock), the caller is in a raw
spin wait context. With CONFIG_PREEMPT_RT, local_lock maps to a
spinlock_t (rt_mutex), which is a sleeping lock.

Even though this uses a trylock and allow_spin is false, acquiring a
sleeping lock under a raw spinlock or in hardirq context violates
lockdep's wait context rules because lockdep evaluates the declared wait
types of the locks, not the runtime behavior.

Additionally, could the rt_mutex trylock implementation internally attempt
to acquire the task's pi_lock, leading to a potential deadlock?

>  		goto fail;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=6

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 8/9] mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock()
  2026-06-15 11:06 ` [PATCH for-next v3 8/9] mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-06-15 11:22   ` sashiko-bot
  0 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:22 UTC (permalink / raw)
  Cc: bpf, linux-rt-devel, linux-kernel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Unbounded iteration over deferred RCU objects in the workqueue lacks `cond_resched()`, risking soft lockups.
--

commit 300f5714ccdf41bb1175c5f608a81c4624b37553
Author: Harry Yoo (Oracle) <harry@kernel.org>

mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock()

This commit introduces struct kfree_rcu_head to reduce memory overhead for
kfree_rcu_nolock(). It adds a global fallback list and a deferred workqueue
to handle cases where trylock fails or no empty sheaves are available.

> diff --git a/mm/slub.c b/mm/slub.c
> index 19018a979445e..853430c7fbe06 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
[ ... ]
> @@ -6431,12 +6435,50 @@ static void deferred_percpu_work_fn(struct irq_work *work)
[ ... ]
> +static void deferred_work_fn(struct work_struct *w)
> +{
> +	struct llist_node *llnode, *pos, *t;
> +
> +	llnode = llist_del_all(&deferred_free_by_rcu);
> +	if (!llnode)
> +		return;
> +
> +	synchronize_rcu();
>  
> -	llnode = llist_del_all(objs_by_rcu);
>  	llist_for_each_safe(pos, t, llnode) {
> -		struct rcu_head *head = (struct rcu_head *)pos;
> +		void *obj = object_start_address(pos);
>  
> -		call_rcu(head, kvfree_rcu_cb);
> +		kfree(obj);
>  	}
>  }

[Severity: Medium]
Could this loop cause soft lockups on kernels without preemption?

Since deferred_free_by_rcu is a global list aggregating objects from all
CPUs when the fallback path is taken, it could potentially grow very large.
Executing kfree() on an unbounded number of objects in a single continuous
loop without yielding the CPU might trigger soft lockup warnings.

Would it be safer to add cond_resched() inside this loop to prevent hogging
the CPU?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=8

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock()
  2026-06-15 11:06 ` [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-06-15 11:22   ` sashiko-bot
  0 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:22 UTC (permalink / raw)
  Cc: linux-rt-devel, bpf, linux-kernel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] kfree_call_rcu_nolock() crashes when freeing large kmalloc objects or vmalloc addresses due to an unconditional dereference of `slab->slab_cache`.
- [High] Calling `__kfree_rcu_sheaf()` on PREEMPT_RT triggers an 'Invalid wait context' lockdep warning because it attempts a `local_trylock()` under a `raw_spinlock_t`.
--

commit 8bccb182210df2707b079473bfa04aa463cafa86
Author: Harry Yoo (Oracle) <harry@kernel.org>

mm/slab: introduce kfree_rcu_nolock()

This commit introduces kfree_rcu_nolock(), a variant of kfree_rcu() designed
to be safely called from unknown contexts without falling back to batched
processing. It primarily tries to use the sheaf path and falls back to
defer_kfree_rcu() if that fails.

> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 807924a94fb08..5a39e62251609 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1263,6 +1263,23 @@ EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
>  EXPORT_TRACEPOINT_SYMBOL(kfree);
>  EXPORT_TRACEPOINT_SYMBOL(kmem_cache_free);
>  
> +void kfree_call_rcu_nolock(struct rcu_head *head, void *ptr)
> +{
> +	struct slab *slab;
> +	struct kmem_cache *s;
> +
> +	VM_WARN_ON_ONCE(is_vmalloc_addr(ptr) || !virt_to_slab(ptr));
> +
> +	slab = virt_to_slab(ptr);
> +	s = slab->slab_cache;

[Severity: Critical]
Does this crash when freeing large kmalloc objects or vmalloc addresses?

If ptr is a large kmalloc object (which bypasses SLUB) or a vmalloc address,
virt_to_slab(ptr) returns NULL. The code issues a VM_WARN_ON_ONCE but
proceeds to blindly dereference slab->slab_cache, causing a deterministic
NULL pointer dereference.

Standard kfree_rcu() and kvfree_rcu() support these types of addresses, so
callers might pass them here expecting them to be handled safely. Should we
return early or handle the fallback path directly if slab is NULL?

> +
> +	if (__kfree_rcu_sheaf(s, ptr, /* allow_spin = */ false))
> +		return;

[Severity: High]
Can this trigger an 'Invalid wait context' lockdep warning on PREEMPT_RT?

The purpose of kfree_call_rcu_nolock() is to handle unknown contexts,
including hardirq or raw spinlock contexts. However, this unconditionally
calls __kfree_rcu_sheaf(), which acquires a lock via
local_trylock(&s->cpu_sheaves->lock).

On PREEMPT_RT, local_lock maps to a sleepable spinlock_t (RT mutex).
Acquiring a sleepable lock inside a raw spinlock context causes a lockdep
splat because CONFIG_PROVE_RAW_LOCK_NESTING checks declared wait-types even
for trylocks.

> +
> +	defer_kfree_rcu(head);
> +}
> +EXPORT_SYMBOL_GPL(kfree_call_rcu_nolock);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=7

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 5/9] mm/slab: extend deferred free mechanism to handle rcu sheaves
  2026-06-15 11:05 ` [PATCH for-next v3 5/9] mm/slab: extend deferred free mechanism to handle rcu sheaves Harry Yoo (Oracle)
@ 2026-06-15 11:24   ` sashiko-bot
  0 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:24 UTC (permalink / raw)
  Cc: linux-rt-devel, linux-kernel, bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Acquiring local_trylock inside __kfree_rcu_sheaf() violates lock nesting rules on PREEMPT_RT, triggering a lockdep "Invalid wait context" warning.

Pre-existing issues:
- [High] Concurrent lockless calls to deferred_work_barrier() cause an rcuwait race on PREEMPT_RT, leading to permanent task hangs.
--

commit b0c903d6396ede7c44ac45135ec7adc7d173391f
Author: Harry Yoo (Oracle) <harry@kernel.org>
Subject: mm/slab: extend deferred free mechanism to handle rcu sheaves

This patch extends the deferred free fallback mechanism so that a full RCU
sheaf can be submitted to call_rcu() via the existing IRQ work when spinning
is not allowed and IRQs are disabled. It also renames the defer_free structs
to deferred_percpu_work and adds deferred_work_barrier() calls to keep the
kvfree_rcu barrier's promise.

> diff --git a/mm/slub.c b/mm/slub.c
> index 6a3552b70683d..ba593c1c53d50 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
[ ... ]
> @@ -6158,12 +6175,6 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
>  	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
>  		rcu_sheaf = NULL;
>  	} else {

[Severity: High]
Does acquiring local_trylock inside __kfree_rcu_sheaf() violate lock nesting
rules on PREEMPT_RT?

The kvfree_call_rcu() API is frequently invoked from hardirq context or while
holding a raw_spinlock_t.  This function attempts to acquire a local_trylock:

    if (!local_trylock(&s->cpu_sheaves->lock))

On PREEMPT_RT kernels, local_trylock_t maps to a sleepable spinlock_t.
Attempting to acquire a sleepable lock (even via a trylock operation) while in
a raw_spinlock or hardirq context violates the lock nesting rules.
CONFIG_PROVE_RAW_LOCK_NESTING will catch this and trigger a lockdep warning
for an invalid wait context.

> -		/* call_rcu() disables IRQs to protect percpu data structures */
> -		if (unlikely(!allow_spin && irqs_disabled())) {
> -			rcu_sheaf->size--;
> -			local_unlock(&s->cpu_sheaves->lock);
> -			goto fail;
> -		}
>  		pcs->rcu_free = NULL;
>  		rcu_sheaf->node = numa_node_id();
>  	}
[ ... ]
> @@ -6408,27 +6418,34 @@ static void free_deferred_objects(struct irq_work *work)
[ ... ]
>  static void defer_free(struct kmem_cache *s, void *head)
>  {
> -	struct defer_free *df;
> +	struct deferred_percpu_work *dpw;
>  
>  	guard(preempt)();
>  
>  	head = kasan_reset_tag(head);
>  
> -	df = this_cpu_ptr(&defer_free_objects);
> -	if (llist_add(head + s->offset, &df->objects))
> -		irq_work_queue(&df->work);
> +	dpw = this_cpu_ptr(&deferred_percpu_work);
> +	if (llist_add(head + s->offset, &dpw->objects))
> +		irq_work_queue(&dpw->work);
>  }
>  
> -void defer_free_barrier(void)
> +void deferred_work_barrier(void)
>  {
>  	int cpu;
>  
>  	for_each_possible_cpu(cpu)
> -		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);

[Severity: High]
This is a pre-existing issue, but can concurrent lockless calls to
deferred_work_barrier() cause an rcuwait race on PREEMPT_RT, leading to
permanent task hangs?

The function iterates over all CPUs, invoking irq_work_sync() on each CPU's
deferred work object. On PREEMPT_RT, irq_work_sync() relies on
rcuwait_wait_event() to block until completion, and the rcuwait
synchronization primitive strictly allows only one waiter at a time.

Because deferred_work_barrier() is called without any global serialization
(for instance, in kmem_cache_destroy() and kvfree_rcu_barrier_on_cache(),
and now in flush_all_rcu_sheaves()), multiple threads can enter
irq_work_sync() for the same work object concurrently.

This overwrites the waiter task pointer, meaning only one task will be woken
up when the work completes, leaving the other tasks hanging permanently in an
uninterruptible sleep.

> +		irq_work_sync(&per_cpu_ptr(&deferred_percpu_work, cpu)->work);
>  }
>  
>  static __fastpath_inline

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=5

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 3/9] mm/slab: handle the !allow_spin case in kfree_rcu_sheaf()
  2026-06-15 11:05 ` [PATCH for-next v3 3/9] mm/slab: handle the !allow_spin case in kfree_rcu_sheaf() Harry Yoo (Oracle)
@ 2026-06-15 11:24   ` sashiko-bot
  0 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:24 UTC (permalink / raw)
  Cc: bpf, linux-rt-devel, linux-kernel

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [High] Kmemleak tracking symmetry violation for empty sheaves allocated with !allow_spin.

Pre-existing issues:
- [High] Deadlock under scheduler locks due to GFP_NOWAIT allocation in __kfree_rcu_sheaf().
--

commit 0ddecba01f906bc55d1c11049f09813fcc05200e
Author: Harry Yoo (Oracle) <harry@kernel.org>

mm/slab: handle the !allow_spin case in kfree_rcu_sheaf()

This commit updates kfree_rcu_sheaf() to handle cases where spinning is not
allowed, by allocating and freeing empty sheaves without taking locks. It also
modifies the allocation flags and fallback paths depending on the allow_spin
context.

> diff --git a/mm/slub.c b/mm/slub.c
> index 87ca154ccd80f..b0d38d5153865 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
[ ... ]
> @@ -2827,11 +2828,20 @@ static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
>  		mark_obj_codetag_empty(sheaf);
>  
>  	VM_WARN_ON_ONCE(sheaf->size > 0);
> -	kfree(sheaf);
> +
> +	if (likely(allow_spin))
> +		kfree(sheaf);
> +	else
> +		kfree_nolock(sheaf);
>  
>  	stat(s, SHEAF_FREE);
>  }
>  
> +static void free_empty_sheaf(struct kmem_cache *s, struct slab_sheaf *sheaf)
> +{
> +	__free_empty_sheaf(s, sheaf, /* allow_spin = */ true);
> +}

[Severity: High]
Does hardcoding allow_spin to true here cause a kmemleak tracking symmetry
violation for sheaves allocated with allow_spin = false?

In __kfree_rcu_sheaf(), if allow_spin is false, the empty sheaf is allocated
using SLAB_ALLOC_TRYLOCK and gfp = 0. Because __GFP_RECLAIM bits are missing,
SLUB skips kmemleak registration for the object.

When this sheaf becomes full and is processed by RCU, rcu_free_sheaf() handles
the cleanup and delegates to free_empty_sheaf(), which now unconditionally
calls kfree(sheaf) because allow_spin is hardcoded to true. Calling kfree()
on an object that wasn't registered with kmemleak triggers kmemleak warnings.

[ ... ]
> @@ -6112,20 +6122,25 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj)
>  
>  		local_unlock(&s->cpu_sheaves->lock);
>  
> -		empty = alloc_empty_sheaf(s, GFP_NOWAIT, SLAB_ALLOC_DEFAULT);
> +		if (unlikely(!allow_spin)) {
> +			alloc_flags = SLAB_ALLOC_TRYLOCK;
> +			gfp = 0;
> +		}
> +
> +		empty = alloc_empty_sheaf(s, gfp, alloc_flags);

[Severity: High]
This is a pre-existing issue, but does the use of GFP_NOWAIT when allow_spin
is true risk a deadlock?

Since GFP_NOWAIT contains the __GFP_KSWAPD_RECLAIM flag, the allocation can
trigger wakeup_kswapd(). This enters the scheduler and attempts to acquire
scheduler locks.

If kvfree_call_rcu() is invoked while a scheduler lock is already held,
wouldn't this cause a lock recursion deadlock?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 4/9] mm/slab: use call_rcu() in unknown context if irqs are enabled
  2026-06-15 11:05 ` [PATCH for-next v3 4/9] mm/slab: use call_rcu() in unknown context if irqs are enabled Harry Yoo (Oracle)
@ 2026-06-15 11:25   ` sashiko-bot
  0 siblings, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:25 UTC (permalink / raw)
  Cc: bpf, linux-kernel, linux-rt-devel

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Acquiring a local_trylock inside a raw_spinlock_t wait context triggers a lockdep Invalid wait context warning.
--

commit 05f00697f28bd84542538d9e89bbc00e3f7a427a
Author: Harry Yoo (Oracle) <harry@kernel.org>

mm/slab: use call_rcu() in unknown context if irqs are enabled

This commit modifies the fallback logic for deferred RCU freeing in the
slab allocator. It now uses call_rcu() even in an unknown context as long
as interrupts are enabled.

Link: https://lore.kernel.org/linux-mm/CAADnVQKRVD5ZSnEKbZZU7w86gHbGHUug2pvzpgZTngNS+fg4rw@mail.gmail.com

> diff --git a/mm/slub.c b/mm/slub.c
> index b0d38d5153865..6a3552b70683d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -6158,8 +6158,8 @@ bool __kfree_rcu_sheaf(struct kmem_cache *s, void *obj, bool allow_spin)
>  	if (likely(rcu_sheaf->size < s->sheaf_capacity)) {
>  		rcu_sheaf = NULL;
>  	} else {
> -		if (unlikely(!allow_spin)) {
> -			/* call_rcu() cannot be called in an unknown context */
> +		/* call_rcu() disables IRQs to protect percpu data structures */

[Severity: High]
This isn't a bug introduced by this patch, but since you are modifying
__kfree_rcu_sheaf() here, can this function trigger an invalid wait context
warning in lockdep?

According to the RCU guidelines, kvfree_call_rcu() is frequently called
while holding a raw_spinlock_t (such as pi_lock in kernel/sched/core.c)
and from hardirq context:

mm/slab_common.c:kvfree_call_rcu() {
    ...
    if (kfree_rcu_sheaf(ptr))
        return;
    ...
}

This calls into __kfree_rcu_sheaf(), which unconditionally attempts to
acquire a local_lock earlier in the function:

mm/slub.c:__kfree_rcu_sheaf() {
    ...
    if (!local_trylock(&s->cpu_sheaves->lock))
        goto fail;
    ...
}

Because local_lock has a lower lockdep wait type than raw_spinlock_t,
does this wait context inversion violate nesting rules and trigger
warnings? 

Guarding lock_map_acquire_try with !IS_ENABLED(CONFIG_PREEMPT_RT) does not
prevent this because local_trylock internally checks wait contexts
regardless. Could this lead to panics on systems with panic_on_warn enabled?

> +		if (unlikely(!allow_spin && irqs_disabled())) {
>  			rcu_sheaf->size--;
>  			local_unlock(&s->cpu_sheaves->lock);
>  			goto fail;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=4

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs
  2026-06-15 11:05 ` [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs Harry Yoo (Oracle)
@ 2026-06-15 11:25   ` sashiko-bot
  2026-06-15 20:04   ` Alexei Starovoitov
  1 sibling, 0 replies; 23+ messages in thread
From: sashiko-bot @ 2026-06-15 11:25 UTC (permalink / raw)
  Cc: linux-rt-devel, linux-kernel, bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Global kprobe execution introduces a data race on shared KUnit context variables, which will trigger KCSAN warnings.
- [Low] The `test_nolock()` function is conditionally compiled only when `CONFIG_PERF_EVENTS` is enabled, but it is called unconditionally by `slab_kprobe_pre_handler()` when `CONFIG_KPROBES` and `CONFIG_SMP` are enabled.
--

commit d19f8bc8e3ea86ee712ab0dede8b68fd72d0fbcf
Author: Harry Yoo (Oracle) <harry@kernel.org>

mm/slab, slub_kunit: register kprobe to trigger _nolock APIs

This patch modifies the slub_kunit test to attach a kprobe pre-handler
to test the kmalloc_nolock() and kfree_nolock() APIs. It introduces a
wrapper function slab_attach_kprobe_locked() that is invoked when
slab_lockdep_assert_held() is called.

Link: https://lore.kernel.org/linux-mm/20260427-nolock-api-fix-v2-0-a6b83a92d9a4@kernel.org [1]

> diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c
> index 11255fc8eb781..01d808cb77faf 100644
> --- a/lib/tests/slub_kunit.c
> +++ b/lib/tests/slub_kunit.c
[ ... ]
> @@ -326,13 +334,10 @@ static struct perf_event_attr sw_attr = {
>  	.sample_freq = 100000,
>  };
>  
> -static void overflow_handler_test_nolock(struct perf_event *event,
> -					 struct perf_sample_data *data,
> -					 struct pt_regs *regs)
> +static void test_nolock(struct test_nolock_context *ctx)
>  {
>  	void *objp;
>  	gfp_t gfp;
> -	struct test_nolock_context *ctx = event->overflow_handler_context;
>  
>  	/* __GFP_ACCOUNT to test kmalloc_nolock() in alloc_slab_obj_exts() */
>  	gfp = (ctx->callback_count % 2) ? 0 : __GFP_ACCOUNT;
> @@ -347,6 +352,15 @@ static void overflow_handler_test_nolock(struct perf_event *event,
>  	ctx->callback_count++;
>  }

[ ... ]

> +#if defined(CONFIG_KPROBES) && defined(CONFIG_SMP)
> +static int slab_kprobe_pre_handler(struct kprobe *p, struct pt_regs *regs)
> +{
> +	struct test_nolock_context *ctx;
> +
> +	ctx = container_of(p, struct test_nolock_context, kprobe);
> +	test_nolock(ctx);

[Severity: Low]
If the kernel is built with CONFIG_KPROBES=y and CONFIG_SMP=y, but
CONFIG_PERF_EVENTS=n, will this result in a build failure?

The test_nolock() function is defined inside an ifdef CONFIG_PERF_EVENTS
block earlier in the file, but it is called here unconditionally when
kprobes are enabled.

[Severity: Medium]
Because slab_attach_kprobe_locked() is a global function called by SLUB
internals system-wide, could multiple CPUs hit this kprobe concurrently
during the test?

If other CPUs perform slab operations, they will trigger
slab_kprobe_pre_handler() and concurrently pass the shared kprobe_ctx to
test_nolock(). This would cause multiple threads to modify shared integers
like ctx->callback_count++, ctx->alloc_ok++, and ctx->alloc_fail++ without
atomics or locks, potentially leading to lost updates and KCSAN warnings.

> +	return 0;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260615-kfree_rcu_nolock-v3-0-70a54f3775bb@kernel.org?part=2

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (8 preceding siblings ...)
  2026-06-15 11:06 ` [PATCH for-next v3 9/9] slub_kunit: extend the test for kfree_rcu_nolock() Harry Yoo (Oracle)
@ 2026-06-15 11:43 ` Harry Yoo
  2026-06-15 20:28 ` Alexei Starovoitov
  10 siblings, 0 replies; 23+ messages in thread
From: Harry Yoo @ 2026-06-15 11:43 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf


[-- Attachment #1.1: Type: text/plain, Size: 8397 bytes --]

Now I'm investigating and fixing pre-existing two kmalloc_nolock() bugs
and (hopefully) planning to post the fixes later this week.

I will later rebase the this series onto Vlastimil's slab_alloc_flags v3
and the kmalloc_nolock() fixes, and address human and sashiko's
comments.

But this should be enough for review.

Thanks!

On 6/15/26 8:05 PM, Harry Yoo (Oracle) wrote:
> Not the best time to post a series, but didn't want to delay posting
> the series for too long. no pressures ;)  This is aimed to be queued
> for review and testing after the merge window closes.
> 
> This series is based on next-20260612, and is also available on
> git.kernel.org [3].
> 
> To RCU folks: It would be great if you could kindly take a quick look at
> patch 4 and either ack or nack the patch ;)
> 
> To BPF folks: Ulad asked to share workloads to measure performance
> of kfree_rcu_nolock(). Unfortunately, I focused more on correctness
> and have not spent much effort on that. It would be nice if BPF folks
> could help evaluate it on their relevant workloads.
> 
> To PREEMPT_RT folks: The most relevant part is allowing
> kfree_rcu_sheaf() on PREEMPT_RT (patch 6). It carefully avoids sleeping
> by acquiring the locks via local_trylock() or spin_trylock_irqsave()
> to avoid sleeping within a raw spinlock. When trylock or unlock is
> unsafe, kmalloc_nolock() always fails.
> 
> Changes since RFC v2
> ====================
> 
> Reduced complexity and intrusiveness (Uladzislau Rezki)
> -------------------------------------------------------
> 
> While discussing concerns about the complexity of adding allow_spin
> handling with Ulad (Thanks!), I realized that adding complexity to the
> kvfree_rcu batching is not strictly necessary: only slab objects need to
> be batched, they are already batched by rcu sheaves, and slab already
> supports unknown context. So it is enough to implement only a minimal
> fallback for the sheaves path.
> 
> I tried to avoid making intrusive changes to the existing kvfree_rcu
> path as much as possible. struct rcu_ptr is renamed to kfree_rcu_head
> following Vlastimil's suggestion, and it is used only in the
> kfree_rcu_nolock() path for now.
> 
> As a result, the complexity is significantly reduced and the series
> became much less intrusive. This is also reflected well in the diffstat
> below.
> 
> RFC v2 diffstat:
>   8 files changed, 514 insertions(+), 163 deletions(-)
> 
> v3 diffstat:
>   6 files changed, 370 insertions(+), 105 deletions(-)
> 
> v3 diffstat (slub_kunit improvements - patch 1, 2, 9 excluded):
>   5 files changed, 199 insertions(+), 66 deletions(-)
> 
> kfree_rcu_sheaf() PREEMPT_RT support (Vlastimil Babka)
> ------------------------------------------------------
> 
> As suggested by Vlastimil (Thanks!), kfree_rcu_sheaf() can now be used
> on PREEMPT_RT as well, by always assuming allow_spin is false on
> PREEMPT_RT.
> 
> slub_kunit enhancements
> -----------------------
> 
> - Currently the test is skipped when there is no hardware PMU. This can
>   happen on machines without a PMU, or in virtualized environments
>   (e.g., automated testing or virtme). Implement a fallback based on SW
>   perf events so that the test can still run in such environments, even
>   though the coverage is slightly smaller.
> 
> - While testing on PREEMPT_RT, I found that kmalloc_nolock() fails every
>   time, so the fallback path is not properly tested. This is a limitation
>   of perf events: the handler is called in NMI (HW perf events) or
>   interrupt context (SW perf events), where kmalloc_nolock() cannot
>   succeed.
> 
>   slub_kunit now registers a kprobe pre-handler at the points in the slab
>   allocator where lockdep_assert_held() is invoked. The pre-handler calls
>   kmalloc_nolock() and friends, to improve coverage on PREEMPT_RT instead
>   of relying on perf events.
> 
> One thing that needs to be further explored
> -------------------------------------------
> 
> The global deferred_free_by_rcu (introduced by patch 8) list for the
> fallback should probably be per-CPU [5].
> 
> Actual Cover Letter
> ===================
> 
> This series improves kmalloc_nolock() and kfree_nolock() coverage
> in slub_kunit (patch 1 and 2) and introduces kfree_rcu_nolock() for
> an unknown context as suggested by Alexei Starovoitov.
> 
> Unknown context means the caller does not know whether spinning on a lock
> is safe (e.g., a BPF program attached to an arbitrary kernel function or
> in NMI context).
> 
> The slab allocator already supports unknown context via kmalloc_nolock()
> and kfree_nolock(), but te slab allocator does not support freeing
> objects by RCU in unknown context.
> 
> It is not ideal to have completely separate batching for unknown context
> because the worst scenario where spinning on a lock would lead to
> deadlock is very rare, and in most cases, it is safe to use the
> existing mechanism (kfree_rcu_sheaf()).
> 
> Since most part of the slab allocator already supports unknown context
> and sheaves support batching kvfree_rcu() calls for slab objects,
> implement kfree_rcu_nolock() with minimal changes by teaching
> kfree_rcu_sheaf() how to support unknown context and making
> it a little bit harder to allocate an empty sheaf, instead of making
> intrusive changes to the existing kvfree_rcu batching logic.
> 
> kfree_rcu_nolock() tries to free the object to the rcu sheaf if
> trylock succeeds. Once the rcu sheaf becomes full, it is submitted to
> RCU via call_rcu() if spinning is allowed or IRQs are enabled (to avoid
> calling call_rcu() in the middle of call_rcu()). Otherwise, call_rcu()
> is deferred via irq work.
> 
> In unknown context, when there is no sheaf available, kfree_rcu_sheaf()
> falls back to defer_kfree_rcu(), which inserts the object to a global
> lockless list [5] and those objects are freed after synchronize_rcu() in
> a workqueue.
> 
> Unlike kfree_rcu(), only the 2-argument variant is supported.
> This is because the last resort of the 1-arg variant is
> synchronize_rcu(), which cannot be used in an unknown context.
> 
> As suggested by Alexei Starovoitov, kfree_rcu_nolock() can be used with
> struct kfree_rcu_head (8 bytes), which is smaller than struct rcu_head
> (16 bytes).
> 
> For more background and future plans, please see [4].
> 
> [1] RFC v1: https://lore.kernel.org/linux-mm/20260206093410.160622-1-harry.yoo@oracle.com
> 
> [2] RFC v2: https://lore.kernel.org/linux-mm/20260416091022.36823-1-harry@kernel.org
> 
> [3] https://git.kernel.org/pub/scm/linux/kernel/git/harry/linux.git/log/?h=kfree_rcu_nolock-v3r3
> 
> [4] kmalloc_nolock() follow-ups, including kfree_rcu_nolock(),
>     https://lore.kernel.org/linux-mm/esepccfhqg7m6jo76ns2znj2cnuaepx2xvw5zaygtwohq4psma@563ypprp6rr3
> 
> [5] However, we should probably make the list percpu because,
>     unlike RFC v2, it can be triggered more frequently under memory
>     pressure.
> 
>     https://lore.kernel.org/linux-mm/805c33d7-3a7b-470c-bd9d-065717a3e3e2@paulmck-laptop
> 
> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
> ---
> Harry Yoo (Oracle) (9):
>       slub_kunit: fall back to SW perf events when HW PMU is not available
>       mm/slab, slub_kunit: register kprobe to trigger _nolock APIs
>       mm/slab: handle the !allow_spin case in kfree_rcu_sheaf()
>       mm/slab: use call_rcu() in unknown context if irqs are enabled
>       mm/slab: extend deferred free mechanism to handle rcu sheaves
>       mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT
>       mm/slab: introduce kfree_rcu_nolock()
>       mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock()
>       slub_kunit: extend the test for kfree_rcu_nolock()
> 
>  include/linux/rcupdate.h |  12 +++
>  include/linux/types.h    |   4 +
>  lib/tests/slub_kunit.c   | 174 ++++++++++++++++++++++++++++------
>  mm/slab.h                |   5 +-
>  mm/slab_common.c         |  38 ++++++--
>  mm/slub.c                | 242 ++++++++++++++++++++++++++++++++++-------------
>  6 files changed, 370 insertions(+), 105 deletions(-)
> ---
> base-commit: c425609d6ac4012c8bbf01ec2e10e801b1923a7b
> change-id: 20260615-kfree_rcu_nolock-e5502555992f
> 
> Best regards,

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available
  2026-06-15 11:05 ` [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available Harry Yoo (Oracle)
  2026-06-15 11:14   ` sashiko-bot
@ 2026-06-15 12:58   ` Harry Yoo
  2026-06-15 20:01   ` Alexei Starovoitov
  2 siblings, 0 replies; 23+ messages in thread
From: Harry Yoo @ 2026-06-15 12:58 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Hao Li, Christoph Lameter,
	David Rientjes, Roman Gushchin, Alexei Starovoitov,
	Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf


[-- Attachment #1.1: Type: text/plain, Size: 1264 bytes --]



On 6/15/26 8:05 PM, Harry Yoo (Oracle) wrote:
> slub_kunit uses hardware perf events to invoke _nolock() APIs from NMI
> context. However, creating a HW perf event fails when there is no
> hardware PMU, or when running in a virtualized environment without a
> virtual PMU configured.
> 
> Since tests are often performed in virtualized environments, fall back
> to a SW perf event so that the test can still run, even if it cannot
> cover IRQ-disabled sections.
> 
> For the sheaves path, IRQs are not disabled until the barn is accessed.
> So some coverage is still expected.
> 
> Signed-off-by: Harry Yoo (Oracle) <harry@kernel.org>
> ---
>  lib/tests/slub_kunit.c | 70 +++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 55 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c
> index fa6d31dbca16..11255fc8eb78 100644
> --- a/lib/tests/slub_kunit.c
> +++ b/lib/tests/slub_kunit.c
> @@ -303,6 +303,7 @@ struct test_nolock_context {
>  	int alloc_ok;
>  	int alloc_fail;
>  	struct perf_event *event;
> +	bool is_perf_type_hw;
>  };

nit: should have checked struct perf_event.type instead of introducing
a new field. Will fix.

-- 
Cheers,
Harry / Hyeonggon


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available
  2026-06-15 11:05 ` [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available Harry Yoo (Oracle)
  2026-06-15 11:14   ` sashiko-bot
  2026-06-15 12:58   ` Harry Yoo
@ 2026-06-15 20:01   ` Alexei Starovoitov
  2 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2026-06-15 20:01 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Vlastimil Babka, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

On Mon Jun 15, 2026 at 4:05 AM PDT, Harry Yoo (Oracle) wrote:
> slub_kunit uses hardware perf events to invoke _nolock() APIs from NMI
> context. However, creating a HW perf event fails when there is no
> hardware PMU, or when running in a virtualized environment without a
> virtual PMU configured.
>
> Since tests are often performed in virtualized environments, fall back
> to a SW perf event so that the test can still run, even if it cannot
> cover IRQ-disabled sections.

Frankly, I wouldn't bother with this fallback, since it completely misses
irq disabled section which is the one to test with *_nolock().
VMs need to configured properly. Even nested VMs support
PERF_COUNT_HW_CPU_CYCLES.

Also in overflow_handler_test_nolock()
I would add
even->hw.interrupts = 0;
to the end to make sure it actually stress tests things.
Without it NMIs are not firing quickly enough.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs
  2026-06-15 11:05 ` [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs Harry Yoo (Oracle)
  2026-06-15 11:25   ` sashiko-bot
@ 2026-06-15 20:04   ` Alexei Starovoitov
  1 sibling, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2026-06-15 20:04 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Vlastimil Babka, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

On Mon Jun 15, 2026 at 4:05 AM PDT, Harry Yoo (Oracle) wrote:
>  
> +/*
> + * A no-op function used to attach kprobe handlers in slub_kunit tests.
> + * The barrier is needed to prevent the compiler from optimizing out callsites.
> + */
> +#if defined(CONFIG_DEBUG_VM) || defined(CONFIG_PROVE_LOCKING)
> +static noinline void slab_attach_kprobe_locked(void)
> +{
> +	barrier();
> +}
> +#else
> +static inline void slab_attach_kprobe_locked(void) { }
> +#endif
> +
> +#define slab_lockdep_assert_held(lock) do {	\
> +	lockdep_assert_held(lock);		\
> +	slab_attach_kprobe_locked();	\
> +} while (0)

Great idea. That should cover all kinds of cases.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage
  2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
                   ` (9 preceding siblings ...)
  2026-06-15 11:43 ` [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo
@ 2026-06-15 20:28 ` Alexei Starovoitov
  10 siblings, 0 replies; 23+ messages in thread
From: Alexei Starovoitov @ 2026-06-15 20:28 UTC (permalink / raw)
  To: Harry Yoo (Oracle), Vlastimil Babka, Andrew Morton, Hao Li,
	Christoph Lameter, David Rientjes, Roman Gushchin,
	Alexei Starovoitov, Andrii Nakryiko, Puranjay Mohan, Amery Hung,
	Sebastian Andrzej Siewior, Clark Williams, Steven Rostedt,
	Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Mathieu Desnoyers, Lai Jiangshan, Zqiang, Pedro Falcato,
	Suren Baghdasaryan
  Cc: linux-mm, linux-kernel, linux-rt-devel, rcu, bpf

On Mon Jun 15, 2026 at 4:05 AM PDT, Harry Yoo (Oracle) wrote:
> Not the best time to post a series, but didn't want to delay posting
> the series for too long. no pressures ;)  This is aimed to be queued
> for review and testing after the merge window closes.
>
> This series is based on next-20260612, and is also available on
> git.kernel.org [3].
>
> To RCU folks: It would be great if you could kindly take a quick look at
> patch 4 and either ack or nack the patch ;)
>
> To BPF folks: Ulad asked to share workloads to measure performance
> of kfree_rcu_nolock(). Unfortunately, I focused more on correctness
> and have not spent much effort on that. It would be nice if BPF folks
> could help evaluate it on their relevant workloads.

kfree_rcu_nolock() needs to replace bpf_mem_alloc which is backbone
of bpf maps and bpf local storage.
So all of the selftests/bpf/benchs/run_bench_*.sh
will exercise it one way or the other the replacement is complete.
In other words performance is absolutely critical.

>
> To PREEMPT_RT folks: The most relevant part is allowing
> kfree_rcu_sheaf() on PREEMPT_RT (patch 6). It carefully avoids sleeping
> by acquiring the locks via local_trylock() or spin_trylock_irqsave()
> to avoid sleeping within a raw spinlock. When trylock or unlock is
> unsafe, kmalloc_nolock() always fails.
>
> Changes since RFC v2
> ====================
>
> Reduced complexity and intrusiveness (Uladzislau Rezki)
> -------------------------------------------------------
>
> While discussing concerns about the complexity of adding allow_spin
> handling with Ulad (Thanks!), I realized that adding complexity to the
> kvfree_rcu batching is not strictly necessary: only slab objects need to
> be batched, they are already batched by rcu sheaves, and slab already
> supports unknown context. So it is enough to implement only a minimal
> fallback for the sheaves path.
>
> I tried to avoid making intrusive changes to the existing kvfree_rcu
> path as much as possible. struct rcu_ptr is renamed to kfree_rcu_head
> following Vlastimil's suggestion, and it is used only in the
> kfree_rcu_nolock() path for now.
>
> As a result, the complexity is significantly reduced and the series
> became much less intrusive. This is also reflected well in the diffstat
> below.

Overall looks good to me.
btw sashiko was confused in few cases.
Not everything that it flags needs a fix. Sometimes it's not an issue at all.
It only sounds like one.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-06-15 20:28 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-15 11:05 [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo (Oracle)
2026-06-15 11:05 ` [PATCH for-next v3 1/9] slub_kunit: fall back to SW perf events when HW PMU is not available Harry Yoo (Oracle)
2026-06-15 11:14   ` sashiko-bot
2026-06-15 12:58   ` Harry Yoo
2026-06-15 20:01   ` Alexei Starovoitov
2026-06-15 11:05 ` [PATCH for-next v3 2/9] mm/slab, slub_kunit: register kprobe to trigger _nolock APIs Harry Yoo (Oracle)
2026-06-15 11:25   ` sashiko-bot
2026-06-15 20:04   ` Alexei Starovoitov
2026-06-15 11:05 ` [PATCH for-next v3 3/9] mm/slab: handle the !allow_spin case in kfree_rcu_sheaf() Harry Yoo (Oracle)
2026-06-15 11:24   ` sashiko-bot
2026-06-15 11:05 ` [PATCH for-next v3 4/9] mm/slab: use call_rcu() in unknown context if irqs are enabled Harry Yoo (Oracle)
2026-06-15 11:25   ` sashiko-bot
2026-06-15 11:05 ` [PATCH for-next v3 5/9] mm/slab: extend deferred free mechanism to handle rcu sheaves Harry Yoo (Oracle)
2026-06-15 11:24   ` sashiko-bot
2026-06-15 11:06 ` [PATCH for-next v3 6/9] mm/slab: allow kfree_rcu_sheaf() on PREEMPT_RT Harry Yoo (Oracle)
2026-06-15 11:19   ` sashiko-bot
2026-06-15 11:06 ` [PATCH for-next v3 7/9] mm/slab: introduce kfree_rcu_nolock() Harry Yoo (Oracle)
2026-06-15 11:22   ` sashiko-bot
2026-06-15 11:06 ` [PATCH for-next v3 8/9] mm/slab: introduce struct kfree_rcu_head and use in kfree_rcu_nolock() Harry Yoo (Oracle)
2026-06-15 11:22   ` sashiko-bot
2026-06-15 11:06 ` [PATCH for-next v3 9/9] slub_kunit: extend the test for kfree_rcu_nolock() Harry Yoo (Oracle)
2026-06-15 11:43 ` [PATCH for-next v3 0/9] mm/slab: introduce kfree_rcu_nolock() and improve slub_kunit coverage Harry Yoo
2026-06-15 20:28 ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.