[PATCH v2 0/6] slab: Re-entrant kmalloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock()
@ 2025-07-09  1:52 Alexei Starovoitov
  2025-07-09  1:52 ` [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-09  1:52 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

v1->v2:
Added more comments for this non-trivial logic and addressed earlier comments.
In particular:
- Introduce alloc_frozen_pages_nolock() to avoid refcnt race
- alloc_pages_nolock() defaults to GFP_COMP
- Support SLUB_TINY
- Added more variants to stress tester to discover that kfree_nolock() can
  OOM, because deferred per-slab llist won't be serviced if kfree_nolock()
  gets unlucky long enough. Scraped previous approach and switched to
  global per-cpu llist with immediate irq_work_queue() to process all
  object sizes.
- Reentrant kmalloc cannot deactivate_slab(). In v1 the node hint was
  downgraded to NUMA_NO_NODE before calling slab_alloc(). Realized it's not
  good enough. There are odd cases that can trigger deactivate. Rewrote
  this part.
- Struggled with SLAB_NO_CMPXCHG. Thankfully Harry had a great suggestion:
  https://lore.kernel.org/bpf/aFvfr1KiNrLofavW@hyeyoo/
  which was adopted. So slab_debug works now.
- In v1 I had to s/local_lock_irqsave/local_lock_irqsave_check/ in a bunch
  of places in mm/slub.c to avoid lockdep false positives.
  Came up with much cleaner approach to silence invalid lockdep reports
  without sacrificing lockdep coverage. See local_lock_lockdep_start/end().

v1:
https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (6):
  locking/local_lock: Expose dep_map in local_trylock_t.
  locking/local_lock: Introduce local_lock_is_locked().
  locking/local_lock: Introduce local_lock_lockdep_start/end()
  mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
  mm: Introduce alloc_frozen_pages_nolock()
  slab: Introduce kmalloc_nolock() and kfree_nolock().

 include/linux/gfp.h                 |   2 +-
 include/linux/kasan.h               |  13 +-
 include/linux/local_lock.h          |  17 ++
 include/linux/local_lock_internal.h |  16 +-
 include/linux/lockdep_types.h       |   4 +-
 include/linux/rtmutex.h             |   9 +
 include/linux/slab.h                |   4 +
 kernel/bpf/syscall.c                |   2 +-
 kernel/locking/lockdep.c            |   4 +
 kernel/locking/rtmutex_common.h     |   9 -
 mm/internal.h                       |   4 +
 mm/kasan/common.c                   |   5 +-
 mm/page_alloc.c                     |  54 +++--
 mm/slub.c                           | 330 +++++++++++++++++++++++++---
 14 files changed, 402 insertions(+), 71 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-07-09  1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
@ 2025-07-09  1:52 ` Alexei Starovoitov
  2025-07-11  8:02   ` Sebastian Andrzej Siewior
  2025-07-09  1:52 ` [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-09  1:52 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
is a top level field of any lock that participates in LOCKDEP.
Make it so for local_trylock_t.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock_internal.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8d5ac16a9b17..85c2e1b1af6b 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -17,7 +17,10 @@ typedef struct {
 
 /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
 typedef struct {
-	local_lock_t	llock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+	struct task_struct	*owner;
+#endif
 	u8		acquired;
 } local_trylock_t;
 
@@ -31,7 +34,7 @@ typedef struct {
 	.owner = NULL,
 
 # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
-	.llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
+	LOCAL_LOCK_DEBUG_INIT(lockname)
 
 static inline void local_lock_acquire(local_lock_t *l)
 {
@@ -81,7 +84,7 @@ do {								\
 	local_lock_debug_init(lock);				\
 } while (0)
 
-#define __local_trylock_init(lock) __local_lock_init(lock.llock)
+#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
 
 #define __spinlock_nested_bh_init(lock)				\
 do {								\
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-07-09  1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
  2025-07-09  1:52 ` [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
@ 2025-07-09  1:52 ` Alexei Starovoitov
  2025-07-11  7:52   ` Sebastian Andrzej Siewior
  2025-07-09  1:53 ` [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end() Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-09  1:52 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Introduce local_lock_is_locked() that returns true when
given local_lock is locked by current cpu (in !PREEMPT_RT) or
by current task (in PREEMPT_RT).

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          | 2 ++
 include/linux/local_lock_internal.h | 7 +++++++
 include/linux/rtmutex.h             | 9 +++++++++
 kernel/locking/rtmutex_common.h     | 9 ---------
 4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 16a2ee4f8310..092ce89b162a 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -66,6 +66,8 @@
  */
 #define local_trylock(lock)		__local_trylock(lock)
 
+#define local_lock_is_locked(lock)	__local_lock_is_locked(lock)
+
 /**
  * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
  *			   interrupts if acquired
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 85c2e1b1af6b..db61409f040c 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -165,6 +165,9 @@ do {								\
 		!!tl;						\
 	})
 
+/* preemption or migration must be disabled before calling __local_lock_is_locked */
+#define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
+
 #define __local_lock_release(lock)					\
 	do {								\
 		local_trylock_t *tl;					\
@@ -285,4 +288,8 @@ do {								\
 		__local_trylock(lock);				\
 	})
 
+/* migration must be disabled before calling __local_lock_is_locked */
+#define __local_lock_is_locked(__lock)					\
+	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
+
 #endif /* CONFIG_PREEMPT_RT */
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 7d049883a08a..a8f2fe154487 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -44,6 +44,15 @@ static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock)
 	return READ_ONCE(lock->owner) != NULL;
 }
 
+#define RT_MUTEX_HAS_WAITERS	1UL
+
+static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock)
+{
+	unsigned long owner = (unsigned long) READ_ONCE(lock->owner);
+
+	return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS);
+}
+
 extern void rt_mutex_base_init(struct rt_mutex_base *rtb);
 
 /**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 78dd3d8c6554..cf6ddd1b23a2 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -153,15 +153,6 @@ static inline struct rt_mutex_waiter *task_top_pi_waiter(struct task_struct *p)
 			pi_tree.entry);
 }
 
-#define RT_MUTEX_HAS_WAITERS	1UL
-
-static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock)
-{
-	unsigned long owner = (unsigned long) READ_ONCE(lock->owner);
-
-	return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS);
-}
-
 /*
  * Constants for rt mutex functions which have a selectable deadlock
  * detection.
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-09  1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
  2025-07-09  1:52 ` [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
  2025-07-09  1:52 ` [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
@ 2025-07-09  1:53 ` Alexei Starovoitov
  2025-07-11  7:50   ` Sebastian Andrzej Siewior
  2025-07-09  1:53 ` [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-09  1:53 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Introduce local_lock_lockdep_start/end() pair to teach lockdep
about a region of execution where per-cpu local_lock is not taken
and lockdep should consider such local_lock() as "trylock" to
avoid multiple false-positives:
- lockdep doesn't like when the same lock is taken in normal and
  in NMI context
- lockdep cannot recognize that local_locks that protect kmalloc
  buckets are different local_locks and not taken together

This pair of lockdep aid is used by slab in the following way:

if (local_lock_is_locked(&s->cpu_slab->lock))
	goto out;
local_lock_lockdep_start(&s->cpu_slab->lock);
p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
local_lock_lockdep_end(&s->cpu_slab->lock);

Where ___slab_alloc() is calling
local_lock_irqsave(&s->cpu_slab->lock, ...) many times,
and all of them will not deadlock since this lock is not taken.

In other words this lockdep-aid avoids the following false positives:

    page_alloc_kthr/1965 is trying to acquire lock:
    ffff8881f6ebe0f0 ((local_lock_t *)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0x9a9/0x1ab0

    but task is already holding lock:
    ffff8881f6ebd490 ((local_lock_t *)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0xc7/0x1ab0

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock((local_lock_t *)&c->lock);
      lock((local_lock_t *)&c->lock);

and

    inconsistent {INITIAL USE} -> {IN-NMI} usage.
           CPU0
           ----
    lock(per_cpu_ptr(&lock));
    <Interrupt>
      lock(per_cpu_ptr(&lock));

Note that this lockdep-aid does _not_ reduce lockdep coverage for local_locks.
If the code guarded by local_lock_lockdep_start/end() will
attempt to deadlock via local_lock(&lockA); local_lock(&lockA);
__local_lock_acquire() logic will yell with lockdep_assert().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h    | 15 +++++++++++++++
 include/linux/lockdep_types.h |  4 +++-
 kernel/locking/lockdep.c      |  4 ++++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 092ce89b162a..04de6ae9e5f0 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -81,6 +81,21 @@
 #define local_trylock_irqsave(lock, flags)			\
 	__local_trylock_irqsave(lock, flags)
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#define local_lock_lockdep_start(lock)					\
+	do {								\
+		lockdep_assert(!__local_lock_is_locked(lock));		\
+		this_cpu_ptr(lock)->dep_map.flags = LOCAL_LOCK_UNLOCKED;\
+	} while (0)
+
+#define local_lock_lockdep_end(lock)					\
+	do { this_cpu_ptr(lock)->dep_map.flags = 0; } while (0)
+
+#else
+#define local_lock_lockdep_start(lock) /**/
+#define local_lock_lockdep_end(lock) /**/
+#endif
+
 DEFINE_GUARD(local_lock, local_lock_t __percpu*,
 	     local_lock(_T),
 	     local_unlock(_T))
diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h
index 9f361d3ab9d9..6c580081ace3 100644
--- a/include/linux/lockdep_types.h
+++ b/include/linux/lockdep_types.h
@@ -190,13 +190,15 @@ struct lockdep_map {
 	u8				wait_type_outer; /* can be taken in this context */
 	u8				wait_type_inner; /* presents this context */
 	u8				lock_type;
-	/* u8				hole; */
+	u8				flags;
 #ifdef CONFIG_LOCK_STAT
 	int				cpu;
 	unsigned long			ip;
 #endif
 };
 
+#define LOCAL_LOCK_UNLOCKED		1
+
 struct pin_cookie { unsigned int val; };
 
 #define MAX_LOCKDEP_KEYS_BITS		13
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index dd2bbf73718b..461f70f4ca28 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4963,6 +4963,7 @@ void lockdep_init_map_type(struct lockdep_map *lock, const char *name,
 	lock->wait_type_outer = outer;
 	lock->wait_type_inner = inner;
 	lock->lock_type = lock_type;
+	lock->flags = 0;
 
 	/*
 	 * No key, no joy, we need to hash something.
@@ -5844,6 +5845,9 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
 	 */
 	kasan_check_byte(lock);
 
+	if (unlikely(lock->flags == LOCAL_LOCK_UNLOCKED))
+		trylock = 1;
+
 	if (unlikely(!lockdep_enabled())) {
 		/* XXX allow trylock from NMI ?!? */
 		if (lockdep_nmi() && !trylock) {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
  2025-07-09  1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2025-07-09  1:53 ` [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end() Alexei Starovoitov
@ 2025-07-09  1:53 ` Alexei Starovoitov
  2025-07-09 14:20   ` Vlastimil Babka
  2025-07-09  1:53 ` [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
  2025-07-09  1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  5 siblings, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-09  1:53 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Change alloc_pages_nolock() to default to __GFP_COMP when allocating
pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP.
Also allow __GFP_ACCOUNT flag to be specified,
since BPF infra needs __GFP_ACCOUNT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h  |  2 +-
 kernel/bpf/syscall.c |  2 +-
 mm/page_alloc.c      | 10 ++++++----
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5ebf26fcdcfa..0ceb4e09306c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,7 +354,7 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order);
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
 #define alloc_pages_nolock(...)			alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__))
 
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dd5304c6ac3c..eb9b6c4c10e9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -579,7 +579,7 @@ static bool can_alloc_pages(void)
 static struct page *__bpf_alloc_page(int nid)
 {
 	if (!can_alloc_pages())
-		return alloc_pages_nolock(nid, 0);
+		return alloc_pages_nolock(__GFP_ACCOUNT, nid, 0);
 
 	return alloc_pages_node(nid,
 				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78ddf1d43c6c..148945f0b667 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7506,6 +7506,7 @@ static bool __free_unaccepted(struct page *page)
 
 /**
  * alloc_pages_nolock - opportunistic reentrant allocation from any context
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
  * @nid: node to allocate from
  * @order: allocation order size
  *
@@ -7519,7 +7520,7 @@ static bool __free_unaccepted(struct page *page)
  * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
  * It means ENOMEM. There is no reason to call it again and expect !NULL.
  */
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7541,12 +7542,13 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	 * specify it here to highlight that alloc_pages_nolock()
 	 * doesn't want to deplete reserves.
 	 */
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
-			| __GFP_ACCOUNT;
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
+			| gfp_flags;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
 
+	VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
 	/*
 	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
 	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
@@ -7584,7 +7586,7 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	if (page)
 		set_page_refcounted(page);
 
-	if (memcg_kmem_online() && page &&
+	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
 	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
 		free_pages_nolock(page, order);
 		page = NULL;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock()
  2025-07-09  1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2025-07-09  1:53 ` [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
@ 2025-07-09  1:53 ` Alexei Starovoitov
  2025-07-09 14:21   ` Vlastimil Babka
  2025-07-09  1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  5 siblings, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-09  1:53 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock()
to be used by alloc_slab_page().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/internal.h   |  4 ++++
 mm/page_alloc.c | 48 +++++++++++++++++++++++++++---------------------
 2 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5b0f71e5434b..ea85cf703331 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -837,6 +837,10 @@ static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int ord
 #define alloc_frozen_pages(...) \
 	alloc_hooks(alloc_frozen_pages_noprof(__VA_ARGS__))
 
+struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
+#define alloc_frozen_pages_nolock(...) \
+	alloc_hooks(alloc_frozen_pages_nolock_noprof(__VA_ARGS__))
+
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
 extern void zone_pcp_enable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 148945f0b667..11a184bab03c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7504,23 +7504,7 @@ static bool __free_unaccepted(struct page *page)
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
-/**
- * alloc_pages_nolock - opportunistic reentrant allocation from any context
- * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
- * @nid: node to allocate from
- * @order: allocation order size
- *
- * Allocates pages of a given order from the given node. This is safe to
- * call from any context (from atomic, NMI, and also reentrant
- * allocator -> tracepoint -> alloc_pages_nolock_noprof).
- * Allocation is best effort and to be expected to fail easily so nobody should
- * rely on the success. Failures are not reported via warn_alloc().
- * See always fail conditions below.
- *
- * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
- * It means ENOMEM. There is no reason to call it again and expect !NULL.
- */
-struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
+struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7583,16 +7567,38 @@ struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int or
 
 	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
 
-	if (page)
-		set_page_refcounted(page);
-
 	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
 	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
-		free_pages_nolock(page, order);
+		__free_frozen_pages(page, order, FPI_TRYLOCK);
 		page = NULL;
 	}
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
 }
+/**
+ * alloc_pages_nolock - opportunistic reentrant allocation from any context
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
+ * @nid: node to allocate from
+ * @order: allocation order size
+ *
+ * Allocates pages of a given order from the given node. This is safe to
+ * call from any context (from atomic, NMI, and also reentrant
+ * allocator -> tracepoint -> alloc_pages_nolock_noprof).
+ * Allocation is best effort and to be expected to fail easily so nobody should
+ * rely on the success. Failures are not reported via warn_alloc().
+ * See always fail conditions below.
+ *
+ * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
+ * It means ENOMEM. There is no reason to call it again and expect !NULL.
+ */
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
+{
+	struct page *page;
+
+	page = alloc_frozen_pages_nolock_noprof(gfp_flags, nid, order);
+	if (page)
+		set_page_refcounted(page);
+	return page;
+}
 EXPORT_SYMBOL_GPL(alloc_pages_nolock_noprof);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-09  1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2025-07-09  1:53 ` [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
@ 2025-07-09  1:53 ` Alexei Starovoitov
  2025-07-10  9:36   ` Vlastimil Babka
                     ` (2 more replies)
  5 siblings, 3 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-09  1:53 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

kmalloc_nolock() relies on ability of local_lock to detect the situation
when it's locked.
In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
In that case retry the operation in a different kmalloc bucket.
The second attempt will likely succeed, since this cpu locked
different kmem_cache_cpu.

Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current task. In this case re-entrance
into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
a different bucket that is most likely is not locked by the current
task. Though it may be locked by a different task it's safe to
rt_spin_lock() on it.

Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.

kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
and in_nmi() or in PREEMPT_RT.

SLUB_TINY config doesn't use local_lock_is_locked() and relies on
spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
always defers to irq_work.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/kasan.h |  13 +-
 include/linux/slab.h  |   4 +
 mm/kasan/common.c     |   5 +-
 mm/slub.c             | 330 ++++++++++++++++++++++++++++++++++++++----
 4 files changed, 319 insertions(+), 33 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 890011071f2b..acdc8cb0152e 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s,
 }
 
 bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
-		       bool still_accessible);
+		       bool still_accessible, bool no_quarantine);
 /**
  * kasan_slab_free - Poison, initialize, and quarantine a slab object.
  * @object: Object to be freed.
@@ -226,11 +226,13 @@ bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
  * @Return true if KASAN took ownership of the object; false otherwise.
  */
 static __always_inline bool kasan_slab_free(struct kmem_cache *s,
-						void *object, bool init,
-						bool still_accessible)
+					    void *object, bool init,
+					    bool still_accessible,
+					    bool no_quarantine)
 {
 	if (kasan_enabled())
-		return __kasan_slab_free(s, object, init, still_accessible);
+		return __kasan_slab_free(s, object, init, still_accessible,
+					 no_quarantine);
 	return false;
 }
 
@@ -427,7 +429,8 @@ static inline bool kasan_slab_pre_free(struct kmem_cache *s, void *object)
 }
 
 static inline bool kasan_slab_free(struct kmem_cache *s, void *object,
-				   bool init, bool still_accessible)
+				   bool init, bool still_accessible,
+				   bool no_quarantine)
 {
 	return false;
 }
diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035c..743f6d196d57 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -470,6 +470,7 @@ void * __must_check krealloc_noprof(const void *objp, size_t new_size,
 #define krealloc(...)				alloc_hooks(krealloc_noprof(__VA_ARGS__))
 
 void kfree(const void *objp);
+void kfree_nolock(const void *objp);
 void kfree_sensitive(const void *objp);
 size_t __ksize(const void *objp);
 
@@ -910,6 +911,9 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
 }
 #define kmalloc(...)				alloc_hooks(kmalloc_noprof(__VA_ARGS__))
 
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node);
+#define kmalloc_nolock(...)			alloc_hooks(kmalloc_nolock_noprof(__VA_ARGS__))
+
 #define kmem_buckets_alloc(_b, _size, _flags)	\
 	alloc_hooks(__kmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE))
 
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index ed4873e18c75..67042e07baee 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -256,13 +256,16 @@ bool __kasan_slab_pre_free(struct kmem_cache *cache, void *object,
 }
 
 bool __kasan_slab_free(struct kmem_cache *cache, void *object, bool init,
-		       bool still_accessible)
+		       bool still_accessible, bool no_quarantine)
 {
 	if (!kasan_arch_is_ready() || is_kfence_address(object))
 		return false;
 
 	poison_slab_object(cache, object, init, still_accessible);
 
+	if (no_quarantine)
+		return false;
+
 	/*
 	 * If the object is put into quarantine, do not let slab put the object
 	 * onto the freelist for now. The object's metadata is kept until the
diff --git a/mm/slub.c b/mm/slub.c
index c4b64821e680..f0844b44ee09 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -44,6 +44,7 @@
 #include <kunit/test.h>
 #include <kunit/test-bug.h>
 #include <linux/sort.h>
+#include <linux/irq_work.h>
 
 #include <linux/debugfs.h>
 #include <trace/events/kmem.h>
@@ -393,7 +394,7 @@ struct kmem_cache_cpu {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct slab *partial;	/* Partially allocated slabs */
 #endif
-	local_lock_t lock;	/* Protects the fields above */
+	local_trylock_t lock;	/* Protects the fields above */
 #ifdef CONFIG_SLUB_STATS
 	unsigned int stat[NR_SLUB_STAT_ITEMS];
 #endif
@@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		        gfp_t gfp, bool new_slab)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfp);
 	unsigned int objects = objs_per_slab(s, slab);
 	unsigned long new_exts;
 	unsigned long old_exts;
@@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
 	gfp |= __GFP_NO_OBJ_EXT;
-	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
-			   slab_nid(slab));
+	if (unlikely(!allow_spin)) {
+		size_t sz = objects * sizeof(struct slabobj_ext);
+
+		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
+	} else {
+		vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+				   slab_nid(slab));
+	}
 	if (!vec) {
 		/* Mark vectors which failed to allocate */
 		if (new_slab)
@@ -2021,7 +2029,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		 * objcg vector should be reused.
 		 */
 		mark_objexts_empty(vec);
-		kfree(vec);
+		if (unlikely(!allow_spin))
+			kfree_nolock(vec);
+		else
+			kfree(vec);
 		return 0;
 	}
 
@@ -2379,7 +2390,7 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
 
 	}
 	/* KASAN might put x into memory quarantine, delaying its reuse. */
-	return !kasan_slab_free(s, x, init, still_accessible);
+	return !kasan_slab_free(s, x, init, still_accessible, false);
 }
 
 static __fastpath_inline
@@ -2442,13 +2453,17 @@ static void *setup_object(struct kmem_cache *s, void *object)
  * Slab allocation and freeing
  */
 static inline struct slab *alloc_slab_page(gfp_t flags, int node,
-		struct kmem_cache_order_objects oo)
+					   struct kmem_cache_order_objects oo,
+					   bool allow_spin)
 {
 	struct folio *folio;
 	struct slab *slab;
 	unsigned int order = oo_order(oo);
 
-	if (node == NUMA_NO_NODE)
+	if (unlikely(!allow_spin)) {
+		folio = (struct folio *)alloc_frozen_pages_nolock(0/* __GFP_COMP is implied */,
+								  node, order);
+	} else if (node == NUMA_NO_NODE)
 		folio = (struct folio *)alloc_frozen_pages(flags, order);
 	else
 		folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
@@ -2598,6 +2613,7 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 
 static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
+	bool allow_spin = gfpflags_allow_spinning(flags);
 	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
@@ -2617,7 +2633,11 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
 		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_RECLAIM;
 
-	slab = alloc_slab_page(alloc_gfp, node, oo);
+	/*
+	 * __GFP_RECLAIM could be cleared on the first allocation attempt,
+	 * so pass allow_spin flag directly.
+	 */
+	slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 	if (unlikely(!slab)) {
 		oo = s->min;
 		alloc_gfp = flags;
@@ -2625,7 +2645,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
-		slab = alloc_slab_page(alloc_gfp, node, oo);
+		slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 		if (unlikely(!slab))
 			return NULL;
 		stat(s, ORDER_FALLBACK);
@@ -2803,8 +2823,8 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
  * allocated slab. Allocate a single object instead of whole freelist
  * and put the slab to the partial (or full) list.
  */
-static void *alloc_single_from_new_slab(struct kmem_cache *s,
-					struct slab *slab, int orig_size)
+static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
+					int orig_size, gfp_t gfpflags)
 {
 	int nid = slab_nid(slab);
 	struct kmem_cache_node *n = get_node(s, nid);
@@ -2824,7 +2844,10 @@ static void *alloc_single_from_new_slab(struct kmem_cache *s,
 		 */
 		return NULL;
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (gfpflags_allow_spinning(gfpflags))
+		spin_lock_irqsave(&n->list_lock, flags);
+	else if (!spin_trylock_irqsave(&n->list_lock, flags))
+		return NULL;
 
 	if (slab->inuse == slab->objects)
 		add_full(s, n, slab);
@@ -2865,7 +2888,10 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 	if (!n || !n->nr_partial)
 		return NULL;
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (gfpflags_allow_spinning(pc->flags))
+		spin_lock_irqsave(&n->list_lock, flags);
+	else if (!spin_trylock_irqsave(&n->list_lock, flags))
+		return NULL;
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
 		if (!pfmemalloc_match(slab, pc->flags))
 			continue;
@@ -3056,7 +3082,7 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
-		local_lock_init(&c->lock);
+		local_trylock_init(&c->lock);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -3690,6 +3716,7 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfpflags);
 	void *freelist;
 	struct slab *slab;
 	unsigned long flags;
@@ -3717,7 +3744,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		 * same as above but node_match() being false already
 		 * implies node != NUMA_NO_NODE
 		 */
-		if (!node_isset(node, slab_nodes)) {
+		if (!node_isset(node, slab_nodes) ||
+		    !allow_spin) {
+			/*
+			 * Reentrant slub cannot take locks necessary
+			 * to deactivate_slab, hence downgrade to any node
+			 */
 			node = NUMA_NO_NODE;
 		} else {
 			stat(s, ALLOC_NODE_MISMATCH);
@@ -3730,7 +3762,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 * PFMEMALLOC but right now, we are losing the pfmemalloc
 	 * information when the page leaves the per-cpu allocator
 	 */
-	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
+	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
 		goto deactivate_slab;
 
 	/* must check again c->slab in case we got preempted and it changed */
@@ -3803,7 +3835,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		slub_set_percpu_partial(c, slab);
 
 		if (likely(node_match(slab, node) &&
-			   pfmemalloc_match(slab, gfpflags))) {
+			   pfmemalloc_match(slab, gfpflags)) ||
+		    /*
+		     * Reentrant slub cannot take locks necessary
+		     * for __put_partials(), hence downgrade to any node
+		     */
+		    !allow_spin) {
 			c->slab = slab;
 			freelist = get_freelist(s, slab);
 			VM_BUG_ON(!freelist);
@@ -3833,8 +3870,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 *    allocating new page from other nodes
 	 */
 	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
-		     && try_thisnode))
-		pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+		     && try_thisnode)) {
+		if (unlikely(!allow_spin))
+			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
+			pc.flags = gfpflags | __GFP_THISNODE;
+		else
+			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+	}
 
 	pc.orig_size = orig_size;
 	slab = get_partial(s, node, &pc);
@@ -3873,7 +3915,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	stat(s, ALLOC_SLAB);
 
 	if (kmem_cache_debug(s)) {
-		freelist = alloc_single_from_new_slab(s, slab, orig_size);
+		freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
 
 		if (unlikely(!freelist))
 			goto new_objects;
@@ -3895,7 +3937,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 	inc_slabs_node(s, slab_nid(slab), slab->objects);
 
-	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
+	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) {
 		/*
 		 * For !pfmemalloc_match() case we don't load freelist so that
 		 * we don't make further mismatched allocations easier.
@@ -3911,6 +3953,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		void *flush_freelist = c->freelist;
 		struct slab *flush_slab = c->slab;
 
+		if (unlikely(!allow_spin))
+			/*
+			 * Reentrant slub cannot take locks
+			 * necessary for deactivate_slab()
+			 */
+			return NULL;
 		c->slab = NULL;
 		c->freelist = NULL;
 		c->tid = next_tid(c->tid);
@@ -3946,8 +3994,23 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 */
 	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
-
-	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
+	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
+		if (local_lock_is_locked(&s->cpu_slab->lock)) {
+			/*
+			 * EBUSY is an internal signal to kmalloc_nolock() to
+			 * retry a different bucket. It's not propagated
+			 * to the caller.
+			 */
+			p = ERR_PTR(-EBUSY);
+			goto out;
+		}
+		local_lock_lockdep_start(&s->cpu_slab->lock);
+		p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
+		local_lock_lockdep_end(&s->cpu_slab->lock);
+	} else {
+		p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
+	}
+out:
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);
 #endif
@@ -4071,7 +4134,7 @@ static void *__slab_alloc_node(struct kmem_cache *s,
 		return NULL;
 	}
 
-	object = alloc_single_from_new_slab(s, slab, orig_size);
+	object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
 
 	return object;
 }
@@ -4150,8 +4213,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		if (p[i] && init && (!kasan_init ||
 				     !kasan_has_integrated_init()))
 			memset(p[i], 0, zero_size);
-		kmemleak_alloc_recursive(p[i], s->object_size, 1,
-					 s->flags, init_flags);
+		if (gfpflags_allow_spinning(flags))
+			kmemleak_alloc_recursive(p[i], s->object_size, 1,
+						 s->flags, init_flags);
 		kmsan_slab_alloc(s, p[i], init_flags);
 		alloc_tagging_slab_alloc_hook(s, p[i], flags);
 	}
@@ -4342,6 +4406,94 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
+/**
+ * kmalloc_nolock - Allocate an object of given size from any context.
+ * @size: size to allocate
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
+ * @node: node number of the target node.
+ *
+ * Return: pointer to the new object or NULL in case of error.
+ * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
+ * There is no reason to call it again and expect !NULL.
+ */
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
+{
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
+	struct kmem_cache *s;
+	bool can_retry = true;
+	void *ret = ERR_PTR(-EBUSY);
+
+	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
+
+	if (unlikely(!size))
+		return ZERO_SIZE_PTR;
+
+	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
+		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
+		return NULL;
+retry:
+	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
+		return NULL;
+	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
+
+	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
+		/*
+		 * kmalloc_nolock() is not supported on architectures that
+		 * don't implement cmpxchg16b, but debug caches don't use
+		 * per-cpu slab and per-cpu partial slabs. They rely on
+		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
+		 * attempt to allocate from debug caches by
+		 * spin_trylock_irqsave(&n->list_lock, ...)
+		 */
+		return NULL;
+
+	/*
+	 * Do not call slab_alloc_node(), since trylock mode isn't
+	 * compatible with slab_pre_alloc_hook/should_failslab and
+	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
+	 * and slab_post_alloc_hook() directly.
+	 *
+	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
+	 * in irq saved region. It assumes that the same cpu will not
+	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
+	 * Therefore use in_nmi() to check whether particular bucket is in
+	 * irq protected section.
+	 *
+	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
+	 * this cpu was interrupted somewhere inside ___slab_alloc() after
+	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
+	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
+	 */
+#ifndef CONFIG_SLUB_TINY
+	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
+#endif
+		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
+
+	if (PTR_ERR(ret) == -EBUSY) {
+		if (can_retry) {
+			/* pick the next kmalloc bucket */
+			size = s->object_size + 1;
+			/*
+			 * Another alternative is to
+			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
+			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
+			 * to retry from bucket of the same size.
+			 */
+			can_retry = false;
+			goto retry;
+		}
+		ret = NULL;
+	}
+
+	maybe_wipe_obj_freeptr(s, ret);
+	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
+			     slab_want_init_on_alloc(alloc_gfp, s), size);
+
+	ret = kasan_kmalloc(s, ret, size, alloc_gfp);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);
+
 void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags,
 					 int node, unsigned long caller)
 {
@@ -4555,6 +4707,53 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
+static DEFINE_PER_CPU(struct irq_work, defer_free_work);
+
+static void free_deferred_objects(struct irq_work *work)
+{
+	struct llist_head *llhead = this_cpu_ptr(&defer_free_objects);
+	struct llist_node *llnode, *pos, *t;
+
+	if (llist_empty(llhead))
+		return;
+
+	llnode = llist_del_all(llhead);
+	llist_for_each_safe(pos, t, llnode) {
+		struct kmem_cache *s;
+		struct slab *slab;
+		void *x = pos;
+
+		slab = virt_to_slab(x);
+		s = slab->slab_cache;
+
+		/*
+		 * memcg, kasan_slab_pre are already done for 'x'.
+		 * The only thing left is kasan_poison.
+		 */
+		kasan_slab_free(s, x, false, false, true);
+		__slab_free(s, slab, x, x, 1, _THIS_IP_);
+	}
+}
+
+static int __init init_defer_work(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		init_llist_head(per_cpu_ptr(&defer_free_objects, cpu));
+		init_irq_work(per_cpu_ptr(&defer_free_work, cpu),
+			      free_deferred_objects);
+	}
+	return 0;
+}
+late_initcall(init_defer_work);
+
+static void defer_free(void *head)
+{
+	if (llist_add(head, this_cpu_ptr(&defer_free_objects)))
+		irq_work_queue(this_cpu_ptr(&defer_free_work));
+}
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4593,10 +4792,31 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	barrier();
 
 	if (unlikely(slab != c->slab)) {
-		__slab_free(s, slab, head, tail, cnt, addr);
+		/* cnt == 0 signals that it's called from kfree_nolock() */
+		if (unlikely(!cnt)) {
+			/*
+			 * __slab_free() can locklessly cmpxchg16 into a slab,
+			 * but then it might need to take spin_lock or local_lock
+			 * in put_cpu_partial() for further processing.
+			 * Avoid the complexity and simply add to a deferred list.
+			 */
+			defer_free(head);
+		} else {
+			__slab_free(s, slab, head, tail, cnt, addr);
+		}
 		return;
 	}
 
+	if (unlikely(!cnt)) {
+		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
+		    local_lock_is_locked(&s->cpu_slab->lock)) {
+			defer_free(head);
+			return;
+		}
+		cnt = 1;
+		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
+	}
+
 	if (USE_LOCKLESS_FAST_PATH()) {
 		freelist = READ_ONCE(c->freelist);
 
@@ -4844,6 +5064,62 @@ void kfree(const void *object)
 }
 EXPORT_SYMBOL(kfree);
 
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI,
+ * but only for objects allocated by kmalloc_nolock(),
+ * since some debug checks (like kmemleak and kfence) were
+ * skipped on allocation. large_kmalloc is not supported either.
+ */
+void kfree_nolock(const void *object)
+{
+	struct folio *folio;
+	struct slab *slab;
+	struct kmem_cache *s;
+	void *x = (void *)object;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	folio = virt_to_folio(object);
+	if (unlikely(!folio_test_slab(folio))) {
+		WARN(1, "Buggy usage of kfree_nolock");
+		return;
+	}
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+
+	memcg_slab_free_hook(s, slab, &x, 1);
+	alloc_tagging_slab_free_hook(s, slab, &x, 1);
+	/*
+	 * Unlike slab_free() do NOT call the following:
+	 * kmemleak_free_recursive(x, s->flags);
+	 * debug_check_no_locks_freed(x, s->object_size);
+	 * debug_check_no_obj_freed(x, s->object_size);
+	 * __kcsan_check_access(x, s->object_size, ..);
+	 * kfence_free(x);
+	 * since they take spinlocks.
+	 */
+	kmsan_slab_free(s, x);
+	/*
+	 * If KASAN finds a kernel bug it will do kasan_report_invalid_free()
+	 * which will call raw_spin_lock_irqsave() which is technically
+	 * unsafe from NMI, but take chance and report kernel bug.
+	 * The sequence of
+	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
+	 *  -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
+	 * is double buggy and deserves to deadlock.
+	 */
+	if (kasan_slab_pre_free(s, x))
+		return;
+#ifndef CONFIG_SLUB_TINY
+	do_slab_free(s, slab, x, x, 0, _RET_IP_);
+#else
+	defer_free(x);
+#endif
+}
+EXPORT_SYMBOL_GPL(kfree_nolock);
+
 static __always_inline __realloc_size(2) void *
 __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
  2025-07-09  1:53 ` [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
@ 2025-07-09 14:20   ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-09 14:20 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes

On 7/9/25 03:53, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Change alloc_pages_nolock() to default to __GFP_COMP when allocating
> pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP.
> Also allow __GFP_ACCOUNT flag to be specified,
> since BPF infra needs __GFP_ACCOUNT.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock()
  2025-07-09  1:53 ` [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
@ 2025-07-09 14:21   ` Vlastimil Babka
  0 siblings, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-09 14:21 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes

On 7/9/25 03:53, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock()
> to be used by alloc_slab_page().
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-09  1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
@ 2025-07-10  9:36   ` Vlastimil Babka
  2025-07-10 10:21     ` Harry Yoo
  2025-07-10 19:21     ` Alexei Starovoitov
  2025-07-11  7:26   ` Sebastian Andrzej Siewior
  2025-07-11  7:36   ` Harry Yoo
  2 siblings, 2 replies; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-10  9:36 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes

On 7/9/25 03:53, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> kmalloc_nolock() relies on ability of local_lock to detect the situation
> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by the current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and in_nmi() or in PREEMPT_RT.
> 
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> always defers to irq_work.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

>  static __fastpath_inline
> @@ -2442,13 +2453,17 @@ static void *setup_object(struct kmem_cache *s, void *object)
>   * Slab allocation and freeing
>   */
>  static inline struct slab *alloc_slab_page(gfp_t flags, int node,
> -		struct kmem_cache_order_objects oo)
> +					   struct kmem_cache_order_objects oo,
> +					   bool allow_spin)
>  {
>  	struct folio *folio;
>  	struct slab *slab;
>  	unsigned int order = oo_order(oo);
>  
> -	if (node == NUMA_NO_NODE)
> +	if (unlikely(!allow_spin)) {
> +		folio = (struct folio *)alloc_frozen_pages_nolock(0/* __GFP_COMP is implied */,
> +								  node, order);
> +	} else if (node == NUMA_NO_NODE)
>  		folio = (struct folio *)alloc_frozen_pages(flags, order);
>  	else
>  		folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);

Nit: should use { } either for everything or nothing (seems your new branch
would work without them)

>  			stat(s, ALLOC_NODE_MISMATCH);
> @@ -3730,7 +3762,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	 * PFMEMALLOC but right now, we are losing the pfmemalloc
>  	 * information when the page leaves the per-cpu allocator
>  	 */
> -	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
> +	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
>  		goto deactivate_slab;
>  
>  	/* must check again c->slab in case we got preempted and it changed */
> @@ -3803,7 +3835,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		slub_set_percpu_partial(c, slab);
>  
>  		if (likely(node_match(slab, node) &&
> -			   pfmemalloc_match(slab, gfpflags))) {
> +			   pfmemalloc_match(slab, gfpflags)) ||
> +		    /*
> +		     * Reentrant slub cannot take locks necessary
> +		     * for __put_partials(), hence downgrade to any node
> +		     */
> +		    !allow_spin) {

Uh this seems rather ugly, I'd move the comment above everything. Also it's
not "downgrade" as when you assign NUMA_NO_NODE earlier, I'd say "ignore the
preference".
Note that it would be bad to ignore with __GFP_THISNODE but then it's not
allowed for kmalloc_nolock() so that's fine.

> @@ -3911,6 +3953,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		void *flush_freelist = c->freelist;
>  		struct slab *flush_slab = c->slab;
>  
> +		if (unlikely(!allow_spin))
> +			/*
> +			 * Reentrant slub cannot take locks
> +			 * necessary for deactivate_slab()
> +			 */
> +			return NULL;

Hm but this is leaking the slab we allocated and have in the "slab"
variable, we need to free it back in that case.

>  		c->slab = NULL;
>  		c->freelist = NULL;
>  		c->tid = next_tid(c->tid);

> @@ -4593,10 +4792,31 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  	barrier();
>  
>  	if (unlikely(slab != c->slab)) {
> -		__slab_free(s, slab, head, tail, cnt, addr);
> +		/* cnt == 0 signals that it's called from kfree_nolock() */
> +		if (unlikely(!cnt)) {
> +			/*
> +			 * __slab_free() can locklessly cmpxchg16 into a slab,
> +			 * but then it might need to take spin_lock or local_lock
> +			 * in put_cpu_partial() for further processing.
> +			 * Avoid the complexity and simply add to a deferred list.
> +			 */
> +			defer_free(head);
> +		} else {
> +			__slab_free(s, slab, head, tail, cnt, addr);
> +		}
>  		return;
>  	}
>  
> +	if (unlikely(!cnt)) {
> +		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> +		    local_lock_is_locked(&s->cpu_slab->lock)) {
> +			defer_free(head);
> +			return;
> +		}
> +		cnt = 1;

Hmm we might end up doing a "goto redo" later and then do the wrong thing above?

> +		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> +	}
> +
>  	if (USE_LOCKLESS_FAST_PATH()) {
>  		freelist = READ_ONCE(c->freelist);
>  


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-10  9:36   ` Vlastimil Babka
@ 2025-07-10 10:21     ` Harry Yoo
  2025-07-10 15:05       ` Vlastimil Babka
  2025-07-10 19:21     ` Alexei Starovoitov
  1 sibling, 1 reply; 38+ messages in thread
From: Harry Yoo @ 2025-07-10 10:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexei Starovoitov, bpf, linux-mm, shakeel.butt, mhocko, bigeasy,
	andrii, memxor, akpm, peterz, rostedt, hannes

On Thu, Jul 10, 2025 at 11:36:02AM +0200, Vlastimil Babka wrote:
> On 7/9/25 03:53, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > kmalloc_nolock() relies on ability of local_lock to detect the situation
> > when it's locked.
> > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> > In that case retry the operation in a different kmalloc bucket.
> > The second attempt will likely succeed, since this cpu locked
> > different kmem_cache_cpu.
> > 
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > a different bucket that is most likely is not locked by the current
> > task. Though it may be locked by a different task it's safe to
> > rt_spin_lock() on it.
> > 
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
> > 
> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > and in_nmi() or in PREEMPT_RT.
> > 
> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> > always defers to irq_work.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > @@ -3911,6 +3953,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  		void *flush_freelist = c->freelist;
> >  		struct slab *flush_slab = c->slab;
> >  
> > +		if (unlikely(!allow_spin))
> > +			/*
> > +			 * Reentrant slub cannot take locks
> > +			 * necessary for deactivate_slab()
> > +			 */
> > +			return NULL;
> 
> Hm but this is leaking the slab we allocated and have in the "slab"
> variable, we need to free it back in that case.

But it might be a partial slab taken from the list?
Then we need to trylock n->list_lock and if that fails, oh...

> >  		c->slab = NULL;
> >  		c->freelist = NULL;
> >  		c->tid = next_tid(c->tid);

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-10 10:21     ` Harry Yoo
@ 2025-07-10 15:05       ` Vlastimil Babka
  2025-07-10 19:13         ` Alexei Starovoitov
  0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-10 15:05 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Alexei Starovoitov, bpf, linux-mm, shakeel.butt, mhocko, bigeasy,
	andrii, memxor, akpm, peterz, rostedt, hannes

On 7/10/25 12:21, Harry Yoo wrote:
> On Thu, Jul 10, 2025 at 11:36:02AM +0200, Vlastimil Babka wrote:
>> On 7/9/25 03:53, Alexei Starovoitov wrote:
>> 
>> Hm but this is leaking the slab we allocated and have in the "slab"
>> variable, we need to free it back in that case.
> 
> But it might be a partial slab taken from the list?

True.

> Then we need to trylock n->list_lock and if that fails, oh...

So... since we succeeded taking it from the list and thus the spin_trylock,
it means it's safe to spinlock n->list_lock again - we might be waiting on
other cpu to unlock it but we know we didn't NMI on our own cpu having the
lock, right? But we'd probably need to convince lockdep about this somehow,
and also remember if we allocated a new slab or taken on from the partial
list... or just deal with this unlikely situation in another irq work :/

>> >  		c->slab = NULL;
>> >  		c->freelist = NULL;
>> >  		c->tid = next_tid(c->tid);
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-10 15:05       ` Vlastimil Babka
@ 2025-07-10 19:13         ` Alexei Starovoitov
  2025-07-11  6:06           ` Harry Yoo
  2025-07-11 10:30           ` Vlastimil Babka
  0 siblings, 2 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-10 19:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, bpf, linux-mm, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Thu, Jul 10, 2025 at 8:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/10/25 12:21, Harry Yoo wrote:
> > On Thu, Jul 10, 2025 at 11:36:02AM +0200, Vlastimil Babka wrote:
> >> On 7/9/25 03:53, Alexei Starovoitov wrote:
> >>
> >> Hm but this is leaking the slab we allocated and have in the "slab"
> >> variable, we need to free it back in that case.

ohh. sorry for the silly mistake.
Re-reading the diff again I realized that I made a similar mistake in
alloc_single_from_new_slab().
It has this bit:
if (!alloc_debug_processing(...))
  return NULL;

so I assumed that doing:
if (!spin_trylock_irqsave(&n->list_lock,..))
   return NULL;

is ok too. Now I see that !alloc_debug is purposefully leaking memory.

Should we add:
@@ -2841,6 +2841,7 @@ static void *alloc_single_from_new_slab(struct
kmem_cache *s, struct slab *slab,
                 * It's not really expected that this would fail on a
                 * freshly allocated slab, but a concurrent memory
                 * corruption in theory could cause that.
+                * Leak newly allocated slab.
                 */
                return NULL;

so the next person doesn't make the same mistake?

Also help me understand...
slab->objects is never equal to 1, right?
/proc/slabinfo agrees, but I cannot decipher it through slab init code.
Logically it makes sense.
If that's the case why alloc_single_from_new_slab()
has this part:
        if (slab->inuse == slab->objects)
                add_full(s, n, slab);
        else
                add_partial(n, slab, DEACTIVATE_TO_HEAD);

Shouldn't it call add_partial() only ?
since slab->inuse == 1 and slab->objects != 1

> >
> > But it might be a partial slab taken from the list?
>
> True.
>
> > Then we need to trylock n->list_lock and if that fails, oh...
>
> So... since we succeeded taking it from the list and thus the spin_trylock,
> it means it's safe to spinlock n->list_lock again - we might be waiting on
> other cpu to unlock it but we know we didn't NMI on our own cpu having the
> lock, right? But we'd probably need to convince lockdep about this somehow,
> and also remember if we allocated a new slab or taken on from the partial
> list... or just deal with this unlikely situation in another irq work :/

irq_work might be the least mind bending.
Good point about partial vs new slab.
For partial we can indeed proceed with deactivate_slab() and if
I'm reading the code correctly, it won't have new.inuse == 0,
so it won't go to discard_slab() (which won't be safe in this path)
But teaching lockdep that below bit in deactivate_slab() is safe:
        } else if (new.freelist) {
                spin_lock_irqsave(&n->list_lock, flags);
                add_partial(n, slab, tail);
is a challenge.

Since defer_free_work is there, I'm leaning to reuse it for
deactive_slab too. It will process
static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
and
static DEFINE_PER_CPU(struct llist_head, defer_deactivate_slabs);

Shouldn't be too ugly. Better ideas?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-10  9:36   ` Vlastimil Babka
  2025-07-10 10:21     ` Harry Yoo
@ 2025-07-10 19:21     ` Alexei Starovoitov
  1 sibling, 0 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-10 19:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Thu, Jul 10, 2025 at 2:36 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> > +     if (unlikely(!allow_spin)) {
> > +             folio = (struct folio *)alloc_frozen_pages_nolock(0/* __GFP_COMP is implied */,
> > +                                                               node, order);
> > +     } else if (node == NUMA_NO_NODE)
> >               folio = (struct folio *)alloc_frozen_pages(flags, order);
> >       else
> >               folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
>
> Nit: should use { } either for everything or nothing (seems your new branch
> would work without them)

leftover from v1. will fix.

> >                       stat(s, ALLOC_NODE_MISMATCH);
> > @@ -3730,7 +3762,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >        * PFMEMALLOC but right now, we are losing the pfmemalloc
> >        * information when the page leaves the per-cpu allocator
> >        */
> > -     if (unlikely(!pfmemalloc_match(slab, gfpflags)))
> > +     if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
> >               goto deactivate_slab;
> >
> >       /* must check again c->slab in case we got preempted and it changed */
> > @@ -3803,7 +3835,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >               slub_set_percpu_partial(c, slab);
> >
> >               if (likely(node_match(slab, node) &&
> > -                        pfmemalloc_match(slab, gfpflags))) {
> > +                        pfmemalloc_match(slab, gfpflags)) ||
> > +                 /*
> > +                  * Reentrant slub cannot take locks necessary
> > +                  * for __put_partials(), hence downgrade to any node
> > +                  */
> > +                 !allow_spin) {
>
> Uh this seems rather ugly, I'd move the comment above everything. Also it's
> not "downgrade" as when you assign NUMA_NO_NODE earlier, I'd say "ignore the
> preference".
> Note that it would be bad to ignore with __GFP_THISNODE but then it's not
> allowed for kmalloc_nolock() so that's fine.

Yes. All correct. Will reword.

> > @@ -3911,6 +3953,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >               void *flush_freelist = c->freelist;
> >               struct slab *flush_slab = c->slab;
> >
> > +             if (unlikely(!allow_spin))
> > +                     /*
> > +                      * Reentrant slub cannot take locks
> > +                      * necessary for deactivate_slab()
> > +                      */
> > +                     return NULL;
>
> Hm but this is leaking the slab we allocated and have in the "slab"
> variable, we need to free it back in that case.
>
> >               c->slab = NULL;
> >               c->freelist = NULL;
> >               c->tid = next_tid(c->tid);
>
> > @@ -4593,10 +4792,31 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >       barrier();
> >
> >       if (unlikely(slab != c->slab)) {
> > -             __slab_free(s, slab, head, tail, cnt, addr);
> > +             /* cnt == 0 signals that it's called from kfree_nolock() */
> > +             if (unlikely(!cnt)) {
> > +                     /*
> > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > +                      * but then it might need to take spin_lock or local_lock
> > +                      * in put_cpu_partial() for further processing.
> > +                      * Avoid the complexity and simply add to a deferred list.
> > +                      */
> > +                     defer_free(head);
> > +             } else {
> > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > +             }
> >               return;
> >       }
> >
> > +     if (unlikely(!cnt)) {
> > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > +                     defer_free(head);
> > +                     return;
> > +             }
> > +             cnt = 1;
>
> Hmm we might end up doing a "goto redo" later and then do the wrong thing above?

Great catch. Will fix. That's two serious bugs.
That's my penalty for reviewing other people code 99% of the time
and little time to code myself.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-10 19:13         ` Alexei Starovoitov
@ 2025-07-11  6:06           ` Harry Yoo
  2025-07-11 10:30           ` Vlastimil Babka
  1 sibling, 0 replies; 38+ messages in thread
From: Harry Yoo @ 2025-07-11  6:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Vlastimil Babka, bpf, linux-mm, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Thu, Jul 10, 2025 at 12:13:20PM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 10, 2025 at 8:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > On 7/10/25 12:21, Harry Yoo wrote:
> > > On Thu, Jul 10, 2025 at 11:36:02AM +0200, Vlastimil Babka wrote:
> > >> On 7/9/25 03:53, Alexei Starovoitov wrote:
> > >>
> > >> Hm but this is leaking the slab we allocated and have in the "slab"
> > >> variable, we need to free it back in that case.
> 
> ohh. sorry for the silly mistake.
> Re-reading the diff again I realized that I made a similar mistake in
> alloc_single_from_new_slab().
> It has this bit:
> if (!alloc_debug_processing(...))
>   return NULL;

Yeah but we purposefully leak slabs if !alloc_debug_processing(),
the same in alloc_single_from_partial().

> so I assumed that doing:
> if (!spin_trylock_irqsave(&n->list_lock,..))
>    return NULL;
> 
> is ok too. Now I see that !alloc_debug is purposefully leaking memory.
> 
> Should we add:
> @@ -2841,6 +2841,7 @@ static void *alloc_single_from_new_slab(struct
> kmem_cache *s, struct slab *slab,
>                  * It's not really expected that this would fail on a
>                  * freshly allocated slab, but a concurrent memory
>                  * corruption in theory could cause that.
> +                * Leak newly allocated slab.
>                  */
>                 return NULL;
> 
> so the next person doesn't make the same mistake?

Looks fine. Probably add a comment to alloc_single_from_partial() as well?

> Also help me understand...
> slab->objects is never equal to 1, right?

No. For example, if you allocate a 4k obj and SLUB allocate a slab with
oo_order(s->min), s->objects will be 1.

/proc/slabinfo only prints oo_order(s->oo), not oo_order(s->min).

> /proc/slabinfo agrees, but I cannot decipher it through slab init code.
> Logically it makes sense.

I think the reason why there is no <objsperslab> == 1 in your
/proc/slabinfo is that calculate_order() tries to choose higher order
for slabs (based on nr of CPUs) to reduce lock contention.

But nothing prevents s->objects from being 1.

> If that's the case why alloc_single_from_new_slab()
> has this part:
>         if (slab->inuse == slab->objects)
>                 add_full(s, n, slab);
>         else
>                 add_partial(n, slab, DEACTIVATE_TO_HEAD);
> 
> Shouldn't it call add_partial() only ?
> since slab->inuse == 1 and slab->objects != 1

...and that means we need to handle slab->inuse == slab->objects.

> > > But it might be a partial slab taken from the list?
> >
> > True.
> >
> > > Then we need to trylock n->list_lock and if that fails, oh...
> >
> > So... since we succeeded taking it from the list and thus the spin_trylock,
> > it means it's safe to spinlock n->list_lock again - we might be waiting on
> > other cpu to unlock it but we know we didn't NMI on our own cpu having the
> > lock, right? But we'd probably need to convince lockdep about this somehow,
> > and also remember if we allocated a new slab or taken on from the partial
> > list... or just deal with this unlikely situation in another irq work :/
> 
> irq_work might be the least mind bending.
> Good point about partial vs new slab.
> For partial we can indeed proceed with deactivate_slab() and if
> I'm reading the code correctly, it won't have new.inuse == 0,
> so it won't go to discard_slab() (which won't be safe in this path)
> But teaching lockdep that below bit in deactivate_slab() is safe:
>         } else if (new.freelist) {
>                 spin_lock_irqsave(&n->list_lock, flags);
>                 add_partial(n, slab, tail);
> is a challenge.
> 
> Since defer_free_work is there, I'm leaning to reuse it for
> deactive_slab too. It will process
> static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
> and
> static DEFINE_PER_CPU(struct llist_head, defer_deactivate_slabs);

+1 for another irq work for slab deactivation
it should be rare anyway...

> Shouldn't be too ugly. Better ideas?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-09  1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  2025-07-10  9:36   ` Vlastimil Babka
@ 2025-07-11  7:26   ` Sebastian Andrzej Siewior
  2025-07-11  7:36   ` Harry Yoo
  2 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-11  7:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, shakeel.butt, mhocko, andrii,
	memxor, akpm, peterz, rostedt, hannes

On 2025-07-08 18:53:03 [-0700], Alexei Starovoitov wrote:
> @@ -4555,6 +4707,53 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
> +static DEFINE_PER_CPU(struct irq_work, defer_free_work);

static DEFINE_PER_CPU(struct llist_head, defer_free_objects) = LLIST_HEAD_INIT(defer_free_objects);
static DEFINE_PER_CPU(struct irq_work, defer_free_work) = IRQ_WORK_INIT(free_deferred_objects);

would allow you to avoid init_defer_work().

> +static void free_deferred_objects(struct irq_work *work)
> +{
> +	struct llist_head *llhead = this_cpu_ptr(&defer_free_objects);
> +	struct llist_node *llnode, *pos, *t;
…
> +}
> +
> +static int __init init_defer_work(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu) {
> +		init_llist_head(per_cpu_ptr(&defer_free_objects, cpu));
> +		init_irq_work(per_cpu_ptr(&defer_free_work, cpu),
> +			      free_deferred_objects);
> +	}
> +	return 0;
> +}
> +late_initcall(init_defer_work);
> +
> +static void defer_free(void *head)
> +{
> +	if (llist_add(head, this_cpu_ptr(&defer_free_objects)))
> +		irq_work_queue(this_cpu_ptr(&defer_free_work));

If you group &defer_free_objects and &defer_free_work into a struct you
could avoid using this_cpu_ptr twice.
Having both in one struct would allow to use container_of() in
free_deferred_objects() to get the free list.

> +}
…
> @@ -4844,6 +5064,62 @@ void kfree(const void *object)
>  }
>  EXPORT_SYMBOL(kfree);
>  
> +/*
> + * Can be called while holding raw_spin_lock or from IRQ and NMI,
raw_spinlock_t

> + * but only for objects allocated by kmalloc_nolock(),
> + * since some debug checks (like kmemleak and kfence) were
> + * skipped on allocation. large_kmalloc is not supported either.
> + */
> +void kfree_nolock(const void *object)
> +{
…

Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-09  1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  2025-07-10  9:36   ` Vlastimil Babka
  2025-07-11  7:26   ` Sebastian Andrzej Siewior
@ 2025-07-11  7:36   ` Harry Yoo
  2025-07-11  7:40     ` Harry Yoo
  2025-07-11 10:48     ` Vlastimil Babka
  2 siblings, 2 replies; 38+ messages in thread
From: Harry Yoo @ 2025-07-11  7:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii,
	memxor, akpm, peterz, rostedt, hannes

On Tue, Jul 08, 2025 at 06:53:03PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> kmalloc_nolock() relies on ability of local_lock to detect the situation
> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
>
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by the current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and in_nmi() or in PREEMPT_RT.

Does that mean in mm/Kconfig SLUB now needs to select IRQ_WORK?

> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> always defers to irq_work.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/kasan.h |  13 +-
>  include/linux/slab.h  |   4 +
>  mm/kasan/common.c     |   5 +-
>  mm/slub.c             | 330 ++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 319 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index 890011071f2b..acdc8cb0152e 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s,
>  }
>  
>  bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
> -		       bool still_accessible);
> +		       bool still_accessible, bool no_quarantine);
>  /**
>   * kasan_slab_free - Poison, initialize, and quarantine a slab object.
>   * @object: Object to be freed.
> @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
>  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  		        gfp_t gfp, bool new_slab)
>  {
> +	bool allow_spin = gfpflags_allow_spinning(gfp);
>  	unsigned int objects = objs_per_slab(s, slab);
>  	unsigned long new_exts;
>  	unsigned long old_exts;
> @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  	gfp &= ~OBJCGS_CLEAR_MASK;
>  	/* Prevent recursive extension vector allocation */
>  	gfp |= __GFP_NO_OBJ_EXT;
> -	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> -			   slab_nid(slab));
> +	if (unlikely(!allow_spin)) {
> +		size_t sz = objects * sizeof(struct slabobj_ext);
> +
> +		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));

Missing memset()?
as there is no kcalloc_nolock()...

> +	} else {
> +		vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> +				   slab_nid(slab));
> +	}
>  	if (!vec) {
>  		/* Mark vectors which failed to allocate */
>  		if (new_slab)
> @@ -3911,6 +3953,12 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		void *flush_freelist = c->freelist;
>  		struct slab *flush_slab = c->slab;
>  
> +		if (unlikely(!allow_spin))
> +			/*
> +			 * Reentrant slub cannot take locks
> +			 * necessary for deactivate_slab()
> +			 */
> +			return NULL;
>  		c->slab = NULL;
>  		c->freelist = NULL;
>  		c->tid = next_tid(c->tid);
> @@ -4555,6 +4707,53 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
> +static DEFINE_PER_CPU(struct irq_work, defer_free_work);
> +
> +static void free_deferred_objects(struct irq_work *work)
> +{
> +	struct llist_head *llhead = this_cpu_ptr(&defer_free_objects);
> +	struct llist_node *llnode, *pos, *t;
> +
> +	if (llist_empty(llhead))
> +		return;
> +
> +	llnode = llist_del_all(llhead);
> +	llist_for_each_safe(pos, t, llnode) {
> +		struct kmem_cache *s;
> +		struct slab *slab;
> +		void *x = pos;
> +
> +		slab = virt_to_slab(x);
> +		s = slab->slab_cache;
> +
> +		/*
> +		 * memcg, kasan_slab_pre are already done for 'x'.
> +		 * The only thing left is kasan_poison.
> +		 */
> +		kasan_slab_free(s, x, false, false, true);
> +		__slab_free(s, slab, x, x, 1, _THIS_IP_);
> +	}
> +}
> +
> +static void defer_free(void *head)
> +{
> +	if (llist_add(head, this_cpu_ptr(&defer_free_objects)))
> +		irq_work_queue(this_cpu_ptr(&defer_free_work));

By adding it to the lockless list, it's overwriting freed objects,
and it's not always safe.

Looking at calculate_sizes():

        if (((flags & SLAB_TYPESAFE_BY_RCU) && !args->use_freeptr_offset) ||    
            (flags & SLAB_POISON) || s->ctor ||                                 
            ((flags & SLAB_RED_ZONE) &&                                         
             (s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) {    
                /*                                                              
                 * Relocate free pointer after the object if it is not          
                 * permitted to overwrite the first word of the object on           
                 * kmem_cache_free.                                             
                 *                                                              
                 * This is the case if we do RCU, have a constructor or         
                 * destructor, are poisoning the objects, or are                
                 * redzoning an object smaller than sizeof(void *) or are           
                 * redzoning an object with slub_debug_orig_size() enabled,         
                 * in which case the right redzone may be extended.             
                 *                                                              
                 * The assumption that s->offset >= s->inuse means free         
                 * pointer is outside of the object is used in the              
                 * freeptr_outside_object() function. If that is no             
                 * longer true, the function needs to be modified.              
                 */                                                             
                s->offset = size;                                               
                size += sizeof(void *);  

Only sizeof(void *) bytes from object + s->offset is always safe to overwrite.

So either 1) teach defer_free() that it needs to use s->offset for each
object, instead of zero (and that the list can have objects from
different caches), or 2) introduce per-cache per-CPU lockless lists?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-11  7:36   ` Harry Yoo
@ 2025-07-11  7:40     ` Harry Yoo
  2025-07-11 10:48     ` Vlastimil Babka
  1 sibling, 0 replies; 38+ messages in thread
From: Harry Yoo @ 2025-07-11  7:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii,
	memxor, akpm, peterz, rostedt, hannes

On Fri, Jul 11, 2025 at 04:36:30PM +0900, Harry Yoo wrote:
> On Tue, Jul 08, 2025 at 06:53:03PM -0700, Alexei Starovoitov wrote:
> > @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >  	gfp &= ~OBJCGS_CLEAR_MASK;
> >  	/* Prevent recursive extension vector allocation */
> >  	gfp |= __GFP_NO_OBJ_EXT;
> > -	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > -			   slab_nid(slab));
> > +	if (unlikely(!allow_spin)) {
> > +		size_t sz = objects * sizeof(struct slabobj_ext);
> > +
> > +		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
> 
> Missing memset()?
> as there is no kcalloc_nolock()...

Oops, there's __GFP_ZERO. nevermind.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-09  1:53 ` [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end() Alexei Starovoitov
@ 2025-07-11  7:50   ` Sebastian Andrzej Siewior
  2025-07-11  9:55     ` Vlastimil Babka
  0 siblings, 1 reply; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-11  7:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, shakeel.butt, mhocko, andrii,
	memxor, akpm, peterz, rostedt, hannes

On 2025-07-08 18:53:00 [-0700], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce local_lock_lockdep_start/end() pair to teach lockdep
> about a region of execution where per-cpu local_lock is not taken
> and lockdep should consider such local_lock() as "trylock" to
> avoid multiple false-positives:
> - lockdep doesn't like when the same lock is taken in normal and
>   in NMI context
> - lockdep cannot recognize that local_locks that protect kmalloc
>   buckets are different local_locks and not taken together
> 
> This pair of lockdep aid is used by slab in the following way:
> 
> if (local_lock_is_locked(&s->cpu_slab->lock))
> 	goto out;
> local_lock_lockdep_start(&s->cpu_slab->lock);
> p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> local_lock_lockdep_end(&s->cpu_slab->lock);
> 
> Where ___slab_alloc() is calling
> local_lock_irqsave(&s->cpu_slab->lock, ...) many times,
> and all of them will not deadlock since this lock is not taken.

So you prefer this instead of using a trylock variant in ___slab_alloc()
which would simply return in case the trylock fails?
Having the local_lock_is_locked() is still good to avoid the lock
failure if it can be detected early. I am just not sure if the extra
lockdep override is really needed.

…
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -81,6 +81,21 @@
>  #define local_trylock_irqsave(lock, flags)			\
>  	__local_trylock_irqsave(lock, flags)
>  
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +#define local_lock_lockdep_start(lock)					\
> +	do {								\
> +		lockdep_assert(!__local_lock_is_locked(lock));		\
> +		this_cpu_ptr(lock)->dep_map.flags = LOCAL_LOCK_UNLOCKED;\
> +	} while (0)
> +
> +#define local_lock_lockdep_end(lock)					\
> +	do { this_cpu_ptr(lock)->dep_map.flags = 0; } while (0)
> +
> +#else
> +#define local_lock_lockdep_start(lock) /**/
> +#define local_lock_lockdep_end(lock) /**/

Why the /**/?

…
> index 9f361d3ab9d9..6c580081ace3 100644
> --- a/include/linux/lockdep_types.h
> +++ b/include/linux/lockdep_types.h
> @@ -190,13 +190,15 @@ struct lockdep_map {
>  	u8				wait_type_outer; /* can be taken in this context */
>  	u8				wait_type_inner; /* presents this context */
>  	u8				lock_type;
> -	/* u8				hole; */
> +	u8				flags;
>  #ifdef CONFIG_LOCK_STAT
>  	int				cpu;
>  	unsigned long			ip;
>  #endif
>  };
>  
> +#define LOCAL_LOCK_UNLOCKED		1

Maybe DEPMAP_FLAG_LL_UNLOCKED so it is kind of obvious where it belongs
to. Maybe use "u8 local_lock_unlocked:1;" instead the flags + define. It
is even used for held_lock below so it is not a new concept with
lockdep. It would narrow down the usage.

>  struct pin_cookie { unsigned int val; };
>  

Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-07-09  1:52 ` [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
@ 2025-07-11  7:52   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-11  7:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, shakeel.butt, mhocko, andrii,
	memxor, akpm, peterz, rostedt, hannes

On 2025-07-08 18:52:59 [-0700], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce local_lock_is_locked() that returns true when
> given local_lock is locked by current cpu (in !PREEMPT_RT) or
> by current task (in PREEMPT_RT).

+ The goal is to detect a deadlock by the caller.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-07-09  1:52 ` [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
@ 2025-07-11  8:02   ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-11  8:02 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, shakeel.butt, mhocko, andrii,
	memxor, akpm, peterz, rostedt, hannes

On 2025-07-08 18:52:58 [-0700], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
> is a top level field of any lock that participates in LOCKDEP.
> Make it so for local_trylock_t.

I am not very pleased with this. I would prefer one struct given that
the functions below are shared and the _Generic construct is not as I
would have preferred it. Anyway. I meant to redo it but didn't get very
far.

Since is reasonable for now

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-11  7:50   ` Sebastian Andrzej Siewior
@ 2025-07-11  9:55     ` Vlastimil Babka
  2025-07-11 15:17       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-11  9:55 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Alexei Starovoitov
  Cc: bpf, linux-mm, harry.yoo, shakeel.butt, mhocko, andrii, memxor,
	akpm, peterz, rostedt, hannes

On 7/11/25 09:50, Sebastian Andrzej Siewior wrote:
> On 2025-07-08 18:53:00 [-0700], Alexei Starovoitov wrote:
>> From: Alexei Starovoitov <ast@kernel.org>
>> 
>> Introduce local_lock_lockdep_start/end() pair to teach lockdep
>> about a region of execution where per-cpu local_lock is not taken
>> and lockdep should consider such local_lock() as "trylock" to
>> avoid multiple false-positives:
>> - lockdep doesn't like when the same lock is taken in normal and
>>   in NMI context
>> - lockdep cannot recognize that local_locks that protect kmalloc
>>   buckets are different local_locks and not taken together
>> 
>> This pair of lockdep aid is used by slab in the following way:
>> 
>> if (local_lock_is_locked(&s->cpu_slab->lock))
>> 	goto out;
>> local_lock_lockdep_start(&s->cpu_slab->lock);
>> p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
>> local_lock_lockdep_end(&s->cpu_slab->lock);
>> 
>> Where ___slab_alloc() is calling
>> local_lock_irqsave(&s->cpu_slab->lock, ...) many times,
>> and all of them will not deadlock since this lock is not taken.
> 
> So you prefer this instead of using a trylock variant in ___slab_alloc()
> which would simply return in case the trylock fails?

The code isn't always in a position to "simply return". On !RT I think we
can at least assume that if we succeeded once, it means we're not a irq/nmi
interrupting a locked context so we'll succeed the following attempts too.
On RT IIUC the lock might be taken by someone else, so a trylock might fail
(even if it should also mean we're in a context that can do a non-try lock).

> Having the local_lock_is_locked() is still good to avoid the lock
> failure if it can be detected early. I am just not sure if the extra
> lockdep override is really needed.
> 
> …
>> --- a/include/linux/local_lock.h
>> +++ b/include/linux/local_lock.h
>> @@ -81,6 +81,21 @@
>>  #define local_trylock_irqsave(lock, flags)			\
>>  	__local_trylock_irqsave(lock, flags)
>>  
>> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
>> +#define local_lock_lockdep_start(lock)					\
>> +	do {								\
>> +		lockdep_assert(!__local_lock_is_locked(lock));		\
>> +		this_cpu_ptr(lock)->dep_map.flags = LOCAL_LOCK_UNLOCKED;\
>> +	} while (0)
>> +
>> +#define local_lock_lockdep_end(lock)					\
>> +	do { this_cpu_ptr(lock)->dep_map.flags = 0; } while (0)
>> +
>> +#else
>> +#define local_lock_lockdep_start(lock) /**/
>> +#define local_lock_lockdep_end(lock) /**/
> 
> Why the /**/?
> 
> …
>> index 9f361d3ab9d9..6c580081ace3 100644
>> --- a/include/linux/lockdep_types.h
>> +++ b/include/linux/lockdep_types.h
>> @@ -190,13 +190,15 @@ struct lockdep_map {
>>  	u8				wait_type_outer; /* can be taken in this context */
>>  	u8				wait_type_inner; /* presents this context */
>>  	u8				lock_type;
>> -	/* u8				hole; */
>> +	u8				flags;
>>  #ifdef CONFIG_LOCK_STAT
>>  	int				cpu;
>>  	unsigned long			ip;
>>  #endif
>>  };
>>  
>> +#define LOCAL_LOCK_UNLOCKED		1
> 
> Maybe DEPMAP_FLAG_LL_UNLOCKED so it is kind of obvious where it belongs
> to. Maybe use "u8 local_lock_unlocked:1;" instead the flags + define. It
> is even used for held_lock below so it is not a new concept with
> lockdep. It would narrow down the usage.
> 
>>  struct pin_cookie { unsigned int val; };
>>  
> 
> Sebastian



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-10 19:13         ` Alexei Starovoitov
  2025-07-11  6:06           ` Harry Yoo
@ 2025-07-11 10:30           ` Vlastimil Babka
  2025-07-12  1:55             ` Alexei Starovoitov
  1 sibling, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-11 10:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Harry Yoo, bpf, linux-mm, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 7/10/25 21:13, Alexei Starovoitov wrote:
> On Thu, Jul 10, 2025 at 8:05 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 7/10/25 12:21, Harry Yoo wrote:
>> > On Thu, Jul 10, 2025 at 11:36:02AM +0200, Vlastimil Babka wrote:
>> >> On 7/9/25 03:53, Alexei Starovoitov wrote:
>> >>
>> >> Hm but this is leaking the slab we allocated and have in the "slab"
>> >> variable, we need to free it back in that case.
> 
> ohh. sorry for the silly mistake.
> Re-reading the diff again I realized that I made a similar mistake in
> alloc_single_from_new_slab().
> It has this bit:
> if (!alloc_debug_processing(...))
>   return NULL;
> 
> so I assumed that doing:
> if (!spin_trylock_irqsave(&n->list_lock,..))
>    return NULL;
> 
> is ok too. Now I see that !alloc_debug is purposefully leaking memory.
> 
> Should we add:
> @@ -2841,6 +2841,7 @@ static void *alloc_single_from_new_slab(struct
> kmem_cache *s, struct slab *slab,
>                  * It's not really expected that this would fail on a
>                  * freshly allocated slab, but a concurrent memory
>                  * corruption in theory could cause that.
> +                * Leak newly allocated slab.
>                  */
>                 return NULL;
> 
> so the next person doesn't make the same mistake?

Sure, thanks.

>> So... since we succeeded taking it from the list and thus the spin_trylock,
>> it means it's safe to spinlock n->list_lock again - we might be waiting on
>> other cpu to unlock it but we know we didn't NMI on our own cpu having the
>> lock, right? But we'd probably need to convince lockdep about this somehow,
>> and also remember if we allocated a new slab or taken on from the partial
>> list... or just deal with this unlikely situation in another irq work :/
> 
> irq_work might be the least mind bending.
> Good point about partial vs new slab.
> For partial we can indeed proceed with deactivate_slab() and if
> I'm reading the code correctly, it won't have new.inuse == 0,

Hm I think due to other cpus freeing objects concurrently it might end up
with 0.

> so it won't go to discard_slab() (which won't be safe in this path)
> But teaching lockdep that below bit in deactivate_slab() is safe:
>         } else if (new.freelist) {
>                 spin_lock_irqsave(&n->list_lock, flags);
>                 add_partial(n, slab, tail);
> is a challenge.
> 
> Since defer_free_work is there, I'm leaning to reuse it for
> deactive_slab too. It will process
> static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
> and
> static DEFINE_PER_CPU(struct llist_head, defer_deactivate_slabs);

Should work. Also deactivate_slab() should be the correct operation for both
a slab from partial list and a newly allocated one.
But oops, where do we store all the parameters for deactivate_slab() We can
probably reuse the union with "struct list_head slab_list" for queueing.
kmem_cache pointer can be simply taken from struct slab, it's already tehre.
But the separate flush_freelist pointer? Maybe take advantage of list_head
being two pointers and struct llist_node just one pointer, so what we need
will still fit?

Otherwise we could do the first two phases of deactivate_slab() immediately
and only defer the third phase where the freelists are already merged and
there's no freelist pointer to handle anymore. But if it's not necessary,
let's not complicate.

Also should kmem_cache_destroy() path now get a barrier to flush all pending
irq_work? Does it exist?

> Shouldn't be too ugly. Better ideas?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-11  7:36   ` Harry Yoo
  2025-07-11  7:40     ` Harry Yoo
@ 2025-07-11 10:48     ` Vlastimil Babka
  1 sibling, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-11 10:48 UTC (permalink / raw)
  To: Harry Yoo, Alexei Starovoitov
  Cc: bpf, linux-mm, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

On 7/11/25 09:36, Harry Yoo wrote:
> On Tue, Jul 08, 2025 at 06:53:03PM -0700, Alexei Starovoitov wrote:
> 
> By adding it to the lockless list, it's overwriting freed objects,
> and it's not always safe.
> 
> Looking at calculate_sizes():
> 
>         if (((flags & SLAB_TYPESAFE_BY_RCU) && !args->use_freeptr_offset) ||    
>             (flags & SLAB_POISON) || s->ctor ||                                 
>             ((flags & SLAB_RED_ZONE) &&                                         
>              (s->object_size < sizeof(void *) || slub_debug_orig_size(s)))) {    
>                 /*                                                              
>                  * Relocate free pointer after the object if it is not          
>                  * permitted to overwrite the first word of the object on           
>                  * kmem_cache_free.                                             
>                  *                                                              
>                  * This is the case if we do RCU, have a constructor or         
>                  * destructor, are poisoning the objects, or are                
>                  * redzoning an object smaller than sizeof(void *) or are           
>                  * redzoning an object with slub_debug_orig_size() enabled,         
>                  * in which case the right redzone may be extended.             
>                  *                                                              
>                  * The assumption that s->offset >= s->inuse means free         
>                  * pointer is outside of the object is used in the              
>                  * freeptr_outside_object() function. If that is no             
>                  * longer true, the function needs to be modified.              
>                  */                                                             
>                 s->offset = size;                                               
>                 size += sizeof(void *);  
> 
> Only sizeof(void *) bytes from object + s->offset is always safe to overwrite.

Great point! Agreed.

> So either 1) teach defer_free() that it needs to use s->offset for each
> object, instead of zero (and that the list can have objects from
> different caches), or 2) introduce per-cache per-CPU lockless lists?

1) should be feasible. s->offset should be available when queueing the
object, and in free_deferred_objects() you already obtain "s" too so it's
trivial to subtract s->offset back. Thus 2) is unnecessary overhead.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-11  9:55     ` Vlastimil Babka
@ 2025-07-11 15:17       ` Sebastian Andrzej Siewior
  2025-07-11 15:23         ` Vlastimil Babka
  2025-07-12  2:19         ` Alexei Starovoitov
  0 siblings, 2 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-11 15:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexei Starovoitov, bpf, linux-mm, harry.yoo, shakeel.butt,
	mhocko, andrii, memxor, akpm, peterz, rostedt, hannes

On 2025-07-11 11:55:22 [+0200], Vlastimil Babka wrote:
> On 7/11/25 09:50, Sebastian Andrzej Siewior wrote:
> > On 2025-07-08 18:53:00 [-0700], Alexei Starovoitov wrote:
> >> From: Alexei Starovoitov <ast@kernel.org>
> >> 
> >> Introduce local_lock_lockdep_start/end() pair to teach lockdep
> >> about a region of execution where per-cpu local_lock is not taken
> >> and lockdep should consider such local_lock() as "trylock" to
> >> avoid multiple false-positives:
> >> - lockdep doesn't like when the same lock is taken in normal and
> >>   in NMI context
> >> - lockdep cannot recognize that local_locks that protect kmalloc
> >>   buckets are different local_locks and not taken together
> >> 
> >> This pair of lockdep aid is used by slab in the following way:
> >> 
> >> if (local_lock_is_locked(&s->cpu_slab->lock))
> >> 	goto out;
> >> local_lock_lockdep_start(&s->cpu_slab->lock);
> >> p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> >> local_lock_lockdep_end(&s->cpu_slab->lock);
> >> 
> >> Where ___slab_alloc() is calling
> >> local_lock_irqsave(&s->cpu_slab->lock, ...) many times,
> >> and all of them will not deadlock since this lock is not taken.
> > 
> > So you prefer this instead of using a trylock variant in ___slab_alloc()
> > which would simply return in case the trylock fails?
> 
> The code isn't always in a position to "simply return". On !RT I think we
> can at least assume that if we succeeded once, it means we're not a irq/nmi
> interrupting a locked context so we'll succeed the following attempts too.
> On RT IIUC the lock might be taken by someone else, so a trylock might fail
> (even if it should also mean we're in a context that can do a non-try lock).

There is this parent check. If the parent check "allows" the allocation
then on !RT the trylock should always succeed. So the return "empty
handed" would be there but should not happen kind of thing.

On RT this is different so local_lock_is_locked() will return false but
the trylock might fail if the lock is acquired by another task.

But then with this change we do trylock from lockdep's point of view
while in reality we do the full locking including possible context
switch.

That is why I don't like the part where we trick lockdep.

If we the parent check we could trylock for !RT and normal lock for RT
what we actual do.
If there is no parent check then we could do "normal lock" on both
sides.

Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-11 15:17       ` Sebastian Andrzej Siewior
@ 2025-07-11 15:23         ` Vlastimil Babka
  2025-07-12  2:19         ` Alexei Starovoitov
  1 sibling, 0 replies; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-11 15:23 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, bpf, linux-mm, harry.yoo, shakeel.butt,
	mhocko, andrii, memxor, akpm, peterz, rostedt, hannes

On 7/11/25 17:17, Sebastian Andrzej Siewior wrote:
> On 2025-07-11 11:55:22 [+0200], Vlastimil Babka wrote:
>> On 7/11/25 09:50, Sebastian Andrzej Siewior wrote:
>> > On 2025-07-08 18:53:00 [-0700], Alexei Starovoitov wrote:
>> >> From: Alexei Starovoitov <ast@kernel.org>
>> >> 
>> >> Introduce local_lock_lockdep_start/end() pair to teach lockdep
>> >> about a region of execution where per-cpu local_lock is not taken
>> >> and lockdep should consider such local_lock() as "trylock" to
>> >> avoid multiple false-positives:
>> >> - lockdep doesn't like when the same lock is taken in normal and
>> >>   in NMI context
>> >> - lockdep cannot recognize that local_locks that protect kmalloc
>> >>   buckets are different local_locks and not taken together
>> >> 
>> >> This pair of lockdep aid is used by slab in the following way:
>> >> 
>> >> if (local_lock_is_locked(&s->cpu_slab->lock))
>> >> 	goto out;
>> >> local_lock_lockdep_start(&s->cpu_slab->lock);
>> >> p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
>> >> local_lock_lockdep_end(&s->cpu_slab->lock);
>> >> 
>> >> Where ___slab_alloc() is calling
>> >> local_lock_irqsave(&s->cpu_slab->lock, ...) many times,
>> >> and all of them will not deadlock since this lock is not taken.
>> > 
>> > So you prefer this instead of using a trylock variant in ___slab_alloc()
>> > which would simply return in case the trylock fails?
>> 
>> The code isn't always in a position to "simply return". On !RT I think we
>> can at least assume that if we succeeded once, it means we're not a irq/nmi
>> interrupting a locked context so we'll succeed the following attempts too.
>> On RT IIUC the lock might be taken by someone else, so a trylock might fail
>> (even if it should also mean we're in a context that can do a non-try lock).
> 
> There is this parent check. If the parent check "allows" the allocation
> then on !RT the trylock should always succeed. So the return "empty
> handed" would be there but should not happen kind of thing.
> 
> On RT this is different so local_lock_is_locked() will return false but
> the trylock might fail if the lock is acquired by another task.
> 
> But then with this change we do trylock from lockdep's point of view
> while in reality we do the full locking including possible context
> switch.
> 
> That is why I don't like the part where we trick lockdep.
> 
> If we the parent check we could trylock for !RT and normal lock for RT
> what we actual do.
> If there is no parent check then we could do "normal lock" on both
> sides.

So you mean the approach of v1 with local_lock_irqsave_check()?

https://lore.kernel.org/bpf/20250501032718.65476-5-alexei.starovoitov@gmail.com/#t

> Sebastian



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-11 10:30           ` Vlastimil Babka
@ 2025-07-12  1:55             ` Alexei Starovoitov
  0 siblings, 0 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-12  1:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Harry Yoo, bpf, linux-mm, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Fri, Jul 11, 2025 at 12:30:19PM +0200, Vlastimil Babka wrote:
> > and
> > static DEFINE_PER_CPU(struct llist_head, defer_deactivate_slabs);
> 
> Should work. Also deactivate_slab() should be the correct operation for both
> a slab from partial list and a newly allocated one.
> But oops, where do we store all the parameters for deactivate_slab() We can
> probably reuse the union with "struct list_head slab_list" for queueing.
> kmem_cache pointer can be simply taken from struct slab, it's already tehre.
> But the separate flush_freelist pointer? Maybe take advantage of list_head
> being two pointers and struct llist_node just one pointer, so what we need
> will still fit?
> 
> Otherwise we could do the first two phases of deactivate_slab() immediately
> and only defer the third phase where the freelists are already merged and
> there's no freelist pointer to handle anymore. But if it's not necessary,
> let's not complicate.
> 
> Also should kmem_cache_destroy() path now get a barrier to flush all pending
> irq_work? Does it exist?

Thanks a lot everyone for great feedback.

Here is what I have so far that addresses the comments.
The only thing I struggle with is how to properly test
"if (unlikely(c->slab))"
condition in retry_load_slab.
I couldn't trigger it no matter what I tried.
So I manually unit-tested defer_deactivate_slab() bits with hacks.

Will fold and respin next week.

--
From 7efd089831b1e1968f12c7c4e058375bd126f9f6 Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Fri, 11 Jul 2025 16:56:12 -0700
Subject: [PATCH slab] slab: fixes

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/Kconfig       |   1 +
 mm/slab.h        |   6 +++
 mm/slab_common.c |   3 ++
 mm/slub.c        | 112 ++++++++++++++++++++++++++++++-----------------
 4 files changed, 83 insertions(+), 39 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 0287e8d94aea..331a14d678b3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -206,6 +206,7 @@ menu "Slab allocator options"
 
 config SLUB
 	def_bool y
+	select IRQ_WORK
 
 config KVFREE_RCU_BATCHED
 	def_bool y
diff --git a/mm/slab.h b/mm/slab.h
index 05a21dc796e0..65f4616b41de 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -57,6 +57,10 @@ struct slab {
 		struct {
 			union {
 				struct list_head slab_list;
+				struct { /* For deferred deactivate_slab() */
+					struct llist_node llnode;
+					void *flush_freelist;
+				};
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 				struct {
 					struct slab *next;
@@ -680,6 +684,8 @@ void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 void __check_heap_object(const void *ptr, unsigned long n,
 			 const struct slab *slab, bool to_user);
 
+void defer_free_barrier(void);
+
 static inline bool slub_debug_orig_size(struct kmem_cache *s)
 {
 	return (kmem_cache_debug_flags(s, SLAB_STORE_USER) &&
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bfe7c40eeee1..937af8ab2501 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -507,6 +507,9 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		rcu_barrier();
 	}
 
+	/* Wait for deferred work from kmalloc/kfree_nolock() */
+	defer_free_barrier();
+
 	cpus_read_lock();
 	mutex_lock(&slab_mutex);
 
diff --git a/mm/slub.c b/mm/slub.c
index dbecfd412e41..dc889cc59809 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2460,10 +2460,10 @@ static inline struct slab *alloc_slab_page(gfp_t flags, int node,
 	struct slab *slab;
 	unsigned int order = oo_order(oo);
 
-	if (unlikely(!allow_spin)) {
+	if (unlikely(!allow_spin))
 		folio = (struct folio *)alloc_frozen_pages_nolock(0/* __GFP_COMP is implied */,
 								  node, order);
-	} else if (node == NUMA_NO_NODE)
+	else if (node == NUMA_NO_NODE)
 		folio = (struct folio *)alloc_frozen_pages(flags, order);
 	else
 		folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
@@ -3694,6 +3694,8 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
 	return freelist;
 }
 
+static void defer_deactivate_slab(struct slab *slab);
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
@@ -3742,14 +3744,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (unlikely(!node_match(slab, node))) {
 		/*
 		 * same as above but node_match() being false already
-		 * implies node != NUMA_NO_NODE
+		 * implies node != NUMA_NO_NODE.
+		 * Reentrant slub cannot take locks necessary to
+		 * deactivate_slab, hence ignore node preference.
+		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
 		 */
 		if (!node_isset(node, slab_nodes) ||
 		    !allow_spin) {
-			/*
-			 * Reentrant slub cannot take locks necessary
-			 * to deactivate_slab, hence downgrade to any node
-			 */
 			node = NUMA_NO_NODE;
 		} else {
 			stat(s, ALLOC_NODE_MISMATCH);
@@ -3953,19 +3954,19 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		void *flush_freelist = c->freelist;
 		struct slab *flush_slab = c->slab;
 
-		if (unlikely(!allow_spin))
-			/*
-			 * Reentrant slub cannot take locks
-			 * necessary for deactivate_slab()
-			 */
-			return NULL;
 		c->slab = NULL;
 		c->freelist = NULL;
 		c->tid = next_tid(c->tid);
 
 		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 
-		deactivate_slab(s, flush_slab, flush_freelist);
+		if (unlikely(!allow_spin)) {
+			/* Reentrant slub cannot take locks, defer */
+			flush_slab->flush_freelist = flush_freelist;
+			defer_deactivate_slab(flush_slab);
+		} else {
+			deactivate_slab(s, flush_slab, flush_freelist);
+		}
 
 		stat(s, CPUSLAB_FLUSH);
 
@@ -4707,18 +4708,36 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
-static DEFINE_PER_CPU(struct llist_head, defer_free_objects);
-static DEFINE_PER_CPU(struct irq_work, defer_free_work);
+struct defer_free {
+	struct llist_head objects;
+	struct llist_head slabs;
+	struct irq_work work;
+};
+
+static void free_deferred_objects(struct irq_work *work);
 
+static DEFINE_PER_CPU(struct defer_free, defer_free_objects) = {
+	.objects = LLIST_HEAD_INIT(objects),
+	.slabs = LLIST_HEAD_INIT(slabs),
+	.work = IRQ_WORK_INIT(free_deferred_objects),
+};
+
+/*
+ * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
+ * to take sleeping spin_locks from __slab_free() and deactivate_slab().
+ * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
+ */
 static void free_deferred_objects(struct irq_work *work)
 {
-	struct llist_head *llhead = this_cpu_ptr(&defer_free_objects);
+	struct defer_free *df = container_of(work, struct defer_free, work);
+	struct llist_head *objs = &df->objects;
+	struct llist_head *slabs = &df->slabs;
 	struct llist_node *llnode, *pos, *t;
 
-	if (llist_empty(llhead))
+	if (llist_empty(objs) && llist_empty(slabs))
 		return;
 
-	llnode = llist_del_all(llhead);
+	llnode = llist_del_all(objs);
 	llist_for_each_safe(pos, t, llnode) {
 		struct kmem_cache *s;
 		struct slab *slab;
@@ -4727,6 +4746,7 @@ static void free_deferred_objects(struct irq_work *work)
 		slab = virt_to_slab(x);
 		s = slab->slab_cache;
 
+		x -= s->offset;
 		/*
 		 * memcg, kasan_slab_pre are already done for 'x'.
 		 * The only thing left is kasan_poison.
@@ -4734,26 +4754,39 @@ static void free_deferred_objects(struct irq_work *work)
 		kasan_slab_free(s, x, false, false, true);
 		__slab_free(s, slab, x, x, 1, _THIS_IP_);
 	}
+
+	llnode = llist_del_all(slabs);
+	llist_for_each_safe(pos, t, llnode) {
+		struct slab *slab = container_of(pos, struct slab, llnode);
+
+		deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
+	}
 }
 
-static int __init init_defer_work(void)
+static void defer_free(struct kmem_cache *s, void *head)
 {
-	int cpu;
+	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
 
-	for_each_possible_cpu(cpu) {
-		init_llist_head(per_cpu_ptr(&defer_free_objects, cpu));
-		init_irq_work(per_cpu_ptr(&defer_free_work, cpu),
-			      free_deferred_objects);
-	}
-	return 0;
+	if (llist_add(head + s->offset, &df->objects))
+		irq_work_queue(&df->work);
 }
-late_initcall(init_defer_work);
 
-static void defer_free(void *head)
+static void defer_deactivate_slab(struct slab *slab)
 {
-	if (llist_add(head, this_cpu_ptr(&defer_free_objects)))
-		irq_work_queue(this_cpu_ptr(&defer_free_work));
+	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
+
+	if (llist_add(&slab->llnode, &df->slabs))
+		irq_work_queue(&df->work);
+}
+
+void defer_free_barrier(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
 }
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4774,6 +4807,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 				struct slab *slab, void *head, void *tail,
 				int cnt, unsigned long addr)
 {
+	/* cnt == 0 signals that it's called from kfree_nolock() */
+	bool allow_spin = cnt;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 	void **freelist;
@@ -4792,28 +4827,27 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	barrier();
 
 	if (unlikely(slab != c->slab)) {
-		/* cnt == 0 signals that it's called from kfree_nolock() */
-		if (unlikely(!cnt)) {
+		if (unlikely(!allow_spin)) {
 			/*
 			 * __slab_free() can locklessly cmpxchg16 into a slab,
 			 * but then it might need to take spin_lock or local_lock
 			 * in put_cpu_partial() for further processing.
 			 * Avoid the complexity and simply add to a deferred list.
 			 */
-			defer_free(head);
+			defer_free(s, head);
 		} else {
 			__slab_free(s, slab, head, tail, cnt, addr);
 		}
 		return;
 	}
 
-	if (unlikely(!cnt)) {
+	if (unlikely(!allow_spin)) {
 		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
 		    local_lock_is_locked(&s->cpu_slab->lock)) {
-			defer_free(head);
+			defer_free(s, head);
 			return;
 		}
-		cnt = 1;
+		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
 		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
 	}
 
@@ -5065,7 +5099,7 @@ void kfree(const void *object)
 EXPORT_SYMBOL(kfree);
 
 /*
- * Can be called while holding raw_spin_lock or from IRQ and NMI,
+ * Can be called while holding raw_spinlock_t or from IRQ and NMI,
  * but only for objects allocated by kmalloc_nolock(),
  * since some debug checks (like kmemleak and kfence) were
  * skipped on allocation. large_kmalloc is not supported either.
@@ -5115,7 +5149,7 @@ void kfree_nolock(const void *object)
 #ifndef CONFIG_SLUB_TINY
 	do_slab_free(s, slab, x, x, 0, _RET_IP_);
 #else
-	defer_free(x);
+	defer_free(s, x);
 #endif
 }
 EXPORT_SYMBOL_GPL(kfree_nolock);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-11 15:17       ` Sebastian Andrzej Siewior
  2025-07-11 15:23         ` Vlastimil Babka
@ 2025-07-12  2:19         ` Alexei Starovoitov
  2025-07-14 11:06           ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-12  2:19 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Vlastimil Babka, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Fri, Jul 11, 2025 at 8:17 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2025-07-11 11:55:22 [+0200], Vlastimil Babka wrote:
> > On 7/11/25 09:50, Sebastian Andrzej Siewior wrote:
> > > On 2025-07-08 18:53:00 [-0700], Alexei Starovoitov wrote:
> > >> From: Alexei Starovoitov <ast@kernel.org>
> > >>
> > >> Introduce local_lock_lockdep_start/end() pair to teach lockdep
> > >> about a region of execution where per-cpu local_lock is not taken
> > >> and lockdep should consider such local_lock() as "trylock" to
> > >> avoid multiple false-positives:
> > >> - lockdep doesn't like when the same lock is taken in normal and
> > >>   in NMI context
> > >> - lockdep cannot recognize that local_locks that protect kmalloc
> > >>   buckets are different local_locks and not taken together
> > >>
> > >> This pair of lockdep aid is used by slab in the following way:
> > >>
> > >> if (local_lock_is_locked(&s->cpu_slab->lock))
> > >>    goto out;
> > >> local_lock_lockdep_start(&s->cpu_slab->lock);
> > >> p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> > >> local_lock_lockdep_end(&s->cpu_slab->lock);
> > >>
> > >> Where ___slab_alloc() is calling
> > >> local_lock_irqsave(&s->cpu_slab->lock, ...) many times,
> > >> and all of them will not deadlock since this lock is not taken.
> > >
> > > So you prefer this instead of using a trylock variant in ___slab_alloc()
> > > which would simply return in case the trylock fails?
> >
> > The code isn't always in a position to "simply return". On !RT I think we
> > can at least assume that if we succeeded once, it means we're not a irq/nmi
> > interrupting a locked context so we'll succeed the following attempts too.
> > On RT IIUC the lock might be taken by someone else, so a trylock might fail
> > (even if it should also mean we're in a context that can do a non-try lock).
>
> There is this parent check. If the parent check "allows" the allocation
> then on !RT the trylock should always succeed. So the return "empty
> handed" would be there but should not happen kind of thing.

So you're proposing to replace four local_lock_irqsave() in ___slab_alloc()
with if (!local_trylock_irqsave()) return NULL;
and a nasty comment that it shouldn't happen because we did
local_lock_is_locked() in the caller?

But for RT it will pessimize kmalloc_nolock() chances. More below:

> On RT this is different so local_lock_is_locked() will return false but
> the trylock might fail if the lock is acquired by another task.

Exactly and that's what we need to avoid.
Sleeping in rt_spin_lock() is fine here, since the current task
doesn't hold this per-cpu local_lock.
But there is no such lockdep concept.
Hence the need for local_lock_lockdep_start() which is purely
lockdep-aid and doesn't affect locking logic and checks.

> But then with this change we do trylock from lockdep's point of view
> while in reality we do the full locking including possible context
> switch.

correct. In RT it's better to have a full rt_spin_lock.

> That is why I don't like the part where we trick lockdep.

yes. we do trick lockdep. I don't see an alternative.
lockdep doesn't understand this part either:
"inconsistent {INITIAL USE} -> {IN-NMI} usage"
So it has to be tricked regardless.

> If we the parent check we could trylock for !RT and normal lock for RT
> what we actual do.

How would you do a normal rt_spin_lock() ?
lockdep will yell for two reasons in the commit log of this patch.

> If there is no parent check then we could do "normal lock" on both
> sides.

How would ___slab_alloc() know whether there was a parent check or not?

imo keeping local_lock_irqsave() as-is is cleaner,
since if there is no parent check lockdep will rightfully complain.

One can argue that local_lock_is_locked() and local_lock_lockdep_start()
should be paired together and that's what I had in v1, but they're
really different things. local_lock_is_locked() is true run-time
check regardless of lockdep and the other is lockdep specific band-aid.
Keeping them next to each other in __slab_alloc() looks cleaner.
Maybe a bigger comment is necessary.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-12  2:19         ` Alexei Starovoitov
@ 2025-07-14 11:06           ` Sebastian Andrzej Siewior
  2025-07-14 15:35             ` Vlastimil Babka
  2025-07-14 17:52             ` Alexei Starovoitov
  0 siblings, 2 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-14 11:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Vlastimil Babka, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 2025-07-11 19:19:26 [-0700], Alexei Starovoitov wrote:
> > If there is no parent check then we could do "normal lock" on both
> > sides.
> 
> How would ___slab_alloc() know whether there was a parent check or not?
> 
> imo keeping local_lock_irqsave() as-is is cleaner,
> since if there is no parent check lockdep will rightfully complain.

what about this:

diff --git a/mm/slub.c b/mm/slub.c
index 7e2ffe1d46c6c..3520d1c25c205 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3693,6 +3693,34 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
 	return freelist;
 }
 
+static void local_lock_cpu_slab(struct kmem_cache *s, const gfp_t gfp_flags,
+				unsigned long *flags)
+{
+	bool allow_spin = gfpflags_allow_spinning(gfp_flags);
+
+	/*
+	 * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
+	 * can be acquired without a deadlock before invoking the function.
+	 *
+	 * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
+	 * disabled context. The lock will always be acquired and if needed it
+	 * block and sleep until the lock is available.
+	 *
+	 * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
+	 * is always acquired with disabled interrupts meaning it is always
+	 * possible to it.
+	 * In NMI context it is needed to check if the lock is acquired. If it is not,
+	 * it is safe to acquire it. The trylock semantic is used to tell lockdep
+	 * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
+	 * the lock.
+	 *
+	 */
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && !allow_spin)
+		BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
+	else
+		local_lock_irqsave(&s->cpu_slab->lock, *flags);
+}
+
 /*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
@@ -3765,7 +3793,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto deactivate_slab;
 
 	/* must check again c->slab in case we got preempted and it changed */
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, gfpflags, &flags);
+
 	if (unlikely(slab != c->slab)) {
 		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		goto reread_slab;
@@ -3803,7 +3832,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 deactivate_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, gfpflags, &flags);
 	if (slab != c->slab) {
 		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		goto reread_slab;
@@ -3819,7 +3848,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	while (slub_percpu_partial(c)) {
-		local_lock_irqsave(&s->cpu_slab->lock, flags);
+		local_lock_cpu_slab(s, gfpflags, &flags);
 		if (unlikely(c->slab)) {
 			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 			goto reread_slab;
@@ -3947,7 +3976,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 retry_load_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, gfpflags, &flags);
 	if (unlikely(c->slab)) {
 		void *flush_freelist = c->freelist;
 		struct slab *flush_slab = c->slab;
@@ -4003,12 +4032,8 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			p = ERR_PTR(-EBUSY);
 			goto out;
 		}
-		local_lock_lockdep_start(&s->cpu_slab->lock);
-		p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
-		local_lock_lockdep_end(&s->cpu_slab->lock);
-	} else {
-		p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
 	}
+	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
 out:
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);


Sebastian


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-14 11:06           ` Sebastian Andrzej Siewior
@ 2025-07-14 15:35             ` Vlastimil Babka
  2025-07-14 15:54               ` Sebastian Andrzej Siewior
  2025-07-14 17:52             ` Alexei Starovoitov
  1 sibling, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-14 15:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Alexei Starovoitov
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 7/14/25 13:06, Sebastian Andrzej Siewior wrote:
> On 2025-07-11 19:19:26 [-0700], Alexei Starovoitov wrote:
>> > If there is no parent check then we could do "normal lock" on both
>> > sides.
>> 
>> How would ___slab_alloc() know whether there was a parent check or not?
>> 
>> imo keeping local_lock_irqsave() as-is is cleaner,
>> since if there is no parent check lockdep will rightfully complain.
> 
> what about this:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 7e2ffe1d46c6c..3520d1c25c205 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3693,6 +3693,34 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
>  	return freelist;
>  }
>  
> +static void local_lock_cpu_slab(struct kmem_cache *s, const gfp_t gfp_flags,
> +				unsigned long *flags)
> +{
> +	bool allow_spin = gfpflags_allow_spinning(gfp_flags);
> +
> +	/*
> +	 * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> +	 * can be acquired without a deadlock before invoking the function.
> +	 *
> +	 * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> +	 * disabled context. The lock will always be acquired and if needed it
> +	 * block and sleep until the lock is available.
> +	 *
> +	 * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
> +	 * is always acquired with disabled interrupts meaning it is always
> +	 * possible to it.
> +	 * In NMI context it is needed to check if the lock is acquired. If it is not,
> +	 * it is safe to acquire it. The trylock semantic is used to tell lockdep
> +	 * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
> +	 * the lock.
> +	 *
> +	 */
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT) && !allow_spin)
> +		BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
> +	else
> +		local_lock_irqsave(&s->cpu_slab->lock, *flags);

If we go with this, then I think the better approach would be simply:

if (unlikely(!local_trylock_irqsave(&s->cpu_slab->lock, *flags))
	local_lock_irqsave(&s->cpu_slab->lock, *flags);

- no branches before the likely to succeed local_trylock_irqsave()
- the unlikely local_lock_irqsave() fallback exists to handle the PREEMPT_RT
case / provide lockdep checks in case of screwing up
- we don't really need to evaluate allow_spin or add BUG_ON() (which is
actively disallowed to add these days anyway) - if we screw up, either
lockdep will splat, or we deadlock

Also I'm thinking on !PREEMPT_RT && !LOCKDEP we don't even need the fallback
local_lock_irqsave part? The trylock is supposed to always succeed, right?
Either we allow spinning and that means we're not under kmalloc_nolock() and
should not be interrupting the locked section (as before this series). Or
it's the opposite and then the earlier local_lock_is_locked() check should
have prevented us from going here. So I guess we could just trylock without
checking the return value - any screw up should blow up quickly even without
the BUG_ON().

> +}
> +
>  /*
>   * Slow path. The lockless freelist is empty or we need to perform
>   * debugging duties.
> @@ -3765,7 +3793,8 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  		goto deactivate_slab;
>  
>  	/* must check again c->slab in case we got preempted and it changed */
> -	local_lock_irqsave(&s->cpu_slab->lock, flags);
> +	local_lock_cpu_slab(s, gfpflags, &flags);
> +
>  	if (unlikely(slab != c->slab)) {
>  		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  		goto reread_slab;
> @@ -3803,7 +3832,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  deactivate_slab:
>  
> -	local_lock_irqsave(&s->cpu_slab->lock, flags);
> +	local_lock_cpu_slab(s, gfpflags, &flags);
>  	if (slab != c->slab) {
>  		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  		goto reread_slab;
> @@ -3819,7 +3848,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  	while (slub_percpu_partial(c)) {
> -		local_lock_irqsave(&s->cpu_slab->lock, flags);
> +		local_lock_cpu_slab(s, gfpflags, &flags);
>  		if (unlikely(c->slab)) {
>  			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
>  			goto reread_slab;
> @@ -3947,7 +3976,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  retry_load_slab:
>  
> -	local_lock_irqsave(&s->cpu_slab->lock, flags);
> +	local_lock_cpu_slab(s, gfpflags, &flags);
>  	if (unlikely(c->slab)) {
>  		void *flush_freelist = c->freelist;
>  		struct slab *flush_slab = c->slab;
> @@ -4003,12 +4032,8 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  			p = ERR_PTR(-EBUSY);
>  			goto out;
>  		}
> -		local_lock_lockdep_start(&s->cpu_slab->lock);
> -		p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> -		local_lock_lockdep_end(&s->cpu_slab->lock);
> -	} else {
> -		p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
>  	}
> +	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
>  out:
>  #ifdef CONFIG_PREEMPT_COUNT
>  	slub_put_cpu_ptr(s->cpu_slab);
> 
> 
> Sebastian



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-14 15:35             ` Vlastimil Babka
@ 2025-07-14 15:54               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 38+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-07-14 15:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexei Starovoitov, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 2025-07-14 17:35:52 [+0200], Vlastimil Babka wrote:
> If we go with this, then I think the better approach would be simply:
> 
> if (unlikely(!local_trylock_irqsave(&s->cpu_slab->lock, *flags))
> 	local_lock_irqsave(&s->cpu_slab->lock, *flags);
> 
> - no branches before the likely to succeed local_trylock_irqsave()
> - the unlikely local_lock_irqsave() fallback exists to handle the PREEMPT_RT
> case / provide lockdep checks in case of screwing up
> - we don't really need to evaluate allow_spin or add BUG_ON() (which is
> actively disallowed to add these days anyway) - if we screw up, either
> lockdep will splat, or we deadlock

Some people added BUG_ON() in cases were a warning would be more
applicable and recovery would be still be possible. I don't see how to
recover from this (unless you want return NULL) plus it should not
happen.
The only downside would that you don't evaluate the spinning part but
this only matters on RT since !RT should always succeed. So why not.

> Also I'm thinking on !PREEMPT_RT && !LOCKDEP we don't even need the fallback
> local_lock_irqsave part? The trylock is supposed to always succeed, right?
> Either we allow spinning and that means we're not under kmalloc_nolock() and
> should not be interrupting the locked section (as before this series). Or
> it's the opposite and then the earlier local_lock_is_locked() check should
> have prevented us from going here. So I guess we could just trylock without
> checking the return value - any screw up should blow up quickly even without
> the BUG_ON().

As explained above, under normal circumstances the trylock will always
succeed on !RT. But ignoring the return value here does not feel right
and the API might get extended to warn if the return error is ignored.

Sebastian


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-14 11:06           ` Sebastian Andrzej Siewior
  2025-07-14 15:35             ` Vlastimil Babka
@ 2025-07-14 17:52             ` Alexei Starovoitov
  2025-07-14 18:33               ` Vlastimil Babka
  1 sibling, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-14 17:52 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Vlastimil Babka, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Mon, Jul 14, 2025 at 4:06 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2025-07-11 19:19:26 [-0700], Alexei Starovoitov wrote:
> > > If there is no parent check then we could do "normal lock" on both
> > > sides.
> >
> > How would ___slab_alloc() know whether there was a parent check or not?
> >
> > imo keeping local_lock_irqsave() as-is is cleaner,
> > since if there is no parent check lockdep will rightfully complain.
>
> what about this:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 7e2ffe1d46c6c..3520d1c25c205 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3693,6 +3693,34 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
>         return freelist;
>  }
>
> +static void local_lock_cpu_slab(struct kmem_cache *s, const gfp_t gfp_flags,
> +                               unsigned long *flags)
> +{
> +       bool allow_spin = gfpflags_allow_spinning(gfp_flags);
> +
> +       /*
> +        * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> +        * can be acquired without a deadlock before invoking the function.
> +        *
> +        * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> +        * disabled context. The lock will always be acquired and if needed it
> +        * block and sleep until the lock is available.
> +        *
> +        * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
> +        * is always acquired with disabled interrupts meaning it is always
> +        * possible to it.
> +        * In NMI context it is needed to check if the lock is acquired. If it is not,
> +        * it is safe to acquire it. The trylock semantic is used to tell lockdep
> +        * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
> +        * the lock.
> +        *
> +        */
> +       if (!IS_ENABLED(CONFIG_PREEMPT_RT) && !allow_spin)
> +               BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
> +       else
> +               local_lock_irqsave(&s->cpu_slab->lock, *flags);
> +}

the patch misses these two:

diff --git a/mm/slub.c b/mm/slub.c
index 36779519b02c..2f30b85fbf68 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3260,7 +3260,7 @@ static void put_cpu_partial(struct kmem_cache
*s, struct slab *slab, int drain)
        unsigned long flags;
        int slabs = 0;

-       local_lock_irqsave(&s->cpu_slab->lock, flags);
+       local_lock_cpu_slab(s, 0, &flags);

        oldslab = this_cpu_read(s->cpu_slab->partial);

@@ -4889,8 +4889,9 @@ static __always_inline void do_slab_free(struct
kmem_cache *s,
                        goto redo;
                }
        } else {
+               long flags;
                /* Update the free list under the local lock */
-               local_lock(&s->cpu_slab->lock);
+               local_lock_cpu_slab(s, 0, &flags);
                c = this_cpu_ptr(s->cpu_slab);
                if (unlikely(slab != c->slab)) {
                        local_unlock(&s->cpu_slab->lock);

I realized that the latter one was missing local_lock_lockdep_start/end()
in my patch as well, but that's secondary.

So with above it works on !RT,
but on RT lockdep complains as I explained earlier.

With yours and above hunks applied here is full lockdep splat:

[   39.819636] ============================================
[   39.819638] WARNING: possible recursive locking detected
[   39.819641] 6.16.0-rc5-00342-gc8aca7837440-dirty #54 Tainted: G           O
[   39.819645] --------------------------------------------
[   39.819646] page_alloc_kthr/2306 is trying to acquire lock:
[   39.819650] ff110001f5cbea88 ((&c->lock)){+.+.}-{3:3}, at:
___slab_alloc+0xb7/0xec0
[   39.819667]
[   39.819667] but task is already holding lock:
[   39.819668] ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
___slab_alloc+0xb7/0xec0
[   39.819677]
[   39.819677] other info that might help us debug this:
[   39.819678]  Possible unsafe locking scenario:
[   39.819678]
[   39.819679]        CPU0
[   39.819680]        ----
[   39.819681]   lock((&c->lock));
[   39.819684]   lock((&c->lock));
[   39.819687]
[   39.819687]  *** DEADLOCK ***
[   39.819687]
[   39.819687]  May be due to missing lock nesting notation
[   39.819687]
[   39.819689] 2 locks held by page_alloc_kthr/2306:
[   39.819691]  #0: ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
___slab_alloc+0xb7/0xec0
[   39.819700]  #1: ffffffff8588f3a0 (rcu_read_lock){....}-{1:3}, at:
rt_spin_lock+0x197/0x250
[   39.819710]
[   39.819710] stack backtrace:
[   39.819714] CPU: 1 UID: 0 PID: 2306 Comm: page_alloc_kthr Tainted:
G           O        6.16.0-rc5-00342-gc8aca7837440-dirty #54
PREEMPT_RT
[   39.819721] Tainted: [O]=OOT_MODULE
[   39.819723] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[   39.819726] Call Trace:
[   39.819729]  <TASK>
[   39.819734]  dump_stack_lvl+0x5b/0x80
[   39.819740]  print_deadlock_bug.cold+0xbd/0xca
[   39.819747]  __lock_acquire+0x12ad/0x2590
[   39.819753]  ? __lock_acquire+0x42b/0x2590
[   39.819758]  lock_acquire+0x133/0x2d0
[   39.819763]  ? ___slab_alloc+0xb7/0xec0
[   39.819769]  ? try_to_take_rt_mutex+0x624/0xfc0
[   39.819773]  ? __lock_acquire+0x42b/0x2590
[   39.819778]  rt_spin_lock+0x6f/0x250
[   39.819783]  ? ___slab_alloc+0xb7/0xec0
[   39.819788]  ? rtlock_slowlock_locked+0x5c60/0x5c60
[   39.819792]  ? rtlock_slowlock_locked+0xc3/0x5c60
[   39.819798]  ___slab_alloc+0xb7/0xec0
[   39.819803]  ? __lock_acquire+0x42b/0x2590
[   39.819809]  ? my_debug_callback+0x20e/0x390 [bpf_testmod]
[   39.819826]  ? __lock_acquire+0x42b/0x2590
[   39.819830]  ? rt_read_unlock+0x2f0/0x2f0
[   39.819835]  ? my_debug_callback+0x20e/0x390 [bpf_testmod]
[   39.819844]  ? kmalloc_nolock_noprof+0x15a/0x430
[   39.819849]  kmalloc_nolock_noprof+0x15a/0x430
[   39.819857]  my_debug_callback+0x20e/0x390 [bpf_testmod]
[   39.819867]  ? page_alloc_kthread+0x320/0x320 [bpf_testmod]
[   39.819875]  ? lock_is_held_type+0x85/0xe0
[   39.819881]  ___slab_alloc+0x256/0xec0
[   39.819898]  ? lock_acquire+0x133/0x2d0
[   39.819927]  ? __kmalloc_cache_noprof+0xd6/0x3b0
[   39.819932]  __kmalloc_cache_noprof+0xd6/0x3b0

As I said earlier lockdep _has_ to be tricked.
We cannot unconditionally call local_lock_irqsave() on RT.
lockdep doesn't understand per-cpu local_lock.
And it doesn't understand this "if !locked_by_current_task -> go and lock"
concept.
lockdep has to be taught about safe lock region (call it tricking
lockdep, but it has to be an external signal to lockdep).


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-14 17:52             ` Alexei Starovoitov
@ 2025-07-14 18:33               ` Vlastimil Babka
  2025-07-14 18:46                 ` Alexei Starovoitov
  0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-14 18:33 UTC (permalink / raw)
  To: Alexei Starovoitov, Sebastian Andrzej Siewior
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 7/14/25 19:52, Alexei Starovoitov wrote:
> On Mon, Jul 14, 2025 at 4:06 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2025-07-11 19:19:26 [-0700], Alexei Starovoitov wrote:
>> > > If there is no parent check then we could do "normal lock" on both
>> > > sides.
>> >
>> > How would ___slab_alloc() know whether there was a parent check or not?
>> >
>> > imo keeping local_lock_irqsave() as-is is cleaner,
>> > since if there is no parent check lockdep will rightfully complain.
>>
>> what about this:
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 7e2ffe1d46c6c..3520d1c25c205 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3693,6 +3693,34 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
>>         return freelist;
>>  }
>>
>> +static void local_lock_cpu_slab(struct kmem_cache *s, const gfp_t gfp_flags,
>> +                               unsigned long *flags)
>> +{
>> +       bool allow_spin = gfpflags_allow_spinning(gfp_flags);
>> +
>> +       /*
>> +        * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
>> +        * can be acquired without a deadlock before invoking the function.
>> +        *
>> +        * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
>> +        * disabled context. The lock will always be acquired and if needed it
>> +        * block and sleep until the lock is available.
>> +        *
>> +        * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
>> +        * is always acquired with disabled interrupts meaning it is always
>> +        * possible to it.
>> +        * In NMI context it is needed to check if the lock is acquired. If it is not,
>> +        * it is safe to acquire it. The trylock semantic is used to tell lockdep
>> +        * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
>> +        * the lock.
>> +        *
>> +        */
>> +       if (!IS_ENABLED(CONFIG_PREEMPT_RT) && !allow_spin)
>> +               BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
>> +       else
>> +               local_lock_irqsave(&s->cpu_slab->lock, *flags);
>> +}
> 
> the patch misses these two:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 36779519b02c..2f30b85fbf68 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3260,7 +3260,7 @@ static void put_cpu_partial(struct kmem_cache
> *s, struct slab *slab, int drain)
>         unsigned long flags;
>         int slabs = 0;
> 
> -       local_lock_irqsave(&s->cpu_slab->lock, flags);
> +       local_lock_cpu_slab(s, 0, &flags);
> 
>         oldslab = this_cpu_read(s->cpu_slab->partial);
> 
> @@ -4889,8 +4889,9 @@ static __always_inline void do_slab_free(struct
> kmem_cache *s,
>                         goto redo;
>                 }
>         } else {
> +               long flags;
>                 /* Update the free list under the local lock */
> -               local_lock(&s->cpu_slab->lock);
> +               local_lock_cpu_slab(s, 0, &flags);
>                 c = this_cpu_ptr(s->cpu_slab);
>                 if (unlikely(slab != c->slab)) {
>                         local_unlock(&s->cpu_slab->lock);
> 
> I realized that the latter one was missing local_lock_lockdep_start/end()
> in my patch as well, but that's secondary.
> 
> So with above it works on !RT,
> but on RT lockdep complains as I explained earlier.
> 
> With yours and above hunks applied here is full lockdep splat:
> 
> [   39.819636] ============================================
> [   39.819638] WARNING: possible recursive locking detected
> [   39.819641] 6.16.0-rc5-00342-gc8aca7837440-dirty #54 Tainted: G           O
> [   39.819645] --------------------------------------------
> [   39.819646] page_alloc_kthr/2306 is trying to acquire lock:
> [   39.819650] ff110001f5cbea88 ((&c->lock)){+.+.}-{3:3}, at:
> ___slab_alloc+0xb7/0xec0
> [   39.819667]
> [   39.819667] but task is already holding lock:
> [   39.819668] ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
> ___slab_alloc+0xb7/0xec0
> [   39.819677]
> [   39.819677] other info that might help us debug this:
> [   39.819678]  Possible unsafe locking scenario:
> [   39.819678]
> [   39.819679]        CPU0
> [   39.819680]        ----
> [   39.819681]   lock((&c->lock));
> [   39.819684]   lock((&c->lock));
> [   39.819687]
> [   39.819687]  *** DEADLOCK ***
> [   39.819687]
> [   39.819687]  May be due to missing lock nesting notation
> [   39.819687]
> [   39.819689] 2 locks held by page_alloc_kthr/2306:
> [   39.819691]  #0: ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
> ___slab_alloc+0xb7/0xec0
> [   39.819700]  #1: ffffffff8588f3a0 (rcu_read_lock){....}-{1:3}, at:
> rt_spin_lock+0x197/0x250
> [   39.819710]
> [   39.819710] stack backtrace:
> [   39.819714] CPU: 1 UID: 0 PID: 2306 Comm: page_alloc_kthr Tainted:
> G           O        6.16.0-rc5-00342-gc8aca7837440-dirty #54
> PREEMPT_RT
> [   39.819721] Tainted: [O]=OOT_MODULE
> [   39.819723] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [   39.819726] Call Trace:
> [   39.819729]  <TASK>
> [   39.819734]  dump_stack_lvl+0x5b/0x80
> [   39.819740]  print_deadlock_bug.cold+0xbd/0xca
> [   39.819747]  __lock_acquire+0x12ad/0x2590
> [   39.819753]  ? __lock_acquire+0x42b/0x2590
> [   39.819758]  lock_acquire+0x133/0x2d0
> [   39.819763]  ? ___slab_alloc+0xb7/0xec0
> [   39.819769]  ? try_to_take_rt_mutex+0x624/0xfc0
> [   39.819773]  ? __lock_acquire+0x42b/0x2590
> [   39.819778]  rt_spin_lock+0x6f/0x250

But why are we here in ___slab_alloc, trying to take the lock...

> [   39.819783]  ? ___slab_alloc+0xb7/0xec0
> [   39.819788]  ? rtlock_slowlock_locked+0x5c60/0x5c60
> [   39.819792]  ? rtlock_slowlock_locked+0xc3/0x5c60
> [   39.819798]  ___slab_alloc+0xb7/0xec0
> [   39.819803]  ? __lock_acquire+0x42b/0x2590
> [   39.819809]  ? my_debug_callback+0x20e/0x390 [bpf_testmod]
> [   39.819826]  ? __lock_acquire+0x42b/0x2590
> [   39.819830]  ? rt_read_unlock+0x2f0/0x2f0
> [   39.819835]  ? my_debug_callback+0x20e/0x390 [bpf_testmod]
> [   39.819844]  ? kmalloc_nolock_noprof+0x15a/0x430
> [   39.819849]  kmalloc_nolock_noprof+0x15a/0x430

When in patch 6/6 __slab_alloc() we should have bailed out via

	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
+		if (local_lock_is_locked(&s->cpu_slab->lock)) {
+			/*
+			 * EBUSY is an internal signal to kmalloc_nolock() to
+			 * retry a different bucket. It's not propagated
+			 * to the caller.
+			 */
+			p = ERR_PTR(-EBUSY);
+			goto out;
+		}

So it doesn't seem to me as a lack of lockdep tricking, but we reached
something we should not have because the avoidance based on
local_lock_is_locked() above didn't work properly? At least if I read the
splat and backtrace properly, it doesn't seem to suggest a theoretical
scenario but that we really tried to lock something we already had locked.

> [   39.819857]  my_debug_callback+0x20e/0x390 [bpf_testmod]

What exactly did you instrument here?

> [   39.819867]  ? page_alloc_kthread+0x320/0x320 [bpf_testmod]
> [   39.819875]  ? lock_is_held_type+0x85/0xe0
> [   39.819881]  ___slab_alloc+0x256/0xec0

And here we took the lock originally?

> [   39.819898]  ? lock_acquire+0x133/0x2d0
> [   39.819927]  ? __kmalloc_cache_noprof+0xd6/0x3b0
> [   39.819932]  __kmalloc_cache_noprof+0xd6/0x3b0
> 
> As I said earlier lockdep _has_ to be tricked.
> We cannot unconditionally call local_lock_irqsave() on RT.
> lockdep doesn't understand per-cpu local_lock.
> And it doesn't understand this "if !locked_by_current_task -> go and lock"
> concept.
> lockdep has to be taught about safe lock region (call it tricking
> lockdep, but it has to be an external signal to lockdep).



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-14 18:33               ` Vlastimil Babka
@ 2025-07-14 18:46                 ` Alexei Starovoitov
  2025-07-15  6:56                   ` Vlastimil Babka
  0 siblings, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-14 18:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sebastian Andrzej Siewior, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Mon, Jul 14, 2025 at 11:33 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/14/25 19:52, Alexei Starovoitov wrote:
> > On Mon, Jul 14, 2025 at 4:06 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> >>
> >> On 2025-07-11 19:19:26 [-0700], Alexei Starovoitov wrote:
> >> > > If there is no parent check then we could do "normal lock" on both
> >> > > sides.
> >> >
> >> > How would ___slab_alloc() know whether there was a parent check or not?
> >> >
> >> > imo keeping local_lock_irqsave() as-is is cleaner,
> >> > since if there is no parent check lockdep will rightfully complain.
> >>
> >> what about this:
> >>
> >> diff --git a/mm/slub.c b/mm/slub.c
> >> index 7e2ffe1d46c6c..3520d1c25c205 100644
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -3693,6 +3693,34 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
> >>         return freelist;
> >>  }
> >>
> >> +static void local_lock_cpu_slab(struct kmem_cache *s, const gfp_t gfp_flags,
> >> +                               unsigned long *flags)
> >> +{
> >> +       bool allow_spin = gfpflags_allow_spinning(gfp_flags);
> >> +
> >> +       /*
> >> +        * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> >> +        * can be acquired without a deadlock before invoking the function.
> >> +        *
> >> +        * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> >> +        * disabled context. The lock will always be acquired and if needed it
> >> +        * block and sleep until the lock is available.
> >> +        *
> >> +        * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
> >> +        * is always acquired with disabled interrupts meaning it is always
> >> +        * possible to it.
> >> +        * In NMI context it is needed to check if the lock is acquired. If it is not,
> >> +        * it is safe to acquire it. The trylock semantic is used to tell lockdep
> >> +        * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
> >> +        * the lock.
> >> +        *
> >> +        */
> >> +       if (!IS_ENABLED(CONFIG_PREEMPT_RT) && !allow_spin)
> >> +               BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
> >> +       else
> >> +               local_lock_irqsave(&s->cpu_slab->lock, *flags);
> >> +}
> >
> > the patch misses these two:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 36779519b02c..2f30b85fbf68 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3260,7 +3260,7 @@ static void put_cpu_partial(struct kmem_cache
> > *s, struct slab *slab, int drain)
> >         unsigned long flags;
> >         int slabs = 0;
> >
> > -       local_lock_irqsave(&s->cpu_slab->lock, flags);
> > +       local_lock_cpu_slab(s, 0, &flags);
> >
> >         oldslab = this_cpu_read(s->cpu_slab->partial);
> >
> > @@ -4889,8 +4889,9 @@ static __always_inline void do_slab_free(struct
> > kmem_cache *s,
> >                         goto redo;
> >                 }
> >         } else {
> > +               long flags;
> >                 /* Update the free list under the local lock */
> > -               local_lock(&s->cpu_slab->lock);
> > +               local_lock_cpu_slab(s, 0, &flags);
> >                 c = this_cpu_ptr(s->cpu_slab);
> >                 if (unlikely(slab != c->slab)) {
> >                         local_unlock(&s->cpu_slab->lock);
> >
> > I realized that the latter one was missing local_lock_lockdep_start/end()
> > in my patch as well, but that's secondary.
> >
> > So with above it works on !RT,
> > but on RT lockdep complains as I explained earlier.
> >
> > With yours and above hunks applied here is full lockdep splat:
> >
> > [   39.819636] ============================================
> > [   39.819638] WARNING: possible recursive locking detected
> > [   39.819641] 6.16.0-rc5-00342-gc8aca7837440-dirty #54 Tainted: G           O
> > [   39.819645] --------------------------------------------
> > [   39.819646] page_alloc_kthr/2306 is trying to acquire lock:
> > [   39.819650] ff110001f5cbea88 ((&c->lock)){+.+.}-{3:3}, at:
> > ___slab_alloc+0xb7/0xec0
> > [   39.819667]
> > [   39.819667] but task is already holding lock:
> > [   39.819668] ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
> > ___slab_alloc+0xb7/0xec0
> > [   39.819677]
> > [   39.819677] other info that might help us debug this:
> > [   39.819678]  Possible unsafe locking scenario:
> > [   39.819678]
> > [   39.819679]        CPU0
> > [   39.819680]        ----
> > [   39.819681]   lock((&c->lock));
> > [   39.819684]   lock((&c->lock));
> > [   39.819687]
> > [   39.819687]  *** DEADLOCK ***
> > [   39.819687]
> > [   39.819687]  May be due to missing lock nesting notation
> > [   39.819687]
> > [   39.819689] 2 locks held by page_alloc_kthr/2306:
> > [   39.819691]  #0: ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
> > ___slab_alloc+0xb7/0xec0
> > [   39.819700]  #1: ffffffff8588f3a0 (rcu_read_lock){....}-{1:3}, at:
> > rt_spin_lock+0x197/0x250
> > [   39.819710]
> > [   39.819710] stack backtrace:
> > [   39.819714] CPU: 1 UID: 0 PID: 2306 Comm: page_alloc_kthr Tainted:
> > G           O        6.16.0-rc5-00342-gc8aca7837440-dirty #54
> > PREEMPT_RT
> > [   39.819721] Tainted: [O]=OOT_MODULE
> > [   39.819723] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> > BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > [   39.819726] Call Trace:
> > [   39.819729]  <TASK>
> > [   39.819734]  dump_stack_lvl+0x5b/0x80
> > [   39.819740]  print_deadlock_bug.cold+0xbd/0xca
> > [   39.819747]  __lock_acquire+0x12ad/0x2590
> > [   39.819753]  ? __lock_acquire+0x42b/0x2590
> > [   39.819758]  lock_acquire+0x133/0x2d0
> > [   39.819763]  ? ___slab_alloc+0xb7/0xec0
> > [   39.819769]  ? try_to_take_rt_mutex+0x624/0xfc0
> > [   39.819773]  ? __lock_acquire+0x42b/0x2590
> > [   39.819778]  rt_spin_lock+0x6f/0x250
>
> But why are we here in ___slab_alloc, trying to take the lock...
>
> > [   39.819783]  ? ___slab_alloc+0xb7/0xec0
> > [   39.819788]  ? rtlock_slowlock_locked+0x5c60/0x5c60
> > [   39.819792]  ? rtlock_slowlock_locked+0xc3/0x5c60
> > [   39.819798]  ___slab_alloc+0xb7/0xec0
> > [   39.819803]  ? __lock_acquire+0x42b/0x2590
> > [   39.819809]  ? my_debug_callback+0x20e/0x390 [bpf_testmod]
> > [   39.819826]  ? __lock_acquire+0x42b/0x2590
> > [   39.819830]  ? rt_read_unlock+0x2f0/0x2f0
> > [   39.819835]  ? my_debug_callback+0x20e/0x390 [bpf_testmod]
> > [   39.819844]  ? kmalloc_nolock_noprof+0x15a/0x430
> > [   39.819849]  kmalloc_nolock_noprof+0x15a/0x430
>
> When in patch 6/6 __slab_alloc() we should have bailed out via
>
>         if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
> +               if (local_lock_is_locked(&s->cpu_slab->lock)) {
> +                       /*
> +                        * EBUSY is an internal signal to kmalloc_nolock() to
> +                        * retry a different bucket. It's not propagated
> +                        * to the caller.
> +                        */
> +                       p = ERR_PTR(-EBUSY);
> +                       goto out;
> +               }
>
> So it doesn't seem to me as a lack of lockdep tricking, but we reached
> something we should not have because the avoidance based on
> local_lock_is_locked() above didn't work properly? At least if I read the
> splat and backtrace properly, it doesn't seem to suggest a theoretical
> scenario but that we really tried to lock something we already had locked.

It's not theoretical. Such slab re-entrance can happen with
a tracepoint:
slab -> some tracepoint -> bpf -> slab

I simulate it with a stress test:
+extern void (*debug_callback)(void);
+#define local_unlock_irqrestore(lock, flags)                   \
+       do {                    \
+               if (debug_callback) debug_callback(); \
+               __local_unlock_irqrestore(lock, flags); \
+       } while (0)

and debug_callback() calls kmalloc_nolock(random_size) without any bpf
to simplify testing.

> > [   39.819857]  my_debug_callback+0x20e/0x390 [bpf_testmod]
>
> What exactly did you instrument here?
>
> > [   39.819867]  ? page_alloc_kthread+0x320/0x320 [bpf_testmod]
> > [   39.819875]  ? lock_is_held_type+0x85/0xe0
> > [   39.819881]  ___slab_alloc+0x256/0xec0
>
> And here we took the lock originally?

yes, but they are truly different local_locks of different
kmalloc buckets, and local_lock_is_locked() is working.

See in the splat:

> > [   39.819646] page_alloc_kthr/2306 is trying to acquire lock:
> > [   39.819650] ff110001f5cbea88 ((&c->lock)){+.+.}-{3:3}, at:
> > ___slab_alloc+0xb7/0xec0
> > [   39.819667]
> > [   39.819667] but task is already holding lock:
> > [   39.819668] ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
> > ___slab_alloc+0xb7/0xec0

the addresses of the locks are different and they're different
kmalloc buckets, but lockdep cannot understand this without
explicit local_lock_lockdep_start().
The same thing I'm trying to explain in the commit log.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-14 18:46                 ` Alexei Starovoitov
@ 2025-07-15  6:56                   ` Vlastimil Babka
  2025-07-15 17:29                     ` Alexei Starovoitov
  0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-15  6:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Andrzej Siewior, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 7/14/25 20:46, Alexei Starovoitov wrote:
> On Mon, Jul 14, 2025 at 11:33 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> When in patch 6/6 __slab_alloc() we should have bailed out via
>>
>>         if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
>> +               if (local_lock_is_locked(&s->cpu_slab->lock)) {
>> +                       /*
>> +                        * EBUSY is an internal signal to kmalloc_nolock() to
>> +                        * retry a different bucket. It's not propagated
>> +                        * to the caller.
>> +                        */
>> +                       p = ERR_PTR(-EBUSY);
>> +                       goto out;
>> +               }
>>
>> So it doesn't seem to me as a lack of lockdep tricking, but we reached
>> something we should not have because the avoidance based on
>> local_lock_is_locked() above didn't work properly? At least if I read the
>> splat and backtrace properly, it doesn't seem to suggest a theoretical
>> scenario but that we really tried to lock something we already had locked.
> 
> It's not theoretical. Such slab re-entrance can happen with
> a tracepoint:
> slab -> some tracepoint -> bpf -> slab
> 
> I simulate it with a stress test:
> +extern void (*debug_callback)(void);
> +#define local_unlock_irqrestore(lock, flags)                   \
> +       do {                    \
> +               if (debug_callback) debug_callback(); \
> +               __local_unlock_irqrestore(lock, flags); \
> +       } while (0)
> 
> and debug_callback() calls kmalloc_nolock(random_size) without any bpf
> to simplify testing.
> 
>> > [   39.819857]  my_debug_callback+0x20e/0x390 [bpf_testmod]
>>
>> What exactly did you instrument here?
>>
>> > [   39.819867]  ? page_alloc_kthread+0x320/0x320 [bpf_testmod]
>> > [   39.819875]  ? lock_is_held_type+0x85/0xe0
>> > [   39.819881]  ___slab_alloc+0x256/0xec0
>>
>> And here we took the lock originally?
> 
> yes, but they are truly different local_locks of different
> kmalloc buckets, and local_lock_is_locked() is working.
> 
> See in the splat:
> 
>> > [   39.819646] page_alloc_kthr/2306 is trying to acquire lock:
>> > [   39.819650] ff110001f5cbea88 ((&c->lock)){+.+.}-{3:3}, at:
>> > ___slab_alloc+0xb7/0xec0
>> > [   39.819667]
>> > [   39.819667] but task is already holding lock:
>> > [   39.819668] ff110001f5cbfe88 ((&c->lock)){+.+.}-{3:3}, at:
>> > ___slab_alloc+0xb7/0xec0
> 
> the addresses of the locks are different and they're different
> kmalloc buckets, but lockdep cannot understand this without
> explicit local_lock_lockdep_start().
> The same thing I'm trying to explain in the commit log.

Thanks for the explanation and sorry for being so dense.
Maybe lockdep's lock classes can be used here somehow instead of having to
teach lockdep completely new tricks, but I don't know enough about those to
know for sure.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-15  6:56                   ` Vlastimil Babka
@ 2025-07-15 17:29                     ` Alexei Starovoitov
  2025-07-15 17:48                       ` Vlastimil Babka
  0 siblings, 1 reply; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-15 17:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sebastian Andrzej Siewior, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Tue, Jul 15, 2025 at 08:56:21AM +0200, Vlastimil Babka wrote:
> > the addresses of the locks are different and they're different
> > kmalloc buckets, but lockdep cannot understand this without
> > explicit local_lock_lockdep_start().
> > The same thing I'm trying to explain in the commit log.
> 
> Thanks for the explanation and sorry for being so dense.
> Maybe lockdep's lock classes can be used here somehow instead of having to
> teach lockdep completely new tricks, but I don't know enough about those to
> know for sure.

I tried that with a separate lock_key for each local_trylock_t
and it's sort-of kinda works for 16 cpus, but doesn't scale
when number of cpus is large.
There is no good way to pick LOCKDEP_BITS.

It can be made optional on PREEMP_RT only
and used for kmalloc buckets only that kmalloc_nolock() is using,
but still feels less clean than
local_lock_lockdep_start/end()
since it makes lockdep work harder.

Better ideas?

From da2b3bac08950929da105836fbff7e2ea4ecbc0e Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Tue, 15 Jul 2025 10:16:42 -0700
Subject: [PATCH] lockdep

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 lib/Kconfig.debug |  2 +-
 mm/slub.c         | 13 +++++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ebe33181b6e6..94c07b84ecd0 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1526,7 +1526,7 @@ config LOCKDEP_BITS
 	int "Size for MAX_LOCKDEP_ENTRIES (as Nth power of 2)"
 	depends on LOCKDEP && !LOCKDEP_SMALL
 	range 10 24
-	default 15
+	default 16
 	help
 	  Try increasing this value if you hit "BUG: MAX_LOCKDEP_ENTRIES too low!" message.
 
diff --git a/mm/slub.c b/mm/slub.c
index 2f30b85fbf68..2ae6bf3ebcd0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -395,6 +395,7 @@ struct kmem_cache_cpu {
 	struct slab *partial;	/* Partially allocated slabs */
 #endif
 	local_trylock_t lock;	/* Protects the fields above */
+	struct lock_class_key lock_key;
 #ifdef CONFIG_SLUB_STATS
 	unsigned int stat[NR_SLUB_STAT_ITEMS];
 #endif
@@ -3083,6 +3084,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
 		local_trylock_init(&c->lock);
+		lockdep_register_key(&c->lock_key);
+		lockdep_set_class(&c->lock, &c->lock_key);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -5953,6 +5956,16 @@ void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
 #ifndef CONFIG_SLUB_TINY
+	{
+		int cpu;
+
+		for_each_possible_cpu(cpu) {
+			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab,
+							       cpu);
+
+			lockdep_unregister_key(&c->lock_key);
+		}
+	}
 	free_percpu(s->cpu_slab);
 #endif
 	free_kmem_cache_nodes(s);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-15 17:29                     ` Alexei Starovoitov
@ 2025-07-15 17:48                       ` Vlastimil Babka
  2025-07-15 21:00                         ` Alexei Starovoitov
  0 siblings, 1 reply; 38+ messages in thread
From: Vlastimil Babka @ 2025-07-15 17:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Andrzej Siewior, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 7/15/25 19:29, Alexei Starovoitov wrote:
> On Tue, Jul 15, 2025 at 08:56:21AM +0200, Vlastimil Babka wrote:
>> > the addresses of the locks are different and they're different
>> > kmalloc buckets, but lockdep cannot understand this without
>> > explicit local_lock_lockdep_start().
>> > The same thing I'm trying to explain in the commit log.
>> 
>> Thanks for the explanation and sorry for being so dense.
>> Maybe lockdep's lock classes can be used here somehow instead of having to
>> teach lockdep completely new tricks, but I don't know enough about those to
>> know for sure.
> 
> I tried that with a separate lock_key for each local_trylock_t
> and it's sort-of kinda works for 16 cpus, but doesn't scale
> when number of cpus is large.
> There is no good way to pick LOCKDEP_BITS.
> 
> It can be made optional on PREEMP_RT only
> and used for kmalloc buckets only that kmalloc_nolock() is using,
> but still feels less clean than
> local_lock_lockdep_start/end()
> since it makes lockdep work harder.
> 
> Better ideas?

I was thinking something like a class for normal context and a different
class for kmalloc_nolock() context (determined by allow_spinning) but it's
quite vague as I don't understand lockdep enough.

> From da2b3bac08950929da105836fbff7e2ea4ecbc0e Mon Sep 17 00:00:00 2001
> From: Alexei Starovoitov <ast@kernel.org>
> Date: Tue, 15 Jul 2025 10:16:42 -0700
> Subject: [PATCH] lockdep
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  lib/Kconfig.debug |  2 +-
>  mm/slub.c         | 13 +++++++++++++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index ebe33181b6e6..94c07b84ecd0 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1526,7 +1526,7 @@ config LOCKDEP_BITS
>  	int "Size for MAX_LOCKDEP_ENTRIES (as Nth power of 2)"
>  	depends on LOCKDEP && !LOCKDEP_SMALL
>  	range 10 24
> -	default 15
> +	default 16
>  	help
>  	  Try increasing this value if you hit "BUG: MAX_LOCKDEP_ENTRIES too low!" message.
>  
> diff --git a/mm/slub.c b/mm/slub.c
> index 2f30b85fbf68..2ae6bf3ebcd0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -395,6 +395,7 @@ struct kmem_cache_cpu {
>  	struct slab *partial;	/* Partially allocated slabs */
>  #endif
>  	local_trylock_t lock;	/* Protects the fields above */
> +	struct lock_class_key lock_key;
>  #ifdef CONFIG_SLUB_STATS
>  	unsigned int stat[NR_SLUB_STAT_ITEMS];
>  #endif
> @@ -3083,6 +3084,8 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
>  	for_each_possible_cpu(cpu) {
>  		c = per_cpu_ptr(s->cpu_slab, cpu);
>  		local_trylock_init(&c->lock);
> +		lockdep_register_key(&c->lock_key);
> +		lockdep_set_class(&c->lock, &c->lock_key);
>  		c->tid = init_tid(cpu);
>  	}
>  }
> @@ -5953,6 +5956,16 @@ void __kmem_cache_release(struct kmem_cache *s)
>  {
>  	cache_random_seq_destroy(s);
>  #ifndef CONFIG_SLUB_TINY
> +	{
> +		int cpu;
> +
> +		for_each_possible_cpu(cpu) {
> +			struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab,
> +							       cpu);
> +
> +			lockdep_unregister_key(&c->lock_key);
> +		}
> +	}
>  	free_percpu(s->cpu_slab);
>  #endif
>  	free_kmem_cache_nodes(s);



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end()
  2025-07-15 17:48                       ` Vlastimil Babka
@ 2025-07-15 21:00                         ` Alexei Starovoitov
  0 siblings, 0 replies; 38+ messages in thread
From: Alexei Starovoitov @ 2025-07-15 21:00 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sebastian Andrzej Siewior, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Tue, Jul 15, 2025 at 07:48:39PM +0200, Vlastimil Babka wrote:
> On 7/15/25 19:29, Alexei Starovoitov wrote:
> > On Tue, Jul 15, 2025 at 08:56:21AM +0200, Vlastimil Babka wrote:
> >> > the addresses of the locks are different and they're different
> >> > kmalloc buckets, but lockdep cannot understand this without
> >> > explicit local_lock_lockdep_start().
> >> > The same thing I'm trying to explain in the commit log.
> >> 
> >> Thanks for the explanation and sorry for being so dense.
> >> Maybe lockdep's lock classes can be used here somehow instead of having to
> >> teach lockdep completely new tricks, but I don't know enough about those to
> >> know for sure.
> > 
> > I tried that with a separate lock_key for each local_trylock_t
> > and it's sort-of kinda works for 16 cpus, but doesn't scale
> > when number of cpus is large.
> > There is no good way to pick LOCKDEP_BITS.
> > 
> > It can be made optional on PREEMP_RT only
> > and used for kmalloc buckets only that kmalloc_nolock() is using,
> > but still feels less clean than
> > local_lock_lockdep_start/end()
> > since it makes lockdep work harder.
> > 
> > Better ideas?
> 
> I was thinking something like a class for normal context and a different
> class for kmalloc_nolock() context (determined by allow_spinning) but it's
> quite vague as I don't understand lockdep enough.

lockdep is not context sensitive. Any lock can either 'lock' or 'trylock'.
One cannot tell lockdep to use different class when lock is 'lock'ed
from a different context (like special gfpflags),
but good news... Turned out lockdep understand LD_LOCK_PERCPU which was
added specifically for local_lock.
So no need to register lock_key for each cpu.
The following diff addresses false positive on PREEMPT_RT.
This is definitely cleaner than local_lock_lockdep_start/end() trickery.
I'm thinking to wrap it with ifdef CONFIG_PREEMPT_RT and add to respin.
As long as we stick to local_trylock_irqsave() for !PREEMPT_RT
and local_lock_irqsave() for PREEMPT_RT all cases should be covered.

For !PREEMP_RT there will be no issue with
"inconsistent {INITIAL USE} -> {IN-NMI} usage."
case, since we will be doing local_trylock_irqsave().
And for PREEMPT_RT there is no NMI or hardirq in this path.

Meaning that what you were suggesting earlier:
> if (unlikely(!local_trylock_irqsave(&s->cpu_slab->lock, *flags))
>        local_lock_irqsave(&s->cpu_slab->lock, *flags);
isn't an option.
For !RT we can only use local_trylock_irqsave() without fallback
otherwise we will be back to square one re: lockdep falsepositives.

So I'll be going with Sebastian's approach:
+       if (!IS_ENABLED(CONFIG_PREEMPT_RT) && !allow_spin)
+               BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
+       else
+               local_lock_irqsave(&s->cpu_slab->lock, *flags);

From a51dd40cadc5fbe0a2dfa9d8fb493cdc14ab2e9f Mon Sep 17 00:00:00 2001
From: Alexei Starovoitov <ast@kernel.org>
Date: Tue, 15 Jul 2025 10:16:42 -0700
Subject: [PATCH v2] lockdep

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/slab.h | 1 +
 mm/slub.c | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/mm/slab.h b/mm/slab.h
index 65f4616b41de..165737accb20 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -262,6 +262,7 @@ struct kmem_cache_order_objects {
 struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
+	struct lock_class_key lock_key;
 #endif
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
diff --git a/mm/slub.c b/mm/slub.c
index 2f30b85fbf68..ca7f6a3d5db4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3080,9 +3080,13 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 	int cpu;
 	struct kmem_cache_cpu *c;
 
+	if (!init_section_contains(s, 1))
+		/* register lockdep key for non-boot kmem caches */
+		lockdep_register_key(&s->lock_key);
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
 		local_trylock_init(&c->lock);
+		lockdep_set_class(&c->lock, &s->lock_key);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -5953,6 +5957,7 @@ void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
 #ifndef CONFIG_SLUB_TINY
+	lockdep_unregister_key(&s->lock_key);
 	free_percpu(s->cpu_slab);
 #endif
 	free_kmem_cache_nodes(s);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2025-07-15 21:00 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09  1:52 [PATCH v2 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-07-09  1:52 ` [PATCH v2 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-07-11  8:02   ` Sebastian Andrzej Siewior
2025-07-09  1:52 ` [PATCH v2 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-07-11  7:52   ` Sebastian Andrzej Siewior
2025-07-09  1:53 ` [PATCH v2 3/6] locking/local_lock: Introduce local_lock_lockdep_start/end() Alexei Starovoitov
2025-07-11  7:50   ` Sebastian Andrzej Siewior
2025-07-11  9:55     ` Vlastimil Babka
2025-07-11 15:17       ` Sebastian Andrzej Siewior
2025-07-11 15:23         ` Vlastimil Babka
2025-07-12  2:19         ` Alexei Starovoitov
2025-07-14 11:06           ` Sebastian Andrzej Siewior
2025-07-14 15:35             ` Vlastimil Babka
2025-07-14 15:54               ` Sebastian Andrzej Siewior
2025-07-14 17:52             ` Alexei Starovoitov
2025-07-14 18:33               ` Vlastimil Babka
2025-07-14 18:46                 ` Alexei Starovoitov
2025-07-15  6:56                   ` Vlastimil Babka
2025-07-15 17:29                     ` Alexei Starovoitov
2025-07-15 17:48                       ` Vlastimil Babka
2025-07-15 21:00                         ` Alexei Starovoitov
2025-07-09  1:53 ` [PATCH v2 4/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-07-09 14:20   ` Vlastimil Babka
2025-07-09  1:53 ` [PATCH v2 5/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-07-09 14:21   ` Vlastimil Babka
2025-07-09  1:53 ` [PATCH v2 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-07-10  9:36   ` Vlastimil Babka
2025-07-10 10:21     ` Harry Yoo
2025-07-10 15:05       ` Vlastimil Babka
2025-07-10 19:13         ` Alexei Starovoitov
2025-07-11  6:06           ` Harry Yoo
2025-07-11 10:30           ` Vlastimil Babka
2025-07-12  1:55             ` Alexei Starovoitov
2025-07-10 19:21     ` Alexei Starovoitov
2025-07-11  7:26   ` Sebastian Andrzej Siewior
2025-07-11  7:36   ` Harry Yoo
2025-07-11  7:40     ` Harry Yoo
2025-07-11 10:48     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).