[PATCH v3 0/6] slab: Re-entrant kmalloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock()
@ 2025-07-16  2:29 Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-16  2:29 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

v2->v3:
- Adopted Sebastian's local_lock_cpu_slab(), but dropped gfpflags
  to avoid extra branch for performance reasons,
  and added local_unlock_cpu_slab() for symmetry.
- Dropped local_lock_lockdep_start/end() pair and switched to
  per kmem_cache lockdep class on PREEMPT_RT to silence false positive
  when the same cpu/task acquires two local_lock-s.
- Refactorred defer_free per Sebastian's suggestion
- Fixed slab leak when it needs to be deactivated via irq_work and llist
  as Vlastimil proposed. Including defer_free_barrier().
- Use kmem_cache->offset for llist_node pointer when linking objects
  instead of zero offset, since whole object could be used for slabs
  with ctors and other cases.
- Fixed "cnt = 1; goto redo;" issue.
- Fixed slab leak in alloc_single_from_new_slab().
- Retested with slab_debug, RT, !RT, lockdep, kasan, slab_tiny
- Added acks to patches 1-4 that should be good to go.

v2:
https://lore.kernel.org/bpf/20250709015303.8107-1-alexei.starovoitov@gmail.com/

v1->v2:
Added more comments for this non-trivial logic and addressed earlier comments.
In particular:
- Introduce alloc_frozen_pages_nolock() to avoid refcnt race
- alloc_pages_nolock() defaults to GFP_COMP
- Support SLUB_TINY
- Added more variants to stress tester to discover that kfree_nolock() can
  OOM, because deferred per-slab llist won't be serviced if kfree_nolock()
  gets unlucky long enough. Scraped previous approach and switched to
  global per-cpu llist with immediate irq_work_queue() to process all
  object sizes.
- Reentrant kmalloc cannot deactivate_slab(). In v1 the node hint was
  downgraded to NUMA_NO_NODE before calling slab_alloc(). Realized it's not
  good enough. There are odd cases that can trigger deactivate. Rewrote
  this part.
- Struggled with SLAB_NO_CMPXCHG. Thankfully Harry had a great suggestion:
  https://lore.kernel.org/bpf/aFvfr1KiNrLofavW@hyeyoo/
  which was adopted. So slab_debug works now.
- In v1 I had to s/local_lock_irqsave/local_lock_irqsave_check/ in a bunch
  of places in mm/slub.c to avoid lockdep false positives.
  Came up with much cleaner approach to silence invalid lockdep reports
  without sacrificing lockdep coverage. See local_lock_lockdep_start/end().

v1:
https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (6):
  locking/local_lock: Expose dep_map in local_trylock_t.
  locking/local_lock: Introduce local_lock_is_locked().
  mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
  mm: Introduce alloc_frozen_pages_nolock()
  slab: Introduce kmalloc_nolock() and kfree_nolock().
  slab: Make slub local_trylock_t more precise for LOCKDEP

 include/linux/gfp.h                 |   2 +-
 include/linux/kasan.h               |  13 +-
 include/linux/local_lock.h          |   2 +
 include/linux/local_lock_internal.h |  16 +-
 include/linux/rtmutex.h             |   9 +
 include/linux/slab.h                |   4 +
 kernel/bpf/syscall.c                |   2 +-
 kernel/locking/rtmutex_common.h     |   9 -
 mm/Kconfig                          |   1 +
 mm/internal.h                       |   4 +
 mm/kasan/common.c                   |   5 +-
 mm/page_alloc.c                     |  54 ++--
 mm/slab.h                           |   7 +
 mm/slab_common.c                    |   3 +
 mm/slub.c                           | 471 +++++++++++++++++++++++++---
 15 files changed, 513 insertions(+), 89 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-07-16  2:29 [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
@ 2025-07-16  2:29 ` Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-16  2:29 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
is a top level field of any lock that participates in LOCKDEP.
Make it so for local_trylock_t.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock_internal.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8d5ac16a9b17..85c2e1b1af6b 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -17,7 +17,10 @@ typedef struct {
 
 /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
 typedef struct {
-	local_lock_t	llock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+	struct task_struct	*owner;
+#endif
 	u8		acquired;
 } local_trylock_t;
 
@@ -31,7 +34,7 @@ typedef struct {
 	.owner = NULL,
 
 # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
-	.llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
+	LOCAL_LOCK_DEBUG_INIT(lockname)
 
 static inline void local_lock_acquire(local_lock_t *l)
 {
@@ -81,7 +84,7 @@ do {								\
 	local_lock_debug_init(lock);				\
 } while (0)
 
-#define __local_trylock_init(lock) __local_lock_init(lock.llock)
+#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
 
 #define __spinlock_nested_bh_init(lock)				\
 do {								\
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-07-16  2:29 [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
@ 2025-07-16  2:29 ` Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-16  2:29 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Introduce local_lock_is_locked() that returns true when
given local_lock is locked by current cpu (in !PREEMPT_RT) or
by current task (in PREEMPT_RT).
The goal is to detect a deadlock by the caller.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          | 2 ++
 include/linux/local_lock_internal.h | 7 +++++++
 include/linux/rtmutex.h             | 9 +++++++++
 kernel/locking/rtmutex_common.h     | 9 ---------
 4 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 16a2ee4f8310..092ce89b162a 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -66,6 +66,8 @@
  */
 #define local_trylock(lock)		__local_trylock(lock)
 
+#define local_lock_is_locked(lock)	__local_lock_is_locked(lock)
+
 /**
  * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
  *			   interrupts if acquired
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 85c2e1b1af6b..db61409f040c 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -165,6 +165,9 @@ do {								\
 		!!tl;						\
 	})
 
+/* preemption or migration must be disabled before calling __local_lock_is_locked */
+#define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
+
 #define __local_lock_release(lock)					\
 	do {								\
 		local_trylock_t *tl;					\
@@ -285,4 +288,8 @@ do {								\
 		__local_trylock(lock);				\
 	})
 
+/* migration must be disabled before calling __local_lock_is_locked */
+#define __local_lock_is_locked(__lock)					\
+	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
+
 #endif /* CONFIG_PREEMPT_RT */
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 7d049883a08a..a8f2fe154487 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -44,6 +44,15 @@ static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock)
 	return READ_ONCE(lock->owner) != NULL;
 }
 
+#define RT_MUTEX_HAS_WAITERS	1UL
+
+static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock)
+{
+	unsigned long owner = (unsigned long) READ_ONCE(lock->owner);
+
+	return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS);
+}
+
 extern void rt_mutex_base_init(struct rt_mutex_base *rtb);
 
 /**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 78dd3d8c6554..cf6ddd1b23a2 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -153,15 +153,6 @@ static inline struct rt_mutex_waiter *task_top_pi_waiter(struct task_struct *p)
 			pi_tree.entry);
 }
 
-#define RT_MUTEX_HAS_WAITERS	1UL
-
-static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock)
-{
-	unsigned long owner = (unsigned long) READ_ONCE(lock->owner);
-
-	return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS);
-}
-
 /*
  * Constants for rt mutex functions which have a selectable deadlock
  * detection.
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
  2025-07-16  2:29 [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
@ 2025-07-16  2:29 ` Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 4/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-16  2:29 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Change alloc_pages_nolock() to default to __GFP_COMP when allocating
pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP.
Also allow __GFP_ACCOUNT flag to be specified,
since BPF infra needs __GFP_ACCOUNT.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h  |  2 +-
 kernel/bpf/syscall.c |  2 +-
 mm/page_alloc.c      | 10 ++++++----
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5ebf26fcdcfa..0ceb4e09306c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,7 +354,7 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order);
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
 #define alloc_pages_nolock(...)			alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__))
 
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dd5304c6ac3c..eb9b6c4c10e9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -579,7 +579,7 @@ static bool can_alloc_pages(void)
 static struct page *__bpf_alloc_page(int nid)
 {
 	if (!can_alloc_pages())
-		return alloc_pages_nolock(nid, 0);
+		return alloc_pages_nolock(__GFP_ACCOUNT, nid, 0);
 
 	return alloc_pages_node(nid,
 				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78ddf1d43c6c..148945f0b667 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7506,6 +7506,7 @@ static bool __free_unaccepted(struct page *page)
 
 /**
  * alloc_pages_nolock - opportunistic reentrant allocation from any context
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
  * @nid: node to allocate from
  * @order: allocation order size
  *
@@ -7519,7 +7520,7 @@ static bool __free_unaccepted(struct page *page)
  * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
  * It means ENOMEM. There is no reason to call it again and expect !NULL.
  */
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7541,12 +7542,13 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	 * specify it here to highlight that alloc_pages_nolock()
 	 * doesn't want to deplete reserves.
 	 */
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
-			| __GFP_ACCOUNT;
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
+			| gfp_flags;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
 
+	VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
 	/*
 	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
 	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
@@ -7584,7 +7586,7 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	if (page)
 		set_page_refcounted(page);
 
-	if (memcg_kmem_online() && page &&
+	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
 	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
 		free_pages_nolock(page, order);
 		page = NULL;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/6] mm: Introduce alloc_frozen_pages_nolock()
  2025-07-16  2:29 [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2025-07-16  2:29 ` [PATCH v3 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
@ 2025-07-16  2:29 ` Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  2025-07-16  2:29 ` [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP Alexei Starovoitov
  5 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-16  2:29 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock()
to be used by alloc_slab_page().

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/internal.h   |  4 ++++
 mm/page_alloc.c | 48 +++++++++++++++++++++++++++---------------------
 2 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5b0f71e5434b..ea85cf703331 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -837,6 +837,10 @@ static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int ord
 #define alloc_frozen_pages(...) \
 	alloc_hooks(alloc_frozen_pages_noprof(__VA_ARGS__))
 
+struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
+#define alloc_frozen_pages_nolock(...) \
+	alloc_hooks(alloc_frozen_pages_nolock_noprof(__VA_ARGS__))
+
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
 extern void zone_pcp_enable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 148945f0b667..11a184bab03c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7504,23 +7504,7 @@ static bool __free_unaccepted(struct page *page)
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
-/**
- * alloc_pages_nolock - opportunistic reentrant allocation from any context
- * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
- * @nid: node to allocate from
- * @order: allocation order size
- *
- * Allocates pages of a given order from the given node. This is safe to
- * call from any context (from atomic, NMI, and also reentrant
- * allocator -> tracepoint -> alloc_pages_nolock_noprof).
- * Allocation is best effort and to be expected to fail easily so nobody should
- * rely on the success. Failures are not reported via warn_alloc().
- * See always fail conditions below.
- *
- * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
- * It means ENOMEM. There is no reason to call it again and expect !NULL.
- */
-struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
+struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7583,16 +7567,38 @@ struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int or
 
 	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
 
-	if (page)
-		set_page_refcounted(page);
-
 	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
 	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
-		free_pages_nolock(page, order);
+		__free_frozen_pages(page, order, FPI_TRYLOCK);
 		page = NULL;
 	}
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
 }
+/**
+ * alloc_pages_nolock - opportunistic reentrant allocation from any context
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
+ * @nid: node to allocate from
+ * @order: allocation order size
+ *
+ * Allocates pages of a given order from the given node. This is safe to
+ * call from any context (from atomic, NMI, and also reentrant
+ * allocator -> tracepoint -> alloc_pages_nolock_noprof).
+ * Allocation is best effort and to be expected to fail easily so nobody should
+ * rely on the success. Failures are not reported via warn_alloc().
+ * See always fail conditions below.
+ *
+ * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
+ * It means ENOMEM. There is no reason to call it again and expect !NULL.
+ */
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
+{
+	struct page *page;
+
+	page = alloc_frozen_pages_nolock_noprof(gfp_flags, nid, order);
+	if (page)
+		set_page_refcounted(page);
+	return page;
+}
 EXPORT_SYMBOL_GPL(alloc_pages_nolock_noprof);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-16  2:29 [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2025-07-16  2:29 ` [PATCH v3 4/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
@ 2025-07-16  2:29 ` Alexei Starovoitov
  2025-07-16 10:58   ` Vlastimil Babka
  2025-08-25  4:45   ` Harry Yoo
  2025-07-16  2:29 ` [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP Alexei Starovoitov
  5 siblings, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-16  2:29 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

kmalloc_nolock() relies on ability of local_lock to detect the situation
when it's locked.
In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
In that case retry the operation in a different kmalloc bucket.
The second attempt will likely succeed, since this cpu locked
different kmem_cache_cpu.

Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current task. In this case re-entrance
into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
a different bucket that is most likely is not locked by the current
task. Though it may be locked by a different task it's safe to
rt_spin_lock() on it.

Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.

kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
and in_nmi() or in PREEMPT_RT.

SLUB_TINY config doesn't use local_lock_is_locked() and relies on
spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
always defers to irq_work.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/kasan.h |  13 +-
 include/linux/slab.h  |   4 +
 mm/Kconfig            |   1 +
 mm/kasan/common.c     |   5 +-
 mm/slab.h             |   6 +
 mm/slab_common.c      |   3 +
 mm/slub.c             | 454 +++++++++++++++++++++++++++++++++++++-----
 7 files changed, 434 insertions(+), 52 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 890011071f2b..acdc8cb0152e 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s,
 }
 
 bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
-		       bool still_accessible);
+		       bool still_accessible, bool no_quarantine);
 /**
  * kasan_slab_free - Poison, initialize, and quarantine a slab object.
  * @object: Object to be freed.
@@ -226,11 +226,13 @@ bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
  * @Return true if KASAN took ownership of the object; false otherwise.
  */
 static __always_inline bool kasan_slab_free(struct kmem_cache *s,
-						void *object, bool init,
-						bool still_accessible)
+					    void *object, bool init,
+					    bool still_accessible,
+					    bool no_quarantine)
 {
 	if (kasan_enabled())
-		return __kasan_slab_free(s, object, init, still_accessible);
+		return __kasan_slab_free(s, object, init, still_accessible,
+					 no_quarantine);
 	return false;
 }
 
@@ -427,7 +429,8 @@ static inline bool kasan_slab_pre_free(struct kmem_cache *s, void *object)
 }
 
 static inline bool kasan_slab_free(struct kmem_cache *s, void *object,
-				   bool init, bool still_accessible)
+				   bool init, bool still_accessible,
+				   bool no_quarantine)
 {
 	return false;
 }
diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035c..743f6d196d57 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -470,6 +470,7 @@ void * __must_check krealloc_noprof(const void *objp, size_t new_size,
 #define krealloc(...)				alloc_hooks(krealloc_noprof(__VA_ARGS__))
 
 void kfree(const void *objp);
+void kfree_nolock(const void *objp);
 void kfree_sensitive(const void *objp);
 size_t __ksize(const void *objp);
 
@@ -910,6 +911,9 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
 }
 #define kmalloc(...)				alloc_hooks(kmalloc_noprof(__VA_ARGS__))
 
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node);
+#define kmalloc_nolock(...)			alloc_hooks(kmalloc_nolock_noprof(__VA_ARGS__))
+
 #define kmem_buckets_alloc(_b, _size, _flags)	\
 	alloc_hooks(__kmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE))
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 0287e8d94aea..331a14d678b3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -206,6 +206,7 @@ menu "Slab allocator options"
 
 config SLUB
 	def_bool y
+	select IRQ_WORK
 
 config KVFREE_RCU_BATCHED
 	def_bool y
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index ed4873e18c75..67042e07baee 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -256,13 +256,16 @@ bool __kasan_slab_pre_free(struct kmem_cache *cache, void *object,
 }
 
 bool __kasan_slab_free(struct kmem_cache *cache, void *object, bool init,
-		       bool still_accessible)
+		       bool still_accessible, bool no_quarantine)
 {
 	if (!kasan_arch_is_ready() || is_kfence_address(object))
 		return false;
 
 	poison_slab_object(cache, object, init, still_accessible);
 
+	if (no_quarantine)
+		return false;
+
 	/*
 	 * If the object is put into quarantine, do not let slab put the object
 	 * onto the freelist for now. The object's metadata is kept until the
diff --git a/mm/slab.h b/mm/slab.h
index 05a21dc796e0..65f4616b41de 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -57,6 +57,10 @@ struct slab {
 		struct {
 			union {
 				struct list_head slab_list;
+				struct { /* For deferred deactivate_slab() */
+					struct llist_node llnode;
+					void *flush_freelist;
+				};
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 				struct {
 					struct slab *next;
@@ -680,6 +684,8 @@ void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 void __check_heap_object(const void *ptr, unsigned long n,
 			 const struct slab *slab, bool to_user);
 
+void defer_free_barrier(void);
+
 static inline bool slub_debug_orig_size(struct kmem_cache *s)
 {
 	return (kmem_cache_debug_flags(s, SLAB_STORE_USER) &&
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bfe7c40eeee1..937af8ab2501 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -507,6 +507,9 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		rcu_barrier();
 	}
 
+	/* Wait for deferred work from kmalloc/kfree_nolock() */
+	defer_free_barrier();
+
 	cpus_read_lock();
 	mutex_lock(&slab_mutex);
 
diff --git a/mm/slub.c b/mm/slub.c
index c4b64821e680..c92703d367d7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -44,6 +44,7 @@
 #include <kunit/test.h>
 #include <kunit/test-bug.h>
 #include <linux/sort.h>
+#include <linux/irq_work.h>
 
 #include <linux/debugfs.h>
 #include <trace/events/kmem.h>
@@ -393,7 +394,7 @@ struct kmem_cache_cpu {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct slab *partial;	/* Partially allocated slabs */
 #endif
-	local_lock_t lock;	/* Protects the fields above */
+	local_trylock_t lock;	/* Protects the fields above */
 #ifdef CONFIG_SLUB_STATS
 	unsigned int stat[NR_SLUB_STAT_ITEMS];
 #endif
@@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		        gfp_t gfp, bool new_slab)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfp);
 	unsigned int objects = objs_per_slab(s, slab);
 	unsigned long new_exts;
 	unsigned long old_exts;
@@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
 	gfp |= __GFP_NO_OBJ_EXT;
-	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
-			   slab_nid(slab));
+	if (unlikely(!allow_spin)) {
+		size_t sz = objects * sizeof(struct slabobj_ext);
+
+		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
+	} else {
+		vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+				   slab_nid(slab));
+	}
 	if (!vec) {
 		/* Mark vectors which failed to allocate */
 		if (new_slab)
@@ -2021,7 +2029,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		 * objcg vector should be reused.
 		 */
 		mark_objexts_empty(vec);
-		kfree(vec);
+		if (unlikely(!allow_spin))
+			kfree_nolock(vec);
+		else
+			kfree(vec);
 		return 0;
 	}
 
@@ -2379,7 +2390,7 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
 
 	}
 	/* KASAN might put x into memory quarantine, delaying its reuse. */
-	return !kasan_slab_free(s, x, init, still_accessible);
+	return !kasan_slab_free(s, x, init, still_accessible, false);
 }
 
 static __fastpath_inline
@@ -2442,13 +2453,17 @@ static void *setup_object(struct kmem_cache *s, void *object)
  * Slab allocation and freeing
  */
 static inline struct slab *alloc_slab_page(gfp_t flags, int node,
-		struct kmem_cache_order_objects oo)
+					   struct kmem_cache_order_objects oo,
+					   bool allow_spin)
 {
 	struct folio *folio;
 	struct slab *slab;
 	unsigned int order = oo_order(oo);
 
-	if (node == NUMA_NO_NODE)
+	if (unlikely(!allow_spin))
+		folio = (struct folio *)alloc_frozen_pages_nolock(0/* __GFP_COMP is implied */,
+								  node, order);
+	else if (node == NUMA_NO_NODE)
 		folio = (struct folio *)alloc_frozen_pages(flags, order);
 	else
 		folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
@@ -2598,6 +2613,7 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 
 static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
+	bool allow_spin = gfpflags_allow_spinning(flags);
 	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
@@ -2617,7 +2633,11 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
 		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_RECLAIM;
 
-	slab = alloc_slab_page(alloc_gfp, node, oo);
+	/*
+	 * __GFP_RECLAIM could be cleared on the first allocation attempt,
+	 * so pass allow_spin flag directly.
+	 */
+	slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 	if (unlikely(!slab)) {
 		oo = s->min;
 		alloc_gfp = flags;
@@ -2625,7 +2645,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
-		slab = alloc_slab_page(alloc_gfp, node, oo);
+		slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 		if (unlikely(!slab))
 			return NULL;
 		stat(s, ORDER_FALLBACK);
@@ -2798,33 +2818,48 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
 	return object;
 }
 
+static void defer_deactivate_slab(struct slab *slab);
+
 /*
  * Called only for kmem_cache_debug() caches to allocate from a freshly
  * allocated slab. Allocate a single object instead of whole freelist
  * and put the slab to the partial (or full) list.
  */
-static void *alloc_single_from_new_slab(struct kmem_cache *s,
-					struct slab *slab, int orig_size)
+static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
+					int orig_size, gfp_t gfpflags)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfpflags);
 	int nid = slab_nid(slab);
 	struct kmem_cache_node *n = get_node(s, nid);
 	unsigned long flags;
 	void *object;
 
+	if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
+		/* Unlucky, discard newly allocated slab */
+		slab->flush_freelist = NULL;
+		slab->frozen = 1;
+		defer_deactivate_slab(slab);
+		return NULL;
+	}
 
 	object = slab->freelist;
 	slab->freelist = get_freepointer(s, object);
 	slab->inuse = 1;
 
-	if (!alloc_debug_processing(s, slab, object, orig_size))
+	if (!alloc_debug_processing(s, slab, object, orig_size)) {
 		/*
 		 * It's not really expected that this would fail on a
 		 * freshly allocated slab, but a concurrent memory
 		 * corruption in theory could cause that.
+		 * Leak memory of allocated slab.
 		 */
+		if (!allow_spin)
+			spin_unlock_irqrestore(&n->list_lock, flags);
 		return NULL;
+	}
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (allow_spin)
+		spin_lock_irqsave(&n->list_lock, flags);
 
 	if (slab->inuse == slab->objects)
 		add_full(s, n, slab);
@@ -2865,7 +2900,10 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 	if (!n || !n->nr_partial)
 		return NULL;
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (gfpflags_allow_spinning(pc->flags))
+		spin_lock_irqsave(&n->list_lock, flags);
+	else if (!spin_trylock_irqsave(&n->list_lock, flags))
+		return NULL;
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
 		if (!pfmemalloc_match(slab, pc->flags))
 			continue;
@@ -3056,7 +3094,7 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
-		local_lock_init(&c->lock);
+		local_trylock_init(&c->lock);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -3189,6 +3227,36 @@ static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
 	}
 }
 
+static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags)
+{
+	/*
+	 * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
+	 * can be acquired without a deadlock before invoking the function.
+	 *
+	 * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
+	 * disabled context. The lock will always be acquired and if needed it
+	 * block and sleep until the lock is available.
+	 *
+	 * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
+	 * is always acquired with disabled interrupts meaning it is always
+	 * possible to it.
+	 * In NMI context it is needed to check if the lock is acquired. If it is not,
+	 * it is safe to acquire it. The trylock semantic is used to tell lockdep
+	 * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
+	 * the lock.
+	 *
+	 */
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
+	else
+		local_lock_irqsave(&s->cpu_slab->lock, *flags);
+}
+
+static inline void local_unlock_cpu_slab(struct kmem_cache *s, unsigned long *flags)
+{
+	local_unlock_irqrestore(&s->cpu_slab->lock, *flags);
+}
+
 /*
  * Put all the cpu partial slabs to the node partial list.
  */
@@ -3231,7 +3299,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 	unsigned long flags;
 	int slabs = 0;
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, &flags);
 
 	oldslab = this_cpu_read(s->cpu_slab->partial);
 
@@ -3256,7 +3324,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 
 	this_cpu_write(s->cpu_slab->partial, slab);
 
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	local_unlock_cpu_slab(s, &flags);
 
 	if (slab_to_put) {
 		__put_partials(s, slab_to_put);
@@ -3690,6 +3758,7 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfpflags);
 	void *freelist;
 	struct slab *slab;
 	unsigned long flags;
@@ -3715,9 +3784,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (unlikely(!node_match(slab, node))) {
 		/*
 		 * same as above but node_match() being false already
-		 * implies node != NUMA_NO_NODE
+		 * implies node != NUMA_NO_NODE.
+		 * Reentrant slub cannot take locks necessary to
+		 * deactivate_slab, hence ignore node preference.
+		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
 		 */
-		if (!node_isset(node, slab_nodes)) {
+		if (!node_isset(node, slab_nodes) ||
+		    !allow_spin) {
 			node = NUMA_NO_NODE;
 		} else {
 			stat(s, ALLOC_NODE_MISMATCH);
@@ -3730,13 +3803,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 * PFMEMALLOC but right now, we are losing the pfmemalloc
 	 * information when the page leaves the per-cpu allocator
 	 */
-	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
+	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
 		goto deactivate_slab;
 
 	/* must check again c->slab in case we got preempted and it changed */
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, &flags);
+
 	if (unlikely(slab != c->slab)) {
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, &flags);
 		goto reread_slab;
 	}
 	freelist = c->freelist;
@@ -3748,7 +3822,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (!freelist) {
 		c->slab = NULL;
 		c->tid = next_tid(c->tid);
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, &flags);
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
 	}
@@ -3767,34 +3841,34 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	VM_BUG_ON(!c->slab->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	local_unlock_cpu_slab(s, &flags);
 	return freelist;
 
 deactivate_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, &flags);
 	if (slab != c->slab) {
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, &flags);
 		goto reread_slab;
 	}
 	freelist = c->freelist;
 	c->slab = NULL;
 	c->freelist = NULL;
 	c->tid = next_tid(c->tid);
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	local_unlock_cpu_slab(s, &flags);
 	deactivate_slab(s, slab, freelist);
 
 new_slab:
 
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	while (slub_percpu_partial(c)) {
-		local_lock_irqsave(&s->cpu_slab->lock, flags);
+		local_lock_cpu_slab(s, &flags);
 		if (unlikely(c->slab)) {
-			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+			local_unlock_cpu_slab(s, &flags);
 			goto reread_slab;
 		}
 		if (unlikely(!slub_percpu_partial(c))) {
-			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+			local_unlock_cpu_slab(s, &flags);
 			/* we were preempted and partial list got empty */
 			goto new_objects;
 		}
@@ -3802,8 +3876,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		slab = slub_percpu_partial(c);
 		slub_set_percpu_partial(c, slab);
 
+		/*
+		 * Reentrant slub cannot take locks necessary for
+		 * __put_partials(), hence ignore node preference.
+		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
+		 */
 		if (likely(node_match(slab, node) &&
-			   pfmemalloc_match(slab, gfpflags))) {
+			   pfmemalloc_match(slab, gfpflags)) ||
+		    !allow_spin) {
 			c->slab = slab;
 			freelist = get_freelist(s, slab);
 			VM_BUG_ON(!freelist);
@@ -3811,7 +3891,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			goto load_freelist;
 		}
 
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, &flags);
 
 		slab->next = NULL;
 		__put_partials(s, slab);
@@ -3833,8 +3913,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 *    allocating new page from other nodes
 	 */
 	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
-		     && try_thisnode))
-		pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+		     && try_thisnode)) {
+		if (unlikely(!allow_spin))
+			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
+			pc.flags = gfpflags | __GFP_THISNODE;
+		else
+			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+	}
 
 	pc.orig_size = orig_size;
 	slab = get_partial(s, node, &pc);
@@ -3873,7 +3958,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	stat(s, ALLOC_SLAB);
 
 	if (kmem_cache_debug(s)) {
-		freelist = alloc_single_from_new_slab(s, slab, orig_size);
+		freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
 
 		if (unlikely(!freelist))
 			goto new_objects;
@@ -3895,7 +3980,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 	inc_slabs_node(s, slab_nid(slab), slab->objects);
 
-	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
+	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) {
 		/*
 		 * For !pfmemalloc_match() case we don't load freelist so that
 		 * we don't make further mismatched allocations easier.
@@ -3906,7 +3991,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 retry_load_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, &flags);
 	if (unlikely(c->slab)) {
 		void *flush_freelist = c->freelist;
 		struct slab *flush_slab = c->slab;
@@ -3915,9 +4000,15 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		c->freelist = NULL;
 		c->tid = next_tid(c->tid);
 
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, &flags);
 
-		deactivate_slab(s, flush_slab, flush_freelist);
+		if (unlikely(!allow_spin)) {
+			/* Reentrant slub cannot take locks, defer */
+			flush_slab->flush_freelist = flush_freelist;
+			defer_deactivate_slab(flush_slab);
+		} else {
+			deactivate_slab(s, flush_slab, flush_freelist);
+		}
 
 		stat(s, CPUSLAB_FLUSH);
 
@@ -3946,8 +4037,19 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 */
 	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
-
+	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
+		if (local_lock_is_locked(&s->cpu_slab->lock)) {
+			/*
+			 * EBUSY is an internal signal to kmalloc_nolock() to
+			 * retry a different bucket. It's not propagated
+			 * to the caller.
+			 */
+			p = ERR_PTR(-EBUSY);
+			goto out;
+		}
+	}
 	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
+out:
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);
 #endif
@@ -4071,7 +4173,7 @@ static void *__slab_alloc_node(struct kmem_cache *s,
 		return NULL;
 	}
 
-	object = alloc_single_from_new_slab(s, slab, orig_size);
+	object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
 
 	return object;
 }
@@ -4150,8 +4252,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		if (p[i] && init && (!kasan_init ||
 				     !kasan_has_integrated_init()))
 			memset(p[i], 0, zero_size);
-		kmemleak_alloc_recursive(p[i], s->object_size, 1,
-					 s->flags, init_flags);
+		if (gfpflags_allow_spinning(flags))
+			kmemleak_alloc_recursive(p[i], s->object_size, 1,
+						 s->flags, init_flags);
 		kmsan_slab_alloc(s, p[i], init_flags);
 		alloc_tagging_slab_alloc_hook(s, p[i], flags);
 	}
@@ -4342,6 +4445,94 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
+/**
+ * kmalloc_nolock - Allocate an object of given size from any context.
+ * @size: size to allocate
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
+ * @node: node number of the target node.
+ *
+ * Return: pointer to the new object or NULL in case of error.
+ * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
+ * There is no reason to call it again and expect !NULL.
+ */
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
+{
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
+	struct kmem_cache *s;
+	bool can_retry = true;
+	void *ret = ERR_PTR(-EBUSY);
+
+	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
+
+	if (unlikely(!size))
+		return ZERO_SIZE_PTR;
+
+	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
+		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
+		return NULL;
+retry:
+	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
+		return NULL;
+	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
+
+	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
+		/*
+		 * kmalloc_nolock() is not supported on architectures that
+		 * don't implement cmpxchg16b, but debug caches don't use
+		 * per-cpu slab and per-cpu partial slabs. They rely on
+		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
+		 * attempt to allocate from debug caches by
+		 * spin_trylock_irqsave(&n->list_lock, ...)
+		 */
+		return NULL;
+
+	/*
+	 * Do not call slab_alloc_node(), since trylock mode isn't
+	 * compatible with slab_pre_alloc_hook/should_failslab and
+	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
+	 * and slab_post_alloc_hook() directly.
+	 *
+	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
+	 * in irq saved region. It assumes that the same cpu will not
+	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
+	 * Therefore use in_nmi() to check whether particular bucket is in
+	 * irq protected section.
+	 *
+	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
+	 * this cpu was interrupted somewhere inside ___slab_alloc() after
+	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
+	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
+	 */
+#ifndef CONFIG_SLUB_TINY
+	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
+#endif
+		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
+
+	if (PTR_ERR(ret) == -EBUSY) {
+		if (can_retry) {
+			/* pick the next kmalloc bucket */
+			size = s->object_size + 1;
+			/*
+			 * Another alternative is to
+			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
+			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
+			 * to retry from bucket of the same size.
+			 */
+			can_retry = false;
+			goto retry;
+		}
+		ret = NULL;
+	}
+
+	maybe_wipe_obj_freeptr(s, ret);
+	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
+			     slab_want_init_on_alloc(alloc_gfp, s), size);
+
+	ret = kasan_kmalloc(s, ret, size, alloc_gfp);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);
+
 void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags,
 					 int node, unsigned long caller)
 {
@@ -4555,6 +4746,97 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+struct defer_free {
+	struct llist_head objects;
+	struct llist_head slabs;
+	struct irq_work work;
+};
+
+static void free_deferred_objects(struct irq_work *work);
+
+static DEFINE_PER_CPU(struct defer_free, defer_free_objects) = {
+	.objects = LLIST_HEAD_INIT(objects),
+	.slabs = LLIST_HEAD_INIT(slabs),
+	.work = IRQ_WORK_INIT(free_deferred_objects),
+};
+
+/*
+ * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
+ * to take sleeping spin_locks from __slab_free() and deactivate_slab().
+ * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
+ */
+static void free_deferred_objects(struct irq_work *work)
+{
+	struct defer_free *df = container_of(work, struct defer_free, work);
+	struct llist_head *objs = &df->objects;
+	struct llist_head *slabs = &df->slabs;
+	struct llist_node *llnode, *pos, *t;
+
+	if (llist_empty(objs) && llist_empty(slabs))
+		return;
+
+	llnode = llist_del_all(objs);
+	llist_for_each_safe(pos, t, llnode) {
+		struct kmem_cache *s;
+		struct slab *slab;
+		void *x = pos;
+
+		slab = virt_to_slab(x);
+		s = slab->slab_cache;
+
+		/*
+		 * We used freepointer in 'x' to link 'x' into df->objects.
+		 * Clear it to NULL to avoid false positive detection
+		 * of "Freepointer corruption".
+		 */
+		*(void **)x = NULL;
+
+		/* Point 'x' back to the beginning of allocated object */
+		x -= s->offset;
+		/*
+		 * memcg, kasan_slab_pre are already done for 'x'.
+		 * The only thing left is kasan_poison.
+		 */
+		kasan_slab_free(s, x, false, false, true);
+		__slab_free(s, slab, x, x, 1, _THIS_IP_);
+	}
+
+	llnode = llist_del_all(slabs);
+	llist_for_each_safe(pos, t, llnode) {
+		struct slab *slab = container_of(pos, struct slab, llnode);
+
+#ifdef CONFIG_SLUB_TINY
+		discard_slab(slab->slab_cache, slab);
+#else
+		deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
+#endif
+	}
+}
+
+static void defer_free(struct kmem_cache *s, void *head)
+{
+	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
+
+	if (llist_add(head + s->offset, &df->objects))
+		irq_work_queue(&df->work);
+}
+
+static void defer_deactivate_slab(struct slab *slab)
+{
+	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
+
+	if (llist_add(&slab->llnode, &df->slabs))
+		irq_work_queue(&df->work);
+}
+
+void defer_free_barrier(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 				struct slab *slab, void *head, void *tail,
 				int cnt, unsigned long addr)
 {
+	/* cnt == 0 signals that it's called from kfree_nolock() */
+	bool allow_spin = cnt;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 	void **freelist;
@@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	barrier();
 
 	if (unlikely(slab != c->slab)) {
-		__slab_free(s, slab, head, tail, cnt, addr);
+		if (unlikely(!allow_spin)) {
+			/*
+			 * __slab_free() can locklessly cmpxchg16 into a slab,
+			 * but then it might need to take spin_lock or local_lock
+			 * in put_cpu_partial() for further processing.
+			 * Avoid the complexity and simply add to a deferred list.
+			 */
+			defer_free(s, head);
+		} else {
+			__slab_free(s, slab, head, tail, cnt, addr);
+		}
 		return;
 	}
 
+	if (unlikely(!allow_spin)) {
+		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
+		    local_lock_is_locked(&s->cpu_slab->lock)) {
+			defer_free(s, head);
+			return;
+		}
+		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
+		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
+	}
+
 	if (USE_LOCKLESS_FAST_PATH()) {
 		freelist = READ_ONCE(c->freelist);
 
@@ -4607,8 +4911,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 			goto redo;
 		}
 	} else {
+		__maybe_unused long flags = 0;
+
 		/* Update the free list under the local lock */
-		local_lock(&s->cpu_slab->lock);
+		local_lock_cpu_slab(s, &flags);
 		c = this_cpu_ptr(s->cpu_slab);
 		if (unlikely(slab != c->slab)) {
 			local_unlock(&s->cpu_slab->lock);
@@ -4621,7 +4927,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 		c->freelist = head;
 		c->tid = next_tid(tid);
 
-		local_unlock(&s->cpu_slab->lock);
+		local_unlock_cpu_slab(s, &flags);
 	}
 	stat_add(s, FREE_FASTPATH, cnt);
 }
@@ -4844,6 +5150,62 @@ void kfree(const void *object)
 }
 EXPORT_SYMBOL(kfree);
 
+/*
+ * Can be called while holding raw_spinlock_t or from IRQ and NMI,
+ * but only for objects allocated by kmalloc_nolock(),
+ * since some debug checks (like kmemleak and kfence) were
+ * skipped on allocation. large_kmalloc is not supported either.
+ */
+void kfree_nolock(const void *object)
+{
+	struct folio *folio;
+	struct slab *slab;
+	struct kmem_cache *s;
+	void *x = (void *)object;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	folio = virt_to_folio(object);
+	if (unlikely(!folio_test_slab(folio))) {
+		WARN(1, "Buggy usage of kfree_nolock");
+		return;
+	}
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+
+	memcg_slab_free_hook(s, slab, &x, 1);
+	alloc_tagging_slab_free_hook(s, slab, &x, 1);
+	/*
+	 * Unlike slab_free() do NOT call the following:
+	 * kmemleak_free_recursive(x, s->flags);
+	 * debug_check_no_locks_freed(x, s->object_size);
+	 * debug_check_no_obj_freed(x, s->object_size);
+	 * __kcsan_check_access(x, s->object_size, ..);
+	 * kfence_free(x);
+	 * since they take spinlocks.
+	 */
+	kmsan_slab_free(s, x);
+	/*
+	 * If KASAN finds a kernel bug it will do kasan_report_invalid_free()
+	 * which will call raw_spin_lock_irqsave() which is technically
+	 * unsafe from NMI, but take chance and report kernel bug.
+	 * The sequence of
+	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
+	 *  -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
+	 * is double buggy and deserves to deadlock.
+	 */
+	if (kasan_slab_pre_free(s, x))
+		return;
+#ifndef CONFIG_SLUB_TINY
+	do_slab_free(s, slab, x, x, 0, _RET_IP_);
+#else
+	defer_free(s, x);
+#endif
+}
+EXPORT_SYMBOL_GPL(kfree_nolock);
+
 static __always_inline __realloc_size(2) void *
 __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP
  2025-07-16  2:29 [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2025-07-16  2:29 ` [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
@ 2025-07-16  2:29 ` Alexei Starovoitov
  2025-07-16 13:35   ` Vlastimil Babka
  5 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-16  2:29 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Since kmalloc_nolock() can be called from any context
the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock
in PREEMPT_RT) and attempt to acquire a different local_trylock_t
while in the same task context.

The calling sequence might look like:
kmalloc() -> tracepoint -> bpf -> kmalloc_nolock()

or more precisely:
__lock_acquire+0x12ad/0x2590
lock_acquire+0x133/0x2d0
rt_spin_lock+0x6f/0x250
___slab_alloc+0xb7/0xec0
kmalloc_nolock_noprof+0x15a/0x430
my_debug_callback+0x20e/0x390 [testmod]
___slab_alloc+0x256/0xec0
__kmalloc_cache_noprof+0xd6/0x3b0

Make LOCKDEP understand that local_trylock_t-s protect
different kmem_caches. In order to do that add lock_class_key
for each kmem_cache and use that key in local_trylock_t.

This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT,
but teach lockdep about it only for PREEMP_RT, since
in !PREEMPT_RT the code is using local_trylock_irqsave() only.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/slab.h |  1 +
 mm/slub.c | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/mm/slab.h b/mm/slab.h
index 65f4616b41de..165737accb20 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -262,6 +262,7 @@ struct kmem_cache_order_objects {
 struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
+	struct lock_class_key lock_key;
 #endif
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
diff --git a/mm/slub.c b/mm/slub.c
index c92703d367d7..526296778247 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3089,12 +3089,26 @@ static inline void note_cmpxchg_failure(const char *n,
 
 static void init_kmem_cache_cpus(struct kmem_cache *s)
 {
+#ifdef CONFIG_PREEMPT_RT
+	/* Register lockdep key for non-boot kmem caches */
+	bool finegrain_lockdep = !init_section_contains(s, 1);
+#else
+	/*
+	 * Don't bother with different lockdep classes for each
+	 * kmem_cache, since we only use local_trylock_irqsave().
+	 */
+	bool finegrain_lockdep = false;
+#endif
 	int cpu;
 	struct kmem_cache_cpu *c;
 
+	if (finegrain_lockdep)
+		lockdep_register_key(&s->lock_key);
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
 		local_trylock_init(&c->lock);
+		if (finegrain_lockdep)
+			lockdep_set_class(&c->lock, &s->lock_key);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -5976,6 +5990,9 @@ void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
 #ifndef CONFIG_SLUB_TINY
+#ifdef CONFIG_PREEMPT_RT
+	lockdep_unregister_key(&s->lock_key);
+#endif
 	free_percpu(s->cpu_slab);
 #endif
 	free_kmem_cache_nodes(s);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-16  2:29 ` [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
@ 2025-07-16 10:58   ` Vlastimil Babka
  2025-07-17  2:50     ` Alexei Starovoitov
  2025-08-25  4:45   ` Harry Yoo
  1 sibling, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2025-07-16 10:58 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes

On 7/16/25 04:29, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> kmalloc_nolock() relies on ability of local_lock to detect the situation

					^ local_trylock_t perhaps?

> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.

It can be also true when you call it from the bpf hook, no?

> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by the current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and in_nmi() or in PREEMPT_RT.
> 
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> always defers to irq_work.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Haven't seen an obvious bug now but will ponder it some more. Meanwhile some
nits and maybe one bit more serious concern.

> +static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags)
> +{
> +	/*
> +	 * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> +	 * can be acquired without a deadlock before invoking the function.
> +	 *
> +	 * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> +	 * disabled context. The lock will always be acquired and if needed it
> +	 * block and sleep until the lock is available.
> +	 *
> +	 * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
> +	 * is always acquired with disabled interrupts meaning it is always
> +	 * possible to it.
> +	 * In NMI context it is needed to check if the lock is acquired. If it is not,

This also could mention the bpf instrumentation context?

> +	 * it is safe to acquire it. The trylock semantic is used to tell lockdep
> +	 * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
> +	 * the lock.
> +	 *
> +	 */
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> +		BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));

Linus might still spot the BUG_ON() and complain, lockdep_assert() would be
safer maybe :)
Or just use local_lock_irqsave() with !CONFIG_LOCKDEP as well.

Nit: maybe could be a #define to avoid the unusual need for "&flags" instead
of "flags" when calling.

> +/**
> + * kmalloc_nolock - Allocate an object of given size from any context.
> + * @size: size to allocate
> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> + * @node: node number of the target node.
> + *
> + * Return: pointer to the new object or NULL in case of error.
> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> + * There is no reason to call it again and expect !NULL.
> + */
> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> +{
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> +	struct kmem_cache *s;
> +	bool can_retry = true;
> +	void *ret = ERR_PTR(-EBUSY);
> +
> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> +
> +	if (unlikely(!size))
> +		return ZERO_SIZE_PTR;
> +
> +	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))

Nit: maybe just due explicit PREEMPT_RT checks when the code isn't about
lockless fastpaths,

> +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> +		return NULL;
> +retry:
> +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> +		return NULL;
> +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> +
> +	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> +		/*
> +		 * kmalloc_nolock() is not supported on architectures that
> +		 * don't implement cmpxchg16b, but debug caches don't use
> +		 * per-cpu slab and per-cpu partial slabs. They rely on
> +		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
> +		 * attempt to allocate from debug caches by
> +		 * spin_trylock_irqsave(&n->list_lock, ...)
> +		 */
> +		return NULL;
> +
> +	/*
> +	 * Do not call slab_alloc_node(), since trylock mode isn't
> +	 * compatible with slab_pre_alloc_hook/should_failslab and
> +	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> +	 * and slab_post_alloc_hook() directly.
> +	 *
> +	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> +	 * in irq saved region. It assumes that the same cpu will not
> +	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> +	 * Therefore use in_nmi() to check whether particular bucket is in
> +	 * irq protected section.
> +	 *
> +	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
> +	 * this cpu was interrupted somewhere inside ___slab_alloc() after
> +	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
> +	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
> +	 */
> +#ifndef CONFIG_SLUB_TINY
> +	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> +#endif
> +		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);

Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better?

> +static void defer_deactivate_slab(struct slab *slab)
> +{

Nit: for more consistency this could thake the freelist argument and assign
it here, and not in the caller.

> +	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> +
> +	if (llist_add(&slab->llnode, &df->slabs))
> +		irq_work_queue(&df->work);
> +}
> +
> +void defer_free_barrier(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
> +}
> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> @@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  				struct slab *slab, void *head, void *tail,
>  				int cnt, unsigned long addr)
>  {
> +	/* cnt == 0 signals that it's called from kfree_nolock() */
> +	bool allow_spin = cnt;
>  	struct kmem_cache_cpu *c;
>  	unsigned long tid;
>  	void **freelist;
> @@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  	barrier();
>  
>  	if (unlikely(slab != c->slab)) {
> -		__slab_free(s, slab, head, tail, cnt, addr);
> +		if (unlikely(!allow_spin)) {
> +			/*
> +			 * __slab_free() can locklessly cmpxchg16 into a slab,
> +			 * but then it might need to take spin_lock or local_lock
> +			 * in put_cpu_partial() for further processing.
> +			 * Avoid the complexity and simply add to a deferred list.
> +			 */
> +			defer_free(s, head);
> +		} else {
> +			__slab_free(s, slab, head, tail, cnt, addr);
> +		}
>  		return;
>  	}
>  
> +	if (unlikely(!allow_spin)) {
> +		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&

Same nit about USE_LOCKLESS_FAST_PATH

> +		    local_lock_is_locked(&s->cpu_slab->lock)) {
> +			defer_free(s, head);
> +			return;
> +		}
> +		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> +		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> +	}
> +
>  	if (USE_LOCKLESS_FAST_PATH()) {
>  		freelist = READ_ONCE(c->freelist);
>  
> @@ -4607,8 +4911,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  			goto redo;
>  		}
>  	} else {
> +		__maybe_unused long flags = 0;
> +
>  		/* Update the free list under the local lock */
> -		local_lock(&s->cpu_slab->lock);
> +		local_lock_cpu_slab(s, &flags);
>  		c = this_cpu_ptr(s->cpu_slab);
>  		if (unlikely(slab != c->slab)) {
>  			local_unlock(&s->cpu_slab->lock);
> @@ -4621,7 +4927,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  		c->freelist = head;
>  		c->tid = next_tid(tid);
>  
> -		local_unlock(&s->cpu_slab->lock);
> +		local_unlock_cpu_slab(s, &flags);
>  	}
>  	stat_add(s, FREE_FASTPATH, cnt);
>  }
> @@ -4844,6 +5150,62 @@ void kfree(const void *object)
>  }
>  EXPORT_SYMBOL(kfree);
>  
> +/*
> + * Can be called while holding raw_spinlock_t or from IRQ and NMI,
> + * but only for objects allocated by kmalloc_nolock(),
> + * since some debug checks (like kmemleak and kfence) were
> + * skipped on allocation. large_kmalloc is not supported either.
> + */
> +void kfree_nolock(const void *object)
> +{
> +	struct folio *folio;
> +	struct slab *slab;
> +	struct kmem_cache *s;
> +	void *x = (void *)object;
> +
> +	if (unlikely(ZERO_OR_NULL_PTR(object)))
> +		return;
> +
> +	folio = virt_to_folio(object);
> +	if (unlikely(!folio_test_slab(folio))) {
> +		WARN(1, "Buggy usage of kfree_nolock");
> +		return;
> +	}
> +
> +	slab = folio_slab(folio);
> +	s = slab->slab_cache;
> +
> +	memcg_slab_free_hook(s, slab, &x, 1);
> +	alloc_tagging_slab_free_hook(s, slab, &x, 1);
> +	/*
> +	 * Unlike slab_free() do NOT call the following:
> +	 * kmemleak_free_recursive(x, s->flags);
> +	 * debug_check_no_locks_freed(x, s->object_size);
> +	 * debug_check_no_obj_freed(x, s->object_size);
> +	 * __kcsan_check_access(x, s->object_size, ..);
> +	 * kfence_free(x);
> +	 * since they take spinlocks.
> +	 */

So here's the bigger concern. What if someone allocates with regular
kmalloc() so that the debugging stuff is performed as usual, and then tries
to use kfree_nolock() whre we skip it? You might not be planning such usage,
but later someone can realize that only their freeing context is limited,
finds out kfree_nolock() exists and tries to use it?

Can we document this strongly enough? Or even enforce it somehow? Or when
any of these kinds of debugs above are enabled, we play it safe and use
defer_free()?

> +	kmsan_slab_free(s, x);
> +	/*
> +	 * If KASAN finds a kernel bug it will do kasan_report_invalid_free()
> +	 * which will call raw_spin_lock_irqsave() which is technically
> +	 * unsafe from NMI, but take chance and report kernel bug.
> +	 * The sequence of
> +	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
> +	 *  -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
> +	 * is double buggy and deserves to deadlock.
> +	 */
> +	if (kasan_slab_pre_free(s, x))
> +		return;
> +#ifndef CONFIG_SLUB_TINY
> +	do_slab_free(s, slab, x, x, 0, _RET_IP_);
> +#else
> +	defer_free(s, x);
> +#endif
> +}
> +EXPORT_SYMBOL_GPL(kfree_nolock);
> +
>  static __always_inline __realloc_size(2) void *
>  __do_krealloc(const void *p, size_t new_size, gfp_t flags)
>  {



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP
  2025-07-16  2:29 ` [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP Alexei Starovoitov
@ 2025-07-16 13:35   ` Vlastimil Babka
  2025-07-17  3:32     ` Alexei Starovoitov
  0 siblings, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2025-07-16 13:35 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes

On 7/16/25 04:29, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Since kmalloc_nolock() can be called from any context
> the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock
> in PREEMPT_RT) and attempt to acquire a different local_trylock_t
> while in the same task context.
> 
> The calling sequence might look like:
> kmalloc() -> tracepoint -> bpf -> kmalloc_nolock()
> 
> or more precisely:
> __lock_acquire+0x12ad/0x2590
> lock_acquire+0x133/0x2d0
> rt_spin_lock+0x6f/0x250
> ___slab_alloc+0xb7/0xec0
> kmalloc_nolock_noprof+0x15a/0x430
> my_debug_callback+0x20e/0x390 [testmod]
> ___slab_alloc+0x256/0xec0
> __kmalloc_cache_noprof+0xd6/0x3b0
> 
> Make LOCKDEP understand that local_trylock_t-s protect
> different kmem_caches. In order to do that add lock_class_key
> for each kmem_cache and use that key in local_trylock_t.
> 
> This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT,
> but teach lockdep about it only for PREEMP_RT, since
> in !PREEMPT_RT the code is using local_trylock_irqsave() only.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Should we switch the order of patches 5 and 6 or is it sufficient there are
no callers of kmalloc_nolock() yet?

> ---
>  mm/slab.h |  1 +
>  mm/slub.c | 17 +++++++++++++++++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 65f4616b41de..165737accb20 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -262,6 +262,7 @@ struct kmem_cache_order_objects {
>  struct kmem_cache {
>  #ifndef CONFIG_SLUB_TINY
>  	struct kmem_cache_cpu __percpu *cpu_slab;
> +	struct lock_class_key lock_key;

I see " * The class key takes no space if lockdep is disabled:", ok good.

>  #endif
>  	/* Used for retrieving partial slabs, etc. */
>  	slab_flags_t flags;
> diff --git a/mm/slub.c b/mm/slub.c
> index c92703d367d7..526296778247 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3089,12 +3089,26 @@ static inline void note_cmpxchg_failure(const char *n,
>  
>  static void init_kmem_cache_cpus(struct kmem_cache *s)
>  {
> +#ifdef CONFIG_PREEMPT_RT
> +	/* Register lockdep key for non-boot kmem caches */
> +	bool finegrain_lockdep = !init_section_contains(s, 1);

I guess it's to avoid the "if (WARN_ON_ONCE(static_obj(key)))"
if it means the two bootstrap caches get a different class just by being
static, then I guess it works.

> +#else
> +	/*
> +	 * Don't bother with different lockdep classes for each
> +	 * kmem_cache, since we only use local_trylock_irqsave().
> +	 */
> +	bool finegrain_lockdep = false;
> +#endif
>  	int cpu;
>  	struct kmem_cache_cpu *c;
>  
> +	if (finegrain_lockdep)
> +		lockdep_register_key(&s->lock_key);
>  	for_each_possible_cpu(cpu) {
>  		c = per_cpu_ptr(s->cpu_slab, cpu);
>  		local_trylock_init(&c->lock);
> +		if (finegrain_lockdep)
> +			lockdep_set_class(&c->lock, &s->lock_key);
>  		c->tid = init_tid(cpu);
>  	}
>  }
> @@ -5976,6 +5990,9 @@ void __kmem_cache_release(struct kmem_cache *s)
>  {
>  	cache_random_seq_destroy(s);
>  #ifndef CONFIG_SLUB_TINY
> +#ifdef CONFIG_PREEMPT_RT
> +	lockdep_unregister_key(&s->lock_key);
> +#endif
>  	free_percpu(s->cpu_slab);
>  #endif
>  	free_kmem_cache_nodes(s);



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-16 10:58   ` Vlastimil Babka
@ 2025-07-17  2:50     ` Alexei Starovoitov
  2025-07-17  9:18       ` Vlastimil Babka
  2025-07-18  0:09       ` Alexei Starovoitov
  0 siblings, 2 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-17  2:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Wed, Jul 16, 2025 at 3:58 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/16/25 04:29, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > kmalloc_nolock() relies on ability of local_lock to detect the situation
>
>                                         ^ local_trylock_t perhaps?
>
> > when it's locked.
> > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
>
> It can be also true when you call it from the bpf hook, no?

Correct. Technically one can disasm ___slab_alloc(),
find some instruction after pushfq;cli,
add kprobe there, attach bpf prog to kprobe, and call kmalloc_nolock
(eventually, when bpf infra switches to kmalloc_nolock).
I wouldn't call it malicious, but if somebody does that
they are looking for trouble. Even syzbot doesn't do such things.

> > In that case retry the operation in a different kmalloc bucket.
> > The second attempt will likely succeed, since this cpu locked
> > different kmem_cache_cpu.
> >
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > a different bucket that is most likely is not locked by the current
> > task. Though it may be locked by a different task it's safe to
> > rt_spin_lock() on it.
> >
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
> >
> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > and in_nmi() or in PREEMPT_RT.
> >
> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> > always defers to irq_work.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Haven't seen an obvious bug now but will ponder it some more. Meanwhile some
> nits and maybe one bit more serious concern.
>
> > +static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags)
> > +{
> > +     /*
> > +      * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> > +      * can be acquired without a deadlock before invoking the function.
> > +      *
> > +      * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> > +      * disabled context. The lock will always be acquired and if needed it
> > +      * block and sleep until the lock is available.
> > +      *
> > +      * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
> > +      * is always acquired with disabled interrupts meaning it is always
> > +      * possible to it.
> > +      * In NMI context it is needed to check if the lock is acquired. If it is not,
>
> This also could mention the bpf instrumentation context?

Ok.

> > +      * it is safe to acquire it. The trylock semantic is used to tell lockdep
> > +      * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
> > +      * the lock.
> > +      *
> > +      */
> > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT))
> > +             BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
>
> Linus might still spot the BUG_ON() and complain, lockdep_assert() would be
> safer maybe :)
> Or just use local_lock_irqsave() with !CONFIG_LOCKDEP as well.

Fair enough. Let's save one branch in the critical path.

> Nit: maybe could be a #define to avoid the unusual need for "&flags" instead
> of "flags" when calling.

When "bool allow_spin" was there in Sebastian's version it definitely
looked cleaner as a proper function,
but now, if (!IS_ENABLED(CONFIG_PREEMPT_RT)) can be
#ifdef CONFIG_PREEMPT_RT
and the comment will look normal (without ugly backslashes)
So yeah. I'll convert it to macro.

> > +/**
> > + * kmalloc_nolock - Allocate an object of given size from any context.
> > + * @size: size to allocate
> > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> > + * @node: node number of the target node.
> > + *
> > + * Return: pointer to the new object or NULL in case of error.
> > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> > + * There is no reason to call it again and expect !NULL.
> > + */
> > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> > +{
> > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> > +     struct kmem_cache *s;
> > +     bool can_retry = true;
> > +     void *ret = ERR_PTR(-EBUSY);
> > +
> > +     VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> > +
> > +     if (unlikely(!size))
> > +             return ZERO_SIZE_PTR;
> > +
> > +     if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
>
> Nit: maybe just due explicit PREEMPT_RT checks when the code isn't about
> lockless fastpaths,

True. I wasn't sure what's better.
do_slab_free() does
if (USE_LOCKLESS_FAST_PATH())
but it's really meant to be PREEMPT_RT,
since 'else' part doesn't make sense otherwise.
It's doing local_lock() without _irqsave() which is
inconsistent with everything else and looks broken
when one doesn't have knowledge of local_lock_internal.h
This patch fixes this part:
-               local_lock(&s->cpu_slab->lock);
+               local_lock_cpu_slab(s, &flags);


Here, in this hunk, if (IS_ENABLED(CONFIG_PREEMPT_RT) might
look better indeed.

> > +             /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> > +             return NULL;
> > +retry:
> > +     if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> > +             return NULL;
> > +     s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> > +
> > +     if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
> > +             /*
> > +              * kmalloc_nolock() is not supported on architectures that
> > +              * don't implement cmpxchg16b, but debug caches don't use
> > +              * per-cpu slab and per-cpu partial slabs. They rely on
> > +              * kmem_cache_node->list_lock, so kmalloc_nolock() can
> > +              * attempt to allocate from debug caches by
> > +              * spin_trylock_irqsave(&n->list_lock, ...)
> > +              */
> > +             return NULL;
> > +
> > +     /*
> > +      * Do not call slab_alloc_node(), since trylock mode isn't
> > +      * compatible with slab_pre_alloc_hook/should_failslab and
> > +      * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
> > +      * and slab_post_alloc_hook() directly.
> > +      *
> > +      * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> > +      * in irq saved region. It assumes that the same cpu will not
> > +      * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> > +      * Therefore use in_nmi() to check whether particular bucket is in
> > +      * irq protected section.
> > +      *
> > +      * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
> > +      * this cpu was interrupted somewhere inside ___slab_alloc() after
> > +      * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
> > +      * In this case fast path with __update_cpu_freelist_fast() is not safe.
> > +      */
> > +#ifndef CONFIG_SLUB_TINY
> > +     if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> > +#endif
> > +             ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
>
> Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better?

ok.

> > +static void defer_deactivate_slab(struct slab *slab)
> > +{
>
> Nit: for more consistency this could thake the freelist argument and assign
> it here, and not in the caller.

ok.

> > +     struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> > +
> > +     if (llist_add(&slab->llnode, &df->slabs))
> > +             irq_work_queue(&df->work);
> > +}
> > +
> > +void defer_free_barrier(void)
> > +{
> > +     int cpu;
> > +
> > +     for_each_possible_cpu(cpu)
> > +             irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
> > +}
> > +
> >  #ifndef CONFIG_SLUB_TINY
> >  /*
> >   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> > @@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >                               struct slab *slab, void *head, void *tail,
> >                               int cnt, unsigned long addr)
> >  {
> > +     /* cnt == 0 signals that it's called from kfree_nolock() */
> > +     bool allow_spin = cnt;
> >       struct kmem_cache_cpu *c;
> >       unsigned long tid;
> >       void **freelist;
> > @@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >       barrier();
> >
> >       if (unlikely(slab != c->slab)) {
> > -             __slab_free(s, slab, head, tail, cnt, addr);
> > +             if (unlikely(!allow_spin)) {
> > +                     /*
> > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > +                      * but then it might need to take spin_lock or local_lock
> > +                      * in put_cpu_partial() for further processing.
> > +                      * Avoid the complexity and simply add to a deferred list.
> > +                      */
> > +                     defer_free(s, head);
> > +             } else {
> > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > +             }
> >               return;
> >       }
> >
> > +     if (unlikely(!allow_spin)) {
> > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
>
> Same nit about USE_LOCKLESS_FAST_PATH

Here, I have to disagree unless we fix the couple lines below as well.

> > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > +                     defer_free(s, head);
> > +                     return;
> > +             }
> > +             cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> > +             kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> > +     }
> > +
> >       if (USE_LOCKLESS_FAST_PATH()) {
> >               freelist = READ_ONCE(c->freelist);
> >
> > @@ -4607,8 +4911,10 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >                       goto redo;
> >               }
> >       } else {
> > +             __maybe_unused long flags = 0;
> > +
> >               /* Update the free list under the local lock */
> > -             local_lock(&s->cpu_slab->lock);
> > +             local_lock_cpu_slab(s, &flags);
> >               c = this_cpu_ptr(s->cpu_slab);
> >               if (unlikely(slab != c->slab)) {
> >                       local_unlock(&s->cpu_slab->lock);
> > @@ -4621,7 +4927,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >               c->freelist = head;
> >               c->tid = next_tid(tid);
> >
> > -             local_unlock(&s->cpu_slab->lock);
> > +             local_unlock_cpu_slab(s, &flags);
> >       }
> >       stat_add(s, FREE_FASTPATH, cnt);
> >  }
> > @@ -4844,6 +5150,62 @@ void kfree(const void *object)
> >  }
> >  EXPORT_SYMBOL(kfree);
> >
> > +/*
> > + * Can be called while holding raw_spinlock_t or from IRQ and NMI,
> > + * but only for objects allocated by kmalloc_nolock(),
> > + * since some debug checks (like kmemleak and kfence) were
> > + * skipped on allocation. large_kmalloc is not supported either.
> > + */
> > +void kfree_nolock(const void *object)
> > +{
> > +     struct folio *folio;
> > +     struct slab *slab;
> > +     struct kmem_cache *s;
> > +     void *x = (void *)object;
> > +
> > +     if (unlikely(ZERO_OR_NULL_PTR(object)))
> > +             return;
> > +
> > +     folio = virt_to_folio(object);
> > +     if (unlikely(!folio_test_slab(folio))) {
> > +             WARN(1, "Buggy usage of kfree_nolock");
> > +             return;
> > +     }
> > +
> > +     slab = folio_slab(folio);
> > +     s = slab->slab_cache;
> > +
> > +     memcg_slab_free_hook(s, slab, &x, 1);
> > +     alloc_tagging_slab_free_hook(s, slab, &x, 1);
> > +     /*
> > +      * Unlike slab_free() do NOT call the following:
> > +      * kmemleak_free_recursive(x, s->flags);
> > +      * debug_check_no_locks_freed(x, s->object_size);
> > +      * debug_check_no_obj_freed(x, s->object_size);
> > +      * __kcsan_check_access(x, s->object_size, ..);
> > +      * kfence_free(x);
> > +      * since they take spinlocks.
> > +      */
>
> So here's the bigger concern. What if someone allocates with regular
> kmalloc() so that the debugging stuff is performed as usual, and then tries
> to use kfree_nolock() whre we skip it? You might not be planning such usage,
> but later someone can realize that only their freeing context is limited,
> finds out kfree_nolock() exists and tries to use it?
>
> Can we document this strongly enough? Or even enforce it somehow? Or when
> any of these kinds of debugs above are enabled, we play it safe and use
> defer_free()?

Let's break it one by one.
1.
kmemleak_free_recursive() will miss an object that was recorded
during normal kmalloc() and that's indeed problematic.

2.
debug_check_no_locks_freed() and
debug_check_no_obj_freed()
are somewhat harmless.
We miss checks, but it's not breaking the corresponding features.

3.
__kcsan_check_access() doesn't take locks, but its stack is
so deep and looks to be recursive that I doubt it's safe from
any context.

4.
kfence_free() looks like an existing quirk.
I'm not sure why it's there in the slab free path :)
kfence comment says:
 * KFENCE objects live in a separate page range and are not to be intermixed
 * with regular heap objects (e.g. KFENCE objects must never be added to the
 * allocator freelists). Failing to do so may and will result in heap
 * corruptions, therefore is_kfence_address() must be used to check whether
 * an object requires specific handling.

so it should always be a nop for slab.
I removed the call for peace of mind.

So imo only 1 is dodgy. We can add:
if (!(flags & SLAB_NOLEAKTRACE) && kmemleak_free_enabled)
  defer_free(..);

but it's ugly too.

My preference is to add a comment saying that only objects
allocated by kmalloc_nolock() should be freed by kfree_nolock().


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP
  2025-07-16 13:35   ` Vlastimil Babka
@ 2025-07-17  3:32     ` Alexei Starovoitov
  2025-07-17  9:29       ` Vlastimil Babka
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-17  3:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Wed, Jul 16, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 7/16/25 04:29, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Since kmalloc_nolock() can be called from any context
> > the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock
> > in PREEMPT_RT) and attempt to acquire a different local_trylock_t
> > while in the same task context.
> >
> > The calling sequence might look like:
> > kmalloc() -> tracepoint -> bpf -> kmalloc_nolock()
> >
> > or more precisely:
> > __lock_acquire+0x12ad/0x2590
> > lock_acquire+0x133/0x2d0
> > rt_spin_lock+0x6f/0x250
> > ___slab_alloc+0xb7/0xec0
> > kmalloc_nolock_noprof+0x15a/0x430
> > my_debug_callback+0x20e/0x390 [testmod]
> > ___slab_alloc+0x256/0xec0
> > __kmalloc_cache_noprof+0xd6/0x3b0
> >
> > Make LOCKDEP understand that local_trylock_t-s protect
> > different kmem_caches. In order to do that add lock_class_key
> > for each kmem_cache and use that key in local_trylock_t.
> >
> > This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT,
> > but teach lockdep about it only for PREEMP_RT, since
> > in !PREEMPT_RT the code is using local_trylock_irqsave() only.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Should we switch the order of patches 5 and 6 or is it sufficient there are
> no callers of kmalloc_nolock() yet?

I can switch the order. I don't think it makes much difference.

> > ---
> >  mm/slab.h |  1 +
> >  mm/slub.c | 17 +++++++++++++++++
> >  2 files changed, 18 insertions(+)
> >
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 65f4616b41de..165737accb20 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -262,6 +262,7 @@ struct kmem_cache_order_objects {
> >  struct kmem_cache {
> >  #ifndef CONFIG_SLUB_TINY
> >       struct kmem_cache_cpu __percpu *cpu_slab;
> > +     struct lock_class_key lock_key;
>
> I see " * The class key takes no space if lockdep is disabled:", ok good.

yes.
I was considering guarding it with
#ifdef CONFIG_PREEMPT_RT
but then all accesses to ->lock_key would need to be wrapped
with #ifdef as well. So init_kmem_cache_cpus() will have
three #ifdef-s instead of one.
Hence the above lock_key; is unconditional,
though it's a bit odd that it's completely unused on !RT
(only the compiler touches it).

> >  #endif
> >       /* Used for retrieving partial slabs, etc. */
> >       slab_flags_t flags;
> > diff --git a/mm/slub.c b/mm/slub.c
> > index c92703d367d7..526296778247 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3089,12 +3089,26 @@ static inline void note_cmpxchg_failure(const char *n,
> >
> >  static void init_kmem_cache_cpus(struct kmem_cache *s)
> >  {
> > +#ifdef CONFIG_PREEMPT_RT
> > +     /* Register lockdep key for non-boot kmem caches */
> > +     bool finegrain_lockdep = !init_section_contains(s, 1);
>
> I guess it's to avoid the "if (WARN_ON_ONCE(static_obj(key)))"
> if it means the two bootstrap caches get a different class just by being
> static, then I guess it works.

Yes. Not pretty. The alternative is to pass a flag
through a bunch of functions all the way from kmem_cache_init.
Another alternative is to
bool finegrain_lockdep = s != boot_kmem_cache_node && s != boot_kmem_cache.
Both are imo uglier.
init_section_contains() is more precise and matches static_obj().

Checking for SLAB_NO_OBJ_EXT isn't an option. Since it's conditional.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-17  2:50     ` Alexei Starovoitov
@ 2025-07-17  9:18       ` Vlastimil Babka
  2025-07-17 16:23         ` Alexei Starovoitov
  2025-07-18  0:09       ` Alexei Starovoitov
  1 sibling, 1 reply; 18+ messages in thread
From: Vlastimil Babka @ 2025-07-17  9:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 7/17/25 04:50, Alexei Starovoitov wrote:
> On Wed, Jul 16, 2025 at 3:58 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> > In that case retry the operation in a different kmalloc bucket.
>> > The second attempt will likely succeed, since this cpu locked
>> > different kmem_cache_cpu.
>> >
>> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
>> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
>> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
>> > a different bucket that is most likely is not locked by the current
>> > task. Though it may be locked by a different task it's safe to
>> > rt_spin_lock() on it.
>> >
>> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
>> > immediately if called from hard irq or NMI in PREEMPT_RT.
>> >
>> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
>> > and in_nmi() or in PREEMPT_RT.
>> >
>> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
>> > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
>> > always defers to irq_work.
>> >
>> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>
>> Haven't seen an obvious bug now but will ponder it some more. Meanwhile some
>> nits and maybe one bit more serious concern.
>>
>> > +static inline void local_lock_cpu_slab(struct kmem_cache *s, unsigned long *flags)
>> > +{
>> > +     /*
>> > +      * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
>> > +      * can be acquired without a deadlock before invoking the function.
>> > +      *
>> > +      * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
>> > +      * disabled context. The lock will always be acquired and if needed it
>> > +      * block and sleep until the lock is available.
>> > +      *
>> > +      * On !PREEMPT_RT allocations from any context but NMI are safe. The lock
>> > +      * is always acquired with disabled interrupts meaning it is always
>> > +      * possible to it.
>> > +      * In NMI context it is needed to check if the lock is acquired. If it is not,
>>
>> This also could mention the bpf instrumentation context?
> 
> Ok.
> 
>> > +      * it is safe to acquire it. The trylock semantic is used to tell lockdep
>> > +      * that we don't spin. The BUG_ON() will not trigger if it is safe to acquire
>> > +      * the lock.
>> > +      *
>> > +      */
>> > +     if (!IS_ENABLED(CONFIG_PREEMPT_RT))
>> > +             BUG_ON(!local_trylock_irqsave(&s->cpu_slab->lock, *flags));
>>
>> Linus might still spot the BUG_ON() and complain, lockdep_assert() would be
>> safer maybe :)
>> Or just use local_lock_irqsave() with !CONFIG_LOCKDEP as well.
> 
> Fair enough. Let's save one branch in the critical path.
> 
>> Nit: maybe could be a #define to avoid the unusual need for "&flags" instead
>> of "flags" when calling.
> 
> When "bool allow_spin" was there in Sebastian's version it definitely
> looked cleaner as a proper function,
> but now, if (!IS_ENABLED(CONFIG_PREEMPT_RT)) can be
> #ifdef CONFIG_PREEMPT_RT
> and the comment will look normal (without ugly backslashes)
> So yeah. I'll convert it to macro.

To clarify, the ideal I think would be e.g.

#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)

local_lock_irqsave();

#else

lockdep_assert(local_trylock_irqsave());

#endif

This should mean that without lockdep we just trust the code to be correct
(kmalloc_nolock() using local_lock_held() properly before calling here, and
kmalloc() callers not being in an unsupported context, as before this
series) with no checking on both RT and !RT.

With lockdep, on RT lockdep does its checking in local_lock_irqsave()
normally. On !RT we need to use trylock to avoid false positives in nmi, but
lockdep_assert() will catch a bug still in case of a true positive.

At least I hope I got it right...

>> > +             /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
>> > +             return NULL;
>> > +retry:
>> > +     if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
>> > +             return NULL;
>> > +     s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
>> > +
>> > +     if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>> > +             /*
>> > +              * kmalloc_nolock() is not supported on architectures that
>> > +              * don't implement cmpxchg16b, but debug caches don't use
>> > +              * per-cpu slab and per-cpu partial slabs. They rely on
>> > +              * kmem_cache_node->list_lock, so kmalloc_nolock() can
>> > +              * attempt to allocate from debug caches by
>> > +              * spin_trylock_irqsave(&n->list_lock, ...)
>> > +              */
>> > +             return NULL;
>> > +
>> > +     /*
>> > +      * Do not call slab_alloc_node(), since trylock mode isn't
>> > +      * compatible with slab_pre_alloc_hook/should_failslab and
>> > +      * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
>> > +      * and slab_post_alloc_hook() directly.
>> > +      *
>> > +      * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
>> > +      * in irq saved region. It assumes that the same cpu will not
>> > +      * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
>> > +      * Therefore use in_nmi() to check whether particular bucket is in
>> > +      * irq protected section.
>> > +      *
>> > +      * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
>> > +      * this cpu was interrupted somewhere inside ___slab_alloc() after
>> > +      * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
>> > +      * In this case fast path with __update_cpu_freelist_fast() is not safe.
>> > +      */
>> > +#ifndef CONFIG_SLUB_TINY
>> > +     if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
>> > +#endif
>> > +             ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
>>
>> Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better?
> 
> ok.
> 
>> > +static void defer_deactivate_slab(struct slab *slab)
>> > +{
>>
>> Nit: for more consistency this could thake the freelist argument and assign
>> it here, and not in the caller.
> 
> ok.
> 
>> > +     struct defer_free *df = this_cpu_ptr(&defer_free_objects);
>> > +
>> > +     if (llist_add(&slab->llnode, &df->slabs))
>> > +             irq_work_queue(&df->work);
>> > +}
>> > +
>> > +void defer_free_barrier(void)
>> > +{
>> > +     int cpu;
>> > +
>> > +     for_each_possible_cpu(cpu)
>> > +             irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
>> > +}
>> > +
>> >  #ifndef CONFIG_SLUB_TINY
>> >  /*
>> >   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
>> > @@ -4575,6 +4857,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>> >                               struct slab *slab, void *head, void *tail,
>> >                               int cnt, unsigned long addr)
>> >  {
>> > +     /* cnt == 0 signals that it's called from kfree_nolock() */
>> > +     bool allow_spin = cnt;
>> >       struct kmem_cache_cpu *c;
>> >       unsigned long tid;
>> >       void **freelist;
>> > @@ -4593,10 +4877,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>> >       barrier();
>> >
>> >       if (unlikely(slab != c->slab)) {
>> > -             __slab_free(s, slab, head, tail, cnt, addr);
>> > +             if (unlikely(!allow_spin)) {
>> > +                     /*
>> > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
>> > +                      * but then it might need to take spin_lock or local_lock
>> > +                      * in put_cpu_partial() for further processing.
>> > +                      * Avoid the complexity and simply add to a deferred list.
>> > +                      */
>> > +                     defer_free(s, head);
>> > +             } else {
>> > +                     __slab_free(s, slab, head, tail, cnt, addr);
>> > +             }
>> >               return;
>> >       }
>> >
>> > +     if (unlikely(!allow_spin)) {
>> > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
>>
>> Same nit about USE_LOCKLESS_FAST_PATH
> 
> Here, I have to disagree unless we fix the couple lines below as well.

Ah you're right, we can leave it.

>> So here's the bigger concern. What if someone allocates with regular
>> kmalloc() so that the debugging stuff is performed as usual, and then tries
>> to use kfree_nolock() whre we skip it? You might not be planning such usage,
>> but later someone can realize that only their freeing context is limited,
>> finds out kfree_nolock() exists and tries to use it?
>>
>> Can we document this strongly enough? Or even enforce it somehow? Or when
>> any of these kinds of debugs above are enabled, we play it safe and use
>> defer_free()?
> 
> Let's break it one by one.
> 1.
> kmemleak_free_recursive() will miss an object that was recorded
> during normal kmalloc() and that's indeed problematic.
> 
> 2.
> debug_check_no_locks_freed() and
> debug_check_no_obj_freed()
> are somewhat harmless.
> We miss checks, but it's not breaking the corresponding features.
> 
> 3.
> __kcsan_check_access() doesn't take locks, but its stack is
> so deep and looks to be recursive that I doubt it's safe from
> any context.

But it's also the case of "miss check but not break anything" right?

> 4.
> kfence_free() looks like an existing quirk.
> I'm not sure why it's there in the slab free path :)
> kfence comment says:
>  * KFENCE objects live in a separate page range and are not to be intermixed
>  * with regular heap objects (e.g. KFENCE objects must never be added to the
>  * allocator freelists). Failing to do so may and will result in heap
>  * corruptions, therefore is_kfence_address() must be used to check whether
>  * an object requires specific handling.
> 
> so it should always be a nop for slab.

Well the point of kfence is that it can inject its objects into slab
allocations (so that they are with some probability checked for buffer
overflows, UAF etc), so the slab freeing must then recognize and handle
them. It also wants to be low overhead for production use, so the freeing
isn't doing "if (is_kfence_address())" upfront but deeper.

So we could detect them and defer.

> I removed the call for peace of mind.
> 
> So imo only 1 is dodgy. We can add:
> if (!(flags & SLAB_NOLEAKTRACE) && kmemleak_free_enabled)
>   defer_free(..);
> 
> but it's ugly too.

Hm yeah looks like kmemleak isn't one of those debugging functionalities
that can be built-in but no overhead unless enabled on boot, using static keys.

> My preference is to add a comment saying that only objects
> allocated by kmalloc_nolock() should be freed by kfree_nolock().

We could go with that until someone has a case for changing this, and then
handle kmemleak and kfence with defer_free()...


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP
  2025-07-17  3:32     ` Alexei Starovoitov
@ 2025-07-17  9:29       ` Vlastimil Babka
  0 siblings, 0 replies; 18+ messages in thread
From: Vlastimil Babka @ 2025-07-17  9:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On 7/17/25 05:32, Alexei Starovoitov wrote:
> On Wed, Jul 16, 2025 at 6:35 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>> >  #endif
>> >       /* Used for retrieving partial slabs, etc. */
>> >       slab_flags_t flags;
>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index c92703d367d7..526296778247 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -3089,12 +3089,26 @@ static inline void note_cmpxchg_failure(const char *n,
>> >
>> >  static void init_kmem_cache_cpus(struct kmem_cache *s)
>> >  {
>> > +#ifdef CONFIG_PREEMPT_RT
>> > +     /* Register lockdep key for non-boot kmem caches */
>> > +     bool finegrain_lockdep = !init_section_contains(s, 1);
>>
>> I guess it's to avoid the "if (WARN_ON_ONCE(static_obj(key)))"
>> if it means the two bootstrap caches get a different class just by being
>> static, then I guess it works.
> 
> Yes. Not pretty. The alternative is to pass a flag
> through a bunch of functions all the way from kmem_cache_init.
> Another alternative is to
> bool finegrain_lockdep = s != boot_kmem_cache_node && s != boot_kmem_cache.
> Both are imo uglier.
> init_section_contains() is more precise and matches static_obj().

We also have the slab_state variable to handle bootstrap, but it would need
to gain a new state I think. Currently it's updated to "slab_state =
PARTIAL;" right between creating kmem_cache_node and kmem_cache, and the
next update "slab_state = UP;" is only after all the kmalloc caches. IIUC
we'd need to add a new one right after creating "kmem_cache".

I guess up to you, the init_section_contains() seems working too.

> Checking for SLAB_NO_OBJ_EXT isn't an option. Since it's conditional.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-17  9:18       ` Vlastimil Babka
@ 2025-07-17 16:23         ` Alexei Starovoitov
  0 siblings, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-17 16:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Thu, Jul 17, 2025 at 2:18 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> >
> > When "bool allow_spin" was there in Sebastian's version it definitely
> > looked cleaner as a proper function,
> > but now, if (!IS_ENABLED(CONFIG_PREEMPT_RT)) can be
> > #ifdef CONFIG_PREEMPT_RT
> > and the comment will look normal (without ugly backslashes)
> > So yeah. I'll convert it to macro.
>
> To clarify, the ideal I think would be e.g.
>
> #if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
>
> local_lock_irqsave();
>
> #else
>
> lockdep_assert(local_trylock_irqsave());
>
> #endif
>
> This should mean that without lockdep we just trust the code to be correct
> (kmalloc_nolock() using local_lock_held() properly before calling here, and
> kmalloc() callers not being in an unsupported context, as before this
> series) with no checking on both RT and !RT.
>
> With lockdep, on RT lockdep does its checking in local_lock_irqsave()
> normally. On !RT we need to use trylock to avoid false positives in nmi, but
> lockdep_assert() will catch a bug still in case of a true positive.
>
> At least I hope I got it right...

Yes. Exactly what I had in mind.

> > My preference is to add a comment saying that only objects
> > allocated by kmalloc_nolock() should be freed by kfree_nolock().
>
> We could go with that until someone has a case for changing this, and then
> handle kmemleak and kfence with defer_free()...

Good. Will go with warning/comment for now.

At least from bpf infra pov the context where free-ing is done
is better controlled than allocation.
Free is often in call_rcu() or call_rcu_tasks_trace() callback.
Whereas the context of kmalloc_nolock() is impossible to predict
when it comes to tracing bpf progs.
So the issues from kmalloc_nolock() -> kfree() transition are more
likely to occur, but bpf progs won't be using these primitives
directly, of course. The existing layers of protection
like bpf_obj_new()/bpf_obj_drop() will continue to be the interface.

Fun fact... initially I was planning to implement kmalloc_nolock()
only :) since bpf use case of freeing is after rcu and rcu_tasks_trace GP,
but once I realized that slab is already recursive and kmalloc_nolock()
has to be able to free deep inside, I figured I had to implement
kfree_nolock() in the same patch.
I'm talking about
alloc_slab_obj_exts() -> vec = kmalloc_nolock() -> kfree_nolock(vec) path.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-17  2:50     ` Alexei Starovoitov
  2025-07-17  9:18       ` Vlastimil Babka
@ 2025-07-18  0:09       ` Alexei Starovoitov
  1 sibling, 0 replies; 18+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  0:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Wed, Jul 16, 2025 at 7:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> > > +#ifndef CONFIG_SLUB_TINY
> > > +     if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> > > +#endif
> > > +             ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
> >
> > Nit: use IS_DEFINED(CONFIG_SLUB_TINY) to make this look better?
>
> ok.

Will take it back. That doesn't work since s->cpu_slab doesn't exist
with SLUB_TINY.
So I kept the ifdef.
Addressed the rest of comments and will send v4 soon.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-16  2:29 ` [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  2025-07-16 10:58   ` Vlastimil Babka
@ 2025-08-25  4:45   ` Harry Yoo
  2025-08-27  2:31     ` Alexei Starovoitov
  1 sibling, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2025-08-25  4:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii,
	memxor, akpm, peterz, rostedt, hannes

On Tue, Jul 15, 2025 at 07:29:49PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> kmalloc_nolock() relies on ability of local_lock to detect the situation
> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by the current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and in_nmi() or in PREEMPT_RT.
> 
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> always defers to irq_work.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/kasan.h |  13 +-
>  include/linux/slab.h  |   4 +
>  mm/Kconfig            |   1 +
>  mm/kasan/common.c     |   5 +-
>  mm/slab.h             |   6 +
>  mm/slab_common.c      |   3 +
>  mm/slub.c             | 454 +++++++++++++++++++++++++++++++++++++-----
>  7 files changed, 434 insertions(+), 52 deletions(-)

> +static void defer_free(struct kmem_cache *s, void *head)
> +{
> +	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> +
> +	if (llist_add(head + s->offset, &df->objects))
> +		irq_work_queue(&df->work);
> +}
> +
> +static void defer_deactivate_slab(struct slab *slab)
> +{
> +	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> +
> +	if (llist_add(&slab->llnode, &df->slabs))
> +		irq_work_queue(&df->work);
> +}
> +
> +void defer_free_barrier(void)
> +{
> +	int cpu;
> +
> +	for_each_possible_cpu(cpu)
> +		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
> +}

I think it should also initiate deferred frees, if kfree_nolock() freed
the last object in some CPUs?

> +
>  #ifndef CONFIG_SLUB_TINY
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-08-25  4:45   ` Harry Yoo
@ 2025-08-27  2:31     ` Alexei Starovoitov
  2025-08-27  5:13       ` Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2025-08-27  2:31 UTC (permalink / raw)
  To: Harry Yoo
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Sun, Aug 24, 2025 at 9:46 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Jul 15, 2025 at 07:29:49PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > kmalloc_nolock() relies on ability of local_lock to detect the situation
> > when it's locked.
> > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> > In that case retry the operation in a different kmalloc bucket.
> > The second attempt will likely succeed, since this cpu locked
> > different kmem_cache_cpu.
> >
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > a different bucket that is most likely is not locked by the current
> > task. Though it may be locked by a different task it's safe to
> > rt_spin_lock() on it.
> >
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
> >
> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > and in_nmi() or in PREEMPT_RT.
> >
> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> > always defers to irq_work.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/kasan.h |  13 +-
> >  include/linux/slab.h  |   4 +
> >  mm/Kconfig            |   1 +
> >  mm/kasan/common.c     |   5 +-
> >  mm/slab.h             |   6 +
> >  mm/slab_common.c      |   3 +
> >  mm/slub.c             | 454 +++++++++++++++++++++++++++++++++++++-----
> >  7 files changed, 434 insertions(+), 52 deletions(-)
>
> > +static void defer_free(struct kmem_cache *s, void *head)
> > +{
> > +     struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> > +
> > +     if (llist_add(head + s->offset, &df->objects))
> > +             irq_work_queue(&df->work);
> > +}
> > +
> > +static void defer_deactivate_slab(struct slab *slab)
> > +{
> > +     struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> > +
> > +     if (llist_add(&slab->llnode, &df->slabs))
> > +             irq_work_queue(&df->work);
> > +}
> > +
> > +void defer_free_barrier(void)
> > +{
> > +     int cpu;
> > +
> > +     for_each_possible_cpu(cpu)
> > +             irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
> > +}
>
> I think it should also initiate deferred frees, if kfree_nolock() freed
> the last object in some CPUs?

I don't understand the question. "the last object in some CPU" ?
Are you asking about the need of defer_free_barrier() ?

PS
I just got back from 2+ week PTO. Going through backlog.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-08-27  2:31     ` Alexei Starovoitov
@ 2025-08-27  5:13       ` Harry Yoo
  0 siblings, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2025-08-27  5:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Tue, Aug 26, 2025 at 07:31:34PM -0700, Alexei Starovoitov wrote:
> On Sun, Aug 24, 2025 at 9:46 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Tue, Jul 15, 2025 at 07:29:49PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > kmalloc_nolock() relies on ability of local_lock to detect the situation
> > > when it's locked.
> > > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> > > In that case retry the operation in a different kmalloc bucket.
> > > The second attempt will likely succeed, since this cpu locked
> > > different kmem_cache_cpu.
> > >
> > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > > a different bucket that is most likely is not locked by the current
> > > task. Though it may be locked by a different task it's safe to
> > > rt_spin_lock() on it.
> > >
> > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > > immediately if called from hard irq or NMI in PREEMPT_RT.
> > >
> > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > > and in_nmi() or in PREEMPT_RT.
> > >
> > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > > spin_trylock_irqsave(&n->list_lock) to allocate while kfree_nolock()
> > > always defers to irq_work.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  include/linux/kasan.h |  13 +-
> > >  include/linux/slab.h  |   4 +
> > >  mm/Kconfig            |   1 +
> > >  mm/kasan/common.c     |   5 +-
> > >  mm/slab.h             |   6 +
> > >  mm/slab_common.c      |   3 +
> > >  mm/slub.c             | 454 +++++++++++++++++++++++++++++++++++++-----
> > >  7 files changed, 434 insertions(+), 52 deletions(-)
> >
> > > +static void defer_free(struct kmem_cache *s, void *head)
> > > +{
> > > +     struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> > > +
> > > +     if (llist_add(head + s->offset, &df->objects))
> > > +             irq_work_queue(&df->work);
> > > +}
> > > +
> > > +static void defer_deactivate_slab(struct slab *slab)
> > > +{
> > > +     struct defer_free *df = this_cpu_ptr(&defer_free_objects);
> > > +
> > > +     if (llist_add(&slab->llnode, &df->slabs))
> > > +             irq_work_queue(&df->work);
> > > +}
> > > +
> > > +void defer_free_barrier(void)
> > > +{
> > > +     int cpu;
> > > +
> > > +     for_each_possible_cpu(cpu)
> > > +             irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
> > > +}
> >
> > I think it should also initiate deferred frees, if kfree_nolock() freed
> > the last object in some CPUs?
> 
> I don't understand the question. "the last object in some CPU" ?
> Are you asking about the need of defer_free_barrier() ?

My bad. It slipped my mind.

I thought objects freed via kfree_nolock() are not freed before
a following kfree(), but since we've switched to IRQ work,
that's not the case anymore.

> PS
> I just got back from 2+ week PTO. Going through backlog.

Hope you enjoyed your PTO!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-08-27  5:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16  2:29 [PATCH v3 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-07-16  2:29 ` [PATCH v3 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-07-16  2:29 ` [PATCH v3 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-07-16  2:29 ` [PATCH v3 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-07-16  2:29 ` [PATCH v3 4/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-07-16  2:29 ` [PATCH v3 5/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-07-16 10:58   ` Vlastimil Babka
2025-07-17  2:50     ` Alexei Starovoitov
2025-07-17  9:18       ` Vlastimil Babka
2025-07-17 16:23         ` Alexei Starovoitov
2025-07-18  0:09       ` Alexei Starovoitov
2025-08-25  4:45   ` Harry Yoo
2025-08-27  2:31     ` Alexei Starovoitov
2025-08-27  5:13       ` Harry Yoo
2025-07-16  2:29 ` [PATCH v3 6/6] slab: Make slub local_trylock_t more precise for LOCKDEP Alexei Starovoitov
2025-07-16 13:35   ` Vlastimil Babka
2025-07-17  3:32     ` Alexei Starovoitov
2025-07-17  9:29       ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).