[PATCH v4 0/6] slab: Re-entrant kmalloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock()
@ 2025-07-18  2:16 Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  2:16 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

v3->v4:
- Converted local_lock_cpu_slab() to macro
- Reordered patches 5 and 6
- Emphasized that kfree_nolock() shouldn't be used on kmalloc()-ed objects
- Addressed other comments and improved commit logs
- Fixed build issues reported by bots

v3:
https://lore.kernel.org/bpf/20250716022950.69330-1-alexei.starovoitov@gmail.com/

v2->v3:
- Adopted Sebastian's local_lock_cpu_slab(), but dropped gfpflags
  to avoid extra branch for performance reasons,
  and added local_unlock_cpu_slab() for symmetry.
- Dropped local_lock_lockdep_start/end() pair and switched to
  per kmem_cache lockdep class on PREEMPT_RT to silence false positive
  when the same cpu/task acquires two local_lock-s.
- Refactorred defer_free per Sebastian's suggestion
- Fixed slab leak when it needs to be deactivated via irq_work and llist
  as Vlastimil proposed. Including defer_free_barrier().
- Use kmem_cache->offset for llist_node pointer when linking objects
  instead of zero offset, since whole object could be used for slabs
  with ctors and other cases.
- Fixed "cnt = 1; goto redo;" issue.
- Fixed slab leak in alloc_single_from_new_slab().
- Retested with slab_debug, RT, !RT, lockdep, kasan, slab_tiny
- Added acks to patches 1-4 that should be good to go.

v2:
https://lore.kernel.org/bpf/20250709015303.8107-1-alexei.starovoitov@gmail.com/

v1->v2:
Added more comments for this non-trivial logic and addressed earlier comments.
In particular:
- Introduce alloc_frozen_pages_nolock() to avoid refcnt race
- alloc_pages_nolock() defaults to GFP_COMP
- Support SLUB_TINY
- Added more variants to stress tester to discover that kfree_nolock() can
  OOM, because deferred per-slab llist won't be serviced if kfree_nolock()
  gets unlucky long enough. Scraped previous approach and switched to
  global per-cpu llist with immediate irq_work_queue() to process all
  object sizes.
- Reentrant kmalloc cannot deactivate_slab(). In v1 the node hint was
  downgraded to NUMA_NO_NODE before calling slab_alloc(). Realized it's not
  good enough. There are odd cases that can trigger deactivate. Rewrote
  this part.
- Struggled with SLAB_NO_CMPXCHG. Thankfully Harry had a great suggestion:
  https://lore.kernel.org/bpf/aFvfr1KiNrLofavW@hyeyoo/
  which was adopted. So slab_debug works now.
- In v1 I had to s/local_lock_irqsave/local_lock_irqsave_check/ in a bunch
  of places in mm/slub.c to avoid lockdep false positives.
  Came up with much cleaner approach to silence invalid lockdep reports
  without sacrificing lockdep coverage. See local_lock_lockdep_start/end().

v1:
https://lore.kernel.org/bpf/20250501032718.65476-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (6):
  locking/local_lock: Expose dep_map in local_trylock_t.
  locking/local_lock: Introduce local_lock_is_locked().
  mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
  mm: Introduce alloc_frozen_pages_nolock()
  slab: Make slub local_(try)lock more precise for LOCKDEP
  slab: Introduce kmalloc_nolock() and kfree_nolock().

 include/linux/gfp.h                 |   2 +-
 include/linux/kasan.h               |  13 +-
 include/linux/local_lock.h          |   2 +
 include/linux/local_lock_internal.h |  16 +-
 include/linux/rtmutex.h             |  10 +
 include/linux/slab.h                |   4 +
 kernel/bpf/syscall.c                |   2 +-
 kernel/locking/rtmutex_common.h     |   9 -
 mm/Kconfig                          |   1 +
 mm/internal.h                       |   4 +
 mm/kasan/common.c                   |   5 +-
 mm/page_alloc.c                     |  54 ++--
 mm/slab.h                           |   7 +
 mm/slab_common.c                    |   3 +
 mm/slub.c                           | 486 +++++++++++++++++++++++++---
 15 files changed, 528 insertions(+), 90 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-07-18  2:16 [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
@ 2025-07-18  2:16 ` Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  2:16 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
is a top level field of any lock that participates in LOCKDEP.
Make it so for local_trylock_t.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock_internal.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8d5ac16a9b17..85c2e1b1af6b 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -17,7 +17,10 @@ typedef struct {
 
 /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
 typedef struct {
-	local_lock_t	llock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+	struct task_struct	*owner;
+#endif
 	u8		acquired;
 } local_trylock_t;
 
@@ -31,7 +34,7 @@ typedef struct {
 	.owner = NULL,
 
 # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
-	.llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
+	LOCAL_LOCK_DEBUG_INIT(lockname)
 
 static inline void local_lock_acquire(local_lock_t *l)
 {
@@ -81,7 +84,7 @@ do {								\
 	local_lock_debug_init(lock);				\
 } while (0)
 
-#define __local_trylock_init(lock) __local_lock_init(lock.llock)
+#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
 
 #define __spinlock_nested_bh_init(lock)				\
 do {								\
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 2/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-07-18  2:16 [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
@ 2025-07-18  2:16 ` Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  2:16 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Introduce local_lock_is_locked() that returns true when
given local_lock is locked by current cpu (in !PREEMPT_RT) or
by current task (in PREEMPT_RT).
The goal is to detect a deadlock by the caller.

Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          |  2 ++
 include/linux/local_lock_internal.h |  7 +++++++
 include/linux/rtmutex.h             | 10 ++++++++++
 kernel/locking/rtmutex_common.h     |  9 ---------
 4 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 16a2ee4f8310..092ce89b162a 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -66,6 +66,8 @@
  */
 #define local_trylock(lock)		__local_trylock(lock)
 
+#define local_lock_is_locked(lock)	__local_lock_is_locked(lock)
+
 /**
  * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
  *			   interrupts if acquired
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 85c2e1b1af6b..db61409f040c 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -165,6 +165,9 @@ do {								\
 		!!tl;						\
 	})
 
+/* preemption or migration must be disabled before calling __local_lock_is_locked */
+#define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
+
 #define __local_lock_release(lock)					\
 	do {								\
 		local_trylock_t *tl;					\
@@ -285,4 +288,8 @@ do {								\
 		__local_trylock(lock);				\
 	})
 
+/* migration must be disabled before calling __local_lock_is_locked */
+#define __local_lock_is_locked(__lock)					\
+	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
+
 #endif /* CONFIG_PREEMPT_RT */
diff --git a/include/linux/rtmutex.h b/include/linux/rtmutex.h
index 7d049883a08a..98391601fe94 100644
--- a/include/linux/rtmutex.h
+++ b/include/linux/rtmutex.h
@@ -44,6 +44,16 @@ static inline bool rt_mutex_base_is_locked(struct rt_mutex_base *lock)
 	return READ_ONCE(lock->owner) != NULL;
 }
 
+#ifdef CONFIG_RT_MUTEXES
+#define RT_MUTEX_HAS_WAITERS	1UL
+
+static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock)
+{
+	unsigned long owner = (unsigned long) READ_ONCE(lock->owner);
+
+	return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS);
+}
+#endif
 extern void rt_mutex_base_init(struct rt_mutex_base *rtb);
 
 /**
diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h
index 78dd3d8c6554..cf6ddd1b23a2 100644
--- a/kernel/locking/rtmutex_common.h
+++ b/kernel/locking/rtmutex_common.h
@@ -153,15 +153,6 @@ static inline struct rt_mutex_waiter *task_top_pi_waiter(struct task_struct *p)
 			pi_tree.entry);
 }
 
-#define RT_MUTEX_HAS_WAITERS	1UL
-
-static inline struct task_struct *rt_mutex_owner(struct rt_mutex_base *lock)
-{
-	unsigned long owner = (unsigned long) READ_ONCE(lock->owner);
-
-	return (struct task_struct *) (owner & ~RT_MUTEX_HAS_WAITERS);
-}
-
 /*
  * Constants for rt mutex functions which have a selectable deadlock
  * detection.
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock().
  2025-07-18  2:16 [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
@ 2025-07-18  2:16 ` Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 4/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  2:16 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Change alloc_pages_nolock() to default to __GFP_COMP when allocating
pages, since upcoming reentrant alloc_slab_page() needs __GFP_COMP.
Also allow __GFP_ACCOUNT flag to be specified,
since BPF infra needs __GFP_ACCOUNT.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h  |  2 +-
 kernel/bpf/syscall.c |  2 +-
 mm/page_alloc.c      | 10 ++++++----
 3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5ebf26fcdcfa..0ceb4e09306c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,7 +354,7 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order);
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
 #define alloc_pages_nolock(...)			alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__))
 
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dd5304c6ac3c..eb9b6c4c10e9 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -579,7 +579,7 @@ static bool can_alloc_pages(void)
 static struct page *__bpf_alloc_page(int nid)
 {
 	if (!can_alloc_pages())
-		return alloc_pages_nolock(nid, 0);
+		return alloc_pages_nolock(__GFP_ACCOUNT, nid, 0);
 
 	return alloc_pages_node(nid,
 				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78ddf1d43c6c..148945f0b667 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7506,6 +7506,7 @@ static bool __free_unaccepted(struct page *page)
 
 /**
  * alloc_pages_nolock - opportunistic reentrant allocation from any context
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
  * @nid: node to allocate from
  * @order: allocation order size
  *
@@ -7519,7 +7520,7 @@ static bool __free_unaccepted(struct page *page)
  * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
  * It means ENOMEM. There is no reason to call it again and expect !NULL.
  */
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7541,12 +7542,13 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	 * specify it here to highlight that alloc_pages_nolock()
 	 * doesn't want to deplete reserves.
 	 */
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
-			| __GFP_ACCOUNT;
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC | __GFP_COMP
+			| gfp_flags;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
 
+	VM_WARN_ON_ONCE(gfp_flags & ~__GFP_ACCOUNT);
 	/*
 	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
 	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
@@ -7584,7 +7586,7 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	if (page)
 		set_page_refcounted(page);
 
-	if (memcg_kmem_online() && page &&
+	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
 	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
 		free_pages_nolock(page, order);
 		page = NULL;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 4/6] mm: Introduce alloc_frozen_pages_nolock()
  2025-07-18  2:16 [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2025-07-18  2:16 ` [PATCH v4 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
@ 2025-07-18  2:16 ` Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 5/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  5 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  2:16 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

Split alloc_pages_nolock() and introduce alloc_frozen_pages_nolock()
to be used by alloc_slab_page().

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/internal.h   |  4 ++++
 mm/page_alloc.c | 48 +++++++++++++++++++++++++++---------------------
 2 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 5b0f71e5434b..ea85cf703331 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -837,6 +837,10 @@ static inline struct page *alloc_frozen_pages_noprof(gfp_t gfp, unsigned int ord
 #define alloc_frozen_pages(...) \
 	alloc_hooks(alloc_frozen_pages_noprof(__VA_ARGS__))
 
+struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
+#define alloc_frozen_pages_nolock(...) \
+	alloc_hooks(alloc_frozen_pages_nolock_noprof(__VA_ARGS__))
+
 extern void zone_pcp_reset(struct zone *zone);
 extern void zone_pcp_disable(struct zone *zone);
 extern void zone_pcp_enable(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 148945f0b667..11a184bab03c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7504,23 +7504,7 @@ static bool __free_unaccepted(struct page *page)
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
-/**
- * alloc_pages_nolock - opportunistic reentrant allocation from any context
- * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
- * @nid: node to allocate from
- * @order: allocation order size
- *
- * Allocates pages of a given order from the given node. This is safe to
- * call from any context (from atomic, NMI, and also reentrant
- * allocator -> tracepoint -> alloc_pages_nolock_noprof).
- * Allocation is best effort and to be expected to fail easily so nobody should
- * rely on the success. Failures are not reported via warn_alloc().
- * See always fail conditions below.
- *
- * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
- * It means ENOMEM. There is no reason to call it again and expect !NULL.
- */
-struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
+struct page *alloc_frozen_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7583,16 +7567,38 @@ struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int or
 
 	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
 
-	if (page)
-		set_page_refcounted(page);
-
 	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
 	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
-		free_pages_nolock(page, order);
+		__free_frozen_pages(page, order, FPI_TRYLOCK);
 		page = NULL;
 	}
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
 }
+/**
+ * alloc_pages_nolock - opportunistic reentrant allocation from any context
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT allowed.
+ * @nid: node to allocate from
+ * @order: allocation order size
+ *
+ * Allocates pages of a given order from the given node. This is safe to
+ * call from any context (from atomic, NMI, and also reentrant
+ * allocator -> tracepoint -> alloc_pages_nolock_noprof).
+ * Allocation is best effort and to be expected to fail easily so nobody should
+ * rely on the success. Failures are not reported via warn_alloc().
+ * See always fail conditions below.
+ *
+ * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
+ * It means ENOMEM. There is no reason to call it again and expect !NULL.
+ */
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
+{
+	struct page *page;
+
+	page = alloc_frozen_pages_nolock_noprof(gfp_flags, nid, order);
+	if (page)
+		set_page_refcounted(page);
+	return page;
+}
 EXPORT_SYMBOL_GPL(alloc_pages_nolock_noprof);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 5/6] slab: Make slub local_(try)lock more precise for LOCKDEP
  2025-07-18  2:16 [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2025-07-18  2:16 ` [PATCH v4 4/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
@ 2025-07-18  2:16 ` Alexei Starovoitov
  2025-07-18  2:16 ` [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  5 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  2:16 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

kmalloc_nolock() can be called from any context
the ___slab_alloc() can acquire local_trylock_t (which is rt_spin_lock
in PREEMPT_RT) and attempt to acquire a different local_trylock_t
while in the same task context.

The calling sequence might look like:
kmalloc() -> tracepoint -> bpf -> kmalloc_nolock()

or more precisely:
__lock_acquire+0x12ad/0x2590
lock_acquire+0x133/0x2d0
rt_spin_lock+0x6f/0x250
___slab_alloc+0xb7/0xec0
kmalloc_nolock_noprof+0x15a/0x430
my_debug_callback+0x20e/0x390 [testmod]
___slab_alloc+0x256/0xec0
__kmalloc_cache_noprof+0xd6/0x3b0

Make LOCKDEP understand that local_trylock_t-s protect
different kmem_caches. In order to do that add lock_class_key
for each kmem_cache and use that key in local_trylock_t.

This stack trace is possible on both PREEMPT_RT and !PREEMPT_RT,
but teach lockdep about it only for PREEMPT_RT, since
in !PREEMPT_RT the ___slab_alloc() code is using
local_trylock_irqsave() when lockdep is on.

Note, this patch applies this logic to local_lock_t
while the next one converts it to local_trylock_t.
Both are mapped to rt_spin_lock in PREEMPT_RT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/slab.h |  1 +
 mm/slub.c | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/mm/slab.h b/mm/slab.h
index 05a21dc796e0..4f4dfc3d239c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -258,6 +258,7 @@ struct kmem_cache_order_objects {
 struct kmem_cache {
 #ifndef CONFIG_SLUB_TINY
 	struct kmem_cache_cpu __percpu *cpu_slab;
+	struct lock_class_key lock_key;
 #endif
 	/* Used for retrieving partial slabs, etc. */
 	slab_flags_t flags;
diff --git a/mm/slub.c b/mm/slub.c
index c4b64821e680..54444bce218e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3051,12 +3051,29 @@ static inline void note_cmpxchg_failure(const char *n,
 
 static void init_kmem_cache_cpus(struct kmem_cache *s)
 {
+#ifdef CONFIG_PREEMPT_RT
+	/*
+	 * Register lockdep key for non-boot kmem caches to avoid
+	 * WARN_ON_ONCE(static_obj(key))) in lockdep_register_key()
+	 */
+	bool finegrain_lockdep = !init_section_contains(s, 1);
+#else
+	/*
+	 * Don't bother with different lockdep classes for each
+	 * kmem_cache, since we only use local_trylock_irqsave().
+	 */
+	bool finegrain_lockdep = false;
+#endif
 	int cpu;
 	struct kmem_cache_cpu *c;
 
+	if (finegrain_lockdep)
+		lockdep_register_key(&s->lock_key);
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
 		local_lock_init(&c->lock);
+		if (finegrain_lockdep)
+			lockdep_set_class(&c->lock, &s->lock_key);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -5614,6 +5631,9 @@ void __kmem_cache_release(struct kmem_cache *s)
 {
 	cache_random_seq_destroy(s);
 #ifndef CONFIG_SLUB_TINY
+#ifdef CONFIG_PREEMPT_RT
+	lockdep_unregister_key(&s->lock_key);
+#endif
 	free_percpu(s->cpu_slab);
 #endif
 	free_kmem_cache_nodes(s);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-18  2:16 [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2025-07-18  2:16 ` [PATCH v4 5/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
@ 2025-07-18  2:16 ` Alexei Starovoitov
  2025-07-22 15:52   ` Harry Yoo
  5 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2025-07-18  2:16 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes

From: Alexei Starovoitov <ast@kernel.org>

kmalloc_nolock() relies on ability of local_trylock_t to detect
the situation when per-cpu kmem_cache is locked.

In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
disables IRQs and marks s->cpu_slab->lock as acquired.
local_lock_is_locked(&s->cpu_slab->lock) returns true when
slab is in the middle of manipulating per-cpu cache
of that specific kmem_cache.

kmalloc_nolock() can be called from any context and can re-enter
into ___slab_alloc():
  kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
    kmalloc_nolock() -> ___slab_alloc(cache_B)
or
  kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
    kmalloc_nolock() -> ___slab_alloc(cache_B)

Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
can be acquired without a deadlock before invoking the function.
If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
retries in a different kmalloc bucket. The second attempt will
likely succeed, since this cpu locked different kmem_cache.

Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current _task_. In this case
re-entrance into the same kmalloc bucket is unsafe, and
kmalloc_nolock() tries a different bucket that is most likely is
not locked by the current task. Though it may be locked by a
different task it's safe to rt_spin_lock() and sleep on it.

Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.

kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
and (in_nmi() or in PREEMPT_RT).

SLUB_TINY config doesn't use local_lock_is_locked() and relies on
spin_trylock_irqsave(&n->list_lock) to allocate,
while kfree_nolock() always defers to irq_work.

Note, kfree_nolock() must be called _only_ for objects allocated
with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
will miss kmemleak/kfence book keeping and will cause false positives.
large_kmalloc is not supported by either kmalloc_nolock()
or kfree_nolock().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/kasan.h |  13 +-
 include/linux/slab.h  |   4 +
 mm/Kconfig            |   1 +
 mm/kasan/common.c     |   5 +-
 mm/slab.h             |   6 +
 mm/slab_common.c      |   3 +
 mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
 7 files changed, 445 insertions(+), 53 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 890011071f2b..acdc8cb0152e 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s,
 }
 
 bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
-		       bool still_accessible);
+		       bool still_accessible, bool no_quarantine);
 /**
  * kasan_slab_free - Poison, initialize, and quarantine a slab object.
  * @object: Object to be freed.
@@ -226,11 +226,13 @@ bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
  * @Return true if KASAN took ownership of the object; false otherwise.
  */
 static __always_inline bool kasan_slab_free(struct kmem_cache *s,
-						void *object, bool init,
-						bool still_accessible)
+					    void *object, bool init,
+					    bool still_accessible,
+					    bool no_quarantine)
 {
 	if (kasan_enabled())
-		return __kasan_slab_free(s, object, init, still_accessible);
+		return __kasan_slab_free(s, object, init, still_accessible,
+					 no_quarantine);
 	return false;
 }
 
@@ -427,7 +429,8 @@ static inline bool kasan_slab_pre_free(struct kmem_cache *s, void *object)
 }
 
 static inline bool kasan_slab_free(struct kmem_cache *s, void *object,
-				   bool init, bool still_accessible)
+				   bool init, bool still_accessible,
+				   bool no_quarantine)
 {
 	return false;
 }
diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035c..743f6d196d57 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -470,6 +470,7 @@ void * __must_check krealloc_noprof(const void *objp, size_t new_size,
 #define krealloc(...)				alloc_hooks(krealloc_noprof(__VA_ARGS__))
 
 void kfree(const void *objp);
+void kfree_nolock(const void *objp);
 void kfree_sensitive(const void *objp);
 size_t __ksize(const void *objp);
 
@@ -910,6 +911,9 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
 }
 #define kmalloc(...)				alloc_hooks(kmalloc_noprof(__VA_ARGS__))
 
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node);
+#define kmalloc_nolock(...)			alloc_hooks(kmalloc_nolock_noprof(__VA_ARGS__))
+
 #define kmem_buckets_alloc(_b, _size, _flags)	\
 	alloc_hooks(__kmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE))
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 0287e8d94aea..331a14d678b3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -206,6 +206,7 @@ menu "Slab allocator options"
 
 config SLUB
 	def_bool y
+	select IRQ_WORK
 
 config KVFREE_RCU_BATCHED
 	def_bool y
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index ed4873e18c75..67042e07baee 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -256,13 +256,16 @@ bool __kasan_slab_pre_free(struct kmem_cache *cache, void *object,
 }
 
 bool __kasan_slab_free(struct kmem_cache *cache, void *object, bool init,
-		       bool still_accessible)
+		       bool still_accessible, bool no_quarantine)
 {
 	if (!kasan_arch_is_ready() || is_kfence_address(object))
 		return false;
 
 	poison_slab_object(cache, object, init, still_accessible);
 
+	if (no_quarantine)
+		return false;
+
 	/*
 	 * If the object is put into quarantine, do not let slab put the object
 	 * onto the freelist for now. The object's metadata is kept until the
diff --git a/mm/slab.h b/mm/slab.h
index 4f4dfc3d239c..165737accb20 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -57,6 +57,10 @@ struct slab {
 		struct {
 			union {
 				struct list_head slab_list;
+				struct { /* For deferred deactivate_slab() */
+					struct llist_node llnode;
+					void *flush_freelist;
+				};
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 				struct {
 					struct slab *next;
@@ -681,6 +685,8 @@ void __kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct slab *slab)
 void __check_heap_object(const void *ptr, unsigned long n,
 			 const struct slab *slab, bool to_user);
 
+void defer_free_barrier(void);
+
 static inline bool slub_debug_orig_size(struct kmem_cache *s)
 {
 	return (kmem_cache_debug_flags(s, SLAB_STORE_USER) &&
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bfe7c40eeee1..937af8ab2501 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -507,6 +507,9 @@ void kmem_cache_destroy(struct kmem_cache *s)
 		rcu_barrier();
 	}
 
+	/* Wait for deferred work from kmalloc/kfree_nolock() */
+	defer_free_barrier();
+
 	cpus_read_lock();
 	mutex_lock(&slab_mutex);
 
diff --git a/mm/slub.c b/mm/slub.c
index 54444bce218e..7de6da4ee46d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -44,6 +44,7 @@
 #include <kunit/test.h>
 #include <kunit/test-bug.h>
 #include <linux/sort.h>
+#include <linux/irq_work.h>
 
 #include <linux/debugfs.h>
 #include <trace/events/kmem.h>
@@ -393,7 +394,7 @@ struct kmem_cache_cpu {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct slab *partial;	/* Partially allocated slabs */
 #endif
-	local_lock_t lock;	/* Protects the fields above */
+	local_trylock_t lock;	/* Protects the fields above */
 #ifdef CONFIG_SLUB_STATS
 	unsigned int stat[NR_SLUB_STAT_ITEMS];
 #endif
@@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		        gfp_t gfp, bool new_slab)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfp);
 	unsigned int objects = objs_per_slab(s, slab);
 	unsigned long new_exts;
 	unsigned long old_exts;
@@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
 	gfp |= __GFP_NO_OBJ_EXT;
-	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
-			   slab_nid(slab));
+	if (unlikely(!allow_spin)) {
+		size_t sz = objects * sizeof(struct slabobj_ext);
+
+		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
+	} else {
+		vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+				   slab_nid(slab));
+	}
 	if (!vec) {
 		/* Mark vectors which failed to allocate */
 		if (new_slab)
@@ -2021,7 +2029,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		 * objcg vector should be reused.
 		 */
 		mark_objexts_empty(vec);
-		kfree(vec);
+		if (unlikely(!allow_spin))
+			kfree_nolock(vec);
+		else
+			kfree(vec);
 		return 0;
 	}
 
@@ -2379,7 +2390,7 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
 
 	}
 	/* KASAN might put x into memory quarantine, delaying its reuse. */
-	return !kasan_slab_free(s, x, init, still_accessible);
+	return !kasan_slab_free(s, x, init, still_accessible, false);
 }
 
 static __fastpath_inline
@@ -2442,13 +2453,17 @@ static void *setup_object(struct kmem_cache *s, void *object)
  * Slab allocation and freeing
  */
 static inline struct slab *alloc_slab_page(gfp_t flags, int node,
-		struct kmem_cache_order_objects oo)
+					   struct kmem_cache_order_objects oo,
+					   bool allow_spin)
 {
 	struct folio *folio;
 	struct slab *slab;
 	unsigned int order = oo_order(oo);
 
-	if (node == NUMA_NO_NODE)
+	if (unlikely(!allow_spin))
+		folio = (struct folio *)alloc_frozen_pages_nolock(0/* __GFP_COMP is implied */,
+								  node, order);
+	else if (node == NUMA_NO_NODE)
 		folio = (struct folio *)alloc_frozen_pages(flags, order);
 	else
 		folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
@@ -2598,6 +2613,7 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 
 static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
+	bool allow_spin = gfpflags_allow_spinning(flags);
 	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
@@ -2617,7 +2633,11 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
 		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_RECLAIM;
 
-	slab = alloc_slab_page(alloc_gfp, node, oo);
+	/*
+	 * __GFP_RECLAIM could be cleared on the first allocation attempt,
+	 * so pass allow_spin flag directly.
+	 */
+	slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 	if (unlikely(!slab)) {
 		oo = s->min;
 		alloc_gfp = flags;
@@ -2625,7 +2645,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
-		slab = alloc_slab_page(alloc_gfp, node, oo);
+		slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 		if (unlikely(!slab))
 			return NULL;
 		stat(s, ORDER_FALLBACK);
@@ -2798,33 +2818,47 @@ static void *alloc_single_from_partial(struct kmem_cache *s,
 	return object;
 }
 
+static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
+
 /*
  * Called only for kmem_cache_debug() caches to allocate from a freshly
  * allocated slab. Allocate a single object instead of whole freelist
  * and put the slab to the partial (or full) list.
  */
-static void *alloc_single_from_new_slab(struct kmem_cache *s,
-					struct slab *slab, int orig_size)
+static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
+					int orig_size, gfp_t gfpflags)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfpflags);
 	int nid = slab_nid(slab);
 	struct kmem_cache_node *n = get_node(s, nid);
 	unsigned long flags;
 	void *object;
 
+	if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
+		/* Unlucky, discard newly allocated slab */
+		slab->frozen = 1;
+		defer_deactivate_slab(slab, NULL);
+		return NULL;
+	}
 
 	object = slab->freelist;
 	slab->freelist = get_freepointer(s, object);
 	slab->inuse = 1;
 
-	if (!alloc_debug_processing(s, slab, object, orig_size))
+	if (!alloc_debug_processing(s, slab, object, orig_size)) {
 		/*
 		 * It's not really expected that this would fail on a
 		 * freshly allocated slab, but a concurrent memory
 		 * corruption in theory could cause that.
+		 * Leak memory of allocated slab.
 		 */
+		if (!allow_spin)
+			spin_unlock_irqrestore(&n->list_lock, flags);
 		return NULL;
+	}
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (allow_spin)
+		spin_lock_irqsave(&n->list_lock, flags);
 
 	if (slab->inuse == slab->objects)
 		add_full(s, n, slab);
@@ -2865,7 +2899,10 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 	if (!n || !n->nr_partial)
 		return NULL;
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (gfpflags_allow_spinning(pc->flags))
+		spin_lock_irqsave(&n->list_lock, flags);
+	else if (!spin_trylock_irqsave(&n->list_lock, flags))
+		return NULL;
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
 		if (!pfmemalloc_match(slab, pc->flags))
 			continue;
@@ -3071,7 +3108,7 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 		lockdep_register_key(&s->lock_key);
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
-		local_lock_init(&c->lock);
+		local_trylock_init(&c->lock);
 		if (finegrain_lockdep)
 			lockdep_set_class(&c->lock, &s->lock_key);
 		c->tid = init_tid(cpu);
@@ -3164,6 +3201,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
 	}
 }
 
+/*
+ * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
+ * can be acquired without a deadlock before invoking the function.
+ *
+ * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
+ * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
+ * and kmalloc() is not used in an unsupported context.
+ *
+ * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
+ * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
+ * lockdep_assert() will catch a bug in case:
+ * #1
+ * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
+ * or
+ * #2
+ * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
+ *
+ * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
+ * disabled context. The lock will always be acquired and if needed it
+ * block and sleep until the lock is available.
+ * #1 is possible in !PREEMP_RT only.
+ * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
+ * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
+ *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
+ *
+ * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
+ */
+#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
+#define local_lock_cpu_slab(s, flags)	\
+	local_lock_irqsave(&(s)->cpu_slab->lock, flags)
+#else
+#define local_lock_cpu_slab(s, flags)	\
+	lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
+#endif
+
+#define local_unlock_cpu_slab(s, flags)	\
+	local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
+
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
 {
@@ -3248,7 +3323,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 	unsigned long flags;
 	int slabs = 0;
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, flags);
 
 	oldslab = this_cpu_read(s->cpu_slab->partial);
 
@@ -3273,7 +3348,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 
 	this_cpu_write(s->cpu_slab->partial, slab);
 
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	local_unlock_cpu_slab(s, flags);
 
 	if (slab_to_put) {
 		__put_partials(s, slab_to_put);
@@ -3707,6 +3782,7 @@ static inline void *freeze_slab(struct kmem_cache *s, struct slab *slab)
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			  unsigned long addr, struct kmem_cache_cpu *c, unsigned int orig_size)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfpflags);
 	void *freelist;
 	struct slab *slab;
 	unsigned long flags;
@@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (unlikely(!node_match(slab, node))) {
 		/*
 		 * same as above but node_match() being false already
-		 * implies node != NUMA_NO_NODE
+		 * implies node != NUMA_NO_NODE.
+		 * Reentrant slub cannot take locks necessary to
+		 * deactivate_slab, hence ignore node preference.
+		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
 		 */
-		if (!node_isset(node, slab_nodes)) {
+		if (!node_isset(node, slab_nodes) ||
+		    !allow_spin) {
 			node = NUMA_NO_NODE;
 		} else {
 			stat(s, ALLOC_NODE_MISMATCH);
@@ -3747,13 +3827,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 * PFMEMALLOC but right now, we are losing the pfmemalloc
 	 * information when the page leaves the per-cpu allocator
 	 */
-	if (unlikely(!pfmemalloc_match(slab, gfpflags)))
+	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin))
 		goto deactivate_slab;
 
 	/* must check again c->slab in case we got preempted and it changed */
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, flags);
+
 	if (unlikely(slab != c->slab)) {
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, flags);
 		goto reread_slab;
 	}
 	freelist = c->freelist;
@@ -3765,7 +3846,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	if (!freelist) {
 		c->slab = NULL;
 		c->tid = next_tid(c->tid);
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, flags);
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
 	}
@@ -3784,34 +3865,34 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	VM_BUG_ON(!c->slab->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	local_unlock_cpu_slab(s, flags);
 	return freelist;
 
 deactivate_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, flags);
 	if (slab != c->slab) {
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, flags);
 		goto reread_slab;
 	}
 	freelist = c->freelist;
 	c->slab = NULL;
 	c->freelist = NULL;
 	c->tid = next_tid(c->tid);
-	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+	local_unlock_cpu_slab(s, flags);
 	deactivate_slab(s, slab, freelist);
 
 new_slab:
 
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	while (slub_percpu_partial(c)) {
-		local_lock_irqsave(&s->cpu_slab->lock, flags);
+		local_lock_cpu_slab(s, flags);
 		if (unlikely(c->slab)) {
-			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+			local_unlock_cpu_slab(s, flags);
 			goto reread_slab;
 		}
 		if (unlikely(!slub_percpu_partial(c))) {
-			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+			local_unlock_cpu_slab(s, flags);
 			/* we were preempted and partial list got empty */
 			goto new_objects;
 		}
@@ -3819,8 +3900,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		slab = slub_percpu_partial(c);
 		slub_set_percpu_partial(c, slab);
 
+		/*
+		 * Reentrant slub cannot take locks necessary for
+		 * __put_partials(), hence ignore node preference.
+		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
+		 */
 		if (likely(node_match(slab, node) &&
-			   pfmemalloc_match(slab, gfpflags))) {
+			   pfmemalloc_match(slab, gfpflags)) ||
+		    !allow_spin) {
 			c->slab = slab;
 			freelist = get_freelist(s, slab);
 			VM_BUG_ON(!freelist);
@@ -3828,7 +3915,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			goto load_freelist;
 		}
 
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, flags);
 
 		slab->next = NULL;
 		__put_partials(s, slab);
@@ -3850,8 +3937,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 *    allocating new page from other nodes
 	 */
 	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
-		     && try_thisnode))
-		pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+		     && try_thisnode)) {
+		if (unlikely(!allow_spin))
+			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
+			pc.flags = gfpflags | __GFP_THISNODE;
+		else
+			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+	}
 
 	pc.orig_size = orig_size;
 	slab = get_partial(s, node, &pc);
@@ -3890,7 +3982,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	stat(s, ALLOC_SLAB);
 
 	if (kmem_cache_debug(s)) {
-		freelist = alloc_single_from_new_slab(s, slab, orig_size);
+		freelist = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
 
 		if (unlikely(!freelist))
 			goto new_objects;
@@ -3912,7 +4004,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 	inc_slabs_node(s, slab_nid(slab), slab->objects);
 
-	if (unlikely(!pfmemalloc_match(slab, gfpflags))) {
+	if (unlikely(!pfmemalloc_match(slab, gfpflags) && allow_spin)) {
 		/*
 		 * For !pfmemalloc_match() case we don't load freelist so that
 		 * we don't make further mismatched allocations easier.
@@ -3923,7 +4015,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 retry_load_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_cpu_slab(s, flags);
 	if (unlikely(c->slab)) {
 		void *flush_freelist = c->freelist;
 		struct slab *flush_slab = c->slab;
@@ -3932,9 +4024,14 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		c->freelist = NULL;
 		c->tid = next_tid(c->tid);
 
-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock_cpu_slab(s, flags);
 
-		deactivate_slab(s, flush_slab, flush_freelist);
+		if (unlikely(!allow_spin)) {
+			/* Reentrant slub cannot take locks, defer */
+			defer_deactivate_slab(flush_slab, flush_freelist);
+		} else {
+			deactivate_slab(s, flush_slab, flush_freelist);
+		}
 
 		stat(s, CPUSLAB_FLUSH);
 
@@ -3963,8 +4060,19 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 */
 	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
-
+	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
+		if (local_lock_is_locked(&s->cpu_slab->lock)) {
+			/*
+			 * EBUSY is an internal signal to kmalloc_nolock() to
+			 * retry a different bucket. It's not propagated
+			 * to the caller.
+			 */
+			p = ERR_PTR(-EBUSY);
+			goto out;
+		}
+	}
 	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
+out:
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);
 #endif
@@ -4088,7 +4196,7 @@ static void *__slab_alloc_node(struct kmem_cache *s,
 		return NULL;
 	}
 
-	object = alloc_single_from_new_slab(s, slab, orig_size);
+	object = alloc_single_from_new_slab(s, slab, orig_size, gfpflags);
 
 	return object;
 }
@@ -4167,8 +4275,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		if (p[i] && init && (!kasan_init ||
 				     !kasan_has_integrated_init()))
 			memset(p[i], 0, zero_size);
-		kmemleak_alloc_recursive(p[i], s->object_size, 1,
-					 s->flags, init_flags);
+		if (gfpflags_allow_spinning(flags))
+			kmemleak_alloc_recursive(p[i], s->object_size, 1,
+						 s->flags, init_flags);
 		kmsan_slab_alloc(s, p[i], init_flags);
 		alloc_tagging_slab_alloc_hook(s, p[i], flags);
 	}
@@ -4359,6 +4468,94 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
+/**
+ * kmalloc_nolock - Allocate an object of given size from any context.
+ * @size: size to allocate
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
+ * @node: node number of the target node.
+ *
+ * Return: pointer to the new object or NULL in case of error.
+ * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
+ * There is no reason to call it again and expect !NULL.
+ */
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
+{
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
+	struct kmem_cache *s;
+	bool can_retry = true;
+	void *ret = ERR_PTR(-EBUSY);
+
+	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
+
+	if (unlikely(!size))
+		return ZERO_SIZE_PTR;
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
+		return NULL;
+retry:
+	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
+		return NULL;
+	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
+
+	if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
+		/*
+		 * kmalloc_nolock() is not supported on architectures that
+		 * don't implement cmpxchg16b, but debug caches don't use
+		 * per-cpu slab and per-cpu partial slabs. They rely on
+		 * kmem_cache_node->list_lock, so kmalloc_nolock() can
+		 * attempt to allocate from debug caches by
+		 * spin_trylock_irqsave(&n->list_lock, ...)
+		 */
+		return NULL;
+
+	/*
+	 * Do not call slab_alloc_node(), since trylock mode isn't
+	 * compatible with slab_pre_alloc_hook/should_failslab and
+	 * kfence_alloc. Hence call __slab_alloc_node() (at most twice)
+	 * and slab_post_alloc_hook() directly.
+	 *
+	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
+	 * in irq saved region. It assumes that the same cpu will not
+	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
+	 * Therefore use in_nmi() to check whether particular bucket is in
+	 * irq protected section.
+	 *
+	 * If in_nmi() && local_lock_is_locked(s->cpu_slab) then it means that
+	 * this cpu was interrupted somewhere inside ___slab_alloc() after
+	 * it did local_lock_irqsave(&s->cpu_slab->lock, flags).
+	 * In this case fast path with __update_cpu_freelist_fast() is not safe.
+	 */
+#ifndef CONFIG_SLUB_TINY
+	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
+#endif
+		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
+
+	if (PTR_ERR(ret) == -EBUSY) {
+		if (can_retry) {
+			/* pick the next kmalloc bucket */
+			size = s->object_size + 1;
+			/*
+			 * Another alternative is to
+			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
+			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
+			 * to retry from bucket of the same size.
+			 */
+			can_retry = false;
+			goto retry;
+		}
+		ret = NULL;
+	}
+
+	maybe_wipe_obj_freeptr(s, ret);
+	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
+			     slab_want_init_on_alloc(alloc_gfp, s), size);
+
+	ret = kasan_kmalloc(s, ret, size, alloc_gfp);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);
+
 void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags,
 					 int node, unsigned long caller)
 {
@@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 	discard_slab(s, slab);
 }
 
+struct defer_free {
+	struct llist_head objects;
+	struct llist_head slabs;
+	struct irq_work work;
+};
+
+static void free_deferred_objects(struct irq_work *work);
+
+static DEFINE_PER_CPU(struct defer_free, defer_free_objects) = {
+	.objects = LLIST_HEAD_INIT(objects),
+	.slabs = LLIST_HEAD_INIT(slabs),
+	.work = IRQ_WORK_INIT(free_deferred_objects),
+};
+
+/*
+ * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
+ * to take sleeping spin_locks from __slab_free() and deactivate_slab().
+ * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
+ */
+static void free_deferred_objects(struct irq_work *work)
+{
+	struct defer_free *df = container_of(work, struct defer_free, work);
+	struct llist_head *objs = &df->objects;
+	struct llist_head *slabs = &df->slabs;
+	struct llist_node *llnode, *pos, *t;
+
+	if (llist_empty(objs) && llist_empty(slabs))
+		return;
+
+	llnode = llist_del_all(objs);
+	llist_for_each_safe(pos, t, llnode) {
+		struct kmem_cache *s;
+		struct slab *slab;
+		void *x = pos;
+
+		slab = virt_to_slab(x);
+		s = slab->slab_cache;
+
+		/*
+		 * We used freepointer in 'x' to link 'x' into df->objects.
+		 * Clear it to NULL to avoid false positive detection
+		 * of "Freepointer corruption".
+		 */
+		*(void **)x = NULL;
+
+		/* Point 'x' back to the beginning of allocated object */
+		x -= s->offset;
+		/*
+		 * memcg, kasan_slab_pre are already done for 'x'.
+		 * The only thing left is kasan_poison.
+		 */
+		kasan_slab_free(s, x, false, false, true);
+		__slab_free(s, slab, x, x, 1, _THIS_IP_);
+	}
+
+	llnode = llist_del_all(slabs);
+	llist_for_each_safe(pos, t, llnode) {
+		struct slab *slab = container_of(pos, struct slab, llnode);
+
+#ifdef CONFIG_SLUB_TINY
+		discard_slab(slab->slab_cache, slab);
+#else
+		deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
+#endif
+	}
+}
+
+static void defer_free(struct kmem_cache *s, void *head)
+{
+	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
+
+	if (llist_add(head + s->offset, &df->objects))
+		irq_work_queue(&df->work);
+}
+
+static void defer_deactivate_slab(struct slab *slab, void *flush_freelist)
+{
+	struct defer_free *df = this_cpu_ptr(&defer_free_objects);
+
+	slab->flush_freelist = flush_freelist;
+	if (llist_add(&slab->llnode, &df->slabs))
+		irq_work_queue(&df->work);
+}
+
+void defer_free_barrier(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		irq_work_sync(&per_cpu_ptr(&defer_free_objects, cpu)->work);
+}
+
 #ifndef CONFIG_SLUB_TINY
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
@@ -4592,6 +4881,8 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 				struct slab *slab, void *head, void *tail,
 				int cnt, unsigned long addr)
 {
+	/* cnt == 0 signals that it's called from kfree_nolock() */
+	bool allow_spin = cnt;
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 	void **freelist;
@@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	barrier();
 
 	if (unlikely(slab != c->slab)) {
-		__slab_free(s, slab, head, tail, cnt, addr);
+		if (unlikely(!allow_spin)) {
+			/*
+			 * __slab_free() can locklessly cmpxchg16 into a slab,
+			 * but then it might need to take spin_lock or local_lock
+			 * in put_cpu_partial() for further processing.
+			 * Avoid the complexity and simply add to a deferred list.
+			 */
+			defer_free(s, head);
+		} else {
+			__slab_free(s, slab, head, tail, cnt, addr);
+		}
 		return;
 	}
 
+	if (unlikely(!allow_spin)) {
+		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
+		    local_lock_is_locked(&s->cpu_slab->lock)) {
+			defer_free(s, head);
+			return;
+		}
+		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
+		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
+	}
+
 	if (USE_LOCKLESS_FAST_PATH()) {
 		freelist = READ_ONCE(c->freelist);
 
@@ -4624,11 +4935,13 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 			goto redo;
 		}
 	} else {
+		__maybe_unused unsigned long flags = 0;
+
 		/* Update the free list under the local lock */
-		local_lock(&s->cpu_slab->lock);
+		local_lock_cpu_slab(s, flags);
 		c = this_cpu_ptr(s->cpu_slab);
 		if (unlikely(slab != c->slab)) {
-			local_unlock(&s->cpu_slab->lock);
+			local_unlock_cpu_slab(s, flags);
 			goto redo;
 		}
 		tid = c->tid;
@@ -4638,7 +4951,7 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 		c->freelist = head;
 		c->tid = next_tid(tid);
 
-		local_unlock(&s->cpu_slab->lock);
+		local_unlock_cpu_slab(s, flags);
 	}
 	stat_add(s, FREE_FASTPATH, cnt);
 }
@@ -4861,6 +5174,65 @@ void kfree(const void *object)
 }
 EXPORT_SYMBOL(kfree);
 
+/*
+ * Can be called while holding raw_spinlock_t or from IRQ and NMI,
+ * but ONLY for objects allocated by kmalloc_nolock().
+ * Debug checks (like kmemleak and kfence) were skipped on allocation,
+ * hence
+ * obj = kmalloc(); kfree_nolock(obj);
+ * will miss kmemleak/kfence book keeping and will cause false positives.
+ * large_kmalloc is not supported either.
+ */
+void kfree_nolock(const void *object)
+{
+	struct folio *folio;
+	struct slab *slab;
+	struct kmem_cache *s;
+	void *x = (void *)object;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	folio = virt_to_folio(object);
+	if (unlikely(!folio_test_slab(folio))) {
+		WARN_ONCE(1, "large_kmalloc is not supported by kfree_nolock()");
+		return;
+	}
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+
+	memcg_slab_free_hook(s, slab, &x, 1);
+	alloc_tagging_slab_free_hook(s, slab, &x, 1);
+	/*
+	 * Unlike slab_free() do NOT call the following:
+	 * kmemleak_free_recursive(x, s->flags);
+	 * debug_check_no_locks_freed(x, s->object_size);
+	 * debug_check_no_obj_freed(x, s->object_size);
+	 * __kcsan_check_access(x, s->object_size, ..);
+	 * kfence_free(x);
+	 * since they take spinlocks or not safe from any context.
+	 */
+	kmsan_slab_free(s, x);
+	/*
+	 * If KASAN finds a kernel bug it will do kasan_report_invalid_free()
+	 * which will call raw_spin_lock_irqsave() which is technically
+	 * unsafe from NMI, but take chance and report kernel bug.
+	 * The sequence of
+	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
+	 *  -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
+	 * is double buggy and deserves to deadlock.
+	 */
+	if (kasan_slab_pre_free(s, x))
+		return;
+#ifndef CONFIG_SLUB_TINY
+	do_slab_free(s, slab, x, x, 0, _RET_IP_);
+#else
+	defer_free(s, x);
+#endif
+}
+EXPORT_SYMBOL_GPL(kfree_nolock);
+
 static __always_inline __realloc_size(2) void *
 __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-18  2:16 ` [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
@ 2025-07-22 15:52   ` Harry Yoo
  2025-08-06  2:40     ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Harry Yoo @ 2025-07-22 15:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii,
	memxor, akpm, peterz, rostedt, hannes

On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
>
> kmalloc_nolock() relies on ability of local_trylock_t to detect
> the situation when per-cpu kmem_cache is locked.

I think kmalloc_nolock() should be kmalloc_node_nolock() because
it has `node` parameter?

# Don't specify NUMA node	# Specify NUMA node
kmalloc(size, gfp)		kmalloc_nolock(size, gfp)
kmalloc_node(size, gfp, node)	kmalloc_node_nolock(size, gfp, node)

...just like kmalloc() and kmalloc_node()?

> In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> disables IRQs and marks s->cpu_slab->lock as acquired.
> local_lock_is_locked(&s->cpu_slab->lock) returns true when
> slab is in the middle of manipulating per-cpu cache
> of that specific kmem_cache.
> 
> kmalloc_nolock() can be called from any context and can re-enter
> into ___slab_alloc():
>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
>     kmalloc_nolock() -> ___slab_alloc(cache_B)
> or
>   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
>     kmalloc_nolock() -> ___slab_alloc(cache_B)

> Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> can be acquired without a deadlock before invoking the function.
> If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> retries in a different kmalloc bucket. The second attempt will
> likely succeed, since this cpu locked different kmem_cache.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current _task_. In this case
> re-entrance into the same kmalloc bucket is unsafe, and
> kmalloc_nolock() tries a different bucket that is most likely is
> not locked by the current task. Though it may be locked by a
> different task it's safe to rt_spin_lock() and sleep on it.
>
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.

A question; I was confused for a while thinking "If it can't be called
from NMI and hard irq on PREEMPT_RT, why it can't just spin?"

And I guess it's because even in process context, when kmalloc_nolock()
is called by bpf, it can be called by the task that is holding the local lock
and thus spinning is not allowed. Is that correct?

> kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> and (in_nmi() or in PREEMPT_RT).
> 
> SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> spin_trylock_irqsave(&n->list_lock) to allocate,
> while kfree_nolock() always defers to irq_work.
> 
> Note, kfree_nolock() must be called _only_ for objects allocated
> with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> will miss kmemleak/kfence book keeping and will cause false positives.
> large_kmalloc is not supported by either kmalloc_nolock()
> or kfree_nolock().
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/kasan.h |  13 +-
>  include/linux/slab.h  |   4 +
>  mm/Kconfig            |   1 +
>  mm/kasan/common.c     |   5 +-
>  mm/slab.h             |   6 +
>  mm/slab_common.c      |   3 +
>  mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
>  7 files changed, 445 insertions(+), 53 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 54444bce218e..7de6da4ee46d 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
>  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  		        gfp_t gfp, bool new_slab)
>  {
> +	bool allow_spin = gfpflags_allow_spinning(gfp);
>  	unsigned int objects = objs_per_slab(s, slab);
>  	unsigned long new_exts;
>  	unsigned long old_exts;
> @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
>  	gfp &= ~OBJCGS_CLEAR_MASK;
>  	/* Prevent recursive extension vector allocation */
>  	gfp |= __GFP_NO_OBJ_EXT;
> -	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> -			   slab_nid(slab));
> +	if (unlikely(!allow_spin)) {
> +		size_t sz = objects * sizeof(struct slabobj_ext);
> +
> +		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));

In free_slab_obj_exts(), how do you know slabobj_ext is allocated via
kmalloc_nolock() or kcalloc_node()?

I was going to say "add a new flag to enum objext_flags",
but all lower 3 bits of slab->obj_exts pointer are already in use? oh...

Maybe need a magic trick to add one more flag,
like always align the size with 16?

In practice that should not lead to increase in memory consumption
anyway because most of the kmalloc-* sizes are already at least
16 bytes aligned.

> +	} else {
> +		vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> +				   slab_nid(slab));
> +	}
>  	if (!vec) {
>  		/* Mark vectors which failed to allocate */
>  		if (new_slab)
> +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
> +
>  /*
>   * Called only for kmem_cache_debug() caches to allocate from a freshly
>   * allocated slab. Allocate a single object instead of whole freelist
>   * and put the slab to the partial (or full) list.
>   */
> -static void *alloc_single_from_new_slab(struct kmem_cache *s,
> -					struct slab *slab, int orig_size)
> +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> +					int orig_size, gfp_t gfpflags)
>  {
> +	bool allow_spin = gfpflags_allow_spinning(gfpflags);
>  	int nid = slab_nid(slab);
>  	struct kmem_cache_node *n = get_node(s, nid);
>  	unsigned long flags;
>  	void *object;
>  
> +	if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {

I think alloc_debug_processing() doesn't have to be called under
n->list_lock here because it is a new slab?

That means the code can be something like:

/* allocate one object from slab */
object = slab->freelist;
slab->freelist = get_freepointer(s, object);
slab->inuse = 1;

/* Leak slab if debug checks fails */
if (!alloc_debug_processing())
	return NULL;

/* add slab to per-node partial list */
if (allow_spin) {
	spin_lock_irqsave();
} else if (!spin_trylock_irqsave()) {
	slab->frozen = 1;
	defer_deactivate_slab();
}

> +		/* Unlucky, discard newly allocated slab */
> +		slab->frozen = 1;
> +		defer_deactivate_slab(slab, NULL);
> +		return NULL;
> +	}
>  
>  	object = slab->freelist;
>  	slab->freelist = get_freepointer(s, object);
>  	slab->inuse = 1;
>  
> -	if (!alloc_debug_processing(s, slab, object, orig_size))
> +	if (!alloc_debug_processing(s, slab, object, orig_size)) {
>  		/*
>  		 * It's not really expected that this would fail on a
>  		 * freshly allocated slab, but a concurrent memory
>  		 * corruption in theory could cause that.
> +		 * Leak memory of allocated slab.
>  		 */
> +		if (!allow_spin)
> +			spin_unlock_irqrestore(&n->list_lock, flags);
>  		return NULL;
> +	}
>  
> -	spin_lock_irqsave(&n->list_lock, flags);
> +	if (allow_spin)
> +		spin_lock_irqsave(&n->list_lock, flags);
>  
>  	if (slab->inuse == slab->objects)
>  		add_full(s, n, slab);
> @@ -3164,6 +3201,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
>  	}
>  }
>  
> +/*
> + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> + * can be acquired without a deadlock before invoking the function.
> + *
> + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
> + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
> + * and kmalloc() is not used in an unsupported context.
> + *
> + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
> + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
> + * lockdep_assert() will catch a bug in case:
> + * #1
> + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
> + * or
> + * #2
> + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> + *
> + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> + * disabled context. The lock will always be acquired and if needed it
> + * block and sleep until the lock is available.
> + * #1 is possible in !PREEMP_RT only.

s/PREEMP_RT/PREEMPT_RT/

> + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> + *
> + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> + */
> +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> +#define local_lock_cpu_slab(s, flags)	\
> +	local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> +#else
> +#define local_lock_cpu_slab(s, flags)	\
> +	lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> +#endif
> +
> +#define local_unlock_cpu_slab(s, flags)	\
> +	local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> +
>  #ifdef CONFIG_SLUB_CPU_PARTIAL
>  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
>  {
> @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	if (unlikely(!node_match(slab, node))) {
>  		/*
>  		 * same as above but node_match() being false already
> -		 * implies node != NUMA_NO_NODE
> +		 * implies node != NUMA_NO_NODE.
> +		 * Reentrant slub cannot take locks necessary to
> +		 * deactivate_slab, hence ignore node preference.

Now that we have defer_deactivate_slab(), we need to either update the
code or comment?

1. Deactivate slabs when node / pfmemalloc mismatches
or 2. Update comments to explain why it's still undesirable

> +		 * kmalloc_nolock() doesn't allow __GFP_THISNODE.
>  		 */
> -		if (!node_isset(node, slab_nodes)) {
> +		if (!node_isset(node, slab_nodes) ||
> +		    !allow_spin) {
>  			node = NUMA_NO_NODE;
>  		} else {
>  			stat(s, ALLOC_NODE_MISMATCH);

> @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  	discard_slab(s, slab);
>  }
>  
> +/*
> + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
> + * to take sleeping spin_locks from __slab_free() and deactivate_slab().
> + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
> + */
> +static void free_deferred_objects(struct irq_work *work)
> +{
> +	struct defer_free *df = container_of(work, struct defer_free, work);
> +	struct llist_head *objs = &df->objects;
> +	struct llist_head *slabs = &df->slabs;
> +	struct llist_node *llnode, *pos, *t;
> +
> +	if (llist_empty(objs) && llist_empty(slabs))
> +		return;
> +
> +	llnode = llist_del_all(objs);
> +	llist_for_each_safe(pos, t, llnode) {
> +		struct kmem_cache *s;
> +		struct slab *slab;
> +		void *x = pos;
> +
> +		slab = virt_to_slab(x);
> +		s = slab->slab_cache;
> +
> +		/*
> +		 * We used freepointer in 'x' to link 'x' into df->objects.
> +		 * Clear it to NULL to avoid false positive detection
> +		 * of "Freepointer corruption".
> +		 */
> +		*(void **)x = NULL;
> +
> +		/* Point 'x' back to the beginning of allocated object */
> +		x -= s->offset;
> +		/*
> +		 * memcg, kasan_slab_pre are already done for 'x'.
> +		 * The only thing left is kasan_poison.
> +		 */
> +		kasan_slab_free(s, x, false, false, true);
> +		__slab_free(s, slab, x, x, 1, _THIS_IP_);
> +	}
> +
> +	llnode = llist_del_all(slabs);
> +	llist_for_each_safe(pos, t, llnode) {
> +		struct slab *slab = container_of(pos, struct slab, llnode);
> +
> +#ifdef CONFIG_SLUB_TINY
> +		discard_slab(slab->slab_cache, slab);

...and with my comment on alloc_single_from_new_slab(),
The slab may not be empty anymore?

> +#else
> +		deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> +#endif
> +	}
> +}
> @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  	barrier();
>  
>  	if (unlikely(slab != c->slab)) {
> -		__slab_free(s, slab, head, tail, cnt, addr);
> +		if (unlikely(!allow_spin)) {
> +			/*
> +			 * __slab_free() can locklessly cmpxchg16 into a slab,
> +			 * but then it might need to take spin_lock or local_lock
> +			 * in put_cpu_partial() for further processing.
> +			 * Avoid the complexity and simply add to a deferred list.
> +			 */
> +			defer_free(s, head);
> +		} else {
> +			__slab_free(s, slab, head, tail, cnt, addr);
> +		}
>  		return;
>  	}
>  
> +	if (unlikely(!allow_spin)) {
> +		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> +		    local_lock_is_locked(&s->cpu_slab->lock)) {
> +			defer_free(s, head);
> +			return;
> +		}
> +		cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> +		kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> +	}

I'm not sure what prevents below from happening

1. slab == c->slab && !allow_spin -> call kasan_slab_free()
2. preempted by something and resume
3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo
4. !allow_spin, so defer_free() will call kasan_slab_free() again later

Perhaps kasan_slab_free() should be called before do_slab_free()
just like normal free path and do not call kasan_slab_free() in deferred
freeing (then you may need to disable KASAN while accessing the deferred
list)?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-07-22 15:52   ` Harry Yoo
@ 2025-08-06  2:40     ` Alexei Starovoitov
  2025-08-12 15:11       ` Harry Yoo
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2025-08-06  2:40 UTC (permalink / raw)
  To: Harry Yoo
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Tue, Jul 22, 2025 at 8:52 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>

Sorry for the delay. PTO plus merge window.

> On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > kmalloc_nolock() relies on ability of local_trylock_t to detect
> > the situation when per-cpu kmem_cache is locked.
>
> I think kmalloc_nolock() should be kmalloc_node_nolock() because
> it has `node` parameter?
>
> # Don't specify NUMA node       # Specify NUMA node
> kmalloc(size, gfp)              kmalloc_nolock(size, gfp)
> kmalloc_node(size, gfp, node)   kmalloc_node_nolock(size, gfp, node)
>
> ...just like kmalloc() and kmalloc_node()?

I think this is a wrong pattern to follow.
All this "_node" suffix/rename was done long ago to avoid massive
search-and-replace when NUMA was introduced. Now NUMA is mandatory.
The user of API should think what numa_id to use and if none
they should explicitly say NUMA_NO_NODE.
Hiding behind macros is not a good api.
I hate the "_node_align" suffixes too. It's a silly convention.
Nothing in the kernel follows such an outdated naming scheme.
mm folks should follow what the rest of the kernel does
instead of following a pattern from 20 years ago.

> > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> > disables IRQs and marks s->cpu_slab->lock as acquired.
> > local_lock_is_locked(&s->cpu_slab->lock) returns true when
> > slab is in the middle of manipulating per-cpu cache
> > of that specific kmem_cache.
> >
> > kmalloc_nolock() can be called from any context and can re-enter
> > into ___slab_alloc():
> >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
> >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> > or
> >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
> >     kmalloc_nolock() -> ___slab_alloc(cache_B)
>
> > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> > can be acquired without a deadlock before invoking the function.
> > If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> > retries in a different kmalloc bucket. The second attempt will
> > likely succeed, since this cpu locked different kmem_cache.
> >
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current _task_. In this case
> > re-entrance into the same kmalloc bucket is unsafe, and
> > kmalloc_nolock() tries a different bucket that is most likely is
> > not locked by the current task. Though it may be locked by a
> > different task it's safe to rt_spin_lock() and sleep on it.
> >
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
>
> A question; I was confused for a while thinking "If it can't be called
> from NMI and hard irq on PREEMPT_RT, why it can't just spin?"

It's not safe due to PI issues in RT.
Steven and Sebastian explained it earlier:
https://lore.kernel.org/bpf/20241213124411.105d0f33@gandalf.local.home/

I don't think I can copy paste the multi page explanation in
commit log or into comments.
So "not safe in NMI or hard irq on RT" is the summary.
Happy to add a few words, but don't know what exactly to say.
If Steven/Sebastian can provide a paragraph I can add it.

> And I guess it's because even in process context, when kmalloc_nolock()
> is called by bpf, it can be called by the task that is holding the local lock
> and thus spinning is not allowed. Is that correct?
>
> > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > and (in_nmi() or in PREEMPT_RT).
> >
> > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > spin_trylock_irqsave(&n->list_lock) to allocate,
> > while kfree_nolock() always defers to irq_work.
> >
> > Note, kfree_nolock() must be called _only_ for objects allocated
> > with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> > will miss kmemleak/kfence book keeping and will cause false positives.
> > large_kmalloc is not supported by either kmalloc_nolock()
> > or kfree_nolock().
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/kasan.h |  13 +-
> >  include/linux/slab.h  |   4 +
> >  mm/Kconfig            |   1 +
> >  mm/kasan/common.c     |   5 +-
> >  mm/slab.h             |   6 +
> >  mm/slab_common.c      |   3 +
> >  mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
> >  7 files changed, 445 insertions(+), 53 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 54444bce218e..7de6da4ee46d 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
> >  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >                       gfp_t gfp, bool new_slab)
> >  {
> > +     bool allow_spin = gfpflags_allow_spinning(gfp);
> >       unsigned int objects = objs_per_slab(s, slab);
> >       unsigned long new_exts;
> >       unsigned long old_exts;
> > @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> >       gfp &= ~OBJCGS_CLEAR_MASK;
> >       /* Prevent recursive extension vector allocation */
> >       gfp |= __GFP_NO_OBJ_EXT;
> > -     vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > -                        slab_nid(slab));
> > +     if (unlikely(!allow_spin)) {
> > +             size_t sz = objects * sizeof(struct slabobj_ext);
> > +
> > +             vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
>
> In free_slab_obj_exts(), how do you know slabobj_ext is allocated via
> kmalloc_nolock() or kcalloc_node()?

Technically kmalloc_nolock()->kfree() isn't as bad as
kmalloc()->kfree_nolock(), since kmemleak/kfence can ignore
debug free-ing action without matching alloc side,
but you're right it's better to avoid it.

> I was going to say "add a new flag to enum objext_flags",
> but all lower 3 bits of slab->obj_exts pointer are already in use? oh...
>
> Maybe need a magic trick to add one more flag,
> like always align the size with 16?
>
> In practice that should not lead to increase in memory consumption
> anyway because most of the kmalloc-* sizes are already at least
> 16 bytes aligned.

Yes. That's an option, but I think we can do better.
OBJEXTS_ALLOC_FAIL doesn't need to consume the bit.
Here are two patches that fix this issue:

Subject: [PATCH 1/2] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL

Since the combination of valid upper bits in slab->obj_exts with
OBJEXTS_ALLOC_FAIL bit can never happen,
use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel
instead of (1ull << 2) to free up bit 2.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/memcontrol.h | 4 +++-
 mm/slub.c                  | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 785173aa0739..daa78665f850 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -341,17 +341,19 @@ enum page_memcg_data_flags {
        __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
 };

+#define __OBJEXTS_ALLOC_FAIL   MEMCG_DATA_OBJEXTS
 #define __FIRST_OBJEXT_FLAG    __NR_MEMCG_DATA_FLAGS

 #else /* CONFIG_MEMCG */

+#define __OBJEXTS_ALLOC_FAIL   (1UL << 0)
 #define __FIRST_OBJEXT_FLAG    (1UL << 0)

 #endif /* CONFIG_MEMCG */

 enum objext_flags {
        /* slabobj_ext vector failed to allocate */
-       OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
+       OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
        /* the next bit after the last actual flag */
        __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
 };
diff --git a/mm/slub.c b/mm/slub.c
index bd4bf2613e7a..16e53bfb310e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1950,7 +1950,7 @@ static inline void
handle_failed_objexts_alloc(unsigned long obj_exts,
         * objects with no tag reference. Mark all references in this
         * vector as empty to avoid warnings later on.
         */
-       if (obj_exts & OBJEXTS_ALLOC_FAIL) {
+       if (obj_exts == OBJEXTS_ALLOC_FAIL) {
                unsigned int i;

                for (i = 0; i < objects; i++)
--
2.47.3

Subject: [PATCH 2/2] slab: Use kfree_nolock() to free obj_exts

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/memcontrol.h | 1 +
 mm/slub.c                  | 7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index daa78665f850..2e6c33fdd9c5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,6 +354,7 @@ enum page_memcg_data_flags {
 enum objext_flags {
     /* slabobj_ext vector failed to allocate */
     OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
+    OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG,
     /* the next bit after the last actual flag */
     __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
 };
diff --git a/mm/slub.c b/mm/slub.c
index 16e53bfb310e..417d647f1f02 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2009,6 +2009,8 @@ int alloc_slab_obj_exts(struct slab *slab,
struct kmem_cache *s,
     }

     new_exts = (unsigned long)vec;
+    if (unlikely(!allow_spin))
+        new_exts |= OBJEXTS_NOSPIN_ALLOC;
 #ifdef CONFIG_MEMCG
     new_exts |= MEMCG_DATA_OBJEXTS;
 #endif
@@ -2056,7 +2058,10 @@ static inline void free_slab_obj_exts(struct slab *slab)
      * the extension for obj_exts is expected to be NULL.
      */
     mark_objexts_empty(obj_exts);
-    kfree(obj_exts);
+    if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC))
+        kfree_nolock(obj_exts);
+    else
+        kfree(obj_exts);
     slab->obj_exts = 0;
 }

--
2.47.3


> > +     } else {
> > +             vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > +                                slab_nid(slab));
> > +     }
> >       if (!vec) {
> >               /* Mark vectors which failed to allocate */
> >               if (new_slab)
> > +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
> > +
> >  /*
> >   * Called only for kmem_cache_debug() caches to allocate from a freshly
> >   * allocated slab. Allocate a single object instead of whole freelist
> >   * and put the slab to the partial (or full) list.
> >   */
> > -static void *alloc_single_from_new_slab(struct kmem_cache *s,
> > -                                     struct slab *slab, int orig_size)
> > +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > +                                     int orig_size, gfp_t gfpflags)
> >  {
> > +     bool allow_spin = gfpflags_allow_spinning(gfpflags);
> >       int nid = slab_nid(slab);
> >       struct kmem_cache_node *n = get_node(s, nid);
> >       unsigned long flags;
> >       void *object;
> >
> > +     if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
>
> I think alloc_debug_processing() doesn't have to be called under
> n->list_lock here because it is a new slab?
>
> That means the code can be something like:
>
> /* allocate one object from slab */
> object = slab->freelist;
> slab->freelist = get_freepointer(s, object);
> slab->inuse = 1;
>
> /* Leak slab if debug checks fails */
> if (!alloc_debug_processing())
>         return NULL;
>
> /* add slab to per-node partial list */
> if (allow_spin) {
>         spin_lock_irqsave();
> } else if (!spin_trylock_irqsave()) {
>         slab->frozen = 1;
>         defer_deactivate_slab();
> }

No. That doesn't work. I implemented it this way
before reverting back to spin_trylock_irqsave() in the beginning.
The problem is alloc_debug_processing() will likely succeed
and undoing it is pretty complex.
So it's better to "!allow_spin && !spin_trylock_irqsave()"
before doing expensive and hard to undo alloc_debug_processing().

>
> > +             /* Unlucky, discard newly allocated slab */
> > +             slab->frozen = 1;
> > +             defer_deactivate_slab(slab, NULL);
> > +             return NULL;
> > +     }
> >
> >       object = slab->freelist;
> >       slab->freelist = get_freepointer(s, object);
> >       slab->inuse = 1;
> >
> > -     if (!alloc_debug_processing(s, slab, object, orig_size))
> > +     if (!alloc_debug_processing(s, slab, object, orig_size)) {
> >               /*
> >                * It's not really expected that this would fail on a
> >                * freshly allocated slab, but a concurrent memory
> >                * corruption in theory could cause that.
> > +              * Leak memory of allocated slab.
> >                */
> > +             if (!allow_spin)
> > +                     spin_unlock_irqrestore(&n->list_lock, flags);
> >               return NULL;
> > +     }
> >
> > -     spin_lock_irqsave(&n->list_lock, flags);
> > +     if (allow_spin)
> > +             spin_lock_irqsave(&n->list_lock, flags);
> >
> >       if (slab->inuse == slab->objects)
> >               add_full(s, n, slab);
> > @@ -3164,6 +3201,44 @@ static void deactivate_slab(struct kmem_cache *s, struct slab *slab,
> >       }
> >  }
> >
> > +/*
> > + * ___slab_alloc()'s caller is supposed to check if kmem_cache::kmem_cache_cpu::lock
> > + * can be acquired without a deadlock before invoking the function.
> > + *
> > + * Without LOCKDEP we trust the code to be correct. kmalloc_nolock() is
> > + * using local_lock_is_locked() properly before calling local_lock_cpu_slab(),
> > + * and kmalloc() is not used in an unsupported context.
> > + *
> > + * With LOCKDEP, on PREEMPT_RT lockdep does its checking in local_lock_irqsave().
> > + * On !PREEMPT_RT we use trylock to avoid false positives in NMI, but
> > + * lockdep_assert() will catch a bug in case:
> > + * #1
> > + * kmalloc() -> ___slab_alloc() -> irqsave -> NMI -> bpf -> kmalloc_nolock()
> > + * or
> > + * #2
> > + * kmalloc() -> ___slab_alloc() -> irqsave -> tracepoint/kprobe -> bpf -> kmalloc_nolock()
> > + *
> > + * On PREEMPT_RT an invocation is not possible from IRQ-off or preempt
> > + * disabled context. The lock will always be acquired and if needed it
> > + * block and sleep until the lock is available.
> > + * #1 is possible in !PREEMP_RT only.
>
> s/PREEMP_RT/PREEMPT_RT/

ok.

> > + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> > + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> > + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> > + *
> > + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> > + */
> > +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> > +#define local_lock_cpu_slab(s, flags)        \
> > +     local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> > +#else
> > +#define local_lock_cpu_slab(s, flags)        \
> > +     lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> > +#endif
> > +
> > +#define local_unlock_cpu_slab(s, flags)      \
> > +     local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> > +
> >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> >  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
> >  {
> > @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >       if (unlikely(!node_match(slab, node))) {
> >               /*
> >                * same as above but node_match() being false already
> > -              * implies node != NUMA_NO_NODE
> > +              * implies node != NUMA_NO_NODE.
> > +              * Reentrant slub cannot take locks necessary to
> > +              * deactivate_slab, hence ignore node preference.
>
> Now that we have defer_deactivate_slab(), we need to either update the
> code or comment?
>
> 1. Deactivate slabs when node / pfmemalloc mismatches
> or 2. Update comments to explain why it's still undesirable

Well, defer_deactivate_slab() is a heavy hammer.
In !SLUB_TINY it pretty much never happens.

This bit:

retry_load_slab:

        local_lock_cpu_slab(s, flags);
        if (unlikely(c->slab)) {

is very rare. I couldn't trigger it at all in my stress test.

But in this hunk the node mismatch is not rare, so ignoring node preference
for kmalloc_nolock() is a much better trade off.
I'll add a comment that defer_deactivate_slab() is undesired here.

> > +              * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> >                */
> > -             if (!node_isset(node, slab_nodes)) {
> > +             if (!node_isset(node, slab_nodes) ||
> > +                 !allow_spin) {
> >                       node = NUMA_NO_NODE;
> >               } else {
> >                       stat(s, ALLOC_NODE_MISMATCH);
>
> > @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> >       discard_slab(s, slab);
> >  }
> >
> > +/*
> > + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
> > + * to take sleeping spin_locks from __slab_free() and deactivate_slab().
> > + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
> > + */
> > +static void free_deferred_objects(struct irq_work *work)
> > +{
> > +     struct defer_free *df = container_of(work, struct defer_free, work);
> > +     struct llist_head *objs = &df->objects;
> > +     struct llist_head *slabs = &df->slabs;
> > +     struct llist_node *llnode, *pos, *t;
> > +
> > +     if (llist_empty(objs) && llist_empty(slabs))
> > +             return;
> > +
> > +     llnode = llist_del_all(objs);
> > +     llist_for_each_safe(pos, t, llnode) {
> > +             struct kmem_cache *s;
> > +             struct slab *slab;
> > +             void *x = pos;
> > +
> > +             slab = virt_to_slab(x);
> > +             s = slab->slab_cache;
> > +
> > +             /*
> > +              * We used freepointer in 'x' to link 'x' into df->objects.
> > +              * Clear it to NULL to avoid false positive detection
> > +              * of "Freepointer corruption".
> > +              */
> > +             *(void **)x = NULL;
> > +
> > +             /* Point 'x' back to the beginning of allocated object */
> > +             x -= s->offset;
> > +             /*
> > +              * memcg, kasan_slab_pre are already done for 'x'.
> > +              * The only thing left is kasan_poison.
> > +              */
> > +             kasan_slab_free(s, x, false, false, true);
> > +             __slab_free(s, slab, x, x, 1, _THIS_IP_);
> > +     }
> > +
> > +     llnode = llist_del_all(slabs);
> > +     llist_for_each_safe(pos, t, llnode) {
> > +             struct slab *slab = container_of(pos, struct slab, llnode);
> > +
> > +#ifdef CONFIG_SLUB_TINY
> > +             discard_slab(slab->slab_cache, slab);
>
> ...and with my comment on alloc_single_from_new_slab(),
> The slab may not be empty anymore?

Exactly.
That's another problem with your suggestion in alloc_single_from_new_slab().
That's why I did it as:
if (!allow_spin && !spin_trylock_irqsave(...)

and I still believe it's the right call.
The slab is empty here, so discard_slab() is appropriate.

>
> > +#else
> > +             deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> > +#endif
> > +     }
> > +}
> > @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >       barrier();
> >
> >       if (unlikely(slab != c->slab)) {
> > -             __slab_free(s, slab, head, tail, cnt, addr);
> > +             if (unlikely(!allow_spin)) {
> > +                     /*
> > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > +                      * but then it might need to take spin_lock or local_lock
> > +                      * in put_cpu_partial() for further processing.
> > +                      * Avoid the complexity and simply add to a deferred list.
> > +                      */
> > +                     defer_free(s, head);
> > +             } else {
> > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > +             }
> >               return;
> >       }
> >
> > +     if (unlikely(!allow_spin)) {
> > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > +                     defer_free(s, head);
> > +                     return;
> > +             }
> > +             cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> > +             kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> > +     }
>
> I'm not sure what prevents below from happening
>
> 1. slab == c->slab && !allow_spin -> call kasan_slab_free()
> 2. preempted by something and resume
> 3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo
> 4. !allow_spin, so defer_free() will call kasan_slab_free() again later

yes. it's possible and it's ok.
kasan_slab_free(, no_quarantine == true)
is poison_slab_object() only and one can do it many times.

> Perhaps kasan_slab_free() should be called before do_slab_free()
> just like normal free path and do not call kasan_slab_free() in deferred
> freeing (then you may need to disable KASAN while accessing the deferred
> list)?

I tried that too and didn't like it.
After kasan_slab_free() the object cannot be put on the llist easily.
One needs to do a kasan_reset_tag() dance which uglifies the code a lot.
Double kasan poison in a rare case is fine. There is no harm.


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-08-06  2:40     ` Alexei Starovoitov
@ 2025-08-12 15:11       ` Harry Yoo
  2025-08-12 17:08         ` Harry Yoo
  0 siblings, 1 reply; 11+ messages in thread
From: Harry Yoo @ 2025-08-12 15:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Tue, Aug 05, 2025 at 07:40:31PM -0700, Alexei Starovoitov wrote:
> On Tue, Jul 22, 2025 at 8:52 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> 
> Sorry for the delay. PTO plus merge window.

No problem! Hope you enjoyed PTO :)
Sorry for the delay as well..

> > On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > kmalloc_nolock() relies on ability of local_trylock_t to detect
> > > the situation when per-cpu kmem_cache is locked.
> >
> > I think kmalloc_nolock() should be kmalloc_node_nolock() because
> > it has `node` parameter?
> >
> > # Don't specify NUMA node       # Specify NUMA node
> > kmalloc(size, gfp)              kmalloc_nolock(size, gfp)
> > kmalloc_node(size, gfp, node)   kmalloc_node_nolock(size, gfp, node)
> >
> > ...just like kmalloc() and kmalloc_node()?
> 
> I think this is a wrong pattern to follow.
> All this "_node" suffix/rename was done long ago to avoid massive
> search-and-replace when NUMA was introduced. Now NUMA is mandatory.
> The user of API should think what numa_id to use and if none
> they should explicitly say NUMA_NO_NODE.

You're probably right, no strong opinion from me.

> Hiding behind macros is not a good api.
> I hate the "_node_align" suffixes too. It's a silly convention.
> Nothing in the kernel follows such an outdated naming scheme.
> mm folks should follow what the rest of the kernel does
> instead of following a pattern from 20 years ago.

That's a new scheme from a very recent patch series that did not land
mainline yet
https://lore.kernel.org/linux-mm/20250806124034.1724515-1-vitaly.wool@konsulko.se

> > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> > > disables IRQs and marks s->cpu_slab->lock as acquired.
> > > local_lock_is_locked(&s->cpu_slab->lock) returns true when
> > > slab is in the middle of manipulating per-cpu cache
> > > of that specific kmem_cache.
> > >
> > > kmalloc_nolock() can be called from any context and can re-enter
> > > into ___slab_alloc():
> > >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
> > >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> > > or
> > >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
> > >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> >
> > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> > > can be acquired without a deadlock before invoking the function.
> > > If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> > > retries in a different kmalloc bucket. The second attempt will
> > > likely succeed, since this cpu locked different kmem_cache.
> > >
> > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > > per-cpu rt_spin_lock is locked by current _task_. In this case
> > > re-entrance into the same kmalloc bucket is unsafe, and
> > > kmalloc_nolock() tries a different bucket that is most likely is
> > > not locked by the current task. Though it may be locked by a
> > > different task it's safe to rt_spin_lock() and sleep on it.
> > >
> > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > > immediately if called from hard irq or NMI in PREEMPT_RT.
> >
> > A question; I was confused for a while thinking "If it can't be called
> > from NMI and hard irq on PREEMPT_RT, why it can't just spin?"
> 
> It's not safe due to PI issues in RT.
> Steven and Sebastian explained it earlier:
> https://lore.kernel.org/bpf/20241213124411.105d0f33@gandalf.local.home/

Uh, I was totally missing that point. Thanks for pointing it out!

> I don't think I can copy paste the multi page explanation in
> commit log or into comments.
> So "not safe in NMI or hard irq on RT" is the summary.
> Happy to add a few words, but don't know what exactly to say.
> If Steven/Sebastian can provide a paragraph I can add it.
> 
> > And I guess it's because even in process context, when kmalloc_nolock()
> > is called by bpf, it can be called by the task that is holding the local lock
> > and thus spinning is not allowed. Is that correct?
> >
> > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > > and (in_nmi() or in PREEMPT_RT).
> > >
> > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > > spin_trylock_irqsave(&n->list_lock) to allocate,
> > > while kfree_nolock() always defers to irq_work.
> > >
> > > Note, kfree_nolock() must be called _only_ for objects allocated
> > > with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> > > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> > > will miss kmemleak/kfence book keeping and will cause false positives.
> > > large_kmalloc is not supported by either kmalloc_nolock()
> > > or kfree_nolock().
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > ---
> > >  include/linux/kasan.h |  13 +-
> > >  include/linux/slab.h  |   4 +
> > >  mm/Kconfig            |   1 +
> > >  mm/kasan/common.c     |   5 +-
> > >  mm/slab.h             |   6 +
> > >  mm/slab_common.c      |   3 +
> > >  mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
> > >  7 files changed, 445 insertions(+), 53 deletions(-)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 54444bce218e..7de6da4ee46d 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
> > >  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > >                       gfp_t gfp, bool new_slab)
> > >  {
> > > +     bool allow_spin = gfpflags_allow_spinning(gfp);
> > >       unsigned int objects = objs_per_slab(s, slab);
> > >       unsigned long new_exts;
> > >       unsigned long old_exts;
> > > @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > >       gfp &= ~OBJCGS_CLEAR_MASK;
> > >       /* Prevent recursive extension vector allocation */
> > >       gfp |= __GFP_NO_OBJ_EXT;
> > > -     vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > > -                        slab_nid(slab));
> > > +     if (unlikely(!allow_spin)) {
> > > +             size_t sz = objects * sizeof(struct slabobj_ext);
> > > +
> > > +             vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
> >
> > In free_slab_obj_exts(), how do you know slabobj_ext is allocated via
> > kmalloc_nolock() or kcalloc_node()?
> 
> Technically kmalloc_nolock()->kfree() isn't as bad as
> kmalloc()->kfree_nolock(), since kmemleak/kfence can ignore
> debug free-ing action without matching alloc side,
> but you're right it's better to avoid it.
> 
> > I was going to say "add a new flag to enum objext_flags",
> > but all lower 3 bits of slab->obj_exts pointer are already in use? oh...
> >
> > Maybe need a magic trick to add one more flag,
> > like always align the size with 16?
> >
> > In practice that should not lead to increase in memory consumption
> > anyway because most of the kmalloc-* sizes are already at least
> > 16 bytes aligned.
> 
> Yes. That's an option, but I think we can do better.
> OBJEXTS_ALLOC_FAIL doesn't need to consume the bit.
> Here are two patches that fix this issue:
> 
> Subject: [PATCH 1/2] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL
> 
> Since the combination of valid upper bits in slab->obj_exts with
> OBJEXTS_ALLOC_FAIL bit can never happen,
> use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel
> instead of (1ull << 2) to free up bit 2.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---

This will work, but it would be helpful to add a comment clarifying that
when bit 0 is set with valid upper bits, it indicates
MEMCG_DATA_OBJEXTS, but when the upper bits are all zero, it indicates
OBJEXTS_ALLOC_FAIL.

When someone looks at the code without checking history it might not
be obvious at first glance.

>  include/linux/memcontrol.h | 4 +++-
>  mm/slub.c                  | 2 +-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 785173aa0739..daa78665f850 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -341,17 +341,19 @@ enum page_memcg_data_flags {
>         __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
>  };
> 
> +#define __OBJEXTS_ALLOC_FAIL   MEMCG_DATA_OBJEXTS
>  #define __FIRST_OBJEXT_FLAG    __NR_MEMCG_DATA_FLAGS
> 
>  #else /* CONFIG_MEMCG */
> 
> +#define __OBJEXTS_ALLOC_FAIL   (1UL << 0)
>  #define __FIRST_OBJEXT_FLAG    (1UL << 0)
> 
>  #endif /* CONFIG_MEMCG */
> 
>  enum objext_flags {
>         /* slabobj_ext vector failed to allocate */
> -       OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
> +       OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
>         /* the next bit after the last actual flag */
>         __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
>  };
> diff --git a/mm/slub.c b/mm/slub.c
> index bd4bf2613e7a..16e53bfb310e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1950,7 +1950,7 @@ static inline void
> handle_failed_objexts_alloc(unsigned long obj_exts,
>          * objects with no tag reference. Mark all references in this
>          * vector as empty to avoid warnings later on.
>          */
> -       if (obj_exts & OBJEXTS_ALLOC_FAIL) {
> +       if (obj_exts == OBJEXTS_ALLOC_FAIL) {
>                 unsigned int i;
> 
>                 for (i = 0; i < objects; i++)
> --
> 2.47.3
> 
> Subject: [PATCH 2/2] slab: Use kfree_nolock() to free obj_exts
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/memcontrol.h | 1 +
>  mm/slub.c                  | 7 ++++++-
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index daa78665f850..2e6c33fdd9c5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -354,6 +354,7 @@ enum page_memcg_data_flags {
>  enum objext_flags {
>      /* slabobj_ext vector failed to allocate */
>      OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,

/* slabobj_ext vector allocated with kmalloc_nolock() */ ?

> +    OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG,
>      /* the next bit after the last actual flag */
>      __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
>  };
> diff --git a/mm/slub.c b/mm/slub.c
> index 16e53bfb310e..417d647f1f02 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2009,6 +2009,8 @@ int alloc_slab_obj_exts(struct slab *slab,
> struct kmem_cache *s,
>      }
> 
>      new_exts = (unsigned long)vec;
> +    if (unlikely(!allow_spin))
> +        new_exts |= OBJEXTS_NOSPIN_ALLOC;
>  #ifdef CONFIG_MEMCG
>      new_exts |= MEMCG_DATA_OBJEXTS;
>  #endif
> @@ -2056,7 +2058,10 @@ static inline void free_slab_obj_exts(struct slab *slab)
>       * the extension for obj_exts is expected to be NULL.
>       */
>      mark_objexts_empty(obj_exts);
> -    kfree(obj_exts);
> +    if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC))
> +        kfree_nolock(obj_exts);
> +    else
> +        kfree(obj_exts);
>      slab->obj_exts = 0;
>  }
> 
> --
> 2.47.3

Otherwise looks fine to me.

> > > +     } else {
> > > +             vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > > +                                slab_nid(slab));
> > > +     }
> > >       if (!vec) {
> > >               /* Mark vectors which failed to allocate */
> > >               if (new_slab)
> > > +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
> > > +
> > >  /*
> > >   * Called only for kmem_cache_debug() caches to allocate from a freshly
> > >   * allocated slab. Allocate a single object instead of whole freelist
> > >   * and put the slab to the partial (or full) list.
> > >   */
> > > -static void *alloc_single_from_new_slab(struct kmem_cache *s,
> > > -                                     struct slab *slab, int orig_size)
> > > +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > > +                                     int orig_size, gfp_t gfpflags)
> > >  {
> > > +     bool allow_spin = gfpflags_allow_spinning(gfpflags);
> > >       int nid = slab_nid(slab);
> > >       struct kmem_cache_node *n = get_node(s, nid);
> > >       unsigned long flags;
> > >       void *object;
> > >
> > > +     if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
> >
> > I think alloc_debug_processing() doesn't have to be called under
> > n->list_lock here because it is a new slab?
> >
> > That means the code can be something like:
> >
> > /* allocate one object from slab */
> > object = slab->freelist;
> > slab->freelist = get_freepointer(s, object);
> > slab->inuse = 1;
> >
> > /* Leak slab if debug checks fails */
> > if (!alloc_debug_processing())
> >         return NULL;
> >
> > /* add slab to per-node partial list */
> > if (allow_spin) {
> >         spin_lock_irqsave();
> > } else if (!spin_trylock_irqsave()) {
> >         slab->frozen = 1;
> >         defer_deactivate_slab();
> > }
> 
> No. That doesn't work. I implemented it this way
> before reverting back to spin_trylock_irqsave() in the beginning.
> The problem is alloc_debug_processing() will likely succeed
> and undoing it is pretty complex.
> So it's better to "!allow_spin && !spin_trylock_irqsave()"
> before doing expensive and hard to undo alloc_debug_processing().

Gotcha, that makes sense!

> > > +             /* Unlucky, discard newly allocated slab */
> > > +             slab->frozen = 1;
> > > +             defer_deactivate_slab(slab, NULL);
> > > +             return NULL;
> > > +     }
> > >
> > >       object = slab->freelist;
> > >       slab->freelist = get_freepointer(s, object);
> > >       slab->inuse = 1;
> > >
> > > -     if (!alloc_debug_processing(s, slab, object, orig_size))
> > > +     if (!alloc_debug_processing(s, slab, object, orig_size)) {
> > >               /*
> > >                * It's not really expected that this would fail on a
> > >                * freshly allocated slab, but a concurrent memory
> > >                * corruption in theory could cause that.
> > > +              * Leak memory of allocated slab.
> > >                */
> > > +             if (!allow_spin)
> > > +                     spin_unlock_irqrestore(&n->list_lock, flags);
> > >               return NULL;
> > > +     }
> > >
> > > -     spin_lock_irqsave(&n->list_lock, flags);
> > > +     if (allow_spin)
> > > +             spin_lock_irqsave(&n->list_lock, flags);
> > >
> > >       if (slab->inuse == slab->objects)
> > >               add_full(s, n, slab);
> > > + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> > > + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> > > + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> > > + *
> > > + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> > > + */
> > > +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> > > +#define local_lock_cpu_slab(s, flags)        \
> > > +     local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> > > +#else
> > > +#define local_lock_cpu_slab(s, flags)        \
> > > +     lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> > > +#endif
> > > +
> > > +#define local_unlock_cpu_slab(s, flags)      \
> > > +     local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> > > +
> > >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> > >  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
> > >  {
> > > @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > >       if (unlikely(!node_match(slab, node))) {
> > >               /*
> > >                * same as above but node_match() being false already
> > > -              * implies node != NUMA_NO_NODE
> > > +              * implies node != NUMA_NO_NODE.
> > > +              * Reentrant slub cannot take locks necessary to
> > > +              * deactivate_slab, hence ignore node preference.
> >
> > Now that we have defer_deactivate_slab(), we need to either update the
> > code or comment?
> >
> > 1. Deactivate slabs when node / pfmemalloc mismatches
> > or 2. Update comments to explain why it's still undesirable
> 
> Well, defer_deactivate_slab() is a heavy hammer.
> In !SLUB_TINY it pretty much never happens.
> 
> This bit:
> 
> retry_load_slab:
> 
>         local_lock_cpu_slab(s, flags);
>         if (unlikely(c->slab)) {
> 
> is very rare. I couldn't trigger it at all in my stress test.
> 
> But in this hunk the node mismatch is not rare, so ignoring node preference
> for kmalloc_nolock() is a much better trade off.
> I'll add a comment that defer_deactivate_slab() is undesired here.

Wait, does that mean kmalloc_nolock() have `node` parameter that is always
ignored? Why not use NUMA_NO_NODE then instead of adding a special case
for !allow_spin?

> > > +              * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> > >                */
> > > -             if (!node_isset(node, slab_nodes)) {
> > > +             if (!node_isset(node, slab_nodes) ||
> > > +                 !allow_spin) {
> > >                       node = NUMA_NO_NODE;
> > >               } else {
> > >                       stat(s, ALLOC_NODE_MISMATCH);
> >
> > > @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> > >       discard_slab(s, slab);
> > >  }
> > >
> > > +/*
> > > + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
> > > + * to take sleeping spin_locks from __slab_free() and deactivate_slab().
> > > + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
> > > + */
> > > +static void free_deferred_objects(struct irq_work *work)
> > > +{
> > > +     struct defer_free *df = container_of(work, struct defer_free, work);
> > > +     struct llist_head *objs = &df->objects;
> > > +     struct llist_head *slabs = &df->slabs;
> > > +     struct llist_node *llnode, *pos, *t;
> > > +
> > > +     if (llist_empty(objs) && llist_empty(slabs))
> > > +             return;
> > > +
> > > +     llnode = llist_del_all(objs);
> > > +     llist_for_each_safe(pos, t, llnode) {
> > > +             struct kmem_cache *s;
> > > +             struct slab *slab;
> > > +             void *x = pos;
> > > +
> > > +             slab = virt_to_slab(x);
> > > +             s = slab->slab_cache;
> > > +
> > > +             /*
> > > +              * We used freepointer in 'x' to link 'x' into df->objects.
> > > +              * Clear it to NULL to avoid false positive detection
> > > +              * of "Freepointer corruption".
> > > +              */
> > > +             *(void **)x = NULL;
> > > +
> > > +             /* Point 'x' back to the beginning of allocated object */
> > > +             x -= s->offset;
> > > +             /*
> > > +              * memcg, kasan_slab_pre are already done for 'x'.
> > > +              * The only thing left is kasan_poison.
> > > +              */
> > > +             kasan_slab_free(s, x, false, false, true);
> > > +             __slab_free(s, slab, x, x, 1, _THIS_IP_);
> > > +     }
> > > +
> > > +     llnode = llist_del_all(slabs);
> > > +     llist_for_each_safe(pos, t, llnode) {
> > > +             struct slab *slab = container_of(pos, struct slab, llnode);
> > > +
> > > +#ifdef CONFIG_SLUB_TINY
> > > +             discard_slab(slab->slab_cache, slab);
> >
> > ...and with my comment on alloc_single_from_new_slab(),
> > The slab may not be empty anymore?
> 
> Exactly.
> That's another problem with your suggestion in alloc_single_from_new_slab().
> That's why I did it as:
> if (!allow_spin && !spin_trylock_irqsave(...)
> 
> and I still believe it's the right call.

Yeah, I think it's fine.

> The slab is empty here, so discard_slab() is appropriate.
>
> > > +#else
> > > +             deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> > > +#endif
> > > +     }
> > > +}
> > > @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> > >       barrier();
> > >
> > >       if (unlikely(slab != c->slab)) {
> > > -             __slab_free(s, slab, head, tail, cnt, addr);
> > > +             if (unlikely(!allow_spin)) {
> > > +                     /*
> > > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > > +                      * but then it might need to take spin_lock or local_lock
> > > +                      * in put_cpu_partial() for further processing.
> > > +                      * Avoid the complexity and simply add to a deferred list.
> > > +                      */
> > > +                     defer_free(s, head);
> > > +             } else {
> > > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > > +             }
> > >               return;
> > >       }
> > >
> > > +     if (unlikely(!allow_spin)) {
> > > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> > > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > > +                     defer_free(s, head);
> > > +                     return;
> > > +             }
> > > +             cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> > > +             kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> > > +     }
> >
> > I'm not sure what prevents below from happening
> >
> > 1. slab == c->slab && !allow_spin -> call kasan_slab_free()
> > 2. preempted by something and resume
> > 3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo
> > 4. !allow_spin, so defer_free() will call kasan_slab_free() again later
> 
> yes. it's possible and it's ok.
> kasan_slab_free(, no_quarantine == true)
> is poison_slab_object() only and one can do it many times.
>
> > Perhaps kasan_slab_free() should be called before do_slab_free()
> > just like normal free path and do not call kasan_slab_free() in deferred
> > freeing (then you may need to disable KASAN while accessing the deferred
> > list)?
> 
> I tried that too and didn't like it.
> After kasan_slab_free() the object cannot be put on the llist easily.
> One needs to do a kasan_reset_tag() dance which uglifies the code a lot.
> Double kasan poison in a rare case is fine. There is no harm.

Okay, but we're now depending on kasan_slab_free() being safe to be
called multiple times on the same object. That would be better
documented in kasan_slab_free().

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-08-12 15:11       ` Harry Yoo
@ 2025-08-12 17:08         ` Harry Yoo
  0 siblings, 0 replies; 11+ messages in thread
From: Harry Yoo @ 2025-08-12 17:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner

On Wed, Aug 13, 2025 at 12:11:42AM +0900, Harry Yoo wrote:
> On Tue, Aug 05, 2025 at 07:40:31PM -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 22, 2025 at 8:52 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > 
> > Sorry for the delay. PTO plus merge window.
> 
> No problem! Hope you enjoyed PTO :)
> Sorry for the delay as well..
> 
> > > On Thu, Jul 17, 2025 at 07:16:46PM -0700, Alexei Starovoitov wrote:
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > >
> > > > kmalloc_nolock() relies on ability of local_trylock_t to detect
> > > > the situation when per-cpu kmem_cache is locked.
> > >
> > > I think kmalloc_nolock() should be kmalloc_node_nolock() because
> > > it has `node` parameter?
> > >
> > > # Don't specify NUMA node       # Specify NUMA node
> > > kmalloc(size, gfp)              kmalloc_nolock(size, gfp)
> > > kmalloc_node(size, gfp, node)   kmalloc_node_nolock(size, gfp, node)
> > >
> > > ...just like kmalloc() and kmalloc_node()?
> > 
> > I think this is a wrong pattern to follow.
> > All this "_node" suffix/rename was done long ago to avoid massive
> > search-and-replace when NUMA was introduced. Now NUMA is mandatory.
> > The user of API should think what numa_id to use and if none
> > they should explicitly say NUMA_NO_NODE.
> 
> You're probably right, no strong opinion from me.
> 
> > Hiding behind macros is not a good api.
> > I hate the "_node_align" suffixes too. It's a silly convention.
> > Nothing in the kernel follows such an outdated naming scheme.
> > mm folks should follow what the rest of the kernel does
> > instead of following a pattern from 20 years ago.
> 
> That's a new scheme from a very recent patch series that did not land
> mainline yet
> https://lore.kernel.org/linux-mm/20250806124034.1724515-1-vitaly.wool@konsulko.se
> 
> > > > In !PREEMPT_RT local_(try)lock_irqsave(&s->cpu_slab->lock, flags)
> > > > disables IRQs and marks s->cpu_slab->lock as acquired.
> > > > local_lock_is_locked(&s->cpu_slab->lock) returns true when
> > > > slab is in the middle of manipulating per-cpu cache
> > > > of that specific kmem_cache.
> > > >
> > > > kmalloc_nolock() can be called from any context and can re-enter
> > > > into ___slab_alloc():
> > > >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> NMI -> bpf ->
> > > >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> > > > or
> > > >   kmalloc() -> ___slab_alloc(cache_A) -> irqsave -> tracepoint/kprobe -> bpf ->
> > > >     kmalloc_nolock() -> ___slab_alloc(cache_B)
> > >
> > > > Hence the caller of ___slab_alloc() checks if &s->cpu_slab->lock
> > > > can be acquired without a deadlock before invoking the function.
> > > > If that specific per-cpu kmem_cache is busy the kmalloc_nolock()
> > > > retries in a different kmalloc bucket. The second attempt will
> > > > likely succeed, since this cpu locked different kmem_cache.
> > > >
> > > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > > > per-cpu rt_spin_lock is locked by current _task_. In this case
> > > > re-entrance into the same kmalloc bucket is unsafe, and
> > > > kmalloc_nolock() tries a different bucket that is most likely is
> > > > not locked by the current task. Though it may be locked by a
> > > > different task it's safe to rt_spin_lock() and sleep on it.
> > > >
> > > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > > > immediately if called from hard irq or NMI in PREEMPT_RT.
> > >
> > > A question; I was confused for a while thinking "If it can't be called
> > > from NMI and hard irq on PREEMPT_RT, why it can't just spin?"
> > 
> > It's not safe due to PI issues in RT.
> > Steven and Sebastian explained it earlier:
> > https://lore.kernel.org/bpf/20241213124411.105d0f33@gandalf.local.home/
> 
> Uh, I was totally missing that point. Thanks for pointing it out!
> 
> > I don't think I can copy paste the multi page explanation in
> > commit log or into comments.
> > So "not safe in NMI or hard irq on RT" is the summary.
> > Happy to add a few words, but don't know what exactly to say.
> > If Steven/Sebastian can provide a paragraph I can add it.
> > 
> > > And I guess it's because even in process context, when kmalloc_nolock()
> > > is called by bpf, it can be called by the task that is holding the local lock
> > > and thus spinning is not allowed. Is that correct?
> > >
> > > > kfree_nolock() defers freeing to irq_work when local_lock_is_locked()
> > > > and (in_nmi() or in PREEMPT_RT).
> > > >
> > > > SLUB_TINY config doesn't use local_lock_is_locked() and relies on
> > > > spin_trylock_irqsave(&n->list_lock) to allocate,
> > > > while kfree_nolock() always defers to irq_work.
> > > >
> > > > Note, kfree_nolock() must be called _only_ for objects allocated
> > > > with kmalloc_nolock(). Debug checks (like kmemleak and kfence)
> > > > were skipped on allocation, hence obj = kmalloc(); kfree_nolock(obj);
> > > > will miss kmemleak/kfence book keeping and will cause false positives.
> > > > large_kmalloc is not supported by either kmalloc_nolock()
> > > > or kfree_nolock().
> > > >
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > ---
> > > >  include/linux/kasan.h |  13 +-
> > > >  include/linux/slab.h  |   4 +
> > > >  mm/Kconfig            |   1 +
> > > >  mm/kasan/common.c     |   5 +-
> > > >  mm/slab.h             |   6 +
> > > >  mm/slab_common.c      |   3 +
> > > >  mm/slub.c             | 466 +++++++++++++++++++++++++++++++++++++-----
> > > >  7 files changed, 445 insertions(+), 53 deletions(-)
> > > >
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > index 54444bce218e..7de6da4ee46d 100644
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -1982,6 +1983,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
> > > >  int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > > >                       gfp_t gfp, bool new_slab)
> > > >  {
> > > > +     bool allow_spin = gfpflags_allow_spinning(gfp);
> > > >       unsigned int objects = objs_per_slab(s, slab);
> > > >       unsigned long new_exts;
> > > >       unsigned long old_exts;
> > > > @@ -1990,8 +1992,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
> > > >       gfp &= ~OBJCGS_CLEAR_MASK;
> > > >       /* Prevent recursive extension vector allocation */
> > > >       gfp |= __GFP_NO_OBJ_EXT;
> > > > -     vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > > > -                        slab_nid(slab));
> > > > +     if (unlikely(!allow_spin)) {
> > > > +             size_t sz = objects * sizeof(struct slabobj_ext);
> > > > +
> > > > +             vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
> > >
> > > In free_slab_obj_exts(), how do you know slabobj_ext is allocated via
> > > kmalloc_nolock() or kcalloc_node()?
> > 
> > Technically kmalloc_nolock()->kfree() isn't as bad as
> > kmalloc()->kfree_nolock(), since kmemleak/kfence can ignore
> > debug free-ing action without matching alloc side,
> > but you're right it's better to avoid it.
> > 
> > > I was going to say "add a new flag to enum objext_flags",
> > > but all lower 3 bits of slab->obj_exts pointer are already in use? oh...
> > >
> > > Maybe need a magic trick to add one more flag,
> > > like always align the size with 16?
> > >
> > > In practice that should not lead to increase in memory consumption
> > > anyway because most of the kmalloc-* sizes are already at least
> > > 16 bytes aligned.
> > 
> > Yes. That's an option, but I think we can do better.
> > OBJEXTS_ALLOC_FAIL doesn't need to consume the bit.
> > Here are two patches that fix this issue:
> > 
> > Subject: [PATCH 1/2] slab: Reuse first bit for OBJEXTS_ALLOC_FAIL
> > 
> > Since the combination of valid upper bits in slab->obj_exts with
> > OBJEXTS_ALLOC_FAIL bit can never happen,
> > use OBJEXTS_ALLOC_FAIL == (1ull << 0) as a magic sentinel
> > instead of (1ull << 2) to free up bit 2.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> 
> This will work, but it would be helpful to add a comment clarifying that
> when bit 0 is set with valid upper bits, it indicates
> MEMCG_DATA_OBJEXTS, but when the upper bits are all zero, it indicates
> OBJEXTS_ALLOC_FAIL.
> 
> When someone looks at the code without checking history it might not
> be obvious at first glance.
> 
> >  include/linux/memcontrol.h | 4 +++-
> >  mm/slub.c                  | 2 +-
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 785173aa0739..daa78665f850 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -341,17 +341,19 @@ enum page_memcg_data_flags {
> >         __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
> >  };
> > 
> > +#define __OBJEXTS_ALLOC_FAIL   MEMCG_DATA_OBJEXTS
> >  #define __FIRST_OBJEXT_FLAG    __NR_MEMCG_DATA_FLAGS
> > 
> >  #else /* CONFIG_MEMCG */
> > 
> > +#define __OBJEXTS_ALLOC_FAIL   (1UL << 0)
> >  #define __FIRST_OBJEXT_FLAG    (1UL << 0)
> > 
> >  #endif /* CONFIG_MEMCG */
> > 
> >  enum objext_flags {
> >         /* slabobj_ext vector failed to allocate */
> > -       OBJEXTS_ALLOC_FAIL = __FIRST_OBJEXT_FLAG,
> > +       OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
> >         /* the next bit after the last actual flag */
> >         __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
> >  };
> > diff --git a/mm/slub.c b/mm/slub.c
> > index bd4bf2613e7a..16e53bfb310e 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1950,7 +1950,7 @@ static inline void
> > handle_failed_objexts_alloc(unsigned long obj_exts,
> >          * objects with no tag reference. Mark all references in this
> >          * vector as empty to avoid warnings later on.
> >          */
> > -       if (obj_exts & OBJEXTS_ALLOC_FAIL) {
> > +       if (obj_exts == OBJEXTS_ALLOC_FAIL) {
> >                 unsigned int i;
> > 
> >                 for (i = 0; i < objects; i++)
> > --
> > 2.47.3
> > 
> > Subject: [PATCH 2/2] slab: Use kfree_nolock() to free obj_exts
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/memcontrol.h | 1 +
> >  mm/slub.c                  | 7 ++++++-
> >  2 files changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index daa78665f850..2e6c33fdd9c5 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -354,6 +354,7 @@ enum page_memcg_data_flags {
> >  enum objext_flags {
> >      /* slabobj_ext vector failed to allocate */
> >      OBJEXTS_ALLOC_FAIL = __OBJEXTS_ALLOC_FAIL,
> 
> /* slabobj_ext vector allocated with kmalloc_nolock() */ ?
> 
> > +    OBJEXTS_NOSPIN_ALLOC = __FIRST_OBJEXT_FLAG,
> >      /* the next bit after the last actual flag */
> >      __NR_OBJEXTS_FLAGS  = (__FIRST_OBJEXT_FLAG << 1),
> >  };
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 16e53bfb310e..417d647f1f02 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2009,6 +2009,8 @@ int alloc_slab_obj_exts(struct slab *slab,
> > struct kmem_cache *s,
> >      }
> > 
> >      new_exts = (unsigned long)vec;
> > +    if (unlikely(!allow_spin))
> > +        new_exts |= OBJEXTS_NOSPIN_ALLOC;
> >  #ifdef CONFIG_MEMCG
> >      new_exts |= MEMCG_DATA_OBJEXTS;
> >  #endif
> > @@ -2056,7 +2058,10 @@ static inline void free_slab_obj_exts(struct slab *slab)
> >       * the extension for obj_exts is expected to be NULL.
> >       */
> >      mark_objexts_empty(obj_exts);
> > -    kfree(obj_exts);
> > +    if (unlikely(READ_ONCE(slab->obj_exts) & OBJEXTS_NOSPIN_ALLOC))
> > +        kfree_nolock(obj_exts);
> > +    else
> > +        kfree(obj_exts);
> >      slab->obj_exts = 0;
> >  }
> > 
> > --
> > 2.47.3
> 
> Otherwise looks fine to me.
> 
> > > > +     } else {
> > > > +             vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
> > > > +                                slab_nid(slab));
> > > > +     }
> > > >       if (!vec) {
> > > >               /* Mark vectors which failed to allocate */
> > > >               if (new_slab)
> > > > +static void defer_deactivate_slab(struct slab *slab, void *flush_freelist);
> > > > +
> > > >  /*
> > > >   * Called only for kmem_cache_debug() caches to allocate from a freshly
> > > >   * allocated slab. Allocate a single object instead of whole freelist
> > > >   * and put the slab to the partial (or full) list.
> > > >   */
> > > > -static void *alloc_single_from_new_slab(struct kmem_cache *s,
> > > > -                                     struct slab *slab, int orig_size)
> > > > +static void *alloc_single_from_new_slab(struct kmem_cache *s, struct slab *slab,
> > > > +                                     int orig_size, gfp_t gfpflags)
> > > >  {
> > > > +     bool allow_spin = gfpflags_allow_spinning(gfpflags);
> > > >       int nid = slab_nid(slab);
> > > >       struct kmem_cache_node *n = get_node(s, nid);
> > > >       unsigned long flags;
> > > >       void *object;
> > > >
> > > > +     if (!allow_spin && !spin_trylock_irqsave(&n->list_lock, flags)) {
> > >
> > > I think alloc_debug_processing() doesn't have to be called under
> > > n->list_lock here because it is a new slab?
> > >
> > > That means the code can be something like:
> > >
> > > /* allocate one object from slab */
> > > object = slab->freelist;
> > > slab->freelist = get_freepointer(s, object);
> > > slab->inuse = 1;
> > >
> > > /* Leak slab if debug checks fails */
> > > if (!alloc_debug_processing())
> > >         return NULL;
> > >
> > > /* add slab to per-node partial list */
> > > if (allow_spin) {
> > >         spin_lock_irqsave();
> > > } else if (!spin_trylock_irqsave()) {
> > >         slab->frozen = 1;
> > >         defer_deactivate_slab();
> > > }
> > 
> > No. That doesn't work. I implemented it this way
> > before reverting back to spin_trylock_irqsave() in the beginning.
> > The problem is alloc_debug_processing() will likely succeed
> > and undoing it is pretty complex.
> > So it's better to "!allow_spin && !spin_trylock_irqsave()"
> > before doing expensive and hard to undo alloc_debug_processing().
> 
> Gotcha, that makes sense!
> 
> > > > +             /* Unlucky, discard newly allocated slab */
> > > > +             slab->frozen = 1;
> > > > +             defer_deactivate_slab(slab, NULL);
> > > > +             return NULL;
> > > > +     }
> > > >
> > > >       object = slab->freelist;
> > > >       slab->freelist = get_freepointer(s, object);
> > > >       slab->inuse = 1;
> > > >
> > > > -     if (!alloc_debug_processing(s, slab, object, orig_size))
> > > > +     if (!alloc_debug_processing(s, slab, object, orig_size)) {
> > > >               /*
> > > >                * It's not really expected that this would fail on a
> > > >                * freshly allocated slab, but a concurrent memory
> > > >                * corruption in theory could cause that.
> > > > +              * Leak memory of allocated slab.
> > > >                */
> > > > +             if (!allow_spin)
> > > > +                     spin_unlock_irqrestore(&n->list_lock, flags);
> > > >               return NULL;
> > > > +     }
> > > >
> > > > -     spin_lock_irqsave(&n->list_lock, flags);
> > > > +     if (allow_spin)
> > > > +             spin_lock_irqsave(&n->list_lock, flags);
> > > >
> > > >       if (slab->inuse == slab->objects)
> > > >               add_full(s, n, slab);
> > > > + * #2 is possible in both with a twist that irqsave is replaced with rt_spinlock:
> > > > + * kmalloc() -> ___slab_alloc() -> rt_spin_lock(kmem_cache_A) ->
> > > > + *    tracepoint/kprobe -> bpf -> kmalloc_nolock() -> rt_spin_lock(kmem_cache_B)
> > > > + *
> > > > + * local_lock_is_locked() prevents the case kmem_cache_A == kmem_cache_B
> > > > + */
> > > > +#if defined(CONFIG_PREEMPT_RT) || !defined(CONFIG_LOCKDEP)
> > > > +#define local_lock_cpu_slab(s, flags)        \
> > > > +     local_lock_irqsave(&(s)->cpu_slab->lock, flags)
> > > > +#else
> > > > +#define local_lock_cpu_slab(s, flags)        \
> > > > +     lockdep_assert(local_trylock_irqsave(&(s)->cpu_slab->lock, flags))
> > > > +#endif
> > > > +
> > > > +#define local_unlock_cpu_slab(s, flags)      \
> > > > +     local_unlock_irqrestore(&(s)->cpu_slab->lock, flags)
> > > > +
> > > >  #ifdef CONFIG_SLUB_CPU_PARTIAL
> > > >  static void __put_partials(struct kmem_cache *s, struct slab *partial_slab)
> > > >  {
> > > > @@ -3732,9 +3808,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > >       if (unlikely(!node_match(slab, node))) {
> > > >               /*
> > > >                * same as above but node_match() being false already
> > > > -              * implies node != NUMA_NO_NODE
> > > > +              * implies node != NUMA_NO_NODE.
> > > > +              * Reentrant slub cannot take locks necessary to
> > > > +              * deactivate_slab, hence ignore node preference.
> > >
> > > Now that we have defer_deactivate_slab(), we need to either update the
> > > code or comment?
> > >
> > > 1. Deactivate slabs when node / pfmemalloc mismatches
> > > or 2. Update comments to explain why it's still undesirable
> > 
> > Well, defer_deactivate_slab() is a heavy hammer.
> > In !SLUB_TINY it pretty much never happens.
> > 
> > This bit:
> > 
> > retry_load_slab:
> > 
> >         local_lock_cpu_slab(s, flags);
> >         if (unlikely(c->slab)) {
> > 
> > is very rare. I couldn't trigger it at all in my stress test.
> > 
> > But in this hunk the node mismatch is not rare, so ignoring node preference
> > for kmalloc_nolock() is a much better trade off.

But users would have requested that specific node instead of
NUMA_NO_NODE because (at least) they think it's worth it.
(e.g., allocating kernel data structures tied to specified node)

I don't understand why kmalloc()/kmem_cache_alloc() try harder
(by deactivating cpu slab) to respect the node parameter,
but kmalloc_nolock() does not.

> > I'll add a comment that defer_deactivate_slab() is undesired here.
> 
> Wait, does that mean kmalloc_nolock() have `node` parameter that is always
> ignored?

Ah, it's not _always_ ignored. At least it tries to grab/allocate slabs
from the specified node in the slowpath. But still, users can't reliably
get objects from the specified node? (even if there's no shortage in
the NUMA node.)

> Why not use NUMA_NO_NODE then instead of adding a special case
> for !allow_spin?

> > > > +              * kmalloc_nolock() doesn't allow __GFP_THISNODE.
> > > >                */
> > > > -             if (!node_isset(node, slab_nodes)) {
> > > > +             if (!node_isset(node, slab_nodes) ||
> > > > +                 !allow_spin) {
> > > >                       node = NUMA_NO_NODE;
> > > >               } else {
> > > >                       stat(s, ALLOC_NODE_MISMATCH);
> > >
> > > > @@ -4572,6 +4769,98 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
> > > >       discard_slab(s, slab);
> > > >  }
> > > >
> > > > +/*
> > > > + * In PREEMPT_RT irq_work runs in per-cpu kthread, so it's safe
> > > > + * to take sleeping spin_locks from __slab_free() and deactivate_slab().
> > > > + * In !PREEMPT_RT irq_work will run after local_unlock_irqrestore().
> > > > + */
> > > > +static void free_deferred_objects(struct irq_work *work)
> > > > +{
> > > > +     struct defer_free *df = container_of(work, struct defer_free, work);
> > > > +     struct llist_head *objs = &df->objects;
> > > > +     struct llist_head *slabs = &df->slabs;
> > > > +     struct llist_node *llnode, *pos, *t;
> > > > +
> > > > +     if (llist_empty(objs) && llist_empty(slabs))
> > > > +             return;
> > > > +
> > > > +     llnode = llist_del_all(objs);
> > > > +     llist_for_each_safe(pos, t, llnode) {
> > > > +             struct kmem_cache *s;
> > > > +             struct slab *slab;
> > > > +             void *x = pos;
> > > > +
> > > > +             slab = virt_to_slab(x);
> > > > +             s = slab->slab_cache;
> > > > +
> > > > +             /*
> > > > +              * We used freepointer in 'x' to link 'x' into df->objects.
> > > > +              * Clear it to NULL to avoid false positive detection
> > > > +              * of "Freepointer corruption".
> > > > +              */
> > > > +             *(void **)x = NULL;
> > > > +
> > > > +             /* Point 'x' back to the beginning of allocated object */
> > > > +             x -= s->offset;
> > > > +             /*
> > > > +              * memcg, kasan_slab_pre are already done for 'x'.
> > > > +              * The only thing left is kasan_poison.
> > > > +              */
> > > > +             kasan_slab_free(s, x, false, false, true);
> > > > +             __slab_free(s, slab, x, x, 1, _THIS_IP_);
> > > > +     }
> > > > +
> > > > +     llnode = llist_del_all(slabs);
> > > > +     llist_for_each_safe(pos, t, llnode) {
> > > > +             struct slab *slab = container_of(pos, struct slab, llnode);
> > > > +
> > > > +#ifdef CONFIG_SLUB_TINY
> > > > +             discard_slab(slab->slab_cache, slab);
> > >
> > > ...and with my comment on alloc_single_from_new_slab(),
> > > The slab may not be empty anymore?
> > 
> > Exactly.
> > That's another problem with your suggestion in alloc_single_from_new_slab().
> > That's why I did it as:
> > if (!allow_spin && !spin_trylock_irqsave(...)
> > 
> > and I still believe it's the right call.
> 
> Yeah, I think it's fine.
> 
> > The slab is empty here, so discard_slab() is appropriate.
> >
> > > > +#else
> > > > +             deactivate_slab(slab->slab_cache, slab, slab->flush_freelist);
> > > > +#endif
> > > > +     }
> > > > +}
> > > > @@ -4610,10 +4901,30 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> > > >       barrier();
> > > >
> > > >       if (unlikely(slab != c->slab)) {
> > > > -             __slab_free(s, slab, head, tail, cnt, addr);
> > > > +             if (unlikely(!allow_spin)) {
> > > > +                     /*
> > > > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > > > +                      * but then it might need to take spin_lock or local_lock
> > > > +                      * in put_cpu_partial() for further processing.
> > > > +                      * Avoid the complexity and simply add to a deferred list.
> > > > +                      */
> > > > +                     defer_free(s, head);
> > > > +             } else {
> > > > +                     __slab_free(s, slab, head, tail, cnt, addr);
> > > > +             }
> > > >               return;
> > > >       }
> > > >
> > > > +     if (unlikely(!allow_spin)) {
> > > > +             if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> > > > +                 local_lock_is_locked(&s->cpu_slab->lock)) {
> > > > +                     defer_free(s, head);
> > > > +                     return;
> > > > +             }
> > > > +             cnt = 1; /* restore cnt. kfree_nolock() frees one object at a time */
> > > > +             kasan_slab_free(s, head, false, false, /* skip quarantine */true);
> > > > +     }
> > >
> > > I'm not sure what prevents below from happening
> > >
> > > 1. slab == c->slab && !allow_spin -> call kasan_slab_free()
> > > 2. preempted by something and resume
> > > 3. after acquiring local_lock, slab != c->slab, release local_lock, goto redo
> > > 4. !allow_spin, so defer_free() will call kasan_slab_free() again later
> > 
> > yes. it's possible and it's ok.
> > kasan_slab_free(, no_quarantine == true)
> > is poison_slab_object() only and one can do it many times.
> >
> > > Perhaps kasan_slab_free() should be called before do_slab_free()
> > > just like normal free path and do not call kasan_slab_free() in deferred
> > > freeing (then you may need to disable KASAN while accessing the deferred
> > > list)?
> > 
> > I tried that too and didn't like it.
> > After kasan_slab_free() the object cannot be put on the llist easily.
> > One needs to do a kasan_reset_tag() dance which uglifies the code a lot.
> > Double kasan poison in a rare case is fine. There is no harm.
> 
> Okay, but we're now depending on kasan_slab_free() being safe to be
> called multiple times on the same object. That would be better
> documented in kasan_slab_free().
> 
> -- 
> Cheers,
> Harry / Hyeonggon

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-12 17:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-18  2:16 [PATCH v4 0/6] slab: Re-entrant kmalloc_nolock() Alexei Starovoitov
2025-07-18  2:16 ` [PATCH v4 1/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-07-18  2:16 ` [PATCH v4 2/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-07-18  2:16 ` [PATCH v4 3/6] mm: Allow GFP_ACCOUNT to be used in alloc_pages_nolock() Alexei Starovoitov
2025-07-18  2:16 ` [PATCH v4 4/6] mm: Introduce alloc_frozen_pages_nolock() Alexei Starovoitov
2025-07-18  2:16 ` [PATCH v4 5/6] slab: Make slub local_(try)lock more precise for LOCKDEP Alexei Starovoitov
2025-07-18  2:16 ` [PATCH v4 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-07-22 15:52   ` Harry Yoo
2025-08-06  2:40     ` Alexei Starovoitov
2025-08-12 15:11       ` Harry Yoo
2025-08-12 17:08         ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).