[PATCH 0/6] mm: Reentrant kmalloc

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] mm: Reentrant kmalloc
@ 2025-05-01  3:27 Alexei Starovoitov
  2025-05-01  3:27 ` [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock() Alexei Starovoitov
                   ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-01  3:27 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

From: Alexei Starovoitov <ast@kernel.org>

As promised during LSFMM here is a reentrant kmalloc.
See patches for details.

Alexei Starovoitov (6):
  mm: Rename try_alloc_pages() to alloc_pages_nolock()
  locking/local_lock: Expose dep_map in local_trylock_t.
  locking/local_lock: Introduce local_lock_is_locked().
  locking/local_lock: Introduce local_lock_irqsave_check()
  mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock().
  slab: Introduce kmalloc_nolock() and kfree_nolock().

 include/linux/gfp.h                 |   8 +-
 include/linux/kasan.h               |  13 +-
 include/linux/local_lock.h          |  15 ++
 include/linux/local_lock_internal.h |  36 +++-
 include/linux/slab.h                |   4 +
 kernel/bpf/syscall.c                |   2 +-
 mm/kasan/common.c                   |   5 +-
 mm/memcontrol.c                     |  60 +++++-
 mm/page_alloc.c                     |  21 ++-
 mm/page_owner.c                     |   2 +-
 mm/slab.h                           |   1 +
 mm/slub.c                           | 280 +++++++++++++++++++++++++---
 12 files changed, 396 insertions(+), 51 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock()
  2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
@ 2025-05-01  3:27 ` Alexei Starovoitov
  2025-05-06  8:26   ` Vlastimil Babka
  2025-05-01  3:27 ` [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-01  3:27 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

From: Alexei Starovoitov <ast@kernel.org>

The "try_" prefix is confusing, since it made people believe
that try_alloc_pages() is analogous to spin_trylock() and
NULL return means EAGAIN. This is not the case. If it returns
NULL there is no reason to call it again. It will most likely
return NULL again. Hence rename it to alloc_pages_nolock()
to make it symmetrical to free_pages_nolock() and document that
NULL means ENOMEM.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h  |  8 ++++----
 kernel/bpf/syscall.c |  2 +-
 mm/page_alloc.c      | 15 ++++++++-------
 mm/page_owner.c      |  2 +-
 4 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c9fa6309c903..be160e8d8bcb 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -45,13 +45,13 @@ static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
 	 * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
 	 * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
 	 * All GFP_* flags including GFP_NOWAIT use one or both flags.
-	 * try_alloc_pages() is the only API that doesn't specify either flag.
+	 * alloc_pages_nolock() is the only API that doesn't specify either flag.
 	 *
 	 * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
 	 * those are guaranteed to never block on a sleeping lock.
 	 * Here we are enforcing that the allocation doesn't ever spin
 	 * on any locks (i.e. only trylocks). There is no high level
-	 * GFP_$FOO flag for this use in try_alloc_pages() as the
+	 * GFP_$FOO flag for this use in alloc_pages_nolock() as the
 	 * regular page allocator doesn't fully support this
 	 * allocation mode.
 	 */
@@ -354,8 +354,8 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
-struct page *try_alloc_pages_noprof(int nid, unsigned int order);
-#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+struct page *alloc_pages_nolock_noprof(int nid, unsigned int order);
+#define alloc_pages_nolock(...)			alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__))
 
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9794446bc8c6..d0ddba2a952b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -578,7 +578,7 @@ static bool can_alloc_pages(void)
 static struct page *__bpf_alloc_page(int nid)
 {
 	if (!can_alloc_pages())
-		return try_alloc_pages(nid, 0);
+		return alloc_pages_nolock(nid, 0);
 
 	return alloc_pages_node(nid,
 				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e9322052df53..1d77a07b0659 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5080,7 +5080,7 @@ EXPORT_SYMBOL(__free_pages);
 
 /*
  * Can be called while holding raw_spin_lock or from IRQ and NMI for any
- * page type (not only those that came from try_alloc_pages)
+ * page type (not only those that came from alloc_pages_nolock)
  */
 void free_pages_nolock(struct page *page, unsigned int order)
 {
@@ -7378,20 +7378,21 @@ static bool __free_unaccepted(struct page *page)
 #endif /* CONFIG_UNACCEPTED_MEMORY */
 
 /**
- * try_alloc_pages - opportunistic reentrant allocation from any context
+ * alloc_pages_nolock - opportunistic reentrant allocation from any context
  * @nid: node to allocate from
  * @order: allocation order size
  *
  * Allocates pages of a given order from the given node. This is safe to
  * call from any context (from atomic, NMI, and also reentrant
- * allocator -> tracepoint -> try_alloc_pages_noprof).
+ * allocator -> tracepoint -> alloc_pages_nolock_noprof).
  * Allocation is best effort and to be expected to fail easily so nobody should
  * rely on the success. Failures are not reported via warn_alloc().
  * See always fail conditions below.
  *
- * Return: allocated page or NULL on failure.
+ * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
+ * It means ENOMEM. There is no reason to call it again and expect !NULL.
  */
-struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7400,7 +7401,7 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 	 *
 	 * These two are the conditions for gfpflags_allow_spinning() being true.
 	 *
-	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+	 * Specify __GFP_NOWARN since failing alloc_pages_nolock() is not a reason
 	 * to warn. Also warn would trigger printk() which is unsafe from
 	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
 	 * since the running context is unknown.
@@ -7410,7 +7411,7 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 	 * BPF use cases.
 	 *
 	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
-	 * specify it here to highlight that try_alloc_pages()
+	 * specify it here to highlight that alloc_pages_nolock()
 	 * doesn't want to deplete reserves.
 	 */
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
diff --git a/mm/page_owner.c b/mm/page_owner.c
index cc4a6916eec6..9928c9ac8c31 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -302,7 +302,7 @@ void __reset_page_owner(struct page *page, unsigned short order)
 	/*
 	 * Do not specify GFP_NOWAIT to make gfpflags_allow_spinning() == false
 	 * to prevent issues in stack_depot_save().
-	 * This is similar to try_alloc_pages() gfp flags, but only used
+	 * This is similar to alloc_pages_nolock() gfp flags, but only used
 	 * to signal stack_depot to avoid spin_locks.
 	 */
 	handle = save_stack(__GFP_NOWARN);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
  2025-05-01  3:27 ` [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock() Alexei Starovoitov
@ 2025-05-01  3:27 ` Alexei Starovoitov
  2025-05-06 12:56   ` Vlastimil Babka
  2025-05-12 13:26   ` Sebastian Andrzej Siewior
  2025-05-01  3:27 ` [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-01  3:27 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

From: Alexei Starovoitov <ast@kernel.org>

lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
is a top level field of any lock that participates in LOCKDEP.
Make it so for local_trylock_t.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock_internal.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index bf2bf40d7b18..29df45f95843 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -17,7 +17,10 @@ typedef struct {
 
 /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
 typedef struct {
-	local_lock_t	llock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+	struct task_struct	*owner;
+#endif
 	u8		acquired;
 } local_trylock_t;
 
@@ -31,7 +34,7 @@ typedef struct {
 	.owner = NULL,
 
 # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
-	.llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
+	LOCAL_LOCK_DEBUG_INIT(lockname)
 
 static inline void local_lock_acquire(local_lock_t *l)
 {
@@ -81,7 +84,7 @@ do {								\
 	local_lock_debug_init(lock);				\
 } while (0)
 
-#define __local_trylock_init(lock) __local_lock_init(lock.llock)
+#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
 
 #define __spinlock_nested_bh_init(lock)				\
 do {								\
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
  2025-05-01  3:27 ` [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock() Alexei Starovoitov
  2025-05-01  3:27 ` [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
@ 2025-05-01  3:27 ` Alexei Starovoitov
  2025-05-06 12:59   ` Vlastimil Babka
  2025-05-12 14:56   ` Sebastian Andrzej Siewior
  2025-05-01  3:27 ` [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check() Alexei Starovoitov
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-01  3:27 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

From: Alexei Starovoitov <ast@kernel.org>

Introduce local_lock_is_locked() that returns true when
given local_lock is locked by current cpu (in !PREEMPT_RT) or
by current task (in PREEMPT_RT).

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          | 2 ++
 include/linux/local_lock_internal.h | 8 ++++++++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 16a2ee4f8310..092ce89b162a 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -66,6 +66,8 @@
  */
 #define local_trylock(lock)		__local_trylock(lock)
 
+#define local_lock_is_locked(lock)	__local_lock_is_locked(lock)
+
 /**
  * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
  *			   interrupts if acquired
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 29df45f95843..263723a45ecd 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -165,6 +165,9 @@ do {								\
 		!!tl;						\
 	})
 
+/* preemption or migration must be disabled before calling __local_lock_is_locked */
+#define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
+
 #define __local_lock_release(lock)					\
 	do {								\
 		local_trylock_t *tl;					\
@@ -285,4 +288,9 @@ do {								\
 		__local_trylock(lock);				\
 	})
 
+/* migration must be disabled before calling __local_lock_is_locked */
+#include "../../kernel/locking/rtmutex_common.h"
+#define __local_lock_is_locked(__lock)					\
+	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
+
 #endif /* CONFIG_PREEMPT_RT */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check()
  2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2025-05-01  3:27 ` [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
@ 2025-05-01  3:27 ` Alexei Starovoitov
  2025-05-07 13:02   ` Vlastimil Babka
  2025-05-12 14:03   ` Sebastian Andrzej Siewior
  2025-05-01  3:27 ` [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock() Alexei Starovoitov
  2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  5 siblings, 2 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-01  3:27 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

From: Alexei Starovoitov <ast@kernel.org>

Introduce local_lock_irqsave_check() to check that local_lock is
not taken recursively.
In !PREEMPT_RT local_lock_irqsave() disables IRQ, but
re-entrance is possible either from NMI or strategically placed
kprobe. The code should call local_lock_is_locked() before proceeding
to acquire a local_lock. Such local_lock_is_locked() might be called
earlier in the call graph and there could be a lot of code
between local_lock_is_locked() and local_lock_irqsave_check().

Without CONFIG_DEBUG_LOCK_ALLOC the local_lock_irqsave_check()
is equivalent to local_lock_irqsave().

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          | 13 +++++++++++++
 include/linux/local_lock_internal.h | 19 +++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 092ce89b162a..0d6efb0fdd15 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -81,6 +81,19 @@
 #define local_trylock_irqsave(lock, flags)			\
 	__local_trylock_irqsave(lock, flags)
 
+/**
+ * local_lock_irqsave_check - Acquire a per CPU local lock, save and disable
+ *			      interrupts
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ *
+ * This function checks that local_lock is not taken recursively.
+ * In !PREEMPT_RT re-entrance is possible either from NMI or kprobe.
+ * In PREEMPT_RT it checks that current task is not holding it.
+ */
+#define local_lock_irqsave_check(lock, flags)			\
+	__local_lock_irqsave_check(lock, flags)
+
 DEFINE_GUARD(local_lock, local_lock_t __percpu*,
 	     local_lock(_T),
 	     local_unlock(_T))
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 263723a45ecd..7c4cc002bc68 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -168,6 +168,15 @@ do {								\
 /* preemption or migration must be disabled before calling __local_lock_is_locked */
 #define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
 
+#define __local_lock_irqsave_check(lock, flags)					\
+	do {									\
+		if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&			\
+		    (!__local_lock_is_locked(lock) || in_nmi()))		\
+			WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));	\
+		else								\
+			__local_lock_irqsave(lock, flags);			\
+	} while (0)
+
 #define __local_lock_release(lock)					\
 	do {								\
 		local_trylock_t *tl;					\
@@ -293,4 +302,14 @@ do {								\
 #define __local_lock_is_locked(__lock)					\
 	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
 
+#define __local_lock_irqsave_check(lock, flags)				\
+	do {								\
+		typecheck(unsigned long, flags);			\
+		flags = 0;						\
+		migrate_disable();					\
+		if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC))		\
+			WARN_ON_ONCE(__local_lock_is_locked(lock));	\
+		spin_lock(this_cpu_ptr((lock)));			\
+	} while (0)
+
 #endif /* CONFIG_PREEMPT_RT */
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock().
  2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2025-05-01  3:27 ` [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check() Alexei Starovoitov
@ 2025-05-01  3:27 ` Alexei Starovoitov
  2025-05-06  8:55   ` Vlastimil Babka
  2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  5 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-01  3:27 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

From: Alexei Starovoitov <ast@kernel.org>

Allow __GFP_ACCOUNT and __GFP_COMP flags to be specified when calling
alloc_pages_nolock(), since upcoming reentrant alloc_slab_page() needs
to allocate __GFP_COMP pages while BPF infra needs __GFP_ACCOUNT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h  | 2 +-
 kernel/bpf/syscall.c | 2 +-
 mm/page_alloc.c      | 8 +++++---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index be160e8d8bcb..9afbe5b3aef6 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -354,7 +354,7 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order);
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
 #define alloc_pages_nolock(...)			alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__))
 
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d0ddba2a952b..83af8fa9db3f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -578,7 +578,7 @@ static bool can_alloc_pages(void)
 static struct page *__bpf_alloc_page(int nid)
 {
 	if (!can_alloc_pages())
-		return alloc_pages_nolock(nid, 0);
+		return alloc_pages_nolock(__GFP_ACCOUNT, nid, 0);
 
 	return alloc_pages_node(nid,
 				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d77a07b0659..303df205ca7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7379,6 +7379,7 @@ static bool __free_unaccepted(struct page *page)
 
 /**
  * alloc_pages_nolock - opportunistic reentrant allocation from any context
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_COMP allowed.
  * @nid: node to allocate from
  * @order: allocation order size
  *
@@ -7392,7 +7393,7 @@ static bool __free_unaccepted(struct page *page)
  * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
  * It means ENOMEM. There is no reason to call it again and expect !NULL.
  */
-struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
+struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
 {
 	/*
 	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
@@ -7415,11 +7416,12 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	 * doesn't want to deplete reserves.
 	 */
 	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
-			| __GFP_ACCOUNT;
+			| gfp_flags;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
 
+	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_COMP));
 	/*
 	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
 	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
@@ -7462,7 +7464,7 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
 	if (page)
 		set_page_refcounted(page);
 
-	if (memcg_kmem_online() && page &&
+	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
 	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
 		free_pages_nolock(page, order);
 		page = NULL;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2025-05-01  3:27 ` [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock() Alexei Starovoitov
@ 2025-05-01  3:27 ` Alexei Starovoitov
  2025-05-05 18:46   ` Shakeel Butt
                     ` (2 more replies)
  5 siblings, 3 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-01  3:27 UTC (permalink / raw)
  To: bpf, linux-mm
  Cc: vbabka, harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

From: Alexei Starovoitov <ast@kernel.org>

kmalloc_nolock() relies on ability of local_lock to detect the situation
when it's locked.
In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
In that case retry the operation in a different kmalloc bucket.
The second attempt will likely succeed, since this cpu locked
different kmem_cache_cpu.
When lock_local_is_locked() sees locked memcg_stock.stock_lock
fallback to atomic operations.

Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
per-cpu rt_spin_lock is locked by current task. In this case re-entrance
into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
a different bucket that is most likely is not locked by current
task. Though it may be locked by a different task it's safe to
rt_spin_lock() on it.

Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
immediately if called from hard irq or NMI in PREEMPT_RT.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/kasan.h |  13 +-
 include/linux/slab.h  |   4 +
 mm/kasan/common.c     |   5 +-
 mm/memcontrol.c       |  60 ++++++++-
 mm/slab.h             |   1 +
 mm/slub.c             | 280 ++++++++++++++++++++++++++++++++++++++----
 6 files changed, 330 insertions(+), 33 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 890011071f2b..acdc8cb0152e 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -200,7 +200,7 @@ static __always_inline bool kasan_slab_pre_free(struct kmem_cache *s,
 }
 
 bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
-		       bool still_accessible);
+		       bool still_accessible, bool no_quarantine);
 /**
  * kasan_slab_free - Poison, initialize, and quarantine a slab object.
  * @object: Object to be freed.
@@ -226,11 +226,13 @@ bool __kasan_slab_free(struct kmem_cache *s, void *object, bool init,
  * @Return true if KASAN took ownership of the object; false otherwise.
  */
 static __always_inline bool kasan_slab_free(struct kmem_cache *s,
-						void *object, bool init,
-						bool still_accessible)
+					    void *object, bool init,
+					    bool still_accessible,
+					    bool no_quarantine)
 {
 	if (kasan_enabled())
-		return __kasan_slab_free(s, object, init, still_accessible);
+		return __kasan_slab_free(s, object, init, still_accessible,
+					 no_quarantine);
 	return false;
 }
 
@@ -427,7 +429,8 @@ static inline bool kasan_slab_pre_free(struct kmem_cache *s, void *object)
 }
 
 static inline bool kasan_slab_free(struct kmem_cache *s, void *object,
-				   bool init, bool still_accessible)
+				   bool init, bool still_accessible,
+				   bool no_quarantine)
 {
 	return false;
 }
diff --git a/include/linux/slab.h b/include/linux/slab.h
index d5a8ab98035c..743f6d196d57 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -470,6 +470,7 @@ void * __must_check krealloc_noprof(const void *objp, size_t new_size,
 #define krealloc(...)				alloc_hooks(krealloc_noprof(__VA_ARGS__))
 
 void kfree(const void *objp);
+void kfree_nolock(const void *objp);
 void kfree_sensitive(const void *objp);
 size_t __ksize(const void *objp);
 
@@ -910,6 +911,9 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
 }
 #define kmalloc(...)				alloc_hooks(kmalloc_noprof(__VA_ARGS__))
 
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node);
+#define kmalloc_nolock(...)			alloc_hooks(kmalloc_nolock_noprof(__VA_ARGS__))
+
 #define kmem_buckets_alloc(_b, _size, _flags)	\
 	alloc_hooks(__kmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE))
 
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index ed4873e18c75..67042e07baee 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -256,13 +256,16 @@ bool __kasan_slab_pre_free(struct kmem_cache *cache, void *object,
 }
 
 bool __kasan_slab_free(struct kmem_cache *cache, void *object, bool init,
-		       bool still_accessible)
+		       bool still_accessible, bool no_quarantine)
 {
 	if (!kasan_arch_is_ready() || is_kfence_address(object))
 		return false;
 
 	poison_slab_object(cache, object, init, still_accessible);
 
+	if (no_quarantine)
+		return false;
+
 	/*
 	 * If the object is put into quarantine, do not let slab put the object
 	 * onto the freelist for now. The object's metadata is kept until the
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b877287aeb11..1fe23b2fe03d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -595,7 +595,13 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	if (!val)
 		return;
 
-	cgroup_rstat_updated(memcg->css.cgroup, cpu);
+	/*
+	 * If called from NMI via kmalloc_nolock -> memcg_slab_post_alloc_hook
+	 * -> obj_cgroup_charge -> mod_memcg_state,
+	 * then delay the update.
+	 */
+	if (!in_nmi())
+		cgroup_rstat_updated(memcg->css.cgroup, cpu);
 	statc = this_cpu_ptr(memcg->vmstats_percpu);
 	for (; statc; statc = statc->parent) {
 		/*
@@ -2895,7 +2901,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	unsigned long flags;
 	bool ret = false;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
@@ -2995,7 +3001,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
@@ -3088,6 +3094,27 @@ static inline size_t obj_full_size(struct kmem_cache *s)
 	return s->size + sizeof(struct obj_cgroup *);
 }
 
+/*
+ * Try subtract from nr_charged_bytes without making it negative
+ */
+static bool obj_cgroup_charge_atomic(struct obj_cgroup *objcg, gfp_t flags, size_t sz)
+{
+	size_t old = atomic_read(&objcg->nr_charged_bytes);
+	u32 nr_pages = sz >> PAGE_SHIFT;
+	u32 nr_bytes = sz & (PAGE_SIZE - 1);
+
+	if ((ssize_t)(old - sz) >= 0 &&
+	    atomic_cmpxchg(&objcg->nr_charged_bytes, old, old - sz) == old)
+		return true;
+
+	nr_pages++;
+	if (obj_cgroup_charge_pages(objcg, flags, nr_pages))
+		return false;
+
+	atomic_add(PAGE_SIZE - nr_bytes, &objcg->nr_charged_bytes);
+	return true;
+}
+
 bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 				  gfp_t flags, size_t size, void **p)
 {
@@ -3128,6 +3155,21 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 			return false;
 	}
 
+	if (!gfpflags_allow_spinning(flags)) {
+		if (local_lock_is_locked(&memcg_stock.stock_lock)) {
+			/*
+			 * Cannot use
+			 * lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock));
+			 * since lockdep might not have been informed yet
+			 * of lock acquisition.
+			 */
+			return obj_cgroup_charge_atomic(objcg, flags,
+							size * obj_full_size(s));
+		} else {
+			lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
+		}
+	}
+
 	for (i = 0; i < size; i++) {
 		slab = virt_to_slab(p[i]);
 
@@ -3162,8 +3204,12 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 			    void **p, int objects, struct slabobj_ext *obj_exts)
 {
+	bool lock_held = local_lock_is_locked(&memcg_stock.stock_lock);
 	size_t obj_size = obj_full_size(s);
 
+	if (likely(!lock_held))
+		lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
+
 	for (int i = 0; i < objects; i++) {
 		struct obj_cgroup *objcg;
 		unsigned int off;
@@ -3174,8 +3220,12 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
 			continue;
 
 		obj_exts[off].objcg = NULL;
-		refill_obj_stock(objcg, obj_size, true, -obj_size,
-				 slab_pgdat(slab), cache_vmstat_idx(s));
+		if (unlikely(lock_held)) {
+			atomic_add(obj_size, &objcg->nr_charged_bytes);
+		} else {
+			refill_obj_stock(objcg, obj_size, true, -obj_size,
+					 slab_pgdat(slab), cache_vmstat_idx(s));
+		}
 		obj_cgroup_put(objcg);
 	}
 }
diff --git a/mm/slab.h b/mm/slab.h
index 05a21dc796e0..1688749d2995 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -273,6 +273,7 @@ struct kmem_cache {
 	unsigned int cpu_partial_slabs;
 #endif
 	struct kmem_cache_order_objects oo;
+	struct llist_head defer_free_objects;
 
 	/* Allocation and freeing of slabs */
 	struct kmem_cache_order_objects min;
diff --git a/mm/slub.c b/mm/slub.c
index dc9e729e1d26..307ea0135b92 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -392,7 +392,7 @@ struct kmem_cache_cpu {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct slab *partial;	/* Partially allocated slabs */
 #endif
-	local_lock_t lock;	/* Protects the fields above */
+	local_trylock_t lock;	/* Protects the fields above */
 #ifdef CONFIG_SLUB_STATS
 	unsigned int stat[NR_SLUB_STAT_ITEMS];
 #endif
@@ -1981,6 +1981,7 @@ static inline void init_slab_obj_exts(struct slab *slab)
 int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		        gfp_t gfp, bool new_slab)
 {
+	bool allow_spin = gfpflags_allow_spinning(gfp);
 	unsigned int objects = objs_per_slab(s, slab);
 	unsigned long new_exts;
 	unsigned long old_exts;
@@ -1989,8 +1990,14 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 	gfp &= ~OBJCGS_CLEAR_MASK;
 	/* Prevent recursive extension vector allocation */
 	gfp |= __GFP_NO_OBJ_EXT;
-	vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
-			   slab_nid(slab));
+	if (unlikely(!allow_spin)) {
+		size_t sz = objects * sizeof(struct slabobj_ext);
+
+		vec = kmalloc_nolock(sz, __GFP_ZERO, slab_nid(slab));
+	} else {
+		vec = kcalloc_node(objects, sizeof(struct slabobj_ext), gfp,
+				   slab_nid(slab));
+	}
 	if (!vec) {
 		/* Mark vectors which failed to allocate */
 		if (new_slab)
@@ -2020,7 +2027,10 @@ int alloc_slab_obj_exts(struct slab *slab, struct kmem_cache *s,
 		 * objcg vector should be reused.
 		 */
 		mark_objexts_empty(vec);
-		kfree(vec);
+		if (unlikely(!allow_spin))
+			kfree_nolock(vec);
+		else
+			kfree(vec);
 		return 0;
 	}
 
@@ -2395,7 +2405,7 @@ bool slab_free_hook(struct kmem_cache *s, void *x, bool init,
 
 	}
 	/* KASAN might put x into memory quarantine, delaying its reuse. */
-	return !kasan_slab_free(s, x, init, still_accessible);
+	return !kasan_slab_free(s, x, init, still_accessible, false);
 }
 
 static __fastpath_inline
@@ -2458,13 +2468,21 @@ static void *setup_object(struct kmem_cache *s, void *object)
  * Slab allocation and freeing
  */
 static inline struct slab *alloc_slab_page(gfp_t flags, int node,
-		struct kmem_cache_order_objects oo)
+					   struct kmem_cache_order_objects oo,
+					   bool allow_spin)
 {
 	struct folio *folio;
 	struct slab *slab;
 	unsigned int order = oo_order(oo);
 
-	if (node == NUMA_NO_NODE)
+	if (unlikely(!allow_spin)) {
+		struct page *p = alloc_pages_nolock(__GFP_COMP, node, order);
+
+		if (p)
+			/* Make the page frozen. Drop refcnt to zero. */
+			put_page_testzero(p);
+		folio = (struct folio *)p;
+	} else if (node == NUMA_NO_NODE)
 		folio = (struct folio *)alloc_frozen_pages(flags, order);
 	else
 		folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
@@ -2610,6 +2628,7 @@ static __always_inline void unaccount_slab(struct slab *slab, int order,
 
 static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
+	bool allow_spin = gfpflags_allow_spinning(flags);
 	struct slab *slab;
 	struct kmem_cache_order_objects oo = s->oo;
 	gfp_t alloc_gfp;
@@ -2629,7 +2648,11 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
 		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_RECLAIM;
 
-	slab = alloc_slab_page(alloc_gfp, node, oo);
+	/*
+	 * __GFP_RECLAIM could be cleared on the first allocation attempt,
+	 * so pass allow_spin flag directly.
+	 */
+	slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 	if (unlikely(!slab)) {
 		oo = s->min;
 		alloc_gfp = flags;
@@ -2637,7 +2660,7 @@ static struct slab *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
-		slab = alloc_slab_page(alloc_gfp, node, oo);
+		slab = alloc_slab_page(alloc_gfp, node, oo, allow_spin);
 		if (unlikely(!slab))
 			return NULL;
 		stat(s, ORDER_FALLBACK);
@@ -2877,7 +2900,10 @@ static struct slab *get_partial_node(struct kmem_cache *s,
 	if (!n || !n->nr_partial)
 		return NULL;
 
-	spin_lock_irqsave(&n->list_lock, flags);
+	if (gfpflags_allow_spinning(pc->flags))
+		spin_lock_irqsave(&n->list_lock, flags);
+	else if (!spin_trylock_irqsave(&n->list_lock, flags))
+		return NULL;
 	list_for_each_entry_safe(slab, slab2, &n->partial, slab_list) {
 		if (!pfmemalloc_match(slab, pc->flags))
 			continue;
@@ -3068,7 +3094,7 @@ static void init_kmem_cache_cpus(struct kmem_cache *s)
 
 	for_each_possible_cpu(cpu) {
 		c = per_cpu_ptr(s->cpu_slab, cpu);
-		local_lock_init(&c->lock);
+		local_trylock_init(&c->lock);
 		c->tid = init_tid(cpu);
 	}
 }
@@ -3243,7 +3269,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct slab *slab, int drain)
 	unsigned long flags;
 	int slabs = 0;
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_irqsave_check(&s->cpu_slab->lock, flags);
 
 	oldslab = this_cpu_read(s->cpu_slab->partial);
 
@@ -3746,7 +3772,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto deactivate_slab;
 
 	/* must check again c->slab in case we got preempted and it changed */
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_irqsave_check(&s->cpu_slab->lock, flags);
 	if (unlikely(slab != c->slab)) {
 		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		goto reread_slab;
@@ -3784,7 +3810,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 deactivate_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_irqsave_check(&s->cpu_slab->lock, flags);
 	if (slab != c->slab) {
 		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		goto reread_slab;
@@ -3800,7 +3826,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	while (slub_percpu_partial(c)) {
-		local_lock_irqsave(&s->cpu_slab->lock, flags);
+		local_lock_irqsave_check(&s->cpu_slab->lock, flags);
 		if (unlikely(c->slab)) {
 			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 			goto reread_slab;
@@ -3845,8 +3871,13 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 *    allocating new page from other nodes
 	 */
 	if (unlikely(node != NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE)
-		     && try_thisnode))
-		pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+		     && try_thisnode)) {
+		if (unlikely(!gfpflags_allow_spinning(gfpflags)))
+			/* Do not upgrade gfp to NOWAIT from more restrictive mode */
+			pc.flags = gfpflags | __GFP_THISNODE;
+		else
+			pc.flags = GFP_NOWAIT | __GFP_THISNODE;
+	}
 
 	pc.orig_size = orig_size;
 	slab = get_partial(s, node, &pc);
@@ -3918,7 +3949,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 
 retry_load_slab:
 
-	local_lock_irqsave(&s->cpu_slab->lock, flags);
+	local_lock_irqsave_check(&s->cpu_slab->lock, flags);
 	if (unlikely(c->slab)) {
 		void *flush_freelist = c->freelist;
 		struct slab *flush_slab = c->slab;
@@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 	 */
 	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
+	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
+		struct slab *slab;
+
+		slab = c->slab;
+		if (slab && !node_match(slab, node))
+			/* In trylock mode numa node is a hint */
+			node = NUMA_NO_NODE;
+
+		if (!local_lock_is_locked(&s->cpu_slab->lock)) {
+			lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock));
+		} else {
+			/*
+			 * EBUSY is an internal signal to kmalloc_nolock() to
+			 * retry a different bucket. It's not propagated further.
+			 */
+			p = ERR_PTR(-EBUSY);
+			goto out;
+		}
+	}
 
 	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
+out:
 #ifdef CONFIG_PREEMPT_COUNT
 	slub_put_cpu_ptr(s->cpu_slab);
 #endif
@@ -4162,8 +4213,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 		if (p[i] && init && (!kasan_init ||
 				     !kasan_has_integrated_init()))
 			memset(p[i], 0, zero_size);
-		kmemleak_alloc_recursive(p[i], s->object_size, 1,
-					 s->flags, init_flags);
+		if (gfpflags_allow_spinning(flags))
+			kmemleak_alloc_recursive(p[i], s->object_size, 1,
+						 s->flags, init_flags);
 		kmsan_slab_alloc(s, p[i], init_flags);
 		alloc_tagging_slab_alloc_hook(s, p[i], flags);
 	}
@@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
 }
 EXPORT_SYMBOL(__kmalloc_noprof);
 
+/**
+ * kmalloc_nolock - Allocate an object of given size from any context.
+ * @size: size to allocate
+ * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
+ * @node: node number of the target node.
+ *
+ * Return: pointer to the new object or NULL in case of error.
+ * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
+ * There is no reason to call it again and expect !NULL.
+ */
+void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
+{
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
+	struct kmem_cache *s;
+	bool can_retry = true;
+	void *ret = ERR_PTR(-EBUSY);
+
+	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
+
+	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
+		return NULL;
+	if (unlikely(!size))
+		return ZERO_SIZE_PTR;
+
+	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
+		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
+		return NULL;
+retry:
+	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
+
+	if (!(s->flags & __CMPXCHG_DOUBLE))
+		/*
+		 * kmalloc_nolock() is not supported on architectures that
+		 * don't implement cmpxchg16b.
+		 */
+		return NULL;
+
+	/*
+	 * Do not call slab_alloc_node(), since trylock mode isn't
+	 * compatible with slab_pre_alloc_hook/should_failslab and
+	 * kfence_alloc.
+	 *
+	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
+	 * in irq saved region. It assumes that the same cpu will not
+	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
+	 * Therefore use in_nmi() to check whether particular bucket is in
+	 * irq protected section.
+	 */
+	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
+		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
+
+	if (PTR_ERR(ret) == -EBUSY) {
+		if (can_retry) {
+			/* pick the next kmalloc bucket */
+			size = s->object_size + 1;
+			/*
+			 * Another alternative is to
+			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
+			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
+			 * to retry from bucket of the same size.
+			 */
+			can_retry = false;
+			goto retry;
+		}
+		ret = NULL;
+	}
+
+	maybe_wipe_obj_freeptr(s, ret);
+	/*
+	 * Make sure memcg_stock.stock_lock doesn't change cpu
+	 * when memcg layers access it.
+	 */
+	slub_get_cpu_ptr(s->cpu_slab);
+	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
+			     slab_want_init_on_alloc(alloc_gfp, s), size);
+	slub_put_cpu_ptr(s->cpu_slab);
+
+	ret = kasan_kmalloc(s, ret, size, alloc_gfp);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);
+
 void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags,
 					 int node, unsigned long caller)
 {
@@ -4511,7 +4645,6 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 		"__slab_free"));
 
 	if (likely(!n)) {
-
 		if (likely(was_frozen)) {
 			/*
 			 * The list lock was not taken therefore no list
@@ -4568,6 +4701,30 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
 }
 
 #ifndef CONFIG_SLUB_TINY
+static void free_deferred_objects(struct llist_head *llhead, unsigned long addr)
+{
+	struct llist_node *llnode, *pos, *t;
+
+	if (likely(llist_empty(llhead)))
+		return;
+
+	llnode = llist_del_all(llhead);
+	llist_for_each_safe(pos, t, llnode) {
+		struct kmem_cache *s;
+		struct slab *slab;
+		void *x = pos;
+
+		slab = virt_to_slab(x);
+		s = slab->slab_cache;
+
+		/*
+		 * memcg, kasan_slab_pre are already done for 'x'.
+		 * The only thing left is kasan_poison.
+		 */
+		kasan_slab_free(s, x, false, false, true);
+		__slab_free(s, slab, x, x, 1, addr);
+	}
+}
 /*
  * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
  * can perform fastpath freeing without additional function calls.
@@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
 	barrier();
 
 	if (unlikely(slab != c->slab)) {
-		__slab_free(s, slab, head, tail, cnt, addr);
+		/* cnt == 0 signals that it's called from kfree_nolock() */
+		if (unlikely(!cnt)) {
+			/*
+			 * Use llist in cache_node ?
+			 * struct kmem_cache_node *n = get_node(s, slab_nid(slab));
+			 */
+			/*
+			 * __slab_free() can locklessly cmpxchg16 into a slab,
+			 * but then it might need to take spin_lock or local_lock
+			 * in put_cpu_partial() for further processing.
+			 * Avoid the complexity and simply add to a deferred list.
+			 */
+			llist_add(head, &s->defer_free_objects);
+		} else {
+			free_deferred_objects(&s->defer_free_objects, addr);
+			__slab_free(s, slab, head, tail, cnt, addr);
+		}
 		return;
 	}
 
+	if (unlikely(!cnt)) {
+		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
+		    local_lock_is_locked(&s->cpu_slab->lock)) {
+			llist_add(head, &s->defer_free_objects);
+			return;
+		}
+		cnt = 1;
+		kasan_slab_free(s, head, false, false, true);
+	}
+
 	if (USE_LOCKLESS_FAST_PATH()) {
 		freelist = READ_ONCE(c->freelist);
 
@@ -4856,6 +5039,58 @@ void kfree(const void *object)
 }
 EXPORT_SYMBOL(kfree);
 
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI,
+ * but only for objects allocated by kmalloc_nolock(),
+ * since some debug checks (like kmemleak and kfence) were
+ * skipped on allocation. large_kmalloc is not supported either.
+ */
+void kfree_nolock(const void *object)
+{
+	struct folio *folio;
+	struct slab *slab;
+	struct kmem_cache *s;
+	void *x = (void *)object;
+
+	if (unlikely(ZERO_OR_NULL_PTR(object)))
+		return;
+
+	folio = virt_to_folio(object);
+	if (unlikely(!folio_test_slab(folio))) {
+		WARN(1, "Buggy usage of kfree_nolock");
+		return;
+	}
+
+	slab = folio_slab(folio);
+	s = slab->slab_cache;
+
+	memcg_slab_free_hook(s, slab, &x, 1);
+	alloc_tagging_slab_free_hook(s, slab, &x, 1);
+	/*
+	 * Unlike slab_free() do NOT call the following:
+	 * kmemleak_free_recursive(x, s->flags);
+	 * debug_check_no_locks_freed(x, s->object_size);
+	 * debug_check_no_obj_freed(x, s->object_size);
+	 * __kcsan_check_access(x, s->object_size, ..);
+	 * kfence_free(x);
+	 * since they take spinlocks.
+	 */
+	kmsan_slab_free(s, x);
+	/*
+	 * If KASAN finds a kernel bug it will do kasan_report_invalid_free()
+	 * which will call raw_spin_lock_irqsave() which is technically
+	 * unsafe from NMI, but take chance and report kernel bug.
+	 * The sequence of
+	 * kasan_report_invalid_free() -> raw_spin_lock_irqsave() -> NMI
+	 *  -> kfree_nolock() -> kasan_report_invalid_free() on the same CPU
+	 * is double buggy and deserves to deadlock.
+	 */
+	if (kasan_slab_pre_free(s, x))
+		return;
+	do_slab_free(s, slab, x, x, 0, _RET_IP_);
+}
+EXPORT_SYMBOL_GPL(kfree_nolock);
+
 static __always_inline __realloc_size(2) void *
 __do_krealloc(const void *p, size_t new_size, gfp_t flags)
 {
@@ -6423,6 +6658,7 @@ int do_kmem_cache_create(struct kmem_cache *s, const char *name,
 	s->useroffset = args->useroffset;
 	s->usersize = args->usersize;
 #endif
+	init_llist_head(&s->defer_free_objects);
 
 	if (!calculate_sizes(args, s))
 		goto out;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
@ 2025-05-05 18:46   ` Shakeel Butt
  2025-05-06  0:49     ` Alexei Starovoitov
  2025-05-06 12:01   ` Vlastimil Babka
  2025-05-09  1:03   ` Harry Yoo
  2 siblings, 1 reply; 41+ messages in thread
From: Shakeel Butt @ 2025-05-05 18:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, mhocko, bigeasy, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

On Wed, Apr 30, 2025 at 08:27:18PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -595,7 +595,13 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  	if (!val)
>  		return;
>  
> -	cgroup_rstat_updated(memcg->css.cgroup, cpu);
> +	/*
> +	 * If called from NMI via kmalloc_nolock -> memcg_slab_post_alloc_hook
> +	 * -> obj_cgroup_charge -> mod_memcg_state,
> +	 * then delay the update.
> +	 */
> +	if (!in_nmi())
> +		cgroup_rstat_updated(memcg->css.cgroup, cpu);

I don't think we can just ignore cgroup_rstat_updated() for nmi as there
is a chance (though very small) that we will loose these stats updates.

In addition, memcg_rstat_updated() itself is not reentrant safe along
with couple of functions leading to it like __mod_memcg_lruvec_state().

>  	statc = this_cpu_ptr(memcg->vmstats_percpu);
>  	for (; statc; statc = statc->parent) {
>  		/*
> @@ -2895,7 +2901,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>  	unsigned long flags;
>  	bool ret = false;
>  
> -	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
>  	if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
> @@ -2995,7 +3001,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
>  	unsigned long flags;
>  	unsigned int nr_pages = 0;
>  
> -	local_lock_irqsave(&memcg_stock.stock_lock, flags);
> +	local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
>  
>  	stock = this_cpu_ptr(&memcg_stock);
>  	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> @@ -3088,6 +3094,27 @@ static inline size_t obj_full_size(struct kmem_cache *s)
>  	return s->size + sizeof(struct obj_cgroup *);
>  }
>  
> +/*
> + * Try subtract from nr_charged_bytes without making it negative
> + */
> +static bool obj_cgroup_charge_atomic(struct obj_cgroup *objcg, gfp_t flags, size_t sz)
> +{
> +	size_t old = atomic_read(&objcg->nr_charged_bytes);
> +	u32 nr_pages = sz >> PAGE_SHIFT;
> +	u32 nr_bytes = sz & (PAGE_SIZE - 1);
> +
> +	if ((ssize_t)(old - sz) >= 0 &&
> +	    atomic_cmpxchg(&objcg->nr_charged_bytes, old, old - sz) == old)
> +		return true;
> +
> +	nr_pages++;
> +	if (obj_cgroup_charge_pages(objcg, flags, nr_pages))
> +		return false;
> +
> +	atomic_add(PAGE_SIZE - nr_bytes, &objcg->nr_charged_bytes);
> +	return true;
> +}
> +
>  bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  				  gfp_t flags, size_t size, void **p)
>  {
> @@ -3128,6 +3155,21 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  			return false;
>  	}
>  
> +	if (!gfpflags_allow_spinning(flags)) {
> +		if (local_lock_is_locked(&memcg_stock.stock_lock)) {
> +			/*
> +			 * Cannot use
> +			 * lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock));
> +			 * since lockdep might not have been informed yet
> +			 * of lock acquisition.
> +			 */
> +			return obj_cgroup_charge_atomic(objcg, flags,
> +							size * obj_full_size(s));

We can not just ignore the stat updates here.

> +		} else {
> +			lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> +		}
> +	}
> +
>  	for (i = 0; i < size; i++) {
>  		slab = virt_to_slab(p[i]);
>  
> @@ -3162,8 +3204,12 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
>  			    void **p, int objects, struct slabobj_ext *obj_exts)
>  {
> +	bool lock_held = local_lock_is_locked(&memcg_stock.stock_lock);
>  	size_t obj_size = obj_full_size(s);
>  
> +	if (likely(!lock_held))
> +		lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> +
>  	for (int i = 0; i < objects; i++) {
>  		struct obj_cgroup *objcg;
>  		unsigned int off;
> @@ -3174,8 +3220,12 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
>  			continue;
>  
>  		obj_exts[off].objcg = NULL;
> -		refill_obj_stock(objcg, obj_size, true, -obj_size,
> -				 slab_pgdat(slab), cache_vmstat_idx(s));
> +		if (unlikely(lock_held)) {
> +			atomic_add(obj_size, &objcg->nr_charged_bytes);

objcg->nr_charged_bytes is stats ignorant and the relevant stats need to
be updated before putting stuff into it.

> +		} else {
> +			refill_obj_stock(objcg, obj_size, true, -obj_size,
> +					 slab_pgdat(slab), cache_vmstat_idx(s));
> +		}
>  		obj_cgroup_put(objcg);
>  	}
>  }

I am actually working on making this whole call chain (i.e.
kmalloc/kmem_cache_alloc to memcg [un]charging) reentrant/nmi safe.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-05 18:46   ` Shakeel Butt
@ 2025-05-06  0:49     ` Alexei Starovoitov
  2025-05-06  1:24       ` Shakeel Butt
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-06  0:49 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Mon, May 5, 2025 at 11:46 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Apr 30, 2025 at 08:27:18PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -595,7 +595,13 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >       if (!val)
> >               return;
> >
> > -     cgroup_rstat_updated(memcg->css.cgroup, cpu);
> > +     /*
> > +      * If called from NMI via kmalloc_nolock -> memcg_slab_post_alloc_hook
> > +      * -> obj_cgroup_charge -> mod_memcg_state,
> > +      * then delay the update.
> > +      */
> > +     if (!in_nmi())
> > +             cgroup_rstat_updated(memcg->css.cgroup, cpu);
>
> I don't think we can just ignore cgroup_rstat_updated() for nmi as there
> is a chance (though very small) that we will loose these stats updates.

I'm failing to understand why it's an issue.
Not doing cgroup_rstat_updated() can only cause updated_next link
to stay NULL when it should be set,
but it should be harmless, and no different from racy check
that the code already doing:
if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
  return;

Imaging it was !NULL, the code would return,
but then preemption, something clears it to NULL,
and here we're skipping a set of updated_next.

> In addition, memcg_rstat_updated() itself is not reentrant safe along
> with couple of functions leading to it like __mod_memcg_lruvec_state().

Sure. __mod_memcg_lruvec_state() is not reentrant,
but it's not an issue for kmalloc_nolock(), since objcg/memcg
charge/uncharge from slub is not calling it (as far as I can tell).

>
> >       statc = this_cpu_ptr(memcg->vmstats_percpu);
> >       for (; statc; statc = statc->parent) {
> >               /*
> > @@ -2895,7 +2901,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> >       unsigned long flags;
> >       bool ret = false;
> >
> > -     local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > +     local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
> >
> >       stock = this_cpu_ptr(&memcg_stock);
> >       if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
> > @@ -2995,7 +3001,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> >       unsigned long flags;
> >       unsigned int nr_pages = 0;
> >
> > -     local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > +     local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
> >
> >       stock = this_cpu_ptr(&memcg_stock);
> >       if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > @@ -3088,6 +3094,27 @@ static inline size_t obj_full_size(struct kmem_cache *s)
> >       return s->size + sizeof(struct obj_cgroup *);
> >  }
> >
> > +/*
> > + * Try subtract from nr_charged_bytes without making it negative
> > + */
> > +static bool obj_cgroup_charge_atomic(struct obj_cgroup *objcg, gfp_t flags, size_t sz)
> > +{
> > +     size_t old = atomic_read(&objcg->nr_charged_bytes);
> > +     u32 nr_pages = sz >> PAGE_SHIFT;
> > +     u32 nr_bytes = sz & (PAGE_SIZE - 1);
> > +
> > +     if ((ssize_t)(old - sz) >= 0 &&
> > +         atomic_cmpxchg(&objcg->nr_charged_bytes, old, old - sz) == old)
> > +             return true;
> > +
> > +     nr_pages++;
> > +     if (obj_cgroup_charge_pages(objcg, flags, nr_pages))
> > +             return false;
> > +
> > +     atomic_add(PAGE_SIZE - nr_bytes, &objcg->nr_charged_bytes);
> > +     return true;
> > +}
> > +
> >  bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> >                                 gfp_t flags, size_t size, void **p)
> >  {
> > @@ -3128,6 +3155,21 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> >                       return false;
> >       }
> >
> > +     if (!gfpflags_allow_spinning(flags)) {
> > +             if (local_lock_is_locked(&memcg_stock.stock_lock)) {
> > +                     /*
> > +                      * Cannot use
> > +                      * lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > +                      * since lockdep might not have been informed yet
> > +                      * of lock acquisition.
> > +                      */
> > +                     return obj_cgroup_charge_atomic(objcg, flags,
> > +                                                     size * obj_full_size(s));
>
> We can not just ignore the stat updates here.
>
> > +             } else {
> > +                     lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > +             }
> > +     }
> > +
> >       for (i = 0; i < size; i++) {
> >               slab = virt_to_slab(p[i]);
> >
> > @@ -3162,8 +3204,12 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> >  void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> >                           void **p, int objects, struct slabobj_ext *obj_exts)
> >  {
> > +     bool lock_held = local_lock_is_locked(&memcg_stock.stock_lock);
> >       size_t obj_size = obj_full_size(s);
> >
> > +     if (likely(!lock_held))
> > +             lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > +
> >       for (int i = 0; i < objects; i++) {
> >               struct obj_cgroup *objcg;
> >               unsigned int off;
> > @@ -3174,8 +3220,12 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> >                       continue;
> >
> >               obj_exts[off].objcg = NULL;
> > -             refill_obj_stock(objcg, obj_size, true, -obj_size,
> > -                              slab_pgdat(slab), cache_vmstat_idx(s));
> > +             if (unlikely(lock_held)) {
> > +                     atomic_add(obj_size, &objcg->nr_charged_bytes);
>
> objcg->nr_charged_bytes is stats ignorant and the relevant stats need to
> be updated before putting stuff into it.

I'm not following.
It's functionally equivalent to refill_obj_stock() without
__account_obj_stock().
And the stats are not ignored.
The next __memcg_slab_free_hook() from good context will update
them. It's only a tiny delay in update.
I don't see why it's an issue.

> > +             } else {
> > +                     refill_obj_stock(objcg, obj_size, true, -obj_size,
> > +                                      slab_pgdat(slab), cache_vmstat_idx(s));
> > +             }
> >               obj_cgroup_put(objcg);
> >       }
> >  }
>
> I am actually working on making this whole call chain (i.e.
> kmalloc/kmem_cache_alloc to memcg [un]charging) reentrant/nmi safe.

Thank you for working on it!
You mean this set:
https://lore.kernel.org/all/20250429061211.1295443-1-shakeel.butt@linux.dev/
?
it's making css_rstat_updated() re-entrant,
which is renamed/reworked version of memcg_rstat_updated().
That's good, but not enough from slub pov.
It removes the need for the first hunk in this patch from mm/memcontrol.c
+ if (!in_nmi())
+               cgroup_rstat_updated(...);

but hunks in __memcg_slab_post_alloc_hook() and __memcg_slab_free_hook()
are still needed.
And I think the obj_cgroup_charge_atomic() approach in this patch is correct.
The delay in rstat update seems fine.
Please help me understand what I'm missing.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-06  0:49     ` Alexei Starovoitov
@ 2025-05-06  1:24       ` Shakeel Butt
  2025-05-06  1:51         ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Shakeel Butt @ 2025-05-06  1:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Mon, May 05, 2025 at 05:49:47PM -0700, Alexei Starovoitov wrote:
> On Mon, May 5, 2025 at 11:46 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Wed, Apr 30, 2025 at 08:27:18PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -595,7 +595,13 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> > >       if (!val)
> > >               return;
> > >
> > > -     cgroup_rstat_updated(memcg->css.cgroup, cpu);
> > > +     /*
> > > +      * If called from NMI via kmalloc_nolock -> memcg_slab_post_alloc_hook
> > > +      * -> obj_cgroup_charge -> mod_memcg_state,
> > > +      * then delay the update.
> > > +      */
> > > +     if (!in_nmi())
> > > +             cgroup_rstat_updated(memcg->css.cgroup, cpu);
> >
> > I don't think we can just ignore cgroup_rstat_updated() for nmi as there
> > is a chance (though very small) that we will loose these stats updates.
> 
> I'm failing to understand why it's an issue.
> Not doing cgroup_rstat_updated() can only cause updated_next link
> to stay NULL when it should be set,
> but it should be harmless, and no different from racy check
> that the code already doing:
> if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
>   return;
> 
> Imaging it was !NULL, the code would return,
> but then preemption, something clears it to NULL,
> and here we're skipping a set of updated_next.

cgroup_rstat_updated() puts the given cgroup whose stats are modified in
the per-cpu update tree which later the read side will flush to get the
uptodate stats. Not putting in the update tree will cause the read side
to not flush the stats cached on that cpu. Though there is a possibility
that someone else in non-nmi context may put that cgroup on that cpu's
update tree but there is no guarantee.

> 
> > In addition, memcg_rstat_updated() itself is not reentrant safe along
> > with couple of functions leading to it like __mod_memcg_lruvec_state().
> 
> Sure. __mod_memcg_lruvec_state() is not reentrant,
> but it's not an issue for kmalloc_nolock(), since objcg/memcg
> charge/uncharge from slub is not calling it (as far as I can tell).

Without this patch:

__memcg_slab_post_alloc_hook() -> obj_cgroup_charge_account() ->
consume_obj_stock() -> __account_obj_stock() -> __account_obj_stock() ->
__mod_objcg_mlstate() -> __mod_memcg_lruvec_state()

With this patch:
__memcg_slab_post_alloc_hook() -> obj_cgroup_charge_atomic() ->
obj_cgroup_charge_pages() -> mod_memcg_state() -> __mod_memcg_state()

Other than __mod_memcg_state() being not reentrant safe, we will be
missing NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B after the
patch.

> 
> >
> > >       statc = this_cpu_ptr(memcg->vmstats_percpu);
> > >       for (; statc; statc = statc->parent) {
> > >               /*
> > > @@ -2895,7 +2901,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > >       unsigned long flags;
> > >       bool ret = false;
> > >
> > > -     local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > > +     local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
> > >
> > >       stock = this_cpu_ptr(&memcg_stock);
> > >       if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
> > > @@ -2995,7 +3001,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
> > >       unsigned long flags;
> > >       unsigned int nr_pages = 0;
> > >
> > > -     local_lock_irqsave(&memcg_stock.stock_lock, flags);
> > > +     local_lock_irqsave_check(&memcg_stock.stock_lock, flags);
> > >
> > >       stock = this_cpu_ptr(&memcg_stock);
> > >       if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
> > > @@ -3088,6 +3094,27 @@ static inline size_t obj_full_size(struct kmem_cache *s)
> > >       return s->size + sizeof(struct obj_cgroup *);
> > >  }
> > >
> > > +/*
> > > + * Try subtract from nr_charged_bytes without making it negative
> > > + */
> > > +static bool obj_cgroup_charge_atomic(struct obj_cgroup *objcg, gfp_t flags, size_t sz)
> > > +{
> > > +     size_t old = atomic_read(&objcg->nr_charged_bytes);
> > > +     u32 nr_pages = sz >> PAGE_SHIFT;
> > > +     u32 nr_bytes = sz & (PAGE_SIZE - 1);
> > > +
> > > +     if ((ssize_t)(old - sz) >= 0 &&
> > > +         atomic_cmpxchg(&objcg->nr_charged_bytes, old, old - sz) == old)
> > > +             return true;
> > > +
> > > +     nr_pages++;
> > > +     if (obj_cgroup_charge_pages(objcg, flags, nr_pages))
> > > +             return false;
> > > +
> > > +     atomic_add(PAGE_SIZE - nr_bytes, &objcg->nr_charged_bytes);
> > > +     return true;
> > > +}
> > > +
> > >  bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> > >                                 gfp_t flags, size_t size, void **p)
> > >  {
> > > @@ -3128,6 +3155,21 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> > >                       return false;
> > >       }
> > >
> > > +     if (!gfpflags_allow_spinning(flags)) {
> > > +             if (local_lock_is_locked(&memcg_stock.stock_lock)) {
> > > +                     /*
> > > +                      * Cannot use
> > > +                      * lockdep_assert_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > > +                      * since lockdep might not have been informed yet
> > > +                      * of lock acquisition.
> > > +                      */
> > > +                     return obj_cgroup_charge_atomic(objcg, flags,
> > > +                                                     size * obj_full_size(s));
> >
> > We can not just ignore the stat updates here.
> >
> > > +             } else {
> > > +                     lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > > +             }
> > > +     }
> > > +
> > >       for (i = 0; i < size; i++) {
> > >               slab = virt_to_slab(p[i]);
> > >
> > > @@ -3162,8 +3204,12 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> > >  void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > >                           void **p, int objects, struct slabobj_ext *obj_exts)
> > >  {
> > > +     bool lock_held = local_lock_is_locked(&memcg_stock.stock_lock);
> > >       size_t obj_size = obj_full_size(s);
> > >
> > > +     if (likely(!lock_held))
> > > +             lockdep_assert_not_held(this_cpu_ptr(&memcg_stock.stock_lock));
> > > +
> > >       for (int i = 0; i < objects; i++) {
> > >               struct obj_cgroup *objcg;
> > >               unsigned int off;
> > > @@ -3174,8 +3220,12 @@ void __memcg_slab_free_hook(struct kmem_cache *s, struct slab *slab,
> > >                       continue;
> > >
> > >               obj_exts[off].objcg = NULL;
> > > -             refill_obj_stock(objcg, obj_size, true, -obj_size,
> > > -                              slab_pgdat(slab), cache_vmstat_idx(s));
> > > +             if (unlikely(lock_held)) {
> > > +                     atomic_add(obj_size, &objcg->nr_charged_bytes);
> >
> > objcg->nr_charged_bytes is stats ignorant and the relevant stats need to
> > be updated before putting stuff into it.
> 
> I'm not following.
> It's functionally equivalent to refill_obj_stock() without
> __account_obj_stock().
> And the stats are not ignored.
> The next __memcg_slab_free_hook() from good context will update
> them. It's only a tiny delay in update.
> I don't see why it's an issue.

For the slab object of size obj_size which is being freed here, we need
to update NR_SLAB_RECLAIMABLE_B or NR_SLAB_UNRECLAIMABLE_B stat for the
corresponding objcg by the amount of obj_size. If we don't call
__account_obj_stock() here we will loose the context and information to
update these stats later.

> 
> > > +             } else {
> > > +                     refill_obj_stock(objcg, obj_size, true, -obj_size,
> > > +                                      slab_pgdat(slab), cache_vmstat_idx(s));
> > > +             }
> > >               obj_cgroup_put(objcg);
> > >       }
> > >  }
> >
> > I am actually working on making this whole call chain (i.e.
> > kmalloc/kmem_cache_alloc to memcg [un]charging) reentrant/nmi safe.
> 
> Thank you for working on it!
> You mean this set:
> https://lore.kernel.org/all/20250429061211.1295443-1-shakeel.butt@linux.dev/
> ?
> it's making css_rstat_updated() re-entrant,
> which is renamed/reworked version of memcg_rstat_updated().
> That's good, but not enough from slub pov.
> It removes the need for the first hunk in this patch from mm/memcontrol.c
> + if (!in_nmi())
> +               cgroup_rstat_updated(...);
> 
> but hunks in __memcg_slab_post_alloc_hook() and __memcg_slab_free_hook()
> are still needed.
> And I think the obj_cgroup_charge_atomic() approach in this patch is correct.
> The delay in rstat update seems fine.
> Please help me understand what I'm missing.
> 

The css_rstat_updated() is the new name of cgroup_rstat_updated() and it
is only a piece of the puzzle. My plan is to memcg stats reentrant which
would allow to call __account_obj_stock (or whatever new name would be)
in nmi context and then comsume_obj_stock() and refill_obj_stock() would
work very similar to consume_stock() and refill_stock().

Please give me couple of days and I can share the full RFC of the memcg
side.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-06  1:24       ` Shakeel Butt
@ 2025-05-06  1:51         ` Alexei Starovoitov
  2025-05-06 18:05           ` Shakeel Butt
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-06  1:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Mon, May 5, 2025 at 6:25 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> > > >               obj_exts[off].objcg = NULL;
> > > > -             refill_obj_stock(objcg, obj_size, true, -obj_size,
> > > > -                              slab_pgdat(slab), cache_vmstat_idx(s));
> > > > +             if (unlikely(lock_held)) {
> > > > +                     atomic_add(obj_size, &objcg->nr_charged_bytes);
> > >
> > > objcg->nr_charged_bytes is stats ignorant and the relevant stats need to
> > > be updated before putting stuff into it.
> >
> > I'm not following.
> > It's functionally equivalent to refill_obj_stock() without
> > __account_obj_stock().
> > And the stats are not ignored.
> > The next __memcg_slab_free_hook() from good context will update
> > them. It's only a tiny delay in update.
> > I don't see why it's an issue.
>
> For the slab object of size obj_size which is being freed here, we need
> to update NR_SLAB_RECLAIMABLE_B or NR_SLAB_UNRECLAIMABLE_B stat for the
> corresponding objcg by the amount of obj_size. If we don't call
> __account_obj_stock() here we will loose the context and information to
> update these stats later.

Lose context?
pgdat has to match objcg, so I don't think we lose anything.
Later refill_obj_stock() will take objcg->nr_charged_bytes and
apply to appropriate NR_SLAB_[UN]RECLAIMABLE_B.
The patch introduces a delay in update to stats.
I still think it's correct, but I agree it's harder to reason about it,
since stats may not reflect the real numbers at that very instance.
But seeing how rstat flush is periodic and racy too I don't see
how it's a problem in practice.

>
> The css_rstat_updated() is the new name of cgroup_rstat_updated() and it
> is only a piece of the puzzle. My plan is to memcg stats reentrant which
> would allow to call __account_obj_stock (or whatever new name would be)
> in nmi context and then comsume_obj_stock() and refill_obj_stock() would
> work very similar to consume_stock() and refill_stock().
>
> Please give me couple of days and I can share the full RFC of the memcg
> side.

Sure thing. I don't insist on this particular way of charge/uncharge.
Sounds like I should drop mm/memcontrol.c bits, since your approach
will address the same re-entrance issue but in a more synchronous
(easier to reason about) way. Thanks!


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock()
  2025-05-01  3:27 ` [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock() Alexei Starovoitov
@ 2025-05-06  8:26   ` Vlastimil Babka
  2025-05-07  1:24     ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-06  8:26 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes, willy

On 5/1/25 05:27, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> The "try_" prefix is confusing, since it made people believe
> that try_alloc_pages() is analogous to spin_trylock() and
> NULL return means EAGAIN. This is not the case. If it returns
> NULL there is no reason to call it again. It will most likely
> return NULL again. Hence rename it to alloc_pages_nolock()
> to make it symmetrical to free_pages_nolock() and document that
> NULL means ENOMEM.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> @@ -7378,20 +7378,21 @@ static bool __free_unaccepted(struct page *page)
>  #endif /* CONFIG_UNACCEPTED_MEMORY */
>  
>  /**
> - * try_alloc_pages - opportunistic reentrant allocation from any context
> + * alloc_pages_nolock - opportunistic reentrant allocation from any context
>   * @nid: node to allocate from
>   * @order: allocation order size
>   *
>   * Allocates pages of a given order from the given node. This is safe to
>   * call from any context (from atomic, NMI, and also reentrant
> - * allocator -> tracepoint -> try_alloc_pages_noprof).
> + * allocator -> tracepoint -> alloc_pages_nolock_noprof).
>   * Allocation is best effort and to be expected to fail easily so nobody should
>   * rely on the success. Failures are not reported via warn_alloc().
>   * See always fail conditions below.
>   *
> - * Return: allocated page or NULL on failure.
> + * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
> + * It means ENOMEM. There is no reason to call it again and expect !NULL.

Should we explain that the "ENOMEM" doesn't necessarily mean the system is
out of memory, but also that the calling context might be simply unlucky
(preempted someone with the lock) and retrying in the same context can't
help it?



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock().
  2025-05-01  3:27 ` [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock() Alexei Starovoitov
@ 2025-05-06  8:55   ` Vlastimil Babka
  2025-05-07  1:33     ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-06  8:55 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes, willy

On 5/1/25 05:27, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Allow __GFP_ACCOUNT and __GFP_COMP flags to be specified when calling
> alloc_pages_nolock(), since upcoming reentrant alloc_slab_page() needs
> to allocate __GFP_COMP pages while BPF infra needs __GFP_ACCOUNT.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

I would rather see __GFP_COMP implied as we don't need any new users of
high-order allocations that are not compound, and alloc_pages_nolock() is a
new API.

IIRC it would be more important in some early version where
free_pages_nolock() was different, now it just calls to ___free_pages()
which has the tricky code for freeing those non-compound allocations.

But still, I'd really recommend that approach. Note __GFP_COMP is simply
ignored for order-0 so no need to filter it out.

> ---
>  include/linux/gfp.h  | 2 +-
>  kernel/bpf/syscall.c | 2 +-
>  mm/page_alloc.c      | 8 +++++---
>  3 files changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index be160e8d8bcb..9afbe5b3aef6 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -354,7 +354,7 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
>  }
>  #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
>  
> -struct page *alloc_pages_nolock_noprof(int nid, unsigned int order);
> +struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order);
>  #define alloc_pages_nolock(...)			alloc_hooks(alloc_pages_nolock_noprof(__VA_ARGS__))
>  
>  extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d0ddba2a952b..83af8fa9db3f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -578,7 +578,7 @@ static bool can_alloc_pages(void)
>  static struct page *__bpf_alloc_page(int nid)
>  {
>  	if (!can_alloc_pages())
> -		return alloc_pages_nolock(nid, 0);
> +		return alloc_pages_nolock(__GFP_ACCOUNT, nid, 0);
>  
>  	return alloc_pages_node(nid,
>  				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1d77a07b0659..303df205ca7d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7379,6 +7379,7 @@ static bool __free_unaccepted(struct page *page)
>  
>  /**
>   * alloc_pages_nolock - opportunistic reentrant allocation from any context
> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_COMP allowed.
>   * @nid: node to allocate from
>   * @order: allocation order size
>   *
> @@ -7392,7 +7393,7 @@ static bool __free_unaccepted(struct page *page)
>   * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
>   * It means ENOMEM. There is no reason to call it again and expect !NULL.
>   */
> -struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
> +struct page *alloc_pages_nolock_noprof(gfp_t gfp_flags, int nid, unsigned int order)
>  {
>  	/*
>  	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
> @@ -7415,11 +7416,12 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
>  	 * doesn't want to deplete reserves.
>  	 */
>  	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
> -			| __GFP_ACCOUNT;
> +			| gfp_flags;
>  	unsigned int alloc_flags = ALLOC_TRYLOCK;
>  	struct alloc_context ac = { };
>  	struct page *page;
>  
> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_COMP));
>  	/*
>  	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
>  	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
> @@ -7462,7 +7464,7 @@ struct page *alloc_pages_nolock_noprof(int nid, unsigned int order)
>  	if (page)
>  		set_page_refcounted(page);
>  
> -	if (memcg_kmem_online() && page &&
> +	if (memcg_kmem_online() && page && (gfp_flags & __GFP_ACCOUNT) &&
>  	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
>  		free_pages_nolock(page, order);
>  		page = NULL;



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  2025-05-05 18:46   ` Shakeel Butt
@ 2025-05-06 12:01   ` Vlastimil Babka
  2025-05-07  0:31     ` Harry Yoo
  2025-05-07  2:20     ` Alexei Starovoitov
  2025-05-09  1:03   ` Harry Yoo
  2 siblings, 2 replies; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-06 12:01 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes, willy

On 5/1/25 05:27, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> kmalloc_nolock() relies on ability of local_lock to detect the situation
> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
> When lock_local_is_locked() sees locked memcg_stock.stock_lock
> fallback to atomic operations.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

In general I'd prefer if we could avoid local_lock_is_locked() usage outside
of debugging code. It just feels hacky given we have local_trylock()
operations. But I can see how this makes things simpler so it's probably
acceptable.

> @@ -2458,13 +2468,21 @@ static void *setup_object(struct kmem_cache *s, void *object)
>   * Slab allocation and freeing
>   */
>  static inline struct slab *alloc_slab_page(gfp_t flags, int node,
> -		struct kmem_cache_order_objects oo)
> +					   struct kmem_cache_order_objects oo,
> +					   bool allow_spin)
>  {
>  	struct folio *folio;
>  	struct slab *slab;
>  	unsigned int order = oo_order(oo);
>  
> -	if (node == NUMA_NO_NODE)
> +	if (unlikely(!allow_spin)) {
> +		struct page *p = alloc_pages_nolock(__GFP_COMP, node, order);
> +
> +		if (p)
> +			/* Make the page frozen. Drop refcnt to zero. */
> +			put_page_testzero(p);

This is dangerous. Once we create a refcounted (non-frozen) page, someone
else (a pfn scanner like compaction) can do a get_page_unless_zero(), so the
refcount becomes 2, then we decrement the refcount here to 1, the pfn
scanner realizes it's not a page it can work with, do put_page() and frees
it under us.

The solution is to split out alloc_frozen_pages_nolock() to use from here,
and make alloc_pages_nolock() use it too and then set refcounted.

> +		folio = (struct folio *)p;
> +	} else if (node == NUMA_NO_NODE)
>  		folio = (struct folio *)alloc_frozen_pages(flags, order);
>  	else
>  		folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);

<snip>

> @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	 */
>  	c = slub_get_cpu_ptr(s->cpu_slab);
>  #endif
> +	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
> +		struct slab *slab;
> +
> +		slab = c->slab;
> +		if (slab && !node_match(slab, node))
> +			/* In trylock mode numa node is a hint */
> +			node = NUMA_NO_NODE;
> +
> +		if (!local_lock_is_locked(&s->cpu_slab->lock)) {
> +			lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock));
> +		} else {
> +			/*
> +			 * EBUSY is an internal signal to kmalloc_nolock() to
> +			 * retry a different bucket. It's not propagated further.
> +			 */
> +			p = ERR_PTR(-EBUSY);
> +			goto out;

Am I right in my reasoning as follows?

- If we're on RT and "in_nmi() || in_hardirq()" is true then
kmalloc_nolock_noprof() would return NULL immediately and we never reach
this code

- local_lock_is_locked() on RT tests if the current process is the lock
owner. This means (in absence of double locking bugs) that we locked it as
task (or hardirq) and now we're either in_hardirq() (doesn't change current
AFAIK?) preempting task, or in_nmi() preempting task or hardirq.

- so local_lock_is_locked() will never be true here on RT

> +		}
> +	}
>  
>  	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> +out:
>  #ifdef CONFIG_PREEMPT_COUNT
>  	slub_put_cpu_ptr(s->cpu_slab);
>  #endif
> @@ -4162,8 +4213,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  		if (p[i] && init && (!kasan_init ||
>  				     !kasan_has_integrated_init()))
>  			memset(p[i], 0, zero_size);
> -		kmemleak_alloc_recursive(p[i], s->object_size, 1,
> -					 s->flags, init_flags);
> +		if (gfpflags_allow_spinning(flags))
> +			kmemleak_alloc_recursive(p[i], s->object_size, 1,
> +						 s->flags, init_flags);
>  		kmsan_slab_alloc(s, p[i], init_flags);
>  		alloc_tagging_slab_alloc_hook(s, p[i], flags);
>  	}
> @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
>  }
>  EXPORT_SYMBOL(__kmalloc_noprof);
>  
> +/**
> + * kmalloc_nolock - Allocate an object of given size from any context.
> + * @size: size to allocate
> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> + * @node: node number of the target node.
> + *
> + * Return: pointer to the new object or NULL in case of error.
> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> + * There is no reason to call it again and expect !NULL.
> + */
> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> +{
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> +	struct kmem_cache *s;
> +	bool can_retry = true;
> +	void *ret = ERR_PTR(-EBUSY);
> +
> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> +
> +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> +		return NULL;
> +	if (unlikely(!size))
> +		return ZERO_SIZE_PTR;
> +
> +	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
> +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> +		return NULL;
> +retry:
> +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);

The idea of retrying on different bucket is based on wrong assumptions and
thus won't work as you expect. kmalloc_slab() doesn't select buckets truly
randomly, but deterministically via hashing from a random per-boot seed and
the _RET_IP_, as the security hardening goal is to make different kmalloc()
callsites get different caches with high probability.

And I wouldn't also recommend changing this for kmalloc_nolock_noprof() case
as that could make the hardening weaker, and also not help for kernels that
don't have it enabled, anyway.

> +
> +	if (!(s->flags & __CMPXCHG_DOUBLE))
> +		/*
> +		 * kmalloc_nolock() is not supported on architectures that
> +		 * don't implement cmpxchg16b.
> +		 */
> +		return NULL;
> +
> +	/*
> +	 * Do not call slab_alloc_node(), since trylock mode isn't
> +	 * compatible with slab_pre_alloc_hook/should_failslab and
> +	 * kfence_alloc.
> +	 *
> +	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> +	 * in irq saved region. It assumes that the same cpu will not
> +	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> +	 * Therefore use in_nmi() to check whether particular bucket is in
> +	 * irq protected section.
> +	 */
> +	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> +		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);

Hm this is somewhat subtle. We're testing the local lock without having the
cpu explicitly pinned. But the test only happens in_nmi() which implicitly
is a context that won't migrate, so should work I think, but maybe should be
more explicit in the comment?

<snip>

>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
>   * can perform fastpath freeing without additional function calls.
> @@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>  	barrier();
>  
>  	if (unlikely(slab != c->slab)) {

Note this unlikely() is actually a lie. It's actually unlikely that the free
will happen on the same cpu and with the same slab still being c->slab,
unless it's a free following shortly a temporary object allocation.

> -		__slab_free(s, slab, head, tail, cnt, addr);
> +		/* cnt == 0 signals that it's called from kfree_nolock() */
> +		if (unlikely(!cnt)) {
> +			/*
> +			 * Use llist in cache_node ?
> +			 * struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> +			 */
> +			/*
> +			 * __slab_free() can locklessly cmpxchg16 into a slab,
> +			 * but then it might need to take spin_lock or local_lock
> +			 * in put_cpu_partial() for further processing.
> +			 * Avoid the complexity and simply add to a deferred list.
> +			 */
> +			llist_add(head, &s->defer_free_objects);
> +		} else {
> +			free_deferred_objects(&s->defer_free_objects, addr);

So I'm a bit vary that this is actually rather a fast path that might
contend on the defer_free_objects from all cpus.

I'm wondering if we could make the list part of kmem_cache_cpu to distribute
it, and hook the flushing e.g. to places where we do deactivate_slab() which
should be much slower path, and also free_to_partial_list() to handle
SLUB_TINY/caches with debugging enabled.

> +			__slab_free(s, slab, head, tail, cnt, addr);
> +		}
>  		return;
>  	}
>  
> +	if (unlikely(!cnt)) {
> +		if ((in_nmi() || !USE_LOCKLESS_FAST_PATH()) &&
> +		    local_lock_is_locked(&s->cpu_slab->lock)) {
> +			llist_add(head, &s->defer_free_objects);
> +			return;
> +		}
> +		cnt = 1;
> +		kasan_slab_free(s, head, false, false, true);
> +	}
> +
>  	if (USE_LOCKLESS_FAST_PATH()) {
>  		freelist = READ_ONCE(c->freelist);
>  


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-05-01  3:27 ` [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
@ 2025-05-06 12:56   ` Vlastimil Babka
  2025-05-06 14:55     ` Vlastimil Babka
  2025-05-12 13:26   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-06 12:56 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes, willy

On 5/1/25 05:27, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
> is a top level field of any lock that participates in LOCKDEP.
> Make it so for local_trylock_t.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/local_lock_internal.h | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index bf2bf40d7b18..29df45f95843 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -17,7 +17,10 @@ typedef struct {
>  
>  /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
>  typedef struct {
> -	local_lock_t	llock;
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map	dep_map;
> +	struct task_struct	*owner;
> +#endif
>  	u8		acquired;
>  } local_trylock_t;
>  
> @@ -31,7 +34,7 @@ typedef struct {
>  	.owner = NULL,
>  
>  # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
> -	.llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
> +	LOCAL_LOCK_DEBUG_INIT(lockname)
>  
>  static inline void local_lock_acquire(local_lock_t *l)
>  {
> @@ -81,7 +84,7 @@ do {								\
>  	local_lock_debug_init(lock);				\
>  } while (0)
>  
> -#define __local_trylock_init(lock) __local_lock_init(lock.llock)
> +#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)

This cast seems unnecessary. Better not hide mistakes when using the
local_trylock_init() macro.

>  
>  #define __spinlock_nested_bh_init(lock)				\
>  do {								\



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-05-01  3:27 ` [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
@ 2025-05-06 12:59   ` Vlastimil Babka
  2025-05-07  1:28     ` Alexei Starovoitov
  2025-05-12 14:56   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-06 12:59 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes, willy

On 5/1/25 05:27, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce local_lock_is_locked() that returns true when
> given local_lock is locked by current cpu (in !PREEMPT_RT) or
> by current task (in PREEMPT_RT).
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

On !RT this only works for local_trylock_t, which is fine, but maybe make it
part of the name then? local_trylock_is_locked()?

> ---
>  include/linux/local_lock.h          | 2 ++
>  include/linux/local_lock_internal.h | 8 ++++++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index 16a2ee4f8310..092ce89b162a 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -66,6 +66,8 @@
>   */
>  #define local_trylock(lock)		__local_trylock(lock)
>  
> +#define local_lock_is_locked(lock)	__local_lock_is_locked(lock)
> +
>  /**
>   * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
>   *			   interrupts if acquired
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index 29df45f95843..263723a45ecd 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -165,6 +165,9 @@ do {								\
>  		!!tl;						\
>  	})
>  
> +/* preemption or migration must be disabled before calling __local_lock_is_locked */
> +#define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
> +
>  #define __local_lock_release(lock)					\
>  	do {								\
>  		local_trylock_t *tl;					\
> @@ -285,4 +288,9 @@ do {								\
>  		__local_trylock(lock);				\
>  	})
>  
> +/* migration must be disabled before calling __local_lock_is_locked */
> +#include "../../kernel/locking/rtmutex_common.h"
> +#define __local_lock_is_locked(__lock)					\
> +	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
> +
>  #endif /* CONFIG_PREEMPT_RT */



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-05-06 12:56   ` Vlastimil Babka
@ 2025-05-06 14:55     ` Vlastimil Babka
  2025-05-07  1:25       ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-06 14:55 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes, willy

On 5/6/25 14:56, Vlastimil Babka wrote:
> On 5/1/25 05:27, Alexei Starovoitov wrote:
>> @@ -81,7 +84,7 @@ do {								\
>>  	local_lock_debug_init(lock);				\
>>  } while (0)
>>  
>> -#define __local_trylock_init(lock) __local_lock_init(lock.llock)
>> +#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
> 
> This cast seems unnecessary. Better not hide mistakes when using the
> local_trylock_init() macro.

Nevermind, tested the wrong config, it's necessary. Sigh.

>>  
>>  #define __spinlock_nested_bh_init(lock)				\
>>  do {								\
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-06  1:51         ` Alexei Starovoitov
@ 2025-05-06 18:05           ` Shakeel Butt
  0 siblings, 0 replies; 41+ messages in thread
From: Shakeel Butt @ 2025-05-06 18:05 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Vlastimil Babka, Harry Yoo, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Mon, May 05, 2025 at 06:51:28PM -0700, Alexei Starovoitov wrote:
> On Mon, May 5, 2025 at 6:25 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > > > >               obj_exts[off].objcg = NULL;
> > > > > -             refill_obj_stock(objcg, obj_size, true, -obj_size,
> > > > > -                              slab_pgdat(slab), cache_vmstat_idx(s));
> > > > > +             if (unlikely(lock_held)) {
> > > > > +                     atomic_add(obj_size, &objcg->nr_charged_bytes);
> > > >
> > > > objcg->nr_charged_bytes is stats ignorant and the relevant stats need to
> > > > be updated before putting stuff into it.
> > >
> > > I'm not following.
> > > It's functionally equivalent to refill_obj_stock() without
> > > __account_obj_stock().
> > > And the stats are not ignored.
> > > The next __memcg_slab_free_hook() from good context will update
> > > them. It's only a tiny delay in update.
> > > I don't see why it's an issue.
> >
> > For the slab object of size obj_size which is being freed here, we need
> > to update NR_SLAB_RECLAIMABLE_B or NR_SLAB_UNRECLAIMABLE_B stat for the
> > corresponding objcg by the amount of obj_size. If we don't call
> > __account_obj_stock() here we will loose the context and information to
> > update these stats later.
> 
> Lose context?
> pgdat has to match objcg, so I don't think we lose anything.
> Later refill_obj_stock() will take objcg->nr_charged_bytes and
> apply to appropriate NR_SLAB_[UN]RECLAIMABLE_B.

refill_obj_stock() will not apply objcg->nr_charged_bytes to
NR_SLAB_[UN]RECLAIMABLE_B. This is confusing because per-cpu objcg stock
has two distinct functions i.e. (1) byte level charge cache and (2) slab
stats cache, though both of them happen to share objcg (this will change
when I add multi-objcg support).

Let me explain this with an example. Let's say in the task context,
Job-A frees a charged slab object on CPU-i and also assume CPU-i's
memcg_stock has Job-A's objcg cached. Next let's assume while CPU-i was
in refill_obj_stock(), it got interrupted by a nmi and within nmi there
is a slab object charged to Job-B got freed.

In the above scenario and with this patch, the kernel will put the
obj_size of the Job-B's freed slab object to Job-B's
objcg->nr_charged_bytes and just return. Please note that CPU-i still
has Job-A's objcg (charge and stats) cached. When CPU-i's memcg_stock
get drained later, it will update stats of Job-A which were cached on
it. We have lost the stat updates for Job-B's freed slab object.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-06 12:01   ` Vlastimil Babka
@ 2025-05-07  0:31     ` Harry Yoo
  2025-05-07  2:23       ` Alexei Starovoitov
  2025-05-07  8:38       ` Vlastimil Babka
  2025-05-07  2:20     ` Alexei Starovoitov
  1 sibling, 2 replies; 41+ messages in thread
From: Harry Yoo @ 2025-05-07  0:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexei Starovoitov, bpf, linux-mm, shakeel.butt, mhocko, bigeasy,
	andrii, memxor, akpm, peterz, rostedt, hannes, willy

On Tue, May 06, 2025 at 02:01:48PM +0200, Vlastimil Babka wrote:
> On 5/1/25 05:27, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > kmalloc_nolock() relies on ability of local_lock to detect the situation
> > when it's locked.
> > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> > In that case retry the operation in a different kmalloc bucket.
> > The second attempt will likely succeed, since this cpu locked
> > different kmem_cache_cpu.
> > When lock_local_is_locked() sees locked memcg_stock.stock_lock
> > fallback to atomic operations.
> > 
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > a different bucket that is most likely is not locked by current
> > task. Though it may be locked by a different task it's safe to
> > rt_spin_lock() on it.
> > 
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
> > 
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> 

... snip ...

> > @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
> >  }
> >  EXPORT_SYMBOL(__kmalloc_noprof);
> >  
> > +/**
> > + * kmalloc_nolock - Allocate an object of given size from any context.
> > + * @size: size to allocate
> > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> > + * @node: node number of the target node.
> > + *
> > + * Return: pointer to the new object or NULL in case of error.
> > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> > + * There is no reason to call it again and expect !NULL.
> > + */
> > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> > +{
> > +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> > +	struct kmem_cache *s;
> > +	bool can_retry = true;
> > +	void *ret = ERR_PTR(-EBUSY);
> > +
> > +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> > +
> > +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> > +		return NULL;
> > +	if (unlikely(!size))
> > +		return ZERO_SIZE_PTR;
> > +
> > +	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
> > +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> > +		return NULL;
> > +retry:
> > +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> 
> The idea of retrying on different bucket is based on wrong assumptions and
> thus won't work as you expect. kmalloc_slab() doesn't select buckets truly
> randomly, but deterministically via hashing from a random per-boot seed and
> the _RET_IP_, as the security hardening goal is to make different kmalloc()
> callsites get different caches with high probability.

It's not retrying with the same size, so I don't think it's relying on any
assumption about random kmalloc caches. (yeah, it wastes some memory if
allocated from the next size bucket)

	if (PTR_ERR(ret) == -EBUSY) {
		if (can_retry) {
			/* pick the next kmalloc bucket */
			size = s->object_size + 1;
			/*
			 * Another alternative is to
			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
			 * to retry from bucket of the same size.
			 */
			can_retry = false;
			goto retry;
		}
		ret = NULL;
	}

By the way, it doesn't check if a kmalloc cache that can serve
(s->object_size + 1) allocations actually exists, which is not true for
the largest kmalloc cache?

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock()
  2025-05-06  8:26   ` Vlastimil Babka
@ 2025-05-07  1:24     ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  1:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, May 6, 2025 at 1:26 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 5/1/25 05:27, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > The "try_" prefix is confusing, since it made people believe
> > that try_alloc_pages() is analogous to spin_trylock() and
> > NULL return means EAGAIN. This is not the case. If it returns
> > NULL there is no reason to call it again. It will most likely
> > return NULL again. Hence rename it to alloc_pages_nolock()
> > to make it symmetrical to free_pages_nolock() and document that
> > NULL means ENOMEM.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>
> > @@ -7378,20 +7378,21 @@ static bool __free_unaccepted(struct page *page)
> >  #endif /* CONFIG_UNACCEPTED_MEMORY */
> >
> >  /**
> > - * try_alloc_pages - opportunistic reentrant allocation from any context
> > + * alloc_pages_nolock - opportunistic reentrant allocation from any context
> >   * @nid: node to allocate from
> >   * @order: allocation order size
> >   *
> >   * Allocates pages of a given order from the given node. This is safe to
> >   * call from any context (from atomic, NMI, and also reentrant
> > - * allocator -> tracepoint -> try_alloc_pages_noprof).
> > + * allocator -> tracepoint -> alloc_pages_nolock_noprof).
> >   * Allocation is best effort and to be expected to fail easily so nobody should
> >   * rely on the success. Failures are not reported via warn_alloc().
> >   * See always fail conditions below.
> >   *
> > - * Return: allocated page or NULL on failure.
> > + * Return: allocated page or NULL on failure. NULL does not mean EBUSY or EAGAIN.
> > + * It means ENOMEM. There is no reason to call it again and expect !NULL.
>
> Should we explain that the "ENOMEM" doesn't necessarily mean the system is
> out of memory, but also that the calling context might be simply unlucky
> (preempted someone with the lock) and retrying in the same context can't
> help it?

Technically correct, but it opens the door for "retry" thinking:
"I called it and got unlucky, maybe I should retry once.. I promise
I won't loop forever".
So I really think the doc should say "ENOMEM. no reason to retry" like above.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-05-06 14:55     ` Vlastimil Babka
@ 2025-05-07  1:25       ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  1:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, May 6, 2025 at 7:55 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 5/6/25 14:56, Vlastimil Babka wrote:
> > On 5/1/25 05:27, Alexei Starovoitov wrote:
> >> @@ -81,7 +84,7 @@ do {                                                               \
> >>      local_lock_debug_init(lock);                            \
> >>  } while (0)
> >>
> >> -#define __local_trylock_init(lock) __local_lock_init(lock.llock)
> >> +#define __local_trylock_init(lock) __local_lock_init((local_lock_t *)lock)
> >
> > This cast seems unnecessary. Better not hide mistakes when using the
> > local_trylock_init() macro.
>
> Nevermind, tested the wrong config, it's necessary. Sigh.

Yep. lockdep assumption is deep.
I see no other way of doing it.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-05-06 12:59   ` Vlastimil Babka
@ 2025-05-07  1:28     ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  1:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, May 6, 2025 at 5:59 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 5/1/25 05:27, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Introduce local_lock_is_locked() that returns true when
> > given local_lock is locked by current cpu (in !PREEMPT_RT) or
> > by current task (in PREEMPT_RT).
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> On !RT this only works for local_trylock_t, which is fine, but maybe make it
> part of the name then? local_trylock_is_locked()?

Prior to _Generic() conversion it would make sense,
but now it will be inconsistent.
Type name is not part of the helper name.
It's local_lock, local_unlock, local_lock_is_taken.
local_trylock() might not even be used at all.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock().
  2025-05-06  8:55   ` Vlastimil Babka
@ 2025-05-07  1:33     ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  1:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, May 6, 2025 at 1:55 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 5/1/25 05:27, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > Allow __GFP_ACCOUNT and __GFP_COMP flags to be specified when calling
> > alloc_pages_nolock(), since upcoming reentrant alloc_slab_page() needs
> > to allocate __GFP_COMP pages while BPF infra needs __GFP_ACCOUNT.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> I would rather see __GFP_COMP implied as we don't need any new users of
> high-order allocations that are not compound, and alloc_pages_nolock() is a
> new API.
>
> IIRC it would be more important in some early version where
> free_pages_nolock() was different, now it just calls to ___free_pages()
> which has the tricky code for freeing those non-compound allocations.
>
> But still, I'd really recommend that approach. Note __GFP_COMP is simply
> ignored for order-0 so no need to filter it out.

All makes sense to me.
Will make __GFP_COMP a default for alloca_pages_nolock().
From the bpf side there is no use case for order > 0.
I'm adding support for __GFP_COMP and order > 0 only
to call it from alloc_slab_page().


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-06 12:01   ` Vlastimil Babka
  2025-05-07  0:31     ` Harry Yoo
@ 2025-05-07  2:20     ` Alexei Starovoitov
  2025-05-07 10:44       ` Vlastimil Babka
  1 sibling, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  2:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, May 6, 2025 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 5/1/25 05:27, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > kmalloc_nolock() relies on ability of local_lock to detect the situation
> > when it's locked.
> > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> > In that case retry the operation in a different kmalloc bucket.
> > The second attempt will likely succeed, since this cpu locked
> > different kmem_cache_cpu.
> > When lock_local_is_locked() sees locked memcg_stock.stock_lock
> > fallback to atomic operations.
> >
> > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > a different bucket that is most likely is not locked by current
> > task. Though it may be locked by a different task it's safe to
> > rt_spin_lock() on it.
> >
> > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > immediately if called from hard irq or NMI in PREEMPT_RT.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>
> In general I'd prefer if we could avoid local_lock_is_locked() usage outside
> of debugging code. It just feels hacky given we have local_trylock()
> operations. But I can see how this makes things simpler so it's probably
> acceptable.

local_lock_is_locked() is not for debugging.
It's gating further calls into slub internals.
If a particular bucket is locked the logic will use a different one.
There is no local_trylock() at all here.
In that sense it's very different from alloc_pages_nolock().
There we trylock first and if not successful go for plan B.
For kmalloc_nolock() we first check whether local_lock_is_locked(),
if not then proceed and do
local_lock_irqsave_check() instead of local_lock_irqsave().
Both are unconditional and exactly the same without
CONFIG_DEBUG_LOCK_ALLOC.
Extra checks are there in _check() version for debugging,
since local_lock_is_locked() is called much earlier in the call chain
and far from local_lock_irqsave. So not trivial to see by just
code reading.
If local_lock_is_locked() says that it's locked
we go for a different bucket which is pretty much guaranteed to
be unlocked.

>
> > @@ -2458,13 +2468,21 @@ static void *setup_object(struct kmem_cache *s, void *object)
> >   * Slab allocation and freeing
> >   */
> >  static inline struct slab *alloc_slab_page(gfp_t flags, int node,
> > -             struct kmem_cache_order_objects oo)
> > +                                        struct kmem_cache_order_objects oo,
> > +                                        bool allow_spin)
> >  {
> >       struct folio *folio;
> >       struct slab *slab;
> >       unsigned int order = oo_order(oo);
> >
> > -     if (node == NUMA_NO_NODE)
> > +     if (unlikely(!allow_spin)) {
> > +             struct page *p = alloc_pages_nolock(__GFP_COMP, node, order);
> > +
> > +             if (p)
> > +                     /* Make the page frozen. Drop refcnt to zero. */
> > +                     put_page_testzero(p);
>
> This is dangerous. Once we create a refcounted (non-frozen) page, someone
> else (a pfn scanner like compaction) can do a get_page_unless_zero(), so the
> refcount becomes 2, then we decrement the refcount here to 1, the pfn
> scanner realizes it's not a page it can work with, do put_page() and frees
> it under us.

Something like isolate_migratepages_block() does that?
ok. good to know.

> The solution is to split out alloc_frozen_pages_nolock() to use from here,
> and make alloc_pages_nolock() use it too and then set refcounted.

understood.

> > +             folio = (struct folio *)p;
> > +     } else if (node == NUMA_NO_NODE)
> >               folio = (struct folio *)alloc_frozen_pages(flags, order);
> >       else
> >               folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
>
> <snip>
>
> > @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >        */
> >       c = slub_get_cpu_ptr(s->cpu_slab);
> >  #endif
> > +     if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
> > +             struct slab *slab;
> > +
> > +             slab = c->slab;
> > +             if (slab && !node_match(slab, node))
> > +                     /* In trylock mode numa node is a hint */
> > +                     node = NUMA_NO_NODE;
> > +
> > +             if (!local_lock_is_locked(&s->cpu_slab->lock)) {
> > +                     lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock));
> > +             } else {
> > +                     /*
> > +                      * EBUSY is an internal signal to kmalloc_nolock() to
> > +                      * retry a different bucket. It's not propagated further.
> > +                      */
> > +                     p = ERR_PTR(-EBUSY);
> > +                     goto out;
>
> Am I right in my reasoning as follows?
>
> - If we're on RT and "in_nmi() || in_hardirq()" is true then
> kmalloc_nolock_noprof() would return NULL immediately and we never reach
> this code

correct.

> - local_lock_is_locked() on RT tests if the current process is the lock
> owner. This means (in absence of double locking bugs) that we locked it as
> task (or hardirq) and now we're either in_hardirq() (doesn't change current
> AFAIK?) preempting task, or in_nmi() preempting task or hardirq.

not quite.
There could be re-entrance due to kprobe/fentry/tracepoint.
Like trace_contention_begin().
The code is still preemptable.

> - so local_lock_is_locked() will never be true here on RT

hehe :)

To have good coverage I fuzz test this patch set with:

+extern void (*debug_callback)(void);
+#define local_unlock_irqrestore(lock, flags) \
+ do { \
+ if (debug_callback) debug_callback(); \
+ __local_unlock_irqrestore(lock, flags); \
+ } while (0)

and randomly re-enter everywhere from debug_callback().

> > +             }
> > +     }
> >
> >       p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> > +out:
> >  #ifdef CONFIG_PREEMPT_COUNT
> >       slub_put_cpu_ptr(s->cpu_slab);
> >  #endif
> > @@ -4162,8 +4213,9 @@ bool slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> >               if (p[i] && init && (!kasan_init ||
> >                                    !kasan_has_integrated_init()))
> >                       memset(p[i], 0, zero_size);
> > -             kmemleak_alloc_recursive(p[i], s->object_size, 1,
> > -                                      s->flags, init_flags);
> > +             if (gfpflags_allow_spinning(flags))
> > +                     kmemleak_alloc_recursive(p[i], s->object_size, 1,
> > +                                              s->flags, init_flags);
> >               kmsan_slab_alloc(s, p[i], init_flags);
> >               alloc_tagging_slab_alloc_hook(s, p[i], flags);
> >       }
> > @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
> >  }
> >  EXPORT_SYMBOL(__kmalloc_noprof);
> >
> > +/**
> > + * kmalloc_nolock - Allocate an object of given size from any context.
> > + * @size: size to allocate
> > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> > + * @node: node number of the target node.
> > + *
> > + * Return: pointer to the new object or NULL in case of error.
> > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> > + * There is no reason to call it again and expect !NULL.
> > + */
> > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> > +{
> > +     gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> > +     struct kmem_cache *s;
> > +     bool can_retry = true;
> > +     void *ret = ERR_PTR(-EBUSY);
> > +
> > +     VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> > +
> > +     if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> > +             return NULL;
> > +     if (unlikely(!size))
> > +             return ZERO_SIZE_PTR;
> > +
> > +     if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
> > +             /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> > +             return NULL;
> > +retry:
> > +     s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
>
> The idea of retrying on different bucket is based on wrong assumptions and
> thus won't work as you expect. kmalloc_slab() doesn't select buckets truly
> randomly, but deterministically via hashing from a random per-boot seed and
> the _RET_IP_, as the security hardening goal is to make different kmalloc()
> callsites get different caches with high probability.

There is no relying on randomness.
As Harry pointed out in the other reply,
there is one retry from a different bucket.
Everything is deterministic.

> And I wouldn't also recommend changing this for kmalloc_nolock_noprof() case
> as that could make the hardening weaker, and also not help for kernels that
> don't have it enabled, anyway.

This patch doesn't affect hardening.
If RANDOM_KMALLOC_CACHES is enabled it will affect
all callers of kmalloc_slab(), normal kmalloc and this kmalloc_nolock.
Protection is not weakened.

>
> > +
> > +     if (!(s->flags & __CMPXCHG_DOUBLE))
> > +             /*
> > +              * kmalloc_nolock() is not supported on architectures that
> > +              * don't implement cmpxchg16b.
> > +              */
> > +             return NULL;
> > +
> > +     /*
> > +      * Do not call slab_alloc_node(), since trylock mode isn't
> > +      * compatible with slab_pre_alloc_hook/should_failslab and
> > +      * kfence_alloc.
> > +      *
> > +      * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> > +      * in irq saved region. It assumes that the same cpu will not
> > +      * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> > +      * Therefore use in_nmi() to check whether particular bucket is in
> > +      * irq protected section.
> > +      */
> > +     if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> > +             ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
>
> Hm this is somewhat subtle. We're testing the local lock without having the
> cpu explicitly pinned. But the test only happens in_nmi() which implicitly
> is a context that won't migrate, so should work I think, but maybe should be
> more explicit in the comment?

Ok. I'll expand the comment right above this 'if'.

>
> <snip>
>
> >  /*
> >   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
> >   * can perform fastpath freeing without additional function calls.
> > @@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
> >       barrier();
> >
> >       if (unlikely(slab != c->slab)) {
>
> Note this unlikely() is actually a lie. It's actually unlikely that the free
> will happen on the same cpu and with the same slab still being c->slab,
> unless it's a free following shortly a temporary object allocation.

I didn't change it, since you would have called it
an unrelated change in the patch :)
I can prepare a separate single line patch to remove unlikely() here,
but it's a micro optimization unrelated to this set.

> > -             __slab_free(s, slab, head, tail, cnt, addr);
> > +             /* cnt == 0 signals that it's called from kfree_nolock() */
> > +             if (unlikely(!cnt)) {
> > +                     /*
> > +                      * Use llist in cache_node ?
> > +                      * struct kmem_cache_node *n = get_node(s, slab_nid(slab));
> > +                      */
> > +                     /*
> > +                      * __slab_free() can locklessly cmpxchg16 into a slab,
> > +                      * but then it might need to take spin_lock or local_lock
> > +                      * in put_cpu_partial() for further processing.
> > +                      * Avoid the complexity and simply add to a deferred list.
> > +                      */
> > +                     llist_add(head, &s->defer_free_objects);
> > +             } else {
> > +                     free_deferred_objects(&s->defer_free_objects, addr);
>
> So I'm a bit vary that this is actually rather a fast path that might
> contend on the defer_free_objects from all cpus.

Well, in my current stress test I could only get this list
to contain a single digit number of objects.

> I'm wondering if we could make the list part of kmem_cache_cpu to distribute
> it,

doable, but kmem_cache_cpu *c = raw_cpu_ptr(s->cpu_slab);
is preemptable, so there is a risk that
llist_add(.. , &c->defer_free_objects);
will be accessing per-cpu memory of another cpu.
llist_add() will work correctly, but cache line bounce is possible.
In kmem_cache I placed defer_free_objects after cpu_partial and oo,
so it should be cache hot.

> and hook the flushing e.g. to places where we do deactivate_slab() which
> should be much slower path,

I don't follow the idea.
If we don't process kmem_cache_cpu *c right here in do_slab_free()
this llist will get large.
So we have to process it here, but if we do, what's the point
of extra flush in deactivate_slab() ?
Especially with extra for_each_cpu() loop to reach all kmem_cache_cpu ?

> and also free_to_partial_list() to handle
> SLUB_TINY/caches with debugging enabled.

SLUB_TINY... ohh. I didn't try it. Will fix.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-07  0:31     ` Harry Yoo
@ 2025-05-07  2:23       ` Alexei Starovoitov
  2025-05-07  8:38       ` Vlastimil Babka
  1 sibling, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-07  2:23 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Vlastimil Babka, bpf, linux-mm, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, May 6, 2025 at 5:31 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, May 06, 2025 at 02:01:48PM +0200, Vlastimil Babka wrote:
> > On 5/1/25 05:27, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > >
> > > kmalloc_nolock() relies on ability of local_lock to detect the situation
> > > when it's locked.
> > > In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> > > irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> > > In that case retry the operation in a different kmalloc bucket.
> > > The second attempt will likely succeed, since this cpu locked
> > > different kmem_cache_cpu.
> > > When lock_local_is_locked() sees locked memcg_stock.stock_lock
> > > fallback to atomic operations.
> > >
> > > Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> > > per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> > > into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> > > a different bucket that is most likely is not locked by current
> > > task. Though it may be locked by a different task it's safe to
> > > rt_spin_lock() on it.
> > >
> > > Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> > > immediately if called from hard irq or NMI in PREEMPT_RT.
> > >
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> >
>
> ... snip ...
>
> > > @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
> > >  }
> > >  EXPORT_SYMBOL(__kmalloc_noprof);
> > >
> > > +/**
> > > + * kmalloc_nolock - Allocate an object of given size from any context.
> > > + * @size: size to allocate
> > > + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> > > + * @node: node number of the target node.
> > > + *
> > > + * Return: pointer to the new object or NULL in case of error.
> > > + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> > > + * There is no reason to call it again and expect !NULL.
> > > + */
> > > +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> > > +{
> > > +   gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> > > +   struct kmem_cache *s;
> > > +   bool can_retry = true;
> > > +   void *ret = ERR_PTR(-EBUSY);
> > > +
> > > +   VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> > > +
> > > +   if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> > > +           return NULL;
> > > +   if (unlikely(!size))
> > > +           return ZERO_SIZE_PTR;
> > > +
> > > +   if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
> > > +           /* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> > > +           return NULL;
> > > +retry:
> > > +   s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> >
> > The idea of retrying on different bucket is based on wrong assumptions and
> > thus won't work as you expect. kmalloc_slab() doesn't select buckets truly
> > randomly, but deterministically via hashing from a random per-boot seed and
> > the _RET_IP_, as the security hardening goal is to make different kmalloc()
> > callsites get different caches with high probability.
>
> It's not retrying with the same size, so I don't think it's relying on any
> assumption about random kmalloc caches. (yeah, it wastes some memory if
> allocated from the next size bucket)
>
>         if (PTR_ERR(ret) == -EBUSY) {
>                 if (can_retry) {
>                         /* pick the next kmalloc bucket */
>                         size = s->object_size + 1;
>                         /*
>                          * Another alternative is to
>                          * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
>                          * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
>                          * to retry from bucket of the same size.
>                          */
>                         can_retry = false;
>                         goto retry;
>                 }
>                 ret = NULL;
>         }
>
> By the way, it doesn't check if a kmalloc cache that can serve
> (s->object_size + 1) allocations actually exists, which is not true for
> the largest kmalloc cache?

Good catch.
I need to add a check for s->object_size + 1 < KMALLOC_MAX_CACHE_SIZE.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-07  0:31     ` Harry Yoo
  2025-05-07  2:23       ` Alexei Starovoitov
@ 2025-05-07  8:38       ` Vlastimil Babka
  1 sibling, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-07  8:38 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Alexei Starovoitov, bpf, linux-mm, shakeel.butt, mhocko, bigeasy,
	andrii, memxor, akpm, peterz, rostedt, hannes, willy

On 5/7/25 2:31 AM, Harry Yoo wrote:
> On Tue, May 06, 2025 at 02:01:48PM +0200, Vlastimil Babka wrote:
>>> @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
>>>  }
>>>  EXPORT_SYMBOL(__kmalloc_noprof);
>>>  
>>> +/**
>>> + * kmalloc_nolock - Allocate an object of given size from any context.
>>> + * @size: size to allocate
>>> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
>>> + * @node: node number of the target node.
>>> + *
>>> + * Return: pointer to the new object or NULL in case of error.
>>> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
>>> + * There is no reason to call it again and expect !NULL.
>>> + */
>>> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
>>> +{
>>> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
>>> +	struct kmem_cache *s;
>>> +	bool can_retry = true;
>>> +	void *ret = ERR_PTR(-EBUSY);
>>> +
>>> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
>>> +
>>> +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
>>> +		return NULL;
>>> +	if (unlikely(!size))
>>> +		return ZERO_SIZE_PTR;
>>> +
>>> +	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
>>> +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
>>> +		return NULL;
>>> +retry:
>>> +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
>>
>> The idea of retrying on different bucket is based on wrong assumptions and
>> thus won't work as you expect. kmalloc_slab() doesn't select buckets truly
>> randomly, but deterministically via hashing from a random per-boot seed and
>> the _RET_IP_, as the security hardening goal is to make different kmalloc()
>> callsites get different caches with high probability.
> 
> It's not retrying with the same size, so I don't think it's relying on any
> assumption about random kmalloc caches. (yeah, it wastes some memory if
> allocated from the next size bucket)

Yeah sorry about that, a big oversight on my side. I saw the word bucket
and made the wrong conclusion without checking the code more closely.
It's because I wasn't used to seeing it until Kees introduced the
kmem_buckets typedef. But I realize that's a set of differently sized
buckets, and bucket is just one cache of the set. Yeah retrying with a
larger size is a smart trick.

> 	if (PTR_ERR(ret) == -EBUSY) {
> 		if (can_retry) {
> 			/* pick the next kmalloc bucket */
> 			size = s->object_size + 1;
> 			/*
> 			 * Another alternative is to
> 			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
> 			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
> 			 * to retry from bucket of the same size.
> 			 */
> 			can_retry = false;
> 			goto retry;
> 		}
> 		ret = NULL;
> 	}
> 
> By the way, it doesn't check if a kmalloc cache that can serve
> (s->object_size + 1) allocations actually exists, which is not true for
> the largest kmalloc cache?
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-07  2:20     ` Alexei Starovoitov
@ 2025-05-07 10:44       ` Vlastimil Babka
  0 siblings, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-07 10:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On 5/7/25 4:20 AM, Alexei Starovoitov wrote:
> On Tue, May 6, 2025 at 5:01 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 5/1/25 05:27, Alexei Starovoitov wrote:
>>> From: Alexei Starovoitov <ast@kernel.org>
>>>
>>> kmalloc_nolock() relies on ability of local_lock to detect the situation
>>> when it's locked.
>>> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
>>> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
>>> In that case retry the operation in a different kmalloc bucket.
>>> The second attempt will likely succeed, since this cpu locked
>>> different kmem_cache_cpu.
>>> When lock_local_is_locked() sees locked memcg_stock.stock_lock
>>> fallback to atomic operations.
>>>
>>> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
>>> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
>>> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
>>> a different bucket that is most likely is not locked by current
>>> task. Though it may be locked by a different task it's safe to
>>> rt_spin_lock() on it.
>>>
>>> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
>>> immediately if called from hard irq or NMI in PREEMPT_RT.
>>>
>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>
>> In general I'd prefer if we could avoid local_lock_is_locked() usage outside
>> of debugging code. It just feels hacky given we have local_trylock()
>> operations. But I can see how this makes things simpler so it's probably
>> acceptable.
> 
> local_lock_is_locked() is not for debugging.
> It's gating further calls into slub internals.
> If a particular bucket is locked the logic will use a different one.
> There is no local_trylock() at all here.

It could be, but I can see how it would complicate things. Not worth it,
especially in case we manage to replace the current percpu scheme with
sheaves later.

> In that sense it's very different from alloc_pages_nolock().
> There we trylock first and if not successful go for plan B.
> For kmalloc_nolock() we first check whether local_lock_is_locked(),
> if not then proceed and do
> local_lock_irqsave_check() instead of local_lock_irqsave().
> Both are unconditional and exactly the same without
> CONFIG_DEBUG_LOCK_ALLOC.

Right.

> Extra checks are there in _check() version for debugging,
> since local_lock_is_locked() is called much earlier in the call chain
> and far from local_lock_irqsave. So not trivial to see by just
> code reading.

Right, we rely on the implication that once we find
local_lock_is_locked() is false, it cannot become suddenly locked later
even if we re-enable preemption in the meanwhile.

> If local_lock_is_locked() says that it's locked
> we go for a different bucket which is pretty much guaranteed to
> be unlocked.

OK.

>>> +             folio = (struct folio *)p;
>>> +     } else if (node == NUMA_NO_NODE)
>>>               folio = (struct folio *)alloc_frozen_pages(flags, order);
>>>       else
>>>               folio = (struct folio *)__alloc_frozen_pages(flags, order, node, NULL);
>>
>> <snip>
>>
>>> @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>>>        */
>>>       c = slub_get_cpu_ptr(s->cpu_slab);
>>>  #endif
>>> +     if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
>>> +             struct slab *slab;
>>> +
>>> +             slab = c->slab;
>>> +             if (slab && !node_match(slab, node))
>>> +                     /* In trylock mode numa node is a hint */
>>> +                     node = NUMA_NO_NODE;
>>> +
>>> +             if (!local_lock_is_locked(&s->cpu_slab->lock)) {
>>> +                     lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock));
>>> +             } else {
>>> +                     /*
>>> +                      * EBUSY is an internal signal to kmalloc_nolock() to
>>> +                      * retry a different bucket. It's not propagated further.
>>> +                      */
>>> +                     p = ERR_PTR(-EBUSY);
>>> +                     goto out;
>>
>> Am I right in my reasoning as follows?
>>
>> - If we're on RT and "in_nmi() || in_hardirq()" is true then
>> kmalloc_nolock_noprof() would return NULL immediately and we never reach
>> this code
> 
> correct.
> 
>> - local_lock_is_locked() on RT tests if the current process is the lock
>> owner. This means (in absence of double locking bugs) that we locked it as
>> task (or hardirq) and now we're either in_hardirq() (doesn't change current
>> AFAIK?) preempting task, or in_nmi() preempting task or hardirq.
> 
> not quite.
> There could be re-entrance due to kprobe/fentry/tracepoint.
> Like trace_contention_begin().
> The code is still preemptable.

Hm right. Glad that I asked then and thanks for making me realize.

>> - so local_lock_is_locked() will never be true here on RT
> 
> hehe :)
> 
> To have good coverage I fuzz test this patch set with:
> 
> +extern void (*debug_callback)(void);
> +#define local_unlock_irqrestore(lock, flags) \
> + do { \
> + if (debug_callback) debug_callback(); \
> + __local_unlock_irqrestore(lock, flags); \
> + } while (0)
> 
> and randomly re-enter everywhere from debug_callback().

Oh cool :)

>>
>> <snip>
>>
>>>  /*
>>>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
>>>   * can perform fastpath freeing without additional function calls.
>>> @@ -4605,10 +4762,36 @@ static __always_inline void do_slab_free(struct kmem_cache *s,
>>>       barrier();
>>>
>>>       if (unlikely(slab != c->slab)) {
>>
>> Note this unlikely() is actually a lie. It's actually unlikely that the free
>> will happen on the same cpu and with the same slab still being c->slab,
>> unless it's a free following shortly a temporary object allocation.
> 
> I didn't change it, since you would have called it
> an unrelated change in the patch :)

Right I'm not suggesting to change it here and now, just that it might
be misleading in that this is a slowpath and we're fine doing expensive
things here - I'm arguing we should be careful.

> I can prepare a separate single line patch to remove unlikely() here,
> but it's a micro optimization unrelated to this set.
> 
>>> -             __slab_free(s, slab, head, tail, cnt, addr);
>>> +             /* cnt == 0 signals that it's called from kfree_nolock() */
>>> +             if (unlikely(!cnt)) {
>>> +                     /*
>>> +                      * Use llist in cache_node ?
>>> +                      * struct kmem_cache_node *n = get_node(s, slab_nid(slab));
>>> +                      */
>>> +                     /*
>>> +                      * __slab_free() can locklessly cmpxchg16 into a slab,
>>> +                      * but then it might need to take spin_lock or local_lock
>>> +                      * in put_cpu_partial() for further processing.
>>> +                      * Avoid the complexity and simply add to a deferred list.
>>> +                      */
>>> +                     llist_add(head, &s->defer_free_objects);
>>> +             } else {
>>> +                     free_deferred_objects(&s->defer_free_objects, addr);
>>
>> So I'm a bit vary that this is actually rather a fast path that might
>> contend on the defer_free_objects from all cpus.
> 
> Well, in my current stress test I could only get this list
> to contain a single digit number of objects.

My worry isn't the list would get long, but that we'd be checking it on
almost any kfree() (that's not lucky enough to be slab == c->slab) from
all cpus. And every kfree_nolock() will have to make the cache line of
s->defer_free_objects exclusive to that cpu in the llist_add(), and then
all other cpus doing kfree() will have to refetch it while making it
shared again...

>> I'm wondering if we could make the list part of kmem_cache_cpu to distribute
>> it,
> 
> doable, but kmem_cache_cpu *c = raw_cpu_ptr(s->cpu_slab);
> is preemptable, so there is a risk that
> llist_add(.. , &c->defer_free_objects);
> will be accessing per-cpu memory of another cpu.
> llist_add() will work correctly, but cache line bounce is possible.

The cache line bounce due to occasional preemption should be much more
rare than on the single s->defer_free_objects cache line as described above.

> In kmem_cache I placed defer_free_objects after cpu_partial and oo,
> so it should be cache hot.

OTOH it would make the bouncing worse?

>> and hook the flushing e.g. to places where we do deactivate_slab() which
>> should be much slower path,
> 
> I don't follow the idea.
> If we don't process kmem_cache_cpu *c right here in do_slab_free()
> this llist will get large.

I'd hope deativate_slab() should be frequent enough, it means mainly the
local cpu's slab was exhausted. If there are many frees there are
probably many allocations too.

But ok, maybe having the llist local to the cpu would be sufficient to
make it feasible to check it every kfree() cheaply enough.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check()
  2025-05-01  3:27 ` [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check() Alexei Starovoitov
@ 2025-05-07 13:02   ` Vlastimil Babka
  2025-05-12 14:03   ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-07 13:02 UTC (permalink / raw)
  To: Alexei Starovoitov, bpf, linux-mm
  Cc: harry.yoo, shakeel.butt, mhocko, bigeasy, andrii, memxor, akpm,
	peterz, rostedt, hannes, willy

On 5/1/25 5:27 AM, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Introduce local_lock_irqsave_check() to check that local_lock is
> not taken recursively.
> In !PREEMPT_RT local_lock_irqsave() disables IRQ, but
> re-entrance is possible either from NMI or strategically placed
> kprobe. The code should call local_lock_is_locked() before proceeding
> to acquire a local_lock. Such local_lock_is_locked() might be called
> earlier in the call graph and there could be a lot of code
> between local_lock_is_locked() and local_lock_irqsave_check().
> 
> Without CONFIG_DEBUG_LOCK_ALLOC the local_lock_irqsave_check()
> is equivalent to local_lock_irqsave().
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>

While I agree with the principle, what I think is less ideal:

- it's an opt-in new API local_lock_irqsave_check() so requires to
change the callers that want the check enabled, even though it's
controlled by a debug config. We could just do the check in every
local_lock_*() operation? Perhaps it would be checking something that
can't ever trigger for instances that never use local_lock_is_locked()
(or local_trylock()) to determine the code flow. But maybe we can be
surprised, and the cost of the check everywhere is fine to pay with a
debug option.

Yes the check only supports local_trylock_t (on !RT) but we could handle
that with _Generic(), or maybe even turn local_lock's to full
local_trylock's to include the acquired field, when the debug option is
enabled?

- CONFIG_DEBUG_LOCK_ALLOC seems like a wrong config given its
name+description, isn't there something more fitting in the lock related
debugging ecosystem?

- shouldn't lockdep just handle this already because this is about not
locking something that's already locked by us?

- a question below for the implementation:

> ---
>  include/linux/local_lock.h          | 13 +++++++++++++
>  include/linux/local_lock_internal.h | 19 +++++++++++++++++++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
> index 092ce89b162a..0d6efb0fdd15 100644
> --- a/include/linux/local_lock.h
> +++ b/include/linux/local_lock.h
> @@ -81,6 +81,19 @@
>  #define local_trylock_irqsave(lock, flags)			\
>  	__local_trylock_irqsave(lock, flags)
>  
> +/**
> + * local_lock_irqsave_check - Acquire a per CPU local lock, save and disable
> + *			      interrupts
> + * @lock:	The lock variable
> + * @flags:	Storage for interrupt flags
> + *
> + * This function checks that local_lock is not taken recursively.
> + * In !PREEMPT_RT re-entrance is possible either from NMI or kprobe.
> + * In PREEMPT_RT it checks that current task is not holding it.
> + */
> +#define local_lock_irqsave_check(lock, flags)			\
> +	__local_lock_irqsave_check(lock, flags)
> +
>  DEFINE_GUARD(local_lock, local_lock_t __percpu*,
>  	     local_lock(_T),
>  	     local_unlock(_T))
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index 263723a45ecd..7c4cc002bc68 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -168,6 +168,15 @@ do {								\
>  /* preemption or migration must be disabled before calling __local_lock_is_locked */
>  #define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
>  
> +#define __local_lock_irqsave_check(lock, flags)					\
> +	do {									\
> +		if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&			\
> +		    (!__local_lock_is_locked(lock) || in_nmi()))		\
> +			WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));	\

I'm wondering about the conditions here. If local_lock_is_locked() is
true and we're not in nmi, we just do nothing here, but that means thies
just silently ignores the situation where we would lock in the task and
then try locking again in irq?
Shouldn't we just always trylock and warn if it fails? (but back to my
lockdep point this might be just duplicating what it already does?)

> +		else								\
> +			__local_lock_irqsave(lock, flags);			\
> +	} while (0)
> +
>  #define __local_lock_release(lock)					\
>  	do {								\
>  		local_trylock_t *tl;					\
> @@ -293,4 +302,14 @@ do {								\
>  #define __local_lock_is_locked(__lock)					\
>  	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
>  
> +#define __local_lock_irqsave_check(lock, flags)				\
> +	do {								\
> +		typecheck(unsigned long, flags);			\
> +		flags = 0;						\
> +		migrate_disable();					\
> +		if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC))		\
> +			WARN_ON_ONCE(__local_lock_is_locked(lock));	\
> +		spin_lock(this_cpu_ptr((lock)));			\
> +	} while (0)
> +
>  #endif /* CONFIG_PREEMPT_RT */



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
  2025-05-05 18:46   ` Shakeel Butt
  2025-05-06 12:01   ` Vlastimil Babka
@ 2025-05-09  1:03   ` Harry Yoo
  2025-06-24 17:13     ` SLAB_NO_CMPXCHG was:: " Alexei Starovoitov
  2 siblings, 1 reply; 41+ messages in thread
From: Harry Yoo @ 2025-05-09  1:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, shakeel.butt, mhocko, bigeasy, andrii,
	memxor, akpm, peterz, rostedt, hannes, willy

On Wed, Apr 30, 2025 at 08:27:18PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> kmalloc_nolock() relies on ability of local_lock to detect the situation
> when it's locked.
> In !PREEMPT_RT local_lock_is_locked() is true only when NMI happened in
> irq saved region that protects _that specific_ per-cpu kmem_cache_cpu.
> In that case retry the operation in a different kmalloc bucket.
> The second attempt will likely succeed, since this cpu locked
> different kmem_cache_cpu.
> When lock_local_is_locked() sees locked memcg_stock.stock_lock
> fallback to atomic operations.
> 
> Similarly, in PREEMPT_RT local_lock_is_locked() returns true when
> per-cpu rt_spin_lock is locked by current task. In this case re-entrance
> into the same kmalloc bucket is unsafe, and kmalloc_nolock() tries
> a different bucket that is most likely is not locked by current
> task. Though it may be locked by a different task it's safe to
> rt_spin_lock() on it.
> 
> Similar to alloc_pages_nolock() the kmalloc_nolock() returns NULL
> immediately if called from hard irq or NMI in PREEMPT_RT.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/kasan.h |  13 +-
>  include/linux/slab.h  |   4 +
>  mm/kasan/common.c     |   5 +-
>  mm/memcontrol.c       |  60 ++++++++-
>  mm/slab.h             |   1 +
>  mm/slub.c             | 280 ++++++++++++++++++++++++++++++++++++++----
>  6 files changed, 330 insertions(+), 33 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index d5a8ab98035c..743f6d196d57 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -470,6 +470,7 @@ void * __must_check krealloc_noprof(const void *objp, size_t new_size,
>  #define krealloc(...)				alloc_hooks(krealloc_noprof(__VA_ARGS__))
>  
>  void kfree(const void *objp);
> +void kfree_nolock(const void *objp);
>  void kfree_sensitive(const void *objp);
>  size_t __ksize(const void *objp);
>  
> @@ -910,6 +911,9 @@ static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
>  }
>  #define kmalloc(...)				alloc_hooks(kmalloc_noprof(__VA_ARGS__))
>  
> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node);
> +#define kmalloc_nolock(...)			alloc_hooks(kmalloc_nolock_noprof(__VA_ARGS__))

As it takes node parameter, it should be kmalloc_node_nolock() instead?

> diff --git a/mm/slab.h b/mm/slab.h
> index 05a21dc796e0..1688749d2995 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -273,6 +273,7 @@ struct kmem_cache {
>  	unsigned int cpu_partial_slabs;
>  #endif
>  	struct kmem_cache_order_objects oo;
> +	struct llist_head defer_free_objects;
>  
>  	/* Allocation and freeing of slabs */
>  	struct kmem_cache_order_objects min;
> diff --git a/mm/slub.c b/mm/slub.c
> index dc9e729e1d26..307ea0135b92 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3918,7 +3949,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  
>  retry_load_slab:
>  
> -	local_lock_irqsave(&s->cpu_slab->lock, flags);
> +	local_lock_irqsave_check(&s->cpu_slab->lock, flags);
>  	if (unlikely(c->slab)) {
>  		void *flush_freelist = c->freelist;
>  		struct slab *flush_slab = c->slab;
> @@ -3958,8 +3989,28 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  	 */
>  	c = slub_get_cpu_ptr(s->cpu_slab);
>  #endif
> +	if (unlikely(!gfpflags_allow_spinning(gfpflags))) {
> +		struct slab *slab;
> +
> +		slab = c->slab;
> +		if (slab && !node_match(slab, node))
> +			/* In trylock mode numa node is a hint */
> +			node = NUMA_NO_NODE;

This logic can be moved to ___slab_alloc() as the code to ignore
node constraint (on some conditions) is already there?
> +
> +		if (!local_lock_is_locked(&s->cpu_slab->lock)) {
> +			lockdep_assert_not_held(this_cpu_ptr(&s->cpu_slab->lock));
> +		} else {
> +			/*
> +			 * EBUSY is an internal signal to kmalloc_nolock() to
> +			 * retry a different bucket. It's not propagated further.
> +			 */
> +			p = ERR_PTR(-EBUSY);
> +			goto out;
> +		}
> +	}
>  	p = ___slab_alloc(s, gfpflags, node, addr, c, orig_size);
> +out:
>  #ifdef CONFIG_PREEMPT_COUNT
>  	slub_put_cpu_ptr(s->cpu_slab);
>  #endif
> @@ -4354,6 +4406,88 @@ void *__kmalloc_noprof(size_t size, gfp_t flags)
>  }
>  EXPORT_SYMBOL(__kmalloc_noprof);
>  
> +/**
> + * kmalloc_nolock - Allocate an object of given size from any context.
> + * @size: size to allocate
> + * @gfp_flags: GFP flags. Only __GFP_ACCOUNT, __GFP_ZERO allowed.
> + * @node: node number of the target node.
> + *
> + * Return: pointer to the new object or NULL in case of error.
> + * NULL does not mean EBUSY or EAGAIN. It means ENOMEM.
> + * There is no reason to call it again and expect !NULL.
> + */
> +void *kmalloc_nolock_noprof(size_t size, gfp_t gfp_flags, int node)
> +{
> +	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_NOMEMALLOC | gfp_flags;
> +	struct kmem_cache *s;
> +	bool can_retry = true;
> +	void *ret = ERR_PTR(-EBUSY);
> +
> +	VM_WARN_ON_ONCE(gfp_flags & ~(__GFP_ACCOUNT | __GFP_ZERO));
> +
> +	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
> +		return NULL;
> +	if (unlikely(!size))
> +		return ZERO_SIZE_PTR;
> +
> +	if (!USE_LOCKLESS_FAST_PATH() && (in_nmi() || in_hardirq()))
> +		/* kmalloc_nolock() in PREEMPT_RT is not supported from irq */
> +		return NULL;
> +retry:
> +	s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> +
> +	if (!(s->flags & __CMPXCHG_DOUBLE))
> +		/*
> +		 * kmalloc_nolock() is not supported on architectures that
> +		 * don't implement cmpxchg16b.
> +		 */
> +		return NULL;

Hmm when someone uses slab debugging flags (e.g., passing boot
parameter slab_debug=FPZ as a hardening option on production [1], or
just for debugging), __CMPXCHG_DOUBLE is not set even when the arch
supports it.

Is it okay to fail all kmalloc_nolock() calls in such cases?

[1] https://lore.kernel.org/linux-mm/20250421165508.make.689-kees@kernel.org/

> +
> +	/*
> +	 * Do not call slab_alloc_node(), since trylock mode isn't
> +	 * compatible with slab_pre_alloc_hook/should_failslab and
> +	 * kfence_alloc.
> +	 *
> +	 * In !PREEMPT_RT ___slab_alloc() manipulates (freelist,tid) pair
> +	 * in irq saved region. It assumes that the same cpu will not
> +	 * __update_cpu_freelist_fast() into the same (freelist,tid) pair.
> +	 * Therefore use in_nmi() to check whether particular bucket is in
> +	 * irq protected section.
> +	 */
> +	if (!in_nmi() || !local_lock_is_locked(&s->cpu_slab->lock))
> +		ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
> +
> +	if (PTR_ERR(ret) == -EBUSY) {
> +		if (can_retry) {
> +			/* pick the next kmalloc bucket */
> +			size = s->object_size + 1;
> +			/*
> +			 * Another alternative is to
> +			 * if (memcg) alloc_gfp &= ~__GFP_ACCOUNT;
> +			 * else if (!memcg) alloc_gfp |= __GFP_ACCOUNT;
> +			 * to retry from bucket of the same size.
> +			 */
> +			can_retry = false;
> +			goto retry;
> +		}
> +		ret = NULL;
> +	}
> +
> +	maybe_wipe_obj_freeptr(s, ret);
> +	/*
> +	 * Make sure memcg_stock.stock_lock doesn't change cpu
> +	 * when memcg layers access it.
> +	 */
> +	slub_get_cpu_ptr(s->cpu_slab);
> +	slab_post_alloc_hook(s, NULL, alloc_gfp, 1, &ret,
> +			     slab_want_init_on_alloc(alloc_gfp, s), size);
> +	slub_put_cpu_ptr(s->cpu_slab);

Should this migration prevention really be in slab, not in memcg?

> +	ret = kasan_kmalloc(s, ret, size, alloc_gfp);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(kmalloc_nolock_noprof);
> +
>  void *__kmalloc_node_track_caller_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags,
>  					 int node, unsigned long caller)
>  {
> @@ -4568,6 +4701,30 @@ static void __slab_free(struct kmem_cache *s, struct slab *slab,
>  }
>  
>  #ifndef CONFIG_SLUB_TINY
> +static void free_deferred_objects(struct llist_head *llhead, unsigned long addr)
> +{

If SLUB_TINY is enabled and thus it does not perform deferred free,
what happens?

> +	struct llist_node *llnode, *pos, *t;
> +
> +	if (likely(llist_empty(llhead)))
> +		return;
> +
> +	llnode = llist_del_all(llhead);
> +	llist_for_each_safe(pos, t, llnode) {
> +		struct kmem_cache *s;
> +		struct slab *slab;
> +		void *x = pos;
> +
> +		slab = virt_to_slab(x);
> +		s = slab->slab_cache;
> +
> +		/*
> +		 * memcg, kasan_slab_pre are already done for 'x'.
> +		 * The only thing left is kasan_poison.
> +		 */
> +		kasan_slab_free(s, x, false, false, true);
> +		__slab_free(s, slab, x, x, 1, addr);
> +	}
> +}
>  /*
>   * Fastpath with forced inlining to produce a kfree and kmem_cache_free that
>   * can perform fastpath freeing without additional function calls.

--
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-05-01  3:27 ` [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
  2025-05-06 12:56   ` Vlastimil Babka
@ 2025-05-12 13:26   ` Sebastian Andrzej Siewior
  2025-05-12 16:46     ` Alexei Starovoitov
  1 sibling, 1 reply; 41+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-12 13:26 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, shakeel.butt, mhocko, andrii,
	memxor, akpm, peterz, rostedt, hannes, willy

On 2025-04-30 20:27:14 [-0700], Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
> is a top level field of any lock that participates in LOCKDEP.
> Make it so for local_trylock_t.
> 
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> ---
>  include/linux/local_lock_internal.h | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> index bf2bf40d7b18..29df45f95843 100644
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -17,7 +17,10 @@ typedef struct {
>  
>  /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
>  typedef struct {
> -	local_lock_t	llock;
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +	struct lockdep_map	dep_map;
> +	struct task_struct	*owner;
> +#endif
>  	u8		acquired;
>  } local_trylock_t;

So this trick should make it work. I am not sure it is worth it. It
would avoid the cast down the road…

diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -17,10 +17,17 @@ typedef struct {
 
 /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
 typedef struct {
+	union	{
+		local_lock_t		llock;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-	struct lockdep_map	dep_map;
-	struct task_struct	*owner;
+# define LOCK_PAD_SIZE (offsetof(local_lock_t, dep_map))
+		struct {
+			u8 __padding[LOCK_PAD_SIZE];
+			struct lockdep_map	dep_map;
+		};
+#undef LOCK_PAD_SIZE
 #endif
+	};
 	u8		acquired;
 } local_trylock_t;
 
@@ -34,7 +41,7 @@ typedef struct {
 	.owner = NULL,
 
 # define LOCAL_TRYLOCK_DEBUG_INIT(lockname)		\
-	LOCAL_LOCK_DEBUG_INIT(lockname)
+	.llock	= { LOCAL_LOCK_DEBUG_INIT(llock.lockname) }
 
 static inline void local_lock_acquire(local_lock_t *l)
 {


Sebastian


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check()
  2025-05-01  3:27 ` [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check() Alexei Starovoitov
  2025-05-07 13:02   ` Vlastimil Babka
@ 2025-05-12 14:03   ` Sebastian Andrzej Siewior
  2025-05-12 17:16     ` Alexei Starovoitov
  1 sibling, 1 reply; 41+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-12 14:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, shakeel.butt, mhocko, andrii,
	memxor, akpm, peterz, rostedt, hannes, willy

On 2025-04-30 20:27:16 [-0700], Alexei Starovoitov wrote:
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -168,6 +168,15 @@ do {								\
>  /* preemption or migration must be disabled before calling __local_lock_is_locked */
>  #define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
>  
> +#define __local_lock_irqsave_check(lock, flags)					\
> +	do {									\
> +		if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&			\
> +		    (!__local_lock_is_locked(lock) || in_nmi()))		\
> +			WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));	\
> +		else								\
> +			__local_lock_irqsave(lock, flags);			\
> +	} while (0)
> +

Hmm. If I see this right in SLUB then this is called from preemptible
context. Therefore the this_cpu_ptr() from __local_lock_is_locked()
should trigger a warning here.

This check variant provides only additional debugging and otherwise
behaves as local_lock_irqsave(). Therefore the in_nmi() should return
immediately with a WARN_ON() regardless if the lock is available or not
because the non-try variant should never be invoked from an NMI. 

This looks like additional debug infrastructure that should be part of
local_lock_irqsave() itself, maybe hidden behind a debug switch which is
forced now from one of the current user. I don't think we need an extra
interface for this. It might be restricted to the trylock_test variant
if it makes sense but I wouldn't introduce an extra function for that.

>  #define __local_lock_release(lock)					\
>  	do {								\
>  		local_trylock_t *tl;					\

Sebastian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-05-01  3:27 ` [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
  2025-05-06 12:59   ` Vlastimil Babka
@ 2025-05-12 14:56   ` Sebastian Andrzej Siewior
  2025-05-12 15:01     ` Vlastimil Babka
  1 sibling, 1 reply; 41+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-12 14:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, vbabka, harry.yoo, shakeel.butt, mhocko, andrii,
	memxor, akpm, peterz, rostedt, hannes, willy

On 2025-04-30 20:27:15 [-0700], Alexei Starovoitov wrote:
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -285,4 +288,9 @@ do {								\
>  		__local_trylock(lock);				\
>  	})
>  
> +/* migration must be disabled before calling __local_lock_is_locked */
> +#include "../../kernel/locking/rtmutex_common.h"
> +#define __local_lock_is_locked(__lock)					\
> +	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)

So I've been looking if we really need rt_mutex_owner() or if
rt_mutex_base_is_locked() could do the job. Judging from the slub-free
case, the rt_mutex_base_is_locked() would be just fine. The alloc case
on the other hand probably not so much. On the other hand since we don't
accept allocations from hardirq or NMI the "caller == owner" case should
never be observed. Unless buggy & debugging and this should then also be
observed by lockdep. Right?

If there is another case where recursion can be observed and need to be
addressed I would prefer to move the function (+define) to
include/linux/rtmutex.h. instead of doing this "../../ include".

>  #endif /* CONFIG_PREEMPT_RT */

Sebastian

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-05-12 14:56   ` Sebastian Andrzej Siewior
@ 2025-05-12 15:01     ` Vlastimil Babka
  2025-05-12 15:23       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-12 15:01 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Alexei Starovoitov
  Cc: bpf, linux-mm, harry.yoo, shakeel.butt, mhocko, andrii, memxor,
	akpm, peterz, rostedt, hannes, willy

On 5/12/25 16:56, Sebastian Andrzej Siewior wrote:
> On 2025-04-30 20:27:15 [-0700], Alexei Starovoitov wrote:
>> --- a/include/linux/local_lock_internal.h
>> +++ b/include/linux/local_lock_internal.h
>> @@ -285,4 +288,9 @@ do {								\
>>  		__local_trylock(lock);				\
>>  	})
>>  
>> +/* migration must be disabled before calling __local_lock_is_locked */
>> +#include "../../kernel/locking/rtmutex_common.h"
>> +#define __local_lock_is_locked(__lock)					\
>> +	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
> 
> So I've been looking if we really need rt_mutex_owner() or if
> rt_mutex_base_is_locked() could do the job. Judging from the slub-free
> case, the rt_mutex_base_is_locked() would be just fine. The alloc case
> on the other hand probably not so much. On the other hand since we don't
> accept allocations from hardirq or NMI the "caller == owner" case should
> never be observed. Unless buggy & debugging and this should then also be
> observed by lockdep. Right?

AFAIU my same line of thought was debunked by Alexei here:

https://lore.kernel.org/all/CAADnVQLO9YX2_0wEZshHbwXoJY2-wv3OgVGvN-hgf6mK0_ipxw@mail.gmail.com/

e.g. you could have the lock and then due to kprobe or tracing in the slab
allocator code re-enter it.

> If there is another case where recursion can be observed and need to be
> addressed I would prefer to move the function (+define) to
> include/linux/rtmutex.h. instead of doing this "../../ include".
> 
>>  #endif /* CONFIG_PREEMPT_RT */
> 
> Sebastian



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked().
  2025-05-12 15:01     ` Vlastimil Babka
@ 2025-05-12 15:23       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 41+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-12 15:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexei Starovoitov, bpf, linux-mm, harry.yoo, shakeel.butt,
	mhocko, andrii, memxor, akpm, peterz, rostedt, hannes, willy

On 2025-05-12 17:01:55 [+0200], Vlastimil Babka wrote:
> On 5/12/25 16:56, Sebastian Andrzej Siewior wrote:
> > On 2025-04-30 20:27:15 [-0700], Alexei Starovoitov wrote:
> >> --- a/include/linux/local_lock_internal.h
> >> +++ b/include/linux/local_lock_internal.h
> >> @@ -285,4 +288,9 @@ do {								\
> >>  		__local_trylock(lock);				\
> >>  	})
> >>  
> >> +/* migration must be disabled before calling __local_lock_is_locked */
> >> +#include "../../kernel/locking/rtmutex_common.h"
> >> +#define __local_lock_is_locked(__lock)					\
> >> +	(rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current)
> > 
> > So I've been looking if we really need rt_mutex_owner() or if
> > rt_mutex_base_is_locked() could do the job. Judging from the slub-free
> > case, the rt_mutex_base_is_locked() would be just fine. The alloc case
> > on the other hand probably not so much. On the other hand since we don't
> > accept allocations from hardirq or NMI the "caller == owner" case should
> > never be observed. Unless buggy & debugging and this should then also be
> > observed by lockdep. Right?
> 
> AFAIU my same line of thought was debunked by Alexei here:
> 
> https://lore.kernel.org/all/CAADnVQLO9YX2_0wEZshHbwXoJY2-wv3OgVGvN-hgf6mK0_ipxw@mail.gmail.com/
> 
> e.g. you could have the lock and then due to kprobe or tracing in the slab
> allocator code re-enter it.

Okay. So I assumed that re-entrace is not a thing here on PREEMPT_RT but
thank you correcting.
There is a difference if the lock is locked and you try a different one
if it is possible just to avoid the contention. Otherwise fallback to
the contention case if there is no other way. The other case is to avoid
recursive locking.

> > If there is another case where recursion can be observed and need to be
> > addressed I would prefer to move the function (+define) to
> > include/linux/rtmutex.h. instead of doing this "../../ include".
> > 
> >>  #endif /* CONFIG_PREEMPT_RT */

Sebastian


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t.
  2025-05-12 13:26   ` Sebastian Andrzej Siewior
@ 2025-05-12 16:46     ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-12 16:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, linux-mm, Vlastimil Babka, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Mon, May 12, 2025 at 6:26 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2025-04-30 20:27:14 [-0700], Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> >
> > lockdep_is_held() macro assumes that "struct lockdep_map dep_map;"
> > is a top level field of any lock that participates in LOCKDEP.
> > Make it so for local_trylock_t.
> >
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > ---
> >  include/linux/local_lock_internal.h | 9 ++++++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> > index bf2bf40d7b18..29df45f95843 100644
> > --- a/include/linux/local_lock_internal.h
> > +++ b/include/linux/local_lock_internal.h
> > @@ -17,7 +17,10 @@ typedef struct {
> >
> >  /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
> >  typedef struct {
> > -     local_lock_t    llock;
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> > +     struct lockdep_map      dep_map;
> > +     struct task_struct      *owner;
> > +#endif
> >       u8              acquired;
> >  } local_trylock_t;
>
> So this trick should make it work. I am not sure it is worth it. It
> would avoid the cast down the road…
>
> diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
> --- a/include/linux/local_lock_internal.h
> +++ b/include/linux/local_lock_internal.h
> @@ -17,10 +17,17 @@ typedef struct {
>
>  /* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
>  typedef struct {
> +       union   {
> +               local_lock_t            llock;
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> -       struct lockdep_map      dep_map;
> -       struct task_struct      *owner;
> +# define LOCK_PAD_SIZE (offsetof(local_lock_t, dep_map))
> +               struct {
> +                       u8 __padding[LOCK_PAD_SIZE];
> +                       struct lockdep_map      dep_map;
> +               };
> +#undef LOCK_PAD_SIZE

I don't like it. It obfuscates the layout.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check()
  2025-05-12 14:03   ` Sebastian Andrzej Siewior
@ 2025-05-12 17:16     ` Alexei Starovoitov
  2025-05-13  6:58       ` Vlastimil Babka
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-12 17:16 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, linux-mm, Vlastimil Babka, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Mon, May 12, 2025 at 7:04 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2025-04-30 20:27:16 [-0700], Alexei Starovoitov wrote:
> > --- a/include/linux/local_lock_internal.h
> > +++ b/include/linux/local_lock_internal.h
> > @@ -168,6 +168,15 @@ do {                                                             \
> >  /* preemption or migration must be disabled before calling __local_lock_is_locked */
> >  #define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
> >
> > +#define __local_lock_irqsave_check(lock, flags)                                      \
> > +     do {                                                                    \
> > +             if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&                      \
> > +                 (!__local_lock_is_locked(lock) || in_nmi()))                \
> > +                     WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));    \
> > +             else                                                            \
> > +                     __local_lock_irqsave(lock, flags);                      \
> > +     } while (0)
> > +
>
> Hmm. If I see this right in SLUB then this is called from preemptible
> context. Therefore the this_cpu_ptr() from __local_lock_is_locked()
> should trigger a warning here.

When preemptible the migration is disabled. So no warning.

> This check variant provides only additional debugging and otherwise
> behaves as local_lock_irqsave(). Therefore the in_nmi() should return
> immediately with a WARN_ON() regardless if the lock is available or not
> because the non-try variant should never be invoked from an NMI.

non-try variant can be invoked from NMI, because the earlier
__local_lock_is_locked() check tells us that the lock is not locked.
And it's safe to do.
And that's the main challenge here.
local_lock_irqsave_check() macro fights lockdep here.

> This looks like additional debug infrastructure that should be part of
> local_lock_irqsave() itself,

The pattern of

if (!__local_lock_is_locked(lock)) {
   .. lots of code..
   local_lock_irqsave(lock);

is foreign to lockdep.

Since it can be called from NMI the lockdep just hates it:

[ 1021.956825] inconsistent {INITIAL USE} -> {IN-NMI} usage.
...
[ 1021.956888]   lock(per_cpu_ptr(&lock));
[ 1021.956890]   <Interrupt>
[ 1021.956891]     lock(per_cpu_ptr(&lock));
..

and technically lockdep is correct.
For any normal lock it's a deadlock waiting to happen,
but not here.

Even without NMI the lockdep doesn't like it:
[   14.627331] page_alloc_kthr/1965 is trying to acquire lock:
[   14.627331] ffff8881f6ebe0f0 ((local_lock_t
*)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0x9a9/0x1ab0
[   14.627331]
[   14.627331] but task is already holding lock:
[   14.627331] ffff8881f6ebd490 ((local_lock_t
*)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0xc7/0x1ab0
[   14.627331]
[   14.627331] other info that might help us debug this:
[   14.627331]  Possible unsafe locking scenario:
[   14.627331]
[   14.627331]        CPU0
[   14.627331]        ----
[   14.627331]   lock((local_lock_t *)&c->lock);
[   14.627331]   lock((local_lock_t *)&c->lock);

When slub is holding lock ...bd490 we detect it with
__local_lock_is_locked(),
then we check that lock ..be0f0 is not locked,
and proceed to acquire it, but
lockdep will show the above splat.

So local_lock_irqsave_check() is a workaround to avoid
these two false positives from lockdep.

Yours and Vlastimil's observation is correct, that ideally
local_lock_irqsave() should just handle it,
but I don't see how to do it.
How can lockdep understand the if (!locked()) lock() pattern ?
Such usage is correct only for per-cpu local lock when migration
is disabled from check to acquire.

Hence the macro is doing:
if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&
   (!__local_lock_is_locked(lock) || in_nmi()))
         WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));

in_nmi() part is a workaround for the first lockdep splat
and __local_lock_is_locked() is a workaround for 2nd lockdep splat,
though the code did __local_lock_is_locked() check already.

In your other email you wonder whether
rt_mutex_base_is_locked() should be enough.
It's not.
We need to check:
__local_lock_is_locked(__lock) \
rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current

Because the following sequence is normal in PREEMP_RT:
kmalloc
  local_lock_irqsave(lock_A)
     preemption
        kmalloc_nolock
           if (is_locked(lock_A) == true)
               retry:  is_locked(lock_B) == false
                         local_lock_irqsave_check(lock_B)

while lock_B could be locked on another CPU by a different task.
So we cannot trylock(lock_B) here.
Hence in PREEMPT_RT
__local_lock_irqsave_check() is doing:
WARN_ON_ONCE(__local_lock_is_locked(lock));
spin_lock(this_cpu_ptr((lock)));

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check()
  2025-05-12 17:16     ` Alexei Starovoitov
@ 2025-05-13  6:58       ` Vlastimil Babka
  2025-05-13 21:55         ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Vlastimil Babka @ 2025-05-13  6:58 UTC (permalink / raw)
  To: Alexei Starovoitov, Sebastian Andrzej Siewior
  Cc: bpf, linux-mm, Harry Yoo, Shakeel Butt, Michal Hocko,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Steven Rostedt, Johannes Weiner, Matthew Wilcox

On 5/12/25 19:16, Alexei Starovoitov wrote:
> On Mon, May 12, 2025 at 7:04 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2025-04-30 20:27:16 [-0700], Alexei Starovoitov wrote:
>> > --- a/include/linux/local_lock_internal.h
>> > +++ b/include/linux/local_lock_internal.h
>> > @@ -168,6 +168,15 @@ do {                                                             \
>> >  /* preemption or migration must be disabled before calling __local_lock_is_locked */
>> >  #define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
>> >
>> > +#define __local_lock_irqsave_check(lock, flags)                                      \
>> > +     do {                                                                    \
>> > +             if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&                      \
>> > +                 (!__local_lock_is_locked(lock) || in_nmi()))                \
>> > +                     WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));    \
>> > +             else                                                            \
>> > +                     __local_lock_irqsave(lock, flags);                      \
>> > +     } while (0)
>> > +
>>
>> Hmm. If I see this right in SLUB then this is called from preemptible
>> context. Therefore the this_cpu_ptr() from __local_lock_is_locked()
>> should trigger a warning here.
> 
> When preemptible the migration is disabled. So no warning.
> 
>> This check variant provides only additional debugging and otherwise
>> behaves as local_lock_irqsave(). Therefore the in_nmi() should return
>> immediately with a WARN_ON() regardless if the lock is available or not
>> because the non-try variant should never be invoked from an NMI.
> 
> non-try variant can be invoked from NMI, because the earlier
> __local_lock_is_locked() check tells us that the lock is not locked.
> And it's safe to do.
> And that's the main challenge here.
> local_lock_irqsave_check() macro fights lockdep here.
> 
>> This looks like additional debug infrastructure that should be part of
>> local_lock_irqsave() itself,
> 
> The pattern of
> 
> if (!__local_lock_is_locked(lock)) {
>    .. lots of code..
>    local_lock_irqsave(lock);
> 
> is foreign to lockdep.
> 
> Since it can be called from NMI the lockdep just hates it:
> 
> [ 1021.956825] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> ...
> [ 1021.956888]   lock(per_cpu_ptr(&lock));
> [ 1021.956890]   <Interrupt>
> [ 1021.956891]     lock(per_cpu_ptr(&lock));
> ..
> 
> and technically lockdep is correct.
> For any normal lock it's a deadlock waiting to happen,
> but not here.
> 
> Even without NMI the lockdep doesn't like it:
> [   14.627331] page_alloc_kthr/1965 is trying to acquire lock:
> [   14.627331] ffff8881f6ebe0f0 ((local_lock_t
> *)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0x9a9/0x1ab0
> [   14.627331]
> [   14.627331] but task is already holding lock:
> [   14.627331] ffff8881f6ebd490 ((local_lock_t
> *)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0xc7/0x1ab0
> [   14.627331]
> [   14.627331] other info that might help us debug this:
> [   14.627331]  Possible unsafe locking scenario:
> [   14.627331]
> [   14.627331]        CPU0
> [   14.627331]        ----
> [   14.627331]   lock((local_lock_t *)&c->lock);
> [   14.627331]   lock((local_lock_t *)&c->lock);
> 
> When slub is holding lock ...bd490 we detect it with
> __local_lock_is_locked(),
> then we check that lock ..be0f0 is not locked,
> and proceed to acquire it, but
> lockdep will show the above splat.
> 
> So local_lock_irqsave_check() is a workaround to avoid
> these two false positives from lockdep.
> 
> Yours and Vlastimil's observation is correct, that ideally
> local_lock_irqsave() should just handle it,
> but I don't see how to do it.
> How can lockdep understand the if (!locked()) lock() pattern ?
> Such usage is correct only for per-cpu local lock when migration
> is disabled from check to acquire.

Thanks, I think I finally understand the issue and why a _check variant is
necessary. As a general note as this is so tricky, having more details in
comments and commit messages can't hurt so we can understand it sooner :)

Again this would be all simpler if we could just use trylock instead of
_check(), but then we need to handle the fallbacks. And AFAIU on RT trylock
can fail "spuriously", i.e. when we don't really preempt ourselves, as we
discussed in that memcg thread.

> Hence the macro is doing:
> if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&
>    (!__local_lock_is_locked(lock) || in_nmi()))
>          WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));
> 
> in_nmi() part is a workaround for the first lockdep splat
> and __local_lock_is_locked() is a workaround for 2nd lockdep splat,
> though the code did __local_lock_is_locked() check already.

So here's where this would be useful to have that info in a comment.
However, I wonder about it, as the code uses __local_trylock_irqsave(), so
lockdep should see it as an opportunistic attempt and not splat as that
trylock alone should be avoiding deadlock - if not we might have a bug in
the lockdep bits of trylock.

> In your other email you wonder whether
> rt_mutex_base_is_locked() should be enough.
> It's not.
> We need to check:
> __local_lock_is_locked(__lock) \
> rt_mutex_owner(&this_cpu_ptr(__lock)->lock) == current
> 
> Because the following sequence is normal in PREEMP_RT:
> kmalloc
>   local_lock_irqsave(lock_A)
>      preemption
>         kmalloc_nolock
>            if (is_locked(lock_A) == true)
>                retry:  is_locked(lock_B) == false
>                          local_lock_irqsave_check(lock_B)
> 
> while lock_B could be locked on another CPU by a different task.
> So we cannot trylock(lock_B) here.
> Hence in PREEMPT_RT
> __local_lock_irqsave_check() is doing:
> WARN_ON_ONCE(__local_lock_is_locked(lock));
> spin_lock(this_cpu_ptr((lock)));



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check()
  2025-05-13  6:58       ` Vlastimil Babka
@ 2025-05-13 21:55         ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-05-13 21:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sebastian Andrzej Siewior, bpf, linux-mm, Harry Yoo, Shakeel Butt,
	Michal Hocko, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, May 13, 2025 at 08:58:43AM +0200, Vlastimil Babka wrote:
> On 5/12/25 19:16, Alexei Starovoitov wrote:
> > On Mon, May 12, 2025 at 7:04 AM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> >>
> >> On 2025-04-30 20:27:16 [-0700], Alexei Starovoitov wrote:
> >> > --- a/include/linux/local_lock_internal.h
> >> > +++ b/include/linux/local_lock_internal.h
> >> > @@ -168,6 +168,15 @@ do {                                                             \
> >> >  /* preemption or migration must be disabled before calling __local_lock_is_locked */
> >> >  #define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
> >> >
> >> > +#define __local_lock_irqsave_check(lock, flags)                                      \
> >> > +     do {                                                                    \
> >> > +             if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&                      \
> >> > +                 (!__local_lock_is_locked(lock) || in_nmi()))                \
> >> > +                     WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));    \
> >> > +             else                                                            \
> >> > +                     __local_lock_irqsave(lock, flags);                      \
> >> > +     } while (0)
> >> > +
> >>
> >> Hmm. If I see this right in SLUB then this is called from preemptible
> >> context. Therefore the this_cpu_ptr() from __local_lock_is_locked()
> >> should trigger a warning here.
> > 
> > When preemptible the migration is disabled. So no warning.
> > 
> >> This check variant provides only additional debugging and otherwise
> >> behaves as local_lock_irqsave(). Therefore the in_nmi() should return
> >> immediately with a WARN_ON() regardless if the lock is available or not
> >> because the non-try variant should never be invoked from an NMI.
> > 
> > non-try variant can be invoked from NMI, because the earlier
> > __local_lock_is_locked() check tells us that the lock is not locked.
> > And it's safe to do.
> > And that's the main challenge here.
> > local_lock_irqsave_check() macro fights lockdep here.
> > 
> >> This looks like additional debug infrastructure that should be part of
> >> local_lock_irqsave() itself,
> > 
> > The pattern of
> > 
> > if (!__local_lock_is_locked(lock)) {
> >    .. lots of code..
> >    local_lock_irqsave(lock);
> > 
> > is foreign to lockdep.
> > 
> > Since it can be called from NMI the lockdep just hates it:
> > 
> > [ 1021.956825] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> > ...
> > [ 1021.956888]   lock(per_cpu_ptr(&lock));
> > [ 1021.956890]   <Interrupt>
> > [ 1021.956891]     lock(per_cpu_ptr(&lock));
> > ..
> > 
> > and technically lockdep is correct.
> > For any normal lock it's a deadlock waiting to happen,
> > but not here.
> > 
> > Even without NMI the lockdep doesn't like it:
> > [   14.627331] page_alloc_kthr/1965 is trying to acquire lock:
> > [   14.627331] ffff8881f6ebe0f0 ((local_lock_t
> > *)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0x9a9/0x1ab0
> > [   14.627331]
> > [   14.627331] but task is already holding lock:
> > [   14.627331] ffff8881f6ebd490 ((local_lock_t
> > *)&c->lock){-.-.}-{3:3}, at: ___slab_alloc+0xc7/0x1ab0
> > [   14.627331]
> > [   14.627331] other info that might help us debug this:
> > [   14.627331]  Possible unsafe locking scenario:
> > [   14.627331]
> > [   14.627331]        CPU0
> > [   14.627331]        ----
> > [   14.627331]   lock((local_lock_t *)&c->lock);
> > [   14.627331]   lock((local_lock_t *)&c->lock);
> > 
> > When slub is holding lock ...bd490 we detect it with
> > __local_lock_is_locked(),
> > then we check that lock ..be0f0 is not locked,
> > and proceed to acquire it, but
> > lockdep will show the above splat.
> > 
> > So local_lock_irqsave_check() is a workaround to avoid
> > these two false positives from lockdep.
> > 
> > Yours and Vlastimil's observation is correct, that ideally
> > local_lock_irqsave() should just handle it,
> > but I don't see how to do it.
> > How can lockdep understand the if (!locked()) lock() pattern ?
> > Such usage is correct only for per-cpu local lock when migration
> > is disabled from check to acquire.
> 
> Thanks, I think I finally understand the issue and why a _check variant is
> necessary. As a general note as this is so tricky, having more details in
> comments and commit messages can't hurt so we can understand it sooner :)
> 
> Again this would be all simpler if we could just use trylock instead of
> _check(), but then we need to handle the fallbacks. And AFAIU on RT trylock
> can fail "spuriously", i.e. when we don't really preempt ourselves, as we
> discussed in that memcg thread.
> 
> > Hence the macro is doing:
> > if (IS_ENABLED(CONFIG_DEBUG_LOCK_ALLOC) &&
> >    (!__local_lock_is_locked(lock) || in_nmi()))
> >          WARN_ON_ONCE(!__local_trylock_irqsave(lock, flags));
> > 
> > in_nmi() part is a workaround for the first lockdep splat
> > and __local_lock_is_locked() is a workaround for 2nd lockdep splat,
> > though the code did __local_lock_is_locked() check already.
> 
> So here's where this would be useful to have that info in a comment.
> However, I wonder about it, as the code uses __local_trylock_irqsave(), so
> lockdep should see it as an opportunistic attempt and not splat as that
> trylock alone should be avoiding deadlock - if not we might have a bug in
> the lockdep bits of trylock.

Point taken. The comments need to be more detailed.

I've been thinking of a way to avoid local_lock_irqsave_check() and
came up with the following:

diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 94be15d574ad..58ac29f4ba9b 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -79,7 +79,7 @@ do {                                                          \
                                                                \
        debug_check_no_locks_freed((void *)lock, sizeof(*lock));\
        lockdep_init_map_type(&(lock)->dep_map, #lock, &__key,  \
-                             0, LD_WAIT_CONFIG, LD_WAIT_INV,   \
+                             1, LD_WAIT_CONFIG, LD_WAIT_INV,   \
                              LD_LOCK_PERCPU);                  \
        local_lock_debug_init(lock);                            \
 } while (0)
@@ -166,11 +166,21 @@ do {                                                              \
        })

 /* preemption or migration must be disabled before calling __local_lock_is_locked */
-#define __local_lock_is_locked(lock) READ_ONCE(this_cpu_ptr(lock)->acquired)
+#define __local_lock_is_locked(lock)                                   \
+       ({                                                              \
+               bool ret = READ_ONCE(this_cpu_ptr(lock)->acquired);     \
+                                                                       \
+               if (!ret)                                               \
+                       this_cpu_ptr(lock)->dep_map.flags = LOCAL_LOCK_UNLOCKED;\
+               ret; \
+       })
+
+#define __local_lock_flags_clear(lock) \
+       do { this_cpu_ptr(lock)->dep_map.flags = 0; } while (0)

It would need to be wrapped into macroses for !LOCKDEP, of course.

diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h
index 9f361d3ab9d9..6c580081ace3 100644
--- a/include/linux/lockdep_types.h
+++ b/include/linux/lockdep_types.h
@@ -190,13 +190,15 @@ struct lockdep_map {
        u8                              wait_type_outer; /* can be taken in this context */
        u8                              wait_type_inner; /* presents this context */
        u8                              lock_type;
-       /* u8                           hole; */
+       u8                              flags;
 #ifdef CONFIG_LOCK_STAT
        int                             cpu;
        unsigned long                   ip;
 #endif
 };

+#define LOCAL_LOCK_UNLOCKED            1

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 58d78a33ac65..0eadee339e1f 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4961,6 +4961,7 @@ void lockdep_init_map_type(struct lockdep_map *lock, const char *name,
        lock->wait_type_outer = outer;
        lock->wait_type_inner = inner;
        lock->lock_type = lock_type;
+       lock->flags = 0;

        /*
         * No key, no joy, we need to hash something.
@@ -5101,6 +5102,9 @@ static int __lock_acquire(struct lockdep_map *lock, unsigned int subclass,
                lockevent_inc(lockdep_nocheck);
        }

+       if (unlikely(lock->flags == LOCAL_LOCK_UNLOCKED))
+               subclass++;
+
        if (subclass < NR_LOCKDEP_CACHING_CLASSES)
                class = lock->class_cache[subclass];
        /*


and the usage from slub/memcg looks like this:

if (!!local_lock_is_locked(&s->cpu_slab->lock)) {
        ret = __slab_alloc_node(s, alloc_gfp, node, _RET_IP_, size);
        __local_lock_flags_clear(&s->cpu_slab->lock);
}

With that all normal local_lock_irqsave() automagically work.

High level the idea is to tell lockdep: "trust me, I know what I'm doing".
Since it's a per-cpu local lock the workaround tells lockdep to treat
such local_lock as nested, so lockdep allows second local_lock
while the same cpu (in !RT) or task (in RT) is holding another local_lock.

It addresses the 2nd false positive above:
[   14.627331]   lock((local_lock_t *)&c->lock);
[   14.627331]   lock((local_lock_t *)&c->lock);

but doesn't address the first false positive of:
[ 1021.956825] inconsistent {INITIAL USE} -> {IN-NMI} usage.

We can silence lockdep for this lock with:
@@ -5839,6 +5840,9 @@ void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
         */
        kasan_check_byte(lock);

+       if (unlikely(lock->flags == LOCAL_LOCK_UNLOCKED))
+               trylock = 1;
+
        if (unlikely(!lockdep_enabled())) {

Then all lockdep false positives are gone.
In other words the pair:
  local_lock_is_locked(&local_lock);
  __local_lock_flags_clear(&local_lock);

guards the region where local_lock can be taken multiple
times on that cpu/task from any context including nmi.
We know that the task won't migrate, so multiple lock/unlock
of unlocked lock is safe.

I think this is a lesser evil hack/workaround than local_lock_irqsave_check().
It gives clean start/end scope for such usage of local_lock.

Thoughts?


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* SLAB_NO_CMPXCHG was:: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-05-09  1:03   ` Harry Yoo
@ 2025-06-24 17:13     ` Alexei Starovoitov
  2025-06-25 11:38       ` Harry Yoo
  0 siblings, 1 reply; 41+ messages in thread
From: Alexei Starovoitov @ 2025-06-24 17:13 UTC (permalink / raw)
  To: Harry Yoo
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Thu, May 8, 2025 at 6:03 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > +     s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> > +
> > +     if (!(s->flags & __CMPXCHG_DOUBLE))
> > +             /*
> > +              * kmalloc_nolock() is not supported on architectures that
> > +              * don't implement cmpxchg16b.
> > +              */
> > +             return NULL;
>
> Hmm when someone uses slab debugging flags (e.g., passing boot
> parameter slab_debug=FPZ as a hardening option on production [1], or
> just for debugging), __CMPXCHG_DOUBLE is not set even when the arch
> supports it.
>
> Is it okay to fail all kmalloc_nolock() calls in such cases?

I studied the code and the git history.
Looks like slub doesn't have to disable cmpxchg mode when slab_debug is on.
The commit 41bec7c33f37 ("mm/slub: remove slab_lock() usage for debug
operations")
removed slab_lock from debug validation checks.
So right now slab_lock() only serializes slab->freelist/counter update.
It's still necessary on arch-s that don't have cmpxchg, but that's it.
Only __update_freelist_slow() is using it.
folio_lock() is the same lock, but PG_locked is there for different
reasons. 10+ years ago PG_locked bit was reused for slab serialization.
Today it's unnecessary to do so. Abusing the same bit for
freelist/counter updates doesn't provide any better slab debugging.
The comment next to SLAB_NO_CMPXCHG is obsolete as well.
It's been there since days that slab_lock() was taken during
consistency checks. Removing misuse of PG_locked is a good thing on
its own.

I think the following diff is appropriate:

diff --git a/mm/slub.c b/mm/slub.c
index 044e43ee3373..9d615cfd1b6f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -286,14 +286,6 @@ static inline bool
kmem_cache_has_cpu_partial(struct kmem_cache *s)
 #define DEBUG_DEFAULT_FLAGS (SLAB_CONSISTENCY_CHECKS | SLAB_RED_ZONE | \
                                SLAB_POISON | SLAB_STORE_USER)

-/*
- * These debug flags cannot use CMPXCHG because there might be consistency
- * issues when checking or reading debug information
- */
-#define SLAB_NO_CMPXCHG (SLAB_CONSISTENCY_CHECKS | SLAB_STORE_USER | \
-                               SLAB_TRACE)
-
-
 /*
  * Debugging flags that require metadata to be stored in the slab.  These get
  * disabled when slab_debug=O is used and a cache's min order increases with
@@ -6654,7 +6646,7 @@ int do_kmem_cache_create(struct kmem_cache *s,
const char *name,
        }

 #ifdef system_has_freelist_aba
-       if (system_has_freelist_aba() && !(s->flags & SLAB_NO_CMPXCHG)) {
+       if (system_has_freelist_aba()) {
                /* Enable fast mode */
                s->flags |= __CMPXCHG_DOUBLE;
        }

It survived my stress tests.
Thoughts?


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: SLAB_NO_CMPXCHG was:: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-06-24 17:13     ` SLAB_NO_CMPXCHG was:: " Alexei Starovoitov
@ 2025-06-25 11:38       ` Harry Yoo
  2025-06-26 20:03         ` Alexei Starovoitov
  0 siblings, 1 reply; 41+ messages in thread
From: Harry Yoo @ 2025-06-25 11:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Tue, Jun 24, 2025 at 10:13:49AM -0700, Alexei Starovoitov wrote:
> On Thu, May 8, 2025 at 6:03 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > +     s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> > > +
> > > +     if (!(s->flags & __CMPXCHG_DOUBLE))
> > > +             /*
> > > +              * kmalloc_nolock() is not supported on architectures that
> > > +              * don't implement cmpxchg16b.
> > > +              */
> > > +             return NULL;
> >
> > Hmm when someone uses slab debugging flags (e.g., passing boot
> > parameter slab_debug=FPZ as a hardening option on production [1], or
> > just for debugging), __CMPXCHG_DOUBLE is not set even when the arch
> > supports it.
> >
> > Is it okay to fail all kmalloc_nolock() calls in such cases?
> 
> I studied the code and the git history.
> Looks like slub doesn't have to disable cmpxchg mode when slab_debug is on.

A slight correction; Debug caches do not use cmpxchg mode at all by
design. If a future change enables cmpxchg mode for them, it will cause
the same consistency issue.

> The commit 41bec7c33f37 ("mm/slub: remove slab_lock() usage for debug
> operations")
> removed slab_lock from debug validation checks.

An excellent point!

Yes, SLUB does not maintain cpu slab and percpu partial slabs on
debug caches. Alloc/free is done under n->list_lock, so no need for
cmpxchg double and slab_lock() at all :)

> So right now slab_lock() only serializes slab->freelist/counter update.
> It's still necessary on arch-s that don't have cmpxchg, but that's it.
> Only __update_freelist_slow() is using it.

Yes.

> The comment next to SLAB_NO_CMPXCHG is obsolete as well.
> It's been there since days that slab_lock() was taken during
> consistency checks.

Yes.

> I think the following diff is appropriate:
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 044e43ee3373..9d615cfd1b6f 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -286,14 +286,6 @@ static inline bool
> kmem_cache_has_cpu_partial(struct kmem_cache *s)
>  #define DEBUG_DEFAULT_FLAGS (SLAB_CONSISTENCY_CHECKS | SLAB_RED_ZONE | \
>                                 SLAB_POISON | SLAB_STORE_USER)
> 
> -/*
> - * These debug flags cannot use CMPXCHG because there might be consistency
> - * issues when checking or reading debug information
> - */
> -#define SLAB_NO_CMPXCHG (SLAB_CONSISTENCY_CHECKS | SLAB_STORE_USER | \
> -                               SLAB_TRACE)
> -
> -
>
>  /*
>   * Debugging flags that require metadata to be stored in the slab.  These get
>   * disabled when slab_debug=O is used and a cache's min order increases with
> @@ -6654,7 +6646,7 @@ int do_kmem_cache_create(struct kmem_cache *s,
> const char *name,
>         }
> 
>  #ifdef system_has_freelist_aba
> -       if (system_has_freelist_aba() && !(s->flags & SLAB_NO_CMPXCHG)) {
> +       if (system_has_freelist_aba()) {
>                 /* Enable fast mode */
>                 s->flags |= __CMPXCHG_DOUBLE;
>         }
> 
> It survived my stress tests.
> Thoughts?

Perhaps it's better to change the condition in
kmalloc_nolock_noprof() from;

    if (!(s->flags & __CMPXCHG_DOUBLE))
        return NULL;

to;

    if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
        return NULL;

Because debug caches do not use cmpxchg double (and that's why
it survived your test), it is not accurate to set __CMPXCHG_DOUBLE.

And it'll get into trouble anyway if debug caches use cmpxchg double.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: SLAB_NO_CMPXCHG was:: [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock().
  2025-06-25 11:38       ` Harry Yoo
@ 2025-06-26 20:03         ` Alexei Starovoitov
  0 siblings, 0 replies; 41+ messages in thread
From: Alexei Starovoitov @ 2025-06-26 20:03 UTC (permalink / raw)
  To: Harry Yoo
  Cc: bpf, linux-mm, Vlastimil Babka, Shakeel Butt, Michal Hocko,
	Sebastian Sewior, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Andrew Morton, Peter Zijlstra, Steven Rostedt, Johannes Weiner,
	Matthew Wilcox

On Wed, Jun 25, 2025 at 4:38 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Jun 24, 2025 at 10:13:49AM -0700, Alexei Starovoitov wrote:
> > On Thu, May 8, 2025 at 6:03 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > +     s = kmalloc_slab(size, NULL, alloc_gfp, _RET_IP_);
> > > > +
> > > > +     if (!(s->flags & __CMPXCHG_DOUBLE))
> > > > +             /*
> > > > +              * kmalloc_nolock() is not supported on architectures that
> > > > +              * don't implement cmpxchg16b.
> > > > +              */
> > > > +             return NULL;
> > >
> > > Hmm when someone uses slab debugging flags (e.g., passing boot
> > > parameter slab_debug=FPZ as a hardening option on production [1], or
> > > just for debugging), __CMPXCHG_DOUBLE is not set even when the arch
> > > supports it.
> > >
> > > Is it okay to fail all kmalloc_nolock() calls in such cases?
> >
> > I studied the code and the git history.
> > Looks like slub doesn't have to disable cmpxchg mode when slab_debug is on.
>
> A slight correction; Debug caches do not use cmpxchg mode at all by
> design. If a future change enables cmpxchg mode for them, it will cause
> the same consistency issue.
>
> > The commit 41bec7c33f37 ("mm/slub: remove slab_lock() usage for debug
> > operations")
> > removed slab_lock from debug validation checks.
>
> An excellent point!
>
> Yes, SLUB does not maintain cpu slab and percpu partial slabs on
> debug caches.

Ohh. That was a crucial detail I was missing.
Now I see that 90% of ___slab_alloc() logic is not executed
for debug slabs :)

> Alloc/free is done under n->list_lock, so no need for
> cmpxchg double and slab_lock() at all :)
>
> > So right now slab_lock() only serializes slab->freelist/counter update.
> > It's still necessary on arch-s that don't have cmpxchg, but that's it.
> > Only __update_freelist_slow() is using it.
>
> Yes.
>
> > The comment next to SLAB_NO_CMPXCHG is obsolete as well.
> > It's been there since days that slab_lock() was taken during
> > consistency checks.
>
> Yes.
>
> > I think the following diff is appropriate:
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 044e43ee3373..9d615cfd1b6f 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -286,14 +286,6 @@ static inline bool
> > kmem_cache_has_cpu_partial(struct kmem_cache *s)
> >  #define DEBUG_DEFAULT_FLAGS (SLAB_CONSISTENCY_CHECKS | SLAB_RED_ZONE | \
> >                                 SLAB_POISON | SLAB_STORE_USER)
> >
> > -/*
> > - * These debug flags cannot use CMPXCHG because there might be consistency
> > - * issues when checking or reading debug information
> > - */
> > -#define SLAB_NO_CMPXCHG (SLAB_CONSISTENCY_CHECKS | SLAB_STORE_USER | \
> > -                               SLAB_TRACE)
> > -
> > -
> >
> >  /*
> >   * Debugging flags that require metadata to be stored in the slab.  These get
> >   * disabled when slab_debug=O is used and a cache's min order increases with
> > @@ -6654,7 +6646,7 @@ int do_kmem_cache_create(struct kmem_cache *s,
> > const char *name,
> >         }
> >
> >  #ifdef system_has_freelist_aba
> > -       if (system_has_freelist_aba() && !(s->flags & SLAB_NO_CMPXCHG)) {
> > +       if (system_has_freelist_aba()) {
> >                 /* Enable fast mode */
> >                 s->flags |= __CMPXCHG_DOUBLE;
> >         }
> >
> > It survived my stress tests.
> > Thoughts?
>
> Perhaps it's better to change the condition in
> kmalloc_nolock_noprof() from;
>
>     if (!(s->flags & __CMPXCHG_DOUBLE))
>         return NULL;
>
> to;
>
>     if (!(s->flags & __CMPXCHG_DOUBLE) && !kmem_cache_debug(s))
>         return NULL;
>
> Because debug caches do not use cmpxchg double (and that's why
> it survived your test), it is not accurate to set __CMPXCHG_DOUBLE.
>
> And it'll get into trouble anyway if debug caches use cmpxchg double.

All makes sense. Tested this suggestion. Indeed behaves as
you described :) It's quite a relief, since I was also worried
that __CMPXCHG_DOUBLE limitation of kmalloc_nolock() might hurt
debug_slab cases.

Only slub_tiny case left to do before I can post v2.


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2025-06-26 20:04 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-01  3:27 [PATCH 0/6] mm: Reentrant kmalloc Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 1/6] mm: Rename try_alloc_pages() to alloc_pages_nolock() Alexei Starovoitov
2025-05-06  8:26   ` Vlastimil Babka
2025-05-07  1:24     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 2/6] locking/local_lock: Expose dep_map in local_trylock_t Alexei Starovoitov
2025-05-06 12:56   ` Vlastimil Babka
2025-05-06 14:55     ` Vlastimil Babka
2025-05-07  1:25       ` Alexei Starovoitov
2025-05-12 13:26   ` Sebastian Andrzej Siewior
2025-05-12 16:46     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 3/6] locking/local_lock: Introduce local_lock_is_locked() Alexei Starovoitov
2025-05-06 12:59   ` Vlastimil Babka
2025-05-07  1:28     ` Alexei Starovoitov
2025-05-12 14:56   ` Sebastian Andrzej Siewior
2025-05-12 15:01     ` Vlastimil Babka
2025-05-12 15:23       ` Sebastian Andrzej Siewior
2025-05-01  3:27 ` [PATCH 4/6] locking/local_lock: Introduce local_lock_irqsave_check() Alexei Starovoitov
2025-05-07 13:02   ` Vlastimil Babka
2025-05-12 14:03   ` Sebastian Andrzej Siewior
2025-05-12 17:16     ` Alexei Starovoitov
2025-05-13  6:58       ` Vlastimil Babka
2025-05-13 21:55         ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 5/6] mm: Allow GFP_ACCOUNT and GFP_COMP to be used in alloc_pages_nolock() Alexei Starovoitov
2025-05-06  8:55   ` Vlastimil Babka
2025-05-07  1:33     ` Alexei Starovoitov
2025-05-01  3:27 ` [PATCH 6/6] slab: Introduce kmalloc_nolock() and kfree_nolock() Alexei Starovoitov
2025-05-05 18:46   ` Shakeel Butt
2025-05-06  0:49     ` Alexei Starovoitov
2025-05-06  1:24       ` Shakeel Butt
2025-05-06  1:51         ` Alexei Starovoitov
2025-05-06 18:05           ` Shakeel Butt
2025-05-06 12:01   ` Vlastimil Babka
2025-05-07  0:31     ` Harry Yoo
2025-05-07  2:23       ` Alexei Starovoitov
2025-05-07  8:38       ` Vlastimil Babka
2025-05-07  2:20     ` Alexei Starovoitov
2025-05-07 10:44       ` Vlastimil Babka
2025-05-09  1:03   ` Harry Yoo
2025-06-24 17:13     ` SLAB_NO_CMPXCHG was:: " Alexei Starovoitov
2025-06-25 11:38       ` Harry Yoo
2025-06-26 20:03         ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).