bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages()
@ 2025-02-22  2:44 Alexei Starovoitov
  2025-02-22  2:44 ` [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t Alexei Starovoitov
                   ` (7 more replies)
  0 siblings, 8 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-22  2:44 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Hi All,

The main motivation is to make alloc page and slab reentrant and
remove bpf_mem_alloc.

v8->v9:
- Squash Vlastimil's fix/feature for localtry_trylock, and
  udpate commit log as suggested by Sebastian.
- Drop _noprof suffix in try_alloc_pages kdoc
- rebase

v8:
https://lore.kernel.org/bpf/20250213033556.9534-1-alexei.starovoitov@gmail.com/

v7->v8:
- rebase: s/free_unref_page/free_frozen_page/

v6->v7:
- Took Sebastian's patch for localtry_lock_t as-is with minor
  addition of local_trylock_acquire() for proper LOCKDEP.
  Kept his authorship.
- Adjusted patch 4 to use it. The rest is unchanged.

v6:
https://lore.kernel.org/bpf/20250124035655.78899-1-alexei.starovoitov@gmail.com/

v5->v6:
- Addressed comments from Sebastian, Vlastimil
- New approach for local_lock_t in patch 3. Instead of unconditionally
  increasing local_lock_t size to 4 bytes introduce local_trylock_t
  and use _Generic() tricks to manipulate active field.
- Address stackdepot reentrance issues. alloc part in patch 1 and
  free part in patch 2.
- Inlined mem_cgroup_cancel_charge() in patch 4 since this helper
  is being removed.
- Added Acks.
- Dropped failslab, kfence, kmemleak patch.
- Improved bpf_map_alloc_pages() in patch 6 a bit to demo intended usage.
  It will be refactored further.
- Considered using __GFP_COMP in try_alloc_pages to simplify
  free_pages_nolock a bit, but then decided to make it work
  for all types of pages, since free_pages_nolock() is used by
  stackdepot and currently it's using non-compound order 2.
  I felt it's best to leave it as-is and make free_pages_nolock()
  support all pages.

v5:
https://lore.kernel.org/all/20250115021746.34691-1-alexei.starovoitov@gmail.com/

v4->v5:
- Fixed patch 1 and 4 commit logs and comments per Michal suggestions.
  Added Acks.
- Added patch 6 to make failslab, kfence, kmemleak complaint
  with trylock mode. It's a prerequisite for reentrant slab patches.

v4:
https://lore.kernel.org/bpf/20250114021922.92609-1-alexei.starovoitov@gmail.com/

v3->v4:
Addressed feedback from Michal and Shakeel:
- GFP_TRYLOCK flag is gone. gfpflags_allow_spinning() is used instead.
- Improved comments and commit logs.

v3:
https://lore.kernel.org/bpf/20241218030720.1602449-1-alexei.starovoitov@gmail.com/

v2->v3:
To address the issues spotted by Sebastian, Vlastimil, Steven:
- Made GFP_TRYLOCK internal to mm/internal.h
  try_alloc_pages() and free_pages_nolock() are the only interfaces.
- Since spin_trylock() is not safe in RT from hard IRQ and NMI
  disable such usage in lock_trylock and in try_alloc_pages().
  In such case free_pages_nolock() falls back to llist right away.
- Process trylock_free_pages llist when preemptible.
- Check for things like unaccepted memory and order <= 3 early.
- Don't call into __alloc_pages_slowpath() at all.
- Inspired by Vlastimil's struct local_tryirq_lock adopted it in
  local_lock_t. Extra 4 bytes in !RT in local_lock_t shouldn't
  affect any of the current local_lock_t users. This is patch 3.
- Tested with bpf selftests in RT and !RT and realized how much
  more work is necessary on bpf side to play nice with RT.
  The urgency of this work got higher. The alternative is to
  convert bpf bits left and right to bpf_mem_alloc.

v2:
https://lore.kernel.org/bpf/20241210023936.46871-1-alexei.starovoitov@gmail.com/

v1->v2:
- fixed buggy try_alloc_pages_noprof() in PREEMPT_RT. Thanks Peter.
- optimize all paths by doing spin_trylock_irqsave() first
  and only then check for gfp_flags & __GFP_TRYLOCK.
  Then spin_lock_irqsave() if it's a regular mode.
  So new gfp flag will not add performance overhead.
- patches 2-5 are new. They introduce lockless and/or trylock free_pages_nolock()
  and memcg support. So it's in usable shape for bpf in patch 6.

v1:
https://lore.kernel.org/bpf/20241116014854.55141-1-alexei.starovoitov@gmail.com/

Alexei Starovoitov (5):
  mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  mm, bpf: Introduce free_pages_nolock()
  memcg: Use trylock to access memcg stock_lock.
  mm, bpf: Use memcg in try_alloc_pages().
  bpf: Use try_alloc_pages() to allocate pages for bpf needs.

Sebastian Andrzej Siewior (1):
  locking/local_lock: Introduce localtry_lock_t

 include/linux/bpf.h                 |   2 +-
 include/linux/gfp.h                 |  23 ++++
 include/linux/local_lock.h          |  70 ++++++++++
 include/linux/local_lock_internal.h | 146 ++++++++++++++++++++
 include/linux/mm_types.h            |   4 +
 include/linux/mmzone.h              |   3 +
 kernel/bpf/arena.c                  |   5 +-
 kernel/bpf/syscall.c                |  23 +++-
 lib/stackdepot.c                    |  10 +-
 mm/internal.h                       |   1 +
 mm/memcontrol.c                     |  52 +++++---
 mm/page_alloc.c                     | 200 ++++++++++++++++++++++++++--
 mm/page_owner.c                     |   8 +-
 13 files changed, 506 insertions(+), 41 deletions(-)

-- 
2.43.5


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
@ 2025-02-22  2:44 ` Alexei Starovoitov
  2025-03-11 15:44   ` Mateusz Guzik
  2025-02-22  2:44 ` [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-22  2:44 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

In !PREEMPT_RT local_lock_irqsave() disables interrupts to protect
critical section, but it doesn't prevent NMI, so the fully reentrant
code cannot use local_lock_irqsave() for exclusive access.

Introduce localtry_lock_t and localtry_lock_irqsave() that
disables interrupts and sets acquired=1, so localtry_lock_irqsave()
from NMI attempting to acquire the same lock will return false.

In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
Map localtry_lock_irqsave() to preemptible spin_trylock().
When in hard IRQ or NMI return false right away, since
spin_trylock() is not safe due to explicit locking in the underneath
rt_spin_trylock() implementation. Removing this explicit locking and
attempting only "trylock" is undesired due to PI implications.

Note there is no need to use local_inc for acquired variable,
since it's a percpu variable with strict nesting scopes.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/local_lock.h          |  70 +++++++++++++
 include/linux/local_lock_internal.h | 146 ++++++++++++++++++++++++++++
 2 files changed, 216 insertions(+)

diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 091dc0b6bdfb..1a0bc35839e3 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -51,6 +51,76 @@
 #define local_unlock_irqrestore(lock, flags)			\
 	__local_unlock_irqrestore(lock, flags)
 
+/**
+ * localtry_lock_init - Runtime initialize a lock instance
+ */
+#define localtry_lock_init(lock)		__localtry_lock_init(lock)
+
+/**
+ * localtry_lock - Acquire a per CPU local lock
+ * @lock:	The lock variable
+ */
+#define localtry_lock(lock)		__localtry_lock(lock)
+
+/**
+ * localtry_lock_irq - Acquire a per CPU local lock and disable interrupts
+ * @lock:	The lock variable
+ */
+#define localtry_lock_irq(lock)		__localtry_lock_irq(lock)
+
+/**
+ * localtry_lock_irqsave - Acquire a per CPU local lock, save and disable
+ *			 interrupts
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ */
+#define localtry_lock_irqsave(lock, flags)				\
+	__localtry_lock_irqsave(lock, flags)
+
+/**
+ * localtry_trylock - Try to acquire a per CPU local lock.
+ * @lock:	The lock variable
+ *
+ * The function can be used in any context such as NMI or HARDIRQ. Due to
+ * locking constrains it will _always_ fail to acquire the lock in NMI or
+ * HARDIRQ context on PREEMPT_RT.
+ */
+#define localtry_trylock(lock)		__localtry_trylock(lock)
+
+/**
+ * localtry_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ *			      interrupts if acquired
+ * @lock:	The lock variable
+ * @flags:	Storage for interrupt flags
+ *
+ * The function can be used in any context such as NMI or HARDIRQ. Due to
+ * locking constrains it will _always_ fail to acquire the lock in NMI or
+ * HARDIRQ context on PREEMPT_RT.
+ */
+#define localtry_trylock_irqsave(lock, flags)				\
+	__localtry_trylock_irqsave(lock, flags)
+
+/**
+ * local_unlock - Release a per CPU local lock
+ * @lock:	The lock variable
+ */
+#define localtry_unlock(lock)		__localtry_unlock(lock)
+
+/**
+ * local_unlock_irq - Release a per CPU local lock and enable interrupts
+ * @lock:	The lock variable
+ */
+#define localtry_unlock_irq(lock)		__localtry_unlock_irq(lock)
+
+/**
+ * localtry_unlock_irqrestore - Release a per CPU local lock and restore
+ *			      interrupt flags
+ * @lock:	The lock variable
+ * @flags:      Interrupt flags to restore
+ */
+#define localtry_unlock_irqrestore(lock, flags)			\
+	__localtry_unlock_irqrestore(lock, flags)
+
 DEFINE_GUARD(local_lock, local_lock_t __percpu*,
 	     local_lock(_T),
 	     local_unlock(_T))
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 8dd71fbbb6d2..67bd13d142fa 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -15,6 +15,11 @@ typedef struct {
 #endif
 } local_lock_t;
 
+typedef struct {
+	local_lock_t	llock;
+	unsigned int	acquired;
+} localtry_lock_t;
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 # define LOCAL_LOCK_DEBUG_INIT(lockname)		\
 	.dep_map = {					\
@@ -31,6 +36,13 @@ static inline void local_lock_acquire(local_lock_t *l)
 	l->owner = current;
 }
 
+static inline void local_trylock_acquire(local_lock_t *l)
+{
+	lock_map_acquire_try(&l->dep_map);
+	DEBUG_LOCKS_WARN_ON(l->owner);
+	l->owner = current;
+}
+
 static inline void local_lock_release(local_lock_t *l)
 {
 	DEBUG_LOCKS_WARN_ON(l->owner != current);
@@ -45,11 +57,13 @@ static inline void local_lock_debug_init(local_lock_t *l)
 #else /* CONFIG_DEBUG_LOCK_ALLOC */
 # define LOCAL_LOCK_DEBUG_INIT(lockname)
 static inline void local_lock_acquire(local_lock_t *l) { }
+static inline void local_trylock_acquire(local_lock_t *l) { }
 static inline void local_lock_release(local_lock_t *l) { }
 static inline void local_lock_debug_init(local_lock_t *l) { }
 #endif /* !CONFIG_DEBUG_LOCK_ALLOC */
 
 #define INIT_LOCAL_LOCK(lockname)	{ LOCAL_LOCK_DEBUG_INIT(lockname) }
+#define INIT_LOCALTRY_LOCK(lockname)	{ .llock = { LOCAL_LOCK_DEBUG_INIT(lockname.llock) }}
 
 #define __local_lock_init(lock)					\
 do {								\
@@ -118,6 +132,104 @@ do {								\
 #define __local_unlock_nested_bh(lock)				\
 	local_lock_release(this_cpu_ptr(lock))
 
+/* localtry_lock_t variants */
+
+#define __localtry_lock_init(lock)				\
+do {								\
+	__local_lock_init(&(lock)->llock);			\
+	WRITE_ONCE((lock)->acquired, 0);			\
+} while (0)
+
+#define __localtry_lock(lock)					\
+	do {							\
+		localtry_lock_t *lt;				\
+		preempt_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_lock_irq(lock)				\
+	do {							\
+		localtry_lock_t *lt;				\
+		local_irq_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_lock_irqsave(lock, flags)			\
+	do {							\
+		localtry_lock_t *lt;				\
+		local_irq_save(flags);				\
+		lt = this_cpu_ptr(lock);			\
+		local_lock_acquire(&lt->llock);			\
+		WRITE_ONCE(lt->acquired, 1);			\
+	} while (0)
+
+#define __localtry_trylock(lock)				\
+	({							\
+		localtry_lock_t *lt;				\
+		bool _ret;					\
+								\
+		preempt_disable();				\
+		lt = this_cpu_ptr(lock);			\
+		if (!READ_ONCE(lt->acquired)) {			\
+			WRITE_ONCE(lt->acquired, 1);		\
+			local_trylock_acquire(&lt->llock);	\
+			_ret = true;				\
+		} else {					\
+			_ret = false;				\
+			preempt_enable();			\
+		}						\
+		_ret;						\
+	})
+
+#define __localtry_trylock_irqsave(lock, flags)			\
+	({							\
+		localtry_lock_t *lt;				\
+		bool _ret;					\
+								\
+		local_irq_save(flags);				\
+		lt = this_cpu_ptr(lock);			\
+		if (!READ_ONCE(lt->acquired)) {			\
+			WRITE_ONCE(lt->acquired, 1);		\
+			local_trylock_acquire(&lt->llock);	\
+			_ret = true;				\
+		} else {					\
+			_ret = false;				\
+			local_irq_restore(flags);		\
+		}						\
+		_ret;						\
+	})
+
+#define __localtry_unlock(lock)					\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		preempt_enable();				\
+	} while (0)
+
+#define __localtry_unlock_irq(lock)				\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		local_irq_enable();				\
+	} while (0)
+
+#define __localtry_unlock_irqrestore(lock, flags)		\
+	do {							\
+		localtry_lock_t *lt;				\
+		lt = this_cpu_ptr(lock);			\
+		WRITE_ONCE(lt->acquired, 0);			\
+		local_lock_release(&lt->llock);			\
+		local_irq_restore(flags);			\
+	} while (0)
+
 #else /* !CONFIG_PREEMPT_RT */
 
 /*
@@ -125,8 +237,10 @@ do {								\
  * critical section while staying preemptible.
  */
 typedef spinlock_t local_lock_t;
+typedef spinlock_t localtry_lock_t;
 
 #define INIT_LOCAL_LOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname))
+#define INIT_LOCALTRY_LOCK(lockname) INIT_LOCAL_LOCK(lockname)
 
 #define __local_lock_init(l)					\
 	do {							\
@@ -169,4 +283,36 @@ do {								\
 	spin_unlock(this_cpu_ptr((lock)));			\
 } while (0)
 
+/* localtry_lock_t variants */
+
+#define __localtry_lock_init(lock)			__local_lock_init(lock)
+#define __localtry_lock(lock)				__local_lock(lock)
+#define __localtry_lock_irq(lock)			__local_lock(lock)
+#define __localtry_lock_irqsave(lock, flags)		__local_lock_irqsave(lock, flags)
+#define __localtry_unlock(lock)				__local_unlock(lock)
+#define __localtry_unlock_irq(lock)			__local_unlock(lock)
+#define __localtry_unlock_irqrestore(lock, flags)	__local_unlock_irqrestore(lock, flags)
+
+#define __localtry_trylock(lock)				\
+	({							\
+		int __locked;					\
+								\
+		if (in_nmi() | in_hardirq()) {			\
+			__locked = 0;				\
+		} else {					\
+			migrate_disable();			\
+			__locked = spin_trylock(this_cpu_ptr((lock)));	\
+			if (!__locked)				\
+				migrate_enable();		\
+		}						\
+		__locked;					\
+	})
+
+#define __localtry_trylock_irqsave(lock, flags)			\
+	({							\
+		typecheck(unsigned long, flags);		\
+		flags = 0;					\
+		__localtry_trylock(lock);			\
+	})
+
 #endif /* CONFIG_PREEMPT_RT */
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
  2025-02-22  2:44 ` [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t Alexei Starovoitov
@ 2025-02-22  2:44 ` Alexei Starovoitov
  2025-03-11  2:04   ` Andrew Morton
  2025-02-22  2:44 ` [PATCH bpf-next v9 3/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-22  2:44 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Tracing BPF programs execute from tracepoints and kprobes where
running context is unknown, but they need to request additional
memory. The prior workarounds were using pre-allocated memory and
BPF specific freelists to satisfy such allocation requests.
Instead, introduce gfpflags_allow_spinning() condition that signals
to the allocator that running context is unknown.
Then rely on percpu free list of pages to allocate a page.
try_alloc_pages() -> get_page_from_freelist() -> rmqueue() ->
rmqueue_pcplist() will spin_trylock to grab the page from percpu
free list. If it fails (due to re-entrancy or list being empty)
then rmqueue_bulk()/rmqueue_buddy() will attempt to
spin_trylock zone->lock and grab the page from there.
spin_trylock() is not safe in PREEMPT_RT when in NMI or in hard IRQ.
Bailout early in such case.

The support for gfpflags_allow_spinning() mode for free_page and memcg
comes in the next patches.

This is a first step towards supporting BPF requirements in SLUB
and getting rid of bpf_mem_alloc.
That goal was discussed at LSFMM: https://lwn.net/Articles/974138/

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h |  22 ++++++++++
 lib/stackdepot.c    |   5 ++-
 mm/internal.h       |   1 +
 mm/page_alloc.c     | 104 ++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6bb1a5a7a4ae..5d9ee78c74e4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,6 +39,25 @@ static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 	return !!(gfp_flags & __GFP_DIRECT_RECLAIM);
 }
 
+static inline bool gfpflags_allow_spinning(const gfp_t gfp_flags)
+{
+	/*
+	 * !__GFP_DIRECT_RECLAIM -> direct claim is not allowed.
+	 * !__GFP_KSWAPD_RECLAIM -> it's not safe to wake up kswapd.
+	 * All GFP_* flags including GFP_NOWAIT use one or both flags.
+	 * try_alloc_pages() is the only API that doesn't specify either flag.
+	 *
+	 * This is stronger than GFP_NOWAIT or GFP_ATOMIC because
+	 * those are guaranteed to never block on a sleeping lock.
+	 * Here we are enforcing that the allocation doesn't ever spin
+	 * on any locks (i.e. only trylocks). There is no high level
+	 * GFP_$FOO flag for this use in try_alloc_pages() as the
+	 * regular page allocator doesn't fully support this
+	 * allocation mode.
+	 */
+	return !(gfp_flags & __GFP_RECLAIM);
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
@@ -335,6 +354,9 @@ static inline struct page *alloc_page_vma_noprof(gfp_t gfp,
 }
 #define alloc_page_vma(...)			alloc_hooks(alloc_page_vma_noprof(__VA_ARGS__))
 
+struct page *try_alloc_pages_noprof(int nid, unsigned int order);
+#define try_alloc_pages(...)			alloc_hooks(try_alloc_pages_noprof(__VA_ARGS__))
+
 extern unsigned long get_free_pages_noprof(gfp_t gfp_mask, unsigned int order);
 #define __get_free_pages(...)			alloc_hooks(get_free_pages_noprof(__VA_ARGS__))
 
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 245d5b416699..377194969e61 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -591,7 +591,8 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 	depot_stack_handle_t handle = 0;
 	struct page *page = NULL;
 	void *prealloc = NULL;
-	bool can_alloc = depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC;
+	bool allow_spin = gfpflags_allow_spinning(alloc_flags);
+	bool can_alloc = (depot_flags & STACK_DEPOT_FLAG_CAN_ALLOC) && allow_spin;
 	unsigned long flags;
 	u32 hash;
 
@@ -630,7 +631,7 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 			prealloc = page_address(page);
 	}
 
-	if (in_nmi()) {
+	if (in_nmi() || !allow_spin) {
 		/* We can never allocate in NMI context. */
 		WARN_ON_ONCE(can_alloc);
 		/* Best effort; bail if we fail to take the lock. */
diff --git a/mm/internal.h b/mm/internal.h
index 109ef30fee11..10a8b4b3b86e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1187,6 +1187,7 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 #define ALLOC_NOFRAGMENT	  0x0
 #endif
 #define ALLOC_HIGHATOMIC	0x200 /* Allows access to MIGRATE_HIGHATOMIC */
+#define ALLOC_TRYLOCK		0x400 /* Only use spin_trylock in allocation path */
 #define ALLOC_KSWAPD		0x800 /* allow waking of kswapd, __GFP_KSWAPD_RECLAIM set */
 
 /* Flags that allow allocations below the min watermark. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 579789600a3c..1f2a4e1c70ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2307,7 +2307,11 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 	unsigned long flags;
 	int i;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+			return 0;
+		spin_lock_irqsave(&zone->lock, flags);
+	}
 	for (i = 0; i < count; ++i) {
 		struct page *page = __rmqueue(zone, order, migratetype,
 								alloc_flags);
@@ -2907,7 +2911,11 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
 
 	do {
 		page = NULL;
-		spin_lock_irqsave(&zone->lock, flags);
+		if (!spin_trylock_irqsave(&zone->lock, flags)) {
+			if (unlikely(alloc_flags & ALLOC_TRYLOCK))
+				return NULL;
+			spin_lock_irqsave(&zone->lock, flags);
+		}
 		if (alloc_flags & ALLOC_HIGHATOMIC)
 			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
 		if (!page) {
@@ -4511,7 +4519,12 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 
 	might_alloc(gfp_mask);
 
-	if (should_fail_alloc_page(gfp_mask, order))
+	/*
+	 * Don't invoke should_fail logic, since it may call
+	 * get_random_u32() and printk() which need to spin_lock.
+	 */
+	if (!(*alloc_flags & ALLOC_TRYLOCK) &&
+	    should_fail_alloc_page(gfp_mask, order))
 		return false;
 
 	*alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, *alloc_flags);
@@ -7071,3 +7084,88 @@ static bool __free_unaccepted(struct page *page)
 }
 
 #endif /* CONFIG_UNACCEPTED_MEMORY */
+
+/**
+ * try_alloc_pages - opportunistic reentrant allocation from any context
+ * @nid - node to allocate from
+ * @order - allocation order size
+ *
+ * Allocates pages of a given order from the given node. This is safe to
+ * call from any context (from atomic, NMI, and also reentrant
+ * allocator -> tracepoint -> try_alloc_pages_noprof).
+ * Allocation is best effort and to be expected to fail easily so nobody should
+ * rely on the success. Failures are not reported via warn_alloc().
+ * See always fail conditions below.
+ *
+ * Return: allocated page or NULL on failure.
+ */
+struct page *try_alloc_pages_noprof(int nid, unsigned int order)
+{
+	/*
+	 * Do not specify __GFP_DIRECT_RECLAIM, since direct claim is not allowed.
+	 * Do not specify __GFP_KSWAPD_RECLAIM either, since wake up of kswapd
+	 * is not safe in arbitrary context.
+	 *
+	 * These two are the conditions for gfpflags_allow_spinning() being true.
+	 *
+	 * Specify __GFP_NOWARN since failing try_alloc_pages() is not a reason
+	 * to warn. Also warn would trigger printk() which is unsafe from
+	 * various contexts. We cannot use printk_deferred_enter() to mitigate,
+	 * since the running context is unknown.
+	 *
+	 * Specify __GFP_ZERO to make sure that call to kmsan_alloc_page() below
+	 * is safe in any context. Also zeroing the page is mandatory for
+	 * BPF use cases.
+	 *
+	 * Though __GFP_NOMEMALLOC is not checked in the code path below,
+	 * specify it here to highlight that try_alloc_pages()
+	 * doesn't want to deplete reserves.
+	 */
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	unsigned int alloc_flags = ALLOC_TRYLOCK;
+	struct alloc_context ac = { };
+	struct page *page;
+
+	/*
+	 * In PREEMPT_RT spin_trylock() will call raw_spin_lock() which is
+	 * unsafe in NMI. If spin_trylock() is called from hard IRQ the current
+	 * task may be waiting for one rt_spin_lock, but rt_spin_trylock() will
+	 * mark the task as the owner of another rt_spin_lock which will
+	 * confuse PI logic, so return immediately if called form hard IRQ or
+	 * NMI.
+	 *
+	 * Note, irqs_disabled() case is ok. This function can be called
+	 * from raw_spin_lock_irqsave region.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) && (in_nmi() || in_hardirq()))
+		return NULL;
+	if (!pcp_allowed_order(order))
+		return NULL;
+
+#ifdef CONFIG_UNACCEPTED_MEMORY
+	/* Bailout, since try_to_accept_memory_one() needs to take a lock */
+	if (has_unaccepted_memory())
+		return NULL;
+#endif
+	/* Bailout, since _deferred_grow_zone() needs to take a lock */
+	if (deferred_pages_enabled())
+		return NULL;
+
+	if (nid == NUMA_NO_NODE)
+		nid = numa_node_id();
+
+	prepare_alloc_pages(alloc_gfp, order, nid, NULL, &ac,
+			    &alloc_gfp, &alloc_flags);
+
+	/*
+	 * Best effort allocation from percpu free list.
+	 * If it's empty attempt to spin_trylock zone->lock.
+	 */
+	page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
+
+	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
+
+	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
+	kmsan_alloc_page(page, order, alloc_gfp);
+	return page;
+}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next v9 3/6] mm, bpf: Introduce free_pages_nolock()
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
  2025-02-22  2:44 ` [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t Alexei Starovoitov
  2025-02-22  2:44 ` [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-02-22  2:44 ` Alexei Starovoitov
  2025-02-22  2:44 ` [PATCH bpf-next v9 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-22  2:44 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Introduce free_pages_nolock() that can free pages without taking locks.
It relies on trylock and can be called from any context.
Since spin_trylock() cannot be used in PREEMPT_RT from hard IRQ or NMI
it uses lockless link list to stash the pages which will be freed
by subsequent free_pages() from good context.

Do not use llist unconditionally. BPF maps continuously
allocate/free, so we cannot unconditionally delay the freeing to
llist. When the memory becomes free make it available to the
kernel and BPF users right away if possible, and fallback to
llist as the last resort.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/gfp.h      |  1 +
 include/linux/mm_types.h |  4 ++
 include/linux/mmzone.h   |  3 ++
 lib/stackdepot.c         |  5 ++-
 mm/page_alloc.c          | 90 +++++++++++++++++++++++++++++++++++-----
 mm/page_owner.c          |  8 +++-
 6 files changed, 98 insertions(+), 13 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5d9ee78c74e4..ceb226c2e25c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -379,6 +379,7 @@ __meminit void *alloc_pages_exact_nid_noprof(int nid, size_t size, gfp_t gfp_mas
 	__get_free_pages((gfp_mask) | GFP_DMA, (order))
 
 extern void __free_pages(struct page *page, unsigned int order);
+extern void free_pages_nolock(struct page *page, unsigned int order);
 extern void free_pages(unsigned long addr, unsigned int order);
 
 #define __free_page(page) __free_pages((page), 0)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b27db7f9496..483aa90242cd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,10 @@ struct page {
 				/* Or, free page */
 				struct list_head buddy_list;
 				struct list_head pcp_list;
+				struct {
+					struct llist_node pcp_llist;
+					unsigned int order;
+				};
 			};
 			/* See page-flags.h for PAGE_MAPPING_FLAGS */
 			struct address_space *mapping;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9540b41894da..e16939553930 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -972,6 +972,9 @@ struct zone {
 	/* Primarily protects free_area */
 	spinlock_t		lock;
 
+	/* Pages to be freed when next trylock succeeds */
+	struct llist_head	trylock_free_pages;
+
 	/* Write-intensive fields used by compaction and vmstats. */
 	CACHELINE_PADDING(_pad2_);
 
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index 377194969e61..73d7b50924ef 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -672,7 +672,10 @@ depot_stack_handle_t stack_depot_save_flags(unsigned long *entries,
 exit:
 	if (prealloc) {
 		/* Stack depot didn't use this memory, free it. */
-		free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
+		if (!allow_spin)
+			free_pages_nolock(virt_to_page(prealloc), DEPOT_POOL_ORDER);
+		else
+			free_pages((unsigned long)prealloc, DEPOT_POOL_ORDER);
 	}
 	if (found)
 		handle = found->handle.handle;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1f2a4e1c70ae..79b39ad4bb1b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -88,6 +88,9 @@ typedef int __bitwise fpi_t;
  */
 #define FPI_TO_TAIL		((__force fpi_t)BIT(1))
 
+/* Free the page without taking locks. Rely on trylock only. */
+#define FPI_TRYLOCK		((__force fpi_t)BIT(2))
+
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
 #define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -1249,13 +1252,44 @@ static void split_large_buddy(struct zone *zone, struct page *page,
 	} while (1);
 }
 
+static void add_page_to_zone_llist(struct zone *zone, struct page *page,
+				   unsigned int order)
+{
+	/* Remember the order */
+	page->order = order;
+	/* Add the page to the free list */
+	llist_add(&page->pcp_llist, &zone->trylock_free_pages);
+}
+
 static void free_one_page(struct zone *zone, struct page *page,
 			  unsigned long pfn, unsigned int order,
 			  fpi_t fpi_flags)
 {
+	struct llist_head *llhead;
 	unsigned long flags;
 
-	spin_lock_irqsave(&zone->lock, flags);
+	if (!spin_trylock_irqsave(&zone->lock, flags)) {
+		if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+			add_page_to_zone_llist(zone, page, order);
+			return;
+		}
+		spin_lock_irqsave(&zone->lock, flags);
+	}
+
+	/* The lock succeeded. Process deferred pages. */
+	llhead = &zone->trylock_free_pages;
+	if (unlikely(!llist_empty(llhead) && !(fpi_flags & FPI_TRYLOCK))) {
+		struct llist_node *llnode;
+		struct page *p, *tmp;
+
+		llnode = llist_del_all(llhead);
+		llist_for_each_entry_safe(p, tmp, llnode, pcp_llist) {
+			unsigned int p_order = p->order;
+
+			split_large_buddy(zone, p, page_to_pfn(p), p_order, fpi_flags);
+			__count_vm_events(PGFREE, 1 << p_order);
+		}
+	}
 	split_large_buddy(zone, page, pfn, order, fpi_flags);
 	spin_unlock_irqrestore(&zone->lock, flags);
 
@@ -2599,7 +2633,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
 
 static void free_frozen_page_commit(struct zone *zone,
 		struct per_cpu_pages *pcp, struct page *page, int migratetype,
-		unsigned int order)
+		unsigned int order, fpi_t fpi_flags)
 {
 	int high, batch;
 	int pindex;
@@ -2634,6 +2668,14 @@ static void free_frozen_page_commit(struct zone *zone,
 	}
 	if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX))
 		pcp->free_count += (1 << order);
+
+	if (unlikely(fpi_flags & FPI_TRYLOCK)) {
+		/*
+		 * Do not attempt to take a zone lock. Let pcp->count get
+		 * over high mark temporarily.
+		 */
+		return;
+	}
 	high = nr_pcp_high(pcp, zone, batch, free_high);
 	if (pcp->count >= high) {
 		free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2648,7 +2690,8 @@ static void free_frozen_page_commit(struct zone *zone,
 /*
  * Free a pcp page
  */
-void free_frozen_pages(struct page *page, unsigned int order)
+static void __free_frozen_pages(struct page *page, unsigned int order,
+				fpi_t fpi_flags)
 {
 	unsigned long __maybe_unused UP_flags;
 	struct per_cpu_pages *pcp;
@@ -2657,7 +2700,7 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	int migratetype;
 
 	if (!pcp_allowed_order(order)) {
-		__free_pages_ok(page, order, FPI_NONE);
+		__free_pages_ok(page, order, fpi_flags);
 		return;
 	}
 
@@ -2675,23 +2718,33 @@ void free_frozen_pages(struct page *page, unsigned int order)
 	migratetype = get_pfnblock_migratetype(page, pfn);
 	if (unlikely(migratetype >= MIGRATE_PCPTYPES)) {
 		if (unlikely(is_migrate_isolate(migratetype))) {
-			free_one_page(zone, page, pfn, order, FPI_NONE);
+			free_one_page(zone, page, pfn, order, fpi_flags);
 			return;
 		}
 		migratetype = MIGRATE_MOVABLE;
 	}
 
+	if (unlikely((fpi_flags & FPI_TRYLOCK) && IS_ENABLED(CONFIG_PREEMPT_RT)
+		     && (in_nmi() || in_hardirq()))) {
+		add_page_to_zone_llist(zone, page, order);
+		return;
+	}
 	pcp_trylock_prepare(UP_flags);
 	pcp = pcp_spin_trylock(zone->per_cpu_pageset);
 	if (pcp) {
-		free_frozen_page_commit(zone, pcp, page, migratetype, order);
+		free_frozen_page_commit(zone, pcp, page, migratetype, order, fpi_flags);
 		pcp_spin_unlock(pcp);
 	} else {
-		free_one_page(zone, page, pfn, order, FPI_NONE);
+		free_one_page(zone, page, pfn, order, fpi_flags);
 	}
 	pcp_trylock_finish(UP_flags);
 }
 
+void free_frozen_pages(struct page *page, unsigned int order)
+{
+	__free_frozen_pages(page, order, FPI_NONE);
+}
+
 /*
  * Free a batch of folios
  */
@@ -2780,7 +2833,7 @@ void free_unref_folios(struct folio_batch *folios)
 
 		trace_mm_page_free_batched(&folio->page);
 		free_frozen_page_commit(zone, pcp, &folio->page, migratetype,
-				order);
+					order, FPI_NONE);
 	}
 
 	if (pcp) {
@@ -4841,22 +4894,37 @@ EXPORT_SYMBOL(get_zeroed_page_noprof);
  * Context: May be called in interrupt context or while holding a normal
  * spinlock, but not in NMI context or while holding a raw spinlock.
  */
-void __free_pages(struct page *page, unsigned int order)
+static void ___free_pages(struct page *page, unsigned int order,
+			  fpi_t fpi_flags)
 {
 	/* get PageHead before we drop reference */
 	int head = PageHead(page);
 	struct alloc_tag *tag = pgalloc_tag_get(page);
 
 	if (put_page_testzero(page))
-		free_frozen_pages(page, order);
+		__free_frozen_pages(page, order, fpi_flags);
 	else if (!head) {
 		pgalloc_tag_sub_pages(tag, (1 << order) - 1);
 		while (order-- > 0)
-			free_frozen_pages(page + (1 << order), order);
+			__free_frozen_pages(page + (1 << order), order,
+					    fpi_flags);
 	}
 }
+void __free_pages(struct page *page, unsigned int order)
+{
+	___free_pages(page, order, FPI_NONE);
+}
 EXPORT_SYMBOL(__free_pages);
 
+/*
+ * Can be called while holding raw_spin_lock or from IRQ and NMI for any
+ * page type (not only those that came from try_alloc_pages)
+ */
+void free_pages_nolock(struct page *page, unsigned int order)
+{
+	___free_pages(page, order, FPI_TRYLOCK);
+}
+
 void free_pages(unsigned long addr, unsigned int order)
 {
 	if (addr != 0) {
diff --git a/mm/page_owner.c b/mm/page_owner.c
index 2d6360eaccbb..90e31d0e3ed7 100644
--- a/mm/page_owner.c
+++ b/mm/page_owner.c
@@ -294,7 +294,13 @@ void __reset_page_owner(struct page *page, unsigned short order)
 	page_owner = get_page_owner(page_ext);
 	alloc_handle = page_owner->handle;
 
-	handle = save_stack(GFP_NOWAIT | __GFP_NOWARN);
+	/*
+	 * Do not specify GFP_NOWAIT to make gfpflags_allow_spinning() == false
+	 * to prevent issues in stack_depot_save().
+	 * This is similar to try_alloc_pages() gfp flags, but only used
+	 * to signal stack_depot to avoid spin_locks.
+	 */
+	handle = save_stack(__GFP_NOWARN);
 	__update_page_owner_free_handle(page_ext, handle, order, current->pid,
 					current->tgid, free_ts_nsec);
 	page_ext_put(page_ext);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next v9 4/6] memcg: Use trylock to access memcg stock_lock.
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (2 preceding siblings ...)
  2025-02-22  2:44 ` [PATCH bpf-next v9 3/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
@ 2025-02-22  2:44 ` Alexei Starovoitov
  2025-02-22  2:44 ` [PATCH bpf-next v9 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-22  2:44 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Teach memcg to operate under trylock conditions when spinning locks
cannot be used.

localtry_trylock might fail and this would lead to charge cache bypass
if the calling context doesn't allow spinning (gfpflags_allow_spinning).
In those cases charge the memcg counter directly and fail early if
that is not possible. This might cause a pre-mature charge failing
but it will allow an opportunistic charging that is safe from
try_alloc_pages path.

Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/memcontrol.c | 52 ++++++++++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4de6acb9b8ec..97a7307d4932 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1739,7 +1739,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 }
 
 struct memcg_stock_pcp {
-	local_lock_t stock_lock;
+	localtry_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 
@@ -1754,7 +1754,7 @@ struct memcg_stock_pcp {
 #define FLUSHING_CACHED_CHARGE	0
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
-	.stock_lock = INIT_LOCAL_LOCK(stock_lock),
+	.stock_lock = INIT_LOCALTRY_LOCK(stock_lock),
 };
 static DEFINE_MUTEX(percpu_charge_mutex);
 
@@ -1773,7 +1773,8 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock,
  *
  * returns true if successful, false otherwise.
  */
-static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
+static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
+			  gfp_t gfp_mask)
 {
 	struct memcg_stock_pcp *stock;
 	unsigned int stock_pages;
@@ -1783,7 +1784,11 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 	if (nr_pages > MEMCG_CHARGE_BATCH)
 		return ret;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		if (!gfpflags_allow_spinning(gfp_mask))
+			return ret;
+		localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
+	}
 
 	stock = this_cpu_ptr(&memcg_stock);
 	stock_pages = READ_ONCE(stock->nr_pages);
@@ -1792,7 +1797,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 		ret = true;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -1831,14 +1836,14 @@ static void drain_local_stock(struct work_struct *dummy)
 	 * drain_stock races is that we always operate on local CPU stock
 	 * here with IRQ disabled
 	 */
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	old = drain_obj_stock(stock);
 	drain_stock(stock);
 	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 	obj_cgroup_put(old);
 }
 
@@ -1868,9 +1873,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	unsigned long flags;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+		/*
+		 * In case of unlikely failure to lock percpu stock_lock
+		 * uncharge memcg directly.
+		 */
+		if (mem_cgroup_is_root(memcg))
+			return;
+		page_counter_uncharge(&memcg->memory, nr_pages);
+		if (do_memsw_account())
+			page_counter_uncharge(&memcg->memsw, nr_pages);
+		return;
+	}
 	__refill_stock(memcg, nr_pages);
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 }
 
 /*
@@ -2213,9 +2229,13 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long pflags;
 
 retry:
-	if (consume_stock(memcg, nr_pages))
+	if (consume_stock(memcg, nr_pages, gfp_mask))
 		return 0;
 
+	if (!gfpflags_allow_spinning(gfp_mask))
+		/* Avoid the refill and flush of the older stock */
+		batch = nr_pages;
+
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 		if (page_counter_try_charge(&memcg->memory, batch, &counter))
@@ -2699,7 +2719,7 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	unsigned long flags;
 	int *bytes;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
 	stock = this_cpu_ptr(&memcg_stock);
 
 	/*
@@ -2752,7 +2772,7 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
 	if (nr)
 		__mod_objcg_mlstate(objcg, pgdat, idx, nr);
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 	obj_cgroup_put(old);
 }
 
@@ -2762,7 +2782,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 	unsigned long flags;
 	bool ret = false;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
@@ -2770,7 +2790,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
 		ret = true;
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 
 	return ret;
 }
@@ -2862,7 +2882,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 	unsigned long flags;
 	unsigned int nr_pages = 0;
 
-	local_lock_irqsave(&memcg_stock.stock_lock, flags);
+	localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
 
 	stock = this_cpu_ptr(&memcg_stock);
 	if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
@@ -2880,7 +2900,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
 		stock->nr_bytes &= (PAGE_SIZE - 1);
 	}
 
-	local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+	localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
 	obj_cgroup_put(old);
 
 	if (nr_pages)
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next v9 5/6] mm, bpf: Use memcg in try_alloc_pages().
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (3 preceding siblings ...)
  2025-02-22  2:44 ` [PATCH bpf-next v9 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
@ 2025-02-22  2:44 ` Alexei Starovoitov
  2025-02-22  2:44 ` [PATCH bpf-next v9 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-22  2:44 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Unconditionally use __GFP_ACCOUNT in try_alloc_pages().
The caller is responsible to setup memcg correctly.
All BPF memory accounting is memcg based.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 mm/page_alloc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 79b39ad4bb1b..4ad0ba7801a8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7189,7 +7189,8 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 	 * specify it here to highlight that try_alloc_pages()
 	 * doesn't want to deplete reserves.
 	 */
-	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC;
+	gfp_t alloc_gfp = __GFP_NOWARN | __GFP_ZERO | __GFP_NOMEMALLOC
+			| __GFP_ACCOUNT;
 	unsigned int alloc_flags = ALLOC_TRYLOCK;
 	struct alloc_context ac = { };
 	struct page *page;
@@ -7233,6 +7234,11 @@ struct page *try_alloc_pages_noprof(int nid, unsigned int order)
 
 	/* Unlike regular alloc_pages() there is no __alloc_pages_slowpath(). */
 
+	if (memcg_kmem_online() && page &&
+	    unlikely(__memcg_kmem_charge_page(page, alloc_gfp, order) != 0)) {
+		free_pages_nolock(page, order);
+		page = NULL;
+	}
 	trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 	kmsan_alloc_page(page, order, alloc_gfp);
 	return page;
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH bpf-next v9 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (4 preceding siblings ...)
  2025-02-22  2:44 ` [PATCH bpf-next v9 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
@ 2025-02-22  2:44 ` Alexei Starovoitov
  2025-02-26  3:19 ` [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
  2025-02-27 17:50 ` patchwork-bot+netdevbpf
  7 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-22  2:44 UTC (permalink / raw)
  To: bpf
  Cc: andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

From: Alexei Starovoitov <ast@kernel.org>

Use try_alloc_pages() and free_pages_nolock() for BPF needs
when context doesn't allow using normal alloc_pages.
This is a prerequisite for further work.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
 include/linux/bpf.h  |  2 +-
 kernel/bpf/arena.c   |  5 ++---
 kernel/bpf/syscall.c | 23 ++++++++++++++++++++---
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 15164787ce7f..aec102868b93 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2354,7 +2354,7 @@ int  generic_map_delete_batch(struct bpf_map *map,
 struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
 struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
 
-int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
+int bpf_map_alloc_pages(const struct bpf_map *map, int nid,
 			unsigned long nr_pages, struct page **page_array);
 #ifdef CONFIG_MEMCG
 void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 647b709d7d77..0d56cea71602 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -287,7 +287,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGSEGV;
 
 	/* Account into memcg of the process that created bpf_arena */
-	ret = bpf_map_alloc_pages(map, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE, 1, &page);
+	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
 		return VM_FAULT_SIGSEGV;
@@ -465,8 +465,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (ret)
 		goto out_free_pages;
 
-	ret = bpf_map_alloc_pages(&arena->map, GFP_KERNEL | __GFP_ZERO,
-				  node_id, page_cnt, pages);
+	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
 	if (ret)
 		goto out;
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dbd89c13dd32..28680896c6a4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -569,7 +569,24 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 #endif
 
-int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
+static bool can_alloc_pages(void)
+{
+	return preempt_count() == 0 && !irqs_disabled() &&
+		!IS_ENABLED(CONFIG_PREEMPT_RT);
+}
+
+static struct page *__bpf_alloc_page(int nid)
+{
+	if (!can_alloc_pages())
+		return try_alloc_pages(nid, 0);
+
+	return alloc_pages_node(nid,
+				GFP_KERNEL | __GFP_ZERO | __GFP_ACCOUNT
+				| __GFP_NOWARN,
+				0);
+}
+
+int bpf_map_alloc_pages(const struct bpf_map *map, int nid,
 			unsigned long nr_pages, struct page **pages)
 {
 	unsigned long i, j;
@@ -582,14 +599,14 @@ int bpf_map_alloc_pages(const struct bpf_map *map, gfp_t gfp, int nid,
 	old_memcg = set_active_memcg(memcg);
 #endif
 	for (i = 0; i < nr_pages; i++) {
-		pg = alloc_pages_node(nid, gfp | __GFP_ACCOUNT, 0);
+		pg = __bpf_alloc_page(nid);
 
 		if (pg) {
 			pages[i] = pg;
 			continue;
 		}
 		for (j = 0; j < i; j++)
-			__free_page(pages[j]);
+			free_pages_nolock(pages[j], 0);
 		ret = -ENOMEM;
 		break;
 	}
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (5 preceding siblings ...)
  2025-02-22  2:44 ` [PATCH bpf-next v9 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
@ 2025-02-26  3:19 ` Alexei Starovoitov
  2025-02-27 17:50 ` patchwork-bot+netdevbpf
  7 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-02-26  3:19 UTC (permalink / raw)
  To: bpf
  Cc: Andrii Nakryiko, Kumar Kartikeya Dwivedi, Andrew Morton,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Fri, Feb 21, 2025 at 6:44 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> From: Alexei Starovoitov <ast@kernel.org>
>
> Hi All,
>
> The main motivation is to make alloc page and slab reentrant and
> remove bpf_mem_alloc.
>
> v8->v9:
> - Squash Vlastimil's fix/feature for localtry_trylock, and
>   udpate commit log as suggested by Sebastian.
> - Drop _noprof suffix in try_alloc_pages kdoc
> - rebase

Looks like there are no more comments.
I'm going to apply to bpf-next soon.
There is always room for followups if necessary.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages()
  2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
                   ` (6 preceding siblings ...)
  2025-02-26  3:19 ` [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
@ 2025-02-27 17:50 ` patchwork-bot+netdevbpf
  7 siblings, 0 replies; 30+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-02-27 17:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj,
	linux-mm, kernel-team

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Fri, 21 Feb 2025 18:44:21 -0800 you wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> Hi All,
> 
> The main motivation is to make alloc page and slab reentrant and
> remove bpf_mem_alloc.
> 
> [...]

Here is the summary with links:
  - [bpf-next,v9,1/6] locking/local_lock: Introduce localtry_lock_t
    https://git.kernel.org/bpf/bpf-next/c/0aaddfb06882
  - [bpf-next,v9,2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
    https://git.kernel.org/bpf/bpf-next/c/97769a53f117
  - [bpf-next,v9,3/6] mm, bpf: Introduce free_pages_nolock()
    https://git.kernel.org/bpf/bpf-next/c/8c57b687e833
  - [bpf-next,v9,4/6] memcg: Use trylock to access memcg stock_lock.
    https://git.kernel.org/bpf/bpf-next/c/01d37228d331
  - [bpf-next,v9,5/6] mm, bpf: Use memcg in try_alloc_pages().
    https://git.kernel.org/bpf/bpf-next/c/e8d78dbd0199
  - [bpf-next,v9,6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs.
    https://git.kernel.org/bpf/bpf-next/c/c9eb8102e21e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-02-22  2:44 ` [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
@ 2025-03-11  2:04   ` Andrew Morton
  2025-03-11 13:32     ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Andrew Morton @ 2025-03-11  2:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, peterz, vbabka, bigeasy, rostedt, houtao1,
	hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj, linux-mm,
	kernel-team

On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> Tracing BPF programs execute from tracepoints and kprobes where
> running context is unknown, but they need to request additional
> memory. The prior workarounds were using pre-allocated memory and
> BPF specific freelists to satisfy such allocation requests.

The "prior workarounds" sound entirely appropriate.  Because the
performance and maintainability of Linux's page allocator is about
1,000,040 times more important than relieving BPF of having to carry a
"workaround".

IOW, I don't see any way in which this patchset even remotely meets any
sane cost/benefit comparison.

Feel free to explain why I am wrong, in great detail?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-11  2:04   ` Andrew Morton
@ 2025-03-11 13:32     ` Alexei Starovoitov
  2025-03-11 18:04       ` Mateusz Guzik
  2025-03-12 10:00       ` Vlastimil Babka
  0 siblings, 2 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-03-11 13:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Peter Zijlstra,
	Vlastimil Babka, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team

On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> > Tracing BPF programs execute from tracepoints and kprobes where
> > running context is unknown, but they need to request additional
> > memory. The prior workarounds were using pre-allocated memory and
> > BPF specific freelists to satisfy such allocation requests.
>
> The "prior workarounds" sound entirely appropriate.  Because the
> performance and maintainability of Linux's page allocator is about
> 1,000,040 times more important than relieving BPF of having to carry a
> "workaround".

Please explain where performance and maintainability is affected?

As far as motivation, if I recall correctly, you were present in
the room when Vlastimil presented the next steps for SLUB at
LSFMM back in May of last year.
A link to memory refresher is in the commit log:
https://lwn.net/Articles/974138/

Back then he talked about a bunch of reasons including better
maintainability of the kernel overall, but what stood out to me
as the main reason to use SLUB for bpf, objpool, mempool,
and networking needs is prevention of memory waste.
All these wrappers of slub pin memory that should be shared.
bpf, objpool, mempools should be good citizens of the kernel
instead of stealing the memory. That's the core job of the
kernel. To share resources. Memory is one such resource.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-02-22  2:44 ` [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t Alexei Starovoitov
@ 2025-03-11 15:44   ` Mateusz Guzik
  2025-03-11 16:20     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-11 15:44 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, andrii, memxor, akpm, peterz, vbabka, bigeasy, rostedt,
	houtao1, hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On Fri, Feb 21, 2025 at 06:44:22PM -0800, Alexei Starovoitov wrote:
> +#define __localtry_lock(lock)					\
> +	do {							\
> +		localtry_lock_t *lt;				\
> +		preempt_disable();				\
> +		lt = this_cpu_ptr(lock);			\
> +		local_lock_acquire(&lt->llock);			\
> +		WRITE_ONCE(lt->acquired, 1);			\
> +	} while (0)

I think these need compiler barriers.

I checked with gcc docs (https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)
and found this as confirmation:
> Accesses to non-volatile objects are not ordered with respect to volatile accesses.

Unless the Linux kernel is built with some magic to render this moot(?).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-11 15:44   ` Mateusz Guzik
@ 2025-03-11 16:20     ` Sebastian Andrzej Siewior
  2025-03-11 16:31       ` Mateusz Guzik
  0 siblings, 1 reply; 30+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-11 16:20 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, peterz, vbabka,
	rostedt, houtao1, hannes, shakeel.butt, mhocko, willy, tglx,
	jannh, tj, linux-mm, kernel-team

On 2025-03-11 16:44:30 [+0100], Mateusz Guzik wrote:
> On Fri, Feb 21, 2025 at 06:44:22PM -0800, Alexei Starovoitov wrote:
> > +#define __localtry_lock(lock)					\
> > +	do {							\
> > +		localtry_lock_t *lt;				\
> > +		preempt_disable();				\
> > +		lt = this_cpu_ptr(lock);			\
> > +		local_lock_acquire(&lt->llock);			\
> > +		WRITE_ONCE(lt->acquired, 1);			\
> > +	} while (0)
> 
> I think these need compiler barriers.
> 
> I checked with gcc docs (https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)
> and found this as confirmation:
> > Accesses to non-volatile objects are not ordered with respect to volatile accesses.
> 
> Unless the Linux kernel is built with some magic to render this moot(?).

You say we need a barrier() after the WRITE_ONCE()? If so, we need it in
the whole file…

Sebastian

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-11 16:20     ` Sebastian Andrzej Siewior
@ 2025-03-11 16:31       ` Mateusz Guzik
  2025-03-11 20:21         ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-11 16:31 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, peterz, vbabka,
	rostedt, houtao1, hannes, shakeel.butt, mhocko, willy, tglx,
	jannh, tj, linux-mm, kernel-team

On Tue, Mar 11, 2025 at 5:21 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> On 2025-03-11 16:44:30 [+0100], Mateusz Guzik wrote:
> > On Fri, Feb 21, 2025 at 06:44:22PM -0800, Alexei Starovoitov wrote:
> > > +#define __localtry_lock(lock)                                      \
> > > +   do {                                                    \
> > > +           localtry_lock_t *lt;                            \
> > > +           preempt_disable();                              \
> > > +           lt = this_cpu_ptr(lock);                        \
> > > +           local_lock_acquire(&lt->llock);                 \
> > > +           WRITE_ONCE(lt->acquired, 1);                    \
> > > +   } while (0)
> >
> > I think these need compiler barriers.
> >
> > I checked with gcc docs (https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)
> > and found this as confirmation:
> > > Accesses to non-volatile objects are not ordered with respect to volatile accesses.
> >
> > Unless the Linux kernel is built with some magic to render this moot(?).
>
> You say we need a barrier() after the WRITE_ONCE()? If so, we need it in
> the whole file…
>

I see the original local_lock machinery on the stock kernel works fine
as it expands to the preempt pair which has the appropriate fences. If
debug is added, the "locking" remains unaffected, but the debug state
might be bogus when looked at from the "wrong" context and adding the
compiler fences would trivially sort it out. I don't think it's a big
deal for *their* case, but patching that up should not raise any
eyebrows and may prevent eyebrows from going up later.

The machinery added in this patch does need the addition for
correctness in the base operation though.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-11 13:32     ` Alexei Starovoitov
@ 2025-03-11 18:04       ` Mateusz Guzik
  2025-03-12  9:45         ` Steven Rostedt
  2025-03-15  0:34         ` Alexei Starovoitov
  2025-03-12 10:00       ` Vlastimil Babka
  1 sibling, 2 replies; 30+ messages in thread
From: Mateusz Guzik @ 2025-03-11 18:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Tue, Mar 11, 2025 at 02:32:24PM +0100, Alexei Starovoitov wrote:
> On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > > Tracing BPF programs execute from tracepoints and kprobes where
> > > running context is unknown, but they need to request additional
> > > memory. The prior workarounds were using pre-allocated memory and
> > > BPF specific freelists to satisfy such allocation requests.
> >
> > The "prior workarounds" sound entirely appropriate.  Because the
> > performance and maintainability of Linux's page allocator is about
> > 1,000,040 times more important than relieving BPF of having to carry a
> > "workaround".
> 
> Please explain where performance and maintainability is affected?
> 

I have some related questions below. Note I'm a bystander, not claiming
to have any (N)ACK power.

A small bit before that:
       if (!spin_trylock_irqsave(&zone->lock, flags)) {
	       if (unlikely(alloc_flags & ALLOC_TRYLOCK))
		       return NULL;
	       spin_lock_irqsave(&zone->lock, flags);
       }

This is going to perform worse when contested due to an extra access to
the lock. I presume it was done this way to avoid suffering another
branch, with the assumption the trylock is normally going to succeed.

So happens I'm looking at parallel exec on the side and while currently
there is bigger fish to fry, contention on zone->lock is very much a
factor. Majority of it comes from RCU freeing (in free_pcppages_bulk()),
but I also see several rmqueue calls below. As they trylock, they are
going to make it more expensive for free_pcppages_bulk() to even get the
lock.

So this *is* contested, but at the moment is largely overshadowed by
bigger problems (which someone(tm) hopefully will sort out sooner than
later).

So should this land, I expect someone is going to hoist the trylock at
some point in the future. If it was my patch I would just do it now, but
I understand this may result in new people showing up and complaining.

> As far as motivation, if I recall correctly, you were present in
> the room when Vlastimil presented the next steps for SLUB at
> LSFMM back in May of last year.
> A link to memory refresher is in the commit log:
> https://lwn.net/Articles/974138/
> 
> Back then he talked about a bunch of reasons including better
> maintainability of the kernel overall, but what stood out to me
> as the main reason to use SLUB for bpf, objpool, mempool,
> and networking needs is prevention of memory waste.
> All these wrappers of slub pin memory that should be shared.
> bpf, objpool, mempools should be good citizens of the kernel
> instead of stealing the memory. That's the core job of the
> kernel. To share resources. Memory is one such resource.
> 

I suspect the worry is that the added behavior may complicate things
down the road (or even prevent some optimizations) -- there is another
context to worry about.

I think it would help to outline why these are doing any memory
allocation from something like NMI to begin with. Perhaps you could have
carved out a very small piece of memory as a reserve just for that? It
would be refilled as needed from process context.

A general remark is that support for an arbitrary running context in
core primitives artificially limits what can be done to optimize them
for their most common users. imo the sheaves patchset is a little bit of
an admission (also see below). It may be the get pages routine will get
there.

If non-task memory allocs got beaten to the curb, or at least got heavily
limited, then a small allocator just for that purpose would do the
trick and the two variants would likely be simpler than one thing which
supports everyone. This patchset is a step in the opposite direction,
but it may be there is a good reason.

To my understanding ebpf can be used to run "real" code to do something
or "merely" collect data. I presume the former case is already running
from a context where it can allocate memory no problem.

For the latter I presume ebpf has conflicting goals: 1. not disrupt the
workload under observation (cpu and ram) -- to that end small memory
usage limits are in place. otherwise a carelessly written aggregation
can OOM the box (e.g., say someone wants to know which files get opened
the most and aggregates on names, while a malicious user opens and
unlinks autogenerated names, endlessly growing the list if you let it)
2. actually log stuff even if resources are scarce. to that end I
would expect that a small area is pre-allocated and periodically drained

Which for me further puts a question mark on general alloc from the NMI
context.

All that said, the cover letter:
> The main motivation is to make alloc page and slab reentrant and
> remove bpf_mem_alloc.

does not justify why ebpf performs allocations in a manner which warrant
any of this, which I suspect is what Andrew asked about.

I never put any effort into ebpf -- it may be all the above makes
excellent sense. But then you need to make a case to the people
maintaining the code.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-11 16:31       ` Mateusz Guzik
@ 2025-03-11 20:21         ` Vlastimil Babka
  2025-03-11 22:24           ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2025-03-11 20:21 UTC (permalink / raw)
  To: Mateusz Guzik, Sebastian Andrzej Siewior
  Cc: Alexei Starovoitov, bpf, andrii, memxor, akpm, peterz, rostedt,
	houtao1, hannes, shakeel.butt, mhocko, willy, tglx, jannh, tj,
	linux-mm, kernel-team

On 3/11/25 17:31, Mateusz Guzik wrote:
> On Tue, Mar 11, 2025 at 5:21 PM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
>>
>> On 2025-03-11 16:44:30 [+0100], Mateusz Guzik wrote:
>> > On Fri, Feb 21, 2025 at 06:44:22PM -0800, Alexei Starovoitov wrote:
>> > > +#define __localtry_lock(lock)                                      \
>> > > +   do {                                                    \
>> > > +           localtry_lock_t *lt;                            \
>> > > +           preempt_disable();                              \
>> > > +           lt = this_cpu_ptr(lock);                        \
>> > > +           local_lock_acquire(&lt->llock);                 \
>> > > +           WRITE_ONCE(lt->acquired, 1);                    \
>> > > +   } while (0)
>> >
>> > I think these need compiler barriers.
>> >
>> > I checked with gcc docs (https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)
>> > and found this as confirmation:
>> > > Accesses to non-volatile objects are not ordered with respect to volatile accesses.
>> >
>> > Unless the Linux kernel is built with some magic to render this moot(?).
>>
>> You say we need a barrier() after the WRITE_ONCE()? If so, we need it in
>> the whole file…
>>
> 
> I see the original local_lock machinery on the stock kernel works fine
> as it expands to the preempt pair which has the appropriate fences. If
> debug is added, the "locking" remains unaffected, but the debug state
> might be bogus when looked at from the "wrong" context and adding the
> compiler fences would trivially sort it out. I don't think it's a big
> deal for *their* case, but patching that up should not raise any
> eyebrows and may prevent eyebrows from going up later.
> 
> The machinery added in this patch does need the addition for
> correctness in the base operation though.

Yeah my version of this kind of lock in sheaves code had those barrier()'s,
IIRC after you or Jann told me. It's needed so that the *compiler* does not
e.g. reorder a write to the protected data to happen before the
WRITE_ONCE(lt->acquired, 1) (or after the WRITE_ONCE(lt->acquired, 0) in
unlock).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-11 20:21         ` Vlastimil Babka
@ 2025-03-11 22:24           ` Alexei Starovoitov
  2025-03-12  8:29             ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-03-11 22:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mateusz Guzik, Sebastian Andrzej Siewior, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, Mar 11, 2025 at 9:21 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/11/25 17:31, Mateusz Guzik wrote:
> > On Tue, Mar 11, 2025 at 5:21 PM Sebastian Andrzej Siewior
> > <bigeasy@linutronix.de> wrote:
> >>
> >> On 2025-03-11 16:44:30 [+0100], Mateusz Guzik wrote:
> >> > On Fri, Feb 21, 2025 at 06:44:22PM -0800, Alexei Starovoitov wrote:
> >> > > +#define __localtry_lock(lock)                                      \
> >> > > +   do {                                                    \
> >> > > +           localtry_lock_t *lt;                            \
> >> > > +           preempt_disable();                              \
> >> > > +           lt = this_cpu_ptr(lock);                        \
> >> > > +           local_lock_acquire(&lt->llock);                 \
> >> > > +           WRITE_ONCE(lt->acquired, 1);                    \
> >> > > +   } while (0)
> >> >
> >> > I think these need compiler barriers.
> >> >
> >> > I checked with gcc docs (https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)
> >> > and found this as confirmation:
> >> > > Accesses to non-volatile objects are not ordered with respect to volatile accesses.
> >> >
> >> > Unless the Linux kernel is built with some magic to render this moot(?).
> >>
> >> You say we need a barrier() after the WRITE_ONCE()? If so, we need it in
> >> the whole file…
> >>
> >
> > I see the original local_lock machinery on the stock kernel works fine
> > as it expands to the preempt pair which has the appropriate fences. If
> > debug is added, the "locking" remains unaffected, but the debug state
> > might be bogus when looked at from the "wrong" context and adding the
> > compiler fences would trivially sort it out. I don't think it's a big
> > deal for *their* case, but patching that up should not raise any
> > eyebrows and may prevent eyebrows from going up later.
> >
> > The machinery added in this patch does need the addition for
> > correctness in the base operation though.
>
> Yeah my version of this kind of lock in sheaves code had those barrier()'s,
> IIRC after you or Jann told me. It's needed so that the *compiler* does not
> e.g. reorder a write to the protected data to happen before the
> WRITE_ONCE(lt->acquired, 1) (or after the WRITE_ONCE(lt->acquired, 0) in
> unlock).

I think you all are missing a fine print in gcc doc:
"Unless...can be aliased".
The kernel is compiled with -fno-strict-aliasing.
No need for barrier()s here.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-11 22:24           ` Alexei Starovoitov
@ 2025-03-12  8:29             ` Vlastimil Babka
  2025-03-14 21:05               ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2025-03-12  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mateusz Guzik, Sebastian Andrzej Siewior, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On 3/11/25 23:24, Alexei Starovoitov wrote:
> On Tue, Mar 11, 2025 at 9:21 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> On 3/11/25 17:31, Mateusz Guzik wrote:
>> > On Tue, Mar 11, 2025 at 5:21 PM Sebastian Andrzej Siewior
>> > <bigeasy@linutronix.de> wrote:
>> >>
>> >> On 2025-03-11 16:44:30 [+0100], Mateusz Guzik wrote:
>> >> > On Fri, Feb 21, 2025 at 06:44:22PM -0800, Alexei Starovoitov wrote:
>> >> > > +#define __localtry_lock(lock)                                      \
>> >> > > +   do {                                                    \
>> >> > > +           localtry_lock_t *lt;                            \
>> >> > > +           preempt_disable();                              \
>> >> > > +           lt = this_cpu_ptr(lock);                        \
>> >> > > +           local_lock_acquire(&lt->llock);                 \
>> >> > > +           WRITE_ONCE(lt->acquired, 1);                    \
>> >> > > +   } while (0)
>> >> >
>> >> > I think these need compiler barriers.
>> >> >
>> >> > I checked with gcc docs (https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)
>> >> > and found this as confirmation:
>> >> > > Accesses to non-volatile objects are not ordered with respect to volatile accesses.
>> >> >
>> >> > Unless the Linux kernel is built with some magic to render this moot(?).
>> >>
>> >> You say we need a barrier() after the WRITE_ONCE()? If so, we need it in
>> >> the whole file…
>> >>
>> >
>> > I see the original local_lock machinery on the stock kernel works fine
>> > as it expands to the preempt pair which has the appropriate fences. If
>> > debug is added, the "locking" remains unaffected, but the debug state
>> > might be bogus when looked at from the "wrong" context and adding the
>> > compiler fences would trivially sort it out. I don't think it's a big
>> > deal for *their* case, but patching that up should not raise any
>> > eyebrows and may prevent eyebrows from going up later.
>> >
>> > The machinery added in this patch does need the addition for
>> > correctness in the base operation though.
>>
>> Yeah my version of this kind of lock in sheaves code had those barrier()'s,
>> IIRC after you or Jann told me. It's needed so that the *compiler* does not
>> e.g. reorder a write to the protected data to happen before the
>> WRITE_ONCE(lt->acquired, 1) (or after the WRITE_ONCE(lt->acquired, 0) in
>> unlock).
> 
> I think you all are missing a fine print in gcc doc:
> "Unless...can be aliased".
> The kernel is compiled with -fno-strict-aliasing.
> No need for barrier()s here.

Note I know next to nothing about these things, but I see here [1]:

"Whether GCC actually performs type-based aliasing analysis depends on the
details of the code. GCC has other ways to determine (in some cases) whether
objects alias, and if it gets a reliable answer that way, it won’t fall back
on type-based heuristics. [...] You can turn off type-based aliasing
analysis by giving GCC the option -fno-strict-aliasing."

I'd read that as -fno-strict-aliasing only disables TBAA, but that does not
necessary mean anything can be assumed to be aliased with anything?
An if we e.g. have a pointer to memcg_stock_pcp through which we access the
stock_lock an the other (protected) fields and that pointer doesn't change
between that, I imagine gcc can reliably determine these can't alias?

[1]
https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Aliasing-Type-Rules.html

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-11 18:04       ` Mateusz Guzik
@ 2025-03-12  9:45         ` Steven Rostedt
  2025-03-15  0:34         ` Alexei Starovoitov
  1 sibling, 0 replies; 30+ messages in thread
From: Steven Rostedt @ 2025-03-12  9:45 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexei Starovoitov, Andrew Morton, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Peter Zijlstra, Vlastimil Babka,
	Sebastian Sewior, Hou Tao, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Tue, 11 Mar 2025 19:04:47 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> A small bit before that:
>        if (!spin_trylock_irqsave(&zone->lock, flags)) {
> 	       if (unlikely(alloc_flags & ALLOC_TRYLOCK))
> 		       return NULL;
> 	       spin_lock_irqsave(&zone->lock, flags);
>        }
> 
> This is going to perform worse when contested due to an extra access to
> the lock. I presume it was done this way to avoid suffering another
> branch, with the assumption the trylock is normally going to succeed.

What extra access? If a spinlock were to fail, it keeps checking the
lock until it's released. If anything, this may actually help with
performance when contended.

Now, there's some implementations of spinlocks where on failure to
secure the lock, some magic is done to spin on another bit instead of
the lock to prevent cache bouncing (as locks usually live on the same
cache line as the data they protect). When the owner releases the lock,
it will also have to tell the spinners that the lock is free again.

But this extra trylock is not going to show up outside the noise.

-- Steve

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-11 13:32     ` Alexei Starovoitov
  2025-03-11 18:04       ` Mateusz Guzik
@ 2025-03-12 10:00       ` Vlastimil Babka
  2025-03-12 19:06         ` Shakeel Butt
  2025-03-15  0:51         ` Alexei Starovoitov
  1 sibling, 2 replies; 30+ messages in thread
From: Vlastimil Babka @ 2025-03-12 10:00 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrew Morton
  Cc: bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi, Peter Zijlstra,
	Sebastian Sewior, Steven Rostedt, Hou Tao, Johannes Weiner,
	Shakeel Butt, Michal Hocko, Matthew Wilcox, Thomas Gleixner,
	Jann Horn, Tejun Heo, linux-mm, Kernel Team

On 3/11/25 14:32, Alexei Starovoitov wrote:
> On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>> > Tracing BPF programs execute from tracepoints and kprobes where
>> > running context is unknown, but they need to request additional
>> > memory. The prior workarounds were using pre-allocated memory and
>> > BPF specific freelists to satisfy such allocation requests.
>>
>> The "prior workarounds" sound entirely appropriate.  Because the
>> performance and maintainability of Linux's page allocator is about
>> 1,000,040 times more important than relieving BPF of having to carry a
>> "workaround".
> 
> Please explain where performance and maintainability is affected?
> 
> As far as motivation, if I recall correctly, you were present in
> the room when Vlastimil presented the next steps for SLUB at
> LSFMM back in May of last year.
> A link to memory refresher is in the commit log:
> https://lwn.net/Articles/974138/
> 
> Back then he talked about a bunch of reasons including better
> maintainability of the kernel overall, but what stood out to me
> as the main reason to use SLUB for bpf, objpool, mempool,
> and networking needs is prevention of memory waste.
> All these wrappers of slub pin memory that should be shared.
> bpf, objpool, mempools should be good citizens of the kernel
> instead of stealing the memory. That's the core job of the
> kernel. To share resources. Memory is one such resource.

Yes. Although at that time I've envisioned there would still be some
reserved objects set aside for these purposes. The difference would be they
would be under control of the allocator and not in multiple caches outside
of it.

But if we can achieve the same without such reserved objects, I think it's
even better. Performance and maintainability doesn't need to necessarily
suffer. Maybe it can even improve in the process. E.g. if we build upon
patches 1+4 and swith memcg stock locking to the non-irqsave variant, we
should avoid some overhead there (something similar was tried there in the
past but reverted when making it RT compatible).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-12 10:00       ` Vlastimil Babka
@ 2025-03-12 19:06         ` Shakeel Butt
  2025-03-13  8:44           ` Michal Hocko
  2025-03-15  0:51         ` Alexei Starovoitov
  1 sibling, 1 reply; 30+ messages in thread
From: Shakeel Butt @ 2025-03-12 19:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Alexei Starovoitov, Andrew Morton, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Peter Zijlstra, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Michal Hocko,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed, Mar 12, 2025 at 11:00:20AM +0100, Vlastimil Babka wrote:
[...]
> 
> But if we can achieve the same without such reserved objects, I think it's
> even better. Performance and maintainability doesn't need to necessarily
> suffer. Maybe it can even improve in the process. E.g. if we build upon
> patches 1+4 and swith memcg stock locking to the non-irqsave variant, we
> should avoid some overhead there (something similar was tried there in the
> past but reverted when making it RT compatible).

In hindsight that revert was the bad decision. We accepted so much
complexity in memcg code for RT without questioning about a real world
use-case. Are there really RT users who want memcg or are using memcg? I
can not think of some RT user fine with memcg limits enforcement
(reclaim and throttling).

I am on the path to bypass per-cpu memcg stocks for RT kernels. The
stats would still need to be careful but I don't see any reason to keep
the complexity in memcg stocks for RT. IMO RT should prefer
predictability over performance, so bypassing memcg stocks should be
fine.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-12 19:06         ` Shakeel Butt
@ 2025-03-13  8:44           ` Michal Hocko
  2025-03-13 14:21             ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Michal Hocko @ 2025-03-13  8:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Vlastimil Babka, Alexei Starovoitov, Andrew Morton, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Peter Zijlstra,
	Sebastian Sewior, Steven Rostedt, Hou Tao, Johannes Weiner,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Wed 12-03-25 12:06:10, Shakeel Butt wrote:
> On Wed, Mar 12, 2025 at 11:00:20AM +0100, Vlastimil Babka wrote:
> [...]
> > 
> > But if we can achieve the same without such reserved objects, I think it's
> > even better. Performance and maintainability doesn't need to necessarily
> > suffer. Maybe it can even improve in the process. E.g. if we build upon
> > patches 1+4 and swith memcg stock locking to the non-irqsave variant, we
> > should avoid some overhead there (something similar was tried there in the
> > past but reverted when making it RT compatible).
> 
> In hindsight that revert was the bad decision. We accepted so much
> complexity in memcg code for RT without questioning about a real world
> use-case. Are there really RT users who want memcg or are using memcg? I
> can not think of some RT user fine with memcg limits enforcement
> (reclaim and throttling).

I do not think that there is any reasonable RT workload that would use
memcg limits or other memcg features. On the other hand it is not
unusual to have RT and non-RT workloads mixed on the same machine. They
usually use some sort of CPU isolation to prevent from CPU contention
but that doesn't help much if there are other resources they need to
contend for (like shared locks). 

> I am on the path to bypass per-cpu memcg stocks for RT kernels.

That would cause regressions for non-RT tasks running on PREEMPT_RT
kernels, right?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-13  8:44           ` Michal Hocko
@ 2025-03-13 14:21             ` Vlastimil Babka
  2025-03-13 16:02               ` Shakeel Butt
  2025-03-14 10:16               ` Michal Hocko
  0 siblings, 2 replies; 30+ messages in thread
From: Vlastimil Babka @ 2025-03-13 14:21 UTC (permalink / raw)
  To: Michal Hocko, Shakeel Butt
  Cc: Alexei Starovoitov, Andrew Morton, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Peter Zijlstra, Sebastian Sewior,
	Steven Rostedt, Hou Tao, Johannes Weiner, Matthew Wilcox,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team

On 3/13/25 09:44, Michal Hocko wrote:
> On Wed 12-03-25 12:06:10, Shakeel Butt wrote:
>> On Wed, Mar 12, 2025 at 11:00:20AM +0100, Vlastimil Babka wrote:
>> [...]
>> > 
>> > But if we can achieve the same without such reserved objects, I think it's
>> > even better. Performance and maintainability doesn't need to necessarily
>> > suffer. Maybe it can even improve in the process. E.g. if we build upon
>> > patches 1+4 and swith memcg stock locking to the non-irqsave variant, we
>> > should avoid some overhead there (something similar was tried there in the
>> > past but reverted when making it RT compatible).
>> 
>> In hindsight that revert was the bad decision. We accepted so much
>> complexity in memcg code for RT without questioning about a real world
>> use-case. Are there really RT users who want memcg or are using memcg? I
>> can not think of some RT user fine with memcg limits enforcement
>> (reclaim and throttling).
> 
> I do not think that there is any reasonable RT workload that would use
> memcg limits or other memcg features. On the other hand it is not
> unusual to have RT and non-RT workloads mixed on the same machine. They
> usually use some sort of CPU isolation to prevent from CPU contention
> but that doesn't help much if there are other resources they need to
> contend for (like shared locks). 
> 
>> I am on the path to bypass per-cpu memcg stocks for RT kernels.
> 
> That would cause regressions for non-RT tasks running on PREEMPT_RT
> kernels, right?

For the context, this is about commit 559271146efc ("mm/memcg: optimize user
context object stock access")

reverted in fead2b869764 ("mm/memcg: revert ("mm/memcg: optimize user
context object stock access")")

I think at this point we don't have to recreate the full approach of the
first commit and introduce separate in_task() and in-interrupt stocks again.

The localtry_lock itself should make it possible to avoid the
irqsave/restore overhead (which was the main performance benefit of
559271146efc [1]) and only end up bypassing the stock when an allocation
from irq context actually interrupts an allocation from task context - which
would be very rare. And it should be already RT compatible. Let me see how
hard it would be on top of patch 4/6 "memcg: Use trylock to access memcg
stock_lock" to switch to the variant without _irqsave...

[1] the revert cites benchmarks that irqsave/restore can be actually cheaper
than preempt disable/enable, but I believe those were flawed

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-13 14:21             ` Vlastimil Babka
@ 2025-03-13 16:02               ` Shakeel Butt
  2025-03-14 10:16               ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Shakeel Butt @ 2025-03-13 16:02 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Alexei Starovoitov, Andrew Morton, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Peter Zijlstra,
	Sebastian Sewior, Steven Rostedt, Hou Tao, Johannes Weiner,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Thu, Mar 13, 2025 at 03:21:48PM +0100, Vlastimil Babka wrote:
> On 3/13/25 09:44, Michal Hocko wrote:
> > On Wed 12-03-25 12:06:10, Shakeel Butt wrote:
> >> On Wed, Mar 12, 2025 at 11:00:20AM +0100, Vlastimil Babka wrote:
> >> [...]
> >> > 
> >> > But if we can achieve the same without such reserved objects, I think it's
> >> > even better. Performance and maintainability doesn't need to necessarily
> >> > suffer. Maybe it can even improve in the process. E.g. if we build upon
> >> > patches 1+4 and swith memcg stock locking to the non-irqsave variant, we
> >> > should avoid some overhead there (something similar was tried there in the
> >> > past but reverted when making it RT compatible).
> >> 
> >> In hindsight that revert was the bad decision. We accepted so much
> >> complexity in memcg code for RT without questioning about a real world
> >> use-case. Are there really RT users who want memcg or are using memcg? I
> >> can not think of some RT user fine with memcg limits enforcement
> >> (reclaim and throttling).
> > 
> > I do not think that there is any reasonable RT workload that would use
> > memcg limits or other memcg features. On the other hand it is not
> > unusual to have RT and non-RT workloads mixed on the same machine. They
> > usually use some sort of CPU isolation to prevent from CPU contention
> > but that doesn't help much if there are other resources they need to
> > contend for (like shared locks). 
> > 
> >> I am on the path to bypass per-cpu memcg stocks for RT kernels.
> > 
> > That would cause regressions for non-RT tasks running on PREEMPT_RT
> > kernels, right?

I would say more predictable for RT-kernel users but anyways I am in the
process in the prototying and will share how it looks like.

> 
> For the context, this is about commit 559271146efc ("mm/memcg: optimize user
> context object stock access")
> 
> reverted in fead2b869764 ("mm/memcg: revert ("mm/memcg: optimize user
> context object stock access")")
> 
> I think at this point we don't have to recreate the full approach of the
> first commit and introduce separate in_task() and in-interrupt stocks again.
> 
> The localtry_lock itself should make it possible to avoid the
> irqsave/restore overhead (which was the main performance benefit of
> 559271146efc [1]) and only end up bypassing the stock when an allocation
> from irq context actually interrupts an allocation from task context - which
> would be very rare. And it should be already RT compatible. Let me see how
> hard it would be on top of patch 4/6 "memcg: Use trylock to access memcg
> stock_lock" to switch to the variant without _irqsave...

I am already changing stuff in this area, I will also give this idea a
try as well.

> 
> [1] the revert cites benchmarks that irqsave/restore can be actually cheaper
> than preempt disable/enable, but I believe those were flawed

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-13 14:21             ` Vlastimil Babka
  2025-03-13 16:02               ` Shakeel Butt
@ 2025-03-14 10:16               ` Michal Hocko
  1 sibling, 0 replies; 30+ messages in thread
From: Michal Hocko @ 2025-03-14 10:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Shakeel Butt, Alexei Starovoitov, Andrew Morton, bpf,
	Andrii Nakryiko, Kumar Kartikeya Dwivedi, Peter Zijlstra,
	Sebastian Sewior, Steven Rostedt, Hou Tao, Johannes Weiner,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Thu 13-03-25 15:21:48, Vlastimil Babka wrote:
> On 3/13/25 09:44, Michal Hocko wrote:
> > On Wed 12-03-25 12:06:10, Shakeel Butt wrote:
> >> On Wed, Mar 12, 2025 at 11:00:20AM +0100, Vlastimil Babka wrote:
> >> [...]
> >> > 
> >> > But if we can achieve the same without such reserved objects, I think it's
> >> > even better. Performance and maintainability doesn't need to necessarily
> >> > suffer. Maybe it can even improve in the process. E.g. if we build upon
> >> > patches 1+4 and swith memcg stock locking to the non-irqsave variant, we
> >> > should avoid some overhead there (something similar was tried there in the
> >> > past but reverted when making it RT compatible).
> >> 
> >> In hindsight that revert was the bad decision. We accepted so much
> >> complexity in memcg code for RT without questioning about a real world
> >> use-case. Are there really RT users who want memcg or are using memcg? I
> >> can not think of some RT user fine with memcg limits enforcement
> >> (reclaim and throttling).
> > 
> > I do not think that there is any reasonable RT workload that would use
> > memcg limits or other memcg features. On the other hand it is not
> > unusual to have RT and non-RT workloads mixed on the same machine. They
> > usually use some sort of CPU isolation to prevent from CPU contention
> > but that doesn't help much if there are other resources they need to
> > contend for (like shared locks). 
> > 
> >> I am on the path to bypass per-cpu memcg stocks for RT kernels.
> > 
> > That would cause regressions for non-RT tasks running on PREEMPT_RT
> > kernels, right?
> 
> For the context, this is about commit 559271146efc ("mm/memcg: optimize user
> context object stock access")
> 
> reverted in fead2b869764 ("mm/memcg: revert ("mm/memcg: optimize user
> context object stock access")")
> 
> I think at this point we don't have to recreate the full approach of the
> first commit and introduce separate in_task() and in-interrupt stocks again.
> 
> The localtry_lock itself should make it possible to avoid the
> irqsave/restore overhead (which was the main performance benefit of
> 559271146efc [1]) and only end up bypassing the stock when an allocation
> from irq context actually interrupts an allocation from task context - which
> would be very rare. And it should be already RT compatible. Let me see how
> hard it would be on top of patch 4/6 "memcg: Use trylock to access memcg
> stock_lock" to switch to the variant without _irqsave...

makes sense.

> [1] the revert cites benchmarks that irqsave/restore can be actually cheaper
> than preempt disable/enable, but I believe those were flawed

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-12  8:29             ` Vlastimil Babka
@ 2025-03-14 21:05               ` Alexei Starovoitov
  2025-03-14 21:08                 ` Vlastimil Babka
  0 siblings, 1 reply; 30+ messages in thread
From: Alexei Starovoitov @ 2025-03-14 21:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mateusz Guzik, Sebastian Andrzej Siewior, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Wed, Mar 12, 2025 at 1:29 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/11/25 23:24, Alexei Starovoitov wrote:
> > On Tue, Mar 11, 2025 at 9:21 PM Vlastimil Babka <vbabka@suse.cz> wrote:
> >>
> >> On 3/11/25 17:31, Mateusz Guzik wrote:
> >> > On Tue, Mar 11, 2025 at 5:21 PM Sebastian Andrzej Siewior
> >> > <bigeasy@linutronix.de> wrote:
> >> >>
> >> >> On 2025-03-11 16:44:30 [+0100], Mateusz Guzik wrote:
> >> >> > On Fri, Feb 21, 2025 at 06:44:22PM -0800, Alexei Starovoitov wrote:
> >> >> > > +#define __localtry_lock(lock)                                      \
> >> >> > > +   do {                                                    \
> >> >> > > +           localtry_lock_t *lt;                            \
> >> >> > > +           preempt_disable();                              \
> >> >> > > +           lt = this_cpu_ptr(lock);                        \
> >> >> > > +           local_lock_acquire(&lt->llock);                 \
> >> >> > > +           WRITE_ONCE(lt->acquired, 1);                    \
> >> >> > > +   } while (0)
> >> >> >
> >> >> > I think these need compiler barriers.
> >> >> >
> >> >> > I checked with gcc docs (https://gcc.gnu.org/onlinedocs/gcc/Volatiles.html)
> >> >> > and found this as confirmation:
> >> >> > > Accesses to non-volatile objects are not ordered with respect to volatile accesses.
> >> >> >
> >> >> > Unless the Linux kernel is built with some magic to render this moot(?).
> >> >>
> >> >> You say we need a barrier() after the WRITE_ONCE()? If so, we need it in
> >> >> the whole file…
> >> >>
> >> >
> >> > I see the original local_lock machinery on the stock kernel works fine
> >> > as it expands to the preempt pair which has the appropriate fences. If
> >> > debug is added, the "locking" remains unaffected, but the debug state
> >> > might be bogus when looked at from the "wrong" context and adding the
> >> > compiler fences would trivially sort it out. I don't think it's a big
> >> > deal for *their* case, but patching that up should not raise any
> >> > eyebrows and may prevent eyebrows from going up later.
> >> >
> >> > The machinery added in this patch does need the addition for
> >> > correctness in the base operation though.
> >>
> >> Yeah my version of this kind of lock in sheaves code had those barrier()'s,
> >> IIRC after you or Jann told me. It's needed so that the *compiler* does not
> >> e.g. reorder a write to the protected data to happen before the
> >> WRITE_ONCE(lt->acquired, 1) (or after the WRITE_ONCE(lt->acquired, 0) in
> >> unlock).
> >
> > I think you all are missing a fine print in gcc doc:
> > "Unless...can be aliased".
> > The kernel is compiled with -fno-strict-aliasing.
> > No need for barrier()s here.
>
> Note I know next to nothing about these things, but I see here [1]:
>
> "Whether GCC actually performs type-based aliasing analysis depends on the
> details of the code. GCC has other ways to determine (in some cases) whether
> objects alias, and if it gets a reliable answer that way, it won’t fall back
> on type-based heuristics. [...] You can turn off type-based aliasing
> analysis by giving GCC the option -fno-strict-aliasing."
>
> I'd read that as -fno-strict-aliasing only disables TBAA, but that does not
> necessary mean anything can be assumed to be aliased with anything?

That's correct.

> An if we e.g. have a pointer to memcg_stock_pcp through which we access the
> stock_lock an the other (protected) fields and that pointer doesn't change
> between that, I imagine gcc can reliably determine these can't alias?

Though my last gcc commit was very long ago here is a simple example
where compiler can reorder/combine stores:
struct s {
   short a, b;
} *p;
p->a = 1;
p->b = 2;
The compiler can keep them as-is, combine or reorder even with
-fno-strict-aliasing, because it can determine that a and b don't alias.

But after re-reading gcc doc on volatiles again it's clear that
extra barriers are not necessary.
The main part:
"The minimum requirement is that at a sequence point all previous
accesses to volatile objects have stabilized"

So anything after WRITE_ONCE(lt->acquired, 1); will not be hoisted up
and that's what we care about here.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-14 21:05               ` Alexei Starovoitov
@ 2025-03-14 21:08                 ` Vlastimil Babka
  2025-03-14 21:18                   ` Alexei Starovoitov
  0 siblings, 1 reply; 30+ messages in thread
From: Vlastimil Babka @ 2025-03-14 21:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Mateusz Guzik, Sebastian Andrzej Siewior, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On 3/14/25 22:05, Alexei Starovoitov wrote:
> On Wed, Mar 12, 2025 at 1:29 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> 
> That's correct.
> 
>> An if we e.g. have a pointer to memcg_stock_pcp through which we access the
>> stock_lock an the other (protected) fields and that pointer doesn't change
>> between that, I imagine gcc can reliably determine these can't alias?
> 
> Though my last gcc commit was very long ago here is a simple example
> where compiler can reorder/combine stores:
> struct s {
>    short a, b;
> } *p;
> p->a = 1;
> p->b = 2;
> The compiler can keep them as-is, combine or reorder even with
> -fno-strict-aliasing, because it can determine that a and b don't alias.
> 
> But after re-reading gcc doc on volatiles again it's clear that
> extra barriers are not necessary.
> The main part:
> "The minimum requirement is that at a sequence point all previous
> accesses to volatile objects have stabilized"
> 
> So anything after WRITE_ONCE(lt->acquired, 1); will not be hoisted up
> and that's what we care about here.

OK, is there similar guarantee for the unlock side? No write will be moved
after WRITE_ONCE(lt->acquired, 0); there?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t
  2025-03-14 21:08                 ` Vlastimil Babka
@ 2025-03-14 21:18                   ` Alexei Starovoitov
  0 siblings, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-03-14 21:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mateusz Guzik, Sebastian Andrzej Siewior, bpf, Andrii Nakryiko,
	Kumar Kartikeya Dwivedi, Andrew Morton, Peter Zijlstra,
	Steven Rostedt, Hou Tao, Johannes Weiner, Shakeel Butt,
	Michal Hocko, Matthew Wilcox, Thomas Gleixner, Jann Horn,
	Tejun Heo, linux-mm, Kernel Team

On Fri, Mar 14, 2025 at 2:08 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/14/25 22:05, Alexei Starovoitov wrote:
> > On Wed, Mar 12, 2025 at 1:29 AM Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > That's correct.
> >
> >> An if we e.g. have a pointer to memcg_stock_pcp through which we access the
> >> stock_lock an the other (protected) fields and that pointer doesn't change
> >> between that, I imagine gcc can reliably determine these can't alias?
> >
> > Though my last gcc commit was very long ago here is a simple example
> > where compiler can reorder/combine stores:
> > struct s {
> >    short a, b;
> > } *p;
> > p->a = 1;
> > p->b = 2;
> > The compiler can keep them as-is, combine or reorder even with
> > -fno-strict-aliasing, because it can determine that a and b don't alias.
> >
> > But after re-reading gcc doc on volatiles again it's clear that
> > extra barriers are not necessary.
> > The main part:
> > "The minimum requirement is that at a sequence point all previous
> > accesses to volatile objects have stabilized"
> >
> > So anything after WRITE_ONCE(lt->acquired, 1); will not be hoisted up
> > and that's what we care about here.
>
> OK, is there similar guarantee for the unlock side? No write will be moved
> after WRITE_ONCE(lt->acquired, 0); there?

Yes, because the first line:

lt = this_cpu_ptr(lock);
WRITE_ONCE(lt->acquired, 0);

this_cpu_ptr is pretty much a black box for the compiler.
'lt' can alias with anything before it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-11 18:04       ` Mateusz Guzik
  2025-03-12  9:45         ` Steven Rostedt
@ 2025-03-15  0:34         ` Alexei Starovoitov
  1 sibling, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-03-15  0:34 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Andrew Morton, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Peter Zijlstra, Vlastimil Babka, Sebastian Sewior, Steven Rostedt,
	Hou Tao, Johannes Weiner, Shakeel Butt, Michal Hocko,
	Matthew Wilcox, Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm,
	Kernel Team

On Tue, Mar 11, 2025 at 11:04 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Tue, Mar 11, 2025 at 02:32:24PM +0100, Alexei Starovoitov wrote:
> > On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> > >
> > > > Tracing BPF programs execute from tracepoints and kprobes where
> > > > running context is unknown, but they need to request additional
> > > > memory. The prior workarounds were using pre-allocated memory and
> > > > BPF specific freelists to satisfy such allocation requests.
> > >
> > > The "prior workarounds" sound entirely appropriate.  Because the
> > > performance and maintainability of Linux's page allocator is about
> > > 1,000,040 times more important than relieving BPF of having to carry a
> > > "workaround".
> >
> > Please explain where performance and maintainability is affected?
> >
>
> I have some related questions below. Note I'm a bystander, not claiming
> to have any (N)ACK power.
>
> A small bit before that:
>        if (!spin_trylock_irqsave(&zone->lock, flags)) {
>                if (unlikely(alloc_flags & ALLOC_TRYLOCK))
>                        return NULL;
>                spin_lock_irqsave(&zone->lock, flags);
>        }
>
> This is going to perform worse when contested due to an extra access to
> the lock. I presume it was done this way to avoid suffering another
> branch, with the assumption the trylock is normally going to succeed.

Steven already explained that extra trylock is a noise from performance pov.
Also notice that it's indeed not written as
if (unlikely(alloc_flags & ALLOC_TRYLOCK))
    spin_trylock_irqsave(...);
else
    spin_lock_irqsave(...);

because this way it will suffer an extra branch.
Even with this extra branch it would have been in the noise
considering all the prior branches rmqueue*() does.
With unconditional trylock first the performance noise is even less
noticeable.

> I think it would help to outline why these are doing any memory
> allocation from something like NMI to begin with. Perhaps you could have
> carved out a very small piece of memory as a reserve just for that? It
> would be refilled as needed from process context.

It's not about NMI. It's about an unknown context.
bpf and other subsystems will not be calling it from NMI on purpose.
It can happen by accident.
See netpoll_send_udp().
It's using alloc_skb(GFP_ATOMIC) which is the same as GFP_WISH_ME_LUCK.
netpoll is called from an arbitrary context where accessing slab is not ok.

> If non-task memory allocs got beaten to the curb, or at least got heavily
> limited, then a small allocator just for that purpose would do the
> trick and the two variants would likely be simpler than one thing which
> supports everyone.

Try diff stat of this patch vs diff of bpf_mem_alloc and other hacks
to see what it takes to build "small allocator".
Also notice how "small allocator" doesn't work for netpoll.
It needs an skb in an unknown context and currently pretends
that GFP_ATOMIC isn't going to crash.
Any context kmalloc will finally fix it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation
  2025-03-12 10:00       ` Vlastimil Babka
  2025-03-12 19:06         ` Shakeel Butt
@ 2025-03-15  0:51         ` Alexei Starovoitov
  1 sibling, 0 replies; 30+ messages in thread
From: Alexei Starovoitov @ 2025-03-15  0:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, bpf, Andrii Nakryiko, Kumar Kartikeya Dwivedi,
	Peter Zijlstra, Sebastian Sewior, Steven Rostedt, Hou Tao,
	Johannes Weiner, Shakeel Butt, Michal Hocko, Matthew Wilcox,
	Thomas Gleixner, Jann Horn, Tejun Heo, linux-mm, Kernel Team

On Wed, Mar 12, 2025 at 3:00 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 3/11/25 14:32, Alexei Starovoitov wrote:
> > On Tue, Mar 11, 2025 at 3:04 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Fri, 21 Feb 2025 18:44:23 -0800 Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >> > Tracing BPF programs execute from tracepoints and kprobes where
> >> > running context is unknown, but they need to request additional
> >> > memory. The prior workarounds were using pre-allocated memory and
> >> > BPF specific freelists to satisfy such allocation requests.
> >>
> >> The "prior workarounds" sound entirely appropriate.  Because the
> >> performance and maintainability of Linux's page allocator is about
> >> 1,000,040 times more important than relieving BPF of having to carry a
> >> "workaround".
> >
> > Please explain where performance and maintainability is affected?
> >
> > As far as motivation, if I recall correctly, you were present in
> > the room when Vlastimil presented the next steps for SLUB at
> > LSFMM back in May of last year.
> > A link to memory refresher is in the commit log:
> > https://lwn.net/Articles/974138/
> >
> > Back then he talked about a bunch of reasons including better
> > maintainability of the kernel overall, but what stood out to me
> > as the main reason to use SLUB for bpf, objpool, mempool,
> > and networking needs is prevention of memory waste.
> > All these wrappers of slub pin memory that should be shared.
> > bpf, objpool, mempools should be good citizens of the kernel
> > instead of stealing the memory. That's the core job of the
> > kernel. To share resources. Memory is one such resource.
>
> Yes. Although at that time I've envisioned there would still be some
> reserved objects set aside for these purposes. The difference would be they
> would be under control of the allocator and not in multiple caches outside
> of it.

Yes. Exactly. So far it looks like we don't have to add a pool of
reserved objects. percpu caches are such reserve pools already.
In the worst cast it will be one global reserve shared by everyone
under mm control. shrinker may be an option too.
All that complexity looks very unlikely at this point.
I'm certainly more optimistic now than when we started on this path
back in November :)

> But if we can achieve the same without such reserved objects, I think it's
> even better. Performance and maintainability doesn't need to necessarily
> suffer. Maybe it can even improve in the process. E.g. if we build upon
> patches 1+4 and swith memcg stock locking to the non-irqsave variant, we
> should avoid some overhead there (something similar was tried there in the
> past but reverted when making it RT compatible).

Sounds like Shakeel is starting to experiment in this area which is great.
Performance improvements in memcg are certainly very welcomed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-03-15  0:51 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-22  2:44 [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-02-22  2:44 ` [PATCH bpf-next v9 1/6] locking/local_lock: Introduce localtry_lock_t Alexei Starovoitov
2025-03-11 15:44   ` Mateusz Guzik
2025-03-11 16:20     ` Sebastian Andrzej Siewior
2025-03-11 16:31       ` Mateusz Guzik
2025-03-11 20:21         ` Vlastimil Babka
2025-03-11 22:24           ` Alexei Starovoitov
2025-03-12  8:29             ` Vlastimil Babka
2025-03-14 21:05               ` Alexei Starovoitov
2025-03-14 21:08                 ` Vlastimil Babka
2025-03-14 21:18                   ` Alexei Starovoitov
2025-02-22  2:44 ` [PATCH bpf-next v9 2/6] mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation Alexei Starovoitov
2025-03-11  2:04   ` Andrew Morton
2025-03-11 13:32     ` Alexei Starovoitov
2025-03-11 18:04       ` Mateusz Guzik
2025-03-12  9:45         ` Steven Rostedt
2025-03-15  0:34         ` Alexei Starovoitov
2025-03-12 10:00       ` Vlastimil Babka
2025-03-12 19:06         ` Shakeel Butt
2025-03-13  8:44           ` Michal Hocko
2025-03-13 14:21             ` Vlastimil Babka
2025-03-13 16:02               ` Shakeel Butt
2025-03-14 10:16               ` Michal Hocko
2025-03-15  0:51         ` Alexei Starovoitov
2025-02-22  2:44 ` [PATCH bpf-next v9 3/6] mm, bpf: Introduce free_pages_nolock() Alexei Starovoitov
2025-02-22  2:44 ` [PATCH bpf-next v9 4/6] memcg: Use trylock to access memcg stock_lock Alexei Starovoitov
2025-02-22  2:44 ` [PATCH bpf-next v9 5/6] mm, bpf: Use memcg in try_alloc_pages() Alexei Starovoitov
2025-02-22  2:44 ` [PATCH bpf-next v9 6/6] bpf: Use try_alloc_pages() to allocate pages for bpf needs Alexei Starovoitov
2025-02-26  3:19 ` [PATCH bpf-next v9 0/6] bpf, mm: Introduce try_alloc_pages() Alexei Starovoitov
2025-02-27 17:50 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).