[PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
@ 2025-09-05  6:19 Peilin Ye
  2025-09-05  6:28 ` Peilin Ye
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Peilin Ye @ 2025-09-05  6:19 UTC (permalink / raw)
  To: bpf
  Cc: Peilin Ye, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden

Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various
locking issues; see the following stack trace (edited for style) as one
example:

...
 [10.011566]  do_raw_spin_lock.cold
 [10.011570]  try_to_wake_up             (5) double-acquiring the same
 [10.011575]  kick_pool                      rq_lock, causing a hardlockup
 [10.011579]  __queue_work
 [10.011582]  queue_work_on
 [10.011585]  kernfs_notify
 [10.011589]  cgroup_file_notify
 [10.011593]  try_charge_memcg           (4) memcg accounting raises an
 [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
 [10.011599]  obj_cgroup_charge_account
 [10.011600]  __memcg_slab_post_alloc_hook
 [10.011603]  __kmalloc_node_noprof
...
 [10.011611]  bpf_map_kmalloc_node
 [10.011612]  __bpf_async_init
 [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
 [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
 [10.011619]  bpf__sched_ext_ops_runnable
 [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
 [10.011622]  enqueue_task
 [10.011626]  ttwu_do_activate
 [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
...

The above was reproduced on bpf-next (b338cf849ec8) by modifying
./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
ops.runnable(), and hacking [1] the memcg accounting code a bit to make
it (much more likely to) raise an MEMCG_MAX event from a
bpf_timer_init() call.

We have also run into other similar variants both internally (without
applying the [1] hack) and on bpf-next, including:

 * run_timer_softirq() -> cgroup_file_notify()
   (grabs cgroup_file_kn_lock) -> try_to_wake_up() ->
   BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
   try_charge_memcg() raises MEMCG_MAX ->
   cgroup_file_notify() (tries to grab cgroup_file_kn_lock again)

 * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() ->
   BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
   try_charge_memcg() raises MEMCG_MAX -> cgroup_file_notify() ->
   __queue_work() (tries to grab the same worker_pool::lock)
 ...

As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
bpf_hrtimer and bpf_work, to skip memcg accounting.

Tested with vmtest.sh (llvm-18, x86-64):

 $ ./test_progs -a '*timer*' -a '*wq*'
...
 Summary: 7/12 PASSED, 0 SKIPPED, 0 FAILED

[1] Making a bpf_timer_init() call (much more likely) to raise an
MEMCG_MAX event (gist-only, for brevity):

kernel/bpf/helpers.c:__bpf_async_init():
          /* allocate hrtimer via map_kmalloc to use memcg accounting */
 -        cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
 +        cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_HACK,
 +                                  map->numa_node);

mm/memcontrol.c:try_charge_memcg():
          if (!do_memsw_account() ||
 -            page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 -                if (page_counter_try_charge(&memcg->memory, batch, &counter))
 +            page_counter_try_charge_hack(&memcg->memsw, batch, &counter,
 +                                         gfp_mask & __GFP_HACK)) {
 +                if (page_counter_try_charge_hack(&memcg->memory, batch,
 +                                                 &counter,
 +                                                 gfp_mask & __GFP_HACK))
                          goto done_restock;

mm/page_counter.c:page_counter_try_charge():
 -bool page_counter_try_charge(struct page_counter *counter,
 -                             unsigned long nr_pages,
 -                             struct page_counter **fail)
 +bool page_counter_try_charge_hack(struct page_counter *counter,
 +                                  unsigned long nr_pages,
 +                                  struct page_counter **fail, bool hack)
 {
...
 -                if (new > c->max) {
 +                if (hack || new > c->max) {     // goto failed;
                          atomic_long_sub(nr_pages, &c->usage);
                          /*

Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.")
Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
---
 kernel/bpf/helpers.c | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index e3a2662f4e33..f7f3c6fb59ee 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1085,10 +1085,7 @@ struct bpf_async_cb {
 	struct bpf_prog *prog;
 	void __rcu *callback_fn;
 	void *value;
-	union {
-		struct rcu_head rcu;
-		struct work_struct delete_work;
-	};
+	struct work_struct delete_work;
 	u64 flags;
 };
 
@@ -1221,7 +1218,7 @@ static void bpf_wq_delete_work(struct work_struct *work)
 
 	cancel_work_sync(&w->work);
 
-	kfree_rcu(w, cb.rcu);
+	bpf_mem_free_rcu(&bpf_global_ma, w);
 }
 
 static void bpf_timer_delete_work(struct work_struct *work)
@@ -1230,13 +1227,13 @@ static void bpf_timer_delete_work(struct work_struct *work)
 
 	/* Cancel the timer and wait for callback to complete if it was running.
 	 * If hrtimer_cancel() can be safely called it's safe to call
-	 * kfree_rcu(t) right after for both preallocated and non-preallocated
+	 * bpf_mem_free_rcu(t) right after for both preallocated and non-preallocated
 	 * maps.  The async->cb = NULL was already done and no code path can see
 	 * address 't' anymore. Timer if armed for existing bpf_hrtimer before
 	 * bpf_timer_cancel_and_free will have been cancelled.
 	 */
 	hrtimer_cancel(&t->timer);
-	kfree_rcu(t, cb.rcu);
+	bpf_mem_free_rcu(&bpf_global_ma, t);
 }
 
 static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u64 flags,
@@ -1270,8 +1267,7 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u
 		goto out;
 	}
 
-	/* allocate hrtimer via map_kmalloc to use memcg accounting */
-	cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node);
+	cb = bpf_mem_alloc(&bpf_global_ma, size);
 	if (!cb) {
 		ret = -ENOMEM;
 		goto out;
@@ -1567,7 +1563,7 @@ void bpf_timer_cancel_and_free(void *val)
 	 * callback_fn. In such case we don't call hrtimer_cancel() (since it
 	 * will deadlock) and don't call hrtimer_try_to_cancel() (since it will
 	 * just return -1). Though callback_fn is still running on this cpu it's
-	 * safe to do kfree(t) because bpf_timer_cb() read everything it needed
+	 * safe to free 't' because bpf_timer_cb() read everything it needed
 	 * from 't'. The bpf subprog callback_fn won't be able to access 't',
 	 * since async->cb = NULL was already done. The timer will be
 	 * effectively cancelled because bpf_timer_cb() will return
@@ -1577,7 +1573,7 @@ void bpf_timer_cancel_and_free(void *val)
 	 * timer _before_ calling us, such that failing to cancel it here will
 	 * cause it to possibly use struct hrtimer after freeing bpf_hrtimer.
 	 * Therefore, we _need_ to cancel any outstanding timers before we do
-	 * kfree_rcu, even though no more timers can be armed.
+	 * bpf_mem_free_rcu(), even though no more timers can be armed.
 	 *
 	 * Moreover, we need to schedule work even if timer does not belong to
 	 * the calling callback_fn, as on two different CPUs, we can end up in a
@@ -1604,7 +1600,7 @@ void bpf_timer_cancel_and_free(void *val)
 		 * completion.
 		 */
 		if (hrtimer_try_to_cancel(&t->timer) >= 0)
-			kfree_rcu(t, cb.rcu);
+			bpf_mem_free_rcu(&bpf_global_ma, t);
 		else
 			queue_work(system_unbound_wq, &t->cb.delete_work);
 	} else {
-- 
2.51.0.355.g5224444f11-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05  6:19 [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init() Peilin Ye
@ 2025-09-05  6:28 ` Peilin Ye
  2025-09-05 15:18 ` Alexei Starovoitov
  2025-09-05 23:51 ` Peilin Ye
  2 siblings, 0 replies; 10+ messages in thread
From: Peilin Ye @ 2025-09-05  6:28 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden

Hi all,

On Fri, Sep 05, 2025 at 06:19:17AM +0000, Peilin Ye wrote:
> The above was reproduced on bpf-next (b338cf849ec8) by modifying
> ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> it (much more likely to) raise an MEMCG_MAX event from a
> bpf_timer_init() call.

FWIW, below are changes I made to scx_flatcg.bpf.c to reproduce the
hardlockup.  Please let me know if there's more info that I can provide.

Thanks,
Peilin Ye

--- a/tools/sched_ext/scx_flatcg.bpf.c
+++ b/tools/sched_ext/scx_flatcg.bpf.c
@@ -504,8 +504,31 @@ static void update_active_weight_sums(struct cgroup *cgrp, bool runnable)
 		cgrp_refresh_hweight(cgrp, cgc);
 }

+struct __bpf_timer {
+	struct bpf_timer timer;
+};
+#define NUM_BPF_TIMERS	10
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, NUM_BPF_TIMERS);
+	__type(key, u32);
+	__type(value, struct __bpf_timer);
+} timer_map SEC(".maps");
+int count = 0;
+
 void BPF_STRUCT_OPS(fcg_runnable, struct task_struct *p, u64 enq_flags)
 {
+	if (count < NUM_BPF_TIMERS) {
+		struct bpf_timer *timer;
+		u32 key = count;
+
+		timer = bpf_map_lookup_elem(&timer_map, &key);
+		if (!timer)
+			return;
+		bpf_timer_init(timer, &timer_map, CLOCK_MONOTONIC);
+		count++;
+	}
+
 	struct cgroup *cgrp;

 	cgrp = __COMPAT_scx_bpf_task_cgroup(p);


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05  6:19 [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init() Peilin Ye
  2025-09-05  6:28 ` Peilin Ye
@ 2025-09-05 15:18 ` Alexei Starovoitov
  2025-09-05 17:31   ` Shakeel Butt
  2025-09-05 23:51 ` Peilin Ye
  2 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2025-09-05 15:18 UTC (permalink / raw)
  To: Peilin Ye, Shakeel Butt, Johannes Weiner, Tejun Heo
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden

On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@google.com> wrote:
>
> Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various
> locking issues; see the following stack trace (edited for style) as one
> example:
>
> ...
>  [10.011566]  do_raw_spin_lock.cold
>  [10.011570]  try_to_wake_up             (5) double-acquiring the same
>  [10.011575]  kick_pool                      rq_lock, causing a hardlockup
>  [10.011579]  __queue_work
>  [10.011582]  queue_work_on
>  [10.011585]  kernfs_notify
>  [10.011589]  cgroup_file_notify
>  [10.011593]  try_charge_memcg           (4) memcg accounting raises an
>  [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
>  [10.011599]  obj_cgroup_charge_account
>  [10.011600]  __memcg_slab_post_alloc_hook
>  [10.011603]  __kmalloc_node_noprof
> ...
>  [10.011611]  bpf_map_kmalloc_node
>  [10.011612]  __bpf_async_init
>  [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
>  [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
>  [10.011619]  bpf__sched_ext_ops_runnable
>  [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
>  [10.011622]  enqueue_task
>  [10.011626]  ttwu_do_activate
>  [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
> ...
>
> The above was reproduced on bpf-next (b338cf849ec8) by modifying
> ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> it (much more likely to) raise an MEMCG_MAX event from a
> bpf_timer_init() call.
>
> We have also run into other similar variants both internally (without
> applying the [1] hack) and on bpf-next, including:
>
>  * run_timer_softirq() -> cgroup_file_notify()
>    (grabs cgroup_file_kn_lock) -> try_to_wake_up() ->
>    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
>    try_charge_memcg() raises MEMCG_MAX ->
>    cgroup_file_notify() (tries to grab cgroup_file_kn_lock again)
>
>  * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() ->
>    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
>    try_charge_memcg() raises MEMCG_MAX -> m() ->
>    __queue_work() (tries to grab the same worker_pool::lock)
>  ...
>
> As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> bpf_hrtimer and bpf_work, to skip memcg accounting.

This is a short term workaround that we shouldn't take.
Long term bpf_mem_alloc() will use kmalloc_nolock() and
memcg accounting that was already made to work from any context
except that the path of memcg_memory_event() wasn't converted.

Shakeel,

Any suggestions how memcg_memory_event()->cgroup_file_notify()
can be fixed?
Can we just trylock and skip the event?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05 15:18 ` Alexei Starovoitov
@ 2025-09-05 17:31   ` Shakeel Butt
  2025-09-05 18:23     ` Peilin Ye
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-09-05 17:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peilin Ye, Johannes Weiner, Tejun Heo, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@google.com> wrote:
> >
> > Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various
> > locking issues; see the following stack trace (edited for style) as one
> > example:
> >
> > ...
> >  [10.011566]  do_raw_spin_lock.cold
> >  [10.011570]  try_to_wake_up             (5) double-acquiring the same
> >  [10.011575]  kick_pool                      rq_lock, causing a hardlockup
> >  [10.011579]  __queue_work
> >  [10.011582]  queue_work_on
> >  [10.011585]  kernfs_notify
> >  [10.011589]  cgroup_file_notify
> >  [10.011593]  try_charge_memcg           (4) memcg accounting raises an
> >  [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
> >  [10.011599]  obj_cgroup_charge_account
> >  [10.011600]  __memcg_slab_post_alloc_hook
> >  [10.011603]  __kmalloc_node_noprof
> > ...
> >  [10.011611]  bpf_map_kmalloc_node
> >  [10.011612]  __bpf_async_init
> >  [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
> >  [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
> >  [10.011619]  bpf__sched_ext_ops_runnable
> >  [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
> >  [10.011622]  enqueue_task
> >  [10.011626]  ttwu_do_activate
> >  [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
> > ...
> >
> > The above was reproduced on bpf-next (b338cf849ec8) by modifying
> > ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> > ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> > it (much more likely to) raise an MEMCG_MAX event from a
> > bpf_timer_init() call.
> >
> > We have also run into other similar variants both internally (without
> > applying the [1] hack) and on bpf-next, including:
> >
> >  * run_timer_softirq() -> cgroup_file_notify()
> >    (grabs cgroup_file_kn_lock) -> try_to_wake_up() ->
> >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> >    try_charge_memcg() raises MEMCG_MAX ->
> >    cgroup_file_notify() (tries to grab cgroup_file_kn_lock again)
> >
> >  * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() ->
> >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> >    try_charge_memcg() raises MEMCG_MAX -> m() ->
> >    __queue_work() (tries to grab the same worker_pool::lock)
> >  ...
> >
> > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > bpf_hrtimer and bpf_work, to skip memcg accounting.
> 
> This is a short term workaround that we shouldn't take.
> Long term bpf_mem_alloc() will use kmalloc_nolock() and
> memcg accounting that was already made to work from any context
> except that the path of memcg_memory_event() wasn't converted.
> 
> Shakeel,
> 
> Any suggestions how memcg_memory_event()->cgroup_file_notify()
> can be fixed?
> Can we just trylock and skip the event?

Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
chains? If yes, then we can change memcg_memory_event() to skip calls to
cgroup_file_notify() if spinning is not allowed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05 17:31   ` Shakeel Butt
@ 2025-09-05 18:23     ` Peilin Ye
  2025-09-05 19:14     ` Peilin Ye
  2025-09-05 19:48     ` Shakeel Butt
  2 siblings, 0 replies; 10+ messages in thread
From: Peilin Ye @ 2025-09-05 18:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Johannes Weiner, Tejun Heo, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Fri, Sep 05, 2025 at 10:31:07AM -0700, Shakeel Butt wrote:
> On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> > On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@google.com> wrote:
> > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > > bpf_hrtimer and bpf_work, to skip memcg accounting.
> > 
> > This is a short term workaround that we shouldn't take.
> > Long term bpf_mem_alloc() will use kmalloc_nolock() and
> > memcg accounting that was already made to work from any context
> > except that the path of memcg_memory_event() wasn't converted.
> > 
> > Shakeel,
> > 
> > Any suggestions how memcg_memory_event()->cgroup_file_notify()
> > can be fixed?
> > Can we just trylock and skip the event?
> 
> Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
> chains? If yes, then we can change memcg_memory_event() to skip calls to
> cgroup_file_notify() if spinning is not allowed.

Thanks!  I'll try this.

Peilin Ye


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05 17:31   ` Shakeel Butt
  2025-09-05 18:23     ` Peilin Ye
@ 2025-09-05 19:14     ` Peilin Ye
  2025-09-05 20:32       ` Shakeel Butt
  2025-09-05 19:48     ` Shakeel Butt
  2 siblings, 1 reply; 10+ messages in thread
From: Peilin Ye @ 2025-09-05 19:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Johannes Weiner, Tejun Heo, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Fri, Sep 05, 2025 at 10:31:07AM -0700, Shakeel Butt wrote:
> On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> > On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@google.com> wrote:
> > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > > bpf_hrtimer and bpf_work, to skip memcg accounting.
> > 
> > This is a short term workaround that we shouldn't take.
> > Long term bpf_mem_alloc() will use kmalloc_nolock() and
> > memcg accounting that was already made to work from any context
> > except that the path of memcg_memory_event() wasn't converted.
> > 
> > Shakeel,
> > 
> > Any suggestions how memcg_memory_event()->cgroup_file_notify()
> > can be fixed?
> > Can we just trylock and skip the event?
> 
> Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
> chains? If yes, then we can change memcg_memory_event() to skip calls to
> cgroup_file_notify() if spinning is not allowed.

I tried the below diff, but unfortunately __bpf_async_init() calls
bpf_map_kmalloc_node() with GFP_ATOMIC, so gfpflags_allow_spinning()
would return true.  I'll try the trylock-and-skip approach.

Thanks,
Peilin Ye

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2350,11 +2350,12 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
        if (unlikely(task_in_memcg_oom(current)))
                goto nomem;
 
-       if (!gfpflags_allow_blocking(gfp_mask))
+       if (!gfpflags_allow_blocking(gfp_mask)) {
                goto nomem;
-
-       memcg_memory_event(mem_over_limit, MEMCG_MAX);
-       raised_max_event = true;
+       } else if (gfpflags_allow_spinning(gfp_mask)) {
+               memcg_memory_event(mem_over_limit, MEMCG_MAX);
+               raised_max_event = true;
+       }
 
        psi_memstall_enter(&pflags);
        nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
@@ -2419,7 +2420,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
         * If the allocation has to be enforced, don't forget to raise
         * a MEMCG_MAX event.
         */
-       if (!raised_max_event)
+       if (!raised_max_event && gfpflags_allow_spinning(gfp_mask))
                memcg_memory_event(mem_over_limit, MEMCG_MAX);
 
        /*


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05 17:31   ` Shakeel Butt
  2025-09-05 18:23     ` Peilin Ye
  2025-09-05 19:14     ` Peilin Ye
@ 2025-09-05 19:48     ` Shakeel Butt
  2025-09-05 20:31       ` Peilin Ye
  2 siblings, 1 reply; 10+ messages in thread
From: Shakeel Butt @ 2025-09-05 19:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Peilin Ye, Johannes Weiner, Tejun Heo, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Fri, Sep 05, 2025 at 10:31:07AM -0700, Shakeel Butt wrote:
> On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> > On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@google.com> wrote:
> > >
> > > Calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various
> > > locking issues; see the following stack trace (edited for style) as one
> > > example:
> > >
> > > ...
> > >  [10.011566]  do_raw_spin_lock.cold
> > >  [10.011570]  try_to_wake_up             (5) double-acquiring the same
> > >  [10.011575]  kick_pool                      rq_lock, causing a hardlockup
> > >  [10.011579]  __queue_work
> > >  [10.011582]  queue_work_on
> > >  [10.011585]  kernfs_notify
> > >  [10.011589]  cgroup_file_notify
> > >  [10.011593]  try_charge_memcg           (4) memcg accounting raises an
> > >  [10.011597]  obj_cgroup_charge_pages        MEMCG_MAX event
> > >  [10.011599]  obj_cgroup_charge_account
> > >  [10.011600]  __memcg_slab_post_alloc_hook
> > >  [10.011603]  __kmalloc_node_noprof
> > > ...
> > >  [10.011611]  bpf_map_kmalloc_node
> > >  [10.011612]  __bpf_async_init
> > >  [10.011615]  bpf_timer_init             (3) BPF calls bpf_timer_init()
> > >  [10.011617]  bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable
> > >  [10.011619]  bpf__sched_ext_ops_runnable
> > >  [10.011620]  enqueue_task_scx           (2) BPF runs with rq_lock held
> > >  [10.011622]  enqueue_task
> > >  [10.011626]  ttwu_do_activate
> > >  [10.011629]  sched_ttwu_pending         (1) grabs rq_lock
> > > ...
> > >
> > > The above was reproduced on bpf-next (b338cf849ec8) by modifying
> > > ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during
> > > ops.runnable(), and hacking [1] the memcg accounting code a bit to make
> > > it (much more likely to) raise an MEMCG_MAX event from a
> > > bpf_timer_init() call.
> > >
> > > We have also run into other similar variants both internally (without
> > > applying the [1] hack) and on bpf-next, including:
> > >
> > >  * run_timer_softirq() -> cgroup_file_notify()
> > >    (grabs cgroup_file_kn_lock) -> try_to_wake_up() ->
> > >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> > >    try_charge_memcg() raises MEMCG_MAX ->
> > >    cgroup_file_notify() (tries to grab cgroup_file_kn_lock again)
> > >
> > >  * __queue_work() (grabs worker_pool::lock) -> try_to_wake_up() ->
> > >    BPF calls bpf_timer_init() -> bpf_map_kmalloc_node() ->
> > >    try_charge_memcg() raises MEMCG_MAX -> m() ->
> > >    __queue_work() (tries to grab the same worker_pool::lock)
> > >  ...
> > >
> > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > > bpf_hrtimer and bpf_work, to skip memcg accounting.
> > 
> > This is a short term workaround that we shouldn't take.
> > Long term bpf_mem_alloc() will use kmalloc_nolock() and
> > memcg accounting that was already made to work from any context
> > except that the path of memcg_memory_event() wasn't converted.
> > 
> > Shakeel,
> > 
> > Any suggestions how memcg_memory_event()->cgroup_file_notify()
> > can be fixed?
> > Can we just trylock and skip the event?
> 
> Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
> chains? If yes, then we can change memcg_memory_event() to skip calls to
> cgroup_file_notify() if spinning is not allowed.

Along with using __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init(),
we need the following patch:


From 32d310208ebe28eac562552849355c33ae4320f1 Mon Sep 17 00:00:00 2001
From: Shakeel Butt <shakeel.butt@linux.dev>
Date: Fri, 5 Sep 2025 11:15:17 -0700
Subject: [PATCH] memcg: skip cgroup_file_notify if spinning is not allowed

Generally memcg charging is allowed from all the contexts including NMI
where even spinning on spinlock can cause locking issues. However one
call chain was missed during the addition of memcg charging from any
context support. That is try_charge_memcg() -> memcg_memory_event() ->
cgroup_file_notify().

The possible function call tree under cgroup_file_notify() can acquire
many different spin locks in spinning mode. Some of them are
cgroup_file_kn_lock, kernfs_notify_lock, pool_workqeue's lock. So, let's
just skip cgroup_file_notify() from memcg charging if the context does
not allow spinning.

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 23 ++++++++++++++++-------
 mm/memcontrol.c            |  7 ++++---
 2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9dc5b52672a6..054fa34c936a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -993,22 +993,25 @@ static inline void count_memcg_event_mm(struct mm_struct *mm,
 	count_memcg_events_mm(mm, idx, 1);
 }
 
-static inline void memcg_memory_event(struct mem_cgroup *memcg,
-				      enum memcg_memory_event event)
+static inline void __memcg_memory_event(struct mem_cgroup *memcg,
+					enum memcg_memory_event event,
+					bool allow_spinning)
 {
 	bool swap_event = event == MEMCG_SWAP_HIGH || event == MEMCG_SWAP_MAX ||
 			  event == MEMCG_SWAP_FAIL;
 
 	atomic_long_inc(&memcg->memory_events_local[event]);
-	if (!swap_event)
+	if (!swap_event && allow_spinning)
 		cgroup_file_notify(&memcg->events_local_file);
 
 	do {
 		atomic_long_inc(&memcg->memory_events[event]);
-		if (swap_event)
-			cgroup_file_notify(&memcg->swap_events_file);
-		else
-			cgroup_file_notify(&memcg->events_file);
+		if (allow_spinning) {
+			if (swap_event)
+				cgroup_file_notify(&memcg->swap_events_file);
+			else
+				cgroup_file_notify(&memcg->events_file);
+		}
 
 		if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 			break;
@@ -1018,6 +1021,12 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg,
 		 !mem_cgroup_is_root(memcg));
 }
 
+static inline void memcg_memory_event(struct mem_cgroup *memcg,
+				      enum memcg_memory_event event)
+{
+	__memcg_memory_event(memcg, event, true);
+}
+
 static inline void memcg_memory_event_mm(struct mm_struct *mm,
 					 enum memcg_memory_event event)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 257d2c76b730..dd5cd9d352f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2306,12 +2306,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	bool drained = false;
 	bool raised_max_event = false;
 	unsigned long pflags;
+	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
 
 retry:
 	if (consume_stock(memcg, nr_pages))
 		return 0;
 
-	if (!gfpflags_allow_spinning(gfp_mask))
+	if (!allow_spinning)
 		/* Avoid the refill and flush of the older stock */
 		batch = nr_pages;
 
@@ -2347,7 +2348,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!gfpflags_allow_blocking(gfp_mask))
 		goto nomem;
 
-	memcg_memory_event(mem_over_limit, MEMCG_MAX);
+	__memcg_memory_event(mem_over_limit, MEMCG_MAX, allow_spinning);
 	raised_max_event = true;
 
 	psi_memstall_enter(&pflags);
@@ -2414,7 +2415,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * a MEMCG_MAX event.
 	 */
 	if (!raised_max_event)
-		memcg_memory_event(mem_over_limit, MEMCG_MAX);
+		__memcg_memory_event(mem_over_limit, MEMCG_MAX, allow_spinning);
 
 	/*
 	 * The allocation either can't fail or will lead to more memory
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05 19:48     ` Shakeel Butt
@ 2025-09-05 20:31       ` Peilin Ye
  0 siblings, 0 replies; 10+ messages in thread
From: Peilin Ye @ 2025-09-05 20:31 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Johannes Weiner, Tejun Heo, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Fri, Sep 05, 2025 at 12:48:11PM -0700, Shakeel Butt wrote:
> On Fri, Sep 05, 2025 at 10:31:07AM -0700, Shakeel Butt wrote:
> > On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> > > On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@google.com> wrote:
> > > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > > > bpf_hrtimer and bpf_work, to skip memcg accounting.
> > > 
> > > This is a short term workaround that we shouldn't take.
> > > Long term bpf_mem_alloc() will use kmalloc_nolock() and
> > > memcg accounting that was already made to work from any context
> > > except that the path of memcg_memory_event() wasn't converted.
> > > 
> > > Shakeel,
> > > 
> > > Any suggestions how memcg_memory_event()->cgroup_file_notify()
> > > can be fixed?
> > > Can we just trylock and skip the event?
> > 
> > Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
> > chains? If yes, then we can change memcg_memory_event() to skip calls to
> > cgroup_file_notify() if spinning is not allowed.
> 
> Along with using __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init()

Ah, I see, thanks for the suggestion - I'll send a separate patch to do
this.

Peilin Ye


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05 19:14     ` Peilin Ye
@ 2025-09-05 20:32       ` Shakeel Butt
  0 siblings, 0 replies; 10+ messages in thread
From: Shakeel Butt @ 2025-09-05 20:32 UTC (permalink / raw)
  To: Peilin Ye
  Cc: Alexei Starovoitov, Johannes Weiner, Tejun Heo, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden, linux-mm

On Fri, Sep 05, 2025 at 07:14:57PM +0000, Peilin Ye wrote:
> On Fri, Sep 05, 2025 at 10:31:07AM -0700, Shakeel Butt wrote:
> > On Fri, Sep 05, 2025 at 08:18:25AM -0700, Alexei Starovoitov wrote:
> > > On Thu, Sep 4, 2025 at 11:20 PM Peilin Ye <yepeilin@google.com> wrote:
> > > > As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> > > > bpf_hrtimer and bpf_work, to skip memcg accounting.
> > > 
> > > This is a short term workaround that we shouldn't take.
> > > Long term bpf_mem_alloc() will use kmalloc_nolock() and
> > > memcg accounting that was already made to work from any context
> > > except that the path of memcg_memory_event() wasn't converted.
> > > 
> > > Shakeel,
> > > 
> > > Any suggestions how memcg_memory_event()->cgroup_file_notify()
> > > can be fixed?
> > > Can we just trylock and skip the event?
> > 
> > Will !gfpflags_allow_spinning(gfp_mask) be able to detect such call
> > chains? If yes, then we can change memcg_memory_event() to skip calls to
> > cgroup_file_notify() if spinning is not allowed.
> 
> I tried the below diff, but unfortunately __bpf_async_init() calls
> bpf_map_kmalloc_node() with GFP_ATOMIC, so gfpflags_allow_spinning()
> would return true.  I'll try the trylock-and-skip approach.

Trylock approach will not work as there are multiple locks in this
codepath. Better to use __GFP_HIGH instead of GFP_ATOMIC in
__bpf_async_init().


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init()
  2025-09-05  6:19 [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init() Peilin Ye
  2025-09-05  6:28 ` Peilin Ye
  2025-09-05 15:18 ` Alexei Starovoitov
@ 2025-09-05 23:51 ` Peilin Ye
  2 siblings, 0 replies; 10+ messages in thread
From: Peilin Ye @ 2025-09-05 23:51 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, Josh Don, Barret Rhoden

Future readers,

On Fri, Sep 05, 2025 at 06:19:17AM +0000, Peilin Ye wrote:
> As pointed out by Kumar, we can use bpf_mem_alloc() and friends for
> bpf_hrtimer and bpf_work, to skip memcg accounting.

This submission has been superseded by:

[PATCH bpf] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init()
https://lore.kernel.org/all/20250905234547.862249-1-yepeilin@google.com/

Thanks,
Peilin Ye


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-09-05 23:51 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-05  6:19 [PATCH bpf] bpf/helpers: Skip memcg accounting in __bpf_async_init() Peilin Ye
2025-09-05  6:28 ` Peilin Ye
2025-09-05 15:18 ` Alexei Starovoitov
2025-09-05 17:31   ` Shakeel Butt
2025-09-05 18:23     ` Peilin Ye
2025-09-05 19:14     ` Peilin Ye
2025-09-05 20:32       ` Shakeel Butt
2025-09-05 19:48     ` Shakeel Butt
2025-09-05 20:31       ` Peilin Ye
2025-09-05 23:51 ` Peilin Ye

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox