From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f202.google.com (mail-pf1-f202.google.com [209.85.210.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 69BDA28C87C for ; Tue, 9 Sep 2025 09:52:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757411570; cv=none; b=DwVROQ3YVOOa0/GtMiP1mN+nb/18YywuemZ6XZfiI6xs9X9h7W5/c7ZadqGNtKhqMYWgmKAkWcwrqHOyTHpKT9owSRSWZzKdIrhhRjzTZECcHQr2Wej/NKNdzDpg2wBnjs7wHs8DhpMDAHgjoC2ToPA2Idrd786DFQcDRAYAHNw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757411570; c=relaxed/simple; bh=DuDv3ua4IwKyYMWhr+S8CTNVoVIcjA8BB/4W4Mu6E1g=; h=Date:Mime-Version:Message-ID:Subject:From:To:Cc:Content-Type; b=UBQdolEBc0tA6H0TBciFucw7RJQ/atS9t4FhdKVuJp9iFqwS4ckBtChk/KPUDnTDVi2NNDBm9dhBJW8SKZRcXR+FSkIqu5grNVf+QILjeaQsXV+7OS45+aBo9qaXEpskvi6YT0wO88vNFwkI/oNTNRA+4AukQDWTz3RL32q6v50= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yepeilin.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=RUXt1rvE; arc=none smtp.client-ip=209.85.210.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yepeilin.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RUXt1rvE" Received: by mail-pf1-f202.google.com with SMTP id d2e1a72fcca58-7724877cd7cso6216438b3a.1 for ; Tue, 09 Sep 2025 02:52:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1757411567; x=1758016367; darn=vger.kernel.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=qOFVmhbI856obbOouijqOO2ps+J/JpM5ycRu9RASoGg=; b=RUXt1rvEoZi+wbOs8CNxEMwi2II4gSE5HVQeyGy1iyrtCQ0GbsNUlfPmutLAAKOj8I Y9dxjoB/vcyk2P8l/bMoLoKPjVsDnHnFBVa/joUBLR0jAB3S1cleajY3YLDpgDj5coQy Cz/r5T18ei7YyloIYKkCNnxTTkej4u0ZtnXbQGi9yh1B1ZHMirHsYacGoAwqTwA1tecC 0f7pghkaxuRQ09xdKD6DVrh+GOBnD8OR2782uFCJMVJqhACx243T4U2tT6MinmuCBqpO sQtqfqMY6P3Jhv3LVbONnFH7lMM6BqnewRKPO8+xzh4uu7/6PK5wMEvYsYSpKm4DPlvc h0wg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757411567; x=1758016367; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qOFVmhbI856obbOouijqOO2ps+J/JpM5ycRu9RASoGg=; b=uxjXL7OWzv3DIrZi+HgGXOCPa4T6K4TQv2muPF3n8o9xU0AzbofgRfr/pIK5BVBb6M /MOC5DhJ0PD1bOQ4vYpzi+HCPev6XG0PN3hM5P3y47IeiZKP1pMsK7qYCiA3EOHG+xlY FgISh3Wxz7fbTNZOvRvw0LY6llHG67ucpoEIvWzLdelP1w9SiLFwIh73kKNpoSq+tBMz 906V6ZcbaQ8UgYkOSEp+LiucTDLS1hLVU+4evNUOG1XN2FjyLrtBKITO+9ZXWzG1PAhi uICKkL1Ul+IPZ5eJKyFbnNNHIBHj5vYUH4XFZPC1XFcs4veb0KZVHpqcUGmGtRR4m+6T AlCg== X-Gm-Message-State: AOJu0YyU28ilKNylelpqGltOZ7AhY9mF2vTWlZJj4DFdmpy1lMEJGVEw xwh4RXkeSTquDCQCAN5AR12cijL4+JBUlVdChDdS/n0CUWYwZYj/Zx3G7RJZYqe8z5j2dx5cxzc cJACm6WOInx68rycpivtjbMjyHmE9OzRP4rJLzen6Yi7vIJKELMjmPYK6uAF5lGv4ixWk8PS5Ay zRjET12YllfYwwvP7i9g0dcv/rxXG7tzXr//eH6fw1I0U= X-Google-Smtp-Source: AGHT+IFqmQ+9VVjPtSUUrkCUHF3CruI5aNTXXYKoJuAYl4F0Pt7URhDMSiYRtDdjo9N8yw+OspG5qKJ1hqcjvg== X-Received: from pfwy4.prod.google.com ([2002:a05:6a00:1c84:b0:772:630e:8fd4]) (user=yepeilin job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:983:b0:772:46b5:cc90 with SMTP id d2e1a72fcca58-7742de71235mr15612483b3a.32.1757411567046; Tue, 09 Sep 2025 02:52:47 -0700 (PDT) Date: Tue, 9 Sep 2025 09:52:20 +0000 Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 X-Mailer: git-send-email 2.51.0.384.g4c02a37b29-goog Message-ID: <20250909095222.2121438-1-yepeilin@google.com> Subject: [PATCH bpf v2] bpf/helpers: Use __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init() From: Peilin Ye To: bpf@vger.kernel.org, Alexei Starovoitov , Shakeel Butt Cc: Peilin Ye , Johannes Weiner , Tejun Heo , Roman Gushchin , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , Kumar Kartikeya Dwivedi , Josh Don , Barret Rhoden , linux-mm@kvack.org, Leon Hwang Content-Type: text/plain; charset="UTF-8" Currently, calling bpf_map_kmalloc_node() from __bpf_async_init() can cause various locking issues; see the following stack trace (edited for style) as one example: ... [10.011566] do_raw_spin_lock.cold [10.011570] try_to_wake_up (5) double-acquiring the same [10.011575] kick_pool rq_lock, causing a hardlockup [10.011579] __queue_work [10.011582] queue_work_on [10.011585] kernfs_notify [10.011589] cgroup_file_notify [10.011593] try_charge_memcg (4) memcg accounting raises an [10.011597] obj_cgroup_charge_pages MEMCG_MAX event [10.011599] obj_cgroup_charge_account [10.011600] __memcg_slab_post_alloc_hook [10.011603] __kmalloc_node_noprof ... [10.011611] bpf_map_kmalloc_node [10.011612] __bpf_async_init [10.011615] bpf_timer_init (3) BPF calls bpf_timer_init() [10.011617] bpf_prog_xxxxxxxxxxxxxxxx_fcg_runnable [10.011619] bpf__sched_ext_ops_runnable [10.011620] enqueue_task_scx (2) BPF runs with rq_lock held [10.011622] enqueue_task [10.011626] ttwu_do_activate [10.011629] sched_ttwu_pending (1) grabs rq_lock ... The above was reproduced on bpf-next (b338cf849ec8) by modifying ./tools/sched_ext/scx_flatcg.bpf.c to call bpf_timer_init() during ops.runnable(), and hacking [1] the memcg accounting code a bit to make a bpf_timer_init() call much more likely to raise an MEMCG_MAX event. We have also run into other similar variants (both internally and on bpf-next), including double-acquiring cgroup_file_kn_lock, the same worker_pool::lock, etc. As suggested by Shakeel, fix this by using __GFP_HIGH instead of GFP_ATOMIC in __bpf_async_init(), so that e.g. if try_charge_memcg() raises an MEMCG_MAX event, we call __memcg_memory_event() with @allow_spinning=false and avoid calling cgroup_file_notify() there. Depends on mm patch "memcg: skip cgroup_file_notify if spinning is not allowed". Tested with vmtest.sh (llvm-18, x86-64): $ ./test_progs -a '*timer*' -a '*wq*' ... Summary: 7/12 PASSED, 0 SKIPPED, 0 FAILED [1] Making bpf_timer_init() much more likely to raise an MEMCG_MAX event (gist-only, for brevity): kernel/bpf/helpers.c:__bpf_async_init(): - cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node); + cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC | __GFP_HACK, + map->numa_node); mm/memcontrol.c:try_charge_memcg(): if (!do_memsw_account() || - page_counter_try_charge(&memcg->memsw, batch, &counter)) { - if (page_counter_try_charge(&memcg->memory, batch, &counter)) + page_counter_try_charge_hack(&memcg->memsw, batch, &counter, + gfp_mask & __GFP_HACK)) { + if (page_counter_try_charge_hack(&memcg->memory, batch, + &counter, + gfp_mask & __GFP_HACK)) goto done_restock; mm/page_counter.c:page_counter_try_charge(): -bool page_counter_try_charge(struct page_counter *counter, - unsigned long nr_pages, - struct page_counter **fail) +bool page_counter_try_charge_hack(struct page_counter *counter, + unsigned long nr_pages, + struct page_counter **fail, bool hack) { ... - if (new > c->max) { + if (hack || new > c->max) { // goto failed; atomic_long_sub(nr_pages, &c->usage); Fixes: b00628b1c7d5 ("bpf: Introduce bpf timers.") Suggested-by: Shakeel Butt Signed-off-by: Peilin Ye --- v1: https://lore.kernel.org/bpf/20250905234547.862249-1-yepeilin@google.com/ change since v1: - simplify comment, and mention kmalloc_nolock() (Shakeel) kernel/bpf/helpers.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index b9b0c5fe33f6..8af62cb243d9 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1274,8 +1274,11 @@ static int __bpf_async_init(struct bpf_async_kern *async, struct bpf_map *map, u goto out; } - /* allocate hrtimer via map_kmalloc to use memcg accounting */ - cb = bpf_map_kmalloc_node(map, size, GFP_ATOMIC, map->numa_node); + /* Allocate via bpf_map_kmalloc_node() for memcg accounting. Until + * kmalloc_nolock() is available, avoid locking issues by using + * __GFP_HIGH (GFP_ATOMIC & ~__GFP_RECLAIM). + */ + cb = bpf_map_kmalloc_node(map, size, __GFP_HIGH, map->numa_node); if (!cb) { ret = -ENOMEM; goto out; -- 2.51.0.384.g4c02a37b29-goog