From: Amery Hung <ameryhung@gmail.com>
To: bpf@vger.kernel.org
Cc: netdev@vger.kernel.org, alexei.starovoitov@gmail.com,
andrii@kernel.org, daniel@iogearbox.net, memxor@gmail.com,
kpsingh@kernel.org, martin.lau@kernel.org,
yonghong.song@linux.dev, song@kernel.org, haoluo@google.com,
kernel-team@meta.com
Subject: Re: [RFC PATCH bpf-next v1 00/11] Remove task and cgroup local
Date: Tue, 29 Jul 2025 11:28:48 -0700 [thread overview]
Message-ID: <CAMB2axM2qgsstBc+efRN4CU0V1-dMa7DemPkTHZ1Fhdku7o2HQ@mail.gmail.com> (raw)
In-Reply-To: <20250729182550.185356-1-ameryhung@gmail.com>
The title is "Remove task and cgroup local storage percpu counters"
On Tue, Jul 29, 2025 at 11:25 AM Amery Hung <ameryhung@gmail.com> wrote:
>
> * Motivation *
>
> The goal of this patchset is to make bpf syscalls and helpers updating
> task and cgroup local storage more robust by removing percpu counters
> in them. Task local storage and cgroup storage each employs a percpu
> counter to prevent deadlock caused by recursion. Since the underlying
> bpf local storage takes spinlocks in various operations, bpf programs
> running recursively may try to take a spinlock which is already taken.
> For example, when a tracing bpf program called recursively during
> bpf_task_storage_get(..., F_CREATE) tries to call
> bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock
> if the percpu variable is not in place.
>
> However, sometimes, the percpu counter may cause bpf syscalls or helpers
> to return errors spuriously, as soon as another threads is also updating
> the local storage or the local storage map. Ideally, the two threads
> could have taken turn to take the locks and perform their jobs
> respectively. However, due to the percpu counter, the syscalls and
> helpers can return -EBUSY even if one of them does not run recursively
> in another one. All it takes for this to happen is if the two threads run
> on the same CPU. This happened when BPF-CI ran the selftest of task local
> data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(...,
> F_CREATE) can easily fail.
>
> The failure mode is not good for users as they need to add retry logic
> in user space or bpf programs to avoid it. Even with retry, there
> is no guaranteed upper bound of the loop for a succeess call. Therefore,
> this patchset seeks to remove the percpu counter and makes the related
> bpf syscalls and helpers more reliable, while still make sure recursion
> deadlock will not happen, with the help of resilient queued spinlock
> (rqspinlock).
>
>
> * Implementation *
>
> To remove the percpu counter without introducing deadlock,
> bpf_local_storage is refactored by changing the locks from raw_spin_lock
> to rqspinlock, which prevents deadlock with deadlock detection and a
> timeout mechanism.
>
> There are two locks in bpf_local_storage due to the ownership model as
> illustrated in the figure below. A map value, which consists of a
> pointer to the map and the data, is a bpf_local_storage_map_data (sdata)
> stored in a bpf_local_storage_elem (selem). A selem belongs to a
> bpf_local_storage and bpf_local_storage_map at the same time.
> bpf_local_storage::lock (lock_storage->lock in short) protects the list
> in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock)
> protects the hash bucket in a bpf_local_storage_map.
>
>
> task_struct
> ┌ task1 ───────┐ bpf_local_storage
> │ *bpf_storage │---->┌─────────┐
> └──────────────┘<----│ *owner │ bpf_local_storage_elem
> │ *cache[16] (selem) selem
> │ *smap │ ┌──────────┐ ┌──────────┐
> │ list │------->│ snode │<------->│ snode │
> │ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │
> └─────────┘ │ │ sdata = │ │ │ │ sdata = │
> task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │
> ┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘
> │ *bpf_storage │---->┌─────────┐ │ │ │
> └──────────────┘<----│ *owner │ │ │ │
> │ *cache[16] │ selem │ │ selem
> │ *smap │ │ ┌──────────┐ │ │ ┌──────────┐
> │ list │--│---->│ snode │<--│-│-->│ snode │
> │ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │
> └─────────┘ │ │ │ sdata = │ │ │ sdata = │
> bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │
> (smap) │ │ └──────────┘ │ └──────────┘
> ┌ mapA ───────┐ │ │ │
> │ bpf_map map │ bpf_local_storage_map_bucket │
> │ *buckets │---->┌ b[0] ┐ │ │ │
> └─────────────┘ │ list │------┘ │ │
> │ lock │ │ │
> └──────┘ │ │
> smap ... │ │
> ┌ mapB ───────┐ │ │
> │ bpf_map map │ bpf_local_storage_map_bucket │
> │ *buckets │---->┌ b[0] ┐ │ │
> └─────────────┘ │ list │--------┘ │
> │ lock │ │
> └──────┘ │
> ┌ b[1] ┐ │
> │ list │-----------------------------┘
> │ lock │
> └──────┘
> ...
>
>
> The refactoring is divided into three steps.
>
> First, in patch 1-4, local storage helpers that take locks are being
> converted to failable. The functions are changed from returning void to
> returning an int error code with the return value temporarily set to 0.
> In callers where the helpers cannot fail in the middle of an update,
> the helper is open coded. In callers that are not allowed to fail, (i.e.,
> bpf_local_storage_destroy() and bpf_local_storage_map_free(), we make
> sure the functions cannot be called recursively, causing deadlock, and
> assert the return value to be 0.
>
> Then, in patch 5, the locks are changed to rqspinlock, and the error
> returned from raw_res_spin_lock_irqsave() is propogated to the syscalls
> and heleprs.
>
> Finally, in patch 7-8, the percpu counters in task and cgroup local
> storage are removed.
>
> Question:
>
> - In bpf_local_storage_destroy() and bpf_local_storage_map_free(), where
> it is not allow to fail, I assert that the lock acquisition always
> succeeds based on the fact that 1) these paths cannot run recursively
> causing AA deadlock and 2) local_storage->lock and b->lock are always
> acquired in the same order, but I also notice that rqspinlock has
> a timeout fallback. Is this assertion an okay thing to do?
>
>
> * Test *
>
> Task and cgroup local storage selftests have already covered deadlock
> caused by recursion. Patch 9 updates the expected result of task local
> storage selftests as task local storage bpf helpers can now run on the
> same CPU as they don't cause deadlock.
>
> * Patch organization *
>
> E(exposed) L(local storage lock) B(bucket lock)
> EL __bpf_local_storage_insert_cache (skip cache update)
> ELB bpf_local_storage_destroy (cannot recur)
> ELB bpf_local_storage_map_free (cannot recur)
> ELB bpf_selem_unlink --> Patch 4
> E B bpf_selem_link_map --> Patch 2
> B bpf_selem_unlink_map --> Patch 1
> L bpf_selem_unlink_storage --> Patch 3
>
> During the refactoring, to make sure all exposed functions
> handle the error returned by raw_res_spin_lock_irqsave(),
> __must_check is added locally to catch all callers.
>
> Patch 1-4
> Convert local storage helpers to failable, or open-code
> the helpers
>
> Patch 5
> Change local_storage->lock and b->lock from
> raw_spin_lock to rqspinlock
>
> Patch 6
> Remove percpu lock in task local storage and remove
> bpf_task_storage_{get,delete}_recur()
>
> Patch 7
> Remove percpu lock in cgroup local storage
>
> Patch 8
> Remove percpu lock in bpf_local_storage
>
> Patch 9
> Update task local storage recursion test
>
> Patch 10
> Remove task local storage stress test
>
> Patch 11
> Update btf_dump to use another percpu variable
>
> ----
>
> Amery Hung (11):
> bpf: Convert bpf_selem_unlink_map to failable
> bpf: Convert bpf_selem_link_map to failable
> bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink
> bpf: Convert bpf_selem_unlink to failable
> bpf: Change local_storage->lock and b->lock to rqspinlock
> bpf: Remove task local storage percpu counter
> bpf: Remove cgroup local storage percpu counter
> bpf: Remove unused percpu counter from bpf_local_storage_map_free
> selftests/bpf: Update task_local_storage/recursion test
> selftests/bpf: Remove test_task_storage_map_stress_lookup
> selftests/bpf: Choose another percpu variable in bpf for btf_dump test
>
> include/linux/bpf_local_storage.h | 14 +-
> kernel/bpf/bpf_cgrp_storage.c | 60 +-----
> kernel/bpf/bpf_inode_storage.c | 6 +-
> kernel/bpf/bpf_local_storage.c | 202 ++++++++++++------
> kernel/bpf/bpf_task_storage.c | 153 ++-----------
> kernel/bpf/helpers.c | 4 -
> net/core/bpf_sk_storage.c | 10 +-
> .../bpf/map_tests/task_storage_map.c | 128 -----------
> .../selftests/bpf/prog_tests/btf_dump.c | 4 +-
> .../bpf/prog_tests/task_local_storage.c | 8 +-
> .../bpf/progs/read_bpf_task_storage_busy.c | 38 ----
> 11 files changed, 184 insertions(+), 443 deletions(-)
> delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c
> delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
>
> --
> 2.47.3
>
next prev parent reply other threads:[~2025-07-29 18:29 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-29 18:25 [RFC PATCH bpf-next v1 00/11] Remove task and cgroup local Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 01/11] bpf: Convert bpf_selem_unlink_map to failable Amery Hung
2025-08-01 1:05 ` Martin KaFai Lau
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 02/11] bpf: Convert bpf_selem_link_map " Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 03/11] bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink Amery Hung
2025-08-02 0:58 ` Martin KaFai Lau
2025-08-05 16:25 ` Amery Hung
2025-08-05 23:10 ` Martin KaFai Lau
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 04/11] bpf: Convert bpf_selem_unlink to failable Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 05/11] bpf: Change local_storage->lock and b->lock to rqspinlock Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 06/11] bpf: Remove task local storage percpu counter Amery Hung
2025-08-02 1:15 ` Martin KaFai Lau
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 07/11] bpf: Remove cgroup " Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 08/11] bpf: Remove unused percpu counter from bpf_local_storage_map_free Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 09/11] selftests/bpf: Update task_local_storage/recursion test Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 10/11] selftests/bpf: Remove test_task_storage_map_stress_lookup Amery Hung
2025-07-29 18:25 ` [RFC PATCH bpf-next v1 11/11] selftests/bpf: Choose another percpu variable in bpf for btf_dump test Amery Hung
2025-07-29 18:28 ` Amery Hung [this message]
2025-08-02 1:46 ` [RFC PATCH bpf-next v1 00/11] Remove task and cgroup local Martin KaFai Lau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAMB2axM2qgsstBc+efRN4CU0V1-dMa7DemPkTHZ1Fhdku7o2HQ@mail.gmail.com \
--to=ameryhung@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=kernel-team@meta.com \
--cc=kpsingh@kernel.org \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=song@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).