* [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters
@ 2025-10-02 22:53 Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 01/12] bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage Amery Hung
` (11 more replies)
0 siblings, 12 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
v1 -> v2
- Rebase to bpf-next
- Select bucket based on local_storage instead of selem (Martin)
- Simplify bpf_selem_unlink (Martin)
- Change handling of rqspinlock errors in bpf_local_storage_destroy()
and bpf_local_storage_map_free(). Retry instead of WARN_ON.
Link: https://lore.kernel.org/bpf/20250729182550.185356-1-ameryhung@gmail.com/
* Motivation *
The goal of this patchset is to make bpf syscalls and helpers updating
task and cgroup local storage more robust by removing percpu counters
in them. Task local storage and cgroup storage each employs a percpu
counter to prevent deadlock caused by recursion. Since the underlying
bpf local storage takes spinlocks in various operations, bpf programs
running recursively may try to take a spinlock which is already taken.
For example, when a tracing bpf program called recursively during
bpf_task_storage_get(..., F_CREATE) tries to call
bpf_task_storage_get(..., F_CREATE) again, it will cause AA deadlock
if the percpu variable is not in place.
However, sometimes, the percpu counter may cause bpf syscalls or helpers
to return errors spuriously, as soon as another threads is also updating
the local storage or the local storage map. Ideally, the two threads
could have taken turn to take the locks and perform their jobs
respectively. However, due to the percpu counter, the syscalls and
helpers can return -EBUSY even if one of them does not run recursively
in another one. All it takes for this to happen is if the two threads run
on the same CPU. This happened when BPF-CI ran the selftest of task local
data. Since CI runs the test on VM with 2 CPUs, bpf_task_storage_get(...,
F_CREATE) can easily fail.
The failure mode is not good for users as they need to add retry logic
in user space or bpf programs to avoid it. Even with retry, there
is no guaranteed upper bound of the loop for a success call. Therefore,
this patchset seeks to remove the percpu counter and makes the related
bpf syscalls and helpers more reliable, while still make sure recursion
deadlock will not happen, with the help of resilient queued spinlock
(rqspinlock).
* Implementation *
To remove the percpu counter without introducing deadlock,
bpf_local_storage is refactored by changing the locks from raw_spin_lock
to rqspinlock, which prevents deadlock with deadlock detection and a
timeout mechanism.
The refactor basically repalces the locks with rqspinlock and propagates
errors returned by the locking function to BPF helpers or syscalls.
In two lock acquiring functions that cannot fail,
bpf_local_storage_destroy() and bpf_local_storage_map_free()
(i.e., local storage is being freed by the subsystem or the map is
being freed), we simply retry. Since freeing of task, cgroup, inode and
sock and free of BPF map are done in deferred callbacks, they should be
able to make forward progress eventually (i.e., they cannot deadlock).
If not familiar with local storage, the last section briefly describe
the locks and structure of local storage. It also shows the abbreviation
used in the rest of the letter.
* Handling rqspinlock return in destroy() and map_free() *
The main concern of this patchset is handling errors returned from
rqspinlock in bpf_local_storage_destroy() and
bpf_local_storage_map_free() where the function cannot fail. First,
we explain the rationale behind the current approach. Then, we explain
why the approach taken in v1 is not correct. Finally, we list one other
approach to handle errors returned from rqspinlock.
bpf_local_storage_destroy() and bpf_local_storage_map_free() are not
allowed to fail as they are responsible for cleaning up the local
storage and BPF map. In an unlikely event where lock acquisition returns
errors, we simply retry since both functions should be able to make
forward progress eventually. They should not deadlock with itself or
each other since both functions are called in deferred callbacks and
preemption is disabled in the critical section. They may not run
recursively when locks were already being held, hence AA deadlock cannot
happen. In addition, since we always follow the same locking order
(i.e., local_storage->lock before bucket->lock), ABBA deadlock cannot
happen either.
Another way to handle the return, as taken by v1, is to assert the lock
will always succeed, which unfortunately is not true. Here we look at how
raw_res_spin_lock_irqsave() in destroy() and map_free() can fail.
1) Deadlock
While AA and ABBA deadlock cannot happen in destroy() and map_free(),
rqspinlock in these functions can return -EDEADLOCK.
However, when other threads waiting for the lock detects deadlock,
they may signal other waiters -EDEADLOCK. Here is one example:
(ls: local_storage, b: bucket)
CPU1 CPU2
---- ----
bpf_prog1
-> bpf_task_local_storage_get(F_CREATE)
-> lock(ls1->lock)
-> lock(b1->lock)
bpf_prog2 hooked to NMI
-> bpf_task_local_storage_get(F_CREATE)
-> lock(ls2->lock)
-> lock(b1->lock) return -EDEADLOCK
__put_task_struct_rcu_cb()
-> bpf_local_storage_destroy()
-> lock(ls3->lock)
-> lock(b1->lock) return -EDEADLOCK
as signaled by the head waiter
2) Timeout
While very unlikely, timeout can happen theorectically. If a local
storage contains lots of selems, it can spend a significant amount of
time in bpf_local_storage_destroy(), causing rqspinlock to return
-ETIMEOUT in map_free().
CPU1 CPU2
---- ----
bpf_local_storage_destroy()
-> lock(ls->lock)
-> for selem in list lock(b->lock)
(hold the lock for a long time)
bpf_local_storage_map_free()
-> for selem in smap,
lock(ls->lock) return -ETIMEOUT
There are also other approaches to handle the return of
raw_res_spin_lock_irqsave() at the cost of adding more complexity to
local storage. In general, we can try to mark the selem in the local
storage or map as freed and reclaim/reuse the memory later, but I hope
retry is enough and we don't need to go there.
* Patchset organization *
The refactoring is organized into four steps.
First, in patch 1, we change the pointer used to select the bucket from
selem to smap. This will reduce the number of buckets that need to be
locked from 2 to 1 during update.
Then, in patch 2-5, local storage functions that take locks are being
converted to failable. The functions are changed from returning void to
returning an int error code with the return value temporarily set to 0.
In callers where the helpers cannot fail in the middle of an update,
the helper is open coded. In callers that are not allowed to fail, (i.e.,
bpf_local_storage_destroy() and bpf_local_storage_map_free()), we retry.
Then, in patch 6, the locks are changed to rqspinlock, and the error
returned from raw_res_spin_lock_irqsave() is propagated to the syscalls
and heleprs.
Finally, in patch 7-8, the percpu counters in task and cgroup local
storage are removed.
* Test *
Task and cgroup local storage selftests have already covered deadlock
caused by recursion. Patch 10 updates the expected result of task local
storage selftests as task local storage bpf helpers can now run on the
same CPU as they don't cause deadlock.
* Appendix: local storage internal *
There are two locks in bpf_local_storage due to the ownership model as
illustrated in the figure below. A map value, which consists of a
pointer to the map and the data, is a bpf_local_storage_map_data (sdata)
stored in a bpf_local_storage_elem (selem). A selem belongs to a
bpf_local_storage and bpf_local_storage_map at the same time.
bpf_local_storage::lock (lock_storage->lock in short) protects the list
in a bpf_local_storage and bpf_local_storage_map_bucket::lock (b->lock)
protects the hash bucket in a bpf_local_storage_map.
task_struct
┌ task1 ───────┐ bpf_local_storage
│ *bpf_storage │---->┌─────────┐
└──────────────┘<----│ *owner │ bpf_local_storage_elem
│ *cache[16] (selem) selem
│ *smap │ ┌──────────┐ ┌──────────┐
│ list │------->│ snode │<------->│ snode │
│ lock │ ┌---->│ map_node │<--┐ ┌-->│ map_node │
└─────────┘ │ │ sdata = │ │ │ │ sdata = │
task_struct │ │ {&mapA,} │ │ │ │ {&mapB,} │
┌ task2 ───────┐ bpf_local_storage └──────────┘ │ │ └──────────┘
│ *bpf_storage │---->┌─────────┐ │ │ │
└──────────────┘<----│ *owner │ │ │ │
│ *cache[16] │ selem │ │ selem
│ *smap │ │ ┌──────────┐ │ │ ┌──────────┐
│ list │--│---->│ snode │<--│-│-->│ snode │
│ lock │ │ ┌-->│ map_node │ └-│-->│ map_node │
└─────────┘ │ │ │ sdata = │ │ │ sdata = │
bpf_local_storage_map │ │ │ {&mapB,} │ │ │ {&mapA,} │
(smap) │ │ └──────────┘ │ └──────────┘
┌ mapA ───────┐ │ │ │
│ bpf_map map │ bpf_local_storage_map_bucket │
│ *buckets │---->┌ b[0] ┐ │ │ │
└─────────────┘ │ list │------┘ │ │
│ lock │ │ │
└──────┘ │ │
smap ... │ │
┌ mapB ───────┐ │ │
│ bpf_map map │ bpf_local_storage_map_bucket │
│ *buckets │---->┌ b[0] ┐ │ │
└─────────────┘ │ list │--------┘ │
│ lock │ │
└──────┘ │
┌ b[1] ┐ │
│ list │-----------------------------┘
│ lock │
└──────┘
...
----
Amery Hung (12):
bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
bpf: Convert bpf_selem_unlink_map to failable
bpf: Convert bpf_selem_link_map to failable
bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink
bpf: Convert bpf_selem_unlink to failable
bpf: Change local_storage->lock and b->lock to rqspinlock
bpf: Remove task local storage percpu counter
bpf: Remove cgroup local storage percpu counter
bpf: Remove unused percpu counter from bpf_local_storage_map_free
selftests/bpf: Update task_local_storage/recursion test
selftests/bpf: Remove test_task_storage_map_stress_lookup
selftests/bpf: Choose another percpu variable in bpf for btf_dump test
include/linux/bpf_local_storage.h | 14 +-
kernel/bpf/bpf_cgrp_storage.c | 62 +-----
kernel/bpf/bpf_inode_storage.c | 6 +-
kernel/bpf/bpf_local_storage.c | 200 +++++++++++-------
kernel/bpf/bpf_task_storage.c | 154 ++------------
kernel/bpf/helpers.c | 4 -
net/core/bpf_sk_storage.c | 10 +-
.../bpf/map_tests/task_storage_map.c | 128 -----------
.../selftests/bpf/prog_tests/btf_dump.c | 4 +-
.../bpf/prog_tests/task_local_storage.c | 8 +-
.../bpf/progs/read_bpf_task_storage_busy.c | 38 ----
11 files changed, 177 insertions(+), 451 deletions(-)
delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c
delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
--
2.47.3
^ permalink raw reply [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 01/12] bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 02/12] bpf: Convert bpf_selem_unlink_map to failable Amery Hung
` (10 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
A later bpf_local_storage refactor will acquire all locks before
performing any update. To simplified the number of locks needed to take
in bpf_local_storage_map_update(), determine the bucket based on the
local_storage an selem belongs to instead of the selem pointer.
Currently, when a new selem needs to be created to replace the old selem
in bpf_local_storage_map_update(), locks of both buckets need to be
acquired to prevent racing. This can be simplified if the two selem
belongs to the same bucket so that only one bucket needs to be locked.
Therefore, instead of hashing selem, hashing the local_storage pointer
the selem belongs.
This is safe since a selem is always linked to local_storage before
linked to map and unlinked from local_storage after unlinked from map.
Performance wise, this is slightly better as update now requires locking
one bucket. It should not change the level of contention on one bucket
as the pointers to local storages of selems in a map are just as unique
as pointers to selems.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
kernel/bpf/bpf_local_storage.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index b931fbceb54d..e4a7cd33b455 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -19,9 +19,9 @@
static struct bpf_local_storage_map_bucket *
select_bucket(struct bpf_local_storage_map *smap,
- struct bpf_local_storage_elem *selem)
+ struct bpf_local_storage *local_storage)
{
- return &smap->buckets[hash_ptr(selem, smap->bucket_log)];
+ return &smap->buckets[hash_ptr(local_storage, smap->bucket_log)];
}
static int mem_charge(struct bpf_local_storage_map *smap, void *owner, u32 size)
@@ -411,6 +411,7 @@ void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
static void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
{
+ struct bpf_local_storage *local_storage;
struct bpf_local_storage_map *smap;
struct bpf_local_storage_map_bucket *b;
unsigned long flags;
@@ -419,8 +420,10 @@ static void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
/* selem has already be unlinked from smap */
return;
+ local_storage = rcu_dereference_check(selem->local_storage,
+ bpf_rcu_lock_held());
smap = rcu_dereference_check(SDATA(selem)->smap, bpf_rcu_lock_held());
- b = select_bucket(smap, selem);
+ b = select_bucket(smap, local_storage);
raw_spin_lock_irqsave(&b->lock, flags);
if (likely(selem_linked_to_map(selem)))
hlist_del_init_rcu(&selem->map_node);
@@ -430,9 +433,13 @@ static void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
void bpf_selem_link_map(struct bpf_local_storage_map *smap,
struct bpf_local_storage_elem *selem)
{
- struct bpf_local_storage_map_bucket *b = select_bucket(smap, selem);
+ struct bpf_local_storage *local_storage;
+ struct bpf_local_storage_map_bucket *b;
unsigned long flags;
+ local_storage = rcu_dereference_check(selem->local_storage,
+ bpf_rcu_lock_held());
+ b = select_bucket(smap, local_storage);
raw_spin_lock_irqsave(&b->lock, flags);
RCU_INIT_POINTER(SDATA(selem)->smap, smap);
hlist_add_head_rcu(&selem->map_node, &b->list);
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 02/12] bpf: Convert bpf_selem_unlink_map to failable
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 01/12] bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 03/12] bpf: Convert bpf_selem_link_map " Amery Hung
` (9 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_unlink_map() to failable. It still always succeeds and
returns 0 for now.
Since some operations updating local storage cannot fail in the middle,
open-code bpf_selem_unlink_map() to take the b->lock before the
operation. There are two such locations:
- bpf_local_storage_alloc()
The first selem will be unlinked from smap if cmpxchg owner_storage_ptr
fails, which should not fail. Therefore, hold b->lock when linking
until allocation complete. Helpers that assume b->lock is held by
callers are introduced: bpf_selem_link_map_nolock() and
bpf_selem_unlink_map_nolock().
- bpf_local_storage_update()
The three step update process: link_map(new_selem),
link_storage(new_selem), and unlink_map(old_selem) should not fail in
the middle.
In bpf_selem_unlink(), bpf_selem_unlink_map() and
bpf_selem_unlink_storage() should either all succeed or fail as a whole
instead of failing in the middle. So, return if unlink_map() failed.
In bpf_local_storage_destroy(), since it cannot deadlock with itself or
bpf_local_storage_map_free() who the function might be racing with,
retry if bpf_selem_unlink_map() fails due to rqspinlock returning
errors.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
kernel/bpf/bpf_local_storage.c | 54 ++++++++++++++++++++++++++++------
1 file changed, 45 insertions(+), 9 deletions(-)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index e4a7cd33b455..cbccf6b77f10 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -409,7 +409,7 @@ void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
hlist_add_head_rcu(&selem->snode, &local_storage->list);
}
-static void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
+static int bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
{
struct bpf_local_storage *local_storage;
struct bpf_local_storage_map *smap;
@@ -418,7 +418,7 @@ static void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
if (unlikely(!selem_linked_to_map_lockless(selem)))
/* selem has already be unlinked from smap */
- return;
+ return 0;
local_storage = rcu_dereference_check(selem->local_storage,
bpf_rcu_lock_held());
@@ -428,6 +428,14 @@ static void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
if (likely(selem_linked_to_map(selem)))
hlist_del_init_rcu(&selem->map_node);
raw_spin_unlock_irqrestore(&b->lock, flags);
+
+ return 0;
+}
+
+static void bpf_selem_unlink_map_nolock(struct bpf_local_storage_elem *selem)
+{
+ if (likely(selem_linked_to_map(selem)))
+ hlist_del_init_rcu(&selem->map_node);
}
void bpf_selem_link_map(struct bpf_local_storage_map *smap,
@@ -446,13 +454,26 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
raw_spin_unlock_irqrestore(&b->lock, flags);
}
+static void bpf_selem_link_map_nolock(struct bpf_local_storage_map *smap,
+ struct bpf_local_storage_elem *selem,
+ struct bpf_local_storage_map_bucket *b)
+{
+ RCU_INIT_POINTER(SDATA(selem)->smap, smap);
+ hlist_add_head_rcu(&selem->map_node, &b->list);
+}
+
void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
{
+ int err;
+
/* Always unlink from map before unlinking from local_storage
* because selem will be freed after successfully unlinked from
* the local_storage.
*/
- bpf_selem_unlink_map(selem);
+ err = bpf_selem_unlink_map(selem);
+ if (err)
+ return;
+
bpf_selem_unlink_storage(selem, reuse_now);
}
@@ -494,6 +515,8 @@ int bpf_local_storage_alloc(void *owner,
{
struct bpf_local_storage *prev_storage, *storage;
struct bpf_local_storage **owner_storage_ptr;
+ struct bpf_local_storage_map_bucket *b;
+ unsigned long flags;
int err;
err = mem_charge(smap, owner, sizeof(*storage));
@@ -516,7 +539,10 @@ int bpf_local_storage_alloc(void *owner,
storage->owner = owner;
bpf_selem_link_storage_nolock(storage, first_selem);
- bpf_selem_link_map(smap, first_selem);
+
+ b = select_bucket(smap, storage);
+ raw_spin_lock_irqsave(&b->lock, flags);
+ bpf_selem_link_map_nolock(smap, first_selem, b);
owner_storage_ptr =
(struct bpf_local_storage **)owner_storage(smap, owner);
@@ -532,7 +558,8 @@ int bpf_local_storage_alloc(void *owner,
*/
prev_storage = cmpxchg(owner_storage_ptr, NULL, storage);
if (unlikely(prev_storage)) {
- bpf_selem_unlink_map(first_selem);
+ bpf_selem_unlink_map_nolock(first_selem);
+ raw_spin_unlock_irqrestore(&b->lock, flags);
err = -EAGAIN;
goto uncharge;
@@ -546,6 +573,7 @@ int bpf_local_storage_alloc(void *owner,
* bucket->list under rcu_read_lock().
*/
}
+ raw_spin_unlock_irqrestore(&b->lock, flags);
return 0;
@@ -567,8 +595,9 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
struct bpf_local_storage_data *old_sdata = NULL;
struct bpf_local_storage_elem *alloc_selem, *selem = NULL;
struct bpf_local_storage *local_storage;
+ struct bpf_local_storage_map_bucket *b;
HLIST_HEAD(old_selem_free_list);
- unsigned long flags;
+ unsigned long flags, b_flags;
int err;
/* BPF_EXIST and BPF_NOEXIST cannot be both set */
@@ -652,20 +681,26 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
goto unlock;
}
+ b = select_bucket(smap, local_storage);
+
+ raw_spin_lock_irqsave(&b->lock, b_flags);
+
alloc_selem = NULL;
/* First, link the new selem to the map */
- bpf_selem_link_map(smap, selem);
+ bpf_selem_link_map_nolock(smap, selem, b);
/* Second, link (and publish) the new selem to local_storage */
bpf_selem_link_storage_nolock(local_storage, selem);
/* Third, remove old selem, SELEM(old_sdata) */
if (old_sdata) {
- bpf_selem_unlink_map(SELEM(old_sdata));
+ bpf_selem_unlink_map_nolock(SELEM(old_sdata));
bpf_selem_unlink_storage_nolock(local_storage, SELEM(old_sdata),
true, &old_selem_free_list);
}
+ raw_spin_unlock_irqrestore(&b->lock, b_flags);
+
unlock:
raw_spin_unlock_irqrestore(&local_storage->lock, flags);
bpf_selem_free_list(&old_selem_free_list, false);
@@ -743,6 +778,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
HLIST_HEAD(free_selem_list);
struct hlist_node *n;
unsigned long flags;
+ int err;
storage_smap = rcu_dereference_check(local_storage->smap, bpf_rcu_lock_held());
bpf_ma = check_storage_bpf_ma(local_storage, storage_smap, NULL);
@@ -761,7 +797,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
/* Always unlink from map before unlinking from
* local_storage.
*/
- bpf_selem_unlink_map(selem);
+ while (bpf_selem_unlink_map(selem));
/* If local_storage list has only one element, the
* bpf_selem_unlink_storage_nolock() will return true.
* Otherwise, it will return false. The current loop iteration
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 03/12] bpf: Convert bpf_selem_link_map to failable
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 01/12] bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 02/12] bpf: Convert bpf_selem_unlink_map to failable Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 04/12] bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink Amery Hung
` (8 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
To prepare for changing bpf_local_storage_map_bucket::lock to rqspinlock,
convert bpf_selem_link_map() to failable. It still always succeeds and
returns 0 until the change happens. No functional change.
__must_check is added to the function declaration locally to make sure
all the callers are accounted for during the conversion.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
include/linux/bpf_local_storage.h | 4 ++--
kernel/bpf/bpf_local_storage.c | 6 ++++--
net/core/bpf_sk_storage.c | 4 +++-
3 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index ab7244d8108f..dc56fa459ac9 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -182,8 +182,8 @@ void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now);
-void bpf_selem_link_map(struct bpf_local_storage_map *smap,
- struct bpf_local_storage_elem *selem);
+int bpf_selem_link_map(struct bpf_local_storage_map *smap,
+ struct bpf_local_storage_elem *selem);
struct bpf_local_storage_elem *
bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value,
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index cbccf6b77f10..682409fb22a2 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -438,8 +438,8 @@ static void bpf_selem_unlink_map_nolock(struct bpf_local_storage_elem *selem)
hlist_del_init_rcu(&selem->map_node);
}
-void bpf_selem_link_map(struct bpf_local_storage_map *smap,
- struct bpf_local_storage_elem *selem)
+int bpf_selem_link_map(struct bpf_local_storage_map *smap,
+ struct bpf_local_storage_elem *selem)
{
struct bpf_local_storage *local_storage;
struct bpf_local_storage_map_bucket *b;
@@ -452,6 +452,8 @@ void bpf_selem_link_map(struct bpf_local_storage_map *smap,
RCU_INIT_POINTER(SDATA(selem)->smap, smap);
hlist_add_head_rcu(&selem->map_node, &b->list);
raw_spin_unlock_irqrestore(&b->lock, flags);
+
+ return 0;
}
static void bpf_selem_link_map_nolock(struct bpf_local_storage_map *smap,
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 2e538399757f..fac5cf385785 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -194,7 +194,9 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
}
if (new_sk_storage) {
- bpf_selem_link_map(smap, copy_selem);
+ ret = bpf_selem_link_map(smap, copy_selem);
+ if (ret)
+ goto out;
bpf_selem_link_storage_nolock(new_sk_storage, copy_selem);
} else {
ret = bpf_local_storage_alloc(newsk, smap, copy_selem, GFP_ATOMIC);
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 04/12] bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (2 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 03/12] bpf: Convert bpf_selem_link_map " Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 05/12] bpf: Convert bpf_selem_unlink to failable Amery Hung
` (7 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
To prepare for changing bpf_local_storage::lock to rqspinlock, open code
bpf_selem_unlink_storage() in the only caller, bpf_selem_unlink(), since
unlink_map and unlink_storage must be done together after all the
necessary locks are acquired.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
kernel/bpf/bpf_local_storage.c | 71 ++++++++++++++++------------------
1 file changed, 33 insertions(+), 38 deletions(-)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 682409fb22a2..9c2b041ae9ca 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -371,37 +371,6 @@ static bool check_storage_bpf_ma(struct bpf_local_storage *local_storage,
return selem_smap->bpf_ma;
}
-static void bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem,
- bool reuse_now)
-{
- struct bpf_local_storage_map *storage_smap;
- struct bpf_local_storage *local_storage;
- bool bpf_ma, free_local_storage = false;
- HLIST_HEAD(selem_free_list);
- unsigned long flags;
-
- if (unlikely(!selem_linked_to_storage_lockless(selem)))
- /* selem has already been unlinked from sk */
- return;
-
- local_storage = rcu_dereference_check(selem->local_storage,
- bpf_rcu_lock_held());
- storage_smap = rcu_dereference_check(local_storage->smap,
- bpf_rcu_lock_held());
- bpf_ma = check_storage_bpf_ma(local_storage, storage_smap, selem);
-
- raw_spin_lock_irqsave(&local_storage->lock, flags);
- if (likely(selem_linked_to_storage(selem)))
- free_local_storage = bpf_selem_unlink_storage_nolock(
- local_storage, selem, true, &selem_free_list);
- raw_spin_unlock_irqrestore(&local_storage->lock, flags);
-
- bpf_selem_free_list(&selem_free_list, reuse_now);
-
- if (free_local_storage)
- bpf_local_storage_free(local_storage, storage_smap, bpf_ma, reuse_now);
-}
-
void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
struct bpf_local_storage_elem *selem)
{
@@ -466,17 +435,43 @@ static void bpf_selem_link_map_nolock(struct bpf_local_storage_map *smap,
void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
{
+ struct bpf_local_storage_map *storage_smap;
+ struct bpf_local_storage *local_storage;
+ bool bpf_ma, free_local_storage = false;
+ HLIST_HEAD(selem_free_list);
+ unsigned long flags;
int err;
- /* Always unlink from map before unlinking from local_storage
- * because selem will be freed after successfully unlinked from
- * the local_storage.
- */
- err = bpf_selem_unlink_map(selem);
- if (err)
+ if (unlikely(!selem_linked_to_storage_lockless(selem)))
+ /* selem has already been unlinked from sk */
return;
- bpf_selem_unlink_storage(selem, reuse_now);
+ local_storage = rcu_dereference_check(selem->local_storage,
+ bpf_rcu_lock_held());
+ storage_smap = rcu_dereference_check(local_storage->smap,
+ bpf_rcu_lock_held());
+ bpf_ma = check_storage_bpf_ma(local_storage, storage_smap, selem);
+
+ raw_spin_lock_irqsave(&local_storage->lock, flags);
+ if (likely(selem_linked_to_storage(selem))) {
+ /* Always unlink from map before unlinking from local_storage
+ * because selem will be freed after successfully unlinked from
+ * the local_storage.
+ */
+ err = bpf_selem_unlink_map(selem);
+ if (err)
+ goto out;
+
+ free_local_storage = bpf_selem_unlink_storage_nolock(
+ local_storage, selem, true, &selem_free_list);
+ }
+out:
+ raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+
+ bpf_selem_free_list(&selem_free_list, reuse_now);
+
+ if (free_local_storage)
+ bpf_local_storage_free(local_storage, storage_smap, bpf_ma, reuse_now);
}
void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 05/12] bpf: Convert bpf_selem_unlink to failable
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (3 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 04/12] bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock Amery Hung
` (6 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
To prepare changing both bpf_local_storage_map_bucket::lock and
bpf_local_storage::lock to rqspinlock, convert bpf_selem_unlink() to
failable. It still always succeeds and returns 0 until the change
happens. No functional change.
For bpf_local_storage_map_free(), since it cannot deadlock with itself
or bpf_local_storage_destroy who the function might be racing with,
retry if bpf_selem_unlink() fails due to rqspinlock returning errors.
__must_check is added to the function declaration locally to make sure
all callers are accounted for during the conversion.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
include/linux/bpf_local_storage.h | 2 +-
kernel/bpf/bpf_cgrp_storage.c | 3 +--
kernel/bpf/bpf_inode_storage.c | 4 +---
kernel/bpf/bpf_local_storage.c | 6 ++++--
kernel/bpf/bpf_task_storage.c | 4 +---
net/core/bpf_sk_storage.c | 4 +---
6 files changed, 9 insertions(+), 14 deletions(-)
diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index dc56fa459ac9..26b7f53dad33 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -180,7 +180,7 @@ int bpf_local_storage_map_check_btf(const struct bpf_map *map,
void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage,
struct bpf_local_storage_elem *selem);
-void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now);
+int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now);
int bpf_selem_link_map(struct bpf_local_storage_map *smap,
struct bpf_local_storage_elem *selem);
diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
index 0687a760974a..8fef24fcac68 100644
--- a/kernel/bpf/bpf_cgrp_storage.c
+++ b/kernel/bpf/bpf_cgrp_storage.c
@@ -118,8 +118,7 @@ static int cgroup_storage_delete(struct cgroup *cgroup, struct bpf_map *map)
if (!sdata)
return -ENOENT;
- bpf_selem_unlink(SELEM(sdata), false);
- return 0;
+ return bpf_selem_unlink(SELEM(sdata), false);
}
static long bpf_cgrp_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index e54cce2b9175..cedc99184dad 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -110,9 +110,7 @@ static int inode_storage_delete(struct inode *inode, struct bpf_map *map)
if (!sdata)
return -ENOENT;
- bpf_selem_unlink(SELEM(sdata), false);
-
- return 0;
+ return bpf_selem_unlink(SELEM(sdata), false);
}
static long bpf_fd_inode_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 9c2b041ae9ca..e0e405060e3c 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -433,7 +433,7 @@ static void bpf_selem_link_map_nolock(struct bpf_local_storage_map *smap,
hlist_add_head_rcu(&selem->map_node, &b->list);
}
-void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
+int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
{
struct bpf_local_storage_map *storage_smap;
struct bpf_local_storage *local_storage;
@@ -472,6 +472,8 @@ void bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
if (free_local_storage)
bpf_local_storage_free(local_storage, storage_smap, bpf_ma, reuse_now);
+
+ return 0;
}
void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
@@ -930,7 +932,7 @@ void bpf_local_storage_map_free(struct bpf_map *map,
struct bpf_local_storage_elem, map_node))) {
if (busy_counter)
this_cpu_inc(*busy_counter);
- bpf_selem_unlink(selem, true);
+ while (bpf_selem_unlink(selem, true));
if (busy_counter)
this_cpu_dec(*busy_counter);
cond_resched_rcu();
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index a1dc1bf0848a..ab902364ac23 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -167,9 +167,7 @@ static int task_storage_delete(struct task_struct *task, struct bpf_map *map,
if (!nobusy)
return -EBUSY;
- bpf_selem_unlink(SELEM(sdata), false);
-
- return 0;
+ return bpf_selem_unlink(SELEM(sdata), false);
}
static long bpf_pid_task_storage_delete_elem(struct bpf_map *map, void *key)
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index fac5cf385785..7b3d44667cee 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -40,9 +40,7 @@ static int bpf_sk_storage_del(struct sock *sk, struct bpf_map *map)
if (!sdata)
return -ENOENT;
- bpf_selem_unlink(SELEM(sdata), false);
-
- return 0;
+ return bpf_selem_unlink(SELEM(sdata), false);
}
/* Called by __sk_destruct() & bpf_sk_storage_clone() */
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (4 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 05/12] bpf: Convert bpf_selem_unlink to failable Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 23:37 ` Alexei Starovoitov
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 07/12] bpf: Remove task local storage percpu counter Amery Hung
` (5 subsequent siblings)
11 siblings, 1 reply; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
Change bpf_local_storage::lock and bpf_local_storage_map_bucket::lock to
from raw_spin_lock to rqspinlock.
Finally, propagate errors from raw_res_spin_lock_irqsave() to syscall
return or BPF helper return.
In bpf_local_storage_destroy(), since it cannot deadlock with itself or
bpf_local_storage_map_free() who the function might be racing with,
retry if bpf_selem_unlink_map() fails due to rqspinlock returning
errors
For, __bpf_local_storage_map_cache(), instead of handling the error, skip
updating the cache.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
include/linux/bpf_local_storage.h | 5 ++-
kernel/bpf/bpf_local_storage.c | 65 ++++++++++++++++++++-----------
2 files changed, 46 insertions(+), 24 deletions(-)
diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 26b7f53dad33..2a0aae5168fa 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -15,6 +15,7 @@
#include <linux/types.h>
#include <linux/bpf_mem_alloc.h>
#include <uapi/linux/btf.h>
+#include <asm/rqspinlock.h>
#define BPF_LOCAL_STORAGE_CACHE_SIZE 16
@@ -23,7 +24,7 @@
rcu_read_lock_bh_held())
struct bpf_local_storage_map_bucket {
struct hlist_head list;
- raw_spinlock_t lock;
+ rqspinlock_t lock;
};
/* Thp map is not the primary owner of a bpf_local_storage_elem.
@@ -99,7 +100,7 @@ struct bpf_local_storage {
* bpf_local_storage_elem.
*/
struct rcu_head rcu;
- raw_spinlock_t lock; /* Protect adding/removing from the "list" */
+ rqspinlock_t lock; /* Protect adding/removing from the "list" */
};
/* U16_MAX is much more than enough for sk local storage
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index e0e405060e3c..572956e2a72d 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -384,6 +384,7 @@ static int bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
struct bpf_local_storage_map *smap;
struct bpf_local_storage_map_bucket *b;
unsigned long flags;
+ int err;
if (unlikely(!selem_linked_to_map_lockless(selem)))
/* selem has already be unlinked from smap */
@@ -393,10 +394,13 @@ static int bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
bpf_rcu_lock_held());
smap = rcu_dereference_check(SDATA(selem)->smap, bpf_rcu_lock_held());
b = select_bucket(smap, local_storage);
- raw_spin_lock_irqsave(&b->lock, flags);
+ err = raw_res_spin_lock_irqsave(&b->lock, flags);
+ if (err)
+ return err;
+
if (likely(selem_linked_to_map(selem)))
hlist_del_init_rcu(&selem->map_node);
- raw_spin_unlock_irqrestore(&b->lock, flags);
+ raw_res_spin_unlock_irqrestore(&b->lock, flags);
return 0;
}
@@ -413,14 +417,18 @@ int bpf_selem_link_map(struct bpf_local_storage_map *smap,
struct bpf_local_storage *local_storage;
struct bpf_local_storage_map_bucket *b;
unsigned long flags;
+ int err;
local_storage = rcu_dereference_check(selem->local_storage,
bpf_rcu_lock_held());
b = select_bucket(smap, local_storage);
- raw_spin_lock_irqsave(&b->lock, flags);
+ err = raw_res_spin_lock_irqsave(&b->lock, flags);
+ if (err)
+ return err;
+
RCU_INIT_POINTER(SDATA(selem)->smap, smap);
hlist_add_head_rcu(&selem->map_node, &b->list);
- raw_spin_unlock_irqrestore(&b->lock, flags);
+ raw_res_spin_unlock_irqrestore(&b->lock, flags);
return 0;
}
@@ -444,7 +452,7 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
if (unlikely(!selem_linked_to_storage_lockless(selem)))
/* selem has already been unlinked from sk */
- return;
+ return 0;
local_storage = rcu_dereference_check(selem->local_storage,
bpf_rcu_lock_held());
@@ -452,7 +460,10 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
bpf_rcu_lock_held());
bpf_ma = check_storage_bpf_ma(local_storage, storage_smap, selem);
- raw_spin_lock_irqsave(&local_storage->lock, flags);
+ err = raw_res_spin_lock_irqsave(&local_storage->lock, flags);
+ if (err)
+ return err;
+
if (likely(selem_linked_to_storage(selem))) {
/* Always unlink from map before unlinking from local_storage
* because selem will be freed after successfully unlinked from
@@ -466,14 +477,14 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
local_storage, selem, true, &selem_free_list);
}
out:
- raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+ raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
bpf_selem_free_list(&selem_free_list, reuse_now);
if (free_local_storage)
bpf_local_storage_free(local_storage, storage_smap, bpf_ma, reuse_now);
- return 0;
+ return err;
}
void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
@@ -481,16 +492,20 @@ void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
struct bpf_local_storage_elem *selem)
{
unsigned long flags;
+ int err;
/* spinlock is needed to avoid racing with the
* parallel delete. Otherwise, publishing an already
* deleted sdata to the cache will become a use-after-free
* problem in the next bpf_local_storage_lookup().
*/
- raw_spin_lock_irqsave(&local_storage->lock, flags);
+ err = raw_res_spin_lock_irqsave(&local_storage->lock, flags);
+ if (err)
+ return;
+
if (selem_linked_to_storage(selem))
rcu_assign_pointer(local_storage->cache[smap->cache_idx], SDATA(selem));
- raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+ raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
}
static int check_flags(const struct bpf_local_storage_data *old_sdata,
@@ -534,13 +549,16 @@ int bpf_local_storage_alloc(void *owner,
RCU_INIT_POINTER(storage->smap, smap);
INIT_HLIST_HEAD(&storage->list);
- raw_spin_lock_init(&storage->lock);
+ raw_res_spin_lock_init(&storage->lock);
storage->owner = owner;
bpf_selem_link_storage_nolock(storage, first_selem);
b = select_bucket(smap, storage);
- raw_spin_lock_irqsave(&b->lock, flags);
+ err = raw_res_spin_lock_irqsave(&b->lock, flags);
+ if (err)
+ goto uncharge;
+
bpf_selem_link_map_nolock(smap, first_selem, b);
owner_storage_ptr =
@@ -558,7 +576,7 @@ int bpf_local_storage_alloc(void *owner,
prev_storage = cmpxchg(owner_storage_ptr, NULL, storage);
if (unlikely(prev_storage)) {
bpf_selem_unlink_map_nolock(first_selem);
- raw_spin_unlock_irqrestore(&b->lock, flags);
+ raw_res_spin_unlock_irqrestore(&b->lock, flags);
err = -EAGAIN;
goto uncharge;
@@ -572,7 +590,7 @@ int bpf_local_storage_alloc(void *owner,
* bucket->list under rcu_read_lock().
*/
}
- raw_spin_unlock_irqrestore(&b->lock, flags);
+ raw_res_spin_unlock_irqrestore(&b->lock, flags);
return 0;
@@ -655,7 +673,9 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
if (!alloc_selem)
return ERR_PTR(-ENOMEM);
- raw_spin_lock_irqsave(&local_storage->lock, flags);
+ err = raw_res_spin_lock_irqsave(&local_storage->lock, flags);
+ if (err)
+ return ERR_PTR(err);
/* Recheck local_storage->list under local_storage->lock */
if (unlikely(hlist_empty(&local_storage->list))) {
@@ -682,7 +702,9 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
b = select_bucket(smap, local_storage);
- raw_spin_lock_irqsave(&b->lock, b_flags);
+ err = raw_res_spin_lock_irqsave(&b->lock, b_flags);
+ if (err)
+ goto unlock;
alloc_selem = NULL;
/* First, link the new selem to the map */
@@ -698,10 +720,9 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
true, &old_selem_free_list);
}
- raw_spin_unlock_irqrestore(&b->lock, b_flags);
-
+ raw_res_spin_unlock_irqrestore(&b->lock, b_flags);
unlock:
- raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+ raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
bpf_selem_free_list(&old_selem_free_list, false);
if (alloc_selem) {
mem_uncharge(smap, owner, smap->elem_size);
@@ -791,7 +812,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
* when unlinking elem from the local_storage->list and
* the map's bucket->list.
*/
- raw_spin_lock_irqsave(&local_storage->lock, flags);
+ while (raw_res_spin_lock_irqsave(&local_storage->lock, flags));
hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) {
/* Always unlink from map before unlinking from
* local_storage.
@@ -806,7 +827,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
free_storage = bpf_selem_unlink_storage_nolock(
local_storage, selem, true, &free_selem_list);
}
- raw_spin_unlock_irqrestore(&local_storage->lock, flags);
+ raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
bpf_selem_free_list(&free_selem_list, true);
@@ -863,7 +884,7 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
for (i = 0; i < nbuckets; i++) {
INIT_HLIST_HEAD(&smap->buckets[i].list);
- raw_spin_lock_init(&smap->buckets[i].lock);
+ raw_res_spin_lock_init(&smap->buckets[i].lock);
}
smap->elem_size = offsetof(struct bpf_local_storage_elem,
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 07/12] bpf: Remove task local storage percpu counter
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (5 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 08/12] bpf: Remove cgroup " Amery Hung
` (4 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
The percpu counter in task local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Since the percpu counter is removed, merge back bpf_task_storage_get()
and bpf_task_storage_get_recur(). This will allow the bpf syscalls and
helpers to run concurrently on the same CPU, removing the spurious
-EBUSY error. bpf_task_storage_get(..., F_CREATE) will now always
succeed with enough free memory unless being called recursively.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
kernel/bpf/bpf_task_storage.c | 150 ++++------------------------------
kernel/bpf/helpers.c | 4 -
2 files changed, 18 insertions(+), 136 deletions(-)
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index ab902364ac23..dd858226ada2 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -20,29 +20,6 @@
DEFINE_BPF_STORAGE_CACHE(task_cache);
-static DEFINE_PER_CPU(int, bpf_task_storage_busy);
-
-static void bpf_task_storage_lock(void)
-{
- cant_migrate();
- this_cpu_inc(bpf_task_storage_busy);
-}
-
-static void bpf_task_storage_unlock(void)
-{
- this_cpu_dec(bpf_task_storage_busy);
-}
-
-static bool bpf_task_storage_trylock(void)
-{
- cant_migrate();
- if (unlikely(this_cpu_inc_return(bpf_task_storage_busy) != 1)) {
- this_cpu_dec(bpf_task_storage_busy);
- return false;
- }
- return true;
-}
-
static struct bpf_local_storage __rcu **task_storage_ptr(void *owner)
{
struct task_struct *task = owner;
@@ -70,17 +47,15 @@ void bpf_task_storage_free(struct task_struct *task)
{
struct bpf_local_storage *local_storage;
- rcu_read_lock_dont_migrate();
+ rcu_read_lock();
local_storage = rcu_dereference(task->bpf_storage);
if (!local_storage)
goto out;
- bpf_task_storage_lock();
bpf_local_storage_destroy(local_storage);
- bpf_task_storage_unlock();
out:
- rcu_read_unlock_migrate();
+ rcu_read_unlock();
}
static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
@@ -106,9 +81,7 @@ static void *bpf_pid_task_storage_lookup_elem(struct bpf_map *map, void *key)
goto out;
}
- bpf_task_storage_lock();
sdata = task_storage_lookup(task, map, true);
- bpf_task_storage_unlock();
put_pid(pid);
return sdata ? sdata->data : NULL;
out:
@@ -143,11 +116,9 @@ static long bpf_pid_task_storage_update_elem(struct bpf_map *map, void *key,
goto out;
}
- bpf_task_storage_lock();
sdata = bpf_local_storage_update(
task, (struct bpf_local_storage_map *)map, value, map_flags,
true, GFP_ATOMIC);
- bpf_task_storage_unlock();
err = PTR_ERR_OR_ZERO(sdata);
out:
@@ -155,8 +126,7 @@ static long bpf_pid_task_storage_update_elem(struct bpf_map *map, void *key,
return err;
}
-static int task_storage_delete(struct task_struct *task, struct bpf_map *map,
- bool nobusy)
+static int task_storage_delete(struct task_struct *task, struct bpf_map *map)
{
struct bpf_local_storage_data *sdata;
@@ -164,9 +134,6 @@ static int task_storage_delete(struct task_struct *task, struct bpf_map *map,
if (!sdata)
return -ENOENT;
- if (!nobusy)
- return -EBUSY;
-
return bpf_selem_unlink(SELEM(sdata), false);
}
@@ -192,111 +159,50 @@ static long bpf_pid_task_storage_delete_elem(struct bpf_map *map, void *key)
goto out;
}
- bpf_task_storage_lock();
- err = task_storage_delete(task, map, true);
- bpf_task_storage_unlock();
+ err = task_storage_delete(task, map);
out:
put_pid(pid);
return err;
}
-/* Called by bpf_task_storage_get*() helpers */
-static void *__bpf_task_storage_get(struct bpf_map *map,
- struct task_struct *task, void *value,
- u64 flags, gfp_t gfp_flags, bool nobusy)
+/* *gfp_flags* is a hidden argument provided by the verifier */
+BPF_CALL_5(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
+ task, void *, value, u64, flags, gfp_t, gfp_flags)
{
struct bpf_local_storage_data *sdata;
- sdata = task_storage_lookup(task, map, nobusy);
+ WARN_ON_ONCE(!bpf_rcu_lock_held());
+ if (flags & ~BPF_LOCAL_STORAGE_GET_F_CREATE || !task)
+ return (unsigned long)NULL;
+
+ sdata = task_storage_lookup(task, map, true);
if (sdata)
- return sdata->data;
+ return (unsigned long)sdata->data;
/* only allocate new storage, when the task is refcounted */
if (refcount_read(&task->usage) &&
- (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) && nobusy) {
+ (flags & BPF_LOCAL_STORAGE_GET_F_CREATE)) {
sdata = bpf_local_storage_update(
task, (struct bpf_local_storage_map *)map, value,
BPF_NOEXIST, false, gfp_flags);
- return IS_ERR(sdata) ? NULL : sdata->data;
+ return IS_ERR(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
}
- return NULL;
-}
-
-/* *gfp_flags* is a hidden argument provided by the verifier */
-BPF_CALL_5(bpf_task_storage_get_recur, struct bpf_map *, map, struct task_struct *,
- task, void *, value, u64, flags, gfp_t, gfp_flags)
-{
- bool nobusy;
- void *data;
-
- WARN_ON_ONCE(!bpf_rcu_lock_held());
- if (flags & ~BPF_LOCAL_STORAGE_GET_F_CREATE || !task)
- return (unsigned long)NULL;
-
- nobusy = bpf_task_storage_trylock();
- data = __bpf_task_storage_get(map, task, value, flags,
- gfp_flags, nobusy);
- if (nobusy)
- bpf_task_storage_unlock();
- return (unsigned long)data;
-}
-
-/* *gfp_flags* is a hidden argument provided by the verifier */
-BPF_CALL_5(bpf_task_storage_get, struct bpf_map *, map, struct task_struct *,
- task, void *, value, u64, flags, gfp_t, gfp_flags)
-{
- void *data;
-
- WARN_ON_ONCE(!bpf_rcu_lock_held());
- if (flags & ~BPF_LOCAL_STORAGE_GET_F_CREATE || !task)
- return (unsigned long)NULL;
-
- bpf_task_storage_lock();
- data = __bpf_task_storage_get(map, task, value, flags,
- gfp_flags, true);
- bpf_task_storage_unlock();
- return (unsigned long)data;
-}
-
-BPF_CALL_2(bpf_task_storage_delete_recur, struct bpf_map *, map, struct task_struct *,
- task)
-{
- bool nobusy;
- int ret;
-
- WARN_ON_ONCE(!bpf_rcu_lock_held());
- if (!task)
- return -EINVAL;
-
- nobusy = bpf_task_storage_trylock();
- /* This helper must only be called from places where the lifetime of the task
- * is guaranteed. Either by being refcounted or by being protected
- * by an RCU read-side critical section.
- */
- ret = task_storage_delete(task, map, nobusy);
- if (nobusy)
- bpf_task_storage_unlock();
- return ret;
+ return (unsigned long)NULL;
}
BPF_CALL_2(bpf_task_storage_delete, struct bpf_map *, map, struct task_struct *,
task)
{
- int ret;
-
WARN_ON_ONCE(!bpf_rcu_lock_held());
if (!task)
return -EINVAL;
- bpf_task_storage_lock();
/* This helper must only be called from places where the lifetime of the task
* is guaranteed. Either by being refcounted or by being protected
* by an RCU read-side critical section.
*/
- ret = task_storage_delete(task, map, true);
- bpf_task_storage_unlock();
- return ret;
+ return task_storage_delete(task, map);
}
static int notsupp_get_next_key(struct bpf_map *map, void *key, void *next_key)
@@ -311,7 +217,7 @@ static struct bpf_map *task_storage_map_alloc(union bpf_attr *attr)
static void task_storage_map_free(struct bpf_map *map)
{
- bpf_local_storage_map_free(map, &task_cache, &bpf_task_storage_busy);
+ bpf_local_storage_map_free(map, &task_cache, NULL);
}
BTF_ID_LIST_GLOBAL_SINGLE(bpf_local_storage_map_btf_id, struct, bpf_local_storage_map)
@@ -330,17 +236,6 @@ const struct bpf_map_ops task_storage_map_ops = {
.map_owner_storage_ptr = task_storage_ptr,
};
-const struct bpf_func_proto bpf_task_storage_get_recur_proto = {
- .func = bpf_task_storage_get_recur,
- .gpl_only = false,
- .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
- .arg1_type = ARG_CONST_MAP_PTR,
- .arg2_type = ARG_PTR_TO_BTF_ID_OR_NULL,
- .arg2_btf_id = &btf_tracing_ids[BTF_TRACING_TYPE_TASK],
- .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL,
- .arg4_type = ARG_ANYTHING,
-};
-
const struct bpf_func_proto bpf_task_storage_get_proto = {
.func = bpf_task_storage_get,
.gpl_only = false,
@@ -352,15 +247,6 @@ const struct bpf_func_proto bpf_task_storage_get_proto = {
.arg4_type = ARG_ANYTHING,
};
-const struct bpf_func_proto bpf_task_storage_delete_recur_proto = {
- .func = bpf_task_storage_delete_recur,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_CONST_MAP_PTR,
- .arg2_type = ARG_PTR_TO_BTF_ID_OR_NULL,
- .arg2_btf_id = &btf_tracing_ids[BTF_TRACING_TYPE_TASK],
-};
-
const struct bpf_func_proto bpf_task_storage_delete_proto = {
.func = bpf_task_storage_delete,
.gpl_only = false,
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index c9fab9a356df..1d83f29a8986 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2040,12 +2040,8 @@ bpf_base_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_get_cgroup_classid_curr_proto;
#endif
case BPF_FUNC_task_storage_get:
- if (bpf_prog_check_recur(prog))
- return &bpf_task_storage_get_recur_proto;
return &bpf_task_storage_get_proto;
case BPF_FUNC_task_storage_delete:
- if (bpf_prog_check_recur(prog))
- return &bpf_task_storage_delete_recur_proto;
return &bpf_task_storage_delete_proto;
default:
break;
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 08/12] bpf: Remove cgroup local storage percpu counter
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (6 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 07/12] bpf: Remove task local storage percpu counter Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 09/12] bpf: Remove unused percpu counter from bpf_local_storage_map_free Amery Hung
` (3 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
The percpu counter in cgroup local storage is no longer needed as the
underlying bpf_local_storage can now handle deadlock with the help of
rqspinlock. Remove the percpu counter and related migrate_{disable,
enable}.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
kernel/bpf/bpf_cgrp_storage.c | 59 +++++------------------------------
1 file changed, 8 insertions(+), 51 deletions(-)
diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
index 8fef24fcac68..4f9cfa032870 100644
--- a/kernel/bpf/bpf_cgrp_storage.c
+++ b/kernel/bpf/bpf_cgrp_storage.c
@@ -11,29 +11,6 @@
DEFINE_BPF_STORAGE_CACHE(cgroup_cache);
-static DEFINE_PER_CPU(int, bpf_cgrp_storage_busy);
-
-static void bpf_cgrp_storage_lock(void)
-{
- cant_migrate();
- this_cpu_inc(bpf_cgrp_storage_busy);
-}
-
-static void bpf_cgrp_storage_unlock(void)
-{
- this_cpu_dec(bpf_cgrp_storage_busy);
-}
-
-static bool bpf_cgrp_storage_trylock(void)
-{
- cant_migrate();
- if (unlikely(this_cpu_inc_return(bpf_cgrp_storage_busy) != 1)) {
- this_cpu_dec(bpf_cgrp_storage_busy);
- return false;
- }
- return true;
-}
-
static struct bpf_local_storage __rcu **cgroup_storage_ptr(void *owner)
{
struct cgroup *cg = owner;
@@ -45,16 +22,14 @@ void bpf_cgrp_storage_free(struct cgroup *cgroup)
{
struct bpf_local_storage *local_storage;
- rcu_read_lock_dont_migrate();
+ rcu_read_lock();
local_storage = rcu_dereference(cgroup->bpf_cgrp_storage);
if (!local_storage)
goto out;
- bpf_cgrp_storage_lock();
bpf_local_storage_destroy(local_storage);
- bpf_cgrp_storage_unlock();
out:
- rcu_read_unlock_migrate();
+ rcu_read_unlock();
}
static struct bpf_local_storage_data *
@@ -83,9 +58,7 @@ static void *bpf_cgrp_storage_lookup_elem(struct bpf_map *map, void *key)
if (IS_ERR(cgroup))
return ERR_CAST(cgroup);
- bpf_cgrp_storage_lock();
sdata = cgroup_storage_lookup(cgroup, map, true);
- bpf_cgrp_storage_unlock();
cgroup_put(cgroup);
return sdata ? sdata->data : NULL;
}
@@ -102,10 +75,8 @@ static long bpf_cgrp_storage_update_elem(struct bpf_map *map, void *key,
if (IS_ERR(cgroup))
return PTR_ERR(cgroup);
- bpf_cgrp_storage_lock();
sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
value, map_flags, false, GFP_ATOMIC);
- bpf_cgrp_storage_unlock();
cgroup_put(cgroup);
return PTR_ERR_OR_ZERO(sdata);
}
@@ -131,9 +102,7 @@ static long bpf_cgrp_storage_delete_elem(struct bpf_map *map, void *key)
if (IS_ERR(cgroup))
return PTR_ERR(cgroup);
- bpf_cgrp_storage_lock();
err = cgroup_storage_delete(cgroup, map);
- bpf_cgrp_storage_unlock();
cgroup_put(cgroup);
return err;
}
@@ -150,7 +119,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
static void cgroup_storage_map_free(struct bpf_map *map)
{
- bpf_local_storage_map_free(map, &cgroup_cache, &bpf_cgrp_storage_busy);
+ bpf_local_storage_map_free(map, &cgroup_cache, NULL);
}
/* *gfp_flags* is a hidden argument provided by the verifier */
@@ -158,7 +127,6 @@ BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
void *, value, u64, flags, gfp_t, gfp_flags)
{
struct bpf_local_storage_data *sdata;
- bool nobusy;
WARN_ON_ONCE(!bpf_rcu_lock_held());
if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE))
@@ -167,38 +135,27 @@ BPF_CALL_5(bpf_cgrp_storage_get, struct bpf_map *, map, struct cgroup *, cgroup,
if (!cgroup)
return (unsigned long)NULL;
- nobusy = bpf_cgrp_storage_trylock();
-
- sdata = cgroup_storage_lookup(cgroup, map, nobusy);
+ sdata = cgroup_storage_lookup(cgroup, map, NULL);
if (sdata)
- goto unlock;
+ goto out;
/* only allocate new storage, when the cgroup is refcounted */
if (!percpu_ref_is_dying(&cgroup->self.refcnt) &&
- (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) && nobusy)
+ (flags & BPF_LOCAL_STORAGE_GET_F_CREATE))
sdata = bpf_local_storage_update(cgroup, (struct bpf_local_storage_map *)map,
value, BPF_NOEXIST, false, gfp_flags);
-unlock:
- if (nobusy)
- bpf_cgrp_storage_unlock();
+out:
return IS_ERR_OR_NULL(sdata) ? (unsigned long)NULL : (unsigned long)sdata->data;
}
BPF_CALL_2(bpf_cgrp_storage_delete, struct bpf_map *, map, struct cgroup *, cgroup)
{
- int ret;
-
WARN_ON_ONCE(!bpf_rcu_lock_held());
if (!cgroup)
return -EINVAL;
- if (!bpf_cgrp_storage_trylock())
- return -EBUSY;
-
- ret = cgroup_storage_delete(cgroup, map);
- bpf_cgrp_storage_unlock();
- return ret;
+ return cgroup_storage_delete(cgroup, map);
}
const struct bpf_map_ops cgrp_storage_map_ops = {
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 09/12] bpf: Remove unused percpu counter from bpf_local_storage_map_free
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (7 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 08/12] bpf: Remove cgroup " Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 10/12] selftests/bpf: Update task_local_storage/recursion test Amery Hung
` (2 subsequent siblings)
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
Percpu locks have been removed from cgroup and task local storage. Now
that all local storage no longer use percpu variables as locks preventing
recursion, there is no need to pass them to bpf_local_storage_map_free().
Remove the argument from the function.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
include/linux/bpf_local_storage.h | 3 +--
kernel/bpf/bpf_cgrp_storage.c | 2 +-
kernel/bpf/bpf_inode_storage.c | 2 +-
kernel/bpf/bpf_local_storage.c | 7 +------
kernel/bpf/bpf_task_storage.c | 2 +-
net/core/bpf_sk_storage.c | 2 +-
6 files changed, 6 insertions(+), 12 deletions(-)
diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 2a0aae5168fa..5888e012dfe3 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -170,8 +170,7 @@ bpf_local_storage_lookup(struct bpf_local_storage *local_storage,
void bpf_local_storage_destroy(struct bpf_local_storage *local_storage);
void bpf_local_storage_map_free(struct bpf_map *map,
- struct bpf_local_storage_cache *cache,
- int __percpu *busy_counter);
+ struct bpf_local_storage_cache *cache);
int bpf_local_storage_map_check_btf(const struct bpf_map *map,
const struct btf *btf,
diff --git a/kernel/bpf/bpf_cgrp_storage.c b/kernel/bpf/bpf_cgrp_storage.c
index 4f9cfa032870..a57abb2956d5 100644
--- a/kernel/bpf/bpf_cgrp_storage.c
+++ b/kernel/bpf/bpf_cgrp_storage.c
@@ -119,7 +119,7 @@ static struct bpf_map *cgroup_storage_map_alloc(union bpf_attr *attr)
static void cgroup_storage_map_free(struct bpf_map *map)
{
- bpf_local_storage_map_free(map, &cgroup_cache, NULL);
+ bpf_local_storage_map_free(map, &cgroup_cache);
}
/* *gfp_flags* is a hidden argument provided by the verifier */
diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
index cedc99184dad..470f4b02c79e 100644
--- a/kernel/bpf/bpf_inode_storage.c
+++ b/kernel/bpf/bpf_inode_storage.c
@@ -184,7 +184,7 @@ static struct bpf_map *inode_storage_map_alloc(union bpf_attr *attr)
static void inode_storage_map_free(struct bpf_map *map)
{
- bpf_local_storage_map_free(map, &inode_cache, NULL);
+ bpf_local_storage_map_free(map, &inode_cache);
}
const struct bpf_map_ops inode_storage_map_ops = {
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 572956e2a72d..3ce4dd7e7fc6 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -917,8 +917,7 @@ bpf_local_storage_map_alloc(union bpf_attr *attr,
}
void bpf_local_storage_map_free(struct bpf_map *map,
- struct bpf_local_storage_cache *cache,
- int __percpu *busy_counter)
+ struct bpf_local_storage_cache *cache)
{
struct bpf_local_storage_map_bucket *b;
struct bpf_local_storage_elem *selem;
@@ -951,11 +950,7 @@ void bpf_local_storage_map_free(struct bpf_map *map,
while ((selem = hlist_entry_safe(
rcu_dereference_raw(hlist_first_rcu(&b->list)),
struct bpf_local_storage_elem, map_node))) {
- if (busy_counter)
- this_cpu_inc(*busy_counter);
while (bpf_selem_unlink(selem, true));
- if (busy_counter)
- this_cpu_dec(*busy_counter);
cond_resched_rcu();
}
rcu_read_unlock();
diff --git a/kernel/bpf/bpf_task_storage.c b/kernel/bpf/bpf_task_storage.c
index dd858226ada2..4d53aebe6784 100644
--- a/kernel/bpf/bpf_task_storage.c
+++ b/kernel/bpf/bpf_task_storage.c
@@ -217,7 +217,7 @@ static struct bpf_map *task_storage_map_alloc(union bpf_attr *attr)
static void task_storage_map_free(struct bpf_map *map)
{
- bpf_local_storage_map_free(map, &task_cache, NULL);
+ bpf_local_storage_map_free(map, &task_cache);
}
BTF_ID_LIST_GLOBAL_SINGLE(bpf_local_storage_map_btf_id, struct, bpf_local_storage_map)
diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c
index 7b3d44667cee..7037b841cf11 100644
--- a/net/core/bpf_sk_storage.c
+++ b/net/core/bpf_sk_storage.c
@@ -62,7 +62,7 @@ void bpf_sk_storage_free(struct sock *sk)
static void bpf_sk_storage_map_free(struct bpf_map *map)
{
- bpf_local_storage_map_free(map, &sk_cache, NULL);
+ bpf_local_storage_map_free(map, &sk_cache);
}
static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr)
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 10/12] selftests/bpf: Update task_local_storage/recursion test
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (8 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 09/12] bpf: Remove unused percpu counter from bpf_local_storage_map_free Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 11/12] selftests/bpf: Remove test_task_storage_map_stress_lookup Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 12/12] selftests/bpf: Choose another percpu variable in bpf for btf_dump test Amery Hung
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
Update the expected result of the selftest as recursion of task local
storage syscall and helpers have been relaxed. Now that the percpu
counter is removed, task local storage helpers, bpf_task_storage_get()
and bpf_task_storage_delete() can now run on the same CPU at the same
time unless they cause deadlock.
Note that since there is no percpu counter preventing recursion in
task local storage helpers, bpf_trampoline now catches the recursion
of on_update as reported by recursion_misses.
on_enter: tp_btf/sys_enter
on_update: fentry/bpf_local_storage_update
Old behavior New behavior
____________ ____________
on_enter on_enter
bpf_task_storage_get(&map_a) bpf_task_storage_get(&map_a)
bpf_task_storage_trylock succeed bpf_local_storage_update(&map_a)
bpf_local_storage_update(&map_a)
on_update on_update
bpf_task_storage_get(&map_a) bpf_task_storage_get(&map_a)
bpf_task_storage_trylock fail on_update::misses++ (1)
return NULL create and return map_a::ptr
map_a::ptr += 1 (1)
bpf_task_storage_delete(&map_a)
return 0
bpf_task_storage_get(&map_b) bpf_task_storage_get(&map_b)
bpf_task_storage_trylock fail on_update::misses++ (2)
return NULL create and return map_b::ptr
map_b::ptr += 1 (1)
create and return map_a::ptr create and return map_a::ptr
map_a::ptr = 200 map_a::ptr = 200
bpf_task_storage_get(&map_b) bpf_task_storage_get(&map_b)
bpf_task_storage_trylock succeed lockless lookup succeed
bpf_local_storage_update(&map_b) return map_b::ptr
on_update
bpf_task_storage_get(&map_a)
bpf_task_storage_trylock fail
lockless lookup succeed
return map_a::ptr
map_a::ptr += 1 (201)
bpf_task_storage_delete(&map_a)
bpf_task_storage_trylock fail
return -EBUSY
nr_del_errs++ (1)
bpf_task_storage_get(&map_b)
bpf_task_storage_trylock fail
return NULL
create and return ptr
map_b::ptr = 100
Expected result:
map_a::ptr = 201 map_a::ptr = 200
map_b::ptr = 100 map_b::ptr = 1
nr_del_err = 1 nr_del_err = 0
on_update::recursion_misses = 0 on_update::recursion_misses = 2
On_enter::recursion_misses = 0 on_enter::recursion_misses = 0
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
.../testing/selftests/bpf/prog_tests/task_local_storage.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
index 42e822ea352f..559727b05e08 100644
--- a/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
+++ b/tools/testing/selftests/bpf/prog_tests/task_local_storage.c
@@ -117,19 +117,19 @@ static void test_recursion(void)
map_fd = bpf_map__fd(skel->maps.map_a);
err = bpf_map_lookup_elem(map_fd, &task_fd, &value);
ASSERT_OK(err, "lookup map_a");
- ASSERT_EQ(value, 201, "map_a value");
- ASSERT_EQ(skel->bss->nr_del_errs, 1, "bpf_task_storage_delete busy");
+ ASSERT_EQ(value, 200, "map_a value");
+ ASSERT_EQ(skel->bss->nr_del_errs, 0, "bpf_task_storage_delete busy");
map_fd = bpf_map__fd(skel->maps.map_b);
err = bpf_map_lookup_elem(map_fd, &task_fd, &value);
ASSERT_OK(err, "lookup map_b");
- ASSERT_EQ(value, 100, "map_b value");
+ ASSERT_EQ(value, 1, "map_b value");
prog_fd = bpf_program__fd(skel->progs.on_update);
memset(&info, 0, sizeof(info));
err = bpf_prog_get_info_by_fd(prog_fd, &info, &info_len);
ASSERT_OK(err, "get prog info");
- ASSERT_EQ(info.recursion_misses, 0, "on_update prog recursion");
+ ASSERT_EQ(info.recursion_misses, 2, "on_update prog recursion");
prog_fd = bpf_program__fd(skel->progs.on_enter);
memset(&info, 0, sizeof(info));
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 11/12] selftests/bpf: Remove test_task_storage_map_stress_lookup
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (9 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 10/12] selftests/bpf: Update task_local_storage/recursion test Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 12/12] selftests/bpf: Choose another percpu variable in bpf for btf_dump test Amery Hung
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
Remove a test in test_maps that checks if the updating of the percpu
counter in task local storage map is preemption safe as the percpu
counter is now removed.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
.../bpf/map_tests/task_storage_map.c | 128 ------------------
.../bpf/progs/read_bpf_task_storage_busy.c | 38 ------
2 files changed, 166 deletions(-)
delete mode 100644 tools/testing/selftests/bpf/map_tests/task_storage_map.c
delete mode 100644 tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
diff --git a/tools/testing/selftests/bpf/map_tests/task_storage_map.c b/tools/testing/selftests/bpf/map_tests/task_storage_map.c
deleted file mode 100644
index a4121d2248ac..000000000000
--- a/tools/testing/selftests/bpf/map_tests/task_storage_map.c
+++ /dev/null
@@ -1,128 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
-#define _GNU_SOURCE
-#include <sched.h>
-#include <unistd.h>
-#include <stdlib.h>
-#include <stdbool.h>
-#include <errno.h>
-#include <string.h>
-#include <pthread.h>
-
-#include <bpf/bpf.h>
-#include <bpf/libbpf.h>
-
-#include "bpf_util.h"
-#include "test_maps.h"
-#include "task_local_storage_helpers.h"
-#include "read_bpf_task_storage_busy.skel.h"
-
-struct lookup_ctx {
- bool start;
- bool stop;
- int pid_fd;
- int map_fd;
- int loop;
-};
-
-static void *lookup_fn(void *arg)
-{
- struct lookup_ctx *ctx = arg;
- long value;
- int i = 0;
-
- while (!ctx->start)
- usleep(1);
-
- while (!ctx->stop && i++ < ctx->loop)
- bpf_map_lookup_elem(ctx->map_fd, &ctx->pid_fd, &value);
- return NULL;
-}
-
-static void abort_lookup(struct lookup_ctx *ctx, pthread_t *tids, unsigned int nr)
-{
- unsigned int i;
-
- ctx->stop = true;
- ctx->start = true;
- for (i = 0; i < nr; i++)
- pthread_join(tids[i], NULL);
-}
-
-void test_task_storage_map_stress_lookup(void)
-{
-#define MAX_NR_THREAD 4096
- unsigned int i, nr = 256, loop = 8192, cpu = 0;
- struct read_bpf_task_storage_busy *skel;
- pthread_t tids[MAX_NR_THREAD];
- struct lookup_ctx ctx;
- cpu_set_t old, new;
- const char *cfg;
- int err;
-
- cfg = getenv("TASK_STORAGE_MAP_NR_THREAD");
- if (cfg) {
- nr = atoi(cfg);
- if (nr > MAX_NR_THREAD)
- nr = MAX_NR_THREAD;
- }
- cfg = getenv("TASK_STORAGE_MAP_NR_LOOP");
- if (cfg)
- loop = atoi(cfg);
- cfg = getenv("TASK_STORAGE_MAP_PIN_CPU");
- if (cfg)
- cpu = atoi(cfg);
-
- skel = read_bpf_task_storage_busy__open_and_load();
- err = libbpf_get_error(skel);
- CHECK(err, "open_and_load", "error %d\n", err);
-
- /* Only for a fully preemptible kernel */
- if (!skel->kconfig->CONFIG_PREEMPTION) {
- printf("%s SKIP (no CONFIG_PREEMPTION)\n", __func__);
- read_bpf_task_storage_busy__destroy(skel);
- skips++;
- return;
- }
-
- /* Save the old affinity setting */
- sched_getaffinity(getpid(), sizeof(old), &old);
-
- /* Pinned on a specific CPU */
- CPU_ZERO(&new);
- CPU_SET(cpu, &new);
- sched_setaffinity(getpid(), sizeof(new), &new);
-
- ctx.start = false;
- ctx.stop = false;
- ctx.pid_fd = sys_pidfd_open(getpid(), 0);
- ctx.map_fd = bpf_map__fd(skel->maps.task);
- ctx.loop = loop;
- for (i = 0; i < nr; i++) {
- err = pthread_create(&tids[i], NULL, lookup_fn, &ctx);
- if (err) {
- abort_lookup(&ctx, tids, i);
- CHECK(err, "pthread_create", "error %d\n", err);
- goto out;
- }
- }
-
- ctx.start = true;
- for (i = 0; i < nr; i++)
- pthread_join(tids[i], NULL);
-
- skel->bss->pid = getpid();
- err = read_bpf_task_storage_busy__attach(skel);
- CHECK(err, "attach", "error %d\n", err);
-
- /* Trigger program */
- sys_gettid();
- skel->bss->pid = 0;
-
- CHECK(skel->bss->busy != 0, "bad bpf_task_storage_busy", "got %d\n", skel->bss->busy);
-out:
- read_bpf_task_storage_busy__destroy(skel);
- /* Restore affinity setting */
- sched_setaffinity(getpid(), sizeof(old), &old);
- printf("%s:PASS\n", __func__);
-}
diff --git a/tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c b/tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
deleted file mode 100644
index 69da05bb6c63..000000000000
--- a/tools/testing/selftests/bpf/progs/read_bpf_task_storage_busy.c
+++ /dev/null
@@ -1,38 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* Copyright (C) 2022. Huawei Technologies Co., Ltd */
-#include "vmlinux.h"
-#include <bpf/bpf_helpers.h>
-#include <bpf/bpf_tracing.h>
-
-extern bool CONFIG_PREEMPTION __kconfig __weak;
-extern const int bpf_task_storage_busy __ksym;
-
-char _license[] SEC("license") = "GPL";
-
-int pid = 0;
-int busy = 0;
-
-struct {
- __uint(type, BPF_MAP_TYPE_TASK_STORAGE);
- __uint(map_flags, BPF_F_NO_PREALLOC);
- __type(key, int);
- __type(value, long);
-} task SEC(".maps");
-
-SEC("raw_tp/sys_enter")
-int BPF_PROG(read_bpf_task_storage_busy)
-{
- int *value;
-
- if (!CONFIG_PREEMPTION)
- return 0;
-
- if (bpf_get_current_pid_tgid() >> 32 != pid)
- return 0;
-
- value = bpf_this_cpu_ptr(&bpf_task_storage_busy);
- if (value)
- busy = *value;
-
- return 0;
-}
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [RFC PATCH bpf-next v2 12/12] selftests/bpf: Choose another percpu variable in bpf for btf_dump test
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
` (10 preceding siblings ...)
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 11/12] selftests/bpf: Remove test_task_storage_map_stress_lookup Amery Hung
@ 2025-10-02 22:53 ` Amery Hung
11 siblings, 0 replies; 17+ messages in thread
From: Amery Hung @ 2025-10-02 22:53 UTC (permalink / raw)
To: bpf
Cc: netdev, alexei.starovoitov, andrii, daniel, memxor, martin.lau,
kpsingh, yonghong.song, song, haoluo, ameryhung, kernel-team
bpf_cgrp_storage_busy has been removed. Use bpf_bprintf_nest_level
instead. This percpu variable is also in the bpf subsystem so that
if it is removed in the future, BPF-CI will catch this type of CI-
breaking change.
Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
tools/testing/selftests/bpf/prog_tests/btf_dump.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
index 10cba526d3e6..f1642794f70e 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
@@ -875,8 +875,8 @@ static void test_btf_dump_var_data(struct btf *btf, struct btf_dump *d,
TEST_BTF_DUMP_VAR(btf, d, NULL, str, "cpu_number", int, BTF_F_COMPACT,
"int cpu_number = (int)100", 100);
#endif
- TEST_BTF_DUMP_VAR(btf, d, NULL, str, "bpf_cgrp_storage_busy", int, BTF_F_COMPACT,
- "static int bpf_cgrp_storage_busy = (int)2", 2);
+ TEST_BTF_DUMP_VAR(btf, d, NULL, str, "bpf_bprintf_nest_level", int, BTF_F_COMPACT,
+ "static int bpf_bprintf_nest_level = (int)2", 2);
}
struct btf_dump_string_ctx {
--
2.47.3
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock Amery Hung
@ 2025-10-02 23:37 ` Alexei Starovoitov
2025-10-03 22:03 ` Amery Hung
0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2025-10-02 23:37 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, Network Development, Andrii Nakryiko, Daniel Borkmann,
Kumar Kartikeya Dwivedi, Martin KaFai Lau, KP Singh,
Yonghong Song, Song Liu, Hao Luo, Kernel Team
On Thu, Oct 2, 2025 at 3:54 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> bpf_selem_free_list(&old_selem_free_list, false);
> if (alloc_selem) {
> mem_uncharge(smap, owner, smap->elem_size);
> @@ -791,7 +812,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
> * when unlinking elem from the local_storage->list and
> * the map's bucket->list.
> */
> - raw_spin_lock_irqsave(&local_storage->lock, flags);
> + while (raw_res_spin_lock_irqsave(&local_storage->lock, flags));
This pattern and other while(foo) doesn't make sense to me.
res_spin_lock will fail only on deadlock or timeout.
We should not spin, since retry will likely produce the same
result. So the above pattern just enters into infinite spin.
If it should never fail in practice then pr_warn_once and goto out
leaking memory. Better yet defer to irq_work and cleanup there.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock
2025-10-02 23:37 ` Alexei Starovoitov
@ 2025-10-03 22:03 ` Amery Hung
2025-10-06 17:58 ` Alexei Starovoitov
0 siblings, 1 reply; 17+ messages in thread
From: Amery Hung @ 2025-10-03 22:03 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Network Development, Andrii Nakryiko, Daniel Borkmann,
Kumar Kartikeya Dwivedi, Martin KaFai Lau, KP Singh,
Yonghong Song, Song Liu, Hao Luo, Kernel Team
On Thu, Oct 2, 2025 at 4:37 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Oct 2, 2025 at 3:54 PM Amery Hung <ameryhung@gmail.com> wrote:
> >
> > bpf_selem_free_list(&old_selem_free_list, false);
> > if (alloc_selem) {
> > mem_uncharge(smap, owner, smap->elem_size);
> > @@ -791,7 +812,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
> > * when unlinking elem from the local_storage->list and
> > * the map's bucket->list.
> > */
> > - raw_spin_lock_irqsave(&local_storage->lock, flags);
> > + while (raw_res_spin_lock_irqsave(&local_storage->lock, flags));
>
> This pattern and other while(foo) doesn't make sense to me.
> res_spin_lock will fail only on deadlock or timeout.
> We should not spin, since retry will likely produce the same
> result. So the above pattern just enters into infinite spin.
I only spin in destroy() and map_free(), which cannot deadlock with
itself or each other. However, IIUC, a head waiter that detects
deadlock will cause other queued waiters to also return -DEADLOCK. I
think they should be able to make progress with a retry. Or better if
rqspinlock does not force queued waiters to exit the queue if it is
deadlock not timeout.
>
> If it should never fail in practice then pr_warn_once and goto out
> leaking memory. Better yet defer to irq_work and cleanup there.
Hmm, both functions are already called in some deferred callbacks.
Even if we defer the cleanup again, they still need to grab locks and
still might fail, no?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock
2025-10-03 22:03 ` Amery Hung
@ 2025-10-06 17:58 ` Alexei Starovoitov
2025-10-06 20:19 ` Martin KaFai Lau
0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2025-10-06 17:58 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, Network Development, Andrii Nakryiko, Daniel Borkmann,
Kumar Kartikeya Dwivedi, Martin KaFai Lau, KP Singh,
Yonghong Song, Song Liu, Hao Luo, Kernel Team
On Fri, Oct 3, 2025 at 3:03 PM Amery Hung <ameryhung@gmail.com> wrote:
>
> On Thu, Oct 2, 2025 at 4:37 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Oct 2, 2025 at 3:54 PM Amery Hung <ameryhung@gmail.com> wrote:
> > >
> > > bpf_selem_free_list(&old_selem_free_list, false);
> > > if (alloc_selem) {
> > > mem_uncharge(smap, owner, smap->elem_size);
> > > @@ -791,7 +812,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
> > > * when unlinking elem from the local_storage->list and
> > > * the map's bucket->list.
> > > */
> > > - raw_spin_lock_irqsave(&local_storage->lock, flags);
> > > + while (raw_res_spin_lock_irqsave(&local_storage->lock, flags));
> >
> > This pattern and other while(foo) doesn't make sense to me.
> > res_spin_lock will fail only on deadlock or timeout.
> > We should not spin, since retry will likely produce the same
> > result. So the above pattern just enters into infinite spin.
>
> I only spin in destroy() and map_free(), which cannot deadlock with
> itself or each other. However, IIUC, a head waiter that detects
> deadlock will cause other queued waiters to also return -DEADLOCK. I
> think they should be able to make progress with a retry.
If it's in map_free() path then why are we taking the lock at all?
There are supposed to be no active users of it.
If there are users and we actually need that lock then the deadlock
is possible and retrying will deadlock the same way.
I feel AI explained it better:
"
raw_res_spin_lock_irqsave() can return -ETIMEDOUT (after 250ms) or
-EDEADLK. Both are non-zero, so the while() loop continues. The commit
message says "it cannot deadlock with itself or
bpf_local_storage_map_free", but:
1. If -ETIMEDOUT is returned because the lock holder is taking too long,
retrying immediately won't help. The timeout means progress isn't
being made, and spinning in a retry loop without any backoff or
limit will prevent other work from proceeding.
2. If -EDEADLK is returned, it means the deadlock detector found a
cycle. Retrying immediately without any state change won't break the
deadlock cycle.
"
> Or better if
> rqspinlock does not force queued waiters to exit the queue if it is
> deadlock not timeout.
If a deadlock is detected, it's the same issue for all waiters.
I don't see the difference between timeout and deadlock.
Both are in the "do-not-retry" category.
Both mean that there is a bug somewhere.
> >
> > If it should never fail in practice then pr_warn_once and goto out
> > leaking memory. Better yet defer to irq_work and cleanup there.
>
> Hmm, both functions are already called in some deferred callbacks.
> Even if we defer the cleanup again, they still need to grab locks and
> still might fail, no?
If it's a map destroy path and we waited for RCU GP, there shouldn't be
a need to take a lock.
The css_free_rwork_fn() -> bpf_cgrp_storage_free() path
is currently implemented like it's similar to:
bpf_cgrp_storage_delete() which needs a lock.
But bpf_cgrp_storage_free() doesn't have to.
In css_free_rwork_fn() no prog or user space
should see 'cgrp' pointer, since we're about to kfree(cgrp); it.
I could be certainly missing something.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock
2025-10-06 17:58 ` Alexei Starovoitov
@ 2025-10-06 20:19 ` Martin KaFai Lau
0 siblings, 0 replies; 17+ messages in thread
From: Martin KaFai Lau @ 2025-10-06 20:19 UTC (permalink / raw)
To: Alexei Starovoitov, Amery Hung
Cc: bpf, Network Development, Andrii Nakryiko, Daniel Borkmann,
Kumar Kartikeya Dwivedi, Martin KaFai Lau, KP Singh,
Yonghong Song, Song Liu, Hao Luo, Kernel Team
On 10/6/25 10:58 AM, Alexei Starovoitov wrote:
> On Fri, Oct 3, 2025 at 3:03 PM Amery Hung <ameryhung@gmail.com> wrote:
>>
>> On Thu, Oct 2, 2025 at 4:37 PM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Thu, Oct 2, 2025 at 3:54 PM Amery Hung <ameryhung@gmail.com> wrote:
>>>>
>>>> bpf_selem_free_list(&old_selem_free_list, false);
>>>> if (alloc_selem) {
>>>> mem_uncharge(smap, owner, smap->elem_size);
>>>> @@ -791,7 +812,7 @@ void bpf_local_storage_destroy(struct bpf_local_storage *local_storage)
>>>> * when unlinking elem from the local_storage->list and
>>>> * the map's bucket->list.
>>>> */
>>>> - raw_spin_lock_irqsave(&local_storage->lock, flags);
>>>> + while (raw_res_spin_lock_irqsave(&local_storage->lock, flags));
>>>
>>> This pattern and other while(foo) doesn't make sense to me.
>>> res_spin_lock will fail only on deadlock or timeout.
>>> We should not spin, since retry will likely produce the same
>>> result. So the above pattern just enters into infinite spin.
>>
>> I only spin in destroy() and map_free(), which cannot deadlock with
>> itself or each other. However, IIUC, a head waiter that detects
>> deadlock will cause other queued waiters to also return -DEADLOCK. I
>> think they should be able to make progress with a retry.
>
> If it's in map_free() path then why are we taking the lock at all?
> There are supposed to be no active users of it.
There is no user from the syscall or from the bpf prog.
There are still kernel users, bpf_cgrp_storage_free() and bpf_sk_storage_free(),
that can race with map_free().
> If there are users and we actually need that lock then the deadlock
> is possible and retrying will deadlock the same way.
> I feel AI explained it better:
> "
> raw_res_spin_lock_irqsave() can return -ETIMEDOUT (after 250ms) or
> -EDEADLK. Both are non-zero, so the while() loop continues. The commit
> message says "it cannot deadlock with itself or
> bpf_local_storage_map_free", but:
>
> 1. If -ETIMEDOUT is returned because the lock holder is taking too long,
> retrying immediately won't help. The timeout means progress isn't
> being made, and spinning in a retry loop without any backoff or
> limit will prevent other work from proceeding.
>
> 2. If -EDEADLK is returned, it means the deadlock detector found a
> cycle. Retrying immediately without any state change won't break the
> deadlock cycle.
> "
>
>> Or better if
>> rqspinlock does not force queued waiters to exit the queue if it is
>> deadlock not timeout.
>
> If a deadlock is detected, it's the same issue for all waiters.
> I don't see the difference between timeout and deadlock.
> Both are in the "do-not-retry" category.
> Both mean that there is a bug somewhere.
Both bpf_cgrp_storage_free() and map_free() are the only remaining kernel users
of the locks, so no deadlock is expected unless there is a bug. The busy percpu
counter is currently not used in both of them also. Theoretically, the
res_spin_lock (and the current regular spin_lock) should never fail here in
bpf_cgrp_storage_free() and map_free(). If res_spin_lock returns error, there is
a bug somewhere.
>
>>>
>>> If it should never fail in practice then pr_warn_once and goto out
>>> leaking memory. Better yet defer to irq_work and cleanup there.>>
Not sure how to handle the bug. yeah, maybe just leak it and then splat.
I think deferring it still need to take the lock.
>> Hmm, both functions are already called in some deferred callbacks.
>> Even if we defer the cleanup again, they still need to grab locks and
>> still might fail, no?
>
> If it's a map destroy path and we waited for RCU GP, there shouldn't be
> a need to take a lock.
> The css_free_rwork_fn() -> bpf_cgrp_storage_free() path
> is currently implemented like it's similar to:
> bpf_cgrp_storage_delete() which needs a lock.
> But bpf_cgrp_storage_free() doesn't have to.
> In css_free_rwork_fn() no prog or user space
> should see 'cgrp' pointer, since we're about to kfree(cgrp); it.
> I could be certainly missing something.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-10-06 20:19 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-02 22:53 [RFC PATCH bpf-next v2 00/12] Remove task and cgroup local storage percpu counters Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 01/12] bpf: Select bpf_local_storage_map_bucket based on bpf_local_storage Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 02/12] bpf: Convert bpf_selem_unlink_map to failable Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 03/12] bpf: Convert bpf_selem_link_map " Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 04/12] bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 05/12] bpf: Convert bpf_selem_unlink to failable Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 06/12] bpf: Change local_storage->lock and b->lock to rqspinlock Amery Hung
2025-10-02 23:37 ` Alexei Starovoitov
2025-10-03 22:03 ` Amery Hung
2025-10-06 17:58 ` Alexei Starovoitov
2025-10-06 20:19 ` Martin KaFai Lau
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 07/12] bpf: Remove task local storage percpu counter Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 08/12] bpf: Remove cgroup " Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 09/12] bpf: Remove unused percpu counter from bpf_local_storage_map_free Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 10/12] selftests/bpf: Update task_local_storage/recursion test Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 11/12] selftests/bpf: Remove test_task_storage_map_stress_lookup Amery Hung
2025-10-02 22:53 ` [RFC PATCH bpf-next v2 12/12] selftests/bpf: Choose another percpu variable in bpf for btf_dump test Amery Hung
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).