From: Martin KaFai Lau <martin.lau@linux.dev>
To: Amery Hung <ameryhung@gmail.com>
Cc: netdev@vger.kernel.org, alexei.starovoitov@gmail.com,
andrii@kernel.org, daniel@iogearbox.net, memxor@gmail.com,
martin.lau@kernel.org, kpsingh@kernel.org,
yonghong.song@linux.dev, song@kernel.org, haoluo@google.com,
kernel-team@meta.com, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when freeing map or local storage
Date: Fri, 9 Jan 2026 12:16:02 -0800 [thread overview]
Message-ID: <337d8ebe-d3d4-4818-92d8-4937da835843@linux.dev> (raw)
In-Reply-To: <20251218175628.1460321-11-ameryhung@gmail.com>
On 12/18/25 9:56 AM, Amery Hung wrote:
> Introduce bpf_selem_unlink_lockless() to properly handle errors returned
> from rqspinlock in bpf_local_storage_map_free() and
> bpf_local_storage_destroy() where the operation must succeeds.
>
> The idea of bpf_selem_unlink_lockless() is to allow an selem to be
> partially linked and use refcount to determine when and who can free the
> selem. An selem initially is fully linked to a map and a local storage
> and therefore selem->link_cnt is set to 2. Under normal circumstances,
> bpf_selem_unlink_lockless() will be able to grab locks and unlink
> an selem from map and local storage in sequeunce, just like
> bpf_selem_unlink(), and then add it to a local tofree list provide by
> the caller. However, if any of the lock attempts fails, it will
> only clear SDATA(selem)->smap or selem->local_storage depending on the
> caller and decrement link_cnt to signal that the corresponding data
> structure holding a reference to the selem is gone. Then, only when both
> map and local storage are gone, an selem can be free by the last caller
> that turns link_cnt to 0.
>
> To make sure bpf_obj_free_fields() is done only once and when map is
> still present, it is called when unlinking an selem from b->list under
> b->lock.
>
> To make sure uncharging memory is only done once and when owner is still
> present, only unlink selem from local_storage->list in
> bpf_local_storage_destroy() and return the amount of memory to uncharge
> to the caller (i.e., owner) since the map associated with an selem may
> already be gone and map->ops->map_local_storage_uncharge can no longer
> be referenced.
>
> Finally, access of selem, SDATA(selem)->smap and selem->local_storage
> are racy. Callers will protect these fields with RCU.
>
> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
> Signed-off-by: Amery Hung <ameryhung@gmail.com>
> ---
> include/linux/bpf_local_storage.h | 2 +-
> kernel/bpf/bpf_local_storage.c | 77 +++++++++++++++++++++++++++++--
> 2 files changed, 74 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 20918c31b7e5..1fd908c44fb6 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -80,9 +80,9 @@ struct bpf_local_storage_elem {
> * after raw_spin_unlock
> */
> };
> + atomic_t link_cnt;
> u16 size;
> bool use_kmalloc_nolock;
> - /* 4 bytes hole */
> /* The data is stored in another cacheline to minimize
> * the number of cachelines access during a cache hit.
> */
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 62201552dca6..4c682d5aef7f 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -97,6 +97,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> if (swap_uptrs)
> bpf_obj_swap_uptrs(smap->map.record, SDATA(selem)->data, value);
> }
> + atomic_set(&selem->link_cnt, 2);
> selem->size = smap->elem_size;
> selem->use_kmalloc_nolock = smap->use_kmalloc_nolock;
> return selem;
> @@ -200,9 +201,11 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
> /* The bpf_local_storage_map_free will wait for rcu_barrier */
> smap = rcu_dereference_check(SDATA(selem)->smap, 1);
>
> - migrate_disable();
> - bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> - migrate_enable();
> + if (smap) {
> + migrate_disable();
> + bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> + migrate_enable();
> + }
> kfree_nolock(selem);
> }
>
> @@ -227,7 +230,8 @@ void bpf_selem_free(struct bpf_local_storage_elem *selem,
> * is only supported in task local storage, where
> * smap->use_kmalloc_nolock == true.
> */
> - bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> + if (smap)
> + bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> __bpf_selem_free(selem, reuse_now);
> return;
> }
> @@ -419,6 +423,71 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
> return err;
> }
>
> +/* Callers of bpf_selem_unlink_lockless() */
> +#define BPF_LOCAL_STORAGE_MAP_FREE 0
> +#define BPF_LOCAL_STORAGE_DESTROY 1
> +
> +/*
> + * Unlink an selem from map and local storage with lockless fallback if callers
> + * are racing or rqspinlock returns error. It should only be called by
> + * bpf_local_storage_destroy() or bpf_local_storage_map_free().
> + */
> +static void bpf_selem_unlink_lockless(struct bpf_local_storage_elem *selem,
> + struct hlist_head *to_free, int caller)
> +{
> + struct bpf_local_storage *local_storage;
> + struct bpf_local_storage_map_bucket *b;
> + struct bpf_local_storage_map *smap;
> + unsigned long flags;
> + int err, unlink = 0;
> +
> + local_storage = rcu_dereference_check(selem->local_storage, bpf_rcu_lock_held());
> + smap = rcu_dereference_check(SDATA(selem)->smap, bpf_rcu_lock_held());
> +
> + /*
> + * Free special fields immediately as SDATA(selem)->smap will be cleared.
> + * No BPF program should be reading the selem.
> + */
> + if (smap) {
> + b = select_bucket(smap, selem);
> + err = raw_res_spin_lock_irqsave(&b->lock, flags);
> + if (!err) {
> + if (likely(selem_linked_to_map(selem))) {
> + hlist_del_init_rcu(&selem->map_node);
> + bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
> + RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
> + unlink++;
> + }
> + raw_res_spin_unlock_irqrestore(&b->lock, flags);
> + } else if (caller == BPF_LOCAL_STORAGE_MAP_FREE) {
> + RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
I suspect I am missing something obvious, so it will be faster to ask here.
I could see why init NULL can work if it could assume the map_free
caller always succeeds in the b->lock, the "if (!err)" path above.
If the b->lock did fail here for the map_free caller, it reset
SDATA(selem)->smap to NULL here. Who can do the bpf_obj_free_fields() in
the future?
> + }
> + }
> +
> + /*
> + * Only let destroy() unlink from local_storage->list and do mem_uncharge
> + * as owner is guaranteed to be valid in destroy().
> + */
> + if (local_storage && caller == BPF_LOCAL_STORAGE_DESTROY) {
If I read here correctly, only bpf_local_storage_destroy() can do the
bpf_selem_free(). For example, if a bpf_sk_storage_map is going away,
the selem (which is memcg charged) will stay in the sk until the sk is
closed?
> + err = raw_res_spin_lock_irqsave(&local_storage->lock, flags);
> + if (!err) {
> + hlist_del_init_rcu(&selem->snode);
> + unlink++;
> + raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
> + }
> + RCU_INIT_POINTER(selem->local_storage, NULL);
> + }
> +
> + /*
> + * Normally, an selem can be unlink under local_storage->lock and b->lock, and
> + * then added to a local to_free list. However, if destroy() and map_free() are
> + * racing or rqspinlock returns errors in unlikely situations (unlink != 2), free
> + * the selem only after both map_free() and destroy() drop the refcnt.
> + */
> + if (unlink == 2 || atomic_dec_and_test(&selem->link_cnt))
> + hlist_add_head(&selem->free_node, to_free);
> +}
> +
> void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
> struct bpf_local_storage_map *smap,
> struct bpf_local_storage_elem *selem)
next prev parent reply other threads:[~2026-01-09 20:16 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-18 17:56 [PATCH bpf-next v3 00/16] Remove task and cgroup local storage percpu counters Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 01/16] bpf: Convert bpf_selem_unlink_map to failable Amery Hung
2025-12-18 18:27 ` bot+bpf-ci
2026-01-08 20:40 ` Martin KaFai Lau
2026-01-08 20:29 ` Martin KaFai Lau
2026-01-09 18:39 ` Amery Hung
2026-01-09 21:53 ` Martin KaFai Lau
2026-01-12 17:47 ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 02/16] bpf: Convert bpf_selem_link_map " Amery Hung
2025-12-18 18:19 ` bot+bpf-ci
2025-12-18 17:56 ` [PATCH bpf-next v3 03/16] bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink Amery Hung
2026-01-09 17:42 ` Martin KaFai Lau
2026-01-09 18:49 ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 04/16] bpf: Convert bpf_selem_unlink to failable Amery Hung
2025-12-18 18:27 ` bot+bpf-ci
2026-01-09 18:16 ` Martin KaFai Lau
2026-01-09 18:49 ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 05/16] bpf: Change local_storage->lock and b->lock to rqspinlock Amery Hung
2025-12-18 18:27 ` bot+bpf-ci
2025-12-18 17:56 ` [PATCH bpf-next v3 06/16] bpf: Remove task local storage percpu counter Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 07/16] bpf: Remove cgroup " Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 08/16] bpf: Remove unused percpu counter from bpf_local_storage_map_free Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 09/16] bpf: Save memory allocation method and size in bpf_local_storage_elem Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when freeing map or local storage Amery Hung
2026-01-09 20:16 ` Martin KaFai Lau [this message]
2026-01-09 20:47 ` Amery Hung
2026-01-09 21:38 ` Martin KaFai Lau
2026-01-12 22:38 ` Amery Hung
2026-01-13 0:15 ` Martin KaFai Lau
2026-01-12 15:36 ` Kumar Kartikeya Dwivedi
2026-01-12 15:49 ` Kumar Kartikeya Dwivedi
2026-01-12 21:17 ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 11/16] bpf: Switch to bpf_selem_unlink_lockless in bpf_local_storage_{map_free, destroy} Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 12/16] selftests/bpf: Update sk_storage_omem_uncharge test Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 13/16] selftests/bpf: Update task_local_storage/recursion test Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 14/16] selftests/bpf: Update task_local_storage/task_storage_nodeadlock test Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 15/16] selftests/bpf: Remove test_task_storage_map_stress_lookup Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 16/16] selftests/bpf: Choose another percpu variable in bpf for btf_dump test Amery Hung
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=337d8ebe-d3d4-4818-92d8-4937da835843@linux.dev \
--to=martin.lau@linux.dev \
--cc=alexei.starovoitov@gmail.com \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=kernel-team@meta.com \
--cc=kpsingh@kernel.org \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=song@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox