Re: [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when freeing map or local storage

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

From: Martin KaFai Lau <martin.lau@linux.dev>
To: Amery Hung <ameryhung@gmail.com>
Cc: netdev@vger.kernel.org, alexei.starovoitov@gmail.com,
	andrii@kernel.org, daniel@iogearbox.net, memxor@gmail.com,
	martin.lau@kernel.org, kpsingh@kernel.org,
	yonghong.song@linux.dev, song@kernel.org, haoluo@google.com,
	kernel-team@meta.com, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when freeing map or local storage
Date: Fri, 9 Jan 2026 13:38:15 -0800	[thread overview]
Message-ID: <36b3dc2d-b850-491f-bfc5-3581d5de7b82@linux.dev> (raw)
In-Reply-To: <CAMB2axNcc5yJwhXjcEcQJuLxrjB7MgVK6XXpKqO9EiFPtQH6bQ@mail.gmail.com>



On 1/9/26 12:47 PM, Amery Hung wrote:
> On Fri, Jan 9, 2026 at 12:16 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> On 12/18/25 9:56 AM, Amery Hung wrote:
>>> Introduce bpf_selem_unlink_lockless() to properly handle errors returned
>>> from rqspinlock in bpf_local_storage_map_free() and
>>> bpf_local_storage_destroy() where the operation must succeeds.
>>>
>>> The idea of bpf_selem_unlink_lockless() is to allow an selem to be
>>> partially linked and use refcount to determine when and who can free the
>>> selem. An selem initially is fully linked to a map and a local storage
>>> and therefore selem->link_cnt is set to 2. Under normal circumstances,
>>> bpf_selem_unlink_lockless() will be able to grab locks and unlink
>>> an selem from map and local storage in sequeunce, just like
>>> bpf_selem_unlink(), and then add it to a local tofree list provide by
>>> the caller. However, if any of the lock attempts fails, it will
>>> only clear SDATA(selem)->smap or selem->local_storage depending on the
>>> caller and decrement link_cnt to signal that the corresponding data
>>> structure holding a reference to the selem is gone. Then, only when both
>>> map and local storage are gone, an selem can be free by the last caller
>>> that turns link_cnt to 0.
>>>
>>> To make sure bpf_obj_free_fields() is done only once and when map is
>>> still present, it is called when unlinking an selem from b->list under
>>> b->lock.
>>>
>>> To make sure uncharging memory is only done once and when owner is still
>>> present, only unlink selem from local_storage->list in
>>> bpf_local_storage_destroy() and return the amount of memory to uncharge
>>> to the caller (i.e., owner) since the map associated with an selem may
>>> already be gone and map->ops->map_local_storage_uncharge can no longer
>>> be referenced.
>>>
>>> Finally, access of selem, SDATA(selem)->smap and selem->local_storage
>>> are racy. Callers will protect these fields with RCU.
>>>
>>> Co-developed-by: Martin KaFai Lau <martin.lau@kernel.org>
>>> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
>>> Signed-off-by: Amery Hung <ameryhung@gmail.com>
>>> ---
>>>    include/linux/bpf_local_storage.h |  2 +-
>>>    kernel/bpf/bpf_local_storage.c    | 77 +++++++++++++++++++++++++++++--
>>>    2 files changed, 74 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
>>> index 20918c31b7e5..1fd908c44fb6 100644
>>> --- a/include/linux/bpf_local_storage.h
>>> +++ b/include/linux/bpf_local_storage.h
>>> @@ -80,9 +80,9 @@ struct bpf_local_storage_elem {
>>>                                                 * after raw_spin_unlock
>>>                                                 */
>>>        };
>>> +     atomic_t link_cnt;
>>>        u16 size;
>>>        bool use_kmalloc_nolock;
>>> -     /* 4 bytes hole */
>>>        /* The data is stored in another cacheline to minimize
>>>         * the number of cachelines access during a cache hit.
>>>         */
>>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
>>> index 62201552dca6..4c682d5aef7f 100644
>>> --- a/kernel/bpf/bpf_local_storage.c
>>> +++ b/kernel/bpf/bpf_local_storage.c
>>> @@ -97,6 +97,7 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>>>                        if (swap_uptrs)
>>>                                bpf_obj_swap_uptrs(smap->map.record, SDATA(selem)->data, value);
>>>                }
>>> +             atomic_set(&selem->link_cnt, 2);
>>>                selem->size = smap->elem_size;
>>>                selem->use_kmalloc_nolock = smap->use_kmalloc_nolock;
>>>                return selem;
>>> @@ -200,9 +201,11 @@ static void bpf_selem_free_rcu(struct rcu_head *rcu)
>>>        /* The bpf_local_storage_map_free will wait for rcu_barrier */
>>>        smap = rcu_dereference_check(SDATA(selem)->smap, 1);
>>>
>>> -     migrate_disable();
>>> -     bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
>>> -     migrate_enable();
>>> +     if (smap) {
>>> +             migrate_disable();
>>> +             bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
>>> +             migrate_enable();
>>> +     }
>>>        kfree_nolock(selem);
>>>    }
>>>
>>> @@ -227,7 +230,8 @@ void bpf_selem_free(struct bpf_local_storage_elem *selem,
>>>                 * is only supported in task local storage, where
>>>                 * smap->use_kmalloc_nolock == true.
>>>                 */
>>> -             bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
>>> +             if (smap)
>>> +                     bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
>>>                __bpf_selem_free(selem, reuse_now);
>>>                return;
>>>        }
>>> @@ -419,6 +423,71 @@ int bpf_selem_unlink(struct bpf_local_storage_elem *selem, bool reuse_now)
>>>        return err;
>>>    }
>>>
>>> +/* Callers of bpf_selem_unlink_lockless() */
>>> +#define BPF_LOCAL_STORAGE_MAP_FREE   0
>>> +#define BPF_LOCAL_STORAGE_DESTROY    1
>>> +
>>> +/*
>>> + * Unlink an selem from map and local storage with lockless fallback if callers
>>> + * are racing or rqspinlock returns error. It should only be called by
>>> + * bpf_local_storage_destroy() or bpf_local_storage_map_free().
>>> + */
>>> +static void bpf_selem_unlink_lockless(struct bpf_local_storage_elem *selem,
>>> +                                   struct hlist_head *to_free, int caller)
>>> +{
>>> +     struct bpf_local_storage *local_storage;
>>> +     struct bpf_local_storage_map_bucket *b;
>>> +     struct bpf_local_storage_map *smap;
>>> +     unsigned long flags;
>>> +     int err, unlink = 0;
>>> +
>>> +     local_storage = rcu_dereference_check(selem->local_storage, bpf_rcu_lock_held());
>>> +     smap = rcu_dereference_check(SDATA(selem)->smap, bpf_rcu_lock_held());
>>> +
>>> +     /*
>>> +      * Free special fields immediately as SDATA(selem)->smap will be cleared.
>>> +      * No BPF program should be reading the selem.
>>> +      */
>>> +     if (smap) {
>>> +             b = select_bucket(smap, selem);
>>> +             err = raw_res_spin_lock_irqsave(&b->lock, flags);
>>> +             if (!err) {
>>> +                     if (likely(selem_linked_to_map(selem))) {
>>> +                             hlist_del_init_rcu(&selem->map_node);
>>> +                             bpf_obj_free_fields(smap->map.record, SDATA(selem)->data);
>>> +                             RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
>>> +                             unlink++;
>>> +                     }
>>> +                     raw_res_spin_unlock_irqrestore(&b->lock, flags);
>>> +             } else if (caller == BPF_LOCAL_STORAGE_MAP_FREE) {
>>> +                     RCU_INIT_POINTER(SDATA(selem)->smap, NULL);
>>
>> I suspect I am missing something obvious, so it will be faster to ask here.
>>
>> I could see why init NULL can work if it could assume the map_free
>> caller always succeeds in the b->lock, the "if (!err)" path above.
>>
>> If the b->lock did fail here for the map_free caller, it reset
>> SDATA(selem)->smap to NULL here. Who can do the bpf_obj_free_fields() in
>> the future?
> 
> I think this can only mean destroy() is holding the lock, and
> destroy() should do bpf_selem_unlink_map_nolock(). Though I am not

hmm... instead of bpf_selem_unlink_map_nolock(), do you mean the "if 
(!err)" path in this bpf_selem_unlink_lockless() function? or we are 
talking different things?

[ btw, a nit, I think it can use a better function name instead of 
"lockless". This function still takes the lock if it can. ]

> sure how destroy() can hold b->lock in a way that causes map_free() to
> fail acquiring b->lock.

I recall ETIMEDOUT was mentioned to be the likely (only?) case here. 
Assume the b->lock did fail in map_free here, there are >1 selem(s) 
using the same b->lock. Is it always true that the selem that failed at 
the b->lock in map_free() here must race with the very same selem in 
destroy()?

> 
>>
>>> +             }
>>> +     }
>>> +
>>> +     /*
>>> +      * Only let destroy() unlink from local_storage->list and do mem_uncharge
>>> +      * as owner is guaranteed to be valid in destroy().
>>> +      */
>>> +     if (local_storage && caller == BPF_LOCAL_STORAGE_DESTROY) {
>>
>> If I read here correctly, only bpf_local_storage_destroy() can do the
>> bpf_selem_free(). For example, if a bpf_sk_storage_map is going away,
>> the selem (which is memcg charged) will stay in the sk until the sk is
>> closed?
> 
> You read it correctly and Yes there will be stale elements in
> local_storage->list.
> 
> I would hope the unlink from local_storage part is doable from
> map_free() and destroy(), but it is hard to guarantee (1) mem_uncharge
> is done only once (2) while the owner is still valid in a lockless
> way.

This needs to be addressed. It cannot leave the selem lingering. At 
least the selem should be freed for the common case (i.e., succeeds in 
both locks). Lingering selem is ok in case of lock failure. Many sk can 
be long-lived connections. The user space may want to recreate the map, 
and it will be limited by the memcg.

> 
>>
>>> +             err = raw_res_spin_lock_irqsave(&local_storage->lock, flags);
>>> +             if (!err) {
>>> +                     hlist_del_init_rcu(&selem->snode);
>>> +                     unlink++;
>>> +                     raw_res_spin_unlock_irqrestore(&local_storage->lock, flags);
>>> +             }
>>> +             RCU_INIT_POINTER(selem->local_storage, NULL);
>>> +     }
>>> +
>>> +     /*
>>> +      * Normally, an selem can be unlink under local_storage->lock and b->lock, and
>>> +      * then added to a local to_free list. However, if destroy() and map_free() are
>>> +      * racing or rqspinlock returns errors in unlikely situations (unlink != 2), free
>>> +      * the selem only after both map_free() and destroy() drop the refcnt.
>>> +      */
>>> +     if (unlink == 2 || atomic_dec_and_test(&selem->link_cnt))
>>> +             hlist_add_head(&selem->free_node, to_free);
>>> +}
>>> +
>>>    void __bpf_local_storage_insert_cache(struct bpf_local_storage *local_storage,
>>>                                      struct bpf_local_storage_map *smap,
>>>                                      struct bpf_local_storage_elem *selem)
>>

next prev parent reply	other threads:[~2026-01-09 21:38 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-18 17:56 [PATCH bpf-next v3 00/16] Remove task and cgroup local storage percpu counters Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 01/16] bpf: Convert bpf_selem_unlink_map to failable Amery Hung
2025-12-18 18:27   ` bot+bpf-ci
2026-01-08 20:40     ` Martin KaFai Lau
2026-01-08 20:29   ` Martin KaFai Lau
2026-01-09 18:39     ` Amery Hung
2026-01-09 21:53       ` Martin KaFai Lau
2026-01-12 17:47         ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 02/16] bpf: Convert bpf_selem_link_map " Amery Hung
2025-12-18 18:19   ` bot+bpf-ci
2025-12-18 17:56 ` [PATCH bpf-next v3 03/16] bpf: Open code bpf_selem_unlink_storage in bpf_selem_unlink Amery Hung
2026-01-09 17:42   ` Martin KaFai Lau
2026-01-09 18:49     ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 04/16] bpf: Convert bpf_selem_unlink to failable Amery Hung
2025-12-18 18:27   ` bot+bpf-ci
2026-01-09 18:16   ` Martin KaFai Lau
2026-01-09 18:49     ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 05/16] bpf: Change local_storage->lock and b->lock to rqspinlock Amery Hung
2025-12-18 18:27   ` bot+bpf-ci
2025-12-18 17:56 ` [PATCH bpf-next v3 06/16] bpf: Remove task local storage percpu counter Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 07/16] bpf: Remove cgroup " Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 08/16] bpf: Remove unused percpu counter from bpf_local_storage_map_free Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 09/16] bpf: Save memory allocation method and size in bpf_local_storage_elem Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 10/16] bpf: Support lockless unlink when freeing map or local storage Amery Hung
2026-01-09 20:16   ` Martin KaFai Lau
2026-01-09 20:47     ` Amery Hung
2026-01-09 21:38       ` Martin KaFai Lau [this message]
2026-01-12 22:38         ` Amery Hung
2026-01-13  0:15           ` Martin KaFai Lau
2026-01-12 15:36   ` Kumar Kartikeya Dwivedi
2026-01-12 15:49     ` Kumar Kartikeya Dwivedi
2026-01-12 21:17       ` Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 11/16] bpf: Switch to bpf_selem_unlink_lockless in bpf_local_storage_{map_free, destroy} Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 12/16] selftests/bpf: Update sk_storage_omem_uncharge test Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 13/16] selftests/bpf: Update task_local_storage/recursion test Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 14/16] selftests/bpf: Update task_local_storage/task_storage_nodeadlock test Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 15/16] selftests/bpf: Remove test_task_storage_map_stress_lookup Amery Hung
2025-12-18 17:56 ` [PATCH bpf-next v3 16/16] selftests/bpf: Choose another percpu variable in bpf for btf_dump test Amery Hung

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36b3dc2d-b850-491f-bfc5-3581d5de7b82@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=kernel-team@meta.com \
    --cc=kpsingh@kernel.org \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox