From: Hou Tao <houtao@huaweicloud.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: "Toke Høiland-Jørgensen" <toke@kernel.org>,
bpf <bpf@vger.kernel.org>,
rcu@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
"Alexei Starovoitov" <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Andrii Nakryiko" <andrii@kernel.org>,
"Martin KaFai Lau" <martin.lau@linux.dev>,
"Eduard Zingerman" <eddyz87@gmail.com>,
"Song Liu" <song@kernel.org>,
"Yonghong Song" <yonghong.song@linux.dev>,
"John Fastabend" <john.fastabend@gmail.com>,
"KP Singh" <kpsingh@kernel.org>,
"Stanislav Fomichev" <sdf@fomichev.me>,
"Hao Luo" <haoluo@google.com>, "Jiri Olsa" <jolsa@kernel.org>,
"Paul E . McKenney" <paulmck@kernel.org>,
"Cody Haas" <chaas@riotgames.com>,
"Hou Tao" <hotforest@gmail.com>
Subject: Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
Date: Wed, 26 Feb 2025 12:05:33 +0800 [thread overview]
Message-ID: <6a84a878-0728-0a19-73d2-b5871e10e120@huaweicloud.com> (raw)
In-Reply-To: <CAADnVQKD94q-G4N=w9PJU+k6gPhM8GmUYcyfj=33B_mKX6Qbjw@mail.gmail.com>
Hi,
On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi Toke,
>>
>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>
>>>> +cc Cody Haas
>>>>
>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>
>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>> Currently, the update of existing element in hash map involves two
>>>>> steps:
>>>>> 1) insert the new element at the head of the hash list
>>>>> 2) remove the old element
>>>>>
>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>> either the old element or the new element if the lookup operation starts
>>>>> before the addition and continues after the removal.
>>>>>
>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>> operation: it will either find the old element or the new element.
> I'm missing the point.
> This "atomic" replacement doesn't really solve anything.
> lookup will see one element.
> That element could be deleted by another thread.
> bucket lock and either two step update or single step
> don't change anything from the pov of bpf prog doing lookup.
The point is that overwriting an existed element may lead to concurrent
lookups return ENOENT as demonstrated by the added selftest and the
patch tried to "fix" that. However, it seems using
hlist_nulls_replace_rcu() for the overwriting update is still not
enough. Because when the lookup procedure found the old element, the old
element may be reusing, the comparison of the map key may fail, and the
lookup procedure may still return ENOENT.
>
>>>>> Signed-off-by: Hou Tao <hotforest@gmail.com>
>>>>> ---
>>>>> kernel/bpf/hashtab.c | 14 ++++++++------
>>>>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>> index 4a9eeb7aef85..a28b11ce74c6 100644
>>>>> --- a/kernel/bpf/hashtab.c
>>>>> +++ b/kernel/bpf/hashtab.c
>>>>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>>>> goto err;
>>>>> }
>>>>>
>>>>> - /* add new element to the head of the list, so that
>>>>> - * concurrent search will find it before old elem
>>>>> - */
>>>>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> - if (l_old) {
>>>>> - hlist_nulls_del_rcu(&l_old->hash_node);
>>>>> + if (!l_old) {
>>>>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> + } else {
>>>>> + /* Replace the old element atomically, so that
>>>>> + * concurrent search will find either the new element or
>>>>> + * the old element.
>>>>> + */
>>>>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>>>>
>>>>> /* l_old has already been stashed in htab->extra_elems, free
>>>>> * its special fields before it is available for reuse. Also
>>>>>
>>>> After thinking about it the second time, the atomic list replacement on
>>>> the update side is enough to make lookup operation always find the
>>>> existing element. However, due to the immediate reuse, the lookup may
>>>> find an unexpected value. Maybe we should disable the immediate reuse
>>>> for specific map (e.g., htab of maps).
>>> Hmm, in an RCU-protected data structure, reusing the memory before an
>>> RCU grace period has elapsed is just as wrong as freeing it, isn't it?
>>> I.e., the reuse logic should have some kind of call_rcu redirection to
>>> be completely correct?
>> Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
>> map, the reuse is also tricky (e.g., the goto again case in
>> lookup_nulls_elem_raw), however it can not prevent the lookup procedure
>> from returning unexpected value. I had post a patch set [1] to "fix"
>> that, but Alexei said it is "a known quirk". Here I am not sure about
>> whether it is reasonable to disable the reuse for htab of maps only. I
>> will post a v2 for the patch set.
>>
>> [1]:
>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> yes. we still have to keep prealloc as default for now :(
> Eventually bpf_mem_alloc is replaced with fully re-entrant
> and safe kmalloc, then we can do fully re-entrant and safe
> kfree_rcu. Then we can talk about closing this quirk.
> Until then the prog has to deal with immediate reuse.
> That was the case for a decade already.
I see. In v2 I will fallback to the original idea: adding a standalone
update procedure for htab of maps in which it will atomically overwrite
the map_ptr just like array of maps does.
next prev parent reply other threads:[~2025-02-26 4:05 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-04 8:28 [PATCH bpf-next 0/3] bpf: Overwrite the htab element atomically Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 1/3] rculist: add hlist_nulls_replace_rcu() helper Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically Hou Tao
2025-02-05 1:38 ` [RESEND] " Hou Tao
2025-02-06 15:05 ` Toke Høiland-Jørgensen
2025-02-08 10:16 ` Hou Tao
2025-02-26 3:24 ` Alexei Starovoitov
2025-02-26 4:05 ` Hou Tao [this message]
2025-02-26 5:42 ` Alexei Starovoitov
2025-02-26 23:17 ` Zvi Effron
2025-02-27 1:48 ` Hou Tao
2025-02-27 1:59 ` Alexei Starovoitov
2025-02-27 2:43 ` Hou Tao
2025-02-27 3:17 ` Alexei Starovoitov
2025-02-27 4:08 ` Hou Tao
2025-03-06 10:22 ` Nick Zavaritsky
2025-02-04 8:28 ` [PATCH bpf-next 3/3] selftests/bpf: Add test case for atomic htab update Hou Tao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6a84a878-0728-0a19-73d2-b5871e10e120@huaweicloud.com \
--to=houtao@huaweicloud.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=chaas@riotgames.com \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=hotforest@gmail.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=paulmck@kernel.org \
--cc=rcu@vger.kernel.org \
--cc=sdf@fomichev.me \
--cc=song@kernel.org \
--cc=toke@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).