* [PATCH bpf-next 0/3] bpf: Overwrite the htab element atomically
@ 2025-02-04 8:28 Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 1/3] rculist: add hlist_nulls_replace_rcu() helper Hou Tao
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Hou Tao @ 2025-02-04 8:28 UTC (permalink / raw)
To: bpf, rcu
Cc: linux-kernel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Paul E . McKenney, houtao1, hotforest
Hi,
The motivation for the patch set comes from the question raised by Cody
Haas [1]. He asked whether or not the update of htab of map is atomic in
the perspective of lookup operation. Currently, the update is not atomic
because the overwrite of existing element happens in a two-steps way,
but the support of atomic update is feasible. Initiallly, I only plan to
support atomic update for htab of map because array of map has already
supported that. Afterwards, I think it may be reasonable to support
atomic update for all kinds of hash map. However, for the BPF_F_LOCK
case, although the update is protected by a spin-lock, the update is
still not atomic in the perspective of the lookup operation.
Please see individual patches for details. Comments are always welcome.
---
[1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@huaweicloud.com/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6
Hou Tao (3):
rculist: add hlist_nulls_replace_rcu() helper
bpf: Overwrite the element in hash map atomically
selftests/bpf: Add test case for atomic htab update
include/linux/rculist_nulls.h | 42 ++++++
kernel/bpf/hashtab.c | 14 +-
.../selftests/bpf/prog_tests/htab_lookup.c | 130 ++++++++++++++++++
.../testing/selftests/bpf/progs/htab_lookup.c | 13 ++
4 files changed, 193 insertions(+), 6 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_lookup.c
create mode 100644 tools/testing/selftests/bpf/progs/htab_lookup.c
--
2.48.1
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH bpf-next 1/3] rculist: add hlist_nulls_replace_rcu() helper
2025-02-04 8:28 [PATCH bpf-next 0/3] bpf: Overwrite the htab element atomically Hou Tao
@ 2025-02-04 8:28 ` Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 3/3] selftests/bpf: Add test case for atomic htab update Hou Tao
2 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2025-02-04 8:28 UTC (permalink / raw)
To: bpf, rcu
Cc: linux-kernel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Paul E . McKenney, houtao1, hotforest
Add hlist_nulls_replace_rcu() to replace an existing element in the hash
list. For the concurrent list traversal, the replace is atomic, it will
find either the old element or the new element.
Signed-off-by: Hou Tao <hotforest@gmail.com>
---
include/linux/rculist_nulls.h | 42 +++++++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)
diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index 89186c499dd4..795071fda6ad 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -52,6 +52,14 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
#define hlist_nulls_next_rcu(node) \
(*((struct hlist_nulls_node __rcu __force **)&(node)->next))
+
+/**
+ * hlist_nulls_pprev_rcu - returns the element of the list before @node.
+ * @node: element of the list.
+ */
+#define hlist_nulls_pprev_rcu(node) \
+ (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
+
/**
* hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
* @n: the element to delete from the hash list.
@@ -145,6 +153,40 @@ static inline void hlist_nulls_add_tail_rcu(struct hlist_nulls_node *n,
}
}
+/**
+ * hlist_nulls_replace_rcu - replace an element in hash list
+ * @n: new element to add
+ * @o: old element to replace
+ *
+ * Description:
+ * Replace an existing element in a hash list with a new one,
+ * while permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary
+ * (such as holding appropriate locks) to avoid racing
+ * with another list-mutation primitive, such as hlist_nulls_add_head_rcu()
+ * or hlist_nulls_del_rcu(), running on this same list.
+ * However, it is perfectly legal to run concurrently with
+ * the _rcu list-traversal primitives, such as
+ * hlist_nulls_for_each_entry_rcu(), used to prevent memory-consistency
+ * problems on Alpha CPUs. Regardless of the type of CPU, the
+ * list-traversal primitive must be guarded by rcu_read_lock().
+ */
+static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *n,
+ struct hlist_nulls_node *o)
+{
+ struct hlist_nulls_node *next = o->next;
+ struct hlist_nulls_node **pprev = o->pprev;
+
+ WRITE_ONCE(n->next, next);
+ WRITE_ONCE(n->pprev, pprev);
+ rcu_assign_pointer(hlist_nulls_pprev_rcu(o), n);
+
+ if (!is_a_nulls(next))
+ WRITE_ONCE(next->pprev, &n->next);
+ WRITE_ONCE(o->pprev, LIST_POISON2);
+}
+
/* after that hlist_nulls_del will work */
static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
{
--
2.48.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-04 8:28 [PATCH bpf-next 0/3] bpf: Overwrite the htab element atomically Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 1/3] rculist: add hlist_nulls_replace_rcu() helper Hou Tao
@ 2025-02-04 8:28 ` Hou Tao
2025-02-05 1:38 ` [RESEND] " Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 3/3] selftests/bpf: Add test case for atomic htab update Hou Tao
2 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-02-04 8:28 UTC (permalink / raw)
To: bpf, rcu
Cc: linux-kernel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Paul E . McKenney, houtao1, hotforest
Currently, the update of existing element in hash map involves two
steps:
1) insert the new element at the head of the hash list
2) remove the old element
It is possible that the concurrent lookup operation may fail to find
either the old element or the new element if the lookup operation starts
before the addition and continues after the removal.
Therefore, replacing the two-step update with an atomic update. After
the change, the update will be atomic in the perspective of the lookup
operation: it will either find the old element or the new element.
Signed-off-by: Hou Tao <hotforest@gmail.com>
---
kernel/bpf/hashtab.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 4a9eeb7aef85..a28b11ce74c6 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
goto err;
}
- /* add new element to the head of the list, so that
- * concurrent search will find it before old elem
- */
- hlist_nulls_add_head_rcu(&l_new->hash_node, head);
- if (l_old) {
- hlist_nulls_del_rcu(&l_old->hash_node);
+ if (!l_old) {
+ hlist_nulls_add_head_rcu(&l_new->hash_node, head);
+ } else {
+ /* Replace the old element atomically, so that
+ * concurrent search will find either the new element or
+ * the old element.
+ */
+ hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
/* l_old has already been stashed in htab->extra_elems, free
* its special fields before it is available for reuse. Also
--
2.48.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH bpf-next 3/3] selftests/bpf: Add test case for atomic htab update
2025-02-04 8:28 [PATCH bpf-next 0/3] bpf: Overwrite the htab element atomically Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 1/3] rculist: add hlist_nulls_replace_rcu() helper Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically Hou Tao
@ 2025-02-04 8:28 ` Hou Tao
2 siblings, 0 replies; 17+ messages in thread
From: Hou Tao @ 2025-02-04 8:28 UTC (permalink / raw)
To: bpf, rcu
Cc: linux-kernel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Paul E . McKenney, houtao1, hotforest
Add a test case to verify the atomic update of existing element in hash
map. The test proceeds in three steps:
1) fill the map with keys in the range [0, 63]
2) create 8 threads to lookup these keys concurrently
3) create 2 threads to overwrite these keys concurrently
Without atomic update support, the lookup operation may return -ENOENT
error and the test will fail. After the atomic-update change, the lookup
operation will always return 0 and the test will succeed.
Signed-off-by: Hou Tao <hotforest@gmail.com>
---
.../selftests/bpf/prog_tests/htab_lookup.c | 130 ++++++++++++++++++
.../testing/selftests/bpf/progs/htab_lookup.c | 13 ++
2 files changed, 143 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_lookup.c
create mode 100644 tools/testing/selftests/bpf/progs/htab_lookup.c
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_lookup.c b/tools/testing/selftests/bpf/prog_tests/htab_lookup.c
new file mode 100644
index 000000000000..ef9036827439
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/htab_lookup.c
@@ -0,0 +1,130 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <stdbool.h>
+#include <test_progs.h>
+#include "htab_lookup.skel.h"
+
+struct htab_op_ctx {
+ int fd;
+ int loop;
+ unsigned int entries;
+ bool stop;
+};
+
+static void *htab_lookup_fn(void *arg)
+{
+ struct htab_op_ctx *ctx = arg;
+ int i = 0;
+
+ while (i++ < ctx->loop && !ctx->stop) {
+ unsigned int j;
+
+ for (j = 0; j < ctx->entries; j++) {
+ unsigned long key = j, value;
+ int err;
+
+ err = bpf_map_lookup_elem(ctx->fd, &key, &value);
+ if (err) {
+ ctx->stop = true;
+ return (void *)(long)err;
+ }
+ }
+ }
+
+ return NULL;
+}
+
+static void *htab_update_fn(void *arg)
+{
+ struct htab_op_ctx *ctx = arg;
+ int i = 0;
+
+ while (i++ < ctx->loop && !ctx->stop) {
+ unsigned int j;
+
+ for (j = 0; j < ctx->entries; j++) {
+ unsigned long key = j, value = j;
+ int err;
+
+ err = bpf_map_update_elem(ctx->fd, &key, &value, BPF_EXIST);
+ if (err) {
+ if (err == -ENOMEM)
+ continue;
+ ctx->stop = true;
+ return (void *)(long)err;
+ }
+ }
+ }
+
+ return NULL;
+}
+
+static int setup_htab(int fd, unsigned int entries)
+{
+ unsigned int i;
+
+ for (i = 0; i < entries; i++) {
+ unsigned long key = i, value = i;
+ int err;
+
+ err = bpf_map_update_elem(fd, &key, &value, 0);
+ if (!ASSERT_OK(err, "init update"))
+ return -1;
+ }
+
+ return 0;
+}
+
+void test_htab_lookup(void)
+{
+ unsigned int i, wr_nr = 2, rd_nr = 8;
+ pthread_t tids[wr_nr + rd_nr];
+ struct htab_lookup *skel;
+ struct htab_op_ctx ctx;
+ int err;
+
+ skel = htab_lookup__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "htab_lookup__open_and_load"))
+ return;
+
+ ctx.fd = bpf_map__fd(skel->maps.htab);
+ ctx.loop = 50;
+ ctx.stop = false;
+ ctx.entries = 64;
+
+ err = setup_htab(ctx.fd, ctx.entries);
+ if (err)
+ goto destroy;
+
+ memset(tids, 0, sizeof(tids));
+ for (i = 0; i < wr_nr; i++) {
+ err = pthread_create(&tids[i], NULL, htab_update_fn, &ctx);
+ if (!ASSERT_OK(err, "pthread_create")) {
+ ctx.stop = true;
+ goto reap;
+ }
+ }
+ for (i = 0; i < rd_nr; i++) {
+ err = pthread_create(&tids[i + wr_nr], NULL, htab_lookup_fn, &ctx);
+ if (!ASSERT_OK(err, "pthread_create")) {
+ ctx.stop = true;
+ goto reap;
+ }
+ }
+
+reap:
+ for (i = 0; i < wr_nr + rd_nr; i++) {
+ void *ret = NULL;
+ char desc[32];
+
+ if (!tids[i])
+ continue;
+
+ snprintf(desc, sizeof(desc), "thread %u", i + 1);
+ err = pthread_join(tids[i], &ret);
+ ASSERT_OK(err, desc);
+ ASSERT_EQ(ret, NULL, desc);
+ }
+destroy:
+ htab_lookup__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/htab_lookup.c b/tools/testing/selftests/bpf/progs/htab_lookup.c
new file mode 100644
index 000000000000..baa30fa5b84f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/htab_lookup.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, 64);
+ __type(key, unsigned long);
+ __type(value, unsigned long);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
--
2.48.1
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-04 8:28 ` [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically Hou Tao
@ 2025-02-05 1:38 ` Hou Tao
2025-02-06 15:05 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-02-05 1:38 UTC (permalink / raw)
To: Hou Tao, bpf, rcu
Cc: linux-kernel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Paul E . McKenney
+cc Cody Haas
Sorry for the resend. I sent the reply in the HTML format.
On 2/4/2025 4:28 PM, Hou Tao wrote:
> Currently, the update of existing element in hash map involves two
> steps:
> 1) insert the new element at the head of the hash list
> 2) remove the old element
>
> It is possible that the concurrent lookup operation may fail to find
> either the old element or the new element if the lookup operation starts
> before the addition and continues after the removal.
>
> Therefore, replacing the two-step update with an atomic update. After
> the change, the update will be atomic in the perspective of the lookup
> operation: it will either find the old element or the new element.
>
> Signed-off-by: Hou Tao <hotforest@gmail.com>
> ---
> kernel/bpf/hashtab.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 4a9eeb7aef85..a28b11ce74c6 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
> goto err;
> }
>
> - /* add new element to the head of the list, so that
> - * concurrent search will find it before old elem
> - */
> - hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> - if (l_old) {
> - hlist_nulls_del_rcu(&l_old->hash_node);
> + if (!l_old) {
> + hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> + } else {
> + /* Replace the old element atomically, so that
> + * concurrent search will find either the new element or
> + * the old element.
> + */
> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>
> /* l_old has already been stashed in htab->extra_elems, free
> * its special fields before it is available for reuse. Also
>
After thinking about it the second time, the atomic list replacement on
the update side is enough to make lookup operation always find the
existing element. However, due to the immediate reuse, the lookup may
find an unexpected value. Maybe we should disable the immediate reuse
for specific map (e.g., htab of maps).
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-05 1:38 ` [RESEND] " Hou Tao
@ 2025-02-06 15:05 ` Toke Høiland-Jørgensen
2025-02-08 10:16 ` Hou Tao
0 siblings, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-02-06 15:05 UTC (permalink / raw)
To: Hou Tao, Hou Tao, bpf, rcu
Cc: linux-kernel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Paul E . McKenney
Hou Tao <houtao@huaweicloud.com> writes:
> +cc Cody Haas
>
> Sorry for the resend. I sent the reply in the HTML format.
>
> On 2/4/2025 4:28 PM, Hou Tao wrote:
>> Currently, the update of existing element in hash map involves two
>> steps:
>> 1) insert the new element at the head of the hash list
>> 2) remove the old element
>>
>> It is possible that the concurrent lookup operation may fail to find
>> either the old element or the new element if the lookup operation starts
>> before the addition and continues after the removal.
>>
>> Therefore, replacing the two-step update with an atomic update. After
>> the change, the update will be atomic in the perspective of the lookup
>> operation: it will either find the old element or the new element.
>>
>> Signed-off-by: Hou Tao <hotforest@gmail.com>
>> ---
>> kernel/bpf/hashtab.c | 14 ++++++++------
>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>> index 4a9eeb7aef85..a28b11ce74c6 100644
>> --- a/kernel/bpf/hashtab.c
>> +++ b/kernel/bpf/hashtab.c
>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>> goto err;
>> }
>>
>> - /* add new element to the head of the list, so that
>> - * concurrent search will find it before old elem
>> - */
>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>> - if (l_old) {
>> - hlist_nulls_del_rcu(&l_old->hash_node);
>> + if (!l_old) {
>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>> + } else {
>> + /* Replace the old element atomically, so that
>> + * concurrent search will find either the new element or
>> + * the old element.
>> + */
>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>
>> /* l_old has already been stashed in htab->extra_elems, free
>> * its special fields before it is available for reuse. Also
>>
>
> After thinking about it the second time, the atomic list replacement on
> the update side is enough to make lookup operation always find the
> existing element. However, due to the immediate reuse, the lookup may
> find an unexpected value. Maybe we should disable the immediate reuse
> for specific map (e.g., htab of maps).
Hmm, in an RCU-protected data structure, reusing the memory before an
RCU grace period has elapsed is just as wrong as freeing it, isn't it?
I.e., the reuse logic should have some kind of call_rcu redirection to
be completely correct?
-Toke
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-06 15:05 ` Toke Høiland-Jørgensen
@ 2025-02-08 10:16 ` Hou Tao
2025-02-26 3:24 ` Alexei Starovoitov
0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-02-08 10:16 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, bpf, rcu
Cc: linux-kernel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Paul E . McKenney, Cody Haas, Hou Tao
Hi Toke,
On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> Hou Tao <houtao@huaweicloud.com> writes:
>
>> +cc Cody Haas
>>
>> Sorry for the resend. I sent the reply in the HTML format.
>>
>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>> Currently, the update of existing element in hash map involves two
>>> steps:
>>> 1) insert the new element at the head of the hash list
>>> 2) remove the old element
>>>
>>> It is possible that the concurrent lookup operation may fail to find
>>> either the old element or the new element if the lookup operation starts
>>> before the addition and continues after the removal.
>>>
>>> Therefore, replacing the two-step update with an atomic update. After
>>> the change, the update will be atomic in the perspective of the lookup
>>> operation: it will either find the old element or the new element.
>>>
>>> Signed-off-by: Hou Tao <hotforest@gmail.com>
>>> ---
>>> kernel/bpf/hashtab.c | 14 ++++++++------
>>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>> index 4a9eeb7aef85..a28b11ce74c6 100644
>>> --- a/kernel/bpf/hashtab.c
>>> +++ b/kernel/bpf/hashtab.c
>>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>> goto err;
>>> }
>>>
>>> - /* add new element to the head of the list, so that
>>> - * concurrent search will find it before old elem
>>> - */
>>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>> - if (l_old) {
>>> - hlist_nulls_del_rcu(&l_old->hash_node);
>>> + if (!l_old) {
>>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>> + } else {
>>> + /* Replace the old element atomically, so that
>>> + * concurrent search will find either the new element or
>>> + * the old element.
>>> + */
>>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>>
>>> /* l_old has already been stashed in htab->extra_elems, free
>>> * its special fields before it is available for reuse. Also
>>>
>> After thinking about it the second time, the atomic list replacement on
>> the update side is enough to make lookup operation always find the
>> existing element. However, due to the immediate reuse, the lookup may
>> find an unexpected value. Maybe we should disable the immediate reuse
>> for specific map (e.g., htab of maps).
> Hmm, in an RCU-protected data structure, reusing the memory before an
> RCU grace period has elapsed is just as wrong as freeing it, isn't it?
> I.e., the reuse logic should have some kind of call_rcu redirection to
> be completely correct?
Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
map, the reuse is also tricky (e.g., the goto again case in
lookup_nulls_elem_raw), however it can not prevent the lookup procedure
from returning unexpected value. I had post a patch set [1] to "fix"
that, but Alexei said it is "a known quirk". Here I am not sure about
whether it is reasonable to disable the reuse for htab of maps only. I
will post a v2 for the patch set.
[1]:
https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> -Toke
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-08 10:16 ` Hou Tao
@ 2025-02-26 3:24 ` Alexei Starovoitov
2025-02-26 4:05 ` Hou Tao
0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2025-02-26 3:24 UTC (permalink / raw)
To: Hou Tao
Cc: Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi Toke,
>
> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> > Hou Tao <houtao@huaweicloud.com> writes:
> >
> >> +cc Cody Haas
> >>
> >> Sorry for the resend. I sent the reply in the HTML format.
> >>
> >> On 2/4/2025 4:28 PM, Hou Tao wrote:
> >>> Currently, the update of existing element in hash map involves two
> >>> steps:
> >>> 1) insert the new element at the head of the hash list
> >>> 2) remove the old element
> >>>
> >>> It is possible that the concurrent lookup operation may fail to find
> >>> either the old element or the new element if the lookup operation starts
> >>> before the addition and continues after the removal.
> >>>
> >>> Therefore, replacing the two-step update with an atomic update. After
> >>> the change, the update will be atomic in the perspective of the lookup
> >>> operation: it will either find the old element or the new element.
I'm missing the point.
This "atomic" replacement doesn't really solve anything.
lookup will see one element.
That element could be deleted by another thread.
bucket lock and either two step update or single step
don't change anything from the pov of bpf prog doing lookup.
> >>>
> >>> Signed-off-by: Hou Tao <hotforest@gmail.com>
> >>> ---
> >>> kernel/bpf/hashtab.c | 14 ++++++++------
> >>> 1 file changed, 8 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> >>> index 4a9eeb7aef85..a28b11ce74c6 100644
> >>> --- a/kernel/bpf/hashtab.c
> >>> +++ b/kernel/bpf/hashtab.c
> >>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
> >>> goto err;
> >>> }
> >>>
> >>> - /* add new element to the head of the list, so that
> >>> - * concurrent search will find it before old elem
> >>> - */
> >>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> >>> - if (l_old) {
> >>> - hlist_nulls_del_rcu(&l_old->hash_node);
> >>> + if (!l_old) {
> >>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head);
> >>> + } else {
> >>> + /* Replace the old element atomically, so that
> >>> + * concurrent search will find either the new element or
> >>> + * the old element.
> >>> + */
> >>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
> >>>
> >>> /* l_old has already been stashed in htab->extra_elems, free
> >>> * its special fields before it is available for reuse. Also
> >>>
> >> After thinking about it the second time, the atomic list replacement on
> >> the update side is enough to make lookup operation always find the
> >> existing element. However, due to the immediate reuse, the lookup may
> >> find an unexpected value. Maybe we should disable the immediate reuse
> >> for specific map (e.g., htab of maps).
> > Hmm, in an RCU-protected data structure, reusing the memory before an
> > RCU grace period has elapsed is just as wrong as freeing it, isn't it?
> > I.e., the reuse logic should have some kind of call_rcu redirection to
> > be completely correct?
>
> Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
> map, the reuse is also tricky (e.g., the goto again case in
> lookup_nulls_elem_raw), however it can not prevent the lookup procedure
> from returning unexpected value. I had post a patch set [1] to "fix"
> that, but Alexei said it is "a known quirk". Here I am not sure about
> whether it is reasonable to disable the reuse for htab of maps only. I
> will post a v2 for the patch set.
>
> [1]:
> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
yes. we still have to keep prealloc as default for now :(
Eventually bpf_mem_alloc is replaced with fully re-entrant
and safe kmalloc, then we can do fully re-entrant and safe
kfree_rcu. Then we can talk about closing this quirk.
Until then the prog has to deal with immediate reuse.
That was the case for a decade already.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-26 3:24 ` Alexei Starovoitov
@ 2025-02-26 4:05 ` Hou Tao
2025-02-26 5:42 ` Alexei Starovoitov
0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-02-26 4:05 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
Hi,
On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi Toke,
>>
>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>
>>>> +cc Cody Haas
>>>>
>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>
>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>> Currently, the update of existing element in hash map involves two
>>>>> steps:
>>>>> 1) insert the new element at the head of the hash list
>>>>> 2) remove the old element
>>>>>
>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>> either the old element or the new element if the lookup operation starts
>>>>> before the addition and continues after the removal.
>>>>>
>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>> operation: it will either find the old element or the new element.
> I'm missing the point.
> This "atomic" replacement doesn't really solve anything.
> lookup will see one element.
> That element could be deleted by another thread.
> bucket lock and either two step update or single step
> don't change anything from the pov of bpf prog doing lookup.
The point is that overwriting an existed element may lead to concurrent
lookups return ENOENT as demonstrated by the added selftest and the
patch tried to "fix" that. However, it seems using
hlist_nulls_replace_rcu() for the overwriting update is still not
enough. Because when the lookup procedure found the old element, the old
element may be reusing, the comparison of the map key may fail, and the
lookup procedure may still return ENOENT.
>
>>>>> Signed-off-by: Hou Tao <hotforest@gmail.com>
>>>>> ---
>>>>> kernel/bpf/hashtab.c | 14 ++++++++------
>>>>> 1 file changed, 8 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
>>>>> index 4a9eeb7aef85..a28b11ce74c6 100644
>>>>> --- a/kernel/bpf/hashtab.c
>>>>> +++ b/kernel/bpf/hashtab.c
>>>>> @@ -1179,12 +1179,14 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>>>>> goto err;
>>>>> }
>>>>>
>>>>> - /* add new element to the head of the list, so that
>>>>> - * concurrent search will find it before old elem
>>>>> - */
>>>>> - hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> - if (l_old) {
>>>>> - hlist_nulls_del_rcu(&l_old->hash_node);
>>>>> + if (!l_old) {
>>>>> + hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>>>>> + } else {
>>>>> + /* Replace the old element atomically, so that
>>>>> + * concurrent search will find either the new element or
>>>>> + * the old element.
>>>>> + */
>>>>> + hlist_nulls_replace_rcu(&l_new->hash_node, &l_old->hash_node);
>>>>>
>>>>> /* l_old has already been stashed in htab->extra_elems, free
>>>>> * its special fields before it is available for reuse. Also
>>>>>
>>>> After thinking about it the second time, the atomic list replacement on
>>>> the update side is enough to make lookup operation always find the
>>>> existing element. However, due to the immediate reuse, the lookup may
>>>> find an unexpected value. Maybe we should disable the immediate reuse
>>>> for specific map (e.g., htab of maps).
>>> Hmm, in an RCU-protected data structure, reusing the memory before an
>>> RCU grace period has elapsed is just as wrong as freeing it, isn't it?
>>> I.e., the reuse logic should have some kind of call_rcu redirection to
>>> be completely correct?
>> Not for all cases. There is SLAB_TYPESAFE_BY_RCU-typed slab. For hash
>> map, the reuse is also tricky (e.g., the goto again case in
>> lookup_nulls_elem_raw), however it can not prevent the lookup procedure
>> from returning unexpected value. I had post a patch set [1] to "fix"
>> that, but Alexei said it is "a known quirk". Here I am not sure about
>> whether it is reasonable to disable the reuse for htab of maps only. I
>> will post a v2 for the patch set.
>>
>> [1]:
>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> yes. we still have to keep prealloc as default for now :(
> Eventually bpf_mem_alloc is replaced with fully re-entrant
> and safe kmalloc, then we can do fully re-entrant and safe
> kfree_rcu. Then we can talk about closing this quirk.
> Until then the prog has to deal with immediate reuse.
> That was the case for a decade already.
I see. In v2 I will fallback to the original idea: adding a standalone
update procedure for htab of maps in which it will atomically overwrite
the map_ptr just like array of maps does.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-26 4:05 ` Hou Tao
@ 2025-02-26 5:42 ` Alexei Starovoitov
2025-02-26 23:17 ` Zvi Effron
0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2025-02-26 5:42 UTC (permalink / raw)
To: Hou Tao
Cc: Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> > On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi Toke,
> >>
> >> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> >>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>
> >>>> +cc Cody Haas
> >>>>
> >>>> Sorry for the resend. I sent the reply in the HTML format.
> >>>>
> >>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
> >>>>> Currently, the update of existing element in hash map involves two
> >>>>> steps:
> >>>>> 1) insert the new element at the head of the hash list
> >>>>> 2) remove the old element
> >>>>>
> >>>>> It is possible that the concurrent lookup operation may fail to find
> >>>>> either the old element or the new element if the lookup operation starts
> >>>>> before the addition and continues after the removal.
> >>>>>
> >>>>> Therefore, replacing the two-step update with an atomic update. After
> >>>>> the change, the update will be atomic in the perspective of the lookup
> >>>>> operation: it will either find the old element or the new element.
> > I'm missing the point.
> > This "atomic" replacement doesn't really solve anything.
> > lookup will see one element.
> > That element could be deleted by another thread.
> > bucket lock and either two step update or single step
> > don't change anything from the pov of bpf prog doing lookup.
>
> The point is that overwriting an existed element may lead to concurrent
> lookups return ENOENT as demonstrated by the added selftest and the
> patch tried to "fix" that. However, it seems using
> hlist_nulls_replace_rcu() for the overwriting update is still not
> enough. Because when the lookup procedure found the old element, the old
> element may be reusing, the comparison of the map key may fail, and the
> lookup procedure may still return ENOENT.
you mean l_old == l_new ? I don't think it's possible
within htab_map_update_elem(),
but htab_map_delete_elem() doing hlist_nulls_del_rcu()
then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
and just deleted elem is reused to be inserted
into another bucket.
I'm not sure whether this new hlist_nulls_replace_rcu()
primitive works with nulls logic.
So back to the problem statement..
Are you saying that adding new to a head while lookup is in the middle
is causing it to miss an element that
is supposed to be updated assuming atomicity of the update?
While now update_elem() is more like a sequence of delete + insert?
Hmm.
> I see. In v2 I will fallback to the original idea: adding a standalone
> update procedure for htab of maps in which it will atomically overwrite
> the map_ptr just like array of maps does.
hold on. is this only for hash-of-maps ?
How that non-atomic update manifested in real production?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-26 5:42 ` Alexei Starovoitov
@ 2025-02-26 23:17 ` Zvi Effron
2025-02-27 1:48 ` Hou Tao
0 siblings, 1 reply; 17+ messages in thread
From: Zvi Effron @ 2025-02-26 23:17 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Hou Tao, Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> > > On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >> Hi Toke,
> > >>
> > >> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> > >>> Hou Tao <houtao@huaweicloud.com> writes:
> > >>>
> > >>>> +cc Cody Haas
> > >>>>
> > >>>> Sorry for the resend. I sent the reply in the HTML format.
> > >>>>
> > >>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
> > >>>>> Currently, the update of existing element in hash map involves two
> > >>>>> steps:
> > >>>>> 1) insert the new element at the head of the hash list
> > >>>>> 2) remove the old element
> > >>>>>
> > >>>>> It is possible that the concurrent lookup operation may fail to find
> > >>>>> either the old element or the new element if the lookup operation starts
> > >>>>> before the addition and continues after the removal.
> > >>>>>
> > >>>>> Therefore, replacing the two-step update with an atomic update. After
> > >>>>> the change, the update will be atomic in the perspective of the lookup
> > >>>>> operation: it will either find the old element or the new element.
> > > I'm missing the point.
> > > This "atomic" replacement doesn't really solve anything.
> > > lookup will see one element.
> > > That element could be deleted by another thread.
> > > bucket lock and either two step update or single step
> > > don't change anything from the pov of bpf prog doing lookup.
> >
> > The point is that overwriting an existed element may lead to concurrent
> > lookups return ENOENT as demonstrated by the added selftest and the
> > patch tried to "fix" that. However, it seems using
> > hlist_nulls_replace_rcu() for the overwriting update is still not
> > enough. Because when the lookup procedure found the old element, the old
> > element may be reusing, the comparison of the map key may fail, and the
> > lookup procedure may still return ENOENT.
>
> you mean l_old == l_new ? I don't think it's possible
> within htab_map_update_elem(),
> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
> and just deleted elem is reused to be inserted
> into another bucket.
>
> I'm not sure whether this new hlist_nulls_replace_rcu()
> primitive works with nulls logic.
>
> So back to the problem statement..
> Are you saying that adding new to a head while lookup is in the middle
> is causing it to miss an element that
> is supposed to be updated assuming atomicity of the update?
> While now update_elem() is more like a sequence of delete + insert?
>
> Hmm.
Yes, exactly that. Because update_elem is actually a delete + insert (actually
an insert + delete, I think?), it is possible for a concurrent lookup to see no
element instead of either the old or new value.
>
> > I see. In v2 I will fallback to the original idea: adding a standalone
> > update procedure for htab of maps in which it will atomically overwrite
> > the map_ptr just like array of maps does.
>
> hold on. is this only for hash-of-maps ?
I believe this was also replicated for hash as well as hash-of-maps. Cody can
confirm, or use the reproducer he has to test that case.
> How that non-atomic update manifested in real production?
>
See [1] (in the cover letter for this series, also replicated below).
[1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@huaweicloud.com/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-26 23:17 ` Zvi Effron
@ 2025-02-27 1:48 ` Hou Tao
2025-02-27 1:59 ` Alexei Starovoitov
0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-02-27 1:48 UTC (permalink / raw)
To: Zvi Effron, Alexei Starovoitov
Cc: Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
Hi,
On 2/27/2025 7:17 AM, Zvi Effron wrote:
> On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
>>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hi Toke,
>>>>>
>>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>
>>>>>>> +cc Cody Haas
>>>>>>>
>>>>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>>>>
>>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>>>>> Currently, the update of existing element in hash map involves two
>>>>>>>> steps:
>>>>>>>> 1) insert the new element at the head of the hash list
>>>>>>>> 2) remove the old element
>>>>>>>>
>>>>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>>>>> either the old element or the new element if the lookup operation starts
>>>>>>>> before the addition and continues after the removal.
>>>>>>>>
>>>>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>>>>> operation: it will either find the old element or the new element.
>>>> I'm missing the point.
>>>> This "atomic" replacement doesn't really solve anything.
>>>> lookup will see one element.
>>>> That element could be deleted by another thread.
>>>> bucket lock and either two step update or single step
>>>> don't change anything from the pov of bpf prog doing lookup.
>>> The point is that overwriting an existed element may lead to concurrent
>>> lookups return ENOENT as demonstrated by the added selftest and the
>>> patch tried to "fix" that. However, it seems using
>>> hlist_nulls_replace_rcu() for the overwriting update is still not
>>> enough. Because when the lookup procedure found the old element, the old
>>> element may be reusing, the comparison of the map key may fail, and the
>>> lookup procedure may still return ENOENT.
>> you mean l_old == l_new ? I don't think it's possible
>> within htab_map_update_elem(),
>> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
>> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
>> and just deleted elem is reused to be inserted
>> into another bucket.
No. I mean the following procedure in which the lookup procedure finds
the old element and tries to match its key, one update procedure has
already deleted the old element, and another update procedure is reusing
the old element:
lookup procedure A
A: find the old element (instead of the new old)
update procedure B
B: delete the old element
update procedure C on the same CPU:
C: reuse the old element (overwrite its key and insert in
the same bucket)
A: the key is mismatched and return -ENOENT.
It can be reproduced by increasing ctx.loop in the selftest.
>>
>> I'm not sure whether this new hlist_nulls_replace_rcu()
>> primitive works with nulls logic.
>>
>> So back to the problem statement..
>> Are you saying that adding new to a head while lookup is in the middle
>> is causing it to miss an element that
>> is supposed to be updated assuming atomicity of the update?
>> While now update_elem() is more like a sequence of delete + insert?
>>
>> Hmm.
> Yes, exactly that. Because update_elem is actually a delete + insert (actually
> an insert + delete, I think?), it is possible for a concurrent lookup to see no
> element instead of either the old or new value.
Yep.
>
>>> I see. In v2 I will fallback to the original idea: adding a standalone
>>> update procedure for htab of maps in which it will atomically overwrite
>>> the map_ptr just like array of maps does.
>> hold on. is this only for hash-of-maps ?
> I believe this was also replicated for hash as well as hash-of-maps. Cody can
> confirm, or use the reproducer he has to test that case.
The fix for hash-of-maps will be much simpler and the returned map
pointer will be correct. For other types of hash map, beside switching
to hlist_nulls_replace_rcu(), I think we also need to detect the reuse
of element during the loop: if the element has been reused, the lookup
should restart from the head again. However, even withing the above two
fixes, the value returned by lookup procedure for other types of hash
map may still be incorrect (as shown long time ago [1]), so I think
maybe for other types of map, the atomic update doesn't matter too much.
[1]:
https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>
>> How that non-atomic update manifested in real production?
>>
> See [1] (in the cover letter for this series, also replicated below).
>
> [1] : https://lore.kernel.org/xdp-newbies/07a365d8-2e66-2899-4298-b8b158a928fa@huaweicloud.com/T/#m06fcd687c6cfdbd0f9b643b227e69b479fc8c2f6
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-27 1:48 ` Hou Tao
@ 2025-02-27 1:59 ` Alexei Starovoitov
2025-02-27 2:43 ` Hou Tao
0 siblings, 1 reply; 17+ messages in thread
From: Alexei Starovoitov @ 2025-02-27 1:59 UTC (permalink / raw)
To: Hou Tao
Cc: Zvi Effron, Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
On Wed, Feb 26, 2025 at 5:48 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/27/2025 7:17 AM, Zvi Effron wrote:
> > On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>> Hi,
> >>>
> >>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
> >>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>>> Hi Toke,
> >>>>>
> >>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
> >>>>>> Hou Tao <houtao@huaweicloud.com> writes:
> >>>>>>
> >>>>>>> +cc Cody Haas
> >>>>>>>
> >>>>>>> Sorry for the resend. I sent the reply in the HTML format.
> >>>>>>>
> >>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
> >>>>>>>> Currently, the update of existing element in hash map involves two
> >>>>>>>> steps:
> >>>>>>>> 1) insert the new element at the head of the hash list
> >>>>>>>> 2) remove the old element
> >>>>>>>>
> >>>>>>>> It is possible that the concurrent lookup operation may fail to find
> >>>>>>>> either the old element or the new element if the lookup operation starts
> >>>>>>>> before the addition and continues after the removal.
> >>>>>>>>
> >>>>>>>> Therefore, replacing the two-step update with an atomic update. After
> >>>>>>>> the change, the update will be atomic in the perspective of the lookup
> >>>>>>>> operation: it will either find the old element or the new element.
> >>>> I'm missing the point.
> >>>> This "atomic" replacement doesn't really solve anything.
> >>>> lookup will see one element.
> >>>> That element could be deleted by another thread.
> >>>> bucket lock and either two step update or single step
> >>>> don't change anything from the pov of bpf prog doing lookup.
> >>> The point is that overwriting an existed element may lead to concurrent
> >>> lookups return ENOENT as demonstrated by the added selftest and the
> >>> patch tried to "fix" that. However, it seems using
> >>> hlist_nulls_replace_rcu() for the overwriting update is still not
> >>> enough. Because when the lookup procedure found the old element, the old
> >>> element may be reusing, the comparison of the map key may fail, and the
> >>> lookup procedure may still return ENOENT.
> >> you mean l_old == l_new ? I don't think it's possible
> >> within htab_map_update_elem(),
> >> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
> >> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
> >> and just deleted elem is reused to be inserted
> >> into another bucket.
>
> No. I mean the following procedure in which the lookup procedure finds
> the old element and tries to match its key, one update procedure has
> already deleted the old element, and another update procedure is reusing
> the old element:
>
> lookup procedure A
> A: find the old element (instead of the new old)
>
> update procedure B
> B: delete the old element
> update procedure C on the same CPU:
> C: reuse the old element (overwrite its key and insert in
> the same bucket)
>
> A: the key is mismatched and return -ENOENT.
This is fine. It's just normal reuse.
Orthogonal to 'update as insert+delete' issue.
> It can be reproduced by increasing ctx.loop in the selftest.
> >>
> >> I'm not sure whether this new hlist_nulls_replace_rcu()
> >> primitive works with nulls logic.
> >>
> >> So back to the problem statement..
> >> Are you saying that adding new to a head while lookup is in the middle
> >> is causing it to miss an element that
> >> is supposed to be updated assuming atomicity of the update?
> >> While now update_elem() is more like a sequence of delete + insert?
> >>
> >> Hmm.
> > Yes, exactly that. Because update_elem is actually a delete + insert (actually
> > an insert + delete, I think?), it is possible for a concurrent lookup to see no
> > element instead of either the old or new value.
>
> Yep.
> >
> >>> I see. In v2 I will fallback to the original idea: adding a standalone
> >>> update procedure for htab of maps in which it will atomically overwrite
> >>> the map_ptr just like array of maps does.
> >> hold on. is this only for hash-of-maps ?
> > I believe this was also replicated for hash as well as hash-of-maps. Cody can
> > confirm, or use the reproducer he has to test that case.
>
> The fix for hash-of-maps will be much simpler and the returned map
> pointer will be correct. For other types of hash map, beside switching
> to hlist_nulls_replace_rcu(),
It's been a long time since I looked into rcu_nulls details.
Pls help me understand that this new replace_rcu_nulls()
is correct from nulls pov,
If it is then this patch set may be the right answer to non-atomic update.
And for the future, please please focus on "why" part in
the cover letter and commit logs instead of "what".
Since the only thing I got from the log was:
"Currently, the update is not atomic
because the overwrite of existing element happens in a two-steps way,
but the support of atomic update is feasible".
"is feasible" doesn't explain "why".
Link to xdp-newbie question is nice for additional context,
but reviewers should not need to go and read some thread somewhere
to understand "why" part.
All of it should be in the commit log.
> map may still be incorrect (as shown long time ago [1]), so I think
> maybe for other types of map, the atomic update doesn't matter too much.
>
> [1]:
> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
A thread from 3 years ago ?! Sorry, it's not helpful to ask
people to page-in such an old context with lots of follow ups
that may or may not be relevant today.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-27 1:59 ` Alexei Starovoitov
@ 2025-02-27 2:43 ` Hou Tao
2025-02-27 3:17 ` Alexei Starovoitov
0 siblings, 1 reply; 17+ messages in thread
From: Hou Tao @ 2025-02-27 2:43 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Zvi Effron, Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
Hi,
On 2/27/2025 9:59 AM, Alexei Starovoitov wrote:
> On Wed, Feb 26, 2025 at 5:48 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 2/27/2025 7:17 AM, Zvi Effron wrote:
>>> On Tue, Feb 25, 2025 at 9:42 PM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> On Tue, Feb 25, 2025 at 8:05 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 2/26/2025 11:24 AM, Alexei Starovoitov wrote:
>>>>>> On Sat, Feb 8, 2025 at 2:17 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>>> Hi Toke,
>>>>>>>
>>>>>>> On 2/6/2025 11:05 PM, Toke Høiland-Jørgensen wrote:
>>>>>>>> Hou Tao <houtao@huaweicloud.com> writes:
>>>>>>>>
>>>>>>>>> +cc Cody Haas
>>>>>>>>>
>>>>>>>>> Sorry for the resend. I sent the reply in the HTML format.
>>>>>>>>>
>>>>>>>>> On 2/4/2025 4:28 PM, Hou Tao wrote:
>>>>>>>>>> Currently, the update of existing element in hash map involves two
>>>>>>>>>> steps:
>>>>>>>>>> 1) insert the new element at the head of the hash list
>>>>>>>>>> 2) remove the old element
>>>>>>>>>>
>>>>>>>>>> It is possible that the concurrent lookup operation may fail to find
>>>>>>>>>> either the old element or the new element if the lookup operation starts
>>>>>>>>>> before the addition and continues after the removal.
>>>>>>>>>>
>>>>>>>>>> Therefore, replacing the two-step update with an atomic update. After
>>>>>>>>>> the change, the update will be atomic in the perspective of the lookup
>>>>>>>>>> operation: it will either find the old element or the new element.
>>>>>> I'm missing the point.
>>>>>> This "atomic" replacement doesn't really solve anything.
>>>>>> lookup will see one element.
>>>>>> That element could be deleted by another thread.
>>>>>> bucket lock and either two step update or single step
>>>>>> don't change anything from the pov of bpf prog doing lookup.
>>>>> The point is that overwriting an existed element may lead to concurrent
>>>>> lookups return ENOENT as demonstrated by the added selftest and the
>>>>> patch tried to "fix" that. However, it seems using
>>>>> hlist_nulls_replace_rcu() for the overwriting update is still not
>>>>> enough. Because when the lookup procedure found the old element, the old
>>>>> element may be reusing, the comparison of the map key may fail, and the
>>>>> lookup procedure may still return ENOENT.
>>>> you mean l_old == l_new ? I don't think it's possible
>>>> within htab_map_update_elem(),
>>>> but htab_map_delete_elem() doing hlist_nulls_del_rcu()
>>>> then free_htab_elem, htab_map_update_elem, alloc, hlist_nulls_add_head_rcu
>>>> and just deleted elem is reused to be inserted
>>>> into another bucket.
>> No. I mean the following procedure in which the lookup procedure finds
>> the old element and tries to match its key, one update procedure has
>> already deleted the old element, and another update procedure is reusing
>> the old element:
>>
>> lookup procedure A
>> A: find the old element (instead of the new old)
>>
>> update procedure B
>> B: delete the old element
>> update procedure C on the same CPU:
>> C: reuse the old element (overwrite its key and insert in
>> the same bucket)
>>
>> A: the key is mismatched and return -ENOENT.
> This is fine. It's just normal reuse.
> Orthogonal to 'update as insert+delete' issue.
OK. However, it will break the lookup procedure because it expects it
will return an valid result instead of -ENOENT.
>
>> It can be reproduced by increasing ctx.loop in the selftest.
>>>> I'm not sure whether this new hlist_nulls_replace_rcu()
>>>> primitive works with nulls logic.
>>>>
>>>> So back to the problem statement..
>>>> Are you saying that adding new to a head while lookup is in the middle
>>>> is causing it to miss an element that
>>>> is supposed to be updated assuming atomicity of the update?
>>>> While now update_elem() is more like a sequence of delete + insert?
>>>>
>>>> Hmm.
>>> Yes, exactly that. Because update_elem is actually a delete + insert (actually
>>> an insert + delete, I think?), it is possible for a concurrent lookup to see no
>>> element instead of either the old or new value.
>> Yep.
>>>>> I see. In v2 I will fallback to the original idea: adding a standalone
>>>>> update procedure for htab of maps in which it will atomically overwrite
>>>>> the map_ptr just like array of maps does.
>>>> hold on. is this only for hash-of-maps ?
>>> I believe this was also replicated for hash as well as hash-of-maps. Cody can
>>> confirm, or use the reproducer he has to test that case.
>> The fix for hash-of-maps will be much simpler and the returned map
>> pointer will be correct. For other types of hash map, beside switching
>> to hlist_nulls_replace_rcu(),
> It's been a long time since I looked into rcu_nulls details.
> Pls help me understand that this new replace_rcu_nulls()
> is correct from nulls pov,
> If it is then this patch set may be the right answer to non-atomic update.
If I understand correctly, only the manipulations of ->first pointer and
->next pointer need to take care of nulls pointer.
>
> And for the future, please please focus on "why" part in
> the cover letter and commit logs instead of "what".
>
> Since the only thing I got from the log was:
> "Currently, the update is not atomic
> because the overwrite of existing element happens in a two-steps way,
> but the support of atomic update is feasible".
>
> "is feasible" doesn't explain "why".
>
> Link to xdp-newbie question is nice for additional context,
> but reviewers should not need to go and read some thread somewhere
> to understand "why" part.
> All of it should be in the commit log.
OK. My original thought is that is a reported problem, so an extra link
will be enough. Will try to add more context next time.
>
>> map may still be incorrect (as shown long time ago [1]), so I think
>> maybe for other types of map, the atomic update doesn't matter too much.
>>
>> [1]:
>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> A thread from 3 years ago ?! Sorry, it's not helpful to ask
> people to page-in such an old context with lots of follow ups
> that may or may not be relevant today.
Let me reuse part of the diagram above to explain how does the lookup
procedure may return a incorrect value:
lookup procedure A
A: find the old element (instead of the new element)
update procedure B
B: delete the old element
update procedure C on the same CPU:
A: the key is matched and return the value in the element
C: reuse the old element (overwrite its key and value)
A: read the value (it is incorrect, because it has been reused and
overwritten)
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-27 2:43 ` Hou Tao
@ 2025-02-27 3:17 ` Alexei Starovoitov
2025-02-27 4:08 ` Hou Tao
2025-03-06 10:22 ` Nick Zavaritsky
0 siblings, 2 replies; 17+ messages in thread
From: Alexei Starovoitov @ 2025-02-27 3:17 UTC (permalink / raw)
To: Hou Tao
Cc: Zvi Effron, Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> >>
> >> lookup procedure A
> >> A: find the old element (instead of the new old)
> >>
> >> update procedure B
> >> B: delete the old element
> >> update procedure C on the same CPU:
> >> C: reuse the old element (overwrite its key and insert in
> >> the same bucket)
> >>
> >> A: the key is mismatched and return -ENOENT.
> > This is fine. It's just normal reuse.
> > Orthogonal to 'update as insert+delete' issue.
>
> OK. However, it will break the lookup procedure because it expects it
> will return an valid result instead of -ENOENT.
What do you mean 'breaks the lookup' ?
lookup_elem_raw() matches hash, and then it memcmp(key),
if the element is reused anything can happen.
Either it succeeds in memcmp() and returns an elem,
or miscompares in memcmp().
Both are expected, because elems are reused in place.
And this behavior is expected and not-broken,
because bpf prog that does lookup on one cpu and deletes
the same element on the other cpu is asking for trouble.
bpf infra guarantees the safety of the kernel.
It doesn't guarantee that bpf progs are sane.
> > It's been a long time since I looked into rcu_nulls details.
> > Pls help me understand that this new replace_rcu_nulls()
> > is correct from nulls pov,
> > If it is then this patch set may be the right answer to non-atomic update.
>
> If I understand correctly, only the manipulations of ->first pointer and
> ->next pointer need to take care of nulls pointer.
hmm. I feel we're still talking past each other.
See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
It's there because of reuse. The lookup can start in one bucket
and finish in another.
> >
> > And for the future, please please focus on "why" part in
> > the cover letter and commit logs instead of "what".
> >
> > Since the only thing I got from the log was:
> > "Currently, the update is not atomic
> > because the overwrite of existing element happens in a two-steps way,
> > but the support of atomic update is feasible".
> >
> > "is feasible" doesn't explain "why".
> >
> > Link to xdp-newbie question is nice for additional context,
> > but reviewers should not need to go and read some thread somewhere
> > to understand "why" part.
> > All of it should be in the commit log.
>
> OK. My original thought is that is a reported problem, so an extra link
> will be enough. Will try to add more context next time.
> >
> >> map may still be incorrect (as shown long time ago [1]), so I think
> >> maybe for other types of map, the atomic update doesn't matter too much.
> >>
> >> [1]:
> >> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
> > A thread from 3 years ago ?! Sorry, it's not helpful to ask
> > people to page-in such an old context with lots of follow ups
> > that may or may not be relevant today.
> Let me reuse part of the diagram above to explain how does the lookup
> procedure may return a incorrect value:
>
> lookup procedure A
> A: find the old element (instead of the new element)
>
>
> update procedure B
> B: delete the old element
> update procedure C on the same CPU:
>
>
> A: the key is matched and return the value in the element
>
> C: reuse the old element (overwrite its key and value)
>
> A: read the value (it is incorrect, because it has been reused and
> overwritten)
... and it's fine. It's by design. It's an element reuse behavior.
Long ago hashmap had two modes: prealloc (default) and
NO_PREALLOC (call_rcu + kfree)
The call_rcu part was there to make things safe.
The memory cannot be kfree-ed to the kernel until RCU GP.
With bpf_mem_alloc hashmap elements are freed back to bpf_ma
right away. Hashmap is doing bpf_mem_cache_free()
(instead of bpf_mem_cache_free_rcu()) because users need speed.
So since 2022 both prealloc and no_prealloc reuse elements.
We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
that will use _rcu() flavor of freeing into bpf_ma,
but it has to have a strong reason.
And soon as we add it the default with prealloc would need
to use call_rcu() too, right?
and that becomes nightmare, since bpf prog can easily DoS the system.
Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
Unlike new things like bpf_obj_new/obj_drop the hashmap
is unpriv, so concerns are drastically different.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-27 3:17 ` Alexei Starovoitov
@ 2025-02-27 4:08 ` Hou Tao
2025-03-06 10:22 ` Nick Zavaritsky
1 sibling, 0 replies; 17+ messages in thread
From: Hou Tao @ 2025-02-27 4:08 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Zvi Effron, Toke Høiland-Jørgensen, bpf, rcu, LKML,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao
Hi,
On 2/27/2025 11:17 AM, Alexei Starovoitov wrote:
> On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> lookup procedure A
>>>> A: find the old element (instead of the new old)
>>>>
>>>> update procedure B
>>>> B: delete the old element
>>>> update procedure C on the same CPU:
>>>> C: reuse the old element (overwrite its key and insert in
>>>> the same bucket)
>>>>
>>>> A: the key is mismatched and return -ENOENT.
>>> This is fine. It's just normal reuse.
>>> Orthogonal to 'update as insert+delete' issue.
>> OK. However, it will break the lookup procedure because it expects it
>> will return an valid result instead of -ENOENT.
> What do you mean 'breaks the lookup' ?
> lookup_elem_raw() matches hash, and then it memcmp(key),
> if the element is reused anything can happen.
> Either it succeeds in memcmp() and returns an elem,
> or miscompares in memcmp().
> Both are expected, because elems are reused in place.
>
> And this behavior is expected and not-broken,
> because bpf prog that does lookup on one cpu and deletes
> the same element on the other cpu is asking for trouble.
It seems I mislead you in the above example. It is also possible for
doing lookup in one CPU and updating the same element in other CPU. So
does such concurrence also look insane ?
lookup procedure A
A: find the old element (instead of the new old)
update procedure B
B: overwrite the same element and free the old element
update procedure C on the same CPU:
C: reuse the old element (overwrite its key and insert in
the same bucket)
A: the key is mismatched and return -ENOENT.
> bpf infra guarantees the safety of the kernel.
> It doesn't guarantee that bpf progs are sane.
>
>>> It's been a long time since I looked into rcu_nulls details.
>>> Pls help me understand that this new replace_rcu_nulls()
>>> is correct from nulls pov,
>>> If it is then this patch set may be the right answer to non-atomic update.
>> If I understand correctly, only the manipulations of ->first pointer and
>> ->next pointer need to take care of nulls pointer.
> hmm. I feel we're still talking past each other.
> See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
> It's there because of reuse. The lookup can start in one bucket
> and finish in another.
Yes. I noticed that. However, it happens when the deleted element is
reused and inserted to a different bucket, right ? For
replace_rcu_nulls(), the reuse of old element is possible only after
replace_rcu_nulls() completes, so I think it is irrelevant.
>If it is then this patch set may be the right answer to non-atomic update.
I also didn't follow that. Due to the immediate reuse, the lookup
procedure may still return -ENOENT when there is concurrent overwriting
of the same element as show above, so I think it only reduce the
possibility. I still prefer to implement a separated update procedure of
htab of maps to fix the atomic update problem completely for htab of
maps first.
>
>>> And for the future, please please focus on "why" part in
>>> the cover letter and commit logs instead of "what".
>>>
>>> Since the only thing I got from the log was:
>>> "Currently, the update is not atomic
>>> because the overwrite of existing element happens in a two-steps way,
>>> but the support of atomic update is feasible".
>>>
>>> "is feasible" doesn't explain "why".
>>>
>>> Link to xdp-newbie question is nice for additional context,
>>> but reviewers should not need to go and read some thread somewhere
>>> to understand "why" part.
>>> All of it should be in the commit log.
>> OK. My original thought is that is a reported problem, so an extra link
>> will be enough. Will try to add more context next time.
>>>> map may still be incorrect (as shown long time ago [1]), so I think
>>>> maybe for other types of map, the atomic update doesn't matter too much.
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>>> A thread from 3 years ago ?! Sorry, it's not helpful to ask
>>> people to page-in such an old context with lots of follow ups
>>> that may or may not be relevant today.
>> Let me reuse part of the diagram above to explain how does the lookup
>> procedure may return a incorrect value:
>>
>> lookup procedure A
>> A: find the old element (instead of the new element)
>>
>>
>> update procedure B
>> B: delete the old element
>> update procedure C on the same CPU:
>>
>>
>> A: the key is matched and return the value in the element
>>
>> C: reuse the old element (overwrite its key and value)
>>
>> A: read the value (it is incorrect, because it has been reused and
>> overwritten)
> ... and it's fine. It's by design. It's an element reuse behavior.
>
> Long ago hashmap had two modes: prealloc (default) and
> NO_PREALLOC (call_rcu + kfree)
>
> The call_rcu part was there to make things safe.
> The memory cannot be kfree-ed to the kernel until RCU GP.
> With bpf_mem_alloc hashmap elements are freed back to bpf_ma
> right away. Hashmap is doing bpf_mem_cache_free()
> (instead of bpf_mem_cache_free_rcu()) because users need speed.
> So since 2022 both prealloc and no_prealloc reuse elements.
> We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
> that will use _rcu() flavor of freeing into bpf_ma,
> but it has to have a strong reason.
> And soon as we add it the default with prealloc would need
> to use call_rcu() too, right?
> and that becomes nightmare, since bpf prog can easily DoS the system.
> Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
> Unlike new things like bpf_obj_new/obj_drop the hashmap
> is unpriv, so concerns are drastically different.
>
> .
I see. Thanks for the detailed explanation. And that is part of the
reason why I proposed adding a seq-counter in the htab-element and
checking the seq-counter during lookup, so at least the lookup will not
return -ENOENT for the concurrent overwriting procedure.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RESEND] [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically
2025-02-27 3:17 ` Alexei Starovoitov
2025-02-27 4:08 ` Hou Tao
@ 2025-03-06 10:22 ` Nick Zavaritsky
1 sibling, 0 replies; 17+ messages in thread
From: Nick Zavaritsky @ 2025-03-06 10:22 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Hou Tao, Zvi Effron, Toke Høiland-Jørgensen, bpf, rcu,
LKML, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
Paul E . McKenney, Cody Haas, Hou Tao, Charalampos Stylianopoulos
> On 27. Feb 2025, at 04:17, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Feb 26, 2025 at 6:43 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>>>>
>>>> lookup procedure A
>>>> A: find the old element (instead of the new old)
>>>>
>>>> update procedure B
>>>> B: delete the old element
>>>> update procedure C on the same CPU:
>>>> C: reuse the old element (overwrite its key and insert in
>>>> the same bucket)
>>>>
>>>> A: the key is mismatched and return -ENOENT.
>>> This is fine. It's just normal reuse.
>>> Orthogonal to 'update as insert+delete' issue.
>>
>> OK. However, it will break the lookup procedure because it expects it
>> will return an valid result instead of -ENOENT.
>
> What do you mean 'breaks the lookup' ?
> lookup_elem_raw() matches hash, and then it memcmp(key),
> if the element is reused anything can happen.
> Either it succeeds in memcmp() and returns an elem,
> or miscompares in memcmp().
> Both are expected, because elems are reused in place.
>
> And this behavior is expected and not-broken,
> because bpf prog that does lookup on one cpu and deletes
> the same element on the other cpu is asking for trouble.
> bpf infra guarantees the safety of the kernel.
> It doesn't guarantee that bpf progs are sane.
>
>>> It's been a long time since I looked into rcu_nulls details.
>>> Pls help me understand that this new replace_rcu_nulls()
>>> is correct from nulls pov,
>>> If it is then this patch set may be the right answer to non-atomic update.
>>
>> If I understand correctly, only the manipulations of ->first pointer and
>> ->next pointer need to take care of nulls pointer.
>
> hmm. I feel we're still talking past each other.
> See if (get_nulls_value() == ...) in lookup_nulls_elem_raw().
> It's there because of reuse. The lookup can start in one bucket
> and finish in another.
>
>>>
>>> And for the future, please please focus on "why" part in
>>> the cover letter and commit logs instead of "what".
>>>
>>> Since the only thing I got from the log was:
>>> "Currently, the update is not atomic
>>> because the overwrite of existing element happens in a two-steps way,
>>> but the support of atomic update is feasible".
>>>
>>> "is feasible" doesn't explain "why".
>>>
>>> Link to xdp-newbie question is nice for additional context,
>>> but reviewers should not need to go and read some thread somewhere
>>> to understand "why" part.
>>> All of it should be in the commit log.
>>
>> OK. My original thought is that is a reported problem, so an extra link
>> will be enough. Will try to add more context next time.
>>>
>>>> map may still be incorrect (as shown long time ago [1]), so I think
>>>> maybe for other types of map, the atomic update doesn't matter too much.
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/bpf/20221230041151.1231169-1-houtao@huaweicloud.com/
>>> A thread from 3 years ago ?! Sorry, it's not helpful to ask
>>> people to page-in such an old context with lots of follow ups
>>> that may or may not be relevant today.
>> Let me reuse part of the diagram above to explain how does the lookup
>> procedure may return a incorrect value:
>>
>> lookup procedure A
>> A: find the old element (instead of the new element)
>>
>>
>> update procedure B
>> B: delete the old element
>> update procedure C on the same CPU:
>>
>>
>> A: the key is matched and return the value in the element
>>
>> C: reuse the old element (overwrite its key and value)
>>
>> A: read the value (it is incorrect, because it has been reused and
>> overwritten)
>
> ... and it's fine. It's by design. It's an element reuse behavior.
>
> Long ago hashmap had two modes: prealloc (default) and
> NO_PREALLOC (call_rcu + kfree)
>
> The call_rcu part was there to make things safe.
> The memory cannot be kfree-ed to the kernel until RCU GP.
> With bpf_mem_alloc hashmap elements are freed back to bpf_ma
> right away. Hashmap is doing bpf_mem_cache_free()
We (emnify.com) missed this change and kept writing code with an
assumption that NO_PREALLOC implies rcu.
Is there something we can do as of today to reduce the likelihood of an
item getting reused immediately? We are concerned with lookups yielding
bogus results when racing with updates. Worse, a program could corrupt
an unrelated item when writing via a pointer obtained from lookup.
You wrote that "lookup on one cpu and deletes the same element on the
other cpu is asking for trouble.” It puzzles me since user space
updating a map while (say) TC program is consulting the map to make a
routing decision look like a supported and widespread use case.
For us, implications vary in severity, e.g.:
- 1-in-1e? packet mis-delivered (bogus lookup: LOW)
- a tenant getting billed for a packet of another tenant delivered via
satellite and costing USD 0.10 (writing into unrelated item: LOW)
- a network flow state corrupted (writing into unrelated item: MEDIUM)
We need to audit our code to ensure that e.g. a flow state getting
corrupted self-corrects and the damage doesn’t spread.
It would be nice if we as eBPF users could decide whether we wish to
live dangerously or prefer to trade speed for safety, on a case-by-case
basis.
> (instead of bpf_mem_cache_free_rcu()) because users need speed.
> So since 2022 both prealloc and no_prealloc reuse elements.
> We can consider a new flag for the hash map like F_REUSE_AFTER_RCU_GP
> that will use _rcu() flavor of freeing into bpf_ma,
> but it has to have a strong reason.
> And soon as we add it the default with prealloc would need
> to use call_rcu() too, right?
> and that becomes nightmare, since bpf prog can easily DoS the system.
> Even if we use bpf_mem_cache_free_rcu() only, the DoS is a concern.
> Unlike new things like bpf_obj_new/obj_drop the hashmap
> is unpriv, so concerns are drastically different.
Would it be an option to gate F_REUSE_AFTER_RCU under CAP_BPF?
It looks like sleepable programs would need to bpf_rcu_read_lock
explicitly to reap the benefits, but why not.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-03-06 10:23 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04 8:28 [PATCH bpf-next 0/3] bpf: Overwrite the htab element atomically Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 1/3] rculist: add hlist_nulls_replace_rcu() helper Hou Tao
2025-02-04 8:28 ` [PATCH bpf-next 2/3] bpf: Overwrite the element in hash map atomically Hou Tao
2025-02-05 1:38 ` [RESEND] " Hou Tao
2025-02-06 15:05 ` Toke Høiland-Jørgensen
2025-02-08 10:16 ` Hou Tao
2025-02-26 3:24 ` Alexei Starovoitov
2025-02-26 4:05 ` Hou Tao
2025-02-26 5:42 ` Alexei Starovoitov
2025-02-26 23:17 ` Zvi Effron
2025-02-27 1:48 ` Hou Tao
2025-02-27 1:59 ` Alexei Starovoitov
2025-02-27 2:43 ` Hou Tao
2025-02-27 3:17 ` Alexei Starovoitov
2025-02-27 4:08 ` Hou Tao
2025-03-06 10:22 ` Nick Zavaritsky
2025-02-04 8:28 ` [PATCH bpf-next 3/3] selftests/bpf: Add test case for atomic htab update Hou Tao
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).