* [PATCH 1/2] bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK
2026-04-01 13:50 [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK Mykyta Yatsenko
@ 2026-04-01 13:50 ` Mykyta Yatsenko
2026-04-01 13:50 ` [PATCH 2/2] selftests/bpf: Add torn write detection test for htab BPF_F_LOCK Mykyta Yatsenko
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Mykyta Yatsenko @ 2026-04-01 13:50 UTC (permalink / raw)
To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
Cc: Aaron Esau, Mykyta Yatsenko
From: Mykyta Yatsenko <yatsenko@meta.com>
When a BPF_F_LOCK update races with a concurrent delete, the freed
element can be immediately recycled by alloc_htab_elem(). The fast path
in htab_map_update_elem() performs a lockless lookup and then calls
copy_map_value_locked() under the element's spin_lock. If
alloc_htab_elem() recycles the same memory, it overwrites the value
with plain copy_map_value(), without taking the spin_lock, causing
torn writes.
Use copy_map_value_locked() when BPF_F_LOCK is set so the new element's
value is written under the embedded spin_lock, serializing against any
stale lock holders.
Fixes: 96049f3afd50 ("bpf: introduce BPF_F_LOCK flag")
Reported-by: Aaron Esau <aaron1esau@gmail.com>
Closes: https://lore.kernel.org/all/CADucPGRvSRpkneb94dPP08YkOHgNgBnskTK6myUag_Mkjimihg@mail.gmail.com/
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
kernel/bpf/hashtab.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index bc6bc8bb871d..f7ac1ec7be8b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1138,6 +1138,10 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
} else if (fd_htab_map_needs_adjust(htab)) {
size = round_up(size, 8);
memcpy(htab_elem_value(l_new, key_size), value, size);
+ } else if (map_flags & BPF_F_LOCK) {
+ copy_map_value_locked(&htab->map,
+ htab_elem_value(l_new, key_size),
+ value, false);
} else {
copy_map_value(&htab->map, htab_elem_value(l_new, key_size), value);
}
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [PATCH 2/2] selftests/bpf: Add torn write detection test for htab BPF_F_LOCK
2026-04-01 13:50 [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK Mykyta Yatsenko
2026-04-01 13:50 ` [PATCH 1/2] bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK Mykyta Yatsenko
@ 2026-04-01 13:50 ` Mykyta Yatsenko
2026-04-01 15:21 ` [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK Kumar Kartikeya Dwivedi
2026-04-06 1:50 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 7+ messages in thread
From: Mykyta Yatsenko @ 2026-04-01 13:50 UTC (permalink / raw)
To: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor
Cc: Aaron Esau, Mykyta Yatsenko
From: Mykyta Yatsenko <yatsenko@meta.com>
Add a consistency subtest to htab_reuse that detects torn writes
caused by the BPF_F_LOCK lockless update racing with element
reallocation in alloc_htab_elem().
The test uses three thread roles started simultaneously via a pipe:
- locked updaters: BPF_F_LOCK|BPF_EXIST in-place updates
- delete+update workers: delete then BPF_ANY|BPF_F_LOCK insert
- locked readers: BPF_F_LOCK lookup checking value consistency
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
---
.../testing/selftests/bpf/prog_tests/htab_reuse.c | 169 ++++++++++++++++++++-
tools/testing/selftests/bpf/progs/htab_reuse.c | 16 ++
2 files changed, 184 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/htab_reuse.c b/tools/testing/selftests/bpf/prog_tests/htab_reuse.c
index a742dd994d60..d7c3df165adc 100644
--- a/tools/testing/selftests/bpf/prog_tests/htab_reuse.c
+++ b/tools/testing/selftests/bpf/prog_tests/htab_reuse.c
@@ -59,7 +59,7 @@ static void *htab_update_fn(void *arg)
return NULL;
}
-void test_htab_reuse(void)
+static void test_htab_reuse_basic(void)
{
unsigned int i, wr_nr = 1, rd_nr = 4;
pthread_t tids[wr_nr + rd_nr];
@@ -99,3 +99,170 @@ void test_htab_reuse(void)
}
htab_reuse__destroy(skel);
}
+
+/*
+ * Writes consistency test for BPF_F_LOCK update
+ *
+ * The race:
+ * 1. Thread A: BPF_F_LOCK|BPF_EXIST update
+ * 2. Thread B: delete element then update it with BPF_ANY
+ */
+
+struct htab_val_large {
+ struct bpf_spin_lock lock;
+ __u32 seq;
+ __u64 data[256];
+};
+
+struct consistency_ctx {
+ int fd;
+ int start_fd;
+ int loop;
+ volatile bool torn_write;
+};
+
+static void wait_for_start(int fd)
+{
+ char buf;
+
+ read(fd, &buf, 1);
+}
+
+static void *locked_update_fn(void *arg)
+{
+ struct consistency_ctx *ctx = arg;
+ struct htab_val_large value;
+ unsigned int key = 1;
+ int i;
+
+ memset(&value, 0xAA, sizeof(value));
+ wait_for_start(ctx->start_fd);
+
+ for (i = 0; i < ctx->loop; i++) {
+ value.seq = i;
+ bpf_map_update_elem(ctx->fd, &key, &value,
+ BPF_F_LOCK | BPF_EXIST);
+ }
+
+ return NULL;
+}
+
+/* Delete + update: removes the element then re-creates it with BPF_ANY. */
+static void *delete_update_fn(void *arg)
+{
+ struct consistency_ctx *ctx = arg;
+ struct htab_val_large value;
+ unsigned int key = 1;
+ int i;
+
+ memset(&value, 0xBB, sizeof(value));
+
+ wait_for_start(ctx->start_fd);
+
+ for (i = 0; i < ctx->loop; i++) {
+ value.seq = i;
+ bpf_map_delete_elem(ctx->fd, &key);
+ bpf_map_update_elem(ctx->fd, &key, &value, BPF_ANY | BPF_F_LOCK);
+ }
+
+ return NULL;
+}
+
+static void *locked_lookup_fn(void *arg)
+{
+ struct consistency_ctx *ctx = arg;
+ struct htab_val_large value;
+ unsigned int key = 1;
+ int i, j;
+
+ wait_for_start(ctx->start_fd);
+
+ for (i = 0; i < ctx->loop && !ctx->torn_write; i++) {
+ if (bpf_map_lookup_elem_flags(ctx->fd, &key, &value, BPF_F_LOCK))
+ continue;
+
+ for (j = 0; j < 256; j++) {
+ if (value.data[j] != value.data[0]) {
+ ctx->torn_write = true;
+ return NULL;
+ }
+ }
+ }
+
+ return NULL;
+}
+
+static void test_htab_reuse_consistency(void)
+{
+ int threads_total = 6, threads = 2;
+ pthread_t tids[threads_total];
+ struct consistency_ctx ctx;
+ struct htab_val_large seed;
+ struct htab_reuse *skel;
+ unsigned int key = 1, i;
+ int pipefd[2];
+ int err;
+
+ skel = htab_reuse__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "htab_reuse__open_and_load"))
+ return;
+
+ if (!ASSERT_OK(pipe(pipefd), "pipe"))
+ goto out;
+
+ ctx.fd = bpf_map__fd(skel->maps.htab_lock_consistency);
+ ctx.start_fd = pipefd[0];
+ ctx.loop = 100000;
+ ctx.torn_write = false;
+
+ /* Seed the element so locked updaters have something to find */
+ memset(&seed, 0xBB, sizeof(seed));
+ err = bpf_map_update_elem(ctx.fd, &key, &seed, BPF_ANY);
+ if (!ASSERT_OK(err, "seed_element"))
+ goto close_pipe;
+
+ memset(tids, 0, sizeof(tids));
+ for (i = 0; i < threads; i++) {
+ err = pthread_create(&tids[i], NULL, locked_update_fn, &ctx);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto stop;
+ }
+ for (i = 0; i < threads; i++) {
+ err = pthread_create(&tids[threads + i], NULL, delete_update_fn, &ctx);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto stop;
+ }
+ for (i = 0; i < threads; i++) {
+ err = pthread_create(&tids[threads * 2 + i], NULL, locked_lookup_fn, &ctx);
+ if (!ASSERT_OK(err, "pthread_create"))
+ goto stop;
+ }
+
+ /* Release all threads simultaneously */
+ close(pipefd[1]);
+ pipefd[1] = -1;
+
+stop:
+ for (i = 0; i < threads_total; i++) {
+ if (!tids[i])
+ continue;
+ pthread_join(tids[i], NULL);
+ }
+
+ ASSERT_FALSE(ctx.torn_write, "no torn writes detected");
+
+close_pipe:
+ if (pipefd[1] >= 0)
+ close(pipefd[1]);
+ close(pipefd[0]);
+out:
+ htab_reuse__destroy(skel);
+}
+
+void test_htab_reuse(void)
+{
+ if (test__start_subtest("basic"))
+ test_htab_reuse_basic();
+ if (test__start_subtest("consistency"))
+ test_htab_reuse_consistency();
+}
diff --git a/tools/testing/selftests/bpf/progs/htab_reuse.c b/tools/testing/selftests/bpf/progs/htab_reuse.c
index 7f7368cb3095..1c7fa7ee45ee 100644
--- a/tools/testing/selftests/bpf/progs/htab_reuse.c
+++ b/tools/testing/selftests/bpf/progs/htab_reuse.c
@@ -17,3 +17,19 @@ struct {
__type(value, struct htab_val);
__uint(map_flags, BPF_F_NO_PREALLOC);
} htab SEC(".maps");
+
+#define HTAB_NDATA 256
+
+struct htab_val_large {
+ struct bpf_spin_lock lock;
+ __u32 seq;
+ __u64 data[HTAB_NDATA];
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __uint(max_entries, 8);
+ __type(key, unsigned int);
+ __type(value, struct htab_val_large);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+} htab_lock_consistency SEC(".maps");
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK
2026-04-01 13:50 [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK Mykyta Yatsenko
2026-04-01 13:50 ` [PATCH 1/2] bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK Mykyta Yatsenko
2026-04-01 13:50 ` [PATCH 2/2] selftests/bpf: Add torn write detection test for htab BPF_F_LOCK Mykyta Yatsenko
@ 2026-04-01 15:21 ` Kumar Kartikeya Dwivedi
2026-04-01 15:33 ` Mykyta Yatsenko
2026-04-06 1:50 ` patchwork-bot+netdevbpf
3 siblings, 1 reply; 7+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-04-01 15:21 UTC (permalink / raw)
To: Mykyta Yatsenko
Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Aaron Esau,
Mykyta Yatsenko
On Wed, 1 Apr 2026 at 15:51, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> A torn write issue was reported in htab_map_update_elem() with
> BPF_F_LOCK on hash maps. The BPF_F_LOCK fast path performs
> a lockless lookup and copies the value under the element's embedded
> spin_lock. A concurrent delete can free the element via
> bpf_mem_cache_free(), which allows immediate reuse. When
> alloc_htab_elem() recycles the same memory, it writes the value with
> plain copy_map_value() without taking the spin_lock, racing with the
> stale lock holder and producing torn writes.
>
> Patch 1 fixes alloc_htab_elem() to use copy_map_value_locked() when
> BPF_F_LOCK is set.
>
> Patch 2 adds a selftest that reliably detects the torn writes on an
> unpatched kernel.
>
> Reported-by: Aaron Esau <aaron1esau@gmail.com>
> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> ---
The fix looks correct, but I think the test needs more work.
After reverting the test, it continues to pass for me the majority of
the times, so it's not deterministic. If a fix is made in the kernel,
it is essentially best-effort, so we cannot be confident whether
trials that didn't produce a fault were due to our inability to
reproduce the race or because the fix was correct.
To be fair, I am not sure we can construct a deterministic test
without controlling thread schedules, so it might not be worth it and
we can continue with the best-effort test you have. It definitely has
some value, since I don't observe any failures on repeated runs but
can observe some without the fix.
One way I considered strengthening the current test is to pin threads
to two CPUs. One thread will constantly lookup and verify that all
bytes are the same. This will use BPF_F_LOCK, ensuring serialization.
On the other CPU we constantly do delete + update with BPF_F_LOCK, and
keep changing the byte sequence we update. So aaaa -> bbbb -> cccc,
etc. Set max_entries = 1 so that we ensure the same element keeps
getting recycled constantly. I didn't try this but it might improve
the chances of success, esp. if the map value is big, such that the
window of reading and writing on either side can be long. We might
also want to do repeat = 1000 or more and then handle this sequence
copy in a tight loop inside the kernel in the program itself. This
ensures that precious time on either side is not lost bouncing back
and forth from user space to the kernel due to bpf_prog_test_run().
Thoughts?
> [...]
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK
2026-04-01 15:21 ` [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK Kumar Kartikeya Dwivedi
@ 2026-04-01 15:33 ` Mykyta Yatsenko
2026-04-01 15:42 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 7+ messages in thread
From: Mykyta Yatsenko @ 2026-04-01 15:33 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Aaron Esau,
Mykyta Yatsenko
On 4/1/26 4:21 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, 1 Apr 2026 at 15:51, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>>
>> A torn write issue was reported in htab_map_update_elem() with
>> BPF_F_LOCK on hash maps. The BPF_F_LOCK fast path performs
>> a lockless lookup and copies the value under the element's embedded
>> spin_lock. A concurrent delete can free the element via
>> bpf_mem_cache_free(), which allows immediate reuse. When
>> alloc_htab_elem() recycles the same memory, it writes the value with
>> plain copy_map_value() without taking the spin_lock, racing with the
>> stale lock holder and producing torn writes.
>>
>> Patch 1 fixes alloc_htab_elem() to use copy_map_value_locked() when
>> BPF_F_LOCK is set.
>>
>> Patch 2 adds a selftest that reliably detects the torn writes on an
>> unpatched kernel.
>>
>> Reported-by: Aaron Esau <aaron1esau@gmail.com>
>> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
>> ---
>
> The fix looks correct, but I think the test needs more work.
>
> After reverting the test, it continues to pass for me the majority of
> the times, so it's not deterministic. If a fix is made in the kernel,
> it is essentially best-effort, so we cannot be confident whether
> trials that didn't produce a fault were due to our inability to
> reproduce the race or because the fix was correct.
>
> To be fair, I am not sure we can construct a deterministic test
> without controlling thread schedules, so it might not be worth it and
> we can continue with the best-effort test you have. It definitely has
> some value, since I don't observe any failures on repeated runs but
> can observe some without the fix.
>
> One way I considered strengthening the current test is to pin threads
> to two CPUs. One thread will constantly lookup and verify that all
> bytes are the same. This will use BPF_F_LOCK, ensuring serialization.
> On the other CPU we constantly do delete + update with BPF_F_LOCK, and
> keep changing the byte sequence we update. So aaaa -> bbbb -> cccc,
> etc. Set max_entries = 1 so that we ensure the same element keeps
> getting recycled constantly. I didn't try this but it might improve
> the chances of success, esp. if the map value is big, such that the
> window of reading and writing on either side can be long. We might
> also want to do repeat = 1000 or more and then handle this sequence
> copy in a tight loop inside the kernel in the program itself. This
> ensures that precious time on either side is not lost bouncing back
> and forth from user space to the kernel due to bpf_prog_test_run().
>
> Thoughts?
>
Yes, the test does not fail 100% of time on the baseline, but this
particular implementation fails most of the time for me (like 9-8 out of
10). It passes consistently with the fix.
Yes, probably pinning threads and making map value bigger may increase
failure rate on the baseline. Thanks for checking the test on your setup
and suggestions, I'm happy to implement them for the next version.
>> [...]
>>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK
2026-04-01 15:33 ` Mykyta Yatsenko
@ 2026-04-01 15:42 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 7+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-04-01 15:42 UTC (permalink / raw)
To: Mykyta Yatsenko
Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, Aaron Esau,
Mykyta Yatsenko
On Wed, 1 Apr 2026 at 17:33, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
>
> On 4/1/26 4:21 PM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, 1 Apr 2026 at 15:51, Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> wrote:
> >>
> >> A torn write issue was reported in htab_map_update_elem() with
> >> BPF_F_LOCK on hash maps. The BPF_F_LOCK fast path performs
> >> a lockless lookup and copies the value under the element's embedded
> >> spin_lock. A concurrent delete can free the element via
> >> bpf_mem_cache_free(), which allows immediate reuse. When
> >> alloc_htab_elem() recycles the same memory, it writes the value with
> >> plain copy_map_value() without taking the spin_lock, racing with the
> >> stale lock holder and producing torn writes.
> >>
> >> Patch 1 fixes alloc_htab_elem() to use copy_map_value_locked() when
> >> BPF_F_LOCK is set.
> >>
> >> Patch 2 adds a selftest that reliably detects the torn writes on an
> >> unpatched kernel.
> >>
> >> Reported-by: Aaron Esau <aaron1esau@gmail.com>
> >> Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
> >> ---
> >
> > The fix looks correct, but I think the test needs more work.
> >
> > After reverting the test, it continues to pass for me the majority of
> > the times, so it's not deterministic. If a fix is made in the kernel,
> > it is essentially best-effort, so we cannot be confident whether
> > trials that didn't produce a fault were due to our inability to
> > reproduce the race or because the fix was correct.
> >
> > To be fair, I am not sure we can construct a deterministic test
> > without controlling thread schedules, so it might not be worth it and
> > we can continue with the best-effort test you have. It definitely has
> > some value, since I don't observe any failures on repeated runs but
> > can observe some without the fix.
> >
> > One way I considered strengthening the current test is to pin threads
> > to two CPUs. One thread will constantly lookup and verify that all
> > bytes are the same. This will use BPF_F_LOCK, ensuring serialization.
> > On the other CPU we constantly do delete + update with BPF_F_LOCK, and
> > keep changing the byte sequence we update. So aaaa -> bbbb -> cccc,
> > etc. Set max_entries = 1 so that we ensure the same element keeps
> > getting recycled constantly. I didn't try this but it might improve
> > the chances of success, esp. if the map value is big, such that the
> > window of reading and writing on either side can be long. We might
> > also want to do repeat = 1000 or more and then handle this sequence
> > copy in a tight loop inside the kernel in the program itself. This
> > ensures that precious time on either side is not lost bouncing back
> > and forth from user space to the kernel due to bpf_prog_test_run().
> >
> > Thoughts?
> >
>
> Yes, the test does not fail 100% of time on the baseline, but this
> particular implementation fails most of the time for me (like 9-8 out of
> 10). It passes consistently with the fix.
>
> Yes, probably pinning threads and making map value bigger may increase
> failure rate on the baseline. Thanks for checking the test on your setup
> and suggestions, I'm happy to implement them for the next version.
>
Maybe try pinning and a bigger map value, but if that doesn't help
much, I think the current thing is fine to go ahead with. The fix
looks correct and it wouldn't make sense to waste more time on it.
> >> [...]
> >>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK
2026-04-01 13:50 [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK Mykyta Yatsenko
` (2 preceding siblings ...)
2026-04-01 15:21 ` [PATCH 0/2] bpf: Fix torn writes in non-prealloc htab with BPF_F_LOCK Kumar Kartikeya Dwivedi
@ 2026-04-06 1:50 ` patchwork-bot+netdevbpf
3 siblings, 0 replies; 7+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-04-06 1:50 UTC (permalink / raw)
To: Mykyta Yatsenko
Cc: bpf, ast, andrii, daniel, kafai, kernel-team, eddyz87, memxor,
aaron1esau, yatsenko
Hello:
This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:
On Wed, 01 Apr 2026 06:50:35 -0700 you wrote:
> A torn write issue was reported in htab_map_update_elem() with
> BPF_F_LOCK on hash maps. The BPF_F_LOCK fast path performs
> a lockless lookup and copies the value under the element's embedded
> spin_lock. A concurrent delete can free the element via
> bpf_mem_cache_free(), which allows immediate reuse. When
> alloc_htab_elem() recycles the same memory, it writes the value with
> plain copy_map_value() without taking the spin_lock, racing with the
> stale lock holder and producing torn writes.
>
> [...]
Here is the summary with links:
- [1/2] bpf: Use copy_map_value_locked() in alloc_htab_elem() for BPF_F_LOCK
https://git.kernel.org/bpf/bpf-next/c/07738bc566c3
- [2/2] selftests/bpf: Add torn write detection test for htab BPF_F_LOCK
https://git.kernel.org/bpf/bpf-next/c/f64eb44ce906
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 7+ messages in thread