[PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps

bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps
@ 2025-10-26 15:39 Leon Hwang
  2025-10-26 15:39 ` [PATCH bpf v3 1/4] bpf: Free special fields when update [lru_,]percpu_hash maps Leon Hwang
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Leon Hwang @ 2025-10-26 15:39 UTC (permalink / raw)
  To: bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, memxor, linux-kernel,
	kernel-patches-bot, Leon Hwang

In the discussion thread
"[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1],
it was pointed out that missing calls to bpf_obj_free_fields() could
lead to memory leaks.

A selftest was added to confirm that this is indeed a real issue - the
refcount of BPF_KPTR_REF field is not decremented when
bpf_obj_free_fields() is missing after copy_map_value[,_long]().

Further inspection of copy_map_value[,_long]() call sites revealed two
locations affected by this issue:

1. pcpu_copy_value()
2. htab_map_update_elem() when used with BPF_F_LOCK

Similar cases happen when update local storage maps.

This series fixes the issues by properly calling bpf_obj_free_fields()
(or check_and_free_fields()) after copy_map_value[,_long]() and adds
selftests to verify the fix.

Changes:
v2 -> v3:
* Free special fields when update local storage maps without BPF_F_LOCK.
* Add test to verify decrementing refcount when update cgroup local
  storage maps without BPF_F_LOCK.
* Address review from AI bot:
  * Slow path with BPF_F_LOCK (around line 642-646) in
    'bpf_local_storage.c'.
* https://lore.kernel.org/bpf/20251020164608.20536-1-leon.hwang@linux.dev/

v1 -> v2:
* Add test to verify decrementing refcount when update cgroup local
  storage maps with BPF_F_LOCK.
* Address review from AI bot:
  * Fast path without bucket lock (around line 610) in
    'bpf_local_storage.c'.
* https://lore.kernel.org/bpf/20251016145801.47552-1-leon.hwang@linux.dev/

Leon Hwang (4):
  bpf: Free special fields when update [lru_,]percpu_hash maps
  bpf: Free special fields when update hash maps with BPF_F_LOCK
  bpf: Free special fields when update local storage maps
  selftests/bpf: Add tests to verify freeing the special fields when
    update hash and local storage maps

 kernel/bpf/bpf_local_storage.c                |   3 +
 kernel/bpf/hashtab.c                          |   4 +
 .../bpf/prog_tests/refcounted_kptr.c          | 178 +++++++++++++++++-
 .../selftests/bpf/progs/refcounted_kptr.c     | 160 ++++++++++++++++
 4 files changed, 344 insertions(+), 1 deletion(-)

--
2.51.0


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH bpf v3 1/4] bpf: Free special fields when update [lru_,]percpu_hash maps
  2025-10-26 15:39 [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps Leon Hwang
@ 2025-10-26 15:39 ` Leon Hwang
  2025-10-28 18:03   ` Andrii Nakryiko
  2025-10-26 15:39 ` [PATCH bpf v3 2/4] bpf: Free special fields when update hash maps with BPF_F_LOCK Leon Hwang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Leon Hwang @ 2025-10-26 15:39 UTC (permalink / raw)
  To: bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, memxor, linux-kernel,
	kernel-patches-bot, Leon Hwang

As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing
calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the
memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
map gets freed.

Fix this by calling 'bpf_obj_free_fields()' after
'copy_map_value[,_long]()' in 'pcpu_copy_value()'.

Fixes: 65334e64a493 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 kernel/bpf/hashtab.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index c2fcd0cd51e51..26308adc9ccb3 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -950,12 +950,14 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 	if (!onallcpus) {
 		/* copy true value_size bytes */
 		copy_map_value(&htab->map, this_cpu_ptr(pptr), value);
+		bpf_obj_free_fields(htab->map.record, this_cpu_ptr(pptr));
 	} else {
 		u32 size = round_up(htab->map.value_size, 8);
 		int off = 0, cpu;
 
 		for_each_possible_cpu(cpu) {
 			copy_map_value_long(&htab->map, per_cpu_ptr(pptr, cpu), value + off);
+			bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
 			off += size;
 		}
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH bpf v3 2/4] bpf: Free special fields when update hash maps with BPF_F_LOCK
  2025-10-26 15:39 [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps Leon Hwang
  2025-10-26 15:39 ` [PATCH bpf v3 1/4] bpf: Free special fields when update [lru_,]percpu_hash maps Leon Hwang
@ 2025-10-26 15:39 ` Leon Hwang
  2025-10-26 15:39 ` [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps Leon Hwang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: Leon Hwang @ 2025-10-26 15:39 UTC (permalink / raw)
  To: bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, memxor, linux-kernel,
	kernel-patches-bot, Leon Hwang

When updating hash maps with BPF_F_LOCK, the special fields were not
freed after being replaced. This could cause memory referenced by
BPF_KPTR_{REF,PERCPU} fields to be held until the map gets freed.

Fix this by calling 'check_and_free_fields()' after
'copy_map_value_locked()' to properly release the old fields.

Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 kernel/bpf/hashtab.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 26308adc9ccb3..65009ea3e9379 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1124,6 +1124,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 			copy_map_value_locked(map,
 					      htab_elem_value(l_old, key_size),
 					      value, false);
+			check_and_free_fields(htab, l_old);
 			return 0;
 		}
 		/* fall through, grab the bucket lock and lookup again.
@@ -1152,6 +1153,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 		copy_map_value_locked(map,
 				      htab_elem_value(l_old, key_size),
 				      value, false);
+		check_and_free_fields(htab, l_old);
 		ret = 0;
 		goto err;
 	}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps
  2025-10-26 15:39 [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps Leon Hwang
  2025-10-26 15:39 ` [PATCH bpf v3 1/4] bpf: Free special fields when update [lru_,]percpu_hash maps Leon Hwang
  2025-10-26 15:39 ` [PATCH bpf v3 2/4] bpf: Free special fields when update hash maps with BPF_F_LOCK Leon Hwang
@ 2025-10-26 15:39 ` Leon Hwang
  2025-10-27 15:44   ` Amery Hung
  2025-10-26 15:40 ` [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and " Leon Hwang
  2025-10-28 18:10 ` [PATCH bpf v3 0/4] bpf: Free " patchwork-bot+netdevbpf
  4 siblings, 1 reply; 20+ messages in thread
From: Leon Hwang @ 2025-10-26 15:39 UTC (permalink / raw)
  To: bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, memxor, linux-kernel,
	kernel-patches-bot, Leon Hwang

When updating local storage maps with BPF_F_LOCK on the fast path, the
special fields were not freed after being replaced. This could cause
memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
map gets freed.

Similarly, on the other path, the old sdata's special fields were never
freed regardless of whether BPF_F_LOCK was used, causing the same issue.

Fix this by calling 'bpf_obj_free_fields()' after
'copy_map_value_locked()' to properly release the old fields.

Fixes: 9db44fdd8105 ("bpf: Support kptrs in local storage maps")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 kernel/bpf/bpf_local_storage.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index b931fbceb54da..8e3aea4e07c50 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -609,6 +609,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
 			copy_map_value_locked(&smap->map, old_sdata->data,
 					      value, false);
+			bpf_obj_free_fields(smap->map.record, old_sdata->data);
 			return old_sdata;
 		}
 	}
@@ -641,6 +642,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 	if (old_sdata && (map_flags & BPF_F_LOCK)) {
 		copy_map_value_locked(&smap->map, old_sdata->data, value,
 				      false);
+		bpf_obj_free_fields(smap->map.record, old_sdata->data);
 		selem = SELEM(old_sdata);
 		goto unlock;
 	}
@@ -654,6 +656,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 
 	/* Third, remove old selem, SELEM(old_sdata) */
 	if (old_sdata) {
+		bpf_obj_free_fields(smap->map.record, old_sdata->data);
 		bpf_selem_unlink_map(SELEM(old_sdata));
 		bpf_selem_unlink_storage_nolock(local_storage, SELEM(old_sdata),
 						true, &old_selem_free_list);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and local storage maps
  2025-10-26 15:39 [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps Leon Hwang
                   ` (2 preceding siblings ...)
  2025-10-26 15:39 ` [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps Leon Hwang
@ 2025-10-26 15:40 ` Leon Hwang
  2025-10-27 16:34   ` Amery Hung
  2025-10-28 18:10 ` [PATCH bpf v3 0/4] bpf: Free " patchwork-bot+netdevbpf
  4 siblings, 1 reply; 20+ messages in thread
From: Leon Hwang @ 2025-10-26 15:40 UTC (permalink / raw)
  To: bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, memxor, linux-kernel,
	kernel-patches-bot, Leon Hwang

Add tests to verify that updating hash and local storage maps decrements
refcount when BPF_KPTR_REF objects are involved.

The tests perform the following steps:

1. Call update_elem() to insert an initial value.
2. Use bpf_refcount_acquire() to increment the refcount.
3. Store the node pointer in the map value.
4. Add the node to a linked list.
5. Probe-read the refcount and verify it is *2*.
6. Call update_elem() again to trigger refcount decrement.
7. Probe-read the refcount and verify it is *1*.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 .../bpf/prog_tests/refcounted_kptr.c          | 178 +++++++++++++++++-
 .../selftests/bpf/progs/refcounted_kptr.c     | 160 ++++++++++++++++
 2 files changed, 337 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c b/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
index d6bd5e16e6372..0a60330a1f4b3 100644
--- a/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
@@ -3,7 +3,7 @@
 
 #include <test_progs.h>
 #include <network_helpers.h>
-
+#include "cgroup_helpers.h"
 #include "refcounted_kptr.skel.h"
 #include "refcounted_kptr_fail.skel.h"
 
@@ -44,3 +44,179 @@ void test_refcounted_kptr_wrong_owner(void)
 	ASSERT_OK(opts.retval, "rbtree_wrong_owner_remove_fail_a2 retval");
 	refcounted_kptr__destroy(skel);
 }
+
+static void test_refcnt_leak(void *values, size_t values_sz, u64 flags, struct bpf_map *map,
+			     struct bpf_program *prog_leak, struct bpf_program *prog_check)
+{
+	int ret, fd, key = 0;
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		    .data_in = &pkt_v4,
+		    .data_size_in = sizeof(pkt_v4),
+		    .repeat = 1,
+	);
+
+	ret = bpf_map__update_elem(map, &key, sizeof(key), values, values_sz, flags);
+	if (!ASSERT_OK(ret, "bpf_map__update_elem init"))
+		return;
+
+	fd = bpf_program__fd(prog_leak);
+	ret = bpf_prog_test_run_opts(fd, &opts);
+	if (!ASSERT_OK(ret, "test_run_opts"))
+		return;
+	if (!ASSERT_EQ(opts.retval, 2, "retval refcount"))
+		return;
+
+	ret = bpf_map__update_elem(map, &key, sizeof(key), values, values_sz, flags);
+	if (!ASSERT_OK(ret, "bpf_map__update_elem dec refcount"))
+		return;
+
+	fd = bpf_program__fd(prog_check);
+	ret = bpf_prog_test_run_opts(fd, &opts);
+	ASSERT_OK(ret, "test_run_opts");
+	ASSERT_EQ(opts.retval, 1, "retval");
+}
+
+static void test_percpu_hash_refcount_leak(void)
+{
+	struct refcounted_kptr *skel;
+	size_t values_sz;
+	u64 *values;
+	int cpu_nr;
+
+	cpu_nr = libbpf_num_possible_cpus();
+	if (!ASSERT_GT(cpu_nr, 0, "libbpf_num_possible_cpus"))
+		return;
+
+	values = calloc(cpu_nr, sizeof(u64));
+	if (!ASSERT_OK_PTR(values, "calloc values"))
+		return;
+
+	skel = refcounted_kptr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "refcounted_kptr__open_and_load")) {
+		free(values);
+		return;
+	}
+
+	values_sz = cpu_nr * sizeof(u64);
+	memset(values, 0, values_sz);
+
+	test_refcnt_leak(values, values_sz, 0, skel->maps.pcpu_hash,
+			 skel->progs.pcpu_hash_refcount_leak,
+			 skel->progs.check_pcpu_hash_refcount);
+
+	refcounted_kptr__destroy(skel);
+	free(values);
+}
+
+struct lock_map_value {
+	u64 kptr;
+	struct bpf_spin_lock lock;
+	int value;
+};
+
+static void test_hash_lock_refcount_leak(void)
+{
+	struct lock_map_value value = {};
+	struct refcounted_kptr *skel;
+
+	skel = refcounted_kptr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "refcounted_kptr__open_and_load"))
+		return;
+
+	test_refcnt_leak(&value, sizeof(value), BPF_F_LOCK, skel->maps.lock_hash,
+			 skel->progs.hash_lock_refcount_leak,
+			 skel->progs.check_hash_lock_refcount);
+
+	refcounted_kptr__destroy(skel);
+}
+
+static void test_cgrp_storage_refcount_leak(u64 flags)
+{
+	int server_fd = -1, client_fd = -1;
+	struct lock_map_value value = {};
+	struct refcounted_kptr *skel;
+	struct bpf_link *link;
+	struct bpf_map *map;
+	int cgroup, err;
+
+	cgroup = test__join_cgroup("/cg_refcount_leak");
+	if (!ASSERT_GE(cgroup, 0, "test__join_cgroup"))
+		return;
+
+	skel = refcounted_kptr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "refcounted_kptr__open_and_load"))
+		goto out;
+
+	link = bpf_program__attach_cgroup(skel->progs.cgroup_storage_refcount_leak, cgroup);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_cgroup"))
+		goto out;
+	skel->links.cgroup_storage_refcount_leak = link;
+
+	server_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0);
+	if (!ASSERT_GE(server_fd, 0, "start_server"))
+		goto out;
+
+	client_fd = connect_to_fd(server_fd, 0);
+	if (!ASSERT_GE(client_fd, 0, "connect_to_fd"))
+		goto out;
+
+	map = skel->maps.cgrp_strg;
+	err = bpf_map__lookup_elem(map, &cgroup, sizeof(cgroup), &value, sizeof(value), flags);
+	if (!ASSERT_OK(err, "bpf_map__lookup_elem"))
+		goto out;
+
+	ASSERT_EQ(value.value, 2, "refcount");
+
+	err = bpf_map__update_elem(map, &cgroup, sizeof(cgroup), &value, sizeof(value), flags);
+	if (!ASSERT_OK(err, "bpf_map__update_elem"))
+		goto out;
+
+	err = bpf_link__detach(skel->links.cgroup_storage_refcount_leak);
+	if (!ASSERT_OK(err, "bpf_link__detach"))
+		goto out;
+
+	link = bpf_program__attach(skel->progs.check_cgroup_storage_refcount);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+		goto out;
+	skel->links.check_cgroup_storage_refcount = link;
+
+	close(client_fd);
+	client_fd = connect_to_fd(server_fd, 0);
+	if (!ASSERT_GE(client_fd, 0, "connect_to_fd"))
+		goto out;
+
+	err = bpf_map__lookup_elem(map, &cgroup, sizeof(cgroup), &value, sizeof(value), flags);
+	if (!ASSERT_OK(err, "bpf_map__lookup_elem"))
+		goto out;
+
+	ASSERT_EQ(value.value, 1, "refcount");
+out:
+	close(cgroup);
+	refcounted_kptr__destroy(skel);
+	if (client_fd >= 0)
+		close(client_fd);
+	if (server_fd >= 0)
+		close(server_fd);
+}
+
+static void test_cgroup_storage_refcount_leak(void)
+{
+	test_cgrp_storage_refcount_leak(0);
+}
+
+static void test_cgroup_storage_lock_refcount_leak(void)
+{
+	test_cgrp_storage_refcount_leak(BPF_F_LOCK);
+}
+
+void test_kptr_refcount_leak(void)
+{
+	if (test__start_subtest("percpu_hash_refcount_leak"))
+		test_percpu_hash_refcount_leak();
+	if (test__start_subtest("hash_lock_refcount_leak"))
+		test_hash_lock_refcount_leak();
+	if (test__start_subtest("cgroup_storage_refcount_leak"))
+		test_cgroup_storage_refcount_leak();
+	if (test__start_subtest("cgroup_storage_lock_refcount_leak"))
+		test_cgroup_storage_lock_refcount_leak();
+}
diff --git a/tools/testing/selftests/bpf/progs/refcounted_kptr.c b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
index 893a4fdb4b6e9..09efae9537c9b 100644
--- a/tools/testing/selftests/bpf/progs/refcounted_kptr.c
+++ b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
@@ -7,6 +7,7 @@
 #include <bpf/bpf_core_read.h>
 #include "bpf_misc.h"
 #include "bpf_experimental.h"
+#include "bpf_tracing_net.h"
 
 extern void bpf_rcu_read_lock(void) __ksym;
 extern void bpf_rcu_read_unlock(void) __ksym;
@@ -568,4 +569,163 @@ int BPF_PROG(rbtree_sleepable_rcu_no_explicit_rcu_lock,
 	return 0;
 }
 
+private(leak) u64 ref;
+
+static u32 probe_read_refcount(void)
+{
+	u32 refcnt;
+
+	bpf_probe_read_kernel(&refcnt, sizeof(refcnt), (void *) ref);
+	return refcnt;
+}
+
+static int __insert_in_list(struct bpf_list_head *head, struct bpf_spin_lock *lock,
+			    struct node_data __kptr **node)
+{
+	struct node_data *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return -1;
+
+	m = bpf_refcount_acquire(n);
+	n = bpf_kptr_xchg(node, n);
+	if (n) {
+		bpf_obj_drop(n);
+		bpf_obj_drop(m);
+		return -2;
+	}
+
+	bpf_spin_lock(lock);
+	bpf_list_push_front(head, &m->l);
+	ref = (u64)(void *) &m->ref;
+	bpf_spin_unlock(lock);
+	return probe_read_refcount();
+}
+
+static void *__lookup_map(void *map)
+{
+	int key = 0;
+
+	return bpf_map_lookup_elem(map, &key);
+}
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} pcpu_hash SEC(".maps");
+
+SEC("tc")
+int pcpu_hash_refcount_leak(void *ctx)
+{
+	struct map_value *v;
+
+	v = __lookup_map(&pcpu_hash);
+	if (!v)
+		return 0;
+
+	return __insert_in_list(&head, &lock, &v->node);
+}
+
+SEC("tc")
+int check_pcpu_hash_refcount(void *ctx)
+{
+	return probe_read_refcount();
+}
+
+struct lock_map_value {
+	struct node_data __kptr *node;
+	struct bpf_spin_lock lock;
+	int value;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, int);
+	__type(value, struct lock_map_value);
+	__uint(max_entries, 1);
+} lock_hash SEC(".maps");
+
+SEC("tc")
+int hash_lock_refcount_leak(void *ctx)
+{
+	struct lock_map_value *v;
+
+	v = __lookup_map(&lock_hash);
+	if (!v)
+		return 0;
+
+	bpf_spin_lock(&v->lock);
+	v->value = 42;
+	bpf_spin_unlock(&v->lock);
+	return __insert_in_list(&head, &lock, &v->node);
+}
+
+SEC("tc")
+int check_hash_lock_refcount(void *ctx)
+{
+	return probe_read_refcount();
+}
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct lock_map_value);
+} cgrp_strg SEC(".maps");
+
+SEC("cgroup/connect6")
+int cgroup_storage_refcount_leak(struct bpf_sock_addr *ctx)
+{
+	struct lock_map_value *v;
+	struct tcp_sock *tsk;
+	struct bpf_sock *sk;
+	u32 refcnt;
+
+	if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
+		return 1;
+
+	sk = ctx->sk;
+	if (!sk)
+		return 1;
+
+	tsk = bpf_skc_to_tcp_sock(sk);
+	if (!tsk)
+		return 1;
+
+	v = bpf_cgrp_storage_get(&cgrp_strg, tsk->inet_conn.icsk_inet.sk.sk_cgrp_data.cgroup, 0,
+				 BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!v)
+		return 1;
+
+	refcnt = __insert_in_list(&head, &lock, &v->node);
+	bpf_spin_lock(&v->lock);
+	v->value = refcnt;
+	bpf_spin_unlock(&v->lock);
+	return 1;
+}
+
+SEC("fexit/inet_stream_connect")
+int BPF_PROG(check_cgroup_storage_refcount, struct socket *sock, struct sockaddr *uaddr, int addr_len,
+	     int flags)
+{
+	struct lock_map_value *v;
+	u32 refcnt;
+
+	if (uaddr->sa_family != AF_INET6)
+		return 0;
+
+	v = bpf_cgrp_storage_get(&cgrp_strg, sock->sk->sk_cgrp_data.cgroup, 0, 0);
+	if (!v)
+		return 0;
+
+	refcnt = probe_read_refcount();
+	bpf_spin_lock(&v->lock);
+	v->value = refcnt;
+	bpf_spin_unlock(&v->lock);
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps
  2025-10-26 15:39 ` [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps Leon Hwang
@ 2025-10-27 15:44   ` Amery Hung
  2025-10-27 16:15     ` Leon Hwang
  0 siblings, 1 reply; 20+ messages in thread
From: Amery Hung @ 2025-10-27 15:44 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

On Sun, Oct 26, 2025 at 8:41 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> When updating local storage maps with BPF_F_LOCK on the fast path, the
> special fields were not freed after being replaced. This could cause
> memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
> map gets freed.
>
> Similarly, on the other path, the old sdata's special fields were never
> freed regardless of whether BPF_F_LOCK was used, causing the same issue.
>
> Fix this by calling 'bpf_obj_free_fields()' after
> 'copy_map_value_locked()' to properly release the old fields.
>
> Fixes: 9db44fdd8105 ("bpf: Support kptrs in local storage maps")
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>  kernel/bpf/bpf_local_storage.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index b931fbceb54da..8e3aea4e07c50 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -609,6 +609,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>                 if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
>                         copy_map_value_locked(&smap->map, old_sdata->data,
>                                               value, false);
> +                       bpf_obj_free_fields(smap->map.record, old_sdata->data);

[ ... ]

>                         return old_sdata;
>                 }
>         }
> @@ -641,6 +642,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>         if (old_sdata && (map_flags & BPF_F_LOCK)) {
>                 copy_map_value_locked(&smap->map, old_sdata->data, value,
>                                       false);
> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);

The one above and this make sense. Thanks for fixing it.

>                 selem = SELEM(old_sdata);
>                 goto unlock;
>         }
> @@ -654,6 +656,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>
>         /* Third, remove old selem, SELEM(old_sdata) */
>         if (old_sdata) {
> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);

Is this really needed? bpf_selem_free_list() later should free special
fields in this selem.


>                 bpf_selem_unlink_map(SELEM(old_sdata));
>                 bpf_selem_unlink_storage_nolock(local_storage, SELEM(old_sdata),
>                                                 true, &old_selem_free_list);
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps
  2025-10-27 15:44   ` Amery Hung
@ 2025-10-27 16:15     ` Leon Hwang
  2025-10-27 17:04       ` Amery Hung
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Hwang @ 2025-10-27 16:15 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

Hi Amery,

On 2025/10/27 23:44, Amery Hung wrote:
> On Sun, Oct 26, 2025 at 8:41 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> When updating local storage maps with BPF_F_LOCK on the fast path, the
>> special fields were not freed after being replaced. This could cause
>> memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
>> map gets freed.
>>
>> Similarly, on the other path, the old sdata's special fields were never
>> freed regardless of whether BPF_F_LOCK was used, causing the same issue.
>>
>> Fix this by calling 'bpf_obj_free_fields()' after
>> 'copy_map_value_locked()' to properly release the old fields.
>>
>> Fixes: 9db44fdd8105 ("bpf: Support kptrs in local storage maps")
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>> ---
>>  kernel/bpf/bpf_local_storage.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
>> index b931fbceb54da..8e3aea4e07c50 100644
>> --- a/kernel/bpf/bpf_local_storage.c
>> +++ b/kernel/bpf/bpf_local_storage.c
>> @@ -609,6 +609,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>                 if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
>>                         copy_map_value_locked(&smap->map, old_sdata->data,
>>                                               value, false);
>> +                       bpf_obj_free_fields(smap->map.record, old_sdata->data);
> 
> [ ... ]
> 
>>                         return old_sdata;
>>                 }
>>         }
>> @@ -641,6 +642,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>         if (old_sdata && (map_flags & BPF_F_LOCK)) {
>>                 copy_map_value_locked(&smap->map, old_sdata->data, value,
>>                                       false);
>> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);
> 
> The one above and this make sense. Thanks for fixing it.
> 

Thanks for your review.

>>                 selem = SELEM(old_sdata);
>>                 goto unlock;
>>         }
>> @@ -654,6 +656,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>
>>         /* Third, remove old selem, SELEM(old_sdata) */
>>         if (old_sdata) {
>> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);
> 
> Is this really needed? bpf_selem_free_list() later should free special
> fields in this selem.
> 

Yes, it’s needed. The new selftest confirms that the special fields are
not freed when updating a local storage map.

Also, bpf_selem_unlink_storage_nolock() doesn’t invoke
bpf_selem_free_list(), unlike bpf_selem_unlink_storage(). So we need to
call bpf_obj_free_fields() here explicitly to free those fields.

Thanks,
Leon

[...]



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and local storage maps
  2025-10-26 15:40 ` [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and " Leon Hwang
@ 2025-10-27 16:34   ` Amery Hung
  2025-10-29 14:58     ` Leon Hwang
  0 siblings, 1 reply; 20+ messages in thread
From: Amery Hung @ 2025-10-27 16:34 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

On Sun, Oct 26, 2025 at 8:42 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> Add tests to verify that updating hash and local storage maps decrements
> refcount when BPF_KPTR_REF objects are involved.
>
> The tests perform the following steps:
>
> 1. Call update_elem() to insert an initial value.
> 2. Use bpf_refcount_acquire() to increment the refcount.
> 3. Store the node pointer in the map value.
> 4. Add the node to a linked list.
> 5. Probe-read the refcount and verify it is *2*.
> 6. Call update_elem() again to trigger refcount decrement.
> 7. Probe-read the refcount and verify it is *1*.
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>  .../bpf/prog_tests/refcounted_kptr.c          | 178 +++++++++++++++++-
>  .../selftests/bpf/progs/refcounted_kptr.c     | 160 ++++++++++++++++
>  2 files changed, 337 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c b/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
> index d6bd5e16e6372..0a60330a1f4b3 100644
> --- a/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
> +++ b/tools/testing/selftests/bpf/prog_tests/refcounted_kptr.c
> @@ -3,7 +3,7 @@
>
>  #include <test_progs.h>
>  #include <network_helpers.h>
> -
> +#include "cgroup_helpers.h"
>  #include "refcounted_kptr.skel.h"
>  #include "refcounted_kptr_fail.skel.h"
>
> @@ -44,3 +44,179 @@ void test_refcounted_kptr_wrong_owner(void)
>         ASSERT_OK(opts.retval, "rbtree_wrong_owner_remove_fail_a2 retval");
>         refcounted_kptr__destroy(skel);
>  }
> +
> +static void test_refcnt_leak(void *values, size_t values_sz, u64 flags, struct bpf_map *map,
> +                            struct bpf_program *prog_leak, struct bpf_program *prog_check)
> +{
> +       int ret, fd, key = 0;
> +       LIBBPF_OPTS(bpf_test_run_opts, opts,
> +                   .data_in = &pkt_v4,
> +                   .data_size_in = sizeof(pkt_v4),
> +                   .repeat = 1,
> +       );
> +
> +       ret = bpf_map__update_elem(map, &key, sizeof(key), values, values_sz, flags);
> +       if (!ASSERT_OK(ret, "bpf_map__update_elem init"))
> +               return;
> +
> +       fd = bpf_program__fd(prog_leak);
> +       ret = bpf_prog_test_run_opts(fd, &opts);
> +       if (!ASSERT_OK(ret, "test_run_opts"))
> +               return;
> +       if (!ASSERT_EQ(opts.retval, 2, "retval refcount"))
> +               return;
> +
> +       ret = bpf_map__update_elem(map, &key, sizeof(key), values, values_sz, flags);
> +       if (!ASSERT_OK(ret, "bpf_map__update_elem dec refcount"))
> +               return;
> +
> +       fd = bpf_program__fd(prog_check);
> +       ret = bpf_prog_test_run_opts(fd, &opts);
> +       ASSERT_OK(ret, "test_run_opts");
> +       ASSERT_EQ(opts.retval, 1, "retval");
> +}

Just use syscall BPF programs across different subtests, and you can
share this test_refcnt_leak() across subtests.

It also saves you some code setting up bpf_test_run_opts. You can just
call bpf_prog_test_run_opts(prog_fd, NULL) as you don't pass any input
from ctx.

> +
> +static void test_percpu_hash_refcount_leak(void)
> +{
> +       struct refcounted_kptr *skel;
> +       size_t values_sz;
> +       u64 *values;
> +       int cpu_nr;
> +
> +       cpu_nr = libbpf_num_possible_cpus();
> +       if (!ASSERT_GT(cpu_nr, 0, "libbpf_num_possible_cpus"))
> +               return;
> +
> +       values = calloc(cpu_nr, sizeof(u64));
> +       if (!ASSERT_OK_PTR(values, "calloc values"))
> +               return;
> +
> +       skel = refcounted_kptr__open_and_load();
> +       if (!ASSERT_OK_PTR(skel, "refcounted_kptr__open_and_load")) {
> +               free(values);
> +               return;
> +       }
> +
> +       values_sz = cpu_nr * sizeof(u64);
> +       memset(values, 0, values_sz);
> +
> +       test_refcnt_leak(values, values_sz, 0, skel->maps.pcpu_hash,
> +                        skel->progs.pcpu_hash_refcount_leak,
> +                        skel->progs.check_pcpu_hash_refcount);
> +
> +       refcounted_kptr__destroy(skel);
> +       free(values);
> +}
> +
> +struct lock_map_value {
> +       u64 kptr;
> +       struct bpf_spin_lock lock;
> +       int value;
> +};
> +
> +static void test_hash_lock_refcount_leak(void)
> +{
> +       struct lock_map_value value = {};
> +       struct refcounted_kptr *skel;
> +
> +       skel = refcounted_kptr__open_and_load();
> +       if (!ASSERT_OK_PTR(skel, "refcounted_kptr__open_and_load"))
> +               return;
> +
> +       test_refcnt_leak(&value, sizeof(value), BPF_F_LOCK, skel->maps.lock_hash,
> +                        skel->progs.hash_lock_refcount_leak,
> +                        skel->progs.check_hash_lock_refcount);
> +
> +       refcounted_kptr__destroy(skel);
> +}
> +
> +static void test_cgrp_storage_refcount_leak(u64 flags)
> +{
> +       int server_fd = -1, client_fd = -1;
> +       struct lock_map_value value = {};
> +       struct refcounted_kptr *skel;
> +       struct bpf_link *link;
> +       struct bpf_map *map;
> +       int cgroup, err;
> +
> +       cgroup = test__join_cgroup("/cg_refcount_leak");
> +       if (!ASSERT_GE(cgroup, 0, "test__join_cgroup"))
> +               return;
> +
> +       skel = refcounted_kptr__open_and_load();
> +       if (!ASSERT_OK_PTR(skel, "refcounted_kptr__open_and_load"))
> +               goto out;
> +
> +       link = bpf_program__attach_cgroup(skel->progs.cgroup_storage_refcount_leak, cgroup);
> +       if (!ASSERT_OK_PTR(link, "bpf_program__attach_cgroup"))
> +               goto out;
> +       skel->links.cgroup_storage_refcount_leak = link;
> +
> +       server_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0);
> +       if (!ASSERT_GE(server_fd, 0, "start_server"))
> +               goto out;
> +
> +       client_fd = connect_to_fd(server_fd, 0);
> +       if (!ASSERT_GE(client_fd, 0, "connect_to_fd"))
> +               goto out;
> +
> +       map = skel->maps.cgrp_strg;
> +       err = bpf_map__lookup_elem(map, &cgroup, sizeof(cgroup), &value, sizeof(value), flags);
> +       if (!ASSERT_OK(err, "bpf_map__lookup_elem"))
> +               goto out;
> +
> +       ASSERT_EQ(value.value, 2, "refcount");
> +
> +       err = bpf_map__update_elem(map, &cgroup, sizeof(cgroup), &value, sizeof(value), flags);
> +       if (!ASSERT_OK(err, "bpf_map__update_elem"))
> +               goto out;
> +
> +       err = bpf_link__detach(skel->links.cgroup_storage_refcount_leak);
> +       if (!ASSERT_OK(err, "bpf_link__detach"))
> +               goto out;
> +
> +       link = bpf_program__attach(skel->progs.check_cgroup_storage_refcount);
> +       if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
> +               goto out;
> +       skel->links.check_cgroup_storage_refcount = link;
> +
> +       close(client_fd);
> +       client_fd = connect_to_fd(server_fd, 0);
> +       if (!ASSERT_GE(client_fd, 0, "connect_to_fd"))
> +               goto out;
> +
> +       err = bpf_map__lookup_elem(map, &cgroup, sizeof(cgroup), &value, sizeof(value), flags);
> +       if (!ASSERT_OK(err, "bpf_map__lookup_elem"))
> +               goto out;
> +
> +       ASSERT_EQ(value.value, 1, "refcount");
> +out:
> +       close(cgroup);
> +       refcounted_kptr__destroy(skel);
> +       if (client_fd >= 0)
> +               close(client_fd);
> +       if (server_fd >= 0)
> +               close(server_fd);
> +}

Then, you won't need to set up server, connection.... just to
read/write cgroup local storage. Just call test_refcnt_leak() that
runs the two BPF syscall programs for cgroup local storage.

> +
> +static void test_cgroup_storage_refcount_leak(void)
> +{
> +       test_cgrp_storage_refcount_leak(0);
> +}
> +
> +static void test_cgroup_storage_lock_refcount_leak(void)
> +{
> +       test_cgrp_storage_refcount_leak(BPF_F_LOCK);
> +}
> +
> +void test_kptr_refcount_leak(void)
> +{
> +       if (test__start_subtest("percpu_hash_refcount_leak"))
> +               test_percpu_hash_refcount_leak();
> +       if (test__start_subtest("hash_lock_refcount_leak"))
> +               test_hash_lock_refcount_leak();
> +       if (test__start_subtest("cgroup_storage_refcount_leak"))
> +               test_cgroup_storage_refcount_leak();
> +       if (test__start_subtest("cgroup_storage_lock_refcount_leak"))
> +               test_cgroup_storage_lock_refcount_leak();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/refcounted_kptr.c b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
> index 893a4fdb4b6e9..09efae9537c9b 100644
> --- a/tools/testing/selftests/bpf/progs/refcounted_kptr.c
> +++ b/tools/testing/selftests/bpf/progs/refcounted_kptr.c
> @@ -7,6 +7,7 @@
>  #include <bpf/bpf_core_read.h>
>  #include "bpf_misc.h"
>  #include "bpf_experimental.h"
> +#include "bpf_tracing_net.h"
>
>  extern void bpf_rcu_read_lock(void) __ksym;
>  extern void bpf_rcu_read_unlock(void) __ksym;
> @@ -568,4 +569,163 @@ int BPF_PROG(rbtree_sleepable_rcu_no_explicit_rcu_lock,
>         return 0;
>  }
>
> +private(leak) u64 ref;
> +
> +static u32 probe_read_refcount(void)
> +{
> +       u32 refcnt;
> +
> +       bpf_probe_read_kernel(&refcnt, sizeof(refcnt), (void *) ref);
> +       return refcnt;
> +}
> +
> +static int __insert_in_list(struct bpf_list_head *head, struct bpf_spin_lock *lock,
> +                           struct node_data __kptr **node)
> +{
> +       struct node_data *n, *m;
> +
> +       n = bpf_obj_new(typeof(*n));
> +       if (!n)
> +               return -1;
> +
> +       m = bpf_refcount_acquire(n);
> +       n = bpf_kptr_xchg(node, n);
> +       if (n) {
> +               bpf_obj_drop(n);
> +               bpf_obj_drop(m);
> +               return -2;
> +       }
> +
> +       bpf_spin_lock(lock);
> +       bpf_list_push_front(head, &m->l);
> +       ref = (u64)(void *) &m->ref;
> +       bpf_spin_unlock(lock);
> +       return probe_read_refcount();
> +}
> +
> +static void *__lookup_map(void *map)
> +{
> +       int key = 0;
> +
> +       return bpf_map_lookup_elem(map, &key);
> +}
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_PERCPU_HASH);
> +       __type(key, int);
> +       __type(value, struct map_value);
> +       __uint(max_entries, 1);
> +} pcpu_hash SEC(".maps");
> +
> +SEC("tc")
> +int pcpu_hash_refcount_leak(void *ctx)
> +{
> +       struct map_value *v;
> +
> +       v = __lookup_map(&pcpu_hash);
> +       if (!v)
> +               return 0;
> +
> +       return __insert_in_list(&head, &lock, &v->node);
> +}
> +
> +SEC("tc")
> +int check_pcpu_hash_refcount(void *ctx)
> +{
> +       return probe_read_refcount();
> +}
> +
> +struct lock_map_value {
> +       struct node_data __kptr *node;
> +       struct bpf_spin_lock lock;
> +       int value;
> +};
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __type(key, int);
> +       __type(value, struct lock_map_value);
> +       __uint(max_entries, 1);
> +} lock_hash SEC(".maps");
> +
> +SEC("tc")
> +int hash_lock_refcount_leak(void *ctx)
> +{
> +       struct lock_map_value *v;
> +
> +       v = __lookup_map(&lock_hash);
> +       if (!v)
> +               return 0;
> +
> +       bpf_spin_lock(&v->lock);
> +       v->value = 42;
> +       bpf_spin_unlock(&v->lock);
> +       return __insert_in_list(&head, &lock, &v->node);
> +}
> +
> +SEC("tc")
> +int check_hash_lock_refcount(void *ctx)
> +{
> +       return probe_read_refcount();
> +}
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
> +       __uint(map_flags, BPF_F_NO_PREALLOC);
> +       __type(key, int);
> +       __type(value, struct lock_map_value);
> +} cgrp_strg SEC(".maps");
> +
> +SEC("cgroup/connect6")
> +int cgroup_storage_refcount_leak(struct bpf_sock_addr *ctx)
> +{
> +       struct lock_map_value *v;
> +       struct tcp_sock *tsk;
> +       struct bpf_sock *sk;
> +       u32 refcnt;
> +
> +       if (ctx->family != AF_INET6 || ctx->user_family != AF_INET6)
> +               return 1;
> +
> +       sk = ctx->sk;
> +       if (!sk)
> +               return 1;
> +
> +       tsk = bpf_skc_to_tcp_sock(sk);
> +       if (!tsk)
> +               return 1;
> +
> +       v = bpf_cgrp_storage_get(&cgrp_strg, tsk->inet_conn.icsk_inet.sk.sk_cgrp_data.cgroup, 0,
> +                                BPF_LOCAL_STORAGE_GET_F_CREATE);
> +       if (!v)
> +               return 1;
> +
> +       refcnt = __insert_in_list(&head, &lock, &v->node);
> +       bpf_spin_lock(&v->lock);
> +       v->value = refcnt;
> +       bpf_spin_unlock(&v->lock);
> +       return 1;
> +}


And in syscall BPF program, you can simply get the cgroup through the
current task

SEC("syscall")
int syscall_prog(void *ctx)
{
        struct task_struct *task = bpf_get_current_task_btf();

        v = bpf_cgrp_storage_get(&cgrp_strg, task->cgroups->dfl_cgrp, 0,
                               BPF_LOCAL_STORAGE_GET_F_CREATE);
        ...
}

> +
> +SEC("fexit/inet_stream_connect")
> +int BPF_PROG(check_cgroup_storage_refcount, struct socket *sock, struct sockaddr *uaddr, int addr_len,
> +            int flags)
> +{
> +       struct lock_map_value *v;
> +       u32 refcnt;
> +
> +       if (uaddr->sa_family != AF_INET6)
> +               return 0;
> +
> +       v = bpf_cgrp_storage_get(&cgrp_strg, sock->sk->sk_cgrp_data.cgroup, 0, 0);
> +       if (!v)
> +               return 0;
> +
> +       refcnt = probe_read_refcount();
> +       bpf_spin_lock(&v->lock);
> +       v->value = refcnt;
> +       bpf_spin_unlock(&v->lock);
> +       return 0;
> +}
> +
>  char _license[] SEC("license") = "GPL";
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps
  2025-10-27 16:15     ` Leon Hwang
@ 2025-10-27 17:04       ` Amery Hung
  2025-10-28 14:48         ` Leon Hwang
  0 siblings, 1 reply; 20+ messages in thread
From: Amery Hung @ 2025-10-27 17:04 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

On Mon, Oct 27, 2025 at 9:15 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> Hi Amery,
>
> On 2025/10/27 23:44, Amery Hung wrote:
> > On Sun, Oct 26, 2025 at 8:41 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >> When updating local storage maps with BPF_F_LOCK on the fast path, the
> >> special fields were not freed after being replaced. This could cause
> >> memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
> >> map gets freed.
> >>
> >> Similarly, on the other path, the old sdata's special fields were never
> >> freed regardless of whether BPF_F_LOCK was used, causing the same issue.
> >>
> >> Fix this by calling 'bpf_obj_free_fields()' after
> >> 'copy_map_value_locked()' to properly release the old fields.
> >>
> >> Fixes: 9db44fdd8105 ("bpf: Support kptrs in local storage maps")
> >> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> >> ---
> >>  kernel/bpf/bpf_local_storage.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >>
> >> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> >> index b931fbceb54da..8e3aea4e07c50 100644
> >> --- a/kernel/bpf/bpf_local_storage.c
> >> +++ b/kernel/bpf/bpf_local_storage.c
> >> @@ -609,6 +609,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> >>                 if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
> >>                         copy_map_value_locked(&smap->map, old_sdata->data,
> >>                                               value, false);
> >> +                       bpf_obj_free_fields(smap->map.record, old_sdata->data);
> >
> > [ ... ]
> >
> >>                         return old_sdata;
> >>                 }
> >>         }
> >> @@ -641,6 +642,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> >>         if (old_sdata && (map_flags & BPF_F_LOCK)) {
> >>                 copy_map_value_locked(&smap->map, old_sdata->data, value,
> >>                                       false);
> >> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);
> >
> > The one above and this make sense. Thanks for fixing it.
> >
>
> Thanks for your review.
>
> >>                 selem = SELEM(old_sdata);
> >>                 goto unlock;
> >>         }
> >> @@ -654,6 +656,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> >>
> >>         /* Third, remove old selem, SELEM(old_sdata) */
> >>         if (old_sdata) {
> >> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);
> >
> > Is this really needed? bpf_selem_free_list() later should free special
> > fields in this selem.
> >
>
> Yes, it’s needed. The new selftest confirms that the special fields are
> not freed when updating a local storage map.
>

Hmmm. I don't think so.

> Also, bpf_selem_unlink_storage_nolock() doesn’t invoke
> bpf_selem_free_list(), unlike bpf_selem_unlink_storage(). So we need to
> call bpf_obj_free_fields() here explicitly to free those fields.
>

bpf_selem_unlink_storage_nolock() unlinks the old selem and adds it to
old_selem_free_list. Later, bpf_selem_free_list() will call
bpf_selem_free() to free selem in bpf_selem_free_list, which should
also free special fields in the selem.

The selftests may have checked the refcount before an task trace RCU
gp and thought it is a leak. I added a 300ms delay before the checking
program runs and the test did not detect any leak even without this
specific bpf_obj_free_fields().

> Thanks,
> Leon
>
> [...]
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps
  2025-10-27 17:04       ` Amery Hung
@ 2025-10-28 14:48         ` Leon Hwang
  2025-10-28 18:03           ` Andrii Nakryiko
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Hwang @ 2025-10-28 14:48 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot



On 2025/10/28 01:04, Amery Hung wrote:
> On Mon, Oct 27, 2025 at 9:15 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> Hi Amery,
>>
>> On 2025/10/27 23:44, Amery Hung wrote:
>>> On Sun, Oct 26, 2025 at 8:41 AM Leon Hwang <leon.hwang@linux.dev> wrote:

[...]

>>>>                 selem = SELEM(old_sdata);
>>>>                 goto unlock;
>>>>         }
>>>> @@ -654,6 +656,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
>>>>
>>>>         /* Third, remove old selem, SELEM(old_sdata) */
>>>>         if (old_sdata) {
>>>> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);
>>>
>>> Is this really needed? bpf_selem_free_list() later should free special
>>> fields in this selem.
>>>
>>
>> Yes, it’s needed. The new selftest confirms that the special fields are
>> not freed when updating a local storage map.
>>
> 
> Hmmm. I don't think so.
> 
>> Also, bpf_selem_unlink_storage_nolock() doesn’t invoke
>> bpf_selem_free_list(), unlike bpf_selem_unlink_storage(). So we need to
>> call bpf_obj_free_fields() here explicitly to free those fields.
>>
> 
> bpf_selem_unlink_storage_nolock() unlinks the old selem and adds it to
> old_selem_free_list. Later, bpf_selem_free_list() will call
> bpf_selem_free() to free selem in bpf_selem_free_list, which should
> also free special fields in the selem.
> 
> The selftests may have checked the refcount before an task trace RCU
> gp and thought it is a leak. I added a 300ms delay before the checking
> program runs and the test did not detect any leak even without this
> specific bpf_obj_free_fields().

Yeah, you're right. Thanks for the clear explanation.

I also verified it by adding a 300ms delay.

So this bpf_obj_free_fields() call isn't needed — I'll drop it in the
next revision.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps
  2025-10-28 14:48         ` Leon Hwang
@ 2025-10-28 18:03           ` Andrii Nakryiko
  0 siblings, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2025-10-28 18:03 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Amery Hung, bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

On Tue, Oct 28, 2025 at 7:48 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 2025/10/28 01:04, Amery Hung wrote:
> > On Mon, Oct 27, 2025 at 9:15 AM Leon Hwang <leon.hwang@linux.dev> wrote:
> >>
> >> Hi Amery,
> >>
> >> On 2025/10/27 23:44, Amery Hung wrote:
> >>> On Sun, Oct 26, 2025 at 8:41 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> [...]
>
> >>>>                 selem = SELEM(old_sdata);
> >>>>                 goto unlock;
> >>>>         }
> >>>> @@ -654,6 +656,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
> >>>>
> >>>>         /* Third, remove old selem, SELEM(old_sdata) */
> >>>>         if (old_sdata) {
> >>>> +               bpf_obj_free_fields(smap->map.record, old_sdata->data);
> >>>
> >>> Is this really needed? bpf_selem_free_list() later should free special
> >>> fields in this selem.
> >>>
> >>
> >> Yes, it’s needed. The new selftest confirms that the special fields are
> >> not freed when updating a local storage map.
> >>
> >
> > Hmmm. I don't think so.
> >
> >> Also, bpf_selem_unlink_storage_nolock() doesn’t invoke
> >> bpf_selem_free_list(), unlike bpf_selem_unlink_storage(). So we need to
> >> call bpf_obj_free_fields() here explicitly to free those fields.
> >>
> >
> > bpf_selem_unlink_storage_nolock() unlinks the old selem and adds it to
> > old_selem_free_list. Later, bpf_selem_free_list() will call
> > bpf_selem_free() to free selem in bpf_selem_free_list, which should
> > also free special fields in the selem.
> >
> > The selftests may have checked the refcount before an task trace RCU
> > gp and thought it is a leak. I added a 300ms delay before the checking
> > program runs and the test did not detect any leak even without this
> > specific bpf_obj_free_fields().
>
> Yeah, you're right. Thanks for the clear explanation.
>
> I also verified it by adding a 300ms delay.
>
> So this bpf_obj_free_fields() call isn't needed — I'll drop it in the
> next revision.

I've dropped it while applying, no need to resend.

>
> Thanks,
> Leon
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 1/4] bpf: Free special fields when update [lru_,]percpu_hash maps
  2025-10-26 15:39 ` [PATCH bpf v3 1/4] bpf: Free special fields when update [lru_,]percpu_hash maps Leon Hwang
@ 2025-10-28 18:03   ` Andrii Nakryiko
  0 siblings, 0 replies; 20+ messages in thread
From: Andrii Nakryiko @ 2025-10-28 18:03 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

On Sun, Oct 26, 2025 at 8:40 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> As [lru_,]percpu_hash maps support BPF_KPTR_{REF,PERCPU}, missing
> calls to 'bpf_obj_free_fields()' in 'pcpu_copy_value()' could cause the
> memory referenced by BPF_KPTR_{REF,PERCPU} fields to be held until the
> map gets freed.
>
> Fix this by calling 'bpf_obj_free_fields()' after
> 'copy_map_value[,_long]()' in 'pcpu_copy_value()'.
>
> Fixes: 65334e64a493 ("bpf: Support kptrs in percpu hashmap and percpu LRU hashmap")
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>  kernel/bpf/hashtab.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index c2fcd0cd51e51..26308adc9ccb3 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -950,12 +950,14 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
>         if (!onallcpus) {
>                 /* copy true value_size bytes */
>                 copy_map_value(&htab->map, this_cpu_ptr(pptr), value);
> +               bpf_obj_free_fields(htab->map.record, this_cpu_ptr(pptr));

would make sense to assign this_cpu_ptr() result in a local variable
and reuse it between copy_map_value and bpf_obj_free_fields().
Consider that for a follow up.


>         } else {
>                 u32 size = round_up(htab->map.value_size, 8);
>                 int off = 0, cpu;
>
>                 for_each_possible_cpu(cpu) {
>                         copy_map_value_long(&htab->map, per_cpu_ptr(pptr, cpu), value + off);
> +                       bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
>                         off += size;
>                 }
>         }
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps
  2025-10-26 15:39 [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps Leon Hwang
                   ` (3 preceding siblings ...)
  2025-10-26 15:40 ` [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and " Leon Hwang
@ 2025-10-28 18:10 ` patchwork-bot+netdevbpf
  2025-10-28 20:22   ` Andrii Nakryiko
  4 siblings, 1 reply; 20+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-10-28 18:10 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

Hello:

This series was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Sun, 26 Oct 2025 23:39:56 +0800 you wrote:
> In the discussion thread
> "[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1],
> it was pointed out that missing calls to bpf_obj_free_fields() could
> lead to memory leaks.
> 
> A selftest was added to confirm that this is indeed a real issue - the
> refcount of BPF_KPTR_REF field is not decremented when
> bpf_obj_free_fields() is missing after copy_map_value[,_long]().
> 
> [...]

Here is the summary with links:
  - [bpf,v3,1/4] bpf: Free special fields when update [lru_,]percpu_hash maps
    https://git.kernel.org/bpf/bpf-next/c/f6de8d643ff1
  - [bpf,v3,2/4] bpf: Free special fields when update hash maps with BPF_F_LOCK
    https://git.kernel.org/bpf/bpf-next/c/c7fcb7972196
  - [bpf,v3,3/4] bpf: Free special fields when update local storage maps
    (no matching commit)
  - [bpf,v3,4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and local storage maps
    https://git.kernel.org/bpf/bpf-next/c/d5a7e7af14cc

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps
  2025-10-28 18:10 ` [PATCH bpf v3 0/4] bpf: Free " patchwork-bot+netdevbpf
@ 2025-10-28 20:22   ` Andrii Nakryiko
  2025-10-29  6:49     ` Leon Hwang
  0 siblings, 1 reply; 20+ messages in thread
From: Andrii Nakryiko @ 2025-10-28 20:22 UTC (permalink / raw)
  To: patchwork-bot+netdevbpf
  Cc: Leon Hwang, bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot

On Tue, Oct 28, 2025 at 11:10 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
>
> Hello:
>
> This series was applied to bpf/bpf-next.git (master)
> by Andrii Nakryiko <andrii@kernel.org>:
>
> On Sun, 26 Oct 2025 23:39:56 +0800 you wrote:
> > In the discussion thread
> > "[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1],
> > it was pointed out that missing calls to bpf_obj_free_fields() could
> > lead to memory leaks.
> >
> > A selftest was added to confirm that this is indeed a real issue - the
> > refcount of BPF_KPTR_REF field is not decremented when
> > bpf_obj_free_fields() is missing after copy_map_value[,_long]().
> >
> > [...]
>
> Here is the summary with links:
>   - [bpf,v3,1/4] bpf: Free special fields when update [lru_,]percpu_hash maps
>     https://git.kernel.org/bpf/bpf-next/c/f6de8d643ff1
>   - [bpf,v3,2/4] bpf: Free special fields when update hash maps with BPF_F_LOCK
>     https://git.kernel.org/bpf/bpf-next/c/c7fcb7972196
>   - [bpf,v3,3/4] bpf: Free special fields when update local storage maps
>     (no matching commit)
>   - [bpf,v3,4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and local storage maps
>     https://git.kernel.org/bpf/bpf-next/c/d5a7e7af14cc
>

Ok, I had to drop this from bpf-next after all. First,
kptr_refcount_leak/cgroup_storage_refcount_leak needs to be adjusted
due to that one line removal in patch 3.

But what's worse, we started getting deadlock warning when running one
of the tests, see [0]:

[  418.260323] bpf_testmod: oh no, recursing into test_1, recursion_misses 1
  [  424.982201]
  [  424.982207] ================================
  [  424.982216] WARNING: inconsistent lock state
  [  424.982219] 6.18.0-rc1-gbb1b9387787c-dirty #1 Tainted: G        W  OE
  [  424.982221] --------------------------------
  [  424.982223] inconsistent {INITIAL USE} -> {IN-NMI} usage.
  [  424.982225] new_name/11207 [HC1[1]:SC0[0]:HE0:SE1] takes:
  [  424.982229] ffffe8ffffd9c000 (&loc_l->lock){....}-{2:2}, at:
bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982244] {INITIAL USE} state was registered at:
  [  424.982246]   lock_acquire+0x154/0x2d0
  [  424.982252]   _raw_spin_lock_irqsave+0x39/0x60
  [  424.982259]   bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982262]   htab_lru_map_update_elem+0x17e/0xa90
  [  424.982266]   bpf_map_update_value+0x5aa/0x1230
  [  424.982272]   __sys_bpf+0x33b4/0x4ef0
  [  424.982275]   __x64_sys_bpf+0x78/0xe0
  [  424.982278]   do_syscall_64+0x6a/0x2f0
  [  424.982282]   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [  424.982287] irq event stamp: 236
  [  424.982288] hardirqs last  enabled at (235): [<ffffffff959e4e70>]
do_syscall_64+0x30/0x2f0
  [  424.982292] hardirqs last disabled at (236): [<ffffffff959e65df>]
exc_nmi+0x7f/0x110
  [  424.982296] softirqs last  enabled at (0): [<ffffffff933fe7cf>]
copy_process+0x1c3f/0x6ab0
  [  424.982302] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [  424.982305]
  [  424.982305] other info that might help us debug this:
  [  424.982306]  Possible unsafe locking scenario:
  [  424.982306]
  [  424.982307]        CPU0
  [  424.982308]        ----
  [  424.982309]   lock(&loc_l->lock);
  [  424.982311]   <Interrupt>
  [  424.982312]     lock(&loc_l->lock);
  [  424.982314]
  [  424.982314]  *** DEADLOCK ***
  [  424.982314]
  [  424.982315] no locks held by new_name/11207.
  [  424.982317]
  [  424.982317] stack backtrace:
  [  424.982326] CPU: 1 UID: 0 PID: 11207 Comm: new_name Tainted: G
    W  OE       6.18.0-rc1-gbb1b9387787c-dirty #1 PREEMPT(full)
  [  424.982332] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  [  424.982334] Hardware name: QEMU Ubuntu 25.04 PC (i440FX + PIIX,
1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [  424.982337] Call Trace:
  [  424.982340]  <NMI>
  [  424.982342]  dump_stack_lvl+0x5d/0x80
  [  424.982356]  print_usage_bug.part.0+0x22b/0x2c0
  [  424.982360]  lock_acquire+0x278/0x2d0
  [  424.982364]  ? __irq_work_queue_local+0x133/0x360
  [  424.982371]  ? bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982375]  _raw_spin_lock_irqsave+0x39/0x60
  [  424.982379]  ? bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982382]  bpf_lru_pop_free+0x2c6/0x1a50
  [  424.982387]  ? arch_irq_work_raise+0x3f/0x60
  [  424.982394]  ? __pfx___irq_work_queue_local+0x10/0x10
  [  424.982399]  htab_lru_map_update_elem+0x17e/0xa90
  [  424.982405]  ? __pfx_htab_lru_map_update_elem+0x10/0x10
  [  424.982408]  ? __kasan_check_byte+0x16/0x60
  [  424.982414]  ? __htab_map_lookup_elem+0x95/0x220
  [  424.982420]  bpf_prog_2c77131b3c031599_oncpu_lru_map+0xe4/0x168
  [  424.982423]  __perf_event_overflow+0x8e8/0xea0
  [  424.982430]  ? __pfx___perf_event_overflow+0x10/0x10
  [  424.982436]  handle_pmi_common+0x3fe/0x810
  [  424.982441]  ? __pfx_handle_pmi_common+0x10/0x10
  [  424.982452]  ? __pfx_intel_bts_interrupt+0x10/0x10
  [  424.982458]  intel_pmu_handle_irq+0x1c5/0x5d0
  [  424.982461]  ? lock_acquire+0x1ef/0x2d0
  [  424.982465]  ? nmi_handle.part.0+0x2f/0x380
  [  424.982469]  perf_event_nmi_handler+0x3e/0x70
  [  424.982476]  nmi_handle.part.0+0x13f/0x380
  [  424.982480]  ? trace_rcu_watching+0x105/0x170
  [  424.982486]  default_do_nmi+0x3b/0x110
  [  424.982490]  ? irqentry_nmi_enter+0x6f/0x80
  [  424.982493]  exc_nmi+0xe3/0x110
  [  424.982497]  end_repeat_nmi+0xf/0x53
  [  424.982502] RIP: 0010:fput_close_sync+0x56/0x1a0
  [  424.982509] Code: 48 89 e5 48 c7 04 24 b3 8a b5 41 48 c7 44 24 08
5c a2 3e 96 48 c1 ed 03 48 c7 44 24 10 10 a7 e0 93 42 c7 44 2d 00 f1
f1 f1 f1 <42> c7 44 2d 04 00 f3 f3 f3 65 48 8b 05 91 98 56 04 48 89 44
24 58
  [  424.982513] RSP: 0018:ffffc900099d7e88 EFLAGS: 00000a06
  [  424.982517] RAX: 0000000000000000 RBX: ffff888109fb48c0 RCX:
0000000000000000
  [  424.982520] RDX: 1ffff110099572bb RSI: 0000000000000008 RDI:
ffff888109fb4a20
  [  424.982522] RBP: 1ffff9200133afd1 R08: ffff888109fb48c0 R09:
ffff888109278b40
  [  424.982524] R10: ffff888109fb4920 R11: 0000000000000000 R12:
0000000000000003
  [  424.982526] R13: dffffc0000000000 R14: 0000000000000003 R15:
0000000000000000
  [  424.982532]  ? fput_close_sync+0x56/0x1a0
  [  424.982537]  ? fput_close_sync+0x56/0x1a0
  [  424.982541]  </NMI>
  [  424.982542]  <TASK>
  [  424.982544]  ? __pfx_fput_close_sync+0x10/0x10
  [  424.982548]  ? do_raw_spin_unlock+0x59/0x250
  [  424.982553]  __x64_sys_close+0x7d/0xd0
  [  424.982559]  do_syscall_64+0x6a/0x2f0
  [  424.982563]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [  424.982566] RIP: 0033:0x7faae0f88fe2
  [  424.982569] Code: 08 0f 85 71 3a ff ff 49 89 fb 48 89 f0 48 89 d7
48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24
08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00
00 66
  [  424.982571] RSP: 002b:00007ffe58ee5b08 EFLAGS: 00000246 ORIG_RAX:
0000000000000003
  [  424.982574] RAX: ffffffffffffffda RBX: 00007faae0a6cb00 RCX:
00007faae0f88fe2
  [  424.982577] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
0000000000000072
  [  424.982579] RBP: 00007ffe58ee5b30 R08: 0000000000000000 R09:
0000000000000000
  [  424.982581] R10: 0000000000000000 R11: 0000000000000246 R12:
0000000000000008
  [  424.982583] R13: 0000000000000000 R14: 0000556f5e250c90 R15:
00007faae11e9000
  [  424.982588]  </TASK>
  [  424.982606] perf: interrupt took too long (14417 > 12551),
lowering kernel.perf_event_max_sample_rate to 13000

We'll need to figure this out first.

  [0] https://github.com/kernel-patches/bpf/actions/runs/18884827710/job/53898669092

> You are awesome, thank you!
> --
> Deet-doot-dot, I am a bot.
> https://korg.docs.kernel.org/patchwork/pwbot.html
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps
  2025-10-28 20:22   ` Andrii Nakryiko
@ 2025-10-29  6:49     ` Leon Hwang
  2025-10-29  6:57       ` Menglong Dong
  2025-10-29 16:38       ` Alexei Starovoitov
  0 siblings, 2 replies; 20+ messages in thread
From: Leon Hwang @ 2025-10-29  6:49 UTC (permalink / raw)
  To: Andrii Nakryiko, patchwork-bot+netdevbpf, Menglong Dong
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot



On 29/10/25 04:22, Andrii Nakryiko wrote:
> On Tue, Oct 28, 2025 at 11:10 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
>>
>> Hello:
>>
>> This series was applied to bpf/bpf-next.git (master)
>> by Andrii Nakryiko <andrii@kernel.org>:
>>
>> On Sun, 26 Oct 2025 23:39:56 +0800 you wrote:
>>> In the discussion thread
>>> "[PATCH bpf-next v9 0/7] bpf: Introduce BPF_F_CPU and BPF_F_ALL_CPUS flags for percpu maps"[1],
>>> it was pointed out that missing calls to bpf_obj_free_fields() could
>>> lead to memory leaks.
>>>
>>> A selftest was added to confirm that this is indeed a real issue - the
>>> refcount of BPF_KPTR_REF field is not decremented when
>>> bpf_obj_free_fields() is missing after copy_map_value[,_long]().
>>>
>>> [...]
>>
>> Here is the summary with links:
>>   - [bpf,v3,1/4] bpf: Free special fields when update [lru_,]percpu_hash maps
>>     https://git.kernel.org/bpf/bpf-next/c/f6de8d643ff1
>>   - [bpf,v3,2/4] bpf: Free special fields when update hash maps with BPF_F_LOCK
>>     https://git.kernel.org/bpf/bpf-next/c/c7fcb7972196
>>   - [bpf,v3,3/4] bpf: Free special fields when update local storage maps
>>     (no matching commit)
>>   - [bpf,v3,4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and local storage maps
>>     https://git.kernel.org/bpf/bpf-next/c/d5a7e7af14cc
>>
>
> Ok, I had to drop this from bpf-next after all. First,
> kptr_refcount_leak/cgroup_storage_refcount_leak needs to be adjusted
> due to that one line removal in patch 3.

Ack.

>
> But what's worse, we started getting deadlock warning when running one
> of the tests, see [0]:
>

Oops.

> [  418.260323] bpf_testmod: oh no, recursing into test_1, recursion_misses 1
>   [  424.982201]
>   [  424.982207] ================================
>   [  424.982216] WARNING: inconsistent lock state
>   [  424.982219] 6.18.0-rc1-gbb1b9387787c-dirty #1 Tainted: G        W  OE
>   [  424.982221] --------------------------------
>   [  424.982223] inconsistent {INITIAL USE} -> {IN-NMI} usage.
>   [  424.982225] new_name/11207 [HC1[1]:SC0[0]:HE0:SE1] takes:
>   [  424.982229] ffffe8ffffd9c000 (&loc_l->lock){....}-{2:2}, at:
> bpf_lru_pop_free+0x2c6/0x1a50
>   [  424.982244] {INITIAL USE} state was registered at:
>   [  424.982246]   lock_acquire+0x154/0x2d0
>   [  424.982252]   _raw_spin_lock_irqsave+0x39/0x60
>   [  424.982259]   bpf_lru_pop_free+0x2c6/0x1a50
>   [  424.982262]   htab_lru_map_update_elem+0x17e/0xa90
>   [  424.982266]   bpf_map_update_value+0x5aa/0x1230
>   [  424.982272]   __sys_bpf+0x33b4/0x4ef0
>   [  424.982275]   __x64_sys_bpf+0x78/0xe0
>   [  424.982278]   do_syscall_64+0x6a/0x2f0
>   [  424.982282]   entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   [  424.982287] irq event stamp: 236
>   [  424.982288] hardirqs last  enabled at (235): [<ffffffff959e4e70>]
> do_syscall_64+0x30/0x2f0
>   [  424.982292] hardirqs last disabled at (236): [<ffffffff959e65df>]
> exc_nmi+0x7f/0x110
>   [  424.982296] softirqs last  enabled at (0): [<ffffffff933fe7cf>]
> copy_process+0x1c3f/0x6ab0
>   [  424.982302] softirqs last disabled at (0): [<0000000000000000>] 0x0
>   [  424.982305]
>   [  424.982305] other info that might help us debug this:
>   [  424.982306]  Possible unsafe locking scenario:
>   [  424.982306]
>   [  424.982307]        CPU0
>   [  424.982308]        ----
>   [  424.982309]   lock(&loc_l->lock);
>   [  424.982311]   <Interrupt>
>   [  424.982312]     lock(&loc_l->lock);
>   [  424.982314]
>   [  424.982314]  *** DEADLOCK ***
>   [  424.982314]
>   [  424.982315] no locks held by new_name/11207.
>   [  424.982317]
>   [  424.982317] stack backtrace:
>   [  424.982326] CPU: 1 UID: 0 PID: 11207 Comm: new_name Tainted: G
>     W  OE       6.18.0-rc1-gbb1b9387787c-dirty #1 PREEMPT(full)
>   [  424.982332] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
>   [  424.982334] Hardware name: QEMU Ubuntu 25.04 PC (i440FX + PIIX,
> 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>   [  424.982337] Call Trace:
>   [  424.982340]  <NMI>
>   [  424.982342]  dump_stack_lvl+0x5d/0x80
>   [  424.982356]  print_usage_bug.part.0+0x22b/0x2c0
>   [  424.982360]  lock_acquire+0x278/0x2d0
>   [  424.982364]  ? __irq_work_queue_local+0x133/0x360
>   [  424.982371]  ? bpf_lru_pop_free+0x2c6/0x1a50
>   [  424.982375]  _raw_spin_lock_irqsave+0x39/0x60
>   [  424.982379]  ? bpf_lru_pop_free+0x2c6/0x1a50
>   [  424.982382]  bpf_lru_pop_free+0x2c6/0x1a50

Right, this is the classic NMI vs spinlock deadlock:

Process Context (CPU 0)         NMI Context (CPU 0)
=======================         ===================

    syscall()
       |
       +-> htab_lru_map_update_elem()
       |
       +-> bpf_lru_pop_free()
       |
       +-> spin_lock_irqsave(&lock)
       |   +-------------------+
       |   | LOCK ACQUIRED [Y] |
       |   | IRQs DISABLED     |
       |   +-------------------+
       |
       +-> [Critical Section]
       |   |
       |   | Working with LRU...
       |   |
       |   |                      +-----------------------+
       |   |<---------------------+ ! NMI FIRES!          |
       |   |                      +-----------------------+
       |   |                      | (IRQs disabled but    |
       |   |                      |  NMI ignores that!)   |
       |   |                      +-----------------------+
       |   |                                 |
       |   | [INTERRUPTED]                   |
       |   | [Context saved]                 |
       |   |                                 v
       |   |                     perf_event_nmi_handler()
       |   |                                 |
       |   |                                 +-> BPF program
       |   |                                 |
       |   |                                 +-> htab_lru_map_
       |   |                                 |   update_elem()
       |   |                                 |
       |   |                                 +-> bpf_lru_pop_
       |   |                                 |   free()
       |   |                                 |
       |   |                                 +-> spin_lock_
       |   |                                 |   irqsave(&lock)
       |   |                                 |   +------------+
       |   |                                 |   | TRIES TO   |
       |   |                                 |   | ACQUIRE    |
       |   |                                 |   | SAME LOCK! |
       |   |                                 |   +------------+
       |   |                                 |        |
       |   |                                 |        v
       |   |                                 |   +------------+
       |   |<--------------------------------+---+ ! DEADLOCK |
       |   |                                 |   +------------+
       |   |                                 |   | Lock held  |
       |   | Still holding lock...           |   | by process |
       |   | Waiting for NMI to finish ---+  |   | context    |
       |   |                              |  |   |            |
       |   |                              |  |   | NMI waits  |
       |   |                              |  |   | for same   |
       |   |                              |  |   | lock       |
       |   |                              |  |   +------------+
       |   |                              |  |        |
       |   |                              |  |        v
       |   |                              |  |   [Spin forever]
       |   |                              |  |        |
       |   |                              |  +--------+
       |   |                              |  (Circular wait)
       |   |                              |
       |   |                              +-> SYSTEM HUNG
       |   |
       |   +-> [Never reached]
       |
       +-> spin_unlock_irqrestore(&lock)
           [Never reached]


+---------------------------------------------------------------------+
|                       DEADLOCK SUMMARY                              |
+---------------------------------------------------------------------+
|                                                                     |
| Process Context: Holds &loc_l->lock, waiting for NMI to finish      |
|                                                                     |
| NMI Context:     Trying to acquire &loc_l->lock                     |
|                  (same lock, same CPU)                              |
|                                                                     |
| Result:          Both contexts wait for each other = DEADLOCK       |
|                                                                     |
+---------------------------------------------------------------------+

We can fix this by converting the raw_spinlock_t to trylock-based
approach, similar to the fix for ringbuf in
commit a650d38915c1 ("bpf: Convert ringbuf map to rqspinlock").

In bpf_common_lru_pop_free(), we could use:

    if (!raw_res_spin_lock_irqsave(&loc_l->lock, flags))
        return NULL;

However, this deadlock is pre-existing and not introduced by this
series. It's better to send a separate fix for this deadlock.

Hi Menglong, could you follow up on the deadlock fix?

Thanks,
Leon

>   [  424.982387]  ? arch_irq_work_raise+0x3f/0x60
>   [  424.982394]  ? __pfx___irq_work_queue_local+0x10/0x10
>   [  424.982399]  htab_lru_map_update_elem+0x17e/0xa90
>   [  424.982405]  ? __pfx_htab_lru_map_update_elem+0x10/0x10
>   [  424.982408]  ? __kasan_check_byte+0x16/0x60
>   [  424.982414]  ? __htab_map_lookup_elem+0x95/0x220
>   [  424.982420]  bpf_prog_2c77131b3c031599_oncpu_lru_map+0xe4/0x168
>   [  424.982423]  __perf_event_overflow+0x8e8/0xea0
>   [  424.982430]  ? __pfx___perf_event_overflow+0x10/0x10
>   [  424.982436]  handle_pmi_common+0x3fe/0x810
>   [  424.982441]  ? __pfx_handle_pmi_common+0x10/0x10
>   [  424.982452]  ? __pfx_intel_bts_interrupt+0x10/0x10
>   [  424.982458]  intel_pmu_handle_irq+0x1c5/0x5d0
>   [  424.982461]  ? lock_acquire+0x1ef/0x2d0
>   [  424.982465]  ? nmi_handle.part.0+0x2f/0x380
>   [  424.982469]  perf_event_nmi_handler+0x3e/0x70
>   [  424.982476]  nmi_handle.part.0+0x13f/0x380
>   [  424.982480]  ? trace_rcu_watching+0x105/0x170
>   [  424.982486]  default_do_nmi+0x3b/0x110
>   [  424.982490]  ? irqentry_nmi_enter+0x6f/0x80
>   [  424.982493]  exc_nmi+0xe3/0x110
>   [  424.982497]  end_repeat_nmi+0xf/0x53
>   [  424.982502] RIP: 0010:fput_close_sync+0x56/0x1a0
>   [  424.982509] Code: 48 89 e5 48 c7 04 24 b3 8a b5 41 48 c7 44 24 08
> 5c a2 3e 96 48 c1 ed 03 48 c7 44 24 10 10 a7 e0 93 42 c7 44 2d 00 f1
> f1 f1 f1 <42> c7 44 2d 04 00 f3 f3 f3 65 48 8b 05 91 98 56 04 48 89 44
> 24 58
>   [  424.982513] RSP: 0018:ffffc900099d7e88 EFLAGS: 00000a06
>   [  424.982517] RAX: 0000000000000000 RBX: ffff888109fb48c0 RCX:
> 0000000000000000
>   [  424.982520] RDX: 1ffff110099572bb RSI: 0000000000000008 RDI:
> ffff888109fb4a20
>   [  424.982522] RBP: 1ffff9200133afd1 R08: ffff888109fb48c0 R09:
> ffff888109278b40
>   [  424.982524] R10: ffff888109fb4920 R11: 0000000000000000 R12:
> 0000000000000003
>   [  424.982526] R13: dffffc0000000000 R14: 0000000000000003 R15:
> 0000000000000000
>   [  424.982532]  ? fput_close_sync+0x56/0x1a0
>   [  424.982537]  ? fput_close_sync+0x56/0x1a0
>   [  424.982541]  </NMI>
>   [  424.982542]  <TASK>
>   [  424.982544]  ? __pfx_fput_close_sync+0x10/0x10
>   [  424.982548]  ? do_raw_spin_unlock+0x59/0x250
>   [  424.982553]  __x64_sys_close+0x7d/0xd0
>   [  424.982559]  do_syscall_64+0x6a/0x2f0
>   [  424.982563]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>   [  424.982566] RIP: 0033:0x7faae0f88fe2
>   [  424.982569] Code: 08 0f 85 71 3a ff ff 49 89 fb 48 89 f0 48 89 d7
> 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24
> 08 0f 05 <c3> 66 2e 0f 1f 84 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00
> 00 66
>   [  424.982571] RSP: 002b:00007ffe58ee5b08 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000003
>   [  424.982574] RAX: ffffffffffffffda RBX: 00007faae0a6cb00 RCX:
> 00007faae0f88fe2
>   [  424.982577] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> 0000000000000072
>   [  424.982579] RBP: 00007ffe58ee5b30 R08: 0000000000000000 R09:
> 0000000000000000
>   [  424.982581] R10: 0000000000000000 R11: 0000000000000246 R12:
> 0000000000000008
>   [  424.982583] R13: 0000000000000000 R14: 0000556f5e250c90 R15:
> 00007faae11e9000
>   [  424.982588]  </TASK>
>   [  424.982606] perf: interrupt took too long (14417 > 12551),
> lowering kernel.perf_event_max_sample_rate to 13000
>
> We'll need to figure this out first.
>
>   [0] https://github.com/kernel-patches/bpf/actions/runs/18884827710/job/53898669092
>
>> You are awesome, thank you!
>> --
>> Deet-doot-dot, I am a bot.
>> https://korg.docs.kernel.org/patchwork/pwbot.html
>>
>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps
  2025-10-29  6:49     ` Leon Hwang
@ 2025-10-29  6:57       ` Menglong Dong
  2025-10-29 16:38       ` Alexei Starovoitov
  1 sibling, 0 replies; 20+ messages in thread
From: Menglong Dong @ 2025-10-29  6:57 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Andrii Nakryiko, patchwork-bot+netdevbpf, bpf, ast, andrii,
	daniel, martin.lau, eddyz87, song, yonghong.song, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, memxor, linux-kernel,
	kernel-patches-bot

On Wed, Oct 29, 2025 at 2:50 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 29/10/25 04:22, Andrii Nakryiko wrote:
> > On Tue, Oct 28, 2025 at 11:10 AM <patchwork-bot+netdevbpf@kernel.org> wrote:
[......]
> >   [  424.982379]  ? bpf_lru_pop_free+0x2c6/0x1a50
> >   [  424.982382]  bpf_lru_pop_free+0x2c6/0x1a50
>
> Right, this is the classic NMI vs spinlock deadlock:
>
> Process Context (CPU 0)         NMI Context (CPU 0)
> =======================         ===================
>
>     syscall()
>        |
>        +-> htab_lru_map_update_elem()
>        |
>        +-> bpf_lru_pop_free()
>        |
>        +-> spin_lock_irqsave(&lock)
>        |   +-------------------+
>        |   | LOCK ACQUIRED [Y] |
>        |   | IRQs DISABLED     |
>        |   +-------------------+
>        |
>        +-> [Critical Section]
>        |   |
>        |   | Working with LRU...
>        |   |
>        |   |                      +-----------------------+
>        |   |<---------------------+ ! NMI FIRES!          |
>        |   |                      +-----------------------+
>        |   |                      | (IRQs disabled but    |
>        |   |                      |  NMI ignores that!)   |
>        |   |                      +-----------------------+
>        |   |                                 |
>        |   | [INTERRUPTED]                   |
>        |   | [Context saved]                 |
>        |   |                                 v
>        |   |                     perf_event_nmi_handler()
>        |   |                                 |
>        |   |                                 +-> BPF program
>        |   |                                 |
>        |   |                                 +-> htab_lru_map_
>        |   |                                 |   update_elem()
>        |   |                                 |
>        |   |                                 +-> bpf_lru_pop_
>        |   |                                 |   free()
>        |   |                                 |
>        |   |                                 +-> spin_lock_
>        |   |                                 |   irqsave(&lock)
>        |   |                                 |   +------------+
>        |   |                                 |   | TRIES TO   |
>        |   |                                 |   | ACQUIRE    |
>        |   |                                 |   | SAME LOCK! |
>        |   |                                 |   +------------+
>        |   |                                 |        |
>        |   |                                 |        v
>        |   |                                 |   +------------+
>        |   |<--------------------------------+---+ ! DEADLOCK |
>        |   |                                 |   +------------+
>        |   |                                 |   | Lock held  |
>        |   | Still holding lock...           |   | by process |
>        |   | Waiting for NMI to finish ---+  |   | context    |
>        |   |                              |  |   |            |
>        |   |                              |  |   | NMI waits  |
>        |   |                              |  |   | for same   |
>        |   |                              |  |   | lock       |
>        |   |                              |  |   +------------+
>        |   |                              |  |        |
>        |   |                              |  |        v
>        |   |                              |  |   [Spin forever]
>        |   |                              |  |        |
>        |   |                              |  +--------+
>        |   |                              |  (Circular wait)
>        |   |                              |
>        |   |                              +-> SYSTEM HUNG
>        |   |
>        |   +-> [Never reached]
>        |
>        +-> spin_unlock_irqrestore(&lock)
>            [Never reached]
>
>
> +---------------------------------------------------------------------+
> |                       DEADLOCK SUMMARY                              |
> +---------------------------------------------------------------------+
> |                                                                     |
> | Process Context: Holds &loc_l->lock, waiting for NMI to finish      |
> |                                                                     |
> | NMI Context:     Trying to acquire &loc_l->lock                     |
> |                  (same lock, same CPU)                              |
> |                                                                     |
> | Result:          Both contexts wait for each other = DEADLOCK       |
> |                                                                     |
> +---------------------------------------------------------------------+
>
> We can fix this by converting the raw_spinlock_t to trylock-based
> approach, similar to the fix for ringbuf in
> commit a650d38915c1 ("bpf: Convert ringbuf map to rqspinlock").

Nice shot!

>
> In bpf_common_lru_pop_free(), we could use:
>
>     if (!raw_res_spin_lock_irqsave(&loc_l->lock, flags))
>         return NULL;
>
> However, this deadlock is pre-existing and not introduced by this
> series. It's better to send a separate fix for this deadlock.
>
> Hi Menglong, could you follow up on the deadlock fix?

Yeah, with pleasure. As I see, this is not the only place
that can cause deadlock and rqspinlock should be used. And
I'll follow up on such deadlocks.

Thanks!
Menglong Dong

>
> Thanks,
> Leon
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and local storage maps
  2025-10-27 16:34   ` Amery Hung
@ 2025-10-29 14:58     ` Leon Hwang
  2025-10-30 14:29       ` Leon Hwang
  0 siblings, 1 reply; 20+ messages in thread
From: Leon Hwang @ 2025-10-29 14:58 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot



On 2025/10/28 00:34, Amery Hung wrote:
> On Sun, Oct 26, 2025 at 8:42 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>> Add tests to verify that updating hash and local storage maps decrements
>> refcount when BPF_KPTR_REF objects are involved.
>>
>> The tests perform the following steps:
>>
>> 1. Call update_elem() to insert an initial value.
>> 2. Use bpf_refcount_acquire() to increment the refcount.
>> 3. Store the node pointer in the map value.
>> 4. Add the node to a linked list.
>> 5. Probe-read the refcount and verify it is *2*.
>> 6. Call update_elem() again to trigger refcount decrement.
>> 7. Probe-read the refcount and verify it is *1*.
>>
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>> ---
>>  .../bpf/prog_tests/refcounted_kptr.c          | 178 +++++++++++++++++-
>>  .../selftests/bpf/progs/refcounted_kptr.c     | 160 ++++++++++++++++
>>  2 files changed, 337 insertions(+), 1 deletion(-)
>>

[...]

>> @@ -44,3 +44,179 @@ void test_refcounted_kptr_wrong_owner(void)
>>         ASSERT_OK(opts.retval, "rbtree_wrong_owner_remove_fail_a2 retval");
>>         refcounted_kptr__destroy(skel);
>>  }
>> +
>> +static void test_refcnt_leak(void *values, size_t values_sz, u64 flags, struct bpf_map *map,
>> +                            struct bpf_program *prog_leak, struct bpf_program *prog_check)
>> +{

[...]

>> +}
>
> Just use syscall BPF programs across different subtests, and you can
> share this test_refcnt_leak() across subtests.
>
> It also saves you some code setting up bpf_test_run_opts. You can just
> call bpf_prog_test_run_opts(prog_fd, NULL) as you don't pass any input
> from ctx.
>
>> +
>> +static void test_percpu_hash_refcount_leak(void)
>> +{

[...]

>> +out:
>> +       close(cgroup);
>> +       refcounted_kptr__destroy(skel);
>> +       if (client_fd >= 0)
>> +               close(client_fd);
>> +       if (server_fd >= 0)
>> +               close(server_fd);
>> +}
>
> Then, you won't need to set up server, connection.... just to
> read/write cgroup local storage. Just call test_refcnt_leak() that
> runs the two BPF syscall programs for cgroup local storage.
>

[...]

>
>
> And in syscall BPF program, you can simply get the cgroup through the
> current task
>
> SEC("syscall")
> int syscall_prog(void *ctx)
> {
>         struct task_struct *task = bpf_get_current_task_btf();
>
>         v = bpf_cgrp_storage_get(&cgrp_strg, task->cgroups->dfl_cgrp, 0,
>                                BPF_LOCAL_STORAGE_GET_F_CREATE);
>         ...
> }
>

Hi Amery,

Thanks for the suggestion.

I tried your approach, but the verifier rejected it with the following
error:

0: R1=ctx() R10=fp0
; task = bpf_get_current_task_btf(); @ refcounted_kptr.c:686
0: (85) call bpf_get_current_task_btf#158     ; R0=trusted_ptr_task_struct()
; v = bpf_cgrp_storage_get(&cgrp_strg, task->cgroups->dfl_cgrp, 0, @
refcounted_kptr.c:687
1: (79) r1 = *(u64 *)(r0 +4856)       ; R0=trusted_ptr_task_struct()
R1=untrusted_ptr_css_set()
2: (79) r2 = *(u64 *)(r1 +96)         ; R1=untrusted_ptr_css_set()
R2=untrusted_ptr_cgroup()
3: (18) r1 = 0xffffa1b442a4b800       ; R1=map_ptr(map=cgrp_strg,ks=4,vs=16)
5: (b7) r3 = 0                        ; R3=0
6: (b7) r4 = 1                        ; R4=1
7: (85) call bpf_cgrp_storage_get#210
R2 type=untrusted_ptr_ expected=ptr_, trusted_ptr_, rcu_ptr_
processed 7 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0

The issue is that dereferencing task->cgroups->dfl_cgrp results in an
untrusted pointer, which bpf_cgrp_storage_get() doesn't accept in
syscall programs. I will investigate the issue later.

However, while searching for 'task->cgroups->dfl_cgrp' usage in the
selftests, I found that bpf_cgrp_storage_get() works fine in fentry
programs. This approach also has the benefit of avoiding the cgroup
setup and client/server infrastructure.

I'll respin the selftest using fentry programs instead.

Thanks,
Leon

[...]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps
  2025-10-29  6:49     ` Leon Hwang
  2025-10-29  6:57       ` Menglong Dong
@ 2025-10-29 16:38       ` Alexei Starovoitov
  2025-10-30  5:37         ` Leon Hwang
  1 sibling, 1 reply; 20+ messages in thread
From: Alexei Starovoitov @ 2025-10-29 16:38 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Andrii Nakryiko, patchwork-bot+netdevbpf, Menglong Dong, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, LKML, kernel-patches-bot

On Tue, Oct 28, 2025 at 11:50 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
> Right, this is the classic NMI vs spinlock deadlock:

Leon,

please stop copy pasting what AI told you.
I'd rather see a human with typos and grammar mistakes.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps
  2025-10-29 16:38       ` Alexei Starovoitov
@ 2025-10-30  5:37         ` Leon Hwang
  0 siblings, 0 replies; 20+ messages in thread
From: Leon Hwang @ 2025-10-30  5:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrii Nakryiko, patchwork-bot+netdevbpf, Menglong Dong, bpf,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Kumar Kartikeya Dwivedi, LKML, kernel-patches-bot



On 30/10/25 00:38, Alexei Starovoitov wrote:
> On Tue, Oct 28, 2025 at 11:50 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>> Right, this is the classic NMI vs spinlock deadlock:
> 
> Leon,
> 
> please stop copy pasting what AI told you.
> I'd rather see a human with typos and grammar mistakes.

Got it.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and local storage maps
  2025-10-29 14:58     ` Leon Hwang
@ 2025-10-30 14:29       ` Leon Hwang
  0 siblings, 0 replies; 20+ messages in thread
From: Leon Hwang @ 2025-10-30 14:29 UTC (permalink / raw)
  To: Amery Hung
  Cc: bpf, ast, andrii, daniel, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa,
	memxor, linux-kernel, kernel-patches-bot



On 2025/10/29 22:58, Leon Hwang wrote:
> 
> 
> On 2025/10/28 00:34, Amery Hung wrote:
>> On Sun, Oct 26, 2025 at 8:42 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>>
>>> Add tests to verify that updating hash and local storage maps decrements
>>> refcount when BPF_KPTR_REF objects are involved.
>>>
>>> The tests perform the following steps:
>>>
>>> 1. Call update_elem() to insert an initial value.
>>> 2. Use bpf_refcount_acquire() to increment the refcount.
>>> 3. Store the node pointer in the map value.
>>> 4. Add the node to a linked list.
>>> 5. Probe-read the refcount and verify it is *2*.
>>> 6. Call update_elem() again to trigger refcount decrement.
>>> 7. Probe-read the refcount and verify it is *1*.
>>>
>>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>>> ---
>>>  .../bpf/prog_tests/refcounted_kptr.c          | 178 +++++++++++++++++-
>>>  .../selftests/bpf/progs/refcounted_kptr.c     | 160 ++++++++++++++++
>>>  2 files changed, 337 insertions(+), 1 deletion(-)
>>>
> 
> [...]
> 
>>> @@ -44,3 +44,179 @@ void test_refcounted_kptr_wrong_owner(void)
>>>         ASSERT_OK(opts.retval, "rbtree_wrong_owner_remove_fail_a2 retval");
>>>         refcounted_kptr__destroy(skel);
>>>  }
>>> +
>>> +static void test_refcnt_leak(void *values, size_t values_sz, u64 flags, struct bpf_map *map,
>>> +                            struct bpf_program *prog_leak, struct bpf_program *prog_check)
>>> +{
> 
> [...]
> 
>>> +}
>>
>> Just use syscall BPF programs across different subtests, and you can
>> share this test_refcnt_leak() across subtests.
>>
>> It also saves you some code setting up bpf_test_run_opts. You can just
>> call bpf_prog_test_run_opts(prog_fd, NULL) as you don't pass any input
>> from ctx.
>>
>>> +
>>> +static void test_percpu_hash_refcount_leak(void)
>>> +{
> 
> [...]
> 
>>> +out:
>>> +       close(cgroup);
>>> +       refcounted_kptr__destroy(skel);
>>> +       if (client_fd >= 0)
>>> +               close(client_fd);
>>> +       if (server_fd >= 0)
>>> +               close(server_fd);
>>> +}
>>
>> Then, you won't need to set up server, connection.... just to
>> read/write cgroup local storage. Just call test_refcnt_leak() that
>> runs the two BPF syscall programs for cgroup local storage.
>>
> 
> [...]
> 
>>
>>
>> And in syscall BPF program, you can simply get the cgroup through the
>> current task
>>
>> SEC("syscall")
>> int syscall_prog(void *ctx)
>> {
>>         struct task_struct *task = bpf_get_current_task_btf();
>>
>>         v = bpf_cgrp_storage_get(&cgrp_strg, task->cgroups->dfl_cgrp, 0,
>>                                BPF_LOCAL_STORAGE_GET_F_CREATE);
>>         ...
>> }
>>
> 
> Hi Amery,
> 
> Thanks for the suggestion.
> 
> I tried your approach, but the verifier rejected it with the following
> error:
> 
> 0: R1=ctx() R10=fp0
> ; task = bpf_get_current_task_btf(); @ refcounted_kptr.c:686
> 0: (85) call bpf_get_current_task_btf#158     ; R0=trusted_ptr_task_struct()
> ; v = bpf_cgrp_storage_get(&cgrp_strg, task->cgroups->dfl_cgrp, 0, @
> refcounted_kptr.c:687
> 1: (79) r1 = *(u64 *)(r0 +4856)       ; R0=trusted_ptr_task_struct()
> R1=untrusted_ptr_css_set()
> 2: (79) r2 = *(u64 *)(r1 +96)         ; R1=untrusted_ptr_css_set()
> R2=untrusted_ptr_cgroup()
> 3: (18) r1 = 0xffffa1b442a4b800       ; R1=map_ptr(map=cgrp_strg,ks=4,vs=16)
> 5: (b7) r3 = 0                        ; R3=0
> 6: (b7) r4 = 1                        ; R4=1
> 7: (85) call bpf_cgrp_storage_get#210
> R2 type=untrusted_ptr_ expected=ptr_, trusted_ptr_, rcu_ptr_
> processed 7 insns (limit 1000000) max_states_per_insn 0 total_states 0
> peak_states 0 mark_read 0
> 

After analyzing the verifier log (with a bit of AI help), it turned out
that 'task->cgroups->dfl_cgrp' wasn't protected by an RCU read lock.
According to verifier.c::btf_ld_kptr_type(), this pointer must be
accessed either within an RCU critical section or in a non-sleepable
program.

Adding RCU protection fixed the issue:

    bpf_rcu_read_lock()
    v = bpf_cgrp_storage_get(&cgrp_strg, task->cgroups->dfl_cgrp, 0,
                             BPF_LOCAL_STORAGE_GET_F_CREATE);
    bpf_rcu_read_unlock()

With this change, test_refcnt_leak() can now be used across all subtests.

Thanks again for the helpful suggestion.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2025-10-30 14:30 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-26 15:39 [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps Leon Hwang
2025-10-26 15:39 ` [PATCH bpf v3 1/4] bpf: Free special fields when update [lru_,]percpu_hash maps Leon Hwang
2025-10-28 18:03   ` Andrii Nakryiko
2025-10-26 15:39 ` [PATCH bpf v3 2/4] bpf: Free special fields when update hash maps with BPF_F_LOCK Leon Hwang
2025-10-26 15:39 ` [PATCH bpf v3 3/4] bpf: Free special fields when update local storage maps Leon Hwang
2025-10-27 15:44   ` Amery Hung
2025-10-27 16:15     ` Leon Hwang
2025-10-27 17:04       ` Amery Hung
2025-10-28 14:48         ` Leon Hwang
2025-10-28 18:03           ` Andrii Nakryiko
2025-10-26 15:40 ` [PATCH bpf v3 4/4] selftests/bpf: Add tests to verify freeing the special fields when update hash and " Leon Hwang
2025-10-27 16:34   ` Amery Hung
2025-10-29 14:58     ` Leon Hwang
2025-10-30 14:29       ` Leon Hwang
2025-10-28 18:10 ` [PATCH bpf v3 0/4] bpf: Free " patchwork-bot+netdevbpf
2025-10-28 20:22   ` Andrii Nakryiko
2025-10-29  6:49     ` Leon Hwang
2025-10-29  6:57       ` Menglong Dong
2025-10-29 16:38       ` Alexei Starovoitov
2025-10-30  5:37         ` Leon Hwang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).