public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
@ 2024-11-06  6:35 Hou Tao
  2024-11-06  6:35 ` [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket() Hou Tao
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Hou Tao @ 2024-11-06  6:35 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Hi,

The patch set fixes a lockdep warning for htab of map. The
warning is found when running test_maps. The warning occurs when
htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
of the inner map while already holding the bucket lock (raw_spinlock_t).

The fix moves the invocation of free_htab_elem() after
htab_unlock_bucket() and adds a test case to verify the solution. Please
see the individual patches for details. Comments are always welcome.

Hou Tao (3):
  bpf: Call free_htab_elem() after htab_unlock_bucket()
  selftests/bpf: Move ENOTSUPP from bpf_util.h
  selftests/bpf: Test the update operations for htab of maps

 kernel/bpf/hashtab.c                          |  56 +++++---
 tools/testing/selftests/bpf/bpf_util.h        |   4 +
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |   4 -
 .../selftests/bpf/prog_tests/lsm_cgroup.c     |   4 -
 .../selftests/bpf/prog_tests/map_in_map.c     | 132 +++++++++++++++++-
 .../selftests/bpf/prog_tests/sock_addr.c      |   4 -
 .../selftests/bpf/progs/update_map_in_htab.c  |  30 ++++
 tools/testing/selftests/bpf/test_maps.c       |   4 -
 tools/testing/selftests/bpf/test_verifier.c   |   4 -
 9 files changed, 204 insertions(+), 38 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/update_map_in_htab.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket()
  2024-11-06  6:35 [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Hou Tao
@ 2024-11-06  6:35 ` Hou Tao
  2024-11-06  9:53   ` Hou Tao
  2024-11-06  6:35 ` [PATCH bpf-next 2/3] selftests/bpf: Move ENOTSUPP from bpf_util.h Hou Tao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2024-11-06  6:35 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

For htab of maps, when the map is removed from the htab, it may hold the
last reference of the map. bpf_map_fd_put_ptr() will invoke
bpf_map_free_id() to free the id of the removed map element. However,
bpf_map_fd_put_ptr() is invoked while holding a bucket lock
(raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
(spinlock_t), triggering the following lockdep warning:

  =============================
  [ BUG: Invalid wait context ]
  6.11.0-rc4+ #49 Not tainted
  -----------------------------
  test_maps/4881 is trying to lock:
  ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
  other info that might help us debug this:
  context-{5:5}
  2 locks held by test_maps/4881:
   #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
   #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
  stack backtrace:
  CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x6e/0xb0
   dump_stack+0x10/0x20
   __lock_acquire+0x73e/0x36c0
   lock_acquire+0x182/0x450
   _raw_spin_lock_irqsave+0x43/0x70
   bpf_map_free_id.part.0+0x21/0x70
   bpf_map_put+0xcf/0x110
   bpf_map_fd_put_ptr+0x9a/0xb0
   free_htab_elem+0x69/0xe0
   htab_map_update_elem+0x50f/0xa80
   bpf_fd_htab_map_update_elem+0x131/0x270
   htab_map_update_elem+0x50f/0xa80
   bpf_fd_htab_map_update_elem+0x131/0x270
   bpf_map_update_value+0x266/0x380
   __sys_bpf+0x21bb/0x36b0
   __x64_sys_bpf+0x45/0x60
   x64_sys_call+0x1b2a/0x20d0
   do_syscall_64+0x5d/0x100
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

One way to fix the lockdep warning is using raw_spinlock_t for
map_idr_lock as well. However, bpf_map_alloc_id() invokes
idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
still a spinlock.

Instead of changing map_idr_lock's type, fix the issue by invoking
htab_put_fd_value() after htab_unlock_bucket(). However, only deferring
the invocation of htab_put_fd_value() is not enough, because the old map
pointers in htab of maps can not be saved during batched deletion.
Therefore, also defer the invocation of free_htab_elem(), so these
to-be-freed elements could be linked together similar to lru map.

There are four callers for ->map_fd_put_ptr:

(1) alloc_htab_elem() (through htab_put_fd_value())
It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of
htab_put_fd_value() can not simply move after htab_unlock_bucket(),
because the old element has already been stashed in htab->extra_elems.
It may be reused immediately after htab_unlock_bucket() and the
invocation of htab_put_fd_value() after htab_unlock_bucket() may release
the newly-added element incorrectly. Therefore, saving the map pointer
of the old element for htab of maps before unlocking the bucket and
releasing the map_ptr after unlock. Beside the map pointer in the old
element, should do the same thing for the special fields in the old
element as well.

(2) free_htab_elem() (through htab_put_fd_value())
Its caller includes __htab_map_lookup_and_delete_elem(),
htab_map_delete_elem() and __htab_map_lookup_and_delete_batch().

For htab_map_delete_elem(), simply invoke free_htab_elem() after
htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just
like lru map, linking the to-be-freed element into node_to_free list
and invoking free_htab_elem() for these element after unlock. It is safe
to reuse batch_flink as the link for node_to_free, because these
elements have been removed from the hash llist.

Because htab of maps doesn't support lookup_and_delete operation,
__htab_map_lookup_and_delete_elem() doesn't have the problem, so kept
it as is.

(3) fd_htab_map_free()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

(4) bpf_fd_htab_map_update_elem()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

After moving free_htab_elem() outside htab bucket lock scope, using
pcpu_freelist_push() instead of __pcpu_freelist_push() to disable
the irq before freeing elements, and protecting the invocations of
bpf_mem_cache_free() with migrate_{disable|enable} pair.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 56 ++++++++++++++++++++++++++++++--------------
 1 file changed, 39 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index b14b87463ee0..3ec941a0ea41 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -896,9 +896,12 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
 	check_and_free_fields(htab, l);
+
+	migrate_disable();
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
 	bpf_mem_cache_free(&htab->ma, l);
+	migrate_enable();
 }
 
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
@@ -948,7 +951,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
 	if (htab_is_prealloc(htab)) {
 		bpf_map_dec_elem_count(&htab->map);
 		check_and_free_fields(htab, l);
-		__pcpu_freelist_push(&htab->freelist, &l->fnode);
+		pcpu_freelist_push(&htab->freelist, &l->fnode);
 	} else {
 		dec_elem_count(htab);
 		htab_elem_free(htab, l);
@@ -1018,7 +1021,6 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			 */
 			pl_new = this_cpu_ptr(htab->extra_elems);
 			l_new = *pl_new;
-			htab_put_fd_value(htab, old_elem);
 			*pl_new = old_elem;
 		} else {
 			struct pcpu_freelist_node *l;
@@ -1105,6 +1107,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	struct htab_elem *l_new = NULL, *l_old;
 	struct hlist_nulls_head *head;
 	unsigned long flags;
+	void *old_map_ptr;
 	struct bucket *b;
 	u32 key_size, hash;
 	int ret;
@@ -1183,12 +1186,27 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
 	if (l_old) {
 		hlist_nulls_del_rcu(&l_old->hash_node);
+
+		/* l_old has already been stashed in htab->extra_elems, free
+		 * its special fields before it is available for reuse. Also
+		 * save the old map pointer in htab of maps before unlock
+		 * and release it after unlock.
+		 */
+		old_map_ptr = NULL;
+		if (htab_is_prealloc(htab)) {
+			if (map->ops->map_fd_put_ptr)
+				old_map_ptr = fd_htab_map_get_ptr(map, l_old);
+			check_and_free_fields(htab, l_old);
+		}
+	}
+	htab_unlock_bucket(htab, b, hash, flags);
+	if (l_old) {
+		if (old_map_ptr)
+			map->ops->map_fd_put_ptr(map, old_map_ptr, true);
 		if (!htab_is_prealloc(htab))
 			free_htab_elem(htab, l_old);
-		else
-			check_and_free_fields(htab, l_old);
 	}
-	ret = 0;
+	return 0;
 err:
 	htab_unlock_bucket(htab, b, hash, flags);
 	return ret;
@@ -1432,15 +1450,15 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
 		return ret;
 
 	l = lookup_elem_raw(head, hash, key, key_size);
-
-	if (l) {
+	if (l)
 		hlist_nulls_del_rcu(&l->hash_node);
-		free_htab_elem(htab, l);
-	} else {
+	else
 		ret = -ENOENT;
-	}
 
 	htab_unlock_bucket(htab, b, hash, flags);
+
+	if (l)
+		free_htab_elem(htab, l);
 	return ret;
 }
 
@@ -1853,13 +1871,14 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 			 * may cause deadlock. See comments in function
 			 * prealloc_lru_pop(). Let us do bpf_lru_push_free()
 			 * after releasing the bucket lock.
+			 *
+			 * For htab of maps, htab_put_fd_value() in
+			 * free_htab_elem() may acquire a spinlock with bucket
+			 * lock being held and it violates the lock rule, so
+			 * invoke free_htab_elem() after unlock as well.
 			 */
-			if (is_lru_map) {
-				l->batch_flink = node_to_free;
-				node_to_free = l;
-			} else {
-				free_htab_elem(htab, l);
-			}
+			l->batch_flink = node_to_free;
+			node_to_free = l;
 		}
 		dst_key += key_size;
 		dst_val += value_size;
@@ -1871,7 +1890,10 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 	while (node_to_free) {
 		l = node_to_free;
 		node_to_free = node_to_free->batch_flink;
-		htab_lru_push_free(htab, l);
+		if (is_lru_map)
+			htab_lru_push_free(htab, l);
+		else
+			free_htab_elem(htab, l);
 	}
 
 next_batch:
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 2/3] selftests/bpf: Move ENOTSUPP from bpf_util.h
  2024-11-06  6:35 [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Hou Tao
  2024-11-06  6:35 ` [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket() Hou Tao
@ 2024-11-06  6:35 ` Hou Tao
  2024-11-06  6:35 ` [PATCH bpf-next 3/3] selftests/bpf: Test the update operations for htab of maps Hou Tao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Hou Tao @ 2024-11-06  6:35 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Moving the definition of ENOTSUPP into bpf_util.h to remove the
duplicated definitions in multiple files.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/bpf_util.h              | 4 ++++
 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c | 4 ----
 tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c | 4 ----
 tools/testing/selftests/bpf/prog_tests/sock_addr.c  | 4 ----
 tools/testing/selftests/bpf/test_maps.c             | 4 ----
 tools/testing/selftests/bpf/test_verifier.c         | 4 ----
 6 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/tools/testing/selftests/bpf/bpf_util.h b/tools/testing/selftests/bpf/bpf_util.h
index 10587a29b967..8cba5821a1d5 100644
--- a/tools/testing/selftests/bpf/bpf_util.h
+++ b/tools/testing/selftests/bpf/bpf_util.h
@@ -59,4 +59,8 @@ static inline void bpf_strlcpy(char *dst, const char *src, size_t sz)
 	(offsetof(TYPE, MEMBER)	+ sizeof_field(TYPE, MEMBER))
 #endif
 
+#ifndef ENOTSUPP
+#define ENOTSUPP 524
+#endif
+
 #endif /* __BPF_UTIL__ */
diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index 409a06975823..b7d1b52309d0 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -16,10 +16,6 @@
 #include "tcp_ca_kfunc.skel.h"
 #include "bpf_cc_cubic.skel.h"
 
-#ifndef ENOTSUPP
-#define ENOTSUPP 524
-#endif
-
 static const unsigned int total_bytes = 10 * 1024 * 1024;
 static int expected_stg = 0xeB9F;
 
diff --git a/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c b/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c
index 130a3b21e467..6df25de8f080 100644
--- a/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c
+++ b/tools/testing/selftests/bpf/prog_tests/lsm_cgroup.c
@@ -10,10 +10,6 @@
 #include "cgroup_helpers.h"
 #include "network_helpers.h"
 
-#ifndef ENOTSUPP
-#define ENOTSUPP 524
-#endif
-
 static struct btf *btf;
 
 static __u32 query_prog_cnt(int cgroup_fd, const char *attach_func)
diff --git a/tools/testing/selftests/bpf/prog_tests/sock_addr.c b/tools/testing/selftests/bpf/prog_tests/sock_addr.c
index a6ee7f8d4f79..b2efabbed220 100644
--- a/tools/testing/selftests/bpf/prog_tests/sock_addr.c
+++ b/tools/testing/selftests/bpf/prog_tests/sock_addr.c
@@ -23,10 +23,6 @@
 #include "getpeername_unix_prog.skel.h"
 #include "network_helpers.h"
 
-#ifndef ENOTSUPP
-# define ENOTSUPP 524
-#endif
-
 #define TEST_NS                 "sock_addr"
 #define TEST_IF_PREFIX          "test_sock_addr"
 #define TEST_IPV4               "127.0.0.4"
diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c
index 905d5981ace1..8b40e9496af1 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -26,10 +26,6 @@
 #include "test_maps.h"
 #include "testing_helpers.h"
 
-#ifndef ENOTSUPP
-#define ENOTSUPP 524
-#endif
-
 int skips;
 
 static struct bpf_map_create_opts map_opts = { .sz = sizeof(map_opts) };
diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index 610392dfc4fb..447b68509d76 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -42,10 +42,6 @@
 #include "../../../include/linux/filter.h"
 #include "testing_helpers.h"
 
-#ifndef ENOTSUPP
-#define ENOTSUPP 524
-#endif
-
 #define MAX_INSNS	BPF_MAXINSNS
 #define MAX_EXPECTED_INSNS	32
 #define MAX_UNEXPECTED_INSNS	32
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH bpf-next 3/3] selftests/bpf: Test the update operations for htab of maps
  2024-11-06  6:35 [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Hou Tao
  2024-11-06  6:35 ` [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket() Hou Tao
  2024-11-06  6:35 ` [PATCH bpf-next 2/3] selftests/bpf: Move ENOTSUPP from bpf_util.h Hou Tao
@ 2024-11-06  6:35 ` Hou Tao
  2024-11-06  8:45 ` [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Sebastian Andrzej Siewior
  2024-11-08 19:50 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 14+ messages in thread
From: Hou Tao @ 2024-11-06  6:35 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add test cases to verify the following four update operations on htab of
maps don't trigger lockdep warning:

(1) add then delete
(2) add, overwrite, then delete
(3) add, then lookup_and_delete
(4) add two elements, then lookup_and_delete_batch

Test cases are added for pre-allocated and non-preallocated htab of maps
respectively.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../selftests/bpf/prog_tests/map_in_map.c     | 132 +++++++++++++++++-
 .../selftests/bpf/progs/update_map_in_htab.c  |  30 ++++
 2 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/update_map_in_htab.c

diff --git a/tools/testing/selftests/bpf/prog_tests/map_in_map.c b/tools/testing/selftests/bpf/prog_tests/map_in_map.c
index d2a10eb4e5b5..286a9fb469e2 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_in_map.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_in_map.c
@@ -5,7 +5,9 @@
 #include <sys/syscall.h>
 #include <test_progs.h>
 #include <bpf/btf.h>
+
 #include "access_map_in_map.skel.h"
+#include "update_map_in_htab.skel.h"
 
 struct thread_ctx {
 	pthread_barrier_t barrier;
@@ -127,6 +129,131 @@ static void test_map_in_map_access(const char *prog_name, const char *map_name)
 	access_map_in_map__destroy(skel);
 }
 
+static void add_del_fd_htab(int outer_fd)
+{
+	int inner_fd, err;
+	int key = 1;
+
+	inner_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "arr1", 4, 4, 1, NULL);
+	if (!ASSERT_OK_FD(inner_fd, "inner1"))
+		return;
+	err = bpf_map_update_elem(outer_fd, &key, &inner_fd, BPF_NOEXIST);
+	close(inner_fd);
+	if (!ASSERT_OK(err, "add"))
+		return;
+
+	/* Delete */
+	err = bpf_map_delete_elem(outer_fd, &key);
+	ASSERT_OK(err, "del");
+}
+
+static void overwrite_fd_htab(int outer_fd)
+{
+	int inner_fd, err;
+	int key = 1;
+
+	inner_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "arr1", 4, 4, 1, NULL);
+	if (!ASSERT_OK_FD(inner_fd, "inner1"))
+		return;
+	err = bpf_map_update_elem(outer_fd, &key, &inner_fd, BPF_NOEXIST);
+	close(inner_fd);
+	if (!ASSERT_OK(err, "add"))
+		return;
+
+	/* Overwrite */
+	inner_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "arr2", 4, 4, 1, NULL);
+	if (!ASSERT_OK_FD(inner_fd, "inner2"))
+		goto out;
+	err = bpf_map_update_elem(outer_fd, &key, &inner_fd, BPF_EXIST);
+	close(inner_fd);
+	if (!ASSERT_OK(err, "overwrite"))
+		goto out;
+
+	err = bpf_map_delete_elem(outer_fd, &key);
+	ASSERT_OK(err, "del");
+	return;
+out:
+	bpf_map_delete_elem(outer_fd, &key);
+}
+
+static void lookup_delete_fd_htab(int outer_fd)
+{
+	int key = 1, value;
+	int inner_fd, err;
+
+	inner_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "arr1", 4, 4, 1, NULL);
+	if (!ASSERT_OK_FD(inner_fd, "inner1"))
+		return;
+	err = bpf_map_update_elem(outer_fd, &key, &inner_fd, BPF_NOEXIST);
+	close(inner_fd);
+	if (!ASSERT_OK(err, "add"))
+		return;
+
+	/* lookup_and_delete is not supported for htab of maps */
+	err = bpf_map_lookup_and_delete_elem(outer_fd, &key, &value);
+	ASSERT_EQ(err, -ENOTSUPP, "lookup_del");
+
+	err = bpf_map_delete_elem(outer_fd, &key);
+	ASSERT_OK(err, "del");
+}
+
+static void batched_lookup_delete_fd_htab(int outer_fd)
+{
+	int keys[2] = {1, 2}, values[2];
+	unsigned int cnt, batch;
+	int inner_fd, err;
+
+	inner_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "arr1", 4, 4, 1, NULL);
+	if (!ASSERT_OK_FD(inner_fd, "inner1"))
+		return;
+
+	err = bpf_map_update_elem(outer_fd, &keys[0], &inner_fd, BPF_NOEXIST);
+	close(inner_fd);
+	if (!ASSERT_OK(err, "add1"))
+		return;
+
+	inner_fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "arr2", 4, 4, 1, NULL);
+	if (!ASSERT_OK_FD(inner_fd, "inner2"))
+		goto out;
+	err = bpf_map_update_elem(outer_fd, &keys[1], &inner_fd, BPF_NOEXIST);
+	close(inner_fd);
+	if (!ASSERT_OK(err, "add2"))
+		goto out;
+
+	/* batched lookup_and_delete */
+	cnt = ARRAY_SIZE(keys);
+	err = bpf_map_lookup_and_delete_batch(outer_fd, NULL, &batch, keys, values, &cnt, NULL);
+	ASSERT_TRUE((!err || err == -ENOENT), "delete_batch ret");
+	ASSERT_EQ(cnt, ARRAY_SIZE(keys), "delete_batch cnt");
+
+out:
+	bpf_map_delete_elem(outer_fd, &keys[0]);
+}
+
+static void test_update_map_in_htab(bool preallocate)
+{
+	struct update_map_in_htab *skel;
+	int err, fd;
+
+	skel = update_map_in_htab__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		return;
+
+	err = update_map_in_htab__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto out;
+
+	fd = preallocate ? bpf_map__fd(skel->maps.outer_htab_map) :
+			   bpf_map__fd(skel->maps.outer_alloc_htab_map);
+
+	add_del_fd_htab(fd);
+	overwrite_fd_htab(fd);
+	lookup_delete_fd_htab(fd);
+	batched_lookup_delete_fd_htab(fd);
+out:
+	update_map_in_htab__destroy(skel);
+}
+
 void test_map_in_map(void)
 {
 	if (test__start_subtest("acc_map_in_array"))
@@ -137,5 +264,8 @@ void test_map_in_map(void)
 		test_map_in_map_access("access_map_in_htab", "outer_htab_map");
 	if (test__start_subtest("sleepable_acc_map_in_htab"))
 		test_map_in_map_access("sleepable_access_map_in_htab", "outer_htab_map");
+	if (test__start_subtest("update_map_in_htab"))
+		test_update_map_in_htab(true);
+	if (test__start_subtest("update_map_in_alloc_htab"))
+		test_update_map_in_htab(false);
 }
-
diff --git a/tools/testing/selftests/bpf/progs/update_map_in_htab.c b/tools/testing/selftests/bpf/progs/update_map_in_htab.c
new file mode 100644
index 000000000000..c2066247cd9c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/update_map_in_htab.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2024. Huawei Technologies Co., Ltd */
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct inner_map_type {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+	__uint(value_size, 4);
+	__uint(max_entries, 1);
+} inner_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
+	__type(key, int);
+	__type(value, int);
+	__uint(max_entries, 2);
+	__array(values, struct inner_map_type);
+} outer_htab_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH_OF_MAPS);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, int);
+	__uint(max_entries, 2);
+	__array(values, struct inner_map_type);
+} outer_alloc_htab_map SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
  2024-11-06  6:35 [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Hou Tao
                   ` (2 preceding siblings ...)
  2024-11-06  6:35 ` [PATCH bpf-next 3/3] selftests/bpf: Test the update operations for htab of maps Hou Tao
@ 2024-11-06  8:45 ` Sebastian Andrzej Siewior
  2024-11-06  9:48   ` Hou Tao
  2024-11-08 19:50 ` patchwork-bot+netdevbpf
  4 siblings, 1 reply; 14+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-06  8:45 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, houtao1, xukuohai

On 2024-11-06 14:35:39 [+0800], Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
Hi Hou,

> The patch set fixes a lockdep warning for htab of map. The
> warning is found when running test_maps. The warning occurs when
> htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
> of the inner map while already holding the bucket lock (raw_spinlock_t).
> 
> The fix moves the invocation of free_htab_elem() after
> htab_unlock_bucket() and adds a test case to verify the solution. Please
> see the individual patches for details. Comments are always welcome.

Thank you.

Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

I've seen that you didn't move check_and_free_fields() out of the bucket
locked section. Type BPF_TIMER does hrtimer_cancel() if the timer
happens to run on a remote CPU. On PREEMPT_RT this will acquire a
sleeping lock which is problematic due to the raw_spinlock_t.
Would it be okay, to cleanup the timer unconditionally via the
workqueue?

Sebastian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
  2024-11-06  8:45 ` [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Sebastian Andrzej Siewior
@ 2024-11-06  9:48   ` Hou Tao
  2024-11-06 10:05     ` Sebastian Andrzej Siewior
  2024-11-08 19:52     ` Alexei Starovoitov
  0 siblings, 2 replies; 14+ messages in thread
From: Hou Tao @ 2024-11-06  9:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, houtao1, xukuohai

Hi,

On 11/6/2024 4:45 PM, Sebastian Andrzej Siewior wrote:
> On 2024-11-06 14:35:39 [+0800], Hou Tao wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Hi,
> Hi Hou,
>
>> The patch set fixes a lockdep warning for htab of map. The
>> warning is found when running test_maps. The warning occurs when
>> htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
>> of the inner map while already holding the bucket lock (raw_spinlock_t).
>>
>> The fix moves the invocation of free_htab_elem() after
>> htab_unlock_bucket() and adds a test case to verify the solution. Please
>> see the individual patches for details. Comments are always welcome.
> Thank you.
>
> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> I've seen that you didn't move check_and_free_fields() out of the bucket
> locked section. Type BPF_TIMER does hrtimer_cancel() if the timer
> happens to run on a remote CPU. On PREEMPT_RT this will acquire a
> sleeping lock which is problematic due to the raw_spinlock_t.
> Would it be okay, to cleanup the timer unconditionally via the
> workqueue?

Yes. The patch set still invokes check_and_free_fields() under the
bucket lock when updating an existing element in a pre-allocated htab. I
missed the hrtimer case. For the sleeping lock, you mean the
cpu_base->softirq_expiry_lock in hrtimer_cancel_waiting_running(), right
? Instead of cancelling the timer in workqueue, maybe we could save the
old value temporarily in the bucket lock, and try to free it outside of
the bucket lock or disabling the extra_elems logic temporarily for the
case ?
> Sebastian


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket()
  2024-11-06  6:35 ` [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket() Hou Tao
@ 2024-11-06  9:53   ` Hou Tao
  2024-11-08 19:55     ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2024-11-06  9:53 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Sebastian Andrzej Siewior, xukuohai,
	houtao1@huawei.com



On 11/6/2024 2:35 PM, Hou Tao wrote:
> From: Hou Tao <houtao1@huawei.com>
>
> For htab of maps, when the map is removed from the htab, it may hold the
> last reference of the map. bpf_map_fd_put_ptr() will invoke
> bpf_map_free_id() to free the id of the removed map element. However,
> bpf_map_fd_put_ptr() is invoked while holding a bucket lock
> (raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
> (spinlock_t), triggering the following lockdep warning:
>
>   =============================
>   [ BUG: Invalid wait context ]
>   6.11.0-rc4+ #49 Not tainted
>   -----------------------------
>   test_maps/4881 is trying to lock:
>   ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
>   other info that might help us debug this:
>   context-{5:5}
>   2 locks held by test_maps/4881:
>    #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
>    #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
>   stack backtrace:
>   CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
>   Call Trace:
>    <TASK>
>    dump_stack_lvl+0x6e/0xb0
>    dump_stack+0x10/0x20
>    __lock_acquire+0x73e/0x36c0
>    lock_acquire+0x182/0x450
>    _raw_spin_lock_irqsave+0x43/0x70
>    bpf_map_free_id.part.0+0x21/0x70
>    bpf_map_put+0xcf/0x110
>    bpf_map_fd_put_ptr+0x9a/0xb0
>    free_htab_elem+0x69/0xe0
>    htab_map_update_elem+0x50f/0xa80
>    bpf_fd_htab_map_update_elem+0x131/0x270
>    htab_map_update_elem+0x50f/0xa80
>    bpf_fd_htab_map_update_elem+0x131/0x270
>    bpf_map_update_value+0x266/0x380
>    __sys_bpf+0x21bb/0x36b0
>    __x64_sys_bpf+0x45/0x60
>    x64_sys_call+0x1b2a/0x20d0
>    do_syscall_64+0x5d/0x100
>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> One way to fix the lockdep warning is using raw_spinlock_t for
> map_idr_lock as well. However, bpf_map_alloc_id() invokes
> idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
> similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
> still a spinlock.

Is it OK to move the calling of bpf_map_free_id() from bpf_map_put() to
bpf_map_free_deferred() ? It could fix the lockdep warning for htab of
maps. Its downside is that the free of map id will be delayed, but I
think it will not make a visible effect to the user, because the refcnt
is already 0, trying to get the map fd through map id will return -ENOENT.
> Instead of changing map_idr_lock's type, fix the issue by invoking
> htab_put_fd_value() after htab_unlock_bucket(). However, only deferring
> the invocation of htab_put_fd_value() is not enough, because the old map
> pointers in htab of maps can not be saved during batched deletion.
> Therefore, also defer the invocation of free_htab_elem(), so these
> to-be-freed elements could be linked together similar to lru map.
>
> There are four callers for ->map_fd_put_ptr:
>
> (1) alloc_htab_elem() (through htab_put_fd_value())
> It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of
> htab_put_fd_value() can not simply move after htab_unlock_bucket(),
> because the old element has already been stashed in htab->extra_elems.
> It may be reused immediately after htab_unlock_bucket() and the
> invocation of htab_put_fd_value() after htab_unlock_bucket() may release
> the newly-added element incorrectly. Therefore, saving the map pointer
> of the old element for htab of maps before unlocking the bucket and
> releasing the map_ptr after unlock. Beside the map pointer in the old
> element, should do the same thing for the special fields in the old
> element as well.
>
> (2) free_htab_elem() (through htab_put_fd_value())
> Its caller includes __htab_map_lookup_and_delete_elem(),
> htab_map_delete_elem() and __htab_map_lookup_and_delete_batch().
>
> For htab_map_delete_elem(), simply invoke free_htab_elem() after
> htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just
> like lru map, linking the to-be-freed element into node_to_free list
> and invoking free_htab_elem() for these element after unlock. It is safe
> to reuse batch_flink as the link for node_to_free, because these
> elements have been removed from the hash llist.
>
> Because htab of maps doesn't support lookup_and_delete operation,
> __htab_map_lookup_and_delete_elem() doesn't have the problem, so kept
> it as is.
>
> (3) fd_htab_map_free()
> It invokes ->map_fd_put_ptr without raw_spinlock_t.
>
> (4) bpf_fd_htab_map_update_elem()
> It invokes ->map_fd_put_ptr without raw_spinlock_t.
>
> After moving free_htab_elem() outside htab bucket lock scope, using
> pcpu_freelist_push() instead of __pcpu_freelist_push() to disable
> the irq before freeing elements, and protecting the invocations of
> bpf_mem_cache_free() with migrate_{disable|enable} pair.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/hashtab.c | 56 ++++++++++++++++++++++++++++++--------------
>  1 file changed, 39 insertions(+), 17 deletions(-)
>
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index b14b87463ee0..3ec941a0ea41 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -896,9 +896,12 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
>  static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
>  {
>  	check_and_free_fields(htab, l);
> +
> +	migrate_disable();
>  	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
>  		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
>  	bpf_mem_cache_free(&htab->ma, l);
> +	migrate_enable();
>  }
>  
>  static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
> @@ -948,7 +951,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
>  	if (htab_is_prealloc(htab)) {
>  		bpf_map_dec_elem_count(&htab->map);
>  		check_and_free_fields(htab, l);
> -		__pcpu_freelist_push(&htab->freelist, &l->fnode);
> +		pcpu_freelist_push(&htab->freelist, &l->fnode);
>  	} else {
>  		dec_elem_count(htab);
>  		htab_elem_free(htab, l);
> @@ -1018,7 +1021,6 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
>  			 */
>  			pl_new = this_cpu_ptr(htab->extra_elems);
>  			l_new = *pl_new;
> -			htab_put_fd_value(htab, old_elem);
>  			*pl_new = old_elem;
>  		} else {
>  			struct pcpu_freelist_node *l;
> @@ -1105,6 +1107,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	struct htab_elem *l_new = NULL, *l_old;
>  	struct hlist_nulls_head *head;
>  	unsigned long flags;
> +	void *old_map_ptr;
>  	struct bucket *b;
>  	u32 key_size, hash;
>  	int ret;
> @@ -1183,12 +1186,27 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	hlist_nulls_add_head_rcu(&l_new->hash_node, head);
>  	if (l_old) {
>  		hlist_nulls_del_rcu(&l_old->hash_node);
> +
> +		/* l_old has already been stashed in htab->extra_elems, free
> +		 * its special fields before it is available for reuse. Also
> +		 * save the old map pointer in htab of maps before unlock
> +		 * and release it after unlock.
> +		 */
> +		old_map_ptr = NULL;
> +		if (htab_is_prealloc(htab)) {
> +			if (map->ops->map_fd_put_ptr)
> +				old_map_ptr = fd_htab_map_get_ptr(map, l_old);
> +			check_and_free_fields(htab, l_old);
> +		}
> +	}
> +	htab_unlock_bucket(htab, b, hash, flags);
> +	if (l_old) {
> +		if (old_map_ptr)
> +			map->ops->map_fd_put_ptr(map, old_map_ptr, true);
>  		if (!htab_is_prealloc(htab))
>  			free_htab_elem(htab, l_old);
> -		else
> -			check_and_free_fields(htab, l_old);
>  	}
> -	ret = 0;
> +	return 0;
>  err:
>  	htab_unlock_bucket(htab, b, hash, flags);
>  	return ret;
> @@ -1432,15 +1450,15 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
>  		return ret;
>  
>  	l = lookup_elem_raw(head, hash, key, key_size);
> -
> -	if (l) {
> +	if (l)
>  		hlist_nulls_del_rcu(&l->hash_node);
> -		free_htab_elem(htab, l);
> -	} else {
> +	else
>  		ret = -ENOENT;
> -	}
>  
>  	htab_unlock_bucket(htab, b, hash, flags);
> +
> +	if (l)
> +		free_htab_elem(htab, l);
>  	return ret;
>  }
>  
> @@ -1853,13 +1871,14 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>  			 * may cause deadlock. See comments in function
>  			 * prealloc_lru_pop(). Let us do bpf_lru_push_free()
>  			 * after releasing the bucket lock.
> +			 *
> +			 * For htab of maps, htab_put_fd_value() in
> +			 * free_htab_elem() may acquire a spinlock with bucket
> +			 * lock being held and it violates the lock rule, so
> +			 * invoke free_htab_elem() after unlock as well.
>  			 */
> -			if (is_lru_map) {
> -				l->batch_flink = node_to_free;
> -				node_to_free = l;
> -			} else {
> -				free_htab_elem(htab, l);
> -			}
> +			l->batch_flink = node_to_free;
> +			node_to_free = l;
>  		}
>  		dst_key += key_size;
>  		dst_val += value_size;
> @@ -1871,7 +1890,10 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
>  	while (node_to_free) {
>  		l = node_to_free;
>  		node_to_free = node_to_free->batch_flink;
> -		htab_lru_push_free(htab, l);
> +		if (is_lru_map)
> +			htab_lru_push_free(htab, l);
> +		else
> +			free_htab_elem(htab, l);
>  	}
>  
>  next_batch:


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
  2024-11-06  9:48   ` Hou Tao
@ 2024-11-06 10:05     ` Sebastian Andrzej Siewior
  2024-11-08 19:52     ` Alexei Starovoitov
  1 sibling, 0 replies; 14+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-06 10:05 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, houtao1, xukuohai

On 2024-11-06 17:48:59 [+0800], Hou Tao wrote:
> Hi,
Hi,

> Yes. The patch set still invokes check_and_free_fields() under the
> bucket lock when updating an existing element in a pre-allocated htab. I
> missed the hrtimer case. For the sleeping lock, you mean the
> cpu_base->softirq_expiry_lock in hrtimer_cancel_waiting_running(), right

Yes.

> ? Instead of cancelling the timer in workqueue, maybe we could save the
> old value temporarily in the bucket lock, and try to free it outside of
> the bucket lock or disabling the extra_elems logic temporarily for the
> case ?

Well, it is up to you. Either:

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 1a43d06eab286..b077af12fc9b4 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1593,10 +1593,7 @@ void bpf_timer_cancel_and_free(void *val)
 	 * To avoid these issues, punt to workqueue context when we are in a
 	 * timer callback.
 	 */
-	if (this_cpu_read(hrtimer_running))
-		queue_work(system_unbound_wq, &t->cb.delete_work);
-	else
-		bpf_timer_delete_work(&t->cb.delete_work);
+	queue_work(system_unbound_wq, &t->cb.delete_work);
 }
 
 /* This function is called by map_delete/update_elem for individual element and

Or something smarter where you cancel the timer outside of the bucket
lock.

Sebastian

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
  2024-11-06  6:35 [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Hou Tao
                   ` (3 preceding siblings ...)
  2024-11-06  8:45 ` [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Sebastian Andrzej Siewior
@ 2024-11-08 19:50 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 14+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-11-08 19:50 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, martin.lau, alexei.starovoitov, andrii, eddyz87, song,
	haoluo, yonghong.song, daniel, kpsingh, sdf, jolsa,
	john.fastabend, bigeasy, houtao1, xukuohai

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Wed,  6 Nov 2024 14:35:39 +0800 you wrote:
> From: Hou Tao <houtao1@huawei.com>
> 
> Hi,
> 
> The patch set fixes a lockdep warning for htab of map. The
> warning is found when running test_maps. The warning occurs when
> htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
> of the inner map while already holding the bucket lock (raw_spinlock_t).
> 
> [...]

Here is the summary with links:
  - [bpf-next,1/3] bpf: Call free_htab_elem() after htab_unlock_bucket()
    https://git.kernel.org/bpf/bpf-next/c/81eef03cbf70
  - [bpf-next,2/3] selftests/bpf: Move ENOTSUPP from bpf_util.h
    https://git.kernel.org/bpf/bpf-next/c/7d5e83b4d41b
  - [bpf-next,3/3] selftests/bpf: Test the update operations for htab of maps
    https://git.kernel.org/bpf/bpf-next/c/ff605ec69735

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
  2024-11-06  9:48   ` Hou Tao
  2024-11-06 10:05     ` Sebastian Andrzej Siewior
@ 2024-11-08 19:52     ` Alexei Starovoitov
  2024-11-09  1:48       ` Hou Tao
  1 sibling, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2024-11-08 19:52 UTC (permalink / raw)
  To: Hou Tao
  Cc: Sebastian Andrzej Siewior, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Hou Tao, Xu Kuohai

On Wed, Nov 6, 2024 at 1:49 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 11/6/2024 4:45 PM, Sebastian Andrzej Siewior wrote:
> > On 2024-11-06 14:35:39 [+0800], Hou Tao wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> Hi,
> > Hi Hou,
> >
> >> The patch set fixes a lockdep warning for htab of map. The
> >> warning is found when running test_maps. The warning occurs when
> >> htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
> >> of the inner map while already holding the bucket lock (raw_spinlock_t).
> >>
> >> The fix moves the invocation of free_htab_elem() after
> >> htab_unlock_bucket() and adds a test case to verify the solution. Please
> >> see the individual patches for details. Comments are always welcome.

The fix makes sense.
I manually resolved merge conflict and applied.

> > Thank you.
> >
> > Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >
> > I've seen that you didn't move check_and_free_fields() out of the bucket
> > locked section. Type BPF_TIMER does hrtimer_cancel() if the timer
> > happens to run on a remote CPU. On PREEMPT_RT this will acquire a
> > sleeping lock which is problematic due to the raw_spinlock_t.
> > Would it be okay, to cleanup the timer unconditionally via the
> > workqueue?
>
> Yes. The patch set still invokes check_and_free_fields() under the
> bucket lock when updating an existing element in a pre-allocated htab. I
> missed the hrtimer case. For the sleeping lock, you mean the
> cpu_base->softirq_expiry_lock in hrtimer_cancel_waiting_running(), right
> ? Instead of cancelling the timer in workqueue, maybe we could save the
> old value temporarily in the bucket lock, and try to free it outside of
> the bucket lock or disabling the extra_elems logic temporarily for the
> case ?

We definitely need to avoid spamming wq when cancelling timers.
wq may not be able to handle the volume.
Moving it outside of bucket lock is certainly better.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket()
  2024-11-06  9:53   ` Hou Tao
@ 2024-11-08 19:55     ` Alexei Starovoitov
  2024-11-09  1:29       ` Hou Tao
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2024-11-08 19:55 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Xu Kuohai, houtao1@huawei.com

On Wed, Nov 6, 2024 at 1:53 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
>
>
> On 11/6/2024 2:35 PM, Hou Tao wrote:
> > From: Hou Tao <houtao1@huawei.com>
> >
> > For htab of maps, when the map is removed from the htab, it may hold the
> > last reference of the map. bpf_map_fd_put_ptr() will invoke
> > bpf_map_free_id() to free the id of the removed map element. However,
> > bpf_map_fd_put_ptr() is invoked while holding a bucket lock
> > (raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
> > (spinlock_t), triggering the following lockdep warning:
> >
> >   =============================
> >   [ BUG: Invalid wait context ]
> >   6.11.0-rc4+ #49 Not tainted
> >   -----------------------------
> >   test_maps/4881 is trying to lock:
> >   ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
> >   other info that might help us debug this:
> >   context-{5:5}
> >   2 locks held by test_maps/4881:
> >    #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
> >    #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
> >   stack backtrace:
> >   CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
> >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
> >   Call Trace:
> >    <TASK>
> >    dump_stack_lvl+0x6e/0xb0
> >    dump_stack+0x10/0x20
> >    __lock_acquire+0x73e/0x36c0
> >    lock_acquire+0x182/0x450
> >    _raw_spin_lock_irqsave+0x43/0x70
> >    bpf_map_free_id.part.0+0x21/0x70
> >    bpf_map_put+0xcf/0x110
> >    bpf_map_fd_put_ptr+0x9a/0xb0
> >    free_htab_elem+0x69/0xe0
> >    htab_map_update_elem+0x50f/0xa80
> >    bpf_fd_htab_map_update_elem+0x131/0x270
> >    htab_map_update_elem+0x50f/0xa80
> >    bpf_fd_htab_map_update_elem+0x131/0x270
> >    bpf_map_update_value+0x266/0x380
> >    __sys_bpf+0x21bb/0x36b0
> >    __x64_sys_bpf+0x45/0x60
> >    x64_sys_call+0x1b2a/0x20d0
> >    do_syscall_64+0x5d/0x100
> >    entry_SYSCALL_64_after_hwframe+0x76/0x7e
> >
> > One way to fix the lockdep warning is using raw_spinlock_t for
> > map_idr_lock as well. However, bpf_map_alloc_id() invokes
> > idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
> > similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
> > still a spinlock.
>
> Is it OK to move the calling of bpf_map_free_id() from bpf_map_put() to
> bpf_map_free_deferred() ? It could fix the lockdep warning for htab of
> maps. Its downside is that the free of map id will be delayed, but I
> think it will not make a visible effect to the user, because the refcnt
> is already 0, trying to get the map fd through map id will return -ENOENT.

I've applied the current patch, since doing free outside of bucket lock
is a good thing to do anyway.
With that no need to move bpf_map_free_id(), right?
At least offload case relies on id being removed immediately.
I don't remember what else might care.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket()
  2024-11-08 19:55     ` Alexei Starovoitov
@ 2024-11-09  1:29       ` Hou Tao
  0 siblings, 0 replies; 14+ messages in thread
From: Hou Tao @ 2024-11-09  1:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend,
	Sebastian Andrzej Siewior, Xu Kuohai, houtao1@huawei.com

Hi,

On 11/9/2024 3:55 AM, Alexei Starovoitov wrote:
> On Wed, Nov 6, 2024 at 1:53 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>>
>> On 11/6/2024 2:35 PM, Hou Tao wrote:
>>> From: Hou Tao <houtao1@huawei.com>
>>>
>>> For htab of maps, when the map is removed from the htab, it may hold the
>>> last reference of the map. bpf_map_fd_put_ptr() will invoke
>>> bpf_map_free_id() to free the id of the removed map element. However,
>>> bpf_map_fd_put_ptr() is invoked while holding a bucket lock
>>> (raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
>>> (spinlock_t), triggering the following lockdep warning:
>>>
>>>   =============================
>>>   [ BUG: Invalid wait context ]
>>>   6.11.0-rc4+ #49 Not tainted
>>>   -----------------------------
>>>   test_maps/4881 is trying to lock:
>>>   ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
>>>   other info that might help us debug this:
>>>   context-{5:5}
>>>   2 locks held by test_maps/4881:
>>>    #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
>>>    #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
>>>   stack backtrace:
>>>   CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
>>>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
>>>   Call Trace:
>>>    <TASK>
>>>    dump_stack_lvl+0x6e/0xb0
>>>    dump_stack+0x10/0x20
>>>    __lock_acquire+0x73e/0x36c0
>>>    lock_acquire+0x182/0x450
>>>    _raw_spin_lock_irqsave+0x43/0x70
>>>    bpf_map_free_id.part.0+0x21/0x70
>>>    bpf_map_put+0xcf/0x110
>>>    bpf_map_fd_put_ptr+0x9a/0xb0
>>>    free_htab_elem+0x69/0xe0
>>>    htab_map_update_elem+0x50f/0xa80
>>>    bpf_fd_htab_map_update_elem+0x131/0x270
>>>    htab_map_update_elem+0x50f/0xa80
>>>    bpf_fd_htab_map_update_elem+0x131/0x270
>>>    bpf_map_update_value+0x266/0x380
>>>    __sys_bpf+0x21bb/0x36b0
>>>    __x64_sys_bpf+0x45/0x60
>>>    x64_sys_call+0x1b2a/0x20d0
>>>    do_syscall_64+0x5d/0x100
>>>    entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> One way to fix the lockdep warning is using raw_spinlock_t for
>>> map_idr_lock as well. However, bpf_map_alloc_id() invokes
>>> idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
>>> similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
>>> still a spinlock.
>> Is it OK to move the calling of bpf_map_free_id() from bpf_map_put() to
>> bpf_map_free_deferred() ? It could fix the lockdep warning for htab of
>> maps. Its downside is that the free of map id will be delayed, but I
>> think it will not make a visible effect to the user, because the refcnt
>> is already 0, trying to get the map fd through map id will return -ENOENT.
> I've applied the current patch, since doing free outside of bucket lock
> is a good thing to do anyway.
> With that no need to move bpf_map_free_id(), right?
> At least offload case relies on id being removed immediately.
> I don't remember what else might care.
> .

Yes. There is no need to move bpf_map_free_id() after applying the
patch. Thanks for the information about offload map. According to the
implementation of _bpf_map_offload_destroy(), it seems that it may need
to call bpf_map_free_id() twice, so now bpf_map_free_id() frees the
immediately and sets map->id as 0.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
  2024-11-08 19:52     ` Alexei Starovoitov
@ 2024-11-09  1:48       ` Hou Tao
  2024-11-09  1:55         ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Hou Tao @ 2024-11-09  1:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Sebastian Andrzej Siewior, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Hou Tao, Xu Kuohai

Hi,

On 11/9/2024 3:52 AM, Alexei Starovoitov wrote:
> On Wed, Nov 6, 2024 at 1:49 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 11/6/2024 4:45 PM, Sebastian Andrzej Siewior wrote:
>>> On 2024-11-06 14:35:39 [+0800], Hou Tao wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> Hi,
>>> Hi Hou,
>>>
>>>> The patch set fixes a lockdep warning for htab of map. The
>>>> warning is found when running test_maps. The warning occurs when
>>>> htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
>>>> of the inner map while already holding the bucket lock (raw_spinlock_t).
>>>>
>>>> The fix moves the invocation of free_htab_elem() after
>>>> htab_unlock_bucket() and adds a test case to verify the solution. Please
>>>> see the individual patches for details. Comments are always welcome.
> The fix makes sense.
> I manually resolved merge conflict and applied.

Thanks for the manually conflict resolving. However, the patch set
doesn't move all free operations outside of lock scope (e.g., for
bpf_map_lookup_and_delete()), because htab of maps doesn't support it. I
could post another patch set to do that.
>
>>> Thank you.
>>>
>>> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>>
>>> I've seen that you didn't move check_and_free_fields() out of the bucket
>>> locked section. Type BPF_TIMER does hrtimer_cancel() if the timer
>>> happens to run on a remote CPU. On PREEMPT_RT this will acquire a
>>> sleeping lock which is problematic due to the raw_spinlock_t.
>>> Would it be okay, to cleanup the timer unconditionally via the
>>> workqueue?
>> Yes. The patch set still invokes check_and_free_fields() under the
>> bucket lock when updating an existing element in a pre-allocated htab. I
>> missed the hrtimer case. For the sleeping lock, you mean the
>> cpu_base->softirq_expiry_lock in hrtimer_cancel_waiting_running(), right
>> ? Instead of cancelling the timer in workqueue, maybe we could save the
>> old value temporarily in the bucket lock, and try to free it outside of
>> the bucket lock or disabling the extra_elems logic temporarily for the
>> case ?
> We definitely need to avoid spamming wq when cancelling timers.
> wq may not be able to handle the volume.
> Moving it outside of bucket lock is certainly better.
> .

OK. Will try whether there is a better way to handle the cancelling of
timer case.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH bpf-next 0/3] Fix lockdep warning for htab of map
  2024-11-09  1:48       ` Hou Tao
@ 2024-11-09  1:55         ` Alexei Starovoitov
  0 siblings, 0 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2024-11-09  1:55 UTC (permalink / raw)
  To: Hou Tao, Kumar Kartikeya Dwivedi
  Cc: Sebastian Andrzej Siewior, bpf, Martin KaFai Lau, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Hou Tao, Xu Kuohai

On Fri, Nov 8, 2024 at 5:48 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 11/9/2024 3:52 AM, Alexei Starovoitov wrote:
> > On Wed, Nov 6, 2024 at 1:49 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 11/6/2024 4:45 PM, Sebastian Andrzej Siewior wrote:
> >>> On 2024-11-06 14:35:39 [+0800], Hou Tao wrote:
> >>>> From: Hou Tao <houtao1@huawei.com>
> >>>>
> >>>> Hi,
> >>> Hi Hou,
> >>>
> >>>> The patch set fixes a lockdep warning for htab of map. The
> >>>> warning is found when running test_maps. The warning occurs when
> >>>> htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
> >>>> of the inner map while already holding the bucket lock (raw_spinlock_t).
> >>>>
> >>>> The fix moves the invocation of free_htab_elem() after
> >>>> htab_unlock_bucket() and adds a test case to verify the solution. Please
> >>>> see the individual patches for details. Comments are always welcome.
> > The fix makes sense.
> > I manually resolved merge conflict and applied.
>
> Thanks for the manually conflict resolving. However, the patch set
> doesn't move all free operations outside of lock scope (e.g., for
> bpf_map_lookup_and_delete()), because htab of maps doesn't support it. I
> could post another patch set to do that.

Pls do. It's easier to review incrementally.
For the last year we've been working on "resilient spin lock" and
hope to post patches in a couple weeks.
In those patches we will be replacing htab bucket lock with this new lock,
so moving other locks out of htab lock region fits very well.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-11-09  1:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-06  6:35 [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Hou Tao
2024-11-06  6:35 ` [PATCH bpf-next 1/3] bpf: Call free_htab_elem() after htab_unlock_bucket() Hou Tao
2024-11-06  9:53   ` Hou Tao
2024-11-08 19:55     ` Alexei Starovoitov
2024-11-09  1:29       ` Hou Tao
2024-11-06  6:35 ` [PATCH bpf-next 2/3] selftests/bpf: Move ENOTSUPP from bpf_util.h Hou Tao
2024-11-06  6:35 ` [PATCH bpf-next 3/3] selftests/bpf: Test the update operations for htab of maps Hou Tao
2024-11-06  8:45 ` [PATCH bpf-next 0/3] Fix lockdep warning for htab of map Sebastian Andrzej Siewior
2024-11-06  9:48   ` Hou Tao
2024-11-06 10:05     ` Sebastian Andrzej Siewior
2024-11-08 19:52     ` Alexei Starovoitov
2024-11-09  1:48       ` Hou Tao
2024-11-09  1:55         ` Alexei Starovoitov
2024-11-08 19:50 ` patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox