bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/20] Support dynptr key for hash map
@ 2025-01-25 11:10 Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr Hou Tao
                   ` (19 more replies)
  0 siblings, 20 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Hi,

The patch set aims to add the basic dynptr key support for hash map as
discussed in [1]. The main motivation is to fully utilize the BTF info
of the map key and to support variable-length key (e.g., string or any
byte stream) for bpf map. The patch set uses bpf_dynptr to represent the
variable-length part in the map key and the total number of
variable-length parts in the map key is limited as 1 now. Due to the
limitation in bpf memory allocator, the max size of dynptr in map key is
limited as 4088 bytes. Beside the variable-length parts (dynptr parts),
the fixed-size part in map key is still allowed, so all of these
following map key definitions are valid:

	struct bpf_dynptr;

	struct map_key_1 {
		struct bpf_dynptr name;
	};
	struct map_key_2 {
		int pid;
		struct bpf_dynptr name;
	};
	struct map_key_3 {
		struct map_key_2 f1;
		unsigned long when;
	};

The patch set supports lookup, update, delete operations on normal hash
map with dynptr key for both bpf program and bpf syscall. It also
supports lookup_and_delete and get_next_key operations on dynptr map key
for bpf syscall.

However the following operations have not been fully supported yet on a
hash map with dynptr key:

1) batched map operation through bpf syscall
2) the memory accounting for dynptr (aka .htab_map_mem_usage)
3) btf print for the dynptr in map key
4) bpftool support
5) the iteration of elements through bpf program
When a bpf program iterates the element in a hash map with dynptr key
(e.g., bpf_for_each_map_elem() helper or map element iterator), the
dynptr in the map key has not been specially treated yet and the dynptr
is only treated as a read-only 16-bytes buffer.

The patch set is structured as follow:

Patch #1~#2 introduce BPF_DYNPTR in btf_field_type and parse the
bpf_dynptr in the map key.

Patch #3~#7 remove the need to specify BPF_F_DYNPTR_IN_KEY explicitly,
introduces an internal BPF_INT_F_DYNPTR_IN_KEY map flag, set the
internal flag when there is any bpf_dynptr in the map key btf, and also
verify the value of max_extra is valid when it is set.

Patch #8~#9 refactor check_stack_range_initialized() and support
dynptr-keyed map in verifier.

Patch #10~#12 introduce bpf_dynptr_user, support the use of
bpf_dynptr_user in bpf syscall for map lookup, lookup_delete, update,
delete and get_next_key operations.

Patch #13~#17 update the lookup, lookup_delete, update, delete and
get_next_key callback correspondingly to support dynptr-keyed hash map.

Patch #18~#19 add positive and negative test cases for hash map with
dynptr key support.

Patch #20 adds the benchmark to compare the lookup and update
performance between normal hash map and dynptr-keyed hash map.

Patch set v2 mainly address the suggestions and comments in v1. It
mainly includes:
1) remove the need to set BPF_F_DYNPTR_IN_KEY flag explicitly
2) remove bpf_dynptr_user helpers from libbpf
3) support dynptr-keyed map in verifier in a less-intrusive way
4) add always_inline for lookup_{nulls_elem|elem}_raw to alleviate
   the performance degradation

The performance results in v2 are almost the same as v1. When the max
length of str is greater than 256, the lookup performance of dynptr
hash-map will be better than the normal hash map. When the max length is
greater than 512, the update performance of dynptr hash map will be
better than the normal hash map. And the memory consumption of hash-map
with dynptr key is smaller compared with normal hash map.

a) lookup operation

max_entries = 8K (randomly generated data set)
| max length of desc | normal hash-map    | dynptr hash-map   |
| ---                |  ---               | ---               |
|  64                | 12.0 M/s (1.7 MB)  | 8.3 M/s (1.4 MB)  |
| 128                |  6.4 M/s (2.2 MB)  | 6.6 M/s (1.7 MB)  |
| 256                |  3.7 M/s (4.2 MB)  | 4.8 M/s (2.3 MB)  |
| 512                |  2.1 M/s (8.2 MB)  | 3.1 M/s (3.8 MB)  |
| 1024               |  1.1 M/s (16 MB)   | 1.9 M/s (6.5 MB)  |
| 2048               |  0.6 M/s (32 MB)   | 1.1 M/s (12 MB)   |
| 4096               |  0.3 M/s (64 MB)   | 0.6 M/s (22 MB)   |

| string in file     | normal hash-map    | dynptr hash-map   |
| ---                |  ---               | ---               |
| kallsyms           |  7.7 M/s (29 MB)   | 7.3 M/s (22 MB)   |
| string in BTF      |  8.0 M/s (22 MB)   | 7.3 M/s (16 MB)   |
| alexa top 1M sites |  3.9 M/s (191 MB)  | 3.7 M/s (138 MB)  |

b) update and delete operation

max_entries = 8K (randomly generated data set)
| max length of desc | normal hash-map    | dynptr hash-map   |
| ---                |  ---               | ---               |
|  64                |  5.0 M/s           | 3.6 M/s           |
| 128                |  3.8 M/s           | 3.4 M/s           |
| 256                |  2.7 M/s           | 2.7 M/s           |
| 512                |  1.7 M/s           | 2.1 M/s           |
| 1024               |  0.9 M/s           | 1.5 M/s           |
| 2048               |  0.5 M/s           | 0.9 M/s           |
| 4096               |  0.3 M/s           | 0.5 M/s           |

| strings in file    | normal hash-map    | dynptr hash-map   |
| ---                |  ---               | ---               |
| kallsyms           |  3.9 M/s           | 2.9 M/s           |
| strings in BTF     |  4.1 M/s           | 3.3 M/s           |
| alexa top 1M sites |  2.7 M/s           | 2.5 M/s           |

As usual, comments and suggestions are always welcome.

PS: I will soon start my long Chinese Lunar New Year holiday, so my
replies may be a bit slow.

---

Change Log:
v2:
  * remove the need to set BPF_F_DYNPTR_IN_KEY flag explicitly
  * remove bpf_dynptr_user helpers from libbpf
  * support dynptr-keyed map in verifier in a less-intrusive way
  * handle the return value of kvmemdup_bpfptr() correctly
  * add necessary comments for ->record and ->key_record
  * use __bpf_md_ptr to define the data field of bpf_dynptr_user
  * add always_inline for lookup_{nulls_elem|elem}_raw
  * add benchmark patch for dynptr-keyed hash map

v1: https://lore.kernel.org/bpf/20241008091501.8302-1-houtao@huaweicloud.com/

[1]: https://lore.kernel.org/bpf/CAADnVQJWaBRB=P-ZNkppwm=0tZaT3qP8PKLLJ2S5SSA2-S8mxg@mail.gmail.com/

Hou Tao (20):
  bpf: Add two helpers to facilitate the parsing of bpf_dynptr
  bpf: Parse bpf_dynptr in map key
  bpf: Factor out get_map_btf() helper
  bpf: Move the initialization of btf before ->map_alloc_check
  bpf: Introduce an internal map flag BPF_INT_F_DYNPTR_IN_KEY
  bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  bpf: Use map_extra to indicate the max data size of dynptrs in map key
  bpf: Split check_stack_range_initialized() into small functions
  bpf: Support map key with dynptr in verifier
  bpf: Introduce bpf_dynptr_user
  bpf: Handle bpf_dynptr_user in bpf syscall when it is used as input
  bpf: Handle bpf_dynptr_user in bpf syscall when it is used as output
  bpf: Support basic operations for dynptr key in hash map
  bpf: Export bpf_dynptr_set_size
  bpf: Support get_next_key operation for dynptr key in hash map
  bpf: Disable unsupported operations for map with dynptr key
  bpf: Enable BPF_INT_F_DYNPTR_IN_KEY for hash map
  selftests/bpf: Add bpf_dynptr_user_init() helper
  selftests/bpf: Add test cases for hash map with dynptr key
  selftests/bpf: Add benchmark for dynptr key support in hash map

 include/linux/bpf.h                           |  40 +-
 include/linux/btf.h                           |   2 +
 include/uapi/linux/bpf.h                      |   6 +
 kernel/bpf/btf.c                              |  46 +-
 kernel/bpf/hashtab.c                          | 319 ++++++++-
 kernel/bpf/helpers.c                          |   2 +-
 kernel/bpf/map_in_map.c                       |  21 +-
 kernel/bpf/syscall.c                          | 363 +++++++++--
 kernel/bpf/verifier.c                         | 373 ++++++++---
 tools/include/uapi/linux/bpf.h                |   6 +
 tools/testing/selftests/bpf/Makefile          |   2 +
 tools/testing/selftests/bpf/bench.c           |  10 +
 .../selftests/bpf/benchs/bench_dynptr_key.c   | 612 ++++++++++++++++++
 .../bpf/benchs/run_bench_dynptr_key.sh        |  51 ++
 tools/testing/selftests/bpf/bpf_util.h        |   9 +
 .../bpf/prog_tests/htab_dynkey_test.c         | 427 ++++++++++++
 .../selftests/bpf/progs/cpumask_common.h      |   2 +-
 .../selftests/bpf/progs/dynptr_key_bench.c    | 250 +++++++
 .../bpf/progs/htab_dynkey_test_failure.c      | 216 +++++++
 .../bpf/progs/htab_dynkey_test_success.c      | 383 +++++++++++
 20 files changed, 2946 insertions(+), 194 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_dynptr_key.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_dynptr_key.sh
 create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_dynkey_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/dynptr_key_bench.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_dynkey_test_failure.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_dynkey_test_success.c

-- 
2.29.2


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-02-04 23:17   ` Alexei Starovoitov
  2025-01-25 11:10 ` [PATCH bpf-next v2 02/20] bpf: Parse bpf_dynptr in map key Hou Tao
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add BPF_DYNPTR in btf_field_type to support bpf_dynptr in map key. The
parsing of bpf_dynptr in btf will be done in the following patch, and
the patch only adds two helpers: btf_new_bpf_dynptr_record() creates an
btf record which only includes a bpf_dynptr and btf_type_is_dynptr()
checks whether the btf_type is a bpf_dynptr or not.

With the introduction of BPF_DYNPTR, BTF_FIELDS_MAX is changed from 11
to 13, therefore, update the hard-coded number in cpumask test as well.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h                           |  5 ++-
 include/linux/btf.h                           |  2 +
 kernel/bpf/btf.c                              | 42 ++++++++++++++++---
 .../selftests/bpf/progs/cpumask_common.h      |  2 +-
 4 files changed, 43 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index feda0ce90f5a3..0ee14ae30100f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -184,8 +184,8 @@ struct bpf_map_ops {
 };
 
 enum {
-	/* Support at most 11 fields in a BTF type */
-	BTF_FIELDS_MAX	   = 11,
+	/* Support at most 13 fields in a BTF type */
+	BTF_FIELDS_MAX	   = 13,
 };
 
 enum btf_field_type {
@@ -204,6 +204,7 @@ enum btf_field_type {
 	BPF_REFCOUNT   = (1 << 9),
 	BPF_WORKQUEUE  = (1 << 10),
 	BPF_UPTR       = (1 << 11),
+	BPF_DYNPTR     = (1 << 12),
 };
 
 typedef void (*btf_dtor_kfunc_t)(void *);
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 2a08a2b55592e..ee1488494c73d 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -223,8 +223,10 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
 				    u32 field_mask, u32 value_size);
+struct btf_record *btf_new_bpf_dynptr_record(void);
 int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec);
 bool btf_type_is_void(const struct btf_type *t);
+bool btf_type_is_dynptr(const struct btf *btf, const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 s32 bpf_find_btf_id(const char *name, u32 kind, struct btf **btf_p);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 8396ce1d0fba3..b316631b614fa 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3925,6 +3925,16 @@ static int btf_field_cmp(const void *_a, const void *_b, const void *priv)
 	return 0;
 }
 
+static void btf_init_record(struct btf_record *record)
+{
+	record->cnt = 0;
+	record->field_mask = 0;
+	record->spin_lock_off = -EINVAL;
+	record->timer_off = -EINVAL;
+	record->wq_off = -EINVAL;
+	record->refcount_off = -EINVAL;
+}
+
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
 				    u32 field_mask, u32 value_size)
 {
@@ -3943,14 +3953,11 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	/* This needs to be kzalloc to zero out padding and unused fields, see
 	 * comment in btf_record_equal.
 	 */
-	rec = kzalloc(offsetof(struct btf_record, fields[cnt]), GFP_KERNEL | __GFP_NOWARN);
+	rec = kzalloc(struct_size(rec, fields, cnt), GFP_KERNEL | __GFP_NOWARN);
 	if (!rec)
 		return ERR_PTR(-ENOMEM);
 
-	rec->spin_lock_off = -EINVAL;
-	rec->timer_off = -EINVAL;
-	rec->wq_off = -EINVAL;
-	rec->refcount_off = -EINVAL;
+	btf_init_record(rec);
 	for (i = 0; i < cnt; i++) {
 		field_type_size = btf_field_type_size(info_arr[i].type);
 		if (info_arr[i].off + field_type_size > value_size) {
@@ -4041,6 +4048,25 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	return ERR_PTR(ret);
 }
 
+struct btf_record *btf_new_bpf_dynptr_record(void)
+{
+	struct btf_record *record;
+
+	record = kzalloc(struct_size(record, fields, 1), GFP_KERNEL | __GFP_NOWARN);
+	if (!record)
+		return ERR_PTR(-ENOMEM);
+
+	btf_init_record(record);
+
+	record->cnt = 1;
+	record->field_mask = BPF_DYNPTR;
+	record->fields[0].offset = 0;
+	record->fields[0].size = sizeof(struct bpf_dynptr);
+	record->fields[0].type = BPF_DYNPTR;
+
+	return record;
+}
+
 int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
 {
 	int i;
@@ -7439,6 +7465,12 @@ static bool btf_is_dynptr_ptr(const struct btf *btf, const struct btf_type *t)
 	return false;
 }
 
+bool btf_type_is_dynptr(const struct btf *btf, const struct btf_type *t)
+{
+	return __btf_type_is_struct(t) && t->size == sizeof(struct bpf_dynptr) &&
+	       !strcmp(__btf_name_by_offset(btf, t->name_off), "bpf_dynptr");
+}
+
 struct bpf_cand_cache {
 	const char *name;
 	u32 name_len;
diff --git a/tools/testing/selftests/bpf/progs/cpumask_common.h b/tools/testing/selftests/bpf/progs/cpumask_common.h
index 4ece7873ba609..afbf2e99b1bb8 100644
--- a/tools/testing/selftests/bpf/progs/cpumask_common.h
+++ b/tools/testing/selftests/bpf/progs/cpumask_common.h
@@ -10,7 +10,7 @@
 /* Should use BTF_FIELDS_MAX, but it is not always available in vmlinux.h,
  * so use the hard-coded number as a workaround.
  */
-#define CPUMASK_KPTR_FIELDS_MAX 11
+#define CPUMASK_KPTR_FIELDS_MAX 13
 
 int err;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 02/20] bpf: Parse bpf_dynptr in map key
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-02-13 17:59   ` Alexei Starovoitov
  2025-01-25 11:10 ` [PATCH bpf-next v2 03/20] bpf: Factor out get_map_btf() helper Hou Tao
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

To support variable-length key or strings in map key, use bpf_dynptr to
represent these variable-length objects and save these bpf_dynptr
fields in the map key. As shown in the examples below, a map key with an
integer and a string is defined:

	struct pid_name {
		int pid;
		struct bpf_dynptr name;
	};

The bpf_dynptr in the map key could also be contained indirectly in a
struct as shown below:

	struct pid_name_time {
		struct pid_name process;
		unsigned long long time;
	};

If the whole map key is a bpf_dynptr, the map could be defined as a
struct or directly using bpf_dynptr as the map key:

	struct map_key {
		struct bpf_dynptr name;
	};

The bpf program could use bpf_dynptr_init() to initialize the dynptr
part in the map key, and the userspace application will use
bpf_dynptr_user_init() or similar API to initialize the dynptr. Just
like kptrs in map value, the bpf_dynptr field in the map key could also
be defined in a nested struct which is contained in the map key struct.

The patch updates map_create() accordingly to parse these bpf_dynptr
fields in map key, just like it does for other special fields in map
value. To enable bpf_dynptr support in map key, the map_type should be
BPF_MAP_TYPE_HASH. For now, the max number of bpf_dynptr in a map key
is limited as 1 and the limitation can be relaxed later.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h     | 14 ++++++++++++++
 kernel/bpf/btf.c        |  4 ++++
 kernel/bpf/map_in_map.c | 21 +++++++++++++++++----
 kernel/bpf/syscall.c    | 41 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0ee14ae30100f..ed58d5dd6b34b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -271,7 +271,14 @@ struct bpf_map {
 	u64 map_extra; /* any per-map-type extra fields */
 	u32 map_flags;
 	u32 id;
+	/* BTF record for special fields in map value. bpf_dynptr is disallowed
+	 * at present.
+	 */
 	struct btf_record *record;
+	/* BTF record for special fields in map key. Only bpf_dynptr is allowed
+	 * at present.
+	 */
+	struct btf_record *key_record;
 	int numa_node;
 	u32 btf_key_type_id;
 	u32 btf_value_type_id;
@@ -336,6 +343,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "bpf_rb_node";
 	case BPF_REFCOUNT:
 		return "bpf_refcount";
+	case BPF_DYNPTR:
+		return "bpf_dynptr";
 	default:
 		WARN_ON_ONCE(1);
 		return "unknown";
@@ -366,6 +375,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 		return sizeof(struct bpf_rb_node);
 	case BPF_REFCOUNT:
 		return sizeof(struct bpf_refcount);
+	case BPF_DYNPTR:
+		return sizeof(struct bpf_dynptr);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -396,6 +407,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 		return __alignof__(struct bpf_rb_node);
 	case BPF_REFCOUNT:
 		return __alignof__(struct bpf_refcount);
+	case BPF_DYNPTR:
+		return __alignof__(struct bpf_dynptr);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -426,6 +439,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
 	case BPF_KPTR_REF:
 	case BPF_KPTR_PERCPU:
 	case BPF_UPTR:
+	case BPF_DYNPTR:
 		break;
 	default:
 		WARN_ON_ONCE(1);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b316631b614fa..0ce5180e024a3 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3500,6 +3500,7 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
 	field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");
 	field_mask_test_name(BPF_RB_NODE,   "bpf_rb_node");
 	field_mask_test_name(BPF_REFCOUNT,  "bpf_refcount");
+	field_mask_test_name(BPF_DYNPTR,    "bpf_dynptr");
 
 	/* Only return BPF_KPTR when all other types with matchable names fail */
 	if (field_mask & (BPF_KPTR | BPF_UPTR) && !__btf_type_is_struct(var_type)) {
@@ -3538,6 +3539,7 @@ static int btf_repeat_fields(struct btf_field_info *info, int info_cnt,
 		case BPF_UPTR:
 		case BPF_LIST_HEAD:
 		case BPF_RB_ROOT:
+		case BPF_DYNPTR:
 			break;
 		default:
 			return -EINVAL;
@@ -3660,6 +3662,7 @@ static int btf_find_field_one(const struct btf *btf,
 	case BPF_LIST_NODE:
 	case BPF_RB_NODE:
 	case BPF_REFCOUNT:
+	case BPF_DYNPTR:
 		ret = btf_find_struct(btf, var_type, off, sz, field_type,
 				      info_cnt ? &info[0] : &tmp);
 		if (ret < 0)
@@ -4017,6 +4020,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 			break;
 		case BPF_LIST_NODE:
 		case BPF_RB_NODE:
+		case BPF_DYNPTR:
 			break;
 		default:
 			ret = -EFAULT;
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 645bd30bc9a9d..564ebcc857564 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -12,6 +12,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	struct bpf_map *inner_map, *inner_map_meta;
 	u32 inner_map_meta_size;
 	CLASS(fd, f)(inner_map_ufd);
+	int ret;
 
 	inner_map = __bpf_map_get(f);
 	if (IS_ERR(inner_map))
@@ -45,10 +46,15 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 		 * invalid/empty/valid, but ERR_PTR in case of errors. During
 		 * equality NULL or IS_ERR is equivalent.
 		 */
-		struct bpf_map *ret = ERR_CAST(inner_map_meta->record);
-		kfree(inner_map_meta);
-		return ret;
+		ret = PTR_ERR(inner_map_meta->record);
+		goto free_meta;
 	}
+	inner_map_meta->key_record = btf_record_dup(inner_map->key_record);
+	if (IS_ERR(inner_map_meta->key_record)) {
+		ret = PTR_ERR(inner_map_meta->key_record);
+		goto free_record;
+	}
+
 	/* Note: We must use the same BTF, as we also used btf_record_dup above
 	 * which relies on BTF being same for both maps, as some members like
 	 * record->fields.list_head have pointers like value_rec pointing into
@@ -71,6 +77,12 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 		inner_map_meta->bypass_spec_v1 = inner_map->bypass_spec_v1;
 	}
 	return inner_map_meta;
+
+free_record:
+	btf_record_free(inner_map_meta->record);
+free_meta:
+	kfree(inner_map_meta);
+	return ERR_PTR(ret);
 }
 
 void bpf_map_meta_free(struct bpf_map *map_meta)
@@ -88,7 +100,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
 		meta0->key_size == meta1->key_size &&
 		meta0->value_size == meta1->value_size &&
 		meta0->map_flags == meta1->map_flags &&
-		btf_record_equal(meta0->record, meta1->record);
+		btf_record_equal(meta0->record, meta1->record) &&
+		btf_record_equal(meta0->key_record, meta1->key_record);
 }
 
 void *bpf_map_fd_get_ptr(struct bpf_map *map,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0daf098e32074..6e14208cca813 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -651,6 +651,7 @@ void btf_record_free(struct btf_record *rec)
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
 		case BPF_WORKQUEUE:
+		case BPF_DYNPTR:
 			/* Nothing to release */
 			break;
 		default:
@@ -664,7 +665,9 @@ void btf_record_free(struct btf_record *rec)
 void bpf_map_free_record(struct bpf_map *map)
 {
 	btf_record_free(map->record);
+	btf_record_free(map->key_record);
 	map->record = NULL;
+	map->key_record = NULL;
 }
 
 struct btf_record *btf_record_dup(const struct btf_record *rec)
@@ -703,6 +706,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 		case BPF_TIMER:
 		case BPF_REFCOUNT:
 		case BPF_WORKQUEUE:
+		case BPF_DYNPTR:
 			/* Nothing to acquire */
 			break;
 		default:
@@ -821,6 +825,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 		case BPF_RB_NODE:
 		case BPF_REFCOUNT:
 			break;
+		case BPF_DYNPTR:
+			break;
 		default:
 			WARN_ON_ONCE(1);
 			continue;
@@ -830,6 +836,7 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 
 static void bpf_map_free(struct bpf_map *map)
 {
+	struct btf_record *key_rec = map->key_record;
 	struct btf_record *rec = map->record;
 	struct btf *btf = map->btf;
 
@@ -850,6 +857,7 @@ static void bpf_map_free(struct bpf_map *map)
 	 * eventually calls bpf_map_free_meta, since inner_map_meta is only a
 	 * template bpf_map struct used during verification.
 	 */
+	btf_record_free(key_rec);
 	btf_record_free(rec);
 	/* Delay freeing of btf for maps, as map_free callback may need
 	 * struct_meta info which will be freed with btf_put().
@@ -1180,6 +1188,8 @@ int map_check_no_btf(const struct bpf_map *map,
 	return -ENOTSUPP;
 }
 
+#define MAX_DYNPTR_CNT_IN_MAP_KEY 1
+
 static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 			 const struct btf *btf, u32 btf_key_id, u32 btf_value_id)
 {
@@ -1202,6 +1212,37 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
 	if (!value_type || value_size != map->value_size)
 		return -EINVAL;
 
+	/* Key BTF type can't be data section */
+	if (btf_type_is_dynptr(btf, key_type))
+		map->key_record = btf_new_bpf_dynptr_record();
+	else if (__btf_type_is_struct(key_type))
+		map->key_record = btf_parse_fields(btf, key_type, BPF_DYNPTR, map->key_size);
+	else
+		map->key_record = NULL;
+	if (!IS_ERR_OR_NULL(map->key_record)) {
+		if (map->key_record->cnt > MAX_DYNPTR_CNT_IN_MAP_KEY) {
+			ret = -E2BIG;
+			goto free_map_tab;
+		}
+		if (map->map_type != BPF_MAP_TYPE_HASH) {
+			ret = -EOPNOTSUPP;
+			goto free_map_tab;
+		}
+		if (!bpf_token_capable(token, CAP_BPF)) {
+			ret = -EPERM;
+			goto free_map_tab;
+		}
+		/* Disallow key with dynptr for special map */
+		if (map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) {
+			ret = -EACCES;
+			goto free_map_tab;
+		}
+	} else if (IS_ERR(map->key_record)) {
+		/* Return an error early even the bpf program doesn't use it */
+		ret = PTR_ERR(map->key_record);
+		goto free_map_tab;
+	}
+
 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
 				       BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 03/20] bpf: Factor out get_map_btf() helper
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 02/20] bpf: Parse bpf_dynptr in map key Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 04/20] bpf: Move the initialization of btf before ->map_alloc_check Hou Tao
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

get_map_btf() gets the btf from the btf fd and ensure the btf is not a
kernel btf.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/syscall.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 6e14208cca813..ba2df15ae0f1f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1345,6 +1345,21 @@ static bool bpf_net_capable(void)
 	return capable(CAP_NET_ADMIN) || capable(CAP_SYS_ADMIN);
 }
 
+static struct btf *get_map_btf(int btf_fd)
+{
+	struct btf *btf = btf_get_by_fd(btf_fd);
+
+	if (IS_ERR(btf))
+		return btf;
+
+	if (btf_is_kernel(btf)) {
+		btf_put(btf);
+		return ERR_PTR(-EACCES);
+	}
+
+	return btf;
+}
+
 #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 04/20] bpf: Move the initialization of btf before ->map_alloc_check
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (2 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 03/20] bpf: Factor out get_map_btf() helper Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 05/20] bpf: Introduce an internal map flag BPF_INT_F_DYNPTR_IN_KEY Hou Tao
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

As for now, map_create() calls ->map_alloc_check() and ->map_alloc()
first, then it initializes map btf. In order to support dynptr in map
key, map_create() needs to check whether there is bpf_dynptr in map key
btf type and passes the information to ->map_alloc_check() and
->map_alloc().

However, the case where btf_vmlinux_value_type_id > 0 needs special
handling. The reason is that the probe of struct_ops map in libbpf
doesn't pass a valid btf_fd to map_create syscall, and it expects
->map_alloc() to be invoked before the initialization of the map btf. If
the initialization of the map btf happens before ->map_alloc(), the
probe of struct_ops will fail. To prevent breaking the old libbpf in the
new kernel, the patch only moves the initialization of btf before
->map_alloc_check() for non-struct-ops map case.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/syscall.c | 91 ++++++++++++++++++++++++++------------------
 1 file changed, 55 insertions(+), 36 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ba2df15ae0f1f..d57bfb30463fa 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1368,6 +1368,7 @@ static int map_create(union bpf_attr *attr)
 	struct bpf_token *token = NULL;
 	int numa_node = bpf_map_attr_numa_node(attr);
 	u32 map_type = attr->map_type;
+	struct btf *btf = NULL;
 	struct bpf_map *map;
 	bool token_flag;
 	int f_flags;
@@ -1391,43 +1392,63 @@ static int map_create(union bpf_attr *attr)
 		return -EINVAL;
 	}
 
+	if (attr->btf_key_type_id || attr->btf_value_type_id) {
+		btf = get_map_btf(attr->btf_fd);
+		if (IS_ERR(btf))
+			return PTR_ERR(btf);
+	}
+
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
 	    attr->map_type != BPF_MAP_TYPE_ARENA &&
-	    attr->map_extra != 0)
-		return -EINVAL;
+	    attr->map_extra != 0) {
+		err = -EINVAL;
+		goto put_btf;
+	}
 
 	f_flags = bpf_get_file_flag(attr->map_flags);
-	if (f_flags < 0)
-		return f_flags;
+	if (f_flags < 0) {
+		err = f_flags;
+		goto put_btf;
+	}
 
 	if (numa_node != NUMA_NO_NODE &&
 	    ((unsigned int)numa_node >= nr_node_ids ||
-	     !node_online(numa_node)))
-		return -EINVAL;
+	     !node_online(numa_node))) {
+		err = -EINVAL;
+		goto put_btf;
+	}
 
 	/* find map type and init map: hashtable vs rbtree vs bloom vs ... */
 	map_type = attr->map_type;
-	if (map_type >= ARRAY_SIZE(bpf_map_types))
-		return -EINVAL;
+	if (map_type >= ARRAY_SIZE(bpf_map_types)) {
+		err = -EINVAL;
+		goto put_btf;
+	}
 	map_type = array_index_nospec(map_type, ARRAY_SIZE(bpf_map_types));
 	ops = bpf_map_types[map_type];
-	if (!ops)
-		return -EINVAL;
+	if (!ops) {
+		err = -EINVAL;
+		goto put_btf;
+	}
 
 	if (ops->map_alloc_check) {
 		err = ops->map_alloc_check(attr);
 		if (err)
-			return err;
+			goto put_btf;
 	}
 	if (attr->map_ifindex)
 		ops = &bpf_map_offload_ops;
-	if (!ops->map_mem_usage)
-		return -EINVAL;
+	if (!ops->map_mem_usage) {
+		err = -EINVAL;
+		goto put_btf;
+	}
 
 	if (token_flag) {
 		token = bpf_token_get_from_fd(attr->map_token_fd);
-		if (IS_ERR(token))
-			return PTR_ERR(token);
+		if (IS_ERR(token)) {
+			err = PTR_ERR(token);
+			goto put_btf;
+		}
 
 		/* if current token doesn't grant map creation permissions,
 		 * then we can't use this token, so ignore it and rely on
@@ -1517,30 +1538,27 @@ static int map_create(union bpf_attr *attr)
 	mutex_init(&map->freeze_mutex);
 	spin_lock_init(&map->owner.lock);
 
-	if (attr->btf_key_type_id || attr->btf_value_type_id ||
-	    /* Even the map's value is a kernel's struct,
-	     * the bpf_prog.o must have BTF to begin with
-	     * to figure out the corresponding kernel's
-	     * counter part.  Thus, attr->btf_fd has
-	     * to be valid also.
-	     */
-	    attr->btf_vmlinux_value_type_id) {
-		struct btf *btf;
-
-		btf = btf_get_by_fd(attr->btf_fd);
-		if (IS_ERR(btf)) {
-			err = PTR_ERR(btf);
-			goto free_map;
-		}
-		if (btf_is_kernel(btf)) {
-			btf_put(btf);
-			err = -EACCES;
-			goto free_map;
+	/* Even the struct_ops map's value is a kernel's struct,
+	 * the bpf_prog.o must have BTF to begin with
+	 * to figure out the corresponding kernel's
+	 * counter part.  Thus, attr->btf_fd has
+	 * to be valid also.
+	 */
+	if (btf || attr->btf_vmlinux_value_type_id) {
+		if (!btf) {
+			btf = get_map_btf(attr->btf_fd);
+			if (IS_ERR(btf)) {
+				err = PTR_ERR(btf);
+				btf = NULL;
+				goto free_map;
+			}
 		}
+
 		map->btf = btf;
+		btf = NULL;
 
 		if (attr->btf_value_type_id) {
-			err = map_check_btf(map, token, btf, attr->btf_key_type_id,
+			err = map_check_btf(map, token, map->btf, attr->btf_key_type_id,
 					    attr->btf_value_type_id);
 			if (err)
 				goto free_map;
@@ -1572,7 +1590,6 @@ static int map_create(union bpf_attr *attr)
 		 * have refcnt-ed it through BPF_MAP_GET_FD_BY_ID.
 		 */
 		bpf_map_put_with_uref(map);
-		return err;
 	}
 
 	return err;
@@ -1583,6 +1600,8 @@ static int map_create(union bpf_attr *attr)
 	bpf_map_free(map);
 put_token:
 	bpf_token_put(token);
+put_btf:
+	btf_put(btf);
 	return err;
 }
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 05/20] bpf: Introduce an internal map flag BPF_INT_F_DYNPTR_IN_KEY
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (3 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 04/20] bpf: Move the initialization of btf before ->map_alloc_check Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally Hou Tao
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Introduce an internal map flag BPF_F_DYNPTR_IN_KEY to support dynptr in
map key. Add the corresponding helper bpf_map_has_dynptr_key() to check
whether the support of dynptr-key is enabled.

The reason for an internal map flag is twofolds:
1) user doesn't need to set the map flag explicitly
map_create() will use the presence of bpf_dynptr in map key as an
indicator of enabling dynptr key.
2) avoid adding new arguments for ->map_alloc_check() and ->map_alloc()
map_create() needs to pass the supported status of dynptr key to
->map_alloc_check (e.g., check the maximum length of dynptr data size)
and ->map_alloc (e.g., check whether dynptr key fits current map type).
Adding new arguments for these callbacks to achieve that will introduce
too much churns.

Therefore, the patch uses the topmost bit of map_flags as the internal
map flag. map_create() checks whether the internal flag is set in the
beginning and bpf_map_get_info_by_fd() clears the internal flag before
returns the map flags to userspace.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h  | 17 +++++++++++++++++
 kernel/bpf/syscall.c |  4 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ed58d5dd6b34b..ee02a5d313c56 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -258,6 +258,14 @@ struct bpf_list_node_kern {
 	void *owner;
 } __attribute__((aligned(8)));
 
+/* Internal map flags */
+enum {
+	/* map key supports bpf_dynptr */
+	BPF_INT_F_DYNPTR_IN_KEY = (1U << 31),
+};
+
+#define BPF_INT_F_MASK (1U << 31)
+
 struct bpf_map {
 	const struct bpf_map_ops *ops;
 	struct bpf_map *inner_map_meta;
@@ -269,6 +277,10 @@ struct bpf_map {
 	u32 value_size;
 	u32 max_entries;
 	u64 map_extra; /* any per-map-type extra fields */
+	/* The topmost bit of map_flags is used as an internal map flag
+	 * (aka BPF_INT_F_DYNPTR_IN_KEY) and it can't be set through bpf
+	 * syscall.
+	 */
 	u32 map_flags;
 	u32 id;
 	/* BTF record for special fields in map value. bpf_dynptr is disallowed
@@ -317,6 +329,11 @@ struct bpf_map {
 	s64 __percpu *elem_count;
 };
 
+static inline bool bpf_map_has_dynptr_key(const struct bpf_map *map)
+{
+	return map->map_flags & BPF_INT_F_DYNPTR_IN_KEY;
+}
+
 static inline const char *btf_field_type_name(enum btf_field_type type)
 {
 	switch (type) {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index d57bfb30463fa..07c67ad1a6a07 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1378,6 +1378,8 @@ static int map_create(union bpf_attr *attr)
 	if (err)
 		return -EINVAL;
 
+	if (attr->map_flags & BPF_INT_F_MASK)
+		return -EINVAL;
 	/* check BPF_F_TOKEN_FD flag, remember if it's set, and then clear it
 	 * to avoid per-map type checks tripping on unknown flag
 	 */
@@ -5057,7 +5059,7 @@ static int bpf_map_get_info_by_fd(struct file *file,
 	info.key_size = map->key_size;
 	info.value_size = map->value_size;
 	info.max_entries = map->max_entries;
-	info.map_flags = map->map_flags;
+	info.map_flags = map->map_flags & ~BPF_INT_F_MASK;
 	info.map_extra = map->map_extra;
 	memcpy(info.name, map->name, sizeof(map->name));
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (4 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 05/20] bpf: Introduce an internal map flag BPF_INT_F_DYNPTR_IN_KEY Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-02-13 23:56   ` Alexei Starovoitov
  2025-01-25 11:10 ` [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key Hou Tao
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

When there is bpf_dynptr field in the map key btf type or the map key
btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 07c67ad1a6a07..46b96d062d2db 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
 	return btf;
 }
 
+static int map_has_dynptr_in_key_type(struct btf *btf, u32 btf_key_id, u32 key_size)
+{
+	const struct btf_type *type;
+	struct btf_record *record;
+	u32 btf_key_size;
+
+	if (!btf_key_id)
+		return 0;
+
+	type = btf_type_id_size(btf, &btf_key_id, &btf_key_size);
+	if (!type || btf_key_size != key_size)
+		return -EINVAL;
+
+	/* For dynptr key, key BTF type must be struct */
+	if (!__btf_type_is_struct(type))
+		return 0;
+
+	if (btf_type_is_dynptr(btf, type))
+		return 1;
+
+	record = btf_parse_fields(btf, type, BPF_DYNPTR, key_size);
+	if (IS_ERR(record))
+		return PTR_ERR(record);
+
+	btf_record_free(record);
+	return !!record;
+}
+
 #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
@@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
 		btf = get_map_btf(attr->btf_fd);
 		if (IS_ERR(btf))
 			return PTR_ERR(btf);
+
+		err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
+		if (err < 0)
+			goto put_btf;
+		if (err > 0) {
+			attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;
+			err = 0;
+		}
 	}
 
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (5 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-02-13 18:02   ` Alexei Starovoitov
  2025-01-25 11:10 ` [PATCH bpf-next v2 08/20] bpf: Split check_stack_range_initialized() into small functions Hou Tao
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

For map with dynptr key support, it needs to use map_extra to specify
the maximum data length of these dynptrs. The implementation of the map
will check whether map_extra is smaller than the limitation imposed by
memory allocation during map creation. It may also use map_extra to
optimize the memory allocation for dynptr.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/syscall.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 46b96d062d2db..79459b218109e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1438,6 +1438,7 @@ static int map_create(union bpf_attr *attr)
 
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
 	    attr->map_type != BPF_MAP_TYPE_ARENA &&
+	    !(attr->map_flags & BPF_INT_F_DYNPTR_IN_KEY) &&
 	    attr->map_extra != 0) {
 		err = -EINVAL;
 		goto put_btf;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 08/20] bpf: Split check_stack_range_initialized() into small functions
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (6 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 09/20] bpf: Support map key with dynptr in verifier Hou Tao
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

It is a preparatory patch for supporting map key with bpf_dynptr in
verifier. The patch splits check_stack_range_initialized() into multiple
small functions and the following patch will reuse these functions to
check whether the access of stack range which contains bpf_dynptr is
valid or not.

Beside the splitting of check_stack_range_initialized(), the patch also
changes its name to check_stack_range_access() to better reflect its
purpose, because the function also allows uninitialized stack range.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/verifier.c | 209 ++++++++++++++++++++++++------------------
 1 file changed, 121 insertions(+), 88 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 74525392714e2..290b9b93017c0 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -791,7 +791,7 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
 	 * While we don't allow reading STACK_INVALID, it is still possible to
 	 * do <8 byte writes marking some but not all slots as STACK_MISC. Then,
 	 * helpers or insns can do partial read of that part without failing,
-	 * but check_stack_range_initialized, check_stack_read_var_off, and
+	 * but check_stack_range_access, check_stack_read_var_off, and
 	 * check_stack_read_fixed_off will do mark_reg_read for all 8-bytes of
 	 * the slot conservatively. Hence we need to prevent those liveness
 	 * marking walks.
@@ -5301,11 +5301,11 @@ enum bpf_access_src {
 	ACCESS_HELPER = 2,  /* the access is performed by a helper */
 };
 
-static int check_stack_range_initialized(struct bpf_verifier_env *env,
-					 int regno, int off, int access_size,
-					 bool zero_size_allowed,
-					 enum bpf_access_type type,
-					 struct bpf_call_arg_meta *meta);
+static int check_stack_range_access(struct bpf_verifier_env *env,
+				    int regno, int off, int access_size,
+				    bool zero_size_allowed,
+				    enum bpf_access_type type,
+				    struct bpf_call_arg_meta *meta);
 
 static struct bpf_reg_state *reg_state(struct bpf_verifier_env *env, int regno)
 {
@@ -5336,8 +5336,8 @@ static int check_stack_read_var_off(struct bpf_verifier_env *env,
 
 	/* Note that we pass a NULL meta, so raw access will not be permitted.
 	 */
-	err = check_stack_range_initialized(env, ptr_regno, off, size,
-					    false, BPF_READ, NULL);
+	err = check_stack_range_access(env, ptr_regno, off, size,
+				       false, BPF_READ, NULL);
 	if (err)
 		return err;
 
@@ -7625,44 +7625,13 @@ static int check_atomic(struct bpf_verifier_env *env, int insn_idx, struct bpf_i
 	return 0;
 }
 
-/* When register 'regno' is used to read the stack (either directly or through
- * a helper function) make sure that it's within stack boundary and, depending
- * on the access type and privileges, that all elements of the stack are
- * initialized.
- *
- * 'off' includes 'regno->off', but not its dynamic part (if any).
- *
- * All registers that have been spilled on the stack in the slots within the
- * read offsets are marked as read.
- */
-static int check_stack_range_initialized(
-		struct bpf_verifier_env *env, int regno, int off,
-		int access_size, bool zero_size_allowed,
-		enum bpf_access_type type, struct bpf_call_arg_meta *meta)
+static int get_stack_access_range(struct bpf_verifier_env *env, int regno, int off,
+				  int *min_off, int *max_off)
 {
 	struct bpf_reg_state *reg = reg_state(env, regno);
-	struct bpf_func_state *state = func(env, reg);
-	int err, min_off, max_off, i, j, slot, spi;
-	/* Some accesses can write anything into the stack, others are
-	 * read-only.
-	 */
-	bool clobber = false;
-
-	if (access_size == 0 && !zero_size_allowed) {
-		verbose(env, "invalid zero-sized read\n");
-		return -EACCES;
-	}
-
-	if (type == BPF_WRITE)
-		clobber = true;
-
-	err = check_stack_access_within_bounds(env, regno, off, access_size, type);
-	if (err)
-		return err;
-
 
 	if (tnum_is_const(reg->var_off)) {
-		min_off = max_off = reg->var_off.value + off;
+		*min_off = *max_off = reg->var_off.value + off;
 	} else {
 		/* Variable offset is prohibited for unprivileged mode for
 		 * simplicity since it requires corresponding support in
@@ -7677,49 +7646,76 @@ static int check_stack_range_initialized(
 				regno, tn_buf);
 			return -EACCES;
 		}
-		/* Only initialized buffer on stack is allowed to be accessed
-		 * with variable offset. With uninitialized buffer it's hard to
-		 * guarantee that whole memory is marked as initialized on
-		 * helper return since specific bounds are unknown what may
-		 * cause uninitialized stack leaking.
-		 */
-		if (meta && meta->raw_mode)
-			meta = NULL;
 
-		min_off = reg->smin_value + off;
-		max_off = reg->smax_value + off;
+		*min_off = reg->smin_value + off;
+		*max_off = reg->smax_value + off;
 	}
 
-	if (meta && meta->raw_mode) {
-		/* Ensure we won't be overwriting dynptrs when simulating byte
-		 * by byte access in check_helper_call using meta.access_size.
-		 * This would be a problem if we have a helper in the future
-		 * which takes:
-		 *
-		 *	helper(uninit_mem, len, dynptr)
-		 *
-		 * Now, uninint_mem may overlap with dynptr pointer. Hence, it
-		 * may end up writing to dynptr itself when touching memory from
-		 * arg 1. This can be relaxed on a case by case basis for known
-		 * safe cases, but reject due to the possibilitiy of aliasing by
-		 * default.
-		 */
-		for (i = min_off; i < max_off + access_size; i++) {
-			int stack_off = -i - 1;
+	return 0;
+}
 
-			spi = __get_spi(i);
-			/* raw_mode may write past allocated_stack */
-			if (state->allocated_stack <= stack_off)
-				continue;
-			if (state->stack[spi].slot_type[stack_off % BPF_REG_SIZE] == STACK_DYNPTR) {
-				verbose(env, "potential write to dynptr at off=%d disallowed\n", i);
-				return -EACCES;
-			}
-		}
-		meta->access_size = access_size;
-		meta->regno = regno;
+static int allow_uninitialized_stack_range(struct bpf_verifier_env *env, int regno,
+					   int min_off, int max_off, int access_size,
+					   struct bpf_call_arg_meta *meta)
+{
+	struct bpf_reg_state *reg = reg_state(env, regno);
+	struct bpf_func_state *state = func(env, reg);
+	int i, stack_off, spi;
+
+	/* Disallow uninitialized buffer on stack */
+	if (!meta || !meta->raw_mode)
+		return 0;
+
+	/* Only initialized buffer on stack is allowed to be accessed
+	 * with variable offset. With uninitialized buffer it's hard to
+	 * guarantee that whole memory is marked as initialized on
+	 * helper return since specific bounds are unknown what may
+	 * cause uninitialized stack leaking.
+	 */
+	if (!tnum_is_const(reg->var_off))
 		return 0;
+
+	/* Ensure we won't be overwriting dynptrs when simulating byte
+	 * by byte access in check_helper_call using meta.access_size.
+	 * This would be a problem if we have a helper in the future
+	 * which takes:
+	 *
+	 *	helper(uninit_mem, len, dynptr)
+	 *
+	 * Now, uninint_mem may overlap with dynptr pointer. Hence, it
+	 * may end up writing to dynptr itself when touching memory from
+	 * arg 1. This can be relaxed on a case by case basis for known
+	 * safe cases, but reject due to the possibilitiy of aliasing by
+	 * default.
+	 */
+	for (i = min_off; i < max_off + access_size; i++) {
+		stack_off = -i - 1;
+		spi = __get_spi(i);
+		/* raw_mode may write past allocated_stack */
+		if (state->allocated_stack <= stack_off)
+			continue;
+		if (state->stack[spi].slot_type[stack_off % BPF_REG_SIZE] == STACK_DYNPTR) {
+			verbose(env, "potential write to dynptr at off=%d disallowed\n", i);
+			return -EACCES;
+		}
 	}
+	meta->access_size = access_size;
+	meta->regno = regno;
+
+	return 1;
+}
+
+static int check_stack_range_initialized(struct bpf_verifier_env *env, int regno,
+					 int min_off, int max_off, int access_size,
+					 enum bpf_access_type type)
+{
+	struct bpf_reg_state *reg = reg_state(env, regno);
+	struct bpf_func_state *state = func(env, reg);
+	int i, j, slot, spi;
+	/* Some accesses can write anything into the stack, others are
+	 * read-only.
+	 */
+	bool clobber = type == BPF_WRITE;
 
 	for (i = min_off; i < max_off + access_size; i++) {
 		u8 *stype;
@@ -7768,19 +7764,58 @@ static int check_stack_range_initialized(
 mark:
 		/* reading any byte out of 8-byte 'spill_slot' will cause
 		 * the whole slot to be marked as 'read'
-		 */
-		mark_reg_read(env, &state->stack[spi].spilled_ptr,
-			      state->stack[spi].spilled_ptr.parent,
-			      REG_LIVE_READ64);
-		/* We do not set REG_LIVE_WRITTEN for stack slot, as we can not
+		 *
+		 * We do not set REG_LIVE_WRITTEN for stack slot, as we can not
 		 * be sure that whether stack slot is written to or not. Hence,
 		 * we must still conservatively propagate reads upwards even if
 		 * helper may write to the entire memory range.
 		 */
+		mark_reg_read(env, &state->stack[spi].spilled_ptr,
+			      state->stack[spi].spilled_ptr.parent,
+			      REG_LIVE_READ64);
 	}
+
 	return 0;
 }
 
+/* When register 'regno' is used to read the stack (either directly or through
+ * a helper function) make sure that it's within stack boundary and, depending
+ * on the access type and privileges, that all elements of the stack are
+ * initialized.
+ *
+ * 'off' includes 'regno->off', but not its dynamic part (if any).
+ *
+ * All registers that have been spilled on the stack in the slots within the
+ * read offsets are marked as read.
+ */
+static int check_stack_range_access(struct bpf_verifier_env *env, int regno, int off,
+				    int access_size, bool zero_size_allowed,
+				    enum bpf_access_type type, struct bpf_call_arg_meta *meta)
+{
+	int err, min_off, max_off;
+
+	if (access_size == 0 && !zero_size_allowed) {
+		verbose(env, "invalid zero-sized read\n");
+		return -EACCES;
+	}
+
+	err = check_stack_access_within_bounds(env, regno, off, access_size, type);
+	if (err)
+		return err;
+
+	err = get_stack_access_range(env, regno, off, &min_off, &max_off);
+	if (err)
+		return err;
+
+	err = allow_uninitialized_stack_range(env, regno, min_off, max_off, access_size, meta);
+	if (err < 0)
+		return err;
+	if (err > 0)
+		return 0;
+
+	return check_stack_range_initialized(env, regno, min_off, max_off, access_size, type);
+}
+
 static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 				   int access_size, enum bpf_access_type access_type,
 				   bool zero_size_allowed,
@@ -7834,10 +7869,8 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 					   access_size, zero_size_allowed,
 					   max_access);
 	case PTR_TO_STACK:
-		return check_stack_range_initialized(
-				env,
-				regno, reg->off, access_size,
-				zero_size_allowed, access_type, meta);
+		return check_stack_range_access(env, regno, reg->off, access_size,
+						zero_size_allowed, access_type, meta);
 	case PTR_TO_BTF_ID:
 		return check_ptr_to_btf_access(env, regs, regno, reg->off,
 					       access_size, BPF_READ, -1);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 09/20] bpf: Support map key with dynptr in verifier
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (7 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 08/20] bpf: Split check_stack_range_initialized() into small functions Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-01-25 11:10 ` [PATCH bpf-next v2 10/20] bpf: Introduce bpf_dynptr_user Hou Tao
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

The patch basically does the following three things to enable dynptr key
for bpf map:

1) Only allow PTR_TO_STACK typed register for dynptr key
The main reason is that bpf_dynptr can only be defined in the stack, so
for dynptr key only PTR_TO_STACK typed register is allowed. bpf_dynptr
could also be represented by CONST_PTR_TO_DYNPTR typed register (e.g.,
in callback func or subprog), but it is not supported now.

2) Only allow fixed-offset for PTR_TO_STACK register
Variable-offset for PTR_TO_STACK typed register is disallowed, because
it is impossible to check whether or not the stack access is aligned
with BPF_REG_SIZE and is matched with the location of dynptr or
non-dynptr part in the map key.

3) Check the layout of the stack content is matched with the btf_record
Firstly check the start offset of the stack access is aligned with
BPF_REG_SIZE, then check the offset and the size of dynptr/non-dynptr
parts in the stack range is consistent with the btf_record of the map
key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/verifier.c | 170 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 164 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 290b9b93017c0..4e8531f246e8a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7705,9 +7705,90 @@ static int allow_uninitialized_stack_range(struct bpf_verifier_env *env, int reg
 	return 1;
 }
 
+struct dynptr_key_state {
+	const struct btf_record *rec;
+	const struct btf_field *cur_dynptr;
+	bool valid_dynptr_id;
+	int cur_dynptr_id;
+};
+
+static int init_dynptr_key_state(struct bpf_verifier_env *env, const struct btf_record *rec,
+				 struct dynptr_key_state *state)
+{
+	unsigned int i;
+
+	/* Find the first dynptr in the dynptr-key */
+	for (i = 0; i < rec->cnt; i++) {
+		if (rec->fields[i].type == BPF_DYNPTR)
+			break;
+	}
+	if (i >= rec->cnt) {
+		verbose(env, "verifier bug: dynptr not found\n");
+		return -EFAULT;
+	}
+
+	state->rec = rec;
+	state->cur_dynptr = &rec->fields[i];
+	state->valid_dynptr_id = false;
+
+	return 0;
+}
+
+static int check_dynptr_key_access(struct bpf_verifier_env *env, struct dynptr_key_state *state,
+				   struct bpf_reg_state *reg, u8 stype, int offset)
+{
+	const struct btf_field *dynptr = state->cur_dynptr;
+
+	/* Non-dynptr part before a dynptr or non-dynptr part after
+	 * the last dynptr.
+	 */
+	if (offset < dynptr->offset || offset >= dynptr->offset + dynptr->size) {
+		if (stype == STACK_DYNPTR) {
+			verbose(env,
+				"dynptr-key expects non-dynptr at offset %d cur_dynptr_offset %u\n",
+				offset, dynptr->offset);
+			return -EACCES;
+		}
+	} else {
+		if (stype != STACK_DYNPTR) {
+			verbose(env,
+				"dynptr-key expects dynptr at offset %d cur_dynptr_offset %u\n",
+				offset, dynptr->offset);
+			return -EACCES;
+		}
+
+		/* A dynptr is composed of parts from two dynptrs */
+		if (state->valid_dynptr_id && reg->id != state->cur_dynptr_id) {
+			verbose(env, "malformed dynptr-key at offset %d cur_dynptr_offset %u\n",
+				offset, dynptr->offset);
+			return -EACCES;
+		}
+		if (!state->valid_dynptr_id) {
+			state->valid_dynptr_id = true;
+			state->cur_dynptr_id = reg->id;
+		}
+
+		if (offset == dynptr->offset + dynptr->size - 1) {
+			const struct btf_record *rec = state->rec;
+			unsigned int i;
+
+			for (i = dynptr - rec->fields + 1; i < rec->cnt; i++) {
+				if (rec->fields[i].type == BPF_DYNPTR) {
+					state->cur_dynptr = &rec->fields[i];
+					state->valid_dynptr_id = false;
+					break;
+				}
+			}
+		}
+	}
+
+	return 0;
+}
+
 static int check_stack_range_initialized(struct bpf_verifier_env *env, int regno,
 					 int min_off, int max_off, int access_size,
-					 enum bpf_access_type type)
+					 enum bpf_access_type type,
+					 struct dynptr_key_state *dynkey)
 {
 	struct bpf_reg_state *reg = reg_state(env, regno);
 	struct bpf_func_state *state = func(env, reg);
@@ -7730,6 +7811,8 @@ static int check_stack_range_initialized(struct bpf_verifier_env *env, int regno
 		stype = &state->stack[spi].slot_type[slot % BPF_REG_SIZE];
 		if (*stype == STACK_MISC)
 			goto mark;
+		if (dynkey && *stype == STACK_DYNPTR)
+			goto mark;
 		if ((*stype == STACK_ZERO) ||
 		    (*stype == STACK_INVALID && env->allow_uninit_stack)) {
 			if (clobber) {
@@ -7762,6 +7845,15 @@ static int check_stack_range_initialized(struct bpf_verifier_env *env, int regno
 		}
 		return -EACCES;
 mark:
+		if (dynkey) {
+			int err = check_dynptr_key_access(env, dynkey,
+							  &state->stack[spi].spilled_ptr,
+							  *stype, i - min_off);
+
+			if (err)
+				return err;
+		}
+
 		/* reading any byte out of 8-byte 'spill_slot' will cause
 		 * the whole slot to be marked as 'read'
 		 *
@@ -7813,7 +7905,60 @@ static int check_stack_range_access(struct bpf_verifier_env *env, int regno, int
 	if (err > 0)
 		return 0;
 
-	return check_stack_range_initialized(env, regno, min_off, max_off, access_size, type);
+	return check_stack_range_initialized(env, regno, min_off, max_off, access_size, type, NULL);
+}
+
+static int check_dynkey_stack_access_offset(struct bpf_verifier_env *env, int regno, int off)
+{
+	struct bpf_reg_state *reg = reg_state(env, regno);
+
+	if (!tnum_is_const(reg->var_off)) {
+		verbose(env, "R%d variable offset prohibited for dynptr-key\n", regno);
+		return -EACCES;
+	}
+
+	off = reg->var_off.value + off;
+	if (off % BPF_REG_SIZE) {
+		verbose(env, "R%d misaligned offset %d for dynptr-key\n", regno, off);
+		return -EACCES;
+	}
+
+	return 0;
+}
+
+/* It is almost the same as check_stack_range_access(), except the following
+ * things:
+ * (1) no need to check whether access_size is zero (due to non-zero key_size)
+ * (2) disallow uninitialized stack range
+ * (3) need BPF_REG_SIZE-aligned access with fixed-size offset
+ * (4) need to check whether the layout of bpf_dynptr part and non-bpf_dynptr
+ *     part in the stack range is the same as the layout of dynptr key
+ */
+static int check_dynkey_stack_range_access(struct bpf_verifier_env *env, int regno, int off,
+					   int access_size, struct bpf_call_arg_meta *meta)
+{
+	enum bpf_access_type type = BPF_READ;
+	struct dynptr_key_state dynkey;
+	int err, min_off, max_off;
+
+	err = check_stack_access_within_bounds(env, regno, off, access_size, type);
+	if (err)
+		return err;
+
+	err = check_dynkey_stack_access_offset(env, regno, off);
+	if (err)
+		return err;
+
+	err = get_stack_access_range(env, regno, off, &min_off, &max_off);
+	if (err)
+		return err;
+
+	err = init_dynptr_key_state(env, meta->map_ptr->key_record, &dynkey);
+	if (err)
+		return err;
+
+	return check_stack_range_initialized(env, regno, min_off, max_off, access_size, type,
+					     &dynkey);
 }
 
 static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
@@ -9383,13 +9528,26 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			verbose(env, "invalid map_ptr to access map->key\n");
 			return -EACCES;
 		}
+
 		key_size = meta->map_ptr->key_size;
-		err = check_helper_mem_access(env, regno, key_size, BPF_READ, false, NULL);
+		/* Only allow PTR_TO_STACK for dynptr-key */
+		if (bpf_map_has_dynptr_key(meta->map_ptr)) {
+			if (base_type(reg->type) != PTR_TO_STACK) {
+				verbose(env, "map dynptr-key requires stack ptr but got %s\n",
+					reg_type_str(env, reg->type));
+				return -EACCES;
+			}
+			err = check_dynkey_stack_range_access(env, regno, reg->off, key_size, meta);
+		} else {
+			err = check_helper_mem_access(env, regno, key_size, BPF_READ, false, NULL);
+			if (!err) {
+				meta->const_map_key = get_constant_map_key(env, reg, key_size);
+				if (meta->const_map_key < 0 && meta->const_map_key != -EOPNOTSUPP)
+					err = meta->const_map_key;
+			}
+		}
 		if (err)
 			return err;
-		meta->const_map_key = get_constant_map_key(env, reg, key_size);
-		if (meta->const_map_key < 0 && meta->const_map_key != -EOPNOTSUPP)
-			return meta->const_map_key;
 		break;
 	case ARG_PTR_TO_MAP_VALUE:
 		if (type_may_be_null(arg_type) && register_is_null(reg))
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 10/20] bpf: Introduce bpf_dynptr_user
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (8 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 09/20] bpf: Support map key with dynptr in verifier Hou Tao
@ 2025-01-25 11:10 ` Hou Tao
  2025-02-14  0:13   ` Alexei Starovoitov
  2025-01-25 11:11 ` [PATCH bpf-next v2 11/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as input Hou Tao
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:10 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

For bpf map with dynptr key support, the userspace application will use
bpf_dynptr_user to represent the bpf_dynptr in the map key and pass it
to bpf syscall. The bpf syscall will copy from bpf_dynptr_user to
construct a corresponding bpf_dynptr_kern object when the map key is an
input argument, and copy to bpf_dynptr_user from a bpf_dynptr_kern
object when the map key is an output argument.

For now the size of bpf_dynptr_user must be the same as bpf_dynptr, but
the last u32 field is not used, so make it a reserved field.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/uapi/linux/bpf.h       | 6 ++++++
 tools/include/uapi/linux/bpf.h | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2acf9b3363717..7d96685513c55 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7335,6 +7335,12 @@ struct bpf_dynptr {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
 
+struct bpf_dynptr_user {
+	__bpf_md_ptr(void *, data);
+	__u32 size;
+	__u32 reserved;
+} __attribute__((aligned(8)));
+
 struct bpf_list_head {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 2acf9b3363717..7d96685513c55 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -7335,6 +7335,12 @@ struct bpf_dynptr {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
 
+struct bpf_dynptr_user {
+	__bpf_md_ptr(void *, data);
+	__u32 size;
+	__u32 reserved;
+} __attribute__((aligned(8)));
+
 struct bpf_list_head {
 	__u64 __opaque[2];
 } __attribute__((aligned(8)));
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 11/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as input
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (9 preceding siblings ...)
  2025-01-25 11:10 ` [PATCH bpf-next v2 10/20] bpf: Introduce bpf_dynptr_user Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 12/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as output Hou Tao
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Introduce bpf_copy_from_dynptr_ukey() helper to handle map key with
bpf_dynptr when the map key is used in map lookup, update, delete and
get_next_key operations.

The helper places all variable-length data of these bpf_dynptr_user
objects at the end of the map key to simplify the allocation and the
freeing of map key with dynptr.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/syscall.c | 98 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 87 insertions(+), 11 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 79459b218109e..1f0684ba0a204 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1711,10 +1711,83 @@ int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 	return -ENOTSUPP;
 }
 
-static void *__bpf_copy_key(void __user *ukey, u64 key_size)
+static void *bpf_copy_from_dynptr_ukey(const struct bpf_map *map, bpfptr_t ukey)
 {
-	if (key_size)
-		return vmemdup_user(ukey, key_size);
+	const struct btf_record *record;
+	const struct btf_field *field;
+	struct bpf_dynptr_user *uptr;
+	struct bpf_dynptr_kern *kptr;
+	void *key, *new_key, *kdata;
+	unsigned int key_size, size;
+	bpfptr_t udata;
+	unsigned int i;
+	int err;
+
+	key_size = map->key_size;
+	key = kvmemdup_bpfptr(ukey, key_size);
+	if (IS_ERR(key))
+		return ERR_CAST(key);
+
+	size = key_size;
+	record = map->key_record;
+	for (i = 0; i < record->cnt; i++) {
+		field = &record->fields[i];
+		if (field->type != BPF_DYNPTR)
+			continue;
+		uptr = key + field->offset;
+		if (!uptr->size || uptr->size > map->map_extra || uptr->reserved) {
+			err = -EINVAL;
+			goto free_key;
+		}
+
+		size += uptr->size;
+		/* Overflow ? */
+		if (size < uptr->size) {
+			err = -E2BIG;
+			goto free_key;
+		}
+	}
+
+	/* Place all dynptrs' data in the end of the key */
+	new_key = kvrealloc(key, size, GFP_USER | __GFP_NOWARN);
+	if (!new_key) {
+		err = -ENOMEM;
+		goto free_key;
+	}
+
+	key = new_key;
+	kdata = key + key_size;
+	for (i = 0; i < record->cnt; i++) {
+		field = &record->fields[i];
+		if (field->type != BPF_DYNPTR)
+			continue;
+
+		uptr = key + field->offset;
+		size = uptr->size;
+		udata = make_bpfptr((u64)(uintptr_t)uptr->data, bpfptr_is_kernel(ukey));
+		if (copy_from_bpfptr(kdata, udata, size)) {
+			err = -EFAULT;
+			goto free_key;
+		}
+		kptr = (struct bpf_dynptr_kern *)uptr;
+		bpf_dynptr_init(kptr, kdata, BPF_DYNPTR_TYPE_LOCAL, 0, size);
+		kdata += size;
+	}
+
+	return key;
+
+free_key:
+	kvfree(key);
+	return ERR_PTR(err);
+}
+
+static void *__bpf_copy_key(const struct bpf_map *map, void __user *ukey)
+{
+	if (bpf_map_has_dynptr_key(map))
+		return bpf_copy_from_dynptr_ukey(map, USER_BPFPTR(ukey));
+
+	if (map->key_size)
+		return vmemdup_user(ukey, map->key_size);
 
 	if (ukey)
 		return ERR_PTR(-EINVAL);
@@ -1722,10 +1795,13 @@ static void *__bpf_copy_key(void __user *ukey, u64 key_size)
 	return NULL;
 }
 
-static void *___bpf_copy_key(bpfptr_t ukey, u64 key_size)
+static void *___bpf_copy_key(const struct bpf_map *map, bpfptr_t ukey)
 {
-	if (key_size)
-		return kvmemdup_bpfptr(ukey, key_size);
+	if (bpf_map_has_dynptr_key(map))
+		return bpf_copy_from_dynptr_ukey(map, ukey);
+
+	if (map->key_size)
+		return kvmemdup_bpfptr(ukey, map->key_size);
 
 	if (!bpfptr_is_null(ukey))
 		return ERR_PTR(-EINVAL);
@@ -1762,7 +1838,7 @@ static int map_lookup_elem(union bpf_attr *attr)
 	    !btf_record_has_field(map->record, BPF_SPIN_LOCK))
 		return -EINVAL;
 
-	key = __bpf_copy_key(ukey, map->key_size);
+	key = __bpf_copy_key(map, ukey);
 	if (IS_ERR(key))
 		return PTR_ERR(key);
 
@@ -1829,7 +1905,7 @@ static int map_update_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	key = ___bpf_copy_key(ukey, map->key_size);
+	key = ___bpf_copy_key(map, ukey);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1876,7 +1952,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	key = ___bpf_copy_key(ukey, map->key_size);
+	key = ___bpf_copy_key(map, ukey);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
@@ -1928,7 +2004,7 @@ static int map_get_next_key(union bpf_attr *attr)
 		return -EPERM;
 
 	if (ukey) {
-		key = __bpf_copy_key(ukey, map->key_size);
+		key = __bpf_copy_key(map, ukey);
 		if (IS_ERR(key))
 			return PTR_ERR(key);
 	} else {
@@ -2225,7 +2301,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 		goto err_put;
 	}
 
-	key = __bpf_copy_key(ukey, map->key_size);
+	key = __bpf_copy_key(map, ukey);
 	if (IS_ERR(key)) {
 		err = PTR_ERR(key);
 		goto err_put;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 12/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as output
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (10 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 11/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as input Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 13/20] bpf: Support basic operations for dynptr key in hash map Hou Tao
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

For get_next_key operation, unext_key is used as an output argument.
When there is dynptr in map key, unext_key will also be used as an input
argument, because the userspace application needs to pre-allocate a
buffer for each variable-length part in the map key and save the
length and the address of these buffers in bpf_dynptr_user objects.

To support get_next_key op for map with dynptr key, map_get_next_key()
first calls bpf_copy_from_dynptr_ukey() to construct a map key in which
each bpf_dynptr_kern object has the same size as the corresponding
bpf_dynptr_user object. It then calls ->map_get_next_key() to get the
next_key, and finally calls bpf_copy_to_dynptr_ukey() to copy both the
non-dynptr part and dynptr part in the map key to unext_key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/syscall.c | 89 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 74 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1f0684ba0a204..dc29fa897855c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1711,7 +1711,7 @@ int __weak bpf_stackmap_copy(struct bpf_map *map, void *key, void *value)
 	return -ENOTSUPP;
 }
 
-static void *bpf_copy_from_dynptr_ukey(const struct bpf_map *map, bpfptr_t ukey)
+static void *bpf_copy_from_dynptr_ukey(const struct bpf_map *map, bpfptr_t ukey, bool copy_data)
 {
 	const struct btf_record *record;
 	const struct btf_field *field;
@@ -1719,7 +1719,6 @@ static void *bpf_copy_from_dynptr_ukey(const struct bpf_map *map, bpfptr_t ukey)
 	struct bpf_dynptr_kern *kptr;
 	void *key, *new_key, *kdata;
 	unsigned int key_size, size;
-	bpfptr_t udata;
 	unsigned int i;
 	int err;
 
@@ -1734,6 +1733,7 @@ static void *bpf_copy_from_dynptr_ukey(const struct bpf_map *map, bpfptr_t ukey)
 		field = &record->fields[i];
 		if (field->type != BPF_DYNPTR)
 			continue;
+
 		uptr = key + field->offset;
 		if (!uptr->size || uptr->size > map->map_extra || uptr->reserved) {
 			err = -EINVAL;
@@ -1764,10 +1764,14 @@ static void *bpf_copy_from_dynptr_ukey(const struct bpf_map *map, bpfptr_t ukey)
 
 		uptr = key + field->offset;
 		size = uptr->size;
-		udata = make_bpfptr((u64)(uintptr_t)uptr->data, bpfptr_is_kernel(ukey));
-		if (copy_from_bpfptr(kdata, udata, size)) {
-			err = -EFAULT;
-			goto free_key;
+		if (copy_data) {
+			bpfptr_t udata = make_bpfptr((u64)(uintptr_t)uptr->data,
+						     bpfptr_is_kernel(ukey));
+
+			if (copy_from_bpfptr(kdata, udata, size)) {
+				err = -EFAULT;
+				goto free_key;
+			}
 		}
 		kptr = (struct bpf_dynptr_kern *)uptr;
 		bpf_dynptr_init(kptr, kdata, BPF_DYNPTR_TYPE_LOCAL, 0, size);
@@ -1784,7 +1788,7 @@ static void *bpf_copy_from_dynptr_ukey(const struct bpf_map *map, bpfptr_t ukey)
 static void *__bpf_copy_key(const struct bpf_map *map, void __user *ukey)
 {
 	if (bpf_map_has_dynptr_key(map))
-		return bpf_copy_from_dynptr_ukey(map, USER_BPFPTR(ukey));
+		return bpf_copy_from_dynptr_ukey(map, USER_BPFPTR(ukey), true);
 
 	if (map->key_size)
 		return vmemdup_user(ukey, map->key_size);
@@ -1798,7 +1802,7 @@ static void *__bpf_copy_key(const struct bpf_map *map, void __user *ukey)
 static void *___bpf_copy_key(const struct bpf_map *map, bpfptr_t ukey)
 {
 	if (bpf_map_has_dynptr_key(map))
-		return bpf_copy_from_dynptr_ukey(map, ukey);
+		return bpf_copy_from_dynptr_ukey(map, ukey, true);
 
 	if (map->key_size)
 		return kvmemdup_bpfptr(ukey, map->key_size);
@@ -1809,6 +1813,51 @@ static void *___bpf_copy_key(const struct bpf_map *map, bpfptr_t ukey)
 	return NULL;
 }
 
+static int bpf_copy_to_dynptr_ukey(const struct bpf_map *map,
+				   void __user *ukey, void *key)
+{
+	struct bpf_dynptr_user __user *uptr;
+	struct bpf_dynptr_kern *kptr;
+	struct btf_record *record;
+	unsigned int i, offset;
+
+	offset = 0;
+	record = map->key_record;
+	for (i = 0; i < record->cnt; i++) {
+		struct btf_field *field;
+		unsigned int size;
+		void *udata;
+
+		field = &record->fields[i];
+		if (field->type != BPF_DYNPTR)
+			continue;
+
+		/* Any no-dynptr part before the dynptr ? */
+		if (offset < field->offset &&
+		    copy_to_user(ukey + offset, key + offset, field->offset - offset))
+			return -EFAULT;
+
+		/* dynptr part */
+		uptr = ukey + field->offset;
+		if (copy_from_user(&udata, &uptr->data, sizeof(udata)))
+			return -EFAULT;
+
+		kptr = key + field->offset;
+		size = __bpf_dynptr_size(kptr);
+		if (copy_to_user((void __user *)udata, __bpf_dynptr_data(kptr, size), size) ||
+		    put_user(size, &uptr->size) || put_user(0, &uptr->reserved))
+			return -EFAULT;
+
+		offset = field->offset + field->size;
+	}
+
+	if (offset < map->key_size &&
+	    copy_to_user(ukey + offset, key + offset, map->key_size - offset))
+		return -EFAULT;
+
+	return 0;
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD flags
 
@@ -2011,10 +2060,19 @@ static int map_get_next_key(union bpf_attr *attr)
 		key = NULL;
 	}
 
-	err = -ENOMEM;
-	next_key = kvmalloc(map->key_size, GFP_USER);
-	if (!next_key)
+	if (bpf_map_has_dynptr_key(map))
+		next_key = bpf_copy_from_dynptr_ukey(map, USER_BPFPTR(unext_key), false);
+	else
+		next_key = kvmalloc(map->key_size, GFP_USER);
+	if (IS_ERR_OR_NULL(next_key)) {
+		if (!next_key) {
+			err = -ENOMEM;
+		} else {
+			err = PTR_ERR(next_key);
+			next_key = NULL;
+		}
 		goto free_key;
+	}
 
 	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_get_next_key(map, key, next_key);
@@ -2028,12 +2086,13 @@ static int map_get_next_key(union bpf_attr *attr)
 	if (err)
 		goto free_next_key;
 
-	err = -EFAULT;
-	if (copy_to_user(unext_key, next_key, map->key_size) != 0)
+	if (bpf_map_has_dynptr_key(map))
+		err = bpf_copy_to_dynptr_ukey(map, unext_key, next_key);
+	else
+		err = copy_to_user(unext_key, next_key, map->key_size) ? -EFAULT : 0;
+	if (err)
 		goto free_next_key;
 
-	err = 0;
-
 free_next_key:
 	kvfree(next_key);
 free_key:
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 13/20] bpf: Support basic operations for dynptr key in hash map
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (11 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 12/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as output Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 14/20] bpf: Export bpf_dynptr_set_size Hou Tao
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

The patch supports lookup, update, delete and lookup_delete operations
for hash map with dynptr map. There are two major differences between
the implementation of normal hash map and dynptr-keyed hash map:

1) dynptr-keyed hash map doesn't support pre-allocation.
The reason is that the dynptr in map key is allocated dynamically
through bpf mem allocator. The length limitation for these dynptrs is
4088 bytes now. Because there dynptrs are allocated dynamically, the
consumption of memory will be smaller compared with normal hash map when
there are big differences between the length of these dynptrs.

2) the freed element in dynptr-key map will not be reused immediately
For normal hash map, the freed element may be reused immediately by the
newly-added element, so the lookup may return an incorrect result due to
element deletion and element reuse. However dynptr-key map could not do
that, there are pointers (dynptrs) in the map key and the updates of
these dynptrs are not atomic: both the address and the length of the
dynptr will be updated. If the element is reused immediately, the access
of the dynptr in the freed element may incur invalid memory access due
to the mismatch between the address and the size of dynptr, so reuse the
freed element after one RCU grace period.

Beside the differences above, dynptr-keyed hash map also needs to handle
the maybe-nullified dynptr in the map key.

After the support of dynptr key in hash map, the performance of lookup
and update/delete operations in map_perf_test degrades a lot. Marking
lookup_nulls_elem_raw() and lookup_elem_raw() as always_inline will
narrow the gap from 21%/7% to 4%/2%. Therefore, the patch also adds
always_inline for these two hot functions. The following lines show the
detailed performance numbers:

before patch:
0:hash_map_perf kmalloc 693450 events per sec
0:hash_lookup 89366531 lookups per sec

after patch (without always_inline):
0:hash_map_perf kmalloc 650396 events per sec
0:hash_lookup 73961003 lookups per sec

after patch:
0:hash_map_perf kmalloc 665317 events per sec
0:hash_lookup 87842644 lookups per sec

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 288 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 259 insertions(+), 29 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index bb64eb83ec608..f3ec2b32b59b8 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -88,6 +88,7 @@ struct bpf_htab {
 	struct bpf_map map;
 	struct bpf_mem_alloc ma;
 	struct bpf_mem_alloc pcpu_ma;
+	struct bpf_mem_alloc dynptr_ma;
 	struct bucket *buckets;
 	void *elems;
 	union {
@@ -425,6 +426,7 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
 	bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
 	bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED);
+	bool dynptr_in_key = (attr->map_flags & BPF_INT_F_DYNPTR_IN_KEY);
 	int numa_node = bpf_map_attr_numa_node(attr);
 
 	BUILD_BUG_ON(offsetof(struct htab_elem, fnode.next) !=
@@ -438,6 +440,14 @@ static int htab_map_alloc_check(union bpf_attr *attr)
 	    !bpf_map_flags_access_ok(attr->map_flags))
 		return -EINVAL;
 
+	if (dynptr_in_key) {
+		if (percpu || lru || prealloc || !attr->map_extra)
+			return -EINVAL;
+		if ((attr->map_extra >> 32) || bpf_dynptr_check_size(attr->map_extra) ||
+		    bpf_mem_alloc_check_size(percpu, attr->map_extra))
+			return -E2BIG;
+	}
+
 	if (!lru && percpu_lru)
 		return -EINVAL;
 
@@ -482,6 +492,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	 */
 	bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
 	bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
+	bool dynptr_in_key = (attr->map_flags & BPF_INT_F_DYNPTR_IN_KEY);
 	struct bpf_htab *htab;
 	int err, i;
 
@@ -598,6 +609,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 			if (err)
 				goto free_map_locked;
 		}
+		if (dynptr_in_key) {
+			err = bpf_mem_alloc_init(&htab->dynptr_ma, 0, false);
+			if (err)
+				goto free_map_locked;
+		}
 	}
 
 	return &htab->map;
@@ -610,6 +626,7 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	for (i = 0; i < HASHTAB_MAP_LOCK_COUNT; i++)
 		free_percpu(htab->map_locked[i]);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->dynptr_ma);
 	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 free_elem_count:
@@ -620,13 +637,55 @@ static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
+static inline u32 __htab_map_hash(const void *key, u32 key_len, u32 hashrnd)
 {
 	if (likely(key_len % 4 == 0))
 		return jhash2(key, key_len / 4, hashrnd);
 	return jhash(key, key_len, hashrnd);
 }
 
+static u32 htab_map_dynptr_hash(const void *key, u32 key_len, u32 hashrnd,
+				const struct btf_record *rec)
+{
+	unsigned int i, cnt = rec->cnt;
+	unsigned int hash = hashrnd;
+	unsigned int offset = 0;
+
+	for (i = 0; i < cnt; i++) {
+		const struct btf_field *field = &rec->fields[i];
+		const struct bpf_dynptr_kern *kptr;
+		unsigned int len;
+
+		if (field->type != BPF_DYNPTR)
+			continue;
+
+		/* non-dynptr part ? */
+		if (offset < field->offset)
+			hash = jhash(key + offset, field->offset - offset, hash);
+
+		/* Skip nullified dynptr */
+		kptr = key + field->offset;
+		if (kptr->data) {
+			len = __bpf_dynptr_size(kptr);
+			hash = jhash(__bpf_dynptr_data(kptr, len), len, hash);
+		}
+		offset = field->offset + field->size;
+	}
+
+	if (offset < key_len)
+		hash = jhash(key + offset, key_len - offset, hash);
+
+	return hash;
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len, u32 hashrnd,
+				const struct btf_record *rec)
+{
+	if (likely(!rec))
+		return __htab_map_hash(key, key_len, hashrnd);
+	return htab_map_dynptr_hash(key, key_len, hashrnd, rec);
+}
+
 static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
 {
 	return &htab->buckets[hash & (htab->n_buckets - 1)];
@@ -637,15 +696,68 @@ static inline struct hlist_nulls_head *select_bucket(struct bpf_htab *htab, u32
 	return &__select_bucket(htab, hash)->head;
 }
 
+static bool is_same_dynptr_key(const void *key, const void *tgt, unsigned int key_size,
+			       const struct btf_record *rec)
+{
+	unsigned int i, cnt = rec->cnt;
+	unsigned int offset = 0;
+
+	for (i = 0; i < cnt; i++) {
+		const struct btf_field *field = &rec->fields[i];
+		const struct bpf_dynptr_kern *kptr, *tgt_kptr;
+		const void *data, *tgt_data;
+		unsigned int len;
+
+		if (field->type != BPF_DYNPTR)
+			continue;
+
+		if (offset < field->offset &&
+		    memcmp(key + offset, tgt + offset, field->offset - offset))
+			return false;
+
+		/*
+		 * For a nullified dynptr in the target key, __bpf_dynptr_size()
+		 * will return 0, and there will be no match for the target key.
+		 */
+		kptr = key + field->offset;
+		tgt_kptr = tgt + field->offset;
+		len = __bpf_dynptr_size(kptr);
+		if (len != __bpf_dynptr_size(tgt_kptr))
+			return false;
+
+		data = __bpf_dynptr_data(kptr, len);
+		tgt_data = __bpf_dynptr_data(tgt_kptr, len);
+		if (memcmp(data, tgt_data, len))
+			return false;
+
+		offset = field->offset + field->size;
+	}
+
+	if (offset < key_size &&
+	    memcmp(key + offset, tgt + offset, key_size - offset))
+		return false;
+
+	return true;
+}
+
+static inline bool htab_is_same_key(const void *key, const void *tgt, unsigned int key_size,
+				    const struct btf_record *rec)
+{
+	if (likely(!rec))
+		return !memcmp(key, tgt, key_size);
+	return is_same_dynptr_key(key, tgt, key_size, rec);
+}
+
 /* this lookup function can only be called with bucket lock taken */
-static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash,
-					 void *key, u32 key_size)
+static __always_inline struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash,
+							 void *key, u32 key_size,
+							 const struct btf_record *record)
 {
 	struct hlist_nulls_node *n;
 	struct htab_elem *l;
 
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && htab_is_same_key(l->key, key, key_size, record))
 			return l;
 
 	return NULL;
@@ -655,16 +767,17 @@ static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash
  * the unlikely event when elements moved from one bucket into another
  * while link list is being walked
  */
-static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
-					       u32 hash, void *key,
-					       u32 key_size, u32 n_buckets)
+static __always_inline struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
+							       u32 hash, void *key,
+							       u32 key_size, u32 n_buckets,
+							       const struct btf_record *record)
 {
 	struct hlist_nulls_node *n;
 	struct htab_elem *l;
 
 again:
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && htab_is_same_key(l->key, key, key_size, record))
 			return l;
 
 	if (unlikely(get_nulls_value(n) != (hash & (n_buckets - 1))))
@@ -681,6 +794,7 @@ static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
 static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	const struct btf_record *record;
 	struct hlist_nulls_head *head;
 	struct htab_elem *l;
 	u32 hash, key_size;
@@ -689,12 +803,13 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key)
 		     !rcu_read_lock_bh_held());
 
 	key_size = map->key_size;
+	record = map->key_record;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = htab_map_hash(key, key_size, htab->hashrnd, record);
 
 	head = select_bucket(htab, hash);
 
-	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets);
+	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets, record);
 
 	return l;
 }
@@ -784,6 +899,26 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
 	return insn - insn_buf;
 }
 
+static void htab_free_dynptr_key(struct bpf_htab *htab, void *key)
+{
+	const struct btf_record *record = htab->map.key_record;
+	unsigned int i, cnt = record->cnt;
+
+	for (i = 0; i < cnt; i++) {
+		const struct btf_field *field = &record->fields[i];
+		struct bpf_dynptr_kern *kptr;
+
+		if (field->type != BPF_DYNPTR)
+			continue;
+
+		/* It may be accessed concurrently, so don't overwrite
+		 * the kptr.
+		 */
+		kptr = key + field->offset;
+		bpf_mem_free_rcu(&htab->dynptr_ma, kptr->data);
+	}
+}
+
 static void check_and_free_fields(struct bpf_htab *htab,
 				  struct htab_elem *elem)
 {
@@ -835,6 +970,68 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	return l == tgt_l;
 }
 
+static int htab_copy_dynptr_key(struct bpf_htab *htab, void *dst_key, const void *key, u32 key_size)
+{
+	const struct btf_record *rec = htab->map.key_record;
+	struct bpf_dynptr_kern *dst_kptr;
+	const struct btf_field *field;
+	unsigned int i, cnt, offset;
+	int err;
+
+	offset = 0;
+	cnt = rec->cnt;
+	for (i = 0; i < cnt; i++) {
+		const struct bpf_dynptr_kern *kptr;
+		unsigned int len;
+		const void *data;
+		void *dst_data;
+
+		field = &rec->fields[i];
+		if (field->type != BPF_DYNPTR)
+			continue;
+
+		if (offset < field->offset)
+			memcpy(dst_key + offset, key + offset, field->offset - offset);
+
+		/* Doesn't support nullified dynptr in map key */
+		kptr = key + field->offset;
+		if (!kptr->data) {
+			err = -EINVAL;
+			goto out;
+		}
+		len = __bpf_dynptr_size(kptr);
+		data = __bpf_dynptr_data(kptr, len);
+
+		dst_data = bpf_mem_alloc(&htab->dynptr_ma, len);
+		if (!dst_data) {
+			err = -ENOMEM;
+			goto out;
+		}
+
+		memcpy(dst_data, data, len);
+		dst_kptr = dst_key + field->offset;
+		bpf_dynptr_init(dst_kptr, dst_data, BPF_DYNPTR_TYPE_LOCAL, 0, len);
+
+		offset = field->offset + field->size;
+	}
+
+	if (offset < key_size)
+		memcpy(dst_key + offset, key + offset, key_size - offset);
+
+	return 0;
+
+out:
+	for (; i > 0; i--) {
+		field = &rec->fields[i - 1];
+		if (field->type != BPF_DYNPTR)
+			continue;
+
+		dst_kptr = dst_key + field->offset;
+		bpf_mem_free(&htab->dynptr_ma, dst_kptr->data);
+	}
+	return err;
+}
+
 /* Called from syscall */
 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
@@ -851,12 +1048,12 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	if (!key)
 		goto find_first_elem;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = htab_map_hash(key, key_size, htab->hashrnd, NULL);
 
 	head = select_bucket(htab, hash);
 
 	/* lookup the key */
-	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets);
+	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets, NULL);
 
 	if (!l)
 		goto find_first_elem;
@@ -896,11 +1093,27 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
+	bool dynptr_in_key = bpf_map_has_dynptr_key(&htab->map);
+
+	if (dynptr_in_key)
+		htab_free_dynptr_key(htab, l->key);
+
 	check_and_free_fields(htab, l);
 
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
-	bpf_mem_cache_free(&htab->ma, l);
+
+	/*
+	 * For dynptr key, the update of dynptr in the key is not atomic:
+	 * both the pointer and the size are updated. If the element is reused
+	 * immediately, the access of the dynptr key during lookup procedure may
+	 * incur invalid memory access due to mismatch between the size and the
+	 * data pointer, so reuse the element after one RCU GP.
+	 */
+	if (dynptr_in_key)
+		bpf_mem_cache_free_rcu(&htab->ma, l);
+	else
+		bpf_mem_cache_free(&htab->ma, l);
 }
 
 static void htab_put_fd_value(struct bpf_htab *htab, struct htab_elem *l)
@@ -1047,7 +1260,19 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 		}
 	}
 
-	memcpy(l_new->key, key, key_size);
+	if (bpf_map_has_dynptr_key(&htab->map)) {
+		int copy_err;
+
+		copy_err = htab_copy_dynptr_key(htab, l_new->key, key, key_size);
+		if (copy_err) {
+			bpf_mem_cache_free(&htab->ma, l_new);
+			l_new = ERR_PTR(copy_err);
+			goto dec_count;
+		}
+	} else {
+		memcpy(l_new->key, key, key_size);
+	}
+
 	if (percpu) {
 		if (prealloc) {
 			pptr = htab_elem_get_ptr(l_new, key_size);
@@ -1103,6 +1328,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 				 u64 map_flags)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	const struct btf_record *key_record = map->key_record;
 	struct htab_elem *l_new = NULL, *l_old;
 	struct hlist_nulls_head *head;
 	unsigned long flags;
@@ -1120,7 +1346,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = htab_map_hash(key, key_size, htab->hashrnd, key_record);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -1130,7 +1356,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 			return -EINVAL;
 		/* find an element without taking the bucket lock */
 		l_old = lookup_nulls_elem_raw(head, hash, key, key_size,
-					      htab->n_buckets);
+					      htab->n_buckets, key_record);
 		ret = check_flags(htab, l_old, map_flags);
 		if (ret)
 			return ret;
@@ -1151,7 +1377,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
 	if (ret)
 		return ret;
 
-	l_old = lookup_elem_raw(head, hash, key, key_size);
+	l_old = lookup_elem_raw(head, hash, key, key_size, key_record);
 
 	ret = check_flags(htab, l_old, map_flags);
 	if (ret)
@@ -1238,7 +1464,7 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = __htab_map_hash(key, key_size, htab->hashrnd);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -1258,7 +1484,7 @@ static long htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value
 	if (ret)
 		goto err_lock_bucket;
 
-	l_old = lookup_elem_raw(head, hash, key, key_size);
+	l_old = lookup_elem_raw(head, hash, key, key_size, NULL);
 
 	ret = check_flags(htab, l_old, map_flags);
 	if (ret)
@@ -1307,7 +1533,7 @@ static long __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = __htab_map_hash(key, key_size, htab->hashrnd);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -1316,7 +1542,7 @@ static long __htab_percpu_map_update_elem(struct bpf_map *map, void *key,
 	if (ret)
 		return ret;
 
-	l_old = lookup_elem_raw(head, hash, key, key_size);
+	l_old = lookup_elem_raw(head, hash, key, key_size, NULL);
 
 	ret = check_flags(htab, l_old, map_flags);
 	if (ret)
@@ -1362,7 +1588,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = htab_map_hash(key, key_size, htab->hashrnd, NULL);
 
 	b = __select_bucket(htab, hash);
 	head = &b->head;
@@ -1382,7 +1608,7 @@ static long __htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 	if (ret)
 		goto err_lock_bucket;
 
-	l_old = lookup_elem_raw(head, hash, key, key_size);
+	l_old = lookup_elem_raw(head, hash, key, key_size, NULL);
 
 	ret = check_flags(htab, l_old, map_flags);
 	if (ret)
@@ -1428,6 +1654,7 @@ static long htab_lru_percpu_map_update_elem(struct bpf_map *map, void *key,
 static long htab_map_delete_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	const struct btf_record *key_record = map->key_record;
 	struct hlist_nulls_head *head;
 	struct bucket *b;
 	struct htab_elem *l;
@@ -1440,7 +1667,7 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = htab_map_hash(key, key_size, htab->hashrnd, key_record);
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
@@ -1448,7 +1675,7 @@ static long htab_map_delete_elem(struct bpf_map *map, void *key)
 	if (ret)
 		return ret;
 
-	l = lookup_elem_raw(head, hash, key, key_size);
+	l = lookup_elem_raw(head, hash, key, key_size, key_record);
 	if (l)
 		hlist_nulls_del_rcu(&l->hash_node);
 	else
@@ -1476,7 +1703,7 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 
 	key_size = map->key_size;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = __htab_map_hash(key, key_size, htab->hashrnd);
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
@@ -1484,7 +1711,7 @@ static long htab_lru_map_delete_elem(struct bpf_map *map, void *key)
 	if (ret)
 		return ret;
 
-	l = lookup_elem_raw(head, hash, key, key_size);
+	l = lookup_elem_raw(head, hash, key, key_size, NULL);
 
 	if (l)
 		hlist_nulls_del_rcu(&l->hash_node);
@@ -1581,6 +1808,7 @@ static void htab_map_free(struct bpf_map *map)
 	bpf_map_free_elem_count(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
+	bpf_mem_alloc_destroy(&htab->dynptr_ma);
 	bpf_mem_alloc_destroy(&htab->pcpu_ma);
 	bpf_mem_alloc_destroy(&htab->ma);
 	if (htab->use_percpu_counter)
@@ -1617,6 +1845,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 					     bool is_percpu, u64 flags)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	const struct btf_record *key_record;
 	struct hlist_nulls_head *head;
 	unsigned long bflags;
 	struct htab_elem *l;
@@ -1625,8 +1854,9 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 	int ret;
 
 	key_size = map->key_size;
+	key_record = map->key_record;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd);
+	hash = htab_map_hash(key, key_size, htab->hashrnd, key_record);
 	b = __select_bucket(htab, hash);
 	head = &b->head;
 
@@ -1634,7 +1864,7 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 	if (ret)
 		return ret;
 
-	l = lookup_elem_raw(head, hash, key, key_size);
+	l = lookup_elem_raw(head, hash, key, key_size, key_record);
 	if (!l) {
 		ret = -ENOENT;
 		goto out_unlock;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 14/20] bpf: Export bpf_dynptr_set_size
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (12 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 13/20] bpf: Support basic operations for dynptr key in hash map Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 15/20] bpf: Support get_next_key operation for dynptr key in hash map Hou Tao
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

It will be used by the following patch to shrink the size of dynptr when
the actual data length is smaller than the size of dynptr during
map_get_next_key operation.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h  | 1 +
 kernel/bpf/helpers.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ee02a5d313c56..a7dcdbd8c2824 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1351,6 +1351,7 @@ enum bpf_dynptr_type {
 };
 
 int bpf_dynptr_check_size(u32 size);
+void bpf_dynptr_set_size(struct bpf_dynptr_kern *ptr, u32 new_size);
 u32 __bpf_dynptr_size(const struct bpf_dynptr_kern *ptr);
 const void *__bpf_dynptr_data(const struct bpf_dynptr_kern *ptr, u32 len);
 void *__bpf_dynptr_data_rw(const struct bpf_dynptr_kern *ptr, u32 len);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9fc35656d3e68..a045c131dcefe 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1723,7 +1723,7 @@ u32 __bpf_dynptr_size(const struct bpf_dynptr_kern *ptr)
 	return ptr->size & DYNPTR_SIZE_MASK;
 }
 
-static void bpf_dynptr_set_size(struct bpf_dynptr_kern *ptr, u32 new_size)
+void bpf_dynptr_set_size(struct bpf_dynptr_kern *ptr, u32 new_size)
 {
 	u32 metadata = ptr->size & ~DYNPTR_SIZE_MASK;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 15/20] bpf: Support get_next_key operation for dynptr key in hash map
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (13 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 14/20] bpf: Export bpf_dynptr_set_size Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 16/20] bpf: Disable unsupported operations for map with dynptr key Hou Tao
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

It firstly passed the key_record to htab_map_hash() and
lookup_nulls_eleme_raw() to find the target key, then it uses
htab_copy_dynptr_key() helper to copy from the target key to the next
key used for output.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 55 ++++++++++++++++++++++++++++++--------------
 1 file changed, 38 insertions(+), 17 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index f3ec2b32b59b8..74962a461d091 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -970,7 +970,8 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
 	return l == tgt_l;
 }
 
-static int htab_copy_dynptr_key(struct bpf_htab *htab, void *dst_key, const void *key, u32 key_size)
+static int htab_copy_dynptr_key(struct bpf_htab *htab, void *dst_key, const void *key, u32 key_size,
+				bool copy_in)
 {
 	const struct btf_record *rec = htab->map.key_record;
 	struct bpf_dynptr_kern *dst_kptr;
@@ -995,22 +996,32 @@ static int htab_copy_dynptr_key(struct bpf_htab *htab, void *dst_key, const void
 
 		/* Doesn't support nullified dynptr in map key */
 		kptr = key + field->offset;
-		if (!kptr->data) {
+		if (copy_in && !kptr->data) {
 			err = -EINVAL;
 			goto out;
 		}
 		len = __bpf_dynptr_size(kptr);
 		data = __bpf_dynptr_data(kptr, len);
 
-		dst_data = bpf_mem_alloc(&htab->dynptr_ma, len);
-		if (!dst_data) {
-			err = -ENOMEM;
-			goto out;
-		}
+		dst_kptr = dst_key + field->offset;
+		if (copy_in) {
+			dst_data = bpf_mem_alloc(&htab->dynptr_ma, len);
+			if (!dst_data) {
+				err = -ENOMEM;
+				goto out;
+			}
+			bpf_dynptr_init(dst_kptr, dst_data, BPF_DYNPTR_TYPE_LOCAL, 0, len);
+		} else {
+			dst_data = __bpf_dynptr_data_rw(dst_kptr, len);
+			if (!dst_data) {
+				err = -ENOSPC;
+				goto out;
+			}
 
+			if (__bpf_dynptr_size(dst_kptr) > len)
+				bpf_dynptr_set_size(dst_kptr, len);
+		}
 		memcpy(dst_data, data, len);
-		dst_kptr = dst_key + field->offset;
-		bpf_dynptr_init(dst_kptr, dst_data, BPF_DYNPTR_TYPE_LOCAL, 0, len);
 
 		offset = field->offset + field->size;
 	}
@@ -1021,7 +1032,7 @@ static int htab_copy_dynptr_key(struct bpf_htab *htab, void *dst_key, const void
 	return 0;
 
 out:
-	for (; i > 0; i--) {
+	for (; i > 0 && copy_in; i--) {
 		field = &rec->fields[i - 1];
 		if (field->type != BPF_DYNPTR)
 			continue;
@@ -1032,10 +1043,22 @@ static int htab_copy_dynptr_key(struct bpf_htab *htab, void *dst_key, const void
 	return err;
 }
 
+static inline int htab_copy_next_key(struct bpf_htab *htab, void *next_key, const void *key,
+				     u32 key_size)
+{
+	if (!bpf_map_has_dynptr_key(&htab->map)) {
+		memcpy(next_key, key, key_size);
+		return 0;
+	}
+
+	return htab_copy_dynptr_key(htab, next_key, key, key_size, false);
+}
+
 /* Called from syscall */
 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
+	const struct btf_record *key_record = map->key_record;
 	struct hlist_nulls_head *head;
 	struct htab_elem *l, *next_l;
 	u32 hash, key_size;
@@ -1048,12 +1071,12 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	if (!key)
 		goto find_first_elem;
 
-	hash = htab_map_hash(key, key_size, htab->hashrnd, NULL);
+	hash = htab_map_hash(key, key_size, htab->hashrnd, key_record);
 
 	head = select_bucket(htab, hash);
 
 	/* lookup the key */
-	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets, NULL);
+	l = lookup_nulls_elem_raw(head, hash, key, key_size, htab->n_buckets, key_record);
 
 	if (!l)
 		goto find_first_elem;
@@ -1064,8 +1087,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 	if (next_l) {
 		/* if next elem in this hash list is non-zero, just return it */
-		memcpy(next_key, next_l->key, key_size);
-		return 0;
+		return htab_copy_next_key(htab, next_key, next_l->key, key_size);
 	}
 
 	/* no more elements in this hash list, go to the next bucket */
@@ -1082,8 +1104,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 					  struct htab_elem, hash_node);
 		if (next_l) {
 			/* if it's not empty, just return it */
-			memcpy(next_key, next_l->key, key_size);
-			return 0;
+			return htab_copy_next_key(htab, next_key, next_l->key, key_size);
 		}
 	}
 
@@ -1263,7 +1284,7 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 	if (bpf_map_has_dynptr_key(&htab->map)) {
 		int copy_err;
 
-		copy_err = htab_copy_dynptr_key(htab, l_new->key, key, key_size);
+		copy_err = htab_copy_dynptr_key(htab, l_new->key, key, key_size, true);
 		if (copy_err) {
 			bpf_mem_cache_free(&htab->ma, l_new);
 			l_new = ERR_PTR(copy_err);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 16/20] bpf: Disable unsupported operations for map with dynptr key
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (14 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 15/20] bpf: Support get_next_key operation for dynptr key in hash map Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 17/20] bpf: Enable BPF_INT_F_DYNPTR_IN_KEY for hash map Hou Tao
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Both batched map operation and dumping the map content through bpffs for
maps with dynptr keys are not supported, so disable these operations for
now.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 include/linux/bpf.h  | 3 ++-
 kernel/bpf/syscall.c | 4 ++++
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a7dcdbd8c2824..194f3d4c1b0d0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -635,7 +635,8 @@ static inline bool bpf_map_offload_neutral(const struct bpf_map *map)
 static inline bool bpf_map_support_seq_show(const struct bpf_map *map)
 {
 	return (map->btf_value_type_id || map->btf_vmlinux_value_type_id) &&
-		map->ops->map_seq_show_elem;
+		map->ops->map_seq_show_elem &&
+		!bpf_map_has_dynptr_key(map);
 }
 
 int map_check_no_btf(const struct bpf_map *map,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index dc29fa897855c..0f102142cc0db 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -5542,6 +5542,10 @@ static int bpf_map_do_batch(const union bpf_attr *attr,
 		err = -EPERM;
 		goto err_put;
 	}
+	if (bpf_map_has_dynptr_key(map)) {
+		err = -EOPNOTSUPP;
+		goto err_put;
+	}
 
 	if (cmd == BPF_MAP_LOOKUP_BATCH)
 		BPF_DO_BATCH(map->ops->map_lookup_batch, map, attr, uattr);
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 17/20] bpf: Enable BPF_INT_F_DYNPTR_IN_KEY for hash map
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (15 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 16/20] bpf: Disable unsupported operations for map with dynptr key Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 18/20] selftests/bpf: Add bpf_dynptr_user_init() helper Hou Tao
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Enable BPF_INT_F_DYNPTR_IN_KEY in HTAB_CREATE_FLAG_MASK to support the
creation of hash map with dynptr key.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 kernel/bpf/hashtab.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 74962a461d091..79aa97ad2c903 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -19,7 +19,7 @@
 
 #define HTAB_CREATE_FLAG_MASK						\
 	(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |	\
-	 BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
+	 BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED | BPF_INT_F_DYNPTR_IN_KEY)
 
 #define BATCH_OPS(_name)			\
 	.map_lookup_batch =			\
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 18/20] selftests/bpf: Add bpf_dynptr_user_init() helper
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (16 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 17/20] bpf: Enable BPF_INT_F_DYNPTR_IN_KEY for hash map Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 19/20] selftests/bpf: Add test cases for hash map with dynptr key Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 20/20] selftests/bpf: Add benchmark for dynptr key support in hash map Hou Tao
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add bpf_dynptr_user_init() to initialize a bpf_dynptr_user object. It
will be used test_progs and bench. User can dereference the {data|size}
fields directly to get the address and length of the dynptr object.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/bpf_util.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_util.h b/tools/testing/selftests/bpf/bpf_util.h
index 5f6963a320d73..8ad7e97006c75 100644
--- a/tools/testing/selftests/bpf/bpf_util.h
+++ b/tools/testing/selftests/bpf/bpf_util.h
@@ -71,4 +71,13 @@ static inline void bpf_strlcpy(char *dst, const char *src, size_t sz)
 #define ENOTSUPP 524
 #endif
 
+/* sys_bpf() will check the validity of data and size */
+static inline void bpf_dynptr_user_init(void *data, __u32 size,
+					struct bpf_dynptr_user *dynptr)
+{
+	dynptr->data = data;
+	dynptr->size = size;
+	dynptr->reserved = 0;
+}
+
 #endif /* __BPF_UTIL__ */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 19/20] selftests/bpf: Add test cases for hash map with dynptr key
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (17 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 18/20] selftests/bpf: Add bpf_dynptr_user_init() helper Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  2025-01-25 11:11 ` [PATCH bpf-next v2 20/20] selftests/bpf: Add benchmark for dynptr key support in hash map Hou Tao
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

Add three positive test cases to test the basic operations on the
dynptr-keyed hash map. The basic operations include lookup, update,
delete and get_next_key. These operations are exercised both through
bpf syscall and bpf program. These three test cases use different map
keys. The first test case uses both bpf_dynptr and a struct with only
bpf_dynptr as map key, the second one uses a struct with an integer and
a bpf_dynptr as map key, and the last one use a struct with bpf_dynptr
being nested in another struct as map key.

Also add multiple negative test cases for dynptr-keyed hash map. These
test cases mainly check whether the layout of dynptr and non-dynptr in
the stack is matched with the definition of map->key_record.

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 .../bpf/prog_tests/htab_dynkey_test.c         | 427 ++++++++++++++++++
 .../bpf/progs/htab_dynkey_test_failure.c      | 216 +++++++++
 .../bpf/progs/htab_dynkey_test_success.c      | 383 ++++++++++++++++
 3 files changed, 1026 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/htab_dynkey_test.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_dynkey_test_failure.c
 create mode 100644 tools/testing/selftests/bpf/progs/htab_dynkey_test_success.c

diff --git a/tools/testing/selftests/bpf/prog_tests/htab_dynkey_test.c b/tools/testing/selftests/bpf/prog_tests/htab_dynkey_test.c
new file mode 100644
index 0000000000000..b1f86642e89c1
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/htab_dynkey_test.c
@@ -0,0 +1,427 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2025. Huawei Technologies Co., Ltd */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <test_progs.h>
+
+#include "htab_dynkey_test_success.skel.h"
+#include "htab_dynkey_test_failure.skel.h"
+
+struct id_dname_key {
+	int id;
+	struct bpf_dynptr_user name;
+};
+
+struct dname_key {
+	struct bpf_dynptr_user name;
+};
+
+struct nested_dynptr_key {
+	unsigned long f_1;
+	struct id_dname_key f_2;
+	unsigned long f_3;
+};
+
+static char *name_list[] = {
+	"systemd",
+	"[rcu_sched]",
+	"[kworker/42:0H-events_highpri]",
+	"[ksoftirqd/58]",
+	"[rcu_tasks_trace]",
+};
+
+#define INIT_VALUE 100
+#define INIT_ID 1000
+
+static void setup_pure_dynptr_key_map(int fd)
+{
+	struct bpf_dynptr_user key, _cur_key, _next_key;
+	struct bpf_dynptr_user *cur_key, *next_key;
+	bool marked[ARRAY_SIZE(name_list)];
+	unsigned int i, next_idx, size;
+	unsigned long value, got;
+	char name[2][64];
+	char msg[64];
+	void *data;
+	int err;
+
+	/* lookup non-existent keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u bad lookup", i);
+		/* Use strdup() to ensure that the content pointed by dynptr is
+		 * used for lookup instead of the pointer in dynptr. sys_bpf()
+		 * will handle the NULL case properly.
+		 */
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key);
+		err = bpf_map_lookup_elem(fd, &key, &value);
+		ASSERT_EQ(err, -ENOENT, msg);
+		free(data);
+	}
+
+	/* update keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u insert", i);
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key);
+		value = INIT_VALUE + i;
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* lookup existent keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u lookup", i);
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key);
+		got = 0;
+		err = bpf_map_lookup_elem(fd, &key, &got);
+		ASSERT_OK(err, msg);
+		free(data);
+
+		value = INIT_VALUE + i;
+		ASSERT_EQ(got, value, msg);
+	}
+
+	/* delete keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u delete", i);
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key);
+		err = bpf_map_delete_elem(fd, &key);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* re-insert keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u re-insert", i);
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key);
+		value = 0;
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* overwrite keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u overwrite", i);
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key);
+		value = INIT_VALUE + i;
+		err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* get_next keys */
+	next_idx = 0;
+	cur_key = NULL;
+	next_key = &_next_key;
+	memset(&marked, 0, sizeof(marked));
+	while (true) {
+		bpf_dynptr_user_init(name[next_idx], sizeof(name[next_idx]), next_key);
+		err = bpf_map_get_next_key(fd, cur_key, next_key);
+		if (err) {
+			ASSERT_EQ(err, -ENOENT, "get_next_key");
+			break;
+		}
+
+		size = next_key->size;
+		data = next_key->data;
+		for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+			if (size == strlen(name_list[i]) + 1 &&
+			    !memcmp(name_list[i], data, size)) {
+				ASSERT_FALSE(marked[i], name_list[i]);
+				marked[i] = true;
+				break;
+			}
+		}
+		ASSERT_EQ(next_key->reserved, 0, "reserved");
+
+		if (!cur_key)
+			cur_key = &_cur_key;
+		*cur_key = *next_key;
+		next_idx ^= 1;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(marked); i++)
+		ASSERT_TRUE(marked[i], name_list[i]);
+
+	/* lookup_and_delete all elements except the first one */
+	for (i = 1; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u lookup_delete", i);
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key);
+		got = 0;
+		err = bpf_map_lookup_and_delete_elem(fd, &key, &got);
+		ASSERT_OK(err, msg);
+		free(data);
+
+		value = INIT_VALUE + i;
+		ASSERT_EQ(got, value, msg);
+	}
+
+	/* get the key after the first element */
+	cur_key = &_cur_key;
+	strncpy(name[0], name_list[0], sizeof(name[0]) - 1);
+	name[0][sizeof(name[0]) - 1] = 0;
+	bpf_dynptr_user_init(name[0], strlen(name[0]) + 1, cur_key);
+
+	next_key = &_next_key;
+	bpf_dynptr_user_init(name[1], sizeof(name[1]), next_key);
+	err = bpf_map_get_next_key(fd, cur_key, next_key);
+	ASSERT_EQ(err, -ENOENT, "get_last");
+}
+
+static void setup_mixed_dynptr_key_map(int fd)
+{
+	struct id_dname_key key, _cur_key, _next_key;
+	struct id_dname_key *cur_key, *next_key;
+	bool marked[ARRAY_SIZE(name_list)];
+	unsigned int i, next_idx, size;
+	unsigned long value;
+	char name[2][64];
+	char msg[64];
+	void *data;
+	int err;
+
+	/* Zero the hole */
+	memset(&key, 0, sizeof(key));
+
+	/* lookup non-existent keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u bad lookup", i);
+		key.id = INIT_ID + i;
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key.name);
+		err = bpf_map_lookup_elem(fd, &key, &value);
+		ASSERT_EQ(err, -ENOENT, msg);
+		free(data);
+	}
+
+	/* update keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u insert", i);
+		key.id = INIT_ID + i;
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key.name);
+		value = INIT_VALUE + i;
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* lookup existent keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		unsigned long got = 0;
+
+		snprintf(msg, sizeof(msg), "#%u lookup", i);
+		key.id = INIT_ID + i;
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key.name);
+		err = bpf_map_lookup_elem(fd, &key, &got);
+		ASSERT_OK(err, msg);
+		free(data);
+
+		value = INIT_VALUE + i;
+		ASSERT_EQ(got, value, msg);
+	}
+
+	/* delete keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u delete", i);
+		key.id = INIT_ID + i;
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key.name);
+		err = bpf_map_delete_elem(fd, &key);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* re-insert keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u re-insert", i);
+		key.id = INIT_ID + i;
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key.name);
+		value = 0;
+		err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* overwrite keys */
+	for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+		snprintf(msg, sizeof(msg), "#%u overwrite", i);
+		key.id = INIT_ID + i;
+		data = strdup(name_list[i]);
+		bpf_dynptr_user_init(data, strlen(name_list[i]) + 1, &key.name);
+		value = INIT_VALUE + i;
+		err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+		ASSERT_OK(err, msg);
+		free(data);
+	}
+
+	/* get_next keys */
+	next_idx = 0;
+	cur_key = NULL;
+	next_key = &_next_key;
+	memset(&marked, 0, sizeof(marked));
+	while (true) {
+		bpf_dynptr_user_init(name[next_idx], sizeof(name[next_idx]), &next_key->name);
+		err = bpf_map_get_next_key(fd, cur_key, next_key);
+		if (err) {
+			ASSERT_EQ(err, -ENOENT, "last get_next");
+			break;
+		}
+
+		size = next_key->name.size;
+		data = next_key->name.data;
+		for (i = 0; i < ARRAY_SIZE(name_list); i++) {
+			if (size == strlen(name_list[i]) + 1 &&
+			    !memcmp(name_list[i], data, size)) {
+				ASSERT_FALSE(marked[i], name_list[i]);
+				ASSERT_EQ(next_key->id, INIT_ID + i, name_list[i]);
+				marked[i] = true;
+				break;
+			}
+		}
+		ASSERT_EQ(next_key->name.reserved, 0, "reserved");
+
+		if (!cur_key)
+			cur_key = &_cur_key;
+		*cur_key = *next_key;
+		next_idx ^= 1;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(marked); i++)
+		ASSERT_TRUE(marked[i], name_list[i]);
+}
+
+static void setup_nested_dynptr_key_map(int fd)
+{
+	struct nested_dynptr_key key, cur_key, next_key;
+	unsigned long value;
+	unsigned int size;
+	char name[2][64];
+	void *data;
+	int err;
+
+	/* Zero the hole */
+	memset(&key, 0, sizeof(key));
+
+	key.f_1 = 1;
+	key.f_2.id = 2;
+	key.f_3 = 3;
+
+	/* lookup a non-existent key */
+	data = strdup(name_list[0]);
+	bpf_dynptr_user_init(data, strlen(name_list[0]) + 1, &key.f_2.name);
+	err = bpf_map_lookup_elem(fd, &key, &value);
+	ASSERT_EQ(err, -ENOENT, "lookup");
+
+	/* update key */
+	value = INIT_VALUE;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	ASSERT_OK(err, "update");
+	free(data);
+
+	/* lookup key */
+	data = strdup(name_list[0]);
+	bpf_dynptr_user_init(data, strlen(name_list[0]) + 1, &key.f_2.name);
+	err = bpf_map_lookup_elem(fd, &key, &value);
+	ASSERT_OK(err, "lookup");
+	ASSERT_EQ(value, INIT_VALUE, "lookup");
+
+	/* delete key */
+	err = bpf_map_delete_elem(fd, &key);
+	ASSERT_OK(err, "delete");
+	free(data);
+
+	/* re-insert keys */
+	bpf_dynptr_user_init(name_list[0], strlen(name_list[0]) + 1, &key.f_2.name);
+	value = 0;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_NOEXIST);
+	ASSERT_OK(err, "re-insert");
+
+	/* overwrite keys */
+	data = strdup(name_list[0]);
+	bpf_dynptr_user_init(data, strlen(name_list[0]) + 1, &key.f_2.name);
+	value = INIT_VALUE;
+	err = bpf_map_update_elem(fd, &key, &value, BPF_EXIST);
+	ASSERT_OK(err, "overwrite");
+	free(data);
+
+	/* get_next_key */
+	bpf_dynptr_user_init(name[0], sizeof(name[0]), &next_key.f_2.name);
+	err = bpf_map_get_next_key(fd, NULL, &next_key);
+	ASSERT_OK(err, "first get_next");
+
+	ASSERT_EQ(next_key.f_1, 1, "f_1");
+
+	ASSERT_EQ(next_key.f_2.id, 2, "f_2 id");
+	size = next_key.f_2.name.size;
+	data = next_key.f_2.name.data;
+	if (ASSERT_EQ(size, strlen(name_list[0]) + 1, "f_2 size"))
+		ASSERT_TRUE(!memcmp(name_list[0], data, size), "f_2 data");
+	ASSERT_EQ(next_key.f_2.name.reserved, 0, "f_2 reserved");
+
+	ASSERT_EQ(next_key.f_3, 3, "f_3");
+
+	cur_key = next_key;
+	bpf_dynptr_user_init(name[1], sizeof(name[1]), &next_key.f_2.name);
+	err = bpf_map_get_next_key(fd, &cur_key, &next_key);
+	ASSERT_EQ(err, -ENOENT, "last get_next_key");
+}
+
+static void test_htab_dynptr_key(bool pure, bool nested)
+{
+	struct htab_dynkey_test_success *skel;
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+	struct bpf_program *prog;
+	int err;
+
+	skel = htab_dynkey_test_success__open();
+	if (!ASSERT_OK_PTR(skel, "open()"))
+		return;
+
+	prog = pure ? skel->progs.pure_dynptr_key :
+	       (nested ? skel->progs.nested_dynptr_key : skel->progs.mixed_dynptr_key);
+	bpf_program__set_autoload(prog, true);
+
+	err = htab_dynkey_test_success__load(skel);
+	if (!ASSERT_OK(err, "load()"))
+		goto out;
+
+	if (pure) {
+		setup_pure_dynptr_key_map(bpf_map__fd(skel->maps.htab_1));
+		setup_pure_dynptr_key_map(bpf_map__fd(skel->maps.htab_2));
+	} else if (nested) {
+		setup_nested_dynptr_key_map(bpf_map__fd(skel->maps.htab_4));
+	} else {
+		setup_mixed_dynptr_key_map(bpf_map__fd(skel->maps.htab_3));
+	}
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &opts);
+	ASSERT_OK(err, "run");
+	ASSERT_EQ(opts.retval, 0, "retval");
+out:
+	htab_dynkey_test_success__destroy(skel);
+}
+
+void test_htab_dynkey_test(void)
+{
+	if (test__start_subtest("pure_dynptr_key"))
+		test_htab_dynptr_key(true, false);
+	if (test__start_subtest("mixed_dynptr_key"))
+		test_htab_dynptr_key(false, false);
+	if (test__start_subtest("nested_dynptr_key"))
+		test_htab_dynptr_key(false, true);
+
+	RUN_TESTS(htab_dynkey_test_failure);
+}
diff --git a/tools/testing/selftests/bpf/progs/htab_dynkey_test_failure.c b/tools/testing/selftests/bpf/progs/htab_dynkey_test_failure.c
new file mode 100644
index 0000000000000..2899f1041624b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/htab_dynkey_test_failure.c
@@ -0,0 +1,216 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2025. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct bpf_map;
+
+struct id_dname_key {
+	int id;
+	struct bpf_dynptr name;
+};
+
+struct dname_id_key {
+	struct bpf_dynptr name;
+	int id;
+};
+
+struct id_name_key {
+	int id;
+	char name[20];
+};
+
+struct dname_key {
+	struct bpf_dynptr name;
+};
+
+struct dname_dname_key {
+	struct bpf_dynptr name_1;
+	struct bpf_dynptr name_2;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, struct id_dname_key);
+	__type(value, unsigned long);
+	__uint(map_extra, 1024);
+} htab_1 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, struct dname_key);
+	__type(value, unsigned long);
+	__uint(map_extra, 1024);
+} htab_2 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, struct bpf_dynptr);
+	__type(value, unsigned long);
+	__uint(map_extra, 1024);
+} htab_3 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 4096);
+} ringbuf SEC(".maps");
+
+char dynptr_buf[32] = {};
+
+/* uninitialized dynptr */
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("dynptr-key expects dynptr at offset 8")
+int BPF_PROG(uninit_dynptr)
+{
+	struct id_dname_key key;
+
+	key.id = 100;
+	bpf_map_lookup_elem(&htab_1, &key);
+
+	return 0;
+}
+
+/* invalid dynptr */
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("dynptr-key expects dynptr at offset 8")
+int BPF_PROG(invalid_dynptr)
+{
+	struct id_dname_key key;
+
+	key.id = 100;
+	bpf_ringbuf_reserve_dynptr(&ringbuf, 10, 0, &key.name);
+	bpf_ringbuf_discard_dynptr(&key.name, 0);
+	bpf_map_lookup_elem(&htab_1, &key);
+
+	return 0;
+}
+
+/* expect no-dynptr got dynptr */
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("dynptr-key expects non-dynptr at offset 0")
+int BPF_PROG(invalid_non_dynptr)
+{
+	struct dname_id_key key;
+
+	__builtin_memcpy(dynptr_buf, "test", 4);
+	bpf_dynptr_from_mem(dynptr_buf, 4, 0, &key.name);
+	key.id = 100;
+	bpf_map_lookup_elem(&htab_1, &key);
+
+	return 0;
+}
+
+/* expect dynptr get non-dynptr */
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("dynptr-key expects dynptr at offset 8")
+int BPF_PROG(no_dynptr)
+{
+	struct id_name_key key;
+
+	key.id = 100;
+	__builtin_memset(key.name, 0, sizeof(key.name));
+	__builtin_memcpy(key.name, "test", 4);
+	bpf_map_lookup_elem(&htab_1, &key);
+
+	return 0;
+}
+
+/* malformed */
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("malformed dynptr-key at offset 8")
+int BPF_PROG(malformed_dynptr)
+{
+	struct dname_dname_key key;
+
+	bpf_dynptr_from_mem(dynptr_buf, 4, 0, &key.name_1);
+	bpf_dynptr_from_mem(dynptr_buf, 4, 0, &key.name_2);
+
+	bpf_map_lookup_elem(&htab_2, (void *)&key + 8);
+
+	return 0;
+}
+
+/* misaligned */
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("R2 misaligned offset -28 for dynptr-key")
+int BPF_PROG(misaligned_dynptr)
+{
+	struct dname_dname_key key;
+
+	bpf_map_lookup_elem(&htab_1, (char *)&key + 4);
+
+	return 0;
+}
+
+/* variable offset */
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("R2 variable offset prohibited for dynptr-key")
+int BPF_PROG(variable_offset_dynptr)
+{
+	struct bpf_dynptr dynptr_1;
+	struct bpf_dynptr dynptr_2;
+	char *key;
+
+	bpf_dynptr_from_mem(dynptr_buf, 4, 0, &dynptr_1);
+	bpf_dynptr_from_mem(dynptr_buf, 4, 0, &dynptr_2);
+
+	key = (char *)&dynptr_2;
+	key = key + (bpf_get_prandom_u32() & 1) * 16;
+
+	bpf_map_lookup_elem(&htab_2, key);
+
+	return 0;
+}
+
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("map dynptr-key requires stack ptr but got map_value")
+int BPF_PROG(map_value_as_key)
+{
+	bpf_map_lookup_elem(&htab_1, dynptr_buf);
+
+	return 0;
+}
+
+static int lookup_htab(struct bpf_map *map, struct id_dname_key *key, void *value, void *data)
+{
+	bpf_map_lookup_elem(&htab_1, key);
+	return 0;
+}
+
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("map dynptr-key requires stack ptr but got map_key")
+int BPF_PROG(map_key_as_key)
+{
+	bpf_for_each_map_elem(&htab_1, lookup_htab, NULL, 0);
+	return 0;
+}
+
+__noinline __weak int subprog_lookup_htab(struct bpf_dynptr *dynptr)
+{
+	bpf_map_lookup_elem(&htab_3, dynptr);
+	return 0;
+}
+
+SEC("fentry/" SYS_PREFIX "sys_nanosleep")
+__failure __msg("R2 type=dynptr_ptr expected=")
+int BPF_PROG(subprog_dynptr)
+{
+	struct bpf_dynptr dynptr;
+
+	bpf_dynptr_from_mem(dynptr_buf, 4, 0, &dynptr);
+	subprog_lookup_htab(&dynptr);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/htab_dynkey_test_success.c b/tools/testing/selftests/bpf/progs/htab_dynkey_test_success.c
new file mode 100644
index 0000000000000..ff37f22f07da4
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/htab_dynkey_test_success.c
@@ -0,0 +1,383 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2025. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <errno.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+struct pure_dynptr_key {
+	struct bpf_dynptr name;
+};
+
+struct mixed_dynptr_key {
+	int id;
+	struct bpf_dynptr name;
+};
+
+struct nested_dynptr_key {
+	unsigned long f_1;
+	struct mixed_dynptr_key f_2;
+	unsigned long f_3;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, struct bpf_dynptr);
+	__type(value, unsigned long);
+	__uint(map_extra, 1024);
+} htab_1 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, struct pure_dynptr_key);
+	__type(value, unsigned long);
+	__uint(map_extra, 1024);
+} htab_2 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, struct mixed_dynptr_key);
+	__type(value, unsigned long);
+	__uint(map_extra, 1024);
+} htab_3 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, 10);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, struct nested_dynptr_key);
+	__type(value, unsigned long);
+	__uint(map_extra, 1024);
+} htab_4 SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 4096);
+} ringbuf SEC(".maps");
+
+char dynptr_buf[2][32] = {{}, {}};
+
+static const char systemd_name[] = "systemd";
+static const char udevd_name[] = "udevd";
+static const char rcu_sched_name[] = "[rcu_sched]";
+
+struct bpf_map;
+
+static int test_pure_dynptr_key_htab(struct bpf_map *htab)
+{
+	unsigned long new_value, *value;
+	struct bpf_dynptr key;
+	int err = 0;
+
+	/* Lookup a existent key */
+	__builtin_memcpy(dynptr_buf[0], systemd_name, sizeof(systemd_name));
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(systemd_name), 0, &key);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (!value) {
+		err = 1;
+		goto out;
+	}
+	if (*value != 100) {
+		err = 2;
+		goto out;
+	}
+
+	/* Look up a non-existent key */
+	__builtin_memcpy(dynptr_buf[0], udevd_name, sizeof(udevd_name));
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(udevd_name), 0, &key);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (value) {
+		err = 3;
+		goto out;
+	}
+
+	/* Insert a new key */
+	new_value = 42;
+	err = bpf_map_update_elem(htab, &key, &new_value, BPF_NOEXIST);
+	if (err) {
+		err = 4;
+		goto out;
+	}
+
+	/* Insert an existent key */
+	bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(udevd_name), 0, &key);
+	err = bpf_dynptr_write(&key, 0, (void *)udevd_name, sizeof(udevd_name), 0);
+	if (err) {
+		bpf_ringbuf_discard_dynptr(&key, 0);
+		err = 5;
+		goto out;
+	}
+
+	err = bpf_map_update_elem(htab, &key, &new_value, BPF_NOEXIST);
+	bpf_ringbuf_discard_dynptr(&key, 0);
+	if (err != -EEXIST) {
+		err = 6;
+		goto out;
+	}
+
+	/* Lookup it again */
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(udevd_name), 0, &key);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (!value) {
+		err = 7;
+		goto out;
+	}
+	if (*value != 42) {
+		err = 8;
+		goto out;
+	}
+
+	/* Delete then lookup it */
+	bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(udevd_name), 0, &key);
+	err = bpf_dynptr_write(&key, 0, (void *)udevd_name, sizeof(udevd_name), 0);
+	if (err) {
+		bpf_ringbuf_discard_dynptr(&key, 0);
+		err = 9;
+		goto out;
+	}
+	err = bpf_map_delete_elem(htab, &key);
+	bpf_ringbuf_discard_dynptr(&key, 0);
+	if (err) {
+		err = 10;
+		goto out;
+	}
+
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(udevd_name), 0, &key);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (value) {
+		err = 10;
+		goto out;
+	}
+out:
+	return err;
+}
+
+static int test_mixed_dynptr_key_htab(struct bpf_map *htab)
+{
+	unsigned long new_value, *value;
+	char udevd_name[] = "udevd";
+	struct mixed_dynptr_key key;
+	int err = 0;
+
+	__builtin_memset(&key, 0, sizeof(key));
+	key.id = 1000;
+
+	/* Lookup a existent key */
+	__builtin_memcpy(dynptr_buf[0], systemd_name, sizeof(systemd_name));
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(systemd_name), 0, &key.name);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (!value) {
+		err = 1;
+		goto out;
+	}
+	if (*value != 100) {
+		err = 2;
+		goto out;
+	}
+
+	/* Look up a non-existent key */
+	__builtin_memcpy(dynptr_buf[0], udevd_name, sizeof(udevd_name));
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(udevd_name), 0, &key.name);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (value) {
+		err = 3;
+		goto out;
+	}
+
+	/* Insert a new key */
+	new_value = 42;
+	err = bpf_map_update_elem(htab, &key, &new_value, BPF_NOEXIST);
+	if (err) {
+		err = 4;
+		goto out;
+	}
+
+	/* Insert an existent key */
+	bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(udevd_name), 0, &key.name);
+	err = bpf_dynptr_write(&key.name, 0, (void *)udevd_name, sizeof(udevd_name), 0);
+	if (err) {
+		bpf_ringbuf_discard_dynptr(&key.name, 0);
+		err = 5;
+		goto out;
+	}
+
+	err = bpf_map_update_elem(htab, &key, &new_value, BPF_NOEXIST);
+	bpf_ringbuf_discard_dynptr(&key.name, 0);
+	if (err != -EEXIST) {
+		err = 6;
+		goto out;
+	}
+
+	/* Lookup it again */
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(udevd_name), 0, &key.name);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (!value) {
+		err = 7;
+		goto out;
+	}
+	if (*value != 42) {
+		err = 8;
+		goto out;
+	}
+
+	/* Delete then lookup it */
+	bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(udevd_name), 0, &key.name);
+	err = bpf_dynptr_write(&key.name, 0, (void *)udevd_name, sizeof(udevd_name), 0);
+	if (err) {
+		bpf_ringbuf_discard_dynptr(&key.name, 0);
+		err = 9;
+		goto out;
+	}
+	err = bpf_map_delete_elem(htab, &key);
+	bpf_ringbuf_discard_dynptr(&key.name, 0);
+	if (err) {
+		err = 10;
+		goto out;
+	}
+
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(udevd_name), 0, &key.name);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (value) {
+		err = 10;
+		goto out;
+	}
+out:
+	return err;
+}
+
+static int test_nested_dynptr_key_htab(struct bpf_map *htab)
+{
+	unsigned long new_value, *value;
+	struct nested_dynptr_key key;
+	int err = 0;
+
+	__builtin_memset(&key, 0, sizeof(key));
+	key.f_1 = 1;
+	key.f_2.id = 2;
+	key.f_3 = 3;
+
+	/* Lookup a existent key */
+	__builtin_memcpy(dynptr_buf[0], systemd_name, sizeof(systemd_name));
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(systemd_name), 0, &key.f_2.name);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (!value) {
+		err = 1;
+		goto out;
+	}
+	if (*value != 100) {
+		err = 2;
+		goto out;
+	}
+
+	/* Look up a non-existent key */
+	__builtin_memcpy(dynptr_buf[0], rcu_sched_name, sizeof(rcu_sched_name));
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(rcu_sched_name), 0, &key.f_2.name);
+	value = bpf_map_lookup_elem(htab, &key);
+	if (value) {
+		err = 3;
+		goto out;
+	}
+
+	/* Insert a new key */
+	new_value = 42;
+	err = bpf_map_update_elem(htab, &key, &new_value, BPF_NOEXIST);
+	if (err) {
+		err = 4;
+		goto out;
+	}
+
+	/* Insert an existent key */
+	bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(rcu_sched_name), 0, &key.f_2.name);
+	err = bpf_dynptr_write(&key.f_2.name, 0, (void *)rcu_sched_name, sizeof(rcu_sched_name), 0);
+	if (err) {
+		bpf_ringbuf_discard_dynptr(&key.f_2.name, 0);
+		err = 5;
+		goto out;
+	}
+	err = bpf_map_update_elem(htab, &key, &new_value, BPF_NOEXIST);
+	bpf_ringbuf_discard_dynptr(&key.f_2.name, 0);
+	if (err != -EEXIST) {
+		err = 6;
+		goto out;
+	}
+
+	/* Lookup a non-existent key */
+	bpf_dynptr_from_mem(dynptr_buf[0], sizeof(rcu_sched_name), 0, &key.f_2.name);
+	key.f_3 = 0;
+	value = bpf_map_lookup_elem(htab, &key);
+	if (value) {
+		err = 7;
+		goto out;
+	}
+
+	/* Lookup an existent key */
+	key.f_3 = 3;
+	value = bpf_map_lookup_elem(htab, &key);
+	if (!value) {
+		err = 8;
+		goto out;
+	}
+	if (*value != 42) {
+		err = 9;
+		goto out;
+	}
+
+	/* Delete the newly-inserted key */
+	bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(systemd_name), 0, &key.f_2.name);
+	err = bpf_dynptr_write(&key.f_2.name, 0, (void *)systemd_name, sizeof(systemd_name), 0);
+	if (err) {
+		bpf_ringbuf_discard_dynptr(&key.f_2.name, 0);
+		err = 10;
+		goto out;
+	}
+	err = bpf_map_delete_elem(htab, &key);
+	if (err) {
+		bpf_ringbuf_discard_dynptr(&key.f_2.name, 0);
+		err = 11;
+		goto out;
+	}
+
+	/* Lookup it again */
+	value = bpf_map_lookup_elem(htab, &key);
+	bpf_ringbuf_discard_dynptr(&key.f_2.name, 0);
+	if (value) {
+		err = 12;
+		goto out;
+	}
+out:
+	return err;
+}
+
+SEC("?raw_tp")
+int BPF_PROG(pure_dynptr_key)
+{
+	int err;
+
+	err = test_pure_dynptr_key_htab((struct bpf_map *)&htab_1);
+	err |= test_pure_dynptr_key_htab((struct bpf_map *)&htab_2) << 8;
+
+	return err;
+}
+
+SEC("?raw_tp")
+int BPF_PROG(mixed_dynptr_key)
+{
+	return test_mixed_dynptr_key_htab((struct bpf_map *)&htab_3);
+}
+
+SEC("?raw_tp")
+int BPF_PROG(nested_dynptr_key)
+{
+	return test_nested_dynptr_key_htab((struct bpf_map *)&htab_4);
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH bpf-next v2 20/20] selftests/bpf: Add benchmark for dynptr key support in hash map
  2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
                   ` (18 preceding siblings ...)
  2025-01-25 11:11 ` [PATCH bpf-next v2 19/20] selftests/bpf: Add test cases for hash map with dynptr key Hou Tao
@ 2025-01-25 11:11 ` Hou Tao
  19 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-01-25 11:11 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Hao Luo, Yonghong Song,
	Daniel Borkmann, KP Singh, Stanislav Fomichev, Jiri Olsa,
	John Fastabend, Dan Carpenter, houtao1, xukuohai

From: Hou Tao <houtao1@huawei.com>

The patch adds a benchmark test to compare the lookup and update/delete
performance between normal hash map and dynptr-keyed hash map. It also
compares the memory usage of these two maps after fill up these two
maps.

The benchmark simulates the case when the map key is composed of a
8-bytes integer and a variable-size string. Now the integer just saves
the length of the string. These strings will be randomly generated by
default, and they can also be specified by a external file (e.g., the
output from awk '{print $3}' /proc/kallsyms).

The key definitions for dynptr-keyed and normal hash map are defined as
shown below:

struct dynptr_key {
	__u64 cookie;
	struct bpf_dynptr desc;
}

struct norm_key {
	__u64 cookie;
	char desc[MAX_STR_SIZE];
};

The lookup or update procedure will first lookup an array to get the key
of hash map. The returned value from the array is the same as norm_key
definition. For normal hash map, it will use the returned value to
manipulate the hash map directly. For dynptr-keyed hash map, it will
construct a bpf_dynptr object from the returned value (the value of
cookie is the same as the string length), then passes the key to
dynptr-keyed hash map. Because the lookup procedure is lockless,
therefore, each producer during lookup test will lookup the whole hash
map. However, update and deletion procedures have lock, therefore, each
producer during update test only updates different part of the hash map.

The following is the benchmark results when running the benchmark under a
8-CPUs VM:

(1) Randomly generate 128K strings (max_size=256, entries=128K)

ENTRIES=131072 ./benchs/run_bench_dynptr_key.sh

normal hash map
===============
htab-lookup-p1-131072 2.977 ± 0.017M/s (drops 0.006 ± 0.000M/s, mem 64.984 MiB)
htab-lookup-p2-131072 6.033 ± 0.048M/s (drops 0.015 ± 0.000M/s, mem 64.966 MiB)
htab-lookup-p4-131072 11.612 ± 0.063M/s (drops 0.026 ± 0.000M/s, mem 64.984 MiB)
htab-lookup-p8-131072 22.918 ± 0.315M/s (drops 0.055 ± 0.001M/s, mem 64.966 MiB)
htab-update-p1-131072 2.121 ± 0.014M/s (drops 0.000 ± 0.000M/s, mem 64.986 MiB)
htab-update-p2-131072 4.138 ± 0.047M/s (drops 0.000 ± 0.000M/s, mem 64.986 MiB)
htab-update-p4-131072 7.378 ± 0.078M/s (drops 0.000 ± 0.000M/s, mem 64.986 MiB)
htab-update-p8-131072 13.774 ± 0.129M/s (drops 0.000 ± 0.000M/s, mem 64.986 MiB)

dynptr-keyed hash map
=====================
htab-lookup-p1-131072 3.891 ± 0.008M/s (drops 0.009 ± 0.000M/s, mem 34.908 MiB)
htab-lookup-p2-131072 7.467 ± 0.054M/s (drops 0.016 ± 0.000M/s, mem 34.925 MiB)
htab-lookup-p4-131072 15.151 ± 0.054M/s (drops 0.030 ± 0.000M/s, mem 34.992 MiB)
htab-lookup-p8-131072 29.461 ± 0.448M/s (drops 0.076 ± 0.001M/s, mem 34.910 MiB)
htab-update-p1-131072 2.085 ± 0.124M/s (drops 0.000 ± 0.000M/s, mem 34.888 MiB)
htab-update-p2-131072 3.278 ± 0.068M/s (drops 0.000 ± 0.000M/s, mem 34.888 MiB)
htab-update-p4-131072 6.840 ± 0.100M/s (drops 0.000 ± 0.000M/s, mem 35.023 MiB)
htab-update-p8-131072 11.837 ± 0.190M/s (drops 0.000 ± 0.000M/s, mem 34.941 MiB)

(2) Use strings in /proc/kallsyms (max_size=82, entries=150K)

STR_FILE=kallsyms.txt ./benchs/run_bench_dynptr_key.sh

normal hash map
===============
htab-lookup-p1-kallsyms.txt 7.201 ± 0.080M/s (drops 0.482 ± 0.005M/s, mem 26.384 MiB)
htab-lookup-p2-kallsyms.txt 14.217 ± 0.114M/s (drops 0.951 ± 0.008M/s, mem 26.384 MiB)
htab-lookup-p4-kallsyms.txt 29.293 ± 0.141M/s (drops 1.959 ± 0.010M/s, mem 26.384 MiB)
htab-lookup-p8-kallsyms.txt 58.406 ± 0.384M/s (drops 3.906 ± 0.026M/s, mem 26.384 MiB)
htab-update-p1-kallsyms.txt 3.864 ± 0.036M/s (drops 0.000 ± 0.000M/s, mem 26.387 MiB)
htab-update-p2-kallsyms.txt 5.757 ± 0.078M/s (drops 0.000 ± 0.000M/s, mem 26.387 MiB)
htab-update-p4-kallsyms.txt 10.195 ± 0.655M/s (drops 0.000 ± 0.000M/s, mem 26.387 MiB)
htab-update-p8-kallsyms.txt 18.203 ± 0.165M/s (drops 0.000 ± 0.000M/s, mem 26.387 MiB)

dynptr-keyed hash map
=====================
htab-lookup-p1-kallsyms.txt 7.223 ± 0.007M/s (drops 0.483 ± 0.003M/s, mem 20.993 MiB)
htab-lookup-p2-kallsyms.txt 14.350 ± 0.035M/s (drops 0.960 ± 0.004M/s, mem 20.968 MiB)
htab-lookup-p4-kallsyms.txt 29.317 ± 0.153M/s (drops 1.960 ± 0.013M/s, mem 20.963 MiB)
htab-lookup-p8-kallsyms.txt 58.787 ± 0.662M/s (drops 3.931 ± 0.047M/s, mem 21.018 MiB)
htab-update-p1-kallsyms.txt 2.503 ± 0.124M/s (drops 0.000 ± 0.000M/s, mem 20.972 MiB)
htab-update-p2-kallsyms.txt 4.622 ± 0.422M/s (drops 0.000 ± 0.000M/s, mem 21.104 MiB)
htab-update-p4-kallsyms.txt 8.374 ± 0.149M/s (drops 0.000 ± 0.000M/s, mem 21.027 MiB)
htab-update-p8-kallsyms.txt 14.608 ± 0.319M/s (drops 0.000 ± 0.000M/s, mem 21.027 MiB)

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 tools/testing/selftests/bpf/Makefile          |   2 +
 tools/testing/selftests/bpf/bench.c           |  10 +
 .../selftests/bpf/benchs/bench_dynptr_key.c   | 612 ++++++++++++++++++
 .../bpf/benchs/run_bench_dynptr_key.sh        |  51 ++
 .../selftests/bpf/progs/dynptr_key_bench.c    | 250 +++++++
 5 files changed, 925 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/benchs/bench_dynptr_key.c
 create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_dynptr_key.sh
 create mode 100644 tools/testing/selftests/bpf/progs/dynptr_key_bench.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 8e719170272ad..c9f7b91d18603 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -811,6 +811,7 @@ $(OUTPUT)/bench_local_storage_create.o: $(OUTPUT)/bench_local_storage_create.ske
 $(OUTPUT)/bench_bpf_hashmap_lookup.o: $(OUTPUT)/bpf_hashmap_lookup.skel.h
 $(OUTPUT)/bench_htab_mem.o: $(OUTPUT)/htab_mem_bench.skel.h
 $(OUTPUT)/bench_bpf_crypto.o: $(OUTPUT)/crypto_bench.skel.h
+$(OUTPUT)/bench_dynptr_key.o: $(OUTPUT)/dynptr_key_bench.skel.h
 $(OUTPUT)/bench.o: bench.h testing_helpers.h $(BPFOBJ)
 $(OUTPUT)/bench: LDLIBS += -lm
 $(OUTPUT)/bench: $(OUTPUT)/bench.o \
@@ -831,6 +832,7 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
 		 $(OUTPUT)/bench_local_storage_create.o \
 		 $(OUTPUT)/bench_htab_mem.o \
 		 $(OUTPUT)/bench_bpf_crypto.o \
+		 $(OUTPUT)/bench_dynptr_key.o \
 		 #
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c
index 1bd403a5ef7b3..b13271600bc02 100644
--- a/tools/testing/selftests/bpf/bench.c
+++ b/tools/testing/selftests/bpf/bench.c
@@ -283,6 +283,7 @@ extern struct argp bench_local_storage_create_argp;
 extern struct argp bench_htab_mem_argp;
 extern struct argp bench_trigger_batch_argp;
 extern struct argp bench_crypto_argp;
+extern struct argp bench_dynptr_key_argp;
 
 static const struct argp_child bench_parsers[] = {
 	{ &bench_ringbufs_argp, 0, "Ring buffers benchmark", 0 },
@@ -297,6 +298,7 @@ static const struct argp_child bench_parsers[] = {
 	{ &bench_htab_mem_argp, 0, "hash map memory benchmark", 0 },
 	{ &bench_trigger_batch_argp, 0, "BPF triggering benchmark", 0 },
 	{ &bench_crypto_argp, 0, "bpf crypto benchmark", 0 },
+	{ &bench_dynptr_key_argp, 0, "dynptr key benchmark", 0 },
 	{},
 };
 
@@ -549,6 +551,10 @@ extern const struct bench bench_local_storage_create;
 extern const struct bench bench_htab_mem;
 extern const struct bench bench_crypto_encrypt;
 extern const struct bench bench_crypto_decrypt;
+extern const struct bench bench_norm_htab_lookup;
+extern const struct bench bench_dynkey_htab_lookup;
+extern const struct bench bench_norm_htab_update;
+extern const struct bench bench_dynkey_htab_update;
 
 static const struct bench *benchs[] = {
 	&bench_count_global,
@@ -609,6 +615,10 @@ static const struct bench *benchs[] = {
 	&bench_htab_mem,
 	&bench_crypto_encrypt,
 	&bench_crypto_decrypt,
+	&bench_norm_htab_lookup,
+	&bench_dynkey_htab_lookup,
+	&bench_norm_htab_update,
+	&bench_dynkey_htab_update,
 };
 
 static void find_benchmark(void)
diff --git a/tools/testing/selftests/bpf/benchs/bench_dynptr_key.c b/tools/testing/selftests/bpf/benchs/bench_dynptr_key.c
new file mode 100644
index 0000000000000..713f00cdaac69
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/bench_dynptr_key.c
@@ -0,0 +1,612 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2025. Huawei Technologies Co., Ltd */
+#include <argp.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include "bench.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#include "dynptr_key_bench.skel.h"
+
+enum {
+	NORM_HTAB = 0,
+	DYNPTR_KEY_HTAB,
+};
+
+static struct dynptr_key_ctx {
+	struct dynptr_key_bench *skel;
+	int cgrp_dfd;
+	u64 map_slab_mem;
+} ctx;
+
+static struct {
+	const char *file;
+	__u32 entries;
+	__u32 max_size;
+} args = {
+	.max_size = 256,
+};
+
+struct run_stat {
+	__u64 stats[2];
+};
+
+struct dynkey_key {
+	/* prevent unnecessary hole */
+	__u64 cookie;
+	struct bpf_dynptr_user desc;
+};
+
+struct var_size_str {
+	/* the same size as cookie */
+	__u64 len;
+	unsigned char data[];
+};
+
+enum {
+	ARG_DATA_FILE = 11001,
+	ARG_DATA_ENTRIES = 11002,
+	ARG_MAX_SIZE = 11003,
+};
+
+static const struct argp_option opts[] = {
+	{ "file", ARG_DATA_FILE, "DATA-FILE", 0, "Set data file" },
+	{ "entries", ARG_DATA_ENTRIES, "DATA-ENTRIES", 0, "Set data entries" },
+	{ "max_size", ARG_MAX_SIZE, "MAX-SIZE", 0, "Set data max size" },
+	{},
+};
+
+static error_t dynptr_key_parse_arg(int key, char *arg, struct argp_state *state)
+{
+	switch (key) {
+	case ARG_DATA_FILE:
+		args.file = strdup(arg);
+		if (!args.file) {
+			fprintf(stderr, "no mem for file name\n");
+			argp_usage(state);
+		}
+		break;
+	case ARG_DATA_ENTRIES:
+		args.entries = strtoul(arg, NULL, 10);
+		break;
+	case ARG_MAX_SIZE:
+		args.max_size = strtoul(arg, NULL, 10);
+		break;
+	default:
+		return ARGP_ERR_UNKNOWN;
+	}
+
+	return 0;
+}
+
+const struct argp bench_dynptr_key_argp = {
+	.options = opts,
+	.parser = dynptr_key_parse_arg,
+};
+
+static int count_nr_item(const char *name, char *buf, size_t size, unsigned int *nr_items)
+{
+	unsigned int i = 0;
+	FILE *file;
+	int err;
+
+	file = fopen(name, "rb");
+	if (!file) {
+		fprintf(stderr, "open %s err %s\n", name, strerror(errno));
+		return -1;
+	}
+
+	err = 0;
+	while (true) {
+		unsigned int len;
+		char *got;
+
+		got = fgets(buf, size, file);
+		if (!got) {
+			if (!feof(file)) {
+				fprintf(stderr, "read file %s error\n", name);
+				err = -1;
+			}
+			break;
+		}
+
+		len = strlen(got);
+		if (len && got[len - 1] == '\n') {
+			got[len - 1] = 0;
+			len -= 1;
+		}
+		if (!len)
+			continue;
+		i++;
+	}
+	fclose(file);
+
+	if (!err)
+		*nr_items = i;
+
+	return err;
+}
+
+static int parse_data_set(const char *name, struct var_size_str ***set, unsigned int *nr,
+			  unsigned int *max_len)
+{
+#define FILE_DATA_MAX_SIZE 4095
+	unsigned int i, nr_items, item_max_len;
+	char line[FILE_DATA_MAX_SIZE + 1];
+	struct var_size_str **items;
+	struct var_size_str *cur;
+	int err = 0;
+	FILE *file;
+	char *got;
+
+	if (count_nr_item(name, line, sizeof(line), &nr_items))
+		return -1;
+	if (!nr_items) {
+		fprintf(stderr, "empty file ?\n");
+		return -1;
+	}
+	fprintf(stdout, "%u items in %s\n", nr_items, name);
+
+	file = fopen(name, "rb");
+	if (!file) {
+		fprintf(stderr, "open %s err %s\n", name, strerror(errno));
+		return -1;
+	}
+
+	items = (struct var_size_str **)calloc(nr_items, sizeof(*items) + FILE_DATA_MAX_SIZE);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	i = 0;
+	item_max_len = 0;
+	cur = (void *)items + sizeof(*items) * nr_items;
+	while (true) {
+		unsigned int len;
+
+		got = fgets(line, sizeof(line), file);
+		if (!got) {
+			if (!feof(file)) {
+				fprintf(stderr, "read file %s error\n", name);
+				err = -1;
+			}
+			break;
+		}
+
+		len = strlen(got);
+		if (len && got[len - 1] == '\n') {
+			got[len - 1] = 0;
+			len -= 1;
+		}
+		if (!len)
+			continue;
+
+		if (i >= nr_items) {
+			fprintf(stderr, "too many line in %s\n", name);
+			break;
+		}
+
+		if (len > item_max_len)
+			item_max_len = len;
+		cur->len = len;
+		memcpy(cur->data, got, len);
+		items[i++] = cur;
+		cur = (void *)cur + FILE_DATA_MAX_SIZE;
+	}
+
+	if (!err) {
+		if (i != nr_items)
+			fprintf(stdout, "few lines in %s (exp %u got %u)\n", name, nr_items, i);
+		*nr = i;
+		*set = items;
+		*max_len = item_max_len;
+	} else {
+		free(items);
+	}
+
+out:
+	fclose(file);
+	return err;
+}
+
+static int gen_data_set(unsigned int max_size, struct var_size_str ***set, unsigned int *nr,
+			unsigned int *max_len)
+{
+#define GEN_DATA_MAX_SIZE 4088
+	struct var_size_str **items;
+	size_t ptr_size, data_size;
+	struct var_size_str *cur;
+	unsigned int i, nr_items;
+	size_t left;
+	ssize_t got;
+	int err = 0;
+	void *dst;
+
+	ptr_size = *nr * sizeof(*items);
+	data_size = *nr * (sizeof(*cur) + max_size);
+	items = (struct var_size_str **)malloc(ptr_size + data_size);
+	if (!items) {
+		fprintf(stderr, "no mem for items\n");
+		err = -1;
+		goto out;
+	}
+
+	cur = (void *)items + ptr_size;
+	dst = cur;
+	left = data_size;
+	while (left > 0) {
+		got = syscall(__NR_getrandom, dst, left, 0);
+		if (got <= 0) {
+			fprintf(stderr, "getrandom error %s got %zd\n", strerror(errno), got);
+			err = -1;
+			goto out;
+		}
+		left -= got;
+		dst += got;
+	}
+
+	nr_items = 0;
+	for (i = 0; i < *nr; i++) {
+		cur->len &= (max_size - 1);
+		cur->len += 1;
+		if (cur->len > GEN_DATA_MAX_SIZE)
+			cur->len = GEN_DATA_MAX_SIZE;
+		items[nr_items++] = cur;
+		memset(cur->data + cur->len, 0, max_size - cur->len);
+		cur = (void *)cur + (sizeof(*cur) + max_size);
+	}
+	if (!nr_items) {
+		fprintf(stderr, "no valid key in random data\n");
+		err = -1;
+		goto out;
+	}
+	fprintf(stdout, "generate %u random keys\n", nr_items);
+
+	*nr = nr_items;
+	*set = items;
+	*max_len = max_size <= GEN_DATA_MAX_SIZE ? max_size : GEN_DATA_MAX_SIZE;
+out:
+	if (err && items)
+		free(items);
+	return err;
+}
+
+static inline bool is_pow_of_2(size_t x)
+{
+	return x && (x & (x - 1)) == 0;
+}
+
+static void dynptr_key_validate(void)
+{
+	if (env.consumer_cnt != 0) {
+		fprintf(stderr, "dynptr_key benchmark doesn't support consumer!\n");
+		exit(1);
+	}
+
+	if (!args.file && !args.entries) {
+		fprintf(stderr, "must specify entries when use random generated data set\n");
+		exit(1);
+	}
+
+	if (args.file && access(args.file, R_OK)) {
+		fprintf(stderr, "data file is un-accessible\n");
+		exit(1);
+	}
+
+	if (args.entries && !is_pow_of_2(args.max_size)) {
+		fprintf(stderr, "invalid max size %u (should be power-of-two)\n", args.max_size);
+		exit(1);
+	}
+}
+
+static void dynptr_key_init_map_opts(struct dynptr_key_bench *skel, unsigned int data_size,
+				     unsigned int nr)
+{
+	/* The value will be used as the key for hash map */
+	bpf_map__set_value_size(skel->maps.array,
+				offsetof(struct dynkey_key, desc) + data_size);
+	bpf_map__set_max_entries(skel->maps.array, nr);
+
+	bpf_map__set_key_size(skel->maps.htab, offsetof(struct dynkey_key, desc) + data_size);
+	bpf_map__set_max_entries(skel->maps.htab, nr);
+
+	bpf_map__set_map_extra(skel->maps.dynkey_htab, data_size);
+	bpf_map__set_max_entries(skel->maps.dynkey_htab, nr);
+}
+
+static void dynptr_key_setup_key_map(struct bpf_map *map, struct var_size_str **set,
+				     unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		void *value;
+		int err;
+
+		value = (void *)set[i];
+		err = bpf_map_update_elem(fd, &i, value, 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static u64 dynptr_key_get_slab_mem(int dfd)
+{
+	const char *magic = "slab ";
+	const char *name = "memory.stat";
+	int fd;
+	ssize_t nr;
+	char buf[4096];
+	char *from;
+
+	fd = openat(dfd, name, 0);
+	if (fd < 0) {
+		fprintf(stdout, "no %s (cgroup v1 ?)\n", name);
+		return 0;
+	}
+
+	nr = read(fd, buf, sizeof(buf));
+	if (nr <= 0) {
+		fprintf(stderr, "empty %s ?\n", name);
+		exit(1);
+	}
+	buf[nr - 1] = 0;
+
+	close(fd);
+
+	from = strstr(buf, magic);
+	if (!from) {
+		fprintf(stderr, "no slab in %s\n", name);
+		exit(1);
+	}
+
+	return strtoull(from + strlen(magic), NULL, 10);
+}
+
+static void dynptr_key_setup_lookup_map(struct bpf_map *map, unsigned int map_type,
+					struct var_size_str **set, unsigned int nr)
+{
+	int fd = bpf_map__fd(map);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		struct dynkey_key dynkey;
+		void *key;
+		int err;
+
+		if (map_type == NORM_HTAB) {
+			key = set[i];
+		} else {
+			dynkey.cookie = set[i]->len;
+			bpf_dynptr_user_init(set[i]->data, set[i]->len, &dynkey.desc);
+			key = &dynkey;
+		}
+		/* May have duplicated keys */
+		err = bpf_map_update_elem(fd, key, &i, 0);
+		if (err) {
+			fprintf(stderr, "add #%u key (%s) on %s error %d\n",
+				i, set[i]->data, bpf_map__name(map), err);
+			exit(1);
+		}
+	}
+}
+
+static void dump_data_set_metric(struct var_size_str **set, unsigned int nr)
+{
+	double mean = 0.0, stddev = 0.0;
+	unsigned int max = 0;
+	unsigned int i;
+
+	for (i = 0; i < nr; i++) {
+		if (set[i]->len > max)
+			max = set[i]->len;
+		mean += set[i]->len / (0.0 + nr);
+	}
+
+	if (nr > 1)  {
+		for (i = 0; i < nr; i++)
+			stddev += (mean - set[i]->len) * (mean - set[i]->len) / (nr - 1.0);
+		stddev = sqrt(stddev);
+	}
+
+	fprintf(stdout, "str length: max %u mean %.0f stdev %.0f\n", max, mean, stddev);
+}
+
+static void dynptr_key_setup(unsigned int map_type, const char *prog_name)
+{
+	struct var_size_str **set = NULL;
+	struct dynptr_key_bench *skel;
+	unsigned int nr = 0, max_len = 0;
+	struct bpf_program *prog;
+	struct bpf_link *link;
+	struct bpf_map *map;
+	u64 before, after;
+	int dfd;
+	int err;
+
+	if (!args.file) {
+		nr = args.entries;
+		err = gen_data_set(args.max_size, &set, &nr, &max_len);
+	} else {
+		err = parse_data_set(args.file, &set, &nr, &max_len);
+	}
+	if (err < 0)
+		exit(1);
+
+	if (args.entries && args.entries < nr)
+		nr = args.entries;
+
+	dump_data_set_metric(set, nr);
+
+	dfd = cgroup_setup_and_join("/dynptr_key");
+	if (dfd < 0) {
+		fprintf(stderr, "failed to setup cgroup env\n");
+		goto free_str_set;
+	}
+
+	setup_libbpf();
+
+	before = dynptr_key_get_slab_mem(dfd);
+
+	skel = dynptr_key_bench__open();
+	if (!skel) {
+		fprintf(stderr, "failed to open skeleton\n");
+		goto leave_cgroup;
+	}
+
+	dynptr_key_init_map_opts(skel, max_len, nr);
+
+	skel->rodata->max_dynkey_size = max_len;
+	skel->bss->update_nr = nr;
+	skel->bss->update_chunk = nr / env.producer_cnt;
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!prog) {
+		fprintf(stderr, "no such prog %s\n", prog_name);
+		goto destroy_skel;
+	}
+	bpf_program__set_autoload(prog, true);
+
+	err = dynptr_key_bench__load(skel);
+	if (err) {
+		fprintf(stderr, "failed to load skeleton\n");
+		goto destroy_skel;
+	}
+
+	dynptr_key_setup_key_map(skel->maps.array, set, nr);
+
+	map = (map_type == NORM_HTAB) ? skel->maps.htab : skel->maps.dynkey_htab;
+	dynptr_key_setup_lookup_map(map, map_type, set, nr);
+
+	after = dynptr_key_get_slab_mem(dfd);
+
+	link = bpf_program__attach(prog);
+	if (!link) {
+		fprintf(stderr, "failed to attach %s\n", prog_name);
+		goto destroy_skel;
+	}
+
+	ctx.skel = skel;
+	ctx.cgrp_dfd = dfd;
+	ctx.map_slab_mem = after - before;
+	free(set);
+	return;
+
+destroy_skel:
+	dynptr_key_bench__destroy(skel);
+leave_cgroup:
+	close(dfd);
+	cleanup_cgroup_environment();
+free_str_set:
+	free(set);
+	exit(1);
+}
+
+static void dynkey_htab_lookup_setup(void)
+{
+	dynptr_key_setup(DYNPTR_KEY_HTAB, "dynkey_htab_lookup");
+}
+
+static void norm_htab_lookup_setup(void)
+{
+	dynptr_key_setup(NORM_HTAB, "htab_lookup");
+}
+
+static void dynkey_htab_update_setup(void)
+{
+	dynptr_key_setup(DYNPTR_KEY_HTAB, "dynkey_htab_update");
+}
+
+static void norm_htab_update_setup(void)
+{
+	dynptr_key_setup(NORM_HTAB, "htab_update");
+}
+
+static void *dynptr_key_producer(void *ctx)
+{
+	while (true)
+		(void)syscall(__NR_getpgid);
+	return NULL;
+}
+
+static void dynptr_key_measure(struct bench_res *res)
+{
+	static __u64 last_hits, last_drops;
+	__u64 total_hits = 0, total_drops = 0;
+	unsigned int i, nr_cpus;
+
+	nr_cpus = bpf_num_possible_cpus();
+	for (i = 0; i < nr_cpus; i++) {
+		struct run_stat *s = (void *)&ctx.skel->bss->percpu_stats[i & 255];
+
+		total_hits += s->stats[0];
+		total_drops += s->stats[1];
+	}
+
+	res->hits = total_hits - last_hits;
+	res->drops = total_drops - last_drops;
+
+	last_hits = total_hits;
+	last_drops = total_drops;
+}
+
+static void dynptr_key_report_final(struct bench_res res[], int res_cnt)
+{
+	close(ctx.cgrp_dfd);
+	cleanup_cgroup_environment();
+
+	fprintf(stdout, "Slab: %.3f MiB\n", (float)ctx.map_slab_mem / 1024 / 1024);
+	hits_drops_report_final(res, res_cnt);
+}
+
+const struct bench bench_dynkey_htab_lookup = {
+	.name = "dynkey-htab-lookup",
+	.argp = &bench_dynptr_key_argp,
+	.validate = dynptr_key_validate,
+	.setup = dynkey_htab_lookup_setup,
+	.producer_thread = dynptr_key_producer,
+	.measure = dynptr_key_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = dynptr_key_report_final,
+};
+
+const struct bench bench_norm_htab_lookup = {
+	.name = "norm-htab-lookup",
+	.argp = &bench_dynptr_key_argp,
+	.validate = dynptr_key_validate,
+	.setup = norm_htab_lookup_setup,
+	.producer_thread = dynptr_key_producer,
+	.measure = dynptr_key_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = dynptr_key_report_final,
+};
+
+const struct bench bench_dynkey_htab_update = {
+	.name = "dynkey-htab-update",
+	.argp = &bench_dynptr_key_argp,
+	.validate = dynptr_key_validate,
+	.setup = dynkey_htab_update_setup,
+	.producer_thread = dynptr_key_producer,
+	.measure = dynptr_key_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = dynptr_key_report_final,
+};
+
+const struct bench bench_norm_htab_update = {
+	.name = "norm-htab-update",
+	.argp = &bench_dynptr_key_argp,
+	.validate = dynptr_key_validate,
+	.setup = norm_htab_update_setup,
+	.producer_thread = dynptr_key_producer,
+	.measure = dynptr_key_measure,
+	.report_progress = hits_drops_report_progress,
+	.report_final = dynptr_key_report_final,
+};
diff --git a/tools/testing/selftests/bpf/benchs/run_bench_dynptr_key.sh b/tools/testing/selftests/bpf/benchs/run_bench_dynptr_key.sh
new file mode 100755
index 0000000000000..ec074ce55a363
--- /dev/null
+++ b/tools/testing/selftests/bpf/benchs/run_bench_dynptr_key.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./benchs/run_common.sh
+
+set -eufo pipefail
+
+prod_list=${PROD_LIST:-"1 2 4 8"}
+entries=${ENTRIES:-8192}
+max_size=${MAX_SIZE:-256}
+str_file=${STR_FILE:-}
+
+summarize_rate_and_mem()
+{
+	local bench="$1"
+	local mem=$(echo $2 | grep Slab: | \
+		sed -E "s/.*Slab:\s+([0-9]+\.[0-9]+ MiB).*/\1/")
+	local summary=$(echo $2 | tail -n1)
+
+	printf "%-20s %s (drops %s, mem %s)\n" "$bench" "$(hits $summary)" \
+		"$(drops $summary)" "$mem"
+}
+
+htab_bench()
+{
+	local opts="--entries ${entries} --max_size ${max_size}"
+	local desc="${entries}"
+	local name
+	local prod
+
+	if test -n "${str_file}" && test -f "${str_file}"
+	then
+		opts="--file ${str_file}"
+		desc="${str_file}"
+	fi
+
+	for name in htab-lookup htab-update
+	do
+		for prod in ${prod_list}
+		do
+			summarize_rate_and_mem "${name}-p${prod}-${desc}" \
+				"$($RUN_BENCH -p${prod} ${1}-${name} ${opts})"
+		done
+	done
+}
+
+header "normal hash map"
+htab_bench norm
+
+header "dynptr-keyed hash map"
+htab_bench dynkey
diff --git a/tools/testing/selftests/bpf/progs/dynptr_key_bench.c b/tools/testing/selftests/bpf/progs/dynptr_key_bench.c
new file mode 100644
index 0000000000000..2f3dea926776b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/dynptr_key_bench.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2025. Huawei Technologies Co., Ltd */
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <linux/errno.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+struct bpf_map;
+
+struct dynkey_key {
+	/* Use 8 bytes to prevent unnecessary hole */
+	__u64 cookie;
+	struct bpf_dynptr desc;
+};
+
+struct var_size_key {
+	__u64 len;
+	unsigned char data[];
+};
+
+/* Its value will be used as the key of hash map. The size of value is fixed,
+ * however, the first 8 bytes denote the length of valid data in the value.
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(key_size, 4);
+} array SEC(".maps");
+
+/* key_size will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(value_size, 4);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} htab SEC(".maps");
+
+/* map_extra will be set by benchmark */
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__type(key, struct dynkey_key);
+	__type(value, unsigned int);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+} dynkey_htab SEC(".maps");
+
+char _license[] SEC("license") = "GPL";
+
+struct {
+	__u64 stats[2];
+} __attribute__((__aligned__(256))) percpu_stats[256];
+
+struct update_ctx {
+	unsigned int max;
+	unsigned int from;
+};
+
+volatile const unsigned int max_dynkey_size;
+unsigned int update_nr;
+unsigned int update_chunk;
+
+static __always_inline void update_stats(int idx)
+{
+	__u32 cpu = bpf_get_smp_processor_id();
+
+	percpu_stats[cpu & 255].stats[idx]++;
+}
+
+static int lookup_htab(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	__u32 *index;
+
+	index = bpf_map_lookup_elem(&htab, value);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int lookup_dynkey_htab(struct bpf_map *map, __u32 *key, void *value, void *data)
+{
+	struct var_size_key *var_size_key = value;
+	struct dynkey_key dynkey;
+	__u32 *index;
+	__u64 len;
+
+	len = var_size_key->len;
+	if (len > max_dynkey_size)
+		return 0;
+
+	dynkey.cookie = len;
+	bpf_dynptr_from_mem(var_size_key->data, len, 0, &dynkey.desc);
+	index = bpf_map_lookup_elem(&dynkey_htab, &dynkey);
+	if (index && *index == *key)
+		update_stats(0);
+	else
+		update_stats(1);
+	return 0;
+}
+
+static int update_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_update_elem(&htab, value, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	void *value;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&array, &update->from);
+	if (!value)
+		return 1;
+
+	err = bpf_map_delete_elem(&htab, value);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+static int update_dynkey_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	struct var_size_key *value;
+	struct dynkey_key dynkey;
+	__u64 len;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&array, &update->from);
+	if (!value)
+		return 1;
+	len = value->len;
+	if (len > max_dynkey_size)
+		return 1;
+
+	dynkey.cookie = len;
+	bpf_dynptr_from_mem(value->data, len, 0, &dynkey.desc);
+	err = bpf_map_update_elem(&dynkey_htab, &dynkey, &update->from, 0);
+	if (!err)
+		update_stats(0);
+	else
+		update_stats(1);
+	update->from++;
+
+	return 0;
+}
+
+static int delete_dynkey_htab_loop(unsigned int i, void *ctx)
+{
+	struct update_ctx *update = ctx;
+	struct var_size_key *value;
+	struct dynkey_key dynkey;
+	__u64 len;
+	int err;
+
+	if (update->from >= update->max)
+		update->from = 0;
+	value = bpf_map_lookup_elem(&array, &update->from);
+	if (!value)
+		return 1;
+	len = value->len;
+	if (len > max_dynkey_size)
+		return 1;
+
+	dynkey.cookie = len;
+	bpf_dynptr_from_mem(value->data, len, 0, &dynkey.desc);
+	err = bpf_map_delete_elem(&dynkey_htab, &dynkey);
+	if (!err)
+		update_stats(0);
+	update->from++;
+
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int htab_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&array, lookup_htab, NULL, 0);
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int dynkey_htab_lookup(void *ctx)
+{
+	bpf_for_each_map_elem(&array, lookup_dynkey_htab, NULL, 0);
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int htab_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_htab_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_htab_loop, &update, 0);
+
+	return 0;
+}
+
+SEC("?tp/syscalls/sys_enter_getpgid")
+int dynkey_htab_update(void *ctx)
+{
+	unsigned int index = bpf_get_smp_processor_id() * update_chunk;
+	struct update_ctx update;
+
+	update.max = update_nr;
+	if (update.max && index >= update.max)
+		index %= update.max;
+
+	/* Only operate part of keys according to cpu id */
+	update.from = index;
+	bpf_loop(update_chunk, update_dynkey_htab_loop, &update, 0);
+
+	update.from = index;
+	bpf_loop(update_chunk, delete_dynkey_htab_loop, &update, 0);
+
+	return 0;
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr
  2025-01-25 11:10 ` [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr Hou Tao
@ 2025-02-04 23:17   ` Alexei Starovoitov
  2025-02-05  1:33     ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-04 23:17 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Sat, Jan 25, 2025 at 10:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> Add BPF_DYNPTR in btf_field_type to support bpf_dynptr in map key. The
> parsing of bpf_dynptr in btf will be done in the following patch, and
> the patch only adds two helpers: btf_new_bpf_dynptr_record() creates an
> btf record which only includes a bpf_dynptr and btf_type_is_dynptr()
> checks whether the btf_type is a bpf_dynptr or not.
>
> With the introduction of BPF_DYNPTR, BTF_FIELDS_MAX is changed from 11
> to 13, therefore, update the hard-coded number in cpumask test as well.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  include/linux/bpf.h                           |  5 ++-
>  include/linux/btf.h                           |  2 +
>  kernel/bpf/btf.c                              | 42 ++++++++++++++++---
>  .../selftests/bpf/progs/cpumask_common.h      |  2 +-
>  4 files changed, 43 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index feda0ce90f5a3..0ee14ae30100f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -184,8 +184,8 @@ struct bpf_map_ops {
>  };
>
>  enum {
> -       /* Support at most 11 fields in a BTF type */
> -       BTF_FIELDS_MAX     = 11,
> +       /* Support at most 13 fields in a BTF type */
> +       BTF_FIELDS_MAX     = 13,

BTF_FIELDS_MAX doesn't need to be incremented when btf_field_type
learns about a new type.
The number of fields per map value is independent
from a number of types that the verifier recognizes.
The patch that incremented it last time slipped through
by accident.
Do you really need to increase it?
If so, why 13 and not 32 ?

pw-bot: cr

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr
  2025-02-04 23:17   ` Alexei Starovoitov
@ 2025-02-05  1:33     ` Hou Tao
  0 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-02-05  1:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/5/2025 7:17 AM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2025 at 10:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> Add BPF_DYNPTR in btf_field_type to support bpf_dynptr in map key. The
>> parsing of bpf_dynptr in btf will be done in the following patch, and
>> the patch only adds two helpers: btf_new_bpf_dynptr_record() creates an
>> btf record which only includes a bpf_dynptr and btf_type_is_dynptr()
>> checks whether the btf_type is a bpf_dynptr or not.
>>
>> With the introduction of BPF_DYNPTR, BTF_FIELDS_MAX is changed from 11
>> to 13, therefore, update the hard-coded number in cpumask test as well.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  include/linux/bpf.h                           |  5 ++-
>>  include/linux/btf.h                           |  2 +
>>  kernel/bpf/btf.c                              | 42 ++++++++++++++++---
>>  .../selftests/bpf/progs/cpumask_common.h      |  2 +-
>>  4 files changed, 43 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index feda0ce90f5a3..0ee14ae30100f 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -184,8 +184,8 @@ struct bpf_map_ops {
>>  };
>>
>>  enum {
>> -       /* Support at most 11 fields in a BTF type */
>> -       BTF_FIELDS_MAX     = 11,
>> +       /* Support at most 13 fields in a BTF type */
>> +       BTF_FIELDS_MAX     = 13,
> BTF_FIELDS_MAX doesn't need to be incremented when btf_field_type
> learns about a new type.
> The number of fields per map value is independent
> from a number of types that the verifier recognizes.
> The patch that incremented it last time slipped through
> by accident.
> Do you really need to increase it?
> If so, why 13 and not 32 ?

I see. There is no need to increase it for current patch set. The
original idea is that the parsing needs to support all of these special
field in the map key/value, but it is unnecessary. Will remove it.
> pw-bot: cr


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 02/20] bpf: Parse bpf_dynptr in map key
  2025-01-25 11:10 ` [PATCH bpf-next v2 02/20] bpf: Parse bpf_dynptr in map key Hou Tao
@ 2025-02-13 17:59   ` Alexei Starovoitov
  2025-02-14  4:04     ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-13 17:59 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> To support variable-length key or strings in map key, use bpf_dynptr to
> represent these variable-length objects and save these bpf_dynptr
> fields in the map key. As shown in the examples below, a map key with an
> integer and a string is defined:
>
>         struct pid_name {
>                 int pid;
>                 struct bpf_dynptr name;
>         };
>
> The bpf_dynptr in the map key could also be contained indirectly in a
> struct as shown below:
>
>         struct pid_name_time {
>                 struct pid_name process;
>                 unsigned long long time;
>         };
>
> If the whole map key is a bpf_dynptr, the map could be defined as a
> struct or directly using bpf_dynptr as the map key:
>
>         struct map_key {
>                 struct bpf_dynptr name;
>         };
>
> The bpf program could use bpf_dynptr_init() to initialize the dynptr
> part in the map key, and the userspace application will use
> bpf_dynptr_user_init() or similar API to initialize the dynptr. Just
> like kptrs in map value, the bpf_dynptr field in the map key could also
> be defined in a nested struct which is contained in the map key struct.
>
> The patch updates map_create() accordingly to parse these bpf_dynptr
> fields in map key, just like it does for other special fields in map
> value. To enable bpf_dynptr support in map key, the map_type should be
> BPF_MAP_TYPE_HASH. For now, the max number of bpf_dynptr in a map key
> is limited as 1 and the limitation can be relaxed later.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  include/linux/bpf.h     | 14 ++++++++++++++
>  kernel/bpf/btf.c        |  4 ++++
>  kernel/bpf/map_in_map.c | 21 +++++++++++++++++----
>  kernel/bpf/syscall.c    | 41 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 76 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0ee14ae30100f..ed58d5dd6b34b 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -271,7 +271,14 @@ struct bpf_map {
>         u64 map_extra; /* any per-map-type extra fields */
>         u32 map_flags;
>         u32 id;
> +       /* BTF record for special fields in map value. bpf_dynptr is disallowed
> +        * at present.
> +        */

Maybe drop 'at present' to fit on one line.
I would also capitalize Value to make the difference more obvious...

>         struct btf_record *record;
> +       /* BTF record for special fields in map key. Only bpf_dynptr is allowed
> +        * at present.

...with this line. Key.

> +        */
> +       struct btf_record *key_record;
>         int numa_node;
>         u32 btf_key_type_id;
>         u32 btf_value_type_id;
> @@ -336,6 +343,8 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
>                 return "bpf_rb_node";
>         case BPF_REFCOUNT:
>                 return "bpf_refcount";
> +       case BPF_DYNPTR:
> +               return "bpf_dynptr";
>         default:
>                 WARN_ON_ONCE(1);
>                 return "unknown";
> @@ -366,6 +375,8 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
>                 return sizeof(struct bpf_rb_node);
>         case BPF_REFCOUNT:
>                 return sizeof(struct bpf_refcount);
> +       case BPF_DYNPTR:
> +               return sizeof(struct bpf_dynptr);
>         default:
>                 WARN_ON_ONCE(1);
>                 return 0;
> @@ -396,6 +407,8 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
>                 return __alignof__(struct bpf_rb_node);
>         case BPF_REFCOUNT:
>                 return __alignof__(struct bpf_refcount);
> +       case BPF_DYNPTR:
> +               return __alignof__(struct bpf_dynptr);
>         default:
>                 WARN_ON_ONCE(1);
>                 return 0;
> @@ -426,6 +439,7 @@ static inline void bpf_obj_init_field(const struct btf_field *field, void *addr)
>         case BPF_KPTR_REF:
>         case BPF_KPTR_PERCPU:
>         case BPF_UPTR:
> +       case BPF_DYNPTR:
>                 break;
>         default:
>                 WARN_ON_ONCE(1);
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index b316631b614fa..0ce5180e024a3 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3500,6 +3500,7 @@ static int btf_get_field_type(const struct btf *btf, const struct btf_type *var_
>         field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");
>         field_mask_test_name(BPF_RB_NODE,   "bpf_rb_node");
>         field_mask_test_name(BPF_REFCOUNT,  "bpf_refcount");
> +       field_mask_test_name(BPF_DYNPTR,    "bpf_dynptr");
>
>         /* Only return BPF_KPTR when all other types with matchable names fail */
>         if (field_mask & (BPF_KPTR | BPF_UPTR) && !__btf_type_is_struct(var_type)) {
> @@ -3538,6 +3539,7 @@ static int btf_repeat_fields(struct btf_field_info *info, int info_cnt,
>                 case BPF_UPTR:
>                 case BPF_LIST_HEAD:
>                 case BPF_RB_ROOT:
> +               case BPF_DYNPTR:
>                         break;
>                 default:
>                         return -EINVAL;
> @@ -3660,6 +3662,7 @@ static int btf_find_field_one(const struct btf *btf,
>         case BPF_LIST_NODE:
>         case BPF_RB_NODE:
>         case BPF_REFCOUNT:
> +       case BPF_DYNPTR:
>                 ret = btf_find_struct(btf, var_type, off, sz, field_type,
>                                       info_cnt ? &info[0] : &tmp);
>                 if (ret < 0)
> @@ -4017,6 +4020,7 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
>                         break;
>                 case BPF_LIST_NODE:
>                 case BPF_RB_NODE:
> +               case BPF_DYNPTR:
>                         break;
>                 default:
>                         ret = -EFAULT;
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 645bd30bc9a9d..564ebcc857564 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -12,6 +12,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>         struct bpf_map *inner_map, *inner_map_meta;
>         u32 inner_map_meta_size;
>         CLASS(fd, f)(inner_map_ufd);
> +       int ret;
>
>         inner_map = __bpf_map_get(f);
>         if (IS_ERR(inner_map))
> @@ -45,10 +46,15 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>                  * invalid/empty/valid, but ERR_PTR in case of errors. During
>                  * equality NULL or IS_ERR is equivalent.
>                  */
> -               struct bpf_map *ret = ERR_CAST(inner_map_meta->record);
> -               kfree(inner_map_meta);
> -               return ret;
> +               ret = PTR_ERR(inner_map_meta->record);
> +               goto free_meta;
>         }
> +       inner_map_meta->key_record = btf_record_dup(inner_map->key_record);
> +       if (IS_ERR(inner_map_meta->key_record)) {
> +               ret = PTR_ERR(inner_map_meta->key_record);
> +               goto free_record;
> +       }
> +
>         /* Note: We must use the same BTF, as we also used btf_record_dup above
>          * which relies on BTF being same for both maps, as some members like
>          * record->fields.list_head have pointers like value_rec pointing into
> @@ -71,6 +77,12 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>                 inner_map_meta->bypass_spec_v1 = inner_map->bypass_spec_v1;
>         }
>         return inner_map_meta;
> +
> +free_record:
> +       btf_record_free(inner_map_meta->record);
> +free_meta:
> +       kfree(inner_map_meta);
> +       return ERR_PTR(ret);
>  }
>
>  void bpf_map_meta_free(struct bpf_map *map_meta)
> @@ -88,7 +100,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
>                 meta0->key_size == meta1->key_size &&
>                 meta0->value_size == meta1->value_size &&
>                 meta0->map_flags == meta1->map_flags &&
> -               btf_record_equal(meta0->record, meta1->record);
> +               btf_record_equal(meta0->record, meta1->record) &&
> +               btf_record_equal(meta0->key_record, meta1->key_record);
>  }
>
>  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0daf098e32074..6e14208cca813 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -651,6 +651,7 @@ void btf_record_free(struct btf_record *rec)
>                 case BPF_TIMER:
>                 case BPF_REFCOUNT:
>                 case BPF_WORKQUEUE:
> +               case BPF_DYNPTR:
>                         /* Nothing to release */
>                         break;
>                 default:
> @@ -664,7 +665,9 @@ void btf_record_free(struct btf_record *rec)
>  void bpf_map_free_record(struct bpf_map *map)
>  {
>         btf_record_free(map->record);
> +       btf_record_free(map->key_record);
>         map->record = NULL;
> +       map->key_record = NULL;
>  }
>
>  struct btf_record *btf_record_dup(const struct btf_record *rec)
> @@ -703,6 +706,7 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
>                 case BPF_TIMER:
>                 case BPF_REFCOUNT:
>                 case BPF_WORKQUEUE:
> +               case BPF_DYNPTR:
>                         /* Nothing to acquire */
>                         break;
>                 default:
> @@ -821,6 +825,8 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>                 case BPF_RB_NODE:
>                 case BPF_REFCOUNT:
>                         break;
> +               case BPF_DYNPTR:
> +                       break;
>                 default:
>                         WARN_ON_ONCE(1);
>                         continue;
> @@ -830,6 +836,7 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>
>  static void bpf_map_free(struct bpf_map *map)
>  {
> +       struct btf_record *key_rec = map->key_record;
>         struct btf_record *rec = map->record;
>         struct btf *btf = map->btf;
>
> @@ -850,6 +857,7 @@ static void bpf_map_free(struct bpf_map *map)
>          * eventually calls bpf_map_free_meta, since inner_map_meta is only a
>          * template bpf_map struct used during verification.
>          */
> +       btf_record_free(key_rec);
>         btf_record_free(rec);
>         /* Delay freeing of btf for maps, as map_free callback may need
>          * struct_meta info which will be freed with btf_put().
> @@ -1180,6 +1188,8 @@ int map_check_no_btf(const struct bpf_map *map,
>         return -ENOTSUPP;
>  }
>
> +#define MAX_DYNPTR_CNT_IN_MAP_KEY 1

I remember we discussed to allow 2 dynptr-s in a key.
And in patch 11 you already do:
+       record = map->key_record;
+       for (i = 0; i < record->cnt; i++) {

so the support for multiple dynptr-s is almost there?

> +
>  static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>                          const struct btf *btf, u32 btf_key_id, u32 btf_value_id)
>  {
> @@ -1202,6 +1212,37 @@ static int map_check_btf(struct bpf_map *map, struct bpf_token *token,
>         if (!value_type || value_size != map->value_size)
>                 return -EINVAL;
>
> +       /* Key BTF type can't be data section */
> +       if (btf_type_is_dynptr(btf, key_type))
> +               map->key_record = btf_new_bpf_dynptr_record();
> +       else if (__btf_type_is_struct(key_type))
> +               map->key_record = btf_parse_fields(btf, key_type, BPF_DYNPTR, map->key_size);
> +       else
> +               map->key_record = NULL;
> +       if (!IS_ERR_OR_NULL(map->key_record)) {
> +               if (map->key_record->cnt > MAX_DYNPTR_CNT_IN_MAP_KEY) {
> +                       ret = -E2BIG;
> +                       goto free_map_tab;
> +               }
> +               if (map->map_type != BPF_MAP_TYPE_HASH) {
> +                       ret = -EOPNOTSUPP;
> +                       goto free_map_tab;
> +               }
> +               if (!bpf_token_capable(token, CAP_BPF)) {
> +                       ret = -EPERM;
> +                       goto free_map_tab;
> +               }
> +               /* Disallow key with dynptr for special map */
> +               if (map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) {
> +                       ret = -EACCES;
> +                       goto free_map_tab;
> +               }
> +       } else if (IS_ERR(map->key_record)) {
> +               /* Return an error early even the bpf program doesn't use it */
> +               ret = PTR_ERR(map->key_record);
> +               goto free_map_tab;
> +       }
> +
>         map->record = btf_parse_fields(btf, value_type,
>                                        BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
>                                        BPF_RB_ROOT | BPF_REFCOUNT | BPF_WORKQUEUE | BPF_UPTR,
> --
> 2.29.2
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key
  2025-01-25 11:10 ` [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key Hou Tao
@ 2025-02-13 18:02   ` Alexei Starovoitov
  2025-02-14  6:13     ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-13 18:02 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> For map with dynptr key support, it needs to use map_extra to specify
> the maximum data length of these dynptrs. The implementation of the map
> will check whether map_extra is smaller than the limitation imposed by
> memory allocation during map creation. It may also use map_extra to
> optimize the memory allocation for dynptr.

Why limit it?
The only piece of code I could find is:

uptr->size > map->map_extra

and it doesn't look necessary.
Let it consume whatever necessary ?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-01-25 11:10 ` [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally Hou Tao
@ 2025-02-13 23:56   ` Alexei Starovoitov
  2025-02-14  4:12     ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-13 23:56 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> When there is bpf_dynptr field in the map key btf type or the map key
> btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 07c67ad1a6a07..46b96d062d2db 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
>         return btf;
>  }
>
> +static int map_has_dynptr_in_key_type(struct btf *btf, u32 btf_key_id, u32 key_size)
> +{
> +       const struct btf_type *type;
> +       struct btf_record *record;
> +       u32 btf_key_size;
> +
> +       if (!btf_key_id)
> +               return 0;
> +
> +       type = btf_type_id_size(btf, &btf_key_id, &btf_key_size);
> +       if (!type || btf_key_size != key_size)
> +               return -EINVAL;
> +
> +       /* For dynptr key, key BTF type must be struct */
> +       if (!__btf_type_is_struct(type))
> +               return 0;
> +
> +       if (btf_type_is_dynptr(btf, type))
> +               return 1;
> +
> +       record = btf_parse_fields(btf, type, BPF_DYNPTR, key_size);
> +       if (IS_ERR(record))
> +               return PTR_ERR(record);
> +
> +       btf_record_free(record);
> +       return !!record;
> +}
> +
>  #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
>  /* called via syscall */
>  static int map_create(union bpf_attr *attr)
> @@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
>                 btf = get_map_btf(attr->btf_fd);
>                 if (IS_ERR(btf))
>                         return PTR_ERR(btf);
> +
> +               err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
> +               if (err < 0)
> +                       goto put_btf;
> +               if (err > 0) {
> +                       attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;

I don't like this inband signaling in the uapi field.
The whole refactoring in patch 4 to do patch 6 and
subsequent bpf_map_has_dynptr_key() in various places
feels like reinventing the wheel.

We already have map_check_btf() mechanism that works for
existing special fields inside BTF.
Please use it.

map_has_dynptr_in_key_type() can be done in map_check_btf()
after map is created, no ?
Then when it passes map->map_type check set a bool inside
struct bpf_map, so that bpf_map_has_dynptr_key() can be fast
in the critical path of hashtab.
Or better yet use:
static inline bool bpf_map_has_dynptr_key(const struct bpf_map *map)
{
  /* key_record is not NULL when the map key contains bpf_dynptr_user */
  return !!map->key_record;
}
since htab_map_hash() has to read key_record anyway,
hence better D$ access.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 10/20] bpf: Introduce bpf_dynptr_user
  2025-01-25 11:10 ` [PATCH bpf-next v2 10/20] bpf: Introduce bpf_dynptr_user Hou Tao
@ 2025-02-14  0:13   ` Alexei Starovoitov
  2025-02-14  7:03     ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-14  0:13 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> From: Hou Tao <houtao1@huawei.com>
>
> For bpf map with dynptr key support, the userspace application will use
> bpf_dynptr_user to represent the bpf_dynptr in the map key and pass it
> to bpf syscall. The bpf syscall will copy from bpf_dynptr_user to
> construct a corresponding bpf_dynptr_kern object when the map key is an
> input argument, and copy to bpf_dynptr_user from a bpf_dynptr_kern
> object when the map key is an output argument.
>
> For now the size of bpf_dynptr_user must be the same as bpf_dynptr, but
> the last u32 field is not used, so make it a reserved field.
>
> Signed-off-by: Hou Tao <houtao1@huawei.com>
> ---
>  include/uapi/linux/bpf.h       | 6 ++++++
>  tools/include/uapi/linux/bpf.h | 6 ++++++
>  2 files changed, 12 insertions(+)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2acf9b3363717..7d96685513c55 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7335,6 +7335,12 @@ struct bpf_dynptr {
>         __u64 __opaque[2];
>  } __attribute__((aligned(8)));
>
> +struct bpf_dynptr_user {
> +       __bpf_md_ptr(void *, data);
> +       __u32 size;
> +       __u32 reserved;
> +} __attribute__((aligned(8)));

Pls add a comment explaining that bpf_dynptr_user is for user space only
and bpf progs should continue using bpf_dynptr.
May be give an example that to use bpf_dynptr in map key
the bpf prog should write:

+struct mixed_dynptr_key {
+ int id;
+ struct bpf_dynptr name;
+};

while to access that map the user space should write:

+struct id_dname_key {
+ int id;
+ struct bpf_dynptr_user name;
+};

tbh the api is kinda ugly, since in the past we always had user space
and bpf prog reuse the same struct names.
Here the top struct names have to be different,
but have to have the same layout.

Maybe let's try the following:
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fff6cdb8d11a..55d225961dbf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -7335,7 +7335,14 @@ struct bpf_wq {
 } __attribute__((aligned(8)));

 struct bpf_dynptr {
+       union {
        __u64 __opaque[2];
+       struct {
+               __bpf_md_ptr(void *, data);
+               __u32 size;
+               __u32 reserved;
+       };
+       };
 } __attribute__((aligned(8)));

Then bpf prog and user space can use the same key type.

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 02/20] bpf: Parse bpf_dynptr in map key
  2025-02-13 17:59   ` Alexei Starovoitov
@ 2025-02-14  4:04     ` Hou Tao
  0 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-02-14  4:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/14/2025 1:59 AM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> To support variable-length key or strings in map key, use bpf_dynptr to
>> represent these variable-length objects and save these bpf_dynptr
>> fields in the map key. As shown in the examples below, a map key with an
>> integer and a string is defined:

SNIP
>> @@ -271,7 +271,14 @@ struct bpf_map {
>>         u64 map_extra; /* any per-map-type extra fields */
>>         u32 map_flags;
>>         u32 id;
>> +       /* BTF record for special fields in map value. bpf_dynptr is disallowed
>> +        * at present.
>> +        */
> Maybe drop 'at present' to fit on one line.
> I would also capitalize Value to make the difference more obvious...

Will do.
>
>>         struct btf_record *record;
>> +       /* BTF record for special fields in map key. Only bpf_dynptr is allowed
>> +        * at present.
> ...with this line. Key.

Will do.
>
>> +

SNIP
>> +       btf_record_free(key_rec);
>>         btf_record_free(rec);
>>         /* Delay freeing of btf for maps, as map_free callback may need
>>          * struct_meta info which will be freed with btf_put().
>> @@ -1180,6 +1188,8 @@ int map_check_no_btf(const struct bpf_map *map,
>>         return -ENOTSUPP;
>>  }
>>
>> +#define MAX_DYNPTR_CNT_IN_MAP_KEY 1
> I remember we discussed to allow 2 dynptr-s in a key.
> And in patch 11 you already do:
> +       record = map->key_record;
> +       for (i = 0; i < record->cnt; i++) {
>
> so the support for multiple dynptr-s is almost there?

I misunderstood the discussion. However, the change is simple. Only need
to change it from 1 to 2 because the following patches has already
supported multiple dynptrs.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-13 23:56   ` Alexei Starovoitov
@ 2025-02-14  4:12     ` Hou Tao
  2025-02-14  4:17       ` Alexei Starovoitov
  0 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-02-14  4:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/14/2025 7:56 AM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> When there is bpf_dynptr field in the map key btf type or the map key
>> btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
>>  1 file changed, 36 insertions(+)
>>
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 07c67ad1a6a07..46b96d062d2db 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
>>         return btf;
>>  }
>>
>> +static int map_has_dynptr_in_key_type(struct btf *btf, u32 btf_key_id, u32 key_size)
>> +{
>> +       const struct btf_type *type;
>> +       struct btf_record *record;
>> +       u32 btf_key_size;
>> +
>> +       if (!btf_key_id)
>> +               return 0;
>> +
>> +       type = btf_type_id_size(btf, &btf_key_id, &btf_key_size);
>> +       if (!type || btf_key_size != key_size)
>> +               return -EINVAL;
>> +
>> +       /* For dynptr key, key BTF type must be struct */
>> +       if (!__btf_type_is_struct(type))
>> +               return 0;
>> +
>> +       if (btf_type_is_dynptr(btf, type))
>> +               return 1;
>> +
>> +       record = btf_parse_fields(btf, type, BPF_DYNPTR, key_size);
>> +       if (IS_ERR(record))
>> +               return PTR_ERR(record);
>> +
>> +       btf_record_free(record);
>> +       return !!record;
>> +}
>> +
>>  #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
>>  /* called via syscall */
>>  static int map_create(union bpf_attr *attr)
>> @@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
>>                 btf = get_map_btf(attr->btf_fd);
>>                 if (IS_ERR(btf))
>>                         return PTR_ERR(btf);
>> +
>> +               err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
>> +               if (err < 0)
>> +                       goto put_btf;
>> +               if (err > 0) {
>> +                       attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;
> I don't like this inband signaling in the uapi field.
> The whole refactoring in patch 4 to do patch 6 and
> subsequent bpf_map_has_dynptr_key() in various places
> feels like reinventing the wheel.
>
> We already have map_check_btf() mechanism that works for
> existing special fields inside BTF.
> Please use it.

Yes. However map->key_record is only available after the map is created,
but the creation of hash map needs to check it before the map is
created. Instead of using an internal flag, how about adding extra
argument for both ->map_alloc_check() and ->map_alloc() as proposed in
the commit message of the previous patch ?
>
> map_has_dynptr_in_key_type() can be done in map_check_btf()
> after map is created, no ?

No. both ->map_alloc_check() and ->map_alloc() need to know whether
dynptr is enabled (as explained in the previous commit message). Both of
these functions are called before the map is created.
> Then when it passes map->map_type check set a bool inside
> struct bpf_map, so that bpf_map_has_dynptr_key() can be fast
> in the critical path of hashtab.
> Or better yet use:
> static inline bool bpf_map_has_dynptr_key(const struct bpf_map *map)
> {
>   /* key_record is not NULL when the map key contains bpf_dynptr_user */
>   return !!map->key_record;
> }
> since htab_map_hash() has to read key_record anyway,
> hence better D$ access.
> .


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-14  4:12     ` Hou Tao
@ 2025-02-14  4:17       ` Alexei Starovoitov
  2025-02-14  6:49         ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-14  4:17 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Thu, Feb 13, 2025 at 8:12 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/14/2025 7:56 AM, Alexei Starovoitov wrote:
> > On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> When there is bpf_dynptr field in the map key btf type or the map key
> >> btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.
> >>
> >> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >> ---
> >>  kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 36 insertions(+)
> >>
> >> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >> index 07c67ad1a6a07..46b96d062d2db 100644
> >> --- a/kernel/bpf/syscall.c
> >> +++ b/kernel/bpf/syscall.c
> >> @@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
> >>         return btf;
> >>  }
> >>
> >> +static int map_has_dynptr_in_key_type(struct btf *btf, u32 btf_key_id, u32 key_size)
> >> +{
> >> +       const struct btf_type *type;
> >> +       struct btf_record *record;
> >> +       u32 btf_key_size;
> >> +
> >> +       if (!btf_key_id)
> >> +               return 0;
> >> +
> >> +       type = btf_type_id_size(btf, &btf_key_id, &btf_key_size);
> >> +       if (!type || btf_key_size != key_size)
> >> +               return -EINVAL;
> >> +
> >> +       /* For dynptr key, key BTF type must be struct */
> >> +       if (!__btf_type_is_struct(type))
> >> +               return 0;
> >> +
> >> +       if (btf_type_is_dynptr(btf, type))
> >> +               return 1;
> >> +
> >> +       record = btf_parse_fields(btf, type, BPF_DYNPTR, key_size);
> >> +       if (IS_ERR(record))
> >> +               return PTR_ERR(record);
> >> +
> >> +       btf_record_free(record);
> >> +       return !!record;
> >> +}
> >> +
> >>  #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
> >>  /* called via syscall */
> >>  static int map_create(union bpf_attr *attr)
> >> @@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
> >>                 btf = get_map_btf(attr->btf_fd);
> >>                 if (IS_ERR(btf))
> >>                         return PTR_ERR(btf);
> >> +
> >> +               err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
> >> +               if (err < 0)
> >> +                       goto put_btf;
> >> +               if (err > 0) {
> >> +                       attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;
> > I don't like this inband signaling in the uapi field.
> > The whole refactoring in patch 4 to do patch 6 and
> > subsequent bpf_map_has_dynptr_key() in various places
> > feels like reinventing the wheel.
> >
> > We already have map_check_btf() mechanism that works for
> > existing special fields inside BTF.
> > Please use it.
>
> Yes. However map->key_record is only available after the map is created,
> but the creation of hash map needs to check it before the map is
> created. Instead of using an internal flag, how about adding extra
> argument for both ->map_alloc_check() and ->map_alloc() as proposed in
> the commit message of the previous patch ?
> >
> > map_has_dynptr_in_key_type() can be done in map_check_btf()
> > after map is created, no ?
>
> No. both ->map_alloc_check() and ->map_alloc() need to know whether
> dynptr is enabled (as explained in the previous commit message). Both of
> these functions are called before the map is created.

Is that the explanation?
"
The reason for an internal map flag is twofolds:
1) user doesn't need to set the map flag explicitly
map_create() will use the presence of bpf_dynptr in map key as an
indicator of enabling dynptr key.
2) avoid adding new arguments for ->map_alloc_check() and ->map_alloc()
map_create() needs to pass the supported status of dynptr key to
->map_alloc_check (e.g., check the maximum length of dynptr data size)
and ->map_alloc (e.g., check whether dynptr key fits current map type).
Adding new arguments for these callbacks to achieve that will introduce
too much churns.

Therefore, the patch uses the topmost bit of map_flags as the internal
map flag. map_create() checks whether the internal flag is set in the
beginning and bpf_map_get_info_by_fd() clears the internal flag before
returns the map flags to userspace.
"

As commented in the other patch map_extra can be dropped (I hope).
When it's gone, the map can be destroyed after creation in map_check_btf().
What am I missing?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key
  2025-02-13 18:02   ` Alexei Starovoitov
@ 2025-02-14  6:13     ` Hou Tao
  2025-02-14 15:57       ` Alexei Starovoitov
  0 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-02-14  6:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/14/2025 2:02 AM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> For map with dynptr key support, it needs to use map_extra to specify
>> the maximum data length of these dynptrs. The implementation of the map
>> will check whether map_extra is smaller than the limitation imposed by
>> memory allocation during map creation. It may also use map_extra to
>> optimize the memory allocation for dynptr.
> Why limit it?
> The only piece of code I could find is:
>
> uptr->size > map->map_extra
>
> and it doesn't look necessary.
> Let it consume whatever necessary ?
> .

It will be usable when trying to iterate keys through ->get_next_key()
in kernel (e.g., support map_seq_show_elem in v3), because for now the
data memory for dynptr in the map key is allocated by the caller
(because the callee hold a rcu read lock). If the max length of dynptr
data is unknown, map_iter_alloc()/map_seq_next() may need some logic to
probe the max length of dynptr data during the traversal of keys. Will
check whether or not it is feasible in v3.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-14  4:17       ` Alexei Starovoitov
@ 2025-02-14  6:49         ` Hou Tao
  2025-02-14  7:25           ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-02-14  6:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/14/2025 12:17 PM, Alexei Starovoitov wrote:
> On Thu, Feb 13, 2025 at 8:12 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 2/14/2025 7:56 AM, Alexei Starovoitov wrote:
>>> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> From: Hou Tao <houtao1@huawei.com>
>>>>
>>>> When there is bpf_dynptr field in the map key btf type or the map key
>>>> btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.
>>>>
>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>> ---
>>>>  kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 36 insertions(+)
>>>>
>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>> index 07c67ad1a6a07..46b96d062d2db 100644
>>>> --- a/kernel/bpf/syscall.c
>>>> +++ b/kernel/bpf/syscall.c
>>>> @@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
>>>>         return btf;
>>>>  }
>>>>

SNIP
>>>>  #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
>>>>  /* called via syscall */
>>>>  static int map_create(union bpf_attr *attr)
>>>> @@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
>>>>                 btf = get_map_btf(attr->btf_fd);
>>>>                 if (IS_ERR(btf))
>>>>                         return PTR_ERR(btf);
>>>> +
>>>> +               err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
>>>> +               if (err < 0)
>>>> +                       goto put_btf;
>>>> +               if (err > 0) {
>>>> +                       attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;
>>> I don't like this inband signaling in the uapi field.
>>> The whole refactoring in patch 4 to do patch 6 and
>>> subsequent bpf_map_has_dynptr_key() in various places
>>> feels like reinventing the wheel.
>>>
>>> We already have map_check_btf() mechanism that works for
>>> existing special fields inside BTF.
>>> Please use it.
>> Yes. However map->key_record is only available after the map is created,
>> but the creation of hash map needs to check it before the map is
>> created. Instead of using an internal flag, how about adding extra
>> argument for both ->map_alloc_check() and ->map_alloc() as proposed in
>> the commit message of the previous patch ?
>>> map_has_dynptr_in_key_type() can be done in map_check_btf()
>>> after map is created, no ?
>> No. both ->map_alloc_check() and ->map_alloc() need to know whether
>> dynptr is enabled (as explained in the previous commit message). Both of
>> these functions are called before the map is created.
> Is that the explanation?
> "
> The reason for an internal map flag is twofolds:
> 1) user doesn't need to set the map flag explicitly
> map_create() will use the presence of bpf_dynptr in map key as an
> indicator of enabling dynptr key.
> 2) avoid adding new arguments for ->map_alloc_check() and ->map_alloc()
> map_create() needs to pass the supported status of dynptr key to
> ->map_alloc_check (e.g., check the maximum length of dynptr data size)
> and ->map_alloc (e.g., check whether dynptr key fits current map type).
> Adding new arguments for these callbacks to achieve that will introduce
> too much churns.
>
> Therefore, the patch uses the topmost bit of map_flags as the internal
> map flag. map_create() checks whether the internal flag is set in the
> beginning and bpf_map_get_info_by_fd() clears the internal flag before
> returns the map flags to userspace.
> "
>
> As commented in the other patch map_extra can be dropped (I hope).
> When it's gone, the map can be destroyed after creation in map_check_btf().
> What am I missing?

If I understanding correctly, you are suggesting to replace
(map->map_flags & BPF_INT_F_DYNPTR_IN_KEY) with !!map->key_record, right
? And you also don't want to move map_check_btf() before the invocation
of ->map_alloc_check() and ->map_alloc(), right ? However, beside the
checking of map_extra, ->map_alloc_check() also needs to know whether
the dynptr-typed key is suitable for current hash map type or map flags.
->map_alloc() also needs to allocate a bpf mem allocator for the dynptr
key. So are you proposing the following steps for creating a dynkey hash
map:

1) ->map_alloc_check()
no change

2) ->map_alloc()
allocate bpf mem allocator for dynptr unconditionally

3) map_check_btf()
invokes an new map callback (e.g., ->map_alloc_post_check()) to check
whether the created map is mismatched with the dynptr key and destroy it
if it is.

?



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 10/20] bpf: Introduce bpf_dynptr_user
  2025-02-14  0:13   ` Alexei Starovoitov
@ 2025-02-14  7:03     ` Hou Tao
  0 siblings, 0 replies; 39+ messages in thread
From: Hou Tao @ 2025-02-14  7:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/14/2025 8:13 AM, Alexei Starovoitov wrote:
> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> From: Hou Tao <houtao1@huawei.com>
>>
>> For bpf map with dynptr key support, the userspace application will use
>> bpf_dynptr_user to represent the bpf_dynptr in the map key and pass it
>> to bpf syscall. The bpf syscall will copy from bpf_dynptr_user to
>> construct a corresponding bpf_dynptr_kern object when the map key is an
>> input argument, and copy to bpf_dynptr_user from a bpf_dynptr_kern
>> object when the map key is an output argument.
>>
>> For now the size of bpf_dynptr_user must be the same as bpf_dynptr, but
>> the last u32 field is not used, so make it a reserved field.
>>
>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>> ---
>>  include/uapi/linux/bpf.h       | 6 ++++++
>>  tools/include/uapi/linux/bpf.h | 6 ++++++
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 2acf9b3363717..7d96685513c55 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -7335,6 +7335,12 @@ struct bpf_dynptr {
>>         __u64 __opaque[2];
>>  } __attribute__((aligned(8)));
>>
>> +struct bpf_dynptr_user {
>> +       __bpf_md_ptr(void *, data);
>> +       __u32 size;
>> +       __u32 reserved;
>> +} __attribute__((aligned(8)));
> Pls add a comment explaining that bpf_dynptr_user is for user space only
> and bpf progs should continue using bpf_dynptr.
> May be give an example that to use bpf_dynptr in map key
> the bpf prog should write:
>
> +struct mixed_dynptr_key {
> + int id;
> + struct bpf_dynptr name;
> +};
>
> while to access that map the user space should write:
>
> +struct id_dname_key {
> + int id;
> + struct bpf_dynptr_user name;
> +};

Will add comments for the {data,size} tuple case.
>
> tbh the api is kinda ugly, since in the past we always had user space
> and bpf prog reuse the same struct names.
> Here the top struct names have to be different,
> but have to have the same layout.
>
> Maybe let's try the following:
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index fff6cdb8d11a..55d225961dbf 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -7335,7 +7335,14 @@ struct bpf_wq {
>  } __attribute__((aligned(8)));
>
>  struct bpf_dynptr {
> +       union {
>         __u64 __opaque[2];
> +       struct {
> +               __bpf_md_ptr(void *, data);
> +               __u32 size;
> +               __u32 reserved;
> +       };
> +       };
>  } __attribute__((aligned(8)));
>
> Then bpf prog and user space can use the same key type.
> .

It seems a bit strange to combine these two structs, because bpf prog
uses bpf_dynptr as an opaque object, but user space application uses
bpf_dynptr as a {data,size} tuple. However, I don't have better idea.
Will switch to that in v3.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-14  6:49         ` Hou Tao
@ 2025-02-14  7:25           ` Hou Tao
  2025-02-14 17:30             ` Alexei Starovoitov
  0 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-02-14  7:25 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/14/2025 2:49 PM, Hou Tao wrote:
> Hi,
>
> On 2/14/2025 12:17 PM, Alexei Starovoitov wrote:
>> On Thu, Feb 13, 2025 at 8:12 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>
>>> On 2/14/2025 7:56 AM, Alexei Starovoitov wrote:
>>>> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>> From: Hou Tao <houtao1@huawei.com>
>>>>>
>>>>> When there is bpf_dynptr field in the map key btf type or the map key
>>>>> btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.
>>>>>
>>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
>>>>> ---
>>>>>  kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 36 insertions(+)
>>>>>
>>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>>>> index 07c67ad1a6a07..46b96d062d2db 100644
>>>>> --- a/kernel/bpf/syscall.c
>>>>> +++ b/kernel/bpf/syscall.c
>>>>> @@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
>>>>>         return btf;
>>>>>  }
>>>>>
> SNIP
>>>>>  #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
>>>>>  /* called via syscall */
>>>>>  static int map_create(union bpf_attr *attr)
>>>>> @@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
>>>>>                 btf = get_map_btf(attr->btf_fd);
>>>>>                 if (IS_ERR(btf))
>>>>>                         return PTR_ERR(btf);
>>>>> +
>>>>> +               err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
>>>>> +               if (err < 0)
>>>>> +                       goto put_btf;
>>>>> +               if (err > 0) {
>>>>> +                       attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;
>>>> I don't like this inband signaling in the uapi field.
>>>> The whole refactoring in patch 4 to do patch 6 and
>>>> subsequent bpf_map_has_dynptr_key() in various places
>>>> feels like reinventing the wheel.
>>>>
>>>> We already have map_check_btf() mechanism that works for
>>>> existing special fields inside BTF.
>>>> Please use it.
>>> Yes. However map->key_record is only available after the map is created,
>>> but the creation of hash map needs to check it before the map is
>>> created. Instead of using an internal flag, how about adding extra
>>> argument for both ->map_alloc_check() and ->map_alloc() as proposed in
>>> the commit message of the previous patch ?
>>>> map_has_dynptr_in_key_type() can be done in map_check_btf()
>>>> after map is created, no ?
>>> No. both ->map_alloc_check() and ->map_alloc() need to know whether
>>> dynptr is enabled (as explained in the previous commit message). Both of
>>> these functions are called before the map is created.
>> Is that the explanation?
>> "
>> The reason for an internal map flag is twofolds:
>> 1) user doesn't need to set the map flag explicitly
>> map_create() will use the presence of bpf_dynptr in map key as an
>> indicator of enabling dynptr key.
>> 2) avoid adding new arguments for ->map_alloc_check() and ->map_alloc()
>> map_create() needs to pass the supported status of dynptr key to
>> ->map_alloc_check (e.g., check the maximum length of dynptr data size)
>> and ->map_alloc (e.g., check whether dynptr key fits current map type).
>> Adding new arguments for these callbacks to achieve that will introduce
>> too much churns.
>>
>> Therefore, the patch uses the topmost bit of map_flags as the internal
>> map flag. map_create() checks whether the internal flag is set in the
>> beginning and bpf_map_get_info_by_fd() clears the internal flag before
>> returns the map flags to userspace.
>> "
>>
>> As commented in the other patch map_extra can be dropped (I hope).
>> When it's gone, the map can be destroyed after creation in map_check_btf().
>> What am I missing?
> If I understanding correctly, you are suggesting to replace
> (map->map_flags & BPF_INT_F_DYNPTR_IN_KEY) with !!map->key_record, right
> ? And you also don't want to move map_check_btf() before the invocation
> of ->map_alloc_check() and ->map_alloc(), right ? However, beside the
> checking of map_extra, ->map_alloc_check() also needs to know whether
> the dynptr-typed key is suitable for current hash map type or map flags.
> ->map_alloc() also needs to allocate a bpf mem allocator for the dynptr
> key. So are you proposing the following steps for creating a dynkey hash
> map:
>
> 1) ->map_alloc_check()
> no change
>
> 2) ->map_alloc()
> allocate bpf mem allocator for dynptr unconditionally
>
> 3) map_check_btf()
> invokes an new map callback (e.g., ->map_alloc_post_check()) to check
> whether the created map is mismatched with the dynptr key and destroy it
> if it is.

Sorry, I misread the code, so the third steps is:

3) ->map_check_btf()

In ->map_check_btf() callback, check whether the created map is
mismatched with the dynptr key. If it is, let map_create() destroys the map.
>
>
>


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key
  2025-02-14  6:13     ` Hou Tao
@ 2025-02-14 15:57       ` Alexei Starovoitov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-14 15:57 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Thu, Feb 13, 2025 at 10:13 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/14/2025 2:02 AM, Alexei Starovoitov wrote:
> > On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> From: Hou Tao <houtao1@huawei.com>
> >>
> >> For map with dynptr key support, it needs to use map_extra to specify
> >> the maximum data length of these dynptrs. The implementation of the map
> >> will check whether map_extra is smaller than the limitation imposed by
> >> memory allocation during map creation. It may also use map_extra to
> >> optimize the memory allocation for dynptr.
> > Why limit it?
> > The only piece of code I could find is:
> >
> > uptr->size > map->map_extra
> >
> > and it doesn't look necessary.
> > Let it consume whatever necessary ?
> > .
>
> It will be usable when trying to iterate keys through ->get_next_key()
> in kernel (e.g., support map_seq_show_elem in v3), because for now the
> data memory for dynptr in the map key is allocated by the caller
> (because the callee hold a rcu read lock). If the max length of dynptr
> data is unknown, map_iter_alloc()/map_seq_next() may need some logic to
> probe the max length of dynptr data during the traversal of keys. Will
> check whether or not it is feasible in v3.

It doesn't have to be:
next_key = kvmalloc(map->key_size, GFP_USER);
the internal interface can be different.
It's not a good idea to impose uapi restrictions, because of
implementation details.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-14  7:25           ` Hou Tao
@ 2025-02-14 17:30             ` Alexei Starovoitov
  2025-02-27 21:10               ` Alexei Starovoitov
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-14 17:30 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Thu, Feb 13, 2025 at 11:25 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/14/2025 2:49 PM, Hou Tao wrote:
> > Hi,
> >
> > On 2/14/2025 12:17 PM, Alexei Starovoitov wrote:
> >> On Thu, Feb 13, 2025 at 8:12 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>> Hi,
> >>>
> >>> On 2/14/2025 7:56 AM, Alexei Starovoitov wrote:
> >>>> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>>> From: Hou Tao <houtao1@huawei.com>
> >>>>>
> >>>>> When there is bpf_dynptr field in the map key btf type or the map key
> >>>>> btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.
> >>>>>
> >>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> >>>>> ---
> >>>>>  kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
> >>>>>  1 file changed, 36 insertions(+)
> >>>>>
> >>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> >>>>> index 07c67ad1a6a07..46b96d062d2db 100644
> >>>>> --- a/kernel/bpf/syscall.c
> >>>>> +++ b/kernel/bpf/syscall.c
> >>>>> @@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
> >>>>>         return btf;
> >>>>>  }
> >>>>>
> > SNIP
> >>>>>  #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
> >>>>>  /* called via syscall */
> >>>>>  static int map_create(union bpf_attr *attr)
> >>>>> @@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
> >>>>>                 btf = get_map_btf(attr->btf_fd);
> >>>>>                 if (IS_ERR(btf))
> >>>>>                         return PTR_ERR(btf);
> >>>>> +
> >>>>> +               err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
> >>>>> +               if (err < 0)
> >>>>> +                       goto put_btf;
> >>>>> +               if (err > 0) {
> >>>>> +                       attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;
> >>>> I don't like this inband signaling in the uapi field.
> >>>> The whole refactoring in patch 4 to do patch 6 and
> >>>> subsequent bpf_map_has_dynptr_key() in various places
> >>>> feels like reinventing the wheel.
> >>>>
> >>>> We already have map_check_btf() mechanism that works for
> >>>> existing special fields inside BTF.
> >>>> Please use it.
> >>> Yes. However map->key_record is only available after the map is created,
> >>> but the creation of hash map needs to check it before the map is
> >>> created. Instead of using an internal flag, how about adding extra
> >>> argument for both ->map_alloc_check() and ->map_alloc() as proposed in
> >>> the commit message of the previous patch ?
> >>>> map_has_dynptr_in_key_type() can be done in map_check_btf()
> >>>> after map is created, no ?
> >>> No. both ->map_alloc_check() and ->map_alloc() need to know whether
> >>> dynptr is enabled (as explained in the previous commit message). Both of
> >>> these functions are called before the map is created.
> >> Is that the explanation?
> >> "
> >> The reason for an internal map flag is twofolds:
> >> 1) user doesn't need to set the map flag explicitly
> >> map_create() will use the presence of bpf_dynptr in map key as an
> >> indicator of enabling dynptr key.
> >> 2) avoid adding new arguments for ->map_alloc_check() and ->map_alloc()
> >> map_create() needs to pass the supported status of dynptr key to
> >> ->map_alloc_check (e.g., check the maximum length of dynptr data size)
> >> and ->map_alloc (e.g., check whether dynptr key fits current map type).
> >> Adding new arguments for these callbacks to achieve that will introduce
> >> too much churns.
> >>
> >> Therefore, the patch uses the topmost bit of map_flags as the internal
> >> map flag. map_create() checks whether the internal flag is set in the
> >> beginning and bpf_map_get_info_by_fd() clears the internal flag before
> >> returns the map flags to userspace.
> >> "
> >>
> >> As commented in the other patch map_extra can be dropped (I hope).
> >> When it's gone, the map can be destroyed after creation in map_check_btf().
> >> What am I missing?
> > If I understanding correctly, you are suggesting to replace
> > (map->map_flags & BPF_INT_F_DYNPTR_IN_KEY) with !!map->key_record, right
> > ? And you also don't want to move map_check_btf() before the invocation
> > of ->map_alloc_check() and ->map_alloc(), right ? However, beside the
> > checking of map_extra, ->map_alloc_check() also needs to know whether
> > the dynptr-typed key is suitable for current hash map type or map flags.
> > ->map_alloc() also needs to allocate a bpf mem allocator for the dynptr
> > key. So are you proposing the following steps for creating a dynkey hash
> > map:
> >
> > 1) ->map_alloc_check()
> > no change
> >
> > 2) ->map_alloc()
> > allocate bpf mem allocator for dynptr unconditionally
> >
> > 3) map_check_btf()
> > invokes an new map callback (e.g., ->map_alloc_post_check()) to check
> > whether the created map is mismatched with the dynptr key and destroy it
> > if it is.
>
> Sorry, I misread the code, so the third steps is:
>
> 3) ->map_check_btf()
>
> In ->map_check_btf() callback, check whether the created map is
> mismatched with the dynptr key. If it is, let map_create() destroys the map.

map_check_btf() itself can have the code to filter out unsupported maps
like it does already:
                        case BPF_WORKQUEUE:
                                if (map->map_type != BPF_MAP_TYPE_HASH &&
                                    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
                                    map->map_type != BPF_MAP_TYPE_ARRAY) {
                                        ret = -EOPNOTSUPP;

I don't mind moving map_check_btf() before ->map_alloc_check()
since it doesn't really need 'map' pointer.
I objected to partial move where btf_get_by_fd() is done early
while the rest after map allocation.
Either all map types do map_check_btf() before alloc or
all map types do it after.

If we move map_check_btf() before alloc
then the final map->ops->map_check_btf() should probably
stay after alloc.
Otherwise this is too much churn.

So I think it's better to try to keep the whole map_check_btf() after
as it is right now.
I don't see yet why dynptr-in-key has to have it before.
So far map_extra limitation was the only special condition,
but even if we have to keep (which I doubt) it can be done in
map->ops->map_check_btf().

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-14 17:30             ` Alexei Starovoitov
@ 2025-02-27 21:10               ` Alexei Starovoitov
  2025-02-28  0:58                 ` Hou Tao
  0 siblings, 1 reply; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-27 21:10 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Fri, Feb 14, 2025 at 9:30 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Feb 13, 2025 at 11:25 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >
> > Hi,
> >
> > On 2/14/2025 2:49 PM, Hou Tao wrote:
> > > Hi,
> > >
> > > On 2/14/2025 12:17 PM, Alexei Starovoitov wrote:
> > >> On Thu, Feb 13, 2025 at 8:12 PM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>> Hi,
> > >>>
> > >>> On 2/14/2025 7:56 AM, Alexei Starovoitov wrote:
> > >>>> On Sat, Jan 25, 2025 at 2:59 AM Hou Tao <houtao@huaweicloud.com> wrote:
> > >>>>> From: Hou Tao <houtao1@huawei.com>
> > >>>>>
> > >>>>> When there is bpf_dynptr field in the map key btf type or the map key
> > >>>>> btf type is bpf_dyntr, set BPF_INT_F_DYNPTR_IN_KEY in map_flags.
> > >>>>>
> > >>>>> Signed-off-by: Hou Tao <houtao1@huawei.com>
> > >>>>> ---
> > >>>>>  kernel/bpf/syscall.c | 36 ++++++++++++++++++++++++++++++++++++
> > >>>>>  1 file changed, 36 insertions(+)
> > >>>>>
> > >>>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > >>>>> index 07c67ad1a6a07..46b96d062d2db 100644
> > >>>>> --- a/kernel/bpf/syscall.c
> > >>>>> +++ b/kernel/bpf/syscall.c
> > >>>>> @@ -1360,6 +1360,34 @@ static struct btf *get_map_btf(int btf_fd)
> > >>>>>         return btf;
> > >>>>>  }
> > >>>>>
> > > SNIP
> > >>>>>  #define BPF_MAP_CREATE_LAST_FIELD map_token_fd
> > >>>>>  /* called via syscall */
> > >>>>>  static int map_create(union bpf_attr *attr)
> > >>>>> @@ -1398,6 +1426,14 @@ static int map_create(union bpf_attr *attr)
> > >>>>>                 btf = get_map_btf(attr->btf_fd);
> > >>>>>                 if (IS_ERR(btf))
> > >>>>>                         return PTR_ERR(btf);
> > >>>>> +
> > >>>>> +               err = map_has_dynptr_in_key_type(btf, attr->btf_key_type_id, attr->key_size);
> > >>>>> +               if (err < 0)
> > >>>>> +                       goto put_btf;
> > >>>>> +               if (err > 0) {
> > >>>>> +                       attr->map_flags |= BPF_INT_F_DYNPTR_IN_KEY;
> > >>>> I don't like this inband signaling in the uapi field.
> > >>>> The whole refactoring in patch 4 to do patch 6 and
> > >>>> subsequent bpf_map_has_dynptr_key() in various places
> > >>>> feels like reinventing the wheel.
> > >>>>
> > >>>> We already have map_check_btf() mechanism that works for
> > >>>> existing special fields inside BTF.
> > >>>> Please use it.
> > >>> Yes. However map->key_record is only available after the map is created,
> > >>> but the creation of hash map needs to check it before the map is
> > >>> created. Instead of using an internal flag, how about adding extra
> > >>> argument for both ->map_alloc_check() and ->map_alloc() as proposed in
> > >>> the commit message of the previous patch ?
> > >>>> map_has_dynptr_in_key_type() can be done in map_check_btf()
> > >>>> after map is created, no ?
> > >>> No. both ->map_alloc_check() and ->map_alloc() need to know whether
> > >>> dynptr is enabled (as explained in the previous commit message). Both of
> > >>> these functions are called before the map is created.
> > >> Is that the explanation?
> > >> "
> > >> The reason for an internal map flag is twofolds:
> > >> 1) user doesn't need to set the map flag explicitly
> > >> map_create() will use the presence of bpf_dynptr in map key as an
> > >> indicator of enabling dynptr key.
> > >> 2) avoid adding new arguments for ->map_alloc_check() and ->map_alloc()
> > >> map_create() needs to pass the supported status of dynptr key to
> > >> ->map_alloc_check (e.g., check the maximum length of dynptr data size)
> > >> and ->map_alloc (e.g., check whether dynptr key fits current map type).
> > >> Adding new arguments for these callbacks to achieve that will introduce
> > >> too much churns.
> > >>
> > >> Therefore, the patch uses the topmost bit of map_flags as the internal
> > >> map flag. map_create() checks whether the internal flag is set in the
> > >> beginning and bpf_map_get_info_by_fd() clears the internal flag before
> > >> returns the map flags to userspace.
> > >> "
> > >>
> > >> As commented in the other patch map_extra can be dropped (I hope).
> > >> When it's gone, the map can be destroyed after creation in map_check_btf().
> > >> What am I missing?
> > > If I understanding correctly, you are suggesting to replace
> > > (map->map_flags & BPF_INT_F_DYNPTR_IN_KEY) with !!map->key_record, right
> > > ? And you also don't want to move map_check_btf() before the invocation
> > > of ->map_alloc_check() and ->map_alloc(), right ? However, beside the
> > > checking of map_extra, ->map_alloc_check() also needs to know whether
> > > the dynptr-typed key is suitable for current hash map type or map flags.
> > > ->map_alloc() also needs to allocate a bpf mem allocator for the dynptr
> > > key. So are you proposing the following steps for creating a dynkey hash
> > > map:
> > >
> > > 1) ->map_alloc_check()
> > > no change
> > >
> > > 2) ->map_alloc()
> > > allocate bpf mem allocator for dynptr unconditionally
> > >
> > > 3) map_check_btf()
> > > invokes an new map callback (e.g., ->map_alloc_post_check()) to check
> > > whether the created map is mismatched with the dynptr key and destroy it
> > > if it is.
> >
> > Sorry, I misread the code, so the third steps is:
> >
> > 3) ->map_check_btf()
> >
> > In ->map_check_btf() callback, check whether the created map is
> > mismatched with the dynptr key. If it is, let map_create() destroys the map.
>
> map_check_btf() itself can have the code to filter out unsupported maps
> like it does already:
>                         case BPF_WORKQUEUE:
>                                 if (map->map_type != BPF_MAP_TYPE_HASH &&
>                                     map->map_type != BPF_MAP_TYPE_LRU_HASH &&
>                                     map->map_type != BPF_MAP_TYPE_ARRAY) {
>                                         ret = -EOPNOTSUPP;
>
> I don't mind moving map_check_btf() before ->map_alloc_check()
> since it doesn't really need 'map' pointer.
> I objected to partial move where btf_get_by_fd() is done early
> while the rest after map allocation.
> Either all map types do map_check_btf() before alloc or
> all map types do it after.
>
> If we move map_check_btf() before alloc
> then the final map->ops->map_check_btf() should probably
> stay after alloc.
> Otherwise this is too much churn.
>
> So I think it's better to try to keep the whole map_check_btf() after
> as it is right now.
> I don't see yet why dynptr-in-key has to have it before.
> So far map_extra limitation was the only special condition,
> but even if we have to keep (which I doubt) it can be done in
> map->ops->map_check_btf().

Any update on this ?
Two weeks have passed.
iirc above was the only thing left to resolve.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-27 21:10               ` Alexei Starovoitov
@ 2025-02-28  0:58                 ` Hou Tao
  2025-02-28  2:43                   ` Alexei Starovoitov
  0 siblings, 1 reply; 39+ messages in thread
From: Hou Tao @ 2025-02-28  0:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

Hi,

On 2/28/2025 5:10 AM, Alexei Starovoitov wrote:
> On Fri, Feb 14, 2025 at 9:30 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> On Thu, Feb 13, 2025 at 11:25 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>> Hi,
>>>

SNIP
>>>
>>> 3) ->map_check_btf()
>>>
>>> In ->map_check_btf() callback, check whether the created map is
>>> mismatched with the dynptr key. If it is, let map_create() destroys the map.
>> map_check_btf() itself can have the code to filter out unsupported maps
>> like it does already:
>>                         case BPF_WORKQUEUE:
>>                                 if (map->map_type != BPF_MAP_TYPE_HASH &&
>>                                     map->map_type != BPF_MAP_TYPE_LRU_HASH &&
>>                                     map->map_type != BPF_MAP_TYPE_ARRAY) {
>>                                         ret = -EOPNOTSUPP;
>>
>> I don't mind moving map_check_btf() before ->map_alloc_check()
>> since it doesn't really need 'map' pointer.
>> I objected to partial move where btf_get_by_fd() is done early
>> while the rest after map allocation.
>> Either all map types do map_check_btf() before alloc or
>> all map types do it after.
>>
>> If we move map_check_btf() before alloc
>> then the final map->ops->map_check_btf() should probably
>> stay after alloc.
>> Otherwise this is too much churn.
>>
>> So I think it's better to try to keep the whole map_check_btf() after
>> as it is right now.
>> I don't see yet why dynptr-in-key has to have it before.
>> So far map_extra limitation was the only special condition,
>> but even if we have to keep (which I doubt) it can be done in
>> map->ops->map_check_btf().
> Any update on this ?
> Two weeks have passed.
> iirc above was the only thing left to resolve.
Er, I started adding bpffs seq-file and batched operation support
recently.  I need to ask whether it is OK to complete these todo items
shown below in the following patch-set. As noted in the cover letter,
the following things have not been supported yet:

1) batched map operation through bpf syscall
2) the memory accounting for dynptr (aka .htab_map_mem_usage)
3) btf print for the dynptr in map key
4) bpftool support
5) the iteration of elements through bpf program


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally
  2025-02-28  0:58                 ` Hou Tao
@ 2025-02-28  2:43                   ` Alexei Starovoitov
  0 siblings, 0 replies; 39+ messages in thread
From: Alexei Starovoitov @ 2025-02-28  2:43 UTC (permalink / raw)
  To: Hou Tao
  Cc: bpf, Martin KaFai Lau, Andrii Nakryiko, Eduard Zingerman,
	Song Liu, Hao Luo, Yonghong Song, Daniel Borkmann, KP Singh,
	Stanislav Fomichev, Jiri Olsa, John Fastabend, Dan Carpenter,
	Hou Tao, Xu Kuohai

On Thu, Feb 27, 2025 at 5:16 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/28/2025 5:10 AM, Alexei Starovoitov wrote:
> > On Fri, Feb 14, 2025 at 9:30 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Thu, Feb 13, 2025 at 11:25 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>> Hi,
> >>>
>
> SNIP
> >>>
> >>> 3) ->map_check_btf()
> >>>
> >>> In ->map_check_btf() callback, check whether the created map is
> >>> mismatched with the dynptr key. If it is, let map_create() destroys the map.
> >> map_check_btf() itself can have the code to filter out unsupported maps
> >> like it does already:
> >>                         case BPF_WORKQUEUE:
> >>                                 if (map->map_type != BPF_MAP_TYPE_HASH &&
> >>                                     map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> >>                                     map->map_type != BPF_MAP_TYPE_ARRAY) {
> >>                                         ret = -EOPNOTSUPP;
> >>
> >> I don't mind moving map_check_btf() before ->map_alloc_check()
> >> since it doesn't really need 'map' pointer.
> >> I objected to partial move where btf_get_by_fd() is done early
> >> while the rest after map allocation.
> >> Either all map types do map_check_btf() before alloc or
> >> all map types do it after.
> >>
> >> If we move map_check_btf() before alloc
> >> then the final map->ops->map_check_btf() should probably
> >> stay after alloc.
> >> Otherwise this is too much churn.
> >>
> >> So I think it's better to try to keep the whole map_check_btf() after
> >> as it is right now.
> >> I don't see yet why dynptr-in-key has to have it before.
> >> So far map_extra limitation was the only special condition,
> >> but even if we have to keep (which I doubt) it can be done in
> >> map->ops->map_check_btf().
> > Any update on this ?
> > Two weeks have passed.
> > iirc above was the only thing left to resolve.
> Er, I started adding bpffs seq-file and batched operation support
> recently.  I need to ask whether it is OK to complete these todo items
> shown below in the following patch-set. As noted in the cover letter,
> the following things have not been supported yet:
>
> 1) batched map operation through bpf syscall
> 2) the memory accounting for dynptr (aka .htab_map_mem_usage)
> 3) btf print for the dynptr in map key
> 4) bpftool support
> 5) the iteration of elements through bpf program

All these things would be nice to add, but the patch set
needs to stay review-able.
It's 20 patches already which is too high.
v1 was done many months ago and not only complexity of the feature
makes it slow to land, but the size of the set too.
Keep it small. Incremental work is preferred.
Better to land the core feature first and then gradually
add 5,2,3,4.
Batched ops aka 1 can be delayed for some time.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2025-02-28  2:44 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-25 11:10 [PATCH bpf-next v2 00/20] Support dynptr key for hash map Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 01/20] bpf: Add two helpers to facilitate the parsing of bpf_dynptr Hou Tao
2025-02-04 23:17   ` Alexei Starovoitov
2025-02-05  1:33     ` Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 02/20] bpf: Parse bpf_dynptr in map key Hou Tao
2025-02-13 17:59   ` Alexei Starovoitov
2025-02-14  4:04     ` Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 03/20] bpf: Factor out get_map_btf() helper Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 04/20] bpf: Move the initialization of btf before ->map_alloc_check Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 05/20] bpf: Introduce an internal map flag BPF_INT_F_DYNPTR_IN_KEY Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 06/20] bpf: Set BPF_INT_F_DYNPTR_IN_KEY conditionally Hou Tao
2025-02-13 23:56   ` Alexei Starovoitov
2025-02-14  4:12     ` Hou Tao
2025-02-14  4:17       ` Alexei Starovoitov
2025-02-14  6:49         ` Hou Tao
2025-02-14  7:25           ` Hou Tao
2025-02-14 17:30             ` Alexei Starovoitov
2025-02-27 21:10               ` Alexei Starovoitov
2025-02-28  0:58                 ` Hou Tao
2025-02-28  2:43                   ` Alexei Starovoitov
2025-01-25 11:10 ` [PATCH bpf-next v2 07/20] bpf: Use map_extra to indicate the max data size of dynptrs in map key Hou Tao
2025-02-13 18:02   ` Alexei Starovoitov
2025-02-14  6:13     ` Hou Tao
2025-02-14 15:57       ` Alexei Starovoitov
2025-01-25 11:10 ` [PATCH bpf-next v2 08/20] bpf: Split check_stack_range_initialized() into small functions Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 09/20] bpf: Support map key with dynptr in verifier Hou Tao
2025-01-25 11:10 ` [PATCH bpf-next v2 10/20] bpf: Introduce bpf_dynptr_user Hou Tao
2025-02-14  0:13   ` Alexei Starovoitov
2025-02-14  7:03     ` Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 11/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as input Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 12/20] bpf: Handle bpf_dynptr_user in bpf syscall when it is used as output Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 13/20] bpf: Support basic operations for dynptr key in hash map Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 14/20] bpf: Export bpf_dynptr_set_size Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 15/20] bpf: Support get_next_key operation for dynptr key in hash map Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 16/20] bpf: Disable unsupported operations for map with dynptr key Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 17/20] bpf: Enable BPF_INT_F_DYNPTR_IN_KEY for hash map Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 18/20] selftests/bpf: Add bpf_dynptr_user_init() helper Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 19/20] selftests/bpf: Add test cases for hash map with dynptr key Hou Tao
2025-01-25 11:11 ` [PATCH bpf-next v2 20/20] selftests/bpf: Add benchmark for dynptr key support in hash map Hou Tao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).