[PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists

public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists
@ 2022-09-04 20:41 Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 01/32] bpf: Add copy_map_value_long to copy to remote percpu memory Kumar Kartikeya Dwivedi
                   ` (31 more replies)
  0 siblings, 32 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

WARNING: This is an RFC. WARN_ON_ONCE is sprinkled around the code liberally
(useful while working on this stuff). I'll be doing a thorough pass and clean
all that up before sending out non-RFC v1.
TODO before non-RFC v1.
 - A lot more corner case tests, failure tests, more tests for the new local
   kptr support. I did test the basic stuff (which the verifier complained
   about when writing linked_list.c).
 - More tests for kptr support in new map types.
 - More self review.

--

This series introduces user defined BPF objects, by introducing the idea of
local kptrs. These are kptrs (strongly typed pointers) that refer to objects of
a user defined type, hence called "local" kptrs. This allows BPF programs to
allocate their own objects, build their own object hierarchies, and use the
basic building blocks provided by BPF runtime to build their own data structures
flexibly.

Then, we introduce the support for single ownership BPF linked lists, which can
be put inside BPF maps, or local kptrs, and hold such allocated local kptrs as
elements. It works as an instrusive collection, which is done to allow making
local kptrs part of multiple data structures at the same time in the future.

The eventual goal of this and future patches is to allow one to do some limited
form of kernel style programming in BPF C, and allow programmers to build their
own complex data structures flexibly out of basic building blocks.

The key difference will be that such programs are verified to be safe, preserve
runtime integrity of the system, and are proven to be bug free as far as the
invariants of BPF specific APIs are concerned.

One immediate use case that will be using the entire infrastructure this series
is introducing will be managing percpu NMI safe linked lists inside BPF
programs.

The other use case this will serve in the near future will be linking kernel
structures like XDP frame and sk_buff directly into user data structures
(rbtree, pifomap, etc.) for packet queueing. This will follow single ownership
concept included in this series.

The user has complete control of the internal locking, and hence also the
batching of operations for each critical section.

Eventually, with some more support in future patches, users will be able to
write fully concurrent RCU protected hash table using BPF_MAP_TYPE_ARRAY for
buckets and embed BPF linked lists in these buckets. All of this will be
possible in safe BPF C, which will be proven for runtime safety by the BPF
verifier.

The features, core infrastructure, and other improvements in this set are:
- Allow storing kptrs in local storage and percpu maps.
- Local kptrs - User defined kernel objects.
- bpf_kptr_alloc, bpf_kptr_free to allocate and free them.
- BPF memory object model, similar to what C and C++ abstract machines have,
  now verifier reasons about an object's lifetime, i.e. the concept of object
  lifetime, visibility, construction, destruction is reified.
  The separation of storage and object lifetime is understood by the verifier.
- Single ownership BPF linked lists.
  - Support for them in BPF maps.
  - Support for them in local kptrs.
- Global spin locks.
- Spin locks inside local kptrs.
- Allow storing local kptrs in all BPF maps with support for kernel kptrs.

Some other notable things:
- Completely static verification of locking.
- Kfunc argument handling has been completely reworked.
- Argument rewriting support for kfuncs.
  Now we can also support inlining block of BPF insns for certain kfuncs.
- Iteration over all registers in verifier state has a new lambda based
  iterator (and can be nifty or crazy - depending on your love for GNU C).
- Search pruning now understands non-size precise registers.
- A new bpf_experimental.h header as a dumping ground for these APIs.

Any functionality exposed in this series is **NOT** part of UAPI. It is only
available through use of kfuncs, and structs that can be added to map value may
also change their size or name in the future. Hence, every feature in this
series must be considered **EXPERIMENTAL**.

Next steps:
-----------
 * NMI safe percpu single ownership linked lists (using local_t protection).
  - This enables open coded freelist use case
 * Lockless linked lists.
 * Allow RCU protected local kptrs. This then allows RCU protected list lookups,
   since spinlock protection for readers does not scale.
 * Introduce explicit RCU read sections (using kfuncs).
 * Introduce bpf_refcount for local kptrs, shared ownership.
 * Introduce shared ownership linked lists.
 * Documentation.

Notes:
------
 * Delyan's work to expose Alexei's BPF memory allocator as global allocator
   is still needed before this can be merged. For now, direct kmalloc and
   kfree is used.

Links:
------
 * Dave's BPF RB-Tree RFC series
   v1 (Discussion thread)
     https://lore.kernel.org/bpf/20220722183438.3319790-1-davemarchevsky@fb.com
   v2 (With support for static locks)
     https://lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
 * BPF Linked Lists Discussion
   https://lore.kernel.org/bpf/CAP01T74U30+yeBHEgmgzTJ-XYxZ0zj71kqCDJtTH9YQNfTK+Xw@mail.gmail.com
 * BPF Memory Allocator from Alexei
   https://lore.kernel.org/bpf/20220902211058.60789-1-alexei.starovoitov@gmail.com
 * BPF Memory Allocator UAPI Discussion
   https://lore.kernel.org/bpf/d3f76b27f4e55ec9e400ae8dcaecbb702a4932e8.camel@fb.com

Daniel Xu (1):
  bpf: Remove duplicate PTR_TO_BTF_ID RO check

Dave Marchevsky (1):
  libbpf: Add support for private BSS map section

Kumar Kartikeya Dwivedi (30):
  bpf: Add copy_map_value_long to copy to remote percpu memory
  bpf: Support kptrs in percpu arraymap
  bpf: Add zero_map_value to zero map value with special fields
  bpf: Support kptrs in percpu hashmap and percpu LRU hashmap
  bpf: Support kptrs in local storage maps
  bpf: Annotate data races in bpf_local_storage
  bpf: Allow specifying volatile type modifier for kptrs
  bpf: Add comment about kptr's PTR_TO_MAP_VALUE handling
  bpf: Rewrite kfunc argument handling
  bpf: Drop kfunc support from btf_check_func_arg_match
  bpf: Support constant scalar arguments for kfuncs
  bpf: Teach verifier about non-size constant arguments
  bpf: Introduce bpf_list_head support for BPF maps
  bpf: Introduce bpf_kptr_alloc helper
  bpf: Add helper macro bpf_expr_for_each_reg_in_vstate
  bpf: Introduce BPF memory object model
  bpf: Support bpf_list_node in local kptrs
  bpf: Support bpf_spin_lock in local kptrs
  bpf: Support bpf_list_head in local kptrs
  bpf: Introduce bpf_kptr_free helper
  bpf: Allow locking bpf_spin_lock global variables
  bpf: Bump BTF_KFUNC_SET_MAX_CNT
  bpf: Add single ownership BPF linked list API
  bpf: Permit NULL checking pointer with non-zero fixed offset
  bpf: Allow storing local kptrs in BPF maps
  bpf: Wire up freeing of bpf_list_heads in maps
  bpf: Add destructor for bpf_list_head in local kptr
  selftests/bpf: Add BTF tag macros for local kptrs, BPF linked lists
  selftests/bpf: Add BPF linked list API tests
  selftests/bpf: Add referenced local kptr tests

 Documentation/bpf/kfuncs.rst                  |   30 +
 include/linux/bpf.h                           |  177 +-
 include/linux/bpf_local_storage.h             |    2 +-
 include/linux/bpf_verifier.h                  |   77 +-
 include/linux/btf.h                           |   76 +-
 include/linux/poison.h                        |    3 +
 kernel/bpf/arraymap.c                         |   43 +-
 kernel/bpf/bpf_local_storage.c                |   53 +-
 kernel/bpf/btf.c                              |  727 +++---
 kernel/bpf/hashtab.c                          |   91 +-
 kernel/bpf/helpers.c                          |  137 +-
 kernel/bpf/map_in_map.c                       |    5 +-
 kernel/bpf/syscall.c                          |  231 +-
 kernel/bpf/verifier.c                         | 2084 ++++++++++++++---
 net/bpf/bpf_dummy_struct_ops.c                |    5 +-
 net/ipv4/bpf_tcp_ca.c                         |    5 +-
 tools/lib/bpf/libbpf.c                        |   65 +-
 .../testing/selftests/bpf/bpf_experimental.h  |  120 +
 .../selftests/bpf/prog_tests/linked_list.c    |   88 +
 .../selftests/bpf/prog_tests/map_kptr.c       |    2 +-
 .../testing/selftests/bpf/progs/linked_list.c |  347 +++
 tools/testing/selftests/bpf/progs/map_kptr.c  |   38 +
 tools/testing/selftests/bpf/verifier/calls.c  |    2 +-
 23 files changed, 3626 insertions(+), 782 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_experimental.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/linked_list.c
 create mode 100644 tools/testing/selftests/bpf/progs/linked_list.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 01/32] bpf: Add copy_map_value_long to copy to remote percpu memory
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 02/32] bpf: Support kptrs in percpu arraymap Kumar Kartikeya Dwivedi
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

bpf_long_memcpy is used while copying to remote percpu regions from BPF
syscall and helpers, so that the copy is atomic at word size
granularity.

This might not be possible when you copy from map value hosting kptrs
from or to percpu maps, as the alignment or size in disjoint regions may
not be multiple of word size.

Hence, to avoid complicating the copy loop, we only use bpf_long_memcpy
when special fields are not present, otherwise use normal memcpy to copy
the disjoint regions.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h | 52 ++++++++++++++++++++++++++++-----------------
 1 file changed, 33 insertions(+), 19 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9c1674973e03..a6a0c0025b46 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -280,14 +280,33 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 	}
 }
 
-/* copy everything but bpf_spin_lock and bpf_timer. There could be one of each. */
-static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
+/* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
+ * forced to use 'long' read/writes to try to atomically copy long counters.
+ * Best-effort only.  No barriers here, since it _will_ race with concurrent
+ * updates from BPF programs. Called from bpf syscall and mostly used with
+ * size 8 or 16 bytes, so ask compiler to inline it.
+ */
+static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
+{
+	const long *lsrc = src;
+	long *ldst = dst;
+
+	size /= sizeof(long);
+	while (size--)
+		*ldst++ = *lsrc++;
+}
+
+/* copy everything but bpf_spin_lock, bpf_timer, and kptrs. There could be one of each. */
+static inline void __copy_map_value(struct bpf_map *map, void *dst, void *src, bool long_memcpy)
 {
 	u32 curr_off = 0;
 	int i;
 
 	if (likely(!map->off_arr)) {
-		memcpy(dst, src, map->value_size);
+		if (long_memcpy)
+			bpf_long_memcpy(dst, src, round_up(map->value_size, 8));
+		else
+			memcpy(dst, src, map->value_size);
 		return;
 	}
 
@@ -299,6 +318,17 @@ static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
 	}
 	memcpy(dst + curr_off, src + curr_off, map->value_size - curr_off);
 }
+
+static inline void copy_map_value(struct bpf_map *map, void *dst, void *src)
+{
+	__copy_map_value(map, dst, src, false);
+}
+
+static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src)
+{
+	__copy_map_value(map, dst, src, true);
+}
+
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
 void bpf_timer_cancel_and_free(void *timer);
@@ -1823,22 +1853,6 @@ int bpf_get_file_flag(int flags);
 int bpf_check_uarg_tail_zero(bpfptr_t uaddr, size_t expected_size,
 			     size_t actual_size);
 
-/* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
- * forced to use 'long' read/writes to try to atomically copy long counters.
- * Best-effort only.  No barriers here, since it _will_ race with concurrent
- * updates from BPF programs. Called from bpf syscall and mostly used with
- * size 8 or 16 bytes, so ask compiler to inline it.
- */
-static inline void bpf_long_memcpy(void *dst, const void *src, u32 size)
-{
-	const long *lsrc = src;
-	long *ldst = dst;
-
-	size /= sizeof(long);
-	while (size--)
-		*ldst++ = *lsrc++;
-}
-
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog **fp, union bpf_attr *attr, bpfptr_t uattr);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 02/32] bpf: Support kptrs in percpu arraymap
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 01/32] bpf: Add copy_map_value_long to copy to remote percpu memory Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 03/32] bpf: Add zero_map_value to zero map value with special fields Kumar Kartikeya Dwivedi
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Enable support for kptrs in percpu BPF arraymap by wiring up the freeing
of these kptrs from percpu map elements.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/arraymap.c | 33 ++++++++++++++++++++++++---------
 kernel/bpf/syscall.c  |  3 ++-
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 624527401d4d..832b2659e96e 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -279,7 +279,8 @@ int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value)
 	rcu_read_lock();
 	pptr = array->pptrs[index & array->index_mask];
 	for_each_possible_cpu(cpu) {
-		bpf_long_memcpy(value + off, per_cpu_ptr(pptr, cpu), size);
+		copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu));
+		check_and_init_map_value(map, value + off);
 		off += size;
 	}
 	rcu_read_unlock();
@@ -338,8 +339,9 @@ static int array_map_update_elem(struct bpf_map *map, void *key, void *value,
 		return -EINVAL;
 
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
-		memcpy(this_cpu_ptr(array->pptrs[index & array->index_mask]),
-		       value, map->value_size);
+		val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
+		copy_map_value(map, val, value);
+		check_and_free_fields(array, val);
 	} else {
 		val = array->value +
 			(u64)array->elem_size * (index & array->index_mask);
@@ -383,7 +385,8 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
 	rcu_read_lock();
 	pptr = array->pptrs[index & array->index_mask];
 	for_each_possible_cpu(cpu) {
-		bpf_long_memcpy(per_cpu_ptr(pptr, cpu), value + off, size);
+		copy_map_value_long(map, per_cpu_ptr(pptr, cpu), value + off);
+		check_and_free_fields(array, per_cpu_ptr(pptr, cpu));
 		off += size;
 	}
 	rcu_read_unlock();
@@ -421,8 +424,20 @@ static void array_map_free(struct bpf_map *map)
 	int i;
 
 	if (map_value_has_kptrs(map)) {
-		for (i = 0; i < array->map.max_entries; i++)
-			bpf_map_free_kptrs(map, array_map_elem_ptr(array, i));
+		if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
+			for (i = 0; i < array->map.max_entries; i++) {
+				void __percpu *pptr = array->pptrs[i & array->index_mask];
+				int cpu;
+
+				for_each_possible_cpu(cpu) {
+					bpf_map_free_kptrs(map, per_cpu_ptr(pptr, cpu));
+					cond_resched();
+				}
+			}
+		} else {
+			for (i = 0; i < array->map.max_entries; i++)
+				bpf_map_free_kptrs(map, array_map_elem_ptr(array, i));
+		}
 		bpf_map_free_kptr_off_tab(map);
 	}
 
@@ -608,9 +623,9 @@ static int __bpf_array_map_seq_show(struct seq_file *seq, void *v)
 			pptr = v;
 			size = array->elem_size;
 			for_each_possible_cpu(cpu) {
-				bpf_long_memcpy(info->percpu_value_buf + off,
-						per_cpu_ptr(pptr, cpu),
-						size);
+				copy_map_value_long(map, info->percpu_value_buf + off,
+						    per_cpu_ptr(pptr, cpu));
+				check_and_init_map_value(map, info->percpu_value_buf + off);
 				off += size;
 			}
 			ctx.value = info->percpu_value_buf;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4e9d4622aef7..723699263a62 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1046,7 +1046,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		}
 		if (map->map_type != BPF_MAP_TYPE_HASH &&
 		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
-		    map->map_type != BPF_MAP_TYPE_ARRAY) {
+		    map->map_type != BPF_MAP_TYPE_ARRAY &&
+		    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY) {
 			ret = -EOPNOTSUPP;
 			goto free_map_tab;
 		}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 03/32] bpf: Add zero_map_value to zero map value with special fields
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 01/32] bpf: Add copy_map_value_long to copy to remote percpu memory Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 02/32] bpf: Support kptrs in percpu arraymap Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 04/32] bpf: Support kptrs in percpu hashmap and percpu LRU hashmap Kumar Kartikeya Dwivedi
                   ` (28 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

We need this helper to skip over special fields (bpf_spin_lock,
bpf_timer, kptrs) while zeroing a map value. Use the same logic as
copy_map_value but memset instead of memcpy.

Currently, the code zeroing map value memory does not have to deal with
special fields, hence this is a prerequisite for introducing such
support.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a6a0c0025b46..cdc0a8c1b1d1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -329,6 +329,25 @@ static inline void copy_map_value_long(struct bpf_map *map, void *dst, void *src
 	__copy_map_value(map, dst, src, true);
 }
 
+static inline void zero_map_value(struct bpf_map *map, void *dst)
+{
+	u32 curr_off = 0;
+	int i;
+
+	if (likely(!map->off_arr)) {
+		memset(dst, 0, map->value_size);
+		return;
+	}
+
+	for (i = 0; i < map->off_arr->cnt; i++) {
+		u32 next_off = map->off_arr->field_off[i];
+
+		memset(dst + curr_off, 0, next_off - curr_off);
+		curr_off += map->off_arr->field_sz[i];
+	}
+	memset(dst + curr_off, 0, map->value_size - curr_off);
+}
+
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
 void bpf_timer_cancel_and_free(void *timer);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 04/32] bpf: Support kptrs in percpu hashmap and percpu LRU hashmap
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (2 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 03/32] bpf: Add zero_map_value to zero map value with special fields Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps Kumar Kartikeya Dwivedi
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Enable support for kptrs in percpu BPF hashmap and percpu BPF LRU
hashmap by wiring up the freeing of these kptrs from percpu map
elements.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/hashtab.c | 70 ++++++++++++++++++++++++++++----------------
 kernel/bpf/syscall.c |  2 ++
 2 files changed, 46 insertions(+), 26 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index eb1263f03e9b..bb3f8a63c221 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -282,8 +282,18 @@ static void htab_free_prealloced_kptrs(struct bpf_htab *htab)
 		struct htab_elem *elem;
 
 		elem = get_htab_elem(htab, i);
-		bpf_map_free_kptrs(&htab->map, elem->key + round_up(htab->map.key_size, 8));
-		cond_resched();
+		if (htab_is_percpu(htab)) {
+			void __percpu *pptr = htab_elem_get_ptr(elem, htab->map.key_size);
+			int cpu;
+
+			for_each_possible_cpu(cpu) {
+				bpf_map_free_kptrs(&htab->map, per_cpu_ptr(pptr, cpu));
+				cond_resched();
+			}
+		} else {
+			bpf_map_free_kptrs(&htab->map, elem->key + round_up(htab->map.key_size, 8));
+			cond_resched();
+		}
 	}
 }
 
@@ -761,8 +771,17 @@ static void check_and_free_fields(struct bpf_htab *htab,
 
 	if (map_value_has_timer(&htab->map))
 		bpf_timer_cancel_and_free(map_value + htab->map.timer_off);
-	if (map_value_has_kptrs(&htab->map))
-		bpf_map_free_kptrs(&htab->map, map_value);
+	if (map_value_has_kptrs(&htab->map)) {
+		if (htab_is_percpu(htab)) {
+			void __percpu *pptr = htab_elem_get_ptr(elem, htab->map.key_size);
+			int cpu;
+
+			for_each_possible_cpu(cpu)
+				bpf_map_free_kptrs(&htab->map, per_cpu_ptr(pptr, cpu));
+		} else {
+			bpf_map_free_kptrs(&htab->map, map_value);
+		}
+	}
 }
 
 /* It is called from the bpf_lru_list when the LRU needs to delete
@@ -859,9 +878,9 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 
 static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
 {
+	check_and_free_fields(htab, l);
 	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
 		free_percpu(htab_elem_get_ptr(l, htab->map.key_size));
-	check_and_free_fields(htab, l);
 	kfree(l);
 }
 
@@ -903,14 +922,13 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
 {
 	if (!onallcpus) {
 		/* copy true value_size bytes */
-		memcpy(this_cpu_ptr(pptr), value, htab->map.value_size);
+		copy_map_value(&htab->map, this_cpu_ptr(pptr), value);
 	} else {
 		u32 size = round_up(htab->map.value_size, 8);
 		int off = 0, cpu;
 
 		for_each_possible_cpu(cpu) {
-			bpf_long_memcpy(per_cpu_ptr(pptr, cpu),
-					value + off, size);
+			copy_map_value_long(&htab->map, per_cpu_ptr(pptr, cpu), value + off);
 			off += size;
 		}
 	}
@@ -926,16 +944,16 @@ static void pcpu_init_value(struct bpf_htab *htab, void __percpu *pptr,
 	 * (onallcpus=false always when coming from bpf prog).
 	 */
 	if (htab_is_prealloc(htab) && !onallcpus) {
-		u32 size = round_up(htab->map.value_size, 8);
 		int current_cpu = raw_smp_processor_id();
 		int cpu;
 
 		for_each_possible_cpu(cpu) {
-			if (cpu == current_cpu)
-				bpf_long_memcpy(per_cpu_ptr(pptr, cpu), value,
-						size);
-			else
-				memset(per_cpu_ptr(pptr, cpu), 0, size);
+			if (cpu == current_cpu) {
+				copy_map_value_long(&htab->map, per_cpu_ptr(pptr, cpu), value);
+			} else {
+				/* Since elem is preallocated, we cannot touch special fields */
+				zero_map_value(&htab->map, per_cpu_ptr(pptr, cpu));
+			}
 		}
 	} else {
 		pcpu_copy_value(htab, pptr, value, onallcpus);
@@ -993,8 +1011,9 @@ static struct htab_elem *alloc_htab_elem(struct bpf_htab *htab, void *key,
 			l_new = ERR_PTR(-ENOMEM);
 			goto dec_count;
 		}
-		check_and_init_map_value(&htab->map,
-					 l_new->key + round_up(key_size, 8));
+
+		if (!percpu)
+			check_and_init_map_value(&htab->map, l_new->key + round_up(key_size, 8));
 	}
 
 	memcpy(l_new->key, key, key_size);
@@ -1562,9 +1581,8 @@ static int __htab_map_lookup_and_delete_elem(struct bpf_map *map, void *key,
 
 			pptr = htab_elem_get_ptr(l, key_size);
 			for_each_possible_cpu(cpu) {
-				bpf_long_memcpy(value + off,
-						per_cpu_ptr(pptr, cpu),
-						roundup_value_size);
+				copy_map_value_long(&htab->map, value + off, per_cpu_ptr(pptr, cpu));
+				check_and_init_map_value(&htab->map, value + off);
 				off += roundup_value_size;
 			}
 		} else {
@@ -1758,8 +1776,8 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
 
 			pptr = htab_elem_get_ptr(l, map->key_size);
 			for_each_possible_cpu(cpu) {
-				bpf_long_memcpy(dst_val + off,
-						per_cpu_ptr(pptr, cpu), size);
+				copy_map_value_long(&htab->map, dst_val + off, per_cpu_ptr(pptr, cpu));
+				check_and_init_map_value(&htab->map, dst_val + off);
 				off += size;
 			}
 		} else {
@@ -2031,9 +2049,9 @@ static int __bpf_hash_map_seq_show(struct seq_file *seq, struct htab_elem *elem)
 				roundup_value_size = round_up(map->value_size, 8);
 				pptr = htab_elem_get_ptr(elem, map->key_size);
 				for_each_possible_cpu(cpu) {
-					bpf_long_memcpy(info->percpu_value_buf + off,
-							per_cpu_ptr(pptr, cpu),
-							roundup_value_size);
+					copy_map_value_long(map, info->percpu_value_buf + off,
+							    per_cpu_ptr(pptr, cpu));
+					check_and_init_map_value(map, info->percpu_value_buf + off);
 					off += roundup_value_size;
 				}
 				ctx.value = info->percpu_value_buf;
@@ -2277,8 +2295,8 @@ int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value)
 	 */
 	pptr = htab_elem_get_ptr(l, map->key_size);
 	for_each_possible_cpu(cpu) {
-		bpf_long_memcpy(value + off,
-				per_cpu_ptr(pptr, cpu), size);
+		copy_map_value_long(map, value + off, per_cpu_ptr(pptr, cpu));
+		check_and_init_map_value(map, value + off);
 		off += size;
 	}
 	ret = 0;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 723699263a62..3214bab5b462 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1045,7 +1045,9 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			goto free_map_tab;
 		}
 		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
 		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_PERCPU_HASH &&
 		    map->map_type != BPF_MAP_TYPE_ARRAY &&
 		    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY) {
 			ret = -EOPNOTSUPP;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (3 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 04/32] bpf: Support kptrs in percpu hashmap and percpu LRU hashmap Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-07 19:00   ` Alexei Starovoitov
  2022-09-09  5:27   ` Martin KaFai Lau
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 06/32] bpf: Annotate data races in bpf_local_storage Kumar Kartikeya Dwivedi
                   ` (26 subsequent siblings)
  31 siblings, 2 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, KP Singh, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Dave Marchevsky, Delyan Kratunov

Enable support for kptrs in local storage maps by wiring up the freeing
of these kptrs from map value.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: KP Singh <kpsingh@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_local_storage.h |  2 +-
 kernel/bpf/bpf_local_storage.c    | 33 +++++++++++++++++++++++++++----
 kernel/bpf/syscall.c              |  5 ++++-
 kernel/bpf/verifier.c             |  9 ++++++---
 4 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
index 7ea18d4da84b..6786d00f004e 100644
--- a/include/linux/bpf_local_storage.h
+++ b/include/linux/bpf_local_storage.h
@@ -74,7 +74,7 @@ struct bpf_local_storage_elem {
 	struct hlist_node snode;	/* Linked to bpf_local_storage */
 	struct bpf_local_storage __rcu *local_storage;
 	struct rcu_head rcu;
-	/* 8 bytes hole */
+	struct bpf_map *map;		/* Only set for bpf_selem_free_rcu */
 	/* The data is stored in another cacheline to minimize
 	 * the number of cachelines access during a cache hit.
 	 */
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 802fc15b0d73..4a725379d761 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -74,7 +74,8 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
 				gfp_flags | __GFP_NOWARN);
 	if (selem) {
 		if (value)
-			memcpy(SDATA(selem)->data, value, smap->map.value_size);
+			copy_map_value(&smap->map, SDATA(selem)->data, value);
+		/* No call to check_and_init_map_value as memory is zero init */
 		return selem;
 	}
 
@@ -92,12 +93,27 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
 	kfree_rcu(local_storage, rcu);
 }
 
+static void check_and_free_fields(struct bpf_local_storage_elem *selem)
+{
+	if (map_value_has_kptrs(selem->map))
+		bpf_map_free_kptrs(selem->map, SDATA(selem));
+}
+
 static void bpf_selem_free_rcu(struct rcu_head *rcu)
 {
 	struct bpf_local_storage_elem *selem;
 
 	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
-	kfree_rcu(selem, rcu);
+	check_and_free_fields(selem);
+	kfree(selem);
+}
+
+static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
+{
+	struct bpf_local_storage_elem *selem;
+
+	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
+	call_rcu(&selem->rcu, bpf_selem_free_rcu);
 }
 
 /* local_storage->lock must be held and selem->local_storage == local_storage.
@@ -150,10 +166,11 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
 	    SDATA(selem))
 		RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
 
+	selem->map = &smap->map;
 	if (use_trace_rcu)
-		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
+		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
 	else
-		kfree_rcu(selem, rcu);
+		call_rcu(&selem->rcu, bpf_selem_free_rcu);
 
 	return free_local_storage;
 }
@@ -581,6 +598,14 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
 	 */
 	synchronize_rcu();
 
+	/* When local storage map has kptrs, the call_rcu callback accesses
+	 * kptr_off_tab, hence we need the bpf_selem_free_rcu callbacks to
+	 * finish before we free it.
+	 */
+	if (map_value_has_kptrs(&smap->map)) {
+		rcu_barrier();
+		bpf_map_free_kptr_off_tab(&smap->map);
+	}
 	kvfree(smap->buckets);
 	bpf_map_area_free(smap);
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 3214bab5b462..0311acca19f6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1049,7 +1049,10 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
 		    map->map_type != BPF_MAP_TYPE_LRU_PERCPU_HASH &&
 		    map->map_type != BPF_MAP_TYPE_ARRAY &&
-		    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY) {
+		    map->map_type != BPF_MAP_TYPE_PERCPU_ARRAY &&
+		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
 			ret = -EOPNOTSUPP;
 			goto free_map_tab;
 		}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0194a36d0b36..b7bf68f3b2ec 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6276,17 +6276,20 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_MAP_TYPE_SK_STORAGE:
 		if (func_id != BPF_FUNC_sk_storage_get &&
-		    func_id != BPF_FUNC_sk_storage_delete)
+		    func_id != BPF_FUNC_sk_storage_delete &&
+		    func_id != BPF_FUNC_kptr_xchg)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_INODE_STORAGE:
 		if (func_id != BPF_FUNC_inode_storage_get &&
-		    func_id != BPF_FUNC_inode_storage_delete)
+		    func_id != BPF_FUNC_inode_storage_delete &&
+		    func_id != BPF_FUNC_kptr_xchg)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_TASK_STORAGE:
 		if (func_id != BPF_FUNC_task_storage_get &&
-		    func_id != BPF_FUNC_task_storage_delete)
+		    func_id != BPF_FUNC_task_storage_delete &&
+		    func_id != BPF_FUNC_kptr_xchg)
 			goto error;
 		break;
 	case BPF_MAP_TYPE_BLOOM_FILTER:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 06/32] bpf: Annotate data races in bpf_local_storage
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (4 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 07/32] bpf: Allow specifying volatile type modifier for kptrs Kumar Kartikeya Dwivedi
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, KP Singh, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Dave Marchevsky, Delyan Kratunov

There are a few cases where hlist_node is checked to be unhashed without
holding the lock protecting its modification. In this case, one must use
hlist_unhashed_lockless to avoid load tearing and KCSAN reports. Fix
this by using lockless variant in places not protected by the lock.

Since this is not prompted by any actual KCSAN reports but only from
code review, I have not included a fixes tag.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: KP Singh <kpsingh@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/bpf_local_storage.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 4a725379d761..58cb0c179097 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -51,11 +51,21 @@ owner_storage(struct bpf_local_storage_map *smap, void *owner)
 	return map->ops->map_owner_storage_ptr(owner);
 }
 
+static bool selem_linked_to_storage_lockless(const struct bpf_local_storage_elem *selem)
+{
+	return !hlist_unhashed_lockless(&selem->snode);
+}
+
 static bool selem_linked_to_storage(const struct bpf_local_storage_elem *selem)
 {
 	return !hlist_unhashed(&selem->snode);
 }
 
+static bool selem_linked_to_map_lockless(const struct bpf_local_storage_elem *selem)
+{
+	return !hlist_unhashed_lockless(&selem->map_node);
+}
+
 static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem)
 {
 	return !hlist_unhashed(&selem->map_node);
@@ -182,7 +192,7 @@ static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem,
 	bool free_local_storage = false;
 	unsigned long flags;
 
-	if (unlikely(!selem_linked_to_storage(selem)))
+	if (unlikely(!selem_linked_to_storage_lockless(selem)))
 		/* selem has already been unlinked from sk */
 		return;
 
@@ -216,7 +226,7 @@ void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem)
 	struct bpf_local_storage_map_bucket *b;
 	unsigned long flags;
 
-	if (unlikely(!selem_linked_to_map(selem)))
+	if (unlikely(!selem_linked_to_map_lockless(selem)))
 		/* selem has already be unlinked from smap */
 		return;
 
@@ -427,7 +437,7 @@ bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap,
 		err = check_flags(old_sdata, map_flags);
 		if (err)
 			return ERR_PTR(err);
-		if (old_sdata && selem_linked_to_storage(SELEM(old_sdata))) {
+		if (old_sdata && selem_linked_to_storage_lockless(SELEM(old_sdata))) {
 			copy_map_value_locked(&smap->map, old_sdata->data,
 					      value, false);
 			return old_sdata;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 07/32] bpf: Allow specifying volatile type modifier for kptrs
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (5 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 06/32] bpf: Annotate data races in bpf_local_storage Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 08/32] bpf: Add comment about kptr's PTR_TO_MAP_VALUE handling Kumar Kartikeya Dwivedi
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

This is useful in particular to mark the pointer as volatile, so that
compiler treats each load and store to the field as a volatile access.
The alternative is having to define and use READ_ONCE and WRITE_ONCE in
the BPF program.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h | 5 +++++
 kernel/bpf/btf.c    | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index ad93c2d9cc1c..b3d47e9b9d5c 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -279,6 +279,11 @@ static inline bool btf_type_is_typedef(const struct btf_type *t)
 	return BTF_INFO_KIND(t->info) == BTF_KIND_TYPEDEF;
 }
 
+static inline bool btf_type_is_volatile(const struct btf_type *t)
+{
+	return BTF_INFO_KIND(t->info) == BTF_KIND_VOLATILE;
+}
+
 static inline bool btf_type_is_func(const struct btf_type *t)
 {
 	return BTF_INFO_KIND(t->info) == BTF_KIND_FUNC;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 903719b89238..5e860f76595c 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3225,6 +3225,9 @@ static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
 	enum bpf_kptr_type type;
 	u32 res_id;
 
+	/* Permit modifiers on the pointer itself */
+	if (btf_type_is_volatile(t))
+		t = btf_type_by_id(btf, t->type);
 	/* For PTR, sz is always == 8 */
 	if (!btf_type_is_ptr(t))
 		return BTF_FIELD_IGNORE;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 08/32] bpf: Add comment about kptr's PTR_TO_MAP_VALUE handling
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (6 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 07/32] bpf: Allow specifying volatile type modifier for kptrs Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 09/32] bpf: Rewrite kfunc argument handling Kumar Kartikeya Dwivedi
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

In both process_ktpr_func and kptr_get handling for kfuncs, we expect
PTR_TO_MAP_VALUE with a constant var_off and optionally fixed off, which
in turn points to the kptr in the map value. We know that if we find
such offset in the kptr_off_tab it will be < value_size.

Hence, we skip checking the memory region access. Once establishing that
it is a kptr we also don't need to check whether the map value pointer
touches any other special fields for [ptr, ptr+8) region we are about to
access.

Finally, for check_map_access_type, we already ensure that neither
BPF_F_RDONLY_PROG and BPF_F_WRONLY_PROG flags can be set for the map
containing kptrs. Hence, checking that is also not required.

Encode all these implicit assumptions as comments where such checks are
made, so that any future changes to these take the kptr related
invariants into consideration, and avoid introducing bugs accidently.

All this information was also clarified in the commit adding kptr
support, 61df10c7799e ("bpf: Allow storing unreferenced kptr in map").

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b7bf68f3b2ec..0c19a98c748d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5196,6 +5196,11 @@ static int check_helper_mem_access(struct bpf_verifier_env *env, int regno,
 		return check_mem_region_access(env, regno, reg->off, access_size,
 					       reg->map_ptr->key_size, false);
 	case PTR_TO_MAP_VALUE:
+		/* process_kptr_func and kptr_get assume only map_access_type
+		 * and special field access is checked for PTR_TO_MAP_VALUE,
+		 * apart from verifying memory region access, hence they must be
+		 * revisited when that assumption changes here.
+		 */
 		if (check_map_access_type(env, regno, reg->off, access_size,
 					  meta && meta->raw_mode ? BPF_WRITE :
 					  BPF_READ))
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 09/32] bpf: Rewrite kfunc argument handling
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (7 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 08/32] bpf: Add comment about kptr's PTR_TO_MAP_VALUE handling Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 10/32] bpf: Drop kfunc support from btf_check_func_arg_match Kumar Kartikeya Dwivedi
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

As we continue to add more features, argument types, kfunc flags, and
different extensions to kfuncs, the code to verify the correctness of
the kfunc prototype wrt the passed in registers has become ad-hoc and
ugly to read.

To make life easier, and make a very clear split between different
stages of argument processing, move all the code into verifier.c and
refactor into easier to read helpers and functions.

This also makes sharing code within the verifier easier with kfunc
argument processing. This will be more and more useful in later patches
as we are now moving to implement very core BPF helpers as kfuncs, to
keep them experimental before baking into UAPI.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h                          |  22 +
 kernel/bpf/btf.c                             |  12 +-
 kernel/bpf/verifier.c                        | 426 ++++++++++++++++++-
 tools/testing/selftests/bpf/verifier/calls.c |   2 +-
 4 files changed, 438 insertions(+), 24 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index b3d47e9b9d5c..8062f9da7c40 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -314,6 +314,16 @@ static inline bool btf_type_is_struct(const struct btf_type *t)
 	return kind == BTF_KIND_STRUCT || kind == BTF_KIND_UNION;
 }
 
+static inline bool __btf_type_is_struct(const struct btf_type *t)
+{
+	return BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT;
+}
+
+static inline bool btf_type_is_array(const struct btf_type *t)
+{
+	return BTF_INFO_KIND(t->info) == BTF_KIND_ARRAY;
+}
+
 static inline u16 btf_type_vlen(const struct btf_type *t)
 {
 	return BTF_INFO_VLEN(t->info);
@@ -400,6 +410,7 @@ static inline struct btf_param *btf_params(const struct btf_type *t)
 
 #ifdef CONFIG_BPF_SYSCALL
 struct bpf_prog;
+struct bpf_verifier_log;
 
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
 const char *btf_name_by_offset(const struct btf *btf, u32 offset);
@@ -413,6 +424,10 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
 s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id);
 int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
 				struct module *owner);
+const struct btf_member *
+btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
+		      const struct btf_type *t, enum bpf_prog_type prog_type,
+		      int arg);
 #else
 static inline const struct btf_type *btf_type_by_id(const struct btf *btf,
 						    u32 type_id)
@@ -444,6 +459,13 @@ static inline int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dt
 {
 	return 0;
 }
+static inline const struct btf_member *
+btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
+		      const struct btf_type *t, enum bpf_prog_type prog_type,
+		      int arg)
+{
+	return NULL;
+}
 #endif
 
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 5e860f76595c..0ad809a3055d 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -477,16 +477,6 @@ static bool btf_type_nosize_or_null(const struct btf_type *t)
 	return !t || btf_type_nosize(t);
 }
 
-static bool __btf_type_is_struct(const struct btf_type *t)
-{
-	return BTF_INFO_KIND(t->info) == BTF_KIND_STRUCT;
-}
-
-static bool btf_type_is_array(const struct btf_type *t)
-{
-	return BTF_INFO_KIND(t->info) == BTF_KIND_ARRAY;
-}
-
 static bool btf_type_is_datasec(const struct btf_type *t)
 {
 	return BTF_INFO_KIND(t->info) == BTF_KIND_DATASEC;
@@ -5089,7 +5079,7 @@ static u8 bpf_ctx_convert_map[] = {
 #undef BPF_MAP_TYPE
 #undef BPF_LINK_TYPE
 
-static const struct btf_member *
+const struct btf_member *
 btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
 		      const struct btf_type *t, enum bpf_prog_type prog_type,
 		      int arg)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0c19a98c748d..663c91020f82 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7578,6 +7578,401 @@ static void mark_btf_func_reg_size(struct bpf_verifier_env *env, u32 regno,
 	}
 }
 
+struct bpf_kfunc_arg_meta {
+	/* In parameters */
+	struct btf *btf;
+	u32 func_id;
+	u32 kfunc_flags;
+	const struct btf_type *func_proto;
+	const char *func_name;
+	/* Out parameters */
+	u32 ref_obj_id;
+	u8 release_regno;
+};
+
+static bool is_kfunc_acquire(struct bpf_kfunc_arg_meta *meta)
+{
+	return meta->kfunc_flags & KF_ACQUIRE;
+}
+
+static bool is_kfunc_ret_null(struct bpf_kfunc_arg_meta *meta)
+{
+	return meta->kfunc_flags & KF_RET_NULL;
+}
+
+static bool is_kfunc_release(struct bpf_kfunc_arg_meta *meta)
+{
+	return meta->kfunc_flags & KF_RELEASE;
+}
+
+static bool is_kfunc_trusted_args(struct bpf_kfunc_arg_meta *meta)
+{
+	return meta->kfunc_flags & KF_TRUSTED_ARGS;
+}
+
+static bool is_kfunc_sleepable(struct bpf_kfunc_arg_meta *meta)
+{
+	return meta->kfunc_flags & KF_SLEEPABLE;
+}
+
+static bool is_kfunc_destructive(struct bpf_kfunc_arg_meta *meta)
+{
+	return meta->kfunc_flags & KF_DESTRUCTIVE;
+}
+
+static bool is_kfunc_arg_kptr_get(struct bpf_kfunc_arg_meta *meta, int arg)
+{
+	return arg == 0 && (meta->kfunc_flags & KF_KPTR_GET);
+}
+
+static bool is_kfunc_arg_mem_size(const struct btf *btf,
+				  const struct btf_param *arg,
+				  const struct bpf_reg_state *reg)
+{
+	int len, sfx_len = sizeof("__sz") - 1;
+	const struct btf_type *t;
+	const char *param_name;
+
+	t = btf_type_skip_modifiers(btf, arg->type, NULL);
+	if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE)
+		return false;
+
+	/* In the future, this can be ported to use BTF tagging */
+	param_name = btf_name_by_offset(btf, arg->name_off);
+	if (str_is_empty(param_name))
+		return false;
+	len = strlen(param_name);
+	if (len < sfx_len)
+		return false;
+	param_name += len - sfx_len;
+	if (strncmp(param_name, "__sz", sfx_len))
+		return false;
+
+	return true;
+}
+
+/* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
+static bool __btf_type_is_scalar_struct(struct bpf_verifier_env *env,
+					const struct btf *btf,
+					const struct btf_type *t, int rec)
+{
+	const struct btf_type *member_type;
+	const struct btf_member *member;
+	u32 i;
+
+	if (!btf_type_is_struct(t))
+		return false;
+
+	for_each_member(i, t, member) {
+		const struct btf_array *array;
+
+		member_type = btf_type_skip_modifiers(btf, member->type, NULL);
+		if (btf_type_is_struct(member_type)) {
+			if (rec >= 3) {
+				verbose(env, "max struct nesting depth exceeded\n");
+				return false;
+			}
+			if (!__btf_type_is_scalar_struct(env, btf, member_type, rec + 1))
+				return false;
+			continue;
+		}
+		if (btf_type_is_array(member_type)) {
+			array = btf_array(member_type);
+			if (!array->nelems)
+				return false;
+			member_type = btf_type_skip_modifiers(btf, array->type, NULL);
+			if (!btf_type_is_scalar(member_type))
+				return false;
+			continue;
+		}
+		if (!btf_type_is_scalar(member_type))
+			return false;
+	}
+	return true;
+}
+
+
+static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
+#ifdef CONFIG_NET
+	[PTR_TO_SOCKET] = &btf_sock_ids[BTF_SOCK_TYPE_SOCK],
+	[PTR_TO_SOCK_COMMON] = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],
+	[PTR_TO_TCP_SOCK] = &btf_sock_ids[BTF_SOCK_TYPE_TCP],
+#endif
+};
+
+enum kfunc_ptr_arg_types {
+	KF_ARG_PTR_TO_CTX,
+	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
+	KF_ARG_PTR_TO_KPTR_STRONG,   /* PTR_TO_KPTR but type specific */
+	KF_ARG_PTR_TO_MEM,
+	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
+};
+
+enum kfunc_ptr_arg_types get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
+						struct bpf_kfunc_arg_meta *meta,
+						const struct btf_type *t,
+						const struct btf_type *ref_t,
+						const char *ref_tname,
+						const struct btf_param *args,
+						int argno, int nargs)
+{
+	struct bpf_reg_state *regs = cur_regs(env);
+	bool arg_mem_size = false;
+	u32 regno = argno + 1;
+	struct bpf_reg_state *reg = &regs[regno];
+
+	/* In this function, we verify the kfunc's BTF as per the argument type,
+	 * leaving the rest of the verification with respect to the register
+	 * type to our caller. When a set of conditions hold in the BTF type of
+	 * arguments, we resolve it to a known kfunc_ptr_arg_types enum
+	 * constant.
+	 */
+	if (is_kfunc_arg_kptr_get(meta, argno)) {
+		if (!btf_type_is_ptr(ref_t)) {
+			verbose(env, "arg#0 BTF type must be a double pointer for kptr_get kfunc\n");
+			return -EINVAL;
+		}
+		ref_t = btf_type_by_id(meta->btf, ref_t->type);
+		ref_tname = btf_name_by_offset(meta->btf, ref_t->name_off);
+		if (!btf_type_is_struct(ref_t)) {
+			verbose(env, "kernel function %s args#0 pointer type %s %s is not supported\n",
+				meta->func_name, btf_type_str(ref_t), ref_tname);
+			return -EINVAL;
+		}
+		return KF_ARG_PTR_TO_KPTR_STRONG;
+	}
+
+	if (btf_get_prog_ctx_type(&env->log, meta->btf, t, resolve_prog_type(env->prog), argno))
+		return KF_ARG_PTR_TO_CTX;
+
+	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
+		if (!btf_type_is_struct(ref_t)) {
+			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
+				meta->func_name, argno, btf_type_str(ref_t), ref_tname);
+			return -EINVAL;
+		}
+		return KF_ARG_PTR_TO_BTF_ID;
+	}
+
+	if (argno + 1 < nargs && is_kfunc_arg_mem_size(meta->btf, &args[argno + 1], &regs[regno + 1]))
+		arg_mem_size = true;
+
+	/* Permit pointer to mem, but only when argument
+	 * type is pointer to scalar, or struct composed
+	 * (recursively) of scalars.
+	 * When arg_mem_size is true, the pointer can be
+	 * void *.
+	 */
+	if (!btf_type_is_scalar(ref_t) && !__btf_type_is_scalar_struct(env, meta->btf, ref_t, 0) &&
+	    (arg_mem_size ? !btf_type_is_void(ref_t) : 1)) {
+		verbose(env, "arg#%d pointer type %s %s must point to %sscalar, or struct with scalar\n",
+			argno, btf_type_str(ref_t), ref_tname, arg_mem_size ? "void, " : "");
+		return -EINVAL;
+	}
+	return arg_mem_size ? KF_ARG_PTR_TO_MEM_SIZE : KF_ARG_PTR_TO_MEM;
+}
+
+static int process_kf_arg_ptr_to_btf_id(struct bpf_verifier_env *env,
+					struct bpf_reg_state *reg,
+					const struct btf_type *ref_t,
+					const char *ref_tname, u32 ref_id,
+					struct bpf_kfunc_arg_meta *meta,
+					int argno)
+{
+	const struct btf_type *reg_ref_t;
+	bool strict_type_match = false;
+	const struct btf *reg_btf;
+	const char *reg_ref_tname;
+	u32 reg_ref_id;
+
+	if (reg->type == PTR_TO_BTF_ID) {
+		reg_btf = reg->btf;
+		reg_ref_id = reg->btf_id;
+	} else {
+		reg_btf = btf_vmlinux;
+		reg_ref_id = *reg2btf_ids[base_type(reg->type)];
+	}
+
+	if (is_kfunc_trusted_args(meta) || (is_kfunc_release(meta) && reg->ref_obj_id))
+		strict_type_match = true;
+
+	reg_ref_t = btf_type_skip_modifiers(reg_btf, reg_ref_id, &reg_ref_id);
+	reg_ref_tname = btf_name_by_offset(reg_btf, reg_ref_t->name_off);
+	if (!btf_struct_ids_match(&env->log, reg_btf, reg_ref_id, reg->off, meta->btf, ref_id, strict_type_match)) {
+		verbose(env, "kernel function %s args#%d expected pointer to %s %s but R%d has a pointer to %s %s\n",
+			meta->func_name, argno, btf_type_str(ref_t), ref_tname, argno + 1,
+			btf_type_str(reg_ref_t), reg_ref_tname);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int process_kf_arg_ptr_to_kptr_strong(struct bpf_verifier_env *env,
+					     struct bpf_reg_state *reg,
+					     const struct btf_type *ref_t,
+					     const char *ref_tname,
+					     struct bpf_kfunc_arg_meta *meta,
+					     int argno)
+{
+	struct bpf_map_value_off_desc *off_desc;
+
+	/* check_func_arg_reg_off allows var_off for
+	 * PTR_TO_MAP_VALUE, but we need fixed offset to find
+	 * off_desc.
+	 */
+	if (!tnum_is_const(reg->var_off)) {
+		verbose(env, "arg#0 must have constant offset\n");
+		return -EINVAL;
+	}
+
+	off_desc = bpf_map_kptr_off_contains(reg->map_ptr, reg->off + reg->var_off.value);
+	if (!off_desc || off_desc->type != BPF_KPTR_REF) {
+		verbose(env, "arg#0 no referenced kptr at map value offset=%llu\n",
+			reg->off + reg->var_off.value);
+		return -EINVAL;
+	}
+
+	if (!btf_struct_ids_match(&env->log, meta->btf, ref_t->type, 0, off_desc->kptr.btf,
+				  off_desc->kptr.btf_id, true)) {
+		verbose(env, "kernel function %s args#%d expected pointer to %s %s\n",
+			meta->func_name, argno, btf_type_str(ref_t), ref_tname);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_meta *meta)
+{
+	const char *func_name = meta->func_name, *ref_tname;
+	const struct btf *btf = meta->btf;
+	const struct btf_param *args;
+	u32 i, nargs;
+	int ret;
+
+	args = (const struct btf_param *)(meta->func_proto + 1);
+	nargs = btf_type_vlen(meta->func_proto);
+	if (nargs > MAX_BPF_FUNC_REG_ARGS) {
+		verbose(env, "Function %s has %d > %d args\n", func_name, nargs,
+			MAX_BPF_FUNC_REG_ARGS);
+		return -EINVAL;
+	}
+
+	if (is_kfunc_sleepable(meta) && !env->prog->aux->sleepable) {
+		verbose(env, "program must be sleepable to call sleepable kfunc %s\n", func_name);
+		return -EINVAL;
+	}
+
+	/* Check that BTF function arguments match actual types that the
+	 * verifier sees.
+	 */
+	for (i = 0; i < nargs; i++) {
+		struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[i + 1];
+		const struct btf_type *t, *ref_t, *resolve_ret;
+		enum bpf_arg_type arg_type = ARG_DONTCARE;
+		u32 regno = i + 1, ref_id, type_size;
+
+		t = btf_type_skip_modifiers(btf, args[i].type, NULL);
+		if (btf_type_is_scalar(t)) {
+			if (reg->type == SCALAR_VALUE)
+				continue;
+			verbose(env, "R%d is not a scalar\n", regno);
+			return -EINVAL;
+		}
+
+		if (!btf_type_is_ptr(t)) {
+			verbose(env, "Unrecognized arg#%d type %s\n", i, btf_type_str(t));
+			return -EINVAL;
+		}
+
+		/* Check if argument must be a referenced pointer, args + i has
+		 * been verified to be a pointer (after skipping modifiers).
+		 */
+		if (is_kfunc_trusted_args(meta) && !reg->ref_obj_id) {
+			verbose(env, "R%d must be referenced\n", regno);
+			return -EINVAL;
+		}
+
+		if (reg->ref_obj_id) {
+			if (is_kfunc_release(meta) && meta->ref_obj_id) {
+				verbose(env, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n",
+					regno, reg->ref_obj_id,
+					meta->ref_obj_id);
+				return -EFAULT;
+			}
+			meta->ref_obj_id = reg->ref_obj_id;
+			if (is_kfunc_release(meta))
+				meta->release_regno = regno;
+		}
+
+		/* Trusted args have the same offset checks as release arguments */
+		if (is_kfunc_trusted_args(meta) || (is_kfunc_release(meta) && reg->ref_obj_id))
+			arg_type |= OBJ_RELEASE;
+		ret = check_func_arg_reg_off(env, reg, regno, arg_type);
+		if (ret < 0)
+			return ret;
+
+		ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id);
+		ref_tname = btf_name_by_offset(btf, ref_t->name_off);
+
+		switch (get_kfunc_ptr_arg_type(env, meta, t, ref_t, ref_tname, args, i, nargs)) {
+		case KF_ARG_PTR_TO_CTX:
+			if (reg->type != PTR_TO_CTX) {
+				verbose(env, "arg#%d expected pointer to ctx, but got %s\n", i, btf_type_str(t));
+				return -EINVAL;
+			}
+			break;
+		case KF_ARG_PTR_TO_BTF_ID:
+			/* Only base_type is checked, further checks are done here */
+			if (reg->type != PTR_TO_BTF_ID &&
+			    (!reg2btf_ids[base_type(reg->type)] || type_flag(reg->type))) {
+				verbose(env, "arg#%d expected pointer to btf or socket\n", i);
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_btf_id(env, reg, ref_t, ref_tname, ref_id, meta, i);
+			if (ret < 0)
+				return ret;
+			break;
+		case KF_ARG_PTR_TO_KPTR_STRONG:
+			if (reg->type != PTR_TO_MAP_VALUE) {
+				verbose(env, "arg#0 expected pointer to map value\n");
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_kptr_strong(env, reg, ref_t, ref_tname, meta, i);
+			if (ret < 0)
+				return ret;
+			break;
+		case KF_ARG_PTR_TO_MEM:
+			resolve_ret = btf_resolve_size(btf, ref_t, &type_size);
+			if (IS_ERR(resolve_ret)) {
+				verbose(env, "arg#%d reference type('%s %s') size cannot be determined: %ld\n",
+					i, btf_type_str(ref_t), ref_tname, PTR_ERR(resolve_ret));
+				return -EINVAL;
+			}
+			ret = check_mem_reg(env, reg, regno, type_size);
+			if (ret < 0)
+				return ret;
+			break;
+		case KF_ARG_PTR_TO_MEM_SIZE:
+			ret = check_kfunc_mem_size_reg(env, &regs[regno + 1], regno + 1);
+			if (ret < 0) {
+				verbose(env, "arg#%d arg#%d memory, len pair leads to invalid memory access\n", i, i + 1);
+				return ret;
+			}
+			/* Skip next '__sz' argument */
+			i++;
+			break;
+		}
+	}
+
+	if (is_kfunc_release(meta) && !meta->release_regno) {
+		verbose(env, "release kernel function %s expects refcounted PTR_TO_BTF_ID\n",
+			func_name);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    int *insn_idx_p)
 {
@@ -7586,10 +7981,10 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	const char *func_name, *ptr_type_name;
 	u32 i, nargs, func_id, ptr_type_id;
 	int err, insn_idx = *insn_idx_p;
+	struct bpf_kfunc_arg_meta meta;
 	const struct btf_param *args;
 	struct btf *desc_btf;
 	u32 *kfunc_flags;
-	bool acq;
 
 	/* skip for now, but return error when we find this in fixup_kfunc_call */
 	if (!insn->imm)
@@ -7610,22 +8005,29 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			func_name);
 		return -EACCES;
 	}
-	if (*kfunc_flags & KF_DESTRUCTIVE && !capable(CAP_SYS_BOOT)) {
-		verbose(env, "destructive kfunc calls require CAP_SYS_BOOT capabilities\n");
+
+	/* Prepare kfunc call metadata */
+	memset(&meta, 0, sizeof(meta));
+	meta.btf = desc_btf;
+	meta.func_id = func_id;
+	meta.kfunc_flags = *kfunc_flags;
+	meta.func_proto = func_proto;
+	meta.func_name = func_name;
+
+	if (is_kfunc_destructive(&meta) && !capable(CAP_SYS_BOOT)) {
+		verbose(env, "destructive kfunc calls require CAP_SYS_BOOT capability\n");
 		return -EACCES;
 	}
 
-	acq = *kfunc_flags & KF_ACQUIRE;
-
 	/* Check the arguments */
-	err = btf_check_kfunc_arg_match(env, desc_btf, func_id, regs, *kfunc_flags);
+	err = check_kfunc_args(env, &meta);
 	if (err < 0)
 		return err;
 	/* In case of release function, we get register number of refcounted
-	 * PTR_TO_BTF_ID back from btf_check_kfunc_arg_match, do the release now
+	 * PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now.
 	 */
-	if (err) {
-		err = release_reference(env, regs[err].ref_obj_id);
+	if (meta.release_regno) {
+		err = release_reference(env, regs[meta.release_regno].ref_obj_id);
 		if (err) {
 			verbose(env, "kfunc %s#%d reference has not been acquired before\n",
 				func_name, func_id);
@@ -7639,7 +8041,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 	/* Check return type */
 	t = btf_type_skip_modifiers(desc_btf, func_proto->type, NULL);
 
-	if (acq && !btf_type_is_ptr(t)) {
+	if (is_kfunc_acquire(&meta) && !btf_type_is_ptr(t)) {
 		verbose(env, "acquire kernel function does not return PTR_TO_BTF_ID\n");
 		return -EINVAL;
 	}
@@ -7662,13 +8064,13 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		regs[BPF_REG_0].btf = desc_btf;
 		regs[BPF_REG_0].type = PTR_TO_BTF_ID;
 		regs[BPF_REG_0].btf_id = ptr_type_id;
-		if (*kfunc_flags & KF_RET_NULL) {
+		if (is_kfunc_ret_null(&meta)) {
 			regs[BPF_REG_0].type |= PTR_MAYBE_NULL;
 			/* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */
 			regs[BPF_REG_0].id = ++env->id_gen;
 		}
 		mark_btf_func_reg_size(env, BPF_REG_0, sizeof(void *));
-		if (acq) {
+		if (is_kfunc_acquire(&meta)) {
 			int id = acquire_reference_state(env, insn_idx);
 
 			if (id < 0)
diff --git a/tools/testing/selftests/bpf/verifier/calls.c b/tools/testing/selftests/bpf/verifier/calls.c
index 3fb4f69b1962..4604d0b1fb04 100644
--- a/tools/testing/selftests/bpf/verifier/calls.c
+++ b/tools/testing/selftests/bpf/verifier/calls.c
@@ -109,7 +109,7 @@
 	},
 	.prog_type = BPF_PROG_TYPE_SCHED_CLS,
 	.result = REJECT,
-	.errstr = "arg#0 pointer type STRUCT prog_test_ref_kfunc must point",
+	.errstr = "arg#0 expected pointer to btf or socket",
 	.fixup_kfunc_btf_id = {
 		{ "bpf_kfunc_call_test_acquire", 3 },
 		{ "bpf_kfunc_call_test_release", 5 },
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 10/32] bpf: Drop kfunc support from btf_check_func_arg_match
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (8 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 09/32] bpf: Rewrite kfunc argument handling Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 11/32] bpf: Support constant scalar arguments for kfuncs Kumar Kartikeya Dwivedi
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Remove all kfunc related bits now from btf_check_func_arg_match, as
users have been converted away to refactored kfunc argument handling.

This is split into a separate commit to aid review, in order to compare
what has been preserved from the removed bits easily instead of mixing
removed hunks with previous patch.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h          |   4 -
 include/linux/bpf_verifier.h |   2 -
 kernel/bpf/btf.c             | 262 ++---------------------------------
 kernel/bpf/verifier.c        |   4 +-
 4 files changed, 10 insertions(+), 262 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cdc0a8c1b1d1..d4e6bf789c02 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1976,10 +1976,6 @@ int btf_distill_func_proto(struct bpf_verifier_log *log,
 struct bpf_reg_state;
 int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
 				struct bpf_reg_state *regs);
-int btf_check_kfunc_arg_match(struct bpf_verifier_env *env,
-			      const struct btf *btf, u32 func_id,
-			      struct bpf_reg_state *regs,
-			      u32 kfunc_flags);
 int btf_prepare_func_args(struct bpf_verifier_env *env, int subprog,
 			  struct bpf_reg_state *reg);
 int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *prog,
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 1fdddbf3546b..b4a11ff56054 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -567,8 +567,6 @@ int check_ptr_off_reg(struct bpf_verifier_env *env,
 int check_func_arg_reg_off(struct bpf_verifier_env *env,
 			   const struct bpf_reg_state *reg, int regno,
 			   enum bpf_arg_type arg_type);
-int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
-			     u32 regno);
 int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
 		   u32 regno, u32 mem_size);
 
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 0ad809a3055d..6740c3ade8f1 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -6085,96 +6085,18 @@ int btf_check_type_match(struct bpf_verifier_log *log, const struct bpf_prog *pr
 	return btf_check_func_type_match(log, btf1, t1, btf2, t2);
 }
 
-static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
-#ifdef CONFIG_NET
-	[PTR_TO_SOCKET] = &btf_sock_ids[BTF_SOCK_TYPE_SOCK],
-	[PTR_TO_SOCK_COMMON] = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],
-	[PTR_TO_TCP_SOCK] = &btf_sock_ids[BTF_SOCK_TYPE_TCP],
-#endif
-};
-
-/* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
-static bool __btf_type_is_scalar_struct(struct bpf_verifier_log *log,
-					const struct btf *btf,
-					const struct btf_type *t, int rec)
-{
-	const struct btf_type *member_type;
-	const struct btf_member *member;
-	u32 i;
-
-	if (!btf_type_is_struct(t))
-		return false;
-
-	for_each_member(i, t, member) {
-		const struct btf_array *array;
-
-		member_type = btf_type_skip_modifiers(btf, member->type, NULL);
-		if (btf_type_is_struct(member_type)) {
-			if (rec >= 3) {
-				bpf_log(log, "max struct nesting depth exceeded\n");
-				return false;
-			}
-			if (!__btf_type_is_scalar_struct(log, btf, member_type, rec + 1))
-				return false;
-			continue;
-		}
-		if (btf_type_is_array(member_type)) {
-			array = btf_type_array(member_type);
-			if (!array->nelems)
-				return false;
-			member_type = btf_type_skip_modifiers(btf, array->type, NULL);
-			if (!btf_type_is_scalar(member_type))
-				return false;
-			continue;
-		}
-		if (!btf_type_is_scalar(member_type))
-			return false;
-	}
-	return true;
-}
-
-static bool is_kfunc_arg_mem_size(const struct btf *btf,
-				  const struct btf_param *arg,
-				  const struct bpf_reg_state *reg)
-{
-	int len, sfx_len = sizeof("__sz") - 1;
-	const struct btf_type *t;
-	const char *param_name;
-
-	t = btf_type_skip_modifiers(btf, arg->type, NULL);
-	if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE)
-		return false;
-
-	/* In the future, this can be ported to use BTF tagging */
-	param_name = btf_name_by_offset(btf, arg->name_off);
-	if (str_is_empty(param_name))
-		return false;
-	len = strlen(param_name);
-	if (len < sfx_len)
-		return false;
-	param_name += len - sfx_len;
-	if (strncmp(param_name, "__sz", sfx_len))
-		return false;
-
-	return true;
-}
-
 static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 				    const struct btf *btf, u32 func_id,
 				    struct bpf_reg_state *regs,
-				    bool ptr_to_mem_ok,
-				    u32 kfunc_flags)
+				    bool ptr_to_mem_ok)
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
-	bool rel = false, kptr_get = false, trusted_arg = false;
-	bool sleepable = false;
 	struct bpf_verifier_log *log = &env->log;
-	u32 i, nargs, ref_id, ref_obj_id = 0;
-	bool is_kfunc = btf_is_kernel(btf);
 	const char *func_name, *ref_tname;
 	const struct btf_type *t, *ref_t;
 	const struct btf_param *args;
-	int ref_regno = 0, ret;
+	u32 i, nargs, ref_id;
+	int ret;
 
 	t = btf_type_by_id(btf, func_id);
 	if (!t || !btf_type_is_func(t)) {
@@ -6200,14 +6122,6 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 		return -EINVAL;
 	}
 
-	if (is_kfunc) {
-		/* Only kfunc can be release func */
-		rel = kfunc_flags & KF_RELEASE;
-		kptr_get = kfunc_flags & KF_KPTR_GET;
-		trusted_arg = kfunc_flags & KF_TRUSTED_ARGS;
-		sleepable = kfunc_flags & KF_SLEEPABLE;
-	}
-
 	/* check that BTF function arguments match actual types that the
 	 * verifier sees.
 	 */
@@ -6230,70 +6144,14 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 			return -EINVAL;
 		}
 
-		/* Check if argument must be a referenced pointer, args + i has
-		 * been verified to be a pointer (after skipping modifiers).
-		 */
-		if (is_kfunc && trusted_arg && !reg->ref_obj_id) {
-			bpf_log(log, "R%d must be referenced\n", regno);
-			return -EINVAL;
-		}
-
 		ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id);
 		ref_tname = btf_name_by_offset(btf, ref_t->name_off);
 
-		/* Trusted args have the same offset checks as release arguments */
-		if (trusted_arg || (rel && reg->ref_obj_id))
-			arg_type |= OBJ_RELEASE;
 		ret = check_func_arg_reg_off(env, reg, regno, arg_type);
 		if (ret < 0)
 			return ret;
 
-		/* kptr_get is only true for kfunc */
-		if (i == 0 && kptr_get) {
-			struct bpf_map_value_off_desc *off_desc;
-
-			if (reg->type != PTR_TO_MAP_VALUE) {
-				bpf_log(log, "arg#0 expected pointer to map value\n");
-				return -EINVAL;
-			}
-
-			/* check_func_arg_reg_off allows var_off for
-			 * PTR_TO_MAP_VALUE, but we need fixed offset to find
-			 * off_desc.
-			 */
-			if (!tnum_is_const(reg->var_off)) {
-				bpf_log(log, "arg#0 must have constant offset\n");
-				return -EINVAL;
-			}
-
-			off_desc = bpf_map_kptr_off_contains(reg->map_ptr, reg->off + reg->var_off.value);
-			if (!off_desc || off_desc->type != BPF_KPTR_REF) {
-				bpf_log(log, "arg#0 no referenced kptr at map value offset=%llu\n",
-					reg->off + reg->var_off.value);
-				return -EINVAL;
-			}
-
-			if (!btf_type_is_ptr(ref_t)) {
-				bpf_log(log, "arg#0 BTF type must be a double pointer\n");
-				return -EINVAL;
-			}
-
-			ref_t = btf_type_skip_modifiers(btf, ref_t->type, &ref_id);
-			ref_tname = btf_name_by_offset(btf, ref_t->name_off);
-
-			if (!btf_type_is_struct(ref_t)) {
-				bpf_log(log, "kernel function %s args#%d pointer type %s %s is not supported\n",
-					func_name, i, btf_type_str(ref_t), ref_tname);
-				return -EINVAL;
-			}
-			if (!btf_struct_ids_match(log, btf, ref_id, 0, off_desc->kptr.btf,
-						  off_desc->kptr.btf_id, true)) {
-				bpf_log(log, "kernel function %s args#%d expected pointer to %s %s\n",
-					func_name, i, btf_type_str(ref_t), ref_tname);
-				return -EINVAL;
-			}
-			/* rest of the arguments can be anything, like normal kfunc */
-		} else if (btf_get_prog_ctx_type(log, btf, t, prog_type, i)) {
+		if (btf_get_prog_ctx_type(log, btf, t, prog_type, i)) {
 			/* If function expects ctx type in BTF check that caller
 			 * is passing PTR_TO_CTX.
 			 */
@@ -6303,86 +6161,10 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 					i, btf_type_str(t));
 				return -EINVAL;
 			}
-		} else if (is_kfunc && (reg->type == PTR_TO_BTF_ID ||
-			   (reg2btf_ids[base_type(reg->type)] && !type_flag(reg->type)))) {
-			const struct btf_type *reg_ref_t;
-			const struct btf *reg_btf;
-			const char *reg_ref_tname;
-			u32 reg_ref_id;
-
-			if (!btf_type_is_struct(ref_t)) {
-				bpf_log(log, "kernel function %s args#%d pointer type %s %s is not supported\n",
-					func_name, i, btf_type_str(ref_t),
-					ref_tname);
-				return -EINVAL;
-			}
-
-			if (reg->type == PTR_TO_BTF_ID) {
-				reg_btf = reg->btf;
-				reg_ref_id = reg->btf_id;
-				/* Ensure only one argument is referenced PTR_TO_BTF_ID */
-				if (reg->ref_obj_id) {
-					if (ref_obj_id) {
-						bpf_log(log, "verifier internal error: more than one arg with ref_obj_id R%d %u %u\n",
-							regno, reg->ref_obj_id, ref_obj_id);
-						return -EFAULT;
-					}
-					ref_regno = regno;
-					ref_obj_id = reg->ref_obj_id;
-				}
-			} else {
-				reg_btf = btf_vmlinux;
-				reg_ref_id = *reg2btf_ids[base_type(reg->type)];
-			}
-
-			reg_ref_t = btf_type_skip_modifiers(reg_btf, reg_ref_id,
-							    &reg_ref_id);
-			reg_ref_tname = btf_name_by_offset(reg_btf,
-							   reg_ref_t->name_off);
-			if (!btf_struct_ids_match(log, reg_btf, reg_ref_id,
-						  reg->off, btf, ref_id,
-						  trusted_arg || (rel && reg->ref_obj_id))) {
-				bpf_log(log, "kernel function %s args#%d expected pointer to %s %s but R%d has a pointer to %s %s\n",
-					func_name, i,
-					btf_type_str(ref_t), ref_tname,
-					regno, btf_type_str(reg_ref_t),
-					reg_ref_tname);
-				return -EINVAL;
-			}
 		} else if (ptr_to_mem_ok) {
 			const struct btf_type *resolve_ret;
 			u32 type_size;
 
-			if (is_kfunc) {
-				bool arg_mem_size = i + 1 < nargs && is_kfunc_arg_mem_size(btf, &args[i + 1], &regs[regno + 1]);
-
-				/* Permit pointer to mem, but only when argument
-				 * type is pointer to scalar, or struct composed
-				 * (recursively) of scalars.
-				 * When arg_mem_size is true, the pointer can be
-				 * void *.
-				 */
-				if (!btf_type_is_scalar(ref_t) &&
-				    !__btf_type_is_scalar_struct(log, btf, ref_t, 0) &&
-				    (arg_mem_size ? !btf_type_is_void(ref_t) : 1)) {
-					bpf_log(log,
-						"arg#%d pointer type %s %s must point to %sscalar, or struct with scalar\n",
-						i, btf_type_str(ref_t), ref_tname, arg_mem_size ? "void, " : "");
-					return -EINVAL;
-				}
-
-				/* Check for mem, len pair */
-				if (arg_mem_size) {
-					if (check_kfunc_mem_size_reg(env, &regs[regno + 1], regno + 1)) {
-						bpf_log(log, "arg#%d arg#%d memory, len pair leads to invalid memory access\n",
-							i, i + 1);
-						return -EINVAL;
-					}
-					i++;
-					continue;
-				}
-			}
-
 			resolve_ret = btf_resolve_size(btf, ref_t, &type_size);
 			if (IS_ERR(resolve_ret)) {
 				bpf_log(log,
@@ -6395,33 +6177,13 @@ static int btf_check_func_arg_match(struct bpf_verifier_env *env,
 			if (check_mem_reg(env, reg, regno, type_size))
 				return -EINVAL;
 		} else {
-			bpf_log(log, "reg type unsupported for arg#%d %sfunction %s#%d\n", i,
-				is_kfunc ? "kernel " : "", func_name, func_id);
+			bpf_log(log, "reg type unsupported for arg#%d function %s#%d\n", i,
+				func_name, func_id);
 			return -EINVAL;
 		}
 	}
 
-	/* Either both are set, or neither */
-	WARN_ON_ONCE((ref_obj_id && !ref_regno) || (!ref_obj_id && ref_regno));
-	/* We already made sure ref_obj_id is set only for one argument. We do
-	 * allow (!rel && ref_obj_id), so that passing such referenced
-	 * PTR_TO_BTF_ID to other kfuncs works. Note that rel is only true when
-	 * is_kfunc is true.
-	 */
-	if (rel && !ref_obj_id) {
-		bpf_log(log, "release kernel function %s expects refcounted PTR_TO_BTF_ID\n",
-			func_name);
-		return -EINVAL;
-	}
-
-	if (sleepable && !env->prog->aux->sleepable) {
-		bpf_log(log, "kernel function %s is sleepable but the program is not\n",
-			func_name);
-		return -EINVAL;
-	}
-
-	/* returns argument register number > 0 in case of reference release kfunc */
-	return rel ? ref_regno : 0;
+	return 0;
 }
 
 /* Compare BTF of a function with given bpf_reg_state.
@@ -6451,7 +6213,7 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
 		return -EINVAL;
 
 	is_global = prog->aux->func_info_aux[subprog].linkage == BTF_FUNC_GLOBAL;
-	err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global, 0);
+	err = btf_check_func_arg_match(env, btf, btf_id, regs, is_global);
 
 	/* Compiler optimizations can remove arguments from static functions
 	 * or mismatched type can be passed into a global function.
@@ -6462,14 +6224,6 @@ int btf_check_subprog_arg_match(struct bpf_verifier_env *env, int subprog,
 	return err;
 }
 
-int btf_check_kfunc_arg_match(struct bpf_verifier_env *env,
-			      const struct btf *btf, u32 func_id,
-			      struct bpf_reg_state *regs,
-			      u32 kfunc_flags)
-{
-	return btf_check_func_arg_match(env, btf, func_id, regs, true, kfunc_flags);
-}
-
 /* Convert BTF of a function into bpf_reg_state if possible
  * Returns:
  * EFAULT - there is a verifier bug. Abort verification.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 663c91020f82..96fab14eb94e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5338,8 +5338,8 @@ int check_mem_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
 	return err;
 }
 
-int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
-			     u32 regno)
+static int check_kfunc_mem_size_reg(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
+				    u32 regno)
 {
 	struct bpf_reg_state *mem_reg = &cur_regs(env)[regno - 1];
 	bool may_be_null = type_may_be_null(mem_reg->type);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 11/32] bpf: Support constant scalar arguments for kfuncs
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (9 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 10/32] bpf: Drop kfunc support from btf_check_func_arg_match Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 12/32] bpf: Teach verifier about non-size constant arguments Kumar Kartikeya Dwivedi
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Allow passing known constant scalars as arguments to kfuncs that do not
represent a size parameter. This makes the search pruning optimization
of verifier more conservative for such kfunc calls, and each
non-distinct argument is considered unequivalent.

We will use this support to then expose a global bpf_kptr_alloc function
where it takes the local type ID in program BTF, and returns a
PTR_TO_BTF_ID to the local type. These will be called local kptrs, and
allows programs to allocate their own objects.

However, this is still not completely safe, as mark_chain_precision
logic is buggy without more work when the constant argument is not a
size, but still needs precise marker propagation for pruning checks.
Next patch will fix this problem.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 Documentation/bpf/kfuncs.rst | 30 ++++++++++++++++
 kernel/bpf/verifier.c        | 67 +++++++++++++++++++++++++++---------
 2 files changed, 80 insertions(+), 17 deletions(-)

diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
index 781731749e55..31625393204a 100644
--- a/Documentation/bpf/kfuncs.rst
+++ b/Documentation/bpf/kfuncs.rst
@@ -72,6 +72,36 @@ argument as its size. By default, without __sz annotation, the size of the type
 of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
 pointer.
 
+2.2.1 __k Annotation
+--------------------
+
+This annotation is only understood for scalar arguments, where it indicates that
+the verifier must check the scalar argument to be a known constant, which does
+not indicate a size parameter. This distinction is important, as when the scalar
+argument does not represent a size parameter, verifier is more conservative in
+state search pruning and does not consider two arguments equivalent for safety
+purposes if the already verified value was within range of the new one.
+
+This assumption holds well for sizes (as memory accessed within smaller bounds
+in old verified state will also work for bigger bounds in current to be explored
+state), but not for other constant arguments where each carries a distinct
+semantic effect.
+
+An example is given below::
+
+        void *bpf_mem_alloc(u32 local_type_id__k)
+        {
+        ...
+        }
+
+Here, bpf_mem_alloc uses local_type_id argument to find out the size of that
+type ID in program's BTF and return a sized pointer to it. Each type ID will
+have a distinct size, hence it is crucial to treat each such call as distinct
+when values don't match.
+
+Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
+size parameter, __k suffix must be used.
+
 .. _BPF_kfunc_nodef:
 
 2.3 Using an existing kernel function
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 96fab14eb94e..b28e88d6fabd 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7588,6 +7588,10 @@ struct bpf_kfunc_arg_meta {
 	/* Out parameters */
 	u32 ref_obj_id;
 	u8 release_regno;
+	struct {
+		u64 value;
+		bool found;
+	} arg_constant;
 };
 
 static bool is_kfunc_acquire(struct bpf_kfunc_arg_meta *meta)
@@ -7625,30 +7629,40 @@ static bool is_kfunc_arg_kptr_get(struct bpf_kfunc_arg_meta *meta, int arg)
 	return arg == 0 && (meta->kfunc_flags & KF_KPTR_GET);
 }
 
-static bool is_kfunc_arg_mem_size(const struct btf *btf,
-				  const struct btf_param *arg,
-				  const struct bpf_reg_state *reg)
+static bool __kfunc_param_match_suffix(const struct btf *btf,
+				       const struct btf_param *arg,
+				       const char *suffix)
 {
-	int len, sfx_len = sizeof("__sz") - 1;
-	const struct btf_type *t;
+	int suffix_len = strlen(suffix), len;
 	const char *param_name;
 
-	t = btf_type_skip_modifiers(btf, arg->type, NULL);
-	if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE)
-		return false;
-
 	/* In the future, this can be ported to use BTF tagging */
 	param_name = btf_name_by_offset(btf, arg->name_off);
 	if (str_is_empty(param_name))
 		return false;
 	len = strlen(param_name);
-	if (len < sfx_len)
+	if (len < suffix_len)
 		return false;
-	param_name += len - sfx_len;
-	if (strncmp(param_name, "__sz", sfx_len))
+	param_name += len - suffix_len;
+	return !strncmp(param_name, suffix, suffix_len);
+}
+
+static bool is_kfunc_arg_mem_size(const struct btf *btf,
+				  const struct btf_param *arg,
+				  const struct bpf_reg_state *reg)
+{
+	const struct btf_type *t;
+
+	t = btf_type_skip_modifiers(btf, arg->type, NULL);
+	if (!btf_type_is_scalar(t) || reg->type != SCALAR_VALUE)
 		return false;
 
-	return true;
+	return __kfunc_param_match_suffix(btf, arg, "__sz");
+}
+
+static bool is_kfunc_arg_sfx_constant(const struct btf *btf, const struct btf_param *arg)
+{
+	return __kfunc_param_match_suffix(btf, arg, "__k");
 }
 
 /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
@@ -7873,10 +7887,29 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_m
 
 		t = btf_type_skip_modifiers(btf, args[i].type, NULL);
 		if (btf_type_is_scalar(t)) {
-			if (reg->type == SCALAR_VALUE)
-				continue;
-			verbose(env, "R%d is not a scalar\n", regno);
-			return -EINVAL;
+			if (reg->type != SCALAR_VALUE) {
+				verbose(env, "R%d is not a scalar\n", regno);
+				return -EINVAL;
+			}
+			if (is_kfunc_arg_sfx_constant(meta->btf, &args[i])) {
+				/* kfunc is already bpf_capable() only, no need
+				 * to check it here.
+				 */
+				if (meta->arg_constant.found) {
+					verbose(env, "verifier internal error: only one constant argument permitted\n");
+					return -EFAULT;
+				}
+				if (!tnum_is_const(reg->var_off)) {
+					verbose(env, "R%d must be a known constant\n", regno);
+					return -EINVAL;
+				}
+				ret = mark_chain_precision(env, regno);
+				if (ret < 0)
+					return ret;
+				meta->arg_constant.found = true;
+				meta->arg_constant.value = reg->var_off.value;
+			}
+			continue;
 		}
 
 		if (!btf_type_is_ptr(t)) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 12/32] bpf: Teach verifier about non-size constant arguments
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (10 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 11/32] bpf: Support constant scalar arguments for kfuncs Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-07 22:11   ` Alexei Starovoitov
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 13/32] bpf: Introduce bpf_list_head support for BPF maps Kumar Kartikeya Dwivedi
                   ` (19 subsequent siblings)
  31 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Currently, the verifier has support for various arguments that either
describe the size of the memory being passed in to a helper, or describe
the size of the memory being returned. When a constant is passed in like
this, it is assumed for the purposes of precision tracking that if the
value in the already explored safe state is within the value in current
state, it would fine to prune the search.

While this holds well for size arguments, arguments where each value may
denote a distinct meaning and needs to be verified separately needs more
work. Search can only be pruned if both are constant values and both are
equal. In all other cases, it would be incorrect to treat those two
precise registers as equivalent if the new value satisfies the old one
(i.e. old <= cur).

Hence, make the register precision marker tri-state. There are now three
values that reg->precise takes: NOT_PRECISE, PRECISE, PRECISE_ABSOLUTE.

Both PRECISE and PRECISE_ABSOLUTE are 'true' values. PRECISE_ABSOLUTE
affects how regsafe decides whether both registers are equivalent for
the purposes of verifier state equivalence. When it sees that one
register has reg->precise == PRECISE_ABSOLUTE, unless both are absolute,
it will return false. When both are, it returns true only when both are
const and both have the same value. Otherwise, for PRECISE case it falls
back to the default check that is present now (i.e. thinking that we're
talking about sizes).

This is required as a future patch introduces a BPF memory allocator
interface, where we take the program BTF's type ID as an argument. Each
distinct type ID may result in the returned pointer obtaining a
different size, hence precision tracking is needed, and pruning cannot
just happen when the old value is within the current value. It must only
happen when the type ID is equal. The type ID will always correspond to
prog->aux->btf hence actual type match is not required.

Finally, change mark_chain_precision to mark_chain_precision_absolute
for kfuncs constant non-size scalar arguments (tagged with __k suffix).

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h |  8 +++-
 kernel/bpf/verifier.c        | 93 ++++++++++++++++++++++++++----------
 2 files changed, 76 insertions(+), 25 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index b4a11ff56054..c4d21568d192 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -43,6 +43,12 @@ enum bpf_reg_liveness {
 	REG_LIVE_DONE = 0x8, /* liveness won't be updating this register anymore */
 };
 
+enum bpf_reg_precise {
+	NOT_PRECISE,
+	PRECISE,
+	PRECISE_ABSOLUTE,
+};
+
 struct bpf_reg_state {
 	/* Ordering of fields matters.  See states_equal() */
 	enum bpf_reg_type type;
@@ -180,7 +186,7 @@ struct bpf_reg_state {
 	s32 subreg_def;
 	enum bpf_reg_liveness live;
 	/* if (!precise && SCALAR_VALUE) min/max/tnum don't affect safety */
-	bool precise;
+	enum bpf_reg_precise precise;
 };
 
 enum bpf_stack_slot_type {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b28e88d6fabd..571790ac58d4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -838,7 +838,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 		print_liveness(env, reg->live);
 		verbose(env, "=");
 		if (t == SCALAR_VALUE && reg->precise)
-			verbose(env, "P");
+			verbose(env, reg->precise == PRECISE_ABSOLUTE ? "PA" : "P");
 		if ((t == SCALAR_VALUE || t == PTR_TO_STACK) &&
 		    tnum_is_const(reg->var_off)) {
 			/* reg->off should be 0 for SCALAR_VALUE */
@@ -935,7 +935,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 			t = reg->type;
 			verbose(env, "=%s", t == SCALAR_VALUE ? "" : reg_type_str(env, t));
 			if (t == SCALAR_VALUE && reg->precise)
-				verbose(env, "P");
+				verbose(env, reg->precise == PRECISE_ABSOLUTE ? "PA" : "P");
 			if (t == SCALAR_VALUE && tnum_is_const(reg->var_off))
 				verbose(env, "%lld", reg->var_off.value + reg->off);
 		} else {
@@ -1668,7 +1668,17 @@ static void __mark_reg_unknown(const struct bpf_verifier_env *env,
 	reg->type = SCALAR_VALUE;
 	reg->var_off = tnum_unknown;
 	reg->frameno = 0;
-	reg->precise = env->subprog_cnt > 1 || !env->bpf_capable;
+	/* Helpers requiring PRECISE_ABSOLUTE for constant arguments cannot be
+	 * called from programs without CAP_BPF. This is because we don't
+	 * propagate precision markers for when CAP_BPF is missing. If we
+	 * allowed calling such heleprs in those programs, the default would
+	 * have to be PRECISE_ABSOLUTE for them, which would be too aggresive.
+	 *
+	 * We still propagate PRECISE_ABSOLUTE when subprog_cnt > 1, hence
+	 * those cases would still override the default PRECISE value when
+	 * we propagate the precision markers.
+	 */
+	reg->precise = (env->subprog_cnt > 1 || !env->bpf_capable) ? PRECISE : NOT_PRECISE;
 	__mark_reg_unbounded(reg);
 }
 
@@ -2717,7 +2727,8 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx,
  * For now backtracking falls back into conservative marking.
  */
 static void mark_all_scalars_precise(struct bpf_verifier_env *env,
-				     struct bpf_verifier_state *st)
+				     struct bpf_verifier_state *st,
+				     bool absolute)
 {
 	struct bpf_func_state *func;
 	struct bpf_reg_state *reg;
@@ -2733,7 +2744,7 @@ static void mark_all_scalars_precise(struct bpf_verifier_env *env,
 				reg = &func->regs[j];
 				if (reg->type != SCALAR_VALUE)
 					continue;
-				reg->precise = true;
+				reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
 			}
 			for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) {
 				if (!is_spilled_reg(&func->stack[j]))
@@ -2741,13 +2752,13 @@ static void mark_all_scalars_precise(struct bpf_verifier_env *env,
 				reg = &func->stack[j].spilled_ptr;
 				if (reg->type != SCALAR_VALUE)
 					continue;
-				reg->precise = true;
+				reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
 			}
 		}
 }
 
 static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
-				  int spi)
+				  int spi, bool absolute)
 {
 	struct bpf_verifier_state *st = env->cur_state;
 	int first_idx = st->first_insn_idx;
@@ -2774,7 +2785,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 			new_marks = true;
 		else
 			reg_mask = 0;
-		reg->precise = true;
+		reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
 	}
 
 	while (spi >= 0) {
@@ -2791,7 +2802,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 			new_marks = true;
 		else
 			stack_mask = 0;
-		reg->precise = true;
+		reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
 		break;
 	}
 
@@ -2813,7 +2824,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 				err = backtrack_insn(env, i, &reg_mask, &stack_mask);
 			}
 			if (err == -ENOTSUPP) {
-				mark_all_scalars_precise(env, st);
+				mark_all_scalars_precise(env, st, absolute);
 				return 0;
 			} else if (err) {
 				return err;
@@ -2854,7 +2865,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 			}
 			if (!reg->precise)
 				new_marks = true;
-			reg->precise = true;
+			reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
 		}
 
 		bitmap_from_u64(mask, stack_mask);
@@ -2873,7 +2884,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 				 * fp-8 and it's "unallocated" stack space.
 				 * In such case fallback to conservative.
 				 */
-				mark_all_scalars_precise(env, st);
+				mark_all_scalars_precise(env, st, absolute);
 				return 0;
 			}
 
@@ -2888,7 +2899,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 			}
 			if (!reg->precise)
 				new_marks = true;
-			reg->precise = true;
+			reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
 		}
 		if (env->log.level & BPF_LOG_LEVEL2) {
 			verbose(env, "parent %s regs=%x stack=%llx marks:",
@@ -2910,12 +2921,24 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
 
 static int mark_chain_precision(struct bpf_verifier_env *env, int regno)
 {
-	return __mark_chain_precision(env, regno, -1);
+	return __mark_chain_precision(env, regno, -1, false);
+}
+
+static int mark_chain_precision_absolute(struct bpf_verifier_env *env, int regno)
+{
+	WARN_ON_ONCE(!env->bpf_capable);
+	return __mark_chain_precision(env, regno, -1, true);
 }
 
 static int mark_chain_precision_stack(struct bpf_verifier_env *env, int spi)
 {
-	return __mark_chain_precision(env, -1, spi);
+	return __mark_chain_precision(env, -1, spi, false);
+}
+
+static int mark_chain_precision_absolute_stack(struct bpf_verifier_env *env, int spi)
+{
+	WARN_ON_ONCE(!env->bpf_capable);
+	return __mark_chain_precision(env, -1, spi, true);
 }
 
 static bool is_spillable_regtype(enum bpf_reg_type type)
@@ -3253,7 +3276,7 @@ static void mark_reg_stack_read(struct bpf_verifier_env *env,
 		 * backtracking. Any register that contributed
 		 * to const 0 was marked precise before spill.
 		 */
-		state->regs[dst_regno].precise = true;
+		state->regs[dst_regno].precise = PRECISE;
 	} else {
 		/* have read misc data from the stack */
 		mark_reg_unknown(env, state->regs, dst_regno);
@@ -7903,7 +7926,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_m
 					verbose(env, "R%d must be a known constant\n", regno);
 					return -EINVAL;
 				}
-				ret = mark_chain_precision(env, regno);
+				ret = mark_chain_precision_absolute(env, regno);
 				if (ret < 0)
 					return ret;
 				meta->arg_constant.found = true;
@@ -11899,9 +11922,23 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
 		if (rcur->type == SCALAR_VALUE) {
 			if (!rold->precise && !rcur->precise)
 				return true;
-			/* new val must satisfy old val knowledge */
-			return range_within(rold, rcur) &&
-			       tnum_in(rold->var_off, rcur->var_off);
+			/* We can only determine safety when type of precision
+			 * needed is same. For absolute, we must compare actual
+			 * value, otherwise old being within the current value
+			 * suffices.
+			 */
+			if (rold->precise == PRECISE_ABSOLUTE || rcur->precise == PRECISE_ABSOLUTE) {
+				/* Both should be PRECISE_ABSOLUTE for a comparison */
+				if (rold->precise != rcur->precise)
+					return false;
+				if (!tnum_is_const(rold->var_off) || !tnum_is_const(rcur->var_off))
+					return false;
+				return rold->var_off.value == rcur->var_off.value;
+			} else {
+				/* new val must satisfy old val knowledge */
+				return range_within(rold, rcur) &&
+				       tnum_in(rold->var_off, rcur->var_off);
+			}
 		} else {
 			/* We're trying to use a pointer in place of a scalar.
 			 * Even if the scalar was unbounded, this could lead to
@@ -12229,8 +12266,12 @@ static int propagate_precision(struct bpf_verifier_env *env,
 		    !state_reg->precise)
 			continue;
 		if (env->log.level & BPF_LOG_LEVEL2)
-			verbose(env, "propagating r%d\n", i);
-		err = mark_chain_precision(env, i);
+			verbose(env, "propagating %sr%d\n",
+				state_reg->precise == PRECISE_ABSOLUTE ? "abs " : "", i);
+		if (state_reg->precise == PRECISE_ABSOLUTE)
+			err = mark_chain_precision_absolute(env, i);
+		else
+			err = mark_chain_precision(env, i);
 		if (err < 0)
 			return err;
 	}
@@ -12243,9 +12284,13 @@ static int propagate_precision(struct bpf_verifier_env *env,
 		    !state_reg->precise)
 			continue;
 		if (env->log.level & BPF_LOG_LEVEL2)
-			verbose(env, "propagating fp%d\n",
+			verbose(env, "propagating %sfp%d\n",
+				state_reg->precise == PRECISE_ABSOLUTE ? "abs " : "",
 				(-i - 1) * BPF_REG_SIZE);
-		err = mark_chain_precision_stack(env, i);
+		if (state_reg->precise == PRECISE_ABSOLUTE)
+			err = mark_chain_precision_absolute_stack(env, i);
+		else
+			err = mark_chain_precision_stack(env, i);
 		if (err < 0)
 			return err;
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 13/32] bpf: Introduce bpf_list_head support for BPF maps
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (11 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 12/32] bpf: Teach verifier about non-size constant arguments Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-07 22:46   ` Alexei Starovoitov
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 14/32] bpf: Introduce bpf_kptr_alloc helper Kumar Kartikeya Dwivedi
                   ` (18 subsequent siblings)
  31 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Add the basic support on the map side to parse, recognize, verify, and
build metadata table for a new special field of the type struct
bpf_list_head. To parameterize the bpf_list_head for a certain value
type and the list_node member it will accept in that value type, we use
BTF declaration tags.

The definition of bpf_list_head in a map value will be done as follows:

struct foo {
	int data;
	struct bpf_list_node list;
};

struct map_value {
	struct bpf_list_head list __contains(struct, foo, node);
};

Then, the bpf_list_head only allows adding to the list using the
bpf_list_node 'list' for the type struct foo.

The 'contains' annotation is a BTF declaration tag composed of four
parts, "contains:kind:name:node" where the kind and name is then used to
look up the type in the map BTF. The node defines name of the member in
this type that has the type struct bpf_list_node, which is actually used
for linking into the linked list.

This allows building intrusive linked lists in BPF, using container_of
to obtain pointer to entry, while being completely type safe from the
perspective of the verifier. The verifier knows exactly the type of the
nodes, and knows that list helpers return that type at some fixed offset
where the bpf_list_node member used for this list exists. The verifier
also uses this information to disallow adding types that are not
accepted by a certain list.

For now, no elements can be added to such lists. Support for that is
coming in future patches, hence draining and freeing items is left out
for now, and just freeing the list_head_off_tab is done, since it is
still built and populated when bpf_list_head is specified in the map
value.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h                           |  64 +++++--
 include/linux/btf.h                           |   2 +
 kernel/bpf/arraymap.c                         |   2 +
 kernel/bpf/bpf_local_storage.c                |   1 +
 kernel/bpf/btf.c                              | 173 +++++++++++++++++-
 kernel/bpf/hashtab.c                          |   1 +
 kernel/bpf/map_in_map.c                       |   5 +-
 kernel/bpf/syscall.c                          | 131 +++++++++++--
 kernel/bpf/verifier.c                         |  21 +++
 .../testing/selftests/bpf/bpf_experimental.h  |  21 +++
 10 files changed, 378 insertions(+), 43 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_experimental.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d4e6bf789c02..35c2e9caeb98 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -28,6 +28,9 @@
 #include <linux/btf.h>
 #include <linux/rcupdate_trace.h>
 
+/* Experimental BPF APIs header for type definitions */
+#include "../../../tools/testing/selftests/bpf/bpf_experimental.h"
+
 struct bpf_verifier_env;
 struct bpf_verifier_log;
 struct perf_event;
@@ -164,27 +167,40 @@ struct bpf_map_ops {
 };
 
 enum {
-	/* Support at most 8 pointers in a BPF map value */
-	BPF_MAP_VALUE_OFF_MAX = 8,
-	BPF_MAP_OFF_ARR_MAX   = BPF_MAP_VALUE_OFF_MAX +
-				1 + /* for bpf_spin_lock */
-				1,  /* for bpf_timer */
-};
-
-enum bpf_kptr_type {
+	/* Support at most 8 offsets in a table */
+	BPF_MAP_VALUE_OFF_MAX		= 8,
+	/* Support at most 8 pointer in a BPF map value */
+	BPF_MAP_VALUE_KPTR_MAX		= BPF_MAP_VALUE_OFF_MAX,
+	/* Support at most 8 list_head in a BPF map value */
+	BPF_MAP_VALUE_LIST_HEAD_MAX	= BPF_MAP_VALUE_OFF_MAX,
+	BPF_MAP_OFF_ARR_MAX		= BPF_MAP_VALUE_KPTR_MAX +
+					  BPF_MAP_VALUE_LIST_HEAD_MAX +
+					  1 + /* for bpf_spin_lock */
+					  1,  /* for bpf_timer */
+};
+
+enum bpf_off_type {
 	BPF_KPTR_UNREF,
 	BPF_KPTR_REF,
+	BPF_LIST_HEAD,
 };
 
 struct bpf_map_value_off_desc {
 	u32 offset;
-	enum bpf_kptr_type type;
-	struct {
-		struct btf *btf;
-		struct module *module;
-		btf_dtor_kfunc_t dtor;
-		u32 btf_id;
-	} kptr;
+	enum bpf_off_type type;
+	union {
+		struct {
+			struct btf *btf;
+			struct module *module;
+			btf_dtor_kfunc_t dtor;
+			u32 btf_id;
+		} kptr; /* for BPF_KPTR_{UNREF,REF} */
+		struct {
+			struct btf *btf;
+			u32 value_type_id;
+			u32 list_node_off;
+		} list_head; /* for BPF_LIST_HEAD */
+	};
 };
 
 struct bpf_map_value_off {
@@ -215,6 +231,7 @@ struct bpf_map {
 	u32 map_flags;
 	int spin_lock_off; /* >=0 valid offset, <0 error */
 	struct bpf_map_value_off *kptr_off_tab;
+	struct bpf_map_value_off *list_head_off_tab;
 	int timer_off; /* >=0 valid offset, <0 error */
 	u32 id;
 	int numa_node;
@@ -265,6 +282,11 @@ static inline bool map_value_has_kptrs(const struct bpf_map *map)
 	return !IS_ERR_OR_NULL(map->kptr_off_tab);
 }
 
+static inline bool map_value_has_list_heads(const struct bpf_map *map)
+{
+	return !IS_ERR_OR_NULL(map->list_head_off_tab);
+}
+
 static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 {
 	if (unlikely(map_value_has_spin_lock(map)))
@@ -278,6 +300,13 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
 		for (i = 0; i < tab->nr_off; i++)
 			*(u64 *)(dst + tab->off[i].offset) = 0;
 	}
+	if (unlikely(map_value_has_list_heads(map))) {
+		struct bpf_map_value_off *tab = map->list_head_off_tab;
+		int i;
+
+		for (i = 0; i < tab->nr_off; i++)
+			memset(dst + tab->off[i].offset, 0, sizeof(struct list_head));
+	}
 }
 
 /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
@@ -1676,6 +1705,11 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
 bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
 void bpf_map_free_kptrs(struct bpf_map *map, void *map_value);
 
+struct bpf_map_value_off_desc *bpf_map_list_head_off_contains(struct bpf_map *map, u32 offset);
+void bpf_map_free_list_head_off_tab(struct bpf_map *map);
+struct bpf_map_value_off *bpf_map_copy_list_head_off_tab(const struct bpf_map *map);
+bool bpf_map_equal_list_head_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
+
 struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 8062f9da7c40..9b62b8b2117e 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -156,6 +156,8 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
 struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 					  const struct btf_type *t);
+struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf,
+					       const struct btf_type *t);
 bool btf_type_is_void(const struct btf_type *t);
 s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
 const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 832b2659e96e..c7263ee3a35f 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -423,6 +423,8 @@ static void array_map_free(struct bpf_map *map)
 	struct bpf_array *array = container_of(map, struct bpf_array, map);
 	int i;
 
+	bpf_map_free_list_head_off_tab(map);
+
 	if (map_value_has_kptrs(map)) {
 		if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
 			for (i = 0; i < array->map.max_entries; i++) {
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index 58cb0c179097..b5ccd76026b6 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -616,6 +616,7 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
 		rcu_barrier();
 		bpf_map_free_kptr_off_tab(&smap->map);
 	}
+	bpf_map_free_list_head_off_tab(&smap->map);
 	kvfree(smap->buckets);
 	bpf_map_area_free(smap);
 }
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 6740c3ade8f1..0fb045be3837 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3185,6 +3185,7 @@ enum btf_field_type {
 	BTF_FIELD_SPIN_LOCK,
 	BTF_FIELD_TIMER,
 	BTF_FIELD_KPTR,
+	BTF_FIELD_LIST_HEAD,
 };
 
 enum {
@@ -3193,9 +3194,17 @@ enum {
 };
 
 struct btf_field_info {
-	u32 type_id;
 	u32 off;
-	enum bpf_kptr_type type;
+	union {
+		struct {
+			u32 type_id;
+			enum bpf_off_type type;
+		} kptr;
+		struct {
+			u32 value_type_id;
+			const char *node_name;
+		} list_head;
+	};
 };
 
 static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
@@ -3212,7 +3221,7 @@ static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
 static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
 			 u32 off, int sz, struct btf_field_info *info)
 {
-	enum bpf_kptr_type type;
+	enum bpf_off_type type;
 	u32 res_id;
 
 	/* Permit modifiers on the pointer itself */
@@ -3241,9 +3250,71 @@ static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
 	if (!__btf_type_is_struct(t))
 		return -EINVAL;
 
-	info->type_id = res_id;
 	info->off = off;
-	info->type = type;
+	info->kptr.type_id = res_id;
+	info->kptr.type = type;
+	return BTF_FIELD_FOUND;
+}
+
+static const char *btf_find_decl_tag_value(const struct btf *btf,
+					   const struct btf_type *pt,
+					   int comp_idx, const char *tag_key)
+{
+	int i;
+
+	for (i = 1; i < btf_nr_types(btf); i++) {
+		const struct btf_type *t = btf_type_by_id(btf, i);
+		int len = strlen(tag_key);
+
+		if (!btf_type_is_decl_tag(t))
+			continue;
+		/* TODO: Instead of btf_type pt, it would be much better if we had BTF
+		 * ID of the map value type. This would avoid btf_type_by_id call here.
+		 */
+		if (pt != btf_type_by_id(btf, t->type) ||
+		    btf_type_decl_tag(t)->component_idx != comp_idx)
+			continue;
+		if (strncmp(__btf_name_by_offset(btf, t->name_off), tag_key, len))
+			continue;
+		return __btf_name_by_offset(btf, t->name_off) + len;
+	}
+	return NULL;
+}
+
+static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
+			      int comp_idx, const struct btf_type *t,
+			      u32 off, int sz, struct btf_field_info *info)
+{
+	const char *value_type;
+	const char *list_node;
+	s32 id;
+
+	if (!__btf_type_is_struct(t))
+		return BTF_FIELD_IGNORE;
+	if (t->size != sz)
+		return BTF_FIELD_IGNORE;
+	value_type = btf_find_decl_tag_value(btf, pt, comp_idx, "contains:");
+	if (!value_type)
+		return -EINVAL;
+	if (strncmp(value_type, "struct:", sizeof("struct:") - 1))
+		return -EINVAL;
+	value_type += sizeof("struct:") - 1;
+	list_node = strstr(value_type, ":");
+	if (!list_node)
+		return -EINVAL;
+	value_type = kstrndup(value_type, list_node - value_type, GFP_ATOMIC);
+	if (!value_type)
+		return -ENOMEM;
+	id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT);
+	kfree(value_type);
+	if (id < 0)
+		return id;
+	list_node++;
+	if (str_is_empty(list_node))
+		return -EINVAL;
+	info->off = off;
+	info->list_head.value_type_id = id;
+	info->list_head.node_name = list_node;
 	return BTF_FIELD_FOUND;
 }
 
@@ -3286,6 +3357,12 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 			if (ret < 0)
 				return ret;
 			break;
+		case BTF_FIELD_LIST_HEAD:
+			ret = btf_find_list_head(btf, t, i, member_type, off, sz,
+						 idx < info_cnt ? &info[idx] : &tmp);
+			if (ret < 0)
+				return ret;
+			break;
 		default:
 			return -EFAULT;
 		}
@@ -3336,6 +3413,12 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 			if (ret < 0)
 				return ret;
 			break;
+		case BTF_FIELD_LIST_HEAD:
+			ret = btf_find_list_head(btf, var, -1, var_type, off, sz,
+						 idx < info_cnt ? &info[idx] : &tmp);
+			if (ret < 0)
+				return ret;
+			break;
 		default:
 			return -EFAULT;
 		}
@@ -3372,6 +3455,11 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
 		sz = sizeof(u64);
 		align = 8;
 		break;
+	case BTF_FIELD_LIST_HEAD:
+		name = "bpf_list_head";
+		sz = sizeof(struct bpf_list_head);
+		align = __alignof__(struct bpf_list_head);
+		break;
 	default:
 		return -EFAULT;
 	}
@@ -3440,7 +3528,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 		/* Find type in map BTF, and use it to look up the matching type
 		 * in vmlinux or module BTFs, by name and kind.
 		 */
-		t = btf_type_by_id(btf, info_arr[i].type_id);
+		t = btf_type_by_id(btf, info_arr[i].kptr.type_id);
 		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
 				     &kernel_btf);
 		if (id < 0) {
@@ -3451,7 +3539,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 		/* Find and stash the function pointer for the destruction function that
 		 * needs to be eventually invoked from the map free path.
 		 */
-		if (info_arr[i].type == BPF_KPTR_REF) {
+		if (info_arr[i].kptr.type == BPF_KPTR_REF) {
 			const struct btf_type *dtor_func;
 			const char *dtor_func_name;
 			unsigned long addr;
@@ -3494,7 +3582,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 		}
 
 		tab->off[i].offset = info_arr[i].off;
-		tab->off[i].type = info_arr[i].type;
+		tab->off[i].type = info_arr[i].kptr.type;
 		tab->off[i].kptr.btf_id = id;
 		tab->off[i].kptr.btf = kernel_btf;
 		tab->off[i].kptr.module = mod;
@@ -3515,6 +3603,75 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 	return ERR_PTR(ret);
 }
 
+struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf, const struct btf_type *t)
+{
+	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
+	struct bpf_map_value_off *tab;
+	int ret, i, nr_off;
+
+	ret = btf_find_field(btf, t, BTF_FIELD_LIST_HEAD, info_arr, ARRAY_SIZE(info_arr));
+	if (ret < 0)
+		return ERR_PTR(ret);
+	if (!ret)
+		return NULL;
+
+	nr_off = ret;
+	tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
+	if (!tab)
+		return ERR_PTR(-ENOMEM);
+
+	for (i = 0; i < nr_off; i++) {
+		const struct btf_type *t, *n = NULL;
+		const struct btf_member *member;
+		u32 offset;
+		int j;
+
+		t = btf_type_by_id(btf, info_arr[i].list_head.value_type_id);
+		/* We've already checked that value_type_id is a struct type. We
+		 * just need to figure out the offset of the list_node, and
+		 * verify its type.
+		 */
+		ret = -EINVAL;
+		for_each_member(j, t, member) {
+			if (strcmp(info_arr[i].list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
+				continue;
+			/* Invalid BTF, two members with same name */
+			if (n) {
+				/* We also need to btf_put for the current iteration! */
+				i++;
+				goto end;
+			}
+			n = btf_type_by_id(btf, member->type);
+			if (!__btf_type_is_struct(n))
+				goto end;
+			if (strcmp("bpf_list_node", __btf_name_by_offset(btf, n->name_off)))
+				goto end;
+			offset = __btf_member_bit_offset(n, member);
+			if (offset % 8)
+				goto end;
+			offset /= 8;
+			if (offset % __alignof__(struct bpf_list_node))
+				goto end;
+
+			tab->off[i].offset = info_arr[i].off;
+			tab->off[i].type = BPF_LIST_HEAD;
+			btf_get(btf);
+			tab->off[i].list_head.btf = btf;
+			tab->off[i].list_head.value_type_id = info_arr[i].list_head.value_type_id;
+			tab->off[i].list_head.list_node_off = offset;
+		}
+		if (!n)
+			goto end;
+	}
+	tab->nr_off = nr_off;
+	return tab;
+end:
+	while (i--)
+		btf_put(tab->off[i].list_head.btf);
+	kfree(tab);
+	return ERR_PTR(ret);
+}
+
 static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
 			      u32 type_id, void *data, u8 bits_offset,
 			      struct btf_show *show)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index bb3f8a63c221..270e0ecf4ba3 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -1518,6 +1518,7 @@ static void htab_map_free(struct bpf_map *map)
 		prealloc_destroy(htab);
 	}
 
+	bpf_map_free_list_head_off_tab(map);
 	bpf_map_free_kptr_off_tab(map);
 	free_percpu(htab->extra_elems);
 	bpf_map_area_free(htab->buckets);
diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
index 135205d0d560..ced2559129ab 100644
--- a/kernel/bpf/map_in_map.c
+++ b/kernel/bpf/map_in_map.c
@@ -53,6 +53,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
 	inner_map_meta->timer_off = inner_map->timer_off;
 	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
+	inner_map_meta->list_head_off_tab = bpf_map_copy_list_head_off_tab(inner_map);
 	if (inner_map->btf) {
 		btf_get(inner_map->btf);
 		inner_map_meta->btf = inner_map->btf;
@@ -72,6 +73,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
 
 void bpf_map_meta_free(struct bpf_map *map_meta)
 {
+	bpf_map_free_list_head_off_tab(map_meta);
 	bpf_map_free_kptr_off_tab(map_meta);
 	btf_put(map_meta->btf);
 	kfree(map_meta);
@@ -86,7 +88,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
 		meta0->value_size == meta1->value_size &&
 		meta0->timer_off == meta1->timer_off &&
 		meta0->map_flags == meta1->map_flags &&
-		bpf_map_equal_kptr_off_tab(meta0, meta1);
+		bpf_map_equal_kptr_off_tab(meta0, meta1) &&
+		bpf_map_equal_list_head_off_tab(meta0, meta1);
 }
 
 void *bpf_map_fd_get_ptr(struct bpf_map *map,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 0311acca19f6..e1749e0d2143 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -495,7 +495,7 @@ static void bpf_map_release_memcg(struct bpf_map *map)
 }
 #endif
 
-static int bpf_map_kptr_off_cmp(const void *a, const void *b)
+static int bpf_map_off_cmp(const void *a, const void *b)
 {
 	const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
 
@@ -506,18 +506,22 @@ static int bpf_map_kptr_off_cmp(const void *a, const void *b)
 	return 0;
 }
 
-struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
+static struct bpf_map_value_off_desc *
+__bpf_map_off_contains(struct bpf_map_value_off *off_tab, u32 offset)
 {
 	/* Since members are iterated in btf_find_field in increasing order,
-	 * offsets appended to kptr_off_tab are in increasing order, so we can
+	 * offsets appended to an off_tab are in increasing order, so we can
 	 * do bsearch to find exact match.
 	 */
-	struct bpf_map_value_off *tab;
+	return bsearch(&offset, off_tab->off, off_tab->nr_off, sizeof(off_tab->off[0]),
+		       bpf_map_off_cmp);
+}
 
+struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
+{
 	if (!map_value_has_kptrs(map))
 		return NULL;
-	tab = map->kptr_off_tab;
-	return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
+	return __bpf_map_off_contains(map->kptr_off_tab, offset);
 }
 
 void bpf_map_free_kptr_off_tab(struct bpf_map *map)
@@ -563,15 +567,15 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
 	return new_tab;
 }
 
-bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
+static bool __bpf_map_equal_off_tab(const struct bpf_map_value_off *tab_a,
+				    const struct bpf_map_value_off *tab_b,
+				    bool has_a, bool has_b)
 {
-	struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
-	bool a_has_kptr = map_value_has_kptrs(map_a), b_has_kptr = map_value_has_kptrs(map_b);
 	int size;
 
-	if (!a_has_kptr && !b_has_kptr)
+	if (!has_a && !has_b)
 		return true;
-	if (a_has_kptr != b_has_kptr)
+	if (has_a != has_b)
 		return false;
 	if (tab_a->nr_off != tab_b->nr_off)
 		return false;
@@ -579,6 +583,13 @@ bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_ma
 	return !memcmp(tab_a, tab_b, size);
 }
 
+bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
+{
+	return __bpf_map_equal_off_tab(map_a->kptr_off_tab, map_b->kptr_off_tab,
+				       map_value_has_kptrs(map_a),
+				       map_value_has_kptrs(map_b));
+}
+
 /* Caller must ensure map_value_has_kptrs is true. Note that this function can
  * be called on a map value while the map_value is visible to BPF programs, as
  * it ensures the correct synchronization, and we already enforce the same using
@@ -606,6 +617,50 @@ void bpf_map_free_kptrs(struct bpf_map *map, void *map_value)
 	}
 }
 
+struct bpf_map_value_off_desc *bpf_map_list_head_off_contains(struct bpf_map *map, u32 offset)
+{
+	if (!map_value_has_list_heads(map))
+		return NULL;
+	return __bpf_map_off_contains(map->list_head_off_tab, offset);
+}
+
+void bpf_map_free_list_head_off_tab(struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->list_head_off_tab;
+	int i;
+
+	if (!map_value_has_list_heads(map))
+		return;
+	for (i = 0; i < tab->nr_off; i++)
+		btf_put(tab->off[i].list_head.btf);
+	kfree(tab);
+	map->list_head_off_tab = NULL;
+}
+
+struct bpf_map_value_off *bpf_map_copy_list_head_off_tab(const struct bpf_map *map)
+{
+	struct bpf_map_value_off *tab = map->list_head_off_tab, *new_tab;
+	int size, i;
+
+	if (!map_value_has_list_heads(map))
+		return ERR_PTR(-ENOENT);
+	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
+	new_tab = kmemdup(tab, size, GFP_KERNEL | __GFP_NOWARN);
+	if (!new_tab)
+		return ERR_PTR(-ENOMEM);
+	/* Do a deep copy of the list_head_off_tab */
+	for (i = 0; i < tab->nr_off; i++)
+		btf_get(tab->off[i].list_head.btf);
+	return new_tab;
+}
+
+bool bpf_map_equal_list_head_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
+{
+	return __bpf_map_equal_off_tab(map_a->list_head_off_tab, map_b->list_head_off_tab,
+				       map_value_has_list_heads(map_a),
+				       map_value_has_list_heads(map_b));
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
@@ -776,7 +831,8 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
 	int err;
 
 	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
-	    map_value_has_timer(map) || map_value_has_kptrs(map))
+	    map_value_has_timer(map) || map_value_has_kptrs(map) ||
+	    map_value_has_list_heads(map))
 		return -ENOTSUPP;
 
 	if (!(vma->vm_flags & VM_SHARED))
@@ -931,13 +987,14 @@ static void map_off_arr_swap(void *_a, void *_b, int size, const void *priv)
 
 static int bpf_map_alloc_off_arr(struct bpf_map *map)
 {
+	bool has_list_heads = map_value_has_list_heads(map);
 	bool has_spin_lock = map_value_has_spin_lock(map);
 	bool has_timer = map_value_has_timer(map);
 	bool has_kptrs = map_value_has_kptrs(map);
 	struct bpf_map_off_arr *off_arr;
 	u32 i;
 
-	if (!has_spin_lock && !has_timer && !has_kptrs) {
+	if (!has_spin_lock && !has_timer && !has_kptrs && !has_list_heads) {
 		map->off_arr = NULL;
 		return 0;
 	}
@@ -973,6 +1030,17 @@ static int bpf_map_alloc_off_arr(struct bpf_map *map)
 		}
 		off_arr->cnt += tab->nr_off;
 	}
+	if (has_list_heads) {
+		struct bpf_map_value_off *tab = map->list_head_off_tab;
+		u32 *off = &off_arr->field_off[off_arr->cnt];
+		u8 *sz = &off_arr->field_sz[off_arr->cnt];
+
+		for (i = 0; i < tab->nr_off; i++) {
+			*off++ = tab->off[i].offset;
+			*sz++ = sizeof(struct bpf_list_head);
+		}
+		off_arr->cnt += tab->nr_off;
+	}
 
 	if (off_arr->cnt == 1)
 		return 0;
@@ -1038,11 +1106,11 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	if (map_value_has_kptrs(map)) {
 		if (!bpf_capable()) {
 			ret = -EPERM;
-			goto free_map_tab;
+			goto free_map_kptr_tab;
 		}
 		if (map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) {
 			ret = -EACCES;
-			goto free_map_tab;
+			goto free_map_kptr_tab;
 		}
 		if (map->map_type != BPF_MAP_TYPE_HASH &&
 		    map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
@@ -1054,18 +1122,42 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
 		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
 			ret = -EOPNOTSUPP;
-			goto free_map_tab;
+			goto free_map_kptr_tab;
+		}
+	}
+
+	/* We need to take ref on the BTF, so pass it as non-const */
+	map->list_head_off_tab = btf_parse_list_heads((struct btf *)btf, value_type);
+	if (map_value_has_list_heads(map)) {
+		if (!bpf_capable()) {
+			ret = -EACCES;
+			goto free_map_list_head_tab;
+		}
+		if (map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) {
+			ret = -EACCES;
+			goto free_map_list_head_tab;
+		}
+		if (map->map_type != BPF_MAP_TYPE_HASH &&
+		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
+		    map->map_type != BPF_MAP_TYPE_ARRAY &&
+		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
+		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
+			ret = -EOPNOTSUPP;
+			goto free_map_list_head_tab;
 		}
 	}
 
 	if (map->ops->map_check_btf) {
 		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
 		if (ret < 0)
-			goto free_map_tab;
+			goto free_map_list_head_tab;
 	}
 
 	return ret;
-free_map_tab:
+free_map_list_head_tab:
+	bpf_map_free_list_head_off_tab(map);
+free_map_kptr_tab:
 	bpf_map_free_kptr_off_tab(map);
 	return ret;
 }
@@ -1889,7 +1981,8 @@ static int map_freeze(const union bpf_attr *attr)
 		return PTR_ERR(map);
 
 	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
-	    map_value_has_timer(map) || map_value_has_kptrs(map)) {
+	    map_value_has_timer(map) || map_value_has_kptrs(map) ||
+	    map_value_has_list_heads(map)) {
 		fdput(f);
 		return -ENOTSUPP;
 	}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 571790ac58d4..ab91e5ca7e41 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3879,6 +3879,20 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
 			}
 		}
 	}
+	if (map_value_has_list_heads(map)) {
+		struct bpf_map_value_off *tab = map->list_head_off_tab;
+		int i;
+
+		for (i = 0; i < tab->nr_off; i++) {
+			u32 p = tab->off[i].offset;
+
+			if (reg->smin_value + off < p + sizeof(struct bpf_list_head) &&
+			    p < reg->umax_value + off + size) {
+				verbose(env, "bpf_list_head cannot be accessed directly by load/store\n");
+				return -EACCES;
+			}
+		}
+	}
 	return err;
 }
 
@@ -13165,6 +13179,13 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		}
 	}
 
+	if (map_value_has_list_heads(map)) {
+		if (is_tracing_prog_type(prog_type)) {
+			verbose(env, "tracing progs cannot use bpf_list_head yet\n");
+			return -EINVAL;
+		}
+	}
+
 	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
 	    !bpf_offload_prog_map_match(prog, map)) {
 		verbose(env, "offload device mismatch between prog and map\n");
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
new file mode 100644
index 000000000000..ea1b3b1839d1
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -0,0 +1,21 @@
+#ifndef __KERNEL__
+
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+#else
+
+struct bpf_list_head {
+	__u64 __a;
+	__u64 __b;
+} __attribute__((aligned(8)));
+
+struct bpf_list_node {
+	__u64 __a;
+	__u64 __b;
+} __attribute__((aligned(8)));
+
+#endif
+
+#ifndef __KERNEL__
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 14/32] bpf: Introduce bpf_kptr_alloc helper
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (12 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 13/32] bpf: Introduce bpf_list_head support for BPF maps Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-07 23:30   ` Alexei Starovoitov
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 15/32] bpf: Add helper macro bpf_expr_for_each_reg_in_vstate Kumar Kartikeya Dwivedi
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

To allocate local kptr of types pointing into program BTF instead of
kernel BTF, bpf_kptr_alloc is a new helper that takes the local type's
BTF ID and returns a pointer to it. The size is automatically inferred
from the type ID by the BPF verifier, so user only passes the BTF ID and
flags, if any. For now, no flags are supported.

First, we use the new constant argument type support for kfuncs that
enforces argument is a constant. We need to know the local type's BTF ID
statically to enforce safety properties for the allocation. Next, we
remember this and dynamically assign the return type. During that phase,
we also query the actual size of the structure being allocated, and
whether it is a struct type. If so, we stash the actual size for
do_misc_fixups phase where we rewrite the first argument to be size
instead of local type's BTF ID, which we can then pass on to the kernel
allocator.

This needs some additional support for kfuncs as we were not doing
argument rewrites for them. The fixup has been moved inside
fixup_kfunc_call itself to avoid polluting the huge do_misc_fixups,
and delta, prog, and insn pointers are recalculated based on if any
instructions were patched.

The returned pointer needs to be handled specially as well. While
normally, only struct pointers may be returned, a new internal kfunc
flag __KF_RET_DYN_BTF is used to indicate the BTF is ascertained from
arguments dynamically, hence it is now forced to be void * instead.
For now, bpf_kptr_alloc is the only user of this support.

Hence, allocations using bpf_kptr_alloc are type safe. Later patches
will introduce constructor and destructor support to local kptrs
allocated from this helper. This would allow embedding kernel objects
like bpf_spin_lock, bpf_list_node, bpf_list_head inside a local kptr
allocation, and ensuring they are correctly initialized before use.

A new type flag is associated with PTR_TO_BTF_ID returned from
bpf_kptr_alloc: MEM_TYPE_LOCAL. This indicates that the type of the
memory is of a local type coming from program's BTF.

The btf_struct_access mechanism is tuned to allow BPF_WRITE access to
these allocated objects, so that programs can store data as usual in
them. On following a pointer type inside such PTR_TO_BTF_ID, WALK_PTR
sets the destination register as scalar instead. It would not be safe to
recognize pointer types in local types. This can be changed in the
future if it is allowed to embed kptrs inside such local kptrs.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h                           |  12 +-
 include/linux/bpf_verifier.h                  |   1 +
 include/linux/btf.h                           |   3 +
 kernel/bpf/btf.c                              |   8 +-
 kernel/bpf/helpers.c                          |  17 ++
 kernel/bpf/verifier.c                         | 156 +++++++++++++++---
 net/bpf/bpf_dummy_struct_ops.c                |   5 +-
 net/ipv4/bpf_tcp_ca.c                         |   5 +-
 .../testing/selftests/bpf/bpf_experimental.h  |  14 ++
 9 files changed, 191 insertions(+), 30 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 35c2e9caeb98..5c8bfb0eba17 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -486,6 +486,12 @@ enum bpf_type_flag {
 	/* Size is known at compile time. */
 	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
 
+	/* MEM is of a type from program BTF, not kernel BTF. This is used to
+	 * tag PTR_TO_BTF_ID allocated using bpf_kptr_alloc, since they have
+	 * entirely different semantics.
+	 */
+	MEM_TYPE_LOCAL		= BIT(11 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
@@ -757,7 +763,8 @@ struct bpf_verifier_ops {
 				 const struct btf *btf,
 				 const struct btf_type *t, int off, int size,
 				 enum bpf_access_type atype,
-				 u32 *next_btf_id, enum bpf_type_flag *flag);
+				 u32 *next_btf_id, enum bpf_type_flag *flag,
+				 bool local_type);
 };
 
 struct bpf_prog_offload_ops {
@@ -1995,7 +2002,8 @@ static inline bool bpf_tracing_btf_ctx_access(int off, int size,
 int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
 		      const struct btf_type *t, int off, int size,
 		      enum bpf_access_type atype,
-		      u32 *next_btf_id, enum bpf_type_flag *flag);
+		      u32 *next_btf_id, enum bpf_type_flag *flag,
+		      bool local_type);
 bool btf_struct_ids_match(struct bpf_verifier_log *log,
 			  const struct btf *btf, u32 id, int off,
 			  const struct btf *need_btf, u32 need_type_id,
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c4d21568d192..c6d550978d63 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -403,6 +403,7 @@ struct bpf_insn_aux_data {
 		 */
 		struct bpf_loop_inline_state loop_inline_state;
 	};
+	u64 kptr_alloc_size; /* used to store size of local kptr allocation */
 	u64 map_key_state; /* constant (32 bit) key tracking for maps */
 	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
 	u32 seen; /* this insn was processed by the verifier at env->pass_cnt */
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 9b62b8b2117e..fc35c932e89e 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -52,6 +52,9 @@
 #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
 #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
 
+/* Internal kfunc flags, not meant for general use */
+#define __KF_RET_DYN_BTF (1 << 7) /* kfunc returns dynamically ascertained PTR_TO_BTF_ID */
+
 struct btf;
 struct btf_member;
 struct btf_type;
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 0fb045be3837..17977e0f4e09 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5919,7 +5919,8 @@ static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf,
 int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
 		      const struct btf_type *t, int off, int size,
 		      enum bpf_access_type atype __maybe_unused,
-		      u32 *next_btf_id, enum bpf_type_flag *flag)
+		      u32 *next_btf_id, enum bpf_type_flag *flag,
+		      bool local_type)
 {
 	enum bpf_type_flag tmp_flag = 0;
 	int err;
@@ -5930,6 +5931,11 @@ int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
 
 		switch (err) {
 		case WALK_PTR:
+			/* For local types, the destination register cannot
+			 * become a pointer again.
+			 */
+			if (local_type)
+				return SCALAR_VALUE;
 			/* If we found the pointer or scalar on t+off,
 			 * we're done.
 			 */
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index fc08035f14ed..d417aa4f0b22 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1696,10 +1696,27 @@ bpf_base_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+void *bpf_kptr_alloc(u64 local_type_id__k, u64 flags)
+{
+	/* Verifier patches local_type_id__k to size */
+	u64 size = local_type_id__k;
+
+	if (flags)
+		return NULL;
+	return kmalloc(size, GFP_ATOMIC);
+}
+
+__diag_pop();
+
 BTF_SET8_START(tracing_btf_ids)
 #ifdef CONFIG_KEXEC_CORE
 BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
 #endif
+BTF_ID_FLAGS(func, bpf_kptr_alloc, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
 BTF_SET8_END(tracing_btf_ids)
 
 static const struct btf_kfunc_id_set tracing_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ab91e5ca7e41..8f28aa7f1e8d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -472,6 +472,11 @@ static bool type_may_be_null(u32 type)
 	return type & PTR_MAYBE_NULL;
 }
 
+static bool type_is_local(u32 type)
+{
+	return type & MEM_TYPE_LOCAL;
+}
+
 static bool is_acquire_function(enum bpf_func_id func_id,
 				const struct bpf_map *map)
 {
@@ -4556,17 +4561,22 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 		return -EACCES;
 	}
 
-	if (env->ops->btf_struct_access) {
+	/* For allocated PTR_TO_BTF_ID pointing to a local type, we cannot do
+	 * btf_struct_access callback.
+	 */
+	if (env->ops->btf_struct_access && !type_is_local(reg->type)) {
 		ret = env->ops->btf_struct_access(&env->log, reg->btf, t,
-						  off, size, atype, &btf_id, &flag);
+						  off, size, atype, &btf_id, &flag,
+						  false);
 	} else {
-		if (atype != BPF_READ) {
+		/* It is allowed to write to pointer to a local type */
+		if (atype != BPF_READ && !type_is_local(reg->type)) {
 			verbose(env, "only read is supported\n");
 			return -EACCES;
 		}
 
 		ret = btf_struct_access(&env->log, reg->btf, t, off, size,
-					atype, &btf_id, &flag);
+					atype, &btf_id, &flag, type_is_local(reg->type));
 	}
 
 	if (ret < 0)
@@ -4630,7 +4640,7 @@ static int check_ptr_to_map_access(struct bpf_verifier_env *env,
 		return -EACCES;
 	}
 
-	ret = btf_struct_access(&env->log, btf_vmlinux, t, off, size, atype, &btf_id, &flag);
+	ret = btf_struct_access(&env->log, btf_vmlinux, t, off, size, atype, &btf_id, &flag, false);
 	if (ret < 0)
 		return ret;
 
@@ -7661,6 +7671,11 @@ static bool is_kfunc_destructive(struct bpf_kfunc_arg_meta *meta)
 	return meta->kfunc_flags & KF_DESTRUCTIVE;
 }
 
+static bool __is_kfunc_ret_dyn_btf(struct bpf_kfunc_arg_meta *meta)
+{
+	return meta->kfunc_flags & __KF_RET_DYN_BTF;
+}
+
 static bool is_kfunc_arg_kptr_get(struct bpf_kfunc_arg_meta *meta, int arg)
 {
 	return arg == 0 && (meta->kfunc_flags & KF_KPTR_GET);
@@ -7751,6 +7766,24 @@ static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
 #endif
 };
 
+BTF_ID_LIST(special_kfuncs)
+BTF_ID(func, bpf_kptr_alloc)
+
+enum bpf_special_kfuncs {
+	KF_SPECIAL_bpf_kptr_alloc,
+	KF_SPECIAL_MAX,
+};
+
+static bool __is_kfunc_special(const struct btf *btf, u32 func_id, unsigned int kf_sp)
+{
+	if (btf != btf_vmlinux || kf_sp >= KF_SPECIAL_MAX)
+		return false;
+	return func_id == special_kfuncs[kf_sp];
+}
+
+#define is_kfunc_special(btf, func_id, func_name) \
+	__is_kfunc_special(btf, func_id, KF_SPECIAL_##func_name)
+
 enum kfunc_ptr_arg_types {
 	KF_ARG_PTR_TO_CTX,
 	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
@@ -8120,20 +8153,55 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		mark_reg_unknown(env, regs, BPF_REG_0);
 		mark_btf_func_reg_size(env, BPF_REG_0, t->size);
 	} else if (btf_type_is_ptr(t)) {
-		ptr_type = btf_type_skip_modifiers(desc_btf, t->type,
-						   &ptr_type_id);
-		if (!btf_type_is_struct(ptr_type)) {
-			ptr_type_name = btf_name_by_offset(desc_btf,
-							   ptr_type->name_off);
-			verbose(env, "kernel function %s returns pointer type %s %s is not supported\n",
-				func_name, btf_type_str(ptr_type),
-				ptr_type_name);
-			return -EINVAL;
-		}
+		struct btf *ret_btf;
+		u32 ret_btf_id;
+
+		ptr_type = btf_type_skip_modifiers(desc_btf, t->type, &ptr_type_id);
 		mark_reg_known_zero(env, regs, BPF_REG_0);
-		regs[BPF_REG_0].btf = desc_btf;
 		regs[BPF_REG_0].type = PTR_TO_BTF_ID;
-		regs[BPF_REG_0].btf_id = ptr_type_id;
+
+		if (__is_kfunc_ret_dyn_btf(&meta)) {
+			const struct btf_type *ret_t;
+
+			/* Currently, only bpf_kptr_alloc needs special handling */
+			if (!is_kfunc_special(meta.btf, meta.func_id, bpf_kptr_alloc) ||
+			    !meta.arg_constant.found || !btf_type_is_void(ptr_type)) {
+				verbose(env, "verifier internal error: misconfigured kfunc\n");
+				return -EFAULT;
+			}
+
+			if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
+				verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
+				return -EINVAL;
+			}
+
+			ret_btf = env->prog->aux->btf;
+			ret_btf_id = meta.arg_constant.value;
+
+			ret_t = btf_type_by_id(ret_btf, ret_btf_id);
+			if (!ret_t || !__btf_type_is_struct(ret_t)) {
+				verbose(env, "local type ID %d passed to bpf_kptr_alloc does not refer to struct\n",
+					ret_btf_id);
+				return -EINVAL;
+			}
+			/* Remember this so that we can rewrite R1 as size in fixup_kfunc_call */
+			env->insn_aux_data[insn_idx].kptr_alloc_size = ret_t->size;
+			/* For now, since we hardcode prog->btf, also hardcode
+			 * setting of this flag.
+			 */
+			regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
+		} else {
+			if (!btf_type_is_struct(ptr_type)) {
+				ptr_type_name = btf_name_by_offset(desc_btf, ptr_type->name_off);
+				verbose(env, "kernel function %s returns pointer type %s %s is not supported\n",
+					func_name, btf_type_str(ptr_type), ptr_type_name);
+				return -EINVAL;
+			}
+			ret_btf = desc_btf;
+			ret_btf_id = ptr_type_id;
+		}
+		regs[BPF_REG_0].btf = ret_btf;
+		regs[BPF_REG_0].btf_id = ret_btf_id;
 		if (is_kfunc_ret_null(&meta)) {
 			regs[BPF_REG_0].type |= PTR_MAYBE_NULL;
 			/* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */
@@ -14371,8 +14439,43 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 	return err;
 }
 
+static int do_kfunc_fixups(struct bpf_verifier_env *env, struct bpf_insn *insn,
+			   s32 imm, int insn_idx, int delta)
+{
+	struct bpf_insn insn_buf[16];
+	struct bpf_prog *new_prog;
+	int cnt;
+
+	/* No need to lookup btf, only vmlinux kfuncs are supported for special
+	 * kfuncs handling. Hence when insn->off is zero, check if it is a
+	 * special kfunc by hardcoding btf as btf_vmlinux.
+	 */
+	if (!insn->off && is_kfunc_special(btf_vmlinux, insn->imm, bpf_kptr_alloc)) {
+		u64 local_type_size = env->insn_aux_data[insn_idx + delta].kptr_alloc_size;
+
+		insn_buf[0] = BPF_MOV64_IMM(BPF_REG_1, local_type_size);
+		insn_buf[1] = *insn;
+		cnt = 2;
+
+		new_prog = bpf_patch_insn_data(env, insn_idx + delta, insn_buf, cnt);
+		if (!new_prog)
+			return -ENOMEM;
+
+		delta += cnt - 1;
+		insn = new_prog->insnsi + insn_idx + delta;
+		goto patch_call_imm;
+	}
+
+	insn->imm = imm;
+	return 0;
+patch_call_imm:
+	insn->imm = imm;
+	return cnt - 1;
+}
+
 static int fixup_kfunc_call(struct bpf_verifier_env *env,
-			    struct bpf_insn *insn)
+			    struct bpf_insn *insn,
+			    int insn_idx, int delta)
 {
 	const struct bpf_kfunc_desc *desc;
 
@@ -14391,9 +14494,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env,
 		return -EFAULT;
 	}
 
-	insn->imm = desc->imm;
-
-	return 0;
+	return do_kfunc_fixups(env, insn, desc->imm, insn_idx, delta);
 }
 
 /* Do various post-verification rewrites in a single program pass.
@@ -14534,9 +14635,18 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 		if (insn->src_reg == BPF_PSEUDO_CALL)
 			continue;
 		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
-			ret = fixup_kfunc_call(env, insn);
-			if (ret)
+			ret = fixup_kfunc_call(env, insn, i, delta);
+			if (ret < 0)
 				return ret;
+			/* If ret > 0, fixup_kfunc_call did some instruction
+			 * rewrites. Increment delta, reload prog and insn,
+			 * env->prog is already set by it to the new_prog.
+			 */
+			if (ret) {
+				delta += ret;
+				prog = env->prog;
+				insn = prog->insnsi + i + delta;
+			}
 			continue;
 		}
 
diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c
index e78dadfc5829..fa572714c6f6 100644
--- a/net/bpf/bpf_dummy_struct_ops.c
+++ b/net/bpf/bpf_dummy_struct_ops.c
@@ -160,7 +160,8 @@ static int bpf_dummy_ops_btf_struct_access(struct bpf_verifier_log *log,
 					   const struct btf_type *t, int off,
 					   int size, enum bpf_access_type atype,
 					   u32 *next_btf_id,
-					   enum bpf_type_flag *flag)
+					   enum bpf_type_flag *flag,
+					   bool local_type)
 {
 	const struct btf_type *state;
 	s32 type_id;
@@ -178,7 +179,7 @@ static int bpf_dummy_ops_btf_struct_access(struct bpf_verifier_log *log,
 	}
 
 	err = btf_struct_access(log, btf, t, off, size, atype, next_btf_id,
-				flag);
+				flag, false);
 	if (err < 0)
 		return err;
 
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 85a9e500c42d..869b6266833c 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -73,13 +73,14 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
 					const struct btf_type *t, int off,
 					int size, enum bpf_access_type atype,
 					u32 *next_btf_id,
-					enum bpf_type_flag *flag)
+					enum bpf_type_flag *flag,
+					bool local_type)
 {
 	size_t end;
 
 	if (atype == BPF_READ)
 		return btf_struct_access(log, btf, t, off, size, atype, next_btf_id,
-					 flag);
+					 flag, false);
 
 	if (t != tcp_sock_type) {
 		bpf_log(log, "only read is supported\n");
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index ea1b3b1839d1..bddd77093d1e 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -18,4 +18,18 @@ struct bpf_list_node {
 #endif
 
 #ifndef __KERNEL__
+
+/* Description
+ *	Allocates a local kptr of type represented by 'local_type_id' in program
+ *	BTF. User may use the bpf_core_type_id_local macro to pass the type ID
+ *	of a struct in program BTF.
+ *
+ *	The 'local_type_id' parameter must be a known constant.
+ *	The 'flags' parameter must be 0.
+ * Returns
+ *	A local kptr corresponding to passed in 'local_type_id', or NULL on
+ *	failure.
+ */
+void *bpf_kptr_alloc(__u64 local_type_id, __u64 flags) __ksym;
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 15/32] bpf: Add helper macro bpf_expr_for_each_reg_in_vstate
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (13 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 14/32] bpf: Introduce bpf_kptr_alloc helper Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-07 23:48   ` Alexei Starovoitov
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model Kumar Kartikeya Dwivedi
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

For a lot of use cases in future patches, we will want to modify the
state of registers part of some same 'group' (e.g. same ref_obj_id). It
won't just be limited to releasing reference state, but setting a type
flag dynamically based on certain actions, etc.

Hence, we need a way to easily pass a callback to the function that
iterates over all registers in current bpf_verifier_state in all frames
upto (and including) the curframe.

While in C++ we would be able to easily use a lambda to pass state and
the callback together, sadly we aren't using C++ in the kernel. The next
best thing to avoid defining a function for each case seems like
statement expressions in GNU C. The kernel already uses them heavily,
hence they can passed to the macro in the style of a lambda. The
statement expression will then be substituted in the for loop bodies.

Variables __state and __reg are set to current bpf_func_state and reg
for each invocation of the expression inside the passed in verifier
state.

Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers,
release_reference, find_good_pkt_pointers, find_equal_scalars to
use bpf_expr_for_each_reg_in_vstate.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h |  21 ++++++
 kernel/bpf/verifier.c        | 135 ++++++++---------------------------
 2 files changed, 49 insertions(+), 107 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index c6d550978d63..73d9443d0074 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -354,6 +354,27 @@ struct bpf_verifier_state {
 	     iter < frame->allocated_stack / BPF_REG_SIZE;		\
 	     iter++, reg = bpf_get_spilled_reg(iter, frame))
 
+/* Invoke __expr over regsiters in __vst, setting __state and __reg */
+#define bpf_expr_for_each_reg_in_vstate(__vst, __state, __reg, __expr)   \
+	({                                                               \
+		struct bpf_verifier_state *___vstate = __vst;            \
+		int ___i, ___j;                                          \
+		for (___i = 0; ___i <= ___vstate->curframe; ___i++) {    \
+			struct bpf_reg_state *___regs;                   \
+			__state = ___vstate->frame[___i];                \
+			___regs = __state->regs;                         \
+			for (___j = 0; ___j < MAX_BPF_REG; ___j++) {     \
+				__reg = &___regs[___j];                  \
+				(void)(__expr);                          \
+			}                                                \
+			bpf_for_each_spilled_reg(___j, __state, __reg) { \
+				if (!__reg)                              \
+					continue;                        \
+				(void)(__expr);                          \
+			}                                                \
+		}                                                        \
+	})
+
 /* linked list of verifier states used to prune search */
 struct bpf_verifier_state_list {
 	struct bpf_verifier_state state;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 8f28aa7f1e8d..817131537adb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6546,31 +6546,15 @@ static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
 /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
  * are now invalid, so turn them into unknown SCALAR_VALUE.
  */
-static void __clear_all_pkt_pointers(struct bpf_verifier_env *env,
-				     struct bpf_func_state *state)
+static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
 {
-	struct bpf_reg_state *regs = state->regs, *reg;
-	int i;
-
-	for (i = 0; i < MAX_BPF_REG; i++)
-		if (reg_is_pkt_pointer_any(&regs[i]))
-			mark_reg_unknown(env, regs, i);
+	struct bpf_func_state *state;
+	struct bpf_reg_state *reg;
 
-	bpf_for_each_spilled_reg(i, state, reg) {
-		if (!reg)
-			continue;
+	bpf_expr_for_each_reg_in_vstate(env->cur_state, state, reg, ({
 		if (reg_is_pkt_pointer_any(reg))
 			__mark_reg_unknown(env, reg);
-	}
-}
-
-static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
-{
-	struct bpf_verifier_state *vstate = env->cur_state;
-	int i;
-
-	for (i = 0; i <= vstate->curframe; i++)
-		__clear_all_pkt_pointers(env, vstate->frame[i]);
+	}));
 }
 
 enum {
@@ -6599,41 +6583,24 @@ static void mark_pkt_end(struct bpf_verifier_state *vstate, int regn, bool range
 		reg->range = AT_PKT_END;
 }
 
-static void release_reg_references(struct bpf_verifier_env *env,
-				   struct bpf_func_state *state,
-				   int ref_obj_id)
-{
-	struct bpf_reg_state *regs = state->regs, *reg;
-	int i;
-
-	for (i = 0; i < MAX_BPF_REG; i++)
-		if (regs[i].ref_obj_id == ref_obj_id)
-			mark_reg_unknown(env, regs, i);
-
-	bpf_for_each_spilled_reg(i, state, reg) {
-		if (!reg)
-			continue;
-		if (reg->ref_obj_id == ref_obj_id)
-			__mark_reg_unknown(env, reg);
-	}
-}
-
 /* The pointer with the specified id has released its reference to kernel
  * resources. Identify all copies of the same pointer and clear the reference.
  */
 static int release_reference(struct bpf_verifier_env *env,
 			     int ref_obj_id)
 {
-	struct bpf_verifier_state *vstate = env->cur_state;
+	struct bpf_func_state *state;
+	struct bpf_reg_state *reg;
 	int err;
-	int i;
 
 	err = release_reference_state(cur_func(env), ref_obj_id);
 	if (err)
 		return err;
 
-	for (i = 0; i <= vstate->curframe; i++)
-		release_reg_references(env, vstate->frame[i], ref_obj_id);
+	bpf_expr_for_each_reg_in_vstate(env->cur_state, state, reg, ({
+		if (reg->ref_obj_id == ref_obj_id)
+			__mark_reg_unknown(env, reg);
+	}));
 
 	return 0;
 }
@@ -9844,34 +9811,14 @@ static int check_alu_op(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	return 0;
 }
 
-static void __find_good_pkt_pointers(struct bpf_func_state *state,
-				     struct bpf_reg_state *dst_reg,
-				     enum bpf_reg_type type, int new_range)
-{
-	struct bpf_reg_state *reg;
-	int i;
-
-	for (i = 0; i < MAX_BPF_REG; i++) {
-		reg = &state->regs[i];
-		if (reg->type == type && reg->id == dst_reg->id)
-			/* keep the maximum range already checked */
-			reg->range = max(reg->range, new_range);
-	}
-
-	bpf_for_each_spilled_reg(i, state, reg) {
-		if (!reg)
-			continue;
-		if (reg->type == type && reg->id == dst_reg->id)
-			reg->range = max(reg->range, new_range);
-	}
-}
-
 static void find_good_pkt_pointers(struct bpf_verifier_state *vstate,
 				   struct bpf_reg_state *dst_reg,
 				   enum bpf_reg_type type,
 				   bool range_right_open)
 {
-	int new_range, i;
+	struct bpf_func_state *state;
+	struct bpf_reg_state *reg;
+	int new_range;
 
 	if (dst_reg->off < 0 ||
 	    (dst_reg->off == 0 && range_right_open))
@@ -9936,9 +9883,11 @@ static void find_good_pkt_pointers(struct bpf_verifier_state *vstate,
 	 * the range won't allow anything.
 	 * dst_reg->off is known < MAX_PACKET_OFF, therefore it fits in a u16.
 	 */
-	for (i = 0; i <= vstate->curframe; i++)
-		__find_good_pkt_pointers(vstate->frame[i], dst_reg, type,
-					 new_range);
+	bpf_expr_for_each_reg_in_vstate(vstate, state, reg, ({
+		if (reg->type == type && reg->id == dst_reg->id)
+			/* keep the maximum range already checked */
+			reg->range = max(reg->range, new_range);
+	}));
 }
 
 static int is_branch32_taken(struct bpf_reg_state *reg, u32 val, u8 opcode)
@@ -10427,7 +10376,7 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 
 		if (!reg_may_point_to_spin_lock(reg)) {
 			/* For not-NULL ptr, reg->ref_obj_id will be reset
-			 * in release_reg_references().
+			 * in release_reference().
 			 *
 			 * reg->id is still used by spin_lock ptr. Other
 			 * than spin_lock ptr type, reg->id can be reset.
@@ -10437,22 +10386,6 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 	}
 }
 
-static void __mark_ptr_or_null_regs(struct bpf_func_state *state, u32 id,
-				    bool is_null)
-{
-	struct bpf_reg_state *reg;
-	int i;
-
-	for (i = 0; i < MAX_BPF_REG; i++)
-		mark_ptr_or_null_reg(state, &state->regs[i], id, is_null);
-
-	bpf_for_each_spilled_reg(i, state, reg) {
-		if (!reg)
-			continue;
-		mark_ptr_or_null_reg(state, reg, id, is_null);
-	}
-}
-
 /* The logic is similar to find_good_pkt_pointers(), both could eventually
  * be folded together at some point.
  */
@@ -10460,10 +10393,9 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 				  bool is_null)
 {
 	struct bpf_func_state *state = vstate->frame[vstate->curframe];
-	struct bpf_reg_state *regs = state->regs;
+	struct bpf_reg_state *regs = state->regs, *reg;
 	u32 ref_obj_id = regs[regno].ref_obj_id;
 	u32 id = regs[regno].id;
-	int i;
 
 	if (ref_obj_id && ref_obj_id == id && is_null)
 		/* regs[regno] is in the " == NULL" branch.
@@ -10472,8 +10404,9 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 		 */
 		WARN_ON_ONCE(release_reference_state(state, id));
 
-	for (i = 0; i <= vstate->curframe; i++)
-		__mark_ptr_or_null_regs(vstate->frame[i], id, is_null);
+	bpf_expr_for_each_reg_in_vstate(vstate, state, reg, ({
+		mark_ptr_or_null_reg(state, reg, id, is_null);
+	}));
 }
 
 static bool try_match_pkt_pointers(const struct bpf_insn *insn,
@@ -10586,23 +10519,11 @@ static void find_equal_scalars(struct bpf_verifier_state *vstate,
 {
 	struct bpf_func_state *state;
 	struct bpf_reg_state *reg;
-	int i, j;
 
-	for (i = 0; i <= vstate->curframe; i++) {
-		state = vstate->frame[i];
-		for (j = 0; j < MAX_BPF_REG; j++) {
-			reg = &state->regs[j];
-			if (reg->type == SCALAR_VALUE && reg->id == known_reg->id)
-				*reg = *known_reg;
-		}
-
-		bpf_for_each_spilled_reg(j, state, reg) {
-			if (!reg)
-				continue;
-			if (reg->type == SCALAR_VALUE && reg->id == known_reg->id)
-				*reg = *known_reg;
-		}
-	}
+	bpf_expr_for_each_reg_in_vstate(vstate, state, reg, ({
+		if (reg->type == SCALAR_VALUE && reg->id == known_reg->id)
+			*reg = *known_reg;
+	}));
 }
 
 static int check_cond_jmp_op(struct bpf_verifier_env *env,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (14 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 15/32] bpf: Add helper macro bpf_expr_for_each_reg_in_vstate Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-08  0:34   ` Alexei Starovoitov
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 17/32] bpf: Support bpf_list_node in local kptrs Kumar Kartikeya Dwivedi
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Add the concept of a memory object model to BPF verifier.

What this means is that there are now some types that are not just plain
old data, but require explicit action when they are allocated on a
storage, before their lifetime is considered as started and before it is
allowed for them to escape the program. The verifier will track state of
such fields during the various phases of the object lifetime, where it
can be sure about certain invariants.

Some inspiration is taken from existing memory object and lifetime
models in C and C++ which have stood the test of time. See [0], [1], [2]
for more information, to find some similarities. In the future, the
separation of storage and object lifetime may be made more stark by
allowing to change effective type of storage allocated for a local kptr.
For now, that has been left out. It is only possible when verifier
understands when the program has exclusive access to storage, and when
the object it is hosting is no longer accessible to other CPUs.

This can be useful to maintain size-class based freelists inside BPF
programs and reuse storage of same size for different types. This would
only be safe to allow if verifier can ensure that while storage lifetime
has not ended, object lifetime for the current type has. This
necessiates separating the two and accomodating a simple model to track
object lifetime (composed recursively of more objects whose lifetime
is individually tracked).

Everytime a BPF program allocates such non-trivial types, it must call a
set of constructors on the object to fully begin its lifetime before it
can make use of the pointer to this type. If the program does not do so,
the verifier will complain and lead to failure in loading of the
program.

Similarly, when ending the lifetime of such types, it is required to
fully destruct the object using a series of destructors for each
non-trivial member, before finally freeing the storage the object is
making use of.

During both the construction and destruction phase, there can be only
one program that can own and access such an object, hence their is no
need of any explicit synchronization. The single ownership of such
objects makes it easy for the verifier to enforce the safety around the
beginning and end of the lifetime without resorting to dynamic checks.

When there are multiple fields needing construction or destruction, the
program must call their constructors in ascending order of the offset of
the field.

For example, consider the following type (support for such fields will
be added in subsequent patches):

struct data {
	struct bpf_spin_lock lock;
	struct bpf_list_head list __contains(struct, foo, node);
	int data;
};

struct data *d = bpf_kptr_alloc(...);
if (!d) { ... }

Now, the type of d would be PTR_TO_BTF_ID | MEM_TYPE_LOCAL |
OBJ_CONSTRUCTING, as it needs two constructor calls (for lock and head),
before it can be considered fully initialized and alive.

Hence, we must do (in order of field offsets):

bpf_spin_lock_init(&d->lock);
bpf_list_head_init(&d->list);

Once the final constructor call that is required for the type is made,
in this case bpf_list_head_init, the verifier will unmark the
OBJ_CONSTRUCTING flag. Now, the type is PTR_TO_BTF_ID | MEM_TYPE_LOCAL,
so the pointer can be used anywhere these local kptrs are allowed, and
it can also escape the program.

The verifier ensures that the pointer can only be made visible once the
construction of the object is complete.

Likewise, once the first call to destroy the non-trivial field with
greatest offset is made, the verifier marks the pointer as
OBJ_DESTRUCTING, ensuring that the destruction is taken to its
conclusion and renders the object unusable except for destruction and
consequent freeing of the storage it is occupying.

Construction is always done in ascending order of field offsets, and
destruction is done in descending order of field offsets.

  [0]: https://eel.is/c++draft/basic.life
       "C++: Memory and Objects"
  [1]: https://en.cppreference.com/w/cpp/language/lifetime
       "C++: Object Lifetime"
  [2]: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2318r1.pdf
       "A Provenance-aware Memory Object Model for C"

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h          |  15 ++
 include/linux/bpf_verifier.h |  37 ++++-
 kernel/bpf/verifier.c        | 286 +++++++++++++++++++++++++++++++++++
 3 files changed, 336 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5c8bfb0eba17..910aa891b97a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -492,6 +492,21 @@ enum bpf_type_flag {
 	 */
 	MEM_TYPE_LOCAL		= BIT(11 + BPF_BASE_TYPE_BITS),
 
+	/* This is applied to PTR_TO_BTF_ID pointing to object of a local type
+	 * (also called local kptr) whose lifetime start needs explicit
+	 * constructor calls in the BPF program before it can be considered
+	 * fully intialized and ready for use, escape program, etc.
+	 */
+	OBJ_CONSTRUCTING	= BIT(12 + BPF_BASE_TYPE_BITS),
+
+	/* This is applied to PTR_TO_BTF_ID pointing to object of a local type
+	 * (also called local kptr) whose lifetime has ended officially and it
+	 * needs destructor calls to be invoked in the BPF program that has
+	 * final ownership of its storage before it can be released back to the
+	 * memory allocator.
+	 */
+	OBJ_DESTRUCTING		= BIT(13 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 73d9443d0074..2a9dcefca3b6 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -49,6 +49,15 @@ enum bpf_reg_precise {
 	PRECISE_ABSOLUTE,
 };
 
+enum {
+	FIELD_STATE_UNKNOWN	  = 0,
+	FIELD_STATE_CONSTRUCTED   = 1,
+	FIELD_STATE_DESTRUCTED	  = 2,
+	/* We only have room for one more state */
+	FIELD_STATE_MAX,
+};
+static_assert(FIELD_STATE_MAX <= (1 << 2));
+
 struct bpf_reg_state {
 	/* Ordering of fields matters.  See states_equal() */
 	enum bpf_reg_type type;
@@ -74,6 +83,17 @@ struct bpf_reg_state {
 		struct {
 			struct btf *btf;
 			u32 btf_id;
+			/* In case of PTR_TO_BTF_ID to a local type, sometimes
+			 * it may embed some special kernel types that we need
+			 * to track the state of.
+			 * To save space, we use 2 bits per field for state
+			 * tracking, and so have room for 16 fields. The special
+			 * field with lowest offset takes first two bits,
+			 * special field with second lowest offset takes next
+			 * two bits, and so on. The mapping can be determined
+			 * each time we encounter the type.
+			 */
+			u32 states;
 		};
 
 		u32 mem_size; /* for PTR_TO_MEM | PTR_TO_MEM_OR_NULL */
@@ -92,8 +112,8 @@ struct bpf_reg_state {
 
 		/* Max size from any of the above. */
 		struct {
-			unsigned long raw1;
-			unsigned long raw2;
+			u64 raw1;
+			u64 raw2;
 		} raw;
 
 		u32 subprogno; /* for PTR_TO_FUNC */
@@ -645,4 +665,17 @@ static inline enum bpf_prog_type resolve_prog_type(struct bpf_prog *prog)
 		prog->aux->dst_prog->type : prog->type;
 }
 
+static inline int local_kptr_get_state(struct bpf_reg_state *reg, u8 index)
+{
+	WARN_ON_ONCE(index >= 16);
+	return (reg->states >> (2 * index)) & 0x3;
+}
+
+static inline void local_kptr_set_state(struct bpf_reg_state *reg, u8 index, u32 state)
+{
+	WARN_ON_ONCE(state >= FIELD_STATE_MAX);
+	reg->states &= ~(0x3UL << (2 * index));
+	reg->states |= (state << (2 * index));
+}
+
 #endif /* _LINUX_BPF_VERIFIER_H */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 817131537adb..64cceb7d2f20 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -585,6 +585,10 @@ static const char *reg_type_str(struct bpf_verifier_env *env,
 		strncpy(prefix, "percpu_", 32);
 	if (type & PTR_UNTRUSTED)
 		strncpy(prefix, "untrusted_", 32);
+	if (type & OBJ_CONSTRUCTING)
+		strncpy(prefix, "constructing_", 32);
+	if (type & OBJ_DESTRUCTING)
+		strncpy(prefix, "destructing_", 32);
 
 	snprintf(env->type_str_buf, TYPE_STR_BUF_LEN, "%s%s%s",
 		 prefix, str[base_type(type)], postfix);
@@ -5861,6 +5865,9 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
 	 * fixed offset.
 	 */
 	case PTR_TO_BTF_ID:
+	case PTR_TO_BTF_ID | MEM_TYPE_LOCAL:
+	case PTR_TO_BTF_ID | MEM_TYPE_LOCAL | OBJ_CONSTRUCTING:
+	case PTR_TO_BTF_ID | MEM_TYPE_LOCAL | OBJ_DESTRUCTING:
 		/* When referenced PTR_TO_BTF_ID is passed to release function,
 		 * it's fixed offset must be 0.	In the other cases, fixed offset
 		 * can be non-zero.
@@ -7684,6 +7691,19 @@ static bool is_kfunc_arg_sfx_constant(const struct btf *btf, const struct btf_pa
 	return __kfunc_param_match_suffix(btf, arg, "__k");
 }
 
+static bool
+is_kfunc_arg_sfx_constructing_local_kptr(const struct btf *btf,
+					 const struct btf_param *arg)
+{
+	return __kfunc_param_match_suffix(btf, arg, "__clkptr");
+}
+
+static bool is_kfunc_arg_sfx_destructing_local_kptr(const struct btf *btf,
+						    const struct btf_param *arg)
+{
+	return __kfunc_param_match_suffix(btf, arg, "__dlkptr");
+}
+
 /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
 static bool __btf_type_is_scalar_struct(struct bpf_verifier_env *env,
 					const struct btf *btf,
@@ -7755,6 +7775,8 @@ enum kfunc_ptr_arg_types {
 	KF_ARG_PTR_TO_CTX,
 	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
 	KF_ARG_PTR_TO_KPTR_STRONG,   /* PTR_TO_KPTR but type specific */
+	KF_ARG_CONSTRUCTING_LOCAL_KPTR,
+	KF_ARG_DESTRUCTING_LOCAL_KPTR,
 	KF_ARG_PTR_TO_MEM,
 	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
 };
@@ -7778,6 +7800,12 @@ enum kfunc_ptr_arg_types get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	 * arguments, we resolve it to a known kfunc_ptr_arg_types enum
 	 * constant.
 	 */
+	if (is_kfunc_arg_sfx_constructing_local_kptr(meta->btf, &args[argno]))
+		return KF_ARG_CONSTRUCTING_LOCAL_KPTR;
+
+	if (is_kfunc_arg_sfx_destructing_local_kptr(meta->btf, &args[argno]))
+		return KF_ARG_DESTRUCTING_LOCAL_KPTR;
+
 	if (is_kfunc_arg_kptr_get(meta, argno)) {
 		if (!btf_type_is_ptr(ref_t)) {
 			verbose(env, "arg#0 BTF type must be a double pointer for kptr_get kfunc\n");
@@ -7892,6 +7920,241 @@ static int process_kf_arg_ptr_to_kptr_strong(struct bpf_verifier_env *env,
 	return 0;
 }
 
+struct local_type_field {
+	enum {
+		FIELD_MAX,
+	} type;
+	enum bpf_special_kfuncs ctor_kfunc;
+	enum bpf_special_kfuncs dtor_kfunc;
+	const char *name;
+	u32 offset;
+	bool needs_destruction;
+};
+
+static int local_type_field_cmp(const void *a, const void *b)
+{
+	const struct local_type_field *fa = a, *fb = b;
+
+	if (fa->offset < fb->offset)
+		return -1;
+	else if (fa->offset > fb->offset)
+		return 1;
+	return 0;
+}
+
+static int find_local_type_fields(const struct btf *btf, u32 btf_id, struct local_type_field *fields)
+{
+	/* XXX: Fill the fields when support is added */
+	sort(fields, FIELD_MAX, sizeof(fields[0]), local_type_field_cmp, NULL);
+	return FIELD_MAX;
+}
+
+static int
+process_kf_arg_constructing_local_kptr(struct bpf_verifier_env *env,
+				       struct bpf_reg_state *reg,
+				       struct bpf_kfunc_arg_meta *meta)
+{
+	struct local_type_field fields[FIELD_MAX];
+	struct bpf_func_state *fstate;
+	struct bpf_reg_state *ireg;
+	int ret, i, cnt;
+
+	ret = find_local_type_fields(reg->btf, reg->btf_id, fields);
+	if (ret < 0) {
+		verbose(env, "verifier internal error: bad field specification in local type\n");
+		return -EFAULT;
+	}
+
+	cnt = ret;
+	for (i = 0; i < cnt; i++) {
+		int j;
+
+		if (fields[i].offset != reg->off)
+			continue;
+
+		switch (local_kptr_get_state(reg, i)) {
+		case FIELD_STATE_CONSTRUCTED:
+			verbose(env, "'%s' field at offset %d has already been constructed\n",
+				fields[i].name, fields[i].offset);
+			return -EINVAL;
+		case FIELD_STATE_UNKNOWN:
+			break;
+		case FIELD_STATE_DESTRUCTED:
+			WARN_ON_ONCE(1);
+			fallthrough;
+		default:
+			verbose(env, "verifier internal error: unknown field state\n");
+			return -EFAULT;
+		}
+
+		/* Make sure everything coming before us has been constructed */
+		for (j = 0; j < i; j++) {
+			if (local_kptr_get_state(reg, j) != FIELD_STATE_CONSTRUCTED) {
+				verbose(env, "'%s' field at offset %d must be constructed before this field\n",
+					fields[j].name, fields[j].offset);
+				return -EINVAL;
+			}
+		}
+
+		/* Since we always ensure everything before us is constructed,
+		 * fields after us will be in unknown state, so we do not need
+		 * to check them.
+		 */
+		if (!__is_kfunc_special(meta->btf, meta->func_id, fields[i].ctor_kfunc)) {
+			verbose(env, "incorrect constructor function for '%s' field\n",
+				fields[i].name);
+			return -EINVAL;
+		}
+
+		/* The constructor is the right one, everything before us is
+		 * also constructed, so we can mark this field as constructed.
+		 */
+		bpf_expr_for_each_reg_in_vstate(env->cur_state, fstate, ireg, ({
+			if (ireg->ref_obj_id == reg->ref_obj_id)
+				local_kptr_set_state(ireg, i, FIELD_STATE_CONSTRUCTED);
+		}));
+
+		/* If we are the final field needing construction, move the
+		 * object from constructing to constructed state as a whole.
+		 */
+		if (i + 1 == cnt) {
+			bpf_expr_for_each_reg_in_vstate(env->cur_state, fstate, ireg, ({
+				if (ireg->ref_obj_id == reg->ref_obj_id) {
+					ireg->type &= ~OBJ_CONSTRUCTING;
+					/* clear states to make it usable for tracking states of fields
+					 * after construction.
+					 */
+					reg->states = 0;
+				}
+			}));
+		}
+		return 0;
+	}
+	verbose(env, "no constructible field at offset: %d\n", reg->off);
+	return -EINVAL;
+}
+
+static int
+process_kf_arg_destructing_local_kptr(struct bpf_verifier_env *env,
+				      struct bpf_reg_state *reg,
+				      struct bpf_kfunc_arg_meta *meta)
+{
+	struct local_type_field fields[FIELD_MAX];
+	struct bpf_func_state *fstate;
+	struct bpf_reg_state *ireg;
+	int ret, i, cnt;
+
+	ret = find_local_type_fields(reg->btf, reg->btf_id, fields);
+	if (ret < 0) {
+		verbose(env, "verifier internal error: bad field specification in local type\n");
+		return -EFAULT;
+	}
+
+	cnt = ret;
+	/* If this is a normal reg transitioning to destructing phase,
+	 * mark state for all fields as constructed, to begin tracking
+	 * them during destruction.
+	 */
+	if (reg->type == (PTR_TO_BTF_ID | MEM_TYPE_LOCAL)) {
+		bpf_expr_for_each_reg_in_vstate(env->cur_state, fstate, ireg, ({
+			if (ireg->ref_obj_id != reg->ref_obj_id)
+				continue;
+			for (i = 0; i < cnt; i++)
+				local_kptr_set_state(ireg, i, FIELD_STATE_CONSTRUCTED);
+		}));
+	}
+
+	for (i = 0; i < cnt; i++) {
+		bool mark_dtor = false, unmark_ctor = false;
+		int j;
+
+		if (fields[i].offset != reg->off)
+			continue;
+
+		switch (local_kptr_get_state(reg, i)) {
+		case FIELD_STATE_UNKNOWN:
+			verbose(env, "'%s' field at offset %d has not been constructed\n",
+				fields[i].name, fields[i].offset);
+			return -EINVAL;
+		case FIELD_STATE_DESTRUCTED:
+			verbose(env, "'%s' field at offset %d has already been destructed\n",
+				fields[i].name, fields[i].offset);
+			return -EINVAL;
+		case FIELD_STATE_CONSTRUCTED:
+			break;
+		default:
+			verbose(env, "verifier internal error: unknown field state\n");
+			return -EFAULT;
+		}
+
+		/* Ensure all fields after us have been destructed */
+		for (j = i + 1; j < cnt; j++) {
+			if (!fields[j].needs_destruction)
+				continue;
+			/* For normal case, every field is constructed, so we
+			 * must check destruction order. If we see constructed
+			 * after us that needs destruction, we catch out of
+			 * order destructor call.
+			 *
+			 * For constructing kptr being destructed, later fields
+			 * may be in unknown state. It is fine to not destruct
+			 * them, as we are unwinding construction.
+			 *
+			 * For already destructing kptr, we can only see
+			 * destructed or unknown for later fields, never
+			 * constructed.
+			 */
+			if (local_kptr_get_state(reg, j) == FIELD_STATE_CONSTRUCTED) {
+				verbose(env, "'%s' field at offset %d must be destructed before this field\n",
+					fields[j].name, fields[j].offset);
+				return -EINVAL;
+			}
+		}
+
+		/* Everything before us must be constructed */
+		for (j = i - 1; j >= 0; j--) {
+			if (local_kptr_get_state(reg, j) != FIELD_STATE_CONSTRUCTED) {
+				verbose(env, "invalid state of '%s' field at offset %d\n",
+					fields[j].name, fields[j].offset);
+				return -EINVAL;
+			}
+		}
+
+		if (!__is_kfunc_special(meta->btf, meta->func_id, fields[i].dtor_kfunc)) {
+			verbose(env, "incorrect destructor function for '%s' field\n",
+				fields[i].name);
+			return -EINVAL;
+		}
+
+		if (reg->type != (PTR_TO_BTF_ID | MEM_TYPE_LOCAL | OBJ_DESTRUCTING)) {
+			mark_dtor = true;
+			if (reg->type & OBJ_CONSTRUCTING)
+				unmark_ctor = true;
+		}
+
+		/* The destructor is the right one, everything after us is
+		 * also destructed, so we can mark this field as destructed.
+		 */
+		bpf_expr_for_each_reg_in_vstate(env->cur_state, fstate, ireg, ({
+			if (ireg->ref_obj_id != reg->ref_obj_id)
+				continue;
+			local_kptr_set_state(ireg, i, FIELD_STATE_DESTRUCTED);
+			/* If mark_dtor is true, this is either a normal or
+			 * constructing kptr entering the destructing phase. If
+			 * it is constructing kptr, we also need to unmark
+			 * OBJ_CONSTRUCTING flag.
+			 */
+			if (unmark_ctor)
+				ireg->type &= ~OBJ_CONSTRUCTING;
+			if (mark_dtor)
+				ireg->type |= OBJ_DESTRUCTING;
+		}));
+		return 0;
+	}
+	verbose(env, "no destructible field at offset: %d\n", reg->off);
+	return -EINVAL;
+}
+
 static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_meta *meta)
 {
 	const char *func_name = meta->func_name, *ref_tname;
@@ -8011,6 +8274,25 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_m
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_CONSTRUCTING_LOCAL_KPTR:
+			if (reg->type != (PTR_TO_BTF_ID | MEM_TYPE_LOCAL | OBJ_CONSTRUCTING)) {
+				verbose(env, "arg#%d expected pointer to constructing local kptr\n", i);
+				return -EINVAL;
+			}
+			ret = process_kf_arg_constructing_local_kptr(env, reg, meta);
+			if (ret < 0)
+				return ret;
+			break;
+		case KF_ARG_DESTRUCTING_LOCAL_KPTR:
+			if (base_type(reg->type) != PTR_TO_BTF_ID ||
+			    (type_flag(reg->type) & ~(MEM_TYPE_LOCAL | OBJ_CONSTRUCTING | OBJ_DESTRUCTING))) {
+				verbose(env, "arg#%d expected pointer to normal, constructing, or destructing local kptr\n", i);
+				return -EINVAL;
+			}
+			ret = process_kf_arg_destructing_local_kptr(env, reg, meta);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_MEM:
 			resolve_ret = btf_resolve_size(btf, ref_t, &type_size);
 			if (IS_ERR(resolve_ret)) {
@@ -8157,6 +8439,10 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			 * setting of this flag.
 			 */
 			regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
+			/* TODO: Recognize special fields in local type aand
+			 * force their construction before pointer escapes by
+			 * setting OBJ_CONSTRUCTING.
+			 */
 		} else {
 			if (!btf_type_is_struct(ptr_type)) {
 				ptr_type_name = btf_name_by_offset(desc_btf, ptr_type->name_off);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 17/32] bpf: Support bpf_list_node in local kptrs
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (15 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock " Kumar Kartikeya Dwivedi
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

To allow a user to link their kptr allocated node into a linked list, we
must have a linked list node type that is recognized by the verifier fit
for this purpose. Its name and offset will be matched with the
specification on the bpf_list_head it is being added to. This would
allow precise verification and type safety in BPF programs.

Since bpf_list_node does not correspond to local type, but it is
embedded in a local type (i.e. a type present in program BTF, not kernel
BTF), we need to specially tag such a field so that verifier knows that
it is a special kernel object whose invariants must hold during use of
the kptr allocation. For instance, reading and writing is allowed to all
other offsets in the kptr allocation, but access to this special field
would be rejected.

To do so, it needs to be tagged using a "kernel" BTF declaration tag,
like so:

struct item {
	int data;
	struct bpf_list_node node __kernel;
};

In future commits, more objects (such as kptrs inside kptrs, spin_lock,
even bpf_list_head) will be allowed in kptr allocation. But those need
more plumbing before it can all be made safe.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h                           | 15 ++++
 kernel/bpf/btf.c                              | 86 ++++++++++++++++---
 kernel/bpf/helpers.c                          |  8 ++
 kernel/bpf/verifier.c                         | 46 ++++++++--
 .../testing/selftests/bpf/bpf_experimental.h  |  9 ++
 5 files changed, 146 insertions(+), 18 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index fc35c932e89e..062bc45e1cc9 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -433,6 +433,10 @@ const struct btf_member *
 btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
 		      const struct btf_type *t, enum bpf_prog_type prog_type,
 		      int arg);
+int btf_local_type_has_bpf_list_node(const struct btf *btf,
+				     const struct btf_type *t, u32 *offsetp);
+bool btf_local_type_has_special_fields(const struct btf *btf,
+				       const struct btf_type *t);
 #else
 static inline const struct btf_type *btf_type_by_id(const struct btf *btf,
 						    u32 type_id)
@@ -471,6 +475,17 @@ btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
 {
 	return NULL;
 }
+static inline int btf_local_type_has_bpf_list_node(const struct btf *btf,
+						   const struct btf_type *t,
+						   u32 *offsetp)
+{
+	return -ENOENT;
+}
+static inline bool btf_local_type_has_special_fields(const struct btf *btf,
+						     const struct btf_type *t)
+{
+	return false;
+}
 #endif
 
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 17977e0f4e09..d8bc4752204c 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3186,6 +3186,7 @@ enum btf_field_type {
 	BTF_FIELD_TIMER,
 	BTF_FIELD_KPTR,
 	BTF_FIELD_LIST_HEAD,
+	BTF_FIELD_LIST_NODE,
 };
 
 enum {
@@ -3319,8 +3320,8 @@ static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
 }
 
 static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t,
-				 const char *name, int sz, int align,
-				 enum btf_field_type field_type,
+				 const char *name, const char *decl_tag, int sz,
+				 int align, enum btf_field_type field_type,
 				 struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_member *member;
@@ -3334,6 +3335,8 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 
 		if (name && strcmp(__btf_name_by_offset(btf, member_type->name_off), name))
 			continue;
+		if (decl_tag && !btf_find_decl_tag_value(btf, t, i, decl_tag))
+			continue;
 
 		off = __btf_member_bit_offset(t, member);
 		if (off % 8)
@@ -3346,6 +3349,7 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
+		case BTF_FIELD_LIST_NODE:
 			ret = btf_find_struct(btf, member_type, off, sz,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3377,8 +3381,8 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 }
 
 static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
-				const char *name, int sz, int align,
-				enum btf_field_type field_type,
+				const char *name, const char *decl_tag, int sz,
+				int align, enum btf_field_type field_type,
 				struct btf_field_info *info, int info_cnt)
 {
 	const struct btf_var_secinfo *vsi;
@@ -3394,6 +3398,8 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 
 		if (name && strcmp(__btf_name_by_offset(btf, var_type->name_off), name))
 			continue;
+		if (decl_tag && !btf_find_decl_tag_value(btf, t, i, decl_tag))
+			continue;
 		if (vsi->size != sz)
 			continue;
 		if (off % align)
@@ -3402,6 +3408,7 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		switch (field_type) {
 		case BTF_FIELD_SPIN_LOCK:
 		case BTF_FIELD_TIMER:
+		case BTF_FIELD_LIST_NODE:
 			ret = btf_find_struct(btf, var_type, off, sz,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3433,7 +3440,7 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 }
 
 static int btf_find_field(const struct btf *btf, const struct btf_type *t,
-			  enum btf_field_type field_type,
+			  enum btf_field_type field_type, const char *decl_tag,
 			  struct btf_field_info *info, int info_cnt)
 {
 	const char *name;
@@ -3460,14 +3467,19 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
 		sz = sizeof(struct bpf_list_head);
 		align = __alignof__(struct bpf_list_head);
 		break;
+	case BTF_FIELD_LIST_NODE:
+		name = "bpf_list_node";
+		sz = sizeof(struct bpf_list_node);
+		align = __alignof__(struct bpf_list_node);
+		break;
 	default:
 		return -EFAULT;
 	}
 
 	if (__btf_type_is_struct(t))
-		return btf_find_struct_field(btf, t, name, sz, align, field_type, info, info_cnt);
+		return btf_find_struct_field(btf, t, name, decl_tag, sz, align, field_type, info, info_cnt);
 	else if (btf_type_is_datasec(t))
-		return btf_find_datasec_var(btf, t, name, sz, align, field_type, info, info_cnt);
+		return btf_find_datasec_var(btf, t, name, decl_tag, sz, align, field_type, info, info_cnt);
 	return -EINVAL;
 }
 
@@ -3480,7 +3492,7 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t)
 	struct btf_field_info info;
 	int ret;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, &info, 1);
+	ret = btf_find_field(btf, t, BTF_FIELD_SPIN_LOCK, NULL, &info, 1);
 	if (ret < 0)
 		return ret;
 	if (!ret)
@@ -3493,7 +3505,7 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 	struct btf_field_info info;
 	int ret;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_TIMER, &info, 1);
+	ret = btf_find_field(btf, t, BTF_FIELD_TIMER, NULL, &info, 1);
 	if (ret < 0)
 		return ret;
 	if (!ret)
@@ -3510,7 +3522,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 	struct module *mod = NULL;
 	int ret, i, nr_off;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_KPTR, info_arr, ARRAY_SIZE(info_arr));
+	ret = btf_find_field(btf, t, BTF_FIELD_KPTR, NULL, info_arr, ARRAY_SIZE(info_arr));
 	if (ret < 0)
 		return ERR_PTR(ret);
 	if (!ret)
@@ -3609,7 +3621,7 @@ struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf, const struct btf
 	struct bpf_map_value_off *tab;
 	int ret, i, nr_off;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_LIST_HEAD, info_arr, ARRAY_SIZE(info_arr));
+	ret = btf_find_field(btf, t, BTF_FIELD_LIST_HEAD, NULL, info_arr, ARRAY_SIZE(info_arr));
 	if (ret < 0)
 		return ERR_PTR(ret);
 	if (!ret)
@@ -5916,6 +5928,37 @@ static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf,
 	return -EINVAL;
 }
 
+static int btf_find_local_type_field(const struct btf *btf,
+				     const struct btf_type *t,
+				     enum btf_field_type type,
+				     u32 *offsetp)
+{
+	struct btf_field_info info;
+	int ret;
+
+	/* These are invariants that must hold if this is a local type */
+	WARN_ON_ONCE(btf_is_kernel(btf) || !__btf_type_is_struct(t));
+	ret = btf_find_field(btf, t, type, "kernel", &info, 1);
+	if (ret < 0)
+		return ret;
+	if (!ret)
+		return 0;
+	if (offsetp)
+		*offsetp = info.off;
+	return ret;
+}
+
+int btf_local_type_has_bpf_list_node(const struct btf *btf,
+				     const struct btf_type *t, u32 *offsetp)
+{
+	return btf_find_local_type_field(btf, t, BTF_FIELD_LIST_NODE, offsetp);
+}
+
+bool btf_local_type_has_special_fields(const struct btf *btf, const struct btf_type *t)
+{
+	return btf_local_type_has_bpf_list_node(btf, t, NULL) == 1;
+}
+
 int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
 		      const struct btf_type *t, int off, int size,
 		      enum bpf_access_type atype __maybe_unused,
@@ -5926,6 +5969,27 @@ int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
 	int err;
 	u32 id;
 
+	if (local_type) {
+		u32 offset;
+
+#define PREVENT_DIRECT_WRITE(field)							\
+	err = btf_local_type_has_##field(btf, t, &offset);				\
+	if (err < 0) {									\
+		bpf_log(log, "incorrect " #field " specification in local type\n");	\
+		return err;								\
+	}										\
+	if (err) {									\
+		if (off < offset + sizeof(struct field) && offset < off + size) {	\
+			bpf_log(log, "direct access to " #field " is disallowed\n");	\
+			return -EACCES;							\
+		}									\
+	}
+		PREVENT_DIRECT_WRITE(bpf_list_node);
+
+#undef PREVENT_DIRECT_WRITE
+		err = 0;
+	}
+
 	do {
 		err = btf_struct_walk(log, btf, t, off, size, &id, &tmp_flag);
 
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index d417aa4f0b22..0bb11d8bcaca 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1710,6 +1710,13 @@ void *bpf_kptr_alloc(u64 local_type_id__k, u64 flags)
 	return kmalloc(size, GFP_ATOMIC);
 }
 
+void bpf_list_node_init(struct bpf_list_node *node__clkptr)
+{
+	BUILD_BUG_ON(sizeof(struct bpf_list_node) != sizeof(struct list_head));
+	BUILD_BUG_ON(__alignof__(struct bpf_list_node) != __alignof__(struct list_head));
+	INIT_LIST_HEAD((struct list_head *)node__clkptr);
+}
+
 __diag_pop();
 
 BTF_SET8_START(tracing_btf_ids)
@@ -1717,6 +1724,7 @@ BTF_SET8_START(tracing_btf_ids)
 BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
 #endif
 BTF_ID_FLAGS(func, bpf_kptr_alloc, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
+BTF_ID_FLAGS(func, bpf_list_node_init)
 BTF_SET8_END(tracing_btf_ids)
 
 static const struct btf_kfunc_id_set tracing_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 64cceb7d2f20..1108b6200501 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7755,10 +7755,14 @@ static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
 
 BTF_ID_LIST(special_kfuncs)
 BTF_ID(func, bpf_kptr_alloc)
+BTF_ID(func, bpf_list_node_init)
+BTF_ID(struct, btf) /* empty entry */
 
 enum bpf_special_kfuncs {
 	KF_SPECIAL_bpf_kptr_alloc,
-	KF_SPECIAL_MAX,
+	KF_SPECIAL_bpf_list_node_init,
+	KF_SPECIAL_bpf_empty,
+	KF_SPECIAL_MAX = KF_SPECIAL_bpf_empty,
 };
 
 static bool __is_kfunc_special(const struct btf *btf, u32 func_id, unsigned int kf_sp)
@@ -7922,6 +7926,7 @@ static int process_kf_arg_ptr_to_kptr_strong(struct bpf_verifier_env *env,
 
 struct local_type_field {
 	enum {
+		FIELD_bpf_list_node,
 		FIELD_MAX,
 	} type;
 	enum bpf_special_kfuncs ctor_kfunc;
@@ -7944,9 +7949,34 @@ static int local_type_field_cmp(const void *a, const void *b)
 
 static int find_local_type_fields(const struct btf *btf, u32 btf_id, struct local_type_field *fields)
 {
-	/* XXX: Fill the fields when support is added */
-	sort(fields, FIELD_MAX, sizeof(fields[0]), local_type_field_cmp, NULL);
-	return FIELD_MAX;
+	const struct btf_type *t;
+	int cnt = 0, ret;
+	u32 offset;
+
+	t = btf_type_by_id(btf, btf_id);
+	if (!t)
+		return -ENOENT;
+
+#define FILL_LOCAL_TYPE_FIELD(ftype, ctor, dtor, nd)        \
+	ret = btf_local_type_has_##ftype(btf, t, &offset);  \
+	if (ret < 0)                                        \
+		return ret;                                 \
+	if (ret) {                                          \
+		fields[cnt].type = FIELD_##ftype;           \
+		fields[cnt].ctor_kfunc = KF_SPECIAL_##ctor; \
+		fields[cnt].dtor_kfunc = KF_SPECIAL_##dtor; \
+		fields[cnt].name = #ftype;                  \
+		fields[cnt].offset = offset;                \
+		fields[cnt].needs_destruction = nd;         \
+		cnt++;                                      \
+	}
+
+	FILL_LOCAL_TYPE_FIELD(bpf_list_node, bpf_list_node_init, bpf_empty, false);
+
+#undef FILL_LOCAL_TYPE_FIELD
+
+	sort(fields, cnt, sizeof(fields[0]), local_type_field_cmp, NULL);
+	return cnt;
 }
 
 static int
@@ -8439,10 +8469,12 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			 * setting of this flag.
 			 */
 			regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
-			/* TODO: Recognize special fields in local type aand
-			 * force their construction before pointer escapes by
-			 * setting OBJ_CONSTRUCTING.
+			/* Recognize special fields in local type and force
+			 * their construction before pointer escapes by setting
+			 * OBJ_CONSTRUCTING.
 			 */
+			if (btf_local_type_has_special_fields(ret_btf, ret_t))
+				regs[BPF_REG_0].type |= OBJ_CONSTRUCTING;
 		} else {
 			if (!btf_type_is_struct(ptr_type)) {
 				ptr_type_name = btf_name_by_offset(desc_btf, ptr_type->name_off);
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index bddd77093d1e..c3c5442742dc 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -32,4 +32,13 @@ struct bpf_list_node {
  */
 void *bpf_kptr_alloc(__u64 local_type_id, __u64 flags) __ksym;
 
+/* Description
+ *	Initialize bpf_list_node field in a local kptr. This kfunc has
+ *	constructor semantics, and thus can only be called on a local kptr in
+ *	'constructing' phase.
+ * Returns
+ *	Void.
+ */
+void bpf_list_node_init(struct bpf_list_node *node) __ksym;
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock in local kptrs
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (16 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 17/32] bpf: Support bpf_list_node in local kptrs Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-08  0:35   ` Alexei Starovoitov
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 19/32] bpf: Support bpf_list_head " Kumar Kartikeya Dwivedi
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

To allow users to lock and protect data in their local kptr allocation,
add support for embedding a bpf_spin_lock as a member. This is following
how bpf_list_node is supported as a member already, by suitably tagging
the field. We will later use bpf_spin_lock to allow implementing
map-in-map style intrusive collection, while still associating the lock
of linked list together in one single allocation. Only one bpf_spin_lock
is allowed, and when a PTR_TO_BTF_ID | MEM_TYPE_LOCAL reg->type has it,
it will preserve its reg->id during mark_ptr_or_null_reg marking, so
that verifier can figure out the pairing of spin_lock and spin_unlock
calls.

Existing process_spin_lock is refactored to support such spin_locks in
allocated items. Still the restriction of holding single spin lock at
one point does not go away, which is needed for deadlock safety.

The tagging works similar to bpf_list_node, it needs to be tagged using
a "kernel" BTF declaration tag, like so:

struct item {
    int data;
    struct bpf_spin_lock lock __kernel;
};

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h                           |   8 ++
 include/linux/poison.h                        |   3 +
 kernel/bpf/btf.c                              |  10 +-
 kernel/bpf/helpers.c                          |  14 ++-
 kernel/bpf/verifier.c                         | 104 ++++++++++++++----
 .../testing/selftests/bpf/bpf_experimental.h  |   9 ++
 6 files changed, 121 insertions(+), 27 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 062bc45e1cc9..d99cad21e6d9 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -435,6 +435,8 @@ btf_get_prog_ctx_type(struct bpf_verifier_log *log, const struct btf *btf,
 		      int arg);
 int btf_local_type_has_bpf_list_node(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp);
+int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
+				     const struct btf_type *t, u32 *offsetp);
 bool btf_local_type_has_special_fields(const struct btf *btf,
 				       const struct btf_type *t);
 #else
@@ -481,6 +483,12 @@ static inline int btf_local_type_has_bpf_list_node(const struct btf *btf,
 {
 	return -ENOENT;
 }
+static inline int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
+					           const struct btf_type *t,
+					           u32 *offsetp)
+{
+	return -ENOENT;
+}
 static inline bool btf_local_type_has_special_fields(const struct btf *btf,
 						     const struct btf_type *t)
 {
diff --git a/include/linux/poison.h b/include/linux/poison.h
index d62ef5a6b4e9..753e00b81acf 100644
--- a/include/linux/poison.h
+++ b/include/linux/poison.h
@@ -81,4 +81,7 @@
 /********** net/core/page_pool.c **********/
 #define PP_SIGNATURE		(0x40 + POISON_POINTER_DELTA)
 
+/********** kernel/bpf/helpers.c **********/
+#define BPF_PTR_POISON		((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
+
 #endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index d8bc4752204c..63193c324898 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5954,9 +5954,16 @@ int btf_local_type_has_bpf_list_node(const struct btf *btf,
 	return btf_find_local_type_field(btf, t, BTF_FIELD_LIST_NODE, offsetp);
 }
 
+int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
+				     const struct btf_type *t, u32 *offsetp)
+{
+	return btf_find_local_type_field(btf, t, BTF_FIELD_SPIN_LOCK, offsetp);
+}
+
 bool btf_local_type_has_special_fields(const struct btf *btf, const struct btf_type *t)
 {
-	return btf_local_type_has_bpf_list_node(btf, t, NULL) == 1;
+	return btf_local_type_has_bpf_list_node(btf, t, NULL) == 1 ||
+	       btf_local_type_has_bpf_spin_lock(btf, t, NULL) == 1;
 }
 
 int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
@@ -5985,6 +5992,7 @@ int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
 		}									\
 	}
 		PREVENT_DIRECT_WRITE(bpf_list_node);
+		PREVENT_DIRECT_WRITE(bpf_spin_lock);
 
 #undef PREVENT_DIRECT_WRITE
 		err = 0;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 0bb11d8bcaca..94a23a544aee 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -335,6 +335,7 @@ const struct bpf_func_proto bpf_spin_lock_proto = {
 	.gpl_only	= false,
 	.ret_type	= RET_VOID,
 	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
+	.arg1_btf_id	= BPF_PTR_POISON,
 };
 
 static inline void __bpf_spin_unlock_irqrestore(struct bpf_spin_lock *lock)
@@ -357,6 +358,7 @@ const struct bpf_func_proto bpf_spin_unlock_proto = {
 	.gpl_only	= false,
 	.ret_type	= RET_VOID,
 	.arg1_type	= ARG_PTR_TO_SPIN_LOCK,
+	.arg1_btf_id	= BPF_PTR_POISON,
 };
 
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
@@ -1375,10 +1377,10 @@ BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
 	return xchg(kptr, (unsigned long)ptr);
 }
 
-/* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg()
- * helper is determined dynamically by the verifier.
+/* Unlike other PTR_TO_BTF_ID helpers the btf_id in bpf_kptr_xchg() helper is
+ * determined dynamically by the verifier. Hence, BPF_PTR_POISON is used as the
+ * placeholder pointer.
  */
-#define BPF_PTR_POISON ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
 
 static const struct bpf_func_proto bpf_kptr_xchg_proto = {
 	.func         = bpf_kptr_xchg,
@@ -1717,6 +1719,11 @@ void bpf_list_node_init(struct bpf_list_node *node__clkptr)
 	INIT_LIST_HEAD((struct list_head *)node__clkptr);
 }
 
+void bpf_spin_lock_init(struct bpf_spin_lock *lock__clkptr)
+{
+	memset(lock__clkptr, 0, sizeof(*lock__clkptr));
+}
+
 __diag_pop();
 
 BTF_SET8_START(tracing_btf_ids)
@@ -1725,6 +1732,7 @@ BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
 #endif
 BTF_ID_FLAGS(func, bpf_kptr_alloc, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
 BTF_ID_FLAGS(func, bpf_list_node_init)
+BTF_ID_FLAGS(func, bpf_spin_lock_init)
 BTF_SET8_END(tracing_btf_ids)
 
 static const struct btf_kfunc_id_set tracing_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1108b6200501..130a4f0550f5 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -451,8 +451,17 @@ static bool reg_type_not_null(enum bpf_reg_type type)
 
 static bool reg_may_point_to_spin_lock(const struct bpf_reg_state *reg)
 {
-	return reg->type == PTR_TO_MAP_VALUE &&
-		map_value_has_spin_lock(reg->map_ptr);
+	if (reg->type == PTR_TO_MAP_VALUE)
+		return map_value_has_spin_lock(reg->map_ptr);
+	if (reg->type == (PTR_TO_BTF_ID | MEM_TYPE_LOCAL | OBJ_CONSTRUCTING)) {
+		const struct btf_type *t;
+
+		t = btf_type_by_id(reg->btf, reg->btf_id);
+		if (!t)
+			return false;
+		return btf_local_type_has_bpf_spin_lock(reg->btf, t, NULL) == 1;
+	}
+	return false;
 }
 
 static bool reg_type_may_be_refcounted_or_null(enum bpf_reg_type type)
@@ -5442,8 +5451,11 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
 	struct bpf_verifier_state *cur = env->cur_state;
 	bool is_const = tnum_is_const(reg->var_off);
-	struct bpf_map *map = reg->map_ptr;
 	u64 val = reg->var_off.value;
+	struct bpf_map *map = NULL;
+	struct btf *btf = NULL;
+	bool has_spin_lock;
+	int spin_lock_off;
 
 	if (!is_const) {
 		verbose(env,
@@ -5451,28 +5463,42 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 			regno);
 		return -EINVAL;
 	}
-	if (!map->btf) {
-		verbose(env,
-			"map '%s' has to have BTF in order to use bpf_spin_lock\n",
-			map->name);
-		return -EINVAL;
-	}
-	if (!map_value_has_spin_lock(map)) {
-		if (map->spin_lock_off == -E2BIG)
+	if (reg->type == PTR_TO_MAP_VALUE) {
+		map = reg->map_ptr;
+		if (!map->btf) {
 			verbose(env,
-				"map '%s' has more than one 'struct bpf_spin_lock'\n",
+				"map '%s' has to have BTF in order to use bpf_spin_lock\n",
 				map->name);
-		else if (map->spin_lock_off == -ENOENT)
+			return -EINVAL;
+		}
+		has_spin_lock = map_value_has_spin_lock(map);
+		spin_lock_off = map->spin_lock_off;
+	} else {
+		int ret;
+
+		btf = reg->btf;
+		WARN_ON_ONCE(reg->var_off.value);
+		ret = btf_local_type_has_bpf_spin_lock(reg->btf, btf_type_by_id(reg->btf, reg->btf_id), &spin_lock_off);
+		if (ret <= 0)
+			spin_lock_off = ret;
+		has_spin_lock = ret > 0;
+	}
+	if (!has_spin_lock) {
+		if (spin_lock_off == -E2BIG)
 			verbose(env,
-				"map '%s' doesn't have 'struct bpf_spin_lock'\n",
-				map->name);
+				"%s '%s' has more than one 'struct bpf_spin_lock'\n",
+				map ? "map" : "local", map ? map->name : "kptr");
+		else if (spin_lock_off == -ENOENT)
+			verbose(env,
+				"%s '%s' doesn't have 'struct bpf_spin_lock'\n",
+				map ? "map" : "local", map ? map->name : "kptr");
 		else
 			verbose(env,
-				"map '%s' is not a struct type or bpf_spin_lock is mangled\n",
-				map->name);
+				"%s '%s' is not a struct type or bpf_spin_lock is mangled\n",
+				map ? "map" : "local", map ? map->name : "kptr");
 		return -EINVAL;
 	}
-	if (map->spin_lock_off != val + reg->off) {
+	if (spin_lock_off != val + reg->off) {
 		verbose(env, "off %lld doesn't point to 'struct bpf_spin_lock'\n",
 			val + reg->off);
 		return -EINVAL;
@@ -5709,13 +5735,19 @@ static const struct bpf_reg_types int_ptr_types = {
 	},
 };
 
+static const struct bpf_reg_types spin_lock_types = {
+	.types = {
+		PTR_TO_MAP_VALUE,
+		PTR_TO_BTF_ID | MEM_TYPE_LOCAL,
+	},
+};
+
 static const struct bpf_reg_types fullsock_types = { .types = { PTR_TO_SOCKET } };
 static const struct bpf_reg_types scalar_types = { .types = { SCALAR_VALUE } };
 static const struct bpf_reg_types context_types = { .types = { PTR_TO_CTX } };
 static const struct bpf_reg_types alloc_mem_types = { .types = { PTR_TO_MEM | MEM_ALLOC } };
 static const struct bpf_reg_types const_map_ptr_types = { .types = { CONST_PTR_TO_MAP } };
 static const struct bpf_reg_types btf_ptr_types = { .types = { PTR_TO_BTF_ID } };
-static const struct bpf_reg_types spin_lock_types = { .types = { PTR_TO_MAP_VALUE } };
 static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { PTR_TO_BTF_ID | MEM_PERCPU } };
 static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC } };
 static const struct bpf_reg_types stack_ptr_types = { .types = { PTR_TO_STACK } };
@@ -5806,6 +5838,11 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 		bool strict_type_match = arg_type_is_release(arg_type) &&
 					 meta->func_id != BPF_FUNC_sk_release;
 
+		if (type_is_local(reg->type) &&
+		    WARN_ON_ONCE(meta->func_id != BPF_FUNC_spin_lock &&
+				 meta->func_id != BPF_FUNC_spin_unlock))
+			return -EFAULT;
+
 		if (!arg_btf_id) {
 			if (!compatible->btf_id) {
 				verbose(env, "verifier internal error: missing arg compatible BTF ID\n");
@@ -5814,7 +5851,20 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 			arg_btf_id = compatible->btf_id;
 		}
 
-		if (meta->func_id == BPF_FUNC_kptr_xchg) {
+		if (meta->func_id == BPF_FUNC_spin_lock || meta->func_id == BPF_FUNC_spin_unlock) {
+			u32 offset;
+			int ret;
+
+			if (WARN_ON_ONCE(!type_is_local(reg->type)))
+				return -EFAULT;
+			ret = btf_local_type_has_bpf_spin_lock(reg->btf,
+							       btf_type_by_id(reg->btf, reg->btf_id),
+							       &offset);
+			if (ret <= 0 || reg->off != offset) {
+				verbose(env, "no bpf_spin_lock field at offset=%d\n", reg->off);
+				return -EACCES;
+			}
+		} else if (meta->func_id == BPF_FUNC_kptr_xchg) {
 			if (map_kptr_match_type(env, meta->kptr_off_desc, reg, regno))
 				return -EACCES;
 		} else if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
@@ -5943,7 +5993,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		goto skip_type_check;
 
 	/* arg_btf_id and arg_size are in a union. */
-	if (base_type(arg_type) == ARG_PTR_TO_BTF_ID)
+	if (base_type(arg_type) == ARG_PTR_TO_BTF_ID ||
+	    base_type(arg_type) == ARG_PTR_TO_SPIN_LOCK)
 		arg_btf_id = fn->arg_btf_id[arg];
 
 	err = check_reg_type(env, regno, arg_type, arg_btf_id, meta);
@@ -6530,8 +6581,11 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(fn->arg_type); i++) {
-		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID && !fn->arg_btf_id[i])
-			return false;
+		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID)
+			return !!fn->arg_btf_id[i];
+
+		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_SPIN_LOCK)
+			return fn->arg_btf_id[i] == BPF_PTR_POISON;
 
 		if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i] &&
 		    /* arg_btf_id and arg_size are in a union. */
@@ -7756,11 +7810,13 @@ static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
 BTF_ID_LIST(special_kfuncs)
 BTF_ID(func, bpf_kptr_alloc)
 BTF_ID(func, bpf_list_node_init)
+BTF_ID(func, bpf_spin_lock_init)
 BTF_ID(struct, btf) /* empty entry */
 
 enum bpf_special_kfuncs {
 	KF_SPECIAL_bpf_kptr_alloc,
 	KF_SPECIAL_bpf_list_node_init,
+	KF_SPECIAL_bpf_spin_lock_init,
 	KF_SPECIAL_bpf_empty,
 	KF_SPECIAL_MAX = KF_SPECIAL_bpf_empty,
 };
@@ -7927,6 +7983,7 @@ static int process_kf_arg_ptr_to_kptr_strong(struct bpf_verifier_env *env,
 struct local_type_field {
 	enum {
 		FIELD_bpf_list_node,
+		FIELD_bpf_spin_lock,
 		FIELD_MAX,
 	} type;
 	enum bpf_special_kfuncs ctor_kfunc;
@@ -7972,6 +8029,7 @@ static int find_local_type_fields(const struct btf *btf, u32 btf_id, struct loca
 	}
 
 	FILL_LOCAL_TYPE_FIELD(bpf_list_node, bpf_list_node_init, bpf_empty, false);
+	FILL_LOCAL_TYPE_FIELD(bpf_spin_lock, bpf_spin_lock_init, bpf_empty, false);
 
 #undef FILL_LOCAL_TYPE_FIELD
 
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index c3c5442742dc..8b1cdfb2f6bc 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -41,4 +41,13 @@ void *bpf_kptr_alloc(__u64 local_type_id, __u64 flags) __ksym;
  */
 void bpf_list_node_init(struct bpf_list_node *node) __ksym;
 
+/* Description
+ *	Initialize bpf_spin_lock field in a local kptr. This kfunc has
+ *	constructor semantics, and thus can only be called on a local kptr in
+ *	'constructing' phase.
+ * Returns
+ *	Void.
+ */
+void bpf_spin_lock_init(struct bpf_spin_lock *node) __ksym;
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 19/32] bpf: Support bpf_list_head in local kptrs
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (17 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock " Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 20/32] bpf: Introduce bpf_kptr_free helper Kumar Kartikeya Dwivedi
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

To support map-in-map style use case, allow embedding a bpf_list_head
inside a allocated kptr representing local type. Now, this is a field
that will actually need explicit action while destructing the object,
i.e. popping off all the nodes owned by the bpf_list_head and then
freeing each one of them. Hence, the destruction state needs tracking
when we are in the phase, and we also need to reject freeing when the
embedded list_head is not destructed.

For now, needs_destruction is false. Future patch will flip it to true
once adding items to such list_head is supported.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/btf.h                           |  8 ++
 kernel/bpf/btf.c                              | 89 ++++++++++++++++---
 kernel/bpf/helpers.c                          |  8 ++
 kernel/bpf/verifier.c                         |  4 +
 .../testing/selftests/bpf/bpf_experimental.h  |  9 ++
 5 files changed, 106 insertions(+), 12 deletions(-)

diff --git a/include/linux/btf.h b/include/linux/btf.h
index d99cad21e6d9..42c7f0283887 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -437,6 +437,8 @@ int btf_local_type_has_bpf_list_node(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp);
 int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp);
+int btf_local_type_has_bpf_list_head(const struct btf *btf,
+				     const struct btf_type *t, u32 *offsetp);
 bool btf_local_type_has_special_fields(const struct btf *btf,
 				       const struct btf_type *t);
 #else
@@ -489,6 +491,12 @@ static inline int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
 {
 	return -ENOENT;
 }
+static inline int btf_local_type_has_bpf_list_head(const struct btf *btf,
+					           const struct btf_type *t,
+					           u32 *offsetp)
+{
+	return -ENOENT;
+}
 static inline bool btf_local_type_has_special_fields(const struct btf *btf,
 						     const struct btf_type *t)
 {
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 63193c324898..c8d4513cc73e 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3185,7 +3185,8 @@ enum btf_field_type {
 	BTF_FIELD_SPIN_LOCK,
 	BTF_FIELD_TIMER,
 	BTF_FIELD_KPTR,
-	BTF_FIELD_LIST_HEAD,
+	BTF_FIELD_LIST_HEAD_MAP,
+	BTF_FIELD_LIST_HEAD_KPTR,
 	BTF_FIELD_LIST_NODE,
 };
 
@@ -3204,6 +3205,7 @@ struct btf_field_info {
 		struct {
 			u32 value_type_id;
 			const char *node_name;
+			enum btf_field_type type;
 		} list_head;
 	};
 };
@@ -3282,9 +3284,11 @@ static const char *btf_find_decl_tag_value(const struct btf *btf,
 	return NULL;
 }
 
-static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
-			      int comp_idx, const struct btf_type *t,
-			      u32 off, int sz, struct btf_field_info *info)
+static int btf_find_list_head(const struct btf *btf,
+			      enum btf_field_type field_type,
+			      const struct btf_type *pt, int comp_idx,
+			      const struct btf_type *t, u32 off, int sz,
+			      struct btf_field_info *info)
 {
 	const char *value_type;
 	const char *list_node;
@@ -3316,6 +3320,7 @@ static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
 	info->off = off;
 	info->list_head.value_type_id = id;
 	info->list_head.node_name = list_node;
+	info->list_head.type = field_type;
 	return BTF_FIELD_FOUND;
 }
 
@@ -3361,8 +3366,9 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
 			if (ret < 0)
 				return ret;
 			break;
-		case BTF_FIELD_LIST_HEAD:
-			ret = btf_find_list_head(btf, t, i, member_type, off, sz,
+		case BTF_FIELD_LIST_HEAD_MAP:
+		case BTF_FIELD_LIST_HEAD_KPTR:
+			ret = btf_find_list_head(btf, field_type, t, i, member_type, off, sz,
 						 idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
 				return ret;
@@ -3420,8 +3426,9 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 			if (ret < 0)
 				return ret;
 			break;
-		case BTF_FIELD_LIST_HEAD:
-			ret = btf_find_list_head(btf, var, -1, var_type, off, sz,
+		case BTF_FIELD_LIST_HEAD_MAP:
+		case BTF_FIELD_LIST_HEAD_KPTR:
+			ret = btf_find_list_head(btf, field_type, var, -1, var_type, off, sz,
 						 idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
 				return ret;
@@ -3462,7 +3469,8 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
 		sz = sizeof(u64);
 		align = 8;
 		break;
-	case BTF_FIELD_LIST_HEAD:
+	case BTF_FIELD_LIST_HEAD_MAP:
+	case BTF_FIELD_LIST_HEAD_KPTR:
 		name = "bpf_list_head";
 		sz = sizeof(struct bpf_list_head);
 		align = __alignof__(struct bpf_list_head);
@@ -3615,13 +3623,53 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 	return ERR_PTR(ret);
 }
 
+static bool list_head_value_ok(const struct btf *btf, const struct btf_type *pt,
+			       const struct btf_type *vt,
+			       enum btf_field_type type)
+{
+	struct btf_field_info info;
+	int ret;
+
+	/* This is the value type of either map or kptr list_head. For map
+	 * list_head, we allow the value_type to have another bpf_list_head, but
+	 * for kptr list_head, we cannot allow another level of list_head.
+	 *
+	 * Also, in the map case, we must catch the case where the value_type's
+	 * list_head encodes the map_value as its own value_type.
+	 *
+	 * Essentially, we want only two levels for map, one level for kptr, and
+	 * no cycles at all in the type graph.
+	 */
+	WARN_ON_ONCE(btf_is_kernel(btf) || !__btf_type_is_struct(vt));
+	ret = btf_find_field(btf, vt, type, "kernel", &info, 1);
+	if (ret < 0)
+		return false;
+	/* For map or kptr, if value doesn't have list_head, it's ok! */
+	if (!ret)
+		return true;
+	if (ret) {
+		/* For kptr, we don't allow list_head in the value type. */
+		if (type == BTF_FIELD_LIST_HEAD_KPTR)
+			return false;
+		/* The map's list_head's value has another list head. We now
+		 * need to ensure it doesn't refer to map value type itself,
+		 * creating a cycle.
+		 */
+		vt = btf_type_by_id(btf, info.list_head.value_type_id);
+		if (vt == pt)
+			return false;
+	}
+	return true;
+}
+
 struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf, const struct btf_type *t)
 {
 	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
 	struct bpf_map_value_off *tab;
+	const struct btf_type *pt = t;
 	int ret, i, nr_off;
 
-	ret = btf_find_field(btf, t, BTF_FIELD_LIST_HEAD, NULL, info_arr, ARRAY_SIZE(info_arr));
+	ret = btf_find_field(btf, t, BTF_FIELD_LIST_HEAD_MAP, NULL, info_arr, ARRAY_SIZE(info_arr));
 	if (ret < 0)
 		return ERR_PTR(ret);
 	if (!ret)
@@ -3644,6 +3692,8 @@ struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf, const struct btf
 		 * verify its type.
 		 */
 		ret = -EINVAL;
+		if (!list_head_value_ok(btf, pt, t, BTF_FIELD_LIST_HEAD_MAP))
+			goto end;
 		for_each_member(j, t, member) {
 			if (strcmp(info_arr[i].list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
 				continue;
@@ -5937,12 +5987,19 @@ static int btf_find_local_type_field(const struct btf *btf,
 	int ret;
 
 	/* These are invariants that must hold if this is a local type */
-	WARN_ON_ONCE(btf_is_kernel(btf) || !__btf_type_is_struct(t));
+	WARN_ON_ONCE(btf_is_kernel(btf) || !__btf_type_is_struct(t) || type == BTF_FIELD_LIST_HEAD_MAP);
 	ret = btf_find_field(btf, t, type, "kernel", &info, 1);
 	if (ret < 0)
 		return ret;
 	if (!ret)
 		return 0;
+	/* A validation step needs to be done for bpf_list_head in local kptrs */
+	if (type == BTF_FIELD_LIST_HEAD_KPTR) {
+		const struct btf_type *vt = btf_type_by_id(btf, info.list_head.value_type_id);
+
+		if (!list_head_value_ok(btf, t, vt, type))
+			return -EINVAL;
+	}
 	if (offsetp)
 		*offsetp = info.off;
 	return ret;
@@ -5960,10 +6017,17 @@ int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
 	return btf_find_local_type_field(btf, t, BTF_FIELD_SPIN_LOCK, offsetp);
 }
 
+int btf_local_type_has_bpf_list_head(const struct btf *btf,
+				     const struct btf_type *t, u32 *offsetp)
+{
+	return btf_find_local_type_field(btf, t, BTF_FIELD_LIST_HEAD_KPTR, offsetp);
+}
+
 bool btf_local_type_has_special_fields(const struct btf *btf, const struct btf_type *t)
 {
 	return btf_local_type_has_bpf_list_node(btf, t, NULL) == 1 ||
-	       btf_local_type_has_bpf_spin_lock(btf, t, NULL) == 1;
+	       btf_local_type_has_bpf_spin_lock(btf, t, NULL) == 1 ||
+	       btf_local_type_has_bpf_list_head(btf, t, NULL) == 1;
 }
 
 int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
@@ -5993,6 +6057,7 @@ int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
 	}
 		PREVENT_DIRECT_WRITE(bpf_list_node);
 		PREVENT_DIRECT_WRITE(bpf_spin_lock);
+		PREVENT_DIRECT_WRITE(bpf_list_head);
 
 #undef PREVENT_DIRECT_WRITE
 		err = 0;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 94a23a544aee..8eee0793c7f1 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1724,6 +1724,13 @@ void bpf_spin_lock_init(struct bpf_spin_lock *lock__clkptr)
 	memset(lock__clkptr, 0, sizeof(*lock__clkptr));
 }
 
+void bpf_list_head_init(struct bpf_list_head *head__clkptr)
+{
+	BUILD_BUG_ON(sizeof(struct bpf_list_head) != sizeof(struct list_head));
+	BUILD_BUG_ON(__alignof__(struct bpf_list_head) != __alignof__(struct list_head));
+	INIT_LIST_HEAD((struct list_head *)head__clkptr);
+}
+
 __diag_pop();
 
 BTF_SET8_START(tracing_btf_ids)
@@ -1733,6 +1740,7 @@ BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
 BTF_ID_FLAGS(func, bpf_kptr_alloc, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
 BTF_ID_FLAGS(func, bpf_list_node_init)
 BTF_ID_FLAGS(func, bpf_spin_lock_init)
+BTF_ID_FLAGS(func, bpf_list_head_init)
 BTF_SET8_END(tracing_btf_ids)
 
 static const struct btf_kfunc_id_set tracing_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 130a4f0550f5..a5aa5de4b246 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7811,12 +7811,14 @@ BTF_ID_LIST(special_kfuncs)
 BTF_ID(func, bpf_kptr_alloc)
 BTF_ID(func, bpf_list_node_init)
 BTF_ID(func, bpf_spin_lock_init)
+BTF_ID(func, bpf_list_head_init)
 BTF_ID(struct, btf) /* empty entry */
 
 enum bpf_special_kfuncs {
 	KF_SPECIAL_bpf_kptr_alloc,
 	KF_SPECIAL_bpf_list_node_init,
 	KF_SPECIAL_bpf_spin_lock_init,
+	KF_SPECIAL_bpf_list_head_init,
 	KF_SPECIAL_bpf_empty,
 	KF_SPECIAL_MAX = KF_SPECIAL_bpf_empty,
 };
@@ -7984,6 +7986,7 @@ struct local_type_field {
 	enum {
 		FIELD_bpf_list_node,
 		FIELD_bpf_spin_lock,
+		FIELD_bpf_list_head,
 		FIELD_MAX,
 	} type;
 	enum bpf_special_kfuncs ctor_kfunc;
@@ -8030,6 +8033,7 @@ static int find_local_type_fields(const struct btf *btf, u32 btf_id, struct loca
 
 	FILL_LOCAL_TYPE_FIELD(bpf_list_node, bpf_list_node_init, bpf_empty, false);
 	FILL_LOCAL_TYPE_FIELD(bpf_spin_lock, bpf_spin_lock_init, bpf_empty, false);
+	FILL_LOCAL_TYPE_FIELD(bpf_list_head, bpf_list_head_init, bpf_empty, false);
 
 #undef FILL_LOCAL_TYPE_FIELD
 
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 8b1cdfb2f6bc..f0b6e92c6908 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -50,4 +50,13 @@ void bpf_list_node_init(struct bpf_list_node *node) __ksym;
  */
 void bpf_spin_lock_init(struct bpf_spin_lock *node) __ksym;
 
+/* Description
+ *	Initialize bpf_list_head field in a local kptr. This kfunc has
+ *	constructor semantics, and thus can only be called on a local kptr in
+ *	'constructing' phase.
+ * Returns
+ *	Void.
+ */
+void bpf_list_head_init(struct bpf_list_head *node) __ksym;
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 20/32] bpf: Introduce bpf_kptr_free helper
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (18 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 19/32] bpf: Support bpf_list_head " Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables Kumar Kartikeya Dwivedi
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

After ensuring that the verifier can recognize normal vs destructing
objects, add bpf_kptr_free support, and then verify whether normal
object can directly be freed, or whether it needs destruction. If
already in destructing phase, ensure all fields have been destructed.

Having this state in the verifier simplifies how we release resources
for kptrs to local types with arbitrary fields considerably. The
verifier just needs to ensure that destruction happens while program
has ownership of object, and then it can just release the storage using
bpf_kptr_free.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/helpers.c                          |  6 ++++
 kernel/bpf/verifier.c                         | 29 +++++++++++++++++++
 .../testing/selftests/bpf/bpf_experimental.h  |  8 +++++
 3 files changed, 43 insertions(+)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 8eee0793c7f1..4a6fffe401ae 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1731,6 +1731,11 @@ void bpf_list_head_init(struct bpf_list_head *head__clkptr)
 	INIT_LIST_HEAD((struct list_head *)head__clkptr);
 }
 
+void bpf_kptr_free(void *p__dlkptr)
+{
+	kfree(p__dlkptr);
+}
+
 __diag_pop();
 
 BTF_SET8_START(tracing_btf_ids)
@@ -1741,6 +1746,7 @@ BTF_ID_FLAGS(func, bpf_kptr_alloc, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
 BTF_ID_FLAGS(func, bpf_list_node_init)
 BTF_ID_FLAGS(func, bpf_spin_lock_init)
 BTF_ID_FLAGS(func, bpf_list_head_init)
+BTF_ID_FLAGS(func, bpf_kptr_free, KF_RELEASE)
 BTF_SET8_END(tracing_btf_ids)
 
 static const struct btf_kfunc_id_set tracing_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a5aa5de4b246..b1754fd69f7d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7812,6 +7812,7 @@ BTF_ID(func, bpf_kptr_alloc)
 BTF_ID(func, bpf_list_node_init)
 BTF_ID(func, bpf_spin_lock_init)
 BTF_ID(func, bpf_list_head_init)
+BTF_ID(func, bpf_kptr_free)
 BTF_ID(struct, btf) /* empty entry */
 
 enum bpf_special_kfuncs {
@@ -7819,6 +7820,7 @@ enum bpf_special_kfuncs {
 	KF_SPECIAL_bpf_list_node_init,
 	KF_SPECIAL_bpf_spin_lock_init,
 	KF_SPECIAL_bpf_list_head_init,
+	KF_SPECIAL_bpf_kptr_free,
 	KF_SPECIAL_bpf_empty,
 	KF_SPECIAL_MAX = KF_SPECIAL_bpf_empty,
 };
@@ -8156,6 +8158,33 @@ process_kf_arg_destructing_local_kptr(struct bpf_verifier_env *env,
 		}));
 	}
 
+	/* Handle bpf_kptr_free */
+	if (is_kfunc_special(meta->btf, meta->func_id, bpf_kptr_free)) {
+		for (i = cnt - 1; i >= 0; i--) {
+			if (!fields[i].needs_destruction)
+				continue;
+			/* If a field needs destruction, it must be in
+			 * destructed state when calling bpf_kptr_free.
+			 */
+			switch (local_kptr_get_state(reg, i)) {
+			case FIELD_STATE_CONSTRUCTED:
+				verbose(env, "'%s' field needs to be destructed before bpf_kptr_free\n",
+					fields[i].name);
+				return -EINVAL;
+			case FIELD_STATE_DESTRUCTED:
+				break;
+			case FIELD_STATE_UNKNOWN:
+				if (reg->type & OBJ_CONSTRUCTING)
+					break;
+				fallthrough;
+			default:
+				verbose(env, "verifier internal error: unknown field state\n");
+				return -EFAULT;
+			}
+		}
+		return 0;
+	}
+
 	for (i = 0; i < cnt; i++) {
 		bool mark_dtor = false, unmark_ctor = false;
 		int j;
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index f0b6e92c6908..595e99d5cbc2 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -59,4 +59,12 @@ void bpf_spin_lock_init(struct bpf_spin_lock *node) __ksym;
  */
 void bpf_list_head_init(struct bpf_list_head *node) __ksym;
 
+/* Description
+ *	Free a local kptr. All fields of local kptr that require destruction
+ *	need to be in destructed state before this call is made.
+ * Returns
+ *	Void.
+ */
+void bpf_kptr_free(void *kptr) __ksym;
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (19 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 20/32] bpf: Introduce bpf_kptr_free helper Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-08  0:27   ` Alexei Starovoitov
  2022-09-09  8:13   ` Dave Marchevsky
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 22/32] bpf: Bump BTF_KFUNC_SET_MAX_CNT Kumar Kartikeya Dwivedi
                   ` (10 subsequent siblings)
  31 siblings, 2 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Global variables reside in maps accessible using direct_value_addr
callbacks, so giving each load instruction's rewrite a unique reg->id
disallows us from holding locks which are global.

This is not great, so refactor the active_spin_lock into two separate
fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
enough to allow it for global variables, map lookups, and local kptr
registers at the same time.

Held vs non-held is indicated by active_spin_lock_ptr, which stores the
reg->map_ptr or reg->btf pointer of the register used for locking spin
lock. But the active_spin_lock_id also needs to be compared to ensure
whether bpf_spin_unlock is for the same register.

Next, pseudo load instructions are not given a unique reg->id, as they
are doing lookup for the same map value (max_entries is never greater
than 1).

Essentially, we consider that the tuple of (active_spin_lock_ptr,
active_spin_lock_id) will always be unique for any kind of argument to
bpf_spin_{lock,unlock}.

Note that this can be extended in the future to also remember offset
used for locking, so that we can introduce multiple bpf_spin_lock fields
in the same allocation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h |  3 ++-
 kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 2a9dcefca3b6..00c21ad6f61c 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -348,7 +348,8 @@ struct bpf_verifier_state {
 	u32 branches;
 	u32 insn_idx;
 	u32 curframe;
-	u32 active_spin_lock;
+	void *active_spin_lock_ptr;
+	u32 active_spin_lock_id;
 	bool speculative;
 
 	/* first and last insn idx of this verifier state */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b1754fd69f7d..ed19e4036b0a 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1202,7 +1202,8 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
 	}
 	dst_state->speculative = src->speculative;
 	dst_state->curframe = src->curframe;
-	dst_state->active_spin_lock = src->active_spin_lock;
+	dst_state->active_spin_lock_ptr = src->active_spin_lock_ptr;
+	dst_state->active_spin_lock_id = src->active_spin_lock_id;
 	dst_state->branches = src->branches;
 	dst_state->parent = src->parent;
 	dst_state->first_insn_idx = src->first_insn_idx;
@@ -5504,22 +5505,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 		return -EINVAL;
 	}
 	if (is_lock) {
-		if (cur->active_spin_lock) {
+		if (cur->active_spin_lock_ptr) {
 			verbose(env,
 				"Locking two bpf_spin_locks are not allowed\n");
 			return -EINVAL;
 		}
-		cur->active_spin_lock = reg->id;
+		if (map)
+			cur->active_spin_lock_ptr = map;
+		else
+			cur->active_spin_lock_ptr = btf;
+		cur->active_spin_lock_id = reg->id;
 	} else {
-		if (!cur->active_spin_lock) {
+		void *ptr;
+
+		if (map)
+			ptr = map;
+		else
+			ptr = btf;
+
+		if (!cur->active_spin_lock_ptr) {
 			verbose(env, "bpf_spin_unlock without taking a lock\n");
 			return -EINVAL;
 		}
-		if (cur->active_spin_lock != reg->id) {
+		if (cur->active_spin_lock_ptr != ptr ||
+		    cur->active_spin_lock_id != reg->id) {
 			verbose(env, "bpf_spin_unlock of different lock\n");
 			return -EINVAL;
 		}
-		cur->active_spin_lock = 0;
+		cur->active_spin_lock_ptr = NULL;
+		cur->active_spin_lock_id = 0;
 	}
 	return 0;
 }
@@ -11207,8 +11221,8 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	    insn->src_reg == BPF_PSEUDO_MAP_IDX_VALUE) {
 		dst_reg->type = PTR_TO_MAP_VALUE;
 		dst_reg->off = aux->map_off;
-		if (map_value_has_spin_lock(map))
-			dst_reg->id = ++env->id_gen;
+		WARN_ON_ONCE(map->max_entries != 1);
+		/* We want reg->id to be same (0) as map_value is not distinct */
 	} else if (insn->src_reg == BPF_PSEUDO_MAP_FD ||
 		   insn->src_reg == BPF_PSEUDO_MAP_IDX) {
 		dst_reg->type = CONST_PTR_TO_MAP;
@@ -11286,7 +11300,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 		return err;
 	}
 
-	if (env->cur_state->active_spin_lock) {
+	if (env->cur_state->active_spin_lock_ptr) {
 		verbose(env, "BPF_LD_[ABS|IND] cannot be used inside bpf_spin_lock-ed region\n");
 		return -EINVAL;
 	}
@@ -12566,7 +12580,8 @@ static bool states_equal(struct bpf_verifier_env *env,
 	if (old->speculative && !cur->speculative)
 		return false;
 
-	if (old->active_spin_lock != cur->active_spin_lock)
+	if (old->active_spin_lock_ptr != cur->active_spin_lock_ptr ||
+	    old->active_spin_lock_id != cur->active_spin_lock_id)
 		return false;
 
 	/* for states to be equal callsites have to be the same
@@ -13213,7 +13228,7 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
-				if (env->cur_state->active_spin_lock &&
+				if (env->cur_state->active_spin_lock_ptr &&
 				    (insn->src_reg == BPF_PSEUDO_CALL ||
 				     insn->imm != BPF_FUNC_spin_unlock)) {
 					verbose(env, "function calls are not allowed while holding a lock\n");
@@ -13250,7 +13265,7 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
-				if (env->cur_state->active_spin_lock) {
+				if (env->cur_state->active_spin_lock_ptr) {
 					verbose(env, "bpf_spin_unlock is missing\n");
 					return -EINVAL;
 				}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 22/32] bpf: Bump BTF_KFUNC_SET_MAX_CNT
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (20 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 23/32] bpf: Add single ownership BPF linked list API Kumar Kartikeya Dwivedi
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

The current number of kfuncs limit in the btf_id_set8 wouldn't suffice
any longer, especially keeping in mind the future patches in this
series, hence bump it to 64.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/btf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index c8d4513cc73e..439c980419b9 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -208,7 +208,7 @@ enum btf_kfunc_hook {
 };
 
 enum {
-	BTF_KFUNC_SET_MAX_CNT = 32,
+	BTF_KFUNC_SET_MAX_CNT = 64,
 	BTF_DTOR_KFUNC_MAX_CNT = 256,
 };
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 23/32] bpf: Add single ownership BPF linked list API
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (21 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 22/32] bpf: Bump BTF_KFUNC_SET_MAX_CNT Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 24/32] bpf: Permit NULL checking pointer with non-zero fixed offset Kumar Kartikeya Dwivedi
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Add a linked list API for use in BPF programs, where it expects
protection from the bpf_spin_lock in the same allocation as the
bpf_list_head. Future patches will extend the same infrastructure to
have different flavors with varying protection domains and visibility
(e.g. percpu variant with local_t protection, usable in NMI progs).

The following functions are added to kick things off:

bpf_list_add
bpf_list_add_tail
bpf_list_del
bpf_list_pop_front
bpf_list_pop_back

The lock protecting the bpf_list_head needs to be taken for all
operations.

Once a node has been added to the list, it's pointer changes to
PTR_UNTRUSTED. However, it is only released once the lock protecting the
list is unlocked. For such local kptrs with PTR_UNTRUSTED set but an
active ref_obj_id, it is still permitted to read and write to them as
long as the lock is held. However, they cannot be deleted using
bpf_list_del after addition directly. bpf_list_del will only be
permitted inside for_each helpers for lists which will be added in later
patches. This is unlikely to be a problem as deleting right after
addition in the same lock section is quite uncommon.

For now, bpf_list_del is hence unusable, unless a for_each helper is
added, but it is still necessary to ensure it works correctly in
presence of the rest of the API, and has thus been included with this
change.

bpf_list_pop_front and bpf_list_pop_back delete the first or last item
of the list respectively, and return pointer to the element at the
list_node offset. The user can then use container_of style macro to get
the actual entry type. The verifier however statically knows the actual
type, so the safety properties are still preserved.

With these additions, programs can now manage their own linked lists and
store their objects in them.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf_verifier.h                  |   5 +
 include/linux/btf.h                           |  11 +
 kernel/bpf/btf.c                              |  47 +-
 kernel/bpf/helpers.c                          |  55 ++
 kernel/bpf/verifier.c                         | 489 ++++++++++++++++--
 .../testing/selftests/bpf/bpf_experimental.h  |  35 ++
 6 files changed, 599 insertions(+), 43 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 00c21ad6f61c..3cce796c4d76 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -249,6 +249,11 @@ struct bpf_reference_state {
 	 * exiting a callback function.
 	 */
 	int callback_ref;
+	/* Mark the reference state to release the registers sharing the same id
+	 * on bpf_spin_unlock (for nodes that we will lose ownership to but are
+	 * safe to access inside the critical section).
+	 */
+	bool release_on_unlock;
 };
 
 /* state of the program:
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 42c7f0283887..bd57a9cae12c 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -437,6 +437,9 @@ int btf_local_type_has_bpf_list_node(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp);
 int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp);
+int __btf_local_type_has_bpf_list_head(const struct btf *btf,
+				       const struct btf_type *t, u32 *offsetp,
+				       u32 *value_type_idp, u32 *list_node_offp);
 int btf_local_type_has_bpf_list_head(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp);
 bool btf_local_type_has_special_fields(const struct btf *btf,
@@ -491,6 +494,14 @@ static inline int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
 {
 	return -ENOENT;
 }
+static inline int __btf_local_type_has_bpf_list_head(const struct btf *btf,
+						     const struct btf_type *t,
+						     u32 *offsetp,
+						     u32 *value_type_idp,
+						     u32 *list_node_offp)
+{
+	return -ENOENT;
+}
 static inline int btf_local_type_has_bpf_list_head(const struct btf *btf,
 					           const struct btf_type *t,
 					           u32 *offsetp)
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 439c980419b9..e2ac088cb64f 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -5981,7 +5981,8 @@ static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf,
 static int btf_find_local_type_field(const struct btf *btf,
 				     const struct btf_type *t,
 				     enum btf_field_type type,
-				     u32 *offsetp)
+				     u32 *offsetp, u32 *value_type_idp,
+				     u32 *list_node_offp)
 {
 	struct btf_field_info info;
 	int ret;
@@ -5996,9 +5997,40 @@ static int btf_find_local_type_field(const struct btf *btf,
 	/* A validation step needs to be done for bpf_list_head in local kptrs */
 	if (type == BTF_FIELD_LIST_HEAD_KPTR) {
 		const struct btf_type *vt = btf_type_by_id(btf, info.list_head.value_type_id);
+		const struct btf_type *n = NULL;
+		const struct btf_member *member;
+		u32 offset;
+		int i;
 
 		if (!list_head_value_ok(btf, t, vt, type))
 			return -EINVAL;
+		for_each_member(i, vt, member) {
+			if (strcmp(info.list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
+				continue;
+			/* Invalid BTF, two members with same name */
+			if (n)
+				return -EINVAL;
+			n = btf_type_by_id(btf, member->type);
+			if (!__btf_type_is_struct(n))
+				return -EINVAL;
+			if (strcmp("bpf_list_node", __btf_name_by_offset(btf, n->name_off)))
+				return -EINVAL;
+			offset = __btf_member_bit_offset(n, member);
+			if (offset % 8)
+				return -EINVAL;
+			offset /= 8;
+			if (offset % __alignof__(struct bpf_list_node))
+				return -EINVAL;
+			if (value_type_idp)
+				*value_type_idp = info.list_head.value_type_id;
+			if (list_node_offp)
+				*list_node_offp = offset;
+		}
+		/* Could not find bpf_list_node */
+		if (!n)
+			return -ENOENT;
+	} else if (value_type_idp || list_node_offp) {
+		return -EFAULT;
 	}
 	if (offsetp)
 		*offsetp = info.off;
@@ -6008,19 +6040,26 @@ static int btf_find_local_type_field(const struct btf *btf,
 int btf_local_type_has_bpf_list_node(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp)
 {
-	return btf_find_local_type_field(btf, t, BTF_FIELD_LIST_NODE, offsetp);
+	return btf_find_local_type_field(btf, t, BTF_FIELD_LIST_NODE, offsetp, NULL, NULL);
 }
 
 int btf_local_type_has_bpf_spin_lock(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp)
 {
-	return btf_find_local_type_field(btf, t, BTF_FIELD_SPIN_LOCK, offsetp);
+	return btf_find_local_type_field(btf, t, BTF_FIELD_SPIN_LOCK, offsetp, NULL, NULL);
+}
+
+int __btf_local_type_has_bpf_list_head(const struct btf *btf,
+				       const struct btf_type *t, u32 *offsetp,
+				       u32 *value_type_idp, u32 *list_node_offp)
+{
+	return btf_find_local_type_field(btf, t, BTF_FIELD_LIST_HEAD_KPTR, offsetp, value_type_idp, list_node_offp);
 }
 
 int btf_local_type_has_bpf_list_head(const struct btf *btf,
 				     const struct btf_type *t, u32 *offsetp)
 {
-	return btf_find_local_type_field(btf, t, BTF_FIELD_LIST_HEAD_KPTR, offsetp);
+	return __btf_local_type_has_bpf_list_head(btf, t, offsetp, NULL, NULL);
 }
 
 bool btf_local_type_has_special_fields(const struct btf *btf, const struct btf_type *t)
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4a6fffe401ae..9d5709441800 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1736,6 +1736,56 @@ void bpf_kptr_free(void *p__dlkptr)
 	kfree(p__dlkptr);
 }
 
+static bool __always_inline __bpf_list_head_init_zeroed(struct bpf_list_head *h)
+{
+	struct list_head *head = (struct list_head *)h;
+
+	if (unlikely(!head->next)) {
+		INIT_LIST_HEAD(head);
+		return true;
+	}
+	return false;
+}
+
+void bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *head)
+{
+	__bpf_list_head_init_zeroed(head);
+	list_add((struct list_head *)node, (struct list_head *)head);
+}
+
+void bpf_list_add_tail(struct bpf_list_node *node, struct bpf_list_head *head)
+{
+	__bpf_list_head_init_zeroed(head);
+	list_add_tail((struct list_head *)node, (struct list_head *)head);
+}
+
+void bpf_list_del(struct bpf_list_node *node)
+{
+	list_del_init((struct list_head *)node);
+}
+
+struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head)
+{
+	struct list_head *node, *list = (struct list_head *)head;
+
+	if (__bpf_list_head_init_zeroed(head) || list_empty(list))
+		return NULL;
+	node = list->next;
+	list_del_init(node);
+	return (struct bpf_list_node *)node;
+}
+
+struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
+{
+	struct list_head *node, *list = (struct list_head *)head;
+
+	if (__bpf_list_head_init_zeroed(head) || list_empty(list))
+		return NULL;
+	node = list->prev;
+	list_del_init(node);
+	return (struct bpf_list_node *)node;
+}
+
 __diag_pop();
 
 BTF_SET8_START(tracing_btf_ids)
@@ -1747,6 +1797,11 @@ BTF_ID_FLAGS(func, bpf_list_node_init)
 BTF_ID_FLAGS(func, bpf_spin_lock_init)
 BTF_ID_FLAGS(func, bpf_list_head_init)
 BTF_ID_FLAGS(func, bpf_kptr_free, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_list_add)
+BTF_ID_FLAGS(func, bpf_list_add_tail)
+BTF_ID_FLAGS(func, bpf_list_del)
+BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
+BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
 BTF_SET8_END(tracing_btf_ids)
 
 static const struct btf_kfunc_id_set tracing_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index ed19e4036b0a..dcbeb503c25c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4584,9 +4584,19 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 						  false);
 	} else {
 		/* It is allowed to write to pointer to a local type */
-		if (atype != BPF_READ && !type_is_local(reg->type)) {
-			verbose(env, "only read is supported\n");
-			return -EACCES;
+		if (atype != BPF_READ) {
+			/* When a local kptr is marked untrusted, but has an
+			 * active ref_obj_id, it means that it is untrusted only
+			 * for passing to helpers, but not for reads and writes.
+			 *
+			 * For local kptr loaded from maps, PTR_UNTRUSTED would
+			 * be set but without an active ref_obj_id, which means
+			 * writing won't be permitted.
+			 */
+			if (!type_is_local(reg->type) || !reg->ref_obj_id) {
+				verbose(env, "only read is supported\n");
+				return -EACCES;
+			}
 		}
 
 		ret = btf_struct_access(&env->log, reg->btf, t, off, size,
@@ -5516,7 +5526,9 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 			cur->active_spin_lock_ptr = btf;
 		cur->active_spin_lock_id = reg->id;
 	} else {
+		struct bpf_func_state *fstate = cur_func(env);
 		void *ptr;
+		int i;
 
 		if (map)
 			ptr = map;
@@ -5534,6 +5546,17 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
 		}
 		cur->active_spin_lock_ptr = NULL;
 		cur->active_spin_lock_id = 0;
+
+		/* Now, which ever registers are waiting to expire after the
+		 * critical section ends, kill them.
+		 */
+		for (i = 0; i < fstate->acquired_refs; i++) {
+			/* WARN because this reference state cannot be freed
+			 * before this point.
+			 */
+			if (fstate->refs[i].release_on_unlock)
+				WARN_ON_ONCE(release_reference(env, fstate->refs[i].id));
+		}
 	}
 	return 0;
 }
@@ -7681,6 +7704,11 @@ struct bpf_kfunc_arg_meta {
 		u64 value;
 		bool found;
 	} arg_constant;
+	struct {
+		struct btf *btf;
+		u32 type_id;
+		u32 off;
+	} list_node;
 };
 
 static bool is_kfunc_acquire(struct bpf_kfunc_arg_meta *meta)
@@ -7772,6 +7800,30 @@ static bool is_kfunc_arg_sfx_destructing_local_kptr(const struct btf *btf,
 	return __kfunc_param_match_suffix(btf, arg, "__dlkptr");
 }
 
+BTF_ID_LIST(list_struct_ids)
+BTF_ID(struct, bpf_list_head)
+BTF_ID(struct, bpf_list_node)
+
+static bool __is_kfunc_arg_list_struct(const struct btf *btf, const struct btf_param *arg, u32 btf_id)
+{
+	const struct btf_type *t;
+
+	t = btf_type_by_id(btf, arg->type);
+	if (!t || !btf_type_is_ptr(t))
+		return false;
+	return t->type == btf_id;
+}
+
+static bool is_kfunc_arg_list_head(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_arg_list_struct(btf, arg, list_struct_ids[0]);
+}
+
+static bool is_kfunc_arg_list_node(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_arg_list_struct(btf, arg, list_struct_ids[1]);
+}
+
 /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
 static bool __btf_type_is_scalar_struct(struct bpf_verifier_env *env,
 					const struct btf *btf,
@@ -7827,6 +7879,11 @@ BTF_ID(func, bpf_list_node_init)
 BTF_ID(func, bpf_spin_lock_init)
 BTF_ID(func, bpf_list_head_init)
 BTF_ID(func, bpf_kptr_free)
+BTF_ID(func, bpf_list_add)
+BTF_ID(func, bpf_list_add_tail)
+BTF_ID(func, bpf_list_del)
+BTF_ID(func, bpf_list_pop_front)
+BTF_ID(func, bpf_list_pop_back)
 BTF_ID(struct, btf) /* empty entry */
 
 enum bpf_special_kfuncs {
@@ -7835,6 +7892,11 @@ enum bpf_special_kfuncs {
 	KF_SPECIAL_bpf_spin_lock_init,
 	KF_SPECIAL_bpf_list_head_init,
 	KF_SPECIAL_bpf_kptr_free,
+	KF_SPECIAL_bpf_list_add,
+	KF_SPECIAL_bpf_list_add_tail,
+	KF_SPECIAL_bpf_list_del,
+	KF_SPECIAL_bpf_list_pop_front,
+	KF_SPECIAL_bpf_list_pop_back,
 	KF_SPECIAL_bpf_empty,
 	KF_SPECIAL_MAX = KF_SPECIAL_bpf_empty,
 };
@@ -7846,8 +7908,18 @@ static bool __is_kfunc_special(const struct btf *btf, u32 func_id, unsigned int
 	return func_id == special_kfuncs[kf_sp];
 }
 
+static bool __is_kfunc_insn_special(struct bpf_insn *insn, unsigned int kf_sp)
+{
+	/* insn->off == 0 means btf_vmlinux */
+	if (insn->off || kf_sp >= KF_SPECIAL_MAX)
+		return false;
+	return insn->imm == special_kfuncs[kf_sp];
+}
+
 #define is_kfunc_special(btf, func_id, func_name) \
 	__is_kfunc_special(btf, func_id, KF_SPECIAL_##func_name)
+#define is_kfunc_insn_special(insn, func_name) \
+	__is_kfunc_insn_special(insn, KF_SPECIAL_##func_name)
 
 enum kfunc_ptr_arg_types {
 	KF_ARG_PTR_TO_CTX,
@@ -7855,6 +7927,8 @@ enum kfunc_ptr_arg_types {
 	KF_ARG_PTR_TO_KPTR_STRONG,   /* PTR_TO_KPTR but type specific */
 	KF_ARG_CONSTRUCTING_LOCAL_KPTR,
 	KF_ARG_DESTRUCTING_LOCAL_KPTR,
+	KF_ARG_PTR_TO_LIST_HEAD,
+	KF_ARG_PTR_TO_LIST_NODE,
 	KF_ARG_PTR_TO_MEM,
 	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
 };
@@ -7902,6 +7976,12 @@ enum kfunc_ptr_arg_types get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (btf_get_prog_ctx_type(&env->log, meta->btf, t, resolve_prog_type(env->prog), argno))
 		return KF_ARG_PTR_TO_CTX;
 
+	if (is_kfunc_arg_list_head(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_LIST_HEAD;
+
+	if (is_kfunc_arg_list_node(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_LIST_NODE;
+
 	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
 		if (!btf_type_is_struct(ref_t)) {
 			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
@@ -8049,7 +8129,7 @@ static int find_local_type_fields(const struct btf *btf, u32 btf_id, struct loca
 
 	FILL_LOCAL_TYPE_FIELD(bpf_list_node, bpf_list_node_init, bpf_empty, false);
 	FILL_LOCAL_TYPE_FIELD(bpf_spin_lock, bpf_spin_lock_init, bpf_empty, false);
-	FILL_LOCAL_TYPE_FIELD(bpf_list_head, bpf_list_head_init, bpf_empty, false);
+	FILL_LOCAL_TYPE_FIELD(bpf_list_head, bpf_list_head_init, bpf_empty, true);
 
 #undef FILL_LOCAL_TYPE_FIELD
 
@@ -8290,6 +8370,298 @@ process_kf_arg_destructing_local_kptr(struct bpf_verifier_env *env,
 	return -EINVAL;
 }
 
+static int __reg_release_on_unlock(struct bpf_verifier_env *env, struct bpf_reg_state *reg, bool set)
+{
+	struct bpf_func_state *state = cur_func(env);
+	u32 ref_obj_id = reg->ref_obj_id;
+	int i;
+
+	/* bpf_spin_lock only allows calling list_add and list_del, no BPF
+	 * subprogs, no global functions, so this acquired refs state is the
+	 * same one we will use to find registers to kill on bpf_spin_unlock.
+	 */
+	WARN_ON_ONCE(!ref_obj_id);
+	for (i = 0; i < state->acquired_refs; i++) {
+		if (state->refs[i].id == ref_obj_id) {
+			if (!set)
+				return state->refs[i].release_on_unlock;
+			WARN_ON_ONCE(state->refs[i].release_on_unlock);
+			state->refs[i].release_on_unlock = true;
+			/* Now mark everyone sharing same ref_obj_id as untrusted */
+			bpf_expr_for_each_reg_in_vstate(env->cur_state, state, reg, ({
+				if (reg->ref_obj_id == ref_obj_id)
+					reg->type |= PTR_UNTRUSTED;
+			}));
+			return 0;
+		}
+	}
+	verbose(env, "verifier internal error: ref state missing for ref_obj_id\n");
+	return -EFAULT;
+}
+
+static bool reg_get_release_on_unlock(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+{
+	return __reg_release_on_unlock(env, reg, false);
+}
+
+static bool reg_mark_release_on_unlock(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+{
+	return __reg_release_on_unlock(env, reg, true);
+}
+
+static int __process_list_kfunc_head(struct bpf_verifier_env *env,
+				     struct bpf_reg_state *reg, int argno,
+				     const struct btf **btfp, u32 *val_type_id,
+				     u32 *node_off)
+{
+	int ret;
+
+	if (reg->type == PTR_TO_MAP_VALUE) {
+		struct bpf_map_value_off_desc *off_desc;
+		struct bpf_map *map_ptr = reg->map_ptr;
+		u32 list_head_off;
+
+		if (!tnum_is_const(reg->var_off)) {
+			verbose(env,
+				"R%d doesn't have constant offset. bpf_list_head has to be at the constant offset\n",
+				argno + 1);
+			return -EINVAL;
+		}
+		if (!map_ptr->btf) {
+			verbose(env, "map '%s' has to have BTF in order to use bpf_list_add{,_tail}\n",
+				map_ptr->name);
+			return -EINVAL;
+		}
+		if (!map_value_has_list_heads(map_ptr)) {
+			ret = PTR_ERR_OR_ZERO(map_ptr->list_head_off_tab);
+			if (ret == -E2BIG)
+				verbose(env, "map '%s' has more than %d bpf_list_head\n", map_ptr->name,
+					BPF_MAP_VALUE_OFF_MAX);
+			else if (ret == -EEXIST)
+				verbose(env, "map '%s' has repeating BTF tags\n", map_ptr->name);
+			else
+				verbose(env, "map '%s' has no valid bpf_list_head\n", map_ptr->name);
+			return -EINVAL;
+		}
+
+		list_head_off = reg->off + reg->var_off.value;
+		off_desc = bpf_map_list_head_off_contains(map_ptr, list_head_off);
+		if (!off_desc) {
+			verbose(env, "off=%d doesn't point to bpf_list_head\n", list_head_off);
+			return -EACCES;
+		}
+
+		/* Now, we found the bpf_list_head, verify locking and element type */
+		*btfp = off_desc->list_head.btf;
+		*val_type_id = off_desc->list_head.value_type_id;
+		*node_off = off_desc->list_head.list_node_off;
+	} else /* PTR_TO_BTF_ID | MEM_TYPE_LOCAL */ {
+		u32 value_type_id, list_node_off;
+		const struct btf_type *t;
+		u32 offset;
+
+		t = btf_type_by_id(reg->btf, reg->btf_id);
+		if (!t)
+			return -EFAULT;
+		ret = __btf_local_type_has_bpf_list_head(reg->btf, t, &offset, &value_type_id, &list_node_off);
+		/* Already guaranteed by check_func_arg_ref_off that var_off is not set */
+		if (ret <= 0 || reg->off != offset) {
+			verbose(env, "no bpf_list_head field found at offset=%d\n", reg->off);
+			return ret ?: -EINVAL;
+		}
+
+		/* Now, we found the bpf_list_head, verify locking and element type */
+		*btfp = reg->btf;
+		*val_type_id = value_type_id;
+		*node_off = list_node_off;
+	}
+
+	return 0;
+}
+
+static int process_list_add_kfunc(struct bpf_verifier_env *env,
+				  struct bpf_reg_state *reg,
+				  struct bpf_kfunc_arg_meta *meta,
+				  int argno)
+{
+	u32 list_head_value_type_id, list_head_node_off;
+	const struct btf *list_head_btf;
+	bool is_list_head = !!argno;
+	void *ptr;
+	int ret;
+
+	if (is_list_head) {
+		ret = __process_list_kfunc_head(env, reg, argno, &list_head_btf,
+						&list_head_value_type_id, &list_head_node_off);
+		if (ret < 0)
+			return ret;
+	} else {
+		const struct btf_type *t;
+		u32 offset;
+
+		t = btf_type_by_id(reg->btf, reg->btf_id);
+		if (!t)
+			return -EFAULT;
+		ret = btf_local_type_has_bpf_list_node(reg->btf, t, &offset);
+		/* Already guaranteed by check_func_arg_ref_off that var_off is not set */
+		if (ret <= 0 || reg->off != offset) {
+			verbose(env, "no %s field found at offset=%d\n",
+				is_list_head ? "bpf_list_head" : "bpf_list_node", reg->off);
+			return ret ?: -EINVAL;
+		}
+
+		/* Save info for use in verification of next argument bpf_list_head */
+		if (WARN_ON_ONCE(reg_get_release_on_unlock(env, reg))) {
+			verbose(env, "bpf_list_node has already been added to a list\n");
+			return -EINVAL;
+		}
+		meta->list_node.btf = reg->btf;
+		meta->list_node.type_id = reg->btf_id;
+		meta->list_node.off = reg->off;
+		/* The node will be released once we unlock bpf_list_head, until
+		 * then we have the option of accessing it, but cannot pass it
+		 * further to any other helpers or kfuncs.
+		 */
+		reg_mark_release_on_unlock(env, reg);
+		return 0;
+	}
+
+	/* Locking safety */
+	if (!env->cur_state->active_spin_lock_ptr) {
+		verbose(env, "cannot add node to bpf_list_head without holding its lock\n");
+		return -EINVAL;
+	}
+
+	if (reg->type == PTR_TO_MAP_VALUE)
+		ptr = reg->map_ptr;
+	else
+		ptr = reg->btf;
+	if (env->cur_state->active_spin_lock_ptr != ptr ||
+	    env->cur_state->active_spin_lock_id != reg->id) {
+		verbose(env, "incorrect bpf_spin_lock held for bpf_list_head\n");
+		return -EINVAL;
+	}
+
+	/* Type match */
+	if (meta->list_node.off != list_head_node_off) {
+		verbose(env, "arg list_node off=%d does not match bpf_list_head value's list_node off=%d\n",
+			meta->list_node.off, list_head_node_off);
+		return -EINVAL;
+	}
+	if (!btf_struct_ids_match(&env->log, meta->list_node.btf, meta->list_node.type_id,
+				  0, list_head_btf, list_head_value_type_id, true)) {
+		verbose(env, "bpf_list_node type does not match bpf_list_head value type\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int process_list_del_kfunc(struct bpf_verifier_env *env,
+				  struct bpf_reg_state *reg,
+				  struct bpf_kfunc_arg_meta *meta)
+{
+	const struct btf_type *t;
+	int ret, offset;
+
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (!t)
+		return -EFAULT;
+	ret = btf_local_type_has_bpf_list_node(reg->btf, t, &offset);
+	/* Already guaranteed by check_func_arg_ref_off that var_off is not set */
+	if (ret <= 0 || reg->off != offset) {
+		verbose(env, "no bpf_list_node field found at offset=%d\n", reg->off);
+		return ret ?: -EINVAL;
+	}
+	if (!reg_get_release_on_unlock(env, reg)) {
+		verbose(env, "cannot remove bpf_list_node which is not part of a list\n");
+		return -EINVAL;
+	}
+	/* ... and inserted ones are marked as PTR_UNTRUSTED, so they won't be
+	 * seen by us.
+	 *
+	 * It won't be safe to allow bpf_list_del since we can also do
+	 * bpf_list_pop_front or bpf_list_pop_back, so that node can potentially
+	 * be deleted twice.
+	 *
+	 * Regardless, just deleting again after adding is useless, so we're
+	 * fine! You have much better pop_front/pop_back available anyway.
+	 *
+	 * One of the safe contexts would be allowing it in list_for_each
+	 * helper, but that is still unimplemented so far, hence leave out
+	 * handling that case for now.
+	 *
+	 * For for_each case, the reg->off should equal list_node_off of the
+	 * list we are iterating, which we know statically.
+	 *
+	 * XXX: Allow when invoking inside for_each helper.
+	 */
+	return 0;
+}
+
+static int process_list_pop_kfunc(struct bpf_verifier_env *env,
+				  struct bpf_reg_state *reg,
+				  struct bpf_kfunc_arg_meta *meta,
+				  int argno)
+{
+	u32 value_type_id, list_node_off;
+	const struct btf *btf;
+	void *ptr;
+	int ret;
+
+	ret = __process_list_kfunc_head(env, reg, argno, &btf, &value_type_id, &list_node_off);
+	if (ret < 0)
+		return ret;
+	meta->list_node.btf = (struct btf *)btf;
+	meta->list_node.type_id = value_type_id;
+	meta->list_node.off = list_node_off;
+
+	/* Locking safety */
+	if (!env->cur_state->active_spin_lock_ptr) {
+		verbose(env, "cannot add node to bpf_list_head without holding its lock\n");
+		return -EINVAL;
+	}
+
+	if (reg->type == PTR_TO_MAP_VALUE)
+		ptr = reg->map_ptr;
+	else
+		ptr = reg->btf;
+	if (env->cur_state->active_spin_lock_ptr != ptr ||
+	    env->cur_state->active_spin_lock_id != reg->id) {
+		verbose(env, "incorrect bpf_spin_lock held for bpf_list_head\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int process_kf_arg_ptr_to_list_head(struct bpf_verifier_env *env,
+					   struct bpf_reg_state *reg,
+					   struct bpf_kfunc_arg_meta *meta,
+					   int argno)
+{
+	if ((is_kfunc_special(meta->btf, meta->func_id, bpf_list_add) ||
+	     is_kfunc_special(meta->btf, meta->func_id, bpf_list_add_tail)) && argno == 1)
+		return process_list_add_kfunc(env, reg, meta, argno);
+	if ((is_kfunc_special(meta->btf, meta->func_id, bpf_list_pop_front) ||
+	     is_kfunc_special(meta->btf, meta->func_id, bpf_list_pop_back)) && argno == 0)
+		return process_list_pop_kfunc(env, reg, meta, argno);
+	verbose(env, "verifier internal error: incorrect bpf_list_head argument\n");
+	return -EFAULT;
+}
+
+static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
+					   struct bpf_reg_state *reg,
+					   struct bpf_kfunc_arg_meta *meta,
+					   int argno)
+{
+	if ((is_kfunc_special(meta->btf, meta->func_id, bpf_list_add) ||
+	     is_kfunc_special(meta->btf, meta->func_id, bpf_list_add_tail)) && argno == 0)
+		return process_list_add_kfunc(env, reg, meta, argno);
+	else if (is_kfunc_special(meta->btf, meta->func_id, bpf_list_del) && argno == 0)
+		return process_list_del_kfunc(env, reg, meta);
+	verbose(env, "verifier internal error: incorrect bpf_list_node argument\n");
+	return -EFAULT;
+}
+
 static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_meta *meta)
 {
 	const char *func_name = meta->func_name, *ref_tname;
@@ -8428,6 +8800,25 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_m
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_PTR_TO_LIST_HEAD:
+			if (reg->type != PTR_TO_MAP_VALUE &&
+			    reg->type != (PTR_TO_BTF_ID | MEM_TYPE_LOCAL)) {
+				verbose(env, "arg#%d expected pointer to map value or local kptr\n", i);
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_list_head(env, reg, meta, i);
+			if (ret < 0)
+				return ret;
+			break;
+		case KF_ARG_PTR_TO_LIST_NODE:
+			if (reg->type != (PTR_TO_BTF_ID | MEM_TYPE_LOCAL)) {
+				verbose(env, "arg#%d expected pointer to local kptr\n", i);
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_list_node(env, reg, meta, i);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_MEM:
 			resolve_ret = btf_resolve_size(btf, ref_t, &type_size);
 			if (IS_ERR(resolve_ret)) {
@@ -8545,41 +8936,50 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		regs[BPF_REG_0].type = PTR_TO_BTF_ID;
 
 		if (__is_kfunc_ret_dyn_btf(&meta)) {
-			const struct btf_type *ret_t;
+			if (is_kfunc_special(meta.btf, meta.func_id, bpf_kptr_alloc)) {
+				const struct btf_type *ret_t;
 
-			/* Currently, only bpf_kptr_alloc needs special handling */
-			if (!is_kfunc_special(meta.btf, meta.func_id, bpf_kptr_alloc) ||
-			    !meta.arg_constant.found || !btf_type_is_void(ptr_type)) {
-				verbose(env, "verifier internal error: misconfigured kfunc\n");
-				return -EFAULT;
-			}
+				if (!meta.arg_constant.found || !btf_type_is_void(ptr_type)) {
+					verbose(env, "verifier internal error: misconfigured kfunc\n");
+					return -EFAULT;
+				}
 
-			if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
-				verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
-				return -EINVAL;
-			}
+				if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
+					verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
+					return -EINVAL;
+				}
 
-			ret_btf = env->prog->aux->btf;
-			ret_btf_id = meta.arg_constant.value;
+				ret_btf = env->prog->aux->btf;
+				ret_btf_id = meta.arg_constant.value;
 
-			ret_t = btf_type_by_id(ret_btf, ret_btf_id);
-			if (!ret_t || !__btf_type_is_struct(ret_t)) {
-				verbose(env, "local type ID %d passed to bpf_kptr_alloc does not refer to struct\n",
-					ret_btf_id);
-				return -EINVAL;
+				ret_t = btf_type_by_id(ret_btf, ret_btf_id);
+				if (!ret_t || !__btf_type_is_struct(ret_t)) {
+					verbose(env, "local type ID %d passed to bpf_kptr_alloc does not refer to struct\n",
+						ret_btf_id);
+					return -EINVAL;
+				}
+				/* Remember this so that we can rewrite R1 as size in fixup_kfunc_call */
+				env->insn_aux_data[insn_idx].kptr_alloc_size = ret_t->size;
+				/* For now, since we hardcode prog->btf, also hardcode
+				 * setting of this flag.
+				 */
+				regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
+				/* Recognize special fields in local type and force
+				 * their construction before pointer escapes by setting
+				 * OBJ_CONSTRUCTING.
+				 */
+				if (btf_local_type_has_special_fields(ret_btf, ret_t))
+					regs[BPF_REG_0].type |= OBJ_CONSTRUCTING;
+			} else if (is_kfunc_special(meta.btf, meta.func_id, bpf_list_pop_front) ||
+				   is_kfunc_special(meta.btf, meta.func_id, bpf_list_pop_back)) {
+				ret_btf = meta.list_node.btf;
+				ret_btf_id = meta.list_node.type_id;
+				regs[BPF_REG_0].off = meta.list_node.off;
+				regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
+			} else {
+				verbose(env, "verifier internal error: missing __KF_RET_DYN_BTF handling\n");
+				return -EFAULT;
 			}
-			/* Remember this so that we can rewrite R1 as size in fixup_kfunc_call */
-			env->insn_aux_data[insn_idx].kptr_alloc_size = ret_t->size;
-			/* For now, since we hardcode prog->btf, also hardcode
-			 * setting of this flag.
-			 */
-			regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
-			/* Recognize special fields in local type and force
-			 * their construction before pointer escapes by setting
-			 * OBJ_CONSTRUCTING.
-			 */
-			if (btf_local_type_has_special_fields(ret_btf, ret_t))
-				regs[BPF_REG_0].type |= OBJ_CONSTRUCTING;
 		} else {
 			if (!btf_type_is_struct(ptr_type)) {
 				ptr_type_name = btf_name_by_offset(desc_btf, ptr_type->name_off);
@@ -13228,11 +13628,22 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
-				if (env->cur_state->active_spin_lock_ptr &&
-				    (insn->src_reg == BPF_PSEUDO_CALL ||
-				     insn->imm != BPF_FUNC_spin_unlock)) {
-					verbose(env, "function calls are not allowed while holding a lock\n");
-					return -EINVAL;
+				if (env->cur_state->active_spin_lock_ptr) {
+					/* Only three functions can be called,
+					 * stable bpf_spin_unlock helper, and
+					 * unstable bpf_list_* kfuncs.
+					 */
+					if ((insn->src_reg == BPF_REG_0 && insn->imm != BPF_FUNC_spin_unlock) ||
+					    (insn->src_reg == BPF_PSEUDO_CALL) ||
+					    (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
+					     !is_kfunc_insn_special(insn, bpf_list_add) &&
+					     !is_kfunc_insn_special(insn, bpf_list_add_tail) &&
+					     !is_kfunc_insn_special(insn, bpf_list_del) &&
+					     !is_kfunc_insn_special(insn, bpf_list_pop_front) &&
+					     !is_kfunc_insn_special(insn, bpf_list_pop_back))) {
+						verbose(env, "function calls are not allowed while holding a lock\n");
+						return -EINVAL;
+					}
 				}
 				if (insn->src_reg == BPF_PSEUDO_CALL)
 					err = check_func_call(env, insn, &env->insn_idx);
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 595e99d5cbc2..a8f7a5af8ee3 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -67,4 +67,39 @@ void bpf_list_head_init(struct bpf_list_head *node) __ksym;
  */
 void bpf_kptr_free(void *kptr) __ksym;
 
+/* Description
+ *	Add a new entry to the head of a BPF linked list.
+ * Returns
+ *	Void.
+ */
+void bpf_list_add(struct bpf_list_node *node, struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Add a new entry to the tail of a BPF linked list.
+ * Returns
+ *	Void.
+ */
+void bpf_list_add_tail(struct bpf_list_node *node, struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Remove an entry already part of a BPF linked list.
+ * Returns
+ *	Void.
+ */
+void bpf_list_del(struct bpf_list_node *node) __ksym;
+
+/* Description
+ *	Remove the first entry of a BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
+
+/* Description
+ *	Remove the last entry of a BPF linked list.
+ * Returns
+ *	Pointer to bpf_list_node of deleted entry, or NULL if list is empty.
+ */
+struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 24/32] bpf: Permit NULL checking pointer with non-zero fixed offset
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (22 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 23/32] bpf: Add single ownership BPF linked list API Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 25/32] bpf: Allow storing local kptrs in BPF maps Kumar Kartikeya Dwivedi
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Pointer increment on seeing PTR_MAYBE_NULL is already protected against,
hence make an exception for local kptrs while still keeping the warning
for other unintended cases that might creep in.

bpf_list_pop_{front,back} helpers return a local kptr with incremented
offset pointing to bpf_list_node field. The user is supposed to then
obtain the pointer to the entry using container_of after NULL checking
it. The current restrictions trigger a warning when doing the NULL
checking. Revisiting the reason, it is meant as an assertion which seems
to actually work and catch the bad case! Good job verifier!

Hence, under no other circumstances can reg->off be non-zero for a
register that has the PTR_MAYBE_NULL type flag set.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index dcbeb503c25c..5e0044796671 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11173,15 +11173,20 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 {
 	if (type_may_be_null(reg->type) && reg->id == id &&
 	    !WARN_ON_ONCE(!reg->id)) {
-		if (WARN_ON_ONCE(reg->smin_value || reg->smax_value ||
-				 !tnum_equals_const(reg->var_off, 0) ||
-				 reg->off)) {
+		if (reg->smin_value || reg->smax_value || !tnum_equals_const(reg->var_off, 0) || reg->off) {
 			/* Old offset (both fixed and variable parts) should
 			 * have been known-zero, because we don't allow pointer
 			 * arithmetic on pointers that might be NULL. If we
 			 * see this happening, don't convert the register.
+			 *
+			 * But in some cases, some helpers that return local
+			 * kptrs advance offset for the returned pointer.
+			 * In those cases, it is fine to expect to see reg->off.
 			 */
-			return;
+			if (WARN_ON_ONCE(reg->type != (PTR_TO_BTF_ID | MEM_TYPE_LOCAL | PTR_MAYBE_NULL)))
+				return;
+			if (WARN_ON_ONCE(reg->smin_value || reg->smax_value || !tnum_equals_const(reg->var_off, 0)))
+				return;
 		}
 		if (is_null) {
 			reg->type = SCALAR_VALUE;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 25/32] bpf: Allow storing local kptrs in BPF maps
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (23 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 24/32] bpf: Permit NULL checking pointer with non-zero fixed offset Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 26/32] bpf: Wire up freeing of bpf_list_heads in maps Kumar Kartikeya Dwivedi
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Enable users to store local kptrs allocated from bpf_kptr_alloc into BPF
maps. These are always referenced, hence they can only be stored in
fields with kptr_ref type tag.

However, compared to normal kptrs, local kptrs point to types in program
BTF. To tell the verifier which BTF to search the target of the pointer
type in map value, introduce a new "local" type tag that can be applied
with kptr_ref type tag.

When both are combined, this means we will not search the type in kernel
BTF, and stash reference to the map BTF instead. When program uses this
map, this map BTF must match program BTF for it to be able to interact
with this local kptr in map value.

Later, more complex type matching (field by field) can be introduced to
allow the case where program BTF differs from map BTF, but for now that
is not done.

Note that for these local kptr fields, bpf_ktpr_xchg will set the type
flag 'MEM_TYPE_LOCAL' so that the type information is preserved when
moving value from program to map and vice versa.

Only fully constructed local kptrs can be moved into maps, i.e. escaping
the program into the map is not possible in constructing or destructing
phase. We may allow allocated but not initialized local kptrs in maps in
the future. This would require that all fields are either in unknown or
destructed state (i.e. state right after allocation or right after
destruction of last field needing destruction). It would allow holding
ownership of storage while the object on it has been destroyed. This
would allow reusing same storage for multiple types that may fit in it.

Later commit will wire up freeing of these local kptr fields from the
map value.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h   |  7 +++-
 include/linux/btf.h   |  2 +-
 kernel/bpf/btf.c      | 75 ++++++++++++++++++++++++++++---------------
 kernel/bpf/helpers.c  |  4 +--
 kernel/bpf/syscall.c  | 40 +++++++++++++++++++++--
 kernel/bpf/verifier.c | 63 ++++++++++++++++++++++++++----------
 6 files changed, 142 insertions(+), 49 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 910aa891b97a..3353c47fefa9 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -182,7 +182,9 @@ enum {
 enum bpf_off_type {
 	BPF_KPTR_UNREF,
 	BPF_KPTR_REF,
+	BPF_LOCAL_KPTR_REF,
 	BPF_LIST_HEAD,
+	BPF_OFF_TYPE_MAX,
 };
 
 struct bpf_map_value_off_desc {
@@ -546,6 +548,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_LONG,	/* pointer to long */
 	ARG_PTR_TO_SOCKET,	/* pointer to bpf_sock (fullsock) */
 	ARG_PTR_TO_BTF_ID,	/* pointer to in-kernel struct */
+	ARG_PTR_TO_DYN_BTF_ID,  /* pointer to in-kernel or local struct */
 	ARG_PTR_TO_ALLOC_MEM,	/* pointer to dynamically allocated memory */
 	ARG_CONST_ALLOC_SIZE_OR_ZERO,	/* number of allocated bytes requested */
 	ARG_PTR_TO_BTF_ID_SOCK_COMMON,	/* pointer to in-kernel sock_common or bpf-mirrored bpf_sock */
@@ -565,7 +568,7 @@ enum bpf_arg_type {
 	ARG_PTR_TO_SOCKET_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_SOCKET,
 	ARG_PTR_TO_ALLOC_MEM_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_ALLOC_MEM,
 	ARG_PTR_TO_STACK_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_STACK,
-	ARG_PTR_TO_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_BTF_ID,
+	ARG_PTR_TO_DYN_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | ARG_PTR_TO_DYN_BTF_ID,
 	/* pointer to memory does not need to be initialized, helper function must fill
 	 * all bytes or clear them in error case.
 	 */
@@ -591,6 +594,7 @@ enum bpf_return_type {
 	RET_PTR_TO_ALLOC_MEM,		/* returns a pointer to dynamically allocated memory */
 	RET_PTR_TO_MEM_OR_BTF_ID,	/* returns a pointer to a valid memory or a btf_id */
 	RET_PTR_TO_BTF_ID,		/* returns a pointer to a btf_id */
+	RET_PTR_TO_DYN_BTF_ID,		/* returns a pointer to a btf_id determined dynamically */
 	__BPF_RET_TYPE_MAX,
 
 	/* Extended ret_types. */
@@ -601,6 +605,7 @@ enum bpf_return_type {
 	RET_PTR_TO_ALLOC_MEM_OR_NULL	= PTR_MAYBE_NULL | MEM_ALLOC | RET_PTR_TO_ALLOC_MEM,
 	RET_PTR_TO_DYNPTR_MEM_OR_NULL	= PTR_MAYBE_NULL | RET_PTR_TO_ALLOC_MEM,
 	RET_PTR_TO_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | RET_PTR_TO_BTF_ID,
+	RET_PTR_TO_DYN_BTF_ID_OR_NULL	= PTR_MAYBE_NULL | RET_PTR_TO_DYN_BTF_ID,
 
 	/* This must be the last entry. Its purpose is to ensure the enum is
 	 * wide enough to hold the higher bits reserved for bpf_type_flag.
diff --git a/include/linux/btf.h b/include/linux/btf.h
index bd57a9cae12c..b7d704f730c2 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -157,7 +157,7 @@ bool btf_member_is_reg_int(const struct btf *btf, const struct btf_type *s,
 			   u32 expected_offset, u32 expected_size);
 int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
 int btf_find_timer(const struct btf *btf, const struct btf_type *t);
-struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
+struct bpf_map_value_off *btf_parse_kptrs(struct btf *btf,
 					  const struct btf_type *t);
 struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf,
 					       const struct btf_type *t);
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index e2ac088cb64f..54267b52ff0c 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3224,8 +3224,10 @@ static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
 static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
 			 u32 off, int sz, struct btf_field_info *info)
 {
-	enum bpf_off_type type;
+	enum bpf_off_type type = BPF_OFF_TYPE_MAX;
+	bool local = false;
 	u32 res_id;
+	int i;
 
 	/* Permit modifiers on the pointer itself */
 	if (btf_type_is_volatile(t))
@@ -3237,16 +3239,29 @@ static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
 
 	if (!btf_type_is_type_tag(t))
 		return BTF_FIELD_IGNORE;
-	/* Reject extra tags */
-	if (btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
-		return -EINVAL;
-	if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off)))
-		type = BPF_KPTR_UNREF;
-	else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off)))
-		type = BPF_KPTR_REF;
-	else
-		return -EINVAL;
-
+	/* Maximum two type tags supported */
+	for (i = 0; i < 2; i++) {
+		if (!strcmp("kptr", __btf_name_by_offset(btf, t->name_off))) {
+			type = BPF_KPTR_UNREF;
+		} else if (!strcmp("kptr_ref", __btf_name_by_offset(btf, t->name_off))) {
+			type = BPF_KPTR_REF;
+		} else if (!strcmp("local", __btf_name_by_offset(btf, t->name_off))) {
+			local = true;
+		} else {
+			return -EINVAL;
+		}
+		if (!btf_type_is_type_tag(btf_type_by_id(btf, t->type)))
+			break;
+		/* Reject extra tags */
+		if (i == 1)
+			return -EINVAL;
+		t = btf_type_by_id(btf, t->type);
+	}
+	if (local) {
+		if (type == BPF_KPTR_UNREF)
+			return -EINVAL;
+		type = BPF_LOCAL_KPTR_REF;
+	}
 	/* Get the base type */
 	t = btf_type_skip_modifiers(btf, t->type, &res_id);
 	/* Only pointer to struct is allowed */
@@ -3521,12 +3536,12 @@ int btf_find_timer(const struct btf *btf, const struct btf_type *t)
 	return info.off;
 }
 
-struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
+struct bpf_map_value_off *btf_parse_kptrs(struct btf *btf,
 					  const struct btf_type *t)
 {
 	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
 	struct bpf_map_value_off *tab;
-	struct btf *kernel_btf = NULL;
+	struct btf *kptr_btf = NULL;
 	struct module *mod = NULL;
 	int ret, i, nr_off;
 
@@ -3547,13 +3562,21 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 
 		/* Find type in map BTF, and use it to look up the matching type
 		 * in vmlinux or module BTFs, by name and kind.
+		 * For local kptrs, stash reference to map BTF and type ID same
+		 * as in info_arr.
 		 */
-		t = btf_type_by_id(btf, info_arr[i].kptr.type_id);
-		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
-				     &kernel_btf);
-		if (id < 0) {
-			ret = id;
-			goto end;
+		if (info_arr[i].kptr.type == BPF_LOCAL_KPTR_REF) {
+			kptr_btf = btf;
+			btf_get(kptr_btf);
+			id = info_arr[i].kptr.type_id;
+		} else {
+			t = btf_type_by_id(btf, info_arr[i].kptr.type_id);
+			id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
+					     &kptr_btf);
+			if (id < 0) {
+				ret = id;
+				goto end;
+			}
 		}
 
 		/* Find and stash the function pointer for the destruction function that
@@ -3569,20 +3592,20 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 			 * can be used as a referenced pointer and be stored in a map at
 			 * the same time.
 			 */
-			dtor_btf_id = btf_find_dtor_kfunc(kernel_btf, id);
+			dtor_btf_id = btf_find_dtor_kfunc(kptr_btf, id);
 			if (dtor_btf_id < 0) {
 				ret = dtor_btf_id;
 				goto end_btf;
 			}
 
-			dtor_func = btf_type_by_id(kernel_btf, dtor_btf_id);
+			dtor_func = btf_type_by_id(kptr_btf, dtor_btf_id);
 			if (!dtor_func) {
 				ret = -ENOENT;
 				goto end_btf;
 			}
 
-			if (btf_is_module(kernel_btf)) {
-				mod = btf_try_get_module(kernel_btf);
+			if (btf_is_module(kptr_btf)) {
+				mod = btf_try_get_module(kptr_btf);
 				if (!mod) {
 					ret = -ENXIO;
 					goto end_btf;
@@ -3592,7 +3615,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 			/* We already verified dtor_func to be btf_type_is_func
 			 * in register_btf_id_dtor_kfuncs.
 			 */
-			dtor_func_name = __btf_name_by_offset(kernel_btf, dtor_func->name_off);
+			dtor_func_name = __btf_name_by_offset(kptr_btf, dtor_func->name_off);
 			addr = kallsyms_lookup_name(dtor_func_name);
 			if (!addr) {
 				ret = -EINVAL;
@@ -3604,7 +3627,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 		tab->off[i].offset = info_arr[i].off;
 		tab->off[i].type = info_arr[i].kptr.type;
 		tab->off[i].kptr.btf_id = id;
-		tab->off[i].kptr.btf = kernel_btf;
+		tab->off[i].kptr.btf = kptr_btf;
 		tab->off[i].kptr.module = mod;
 	}
 	tab->nr_off = nr_off;
@@ -3612,7 +3635,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
 end_mod:
 	module_put(mod);
 end_btf:
-	btf_put(kernel_btf);
+	btf_put(kptr_btf);
 end:
 	while (i--) {
 		btf_put(tab->off[i].kptr.btf);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 9d5709441800..168460a03ec3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1385,10 +1385,10 @@ BPF_CALL_2(bpf_kptr_xchg, void *, map_value, void *, ptr)
 static const struct bpf_func_proto bpf_kptr_xchg_proto = {
 	.func         = bpf_kptr_xchg,
 	.gpl_only     = false,
-	.ret_type     = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_type     = RET_PTR_TO_DYN_BTF_ID_OR_NULL,
 	.ret_btf_id   = BPF_PTR_POISON,
 	.arg1_type    = ARG_PTR_TO_KPTR,
-	.arg2_type    = ARG_PTR_TO_BTF_ID_OR_NULL | OBJ_RELEASE,
+	.arg2_type    = ARG_PTR_TO_DYN_BTF_ID_OR_NULL | OBJ_RELEASE,
 	.arg2_btf_id  = BPF_PTR_POISON,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index e1749e0d2143..1af9a7cba08c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -590,6 +590,39 @@ bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_ma
 				       map_value_has_kptrs(map_b));
 }
 
+static void bpf_free_local_kptr(const struct btf *btf, u32 btf_id, void *kptr)
+{
+	struct list_head *list, *olist;
+	u32 offset, list_node_off;
+	const struct btf_type *t;
+	void *entry;
+	int ret;
+
+	if (!kptr)
+		return;
+	/* We must free bpf_list_head in local kptr */
+	t = btf_type_by_id(btf, btf_id);
+	/* TODO: We should just populate this info once in struct btf, and then
+	 * do quick lookups into it. Instead of offset, table would be keyed by
+	 * btf_id.
+	 */
+	ret = __btf_local_type_has_bpf_list_head(btf, t, &offset, NULL, &list_node_off);
+	if (ret <= 0)
+		goto free_kptr;
+	/* List elements for bpf_list_head in local kptr cannot have
+	 * bpf_list_head again. Hence, just iterate and kfree them.
+	 */
+	olist = list = kptr + offset;
+	list = list->next;
+	while (list != olist) {
+		entry = list - list_node_off;
+		list = list->next;
+		kfree(entry);
+	}
+free_kptr:
+	kfree(kptr);
+}
+
 /* Caller must ensure map_value_has_kptrs is true. Note that this function can
  * be called on a map value while the map_value is visible to BPF programs, as
  * it ensures the correct synchronization, and we already enforce the same using
@@ -613,7 +646,10 @@ void bpf_map_free_kptrs(struct bpf_map *map, void *map_value)
 			continue;
 		}
 		old_ptr = xchg(btf_id_ptr, 0);
-		off_desc->kptr.dtor((void *)old_ptr);
+		if (off_desc->type == BPF_LOCAL_KPTR_REF)
+			bpf_free_local_kptr(off_desc->kptr.btf, off_desc->kptr.btf_id, (void *)old_ptr);
+		else
+			off_desc->kptr.dtor((void *)old_ptr);
 	}
 }
 
@@ -1102,7 +1138,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 			return -EOPNOTSUPP;
 	}
 
-	map->kptr_off_tab = btf_parse_kptrs(btf, value_type);
+	map->kptr_off_tab = btf_parse_kptrs((struct btf *)btf, value_type);
 	if (map_value_has_kptrs(map)) {
 		if (!bpf_capable()) {
 			ret = -EPERM;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5e0044796671..d2c4ffc80f4d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3713,13 +3713,18 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 	/* Only unreferenced case accepts untrusted pointers */
 	if (off_desc->type == BPF_KPTR_UNREF)
 		perm_flags |= PTR_UNTRUSTED;
+	else if (off_desc->type == BPF_LOCAL_KPTR_REF)
+		perm_flags |= MEM_TYPE_LOCAL;
 
 	if (base_type(reg->type) != PTR_TO_BTF_ID || (type_flag(reg->type) & ~perm_flags))
 		goto bad_type;
 
-	if (!btf_is_kernel(reg->btf)) {
+	if (off_desc->type != BPF_LOCAL_KPTR_REF && !btf_is_kernel(reg->btf)) {
 		verbose(env, "R%d must point to kernel BTF\n", regno);
 		return -EINVAL;
+	} else if (off_desc->type == BPF_LOCAL_KPTR_REF && btf_is_kernel(reg->btf)) {
+		verbose(env, "R%d must point to program BTF\n", regno);
+		return -EINVAL;
 	}
 	/* We need to verify reg->type and reg->btf, before accessing reg->btf */
 	reg_name = kernel_type_name(reg->btf, reg->btf_id);
@@ -3759,7 +3764,8 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
 	 */
 	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, reg->off,
 				  off_desc->kptr.btf, off_desc->kptr.btf_id,
-				  off_desc->type == BPF_KPTR_REF))
+				  off_desc->type == BPF_KPTR_REF ||
+				  off_desc->type == BPF_LOCAL_KPTR_REF))
 		goto bad_type;
 	return 0;
 bad_type:
@@ -3797,18 +3803,21 @@ static int check_map_kptr_access(struct bpf_verifier_env *env, u32 regno,
 	/* We only allow loading referenced kptr, since it will be marked as
 	 * untrusted, similar to unreferenced kptr.
 	 */
-	if (class != BPF_LDX && off_desc->type == BPF_KPTR_REF) {
+	if (class != BPF_LDX &&
+	    (off_desc->type == BPF_KPTR_REF || off_desc->type == BPF_LOCAL_KPTR_REF)) {
 		verbose(env, "store to referenced kptr disallowed\n");
 		return -EACCES;
 	}
 
 	if (class == BPF_LDX) {
+		int local = (off_desc->type == BPF_LOCAL_KPTR_REF) ? MEM_TYPE_LOCAL : 0;
+
 		val_reg = reg_state(env, value_regno);
 		/* We can simply mark the value_regno receiving the pointer
 		 * value from map as PTR_TO_BTF_ID, with the correct type.
 		 */
 		mark_btf_ld_reg(env, cur_regs(env), value_regno, PTR_TO_BTF_ID, off_desc->kptr.btf,
-				off_desc->kptr.btf_id, PTR_MAYBE_NULL | PTR_UNTRUSTED);
+				off_desc->kptr.btf_id, local | PTR_MAYBE_NULL | PTR_UNTRUSTED);
 		/* For mark_ptr_or_null_reg */
 		val_reg->id = ++env->id_gen;
 	} else if (class == BPF_STX) {
@@ -5648,7 +5657,8 @@ static int process_kptr_func(struct bpf_verifier_env *env, int regno,
 		verbose(env, "off=%d doesn't point to kptr\n", kptr_off);
 		return -EACCES;
 	}
-	if (off_desc->type != BPF_KPTR_REF) {
+	if (off_desc->type != BPF_KPTR_REF &&
+	    off_desc->type != BPF_LOCAL_KPTR_REF) {
 		verbose(env, "off=%d kptr isn't referenced kptr\n", kptr_off);
 		return -EACCES;
 	}
@@ -5779,6 +5789,13 @@ static const struct bpf_reg_types spin_lock_types = {
 	},
 };
 
+static const struct bpf_reg_types dyn_btf_ptr_types = {
+	.types = {
+		PTR_TO_BTF_ID,
+		PTR_TO_BTF_ID | MEM_TYPE_LOCAL,
+	},
+};
+
 static const struct bpf_reg_types fullsock_types = { .types = { PTR_TO_SOCKET } };
 static const struct bpf_reg_types scalar_types = { .types = { SCALAR_VALUE } };
 static const struct bpf_reg_types context_types = { .types = { PTR_TO_CTX } };
@@ -5806,6 +5823,7 @@ static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
 #endif
 	[ARG_PTR_TO_SOCKET]		= &fullsock_types,
 	[ARG_PTR_TO_BTF_ID]		= &btf_ptr_types,
+	[ARG_PTR_TO_DYN_BTF_ID]		= &dyn_btf_ptr_types,
 	[ARG_PTR_TO_SPIN_LOCK]		= &spin_lock_types,
 	[ARG_PTR_TO_MEM]		= &mem_types,
 	[ARG_PTR_TO_ALLOC_MEM]		= &alloc_mem_types,
@@ -5867,7 +5885,8 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 	return -EACCES;
 
 found:
-	if (reg->type == PTR_TO_BTF_ID) {
+	if (reg->type == PTR_TO_BTF_ID ||
+	    reg->type == (PTR_TO_BTF_ID | MEM_TYPE_LOCAL)) {
 		/* For bpf_sk_release, it needs to match against first member
 		 * 'struct sock_common', hence make an exception for it. This
 		 * allows bpf_sk_release to work for multiple socket types.
@@ -5877,7 +5896,8 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 
 		if (type_is_local(reg->type) &&
 		    WARN_ON_ONCE(meta->func_id != BPF_FUNC_spin_lock &&
-				 meta->func_id != BPF_FUNC_spin_unlock))
+				 meta->func_id != BPF_FUNC_spin_unlock &&
+				 meta->func_id != BPF_FUNC_kptr_xchg))
 			return -EFAULT;
 
 		if (!arg_btf_id) {
@@ -6031,6 +6051,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 
 	/* arg_btf_id and arg_size are in a union. */
 	if (base_type(arg_type) == ARG_PTR_TO_BTF_ID ||
+	    base_type(arg_type) == ARG_PTR_TO_DYN_BTF_ID ||
 	    base_type(arg_type) == ARG_PTR_TO_SPIN_LOCK)
 		arg_btf_id = fn->arg_btf_id[arg];
 
@@ -6621,7 +6642,8 @@ static bool check_btf_id_ok(const struct bpf_func_proto *fn)
 		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_BTF_ID)
 			return !!fn->arg_btf_id[i];
 
-		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_SPIN_LOCK)
+		if (base_type(fn->arg_type[i]) == ARG_PTR_TO_DYN_BTF_ID ||
+		    base_type(fn->arg_type[i]) == ARG_PTR_TO_SPIN_LOCK)
 			return fn->arg_btf_id[i] == BPF_PTR_POISON;
 
 		if (base_type(fn->arg_type[i]) != ARG_PTR_TO_BTF_ID && fn->arg_btf_id[i] &&
@@ -7575,28 +7597,33 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	}
 	case RET_PTR_TO_BTF_ID:
 	{
-		struct btf *ret_btf;
 		int ret_btf_id;
 
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
-		if (func_id == BPF_FUNC_kptr_xchg) {
-			ret_btf = meta.kptr_off_desc->kptr.btf;
-			ret_btf_id = meta.kptr_off_desc->kptr.btf_id;
-		} else {
-			ret_btf = btf_vmlinux;
-			ret_btf_id = *fn->ret_btf_id;
-		}
+		ret_btf_id = *fn->ret_btf_id;
 		if (ret_btf_id == 0) {
 			verbose(env, "invalid return type %u of func %s#%d\n",
 				base_type(ret_type), func_id_name(func_id),
 				func_id);
 			return -EINVAL;
 		}
-		regs[BPF_REG_0].btf = ret_btf;
+		regs[BPF_REG_0].btf = btf_vmlinux;
 		regs[BPF_REG_0].btf_id = ret_btf_id;
 		break;
 	}
+	case RET_PTR_TO_DYN_BTF_ID:
+		if (func_id != BPF_FUNC_kptr_xchg) {
+			verbose(env, "verifier internal error: incorrect use of RET_PTR_TO_DYN_BTF_ID\n");
+			return -EFAULT;
+		}
+		mark_reg_known_zero(env, regs, BPF_REG_0);
+		regs[BPF_REG_0].type = PTR_TO_BTF_ID | ret_flag;
+		if (meta.kptr_off_desc->type == BPF_LOCAL_KPTR_REF)
+			regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
+		regs[BPF_REG_0].btf = meta.kptr_off_desc->kptr.btf;
+		regs[BPF_REG_0].btf_id = meta.kptr_off_desc->kptr.btf_id;
+		break;
 	default:
 		verbose(env, "unknown return type %u of func %s#%d\n",
 			base_type(ret_type), func_id_name(func_id), func_id);
@@ -14832,6 +14859,8 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 			break;
 		case PTR_TO_BTF_ID:
 		case PTR_TO_BTF_ID | PTR_UNTRUSTED:
+		/* Only untrusted local kptrs need probe_mem conversions for loads */
+		case PTR_TO_BTF_ID | MEM_TYPE_LOCAL | PTR_UNTRUSTED:
 			if (type == BPF_READ) {
 				insn->code = BPF_LDX | BPF_PROBE_MEM |
 					BPF_SIZE((insn)->code);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 26/32] bpf: Wire up freeing of bpf_list_heads in maps
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (24 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 25/32] bpf: Allow storing local kptrs in BPF maps Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 27/32] bpf: Add destructor for bpf_list_head in local kptr Kumar Kartikeya Dwivedi
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Until now, bpf_list_heads in maps were not being freed. Wire up code
needed to release them when found in map value.
This will also handle freeling local kptr with bpf_list_head in them.

Note that bpf_list_head in map value requires appropriate locking during
the draining operation.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h            |  3 +++
 kernel/bpf/arraymap.c          |  8 +++++++
 kernel/bpf/bpf_local_storage.c | 11 ++++++----
 kernel/bpf/hashtab.c           | 22 +++++++++++++++++++
 kernel/bpf/helpers.c           | 14 ++++++++++++
 kernel/bpf/syscall.c           | 39 ++++++++++++++++++++++++++++++++++
 6 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3353c47fefa9..ad18408ba442 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -381,6 +381,8 @@ static inline void zero_map_value(struct bpf_map *map, void *dst)
 
 void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 			   bool lock_src);
+void bpf_map_value_lock(struct bpf_map *map, void *map_value);
+void bpf_map_value_unlock(struct bpf_map *map, void *map_value);
 void bpf_timer_cancel_and_free(void *timer);
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
@@ -1736,6 +1738,7 @@ struct bpf_map_value_off_desc *bpf_map_list_head_off_contains(struct bpf_map *ma
 void bpf_map_free_list_head_off_tab(struct bpf_map *map);
 struct bpf_map_value_off *bpf_map_copy_list_head_off_tab(const struct bpf_map *map);
 bool bpf_map_equal_list_head_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
+void bpf_map_free_list_heads(struct bpf_map *map, void *map_value);
 
 struct bpf_map *bpf_map_get(u32 ufd);
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index c7263ee3a35f..5412fa66d659 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -312,6 +312,8 @@ static void check_and_free_fields(struct bpf_array *arr, void *val)
 		bpf_timer_cancel_and_free(val + arr->map.timer_off);
 	if (map_value_has_kptrs(&arr->map))
 		bpf_map_free_kptrs(&arr->map, val);
+	if (map_value_has_list_heads(&arr->map))
+		bpf_map_free_list_heads(&arr->map, val);
 }
 
 /* Called from syscall or from eBPF program */
@@ -443,6 +445,12 @@ static void array_map_free(struct bpf_map *map)
 		bpf_map_free_kptr_off_tab(map);
 	}
 
+	if (map_value_has_list_heads(map)) {
+		for (i = 0; i < array->map.max_entries; i++)
+			bpf_map_free_list_heads(map, array_map_elem_ptr(array, i));
+		bpf_map_free_list_head_off_tab(map);
+	}
+
 	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY)
 		bpf_array_free_percpu(array);
 
diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
index b5ccd76026b6..e89c6aa5d782 100644
--- a/kernel/bpf/bpf_local_storage.c
+++ b/kernel/bpf/bpf_local_storage.c
@@ -107,6 +107,8 @@ static void check_and_free_fields(struct bpf_local_storage_elem *selem)
 {
 	if (map_value_has_kptrs(selem->map))
 		bpf_map_free_kptrs(selem->map, SDATA(selem));
+	if (map_value_has_list_heads(selem->map))
+		bpf_map_free_list_heads(selem->map, SDATA(selem));
 }
 
 static void bpf_selem_free_rcu(struct rcu_head *rcu)
@@ -608,13 +610,14 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
 	 */
 	synchronize_rcu();
 
-	/* When local storage map has kptrs, the call_rcu callback accesses
-	 * kptr_off_tab, hence we need the bpf_selem_free_rcu callbacks to
-	 * finish before we free it.
+	/* When local storage map has kptrs or bpf_list_heads, the call_rcu
+	 * callback accesses kptr_off_tab or list_head_off_tab, hence we need
+	 * the bpf_selem_free_rcu callbacks to finish before we free it.
 	 */
-	if (map_value_has_kptrs(&smap->map)) {
+	if (map_value_has_kptrs(&smap->map) || map_value_has_list_heads(&smap->map)) {
 		rcu_barrier();
 		bpf_map_free_kptr_off_tab(&smap->map);
+		bpf_map_free_list_head_off_tab(&smap->map);
 	}
 	bpf_map_free_list_head_off_tab(&smap->map);
 	kvfree(smap->buckets);
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 270e0ecf4ba3..bd1637fa7e3b 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -297,6 +297,25 @@ static void htab_free_prealloced_kptrs(struct bpf_htab *htab)
 	}
 }
 
+static void htab_free_prealloced_list_heads(struct bpf_htab *htab)
+{
+	u32 num_entries = htab->map.max_entries;
+	int i;
+
+	if (!map_value_has_list_heads(&htab->map))
+		return;
+	if (htab_has_extra_elems(htab))
+		num_entries += num_possible_cpus();
+
+	for (i = 0; i < num_entries; i++) {
+		struct htab_elem *elem;
+
+		elem = get_htab_elem(htab, i);
+		bpf_map_free_list_heads(&htab->map, elem->key + round_up(htab->map.key_size, 8));
+		cond_resched();
+	}
+}
+
 static void htab_free_elems(struct bpf_htab *htab)
 {
 	int i;
@@ -782,6 +801,8 @@ static void check_and_free_fields(struct bpf_htab *htab,
 			bpf_map_free_kptrs(&htab->map, map_value);
 		}
 	}
+	if (map_value_has_list_heads(&htab->map))
+		bpf_map_free_list_heads(&htab->map, map_value);
 }
 
 /* It is called from the bpf_lru_list when the LRU needs to delete
@@ -1514,6 +1535,7 @@ static void htab_map_free(struct bpf_map *map)
 	if (!htab_is_prealloc(htab)) {
 		delete_all_elements(htab);
 	} else {
+		htab_free_prealloced_list_heads(htab);
 		htab_free_prealloced_kptrs(htab);
 		prealloc_destroy(htab);
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 168460a03ec3..832dd57ae608 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -377,6 +377,20 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 	preempt_enable();
 }
 
+void bpf_map_value_lock(struct bpf_map *map, void *map_value)
+{
+	WARN_ON_ONCE(map->spin_lock_off < 0);
+	preempt_disable();
+	__bpf_spin_lock_irqsave(map_value + map->spin_lock_off);
+}
+
+void bpf_map_value_unlock(struct bpf_map *map, void *map_value)
+{
+	WARN_ON_ONCE(map->spin_lock_off < 0);
+	__bpf_spin_unlock_irqrestore(map_value + map->spin_lock_off);
+	preempt_enable();
+}
+
 BPF_CALL_0(bpf_jiffies64)
 {
 	return get_jiffies_64();
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 1af9a7cba08c..f1e244b03382 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -600,6 +600,13 @@ static void bpf_free_local_kptr(const struct btf *btf, u32 btf_id, void *kptr)
 
 	if (!kptr)
 		return;
+	/* There is no requirement to lock the bpf_spin_lock protecting
+	 * bpf_list_head in local kptr, as these are single ownership,
+	 * so if we have access to the kptr through xchg, we own it.
+	 *
+	 * If iterating elements of bpf_list_head in map value we are
+	 * already holding the lock for it.
+	 */
 	/* We must free bpf_list_head in local kptr */
 	t = btf_type_by_id(btf, btf_id);
 	/* TODO: We should just populate this info once in struct btf, and then
@@ -697,6 +704,38 @@ bool bpf_map_equal_list_head_off_tab(const struct bpf_map *map_a, const struct b
 				       map_value_has_list_heads(map_b));
 }
 
+void bpf_map_free_list_heads(struct bpf_map *map, void *map_value)
+{
+	struct bpf_map_value_off *tab = map->list_head_off_tab;
+	int i;
+
+	/* TODO: Should we error when bpf_list_head is alone in map value,
+	 * during BTF parsing, instead of ignoring it?
+	 */
+	if (map->spin_lock_off < 0)
+		return;
+
+	bpf_map_value_lock(map, map_value);
+	for (i = 0; i < tab->nr_off; i++) {
+		struct bpf_map_value_off_desc *off_desc = &tab->off[i];
+		struct list_head *list, *olist;
+		void *entry;
+
+		olist = list = map_value + off_desc->offset;
+		list = list->next;
+		if (!list)
+			goto init;
+		while (list != olist) {
+			entry = list - off_desc->list_head.list_node_off;
+			list = list->next;
+			bpf_free_local_kptr(off_desc->list_head.btf, off_desc->list_head.value_type_id, entry);
+		}
+	init:
+		INIT_LIST_HEAD(olist);
+	}
+	bpf_map_value_unlock(map, map_value);
+}
+
 /* called from workqueue */
 static void bpf_map_free_deferred(struct work_struct *work)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 27/32] bpf: Add destructor for bpf_list_head in local kptr
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (25 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 26/32] bpf: Wire up freeing of bpf_list_heads in maps Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 28/32] bpf: Remove duplicate PTR_TO_BTF_ID RO check Kumar Kartikeya Dwivedi
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Refactor bpf_free_local_kptr's bpf_list_head handling logic to introduce
the destructor for bpf_list_head inside local kptrs. The first argument
is pointer to the bpf_list_head inside local kptr, while the second
argument is the node offset of the value type of the list head.

It is possible to only take one argument and pass 'hidden' argument from
verifier side, but unlike helpers which always take 5 arguments at C ABI
level, kfuncs are more strongly checked from their prototype in kernel
BTF. So hidden arguments are more work to support.

Secondly, it would again require rewriting arguments and
bpf_patch_insn_data, which is expensive and slow. Hence, just force user
to pass the offset, but check that it is the right one from verifier
side, which turns out to be much more easier.

Ofcourse, this is a little bit inconvenient, we can explore improving it
later though.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 include/linux/bpf.h                           |  1 +
 kernel/bpf/helpers.c                          |  6 +++
 kernel/bpf/syscall.c                          | 39 ++++++++++++-------
 kernel/bpf/verifier.c                         | 26 ++++++++++++-
 .../testing/selftests/bpf/bpf_experimental.h  | 11 ++++++
 5 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ad18408ba442..9279e453528c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1733,6 +1733,7 @@ void bpf_map_free_kptr_off_tab(struct bpf_map *map);
 struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
 bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
 void bpf_map_free_kptrs(struct bpf_map *map, void *map_value);
+void bpf_free_local_kptr_list_head(struct list_head *list, u32 list_node_off);
 
 struct bpf_map_value_off_desc *bpf_map_list_head_off_contains(struct bpf_map *map, u32 offset);
 void bpf_map_free_list_head_off_tab(struct bpf_map *map);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 832dd57ae608..030c35bf030d 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1800,6 +1800,11 @@ struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
 	return (struct bpf_list_node *)node;
 }
 
+void bpf_list_head_fini(struct bpf_list_head *head__dlkptr, u64 node_off__k)
+{
+	bpf_free_local_kptr_list_head((struct list_head *)head__dlkptr, node_off__k);
+}
+
 __diag_pop();
 
 BTF_SET8_START(tracing_btf_ids)
@@ -1816,6 +1821,7 @@ BTF_ID_FLAGS(func, bpf_list_add_tail)
 BTF_ID_FLAGS(func, bpf_list_del)
 BTF_ID_FLAGS(func, bpf_list_pop_front, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
 BTF_ID_FLAGS(func, bpf_list_pop_back, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
+BTF_ID_FLAGS(func, bpf_list_head_fini)
 BTF_SET8_END(tracing_btf_ids)
 
 static const struct btf_kfunc_id_set tracing_kfunc_set = {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f1e244b03382..feaf4351345b 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -590,12 +590,31 @@ bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_ma
 				       map_value_has_kptrs(map_b));
 }
 
+void bpf_free_local_kptr_list_head(struct list_head *list, u32 list_node_off)
+{
+	struct list_head *olist;
+	void *entry;
+
+	/* List elements for bpf_list_head in local kptr cannot have
+	 * bpf_list_head again. Hence, just iterate and kfree them.
+	 */
+	olist = list;
+	list = list->next;
+	if (!list)
+		goto init;
+	while (list != olist) {
+		entry = list - list_node_off;
+		list = list->next;
+		kfree(entry);
+	}
+init:
+	INIT_LIST_HEAD(olist);
+}
+
 static void bpf_free_local_kptr(const struct btf *btf, u32 btf_id, void *kptr)
 {
-	struct list_head *list, *olist;
-	u32 offset, list_node_off;
+	u32 list_head_off, list_node_off;
 	const struct btf_type *t;
-	void *entry;
 	int ret;
 
 	if (!kptr)
@@ -613,19 +632,13 @@ static void bpf_free_local_kptr(const struct btf *btf, u32 btf_id, void *kptr)
 	 * do quick lookups into it. Instead of offset, table would be keyed by
 	 * btf_id.
 	 */
-	ret = __btf_local_type_has_bpf_list_head(btf, t, &offset, NULL, &list_node_off);
+	ret = __btf_local_type_has_bpf_list_head(btf, t, &list_head_off, NULL, &list_node_off);
 	if (ret <= 0)
 		goto free_kptr;
 	/* List elements for bpf_list_head in local kptr cannot have
-	 * bpf_list_head again. Hence, just iterate and kfree them.
-	 */
-	olist = list = kptr + offset;
-	list = list->next;
-	while (list != olist) {
-		entry = list - list_node_off;
-		list = list->next;
-		kfree(entry);
-	}
+         * bpf_list_head again. Hence, just iterate and kfree them.
+         */
+	bpf_free_local_kptr_list_head(kptr + list_head_off, list_node_off);
 free_kptr:
 	kfree(kptr);
 }
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d2c4ffc80f4d..b795fe9a88da 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7911,6 +7911,7 @@ BTF_ID(func, bpf_list_add_tail)
 BTF_ID(func, bpf_list_del)
 BTF_ID(func, bpf_list_pop_front)
 BTF_ID(func, bpf_list_pop_back)
+BTF_ID(func, bpf_list_head_fini)
 BTF_ID(struct, btf) /* empty entry */
 
 enum bpf_special_kfuncs {
@@ -7924,6 +7925,7 @@ enum bpf_special_kfuncs {
 	KF_SPECIAL_bpf_list_del,
 	KF_SPECIAL_bpf_list_pop_front,
 	KF_SPECIAL_bpf_list_pop_back,
+	KF_SPECIAL_bpf_list_head_fini,
 	KF_SPECIAL_bpf_empty,
 	KF_SPECIAL_MAX = KF_SPECIAL_bpf_empty,
 };
@@ -8156,7 +8158,7 @@ static int find_local_type_fields(const struct btf *btf, u32 btf_id, struct loca
 
 	FILL_LOCAL_TYPE_FIELD(bpf_list_node, bpf_list_node_init, bpf_empty, false);
 	FILL_LOCAL_TYPE_FIELD(bpf_spin_lock, bpf_spin_lock_init, bpf_empty, false);
-	FILL_LOCAL_TYPE_FIELD(bpf_list_head, bpf_list_head_init, bpf_empty, true);
+	FILL_LOCAL_TYPE_FIELD(bpf_list_head, bpf_list_head_init, bpf_list_head_fini, true);
 
 #undef FILL_LOCAL_TYPE_FIELD
 
@@ -8391,6 +8393,19 @@ process_kf_arg_destructing_local_kptr(struct bpf_verifier_env *env,
 			if (mark_dtor)
 				ireg->type |= OBJ_DESTRUCTING;
 		}));
+
+		/* Stash the list_node offset in value type of the
+		 * bpf_list_head, so that offset of node in next argument can be
+		 * checked for bpf_list_head_fini.
+		 */
+		if (fields[i].type == FIELD_bpf_list_head) {
+			ret = __btf_local_type_has_bpf_list_head(reg->btf, btf_type_by_id(reg->btf, reg->btf_id),
+								 NULL, NULL, &meta->list_node.off);
+			if (ret <= 0) {
+				verbose(env, "verifier internal error: bpf_list_head not found\n");
+				return -EFAULT;
+			}
+		}
 		return 0;
 	}
 	verbose(env, "no destructible field at offset: %d\n", reg->off);
@@ -8875,6 +8890,15 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_m
 		return -EINVAL;
 	}
 
+	/* Special semantic checks for some functions */
+	if (is_kfunc_special(meta->btf, meta->func_id, bpf_list_head_fini)) {
+		if (!meta->arg_constant.found || meta->list_node.off != meta->arg_constant.value) {
+			verbose(env, "arg#1 to bpf_list_head_fini must be constant %d\n",
+				meta->list_node.off);
+			return -EINVAL;
+		}
+	}
+
 	return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index a8f7a5af8ee3..60fe48df4f68 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -102,4 +102,15 @@ struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ksym;
  */
 struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
 
+/* Description
+ *	Destruct bpf_list_head field in a local kptr. This kfunc has destructor
+ *	semantics, and marks local kptr as destructing if it isn't already.
+ *
+ *	Note that value_node_offset is the offset of bpf_list_node inside the
+ *	value type of local kptr's bpf_list_head. It must be a known constant.
+ * Returns
+ *	Void.
+ */
+void bpf_list_head_fini(struct bpf_list_head *node, u64 value_node_offset) __ksym;
+
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 28/32] bpf: Remove duplicate PTR_TO_BTF_ID RO check
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (26 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 27/32] bpf: Add destructor for bpf_list_head in local kptr Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 29/32] libbpf: Add support for private BSS map section Kumar Kartikeya Dwivedi
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Daniel Xu, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Dave Marchevsky, Delyan Kratunov

From: Daniel Xu <dxu@dxuuu.xyz>

Since commit 27ae7997a661 ("bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS")
there has existed bpf_verifier_ops:btf_struct_access. When
btf_struct_access is _unset_ for a prog type, the verifier runs the
default implementation, which is to enforce read only:

        if (env->ops->btf_struct_access) {
                [...]
        } else {
                if (atype != BPF_READ) {
                        verbose(env, "only read is supported\n");
                        return -EACCES;
                }

                [...]
        }

When btf_struct_access is _set_, the expectation is that
btf_struct_access has full control over accesses, including if writes
are allowed.

Rather than carve out an exception for each prog type that may write to
BTF ptrs, delete the redundant check and give full control to
btf_struct_access.

[
 Kartikeya: We also require to remove this check, as we are enabling
 writes to local kptrs, which are a special type of PTR_TO_BTF_ID
 pointing to btf_id in program BTF.

 Note that probe_mem conversions, we only need then when such local
 kptr is marked with PTR_UNTRUSTED.

 There are two cases when it is so. One is when node is marked for
 expiry on the end of critical section, it is marked as PTR_UNTRUSTED
 but with a non-zero ref_obj_id. This means that writing is still
 permitted to it, as is reading, and technically PROBE_MEM load
 conversion is not needed. It is just used to prevent passing this
 local kptr elsewhere.

 The second case is loading reference local kptr from a map. In this
 case the pointer may well be invalid by the time we access it. Hence,
 writing to is disallowed but reading isn't. Here, PROBE_MEM conversion
 is crucial.

 We could discern between ref_obj_id set vs unset case, but for it's
 left out of the current series.
]

Cc: Martin KaFai Lau <kafai@fb.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Daniel Xu <dxu@dxuuu.xyz>
[ Kartikeya: Expanded commit message ]
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index b795fe9a88da..2897f780e8be 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14889,9 +14889,6 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 				insn->code = BPF_LDX | BPF_PROBE_MEM |
 					BPF_SIZE((insn)->code);
 				env->prog->aux->num_exentries++;
-			} else if (resolve_prog_type(env->prog) != BPF_PROG_TYPE_STRUCT_OPS) {
-				verbose(env, "Writes through BTF pointers are not allowed\n");
-				return -EINVAL;
 			}
 			continue;
 		default:
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 29/32] libbpf: Add support for private BSS map section
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (27 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 28/32] bpf: Remove duplicate PTR_TO_BTF_ID RO check Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 30/32] selftests/bpf: Add BTF tag macros for local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Dave Marchevsky, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Delyan Kratunov

From: Dave Marchevsky <davemarchevsky@fb.com>

Currently libbpf does not allow declaration of a struct bpf_spin_lock in
global scope. Attempting to do so results in "failed to re-mmap" error,
as .bss arraymap containing spinlock is not allowed to be mmap'd.

This patch adds support for a .bss.private section. The maps contained
in this section will not be mmaped into userspace by libbpf, nor will
they be exposed via bpftool-generated skeleton.

Intent here is to allow more natural programming pattern for
global-scope spinlocks which will be used by rbtree locking mechanism in
further patches in this series.

Notes:

  * Initially I called the section .bss.no_mmap, but the broader
    'private' term better indicates that skeleton shouldn't expose these
    maps at all, IMO.

  * bpftool/gen.c's is_internal_mmapable_map function checks whether the
    map flags have BPF_F_MMAPABLE, so no bpftool changes were necessary
    to remove .bss.private maps from skeleton

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/lib/bpf/libbpf.c | 65 ++++++++++++++++++++++++++++--------------
 1 file changed, 44 insertions(+), 21 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 3ad139285fad..17989dd49179 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -464,6 +464,7 @@ struct bpf_struct_ops {
 #define KCONFIG_SEC ".kconfig"
 #define KSYMS_SEC ".ksyms"
 #define STRUCT_OPS_SEC ".struct_ops"
+#define BSS_SEC_PRIVATE ".bss.private"
 
 enum libbpf_map_type {
 	LIBBPF_MAP_UNSPEC,
@@ -577,6 +578,7 @@ enum sec_type {
 	SEC_BSS,
 	SEC_DATA,
 	SEC_RODATA,
+	SEC_BSS_PRIVATE,
 };
 
 struct elf_sec_desc {
@@ -1581,7 +1583,8 @@ bpf_map_find_btf_info(struct bpf_object *obj, struct bpf_map *map);
 
 static int
 bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
-			      const char *real_name, int sec_idx, void *data, size_t data_sz)
+			      const char *real_name, int sec_idx, void *data,
+			      size_t data_sz, bool do_mmap)
 {
 	struct bpf_map_def *def;
 	struct bpf_map *map;
@@ -1609,27 +1612,31 @@ bpf_object__init_internal_map(struct bpf_object *obj, enum libbpf_map_type type,
 	def->max_entries = 1;
 	def->map_flags = type == LIBBPF_MAP_RODATA || type == LIBBPF_MAP_KCONFIG
 			 ? BPF_F_RDONLY_PROG : 0;
-	def->map_flags |= BPF_F_MMAPABLE;
+	if (do_mmap)
+		def->map_flags |= BPF_F_MMAPABLE;
 
 	pr_debug("map '%s' (global data): at sec_idx %d, offset %zu, flags %x.\n",
 		 map->name, map->sec_idx, map->sec_offset, def->map_flags);
 
-	map->mmaped = mmap(NULL, bpf_map_mmap_sz(map), PROT_READ | PROT_WRITE,
-			   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
-	if (map->mmaped == MAP_FAILED) {
-		err = -errno;
-		map->mmaped = NULL;
-		pr_warn("failed to alloc map '%s' content buffer: %d\n",
-			map->name, err);
-		zfree(&map->real_name);
-		zfree(&map->name);
-		return err;
+	map->mmaped = NULL;
+	if (do_mmap) {
+		map->mmaped = mmap(NULL, bpf_map_mmap_sz(map), PROT_READ | PROT_WRITE,
+				   MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+		if (map->mmaped == MAP_FAILED) {
+			err = -errno;
+			map->mmaped = NULL;
+			pr_warn("failed to alloc map '%s' content buffer: %d\n",
+				map->name, err);
+			zfree(&map->real_name);
+			zfree(&map->name);
+			return err;
+		}
 	}
 
 	/* failures are fine because of maps like .rodata.str1.1 */
 	(void) bpf_map_find_btf_info(obj, map);
 
-	if (data)
+	if (do_mmap && data)
 		memcpy(map->mmaped, data, data_sz);
 
 	pr_debug("map %td is \"%s\"\n", map - obj->maps, map->name);
@@ -1641,12 +1648,14 @@ static int bpf_object__init_global_data_maps(struct bpf_object *obj)
 	struct elf_sec_desc *sec_desc;
 	const char *sec_name;
 	int err = 0, sec_idx;
+	bool do_mmap;
 
 	/*
 	 * Populate obj->maps with libbpf internal maps.
 	 */
 	for (sec_idx = 1; sec_idx < obj->efile.sec_cnt; sec_idx++) {
 		sec_desc = &obj->efile.secs[sec_idx];
+		do_mmap = true;
 
 		/* Skip recognized sections with size 0. */
 		if (!sec_desc->data || sec_desc->data->d_size == 0)
@@ -1658,7 +1667,8 @@ static int bpf_object__init_global_data_maps(struct bpf_object *obj)
 			err = bpf_object__init_internal_map(obj, LIBBPF_MAP_DATA,
 							    sec_name, sec_idx,
 							    sec_desc->data->d_buf,
-							    sec_desc->data->d_size);
+							    sec_desc->data->d_size,
+							    do_mmap);
 			break;
 		case SEC_RODATA:
 			obj->has_rodata = true;
@@ -1666,14 +1676,18 @@ static int bpf_object__init_global_data_maps(struct bpf_object *obj)
 			err = bpf_object__init_internal_map(obj, LIBBPF_MAP_RODATA,
 							    sec_name, sec_idx,
 							    sec_desc->data->d_buf,
-							    sec_desc->data->d_size);
+							    sec_desc->data->d_size,
+							    do_mmap);
 			break;
+		case SEC_BSS_PRIVATE:
+			do_mmap = false;
 		case SEC_BSS:
 			sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, sec_idx));
 			err = bpf_object__init_internal_map(obj, LIBBPF_MAP_BSS,
 							    sec_name, sec_idx,
 							    NULL,
-							    sec_desc->data->d_size);
+							    sec_desc->data->d_size,
+							    do_mmap);
 			break;
 		default:
 			/* skip */
@@ -1987,7 +2001,7 @@ static int bpf_object__init_kconfig_map(struct bpf_object *obj)
 	map_sz = last_ext->kcfg.data_off + last_ext->kcfg.sz;
 	err = bpf_object__init_internal_map(obj, LIBBPF_MAP_KCONFIG,
 					    ".kconfig", obj->efile.symbols_shndx,
-					    NULL, map_sz);
+					    NULL, map_sz, true);
 	if (err)
 		return err;
 
@@ -3431,6 +3445,10 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
 			sec_desc->sec_type = SEC_BSS;
 			sec_desc->shdr = sh;
 			sec_desc->data = data;
+		} else if (sh->sh_type == SHT_NOBITS && strcmp(name, BSS_SEC_PRIVATE) == 0) {
+			sec_desc->sec_type = SEC_BSS_PRIVATE;
+			sec_desc->shdr = sh;
+			sec_desc->data = data;
 		} else {
 			pr_info("elf: skipping section(%d) %s (size %zu)\n", idx, name,
 				(size_t)sh->sh_size);
@@ -3893,6 +3911,7 @@ static bool bpf_object__shndx_is_data(const struct bpf_object *obj,
 	case SEC_BSS:
 	case SEC_DATA:
 	case SEC_RODATA:
+	case SEC_BSS_PRIVATE:
 		return true;
 	default:
 		return false;
@@ -3912,6 +3931,7 @@ bpf_object__section_to_libbpf_map_type(const struct bpf_object *obj, int shndx)
 		return LIBBPF_MAP_KCONFIG;
 
 	switch (obj->efile.secs[shndx].sec_type) {
+	case SEC_BSS_PRIVATE:
 	case SEC_BSS:
 		return LIBBPF_MAP_BSS;
 	case SEC_DATA:
@@ -4901,16 +4921,19 @@ bpf_object__populate_internal_map(struct bpf_object *obj, struct bpf_map *map)
 {
 	enum libbpf_map_type map_type = map->libbpf_type;
 	char *cp, errmsg[STRERR_BUFSIZE];
-	int err, zero = 0;
+	int err = 0, zero = 0;
 
 	if (obj->gen_loader) {
-		bpf_gen__map_update_elem(obj->gen_loader, map - obj->maps,
-					 map->mmaped, map->def.value_size);
+		if (map->mmaped)
+			bpf_gen__map_update_elem(obj->gen_loader, map - obj->maps,
+						 map->mmaped, map->def.value_size);
 		if (map_type == LIBBPF_MAP_RODATA || map_type == LIBBPF_MAP_KCONFIG)
 			bpf_gen__map_freeze(obj->gen_loader, map - obj->maps);
 		return 0;
 	}
-	err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);
+
+	if (map->mmaped)
+		err = bpf_map_update_elem(map->fd, &zero, map->mmaped, 0);
 	if (err) {
 		err = -errno;
 		cp = libbpf_strerror_r(err, errmsg, sizeof(errmsg));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 30/32] selftests/bpf: Add BTF tag macros for local kptrs, BPF linked lists
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (28 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 29/32] libbpf: Add support for private BSS map section Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 31/32] selftests/bpf: Add BPF linked list API tests Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 32/32] selftests/bpf: Add referenced local kptr tests Kumar Kartikeya Dwivedi
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Since this is an experimental API, all of these tags are not exposed in
libbpf, but hidden in the bpf_experimental.h header in the selftests
directory. Once enough field experience has been gained with this API,
it can graduate to stable and can be baked into UAPI.s

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 tools/testing/selftests/bpf/bpf_experimental.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 60fe48df4f68..21f12c510db4 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -3,6 +3,10 @@
 #include <bpf/bpf_tracing.h>
 #include <bpf/bpf_helpers.h>
 
+#define __contains(kind, name, node) __attribute__((btf_decl_tag("contains:" #kind ":" #name ":" #node)))
+#define __kernel __attribute__((btf_decl_tag("kernel")))
+#define __local __attribute__((btf_type_tag("local")))
+
 #else
 
 struct bpf_list_head {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 31/32] selftests/bpf: Add BPF linked list API tests
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (29 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 30/32] selftests/bpf: Add BTF tag macros for local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 32/32] selftests/bpf: Add referenced local kptr tests Kumar Kartikeya Dwivedi
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Include various tests covering the success and failure cases. Also, run
the success cases at runtime to verify correctness of linked list
manipulation routines, in addition to ensuring successful verification.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/helpers.c                          |   5 +-
 .../selftests/bpf/prog_tests/linked_list.c    |  88 +++++
 .../testing/selftests/bpf/progs/linked_list.c | 347 ++++++++++++++++++
 3 files changed, 439 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/linked_list.c
 create mode 100644 tools/testing/selftests/bpf/progs/linked_list.c

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 030c35bf030d..928466f83cca 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1831,7 +1831,10 @@ static const struct btf_kfunc_id_set tracing_kfunc_set = {
 
 static int __init kfunc_init(void)
 {
-	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &tracing_kfunc_set);
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &tracing_kfunc_set);
+	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &tracing_kfunc_set);
 }
 
 late_initcall(kfunc_init);
diff --git a/tools/testing/selftests/bpf/prog_tests/linked_list.c b/tools/testing/selftests/bpf/prog_tests/linked_list.c
new file mode 100644
index 000000000000..2dc695fb05b3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/linked_list.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+
+#define __KERNEL__
+#include "bpf_experimental.h"
+#undef __KERNEL__
+
+#include "linked_list.skel.h"
+
+static char log_buf[1024 * 1024];
+
+static struct {
+	const char *prog_name;
+	const char *err_msg;
+} linked_list_fail_tests = {
+};
+
+static void test_linked_list_success(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		.data_in = &pkt_v4,
+		.data_size_in = sizeof(pkt_v4),
+		.repeat = 1,
+	);
+	struct linked_list *skel;
+	int key = 0, ret;
+	char buf[32];
+
+	(void)log_buf;
+	(void)&linked_list_fail_tests;
+
+	skel = linked_list__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "linked_list__open_and_load"))
+		return;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.map_list_push_pop), &opts);
+	ASSERT_OK(ret, "map_list_push_pop");
+	ASSERT_OK(opts.retval, "map_list_push_pop retval");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.global_list_push_pop), &opts);
+	ASSERT_OK(ret, "global_list_push_pop");
+	ASSERT_OK(opts.retval, "global_list_push_pop retval");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.global_list_push_pop_unclean), &opts);
+	ASSERT_OK(ret, "global_list_push_pop_unclean");
+	ASSERT_OK(opts.retval, "global_list_push_pop_unclean retval");
+
+	ASSERT_OK(bpf_map_update_elem(bpf_map__fd(skel->maps.bss_private), &key, buf, 0),
+		  "check_and_free_fields");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.map_list_push_pop_multiple), &opts);
+	ASSERT_OK(ret, "map_list_push_pop_multiple");
+	ASSERT_OK(opts.retval, "map_list_push_pop_multiple retval");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.global_list_push_pop_multiple), &opts);
+	ASSERT_OK(ret, "global_list_push_pop_multiple");
+	ASSERT_OK(opts.retval, "global_list_push_pop_multiple retval");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.global_list_push_pop_multiple_unclean), &opts);
+	ASSERT_OK(ret, "global_list_push_pop_multiple_unclean");
+	ASSERT_OK(opts.retval, "global_list_push_pop_multiple_unclean retval");
+
+	ASSERT_OK(bpf_map_update_elem(bpf_map__fd(skel->maps.bss_private), &key, buf, 0),
+		  "check_and_free_fields");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.map_list_in_list), &opts);
+	ASSERT_OK(ret, "map_list_in_list");
+	ASSERT_OK(opts.retval, "map_list_in_list retval");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.global_list_in_list), &opts);
+	ASSERT_OK(ret, "global_list_in_list");
+	ASSERT_OK(opts.retval, "global_list_in_list retval");
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.global_list_in_list_unclean), &opts);
+	ASSERT_OK(ret, "global_list_in_list_unclean");
+	ASSERT_OK(opts.retval, "global_list_in_list_unclean retval");
+
+	ASSERT_OK(bpf_map_update_elem(bpf_map__fd(skel->maps.bss_private), &key, buf, 0),
+		  "check_and_free_fields");
+
+	linked_list__destroy(skel);
+}
+
+void test_linked_list(void)
+{
+	test_linked_list_success();
+}
diff --git a/tools/testing/selftests/bpf/progs/linked_list.c b/tools/testing/selftests/bpf/progs/linked_list.c
new file mode 100644
index 000000000000..0fff427a23a5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/linked_list.c
@@ -0,0 +1,347 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+struct bar {
+	struct bpf_list_node node __kernel;
+	int data;
+};
+
+struct foo {
+	struct bpf_list_node node __kernel;
+	struct bpf_list_head head __kernel __contains(struct, bar, node);
+	struct bpf_spin_lock lock __kernel;
+	int data;
+};
+
+struct map_value {
+	struct bpf_list_head head __contains(struct, foo, node);
+	struct bpf_spin_lock lock;
+	int data;
+};
+
+struct array_map {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__type(key, int);
+	__type(value, struct map_value);
+	__uint(max_entries, 1);
+} array_map SEC(".maps");
+
+struct bpf_spin_lock glock SEC(".bss.private");
+struct bpf_list_head ghead __contains(struct, foo, node) SEC(".bss.private");
+struct bpf_list_head gghead __contains(struct, foo, node) SEC(".bss.private");
+
+static struct foo *foo_alloc(void)
+{
+	struct foo *f;
+
+	f = bpf_kptr_alloc(bpf_core_type_id_local(struct foo), 0);
+	if (!f)
+		return NULL;
+	bpf_list_node_init(&f->node);
+	bpf_list_head_init(&f->head);
+	bpf_spin_lock_init(&f->lock);
+	return f;
+}
+
+static void foo_free(struct foo *f)
+{
+	if (!f)
+		return;
+	bpf_list_head_fini(&f->head, offsetof(struct bar, node));
+	bpf_kptr_free(f);
+}
+
+static __always_inline int list_push_pop(void *lock, void *head, bool leave_in_map)
+{
+	struct bpf_list_node *n;
+	struct foo *f;
+
+	f = foo_alloc();
+	if (!f)
+		return 2;
+
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	if (n) {
+		foo_free(container_of(n, struct foo, node));
+		foo_free(f);
+		return 3;
+	}
+
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_back(head);
+	bpf_spin_unlock(lock);
+	if (n) {
+		foo_free(container_of(n, struct foo, node));
+		foo_free(f);
+		return 4;
+	}
+
+
+	bpf_spin_lock(lock);
+	bpf_list_add(&f->node, head);
+	f->data = 42;
+	bpf_spin_unlock(lock);
+	if (leave_in_map)
+		return 0;
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_back(head);
+	bpf_spin_unlock(lock);
+	if (!n)
+		return 5;
+	f = container_of(n, struct foo, node);
+	if (f->data != 42) {
+		foo_free(f);
+		return 6;
+	}
+
+	bpf_spin_lock(lock);
+	bpf_list_add(&f->node, head);
+	f->data = 13;
+	bpf_spin_unlock(lock);
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	if (!n)
+		return 7;
+	f = container_of(n, struct foo, node);
+	if (f->data != 13) {
+		foo_free(f);
+		return 8;
+	}
+	foo_free(f);
+
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	if (n) {
+		foo_free(container_of(n, struct foo, node));
+		return 9;
+	}
+
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_back(head);
+	bpf_spin_unlock(lock);
+	if (n) {
+		foo_free(container_of(n, struct foo, node));
+		return 10;
+	}
+	return 0;
+}
+
+
+static __always_inline int list_push_pop_multiple(void *lock, void *head, bool leave_in_map)
+{
+	struct bpf_list_node *n;
+	struct foo *f[8], *pf;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(f); i++) {
+		f[i] = foo_alloc();
+		if (!f[i])
+			return 2;
+		f[i]->data = i;
+		bpf_spin_lock(lock);
+		bpf_list_add(&f[i]->node, head);
+		bpf_spin_unlock(lock);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(f); i++) {
+		bpf_spin_lock(lock);
+		n = bpf_list_pop_front(head);
+		bpf_spin_unlock(lock);
+		if (!n)
+			return 3;
+		pf = container_of(n, struct foo, node);
+		if (pf->data != (ARRAY_SIZE(f) - i - 1)) {
+			foo_free(pf);
+			return 4;
+		}
+		bpf_spin_lock(lock);
+		bpf_list_add_tail(&pf->node, head);
+		bpf_spin_unlock(lock);
+	}
+
+	if (leave_in_map)
+		return 0;
+
+	for (i = 0; i < ARRAY_SIZE(f); i++) {
+		bpf_spin_lock(lock);
+		n = bpf_list_pop_back(head);
+		bpf_spin_unlock(lock);
+		if (!n)
+			return 5;
+		pf = container_of(n, struct foo, node);
+		if (pf->data != i) {
+			foo_free(pf);
+			return 6;
+		}
+		foo_free(pf);
+	}
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_back(head);
+	bpf_spin_unlock(lock);
+	if (n) {
+		foo_free(container_of(n, struct foo, node));
+		return 7;
+	}
+
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	if (n) {
+		foo_free(container_of(n, struct foo, node));
+		return 8;
+	}
+	return 0;
+}
+
+static __always_inline int list_in_list(void *lock, void *head, bool leave_in_map)
+{
+	struct bpf_list_node *n;
+	struct bar *ba[8], *b;
+	struct foo *f;
+	int i;
+
+	f = foo_alloc();
+	if (!f)
+		return 2;
+	for (i = 0; i < ARRAY_SIZE(ba); i++) {
+		b = bpf_kptr_alloc(bpf_core_type_id_local(struct bar), 0);
+		if (!b) {
+			foo_free(f);
+			return 3;
+		}
+		bpf_list_node_init(&b->node);
+		b->data = i;
+		bpf_spin_lock(&f->lock);
+		bpf_list_add_tail(&b->node, &f->head);
+		bpf_spin_unlock(&f->lock);
+	}
+
+	bpf_spin_lock(lock);
+	bpf_list_add(&f->node, head);
+	f->data = 42;
+	bpf_spin_unlock(lock);
+
+	if (leave_in_map)
+		return 0;
+
+	bpf_spin_lock(lock);
+	n = bpf_list_pop_front(head);
+	bpf_spin_unlock(lock);
+	if (!n)
+		return 4;
+	f = container_of(n, struct foo, node);
+	if (f->data != 42) {
+		foo_free(f);
+		return 5;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(ba); i++) {
+		bpf_spin_lock(&f->lock);
+		n = bpf_list_pop_front(&f->head);
+		bpf_spin_unlock(&f->lock);
+		if (!n) {
+			foo_free(f);
+			return 6;
+		}
+		b = container_of(n, struct bar, node);
+		if (b->data != i) {
+			foo_free(f);
+			bpf_kptr_free(b);
+			return 7;
+		}
+		bpf_kptr_free(b);
+	}
+	bpf_spin_lock(&f->lock);
+	n = bpf_list_pop_front(&f->head);
+	bpf_spin_unlock(&f->lock);
+	if (n) {
+		foo_free(f);
+		bpf_kptr_free(container_of(n, struct bar, node));
+		return 8;
+	}
+	foo_free(f);
+	return 0;
+}
+
+SEC("tc")
+int map_list_push_pop(void *ctx)
+{
+	struct map_value *v;
+
+	v = bpf_map_lookup_elem(&array_map, &(int){0});
+	if (!v)
+		return 1;
+	return list_push_pop(&v->lock, &v->head, false);
+}
+
+SEC("tc")
+int global_list_push_pop(void *ctx)
+{
+	return list_push_pop(&glock, &ghead, false);
+}
+
+SEC("tc")
+int global_list_push_pop_unclean(void *ctx)
+{
+	return list_push_pop(&glock, &gghead, true);
+}
+
+SEC("tc")
+int map_list_push_pop_multiple(void *ctx)
+{
+	struct map_value *v;
+
+	v = bpf_map_lookup_elem(&array_map, &(int){0});
+	if (!v)
+		return 1;
+	return list_push_pop_multiple(&v->lock, &v->head, false);
+}
+
+SEC("tc")
+int global_list_push_pop_multiple(void *ctx)
+{
+	return list_push_pop_multiple(&glock, &ghead, false);
+}
+
+SEC("tc")
+int global_list_push_pop_multiple_unclean(void *ctx)
+{
+	return list_push_pop_multiple(&glock, &gghead, true);
+}
+
+SEC("tc")
+int map_list_in_list(void *ctx)
+{
+	struct map_value *v;
+
+	v = bpf_map_lookup_elem(&array_map, &(int){0});
+	if (!v)
+		return 1;
+	return list_in_list(&v->lock, &v->head, false);
+}
+
+SEC("tc")
+int global_list_in_list(void *ctx)
+{
+	return list_in_list(&glock, &ghead, false);
+}
+
+SEC("tc")
+int global_list_in_list_unclean(void *ctx)
+{
+	return list_in_list(&glock, &gghead, true);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH RFC bpf-next v1 32/32] selftests/bpf: Add referenced local kptr tests
  2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
                   ` (30 preceding siblings ...)
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 31/32] selftests/bpf: Add BPF linked list API tests Kumar Kartikeya Dwivedi
@ 2022-09-04 20:41 ` Kumar Kartikeya Dwivedi
  31 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-04 20:41 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

Add some cases where success and failure at verification time is
tested.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 .../selftests/bpf/prog_tests/map_kptr.c       |  2 +-
 tools/testing/selftests/bpf/progs/map_kptr.c  | 38 +++++++++++++++++++
 2 files changed, 39 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/map_kptr.c b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
index fdcea7a61491..f2608a3e4e0d 100644
--- a/tools/testing/selftests/bpf/prog_tests/map_kptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/map_kptr.c
@@ -91,7 +91,7 @@ static void test_map_kptr_success(bool test_run)
 	);
 	struct map_kptr *skel;
 	int key = 0, ret;
-	char buf[16];
+	char buf[24];
 
 	skel = map_kptr__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "map_kptr__open_and_load"))
diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
index eb8217803493..30c981be008b 100644
--- a/tools/testing/selftests/bpf/progs/map_kptr.c
+++ b/tools/testing/selftests/bpf/progs/map_kptr.c
@@ -2,10 +2,17 @@
 #include <vmlinux.h>
 #include <bpf/bpf_tracing.h>
 #include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct foo {
+	int data;
+};
 
 struct map_value {
 	struct prog_test_ref_kfunc __kptr *unref_ptr;
 	struct prog_test_ref_kfunc __kptr_ref *ref_ptr;
+	struct foo __kptr_ref __local *lref_ptr;
 };
 
 struct array_map {
@@ -130,11 +137,42 @@ static void test_kptr_get(struct map_value *v)
 	bpf_kfunc_call_test_release(p);
 }
 
+static void test_local_kptr_ref(struct map_value *v)
+{
+	struct foo *p;
+
+	p = v->lref_ptr;
+	if (!p)
+		return;
+	if (p->data > 100)
+		return;
+	/* store NULL */
+	p = bpf_kptr_xchg(&v->lref_ptr, NULL);
+	if (!p)
+		return;
+	if (p->data > 100) {
+		p->data = 0;
+		bpf_kptr_free(p);
+		return;
+	}
+	bpf_kptr_free(p);
+
+	p = bpf_kptr_alloc(bpf_core_type_id_local(struct foo), 0);
+	if (!p)
+		return;
+	/* store ptr_ */
+	p = bpf_kptr_xchg(&v->lref_ptr, p);
+	if (!p)
+		return;
+	bpf_kptr_free(p);
+}
+
 static void test_kptr(struct map_value *v)
 {
 	test_kptr_unref(v);
 	test_kptr_ref(v);
 	test_kptr_get(v);
+	test_local_kptr_ref(v);
 }
 
 SEC("tc")
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps Kumar Kartikeya Dwivedi
@ 2022-09-07 19:00   ` Alexei Starovoitov
  2022-09-08  2:47     ` Kumar Kartikeya Dwivedi
  2022-09-09  5:27   ` Martin KaFai Lau
  1 sibling, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-07 19:00 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Martin KaFai Lau, KP Singh, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Dave Marchevsky,
	Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:18PM +0200, Kumar Kartikeya Dwivedi wrote:
> Enable support for kptrs in local storage maps by wiring up the freeing
> of these kptrs from map value.
> 
> Cc: Martin KaFai Lau <kafai@fb.com>
> Cc: KP Singh <kpsingh@kernel.org>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf_local_storage.h |  2 +-
>  kernel/bpf/bpf_local_storage.c    | 33 +++++++++++++++++++++++++++----
>  kernel/bpf/syscall.c              |  5 ++++-
>  kernel/bpf/verifier.c             |  9 ++++++---
>  4 files changed, 40 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 7ea18d4da84b..6786d00f004e 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -74,7 +74,7 @@ struct bpf_local_storage_elem {
>  	struct hlist_node snode;	/* Linked to bpf_local_storage */
>  	struct bpf_local_storage __rcu *local_storage;
>  	struct rcu_head rcu;
> -	/* 8 bytes hole */
> +	struct bpf_map *map;		/* Only set for bpf_selem_free_rcu */
>  	/* The data is stored in another cacheline to minimize
>  	 * the number of cachelines access during a cache hit.
>  	 */
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 802fc15b0d73..4a725379d761 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -74,7 +74,8 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>  				gfp_flags | __GFP_NOWARN);
>  	if (selem) {
>  		if (value)
> -			memcpy(SDATA(selem)->data, value, smap->map.value_size);
> +			copy_map_value(&smap->map, SDATA(selem)->data, value);
> +		/* No call to check_and_init_map_value as memory is zero init */
>  		return selem;
>  	}
>  
> @@ -92,12 +93,27 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
>  	kfree_rcu(local_storage, rcu);
>  }
>  
> +static void check_and_free_fields(struct bpf_local_storage_elem *selem)
> +{
> +	if (map_value_has_kptrs(selem->map))
> +		bpf_map_free_kptrs(selem->map, SDATA(selem));
> +}
> +
>  static void bpf_selem_free_rcu(struct rcu_head *rcu)
>  {
>  	struct bpf_local_storage_elem *selem;
>  
>  	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> -	kfree_rcu(selem, rcu);
> +	check_and_free_fields(selem);
> +	kfree(selem);
> +}
> +
> +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
> +{
> +	struct bpf_local_storage_elem *selem;
> +
> +	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> +	call_rcu(&selem->rcu, bpf_selem_free_rcu);
>  }
>  
>  /* local_storage->lock must be held and selem->local_storage == local_storage.
> @@ -150,10 +166,11 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
>  	    SDATA(selem))
>  		RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
>  
> +	selem->map = &smap->map;
>  	if (use_trace_rcu)
> -		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> +		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
>  	else
> -		kfree_rcu(selem, rcu);
> +		call_rcu(&selem->rcu, bpf_selem_free_rcu);
>  
>  	return free_local_storage;
>  }
> @@ -581,6 +598,14 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
>  	 */
>  	synchronize_rcu();
>  
> +	/* When local storage map has kptrs, the call_rcu callback accesses
> +	 * kptr_off_tab, hence we need the bpf_selem_free_rcu callbacks to
> +	 * finish before we free it.
> +	 */
> +	if (map_value_has_kptrs(&smap->map)) {
> +		rcu_barrier();
> +		bpf_map_free_kptr_off_tab(&smap->map);

probably needs conditional rcu_barrier_tasks_trace before rcu_barrier?
With or without it will be a significant delay in map freeing.
Maybe we should generalize the destroy_mem_alloc trick?

Patch 4 needs rebase. Applied patches 1-3.
The first 5 look great to me.
Pls follow up with kptr specific tests.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 12/32] bpf: Teach verifier about non-size constant arguments
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 12/32] bpf: Teach verifier about non-size constant arguments Kumar Kartikeya Dwivedi
@ 2022-09-07 22:11   ` Alexei Starovoitov
  2022-09-08  2:49     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-07 22:11 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:25PM +0200, Kumar Kartikeya Dwivedi wrote:
> Currently, the verifier has support for various arguments that either
> describe the size of the memory being passed in to a helper, or describe
> the size of the memory being returned. When a constant is passed in like
> this, it is assumed for the purposes of precision tracking that if the
> value in the already explored safe state is within the value in current
> state, it would fine to prune the search.
> 
> While this holds well for size arguments, arguments where each value may
> denote a distinct meaning and needs to be verified separately needs more
> work. Search can only be pruned if both are constant values and both are
> equal. In all other cases, it would be incorrect to treat those two
> precise registers as equivalent if the new value satisfies the old one
> (i.e. old <= cur).
> 
> Hence, make the register precision marker tri-state. There are now three
> values that reg->precise takes: NOT_PRECISE, PRECISE, PRECISE_ABSOLUTE.
> 
> Both PRECISE and PRECISE_ABSOLUTE are 'true' values. PRECISE_ABSOLUTE
> affects how regsafe decides whether both registers are equivalent for
> the purposes of verifier state equivalence. When it sees that one
> register has reg->precise == PRECISE_ABSOLUTE, unless both are absolute,
> it will return false. When both are, it returns true only when both are
> const and both have the same value. Otherwise, for PRECISE case it falls
> back to the default check that is present now (i.e. thinking that we're
> talking about sizes).
> 
> This is required as a future patch introduces a BPF memory allocator
> interface, where we take the program BTF's type ID as an argument. Each
> distinct type ID may result in the returned pointer obtaining a
> different size, hence precision tracking is needed, and pruning cannot
> just happen when the old value is within the current value. It must only
> happen when the type ID is equal. The type ID will always correspond to
> prog->aux->btf hence actual type match is not required.
> 
> Finally, change mark_chain_precision to mark_chain_precision_absolute
> for kfuncs constant non-size scalar arguments (tagged with __k suffix).
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf_verifier.h |  8 +++-
>  kernel/bpf/verifier.c        | 93 ++++++++++++++++++++++++++----------
>  2 files changed, 76 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index b4a11ff56054..c4d21568d192 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -43,6 +43,12 @@ enum bpf_reg_liveness {
>  	REG_LIVE_DONE = 0x8, /* liveness won't be updating this register anymore */
>  };
>  
> +enum bpf_reg_precise {
> +	NOT_PRECISE,
> +	PRECISE,
> +	PRECISE_ABSOLUTE,
> +};

Can we make it less verbose ?

NOT_PRECISE,
PRECISE,
EXACT

> +
>  struct bpf_reg_state {
>  	/* Ordering of fields matters.  See states_equal() */
>  	enum bpf_reg_type type;
> @@ -180,7 +186,7 @@ struct bpf_reg_state {
>  	s32 subreg_def;
>  	enum bpf_reg_liveness live;
>  	/* if (!precise && SCALAR_VALUE) min/max/tnum don't affect safety */
> -	bool precise;
> +	enum bpf_reg_precise precise;

Have been thinking whether
  bool precise;
  bool exact;
would be better,
but doesn't look like it.

>  };
>  
>  enum bpf_stack_slot_type {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index b28e88d6fabd..571790ac58d4 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -838,7 +838,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
>  		print_liveness(env, reg->live);
>  		verbose(env, "=");
>  		if (t == SCALAR_VALUE && reg->precise)
> -			verbose(env, "P");
> +			verbose(env, reg->precise == PRECISE_ABSOLUTE ? "PA" : "P");

and here it would be just 'E'

>  		if ((t == SCALAR_VALUE || t == PTR_TO_STACK) &&
>  		    tnum_is_const(reg->var_off)) {
>  			/* reg->off should be 0 for SCALAR_VALUE */
> @@ -935,7 +935,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
>  			t = reg->type;
>  			verbose(env, "=%s", t == SCALAR_VALUE ? "" : reg_type_str(env, t));
>  			if (t == SCALAR_VALUE && reg->precise)
> -				verbose(env, "P");
> +				verbose(env, reg->precise == PRECISE_ABSOLUTE ? "PA" : "P");
>  			if (t == SCALAR_VALUE && tnum_is_const(reg->var_off))
>  				verbose(env, "%lld", reg->var_off.value + reg->off);
>  		} else {
> @@ -1668,7 +1668,17 @@ static void __mark_reg_unknown(const struct bpf_verifier_env *env,
>  	reg->type = SCALAR_VALUE;
>  	reg->var_off = tnum_unknown;
>  	reg->frameno = 0;
> -	reg->precise = env->subprog_cnt > 1 || !env->bpf_capable;
> +	/* Helpers requiring PRECISE_ABSOLUTE for constant arguments cannot be
> +	 * called from programs without CAP_BPF. This is because we don't
> +	 * propagate precision markers for when CAP_BPF is missing. If we
> +	 * allowed calling such heleprs in those programs, the default would
> +	 * have to be PRECISE_ABSOLUTE for them, which would be too aggresive.
> +	 *
> +	 * We still propagate PRECISE_ABSOLUTE when subprog_cnt > 1, hence
> +	 * those cases would still override the default PRECISE value when
> +	 * we propagate the precision markers.
> +	 */
> +	reg->precise = (env->subprog_cnt > 1 || !env->bpf_capable) ? PRECISE : NOT_PRECISE;
>  	__mark_reg_unbounded(reg);
>  }
>  
> @@ -2717,7 +2727,8 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx,
>   * For now backtracking falls back into conservative marking.
>   */
>  static void mark_all_scalars_precise(struct bpf_verifier_env *env,
> -				     struct bpf_verifier_state *st)
> +				     struct bpf_verifier_state *st,
> +				     bool absolute)
>  {
>  	struct bpf_func_state *func;
>  	struct bpf_reg_state *reg;
> @@ -2733,7 +2744,7 @@ static void mark_all_scalars_precise(struct bpf_verifier_env *env,
>  				reg = &func->regs[j];
>  				if (reg->type != SCALAR_VALUE)
>  					continue;
> -				reg->precise = true;
> +				reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
>  			}
>  			for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) {
>  				if (!is_spilled_reg(&func->stack[j]))
> @@ -2741,13 +2752,13 @@ static void mark_all_scalars_precise(struct bpf_verifier_env *env,
>  				reg = &func->stack[j].spilled_ptr;
>  				if (reg->type != SCALAR_VALUE)
>  					continue;
> -				reg->precise = true;
> +				reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
>  			}
>  		}
>  }
>  
>  static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> -				  int spi)
> +				  int spi, bool absolute)

instead of bool pls pass enum bpf_reg_precise

>  {
>  	struct bpf_verifier_state *st = env->cur_state;
>  	int first_idx = st->first_insn_idx;
> @@ -2774,7 +2785,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>  			new_marks = true;
>  		else
>  			reg_mask = 0;
> -		reg->precise = true;
> +		reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
>  	}
>  
>  	while (spi >= 0) {
> @@ -2791,7 +2802,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>  			new_marks = true;
>  		else
>  			stack_mask = 0;
> -		reg->precise = true;
> +		reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
>  		break;
>  	}
>  
> @@ -2813,7 +2824,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>  				err = backtrack_insn(env, i, &reg_mask, &stack_mask);
>  			}
>  			if (err == -ENOTSUPP) {
> -				mark_all_scalars_precise(env, st);
> +				mark_all_scalars_precise(env, st, absolute);
>  				return 0;
>  			} else if (err) {
>  				return err;
> @@ -2854,7 +2865,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>  			}
>  			if (!reg->precise)
>  				new_marks = true;
> -			reg->precise = true;
> +			reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
>  		}
>  
>  		bitmap_from_u64(mask, stack_mask);
> @@ -2873,7 +2884,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>  				 * fp-8 and it's "unallocated" stack space.
>  				 * In such case fallback to conservative.
>  				 */
> -				mark_all_scalars_precise(env, st);
> +				mark_all_scalars_precise(env, st, absolute);
>  				return 0;
>  			}
>  
> @@ -2888,7 +2899,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>  			}
>  			if (!reg->precise)
>  				new_marks = true;
> -			reg->precise = true;
> +			reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
>  		}
>  		if (env->log.level & BPF_LOG_LEVEL2) {
>  			verbose(env, "parent %s regs=%x stack=%llx marks:",
> @@ -2910,12 +2921,24 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
>  
>  static int mark_chain_precision(struct bpf_verifier_env *env, int regno)
>  {
> -	return __mark_chain_precision(env, regno, -1);
> +	return __mark_chain_precision(env, regno, -1, false);
> +}
> +
> +static int mark_chain_precision_absolute(struct bpf_verifier_env *env, int regno)
> +{
> +	WARN_ON_ONCE(!env->bpf_capable);
> +	return __mark_chain_precision(env, regno, -1, true);
>  }
>  
>  static int mark_chain_precision_stack(struct bpf_verifier_env *env, int spi)
>  {
> -	return __mark_chain_precision(env, -1, spi);
> +	return __mark_chain_precision(env, -1, spi, false);
> +}

No need to fork the functions so much.
Just add enum bpf_reg_precise to existing two functions.

> +
> +static int mark_chain_precision_absolute_stack(struct bpf_verifier_env *env, int spi)
> +{
> +	WARN_ON_ONCE(!env->bpf_capable);
> +	return __mark_chain_precision(env, -1, spi, true);
>  }
>  
>  static bool is_spillable_regtype(enum bpf_reg_type type)
> @@ -3253,7 +3276,7 @@ static void mark_reg_stack_read(struct bpf_verifier_env *env,
>  		 * backtracking. Any register that contributed
>  		 * to const 0 was marked precise before spill.
>  		 */
> -		state->regs[dst_regno].precise = true;
> +		state->regs[dst_regno].precise = PRECISE;
>  	} else {
>  		/* have read misc data from the stack */
>  		mark_reg_unknown(env, state->regs, dst_regno);
> @@ -7903,7 +7926,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_m
>  					verbose(env, "R%d must be a known constant\n", regno);
>  					return -EINVAL;
>  				}
> -				ret = mark_chain_precision(env, regno);
> +				ret = mark_chain_precision_absolute(env, regno);
>  				if (ret < 0)
>  					return ret;
>  				meta->arg_constant.found = true;
> @@ -11899,9 +11922,23 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
>  		if (rcur->type == SCALAR_VALUE) {
>  			if (!rold->precise && !rcur->precise)
>  				return true;
> -			/* new val must satisfy old val knowledge */
> -			return range_within(rold, rcur) &&
> -			       tnum_in(rold->var_off, rcur->var_off);
> +			/* We can only determine safety when type of precision
> +			 * needed is same. For absolute, we must compare actual
> +			 * value, otherwise old being within the current value
> +			 * suffices.
> +			 */
> +			if (rold->precise == PRECISE_ABSOLUTE || rcur->precise == PRECISE_ABSOLUTE) {
> +				/* Both should be PRECISE_ABSOLUTE for a comparison */
> +				if (rold->precise != rcur->precise)
> +					return false;
> +				if (!tnum_is_const(rold->var_off) || !tnum_is_const(rcur->var_off))
> +					return false;
> +				return rold->var_off.value == rcur->var_off.value;

Probably better to do
if (rold->precise == EXACT || rcu->precise == EXACT)
  return false;

because
 if (equal)
    return true;
should have already happened if they were exact match.

> +			} else {
> +				/* new val must satisfy old val knowledge */
> +				return range_within(rold, rcur) &&
> +				       tnum_in(rold->var_off, rcur->var_off);
> +			}
>  		} else {
>  			/* We're trying to use a pointer in place of a scalar.
>  			 * Even if the scalar was unbounded, this could lead to
> @@ -12229,8 +12266,12 @@ static int propagate_precision(struct bpf_verifier_env *env,
>  		    !state_reg->precise)
>  			continue;
>  		if (env->log.level & BPF_LOG_LEVEL2)
> -			verbose(env, "propagating r%d\n", i);
> -		err = mark_chain_precision(env, i);
> +			verbose(env, "propagating %sr%d\n",
> +				state_reg->precise == PRECISE_ABSOLUTE ? "abs " : "", i);
> +		if (state_reg->precise == PRECISE_ABSOLUTE)
> +			err = mark_chain_precision_absolute(env, i);
> +		else
> +			err = mark_chain_precision(env, i);
>  		if (err < 0)
>  			return err;
>  	}
> @@ -12243,9 +12284,13 @@ static int propagate_precision(struct bpf_verifier_env *env,
>  		    !state_reg->precise)
>  			continue;
>  		if (env->log.level & BPF_LOG_LEVEL2)
> -			verbose(env, "propagating fp%d\n",
> +			verbose(env, "propagating %sfp%d\n",
> +				state_reg->precise == PRECISE_ABSOLUTE ? "abs " : "",
>  				(-i - 1) * BPF_REG_SIZE);
> -		err = mark_chain_precision_stack(env, i);
> +		if (state_reg->precise == PRECISE_ABSOLUTE)
> +			err = mark_chain_precision_absolute_stack(env, i);
> +		else
> +			err = mark_chain_precision_stack(env, i);
>  		if (err < 0)
>  			return err;
>  	}
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 13/32] bpf: Introduce bpf_list_head support for BPF maps
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 13/32] bpf: Introduce bpf_list_head support for BPF maps Kumar Kartikeya Dwivedi
@ 2022-09-07 22:46   ` Alexei Starovoitov
  2022-09-08  2:58     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-07 22:46 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:26PM +0200, Kumar Kartikeya Dwivedi wrote:
> Add the basic support on the map side to parse, recognize, verify, and
> build metadata table for a new special field of the type struct
> bpf_list_head. To parameterize the bpf_list_head for a certain value
> type and the list_node member it will accept in that value type, we use
> BTF declaration tags.
> 
> The definition of bpf_list_head in a map value will be done as follows:
> 
> struct foo {
> 	int data;
> 	struct bpf_list_node list;
> };
> 
> struct map_value {
> 	struct bpf_list_head list __contains(struct, foo, node);
> };

kptrs are only for structs.
So I would drop explicit 1st argument which is going to be 'struct'
for foreseeable future and leave it as:
 struct bpf_list_head list __contains(foo, node);

There is typo s/list;/node;/ in struct foo, right?

> Then, the bpf_list_head only allows adding to the list using the
> bpf_list_node 'list' for the type struct foo.
> 
> The 'contains' annotation is a BTF declaration tag composed of four
> parts, "contains:kind:name:node" where the kind and name is then used to
> look up the type in the map BTF. The node defines name of the member in
> this type that has the type struct bpf_list_node, which is actually used
> for linking into the linked list.
> 
> This allows building intrusive linked lists in BPF, using container_of
> to obtain pointer to entry, while being completely type safe from the
> perspective of the verifier. The verifier knows exactly the type of the
> nodes, and knows that list helpers return that type at some fixed offset
> where the bpf_list_node member used for this list exists. The verifier
> also uses this information to disallow adding types that are not
> accepted by a certain list.
> 
> For now, no elements can be added to such lists. Support for that is
> coming in future patches, hence draining and freeing items is left out
> for now, and just freeing the list_head_off_tab is done, since it is
> still built and populated when bpf_list_head is specified in the map
> value.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h                           |  64 +++++--
>  include/linux/btf.h                           |   2 +
>  kernel/bpf/arraymap.c                         |   2 +
>  kernel/bpf/bpf_local_storage.c                |   1 +
>  kernel/bpf/btf.c                              | 173 +++++++++++++++++-
>  kernel/bpf/hashtab.c                          |   1 +
>  kernel/bpf/map_in_map.c                       |   5 +-
>  kernel/bpf/syscall.c                          | 131 +++++++++++--
>  kernel/bpf/verifier.c                         |  21 +++
>  .../testing/selftests/bpf/bpf_experimental.h  |  21 +++
>  10 files changed, 378 insertions(+), 43 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/bpf_experimental.h
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index d4e6bf789c02..35c2e9caeb98 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -28,6 +28,9 @@
>  #include <linux/btf.h>
>  #include <linux/rcupdate_trace.h>
>  
> +/* Experimental BPF APIs header for type definitions */
> +#include "../../../tools/testing/selftests/bpf/bpf_experimental.h"
> +
>  struct bpf_verifier_env;
>  struct bpf_verifier_log;
>  struct perf_event;
> @@ -164,27 +167,40 @@ struct bpf_map_ops {
>  };
>  
>  enum {
> -	/* Support at most 8 pointers in a BPF map value */
> -	BPF_MAP_VALUE_OFF_MAX = 8,
> -	BPF_MAP_OFF_ARR_MAX   = BPF_MAP_VALUE_OFF_MAX +
> -				1 + /* for bpf_spin_lock */
> -				1,  /* for bpf_timer */
> -};
> -
> -enum bpf_kptr_type {
> +	/* Support at most 8 offsets in a table */
> +	BPF_MAP_VALUE_OFF_MAX		= 8,
> +	/* Support at most 8 pointer in a BPF map value */
> +	BPF_MAP_VALUE_KPTR_MAX		= BPF_MAP_VALUE_OFF_MAX,
> +	/* Support at most 8 list_head in a BPF map value */
> +	BPF_MAP_VALUE_LIST_HEAD_MAX	= BPF_MAP_VALUE_OFF_MAX,
> +	BPF_MAP_OFF_ARR_MAX		= BPF_MAP_VALUE_KPTR_MAX +
> +					  BPF_MAP_VALUE_LIST_HEAD_MAX +
> +					  1 + /* for bpf_spin_lock */
> +					  1,  /* for bpf_timer */
> +};
> +
> +enum bpf_off_type {
>  	BPF_KPTR_UNREF,
>  	BPF_KPTR_REF,
> +	BPF_LIST_HEAD,
>  };
>  
>  struct bpf_map_value_off_desc {
>  	u32 offset;
> -	enum bpf_kptr_type type;
> -	struct {
> -		struct btf *btf;
> -		struct module *module;
> -		btf_dtor_kfunc_t dtor;
> -		u32 btf_id;
> -	} kptr;
> +	enum bpf_off_type type;
> +	union {
> +		struct {
> +			struct btf *btf;
> +			struct module *module;
> +			btf_dtor_kfunc_t dtor;
> +			u32 btf_id;
> +		} kptr; /* for BPF_KPTR_{UNREF,REF} */
> +		struct {
> +			struct btf *btf;
> +			u32 value_type_id;
> +			u32 list_node_off;
> +		} list_head; /* for BPF_LIST_HEAD */
> +	};
>  };
>  
>  struct bpf_map_value_off {
> @@ -215,6 +231,7 @@ struct bpf_map {
>  	u32 map_flags;
>  	int spin_lock_off; /* >=0 valid offset, <0 error */
>  	struct bpf_map_value_off *kptr_off_tab;
> +	struct bpf_map_value_off *list_head_off_tab;

The union in bpf_map_value_off_desc prompts the question
why separate array is needed.
Sorting gets uglier.

>  	int timer_off; /* >=0 valid offset, <0 error */
>  	u32 id;
>  	int numa_node;
> @@ -265,6 +282,11 @@ static inline bool map_value_has_kptrs(const struct bpf_map *map)
>  	return !IS_ERR_OR_NULL(map->kptr_off_tab);
>  }
>  
> +static inline bool map_value_has_list_heads(const struct bpf_map *map)
> +{
> +	return !IS_ERR_OR_NULL(map->list_head_off_tab);
> +}
> +
>  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>  {
>  	if (unlikely(map_value_has_spin_lock(map)))
> @@ -278,6 +300,13 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
>  		for (i = 0; i < tab->nr_off; i++)
>  			*(u64 *)(dst + tab->off[i].offset) = 0;
>  	}
> +	if (unlikely(map_value_has_list_heads(map))) {
> +		struct bpf_map_value_off *tab = map->list_head_off_tab;
> +		int i;
> +
> +		for (i = 0; i < tab->nr_off; i++)
> +			memset(dst + tab->off[i].offset, 0, sizeof(struct list_head));
> +	}

Do we really need to distinguish map_value_has_kptrs vs map_value_has_list_heads ?
Can they be generalized?
rb_root will be next.
that would be yet another array and another 'if'-s everywhere?
And then another special pseudo-map type that will cause a bunch of copy-paste again?
Maybe it's inevitable.
Trying to brainstorm.

>  }
>  
>  /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
> @@ -1676,6 +1705,11 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
>  bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
>  void bpf_map_free_kptrs(struct bpf_map *map, void *map_value);
>  
> +struct bpf_map_value_off_desc *bpf_map_list_head_off_contains(struct bpf_map *map, u32 offset);
> +void bpf_map_free_list_head_off_tab(struct bpf_map *map);
> +struct bpf_map_value_off *bpf_map_copy_list_head_off_tab(const struct bpf_map *map);
> +bool bpf_map_equal_list_head_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> +
>  struct bpf_map *bpf_map_get(u32 ufd);
>  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
>  struct bpf_map *__bpf_map_get(struct fd f);
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 8062f9da7c40..9b62b8b2117e 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -156,6 +156,8 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
>  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
>  struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
>  					  const struct btf_type *t);
> +struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf,
> +					       const struct btf_type *t);
>  bool btf_type_is_void(const struct btf_type *t);
>  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
>  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index 832b2659e96e..c7263ee3a35f 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -423,6 +423,8 @@ static void array_map_free(struct bpf_map *map)
>  	struct bpf_array *array = container_of(map, struct bpf_array, map);
>  	int i;
>  
> +	bpf_map_free_list_head_off_tab(map);
> +
>  	if (map_value_has_kptrs(map)) {
>  		if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>  			for (i = 0; i < array->map.max_entries; i++) {
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 58cb0c179097..b5ccd76026b6 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -616,6 +616,7 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
>  		rcu_barrier();
>  		bpf_map_free_kptr_off_tab(&smap->map);
>  	}
> +	bpf_map_free_list_head_off_tab(&smap->map);
>  	kvfree(smap->buckets);
>  	bpf_map_area_free(smap);
>  }
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 6740c3ade8f1..0fb045be3837 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3185,6 +3185,7 @@ enum btf_field_type {
>  	BTF_FIELD_SPIN_LOCK,
>  	BTF_FIELD_TIMER,
>  	BTF_FIELD_KPTR,
> +	BTF_FIELD_LIST_HEAD,
>  };
>  
>  enum {
> @@ -3193,9 +3194,17 @@ enum {
>  };
>  
>  struct btf_field_info {
> -	u32 type_id;
>  	u32 off;
> -	enum bpf_kptr_type type;
> +	union {
> +		struct {
> +			u32 type_id;
> +			enum bpf_off_type type;
> +		} kptr;
> +		struct {
> +			u32 value_type_id;
> +			const char *node_name;
> +		} list_head;
> +	};
>  };
>  
>  static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
> @@ -3212,7 +3221,7 @@ static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
>  static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
>  			 u32 off, int sz, struct btf_field_info *info)
>  {
> -	enum bpf_kptr_type type;
> +	enum bpf_off_type type;
>  	u32 res_id;
>  
>  	/* Permit modifiers on the pointer itself */
> @@ -3241,9 +3250,71 @@ static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
>  	if (!__btf_type_is_struct(t))
>  		return -EINVAL;
>  
> -	info->type_id = res_id;
>  	info->off = off;
> -	info->type = type;
> +	info->kptr.type_id = res_id;
> +	info->kptr.type = type;
> +	return BTF_FIELD_FOUND;
> +}
> +
> +static const char *btf_find_decl_tag_value(const struct btf *btf,
> +					   const struct btf_type *pt,
> +					   int comp_idx, const char *tag_key)
> +{
> +	int i;
> +
> +	for (i = 1; i < btf_nr_types(btf); i++) {
> +		const struct btf_type *t = btf_type_by_id(btf, i);
> +		int len = strlen(tag_key);
> +
> +		if (!btf_type_is_decl_tag(t))
> +			continue;
> +		/* TODO: Instead of btf_type pt, it would be much better if we had BTF
> +		 * ID of the map value type. This would avoid btf_type_by_id call here.
> +		 */
> +		if (pt != btf_type_by_id(btf, t->type) ||
> +		    btf_type_decl_tag(t)->component_idx != comp_idx)
> +			continue;
> +		if (strncmp(__btf_name_by_offset(btf, t->name_off), tag_key, len))
> +			continue;
> +		return __btf_name_by_offset(btf, t->name_off) + len;
> +	}
> +	return NULL;
> +}
> +
> +static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
> +			      int comp_idx, const struct btf_type *t,
> +			      u32 off, int sz, struct btf_field_info *info)
> +{
> +	const char *value_type;
> +	const char *list_node;
> +	s32 id;
> +
> +	if (!__btf_type_is_struct(t))
> +		return BTF_FIELD_IGNORE;
> +	if (t->size != sz)
> +		return BTF_FIELD_IGNORE;
> +	value_type = btf_find_decl_tag_value(btf, pt, comp_idx, "contains:");
> +	if (!value_type)
> +		return -EINVAL;
> +	if (strncmp(value_type, "struct:", sizeof("struct:") - 1))
> +		return -EINVAL;
> +	value_type += sizeof("struct:") - 1;
> +	list_node = strstr(value_type, ":");
> +	if (!list_node)
> +		return -EINVAL;
> +	value_type = kstrndup(value_type, list_node - value_type, GFP_ATOMIC);
> +	if (!value_type)
> +		return -ENOMEM;
> +	id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT);
> +	kfree(value_type);
> +	if (id < 0)
> +		return id;
> +	list_node++;
> +	if (str_is_empty(list_node))
> +		return -EINVAL;
> +	info->off = off;
> +	info->list_head.value_type_id = id;
> +	info->list_head.node_name = list_node;
>  	return BTF_FIELD_FOUND;
>  }
>  
> @@ -3286,6 +3357,12 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
>  			if (ret < 0)
>  				return ret;
>  			break;
> +		case BTF_FIELD_LIST_HEAD:
> +			ret = btf_find_list_head(btf, t, i, member_type, off, sz,
> +						 idx < info_cnt ? &info[idx] : &tmp);
> +			if (ret < 0)
> +				return ret;
> +			break;
>  		default:
>  			return -EFAULT;
>  		}
> @@ -3336,6 +3413,12 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
>  			if (ret < 0)
>  				return ret;
>  			break;
> +		case BTF_FIELD_LIST_HEAD:
> +			ret = btf_find_list_head(btf, var, -1, var_type, off, sz,
> +						 idx < info_cnt ? &info[idx] : &tmp);
> +			if (ret < 0)
> +				return ret;
> +			break;
>  		default:
>  			return -EFAULT;
>  		}
> @@ -3372,6 +3455,11 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
>  		sz = sizeof(u64);
>  		align = 8;
>  		break;
> +	case BTF_FIELD_LIST_HEAD:
> +		name = "bpf_list_head";
> +		sz = sizeof(struct bpf_list_head);
> +		align = __alignof__(struct bpf_list_head);
> +		break;
>  	default:
>  		return -EFAULT;
>  	}
> @@ -3440,7 +3528,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
>  		/* Find type in map BTF, and use it to look up the matching type
>  		 * in vmlinux or module BTFs, by name and kind.
>  		 */
> -		t = btf_type_by_id(btf, info_arr[i].type_id);
> +		t = btf_type_by_id(btf, info_arr[i].kptr.type_id);
>  		id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
>  				     &kernel_btf);
>  		if (id < 0) {
> @@ -3451,7 +3539,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
>  		/* Find and stash the function pointer for the destruction function that
>  		 * needs to be eventually invoked from the map free path.
>  		 */
> -		if (info_arr[i].type == BPF_KPTR_REF) {
> +		if (info_arr[i].kptr.type == BPF_KPTR_REF) {
>  			const struct btf_type *dtor_func;
>  			const char *dtor_func_name;
>  			unsigned long addr;
> @@ -3494,7 +3582,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
>  		}
>  
>  		tab->off[i].offset = info_arr[i].off;
> -		tab->off[i].type = info_arr[i].type;
> +		tab->off[i].type = info_arr[i].kptr.type;
>  		tab->off[i].kptr.btf_id = id;
>  		tab->off[i].kptr.btf = kernel_btf;
>  		tab->off[i].kptr.module = mod;
> @@ -3515,6 +3603,75 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
>  	return ERR_PTR(ret);
>  }
>  
> +struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf, const struct btf_type *t)
> +{
> +	struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
> +	struct bpf_map_value_off *tab;
> +	int ret, i, nr_off;
> +
> +	ret = btf_find_field(btf, t, BTF_FIELD_LIST_HEAD, info_arr, ARRAY_SIZE(info_arr));

Like if search for both LIST_HEAD and KPTR here to know the size.

> +	if (ret < 0)
> +		return ERR_PTR(ret);
> +	if (!ret)
> +		return NULL;
> +
> +	nr_off = ret;
> +	tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> +	if (!tab)
> +		return ERR_PTR(-ENOMEM);
> +
> +	for (i = 0; i < nr_off; i++) {
> +		const struct btf_type *t, *n = NULL;
> +		const struct btf_member *member;
> +		u32 offset;
> +		int j;

and here we can process both, since field_info has type.

> +
> +		t = btf_type_by_id(btf, info_arr[i].list_head.value_type_id);
> +		/* We've already checked that value_type_id is a struct type. We
> +		 * just need to figure out the offset of the list_node, and
> +		 * verify its type.
> +		 */
> +		ret = -EINVAL;
> +		for_each_member(j, t, member) {
> +			if (strcmp(info_arr[i].list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
> +				continue;
> +			/* Invalid BTF, two members with same name */
> +			if (n) {
> +				/* We also need to btf_put for the current iteration! */
> +				i++;
> +				goto end;
> +			}
> +			n = btf_type_by_id(btf, member->type);
> +			if (!__btf_type_is_struct(n))
> +				goto end;
> +			if (strcmp("bpf_list_node", __btf_name_by_offset(btf, n->name_off)))
> +				goto end;
> +			offset = __btf_member_bit_offset(n, member);
> +			if (offset % 8)
> +				goto end;
> +			offset /= 8;
> +			if (offset % __alignof__(struct bpf_list_node))
> +				goto end;
> +
> +			tab->off[i].offset = info_arr[i].off;
> +			tab->off[i].type = BPF_LIST_HEAD;
> +			btf_get(btf);

Do we need to btf_get? The btf should be pinned already and not going to be released
until prog ends.

> +			tab->off[i].list_head.btf = btf;
> +			tab->off[i].list_head.value_type_id = info_arr[i].list_head.value_type_id;
> +			tab->off[i].list_head.list_node_off = offset;
> +		}
> +		if (!n)
> +			goto end;
> +	}
> +	tab->nr_off = nr_off;
> +	return tab;
> +end:
> +	while (i--)
> +		btf_put(tab->off[i].list_head.btf);
> +	kfree(tab);
> +	return ERR_PTR(ret);
> +}
> +
>  static void __btf_struct_show(const struct btf *btf, const struct btf_type *t,
>  			      u32 type_id, void *data, u8 bits_offset,
>  			      struct btf_show *show)
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index bb3f8a63c221..270e0ecf4ba3 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -1518,6 +1518,7 @@ static void htab_map_free(struct bpf_map *map)
>  		prealloc_destroy(htab);
>  	}
>  
> +	bpf_map_free_list_head_off_tab(map);
>  	bpf_map_free_kptr_off_tab(map);
>  	free_percpu(htab->extra_elems);
>  	bpf_map_area_free(htab->buckets);
> diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> index 135205d0d560..ced2559129ab 100644
> --- a/kernel/bpf/map_in_map.c
> +++ b/kernel/bpf/map_in_map.c
> @@ -53,6 +53,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>  	inner_map_meta->spin_lock_off = inner_map->spin_lock_off;
>  	inner_map_meta->timer_off = inner_map->timer_off;
>  	inner_map_meta->kptr_off_tab = bpf_map_copy_kptr_off_tab(inner_map);
> +	inner_map_meta->list_head_off_tab = bpf_map_copy_list_head_off_tab(inner_map);
>  	if (inner_map->btf) {
>  		btf_get(inner_map->btf);
>  		inner_map_meta->btf = inner_map->btf;
> @@ -72,6 +73,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
>  
>  void bpf_map_meta_free(struct bpf_map *map_meta)
>  {
> +	bpf_map_free_list_head_off_tab(map_meta);
>  	bpf_map_free_kptr_off_tab(map_meta);
>  	btf_put(map_meta->btf);
>  	kfree(map_meta);
> @@ -86,7 +88,8 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0,
>  		meta0->value_size == meta1->value_size &&
>  		meta0->timer_off == meta1->timer_off &&
>  		meta0->map_flags == meta1->map_flags &&
> -		bpf_map_equal_kptr_off_tab(meta0, meta1);
> +		bpf_map_equal_kptr_off_tab(meta0, meta1) &&
> +		bpf_map_equal_list_head_off_tab(meta0, meta1);
>  }
>  
>  void *bpf_map_fd_get_ptr(struct bpf_map *map,
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 0311acca19f6..e1749e0d2143 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -495,7 +495,7 @@ static void bpf_map_release_memcg(struct bpf_map *map)
>  }
>  #endif
>  
> -static int bpf_map_kptr_off_cmp(const void *a, const void *b)
> +static int bpf_map_off_cmp(const void *a, const void *b)
>  {
>  	const struct bpf_map_value_off_desc *off_desc1 = a, *off_desc2 = b;
>  
> @@ -506,18 +506,22 @@ static int bpf_map_kptr_off_cmp(const void *a, const void *b)
>  	return 0;
>  }
>  
> -struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> +static struct bpf_map_value_off_desc *
> +__bpf_map_off_contains(struct bpf_map_value_off *off_tab, u32 offset)
>  {
>  	/* Since members are iterated in btf_find_field in increasing order,
> -	 * offsets appended to kptr_off_tab are in increasing order, so we can
> +	 * offsets appended to an off_tab are in increasing order, so we can
>  	 * do bsearch to find exact match.
>  	 */
> -	struct bpf_map_value_off *tab;
> +	return bsearch(&offset, off_tab->off, off_tab->nr_off, sizeof(off_tab->off[0]),
> +		       bpf_map_off_cmp);
> +}
>  
> +struct bpf_map_value_off_desc *bpf_map_kptr_off_contains(struct bpf_map *map, u32 offset)
> +{
>  	if (!map_value_has_kptrs(map))
>  		return NULL;
> -	tab = map->kptr_off_tab;
> -	return bsearch(&offset, tab->off, tab->nr_off, sizeof(tab->off[0]), bpf_map_kptr_off_cmp);
> +	return __bpf_map_off_contains(map->kptr_off_tab, offset);
>  }
>  
>  void bpf_map_free_kptr_off_tab(struct bpf_map *map)
> @@ -563,15 +567,15 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map)
>  	return new_tab;
>  }
>  
> -bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> +static bool __bpf_map_equal_off_tab(const struct bpf_map_value_off *tab_a,
> +				    const struct bpf_map_value_off *tab_b,
> +				    bool has_a, bool has_b)
>  {
> -	struct bpf_map_value_off *tab_a = map_a->kptr_off_tab, *tab_b = map_b->kptr_off_tab;
> -	bool a_has_kptr = map_value_has_kptrs(map_a), b_has_kptr = map_value_has_kptrs(map_b);
>  	int size;
>  
> -	if (!a_has_kptr && !b_has_kptr)
> +	if (!has_a && !has_b)
>  		return true;
> -	if (a_has_kptr != b_has_kptr)
> +	if (has_a != has_b)
>  		return false;
>  	if (tab_a->nr_off != tab_b->nr_off)
>  		return false;
> @@ -579,6 +583,13 @@ bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_ma
>  	return !memcmp(tab_a, tab_b, size);
>  }
>  
> +bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> +{
> +	return __bpf_map_equal_off_tab(map_a->kptr_off_tab, map_b->kptr_off_tab,
> +				       map_value_has_kptrs(map_a),
> +				       map_value_has_kptrs(map_b));
> +}
> +
>  /* Caller must ensure map_value_has_kptrs is true. Note that this function can
>   * be called on a map value while the map_value is visible to BPF programs, as
>   * it ensures the correct synchronization, and we already enforce the same using
> @@ -606,6 +617,50 @@ void bpf_map_free_kptrs(struct bpf_map *map, void *map_value)
>  	}
>  }
>  
> +struct bpf_map_value_off_desc *bpf_map_list_head_off_contains(struct bpf_map *map, u32 offset)
> +{
> +	if (!map_value_has_list_heads(map))
> +		return NULL;
> +	return __bpf_map_off_contains(map->list_head_off_tab, offset);
> +}
> +
> +void bpf_map_free_list_head_off_tab(struct bpf_map *map)
> +{
> +	struct bpf_map_value_off *tab = map->list_head_off_tab;
> +	int i;
> +
> +	if (!map_value_has_list_heads(map))
> +		return;
> +	for (i = 0; i < tab->nr_off; i++)
> +		btf_put(tab->off[i].list_head.btf);
> +	kfree(tab);
> +	map->list_head_off_tab = NULL;
> +}
> +
> +struct bpf_map_value_off *bpf_map_copy_list_head_off_tab(const struct bpf_map *map)
> +{
> +	struct bpf_map_value_off *tab = map->list_head_off_tab, *new_tab;
> +	int size, i;
> +
> +	if (!map_value_has_list_heads(map))
> +		return ERR_PTR(-ENOENT);
> +	size = offsetof(struct bpf_map_value_off, off[tab->nr_off]);
> +	new_tab = kmemdup(tab, size, GFP_KERNEL | __GFP_NOWARN);
> +	if (!new_tab)
> +		return ERR_PTR(-ENOMEM);
> +	/* Do a deep copy of the list_head_off_tab */
> +	for (i = 0; i < tab->nr_off; i++)
> +		btf_get(tab->off[i].list_head.btf);
> +	return new_tab;
> +}
> +
> +bool bpf_map_equal_list_head_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b)
> +{
> +	return __bpf_map_equal_off_tab(map_a->list_head_off_tab, map_b->list_head_off_tab,
> +				       map_value_has_list_heads(map_a),
> +				       map_value_has_list_heads(map_b));
> +}
> +
>  /* called from workqueue */
>  static void bpf_map_free_deferred(struct work_struct *work)
>  {
> @@ -776,7 +831,8 @@ static int bpf_map_mmap(struct file *filp, struct vm_area_struct *vma)
>  	int err;
>  
>  	if (!map->ops->map_mmap || map_value_has_spin_lock(map) ||
> -	    map_value_has_timer(map) || map_value_has_kptrs(map))
> +	    map_value_has_timer(map) || map_value_has_kptrs(map) ||
> +	    map_value_has_list_heads(map))
>  		return -ENOTSUPP;
>  
>  	if (!(vma->vm_flags & VM_SHARED))
> @@ -931,13 +987,14 @@ static void map_off_arr_swap(void *_a, void *_b, int size, const void *priv)
>  
>  static int bpf_map_alloc_off_arr(struct bpf_map *map)
>  {
> +	bool has_list_heads = map_value_has_list_heads(map);
>  	bool has_spin_lock = map_value_has_spin_lock(map);
>  	bool has_timer = map_value_has_timer(map);
>  	bool has_kptrs = map_value_has_kptrs(map);
>  	struct bpf_map_off_arr *off_arr;
>  	u32 i;
>  
> -	if (!has_spin_lock && !has_timer && !has_kptrs) {
> +	if (!has_spin_lock && !has_timer && !has_kptrs && !has_list_heads) {
>  		map->off_arr = NULL;
>  		return 0;
>  	}
> @@ -973,6 +1030,17 @@ static int bpf_map_alloc_off_arr(struct bpf_map *map)
>  		}
>  		off_arr->cnt += tab->nr_off;
>  	}
> +	if (has_list_heads) {
> +		struct bpf_map_value_off *tab = map->list_head_off_tab;
> +		u32 *off = &off_arr->field_off[off_arr->cnt];
> +		u8 *sz = &off_arr->field_sz[off_arr->cnt];
> +
> +		for (i = 0; i < tab->nr_off; i++) {
> +			*off++ = tab->off[i].offset;
> +			*sz++ = sizeof(struct bpf_list_head);
> +		}
> +		off_arr->cnt += tab->nr_off;
> +	}
>  
>  	if (off_arr->cnt == 1)
>  		return 0;
> @@ -1038,11 +1106,11 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>  	if (map_value_has_kptrs(map)) {
>  		if (!bpf_capable()) {
>  			ret = -EPERM;
> -			goto free_map_tab;
> +			goto free_map_kptr_tab;
>  		}
>  		if (map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) {
>  			ret = -EACCES;
> -			goto free_map_tab;
> +			goto free_map_kptr_tab;
>  		}
>  		if (map->map_type != BPF_MAP_TYPE_HASH &&
>  		    map->map_type != BPF_MAP_TYPE_PERCPU_HASH &&
> @@ -1054,18 +1122,42 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>  		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
>  		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
>  			ret = -EOPNOTSUPP;
> -			goto free_map_tab;
> +			goto free_map_kptr_tab;
> +		}
> +	}
> +
> +	/* We need to take ref on the BTF, so pass it as non-const */
> +	map->list_head_off_tab = btf_parse_list_heads((struct btf *)btf, value_type);
> +	if (map_value_has_list_heads(map)) {
> +		if (!bpf_capable()) {
> +			ret = -EACCES;
> +			goto free_map_list_head_tab;
> +		}
> +		if (map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG)) {
> +			ret = -EACCES;
> +			goto free_map_list_head_tab;
> +		}
> +		if (map->map_type != BPF_MAP_TYPE_HASH &&
> +		    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
> +		    map->map_type != BPF_MAP_TYPE_ARRAY &&
> +		    map->map_type != BPF_MAP_TYPE_SK_STORAGE &&
> +		    map->map_type != BPF_MAP_TYPE_INODE_STORAGE &&
> +		    map->map_type != BPF_MAP_TYPE_TASK_STORAGE) {
> +			ret = -EOPNOTSUPP;
> +			goto free_map_list_head_tab;
>  		}
>  	}
>  
>  	if (map->ops->map_check_btf) {
>  		ret = map->ops->map_check_btf(map, btf, key_type, value_type);
>  		if (ret < 0)
> -			goto free_map_tab;
> +			goto free_map_list_head_tab;
>  	}
>  
>  	return ret;
> -free_map_tab:
> +free_map_list_head_tab:
> +	bpf_map_free_list_head_off_tab(map);
> +free_map_kptr_tab:
>  	bpf_map_free_kptr_off_tab(map);
>  	return ret;
>  }
> @@ -1889,7 +1981,8 @@ static int map_freeze(const union bpf_attr *attr)
>  		return PTR_ERR(map);
>  
>  	if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS ||
> -	    map_value_has_timer(map) || map_value_has_kptrs(map)) {
> +	    map_value_has_timer(map) || map_value_has_kptrs(map) ||
> +	    map_value_has_list_heads(map)) {
>  		fdput(f);
>  		return -ENOTSUPP;
>  	}
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 571790ac58d4..ab91e5ca7e41 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3879,6 +3879,20 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno,
>  			}
>  		}
>  	}
> +	if (map_value_has_list_heads(map)) {
> +		struct bpf_map_value_off *tab = map->list_head_off_tab;
> +		int i;
> +
> +		for (i = 0; i < tab->nr_off; i++) {
> +			u32 p = tab->off[i].offset;
> +
> +			if (reg->smin_value + off < p + sizeof(struct bpf_list_head) &&
> +			    p < reg->umax_value + off + size) {
> +				verbose(env, "bpf_list_head cannot be accessed directly by load/store\n");
> +				return -EACCES;
> +			}
> +		}
> +	}
>  	return err;
>  }
>  
> @@ -13165,6 +13179,13 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		}
>  	}
>  
> +	if (map_value_has_list_heads(map)) {
> +		if (is_tracing_prog_type(prog_type)) {
> +			verbose(env, "tracing progs cannot use bpf_list_head yet\n");
> +			return -EINVAL;
> +		}
> +	}
> +
>  	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
>  	    !bpf_offload_prog_map_match(prog, map)) {
>  		verbose(env, "offload device mismatch between prog and map\n");
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> new file mode 100644
> index 000000000000..ea1b3b1839d1
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -0,0 +1,21 @@
> +#ifndef __KERNEL__
> +
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +
> +#else
> +
> +struct bpf_list_head {
> +	__u64 __a;
> +	__u64 __b;
> +} __attribute__((aligned(8)));
> +
> +struct bpf_list_node {
> +	__u64 __a;
> +	__u64 __b;
> +} __attribute__((aligned(8)));
> +
> +#endif
> +
> +#ifndef __KERNEL__
> +#endif
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 14/32] bpf: Introduce bpf_kptr_alloc helper
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 14/32] bpf: Introduce bpf_kptr_alloc helper Kumar Kartikeya Dwivedi
@ 2022-09-07 23:30   ` Alexei Starovoitov
  2022-09-08  3:01     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-07 23:30 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:27PM +0200, Kumar Kartikeya Dwivedi wrote:
> To allocate local kptr of types pointing into program BTF instead of
> kernel BTF, bpf_kptr_alloc is a new helper that takes the local type's
> BTF ID and returns a pointer to it. The size is automatically inferred
> from the type ID by the BPF verifier, so user only passes the BTF ID and
> flags, if any. For now, no flags are supported.
> 
> First, we use the new constant argument type support for kfuncs that
> enforces argument is a constant. We need to know the local type's BTF ID
> statically to enforce safety properties for the allocation. Next, we
> remember this and dynamically assign the return type. During that phase,
> we also query the actual size of the structure being allocated, and
> whether it is a struct type. If so, we stash the actual size for
> do_misc_fixups phase where we rewrite the first argument to be size
> instead of local type's BTF ID, which we can then pass on to the kernel
> allocator.
> 
> This needs some additional support for kfuncs as we were not doing
> argument rewrites for them. The fixup has been moved inside
> fixup_kfunc_call itself to avoid polluting the huge do_misc_fixups,
> and delta, prog, and insn pointers are recalculated based on if any
> instructions were patched.
> 
> The returned pointer needs to be handled specially as well. While
> normally, only struct pointers may be returned, a new internal kfunc
> flag __KF_RET_DYN_BTF is used to indicate the BTF is ascertained from
> arguments dynamically, hence it is now forced to be void * instead.
> For now, bpf_kptr_alloc is the only user of this support.
> 
> Hence, allocations using bpf_kptr_alloc are type safe. Later patches
> will introduce constructor and destructor support to local kptrs
> allocated from this helper. This would allow embedding kernel objects
> like bpf_spin_lock, bpf_list_node, bpf_list_head inside a local kptr
> allocation, and ensuring they are correctly initialized before use.
> 
> A new type flag is associated with PTR_TO_BTF_ID returned from
> bpf_kptr_alloc: MEM_TYPE_LOCAL. This indicates that the type of the
> memory is of a local type coming from program's BTF.
> 
> The btf_struct_access mechanism is tuned to allow BPF_WRITE access to
> these allocated objects, so that programs can store data as usual in
> them. On following a pointer type inside such PTR_TO_BTF_ID, WALK_PTR
> sets the destination register as scalar instead. It would not be safe to
> recognize pointer types in local types. This can be changed in the
> future if it is allowed to embed kptrs inside such local kptrs.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h                           |  12 +-
>  include/linux/bpf_verifier.h                  |   1 +
>  include/linux/btf.h                           |   3 +
>  kernel/bpf/btf.c                              |   8 +-
>  kernel/bpf/helpers.c                          |  17 ++
>  kernel/bpf/verifier.c                         | 156 +++++++++++++++---
>  net/bpf/bpf_dummy_struct_ops.c                |   5 +-
>  net/ipv4/bpf_tcp_ca.c                         |   5 +-
>  .../testing/selftests/bpf/bpf_experimental.h  |  14 ++
>  9 files changed, 191 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 35c2e9caeb98..5c8bfb0eba17 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -486,6 +486,12 @@ enum bpf_type_flag {
>  	/* Size is known at compile time. */
>  	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
>  
> +	/* MEM is of a type from program BTF, not kernel BTF. This is used to
> +	 * tag PTR_TO_BTF_ID allocated using bpf_kptr_alloc, since they have
> +	 * entirely different semantics.
> +	 */
> +	MEM_TYPE_LOCAL		= BIT(11 + BPF_BASE_TYPE_BITS),
> +
>  	__BPF_TYPE_FLAG_MAX,
>  	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
>  };
> @@ -757,7 +763,8 @@ struct bpf_verifier_ops {
>  				 const struct btf *btf,
>  				 const struct btf_type *t, int off, int size,
>  				 enum bpf_access_type atype,
> -				 u32 *next_btf_id, enum bpf_type_flag *flag);
> +				 u32 *next_btf_id, enum bpf_type_flag *flag,
> +				 bool local_type);
>  };
>  
>  struct bpf_prog_offload_ops {
> @@ -1995,7 +2002,8 @@ static inline bool bpf_tracing_btf_ctx_access(int off, int size,
>  int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
>  		      const struct btf_type *t, int off, int size,
>  		      enum bpf_access_type atype,
> -		      u32 *next_btf_id, enum bpf_type_flag *flag);
> +		      u32 *next_btf_id, enum bpf_type_flag *flag,
> +		      bool local_type);
>  bool btf_struct_ids_match(struct bpf_verifier_log *log,
>  			  const struct btf *btf, u32 id, int off,
>  			  const struct btf *need_btf, u32 need_type_id,
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index c4d21568d192..c6d550978d63 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -403,6 +403,7 @@ struct bpf_insn_aux_data {
>  		 */
>  		struct bpf_loop_inline_state loop_inline_state;
>  	};
> +	u64 kptr_alloc_size; /* used to store size of local kptr allocation */
>  	u64 map_key_state; /* constant (32 bit) key tracking for maps */
>  	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
>  	u32 seen; /* this insn was processed by the verifier at env->pass_cnt */
> diff --git a/include/linux/btf.h b/include/linux/btf.h
> index 9b62b8b2117e..fc35c932e89e 100644
> --- a/include/linux/btf.h
> +++ b/include/linux/btf.h
> @@ -52,6 +52,9 @@
>  #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
>  #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
>  
> +/* Internal kfunc flags, not meant for general use */
> +#define __KF_RET_DYN_BTF (1 << 7) /* kfunc returns dynamically ascertained PTR_TO_BTF_ID */

Is there going to be another func that returns similar dynamic type?
We have one such func already kptr_xhcg. I don't see why we need this flag.
We can just compare func_id-s.
In this patch it will be just fund_id == kfunc_ids[KF_kptr_alloc];
When more kfuncs become alloc-like we will just add few ||.

> +
>  struct btf;
>  struct btf_member;
>  struct btf_type;
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index 0fb045be3837..17977e0f4e09 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -5919,7 +5919,8 @@ static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf,
>  int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
>  		      const struct btf_type *t, int off, int size,
>  		      enum bpf_access_type atype __maybe_unused,
> -		      u32 *next_btf_id, enum bpf_type_flag *flag)
> +		      u32 *next_btf_id, enum bpf_type_flag *flag,
> +		      bool local_type)
>  {
>  	enum bpf_type_flag tmp_flag = 0;
>  	int err;
> @@ -5930,6 +5931,11 @@ int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
>  
>  		switch (err) {
>  		case WALK_PTR:
> +			/* For local types, the destination register cannot
> +			 * become a pointer again.
> +			 */
> +			if (local_type)
> +				return SCALAR_VALUE;
>  			/* If we found the pointer or scalar on t+off,
>  			 * we're done.
>  			 */
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index fc08035f14ed..d417aa4f0b22 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1696,10 +1696,27 @@ bpf_base_func_proto(enum bpf_func_id func_id)
>  	}
>  }
>  
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +void *bpf_kptr_alloc(u64 local_type_id__k, u64 flags)
> +{
> +	/* Verifier patches local_type_id__k to size */
> +	u64 size = local_type_id__k;
> +
> +	if (flags)
> +		return NULL;
> +	return kmalloc(size, GFP_ATOMIC);
> +}
> +
> +__diag_pop();
> +
>  BTF_SET8_START(tracing_btf_ids)
>  #ifdef CONFIG_KEXEC_CORE
>  BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
>  #endif
> +BTF_ID_FLAGS(func, bpf_kptr_alloc, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
>  BTF_SET8_END(tracing_btf_ids)
>  
>  static const struct btf_kfunc_id_set tracing_kfunc_set = {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ab91e5ca7e41..8f28aa7f1e8d 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -472,6 +472,11 @@ static bool type_may_be_null(u32 type)
>  	return type & PTR_MAYBE_NULL;
>  }
>  
> +static bool type_is_local(u32 type)
> +{
> +	return type & MEM_TYPE_LOCAL;
> +}
> +
>  static bool is_acquire_function(enum bpf_func_id func_id,
>  				const struct bpf_map *map)
>  {
> @@ -4556,17 +4561,22 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
>  		return -EACCES;
>  	}
>  
> -	if (env->ops->btf_struct_access) {
> +	/* For allocated PTR_TO_BTF_ID pointing to a local type, we cannot do
> +	 * btf_struct_access callback.
> +	 */
> +	if (env->ops->btf_struct_access && !type_is_local(reg->type)) {
>  		ret = env->ops->btf_struct_access(&env->log, reg->btf, t,
> -						  off, size, atype, &btf_id, &flag);
> +						  off, size, atype, &btf_id, &flag,
> +						  false);
>  	} else {
> -		if (atype != BPF_READ) {
> +		/* It is allowed to write to pointer to a local type */
> +		if (atype != BPF_READ && !type_is_local(reg->type)) {
>  			verbose(env, "only read is supported\n");
>  			return -EACCES;
>  		}
>  
>  		ret = btf_struct_access(&env->log, reg->btf, t, off, size,
> -					atype, &btf_id, &flag);
> +					atype, &btf_id, &flag, type_is_local(reg->type));

imo it's cleaner to pass 'reg' instead of 'reg->btf',
so we don't have to pass another boolean.
And check type_is_local(reg) inside btf_struct_access().

>  	}
>  
>  	if (ret < 0)
> @@ -4630,7 +4640,7 @@ static int check_ptr_to_map_access(struct bpf_verifier_env *env,
>  		return -EACCES;
>  	}
>  
> -	ret = btf_struct_access(&env->log, btf_vmlinux, t, off, size, atype, &btf_id, &flag);
> +	ret = btf_struct_access(&env->log, btf_vmlinux, t, off, size, atype, &btf_id, &flag, false);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -7661,6 +7671,11 @@ static bool is_kfunc_destructive(struct bpf_kfunc_arg_meta *meta)
>  	return meta->kfunc_flags & KF_DESTRUCTIVE;
>  }
>  
> +static bool __is_kfunc_ret_dyn_btf(struct bpf_kfunc_arg_meta *meta)
> +{
> +	return meta->kfunc_flags & __KF_RET_DYN_BTF;
> +}
> +
>  static bool is_kfunc_arg_kptr_get(struct bpf_kfunc_arg_meta *meta, int arg)
>  {
>  	return arg == 0 && (meta->kfunc_flags & KF_KPTR_GET);
> @@ -7751,6 +7766,24 @@ static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
>  #endif
>  };
>  
> +BTF_ID_LIST(special_kfuncs)
> +BTF_ID(func, bpf_kptr_alloc)
> +
> +enum bpf_special_kfuncs {
> +	KF_SPECIAL_bpf_kptr_alloc,
> +	KF_SPECIAL_MAX,
> +};
> +
> +static bool __is_kfunc_special(const struct btf *btf, u32 func_id, unsigned int kf_sp)
> +{
> +	if (btf != btf_vmlinux || kf_sp >= KF_SPECIAL_MAX)
> +		return false;
> +	return func_id == special_kfuncs[kf_sp];
> +}
> +
> +#define is_kfunc_special(btf, func_id, func_name) \
> +	__is_kfunc_special(btf, func_id, KF_SPECIAL_##func_name)

This looks like reinventing the wheel.
I'd think similar to btf_tracing_ids[BTF_TRACING_TYPE_VMA] would work just as well.
It's less magic. No need for above macro
and btf != btf_vmlinux should really be explicit in the code
and done early and once.

> +
>  enum kfunc_ptr_arg_types {
>  	KF_ARG_PTR_TO_CTX,
>  	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
> @@ -8120,20 +8153,55 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  		mark_reg_unknown(env, regs, BPF_REG_0);
>  		mark_btf_func_reg_size(env, BPF_REG_0, t->size);
>  	} else if (btf_type_is_ptr(t)) {
> -		ptr_type = btf_type_skip_modifiers(desc_btf, t->type,
> -						   &ptr_type_id);
> -		if (!btf_type_is_struct(ptr_type)) {
> -			ptr_type_name = btf_name_by_offset(desc_btf,
> -							   ptr_type->name_off);
> -			verbose(env, "kernel function %s returns pointer type %s %s is not supported\n",
> -				func_name, btf_type_str(ptr_type),
> -				ptr_type_name);
> -			return -EINVAL;
> -		}
> +		struct btf *ret_btf;
> +		u32 ret_btf_id;
> +
> +		ptr_type = btf_type_skip_modifiers(desc_btf, t->type, &ptr_type_id);
>  		mark_reg_known_zero(env, regs, BPF_REG_0);
> -		regs[BPF_REG_0].btf = desc_btf;
>  		regs[BPF_REG_0].type = PTR_TO_BTF_ID;
> -		regs[BPF_REG_0].btf_id = ptr_type_id;
> +
> +		if (__is_kfunc_ret_dyn_btf(&meta)) {

just check meta.func_id == kfunc_ids[KF_kptr_alloc] instead?

> +			const struct btf_type *ret_t;
> +
> +			/* Currently, only bpf_kptr_alloc needs special handling */
> +			if (!is_kfunc_special(meta.btf, meta.func_id, bpf_kptr_alloc) ||

same thing.

> +			    !meta.arg_constant.found || !btf_type_is_void(ptr_type)) {
> +				verbose(env, "verifier internal error: misconfigured kfunc\n");
> +				return -EFAULT;
> +			}
> +
> +			if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
> +				verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
> +				return -EINVAL;
> +			}
> +
> +			ret_btf = env->prog->aux->btf;
> +			ret_btf_id = meta.arg_constant.value;
> +
> +			ret_t = btf_type_by_id(ret_btf, ret_btf_id);
> +			if (!ret_t || !__btf_type_is_struct(ret_t)) {
> +				verbose(env, "local type ID %d passed to bpf_kptr_alloc does not refer to struct\n",
> +					ret_btf_id);
> +				return -EINVAL;
> +			}
> +			/* Remember this so that we can rewrite R1 as size in fixup_kfunc_call */
> +			env->insn_aux_data[insn_idx].kptr_alloc_size = ret_t->size;
> +			/* For now, since we hardcode prog->btf, also hardcode
> +			 * setting of this flag.
> +			 */
> +			regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
> +		} else {
> +			if (!btf_type_is_struct(ptr_type)) {
> +				ptr_type_name = btf_name_by_offset(desc_btf, ptr_type->name_off);
> +				verbose(env, "kernel function %s returns pointer type %s %s is not supported\n",
> +					func_name, btf_type_str(ptr_type), ptr_type_name);
> +				return -EINVAL;
> +			}
> +			ret_btf = desc_btf;
> +			ret_btf_id = ptr_type_id;
> +		}
> +		regs[BPF_REG_0].btf = ret_btf;
> +		regs[BPF_REG_0].btf_id = ret_btf_id;
>  		if (is_kfunc_ret_null(&meta)) {
>  			regs[BPF_REG_0].type |= PTR_MAYBE_NULL;
>  			/* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */
> @@ -14371,8 +14439,43 @@ static int fixup_call_args(struct bpf_verifier_env *env)
>  	return err;
>  }
>  
> +static int do_kfunc_fixups(struct bpf_verifier_env *env, struct bpf_insn *insn,
> +			   s32 imm, int insn_idx, int delta)
> +{
> +	struct bpf_insn insn_buf[16];
> +	struct bpf_prog *new_prog;
> +	int cnt;
> +
> +	/* No need to lookup btf, only vmlinux kfuncs are supported for special
> +	 * kfuncs handling. Hence when insn->off is zero, check if it is a
> +	 * special kfunc by hardcoding btf as btf_vmlinux.
> +	 */
> +	if (!insn->off && is_kfunc_special(btf_vmlinux, insn->imm, bpf_kptr_alloc)) {
> +		u64 local_type_size = env->insn_aux_data[insn_idx + delta].kptr_alloc_size;
> +
> +		insn_buf[0] = BPF_MOV64_IMM(BPF_REG_1, local_type_size);
> +		insn_buf[1] = *insn;
> +		cnt = 2;
> +
> +		new_prog = bpf_patch_insn_data(env, insn_idx + delta, insn_buf, cnt);
> +		if (!new_prog)
> +			return -ENOMEM;
> +
> +		delta += cnt - 1;
> +		insn = new_prog->insnsi + insn_idx + delta;
> +		goto patch_call_imm;
> +	}
> +
> +	insn->imm = imm;
> +	return 0;
> +patch_call_imm:
> +	insn->imm = imm;
> +	return cnt - 1;
> +}
> +
>  static int fixup_kfunc_call(struct bpf_verifier_env *env,
> -			    struct bpf_insn *insn)
> +			    struct bpf_insn *insn,
> +			    int insn_idx, int delta)
>  {
>  	const struct bpf_kfunc_desc *desc;
>  
> @@ -14391,9 +14494,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env,
>  		return -EFAULT;
>  	}
>  
> -	insn->imm = desc->imm;
> -
> -	return 0;
> +	return do_kfunc_fixups(env, insn, desc->imm, insn_idx, delta);
>  }
>  
>  /* Do various post-verification rewrites in a single program pass.
> @@ -14534,9 +14635,18 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>  		if (insn->src_reg == BPF_PSEUDO_CALL)
>  			continue;
>  		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
> -			ret = fixup_kfunc_call(env, insn);
> -			if (ret)
> +			ret = fixup_kfunc_call(env, insn, i, delta);
> +			if (ret < 0)
>  				return ret;
> +			/* If ret > 0, fixup_kfunc_call did some instruction
> +			 * rewrites. Increment delta, reload prog and insn,
> +			 * env->prog is already set by it to the new_prog.
> +			 */
> +			if (ret) {
> +				delta += ret;
> +				prog = env->prog;
> +				insn = prog->insnsi + i + delta;
> +			}

See how Yonghong did it:
https://lore.kernel.org/all/20220807175121.4179410-1-yhs@fb.com/

It's cleaner to patch and adjust here instead of patch in one place
and adjust in another.

>  			continue;
>  		}
>  
> diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c
> index e78dadfc5829..fa572714c6f6 100644
> --- a/net/bpf/bpf_dummy_struct_ops.c
> +++ b/net/bpf/bpf_dummy_struct_ops.c
> @@ -160,7 +160,8 @@ static int bpf_dummy_ops_btf_struct_access(struct bpf_verifier_log *log,
>  					   const struct btf_type *t, int off,
>  					   int size, enum bpf_access_type atype,
>  					   u32 *next_btf_id,
> -					   enum bpf_type_flag *flag)
> +					   enum bpf_type_flag *flag,
> +					   bool local_type)
>  {
>  	const struct btf_type *state;
>  	s32 type_id;
> @@ -178,7 +179,7 @@ static int bpf_dummy_ops_btf_struct_access(struct bpf_verifier_log *log,
>  	}
>  
>  	err = btf_struct_access(log, btf, t, off, size, atype, next_btf_id,
> -				flag);
> +				flag, false);
>  	if (err < 0)
>  		return err;
>  
> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
> index 85a9e500c42d..869b6266833c 100644
> --- a/net/ipv4/bpf_tcp_ca.c
> +++ b/net/ipv4/bpf_tcp_ca.c
> @@ -73,13 +73,14 @@ static int bpf_tcp_ca_btf_struct_access(struct bpf_verifier_log *log,
>  					const struct btf_type *t, int off,
>  					int size, enum bpf_access_type atype,
>  					u32 *next_btf_id,
> -					enum bpf_type_flag *flag)
> +					enum bpf_type_flag *flag,
> +					bool local_type)
>  {
>  	size_t end;
>  
>  	if (atype == BPF_READ)
>  		return btf_struct_access(log, btf, t, off, size, atype, next_btf_id,
> -					 flag);
> +					 flag, false);
>  
>  	if (t != tcp_sock_type) {
>  		bpf_log(log, "only read is supported\n");
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> index ea1b3b1839d1..bddd77093d1e 100644
> --- a/tools/testing/selftests/bpf/bpf_experimental.h
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -18,4 +18,18 @@ struct bpf_list_node {
>  #endif
>  
>  #ifndef __KERNEL__
> +
> +/* Description
> + *	Allocates a local kptr of type represented by 'local_type_id' in program
> + *	BTF. User may use the bpf_core_type_id_local macro to pass the type ID
> + *	of a struct in program BTF.
> + *
> + *	The 'local_type_id' parameter must be a known constant.
> + *	The 'flags' parameter must be 0.
> + * Returns
> + *	A local kptr corresponding to passed in 'local_type_id', or NULL on
> + *	failure.
> + */
> +void *bpf_kptr_alloc(__u64 local_type_id, __u64 flags) __ksym;
> +
>  #endif
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 15/32] bpf: Add helper macro bpf_expr_for_each_reg_in_vstate
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 15/32] bpf: Add helper macro bpf_expr_for_each_reg_in_vstate Kumar Kartikeya Dwivedi
@ 2022-09-07 23:48   ` Alexei Starovoitov
  0 siblings, 0 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-07 23:48 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:28PM +0200, Kumar Kartikeya Dwivedi wrote:
> For a lot of use cases in future patches, we will want to modify the
> state of registers part of some same 'group' (e.g. same ref_obj_id). It
> won't just be limited to releasing reference state, but setting a type
> flag dynamically based on certain actions, etc.
> 
> Hence, we need a way to easily pass a callback to the function that
> iterates over all registers in current bpf_verifier_state in all frames
> upto (and including) the curframe.
> 
> While in C++ we would be able to easily use a lambda to pass state and
> the callback together, sadly we aren't using C++ in the kernel. The next
> best thing to avoid defining a function for each case seems like
> statement expressions in GNU C. The kernel already uses them heavily,
> hence they can passed to the macro in the style of a lambda. The
> statement expression will then be substituted in the for loop bodies.
> 
> Variables __state and __reg are set to current bpf_func_state and reg
> for each invocation of the expression inside the passed in verifier
> state.
> 
> Then, convert mark_ptr_or_null_regs, clear_all_pkt_pointers,
> release_reference, find_good_pkt_pointers, find_equal_scalars to
> use bpf_expr_for_each_reg_in_vstate.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf_verifier.h |  21 ++++++
>  kernel/bpf/verifier.c        | 135 ++++++++---------------------------
>  2 files changed, 49 insertions(+), 107 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index c6d550978d63..73d9443d0074 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -354,6 +354,27 @@ struct bpf_verifier_state {
>  	     iter < frame->allocated_stack / BPF_REG_SIZE;		\
>  	     iter++, reg = bpf_get_spilled_reg(iter, frame))
>  
> +/* Invoke __expr over regsiters in __vst, setting __state and __reg */
> +#define bpf_expr_for_each_reg_in_vstate(__vst, __state, __reg, __expr)   \

Very nice.
I renamed it to bpf_for_each_reg_in_vstate to make it less verbose
and more consistent with existing bpf_for_each_spilled_reg.
And applied to bpf-next.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables Kumar Kartikeya Dwivedi
@ 2022-09-08  0:27   ` Alexei Starovoitov
  2022-09-08  0:39     ` Kumar Kartikeya Dwivedi
  2022-09-08  1:00     ` Kumar Kartikeya Dwivedi
  2022-09-09  8:13   ` Dave Marchevsky
  1 sibling, 2 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  0:27 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> Global variables reside in maps accessible using direct_value_addr
> callbacks, so giving each load instruction's rewrite a unique reg->id
> disallows us from holding locks which are global.
> 
> This is not great, so refactor the active_spin_lock into two separate
> fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> enough to allow it for global variables, map lookups, and local kptr
> registers at the same time.
> 
> Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> reg->map_ptr or reg->btf pointer of the register used for locking spin
> lock. But the active_spin_lock_id also needs to be compared to ensure
> whether bpf_spin_unlock is for the same register.
> 
> Next, pseudo load instructions are not given a unique reg->id, as they
> are doing lookup for the same map value (max_entries is never greater
> than 1).
> 
> Essentially, we consider that the tuple of (active_spin_lock_ptr,
> active_spin_lock_id) will always be unique for any kind of argument to
> bpf_spin_{lock,unlock}.
> 
> Note that this can be extended in the future to also remember offset
> used for locking, so that we can introduce multiple bpf_spin_lock fields
> in the same allocation.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf_verifier.h |  3 ++-
>  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
>  2 files changed, 29 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 2a9dcefca3b6..00c21ad6f61c 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -348,7 +348,8 @@ struct bpf_verifier_state {
>  	u32 branches;
>  	u32 insn_idx;
>  	u32 curframe;
> -	u32 active_spin_lock;
> +	void *active_spin_lock_ptr;
> +	u32 active_spin_lock_id;

{map, id=0} is indeed enough to distinguish different global locks and
{map, id} for locks in map values,
but what 'btf' is for?
When is the case when reg->map_ptr is not set?
locks in allocated objects?
Feels too early to add that in this patch.

Also this patch is heavily influenced by Dave's patch with
a realization that max_entries==1 simplifies the logic.
I think you gotta give him more credit.
Maybe as much as his SOB and authorship.

>  	bool speculative;
>  
>  	/* first and last insn idx of this verifier state */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index b1754fd69f7d..ed19e4036b0a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1202,7 +1202,8 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
>  	}
>  	dst_state->speculative = src->speculative;
>  	dst_state->curframe = src->curframe;
> -	dst_state->active_spin_lock = src->active_spin_lock;
> +	dst_state->active_spin_lock_ptr = src->active_spin_lock_ptr;
> +	dst_state->active_spin_lock_id = src->active_spin_lock_id;
>  	dst_state->branches = src->branches;
>  	dst_state->parent = src->parent;
>  	dst_state->first_insn_idx = src->first_insn_idx;
> @@ -5504,22 +5505,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
>  		return -EINVAL;
>  	}
>  	if (is_lock) {
> -		if (cur->active_spin_lock) {
> +		if (cur->active_spin_lock_ptr) {
>  			verbose(env,
>  				"Locking two bpf_spin_locks are not allowed\n");
>  			return -EINVAL;
>  		}
> -		cur->active_spin_lock = reg->id;
> +		if (map)
> +			cur->active_spin_lock_ptr = map;
> +		else
> +			cur->active_spin_lock_ptr = btf;
> +		cur->active_spin_lock_id = reg->id;
>  	} else {
> -		if (!cur->active_spin_lock) {
> +		void *ptr;
> +
> +		if (map)
> +			ptr = map;
> +		else
> +			ptr = btf;
> +
> +		if (!cur->active_spin_lock_ptr) {
>  			verbose(env, "bpf_spin_unlock without taking a lock\n");
>  			return -EINVAL;
>  		}
> -		if (cur->active_spin_lock != reg->id) {
> +		if (cur->active_spin_lock_ptr != ptr ||
> +		    cur->active_spin_lock_id != reg->id) {
>  			verbose(env, "bpf_spin_unlock of different lock\n");
>  			return -EINVAL;
>  		}
> -		cur->active_spin_lock = 0;
> +		cur->active_spin_lock_ptr = NULL;
> +		cur->active_spin_lock_id = 0;
>  	}
>  	return 0;
>  }
> @@ -11207,8 +11221,8 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  	    insn->src_reg == BPF_PSEUDO_MAP_IDX_VALUE) {
>  		dst_reg->type = PTR_TO_MAP_VALUE;
>  		dst_reg->off = aux->map_off;
> -		if (map_value_has_spin_lock(map))
> -			dst_reg->id = ++env->id_gen;
> +		WARN_ON_ONCE(map->max_entries != 1);
> +		/* We want reg->id to be same (0) as map_value is not distinct */
>  	} else if (insn->src_reg == BPF_PSEUDO_MAP_FD ||
>  		   insn->src_reg == BPF_PSEUDO_MAP_IDX) {
>  		dst_reg->type = CONST_PTR_TO_MAP;
> @@ -11286,7 +11300,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  		return err;
>  	}
>  
> -	if (env->cur_state->active_spin_lock) {
> +	if (env->cur_state->active_spin_lock_ptr) {
>  		verbose(env, "BPF_LD_[ABS|IND] cannot be used inside bpf_spin_lock-ed region\n");
>  		return -EINVAL;
>  	}
> @@ -12566,7 +12580,8 @@ static bool states_equal(struct bpf_verifier_env *env,
>  	if (old->speculative && !cur->speculative)
>  		return false;
>  
> -	if (old->active_spin_lock != cur->active_spin_lock)
> +	if (old->active_spin_lock_ptr != cur->active_spin_lock_ptr ||
> +	    old->active_spin_lock_id != cur->active_spin_lock_id)
>  		return false;
>  
>  	/* for states to be equal callsites have to be the same
> @@ -13213,7 +13228,7 @@ static int do_check(struct bpf_verifier_env *env)
>  					return -EINVAL;
>  				}
>  
> -				if (env->cur_state->active_spin_lock &&
> +				if (env->cur_state->active_spin_lock_ptr &&
>  				    (insn->src_reg == BPF_PSEUDO_CALL ||
>  				     insn->imm != BPF_FUNC_spin_unlock)) {
>  					verbose(env, "function calls are not allowed while holding a lock\n");
> @@ -13250,7 +13265,7 @@ static int do_check(struct bpf_verifier_env *env)
>  					return -EINVAL;
>  				}
>  
> -				if (env->cur_state->active_spin_lock) {
> +				if (env->cur_state->active_spin_lock_ptr) {
>  					verbose(env, "bpf_spin_unlock is missing\n");
>  					return -EINVAL;
>  				}
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model Kumar Kartikeya Dwivedi
@ 2022-09-08  0:34   ` Alexei Starovoitov
  2022-09-08  2:39     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  0:34 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:29PM +0200, Kumar Kartikeya Dwivedi wrote:
> Add the concept of a memory object model to BPF verifier.
> 
> What this means is that there are now some types that are not just plain
> old data, but require explicit action when they are allocated on a
> storage, before their lifetime is considered as started and before it is
> allowed for them to escape the program. The verifier will track state of
> such fields during the various phases of the object lifetime, where it
> can be sure about certain invariants.
> 
> Some inspiration is taken from existing memory object and lifetime
> models in C and C++ which have stood the test of time. See [0], [1], [2]
> for more information, to find some similarities. In the future, the
> separation of storage and object lifetime may be made more stark by
> allowing to change effective type of storage allocated for a local kptr.
> For now, that has been left out. It is only possible when verifier
> understands when the program has exclusive access to storage, and when
> the object it is hosting is no longer accessible to other CPUs.
> 
> This can be useful to maintain size-class based freelists inside BPF
> programs and reuse storage of same size for different types. This would
> only be safe to allow if verifier can ensure that while storage lifetime
> has not ended, object lifetime for the current type has. This
> necessiates separating the two and accomodating a simple model to track
> object lifetime (composed recursively of more objects whose lifetime
> is individually tracked).
> 
> Everytime a BPF program allocates such non-trivial types, it must call a
> set of constructors on the object to fully begin its lifetime before it
> can make use of the pointer to this type. If the program does not do so,
> the verifier will complain and lead to failure in loading of the
> program.
> 
> Similarly, when ending the lifetime of such types, it is required to
> fully destruct the object using a series of destructors for each
> non-trivial member, before finally freeing the storage the object is
> making use of.
> 
> During both the construction and destruction phase, there can be only
> one program that can own and access such an object, hence their is no
> need of any explicit synchronization. The single ownership of such
> objects makes it easy for the verifier to enforce the safety around the
> beginning and end of the lifetime without resorting to dynamic checks.
> 
> When there are multiple fields needing construction or destruction, the
> program must call their constructors in ascending order of the offset of
> the field.
> 
> For example, consider the following type (support for such fields will
> be added in subsequent patches):
> 
> struct data {
> 	struct bpf_spin_lock lock;
> 	struct bpf_list_head list __contains(struct, foo, node);
> 	int data;
> };
> 
> struct data *d = bpf_kptr_alloc(...);
> if (!d) { ... }
> 
> Now, the type of d would be PTR_TO_BTF_ID | MEM_TYPE_LOCAL |
> OBJ_CONSTRUCTING, as it needs two constructor calls (for lock and head),
> before it can be considered fully initialized and alive.
> 
> Hence, we must do (in order of field offsets):
> 
> bpf_spin_lock_init(&d->lock);
> bpf_list_head_init(&d->list);

All sounds great in theory, but I think it's unnecessary complex at this point.
There is still a need to __bpf_list_head_init_zeroed as seen in later patches.
So all this verifier enforced constructors we don't need _today_.
Zero init of everything works.
It's the case for list_head, list_node, spin_lock, rb_root, rb_node.
Pretty much all new data structures will work with zero init
and all of them need async dtors.
The verifier cannot help during destruction.
dtors have to be specified declaratively in a bpf prog for new types
and as known kfuncs for list_head/node, rb_root/node.
There will be unfreed link lists in maps and the later patches handle that
without OBJ_DESTRUCTING.
So let's postpone this patch.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock in local kptrs
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock " Kumar Kartikeya Dwivedi
@ 2022-09-08  0:35   ` Alexei Starovoitov
  2022-09-09  8:25     ` Dave Marchevsky
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  0:35 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Sun, Sep 04, 2022 at 10:41:31PM +0200, Kumar Kartikeya Dwivedi wrote:
> diff --git a/include/linux/poison.h b/include/linux/poison.h
> index d62ef5a6b4e9..753e00b81acf 100644
> --- a/include/linux/poison.h
> +++ b/include/linux/poison.h
> @@ -81,4 +81,7 @@
>  /********** net/core/page_pool.c **********/
>  #define PP_SIGNATURE		(0x40 + POISON_POINTER_DELTA)
>  
> +/********** kernel/bpf/helpers.c **********/
> +#define BPF_PTR_POISON		((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
> +

That was part of Dave's patch set as well.
Please keep his SOB and authorship and keep it as separate patch.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-08  0:27   ` Alexei Starovoitov
@ 2022-09-08  0:39     ` Kumar Kartikeya Dwivedi
  2022-09-08  0:55       ` Alexei Starovoitov
  2022-09-08  1:00     ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  0:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 02:27, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Sep 04, 2022 at 10:41:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> > Global variables reside in maps accessible using direct_value_addr
> > callbacks, so giving each load instruction's rewrite a unique reg->id
> > disallows us from holding locks which are global.
> >
> > This is not great, so refactor the active_spin_lock into two separate
> > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > enough to allow it for global variables, map lookups, and local kptr
> > registers at the same time.
> >
> > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > lock. But the active_spin_lock_id also needs to be compared to ensure
> > whether bpf_spin_unlock is for the same register.
> >
> > Next, pseudo load instructions are not given a unique reg->id, as they
> > are doing lookup for the same map value (max_entries is never greater
> > than 1).
> >
> > Essentially, we consider that the tuple of (active_spin_lock_ptr,
> > active_spin_lock_id) will always be unique for any kind of argument to
> > bpf_spin_{lock,unlock}.
> >
> > Note that this can be extended in the future to also remember offset
> > used for locking, so that we can introduce multiple bpf_spin_lock fields
> > in the same allocation.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf_verifier.h |  3 ++-
> >  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
> >  2 files changed, 29 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 2a9dcefca3b6..00c21ad6f61c 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -348,7 +348,8 @@ struct bpf_verifier_state {
> >       u32 branches;
> >       u32 insn_idx;
> >       u32 curframe;
> > -     u32 active_spin_lock;
> > +     void *active_spin_lock_ptr;
> > +     u32 active_spin_lock_id;
>
> {map, id=0} is indeed enough to distinguish different global locks and
> {map, id} for locks in map values,
> but what 'btf' is for?
> When is the case when reg->map_ptr is not set?
> locks in allocated objects?
> Feels too early to add that in this patch.
>
> Also this patch is heavily influenced by Dave's patch with
> a realization that max_entries==1 simplifies the logic.

You mean this one?
https://lore.kernel.org/bpf/20220830172759.4069786-12-davemarchevsky@fb.com

> I think you gotta give him more credit.
> Maybe as much as his SOB and authorship.
>

Don't mind sharing the credit where due, but for the record:

15/8: pushed my prototype:
https://github.com/kkdwivedi/linux/commits/bpf-list-15-08-22
15/8: patch with roughly the same logic as above, comitted 24 days ago
https://github.com/kkdwivedi/linux/commit/4a152df6a1f6e096616e02c9b4dd54c5d5c902a1
16/8: Our meeting, described the same idea to you.
17/8: Published notes,
https://lore.kernel.org/bpf/CAP01T74U30+yeBHEgmgzTJ-XYxZ0zj71kqCDJtTH9YQNfTK+Xw@mail.gmail.com
19/8: Described the same thing in detail again in response to Dave's question:
> This ergonomics idea doesn't solve the map-in-map issue, I'm still unsure
> how to statically verify lock in that case. Have you had a chance to think
> about it further?
>
at https://lore.kernel.org/bpf/CAP01T77PBfQ8QvgU-ezxGgUh8WmSYL3wsMT7yo4tGuZRW0qLnQ@mail.gmail.com
30/8: Dave sends patch with this idea:
https://lore.kernel.org/bpf/20220830172759.4069786-11-davemarchevsky@fb.com

What did I miss?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-08  0:39     ` Kumar Kartikeya Dwivedi
@ 2022-09-08  0:55       ` Alexei Starovoitov
  0 siblings, 0 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  0:55 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, Sep 08, 2022 at 02:39:46AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Thu, 8 Sept 2022 at 02:27, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sun, Sep 04, 2022 at 10:41:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > Global variables reside in maps accessible using direct_value_addr
> > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > disallows us from holding locks which are global.
> > >
> > > This is not great, so refactor the active_spin_lock into two separate
> > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > enough to allow it for global variables, map lookups, and local kptr
> > > registers at the same time.
> > >
> > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > whether bpf_spin_unlock is for the same register.
> > >
> > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > are doing lookup for the same map value (max_entries is never greater
> > > than 1).
> > >
> > > Essentially, we consider that the tuple of (active_spin_lock_ptr,
> > > active_spin_lock_id) will always be unique for any kind of argument to
> > > bpf_spin_{lock,unlock}.
> > >
> > > Note that this can be extended in the future to also remember offset
> > > used for locking, so that we can introduce multiple bpf_spin_lock fields
> > > in the same allocation.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf_verifier.h |  3 ++-
> > >  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
> > >  2 files changed, 29 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > index 2a9dcefca3b6..00c21ad6f61c 100644
> > > --- a/include/linux/bpf_verifier.h
> > > +++ b/include/linux/bpf_verifier.h
> > > @@ -348,7 +348,8 @@ struct bpf_verifier_state {
> > >       u32 branches;
> > >       u32 insn_idx;
> > >       u32 curframe;
> > > -     u32 active_spin_lock;
> > > +     void *active_spin_lock_ptr;
> > > +     u32 active_spin_lock_id;
> >
> > {map, id=0} is indeed enough to distinguish different global locks and
> > {map, id} for locks in map values,
> > but what 'btf' is for?
> > When is the case when reg->map_ptr is not set?
> > locks in allocated objects?
> > Feels too early to add that in this patch.
> >
> > Also this patch is heavily influenced by Dave's patch with
> > a realization that max_entries==1 simplifies the logic.
> 
> You mean this one?
> https://lore.kernel.org/bpf/20220830172759.4069786-12-davemarchevsky@fb.com
> 
> > I think you gotta give him more credit.
> > Maybe as much as his SOB and authorship.
> >
> 
> Don't mind sharing the credit where due, but for the record:
> 
> 15/8: pushed my prototype:
> https://github.com/kkdwivedi/linux/commits/bpf-list-15-08-22
> 15/8: patch with roughly the same logic as above, comitted 24 days ago
> https://github.com/kkdwivedi/linux/commit/4a152df6a1f6e096616e02c9b4dd54c5d5c902a1
> 16/8: Our meeting, described the same idea to you.
> 17/8: Published notes,
> https://lore.kernel.org/bpf/CAP01T74U30+yeBHEgmgzTJ-XYxZ0zj71kqCDJtTH9YQNfTK+Xw@mail.gmail.com
> 19/8: Described the same thing in detail again in response to Dave's question:
> > This ergonomics idea doesn't solve the map-in-map issue, I'm still unsure
> > how to statically verify lock in that case. Have you had a chance to think
> > about it further?
> >
> at https://lore.kernel.org/bpf/CAP01T77PBfQ8QvgU-ezxGgUh8WmSYL3wsMT7yo4tGuZRW0qLnQ@mail.gmail.com
> 30/8: Dave sends patch with this idea:
> https://lore.kernel.org/bpf/20220830172759.4069786-11-davemarchevsky@fb.com
> 
> What did I miss?

Just that I saw Dave's patch first. Yours 8-22 private branch I glanced over
and simply missed that patch. github UI is not for everyone.
As far as that notes thread it's hard to connect words to patches.
Re-reading it now I see what you mean.
Feel free to keep the authorship for this one.

Please answer btf question though.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-08  0:27   ` Alexei Starovoitov
  2022-09-08  0:39     ` Kumar Kartikeya Dwivedi
@ 2022-09-08  1:00     ` Kumar Kartikeya Dwivedi
  2022-09-08  1:08       ` Alexei Starovoitov
  1 sibling, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  1:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 02:27, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Sep 04, 2022 at 10:41:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> > Global variables reside in maps accessible using direct_value_addr
> > callbacks, so giving each load instruction's rewrite a unique reg->id
> > disallows us from holding locks which are global.
> >
> > This is not great, so refactor the active_spin_lock into two separate
> > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > enough to allow it for global variables, map lookups, and local kptr
> > registers at the same time.
> >
> > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > lock. But the active_spin_lock_id also needs to be compared to ensure
> > whether bpf_spin_unlock is for the same register.
> >
> > Next, pseudo load instructions are not given a unique reg->id, as they
> > are doing lookup for the same map value (max_entries is never greater
> > than 1).
> >
> > Essentially, we consider that the tuple of (active_spin_lock_ptr,
> > active_spin_lock_id) will always be unique for any kind of argument to
> > bpf_spin_{lock,unlock}.
> >
> > Note that this can be extended in the future to also remember offset
> > used for locking, so that we can introduce multiple bpf_spin_lock fields
> > in the same allocation.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf_verifier.h |  3 ++-
> >  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
> >  2 files changed, 29 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 2a9dcefca3b6..00c21ad6f61c 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -348,7 +348,8 @@ struct bpf_verifier_state {
> >       u32 branches;
> >       u32 insn_idx;
> >       u32 curframe;
> > -     u32 active_spin_lock;
> > +     void *active_spin_lock_ptr;
> > +     u32 active_spin_lock_id;
>
> {map, id=0} is indeed enough to distinguish different global locks and
> {map, id} for locks in map values,
> but what 'btf' is for?
> When is the case when reg->map_ptr is not set?
> locks in allocated objects?
> Feels too early to add that in this patch.
>

It makes active_spin_lock check simpler, just checking
active_spin_lock_ptr that to be non-NULL indicates lock is held. Don't
have to always check both ptr and id, only need to compare both when
verifying that lock is in the same allocation as reg.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-08  1:00     ` Kumar Kartikeya Dwivedi
@ 2022-09-08  1:08       ` Alexei Starovoitov
  2022-09-08  1:15         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  1:08 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Wed, Sep 7, 2022 at 6:01 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Thu, 8 Sept 2022 at 02:27, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sun, Sep 04, 2022 at 10:41:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > Global variables reside in maps accessible using direct_value_addr
> > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > disallows us from holding locks which are global.
> > >
> > > This is not great, so refactor the active_spin_lock into two separate
> > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > enough to allow it for global variables, map lookups, and local kptr
> > > registers at the same time.
> > >
> > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > whether bpf_spin_unlock is for the same register.
> > >
> > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > are doing lookup for the same map value (max_entries is never greater
> > > than 1).
> > >
> > > Essentially, we consider that the tuple of (active_spin_lock_ptr,
> > > active_spin_lock_id) will always be unique for any kind of argument to
> > > bpf_spin_{lock,unlock}.
> > >
> > > Note that this can be extended in the future to also remember offset
> > > used for locking, so that we can introduce multiple bpf_spin_lock fields
> > > in the same allocation.
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > >  include/linux/bpf_verifier.h |  3 ++-
> > >  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
> > >  2 files changed, 29 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > index 2a9dcefca3b6..00c21ad6f61c 100644
> > > --- a/include/linux/bpf_verifier.h
> > > +++ b/include/linux/bpf_verifier.h
> > > @@ -348,7 +348,8 @@ struct bpf_verifier_state {
> > >       u32 branches;
> > >       u32 insn_idx;
> > >       u32 curframe;
> > > -     u32 active_spin_lock;
> > > +     void *active_spin_lock_ptr;
> > > +     u32 active_spin_lock_id;
> >
> > {map, id=0} is indeed enough to distinguish different global locks and
> > {map, id} for locks in map values,
> > but what 'btf' is for?
> > When is the case when reg->map_ptr is not set?
> > locks in allocated objects?
> > Feels too early to add that in this patch.
> >
>
> It makes active_spin_lock check simpler, just checking
> active_spin_lock_ptr that to be non-NULL indicates lock is held. Don't
> have to always check both ptr and id, only need to compare both when
> verifying that lock is in the same allocation as reg.

Not following. There is always non-null reg->map_ptr when
we come down this path.
At least in the current state of the verifier.
So it never assigns that btf afacs.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-08  1:08       ` Alexei Starovoitov
@ 2022-09-08  1:15         ` Kumar Kartikeya Dwivedi
  2022-09-08  2:39           ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  1:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 03:09, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Sep 7, 2022 at 6:01 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Thu, 8 Sept 2022 at 02:27, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sun, Sep 04, 2022 at 10:41:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > Global variables reside in maps accessible using direct_value_addr
> > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > disallows us from holding locks which are global.
> > > >
> > > > This is not great, so refactor the active_spin_lock into two separate
> > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > enough to allow it for global variables, map lookups, and local kptr
> > > > registers at the same time.
> > > >
> > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > whether bpf_spin_unlock is for the same register.
> > > >
> > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > are doing lookup for the same map value (max_entries is never greater
> > > > than 1).
> > > >
> > > > Essentially, we consider that the tuple of (active_spin_lock_ptr,
> > > > active_spin_lock_id) will always be unique for any kind of argument to
> > > > bpf_spin_{lock,unlock}.
> > > >
> > > > Note that this can be extended in the future to also remember offset
> > > > used for locking, so that we can introduce multiple bpf_spin_lock fields
> > > > in the same allocation.
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > ---
> > > >  include/linux/bpf_verifier.h |  3 ++-
> > > >  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
> > > >  2 files changed, 29 insertions(+), 13 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > > index 2a9dcefca3b6..00c21ad6f61c 100644
> > > > --- a/include/linux/bpf_verifier.h
> > > > +++ b/include/linux/bpf_verifier.h
> > > > @@ -348,7 +348,8 @@ struct bpf_verifier_state {
> > > >       u32 branches;
> > > >       u32 insn_idx;
> > > >       u32 curframe;
> > > > -     u32 active_spin_lock;
> > > > +     void *active_spin_lock_ptr;
> > > > +     u32 active_spin_lock_id;
> > >
> > > {map, id=0} is indeed enough to distinguish different global locks and
> > > {map, id} for locks in map values,
> > > but what 'btf' is for?
> > > When is the case when reg->map_ptr is not set?
> > > locks in allocated objects?
> > > Feels too early to add that in this patch.
> > >
> >
> > It makes active_spin_lock check simpler, just checking
> > active_spin_lock_ptr that to be non-NULL indicates lock is held. Don't
> > have to always check both ptr and id, only need to compare both when
> > verifying that lock is in the same allocation as reg.
>
> Not following. There is always non-null reg->map_ptr when
> we come down this path.
> At least in the current state of the verifier.
> So it never assigns that btf afacs.

map is only set when reg->type == PTR_TO_MAP_VALUE,
otherwise btf = reg->btf for local kptrs (else branch). Then the map
ptr is NULL.
See patch 18 which already added support to local kptrs.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-08  1:15         ` Kumar Kartikeya Dwivedi
@ 2022-09-08  2:39           ` Alexei Starovoitov
  0 siblings, 0 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  2:39 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Wed, Sep 7, 2022 at 6:15 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Thu, 8 Sept 2022 at 03:09, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Sep 7, 2022 at 6:01 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > On Thu, 8 Sept 2022 at 02:27, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Sun, Sep 04, 2022 at 10:41:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > disallows us from holding locks which are global.
> > > > >
> > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > registers at the same time.
> > > > >
> > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > whether bpf_spin_unlock is for the same register.
> > > > >
> > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > than 1).
> > > > >
> > > > > Essentially, we consider that the tuple of (active_spin_lock_ptr,
> > > > > active_spin_lock_id) will always be unique for any kind of argument to
> > > > > bpf_spin_{lock,unlock}.
> > > > >
> > > > > Note that this can be extended in the future to also remember offset
> > > > > used for locking, so that we can introduce multiple bpf_spin_lock fields
> > > > > in the same allocation.
> > > > >
> > > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > > ---
> > > > >  include/linux/bpf_verifier.h |  3 ++-
> > > > >  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
> > > > >  2 files changed, 29 insertions(+), 13 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > > > > index 2a9dcefca3b6..00c21ad6f61c 100644
> > > > > --- a/include/linux/bpf_verifier.h
> > > > > +++ b/include/linux/bpf_verifier.h
> > > > > @@ -348,7 +348,8 @@ struct bpf_verifier_state {
> > > > >       u32 branches;
> > > > >       u32 insn_idx;
> > > > >       u32 curframe;
> > > > > -     u32 active_spin_lock;
> > > > > +     void *active_spin_lock_ptr;
> > > > > +     u32 active_spin_lock_id;
> > > >
> > > > {map, id=0} is indeed enough to distinguish different global locks and
> > > > {map, id} for locks in map values,
> > > > but what 'btf' is for?
> > > > When is the case when reg->map_ptr is not set?
> > > > locks in allocated objects?
> > > > Feels too early to add that in this patch.
> > > >
> > >
> > > It makes active_spin_lock check simpler, just checking
> > > active_spin_lock_ptr that to be non-NULL indicates lock is held. Don't
> > > have to always check both ptr and id, only need to compare both when
> > > verifying that lock is in the same allocation as reg.
> >
> > Not following. There is always non-null reg->map_ptr when
> > we come down this path.
> > At least in the current state of the verifier.
> > So it never assigns that btf afacs.
>
> map is only set when reg->type == PTR_TO_MAP_VALUE,
> otherwise btf = reg->btf for local kptrs (else branch). Then the map
> ptr is NULL.
> See patch 18 which already added support to local kptrs.

I see. That's what I was missing.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08  0:34   ` Alexei Starovoitov
@ 2022-09-08  2:39     ` Kumar Kartikeya Dwivedi
  2022-09-08  3:37       ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  2:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 02:34, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Sep 04, 2022 at 10:41:29PM +0200, Kumar Kartikeya Dwivedi wrote:
> > Add the concept of a memory object model to BPF verifier.
> >
> > What this means is that there are now some types that are not just plain
> > old data, but require explicit action when they are allocated on a
> > storage, before their lifetime is considered as started and before it is
> > allowed for them to escape the program. The verifier will track state of
> > such fields during the various phases of the object lifetime, where it
> > can be sure about certain invariants.
> >
> > Some inspiration is taken from existing memory object and lifetime
> > models in C and C++ which have stood the test of time. See [0], [1], [2]
> > for more information, to find some similarities. In the future, the
> > separation of storage and object lifetime may be made more stark by
> > allowing to change effective type of storage allocated for a local kptr.
> > For now, that has been left out. It is only possible when verifier
> > understands when the program has exclusive access to storage, and when
> > the object it is hosting is no longer accessible to other CPUs.
> >
> > This can be useful to maintain size-class based freelists inside BPF
> > programs and reuse storage of same size for different types. This would
> > only be safe to allow if verifier can ensure that while storage lifetime
> > has not ended, object lifetime for the current type has. This
> > necessiates separating the two and accomodating a simple model to track
> > object lifetime (composed recursively of more objects whose lifetime
> > is individually tracked).
> >
> > Everytime a BPF program allocates such non-trivial types, it must call a
> > set of constructors on the object to fully begin its lifetime before it
> > can make use of the pointer to this type. If the program does not do so,
> > the verifier will complain and lead to failure in loading of the
> > program.
> >
> > Similarly, when ending the lifetime of such types, it is required to
> > fully destruct the object using a series of destructors for each
> > non-trivial member, before finally freeing the storage the object is
> > making use of.
> >
> > During both the construction and destruction phase, there can be only
> > one program that can own and access such an object, hence their is no
> > need of any explicit synchronization. The single ownership of such
> > objects makes it easy for the verifier to enforce the safety around the
> > beginning and end of the lifetime without resorting to dynamic checks.
> >
> > When there are multiple fields needing construction or destruction, the
> > program must call their constructors in ascending order of the offset of
> > the field.
> >
> > For example, consider the following type (support for such fields will
> > be added in subsequent patches):
> >
> > struct data {
> >       struct bpf_spin_lock lock;
> >       struct bpf_list_head list __contains(struct, foo, node);
> >       int data;
> > };
> >
> > struct data *d = bpf_kptr_alloc(...);
> > if (!d) { ... }
> >
> > Now, the type of d would be PTR_TO_BTF_ID | MEM_TYPE_LOCAL |
> > OBJ_CONSTRUCTING, as it needs two constructor calls (for lock and head),
> > before it can be considered fully initialized and alive.
> >
> > Hence, we must do (in order of field offsets):
> >
> > bpf_spin_lock_init(&d->lock);
> > bpf_list_head_init(&d->list);
>
> All sounds great in theory, but I think it's unnecessary complex at this point.
> There is still a need to __bpf_list_head_init_zeroed as seen in later patches.

This particular call is only because of map values. INIT_LIST_HEAD for
prealloc init or alloc_elem would be costly.
There won't be any concern to do it in check_and_init_map_value, we
zero out the field there already. Nothing else needs this check.

List helpers I am planning to inline, it doesn't make sense to have
two loads/stores inside kfuncs. And then for local kptrs there is no
need to zero init. pop_front/pop_back are even uglier. There you need
NULL check + zero init, _then_ check for list_empty. Same with future
list_splice.

I don't believe list helpers are going to be so infrequent such that
all this might not matter at all.

But fine, I still consider this a fair point. I thought a lot about this too.

It really boils down to: do we really want to always zero init?

What seems more desirable to me is forcing initialization like this,
esp. since memory reuse is going to be the more common case,
and then simply relaxing initialization when we know it comes from
bpf_kptr_zalloc. needs_construction similar to needs_destruction.
We aren't requiring bpf_list_node_fini, same idea there.

Zeroing the entire big struct vs zeroing/initing two fields makes a
huge difference.

> So all this verifier enforced constructors we don't need _today_.
> Zero init of everything works.
> It's the case for list_head, list_node, spin_lock, rb_root, rb_node.
> Pretty much all new data structures will work with zero init
> and all of them need async dtors.
> The verifier cannot help during destruction.
> dtors have to be specified declaratively in a bpf prog for new types

I think about it the other way around.

There actually isn't a need to specify any dtor IMO for custom types.
Just init and free your type inline. Much more familiar to people
doing C already.
Custom types are always just data without special fields, and we know
how to destroy BPF special fields.
Map already knows how to 'destruct' these types, just like it has to
know how to destruct map value.

map value type and local kptr type are similar in that regard. They
are both local types in prog BTF with special fields.
If it can do it for map value, it can do it for local kptr if it finds
it in map (it has to).

To me taking prog reference and setting up per-type dtor is the uglier
solution. It's unnecessary for the user. That then forces you to have
similar semantics like bpf_timer. map_release_uref will be used to
break the reference cycle between map and prog, which is undesirable.
More effort then - to think about some way to alleviate that, or live
with it and compromise.

Later, asynchronous destruction (RCU case where it won't be done
immediately) is just setting reg->states for all fields as
FIELD_STATE_CONSTRUCTED for reg in callback, but type as
OBJ_DESTRUCTING, forcing you to do nothing but unwind and free in that
context.

The real reason to give destruction control to users for local kptrs
is the ability to manage what to do with drained resources.
They might as well splice out their list when freeing a node to a
local list_head, or move it to a map. Same with more cases in the
future (kptr inside kptr).

It shouldn't be invoked on bpf_kptr_free automagically. That is the
job of the language and best suited to that.
Verifier will see BPF ASM after translation from C/C++/Rust/etc., so
for us the destruction at language level appears as the destructing
phase of local kptr in verifier. For maps it's the last resort, where
programs are already gone, so there is nothing left to do but free
stuff.

> and as known kfuncs for list_head/node, rb_root/node.
> There will be unfreed link lists in maps and the later patches handle that
> without OBJ_DESTRUCTING.
> So let's postpone this patch.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps
  2022-09-07 19:00   ` Alexei Starovoitov
@ 2022-09-08  2:47     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  2:47 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, KP Singh, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Dave Marchevsky,
	Delyan Kratunov

On Wed, 7 Sept 2022 at 21:00, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Sep 04, 2022 at 10:41:18PM +0200, Kumar Kartikeya Dwivedi wrote:
> > Enable support for kptrs in local storage maps by wiring up the freeing
> > of these kptrs from map value.
> >
> > Cc: Martin KaFai Lau <kafai@fb.com>
> > Cc: KP Singh <kpsingh@kernel.org>
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf_local_storage.h |  2 +-
> >  kernel/bpf/bpf_local_storage.c    | 33 +++++++++++++++++++++++++++----
> >  kernel/bpf/syscall.c              |  5 ++++-
> >  kernel/bpf/verifier.c             |  9 ++++++---
> >  4 files changed, 40 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> > index 7ea18d4da84b..6786d00f004e 100644
> > --- a/include/linux/bpf_local_storage.h
> > +++ b/include/linux/bpf_local_storage.h
> > @@ -74,7 +74,7 @@ struct bpf_local_storage_elem {
> >       struct hlist_node snode;        /* Linked to bpf_local_storage */
> >       struct bpf_local_storage __rcu *local_storage;
> >       struct rcu_head rcu;
> > -     /* 8 bytes hole */
> > +     struct bpf_map *map;            /* Only set for bpf_selem_free_rcu */
> >       /* The data is stored in another cacheline to minimize
> >        * the number of cachelines access during a cache hit.
> >        */
> > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> > index 802fc15b0d73..4a725379d761 100644
> > --- a/kernel/bpf/bpf_local_storage.c
> > +++ b/kernel/bpf/bpf_local_storage.c
> > @@ -74,7 +74,8 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> >                               gfp_flags | __GFP_NOWARN);
> >       if (selem) {
> >               if (value)
> > -                     memcpy(SDATA(selem)->data, value, smap->map.value_size);
> > +                     copy_map_value(&smap->map, SDATA(selem)->data, value);
> > +             /* No call to check_and_init_map_value as memory is zero init */
> >               return selem;
> >       }
> >
> > @@ -92,12 +93,27 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
> >       kfree_rcu(local_storage, rcu);
> >  }
> >
> > +static void check_and_free_fields(struct bpf_local_storage_elem *selem)
> > +{
> > +     if (map_value_has_kptrs(selem->map))
> > +             bpf_map_free_kptrs(selem->map, SDATA(selem));
> > +}
> > +
> >  static void bpf_selem_free_rcu(struct rcu_head *rcu)
> >  {
> >       struct bpf_local_storage_elem *selem;
> >
> >       selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> > -     kfree_rcu(selem, rcu);
> > +     check_and_free_fields(selem);
> > +     kfree(selem);
> > +}
> > +
> > +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
> > +{
> > +     struct bpf_local_storage_elem *selem;
> > +
> > +     selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> > +     call_rcu(&selem->rcu, bpf_selem_free_rcu);
> >  }
> >
> >  /* local_storage->lock must be held and selem->local_storage == local_storage.
> > @@ -150,10 +166,11 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
> >           SDATA(selem))
> >               RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
> >
> > +     selem->map = &smap->map;
> >       if (use_trace_rcu)
> > -             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> > +             call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
> >       else
> > -             kfree_rcu(selem, rcu);
> > +             call_rcu(&selem->rcu, bpf_selem_free_rcu);
> >
> >       return free_local_storage;
> >  }
> > @@ -581,6 +598,14 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
> >        */
> >       synchronize_rcu();
> >
> > +     /* When local storage map has kptrs, the call_rcu callback accesses
> > +      * kptr_off_tab, hence we need the bpf_selem_free_rcu callbacks to
> > +      * finish before we free it.
> > +      */
> > +     if (map_value_has_kptrs(&smap->map)) {
> > +             rcu_barrier();
> > +             bpf_map_free_kptr_off_tab(&smap->map);
>
> probably needs conditional rcu_barrier_tasks_trace before rcu_barrier?
> With or without it will be a significant delay in map freeing.
> Maybe we should generalize the destroy_mem_alloc trick?
>

Yes, let me take a closer look tomorrow and ask questions if any.
Otherwise I will rework it. Thanks for catching this.

> Patch 4 needs rebase. Applied patches 1-3.
> The first 5 look great to me.
> Pls follow up with kptr specific tests.

Thanks, I will split those out into another series with its own test.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 12/32] bpf: Teach verifier about non-size constant arguments
  2022-09-07 22:11   ` Alexei Starovoitov
@ 2022-09-08  2:49     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  2:49 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 00:11, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Sep 04, 2022 at 10:41:25PM +0200, Kumar Kartikeya Dwivedi wrote:
> > Currently, the verifier has support for various arguments that either
> > describe the size of the memory being passed in to a helper, or describe
> > the size of the memory being returned. When a constant is passed in like
> > this, it is assumed for the purposes of precision tracking that if the
> > value in the already explored safe state is within the value in current
> > state, it would fine to prune the search.
> >
> > While this holds well for size arguments, arguments where each value may
> > denote a distinct meaning and needs to be verified separately needs more
> > work. Search can only be pruned if both are constant values and both are
> > equal. In all other cases, it would be incorrect to treat those two
> > precise registers as equivalent if the new value satisfies the old one
> > (i.e. old <= cur).
> >
> > Hence, make the register precision marker tri-state. There are now three
> > values that reg->precise takes: NOT_PRECISE, PRECISE, PRECISE_ABSOLUTE.
> >
> > Both PRECISE and PRECISE_ABSOLUTE are 'true' values. PRECISE_ABSOLUTE
> > affects how regsafe decides whether both registers are equivalent for
> > the purposes of verifier state equivalence. When it sees that one
> > register has reg->precise == PRECISE_ABSOLUTE, unless both are absolute,
> > it will return false. When both are, it returns true only when both are
> > const and both have the same value. Otherwise, for PRECISE case it falls
> > back to the default check that is present now (i.e. thinking that we're
> > talking about sizes).
> >
> > This is required as a future patch introduces a BPF memory allocator
> > interface, where we take the program BTF's type ID as an argument. Each
> > distinct type ID may result in the returned pointer obtaining a
> > different size, hence precision tracking is needed, and pruning cannot
> > just happen when the old value is within the current value. It must only
> > happen when the type ID is equal. The type ID will always correspond to
> > prog->aux->btf hence actual type match is not required.
> >
> > Finally, change mark_chain_precision to mark_chain_precision_absolute
> > for kfuncs constant non-size scalar arguments (tagged with __k suffix).
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf_verifier.h |  8 +++-
> >  kernel/bpf/verifier.c        | 93 ++++++++++++++++++++++++++----------
> >  2 files changed, 76 insertions(+), 25 deletions(-)
> >
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index b4a11ff56054..c4d21568d192 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -43,6 +43,12 @@ enum bpf_reg_liveness {
> >       REG_LIVE_DONE = 0x8, /* liveness won't be updating this register anymore */
> >  };
> >
> > +enum bpf_reg_precise {
> > +     NOT_PRECISE,
> > +     PRECISE,
> > +     PRECISE_ABSOLUTE,
> > +};
>
> Can we make it less verbose ?
>
> NOT_PRECISE,
> PRECISE,
> EXACT
>

Yes, looks better.

> > +
> >  struct bpf_reg_state {
> >       /* Ordering of fields matters.  See states_equal() */
> >       enum bpf_reg_type type;
> > @@ -180,7 +186,7 @@ struct bpf_reg_state {
> >       s32 subreg_def;
> >       enum bpf_reg_liveness live;
> >       /* if (!precise && SCALAR_VALUE) min/max/tnum don't affect safety */
> > -     bool precise;
> > +     enum bpf_reg_precise precise;
>
> Have been thinking whether
>   bool precise;
>   bool exact;
> would be better,
> but doesn't look like it.
>
> >  };
> >
> >  enum bpf_stack_slot_type {
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index b28e88d6fabd..571790ac58d4 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -838,7 +838,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
> >               print_liveness(env, reg->live);
> >               verbose(env, "=");
> >               if (t == SCALAR_VALUE && reg->precise)
> > -                     verbose(env, "P");
> > +                     verbose(env, reg->precise == PRECISE_ABSOLUTE ? "PA" : "P");
>
> and here it would be just 'E'
>
> >               if ((t == SCALAR_VALUE || t == PTR_TO_STACK) &&
> >                   tnum_is_const(reg->var_off)) {
> >                       /* reg->off should be 0 for SCALAR_VALUE */
> > @@ -935,7 +935,7 @@ static void print_verifier_state(struct bpf_verifier_env *env,
> >                       t = reg->type;
> >                       verbose(env, "=%s", t == SCALAR_VALUE ? "" : reg_type_str(env, t));
> >                       if (t == SCALAR_VALUE && reg->precise)
> > -                             verbose(env, "P");
> > +                             verbose(env, reg->precise == PRECISE_ABSOLUTE ? "PA" : "P");
> >                       if (t == SCALAR_VALUE && tnum_is_const(reg->var_off))
> >                               verbose(env, "%lld", reg->var_off.value + reg->off);
> >               } else {
> > @@ -1668,7 +1668,17 @@ static void __mark_reg_unknown(const struct bpf_verifier_env *env,
> >       reg->type = SCALAR_VALUE;
> >       reg->var_off = tnum_unknown;
> >       reg->frameno = 0;
> > -     reg->precise = env->subprog_cnt > 1 || !env->bpf_capable;
> > +     /* Helpers requiring PRECISE_ABSOLUTE for constant arguments cannot be
> > +      * called from programs without CAP_BPF. This is because we don't
> > +      * propagate precision markers for when CAP_BPF is missing. If we
> > +      * allowed calling such heleprs in those programs, the default would
> > +      * have to be PRECISE_ABSOLUTE for them, which would be too aggresive.
> > +      *
> > +      * We still propagate PRECISE_ABSOLUTE when subprog_cnt > 1, hence
> > +      * those cases would still override the default PRECISE value when
> > +      * we propagate the precision markers.
> > +      */
> > +     reg->precise = (env->subprog_cnt > 1 || !env->bpf_capable) ? PRECISE : NOT_PRECISE;
> >       __mark_reg_unbounded(reg);
> >  }
> >
> > @@ -2717,7 +2727,8 @@ static int backtrack_insn(struct bpf_verifier_env *env, int idx,
> >   * For now backtracking falls back into conservative marking.
> >   */
> >  static void mark_all_scalars_precise(struct bpf_verifier_env *env,
> > -                                  struct bpf_verifier_state *st)
> > +                                  struct bpf_verifier_state *st,
> > +                                  bool absolute)
> >  {
> >       struct bpf_func_state *func;
> >       struct bpf_reg_state *reg;
> > @@ -2733,7 +2744,7 @@ static void mark_all_scalars_precise(struct bpf_verifier_env *env,
> >                               reg = &func->regs[j];
> >                               if (reg->type != SCALAR_VALUE)
> >                                       continue;
> > -                             reg->precise = true;
> > +                             reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
> >                       }
> >                       for (j = 0; j < func->allocated_stack / BPF_REG_SIZE; j++) {
> >                               if (!is_spilled_reg(&func->stack[j]))
> > @@ -2741,13 +2752,13 @@ static void mark_all_scalars_precise(struct bpf_verifier_env *env,
> >                               reg = &func->stack[j].spilled_ptr;
> >                               if (reg->type != SCALAR_VALUE)
> >                                       continue;
> > -                             reg->precise = true;
> > +                             reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
> >                       }
> >               }
> >  }
> >
> >  static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> > -                               int spi)
> > +                               int spi, bool absolute)
>
> instead of bool pls pass enum bpf_reg_precise
>
> >  {
> >       struct bpf_verifier_state *st = env->cur_state;
> >       int first_idx = st->first_insn_idx;
> > @@ -2774,7 +2785,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> >                       new_marks = true;
> >               else
> >                       reg_mask = 0;
> > -             reg->precise = true;
> > +             reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
> >       }
> >
> >       while (spi >= 0) {
> > @@ -2791,7 +2802,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> >                       new_marks = true;
> >               else
> >                       stack_mask = 0;
> > -             reg->precise = true;
> > +             reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
> >               break;
> >       }
> >
> > @@ -2813,7 +2824,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> >                               err = backtrack_insn(env, i, &reg_mask, &stack_mask);
> >                       }
> >                       if (err == -ENOTSUPP) {
> > -                             mark_all_scalars_precise(env, st);
> > +                             mark_all_scalars_precise(env, st, absolute);
> >                               return 0;
> >                       } else if (err) {
> >                               return err;
> > @@ -2854,7 +2865,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> >                       }
> >                       if (!reg->precise)
> >                               new_marks = true;
> > -                     reg->precise = true;
> > +                     reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
> >               }
> >
> >               bitmap_from_u64(mask, stack_mask);
> > @@ -2873,7 +2884,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> >                                * fp-8 and it's "unallocated" stack space.
> >                                * In such case fallback to conservative.
> >                                */
> > -                             mark_all_scalars_precise(env, st);
> > +                             mark_all_scalars_precise(env, st, absolute);
> >                               return 0;
> >                       }
> >
> > @@ -2888,7 +2899,7 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> >                       }
> >                       if (!reg->precise)
> >                               new_marks = true;
> > -                     reg->precise = true;
> > +                     reg->precise = absolute ? PRECISE_ABSOLUTE : PRECISE;
> >               }
> >               if (env->log.level & BPF_LOG_LEVEL2) {
> >                       verbose(env, "parent %s regs=%x stack=%llx marks:",
> > @@ -2910,12 +2921,24 @@ static int __mark_chain_precision(struct bpf_verifier_env *env, int regno,
> >
> >  static int mark_chain_precision(struct bpf_verifier_env *env, int regno)
> >  {
> > -     return __mark_chain_precision(env, regno, -1);
> > +     return __mark_chain_precision(env, regno, -1, false);
> > +}
> > +
> > +static int mark_chain_precision_absolute(struct bpf_verifier_env *env, int regno)
> > +{
> > +     WARN_ON_ONCE(!env->bpf_capable);
> > +     return __mark_chain_precision(env, regno, -1, true);
> >  }
> >
> >  static int mark_chain_precision_stack(struct bpf_verifier_env *env, int spi)
> >  {
> > -     return __mark_chain_precision(env, -1, spi);
> > +     return __mark_chain_precision(env, -1, spi, false);
> > +}
>
> No need to fork the functions so much.
> Just add enum bpf_reg_precise to existing two functions.
>
> > +
> > +static int mark_chain_precision_absolute_stack(struct bpf_verifier_env *env, int spi)
> > +{
> > +     WARN_ON_ONCE(!env->bpf_capable);
> > +     return __mark_chain_precision(env, -1, spi, true);
> >  }
> >
> >  static bool is_spillable_regtype(enum bpf_reg_type type)
> > @@ -3253,7 +3276,7 @@ static void mark_reg_stack_read(struct bpf_verifier_env *env,
> >                * backtracking. Any register that contributed
> >                * to const 0 was marked precise before spill.
> >                */
> > -             state->regs[dst_regno].precise = true;
> > +             state->regs[dst_regno].precise = PRECISE;
> >       } else {
> >               /* have read misc data from the stack */
> >               mark_reg_unknown(env, state->regs, dst_regno);
> > @@ -7903,7 +7926,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_arg_m
> >                                       verbose(env, "R%d must be a known constant\n", regno);
> >                                       return -EINVAL;
> >                               }
> > -                             ret = mark_chain_precision(env, regno);
> > +                             ret = mark_chain_precision_absolute(env, regno);
> >                               if (ret < 0)
> >                                       return ret;
> >                               meta->arg_constant.found = true;
> > @@ -11899,9 +11922,23 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
> >               if (rcur->type == SCALAR_VALUE) {
> >                       if (!rold->precise && !rcur->precise)
> >                               return true;
> > -                     /* new val must satisfy old val knowledge */
> > -                     return range_within(rold, rcur) &&
> > -                            tnum_in(rold->var_off, rcur->var_off);
> > +                     /* We can only determine safety when type of precision
> > +                      * needed is same. For absolute, we must compare actual
> > +                      * value, otherwise old being within the current value
> > +                      * suffices.
> > +                      */
> > +                     if (rold->precise == PRECISE_ABSOLUTE || rcur->precise == PRECISE_ABSOLUTE) {
> > +                             /* Both should be PRECISE_ABSOLUTE for a comparison */
> > +                             if (rold->precise != rcur->precise)
> > +                                     return false;
> > +                             if (!tnum_is_const(rold->var_off) || !tnum_is_const(rcur->var_off))
> > +                                     return false;
> > +                             return rold->var_off.value == rcur->var_off.value;
>
> Probably better to do
> if (rold->precise == EXACT || rcu->precise == EXACT)
>   return false;
>
> because
>  if (equal)
>     return true;
> should have already happened if they were exact match.
>

I'll add a comment about that, just to make it clear.
and I agree with all the suggestions.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 13/32] bpf: Introduce bpf_list_head support for BPF maps
  2022-09-07 22:46   ` Alexei Starovoitov
@ 2022-09-08  2:58     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  2:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 00:46, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Sep 04, 2022 at 10:41:26PM +0200, Kumar Kartikeya Dwivedi wrote:
> > Add the basic support on the map side to parse, recognize, verify, and
> > build metadata table for a new special field of the type struct
> > bpf_list_head. To parameterize the bpf_list_head for a certain value
> > type and the list_node member it will accept in that value type, we use
> > BTF declaration tags.
> >
> > The definition of bpf_list_head in a map value will be done as follows:
> >
> > struct foo {
> >       int data;
> >       struct bpf_list_node list;
> > };
> >
> > struct map_value {
> >       struct bpf_list_head list __contains(struct, foo, node);
> > };
>
> kptrs are only for structs.
> So I would drop explicit 1st argument which is going to be 'struct'
> for foreseeable future and leave it as:
>  struct bpf_list_head list __contains(foo, node);
>

Ok.

> There is typo s/list;/node;/ in struct foo, right?

Yes.

>
> > Then, the bpf_list_head only allows adding to the list using the
> > bpf_list_node 'list' for the type struct foo.
> >
> > The 'contains' annotation is a BTF declaration tag composed of four
> > parts, "contains:kind:name:node" where the kind and name is then used to
> > look up the type in the map BTF. The node defines name of the member in
> > this type that has the type struct bpf_list_node, which is actually used
> > for linking into the linked list.
> >
> > This allows building intrusive linked lists in BPF, using container_of
> > to obtain pointer to entry, while being completely type safe from the
> > perspective of the verifier. The verifier knows exactly the type of the
> > nodes, and knows that list helpers return that type at some fixed offset
> > where the bpf_list_node member used for this list exists. The verifier
> > also uses this information to disallow adding types that are not
> > accepted by a certain list.
> >
> > For now, no elements can be added to such lists. Support for that is
> > coming in future patches, hence draining and freeing items is left out
> > for now, and just freeing the list_head_off_tab is done, since it is
> > still built and populated when bpf_list_head is specified in the map
> > value.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h                           |  64 +++++--
> >  include/linux/btf.h                           |   2 +
> >  kernel/bpf/arraymap.c                         |   2 +
> >  kernel/bpf/bpf_local_storage.c                |   1 +
> >  kernel/bpf/btf.c                              | 173 +++++++++++++++++-
> >  kernel/bpf/hashtab.c                          |   1 +
> >  kernel/bpf/map_in_map.c                       |   5 +-
> >  kernel/bpf/syscall.c                          | 131 +++++++++++--
> >  kernel/bpf/verifier.c                         |  21 +++
> >  .../testing/selftests/bpf/bpf_experimental.h  |  21 +++
> >  10 files changed, 378 insertions(+), 43 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/bpf_experimental.h
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index d4e6bf789c02..35c2e9caeb98 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -28,6 +28,9 @@
> >  #include <linux/btf.h>
> >  #include <linux/rcupdate_trace.h>
> >
> > +/* Experimental BPF APIs header for type definitions */
> > +#include "../../../tools/testing/selftests/bpf/bpf_experimental.h"
> > +
> >  struct bpf_verifier_env;
> >  struct bpf_verifier_log;
> >  struct perf_event;
> > @@ -164,27 +167,40 @@ struct bpf_map_ops {
> >  };
> >
> >  enum {
> > -     /* Support at most 8 pointers in a BPF map value */
> > -     BPF_MAP_VALUE_OFF_MAX = 8,
> > -     BPF_MAP_OFF_ARR_MAX   = BPF_MAP_VALUE_OFF_MAX +
> > -                             1 + /* for bpf_spin_lock */
> > -                             1,  /* for bpf_timer */
> > -};
> > -
> > -enum bpf_kptr_type {
> > +     /* Support at most 8 offsets in a table */
> > +     BPF_MAP_VALUE_OFF_MAX           = 8,
> > +     /* Support at most 8 pointer in a BPF map value */
> > +     BPF_MAP_VALUE_KPTR_MAX          = BPF_MAP_VALUE_OFF_MAX,
> > +     /* Support at most 8 list_head in a BPF map value */
> > +     BPF_MAP_VALUE_LIST_HEAD_MAX     = BPF_MAP_VALUE_OFF_MAX,
> > +     BPF_MAP_OFF_ARR_MAX             = BPF_MAP_VALUE_KPTR_MAX +
> > +                                       BPF_MAP_VALUE_LIST_HEAD_MAX +
> > +                                       1 + /* for bpf_spin_lock */
> > +                                       1,  /* for bpf_timer */
> > +};
> > +
> > +enum bpf_off_type {
> >       BPF_KPTR_UNREF,
> >       BPF_KPTR_REF,
> > +     BPF_LIST_HEAD,
> >  };
> >
> >  struct bpf_map_value_off_desc {
> >       u32 offset;
> > -     enum bpf_kptr_type type;
> > -     struct {
> > -             struct btf *btf;
> > -             struct module *module;
> > -             btf_dtor_kfunc_t dtor;
> > -             u32 btf_id;
> > -     } kptr;
> > +     enum bpf_off_type type;
> > +     union {
> > +             struct {
> > +                     struct btf *btf;
> > +                     struct module *module;
> > +                     btf_dtor_kfunc_t dtor;
> > +                     u32 btf_id;
> > +             } kptr; /* for BPF_KPTR_{UNREF,REF} */
> > +             struct {
> > +                     struct btf *btf;
> > +                     u32 value_type_id;
> > +                     u32 list_node_off;
> > +             } list_head; /* for BPF_LIST_HEAD */
> > +     };
> >  };
> >
> >  struct bpf_map_value_off {
> > @@ -215,6 +231,7 @@ struct bpf_map {
> >       u32 map_flags;
> >       int spin_lock_off; /* >=0 valid offset, <0 error */
> >       struct bpf_map_value_off *kptr_off_tab;
> > +     struct bpf_map_value_off *list_head_off_tab;
>
> The union in bpf_map_value_off_desc prompts the question
> why separate array is needed.
> Sorting gets uglier.
>

I'll try to create a unified offset array, there aren't going to be
any collisions anyway, and we can discern between field types using
the type field.

> >       int timer_off; /* >=0 valid offset, <0 error */
> >       u32 id;
> >       int numa_node;
> > @@ -265,6 +282,11 @@ static inline bool map_value_has_kptrs(const struct bpf_map *map)
> >       return !IS_ERR_OR_NULL(map->kptr_off_tab);
> >  }
> >
> > +static inline bool map_value_has_list_heads(const struct bpf_map *map)
> > +{
> > +     return !IS_ERR_OR_NULL(map->list_head_off_tab);
> > +}
> > +
> >  static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> >  {
> >       if (unlikely(map_value_has_spin_lock(map)))
> > @@ -278,6 +300,13 @@ static inline void check_and_init_map_value(struct bpf_map *map, void *dst)
> >               for (i = 0; i < tab->nr_off; i++)
> >                       *(u64 *)(dst + tab->off[i].offset) = 0;
> >       }
> > +     if (unlikely(map_value_has_list_heads(map))) {
> > +             struct bpf_map_value_off *tab = map->list_head_off_tab;
> > +             int i;
> > +
> > +             for (i = 0; i < tab->nr_off; i++)
> > +                     memset(dst + tab->off[i].offset, 0, sizeof(struct list_head));
> > +     }
>
> Do we really need to distinguish map_value_has_kptrs vs map_value_has_list_heads ?
> Can they be generalized?
> rb_root will be next.
> that would be yet another array and another 'if'-s everywhere?
> And then another special pseudo-map type that will cause a bunch of copy-paste again?
> Maybe it's inevitable.
> Trying to brainstorm.
>

Yes, it's a bit unfortunate how this is turning out to be.
If we use a unified array we might be able to do it in one go though,
taking size from the type.

> >  }
> >
> >  /* memcpy that is used with 8-byte aligned pointers, power-of-8 size and
> > @@ -1676,6 +1705,11 @@ struct bpf_map_value_off *bpf_map_copy_kptr_off_tab(const struct bpf_map *map);
> >  bool bpf_map_equal_kptr_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> >  void bpf_map_free_kptrs(struct bpf_map *map, void *map_value);
> >
> > +struct bpf_map_value_off_desc *bpf_map_list_head_off_contains(struct bpf_map *map, u32 offset);
> > +void bpf_map_free_list_head_off_tab(struct bpf_map *map);
> > +struct bpf_map_value_off *bpf_map_copy_list_head_off_tab(const struct bpf_map *map);
> > +bool bpf_map_equal_list_head_off_tab(const struct bpf_map *map_a, const struct bpf_map *map_b);
> > +
> >  struct bpf_map *bpf_map_get(u32 ufd);
> >  struct bpf_map *bpf_map_get_with_uref(u32 ufd);
> >  struct bpf_map *__bpf_map_get(struct fd f);
> > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > index 8062f9da7c40..9b62b8b2117e 100644
> > --- a/include/linux/btf.h
> > +++ b/include/linux/btf.h
> > @@ -156,6 +156,8 @@ int btf_find_spin_lock(const struct btf *btf, const struct btf_type *t);
> >  int btf_find_timer(const struct btf *btf, const struct btf_type *t);
> >  struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> >                                         const struct btf_type *t);
> > +struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf,
> > +                                            const struct btf_type *t);
> >  bool btf_type_is_void(const struct btf_type *t);
> >  s32 btf_find_by_name_kind(const struct btf *btf, const char *name, u8 kind);
> >  const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
> > diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> > index 832b2659e96e..c7263ee3a35f 100644
> > --- a/kernel/bpf/arraymap.c
> > +++ b/kernel/bpf/arraymap.c
> > @@ -423,6 +423,8 @@ static void array_map_free(struct bpf_map *map)
> >       struct bpf_array *array = container_of(map, struct bpf_array, map);
> >       int i;
> >
> > +     bpf_map_free_list_head_off_tab(map);
> > +
> >       if (map_value_has_kptrs(map)) {
> >               if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
> >                       for (i = 0; i < array->map.max_entries; i++) {
> > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> > index 58cb0c179097..b5ccd76026b6 100644
> > --- a/kernel/bpf/bpf_local_storage.c
> > +++ b/kernel/bpf/bpf_local_storage.c
> > @@ -616,6 +616,7 @@ void bpf_local_storage_map_free(struct bpf_local_storage_map *smap,
> >               rcu_barrier();
> >               bpf_map_free_kptr_off_tab(&smap->map);
> >       }
> > +     bpf_map_free_list_head_off_tab(&smap->map);
> >       kvfree(smap->buckets);
> >       bpf_map_area_free(smap);
> >  }
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 6740c3ade8f1..0fb045be3837 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -3185,6 +3185,7 @@ enum btf_field_type {
> >       BTF_FIELD_SPIN_LOCK,
> >       BTF_FIELD_TIMER,
> >       BTF_FIELD_KPTR,
> > +     BTF_FIELD_LIST_HEAD,
> >  };
> >
> >  enum {
> > @@ -3193,9 +3194,17 @@ enum {
> >  };
> >
> >  struct btf_field_info {
> > -     u32 type_id;
> >       u32 off;
> > -     enum bpf_kptr_type type;
> > +     union {
> > +             struct {
> > +                     u32 type_id;
> > +                     enum bpf_off_type type;
> > +             } kptr;
> > +             struct {
> > +                     u32 value_type_id;
> > +                     const char *node_name;
> > +             } list_head;
> > +     };
> >  };
> >
> >  static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
> > @@ -3212,7 +3221,7 @@ static int btf_find_struct(const struct btf *btf, const struct btf_type *t,
> >  static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
> >                        u32 off, int sz, struct btf_field_info *info)
> >  {
> > -     enum bpf_kptr_type type;
> > +     enum bpf_off_type type;
> >       u32 res_id;
> >
> >       /* Permit modifiers on the pointer itself */
> > @@ -3241,9 +3250,71 @@ static int btf_find_kptr(const struct btf *btf, const struct btf_type *t,
> >       if (!__btf_type_is_struct(t))
> >               return -EINVAL;
> >
> > -     info->type_id = res_id;
> >       info->off = off;
> > -     info->type = type;
> > +     info->kptr.type_id = res_id;
> > +     info->kptr.type = type;
> > +     return BTF_FIELD_FOUND;
> > +}
> > +
> > +static const char *btf_find_decl_tag_value(const struct btf *btf,
> > +                                        const struct btf_type *pt,
> > +                                        int comp_idx, const char *tag_key)
> > +{
> > +     int i;
> > +
> > +     for (i = 1; i < btf_nr_types(btf); i++) {
> > +             const struct btf_type *t = btf_type_by_id(btf, i);
> > +             int len = strlen(tag_key);
> > +
> > +             if (!btf_type_is_decl_tag(t))
> > +                     continue;
> > +             /* TODO: Instead of btf_type pt, it would be much better if we had BTF
> > +              * ID of the map value type. This would avoid btf_type_by_id call here.
> > +              */
> > +             if (pt != btf_type_by_id(btf, t->type) ||
> > +                 btf_type_decl_tag(t)->component_idx != comp_idx)
> > +                     continue;
> > +             if (strncmp(__btf_name_by_offset(btf, t->name_off), tag_key, len))
> > +                     continue;
> > +             return __btf_name_by_offset(btf, t->name_off) + len;
> > +     }
> > +     return NULL;
> > +}
> > +
> > +static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
> > +                           int comp_idx, const struct btf_type *t,
> > +                           u32 off, int sz, struct btf_field_info *info)
> > +{
> > +     const char *value_type;
> > +     const char *list_node;
> > +     s32 id;
> > +
> > +     if (!__btf_type_is_struct(t))
> > +             return BTF_FIELD_IGNORE;
> > +     if (t->size != sz)
> > +             return BTF_FIELD_IGNORE;
> > +     value_type = btf_find_decl_tag_value(btf, pt, comp_idx, "contains:");
> > +     if (!value_type)
> > +             return -EINVAL;
> > +     if (strncmp(value_type, "struct:", sizeof("struct:") - 1))
> > +             return -EINVAL;
> > +     value_type += sizeof("struct:") - 1;
> > +     list_node = strstr(value_type, ":");
> > +     if (!list_node)
> > +             return -EINVAL;
> > +     value_type = kstrndup(value_type, list_node - value_type, GFP_ATOMIC);
> > +     if (!value_type)
> > +             return -ENOMEM;
> > +     id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT);
> > +     kfree(value_type);
> > +     if (id < 0)
> > +             return id;
> > +     list_node++;
> > +     if (str_is_empty(list_node))
> > +             return -EINVAL;
> > +     info->off = off;
> > +     info->list_head.value_type_id = id;
> > +     info->list_head.node_name = list_node;
> >       return BTF_FIELD_FOUND;
> >  }
> >
> > @@ -3286,6 +3357,12 @@ static int btf_find_struct_field(const struct btf *btf, const struct btf_type *t
> >                       if (ret < 0)
> >                               return ret;
> >                       break;
> > +             case BTF_FIELD_LIST_HEAD:
> > +                     ret = btf_find_list_head(btf, t, i, member_type, off, sz,
> > +                                              idx < info_cnt ? &info[idx] : &tmp);
> > +                     if (ret < 0)
> > +                             return ret;
> > +                     break;
> >               default:
> >                       return -EFAULT;
> >               }
> > @@ -3336,6 +3413,12 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
> >                       if (ret < 0)
> >                               return ret;
> >                       break;
> > +             case BTF_FIELD_LIST_HEAD:
> > +                     ret = btf_find_list_head(btf, var, -1, var_type, off, sz,
> > +                                              idx < info_cnt ? &info[idx] : &tmp);
> > +                     if (ret < 0)
> > +                             return ret;
> > +                     break;
> >               default:
> >                       return -EFAULT;
> >               }
> > @@ -3372,6 +3455,11 @@ static int btf_find_field(const struct btf *btf, const struct btf_type *t,
> >               sz = sizeof(u64);
> >               align = 8;
> >               break;
> > +     case BTF_FIELD_LIST_HEAD:
> > +             name = "bpf_list_head";
> > +             sz = sizeof(struct bpf_list_head);
> > +             align = __alignof__(struct bpf_list_head);
> > +             break;
> >       default:
> >               return -EFAULT;
> >       }
> > @@ -3440,7 +3528,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> >               /* Find type in map BTF, and use it to look up the matching type
> >                * in vmlinux or module BTFs, by name and kind.
> >                */
> > -             t = btf_type_by_id(btf, info_arr[i].type_id);
> > +             t = btf_type_by_id(btf, info_arr[i].kptr.type_id);
> >               id = bpf_find_btf_id(__btf_name_by_offset(btf, t->name_off), BTF_INFO_KIND(t->info),
> >                                    &kernel_btf);
> >               if (id < 0) {
> > @@ -3451,7 +3539,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> >               /* Find and stash the function pointer for the destruction function that
> >                * needs to be eventually invoked from the map free path.
> >                */
> > -             if (info_arr[i].type == BPF_KPTR_REF) {
> > +             if (info_arr[i].kptr.type == BPF_KPTR_REF) {
> >                       const struct btf_type *dtor_func;
> >                       const char *dtor_func_name;
> >                       unsigned long addr;
> > @@ -3494,7 +3582,7 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> >               }
> >
> >               tab->off[i].offset = info_arr[i].off;
> > -             tab->off[i].type = info_arr[i].type;
> > +             tab->off[i].type = info_arr[i].kptr.type;
> >               tab->off[i].kptr.btf_id = id;
> >               tab->off[i].kptr.btf = kernel_btf;
> >               tab->off[i].kptr.module = mod;
> > @@ -3515,6 +3603,75 @@ struct bpf_map_value_off *btf_parse_kptrs(const struct btf *btf,
> >       return ERR_PTR(ret);
> >  }
> >
> > +struct bpf_map_value_off *btf_parse_list_heads(struct btf *btf, const struct btf_type *t)
> > +{
> > +     struct btf_field_info info_arr[BPF_MAP_VALUE_OFF_MAX];
> > +     struct bpf_map_value_off *tab;
> > +     int ret, i, nr_off;
> > +
> > +     ret = btf_find_field(btf, t, BTF_FIELD_LIST_HEAD, info_arr, ARRAY_SIZE(info_arr));
>
> Like if search for both LIST_HEAD and KPTR here to know the size.
>

Yes, it seems like a good idea to unify things, I'll do it in v1.

> > +     if (ret < 0)
> > +             return ERR_PTR(ret);
> > +     if (!ret)
> > +             return NULL;
> > +
> > +     nr_off = ret;
> > +     tab = kzalloc(offsetof(struct bpf_map_value_off, off[nr_off]), GFP_KERNEL | __GFP_NOWARN);
> > +     if (!tab)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     for (i = 0; i < nr_off; i++) {
> > +             const struct btf_type *t, *n = NULL;
> > +             const struct btf_member *member;
> > +             u32 offset;
> > +             int j;
>
> and here we can process both, since field_info has type.
>
> > +
> > +             t = btf_type_by_id(btf, info_arr[i].list_head.value_type_id);
> > +             /* We've already checked that value_type_id is a struct type. We
> > +              * just need to figure out the offset of the list_node, and
> > +              * verify its type.
> > +              */
> > +             ret = -EINVAL;
> > +             for_each_member(j, t, member) {
> > +                     if (strcmp(info_arr[i].list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
> > +                             continue;
> > +                     /* Invalid BTF, two members with same name */
> > +                     if (n) {
> > +                             /* We also need to btf_put for the current iteration! */
> > +                             i++;
> > +                             goto end;
> > +                     }
> > +                     n = btf_type_by_id(btf, member->type);
> > +                     if (!__btf_type_is_struct(n))
> > +                             goto end;
> > +                     if (strcmp("bpf_list_node", __btf_name_by_offset(btf, n->name_off)))
> > +                             goto end;
> > +                     offset = __btf_member_bit_offset(n, member);
> > +                     if (offset % 8)
> > +                             goto end;
> > +                     offset /= 8;
> > +                     if (offset % __alignof__(struct bpf_list_node))
> > +                             goto end;
> > +
> > +                     tab->off[i].offset = info_arr[i].off;
> > +                     tab->off[i].type = BPF_LIST_HEAD;
> > +                     btf_get(btf);
>
> Do we need to btf_get? The btf should be pinned already and not going to be released
> until prog ends.
>

Hm, I think that's true. This is also the map BTF, not prog BTF, which
I guess map holds a ref to it anyway until __bpf_map_put, so it should
be fine. I'll add a comment.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 14/32] bpf: Introduce bpf_kptr_alloc helper
  2022-09-07 23:30   ` Alexei Starovoitov
@ 2022-09-08  3:01     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08  3:01 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 01:30, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Sep 04, 2022 at 10:41:27PM +0200, Kumar Kartikeya Dwivedi wrote:
> > To allocate local kptr of types pointing into program BTF instead of
> > kernel BTF, bpf_kptr_alloc is a new helper that takes the local type's
> > BTF ID and returns a pointer to it. The size is automatically inferred
> > from the type ID by the BPF verifier, so user only passes the BTF ID and
> > flags, if any. For now, no flags are supported.
> >
> > First, we use the new constant argument type support for kfuncs that
> > enforces argument is a constant. We need to know the local type's BTF ID
> > statically to enforce safety properties for the allocation. Next, we
> > remember this and dynamically assign the return type. During that phase,
> > we also query the actual size of the structure being allocated, and
> > whether it is a struct type. If so, we stash the actual size for
> > do_misc_fixups phase where we rewrite the first argument to be size
> > instead of local type's BTF ID, which we can then pass on to the kernel
> > allocator.
> >
> > This needs some additional support for kfuncs as we were not doing
> > argument rewrites for them. The fixup has been moved inside
> > fixup_kfunc_call itself to avoid polluting the huge do_misc_fixups,
> > and delta, prog, and insn pointers are recalculated based on if any
> > instructions were patched.
> >
> > The returned pointer needs to be handled specially as well. While
> > normally, only struct pointers may be returned, a new internal kfunc
> > flag __KF_RET_DYN_BTF is used to indicate the BTF is ascertained from
> > arguments dynamically, hence it is now forced to be void * instead.
> > For now, bpf_kptr_alloc is the only user of this support.
> >
> > Hence, allocations using bpf_kptr_alloc are type safe. Later patches
> > will introduce constructor and destructor support to local kptrs
> > allocated from this helper. This would allow embedding kernel objects
> > like bpf_spin_lock, bpf_list_node, bpf_list_head inside a local kptr
> > allocation, and ensuring they are correctly initialized before use.
> >
> > A new type flag is associated with PTR_TO_BTF_ID returned from
> > bpf_kptr_alloc: MEM_TYPE_LOCAL. This indicates that the type of the
> > memory is of a local type coming from program's BTF.
> >
> > The btf_struct_access mechanism is tuned to allow BPF_WRITE access to
> > these allocated objects, so that programs can store data as usual in
> > them. On following a pointer type inside such PTR_TO_BTF_ID, WALK_PTR
> > sets the destination register as scalar instead. It would not be safe to
> > recognize pointer types in local types. This can be changed in the
> > future if it is allowed to embed kptrs inside such local kptrs.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf.h                           |  12 +-
> >  include/linux/bpf_verifier.h                  |   1 +
> >  include/linux/btf.h                           |   3 +
> >  kernel/bpf/btf.c                              |   8 +-
> >  kernel/bpf/helpers.c                          |  17 ++
> >  kernel/bpf/verifier.c                         | 156 +++++++++++++++---
> >  net/bpf/bpf_dummy_struct_ops.c                |   5 +-
> >  net/ipv4/bpf_tcp_ca.c                         |   5 +-
> >  .../testing/selftests/bpf/bpf_experimental.h  |  14 ++
> >  9 files changed, 191 insertions(+), 30 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 35c2e9caeb98..5c8bfb0eba17 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -486,6 +486,12 @@ enum bpf_type_flag {
> >       /* Size is known at compile time. */
> >       MEM_FIXED_SIZE          = BIT(10 + BPF_BASE_TYPE_BITS),
> >
> > +     /* MEM is of a type from program BTF, not kernel BTF. This is used to
> > +      * tag PTR_TO_BTF_ID allocated using bpf_kptr_alloc, since they have
> > +      * entirely different semantics.
> > +      */
> > +     MEM_TYPE_LOCAL          = BIT(11 + BPF_BASE_TYPE_BITS),
> > +
> >       __BPF_TYPE_FLAG_MAX,
> >       __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
> >  };
> > @@ -757,7 +763,8 @@ struct bpf_verifier_ops {
> >                                const struct btf *btf,
> >                                const struct btf_type *t, int off, int size,
> >                                enum bpf_access_type atype,
> > -                              u32 *next_btf_id, enum bpf_type_flag *flag);
> > +                              u32 *next_btf_id, enum bpf_type_flag *flag,
> > +                              bool local_type);
> >  };
> >
> >  struct bpf_prog_offload_ops {
> > @@ -1995,7 +2002,8 @@ static inline bool bpf_tracing_btf_ctx_access(int off, int size,
> >  int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
> >                     const struct btf_type *t, int off, int size,
> >                     enum bpf_access_type atype,
> > -                   u32 *next_btf_id, enum bpf_type_flag *flag);
> > +                   u32 *next_btf_id, enum bpf_type_flag *flag,
> > +                   bool local_type);
> >  bool btf_struct_ids_match(struct bpf_verifier_log *log,
> >                         const struct btf *btf, u32 id, int off,
> >                         const struct btf *need_btf, u32 need_type_id,
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index c4d21568d192..c6d550978d63 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -403,6 +403,7 @@ struct bpf_insn_aux_data {
> >                */
> >               struct bpf_loop_inline_state loop_inline_state;
> >       };
> > +     u64 kptr_alloc_size; /* used to store size of local kptr allocation */
> >       u64 map_key_state; /* constant (32 bit) key tracking for maps */
> >       int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
> >       u32 seen; /* this insn was processed by the verifier at env->pass_cnt */
> > diff --git a/include/linux/btf.h b/include/linux/btf.h
> > index 9b62b8b2117e..fc35c932e89e 100644
> > --- a/include/linux/btf.h
> > +++ b/include/linux/btf.h
> > @@ -52,6 +52,9 @@
> >  #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
> >  #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
> >
> > +/* Internal kfunc flags, not meant for general use */
> > +#define __KF_RET_DYN_BTF (1 << 7) /* kfunc returns dynamically ascertained PTR_TO_BTF_ID */
>
> Is there going to be another func that returns similar dynamic type?
> We have one such func already kptr_xhcg. I don't see why we need this flag.
> We can just compare func_id-s.
> In this patch it will be just fund_id == kfunc_ids[KF_kptr_alloc];
> When more kfuncs become alloc-like we will just add few ||.
>

There are, bpf_list_pop_{front,back}, even bpf_list_del, probably more
as we add more variants of lists.
But I don't mind keeping a list, they all need to be handled a bit
differently anyway to ascertain the type of PTR_TO_BTF_ID.

> > +
> >  struct btf;
> >  struct btf_member;
> >  struct btf_type;
> > diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> > index 0fb045be3837..17977e0f4e09 100644
> > --- a/kernel/bpf/btf.c
> > +++ b/kernel/bpf/btf.c
> > @@ -5919,7 +5919,8 @@ static int btf_struct_walk(struct bpf_verifier_log *log, const struct btf *btf,
> >  int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
> >                     const struct btf_type *t, int off, int size,
> >                     enum bpf_access_type atype __maybe_unused,
> > -                   u32 *next_btf_id, enum bpf_type_flag *flag)
> > +                   u32 *next_btf_id, enum bpf_type_flag *flag,
> > +                   bool local_type)
> >  {
> >       enum bpf_type_flag tmp_flag = 0;
> >       int err;
> > @@ -5930,6 +5931,11 @@ int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
> >
> >               switch (err) {
> >               case WALK_PTR:
> > +                     /* For local types, the destination register cannot
> > +                      * become a pointer again.
> > +                      */
> > +                     if (local_type)
> > +                             return SCALAR_VALUE;
> >                       /* If we found the pointer or scalar on t+off,
> >                        * we're done.
> >                        */
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index fc08035f14ed..d417aa4f0b22 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1696,10 +1696,27 @@ bpf_base_func_proto(enum bpf_func_id func_id)
> >       }
> >  }
> >
> > +__diag_push();
> > +__diag_ignore_all("-Wmissing-prototypes",
> > +               "Global functions as their definitions will be in vmlinux BTF");
> > +
> > +void *bpf_kptr_alloc(u64 local_type_id__k, u64 flags)
> > +{
> > +     /* Verifier patches local_type_id__k to size */
> > +     u64 size = local_type_id__k;
> > +
> > +     if (flags)
> > +             return NULL;
> > +     return kmalloc(size, GFP_ATOMIC);
> > +}
> > +
> > +__diag_pop();
> > +
> >  BTF_SET8_START(tracing_btf_ids)
> >  #ifdef CONFIG_KEXEC_CORE
> >  BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE)
> >  #endif
> > +BTF_ID_FLAGS(func, bpf_kptr_alloc, KF_ACQUIRE | KF_RET_NULL | __KF_RET_DYN_BTF)
> >  BTF_SET8_END(tracing_btf_ids)
> >
> >  static const struct btf_kfunc_id_set tracing_kfunc_set = {
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index ab91e5ca7e41..8f28aa7f1e8d 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -472,6 +472,11 @@ static bool type_may_be_null(u32 type)
> >       return type & PTR_MAYBE_NULL;
> >  }
> >
> > +static bool type_is_local(u32 type)
> > +{
> > +     return type & MEM_TYPE_LOCAL;
> > +}
> > +
> >  static bool is_acquire_function(enum bpf_func_id func_id,
> >                               const struct bpf_map *map)
> >  {
> > @@ -4556,17 +4561,22 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
> >               return -EACCES;
> >       }
> >
> > -     if (env->ops->btf_struct_access) {
> > +     /* For allocated PTR_TO_BTF_ID pointing to a local type, we cannot do
> > +      * btf_struct_access callback.
> > +      */
> > +     if (env->ops->btf_struct_access && !type_is_local(reg->type)) {
> >               ret = env->ops->btf_struct_access(&env->log, reg->btf, t,
> > -                                               off, size, atype, &btf_id, &flag);
> > +                                               off, size, atype, &btf_id, &flag,
> > +                                               false);
> >       } else {
> > -             if (atype != BPF_READ) {
> > +             /* It is allowed to write to pointer to a local type */
> > +             if (atype != BPF_READ && !type_is_local(reg->type)) {
> >                       verbose(env, "only read is supported\n");
> >                       return -EACCES;
> >               }
> >
> >               ret = btf_struct_access(&env->log, reg->btf, t, off, size,
> > -                                     atype, &btf_id, &flag);
> > +                                     atype, &btf_id, &flag, type_is_local(reg->type));
>
> imo it's cleaner to pass 'reg' instead of 'reg->btf',
> so we don't have to pass another boolean.
> And check type_is_local(reg) inside btf_struct_access().
>

Yes, makes sense, will change in v1.

> >       }
> >
> >       if (ret < 0)
> > @@ -4630,7 +4640,7 @@ static int check_ptr_to_map_access(struct bpf_verifier_env *env,
> >               return -EACCES;
> >       }
> >
> > -     ret = btf_struct_access(&env->log, btf_vmlinux, t, off, size, atype, &btf_id, &flag);
> > +     ret = btf_struct_access(&env->log, btf_vmlinux, t, off, size, atype, &btf_id, &flag, false);
> >       if (ret < 0)
> >               return ret;
> >
> > @@ -7661,6 +7671,11 @@ static bool is_kfunc_destructive(struct bpf_kfunc_arg_meta *meta)
> >       return meta->kfunc_flags & KF_DESTRUCTIVE;
> >  }
> >
> > +static bool __is_kfunc_ret_dyn_btf(struct bpf_kfunc_arg_meta *meta)
> > +{
> > +     return meta->kfunc_flags & __KF_RET_DYN_BTF;
> > +}
> > +
> >  static bool is_kfunc_arg_kptr_get(struct bpf_kfunc_arg_meta *meta, int arg)
> >  {
> >       return arg == 0 && (meta->kfunc_flags & KF_KPTR_GET);
> > @@ -7751,6 +7766,24 @@ static u32 *reg2btf_ids[__BPF_REG_TYPE_MAX] = {
> >  #endif
> >  };
> >
> > +BTF_ID_LIST(special_kfuncs)
> > +BTF_ID(func, bpf_kptr_alloc)
> > +
> > +enum bpf_special_kfuncs {
> > +     KF_SPECIAL_bpf_kptr_alloc,
> > +     KF_SPECIAL_MAX,
> > +};
> > +
> > +static bool __is_kfunc_special(const struct btf *btf, u32 func_id, unsigned int kf_sp)
> > +{
> > +     if (btf != btf_vmlinux || kf_sp >= KF_SPECIAL_MAX)
> > +             return false;
> > +     return func_id == special_kfuncs[kf_sp];
> > +}
> > +
> > +#define is_kfunc_special(btf, func_id, func_name) \
> > +     __is_kfunc_special(btf, func_id, KF_SPECIAL_##func_name)
>
> This looks like reinventing the wheel.
> I'd think similar to btf_tracing_ids[BTF_TRACING_TYPE_VMA] would work just as well.
> It's less magic. No need for above macro
> and btf != btf_vmlinux should really be explicit in the code
> and done early and once.
>

Ok.

> > +
> >  enum kfunc_ptr_arg_types {
> >       KF_ARG_PTR_TO_CTX,
> >       KF_ARG_PTR_TO_BTF_ID,        /* Also covers reg2btf_ids conversions */
> > @@ -8120,20 +8153,55 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >               mark_reg_unknown(env, regs, BPF_REG_0);
> >               mark_btf_func_reg_size(env, BPF_REG_0, t->size);
> >       } else if (btf_type_is_ptr(t)) {
> > -             ptr_type = btf_type_skip_modifiers(desc_btf, t->type,
> > -                                                &ptr_type_id);
> > -             if (!btf_type_is_struct(ptr_type)) {
> > -                     ptr_type_name = btf_name_by_offset(desc_btf,
> > -                                                        ptr_type->name_off);
> > -                     verbose(env, "kernel function %s returns pointer type %s %s is not supported\n",
> > -                             func_name, btf_type_str(ptr_type),
> > -                             ptr_type_name);
> > -                     return -EINVAL;
> > -             }
> > +             struct btf *ret_btf;
> > +             u32 ret_btf_id;
> > +
> > +             ptr_type = btf_type_skip_modifiers(desc_btf, t->type, &ptr_type_id);
> >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > -             regs[BPF_REG_0].btf = desc_btf;
> >               regs[BPF_REG_0].type = PTR_TO_BTF_ID;
> > -             regs[BPF_REG_0].btf_id = ptr_type_id;
> > +
> > +             if (__is_kfunc_ret_dyn_btf(&meta)) {
>
> just check meta.func_id == kfunc_ids[KF_kptr_alloc] instead?
>
> > +                     const struct btf_type *ret_t;
> > +
> > +                     /* Currently, only bpf_kptr_alloc needs special handling */
> > +                     if (!is_kfunc_special(meta.btf, meta.func_id, bpf_kptr_alloc) ||
>
> same thing.
>

Ack.

> > +                         !meta.arg_constant.found || !btf_type_is_void(ptr_type)) {
> > +                             verbose(env, "verifier internal error: misconfigured kfunc\n");
> > +                             return -EFAULT;
> > +                     }
> > +
> > +                     if (((u64)(u32)meta.arg_constant.value) != meta.arg_constant.value) {
> > +                             verbose(env, "local type ID argument must be in range [0, U32_MAX]\n");
> > +                             return -EINVAL;
> > +                     }
> > +
> > +                     ret_btf = env->prog->aux->btf;
> > +                     ret_btf_id = meta.arg_constant.value;
> > +
> > +                     ret_t = btf_type_by_id(ret_btf, ret_btf_id);
> > +                     if (!ret_t || !__btf_type_is_struct(ret_t)) {
> > +                             verbose(env, "local type ID %d passed to bpf_kptr_alloc does not refer to struct\n",
> > +                                     ret_btf_id);
> > +                             return -EINVAL;
> > +                     }
> > +                     /* Remember this so that we can rewrite R1 as size in fixup_kfunc_call */
> > +                     env->insn_aux_data[insn_idx].kptr_alloc_size = ret_t->size;
> > +                     /* For now, since we hardcode prog->btf, also hardcode
> > +                      * setting of this flag.
> > +                      */
> > +                     regs[BPF_REG_0].type |= MEM_TYPE_LOCAL;
> > +             } else {
> > +                     if (!btf_type_is_struct(ptr_type)) {
> > +                             ptr_type_name = btf_name_by_offset(desc_btf, ptr_type->name_off);
> > +                             verbose(env, "kernel function %s returns pointer type %s %s is not supported\n",
> > +                                     func_name, btf_type_str(ptr_type), ptr_type_name);
> > +                             return -EINVAL;
> > +                     }
> > +                     ret_btf = desc_btf;
> > +                     ret_btf_id = ptr_type_id;
> > +             }
> > +             regs[BPF_REG_0].btf = ret_btf;
> > +             regs[BPF_REG_0].btf_id = ret_btf_id;
> >               if (is_kfunc_ret_null(&meta)) {
> >                       regs[BPF_REG_0].type |= PTR_MAYBE_NULL;
> >                       /* For mark_ptr_or_null_reg, see 93c230e3f5bd6 */
> > @@ -14371,8 +14439,43 @@ static int fixup_call_args(struct bpf_verifier_env *env)
> >       return err;
> >  }
> >
> > +static int do_kfunc_fixups(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > +                        s32 imm, int insn_idx, int delta)
> > +{
> > +     struct bpf_insn insn_buf[16];
> > +     struct bpf_prog *new_prog;
> > +     int cnt;
> > +
> > +     /* No need to lookup btf, only vmlinux kfuncs are supported for special
> > +      * kfuncs handling. Hence when insn->off is zero, check if it is a
> > +      * special kfunc by hardcoding btf as btf_vmlinux.
> > +      */
> > +     if (!insn->off && is_kfunc_special(btf_vmlinux, insn->imm, bpf_kptr_alloc)) {
> > +             u64 local_type_size = env->insn_aux_data[insn_idx + delta].kptr_alloc_size;
> > +
> > +             insn_buf[0] = BPF_MOV64_IMM(BPF_REG_1, local_type_size);
> > +             insn_buf[1] = *insn;
> > +             cnt = 2;
> > +
> > +             new_prog = bpf_patch_insn_data(env, insn_idx + delta, insn_buf, cnt);
> > +             if (!new_prog)
> > +                     return -ENOMEM;
> > +
> > +             delta += cnt - 1;
> > +             insn = new_prog->insnsi + insn_idx + delta;
> > +             goto patch_call_imm;
> > +     }
> > +
> > +     insn->imm = imm;
> > +     return 0;
> > +patch_call_imm:
> > +     insn->imm = imm;
> > +     return cnt - 1;
> > +}
> > +
> >  static int fixup_kfunc_call(struct bpf_verifier_env *env,
> > -                         struct bpf_insn *insn)
> > +                         struct bpf_insn *insn,
> > +                         int insn_idx, int delta)
> >  {
> >       const struct bpf_kfunc_desc *desc;
> >
> > @@ -14391,9 +14494,7 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env,
> >               return -EFAULT;
> >       }
> >
> > -     insn->imm = desc->imm;
> > -
> > -     return 0;
> > +     return do_kfunc_fixups(env, insn, desc->imm, insn_idx, delta);
> >  }
> >
> >  /* Do various post-verification rewrites in a single program pass.
> > @@ -14534,9 +14635,18 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >               if (insn->src_reg == BPF_PSEUDO_CALL)
> >                       continue;
> >               if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
> > -                     ret = fixup_kfunc_call(env, insn);
> > -                     if (ret)
> > +                     ret = fixup_kfunc_call(env, insn, i, delta);
> > +                     if (ret < 0)
> >                               return ret;
> > +                     /* If ret > 0, fixup_kfunc_call did some instruction
> > +                      * rewrites. Increment delta, reload prog and insn,
> > +                      * env->prog is already set by it to the new_prog.
> > +                      */
> > +                     if (ret) {
> > +                             delta += ret;
> > +                             prog = env->prog;
> > +                             insn = prog->insnsi + i + delta;
> > +                     }
>
> See how Yonghong did it:
> https://lore.kernel.org/all/20220807175121.4179410-1-yhs@fb.com/
>
> It's cleaner to patch and adjust here instead of patch in one place
> and adjust in another.
>

Agreed, will fix it in v1.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08  2:39     ` Kumar Kartikeya Dwivedi
@ 2022-09-08  3:37       ` Alexei Starovoitov
  2022-09-08 11:50         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08  3:37 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, Sep 08, 2022 at 04:39:43AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Thu, 8 Sept 2022 at 02:34, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sun, Sep 04, 2022 at 10:41:29PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > Add the concept of a memory object model to BPF verifier.
> > >
> > > What this means is that there are now some types that are not just plain
> > > old data, but require explicit action when they are allocated on a
> > > storage, before their lifetime is considered as started and before it is
> > > allowed for them to escape the program. The verifier will track state of
> > > such fields during the various phases of the object lifetime, where it
> > > can be sure about certain invariants.
> > >
> > > Some inspiration is taken from existing memory object and lifetime
> > > models in C and C++ which have stood the test of time. See [0], [1], [2]
> > > for more information, to find some similarities. In the future, the
> > > separation of storage and object lifetime may be made more stark by
> > > allowing to change effective type of storage allocated for a local kptr.
> > > For now, that has been left out. It is only possible when verifier
> > > understands when the program has exclusive access to storage, and when
> > > the object it is hosting is no longer accessible to other CPUs.
> > >
> > > This can be useful to maintain size-class based freelists inside BPF
> > > programs and reuse storage of same size for different types. This would
> > > only be safe to allow if verifier can ensure that while storage lifetime
> > > has not ended, object lifetime for the current type has. This
> > > necessiates separating the two and accomodating a simple model to track
> > > object lifetime (composed recursively of more objects whose lifetime
> > > is individually tracked).
> > >
> > > Everytime a BPF program allocates such non-trivial types, it must call a
> > > set of constructors on the object to fully begin its lifetime before it
> > > can make use of the pointer to this type. If the program does not do so,
> > > the verifier will complain and lead to failure in loading of the
> > > program.
> > >
> > > Similarly, when ending the lifetime of such types, it is required to
> > > fully destruct the object using a series of destructors for each
> > > non-trivial member, before finally freeing the storage the object is
> > > making use of.
> > >
> > > During both the construction and destruction phase, there can be only
> > > one program that can own and access such an object, hence their is no
> > > need of any explicit synchronization. The single ownership of such
> > > objects makes it easy for the verifier to enforce the safety around the
> > > beginning and end of the lifetime without resorting to dynamic checks.
> > >
> > > When there are multiple fields needing construction or destruction, the
> > > program must call their constructors in ascending order of the offset of
> > > the field.
> > >
> > > For example, consider the following type (support for such fields will
> > > be added in subsequent patches):
> > >
> > > struct data {
> > >       struct bpf_spin_lock lock;
> > >       struct bpf_list_head list __contains(struct, foo, node);
> > >       int data;
> > > };
> > >
> > > struct data *d = bpf_kptr_alloc(...);
> > > if (!d) { ... }
> > >
> > > Now, the type of d would be PTR_TO_BTF_ID | MEM_TYPE_LOCAL |
> > > OBJ_CONSTRUCTING, as it needs two constructor calls (for lock and head),
> > > before it can be considered fully initialized and alive.
> > >
> > > Hence, we must do (in order of field offsets):
> > >
> > > bpf_spin_lock_init(&d->lock);
> > > bpf_list_head_init(&d->list);
> >
> > All sounds great in theory, but I think it's unnecessary complex at this point.
> > There is still a need to __bpf_list_head_init_zeroed as seen in later patches.
> 
> This particular call is only because of map values. INIT_LIST_HEAD for
> prealloc init or alloc_elem would be costly.
> There won't be any concern to do it in check_and_init_map_value, we
> zero out the field there already. Nothing else needs this check.
> 
> List helpers I am planning to inline, it doesn't make sense to have
> two loads/stores inside kfuncs. And then for local kptrs there is no
> need to zero init. pop_front/pop_back are even uglier. There you need
> NULL check + zero init, _then_ check for list_empty. Same with future
> list_splice.

The inlining is an orthogonal topic.
It doesn't have to be done the way of map_gen_lookup().

> I don't believe list helpers are going to be so infrequent such that
> all this might not matter at all.
> 
> But fine, I still consider this a fair point. I thought a lot about this too.
> 
> It really boils down to: do we really want to always zero init?

Special fields like locks, timers, lists, trees -> yes.

> 
> What seems more desirable to me is forcing initialization like this,
> esp. since memory reuse is going to be the more common case,
> and then simply relaxing initialization when we know it comes from
> bpf_kptr_zalloc. needs_construction similar to needs_destruction.
> We aren't requiring bpf_list_node_fini, same idea there.
> 
> Zeroing the entire big struct vs zeroing/initing two fields makes a
> huge difference.

Right, but too many assumptions in this reasoning.
I wasn't proposing to do bzero the whole sizeof(struct foo) in bpf_kptr_zalloc.
I wasn't proposing to have zalloc flavor either.
We can do selective zeroing.
The prog will call bpf_kptr_alloc and bpf_kptr_free,
but it doesn't have to be the same kfunc-s for all btf types.
We can substitute kfuncs with custom implicit dtors and ctors based on type info.
Sort of like C++ calls constructors in operator new.
But here we can go pretty far with _implicit_ ctors/dtors only.

> > So all this verifier enforced constructors we don't need _today_.
> > Zero init of everything works.
> > It's the case for list_head, list_node, spin_lock, rb_root, rb_node.
> > Pretty much all new data structures will work with zero init
> > and all of them need async dtors.
> > The verifier cannot help during destruction.
> > dtors have to be specified declaratively in a bpf prog for new types
> 
> I think about it the other way around.
> 
> There actually isn't a need to specify any dtor IMO for custom types.
> Just init and free your type inline. Much more familiar to people
> doing C already.
> Custom types are always just data without special fields, and we know
> how to destroy BPF special fields.
> Map already knows how to 'destruct' these types, just like it has to
> know how to destruct map value.
> 
> map value type and local kptr type are similar in that regard. They
> are both local types in prog BTF with special fields.
> If it can do it for map value, it can do it for local kptr if it finds
> it in map (it has to).
> 
> To me taking prog reference and setting up per-type dtor is the uglier
> solution. It's unnecessary for the user. That then forces you to have
> similar semantics like bpf_timer. map_release_uref will be used to
> break the reference cycle between map and prog, which is undesirable.
> More effort then - to think about some way to alleviate that, or live
> with it and compromise.

Completely agree. I think explicit (bpf prog provided) dtor is an extreme case.
Hopefully we won't need to add support for it for long time.
The verifier should be able to do implicit ctor/dtor based on BTF only
and that will allow us to build pretty complex data structures with rbtrees,
link lists, etc.

The main point is dtor of bpf_list_head in map value has to be implicit anyway.
The prog can do:
struct foo {
  struct bpf_list_head head;
  struct bpf_spin_lock lock;
};

bpf_list_lock
bpf_list_add(&val->head, ...);
bpf_list_unlock
exit

The map will have elements allocated and these elements will contain kptrs
and populated link lists and rbtrees.
The bpf infra has to be able to free all these things automatically
based on BTF and it obviously can do so.
Since it has to do it anyway we can allow:
foo_ptr = bpf_kptr_xchg(...);
bpf_kptr_free(foo_ptr);
and that free function will do the same implicit destruction of
the link list which will include walking the list and deleting
elements recursively.
Maybe it means that it would have to grab the locks automatically as well.
Not sure. For single owner case locks won't be needed.

> It shouldn't be invoked on bpf_kptr_free automagically. That is the
> job of the language and best suited to that.
> Verifier will see BPF ASM after translation from C/C++/Rust/etc., so
> for us the destruction at language level appears as the destructing
> phase of local kptr in verifier. For maps it's the last resort, where
> programs are already gone, so there is nothing left to do but free
> stuff.

The C++ compiler generated ctors/dtors sequences instructs "dumb"
cpu execute them. We have the verifier in-between and the whole run-time.
The map destruction case is not "last resort". It's a feature provided by
the bpf run-time. Just like golang garbage collector is not a "last resort".

Anyway back to original point which is:
we don't have to add support to the verifier to enforce explicit ctor/dtor
sequences. We can solve practical use cases without that additional complexity.
There are plenty of other complex things in this patch set.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08  3:37       ` Alexei Starovoitov
@ 2022-09-08 11:50         ` Kumar Kartikeya Dwivedi
  2022-09-08 14:18           ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08 11:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 05:37, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 08, 2022 at 04:39:43AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Thu, 8 Sept 2022 at 02:34, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Sun, Sep 04, 2022 at 10:41:29PM +0200, Kumar Kartikeya Dwivedi wrote:
> > > > Add the concept of a memory object model to BPF verifier.
> > > >
> > > > What this means is that there are now some types that are not just plain
> > > > old data, but require explicit action when they are allocated on a
> > > > storage, before their lifetime is considered as started and before it is
> > > > allowed for them to escape the program. The verifier will track state of
> > > > such fields during the various phases of the object lifetime, where it
> > > > can be sure about certain invariants.
> > > >
> > > > Some inspiration is taken from existing memory object and lifetime
> > > > models in C and C++ which have stood the test of time. See [0], [1], [2]
> > > > for more information, to find some similarities. In the future, the
> > > > separation of storage and object lifetime may be made more stark by
> > > > allowing to change effective type of storage allocated for a local kptr.
> > > > For now, that has been left out. It is only possible when verifier
> > > > understands when the program has exclusive access to storage, and when
> > > > the object it is hosting is no longer accessible to other CPUs.
> > > >
> > > > This can be useful to maintain size-class based freelists inside BPF
> > > > programs and reuse storage of same size for different types. This would
> > > > only be safe to allow if verifier can ensure that while storage lifetime
> > > > has not ended, object lifetime for the current type has. This
> > > > necessiates separating the two and accomodating a simple model to track
> > > > object lifetime (composed recursively of more objects whose lifetime
> > > > is individually tracked).
> > > >
> > > > Everytime a BPF program allocates such non-trivial types, it must call a
> > > > set of constructors on the object to fully begin its lifetime before it
> > > > can make use of the pointer to this type. If the program does not do so,
> > > > the verifier will complain and lead to failure in loading of the
> > > > program.
> > > >
> > > > Similarly, when ending the lifetime of such types, it is required to
> > > > fully destruct the object using a series of destructors for each
> > > > non-trivial member, before finally freeing the storage the object is
> > > > making use of.
> > > >
> > > > During both the construction and destruction phase, there can be only
> > > > one program that can own and access such an object, hence their is no
> > > > need of any explicit synchronization. The single ownership of such
> > > > objects makes it easy for the verifier to enforce the safety around the
> > > > beginning and end of the lifetime without resorting to dynamic checks.
> > > >
> > > > When there are multiple fields needing construction or destruction, the
> > > > program must call their constructors in ascending order of the offset of
> > > > the field.
> > > >
> > > > For example, consider the following type (support for such fields will
> > > > be added in subsequent patches):
> > > >
> > > > struct data {
> > > >       struct bpf_spin_lock lock;
> > > >       struct bpf_list_head list __contains(struct, foo, node);
> > > >       int data;
> > > > };
> > > >
> > > > struct data *d = bpf_kptr_alloc(...);
> > > > if (!d) { ... }
> > > >
> > > > Now, the type of d would be PTR_TO_BTF_ID | MEM_TYPE_LOCAL |
> > > > OBJ_CONSTRUCTING, as it needs two constructor calls (for lock and head),
> > > > before it can be considered fully initialized and alive.
> > > >
> > > > Hence, we must do (in order of field offsets):
> > > >
> > > > bpf_spin_lock_init(&d->lock);
> > > > bpf_list_head_init(&d->list);
> > >
> > > All sounds great in theory, but I think it's unnecessary complex at this point.
> > > There is still a need to __bpf_list_head_init_zeroed as seen in later patches.
> >
> > This particular call is only because of map values. INIT_LIST_HEAD for
> > prealloc init or alloc_elem would be costly.
> > There won't be any concern to do it in check_and_init_map_value, we
> > zero out the field there already. Nothing else needs this check.
> >
> > List helpers I am planning to inline, it doesn't make sense to have
> > two loads/stores inside kfuncs. And then for local kptrs there is no
> > need to zero init. pop_front/pop_back are even uglier. There you need
> > NULL check + zero init, _then_ check for list_empty. Same with future
> > list_splice.
>
> The inlining is an orthogonal topic.
> It doesn't have to be done the way of map_gen_lookup().
>
> > I don't believe list helpers are going to be so infrequent such that
> > all this might not matter at all.
> >
> > But fine, I still consider this a fair point. I thought a lot about this too.
> >
> > It really boils down to: do we really want to always zero init?
>
> Special fields like locks, timers, lists, trees -> yes.
>
> >
> > What seems more desirable to me is forcing initialization like this,
> > esp. since memory reuse is going to be the more common case,
> > and then simply relaxing initialization when we know it comes from
> > bpf_kptr_zalloc. needs_construction similar to needs_destruction.
> > We aren't requiring bpf_list_node_fini, same idea there.
> >
> > Zeroing the entire big struct vs zeroing/initing two fields makes a
> > huge difference.
>
> Right, but too many assumptions in this reasoning.
> I wasn't proposing to do bzero the whole sizeof(struct foo) in bpf_kptr_zalloc.
> I wasn't proposing to have zalloc flavor either.
> We can do selective zeroing.
> The prog will call bpf_kptr_alloc and bpf_kptr_free,
> but it doesn't have to be the same kfunc-s for all btf types.
> We can substitute kfuncs with custom implicit dtors and ctors based on type info.
> Sort of like C++ calls constructors in operator new.
> But here we can go pretty far with _implicit_ ctors/dtors only.
>
> > > So all this verifier enforced constructors we don't need _today_.
> > > Zero init of everything works.
> > > It's the case for list_head, list_node, spin_lock, rb_root, rb_node.
> > > Pretty much all new data structures will work with zero init
> > > and all of them need async dtors.
> > > The verifier cannot help during destruction.
> > > dtors have to be specified declaratively in a bpf prog for new types
> >
> > I think about it the other way around.
> >
> > There actually isn't a need to specify any dtor IMO for custom types.
> > Just init and free your type inline. Much more familiar to people
> > doing C already.
> > Custom types are always just data without special fields, and we know
> > how to destroy BPF special fields.
> > Map already knows how to 'destruct' these types, just like it has to
> > know how to destruct map value.
> >
> > map value type and local kptr type are similar in that regard. They
> > are both local types in prog BTF with special fields.
> > If it can do it for map value, it can do it for local kptr if it finds
> > it in map (it has to).
> >
> > To me taking prog reference and setting up per-type dtor is the uglier
> > solution. It's unnecessary for the user. That then forces you to have
> > similar semantics like bpf_timer. map_release_uref will be used to
> > break the reference cycle between map and prog, which is undesirable.
> > More effort then - to think about some way to alleviate that, or live
> > with it and compromise.
>
> Completely agree. I think explicit (bpf prog provided) dtor is an extreme case.
> Hopefully we won't need to add support for it for long time.
> The verifier should be able to do implicit ctor/dtor based on BTF only
> and that will allow us to build pretty complex data structures with rbtrees,
> link lists, etc.
>
> The main point is dtor of bpf_list_head in map value has to be implicit anyway.
> The prog can do:
> struct foo {
>   struct bpf_list_head head;
>   struct bpf_spin_lock lock;
> };
>
> bpf_list_lock
> bpf_list_add(&val->head, ...);
> bpf_list_unlock
> exit
>
> The map will have elements allocated and these elements will contain kptrs
> and populated link lists and rbtrees.
> The bpf infra has to be able to free all these things automatically
> based on BTF and it obviously can do so.
> Since it has to do it anyway we can allow:
> foo_ptr = bpf_kptr_xchg(...);
> bpf_kptr_free(foo_ptr);
> and that free function will do the same implicit destruction of
> the link list which will include walking the list and deleting
> elements recursively.
> Maybe it means that it would have to grab the locks automatically as well.
> Not sure. For single owner case locks won't be needed.
>

Nope, no locks will be held whenever bpf_kptr_free is called. At that
point, concurrency wrt BPF special fields is always zero.
Even with bpf_refcount you will call it in true branch of
if (bpf_refcount_put(...)) { bpf_kptr_free(kptr); }

For RCU protected ones, lock may be held without refcount to e.g.
manipulate data, but manipulation of BPF fields will require refcount,
to protect against concurrent destruction.

> > It shouldn't be invoked on bpf_kptr_free automagically. That is the
> > job of the language and best suited to that.
> > Verifier will see BPF ASM after translation from C/C++/Rust/etc., so
> > for us the destruction at language level appears as the destructing
> > phase of local kptr in verifier. For maps it's the last resort, where
> > programs are already gone, so there is nothing left to do but free
> > stuff.
>
> The C++ compiler generated ctors/dtors sequences instructs "dumb"
> cpu execute them. We have the verifier in-between and the whole run-time.
> The map destruction case is not "last resort". It's a feature provided by
> the bpf run-time. Just like golang garbage collector is not a "last resort".
>
> Anyway back to original point which is:
> we don't have to add support to the verifier to enforce explicit ctor/dtor
> sequences. We can solve practical use cases without that additional complexity.
> There are plenty of other complex things in this patch set.

I slept over this. I think I can get behind this idea of implicit
ctor/dtor. We might have open coded construction/destruction later if
we want.

I am however thinking of naming these helpers:
bpf_kptr_new
bpf_kptr_delete
to make it clear it does a little more than just allocating the type.
The open coded cases can later derive their allocation from the more
bare bones bpf_kptr_alloc instead in the future.

The main reason to have open coded-ness was being able to 'manage'
resources once visibility reduces to current CPU (bpf_refcount_put,
single ownership after xchg, etc.). Even with RCU, we won't allow
touching the BPF special fields without refcount. bpf_spin_lock is
different, as it protects more than just bpf special fields.

But one can still splice or kptr_xchg before passing to bpf_kptr_free
to do that. bpf_kptr_free is basically cleaning up whatever is left by
then, forcefully. In the future, we might even be able to do elision
of implicit dtors based on the seen data flow (splicing in single
ownership implies list is empty, any other op will undo that, etc.) if
there are big structs with too many fields. Can also support that in
open coded cases.

What I want to think about more is whether we should still force
calling bpf_refcount_set vs always setting it to 1.

I know we don't agree about whether list_add in shared mode should
take ref vs transfer ref. I'm leaning towards transfer since that will
be most intuitive. It then works the same way in both cases, single
ownership only transfers the sole reference you have, so you lose
access, but in shared you may have more than one. If you have just one
you will still lose access.

It will be odd for list_add to consume it in one case and not the
other. People should already be fully conscious of how they are
managing the lifetime of their object.

It then seems better to require users to set the initial refcount
themselves. When doing the initial linking it can be very cheap.
Later get/put/inc are always available.

But forcing it to be called is going to be much simpler than this patch.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08 11:50         ` Kumar Kartikeya Dwivedi
@ 2022-09-08 14:18           ` Alexei Starovoitov
  2022-09-08 14:45             ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08 14:18 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, Sep 8, 2022 at 4:50 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> I slept over this. I think I can get behind this idea of implicit
> ctor/dtor. We might have open coded construction/destruction later if
> we want.
>
> I am however thinking of naming these helpers:
> bpf_kptr_new
> bpf_kptr_delete
> to make it clear it does a little more than just allocating the type.
> The open coded cases can later derive their allocation from the more
> bare bones bpf_kptr_alloc instead in the future.

New names make complete sense. Good idea.

> The main reason to have open coded-ness was being able to 'manage'
> resources once visibility reduces to current CPU (bpf_refcount_put,
> single ownership after xchg, etc.). Even with RCU, we won't allow
> touching the BPF special fields without refcount. bpf_spin_lock is
> different, as it protects more than just bpf special fields.
>
> But one can still splice or kptr_xchg before passing to bpf_kptr_free
> to do that. bpf_kptr_free is basically cleaning up whatever is left by
> then, forcefully. In the future, we might even be able to do elision
> of implicit dtors based on the seen data flow (splicing in single
> ownership implies list is empty, any other op will undo that, etc.) if
> there are big structs with too many fields. Can also support that in
> open coded cases.

Right.

>
> What I want to think about more is whether we should still force
> calling bpf_refcount_set vs always setting it to 1.
>
> I know we don't agree about whether list_add in shared mode should
> take ref vs transfer ref. I'm leaning towards transfer since that will
> be most intuitive. It then works the same way in both cases, single
> ownership only transfers the sole reference you have, so you lose
> access, but in shared you may have more than one. If you have just one
> you will still lose access.
>
> It will be odd for list_add to consume it in one case and not the
> other. People should already be fully conscious of how they are
> managing the lifetime of their object.
>
> It then seems better to require users to set the initial refcount
> themselves. When doing the initial linking it can be very cheap.
> Later get/put/inc are always available.
>
> But forcing it to be called is going to be much simpler than this patch.

I'm not convinced yet :)
Pls hold on implementing one way or another.
Let's land the single ownership case for locks, lists,
rbtrees, allocators. That's plenty of patches.
Then we can start a deeper discussion into the shared case.
Whether it will be different in terms of 'lose access after list_add'
is not critical to decide now. It can change in the future too.

The other reason to do implicit inits and ref count sets is to
avoid fighting llvm.
obj = bpf_kptr_new();
obj->var1 = 1;
some_func(&obj->var2);
In many cases the compiler is allowed to sink stores.
If there are two calls that "init" two different fields
the compiler is allowed to change the order as well
even if it doesn't see the body of the function and the function is
marked as __pure. Technically initializers as pure functions.
The verifier and llvm already "fight" a lot.
We gotta be very careful in the verifier and not assume
that the code stays as written in C.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08 14:18           ` Alexei Starovoitov
@ 2022-09-08 14:45             ` Kumar Kartikeya Dwivedi
  2022-09-08 15:11               ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08 14:45 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 16:18, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 8, 2022 at 4:50 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > I slept over this. I think I can get behind this idea of implicit
> > ctor/dtor. We might have open coded construction/destruction later if
> > we want.
> >
> > I am however thinking of naming these helpers:
> > bpf_kptr_new
> > bpf_kptr_delete
> > to make it clear it does a little more than just allocating the type.
> > The open coded cases can later derive their allocation from the more
> > bare bones bpf_kptr_alloc instead in the future.
>
> New names make complete sense. Good idea.
>
> > The main reason to have open coded-ness was being able to 'manage'
> > resources once visibility reduces to current CPU (bpf_refcount_put,
> > single ownership after xchg, etc.). Even with RCU, we won't allow
> > touching the BPF special fields without refcount. bpf_spin_lock is
> > different, as it protects more than just bpf special fields.
> >
> > But one can still splice or kptr_xchg before passing to bpf_kptr_free
> > to do that. bpf_kptr_free is basically cleaning up whatever is left by
> > then, forcefully. In the future, we might even be able to do elision
> > of implicit dtors based on the seen data flow (splicing in single
> > ownership implies list is empty, any other op will undo that, etc.) if
> > there are big structs with too many fields. Can also support that in
> > open coded cases.
>
> Right.
>
> >
> > What I want to think about more is whether we should still force
> > calling bpf_refcount_set vs always setting it to 1.
> >
> > I know we don't agree about whether list_add in shared mode should
> > take ref vs transfer ref. I'm leaning towards transfer since that will
> > be most intuitive. It then works the same way in both cases, single
> > ownership only transfers the sole reference you have, so you lose
> > access, but in shared you may have more than one. If you have just one
> > you will still lose access.
> >
> > It will be odd for list_add to consume it in one case and not the
> > other. People should already be fully conscious of how they are
> > managing the lifetime of their object.
> >
> > It then seems better to require users to set the initial refcount
> > themselves. When doing the initial linking it can be very cheap.
> > Later get/put/inc are always available.
> >
> > But forcing it to be called is going to be much simpler than this patch.
>
> I'm not convinced yet :)
> Pls hold on implementing one way or another.
> Let's land the single ownership case for locks, lists,
> rbtrees, allocators. That's plenty of patches.
> Then we can start a deeper discussion into the shared case.
> Whether it will be different in terms of 'lose access after list_add'
> is not critical to decide now. It can change in the future too.
>

Right, I'm not implementing it yet. There's a lot of work left to even
finish single ownership structures, then lots of testing.
But it's helpful to keep thinking about future use cases while working
on the current stuff, just to make sure we're not
digging ourselves into a design hole.

We have the option to undo damage here, since this is all
experimental, but there's still an expectation that the API is not
broken at whim. That wouldn't be very useful for users.

> The other reason to do implicit inits and ref count sets is to

I am not contesting implicit construction.
Other lists already work with zero initialization so list_head seems
more of an exception.
But it's done for good reasons to avoid extra NULL checks
unnecessarily, and make the implementation of list helpers more
efficient and simple at the same time.

> avoid fighting llvm.
> obj = bpf_kptr_new();
> obj->var1 = 1;
> some_func(&obj->var2);
> In many cases the compiler is allowed to sink stores.
> If there are two calls that "init" two different fields
> the compiler is allowed to change the order as well
> even if it doesn't see the body of the function and the function is
> marked as __pure. Technically initializers as pure functions.

But bpf_refcount_set won't be marked __pure, neither am I proposing to
allow direct stores to 'set' it.
I'm not a compiler expert by any means, but AFAIK it should not be
doing such reordering for functions otherwise.
What if the function inside has a memory barrier? That would
completely screw up things.
It's going to have external linkage, so I don't think it can assume
anything about side effects or not. So IMO this is not a good point.

Unless you're talking about some new way of inlining such helpers from
the compiler side that doesn't exist yet.

> The verifier and llvm already "fight" a lot.
> We gotta be very careful in the verifier and not assume
> that the code stays as written in C.

So will these implicit zero stores be done when we enter != NULL
branch, or lazily on first access (helper arg, load, store)?
This is the flip side: rewritings insns to add stores to local kptr
can only happen after the NULL check, in the != NULL branch, at that
point we cannot assume R1-R5 are free for use, so complicated field
initialization will be uglier to do implicitly (e.g. if it involves
calling functions etc.).
There are pros and cons for both.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08 14:45             ` Kumar Kartikeya Dwivedi
@ 2022-09-08 15:11               ` Alexei Starovoitov
  2022-09-08 15:37                 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08 15:11 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, Sep 8, 2022 at 7:46 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Thu, 8 Sept 2022 at 16:18, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Thu, Sep 8, 2022 at 4:50 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > I slept over this. I think I can get behind this idea of implicit
> > > ctor/dtor. We might have open coded construction/destruction later if
> > > we want.
> > >
> > > I am however thinking of naming these helpers:
> > > bpf_kptr_new
> > > bpf_kptr_delete
> > > to make it clear it does a little more than just allocating the type.
> > > The open coded cases can later derive their allocation from the more
> > > bare bones bpf_kptr_alloc instead in the future.
> >
> > New names make complete sense. Good idea.
> >
> > > The main reason to have open coded-ness was being able to 'manage'
> > > resources once visibility reduces to current CPU (bpf_refcount_put,
> > > single ownership after xchg, etc.). Even with RCU, we won't allow
> > > touching the BPF special fields without refcount. bpf_spin_lock is
> > > different, as it protects more than just bpf special fields.
> > >
> > > But one can still splice or kptr_xchg before passing to bpf_kptr_free
> > > to do that. bpf_kptr_free is basically cleaning up whatever is left by
> > > then, forcefully. In the future, we might even be able to do elision
> > > of implicit dtors based on the seen data flow (splicing in single
> > > ownership implies list is empty, any other op will undo that, etc.) if
> > > there are big structs with too many fields. Can also support that in
> > > open coded cases.
> >
> > Right.
> >
> > >
> > > What I want to think about more is whether we should still force
> > > calling bpf_refcount_set vs always setting it to 1.
> > >
> > > I know we don't agree about whether list_add in shared mode should
> > > take ref vs transfer ref. I'm leaning towards transfer since that will
> > > be most intuitive. It then works the same way in both cases, single
> > > ownership only transfers the sole reference you have, so you lose
> > > access, but in shared you may have more than one. If you have just one
> > > you will still lose access.
> > >
> > > It will be odd for list_add to consume it in one case and not the
> > > other. People should already be fully conscious of how they are
> > > managing the lifetime of their object.
> > >
> > > It then seems better to require users to set the initial refcount
> > > themselves. When doing the initial linking it can be very cheap.
> > > Later get/put/inc are always available.
> > >
> > > But forcing it to be called is going to be much simpler than this patch.
> >
> > I'm not convinced yet :)
> > Pls hold on implementing one way or another.
> > Let's land the single ownership case for locks, lists,
> > rbtrees, allocators. That's plenty of patches.
> > Then we can start a deeper discussion into the shared case.
> > Whether it will be different in terms of 'lose access after list_add'
> > is not critical to decide now. It can change in the future too.
> >
>
> Right, I'm not implementing it yet. There's a lot of work left to even
> finish single ownership structures, then lots of testing.
> But it's helpful to keep thinking about future use cases while working
> on the current stuff, just to make sure we're not
> digging ourselves into a design hole.
>
> We have the option to undo damage here, since this is all
> experimental, but there's still an expectation that the API is not
> broken at whim. That wouldn't be very useful for users.

imo this part is minor.
The whole lock + list_or_rbtree in a single allocation
restriction bothers me a lot more.
We will find out for sure only when wwe have a prototype
of lock + list + rbtree and let folks who requested it
actually code things.

> > The other reason to do implicit inits and ref count sets is to
>
> I am not contesting implicit construction.
> Other lists already work with zero initialization so list_head seems
> more of an exception.
> But it's done for good reasons to avoid extra NULL checks
> unnecessarily, and make the implementation of list helpers more
> efficient and simple at the same time.
>
> > avoid fighting llvm.
> > obj = bpf_kptr_new();
> > obj->var1 = 1;
> > some_func(&obj->var2);
> > In many cases the compiler is allowed to sink stores.
> > If there are two calls that "init" two different fields
> > the compiler is allowed to change the order as well
> > even if it doesn't see the body of the function and the function is
> > marked as __pure. Technically initializers as pure functions.
>
> But bpf_refcount_set won't be marked __pure, neither am I proposing to
> allow direct stores to 'set' it.
> I'm not a compiler expert by any means, but AFAIK it should not be
> doing such reordering for functions otherwise.
> What if the function inside has a memory barrier? That would
> completely screw up things.
> It's going to have external linkage, so I don't think it can assume
> anything about side effects or not. So IMO this is not a good point.

The pure attribute tells the compiler that the function doesn't
have side effects. We even use it in the kernel code base.
Sooner or later we'll start using it in bpf too.
Things like memcmp is a primary example.
I have to correct myself though. refcount_set shouldn't be
considered pure.

> Unless you're talking about some new way of inlining such helpers from
> the compiler side that doesn't exist yet.
>
> > The verifier and llvm already "fight" a lot.
> > We gotta be very careful in the verifier and not assume
> > that the code stays as written in C.
>
> So will these implicit zero stores be done when we enter != NULL
> branch, or lazily on first access (helper arg, load, store)?

Whichever way is faster and still safe.
I assumed that we'd have to zero them after successful alloc.

> This is the flip side: rewritings insns to add stores to local kptr
> can only happen after the NULL check, in the != NULL branch, at that
> point we cannot assume R1-R5 are free for use, so complicated field
> initialization will be uglier to do implicitly (e.g. if it involves
> calling functions etc.).
> There are pros and cons for both.

Are you expecting the verifier to insert zero inits
as actual insns after 'call bpf_kptr_new' insn ?
Hmm. I imagined bpf_kptr_new helper will do it.
Just a simple loop that is inverse of zero_map_value().
Not the fastest thing, but good enough to start and can be
optimized later.
The verifier can insert ST insn too in !null branch,
since that insn only needs one register and it's known
in that branch.
It's questionable that a bunch of ST insns will be faster
than a zero_map_value-like loop.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08 15:11               ` Alexei Starovoitov
@ 2022-09-08 15:37                 ` Kumar Kartikeya Dwivedi
  2022-09-08 15:59                   ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-08 15:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, 8 Sept 2022 at 17:11, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Sep 8, 2022 at 7:46 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Thu, 8 Sept 2022 at 16:18, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Thu, Sep 8, 2022 at 4:50 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > I slept over this. I think I can get behind this idea of implicit
> > > > ctor/dtor. We might have open coded construction/destruction later if
> > > > we want.
> > > >
> > > > I am however thinking of naming these helpers:
> > > > bpf_kptr_new
> > > > bpf_kptr_delete
> > > > to make it clear it does a little more than just allocating the type.
> > > > The open coded cases can later derive their allocation from the more
> > > > bare bones bpf_kptr_alloc instead in the future.
> > >
> > > New names make complete sense. Good idea.
> > >
> > > > The main reason to have open coded-ness was being able to 'manage'
> > > > resources once visibility reduces to current CPU (bpf_refcount_put,
> > > > single ownership after xchg, etc.). Even with RCU, we won't allow
> > > > touching the BPF special fields without refcount. bpf_spin_lock is
> > > > different, as it protects more than just bpf special fields.
> > > >
> > > > But one can still splice or kptr_xchg before passing to bpf_kptr_free
> > > > to do that. bpf_kptr_free is basically cleaning up whatever is left by
> > > > then, forcefully. In the future, we might even be able to do elision
> > > > of implicit dtors based on the seen data flow (splicing in single
> > > > ownership implies list is empty, any other op will undo that, etc.) if
> > > > there are big structs with too many fields. Can also support that in
> > > > open coded cases.
> > >
> > > Right.
> > >
> > > >
> > > > What I want to think about more is whether we should still force
> > > > calling bpf_refcount_set vs always setting it to 1.
> > > >
> > > > I know we don't agree about whether list_add in shared mode should
> > > > take ref vs transfer ref. I'm leaning towards transfer since that will
> > > > be most intuitive. It then works the same way in both cases, single
> > > > ownership only transfers the sole reference you have, so you lose
> > > > access, but in shared you may have more than one. If you have just one
> > > > you will still lose access.
> > > >
> > > > It will be odd for list_add to consume it in one case and not the
> > > > other. People should already be fully conscious of how they are
> > > > managing the lifetime of their object.
> > > >
> > > > It then seems better to require users to set the initial refcount
> > > > themselves. When doing the initial linking it can be very cheap.
> > > > Later get/put/inc are always available.
> > > >
> > > > But forcing it to be called is going to be much simpler than this patch.
> > >
> > > I'm not convinced yet :)
> > > Pls hold on implementing one way or another.
> > > Let's land the single ownership case for locks, lists,
> > > rbtrees, allocators. That's plenty of patches.
> > > Then we can start a deeper discussion into the shared case.
> > > Whether it will be different in terms of 'lose access after list_add'
> > > is not critical to decide now. It can change in the future too.
> > >
> >
> > Right, I'm not implementing it yet. There's a lot of work left to even
> > finish single ownership structures, then lots of testing.
> > But it's helpful to keep thinking about future use cases while working
> > on the current stuff, just to make sure we're not
> > digging ourselves into a design hole.
> >
> > We have the option to undo damage here, since this is all
> > experimental, but there's still an expectation that the API is not
> > broken at whim. That wouldn't be very useful for users.
>
> imo this part is minor.
> The whole lock + list_or_rbtree in a single allocation
> restriction bothers me a lot more.
> We will find out for sure only when wwe have a prototype
> of lock + list + rbtree and let folks who requested it
> actually code things.
>

Sure.
But when I look in the kernel, I often see data and its lock often
allocated together.
The lock is just there to serialize access to the data structure. It
might as well not be if it's a bpf_llist_head, or only optionally.
It might be an entirely different way of serializing access (local_t +
percpu list).
But usually having both together is also great for locality.
Different use cases have different needs, the simple and common cases
are often well served by having both together.

Not every use case needs both list and/or rbtree. Some require access
to only one at a time.
e.g. We might reserve struct rb_node in xdp_frame, allowing struct
bpf_rb_tree __contains(xdp_frame, rb_node)
or struct bpf_rb_tree __contains(sk_buff, rb_node), unifying the
queueing primitives for XDP and TC.
It makes sense to make simple cases faster and simpler to use.

This is why I eventually plan to add a RCU based hash table using
these single ownership lists in selftests, at least as a showcase it
can serve a 'real world' use case.

Dave's dynamic lock checks are conceptually not very different from
the verifier's perspective. A bpf_spin_lock * vs bpf_spin_lock
protecting the list. Indirection allows it to assume a dynamic value
at runtime. Some checks can still be done statically, and some where
the pointer's indeterminism hinders static analysis can be offloaded
to runtime. Different tradeoffs.

> > > The other reason to do implicit inits and ref count sets is to
> >
> > I am not contesting implicit construction.
> > Other lists already work with zero initialization so list_head seems
> > more of an exception.
> > But it's done for good reasons to avoid extra NULL checks
> > unnecessarily, and make the implementation of list helpers more
> > efficient and simple at the same time.
> >
> > > avoid fighting llvm.
> > > obj = bpf_kptr_new();
> > > obj->var1 = 1;
> > > some_func(&obj->var2);
> > > In many cases the compiler is allowed to sink stores.
> > > If there are two calls that "init" two different fields
> > > the compiler is allowed to change the order as well
> > > even if it doesn't see the body of the function and the function is
> > > marked as __pure. Technically initializers as pure functions.
> >
> > But bpf_refcount_set won't be marked __pure, neither am I proposing to
> > allow direct stores to 'set' it.
> > I'm not a compiler expert by any means, but AFAIK it should not be
> > doing such reordering for functions otherwise.
> > What if the function inside has a memory barrier? That would
> > completely screw up things.
> > It's going to have external linkage, so I don't think it can assume
> > anything about side effects or not. So IMO this is not a good point.
>
> The pure attribute tells the compiler that the function doesn't
> have side effects. We even use it in the kernel code base.
> Sooner or later we'll start using it in bpf too.
> Things like memcmp is a primary example.
> I have to correct myself though. refcount_set shouldn't be
> considered pure.
>
> > Unless you're talking about some new way of inlining such helpers from
> > the compiler side that doesn't exist yet.
> >
> > > The verifier and llvm already "fight" a lot.
> > > We gotta be very careful in the verifier and not assume
> > > that the code stays as written in C.
> >
> > So will these implicit zero stores be done when we enter != NULL
> > branch, or lazily on first access (helper arg, load, store)?
>
> Whichever way is faster and still safe.
> I assumed that we'd have to zero them after successful alloc.
>
> > This is the flip side: rewritings insns to add stores to local kptr
> > can only happen after the NULL check, in the != NULL branch, at that
> > point we cannot assume R1-R5 are free for use, so complicated field
> > initialization will be uglier to do implicitly (e.g. if it involves
> > calling functions etc.).
> > There are pros and cons for both.
>
> Are you expecting the verifier to insert zero inits
> as actual insns after 'call bpf_kptr_new' insn ?
> Hmm. I imagined bpf_kptr_new helper will do it.
> Just a simple loop that is inverse of zero_map_value().
> Not the fastest thing, but good enough to start and can be
> optimized later.
> The verifier can insert ST insn too in !null branch,
> since that insn only needs one register and it's known
> in that branch.
> It's questionable that a bunch of ST insns will be faster
> than a zero_map_value-like loop.

I would definitely think a bunch of direct 0 stores would be faster.
zero_map_value will access some kind of offset array in memory and
then read from it and loop over to do the stores.
Does seem like more work to do, even if it's hot in cache those are
unneeded extra accesses for information we know statically.
So I'll most likely emit a bunch of ST insns zeroing it out in v1.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model
  2022-09-08 15:37                 ` Kumar Kartikeya Dwivedi
@ 2022-09-08 15:59                   ` Alexei Starovoitov
  0 siblings, 0 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-08 15:59 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Dave Marchevsky, Delyan Kratunov

On Thu, Sep 8, 2022 at 8:38 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > Are you expecting the verifier to insert zero inits
> > as actual insns after 'call bpf_kptr_new' insn ?
> > Hmm. I imagined bpf_kptr_new helper will do it.
> > Just a simple loop that is inverse of zero_map_value().
> > Not the fastest thing, but good enough to start and can be
> > optimized later.
> > The verifier can insert ST insn too in !null branch,
> > since that insn only needs one register and it's known
> > in that branch.
> > It's questionable that a bunch of ST insns will be faster
> > than a zero_map_value-like loop.
>
> I would definitely think a bunch of direct 0 stores would be faster.
> zero_map_value will access some kind of offset array in memory and
> then read from it and loop over to do the stores.
> Does seem like more work to do, even if it's hot in cache those are
> unneeded extra accesses for information we know statically.
> So I'll most likely emit a bunch of ST insns zeroing it out in v1.

Premature optimization is ...

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps Kumar Kartikeya Dwivedi
  2022-09-07 19:00   ` Alexei Starovoitov
@ 2022-09-09  5:27   ` Martin KaFai Lau
  2022-09-09 11:22     ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 82+ messages in thread
From: Martin KaFai Lau @ 2022-09-09  5:27 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Martin KaFai Lau, KP Singh, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Dave Marchevsky, Delyan Kratunov

On 9/4/22 1:41 PM, Kumar Kartikeya Dwivedi wrote:
> diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> index 7ea18d4da84b..6786d00f004e 100644
> --- a/include/linux/bpf_local_storage.h
> +++ b/include/linux/bpf_local_storage.h
> @@ -74,7 +74,7 @@ struct bpf_local_storage_elem {
>   	struct hlist_node snode;	/* Linked to bpf_local_storage */
>   	struct bpf_local_storage __rcu *local_storage;
>   	struct rcu_head rcu;
> -	/* 8 bytes hole */
> +	struct bpf_map *map;		/* Only set for bpf_selem_free_rcu */

Instead of adding another map ptr and using the last 8 bytes hole,

>   	/* The data is stored in another cacheline to minimize
>   	 * the number of cachelines access during a cache hit.
>   	 */
> diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> index 802fc15b0d73..4a725379d761 100644
> --- a/kernel/bpf/bpf_local_storage.c
> +++ b/kernel/bpf/bpf_local_storage.c
> @@ -74,7 +74,8 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
>   				gfp_flags | __GFP_NOWARN);
>   	if (selem) {
>   		if (value)
> -			memcpy(SDATA(selem)->data, value, smap->map.value_size);
> +			copy_map_value(&smap->map, SDATA(selem)->data, value);
> +		/* No call to check_and_init_map_value as memory is zero init */
>   		return selem;
>   	}
>   
> @@ -92,12 +93,27 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
>   	kfree_rcu(local_storage, rcu);
>   }
>   
> +static void check_and_free_fields(struct bpf_local_storage_elem *selem)
> +{
> +	if (map_value_has_kptrs(selem->map))

could SDATA(selem)->smap->map be used here ?

> +		bpf_map_free_kptrs(selem->map, SDATA(selem));
> +}
> +
>   static void bpf_selem_free_rcu(struct rcu_head *rcu)
>   {
>   	struct bpf_local_storage_elem *selem;
>   
>   	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> -	kfree_rcu(selem, rcu);
> +	check_and_free_fields(selem);
> +	kfree(selem);
> +}
> +
> +static void bpf_selem_free_tasks_trace_rcu(struct rcu_head *rcu)
> +{
> +	struct bpf_local_storage_elem *selem;
> +
> +	selem = container_of(rcu, struct bpf_local_storage_elem, rcu);
> +	call_rcu(&selem->rcu, bpf_selem_free_rcu);
>   }
>   
>   /* local_storage->lock must be held and selem->local_storage == local_storage.
> @@ -150,10 +166,11 @@ bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage,
>   	    SDATA(selem))
>   		RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL);
>   
> +	selem->map = &smap->map;
>   	if (use_trace_rcu)
> -		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_rcu);
> +		call_rcu_tasks_trace(&selem->rcu, bpf_selem_free_tasks_trace_rcu);
>   	else
> -		kfree_rcu(selem, rcu);
> +		call_rcu(&selem->rcu, bpf_selem_free_rcu);
>   


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-04 20:41 ` [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables Kumar Kartikeya Dwivedi
  2022-09-08  0:27   ` Alexei Starovoitov
@ 2022-09-09  8:13   ` Dave Marchevsky
  2022-09-09 11:05     ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 82+ messages in thread
From: Dave Marchevsky @ 2022-09-09  8:13 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Delyan Kratunov

On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> Global variables reside in maps accessible using direct_value_addr
> callbacks, so giving each load instruction's rewrite a unique reg->id
> disallows us from holding locks which are global.
> 
> This is not great, so refactor the active_spin_lock into two separate
> fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> enough to allow it for global variables, map lookups, and local kptr
> registers at the same time.
> 
> Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> reg->map_ptr or reg->btf pointer of the register used for locking spin
> lock. But the active_spin_lock_id also needs to be compared to ensure
> whether bpf_spin_unlock is for the same register.
> 
> Next, pseudo load instructions are not given a unique reg->id, as they
> are doing lookup for the same map value (max_entries is never greater
> than 1).
> 

For libbpf-style "internal maps" - like .bss.private further in this series -
all the SEC(".bss.private") vars are globbed together into one map_value. e.g.

  struct bpf_spin_lock lock1 SEC(".bss.private");
  struct bpf_spin_lock lock2 SEC(".bss.private");
  ...
  spin_lock(&lock1);
  ...
  spin_lock(&lock2);

will result in same map but different offsets for the direct read (and different
aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).

> Essentially, we consider that the tuple of (active_spin_lock_ptr,
> active_spin_lock_id) will always be unique for any kind of argument to
> bpf_spin_{lock,unlock}.
> 
> Note that this can be extended in the future to also remember offset
> used for locking, so that we can introduce multiple bpf_spin_lock fields
> in the same allocation.
> 

In light of the above the "multiple spin locks in same map_value"
is probably needed for the common case, probably similar enough to
"same allocation" logic.

> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf_verifier.h |  3 ++-
>  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
>  2 files changed, 29 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 2a9dcefca3b6..00c21ad6f61c 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -348,7 +348,8 @@ struct bpf_verifier_state {
>  	u32 branches;
>  	u32 insn_idx;
>  	u32 curframe;
> -	u32 active_spin_lock;
> +	void *active_spin_lock_ptr;
> +	u32 active_spin_lock_id;

It would be good to make this "(lock_ptr, lock_id) is identifier for lock"
concept more concrete by grouping these fields in a struct w/ type enum + union,
or something similar. Will make it more obvious that they should be used / set
together.

But if you'd prefer to keep it as two fields, active_spin_lock_ptr is a
confusing name. In the future with no context as to what that field is, I'd
assume that it holds a pointer to a spin_lock instead of a "spin lock identity
pointer".

>  	bool speculative;
>  
>  	/* first and last insn idx of this verifier state */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index b1754fd69f7d..ed19e4036b0a 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1202,7 +1202,8 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
>  	}
>  	dst_state->speculative = src->speculative;
>  	dst_state->curframe = src->curframe;
> -	dst_state->active_spin_lock = src->active_spin_lock;
> +	dst_state->active_spin_lock_ptr = src->active_spin_lock_ptr;
> +	dst_state->active_spin_lock_id = src->active_spin_lock_id;
>  	dst_state->branches = src->branches;
>  	dst_state->parent = src->parent;
>  	dst_state->first_insn_idx = src->first_insn_idx;
> @@ -5504,22 +5505,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
>  		return -EINVAL;
>  	}
>  	if (is_lock) {
> -		if (cur->active_spin_lock) {
> +		if (cur->active_spin_lock_ptr) {
>  			verbose(env,
>  				"Locking two bpf_spin_locks are not allowed\n");
>  			return -EINVAL;
>  		}
> -		cur->active_spin_lock = reg->id;
> +		if (map)
> +			cur->active_spin_lock_ptr = map;
> +		else
> +			cur->active_spin_lock_ptr = btf;
> +		cur->active_spin_lock_id = reg->id;
>  	} else {
> -		if (!cur->active_spin_lock) {
> +		void *ptr;
> +
> +		if (map)
> +			ptr = map;
> +		else
> +			ptr = btf;
> +
> +		if (!cur->active_spin_lock_ptr) {
>  			verbose(env, "bpf_spin_unlock without taking a lock\n");
>  			return -EINVAL;
>  		}
> -		if (cur->active_spin_lock != reg->id) {
> +		if (cur->active_spin_lock_ptr != ptr ||
> +		    cur->active_spin_lock_id != reg->id) {
>  			verbose(env, "bpf_spin_unlock of different lock\n");
>  			return -EINVAL;
>  		}
> -		cur->active_spin_lock = 0;
> +		cur->active_spin_lock_ptr = NULL;
> +		cur->active_spin_lock_id = 0;
>  	}
>  	return 0;
>  }
> @@ -11207,8 +11221,8 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  	    insn->src_reg == BPF_PSEUDO_MAP_IDX_VALUE) {
>  		dst_reg->type = PTR_TO_MAP_VALUE;
>  		dst_reg->off = aux->map_off;

Here's where check_ld_imm uses aux->map_off.

> -		if (map_value_has_spin_lock(map))
> -			dst_reg->id = ++env->id_gen;
> +		WARN_ON_ONCE(map->max_entries != 1);
> +		/* We want reg->id to be same (0) as map_value is not distinct */
>  	} else if (insn->src_reg == BPF_PSEUDO_MAP_FD ||
>  		   insn->src_reg == BPF_PSEUDO_MAP_IDX) {
>  		dst_reg->type = CONST_PTR_TO_MAP;
> @@ -11286,7 +11300,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
>  		return err;
>  	}
>  
> -	if (env->cur_state->active_spin_lock) {
> +	if (env->cur_state->active_spin_lock_ptr) {
>  		verbose(env, "BPF_LD_[ABS|IND] cannot be used inside bpf_spin_lock-ed region\n");
>  		return -EINVAL;
>  	}
> @@ -12566,7 +12580,8 @@ static bool states_equal(struct bpf_verifier_env *env,
>  	if (old->speculative && !cur->speculative)
>  		return false;
>  
> -	if (old->active_spin_lock != cur->active_spin_lock)
> +	if (old->active_spin_lock_ptr != cur->active_spin_lock_ptr ||
> +	    old->active_spin_lock_id != cur->active_spin_lock_id)
>  		return false;
>  
>  	/* for states to be equal callsites have to be the same
> @@ -13213,7 +13228,7 @@ static int do_check(struct bpf_verifier_env *env)
>  					return -EINVAL;
>  				}
>  
> -				if (env->cur_state->active_spin_lock &&
> +				if (env->cur_state->active_spin_lock_ptr &&
>  				    (insn->src_reg == BPF_PSEUDO_CALL ||
>  				     insn->imm != BPF_FUNC_spin_unlock)) {
>  					verbose(env, "function calls are not allowed while holding a lock\n");
> @@ -13250,7 +13265,7 @@ static int do_check(struct bpf_verifier_env *env)
>  					return -EINVAL;
>  				}
>  
> -				if (env->cur_state->active_spin_lock) {
> +				if (env->cur_state->active_spin_lock_ptr) {
>  					verbose(env, "bpf_spin_unlock is missing\n");
>  					return -EINVAL;
>  				}

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock in local kptrs
  2022-09-08  0:35   ` Alexei Starovoitov
@ 2022-09-09  8:25     ` Dave Marchevsky
  2022-09-09 11:20       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Dave Marchevsky @ 2022-09-09  8:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Delyan Kratunov

On 9/7/22 8:35 PM, Alexei Starovoitov wrote:
> On Sun, Sep 04, 2022 at 10:41:31PM +0200, Kumar Kartikeya Dwivedi wrote:
>> diff --git a/include/linux/poison.h b/include/linux/poison.h
>> index d62ef5a6b4e9..753e00b81acf 100644
>> --- a/include/linux/poison.h
>> +++ b/include/linux/poison.h
>> @@ -81,4 +81,7 @@
>>  /********** net/core/page_pool.c **********/
>>  #define PP_SIGNATURE		(0x40 + POISON_POINTER_DELTA)
>>  
>> +/********** kernel/bpf/helpers.c **********/
>> +#define BPF_PTR_POISON		((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
>> +
> 
> That was part of Dave's patch set as well.
> Please keep his SOB and authorship and keep it as separate patch.

My patch picked a different constant :). But on that note, it also added some
checking in verifier.c so that verification fails if any arg or retval type
was BPF_PTR_POISON after it should've been replaced. Perhaps it's worth shipping
that patch ("bpf: Add verifier check for BPF_PTR_POISON retval and arg")
separately? Would allow both rbtree series and this lock-focused patch to drop
BPF_PTR_POISON changes after rebase.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09  8:13   ` Dave Marchevsky
@ 2022-09-09 11:05     ` Kumar Kartikeya Dwivedi
  2022-09-09 14:24       ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-09 11:05 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Delyan Kratunov

On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > Global variables reside in maps accessible using direct_value_addr
> > callbacks, so giving each load instruction's rewrite a unique reg->id
> > disallows us from holding locks which are global.
> >
> > This is not great, so refactor the active_spin_lock into two separate
> > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > enough to allow it for global variables, map lookups, and local kptr
> > registers at the same time.
> >
> > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > lock. But the active_spin_lock_id also needs to be compared to ensure
> > whether bpf_spin_unlock is for the same register.
> >
> > Next, pseudo load instructions are not given a unique reg->id, as they
> > are doing lookup for the same map value (max_entries is never greater
> > than 1).
> >
>
> For libbpf-style "internal maps" - like .bss.private further in this series -
> all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
>
>   struct bpf_spin_lock lock1 SEC(".bss.private");
>   struct bpf_spin_lock lock2 SEC(".bss.private");
>   ...
>   spin_lock(&lock1);
>   ...
>   spin_lock(&lock2);
>
> will result in same map but different offsets for the direct read (and different
> aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
>

That won't be a problem. Two spin locks in a map value or datasec are
already rejected on BPF_MAP_CREATE,
so there is no bug. See idx >= info_cnt check in
btf_find_struct_field, btf_find_datasec_var.

I can include offset as the third part of the tuple. The problem then
is figuring out which lock protects which bpf_list_head. We need
another __guarded_by annotation and force users to use that to
eliminate the ambiguity. So for now I just put it in the commit log
and left it for the future.

But it does seem like it's going to be needed at least for the global
case, which should probably do it from the get go.
How does the above idea sound to you?

> > Essentially, we consider that the tuple of (active_spin_lock_ptr,
> > active_spin_lock_id) will always be unique for any kind of argument to
> > bpf_spin_{lock,unlock}.
> >
> > Note that this can be extended in the future to also remember offset
> > used for locking, so that we can introduce multiple bpf_spin_lock fields
> > in the same allocation.
> >
>
> In light of the above the "multiple spin locks in same map_value"
> is probably needed for the common case, probably similar enough to
> "same allocation" logic.
>

Yes, it would be the same idea. Only need to remember an additional
field, the 'offset'.
{ptr, id} already distinguishes between allocations. Offset allows to
distinguish locks within the same allocation or {ptr, id}.

> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> >  include/linux/bpf_verifier.h |  3 ++-
> >  kernel/bpf/verifier.c        | 39 +++++++++++++++++++++++++-----------
> >  2 files changed, 29 insertions(+), 13 deletions(-)
> >
> > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> > index 2a9dcefca3b6..00c21ad6f61c 100644
> > --- a/include/linux/bpf_verifier.h
> > +++ b/include/linux/bpf_verifier.h
> > @@ -348,7 +348,8 @@ struct bpf_verifier_state {
> >       u32 branches;
> >       u32 insn_idx;
> >       u32 curframe;
> > -     u32 active_spin_lock;
> > +     void *active_spin_lock_ptr;
> > +     u32 active_spin_lock_id;
>
> It would be good to make this "(lock_ptr, lock_id) is identifier for lock"
> concept more concrete by grouping these fields in a struct w/ type enum + union,
> or something similar. Will make it more obvious that they should be used / set
> together.
>
> But if you'd prefer to keep it as two fields, active_spin_lock_ptr is a
> confusing name. In the future with no context as to what that field is, I'd
> assume that it holds a pointer to a spin_lock instead of a "spin lock identity
> pointer".
>

That's a good point.

I'm thinking
struct active_lock {
  void *id_ptr;
  u32 offset;
  u32 reg_id;
};
How does that look?

> >       bool speculative;
> >
> >       /* first and last insn idx of this verifier state */
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index b1754fd69f7d..ed19e4036b0a 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -1202,7 +1202,8 @@ static int copy_verifier_state(struct bpf_verifier_state *dst_state,
> >       }
> >       dst_state->speculative = src->speculative;
> >       dst_state->curframe = src->curframe;
> > -     dst_state->active_spin_lock = src->active_spin_lock;
> > +     dst_state->active_spin_lock_ptr = src->active_spin_lock_ptr;
> > +     dst_state->active_spin_lock_id = src->active_spin_lock_id;
> >       dst_state->branches = src->branches;
> >       dst_state->parent = src->parent;
> >       dst_state->first_insn_idx = src->first_insn_idx;
> > @@ -5504,22 +5505,35 @@ static int process_spin_lock(struct bpf_verifier_env *env, int regno,
> >               return -EINVAL;
> >       }
> >       if (is_lock) {
> > -             if (cur->active_spin_lock) {
> > +             if (cur->active_spin_lock_ptr) {
> >                       verbose(env,
> >                               "Locking two bpf_spin_locks are not allowed\n");
> >                       return -EINVAL;
> >               }
> > -             cur->active_spin_lock = reg->id;
> > +             if (map)
> > +                     cur->active_spin_lock_ptr = map;
> > +             else
> > +                     cur->active_spin_lock_ptr = btf;
> > +             cur->active_spin_lock_id = reg->id;
> >       } else {
> > -             if (!cur->active_spin_lock) {
> > +             void *ptr;
> > +
> > +             if (map)
> > +                     ptr = map;
> > +             else
> > +                     ptr = btf;
> > +
> > +             if (!cur->active_spin_lock_ptr) {
> >                       verbose(env, "bpf_spin_unlock without taking a lock\n");
> >                       return -EINVAL;
> >               }
> > -             if (cur->active_spin_lock != reg->id) {
> > +             if (cur->active_spin_lock_ptr != ptr ||
> > +                 cur->active_spin_lock_id != reg->id) {
> >                       verbose(env, "bpf_spin_unlock of different lock\n");
> >                       return -EINVAL;
> >               }
> > -             cur->active_spin_lock = 0;
> > +             cur->active_spin_lock_ptr = NULL;
> > +             cur->active_spin_lock_id = 0;
> >       }
> >       return 0;
> >  }
> > @@ -11207,8 +11221,8 @@ static int check_ld_imm(struct bpf_verifier_env *env, struct bpf_insn *insn)
> >           insn->src_reg == BPF_PSEUDO_MAP_IDX_VALUE) {
> >               dst_reg->type = PTR_TO_MAP_VALUE;
> >               dst_reg->off = aux->map_off;
>
> Here's where check_ld_imm uses aux->map_off.
>
> > -             if (map_value_has_spin_lock(map))
> > -                     dst_reg->id = ++env->id_gen;
> > +             WARN_ON_ONCE(map->max_entries != 1);
> > +             /* We want reg->id to be same (0) as map_value is not distinct */
> >       } else if (insn->src_reg == BPF_PSEUDO_MAP_FD ||
> >                  insn->src_reg == BPF_PSEUDO_MAP_IDX) {
> >               dst_reg->type = CONST_PTR_TO_MAP;
> > @@ -11286,7 +11300,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
> >               return err;
> >       }
> >
> > -     if (env->cur_state->active_spin_lock) {
> > +     if (env->cur_state->active_spin_lock_ptr) {
> >               verbose(env, "BPF_LD_[ABS|IND] cannot be used inside bpf_spin_lock-ed region\n");
> >               return -EINVAL;
> >       }
> > @@ -12566,7 +12580,8 @@ static bool states_equal(struct bpf_verifier_env *env,
> >       if (old->speculative && !cur->speculative)
> >               return false;
> >
> > -     if (old->active_spin_lock != cur->active_spin_lock)
> > +     if (old->active_spin_lock_ptr != cur->active_spin_lock_ptr ||
> > +         old->active_spin_lock_id != cur->active_spin_lock_id)
> >               return false;
> >
> >       /* for states to be equal callsites have to be the same
> > @@ -13213,7 +13228,7 @@ static int do_check(struct bpf_verifier_env *env)
> >                                       return -EINVAL;
> >                               }
> >
> > -                             if (env->cur_state->active_spin_lock &&
> > +                             if (env->cur_state->active_spin_lock_ptr &&
> >                                   (insn->src_reg == BPF_PSEUDO_CALL ||
> >                                    insn->imm != BPF_FUNC_spin_unlock)) {
> >                                       verbose(env, "function calls are not allowed while holding a lock\n");
> > @@ -13250,7 +13265,7 @@ static int do_check(struct bpf_verifier_env *env)
> >                                       return -EINVAL;
> >                               }
> >
> > -                             if (env->cur_state->active_spin_lock) {
> > +                             if (env->cur_state->active_spin_lock_ptr) {
> >                                       verbose(env, "bpf_spin_unlock is missing\n");
> >                                       return -EINVAL;
> >                               }

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock in local kptrs
  2022-09-09  8:25     ` Dave Marchevsky
@ 2022-09-09 11:20       ` Kumar Kartikeya Dwivedi
  2022-09-09 14:26         ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-09 11:20 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Delyan Kratunov

On Fri, 9 Sept 2022 at 10:25, Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> On 9/7/22 8:35 PM, Alexei Starovoitov wrote:
> > On Sun, Sep 04, 2022 at 10:41:31PM +0200, Kumar Kartikeya Dwivedi wrote:
> >> diff --git a/include/linux/poison.h b/include/linux/poison.h
> >> index d62ef5a6b4e9..753e00b81acf 100644
> >> --- a/include/linux/poison.h
> >> +++ b/include/linux/poison.h
> >> @@ -81,4 +81,7 @@
> >>  /********** net/core/page_pool.c **********/
> >>  #define PP_SIGNATURE                (0x40 + POISON_POINTER_DELTA)
> >>
> >> +/********** kernel/bpf/helpers.c **********/
> >> +#define BPF_PTR_POISON              ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
> >> +
> >
> > That was part of Dave's patch set as well.
> > Please keep his SOB and authorship and keep it as separate patch.
>
> My patch picked a different constant :). But on that note, it also added some
> checking in verifier.c so that verification fails if any arg or retval type
> was BPF_PTR_POISON after it should've been replaced. Perhaps it's worth shipping
> that patch ("bpf: Add verifier check for BPF_PTR_POISON retval and arg")
> separately? Would allow both rbtree series and this lock-focused patch to drop
> BPF_PTR_POISON changes after rebase.

Yeah, feel free to post it separately. I'm using the constant in this
patch for a different purpose (I separate the BPF_PTR_POISON case into
its own different argument type, and then check that it is always set
for it in check_btf_id_ok, to ensure DYN_BTF_ID is not setting some
static real BTF ID).

But why change the constant, eB9F looks very close to eBPF already :).
UL is just for unsigned long.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps
  2022-09-09  5:27   ` Martin KaFai Lau
@ 2022-09-09 11:22     ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-09 11:22 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: bpf, Martin KaFai Lau, KP Singh, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Dave Marchevsky,
	Delyan Kratunov

On Fri, 9 Sept 2022 at 07:27, Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 9/4/22 1:41 PM, Kumar Kartikeya Dwivedi wrote:
> > diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h
> > index 7ea18d4da84b..6786d00f004e 100644
> > --- a/include/linux/bpf_local_storage.h
> > +++ b/include/linux/bpf_local_storage.h
> > @@ -74,7 +74,7 @@ struct bpf_local_storage_elem {
> >       struct hlist_node snode;        /* Linked to bpf_local_storage */
> >       struct bpf_local_storage __rcu *local_storage;
> >       struct rcu_head rcu;
> > -     /* 8 bytes hole */
> > +     struct bpf_map *map;            /* Only set for bpf_selem_free_rcu */
>
> Instead of adding another map ptr and using the last 8 bytes hole,
>
> >       /* The data is stored in another cacheline to minimize
> >        * the number of cachelines access during a cache hit.
> >        */
> > diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c
> > index 802fc15b0d73..4a725379d761 100644
> > --- a/kernel/bpf/bpf_local_storage.c
> > +++ b/kernel/bpf/bpf_local_storage.c
> > @@ -74,7 +74,8 @@ bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner,
> >                               gfp_flags | __GFP_NOWARN);
> >       if (selem) {
> >               if (value)
> > -                     memcpy(SDATA(selem)->data, value, smap->map.value_size);
> > +                     copy_map_value(&smap->map, SDATA(selem)->data, value);
> > +             /* No call to check_and_init_map_value as memory is zero init */
> >               return selem;
> >       }
> >
> > @@ -92,12 +93,27 @@ void bpf_local_storage_free_rcu(struct rcu_head *rcu)
> >       kfree_rcu(local_storage, rcu);
> >   }
> >
> > +static void check_and_free_fields(struct bpf_local_storage_elem *selem)
> > +{
> > +     if (map_value_has_kptrs(selem->map))
>
> could SDATA(selem)->smap->map be used here ?
>

Yeah, that should work. Thanks Martin.

> > +             bpf_map_free_kptrs(selem->map, SDATA(selem));
> > +}
> > +
> [...]

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 11:05     ` Kumar Kartikeya Dwivedi
@ 2022-09-09 14:24       ` Alexei Starovoitov
  2022-09-09 14:50         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-09 14:24 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >
> > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > Global variables reside in maps accessible using direct_value_addr
> > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > disallows us from holding locks which are global.
> > >
> > > This is not great, so refactor the active_spin_lock into two separate
> > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > enough to allow it for global variables, map lookups, and local kptr
> > > registers at the same time.
> > >
> > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > whether bpf_spin_unlock is for the same register.
> > >
> > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > are doing lookup for the same map value (max_entries is never greater
> > > than 1).
> > >
> >
> > For libbpf-style "internal maps" - like .bss.private further in this series -
> > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> >
> >   struct bpf_spin_lock lock1 SEC(".bss.private");
> >   struct bpf_spin_lock lock2 SEC(".bss.private");
> >   ...
> >   spin_lock(&lock1);
> >   ...
> >   spin_lock(&lock2);
> >
> > will result in same map but different offsets for the direct read (and different
> > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> >
>
> That won't be a problem. Two spin locks in a map value or datasec are
> already rejected on BPF_MAP_CREATE,
> so there is no bug. See idx >= info_cnt check in
> btf_find_struct_field, btf_find_datasec_var.
>
> I can include offset as the third part of the tuple. The problem then
> is figuring out which lock protects which bpf_list_head. We need
> another __guarded_by annotation and force users to use that to
> eliminate the ambiguity. So for now I just put it in the commit log
> and left it for the future.

Let's not go that far yet.
Extra annotations are just as confusing and non-obvious as
putting locks in different sections.
Let's keep one lock per map value limitation for now.
libbpf side needs to allow many non-mappable sections though.
Single bss.private name is too limiting.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock in local kptrs
  2022-09-09 11:20       ` Kumar Kartikeya Dwivedi
@ 2022-09-09 14:26         ` Alexei Starovoitov
  0 siblings, 0 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-09 14:26 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 4:21 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Fri, 9 Sept 2022 at 10:25, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >
> > On 9/7/22 8:35 PM, Alexei Starovoitov wrote:
> > > On Sun, Sep 04, 2022 at 10:41:31PM +0200, Kumar Kartikeya Dwivedi wrote:
> > >> diff --git a/include/linux/poison.h b/include/linux/poison.h
> > >> index d62ef5a6b4e9..753e00b81acf 100644
> > >> --- a/include/linux/poison.h
> > >> +++ b/include/linux/poison.h
> > >> @@ -81,4 +81,7 @@
> > >>  /********** net/core/page_pool.c **********/
> > >>  #define PP_SIGNATURE                (0x40 + POISON_POINTER_DELTA)
> > >>
> > >> +/********** kernel/bpf/helpers.c **********/
> > >> +#define BPF_PTR_POISON              ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA))
> > >> +
> > >
> > > That was part of Dave's patch set as well.
> > > Please keep his SOB and authorship and keep it as separate patch.
> >
> > My patch picked a different constant :). But on that note, it also added some
> > checking in verifier.c so that verification fails if any arg or retval type
> > was BPF_PTR_POISON after it should've been replaced. Perhaps it's worth shipping
> > that patch ("bpf: Add verifier check for BPF_PTR_POISON retval and arg")
> > separately? Would allow both rbtree series and this lock-focused patch to drop
> > BPF_PTR_POISON changes after rebase.
>
> Yeah, feel free to post it separately. I'm using the constant in this
> patch for a different purpose (I separate the BPF_PTR_POISON case into
> its own different argument type, and then check that it is always set
> for it in check_btf_id_ok, to ensure DYN_BTF_ID is not setting some
> static real BTF ID).
>
> But why change the constant, eB9F looks very close to eBPF already :).

+1. It's already in many spots:
git grep eB9F

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 14:24       ` Alexei Starovoitov
@ 2022-09-09 14:50         ` Kumar Kartikeya Dwivedi
  2022-09-09 14:58           ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-09 14:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Delyan Kratunov

On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > >
> > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > Global variables reside in maps accessible using direct_value_addr
> > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > disallows us from holding locks which are global.
> > > >
> > > > This is not great, so refactor the active_spin_lock into two separate
> > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > enough to allow it for global variables, map lookups, and local kptr
> > > > registers at the same time.
> > > >
> > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > whether bpf_spin_unlock is for the same register.
> > > >
> > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > are doing lookup for the same map value (max_entries is never greater
> > > > than 1).
> > > >
> > >
> > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > >
> > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > >   ...
> > >   spin_lock(&lock1);
> > >   ...
> > >   spin_lock(&lock2);
> > >
> > > will result in same map but different offsets for the direct read (and different
> > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > >
> >
> > That won't be a problem. Two spin locks in a map value or datasec are
> > already rejected on BPF_MAP_CREATE,
> > so there is no bug. See idx >= info_cnt check in
> > btf_find_struct_field, btf_find_datasec_var.
> >
> > I can include offset as the third part of the tuple. The problem then
> > is figuring out which lock protects which bpf_list_head. We need
> > another __guarded_by annotation and force users to use that to
> > eliminate the ambiguity. So for now I just put it in the commit log
> > and left it for the future.
>
> Let's not go that far yet.
> Extra annotations are just as confusing and non-obvious as
> putting locks in different sections.
> Let's keep one lock per map value limitation for now.
> libbpf side needs to allow many non-mappable sections though.
> Single bss.private name is too limiting.

In that case,
Dave, since the libbpf patch is yours, would you be fine with
reworking it to support multiple private maps?
Maybe it can just ignore the .XXX part in .bss.private.XXX?
Also I think Andrii mentioned once that he wants to eventually merge
data and bss, so it might be a good idea to call it .data.private from
the start?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 14:50         ` Kumar Kartikeya Dwivedi
@ 2022-09-09 14:58           ` Alexei Starovoitov
  2022-09-09 18:32             ` Andrii Nakryiko
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-09 14:58 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > > >
> > > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > disallows us from holding locks which are global.
> > > > >
> > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > registers at the same time.
> > > > >
> > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > whether bpf_spin_unlock is for the same register.
> > > > >
> > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > than 1).
> > > > >
> > > >
> > > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > > >
> > > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > > >   ...
> > > >   spin_lock(&lock1);
> > > >   ...
> > > >   spin_lock(&lock2);
> > > >
> > > > will result in same map but different offsets for the direct read (and different
> > > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > > >
> > >
> > > That won't be a problem. Two spin locks in a map value or datasec are
> > > already rejected on BPF_MAP_CREATE,
> > > so there is no bug. See idx >= info_cnt check in
> > > btf_find_struct_field, btf_find_datasec_var.
> > >
> > > I can include offset as the third part of the tuple. The problem then
> > > is figuring out which lock protects which bpf_list_head. We need
> > > another __guarded_by annotation and force users to use that to
> > > eliminate the ambiguity. So for now I just put it in the commit log
> > > and left it for the future.
> >
> > Let's not go that far yet.
> > Extra annotations are just as confusing and non-obvious as
> > putting locks in different sections.
> > Let's keep one lock per map value limitation for now.
> > libbpf side needs to allow many non-mappable sections though.
> > Single bss.private name is too limiting.
>
> In that case,
> Dave, since the libbpf patch is yours, would you be fine with
> reworking it to support multiple private maps?
> Maybe it can just ignore the .XXX part in .bss.private.XXX?
> Also I think Andrii mentioned once that he wants to eventually merge
> data and bss, so it might be a good idea to call it .data.private from
> the start?

I'd probably make all non-canonical names to be not-mmapable.
The compiler generates special sections already.
Thankfully the code doesn't use them, but it will sooner or later.
So libbpf has to create hidden maps for them eventually.
They shouldn't be messed up from user space, since it will screw up
compiler generated code.

Andrii, what's your take?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 14:58           ` Alexei Starovoitov
@ 2022-09-09 18:32             ` Andrii Nakryiko
  2022-09-09 19:25               ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Andrii Nakryiko @ 2022-09-09 18:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > > > >
> > > > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > > disallows us from holding locks which are global.
> > > > > >
> > > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > > registers at the same time.
> > > > > >
> > > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > > whether bpf_spin_unlock is for the same register.
> > > > > >
> > > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > > than 1).
> > > > > >
> > > > >
> > > > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > > > >
> > > > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > > > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > > > >   ...
> > > > >   spin_lock(&lock1);
> > > > >   ...
> > > > >   spin_lock(&lock2);
> > > > >
> > > > > will result in same map but different offsets for the direct read (and different
> > > > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > > > >
> > > >
> > > > That won't be a problem. Two spin locks in a map value or datasec are
> > > > already rejected on BPF_MAP_CREATE,
> > > > so there is no bug. See idx >= info_cnt check in
> > > > btf_find_struct_field, btf_find_datasec_var.
> > > >
> > > > I can include offset as the third part of the tuple. The problem then
> > > > is figuring out which lock protects which bpf_list_head. We need
> > > > another __guarded_by annotation and force users to use that to
> > > > eliminate the ambiguity. So for now I just put it in the commit log
> > > > and left it for the future.
> > >
> > > Let's not go that far yet.
> > > Extra annotations are just as confusing and non-obvious as
> > > putting locks in different sections.
> > > Let's keep one lock per map value limitation for now.
> > > libbpf side needs to allow many non-mappable sections though.
> > > Single bss.private name is too limiting.
> >
> > In that case,
> > Dave, since the libbpf patch is yours, would you be fine with
> > reworking it to support multiple private maps?
> > Maybe it can just ignore the .XXX part in .bss.private.XXX?
> > Also I think Andrii mentioned once that he wants to eventually merge
> > data and bss, so it might be a good idea to call it .data.private from
> > the start?
>
> I'd probably make all non-canonical names to be not-mmapable.
> The compiler generates special sections already.
> Thankfully the code doesn't use them, but it will sooner or later.
> So libbpf has to create hidden maps for them eventually.
> They shouldn't be messed up from user space, since it will screw up
> compiler generated code.
>
> Andrii, what's your take?

Ok, a bunch of things to unpack. We've also discussed a lot of this
with Dave few weeks ago, but I have also few questions.

First, I'd like to not keep extending ".bss" with any custom ".bss.*"
sections. This is why we have .data.* and .rodata.* and not .bss (bad,
meaningless, historic name).

But I'm totally fine dedicating some other prefix to non-mmapable data
sections that won't be exposed in skeleton and, well, not-mmapable.
What to name it depends on what we anticipate putting in them?

If it's just for spinlocks, then having something like SEC(".locks")
seems best to me. If it's for more stuff, like global kptrs, rbtrees
and whatnot, then we'd need a bit more generic name (.private, or
whatever, didn't think much on best name). We can also allow .locks.*
or .private.* (i.e., keep it uniform with .data and .rodata handling,
expect for mmapable aspect).

One benefit for having SEC(".locks") just for spin_locks is that we
can teach libbpf to create a multi-element ARRAY map, where each lock
variable is put into a separate element. From BPF verifier's
perspective, there will be a single BTF type describing spin lock, but
multiple "instances" of lock, one per each element. That seems a bit
magical and I think, generally speaking, it's best to start supporting
multiple lock declarations within single map element (and thus keep
track of their offset within map_value); but at least that's an
option.

Dave had some concerns about pinning such maps and whatnot, but for
starters we decided to not worry about pinning for now. Dave, please
bring up remaining issues, if you don't mind.

So to answer Alexei's specific option. I'm still not in favor of just
saying "anything that's not .data or .rodata is non-mmapable map". I'd
rather carve out naming prefixes with . (which are  reserved for
libbpf's own use) for these special purpose maps. I don't think that
limits anyone, right?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 18:32             ` Andrii Nakryiko
@ 2022-09-09 19:25               ` Alexei Starovoitov
  2022-09-09 20:21                 ` Andrii Nakryiko
  2022-09-09 22:30                 ` Dave Marchevsky
  0 siblings, 2 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-09 19:25 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
> On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > >
> > > On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > > >
> > > > > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > > > > >
> > > > > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > > > disallows us from holding locks which are global.
> > > > > > >
> > > > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > > > registers at the same time.
> > > > > > >
> > > > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > > > whether bpf_spin_unlock is for the same register.
> > > > > > >
> > > > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > > > than 1).
> > > > > > >
> > > > > >
> > > > > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > > > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > > > > >
> > > > > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > > > > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > > > > >   ...
> > > > > >   spin_lock(&lock1);
> > > > > >   ...
> > > > > >   spin_lock(&lock2);
> > > > > >
> > > > > > will result in same map but different offsets for the direct read (and different
> > > > > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > > > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > > > > >
> > > > >
> > > > > That won't be a problem. Two spin locks in a map value or datasec are
> > > > > already rejected on BPF_MAP_CREATE,
> > > > > so there is no bug. See idx >= info_cnt check in
> > > > > btf_find_struct_field, btf_find_datasec_var.
> > > > >
> > > > > I can include offset as the third part of the tuple. The problem then
> > > > > is figuring out which lock protects which bpf_list_head. We need
> > > > > another __guarded_by annotation and force users to use that to
> > > > > eliminate the ambiguity. So for now I just put it in the commit log
> > > > > and left it for the future.
> > > >
> > > > Let's not go that far yet.
> > > > Extra annotations are just as confusing and non-obvious as
> > > > putting locks in different sections.
> > > > Let's keep one lock per map value limitation for now.
> > > > libbpf side needs to allow many non-mappable sections though.
> > > > Single bss.private name is too limiting.
> > >
> > > In that case,
> > > Dave, since the libbpf patch is yours, would you be fine with
> > > reworking it to support multiple private maps?
> > > Maybe it can just ignore the .XXX part in .bss.private.XXX?
> > > Also I think Andrii mentioned once that he wants to eventually merge
> > > data and bss, so it might be a good idea to call it .data.private from
> > > the start?
> >
> > I'd probably make all non-canonical names to be not-mmapable.
> > The compiler generates special sections already.
> > Thankfully the code doesn't use them, but it will sooner or later.
> > So libbpf has to create hidden maps for them eventually.
> > They shouldn't be messed up from user space, since it will screw up
> > compiler generated code.
> >
> > Andrii, what's your take?
> 
> Ok, a bunch of things to unpack. We've also discussed a lot of this
> with Dave few weeks ago, but I have also few questions.
> 
> First, I'd like to not keep extending ".bss" with any custom ".bss.*"
> sections. This is why we have .data.* and .rodata.* and not .bss (bad,
> meaningless, historic name).
> 
> But I'm totally fine dedicating some other prefix to non-mmapable data
> sections that won't be exposed in skeleton and, well, not-mmapable.
> What to name it depends on what we anticipate putting in them?
> 
> If it's just for spinlocks, then having something like SEC(".locks")
> seems best to me. If it's for more stuff, like global kptrs, rbtrees
> and whatnot, then we'd need a bit more generic name (.private, or
> whatever, didn't think much on best name). We can also allow .locks.*
> or .private.* (i.e., keep it uniform with .data and .rodata handling,
> expect for mmapable aspect).
> 
> One benefit for having SEC(".locks") just for spin_locks is that we
> can teach libbpf to create a multi-element ARRAY map, where each lock
> variable is put into a separate element. From BPF verifier's
> perspective, there will be a single BTF type describing spin lock, but
> multiple "instances" of lock, one per each element. That seems a bit
> magical and I think, generally speaking, it's best to start supporting
> multiple lock declarations within single map element (and thus keep
> track of their offset within map_value); but at least that's an
> option.

".lock" won't work. We need lock+rb_root or lock+list_head to be
in the same section.
It should be up to user to name that section with something meaningful.
Ideally something like this should be supported:
SEC("enqueue") struct bpf_spin_lock enqueue_lock;
SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
SEC("dequeue") struct bpf_spin_lock dequeue_lock;
SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);

> Dave had some concerns about pinning such maps and whatnot, but for
> starters we decided to not worry about pinning for now. Dave, please
> bring up remaining issues, if you don't mind.

Pinning shouldn't be an issue.
Only mmap is the problem. User space access if fine since kernel
will mask out special fields on read/write.

> So to answer Alexei's specific option. I'm still not in favor of just
> saying "anything that's not .data or .rodata is non-mmapable map". I'd
> rather carve out naming prefixes with . (which are  reserved for
> libbpf's own use) for these special purpose maps. I don't think that
> limits anyone, right?

Is backward compat a concern?
Whether to mmap global data is a flag.
It can be opt-in or opt-out.
I'm proposing make all named section to be 'do not mmap'.
If a section needs to be mmaped and appear in skeleton the user can do
SEC("my_section.mmap")

What you're proposing is to do the other way around:
SEC("enqueue.nommap")
SEC("dequeue.nommap")
in the above example.
I guess it's fine, but more verbose.
The gut feeling is that the use case for naming section will be specifically
for lock+rbtree. Everything else will go into common global .data or .rodata.
Same thinking about compiler generated special sections with constants.
They shouldn't be mmaped by default, but we're not going to hack llvm
to add ".nommap" suffix to such sections.
Hence the proposal to avoid mmap by default for all non standard sections.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 19:25               ` Alexei Starovoitov
@ 2022-09-09 20:21                 ` Andrii Nakryiko
  2022-09-09 20:57                   ` Alexei Starovoitov
  2022-09-09 22:30                 ` Dave Marchevsky
  1 sibling, 1 reply; 82+ messages in thread
From: Andrii Nakryiko @ 2022-09-09 20:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 12:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
> > On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > >
> > > > On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > > > > > >
> > > > > > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > > > > disallows us from holding locks which are global.
> > > > > > > >
> > > > > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > > > > registers at the same time.
> > > > > > > >
> > > > > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > > > > whether bpf_spin_unlock is for the same register.
> > > > > > > >
> > > > > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > > > > than 1).
> > > > > > > >
> > > > > > >
> > > > > > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > > > > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > > > > > >
> > > > > > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > > > > > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > > > > > >   ...
> > > > > > >   spin_lock(&lock1);
> > > > > > >   ...
> > > > > > >   spin_lock(&lock2);
> > > > > > >
> > > > > > > will result in same map but different offsets for the direct read (and different
> > > > > > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > > > > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > > > > > >
> > > > > >
> > > > > > That won't be a problem. Two spin locks in a map value or datasec are
> > > > > > already rejected on BPF_MAP_CREATE,
> > > > > > so there is no bug. See idx >= info_cnt check in
> > > > > > btf_find_struct_field, btf_find_datasec_var.
> > > > > >
> > > > > > I can include offset as the third part of the tuple. The problem then
> > > > > > is figuring out which lock protects which bpf_list_head. We need
> > > > > > another __guarded_by annotation and force users to use that to
> > > > > > eliminate the ambiguity. So for now I just put it in the commit log
> > > > > > and left it for the future.
> > > > >
> > > > > Let's not go that far yet.
> > > > > Extra annotations are just as confusing and non-obvious as
> > > > > putting locks in different sections.
> > > > > Let's keep one lock per map value limitation for now.
> > > > > libbpf side needs to allow many non-mappable sections though.
> > > > > Single bss.private name is too limiting.
> > > >
> > > > In that case,
> > > > Dave, since the libbpf patch is yours, would you be fine with
> > > > reworking it to support multiple private maps?
> > > > Maybe it can just ignore the .XXX part in .bss.private.XXX?
> > > > Also I think Andrii mentioned once that he wants to eventually merge
> > > > data and bss, so it might be a good idea to call it .data.private from
> > > > the start?
> > >
> > > I'd probably make all non-canonical names to be not-mmapable.
> > > The compiler generates special sections already.
> > > Thankfully the code doesn't use them, but it will sooner or later.
> > > So libbpf has to create hidden maps for them eventually.
> > > They shouldn't be messed up from user space, since it will screw up
> > > compiler generated code.
> > >
> > > Andrii, what's your take?
> >
> > Ok, a bunch of things to unpack. We've also discussed a lot of this
> > with Dave few weeks ago, but I have also few questions.
> >
> > First, I'd like to not keep extending ".bss" with any custom ".bss.*"
> > sections. This is why we have .data.* and .rodata.* and not .bss (bad,
> > meaningless, historic name).
> >
> > But I'm totally fine dedicating some other prefix to non-mmapable data
> > sections that won't be exposed in skeleton and, well, not-mmapable.
> > What to name it depends on what we anticipate putting in them?
> >
> > If it's just for spinlocks, then having something like SEC(".locks")
> > seems best to me. If it's for more stuff, like global kptrs, rbtrees
> > and whatnot, then we'd need a bit more generic name (.private, or
> > whatever, didn't think much on best name). We can also allow .locks.*
> > or .private.* (i.e., keep it uniform with .data and .rodata handling,
> > expect for mmapable aspect).
> >
> > One benefit for having SEC(".locks") just for spin_locks is that we
> > can teach libbpf to create a multi-element ARRAY map, where each lock
> > variable is put into a separate element. From BPF verifier's
> > perspective, there will be a single BTF type describing spin lock, but
> > multiple "instances" of lock, one per each element. That seems a bit
> > magical and I think, generally speaking, it's best to start supporting
> > multiple lock declarations within single map element (and thus keep
> > track of their offset within map_value); but at least that's an
> > option.
>
> ".lock" won't work. We need lock+rb_root or lock+list_head to be
> in the same section.
> It should be up to user to name that section with something meaningful.
> Ideally something like this should be supported:
> SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
>
> > Dave had some concerns about pinning such maps and whatnot, but for
> > starters we decided to not worry about pinning for now. Dave, please
> > bring up remaining issues, if you don't mind.
>
> Pinning shouldn't be an issue.
> Only mmap is the problem. User space access if fine since kernel
> will mask out special fields on read/write.
>
> > So to answer Alexei's specific option. I'm still not in favor of just
> > saying "anything that's not .data or .rodata is non-mmapable map". I'd
> > rather carve out naming prefixes with . (which are  reserved for
> > libbpf's own use) for these special purpose maps. I don't think that
> > limits anyone, right?
>
> Is backward compat a concern?
> Whether to mmap global data is a flag.
> It can be opt-in or opt-out.
> I'm proposing make all named section to be 'do not mmap'.
> If a section needs to be mmaped and appear in skeleton the user can do
> SEC("my_section.mmap")
>
> What you're proposing is to do the other way around:
> SEC("enqueue.nommap")
> SEC("dequeue.nommap")
> in the above example.
> I guess it's fine, but more verbose.

Well, I didn't propose to use suffixes. Currently user can define
SEC(".data.my_custom_use_case"). So I was proposing that we'll just
define a different *prefix*, like SEC(".private.enqueue") and
SEC(".private.dequeue") for your example above, which will be private
to BPF program, not mmap'ed, not exposed in skeleton.

mmap is a bit orthogonal to exposing in skeleton, you can still
imagine data section that will be allowed to be initialized from
user-space before load but never mmaped afterwards. Just to say that
.nommap doesn't necessarily imply that it shouldn't be in a skeleton.

So I still prefer special prefix (.private) and declare that this is
both non-mmapable and not-exposed in skeleton.

As for allowing any section. It just feels unnecessary and long-term
harmful to allow any section name at this point, tbh.

> The gut feeling is that the use case for naming section will be specifically
> for lock+rbtree. Everything else will go into common global .data or .rodata.
> Same thinking about compiler generated special sections with constants.
> They shouldn't be mmaped by default, but we're not going to hack llvm
> to add ".nommap" suffix to such sections.
> Hence the proposal to avoid mmap by default for all non standard sections.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 20:21                 ` Andrii Nakryiko
@ 2022-09-09 20:57                   ` Alexei Starovoitov
  2022-09-10  0:21                     ` Andrii Nakryiko
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-09 20:57 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 09, 2022 at 01:21:05PM -0700, Andrii Nakryiko wrote:
> On Fri, Sep 9, 2022 at 12:25 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
> > > On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > > >
> > > > > On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > > > > > > >
> > > > > > > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > > > > > disallows us from holding locks which are global.
> > > > > > > > >
> > > > > > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > > > > > registers at the same time.
> > > > > > > > >
> > > > > > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > > > > > whether bpf_spin_unlock is for the same register.
> > > > > > > > >
> > > > > > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > > > > > than 1).
> > > > > > > > >
> > > > > > > >
> > > > > > > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > > > > > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > > > > > > >
> > > > > > > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > > > > > > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > > > > > > >   ...
> > > > > > > >   spin_lock(&lock1);
> > > > > > > >   ...
> > > > > > > >   spin_lock(&lock2);
> > > > > > > >
> > > > > > > > will result in same map but different offsets for the direct read (and different
> > > > > > > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > > > > > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > > > > > > >
> > > > > > >
> > > > > > > That won't be a problem. Two spin locks in a map value or datasec are
> > > > > > > already rejected on BPF_MAP_CREATE,
> > > > > > > so there is no bug. See idx >= info_cnt check in
> > > > > > > btf_find_struct_field, btf_find_datasec_var.
> > > > > > >
> > > > > > > I can include offset as the third part of the tuple. The problem then
> > > > > > > is figuring out which lock protects which bpf_list_head. We need
> > > > > > > another __guarded_by annotation and force users to use that to
> > > > > > > eliminate the ambiguity. So for now I just put it in the commit log
> > > > > > > and left it for the future.
> > > > > >
> > > > > > Let's not go that far yet.
> > > > > > Extra annotations are just as confusing and non-obvious as
> > > > > > putting locks in different sections.
> > > > > > Let's keep one lock per map value limitation for now.
> > > > > > libbpf side needs to allow many non-mappable sections though.
> > > > > > Single bss.private name is too limiting.
> > > > >
> > > > > In that case,
> > > > > Dave, since the libbpf patch is yours, would you be fine with
> > > > > reworking it to support multiple private maps?
> > > > > Maybe it can just ignore the .XXX part in .bss.private.XXX?
> > > > > Also I think Andrii mentioned once that he wants to eventually merge
> > > > > data and bss, so it might be a good idea to call it .data.private from
> > > > > the start?
> > > >
> > > > I'd probably make all non-canonical names to be not-mmapable.
> > > > The compiler generates special sections already.
> > > > Thankfully the code doesn't use them, but it will sooner or later.
> > > > So libbpf has to create hidden maps for them eventually.
> > > > They shouldn't be messed up from user space, since it will screw up
> > > > compiler generated code.
> > > >
> > > > Andrii, what's your take?
> > >
> > > Ok, a bunch of things to unpack. We've also discussed a lot of this
> > > with Dave few weeks ago, but I have also few questions.
> > >
> > > First, I'd like to not keep extending ".bss" with any custom ".bss.*"
> > > sections. This is why we have .data.* and .rodata.* and not .bss (bad,
> > > meaningless, historic name).
> > >
> > > But I'm totally fine dedicating some other prefix to non-mmapable data
> > > sections that won't be exposed in skeleton and, well, not-mmapable.
> > > What to name it depends on what we anticipate putting in them?
> > >
> > > If it's just for spinlocks, then having something like SEC(".locks")
> > > seems best to me. If it's for more stuff, like global kptrs, rbtrees
> > > and whatnot, then we'd need a bit more generic name (.private, or
> > > whatever, didn't think much on best name). We can also allow .locks.*
> > > or .private.* (i.e., keep it uniform with .data and .rodata handling,
> > > expect for mmapable aspect).
> > >
> > > One benefit for having SEC(".locks") just for spin_locks is that we
> > > can teach libbpf to create a multi-element ARRAY map, where each lock
> > > variable is put into a separate element. From BPF verifier's
> > > perspective, there will be a single BTF type describing spin lock, but
> > > multiple "instances" of lock, one per each element. That seems a bit
> > > magical and I think, generally speaking, it's best to start supporting
> > > multiple lock declarations within single map element (and thus keep
> > > track of their offset within map_value); but at least that's an
> > > option.
> >
> > ".lock" won't work. We need lock+rb_root or lock+list_head to be
> > in the same section.
> > It should be up to user to name that section with something meaningful.
> > Ideally something like this should be supported:
> > SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> > SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> > SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> > SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
> >
> > > Dave had some concerns about pinning such maps and whatnot, but for
> > > starters we decided to not worry about pinning for now. Dave, please
> > > bring up remaining issues, if you don't mind.
> >
> > Pinning shouldn't be an issue.
> > Only mmap is the problem. User space access if fine since kernel
> > will mask out special fields on read/write.
> >
> > > So to answer Alexei's specific option. I'm still not in favor of just
> > > saying "anything that's not .data or .rodata is non-mmapable map". I'd
> > > rather carve out naming prefixes with . (which are  reserved for
> > > libbpf's own use) for these special purpose maps. I don't think that
> > > limits anyone, right?
> >
> > Is backward compat a concern?
> > Whether to mmap global data is a flag.
> > It can be opt-in or opt-out.
> > I'm proposing make all named section to be 'do not mmap'.
> > If a section needs to be mmaped and appear in skeleton the user can do
> > SEC("my_section.mmap")
> >
> > What you're proposing is to do the other way around:
> > SEC("enqueue.nommap")
> > SEC("dequeue.nommap")
> > in the above example.
> > I guess it's fine, but more verbose.
> 
> Well, I didn't propose to use suffixes. Currently user can define
> SEC(".data.my_custom_use_case"). 

... and libbpf will mmap such maps and will expose them in skeleton.
My point that it's an existing bug.
Compiler generated .rodata.str1.1 sections should not be messed by
user space. There is no BTF for them either.
mmap and subsequent write by user space won't cause a crash for bpf prog,
but it won't be doing what C code intended.
There is nothing in there for skeleton and user space to see,
but such map should be created, populated and map_fd provided to the prog to use.

> So I was proposing that we'll just
> define a different *prefix*, like SEC(".private.enqueue") and
> SEC(".private.dequeue") for your example above, which will be private
> to BPF program, not mmap'ed, not exposed in skeleton.
> 
> mmap is a bit orthogonal to exposing in skeleton, you can still
> imagine data section that will be allowed to be initialized from
> user-space before load but never mmaped afterwards. Just to say that
> .nommap doesn't necessarily imply that it shouldn't be in a skeleton.

Well. That's true for normal skeleton and for lskel,
but not the case for kernel skel that doesn't have mmap.
Exposing a map in skel implies that it will be accessed not only
after _open and before _load, but after load as well.
We can say that mmap != expose in skeleton, but I really don't see
such feature being useful.

> So I still prefer special prefix (.private) and declare that this is
> both non-mmapable and not-exposed in skeleton.
> 
> As for allowing any section. It just feels unnecessary and long-term
> harmful to allow any section name at this point, tbh.

Fine. How about a single new character instead of '.private' prefix ?
Like SEC("#enqueue") that would mean no-skel and no-mmap ?

Or double dot SEC("..enqueue") ?

'.private' is too verbose and when it's read in the context of C file
looks out of place and confusing.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 19:25               ` Alexei Starovoitov
  2022-09-09 20:21                 ` Andrii Nakryiko
@ 2022-09-09 22:30                 ` Dave Marchevsky
  2022-09-09 22:49                   ` Kumar Kartikeya Dwivedi
  2022-09-09 22:51                   ` Alexei Starovoitov
  1 sibling, 2 replies; 82+ messages in thread
From: Dave Marchevsky @ 2022-09-09 22:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko
  Cc: Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Delyan Kratunov

On 9/9/22 3:25 PM, Alexei Starovoitov wrote:
> On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
>> On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>>>>
>>>> On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>
>>>>> On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>>>>>>
>>>>>> On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
>>>>>>>
>>>>>>> On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
>>>>>>>> Global variables reside in maps accessible using direct_value_addr
>>>>>>>> callbacks, so giving each load instruction's rewrite a unique reg->id
>>>>>>>> disallows us from holding locks which are global.
>>>>>>>>
>>>>>>>> This is not great, so refactor the active_spin_lock into two separate
>>>>>>>> fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
>>>>>>>> enough to allow it for global variables, map lookups, and local kptr
>>>>>>>> registers at the same time.
>>>>>>>>
>>>>>>>> Held vs non-held is indicated by active_spin_lock_ptr, which stores the
>>>>>>>> reg->map_ptr or reg->btf pointer of the register used for locking spin
>>>>>>>> lock. But the active_spin_lock_id also needs to be compared to ensure
>>>>>>>> whether bpf_spin_unlock is for the same register.
>>>>>>>>
>>>>>>>> Next, pseudo load instructions are not given a unique reg->id, as they
>>>>>>>> are doing lookup for the same map value (max_entries is never greater
>>>>>>>> than 1).
>>>>>>>>
>>>>>>>
>>>>>>> For libbpf-style "internal maps" - like .bss.private further in this series -
>>>>>>> all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
>>>>>>>
>>>>>>>   struct bpf_spin_lock lock1 SEC(".bss.private");
>>>>>>>   struct bpf_spin_lock lock2 SEC(".bss.private");
>>>>>>>   ...
>>>>>>>   spin_lock(&lock1);
>>>>>>>   ...
>>>>>>>   spin_lock(&lock2);
>>>>>>>
>>>>>>> will result in same map but different offsets for the direct read (and different
>>>>>>> aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
>>>>>>> this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
>>>>>>>
>>>>>>
>>>>>> That won't be a problem. Two spin locks in a map value or datasec are
>>>>>> already rejected on BPF_MAP_CREATE,
>>>>>> so there is no bug. See idx >= info_cnt check in
>>>>>> btf_find_struct_field, btf_find_datasec_var.
>>>>>>
>>>>>> I can include offset as the third part of the tuple. The problem then
>>>>>> is figuring out which lock protects which bpf_list_head. We need
>>>>>> another __guarded_by annotation and force users to use that to
>>>>>> eliminate the ambiguity. So for now I just put it in the commit log
>>>>>> and left it for the future.
>>>>>
>>>>> Let's not go that far yet.
>>>>> Extra annotations are just as confusing and non-obvious as
>>>>> putting locks in different sections.
>>>>> Let's keep one lock per map value limitation for now.
>>>>> libbpf side needs to allow many non-mappable sections though.
>>>>> Single bss.private name is too limiting.
>>>>
>>>> In that case,
>>>> Dave, since the libbpf patch is yours, would you be fine with
>>>> reworking it to support multiple private maps?
>>>> Maybe it can just ignore the .XXX part in .bss.private.XXX?
>>>> Also I think Andrii mentioned once that he wants to eventually merge
>>>> data and bss, so it might be a good idea to call it .data.private from
>>>> the start?
>>>
>>> I'd probably make all non-canonical names to be not-mmapable.
>>> The compiler generates special sections already.
>>> Thankfully the code doesn't use them, but it will sooner or later.
>>> So libbpf has to create hidden maps for them eventually.
>>> They shouldn't be messed up from user space, since it will screw up
>>> compiler generated code.
>>>
>>> Andrii, what's your take?
>>
>> Ok, a bunch of things to unpack. We've also discussed a lot of this
>> with Dave few weeks ago, but I have also few questions.
>>
>> First, I'd like to not keep extending ".bss" with any custom ".bss.*"
>> sections. This is why we have .data.* and .rodata.* and not .bss (bad,
>> meaningless, historic name).
>>
>> But I'm totally fine dedicating some other prefix to non-mmapable data
>> sections that won't be exposed in skeleton and, well, not-mmapable.
>> What to name it depends on what we anticipate putting in them?
>>
>> If it's just for spinlocks, then having something like SEC(".locks")
>> seems best to me. If it's for more stuff, like global kptrs, rbtrees
>> and whatnot, then we'd need a bit more generic name (.private, or
>> whatever, didn't think much on best name). We can also allow .locks.*
>> or .private.* (i.e., keep it uniform with .data and .rodata handling,
>> expect for mmapable aspect).
>>
>> One benefit for having SEC(".locks") just for spin_locks is that we
>> can teach libbpf to create a multi-element ARRAY map, where each lock
>> variable is put into a separate element. From BPF verifier's
>> perspective, there will be a single BTF type describing spin lock, but
>> multiple "instances" of lock, one per each element. That seems a bit
>> magical and I think, generally speaking, it's best to start supporting
>> multiple lock declarations within single map element (and thus keep
>> track of their offset within map_value); but at least that's an
>> option.
> 
> ".lock" won't work. We need lock+rb_root or lock+list_head to be
> in the same section.
> It should be up to user to name that section with something meaningful.
> Ideally something like this should be supported:
> SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
> 

Isn't the "head and lock must be in same section / map_value" desired, or just
a consequence of this implementation? I don't see why it's desirable from user
perspective. Seems to have same problem as rbtree RFCv1's rbtree_map struct
creating its own bpf_spin_lock, namely not providing a way for multiple
datastructures to share same lock in a way that makes sense to the verifier for
enforcement.

>> Dave had some concerns about pinning such maps and whatnot, but for
>> starters we decided to not worry about pinning for now. Dave, please
>> bring up remaining issues, if you don't mind.
> 

@Andrii, aside from vague pinning concerns from our last discussion about this,
I don't have any specific concerns. A multi-element ".locks" is more
appealing to me now, actually, as I think it enables best-of-both-worlds for
this impl and my rbtree RFCv2 experiments:

  * This series uses (map_ptr, map_value_offset) as lock identity for
    verification purposes and expects map_ptr for list_head and lock
    to be the same.
    * If my logic in comment preceding this one is correct, downside
      is no lock sharing between datastructures.

  * rbtree RFCv2 uses lock address as lock identity
    for verification purposes and requires lock address to be known
    when verifying program using the lock.
    * Downside: no clear path forward for map_in_map general case,
      can make it work for some specific cases but kludgey.

  * If ".locks" exists, supporting multiple lock definitions, we can
    use locks_sec_offset or locks_sec_map_{key,idx} as lock identity
    for verification purposes.
    * As a result "head and lock must be in same section" requirement
      is removed, and there's a path forward for map_in_map inner maps
      to share locks arbitrarily without losing verifiability.
    * But I suspect this requires some special handling of the map backing
      ".locks" on kernel side.

I have some hacks on top of rbtree RFCv2 that are moving in this ".locks"
direction, happy to fix them up and send something if I didn't miss anything
above.

Regardless, @Kumar, happy to iterate on .bss.private patch until it's in
a shape that satisfies everyone.

> Pinning shouldn't be an issue.
> Only mmap is the problem. User space access if fine since kernel
> will mask out special fields on read/write.
> 
>> So to answer Alexei's specific option. I'm still not in favor of just
>> saying "anything that's not .data or .rodata is non-mmapable map". I'd
>> rather carve out naming prefixes with . (which are  reserved for
>> libbpf's own use) for these special purpose maps. I don't think that
>> limits anyone, right?
> 
> Is backward compat a concern?
> Whether to mmap global data is a flag.
> It can be opt-in or opt-out.
> I'm proposing make all named section to be 'do not mmap'.
> If a section needs to be mmaped and appear in skeleton the user can do
> SEC("my_section.mmap")
> 
> What you're proposing is to do the other way around:
> SEC("enqueue.nommap")
> SEC("dequeue.nommap")
> in the above example.
> I guess it's fine, but more verbose.
> The gut feeling is that the use case for naming section will be specifically
> for lock+rbtree. Everything else will go into common global .data or .rodata.
> Same thinking about compiler generated special sections with constants.
> They shouldn't be mmaped by default, but we're not going to hack llvm
> to add ".nommap" suffix to such sections.
> Hence the proposal to avoid mmap by default for all non standard sections.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 22:30                 ` Dave Marchevsky
@ 2022-09-09 22:49                   ` Kumar Kartikeya Dwivedi
  2022-09-09 22:57                     ` Alexei Starovoitov
  2022-09-09 22:51                   ` Alexei Starovoitov
  1 sibling, 1 reply; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-09 22:49 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Alexei Starovoitov, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Sat, 10 Sept 2022 at 00:30, Dave Marchevsky <davemarchevsky@fb.com> wrote:
>
> On 9/9/22 3:25 PM, Alexei Starovoitov wrote:
> > On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
> >> On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >>>>
> >>>> On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> >>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>>
> >>>>> On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >>>>>>
> >>>>>> On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >>>>>>>
> >>>>>>> On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> >>>>>>>> Global variables reside in maps accessible using direct_value_addr
> >>>>>>>> callbacks, so giving each load instruction's rewrite a unique reg->id
> >>>>>>>> disallows us from holding locks which are global.
> >>>>>>>>
> >>>>>>>> This is not great, so refactor the active_spin_lock into two separate
> >>>>>>>> fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> >>>>>>>> enough to allow it for global variables, map lookups, and local kptr
> >>>>>>>> registers at the same time.
> >>>>>>>>
> >>>>>>>> Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> >>>>>>>> reg->map_ptr or reg->btf pointer of the register used for locking spin
> >>>>>>>> lock. But the active_spin_lock_id also needs to be compared to ensure
> >>>>>>>> whether bpf_spin_unlock is for the same register.
> >>>>>>>>
> >>>>>>>> Next, pseudo load instructions are not given a unique reg->id, as they
> >>>>>>>> are doing lookup for the same map value (max_entries is never greater
> >>>>>>>> than 1).
> >>>>>>>>
> >>>>>>>
> >>>>>>> For libbpf-style "internal maps" - like .bss.private further in this series -
> >>>>>>> all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> >>>>>>>
> >>>>>>>   struct bpf_spin_lock lock1 SEC(".bss.private");
> >>>>>>>   struct bpf_spin_lock lock2 SEC(".bss.private");
> >>>>>>>   ...
> >>>>>>>   spin_lock(&lock1);
> >>>>>>>   ...
> >>>>>>>   spin_lock(&lock2);
> >>>>>>>
> >>>>>>> will result in same map but different offsets for the direct read (and different
> >>>>>>> aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> >>>>>>> this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> >>>>>>>
> >>>>>>
> >>>>>> That won't be a problem. Two spin locks in a map value or datasec are
> >>>>>> already rejected on BPF_MAP_CREATE,
> >>>>>> so there is no bug. See idx >= info_cnt check in
> >>>>>> btf_find_struct_field, btf_find_datasec_var.
> >>>>>>
> >>>>>> I can include offset as the third part of the tuple. The problem then
> >>>>>> is figuring out which lock protects which bpf_list_head. We need
> >>>>>> another __guarded_by annotation and force users to use that to
> >>>>>> eliminate the ambiguity. So for now I just put it in the commit log
> >>>>>> and left it for the future.
> >>>>>
> >>>>> Let's not go that far yet.
> >>>>> Extra annotations are just as confusing and non-obvious as
> >>>>> putting locks in different sections.
> >>>>> Let's keep one lock per map value limitation for now.
> >>>>> libbpf side needs to allow many non-mappable sections though.
> >>>>> Single bss.private name is too limiting.
> >>>>
> >>>> In that case,
> >>>> Dave, since the libbpf patch is yours, would you be fine with
> >>>> reworking it to support multiple private maps?
> >>>> Maybe it can just ignore the .XXX part in .bss.private.XXX?
> >>>> Also I think Andrii mentioned once that he wants to eventually merge
> >>>> data and bss, so it might be a good idea to call it .data.private from
> >>>> the start?
> >>>
> >>> I'd probably make all non-canonical names to be not-mmapable.
> >>> The compiler generates special sections already.
> >>> Thankfully the code doesn't use them, but it will sooner or later.
> >>> So libbpf has to create hidden maps for them eventually.
> >>> They shouldn't be messed up from user space, since it will screw up
> >>> compiler generated code.
> >>>
> >>> Andrii, what's your take?
> >>
> >> Ok, a bunch of things to unpack. We've also discussed a lot of this
> >> with Dave few weeks ago, but I have also few questions.
> >>
> >> First, I'd like to not keep extending ".bss" with any custom ".bss.*"
> >> sections. This is why we have .data.* and .rodata.* and not .bss (bad,
> >> meaningless, historic name).
> >>
> >> But I'm totally fine dedicating some other prefix to non-mmapable data
> >> sections that won't be exposed in skeleton and, well, not-mmapable.
> >> What to name it depends on what we anticipate putting in them?
> >>
> >> If it's just for spinlocks, then having something like SEC(".locks")
> >> seems best to me. If it's for more stuff, like global kptrs, rbtrees
> >> and whatnot, then we'd need a bit more generic name (.private, or
> >> whatever, didn't think much on best name). We can also allow .locks.*
> >> or .private.* (i.e., keep it uniform with .data and .rodata handling,
> >> expect for mmapable aspect).
> >>
> >> One benefit for having SEC(".locks") just for spin_locks is that we
> >> can teach libbpf to create a multi-element ARRAY map, where each lock
> >> variable is put into a separate element. From BPF verifier's
> >> perspective, there will be a single BTF type describing spin lock, but
> >> multiple "instances" of lock, one per each element. That seems a bit
> >> magical and I think, generally speaking, it's best to start supporting
> >> multiple lock declarations within single map element (and thus keep
> >> track of their offset within map_value); but at least that's an
> >> option.
> >
> > ".lock" won't work. We need lock+rb_root or lock+list_head to be
> > in the same section.
> > It should be up to user to name that section with something meaningful.
> > Ideally something like this should be supported:
> > SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> > SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> > SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> > SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
> >
>
> Isn't the "head and lock must be in same section / map_value" desired, or just
> a consequence of this implementation? I don't see why it's desirable from user
> perspective. Seems to have same problem as rbtree RFCv1's rbtree_map struct
> creating its own bpf_spin_lock, namely not providing a way for multiple
> datastructures to share same lock in a way that makes sense to the verifier for
> enforcement.
>

There is no such restriction here. You just put a lock and every list
or rbtree protected by that lock in the same section.
Then all of them share the same lock for the special section.

#define __private(X) SEC("map" #X)
struct bpf_spin_lock lock __private(a);
struct bpf_list_head head __contains(...) __private(a);
struct bpf_rb_root root __contains(...) __private(a);

As I said already, it's also possible to do a more fine grained
approach by having multiple of them globally.
Then this multiple separate section based approach is not needed at
all. You can have just one private section for such bpf special
structures, maybe even by default from libbpf side, as they can't be
mmap'd anyway.

libbpf will see that you have bpf_spin_lock, bpf_list_head,
bpf_rb_root, it will put them in .data.nommap.

But then the verifier needs to know which lock protects which data.
You always need that info, in any approach. Here we assume by default
just one bpf_spin_lock so the answer is known.
We can 'learn' that implicitly (storing what we see first in the
verifier, e.g. if you added to head while holding lockA, we assume
this is the one you'll be using to protect it). Later the same head
cannot be added to using lockB.
Or we can just make the user annotate that explicitly, like clang's
thread safety annotations (GUARDED_BY(lock) etc.).
Then the spin_lock_off protecting it is stored with other info in
bpf_map_value_off_desc.

So compared to the example above, user will just do:
struct bpf_spin_lock lock1;
struct bpf_spin_lock lock2;
struct bpf_list_head head __contains(...) __guarded_by(lock1);
struct bpf_list_head head2 __contains(...) __guarded_by(lock2);
struct bpf_rb_root root __contains(...) __guarded_by(lock2);

It looks much cleaner to me from a user perspective. Just define what
protects what, which also doubles as great documentation.

Regardless, the point is there are no limitations regarding
coarse-grained/fine-grained locking or lock sharing.
The question is more about how to expose it to the user.

> >> Dave had some concerns about pinning such maps and whatnot, but for
> >> starters we decided to not worry about pinning for now. Dave, please
> >> bring up remaining issues, if you don't mind.
> >
>
> @Andrii, aside from vague pinning concerns from our last discussion about this,
> I don't have any specific concerns. A multi-element ".locks" is more
> appealing to me now, actually, as I think it enables best-of-both-worlds for
> this impl and my rbtree RFCv2 experiments:
>
>   * This series uses (map_ptr, map_value_offset) as lock identity for
>     verification purposes and expects map_ptr for list_head and lock
>     to be the same.
>     * If my logic in comment preceding this one is correct, downside
>       is no lock sharing between datastructures.
>

See above.

>   * rbtree RFCv2 uses lock address as lock identity
>     for verification purposes and requires lock address to be known
>     when verifying program using the lock.
>     * Downside: no clear path forward for map_in_map general case,
>       can make it work for some specific cases but kludgey.
>
>   * If ".locks" exists, supporting multiple lock definitions, we can
>     use locks_sec_offset or locks_sec_map_{key,idx} as lock identity
>     for verification purposes.
>     * As a result "head and lock must be in same section" requirement
>       is removed, and there's a path forward for map_in_map inner maps
>       to share locks arbitrarily without losing verifiability.
>     * But I suspect this requires some special handling of the map backing
>       ".locks" on kernel side.
>
> I have some hacks on top of rbtree RFCv2 that are moving in this ".locks"
> direction, happy to fix them up and send something if I didn't miss anything
> above.

I don't really like the ".locks" section or the idea in general. There
is nothing really special about locks in particular.
Same problem with bpf_timer. A nommap map approach also allows having
more than one bpf_timer globally.

>
> Regardless, @Kumar, happy to iterate on .bss.private patch until it's in
> a shape that satisfies everyone.
>

Great, once the discussion concludes it would be great if you send it
out as its own patch, easier for me too.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 22:30                 ` Dave Marchevsky
  2022-09-09 22:49                   ` Kumar Kartikeya Dwivedi
@ 2022-09-09 22:51                   ` Alexei Starovoitov
  1 sibling, 0 replies; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-09 22:51 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Andrii Nakryiko, Kumar Kartikeya Dwivedi, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 3:30 PM Dave Marchevsky <davemarchevsky@fb.com> wrote:
> >
> > ".lock" won't work. We need lock+rb_root or lock+list_head to be
> > in the same section.
> > It should be up to user to name that section with something meaningful.
> > Ideally something like this should be supported:
> > SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> > SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> > SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> > SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
> >
>
> Isn't the "head and lock must be in same section / map_value" desired, or just
> a consequence of this implementation? I don't see why it's desirable from user
> perspective. Seems to have same problem as rbtree RFCv1's rbtree_map struct
> creating its own bpf_spin_lock, namely not providing a way for multiple
> datastructures to share same lock in a way that makes sense to the verifier for
> enforcement.

The requirement to have only one lock in an "allocation"
(which is map value or one global section) comes from the need
to take that lock when doing map updates.
Shared lists/rbtree might require the verifier to take that lock too.
We can improve things with __guarded_by() tags in the future,
but I prefer to start with simple one-lock-per-map-value.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 22:49                   ` Kumar Kartikeya Dwivedi
@ 2022-09-09 22:57                     ` Alexei Starovoitov
  2022-09-09 23:04                       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-09 22:57 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 3:50 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
>
> So compared to the example above, user will just do:
> struct bpf_spin_lock lock1;
> struct bpf_spin_lock lock2;
> struct bpf_list_head head __contains(...) __guarded_by(lock1);
> struct bpf_list_head head2 __contains(...) __guarded_by(lock2);
> struct bpf_rb_root root __contains(...) __guarded_by(lock2);
>
> It looks much cleaner to me from a user perspective. Just define what
> protects what, which also doubles as great documentation.

Unfortunately that doesn't work.

We cannot magically exclude the locks from global data
because of skel/mmap requirements.
We cannot move the locks automatically, because it involves
massive analysis of the code and fixing all offsets in libbpf.
So users have to use a different section when using
global locks, rb_root, list_head.
Since a different section is needed anyway, it's better to keep
one-lock-per-map-value for now.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 22:57                     ` Alexei Starovoitov
@ 2022-09-09 23:04                       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 82+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-09-09 23:04 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Sat, 10 Sept 2022 at 00:57, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 9, 2022 at 3:50 PM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> >
> > So compared to the example above, user will just do:
> > struct bpf_spin_lock lock1;
> > struct bpf_spin_lock lock2;
> > struct bpf_list_head head __contains(...) __guarded_by(lock1);
> > struct bpf_list_head head2 __contains(...) __guarded_by(lock2);
> > struct bpf_rb_root root __contains(...) __guarded_by(lock2);
> >
> > It looks much cleaner to me from a user perspective. Just define what
> > protects what, which also doubles as great documentation.
>
> Unfortunately that doesn't work.
>
> We cannot magically exclude the locks from global data
> because of skel/mmap requirements.
> We cannot move the locks automatically, because it involves
> massive analysis of the code and fixing all offsets in libbpf.
> So users have to use a different section when using
> global locks, rb_root, list_head.
> Since a different section is needed anyway, it's better to keep
> one-lock-per-map-value for now.

Argh, right. Then they need one extra 'annotation', at that point it's
a bit questionable.
Also just realized reading your other reply that there are also
unanswered questions wrt what we do for BPF_F_LOCK.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-09 20:57                   ` Alexei Starovoitov
@ 2022-09-10  0:21                     ` Andrii Nakryiko
  2022-09-11 22:31                       ` Alexei Starovoitov
  0 siblings, 1 reply; 82+ messages in thread
From: Andrii Nakryiko @ 2022-09-10  0:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 9, 2022 at 1:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 09, 2022 at 01:21:05PM -0700, Andrii Nakryiko wrote:
> > On Fri, Sep 9, 2022 at 12:25 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Sep 09, 2022 at 11:32:40AM -0700, Andrii Nakryiko wrote:
> > > > On Fri, Sep 9, 2022 at 7:58 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Sep 9, 2022 at 7:51 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, 9 Sept 2022 at 16:24, Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Fri, Sep 9, 2022 at 4:05 AM Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, 9 Sept 2022 at 10:13, Dave Marchevsky <davemarchevsky@fb.com> wrote:
> > > > > > > > >
> > > > > > > > > On 9/4/22 4:41 PM, Kumar Kartikeya Dwivedi wrote:
> > > > > > > > > > Global variables reside in maps accessible using direct_value_addr
> > > > > > > > > > callbacks, so giving each load instruction's rewrite a unique reg->id
> > > > > > > > > > disallows us from holding locks which are global.
> > > > > > > > > >
> > > > > > > > > > This is not great, so refactor the active_spin_lock into two separate
> > > > > > > > > > fields, active_spin_lock_ptr and active_spin_lock_id, which is generic
> > > > > > > > > > enough to allow it for global variables, map lookups, and local kptr
> > > > > > > > > > registers at the same time.
> > > > > > > > > >
> > > > > > > > > > Held vs non-held is indicated by active_spin_lock_ptr, which stores the
> > > > > > > > > > reg->map_ptr or reg->btf pointer of the register used for locking spin
> > > > > > > > > > lock. But the active_spin_lock_id also needs to be compared to ensure
> > > > > > > > > > whether bpf_spin_unlock is for the same register.
> > > > > > > > > >
> > > > > > > > > > Next, pseudo load instructions are not given a unique reg->id, as they
> > > > > > > > > > are doing lookup for the same map value (max_entries is never greater
> > > > > > > > > > than 1).
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > For libbpf-style "internal maps" - like .bss.private further in this series -
> > > > > > > > > all the SEC(".bss.private") vars are globbed together into one map_value. e.g.
> > > > > > > > >
> > > > > > > > >   struct bpf_spin_lock lock1 SEC(".bss.private");
> > > > > > > > >   struct bpf_spin_lock lock2 SEC(".bss.private");
> > > > > > > > >   ...
> > > > > > > > >   spin_lock(&lock1);
> > > > > > > > >   ...
> > > > > > > > >   spin_lock(&lock2);
> > > > > > > > >
> > > > > > > > > will result in same map but different offsets for the direct read (and different
> > > > > > > > > aux->map_off set in resolve_pseudo_ldimm64 for use in check_ld_imm). Seems like
> > > > > > > > > this patch would assign both same (active_spin_lock_ptr, active_spin_lock_id).
> > > > > > > > >
> > > > > > > >
> > > > > > > > That won't be a problem. Two spin locks in a map value or datasec are
> > > > > > > > already rejected on BPF_MAP_CREATE,
> > > > > > > > so there is no bug. See idx >= info_cnt check in
> > > > > > > > btf_find_struct_field, btf_find_datasec_var.
> > > > > > > >
> > > > > > > > I can include offset as the third part of the tuple. The problem then
> > > > > > > > is figuring out which lock protects which bpf_list_head. We need
> > > > > > > > another __guarded_by annotation and force users to use that to
> > > > > > > > eliminate the ambiguity. So for now I just put it in the commit log
> > > > > > > > and left it for the future.
> > > > > > >
> > > > > > > Let's not go that far yet.
> > > > > > > Extra annotations are just as confusing and non-obvious as
> > > > > > > putting locks in different sections.
> > > > > > > Let's keep one lock per map value limitation for now.
> > > > > > > libbpf side needs to allow many non-mappable sections though.
> > > > > > > Single bss.private name is too limiting.
> > > > > >
> > > > > > In that case,
> > > > > > Dave, since the libbpf patch is yours, would you be fine with
> > > > > > reworking it to support multiple private maps?
> > > > > > Maybe it can just ignore the .XXX part in .bss.private.XXX?
> > > > > > Also I think Andrii mentioned once that he wants to eventually merge
> > > > > > data and bss, so it might be a good idea to call it .data.private from
> > > > > > the start?
> > > > >
> > > > > I'd probably make all non-canonical names to be not-mmapable.
> > > > > The compiler generates special sections already.
> > > > > Thankfully the code doesn't use them, but it will sooner or later.
> > > > > So libbpf has to create hidden maps for them eventually.
> > > > > They shouldn't be messed up from user space, since it will screw up
> > > > > compiler generated code.
> > > > >
> > > > > Andrii, what's your take?
> > > >
> > > > Ok, a bunch of things to unpack. We've also discussed a lot of this
> > > > with Dave few weeks ago, but I have also few questions.
> > > >
> > > > First, I'd like to not keep extending ".bss" with any custom ".bss.*"
> > > > sections. This is why we have .data.* and .rodata.* and not .bss (bad,
> > > > meaningless, historic name).
> > > >
> > > > But I'm totally fine dedicating some other prefix to non-mmapable data
> > > > sections that won't be exposed in skeleton and, well, not-mmapable.
> > > > What to name it depends on what we anticipate putting in them?
> > > >
> > > > If it's just for spinlocks, then having something like SEC(".locks")
> > > > seems best to me. If it's for more stuff, like global kptrs, rbtrees
> > > > and whatnot, then we'd need a bit more generic name (.private, or
> > > > whatever, didn't think much on best name). We can also allow .locks.*
> > > > or .private.* (i.e., keep it uniform with .data and .rodata handling,
> > > > expect for mmapable aspect).
> > > >
> > > > One benefit for having SEC(".locks") just for spin_locks is that we
> > > > can teach libbpf to create a multi-element ARRAY map, where each lock
> > > > variable is put into a separate element. From BPF verifier's
> > > > perspective, there will be a single BTF type describing spin lock, but
> > > > multiple "instances" of lock, one per each element. That seems a bit
> > > > magical and I think, generally speaking, it's best to start supporting
> > > > multiple lock declarations within single map element (and thus keep
> > > > track of their offset within map_value); but at least that's an
> > > > option.
> > >
> > > ".lock" won't work. We need lock+rb_root or lock+list_head to be
> > > in the same section.
> > > It should be up to user to name that section with something meaningful.
> > > Ideally something like this should be supported:
> > > SEC("enqueue") struct bpf_spin_lock enqueue_lock;
> > > SEC("enqueue") struct bpf_list_head enqueue_head __contains(foo, node);
> > > SEC("dequeue") struct bpf_spin_lock dequeue_lock;
> > > SEC("dequeue") struct bpf_list_head dequeue_head __contains(foo, node);
> > >
> > > > Dave had some concerns about pinning such maps and whatnot, but for
> > > > starters we decided to not worry about pinning for now. Dave, please
> > > > bring up remaining issues, if you don't mind.
> > >
> > > Pinning shouldn't be an issue.
> > > Only mmap is the problem. User space access if fine since kernel
> > > will mask out special fields on read/write.
> > >
> > > > So to answer Alexei's specific option. I'm still not in favor of just
> > > > saying "anything that's not .data or .rodata is non-mmapable map". I'd
> > > > rather carve out naming prefixes with . (which are  reserved for
> > > > libbpf's own use) for these special purpose maps. I don't think that
> > > > limits anyone, right?
> > >
> > > Is backward compat a concern?
> > > Whether to mmap global data is a flag.
> > > It can be opt-in or opt-out.
> > > I'm proposing make all named section to be 'do not mmap'.
> > > If a section needs to be mmaped and appear in skeleton the user can do
> > > SEC("my_section.mmap")
> > >
> > > What you're proposing is to do the other way around:
> > > SEC("enqueue.nommap")
> > > SEC("dequeue.nommap")
> > > in the above example.
> > > I guess it's fine, but more verbose.
> >
> > Well, I didn't propose to use suffixes. Currently user can define
> > SEC(".data.my_custom_use_case").
>
> ... and libbpf will mmap such maps and will expose them in skeleton.
> My point that it's an existing bug.

hm... it's not a bug, it's a desired feature. I wanted

int my_var SEC(".data.mine");

to be just like .data but in a separate map. So no bug here.

> Compiler generated .rodata.str1.1 sections should not be messed by
> user space. There is no BTF for them either.

Shouldn't but could if they wanted to without skeleton as well. In
generated skeleton there will be an empty struct for this and no field
for each of compiler's constant. User has to intentionally do
something to harm themselves, which we can never stop either way.

So stuff like .rodata.str1.1 exposes a bit of compiler implementation
details, but overall idea of allowing custom .data.xxx and .rodata.xxx
sections was to make them mmapable and readable/writable through
skeleton.

Carving out some sub-namespace based on special suffix feels wrong.

> mmap and subsequent write by user space won't cause a crash for bpf prog,
> but it won't be doing what C code intended.
> There is nothing in there for skeleton and user space to see,
> but such map should be created, populated and map_fd provided to the prog to use.
>
> > So I was proposing that we'll just
> > define a different *prefix*, like SEC(".private.enqueue") and
> > SEC(".private.dequeue") for your example above, which will be private
> > to BPF program, not mmap'ed, not exposed in skeleton.
> >
> > mmap is a bit orthogonal to exposing in skeleton, you can still
> > imagine data section that will be allowed to be initialized from
> > user-space before load but never mmaped afterwards. Just to say that
> > .nommap doesn't necessarily imply that it shouldn't be in a skeleton.
>
> Well. That's true for normal skeleton and for lskel,
> but not the case for kernel skel that doesn't have mmap.
> Exposing a map in skel implies that it will be accessed not only
> after _open and before _load, but after load as well.
> We can say that mmap != expose in skeleton, but I really don't see
> such feature being useful.

It's basically what .rodata is, actually. It's something that's
settable from user-space before load and that's it. Yes, you can read
it after load, but no one does it in practice. But we are digressing,
I understand you want to make this short and sweet and I agree with
you. I just disagree about wildcard rule for any non-dotted ELF
section or using special suffix. See below.

>
> > So I still prefer special prefix (.private) and declare that this is
> > both non-mmapable and not-exposed in skeleton.
> >
> > As for allowing any section. It just feels unnecessary and long-term
> > harmful to allow any section name at this point, tbh.
>
> Fine. How about a single new character instead of '.private' prefix ?
> Like SEC("#enqueue") that would mean no-skel and no-mmap ?
>
> Or double dot SEC("..enqueue") ?
>
> '.private' is too verbose and when it's read in the context of C file
> looks out of place and confusing.

As I said, I gave zero thought to .private, I just took it from
".bss.private". I'd like to keep it "dotted", so SEC("#something") is
very "unusual". Me not like.

For double-dot, could be just SEC("..data") and generalized to
SEC("..data.<custom>")? BTW, we can add a macro, similar to __kconfig,
to hide more descriptive and longer name. E.g.,

struct bpf_spin_lock my_lock __internal;

__internal, __private, __secret, don't know, naming is hard.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-10  0:21                     ` Andrii Nakryiko
@ 2022-09-11 22:31                       ` Alexei Starovoitov
  2022-09-20 20:55                         ` Andrii Nakryiko
  0 siblings, 1 reply; 82+ messages in thread
From: Alexei Starovoitov @ 2022-09-11 22:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Fri, Sep 09, 2022 at 05:21:52PM -0700, Andrii Nakryiko wrote:
> > >
> > > Well, I didn't propose to use suffixes. Currently user can define
> > > SEC(".data.my_custom_use_case").
> >
> > ... and libbpf will mmap such maps and will expose them in skeleton.
> > My point that it's an existing bug.
> 
> hm... it's not a bug, it's a desired feature. I wanted
> 
> int my_var SEC(".data.mine");
> 
> to be just like .data but in a separate map. So no bug here.

".rodata.*" and ".data.*" section names are effectively reserved by the compiler.
Sooner or later there will be trouble if users start mixing compiler
sections with their own section names like ".data.mine".
In bpf backend we specicialy check for '.rodata' prefix and avoid emitting BTF.
llvm can emit .rodata.cst%d, .rodata.str%d.%d, .data.rel, etc
Not sure how many such special sections will be generated once
bpf progs will get bigger, but creating them as maps will waste
plenty of kernel memory due to page align in bpf-array.
llvm should probably combine them when possible and minimize section
usage in general, but that's orthogonal.
Mixing user and compiler sections under the same prefix is just asking for trouble.

> > Compiler generated .rodata.str1.1 sections should not be messed by
> > user space. There is no BTF for them either.
> 
> Shouldn't but could if they wanted to without skeleton as well. In
> generated skeleton there will be an empty struct for this and no field
> for each of compiler's constant. User has to intentionally do
> something to harm themselves, which we can never stop either way.
> 
> So stuff like .rodata.str1.1 exposes a bit of compiler implementation
> details, but overall idea of allowing custom .data.xxx and .rodata.xxx
> sections was to make them mmapable and readable/writable through
> skeleton.

It's not a good idea to expose compiler internals into skeleton
and even worse to ask users to operate in the compiler's namespace.

> Carving out some sub-namespace based on special suffix feels wrong.

agree. suffix doesn't work, since prefix is already owned by the compiler.

> > mmap and subsequent write by user space won't cause a crash for bpf prog,
> > but it won't be doing what C code intended.
> > There is nothing in there for skeleton and user space to see,
> > but such map should be created, populated and map_fd provided to the prog to use.
> >
> > > So I was proposing that we'll just
> > > define a different *prefix*, like SEC(".private.enqueue") and
> > > SEC(".private.dequeue") for your example above, which will be private
> > > to BPF program, not mmap'ed, not exposed in skeleton.
> > >
> > > mmap is a bit orthogonal to exposing in skeleton, you can still
> > > imagine data section that will be allowed to be initialized from
> > > user-space before load but never mmaped afterwards. Just to say that
> > > .nommap doesn't necessarily imply that it shouldn't be in a skeleton.
> >
> > Well. That's true for normal skeleton and for lskel,
> > but not the case for kernel skel that doesn't have mmap.
> > Exposing a map in skel implies that it will be accessed not only
> > after _open and before _load, but after load as well.
> > We can say that mmap != expose in skeleton, but I really don't see
> > such feature being useful.
> 
> It's basically what .rodata is, actually. It's something that's
> settable from user-space before load and that's it. Yes, you can read
> it after load, but no one does it in practice. But we are digressing,
> I understand you want to make this short and sweet and I agree with
> you. I just disagree about wildcard rule for any non-dotted ELF
> section or using special suffix. See below.
> 
> >
> > > So I still prefer special prefix (.private) and declare that this is
> > > both non-mmapable and not-exposed in skeleton.
> > >
> > > As for allowing any section. It just feels unnecessary and long-term
> > > harmful to allow any section name at this point, tbh.
> >
> > Fine. How about a single new character instead of '.private' prefix ?
> > Like SEC("#enqueue") that would mean no-skel and no-mmap ?
> >
> > Or double dot SEC("..enqueue") ?
> >
> > '.private' is too verbose and when it's read in the context of C file
> > looks out of place and confusing.
> 
> As I said, I gave zero thought to .private, I just took it from
> ".bss.private". I'd like to keep it "dotted", so SEC("#something") is
> very "unusual". Me not like.

Why is this unusual? We have SEC("?tc") already.
SEC("#foo") is very similar.
Dot prefix is special. Something compiler will generate
whereas the section that starts with [A-z] is user's.
So reserving a prefix that starts from [A-z] would be wrong.

> For double-dot, could be just SEC("..data") and generalized to
> SEC("..data.<custom>")? BTW, we can add a macro, similar to __kconfig,
> to hide more descriptive and longer name. E.g.,
> 
> struct bpf_spin_lock my_lock __internal;

Macro doesn't help, since the namespace is broken anyway.
'..data' is dangerous because something might be doing strstr(".data")
instead of strcmp(".data") and will match that section erroneously.

> __internal, __private, __secret, don't know, naming is hard.

Right. We already use special meaning for text section names: tc, xdp
which is also far from ideal, but that's too late to change.
For data I'm arguing that only '.[ro]data' should appear in skel
and compiler internals '.[ro]data.*' should not leak to users.
Then emit all of [A-z] starting sections, since that's what compilers
and linkers will do, but reserve a single character like '#'
or whatever other char to mean that this section shouldn't be mmapable.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-11 22:31                       ` Alexei Starovoitov
@ 2022-09-20 20:55                         ` Andrii Nakryiko
  2022-10-18  4:06                           ` Andrii Nakryiko
  0 siblings, 1 reply; 82+ messages in thread
From: Andrii Nakryiko @ 2022-09-20 20:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Sun, Sep 11, 2022 at 3:31 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 09, 2022 at 05:21:52PM -0700, Andrii Nakryiko wrote:
> > > >
> > > > Well, I didn't propose to use suffixes. Currently user can define
> > > > SEC(".data.my_custom_use_case").
> > >
> > > ... and libbpf will mmap such maps and will expose them in skeleton.
> > > My point that it's an existing bug.
> >
> > hm... it's not a bug, it's a desired feature. I wanted
> >
> > int my_var SEC(".data.mine");
> >
> > to be just like .data but in a separate map. So no bug here.
>

Not to bury the actual proposal at the end of this email, I'll put it
here upfront, as I think it's a better compromise.

Given the initial problem was that libbpf creates an mmap-able array
for data sections, how about we make libbpf smarter.

The rule is simple and unambiguous: if ELF data section doesn't
contain any global variable, libbpf will not add MMAPABLE flag? I.e.,
if it's special compiler sections which have no variables, or if it's
user data section that only has static variables (which explicitly are
not to be exposed in BPF skeleton), libbpf just creates non-mmapable
array and we don't expose such sections as skeleton structs.

User can still enforce MMAPABLE flag with explicit
bpf_map__set_map_flags(), if necessary, so if libbpf's default
behavior isn't sufficient and user intended mmapable array, they can
still get this working.

That would cover your use case and won't require any new naming
conventions. WDYT?


> ".rodata.*" and ".data.*" section names are effectively reserved by the compiler.
> Sooner or later there will be trouble if users start mixing compiler
> sections with their own section names like ".data.mine".
> In bpf backend we specicialy check for '.rodata' prefix and avoid emitting BTF.
> llvm can emit .rodata.cst%d, .rodata.str%d.%d, .data.rel, etc
> Not sure how many such special sections will be generated once
> bpf progs will get bigger, but creating them as maps will waste
> plenty of kernel memory due to page align in bpf-array.
> llvm should probably combine them when possible and minimize section
> usage in general, but that's orthogonal.

agree about the combining in LLVM and it's an optimization that libbpf
should be oblivious to

> Mixing user and compiler sections under the same prefix is just asking for trouble.
>

ELF spec specifies that .data/.data1 and .rodata/.rodata1 are special.
And also adds:

  Section names with a dot (.) prefix are reserved for the system,
  although applications may use these sections if their existing meanings
  are satisfactory. Applications may use names without the prefix to avoid
  conflicts with system sections. The object file format lets one define
  sections not in the list above. An object file may have more than one section
  with the same name.

I treat libbpf as "a system" and thus treat .* is reserved for libbpf use.

As for not creating a map for .rodata.cst* and others. Isn't it
exactly the same as for .rodata? Compiler might generate relocation
against that section and will expect to be able to access contents of
such map at runtime (e.g., for struct literals, string constants,
etc). So I don't think we can do that.

> > > Compiler generated .rodata.str1.1 sections should not be messed by
> > > user space. There is no BTF for them either.
> >
> > Shouldn't but could if they wanted to without skeleton as well. In
> > generated skeleton there will be an empty struct for this and no field
> > for each of compiler's constant. User has to intentionally do
> > something to harm themselves, which we can never stop either way.
> >
> > So stuff like .rodata.str1.1 exposes a bit of compiler implementation
> > details, but overall idea of allowing custom .data.xxx and .rodata.xxx
> > sections was to make them mmapable and readable/writable through
> > skeleton.
>
> It's not a good idea to expose compiler internals into skeleton
> and even worse to ask users to operate in the compiler's namespace.
>
> > Carving out some sub-namespace based on special suffix feels wrong.
>
> agree. suffix doesn't work, since prefix is already owned by the compiler.
>
> > > mmap and subsequent write by user space won't cause a crash for bpf prog,
> > > but it won't be doing what C code intended.
> > > There is nothing in there for skeleton and user space to see,
> > > but such map should be created, populated and map_fd provided to the prog to use.
> > >
> > > > So I was proposing that we'll just
> > > > define a different *prefix*, like SEC(".private.enqueue") and
> > > > SEC(".private.dequeue") for your example above, which will be private
> > > > to BPF program, not mmap'ed, not exposed in skeleton.
> > > >
> > > > mmap is a bit orthogonal to exposing in skeleton, you can still
> > > > imagine data section that will be allowed to be initialized from
> > > > user-space before load but never mmaped afterwards. Just to say that
> > > > .nommap doesn't necessarily imply that it shouldn't be in a skeleton.
> > >
> > > Well. That's true for normal skeleton and for lskel,
> > > but not the case for kernel skel that doesn't have mmap.
> > > Exposing a map in skel implies that it will be accessed not only
> > > after _open and before _load, but after load as well.
> > > We can say that mmap != expose in skeleton, but I really don't see
> > > such feature being useful.
> >
> > It's basically what .rodata is, actually. It's something that's
> > settable from user-space before load and that's it. Yes, you can read
> > it after load, but no one does it in practice. But we are digressing,
> > I understand you want to make this short and sweet and I agree with
> > you. I just disagree about wildcard rule for any non-dotted ELF
> > section or using special suffix. See below.
> >
> > >
> > > > So I still prefer special prefix (.private) and declare that this is
> > > > both non-mmapable and not-exposed in skeleton.
> > > >
> > > > As for allowing any section. It just feels unnecessary and long-term
> > > > harmful to allow any section name at this point, tbh.
> > >
> > > Fine. How about a single new character instead of '.private' prefix ?
> > > Like SEC("#enqueue") that would mean no-skel and no-mmap ?
> > >
> > > Or double dot SEC("..enqueue") ?
> > >
> > > '.private' is too verbose and when it's read in the context of C file
> > > looks out of place and confusing.
> >
> > As I said, I gave zero thought to .private, I just took it from
> > ".bss.private". I'd like to keep it "dotted", so SEC("#something") is
> > very "unusual". Me not like.
>
> Why is this unusual? We have SEC("?tc") already.
> SEC("#foo") is very similar.

Unusual because these sections were so far used for BPF programs, not
for data. It's not the end of the world, but just not something I'd
like to do.

> Dot prefix is special. Something compiler will generate
> whereas the section that starts with [A-z] is user's.
> So reserving a prefix that starts from [A-z] would be wrong.

If we allow [a-zA-Z]* sections for data, we introduce potential
conflict for new SEC("abc") program annotations. Why would we do this
and cause more problems?

>
> > For double-dot, could be just SEC("..data") and generalized to
> > SEC("..data.<custom>")? BTW, we can add a macro, similar to __kconfig,
> > to hide more descriptive and longer name. E.g.,
> >
> > struct bpf_spin_lock my_lock __internal;
>
> Macro doesn't help, since the namespace is broken anyway.
> '..data' is dangerous because something might be doing strstr(".data")
> instead of strcmp(".data") and will match that section erroneously.
>

Not breaking broken/naive code is hardly a reason for anything. If
someone is doing strstr() and expects it to work as a prefix check,
well, it's a bug that should be fixed.

> > __internal, __private, __secret, don't know, naming is hard.
>
> Right. We already use special meaning for text section names: tc, xdp
> which is also far from ideal, but that's too late to change.
> For data I'm arguing that only '.[ro]data' should appear in skel
> and compiler internals '.[ro]data.*' should not leak to users.

I disagree. It's an already supported libbpf feature and I think it's
fine to keep it this way. It's not hard to avoid compiler "special"
sections, if it's at all a problem to share that section with the
compiler.

> Then emit all of [A-z] starting sections, since that's what compilers
> and linkers will do, but reserve a single character like '#'
> or whatever other char to mean that this section shouldn't be mmapable.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables
  2022-09-20 20:55                         ` Andrii Nakryiko
@ 2022-10-18  4:06                           ` Andrii Nakryiko
  0 siblings, 0 replies; 82+ messages in thread
From: Andrii Nakryiko @ 2022-10-18  4:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Delyan Kratunov

On Tue, Sep 20, 2022 at 1:55 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Sun, Sep 11, 2022 at 3:31 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Sep 09, 2022 at 05:21:52PM -0700, Andrii Nakryiko wrote:
> > > > >
> > > > > Well, I didn't propose to use suffixes. Currently user can define
> > > > > SEC(".data.my_custom_use_case").
> > > >
> > > > ... and libbpf will mmap such maps and will expose them in skeleton.
> > > > My point that it's an existing bug.
> > >
> > > hm... it's not a bug, it's a desired feature. I wanted
> > >
> > > int my_var SEC(".data.mine");
> > >
> > > to be just like .data but in a separate map. So no bug here.
> >
>
> Not to bury the actual proposal at the end of this email, I'll put it
> here upfront, as I think it's a better compromise.
>
> Given the initial problem was that libbpf creates an mmap-able array
> for data sections, how about we make libbpf smarter.
>
> The rule is simple and unambiguous: if ELF data section doesn't
> contain any global variable, libbpf will not add MMAPABLE flag? I.e.,
> if it's special compiler sections which have no variables, or if it's
> user data section that only has static variables (which explicitly are
> not to be exposed in BPF skeleton), libbpf just creates non-mmapable
> array and we don't expose such sections as skeleton structs.
>
> User can still enforce MMAPABLE flag with explicit
> bpf_map__set_map_flags(), if necessary, so if libbpf's default
> behavior isn't sufficient and user intended mmapable array, they can
> still get this working.
>
> That would cover your use case and won't require any new naming
> conventions. WDYT?
>
>

To close the loop, I went ahead and implemented this proposal in code.
See [0]. I think it should be a good first step and should unblock all
the linked list and rbtree_node work. Please give it a try.

  [0] https://patchwork.kernel.org/project/netdevbpf/list/?series=686066&state=*


[...]

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2022-10-18  4:07 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-04 20:41 [PATCH RFC bpf-next v1 00/32] Local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 01/32] bpf: Add copy_map_value_long to copy to remote percpu memory Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 02/32] bpf: Support kptrs in percpu arraymap Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 03/32] bpf: Add zero_map_value to zero map value with special fields Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 04/32] bpf: Support kptrs in percpu hashmap and percpu LRU hashmap Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 05/32] bpf: Support kptrs in local storage maps Kumar Kartikeya Dwivedi
2022-09-07 19:00   ` Alexei Starovoitov
2022-09-08  2:47     ` Kumar Kartikeya Dwivedi
2022-09-09  5:27   ` Martin KaFai Lau
2022-09-09 11:22     ` Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 06/32] bpf: Annotate data races in bpf_local_storage Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 07/32] bpf: Allow specifying volatile type modifier for kptrs Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 08/32] bpf: Add comment about kptr's PTR_TO_MAP_VALUE handling Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 09/32] bpf: Rewrite kfunc argument handling Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 10/32] bpf: Drop kfunc support from btf_check_func_arg_match Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 11/32] bpf: Support constant scalar arguments for kfuncs Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 12/32] bpf: Teach verifier about non-size constant arguments Kumar Kartikeya Dwivedi
2022-09-07 22:11   ` Alexei Starovoitov
2022-09-08  2:49     ` Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 13/32] bpf: Introduce bpf_list_head support for BPF maps Kumar Kartikeya Dwivedi
2022-09-07 22:46   ` Alexei Starovoitov
2022-09-08  2:58     ` Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 14/32] bpf: Introduce bpf_kptr_alloc helper Kumar Kartikeya Dwivedi
2022-09-07 23:30   ` Alexei Starovoitov
2022-09-08  3:01     ` Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 15/32] bpf: Add helper macro bpf_expr_for_each_reg_in_vstate Kumar Kartikeya Dwivedi
2022-09-07 23:48   ` Alexei Starovoitov
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 16/32] bpf: Introduce BPF memory object model Kumar Kartikeya Dwivedi
2022-09-08  0:34   ` Alexei Starovoitov
2022-09-08  2:39     ` Kumar Kartikeya Dwivedi
2022-09-08  3:37       ` Alexei Starovoitov
2022-09-08 11:50         ` Kumar Kartikeya Dwivedi
2022-09-08 14:18           ` Alexei Starovoitov
2022-09-08 14:45             ` Kumar Kartikeya Dwivedi
2022-09-08 15:11               ` Alexei Starovoitov
2022-09-08 15:37                 ` Kumar Kartikeya Dwivedi
2022-09-08 15:59                   ` Alexei Starovoitov
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 17/32] bpf: Support bpf_list_node in local kptrs Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 18/32] bpf: Support bpf_spin_lock " Kumar Kartikeya Dwivedi
2022-09-08  0:35   ` Alexei Starovoitov
2022-09-09  8:25     ` Dave Marchevsky
2022-09-09 11:20       ` Kumar Kartikeya Dwivedi
2022-09-09 14:26         ` Alexei Starovoitov
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 19/32] bpf: Support bpf_list_head " Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 20/32] bpf: Introduce bpf_kptr_free helper Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 21/32] bpf: Allow locking bpf_spin_lock global variables Kumar Kartikeya Dwivedi
2022-09-08  0:27   ` Alexei Starovoitov
2022-09-08  0:39     ` Kumar Kartikeya Dwivedi
2022-09-08  0:55       ` Alexei Starovoitov
2022-09-08  1:00     ` Kumar Kartikeya Dwivedi
2022-09-08  1:08       ` Alexei Starovoitov
2022-09-08  1:15         ` Kumar Kartikeya Dwivedi
2022-09-08  2:39           ` Alexei Starovoitov
2022-09-09  8:13   ` Dave Marchevsky
2022-09-09 11:05     ` Kumar Kartikeya Dwivedi
2022-09-09 14:24       ` Alexei Starovoitov
2022-09-09 14:50         ` Kumar Kartikeya Dwivedi
2022-09-09 14:58           ` Alexei Starovoitov
2022-09-09 18:32             ` Andrii Nakryiko
2022-09-09 19:25               ` Alexei Starovoitov
2022-09-09 20:21                 ` Andrii Nakryiko
2022-09-09 20:57                   ` Alexei Starovoitov
2022-09-10  0:21                     ` Andrii Nakryiko
2022-09-11 22:31                       ` Alexei Starovoitov
2022-09-20 20:55                         ` Andrii Nakryiko
2022-10-18  4:06                           ` Andrii Nakryiko
2022-09-09 22:30                 ` Dave Marchevsky
2022-09-09 22:49                   ` Kumar Kartikeya Dwivedi
2022-09-09 22:57                     ` Alexei Starovoitov
2022-09-09 23:04                       ` Kumar Kartikeya Dwivedi
2022-09-09 22:51                   ` Alexei Starovoitov
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 22/32] bpf: Bump BTF_KFUNC_SET_MAX_CNT Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 23/32] bpf: Add single ownership BPF linked list API Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 24/32] bpf: Permit NULL checking pointer with non-zero fixed offset Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 25/32] bpf: Allow storing local kptrs in BPF maps Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 26/32] bpf: Wire up freeing of bpf_list_heads in maps Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 27/32] bpf: Add destructor for bpf_list_head in local kptr Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 28/32] bpf: Remove duplicate PTR_TO_BTF_ID RO check Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 29/32] libbpf: Add support for private BSS map section Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 30/32] selftests/bpf: Add BTF tag macros for local kptrs, BPF linked lists Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 31/32] selftests/bpf: Add BPF linked list API tests Kumar Kartikeya Dwivedi
2022-09-04 20:41 ` [PATCH RFC bpf-next v1 32/32] selftests/bpf: Add referenced local kptr tests Kumar Kartikeya Dwivedi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox