[PATCH bpf-next 00/13] BPF rbtree next-gen datastructure

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
@ 2022-12-06 23:09 Dave Marchevsky
  2022-12-06 23:09 ` [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record Dave Marchevsky
                   ` (14 more replies)
  0 siblings, 15 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

This series adds a rbtree datastructure following the "next-gen
datastructure" precedent set by recently-added linked-list [0]. This is
a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
instead of adding a new map type. This series adds a smaller set of API
functions than that RFC - just the minimum needed to support current
cgfifo example scheduler in ongoing sched_ext effort [2], namely:

  bpf_rbtree_add
  bpf_rbtree_remove
  bpf_rbtree_first

The meat of this series is bugfixes and verifier infra work to support
these API functions. Adding more rbtree kfuncs in future patches should
be straightforward as a result.

BPF rbtree uses struct rb_root_cached + existing rbtree lib under the
hood. From the BPF program writer's perspective, a BPF rbtree is very
similar to existing linked list. Consider the following example:

  struct node_data {
    long key;
    long data;
    struct bpf_rb_node node;
  }

  static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
  {
    struct node_data *node_a;
    struct node_data *node_b;

    node_a = container_of(a, struct node_data, node);
    node_b = container_of(b, struct node_data, node);

    return node_a->key < node_b->key;
  }

  private(A) struct bpf_spin_lock glock;
  private(A) struct bpf_rb_root groot __contains(node_data, node);

  /* ... in BPF program */
  struct node_data *n, *m;
  struct bpf_rb_node *res;

  n = bpf_obj_new(typeof(*n));
  if (!n)
    /* skip */
  n->key = 5;
  n->data = 10;

  bpf_spin_lock(&glock);
  bpf_rbtree_add(&groot, &n->node, less);
  bpf_spin_unlock(&glock);

  bpf_spin_lock(&glock);
  res = bpf_rbtree_first(&groot);
  if (!res)
    /* skip */
  res = bpf_rbtree_remove(&groot, res);
  if (!res)
    /* skip */
  bpf_spin_unlock(&glock);

  m = container_of(res, struct node_data, node);
  bpf_obj_drop(m);

Some obvious similarities:

  * Special bpf_rb_root and bpf_rb_node types have same semantics
    as bpf_list_head and bpf_list_node, respectively
  * __contains is used to associated node type with root
  * The spin_lock associated with a rbtree must be held when using
    rbtree API kfuncs
  * Nodes are allocated via bpf_obj_new and dropped via bpf_obj_drop
  * Rbtree takes ownership of node lifetime when a node is added.
    Removing a node gives ownership back to the program, requiring a
    bpf_obj_drop before program exit

Some new additions as well:

  * Support for callbacks in kfunc args is added to enable 'less'
    callback use above
  * bpf_rbtree_first's release_on_unlock handling is a bit novel, as
    it's the first next-gen ds API function to release_on_unlock its
    return reg instead of nonexistent node arg
  * Because all references to nodes already added to the rbtree are
    'non-owning', i.e. release_on_unlock and PTR_UNTRUSTED,
    bpf_rbtree_remove must accept such a reference in order to remove it
    from the tree

It seemed better to special-case some 'new additions' verifier logic for
now instead of adding new type flags and concepts, as some of the concepts
(e.g. PTR_UNTRUSTED + release_on_unlock) need a refactoring pass before
we pile more on. Regardless, the net-new verifier logic added in this
patchset is minimal. Verifier changes are mostly generaliztion of
existing linked-list logic and some bugfixes.

A note on naming: 

Some existing list-specific helpers are renamed to 'datastructure_head',
'datastructure_node', etc. Probably a more concise and accurate naming
would be something like 'ng_ds_head' for 'next-gen datastructure'.

For folks who weren't following the conversations over past few months, 
though, such a naming scheme might seem to indicate that _all_ next-gen
datastructures must have certain semantics, like release_on_unlock,
which aren't necessarily required. For this reason I'd like some
feedback on how to name things.

Summary of patches:

  Patches 1, 2, and 10 are bugfixes which are likely worth applying
  independently of rbtree implementation. Patch 12 is somewhere between
  nice-to-have and bugfix.

  Patches 3 and 4 are nonfunctional refactor/rename.

  Patches 5 - 9 implement the meat of rbtree support in this series,
  gradually building up to implemented kfuncs that verify as expected.
  Patch 11 adds the bpf_rbtree_{add,first,remove} to bpf_experimental.h.

  Patch 13 adds tests.

  [0]: lore.kernel.org/bpf/20221118015614.2013203-1-memxor@gmail.com
  [1]: lore.kernel.org/bpf/20220830172759.4069786-1-davemarchevsky@fb.com
  [2]: lore.kernel.org/bpf/20221130082313.3241517-1-tj@kernel.org

Future work:
  Enabling writes to release_on_unlock refs should be done before the
  functionality of BPF rbtree can truly be considered complete.
  Implementing this proved more complex than expected so it's been
  pushed off to a future patch.

Dave Marchevsky (13):
  bpf: Loosen alloc obj test in verifier's reg_btf_record
  bpf: map_check_btf should fail if btf_parse_fields fails
  bpf: Minor refactor of ref_set_release_on_unlock
  bpf: rename list_head -> datastructure_head in field info types
  bpf: Add basic bpf_rb_{root,node} support
  bpf: Add bpf_rbtree_{add,remove,first} kfuncs
  bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
  bpf: Add callback validation to kfunc verifier logic
  bpf: Special verifier handling for bpf_rbtree_{remove, first}
  bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h
  libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj
    type
  selftests/bpf: Add rbtree selftests

 arch/x86/net/bpf_jit_comp.c                   | 123 +++--
 include/linux/bpf.h                           |  21 +-
 include/uapi/linux/bpf.h                      |  11 +
 kernel/bpf/btf.c                              | 181 ++++---
 kernel/bpf/helpers.c                          |  75 ++-
 kernel/bpf/syscall.c                          |  33 +-
 kernel/bpf/verifier.c                         | 506 +++++++++++++++---
 tools/include/uapi/linux/bpf.h                |  11 +
 tools/lib/bpf/libbpf.c                        |  50 +-
 .../testing/selftests/bpf/bpf_experimental.h  |  24 +
 .../selftests/bpf/prog_tests/linked_list.c    |  12 +-
 .../testing/selftests/bpf/prog_tests/rbtree.c | 184 +++++++
 tools/testing/selftests/bpf/progs/rbtree.c    | 180 +++++++
 .../progs/rbtree_btf_fail__add_wrong_type.c   |  48 ++
 .../progs/rbtree_btf_fail__wrong_node_type.c  |  21 +
 .../testing/selftests/bpf/progs/rbtree_fail.c | 263 +++++++++
 16 files changed, 1549 insertions(+), 194 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_btf_fail__add_wrong_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_btf_fail__wrong_node_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_fail.c

-- 
2.30.2

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07 16:41   ` Kumar Kartikeya Dwivedi
  2022-12-06 23:09 ` [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails Dave Marchevsky
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
There, a BTF record is created for any type containing a spin_lock or
any next-gen datastructure node/head.

Currently, for non-MAP_VALUE types, reg_btf_record will only search for
a record using struct_meta_tab if the reg->type exactly matches
(PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
"allocated obj" type - returned from bpf_obj_new - might pick up other
flags while working its way through the program.

Loosen the check to be exact for base_type and just use MEM_ALLOC mask
for type_flag.

This patch is marked Fixes as the original intent of reg_btf_record was
unlikely to have been to fail finding btf_record for valid alloc obj
types with additional flags, some of which (e.g. PTR_UNTRUSTED)
are valid register type states for alloc obj independent of this series.
However, I didn't find a specific broken repro case outside of this
series' added functionality, so it's possible that nothing was
triggering this logic error before.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects")
---
 kernel/bpf/verifier.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1d51bd9596da..67a13110bc22 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type)
 		type == PTR_TO_SOCK_COMMON;
 }

+static bool type_is_ptr_alloc_obj(u32 type)
+{
+	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC;
+}
+
 static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
 {
 	struct btf_record *rec = NULL;
@@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)

 	if (reg->type == PTR_TO_MAP_VALUE) {
 		rec = reg->map_ptr->record;
-	} else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) {
+	} else if (type_is_ptr_alloc_obj(reg->type)) {
 		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
 		if (meta)
 			rec = meta->record;
-- 
2.30.2

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
  2022-12-06 23:09 ` [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07  1:32   ` Alexei Starovoitov
  2022-12-07 16:49   ` Kumar Kartikeya Dwivedi
  2022-12-06 23:09 ` [PATCH bpf-next 03/13] bpf: Minor refactor of ref_set_release_on_unlock Dave Marchevsky
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

map_check_btf calls btf_parse_fields to create a btf_record for its
value_type. If there are no special fields in the value_type
btf_parse_fields returns NULL, whereas if there special value_type
fields but they are invalid in some way an error is returned.

An example invalid state would be:

  struct node_data {
    struct bpf_rb_node node;
    int data;
  };

  private(A) struct bpf_spin_lock glock;
  private(A) struct bpf_list_head ghead __contains(node_data, node);

groot should be invalid as its __contains tag points to a field with
type != "bpf_list_node".

Before this patch, such a scenario would result in btf_parse_fields
returning an error ptr, subsequent !IS_ERR_OR_NULL check failing,
and btf_check_and_fixup_fields returning 0, which would then be
returned by map_check_btf.

After this patch's changes, -EINVAL would be returned by map_check_btf
and the map would correctly fail to load.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Fixes: aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record")
---
 kernel/bpf/syscall.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35972afb6850..c3599a7902f0 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1007,7 +1007,10 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 	map->record = btf_parse_fields(btf, value_type,
 				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD,
 				       map->value_size);
-	if (!IS_ERR_OR_NULL(map->record)) {
+	if (IS_ERR(map->record))
+		return -EINVAL;
+
+	if (map->record) {
 		int i;
 
 		if (!bpf_capable()) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 03/13] bpf: Minor refactor of ref_set_release_on_unlock
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
  2022-12-06 23:09 ` [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record Dave Marchevsky
  2022-12-06 23:09 ` [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-06 23:09 ` [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types Dave Marchevsky
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

This is mostly a nonfunctional change. The verifier log message
"expected false release_on_unlock" was missing a newline, so add it and
move some checks around to reduce indentation level.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/verifier.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 67a13110bc22..6f0aac837d77 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8438,19 +8438,21 @@ static int ref_set_release_on_unlock(struct bpf_verifier_env *env, u32 ref_obj_i
 		return -EFAULT;
 	}
 	for (i = 0; i < state->acquired_refs; i++) {
-		if (state->refs[i].id == ref_obj_id) {
-			if (state->refs[i].release_on_unlock) {
-				verbose(env, "verifier internal error: expected false release_on_unlock");
-				return -EFAULT;
-			}
-			state->refs[i].release_on_unlock = true;
-			/* Now mark everyone sharing same ref_obj_id as untrusted */
-			bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({
-				if (reg->ref_obj_id == ref_obj_id)
-					reg->type |= PTR_UNTRUSTED;
-			}));
-			return 0;
+		if (state->refs[i].id != ref_obj_id)
+			continue;
+
+		if (state->refs[i].release_on_unlock) {
+			verbose(env, "verifier internal error: expected false release_on_unlock\n");
+			return -EFAULT;
 		}
+
+		state->refs[i].release_on_unlock = true;
+		/* Now mark everyone sharing same ref_obj_id as untrusted */
+		bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({
+			if (reg->ref_obj_id == ref_obj_id)
+				reg->type |= PTR_UNTRUSTED;
+		}));
+		return 0;
 	}
 	verbose(env, "verifier internal error: ref state missing for ref_obj_id\n");
 	return -EFAULT;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (2 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 03/13] bpf: Minor refactor of ref_set_release_on_unlock Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07  1:41   ` Alexei Starovoitov
  2022-12-06 23:09 ` [PATCH bpf-next 05/13] bpf: Add basic bpf_rb_{root,node} support Dave Marchevsky
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

Many of the structs recently added to track field info for linked-list
head are useful as-is for rbtree root. So let's do a mechanical renaming
of list_head-related types and fields:

include/linux/bpf.h:
  struct btf_field_list_head -> struct btf_field_datastructure_head
  list_head -> datastructure_head in struct btf_field union
kernel/bpf/btf.c:
  list_head -> datastructure_head in struct btf_field_info

This is a nonfunctional change, functionality to actually use these
fields for rbtree will be added in further patches.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h   |  4 ++--
 kernel/bpf/btf.c      | 21 +++++++++++----------
 kernel/bpf/helpers.c  |  4 ++--
 kernel/bpf/verifier.c | 21 +++++++++++----------
 4 files changed, 26 insertions(+), 24 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4920ac252754..9e8b12c7061e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -189,7 +189,7 @@ struct btf_field_kptr {
 	u32 btf_id;
 };
 
-struct btf_field_list_head {
+struct btf_field_datastructure_head {
 	struct btf *btf;
 	u32 value_btf_id;
 	u32 node_offset;
@@ -201,7 +201,7 @@ struct btf_field {
 	enum btf_field_type type;
 	union {
 		struct btf_field_kptr kptr;
-		struct btf_field_list_head list_head;
+		struct btf_field_datastructure_head datastructure_head;
 	};
 };
 
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index c80bd8709e69..284e3e4b76b7 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3227,7 +3227,7 @@ struct btf_field_info {
 		struct {
 			const char *node_name;
 			u32 value_btf_id;
-		} list_head;
+		} datastructure_head;
 	};
 };
 
@@ -3334,8 +3334,8 @@ static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
 		return -EINVAL;
 	info->type = BPF_LIST_HEAD;
 	info->off = off;
-	info->list_head.value_btf_id = id;
-	info->list_head.node_name = list_node;
+	info->datastructure_head.value_btf_id = id;
+	info->datastructure_head.node_name = list_node;
 	return BTF_FIELD_FOUND;
 }
 
@@ -3603,13 +3603,14 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
 	u32 offset;
 	int i;
 
-	t = btf_type_by_id(btf, info->list_head.value_btf_id);
+	t = btf_type_by_id(btf, info->datastructure_head.value_btf_id);
 	/* We've already checked that value_btf_id is a struct type. We
 	 * just need to figure out the offset of the list_node, and
 	 * verify its type.
 	 */
 	for_each_member(i, t, member) {
-		if (strcmp(info->list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
+		if (strcmp(info->datastructure_head.node_name,
+			   __btf_name_by_offset(btf, member->name_off)))
 			continue;
 		/* Invalid BTF, two members with same name */
 		if (n)
@@ -3626,9 +3627,9 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
 		if (offset % __alignof__(struct bpf_list_node))
 			return -EINVAL;
 
-		field->list_head.btf = (struct btf *)btf;
-		field->list_head.value_btf_id = info->list_head.value_btf_id;
-		field->list_head.node_offset = offset;
+		field->datastructure_head.btf = (struct btf *)btf;
+		field->datastructure_head.value_btf_id = info->datastructure_head.value_btf_id;
+		field->datastructure_head.node_offset = offset;
 	}
 	if (!n)
 		return -ENOENT;
@@ -3735,11 +3736,11 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
 
 		if (!(rec->fields[i].type & BPF_LIST_HEAD))
 			continue;
-		btf_id = rec->fields[i].list_head.value_btf_id;
+		btf_id = rec->fields[i].datastructure_head.value_btf_id;
 		meta = btf_find_struct_meta(btf, btf_id);
 		if (!meta)
 			return -EFAULT;
-		rec->fields[i].list_head.value_rec = meta->record;
+		rec->fields[i].datastructure_head.value_rec = meta->record;
 
 		if (!(rec->field_mask & BPF_LIST_NODE))
 			continue;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index cca642358e80..6c67740222c2 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1737,12 +1737,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
 	while (head != orig_head) {
 		void *obj = head;
 
-		obj -= field->list_head.node_offset;
+		obj -= field->datastructure_head.node_offset;
 		head = head->next;
 		/* The contained type can also have resources, including a
 		 * bpf_list_head which needs to be freed.
 		 */
-		bpf_obj_free_fields(field->list_head.value_rec, obj);
+		bpf_obj_free_fields(field->datastructure_head.value_rec, obj);
 		/* bpf_mem_free requires migrate_disable(), since we can be
 		 * called from map free path as well apart from BPF program (as
 		 * part of map ops doing bpf_obj_free_fields).
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6f0aac837d77..bc80b4c4377b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8615,21 +8615,22 @@ static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
 
 	field = meta->arg_list_head.field;
 
-	et = btf_type_by_id(field->list_head.btf, field->list_head.value_btf_id);
+	et = btf_type_by_id(field->datastructure_head.btf, field->datastructure_head.value_btf_id);
 	t = btf_type_by_id(reg->btf, reg->btf_id);
-	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->list_head.btf,
-				  field->list_head.value_btf_id, true)) {
+	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->datastructure_head.btf,
+				  field->datastructure_head.value_btf_id, true)) {
 		verbose(env, "operation on bpf_list_head expects arg#1 bpf_list_node at offset=%d "
 			"in struct %s, but arg is at offset=%d in struct %s\n",
-			field->list_head.node_offset, btf_name_by_offset(field->list_head.btf, et->name_off),
+			field->datastructure_head.node_offset,
+			btf_name_by_offset(field->datastructure_head.btf, et->name_off),
 			list_node_off, btf_name_by_offset(reg->btf, t->name_off));
 		return -EINVAL;
 	}
 
-	if (list_node_off != field->list_head.node_offset) {
+	if (list_node_off != field->datastructure_head.node_offset) {
 		verbose(env, "arg#1 offset=%d, but expected bpf_list_node at offset=%d in struct %s\n",
-			list_node_off, field->list_head.node_offset,
-			btf_name_by_offset(field->list_head.btf, et->name_off));
+			list_node_off, field->datastructure_head.node_offset,
+			btf_name_by_offset(field->datastructure_head.btf, et->name_off));
 		return -EINVAL;
 	}
 	/* Set arg#1 for expiration after unlock */
@@ -9078,9 +9079,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 
 				mark_reg_known_zero(env, regs, BPF_REG_0);
 				regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
-				regs[BPF_REG_0].btf = field->list_head.btf;
-				regs[BPF_REG_0].btf_id = field->list_head.value_btf_id;
-				regs[BPF_REG_0].off = field->list_head.node_offset;
+				regs[BPF_REG_0].btf = field->datastructure_head.btf;
+				regs[BPF_REG_0].btf_id = field->datastructure_head.value_btf_id;
+				regs[BPF_REG_0].off = field->datastructure_head.node_offset;
 			} else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
 				mark_reg_known_zero(env, regs, BPF_REG_0);
 				regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 05/13] bpf: Add basic bpf_rb_{root,node} support
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (3 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07  1:48   ` Alexei Starovoitov
  2022-12-06 23:09 ` [PATCH bpf-next 06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs Dave Marchevsky
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

This patch adds special BPF_RB_{ROOT,NODE} btf_field_types similar to
BPF_LIST_{HEAD,NODE}, adds the necessary plumbing to detect the new
types, and adds bpf_rb_root_free function for freeing bpf_rb_root in
map_values.

structs bpf_rb_root and bpf_rb_node are opaque types meant to
obscure structs rb_root_cached rb_node, respectively.

btf_struct_access will prevent BPF programs from touching these special
fields automatically now that they're recognized.

btf_check_and_fixup_fields now groups list_head and rb_root together as
"owner" fields and {list,rb}_node as "ownee", and does same ownership
cycle checking as before. Note this function does _not_ prevent
ownership type mixups (e.g. rb_root owning list_node) - that's handled
by btf_parse_datastructure_head.

After this patch, a bpf program can have a struct bpf_rb_root in a
map_value, but not add anything to nor do anything useful with it.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 include/linux/bpf.h                           |  17 ++
 include/uapi/linux/bpf.h                      |  11 ++
 kernel/bpf/btf.c                              | 162 ++++++++++++------
 kernel/bpf/helpers.c                          |  40 +++++
 kernel/bpf/syscall.c                          |  28 ++-
 kernel/bpf/verifier.c                         |   5 +-
 tools/include/uapi/linux/bpf.h                |  11 ++
 .../selftests/bpf/prog_tests/linked_list.c    |  12 +-
 8 files changed, 214 insertions(+), 72 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9e8b12c7061e..2f8c4960390e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -180,6 +180,8 @@ enum btf_field_type {
 	BPF_KPTR       = BPF_KPTR_UNREF | BPF_KPTR_REF,
 	BPF_LIST_HEAD  = (1 << 4),
 	BPF_LIST_NODE  = (1 << 5),
+	BPF_RB_ROOT    = (1 << 6),
+	BPF_RB_NODE    = (1 << 7),
 };
 
 struct btf_field_kptr {
@@ -283,6 +285,10 @@ static inline const char *btf_field_type_name(enum btf_field_type type)
 		return "bpf_list_head";
 	case BPF_LIST_NODE:
 		return "bpf_list_node";
+	case BPF_RB_ROOT:
+		return "bpf_rb_root";
+	case BPF_RB_NODE:
+		return "bpf_rb_node";
 	default:
 		WARN_ON_ONCE(1);
 		return "unknown";
@@ -303,6 +309,10 @@ static inline u32 btf_field_type_size(enum btf_field_type type)
 		return sizeof(struct bpf_list_head);
 	case BPF_LIST_NODE:
 		return sizeof(struct bpf_list_node);
+	case BPF_RB_ROOT:
+		return sizeof(struct bpf_rb_root);
+	case BPF_RB_NODE:
+		return sizeof(struct bpf_rb_node);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -323,6 +333,10 @@ static inline u32 btf_field_type_align(enum btf_field_type type)
 		return __alignof__(struct bpf_list_head);
 	case BPF_LIST_NODE:
 		return __alignof__(struct bpf_list_node);
+	case BPF_RB_ROOT:
+		return __alignof__(struct bpf_rb_root);
+	case BPF_RB_NODE:
+		return __alignof__(struct bpf_rb_node);
 	default:
 		WARN_ON_ONCE(1);
 		return 0;
@@ -433,6 +447,9 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src,
 void bpf_timer_cancel_and_free(void *timer);
 void bpf_list_head_free(const struct btf_field *field, void *list_head,
 			struct bpf_spin_lock *spin_lock);
+void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
+		      struct bpf_spin_lock *spin_lock);
+
 
 int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f89de51a45db..02e68c352372 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -6901,6 +6901,17 @@ struct bpf_list_node {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_rb_root {
+	__u64 :64;
+	__u64 :64;
+} __attribute__((aligned(8)));
+
+struct bpf_rb_node {
+	__u64 :64;
+	__u64 :64;
+	__u64 :64;
+} __attribute__((aligned(8)));
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 284e3e4b76b7..a42f67031963 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -3304,12 +3304,14 @@ static const char *btf_find_decl_tag_value(const struct btf *btf,
 	return NULL;
 }
 
-static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
-			      const struct btf_type *t, int comp_idx,
-			      u32 off, int sz, struct btf_field_info *info)
+static int
+btf_find_datastructure_head(const struct btf *btf, const struct btf_type *pt,
+			    const struct btf_type *t, int comp_idx, u32 off,
+			    int sz, struct btf_field_info *info,
+			    enum btf_field_type head_type)
 {
+	const char *node_field_name;
 	const char *value_type;
-	const char *list_node;
 	s32 id;
 
 	if (!__btf_type_is_struct(t))
@@ -3319,26 +3321,32 @@ static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
 	value_type = btf_find_decl_tag_value(btf, pt, comp_idx, "contains:");
 	if (!value_type)
 		return -EINVAL;
-	list_node = strstr(value_type, ":");
-	if (!list_node)
+	node_field_name = strstr(value_type, ":");
+	if (!node_field_name)
 		return -EINVAL;
-	value_type = kstrndup(value_type, list_node - value_type, GFP_KERNEL | __GFP_NOWARN);
+	value_type = kstrndup(value_type, node_field_name - value_type, GFP_KERNEL | __GFP_NOWARN);
 	if (!value_type)
 		return -ENOMEM;
 	id = btf_find_by_name_kind(btf, value_type, BTF_KIND_STRUCT);
 	kfree(value_type);
 	if (id < 0)
 		return id;
-	list_node++;
-	if (str_is_empty(list_node))
+	node_field_name++;
+	if (str_is_empty(node_field_name))
 		return -EINVAL;
-	info->type = BPF_LIST_HEAD;
+	info->type = head_type;
 	info->off = off;
 	info->datastructure_head.value_btf_id = id;
-	info->datastructure_head.node_name = list_node;
+	info->datastructure_head.node_name = node_field_name;
 	return BTF_FIELD_FOUND;
 }
 
+#define field_mask_test_name(field_type, field_type_str) \
+	if (field_mask & field_type && !strcmp(name, field_type_str)) { \
+		type = field_type;					\
+		goto end;						\
+	}
+
 static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
 			      int *align, int *sz)
 {
@@ -3362,18 +3370,11 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
 			goto end;
 		}
 	}
-	if (field_mask & BPF_LIST_HEAD) {
-		if (!strcmp(name, "bpf_list_head")) {
-			type = BPF_LIST_HEAD;
-			goto end;
-		}
-	}
-	if (field_mask & BPF_LIST_NODE) {
-		if (!strcmp(name, "bpf_list_node")) {
-			type = BPF_LIST_NODE;
-			goto end;
-		}
-	}
+	field_mask_test_name(BPF_LIST_HEAD, "bpf_list_head");
+	field_mask_test_name(BPF_LIST_NODE, "bpf_list_node");
+	field_mask_test_name(BPF_RB_ROOT,   "bpf_rb_root");
+	field_mask_test_name(BPF_RB_NODE,   "bpf_rb_node");
+
 	/* Only return BPF_KPTR when all other types with matchable names fail */
 	if (field_mask & BPF_KPTR) {
 		type = BPF_KPTR_REF;
@@ -3386,6 +3387,8 @@ static int btf_get_field_type(const char *name, u32 field_mask, u32 *seen_mask,
 	return type;
 }
 
+#undef field_mask_test_name
+
 static int btf_find_struct_field(const struct btf *btf,
 				 const struct btf_type *t, u32 field_mask,
 				 struct btf_field_info *info, int info_cnt)
@@ -3418,6 +3421,7 @@ static int btf_find_struct_field(const struct btf *btf,
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
 		case BPF_LIST_NODE:
+		case BPF_RB_NODE:
 			ret = btf_find_struct(btf, member_type, off, sz, field_type,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3431,8 +3435,11 @@ static int btf_find_struct_field(const struct btf *btf,
 				return ret;
 			break;
 		case BPF_LIST_HEAD:
-			ret = btf_find_list_head(btf, t, member_type, i, off, sz,
-						 idx < info_cnt ? &info[idx] : &tmp);
+		case BPF_RB_ROOT:
+			ret = btf_find_datastructure_head(btf, t, member_type,
+							  i, off, sz,
+							  idx < info_cnt ? &info[idx] : &tmp,
+							  field_type);
 			if (ret < 0)
 				return ret;
 			break;
@@ -3479,6 +3486,7 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 		case BPF_SPIN_LOCK:
 		case BPF_TIMER:
 		case BPF_LIST_NODE:
+		case BPF_RB_NODE:
 			ret = btf_find_struct(btf, var_type, off, sz, field_type,
 					      idx < info_cnt ? &info[idx] : &tmp);
 			if (ret < 0)
@@ -3492,8 +3500,11 @@ static int btf_find_datasec_var(const struct btf *btf, const struct btf_type *t,
 				return ret;
 			break;
 		case BPF_LIST_HEAD:
-			ret = btf_find_list_head(btf, var, var_type, -1, off, sz,
-						 idx < info_cnt ? &info[idx] : &tmp);
+		case BPF_RB_ROOT:
+			ret = btf_find_datastructure_head(btf, var, var_type,
+							  -1, off, sz,
+							  idx < info_cnt ? &info[idx] : &tmp,
+							  field_type);
 			if (ret < 0)
 				return ret;
 			break;
@@ -3595,8 +3606,11 @@ static int btf_parse_kptr(const struct btf *btf, struct btf_field *field,
 	return ret;
 }
 
-static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
-			       struct btf_field_info *info)
+static int btf_parse_datastructure_head(const struct btf *btf,
+					struct btf_field *field,
+					struct btf_field_info *info,
+					const char *node_type_name,
+					size_t node_type_align)
 {
 	const struct btf_type *t, *n = NULL;
 	const struct btf_member *member;
@@ -3618,13 +3632,13 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
 		n = btf_type_by_id(btf, member->type);
 		if (!__btf_type_is_struct(n))
 			return -EINVAL;
-		if (strcmp("bpf_list_node", __btf_name_by_offset(btf, n->name_off)))
+		if (strcmp(node_type_name, __btf_name_by_offset(btf, n->name_off)))
 			return -EINVAL;
 		offset = __btf_member_bit_offset(n, member);
 		if (offset % 8)
 			return -EINVAL;
 		offset /= 8;
-		if (offset % __alignof__(struct bpf_list_node))
+		if (offset % node_type_align)
 			return -EINVAL;
 
 		field->datastructure_head.btf = (struct btf *)btf;
@@ -3636,6 +3650,20 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
 	return 0;
 }
 
+static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
+			       struct btf_field_info *info)
+{
+	return btf_parse_datastructure_head(btf, field, info, "bpf_list_node",
+					    __alignof__(struct bpf_list_node));
+}
+
+static int btf_parse_rb_root(const struct btf *btf, struct btf_field *field,
+			     struct btf_field_info *info)
+{
+	return btf_parse_datastructure_head(btf, field, info, "bpf_rb_node",
+					    __alignof__(struct bpf_rb_node));
+}
+
 struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type *t,
 				    u32 field_mask, u32 value_size)
 {
@@ -3698,7 +3726,13 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 			if (ret < 0)
 				goto end;
 			break;
+		case BPF_RB_ROOT:
+			ret = btf_parse_rb_root(btf, &rec->fields[i], &info_arr[i]);
+			if (ret < 0)
+				goto end;
+			break;
 		case BPF_LIST_NODE:
+		case BPF_RB_NODE:
 			break;
 		default:
 			ret = -EFAULT;
@@ -3707,8 +3741,9 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 		rec->cnt++;
 	}
 
-	/* bpf_list_head requires bpf_spin_lock */
-	if (btf_record_has_field(rec, BPF_LIST_HEAD) && rec->spin_lock_off < 0) {
+	/* bpf_{list_head, rb_node} require bpf_spin_lock */
+	if ((btf_record_has_field(rec, BPF_LIST_HEAD) ||
+	     btf_record_has_field(rec, BPF_RB_ROOT)) && rec->spin_lock_off < 0) {
 		ret = -EINVAL;
 		goto end;
 	}
@@ -3719,22 +3754,28 @@ struct btf_record *btf_parse_fields(const struct btf *btf, const struct btf_type
 	return ERR_PTR(ret);
 }
 
+#define OWNER_FIELD_MASK (BPF_LIST_HEAD | BPF_RB_ROOT)
+#define OWNEE_FIELD_MASK (BPF_LIST_NODE | BPF_RB_NODE)
+
 int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
 {
 	int i;
 
-	/* There are two owning types, kptr_ref and bpf_list_head. The former
-	 * only supports storing kernel types, which can never store references
-	 * to program allocated local types, atleast not yet. Hence we only need
-	 * to ensure that bpf_list_head ownership does not form cycles.
+	/* There are three types that signify ownership of some other type:
+	 *  kptr_ref, bpf_list_head, bpf_rb_root.
+	 * kptr_ref only supports storing kernel types, which can't store
+	 * references to program allocated local types.
+	 *
+	 * Hence we only need to ensure that bpf_{list_head,rb_root} ownership
+	 * does not form cycles.
 	 */
-	if (IS_ERR_OR_NULL(rec) || !(rec->field_mask & BPF_LIST_HEAD))
+	if (IS_ERR_OR_NULL(rec) || !(rec->field_mask & OWNER_FIELD_MASK))
 		return 0;
 	for (i = 0; i < rec->cnt; i++) {
 		struct btf_struct_meta *meta;
 		u32 btf_id;
 
-		if (!(rec->fields[i].type & BPF_LIST_HEAD))
+		if (!(rec->fields[i].type & OWNER_FIELD_MASK))
 			continue;
 		btf_id = rec->fields[i].datastructure_head.value_btf_id;
 		meta = btf_find_struct_meta(btf, btf_id);
@@ -3742,39 +3783,47 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
 			return -EFAULT;
 		rec->fields[i].datastructure_head.value_rec = meta->record;
 
-		if (!(rec->field_mask & BPF_LIST_NODE))
+		/* We need to set value_rec for all owner types, but no need
+		 * to check ownership cycle for a type unless it's also an
+		 * ownee type.
+		 */
+		if (!(rec->field_mask & OWNEE_FIELD_MASK))
 			continue;
 
 		/* We need to ensure ownership acyclicity among all types. The
 		 * proper way to do it would be to topologically sort all BTF
 		 * IDs based on the ownership edges, since there can be multiple
-		 * bpf_list_head in a type. Instead, we use the following
-		 * reasoning:
+		 * bpf_{list_head,rb_node} in a type. Instead, we use the
+		 * following resaoning:
 		 *
 		 * - A type can only be owned by another type in user BTF if it
-		 *   has a bpf_list_node.
+		 *   has a bpf_{list,rb}_node. Let's call these ownee types.
 		 * - A type can only _own_ another type in user BTF if it has a
-		 *   bpf_list_head.
+		 *   bpf_{list_head,rb_root}. Let's call these owner types.
 		 *
-		 * We ensure that if a type has both bpf_list_head and
-		 * bpf_list_node, its element types cannot be owning types.
+		 * We ensure that if a type is both an owner and ownee, its
+		 * element types cannot be owner types.
 		 *
 		 * To ensure acyclicity:
 		 *
-		 * When A only has bpf_list_head, ownership chain can be:
+		 * When A is an owner type but not an ownee, its ownership
+		 * chain can be:
 		 *	A -> B -> C
 		 * Where:
-		 * - B has both bpf_list_head and bpf_list_node.
-		 * - C only has bpf_list_node.
+		 * - A is an owner, e.g. has bpf_rb_root.
+		 * - B is both an owner and ownee, e.g. has bpf_rb_node and
+		 *   bpf_list_head.
+		 * - C is only an owner, e.g. has bpf_list_node
 		 *
-		 * When A has both bpf_list_head and bpf_list_node, some other
-		 * type already owns it in the BTF domain, hence it can not own
-		 * another owning type through any of the bpf_list_head edges.
+		 * When A is both an owner and ownee, some other type already
+		 * owns it in the BTF domain, hence it can not own
+		 * another owner type through any of the ownership edges.
 		 *	A -> B
 		 * Where:
-		 * - B only has bpf_list_node.
+		 * - A is both an owner and ownee.
+		 * - B is only an ownee.
 		 */
-		if (meta->record->field_mask & BPF_LIST_HEAD)
+		if (meta->record->field_mask & OWNER_FIELD_MASK)
 			return -ELOOP;
 	}
 	return 0;
@@ -5236,6 +5285,8 @@ static const char *alloc_obj_fields[] = {
 	"bpf_spin_lock",
 	"bpf_list_head",
 	"bpf_list_node",
+	"bpf_rb_root",
+	"bpf_rb_node",
 };
 
 static struct btf_struct_metas *
@@ -5309,7 +5360,8 @@ btf_parse_struct_metas(struct bpf_verifier_log *log, struct btf *btf)
 
 		type = &tab->types[tab->cnt];
 		type->btf_id = i;
-		record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE, t->size);
+		record = btf_parse_fields(btf, t, BPF_SPIN_LOCK | BPF_LIST_HEAD | BPF_LIST_NODE |
+						  BPF_RB_ROOT | BPF_RB_NODE, t->size);
 		/* The record cannot be unset, treat it as an error if so */
 		if (IS_ERR_OR_NULL(record)) {
 			ret = PTR_ERR_OR_ZERO(record) ?: -EFAULT;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 6c67740222c2..4d04432b162e 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1753,6 +1753,46 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
 	}
 }
 
+/* Like rbtree_postorder_for_each_entry_safe, but 'pos' and 'n' are
+ * 'rb_node *', so field name of rb_node within containing struct is not
+ * needed.
+ *
+ * Since bpf_rb_tree's node type has a corresponding struct btf_field with
+ * datastructure_head.node_offset, it's not necessary to know field name
+ * or type of node struct
+ */
+#define bpf_rbtree_postorder_for_each_entry_safe(pos, n, root) \
+	for (pos = rb_first_postorder(root); \
+	    pos && ({ n = rb_next_postorder(pos); 1; }); \
+	    pos = n)
+
+void bpf_rb_root_free(const struct btf_field *field, void *rb_root,
+		      struct bpf_spin_lock *spin_lock)
+{
+	struct rb_root_cached orig_root, *root = rb_root;
+	struct rb_node *pos, *n;
+	void *obj;
+
+	BUILD_BUG_ON(sizeof(struct rb_root_cached) > sizeof(struct bpf_rb_root));
+	BUILD_BUG_ON(__alignof__(struct rb_root_cached) > __alignof__(struct bpf_rb_root));
+
+	__bpf_spin_lock_irqsave(spin_lock);
+	orig_root = *root;
+	*root = RB_ROOT_CACHED;
+	__bpf_spin_unlock_irqrestore(spin_lock);
+
+	bpf_rbtree_postorder_for_each_entry_safe(pos, n, &orig_root.rb_root) {
+		obj = pos;
+		obj -= field->datastructure_head.node_offset;
+
+		bpf_obj_free_fields(field->datastructure_head.value_rec, obj);
+
+		migrate_disable();
+		bpf_mem_free(&bpf_global_ma, obj);
+		migrate_enable();
+	}
+}
+
 __diag_push();
 __diag_ignore_all("-Wmissing-prototypes",
 		  "Global functions as their definitions will be in vmlinux BTF");
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c3599a7902f0..b6b464c15575 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -527,9 +527,6 @@ void btf_record_free(struct btf_record *rec)
 		return;
 	for (i = 0; i < rec->cnt; i++) {
 		switch (rec->fields[i].type) {
-		case BPF_SPIN_LOCK:
-		case BPF_TIMER:
-			break;
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 			if (rec->fields[i].kptr.module)
@@ -538,7 +535,11 @@ void btf_record_free(struct btf_record *rec)
 			break;
 		case BPF_LIST_HEAD:
 		case BPF_LIST_NODE:
-			/* Nothing to release for bpf_list_head */
+		case BPF_RB_ROOT:
+		case BPF_RB_NODE:
+		case BPF_SPIN_LOCK:
+		case BPF_TIMER:
+			/* Nothing to release */
 			break;
 		default:
 			WARN_ON_ONCE(1);
@@ -571,9 +572,6 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 	new_rec->cnt = 0;
 	for (i = 0; i < rec->cnt; i++) {
 		switch (fields[i].type) {
-		case BPF_SPIN_LOCK:
-		case BPF_TIMER:
-			break;
 		case BPF_KPTR_UNREF:
 		case BPF_KPTR_REF:
 			btf_get(fields[i].kptr.btf);
@@ -584,7 +582,11 @@ struct btf_record *btf_record_dup(const struct btf_record *rec)
 			break;
 		case BPF_LIST_HEAD:
 		case BPF_LIST_NODE:
-			/* Nothing to acquire for bpf_list_head */
+		case BPF_RB_ROOT:
+		case BPF_RB_NODE:
+		case BPF_SPIN_LOCK:
+		case BPF_TIMER:
+			/* Nothing to acquire */
 			break;
 		default:
 			ret = -EFAULT;
@@ -664,7 +666,13 @@ void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
 				continue;
 			bpf_list_head_free(field, field_ptr, obj + rec->spin_lock_off);
 			break;
+		case BPF_RB_ROOT:
+			if (WARN_ON_ONCE(rec->spin_lock_off < 0))
+				continue;
+			bpf_rb_root_free(field, field_ptr, obj + rec->spin_lock_off);
+			break;
 		case BPF_LIST_NODE:
+		case BPF_RB_NODE:
 			break;
 		default:
 			WARN_ON_ONCE(1);
@@ -1005,7 +1013,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 		return -EINVAL;
 
 	map->record = btf_parse_fields(btf, value_type,
-				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD,
+				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD |
+				       BPF_RB_ROOT,
 				       map->value_size);
 	if (IS_ERR(map->record))
 		return -EINVAL;
@@ -1056,6 +1065,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
 				}
 				break;
 			case BPF_LIST_HEAD:
+			case BPF_RB_ROOT:
 				if (map->map_type != BPF_MAP_TYPE_HASH &&
 				    map->map_type != BPF_MAP_TYPE_LRU_HASH &&
 				    map->map_type != BPF_MAP_TYPE_ARRAY) {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index bc80b4c4377b..9d9e00fd6dfa 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14105,9 +14105,10 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(prog);
 
-	if (btf_record_has_field(map->record, BPF_LIST_HEAD)) {
+	if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
+	    btf_record_has_field(map->record, BPF_RB_ROOT)) {
 		if (is_tracing_prog_type(prog_type)) {
-			verbose(env, "tracing progs cannot use bpf_list_head yet\n");
+			verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
 			return -EINVAL;
 		}
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f89de51a45db..02e68c352372 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6901,6 +6901,17 @@ struct bpf_list_node {
 	__u64 :64;
 } __attribute__((aligned(8)));
 
+struct bpf_rb_root {
+	__u64 :64;
+	__u64 :64;
+} __attribute__((aligned(8)));
+
+struct bpf_rb_node {
+	__u64 :64;
+	__u64 :64;
+	__u64 :64;
+} __attribute__((aligned(8)));
+
 struct bpf_sysctl {
 	__u32	write;		/* Sysctl is being read (= 0) or written (= 1).
 				 * Allows 1,2,4-byte read, but no write.
diff --git a/tools/testing/selftests/bpf/prog_tests/linked_list.c b/tools/testing/selftests/bpf/prog_tests/linked_list.c
index 9a7d4c47af63..b124028ab51a 100644
--- a/tools/testing/selftests/bpf/prog_tests/linked_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/linked_list.c
@@ -58,12 +58,12 @@ static struct {
 	TEST(inner_map, pop_front)
 	TEST(inner_map, pop_back)
 #undef TEST
-	{ "map_compat_kprobe", "tracing progs cannot use bpf_list_head yet" },
-	{ "map_compat_kretprobe", "tracing progs cannot use bpf_list_head yet" },
-	{ "map_compat_tp", "tracing progs cannot use bpf_list_head yet" },
-	{ "map_compat_perf", "tracing progs cannot use bpf_list_head yet" },
-	{ "map_compat_raw_tp", "tracing progs cannot use bpf_list_head yet" },
-	{ "map_compat_raw_tp_w", "tracing progs cannot use bpf_list_head yet" },
+	{ "map_compat_kprobe", "tracing progs cannot use bpf_{list_head,rb_root} yet" },
+	{ "map_compat_kretprobe", "tracing progs cannot use bpf_{list_head,rb_root} yet" },
+	{ "map_compat_tp", "tracing progs cannot use bpf_{list_head,rb_root} yet" },
+	{ "map_compat_perf", "tracing progs cannot use bpf_{list_head,rb_root} yet" },
+	{ "map_compat_raw_tp", "tracing progs cannot use bpf_{list_head,rb_root} yet" },
+	{ "map_compat_raw_tp_w", "tracing progs cannot use bpf_{list_head,rb_root} yet" },
 	{ "obj_type_id_oor", "local type ID argument must be in range [0, U32_MAX]" },
 	{ "obj_new_no_composite", "bpf_obj_new type ID argument must be of a struct" },
 	{ "obj_new_no_struct", "bpf_obj_new type ID argument must be of a struct" },
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (4 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 05/13] bpf: Add basic bpf_rb_{root,node} support Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-06 23:09 ` [PATCH bpf-next 07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args Dave Marchevsky
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

This patch adds implementations of bpf_rbtree_{add,remove,first}
and teaches verifier about their BTF_IDs as well as those of
bpf_rb_{root,node}.

All three kfuncs have some nonstandard component to their verification
that needs to be addressed in future patches before programs can
properly use them:

  * bpf_rbtree_add:     Takes 'less' callback, need to verify it

  * bpf_rbtree_first:   Returns ptr_to_node_type(off=rb_node_off) instead
                        of ptr_to_rb_node(off=0). Return value ref is
			should be released on unlock.

  * bpf_rbtree_remove:  Returns ptr_to_node_type(off=rb_node_off) instead
                        of ptr_to_rb_node(off=0). 2nd arg (node) is a
			release_on_unlock + PTR_UNTRUSTED reg.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/helpers.c  | 31 +++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c | 11 +++++++++++
 2 files changed, 42 insertions(+)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 4d04432b162e..d216c54b65ab 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1865,6 +1865,33 @@ struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head)
 	return __bpf_list_del(head, true);
 }
 
+struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root, struct bpf_rb_node *node)
+{
+	struct rb_root_cached *r = (struct rb_root_cached *)root;
+	struct rb_node *n = (struct rb_node *)node;
+
+	if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))
+		return (struct bpf_rb_node *)NULL;
+
+	rb_erase_cached(n, r);
+	RB_CLEAR_NODE(n);
+	return (struct bpf_rb_node *)n;
+}
+
+void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
+		    bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b))
+{
+	rb_add_cached((struct rb_node *)node, (struct rb_root_cached *)root,
+		      (bool (*)(struct rb_node *, const struct rb_node *))less);
+}
+
+struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root)
+{
+	struct rb_root_cached *r = (struct rb_root_cached *)root;
+
+	return (struct bpf_rb_node *)rb_first_cached(r);
+}
+
 /**
  * bpf_task_acquire - Acquire a reference to a task. A task acquired by this
  * kfunc which is not stored in a map as a kptr, must be released by calling
@@ -2069,6 +2096,10 @@ BTF_ID_FLAGS(func, bpf_task_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_task_acquire_not_zero, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_kptr_get, KF_ACQUIRE | KF_KPTR_GET | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_release, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_rbtree_remove, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_rbtree_add)
+BTF_ID_FLAGS(func, bpf_rbtree_first, KF_ACQUIRE | KF_RET_NULL)
+
 #ifdef CONFIG_CGROUPS
 BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_cgroup_kptr_get, KF_ACQUIRE | KF_KPTR_GET | KF_RET_NULL)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9d9e00fd6dfa..e36dbde8736c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8135,6 +8135,8 @@ BTF_ID_LIST(kf_arg_btf_ids)
 BTF_ID(struct, bpf_dynptr_kern)
 BTF_ID(struct, bpf_list_head)
 BTF_ID(struct, bpf_list_node)
+BTF_ID(struct, bpf_rb_root)
+BTF_ID(struct, bpf_rb_node)
 
 static bool __is_kfunc_ptr_arg_type(const struct btf *btf,
 				    const struct btf_param *arg, int type)
@@ -8240,6 +8242,9 @@ enum special_kfunc_type {
 	KF_bpf_rdonly_cast,
 	KF_bpf_rcu_read_lock,
 	KF_bpf_rcu_read_unlock,
+	KF_bpf_rbtree_remove,
+	KF_bpf_rbtree_add,
+	KF_bpf_rbtree_first,
 };
 
 BTF_SET_START(special_kfunc_set)
@@ -8251,6 +8256,9 @@ BTF_ID(func, bpf_list_pop_front)
 BTF_ID(func, bpf_list_pop_back)
 BTF_ID(func, bpf_cast_to_kern_ctx)
 BTF_ID(func, bpf_rdonly_cast)
+BTF_ID(func, bpf_rbtree_remove)
+BTF_ID(func, bpf_rbtree_add)
+BTF_ID(func, bpf_rbtree_first)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -8264,6 +8272,9 @@ BTF_ID(func, bpf_cast_to_kern_ctx)
 BTF_ID(func, bpf_rdonly_cast)
 BTF_ID(func, bpf_rcu_read_lock)
 BTF_ID(func, bpf_rcu_read_unlock)
+BTF_ID(func, bpf_rbtree_remove)
+BTF_ID(func, bpf_rbtree_add)
+BTF_ID(func, bpf_rbtree_first)
 
 static bool is_kfunc_bpf_rcu_read_lock(struct bpf_kfunc_call_arg_meta *meta)
 {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (5 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07  1:51   ` Alexei Starovoitov
  2022-12-06 23:09 ` [PATCH bpf-next 08/13] bpf: Add callback validation to kfunc verifier logic Dave Marchevsky
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

Now that we find bpf_rb_root and bpf_rb_node in structs, let's give args
that contain those types special classification and properly handle
these types when checking kfunc args.

"Properly handling" these types largely requires generalizing similar
handling for bpf_list_{head,node}, with little new logic added in this
patch.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/verifier.c | 237 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 203 insertions(+), 34 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e36dbde8736c..652112007b2c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8018,6 +8018,9 @@ struct bpf_kfunc_call_arg_meta {
 	struct {
 		struct btf_field *field;
 	} arg_list_head;
+	struct {
+		struct btf_field *field;
+	} arg_rbtree_root;
 };
 
 static bool is_kfunc_acquire(struct bpf_kfunc_call_arg_meta *meta)
@@ -8129,6 +8132,8 @@ enum {
 	KF_ARG_DYNPTR_ID,
 	KF_ARG_LIST_HEAD_ID,
 	KF_ARG_LIST_NODE_ID,
+	KF_ARG_RB_ROOT_ID,
+	KF_ARG_RB_NODE_ID,
 };
 
 BTF_ID_LIST(kf_arg_btf_ids)
@@ -8170,6 +8175,16 @@ static bool is_kfunc_arg_list_node(const struct btf *btf, const struct btf_param
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_LIST_NODE_ID);
 }
 
+static bool is_kfunc_arg_rbtree_root(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_ROOT_ID);
+}
+
+static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_param *arg)
+{
+	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
+}
+
 /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
 static bool __btf_type_is_scalar_struct(struct bpf_verifier_env *env,
 					const struct btf *btf,
@@ -8229,6 +8244,8 @@ enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
 	KF_ARG_PTR_TO_MEM,
 	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
+	KF_ARG_PTR_TO_RB_ROOT,
+	KF_ARG_PTR_TO_RB_NODE,
 };
 
 enum special_kfunc_type {
@@ -8336,6 +8353,12 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 	if (is_kfunc_arg_list_node(meta->btf, &args[argno]))
 		return KF_ARG_PTR_TO_LIST_NODE;
 
+	if (is_kfunc_arg_rbtree_root(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_RB_ROOT;
+
+	if (is_kfunc_arg_rbtree_node(meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_RB_NODE;
+
 	if ((base_type(reg->type) == PTR_TO_BTF_ID || reg2btf_ids[base_type(reg->type)])) {
 		if (!btf_type_is_struct(ref_t)) {
 			verbose(env, "kernel function %s args#%d pointer type %s %s is not supported\n",
@@ -8550,97 +8573,196 @@ static bool is_bpf_list_api_kfunc(u32 btf_id)
 	       btf_id == special_kfunc_list[KF_bpf_list_pop_back];
 }
 
-static int process_kf_arg_ptr_to_list_head(struct bpf_verifier_env *env,
+static bool is_bpf_rbtree_api_kfunc(u32 btf_id)
+{
+	return btf_id == special_kfunc_list[KF_bpf_rbtree_add] ||
+	       btf_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+	       btf_id == special_kfunc_list[KF_bpf_rbtree_first];
+}
+
+static bool is_bpf_datastructure_api_kfunc(u32 btf_id)
+{
+	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id);
+}
+
+static bool check_kfunc_is_datastructure_head_api(struct bpf_verifier_env *env,
+						  enum btf_field_type head_field_type,
+						  u32 kfunc_btf_id)
+{
+	bool ret;
+
+	switch (head_field_type) {
+	case BPF_LIST_HEAD:
+		ret = is_bpf_list_api_kfunc(kfunc_btf_id);
+		break;
+	case BPF_RB_ROOT:
+		ret = is_bpf_rbtree_api_kfunc(kfunc_btf_id);
+		break;
+	default:
+		verbose(env, "verifier internal error: unexpected datastructure head argument type %s\n",
+			btf_field_type_name(head_field_type));
+		return false;
+	}
+
+	if (!ret)
+		verbose(env, "verifier internal error: %s head arg for unknown kfunc\n",
+			btf_field_type_name(head_field_type));
+	return ret;
+}
+
+static bool check_kfunc_is_datastructure_node_api(struct bpf_verifier_env *env,
+						  enum btf_field_type node_field_type,
+						  u32 kfunc_btf_id)
+{
+	bool ret;
+
+	switch (node_field_type) {
+	case BPF_LIST_NODE:
+		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_front] ||
+		       kfunc_btf_id == special_kfunc_list[KF_bpf_list_push_back]);
+		break;
+	case BPF_RB_NODE:
+		ret = (kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+		       kfunc_btf_id == special_kfunc_list[KF_bpf_rbtree_add]);
+		break;
+	default:
+		verbose(env, "verifier internal error: unexpected datastructure node argument type %s\n",
+			btf_field_type_name(node_field_type));
+		return false;
+	}
+
+	if (!ret)
+		verbose(env, "verifier internal error: %s node arg for unknown kfunc\n",
+			btf_field_type_name(node_field_type));
+	return ret;
+}
+
+static int
+__process_kf_arg_ptr_to_datastructure_head(struct bpf_verifier_env *env,
 					   struct bpf_reg_state *reg, u32 regno,
-					   struct bpf_kfunc_call_arg_meta *meta)
+					   struct bpf_kfunc_call_arg_meta *meta,
+					   enum btf_field_type head_field_type,
+					   struct btf_field **head_field)
 {
+	const char *head_type_name;
 	struct btf_field *field;
 	struct btf_record *rec;
-	u32 list_head_off;
+	u32 head_off;
 
-	if (meta->btf != btf_vmlinux || !is_bpf_list_api_kfunc(meta->func_id)) {
-		verbose(env, "verifier internal error: bpf_list_head argument for unknown kfunc\n");
+	if (meta->btf != btf_vmlinux) {
+		verbose(env, "verifier internal error: unexpected btf mismatch in kfunc call\n");
 		return -EFAULT;
 	}
 
+	if (!check_kfunc_is_datastructure_head_api(env, head_field_type, meta->func_id))
+		return -EFAULT;
+
+	head_type_name = btf_field_type_name(head_field_type);
 	if (!tnum_is_const(reg->var_off)) {
 		verbose(env,
-			"R%d doesn't have constant offset. bpf_list_head has to be at the constant offset\n",
-			regno);
+			"R%d doesn't have constant offset. %s has to be at the constant offset\n",
+			regno, head_type_name);
 		return -EINVAL;
 	}
 
 	rec = reg_btf_record(reg);
-	list_head_off = reg->off + reg->var_off.value;
-	field = btf_record_find(rec, list_head_off, BPF_LIST_HEAD);
+	head_off = reg->off + reg->var_off.value;
+	field = btf_record_find(rec, head_off, head_field_type);
 	if (!field) {
-		verbose(env, "bpf_list_head not found at offset=%u\n", list_head_off);
+		verbose(env, "%s not found at offset=%u\n", head_type_name, head_off);
 		return -EINVAL;
 	}
 
 	/* All functions require bpf_list_head to be protected using a bpf_spin_lock */
 	if (check_reg_allocation_locked(env, reg)) {
-		verbose(env, "bpf_spin_lock at off=%d must be held for bpf_list_head\n",
-			rec->spin_lock_off);
+		verbose(env, "bpf_spin_lock at off=%d must be held for %s\n",
+			rec->spin_lock_off, head_type_name);
 		return -EINVAL;
 	}
 
-	if (meta->arg_list_head.field) {
-		verbose(env, "verifier internal error: repeating bpf_list_head arg\n");
+	if (*head_field) {
+		verbose(env, "verifier internal error: repeating %s arg\n", head_type_name);
 		return -EFAULT;
 	}
-	meta->arg_list_head.field = field;
+	*head_field = field;
 	return 0;
 }
 
-static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
+
+static int process_kf_arg_ptr_to_list_head(struct bpf_verifier_env *env,
 					   struct bpf_reg_state *reg, u32 regno,
 					   struct bpf_kfunc_call_arg_meta *meta)
 {
+	return __process_kf_arg_ptr_to_datastructure_head(env, reg, regno, meta, BPF_LIST_HEAD,
+							  &meta->arg_list_head.field);
+}
+
+static int process_kf_arg_ptr_to_rbtree_root(struct bpf_verifier_env *env,
+					     struct bpf_reg_state *reg, u32 regno,
+					     struct bpf_kfunc_call_arg_meta *meta)
+{
+	return __process_kf_arg_ptr_to_datastructure_head(env, reg, regno, meta, BPF_RB_ROOT,
+							  &meta->arg_rbtree_root.field);
+}
+
+static int
+__process_kf_arg_ptr_to_datastructure_node(struct bpf_verifier_env *env,
+					   struct bpf_reg_state *reg, u32 regno,
+					   struct bpf_kfunc_call_arg_meta *meta,
+					   enum btf_field_type head_field_type,
+					   enum btf_field_type node_field_type,
+					   struct btf_field **node_field)
+{
+	const char *node_type_name;
 	const struct btf_type *et, *t;
 	struct btf_field *field;
 	struct btf_record *rec;
-	u32 list_node_off;
+	u32 node_off;
 
-	if (meta->btf != btf_vmlinux ||
-	    (meta->func_id != special_kfunc_list[KF_bpf_list_push_front] &&
-	     meta->func_id != special_kfunc_list[KF_bpf_list_push_back])) {
-		verbose(env, "verifier internal error: bpf_list_node argument for unknown kfunc\n");
+	if (meta->btf != btf_vmlinux) {
+		verbose(env, "verifier internal error: unexpected btf mismatch in kfunc call\n");
 		return -EFAULT;
 	}
 
+	if (!check_kfunc_is_datastructure_node_api(env, node_field_type, meta->func_id))
+		return -EFAULT;
+
+	node_type_name = btf_field_type_name(node_field_type);
 	if (!tnum_is_const(reg->var_off)) {
 		verbose(env,
-			"R%d doesn't have constant offset. bpf_list_node has to be at the constant offset\n",
-			regno);
+			"R%d doesn't have constant offset. %s has to be at the constant offset\n",
+			regno, node_type_name);
 		return -EINVAL;
 	}
 
 	rec = reg_btf_record(reg);
-	list_node_off = reg->off + reg->var_off.value;
-	field = btf_record_find(rec, list_node_off, BPF_LIST_NODE);
-	if (!field || field->offset != list_node_off) {
-		verbose(env, "bpf_list_node not found at offset=%u\n", list_node_off);
+	node_off = reg->off + reg->var_off.value;
+	field = btf_record_find(rec, node_off, node_field_type);
+	if (!field || field->offset != node_off) {
+		verbose(env, "%s not found at offset=%u\n", node_type_name, node_off);
 		return -EINVAL;
 	}
 
-	field = meta->arg_list_head.field;
+	field = *node_field;
 
 	et = btf_type_by_id(field->datastructure_head.btf, field->datastructure_head.value_btf_id);
 	t = btf_type_by_id(reg->btf, reg->btf_id);
 	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->datastructure_head.btf,
 				  field->datastructure_head.value_btf_id, true)) {
-		verbose(env, "operation on bpf_list_head expects arg#1 bpf_list_node at offset=%d "
+		verbose(env, "operation on %s expects arg#1 %s at offset=%d "
 			"in struct %s, but arg is at offset=%d in struct %s\n",
+			btf_field_type_name(head_field_type),
+			btf_field_type_name(node_field_type),
 			field->datastructure_head.node_offset,
 			btf_name_by_offset(field->datastructure_head.btf, et->name_off),
-			list_node_off, btf_name_by_offset(reg->btf, t->name_off));
+			node_off, btf_name_by_offset(reg->btf, t->name_off));
 		return -EINVAL;
 	}
 
-	if (list_node_off != field->datastructure_head.node_offset) {
-		verbose(env, "arg#1 offset=%d, but expected bpf_list_node at offset=%d in struct %s\n",
-			list_node_off, field->datastructure_head.node_offset,
+	if (node_off != field->datastructure_head.node_offset) {
+		verbose(env, "arg#1 offset=%d, but expected %s at offset=%d in struct %s\n",
+			node_off, btf_field_type_name(node_field_type),
+			field->datastructure_head.node_offset,
 			btf_name_by_offset(field->datastructure_head.btf, et->name_off));
 		return -EINVAL;
 	}
@@ -8648,6 +8770,24 @@ static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
 	return ref_set_release_on_unlock(env, reg->ref_obj_id);
 }
 
+static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
+					   struct bpf_reg_state *reg, u32 regno,
+					   struct bpf_kfunc_call_arg_meta *meta)
+{
+	return __process_kf_arg_ptr_to_datastructure_node(env, reg, regno, meta,
+							  BPF_LIST_HEAD, BPF_LIST_NODE,
+							  &meta->arg_list_head.field);
+}
+
+static int process_kf_arg_ptr_to_rbtree_node(struct bpf_verifier_env *env,
+					     struct bpf_reg_state *reg, u32 regno,
+					     struct bpf_kfunc_call_arg_meta *meta)
+{
+	return __process_kf_arg_ptr_to_datastructure_node(env, reg, regno, meta,
+							  BPF_RB_ROOT, BPF_RB_NODE,
+							  &meta->arg_rbtree_root.field);
+}
+
 static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_arg_meta *meta)
 {
 	const char *func_name = meta->func_name, *ref_tname;
@@ -8776,6 +8916,8 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_DYNPTR:
 		case KF_ARG_PTR_TO_LIST_HEAD:
 		case KF_ARG_PTR_TO_LIST_NODE:
+		case KF_ARG_PTR_TO_RB_ROOT:
+		case KF_ARG_PTR_TO_RB_NODE:
 		case KF_ARG_PTR_TO_MEM:
 		case KF_ARG_PTR_TO_MEM_SIZE:
 			/* Trusted by default */
@@ -8861,6 +9003,20 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_PTR_TO_RB_ROOT:
+			if (reg->type != PTR_TO_MAP_VALUE &&
+			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
+				verbose(env, "arg#%d expected pointer to map value or allocated object\n", i);
+				return -EINVAL;
+			}
+			if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC) && !reg->ref_obj_id) {
+				verbose(env, "allocated object must be referenced\n");
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_rbtree_root(env, reg, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_LIST_NODE:
 			if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
 				verbose(env, "arg#%d expected pointer to allocated object\n", i);
@@ -8874,6 +9030,19 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			if (ret < 0)
 				return ret;
 			break;
+		case KF_ARG_PTR_TO_RB_NODE:
+			if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
+				verbose(env, "arg#%d expected pointer to allocated object\n", i);
+				return -EINVAL;
+			}
+			if (!reg->ref_obj_id) {
+				verbose(env, "allocated object must be referenced\n");
+				return -EINVAL;
+			}
+			ret = process_kf_arg_ptr_to_rbtree_node(env, reg, regno, meta);
+			if (ret < 0)
+				return ret;
+			break;
 		case KF_ARG_PTR_TO_BTF_ID:
 			/* Only base_type is checked, further checks are done here */
 			if ((base_type(reg->type) != PTR_TO_BTF_ID ||
@@ -13818,7 +13987,7 @@ static int do_check(struct bpf_verifier_env *env)
 					if ((insn->src_reg == BPF_REG_0 && insn->imm != BPF_FUNC_spin_unlock) ||
 					    (insn->src_reg == BPF_PSEUDO_CALL) ||
 					    (insn->src_reg == BPF_PSEUDO_KFUNC_CALL &&
-					     (insn->off != 0 || !is_bpf_list_api_kfunc(insn->imm)))) {
+					     (insn->off != 0 || !is_bpf_datastructure_api_kfunc(insn->imm)))) {
 						verbose(env, "function calls are not allowed while holding a lock\n");
 						return -EINVAL;
 					}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 08/13] bpf: Add callback validation to kfunc verifier logic
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (6 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07  2:01   ` Alexei Starovoitov
  2022-12-06 23:09 ` [PATCH bpf-next 09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first} Dave Marchevsky
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

Some BPF helpers take a callback function which the helper calls. For
each helper that takes such a callback, there's a special call to
__check_func_call with a callback-state-setting callback that sets up
verifier bpf_func_state for the callback's frame.

kfuncs don't have any of this infrastructure yet, so let's add it in
this patch, following existing helper pattern as much as possible. To
validate functionality of this added plumbing, this patch adds
callback handling for the bpf_rbtree_add kfunc and hopes to lay
groundwork for future next-gen datastructure callbacks.

In the "general plumbing" category we have:

  * check_kfunc_call doing callback verification right before clearing
    CALLER_SAVED_REGS, exactly like check_helper_call
  * recognition of func_ptr BTF types in kfunc args as
    KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type

In the "rbtree_add / next-gen datastructure-specific plumbing" category:

  * Since bpf_rbtree_add must be called while the spin_lock associated
    with the tree is held, don't complain when callback's func_state
    doesn't unlock it by frame exit
  * Mark rbtree_add callback's args PTR_UNTRUSTED to prevent rbtree
    api functions from being called in the callback

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/verifier.c | 136 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 130 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 652112007b2c..9ad8c0b264dc 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1448,6 +1448,16 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
 	reg->type &= ~PTR_MAYBE_NULL;
 }
 
+static void mark_reg_datastructure_node(struct bpf_reg_state *regs, u32 regno,
+					struct btf_field_datastructure_head *ds_head)
+{
+	__mark_reg_known_zero(&regs[regno]);
+	regs[regno].type = PTR_TO_BTF_ID | MEM_ALLOC;
+	regs[regno].btf = ds_head->btf;
+	regs[regno].btf_id = ds_head->value_btf_id;
+	regs[regno].off = ds_head->node_offset;
+}
+
 static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg)
 {
 	return type_is_pkt_pointer(reg->type);
@@ -4771,7 +4781,8 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 			return -EACCES;
 		}
 
-		if (type_is_alloc(reg->type) && !reg->ref_obj_id) {
+		if (type_is_alloc(reg->type) && !reg->ref_obj_id &&
+		    !cur_func(env)->in_callback_fn) {
 			verbose(env, "verifier internal error: ref_obj_id for allocated object must be non-zero\n");
 			return -EFAULT;
 		}
@@ -6952,6 +6963,8 @@ static int set_callee_state(struct bpf_verifier_env *env,
 			    struct bpf_func_state *caller,
 			    struct bpf_func_state *callee, int insn_idx);
 
+static bool is_callback_calling_kfunc(u32 btf_id);
+
 static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			     int *insn_idx, int subprog,
 			     set_callee_state_fn set_callee_state_cb)
@@ -7006,10 +7019,18 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 	 * interested in validating only BPF helpers that can call subprogs as
 	 * callbacks
 	 */
-	if (set_callee_state_cb != set_callee_state && !is_callback_calling_function(insn->imm)) {
-		verbose(env, "verifier bug: helper %s#%d is not marked as callback-calling\n",
-			func_id_name(insn->imm), insn->imm);
-		return -EFAULT;
+	if (set_callee_state_cb != set_callee_state) {
+		if (bpf_pseudo_kfunc_call(insn) &&
+		    !is_callback_calling_kfunc(insn->imm)) {
+			verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
+				func_id_name(insn->imm), insn->imm);
+			return -EFAULT;
+		} else if (!bpf_pseudo_kfunc_call(insn) &&
+			   !is_callback_calling_function(insn->imm)) { /* helper */
+			verbose(env, "verifier bug: helper %s#%d not marked as callback-calling\n",
+				func_id_name(insn->imm), insn->imm);
+			return -EFAULT;
+		}
 	}
 
 	if (insn->code == (BPF_JMP | BPF_CALL) &&
@@ -7275,6 +7296,67 @@ static int set_user_ringbuf_callback_state(struct bpf_verifier_env *env,
 	return 0;
 }
 
+static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
+					 struct bpf_func_state *caller,
+					 struct bpf_func_state *callee,
+					 int insn_idx)
+{
+	/* void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
+	 *                     bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b));
+	 *
+	 * 'struct bpf_rb_node *node' arg to bpf_rbtree_add is the same PTR_TO_BTF_ID w/ offset
+	 * that 'less' callback args will be receiving. However, 'node' arg was release_reference'd
+	 * by this point, so look at 'root'
+	 */
+	struct btf_field *field;
+	struct btf_record *rec;
+
+	rec = reg_btf_record(&caller->regs[BPF_REG_1]);
+	if (!rec)
+		return -EFAULT;
+
+	field = btf_record_find(rec, caller->regs[BPF_REG_1].off, BPF_RB_ROOT);
+	if (!field || !field->datastructure_head.value_btf_id)
+		return -EFAULT;
+
+	mark_reg_datastructure_node(callee->regs, BPF_REG_1, &field->datastructure_head);
+	callee->regs[BPF_REG_1].type |= PTR_UNTRUSTED;
+	mark_reg_datastructure_node(callee->regs, BPF_REG_2, &field->datastructure_head);
+	callee->regs[BPF_REG_2].type |= PTR_UNTRUSTED;
+
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_3]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_4]);
+	__mark_reg_not_init(env, &callee->regs[BPF_REG_5]);
+	callee->in_callback_fn = true;
+	callee->callback_ret_range = tnum_range(0, 1);
+	return 0;
+}
+
+static bool is_rbtree_lock_required_kfunc(u32 btf_id);
+
+/* Are we currently verifying the callback for a rbtree helper that must
+ * be called with lock held? If so, no need to complain about unreleased
+ * lock
+ */
+static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env)
+{
+	struct bpf_verifier_state *state = env->cur_state;
+	struct bpf_insn *insn = env->prog->insnsi;
+	struct bpf_func_state *callee;
+	int kfunc_btf_id;
+
+	if (!state->curframe)
+		return false;
+
+	callee = state->frame[state->curframe];
+
+	if (!callee->in_callback_fn)
+		return false;
+
+	kfunc_btf_id = insn[callee->callsite].imm;
+	return is_rbtree_lock_required_kfunc(kfunc_btf_id);
+}
+
 static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
 {
 	struct bpf_verifier_state *state = env->cur_state;
@@ -8007,6 +8089,7 @@ struct bpf_kfunc_call_arg_meta {
 	bool r0_rdonly;
 	u32 ret_btf_id;
 	u64 r0_size;
+	u32 subprogno;
 	struct {
 		u64 value;
 		bool found;
@@ -8185,6 +8268,18 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
 	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
 }
 
+static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
+				  const struct btf_param *arg)
+{
+	const struct btf_type *t;
+
+	t = btf_type_resolve_func_ptr(btf, arg->type, NULL);
+	if (!t)
+		return false;
+
+	return true;
+}
+
 /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
 static bool __btf_type_is_scalar_struct(struct bpf_verifier_env *env,
 					const struct btf *btf,
@@ -8244,6 +8339,7 @@ enum kfunc_ptr_arg_type {
 	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
 	KF_ARG_PTR_TO_MEM,
 	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
+	KF_ARG_PTR_TO_CALLBACK,
 	KF_ARG_PTR_TO_RB_ROOT,
 	KF_ARG_PTR_TO_RB_NODE,
 };
@@ -8368,6 +8464,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
 		return KF_ARG_PTR_TO_BTF_ID;
 	}
 
+	if (is_kfunc_arg_callback(env, meta->btf, &args[argno]))
+		return KF_ARG_PTR_TO_CALLBACK;
+
 	if (argno + 1 < nargs && is_kfunc_arg_mem_size(meta->btf, &args[argno + 1], &regs[regno + 1]))
 		arg_mem_size = true;
 
@@ -8585,6 +8684,16 @@ static bool is_bpf_datastructure_api_kfunc(u32 btf_id)
 	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id);
 }
 
+static bool is_callback_calling_kfunc(u32 btf_id)
+{
+	return btf_id == special_kfunc_list[KF_bpf_rbtree_add];
+}
+
+static bool is_rbtree_lock_required_kfunc(u32 btf_id)
+{
+	return is_bpf_rbtree_api_kfunc(btf_id);
+}
+
 static bool check_kfunc_is_datastructure_head_api(struct bpf_verifier_env *env,
 						  enum btf_field_type head_field_type,
 						  u32 kfunc_btf_id)
@@ -8920,6 +9029,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 		case KF_ARG_PTR_TO_RB_NODE:
 		case KF_ARG_PTR_TO_MEM:
 		case KF_ARG_PTR_TO_MEM_SIZE:
+		case KF_ARG_PTR_TO_CALLBACK:
 			/* Trusted by default */
 			break;
 		default:
@@ -9078,6 +9188,9 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 			/* Skip next '__sz' argument */
 			i++;
 			break;
+		case KF_ARG_PTR_TO_CALLBACK:
+			meta->subprogno = reg->subprogno;
+			break;
 		}
 	}
 
@@ -9193,6 +9306,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		}
 	}
 
+	if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_add]) {
+		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
+					set_rbtree_add_callback_state);
+		if (err) {
+			verbose(env, "kfunc %s#%d failed callback verification\n",
+				func_name, func_id);
+			return err;
+		}
+	}
+
 	for (i = 0; i < CALLER_SAVED_REGS; i++)
 		mark_reg_not_init(env, regs, caller_saved[i]);
 
@@ -14023,7 +14146,8 @@ static int do_check(struct bpf_verifier_env *env)
 					return -EINVAL;
 				}
 
-				if (env->cur_state->active_lock.ptr) {
+				if (env->cur_state->active_lock.ptr &&
+				    !in_rbtree_lock_required_cb(env)) {
 					verbose(env, "bpf_spin_unlock is missing\n");
 					return -EINVAL;
 				}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first}
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (7 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 08/13] bpf: Add callback validation to kfunc verifier logic Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07  2:18   ` Alexei Starovoitov
  2022-12-06 23:09 ` [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0 Dave Marchevsky
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties
that require handling in the verifier:

  * both bpf_rbtree_remove and bpf_rbtree_first return the type containing
    the bpf_rb_node field, with the offset set to that field's offset,
    instead of a struct bpf_rb_node *
    * Generalized existing next-gen list verifier handling for this
      as mark_reg_datastructure_node helper

  * Unlike other functions, which set release_on_unlock on one of their
    args, bpf_rbtree_first takes no arguments, rather setting
    release_on_unlock on its return value

  * bpf_rbtree_remove's node input is a node that's been inserted
    in the tree. Only non-owning references (PTR_UNTRUSTED +
    release_on_unlock) refer to such nodes, but kfuncs don't take
    PTR_UNTRUSTED args
    * Added special carveout for bpf_rbtree_remove to take PTR_UNTRUSTED
    * Since node input already has release_on_unlock set, don't set
      it again

This patch, along with the previous one, complete special verifier
handling for all rbtree API functions added in this series.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 kernel/bpf/verifier.c | 89 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 73 insertions(+), 16 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9ad8c0b264dc..29983e2c27df 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6122,6 +6122,23 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
 	return 0;
 }
 
+static bool
+func_arg_reg_rb_node_offset(const struct bpf_reg_state *reg, s32 off)
+{
+	struct btf_record *rec;
+	struct btf_field *field;
+
+	rec = reg_btf_record(reg);
+	if (!rec)
+		return false;
+
+	field = btf_record_find(rec, off, BPF_RB_NODE);
+	if (!field)
+		return false;
+
+	return true;
+}
+
 int check_func_arg_reg_off(struct bpf_verifier_env *env,
 			   const struct bpf_reg_state *reg, int regno,
 			   enum bpf_arg_type arg_type)
@@ -6176,6 +6193,13 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
 		 */
 		fixed_off_ok = true;
 		break;
+	case PTR_TO_BTF_ID | MEM_ALLOC | PTR_UNTRUSTED:
+		/* Currently only bpf_rbtree_remove accepts a PTR_UNTRUSTED
+		 * bpf_rb_node. Fixed off of the node type is OK
+		 */
+		if (reg->off && func_arg_reg_rb_node_offset(reg, reg->off))
+			fixed_off_ok = true;
+		break;
 	default:
 		break;
 	}
@@ -8875,26 +8899,44 @@ __process_kf_arg_ptr_to_datastructure_node(struct bpf_verifier_env *env,
 			btf_name_by_offset(field->datastructure_head.btf, et->name_off));
 		return -EINVAL;
 	}
-	/* Set arg#1 for expiration after unlock */
-	return ref_set_release_on_unlock(env, reg->ref_obj_id);
+
+	return 0;
 }
 
 static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
 					   struct bpf_reg_state *reg, u32 regno,
 					   struct bpf_kfunc_call_arg_meta *meta)
 {
-	return __process_kf_arg_ptr_to_datastructure_node(env, reg, regno, meta,
-							  BPF_LIST_HEAD, BPF_LIST_NODE,
-							  &meta->arg_list_head.field);
+	int err;
+
+	err = __process_kf_arg_ptr_to_datastructure_node(env, reg, regno, meta,
+							 BPF_LIST_HEAD, BPF_LIST_NODE,
+							 &meta->arg_list_head.field);
+	if (err)
+		return err;
+
+	return ref_set_release_on_unlock(env, reg->ref_obj_id);
 }
 
 static int process_kf_arg_ptr_to_rbtree_node(struct bpf_verifier_env *env,
 					     struct bpf_reg_state *reg, u32 regno,
 					     struct bpf_kfunc_call_arg_meta *meta)
 {
-	return __process_kf_arg_ptr_to_datastructure_node(env, reg, regno, meta,
-							  BPF_RB_ROOT, BPF_RB_NODE,
-							  &meta->arg_rbtree_root.field);
+	int err;
+
+	err = __process_kf_arg_ptr_to_datastructure_node(env, reg, regno, meta,
+							 BPF_RB_ROOT, BPF_RB_NODE,
+							 &meta->arg_rbtree_root.field);
+	if (err)
+		return err;
+
+	/* bpf_rbtree_remove's node parameter is a non-owning reference to
+	 * a bpf_rb_node, so release_on_unlock is already set
+	 */
+	if (meta->func_id == special_kfunc_list[KF_bpf_rbtree_remove])
+		return 0;
+
+	return ref_set_release_on_unlock(env, reg->ref_obj_id);
 }
 
 static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_arg_meta *meta)
@@ -8902,7 +8944,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 	const char *func_name = meta->func_name, *ref_tname;
 	const struct btf *btf = meta->btf;
 	const struct btf_param *args;
-	u32 i, nargs;
+	u32 i, nargs, check_type;
 	int ret;
 
 	args = (const struct btf_param *)(meta->func_proto + 1);
@@ -9141,7 +9183,13 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				return ret;
 			break;
 		case KF_ARG_PTR_TO_RB_NODE:
-			if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
+			if (meta->btf == btf_vmlinux &&
+			    meta->func_id == special_kfunc_list[KF_bpf_rbtree_remove])
+				check_type = (PTR_TO_BTF_ID | MEM_ALLOC | PTR_UNTRUSTED);
+			else
+				check_type = (PTR_TO_BTF_ID | MEM_ALLOC);
+
+			if (reg->type != check_type) {
 				verbose(env, "arg#%d expected pointer to allocated object\n", i);
 				return -EINVAL;
 			}
@@ -9380,11 +9428,14 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 				   meta.func_id == special_kfunc_list[KF_bpf_list_pop_back]) {
 				struct btf_field *field = meta.arg_list_head.field;
 
-				mark_reg_known_zero(env, regs, BPF_REG_0);
-				regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
-				regs[BPF_REG_0].btf = field->datastructure_head.btf;
-				regs[BPF_REG_0].btf_id = field->datastructure_head.value_btf_id;
-				regs[BPF_REG_0].off = field->datastructure_head.node_offset;
+				mark_reg_datastructure_node(regs, BPF_REG_0,
+							    &field->datastructure_head);
+			} else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+				   meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
+				struct btf_field *field = meta.arg_rbtree_root.field;
+
+				mark_reg_datastructure_node(regs, BPF_REG_0,
+							    &field->datastructure_head);
 			} else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
 				mark_reg_known_zero(env, regs, BPF_REG_0);
 				regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
@@ -9450,6 +9501,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			if (is_kfunc_ret_null(&meta))
 				regs[BPF_REG_0].id = id;
 			regs[BPF_REG_0].ref_obj_id = id;
+
+			if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_first])
+				ref_set_release_on_unlock(env, regs[BPF_REG_0].ref_obj_id);
 		}
 		if (reg_may_point_to_spin_lock(&regs[BPF_REG_0]) && !regs[BPF_REG_0].id)
 			regs[BPF_REG_0].id = ++env->id_gen;
@@ -11636,8 +11690,11 @@ static void mark_ptr_or_null_reg(struct bpf_func_state *state,
 		 */
 		if (WARN_ON_ONCE(reg->smin_value || reg->smax_value || !tnum_equals_const(reg->var_off, 0)))
 			return;
-		if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC | PTR_MAYBE_NULL) && WARN_ON_ONCE(reg->off))
+		if (reg->type != (PTR_TO_BTF_ID | MEM_ALLOC | PTR_MAYBE_NULL) &&
+		    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC | PTR_MAYBE_NULL | PTR_UNTRUSTED) &&
+		    WARN_ON_ONCE(reg->off)) {
 			return;
+		}
 		if (is_null) {
 			reg->type = SCALAR_VALUE;
 			/* We don't need id and ref_obj_id from this point
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (8 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first} Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-07  2:39   ` Alexei Starovoitov
  2022-12-06 23:09 ` [PATCH bpf-next 11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h Dave Marchevsky
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
insn->off < 0, but this appears to not be true irrespective of changes
in this series. Regardless, changes in this series will result in an
example like:

  struct example_node {
    long key;
    long val;
    struct bpf_rb_node node;
  }

  /* In BPF prog, assume root contains example_node nodes */
  struct bpf_rb_node res = bpf_rbtree_first(&root);
  if (!res)
    return 1;

  struct example_node n = container_of(res, struct example_node, node);
  long key = n->key;

Resulting in a load with off = -16, as bpf_rbtree_first's return is
modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
bpf_rb_node. So it's necessary to support negative insn->off when
jitting BPF_PROBE_MEM.

In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
insn->off is safely handled, we must confirm that *src_reg + insn->off is
in kernel's memory. Two runtime checks are emitted to confirm that:

  1) (*src_reg + insn->off) > boundary between user and kernel address
  spaces
  2) (*src_reg + insn->off) does not overflow to a small positive
  number. This might happen if some function meant to set src_reg
  returns ERR_PTR(-EINVAL) or similar.

Check 1 currently is sligtly off - it compares a

  u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);

to *src_reg, aborting the load if limit is larger. Rewriting this as an
inequality:

  *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
  *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE

shows that this isn't quite right even if insn->off is positive, as we
really want:

  *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
  *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off

Since *src_reg + insn->off is the address we'll be loading from, not
*src_reg - insn->off or *src_reg - abs(insn->off). So change the
subtraction to an addition and remove the abs(), as comment indicates
that it was only added to ignore negative insn->off.

For Check 2, currently "does not overflow to a small positive number" is
confirmed by emitting an 'add insn->off, src_reg' instruction and
checking for carry flag. While this works fine for a positive insn->off,
a small negative insn->off like -16 is almost guaranteed to wrap over to
a small positive number when added to any kernel address.

This patch addresses this by not doing Check 2 at BPF prog runtime when
insn->off is negative, rather doing a stronger check at JIT-time. The
logic supporting this is as follows:

1) Assume insn->off is negative, call the largest such negative offset
   MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
   insn->off.

2) *src_reg + insn->off will not wrap over to an unexpected address by
   virtue of negative insn->off, but it might wrap under if
   -insn->off > *src_reg, as that implies *src_reg + insn->off < 0

3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
   must be true since insn->off is negative.

4) If we've completed check 1, we know that
   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)

5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)

6) By statements 1, 4, and 5, if we can prove
   (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
   (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
   values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
   MAX_NEGATIVE_OFF > 0.

   Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
   negative, if the previous inequality is true,
   src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
   Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
   negative insn->off vals, src_reg + insn->off > 0 and there can be no
   wrapping under.

So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
*src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
negative insn->off. Luckily the former inequality does not need to be
checked at runtime, and in fact could be a static_assert if
TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
kconfig is used.

Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
check. Given that insn->off is a s16 and is unlikely to grow larger,
this check should always succeed on any x86 processor made in the 21st
century. If it doesn't fail all do_jit calls and complain loudly with
the assumption that the BPF subsystem is misconfigured or has a bug.

A few instructions are saved for negative insn->offs as a result. Using
the struct example_node / off = -16 example from before, code looks
like:

BEFORE CHANGE
  72:   movabs $0x800000000010,%r11
  7c:   cmp    %r11,%rdi
  7f:   jb     0x000000000000008d         (check 1 on 7c and here)
  81:   mov    %rdi,%r11
  84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
  8b:   jae    0x0000000000000091          negative insn->off)
  8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
  8f:   jmp    0x0000000000000095
  91:   mov    -0x10(%rdi),%rdi
  95:

AFTER CHANGE:
  5a:   movabs $0x800000000010,%r11
  64:   cmp    %r11,%rdi
  67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
  69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
  6b:   jmp    0x0000000000000071
  6d:   mov    -0x10(%rdi),%rdi
  71:

We could do the same for insn->off == 0, but for now keep code
generation unchanged for previously working nonnegative insn->offs.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 arch/x86/net/bpf_jit_comp.c | 123 +++++++++++++++++++++++++++---------
 1 file changed, 92 insertions(+), 31 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 36ffe67ad6e5..843f619d0d35 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -11,6 +11,7 @@
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/limits.h>
 #include <asm/extable.h>
 #include <asm/set_memory.h>
 #include <asm/nospec-branch.h>
@@ -94,6 +95,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
  */
 #define X86_JB  0x72
 #define X86_JAE 0x73
+#define X86_JNC 0x73
 #define X86_JE  0x74
 #define X86_JNE 0x75
 #define X86_JBE 0x76
@@ -950,6 +952,36 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
 	*pprog = prog;
 }
 
+/* Check that condition necessary for PROBE_MEM handling for insn->off < 0
+ * holds.
+ *
+ * This could be a static_assert((TASK_SIZE_MAX + PAGE_SIZE) > -S16_MIN),
+ * but TASK_SIZE_MAX can't always be evaluated at compile time, so let's not
+ * assume insn->off size either
+ */
+static int check_probe_mem_task_size_overflow(void)
+{
+	struct bpf_insn insn;
+	s64 max_negative;
+
+	switch (sizeof(insn.off)) {
+	case 2:
+		max_negative = S16_MIN;
+		break;
+	default:
+		pr_err("bpf_jit_error: unexpected bpf_insn->off size\n");
+		return -EFAULT;
+	}
+
+	if (!((TASK_SIZE_MAX + PAGE_SIZE) > -max_negative)) {
+		pr_err("bpf jit error: assumption does not hold:\n");
+		pr_err("\t(TASK_SIZE_MAX + PAGE_SIZE) + (max negative insn->off) > 0\n");
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
 #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
 
 static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
@@ -967,6 +999,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
 	u8 *prog = temp;
 	int err;
 
+	err = check_probe_mem_task_size_overflow();
+	if (err)
+		return err;
+
 	detect_reg_usage(insn, insn_cnt, callee_regs_used,
 			 &tail_call_seen);
 
@@ -1359,20 +1395,30 @@ st:			if (is_imm8(insn->off))
 		case BPF_LDX | BPF_MEM | BPF_DW:
 		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
 			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
-				/* Though the verifier prevents negative insn->off in BPF_PROBE_MEM
-				 * add abs(insn->off) to the limit to make sure that negative
-				 * offset won't be an issue.
-				 * insn->off is s16, so it won't affect valid pointers.
-				 */
-				u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
-				u8 *end_of_jmp1, *end_of_jmp2;
-
 				/* Conservatively check that src_reg + insn->off is a kernel address:
-				 * 1. src_reg + insn->off >= limit
-				 * 2. src_reg + insn->off doesn't become small positive.
-				 * Cannot do src_reg + insn->off >= limit in one branch,
-				 * since it needs two spare registers, but JIT has only one.
+				 * 1. src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
+				 * 2. src_reg + insn->off doesn't overflow and become small positive
+				 *
+				 * For check 1, to save regs, do
+				 * src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off) call rhs
+				 * of inequality 'limit'
+				 *
+				 * For check 2:
+				 * If insn->off is positive, add src_reg + insn->off and check
+				 * overflow directly
+				 * If insn->off is negative, we know that
+				 *   (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
+				 * and from check 1 we know
+				 *   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
+				 * So if (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 we can
+				 * be sure that src_reg + insn->off won't overflow in either
+				 * direction and avoid runtime check entirely.
+				 *
+				 * check_probe_mem_task_size_overflow confirms the above assumption
+				 * at the beginning of this function
 				 */
+				u64 limit = TASK_SIZE_MAX + PAGE_SIZE - insn->off;
+				u8 *end_of_jmp1, *end_of_jmp2;
 
 				/* movabsq r11, limit */
 				EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
@@ -1381,32 +1427,47 @@ st:			if (is_imm8(insn->off))
 				/* cmp src_reg, r11 */
 				maybe_emit_mod(&prog, src_reg, AUX_REG, true);
 				EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
-				/* if unsigned '<' goto end_of_jmp2 */
-				EMIT2(X86_JB, 0);
-				end_of_jmp1 = prog;
-
-				/* mov r11, src_reg */
-				emit_mov_reg(&prog, true, AUX_REG, src_reg);
-				/* add r11, insn->off */
-				maybe_emit_1mod(&prog, AUX_REG, true);
-				EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
-				/* jmp if not carry to start_of_ldx
-				 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
-				 * that has to be rejected.
-				 */
-				EMIT2(0x73 /* JNC */, 0);
-				end_of_jmp2 = prog;
+				if (insn->off >= 0) {
+					/* cmp src_reg, r11 */
+					/* if unsigned '<' goto end_of_jmp2 */
+					EMIT2(X86_JB, 0);
+					end_of_jmp1 = prog;
+
+					/* mov r11, src_reg */
+					emit_mov_reg(&prog, true, AUX_REG, src_reg);
+					/* add r11, insn->off */
+					maybe_emit_1mod(&prog, AUX_REG, true);
+					EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
+					/* jmp if not carry to start_of_ldx
+					 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
+					 * that has to be rejected.
+					 */
+					EMIT2(X86_JNC, 0);
+					end_of_jmp2 = prog;
+				} else {
+					/* cmp src_reg, r11 */
+					/* if unsigned '>=' goto start_of_ldx
+					 * w/o needing to do check 2
+					 */
+					EMIT2(X86_JAE, 0);
+					end_of_jmp1 = prog;
+				}
 
 				/* xor dst_reg, dst_reg */
 				emit_mov_imm32(&prog, false, dst_reg, 0);
 				/* jmp byte_after_ldx */
 				EMIT2(0xEB, 0);
 
-				/* populate jmp_offset for JB above to jump to xor dst_reg */
-				end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
-				/* populate jmp_offset for JNC above to jump to start_of_ldx */
 				start_of_ldx = prog;
-				end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
+				if (insn->off >= 0) {
+					/* populate jmp_offset for JB above to jump to xor dst_reg */
+					end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
+					/* populate jmp_offset for JNC above to jump to start_of_ldx */
+					end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
+				} else {
+					/* populate jmp_offset for JAE above to jump to start_of_ldx */
+					end_of_jmp1[-1] = start_of_ldx - end_of_jmp1;
+				}
 			}
 			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
 			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (9 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0 Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-06 23:09 ` [PATCH bpf-next 12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type Dave Marchevsky
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../testing/selftests/bpf/bpf_experimental.h  | 24 +++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
index 424f7bbbfe9b..dbd2c729781a 100644
--- a/tools/testing/selftests/bpf/bpf_experimental.h
+++ b/tools/testing/selftests/bpf/bpf_experimental.h
@@ -65,4 +65,28 @@ extern struct bpf_list_node *bpf_list_pop_front(struct bpf_list_head *head) __ks
  */
 extern struct bpf_list_node *bpf_list_pop_back(struct bpf_list_head *head) __ksym;
 
+/* Description
+ *	Remove 'node' from rbtree with root 'root'
+ * Returns
+ * 	Pointer to the removed node, or NULL if 'root' didn't contain 'node'
+ */
+extern struct bpf_rb_node *bpf_rbtree_remove(struct bpf_rb_root *root,
+					     struct bpf_rb_node *node) __ksym;
+
+/* Description
+ *	Add 'node' to rbtree with root 'root' using comparator 'less'
+ * Returns
+ *	Nothing
+ */
+extern void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
+			   bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b)) __ksym;
+
+/* Description
+ *	Return the first (leftmost) node in input tree
+ * Returns
+ *	Pointer to the node, which is _not_ removed from the tree. If the tree
+ *	contains no nodes, returns NULL.
+ */
+extern struct bpf_rb_node *bpf_rbtree_first(struct bpf_rb_root *root) __ksym;
+
 #endif
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (10 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h Dave Marchevsky
@ 2022-12-06 23:09 ` Dave Marchevsky
  2022-12-06 23:10 ` [PATCH bpf-next 13/13] selftests/bpf: Add rbtree selftests Dave Marchevsky
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:09 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

If a BPF program defines a struct or union type which has a field type
that the verifier considers special - spin_lock, next-gen datastructure
heads and nodes - the verifier needs to be able to find fields of that
type using BTF.

For such a program, BTF is required, so modify kernel_needs_btf helper
to ensure that correct "BTF is mandatory" error message is emitted.

The newly-added btf_has_alloc_obj_type looks for BTF_KIND_STRUCTs with a
name corresponding to a special type. If any such struct is found it is
assumed that some variable is using it, and therefore that successful
BTF load is necessary.

Also add a kernel_needs_btf check to bpf_object__create_map where it was
previously missing. When this function calls bpf_map_create, kernel may
reject map creation due to mismatched datastructure owner and ownee
types (e.g. a struct bpf_list_head with __contains tag pointing to
bpf_rbtree_node field). In such a scenario - or any other where BTF is
necessary for verification - bpf_map_create should not be retried
without BTF.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 tools/lib/bpf/libbpf.c | 50 ++++++++++++++++++++++++++++++++----------
 1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2a82f49ce16f..56a905b502c9 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -998,6 +998,31 @@ find_struct_ops_kern_types(const struct btf *btf, const char *tname,
 	return 0;
 }
 
+/* Should match alloc_obj_fields in kernel/bpf/btf.c
+ */
+static const char *alloc_obj_fields[] = {
+	"bpf_spin_lock",
+	"bpf_list_head",
+	"bpf_list_node",
+	"bpf_rb_root",
+	"bpf_rb_node",
+};
+
+static bool
+btf_has_alloc_obj_type(const struct btf *btf)
+{
+	const char *tname;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(alloc_obj_fields); i++) {
+		tname = alloc_obj_fields[i];
+		if (btf__find_by_name_kind(btf, tname, BTF_KIND_STRUCT) > 0)
+			return true;
+	}
+
+	return false;
+}
+
 static bool bpf_map__is_struct_ops(const struct bpf_map *map)
 {
 	return map->def.type == BPF_MAP_TYPE_STRUCT_OPS;
@@ -2794,7 +2819,8 @@ static bool libbpf_needs_btf(const struct bpf_object *obj)
 
 static bool kernel_needs_btf(const struct bpf_object *obj)
 {
-	return obj->efile.st_ops_shndx >= 0;
+	return obj->efile.st_ops_shndx >= 0 ||
+		(obj->btf && btf_has_alloc_obj_type(obj->btf));
 }
 
 static int bpf_object__init_btf(struct bpf_object *obj,
@@ -5103,16 +5129,18 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map, b
 
 		err = -errno;
 		cp = libbpf_strerror_r(err, errmsg, sizeof(errmsg));
-		pr_warn("Error in bpf_create_map_xattr(%s):%s(%d). Retrying without BTF.\n",
-			map->name, cp, err);
-		create_attr.btf_fd = 0;
-		create_attr.btf_key_type_id = 0;
-		create_attr.btf_value_type_id = 0;
-		map->btf_key_type_id = 0;
-		map->btf_value_type_id = 0;
-		map->fd = bpf_map_create(def->type, map_name,
-					 def->key_size, def->value_size,
-					 def->max_entries, &create_attr);
+		pr_warn("Error in bpf_create_map_xattr(%s):%s(%d).\n", map->name, cp, err);
+		if (!kernel_needs_btf(obj)) {
+			pr_warn("Retrying bpf_map_create_xattr(%s) without BTF.\n", map->name);
+			create_attr.btf_fd = 0;
+			create_attr.btf_key_type_id = 0;
+			create_attr.btf_value_type_id = 0;
+			map->btf_key_type_id = 0;
+			map->btf_value_type_id = 0;
+			map->fd = bpf_map_create(def->type, map_name,
+						 def->key_size, def->value_size,
+						 def->max_entries, &create_attr);
+		}
 	}
 
 	err = map->fd < 0 ? -errno : 0;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH bpf-next 13/13] selftests/bpf: Add rbtree selftests
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (11 preceding siblings ...)
  2022-12-06 23:09 ` [PATCH bpf-next 12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type Dave Marchevsky
@ 2022-12-06 23:10 ` Dave Marchevsky
  2022-12-07  2:50 ` [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure patchwork-bot+netdevbpf
  2022-12-07 19:36 ` Kumar Kartikeya Dwivedi
  14 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-06 23:10 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Kernel Team,
	Kumar Kartikeya Dwivedi, Tejun Heo, Dave Marchevsky

This patch adds selftests exercising the logic changed/added in the
previous patches in the series. A variety of successful and unsuccessful
rbtree usages are validated:

Success:
  * Add some nodes, let map_value bpf_rbtree_root destructor clean them
    up
  * Add some nodes, remove one using the release_on_unlock ref leftover
    by successful rbtree_add() call
  * Add some nodes, remove one using the release_on_unlock ref returned
    from rbtree_first() call

Failure:
  * BTF where bpf_rb_root owns bpf_list_node should fail to load
  * BTF where node of type X is added to tree containing nodes of type Y
    should fail to load
  * No calling rbtree api functions in 'less' callback for rbtree_add
  * No releasing lock in 'less' callback for rbtree_add
  * No removing a node which hasn't been added to any tree
  * No adding a node which has already been added to a tree
  * No escaping of release_on_unlock references past their lock's
    critical section

These tests mostly focus on rbtree-specific additions, but some of the
Failure cases revalidate scenarios common to both linked_list and rbtree
which are covered in the former's tests. Better to be a bit redundant in
case linked_list and rbtree semantics deviate over time.

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
---
 .../testing/selftests/bpf/prog_tests/rbtree.c | 184 ++++++++++++
 tools/testing/selftests/bpf/progs/rbtree.c    | 180 ++++++++++++
 .../progs/rbtree_btf_fail__add_wrong_type.c   |  48 ++++
 .../progs/rbtree_btf_fail__wrong_node_type.c  |  21 ++
 .../testing/selftests/bpf/progs/rbtree_fail.c | 263 ++++++++++++++++++
 5 files changed, 696 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_btf_fail__add_wrong_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_btf_fail__wrong_node_type.c
 create mode 100644 tools/testing/selftests/bpf/progs/rbtree_fail.c

diff --git a/tools/testing/selftests/bpf/prog_tests/rbtree.c b/tools/testing/selftests/bpf/prog_tests/rbtree.c
new file mode 100644
index 000000000000..688ce56d8b92
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/rbtree.c
@@ -0,0 +1,184 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+
+#include "rbtree.skel.h"
+#include "rbtree_fail.skel.h"
+#include "rbtree_btf_fail__wrong_node_type.skel.h"
+#include "rbtree_btf_fail__add_wrong_type.skel.h"
+
+static char log_buf[1024 * 1024];
+
+static struct {
+	const char *prog_name;
+	const char *err_msg;
+} rbtree_fail_tests[] = {
+	{"rbtree_api_nolock_add", "bpf_spin_lock at off=16 must be held for bpf_rb_root"},
+	{"rbtree_api_nolock_remove", "bpf_spin_lock at off=16 must be held for bpf_rb_root"},
+	{"rbtree_api_nolock_first", "bpf_spin_lock at off=16 must be held for bpf_rb_root"},
+
+	/* Specific failure string for these three isn't very important, but it shouldn't be
+	 * possible to call rbtree api func from within add() callback
+	 */
+	{"rbtree_api_add_bad_cb_bad_fn_call_add", "arg#1 expected pointer to allocated object"},
+	{"rbtree_api_add_bad_cb_bad_fn_call_remove", "allocated object must be referenced"},
+	{"rbtree_api_add_bad_cb_bad_fn_call_first", "Unreleased reference id=4 alloc_insn=26"},
+	{"rbtree_api_add_bad_cb_bad_fn_call_first_unlock_after",
+	  "failed to release release_on_unlock reference"},
+
+	{"rbtree_api_remove_unadded_node", "arg#1 expected pointer to allocated object"},
+	{"rbtree_api_add_to_multiple_trees", "arg#1 expected pointer to allocated object"},
+	{"rbtree_api_add_release_unlock_escape", "arg#1 expected pointer to allocated object"},
+	{"rbtree_api_first_release_unlock_escape", "arg#1 expected pointer to allocated object"},
+	{"rbtree_api_remove_no_drop", "Unreleased reference id=4 alloc_insn=10"},
+};
+
+static void test_rbtree_fail_prog(const char *prog_name, const char *err_msg)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, opts,
+		    .kernel_log_buf = log_buf,
+		    .kernel_log_size = sizeof(log_buf),
+		    .kernel_log_level = 1
+	);
+	struct rbtree_fail *skel;
+	struct bpf_program *prog;
+	int ret;
+
+	skel = rbtree_fail__open_opts(&opts);
+	if (!ASSERT_OK_PTR(skel, "rbtree_fail__open_opts"))
+		return;
+
+	prog = bpf_object__find_program_by_name(skel->obj, prog_name);
+	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+		goto end;
+
+	bpf_program__set_autoload(prog, true);
+
+	ret = rbtree_fail__load(skel);
+	if (!ASSERT_ERR(ret, "rbtree_fail__load must fail"))
+		goto end;
+
+	if (!ASSERT_OK_PTR(strstr(log_buf, err_msg), "expected error message")) {
+		fprintf(stderr, "Expected: %s\n", err_msg);
+		fprintf(stderr, "Verifier: %s\n", log_buf);
+	}
+
+end:
+	rbtree_fail__destroy(skel);
+}
+
+static void test_rbtree_add_nodes(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		    .data_in = &pkt_v4,
+		    .data_size_in = sizeof(pkt_v4),
+		    .repeat = 1,
+	);
+	struct rbtree *skel;
+	int ret;
+
+	skel = rbtree__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "rbtree__open_and_load"))
+		return;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.rbtree_add_nodes), &opts);
+	ASSERT_OK(ret, "rbtree_add_nodes run");
+	ASSERT_OK(opts.retval, "rbtree_add_nodes retval");
+	ASSERT_EQ(skel->data->less_callback_ran, 1, "rbtree_add_nodes less_callback_ran");
+
+	rbtree__destroy(skel);
+}
+
+static void test_rbtree_add_and_remove(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		    .data_in = &pkt_v4,
+		    .data_size_in = sizeof(pkt_v4),
+		    .repeat = 1,
+	);
+	struct rbtree *skel;
+	int ret;
+
+	skel = rbtree__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "rbtree__open_and_load"))
+		return;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.rbtree_add_and_remove), &opts);
+	ASSERT_OK(ret, "rbtree_add_and_remove");
+	ASSERT_OK(opts.retval, "rbtree_add_and_remove retval");
+	ASSERT_EQ(skel->data->removed_key, 5, "rbtree_add_and_remove first removed key");
+
+	rbtree__destroy(skel);
+}
+
+static void test_rbtree_first_and_remove(void)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts,
+		    .data_in = &pkt_v4,
+		    .data_size_in = sizeof(pkt_v4),
+		    .repeat = 1,
+	);
+	struct rbtree *skel;
+	int ret;
+
+	skel = rbtree__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "rbtree__open_and_load"))
+		return;
+
+	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.rbtree_first_and_remove), &opts);
+	ASSERT_OK(ret, "rbtree_first_and_remove");
+	ASSERT_OK(opts.retval, "rbtree_first_and_remove retval");
+	ASSERT_EQ(skel->data->first_data[0], 2, "rbtree_first_and_remove first rbtree_first()");
+	ASSERT_EQ(skel->data->removed_key, 1, "rbtree_first_and_remove first removed key");
+	ASSERT_EQ(skel->data->first_data[1], 4, "rbtree_first_and_remove second rbtree_first()");
+
+	rbtree__destroy(skel);
+}
+
+void test_rbtree_success(void)
+{
+	if (test__start_subtest("rbtree_add_nodes"))
+		test_rbtree_add_nodes();
+	if (test__start_subtest("rbtree_add_and_remove"))
+		test_rbtree_add_and_remove();
+	if (test__start_subtest("rbtree_first_and_remove"))
+		test_rbtree_first_and_remove();
+}
+
+#define BTF_FAIL_TEST(suffix)									\
+void test_rbtree_btf_fail__##suffix(void)							\
+{												\
+	struct rbtree_btf_fail__##suffix *skel;							\
+												\
+	skel = rbtree_btf_fail__##suffix##__open_and_load();					\
+	if (!ASSERT_ERR_PTR(skel,								\
+			    "rbtree_btf_fail__" #suffix "__open_and_load unexpected success"))	\
+		rbtree_btf_fail__##suffix##__destroy(skel);					\
+}
+
+#define RUN_BTF_FAIL_TEST(suffix)				\
+	if (test__start_subtest("rbtree_btf_fail__" #suffix))	\
+		test_rbtree_btf_fail__##suffix();
+
+BTF_FAIL_TEST(wrong_node_type);
+BTF_FAIL_TEST(add_wrong_type);
+
+void test_rbtree_btf_fail(void)
+{
+	RUN_BTF_FAIL_TEST(wrong_node_type);
+	RUN_BTF_FAIL_TEST(add_wrong_type);
+}
+
+void test_rbtree_fail(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(rbtree_fail_tests); i++) {
+		if (!test__start_subtest(rbtree_fail_tests[i].prog_name))
+			continue;
+		test_rbtree_fail_prog(rbtree_fail_tests[i].prog_name,
+				      rbtree_fail_tests[i].err_msg);
+	}
+}
diff --git a/tools/testing/selftests/bpf/progs/rbtree.c b/tools/testing/selftests/bpf/progs/rbtree.c
new file mode 100644
index 000000000000..96a9d732e3fe
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rbtree.c
@@ -0,0 +1,180 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct node_data {
+	long key;
+	long data;
+	struct bpf_rb_node node;
+};
+
+long less_callback_ran = -1;
+long removed_key = -1;
+long first_data[2] = {-1, -1};
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+private(A) struct bpf_spin_lock glock;
+private(A) struct bpf_rb_root groot __contains(node_data, node);
+
+static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+	less_callback_ran = 1;
+
+	return node_a->key < node_b->key;
+}
+
+static long __add_three(struct bpf_rb_root *root, struct bpf_spin_lock *lock)
+{
+	struct node_data *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+	n->key = 5;
+
+	m = bpf_obj_new(typeof(*m));
+	if (!m) {
+		bpf_obj_drop(n);
+		return 2;
+	}
+	m->key = 1;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_rbtree_add(&groot, &m->node, less);
+	bpf_spin_unlock(&glock);
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 3;
+	n->key = 3;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_spin_unlock(&glock);
+	return 0;
+}
+
+SEC("tc")
+long rbtree_add_nodes(void *ctx)
+{
+	return __add_three(&groot, &glock);
+}
+
+SEC("tc")
+long rbtree_add_and_remove(void *ctx)
+{
+	struct bpf_rb_node *res = NULL;
+	struct node_data *n, *m;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		goto err_out;
+	n->key = 5;
+
+	m = bpf_obj_new(typeof(*m));
+	if (!m)
+		goto err_out;
+	m->key = 3;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_rbtree_add(&groot, &m->node, less);
+	res = bpf_rbtree_remove(&groot, &n->node);
+	bpf_spin_unlock(&glock);
+
+	if (!res)
+		return 1;
+	n = container_of(res, struct node_data, node);
+	removed_key = n->key;
+
+	bpf_obj_drop(n);
+
+	return 0;
+err_out:
+	if (n)
+		bpf_obj_drop(n);
+	if (m)
+		bpf_obj_drop(m);
+	return 1;
+}
+
+SEC("tc")
+long rbtree_first_and_remove(void *ctx)
+{
+	struct bpf_rb_node *res = NULL;
+	struct node_data *n, *m, *o;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+	n->key = 3;
+	n->data = 4;
+
+	m = bpf_obj_new(typeof(*m));
+	if (!m)
+		goto err_out;
+	m->key = 5;
+	m->data = 6;
+
+	o = bpf_obj_new(typeof(*o));
+	if (!o)
+		goto err_out;
+	o->key = 1;
+	o->data = 2;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_rbtree_add(&groot, &m->node, less);
+	bpf_rbtree_add(&groot, &o->node, less);
+
+	res = bpf_rbtree_first(&groot);
+	if (!res) {
+		bpf_spin_unlock(&glock);
+		return 2;
+	}
+
+	o = container_of(res, struct node_data, node);
+	first_data[0] = o->data;
+
+	res = bpf_rbtree_remove(&groot, &o->node);
+	bpf_spin_unlock(&glock);
+
+	if (!res)
+		return 1;
+	o = container_of(res, struct node_data, node);
+	removed_key = o->key;
+
+	bpf_obj_drop(o);
+
+	bpf_spin_lock(&glock);
+	res = bpf_rbtree_first(&groot);
+	if (!res) {
+		bpf_spin_unlock(&glock);
+		return 3;
+	}
+
+	o = container_of(res, struct node_data, node);
+	first_data[1] = o->data;
+	bpf_spin_unlock(&glock);
+
+	return 0;
+err_out:
+	if (n)
+		bpf_obj_drop(n);
+	if (m)
+		bpf_obj_drop(m);
+	return 1;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/rbtree_btf_fail__add_wrong_type.c b/tools/testing/selftests/bpf/progs/rbtree_btf_fail__add_wrong_type.c
new file mode 100644
index 000000000000..1729712722ec
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rbtree_btf_fail__add_wrong_type.c
@@ -0,0 +1,48 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct node_data {
+	int key;
+	int data;
+	struct bpf_rb_node node;
+};
+
+struct node_data2 {
+	int key;
+	struct bpf_rb_node node;
+	int data;
+};
+
+static bool less2(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_data2 *node_a;
+	struct node_data2 *node_b;
+
+	node_a = container_of(a, struct node_data2, node);
+	node_b = container_of(b, struct node_data2, node);
+
+	return node_a->key < node_b->key;
+}
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+private(A) struct bpf_spin_lock glock;
+private(A) struct bpf_rb_root groot __contains(node_data, node);
+
+SEC("tc")
+long rbtree_api_nolock_add(void *ctx)
+{
+	struct node_data2 *n;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	bpf_rbtree_add(&groot, &n->node, less2);
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/rbtree_btf_fail__wrong_node_type.c b/tools/testing/selftests/bpf/progs/rbtree_btf_fail__wrong_node_type.c
new file mode 100644
index 000000000000..df0efb46177c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rbtree_btf_fail__wrong_node_type.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2022 Meta Platforms, Inc. and affiliates. */
+
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+/* BTF load should fail as bpf_rb_root __contains this type and points to
+ * 'node', but 'node' is not a bpf_rb_node
+ */
+struct node_data {
+	int key;
+	int data;
+	struct bpf_list_node node;
+};
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+private(A) struct bpf_spin_lock glock;
+private(A) struct bpf_rb_root groot __contains(node_data, node);
diff --git a/tools/testing/selftests/bpf/progs/rbtree_fail.c b/tools/testing/selftests/bpf/progs/rbtree_fail.c
new file mode 100644
index 000000000000..96caa7f33805
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/rbtree_fail.c
@@ -0,0 +1,263 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include "bpf_experimental.h"
+
+struct node_data {
+	long key;
+	long data;
+	struct bpf_rb_node node;
+};
+
+#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
+private(A) struct bpf_spin_lock glock;
+private(A) struct bpf_rb_root groot __contains(node_data, node);
+private(A) struct bpf_rb_root groot2 __contains(node_data, node);
+
+static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+
+	return node_a->key < node_b->key;
+}
+
+SEC("?tc")
+long rbtree_api_nolock_add(void *ctx)
+{
+	struct node_data *n;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	bpf_rbtree_add(&groot, &n->node, less);
+	return 0;
+}
+
+SEC("?tc")
+long rbtree_api_nolock_remove(void *ctx)
+{
+	struct node_data *n;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_spin_unlock(&glock);
+
+	bpf_rbtree_remove(&groot, &n->node);
+	return 0;
+}
+
+SEC("?tc")
+long rbtree_api_nolock_first(void *ctx)
+{
+	bpf_rbtree_first(&groot);
+	return 0;
+}
+
+SEC("?tc")
+long rbtree_api_remove_unadded_node(void *ctx)
+{
+	struct node_data *n, *m;
+	struct bpf_rb_node *res;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	m = bpf_obj_new(typeof(*m));
+	if (!m) {
+		bpf_obj_drop(n);
+		return 1;
+	}
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+
+	/* This remove should pass verifier */
+	res = bpf_rbtree_remove(&groot, &n->node);
+	if (res)
+		n = container_of(res, struct node_data, node);
+
+	/* This remove shouldn't, m isn't in an rbtree */
+	res = bpf_rbtree_remove(&groot, &m->node);
+	if (res)
+		m = container_of(res, struct node_data, node);
+	bpf_spin_unlock(&glock);
+
+	if (n)
+		bpf_obj_drop(n);
+	if (m)
+		bpf_obj_drop(m);
+	return 0;
+}
+
+SEC("?tc")
+long rbtree_api_remove_no_drop(void *ctx)
+{
+	struct bpf_rb_node *res;
+	struct node_data *n;
+
+	bpf_spin_lock(&glock);
+	res = bpf_rbtree_first(&groot);
+	if (!res)
+		goto unlock_err;
+
+	res = bpf_rbtree_remove(&groot, res);
+	if (!res)
+		goto unlock_err;
+
+	n = container_of(res, struct node_data, node);
+	bpf_spin_unlock(&glock);
+
+	/* bpf_obj_drop(n) is missing here */
+	return 0;
+
+unlock_err:
+	bpf_spin_unlock(&glock);
+	return 1;
+}
+
+SEC("?tc")
+long rbtree_api_add_to_multiple_trees(void *ctx)
+{
+	struct node_data *n;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+
+	/* This add should fail since n already in groot's tree */
+	bpf_rbtree_add(&groot2, &n->node, less);
+	bpf_spin_unlock(&glock);
+	return 0;
+}
+
+SEC("?tc")
+long rbtree_api_add_release_unlock_escape(void *ctx)
+{
+	struct node_data *n;
+
+	n = bpf_obj_new(typeof(*n));
+	if (!n)
+		return 1;
+
+	bpf_spin_lock(&glock);
+	bpf_rbtree_add(&groot, &n->node, less);
+	bpf_spin_unlock(&glock);
+
+	bpf_spin_lock(&glock);
+	/* After add() in previous critical section, n should be
+	 * release_on_unlock and released after previous spin_unlock,
+	 * so should not be possible to use it here
+	 */
+	bpf_rbtree_remove(&groot, &n->node);
+	bpf_spin_unlock(&glock);
+	return 0;
+}
+
+SEC("?tc")
+long rbtree_api_first_release_unlock_escape(void *ctx)
+{
+	struct bpf_rb_node *res;
+	struct node_data *n;
+
+	bpf_spin_lock(&glock);
+	res = bpf_rbtree_first(&groot);
+	if (res)
+		n = container_of(res, struct node_data, node);
+	bpf_spin_unlock(&glock);
+
+	bpf_spin_lock(&glock);
+	/* After first() in previous critical section, n should be
+	 * release_on_unlock and released after previous spin_unlock,
+	 * so should not be possible to use it here
+	 */
+	bpf_rbtree_remove(&groot, &n->node);
+	bpf_spin_unlock(&glock);
+	return 0;
+}
+
+static bool less__bad_fn_call_add(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+	bpf_rbtree_add(&groot, &node_a->node, less);
+
+	return node_a->key < node_b->key;
+}
+
+static bool less__bad_fn_call_remove(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+	bpf_rbtree_remove(&groot, &node_a->node);
+
+	return node_a->key < node_b->key;
+}
+
+static bool less__bad_fn_call_first(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+	bpf_rbtree_first(&groot);
+
+	return node_a->key < node_b->key;
+}
+
+static bool less__bad_fn_call_first_unlock_after(struct bpf_rb_node *a, const struct bpf_rb_node *b)
+{
+	struct node_data *node_a;
+	struct node_data *node_b;
+
+	node_a = container_of(a, struct node_data, node);
+	node_b = container_of(b, struct node_data, node);
+	bpf_rbtree_first(&groot);
+	bpf_spin_unlock(&glock);
+
+	return node_a->key < node_b->key;
+}
+
+#define RBTREE_API_ADD_BAD_CB(cb_suffix)				\
+SEC("?tc")								\
+long rbtree_api_add_bad_cb_##cb_suffix(void *ctx)			\
+{									\
+	struct node_data *n;						\
+									\
+	n = bpf_obj_new(typeof(*n));					\
+	if (!n)								\
+		return 1;						\
+									\
+	bpf_spin_lock(&glock);						\
+	bpf_rbtree_add(&groot, &n->node, less__##cb_suffix);		\
+	bpf_spin_unlock(&glock);					\
+	return 0;							\
+}
+
+RBTREE_API_ADD_BAD_CB(bad_fn_call_add);
+RBTREE_API_ADD_BAD_CB(bad_fn_call_remove);
+RBTREE_API_ADD_BAD_CB(bad_fn_call_first);
+RBTREE_API_ADD_BAD_CB(bad_fn_call_first_unlock_after);
+
+char _license[] SEC("license") = "GPL";
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails
  2022-12-06 23:09 ` [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails Dave Marchevsky
@ 2022-12-07  1:32   ` Alexei Starovoitov
  2022-12-07 16:49   ` Kumar Kartikeya Dwivedi
  1 sibling, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  1:32 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Dec 06, 2022 at 03:09:49PM -0800, Dave Marchevsky wrote:
> map_check_btf calls btf_parse_fields to create a btf_record for its
> value_type. If there are no special fields in the value_type
> btf_parse_fields returns NULL, whereas if there special value_type
> fields but they are invalid in some way an error is returned.
> 
> An example invalid state would be:
> 
>   struct node_data {
>     struct bpf_rb_node node;
>     int data;
>   };
> 
>   private(A) struct bpf_spin_lock glock;
>   private(A) struct bpf_list_head ghead __contains(node_data, node);
> 
> groot should be invalid as its __contains tag points to a field with

s/groot/ghead/ ?

> type != "bpf_list_node".
> 
> Before this patch, such a scenario would result in btf_parse_fields
> returning an error ptr, subsequent !IS_ERR_OR_NULL check failing,
> and btf_check_and_fixup_fields returning 0, which would then be
> returned by map_check_btf.
> 
> After this patch's changes, -EINVAL would be returned by map_check_btf
> and the map would correctly fail to load.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Fixes: aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record")
> ---
>  kernel/bpf/syscall.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 35972afb6850..c3599a7902f0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1007,7 +1007,10 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>  	map->record = btf_parse_fields(btf, value_type,
>  				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD,
>  				       map->value_size);
> -	if (!IS_ERR_OR_NULL(map->record)) {
> +	if (IS_ERR(map->record))
> +		return -EINVAL;
> +
> +	if (map->record) {
>  		int i;
>  
>  		if (!bpf_capable()) {
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types
  2022-12-06 23:09 ` [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types Dave Marchevsky
@ 2022-12-07  1:41   ` Alexei Starovoitov
  2022-12-07 18:52     ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  1:41 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Dec 06, 2022 at 03:09:51PM -0800, Dave Marchevsky wrote:
> Many of the structs recently added to track field info for linked-list
> head are useful as-is for rbtree root. So let's do a mechanical renaming
> of list_head-related types and fields:
> 
> include/linux/bpf.h:
>   struct btf_field_list_head -> struct btf_field_datastructure_head
>   list_head -> datastructure_head in struct btf_field union
> kernel/bpf/btf.c:
>   list_head -> datastructure_head in struct btf_field_info

Looking through this patch and others it eventually becomes
confusing with 'datastructure head' name.
I'm not sure what is 'head' of the data structure.
There is head in the link list, but 'head of tree' is odd.

The attemp here is to find a common name that represents programming
concept where there is a 'root' and there are 'nodes' that added to that 'root'.
The 'data structure' name is too broad in that sense.
Especially later it becomes 'datastructure_api' which is even broader.

I was thinking to propose:
 struct btf_field_list_head -> struct btf_field_tree_root
 list_head -> tree_root in struct btf_field union

and is_kfunc_tree_api later...
since link list is a tree too.

But reading 'tree' next to other names like 'field', 'kfunc'
it might be mistaken that 'tree' applies to the former.
So I think using 'graph' as more general concept to describe both
link list and rb-tree would be the best.

So the proposal:
 struct btf_field_list_head -> struct btf_field_graph_root
 list_head -> graph_root in struct btf_field union

and is_kfunc_graph_api later...

'graph' is short enough and rarely used in names,
so it stands on its own next to 'field' and in combination
with other names.
wdyt?

> 
> This is a nonfunctional change, functionality to actually use these
> fields for rbtree will be added in further patches.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  include/linux/bpf.h   |  4 ++--
>  kernel/bpf/btf.c      | 21 +++++++++++----------
>  kernel/bpf/helpers.c  |  4 ++--
>  kernel/bpf/verifier.c | 21 +++++++++++----------
>  4 files changed, 26 insertions(+), 24 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 4920ac252754..9e8b12c7061e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -189,7 +189,7 @@ struct btf_field_kptr {
>  	u32 btf_id;
>  };
>  
> -struct btf_field_list_head {
> +struct btf_field_datastructure_head {
>  	struct btf *btf;
>  	u32 value_btf_id;
>  	u32 node_offset;
> @@ -201,7 +201,7 @@ struct btf_field {
>  	enum btf_field_type type;
>  	union {
>  		struct btf_field_kptr kptr;
> -		struct btf_field_list_head list_head;
> +		struct btf_field_datastructure_head datastructure_head;
>  	};
>  };
>  
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index c80bd8709e69..284e3e4b76b7 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -3227,7 +3227,7 @@ struct btf_field_info {
>  		struct {
>  			const char *node_name;
>  			u32 value_btf_id;
> -		} list_head;
> +		} datastructure_head;
>  	};
>  };
>  
> @@ -3334,8 +3334,8 @@ static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
>  		return -EINVAL;
>  	info->type = BPF_LIST_HEAD;
>  	info->off = off;
> -	info->list_head.value_btf_id = id;
> -	info->list_head.node_name = list_node;
> +	info->datastructure_head.value_btf_id = id;
> +	info->datastructure_head.node_name = list_node;
>  	return BTF_FIELD_FOUND;
>  }
>  
> @@ -3603,13 +3603,14 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
>  	u32 offset;
>  	int i;
>  
> -	t = btf_type_by_id(btf, info->list_head.value_btf_id);
> +	t = btf_type_by_id(btf, info->datastructure_head.value_btf_id);
>  	/* We've already checked that value_btf_id is a struct type. We
>  	 * just need to figure out the offset of the list_node, and
>  	 * verify its type.
>  	 */
>  	for_each_member(i, t, member) {
> -		if (strcmp(info->list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
> +		if (strcmp(info->datastructure_head.node_name,
> +			   __btf_name_by_offset(btf, member->name_off)))
>  			continue;
>  		/* Invalid BTF, two members with same name */
>  		if (n)
> @@ -3626,9 +3627,9 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
>  		if (offset % __alignof__(struct bpf_list_node))
>  			return -EINVAL;
>  
> -		field->list_head.btf = (struct btf *)btf;
> -		field->list_head.value_btf_id = info->list_head.value_btf_id;
> -		field->list_head.node_offset = offset;
> +		field->datastructure_head.btf = (struct btf *)btf;
> +		field->datastructure_head.value_btf_id = info->datastructure_head.value_btf_id;
> +		field->datastructure_head.node_offset = offset;
>  	}
>  	if (!n)
>  		return -ENOENT;
> @@ -3735,11 +3736,11 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
>  
>  		if (!(rec->fields[i].type & BPF_LIST_HEAD))
>  			continue;
> -		btf_id = rec->fields[i].list_head.value_btf_id;
> +		btf_id = rec->fields[i].datastructure_head.value_btf_id;
>  		meta = btf_find_struct_meta(btf, btf_id);
>  		if (!meta)
>  			return -EFAULT;
> -		rec->fields[i].list_head.value_rec = meta->record;
> +		rec->fields[i].datastructure_head.value_rec = meta->record;
>  
>  		if (!(rec->field_mask & BPF_LIST_NODE))
>  			continue;
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index cca642358e80..6c67740222c2 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1737,12 +1737,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>  	while (head != orig_head) {
>  		void *obj = head;
>  
> -		obj -= field->list_head.node_offset;
> +		obj -= field->datastructure_head.node_offset;
>  		head = head->next;
>  		/* The contained type can also have resources, including a
>  		 * bpf_list_head which needs to be freed.
>  		 */
> -		bpf_obj_free_fields(field->list_head.value_rec, obj);
> +		bpf_obj_free_fields(field->datastructure_head.value_rec, obj);
>  		/* bpf_mem_free requires migrate_disable(), since we can be
>  		 * called from map free path as well apart from BPF program (as
>  		 * part of map ops doing bpf_obj_free_fields).
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 6f0aac837d77..bc80b4c4377b 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8615,21 +8615,22 @@ static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
>  
>  	field = meta->arg_list_head.field;
>  
> -	et = btf_type_by_id(field->list_head.btf, field->list_head.value_btf_id);
> +	et = btf_type_by_id(field->datastructure_head.btf, field->datastructure_head.value_btf_id);
>  	t = btf_type_by_id(reg->btf, reg->btf_id);
> -	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->list_head.btf,
> -				  field->list_head.value_btf_id, true)) {
> +	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->datastructure_head.btf,
> +				  field->datastructure_head.value_btf_id, true)) {
>  		verbose(env, "operation on bpf_list_head expects arg#1 bpf_list_node at offset=%d "
>  			"in struct %s, but arg is at offset=%d in struct %s\n",
> -			field->list_head.node_offset, btf_name_by_offset(field->list_head.btf, et->name_off),
> +			field->datastructure_head.node_offset,
> +			btf_name_by_offset(field->datastructure_head.btf, et->name_off),
>  			list_node_off, btf_name_by_offset(reg->btf, t->name_off));
>  		return -EINVAL;
>  	}
>  
> -	if (list_node_off != field->list_head.node_offset) {
> +	if (list_node_off != field->datastructure_head.node_offset) {
>  		verbose(env, "arg#1 offset=%d, but expected bpf_list_node at offset=%d in struct %s\n",
> -			list_node_off, field->list_head.node_offset,
> -			btf_name_by_offset(field->list_head.btf, et->name_off));
> +			list_node_off, field->datastructure_head.node_offset,
> +			btf_name_by_offset(field->datastructure_head.btf, et->name_off));
>  		return -EINVAL;
>  	}
>  	/* Set arg#1 for expiration after unlock */
> @@ -9078,9 +9079,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  
>  				mark_reg_known_zero(env, regs, BPF_REG_0);
>  				regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
> -				regs[BPF_REG_0].btf = field->list_head.btf;
> -				regs[BPF_REG_0].btf_id = field->list_head.value_btf_id;
> -				regs[BPF_REG_0].off = field->list_head.node_offset;
> +				regs[BPF_REG_0].btf = field->datastructure_head.btf;
> +				regs[BPF_REG_0].btf_id = field->datastructure_head.value_btf_id;
> +				regs[BPF_REG_0].off = field->datastructure_head.node_offset;
>  			} else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
>  				mark_reg_known_zero(env, regs, BPF_REG_0);
>  				regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 05/13] bpf: Add basic bpf_rb_{root,node} support
  2022-12-06 23:09 ` [PATCH bpf-next 05/13] bpf: Add basic bpf_rb_{root,node} support Dave Marchevsky
@ 2022-12-07  1:48   ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  1:48 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Dec 06, 2022 at 03:09:52PM -0800, Dave Marchevsky wrote:
>  
> +#define OWNER_FIELD_MASK (BPF_LIST_HEAD | BPF_RB_ROOT)
> +#define OWNEE_FIELD_MASK (BPF_LIST_NODE | BPF_RB_NODE)

One letter difference makes it so hard to review.
How about
GRAPH_ROOT_MASK
GRAPH_NODE_MASK
?

> +
>  int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
>  {
>  	int i;
>  
> -	/* There are two owning types, kptr_ref and bpf_list_head. The former
> -	 * only supports storing kernel types, which can never store references
> -	 * to program allocated local types, atleast not yet. Hence we only need
> -	 * to ensure that bpf_list_head ownership does not form cycles.
> +	/* There are three types that signify ownership of some other type:
> +	 *  kptr_ref, bpf_list_head, bpf_rb_root.
> +	 * kptr_ref only supports storing kernel types, which can't store
> +	 * references to program allocated local types.
> +	 *
> +	 * Hence we only need to ensure that bpf_{list_head,rb_root} ownership
> +	 * does not form cycles.
>  	 */
> -	if (IS_ERR_OR_NULL(rec) || !(rec->field_mask & BPF_LIST_HEAD))
> +	if (IS_ERR_OR_NULL(rec) || !(rec->field_mask & OWNER_FIELD_MASK))
>  		return 0;
>  	for (i = 0; i < rec->cnt; i++) {
>  		struct btf_struct_meta *meta;
>  		u32 btf_id;
>  
> -		if (!(rec->fields[i].type & BPF_LIST_HEAD))
> +		if (!(rec->fields[i].type & OWNER_FIELD_MASK))
>  			continue;
>  		btf_id = rec->fields[i].datastructure_head.value_btf_id;
>  		meta = btf_find_struct_meta(btf, btf_id);
> @@ -3742,39 +3783,47 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
>  			return -EFAULT;
>  		rec->fields[i].datastructure_head.value_rec = meta->record;
>  
> -		if (!(rec->field_mask & BPF_LIST_NODE))
> +		/* We need to set value_rec for all owner types, but no need
> +		 * to check ownership cycle for a type unless it's also an
> +		 * ownee type.
> +		 */
> +		if (!(rec->field_mask & OWNEE_FIELD_MASK))
>  			continue;
>  
>  		/* We need to ensure ownership acyclicity among all types. The
>  		 * proper way to do it would be to topologically sort all BTF
>  		 * IDs based on the ownership edges, since there can be multiple
> -		 * bpf_list_head in a type. Instead, we use the following
> -		 * reasoning:
> +		 * bpf_{list_head,rb_node} in a type. Instead, we use the
> +		 * following resaoning:
>  		 *
>  		 * - A type can only be owned by another type in user BTF if it
> -		 *   has a bpf_list_node.
> +		 *   has a bpf_{list,rb}_node. Let's call these ownee types.
>  		 * - A type can only _own_ another type in user BTF if it has a
> -		 *   bpf_list_head.
> +		 *   bpf_{list_head,rb_root}. Let's call these owner types.
>  		 *
> -		 * We ensure that if a type has both bpf_list_head and
> -		 * bpf_list_node, its element types cannot be owning types.
> +		 * We ensure that if a type is both an owner and ownee, its
> +		 * element types cannot be owner types.
>  		 *
>  		 * To ensure acyclicity:
>  		 *
> -		 * When A only has bpf_list_head, ownership chain can be:
> +		 * When A is an owner type but not an ownee, its ownership

and that would become:
When A is a root type, but not a node type...

reads easier.

> +		 * chain can be:
>  		 *	A -> B -> C
>  		 * Where:
> -		 * - B has both bpf_list_head and bpf_list_node.
> -		 * - C only has bpf_list_node.
> +		 * - A is an owner, e.g. has bpf_rb_root.
> +		 * - B is both an owner and ownee, e.g. has bpf_rb_node and
> +		 *   bpf_list_head.
> +		 * - C is only an owner, e.g. has bpf_list_node
>  		 *
> -		 * When A has both bpf_list_head and bpf_list_node, some other
> -		 * type already owns it in the BTF domain, hence it can not own
> -		 * another owning type through any of the bpf_list_head edges.
> +		 * When A is both an owner and ownee, some other type already
> +		 * owns it in the BTF domain, hence it can not own
> +		 * another owner type through any of the ownership edges.
>  		 *	A -> B
>  		 * Where:
> -		 * - B only has bpf_list_node.
> +		 * - A is both an owner and ownee.
> +		 * - B is only an ownee.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
  2022-12-06 23:09 ` [PATCH bpf-next 07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args Dave Marchevsky
@ 2022-12-07  1:51   ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  1:51 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Dec 06, 2022 at 03:09:54PM -0800, Dave Marchevsky wrote:
>  
> -static int process_kf_arg_ptr_to_list_head(struct bpf_verifier_env *env,
> +static bool is_bpf_rbtree_api_kfunc(u32 btf_id)
> +{
> +	return btf_id == special_kfunc_list[KF_bpf_rbtree_add] ||
> +	       btf_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
> +	       btf_id == special_kfunc_list[KF_bpf_rbtree_first];
> +}
> +
> +static bool is_bpf_datastructure_api_kfunc(u32 btf_id)
> +{
> +	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id);
> +}

static bool is_bpf_graph_api_kfunc(u32 btf_id)
{
	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id);
}

would read well here.
Much shorter too.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 08/13] bpf: Add callback validation to kfunc verifier logic
  2022-12-06 23:09 ` [PATCH bpf-next 08/13] bpf: Add callback validation to kfunc verifier logic Dave Marchevsky
@ 2022-12-07  2:01   ` Alexei Starovoitov
  2022-12-17  8:49     ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  2:01 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Dec 06, 2022 at 03:09:55PM -0800, Dave Marchevsky wrote:
> Some BPF helpers take a callback function which the helper calls. For
> each helper that takes such a callback, there's a special call to
> __check_func_call with a callback-state-setting callback that sets up
> verifier bpf_func_state for the callback's frame.
> 
> kfuncs don't have any of this infrastructure yet, so let's add it in
> this patch, following existing helper pattern as much as possible. To
> validate functionality of this added plumbing, this patch adds
> callback handling for the bpf_rbtree_add kfunc and hopes to lay
> groundwork for future next-gen datastructure callbacks.
> 
> In the "general plumbing" category we have:
> 
>   * check_kfunc_call doing callback verification right before clearing
>     CALLER_SAVED_REGS, exactly like check_helper_call
>   * recognition of func_ptr BTF types in kfunc args as
>     KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type
> 
> In the "rbtree_add / next-gen datastructure-specific plumbing" category:
> 
>   * Since bpf_rbtree_add must be called while the spin_lock associated
>     with the tree is held, don't complain when callback's func_state
>     doesn't unlock it by frame exit
>   * Mark rbtree_add callback's args PTR_UNTRUSTED to prevent rbtree
>     api functions from being called in the callback
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  kernel/bpf/verifier.c | 136 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 130 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 652112007b2c..9ad8c0b264dc 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -1448,6 +1448,16 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
>  	reg->type &= ~PTR_MAYBE_NULL;
>  }
>  
> +static void mark_reg_datastructure_node(struct bpf_reg_state *regs, u32 regno,
> +					struct btf_field_datastructure_head *ds_head)
> +{
> +	__mark_reg_known_zero(&regs[regno]);
> +	regs[regno].type = PTR_TO_BTF_ID | MEM_ALLOC;
> +	regs[regno].btf = ds_head->btf;
> +	regs[regno].btf_id = ds_head->value_btf_id;
> +	regs[regno].off = ds_head->node_offset;
> +}
> +
>  static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg)
>  {
>  	return type_is_pkt_pointer(reg->type);
> @@ -4771,7 +4781,8 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
>  			return -EACCES;
>  		}
>  
> -		if (type_is_alloc(reg->type) && !reg->ref_obj_id) {
> +		if (type_is_alloc(reg->type) && !reg->ref_obj_id &&
> +		    !cur_func(env)->in_callback_fn) {
>  			verbose(env, "verifier internal error: ref_obj_id for allocated object must be non-zero\n");
>  			return -EFAULT;
>  		}
> @@ -6952,6 +6963,8 @@ static int set_callee_state(struct bpf_verifier_env *env,
>  			    struct bpf_func_state *caller,
>  			    struct bpf_func_state *callee, int insn_idx);
>  
> +static bool is_callback_calling_kfunc(u32 btf_id);
> +
>  static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  			     int *insn_idx, int subprog,
>  			     set_callee_state_fn set_callee_state_cb)
> @@ -7006,10 +7019,18 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  	 * interested in validating only BPF helpers that can call subprogs as
>  	 * callbacks
>  	 */
> -	if (set_callee_state_cb != set_callee_state && !is_callback_calling_function(insn->imm)) {
> -		verbose(env, "verifier bug: helper %s#%d is not marked as callback-calling\n",
> -			func_id_name(insn->imm), insn->imm);
> -		return -EFAULT;
> +	if (set_callee_state_cb != set_callee_state) {
> +		if (bpf_pseudo_kfunc_call(insn) &&
> +		    !is_callback_calling_kfunc(insn->imm)) {
> +			verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
> +				func_id_name(insn->imm), insn->imm);
> +			return -EFAULT;
> +		} else if (!bpf_pseudo_kfunc_call(insn) &&
> +			   !is_callback_calling_function(insn->imm)) { /* helper */
> +			verbose(env, "verifier bug: helper %s#%d not marked as callback-calling\n",
> +				func_id_name(insn->imm), insn->imm);
> +			return -EFAULT;
> +		}
>  	}
>  
>  	if (insn->code == (BPF_JMP | BPF_CALL) &&
> @@ -7275,6 +7296,67 @@ static int set_user_ringbuf_callback_state(struct bpf_verifier_env *env,
>  	return 0;
>  }
>  
> +static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
> +					 struct bpf_func_state *caller,
> +					 struct bpf_func_state *callee,
> +					 int insn_idx)
> +{
> +	/* void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
> +	 *                     bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b));
> +	 *
> +	 * 'struct bpf_rb_node *node' arg to bpf_rbtree_add is the same PTR_TO_BTF_ID w/ offset
> +	 * that 'less' callback args will be receiving. However, 'node' arg was release_reference'd
> +	 * by this point, so look at 'root'
> +	 */
> +	struct btf_field *field;
> +	struct btf_record *rec;
> +
> +	rec = reg_btf_record(&caller->regs[BPF_REG_1]);
> +	if (!rec)
> +		return -EFAULT;
> +
> +	field = btf_record_find(rec, caller->regs[BPF_REG_1].off, BPF_RB_ROOT);
> +	if (!field || !field->datastructure_head.value_btf_id)
> +		return -EFAULT;
> +
> +	mark_reg_datastructure_node(callee->regs, BPF_REG_1, &field->datastructure_head);
> +	callee->regs[BPF_REG_1].type |= PTR_UNTRUSTED;
> +	mark_reg_datastructure_node(callee->regs, BPF_REG_2, &field->datastructure_head);
> +	callee->regs[BPF_REG_2].type |= PTR_UNTRUSTED;

Please add a comment here to explain that the pointers are actually trusted
and here it's a quick hack to prevent callback to call into rb_tree kfuncs.
We definitely would need to clean it up.
Have you tried to check for is_bpf_list_api_kfunc() || is_bpf_rbtree_api_kfunc()
while processing kfuncs inside callback ?

> +	callee->in_callback_fn = true;

this will give you a flag to do that check.

> +	callee->callback_ret_range = tnum_range(0, 1);
> +	return 0;
> +}
> +
> +static bool is_rbtree_lock_required_kfunc(u32 btf_id);
> +
> +/* Are we currently verifying the callback for a rbtree helper that must
> + * be called with lock held? If so, no need to complain about unreleased
> + * lock
> + */
> +static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env)
> +{
> +	struct bpf_verifier_state *state = env->cur_state;
> +	struct bpf_insn *insn = env->prog->insnsi;
> +	struct bpf_func_state *callee;
> +	int kfunc_btf_id;
> +
> +	if (!state->curframe)
> +		return false;
> +
> +	callee = state->frame[state->curframe];
> +
> +	if (!callee->in_callback_fn)
> +		return false;
> +
> +	kfunc_btf_id = insn[callee->callsite].imm;
> +	return is_rbtree_lock_required_kfunc(kfunc_btf_id);
> +}
> +
>  static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>  {
>  	struct bpf_verifier_state *state = env->cur_state;
> @@ -8007,6 +8089,7 @@ struct bpf_kfunc_call_arg_meta {
>  	bool r0_rdonly;
>  	u32 ret_btf_id;
>  	u64 r0_size;
> +	u32 subprogno;
>  	struct {
>  		u64 value;
>  		bool found;
> @@ -8185,6 +8268,18 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
>  	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
>  }
>  
> +static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
> +				  const struct btf_param *arg)
> +{
> +	const struct btf_type *t;
> +
> +	t = btf_type_resolve_func_ptr(btf, arg->type, NULL);
> +	if (!t)
> +		return false;
> +
> +	return true;
> +}
> +
>  /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
>  static bool __btf_type_is_scalar_struct(struct bpf_verifier_env *env,
>  					const struct btf *btf,
> @@ -8244,6 +8339,7 @@ enum kfunc_ptr_arg_type {
>  	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
>  	KF_ARG_PTR_TO_MEM,
>  	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
> +	KF_ARG_PTR_TO_CALLBACK,
>  	KF_ARG_PTR_TO_RB_ROOT,
>  	KF_ARG_PTR_TO_RB_NODE,
>  };
> @@ -8368,6 +8464,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
>  		return KF_ARG_PTR_TO_BTF_ID;
>  	}
>  
> +	if (is_kfunc_arg_callback(env, meta->btf, &args[argno]))
> +		return KF_ARG_PTR_TO_CALLBACK;
> +
>  	if (argno + 1 < nargs && is_kfunc_arg_mem_size(meta->btf, &args[argno + 1], &regs[regno + 1]))
>  		arg_mem_size = true;
>  
> @@ -8585,6 +8684,16 @@ static bool is_bpf_datastructure_api_kfunc(u32 btf_id)
>  	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id);
>  }
>  
> +static bool is_callback_calling_kfunc(u32 btf_id)
> +{
> +	return btf_id == special_kfunc_list[KF_bpf_rbtree_add];
> +}
> +
> +static bool is_rbtree_lock_required_kfunc(u32 btf_id)
> +{
> +	return is_bpf_rbtree_api_kfunc(btf_id);
> +}
> +
>  static bool check_kfunc_is_datastructure_head_api(struct bpf_verifier_env *env,
>  						  enum btf_field_type head_field_type,
>  						  u32 kfunc_btf_id)
> @@ -8920,6 +9029,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  		case KF_ARG_PTR_TO_RB_NODE:
>  		case KF_ARG_PTR_TO_MEM:
>  		case KF_ARG_PTR_TO_MEM_SIZE:
> +		case KF_ARG_PTR_TO_CALLBACK:
>  			/* Trusted by default */
>  			break;
>  		default:
> @@ -9078,6 +9188,9 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  			/* Skip next '__sz' argument */
>  			i++;
>  			break;
> +		case KF_ARG_PTR_TO_CALLBACK:
> +			meta->subprogno = reg->subprogno;
> +			break;
>  		}
>  	}
>  
> @@ -9193,6 +9306,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  		}
>  	}
>  
> +	if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_add]) {
> +		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
> +					set_rbtree_add_callback_state);
> +		if (err) {
> +			verbose(env, "kfunc %s#%d failed callback verification\n",
> +				func_name, func_id);
> +			return err;
> +		}
> +	}
> +
>  	for (i = 0; i < CALLER_SAVED_REGS; i++)
>  		mark_reg_not_init(env, regs, caller_saved[i]);
>  
> @@ -14023,7 +14146,8 @@ static int do_check(struct bpf_verifier_env *env)
>  					return -EINVAL;
>  				}
>  
> -				if (env->cur_state->active_lock.ptr) {
> +				if (env->cur_state->active_lock.ptr &&
> +				    !in_rbtree_lock_required_cb(env)) {

That looks wrong.
It will allow callbacks to use unpaired lock/unlock.
Have you tried clearing cur_state->active_lock when entering callback?
That should solve it and won't cause lock/unlock imbalance.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first}
  2022-12-06 23:09 ` [PATCH bpf-next 09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first} Dave Marchevsky
@ 2022-12-07  2:18   ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  2:18 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Dec 06, 2022 at 03:09:56PM -0800, Dave Marchevsky wrote:
> Newly-added bpf_rbtree_{remove,first} kfuncs have some special properties
> that require handling in the verifier:
> 
>   * both bpf_rbtree_remove and bpf_rbtree_first return the type containing
>     the bpf_rb_node field, with the offset set to that field's offset,
>     instead of a struct bpf_rb_node *
>     * Generalized existing next-gen list verifier handling for this
>       as mark_reg_datastructure_node helper
> 
>   * Unlike other functions, which set release_on_unlock on one of their
>     args, bpf_rbtree_first takes no arguments, rather setting
>     release_on_unlock on its return value
> 
>   * bpf_rbtree_remove's node input is a node that's been inserted
>     in the tree. Only non-owning references (PTR_UNTRUSTED +
>     release_on_unlock) refer to such nodes, but kfuncs don't take
>     PTR_UNTRUSTED args
>     * Added special carveout for bpf_rbtree_remove to take PTR_UNTRUSTED
>     * Since node input already has release_on_unlock set, don't set
>       it again
> 
> This patch, along with the previous one, complete special verifier
> handling for all rbtree API functions added in this series.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  kernel/bpf/verifier.c | 89 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 73 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 9ad8c0b264dc..29983e2c27df 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -6122,6 +6122,23 @@ static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
>  	return 0;
>  }
>  
> +static bool
> +func_arg_reg_rb_node_offset(const struct bpf_reg_state *reg, s32 off)
> +{
> +	struct btf_record *rec;
> +	struct btf_field *field;
> +
> +	rec = reg_btf_record(reg);
> +	if (!rec)
> +		return false;
> +
> +	field = btf_record_find(rec, off, BPF_RB_NODE);
> +	if (!field)
> +		return false;
> +
> +	return true;
> +}
> +
>  int check_func_arg_reg_off(struct bpf_verifier_env *env,
>  			   const struct bpf_reg_state *reg, int regno,
>  			   enum bpf_arg_type arg_type)
> @@ -6176,6 +6193,13 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
>  		 */
>  		fixed_off_ok = true;
>  		break;
> +	case PTR_TO_BTF_ID | MEM_ALLOC | PTR_UNTRUSTED:
> +		/* Currently only bpf_rbtree_remove accepts a PTR_UNTRUSTED
> +		 * bpf_rb_node. Fixed off of the node type is OK
> +		 */
> +		if (reg->off && func_arg_reg_rb_node_offset(reg, reg->off))
> +			fixed_off_ok = true;
> +		break;

This doesn't look safe.
We cannot pass generic PTR_UNTRUSTED to bpf_rbtree_remove.
bpf_rbtree_remove wouldn't be able to distinguish invalid pointer.

Considering the cover letter example:

 bpf_spin_lock(&glock);
 res = bpf_rbtree_first(&groot);
   // groot and res are both trusted, no?
 if (!res)
   /* skip */
 // res is acquired and !null here

 res = bpf_rbtree_remove(&groot, res); // both args are trusted

 // here old res becomes untrusted because it went through release kfunc
 // new res is untrusted
 if (!res)
   /* skip */
 bpf_spin_unlock(&glock);

what am I missing?

I thought
bpf_obj_new -> returns acq obj
bpf_rbtree_add -> releases that obj
same way bpf_rbtree_first/next/ -> return acq obj
that can be passed to both rbtree_add and rbtree_remove.
The former will be a nop in runtime, but release from the verifier pov.
Similar with rbtree_remove:
obj = bpf_obj_new
bpf_rbtree_remove(root, obj); will be equivalent to bpf_obj_drop at run-time
and release form the verifier pov.

Are you trying to return untrusted from bpf_rbtree_first?
But then how we can guarantee safety?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  2022-12-06 23:09 ` [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0 Dave Marchevsky
@ 2022-12-07  2:39   ` Alexei Starovoitov
  2022-12-07  6:46     ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07  2:39 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
> insn->off < 0, but this appears to not be true irrespective of changes
> in this series. Regardless, changes in this series will result in an
> example like:
> 
>   struct example_node {
>     long key;
>     long val;
>     struct bpf_rb_node node;
>   }
> 
>   /* In BPF prog, assume root contains example_node nodes */
>   struct bpf_rb_node res = bpf_rbtree_first(&root);
>   if (!res)
>     return 1;
> 
>   struct example_node n = container_of(res, struct example_node, node);
>   long key = n->key;
> 
> Resulting in a load with off = -16, as bpf_rbtree_first's return is

Looks like the bug in the previous patch:
+                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
+                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
+                               struct btf_field *field = meta.arg_rbtree_root.field;
+
+                               mark_reg_datastructure_node(regs, BPF_REG_0,
+                                                           &field->datastructure_head);

The R0 .off should have been:
 regs[BPF_REG_0].off = field->rb_node.node_offset;

node, not root.

PTR_TO_BTF_ID should have been returned with approriate 'off',
so that container_of() would it bring back to zero offset.

The apporach of returning untrusted from bpf_rbtree_first is questionable.
Without doing that this issue would not have surfaced.

All PTR_TO_BTF_ID need to have positive offset.
I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
can deal with negative offsets.
There could be all kinds of things to fix.

> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
> bpf_rb_node. So it's necessary to support negative insn->off when
> jitting BPF_PROBE_MEM.

I'm not convinced it's necessary.
container_of() seems to be the only case where bpf prog can convert
PTR_TO_BTF_ID with off >= 0 to negative off.
Normal pointer walking will not make it negative.

> In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
> insn->off is safely handled, we must confirm that *src_reg + insn->off is
> in kernel's memory. Two runtime checks are emitted to confirm that:
> 
>   1) (*src_reg + insn->off) > boundary between user and kernel address
>   spaces
>   2) (*src_reg + insn->off) does not overflow to a small positive
>   number. This might happen if some function meant to set src_reg
>   returns ERR_PTR(-EINVAL) or similar.
> 
> Check 1 currently is sligtly off - it compares a
> 
>   u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
> 
> to *src_reg, aborting the load if limit is larger. Rewriting this as an
> inequality:
> 
>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
>   *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE
> 
> shows that this isn't quite right even if insn->off is positive, as we
> really want:
> 
>   *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off
> 
> Since *src_reg + insn->off is the address we'll be loading from, not
> *src_reg - insn->off or *src_reg - abs(insn->off). So change the
> subtraction to an addition and remove the abs(), as comment indicates
> that it was only added to ignore negative insn->off.
> 
> For Check 2, currently "does not overflow to a small positive number" is
> confirmed by emitting an 'add insn->off, src_reg' instruction and
> checking for carry flag. While this works fine for a positive insn->off,
> a small negative insn->off like -16 is almost guaranteed to wrap over to
> a small positive number when added to any kernel address.
> 
> This patch addresses this by not doing Check 2 at BPF prog runtime when
> insn->off is negative, rather doing a stronger check at JIT-time. The
> logic supporting this is as follows:
> 
> 1) Assume insn->off is negative, call the largest such negative offset
>    MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
>    insn->off.
> 
> 2) *src_reg + insn->off will not wrap over to an unexpected address by
>    virtue of negative insn->off, but it might wrap under if
>    -insn->off > *src_reg, as that implies *src_reg + insn->off < 0
> 
> 3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
>    must be true since insn->off is negative.
> 
> 4) If we've completed check 1, we know that
>    src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
> 
> 5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)
> 
> 6) By statements 1, 4, and 5, if we can prove
>    (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
>    (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
>    values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
>    MAX_NEGATIVE_OFF > 0.
> 
>    Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
>    negative, if the previous inequality is true,
>    src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
>    Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
>    negative insn->off vals, src_reg + insn->off > 0 and there can be no
>    wrapping under.
> 
> So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
> *src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
> negative insn->off. Luckily the former inequality does not need to be
> checked at runtime, and in fact could be a static_assert if
> TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
> kconfig is used.
> 
> Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
> MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
> check. Given that insn->off is a s16 and is unlikely to grow larger,
> this check should always succeed on any x86 processor made in the 21st
> century. If it doesn't fail all do_jit calls and complain loudly with
> the assumption that the BPF subsystem is misconfigured or has a bug.
> 
> A few instructions are saved for negative insn->offs as a result. Using
> the struct example_node / off = -16 example from before, code looks
> like:

This is quite complex to review. I couldn't convince myself
that droping 2nd check is safe, but don't have an argument to
prove that it's not safe.
Let's get to these details when there is need to support negative off.

> 
> BEFORE CHANGE
>   72:   movabs $0x800000000010,%r11
>   7c:   cmp    %r11,%rdi
>   7f:   jb     0x000000000000008d         (check 1 on 7c and here)
>   81:   mov    %rdi,%r11
>   84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
>   8b:   jae    0x0000000000000091          negative insn->off)
>   8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
>   8f:   jmp    0x0000000000000095
>   91:   mov    -0x10(%rdi),%rdi
>   95:
> 
> AFTER CHANGE:
>   5a:   movabs $0x800000000010,%r11
>   64:   cmp    %r11,%rdi
>   67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
>   69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
>   6b:   jmp    0x0000000000000071
>   6d:   mov    -0x10(%rdi),%rdi
>   71:
> 
> We could do the same for insn->off == 0, but for now keep code
> generation unchanged for previously working nonnegative insn->offs.
> 
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> ---
>  arch/x86/net/bpf_jit_comp.c | 123 +++++++++++++++++++++++++++---------
>  1 file changed, 92 insertions(+), 31 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 36ffe67ad6e5..843f619d0d35 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -11,6 +11,7 @@
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
>  #include <linux/sort.h>
> +#include <linux/limits.h>
>  #include <asm/extable.h>
>  #include <asm/set_memory.h>
>  #include <asm/nospec-branch.h>
> @@ -94,6 +95,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
>   */
>  #define X86_JB  0x72
>  #define X86_JAE 0x73
> +#define X86_JNC 0x73
>  #define X86_JE  0x74
>  #define X86_JNE 0x75
>  #define X86_JBE 0x76
> @@ -950,6 +952,36 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
>  	*pprog = prog;
>  }
>  
> +/* Check that condition necessary for PROBE_MEM handling for insn->off < 0
> + * holds.
> + *
> + * This could be a static_assert((TASK_SIZE_MAX + PAGE_SIZE) > -S16_MIN),
> + * but TASK_SIZE_MAX can't always be evaluated at compile time, so let's not
> + * assume insn->off size either
> + */
> +static int check_probe_mem_task_size_overflow(void)
> +{
> +	struct bpf_insn insn;
> +	s64 max_negative;
> +
> +	switch (sizeof(insn.off)) {
> +	case 2:
> +		max_negative = S16_MIN;
> +		break;
> +	default:
> +		pr_err("bpf_jit_error: unexpected bpf_insn->off size\n");
> +		return -EFAULT;
> +	}
> +
> +	if (!((TASK_SIZE_MAX + PAGE_SIZE) > -max_negative)) {
> +		pr_err("bpf jit error: assumption does not hold:\n");
> +		pr_err("\t(TASK_SIZE_MAX + PAGE_SIZE) + (max negative insn->off) > 0\n");
> +		return -EFAULT;
> +	}
> +
> +	return 0;
> +}
> +
>  #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
>  
>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
> @@ -967,6 +999,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>  	u8 *prog = temp;
>  	int err;
>  
> +	err = check_probe_mem_task_size_overflow();
> +	if (err)
> +		return err;
> +
>  	detect_reg_usage(insn, insn_cnt, callee_regs_used,
>  			 &tail_call_seen);
>  
> @@ -1359,20 +1395,30 @@ st:			if (is_imm8(insn->off))
>  		case BPF_LDX | BPF_MEM | BPF_DW:
>  		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
> -				/* Though the verifier prevents negative insn->off in BPF_PROBE_MEM
> -				 * add abs(insn->off) to the limit to make sure that negative
> -				 * offset won't be an issue.
> -				 * insn->off is s16, so it won't affect valid pointers.
> -				 */
> -				u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
> -				u8 *end_of_jmp1, *end_of_jmp2;
> -
>  				/* Conservatively check that src_reg + insn->off is a kernel address:
> -				 * 1. src_reg + insn->off >= limit
> -				 * 2. src_reg + insn->off doesn't become small positive.
> -				 * Cannot do src_reg + insn->off >= limit in one branch,
> -				 * since it needs two spare registers, but JIT has only one.
> +				 * 1. src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
> +				 * 2. src_reg + insn->off doesn't overflow and become small positive
> +				 *
> +				 * For check 1, to save regs, do
> +				 * src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off) call rhs
> +				 * of inequality 'limit'
> +				 *
> +				 * For check 2:
> +				 * If insn->off is positive, add src_reg + insn->off and check
> +				 * overflow directly
> +				 * If insn->off is negative, we know that
> +				 *   (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
> +				 * and from check 1 we know
> +				 *   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
> +				 * So if (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 we can
> +				 * be sure that src_reg + insn->off won't overflow in either
> +				 * direction and avoid runtime check entirely.
> +				 *
> +				 * check_probe_mem_task_size_overflow confirms the above assumption
> +				 * at the beginning of this function
>  				 */
> +				u64 limit = TASK_SIZE_MAX + PAGE_SIZE - insn->off;
> +				u8 *end_of_jmp1, *end_of_jmp2;
>  
>  				/* movabsq r11, limit */
>  				EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
> @@ -1381,32 +1427,47 @@ st:			if (is_imm8(insn->off))
>  				/* cmp src_reg, r11 */
>  				maybe_emit_mod(&prog, src_reg, AUX_REG, true);
>  				EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
> -				/* if unsigned '<' goto end_of_jmp2 */
> -				EMIT2(X86_JB, 0);
> -				end_of_jmp1 = prog;
> -
> -				/* mov r11, src_reg */
> -				emit_mov_reg(&prog, true, AUX_REG, src_reg);
> -				/* add r11, insn->off */
> -				maybe_emit_1mod(&prog, AUX_REG, true);
> -				EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
> -				/* jmp if not carry to start_of_ldx
> -				 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
> -				 * that has to be rejected.
> -				 */
> -				EMIT2(0x73 /* JNC */, 0);
> -				end_of_jmp2 = prog;
> +				if (insn->off >= 0) {
> +					/* cmp src_reg, r11 */
> +					/* if unsigned '<' goto end_of_jmp2 */
> +					EMIT2(X86_JB, 0);
> +					end_of_jmp1 = prog;
> +
> +					/* mov r11, src_reg */
> +					emit_mov_reg(&prog, true, AUX_REG, src_reg);
> +					/* add r11, insn->off */
> +					maybe_emit_1mod(&prog, AUX_REG, true);
> +					EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
> +					/* jmp if not carry to start_of_ldx
> +					 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
> +					 * that has to be rejected.
> +					 */
> +					EMIT2(X86_JNC, 0);
> +					end_of_jmp2 = prog;
> +				} else {
> +					/* cmp src_reg, r11 */
> +					/* if unsigned '>=' goto start_of_ldx
> +					 * w/o needing to do check 2
> +					 */
> +					EMIT2(X86_JAE, 0);
> +					end_of_jmp1 = prog;
> +				}
>  
>  				/* xor dst_reg, dst_reg */
>  				emit_mov_imm32(&prog, false, dst_reg, 0);
>  				/* jmp byte_after_ldx */
>  				EMIT2(0xEB, 0);
>  
> -				/* populate jmp_offset for JB above to jump to xor dst_reg */
> -				end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
> -				/* populate jmp_offset for JNC above to jump to start_of_ldx */
>  				start_of_ldx = prog;
> -				end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
> +				if (insn->off >= 0) {
> +					/* populate jmp_offset for JB above to jump to xor dst_reg */
> +					end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
> +					/* populate jmp_offset for JNC above to jump to start_of_ldx */
> +					end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
> +				} else {
> +					/* populate jmp_offset for JAE above to jump to start_of_ldx */
> +					end_of_jmp1[-1] = start_of_ldx - end_of_jmp1;
> +				}
>  			}
>  			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (12 preceding siblings ...)
  2022-12-06 23:10 ` [PATCH bpf-next 13/13] selftests/bpf: Add rbtree selftests Dave Marchevsky
@ 2022-12-07  2:50 ` patchwork-bot+netdevbpf
  2022-12-07 19:36 ` Kumar Kartikeya Dwivedi
  14 siblings, 0 replies; 50+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-12-07  2:50 UTC (permalink / raw)
  To: Dave Marchevsky; +Cc: bpf, ast, daniel, andrii, kernel-team, memxor, tj

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Tue, 6 Dec 2022 15:09:47 -0800 you wrote:
> This series adds a rbtree datastructure following the "next-gen
> datastructure" precedent set by recently-added linked-list [0]. This is
> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
> instead of adding a new map type. This series adds a smaller set of API
> functions than that RFC - just the minimum needed to support current
> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
> 
> [...]

Here is the summary with links:
  - [bpf-next,01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
    https://git.kernel.org/bpf/bpf-next/c/d8939cb0a03c
  - [bpf-next,02/13] bpf: map_check_btf should fail if btf_parse_fields fails
    (no matching commit)
  - [bpf-next,03/13] bpf: Minor refactor of ref_set_release_on_unlock
    (no matching commit)
  - [bpf-next,04/13] bpf: rename list_head -> datastructure_head in field info types
    (no matching commit)
  - [bpf-next,05/13] bpf: Add basic bpf_rb_{root,node} support
    (no matching commit)
  - [bpf-next,06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs
    (no matching commit)
  - [bpf-next,07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args
    (no matching commit)
  - [bpf-next,08/13] bpf: Add callback validation to kfunc verifier logic
    (no matching commit)
  - [bpf-next,09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first}
    (no matching commit)
  - [bpf-next,10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
    (no matching commit)
  - [bpf-next,11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h
    (no matching commit)
  - [bpf-next,12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type
    (no matching commit)
  - [bpf-next,13/13] selftests/bpf: Add rbtree selftests
    (no matching commit)

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  2022-12-07  2:39   ` Alexei Starovoitov
@ 2022-12-07  6:46     ` Dave Marchevsky
  2022-12-07 18:06       ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-07  6:46 UTC (permalink / raw)
  To: Alexei Starovoitov, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On 12/6/22 9:39 PM, Alexei Starovoitov wrote:
> On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
>> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
>> insn->off < 0, but this appears to not be true irrespective of changes
>> in this series. Regardless, changes in this series will result in an
>> example like:
>>
>>   struct example_node {
>>     long key;
>>     long val;
>>     struct bpf_rb_node node;
>>   }
>>
>>   /* In BPF prog, assume root contains example_node nodes */
>>   struct bpf_rb_node res = bpf_rbtree_first(&root);
>>   if (!res)
>>     return 1;
>>
>>   struct example_node n = container_of(res, struct example_node, node);
>>   long key = n->key;
>>
>> Resulting in a load with off = -16, as bpf_rbtree_first's return is
> 
> Looks like the bug in the previous patch:
> +                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
> +                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
> +                               struct btf_field *field = meta.arg_rbtree_root.field;
> +
> +                               mark_reg_datastructure_node(regs, BPF_REG_0,
> +                                                           &field->datastructure_head);
> 
> The R0 .off should have been:
>  regs[BPF_REG_0].off = field->rb_node.node_offset;
> 
> node, not root.
> 
> PTR_TO_BTF_ID should have been returned with approriate 'off',
> so that container_of() would it bring back to zero offset.
> 

The root's btf_field is used to hold information about the node type. Of
specific interest to us are value_btf_id and node_offset, which
mark_reg_datastructure_node uses to set REG_0's type and offset correctly.

This "use head type to keep info about node type" strategy felt strange to me
initially too: all PTR_TO_BTF_ID regs are passing around their type info, so
why not use that to lookup bpf_rb_node field info? But consider that
bpf_rbtree_first (and bpf_list_pop_{front,back}) doesn't take a node as
input arg, so there's no opportunity to get btf_field info from input
reg type. 

So we'll need to keep this info in rbtree_root's btf_field
regardless, and since any rbtree API function that operates on a node
also operates on a root and expects its node arg to match the node
type expected by the root, might as well use root's field as the main
lookup for this info and not even have &field->rb_node for now.
All __process_kf_arg_ptr_to_datastructure_node calls (added earlier
in the series) use the &meta->arg_{list_head,rbtree_root}.field for same
reason.

So it's setting the reg offset correctly.

> All PTR_TO_BTF_ID need to have positive offset.
> I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
> can deal with negative offsets.
> There could be all kinds of things to fix.

I think you may be conflating reg offset and insn offset here. None of the
changes in this series result in a PTR_TO_BTF_ID reg w/ negative offset
being returned. But LLVM may generate load insns with a negative offset,
and since we're passing around pointers to bpf_rb_node that may come
after useful data fields in a type, this will happen more often.

Consider this small example from selftests in this series:

struct node_data {
  long key;
  long data;
  struct bpf_rb_node node;
};

static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
{
        struct node_data *node_a;
        struct node_data *node_b;

        node_a = container_of(a, struct node_data, node);
        node_b = container_of(b, struct node_data, node);

        return node_a->key < node_b->key;
}

llvm-objdump shows this bpf bytecode for 'less':

0000000000000000 <less>:
;       return node_a->key < node_b->key;
       0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
       1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
       2:       b4 00 00 00 01 00 00 00 w0 = 0x1
;       return node_a->key < node_b->key;
       3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
       4:       b4 00 00 00 00 00 00 00 w0 = 0x0

0000000000000028 <LBB2_2>:
;       return node_a->key < node_b->key;
       5:       95 00 00 00 00 00 00 00 exit

Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
negative insn->off. Verifier's view or R1 and R2 before insn 0 is
untrusted_ptr_node_data(off=16). If there were some intermediate insns
storing result of container_of() before dereferencing:

  r3 = (r2 - 0x10)
  r2 = *(u64 *)(r3)

Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
using original arg ptrs to less() instead of storing container_of() ptr
adjustments.

Since the container_of usage and code pattern in above example's less()
isn't particularly specific to this series, I think there are other scenarios
where such code would be generated and considered this a general bugfix in
cover letter.

[ below paragraph was moved here, it originally preceded "All PTR_TO_BTF_ID"
  paragraph ]

> The apporach of returning untrusted from bpf_rbtree_first is questionable.
> Without doing that this issue would not have surfaced.
> 

I agree re: PTR_UNTRUSTED, but note that my earlier example doesn't involve
bpf_rbtree_first. Regardless, I think the issue is that PTR_UNTRUSTED is
used to denote a few separate traits of a PTR_TO_BTF_ID reg:

  * "I have no ownership over the thing I'm pointing to"
  * "My backing memory may go away at any time"
  * "Access to my fields might result in page fault"
  * "Kfuncs shouldn't accept me as an arg"

Seems like original PTR_UNTRUSTED usage really wanted to denote the first
point and the others were just naturally implied from the first. But
as you've noted there are some things using PTR_UNTRUSTED that really
want to make more granular statements:

ref_set_release_on_unlock logic sets release_on_unlock = true and adds
PTR_UNTRUSTED to the reg type. In this case PTR_UNTRUSTED is trying to say:

  * "I have no ownership over the thing I'm pointing to"
  * "My backing memory may go away at any time _after_ bpf_spin_unlock"
    * Before spin_unlock it's guaranteed to be valid
  * "Kfuncs shouldn't accept me as an arg"
    * We don't want arbitrary kfunc saving and accessing release_on_unlock
      reg after bpf_spin_unlock, as its backing memory can go away any time
      after spin_unlock.

The "backing memory" statement PTR_UNTRUSTED is making is a blunt superset
of what release_on_unlock really needs.

For less() callback we just want

  * "I have no ownership over the thing I'm pointing to"
  * "Kfuncs shouldn't accept me as an arg"

There is probably a way to decompose PTR_UNTRUSTED into a few flags such that
it's possible to denote these things separately and avoid unwanted additional
behavior. But after talking to David Vernet about current complexity of
PTR_TRUSTED and PTR_UNTRUSTED logic and his desire to refactor, it seemed
better to continue with PTR_UNTRUSTED blunt instrument with a bit of
special casing for now, instead of piling on more flags.

> 
>> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
>> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
>> bpf_rb_node. So it's necessary to support negative insn->off when
>> jitting BPF_PROBE_MEM.
> 
> I'm not convinced it's necessary.
> container_of() seems to be the only case where bpf prog can convert
> PTR_TO_BTF_ID with off >= 0 to negative off.
> Normal pointer walking will not make it negative.
> 

I see what you mean - if some non-container_of case resulted in load generation
with negative insn->off, this probably would've been noticed already. But
hopefully my replies above explain why it should be addressed now.

>> In order to ensure that page fault for a BPF_PROBE_MEM load of *src_reg +
>> insn->off is safely handled, we must confirm that *src_reg + insn->off is
>> in kernel's memory. Two runtime checks are emitted to confirm that:
>>
>>   1) (*src_reg + insn->off) > boundary between user and kernel address
>>   spaces
>>   2) (*src_reg + insn->off) does not overflow to a small positive
>>   number. This might happen if some function meant to set src_reg
>>   returns ERR_PTR(-EINVAL) or similar.
>>
>> Check 1 currently is sligtly off - it compares a
>>
>>   u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
>>
>> to *src_reg, aborting the load if limit is larger. Rewriting this as an
>> inequality:
>>
>>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off)
>>   *src_reg - abs(insn->off) > TASK_SIZE_MAX + PAGE_SIZE
>>
>> shows that this isn't quite right even if insn->off is positive, as we
>> really want:
>>
>>   *src_reg + insn->off > TASK_SIZE_MAX + PAGE_SIZE
>>   *src_reg > TASK_SIZE_MAX + PAGE_SIZE - insn_off
>>
>> Since *src_reg + insn->off is the address we'll be loading from, not
>> *src_reg - insn->off or *src_reg - abs(insn->off). So change the
>> subtraction to an addition and remove the abs(), as comment indicates
>> that it was only added to ignore negative insn->off.
>>
>> For Check 2, currently "does not overflow to a small positive number" is
>> confirmed by emitting an 'add insn->off, src_reg' instruction and
>> checking for carry flag. While this works fine for a positive insn->off,
>> a small negative insn->off like -16 is almost guaranteed to wrap over to
>> a small positive number when added to any kernel address.
>>
>> This patch addresses this by not doing Check 2 at BPF prog runtime when
>> insn->off is negative, rather doing a stronger check at JIT-time. The
>> logic supporting this is as follows:
>>
>> 1) Assume insn->off is negative, call the largest such negative offset
>>    MAX_NEGATIVE_OFF. So insn->off >= MAX_NEGATIVE_OFF for all possible
>>    insn->off.
>>
>> 2) *src_reg + insn->off will not wrap over to an unexpected address by
>>    virtue of negative insn->off, but it might wrap under if
>>    -insn->off > *src_reg, as that implies *src_reg + insn->off < 0
>>
>> 3) Inequality (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
>>    must be true since insn->off is negative.
>>
>> 4) If we've completed check 1, we know that
>>    src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
>>
>> 5) Combining statements 3 and 4, we know src_reg > (TASK_SIZE_MAX + PAGE_SIZE)
>>
>> 6) By statements 1, 4, and 5, if we can prove
>>    (TASK_SIZE_MAX + PAGE_SIZE) > -MAX_NEGATIVE_OFF, we'll know that
>>    (TASK_SIZE_MAX + PAGE_SIZE) > -insn->off for all possible insn->off
>>    values. We can rewrite this as (TASK_SIZE_MAX + PAGE_SIZE) +
>>    MAX_NEGATIVE_OFF > 0.
>>
>>    Since src_reg > TASK_SIZE_MAX + PAGE_SIZE and MAX_NEGATIVE_OFF is
>>    negative, if the previous inequality is true,
>>    src_reg + MAX_NEGATIVE_OFF > 0 is also true for all src_reg values.
>>    Similarly, since insn->off >= MAX_NEGATIVE_OFF for all possible
>>    negative insn->off vals, src_reg + insn->off > 0 and there can be no
>>    wrapping under.
>>
>> So proving (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 implies
>> *src_reg + insn->off > 0 for any src_reg that's passed check 1 and any
>> negative insn->off. Luckily the former inequality does not need to be
>> checked at runtime, and in fact could be a static_assert if
>> TASK_SIZE_MAX wasn't determined by a function when CONFIG_X86_5LEVEL
>> kconfig is used.
>>
>> Regardless, we can just check (TASK_SIZE_MAX + PAGE_SIZE) +
>> MAX_NEGATIVE_OFF > 0 once per do_jit call instead of emitting a runtime
>> check. Given that insn->off is a s16 and is unlikely to grow larger,
>> this check should always succeed on any x86 processor made in the 21st
>> century. If it doesn't fail all do_jit calls and complain loudly with
>> the assumption that the BPF subsystem is misconfigured or has a bug.
>>
>> A few instructions are saved for negative insn->offs as a result. Using
>> the struct example_node / off = -16 example from before, code looks
>> like:
> 
> This is quite complex to review. I couldn't convince myself
> that droping 2nd check is safe, but don't have an argument to
> prove that it's not safe.
> Let's get to these details when there is need to support negative off.
> 

Hopefully above explanation shows that there's need to support it now.
I will try to simplify and rephrase the summary to make it easier to follow,
but will prioritize addressing feedback in less complex patches, so this
patch may not change for a few respins.

>>
>> BEFORE CHANGE
>>   72:   movabs $0x800000000010,%r11
>>   7c:   cmp    %r11,%rdi
>>   7f:   jb     0x000000000000008d         (check 1 on 7c and here)
>>   81:   mov    %rdi,%r11
>>   84:   add    $0xfffffffffffffff0,%r11   (check 2, will set carry for almost any r11, so bug for
>>   8b:   jae    0x0000000000000091          negative insn->off)
>>   8d:   xor    %edi,%edi                  (as a result long key = n->key; will be 0'd out here)
>>   8f:   jmp    0x0000000000000095
>>   91:   mov    -0x10(%rdi),%rdi
>>   95:
>>
>> AFTER CHANGE:
>>   5a:   movabs $0x800000000010,%r11
>>   64:   cmp    %r11,%rdi
>>   67:   jae    0x000000000000006d     (check 1 on 64 and here, but now JNC instead of JC)
>>   69:   xor    %edi,%edi              (no check 2, 0 out if %rdi - %r11 < 0)
>>   6b:   jmp    0x0000000000000071
>>   6d:   mov    -0x10(%rdi),%rdi
>>   71:
>>
>> We could do the same for insn->off == 0, but for now keep code
>> generation unchanged for previously working nonnegative insn->offs.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>>  arch/x86/net/bpf_jit_comp.c | 123 +++++++++++++++++++++++++++---------
>>  1 file changed, 92 insertions(+), 31 deletions(-)
>>
>> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
>> index 36ffe67ad6e5..843f619d0d35 100644
>> --- a/arch/x86/net/bpf_jit_comp.c
>> +++ b/arch/x86/net/bpf_jit_comp.c
>> @@ -11,6 +11,7 @@
>>  #include <linux/bpf.h>
>>  #include <linux/memory.h>
>>  #include <linux/sort.h>
>> +#include <linux/limits.h>
>>  #include <asm/extable.h>
>>  #include <asm/set_memory.h>
>>  #include <asm/nospec-branch.h>
>> @@ -94,6 +95,7 @@ static int bpf_size_to_x86_bytes(int bpf_size)
>>   */
>>  #define X86_JB  0x72
>>  #define X86_JAE 0x73
>> +#define X86_JNC 0x73
>>  #define X86_JE  0x74
>>  #define X86_JNE 0x75
>>  #define X86_JBE 0x76
>> @@ -950,6 +952,36 @@ static void emit_shiftx(u8 **pprog, u32 dst_reg, u8 src_reg, bool is64, u8 op)
>>  	*pprog = prog;
>>  }
>>  
>> +/* Check that condition necessary for PROBE_MEM handling for insn->off < 0
>> + * holds.
>> + *
>> + * This could be a static_assert((TASK_SIZE_MAX + PAGE_SIZE) > -S16_MIN),
>> + * but TASK_SIZE_MAX can't always be evaluated at compile time, so let's not
>> + * assume insn->off size either
>> + */
>> +static int check_probe_mem_task_size_overflow(void)
>> +{
>> +	struct bpf_insn insn;
>> +	s64 max_negative;
>> +
>> +	switch (sizeof(insn.off)) {
>> +	case 2:
>> +		max_negative = S16_MIN;
>> +		break;
>> +	default:
>> +		pr_err("bpf_jit_error: unexpected bpf_insn->off size\n");
>> +		return -EFAULT;
>> +	}
>> +
>> +	if (!((TASK_SIZE_MAX + PAGE_SIZE) > -max_negative)) {
>> +		pr_err("bpf jit error: assumption does not hold:\n");
>> +		pr_err("\t(TASK_SIZE_MAX + PAGE_SIZE) + (max negative insn->off) > 0\n");
>> +		return -EFAULT;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>>  #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
>>  
>>  static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image,
>> @@ -967,6 +999,10 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, u8 *rw_image
>>  	u8 *prog = temp;
>>  	int err;
>>  
>> +	err = check_probe_mem_task_size_overflow();
>> +	if (err)
>> +		return err;
>> +
>>  	detect_reg_usage(insn, insn_cnt, callee_regs_used,
>>  			 &tail_call_seen);
>>  
>> @@ -1359,20 +1395,30 @@ st:			if (is_imm8(insn->off))
>>  		case BPF_LDX | BPF_MEM | BPF_DW:
>>  		case BPF_LDX | BPF_PROBE_MEM | BPF_DW:
>>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
>> -				/* Though the verifier prevents negative insn->off in BPF_PROBE_MEM
>> -				 * add abs(insn->off) to the limit to make sure that negative
>> -				 * offset won't be an issue.
>> -				 * insn->off is s16, so it won't affect valid pointers.
>> -				 */
>> -				u64 limit = TASK_SIZE_MAX + PAGE_SIZE + abs(insn->off);
>> -				u8 *end_of_jmp1, *end_of_jmp2;
>> -
>>  				/* Conservatively check that src_reg + insn->off is a kernel address:
>> -				 * 1. src_reg + insn->off >= limit
>> -				 * 2. src_reg + insn->off doesn't become small positive.
>> -				 * Cannot do src_reg + insn->off >= limit in one branch,
>> -				 * since it needs two spare registers, but JIT has only one.
>> +				 * 1. src_reg + insn->off >= TASK_SIZE_MAX + PAGE_SIZE
>> +				 * 2. src_reg + insn->off doesn't overflow and become small positive
>> +				 *
>> +				 * For check 1, to save regs, do
>> +				 * src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off) call rhs
>> +				 * of inequality 'limit'
>> +				 *
>> +				 * For check 2:
>> +				 * If insn->off is positive, add src_reg + insn->off and check
>> +				 * overflow directly
>> +				 * If insn->off is negative, we know that
>> +				 *   (TASK_SIZE_MAX + PAGE_SIZE - insn->off) > (TASK_SIZE_MAX + PAGE_SIZE)
>> +				 * and from check 1 we know
>> +				 *   src_reg >= (TASK_SIZE_MAX + PAGE_SIZE - insn->off)
>> +				 * So if (TASK_SIZE_MAX + PAGE_SIZE) + MAX_NEGATIVE_OFF > 0 we can
>> +				 * be sure that src_reg + insn->off won't overflow in either
>> +				 * direction and avoid runtime check entirely.
>> +				 *
>> +				 * check_probe_mem_task_size_overflow confirms the above assumption
>> +				 * at the beginning of this function
>>  				 */
>> +				u64 limit = TASK_SIZE_MAX + PAGE_SIZE - insn->off;
>> +				u8 *end_of_jmp1, *end_of_jmp2;
>>  
>>  				/* movabsq r11, limit */
>>  				EMIT2(add_1mod(0x48, AUX_REG), add_1reg(0xB8, AUX_REG));
>> @@ -1381,32 +1427,47 @@ st:			if (is_imm8(insn->off))
>>  				/* cmp src_reg, r11 */
>>  				maybe_emit_mod(&prog, src_reg, AUX_REG, true);
>>  				EMIT2(0x39, add_2reg(0xC0, src_reg, AUX_REG));
>> -				/* if unsigned '<' goto end_of_jmp2 */
>> -				EMIT2(X86_JB, 0);
>> -				end_of_jmp1 = prog;
>> -
>> -				/* mov r11, src_reg */
>> -				emit_mov_reg(&prog, true, AUX_REG, src_reg);
>> -				/* add r11, insn->off */
>> -				maybe_emit_1mod(&prog, AUX_REG, true);
>> -				EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
>> -				/* jmp if not carry to start_of_ldx
>> -				 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
>> -				 * that has to be rejected.
>> -				 */
>> -				EMIT2(0x73 /* JNC */, 0);
>> -				end_of_jmp2 = prog;
>> +				if (insn->off >= 0) {
>> +					/* cmp src_reg, r11 */
>> +					/* if unsigned '<' goto end_of_jmp2 */
>> +					EMIT2(X86_JB, 0);
>> +					end_of_jmp1 = prog;
>> +
>> +					/* mov r11, src_reg */
>> +					emit_mov_reg(&prog, true, AUX_REG, src_reg);
>> +					/* add r11, insn->off */
>> +					maybe_emit_1mod(&prog, AUX_REG, true);
>> +					EMIT2_off32(0x81, add_1reg(0xC0, AUX_REG), insn->off);
>> +					/* jmp if not carry to start_of_ldx
>> +					 * Otherwise ERR_PTR(-EINVAL) + 128 will be the user addr
>> +					 * that has to be rejected.
>> +					 */
>> +					EMIT2(X86_JNC, 0);
>> +					end_of_jmp2 = prog;
>> +				} else {
>> +					/* cmp src_reg, r11 */
>> +					/* if unsigned '>=' goto start_of_ldx
>> +					 * w/o needing to do check 2
>> +					 */
>> +					EMIT2(X86_JAE, 0);
>> +					end_of_jmp1 = prog;
>> +				}
>>  
>>  				/* xor dst_reg, dst_reg */
>>  				emit_mov_imm32(&prog, false, dst_reg, 0);
>>  				/* jmp byte_after_ldx */
>>  				EMIT2(0xEB, 0);
>>  
>> -				/* populate jmp_offset for JB above to jump to xor dst_reg */
>> -				end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
>> -				/* populate jmp_offset for JNC above to jump to start_of_ldx */
>>  				start_of_ldx = prog;
>> -				end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
>> +				if (insn->off >= 0) {
>> +					/* populate jmp_offset for JB above to jump to xor dst_reg */
>> +					end_of_jmp1[-1] = end_of_jmp2 - end_of_jmp1;
>> +					/* populate jmp_offset for JNC above to jump to start_of_ldx */
>> +					end_of_jmp2[-1] = start_of_ldx - end_of_jmp2;
>> +				} else {
>> +					/* populate jmp_offset for JAE above to jump to start_of_ldx */
>> +					end_of_jmp1[-1] = start_of_ldx - end_of_jmp1;
>> +				}
>>  			}
>>  			emit_ldx(&prog, BPF_SIZE(insn->code), dst_reg, src_reg, insn->off);
>>  			if (BPF_MODE(insn->code) == BPF_PROBE_MEM) {
>> -- 
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-06 23:09 ` [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record Dave Marchevsky
@ 2022-12-07 16:41   ` Kumar Kartikeya Dwivedi
  2022-12-07 18:34     ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-12-07 16:41 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> There, a BTF record is created for any type containing a spin_lock or
> any next-gen datastructure node/head.
>
> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> a record using struct_meta_tab if the reg->type exactly matches
> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> "allocated obj" type - returned from bpf_obj_new - might pick up other
> flags while working its way through the program.
>

Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
is to make then unpassable to helpers.

> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> for type_flag.
>
> This patch is marked Fixes as the original intent of reg_btf_record was
> unlikely to have been to fail finding btf_record for valid alloc obj
> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> are valid register type states for alloc obj independent of this series.

That was the actual intent, same as how check_ptr_to_btf_access uses the exact
reg->type to allow the BPF_WRITE case.

I think this series is the one introducing this case, passing bpf_rbtree_first's
result to bpf_rbtree_remove, which I think is not possible to make safe in the
first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
bpf_list_del due to this exact issue. More in [0].

 [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com

> However, I didn't find a specific broken repro case outside of this
> series' added functionality, so it's possible that nothing was
> triggering this logic error before.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects")
> ---
>  kernel/bpf/verifier.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1d51bd9596da..67a13110bc22 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type)
>  		type == PTR_TO_SOCK_COMMON;
>  }
>
> +static bool type_is_ptr_alloc_obj(u32 type)
> +{
> +	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC;
> +}
> +
>  static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>  {
>  	struct btf_record *rec = NULL;
> @@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>
>  	if (reg->type == PTR_TO_MAP_VALUE) {
>  		rec = reg->map_ptr->record;
> -	} else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) {
> +	} else if (type_is_ptr_alloc_obj(reg->type)) {
>  		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
>  		if (meta)
>  			rec = meta->record;
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails
  2022-12-06 23:09 ` [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails Dave Marchevsky
  2022-12-07  1:32   ` Alexei Starovoitov
@ 2022-12-07 16:49   ` Kumar Kartikeya Dwivedi
  2022-12-07 19:05     ` Alexei Starovoitov
  1 sibling, 1 reply; 50+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-12-07 16:49 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 04:39:49AM IST, Dave Marchevsky wrote:
> map_check_btf calls btf_parse_fields to create a btf_record for its
> value_type. If there are no special fields in the value_type
> btf_parse_fields returns NULL, whereas if there special value_type
> fields but they are invalid in some way an error is returned.
>
> An example invalid state would be:
>
>   struct node_data {
>     struct bpf_rb_node node;
>     int data;
>   };
>
>   private(A) struct bpf_spin_lock glock;
>   private(A) struct bpf_list_head ghead __contains(node_data, node);
>
> groot should be invalid as its __contains tag points to a field with
> type != "bpf_list_node".
>
> Before this patch, such a scenario would result in btf_parse_fields
> returning an error ptr, subsequent !IS_ERR_OR_NULL check failing,
> and btf_check_and_fixup_fields returning 0, which would then be
> returned by map_check_btf.
>
> After this patch's changes, -EINVAL would be returned by map_check_btf
> and the map would correctly fail to load.
>
> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Fixes: aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record")
> ---
>  kernel/bpf/syscall.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 35972afb6850..c3599a7902f0 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -1007,7 +1007,10 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>  	map->record = btf_parse_fields(btf, value_type,
>  				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD,
>  				       map->value_size);
> -	if (!IS_ERR_OR_NULL(map->record)) {
> +	if (IS_ERR(map->record))
> +		return -EINVAL;
> +

I didn't do this on purpose, because of backward compatibility concerns. An
error has not been returned in earlier kernel versions during map creation time
and those fields acted like normal non-special regions, with errors on use of
helpers that act on those fields.

Especially that bpf_spin_lock and bpf_timer are part of the unified btf_record.

If we are doing such a change, then you should also drop the checks for IS_ERR
in verifier.c, since that shouldn't be possible anymore. But I think we need to
think carefully before changing this.

One possible example is: If we introduce bpf_foo in the future and program
already has that defined in map value, using it for some other purpose, with
different alignment and size, their map creation will start failing.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  2022-12-07  6:46     ` Dave Marchevsky
@ 2022-12-07 18:06       ` Alexei Starovoitov
  2022-12-07 23:39         ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07 18:06 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Wed, Dec 07, 2022 at 01:46:56AM -0500, Dave Marchevsky wrote:
> On 12/6/22 9:39 PM, Alexei Starovoitov wrote:
> > On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
> >> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
> >> insn->off < 0, but this appears to not be true irrespective of changes
> >> in this series. Regardless, changes in this series will result in an
> >> example like:
> >>
> >>   struct example_node {
> >>     long key;
> >>     long val;
> >>     struct bpf_rb_node node;
> >>   }
> >>
> >>   /* In BPF prog, assume root contains example_node nodes */
> >>   struct bpf_rb_node res = bpf_rbtree_first(&root);
> >>   if (!res)
> >>     return 1;
> >>
> >>   struct example_node n = container_of(res, struct example_node, node);
> >>   long key = n->key;
> >>
> >> Resulting in a load with off = -16, as bpf_rbtree_first's return is
> > 
> > Looks like the bug in the previous patch:
> > +                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
> > +                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
> > +                               struct btf_field *field = meta.arg_rbtree_root.field;
> > +
> > +                               mark_reg_datastructure_node(regs, BPF_REG_0,
> > +                                                           &field->datastructure_head);
> > 
> > The R0 .off should have been:
> >  regs[BPF_REG_0].off = field->rb_node.node_offset;
> > 
> > node, not root.
> > 
> > PTR_TO_BTF_ID should have been returned with approriate 'off',
> > so that container_of() would it bring back to zero offset.
> > 
> 
> The root's btf_field is used to hold information about the node type. Of
> specific interest to us are value_btf_id and node_offset, which
> mark_reg_datastructure_node uses to set REG_0's type and offset correctly.
> 
> This "use head type to keep info about node type" strategy felt strange to me
> initially too: all PTR_TO_BTF_ID regs are passing around their type info, so
> why not use that to lookup bpf_rb_node field info? But consider that
> bpf_rbtree_first (and bpf_list_pop_{front,back}) doesn't take a node as
> input arg, so there's no opportunity to get btf_field info from input
> reg type. 
> 
> So we'll need to keep this info in rbtree_root's btf_field
> regardless, and since any rbtree API function that operates on a node
> also operates on a root and expects its node arg to match the node
> type expected by the root, might as well use root's field as the main
> lookup for this info and not even have &field->rb_node for now.
> All __process_kf_arg_ptr_to_datastructure_node calls (added earlier
> in the series) use the &meta->arg_{list_head,rbtree_root}.field for same
> reason.
> 
> So it's setting the reg offset correctly.

Ok. Got it. Than the commit log is incorrectly describing the failing scenario.
It's a container_of() inside bool less() that is generating negative offsets.

> > All PTR_TO_BTF_ID need to have positive offset.
> > I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
> > can deal with negative offsets.
> > There could be all kinds of things to fix.
> 
> I think you may be conflating reg offset and insn offset here. None of the
> changes in this series result in a PTR_TO_BTF_ID reg w/ negative offset
> being returned. But LLVM may generate load insns with a negative offset,
> and since we're passing around pointers to bpf_rb_node that may come
> after useful data fields in a type, this will happen more often.
> 
> Consider this small example from selftests in this series:
> 
> struct node_data {
>   long key;
>   long data;
>   struct bpf_rb_node node;
> };
> 
> static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
> {
>         struct node_data *node_a;
>         struct node_data *node_b;
> 
>         node_a = container_of(a, struct node_data, node);
>         node_b = container_of(b, struct node_data, node);
> 
>         return node_a->key < node_b->key;
> }
> 
> llvm-objdump shows this bpf bytecode for 'less':
> 
> 0000000000000000 <less>:
> ;       return node_a->key < node_b->key;
>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
> ;       return node_a->key < node_b->key;

I see. That's the same bug.
The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
correct positive offset.
Then node_a = container_of(a, struct node_data, node);
would have produced correct offset into proper btf_id.

The verifier should be passing into less() the btf_id
of struct node_data instead of btf_id of struct bpf_rb_node.

>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
> 
> 0000000000000028 <LBB2_2>:
> ;       return node_a->key < node_b->key;
>        5:       95 00 00 00 00 00 00 00 exit
> 
> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
> untrusted_ptr_node_data(off=16). If there were some intermediate insns
> storing result of container_of() before dereferencing:
> 
>   r3 = (r2 - 0x10)
>   r2 = *(u64 *)(r3)
> 
> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
> using original arg ptrs to less() instead of storing container_of() ptr
> adjustments.
> 
> Since the container_of usage and code pattern in above example's less()
> isn't particularly specific to this series, I think there are other scenarios
> where such code would be generated and considered this a general bugfix in
> cover letter.

imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.

> 
> [ below paragraph was moved here, it originally preceded "All PTR_TO_BTF_ID"
>   paragraph ]
> 
> > The apporach of returning untrusted from bpf_rbtree_first is questionable.
> > Without doing that this issue would not have surfaced.
> > 
> 
> I agree re: PTR_UNTRUSTED, but note that my earlier example doesn't involve
> bpf_rbtree_first. Regardless, I think the issue is that PTR_UNTRUSTED is
> used to denote a few separate traits of a PTR_TO_BTF_ID reg:
> 
>   * "I have no ownership over the thing I'm pointing to"
>   * "My backing memory may go away at any time"
>   * "Access to my fields might result in page fault"
>   * "Kfuncs shouldn't accept me as an arg"
> 
> Seems like original PTR_UNTRUSTED usage really wanted to denote the first
> point and the others were just naturally implied from the first. But
> as you've noted there are some things using PTR_UNTRUSTED that really
> want to make more granular statements:

I think PTR_UNTRUSTED implies all of the above. All 4 statements are connected.

> ref_set_release_on_unlock logic sets release_on_unlock = true and adds
> PTR_UNTRUSTED to the reg type. In this case PTR_UNTRUSTED is trying to say:
> 
>   * "I have no ownership over the thing I'm pointing to"
>   * "My backing memory may go away at any time _after_ bpf_spin_unlock"
>     * Before spin_unlock it's guaranteed to be valid
>   * "Kfuncs shouldn't accept me as an arg"
>     * We don't want arbitrary kfunc saving and accessing release_on_unlock
>       reg after bpf_spin_unlock, as its backing memory can go away any time
>       after spin_unlock.
> 
> The "backing memory" statement PTR_UNTRUSTED is making is a blunt superset
> of what release_on_unlock really needs.
> 
> For less() callback we just want
> 
>   * "I have no ownership over the thing I'm pointing to"
>   * "Kfuncs shouldn't accept me as an arg"
> 
> There is probably a way to decompose PTR_UNTRUSTED into a few flags such that
> it's possible to denote these things separately and avoid unwanted additional
> behavior. But after talking to David Vernet about current complexity of
> PTR_TRUSTED and PTR_UNTRUSTED logic and his desire to refactor, it seemed
> better to continue with PTR_UNTRUSTED blunt instrument with a bit of
> special casing for now, instead of piling on more flags.

Exactly. More flags will only increase the confusion.
Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
rbtree kfuncs while inside this particular callback to prevent recursion.
That would solve all these issues, no?
Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.

Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
both link list and rbtree.
It should still be safe to operate on link list part of that node from less()
though it's not something we would ever recommend.
The kfunc call on rb tree part of struct node_data is problematic because
of recursion, right? No other safety concerns ?

> > 
> >> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
> >> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
> >> bpf_rb_node. So it's necessary to support negative insn->off when
> >> jitting BPF_PROBE_MEM.
> > 
> > I'm not convinced it's necessary.
> > container_of() seems to be the only case where bpf prog can convert
> > PTR_TO_BTF_ID with off >= 0 to negative off.
> > Normal pointer walking will not make it negative.
> > 
> 
> I see what you mean - if some non-container_of case resulted in load generation
> with negative insn->off, this probably would've been noticed already. But
> hopefully my replies above explain why it should be addressed now.

Even with container_of() usage we should be passing proper btf_id of container
struct, so that callbacks and non-callbacks can properly container_of() it
and still get offset >= 0.

> >>
> >> A few instructions are saved for negative insn->offs as a result. Using
> >> the struct example_node / off = -16 example from before, code looks
> >> like:
> > 
> > This is quite complex to review. I couldn't convince myself
> > that droping 2nd check is safe, but don't have an argument to
> > prove that it's not safe.
> > Let's get to these details when there is need to support negative off.
> > 
> 
> Hopefully above explanation shows that there's need to support it now.
> I will try to simplify and rephrase the summary to make it easier to follow,
> but will prioritize addressing feedback in less complex patches, so this
> patch may not change for a few respins.

I'm not saying that this patch will never be needed.
Supporting negative offsets here is a good thing.
I'm arguing that it's not necessary to enable bpf_rbtree.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-07 16:41   ` Kumar Kartikeya Dwivedi
@ 2022-12-07 18:34     ` Dave Marchevsky
  2022-12-07 18:59       ` Alexei Starovoitov
  2022-12-07 19:03       ` Kumar Kartikeya Dwivedi
  0 siblings, 2 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-07 18:34 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
>> There, a BTF record is created for any type containing a spin_lock or
>> any next-gen datastructure node/head.
>>
>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
>> a record using struct_meta_tab if the reg->type exactly matches
>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
>> "allocated obj" type - returned from bpf_obj_new - might pick up other
>> flags while working its way through the program.
>>
> 
> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> is to make then unpassable to helpers.
> 

I see what you mean. If reg_btf_record is only used on regs which are args,
then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
but then we have its use in reg_may_point_to_spin_lock, which is itself
used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
sure that it's only used on arg regs currently.

Regardless, if the intended use is on arg regs only, it should be renamed to
arg_reg_btf_record or similar to make that clear, as current name sounds like
it should be applicable to any reg, and thus not enforce constraints particular
to arg regs.

But I think it's better to leave it general and enforce those constraints
elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
big switch statements for KF_ARG_* are doing exact type matching.

>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
>> for type_flag.
>>
>> This patch is marked Fixes as the original intent of reg_btf_record was
>> unlikely to have been to fail finding btf_record for valid alloc obj
>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
>> are valid register type states for alloc obj independent of this series.
> 
> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> reg->type to allow the BPF_WRITE case.
> 
> I think this series is the one introducing this case, passing bpf_rbtree_first's
> result to bpf_rbtree_remove, which I think is not possible to make safe in the
> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> bpf_list_del due to this exact issue. More in [0].
> 
>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> 

Thanks for the link, I better understand what Alexei meant in his comment on
patch 9 of this series. For the helpers added in this series, we can make
bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
refs after the rbtree_remove in same manner as they're invalidated after
spin_unlock currently.

Logic for why this is safe:

  * If we have two non-owning refs to nodes in a tree, e.g. from
    bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
    we have no way of knowing if they're aliases of same node.

  * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
    it might be removing a node that's already been removed, e.g.:

        n = bpf_obj_new(...);
        bpf_spin_lock(&lock);

        bpf_rbtree_add(&tree, &n->node);
        // n is now non-owning ref to node which was added
        res = bpf_rbtree_first();
        if (!m) {}
        m = container_of(res, struct node_data, node);
        // m is now non-owning ref to the same node
        bpf_rbtree_remove(&tree, &n->node);
        bpf_rbtree_remove(&tree, &m->node); // BAD

        bpf_spin_unlock(&lock);

  * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
    of pointing to something that was already removed _only_ after a
    rbtree_remove, so if we invalidate them all after rbtree_remove they can't
    be inputs to subsequent remove()s

This does conflate current "release non-owning refs because it's not safe to
read from them" reasoning with new "release non-owning refs so they can't be
passed to remove()". Ideally we could add some new tag to these refs that
prevents them from being passed to remove()-type fns, but does allow them to
be read, e.g.:

  n = bpf_obj_new(...);
  bpf_spin_lock(&lock);

  bpf_rbtree_add(&tree, &n->node);
  // n is now non-owning ref to node which was added
  res = bpf_rbtree_first();
  if (!m) {}
  m = container_of(res, struct node_data, node);
  // m is now non-owning ref to the same node
  n = bpf_rbtree_remove(&tree, &n->node);
  // n is now owning ref again, m is non-owning ref to same node
  x = m->key; // this should be safe since we're still in CS
  bpf_rbtree_remove(&tree, &m->node); // But this should be prevented

  bpf_spin_unlock(&lock);

But this would introduce too much addt'l complexity for now IMO. The proposal
of just invalidating all non-owning refs prevents both the unsafe second
remove() and the safe x = m->key.

I will give it a shot, if it doesn't work can change rbtree_remove to
rbtree_remove_first w/o node param. But per that linked convo such logic
should be tackled eventually, might as well chip away at it now.

>> However, I didn't find a specific broken repro case outside of this
>> series' added functionality, so it's possible that nothing was
>> triggering this logic error before.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Fixes: 4e814da0d599 ("bpf: Allow locking bpf_spin_lock in allocated objects")
>> ---
>>  kernel/bpf/verifier.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 1d51bd9596da..67a13110bc22 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -451,6 +451,11 @@ static bool reg_type_not_null(enum bpf_reg_type type)
>>  		type == PTR_TO_SOCK_COMMON;
>>  }
>>
>> +static bool type_is_ptr_alloc_obj(u32 type)
>> +{
>> +	return base_type(type) == PTR_TO_BTF_ID && type_flag(type) & MEM_ALLOC;
>> +}
>> +
>>  static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>>  {
>>  	struct btf_record *rec = NULL;
>> @@ -458,7 +463,7 @@ static struct btf_record *reg_btf_record(const struct bpf_reg_state *reg)
>>
>>  	if (reg->type == PTR_TO_MAP_VALUE) {
>>  		rec = reg->map_ptr->record;
>> -	} else if (reg->type == (PTR_TO_BTF_ID | MEM_ALLOC)) {
>> +	} else if (type_is_ptr_alloc_obj(reg->type)) {
>>  		meta = btf_find_struct_meta(reg->btf, reg->btf_id);
>>  		if (meta)
>>  			rec = meta->record;
>> --
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types
  2022-12-07  1:41   ` Alexei Starovoitov
@ 2022-12-07 18:52     ` Dave Marchevsky
  2022-12-07 19:01       ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-07 18:52 UTC (permalink / raw)
  To: Alexei Starovoitov, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On 12/6/22 8:41 PM, Alexei Starovoitov wrote:
> On Tue, Dec 06, 2022 at 03:09:51PM -0800, Dave Marchevsky wrote:
>> Many of the structs recently added to track field info for linked-list
>> head are useful as-is for rbtree root. So let's do a mechanical renaming
>> of list_head-related types and fields:
>>
>> include/linux/bpf.h:
>>   struct btf_field_list_head -> struct btf_field_datastructure_head
>>   list_head -> datastructure_head in struct btf_field union
>> kernel/bpf/btf.c:
>>   list_head -> datastructure_head in struct btf_field_info
> 
> Looking through this patch and others it eventually becomes
> confusing with 'datastructure head' name.
> I'm not sure what is 'head' of the data structure.
> There is head in the link list, but 'head of tree' is odd.
> 
> The attemp here is to find a common name that represents programming
> concept where there is a 'root' and there are 'nodes' that added to that 'root'.
> The 'data structure' name is too broad in that sense.
> Especially later it becomes 'datastructure_api' which is even broader.
> 
> I was thinking to propose:
>  struct btf_field_list_head -> struct btf_field_tree_root
>  list_head -> tree_root in struct btf_field union
> 
> and is_kfunc_tree_api later...
> since link list is a tree too.
> 
> But reading 'tree' next to other names like 'field', 'kfunc'
> it might be mistaken that 'tree' applies to the former.
> So I think using 'graph' as more general concept to describe both
> link list and rb-tree would be the best.
> 
> So the proposal:
>  struct btf_field_list_head -> struct btf_field_graph_root
>  list_head -> graph_root in struct btf_field union
> 
> and is_kfunc_graph_api later...
> 
> 'graph' is short enough and rarely used in names,
> so it stands on its own next to 'field' and in combination
> with other names.
> wdyt?
> 

I'm not a huge fan of 'graph', but it's certainly better than
'datastructure_api', and avoids the "all next-gen datastructures must do this"
implication of a 'ng_ds' name. So will try the rename in v2.

(all specific GRAPH naming suggestions in subsequent patches will
be done as well)

list 'head' -> list 'root' SGTM as well. Not ideal, but alternatives
are worse (rbtree 'head'...)

>>
>> This is a nonfunctional change, functionality to actually use these
>> fields for rbtree will be added in further patches.
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>>  include/linux/bpf.h   |  4 ++--
>>  kernel/bpf/btf.c      | 21 +++++++++++----------
>>  kernel/bpf/helpers.c  |  4 ++--
>>  kernel/bpf/verifier.c | 21 +++++++++++----------
>>  4 files changed, 26 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 4920ac252754..9e8b12c7061e 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -189,7 +189,7 @@ struct btf_field_kptr {
>>  	u32 btf_id;
>>  };
>>  
>> -struct btf_field_list_head {
>> +struct btf_field_datastructure_head {
>>  	struct btf *btf;
>>  	u32 value_btf_id;
>>  	u32 node_offset;
>> @@ -201,7 +201,7 @@ struct btf_field {
>>  	enum btf_field_type type;
>>  	union {
>>  		struct btf_field_kptr kptr;
>> -		struct btf_field_list_head list_head;
>> +		struct btf_field_datastructure_head datastructure_head;
>>  	};
>>  };
>>  
>> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
>> index c80bd8709e69..284e3e4b76b7 100644
>> --- a/kernel/bpf/btf.c
>> +++ b/kernel/bpf/btf.c
>> @@ -3227,7 +3227,7 @@ struct btf_field_info {
>>  		struct {
>>  			const char *node_name;
>>  			u32 value_btf_id;
>> -		} list_head;
>> +		} datastructure_head;
>>  	};
>>  };
>>  
>> @@ -3334,8 +3334,8 @@ static int btf_find_list_head(const struct btf *btf, const struct btf_type *pt,
>>  		return -EINVAL;
>>  	info->type = BPF_LIST_HEAD;
>>  	info->off = off;
>> -	info->list_head.value_btf_id = id;
>> -	info->list_head.node_name = list_node;
>> +	info->datastructure_head.value_btf_id = id;
>> +	info->datastructure_head.node_name = list_node;
>>  	return BTF_FIELD_FOUND;
>>  }
>>  
>> @@ -3603,13 +3603,14 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
>>  	u32 offset;
>>  	int i;
>>  
>> -	t = btf_type_by_id(btf, info->list_head.value_btf_id);
>> +	t = btf_type_by_id(btf, info->datastructure_head.value_btf_id);
>>  	/* We've already checked that value_btf_id is a struct type. We
>>  	 * just need to figure out the offset of the list_node, and
>>  	 * verify its type.
>>  	 */
>>  	for_each_member(i, t, member) {
>> -		if (strcmp(info->list_head.node_name, __btf_name_by_offset(btf, member->name_off)))
>> +		if (strcmp(info->datastructure_head.node_name,
>> +			   __btf_name_by_offset(btf, member->name_off)))
>>  			continue;
>>  		/* Invalid BTF, two members with same name */
>>  		if (n)
>> @@ -3626,9 +3627,9 @@ static int btf_parse_list_head(const struct btf *btf, struct btf_field *field,
>>  		if (offset % __alignof__(struct bpf_list_node))
>>  			return -EINVAL;
>>  
>> -		field->list_head.btf = (struct btf *)btf;
>> -		field->list_head.value_btf_id = info->list_head.value_btf_id;
>> -		field->list_head.node_offset = offset;
>> +		field->datastructure_head.btf = (struct btf *)btf;
>> +		field->datastructure_head.value_btf_id = info->datastructure_head.value_btf_id;
>> +		field->datastructure_head.node_offset = offset;
>>  	}
>>  	if (!n)
>>  		return -ENOENT;
>> @@ -3735,11 +3736,11 @@ int btf_check_and_fixup_fields(const struct btf *btf, struct btf_record *rec)
>>  
>>  		if (!(rec->fields[i].type & BPF_LIST_HEAD))
>>  			continue;
>> -		btf_id = rec->fields[i].list_head.value_btf_id;
>> +		btf_id = rec->fields[i].datastructure_head.value_btf_id;
>>  		meta = btf_find_struct_meta(btf, btf_id);
>>  		if (!meta)
>>  			return -EFAULT;
>> -		rec->fields[i].list_head.value_rec = meta->record;
>> +		rec->fields[i].datastructure_head.value_rec = meta->record;
>>  
>>  		if (!(rec->field_mask & BPF_LIST_NODE))
>>  			continue;
>> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
>> index cca642358e80..6c67740222c2 100644
>> --- a/kernel/bpf/helpers.c
>> +++ b/kernel/bpf/helpers.c
>> @@ -1737,12 +1737,12 @@ void bpf_list_head_free(const struct btf_field *field, void *list_head,
>>  	while (head != orig_head) {
>>  		void *obj = head;
>>  
>> -		obj -= field->list_head.node_offset;
>> +		obj -= field->datastructure_head.node_offset;
>>  		head = head->next;
>>  		/* The contained type can also have resources, including a
>>  		 * bpf_list_head which needs to be freed.
>>  		 */
>> -		bpf_obj_free_fields(field->list_head.value_rec, obj);
>> +		bpf_obj_free_fields(field->datastructure_head.value_rec, obj);
>>  		/* bpf_mem_free requires migrate_disable(), since we can be
>>  		 * called from map free path as well apart from BPF program (as
>>  		 * part of map ops doing bpf_obj_free_fields).
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 6f0aac837d77..bc80b4c4377b 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -8615,21 +8615,22 @@ static int process_kf_arg_ptr_to_list_node(struct bpf_verifier_env *env,
>>  
>>  	field = meta->arg_list_head.field;
>>  
>> -	et = btf_type_by_id(field->list_head.btf, field->list_head.value_btf_id);
>> +	et = btf_type_by_id(field->datastructure_head.btf, field->datastructure_head.value_btf_id);
>>  	t = btf_type_by_id(reg->btf, reg->btf_id);
>> -	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->list_head.btf,
>> -				  field->list_head.value_btf_id, true)) {
>> +	if (!btf_struct_ids_match(&env->log, reg->btf, reg->btf_id, 0, field->datastructure_head.btf,
>> +				  field->datastructure_head.value_btf_id, true)) {
>>  		verbose(env, "operation on bpf_list_head expects arg#1 bpf_list_node at offset=%d "
>>  			"in struct %s, but arg is at offset=%d in struct %s\n",
>> -			field->list_head.node_offset, btf_name_by_offset(field->list_head.btf, et->name_off),
>> +			field->datastructure_head.node_offset,
>> +			btf_name_by_offset(field->datastructure_head.btf, et->name_off),
>>  			list_node_off, btf_name_by_offset(reg->btf, t->name_off));
>>  		return -EINVAL;
>>  	}
>>  
>> -	if (list_node_off != field->list_head.node_offset) {
>> +	if (list_node_off != field->datastructure_head.node_offset) {
>>  		verbose(env, "arg#1 offset=%d, but expected bpf_list_node at offset=%d in struct %s\n",
>> -			list_node_off, field->list_head.node_offset,
>> -			btf_name_by_offset(field->list_head.btf, et->name_off));
>> +			list_node_off, field->datastructure_head.node_offset,
>> +			btf_name_by_offset(field->datastructure_head.btf, et->name_off));
>>  		return -EINVAL;
>>  	}
>>  	/* Set arg#1 for expiration after unlock */
>> @@ -9078,9 +9079,9 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>  
>>  				mark_reg_known_zero(env, regs, BPF_REG_0);
>>  				regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
>> -				regs[BPF_REG_0].btf = field->list_head.btf;
>> -				regs[BPF_REG_0].btf_id = field->list_head.value_btf_id;
>> -				regs[BPF_REG_0].off = field->list_head.node_offset;
>> +				regs[BPF_REG_0].btf = field->datastructure_head.btf;
>> +				regs[BPF_REG_0].btf_id = field->datastructure_head.value_btf_id;
>> +				regs[BPF_REG_0].off = field->datastructure_head.node_offset;
>>  			} else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
>>  				mark_reg_known_zero(env, regs, BPF_REG_0);
>>  				regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
>> -- 
>> 2.30.2
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-07 18:34     ` Dave Marchevsky
@ 2022-12-07 18:59       ` Alexei Starovoitov
  2022-12-07 20:38         ` Dave Marchevsky
  2022-12-07 19:03       ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07 18:59 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> >> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> >> There, a BTF record is created for any type containing a spin_lock or
> >> any next-gen datastructure node/head.
> >>
> >> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> >> a record using struct_meta_tab if the reg->type exactly matches
> >> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> >> "allocated obj" type - returned from bpf_obj_new - might pick up other
> >> flags while working its way through the program.
> >>
> > 
> > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> > passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> > Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> > cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> > is to make then unpassable to helpers.
> > 
> 
> I see what you mean. If reg_btf_record is only used on regs which are args,
> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
> but then we have its use in reg_may_point_to_spin_lock, which is itself
> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
> sure that it's only used on arg regs currently.
> 
> Regardless, if the intended use is on arg regs only, it should be renamed to
> arg_reg_btf_record or similar to make that clear, as current name sounds like
> it should be applicable to any reg, and thus not enforce constraints particular
> to arg regs.
> 
> But I think it's better to leave it general and enforce those constraints
> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
> big switch statements for KF_ARG_* are doing exact type matching.
> 
> >> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> >> for type_flag.
> >>
> >> This patch is marked Fixes as the original intent of reg_btf_record was
> >> unlikely to have been to fail finding btf_record for valid alloc obj
> >> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> >> are valid register type states for alloc obj independent of this series.
> > 
> > That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> > reg->type to allow the BPF_WRITE case.
> > 
> > I think this series is the one introducing this case, passing bpf_rbtree_first's
> > result to bpf_rbtree_remove, which I think is not possible to make safe in the
> > first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> > bpf_list_del due to this exact issue. More in [0].
> > 
> >  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> > 
> 
> Thanks for the link, I better understand what Alexei meant in his comment on
> patch 9 of this series. For the helpers added in this series, we can make
> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
> refs after the rbtree_remove in same manner as they're invalidated after
> spin_unlock currently.
> 
> Logic for why this is safe:
> 
>   * If we have two non-owning refs to nodes in a tree, e.g. from
>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>     we have no way of knowing if they're aliases of same node.
> 
>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>     it might be removing a node that's already been removed, e.g.:
> 
>         n = bpf_obj_new(...);
>         bpf_spin_lock(&lock);
> 
>         bpf_rbtree_add(&tree, &n->node);
>         // n is now non-owning ref to node which was added
>         res = bpf_rbtree_first();
>         if (!m) {}
>         m = container_of(res, struct node_data, node);
>         // m is now non-owning ref to the same node
>         bpf_rbtree_remove(&tree, &n->node);
>         bpf_rbtree_remove(&tree, &m->node); // BAD

Let me clarify my previous email:

Above doesn't have to be 'BAD'.
Instead of
if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))

we can drop WARN and simply return.
If node is not part of the tree -> nop.

Same for bpf_rbtree_add.
If it's already added -> nop.

Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
We do all these checks under the same rbtree root lock, so it's safe.

>         bpf_spin_unlock(&lock);
> 
>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>     of pointing to something that was already removed _only_ after a
>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>     be inputs to subsequent remove()s

With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
can have release semantics.
No need for special release_on_unlock hacks.

> This does conflate current "release non-owning refs because it's not safe to
> read from them" reasoning with new "release non-owning refs so they can't be
> passed to remove()". Ideally we could add some new tag to these refs that
> prevents them from being passed to remove()-type fns, but does allow them to
> be read, e.g.:
> 
>   n = bpf_obj_new(...);

'n' is acquired.

>   bpf_spin_lock(&lock);
> 
>   bpf_rbtree_add(&tree, &n->node);
>   // n is now non-owning ref to node which was added

since bpf_rbtree_add does release on 'n'...

>   res = bpf_rbtree_first();
>   if (!m) {}
>   m = container_of(res, struct node_data, node);
>   // m is now non-owning ref to the same node

... below is not allowed by the verifier.
>   n = bpf_rbtree_remove(&tree, &n->node);

I'm not sure what's an idea to return 'n' from remove...
Maybe it should be simple bool ?

>   // n is now owning ref again, m is non-owning ref to same node
>   x = m->key; // this should be safe since we're still in CS

below works because 'm' cames from bpf_rbtree_first that acquired 'res'.

>   bpf_rbtree_remove(&tree, &m->node); // But this should be prevented
> 
>   bpf_spin_unlock(&lock);
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types
  2022-12-07 18:52     ` Dave Marchevsky
@ 2022-12-07 19:01       ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07 19:01 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Wed, Dec 07, 2022 at 01:52:07PM -0500, Dave Marchevsky wrote:
> On 12/6/22 8:41 PM, Alexei Starovoitov wrote:
> > On Tue, Dec 06, 2022 at 03:09:51PM -0800, Dave Marchevsky wrote:
> >> Many of the structs recently added to track field info for linked-list
> >> head are useful as-is for rbtree root. So let's do a mechanical renaming
> >> of list_head-related types and fields:
> >>
> >> include/linux/bpf.h:
> >>   struct btf_field_list_head -> struct btf_field_datastructure_head
> >>   list_head -> datastructure_head in struct btf_field union
> >> kernel/bpf/btf.c:
> >>   list_head -> datastructure_head in struct btf_field_info
> > 
> > Looking through this patch and others it eventually becomes
> > confusing with 'datastructure head' name.
> > I'm not sure what is 'head' of the data structure.
> > There is head in the link list, but 'head of tree' is odd.
> > 
> > The attemp here is to find a common name that represents programming
> > concept where there is a 'root' and there are 'nodes' that added to that 'root'.
> > The 'data structure' name is too broad in that sense.
> > Especially later it becomes 'datastructure_api' which is even broader.
> > 
> > I was thinking to propose:
> >  struct btf_field_list_head -> struct btf_field_tree_root
> >  list_head -> tree_root in struct btf_field union
> > 
> > and is_kfunc_tree_api later...
> > since link list is a tree too.
> > 
> > But reading 'tree' next to other names like 'field', 'kfunc'
> > it might be mistaken that 'tree' applies to the former.
> > So I think using 'graph' as more general concept to describe both
> > link list and rb-tree would be the best.
> > 
> > So the proposal:
> >  struct btf_field_list_head -> struct btf_field_graph_root
> >  list_head -> graph_root in struct btf_field union
> > 
> > and is_kfunc_graph_api later...
> > 
> > 'graph' is short enough and rarely used in names,
> > so it stands on its own next to 'field' and in combination
> > with other names.
> > wdyt?
> > 
> 
> I'm not a huge fan of 'graph', but it's certainly better than
> 'datastructure_api', and avoids the "all next-gen datastructures must do this"
> implication of a 'ng_ds' name. So will try the rename in v2.

fwiw I don't like 'next-' bit in 'next-gen ds'.
A year from now the 'next' will sound really old.
Just like N in NAPI used to be 'new'.

> (all specific GRAPH naming suggestions in subsequent patches will
> be done as well)
> 
> list 'head' -> list 'root' SGTM as well. Not ideal, but alternatives
> are worse (rbtree 'head'...)

Thanks!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-07 18:34     ` Dave Marchevsky
  2022-12-07 18:59       ` Alexei Starovoitov
@ 2022-12-07 19:03       ` Kumar Kartikeya Dwivedi
  1 sibling, 0 replies; 50+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-12-07 19:03 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo

On Thu, Dec 08, 2022 at 12:04:44AM IST, Dave Marchevsky wrote:
> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> >> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> >> There, a BTF record is created for any type containing a spin_lock or
> >> any next-gen datastructure node/head.
> >>
> >> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> >> a record using struct_meta_tab if the reg->type exactly matches
> >> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> >> "allocated obj" type - returned from bpf_obj_new - might pick up other
> >> flags while working its way through the program.
> >>
> >
> > Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> > passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> > Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> > cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> > is to make then unpassable to helpers.
> >
>
> I see what you mean. If reg_btf_record is only used on regs which are args,
> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
> but then we have its use in reg_may_point_to_spin_lock, which is itself
> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
> sure that it's only used on arg regs currently.
>
> Regardless, if the intended use is on arg regs only, it should be renamed to
> arg_reg_btf_record or similar to make that clear, as current name sounds like
> it should be applicable to any reg, and thus not enforce constraints particular
> to arg regs.
>
> But I think it's better to leave it general and enforce those constraints
> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
> big switch statements for KF_ARG_* are doing exact type matching.
>
> >> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> >> for type_flag.
> >>
> >> This patch is marked Fixes as the original intent of reg_btf_record was
> >> unlikely to have been to fail finding btf_record for valid alloc obj
> >> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> >> are valid register type states for alloc obj independent of this series.
> >
> > That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> > reg->type to allow the BPF_WRITE case.
> >
> > I think this series is the one introducing this case, passing bpf_rbtree_first's
> > result to bpf_rbtree_remove, which I think is not possible to make safe in the
> > first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> > bpf_list_del due to this exact issue. More in [0].
> >
> >  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> >
>
> Thanks for the link, I better understand what Alexei meant in his comment on
> patch 9 of this series. For the helpers added in this series, we can make
> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
> refs after the rbtree_remove in same manner as they're invalidated after
> spin_unlock currently.
>

Rather than doing that, you'll cut down on a lot of complexity and confusion
regarding PTR_UNTRUSTED's use in this set by removing bpf_rbtree_first and
bpf_rbtree_remove, and simply exposing bpf_rbtree_pop_front.

> Logic for why this is safe:
>
>   * If we have two non-owning refs to nodes in a tree, e.g. from
>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>     we have no way of knowing if they're aliases of same node.
>
>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>     it might be removing a node that's already been removed, e.g.:
>
>         n = bpf_obj_new(...);
>         bpf_spin_lock(&lock);
>
>         bpf_rbtree_add(&tree, &n->node);
>         // n is now non-owning ref to node which was added
>         res = bpf_rbtree_first();
>         if (!m) {}
>         m = container_of(res, struct node_data, node);
>         // m is now non-owning ref to the same node
>         bpf_rbtree_remove(&tree, &n->node);
>         bpf_rbtree_remove(&tree, &m->node); // BAD
>
>         bpf_spin_unlock(&lock);
>
>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>     of pointing to something that was already removed _only_ after a
>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>     be inputs to subsequent remove()s
>
> This does conflate current "release non-owning refs because it's not safe to
> read from them" reasoning with new "release non-owning refs so they can't be
> passed to remove()". Ideally we could add some new tag to these refs that
> prevents them from being passed to remove()-type fns, but does allow them to
> be read, e.g.:
>
>   n = bpf_obj_new(...);
>   bpf_spin_lock(&lock);
>
>   bpf_rbtree_add(&tree, &n->node);
>   // n is now non-owning ref to node which was added
>   res = bpf_rbtree_first();
>   if (!m) {}
>   m = container_of(res, struct node_data, node);
>   // m is now non-owning ref to the same node
>   n = bpf_rbtree_remove(&tree, &n->node);
>   // n is now owning ref again, m is non-owning ref to same node
>   x = m->key; // this should be safe since we're still in CS
>   bpf_rbtree_remove(&tree, &m->node); // But this should be prevented
>
>   bpf_spin_unlock(&lock);
>
> But this would introduce too much addt'l complexity for now IMO. The proposal
> of just invalidating all non-owning refs prevents both the unsafe second
> remove() and the safe x = m->key.
>
> I will give it a shot, if it doesn't work can change rbtree_remove to
> rbtree_remove_first w/o node param. But per that linked convo such logic
> should be tackled eventually, might as well chip away at it now.
>

I sympathise with your goal to make it as close to kernel programming style as
possible. I was exploring the same option (as you saw in that link). But based
on multiple discussions so far and trying different approaches, I'm convinced
the additional complexity in the verifier is not worth it.

Both bpf_list_del and bpf_rbtree_remove are useful and should be added, but
should work on e.g. 'current' node in iteration callback. In that context
verifier knows that the node is part of the list/rbtree. Introducing more than
one node in that same context introduces potential aliasing which hinders the
verifier's ability to reason about safety. Then, it has to be pessimistic like
your case and invalidate everything to prevent invalid use, so that double
list_del and double rbtree_remove is not possible.

You will avoid all the problems with PTR_UNTRUSTED being passed to helpers if
you adopt such an approach. The code will become much simpler, while allowing
people to do the same thing without any loss of usability.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails
  2022-12-07 16:49   ` Kumar Kartikeya Dwivedi
@ 2022-12-07 19:05     ` Alexei Starovoitov
  2022-12-17  8:59       ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07 19:05 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 10:19:00PM +0530, Kumar Kartikeya Dwivedi wrote:
> On Wed, Dec 07, 2022 at 04:39:49AM IST, Dave Marchevsky wrote:
> > map_check_btf calls btf_parse_fields to create a btf_record for its
> > value_type. If there are no special fields in the value_type
> > btf_parse_fields returns NULL, whereas if there special value_type
> > fields but they are invalid in some way an error is returned.
> >
> > An example invalid state would be:
> >
> >   struct node_data {
> >     struct bpf_rb_node node;
> >     int data;
> >   };
> >
> >   private(A) struct bpf_spin_lock glock;
> >   private(A) struct bpf_list_head ghead __contains(node_data, node);
> >
> > groot should be invalid as its __contains tag points to a field with
> > type != "bpf_list_node".
> >
> > Before this patch, such a scenario would result in btf_parse_fields
> > returning an error ptr, subsequent !IS_ERR_OR_NULL check failing,
> > and btf_check_and_fixup_fields returning 0, which would then be
> > returned by map_check_btf.
> >
> > After this patch's changes, -EINVAL would be returned by map_check_btf
> > and the map would correctly fail to load.
> >
> > Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
> > cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > Fixes: aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record")
> > ---
> >  kernel/bpf/syscall.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 35972afb6850..c3599a7902f0 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1007,7 +1007,10 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
> >  	map->record = btf_parse_fields(btf, value_type,
> >  				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD,
> >  				       map->value_size);
> > -	if (!IS_ERR_OR_NULL(map->record)) {
> > +	if (IS_ERR(map->record))
> > +		return -EINVAL;
> > +
> 
> I didn't do this on purpose, because of backward compatibility concerns. An
> error has not been returned in earlier kernel versions during map creation time
> and those fields acted like normal non-special regions, with errors on use of
> helpers that act on those fields.
> 
> Especially that bpf_spin_lock and bpf_timer are part of the unified btf_record.
> 
> If we are doing such a change, then you should also drop the checks for IS_ERR
> in verifier.c, since that shouldn't be possible anymore. But I think we need to
> think carefully before changing this.
> 
> One possible example is: If we introduce bpf_foo in the future and program
> already has that defined in map value, using it for some other purpose, with
> different alignment and size, their map creation will start failing.

That's a good point.
If we can error on such misconstructed map at the program verification time that's better
anyway, since there will be a proper verifier log instead of EINVAL from map_create.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
                   ` (13 preceding siblings ...)
  2022-12-07  2:50 ` [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure patchwork-bot+netdevbpf
@ 2022-12-07 19:36 ` Kumar Kartikeya Dwivedi
  2022-12-07 22:28   ` Dave Marchevsky
  14 siblings, 1 reply; 50+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-12-07 19:36 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
> This series adds a rbtree datastructure following the "next-gen
> datastructure" precedent set by recently-added linked-list [0]. This is
> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
> instead of adding a new map type. This series adds a smaller set of API
> functions than that RFC - just the minimum needed to support current
> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>
>   bpf_rbtree_add
>   bpf_rbtree_remove
>   bpf_rbtree_first
>
> [...]
>
> Future work:
>   Enabling writes to release_on_unlock refs should be done before the
>   functionality of BPF rbtree can truly be considered complete.
>   Implementing this proved more complex than expected so it's been
>   pushed off to a future patch.
>

TBH, I think we need to revisit whether there's a strong need for this. I would
even argue that we should simply make the release semantics of rbtree_add,
list_push helpers stronger and remove release_on_unlock logic entirely,
releasing the node immediately. I don't see why it is so critical to have read,
and more importantly, write access to nodes after losing their ownership. And
that too is only available until the lock is unlocked.

I think this relaxed release logic and write support is the wrong direction to
take, as it has a direct bearing on what can be done with a node inside the
critical section. There's already the problem with not being able to do
bpf_obj_drop easily inside the critical section with this. That might be useful
for draining operations while holding the lock.

Semantically in other languages, once you move an object, accessing it is
usually a bug, and in most of the cases it is sufficient to prepare it before
insertion. We are certainly in the same territory here with these APIs.

Can you elaborate on actual use cases where immediate release or not having
write support makes it hard or impossible to support a certain use case, so that
it is easier to understand the requirements and design things accordingly?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-07 18:59       ` Alexei Starovoitov
@ 2022-12-07 20:38         ` Dave Marchevsky
  2022-12-07 22:46           ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-07 20:38 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On 12/7/22 1:59 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
>> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
>>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
>>>> There, a BTF record is created for any type containing a spin_lock or
>>>> any next-gen datastructure node/head.
>>>>
>>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
>>>> a record using struct_meta_tab if the reg->type exactly matches
>>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
>>>> "allocated obj" type - returned from bpf_obj_new - might pick up other
>>>> flags while working its way through the program.
>>>>
>>>
>>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
>>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
>>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
>>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
>>> is to make then unpassable to helpers.
>>>
>>
>> I see what you mean. If reg_btf_record is only used on regs which are args,
>> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
>> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
>> but then we have its use in reg_may_point_to_spin_lock, which is itself
>> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
>> sure that it's only used on arg regs currently.
>>
>> Regardless, if the intended use is on arg regs only, it should be renamed to
>> arg_reg_btf_record or similar to make that clear, as current name sounds like
>> it should be applicable to any reg, and thus not enforce constraints particular
>> to arg regs.
>>
>> But I think it's better to leave it general and enforce those constraints
>> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
>> big switch statements for KF_ARG_* are doing exact type matching.
>>
>>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
>>>> for type_flag.
>>>>
>>>> This patch is marked Fixes as the original intent of reg_btf_record was
>>>> unlikely to have been to fail finding btf_record for valid alloc obj
>>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
>>>> are valid register type states for alloc obj independent of this series.
>>>
>>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
>>> reg->type to allow the BPF_WRITE case.
>>>
>>> I think this series is the one introducing this case, passing bpf_rbtree_first's
>>> result to bpf_rbtree_remove, which I think is not possible to make safe in the
>>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
>>> bpf_list_del due to this exact issue. More in [0].
>>>
>>>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
>>>
>>
>> Thanks for the link, I better understand what Alexei meant in his comment on
>> patch 9 of this series. For the helpers added in this series, we can make
>> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
>> refs after the rbtree_remove in same manner as they're invalidated after
>> spin_unlock currently.
>>
>> Logic for why this is safe:
>>
>>   * If we have two non-owning refs to nodes in a tree, e.g. from
>>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>>     we have no way of knowing if they're aliases of same node.
>>
>>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>>     it might be removing a node that's already been removed, e.g.:
>>
>>         n = bpf_obj_new(...);
>>         bpf_spin_lock(&lock);
>>
>>         bpf_rbtree_add(&tree, &n->node);
>>         // n is now non-owning ref to node which was added
>>         res = bpf_rbtree_first();
>>         if (!m) {}
>>         m = container_of(res, struct node_data, node);
>>         // m is now non-owning ref to the same node
>>         bpf_rbtree_remove(&tree, &n->node);
>>         bpf_rbtree_remove(&tree, &m->node); // BAD
> 
> Let me clarify my previous email:
> 
> Above doesn't have to be 'BAD'.
> Instead of
> if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))
> 
> we can drop WARN and simply return.
> If node is not part of the tree -> nop.
> 
> Same for bpf_rbtree_add.
> If it's already added -> nop.
> 

These runtime checks can certainly be done, but if we can guarantee via
verifier type system that a particular ptr-to-node is guaranteed to be in /
not be in a tree, that's better, no?

Feels like a similar train of thought to "fail verification when correct rbtree
lock isn't held" vs "just check if lock is held in every rbtree API kfunc".

> Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
> We do all these checks under the same rbtree root lock, so it's safe.
> 

I'll comment on PTR_TRUSTED in our discussion on patch 10.

>>         bpf_spin_unlock(&lock);
>>
>>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>>     of pointing to something that was already removed _only_ after a
>>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>>     be inputs to subsequent remove()s
> 
> With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
> can have release semantics.
> No need for special release_on_unlock hacks.
> 

If we want to be able to interact w/ nodes after they've been added to the
rbtree, but before critical section ends, we need to support non-owning refs,
which are currently implemented using special release_on_unlock logic.

If we go with the runtime check suggestion from above, we'd need to implement
'conditional release' similarly to earlier "rbtree map" attempt:
https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ .

If rbtree_add has release semantics for its node arg, but the node is already
in some tree and runtime check fails, the reference should not be released as
rbtree_add() was a nop.

Similarly, if rbtree_remove has release semantics for its node arg and acquire
semantics for its return value, runtime check failing should result in the
node arg not being released. Acquire semantics for the retval are already
conditional - if retval == NULL, mark_ptr_or_null regs will release the
acquired ref before it can be used. So no issue with failing rbtree_remove
messing up acquire.

For this reason rbtree_remove and rbtree_first are tagged
KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be
refactored into a similar flag, KF_RELEASE_NON_OWN or similar.

>> This does conflate current "release non-owning refs because it's not safe to
>> read from them" reasoning with new "release non-owning refs so they can't be
>> passed to remove()". Ideally we could add some new tag to these refs that
>> prevents them from being passed to remove()-type fns, but does allow them to
>> be read, e.g.:
>>
>>   n = bpf_obj_new(...);
> 
> 'n' is acquired.
> 
>>   bpf_spin_lock(&lock);
>>
>>   bpf_rbtree_add(&tree, &n->node);
>>   // n is now non-owning ref to node which was added
> 
> since bpf_rbtree_add does release on 'n'...
> 
>>   res = bpf_rbtree_first();
>>   if (!m) {}
>>   m = container_of(res, struct node_data, node);
>>   // m is now non-owning ref to the same node
> 
> ... below is not allowed by the verifier.
>>   n = bpf_rbtree_remove(&tree, &n->node);
> 
> I'm not sure what's an idea to return 'n' from remove...
> Maybe it should be simple bool ?
> 

I agree that returning node from rbtree_remove is not strictly necessary, since
rbtree_remove can be thought of turning its non-owning ref argument into an
owning ref, instead of taking non-owning ref and returning owning ref. But such
an operation isn't really an 'acquire' by current verifier logic, since only
retvals can be 'acquired'. So we'd need to add some logic to enable acquire
semantics for args. Furthermore it's not really 'acquiring' a new ref, rather
changing properties of node arg ref.

However, if rbtree_remove can fail, such a "turn non-owning into owning"
operation will need to be able to fail as well, and the program will need to
be able to check for failure. Returning 'acquire' result in retval makes
this simple - just check for NULL. For your "return bool" proposal, we'd have
to add verifier logic which turns the 'acquired' owning ref back into non-owning
based on check of the bool, which will add some verifier complexity.

IIRC when doing experimentation with "rbtree map" implementation, I did
something like this and decided that the additional complexity wasn't worth
it when retval can just be used. 

>>   // n is now owning ref again, m is non-owning ref to same node
>>   x = m->key; // this should be safe since we're still in CS
> 
> below works because 'm' cames from bpf_rbtree_first that acquired 'res'.
> 
>>   bpf_rbtree_remove(&tree, &m->node); // But this should be prevented
>>
>>   bpf_spin_unlock(&lock);
>>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-07 19:36 ` Kumar Kartikeya Dwivedi
@ 2022-12-07 22:28   ` Dave Marchevsky
  2022-12-07 23:06     ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-07 22:28 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Tejun Heo

On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
> On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
>> This series adds a rbtree datastructure following the "next-gen
>> datastructure" precedent set by recently-added linked-list [0]. This is
>> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
>> instead of adding a new map type. This series adds a smaller set of API
>> functions than that RFC - just the minimum needed to support current
>> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>>
>>   bpf_rbtree_add
>>   bpf_rbtree_remove
>>   bpf_rbtree_first
>>
>> [...]
>>
>> Future work:
>>   Enabling writes to release_on_unlock refs should be done before the
>>   functionality of BPF rbtree can truly be considered complete.
>>   Implementing this proved more complex than expected so it's been
>>   pushed off to a future patch.
>>

> 
> TBH, I think we need to revisit whether there's a strong need for this. I would
> even argue that we should simply make the release semantics of rbtree_add,
> list_push helpers stronger and remove release_on_unlock logic entirely,
> releasing the node immediately. I don't see why it is so critical to have read,
> and more importantly, write access to nodes after losing their ownership. And
> that too is only available until the lock is unlocked.
> 

Moved the next paragraph here to ease reply, it was the last paragraph
in your response.

> 
> Can you elaborate on actual use cases where immediate release or not having
> write support makes it hard or impossible to support a certain use case, so that
> it is easier to understand the requirements and design things accordingly?
>

Sure, the main usecase and impetus behind this for me is the sched_ext work
Tejun and others are doing (https://lwn.net/Articles/916291/). One of the
things they'd like to be able to do is implement a CFS-like scheduler using
rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
implement complicated scheduling logic.

If we can implement such complicated scheduling logic, but it has so much
BPF-specific twisting of program logic that it's incomprehensible to scheduler
folks, that's not great. The overlap between "BPF experts" and "scheduler
experts" is small, and we want the latter group to be able to read BPF
scheduling logic without too much struggle. Lower learning curve makes folks
more likely to experiment with sched_ext.

When 'rbtree map' was in brainstorming / prototyping, non-owning reference
semantics were called out as moving BPF datastructures closer to their kernel
equivalents from a UX perspective.

If the "it makes BPF code better resemble normal kernel code" argumentwas the
only reason to do this I wouldn't feel so strongly, but there are practical
concerns as well:

If we could only read / write from rbtree node if it isn't in a tree, the common
operation of "find this node and update its data" would require removing and
re-adding it. For rbtree, these unnecessary remove and add operations could
result in unnecessary rebalancing. Going back to the sched_ext usecase,
if we have a rbtree with task or cgroup stats that need to be updated often,
unnecessary rebalancing would make this update slower than if non-owning refs
allowed in-place read/write of node data.

Also, we eventually want to be able to have a node that's part of both a
list and rbtree. Likely adding such a node to both would require calling
kfunc for adding to list, and separate kfunc call for adding to rbtree.
Once the node has been added to list, we need some way to represent a reference
to that node so that we can pass it to rbtree add kfunc. Sounds like a
non-owning reference to me, albeit with different semantics than current
release_on_unlock.

> I think this relaxed release logic and write support is the wrong direction to
> take, as it has a direct bearing on what can be done with a node inside the
> critical section. There's already the problem with not being able to do
> bpf_obj_drop easily inside the critical section with this. That might be useful
> for draining operations while holding the lock.
> 

The bpf_obj_drop case is similar to your "can't pass non-owning reference
to bpf_rbtree_remove" concern from patch 1's thread. If we have:

  n = bpf_obj_new(...); // n is owning ref
  bpf_rbtree_add(&tree, &n->node); // n is non-owning ref

  res = bpf_rbtree_first(&tree);
  if (!res) {...}
  m = container_of(res, struct node_data, node); // m is non-owning ref

  res = bpf_rbtree_remove(&tree, &n->node);
  n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory

  bpf_obj_drop(n);
  // Not safe to use m anymore

Datastructures which support bpf_obj_drop in the critical section can
do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
references after bpf_obj_drop. Then there's no potential use-after-free.
(For the above example, pretend bpf_rbtree_remove didn't already invalidate
'm', or that there's some other way to obtain non-owning ref to 'n''s node
after rbtree_remove)

I think that, in practice, operations where the BPF program wants to remove
/ delete nodes will be distinct from operations where program just wants to 
obtain some non-owning refs and do read / write. At least for sched_ext usecase
this is true. So all the additional clobbers won't require program writer
to do special workarounds to deal with verifier in the common case.

> Semantically in other languages, once you move an object, accessing it is
> usually a bug, and in most of the cases it is sufficient to prepare it before
> insertion. We are certainly in the same territory here with these APIs.

Sure, but 'add'/'remove' for these intrusive linked datastructures is
_not_ a 'move'. Obscuring this from the user and forcing them to use
less performant patterns for the sake of some verifier complexity, or desire
to mimic semantics of languages w/o reference stability, doesn't make sense to
me.

If we were to add some datastructures without reference stability, sure, let's
not do non-owning references for those. So let's make this non-owning reference
stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
which will coincidentally make it very easy to remove if we later decide that
the complexity isn't worth it. 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-07 20:38         ` Dave Marchevsky
@ 2022-12-07 22:46           ` Alexei Starovoitov
  2022-12-07 23:42             ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07 22:46 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 03:38:55PM -0500, Dave Marchevsky wrote:
> On 12/7/22 1:59 PM, Alexei Starovoitov wrote:
> > On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
> >> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
> >>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
> >>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
> >>>> There, a BTF record is created for any type containing a spin_lock or
> >>>> any next-gen datastructure node/head.
> >>>>
> >>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
> >>>> a record using struct_meta_tab if the reg->type exactly matches
> >>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
> >>>> "allocated obj" type - returned from bpf_obj_new - might pick up other
> >>>> flags while working its way through the program.
> >>>>
> >>>
> >>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
> >>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
> >>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
> >>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
> >>> is to make then unpassable to helpers.
> >>>
> >>
> >> I see what you mean. If reg_btf_record is only used on regs which are args,
> >> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
> >> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
> >> but then we have its use in reg_may_point_to_spin_lock, which is itself
> >> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
> >> sure that it's only used on arg regs currently.
> >>
> >> Regardless, if the intended use is on arg regs only, it should be renamed to
> >> arg_reg_btf_record or similar to make that clear, as current name sounds like
> >> it should be applicable to any reg, and thus not enforce constraints particular
> >> to arg regs.
> >>
> >> But I think it's better to leave it general and enforce those constraints
> >> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
> >> big switch statements for KF_ARG_* are doing exact type matching.
> >>
> >>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
> >>>> for type_flag.
> >>>>
> >>>> This patch is marked Fixes as the original intent of reg_btf_record was
> >>>> unlikely to have been to fail finding btf_record for valid alloc obj
> >>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
> >>>> are valid register type states for alloc obj independent of this series.
> >>>
> >>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
> >>> reg->type to allow the BPF_WRITE case.
> >>>
> >>> I think this series is the one introducing this case, passing bpf_rbtree_first's
> >>> result to bpf_rbtree_remove, which I think is not possible to make safe in the
> >>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
> >>> bpf_list_del due to this exact issue. More in [0].
> >>>
> >>>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
> >>>
> >>
> >> Thanks for the link, I better understand what Alexei meant in his comment on
> >> patch 9 of this series. For the helpers added in this series, we can make
> >> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
> >> refs after the rbtree_remove in same manner as they're invalidated after
> >> spin_unlock currently.
> >>
> >> Logic for why this is safe:
> >>
> >>   * If we have two non-owning refs to nodes in a tree, e.g. from
> >>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
> >>     we have no way of knowing if they're aliases of same node.
> >>
> >>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
> >>     it might be removing a node that's already been removed, e.g.:
> >>
> >>         n = bpf_obj_new(...);
> >>         bpf_spin_lock(&lock);
> >>
> >>         bpf_rbtree_add(&tree, &n->node);
> >>         // n is now non-owning ref to node which was added
> >>         res = bpf_rbtree_first();
> >>         if (!m) {}
> >>         m = container_of(res, struct node_data, node);
> >>         // m is now non-owning ref to the same node
> >>         bpf_rbtree_remove(&tree, &n->node);
> >>         bpf_rbtree_remove(&tree, &m->node); // BAD
> > 
> > Let me clarify my previous email:
> > 
> > Above doesn't have to be 'BAD'.
> > Instead of
> > if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))
> > 
> > we can drop WARN and simply return.
> > If node is not part of the tree -> nop.
> > 
> > Same for bpf_rbtree_add.
> > If it's already added -> nop.
> > 
> 
> These runtime checks can certainly be done, but if we can guarantee via
> verifier type system that a particular ptr-to-node is guaranteed to be in /
> not be in a tree, that's better, no?
> 
> Feels like a similar train of thought to "fail verification when correct rbtree
> lock isn't held" vs "just check if lock is held in every rbtree API kfunc".
> 
> > Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
> > We do all these checks under the same rbtree root lock, so it's safe.
> > 
> 
> I'll comment on PTR_TRUSTED in our discussion on patch 10.
> 
> >>         bpf_spin_unlock(&lock);
> >>
> >>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
> >>     of pointing to something that was already removed _only_ after a
> >>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
> >>     be inputs to subsequent remove()s
> > 
> > With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
> > can have release semantics.
> > No need for special release_on_unlock hacks.
> > 
> 
> If we want to be able to interact w/ nodes after they've been added to the
> rbtree, but before critical section ends, we need to support non-owning refs,
> which are currently implemented using special release_on_unlock logic.
> 
> If we go with the runtime check suggestion from above, we'd need to implement
> 'conditional release' similarly to earlier "rbtree map" attempt:
> https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ .
> 
> If rbtree_add has release semantics for its node arg, but the node is already
> in some tree and runtime check fails, the reference should not be released as
> rbtree_add() was a nop.

Got it.
The conditional release is tricky. We should probably avoid it for now.

I think we can either go with Kumar's proposal and do
bpf_rbtree_pop_front() instead of bpf_rbtree_first()
that avoids all these issues...

but considering that we'll have inline iterators soon and should be able to do:

struct bpf_rbtree_iter it;
struct bpf_rb_node * node;

bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree
while ((node = bpf_rbtree_iter_next(&it)) {
  if (node->field == condition) {
    struct bpf_rb_node *n;

    n = bpf_rbtree_remove(rb_root, node);
    bpf_spin_lock(another_rb_root);
    bpf_rbtree_add(another_rb_root, n);
    bpf_spin_unlock(another_rb_root);
    break;
  }
}
bpf_rbtree_iter_destroy(&it);

We can treat the 'node' returned from bpf_rbtree_iter_next() the same way
as return from bpf_rbtree_first() ->  PTR_TRUSTED | MAYBE_NULL,
but not acquired (ref_obj_id == 0).

bpf_rbtree_add -> KF_RELEASE
so we cannot pass not acquired pointers into it.

We should probably remove release_on_unlock logic as Kumar suggesting and
make bpf_list_push_front/back to be KF_RELEASE.

Then
bpf_list_pop_front/back stay KF_ACQUIRE | KF_RET_NULL
and
bpf_rbtree_remove is also KF_ACQUIRE | KF_RET_NULL.

The difference is bpf_list_pop has only 'head'
while bpf_rbtree_remove has 'root' and 'node' where 'node' has to be PTR_TRUSTED
(but not acquired).

bpf_rbtree_add will always succeed.
bpf_rbtree_remove will conditionally fail if 'node' is not linked.

Similarly we can extend link list with
n = bpf_list_remove(node)
which will have KF_ACQUIRE | KF_RET_NULL semantics.

Then everything is nicely uniform.
We'll be able to iterate rbtree and iterate link lists.

There are downsides, of course.
Like the following from your test case:
+       bpf_spin_lock(&glock);
+       bpf_rbtree_add(&groot, &n->node, less);
+       bpf_rbtree_add(&groot, &m->node, less);
+       res = bpf_rbtree_remove(&groot, &n->node);
+       bpf_spin_unlock(&glock);
will not work.
Since bpf_rbtree_add() releases 'n' and it becomes UNTRUSTED.
(assuming release_on_unlock is removed).

I think it's fine for now. I have to agree with Kumar that it's hard to come up
with realistic use case where 'n' should be accessed after it was added to link
list or rbtree. Above test case doesn't look real.

This part of your test case:
+       bpf_spin_lock(&glock);
+       bpf_rbtree_add(&groot, &n->node, less);
+       bpf_rbtree_add(&groot, &m->node, less);
+       bpf_rbtree_add(&groot, &o->node, less);
+
+       res = bpf_rbtree_first(&groot);
+       if (!res) {
+               bpf_spin_unlock(&glock);
+               return 2;
+       }
+
+       o = container_of(res, struct node_data, node);
+       res = bpf_rbtree_remove(&groot, &o->node);
+       bpf_spin_unlock(&glock);

will work, because bpf_rbtree_first returns PTR_TRUSTED | MAYBE_NULL.

> Similarly, if rbtree_remove has release semantics for its node arg and acquire
> semantics for its return value, runtime check failing should result in the
> node arg not being released. Acquire semantics for the retval are already
> conditional - if retval == NULL, mark_ptr_or_null regs will release the
> acquired ref before it can be used. So no issue with failing rbtree_remove
> messing up acquire.
> 
> For this reason rbtree_remove and rbtree_first are tagged
> KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be
> refactored into a similar flag, KF_RELEASE_NON_OWN or similar.

I guess what I'm propsing above is sort-of KF_RELEASE_NON_OWN idea,
but from a different angle.
I'd like to avoid introducing new flags.
I think PTR_TRUSTED is enough.

> > I'm not sure what's an idea to return 'n' from remove...
> > Maybe it should be simple bool ?
> > 
> 
> I agree that returning node from rbtree_remove is not strictly necessary, since
> rbtree_remove can be thought of turning its non-owning ref argument into an
> owning ref, instead of taking non-owning ref and returning owning ref. But such
> an operation isn't really an 'acquire' by current verifier logic, since only
> retvals can be 'acquired'. So we'd need to add some logic to enable acquire
> semantics for args. Furthermore it's not really 'acquiring' a new ref, rather
> changing properties of node arg ref.
> 
> However, if rbtree_remove can fail, such a "turn non-owning into owning"
> operation will need to be able to fail as well, and the program will need to
> be able to check for failure. Returning 'acquire' result in retval makes
> this simple - just check for NULL. For your "return bool" proposal, we'd have
> to add verifier logic which turns the 'acquired' owning ref back into non-owning
> based on check of the bool, which will add some verifier complexity.
> 
> IIRC when doing experimentation with "rbtree map" implementation, I did
> something like this and decided that the additional complexity wasn't worth
> it when retval can just be used. 

Agree. Forget 'bool' idea.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-07 22:28   ` Dave Marchevsky
@ 2022-12-07 23:06     ` Alexei Starovoitov
  2022-12-08  1:18       ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-07 23:06 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote:
> On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
> > On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
> >> This series adds a rbtree datastructure following the "next-gen
> >> datastructure" precedent set by recently-added linked-list [0]. This is
> >> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
> >> instead of adding a new map type. This series adds a smaller set of API
> >> functions than that RFC - just the minimum needed to support current
> >> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
> >>
> >>   bpf_rbtree_add
> >>   bpf_rbtree_remove
> >>   bpf_rbtree_first
> >>
> >> [...]
> >>
> >> Future work:
> >>   Enabling writes to release_on_unlock refs should be done before the
> >>   functionality of BPF rbtree can truly be considered complete.
> >>   Implementing this proved more complex than expected so it's been
> >>   pushed off to a future patch.
> >>
> 
> > 
> > TBH, I think we need to revisit whether there's a strong need for this. I would
> > even argue that we should simply make the release semantics of rbtree_add,
> > list_push helpers stronger and remove release_on_unlock logic entirely,
> > releasing the node immediately. I don't see why it is so critical to have read,
> > and more importantly, write access to nodes after losing their ownership. And
> > that too is only available until the lock is unlocked.
> > 
> 
> Moved the next paragraph here to ease reply, it was the last paragraph
> in your response.
> 
> > 
> > Can you elaborate on actual use cases where immediate release or not having
> > write support makes it hard or impossible to support a certain use case, so that
> > it is easier to understand the requirements and design things accordingly?
> >
> 
> Sure, the main usecase and impetus behind this for me is the sched_ext work
> Tejun and others are doing (https://lwn.net/Articles/916291/). One of the
> things they'd like to be able to do is implement a CFS-like scheduler using
> rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
> implement complicated scheduling logic.
> 
> If we can implement such complicated scheduling logic, but it has so much
> BPF-specific twisting of program logic that it's incomprehensible to scheduler
> folks, that's not great. The overlap between "BPF experts" and "scheduler
> experts" is small, and we want the latter group to be able to read BPF
> scheduling logic without too much struggle. Lower learning curve makes folks
> more likely to experiment with sched_ext.
> 
> When 'rbtree map' was in brainstorming / prototyping, non-owning reference
> semantics were called out as moving BPF datastructures closer to their kernel
> equivalents from a UX perspective.

Our emails crossed. See my previous email.
Agree on the above.

> If the "it makes BPF code better resemble normal kernel code" argumentwas the
> only reason to do this I wouldn't feel so strongly, but there are practical
> concerns as well:
> 
> If we could only read / write from rbtree node if it isn't in a tree, the common
> operation of "find this node and update its data" would require removing and
> re-adding it. For rbtree, these unnecessary remove and add operations could

Not really. See my previous email.

> result in unnecessary rebalancing. Going back to the sched_ext usecase,
> if we have a rbtree with task or cgroup stats that need to be updated often,
> unnecessary rebalancing would make this update slower than if non-owning refs
> allowed in-place read/write of node data.

Agree. Read/write from non-owning refs is necessary.
In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0
(your non-owning ref) should not be mixed with release_on_unlock logic.

KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0.

> 
> Also, we eventually want to be able to have a node that's part of both a
> list and rbtree. Likely adding such a node to both would require calling
> kfunc for adding to list, and separate kfunc call for adding to rbtree.
> Once the node has been added to list, we need some way to represent a reference
> to that node so that we can pass it to rbtree add kfunc. Sounds like a
> non-owning reference to me, albeit with different semantics than current
> release_on_unlock.

A node with both link list and rbtree would be a new concept.
We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing.
That's a future discussion.

> 
> > I think this relaxed release logic and write support is the wrong direction to
> > take, as it has a direct bearing on what can be done with a node inside the
> > critical section. There's already the problem with not being able to do
> > bpf_obj_drop easily inside the critical section with this. That might be useful
> > for draining operations while holding the lock.
> > 
> 
> The bpf_obj_drop case is similar to your "can't pass non-owning reference
> to bpf_rbtree_remove" concern from patch 1's thread. If we have:
> 
>   n = bpf_obj_new(...); // n is owning ref
>   bpf_rbtree_add(&tree, &n->node); // n is non-owning ref

what I proposed in the other email...
n should be untrusted here.
That's != 'n is non-owning ref'

>   res = bpf_rbtree_first(&tree);
>   if (!res) {...}
>   m = container_of(res, struct node_data, node); // m is non-owning ref

agree. m == PTR_TRUSTED with ref_obj_id == 0.

>   res = bpf_rbtree_remove(&tree, &n->node);

a typo here? Did you mean 'm->node' ?

and after 'if (res)' ...
>   n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory

agree. n -> ref_obj_id > 0

>   bpf_obj_drop(n);

above is ok to do.
'n' becomes UNTRUSTED or invalid.

>   // Not safe to use m anymore

'm' should have become UNTRUSTED after bpf_rbtree_remove.

> Datastructures which support bpf_obj_drop in the critical section can
> do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
> references after bpf_obj_drop.

'invalidate all' sounds suspicious.
I don't think we need to do sweaping search after bpf_obj_drop.

> Then there's no potential use-after-free.
> (For the above example, pretend bpf_rbtree_remove didn't already invalidate
> 'm', or that there's some other way to obtain non-owning ref to 'n''s node
> after rbtree_remove)
> 
> I think that, in practice, operations where the BPF program wants to remove
> / delete nodes will be distinct from operations where program just wants to 
> obtain some non-owning refs and do read / write. At least for sched_ext usecase
> this is true. So all the additional clobbers won't require program writer
> to do special workarounds to deal with verifier in the common case.
> 
> > Semantically in other languages, once you move an object, accessing it is
> > usually a bug, and in most of the cases it is sufficient to prepare it before
> > insertion. We are certainly in the same territory here with these APIs.
> 
> Sure, but 'add'/'remove' for these intrusive linked datastructures is
> _not_ a 'move'. Obscuring this from the user and forcing them to use
> less performant patterns for the sake of some verifier complexity, or desire
> to mimic semantics of languages w/o reference stability, doesn't make sense to
> me.

I agree, but everything we discuss in the above looks orthogonal
to release_on_unlock that myself and Kumar are proposing to drop.

> If we were to add some datastructures without reference stability, sure, let's
> not do non-owning references for those. So let's make this non-owning reference
> stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
> which will coincidentally make it very easy to remove if we later decide that
> the complexity isn't worth it. 

You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ?
So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ?
If so then I agree. The 'release' part of the name was confusing.
It's also not clear which arg it applies to.
bpf_rbtree_remove has two args. Both are PTR_TRUSTED.
I wouldn't introduce a new flag for this just yet.
We can hard code bpf_rbtree_remove, bpf_list_pop for now
or use our name suffix hack.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  2022-12-07 18:06       ` Alexei Starovoitov
@ 2022-12-07 23:39         ` Dave Marchevsky
  2022-12-08  0:47           ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-07 23:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On 12/7/22 1:06 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 01:46:56AM -0500, Dave Marchevsky wrote:
>> On 12/6/22 9:39 PM, Alexei Starovoitov wrote:
>>> On Tue, Dec 06, 2022 at 03:09:57PM -0800, Dave Marchevsky wrote:
>>>> Current comment in BPF_PROBE_MEM jit code claims that verifier prevents
>>>> insn->off < 0, but this appears to not be true irrespective of changes
>>>> in this series. Regardless, changes in this series will result in an
>>>> example like:
>>>>
>>>>   struct example_node {
>>>>     long key;
>>>>     long val;
>>>>     struct bpf_rb_node node;
>>>>   }
>>>>
>>>>   /* In BPF prog, assume root contains example_node nodes */
>>>>   struct bpf_rb_node res = bpf_rbtree_first(&root);
>>>>   if (!res)
>>>>     return 1;
>>>>
>>>>   struct example_node n = container_of(res, struct example_node, node);
>>>>   long key = n->key;
>>>>
>>>> Resulting in a load with off = -16, as bpf_rbtree_first's return is
>>>
>>> Looks like the bug in the previous patch:
>>> +                       } else if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_remove] ||
>>> +                                  meta.func_id == special_kfunc_list[KF_bpf_rbtree_first]) {
>>> +                               struct btf_field *field = meta.arg_rbtree_root.field;
>>> +
>>> +                               mark_reg_datastructure_node(regs, BPF_REG_0,
>>> +                                                           &field->datastructure_head);
>>>
>>> The R0 .off should have been:
>>>  regs[BPF_REG_0].off = field->rb_node.node_offset;
>>>
>>> node, not root.
>>>
>>> PTR_TO_BTF_ID should have been returned with approriate 'off',
>>> so that container_of() would it bring back to zero offset.
>>>
>>
>> The root's btf_field is used to hold information about the node type. Of
>> specific interest to us are value_btf_id and node_offset, which
>> mark_reg_datastructure_node uses to set REG_0's type and offset correctly.
>>
>> This "use head type to keep info about node type" strategy felt strange to me
>> initially too: all PTR_TO_BTF_ID regs are passing around their type info, so
>> why not use that to lookup bpf_rb_node field info? But consider that
>> bpf_rbtree_first (and bpf_list_pop_{front,back}) doesn't take a node as
>> input arg, so there's no opportunity to get btf_field info from input
>> reg type. 
>>
>> So we'll need to keep this info in rbtree_root's btf_field
>> regardless, and since any rbtree API function that operates on a node
>> also operates on a root and expects its node arg to match the node
>> type expected by the root, might as well use root's field as the main
>> lookup for this info and not even have &field->rb_node for now.
>> All __process_kf_arg_ptr_to_datastructure_node calls (added earlier
>> in the series) use the &meta->arg_{list_head,rbtree_root}.field for same
>> reason.
>>
>> So it's setting the reg offset correctly.
> 
> Ok. Got it. Than the commit log is incorrectly describing the failing scenario.
> It's a container_of() inside bool less() that is generating negative offsets.
> 

I noticed this happening with container_of() both inside less() and in the
example in patch summary. Specifically in the rbtree_first_and_remove 'success'
selftest added in patch 13. There, operations like this:

  bpf_spin_lock(&glock);
  res = bpf_rbtree_first(&groot);
  if (!res) {...}

  o = container_of(res, struct node_data, node);
  first_data[1] = o->data;
  bpf_spin_unlock(&glock);

Would fail to set first_data[1] to the expected value, instead setting
it to 0. 

>>> All PTR_TO_BTF_ID need to have positive offset.
>>> I'm not sure btf_struct_walk() and other PTR_TO_BTF_ID accessors
>>> can deal with negative offsets.
>>> There could be all kinds of things to fix.
>>
>> I think you may be conflating reg offset and insn offset here. None of the
>> changes in this series result in a PTR_TO_BTF_ID reg w/ negative offset
>> being returned. But LLVM may generate load insns with a negative offset,
>> and since we're passing around pointers to bpf_rb_node that may come
>> after useful data fields in a type, this will happen more often.
>>
>> Consider this small example from selftests in this series:
>>
>> struct node_data {
>>   long key;
>>   long data;
>>   struct bpf_rb_node node;
>> };
>>
>> static bool less(struct bpf_rb_node *a, const struct bpf_rb_node *b)
>> {
>>         struct node_data *node_a;
>>         struct node_data *node_b;
>>
>>         node_a = container_of(a, struct node_data, node);
>>         node_b = container_of(b, struct node_data, node);
>>
>>         return node_a->key < node_b->key;
>> }
>>
>> llvm-objdump shows this bpf bytecode for 'less':
>>
>> 0000000000000000 <less>:
>> ;       return node_a->key < node_b->key;
>>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
>>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
>>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
>> ;       return node_a->key < node_b->key;
> 
> I see. That's the same bug.
> The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
> correct positive offset.
> Then node_a = container_of(a, struct node_data, node);
> would have produced correct offset into proper btf_id.
> 
> The verifier should be passing into less() the btf_id
> of struct node_data instead of btf_id of struct bpf_rb_node.
> 

The verifier is already passing the struct node_data type, not bpf_rb_node.
For less() args, and rbtree_{first,remove} retval, mark_reg_datastructure_node
- added in patch 8 - is doing as you describe.

Verifier sees less' arg regs as R=ptr_to_node_data(off=16). If it was
instead passing R=ptr_to_bpf_rb_node(off=0), attempting to access *(reg - 0x10)
would cause verifier err.

>>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
>>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
>>
>> 0000000000000028 <LBB2_2>:
>> ;       return node_a->key < node_b->key;
>>        5:       95 00 00 00 00 00 00 00 exit
>>
>> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
>> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
>> untrusted_ptr_node_data(off=16). If there were some intermediate insns
>> storing result of container_of() before dereferencing:
>>
>>   r3 = (r2 - 0x10)
>>   r2 = *(u64 *)(r3)
>>
>> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
>> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
>> using original arg ptrs to less() instead of storing container_of() ptr
>> adjustments.
>>
>> Since the container_of usage and code pattern in above example's less()
>> isn't particularly specific to this series, I think there are other scenarios
>> where such code would be generated and considered this a general bugfix in
>> cover letter.
> 
> imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.
> 

If I used PTR_TRUSTED here, the JITted instructions would still do a load like
r2 = *(u64 *)(r2 - 0x10). There would just be no BPF_PROBE_MEM runtime checking
insns generated, avoiding negative insn issue there. But the negative insn->off
load being generated is not specific to PTR_UNTRUSTED.

>>
>> [ below paragraph was moved here, it originally preceded "All PTR_TO_BTF_ID"
>>   paragraph ]
>>
>>> The apporach of returning untrusted from bpf_rbtree_first is questionable.
>>> Without doing that this issue would not have surfaced.
>>>
>>
>> I agree re: PTR_UNTRUSTED, but note that my earlier example doesn't involve
>> bpf_rbtree_first. Regardless, I think the issue is that PTR_UNTRUSTED is
>> used to denote a few separate traits of a PTR_TO_BTF_ID reg:
>>
>>   * "I have no ownership over the thing I'm pointing to"
>>   * "My backing memory may go away at any time"
>>   * "Access to my fields might result in page fault"
>>   * "Kfuncs shouldn't accept me as an arg"
>>
>> Seems like original PTR_UNTRUSTED usage really wanted to denote the first
>> point and the others were just naturally implied from the first. But
>> as you've noted there are some things using PTR_UNTRUSTED that really
>> want to make more granular statements:
> 
> I think PTR_UNTRUSTED implies all of the above. All 4 statements are connected.
> 
>> ref_set_release_on_unlock logic sets release_on_unlock = true and adds
>> PTR_UNTRUSTED to the reg type. In this case PTR_UNTRUSTED is trying to say:
>>
>>   * "I have no ownership over the thing I'm pointing to"
>>   * "My backing memory may go away at any time _after_ bpf_spin_unlock"
>>     * Before spin_unlock it's guaranteed to be valid
>>   * "Kfuncs shouldn't accept me as an arg"
>>     * We don't want arbitrary kfunc saving and accessing release_on_unlock
>>       reg after bpf_spin_unlock, as its backing memory can go away any time
>>       after spin_unlock.
>>
>> The "backing memory" statement PTR_UNTRUSTED is making is a blunt superset
>> of what release_on_unlock really needs.
>>
>> For less() callback we just want
>>
>>   * "I have no ownership over the thing I'm pointing to"
>>   * "Kfuncs shouldn't accept me as an arg"
>>
>> There is probably a way to decompose PTR_UNTRUSTED into a few flags such that
>> it's possible to denote these things separately and avoid unwanted additional
>> behavior. But after talking to David Vernet about current complexity of
>> PTR_TRUSTED and PTR_UNTRUSTED logic and his desire to refactor, it seemed
>> better to continue with PTR_UNTRUSTED blunt instrument with a bit of
>> special casing for now, instead of piling on more flags.
> 
> Exactly. More flags will only increase the confusion.
> Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
> rbtree kfuncs while inside this particular callback to prevent recursion.
> That would solve all these issues, no?
> Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.
> 
> Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
> both link list and rbtree.
> It should still be safe to operate on link list part of that node from less()
> though it's not something we would ever recommend.

I definitely want to allow writes on non-owning references. In order to properly
support this, there needs to be a way to designate a field as a "key":

struct node_data {
  long key __key;
  long data;
  struct bpf_rb_node node;
};

or perhaps on the rb_root via __contains or separate tag:

struct bpf_rb_root groot __contains(struct node_data, node, key);

This is necessary because rbtree's less() uses key field to determine order, so
we don't want to allow write to the key field when the node is in a rbtree. If
such a write were possible the rbtree could easily be placed in an invalid state
since the new key may mean that the rbtree is no longer sorted. Subsequent add()
operations would compare less() using the new key, so other nodes will be placed
in wrong spot as well.

Since PTR_UNTRUSTED currently allows read but not write, and prevents use of
non-owning ref as kfunc arg, it seemed to be reasonable tag for less() args.

I was planning on adding __key / non-owning-ref write support as a followup, but
adding it as part of this series will probably save a lot of back-and-forth.
Will try to add it.

> The kfunc call on rb tree part of struct node_data is problematic because
> of recursion, right? No other safety concerns ?
> 
>>>
>>>> modified by verifier to be PTR_TO_BTF_ID of example_node w/ offset =
>>>> offsetof(struct example_node, node), instead of PTR_TO_BTF_ID of
>>>> bpf_rb_node. So it's necessary to support negative insn->off when
>>>> jitting BPF_PROBE_MEM.
>>>
>>> I'm not convinced it's necessary.
>>> container_of() seems to be the only case where bpf prog can convert
>>> PTR_TO_BTF_ID with off >= 0 to negative off.
>>> Normal pointer walking will not make it negative.
>>>
>>
>> I see what you mean - if some non-container_of case resulted in load generation
>> with negative insn->off, this probably would've been noticed already. But
>> hopefully my replies above explain why it should be addressed now.
> 
> Even with container_of() usage we should be passing proper btf_id of container
> struct, so that callbacks and non-callbacks can properly container_of() it
> and still get offset >= 0.
> 

This was addressed earlier in my response.

>>>>
>>>> A few instructions are saved for negative insn->offs as a result. Using
>>>> the struct example_node / off = -16 example from before, code looks
>>>> like:
>>>
>>> This is quite complex to review. I couldn't convince myself
>>> that droping 2nd check is safe, but don't have an argument to
>>> prove that it's not safe.
>>> Let's get to these details when there is need to support negative off.
>>>
>>
>> Hopefully above explanation shows that there's need to support it now.
>> I will try to simplify and rephrase the summary to make it easier to follow,
>> but will prioritize addressing feedback in less complex patches, so this
>> patch may not change for a few respins.
> 
> I'm not saying that this patch will never be needed.
> Supporting negative offsets here is a good thing.
> I'm arguing that it's not necessary to enable bpf_rbtree.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record
  2022-12-07 22:46           ` Alexei Starovoitov
@ 2022-12-07 23:42             ` Dave Marchevsky
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-07 23:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On 12/7/22 5:46 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 03:38:55PM -0500, Dave Marchevsky wrote:
>> On 12/7/22 1:59 PM, Alexei Starovoitov wrote:
>>> On Wed, Dec 07, 2022 at 01:34:44PM -0500, Dave Marchevsky wrote:
>>>> On 12/7/22 11:41 AM, Kumar Kartikeya Dwivedi wrote:
>>>>> On Wed, Dec 07, 2022 at 04:39:48AM IST, Dave Marchevsky wrote:
>>>>>> btf->struct_meta_tab is populated by btf_parse_struct_metas in btf.c.
>>>>>> There, a BTF record is created for any type containing a spin_lock or
>>>>>> any next-gen datastructure node/head.
>>>>>>
>>>>>> Currently, for non-MAP_VALUE types, reg_btf_record will only search for
>>>>>> a record using struct_meta_tab if the reg->type exactly matches
>>>>>> (PTR_TO_BTF_ID | MEM_ALLOC). This exact match is too strict: an
>>>>>> "allocated obj" type - returned from bpf_obj_new - might pick up other
>>>>>> flags while working its way through the program.
>>>>>>
>>>>>
>>>>> Not following. Only PTR_TO_BTF_ID | MEM_ALLOC is the valid reg->type that can be
>>>>> passed to helpers. reg_btf_record is used in helpers to inspect the btf_record.
>>>>> Any other flag combination (the only one possible is PTR_UNTRUSTED right now)
>>>>> cannot be passed to helpers in the first place. The reason to set PTR_UNTRUSTED
>>>>> is to make then unpassable to helpers.
>>>>>
>>>>
>>>> I see what you mean. If reg_btf_record is only used on regs which are args,
>>>> then the exact match helps enforce PTR_UNTRUSTED not being an acceptable
>>>> type flag for an arg. Most uses of reg_btf_record seem to be on arg regs,
>>>> but then we have its use in reg_may_point_to_spin_lock, which is itself
>>>> used in mark_ptr_or_null_reg and on BPF_REG_0 in check_kfunc_call. So I'm not
>>>> sure that it's only used on arg regs currently.
>>>>
>>>> Regardless, if the intended use is on arg regs only, it should be renamed to
>>>> arg_reg_btf_record or similar to make that clear, as current name sounds like
>>>> it should be applicable to any reg, and thus not enforce constraints particular
>>>> to arg regs.
>>>>
>>>> But I think it's better to leave it general and enforce those constraints
>>>> elsewhere. For kfuncs this is already happening in check_kfunc_args, where the
>>>> big switch statements for KF_ARG_* are doing exact type matching.
>>>>
>>>>>> Loosen the check to be exact for base_type and just use MEM_ALLOC mask
>>>>>> for type_flag.
>>>>>>
>>>>>> This patch is marked Fixes as the original intent of reg_btf_record was
>>>>>> unlikely to have been to fail finding btf_record for valid alloc obj
>>>>>> types with additional flags, some of which (e.g. PTR_UNTRUSTED)
>>>>>> are valid register type states for alloc obj independent of this series.
>>>>>
>>>>> That was the actual intent, same as how check_ptr_to_btf_access uses the exact
>>>>> reg->type to allow the BPF_WRITE case.
>>>>>
>>>>> I think this series is the one introducing this case, passing bpf_rbtree_first's
>>>>> result to bpf_rbtree_remove, which I think is not possible to make safe in the
>>>>> first place. We decided to do bpf_list_pop_front instead of bpf_list_entry ->
>>>>> bpf_list_del due to this exact issue. More in [0].
>>>>>
>>>>>  [0]: https://lore.kernel.org/bpf/CAADnVQKifhUk_HE+8qQ=AOhAssH6w9LZ082Oo53rwaS+tAGtOw@mail.gmail.com
>>>>>
>>>>
>>>> Thanks for the link, I better understand what Alexei meant in his comment on
>>>> patch 9 of this series. For the helpers added in this series, we can make
>>>> bpf_rbtree_first -> bpf_rbtree_remove safe by invalidating all release_on_unlock
>>>> refs after the rbtree_remove in same manner as they're invalidated after
>>>> spin_unlock currently.
>>>>
>>>> Logic for why this is safe:
>>>>
>>>>   * If we have two non-owning refs to nodes in a tree, e.g. from
>>>>     bpf_rbtree_add(node) and calling bpf_rbtree_first() immediately after,
>>>>     we have no way of knowing if they're aliases of same node.
>>>>
>>>>   * If bpf_rbtree_remove takes arbitrary non-owning ref to node in the tree,
>>>>     it might be removing a node that's already been removed, e.g.:
>>>>
>>>>         n = bpf_obj_new(...);
>>>>         bpf_spin_lock(&lock);
>>>>
>>>>         bpf_rbtree_add(&tree, &n->node);
>>>>         // n is now non-owning ref to node which was added
>>>>         res = bpf_rbtree_first();
>>>>         if (!m) {}
>>>>         m = container_of(res, struct node_data, node);
>>>>         // m is now non-owning ref to the same node
>>>>         bpf_rbtree_remove(&tree, &n->node);
>>>>         bpf_rbtree_remove(&tree, &m->node); // BAD
>>>
>>> Let me clarify my previous email:
>>>
>>> Above doesn't have to be 'BAD'.
>>> Instead of
>>> if (WARN_ON_ONCE(RB_EMPTY_NODE(n)))
>>>
>>> we can drop WARN and simply return.
>>> If node is not part of the tree -> nop.
>>>
>>> Same for bpf_rbtree_add.
>>> If it's already added -> nop.
>>>
>>
>> These runtime checks can certainly be done, but if we can guarantee via
>> verifier type system that a particular ptr-to-node is guaranteed to be in /
>> not be in a tree, that's better, no?
>>
>> Feels like a similar train of thought to "fail verification when correct rbtree
>> lock isn't held" vs "just check if lock is held in every rbtree API kfunc".
>>
>>> Then we can have bpf_rbtree_first() returning PTR_TRUSTED with acquire semantics.
>>> We do all these checks under the same rbtree root lock, so it's safe.
>>>
>>
>> I'll comment on PTR_TRUSTED in our discussion on patch 10.
>>
>>>>         bpf_spin_unlock(&lock);
>>>>
>>>>   * bpf_rbtree_remove is the only "pop()" currently. Non-owning refs are at risk
>>>>     of pointing to something that was already removed _only_ after a
>>>>     rbtree_remove, so if we invalidate them all after rbtree_remove they can't
>>>>     be inputs to subsequent remove()s
>>>
>>> With above proposed run-time checks both bpf_rbtree_remove and bpf_rbtree_add
>>> can have release semantics.
>>> No need for special release_on_unlock hacks.
>>>
>>
>> If we want to be able to interact w/ nodes after they've been added to the
>> rbtree, but before critical section ends, we need to support non-owning refs,
>> which are currently implemented using special release_on_unlock logic.
>>
>> If we go with the runtime check suggestion from above, we'd need to implement
>> 'conditional release' similarly to earlier "rbtree map" attempt:
>> https://lore.kernel.org/bpf/20220830172759.4069786-14-davemarchevsky@fb.com/ .
>>
>> If rbtree_add has release semantics for its node arg, but the node is already
>> in some tree and runtime check fails, the reference should not be released as
>> rbtree_add() was a nop.
> 
> Got it.
> The conditional release is tricky. We should probably avoid it for now.
> 
> I think we can either go with Kumar's proposal and do
> bpf_rbtree_pop_front() instead of bpf_rbtree_first()
> that avoids all these issues...
> 
> but considering that we'll have inline iterators soon and should be able to do:
> 
> struct bpf_rbtree_iter it;
> struct bpf_rb_node * node;
> 
> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree
> while ((node = bpf_rbtree_iter_next(&it)) {
>   if (node->field == condition) {
>     struct bpf_rb_node *n;
> 
>     n = bpf_rbtree_remove(rb_root, node);
>     bpf_spin_lock(another_rb_root);
>     bpf_rbtree_add(another_rb_root, n);
>     bpf_spin_unlock(another_rb_root);
>     break;
>   }
> }
> bpf_rbtree_iter_destroy(&it);
> 
> We can treat the 'node' returned from bpf_rbtree_iter_next() the same way
> as return from bpf_rbtree_first() ->  PTR_TRUSTED | MAYBE_NULL,
> but not acquired (ref_obj_id == 0).
> 
> bpf_rbtree_add -> KF_RELEASE
> so we cannot pass not acquired pointers into it.
> 
> We should probably remove release_on_unlock logic as Kumar suggesting and
> make bpf_list_push_front/back to be KF_RELEASE.
> 
> Then
> bpf_list_pop_front/back stay KF_ACQUIRE | KF_RET_NULL
> and
> bpf_rbtree_remove is also KF_ACQUIRE | KF_RET_NULL.
> 
> The difference is bpf_list_pop has only 'head'
> while bpf_rbtree_remove has 'root' and 'node' where 'node' has to be PTR_TRUSTED
> (but not acquired).
> 
> bpf_rbtree_add will always succeed.
> bpf_rbtree_remove will conditionally fail if 'node' is not linked.
> 
> Similarly we can extend link list with
> n = bpf_list_remove(node)
> which will have KF_ACQUIRE | KF_RET_NULL semantics.
> 
> Then everything is nicely uniform.
> We'll be able to iterate rbtree and iterate link lists.
> 
> There are downsides, of course.
> Like the following from your test case:
> +       bpf_spin_lock(&glock);
> +       bpf_rbtree_add(&groot, &n->node, less);
> +       bpf_rbtree_add(&groot, &m->node, less);
> +       res = bpf_rbtree_remove(&groot, &n->node);
> +       bpf_spin_unlock(&glock);
> will not work.
> Since bpf_rbtree_add() releases 'n' and it becomes UNTRUSTED.
> (assuming release_on_unlock is removed).
> 
> I think it's fine for now. I have to agree with Kumar that it's hard to come up
> with realistic use case where 'n' should be accessed after it was added to link
> list or rbtree. Above test case doesn't look real.
> 
> This part of your test case:
> +       bpf_spin_lock(&glock);
> +       bpf_rbtree_add(&groot, &n->node, less);
> +       bpf_rbtree_add(&groot, &m->node, less);
> +       bpf_rbtree_add(&groot, &o->node, less);
> +
> +       res = bpf_rbtree_first(&groot);
> +       if (!res) {
> +               bpf_spin_unlock(&glock);
> +               return 2;
> +       }
> +
> +       o = container_of(res, struct node_data, node);
> +       res = bpf_rbtree_remove(&groot, &o->node);
> +       bpf_spin_unlock(&glock);
> 
> will work, because bpf_rbtree_first returns PTR_TRUSTED | MAYBE_NULL.
> 
>> Similarly, if rbtree_remove has release semantics for its node arg and acquire
>> semantics for its return value, runtime check failing should result in the
>> node arg not being released. Acquire semantics for the retval are already
>> conditional - if retval == NULL, mark_ptr_or_null regs will release the
>> acquired ref before it can be used. So no issue with failing rbtree_remove
>> messing up acquire.
>>
>> For this reason rbtree_remove and rbtree_first are tagged
>> KF_ACQUIRE | KF_RET_NULL. "special release_on_unlock hacks" can likely be
>> refactored into a similar flag, KF_RELEASE_NON_OWN or similar.
> 
> I guess what I'm propsing above is sort-of KF_RELEASE_NON_OWN idea,
> but from a different angle.
> I'd like to avoid introducing new flags.
> I think PTR_TRUSTED is enough.
> 
>>> I'm not sure what's an idea to return 'n' from remove...
>>> Maybe it should be simple bool ?
>>>
>>
>> I agree that returning node from rbtree_remove is not strictly necessary, since
>> rbtree_remove can be thought of turning its non-owning ref argument into an
>> owning ref, instead of taking non-owning ref and returning owning ref. But such
>> an operation isn't really an 'acquire' by current verifier logic, since only
>> retvals can be 'acquired'. So we'd need to add some logic to enable acquire
>> semantics for args. Furthermore it's not really 'acquiring' a new ref, rather
>> changing properties of node arg ref.
>>
>> However, if rbtree_remove can fail, such a "turn non-owning into owning"
>> operation will need to be able to fail as well, and the program will need to
>> be able to check for failure. Returning 'acquire' result in retval makes
>> this simple - just check for NULL. For your "return bool" proposal, we'd have
>> to add verifier logic which turns the 'acquired' owning ref back into non-owning
>> based on check of the bool, which will add some verifier complexity.
>>
>> IIRC when doing experimentation with "rbtree map" implementation, I did
>> something like this and decided that the additional complexity wasn't worth
>> it when retval can just be used. 
> 
> Agree. Forget 'bool' idea.

We will merge this convo w/ similar one in the cover letter's thread, and
continue w/ replies there.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  2022-12-07 23:39         ` Dave Marchevsky
@ 2022-12-08  0:47           ` Alexei Starovoitov
  2022-12-08  8:50             ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-08  0:47 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On Wed, Dec 07, 2022 at 06:39:38PM -0500, Dave Marchevsky wrote:
> >>
> >> 0000000000000000 <less>:
> >> ;       return node_a->key < node_b->key;
> >>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
> >>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
> >>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
> >> ;       return node_a->key < node_b->key;
> > 
> > I see. That's the same bug.
> > The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
> > correct positive offset.
> > Then node_a = container_of(a, struct node_data, node);
> > would have produced correct offset into proper btf_id.
> > 
> > The verifier should be passing into less() the btf_id
> > of struct node_data instead of btf_id of struct bpf_rb_node.
> > 
> 
> The verifier is already passing the struct node_data type, not bpf_rb_node.
> For less() args, and rbtree_{first,remove} retval, mark_reg_datastructure_node
> - added in patch 8 - is doing as you describe.
> 
> Verifier sees less' arg regs as R=ptr_to_node_data(off=16). If it was
> instead passing R=ptr_to_bpf_rb_node(off=0), attempting to access *(reg - 0x10)
> would cause verifier err.

Ahh. I finally got it :)
Please put these details in the commit log when you respin.

> >>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
> >>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
> >>
> >> 0000000000000028 <LBB2_2>:
> >> ;       return node_a->key < node_b->key;
> >>        5:       95 00 00 00 00 00 00 00 exit
> >>
> >> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
> >> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
> >> untrusted_ptr_node_data(off=16). If there were some intermediate insns
> >> storing result of container_of() before dereferencing:
> >>
> >>   r3 = (r2 - 0x10)
> >>   r2 = *(u64 *)(r3)
> >>
> >> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
> >> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
> >> using original arg ptrs to less() instead of storing container_of() ptr
> >> adjustments.
> >>
> >> Since the container_of usage and code pattern in above example's less()
> >> isn't particularly specific to this series, I think there are other scenarios
> >> where such code would be generated and considered this a general bugfix in
> >> cover letter.
> > 
> > imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.
> > 
> 
> If I used PTR_TRUSTED here, the JITted instructions would still do a load like
> r2 = *(u64 *)(r2 - 0x10). There would just be no BPF_PROBE_MEM runtime checking
> insns generated, avoiding negative insn issue there. But the negative insn->off
> load being generated is not specific to PTR_UNTRUSTED.

yep.

> > 
> > Exactly. More flags will only increase the confusion.
> > Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
> > rbtree kfuncs while inside this particular callback to prevent recursion.
> > That would solve all these issues, no?
> > Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.
> > 
> > Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
> > both link list and rbtree.
> > It should still be safe to operate on link list part of that node from less()
> > though it's not something we would ever recommend.
> 
> I definitely want to allow writes on non-owning references. In order to properly
> support this, there needs to be a way to designate a field as a "key":
> 
> struct node_data {
>   long key __key;
>   long data;
>   struct bpf_rb_node node;
> };
> 
> or perhaps on the rb_root via __contains or separate tag:
> 
> struct bpf_rb_root groot __contains(struct node_data, node, key);
> 
> This is necessary because rbtree's less() uses key field to determine order, so
> we don't want to allow write to the key field when the node is in a rbtree. If
> such a write were possible the rbtree could easily be placed in an invalid state
> since the new key may mean that the rbtree is no longer sorted. Subsequent add()
> operations would compare less() using the new key, so other nodes will be placed
> in wrong spot as well.
> 
> Since PTR_UNTRUSTED currently allows read but not write, and prevents use of
> non-owning ref as kfunc arg, it seemed to be reasonable tag for less() args.
> 
> I was planning on adding __key / non-owning-ref write support as a followup, but
> adding it as part of this series will probably save a lot of back-and-forth.
> Will try to add it.

Just key mark might not be enough. less() could be doing all sort of complex
logic on more than one field and even global fields.
But what is the concern with writing into 'key' ?
The rbtree will not be sorted. find/add operation will not be correct,
but nothing will crash. At the end bpf_rb_root_free() will walk all
unsorted nodes anyway and free them all.
Even if we pass PTR_TRUSTED | MEM_RDONLY pointers into less() the less()
can still do nonsensical things like returning random true/false.
Doesn't look like an issue to me.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-07 23:06     ` Alexei Starovoitov
@ 2022-12-08  1:18       ` Dave Marchevsky
  2022-12-08  3:51         ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-08  1:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On 12/7/22 6:06 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 05:28:34PM -0500, Dave Marchevsky wrote:
>> On 12/7/22 2:36 PM, Kumar Kartikeya Dwivedi wrote:
>>> On Wed, Dec 07, 2022 at 04:39:47AM IST, Dave Marchevsky wrote:
>>>> This series adds a rbtree datastructure following the "next-gen
>>>> datastructure" precedent set by recently-added linked-list [0]. This is
>>>> a reimplementation of previous rbtree RFC [1] to use kfunc + kptr
>>>> instead of adding a new map type. This series adds a smaller set of API
>>>> functions than that RFC - just the minimum needed to support current
>>>> cgfifo example scheduler in ongoing sched_ext effort [2], namely:
>>>>
>>>>   bpf_rbtree_add
>>>>   bpf_rbtree_remove
>>>>   bpf_rbtree_first
>>>>
>>>> [...]
>>>>
>>>> Future work:
>>>>   Enabling writes to release_on_unlock refs should be done before the
>>>>   functionality of BPF rbtree can truly be considered complete.
>>>>   Implementing this proved more complex than expected so it's been
>>>>   pushed off to a future patch.
>>>>
>>
>>>
>>> TBH, I think we need to revisit whether there's a strong need for this. I would
>>> even argue that we should simply make the release semantics of rbtree_add,
>>> list_push helpers stronger and remove release_on_unlock logic entirely,
>>> releasing the node immediately. I don't see why it is so critical to have read,
>>> and more importantly, write access to nodes after losing their ownership. And
>>> that too is only available until the lock is unlocked.
>>>
>>
>> Moved the next paragraph here to ease reply, it was the last paragraph
>> in your response.
>>
>>>
>>> Can you elaborate on actual use cases where immediate release or not having
>>> write support makes it hard or impossible to support a certain use case, so that
>>> it is easier to understand the requirements and design things accordingly?
>>>
>>
>> Sure, the main usecase and impetus behind this for me is the sched_ext work
>> Tejun and others are doing (https://lwn.net/Articles/916291/ ). One of the
>> things they'd like to be able to do is implement a CFS-like scheduler using
>> rbtree entirely in BPF. This would prove that sched_ext + BPF can be used to
>> implement complicated scheduling logic.
>>
>> If we can implement such complicated scheduling logic, but it has so much
>> BPF-specific twisting of program logic that it's incomprehensible to scheduler
>> folks, that's not great. The overlap between "BPF experts" and "scheduler
>> experts" is small, and we want the latter group to be able to read BPF
>> scheduling logic without too much struggle. Lower learning curve makes folks
>> more likely to experiment with sched_ext.
>>
>> When 'rbtree map' was in brainstorming / prototyping, non-owning reference
>> semantics were called out as moving BPF datastructures closer to their kernel
>> equivalents from a UX perspective.
> 
> Our emails crossed. See my previous email.
> Agree on the above.
> 
>> If the "it makes BPF code better resemble normal kernel code" argumentwas the
>> only reason to do this I wouldn't feel so strongly, but there are practical
>> concerns as well:
>>
>> If we could only read / write from rbtree node if it isn't in a tree, the common
>> operation of "find this node and update its data" would require removing and
>> re-adding it. For rbtree, these unnecessary remove and add operations could
> 
> Not really. See my previous email.
> 
>> result in unnecessary rebalancing. Going back to the sched_ext usecase,
>> if we have a rbtree with task or cgroup stats that need to be updated often,
>> unnecessary rebalancing would make this update slower than if non-owning refs
>> allowed in-place read/write of node data.
> 
> Agree. Read/write from non-owning refs is necessary.
> In the other email I'm arguing that PTR_TRUSTED with ref_obj_id == 0
> (your non-owning ref) should not be mixed with release_on_unlock logic.
> 
> KF_RELEASE should still accept as args and release only ptrs with ref_obj_id > 0.
> 
>>
>> Also, we eventually want to be able to have a node that's part of both a
>> list and rbtree. Likely adding such a node to both would require calling
>> kfunc for adding to list, and separate kfunc call for adding to rbtree.
>> Once the node has been added to list, we need some way to represent a reference
>> to that node so that we can pass it to rbtree add kfunc. Sounds like a
>> non-owning reference to me, albeit with different semantics than current
>> release_on_unlock.
> 
> A node with both link list and rbtree would be a new concept.
> We'd need to introduce 'struct bpf_refcnt' and make sure prog does the right thing.
> That's a future discussion.
> 
>>
>>> I think this relaxed release logic and write support is the wrong direction to
>>> take, as it has a direct bearing on what can be done with a node inside the
>>> critical section. There's already the problem with not being able to do
>>> bpf_obj_drop easily inside the critical section with this. That might be useful
>>> for draining operations while holding the lock.
>>>
>>
>> The bpf_obj_drop case is similar to your "can't pass non-owning reference
>> to bpf_rbtree_remove" concern from patch 1's thread. If we have:
>>
>>   n = bpf_obj_new(...); // n is owning ref
>>   bpf_rbtree_add(&tree, &n->node); // n is non-owning ref
> 
> what I proposed in the other email...
> n should be untrusted here.
> That's != 'n is non-owning ref'
> 
>>   res = bpf_rbtree_first(&tree);
>>   if (!res) {...}
>>   m = container_of(res, struct node_data, node); // m is non-owning ref
> 
> agree. m == PTR_TRUSTED with ref_obj_id == 0.
> 
>>   res = bpf_rbtree_remove(&tree, &n->node);
> 
> a typo here? Did you mean 'm->node' ?
> 
> and after 'if (res)' ...
>>   n = container_of(res, struct node_data, node); // n is owning ref, m points to same memory
> 
> agree. n -> ref_obj_id > 0
> 
>>   bpf_obj_drop(n);
> 
> above is ok to do.
> 'n' becomes UNTRUSTED or invalid.
> 
>>   // Not safe to use m anymore
> 
> 'm' should have become UNTRUSTED after bpf_rbtree_remove.
> 
>> Datastructures which support bpf_obj_drop in the critical section can
>> do same as my bpf_rbtree_remove suggestion: just invalidate all non-owning
>> references after bpf_obj_drop.
> 
> 'invalidate all' sounds suspicious.
> I don't think we need to do sweaping search after bpf_obj_drop.
> 
>> Then there's no potential use-after-free.
>> (For the above example, pretend bpf_rbtree_remove didn't already invalidate
>> 'm', or that there's some other way to obtain non-owning ref to 'n''s node
>> after rbtree_remove)
>>
>> I think that, in practice, operations where the BPF program wants to remove
>> / delete nodes will be distinct from operations where program just wants to 
>> obtain some non-owning refs and do read / write. At least for sched_ext usecase
>> this is true. So all the additional clobbers won't require program writer
>> to do special workarounds to deal with verifier in the common case.
>>
>>> Semantically in other languages, once you move an object, accessing it is
>>> usually a bug, and in most of the cases it is sufficient to prepare it before
>>> insertion. We are certainly in the same territory here with these APIs.
>>
>> Sure, but 'add'/'remove' for these intrusive linked datastructures is
>> _not_ a 'move'. Obscuring this from the user and forcing them to use
>> less performant patterns for the sake of some verifier complexity, or desire
>> to mimic semantics of languages w/o reference stability, doesn't make sense to
>> me.
> 
> I agree, but everything we discuss in the above looks orthogonal
> to release_on_unlock that myself and Kumar are proposing to drop.
> 
>> If we were to add some datastructures without reference stability, sure, let's
>> not do non-owning references for those. So let's make this non-owning reference
>> stuff easy to turn on/off, perhaps via KF_RELEASE_NON_OWN or similar flags,
>> which will coincidentally make it very easy to remove if we later decide that
>> the complexity isn't worth it. 
> 
> You mean KF_RELEASE_NON_OWN would be applied to bpf_rbtree_remove() ?
> So it accepts PTR_TRUSTED ref_obj_id == 0 arg and makes it PTR_UNTRUSTED ?
> If so then I agree. The 'release' part of the name was confusing.
> It's also not clear which arg it applies to.
> bpf_rbtree_remove has two args. Both are PTR_TRUSTED.
> I wouldn't introduce a new flag for this just yet.
> We can hard code bpf_rbtree_remove, bpf_list_pop for now
> or use our name suffix hack.

Before replying to specific things in this email, I think it would be useful
to have a subthread clearing up definitions and semantics, as I think we're
talking past each other a bit.


On a conceptual level I've still been using "owning reference" and "non-owning
reference" to understand rbtree operations. I'll use those here and try to map
them to actual verifier concepts later.

owning reference

  * This reference controls the lifetime of the pointee
  * Ownership of pointee must be 'released' by passing it to some rbtree
    API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
    * If not released before program ends, verifier considers prog invalid
  * Access to the memory ref is pointing at will not page fault

non-owning reference

  * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
    to bpf_obj_drop
  * No control of lifetime, but can infer memory safety based on context
    (see explanation below)
  * Access to the memory ref is pointing at will not page fault
    (see explanation below)

2) From verifier's perspective non-owning references can only exist
between spin_lock and spin_unlock. Why? After spin_unlock another program
can do arbitrary operations on the rbtree like removing and free-ing
via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
free'd, and reused via bpf_obj_new would point to an entirely different thing.
Or the memory could go away.

To prevent this logic violation all non-owning references are invalidated by
verifier after critical section ends. This is necessary to ensure "will
not page fault" property of non-owning reference. So if verifier hasn't
invalidated a non-owning ref, accessing it will not page fault.

Currently bpf_obj_drop is not allowed in the critical section, so similarly,
if there's a valid non-owning ref, we must be in critical section, and can
conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
and-reused.

1) Any reference to a node that is in a rbtree _must_ be non-owning, since
the tree has control of pointee lifetime. Similarly, any ref to a node
that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
node in map_val for now)

Moving on to rbtree API:

bpf_rbtree_add(&tree, &node);
  'node' is an owning ref, becomes a non-owning ref.

bpf_rbtree_first(&tree);
  retval is a non-owning ref, since first() node is still in tree

bpf_rbtree_remove(&tree, &node);
  'node' is a non-owning ref, retval is an owning ref

All of the above can only be called when rbtree's lock is held, so invalidation
of all non-owning refs on spin_unlock is fine for rbtree_remove.

Nice property of paragraph marked with 1) above is the ability to use the
type system to prevent rbtree_add of node that's already in rbtree and
rbtree_remove of node that's not in one. So we can forego runtime
checking of "already in tree", "already not in tree".

But, as you and Kumar talked about in the past and referenced in patch 1's
thread, non-owning refs may alias each other, or an owning ref, and have no
way of knowing whether this is the case. So if X and Y are two non-owning refs
that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
already isn't in any tree (since prog has an owning ref to it). But verifier
doesn't know X and Y alias each other. So previous paragraph's "forego
runtime checks" statement can only hold if we invalidate all non-owning refs
after 'destructive' rbtree_remove operation.


It doesn't matter to me which combination of type flags, ref_obj_id, other
reg state stuff, and special-casing is used to implement owning and non-owning
refs. Specific ones chosen in this series for rbtree node:

owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
            ref_obj_id > 0

non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
                PTR_UNTRUSTED
                  - used for "can't pass ownership", not PROBE_MEM
                  - this is why I mentioned "decomposing UNTRUSTED into more
                    granular reg traits" in another thread
                ref_obj_id > 0
                release_on_unlock = true
                  - used due to paragraphs starting with 2) above                

Any other combination of type and reg state that gives me the semantics def'd
above works4me.


Based on this reply and others from today, I think you're saying that these
concepts should be implemented using:

owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
            PTR_TRUSTED
            ref_obj_id > 0

non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
                PTR_TRUSTED
                ref_obj_id == 0
                 - used for "can't pass ownership", since funcs that expect
                   owning ref need ref_obj_id > 0

And you're also adding 'untrusted' here, mainly as a result of
bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
instead of becoming a non-owning ref. 'untrusted' would have state like:

PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
PTR_UNTRUSTED
ref_obj_id == 0?

I think your "non-owning ref" definition also differs from mine, specifically
yours doesn't seem to have "will not page fault". For this reason, you don't
see the need for release_on_unlock logic, since that's used to prevent refs
escaping critical section and potentially referring to free'd memory.

This is where I start to get confused. Some questions:

  * If we get rid of release_on_unlock, and with mass invalidation of
    non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?

  * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
    in this scheme, since non-owning ref can escape spin_unlock b/c no mass
    invalidation? PTR_UNTRUSTED isn't sufficient here

  * If non-owning ref can live past spin_unlock, do we expect read from
    such ref after _unlock to go through bpf_probe_read()? Otherwise direct
    read might fault and silently write 0.

  * For your 'untrusted', but not non-owning ref concept, I'm not sure
    what this gives us that's better than just invalidating the ref which
    gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)

I'm also not sure if you agree with my paragraph marked 1) above. But IMO the
release_on_unlock difference, and the perhaps-differing non-owning ref concept
are where we're really talking past each other.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-08  1:18       ` Dave Marchevsky
@ 2022-12-08  3:51         ` Alexei Starovoitov
  2022-12-08  8:28           ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-08  3:51 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote:
> 
> Before replying to specific things in this email, I think it would be useful
> to have a subthread clearing up definitions and semantics, as I think we're
> talking past each other a bit.

Yeah. We were not on the same page.
The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me.
I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago
and I thought we agreed that both are not necessary and with that
I assumed that anything 'non-owning' as a concept is gone too.
So the only thing left (in my mind) was the 'owning' concept.
Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'.

Please have this detailed explanation in the commit log next time to
avoid this back and forth.
Now to the fun part...

> 
> On a conceptual level I've still been using "owning reference" and "non-owning
> reference" to understand rbtree operations. I'll use those here and try to map
> them to actual verifier concepts later.
> 
> owning reference
> 
>   * This reference controls the lifetime of the pointee
>   * Ownership of pointee must be 'released' by passing it to some rbtree
>     API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
>     * If not released before program ends, verifier considers prog invalid
>   * Access to the memory ref is pointing at will not page fault

agree.

> non-owning reference
> 
>   * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
>     to bpf_obj_drop
>   * No control of lifetime, but can infer memory safety based on context
>     (see explanation below)
>   * Access to the memory ref is pointing at will not page fault
>     (see explanation below)

agree with addition that both read and write should be allowed into this
'non-owning' ptr.
Which breaks if you map this to something that ORs with PTR_UNTRUSTED.

> 2) From verifier's perspective non-owning references can only exist
> between spin_lock and spin_unlock. Why? After spin_unlock another program
> can do arbitrary operations on the rbtree like removing and free-ing
> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
> free'd, and reused via bpf_obj_new would point to an entirely different thing.
> Or the memory could go away.

agree that spin_unlock needs to clean up 'non-owning'.

> To prevent this logic violation all non-owning references are invalidated by
> verifier after critical section ends. This is necessary to ensure "will
> not page fault" property of non-owning reference. So if verifier hasn't
> invalidated a non-owning ref, accessing it will not page fault.
> 
> Currently bpf_obj_drop is not allowed in the critical section, so similarly,
> if there's a valid non-owning ref, we must be in critical section, and can
> conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
> and-reused.

I don't understand why is that a problem.

> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since
> the tree has control of pointee lifetime. Similarly, any ref to a node
> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
> node in map_val for now)

Also not clear why such restriction is necessary.

> Moving on to rbtree API:
> 
> bpf_rbtree_add(&tree, &node);
>   'node' is an owning ref, becomes a non-owning ref.
> 
> bpf_rbtree_first(&tree);
>   retval is a non-owning ref, since first() node is still in tree
> 
> bpf_rbtree_remove(&tree, &node);
>   'node' is a non-owning ref, retval is an owning ref

agree on the above definition.

> All of the above can only be called when rbtree's lock is held, so invalidation
> of all non-owning refs on spin_unlock is fine for rbtree_remove.
> 
> Nice property of paragraph marked with 1) above is the ability to use the
> type system to prevent rbtree_add of node that's already in rbtree and
> rbtree_remove of node that's not in one. So we can forego runtime
> checking of "already in tree", "already not in tree".

I think it's easier to add runtime check inside bpf_rbtree_remove()
since it already returns MAYBE_NULL. No 'conditional release' necessary.
And with that we don't need to worry about aliases.

> But, as you and Kumar talked about in the past and referenced in patch 1's
> thread, non-owning refs may alias each other, or an owning ref, and have no
> way of knowing whether this is the case. So if X and Y are two non-owning refs
> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
> already isn't in any tree (since prog has an owning ref to it). But verifier
> doesn't know X and Y alias each other. So previous paragraph's "forego
> runtime checks" statement can only hold if we invalidate all non-owning refs
> after 'destructive' rbtree_remove operation.

right. we either invalidate all non-owning after bpf_rbtree_remove
or do run-time check in bpf_rbtree_remove.
Consider the following:
bpf_spin_lock
n = bpf_rbtree_first(root);
m = bpf_rbtree_first(root);
x = bpf_rbtree_remove(root, n)
y = bpf_rbtree_remove(root, m)
bpf_spin_unlock
if (x)
   bpf_obj_drop(x)
if (y)
   bpf_obj_drop(y)

If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier.
If we do run-time check the above will be accepted and will work without crashing.

The problem with release_on_unlock is that it marks 'n' after 1st remove
as UNTRUSTED which means 'no write' and 'read via probe_read'.
That's not good imo.

> 
> It doesn't matter to me which combination of type flags, ref_obj_id, other
> reg state stuff, and special-casing is used to implement owning and non-owning
> refs. Specific ones chosen in this series for rbtree node:
> 
> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>             ref_obj_id > 0
> 
> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>                 PTR_UNTRUSTED
>                   - used for "can't pass ownership", not PROBE_MEM
>                   - this is why I mentioned "decomposing UNTRUSTED into more
>                     granular reg traits" in another thread

Now I undestand, but that was very hard to grasp.
UNTRUSTED means 'no write' and 'read via probe_read'.
ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly
pointing out below:
>                 ref_obj_id > 0
>                 release_on_unlock = true
>                   - used due to paragraphs starting with 2) above                

but the problem with ref_set_release_on_unlock() that it mixes real ref-d
pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0.
And the latter is a quite confusing combination in my mind,
since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS.

> Any other combination of type and reg state that gives me the semantics def'd
> above works4me.
> 
> 
> Based on this reply and others from today, I think you're saying that these
> concepts should be implemented using:
> 
> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>             PTR_TRUSTED
>             ref_obj_id > 0

Almost.
I propose:
PTR_TO_BTF_ID | MEM_ALLOC  && ref_obj_id > 0

See the definition of is_trusted_reg().
It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED)

I was saying 'trusted' because of is_trusted_reg() definition.
Sorry for confusion.

> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>                 PTR_TRUSTED
>                 ref_obj_id == 0
>                  - used for "can't pass ownership", since funcs that expect
>                    owning ref need ref_obj_id > 0

I propose:
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0

Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs.

And we will be able to pass 'non-owning' under spin_lock into other kfuncs
and owning outside of spin_lock into other kfuncs.
Which is a good thing.

> And you're also adding 'untrusted' here, mainly as a result of
> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> instead of becoming a non-owning ref. 'untrusted' would have state like:
> 
> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> PTR_UNTRUSTED
> ref_obj_id == 0?

I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
or doing 'non-owning' is enough.
If it's full untrusted it will be:
PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0

tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'.

> I think your "non-owning ref" definition also differs from mine, specifically
> yours doesn't seem to have "will not page fault". For this reason, you don't
> see the need for release_on_unlock logic, since that's used to prevent refs
> escaping critical section and potentially referring to free'd memory.

Not quite.
We should be able to read/write directly through
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock
the way release_reference() is doing.
I'm just not happy with using acquire_reference/release_reference() logic
(as release_on_unlock is doing) for cleaning after unlock.
Since we need to clean 'non-owning' ptrs in unlock it's confusing
to call the process 'release'.
I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED)
every reg where 
reg->id == cur_state->active_lock.id &&
flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0

By deleting relase_on_unlock I meant delete release_on_unlock flag
and remove ref_set_release_on_unlock.

> This is where I start to get confused. Some questions:
> 
>   * If we get rid of release_on_unlock, and with mass invalidation of
>     non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?

Since we'll be cleaning all
PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new.

The verifier already enforces that bpf_spin_unlock will be present
at the right place in bpf prog.
When the verifier sees it it will clean all non-owning refs with this spinlock 'id'.
So no concerns of leaking 'non-owning' outside.

While processing bpf_rbtree_first we need to:
regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
regs[BPF_REG_0].id = active_lock.id;
regs[BPF_REG_0].ref_obj_id = 0;

>   * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
>     in this scheme, since non-owning ref can escape spin_unlock b/c no mass
>     invalidation? PTR_UNTRUSTED isn't sufficient here

run-time check in bpf_rbtree_remove (and in the future bpf_list_remove)
should address it, no?

>   * If non-owning ref can live past spin_unlock, do we expect read from
>     such ref after _unlock to go through bpf_probe_read()? Otherwise direct
>     read might fault and silently write 0.

unlock has to clean them.

>   * For your 'untrusted', but not non-owning ref concept, I'm not sure
>     what this gives us that's better than just invalidating the ref which
>     gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)

Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one.
Untrusted will allow prog to do read only access (via probe_read) into the node
but might hide bugs.
The cleanup after bpf_spin_unlock of non-owning and clean up after
bpf_rbtree_add() does not have to be the same.
Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
and non-owning after bpf_rbtree_add.

Walking the example from previous email:

struct bpf_rbtree_iter it;
struct bpf_rb_node * node;
struct bpf_rb_node *n, *m;

bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
while ((node = bpf_rbtree_iter_next(&it)) {
  // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
  if (node && node->field == condition) {

    n = bpf_rbtree_remove(rb_root, node);
    if (!n) ...;
    // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
    m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
    if (!m) ...;
    // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y

    // node is still:
    // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id

    // assume we allow double locks one day
    bpf_spin_lock(another_rb_root);
    bpf_rbtree_add(another_rb_root, n);
    // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id
    bpf_spin_unlock(another_rb_root);
    // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
    break;
  }
}
// node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
bpf_rbtree_iter_destroy(&it); // does unlock
// node -> PTR_TO_BTF_ID | PTR_UNTRUSTED
// n -> PTR_TO_BTF_ID | PTR_UNTRUSTED
// m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
bpf_obj_drop(m);

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-08  3:51         ` Alexei Starovoitov
@ 2022-12-08  8:28           ` Dave Marchevsky
  2022-12-08 12:57             ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-08  8:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On 12/7/22 10:51 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote:
>>
>> Before replying to specific things in this email, I think it would be useful
>> to have a subthread clearing up definitions and semantics, as I think we're
>> talking past each other a bit.
> 
> Yeah. We were not on the same page.
> The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me.
> I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago
> and I thought we agreed that both are not necessary and with that
> I assumed that anything 'non-owning' as a concept is gone too.
> So the only thing left (in my mind) was the 'owning' concept.
> Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'.
> 

Whereas in my mind the release_on_unlock logic was specifically added to
implement the mass invalidation part of non-owning reference semantics, and it
being accepted implied that we weren't getting rid of the concept :).

> Please have this detailed explanation in the commit log next time to
> avoid this back and forth.
> Now to the fun part...
> 

I will add a documentation commit explaining 'owning' and 'non-owning' ref
as they pertain to these datastructures, after we agree about the semantics.

Speaking of which, although I have a few questions / clarifications, I think
we're more in agreement after your reply. After one more round of clarification
I will summarize conclusions to see if we agree on enough to move forward.

>>
>> On a conceptual level I've still been using "owning reference" and "non-owning
>> reference" to understand rbtree operations. I'll use those here and try to map
>> them to actual verifier concepts later.
>>
>> owning reference
>>
>>   * This reference controls the lifetime of the pointee
>>   * Ownership of pointee must be 'released' by passing it to some rbtree
>>     API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
>>     * If not released before program ends, verifier considers prog invalid
>>   * Access to the memory ref is pointing at will not page fault
> 
> agree.
> 
>> non-owning reference
>>
>>   * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
>>     to bpf_obj_drop
>>   * No control of lifetime, but can infer memory safety based on context
>>     (see explanation below)
>>   * Access to the memory ref is pointing at will not page fault
>>     (see explanation below)
> 
> agree with addition that both read and write should be allowed into this
> 'non-owning' ptr.
> Which breaks if you map this to something that ORs with PTR_UNTRUSTED.
> 

Agree re: read/write allowed. PTR_UNTRUSTED was an implementation detail.
Sounds like we agree on general purpose of owning, non-owning. Looks like
we're in agreement about above semantics.

>> 2) From verifier's perspective non-owning references can only exist
>> between spin_lock and spin_unlock. Why? After spin_unlock another program
>> can do arbitrary operations on the rbtree like removing and free-ing
>> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
>> free'd, and reused via bpf_obj_new would point to an entirely different thing.
>> Or the memory could go away.
> 
> agree that spin_unlock needs to clean up 'non-owning'.

Another point of agreement.

> 
>> To prevent this logic violation all non-owning references are invalidated by
>> verifier after critical section ends. This is necessary to ensure "will
>> not page fault" property of non-owning reference. So if verifier hasn't
>> invalidated a non-owning ref, accessing it will not page fault.
>>
>> Currently bpf_obj_drop is not allowed in the critical section, so similarly,
>> if there's a valid non-owning ref, we must be in critical section, and can
>> conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
>> and-reused.
> 
> I don't understand why is that a problem.
> 
>> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since
>> the tree has control of pointee lifetime. Similarly, any ref to a node
>> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
>> node in map_val for now)
> 
> Also not clear why such restriction is necessary.
> 

If we have this restriction and bpf_rbtree_release also mass invalidates
non-owning refs, the type system will ensure that only nodes that are in a tree
will be passed to bpf_rbtree_release, and we can avoid the runtime check.

But below you mention preferring the runtime check, mostly noting here to
refer back when continuing reply below.

>> Moving on to rbtree API:
>>
>> bpf_rbtree_add(&tree, &node);
>>   'node' is an owning ref, becomes a non-owning ref.
>>
>> bpf_rbtree_first(&tree);
>>   retval is a non-owning ref, since first() node is still in tree
>>
>> bpf_rbtree_remove(&tree, &node);
>>   'node' is a non-owning ref, retval is an owning ref
> 
> agree on the above definition.
> >> All of the above can only be called when rbtree's lock is held, so invalidation
>> of all non-owning refs on spin_unlock is fine for rbtree_remove.
>>
>> Nice property of paragraph marked with 1) above is the ability to use the
>> type system to prevent rbtree_add of node that's already in rbtree and
>> rbtree_remove of node that's not in one. So we can forego runtime
>> checking of "already in tree", "already not in tree".
> 
> I think it's easier to add runtime check inside bpf_rbtree_remove()
> since it already returns MAYBE_NULL. No 'conditional release' necessary.
> And with that we don't need to worry about aliases.
> 

To clarify: You're proposing that we don't worry about solving the aliasing
problem at verification time. Instead rbtree_{add,remove} will deal with it
at runtime. Corollary of this is that my restriction tagged 1) above ("ref
to node in tree _must_ be non-owning, to node not in tree must be owning")
isn't something we're guaranteeing, due to possibility of aliasing.

So bpf_rbtree_remove might get a node that's not in tree, and
bpf_rbtree_add might get a node that's already in tree. Runtime behavior
of both should be 'nop'.


If that is an accurate restatement of your proposal, the verifier
logic will need to be changed:

For bpf_rbtree_remove(&tree, &node), if node is already not in a tree,
retval will be NULL, effectively not acquiring an owning ref due to
mark_ptr_or_null_reg's logic.

In this case, do we want to invalidate
arg 'node' as well? Or just leave it as a non-owning ref that points
to node not in tree? I think the latter requires fewer verifier changes,
but can see the argument for the former if we want restriction 1) to
mostly be true, unless aliasing.

The above scenario is the only case where bpf_rbtree_remove fails and
returns NULL.

(In this series it can fail and RET_NULL for this reason, but my earlier comment
about type system + invalidate all-non owning after remove as discussed below
was my original intent. So I shouldn't have been allowing RET_NULL for my
version of these semantics.)


For bpf_rbtree_add(&tree, &node, less), if arg is already in tree, then
'node' isn't really an owning ref, and we need to tag it as non-owning,
and program then won't need to bpf_obj_drop it before exiting. If node
wasn't already in tree and rbtree_add actually added it, 'node' would
also be tagged as non-owning, since tree now owns it. 

Do we need some way to indicate whether 'already in tree' case happened?
If so, would need to change retval from void to bool or struct bpf_rb_node *.

Above scenario is only case where bpf_rbtree_add fails and returns
NULL / false. 

>> But, as you and Kumar talked about in the past and referenced in patch 1's
>> thread, non-owning refs may alias each other, or an owning ref, and have no
>> way of knowing whether this is the case. So if X and Y are two non-owning refs
>> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
>> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
>> already isn't in any tree (since prog has an owning ref to it). But verifier
>> doesn't know X and Y alias each other. So previous paragraph's "forego
>> runtime checks" statement can only hold if we invalidate all non-owning refs
>> after 'destructive' rbtree_remove operation.
> 
> right. we either invalidate all non-owning after bpf_rbtree_remove
> or do run-time check in bpf_rbtree_remove.
> Consider the following:
> bpf_spin_lock
> n = bpf_rbtree_first(root);
> m = bpf_rbtree_first(root);
> x = bpf_rbtree_remove(root, n)
> y = bpf_rbtree_remove(root, m)
> bpf_spin_unlock
> if (x)
>    bpf_obj_drop(x)
> if (y)
>    bpf_obj_drop(y)
> 
> If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier.
> If we do run-time check the above will be accepted and will work without crashing.
> 

Agreed, although the above example's invalid double-remove of same node is
the kind of thing I'd like to be prevented at verification time instead of
runtime. Regardless, continuing with your runtime check idea.

> The problem with release_on_unlock is that it marks 'n' after 1st remove
> as UNTRUSTED which means 'no write' and 'read via probe_read'.
> That's not good imo.
>

Based on your response to paragraph below this one, I think we're in agreement
that using PTR_UNTRUSTED for non-owning ref gives non-owning ref bunch of traits
it doesn't need, when I just wanted "can't pass ownership". So agreed that
PTR_UNTRUSTED is too blunt an instrument here.

Regarding "marks 'n' after 1st remove", the series isn't currently doing this,
I proposed it as a way to prevent aliasing problem, but I think your proposal
is explicitly not trying to prevent aliasing problem at verification time. So
for your semantics we would only have non-owning cleanup after spin_unlock.
And such cleanup might just mark refs PTR_UNTRUSTED instead of invalidating
entirely.

>>
>> It doesn't matter to me which combination of type flags, ref_obj_id, other
>> reg state stuff, and special-casing is used to implement owning and non-owning
>> refs. Specific ones chosen in this series for rbtree node:
>>
>> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>>             ref_obj_id > 0
>>
>> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
>>                 PTR_UNTRUSTED
>>                   - used for "can't pass ownership", not PROBE_MEM
>>                   - this is why I mentioned "decomposing UNTRUSTED into more
>>                     granular reg traits" in another thread
> 
> Now I undestand, but that was very hard to grasp.
> UNTRUSTED means 'no write' and 'read via probe_read'.
> ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly
> pointing out below:
>>                 ref_obj_id > 0
>>                 release_on_unlock = true
>>                   - used due to paragraphs starting with 2) above                
> 
> but the problem with ref_set_release_on_unlock() that it mixes real ref-d
> pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0.
> And the latter is a quite confusing combination in my mind,
> since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS.
> 

I think I understand your desire to get rid of release_on_unlock now. It's not
due to disliking the concept of "clean up non-owning refs after spin_unlock",
which you earlier agreed was necessary, but rather the specifics of
release_on_unlock mechanism used to achieve this. 

If so, I think I agree with your reasoning for why the mechanism is bad in
light of how you want owning/non-owning implemented. To summarize your
statements about release_on_unlock mechanism from the rest of your reply:

  * 'ref_obj_id > 0' already has a specific meaning wrt. is_trusted_reg,
    and we may want to support both TRUSTED and UNTRUSTED non-owning refs

    * My comment: Currently is_trusted_reg is only used for
      KF_ARG_PTR_TO_BTF_ID, while rbtree and list types are assigned special
      KF_ARGs. So hypothetically could have different 'is_trusted_reg' logic.
      I don't actually think that's a good idea, though, especially since
      rbtree / list types are really specializations of PTR_TO_BTF_ID anyways.
      So agreed.

  * Instead of using 'acquire' and (modified) 'release', we can achieve
    "clean-up non-owning after spin_unlock" by associating non-owning
    refs with active_lock.id when they're created. We can store this in
    reg.id, which is currently unused for PTR_TO_BTF_ID (afaict).

    * This will solve issue raised by previous point, allowing us to have
      non-owning refs which are truly 'untrusted' according to is_trusted_reg.

    * My comment: This all sounds reasonable. On spin_unlock we have
      active_lock.id, so can do bpf_for_each_reg_in_vstate to look for
      PTR_TO_BTF_IDs matching the id and do 'cleanup' for them.

>> Any other combination of type and reg state that gives me the semantics def'd
>> above works4me.
>>
>>
>> Based on this reply and others from today, I think you're saying that these
>> concepts should be implemented using:
>>
>> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>>             PTR_TRUSTED
>>             ref_obj_id > 0
> 
> Almost.
> I propose:
> PTR_TO_BTF_ID | MEM_ALLOC  && ref_obj_id > 0
> 
> See the definition of is_trusted_reg().
> It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED)
> 
> I was saying 'trusted' because of is_trusted_reg() definition.
> Sorry for confusion.
>

I see. Sounds reasonable.

>> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>>                 PTR_TRUSTED
>>                 ref_obj_id == 0
>>                  - used for "can't pass ownership", since funcs that expect
>>                    owning ref need ref_obj_id > 0
> 
> I propose:
> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> 

Also sounds reasonable, perhaps with the addition of id > 0 to account for
your desired changes to release_on_unlock mechanism?

> Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs.
> 
> And we will be able to pass 'non-owning' under spin_lock into other kfuncs
> and owning outside of spin_lock into other kfuncs.
> Which is a good thing.
> 

Allowing passing of owning ref outside of spin_lock sounds reasonable to me.
'non-owning' under spinlock will have the same "what if this touches __key"
issue I brought up in another thread. But you mentioned not preventing that
and I don't necessarily disagree, so just noting here.

>> And you're also adding 'untrusted' here, mainly as a result of
>> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
>> instead of becoming a non-owning ref. 'untrusted' would have state like:
>>
>> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>> PTR_UNTRUSTED
>> ref_obj_id == 0?
> 
> I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
> or doing 'non-owning' is enough.
> If it's full untrusted it will be:
> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> 

Yeah, I don't see what this "full untrusted" is giving us either. Let's have
"cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
instead of converting to "full untrusted"?

Adding "full untrusted" later won't make any valid programs written with
"just invalidate the regs" in mind fail the verifier. So painless to add later.

> tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'.
> 

I think such type combo was only added to implement non-owning refs. If it's
rewritten to use your type combos I don't think there'll be any uses of
MEM_ALLOC | PTR_UNTRUSTED remaining.

>> I think your "non-owning ref" definition also differs from mine, specifically
>> yours doesn't seem to have "will not page fault". For this reason, you don't
>> see the need for release_on_unlock logic, since that's used to prevent refs
>> escaping critical section and potentially referring to free'd memory.
> 
> Not quite.
> We should be able to read/write directly through
> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock
> the way release_reference() is doing.
> I'm just not happy with using acquire_reference/release_reference() logic
> (as release_on_unlock is doing) for cleaning after unlock.
> Since we need to clean 'non-owning' ptrs in unlock it's confusing
> to call the process 'release'.
> I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED)
> every reg where 
> reg->id == cur_state->active_lock.id &&
> flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> 
> By deleting relase_on_unlock I meant delete release_on_unlock flag
> and remove ref_set_release_on_unlock.
> 

Summarized above, but: agreed, and thanks for clarifying what you meant by 
"delete release_on_unlock".

>> This is where I start to get confused. Some questions:
>>
>>   * If we get rid of release_on_unlock, and with mass invalidation of
>>     non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?
> 
> Since we'll be cleaning all
> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new.
> 
> The verifier already enforces that bpf_spin_unlock will be present
> at the right place in bpf prog.
> When the verifier sees it it will clean all non-owning refs with this spinlock 'id'.
> So no concerns of leaking 'non-owning' outside.
> 

Sounds like we don't want "full untrusted" or any PTR_UNTRUSTED non-owning ref.

> While processing bpf_rbtree_first we need to:
> regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
> regs[BPF_REG_0].id = active_lock.id;
> regs[BPF_REG_0].ref_obj_id = 0;
> 

Agreed.

>>   * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
>>     in this scheme, since non-owning ref can escape spin_unlock b/c no mass
>>     invalidation? PTR_UNTRUSTED isn't sufficient here
> 
> run-time check in bpf_rbtree_remove (and in the future bpf_list_remove)
> should address it, no?
> 

If we don't do "full untrusted" and cleanup non-owning refs by invalidating,
_and_ don't allow bpf_obj_{new,drop} in critical section, then I don't think
this is an issue.

But to elaborate on the issue, if we instead cleaned up non-owning by marking 
untrusted:

struct node_data *n = bpf_obj_new(typeof(*n));
struct node_data *m, *o;
struct some_other_type *t;

bpf_spin_lock(&lock);

bpf_rbtree_add(&tree, n);
m = bpf_rbtree_first();
o = bpf_rbtree_first(); // m and o are non-owning, point to same node

m = bpf_rbtree_remove(&tree, m); // m is owning

bpf_spin_unlock(&lock); // o is "full untrusted", marked PTR_UNTRUSTED

bpf_obj_drop(m);
t = bpf_obj_new(typeof(*t)); // pretend that exact chunk of memory that was
                             // dropped in previous statement is returned here

data = o->some_data_field;   // PROBE_MEM, but no page fault, so load will
                             // succeed, but will read garbage from another type
                             // while verifier thinks it's reading from node_data


If we clean up by invalidating, but eventually enable bpf_obj_{new,drop} inside
critical section, we'll have similar issue.

It's not necessarily "crash the kernel" dangerous, but it may anger program
writers since they can't be sure they're not reading garbage in this scenario.

>>   * If non-owning ref can live past spin_unlock, do we expect read from
>>     such ref after _unlock to go through bpf_probe_read()? Otherwise direct
>>     read might fault and silently write 0.
> 
> unlock has to clean them.
> 

Ack.

>>   * For your 'untrusted', but not non-owning ref concept, I'm not sure
>>     what this gives us that's better than just invalidating the ref which
>>     gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)
> 
> Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one.
> Untrusted will allow prog to do read only access (via probe_read) into the node
> but might hide bugs.
> The cleanup after bpf_spin_unlock of non-owning and clean up after
> bpf_rbtree_add() does not have to be the same.

This is a good point.

> Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
> and non-owning after bpf_rbtree_add.
> 
> Walking the example from previous email:
> 
> struct bpf_rbtree_iter it;
> struct bpf_rb_node * node;
> struct bpf_rb_node *n, *m;
> 
> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
> while ((node = bpf_rbtree_iter_next(&it)) {
>   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
>   if (node && node->field == condition) {
> 
>     n = bpf_rbtree_remove(rb_root, node);
>     if (!n) ...;
>     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
>     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
>     if (!m) ...;
>     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> 
>     // node is still:
>     // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> 
>     // assume we allow double locks one day
>     bpf_spin_lock(another_rb_root);
>     bpf_rbtree_add(another_rb_root, n);
>     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id
>     bpf_spin_unlock(another_rb_root);
>     // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
>     break;
>   }
> }
> // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> bpf_rbtree_iter_destroy(&it); // does unlock
> // node -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> bpf_obj_drop(m);

This seems like a departure from other statements in your reply, where you're
leaning towards "non-owning and trusted" -> "full untrusted" after unlock
being unnecessary. I think the combo of reference aliases + bpf_obj_drop-and-
reuse make everything hard to reason about.

Regardless, your comments annotating reg state look correct to me.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0
  2022-12-08  0:47           ` Alexei Starovoitov
@ 2022-12-08  8:50             ` Dave Marchevsky
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-08  8:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On 12/7/22 7:47 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 06:39:38PM -0500, Dave Marchevsky wrote:
>>>>
>>>> 0000000000000000 <less>:
>>>> ;       return node_a->key < node_b->key;
>>>>        0:       79 22 f0 ff 00 00 00 00 r2 = *(u64 *)(r2 - 0x10)
>>>>        1:       79 11 f0 ff 00 00 00 00 r1 = *(u64 *)(r1 - 0x10)
>>>>        2:       b4 00 00 00 01 00 00 00 w0 = 0x1
>>>> ;       return node_a->key < node_b->key;
>>>
>>> I see. That's the same bug.
>>> The args to callback should have been PTR_TO_BTF_ID | PTR_TRUSTED with 
>>> correct positive offset.
>>> Then node_a = container_of(a, struct node_data, node);
>>> would have produced correct offset into proper btf_id.
>>>
>>> The verifier should be passing into less() the btf_id
>>> of struct node_data instead of btf_id of struct bpf_rb_node.
>>>
>>
>> The verifier is already passing the struct node_data type, not bpf_rb_node.
>> For less() args, and rbtree_{first,remove} retval, mark_reg_datastructure_node
>> - added in patch 8 - is doing as you describe.
>>
>> Verifier sees less' arg regs as R=ptr_to_node_data(off=16). If it was
>> instead passing R=ptr_to_bpf_rb_node(off=0), attempting to access *(reg - 0x10)
>> would cause verifier err.
> 
> Ahh. I finally got it :)
> Please put these details in the commit log when you respin.
> 

Glad it finally started making sense.
Will do big improvement of patch summary after addressing other
feedback from this series.

>>>>        3:       cd 21 01 00 00 00 00 00 if r1 s< r2 goto +0x1 <LBB2_2>
>>>>        4:       b4 00 00 00 00 00 00 00 w0 = 0x0
>>>>
>>>> 0000000000000028 <LBB2_2>:
>>>> ;       return node_a->key < node_b->key;
>>>>        5:       95 00 00 00 00 00 00 00 exit
>>>>
>>>> Insns 0 and 1 are loading node_b->key and node_a->key, respectively, using
>>>> negative insn->off. Verifier's view or R1 and R2 before insn 0 is
>>>> untrusted_ptr_node_data(off=16). If there were some intermediate insns
>>>> storing result of container_of() before dereferencing:
>>>>
>>>>   r3 = (r2 - 0x10)
>>>>   r2 = *(u64 *)(r3)
>>>>
>>>> Verifier would see R3 as untrusted_ptr_node_data(off=0), and load for
>>>> r2 would have insn->off = 0. But LLVM decides to just do a load-with-offset
>>>> using original arg ptrs to less() instead of storing container_of() ptr
>>>> adjustments.
>>>>
>>>> Since the container_of usage and code pattern in above example's less()
>>>> isn't particularly specific to this series, I think there are other scenarios
>>>> where such code would be generated and considered this a general bugfix in
>>>> cover letter.
>>>
>>> imo the negative offset looks specific to two misuses of PTR_UNTRUSTED in this set.
>>>
>>
>> If I used PTR_TRUSTED here, the JITted instructions would still do a load like
>> r2 = *(u64 *)(r2 - 0x10). There would just be no BPF_PROBE_MEM runtime checking
>> insns generated, avoiding negative insn issue there. But the negative insn->off
>> load being generated is not specific to PTR_UNTRUSTED.
> 
> yep.
> 
>>>
>>> Exactly. More flags will only increase the confusion.
>>> Please try to make callback args as proper PTR_TRUSTED and disallow calling specific
>>> rbtree kfuncs while inside this particular callback to prevent recursion.
>>> That would solve all these issues, no?
>>> Writing into such PTR_TRUSTED should be still allowed inside cb though it's bogus.
>>>
>>> Consider less() receiving btf_id ptr_trusted of struct node_data and it contains
>>> both link list and rbtree.
>>> It should still be safe to operate on link list part of that node from less()
>>> though it's not something we would ever recommend.
>>
>> I definitely want to allow writes on non-owning references. In order to properly
>> support this, there needs to be a way to designate a field as a "key":
>>
>> struct node_data {
>>   long key __key;
>>   long data;
>>   struct bpf_rb_node node;
>> };
>>
>> or perhaps on the rb_root via __contains or separate tag:
>>
>> struct bpf_rb_root groot __contains(struct node_data, node, key);
>>
>> This is necessary because rbtree's less() uses key field to determine order, so
>> we don't want to allow write to the key field when the node is in a rbtree. If
>> such a write were possible the rbtree could easily be placed in an invalid state
>> since the new key may mean that the rbtree is no longer sorted. Subsequent add()
>> operations would compare less() using the new key, so other nodes will be placed
>> in wrong spot as well.
>>
>> Since PTR_UNTRUSTED currently allows read but not write, and prevents use of
>> non-owning ref as kfunc arg, it seemed to be reasonable tag for less() args.
>>
>> I was planning on adding __key / non-owning-ref write support as a followup, but
>> adding it as part of this series will probably save a lot of back-and-forth.
>> Will try to add it.
> 
> Just key mark might not be enough. less() could be doing all sort of complex
> logic on more than one field and even global fields.
> But what is the concern with writing into 'key' ?
> The rbtree will not be sorted. find/add operation will not be correct,
> but nothing will crash. At the end bpf_rb_root_free() will walk all
> unsorted nodes anyway and free them all.
> Even if we pass PTR_TRUSTED | MEM_RDONLY pointers into less() the less()
> can still do nonsensical things like returning random true/false.
> Doesn't look like an issue to me.

Agreed re: complex logic + global fields, less() being able to do nonsensical
things, and writing to key not crashing anything even if it breaks the tree.

OK, let's forget about __key. In next version of the series non-owning refs
will be write-able. Can add more protection in the future if it's deemed
necessary. Since this means non-owning refs won't be PTR_UNTRUSTED anymore,
I can split this patch out from the rest of the series after confirming that
it isn't necessary to ship rbtree.

Still want to convince you that the skipping of a check is correct before
I page out the details, but less urgent now. IIUC although the cause of the
issue is clear now, you'd still like me to clarify the details of solution.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-08  8:28           ` Dave Marchevsky
@ 2022-12-08 12:57             ` Kumar Kartikeya Dwivedi
  2022-12-08 20:36               ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-12-08 12:57 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Alexei Starovoitov, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Thu, Dec 08, 2022 at 01:58:44PM IST, Dave Marchevsky wrote:
> On 12/7/22 10:51 PM, Alexei Starovoitov wrote:
> > On Wed, Dec 07, 2022 at 08:18:25PM -0500, Dave Marchevsky wrote:
> >>
> >> Before replying to specific things in this email, I think it would be useful
> >> to have a subthread clearing up definitions and semantics, as I think we're
> >> talking past each other a bit.
> >
> > Yeah. We were not on the same page.
> > The concepts of 'owning ref' and 'non-owning ref' appeared 'new' to me.
> > I remember discussing 'conditional release' and OBJ_NON_OWNING_REF long ago
> > and I thought we agreed that both are not necessary and with that
> > I assumed that anything 'non-owning' as a concept is gone too.
> > So the only thing left (in my mind) was the 'owning' concept.
> > Which I mapped as ref_obj_id > 0. In other words 'owning' meant 'acquired'.
> >
>
> Whereas in my mind the release_on_unlock logic was specifically added to
> implement the mass invalidation part of non-owning reference semantics, and it
> being accepted implied that we weren't getting rid of the concept :).
>
> > Please have this detailed explanation in the commit log next time to
> > avoid this back and forth.
> > Now to the fun part...
> >
>
> I will add a documentation commit explaining 'owning' and 'non-owning' ref
> as they pertain to these datastructures, after we agree about the semantics.
>
> Speaking of which, although I have a few questions / clarifications, I think
> we're more in agreement after your reply. After one more round of clarification
> I will summarize conclusions to see if we agree on enough to move forward.
>
> >>
> >> On a conceptual level I've still been using "owning reference" and "non-owning
> >> reference" to understand rbtree operations. I'll use those here and try to map
> >> them to actual verifier concepts later.
> >>
> >> owning reference
> >>
> >>   * This reference controls the lifetime of the pointee
> >>   * Ownership of pointee must be 'released' by passing it to some rbtree
> >>     API kfunc - rbtree_add in our case -  or via bpf_obj_drop, which free's
> >>     * If not released before program ends, verifier considers prog invalid
> >>   * Access to the memory ref is pointing at will not page fault
> >
> > agree.
> >
> >> non-owning reference
> >>
> >>   * No ownership of pointee so can't pass ownership via rbtree_add, not allowed
> >>     to bpf_obj_drop
> >>   * No control of lifetime, but can infer memory safety based on context
> >>     (see explanation below)
> >>   * Access to the memory ref is pointing at will not page fault
> >>     (see explanation below)
> >
> > agree with addition that both read and write should be allowed into this
> > 'non-owning' ptr.
> > Which breaks if you map this to something that ORs with PTR_UNTRUSTED.
> >
>
> Agree re: read/write allowed. PTR_UNTRUSTED was an implementation detail.
> Sounds like we agree on general purpose of owning, non-owning. Looks like
> we're in agreement about above semantics.
>

Yes, PTR_UNTRUSTED is not appropriate for this. My opposition was also more to
the idea of mapping PTR_UNTRUSTED to non-owning references.
If we do PTR_TO_BTF_ID | MEM_ALLOC for them with ref_obj_id == 0, it SGTM.

> >> 2) From verifier's perspective non-owning references can only exist
> >> between spin_lock and spin_unlock. Why? After spin_unlock another program
> >> can do arbitrary operations on the rbtree like removing and free-ing
> >> via bpf_obj_drop. A non-owning ref to some chunk of memory that was remove'd,
> >> free'd, and reused via bpf_obj_new would point to an entirely different thing.
> >> Or the memory could go away.
> >
> > agree that spin_unlock needs to clean up 'non-owning'.
>
> Another point of agreement.
>

+1

> >
> >> To prevent this logic violation all non-owning references are invalidated by
> >> verifier after critical section ends. This is necessary to ensure "will
> >> not page fault" property of non-owning reference. So if verifier hasn't
> >> invalidated a non-owning ref, accessing it will not page fault.
> >>
> >> Currently bpf_obj_drop is not allowed in the critical section, so similarly,
> >> if there's a valid non-owning ref, we must be in critical section, and can
> >> conclude that the ref's memory hasn't been dropped-and-free'd or dropped-
> >> and-reused.
> >
> > I don't understand why is that a problem.
> >
> >> 1) Any reference to a node that is in a rbtree _must_ be non-owning, since
> >> the tree has control of pointee lifetime. Similarly, any ref to a node
> >> that isn't in rbtree _must_ be owning. (let's ignore raw read from kptr_xchg'd
> >> node in map_val for now)

The last case is going to be marked PTR_UNTRUSTED.

> >
> > Also not clear why such restriction is necessary.
> >
>
> If we have this restriction and bpf_rbtree_release also mass invalidates
> non-owning refs, the type system will ensure that only nodes that are in a tree
> will be passed to bpf_rbtree_release, and we can avoid the runtime check.
>

I like this property. This was also how I proposed implementing it for lists.
e.g. Any bpf_list_del would invalidate the result of prior bpf_list_first_entry
and bpf_list_last_entry to ensure safety.

It's a bit similar to aliasing XOR mutability guarantees that Rust has. We're
trying to implement a simple borrow checking mechanism.

Once the collection is mutated, any prior non-owning references become
invalidated. It can be further refined (e.g. bpf_rbtree_add won't do
invalidation on mutation) based on the properties of the data structure.

> But below you mention preferring the runtime check, mostly noting here to
> refer back when continuing reply below.
>
> >> Moving on to rbtree API:
> >>
> >> bpf_rbtree_add(&tree, &node);
> >>   'node' is an owning ref, becomes a non-owning ref.
> >>
> >> bpf_rbtree_first(&tree);
> >>   retval is a non-owning ref, since first() node is still in tree
> >>
> >> bpf_rbtree_remove(&tree, &node);
> >>   'node' is a non-owning ref, retval is an owning ref
> >
> > agree on the above definition.
> > >> All of the above can only be called when rbtree's lock is held, so invalidation
> >> of all non-owning refs on spin_unlock is fine for rbtree_remove.
> >>
> >> Nice property of paragraph marked with 1) above is the ability to use the
> >> type system to prevent rbtree_add of node that's already in rbtree and
> >> rbtree_remove of node that's not in one. So we can forego runtime
> >> checking of "already in tree", "already not in tree".
> >
> > I think it's easier to add runtime check inside bpf_rbtree_remove()
> > since it already returns MAYBE_NULL. No 'conditional release' necessary.
> > And with that we don't need to worry about aliases.
> >
>
> To clarify: You're proposing that we don't worry about solving the aliasing
> problem at verification time. Instead rbtree_{add,remove} will deal with it
> at runtime. Corollary of this is that my restriction tagged 1) above ("ref
> to node in tree _must_ be non-owning, to node not in tree must be owning")
> isn't something we're guaranteeing, due to possibility of aliasing.
>
> So bpf_rbtree_remove might get a node that's not in tree, and
> bpf_rbtree_add might get a node that's already in tree. Runtime behavior
> of both should be 'nop'.
>
>
> If that is an accurate restatement of your proposal, the verifier
> logic will need to be changed:
>
> For bpf_rbtree_remove(&tree, &node), if node is already not in a tree,
> retval will be NULL, effectively not acquiring an owning ref due to
> mark_ptr_or_null_reg's logic.
>
> In this case, do we want to invalidate
> arg 'node' as well? Or just leave it as a non-owning ref that points
> to node not in tree? I think the latter requires fewer verifier changes,
> but can see the argument for the former if we want restriction 1) to
> mostly be true, unless aliasing.
>
> The above scenario is the only case where bpf_rbtree_remove fails and
> returns NULL.
>
> (In this series it can fail and RET_NULL for this reason, but my earlier comment
> about type system + invalidate all-non owning after remove as discussed below
> was my original intent. So I shouldn't have been allowing RET_NULL for my
> version of these semantics.)
>

I agree with Dave to rely on the invariant that non-owning refs to nodes are
part of the collection. Then bpf_rbtree_remove is simply KF_ACQUIRE.

>
> For bpf_rbtree_add(&tree, &node, less), if arg is already in tree, then
> 'node' isn't really an owning ref, and we need to tag it as non-owning,
> and program then won't need to bpf_obj_drop it before exiting. If node
> wasn't already in tree and rbtree_add actually added it, 'node' would
> also be tagged as non-owning, since tree now owns it.
>
> Do we need some way to indicate whether 'already in tree' case happened?
> If so, would need to change retval from void to bool or struct bpf_rb_node *.
>
> Above scenario is only case where bpf_rbtree_add fails and returns
> NULL / false.
>

Why should we allow node that is not acquired to be passed to bpf_rbtree_add?

> >> But, as you and Kumar talked about in the past and referenced in patch 1's
> >> thread, non-owning refs may alias each other, or an owning ref, and have no
> >> way of knowing whether this is the case. So if X and Y are two non-owning refs
> >> that alias each other, and bpf_rbtree_remove(tree, X) is called, a subsequent
> >> call to bpf_rbtree_remove(tree, Y) would be removing node from tree which
> >> already isn't in any tree (since prog has an owning ref to it). But verifier
> >> doesn't know X and Y alias each other. So previous paragraph's "forego
> >> runtime checks" statement can only hold if we invalidate all non-owning refs
> >> after 'destructive' rbtree_remove operation.
> >
> > right. we either invalidate all non-owning after bpf_rbtree_remove
> > or do run-time check in bpf_rbtree_remove.
> > Consider the following:
> > bpf_spin_lock
> > n = bpf_rbtree_first(root);
> > m = bpf_rbtree_first(root);
> > x = bpf_rbtree_remove(root, n)
> > y = bpf_rbtree_remove(root, m)
> > bpf_spin_unlock
> > if (x)
> >    bpf_obj_drop(x)
> > if (y)
> >    bpf_obj_drop(y)
> >
> > If we invalidate after bpf_rbtree_remove() the above will be rejected by the verifier.
> > If we do run-time check the above will be accepted and will work without crashing.
> >
>
> Agreed, although the above example's invalid double-remove of same node is
> the kind of thing I'd like to be prevented at verification time instead of
> runtime. Regardless, continuing with your runtime check idea.
>

I agree with Dave, it seems better to invalidate non-owning refs after first
remove rather than allowing this to work.

> > The problem with release_on_unlock is that it marks 'n' after 1st remove
> > as UNTRUSTED which means 'no write' and 'read via probe_read'.
> > That's not good imo.
> >
>
> Based on your response to paragraph below this one, I think we're in agreement
> that using PTR_UNTRUSTED for non-owning ref gives non-owning ref bunch of traits
> it doesn't need, when I just wanted "can't pass ownership". So agreed that
> PTR_UNTRUSTED is too blunt an instrument here.
>

I think this is the part of the confusion which has left me wondering so far.
The discussion in this thread is making things more clear.

PTR_UNTRUSTED was never meant to be the kind of non-owning reference you want to
be returned from bpf_rbtree_first. PTR_TO_BTF_ID | MEM_ALLOC with ref_obj_id == 0
is the right choice.

> Regarding "marks 'n' after 1st remove", the series isn't currently doing this,
> I proposed it as a way to prevent aliasing problem, but I think your proposal
> is explicitly not trying to prevent aliasing problem at verification time. So
> for your semantics we would only have non-owning cleanup after spin_unlock.
> And such cleanup might just mark refs PTR_UNTRUSTED instead of invalidating
> entirely.
>

I would prefer proper invalidation using mark_reg_unknown.

> >>
> >> It doesn't matter to me which combination of type flags, ref_obj_id, other
> >> reg state stuff, and special-casing is used to implement owning and non-owning
> >> refs. Specific ones chosen in this series for rbtree node:
> >>
> >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
> >>             ref_obj_id > 0
> >>
> >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ type that contains bpf_rb_node)
> >>                 PTR_UNTRUSTED
> >>                   - used for "can't pass ownership", not PROBE_MEM
> >>                   - this is why I mentioned "decomposing UNTRUSTED into more
> >>                     granular reg traits" in another thread
> >
> > Now I undestand, but that was very hard to grasp.
> > UNTRUSTED means 'no write' and 'read via probe_read'.
> > ref_set_release_on_unlock() also keeps ref_obj_id > 0 as you're correctly
> > pointing out below:
> >>                 ref_obj_id > 0
> >>                 release_on_unlock = true
> >>                   - used due to paragraphs starting with 2) above
> >
> > but the problem with ref_set_release_on_unlock() that it mixes real ref-d
> > pointers with ref_obj_id > 0 with UNTRUSTED && ref_obj_id > 0.
> > And the latter is a quite confusing combination in my mind,
> > since we consider everything with ref_obj_id > 0 as good for KF_TRUSTED_ARGS.
> >
>
> I think I understand your desire to get rid of release_on_unlock now. It's not
> due to disliking the concept of "clean up non-owning refs after spin_unlock",
> which you earlier agreed was necessary, but rather the specifics of
> release_on_unlock mechanism used to achieve this.
>
> If so, I think I agree with your reasoning for why the mechanism is bad in
> light of how you want owning/non-owning implemented. To summarize your
> statements about release_on_unlock mechanism from the rest of your reply:
>
>   * 'ref_obj_id > 0' already has a specific meaning wrt. is_trusted_reg,
>     and we may want to support both TRUSTED and UNTRUSTED non-owning refs
>
>     * My comment: Currently is_trusted_reg is only used for
>       KF_ARG_PTR_TO_BTF_ID, while rbtree and list types are assigned special
>       KF_ARGs. So hypothetically could have different 'is_trusted_reg' logic.
>       I don't actually think that's a good idea, though, especially since
>       rbtree / list types are really specializations of PTR_TO_BTF_ID anyways.
>       So agreed.
>
>   * Instead of using 'acquire' and (modified) 'release', we can achieve
>     "clean-up non-owning after spin_unlock" by associating non-owning
>     refs with active_lock.id when they're created. We can store this in
>     reg.id, which is currently unused for PTR_TO_BTF_ID (afaict).
>

I don't mind using active_lock.id for invalidation, but using reg->id to
associate it with reg is a bad idea IMO, it's already preserved and set when the
object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
with that non-owing ref if it has a spin lock, essentially unlocking different
spin lock if the reg->btf of already locked spin lock reg is same due to same
active_lock.id.

Even if you prevent it somehow it's more confusing to overload reg->id again for
this purpose.

It makes more sense to introduce a new nonref_obj_id instead dedicated for this
purpose, to associate it back to the reg->id of the collection it is coming from.

Also, there are two cases of invalidation, one is on remove from rbtree, which
should only invalidate non-owning references into the rbtree, and one is on
unlock, which should invalidate all non-owning references.

bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
lock, but unlocking should do it for both rbtree and list non-owning refs it is
protecting.

So it seems you will have to maintain two IDs for non-owning referneces, one for
the collection it comes from, and one for the lock region it is obtained in.

>     * This will solve issue raised by previous point, allowing us to have
>       non-owning refs which are truly 'untrusted' according to is_trusted_reg.
>
>     * My comment: This all sounds reasonable. On spin_unlock we have
>       active_lock.id, so can do bpf_for_each_reg_in_vstate to look for
>       PTR_TO_BTF_IDs matching the id and do 'cleanup' for them.
>
> >> Any other combination of type and reg state that gives me the semantics def'd
> >> above works4me.
> >>
> >>
> >> Based on this reply and others from today, I think you're saying that these
> >> concepts should be implemented using:
> >>
> >> owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> >>             PTR_TRUSTED
> >>             ref_obj_id > 0
> >
> > Almost.
> > I propose:
> > PTR_TO_BTF_ID | MEM_ALLOC  && ref_obj_id > 0
> >
> > See the definition of is_trusted_reg().
> > It's ref_obj_id > 0 || flag == (MEM_ALLOC | PTR_TRUSTED)
> >
> > I was saying 'trusted' because of is_trusted_reg() definition.
> > Sorry for confusion.
> >
>
> I see. Sounds reasonable.
>
> >> non-owning ref: PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> >>                 PTR_TRUSTED
> >>                 ref_obj_id == 0
> >>                  - used for "can't pass ownership", since funcs that expect
> >>                    owning ref need ref_obj_id > 0
> >
> > I propose:
> > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> >
>
> Also sounds reasonable, perhaps with the addition of id > 0 to account for
> your desired changes to release_on_unlock mechanism?
>
> > Both 'owning' and 'non-owning' will fit for KF_TRUSTED_ARGS kfuncs.
> >
> > And we will be able to pass 'non-owning' under spin_lock into other kfuncs
> > and owning outside of spin_lock into other kfuncs.
> > Which is a good thing.
> >
>
> Allowing passing of owning ref outside of spin_lock sounds reasonable to me.
> 'non-owning' under spinlock will have the same "what if this touches __key"
> issue I brought up in another thread. But you mentioned not preventing that
> and I don't necessarily disagree, so just noting here.
>

Yeah, I agree with Alexei that writing to key is a non-issue. 'Less' cb may not
actually do the correct thing at all, so in that sense writing to key is a small
issue. In any case violating the 'sorted' property is not something we should be
trying to prevent.

> >> And you're also adding 'untrusted' here, mainly as a result of
> >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> >> instead of becoming a non-owning ref. 'untrusted' would have state like:
> >>
> >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> >> PTR_UNTRUSTED
> >> ref_obj_id == 0?
> >
> > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
> > or doing 'non-owning' is enough.
> > If it's full untrusted it will be:
> > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> >
>
> Yeah, I don't see what this "full untrusted" is giving us either. Let's have
> "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
> instead of converting to "full untrusted"?
>

+1, I prefer invalidating completely on unlock.

> Adding "full untrusted" later won't make any valid programs written with
> "just invalidate the regs" in mind fail the verifier. So painless to add later.
>

+1

> > tbh I don't remember why we even have 'MEM_ALLOC | PTR_UNTRUSTED'.
> >

Eventually it will also be used for alloc obj kptr loaded from maps.

>
> I think such type combo was only added to implement non-owning refs. If it's
> rewritten to use your type combos I don't think there'll be any uses of
> MEM_ALLOC | PTR_UNTRUSTED remaining.
>

To be clear I was not intending to use PTR_UNTRUSTED to do such non-owning refs.

> >> I think your "non-owning ref" definition also differs from mine, specifically
> >> yours doesn't seem to have "will not page fault". For this reason, you don't
> >> see the need for release_on_unlock logic, since that's used to prevent refs
> >> escaping critical section and potentially referring to free'd memory.
> >
> > Not quite.
> > We should be able to read/write directly through
> > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> > and we need to convert it to __mark_reg_unknown() after bpf_spin_unlock
> > the way release_reference() is doing.
> > I'm just not happy with using acquire_reference/release_reference() logic
> > (as release_on_unlock is doing) for cleaning after unlock.
> > Since we need to clean 'non-owning' ptrs in unlock it's confusing
> > to call the process 'release'.
> > I was hoping we can search through all states and __mark_reg_unknown() (or UNTRUSTED)
> > every reg where
> > reg->id == cur_state->active_lock.id &&
> > flag == PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> >
> > By deleting relase_on_unlock I meant delete release_on_unlock flag
> > and remove ref_set_release_on_unlock.
> >
>
> Summarized above, but: agreed, and thanks for clarifying what you meant by
> "delete release_on_unlock".
>
> >> This is where I start to get confused. Some questions:
> >>
> >>   * If we get rid of release_on_unlock, and with mass invalidation of
> >>     non-owning refs entirely, shouldn't non-owning refs be marked PTR_UNTRUSTED?
> >
> > Since we'll be cleaning all
> > PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0
> > it shouldn't affect ptrs with ref_obj_id > 0 that came from bpf_obj_new.
> >
> > The verifier already enforces that bpf_spin_unlock will be present
> > at the right place in bpf prog.
> > When the verifier sees it it will clean all non-owning refs with this spinlock 'id'.
> > So no concerns of leaking 'non-owning' outside.
> >
>
> Sounds like we don't want "full untrusted" or any PTR_UNTRUSTED non-owning ref.
>
> > While processing bpf_rbtree_first we need to:
> > regs[BPF_REG_0].type = PTR_TO_BTF_ID | MEM_ALLOC;
> > regs[BPF_REG_0].id = active_lock.id;
> > regs[BPF_REG_0].ref_obj_id = 0;
> >
>
> Agreed.
>

I'm a bit concerned about putting active_lock.id in reg->id. Don't object to the
idea but the implementation, since we take PTR_TO_BTF_ID | MEM_ALLOC in
bpf_spin_lock/bpf_spin_unlock. It will lead to confusion. Currently this exact
reg->type never has reg->ref_obj_id as 0. Maybe that needs to be checked for
those helper calls.

Just thinking out loud, maybe it's fine but we need to be careful, reg->id
changes meaning with ref_obj_id == 0.

> >>   * Since refs can alias each other, how to deal with bpf_obj_drop-and-reuse
> >>     in this scheme, since non-owning ref can escape spin_unlock b/c no mass
> >>     invalidation? PTR_UNTRUSTED isn't sufficient here
> >
> > run-time check in bpf_rbtree_remove (and in the future bpf_list_remove)
> > should address it, no?
> >
>
> If we don't do "full untrusted" and cleanup non-owning refs by invalidating,
> _and_ don't allow bpf_obj_{new,drop} in critical section, then I don't think
> this is an issue.
>

bpf_obj_drop if/when enabled can also do invalidation. But let's table that
discussion until we introduce it. We most likely may not need it inside the CS.

> But to elaborate on the issue, if we instead cleaned up non-owning by marking
> untrusted:
>
> struct node_data *n = bpf_obj_new(typeof(*n));
> struct node_data *m, *o;
> struct some_other_type *t;
>
> bpf_spin_lock(&lock);
>
> bpf_rbtree_add(&tree, n);
> m = bpf_rbtree_first();
> o = bpf_rbtree_first(); // m and o are non-owning, point to same node
>
> m = bpf_rbtree_remove(&tree, m); // m is owning
>
> bpf_spin_unlock(&lock); // o is "full untrusted", marked PTR_UNTRUSTED
>
> bpf_obj_drop(m);
> t = bpf_obj_new(typeof(*t)); // pretend that exact chunk of memory that was
>                              // dropped in previous statement is returned here
>
> data = o->some_data_field;   // PROBE_MEM, but no page fault, so load will
>                              // succeed, but will read garbage from another type
>                              // while verifier thinks it's reading from node_data
>
>
> If we clean up by invalidating, but eventually enable bpf_obj_{new,drop} inside
> critical section, we'll have similar issue.
>
> It's not necessarily "crash the kernel" dangerous, but it may anger program
> writers since they can't be sure they're not reading garbage in this scenario.
>

I think it's better to clean by invalidating. We have better tools to form
untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
such an escape hatch for some reason. It's also easier to review where an
untrusted pointer is being used in a program, and has zero cost at runtime.

> >>   * If non-owning ref can live past spin_unlock, do we expect read from
> >>     such ref after _unlock to go through bpf_probe_read()? Otherwise direct
> >>     read might fault and silently write 0.
> >
> > unlock has to clean them.
> >
>
> Ack.
>
> >>   * For your 'untrusted', but not non-owning ref concept, I'm not sure
> >>     what this gives us that's better than just invalidating the ref which
> >>     gets in this state (rbtree_{add,remove} 'node' arg, bpf_obj_drop node)
> >
> > Whether to mark unknown or untrusted or non-owning after bpf_rbtree_add() is a difficult one.
> > Untrusted will allow prog to do read only access (via probe_read) into the node
> > but might hide bugs.
> > The cleanup after bpf_spin_unlock of non-owning and clean up after
> > bpf_rbtree_add() does not have to be the same.
>
> This is a good point.
>

So far I'm leaning towards:

bpf_rbtree_add(node) : node becomes non-owned ref
bpf_spin_unlock(lock) : node is invalidated

> > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
> > and non-owning after bpf_rbtree_add.
> >
> > Walking the example from previous email:
> >
> > struct bpf_rbtree_iter it;
> > struct bpf_rb_node * node;
> > struct bpf_rb_node *n, *m;
> >
> > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
> > while ((node = bpf_rbtree_iter_next(&it)) {
> >   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
> >   if (node && node->field == condition) {
> >
> >     n = bpf_rbtree_remove(rb_root, node);
> >     if (!n) ...;
> >     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
> >     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
> >     if (!m) ...;
> >     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> >

This second remove I would simply disallow as Dave is suggesting during
verification, by invalidating non-owning refs for rb_root.

> >     // node is still:
> >     // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> >
> >     // assume we allow double locks one day
> >     bpf_spin_lock(another_rb_root);
> >     bpf_rbtree_add(another_rb_root, n);
> >     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[1].id
> >     bpf_spin_unlock(another_rb_root);
> >     // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> >     break;
> >   }
> > }
> > // node -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == 0 && id == active_lock[0].id
> > bpf_rbtree_iter_destroy(&it); // does unlock
> > // node -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> > // n -> PTR_TO_BTF_ID | PTR_UNTRUSTED
> > // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> > bpf_obj_drop(m);
>
> This seems like a departure from other statements in your reply, where you're
> leaning towards "non-owning and trusted" -> "full untrusted" after unlock
> being unnecessary. I think the combo of reference aliases + bpf_obj_drop-and-
> reuse make everything hard to reason about.
>
> Regardless, your comments annotating reg state look correct to me.

I think it's much more clear in this thread wrt what you wanted to do. It would
be good after the thread concludes to eventually summarize how you're going to
finally implement all this before respinning.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-08 12:57             ` Kumar Kartikeya Dwivedi
@ 2022-12-08 20:36               ` Alexei Starovoitov
  2022-12-08 23:35                 ` Dave Marchevsky
  0 siblings, 1 reply; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-08 20:36 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote:
> 
> I don't mind using active_lock.id for invalidation, but using reg->id to
> associate it with reg is a bad idea IMO, it's already preserved and set when the
> object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
> with that non-owing ref if it has a spin lock, essentially unlocking different
> spin lock if the reg->btf of already locked spin lock reg is same due to same
> active_lock.id.

Right. Overwriting reg->id was a bad idea.

> Even if you prevent it somehow it's more confusing to overload reg->id again for
> this purpose.
> 
> It makes more sense to introduce a new nonref_obj_id instead dedicated for this
> purpose, to associate it back to the reg->id of the collection it is coming from.

nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be
connected to reg->id the way we do it for ref_obj_id.

> Also, there are two cases of invalidation, one is on remove from rbtree, which
> should only invalidate non-owning references into the rbtree, and one is on
> unlock, which should invalidate all non-owning references.

Two cases only if we're going to do invalidation on rbtree_remove.

> bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
> lock, but unlocking should do it for both rbtree and list non-owning refs it is
> protecting.
> 
> So it seems you will have to maintain two IDs for non-owning referneces, one for
> the collection it comes from, and one for the lock region it is obtained in.

Right. Like this ?
collection_id = rbroot->reg->id; // to track the collection it came from
active_lock_id = cur_state->active_lock.id // to track the lock region

but before we proceed let me demonstrate an example where
cleanup on rbtree_remove is not user friendly:

bpf_spin_lock
x = bpf_list_first(); if (!x) ..
y = bpf_list_last(); if (!y) ..

n = bpf_list_remove(x); if (!n) ..

bpf_list_add_after(n, y); // we should allow this
bpf_spin_unlock

We don't have such apis right now.
The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy
all regs that point somewhere in the collection.
This way we save run-time check in bpf_rbtree_remove, but sacrificing usability.

x and y could be pointing to the same thing.
In such case bpf_list_add_after() should fail in runtime after discovering
that 'y' is unlinked.

Similarly with bpf_rbtree_add().
Currently it cannot fail. It takes owning ref and will release it.
We can mark it as KF_RELEASE and no extra verifier changes necessary.

But in the future we might have failing add/insert operations on lists and rbtree.
If they're failing we'd need to struggle with 'conditional release' verifier additions,
the bpf prog would need to check return value, etc.

I think we better deal with it in run-time.
The verifier could supply bpf_list_add_after() with two hidden args:
- container_of offset (delta between rb_node and begining of prog's struct)
- struct btf_struct_meta *meta
Then inside bpf_list_add_after or any failing KF_RELEASE kfunc
it can call bpf_obj_drop_impl() that element.
Then from the verifier pov the KF_RELEASE function did the release
and 'owning ref' became 'non-owning ref'.

> > >> And you're also adding 'untrusted' here, mainly as a result of
> > >> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
> > >> instead of becoming a non-owning ref. 'untrusted' would have state like:
> > >>
> > >> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
> > >> PTR_UNTRUSTED
> > >> ref_obj_id == 0?
> > >
> > > I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
> > > or doing 'non-owning' is enough.
> > > If it's full untrusted it will be:
> > > PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
> > >
> >
> > Yeah, I don't see what this "full untrusted" is giving us either. Let's have
> > "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
> > instead of converting to "full untrusted"?
> >
> 
> +1, I prefer invalidating completely on unlock.

fine by me.

> 
> I think it's better to clean by invalidating. We have better tools to form
> untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
> such an escape hatch for some reason. It's also easier to review where an
> untrusted pointer is being used in a program, and has zero cost at runtime.

ok. Since it's more strict we can relax to untrusted later if necessary.

> So far I'm leaning towards:
> 
> bpf_rbtree_add(node) : node becomes non-owned ref
> bpf_spin_unlock(lock) : node is invalidated

ok

> > > Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
> > > and non-owning after bpf_rbtree_add.
> > >
> > > Walking the example from previous email:
> > >
> > > struct bpf_rbtree_iter it;
> > > struct bpf_rb_node * node;
> > > struct bpf_rb_node *n, *m;
> > >
> > > bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
> > > while ((node = bpf_rbtree_iter_next(&it)) {
> > >   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
> > >   if (node && node->field == condition) {
> > >
> > >     n = bpf_rbtree_remove(rb_root, node);
> > >     if (!n) ...;
> > >     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
> > >     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
> > >     if (!m) ...;
> > >     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
> > >
> 
> This second remove I would simply disallow as Dave is suggesting during
> verification, by invalidating non-owning refs for rb_root.

Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our
only remaining disagreement.
I feel run-time checks will be fast enough and will improve usabililty.

Also it feels that not doing cleanup on rbtree_remove is simpler to
implement and reason about.

Here is the proposal with one new field 'active_lock_id':

first = bpf_rbtree_first(root) KF_RET_NULL
  check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
  R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
  R0->active_lock_id = root->reg->id
  R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.

same way we can add rb_find, rb_find_first,
but not rb_next, rb_prev, since they don't have 'root' argument.

bpf_rbtree_add(root, node, cb); KF_RELEASE.
  needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
  check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
  calls release_reference(node->ref_obj_id)
  converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
  node->active_lock_id = root->reg->id

'node' is equivalent to 'first'. They both point to some element
inside rbtree and valid inside spin_locked region.
It's ok to read|write to both under lock.

removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
  need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
  usual check_reg_allocation_locked(root)
  R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
  R0->ref_obj_id = R0->id = acquire_reference_state();
  R0->active_lock_id should stay 0
  mark_reg_unknown(node)

bpf_spin_unlock(lock);
  checks lock->id == cur->active_lock.id
  for all regs in state 
    if (reg->active_lock_id == lock->id)
       mark_reg_unknown(reg)

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-08 20:36               ` Alexei Starovoitov
@ 2022-12-08 23:35                 ` Dave Marchevsky
  2022-12-09  0:39                   ` Alexei Starovoitov
  0 siblings, 1 reply; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-08 23:35 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo

On 12/8/22 3:36 PM, Alexei Starovoitov wrote:
> On Thu, Dec 08, 2022 at 06:27:29PM +0530, Kumar Kartikeya Dwivedi wrote:
>>
>> I don't mind using active_lock.id for invalidation, but using reg->id to
>> associate it with reg is a bad idea IMO, it's already preserved and set when the
>> object has bpf_spin_lock in it, and it's going to allow doing bpf_spin_unlock
>> with that non-owing ref if it has a spin lock, essentially unlocking different
>> spin lock if the reg->btf of already locked spin lock reg is same due to same
>> active_lock.id.
> 
> Right. Overwriting reg->id was a bad idea.
> 
>> Even if you prevent it somehow it's more confusing to overload reg->id again for
>> this purpose.
>>
>> It makes more sense to introduce a new nonref_obj_id instead dedicated for this
>> purpose, to associate it back to the reg->id of the collection it is coming from.
> 
> nonref_obj_id name sounds too generic and I'm not sure that it shouldn't be
> connected to reg->id the way we do it for ref_obj_id.
> 
>> Also, there are two cases of invalidation, one is on remove from rbtree, which
>> should only invalidate non-owning references into the rbtree, and one is on
>> unlock, which should invalidate all non-owning references.
> 
> Two cases only if we're going to do invalidation on rbtree_remove.
> 
>> bpf_rbtree_remove shouldn't invalidate non-owning into list protected by same
>> lock, but unlocking should do it for both rbtree and list non-owning refs it is
>> protecting.
>>
>> So it seems you will have to maintain two IDs for non-owning referneces, one for
>> the collection it comes from, and one for the lock region it is obtained in.
> 
> Right. Like this ?
> collection_id = rbroot->reg->id; // to track the collection it came from
> active_lock_id = cur_state->active_lock.id // to track the lock region
> 
> but before we proceed let me demonstrate an example where
> cleanup on rbtree_remove is not user friendly:
> 
> bpf_spin_lock
> x = bpf_list_first(); if (!x) ..
> y = bpf_list_last(); if (!y) ..
> 
> n = bpf_list_remove(x); if (!n) ..
> 
> bpf_list_add_after(n, y); // we should allow this
> bpf_spin_unlock
> 
> We don't have such apis right now.
> The point here that cleanup after bpf_list_remove/bpf_rbtree_remove will destroy
> all regs that point somewhere in the collection.
> This way we save run-time check in bpf_rbtree_remove, but sacrificing usability.
> 
> x and y could be pointing to the same thing.
> In such case bpf_list_add_after() should fail in runtime after discovering
> that 'y' is unlinked.
> 
> Similarly with bpf_rbtree_add().
> Currently it cannot fail. It takes owning ref and will release it.
> We can mark it as KF_RELEASE and no extra verifier changes necessary.
> 
> But in the future we might have failing add/insert operations on lists and rbtree.
> If they're failing we'd need to struggle with 'conditional release' verifier additions,
> the bpf prog would need to check return value, etc.
> 
> I think we better deal with it in run-time.
> The verifier could supply bpf_list_add_after() with two hidden args:
> - container_of offset (delta between rb_node and begining of prog's struct)
> - struct btf_struct_meta *meta
> Then inside bpf_list_add_after or any failing KF_RELEASE kfunc
> it can call bpf_obj_drop_impl() that element.
> Then from the verifier pov the KF_RELEASE function did the release
> and 'owning ref' became 'non-owning ref'.
> 
>>>>> And you're also adding 'untrusted' here, mainly as a result of
>>>>> bpf_rbtree_add(tree, node) - 'node' becoming untrusted after it's added,
>>>>> instead of becoming a non-owning ref. 'untrusted' would have state like:
>>>>>
>>>>> PTR_TO_BTF_ID | MEM_ALLOC (w/ rb_node type)
>>>>> PTR_UNTRUSTED
>>>>> ref_obj_id == 0?
>>>>
>>>> I'm not sure whether we really need full untrusted after going through bpf_rbtree_add()
>>>> or doing 'non-owning' is enough.
>>>> If it's full untrusted it will be:
>>>> PTR_TO_BTF_ID | PTR_UNTRUSTED && ref_obj_id == 0
>>>>
>>>
>>> Yeah, I don't see what this "full untrusted" is giving us either. Let's have
>>> "cleanup non-owning refs on spin_unlock" just invalidate the regs for now,
>>> instead of converting to "full untrusted"?
>>>
>>
>> +1, I prefer invalidating completely on unlock.
> 
> fine by me.
> 
>>
>> I think it's better to clean by invalidating. We have better tools to form
>> untrusted pointers (like bpf_rdonly_cast) now if the BPF program writer needs
>> such an escape hatch for some reason. It's also easier to review where an
>> untrusted pointer is being used in a program, and has zero cost at runtime.
> 
> ok. Since it's more strict we can relax to untrusted later if necessary.
> 
>> So far I'm leaning towards:
>>
>> bpf_rbtree_add(node) : node becomes non-owned ref
>> bpf_spin_unlock(lock) : node is invalidated
> 
> ok
> 
>>>> Currently I'm leaning towards PTR_UNTRUSTED for cleanup after bpf_spin_unlock
>>>> and non-owning after bpf_rbtree_add.
>>>>
>>>> Walking the example from previous email:
>>>>
>>>> struct bpf_rbtree_iter it;
>>>> struct bpf_rb_node * node;
>>>> struct bpf_rb_node *n, *m;
>>>>
>>>> bpf_rbtree_iter_init(&it, rb_root); // locks the rbtree works as bpf_spin_lock
>>>> while ((node = bpf_rbtree_iter_next(&it)) {
>>>>   // node -> PTR_TO_BTF_ID | MEM_ALLOC | MAYBE_NULL && ref_obj_id == 0
>>>>   if (node && node->field == condition) {
>>>>
>>>>     n = bpf_rbtree_remove(rb_root, node);
>>>>     if (!n) ...;
>>>>     // n -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == X
>>>>     m = bpf_rbtree_remove(rb_root, node); // ok, but fails in run-time
>>>>     if (!m) ...;
>>>>     // m -> PTR_TO_BTF_ID | MEM_ALLOC && ref_obj_id == Y
>>>>
>>
>> This second remove I would simply disallow as Dave is suggesting during
>> verification, by invalidating non-owning refs for rb_root.
> 
> Looks like cleanup from non-owning to untrusted|unknown on bpf_rbtree_remove is our
> only remaining disagreement.
> I feel run-time checks will be fast enough and will improve usabililty.
> 
> Also it feels that not doing cleanup on rbtree_remove is simpler to
> implement and reason about.
> 
> Here is the proposal with one new field 'active_lock_id':
> 
> first = bpf_rbtree_first(root) KF_RET_NULL
>   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
>   R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
>   R0->active_lock_id = root->reg->id
>   R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.
> 
> same way we can add rb_find, rb_find_first,
> but not rb_next, rb_prev, since they don't have 'root' argument.
> 
> bpf_rbtree_add(root, node, cb); KF_RELEASE.
>   needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
>   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
>   calls release_reference(node->ref_obj_id)
>   converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
>   node->active_lock_id = root->reg->id
> 
> 'node' is equivalent to 'first'. They both point to some element
> inside rbtree and valid inside spin_locked region.
> It's ok to read|write to both under lock.
> 
> removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
>   need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
>   usual check_reg_allocation_locked(root)
>   R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
>   R0->ref_obj_id = R0->id = acquire_reference_state();
>   R0->active_lock_id should stay 0
>   mark_reg_unknown(node)
> 
> bpf_spin_unlock(lock);
>   checks lock->id == cur->active_lock.id
>   for all regs in state 
>     if (reg->active_lock_id == lock->id)
>        mark_reg_unknown(reg)

OK, so sounds like a few more points of agreement, regardless of whether
we go the runtime checking route or the other one:

  * We're tossing 'full untrusted' for now. non-owning references will not be
    allowed to escape critical section. They'll be clobbered w/
    mark_reg_unknown.
    * No pressing need to make bpf_obj_drop callable from critical section.
      As a result no owning or non-owning ref access can page fault.

  * When spin_lock is unlocked, verifier needs to know about all non-owning
    references so that it can clobber them. Current implementation -
    ref_obj_id + release_on_unlock - is bad for a number of reasons, should
    be replaced with something that doesn't use ref_obj_id or reg->id.
    * Specific better approach was proposed above: new field + keep track
      of lock and datastructure identity.


Differences in proposed approaches:

"Type System checks + invalidation on 'destructive' rbtree ops"

  * This approach tries to prevent aliasing problems by invalidating
    non-owning refs after 'destructive' rbtree ops - like rbtree_remove -
    in addition to invalidation on spin_unlock

  * Type system guarantees invariants:
    * "if it's an owning ref, the node is guaranteed to not be in an rbtree"
    * "if it's a non-owning ref, the node is guaranteed to be in an rbtree"

  * Downside: mass non-owning ref invalidation on rbtree_remove will make some
    programs that logically don't have aliasing problem will be rejected by
    verifier. Will affect usability depending on how bad this is.


"Runtime checks + spin_unlock invalidation only"

  * This approach allows for the possibility of aliasing problem. As a result
    the invariants guaranteed in point 2 above don't necessarily hold.
    * Helpers that add or remove need to account for possibility that the node
      they're operating on has already been added / removed. Need to check this
      at runtime and nop if so.

  * non-owning refs are only invalidated on spin_unlock.
    * As a result, usability issues of previous approach don't happen here.

  * Downside: Need to do runtime checks, some additional verifier complexity
    to deal with "runtime check failed" case due to prev approach's invariant
    not holding

Conversion of non-owning refs to 'untrusted' at a invalidation point (unlock
or remove) can be added to either approach (maybe - at least it was specifically
discussed for "runtime checks"). Such untrusted refs, by virtue of being
PTR_UNTRUSTED, can fault, and aren't accepted by rbtree_{add, remove} as input.
For the "type system" approach this might ameliorate some of the usability
issues. For the "runtime checks" approach it would only be useful to let
such refs escape spin_unlock.

But we're not going to do non-owning -> 'untrusted' for now, just listing for
completeness.


The distance between what I have now and "type system" approach is smaller
than "runtime checks" approach. And to get from "type system" to "runtime
checks" I'd need to:

  * Remove 'destructive op' invalidation points
  * Add runtime checks to rbtree_{add,remove}
  * Add verifier handling of runtime check failure possibility

Of which only the first point is getting rid of something added for the
"type system" approach, and won't be much work relative to all the refactoring
and other improvements that are common between the two approaches.

So for V2 I will do the "type system + invalidation on 'destructive' ops"
approach as it'll take less time. This'll get eyes on common improvements
faster. Then can do a "runtime checks" v3 and we can compare usability of both
on same base.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure
  2022-12-08 23:35                 ` Dave Marchevsky
@ 2022-12-09  0:39                   ` Alexei Starovoitov
  0 siblings, 0 replies; 50+ messages in thread
From: Alexei Starovoitov @ 2022-12-09  0:39 UTC (permalink / raw)
  To: Dave Marchevsky
  Cc: Kumar Kartikeya Dwivedi, Dave Marchevsky, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Kernel Team, Tejun Heo

On Thu, Dec 08, 2022 at 06:35:24PM -0500, Dave Marchevsky wrote:
> > 
> > Here is the proposal with one new field 'active_lock_id':
> > 
> > first = bpf_rbtree_first(root) KF_RET_NULL
> >   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
> >   R0 = PTR_TO_BTF_ID|MEM_ALLOC|PTR_MAYBE_NULL ref_obj_id = 0;
> >   R0->active_lock_id = root->reg->id
> >   R0->id = ++env->id_gen; which will be cleared after !NULL check inside prog.
> > 
> > same way we can add rb_find, rb_find_first,
> > but not rb_next, rb_prev, since they don't have 'root' argument.
> > 
> > bpf_rbtree_add(root, node, cb); KF_RELEASE.
> >   needs to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id > 0
> >   check_reg_allocation_locked() checks that root->reg->id == cur->active_lock.id
> >   calls release_reference(node->ref_obj_id)
> >   converts 'node' to PTR_TO_BTF_ID|MEM_ALLOC ref_obj_id = 0;
> >   node->active_lock_id = root->reg->id
> > 
> > 'node' is equivalent to 'first'. They both point to some element
> > inside rbtree and valid inside spin_locked region.
> > It's ok to read|write to both under lock.
> > 
> > removed_node = bpf_rbtree_remove(root, node); KF_ACQUIRE|KF_RET_NULL
> >   need to see PTR_TO_BTF_ID|MEM_ALLOC node->ref_obj_id = 0; and 
> >   usual check_reg_allocation_locked(root)
> >   R0 = PTR_TO_BTF_ID|MEM_ALLOC|MAYBE_NULL
> >   R0->ref_obj_id = R0->id = acquire_reference_state();
> >   R0->active_lock_id should stay 0
> >   mark_reg_unknown(node)
> > 
> > bpf_spin_unlock(lock);
> >   checks lock->id == cur->active_lock.id
> >   for all regs in state 
> >     if (reg->active_lock_id == lock->id)
> >        mark_reg_unknown(reg)
> 
> OK, so sounds like a few more points of agreement, regardless of whether
> we go the runtime checking route or the other one:
> 
>   * We're tossing 'full untrusted' for now. non-owning references will not be
>     allowed to escape critical section. They'll be clobbered w/
>     mark_reg_unknown.

agree

>     * No pressing need to make bpf_obj_drop callable from critical section.
>       As a result no owning or non-owning ref access can page fault.

agree

> 
>   * When spin_lock is unlocked, verifier needs to know about all non-owning
>     references so that it can clobber them. Current implementation -
>     ref_obj_id + release_on_unlock - is bad for a number of reasons, should
>     be replaced with something that doesn't use ref_obj_id or reg->id.
>     * Specific better approach was proposed above: new field + keep track
>       of lock and datastructure identity.

yes

> 
> Differences in proposed approaches:
> 
> "Type System checks + invalidation on 'destructive' rbtree ops"
> 
>   * This approach tries to prevent aliasing problems by invalidating
>     non-owning refs after 'destructive' rbtree ops - like rbtree_remove -
>     in addition to invalidation on spin_unlock
> 
>   * Type system guarantees invariants:
>     * "if it's an owning ref, the node is guaranteed to not be in an rbtree"
>     * "if it's a non-owning ref, the node is guaranteed to be in an rbtree"
> 
>   * Downside: mass non-owning ref invalidation on rbtree_remove will make some
>     programs that logically don't have aliasing problem will be rejected by
>     verifier. Will affect usability depending on how bad this is.

yes.

> 
> 
> "Runtime checks + spin_unlock invalidation only"
> 
>   * This approach allows for the possibility of aliasing problem. As a result
>     the invariants guaranteed in point 2 above don't necessarily hold.
>     * Helpers that add or remove need to account for possibility that the node
>       they're operating on has already been added / removed. Need to check this
>       at runtime and nop if so.

Only 'remove' needs to check.
'add' is operating on 'owning ref'. It cannot fail.
Some future 'add_here(root, owning_node_to_add, nonowning_location)'
may need to fail.

> 
>   * non-owning refs are only invalidated on spin_unlock.
>     * As a result, usability issues of previous approach don't happen here.
> 
>   * Downside: Need to do runtime checks, some additional verifier complexity
>     to deal with "runtime check failed" case due to prev approach's invariant
>     not holding
> 
> Conversion of non-owning refs to 'untrusted' at a invalidation point (unlock
> or remove) can be added to either approach (maybe - at least it was specifically
> discussed for "runtime checks"). Such untrusted refs, by virtue of being
> PTR_UNTRUSTED, can fault, and aren't accepted by rbtree_{add, remove} as input.

correct.

> For the "type system" approach this might ameliorate some of the usability
> issues. For the "runtime checks" approach it would only be useful to let
> such refs escape spin_unlock.

the prog can do bpf_rdonly_cast() even after mark_unknown.

> But we're not going to do non-owning -> 'untrusted' for now, just listing for
> completeness.

right, because of bpf_rdonly_cast availability.

> The distance between what I have now and "type system" approach is smaller
> than "runtime checks" approach. And to get from "type system" to "runtime
> checks" I'd need to:
> 
>   * Remove 'destructive op' invalidation points
>   * Add runtime checks to rbtree_{add,remove}
>   * Add verifier handling of runtime check failure possibility
> 
> Of which only the first point is getting rid of something added for the
> "type system" approach, and won't be much work relative to all the refactoring
> and other improvements that are common between the two approaches.
> 
> So for V2 I will do the "type system + invalidation on 'destructive' ops"
> approach as it'll take less time. This'll get eyes on common improvements
> faster. Then can do a "runtime checks" v3 and we can compare usability of both
> on same base.

Sure, if you think cleanup on rbtree_remove is faster to implement
then definitely go for it.
I was imagining the other way around, but it's fine. Happy to be wrong.
I'm not seeing though how you gonna do that cleanup.
Another id-like field?
Before doing all coding could you post a proposal in the format that I did above?
imo it's much easier to think through in that form instead of analyzing the src code.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 08/13] bpf: Add callback validation to kfunc verifier logic
  2022-12-07  2:01   ` Alexei Starovoitov
@ 2022-12-17  8:49     ` Dave Marchevsky
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-17  8:49 UTC (permalink / raw)
  To: Alexei Starovoitov, Dave Marchevsky
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Kernel Team, Kumar Kartikeya Dwivedi, Tejun Heo

On 12/6/22 9:01 PM, Alexei Starovoitov wrote:
> On Tue, Dec 06, 2022 at 03:09:55PM -0800, Dave Marchevsky wrote:
>> Some BPF helpers take a callback function which the helper calls. For
>> each helper that takes such a callback, there's a special call to
>> __check_func_call with a callback-state-setting callback that sets up
>> verifier bpf_func_state for the callback's frame.
>>
>> kfuncs don't have any of this infrastructure yet, so let's add it in
>> this patch, following existing helper pattern as much as possible. To
>> validate functionality of this added plumbing, this patch adds
>> callback handling for the bpf_rbtree_add kfunc and hopes to lay
>> groundwork for future next-gen datastructure callbacks.
>>
>> In the "general plumbing" category we have:
>>
>>   * check_kfunc_call doing callback verification right before clearing
>>     CALLER_SAVED_REGS, exactly like check_helper_call
>>   * recognition of func_ptr BTF types in kfunc args as
>>     KF_ARG_PTR_TO_CALLBACK + propagation of subprogno for this arg type
>>
>> In the "rbtree_add / next-gen datastructure-specific plumbing" category:
>>
>>   * Since bpf_rbtree_add must be called while the spin_lock associated
>>     with the tree is held, don't complain when callback's func_state
>>     doesn't unlock it by frame exit
>>   * Mark rbtree_add callback's args PTR_UNTRUSTED to prevent rbtree
>>     api functions from being called in the callback
>>
>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>> ---
>>  kernel/bpf/verifier.c | 136 ++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 130 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 652112007b2c..9ad8c0b264dc 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -1448,6 +1448,16 @@ static void mark_ptr_not_null_reg(struct bpf_reg_state *reg)
>>  	reg->type &= ~PTR_MAYBE_NULL;
>>  }
>>  
>> +static void mark_reg_datastructure_node(struct bpf_reg_state *regs, u32 regno,
>> +					struct btf_field_datastructure_head *ds_head)
>> +{
>> +	__mark_reg_known_zero(&regs[regno]);
>> +	regs[regno].type = PTR_TO_BTF_ID | MEM_ALLOC;
>> +	regs[regno].btf = ds_head->btf;
>> +	regs[regno].btf_id = ds_head->value_btf_id;
>> +	regs[regno].off = ds_head->node_offset;
>> +}
>> +
>>  static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg)
>>  {
>>  	return type_is_pkt_pointer(reg->type);
>> @@ -4771,7 +4781,8 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
>>  			return -EACCES;
>>  		}
>>  
>> -		if (type_is_alloc(reg->type) && !reg->ref_obj_id) {
>> +		if (type_is_alloc(reg->type) && !reg->ref_obj_id &&
>> +		    !cur_func(env)->in_callback_fn) {
>>  			verbose(env, "verifier internal error: ref_obj_id for allocated object must be non-zero\n");
>>  			return -EFAULT;
>>  		}
>> @@ -6952,6 +6963,8 @@ static int set_callee_state(struct bpf_verifier_env *env,
>>  			    struct bpf_func_state *caller,
>>  			    struct bpf_func_state *callee, int insn_idx);
>>  
>> +static bool is_callback_calling_kfunc(u32 btf_id);
>> +
>>  static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>  			     int *insn_idx, int subprog,
>>  			     set_callee_state_fn set_callee_state_cb)
>> @@ -7006,10 +7019,18 @@ static int __check_func_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>>  	 * interested in validating only BPF helpers that can call subprogs as
>>  	 * callbacks
>>  	 */
>> -	if (set_callee_state_cb != set_callee_state && !is_callback_calling_function(insn->imm)) {
>> -		verbose(env, "verifier bug: helper %s#%d is not marked as callback-calling\n",
>> -			func_id_name(insn->imm), insn->imm);
>> -		return -EFAULT;
>> +	if (set_callee_state_cb != set_callee_state) {
>> +		if (bpf_pseudo_kfunc_call(insn) &&
>> +		    !is_callback_calling_kfunc(insn->imm)) {
>> +			verbose(env, "verifier bug: kfunc %s#%d not marked as callback-calling\n",
>> +				func_id_name(insn->imm), insn->imm);
>> +			return -EFAULT;
>> +		} else if (!bpf_pseudo_kfunc_call(insn) &&
>> +			   !is_callback_calling_function(insn->imm)) { /* helper */
>> +			verbose(env, "verifier bug: helper %s#%d not marked as callback-calling\n",
>> +				func_id_name(insn->imm), insn->imm);
>> +			return -EFAULT;
>> +		}
>>  	}
>>  
>>  	if (insn->code == (BPF_JMP | BPF_CALL) &&
>> @@ -7275,6 +7296,67 @@ static int set_user_ringbuf_callback_state(struct bpf_verifier_env *env,
>>  	return 0;
>>  }
>>  
>> +static int set_rbtree_add_callback_state(struct bpf_verifier_env *env,
>> +					 struct bpf_func_state *caller,
>> +					 struct bpf_func_state *callee,
>> +					 int insn_idx)
>> +{
>> +	/* void bpf_rbtree_add(struct bpf_rb_root *root, struct bpf_rb_node *node,
>> +	 *                     bool (less)(struct bpf_rb_node *a, const struct bpf_rb_node *b));
>> +	 *
>> +	 * 'struct bpf_rb_node *node' arg to bpf_rbtree_add is the same PTR_TO_BTF_ID w/ offset
>> +	 * that 'less' callback args will be receiving. However, 'node' arg was release_reference'd
>> +	 * by this point, so look at 'root'
>> +	 */
>> +	struct btf_field *field;
>> +	struct btf_record *rec;
>> +
>> +	rec = reg_btf_record(&caller->regs[BPF_REG_1]);
>> +	if (!rec)
>> +		return -EFAULT;
>> +
>> +	field = btf_record_find(rec, caller->regs[BPF_REG_1].off, BPF_RB_ROOT);
>> +	if (!field || !field->datastructure_head.value_btf_id)
>> +		return -EFAULT;
>> +
>> +	mark_reg_datastructure_node(callee->regs, BPF_REG_1, &field->datastructure_head);
>> +	callee->regs[BPF_REG_1].type |= PTR_UNTRUSTED;
>> +	mark_reg_datastructure_node(callee->regs, BPF_REG_2, &field->datastructure_head);
>> +	callee->regs[BPF_REG_2].type |= PTR_UNTRUSTED;
> 
> Please add a comment here to explain that the pointers are actually trusted
> and here it's a quick hack to prevent callback to call into rb_tree kfuncs.
> We definitely would need to clean it up.
> Have you tried to check for is_bpf_list_api_kfunc() || is_bpf_rbtree_api_kfunc()
> while processing kfuncs inside callback ?
> 
>> +	callee->in_callback_fn = true;
> 
> this will give you a flag to do that check.
> 
>> +	callee->callback_ret_range = tnum_range(0, 1);
>> +	return 0;
>> +}
>> +
>> +static bool is_rbtree_lock_required_kfunc(u32 btf_id);
>> +
>> +/* Are we currently verifying the callback for a rbtree helper that must
>> + * be called with lock held? If so, no need to complain about unreleased
>> + * lock
>> + */
>> +static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env)
>> +{
>> +	struct bpf_verifier_state *state = env->cur_state;
>> +	struct bpf_insn *insn = env->prog->insnsi;
>> +	struct bpf_func_state *callee;
>> +	int kfunc_btf_id;
>> +
>> +	if (!state->curframe)
>> +		return false;
>> +
>> +	callee = state->frame[state->curframe];
>> +
>> +	if (!callee->in_callback_fn)
>> +		return false;
>> +
>> +	kfunc_btf_id = insn[callee->callsite].imm;
>> +	return is_rbtree_lock_required_kfunc(kfunc_btf_id);
>> +}
>> +
>>  static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx)
>>  {
>>  	struct bpf_verifier_state *state = env->cur_state;
>> @@ -8007,6 +8089,7 @@ struct bpf_kfunc_call_arg_meta {
>>  	bool r0_rdonly;
>>  	u32 ret_btf_id;
>>  	u64 r0_size;
>> +	u32 subprogno;
>>  	struct {
>>  		u64 value;
>>  		bool found;
>> @@ -8185,6 +8268,18 @@ static bool is_kfunc_arg_rbtree_node(const struct btf *btf, const struct btf_par
>>  	return __is_kfunc_ptr_arg_type(btf, arg, KF_ARG_RB_NODE_ID);
>>  }
>>  
>> +static bool is_kfunc_arg_callback(struct bpf_verifier_env *env, const struct btf *btf,
>> +				  const struct btf_param *arg)
>> +{
>> +	const struct btf_type *t;
>> +
>> +	t = btf_type_resolve_func_ptr(btf, arg->type, NULL);
>> +	if (!t)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>>  /* Returns true if struct is composed of scalars, 4 levels of nesting allowed */
>>  static bool __btf_type_is_scalar_struct(struct bpf_verifier_env *env,
>>  					const struct btf *btf,
>> @@ -8244,6 +8339,7 @@ enum kfunc_ptr_arg_type {
>>  	KF_ARG_PTR_TO_BTF_ID,	     /* Also covers reg2btf_ids conversions */
>>  	KF_ARG_PTR_TO_MEM,
>>  	KF_ARG_PTR_TO_MEM_SIZE,	     /* Size derived from next argument, skip it */
>> +	KF_ARG_PTR_TO_CALLBACK,
>>  	KF_ARG_PTR_TO_RB_ROOT,
>>  	KF_ARG_PTR_TO_RB_NODE,
>>  };
>> @@ -8368,6 +8464,9 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
>>  		return KF_ARG_PTR_TO_BTF_ID;
>>  	}
>>  
>> +	if (is_kfunc_arg_callback(env, meta->btf, &args[argno]))
>> +		return KF_ARG_PTR_TO_CALLBACK;
>> +
>>  	if (argno + 1 < nargs && is_kfunc_arg_mem_size(meta->btf, &args[argno + 1], &regs[regno + 1]))
>>  		arg_mem_size = true;
>>  
>> @@ -8585,6 +8684,16 @@ static bool is_bpf_datastructure_api_kfunc(u32 btf_id)
>>  	return is_bpf_list_api_kfunc(btf_id) || is_bpf_rbtree_api_kfunc(btf_id);
>>  }
>>  
>> +static bool is_callback_calling_kfunc(u32 btf_id)
>> +{
>> +	return btf_id == special_kfunc_list[KF_bpf_rbtree_add];
>> +}
>> +
>> +static bool is_rbtree_lock_required_kfunc(u32 btf_id)
>> +{
>> +	return is_bpf_rbtree_api_kfunc(btf_id);
>> +}
>> +
>>  static bool check_kfunc_is_datastructure_head_api(struct bpf_verifier_env *env,
>>  						  enum btf_field_type head_field_type,
>>  						  u32 kfunc_btf_id)
>> @@ -8920,6 +9029,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>  		case KF_ARG_PTR_TO_RB_NODE:
>>  		case KF_ARG_PTR_TO_MEM:
>>  		case KF_ARG_PTR_TO_MEM_SIZE:
>> +		case KF_ARG_PTR_TO_CALLBACK:
>>  			/* Trusted by default */
>>  			break;
>>  		default:
>> @@ -9078,6 +9188,9 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>  			/* Skip next '__sz' argument */
>>  			i++;
>>  			break;
>> +		case KF_ARG_PTR_TO_CALLBACK:
>> +			meta->subprogno = reg->subprogno;
>> +			break;
>>  		}
>>  	}
>>  
>> @@ -9193,6 +9306,16 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>  		}
>>  	}
>>  
>> +	if (meta.func_id == special_kfunc_list[KF_bpf_rbtree_add]) {
>> +		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
>> +					set_rbtree_add_callback_state);
>> +		if (err) {
>> +			verbose(env, "kfunc %s#%d failed callback verification\n",
>> +				func_name, func_id);
>> +			return err;
>> +		}
>> +	}
>> +
>>  	for (i = 0; i < CALLER_SAVED_REGS; i++)
>>  		mark_reg_not_init(env, regs, caller_saved[i]);
>>  
>> @@ -14023,7 +14146,8 @@ static int do_check(struct bpf_verifier_env *env)
>>  					return -EINVAL;
>>  				}
>>  
>> -				if (env->cur_state->active_lock.ptr) {
>> +				if (env->cur_state->active_lock.ptr &&
>> +				    !in_rbtree_lock_required_cb(env)) {
> 
> That looks wrong.
> It will allow callbacks to use unpaired lock/unlock.
> Have you tried clearing cur_state->active_lock when entering callback?
> That should solve it and won't cause lock/unlock imbalance.

I didn't directly address this in v2. cur_state->active_lock isn't cleared.
rbtree callback is explicitly prevented from calling spin_{lock,unlock}, and
this check above is preserved so that verifier doesn't complain when cb exits
w/o releasing lock.

Logic for keeping it this way was:
  * We discussed allowing rbtree_first() call in less() cb, which requires
    correct lock to be held, so might as well keep lock info around
  * Similarly, because non-owning refs use active_lock info, need to keep
    info around.
  * Could work around both issues above, but net result would probably be
    _more_ special-casing, just in different places.

Not trying to resurrect v1 with this comment, we can continue convo on
same patch in v2: https://lore.kernel.org/bpf/20221217082506.1570898-9-davemarchevsky@fb.com/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails
  2022-12-07 19:05     ` Alexei Starovoitov
@ 2022-12-17  8:59       ` Dave Marchevsky
  0 siblings, 0 replies; 50+ messages in thread
From: Dave Marchevsky @ 2022-12-17  8:59 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: Dave Marchevsky, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Kernel Team, Tejun Heo

On 12/7/22 2:05 PM, Alexei Starovoitov wrote:
> On Wed, Dec 07, 2022 at 10:19:00PM +0530, Kumar Kartikeya Dwivedi wrote:
>> On Wed, Dec 07, 2022 at 04:39:49AM IST, Dave Marchevsky wrote:
>>> map_check_btf calls btf_parse_fields to create a btf_record for its
>>> value_type. If there are no special fields in the value_type
>>> btf_parse_fields returns NULL, whereas if there special value_type
>>> fields but they are invalid in some way an error is returned.
>>>
>>> An example invalid state would be:
>>>
>>>   struct node_data {
>>>     struct bpf_rb_node node;
>>>     int data;
>>>   };
>>>
>>>   private(A) struct bpf_spin_lock glock;
>>>   private(A) struct bpf_list_head ghead __contains(node_data, node);
>>>
>>> groot should be invalid as its __contains tag points to a field with
>>> type != "bpf_list_node".
>>>
>>> Before this patch, such a scenario would result in btf_parse_fields
>>> returning an error ptr, subsequent !IS_ERR_OR_NULL check failing,
>>> and btf_check_and_fixup_fields returning 0, which would then be
>>> returned by map_check_btf.
>>>
>>> After this patch's changes, -EINVAL would be returned by map_check_btf
>>> and the map would correctly fail to load.
>>>
>>> Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
>>> cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>>> Fixes: aa3496accc41 ("bpf: Refactor kptr_off_tab into btf_record")
>>> ---
>>>  kernel/bpf/syscall.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>>> index 35972afb6850..c3599a7902f0 100644
>>> --- a/kernel/bpf/syscall.c
>>> +++ b/kernel/bpf/syscall.c
>>> @@ -1007,7 +1007,10 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf,
>>>  	map->record = btf_parse_fields(btf, value_type,
>>>  				       BPF_SPIN_LOCK | BPF_TIMER | BPF_KPTR | BPF_LIST_HEAD,
>>>  				       map->value_size);
>>> -	if (!IS_ERR_OR_NULL(map->record)) {
>>> +	if (IS_ERR(map->record))
>>> +		return -EINVAL;
>>> +
>>
>> I didn't do this on purpose, because of backward compatibility concerns. An
>> error has not been returned in earlier kernel versions during map creation time
>> and those fields acted like normal non-special regions, with errors on use of
>> helpers that act on those fields.
>>
>> Especially that bpf_spin_lock and bpf_timer are part of the unified btf_record.
>>
>> If we are doing such a change, then you should also drop the checks for IS_ERR
>> in verifier.c, since that shouldn't be possible anymore. But I think we need to
>> think carefully before changing this.
>>
>> One possible example is: If we introduce bpf_foo in the future and program
>> already has that defined in map value, using it for some other purpose, with
>> different alignment and size, their map creation will start failing.
> 
> That's a good point.
> If we can error on such misconstructed map at the program verification time that's better
> anyway, since there will be a proper verifier log instead of EINVAL from map_create.

In v2 I addressed these comments by just dropping this patch. No additional
logic is needed for "error at verification time", since btf_parse_fields doesn't
create a btf_record, and thus the first insn that expects the map_val to have
one will cause verification to fail.

For my "list_head __contains rb_node" case, the first insn is usually
bpf_spin_lock call, which also needs a populated btf_record for spin_lock.
Unfortunately this doesn't really achieve "proper verifier log", since
spin_lock definition isn't the root cause here, but verifier error msg can
only complain about spin_lock.

Not that the error message coming from BTF parse or check failing is any
better.

Anyways, I think there's some path forward here that results in a good error
message. But semantics work how we want them to without this commit, so it can
be delayed for followups.

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2022-12-17  8:59 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-12-06 23:09 [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure Dave Marchevsky
2022-12-06 23:09 ` [PATCH bpf-next 01/13] bpf: Loosen alloc obj test in verifier's reg_btf_record Dave Marchevsky
2022-12-07 16:41   ` Kumar Kartikeya Dwivedi
2022-12-07 18:34     ` Dave Marchevsky
2022-12-07 18:59       ` Alexei Starovoitov
2022-12-07 20:38         ` Dave Marchevsky
2022-12-07 22:46           ` Alexei Starovoitov
2022-12-07 23:42             ` Dave Marchevsky
2022-12-07 19:03       ` Kumar Kartikeya Dwivedi
2022-12-06 23:09 ` [PATCH bpf-next 02/13] bpf: map_check_btf should fail if btf_parse_fields fails Dave Marchevsky
2022-12-07  1:32   ` Alexei Starovoitov
2022-12-07 16:49   ` Kumar Kartikeya Dwivedi
2022-12-07 19:05     ` Alexei Starovoitov
2022-12-17  8:59       ` Dave Marchevsky
2022-12-06 23:09 ` [PATCH bpf-next 03/13] bpf: Minor refactor of ref_set_release_on_unlock Dave Marchevsky
2022-12-06 23:09 ` [PATCH bpf-next 04/13] bpf: rename list_head -> datastructure_head in field info types Dave Marchevsky
2022-12-07  1:41   ` Alexei Starovoitov
2022-12-07 18:52     ` Dave Marchevsky
2022-12-07 19:01       ` Alexei Starovoitov
2022-12-06 23:09 ` [PATCH bpf-next 05/13] bpf: Add basic bpf_rb_{root,node} support Dave Marchevsky
2022-12-07  1:48   ` Alexei Starovoitov
2022-12-06 23:09 ` [PATCH bpf-next 06/13] bpf: Add bpf_rbtree_{add,remove,first} kfuncs Dave Marchevsky
2022-12-06 23:09 ` [PATCH bpf-next 07/13] bpf: Add support for bpf_rb_root and bpf_rb_node in kfunc args Dave Marchevsky
2022-12-07  1:51   ` Alexei Starovoitov
2022-12-06 23:09 ` [PATCH bpf-next 08/13] bpf: Add callback validation to kfunc verifier logic Dave Marchevsky
2022-12-07  2:01   ` Alexei Starovoitov
2022-12-17  8:49     ` Dave Marchevsky
2022-12-06 23:09 ` [PATCH bpf-next 09/13] bpf: Special verifier handling for bpf_rbtree_{remove, first} Dave Marchevsky
2022-12-07  2:18   ` Alexei Starovoitov
2022-12-06 23:09 ` [PATCH bpf-next 10/13] bpf, x86: BPF_PROBE_MEM handling for insn->off < 0 Dave Marchevsky
2022-12-07  2:39   ` Alexei Starovoitov
2022-12-07  6:46     ` Dave Marchevsky
2022-12-07 18:06       ` Alexei Starovoitov
2022-12-07 23:39         ` Dave Marchevsky
2022-12-08  0:47           ` Alexei Starovoitov
2022-12-08  8:50             ` Dave Marchevsky
2022-12-06 23:09 ` [PATCH bpf-next 11/13] bpf: Add bpf_rbtree_{add,remove,first} decls to bpf_experimental.h Dave Marchevsky
2022-12-06 23:09 ` [PATCH bpf-next 12/13] libbpf: Make BTF mandatory if program BTF has spin_lock or alloc_obj type Dave Marchevsky
2022-12-06 23:10 ` [PATCH bpf-next 13/13] selftests/bpf: Add rbtree selftests Dave Marchevsky
2022-12-07  2:50 ` [PATCH bpf-next 00/13] BPF rbtree next-gen datastructure patchwork-bot+netdevbpf
2022-12-07 19:36 ` Kumar Kartikeya Dwivedi
2022-12-07 22:28   ` Dave Marchevsky
2022-12-07 23:06     ` Alexei Starovoitov
2022-12-08  1:18       ` Dave Marchevsky
2022-12-08  3:51         ` Alexei Starovoitov
2022-12-08  8:28           ` Dave Marchevsky
2022-12-08 12:57             ` Kumar Kartikeya Dwivedi
2022-12-08 20:36               ` Alexei Starovoitov
2022-12-08 23:35                 ` Dave Marchevsky
2022-12-09  0:39                   ` Alexei Starovoitov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox