* [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-14 19:58 ` Amery Hung
` (2 more replies)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe Kumar Kartikeya Dwivedi
` (11 subsequent siblings)
12 siblings, 3 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Amery Hung, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Ensure that for dynptr constructors (MEM_UNINIT arg_type) taking a
referenced object (ref_obj_id) as their memory source, we set the
ref_obj_id of the dynptr appropriately as well. This allows us to
tie the lifetime of the dynptr to its source and properly invalidate
it when the source object is released.
For helpers, we don't have such cases yet as bpf_dynptr_from_mem is
not permitted for anything other than PTR_TO_MAP_VALUE, but still pass
meta->ref_obj_id as clone_ref_obj_id in case this is relaxed in future.
Since they are ossified we know dynptr_from_mem is the only relevant
helper and takes one memory argument, so we know the meta->ref_obj_id if
non-zero will belong to it.
For kfuncs, make sure for constructors, only 1 ref_obj_id argument is
seen, as more than one can be ambiguous in terms of ref_obj_id transfer.
Since more kfuncs can be added with possibly multiple memory arguments,
make sure meta->ref_obj_id reassignment won't cause incorrect lifetime
analysis in the future using ref_obj_cnt logic. Set this ref_obj_id as
the clone_ref_obj_id, so that it is transferred to the spilled_ptr stack
slot register state.
Add support to unmark_stack_slots_dynptr to not descend to its child
slices (using bool slice parameter) so that we don't have circular calls
when invoking release_reference. With this logic in place, we may have
the following object relationships:
+-- slice 1 (ref=1)
source (ref=1) --> dynptr (ref=1) --|-- slice 2 (ref=1)
+-- slice 3 (ref=1)
Destroying dynptr prunes the dynptr and all its children slices, but
does not affect the source. Releasing the source will effectively prune
the entire ownership tree. Dynptr clones with same ref=1 will be
parallel in the ownership tree.
+-- orig dptr (ref=1) --> slice 1 (ref=1)
source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
+-- clone dptr (ref=1) --> slice 3 (ref=1)
In such a case, only child slices of the dynptr clone being destroyed
are invalidated. Likewise, if the source object goes away, the whole
tree ends up getting pruned.
Cc: Amery Hung <ameryhung@gmail.com>
Fixes: 81bb1c2c3e8e ("bpf: net_sched: Add basic bpf qdisc kfuncs")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/verifier.c | 81 ++++++++++++++++++++++++++++---------------
1 file changed, 54 insertions(+), 27 deletions(-)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 54c6953a8b84..a62dfab9aea6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -198,7 +198,7 @@ struct bpf_verifier_stack_elem {
static int acquire_reference(struct bpf_verifier_env *env, int insn_idx);
static int release_reference_nomark(struct bpf_verifier_state *state, int ref_obj_id);
-static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
+static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects);
static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
static int ref_set_non_owning(struct bpf_verifier_env *env,
@@ -299,6 +299,7 @@ struct bpf_kfunc_call_arg_meta {
const char *func_name;
/* Out parameters */
u32 ref_obj_id;
+ u32 ref_obj_cnt;
u8 release_regno;
bool r0_rdonly;
u32 ret_btf_id;
@@ -759,7 +760,7 @@ static int mark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_
mark_dynptr_stack_regs(env, &state->stack[spi].spilled_ptr,
&state->stack[spi - 1].spilled_ptr, type);
- if (dynptr_type_refcounted(type)) {
+ if (dynptr_type_refcounted(type) || clone_ref_obj_id) {
/* The id is used to track proper releasing */
int id;
@@ -818,22 +819,19 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
}
-static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+static int __unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_func_state *state,
+ int spi, bool slice)
{
- struct bpf_func_state *state = func(env, reg);
- int spi, ref_obj_id, i;
+ u32 ref_obj_id;
+ int i;
- spi = dynptr_get_spi(env, reg);
- if (spi < 0)
- return spi;
+ ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
- if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type)) {
+ if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type) && !ref_obj_id) {
invalidate_dynptr(env, state, spi);
return 0;
}
- ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
-
/* If the dynptr has a ref_obj_id, then we need to invalidate
* two things:
*
@@ -842,7 +840,8 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
*/
/* Invalidate any slices associated with this dynptr */
- WARN_ON_ONCE(release_reference(env, ref_obj_id));
+ if (slice)
+ WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
/* Invalidate any dynptr clones */
for (i = 1; i < state->allocated_stack / BPF_REG_SIZE; i++) {
@@ -864,6 +863,18 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
return 0;
}
+static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg, bool slice)
+{
+ struct bpf_func_state *state = func(env, reg);
+ int spi;
+
+ spi = dynptr_get_spi(env, reg);
+ if (spi < 0)
+ return spi;
+
+ return __unmark_stack_slots_dynptr(env, state, spi, slice);
+}
+
static void __mark_reg_unknown(const struct bpf_verifier_env *env,
struct bpf_reg_state *reg);
@@ -1075,7 +1086,7 @@ static int unmark_stack_slots_iter(struct bpf_verifier_env *env,
struct bpf_reg_state *st = &slot->spilled_ptr;
if (i == 0)
- WARN_ON_ONCE(release_reference(env, st->ref_obj_id));
+ WARN_ON_ONCE(release_reference(env, st->ref_obj_id, false));
__mark_reg_not_init(env, st);
@@ -9749,7 +9760,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
true, meta);
break;
case ARG_PTR_TO_DYNPTR:
- err = process_dynptr_func(env, regno, insn_idx, arg_type, 0);
+ err = process_dynptr_func(env, regno, insn_idx, arg_type, meta->ref_obj_id);
if (err)
return err;
break;
@@ -10220,12 +10231,12 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int ref_ob
*
* This is the release function corresponding to acquire_reference(). Idempotent.
*/
-static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
+static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects)
{
struct bpf_verifier_state *vstate = env->cur_state;
struct bpf_func_state *state;
struct bpf_reg_state *reg;
- int err;
+ int err, spi;
err = release_reference_nomark(vstate, ref_obj_id);
if (err)
@@ -10236,6 +10247,19 @@ static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
mark_reg_invalid(env, reg);
}));
+ if (!objects)
+ return 0;
+
+ bpf_for_each_spilled_reg(spi, state, reg, (1 << STACK_DYNPTR)) {
+ if (!reg)
+ continue;
+ if (!reg->dynptr.first_slot || reg->ref_obj_id != ref_obj_id)
+ continue;
+ err = __unmark_stack_slots_dynptr(env, state, spi, false);
+ if (err)
+ return err;
+ }
+
return 0;
}
@@ -11357,7 +11381,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
verbose(env, "verifier internal error: CONST_PTR_TO_DYNPTR cannot be released\n");
return -EFAULT;
}
- err = unmark_stack_slots_dynptr(env, ®s[meta.release_regno]);
+ err = unmark_stack_slots_dynptr(env, ®s[meta.release_regno], true);
} else if (func_id == BPF_FUNC_kptr_xchg && meta.ref_obj_id) {
u32 ref_obj_id = meta.ref_obj_id;
bool in_rcu = in_rcu_cs(env);
@@ -11379,7 +11403,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
}));
}
} else if (meta.ref_obj_id) {
- err = release_reference(env, meta.ref_obj_id);
+ err = release_reference(env, meta.ref_obj_id, true);
} else if (register_is_null(®s[meta.release_regno])) {
/* meta.ref_obj_id can only be 0 if register that is meant to be
* released is NULL, which must be > R0.
@@ -12974,6 +12998,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
meta->ref_obj_id = reg->ref_obj_id;
if (is_kfunc_release(meta))
meta->release_regno = regno;
+ meta->ref_obj_cnt++;
}
ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id);
@@ -13100,13 +13125,19 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
case KF_ARG_PTR_TO_DYNPTR:
{
enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
- int clone_ref_obj_id = 0;
+ int clone_ref_obj_id = meta->ref_obj_id;
if (reg->type == CONST_PTR_TO_DYNPTR)
dynptr_arg_type |= MEM_RDONLY;
- if (is_kfunc_arg_uninit(btf, &args[i]))
+ if (is_kfunc_arg_uninit(btf, &args[i])) {
dynptr_arg_type |= MEM_UNINIT;
+ /* It's confusing if dynptr constructor takes multiple referenced arguments. */
+ if (meta->ref_obj_cnt > 1) {
+ verbose(env, "verifier internal error: multiple referenced arguments\n");
+ return -EFAULT;
+ }
+ }
if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
dynptr_arg_type |= DYNPTR_TYPE_SKB;
@@ -13582,7 +13613,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
* PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now.
*/
if (meta.release_regno) {
- err = release_reference(env, regs[meta.release_regno].ref_obj_id);
+ err = release_reference(env, regs[meta.release_regno].ref_obj_id, true);
if (err) {
verbose(env, "kfunc %s#%d reference has not been acquired before\n",
func_name, meta.func_id);
@@ -13603,7 +13634,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
return err;
}
- err = release_reference(env, release_ref_obj_id);
+ err = release_reference(env, release_ref_obj_id, true);
if (err) {
verbose(env, "kfunc %s#%d reference has not been acquired before\n",
func_name, meta.func_id);
@@ -13803,11 +13834,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
return -EFAULT;
}
regs[BPF_REG_0].dynptr_id = meta.initialized_dynptr.id;
-
- /* we don't need to set BPF_REG_0's ref obj id
- * because packet slices are not refcounted (see
- * dynptr_type_refcounted)
- */
+ regs[BPF_REG_0].ref_obj_id = meta.initialized_dynptr.ref_obj_id;
} else {
verbose(env, "kernel function %s unhandled dynamic return type\n",
meta.func_name);
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects Kumar Kartikeya Dwivedi
@ 2025-04-14 19:58 ` Amery Hung
2025-04-17 19:53 ` Kumar Kartikeya Dwivedi
2025-04-16 5:52 ` Eduard Zingerman
2025-04-16 21:04 ` Andrii Nakryiko
2 siblings, 1 reply; 44+ messages in thread
From: Amery Hung @ 2025-04-14 19:58 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Ensure that for dynptr constructors (MEM_UNINIT arg_type) taking a
> referenced object (ref_obj_id) as their memory source, we set the
> ref_obj_id of the dynptr appropriately as well. This allows us to
> tie the lifetime of the dynptr to its source and properly invalidate
> it when the source object is released.
>
> For helpers, we don't have such cases yet as bpf_dynptr_from_mem is
> not permitted for anything other than PTR_TO_MAP_VALUE, but still pass
> meta->ref_obj_id as clone_ref_obj_id in case this is relaxed in future.
> Since they are ossified we know dynptr_from_mem is the only relevant
> helper and takes one memory argument, so we know the meta->ref_obj_id if
> non-zero will belong to it.
>
> For kfuncs, make sure for constructors, only 1 ref_obj_id argument is
> seen, as more than one can be ambiguous in terms of ref_obj_id transfer.
> Since more kfuncs can be added with possibly multiple memory arguments,
> make sure meta->ref_obj_id reassignment won't cause incorrect lifetime
> analysis in the future using ref_obj_cnt logic. Set this ref_obj_id as
> the clone_ref_obj_id, so that it is transferred to the spilled_ptr stack
> slot register state.
>
> Add support to unmark_stack_slots_dynptr to not descend to its child
> slices (using bool slice parameter) so that we don't have circular calls
> when invoking release_reference. With this logic in place, we may have
> the following object relationships:
> +-- slice 1 (ref=1)
> source (ref=1) --> dynptr (ref=1) --|-- slice 2 (ref=1)
> +-- slice 3 (ref=1)
>
> Destroying dynptr prunes the dynptr and all its children slices, but
> does not affect the source. Releasing the source will effectively prune
> the entire ownership tree. Dynptr clones with same ref=1 will be
> parallel in the ownership tree.
>
> +-- orig dptr (ref=1) --> slice 1 (ref=1)
> source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
> +-- clone dptr (ref=1) --> slice 3 (ref=1)
>
> In such a case, only child slices of the dynptr clone being destroyed
> are invalidated. Likewise, if the source object goes away, the whole
> tree ends up getting pruned.
>
> Cc: Amery Hung <ameryhung@gmail.com>
> Fixes: 81bb1c2c3e8e ("bpf: net_sched: Add basic bpf qdisc kfuncs")
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> kernel/bpf/verifier.c | 81 ++++++++++++++++++++++++++++---------------
> 1 file changed, 54 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 54c6953a8b84..a62dfab9aea6 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -198,7 +198,7 @@ struct bpf_verifier_stack_elem {
>
> static int acquire_reference(struct bpf_verifier_env *env, int insn_idx);
> static int release_reference_nomark(struct bpf_verifier_state *state, int ref_obj_id);
> -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
> +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects);
> static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
> static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
> static int ref_set_non_owning(struct bpf_verifier_env *env,
> @@ -299,6 +299,7 @@ struct bpf_kfunc_call_arg_meta {
> const char *func_name;
> /* Out parameters */
> u32 ref_obj_id;
> + u32 ref_obj_cnt;
> u8 release_regno;
> bool r0_rdonly;
> u32 ret_btf_id;
> @@ -759,7 +760,7 @@ static int mark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_
> mark_dynptr_stack_regs(env, &state->stack[spi].spilled_ptr,
> &state->stack[spi - 1].spilled_ptr, type);
>
> - if (dynptr_type_refcounted(type)) {
> + if (dynptr_type_refcounted(type) || clone_ref_obj_id) {
> /* The id is used to track proper releasing */
> int id;
>
> @@ -818,22 +819,19 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
> state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
> }
>
> -static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> +static int __unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_func_state *state,
> + int spi, bool slice)
> {
> - struct bpf_func_state *state = func(env, reg);
> - int spi, ref_obj_id, i;
> + u32 ref_obj_id;
> + int i;
>
> - spi = dynptr_get_spi(env, reg);
> - if (spi < 0)
> - return spi;
> + ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
>
> - if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type)) {
> + if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type) && !ref_obj_id) {
> invalidate_dynptr(env, state, spi);
> return 0;
> }
>
> - ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> -
> /* If the dynptr has a ref_obj_id, then we need to invalidate
> * two things:
> *
> @@ -842,7 +840,8 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
> */
>
> /* Invalidate any slices associated with this dynptr */
> - WARN_ON_ONCE(release_reference(env, ref_obj_id));
> + if (slice)
> + WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
I think the slice argument might not be needed if we do the following instead:
if (dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type))
WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
When bpf_kfree_skb():
-> release_reference(..., object = true)
-> __unmark_stack_slots_dynptr()
-> not calling into release_reference() since
dynptr_type_refcounted() returns false
When bpf_ringbuf_submit():
-> unmark_stack_slots_dynptr()
-> release_reference(..., object = false)
-> not calling __unmark_stack_slots_dynptr() since object == false
Am I missing anything?
>
> /* Invalidate any dynptr clones */
> for (i = 1; i < state->allocated_stack / BPF_REG_SIZE; i++) {
> @@ -864,6 +863,18 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
> return 0;
> }
>
> +static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg, bool slice)
> +{
> + struct bpf_func_state *state = func(env, reg);
> + int spi;
> +
> + spi = dynptr_get_spi(env, reg);
> + if (spi < 0)
> + return spi;
> +
> + return __unmark_stack_slots_dynptr(env, state, spi, slice);
> +}
> +
> static void __mark_reg_unknown(const struct bpf_verifier_env *env,
> struct bpf_reg_state *reg);
>
> @@ -1075,7 +1086,7 @@ static int unmark_stack_slots_iter(struct bpf_verifier_env *env,
> struct bpf_reg_state *st = &slot->spilled_ptr;
>
> if (i == 0)
> - WARN_ON_ONCE(release_reference(env, st->ref_obj_id));
> + WARN_ON_ONCE(release_reference(env, st->ref_obj_id, false));
>
> __mark_reg_not_init(env, st);
>
> @@ -9749,7 +9760,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> true, meta);
> break;
> case ARG_PTR_TO_DYNPTR:
> - err = process_dynptr_func(env, regno, insn_idx, arg_type, 0);
> + err = process_dynptr_func(env, regno, insn_idx, arg_type, meta->ref_obj_id);
> if (err)
> return err;
> break;
> @@ -10220,12 +10231,12 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int ref_ob
> *
> * This is the release function corresponding to acquire_reference(). Idempotent.
> */
> -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects)
> {
> struct bpf_verifier_state *vstate = env->cur_state;
> struct bpf_func_state *state;
> struct bpf_reg_state *reg;
> - int err;
> + int err, spi;
>
> err = release_reference_nomark(vstate, ref_obj_id);
> if (err)
> @@ -10236,6 +10247,19 @@ static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> mark_reg_invalid(env, reg);
> }));
>
> + if (!objects)
> + return 0;
> +
> + bpf_for_each_spilled_reg(spi, state, reg, (1 << STACK_DYNPTR)) {
> + if (!reg)
> + continue;
> + if (!reg->dynptr.first_slot || reg->ref_obj_id != ref_obj_id)
> + continue;
> + err = __unmark_stack_slots_dynptr(env, state, spi, false);
> + if (err)
> + return err;
> + }
> +
> return 0;
> }
>
> @@ -11357,7 +11381,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> verbose(env, "verifier internal error: CONST_PTR_TO_DYNPTR cannot be released\n");
> return -EFAULT;
> }
> - err = unmark_stack_slots_dynptr(env, ®s[meta.release_regno]);
> + err = unmark_stack_slots_dynptr(env, ®s[meta.release_regno], true);
> } else if (func_id == BPF_FUNC_kptr_xchg && meta.ref_obj_id) {
> u32 ref_obj_id = meta.ref_obj_id;
> bool in_rcu = in_rcu_cs(env);
> @@ -11379,7 +11403,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> }));
> }
> } else if (meta.ref_obj_id) {
> - err = release_reference(env, meta.ref_obj_id);
> + err = release_reference(env, meta.ref_obj_id, true);
> } else if (register_is_null(®s[meta.release_regno])) {
> /* meta.ref_obj_id can only be 0 if register that is meant to be
> * released is NULL, which must be > R0.
> @@ -12974,6 +12998,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> meta->ref_obj_id = reg->ref_obj_id;
> if (is_kfunc_release(meta))
> meta->release_regno = regno;
> + meta->ref_obj_cnt++;
Does it make sense to convert other "more than one arg with
ref_obj_id" checks to using ref_obj_cnt to make it more consistent?
Thanks for fixing the bug!
Amery
> }
>
> ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id);
> @@ -13100,13 +13125,19 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> case KF_ARG_PTR_TO_DYNPTR:
> {
> enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> - int clone_ref_obj_id = 0;
> + int clone_ref_obj_id = meta->ref_obj_id;
>
> if (reg->type == CONST_PTR_TO_DYNPTR)
> dynptr_arg_type |= MEM_RDONLY;
>
> - if (is_kfunc_arg_uninit(btf, &args[i]))
> + if (is_kfunc_arg_uninit(btf, &args[i])) {
> dynptr_arg_type |= MEM_UNINIT;
> + /* It's confusing if dynptr constructor takes multiple referenced arguments. */
> + if (meta->ref_obj_cnt > 1) {
> + verbose(env, "verifier internal error: multiple referenced arguments\n");
> + return -EFAULT;
> + }
> + }
>
> if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> dynptr_arg_type |= DYNPTR_TYPE_SKB;
> @@ -13582,7 +13613,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> * PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now.
> */
> if (meta.release_regno) {
> - err = release_reference(env, regs[meta.release_regno].ref_obj_id);
> + err = release_reference(env, regs[meta.release_regno].ref_obj_id, true);
> if (err) {
> verbose(env, "kfunc %s#%d reference has not been acquired before\n",
> func_name, meta.func_id);
> @@ -13603,7 +13634,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> return err;
> }
>
> - err = release_reference(env, release_ref_obj_id);
> + err = release_reference(env, release_ref_obj_id, true);
> if (err) {
> verbose(env, "kfunc %s#%d reference has not been acquired before\n",
> func_name, meta.func_id);
> @@ -13803,11 +13834,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> return -EFAULT;
> }
> regs[BPF_REG_0].dynptr_id = meta.initialized_dynptr.id;
> -
> - /* we don't need to set BPF_REG_0's ref obj id
> - * because packet slices are not refcounted (see
> - * dynptr_type_refcounted)
> - */
> + regs[BPF_REG_0].ref_obj_id = meta.initialized_dynptr.ref_obj_id;
> } else {
> verbose(env, "kernel function %s unhandled dynamic return type\n",
> meta.func_name);
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-14 19:58 ` Amery Hung
@ 2025-04-17 19:53 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 19:53 UTC (permalink / raw)
To: Amery Hung
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
On Mon, 14 Apr 2025 at 21:58, Amery Hung <ameryhung@gmail.com> wrote:
>
> On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Ensure that for dynptr constructors (MEM_UNINIT arg_type) taking a
> > referenced object (ref_obj_id) as their memory source, we set the
> > ref_obj_id of the dynptr appropriately as well. This allows us to
> > tie the lifetime of the dynptr to its source and properly invalidate
> > it when the source object is released.
> >
> > For helpers, we don't have such cases yet as bpf_dynptr_from_mem is
> > not permitted for anything other than PTR_TO_MAP_VALUE, but still pass
> > meta->ref_obj_id as clone_ref_obj_id in case this is relaxed in future.
> > Since they are ossified we know dynptr_from_mem is the only relevant
> > helper and takes one memory argument, so we know the meta->ref_obj_id if
> > non-zero will belong to it.
> >
> > For kfuncs, make sure for constructors, only 1 ref_obj_id argument is
> > seen, as more than one can be ambiguous in terms of ref_obj_id transfer.
> > Since more kfuncs can be added with possibly multiple memory arguments,
> > make sure meta->ref_obj_id reassignment won't cause incorrect lifetime
> > analysis in the future using ref_obj_cnt logic. Set this ref_obj_id as
> > the clone_ref_obj_id, so that it is transferred to the spilled_ptr stack
> > slot register state.
> >
> > Add support to unmark_stack_slots_dynptr to not descend to its child
> > slices (using bool slice parameter) so that we don't have circular calls
> > when invoking release_reference. With this logic in place, we may have
> > the following object relationships:
> > +-- slice 1 (ref=1)
> > source (ref=1) --> dynptr (ref=1) --|-- slice 2 (ref=1)
> > +-- slice 3 (ref=1)
> >
> > Destroying dynptr prunes the dynptr and all its children slices, but
> > does not affect the source. Releasing the source will effectively prune
> > the entire ownership tree. Dynptr clones with same ref=1 will be
> > parallel in the ownership tree.
> >
> > +-- orig dptr (ref=1) --> slice 1 (ref=1)
> > source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
> > +-- clone dptr (ref=1) --> slice 3 (ref=1)
> >
> > In such a case, only child slices of the dynptr clone being destroyed
> > are invalidated. Likewise, if the source object goes away, the whole
> > tree ends up getting pruned.
> >
> > Cc: Amery Hung <ameryhung@gmail.com>
> > Fixes: 81bb1c2c3e8e ("bpf: net_sched: Add basic bpf qdisc kfuncs")
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> > kernel/bpf/verifier.c | 81 ++++++++++++++++++++++++++++---------------
> > 1 file changed, 54 insertions(+), 27 deletions(-)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 54c6953a8b84..a62dfab9aea6 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -198,7 +198,7 @@ struct bpf_verifier_stack_elem {
> >
> > static int acquire_reference(struct bpf_verifier_env *env, int insn_idx);
> > static int release_reference_nomark(struct bpf_verifier_state *state, int ref_obj_id);
> > -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
> > +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects);
> > static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
> > static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
> > static int ref_set_non_owning(struct bpf_verifier_env *env,
> > @@ -299,6 +299,7 @@ struct bpf_kfunc_call_arg_meta {
> > const char *func_name;
> > /* Out parameters */
> > u32 ref_obj_id;
> > + u32 ref_obj_cnt;
> > u8 release_regno;
> > bool r0_rdonly;
> > u32 ret_btf_id;
> > @@ -759,7 +760,7 @@ static int mark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_
> > mark_dynptr_stack_regs(env, &state->stack[spi].spilled_ptr,
> > &state->stack[spi - 1].spilled_ptr, type);
> >
> > - if (dynptr_type_refcounted(type)) {
> > + if (dynptr_type_refcounted(type) || clone_ref_obj_id) {
> > /* The id is used to track proper releasing */
> > int id;
> >
> > @@ -818,22 +819,19 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
> > state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
> > }
> >
> > -static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > +static int __unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_func_state *state,
> > + int spi, bool slice)
> > {
> > - struct bpf_func_state *state = func(env, reg);
> > - int spi, ref_obj_id, i;
> > + u32 ref_obj_id;
> > + int i;
> >
> > - spi = dynptr_get_spi(env, reg);
> > - if (spi < 0)
> > - return spi;
> > + ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> >
> > - if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type)) {
> > + if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type) && !ref_obj_id) {
> > invalidate_dynptr(env, state, spi);
> > return 0;
> > }
> >
> > - ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> > -
> > /* If the dynptr has a ref_obj_id, then we need to invalidate
> > * two things:
> > *
> > @@ -842,7 +840,8 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
> > */
> >
> > /* Invalidate any slices associated with this dynptr */
> > - WARN_ON_ONCE(release_reference(env, ref_obj_id));
> > + if (slice)
> > + WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
>
> I think the slice argument might not be needed if we do the following instead:
> if (dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type))
> WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
>
> When bpf_kfree_skb():
> -> release_reference(..., object = true)
> -> __unmark_stack_slots_dynptr()
> -> not calling into release_reference() since
> dynptr_type_refcounted() returns false
>
> When bpf_ringbuf_submit():
> -> unmark_stack_slots_dynptr()
> -> release_reference(..., object = false)
> -> not calling __unmark_stack_slots_dynptr() since object == false
>
> Am I missing anything?
That is true, I will probably try to rework it the way Andrii suggested.
>
> >
> > /* Invalidate any dynptr clones */
> > for (i = 1; i < state->allocated_stack / BPF_REG_SIZE; i++) {
> > @@ -864,6 +863,18 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
> > return 0;
> > }
> >
> > +static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg, bool slice)
> > +{
> > + struct bpf_func_state *state = func(env, reg);
> > + int spi;
> > +
> > + spi = dynptr_get_spi(env, reg);
> > + if (spi < 0)
> > + return spi;
> > +
> > + return __unmark_stack_slots_dynptr(env, state, spi, slice);
> > +}
> > +
> > static void __mark_reg_unknown(const struct bpf_verifier_env *env,
> > struct bpf_reg_state *reg);
> >
> > @@ -1075,7 +1086,7 @@ static int unmark_stack_slots_iter(struct bpf_verifier_env *env,
> > struct bpf_reg_state *st = &slot->spilled_ptr;
> >
> > if (i == 0)
> > - WARN_ON_ONCE(release_reference(env, st->ref_obj_id));
> > + WARN_ON_ONCE(release_reference(env, st->ref_obj_id, false));
> >
> > __mark_reg_not_init(env, st);
> >
> > @@ -9749,7 +9760,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> > true, meta);
> > break;
> > case ARG_PTR_TO_DYNPTR:
> > - err = process_dynptr_func(env, regno, insn_idx, arg_type, 0);
> > + err = process_dynptr_func(env, regno, insn_idx, arg_type, meta->ref_obj_id);
> > if (err)
> > return err;
> > break;
> > @@ -10220,12 +10231,12 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int ref_ob
> > *
> > * This is the release function corresponding to acquire_reference(). Idempotent.
> > */
> > -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> > +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects)
> > {
> > struct bpf_verifier_state *vstate = env->cur_state;
> > struct bpf_func_state *state;
> > struct bpf_reg_state *reg;
> > - int err;
> > + int err, spi;
> >
> > err = release_reference_nomark(vstate, ref_obj_id);
> > if (err)
> > @@ -10236,6 +10247,19 @@ static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> > mark_reg_invalid(env, reg);
> > }));
> >
> > + if (!objects)
> > + return 0;
> > +
> > + bpf_for_each_spilled_reg(spi, state, reg, (1 << STACK_DYNPTR)) {
> > + if (!reg)
> > + continue;
> > + if (!reg->dynptr.first_slot || reg->ref_obj_id != ref_obj_id)
> > + continue;
> > + err = __unmark_stack_slots_dynptr(env, state, spi, false);
> > + if (err)
> > + return err;
> > + }
> > +
> > return 0;
> > }
> >
> > @@ -11357,7 +11381,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > verbose(env, "verifier internal error: CONST_PTR_TO_DYNPTR cannot be released\n");
> > return -EFAULT;
> > }
> > - err = unmark_stack_slots_dynptr(env, ®s[meta.release_regno]);
> > + err = unmark_stack_slots_dynptr(env, ®s[meta.release_regno], true);
> > } else if (func_id == BPF_FUNC_kptr_xchg && meta.ref_obj_id) {
> > u32 ref_obj_id = meta.ref_obj_id;
> > bool in_rcu = in_rcu_cs(env);
> > @@ -11379,7 +11403,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > }));
> > }
> > } else if (meta.ref_obj_id) {
> > - err = release_reference(env, meta.ref_obj_id);
> > + err = release_reference(env, meta.ref_obj_id, true);
> > } else if (register_is_null(®s[meta.release_regno])) {
> > /* meta.ref_obj_id can only be 0 if register that is meant to be
> > * released is NULL, which must be > R0.
> > @@ -12974,6 +12998,7 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > meta->ref_obj_id = reg->ref_obj_id;
> > if (is_kfunc_release(meta))
> > meta->release_regno = regno;
> > + meta->ref_obj_cnt++;
>
> Does it make sense to convert other "more than one arg with
> ref_obj_id" checks to using ref_obj_cnt to make it more consistent?
Ack, will change.
>
> Thanks for fixing the bug!
> Amery
>
> > }
> >
> > ref_t = btf_type_skip_modifiers(btf, t->type, &ref_id);
> > @@ -13100,13 +13125,19 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > case KF_ARG_PTR_TO_DYNPTR:
> > {
> > enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > - int clone_ref_obj_id = 0;
> > + int clone_ref_obj_id = meta->ref_obj_id;
> >
> > if (reg->type == CONST_PTR_TO_DYNPTR)
> > dynptr_arg_type |= MEM_RDONLY;
> >
> > - if (is_kfunc_arg_uninit(btf, &args[i]))
> > + if (is_kfunc_arg_uninit(btf, &args[i])) {
> > dynptr_arg_type |= MEM_UNINIT;
> > + /* It's confusing if dynptr constructor takes multiple referenced arguments. */
> > + if (meta->ref_obj_cnt > 1) {
> > + verbose(env, "verifier internal error: multiple referenced arguments\n");
> > + return -EFAULT;
> > + }
> > + }
> >
> > if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > dynptr_arg_type |= DYNPTR_TYPE_SKB;
> > @@ -13582,7 +13613,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > * PTR_TO_BTF_ID in bpf_kfunc_arg_meta, do the release now.
> > */
> > if (meta.release_regno) {
> > - err = release_reference(env, regs[meta.release_regno].ref_obj_id);
> > + err = release_reference(env, regs[meta.release_regno].ref_obj_id, true);
> > if (err) {
> > verbose(env, "kfunc %s#%d reference has not been acquired before\n",
> > func_name, meta.func_id);
> > @@ -13603,7 +13634,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > return err;
> > }
> >
> > - err = release_reference(env, release_ref_obj_id);
> > + err = release_reference(env, release_ref_obj_id, true);
> > if (err) {
> > verbose(env, "kfunc %s#%d reference has not been acquired before\n",
> > func_name, meta.func_id);
> > @@ -13803,11 +13834,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > return -EFAULT;
> > }
> > regs[BPF_REG_0].dynptr_id = meta.initialized_dynptr.id;
> > -
> > - /* we don't need to set BPF_REG_0's ref obj id
> > - * because packet slices are not refcounted (see
> > - * dynptr_type_refcounted)
> > - */
> > + regs[BPF_REG_0].ref_obj_id = meta.initialized_dynptr.ref_obj_id;
> > } else {
> > verbose(env, "kernel function %s unhandled dynamic return type\n",
> > meta.func_name);
> > --
> > 2.47.1
> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects Kumar Kartikeya Dwivedi
2025-04-14 19:58 ` Amery Hung
@ 2025-04-16 5:52 ` Eduard Zingerman
2025-04-17 19:52 ` Kumar Kartikeya Dwivedi
2025-04-16 21:04 ` Andrii Nakryiko
2 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-16 5:52 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Amery Hung, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden,
kkd, kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
[...]
> @@ -818,22 +819,19 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
> state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
> }
>
> -static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> +static int __unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_func_state *state,
> + int spi, bool slice)
> {
> - struct bpf_func_state *state = func(env, reg);
> - int spi, ref_obj_id, i;
> + u32 ref_obj_id;
> + int i;
>
> - spi = dynptr_get_spi(env, reg);
> - if (spi < 0)
> - return spi;
> + ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
>
> - if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type)) {
> + if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type) && !ref_obj_id) {
If dynptr_type_refcounted is true, does this mean that ref_obj_id is set?
If it does, the check could be simplified to just `if (!ref_obj_id)`.
> invalidate_dynptr(env, state, spi);
> return 0;
> }
>
> - ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> -
> /* If the dynptr has a ref_obj_id, then we need to invalidate
> * two things:
> *
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-16 5:52 ` Eduard Zingerman
@ 2025-04-17 19:52 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 19:52 UTC (permalink / raw)
To: Eduard Zingerman
Cc: bpf, Amery Hung, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden,
kkd, kernel-team
On Wed, 16 Apr 2025 at 07:53, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> [...]
>
> > @@ -818,22 +819,19 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
> > state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
> > }
> >
> > -static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > +static int __unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_func_state *state,
> > + int spi, bool slice)
> > {
> > - struct bpf_func_state *state = func(env, reg);
> > - int spi, ref_obj_id, i;
> > + u32 ref_obj_id;
> > + int i;
> >
> > - spi = dynptr_get_spi(env, reg);
> > - if (spi < 0)
> > - return spi;
> > + ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> >
> > - if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type)) {
> > + if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type) && !ref_obj_id) {
>
> If dynptr_type_refcounted is true, does this mean that ref_obj_id is set?
> If it does, the check could be simplified to just `if (!ref_obj_id)`.
Yes, I will change in non-RFC v1.
>
> > invalidate_dynptr(env, state, spi);
> > return 0;
> > }
> >
> > - ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> > -
> > /* If the dynptr has a ref_obj_id, then we need to invalidate
> > * two things:
> > *
>
> [...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects Kumar Kartikeya Dwivedi
2025-04-14 19:58 ` Amery Hung
2025-04-16 5:52 ` Eduard Zingerman
@ 2025-04-16 21:04 ` Andrii Nakryiko
2025-04-17 19:51 ` Kumar Kartikeya Dwivedi
2 siblings, 1 reply; 44+ messages in thread
From: Andrii Nakryiko @ 2025-04-16 21:04 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Amery Hung, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Emil Tsalapatis, Barret Rhoden, kkd, kernel-team
On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Ensure that for dynptr constructors (MEM_UNINIT arg_type) taking a
> referenced object (ref_obj_id) as their memory source, we set the
> ref_obj_id of the dynptr appropriately as well. This allows us to
> tie the lifetime of the dynptr to its source and properly invalidate
> it when the source object is released.
>
> For helpers, we don't have such cases yet as bpf_dynptr_from_mem is
> not permitted for anything other than PTR_TO_MAP_VALUE, but still pass
> meta->ref_obj_id as clone_ref_obj_id in case this is relaxed in future.
> Since they are ossified we know dynptr_from_mem is the only relevant
> helper and takes one memory argument, so we know the meta->ref_obj_id if
> non-zero will belong to it.
>
> For kfuncs, make sure for constructors, only 1 ref_obj_id argument is
> seen, as more than one can be ambiguous in terms of ref_obj_id transfer.
> Since more kfuncs can be added with possibly multiple memory arguments,
> make sure meta->ref_obj_id reassignment won't cause incorrect lifetime
> analysis in the future using ref_obj_cnt logic. Set this ref_obj_id as
> the clone_ref_obj_id, so that it is transferred to the spilled_ptr stack
> slot register state.
>
> Add support to unmark_stack_slots_dynptr to not descend to its child
> slices (using bool slice parameter) so that we don't have circular calls
> when invoking release_reference. With this logic in place, we may have
> the following object relationships:
> +-- slice 1 (ref=1)
> source (ref=1) --> dynptr (ref=1) --|-- slice 2 (ref=1)
> +-- slice 3 (ref=1)
>
> Destroying dynptr prunes the dynptr and all its children slices, but
> does not affect the source. Releasing the source will effectively prune
> the entire ownership tree. Dynptr clones with same ref=1 will be
> parallel in the ownership tree.
>
> +-- orig dptr (ref=1) --> slice 1 (ref=1)
> source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
> +-- clone dptr (ref=1) --> slice 3 (ref=1)
>
> In such a case, only child slices of the dynptr clone being destroyed
> are invalidated. Likewise, if the source object goes away, the whole
> tree ends up getting pruned.
>
> Cc: Amery Hung <ameryhung@gmail.com>
> Fixes: 81bb1c2c3e8e ("bpf: net_sched: Add basic bpf qdisc kfuncs")
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> kernel/bpf/verifier.c | 81 ++++++++++++++++++++++++++++---------------
> 1 file changed, 54 insertions(+), 27 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 54c6953a8b84..a62dfab9aea6 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -198,7 +198,7 @@ struct bpf_verifier_stack_elem {
>
> static int acquire_reference(struct bpf_verifier_env *env, int insn_idx);
> static int release_reference_nomark(struct bpf_verifier_state *state, int ref_obj_id);
> -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
> +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects);
> static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
> static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
> static int ref_set_non_owning(struct bpf_verifier_env *env,
> @@ -299,6 +299,7 @@ struct bpf_kfunc_call_arg_meta {
> const char *func_name;
> /* Out parameters */
> u32 ref_obj_id;
> + u32 ref_obj_cnt;
> u8 release_regno;
> bool r0_rdonly;
> u32 ret_btf_id;
> @@ -759,7 +760,7 @@ static int mark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_
> mark_dynptr_stack_regs(env, &state->stack[spi].spilled_ptr,
> &state->stack[spi - 1].spilled_ptr, type);
>
> - if (dynptr_type_refcounted(type)) {
> + if (dynptr_type_refcounted(type) || clone_ref_obj_id) {
it would be a bug to have both conditions true, right? Should we
express this a bit more explicitly?
maybe something like
int id = 0;
if (dynptr_type_refcounted(type)) {
WARN_ON_ONCE(clone_ref_obj_id);
id = acquire_reference(env, insn_idx);
}
if (clone_ref_obj_id)
id = clone_ref_obj_id;
state->stack[spi].spilled_ptr.ref_obj_id = id;
state->stack[spi - 1].spilled_ptr.ref_obj_id = id;
state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN;
state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
?
> /* The id is used to track proper releasing */
> int id;
>
> @@ -818,22 +819,19 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
> state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
> }
>
> -static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> +static int __unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_func_state *state,
> + int spi, bool slice)
> {
> - struct bpf_func_state *state = func(env, reg);
> - int spi, ref_obj_id, i;
> + u32 ref_obj_id;
> + int i;
>
> - spi = dynptr_get_spi(env, reg);
> - if (spi < 0)
> - return spi;
> + ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
>
> - if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type)) {
> + if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type) && !ref_obj_id) {
Eduard already pointed out that `!ref_obj_id` condition is enough now
> invalidate_dynptr(env, state, spi);
> return 0;
> }
>
> - ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> -
> /* If the dynptr has a ref_obj_id, then we need to invalidate
> * two things:
> *
> @@ -842,7 +840,8 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
> */
>
> /* Invalidate any slices associated with this dynptr */
> - WARN_ON_ONCE(release_reference(env, ref_obj_id));
> + if (slice)
> + WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
tbh, this bool flag to release_reference to prevent cycles feels like
a hack and that we are not solving this properly...
also, changes in this patch still won't properly support the case
where we create dynptr -> slice -> dynptr -> slice -> ... chains,
right? Any dynptr in the chain will cause "release" of any other
dynptr, including parents, right?
>
> /* Invalidate any dynptr clones */
> for (i = 1; i < state->allocated_stack / BPF_REG_SIZE; i++) {
[...]
> @@ -10220,12 +10231,12 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int ref_ob
> *
> * This is the release function corresponding to acquire_reference(). Idempotent.
> */
> -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects)
hm, "objects" really doesn't describe anything...
> {
> struct bpf_verifier_state *vstate = env->cur_state;
> struct bpf_func_state *state;
> struct bpf_reg_state *reg;
> - int err;
> + int err, spi;
>
> err = release_reference_nomark(vstate, ref_obj_id);
> if (err)
> @@ -10236,6 +10247,19 @@ static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> mark_reg_invalid(env, reg);
> }));
>
> + if (!objects)
> + return 0;
instead of a bool flag into release_reference, shouldn't the below
logic be just a separate function?
> +
> + bpf_for_each_spilled_reg(spi, state, reg, (1 << STACK_DYNPTR)) {
> + if (!reg)
> + continue;
> + if (!reg->dynptr.first_slot || reg->ref_obj_id != ref_obj_id)
> + continue;
> + err = __unmark_stack_slots_dynptr(env, state, spi, false);
> + if (err)
> + return err;
> + }
> +
> return 0;
> }
>
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-16 21:04 ` Andrii Nakryiko
@ 2025-04-17 19:51 ` Kumar Kartikeya Dwivedi
2025-04-22 21:09 ` Andrii Nakryiko
0 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 19:51 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Amery Hung, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Emil Tsalapatis, Barret Rhoden, kkd, kernel-team
On Wed, 16 Apr 2025 at 23:04, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Ensure that for dynptr constructors (MEM_UNINIT arg_type) taking a
> > referenced object (ref_obj_id) as their memory source, we set the
> > ref_obj_id of the dynptr appropriately as well. This allows us to
> > tie the lifetime of the dynptr to its source and properly invalidate
> > it when the source object is released.
> >
> > For helpers, we don't have such cases yet as bpf_dynptr_from_mem is
> > not permitted for anything other than PTR_TO_MAP_VALUE, but still pass
> > meta->ref_obj_id as clone_ref_obj_id in case this is relaxed in future.
> > Since they are ossified we know dynptr_from_mem is the only relevant
> > helper and takes one memory argument, so we know the meta->ref_obj_id if
> > non-zero will belong to it.
> >
> > For kfuncs, make sure for constructors, only 1 ref_obj_id argument is
> > seen, as more than one can be ambiguous in terms of ref_obj_id transfer.
> > Since more kfuncs can be added with possibly multiple memory arguments,
> > make sure meta->ref_obj_id reassignment won't cause incorrect lifetime
> > analysis in the future using ref_obj_cnt logic. Set this ref_obj_id as
> > the clone_ref_obj_id, so that it is transferred to the spilled_ptr stack
> > slot register state.
> >
> > Add support to unmark_stack_slots_dynptr to not descend to its child
> > slices (using bool slice parameter) so that we don't have circular calls
> > when invoking release_reference. With this logic in place, we may have
> > the following object relationships:
> > +-- slice 1 (ref=1)
> > source (ref=1) --> dynptr (ref=1) --|-- slice 2 (ref=1)
> > +-- slice 3 (ref=1)
> >
> > Destroying dynptr prunes the dynptr and all its children slices, but
> > does not affect the source. Releasing the source will effectively prune
> > the entire ownership tree. Dynptr clones with same ref=1 will be
> > parallel in the ownership tree.
> >
> > +-- orig dptr (ref=1) --> slice 1 (ref=1)
> > source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
> > +-- clone dptr (ref=1) --> slice 3 (ref=1)
> >
> > In such a case, only child slices of the dynptr clone being destroyed
> > are invalidated. Likewise, if the source object goes away, the whole
> > tree ends up getting pruned.
> >
> > Cc: Amery Hung <ameryhung@gmail.com>
> > Fixes: 81bb1c2c3e8e ("bpf: net_sched: Add basic bpf qdisc kfuncs")
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> > kernel/bpf/verifier.c | 81 ++++++++++++++++++++++++++++---------------
> > 1 file changed, 54 insertions(+), 27 deletions(-)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 54c6953a8b84..a62dfab9aea6 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -198,7 +198,7 @@ struct bpf_verifier_stack_elem {
> >
> > static int acquire_reference(struct bpf_verifier_env *env, int insn_idx);
> > static int release_reference_nomark(struct bpf_verifier_state *state, int ref_obj_id);
> > -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id);
> > +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects);
> > static void invalidate_non_owning_refs(struct bpf_verifier_env *env);
> > static bool in_rbtree_lock_required_cb(struct bpf_verifier_env *env);
> > static int ref_set_non_owning(struct bpf_verifier_env *env,
> > @@ -299,6 +299,7 @@ struct bpf_kfunc_call_arg_meta {
> > const char *func_name;
> > /* Out parameters */
> > u32 ref_obj_id;
> > + u32 ref_obj_cnt;
> > u8 release_regno;
> > bool r0_rdonly;
> > u32 ret_btf_id;
> > @@ -759,7 +760,7 @@ static int mark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_
> > mark_dynptr_stack_regs(env, &state->stack[spi].spilled_ptr,
> > &state->stack[spi - 1].spilled_ptr, type);
> >
> > - if (dynptr_type_refcounted(type)) {
> > + if (dynptr_type_refcounted(type) || clone_ref_obj_id) {
>
> it would be a bug to have both conditions true, right? Should we
> express this a bit more explicitly?
> maybe something like
>
> int id = 0;
>
> if (dynptr_type_refcounted(type)) {
> WARN_ON_ONCE(clone_ref_obj_id);
> id = acquire_reference(env, insn_idx);
> }
> if (clone_ref_obj_id)
> id = clone_ref_obj_id;
>
> state->stack[spi].spilled_ptr.ref_obj_id = id;
> state->stack[spi - 1].spilled_ptr.ref_obj_id = id;
> state->stack[spi].spilled_ptr.live |= REG_LIVE_WRITTEN;
> state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
>
> ?
Yes.
>
> > /* The id is used to track proper releasing */
> > int id;
> >
> > @@ -818,22 +819,19 @@ static void invalidate_dynptr(struct bpf_verifier_env *env, struct bpf_func_stat
> > state->stack[spi - 1].spilled_ptr.live |= REG_LIVE_WRITTEN;
> > }
> >
> > -static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
> > +static int __unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_func_state *state,
> > + int spi, bool slice)
> > {
> > - struct bpf_func_state *state = func(env, reg);
> > - int spi, ref_obj_id, i;
> > + u32 ref_obj_id;
> > + int i;
> >
> > - spi = dynptr_get_spi(env, reg);
> > - if (spi < 0)
> > - return spi;
> > + ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> >
> > - if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type)) {
> > + if (!dynptr_type_refcounted(state->stack[spi].spilled_ptr.dynptr.type) && !ref_obj_id) {
>
> Eduard already pointed out that `!ref_obj_id` condition is enough now
Ack, will change.
>
> > invalidate_dynptr(env, state, spi);
> > return 0;
> > }
> >
> > - ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> > -
> > /* If the dynptr has a ref_obj_id, then we need to invalidate
> > * two things:
> > *
> > @@ -842,7 +840,8 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
> > */
> >
> > /* Invalidate any slices associated with this dynptr */
> > - WARN_ON_ONCE(release_reference(env, ref_obj_id));
> > + if (slice)
> > + WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
>
> tbh, this bool flag to release_reference to prevent cycles feels like
> a hack and that we are not solving this properly...
>
> also, changes in this patch still won't properly support the case
> where we create dynptr -> slice -> dynptr -> slice -> ... chains,
> right? Any dynptr in the chain will cause "release" of any other
> dynptr, including parents, right?
>
If they share the same ref_obj_id, yes.
Also, can we construct dynptr from referenced slices? I don't think
anything except map values work right now.
In the chain example, how will we invalidate a dynptr in the later
chain? Like by overwriting it? Just thinking what and how to test
this.
Regardless, it is a valid point.
I can choose to introduce a ref_level so that we can track hierarchies
and only prune things at the same or level below.
Any opinions on how you think this should be done?
> >
> > /* Invalidate any dynptr clones */
> > for (i = 1; i < state->allocated_stack / BPF_REG_SIZE; i++) {
>
> [...]
>
> > @@ -10220,12 +10231,12 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int ref_ob
> > *
> > * This is the release function corresponding to acquire_reference(). Idempotent.
> > */
> > -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> > +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects)
>
> hm, "objects" really doesn't describe anything...
>
> > {
> > struct bpf_verifier_state *vstate = env->cur_state;
> > struct bpf_func_state *state;
> > struct bpf_reg_state *reg;
> > - int err;
> > + int err, spi;
> >
> > err = release_reference_nomark(vstate, ref_obj_id);
> > if (err)
> > @@ -10236,6 +10247,19 @@ static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> > mark_reg_invalid(env, reg);
> > }));
> >
> > + if (!objects)
> > + return 0;
>
> instead of a bool flag into release_reference, shouldn't the below
> logic be just a separate function?
Yes, that can make it clearer. It may be unnecessary if the function
knows which level of the ownership hierarchy to prune correctly, rn
the flag is partially there because we can call recursively into it
from unmark_stack_slots_dynptr.
>
> > +
> > + bpf_for_each_spilled_reg(spi, state, reg, (1 << STACK_DYNPTR)) {
> > + if (!reg)
> > + continue;
> > + if (!reg->dynptr.first_slot || reg->ref_obj_id != ref_obj_id)
> > + continue;
> > + err = __unmark_stack_slots_dynptr(env, state, spi, false);
> > + if (err)
> > + return err;
> > + }
> > +
> > return 0;
> > }
> >
>
> [...]
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects
2025-04-17 19:51 ` Kumar Kartikeya Dwivedi
@ 2025-04-22 21:09 ` Andrii Nakryiko
0 siblings, 0 replies; 44+ messages in thread
From: Andrii Nakryiko @ 2025-04-22 21:09 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Amery Hung, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Emil Tsalapatis, Barret Rhoden, kkd, kernel-team
On Thu, Apr 17, 2025 at 12:52 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 16 Apr 2025 at 23:04, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > Ensure that for dynptr constructors (MEM_UNINIT arg_type) taking a
> > > referenced object (ref_obj_id) as their memory source, we set the
> > > ref_obj_id of the dynptr appropriately as well. This allows us to
> > > tie the lifetime of the dynptr to its source and properly invalidate
> > > it when the source object is released.
> > >
> > > For helpers, we don't have such cases yet as bpf_dynptr_from_mem is
> > > not permitted for anything other than PTR_TO_MAP_VALUE, but still pass
> > > meta->ref_obj_id as clone_ref_obj_id in case this is relaxed in future.
> > > Since they are ossified we know dynptr_from_mem is the only relevant
> > > helper and takes one memory argument, so we know the meta->ref_obj_id if
> > > non-zero will belong to it.
> > >
> > > For kfuncs, make sure for constructors, only 1 ref_obj_id argument is
> > > seen, as more than one can be ambiguous in terms of ref_obj_id transfer.
> > > Since more kfuncs can be added with possibly multiple memory arguments,
> > > make sure meta->ref_obj_id reassignment won't cause incorrect lifetime
> > > analysis in the future using ref_obj_cnt logic. Set this ref_obj_id as
> > > the clone_ref_obj_id, so that it is transferred to the spilled_ptr stack
> > > slot register state.
> > >
> > > Add support to unmark_stack_slots_dynptr to not descend to its child
> > > slices (using bool slice parameter) so that we don't have circular calls
> > > when invoking release_reference. With this logic in place, we may have
> > > the following object relationships:
> > > +-- slice 1 (ref=1)
> > > source (ref=1) --> dynptr (ref=1) --|-- slice 2 (ref=1)
> > > +-- slice 3 (ref=1)
> > >
> > > Destroying dynptr prunes the dynptr and all its children slices, but
> > > does not affect the source. Releasing the source will effectively prune
> > > the entire ownership tree. Dynptr clones with same ref=1 will be
> > > parallel in the ownership tree.
> > >
> > > +-- orig dptr (ref=1) --> slice 1 (ref=1)
> > > source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
> > > +-- clone dptr (ref=1) --> slice 3 (ref=1)
> > >
> > > In such a case, only child slices of the dynptr clone being destroyed
> > > are invalidated. Likewise, if the source object goes away, the whole
> > > tree ends up getting pruned.
> > >
> > > Cc: Amery Hung <ameryhung@gmail.com>
> > > Fixes: 81bb1c2c3e8e ("bpf: net_sched: Add basic bpf qdisc kfuncs")
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > ---
> > > kernel/bpf/verifier.c | 81 ++++++++++++++++++++++++++++---------------
> > > 1 file changed, 54 insertions(+), 27 deletions(-)
> > >
[...]
> >
> > > invalidate_dynptr(env, state, spi);
> > > return 0;
> > > }
> > >
> > > - ref_obj_id = state->stack[spi].spilled_ptr.ref_obj_id;
> > > -
> > > /* If the dynptr has a ref_obj_id, then we need to invalidate
> > > * two things:
> > > *
> > > @@ -842,7 +840,8 @@ static int unmark_stack_slots_dynptr(struct bpf_verifier_env *env, struct bpf_re
> > > */
> > >
> > > /* Invalidate any slices associated with this dynptr */
> > > - WARN_ON_ONCE(release_reference(env, ref_obj_id));
> > > + if (slice)
> > > + WARN_ON_ONCE(release_reference(env, ref_obj_id, false));
> >
> > tbh, this bool flag to release_reference to prevent cycles feels like
> > a hack and that we are not solving this properly...
> >
> > also, changes in this patch still won't properly support the case
> > where we create dynptr -> slice -> dynptr -> slice -> ... chains,
> > right? Any dynptr in the chain will cause "release" of any other
> > dynptr, including parents, right?
> >
>
> If they share the same ref_obj_id, yes.
> Also, can we construct dynptr from referenced slices? I don't think
> anything except map values work right now.
Amery wanted to allow creating bpf_dynptr_from_mem() from any
PTR_TO_MEM, which would allow creating ringbuf_dynptr -> slice [ ->
mem_dynptr -> slice ]* chains.
> In the chain example, how will we invalidate a dynptr in the later
> chain? Like by overwriting it? Just thinking what and how to test
> this.
Yes, user can always just write over dynptr slots on the stack, which
doesn't even have to be intentional. mem dynptr is never explicitly
released, so technically you can just stop using it and use that stack
space for some other variables.
> Regardless, it is a valid point.
>
> I can choose to introduce a ref_level so that we can track hierarchies
> and only prune things at the same or level below.
> Any opinions on how you think this should be done?
I was thinking that we can form an (implicit) tree of ref objects
through their id and ref_obj_id properties. Each new dynptr gets its
own id and sets its ref_obj_id to whatever id of the resource we are
creating dynptr from (so if it's a slice, then I guess we'd need to
use slices's ref_obj_id, right?). And so you'd have:
ringbuf_dynptr (id=1) -> slice (id=2,ref_obj_id=1) ->
mem_dynptr(id=3,ref_obj_id=1) -> slice(id=4,**ref_obj_id=3**) ->
mem_dynptr(id=5,ref_obj_id=3)
And so if you overwrite mem_dynptr(id=3), then you invalidate
id=3,4,5; if you overwrite just mem_dynptr(id=5), only that one is
invalidated (and id=1,2,3,4 stay valid). If you happen to release the
original ringbuf_dynptr(id=1), the entire "tree" is invalidated, of
course.
>
> > >
> > > /* Invalidate any dynptr clones */
> > > for (i = 1; i < state->allocated_stack / BPF_REG_SIZE; i++) {
> >
> > [...]
> >
> > > @@ -10220,12 +10231,12 @@ static int release_reference_nomark(struct bpf_verifier_state *state, int ref_ob
> > > *
> > > * This is the release function corresponding to acquire_reference(). Idempotent.
> > > */
> > > -static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> > > +static int release_reference(struct bpf_verifier_env *env, int ref_obj_id, bool objects)
> >
> > hm, "objects" really doesn't describe anything...
> >
> > > {
> > > struct bpf_verifier_state *vstate = env->cur_state;
> > > struct bpf_func_state *state;
> > > struct bpf_reg_state *reg;
> > > - int err;
> > > + int err, spi;
> > >
> > > err = release_reference_nomark(vstate, ref_obj_id);
> > > if (err)
> > > @@ -10236,6 +10247,19 @@ static int release_reference(struct bpf_verifier_env *env, int ref_obj_id)
> > > mark_reg_invalid(env, reg);
> > > }));
> > >
> > > + if (!objects)
> > > + return 0;
> >
> > instead of a bool flag into release_reference, shouldn't the below
> > logic be just a separate function?
>
> Yes, that can make it clearer. It may be unnecessary if the function
> knows which level of the ownership hierarchy to prune correctly, rn
> the flag is partially there because we can call recursively into it
> from unmark_stack_slots_dynptr.
>
> >
> > > +
> > > + bpf_for_each_spilled_reg(spi, state, reg, (1 << STACK_DYNPTR)) {
> > > + if (!reg)
> > > + continue;
> > > + if (!reg->dynptr.first_slot || reg->ref_obj_id != ref_obj_id)
> > > + continue;
> > > + err = __unmark_stack_slots_dynptr(env, state, spi, false);
> > > + if (err)
> > > + return err;
> > > + }
> > > +
> > > return 0;
> > > }
> > >
> >
> > [...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-16 7:58 ` Eduard Zingerman
2025-04-16 21:04 ` Andrii Nakryiko
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 03/13] selftests/bpf: Convert dynptr_fail to use vmlinux.h Kumar Kartikeya Dwivedi
` (10 subsequent siblings)
12 siblings, 2 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Now that PTR_TO_MEM can be invalidated due to skb going away, we must
take care to be more careful in regsafe regarding state pruning. While
ref_obj_id comparison will ensure that incorrect pruning is prevented,
since we attach ref_obj_id of skb to the PTR_TO_MEM emanating from it,
it is nonetheless clearer to also compare the dynptr_id as well.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/verifier.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index a62dfab9aea6..7e09c4592038 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -18426,6 +18426,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
range_within(rold, rcur) &&
tnum_in(rold->var_off, rcur->var_off) &&
check_ids(rold->id, rcur->id, idmap) &&
+ (base_type(rold->type) == PTR_TO_MEM ?
+ check_ids(rold->dynptr_id, rcur->dynptr_id, idmap) : 1) &&
check_ids(rold->ref_obj_id, rcur->ref_obj_id, idmap);
case PTR_TO_PACKET_META:
case PTR_TO_PACKET:
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe Kumar Kartikeya Dwivedi
@ 2025-04-16 7:58 ` Eduard Zingerman
2025-04-16 21:04 ` Andrii Nakryiko
1 sibling, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-16 7:58 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> Now that PTR_TO_MEM can be invalidated due to skb going away, we must
> take care to be more careful in regsafe regarding state pruning. While
> ref_obj_id comparison will ensure that incorrect pruning is prevented,
> since we attach ref_obj_id of skb to the PTR_TO_MEM emanating from it,
> it is nonetheless clearer to also compare the dynptr_id as well.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe Kumar Kartikeya Dwivedi
2025-04-16 7:58 ` Eduard Zingerman
@ 2025-04-16 21:04 ` Andrii Nakryiko
2025-04-17 19:54 ` Kumar Kartikeya Dwivedi
1 sibling, 1 reply; 44+ messages in thread
From: Andrii Nakryiko @ 2025-04-16 21:04 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Now that PTR_TO_MEM can be invalidated due to skb going away, we must
> take care to be more careful in regsafe regarding state pruning. While
> ref_obj_id comparison will ensure that incorrect pruning is prevented,
> since we attach ref_obj_id of skb to the PTR_TO_MEM emanating from it,
> it is nonetheless clearer to also compare the dynptr_id as well.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> kernel/bpf/verifier.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index a62dfab9aea6..7e09c4592038 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -18426,6 +18426,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
> range_within(rold, rcur) &&
> tnum_in(rold->var_off, rcur->var_off) &&
> check_ids(rold->id, rcur->id, idmap) &&
> + (base_type(rold->type) == PTR_TO_MEM ?
> + check_ids(rold->dynptr_id, rcur->dynptr_id, idmap) : 1) &&
> check_ids(rold->ref_obj_id, rcur->ref_obj_id, idmap);
hm... shall we split out PTR_TO_MEM case instead of making this
not-so-simple condition even more not-so-simple? or (if people don't
like that idea), I'd rather have this special PTR_TO_MEM handling as a
separate if with return
> case PTR_TO_PACKET_META:
> case PTR_TO_PACKET:
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe
2025-04-16 21:04 ` Andrii Nakryiko
@ 2025-04-17 19:54 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 19:54 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
On Wed, 16 Apr 2025 at 23:04, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Now that PTR_TO_MEM can be invalidated due to skb going away, we must
> > take care to be more careful in regsafe regarding state pruning. While
> > ref_obj_id comparison will ensure that incorrect pruning is prevented,
> > since we attach ref_obj_id of skb to the PTR_TO_MEM emanating from it,
> > it is nonetheless clearer to also compare the dynptr_id as well.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> > kernel/bpf/verifier.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index a62dfab9aea6..7e09c4592038 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -18426,6 +18426,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
> > range_within(rold, rcur) &&
> > tnum_in(rold->var_off, rcur->var_off) &&
> > check_ids(rold->id, rcur->id, idmap) &&
> > + (base_type(rold->type) == PTR_TO_MEM ?
> > + check_ids(rold->dynptr_id, rcur->dynptr_id, idmap) : 1) &&
> > check_ids(rold->ref_obj_id, rcur->ref_obj_id, idmap);
>
> hm... shall we split out PTR_TO_MEM case instead of making this
> not-so-simple condition even more not-so-simple? or (if people don't
> like that idea), I'd rather have this special PTR_TO_MEM handling as a
> separate if with return
I can split into a separate case.
>
>
> > case PTR_TO_PACKET_META:
> > case PTR_TO_PACKET:
> > --
> > 2.47.1
> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 03/13] selftests/bpf: Convert dynptr_fail to use vmlinux.h
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 01/13] bpf: Tie dynptrs to referenced source objects Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 02/13] bpf: Compare dynptr_id in regsafe Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-16 8:41 ` Eduard Zingerman
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction Kumar Kartikeya Dwivedi
` (9 subsequent siblings)
12 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Let's make use of vmlinux.h in dynptr_fail.c. No functional change
intended, just a drive-by improvement.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
tools/testing/selftests/bpf/progs/dynptr_fail.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index bd8f15229f5c..345e704e5346 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -1,13 +1,11 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2022 Facebook */
-
+#include <vmlinux.h>
#include <errno.h>
#include <string.h>
#include <stdbool.h>
-#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
-#include <linux/if_ether.h>
#include "bpf_misc.h"
#include "bpf_kfuncs.h"
@@ -46,7 +44,7 @@ struct {
__type(value, __u64);
} array_map4 SEC(".maps");
-struct sample {
+struct dynptr_sample {
int pid;
long value;
char comm[16];
@@ -95,7 +93,7 @@ __failure __msg("Unreleased reference id=4")
int ringbuf_missing_release2(void *ctx)
{
struct bpf_dynptr ptr1, ptr2;
- struct sample *sample;
+ struct dynptr_sample *sample;
bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(*sample), 0, &ptr1);
bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(*sample), 0, &ptr2);
@@ -173,7 +171,7 @@ __failure __msg("type=mem expected=ringbuf_mem")
int ringbuf_invalid_api(void *ctx)
{
struct bpf_dynptr ptr;
- struct sample *sample;
+ struct dynptr_sample *sample;
bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(*sample), 0, &ptr);
sample = bpf_dynptr_data(&ptr, 0, sizeof(*sample));
@@ -295,7 +293,7 @@ __failure __msg("invalid mem access 'scalar'")
int data_slice_use_after_release1(void *ctx)
{
struct bpf_dynptr ptr;
- struct sample *sample;
+ struct dynptr_sample *sample;
bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(*sample), 0, &ptr);
sample = bpf_dynptr_data(&ptr, 0, sizeof(*sample));
@@ -327,7 +325,7 @@ __failure __msg("invalid mem access 'scalar'")
int data_slice_use_after_release2(void *ctx)
{
struct bpf_dynptr ptr1, ptr2;
- struct sample *sample;
+ struct dynptr_sample *sample;
bpf_ringbuf_reserve_dynptr(&ringbuf, 64, 0, &ptr1);
bpf_ringbuf_reserve_dynptr(&ringbuf, sizeof(*sample), 0, &ptr2);
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 03/13] selftests/bpf: Convert dynptr_fail to use vmlinux.h
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 03/13] selftests/bpf: Convert dynptr_fail to use vmlinux.h Kumar Kartikeya Dwivedi
@ 2025-04-16 8:41 ` Eduard Zingerman
0 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-16 8:41 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> Let's make use of vmlinux.h in dynptr_fail.c. No functional change
> intended, just a drive-by improvement.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (2 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 03/13] selftests/bpf: Convert dynptr_fail to use vmlinux.h Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-16 8:40 ` Eduard Zingerman
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 05/13] locking/local_lock, mm: Replace localtry_ helpers with local_trylock_t type Kumar Kartikeya Dwivedi
` (8 subsequent siblings)
12 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Add a few tests to capture source object relationship with dynptr and
their slices and ensure invalidation of everything works correctly.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
.../testing/selftests/bpf/prog_tests/dynptr.c | 2 +
.../testing/selftests/bpf/progs/dynptr_fail.c | 1 +
.../selftests/bpf/progs/dynptr_fail_qdisc.c | 38 +++++++++++++++++++
3 files changed, 41 insertions(+)
create mode 100644 tools/testing/selftests/bpf/progs/dynptr_fail_qdisc.c
diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
index e29cc16124c2..929355de1002 100644
--- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
@@ -4,6 +4,7 @@
#include <test_progs.h>
#include <network_helpers.h>
#include "dynptr_fail.skel.h"
+#include "dynptr_fail_qdisc.skel.h"
#include "dynptr_success.skel.h"
enum test_setup_type {
@@ -161,4 +162,5 @@ void test_dynptr(void)
}
RUN_TESTS(dynptr_fail);
+ RUN_TESTS(dynptr_fail_qdisc);
}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index 345e704e5346..7c67797a5aac 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -8,6 +8,7 @@
#include <bpf/bpf_tracing.h>
#include "bpf_misc.h"
#include "bpf_kfuncs.h"
+#include "bpf_experimental.h"
char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail_qdisc.c b/tools/testing/selftests/bpf/progs/dynptr_fail_qdisc.c
new file mode 100644
index 000000000000..fc222213f572
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail_qdisc.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+char _license[] SEC("license") = "GPL";
+
+/*
+ * Putting these tests in dynptr_fail.c causes other random tests to fail,
+ * keep them in their isolated CU.
+ */
+
+SEC("?struct_ops")
+__failure __msg("invalid mem access 'scalar'")
+int BPF_PROG(test_dynptr_source_release, struct sk_buff *skb,
+ struct Qdisc *sch, struct bpf_sk_buff_ptr *to_free)
+{
+ struct bpf_dynptr dptr, dptr2;
+ char buf[8], *data;
+
+ bpf_dynptr_from_skb((struct __sk_buff *)skb, 0, &dptr);
+ bpf_dynptr_read(buf, sizeof(buf), &dptr, 0, 0);
+ bpf_dynptr_clone(&dptr, &dptr2);
+ data = bpf_dynptr_slice(&dptr2, 0, buf, sizeof(buf));
+ bpf_qdisc_skb_drop(skb, to_free);
+ /* These reads/writes now succeed since dynptr is destroyed. */
+ *(char *)&dptr = *(char *)&dptr2;
+ return *data;
+}
+
+SEC("?.struct_ops")
+struct Qdisc_ops test_dynptr_qdisc = {
+ .enqueue = (void *)test_dynptr_source_release,
+ .id = "bpf_fq",
+};
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction Kumar Kartikeya Dwivedi
@ 2025-04-16 8:40 ` Eduard Zingerman
2025-04-17 20:09 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-16 8:40 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> Add a few tests to capture source object relationship with dynptr and
> their slices and ensure invalidation of everything works correctly.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
It would be great to have a bit more test cases here,
following your nice diagram from patch #1:
+-- orig dptr (ref=1) --> slice 1 (ref=1)
source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
+-- clone dptr (ref=1) --> slice 3 (ref=1)
In one test (probably extending the one in this patch):
- check that both orig and clone slices can be read at some point
- destroy orig or clone and check that one of the slices can
be read, while another can't.
In another test:
- destroy source and check that both orig and clone slices
are no longer accessible.
Also, is it possible to conjure a test case with two hierarchies?
E.g. source1(ref=1) and source2(ref=2)?
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction
2025-04-16 8:40 ` Eduard Zingerman
@ 2025-04-17 20:09 ` Kumar Kartikeya Dwivedi
2025-04-17 20:44 ` Eduard Zingerman
0 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 20:09 UTC (permalink / raw)
To: Eduard Zingerman
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
On Wed, 16 Apr 2025 at 10:40, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> > Add a few tests to capture source object relationship with dynptr and
> > their slices and ensure invalidation of everything works correctly.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
>
> It would be great to have a bit more test cases here,
> following your nice diagram from patch #1:
>
> +-- orig dptr (ref=1) --> slice 1 (ref=1)
> source (ref=1) --|-- clone dptr (ref=1) --> slice 2 (ref=1)
> +-- clone dptr (ref=1) --> slice 3 (ref=1)
>
> In one test (probably extending the one in this patch):
> - check that both orig and clone slices can be read at some point
> - destroy orig or clone and check that one of the slices can
> be read, while another can't.
>
> In another test:
> - destroy source and check that both orig and clone slices
> are no longer accessible.
Agreed, will try and add more cases explicitly.
I struggled to have more than two callbacks automatically working with
RUN_TESTS and struct_ops, so I guess I will do explicit loading and
failure checks.
>
> Also, is it possible to conjure a test case with two hierarchies?
> E.g. source1(ref=1) and source2(ref=2)?
>
Should be possible, I'll add it.
> [...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction
2025-04-17 20:09 ` Kumar Kartikeya Dwivedi
@ 2025-04-17 20:44 ` Eduard Zingerman
2025-04-17 21:10 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-17 20:44 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
[...]
> I struggled to have more than two callbacks automatically working with
> RUN_TESTS and struct_ops, so I guess I will do explicit loading and
> failure checks.
Could you please clarify the use-case a bit?
In theory two RUN_TESTS should be independent.
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction
2025-04-17 20:44 ` Eduard Zingerman
@ 2025-04-17 21:10 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 21:10 UTC (permalink / raw)
To: Eduard Zingerman
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
On Thu, 17 Apr 2025 at 22:45, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> [...]
>
> > I struggled to have more than two callbacks automatically working with
> > RUN_TESTS and struct_ops, so I guess I will do explicit loading and
> > failure checks.
>
> Could you please clarify the use-case a bit?
> In theory two RUN_TESTS should be independent.
I meant I see strange behavior when I have both .enqueue and .dequeue and
do RUN_TESTS, for the second callback, I see unexpected load success.
When keeping one of the callbacks, everything is ok.
I didn't dig further the other day before sending it out.
>
> [...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 05/13] locking/local_lock, mm: Replace localtry_ helpers with local_trylock_t type
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (3 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 04/13] selftests/bpf: Add tests for dynptr source object interaction Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice Kumar Kartikeya Dwivedi
` (7 subsequent siblings)
12 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Vlastimil Babka, Sebastian Andrzej Siewior, Shakeel Butt,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
From: Alexei Starovoitov <ast@kernel.org>
Partially revert commit 0aaddfb06882 ("locking/local_lock: Introduce localtry_lock_t").
Remove localtry_*() helpers, since localtry_lock() name might
be misinterpreted as "try lock".
Introduce local_trylock[_irqsave]() helpers that only work
with newly introduced local_trylock_t type.
Note that attempt to use local_trylock[_irqsave]() with local_lock_t
will cause compilation failure.
Usage and behavior in !PREEMPT_RT:
local_lock_t lock; // sizeof(lock) == 0
local_lock(&lock); // preempt disable
local_lock_irqsave(&lock, ...); // irq save
if (local_trylock_irqsave(&lock, ...)) // compilation error
local_trylock_t lock; // sizeof(lock) == 4
local_lock(&lock); // preempt disable, acquired = 1
local_lock_irqsave(&lock, ...); // irq save, acquired = 1
if (local_trylock(&lock)) // if (!acquired) preempt disable, acquired = 1
if (local_trylock_irqsave(&lock, ...)) // if (!acquired) irq save, acquired = 1
The existing local_lock_*() macros can be used either with
local_lock_t or local_trylock_t.
With local_trylock_t they set acquired = 1 while local_unlock_*() clears it.
In !PREEMPT_RT local_lock_irqsave(local_lock_t *) disables interrupts
to protect critical section, but it doesn't prevent NMI, so the fully
reentrant code cannot use local_lock_irqsave(local_lock_t *) for
exclusive access.
The local_lock_irqsave(local_trylock_t *) helper disables interrupts
and sets acquired=1, so local_trylock_irqsave(local_trylock_t *) from
NMI attempting to acquire the same lock will return false.
In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock().
Map local_trylock_irqsave() to preemptible spin_trylock().
When in hard IRQ or NMI return false right away, since
spin_trylock() is not safe due to explicit locking in the underneath
rt_spin_trylock() implementation. Removing this explicit locking and
attempting only "trylock" is undesired due to PI implications.
The local_trylock() without _irqsave can be used to avoid the cost of
disabling/enabling interrupts by only disabling preemption, so
local_trylock() in an interrupt attempting to acquire the same
lock will return false.
Note there is no need to use local_inc for acquired variable,
since it's a percpu variable with strict nesting scopes.
Note that guard(local_lock)(&lock) works only for "local_lock_t lock".
The patch also makes sure that local_lock_release(l) is called before
WRITE_ONCE(l->acquired, 0). Though IRQs are disabled at this point
the local_trylock() from NMI will succeed and local_lock_acquire(l)
will warn.
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Fixes: 0aaddfb06882 ("locking/local_lock: Introduce localtry_lock_t")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
---
include/linux/local_lock.h | 58 ++------
include/linux/local_lock_internal.h | 207 ++++++++++++----------------
mm/memcontrol.c | 39 +++---
3 files changed, 114 insertions(+), 190 deletions(-)
diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h
index 1a0bc35839e3..16a2ee4f8310 100644
--- a/include/linux/local_lock.h
+++ b/include/linux/local_lock.h
@@ -52,44 +52,23 @@
__local_unlock_irqrestore(lock, flags)
/**
- * localtry_lock_init - Runtime initialize a lock instance
- */
-#define localtry_lock_init(lock) __localtry_lock_init(lock)
-
-/**
- * localtry_lock - Acquire a per CPU local lock
- * @lock: The lock variable
- */
-#define localtry_lock(lock) __localtry_lock(lock)
-
-/**
- * localtry_lock_irq - Acquire a per CPU local lock and disable interrupts
- * @lock: The lock variable
- */
-#define localtry_lock_irq(lock) __localtry_lock_irq(lock)
-
-/**
- * localtry_lock_irqsave - Acquire a per CPU local lock, save and disable
- * interrupts
- * @lock: The lock variable
- * @flags: Storage for interrupt flags
+ * local_lock_init - Runtime initialize a lock instance
*/
-#define localtry_lock_irqsave(lock, flags) \
- __localtry_lock_irqsave(lock, flags)
+#define local_trylock_init(lock) __local_trylock_init(lock)
/**
- * localtry_trylock - Try to acquire a per CPU local lock.
+ * local_trylock - Try to acquire a per CPU local lock
* @lock: The lock variable
*
* The function can be used in any context such as NMI or HARDIRQ. Due to
* locking constrains it will _always_ fail to acquire the lock in NMI or
* HARDIRQ context on PREEMPT_RT.
*/
-#define localtry_trylock(lock) __localtry_trylock(lock)
+#define local_trylock(lock) __local_trylock(lock)
/**
- * localtry_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
- * interrupts if acquired
+ * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable
+ * interrupts if acquired
* @lock: The lock variable
* @flags: Storage for interrupt flags
*
@@ -97,29 +76,8 @@
* locking constrains it will _always_ fail to acquire the lock in NMI or
* HARDIRQ context on PREEMPT_RT.
*/
-#define localtry_trylock_irqsave(lock, flags) \
- __localtry_trylock_irqsave(lock, flags)
-
-/**
- * local_unlock - Release a per CPU local lock
- * @lock: The lock variable
- */
-#define localtry_unlock(lock) __localtry_unlock(lock)
-
-/**
- * local_unlock_irq - Release a per CPU local lock and enable interrupts
- * @lock: The lock variable
- */
-#define localtry_unlock_irq(lock) __localtry_unlock_irq(lock)
-
-/**
- * localtry_unlock_irqrestore - Release a per CPU local lock and restore
- * interrupt flags
- * @lock: The lock variable
- * @flags: Interrupt flags to restore
- */
-#define localtry_unlock_irqrestore(lock, flags) \
- __localtry_unlock_irqrestore(lock, flags)
+#define local_trylock_irqsave(lock, flags) \
+ __local_trylock_irqsave(lock, flags)
DEFINE_GUARD(local_lock, local_lock_t __percpu*,
local_lock(_T),
diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h
index 67bd13d142fa..bf2bf40d7b18 100644
--- a/include/linux/local_lock_internal.h
+++ b/include/linux/local_lock_internal.h
@@ -15,10 +15,11 @@ typedef struct {
#endif
} local_lock_t;
+/* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */
typedef struct {
local_lock_t llock;
- unsigned int acquired;
-} localtry_lock_t;
+ u8 acquired;
+} local_trylock_t;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
# define LOCAL_LOCK_DEBUG_INIT(lockname) \
@@ -29,6 +30,9 @@ typedef struct {
}, \
.owner = NULL,
+# define LOCAL_TRYLOCK_DEBUG_INIT(lockname) \
+ .llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) },
+
static inline void local_lock_acquire(local_lock_t *l)
{
lock_map_acquire(&l->dep_map);
@@ -56,6 +60,7 @@ static inline void local_lock_debug_init(local_lock_t *l)
}
#else /* CONFIG_DEBUG_LOCK_ALLOC */
# define LOCAL_LOCK_DEBUG_INIT(lockname)
+# define LOCAL_TRYLOCK_DEBUG_INIT(lockname)
static inline void local_lock_acquire(local_lock_t *l) { }
static inline void local_trylock_acquire(local_lock_t *l) { }
static inline void local_lock_release(local_lock_t *l) { }
@@ -63,7 +68,7 @@ static inline void local_lock_debug_init(local_lock_t *l) { }
#endif /* !CONFIG_DEBUG_LOCK_ALLOC */
#define INIT_LOCAL_LOCK(lockname) { LOCAL_LOCK_DEBUG_INIT(lockname) }
-#define INIT_LOCALTRY_LOCK(lockname) { .llock = { LOCAL_LOCK_DEBUG_INIT(lockname.llock) }}
+#define INIT_LOCAL_TRYLOCK(lockname) { LOCAL_TRYLOCK_DEBUG_INIT(lockname) }
#define __local_lock_init(lock) \
do { \
@@ -76,6 +81,8 @@ do { \
local_lock_debug_init(lock); \
} while (0)
+#define __local_trylock_init(lock) __local_lock_init(lock.llock)
+
#define __spinlock_nested_bh_init(lock) \
do { \
static struct lock_class_key __key; \
@@ -87,149 +94,117 @@ do { \
local_lock_debug_init(lock); \
} while (0)
+#define __local_lock_acquire(lock) \
+ do { \
+ local_trylock_t *tl; \
+ local_lock_t *l; \
+ \
+ l = (local_lock_t *)this_cpu_ptr(lock); \
+ tl = (local_trylock_t *)l; \
+ _Generic((lock), \
+ local_trylock_t *: ({ \
+ lockdep_assert(tl->acquired == 0); \
+ WRITE_ONCE(tl->acquired, 1); \
+ }), \
+ default:(void)0); \
+ local_lock_acquire(l); \
+ } while (0)
+
#define __local_lock(lock) \
do { \
preempt_disable(); \
- local_lock_acquire(this_cpu_ptr(lock)); \
+ __local_lock_acquire(lock); \
} while (0)
#define __local_lock_irq(lock) \
do { \
local_irq_disable(); \
- local_lock_acquire(this_cpu_ptr(lock)); \
+ __local_lock_acquire(lock); \
} while (0)
#define __local_lock_irqsave(lock, flags) \
do { \
local_irq_save(flags); \
- local_lock_acquire(this_cpu_ptr(lock)); \
- } while (0)
-
-#define __local_unlock(lock) \
- do { \
- local_lock_release(this_cpu_ptr(lock)); \
- preempt_enable(); \
+ __local_lock_acquire(lock); \
} while (0)
-#define __local_unlock_irq(lock) \
- do { \
- local_lock_release(this_cpu_ptr(lock)); \
- local_irq_enable(); \
- } while (0)
-
-#define __local_unlock_irqrestore(lock, flags) \
- do { \
- local_lock_release(this_cpu_ptr(lock)); \
- local_irq_restore(flags); \
- } while (0)
-
-#define __local_lock_nested_bh(lock) \
- do { \
- lockdep_assert_in_softirq(); \
- local_lock_acquire(this_cpu_ptr(lock)); \
- } while (0)
-
-#define __local_unlock_nested_bh(lock) \
- local_lock_release(this_cpu_ptr(lock))
-
-/* localtry_lock_t variants */
-
-#define __localtry_lock_init(lock) \
-do { \
- __local_lock_init(&(lock)->llock); \
- WRITE_ONCE((lock)->acquired, 0); \
-} while (0)
-
-#define __localtry_lock(lock) \
- do { \
- localtry_lock_t *lt; \
- preempt_disable(); \
- lt = this_cpu_ptr(lock); \
- local_lock_acquire(<->llock); \
- WRITE_ONCE(lt->acquired, 1); \
- } while (0)
-
-#define __localtry_lock_irq(lock) \
- do { \
- localtry_lock_t *lt; \
- local_irq_disable(); \
- lt = this_cpu_ptr(lock); \
- local_lock_acquire(<->llock); \
- WRITE_ONCE(lt->acquired, 1); \
- } while (0)
-
-#define __localtry_lock_irqsave(lock, flags) \
- do { \
- localtry_lock_t *lt; \
- local_irq_save(flags); \
- lt = this_cpu_ptr(lock); \
- local_lock_acquire(<->llock); \
- WRITE_ONCE(lt->acquired, 1); \
- } while (0)
-
-#define __localtry_trylock(lock) \
+#define __local_trylock(lock) \
({ \
- localtry_lock_t *lt; \
- bool _ret; \
+ local_trylock_t *tl; \
\
preempt_disable(); \
- lt = this_cpu_ptr(lock); \
- if (!READ_ONCE(lt->acquired)) { \
- WRITE_ONCE(lt->acquired, 1); \
- local_trylock_acquire(<->llock); \
- _ret = true; \
- } else { \
- _ret = false; \
+ tl = this_cpu_ptr(lock); \
+ if (READ_ONCE(tl->acquired)) { \
preempt_enable(); \
+ tl = NULL; \
+ } else { \
+ WRITE_ONCE(tl->acquired, 1); \
+ local_trylock_acquire( \
+ (local_lock_t *)tl); \
} \
- _ret; \
+ !!tl; \
})
-#define __localtry_trylock_irqsave(lock, flags) \
+#define __local_trylock_irqsave(lock, flags) \
({ \
- localtry_lock_t *lt; \
- bool _ret; \
+ local_trylock_t *tl; \
\
local_irq_save(flags); \
- lt = this_cpu_ptr(lock); \
- if (!READ_ONCE(lt->acquired)) { \
- WRITE_ONCE(lt->acquired, 1); \
- local_trylock_acquire(<->llock); \
- _ret = true; \
- } else { \
- _ret = false; \
+ tl = this_cpu_ptr(lock); \
+ if (READ_ONCE(tl->acquired)) { \
local_irq_restore(flags); \
+ tl = NULL; \
+ } else { \
+ WRITE_ONCE(tl->acquired, 1); \
+ local_trylock_acquire( \
+ (local_lock_t *)tl); \
} \
- _ret; \
+ !!tl; \
})
-#define __localtry_unlock(lock) \
+#define __local_lock_release(lock) \
+ do { \
+ local_trylock_t *tl; \
+ local_lock_t *l; \
+ \
+ l = (local_lock_t *)this_cpu_ptr(lock); \
+ tl = (local_trylock_t *)l; \
+ local_lock_release(l); \
+ _Generic((lock), \
+ local_trylock_t *: ({ \
+ lockdep_assert(tl->acquired == 1); \
+ WRITE_ONCE(tl->acquired, 0); \
+ }), \
+ default:(void)0); \
+ } while (0)
+
+#define __local_unlock(lock) \
do { \
- localtry_lock_t *lt; \
- lt = this_cpu_ptr(lock); \
- WRITE_ONCE(lt->acquired, 0); \
- local_lock_release(<->llock); \
+ __local_lock_release(lock); \
preempt_enable(); \
} while (0)
-#define __localtry_unlock_irq(lock) \
+#define __local_unlock_irq(lock) \
do { \
- localtry_lock_t *lt; \
- lt = this_cpu_ptr(lock); \
- WRITE_ONCE(lt->acquired, 0); \
- local_lock_release(<->llock); \
+ __local_lock_release(lock); \
local_irq_enable(); \
} while (0)
-#define __localtry_unlock_irqrestore(lock, flags) \
+#define __local_unlock_irqrestore(lock, flags) \
do { \
- localtry_lock_t *lt; \
- lt = this_cpu_ptr(lock); \
- WRITE_ONCE(lt->acquired, 0); \
- local_lock_release(<->llock); \
+ __local_lock_release(lock); \
local_irq_restore(flags); \
} while (0)
+#define __local_lock_nested_bh(lock) \
+ do { \
+ lockdep_assert_in_softirq(); \
+ local_lock_acquire(this_cpu_ptr(lock)); \
+ } while (0)
+
+#define __local_unlock_nested_bh(lock) \
+ local_lock_release(this_cpu_ptr(lock))
+
#else /* !CONFIG_PREEMPT_RT */
/*
@@ -237,16 +212,18 @@ do { \
* critical section while staying preemptible.
*/
typedef spinlock_t local_lock_t;
-typedef spinlock_t localtry_lock_t;
+typedef spinlock_t local_trylock_t;
#define INIT_LOCAL_LOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname))
-#define INIT_LOCALTRY_LOCK(lockname) INIT_LOCAL_LOCK(lockname)
+#define INIT_LOCAL_TRYLOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname))
#define __local_lock_init(l) \
do { \
local_spin_lock_init((l)); \
} while (0)
+#define __local_trylock_init(l) __local_lock_init(l)
+
#define __local_lock(__lock) \
do { \
migrate_disable(); \
@@ -283,17 +260,7 @@ do { \
spin_unlock(this_cpu_ptr((lock))); \
} while (0)
-/* localtry_lock_t variants */
-
-#define __localtry_lock_init(lock) __local_lock_init(lock)
-#define __localtry_lock(lock) __local_lock(lock)
-#define __localtry_lock_irq(lock) __local_lock(lock)
-#define __localtry_lock_irqsave(lock, flags) __local_lock_irqsave(lock, flags)
-#define __localtry_unlock(lock) __local_unlock(lock)
-#define __localtry_unlock_irq(lock) __local_unlock(lock)
-#define __localtry_unlock_irqrestore(lock, flags) __local_unlock_irqrestore(lock, flags)
-
-#define __localtry_trylock(lock) \
+#define __local_trylock(lock) \
({ \
int __locked; \
\
@@ -308,11 +275,11 @@ do { \
__locked; \
})
-#define __localtry_trylock_irqsave(lock, flags) \
+#define __local_trylock_irqsave(lock, flags) \
({ \
typecheck(unsigned long, flags); \
flags = 0; \
- __localtry_trylock(lock); \
+ __local_trylock(lock); \
})
#endif /* CONFIG_PREEMPT_RT */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 421740f1bcdc..c96c1f2b9cf5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1759,7 +1759,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
}
struct memcg_stock_pcp {
- localtry_lock_t stock_lock;
+ local_trylock_t stock_lock;
struct mem_cgroup *cached; /* this never be root cgroup */
unsigned int nr_pages;
@@ -1774,7 +1774,7 @@ struct memcg_stock_pcp {
#define FLUSHING_CACHED_CHARGE 0
};
static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = {
- .stock_lock = INIT_LOCALTRY_LOCK(stock_lock),
+ .stock_lock = INIT_LOCAL_TRYLOCK(stock_lock),
};
static DEFINE_MUTEX(percpu_charge_mutex);
@@ -1805,11 +1805,10 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
if (nr_pages > MEMCG_CHARGE_BATCH)
return ret;
- if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
- if (!gfpflags_allow_spinning(gfp_mask))
- return ret;
- localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
- }
+ if (gfpflags_allow_spinning(gfp_mask))
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
+ else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags))
+ return ret;
stock = this_cpu_ptr(&memcg_stock);
stock_pages = READ_ONCE(stock->nr_pages);
@@ -1818,7 +1817,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages,
ret = true;
}
- localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
return ret;
}
@@ -1857,14 +1856,14 @@ static void drain_local_stock(struct work_struct *dummy)
* drain_stock races is that we always operate on local CPU stock
* here with IRQ disabled
*/
- localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
stock = this_cpu_ptr(&memcg_stock);
old = drain_obj_stock(stock);
drain_stock(stock);
clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
- localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
obj_cgroup_put(old);
}
@@ -1894,7 +1893,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
{
unsigned long flags;
- if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
+ if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) {
/*
* In case of unlikely failure to lock percpu stock_lock
* uncharge memcg directly.
@@ -1907,7 +1906,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
return;
}
__refill_stock(memcg, nr_pages);
- localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
}
/*
@@ -1964,9 +1963,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
stock = &per_cpu(memcg_stock, cpu);
/* drain_obj_stock requires stock_lock */
- localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
old = drain_obj_stock(stock);
- localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
drain_stock(stock);
obj_cgroup_put(old);
@@ -2787,7 +2786,7 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
unsigned long flags;
int *bytes;
- localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
stock = this_cpu_ptr(&memcg_stock);
/*
@@ -2836,7 +2835,7 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat,
if (nr)
__mod_objcg_mlstate(objcg, pgdat, idx, nr);
- localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
obj_cgroup_put(old);
}
@@ -2846,7 +2845,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
unsigned long flags;
bool ret = false;
- localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
stock = this_cpu_ptr(&memcg_stock);
if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) {
@@ -2854,7 +2853,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
ret = true;
}
- localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
return ret;
}
@@ -2946,7 +2945,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
unsigned long flags;
unsigned int nr_pages = 0;
- localtry_lock_irqsave(&memcg_stock.stock_lock, flags);
+ local_lock_irqsave(&memcg_stock.stock_lock, flags);
stock = this_cpu_ptr(&memcg_stock);
if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */
@@ -2960,7 +2959,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes,
stock->nr_bytes &= (PAGE_SIZE - 1);
}
- localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags);
+ local_unlock_irqrestore(&memcg_stock.stock_lock, flags);
obj_cgroup_put(old);
if (nr_pages)
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (4 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 05/13] locking/local_lock, mm: Replace localtry_ helpers with local_trylock_t type Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-16 17:36 ` Eduard Zingerman
2025-04-16 21:04 ` Andrii Nakryiko
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams Kumar Kartikeya Dwivedi
` (6 subsequent siblings)
12 siblings, 2 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Add a new bpf_dynptr_from_mem_slice kfunc to create a dynptr from a
PTR_TO_BTF_ID exposing a variable-length slice of memory, represented by
the new bpf_mem_slice type. This slice is read-only, for a read-write
slice we can expose a distinct type in the future.
We rely on the previous commits ensuring source objects underpinning
dynptr memory are tracked correctly for invalidation to ensure when a
PTR_TO_BTF_ID holding a memory slice goes away, it's corresponding
dynptrs get invalidated.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf.h | 5 +++++
kernel/bpf/helpers.c | 32 ++++++++++++++++++++++++++++++++
kernel/bpf/verifier.c | 6 +++++-
3 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 3f0cc89c0622..9feaa9bbf0a4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1344,6 +1344,11 @@ enum bpf_dynptr_type {
BPF_DYNPTR_TYPE_XDP,
};
+struct bpf_mem_slice {
+ void *ptr;
+ size_t len;
+};
+
int bpf_dynptr_check_size(u32 size);
u32 __bpf_dynptr_size(const struct bpf_dynptr_kern *ptr);
const void *__bpf_dynptr_data(const struct bpf_dynptr_kern *ptr, u32 len);
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index e3a2662f4e33..95e9c9df6062 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -2826,6 +2826,37 @@ __bpf_kfunc int bpf_dynptr_copy(struct bpf_dynptr *dst_ptr, u32 dst_off,
return 0;
}
+/**
+ * XXX
+ */
+__bpf_kfunc int bpf_dynptr_from_mem_slice(struct bpf_mem_slice *mem_slice, u64 flags, struct bpf_dynptr *dptr__uninit)
+{
+ struct bpf_dynptr_kern *dptr = (struct bpf_dynptr_kern *)dptr__uninit;
+ int err;
+
+ if (!mem_slice)
+ return -EINVAL;
+
+ err = bpf_dynptr_check_size(mem_slice->len);
+ if (err)
+ goto error;
+
+ /* flags is currently unsupported */
+ if (flags) {
+ err = -EINVAL;
+ goto error;
+ }
+
+ bpf_dynptr_init(dptr, mem_slice->ptr, BPF_DYNPTR_TYPE_LOCAL, 0, mem_slice->len);
+ bpf_dynptr_set_rdonly(dptr);
+
+ return 0;
+
+error:
+ bpf_dynptr_set_null(dptr);
+ return err;
+}
+
__bpf_kfunc void *bpf_cast_to_kern_ctx(void *obj)
{
return obj;
@@ -3275,6 +3306,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
BTF_ID_FLAGS(func, bpf_dynptr_size)
BTF_ID_FLAGS(func, bpf_dynptr_clone)
BTF_ID_FLAGS(func, bpf_dynptr_copy)
+BTF_ID_FLAGS(func, bpf_dynptr_from_mem_slice, KF_TRUSTED_ARGS)
#ifdef CONFIG_NET
BTF_ID_FLAGS(func, bpf_modify_return_test_tp)
#endif
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7e09c4592038..26aa70cd5734 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12125,6 +12125,7 @@ enum special_kfunc_type {
KF_bpf_res_spin_unlock,
KF_bpf_res_spin_lock_irqsave,
KF_bpf_res_spin_unlock_irqrestore,
+ KF_bpf_dynptr_from_mem_slice,
};
BTF_SET_START(special_kfunc_set)
@@ -12218,6 +12219,7 @@ BTF_ID(func, bpf_res_spin_lock)
BTF_ID(func, bpf_res_spin_unlock)
BTF_ID(func, bpf_res_spin_lock_irqsave)
BTF_ID(func, bpf_res_spin_unlock_irqrestore)
+BTF_ID(func, bpf_dynptr_from_mem_slice)
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
{
@@ -13139,7 +13141,9 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
}
}
- if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
+ if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_mem_slice]) {
+ dynptr_arg_type |= DYNPTR_TYPE_LOCAL;
+ } else if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
dynptr_arg_type |= DYNPTR_TYPE_SKB;
} else if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_xdp]) {
dynptr_arg_type |= DYNPTR_TYPE_XDP;
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice Kumar Kartikeya Dwivedi
@ 2025-04-16 17:36 ` Eduard Zingerman
2025-04-17 20:07 ` Kumar Kartikeya Dwivedi
2025-04-16 21:04 ` Andrii Nakryiko
1 sibling, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-16 17:36 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> Add a new bpf_dynptr_from_mem_slice kfunc to create a dynptr from a
> PTR_TO_BTF_ID exposing a variable-length slice of memory, represented by
> the new bpf_mem_slice type. This slice is read-only, for a read-write
> slice we can expose a distinct type in the future.
>
> We rely on the previous commits ensuring source objects underpinning
> dynptr memory are tracked correctly for invalidation to ensure when a
> PTR_TO_BTF_ID holding a memory slice goes away, it's corresponding
> dynptrs get invalidated.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
If I remove KF_TRUSTED_ARGS flag from the function nothing fails.
But that's because with current interface slices will always have a
ref_obj_id, right? And pointers with ref_obj_id are always trusted.
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice
2025-04-16 17:36 ` Eduard Zingerman
@ 2025-04-17 20:07 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 20:07 UTC (permalink / raw)
To: Eduard Zingerman
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
On Wed, 16 Apr 2025 at 19:36, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> > Add a new bpf_dynptr_from_mem_slice kfunc to create a dynptr from a
> > PTR_TO_BTF_ID exposing a variable-length slice of memory, represented by
> > the new bpf_mem_slice type. This slice is read-only, for a read-write
> > slice we can expose a distinct type in the future.
> >
> > We rely on the previous commits ensuring source objects underpinning
> > dynptr memory are tracked correctly for invalidation to ensure when a
> > PTR_TO_BTF_ID holding a memory slice goes away, it's corresponding
> > dynptrs get invalidated.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>
> If I remove KF_TRUSTED_ARGS flag from the function nothing fails.
> But that's because with current interface slices will always have a
> ref_obj_id, right? And pointers with ref_obj_id are always trusted.
Yes.
>
> Acked-by: Eduard Zingerman <eddyz87@gmail.com>
>
> [...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice Kumar Kartikeya Dwivedi
2025-04-16 17:36 ` Eduard Zingerman
@ 2025-04-16 21:04 ` Andrii Nakryiko
2025-04-17 20:07 ` Kumar Kartikeya Dwivedi
1 sibling, 1 reply; 44+ messages in thread
From: Andrii Nakryiko @ 2025-04-16 21:04 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> Add a new bpf_dynptr_from_mem_slice kfunc to create a dynptr from a
> PTR_TO_BTF_ID exposing a variable-length slice of memory, represented by
> the new bpf_mem_slice type. This slice is read-only, for a read-write
> slice we can expose a distinct type in the future.
>
> We rely on the previous commits ensuring source objects underpinning
> dynptr memory are tracked correctly for invalidation to ensure when a
> PTR_TO_BTF_ID holding a memory slice goes away, it's corresponding
> dynptrs get invalidated.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> include/linux/bpf.h | 5 +++++
> kernel/bpf/helpers.c | 32 ++++++++++++++++++++++++++++++++
> kernel/bpf/verifier.c | 6 +++++-
> 3 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 3f0cc89c0622..9feaa9bbf0a4 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1344,6 +1344,11 @@ enum bpf_dynptr_type {
> BPF_DYNPTR_TYPE_XDP,
> };
>
> +struct bpf_mem_slice {
> + void *ptr;
> + size_t len;
for better future extensibility and to avoid big-endian issues, let's have
u32 len;
u32 __reserved; /* or flags */
?
> +};
> +
> int bpf_dynptr_check_size(u32 size);
> u32 __bpf_dynptr_size(const struct bpf_dynptr_kern *ptr);
> const void *__bpf_dynptr_data(const struct bpf_dynptr_kern *ptr, u32 len);
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index e3a2662f4e33..95e9c9df6062 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -2826,6 +2826,37 @@ __bpf_kfunc int bpf_dynptr_copy(struct bpf_dynptr *dst_ptr, u32 dst_off,
> return 0;
> }
>
> +/**
> + * XXX
> + */
> +__bpf_kfunc int bpf_dynptr_from_mem_slice(struct bpf_mem_slice *mem_slice, u64 flags, struct bpf_dynptr *dptr__uninit)
> +{
> + struct bpf_dynptr_kern *dptr = (struct bpf_dynptr_kern *)dptr__uninit;
> + int err;
> +
> + if (!mem_slice)
> + return -EINVAL;
you have to initialize dynptr regardless of errors, just like
bpf_dynptr_from_mem() does, so
err = -EINVAL;
goto error;
> +
> + err = bpf_dynptr_check_size(mem_slice->len);
> + if (err)
> + goto error;
> +
> + /* flags is currently unsupported */
> + if (flags) {
> + err = -EINVAL;
> + goto error;
> + }
> +
> + bpf_dynptr_init(dptr, mem_slice->ptr, BPF_DYNPTR_TYPE_LOCAL, 0, mem_slice->len);
> + bpf_dynptr_set_rdonly(dptr);
> +
> + return 0;
> +
> +error:
> + bpf_dynptr_set_null(dptr);
> + return err;
> +}
> +
> __bpf_kfunc void *bpf_cast_to_kern_ctx(void *obj)
> {
> return obj;
> @@ -3275,6 +3306,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> BTF_ID_FLAGS(func, bpf_dynptr_size)
> BTF_ID_FLAGS(func, bpf_dynptr_clone)
> BTF_ID_FLAGS(func, bpf_dynptr_copy)
> +BTF_ID_FLAGS(func, bpf_dynptr_from_mem_slice, KF_TRUSTED_ARGS)
> #ifdef CONFIG_NET
> BTF_ID_FLAGS(func, bpf_modify_return_test_tp)
> #endif
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 7e09c4592038..26aa70cd5734 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -12125,6 +12125,7 @@ enum special_kfunc_type {
> KF_bpf_res_spin_unlock,
> KF_bpf_res_spin_lock_irqsave,
> KF_bpf_res_spin_unlock_irqrestore,
> + KF_bpf_dynptr_from_mem_slice,
> };
>
> BTF_SET_START(special_kfunc_set)
> @@ -12218,6 +12219,7 @@ BTF_ID(func, bpf_res_spin_lock)
> BTF_ID(func, bpf_res_spin_unlock)
> BTF_ID(func, bpf_res_spin_lock_irqsave)
> BTF_ID(func, bpf_res_spin_unlock_irqrestore)
> +BTF_ID(func, bpf_dynptr_from_mem_slice)
>
> static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
> {
> @@ -13139,7 +13141,9 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> }
> }
>
> - if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> + if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_mem_slice]) {
> + dynptr_arg_type |= DYNPTR_TYPE_LOCAL;
> + } else if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> dynptr_arg_type |= DYNPTR_TYPE_SKB;
> } else if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_xdp]) {
> dynptr_arg_type |= DYNPTR_TYPE_XDP;
> --
> 2.47.1
>
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice
2025-04-16 21:04 ` Andrii Nakryiko
@ 2025-04-17 20:07 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 20:07 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
On Wed, 16 Apr 2025 at 23:04, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Apr 14, 2025 at 9:14 AM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > Add a new bpf_dynptr_from_mem_slice kfunc to create a dynptr from a
> > PTR_TO_BTF_ID exposing a variable-length slice of memory, represented by
> > the new bpf_mem_slice type. This slice is read-only, for a read-write
> > slice we can expose a distinct type in the future.
> >
> > We rely on the previous commits ensuring source objects underpinning
> > dynptr memory are tracked correctly for invalidation to ensure when a
> > PTR_TO_BTF_ID holding a memory slice goes away, it's corresponding
> > dynptrs get invalidated.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> > include/linux/bpf.h | 5 +++++
> > kernel/bpf/helpers.c | 32 ++++++++++++++++++++++++++++++++
> > kernel/bpf/verifier.c | 6 +++++-
> > 3 files changed, 42 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 3f0cc89c0622..9feaa9bbf0a4 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1344,6 +1344,11 @@ enum bpf_dynptr_type {
> > BPF_DYNPTR_TYPE_XDP,
> > };
> >
> > +struct bpf_mem_slice {
> > + void *ptr;
> > + size_t len;
>
> for better future extensibility and to avoid big-endian issues, let's have
>
> u32 len;
> u32 __reserved; /* or flags */
>
Ack.
> ?
>
> > +};
> > +
> > int bpf_dynptr_check_size(u32 size);
> > u32 __bpf_dynptr_size(const struct bpf_dynptr_kern *ptr);
> > const void *__bpf_dynptr_data(const struct bpf_dynptr_kern *ptr, u32 len);
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index e3a2662f4e33..95e9c9df6062 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -2826,6 +2826,37 @@ __bpf_kfunc int bpf_dynptr_copy(struct bpf_dynptr *dst_ptr, u32 dst_off,
> > return 0;
> > }
> >
> > +/**
> > + * XXX
> > + */
> > +__bpf_kfunc int bpf_dynptr_from_mem_slice(struct bpf_mem_slice *mem_slice, u64 flags, struct bpf_dynptr *dptr__uninit)
> > +{
> > + struct bpf_dynptr_kern *dptr = (struct bpf_dynptr_kern *)dptr__uninit;
> > + int err;
> > +
> > + if (!mem_slice)
> > + return -EINVAL;
>
> you have to initialize dynptr regardless of errors, just like
> bpf_dynptr_from_mem() does, so
>
> err = -EINVAL;
> goto error;
Thanks for catching this, will fix.
>
>
> > +
> > + err = bpf_dynptr_check_size(mem_slice->len);
> > + if (err)
> > + goto error;
> > +
> > + /* flags is currently unsupported */
> > + if (flags) {
> > + err = -EINVAL;
> > + goto error;
> > + }
> > +
> > + bpf_dynptr_init(dptr, mem_slice->ptr, BPF_DYNPTR_TYPE_LOCAL, 0, mem_slice->len);
> > + bpf_dynptr_set_rdonly(dptr);
> > +
> > + return 0;
> > +
> > +error:
> > + bpf_dynptr_set_null(dptr);
> > + return err;
> > +}
> > +
> > __bpf_kfunc void *bpf_cast_to_kern_ctx(void *obj)
> > {
> > return obj;
> > @@ -3275,6 +3306,7 @@ BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
> > BTF_ID_FLAGS(func, bpf_dynptr_size)
> > BTF_ID_FLAGS(func, bpf_dynptr_clone)
> > BTF_ID_FLAGS(func, bpf_dynptr_copy)
> > +BTF_ID_FLAGS(func, bpf_dynptr_from_mem_slice, KF_TRUSTED_ARGS)
> > #ifdef CONFIG_NET
> > BTF_ID_FLAGS(func, bpf_modify_return_test_tp)
> > #endif
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 7e09c4592038..26aa70cd5734 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -12125,6 +12125,7 @@ enum special_kfunc_type {
> > KF_bpf_res_spin_unlock,
> > KF_bpf_res_spin_lock_irqsave,
> > KF_bpf_res_spin_unlock_irqrestore,
> > + KF_bpf_dynptr_from_mem_slice,
> > };
> >
> > BTF_SET_START(special_kfunc_set)
> > @@ -12218,6 +12219,7 @@ BTF_ID(func, bpf_res_spin_lock)
> > BTF_ID(func, bpf_res_spin_unlock)
> > BTF_ID(func, bpf_res_spin_lock_irqsave)
> > BTF_ID(func, bpf_res_spin_unlock_irqrestore)
> > +BTF_ID(func, bpf_dynptr_from_mem_slice)
> >
> > static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
> > {
> > @@ -13139,7 +13141,9 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > }
> > }
> >
> > - if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > + if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_mem_slice]) {
> > + dynptr_arg_type |= DYNPTR_TYPE_LOCAL;
> > + } else if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > dynptr_arg_type |= DYNPTR_TYPE_SKB;
> > } else if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_xdp]) {
> > dynptr_arg_type |= DYNPTR_TYPE_XDP;
> > --
> > 2.47.1
> >
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (5 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 06/13] bpf: Introduce bpf_dynptr_from_mem_slice Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-16 21:49 ` Eduard Zingerman
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 08/13] bpf: Add dump_stack() analogue to print to BPF stderr Kumar Kartikeya Dwivedi
` (5 subsequent siblings)
12 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Introduce a set of kfuncs to implement per BPF program stdout and stderr
streams. This is implemented as a linked list of dynamically allocated
strings whenever a message is printed. This can be done by the kernel or
the program itself using the bpf_prog_stream_vprintk kfunc. In-kernel
wrappers are provided over streams to ease the process.
The idea is that everytime messages need to be dumped, the reader would
pull out the whole batch of messages from the stream at once, and then
pop each string one by one and start printing it out (how exactly is
left up to the BPF program reading the log, but usually it will be
streaming data back into a ring buffer that is consumed by user space).
The use of a lockless list warrants that we be careful about the
ordering of messages. When being added, the order maintained is new to
old, therefore after deletion, we must reverse the list before iterating
and popping out elements from it.
Overall, this infrastructure provides NMI-safe any context printing
built on top of the NMI-safe any context bpf_mem_alloc() interface.
Later patches will add support for printing splats in case of BPF arena
page faults, rqspinlock deadlocks, and cond_break timeouts.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf.h | 52 ++++-
kernel/bpf/Makefile | 2 +-
kernel/bpf/core.c | 12 +
kernel/bpf/helpers.c | 26 +--
kernel/bpf/stream.c | 496 ++++++++++++++++++++++++++++++++++++++++++
kernel/bpf/syscall.c | 2 +-
kernel/bpf/verifier.c | 15 +-
7 files changed, 582 insertions(+), 23 deletions(-)
create mode 100644 kernel/bpf/stream.c
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 9feaa9bbf0a4..ae2ddd3bdea1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1523,6 +1523,36 @@ struct btf_mod_pair {
struct bpf_kfunc_desc_tab;
+enum bpf_stream_id {
+ BPF_STDOUT = 1,
+ BPF_STDERR = 2,
+};
+
+enum {
+ BPF_STREAM_ELEM_F_PAGE = (1 << 0),
+ BPF_STREAM_ELEM_F_NEXT = (1 << 1),
+ BPF_STREAM_ELEM_F_MASK = (BPF_STREAM_ELEM_F_PAGE | BPF_STREAM_ELEM_F_NEXT),
+};
+
+struct bpf_stream_elem {
+ struct llist_node node;
+ struct bpf_mem_slice mem_slice;
+ union {
+ unsigned long flags;
+ struct bpf_stream_elem *next;
+ };
+ char str[];
+};
+
+struct bpf_stream_elem_batch {
+ struct llist_node *node;
+};
+
+struct bpf_stream {
+ enum bpf_stream_id stream_id;
+ struct llist_head log;
+};
+
struct bpf_prog_aux {
atomic64_t refcnt;
u32 used_map_cnt;
@@ -1631,6 +1661,7 @@ struct bpf_prog_aux {
struct work_struct work;
struct rcu_head rcu;
};
+ struct bpf_stream stream[2];
};
struct bpf_prog {
@@ -2390,6 +2421,8 @@ int generic_map_delete_batch(struct bpf_map *map,
struct bpf_map *bpf_map_get_curr_or_next(u32 *id);
struct bpf_prog *bpf_prog_get_curr_or_next(u32 *id);
+
+struct page *__bpf_alloc_page(int nid);
int bpf_map_alloc_pages(const struct bpf_map *map, int nid,
unsigned long nr_pages, struct page **page_array);
#ifdef CONFIG_MEMCG
@@ -3528,6 +3561,16 @@ bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
#define MAX_BPRINTF_VARARGS 12
#define MAX_BPRINTF_BUF 1024
+/* Per-cpu temp buffers used by printf-like helpers to store the bprintf binary
+ * arguments representation.
+ */
+#define MAX_BPRINTF_BIN_ARGS 512
+
+struct bpf_bprintf_buffers {
+ char bin_args[MAX_BPRINTF_BIN_ARGS];
+ char buf[MAX_BPRINTF_BUF];
+};
+
struct bpf_bprintf_data {
u32 *bin_args;
char *buf;
@@ -3535,9 +3578,16 @@ struct bpf_bprintf_data {
bool get_buf;
};
-int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
+int bpf_bprintf_prepare(const char *fmt, u32 fmt_size, const u64 *raw_args,
u32 num_args, struct bpf_bprintf_data *data);
void bpf_bprintf_cleanup(struct bpf_bprintf_data *data);
+int bpf_try_get_buffers(struct bpf_bprintf_buffers **bufs);
+void bpf_put_buffers(void);
+
+void bpf_prog_stream_init(struct bpf_prog *prog);
+void bpf_prog_stream_free(struct bpf_prog *prog);
+__printf(2, 3)
+int bpf_prog_stderr_printk(struct bpf_prog *prog, const char *fmt, ...);
#ifdef CONFIG_BPF_LSM
void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 70502f038b92..a89575822b60 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -14,7 +14,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o
obj-$(CONFIG_BPF_SYSCALL) += disasm.o mprog.o
obj-$(CONFIG_BPF_JIT) += trampoline.o
-obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o rqspinlock.o
+obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o rqspinlock.o stream.o
ifeq ($(CONFIG_MMU)$(CONFIG_64BIT),yy)
obj-$(CONFIG_BPF_SYSCALL) += arena.o range_tree.o
endif
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba6b6118cf50..1bfc6f7ea3da 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -134,6 +134,10 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag
mutex_init(&fp->aux->ext_mutex);
mutex_init(&fp->aux->dst_mutex);
+#ifdef CONFIG_BPF_SYSCALL
+ bpf_prog_stream_init(fp);
+#endif
+
return fp;
}
@@ -2856,6 +2860,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
aux = container_of(work, struct bpf_prog_aux, work);
#ifdef CONFIG_BPF_SYSCALL
bpf_free_kfunc_btf_tab(aux->kfunc_btf_tab);
+ bpf_prog_stream_free(aux->prog);
#endif
#ifdef CONFIG_CGROUP_BPF
if (aux->cgroup_atype != CGROUP_BPF_ATTACH_TYPE_INVALID)
@@ -2872,6 +2877,13 @@ static void bpf_prog_free_deferred(struct work_struct *work)
if (aux->dst_trampoline)
bpf_trampoline_put(aux->dst_trampoline);
for (i = 0; i < aux->real_func_cnt; i++) {
+#ifdef CONFIG_BPF_SYSCALL
+ /* Ensure we don't push to subprog lists. */
+ if (bpf_is_subprog(aux->func[i])) {
+ WARN_ON_ONCE(!llist_empty(&aux->func[i]->aux->stream[0].log));
+ WARN_ON_ONCE(!llist_empty(&aux->func[i]->aux->stream[1].log));
+ }
+#endif
/* We can just unlink the subprog poke descriptor table as
* it was originally linked to the main program and is also
* released along with it.
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 95e9c9df6062..7f59367b1101 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -761,22 +761,13 @@ static int bpf_trace_copy_string(char *buf, void *unsafe_ptr, char fmt_ptype,
return -EINVAL;
}
-/* Per-cpu temp buffers used by printf-like helpers to store the bprintf binary
- * arguments representation.
- */
-#define MAX_BPRINTF_BIN_ARGS 512
-
/* Support executing three nested bprintf helper calls on a given CPU */
#define MAX_BPRINTF_NEST_LEVEL 3
-struct bpf_bprintf_buffers {
- char bin_args[MAX_BPRINTF_BIN_ARGS];
- char buf[MAX_BPRINTF_BUF];
-};
static DEFINE_PER_CPU(struct bpf_bprintf_buffers[MAX_BPRINTF_NEST_LEVEL], bpf_bprintf_bufs);
static DEFINE_PER_CPU(int, bpf_bprintf_nest_level);
-static int try_get_buffers(struct bpf_bprintf_buffers **bufs)
+int bpf_try_get_buffers(struct bpf_bprintf_buffers **bufs)
{
int nest_level;
@@ -792,16 +783,21 @@ static int try_get_buffers(struct bpf_bprintf_buffers **bufs)
return 0;
}
-void bpf_bprintf_cleanup(struct bpf_bprintf_data *data)
+void bpf_put_buffers(void)
{
- if (!data->bin_args && !data->buf)
- return;
if (WARN_ON_ONCE(this_cpu_read(bpf_bprintf_nest_level) == 0))
return;
this_cpu_dec(bpf_bprintf_nest_level);
preempt_enable();
}
+void bpf_bprintf_cleanup(struct bpf_bprintf_data *data)
+{
+ if (!data->bin_args && !data->buf)
+ return;
+ bpf_put_buffers();
+}
+
/*
* bpf_bprintf_prepare - Generic pass on format strings for bprintf-like helpers
*
@@ -816,7 +812,7 @@ void bpf_bprintf_cleanup(struct bpf_bprintf_data *data)
* In argument preparation mode, if 0 is returned, safe temporary buffers are
* allocated and bpf_bprintf_cleanup should be called to free them after use.
*/
-int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
+int bpf_bprintf_prepare(const char *fmt, u32 fmt_size, const u64 *raw_args,
u32 num_args, struct bpf_bprintf_data *data)
{
bool get_buffers = (data->get_bin_args && num_args) || data->get_buf;
@@ -832,7 +828,7 @@ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
return -EINVAL;
fmt_size = fmt_end - fmt;
- if (get_buffers && try_get_buffers(&buffers))
+ if (get_buffers && bpf_try_get_buffers(&buffers))
return -EBUSY;
if (data->get_bin_args) {
diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c
new file mode 100644
index 000000000000..2019ce134310
--- /dev/null
+++ b/kernel/bpf/stream.c
@@ -0,0 +1,496 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+
+#include <linux/bpf.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/percpu.h>
+#include <linux/refcount.h>
+#include <linux/gfp.h>
+#include <linux/memory.h>
+#include <linux/local_lock.h>
+
+/*
+ * Simple per-CPU NMI-safe bump allocation mechanism, backed by the NMI-safe
+ * try_alloc_pages()/free_pages_nolock() primitives. We allocate a page and
+ * stash it in a local per-CPU variable, and bump allocate from the page
+ * whenever items need to be printed to a stream. Each page holds a global
+ * atomic refcount in its first 4 bytes, and then records of variable length
+ * that describe the printed messages. Once the global refcount has dropped to
+ * zero, it is a signal to free the page back to the kernel's page allocator,
+ * given all the individual records in it have been consumed.
+ *
+ * It is possible the same page is used to serve allocations across different
+ * programs, which may be consumed at different times individually, hence
+ * maintaining a reference count per-page is critical for correct lifetime
+ * tracking.
+ *
+ * Ideally, we'd have a kmalloc() equivalent that just allowed allocating items
+ * of different sizes directly leading to less fragmentation overall. Let's use
+ * this scheme until support for that arrives, and when we cannot capture memory
+ * from bpf_mem_alloc() reserves.
+ */
+
+struct bpf_stream_page {
+ refcount_t ref;
+ u32 consumed;
+ char buf[];
+};
+
+/* Available room to add data to a refcounted page. */
+#define BPF_STREAM_PAGE_SZ (PAGE_SIZE - offsetofend(struct bpf_stream_page, consumed))
+
+static DEFINE_PER_CPU(local_trylock_t, stream_local_lock) = INIT_LOCAL_TRYLOCK(stream_local_lock);
+static DEFINE_PER_CPU(struct bpf_stream_page *, stream_pcpu_page);
+
+static bool bpf_stream_page_local_lock(unsigned long *flags)
+{
+ return local_trylock_irqsave(&stream_local_lock, *flags);
+}
+
+static void bpf_stream_page_local_unlock(unsigned long *flags)
+{
+ local_unlock_irqrestore(&stream_local_lock, *flags);
+}
+
+static void bpf_stream_page_free(struct bpf_stream_page *stream_page)
+{
+ struct page *p;
+
+ if (!stream_page)
+ return;
+ p = virt_to_page(stream_page);
+ free_pages_nolock(p, 0);
+}
+
+static void bpf_stream_page_get(struct bpf_stream_page *stream_page)
+{
+ refcount_inc(&stream_page->ref);
+}
+
+static void bpf_stream_page_put(struct bpf_stream_page *stream_page)
+{
+ if (refcount_dec_and_test(&stream_page->ref))
+ bpf_stream_page_free(stream_page);
+}
+
+static void bpf_stream_page_init(struct bpf_stream_page *stream_page)
+{
+ refcount_set(&stream_page->ref, 1);
+ stream_page->consumed = 0;
+}
+
+static struct bpf_stream_page *bpf_stream_page_replace(void)
+{
+ struct bpf_stream_page *stream_page, *old_stream_page;
+ struct page *page;
+
+ page = __bpf_alloc_page(NUMA_NO_NODE);
+ if (!page)
+ return NULL;
+ stream_page = page_address(page);
+ bpf_stream_page_init(stream_page);
+
+ old_stream_page = this_cpu_read(stream_pcpu_page);
+ if (old_stream_page)
+ bpf_stream_page_put(old_stream_page);
+ this_cpu_write(stream_pcpu_page, stream_page);
+ return stream_page;
+}
+
+static int bpf_stream_page_check_room(struct bpf_stream_page *stream_page, int len)
+{
+ int min = offsetof(struct bpf_stream_elem, str[0]);
+ int consumed = stream_page->consumed;
+ int total = BPF_STREAM_PAGE_SZ;
+ int rem = max(0, total - consumed - min);
+
+ /* Let's give room of at least 8 bytes. */
+ WARN_ON_ONCE(rem % 8 != 0);
+ return rem < 8 ? 0 : rem;
+}
+
+static void bpf_stream_elem_init(struct bpf_stream_elem *elem, int len)
+{
+ init_llist_node(&elem->node);
+ elem->mem_slice.ptr = elem->str;
+ elem->mem_slice.len = len;
+ elem->flags = 0;
+}
+
+static struct bpf_stream_page *bpf_stream_page_from_elem(struct bpf_stream_elem *elem)
+{
+ unsigned long addr = (unsigned long)elem;
+
+ return (struct bpf_stream_page *)PAGE_ALIGN_DOWN(addr);
+}
+
+static struct bpf_stream_elem *bpf_stream_page_push_elem(struct bpf_stream_page *stream_page, int len)
+{
+ u32 consumed = stream_page->consumed;
+
+ stream_page->consumed += round_up(offsetof(struct bpf_stream_elem, str[len]), 8);
+ return (struct bpf_stream_elem *)&stream_page->buf[consumed];
+}
+
+static noinline struct bpf_stream_elem *bpf_stream_page_reserve_elem(int len)
+{
+ struct bpf_stream_elem *elem = NULL, *next_elem;
+ struct bpf_stream_page *p1, *p2;
+ int room = 0;
+
+ p1 = this_cpu_read(stream_pcpu_page);
+ if (!p1)
+ p1 = bpf_stream_page_replace();
+ if (!p1)
+ return NULL;
+
+ room = bpf_stream_page_check_room(p1, len);
+ room = min(len, room);
+
+ /*
+ * We have three cases to handle, all data in first page, some in first
+ * and some in second page, and all data in second page.
+ *
+ * Some or all data in first page, bump p1's refcount.
+ */
+ if (room)
+ bpf_stream_page_get(p1);
+
+ /* Some or all data in second page, replace p1 with it. */
+ if (room < len) {
+ p2 = bpf_stream_page_replace();
+ if (!p2) {
+ bpf_stream_page_put(p1);
+ return NULL;
+ }
+ bpf_stream_page_get(p2);
+ }
+
+ if (!room)
+ goto second;
+
+ elem = bpf_stream_page_push_elem(p1, room);
+ bpf_stream_elem_init(elem, room);
+ elem->flags |= BPF_STREAM_ELEM_F_PAGE;
+
+ if (room == len)
+ return elem;
+second:
+ next_elem = bpf_stream_page_push_elem(p2, len - room);
+ bpf_stream_elem_init(next_elem, len - room);
+ next_elem->flags |= BPF_STREAM_ELEM_F_PAGE;
+ if (!room)
+ return next_elem;
+
+ elem->next = next_elem;
+ elem->flags |= (BPF_STREAM_ELEM_F_PAGE | BPF_STREAM_ELEM_F_NEXT);
+ return elem;
+}
+
+static struct bpf_stream_elem *bpf_stream_elem_alloc_from_bpf_ma(int len)
+{
+ struct bpf_stream_elem *elem;
+
+ elem = bpf_mem_alloc(&bpf_global_ma, offsetof(typeof(*elem), str[len]));
+ if (!elem)
+ return NULL;
+ bpf_stream_elem_init(elem, len);
+ return elem;
+}
+
+static struct bpf_stream_elem *bpf_stream_elem_alloc_from_stream_page(int len)
+{
+ struct bpf_stream_elem *elem;
+ unsigned long flags;
+
+ /*
+ * Let's not try any harder, caller tried bpf_mem_alloc() before us, so
+ * we've exhausted the per-CPU caches and cannot refill, and so we're
+ * either an NMI hitting between the local_lock() critical section, or
+ * some nested program invocation in that path (e.g. in RT, preemption
+ * is enabled, hardirq (apart from NMIs) are still allowed, etc.).
+ */
+ if (!bpf_stream_page_local_lock(&flags))
+ return NULL;
+ elem = bpf_stream_page_reserve_elem(len);
+ bpf_stream_page_local_unlock(&flags);
+ return elem;
+}
+
+static struct bpf_stream_elem *bpf_stream_elem_alloc(int len)
+{
+ const int max_len = ARRAY_SIZE((struct bpf_bprintf_buffers){}.buf);
+ struct bpf_stream_elem *elem;
+
+ /*
+ * We may overflow, but we should never need more than one page size
+ * worth of memory. This can be lifted, but we'd need to adjust the
+ * other code to keep allocating more pages to overflow messages.
+ */
+ BUILD_BUG_ON(max_len > BPF_STREAM_PAGE_SZ);
+ /*
+ * Length denotes the amount of data to be written as part of stream element,
+ * thus includes '\0' byte. We're capped by how much bpf_bprintf_buffers can
+ * accomodate, therefore deny allocations that won't fit into them.
+ */
+ if (len < 0 || len > max_len)
+ return NULL;
+
+ elem = bpf_stream_elem_alloc_from_bpf_ma(len);
+ if (!elem)
+ elem = bpf_stream_elem_alloc_from_stream_page(len);
+ return elem;
+}
+
+__bpf_kfunc_start_defs();
+
+static int bpf_stream_push_str(struct bpf_stream *stream, const char *str, int len)
+{
+ struct bpf_stream_elem *elem, *next = NULL;
+ int room = 0, rem = 0;
+
+ /*
+ * Allocate a bpf_prog_stream_elem and push it to the bpf_prog_stream
+ * log, elements will be popped at once and reversed to print the log.
+ */
+ elem = bpf_stream_elem_alloc(len);
+ if (!elem)
+ return -ENOMEM;
+ room = elem->mem_slice.len;
+ if (elem->flags & BPF_STREAM_ELEM_F_NEXT) {
+ next = (struct bpf_stream_elem *)((unsigned long)elem->next & ~BPF_STREAM_ELEM_F_MASK);
+ rem = next->mem_slice.len;
+ }
+
+ memcpy(elem->str, str, room);
+ if (next)
+ memcpy(next->str, str + room, rem);
+
+ if (next) {
+ elem->node.next = &next->node;
+ next->node.next = NULL;
+
+ llist_add_batch(&elem->node, &next->node, &stream->log);
+ } else {
+ llist_add(&elem->node, &stream->log);
+ }
+
+ return 0;
+}
+
+__bpf_kfunc int bpf_stream_vprintk(struct bpf_stream *stream, const char *fmt__str, const void *args, u32 len__sz)
+{
+ struct bpf_bprintf_data data = {
+ .get_bin_args = true,
+ .get_buf = true,
+ };
+ u32 fmt_size = strlen(fmt__str) + 1;
+ u32 data_len = len__sz;
+ int ret, num_args;
+
+ if (data_len & 7 || data_len > MAX_BPRINTF_VARARGS * 8 ||
+ (data_len && !args))
+ return -EINVAL;
+ num_args = data_len / 8;
+
+ ret = bpf_bprintf_prepare(fmt__str, fmt_size, args, num_args, &data);
+ if (ret < 0)
+ return ret;
+
+ ret = bstr_printf(data.buf, MAX_BPRINTF_BUF, fmt__str, data.bin_args);
+ /* If the string was truncated, we only wrote until the size of buffer. */
+ ret = min_t(u32, ret + 1, MAX_BPRINTF_BUF);
+ ret = bpf_stream_push_str(stream, data.buf, ret);
+ bpf_bprintf_cleanup(&data);
+
+ return ret;
+}
+
+__bpf_kfunc struct bpf_stream *bpf_stream_get(enum bpf_stream_id stream_id, void *aux__ign)
+{
+ struct bpf_prog_aux *aux = aux__ign;
+
+ if (stream_id != BPF_STDOUT && stream_id != BPF_STDERR)
+ return NULL;
+ return &aux->stream[stream_id - 1];
+}
+
+__bpf_kfunc struct bpf_stream_elem_batch *bpf_stream_next_elem_batch(struct bpf_stream *stream)
+{
+ struct bpf_stream_elem_batch *elem_batch;
+ struct llist_node *batch;
+
+ if (llist_empty(&stream->log))
+ return NULL;
+ /* Allocate our cursor. */
+ migrate_disable();
+ elem_batch = bpf_mem_alloc(&bpf_global_ma, sizeof(*elem_batch));
+ migrate_enable();
+ if (!elem_batch)
+ return NULL;
+ /*
+ * This is the linearization point for the readers; every reader
+ * consumes the whole batch of messages queued on the stream at this
+ * point, and can see everything queued before this. Anything that is
+ * queued after this can only be observed if the next batch is
+ * requested.
+ */
+ batch = llist_del_all(&stream->log);
+ batch = llist_reverse_order(batch);
+ elem_batch->node = batch;
+ return elem_batch;
+}
+
+__bpf_kfunc struct bpf_stream_elem *
+bpf_stream_next_elem(struct bpf_stream_elem_batch *elem_batch)
+{
+ struct llist_node *node = elem_batch->node;
+ struct bpf_stream_elem *elem;
+
+ if (!node)
+ return NULL;
+ elem = container_of(node, struct bpf_stream_elem, node);
+ elem_batch->node = node->next;
+ return elem;
+}
+
+__bpf_kfunc void bpf_stream_free_elem(struct bpf_stream_elem *elem)
+{
+ struct bpf_stream_page *p;
+
+ if (elem->flags & BPF_STREAM_ELEM_F_PAGE) {
+ p = bpf_stream_page_from_elem(elem);
+ bpf_stream_page_put(p);
+ return;
+ }
+
+ migrate_disable();
+ bpf_mem_free(&bpf_global_ma, elem);
+ migrate_enable();
+}
+
+static void bpf_stream_free_list(struct llist_node *list)
+{
+ struct bpf_stream_elem *elem, *tmp;
+
+ llist_for_each_entry_safe(elem, tmp, list, node)
+ bpf_stream_free_elem(elem);
+}
+
+__bpf_kfunc void bpf_stream_free_elem_batch(struct bpf_stream_elem_batch *elem_batch)
+{
+
+ migrate_disable();
+ bpf_stream_free_list(elem_batch->node);
+ bpf_mem_free(&bpf_global_ma, elem_batch);
+ migrate_enable();
+}
+
+__bpf_kfunc struct bpf_stream *bpf_prog_stream_get(enum bpf_stream_id stream_id, u32 prog_id)
+{
+ struct bpf_stream *stream;
+ struct bpf_prog *prog;
+
+ prog = bpf_prog_by_id(prog_id);
+ if (!prog)
+ return NULL;
+ stream = bpf_stream_get(stream_id, prog->aux);
+ if (!stream)
+ bpf_prog_put(prog);
+ return stream;
+}
+
+__bpf_kfunc void bpf_prog_stream_put(struct bpf_stream *stream)
+{
+ enum bpf_stream_id stream_id = stream->stream_id;
+ struct bpf_prog *prog;
+
+ prog = container_of(stream, struct bpf_prog_aux, stream[stream_id - 1])->prog;
+ bpf_prog_put(prog);
+}
+
+void bpf_prog_stream_init(struct bpf_prog *prog)
+{
+ prog->aux->stream[0].stream_id = BPF_STDOUT;
+ init_llist_head(&prog->aux->stream[0].log);
+
+ prog->aux->stream[1].stream_id = BPF_STDERR;
+ init_llist_head(&prog->aux->stream[1].log);
+}
+
+void bpf_prog_stream_free(struct bpf_prog *prog)
+{
+ struct llist_node *list;
+
+ list = llist_del_all(&prog->aux->stream[0].log);
+ bpf_stream_free_list(list);
+
+ list = llist_del_all(&prog->aux->stream[1].log);
+ bpf_stream_free_list(list);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(stream_kfunc_set)
+BTF_ID_FLAGS(func, bpf_stream_get, KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_stream_vprintk, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(stream_kfunc_set)
+
+BTF_KFUNCS_START(stream_consumer_kfunc_set)
+BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
+BTF_KFUNCS_END(stream_consumer_kfunc_set)
+
+BTF_ID_LIST(bpf_stream_dtor_ids)
+BTF_ID(struct, bpf_stream_elem_batch)
+BTF_ID(func, bpf_stream_free_elem_batch)
+
+static const struct btf_kfunc_id_set bpf_stream_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &stream_kfunc_set,
+};
+
+static const struct btf_kfunc_id_set bpf_stream_consumer_kfunc_set = {
+ .owner = THIS_MODULE,
+ .set = &stream_consumer_kfunc_set,
+};
+
+static int __init bpf_stream_kfunc_init(void)
+{
+ const struct btf_id_dtor_kfunc bpf_stream_dtors[] = {
+ {
+ .btf_id = bpf_stream_dtor_ids[0],
+ .kfunc_btf_id = bpf_stream_dtor_ids[1],
+ },
+ };
+ int ret;
+
+ ret = register_btf_id_dtor_kfuncs(bpf_stream_dtors, ARRAY_SIZE(bpf_stream_dtors), THIS_MODULE);
+ ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &bpf_stream_kfunc_set);
+ return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &bpf_stream_consumer_kfunc_set);
+}
+late_initcall(bpf_stream_kfunc_init);
+
+int bpf_prog_stderr_printk(struct bpf_prog *prog, const char *fmt, ...)
+{
+ struct bpf_stream *stream = bpf_stream_get(BPF_STDERR, prog->aux);
+ struct bpf_bprintf_buffers *buf;
+ va_list args;
+ int ret;
+
+ if (bpf_try_get_buffers(&buf))
+ return -EBUSY;
+
+ va_start(args, fmt);
+ ret = vsnprintf(buf->buf, ARRAY_SIZE(buf->buf), fmt, args);
+ va_end(args);
+ /* If the string was truncated, we only wrote until the size of buffer. */
+ ret = min_t(u32, ret + 1, ARRAY_SIZE(buf->buf));
+ ret = bpf_stream_push_str(stream, buf->buf, ret);
+ bpf_put_buffers();
+ return ret;
+}
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 9794446bc8c6..a8e8286a3d95 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -575,7 +575,7 @@ static bool can_alloc_pages(void)
!IS_ENABLED(CONFIG_PREEMPT_RT);
}
-static struct page *__bpf_alloc_page(int nid)
+struct page *__bpf_alloc_page(int nid)
{
if (!can_alloc_pages())
return try_alloc_pages(nid, 0);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 26aa70cd5734..a1ac856ad078 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12126,6 +12126,7 @@ enum special_kfunc_type {
KF_bpf_res_spin_lock_irqsave,
KF_bpf_res_spin_unlock_irqrestore,
KF_bpf_dynptr_from_mem_slice,
+ KF_bpf_stream_get,
};
BTF_SET_START(special_kfunc_set)
@@ -12220,6 +12221,7 @@ BTF_ID(func, bpf_res_spin_unlock)
BTF_ID(func, bpf_res_spin_lock_irqsave)
BTF_ID(func, bpf_res_spin_unlock_irqrestore)
BTF_ID(func, bpf_dynptr_from_mem_slice)
+BTF_ID(func, bpf_stream_get)
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
{
@@ -13883,10 +13885,11 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
regs[BPF_REG_0].type = PTR_TO_BTF_ID;
regs[BPF_REG_0].btf_id = ptr_type_id;
- if (meta.func_id == special_kfunc_list[KF_bpf_get_kmem_cache])
+ if (meta.func_id == special_kfunc_list[KF_bpf_get_kmem_cache]) {
regs[BPF_REG_0].type |= PTR_UNTRUSTED;
-
- if (is_iter_next_kfunc(&meta)) {
+ } else if (meta.func_id == special_kfunc_list[KF_bpf_stream_get]) {
+ regs[BPF_REG_0].type |= PTR_TRUSTED;
+ } else if (is_iter_next_kfunc(&meta)) {
struct bpf_reg_state *cur_iter;
cur_iter = get_iter_from_state(env->cur_state, &meta);
@@ -21520,8 +21523,10 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
*cnt = 1;
- } else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id)) {
- struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(BPF_REG_4, (long)env->prog->aux) };
+ } else if (is_bpf_wq_set_callback_impl_kfunc(desc->func_id) ||
+ desc->func_id == special_kfunc_list[KF_bpf_stream_get]) {
+ u32 regno = is_bpf_wq_set_callback_impl_kfunc(desc->func_id) ? BPF_REG_4 : BPF_REG_2;
+ struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(regno, (long)env->prog->aux) };
insn_buf[0] = ld_addrs[0];
insn_buf[1] = ld_addrs[1];
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams Kumar Kartikeya Dwivedi
@ 2025-04-16 21:49 ` Eduard Zingerman
2025-04-17 20:06 ` Kumar Kartikeya Dwivedi
0 siblings, 1 reply; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-16 21:49 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> Introduce a set of kfuncs to implement per BPF program stdout and stderr
> streams. This is implemented as a linked list of dynamically allocated
> strings whenever a message is printed. This can be done by the kernel or
> the program itself using the bpf_prog_stream_vprintk kfunc. In-kernel
> wrappers are provided over streams to ease the process.
>
> The idea is that everytime messages need to be dumped, the reader would
> pull out the whole batch of messages from the stream at once, and then
> pop each string one by one and start printing it out (how exactly is
> left up to the BPF program reading the log, but usually it will be
> streaming data back into a ring buffer that is consumed by user space).
>
> The use of a lockless list warrants that we be careful about the
> ordering of messages. When being added, the order maintained is new to
> old, therefore after deletion, we must reverse the list before iterating
> and popping out elements from it.
>
> Overall, this infrastructure provides NMI-safe any context printing
> built on top of the NMI-safe any context bpf_mem_alloc() interface.
>
> Later patches will add support for printing splats in case of BPF arena
> page faults, rqspinlock deadlocks, and cond_break timeouts.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
[...]
> diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c
> new file mode 100644
> index 000000000000..2019ce134310
> --- /dev/null
> +++ b/kernel/bpf/stream.c
[...]
> +static int bpf_stream_page_check_room(struct bpf_stream_page *stream_page, int len)
^^^^^^^
len is never used
> +{
> + int min = offsetof(struct bpf_stream_elem, str[0]);
> + int consumed = stream_page->consumed;
> + int total = BPF_STREAM_PAGE_SZ;
> + int rem = max(0, total - consumed - min);
> +
> + /* Let's give room of at least 8 bytes. */
> + WARN_ON_ONCE(rem % 8 != 0);
> + return rem < 8 ? 0 : rem;
> +}
[...]
> +static struct bpf_stream_elem *bpf_stream_elem_alloc(int len)
> +{
> + const int max_len = ARRAY_SIZE((struct bpf_bprintf_buffers){}.buf);
> + struct bpf_stream_elem *elem;
> +
> + /*
> + * We may overflow, but we should never need more than one page size
> + * worth of memory. This can be lifted, but we'd need to adjust the
> + * other code to keep allocating more pages to overflow messages.
> + */
> + BUILD_BUG_ON(max_len > BPF_STREAM_PAGE_SZ);
> + /*
> + * Length denotes the amount of data to be written as part of stream element,
> + * thus includes '\0' byte. We're capped by how much bpf_bprintf_buffers can
> + * accomodate, therefore deny allocations that won't fit into them.
> + */
> + if (len < 0 || len > max_len)
> + return NULL;
> +
> + elem = bpf_stream_elem_alloc_from_bpf_ma(len);
> + if (!elem)
> + elem = bpf_stream_elem_alloc_from_stream_page(len);
So, the stream page is a backup mechanism, right?
I'm curious, did you compare how many messages are dropped if there is
no such backup? Also, how much memory is wasted if there is no
"spillover" mechanism (BPF_STREAM_ELEM_F_NEXT).
Are these complications absolutely necessary?
> + return elem;
> +}
> +
> +__bpf_kfunc_start_defs();
> +
> +static int bpf_stream_push_str(struct bpf_stream *stream, const char *str, int len)
> +{
This function accumulates elements in &stream->log w/o a cap.
Why is this not a problem if e.g. user space never flushes streams for
the program?
> + struct bpf_stream_elem *elem, *next = NULL;
> + int room = 0, rem = 0;
> +
> + /*
> + * Allocate a bpf_prog_stream_elem and push it to the bpf_prog_stream
> + * log, elements will be popped at once and reversed to print the log.
> + */
> + elem = bpf_stream_elem_alloc(len);
> + if (!elem)
> + return -ENOMEM;
> + room = elem->mem_slice.len;
> + if (elem->flags & BPF_STREAM_ELEM_F_NEXT) {
> + next = (struct bpf_stream_elem *)((unsigned long)elem->next & ~BPF_STREAM_ELEM_F_MASK);
> + rem = next->mem_slice.len;
> + }
> +
> + memcpy(elem->str, str, room);
> + if (next)
> + memcpy(next->str, str + room, rem);
> +
> + if (next) {
> + elem->node.next = &next->node;
> + next->node.next = NULL;
> +
> + llist_add_batch(&elem->node, &next->node, &stream->log);
> + } else {
> + llist_add(&elem->node, &stream->log);
> + }
> +
> + return 0;
> +}
[...]
> +BTF_KFUNCS_START(stream_consumer_kfunc_set)
> +BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
> +BTF_KFUNCS_END(stream_consumer_kfunc_set)
This is a complicated API.
If we anticipate that users intend to write this info to ring buffers
maybe just provide a function doing that and do not expose complete API?
[...]
I'll continue reading the patch-set tomorrow...
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams
2025-04-16 21:49 ` Eduard Zingerman
@ 2025-04-17 20:06 ` Kumar Kartikeya Dwivedi
2025-04-22 1:42 ` Alexei Starovoitov
2025-04-22 22:09 ` Eduard Zingerman
0 siblings, 2 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-17 20:06 UTC (permalink / raw)
To: Eduard Zingerman
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
On Wed, 16 Apr 2025 at 23:49, Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> > Introduce a set of kfuncs to implement per BPF program stdout and stderr
> > streams. This is implemented as a linked list of dynamically allocated
> > strings whenever a message is printed. This can be done by the kernel or
> > the program itself using the bpf_prog_stream_vprintk kfunc. In-kernel
> > wrappers are provided over streams to ease the process.
> >
> > The idea is that everytime messages need to be dumped, the reader would
> > pull out the whole batch of messages from the stream at once, and then
> > pop each string one by one and start printing it out (how exactly is
> > left up to the BPF program reading the log, but usually it will be
> > streaming data back into a ring buffer that is consumed by user space).
> >
> > The use of a lockless list warrants that we be careful about the
> > ordering of messages. When being added, the order maintained is new to
> > old, therefore after deletion, we must reverse the list before iterating
> > and popping out elements from it.
> >
> > Overall, this infrastructure provides NMI-safe any context printing
> > built on top of the NMI-safe any context bpf_mem_alloc() interface.
> >
> > Later patches will add support for printing splats in case of BPF arena
> > page faults, rqspinlock deadlocks, and cond_break timeouts.
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
>
> [...]
>
> > diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c
> > new file mode 100644
> > index 000000000000..2019ce134310
> > --- /dev/null
> > +++ b/kernel/bpf/stream.c
>
> [...]
>
> > +static int bpf_stream_page_check_room(struct bpf_stream_page *stream_page, int len)
> ^^^^^^^
> len is never used
Yeah, I will get rid of the parameter.
> > +{
> > + int min = offsetof(struct bpf_stream_elem, str[0]);
> > + int consumed = stream_page->consumed;
> > + int total = BPF_STREAM_PAGE_SZ;
> > + int rem = max(0, total - consumed - min);
> > +
> > + /* Let's give room of at least 8 bytes. */
> > + WARN_ON_ONCE(rem % 8 != 0);
> > + return rem < 8 ? 0 : rem;
> > +}
>
> [...]
>
> > +static struct bpf_stream_elem *bpf_stream_elem_alloc(int len)
> > +{
> > + const int max_len = ARRAY_SIZE((struct bpf_bprintf_buffers){}.buf);
> > + struct bpf_stream_elem *elem;
> > +
> > + /*
> > + * We may overflow, but we should never need more than one page size
> > + * worth of memory. This can be lifted, but we'd need to adjust the
> > + * other code to keep allocating more pages to overflow messages.
> > + */
> > + BUILD_BUG_ON(max_len > BPF_STREAM_PAGE_SZ);
> > + /*
> > + * Length denotes the amount of data to be written as part of stream element,
> > + * thus includes '\0' byte. We're capped by how much bpf_bprintf_buffers can
> > + * accomodate, therefore deny allocations that won't fit into them.
> > + */
> > + if (len < 0 || len > max_len)
> > + return NULL;
> > +
> > + elem = bpf_stream_elem_alloc_from_bpf_ma(len);
> > + if (!elem)
> > + elem = bpf_stream_elem_alloc_from_stream_page(len);
>
> So, the stream page is a backup mechanism, right?
> I'm curious, did you compare how many messages are dropped if there is
> no such backup?
You can try disabling the backup allocation path. You will notice that
the selftest starts failing.
When IRQs are disabled (so arena page fault path or rqspinlock with
_irqsave), we won't be
able to refill bpf_mem_alloc() cache, and will end up dropping the
message on the floor.
> Also, how much memory is wasted if there is no
> "spillover" mechanism (BPF_STREAM_ELEM_F_NEXT).
> Are these complications absolutely necessary?
There's no way to know.
It depends on the free space and the message size, right.
Say we have 128 bytes left in the page, the message spans 156 bytes,
is it better to write 128 bytes and move the rest to the new page?
Say we have 100K messages sitting in the log that followed this
pattern, let's take wasted space on average per message as 32 bytes,
then that's 3.2 MB of unused memory.
>
> > + return elem;
> > +}
> > +
> > +__bpf_kfunc_start_defs();
> > +
> > +static int bpf_stream_push_str(struct bpf_stream *stream, const char *str, int len)
> > +{
>
> This function accumulates elements in &stream->log w/o a cap.
> Why is this not a problem if e.g. user space never flushes streams for
> the program?
It is certainly a problem, hence this bit in cover letter:
> * Enforcing memory consumption limits on the log usage, so that
in-flight log memory is bounded at runtime.
For now I will do a capacity counter per log, but maybe we want to tie
this to memcg? I am not sure we can charge it in any context, but was
hoping to get more opinions.
>
> > + struct bpf_stream_elem *elem, *next = NULL;
> > + int room = 0, rem = 0;
> > +
> > + /*
> > + * Allocate a bpf_prog_stream_elem and push it to the bpf_prog_stream
> > + * log, elements will be popped at once and reversed to print the log.
> > + */
> > + elem = bpf_stream_elem_alloc(len);
> > + if (!elem)
> > + return -ENOMEM;
> > + room = elem->mem_slice.len;
> > + if (elem->flags & BPF_STREAM_ELEM_F_NEXT) {
> > + next = (struct bpf_stream_elem *)((unsigned long)elem->next & ~BPF_STREAM_ELEM_F_MASK);
> > + rem = next->mem_slice.len;
> > + }
> > +
> > + memcpy(elem->str, str, room);
> > + if (next)
> > + memcpy(next->str, str + room, rem);
> > +
> > + if (next) {
> > + elem->node.next = &next->node;
> > + next->node.next = NULL;
> > +
> > + llist_add_batch(&elem->node, &next->node, &stream->log);
> > + } else {
> > + llist_add(&elem->node, &stream->log);
> > + }
> > +
> > + return 0;
> > +}
>
> [...]
>
> > +BTF_KFUNCS_START(stream_consumer_kfunc_set)
> > +BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > +BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > +BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
> > +BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
> > +BTF_KFUNCS_END(stream_consumer_kfunc_set)
>
> This is a complicated API.
> If we anticipate that users intend to write this info to ring buffers
> maybe just provide a function doing that and do not expose complete API?
I don't think anyone will use these functions directly, though they
can if they want to.
It's meant to be hidden behind bpftool, and macros to print stuff like
bpf_printk().
We cannot pop one message at a time, since they are not in FIFO order.
So we need to splice out the whole batch queued and reverse it, before
popping things.
It's a consequence of using lockless lists.
The other option is using a lock to protect the list, but using
rqspinlock to then report messages about rqspinlock sounds like a
circular dependency.
>
> [...]
>
> I'll continue reading the patch-set tomorrow...
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams
2025-04-17 20:06 ` Kumar Kartikeya Dwivedi
@ 2025-04-22 1:42 ` Alexei Starovoitov
2025-04-22 2:30 ` Kumar Kartikeya Dwivedi
2025-04-22 22:09 ` Eduard Zingerman
1 sibling, 1 reply; 44+ messages in thread
From: Alexei Starovoitov @ 2025-04-22 1:42 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden,
kkd, Kernel Team
> > > +BTF_KFUNCS_START(stream_consumer_kfunc_set)
> > > +BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > +BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
> > > +BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > +BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
> > > +BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
> > > +BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
> > > +BTF_KFUNCS_END(stream_consumer_kfunc_set)
> >
> > This is a complicated API.
> > If we anticipate that users intend to write this info to ring buffers
> > maybe just provide a function doing that and do not expose complete API?
>
> I don't think anyone will use these functions directly, though they
> can if they want to.
> It's meant to be hidden behind bpftool, and macros to print stuff like
> bpf_printk().
I have to agree with Eduard here.
The api feels overdesigned. I don't see how it can be reused
anywhere outside of bpftool.
Instead of introducing mem_slice concept and above kfuncs,
let's have one special kfunc that will take prog_id, stream_id
and will move things into dynptr returning EAGAIN/ENOENT when necessary.
EFAULT shouldn't be possible when the whole
SEC("syscall")
int bpftool_dump_prog_stream(void *ctx) {..}
will be one kfunc.
bpf_stream_dtor_ids won't be needed either.
Hard coding such a big logic into one kfunc is not pretty,
and smaller kfuncs/building blocks would be preferred any day,
but these stream_consumer_kfunc_set just don't look reusable.
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams
2025-04-22 1:42 ` Alexei Starovoitov
@ 2025-04-22 2:30 ` Kumar Kartikeya Dwivedi
2025-04-22 17:44 ` Alexei Starovoitov
0 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-22 2:30 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden,
kkd, Kernel Team
On Tue, 22 Apr 2025 at 03:42, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> > > > +BTF_KFUNCS_START(stream_consumer_kfunc_set)
> > > > +BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > > +BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
> > > > +BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > > +BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
> > > > +BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
> > > > +BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
> > > > +BTF_KFUNCS_END(stream_consumer_kfunc_set)
> > >
> > > This is a complicated API.
> > > If we anticipate that users intend to write this info to ring buffers
> > > maybe just provide a function doing that and do not expose complete API?
> >
> > I don't think anyone will use these functions directly, though they
> > can if they want to.
> > It's meant to be hidden behind bpftool, and macros to print stuff like
> > bpf_printk().
>
> I have to agree with Eduard here.
> The api feels overdesigned. I don't see how it can be reused
> anywhere outside of bpftool.
Can you explain why?
Apart from syscall prog requirement which is easy to lift (but didn't
bother for RFC).
> Instead of introducing mem_slice concept and above kfuncs,
> let's have one special kfunc that will take prog_id, stream_id
> and will move things into dynptr returning EAGAIN/ENOENT when necessary.
> EFAULT shouldn't be possible when the whole
> SEC("syscall")
> int bpftool_dump_prog_stream(void *ctx) {..}
> will be one kfunc.
> bpf_stream_dtor_ids won't be needed either.
> Hard coding such a big logic into one kfunc is not pretty,
> and smaller kfuncs/building blocks would be preferred any day,
> but these stream_consumer_kfunc_set just don't look reusable.
I don't follow. How is "shove all messages into ringbuf" more reusable?
I can sympathize with the complexity argument Eduard made, ideally
we'd have one pop() operation to pull out things.
We can discuss how to make things simpler.
With some documentation, it is not hard to follow:
Everytime you consume something from a stream, you splice out a batch
of N messages, you can then splice out each message individually from
a batch.
Both the batch and individual elements need to be freed. You can keep
processing the batch until it's finished, but the better idea is to
pace oneself and do N at a time.
In user space, you usually fill up your buffer's length (say
PAGE_SIZE) when reading something, then delimit and parse and draw
message boundaries when parsing.
You can flip the API in this patch around and use it to stream things
into the program.
We just need to add one more element in prog->aux->stream[BPF_STDIN = 0].
Even without that, you can make it work on stdout. The writing is bidirectional.
User space side can obtain the stream of a prog and printk to it.
Kernel side can pop and parse messages.
Probably not a bright idea to do so when in the fast path of the
kernel to read strings, but in other cases, why not?
It can certainly make sense for self contained programs that want to
consume options from the cmdline generically.
The generic loader can just forward the string when invoked in user
space into the stream, and call some init/main function for the
program before attaching it.
Your BPF init/main gets either raw stream, or prepared argv, argc
blocks that the loader's infra can set up for you.
At that point all logic in the program can be self-contained, you can
probably just write everything in BPF C, including setting config
options etc.
There would be no need to put these strings into a ringbuf, it would
just be for the BPF side to parse and make sense of.
But anyway, I will not push back too much. If everyone thinks "stream
data to ringbuf kfunc" is the way to go, I'll make the change in v2.
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams
2025-04-22 2:30 ` Kumar Kartikeya Dwivedi
@ 2025-04-22 17:44 ` Alexei Starovoitov
0 siblings, 0 replies; 44+ messages in thread
From: Alexei Starovoitov @ 2025-04-22 17:44 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: Eduard Zingerman, bpf, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden,
kkd, Kernel Team
On Mon, Apr 21, 2025 at 7:31 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 22 Apr 2025 at 03:42, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > > > > +BTF_KFUNCS_START(stream_consumer_kfunc_set)
> > > > > +BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > > > +BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
> > > > > +BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > > > +BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
> > > > > +BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
> > > > > +BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
> > > > > +BTF_KFUNCS_END(stream_consumer_kfunc_set)
> > > >
> > > > This is a complicated API.
> > > > If we anticipate that users intend to write this info to ring buffers
> > > > maybe just provide a function doing that and do not expose complete API?
> > >
> > > I don't think anyone will use these functions directly, though they
> > > can if they want to.
> > > It's meant to be hidden behind bpftool, and macros to print stuff like
> > > bpf_printk().
> >
> > I have to agree with Eduard here.
> > The api feels overdesigned. I don't see how it can be reused
> > anywhere outside of bpftool.
>
> Can you explain why?
Because I don't see a second user for these bpf_stream_* fetchers.
A special prog embedded in bpftool will be the only one afaict.
> Apart from syscall prog requirement which is easy to lift (but didn't
> bother for RFC).
>
> > Instead of introducing mem_slice concept and above kfuncs,
> > let's have one special kfunc that will take prog_id, stream_id
> > and will move things into dynptr returning EAGAIN/ENOENT when necessary.
> > EFAULT shouldn't be possible when the whole
> > SEC("syscall")
> > int bpftool_dump_prog_stream(void *ctx) {..}
> > will be one kfunc.
> > bpf_stream_dtor_ids won't be needed either.
> > Hard coding such a big logic into one kfunc is not pretty,
> > and smaller kfuncs/building blocks would be preferred any day,
> > but these stream_consumer_kfunc_set just don't look reusable.
>
> I don't follow. How is "shove all messages into ringbuf" more reusable?
> I can sympathize with the complexity argument Eduard made, ideally
> we'd have one pop() operation to pull out things.
That's exactly what I'm proposing. One pop-like kfunc
with prog_id, stream_id arguments.
> We can discuss how to make things simpler.
>
> With some documentation, it is not hard to follow:
> Everytime you consume something from a stream, you splice out a batch
> of N messages, you can then splice out each message individually from
> a batch.
> Both the batch and individual elements need to be freed. You can keep
> processing the batch until it's finished, but the better idea is to
> pace oneself and do N at a time.
Sure. These details can be hidden in pop() kfunc.
> In user space, you usually fill up your buffer's length (say
> PAGE_SIZE) when reading something, then delimit and parse and draw
> message boundaries when parsing.
>
> You can flip the API in this patch around and use it to stream things
> into the program.
> We just need to add one more element in prog->aux->stream[BPF_STDIN = 0].
> Even without that, you can make it work on stdout. The writing is bidirectional.
Hmm. I don't see the writing part here.
Currently it only consumes llist.
There is bpf_stream_vprintk() to write into it from bpf prog.
That part makes sense.
bpf_stream_push_str() can also be a kfunc, but the operation
here is _push_.
imo it makes sense to pair it with single _pop_,
instead of next_elem, free_elem
> User space side can obtain the stream of a prog and printk to it.
> Kernel side can pop and parse messages.
> Probably not a bright idea to do so when in the fast path of the
> kernel to read strings, but in other cases, why not?
One bpf prog does bpf_stream_vprintk() while another pop-like.
That makes sense.
The manual iterator over messages that requires kptr_xchg
to be correct is what I see as overdesign.
> But anyway, I will not push back too much. If everyone thinks "stream
> data to ringbuf kfunc" is the way to go, I'll make the change in v2.
That's not what I'm suggesting. stream is a concept makes sense.
bpf_stream_vprintk() also makes sense, and it will work to push
things from "user space bpf prog" into stdin of other bpf prog,
and to print into stdout from that other bpf prog.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams
2025-04-17 20:06 ` Kumar Kartikeya Dwivedi
2025-04-22 1:42 ` Alexei Starovoitov
@ 2025-04-22 22:09 ` Eduard Zingerman
1 sibling, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-22 22:09 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
[...]
> > > +BTF_KFUNCS_START(stream_consumer_kfunc_set)
> > > +BTF_ID_FLAGS(func, bpf_stream_next_elem_batch, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > +BTF_ID_FLAGS(func, bpf_stream_free_elem_batch, KF_RELEASE)
> > > +BTF_ID_FLAGS(func, bpf_stream_next_elem, KF_ACQUIRE | KF_RET_NULL | KF_TRUSTED_ARGS)
> > > +BTF_ID_FLAGS(func, bpf_stream_free_elem, KF_RELEASE)
> > > +BTF_ID_FLAGS(func, bpf_prog_stream_get, KF_ACQUIRE | KF_RET_NULL)
> > > +BTF_ID_FLAGS(func, bpf_prog_stream_put, KF_RELEASE)
> > > +BTF_KFUNCS_END(stream_consumer_kfunc_set)
> >
> > This is a complicated API.
> > If we anticipate that users intend to write this info to ring buffers
> > maybe just provide a function doing that and do not expose complete API?
>
> I don't think anyone will use these functions directly, though they
> can if they want to.
> It's meant to be hidden behind bpftool, and macros to print stuff like
> bpf_printk().
>
> We cannot pop one message at a time, since they are not in FIFO order.
> So we need to splice out the whole batch queued and reverse it, before
> popping things.
> It's a consequence of using lockless lists.
>
> The other option is using a lock to protect the list, but using
> rqspinlock to then report messages about rqspinlock sounds like a
> circular dependency.
The API exposes 6 kfuncs and 3 data structures.
If things are exposed these things would be used, I wouldn't assume
that bpftool would be the only user. I'm sure people would craft their
own monitoring solutions.
Imo, exposing 9 entities is a lot. What I don't like about it,
is that API is not abstract:
- user cares about getting program log to some buffer, stream or file;
- user does not care how the strings are split internally in order to
make logging efficient.
If at some point we decide to change implementation this API would be
hard to preserve.
Hence I suggest to instead provide a set of kfuncs that would drain
message queue into user provided ringbuf or buffer, w/o exposing
details.
A question: why exposing this functionality as kfuncs and not BPF
syscall commands?
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 08/13] bpf: Add dump_stack() analogue to print to BPF stderr
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (6 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 07/13] bpf: Introduce per-prog stdout/stderr streams Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-22 21:07 ` Eduard Zingerman
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 09/13] bpf: Report may_goto timeout " Kumar Kartikeya Dwivedi
` (4 subsequent siblings)
12 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Introduce a kernel function which is the analogue of dump_stack()
printing some useful information and the stack trace. This is not
exposed to BPF programs yet, but can be made available in the future.
When we have a PC for a BPF program in the stack trace, also
additionally output the filename and line number to make the trace
helpful. The rest of the trace can be passed into ./decode_stacktrace.sh
to obtain the line numbers for kernel symbols.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf.h | 2 +
kernel/bpf/stream.c | 99 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 101 insertions(+)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ae2ddd3bdea1..d4687da63645 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3588,6 +3588,8 @@ void bpf_prog_stream_init(struct bpf_prog *prog);
void bpf_prog_stream_free(struct bpf_prog *prog);
__printf(2, 3)
int bpf_prog_stderr_printk(struct bpf_prog *prog, const char *fmt, ...);
+struct bpf_prog *bpf_prog_find_from_stack(void);
+void bpf_prog_stderr_dump_stack(struct bpf_prog *prog);
#ifdef CONFIG_BPF_LSM
void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
diff --git a/kernel/bpf/stream.c b/kernel/bpf/stream.c
index 2019ce134310..f5d8d2841a13 100644
--- a/kernel/bpf/stream.c
+++ b/kernel/bpf/stream.c
@@ -2,6 +2,7 @@
/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
#include <linux/bpf.h>
+#include <linux/filter.h>
#include <linux/bpf_mem_alloc.h>
#include <linux/percpu.h>
#include <linux/refcount.h>
@@ -494,3 +495,101 @@ int bpf_prog_stderr_printk(struct bpf_prog *prog, const char *fmt, ...)
bpf_put_buffers();
return ret;
}
+
+struct walk_stack_ctx {
+ struct bpf_prog *prog;
+};
+
+static bool find_from_stack_cb(void *cookie, u64 ip, u64 sp, u64 bp)
+{
+ struct walk_stack_ctx *ctxp = cookie;
+ struct bpf_prog *prog;
+
+ if (!is_bpf_text_address(ip))
+ return true;
+ prog = bpf_prog_ksym_find(ip);
+ if (bpf_is_subprog(prog))
+ return true;
+ ctxp->prog = prog;
+ return false;
+}
+
+struct bpf_prog *bpf_prog_find_from_stack(void)
+{
+ struct walk_stack_ctx ctx = {};
+
+ arch_bpf_stack_walk(find_from_stack_cb, &ctx);
+ return ctx.prog;
+}
+
+struct dump_stack_ctx {
+ struct bpf_prog *prog;
+};
+
+static int dump_stack_bpf_linfo(u64 ip, const char **filep)
+{
+ struct bpf_prog *prog = bpf_prog_ksym_find(ip);
+ int idx = -1, insn_start, insn_end, len;
+ struct bpf_line_info *linfo;
+ void **jited_linfo;
+ struct btf *btf;
+
+ btf = prog->aux->btf;
+ linfo = prog->aux->linfo;
+ jited_linfo = prog->aux->jited_linfo;
+
+ if (!btf || !linfo || !prog->aux->jited_linfo)
+ return -1;
+ len = prog->aux->func ? prog->aux->func[prog->aux->func_idx]->len : prog->len;
+
+ linfo = &prog->aux->linfo[prog->aux->linfo_idx];
+ jited_linfo = &prog->aux->jited_linfo[prog->aux->linfo_idx];
+
+ insn_start = linfo[0].insn_off;
+ insn_end = insn_start + len;
+
+ for (int i = 0; linfo[i].insn_off >= insn_start && linfo[i].insn_off < insn_end; i++) {
+ if (jited_linfo[i] >= (void *)ip)
+ break;
+ idx = i;
+ }
+
+ if (idx == -1)
+ return -1;
+
+ *filep = btf_name_by_offset(btf, linfo[idx].file_name_off);
+ *filep = strrchr(*filep, '/') ?: *filep;
+ *filep += 1;
+
+ return BPF_LINE_INFO_LINE_NUM(linfo[idx].line_col);
+}
+
+static bool dump_stack_cb(void *cookie, u64 ip, u64 sp, u64 bp)
+{
+ struct dump_stack_ctx *ctxp = cookie;
+ const char *file = "";
+ int num;
+
+ if (is_bpf_text_address(ip)) {
+ num = dump_stack_bpf_linfo(ip, &file);
+ if (num == -1)
+ goto end;
+ bpf_prog_stderr_printk(ctxp->prog, " %pS: [%s:%d]\n", (void *)ip, file, num);
+ return true;
+ }
+end:
+ bpf_prog_stderr_printk(ctxp->prog, " %pS\n", (void *)ip);
+ return true;
+}
+
+void bpf_prog_stderr_dump_stack(struct bpf_prog *prog)
+{
+ struct dump_stack_ctx ctx = { .prog = prog };
+
+ bpf_prog_stderr_printk(prog, "CPU: %d UID: %d PID: %d Comm: %s\n",
+ raw_smp_processor_id(), __kuid_val(current_real_cred()->euid),
+ current->pid, current->comm);
+ bpf_prog_stderr_printk(prog, "Call trace:\n");
+ arch_bpf_stack_walk(dump_stack_cb, &ctx);
+ bpf_prog_stderr_printk(prog, "\n");
+}
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 08/13] bpf: Add dump_stack() analogue to print to BPF stderr
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 08/13] bpf: Add dump_stack() analogue to print to BPF stderr Kumar Kartikeya Dwivedi
@ 2025-04-22 21:07 ` Eduard Zingerman
0 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-22 21:07 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> Introduce a kernel function which is the analogue of dump_stack()
> printing some useful information and the stack trace. This is not
> exposed to BPF programs yet, but can be made available in the future.
>
> When we have a PC for a BPF program in the stack trace, also
> additionally output the filename and line number to make the trace
> helpful. The rest of the trace can be passed into ./decode_stacktrace.sh
> to obtain the line numbers for kernel symbols.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
Seems to work as advertised.
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
[...]
> +static int dump_stack_bpf_linfo(u64 ip, const char **filep)
> +{
> + struct bpf_prog *prog = bpf_prog_ksym_find(ip);
> + int idx = -1, insn_start, insn_end, len;
> + struct bpf_line_info *linfo;
> + void **jited_linfo;
> + struct btf *btf;
> +
> + btf = prog->aux->btf;
> + linfo = prog->aux->linfo;
> + jited_linfo = prog->aux->jited_linfo;
> +
> + if (!btf || !linfo || !prog->aux->jited_linfo)
> + return -1;
> + len = prog->aux->func ? prog->aux->func[prog->aux->func_idx]->len : prog->len;
> +
> + linfo = &prog->aux->linfo[prog->aux->linfo_idx];
> + jited_linfo = &prog->aux->jited_linfo[prog->aux->linfo_idx];
> +
> + insn_start = linfo[0].insn_off;
> + insn_end = insn_start + len;
> +
> + for (int i = 0; linfo[i].insn_off >= insn_start && linfo[i].insn_off < insn_end; i++) {
> + if (jited_linfo[i] >= (void *)ip)
> + break;
> + idx = i;
> + }
> +
> + if (idx == -1)
> + return -1;
> +
> + *filep = btf_name_by_offset(btf, linfo[idx].file_name_off);
> + *filep = strrchr(*filep, '/') ?: *filep;
> + *filep += 1;
Nit: `+= 1` not needed if '/' was not found?
> +
> + return BPF_LINE_INFO_LINE_NUM(linfo[idx].line_col);
> +}
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 09/13] bpf: Report may_goto timeout to BPF stderr
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (7 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 08/13] bpf: Add dump_stack() analogue to print to BPF stderr Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 10/13] bpf: Report rqspinlock deadlocks/timeout " Kumar Kartikeya Dwivedi
` (3 subsequent siblings)
12 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Begin reporting may_goto timeouts to BPF program's stderr stream.
Also, pace the reporting frequency by printing at most 64 instances
of any errors, so that we don't end up stressing the allocator and
exhausting memory in case of misbehaving programs.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
include/linux/bpf.h | 3 +++
kernel/bpf/core.c | 17 ++++++++++++++++-
2 files changed, 19 insertions(+), 1 deletion(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d4687da63645..cf057327798c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1662,6 +1662,7 @@ struct bpf_prog_aux {
struct rcu_head rcu;
};
struct bpf_stream stream[2];
+ atomic_long_t error_count;
};
struct bpf_prog {
@@ -3584,6 +3585,8 @@ void bpf_bprintf_cleanup(struct bpf_bprintf_data *data);
int bpf_try_get_buffers(struct bpf_bprintf_buffers **bufs);
void bpf_put_buffers(void);
+#define BPF_PROG_ERROR_COUNT_MAX 64
+
void bpf_prog_stream_init(struct bpf_prog *prog);
void bpf_prog_stream_free(struct bpf_prog *prog);
__printf(2, 3)
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1bfc6f7ea3da..cde1be7ffa8b 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -3151,6 +3151,19 @@ u64 __weak arch_bpf_timed_may_goto(void)
return 0;
}
+static noinline void bpf_prog_report_may_goto_violation(void)
+{
+ struct bpf_prog *prog;
+
+ prog = bpf_prog_find_from_stack();
+ if (!prog)
+ return;
+ if (atomic_long_fetch_add(1, &prog->aux->error_count) >= BPF_PROG_ERROR_COUNT_MAX)
+ return;
+ bpf_prog_stderr_printk(prog, "ERROR: Timeout detected for may_goto instruction\n");
+ bpf_prog_stderr_dump_stack(prog);
+}
+
u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
{
u64 time = ktime_get_mono_fast_ns();
@@ -3161,8 +3174,10 @@ u64 bpf_check_timed_may_goto(struct bpf_timed_may_goto *p)
return BPF_MAX_TIMED_LOOPS;
}
/* Check if we've exhausted our time slice, and zero count. */
- if (time - p->timestamp >= (NSEC_PER_SEC / 4))
+ if (unlikely(time - p->timestamp >= (NSEC_PER_SEC / 4))) {
+ bpf_prog_report_may_goto_violation();
return 0;
+ }
/* Refresh the count for the stack frame. */
return BPF_MAX_TIMED_LOOPS;
}
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* [RFC PATCH bpf-next/net v1 10/13] bpf: Report rqspinlock deadlocks/timeout to BPF stderr
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (8 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 09/13] bpf: Report may_goto timeout " Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 11/13] bpf: Report arena faults " Kumar Kartikeya Dwivedi
` (2 subsequent siblings)
12 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Begin reporting rqspinlock deadlocks and timeout to BPF program's
stderr.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
kernel/bpf/rqspinlock.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/kernel/bpf/rqspinlock.c b/kernel/bpf/rqspinlock.c
index b896c4a75a5c..7cf6bd8e4f78 100644
--- a/kernel/bpf/rqspinlock.c
+++ b/kernel/bpf/rqspinlock.c
@@ -666,6 +666,26 @@ EXPORT_SYMBOL_GPL(resilient_queued_spin_lock_slowpath);
__bpf_kfunc_start_defs();
+static void bpf_prog_report_rqspinlock_violation(const char *str, void *lock, bool irqsave)
+{
+ struct rqspinlock_held *rqh = this_cpu_ptr(&rqspinlock_held_locks);
+ struct bpf_prog *prog;
+
+ prog = bpf_prog_find_from_stack();
+ if (!prog)
+ return;
+ if (atomic_long_fetch_add(1, &prog->aux->error_count) >= BPF_PROG_ERROR_COUNT_MAX)
+ return;
+ bpf_prog_stderr_printk(prog, "ERROR: %s for bpf_res_spin_lock%s\n", str, irqsave ? "_irqsave" : "");
+ bpf_prog_stderr_printk(prog, "Attempted lock = 0x%px\n", lock);
+ bpf_prog_stderr_printk(prog, "Total held locks = %d\n", rqh->cnt);
+ for (int i = 0; i < min(RES_NR_HELD, rqh->cnt); i++)
+ bpf_prog_stderr_printk(prog, "Held lock[%2d] = 0x%px\n", i, rqh->locks[i]);
+ bpf_prog_stderr_dump_stack(prog);
+}
+
+#define REPORT_STR(ret) ({ (ret) == -ETIMEDOUT ? "Timeout detected" : "AA or ABBA deadlock detected"; })
+
__bpf_kfunc int bpf_res_spin_lock(struct bpf_res_spin_lock *lock)
{
int ret;
@@ -676,6 +696,7 @@ __bpf_kfunc int bpf_res_spin_lock(struct bpf_res_spin_lock *lock)
preempt_disable();
ret = res_spin_lock((rqspinlock_t *)lock);
if (unlikely(ret)) {
+ bpf_prog_report_rqspinlock_violation(REPORT_STR(ret), lock, false);
preempt_enable();
return ret;
}
@@ -698,6 +719,7 @@ __bpf_kfunc int bpf_res_spin_lock_irqsave(struct bpf_res_spin_lock *lock, unsign
local_irq_save(flags);
ret = res_spin_lock((rqspinlock_t *)lock);
if (unlikely(ret)) {
+ bpf_prog_report_rqspinlock_violation(REPORT_STR(ret), lock, true);
local_irq_restore(flags);
preempt_enable();
return ret;
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* [RFC PATCH bpf-next/net v1 11/13] bpf: Report arena faults to BPF stderr
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (9 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 10/13] bpf: Report rqspinlock deadlocks/timeout " Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams Kumar Kartikeya Dwivedi
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 13/13] selftests/bpf: Add tests for prog streams Kumar Kartikeya Dwivedi
12 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Begin reporting arena page faults and the faulting address to BPF
program's stderr, for now limited to x86, but arm64 support should
be easy to add.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
arch/x86/net/bpf_jit_comp.c | 21 ++++++++++++++++++---
include/linux/bpf.h | 1 +
kernel/bpf/arena.c | 14 ++++++++++++++
3 files changed, 33 insertions(+), 3 deletions(-)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 9e5fe2ba858f..434a6eae4621 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1384,15 +1384,27 @@ static int emit_atomic_ld_st_index(u8 **pprog, u32 atomic_op, u32 size,
}
#define DONT_CLEAR 1
+#define ARENA_FAULT (1 << 8)
bool ex_handler_bpf(const struct exception_table_entry *x, struct pt_regs *regs)
{
- u32 reg = x->fixup >> 8;
+ u32 arena_reg = (x->fixup >> 8) & 0xff;
+ bool is_arena = !!arena_reg;
+ u32 reg = x->fixup >> 16;
+ unsigned long addr;
+
+ /* Read here, if src_reg is dst_reg for load, we'll write 0 to it. */
+ if (is_arena)
+ addr = *(unsigned long *)((void *)regs + arena_reg);
/* jump over faulting load and clear dest register */
if (reg != DONT_CLEAR)
*(unsigned long *)((void *)regs + reg) = 0;
regs->ip += x->fixup & 0xff;
+
+ if (is_arena)
+ bpf_prog_report_arena_violation(reg == DONT_CLEAR, addr);
+
return true;
}
@@ -2043,7 +2055,10 @@ st: if (is_imm8(insn->off))
ex->data = EX_TYPE_BPF;
ex->fixup = (prog - start_of_ldx) |
- ((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[dst_reg] : DONT_CLEAR) << 8);
+ ((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[dst_reg] : DONT_CLEAR) << 16)
+ | ((BPF_CLASS(insn->code) == BPF_LDX ? reg2pt_regs[src_reg] : reg2pt_regs[dst_reg])<< 8);
+ /* Ensure src_reg offset fits in 1 byte. */
+ BUILD_BUG_ON(sizeof(struct pt_regs) > U8_MAX);
}
break;
@@ -2161,7 +2176,7 @@ st: if (is_imm8(insn->off))
* End result: x86 insn "mov rbx, qword ptr [rax+0x14]"
* of 4 bytes will be ignored and rbx will be zero inited.
*/
- ex->fixup = (prog - start_of_ldx) | (reg2pt_regs[dst_reg] << 8);
+ ex->fixup = (prog - start_of_ldx) | (reg2pt_regs[dst_reg] << 16);
}
break;
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index cf057327798c..9fae22dde926 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3593,6 +3593,7 @@ __printf(2, 3)
int bpf_prog_stderr_printk(struct bpf_prog *prog, const char *fmt, ...);
struct bpf_prog *bpf_prog_find_from_stack(void);
void bpf_prog_stderr_dump_stack(struct bpf_prog *prog);
+void bpf_prog_report_arena_violation(bool write, unsigned long addr);
#ifdef CONFIG_BPF_LSM
void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 0d56cea71602..593a93195cdc 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -590,3 +590,17 @@ static int __init kfunc_init(void)
return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC, &common_kfunc_set);
}
late_initcall(kfunc_init);
+
+void bpf_prog_report_arena_violation(bool write, unsigned long addr)
+{
+ struct bpf_prog *prog;
+
+ prog = bpf_prog_find_from_stack();
+ if (!prog)
+ return;
+ if (atomic_long_fetch_add(1, &prog->aux->error_count) >= BPF_PROG_ERROR_COUNT_MAX)
+ return;
+ bpf_prog_stderr_printk(prog, "ERROR: Arena %s access at unmapped address 0x%lx\n",
+ write ? "WRITE" : "READ", addr);
+ bpf_prog_stderr_dump_stack(prog);
+}
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (10 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 11/13] bpf: Report arena faults " Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-22 20:30 ` Eduard Zingerman
2025-04-22 22:10 ` Quentin Monnet
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 13/13] selftests/bpf: Add tests for prog streams Kumar Kartikeya Dwivedi
12 siblings, 2 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Add bpftool support for dumping streams of a given BPF program.
TODO: JSON and filepath support.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
tools/bpf/bpftool/Makefile | 2 +-
tools/bpf/bpftool/prog.c | 71 +++++++++++++++++-
tools/bpf/bpftool/skeleton/stream.bpf.c | 96 +++++++++++++++++++++++++
3 files changed, 166 insertions(+), 3 deletions(-)
create mode 100644 tools/bpf/bpftool/skeleton/stream.bpf.c
diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 9e9a5f006cd2..eb908223c3bb 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -234,7 +234,7 @@ $(OUTPUT)%.bpf.o: skeleton/%.bpf.c $(OUTPUT)vmlinux.h $(LIBBPF_BOOTSTRAP)
$(OUTPUT)%.skel.h: $(OUTPUT)%.bpf.o $(BPFTOOL_BOOTSTRAP)
$(QUIET_GEN)$(BPFTOOL_BOOTSTRAP) gen skeleton $< > $@
-$(OUTPUT)prog.o: $(OUTPUT)profiler.skel.h
+$(OUTPUT)prog.o: $(OUTPUT)profiler.skel.h $(OUTPUT)stream.skel.h
$(OUTPUT)pids.o: $(OUTPUT)pid_iter.skel.h
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index f010295350be..d0800fec9c3d 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -35,12 +35,16 @@
#include "main.h"
#include "xlated_dumper.h"
+#include "stream.skel.h"
+
#define BPF_METADATA_PREFIX "bpf_metadata_"
#define BPF_METADATA_PREFIX_LEN (sizeof(BPF_METADATA_PREFIX) - 1)
enum dump_mode {
DUMP_JITED,
DUMP_XLATED,
+ DUMP_STDOUT,
+ DUMP_STDERR,
};
static const bool attach_types[] = {
@@ -697,6 +701,55 @@ static int do_show(int argc, char **argv)
return err;
}
+static int process_stream_sample(void *ctx, void *data, size_t len)
+{
+ FILE *file = ctx;
+
+ fprintf(file, "%s", (char *)data);
+ fflush(file);
+ return 0;
+}
+
+static int
+prog_dump_stream(struct bpf_prog_info *info, enum dump_mode mode, const char *filepath)
+{
+ FILE *file = mode == DUMP_STDOUT ? stdout : stderr;
+ LIBBPF_OPTS(bpf_test_run_opts, opts);
+ struct ring_buffer *ringbuf;
+ struct stream_bpf *skel;
+ int map_fd, ret = -1;
+
+ __u32 prog_id = info->id;
+ __u32 stream_id = mode == DUMP_STDOUT ? 1 : 2;
+
+ skel = stream_bpf__open_and_load();
+ if (!skel)
+ return -errno;
+ skel->bss->prog_id = prog_id;
+ skel->bss->stream_id = stream_id;
+
+ //TODO(kkd): Filepath handling
+ map_fd = bpf_map__fd(skel->maps.ringbuf);
+ ringbuf = ring_buffer__new(map_fd, process_stream_sample, file, NULL);
+ if (!ringbuf) {
+ ret = -errno;
+ goto end;
+ }
+ do {
+ skel->bss->written_count = skel->bss->written_size = 0;
+ ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.bpftool_dump_prog_stream), &opts);
+ ret = -EINVAL;
+ if (ring_buffer__consume_n(ringbuf, skel->bss->written_count) != skel->bss->written_count)
+ goto end;
+ } while (!ret && opts.retval == EAGAIN);
+
+ if (opts.retval != 0)
+ ret = -EINVAL;
+end:
+ stream_bpf__destroy(skel);
+ return ret;
+}
+
static int
prog_dump(struct bpf_prog_info *info, enum dump_mode mode,
char *filepath, bool opcodes, bool visual, bool linum)
@@ -719,13 +772,15 @@ prog_dump(struct bpf_prog_info *info, enum dump_mode mode,
}
buf = u64_to_ptr(info->jited_prog_insns);
member_len = info->jited_prog_len;
- } else { /* DUMP_XLATED */
+ } else if (mode == DUMP_XLATED) { /* DUMP_XLATED */
if (info->xlated_prog_len == 0 || !info->xlated_prog_insns) {
p_err("error retrieving insn dump: kernel.kptr_restrict set?");
return -1;
}
buf = u64_to_ptr(info->xlated_prog_insns);
member_len = info->xlated_prog_len;
+ } else if (mode == DUMP_STDOUT || mode == DUMP_STDERR) {
+ return prog_dump_stream(info, mode, filepath);
}
if (info->btf_id) {
@@ -898,8 +953,10 @@ static int do_dump(int argc, char **argv)
mode = DUMP_JITED;
} else if (is_prefix(*argv, "xlated")) {
mode = DUMP_XLATED;
+ } else if (is_prefix(*argv, "stdout") || is_prefix(*argv, "stderr")) {
+ mode = is_prefix(*argv, "stdout") ? DUMP_STDOUT : DUMP_STDERR;
} else {
- p_err("expected 'xlated' or 'jited', got: %s", *argv);
+ p_err("expected 'stdout', 'stderr', 'xlated' or 'jited', got: %s", *argv);
return -1;
}
NEXT_ARG();
@@ -950,6 +1007,14 @@ static int do_dump(int argc, char **argv)
}
}
+ if (mode == DUMP_STDOUT || mode == DUMP_STDERR) {
+ if (opcodes || visual || linum) {
+ p_err("'%s' is not compatible with 'opcodes', 'visual', or 'linum'",
+ mode == DUMP_STDOUT ? "stdout" : "stderr");
+ goto exit_close;
+ }
+ }
+
if (filepath && (opcodes || visual || linum)) {
p_err("'file' is not compatible with 'opcodes', 'visual', or 'linum'");
goto exit_close;
@@ -2468,6 +2533,8 @@ static int do_help(int argc, char **argv)
"Usage: %1$s %2$s { show | list } [PROG]\n"
" %1$s %2$s dump xlated PROG [{ file FILE | [opcodes] [linum] [visual] }]\n"
" %1$s %2$s dump jited PROG [{ file FILE | [opcodes] [linum] }]\n"
+ " %1$s %2$s dump stdout PROG [{ file FILE }]\n"
+ " %1$s %2$s dump stderr PROG [{ file FILE }]\n"
" %1$s %2$s pin PROG FILE\n"
" %1$s %2$s { load | loadall } OBJ PATH \\\n"
" [type TYPE] [{ offload_dev | xdpmeta_dev } NAME] \\\n"
diff --git a/tools/bpf/bpftool/skeleton/stream.bpf.c b/tools/bpf/bpftool/skeleton/stream.bpf.c
new file mode 100644
index 000000000000..31b5933e0384
--- /dev/null
+++ b/tools/bpf/bpftool/skeleton/stream.bpf.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+ __uint(type, BPF_MAP_TYPE_RINGBUF);
+ __uint(max_entries, 1024 * 1024);
+} ringbuf SEC(".maps");
+
+struct value {
+ struct bpf_stream_elem_batch __kptr *batch;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct value);
+ __uint(max_entries, 1);
+} array SEC(".maps");
+
+int written_size;
+int written_count;
+int stream_id;
+int prog_id;
+
+#define ENOENT 2
+#define EAGAIN 11
+#define EFAULT 14
+
+SEC("syscall")
+int bpftool_dump_prog_stream(void *ctx)
+{
+ struct bpf_stream_elem_batch *elem_batch;
+ struct bpf_stream_elem *elem;
+ struct bpf_stream *stream;
+ bool cont = false;
+ struct value *v;
+ bool ret = 0;
+
+ stream = bpf_prog_stream_get(BPF_STDERR, prog_id);
+ if (!stream)
+ return ENOENT;
+
+ v = bpf_map_lookup_elem(&array, &(int){0});
+
+ if (v->batch)
+ elem_batch = bpf_kptr_xchg(&v->batch, NULL);
+ else
+ elem_batch = bpf_stream_next_elem_batch(stream);
+ if (!elem_batch)
+ goto end;
+
+ bpf_repeat(BPF_MAX_LOOPS) {
+ struct bpf_dynptr dst_dptr, src_dptr;
+ int size;
+
+ elem = bpf_stream_next_elem(elem_batch);
+ if (!elem)
+ break;
+ size = elem->mem_slice.len;
+
+ if (bpf_dynptr_from_mem_slice(&elem->mem_slice, 0, &src_dptr))
+ ret = EFAULT;
+ if (bpf_ringbuf_reserve_dynptr(&ringbuf, size, 0, &dst_dptr))
+ ret = EFAULT;
+ if (bpf_dynptr_copy(&dst_dptr, 0, &src_dptr, 0, size))
+ ret = EFAULT;
+ bpf_ringbuf_submit_dynptr(&dst_dptr, 0);
+
+ written_count++;
+ written_size += size;
+
+ bpf_stream_free_elem(elem);
+
+ /* Probe and exit if no more space, probe for twice the typical size.*/
+ if (bpf_ringbuf_reserve_dynptr(&ringbuf, 2048, 0, &dst_dptr))
+ cont = true;
+ bpf_ringbuf_discard_dynptr(&dst_dptr, 0);
+
+ if (ret || cont)
+ break;
+ }
+
+ if (cont)
+ elem_batch = bpf_kptr_xchg(&v->batch, elem_batch);
+ if (elem_batch)
+ bpf_stream_free_elem_batch(elem_batch);
+end:
+ bpf_prog_stream_put(stream);
+
+ return ret ?: (cont ? EAGAIN : 0);
+}
+
+char _license[] SEC("license") = "GPL";
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams Kumar Kartikeya Dwivedi
@ 2025-04-22 20:30 ` Eduard Zingerman
2025-04-22 22:10 ` Quentin Monnet
1 sibling, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-22 20:30 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
When I add a test as at the bottom of the email I get the following output:
test_stream_cond_break:PASS:load 0 nsec
test_stream_cond_break:PASS:run_opts 0 nsec
test_stream_cond_break:PASS:info_by_fd 0 nsec
ERROR: Timeout detected for may_goto instruction
CPU: 6 UID: 0 PID: 206 Comm: test_progs
Call trace:
bpf_prog_stderr_dump_stack+0xde/0x160
bpf_check_timed_may_goto+0x5d/0x90
arch_bpf_timed_may_goto+0x21/0x40
bpf_prog_34056decf3b2fb2f_long_loop+0x49/0x57: [stream_cond_break.c:8]
bpf_prog_run_pin_on_cpu+0x5f/0x110
bpf_prog_test_run_syscall+0x205/0x320
bpf_prog_test_run+0x234/0x2a0
__sys_bpf+0x2d7/0x570
__x64_sys_bpf+0x7c/0x90
do_syscall_64+0x79/0x120
entry_SYSCALL_64_after_hwframe+0x76/0x7e
prog_dump_stream: ret==-22
test_stream_cond_break:FAIL:./tools/sbin/bpftool prog dump stderr id 8
unexpected error: 59904 (errno 95)
Am I doing something wrong or EINVAL should not be returned from
prog_dump_stream in this case?
---
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index d0800fec9c3d..64386737a364 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -747,6 +747,7 @@ prog_dump_stream(struct bpf_prog_info *info, enum dump_mode mode, const char *fi
ret = -EINVAL;
end:
stream_bpf__destroy(skel);
+ fprintf(stderr, "prog_dump_stream: ret==%d\n", ret);
return ret;
}
diff --git a/tools/testing/selftests/bpf/prog_tests/stream_cond_break.c b/tools/testing/selftests/bpf/prog_tests/stream_cond_break.c
new file mode 100644
index 000000000000..7a29a45b0a04
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/stream_cond_break.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include "stream_cond_break.skel.h"
+
+static char log_buf[16 * 1024];
+
+void test_stream_cond_break(void)
+{
+ LIBBPF_OPTS(bpf_object_open_opts, opts, .kernel_log_buf = log_buf,
+ .kernel_log_size = sizeof(log_buf),
+ .kernel_log_level = 1 | 2 | 4);
+ LIBBPF_OPTS(bpf_test_run_opts, run_opts);
+ struct stream_cond_break *skel = NULL;
+ struct bpf_prog_info prog_info;
+ __u32 prog_info_len;
+ int ret, fd;
+
+ skel = stream_cond_break__open_opts(&opts);
+ if (!ASSERT_OK_PTR(skel, "open_opts"))
+ goto out;
+ ret = stream_cond_break__load(skel);
+ if (env.verbosity >= VERBOSE_VERY) {
+ fprintf(stderr, "---- program load log ----\n");
+ fprintf(stderr, "%s", log_buf);
+ fprintf(stderr, "--------- end log --------\n");
+ }
+ if (!ASSERT_OK(ret, "load"))
+ return;
+ fd = bpf_program__fd(skel->progs.long_loop);
+ ret = bpf_prog_test_run_opts(fd, &run_opts);
+ if (!ASSERT_EQ(ret, 0, "run_opts"))
+ goto out;
+ memset(&prog_info, 0, sizeof(prog_info));
+ prog_info_len = sizeof(prog_info);
+ ret = bpf_prog_get_info_by_fd(fd, &prog_info, &prog_info_len);
+ if (!ASSERT_OK(ret, "info_by_fd"))
+ goto out;
+ SYS(out, "./tools/sbin/bpftool prog dump stderr id %i\n", prog_info.id);
+out:
+ stream_cond_break__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/stream_cond_break.c b/tools/testing/selftests/bpf/progs/stream_cond_break.c
new file mode 100644
index 000000000000..47c2e5f1b8fd
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/stream_cond_break.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "bpf_experimental.h"
+
+SEC("syscall")
+int long_loop(const void *ctx)
+{
+ for (;;)
+ cond_break;
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams Kumar Kartikeya Dwivedi
2025-04-22 20:30 ` Eduard Zingerman
@ 2025-04-22 22:10 ` Quentin Monnet
2025-05-07 17:13 ` Kumar Kartikeya Dwivedi
1 sibling, 1 reply; 44+ messages in thread
From: Quentin Monnet @ 2025-04-22 22:10 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi, bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
2025-04-14 09:14 UTC-0700 ~ Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Add bpftool support for dumping streams of a given BPF program.
Thanks for adding bpftool support!
> TODO: JSON and filepath support.
... plus documentation (man page), and bash completion please.
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
> tools/bpf/bpftool/Makefile | 2 +-
> tools/bpf/bpftool/prog.c | 71 +++++++++++++++++-
> tools/bpf/bpftool/skeleton/stream.bpf.c | 96 +++++++++++++++++++++++++
> 3 files changed, 166 insertions(+), 3 deletions(-)
> create mode 100644 tools/bpf/bpftool/skeleton/stream.bpf.c
>
> diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
> index 9e9a5f006cd2..eb908223c3bb 100644
> --- a/tools/bpf/bpftool/Makefile
> +++ b/tools/bpf/bpftool/Makefile
> @@ -234,7 +234,7 @@ $(OUTPUT)%.bpf.o: skeleton/%.bpf.c $(OUTPUT)vmlinux.h $(LIBBPF_BOOTSTRAP)
> $(OUTPUT)%.skel.h: $(OUTPUT)%.bpf.o $(BPFTOOL_BOOTSTRAP)
> $(QUIET_GEN)$(BPFTOOL_BOOTSTRAP) gen skeleton $< > $@
>
> -$(OUTPUT)prog.o: $(OUTPUT)profiler.skel.h
> +$(OUTPUT)prog.o: $(OUTPUT)profiler.skel.h $(OUTPUT)stream.skel.h
>
> $(OUTPUT)pids.o: $(OUTPUT)pid_iter.skel.h
>
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index f010295350be..d0800fec9c3d 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -35,12 +35,16 @@
> #include "main.h"
> #include "xlated_dumper.h"
>
> +#include "stream.skel.h"
> +
> #define BPF_METADATA_PREFIX "bpf_metadata_"
> #define BPF_METADATA_PREFIX_LEN (sizeof(BPF_METADATA_PREFIX) - 1)
>
> enum dump_mode {
> DUMP_JITED,
> DUMP_XLATED,
> + DUMP_STDOUT,
> + DUMP_STDERR,
> };
>
> static const bool attach_types[] = {
> @@ -697,6 +701,55 @@ static int do_show(int argc, char **argv)
> return err;
> }
>
> +static int process_stream_sample(void *ctx, void *data, size_t len)
> +{
> + FILE *file = ctx;
> +
> + fprintf(file, "%s", (char *)data);
> + fflush(file);
> + return 0;
> +}
> +
> +static int
> +prog_dump_stream(struct bpf_prog_info *info, enum dump_mode mode, const char *filepath)
> +{
> + FILE *file = mode == DUMP_STDOUT ? stdout : stderr;
> + LIBBPF_OPTS(bpf_test_run_opts, opts);
> + struct ring_buffer *ringbuf;
> + struct stream_bpf *skel;
> + int map_fd, ret = -1;
> +
> + __u32 prog_id = info->id;
> + __u32 stream_id = mode == DUMP_STDOUT ? 1 : 2;
> +
> + skel = stream_bpf__open_and_load();
> + if (!skel)
> + return -errno;
> + skel->bss->prog_id = prog_id;
> + skel->bss->stream_id = stream_id;
> +
> + //TODO(kkd): Filepath handling
> + map_fd = bpf_map__fd(skel->maps.ringbuf);
> + ringbuf = ring_buffer__new(map_fd, process_stream_sample, file, NULL);
> + if (!ringbuf) {
> + ret = -errno;
> + goto end;
> + }
> + do {
> + skel->bss->written_count = skel->bss->written_size = 0;
> + ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.bpftool_dump_prog_stream), &opts);
> + ret = -EINVAL;
> + if (ring_buffer__consume_n(ringbuf, skel->bss->written_count) != skel->bss->written_count)
> + goto end;
> + } while (!ret && opts.retval == EAGAIN);
> +
> + if (opts.retval != 0)
> + ret = -EINVAL;
> +end:
> + stream_bpf__destroy(skel);
> + return ret;
> +}
> +
> static int
> prog_dump(struct bpf_prog_info *info, enum dump_mode mode,
> char *filepath, bool opcodes, bool visual, bool linum)
> @@ -719,13 +772,15 @@ prog_dump(struct bpf_prog_info *info, enum dump_mode mode,
> }
> buf = u64_to_ptr(info->jited_prog_insns);
> member_len = info->jited_prog_len;
> - } else { /* DUMP_XLATED */
> + } else if (mode == DUMP_XLATED) { /* DUMP_XLATED */
> if (info->xlated_prog_len == 0 || !info->xlated_prog_insns) {
> p_err("error retrieving insn dump: kernel.kptr_restrict set?");
> return -1;
> }
> buf = u64_to_ptr(info->xlated_prog_insns);
> member_len = info->xlated_prog_len;
> + } else if (mode == DUMP_STDOUT || mode == DUMP_STDERR) {
> + return prog_dump_stream(info, mode, filepath);
> }
>
> if (info->btf_id) {
> @@ -898,8 +953,10 @@ static int do_dump(int argc, char **argv)
> mode = DUMP_JITED;
> } else if (is_prefix(*argv, "xlated")) {
> mode = DUMP_XLATED;
> + } else if (is_prefix(*argv, "stdout") || is_prefix(*argv, "stderr")) {
> + mode = is_prefix(*argv, "stdout") ? DUMP_STDOUT : DUMP_STDERR;
> } else {
> - p_err("expected 'xlated' or 'jited', got: %s", *argv);
> + p_err("expected 'stdout', 'stderr', 'xlated' or 'jited', got: %s", *argv);
> return -1;
> }
> NEXT_ARG();
> @@ -950,6 +1007,14 @@ static int do_dump(int argc, char **argv)
> }
> }
>
> + if (mode == DUMP_STDOUT || mode == DUMP_STDERR) {
> + if (opcodes || visual || linum) {
> + p_err("'%s' is not compatible with 'opcodes', 'visual', or 'linum'",
> + mode == DUMP_STDOUT ? "stdout" : "stderr");
> + goto exit_close;
> + }
> + }
> +
> if (filepath && (opcodes || visual || linum)) {
> p_err("'file' is not compatible with 'opcodes', 'visual', or 'linum'");
> goto exit_close;
> @@ -2468,6 +2533,8 @@ static int do_help(int argc, char **argv)
> "Usage: %1$s %2$s { show | list } [PROG]\n"
> " %1$s %2$s dump xlated PROG [{ file FILE | [opcodes] [linum] [visual] }]\n"
> " %1$s %2$s dump jited PROG [{ file FILE | [opcodes] [linum] }]\n"
> + " %1$s %2$s dump stdout PROG [{ file FILE }]\n"
> + " %1$s %2$s dump stderr PROG [{ file FILE }]\n"
I'm not sure "prog dump" is the best subcommand for these features. The
"dump" has been associated with printing the program itself, either
translated or as JITed instructions. Here we display something else;
it's closer to "prog tracelog", that we use to dump the trace pipe. And
stdout/stderr are streams anyway, I'm not sure that "dumping" them is
the most accurate term.
How about "prog trace (stdout|stderr)"? Bonus: you don't have to handle
opcodes/linum/visual, and don't have to add support for "filepath".
> " %1$s %2$s pin PROG FILE\n"
> " %1$s %2$s { load | loadall } OBJ PATH \\\n"
> " [type TYPE] [{ offload_dev | xdpmeta_dev } NAME] \\\n"
> diff --git a/tools/bpf/bpftool/skeleton/stream.bpf.c b/tools/bpf/bpftool/skeleton/stream.bpf.c
> new file mode 100644
> index 000000000000..31b5933e0384
> --- /dev/null
> +++ b/tools/bpf/bpftool/skeleton/stream.bpf.c
> @@ -0,0 +1,96 @@
> +// SPDX-License-Identifier: GPL-2.0
Bpftool is dual-licensed (GPL-2.0-only OR BSD-2-Clause), please consider
using the same here.
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#include <vmlinux.h>
> +#include <bpf/bpf_tracing.h>
> +#include <bpf/bpf_helpers.h>
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_RINGBUF);
> + __uint(max_entries, 1024 * 1024);
> +} ringbuf SEC(".maps");
> +
> +struct value {
> + struct bpf_stream_elem_batch __kptr *batch;
> +};
> +
> +struct {
> + __uint(type, BPF_MAP_TYPE_ARRAY);
> + __type(key, int);
> + __type(value, struct value);
> + __uint(max_entries, 1);
> +} array SEC(".maps");
> +
> +int written_size;
> +int written_count;
> +int stream_id;
> +int prog_id;
> +
> +#define ENOENT 2
> +#define EAGAIN 11
> +#define EFAULT 14
> +
> +SEC("syscall")
> +int bpftool_dump_prog_stream(void *ctx)
> +{
> + struct bpf_stream_elem_batch *elem_batch;
> + struct bpf_stream_elem *elem;
> + struct bpf_stream *stream;
> + bool cont = false;
> + struct value *v;
> + bool ret = 0;
> +
> + stream = bpf_prog_stream_get(BPF_STDERR, prog_id);
Calls to these new kfuncs will break compilation on older systems that
don't support them yet (and don't have the definition in their
vmlinux.h). We should provide fallback definitions to make sure that the
program compiles.
> + if (!stream)
> + return ENOENT;
> +
> + v = bpf_map_lookup_elem(&array, &(int){0});
> +
> + if (v->batch)
> + elem_batch = bpf_kptr_xchg(&v->batch, NULL);
> + else
> + elem_batch = bpf_stream_next_elem_batch(stream);
> + if (!elem_batch)
> + goto end;
> +
> + bpf_repeat(BPF_MAX_LOOPS) {
> + struct bpf_dynptr dst_dptr, src_dptr;
> + int size;
> +
> + elem = bpf_stream_next_elem(elem_batch);
> + if (!elem)
> + break;
> + size = elem->mem_slice.len;
> +
> + if (bpf_dynptr_from_mem_slice(&elem->mem_slice, 0, &src_dptr))
> + ret = EFAULT;
> + if (bpf_ringbuf_reserve_dynptr(&ringbuf, size, 0, &dst_dptr))
> + ret = EFAULT;
> + if (bpf_dynptr_copy(&dst_dptr, 0, &src_dptr, 0, size))
> + ret = EFAULT;
> + bpf_ringbuf_submit_dynptr(&dst_dptr, 0);
> +
> + written_count++;
> + written_size += size;
> +
> + bpf_stream_free_elem(elem);
> +
> + /* Probe and exit if no more space, probe for twice the typical size.*/
> + if (bpf_ringbuf_reserve_dynptr(&ringbuf, 2048, 0, &dst_dptr))
> + cont = true;
> + bpf_ringbuf_discard_dynptr(&dst_dptr, 0);
> +
> + if (ret || cont)
> + break;
> + }
> +
> + if (cont)
> + elem_batch = bpf_kptr_xchg(&v->batch, elem_batch);
> + if (elem_batch)
> + bpf_stream_free_elem_batch(elem_batch);
> +end:
> + bpf_prog_stream_put(stream);
> +
> + return ret ?: (cont ? EAGAIN : 0);
> +}
> +
> +char _license[] SEC("license") = "GPL";
Same note on the license, other programs used by bpftool have license
string "Dual BSD/GPL".
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams
2025-04-22 22:10 ` Quentin Monnet
@ 2025-05-07 17:13 ` Kumar Kartikeya Dwivedi
0 siblings, 0 replies; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-05-07 17:13 UTC (permalink / raw)
To: Quentin Monnet
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
On Wed, 23 Apr 2025 at 00:10, Quentin Monnet <qmo@kernel.org> wrote:
>
> 2025-04-14 09:14 UTC-0700 ~ Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > Add bpftool support for dumping streams of a given BPF program.
>
>
> Thanks for adding bpftool support!
>
>
> > TODO: JSON and filepath support.
>
>
> ... plus documentation (man page), and bash completion please.
>
Done.
>
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > ---
> > tools/bpf/bpftool/Makefile | 2 +-
> > tools/bpf/bpftool/prog.c | 71 +++++++++++++++++-
> > tools/bpf/bpftool/skeleton/stream.bpf.c | 96 +++++++++++++++++++++++++
> > 3 files changed, 166 insertions(+), 3 deletions(-)
> > create mode 100644 tools/bpf/bpftool/skeleton/stream.bpf.c
> >
> > diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
> > index 9e9a5f006cd2..eb908223c3bb 100644
> > --- a/tools/bpf/bpftool/Makefile
> > +++ b/tools/bpf/bpftool/Makefile
> > @@ -234,7 +234,7 @@ $(OUTPUT)%.bpf.o: skeleton/%.bpf.c $(OUTPUT)vmlinux.h $(LIBBPF_BOOTSTRAP)
> > $(OUTPUT)%.skel.h: $(OUTPUT)%.bpf.o $(BPFTOOL_BOOTSTRAP)
> > $(QUIET_GEN)$(BPFTOOL_BOOTSTRAP) gen skeleton $< > $@
> >
> > -$(OUTPUT)prog.o: $(OUTPUT)profiler.skel.h
> > +$(OUTPUT)prog.o: $(OUTPUT)profiler.skel.h $(OUTPUT)stream.skel.h
> >
> > $(OUTPUT)pids.o: $(OUTPUT)pid_iter.skel.h
> >
> > diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> > index f010295350be..d0800fec9c3d 100644
> > --- a/tools/bpf/bpftool/prog.c
> > +++ b/tools/bpf/bpftool/prog.c
> > @@ -35,12 +35,16 @@
> > #include "main.h"
> > #include "xlated_dumper.h"
> >
> > +#include "stream.skel.h"
> > +
> > #define BPF_METADATA_PREFIX "bpf_metadata_"
> > #define BPF_METADATA_PREFIX_LEN (sizeof(BPF_METADATA_PREFIX) - 1)
> >
> > enum dump_mode {
> > DUMP_JITED,
> > DUMP_XLATED,
> > + DUMP_STDOUT,
> > + DUMP_STDERR,
> > };
> >
> > static const bool attach_types[] = {
> > @@ -697,6 +701,55 @@ static int do_show(int argc, char **argv)
> > return err;
> > }
> >
> > +static int process_stream_sample(void *ctx, void *data, size_t len)
> > +{
> > + FILE *file = ctx;
> > +
> > + fprintf(file, "%s", (char *)data);
> > + fflush(file);
> > + return 0;
> > +}
> > +
> > +static int
> > +prog_dump_stream(struct bpf_prog_info *info, enum dump_mode mode, const char *filepath)
> > +{
> > + FILE *file = mode == DUMP_STDOUT ? stdout : stderr;
> > + LIBBPF_OPTS(bpf_test_run_opts, opts);
> > + struct ring_buffer *ringbuf;
> > + struct stream_bpf *skel;
> > + int map_fd, ret = -1;
> > +
> > + __u32 prog_id = info->id;
> > + __u32 stream_id = mode == DUMP_STDOUT ? 1 : 2;
> > +
> > + skel = stream_bpf__open_and_load();
> > + if (!skel)
> > + return -errno;
> > + skel->bss->prog_id = prog_id;
> > + skel->bss->stream_id = stream_id;
> > +
> > + //TODO(kkd): Filepath handling
> > + map_fd = bpf_map__fd(skel->maps.ringbuf);
> > + ringbuf = ring_buffer__new(map_fd, process_stream_sample, file, NULL);
> > + if (!ringbuf) {
> > + ret = -errno;
> > + goto end;
> > + }
> > + do {
> > + skel->bss->written_count = skel->bss->written_size = 0;
> > + ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.bpftool_dump_prog_stream), &opts);
> > + ret = -EINVAL;
> > + if (ring_buffer__consume_n(ringbuf, skel->bss->written_count) != skel->bss->written_count)
> > + goto end;
> > + } while (!ret && opts.retval == EAGAIN);
> > +
> > + if (opts.retval != 0)
> > + ret = -EINVAL;
> > +end:
> > + stream_bpf__destroy(skel);
> > + return ret;
> > +}
> > +
> > static int
> > prog_dump(struct bpf_prog_info *info, enum dump_mode mode,
> > char *filepath, bool opcodes, bool visual, bool linum)
> > @@ -719,13 +772,15 @@ prog_dump(struct bpf_prog_info *info, enum dump_mode mode,
> > }
> > buf = u64_to_ptr(info->jited_prog_insns);
> > member_len = info->jited_prog_len;
> > - } else { /* DUMP_XLATED */
> > + } else if (mode == DUMP_XLATED) { /* DUMP_XLATED */
> > if (info->xlated_prog_len == 0 || !info->xlated_prog_insns) {
> > p_err("error retrieving insn dump: kernel.kptr_restrict set?");
> > return -1;
> > }
> > buf = u64_to_ptr(info->xlated_prog_insns);
> > member_len = info->xlated_prog_len;
> > + } else if (mode == DUMP_STDOUT || mode == DUMP_STDERR) {
> > + return prog_dump_stream(info, mode, filepath);
> > }
> >
> > if (info->btf_id) {
> > @@ -898,8 +953,10 @@ static int do_dump(int argc, char **argv)
> > mode = DUMP_JITED;
> > } else if (is_prefix(*argv, "xlated")) {
> > mode = DUMP_XLATED;
> > + } else if (is_prefix(*argv, "stdout") || is_prefix(*argv, "stderr")) {
> > + mode = is_prefix(*argv, "stdout") ? DUMP_STDOUT : DUMP_STDERR;
> > } else {
> > - p_err("expected 'xlated' or 'jited', got: %s", *argv);
> > + p_err("expected 'stdout', 'stderr', 'xlated' or 'jited', got: %s", *argv);
> > return -1;
> > }
> > NEXT_ARG();
> > @@ -950,6 +1007,14 @@ static int do_dump(int argc, char **argv)
> > }
> > }
> >
> > + if (mode == DUMP_STDOUT || mode == DUMP_STDERR) {
> > + if (opcodes || visual || linum) {
> > + p_err("'%s' is not compatible with 'opcodes', 'visual', or 'linum'",
> > + mode == DUMP_STDOUT ? "stdout" : "stderr");
> > + goto exit_close;
> > + }
> > + }
> > +
> > if (filepath && (opcodes || visual || linum)) {
> > p_err("'file' is not compatible with 'opcodes', 'visual', or 'linum'");
> > goto exit_close;
> > @@ -2468,6 +2533,8 @@ static int do_help(int argc, char **argv)
> > "Usage: %1$s %2$s { show | list } [PROG]\n"
> > " %1$s %2$s dump xlated PROG [{ file FILE | [opcodes] [linum] [visual] }]\n"
> > " %1$s %2$s dump jited PROG [{ file FILE | [opcodes] [linum] }]\n"
> > + " %1$s %2$s dump stdout PROG [{ file FILE }]\n"
> > + " %1$s %2$s dump stderr PROG [{ file FILE }]\n"
>
>
> I'm not sure "prog dump" is the best subcommand for these features. The
> "dump" has been associated with printing the program itself, either
> translated or as JITed instructions. Here we display something else;
> it's closer to "prog tracelog", that we use to dump the trace pipe. And
> stdout/stderr are streams anyway, I'm not sure that "dumping" them is
> the most accurate term.
>
> How about "prog trace (stdout|stderr)"? Bonus: you don't have to handle
> opcodes/linum/visual, and don't have to add support for "filepath".
I went with prog tracelog {stdout|stderr}, since trace probably feels
a bit odd in this context.
Tracelog alone will print trace pipe, when passed stdout/stderr it
gives the stream output.
Let me know what you think.
>
>
> > " %1$s %2$s pin PROG FILE\n"
> > " %1$s %2$s { load | loadall } OBJ PATH \\\n"
> > " [type TYPE] [{ offload_dev | xdpmeta_dev } NAME] \\\n"
> > diff --git a/tools/bpf/bpftool/skeleton/stream.bpf.c b/tools/bpf/bpftool/skeleton/stream.bpf.c
> > new file mode 100644
> > index 000000000000..31b5933e0384
> > --- /dev/null
> > +++ b/tools/bpf/bpftool/skeleton/stream.bpf.c
> > @@ -0,0 +1,96 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
>
> Bpftool is dual-licensed (GPL-2.0-only OR BSD-2-Clause), please consider
> using the same here.
Done.
>
>
> > +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> > +#include <vmlinux.h>
> > +#include <bpf/bpf_tracing.h>
> > +#include <bpf/bpf_helpers.h>
> > +
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_RINGBUF);
> > + __uint(max_entries, 1024 * 1024);
> > +} ringbuf SEC(".maps");
> > +
> > +struct value {
> > + struct bpf_stream_elem_batch __kptr *batch;
> > +};
> > +
> > +struct {
> > + __uint(type, BPF_MAP_TYPE_ARRAY);
> > + __type(key, int);
> > + __type(value, struct value);
> > + __uint(max_entries, 1);
> > +} array SEC(".maps");
> > +
> > +int written_size;
> > +int written_count;
> > +int stream_id;
> > +int prog_id;
> > +
> > +#define ENOENT 2
> > +#define EAGAIN 11
> > +#define EFAULT 14
> > +
> > +SEC("syscall")
> > +int bpftool_dump_prog_stream(void *ctx)
> > +{
> > + struct bpf_stream_elem_batch *elem_batch;
> > + struct bpf_stream_elem *elem;
> > + struct bpf_stream *stream;
> > + bool cont = false;
> > + struct value *v;
> > + bool ret = 0;
> > +
> > + stream = bpf_prog_stream_get(BPF_STDERR, prog_id);
>
>
> Calls to these new kfuncs will break compilation on older systems that
> don't support them yet (and don't have the definition in their
> vmlinux.h). We should provide fallback definitions to make sure that the
> program compiles.
This is the only thing I haven't yet addressed in v2, because it
seemed a bit ugly.
I tried adding kfunc declarations, but those aren't enough.
We rely on structs introduced and read in this patch.
So I think vmlinux.h needs to be dropped, but it means adding a lot
more than just the declarations, all types, plus any types they
transitively depend on.
Maybe there is a better way (like detecting compilation failure and skipping?).
But if not, I will address like above in v3.
>
>
> > + if (!stream)
> > + return ENOENT;
> > +
> > + v = bpf_map_lookup_elem(&array, &(int){0});
> > +
> > + if (v->batch)
> > + elem_batch = bpf_kptr_xchg(&v->batch, NULL);
> > + else
> > + elem_batch = bpf_stream_next_elem_batch(stream);
> > + if (!elem_batch)
> > + goto end;
> > +
> > + bpf_repeat(BPF_MAX_LOOPS) {
> > + struct bpf_dynptr dst_dptr, src_dptr;
> > + int size;
> > +
> > + elem = bpf_stream_next_elem(elem_batch);
> > + if (!elem)
> > + break;
> > + size = elem->mem_slice.len;
> > +
> > + if (bpf_dynptr_from_mem_slice(&elem->mem_slice, 0, &src_dptr))
> > + ret = EFAULT;
> > + if (bpf_ringbuf_reserve_dynptr(&ringbuf, size, 0, &dst_dptr))
> > + ret = EFAULT;
> > + if (bpf_dynptr_copy(&dst_dptr, 0, &src_dptr, 0, size))
> > + ret = EFAULT;
> > + bpf_ringbuf_submit_dynptr(&dst_dptr, 0);
> > +
> > + written_count++;
> > + written_size += size;
> > +
> > + bpf_stream_free_elem(elem);
> > +
> > + /* Probe and exit if no more space, probe for twice the typical size.*/
> > + if (bpf_ringbuf_reserve_dynptr(&ringbuf, 2048, 0, &dst_dptr))
> > + cont = true;
> > + bpf_ringbuf_discard_dynptr(&dst_dptr, 0);
> > +
> > + if (ret || cont)
> > + break;
> > + }
> > +
> > + if (cont)
> > + elem_batch = bpf_kptr_xchg(&v->batch, elem_batch);
> > + if (elem_batch)
> > + bpf_stream_free_elem_batch(elem_batch);
> > +end:
> > + bpf_prog_stream_put(stream);
> > +
> > + return ret ?: (cont ? EAGAIN : 0);
> > +}
> > +
> > +char _license[] SEC("license") = "GPL";
>
>
> Same note on the license, other programs used by bpftool have license
> string "Dual BSD/GPL".
Done, please take a look at the changes in v2.
^ permalink raw reply [flat|nested] 44+ messages in thread
* [RFC PATCH bpf-next/net v1 13/13] selftests/bpf: Add tests for prog streams
2025-04-14 16:14 [RFC PATCH bpf-next/net v1 00/13] BPF Standard Streams Kumar Kartikeya Dwivedi
` (11 preceding siblings ...)
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 12/13] bpftool: Add support for dumping streams Kumar Kartikeya Dwivedi
@ 2025-04-14 16:14 ` Kumar Kartikeya Dwivedi
2025-04-22 19:12 ` Eduard Zingerman
12 siblings, 1 reply; 44+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-04-14 16:14 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Emil Tsalapatis,
Barret Rhoden, kkd, kernel-team
Add selftests to stress test the various facets of the stream API,
memory allocation pattern, and ensuring dumping support is tested
and functional.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
.../testing/selftests/bpf/prog_tests/stream.c | 57 +++++++
.../testing/selftests/bpf/progs/dynptr_fail.c | 28 ++++
tools/testing/selftests/bpf/progs/stream.c | 150 ++++++++++++++++++
.../selftests/bpf/progs/stream_bpftool.c | 142 +++++++++++++++++
.../testing/selftests/bpf/progs/stream_fail.c | 38 +++++
5 files changed, 415 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/stream.c
create mode 100644 tools/testing/selftests/bpf/progs/stream.c
create mode 100644 tools/testing/selftests/bpf/progs/stream_bpftool.c
create mode 100644 tools/testing/selftests/bpf/progs/stream_fail.c
diff --git a/tools/testing/selftests/bpf/prog_tests/stream.c b/tools/testing/selftests/bpf/prog_tests/stream.c
new file mode 100644
index 000000000000..b01cb18d8de5
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/stream.c
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <sys/mman.h>
+
+#include "stream.skel.h"
+#include "stream_fail.skel.h"
+
+#include "stream_bpftool.skel.h"
+
+void test_stream_failure(void)
+{
+ RUN_TESTS(stream_fail);
+}
+
+void test_stream_success(void)
+{
+ RUN_TESTS(stream);
+ RUN_TESTS(stream_bpftool);
+ return;
+}
+
+static int process_sample(void *ctx, void *data, size_t len)
+{
+ fprintf(stderr, "%s", (char *)data);
+ return 0;
+}
+
+void test_stream_ringbuf_output(void)
+{
+ LIBBPF_OPTS(bpf_test_run_opts, opts);
+ struct ring_buffer *ringbuf;
+ struct stream_bpftool *skel;
+ int fd, ret;
+
+ skel = stream_bpftool__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "stream_bpftool_open_and_load"))
+ return;
+
+ fd = bpf_map__fd(skel->maps.ringbuf);
+
+ ringbuf = ring_buffer__new(fd, process_sample, NULL, NULL);
+ if (!ASSERT_OK_PTR(ringbuf, "ringbuf_new"))
+ goto end;
+
+ do {
+ skel->bss->written_count = skel->bss->written_size = 0;
+ ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.stream_bpftool_dump_prog_stream), &opts);
+ ASSERT_EQ(ring_buffer__consume_n(ringbuf, skel->bss->written_count), skel->bss->written_count, "consume");
+ } while (!ret && opts.retval == EAGAIN);
+
+ ASSERT_OK(ret, "ret");
+ ASSERT_EQ(opts.retval, 0, "retval");
+
+end:
+ stream_bpftool__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index 7c67797a5aac..545831b43fc8 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -1734,3 +1734,31 @@ int test_dynptr_reg_type(void *ctx)
global_call_bpf_dynptr((const struct bpf_dynptr *)current);
return 0;
}
+
+SEC("?syscall")
+__failure __msg("Expected an initialized dynptr as arg #2")
+int test_dynptr_source_release_btf(void *ctx)
+{
+ struct bpf_stream_elem_batch *elem_batch;
+ struct bpf_stream_elem *elem;
+ struct bpf_stream *stream;
+ struct bpf_dynptr dptr;
+ char buf[8];
+
+ stream = bpf_stream_get(BPF_STDERR, NULL);
+ if (!stream)
+ return 0;
+ elem_batch = bpf_stream_next_elem_batch(stream);
+ if (!elem_batch)
+ return 0;
+ elem = bpf_stream_next_elem(elem_batch);
+ if (!elem) {
+ bpf_stream_free_elem_batch(elem_batch);
+ return 0;
+ }
+ bpf_dynptr_from_mem_slice(&elem->mem_slice, 0, &dptr);
+ bpf_stream_free_elem(elem);
+
+ bpf_dynptr_read(buf, sizeof(buf), &dptr, 0, 0);
+ return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/stream.c b/tools/testing/selftests/bpf/progs/stream.c
new file mode 100644
index 000000000000..8616e6200bde
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/stream.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+#define _STR "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+
+#define STREAM_STR (u64)(_STR _STR _STR _STR)
+
+int stream_page_cnt;
+int stream_page_next_cnt;
+
+static __noinline void exhaust_stream_memory(int id)
+{
+ struct bpf_stream *stream;
+
+ bpf_repeat(32) {
+ stream = bpf_stream_get(id, NULL);
+ if (!stream)
+ break;
+ bpf_stream_vprintk(stream, "...", &(u64){0}, 0);
+ }
+}
+
+static __noinline int stream_exercise(int id, int N)
+{
+ struct bpf_stream_elem *elem, *earr[56] = {};
+ struct bpf_stream_elem_batch *batch;
+ struct bpf_stream *stream;
+ int ret;
+ u32 i;
+
+ if (N > 56)
+ return 56;
+
+ stream = bpf_stream_get(id, NULL);
+ if (!stream)
+ return 1;
+ for (i = 0; i < N; i++)
+ if ((ret = bpf_stream_vprintk(stream, "%llu%s", &(u64[]){i, STREAM_STR}, 16)) < 0) {
+ bpf_printk("bpf_stream_vprintk ret=%d", ret);
+ return 2;
+ }
+ batch = bpf_stream_next_elem_batch(stream);
+ if (!batch)
+ return 3;
+ ret = 0;
+ for (i = 0; i < N; i++) {
+ elem = bpf_stream_next_elem(batch);
+ if (!elem) {
+ ret = 4;
+ break;
+ }
+ earr[i] = elem;
+
+ if (elem->flags & BPF_STREAM_ELEM_F_PAGE)
+ stream_page_cnt++;
+ if (elem->flags & BPF_STREAM_ELEM_F_NEXT)
+ stream_page_next_cnt++;
+ }
+ for (i = 0; i < N; i++)
+ if (earr[i])
+ bpf_stream_free_elem(earr[i]);
+ bpf_stream_free_elem_batch(batch);
+ return ret;
+}
+
+static __noinline int stream_exercise_nums(int id)
+{
+ int ret = 0;
+
+ ret = ret ?: stream_exercise(id, 56);
+ ret = ret ?: stream_exercise(id, 42);
+ ret = ret ?: stream_exercise(id, 28);
+ ret = ret ?: stream_exercise(id, 10);
+ ret = ret ?: stream_exercise(id, 1);
+
+ return ret;
+}
+
+SEC("syscall")
+__success __retval(0)
+int stream_test(void *ctx)
+{
+ unsigned long flags;
+ int ret;
+
+ bpf_local_irq_save(&flags);
+
+ /*
+ * We grab 32 entries from a supposedly filled cache, so we'll have a
+ * case of elements mixing bpf_mem_alloc() and bpf_stream_page
+ * allocations.
+ *
+ * This also ensures that we test the path where the batch dequeued from
+ * the kernel contains extra non-extracted elements, that are then freed
+ * to the respective memory allocator depending on if they come from a
+ * page or not.
+ */
+ exhaust_stream_memory(BPF_STDOUT);
+
+ bpf_repeat(50) {
+ ret = stream_exercise_nums(BPF_STDOUT);
+ if (ret)
+ break;
+ if (!stream_page_cnt)
+ break;
+ }
+
+ if (ret) {
+ bpf_local_irq_restore(&flags);
+ return ret;
+ }
+
+ if (!stream_page_cnt) {
+ bpf_local_irq_restore(&flags);
+ return 41;
+ }
+
+ stream_page_cnt = 0;
+
+ bpf_repeat(100) {
+ stream_page_cnt = 0;
+ ret = stream_exercise_nums(BPF_STDERR);
+ if (ret)
+ break;
+ }
+
+ exhaust_stream_memory(BPF_STDOUT);
+
+ bpf_local_irq_restore(&flags);
+
+ if (ret)
+ return ret;
+
+ if (!stream_page_cnt)
+ return 42;
+
+ if (!stream_page_next_cnt)
+ return 43;
+
+ ret = stream_exercise_nums(BPF_STDOUT);
+ if (ret)
+ return ret;
+ return stream_exercise_nums(BPF_STDERR);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/stream_bpftool.c b/tools/testing/selftests/bpf/progs/stream_bpftool.c
new file mode 100644
index 000000000000..438c01a96efc
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/stream_bpftool.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+#include "bpf_experimental.h"
+
+struct {
+ __uint(type, BPF_MAP_TYPE_RINGBUF);
+ __uint(max_entries, 1024 * 1024);
+ //__uint(max_entries, 4096 * 2);
+} ringbuf SEC(".maps");
+
+struct value {
+ struct bpf_stream_elem_batch __kptr *batch;
+ struct bpf_res_spin_lock lock;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, int);
+ __type(value, struct value);
+ __uint(max_entries, 1);
+} array SEC(".maps");
+
+int written_size;
+int written_count;
+int stream_id;
+int prog_id;
+
+#define ENOENT 2
+#define EAGAIN 11
+#define EFAULT 14
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARENA);
+ __uint(map_flags, BPF_F_MMAPABLE);
+ __uint(max_entries, 100); /* number of pages */
+} arena SEC(".maps");
+
+#define __arena __attribute__((address_space(1)))
+
+void *ptr;
+int zero;
+
+static __noinline void foo(struct bpf_stream *stream)
+{
+ struct value *v1, *v2;
+ int i;
+
+ ptr = &arena;
+
+ v1 = bpf_map_lookup_elem(&array, &(int){0});
+ if (!v1)
+ return;
+ v2 = bpf_map_lookup_elem(&array, &(int){0});
+ if (!v2)
+ return;
+
+ if (!bpf_res_spin_lock(&v1->lock)) {
+ if (!bpf_res_spin_lock(&v2->lock))
+ bpf_res_spin_unlock(&v2->lock);
+ bpf_res_spin_unlock(&v1->lock);
+ }
+
+#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
+ *(u64 __arena *)0xfaceb00c = *(u64 __arena *)0xdeadbeef;
+#endif
+ i = zero;
+ while (can_loop)
+ i = i * 2;
+
+}
+
+bool init = false;
+
+SEC("syscall")
+__success
+int stream_bpftool_dump_prog_stream(void *ctx)
+{
+ struct bpf_stream_elem_batch *elem_batch;
+ struct bpf_stream_elem *elem;
+ struct bpf_stream *stream;
+ bool cont = false;
+ struct value *v;
+ bool ret = 0;
+
+ stream = bpf_stream_get(BPF_STDERR, 0);
+ if (!stream)
+ return ENOENT;
+ if (!init) {
+ foo(stream);
+ init = true;
+ }
+
+ v = bpf_map_lookup_elem(&array, &(int){0});
+
+ if (v->batch)
+ elem_batch = bpf_kptr_xchg(&v->batch, NULL);
+ else
+ elem_batch = bpf_stream_next_elem_batch(stream);
+ if (!elem_batch)
+ goto end;
+
+ while ((elem = bpf_stream_next_elem(elem_batch))) {
+ struct bpf_dynptr dst_dptr, src_dptr;
+ int size = elem->mem_slice.len;
+
+ if (bpf_dynptr_from_mem_slice(&elem->mem_slice, 0, &src_dptr))
+ ret = EFAULT;
+ if (bpf_ringbuf_reserve_dynptr(&ringbuf, size, 0, &dst_dptr))
+ ret = EFAULT;
+ if (bpf_dynptr_copy(&dst_dptr, 0, &src_dptr, 0, size))
+ ret = EFAULT;
+ bpf_ringbuf_submit_dynptr(&dst_dptr, 0);
+
+ written_count++;
+ written_size += size;
+
+ bpf_stream_free_elem(elem);
+
+ /* Probe and exit if no more space, probe for twice the typical size.*/
+ if (bpf_ringbuf_reserve_dynptr(&ringbuf, 2048, 0, &dst_dptr))
+ cont = true;
+ bpf_ringbuf_discard_dynptr(&dst_dptr, 0);
+
+ if (ret || cont)
+ break;
+ cond_break;
+ }
+
+ if (cont)
+ elem_batch = bpf_kptr_xchg(&v->batch, elem_batch);
+ if (elem_batch)
+ bpf_stream_free_elem_batch(elem_batch);
+end:
+
+ return ret ? ret : (cont ? EAGAIN : 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/stream_fail.c b/tools/testing/selftests/bpf/progs/stream_fail.c
new file mode 100644
index 000000000000..c44590b3ceda
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/stream_fail.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_helpers.h>
+#include "bpf_misc.h"
+
+struct map_value {
+ struct bpf_stream_elem_batch __kptr *batch;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, int);
+ __type(value, struct map_value);
+} arrmap SEC(".maps");
+
+
+SEC("?tc")
+__failure __msg("untrusted_ptr_bpf_stream_elem_batch()")
+int stream_kptr_ptr_untrusted(struct __sk_buff *ctx)
+{
+ struct bpf_stream_elem_batch *batch;
+ struct map_value *v;
+ int key = 0;
+
+ v = bpf_map_lookup_elem(&arrmap, &key);
+ if (!v)
+ return 0;
+ batch = v->batch;
+ if (!batch)
+ return 0;
+ v->batch = (void *)batch->node->next->next->next;
+ return 0;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.47.1
^ permalink raw reply related [flat|nested] 44+ messages in thread* Re: [RFC PATCH bpf-next/net v1 13/13] selftests/bpf: Add tests for prog streams
2025-04-14 16:14 ` [RFC PATCH bpf-next/net v1 13/13] selftests/bpf: Add tests for prog streams Kumar Kartikeya Dwivedi
@ 2025-04-22 19:12 ` Eduard Zingerman
0 siblings, 0 replies; 44+ messages in thread
From: Eduard Zingerman @ 2025-04-22 19:12 UTC (permalink / raw)
To: Kumar Kartikeya Dwivedi
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Emil Tsalapatis, Barret Rhoden, kkd,
kernel-team
Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
[...]
> diff --git a/tools/testing/selftests/bpf/progs/stream_bpftool.c b/tools/testing/selftests/bpf/progs/stream_bpftool.c
> new file mode 100644
> index 000000000000..438c01a96efc
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/stream_bpftool.c
This file is almost identical to ./tools/bpf/bpftool/skeleton/stream.bpf.c
except it also adds a function `foo` that exhausts may goto budget,
as far as I can tell. What is the point of adding it?
In general, I don't think we run bpftool tests on the CI,
so it would be good to test stream.bpf.c itself e.g. by
creating a symlink to it in selftests and adding a corresponding
prog_tests/<smth>.c as you do here.
Also, it would be good if some of the tests checked the content read
from the stream. I think existing tests only check the size.
[...]
^ permalink raw reply [flat|nested] 44+ messages in thread