BPF List

BPF List
 help / color / mirror / Atom feed

* Re: [PATCHv4 05/13] uprobes/x86: Move optimized uprobe from nop5 to nop10
From: Andrii Nakryiko @ 2026-06-08 20:46 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Masami Hiramatsu,
	Andrii Nakryiko, bpf, linux-trace-kernel
In-Reply-To: <20260526205840.173790-6-jolsa@kernel.org>

On Tue, May 26, 2026 at 1:59 PM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Andrii reported an issue with optimized uprobes [1] that can clobber
> redzone area with call instruction storing return address on stack
> where user code may keep temporary data without adjusting rsp.
>
> Fixing this by moving the optimized uprobes on top of 10-bytes nop
> instruction, so we can squeeze another instruction to escape the
> redzone area before doing the call, like:
>
>   lea -0x80(%rsp), %rsp
>   call tramp
>
> Note the lea instruction is used to adjust the rsp register without
> changing the flags.
>
> We use nop10 and following transformation to optimized instructions
> above and back as suggested by Peterz [2].
>
> Optimize path (int3_update_optimize):
>
>   1) Initial state after set_swbp() installed the uprobe:
>       cc 2e 0f 1f 84 00 00 00 00 00
>
>      From offset 0 this is INT3 followed by the tail of the original
>      10-byte NOP.
>
>      After a previous unoptimization bytes 5..9 may still contain the
>      old call instruction, which remains valid for threads already there.
>
>   2) Rewrite the LEA tail and call displacement:
>       cc [8d 64 24 80 e8 d0 d1 d2 d3]
>
>      From offset 0 this traps on the uprobe INT3.  Bytes 1..9 are not
>      executable entry points while byte 0 is trapped.
>
>   3) Publish the first LEA byte:
>       [48] 8d 64 24 80 e8 d0 d1 d2 d3
>
>      From offset 0 this is:
>         lea -0x80(%rsp), %rsp
>         call <uprobe-trampoline>
>
> Unoptimize path (int3_update_unoptimize):
>
>   1) Initial optimized state:
>       48 8d 64 24 80 e8 d0 d1 d2 d3
>      Same as 3) above.
>
>   2) Trap new entries before restoring the NOP bytes:
>       [cc] 8d 64 24 80 e8 d0 d1 d2 d3
>
>      From offset 0 this traps. A thread that had already executed the
>      LEA can still reach the intact CALL at offset 5.
>
>   3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
>      and byte 5 as CALL.
>       cc [2e 0f 1f 84] e8 d0 d1 d2 d3
>
>      From offset 0 this still traps. Offset 5 is still the CALL for any
>      thread that was already past the first LEA byte.
>
>   4) Publish the first byte of the original NOP:
>       [66] 2e 0f 1f 84 e8 d0 d1 d2 d3
>
>      From offset 0 this is the restored 10-byte NOP; the CALL opcode and
>      displacement are now only NOP operands.  Offset 5 still decodes as
>      CALL for a thread that was already there.
>
>      Tthere is only a single target uprobe-trampoline for the given nop10
>      instruction address, so the CALL instruction will not be changed across
>      unoptimization/optimization cycles.
>      Therefore, any task that is preempted at the CALL instruction is guaranteed
>      to observe that CALL and not anything else.
>
> Note as explained in [2] we need to use following nop10:
>        PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
> NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)
>
> which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
> attribute in is_prefix_bad function.
>
> Also changing the uprobe syscall error when called out of uprobe
> trampoline to -EPROTO, so we are able to detect the fixed kernel.
>
> The optimized uprobe performance stays the same:
>
>         uprobe-nop     :    3.129 ± 0.013M/s
>         uprobe-push    :    3.045 ± 0.006M/s
>         uprobe-ret     :    1.095 ± 0.004M/s
>   -->   uprobe-nop10   :    7.170 ± 0.020M/s
>         uretprobe-nop  :    2.143 ± 0.021M/s
>         uretprobe-push :    2.090 ± 0.000M/s
>         uretprobe-ret  :    0.942 ± 0.000M/s
>   -->   uretprobe-nop10:    3.381 ± 0.003M/s
>         usdt-nop       :    3.245 ± 0.004M/s
>   -->   usdt-nop10     :    7.256 ± 0.023M/s
>
> [1] https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> [2] https://lore.kernel.org/bpf/20260518104306.GU3102624@noisy.programming.kicks-ass.net/#t
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Closes: https://lore.kernel.org/bpf/20260509003146.976844-1-andrii@kernel.org/
> Fixes: ba2bfc97b462 ("uprobes/x86: Add support to optimize uprobes")
> Assisted-by: Codex:GPT-5.5
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  arch/x86/kernel/uprobes.c | 255 ++++++++++++++++++++++++++++----------
>  1 file changed, 190 insertions(+), 65 deletions(-)
>

[...]

> @@ -943,13 +1026,31 @@ static int int3_update(struct arch_uprobe *auprobe, struct vm_area_struct *vma,
>         smp_text_poke_sync_each_cpu();
>
>         /*
> -        * Write first byte.
> +        * 3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
> +        *    and byte 5 as CALL:
> +        *    cc [2e 0f 1f 84] e8 d0 d1 d2 d3
> +        */
> +       ctx.expect = EXPECT_SWBP_OPTIMIZED;
> +       err = uprobe_write(auprobe, vma, vaddr + 1, insn + 1,
> +                          LEA_INSN_SIZE - 1, verify_insn,
> +                          true /* is_register */, false /* do_update_ref_ctr */,

tbh, it's quite subtle and non-obvious why is_register should be set
to true first two times (and especially that is_register and
do_update_ref_ctr are implicitly connected), not sure how to make it
cleaner, but maybe leave a short comment explaining this twice
register, once unregister sequence?

> +                          &ctx);
> +       if (err)
> +               return err;
> +
> +       smp_text_poke_sync_each_cpu();

[...]

^ permalink raw reply

* [PATCH net] bnxt: fix head underflow on XDP head-grow
From: Joe Damato @ 2026-06-08 20:31 UTC (permalink / raw)
  To: netdev, Michael Chan, Pavan Chebbi, Andrew Lunn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
	Stanislav Fomichev, Somnath Kotur, Andy Gospodarek
  Cc: horms, Joe Damato, linux-kernel, bpf

The xdp.py test test_xdp_native_adjst_head_grow_data crashes when run on
a bnxt machine (and also crashes in NIPA).

It seems that the bug is an underflow in bnxt_rx_multi_page_skb, which
builds the skb head:

  napi_build_skb(data_ptr - bp->rx_offset, rxr->rx_page_size);

The problem with this expression is that in page mode rx_offset is:

  bp->rx_offset = NET_IP_ALIGN + XDP_PACKET_HEADROOM;

Which evaluates (at least on x86_64) to 258.

The test test_xdp_native_adjst_head_grow_data tests a case where the
head is adjusted by -256.

When this test runs, data_ptr is shifted to frag_start + 2 (where
frag_start = page_address(page) + offset).

Then, bnxt_rx_multi_page_skb is invoked and the napi_build_skb
expression subtracts 258, landing at an address before frag_start. This
could be either the previous fragment or the previous physical page when
the frag_offset is < 256 (e.g. if the fragment started at offset 0).

When the skb is freed, the page pool fragment reference is dropped on
either the wrong page or the wrong frag of the right page. In either
case, the corrupted reference count can lead to the page being
prematurely recycled while still in use. Once (incorrectly) recycled, it
can be handed out again and on driver teardown this would result in a
double free.

The commit under fixes updated this code to handle the case where the
native page size is >= 64k, but it unintentionally broke the head grow
case.

To fix this, we need to do a bit of math to recover the offset if this
is a page fragment since it is not passed into rx_skb_func
(bnxt_rx_multi_page_skb, in this case).

Once the offset is recovered, build the skb at the start of the fragment
and then use skb_reserve to adjust the layout.

There are two cases, the non-adjustment case and the adjustment case.

In both cases, the skb is built at page_address(page) + frag_offset to
account for the case where the native page size >= 64K and skb_reserve
is called with data_ptr - (page_address(page) + frag_offset). That
difference equals bp->rx_offset when data_ptr was not moved, or
bp->rx_offset + xdp_adjust when XDP adjusted the head.

Re-running the failing test with this commit applied causes the test to
run successfully to completion.

The other rx_skb_func implementations don't have this issue.

Fixes: f6974b4c2d8e ("bnxt_en: Fix page pool logic for page size >= 64K")
Signed-off-by: Joe Damato <joe@dama.to>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 35e1f8f663c7..448609cc1617 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1174,7 +1174,9 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 	unsigned int len = offset_and_len & 0xffff;
 	struct page *page = data;
 	u16 prod = rxr->rx_prod;
+	unsigned int frag_off;
 	struct sk_buff *skb;
+	void *frag_start;
 	int err;

 	err = bnxt_alloc_rx_data(bp, rxr, prod, GFP_ATOMIC);
@@ -1185,13 +1187,15 @@ static struct sk_buff *bnxt_rx_multi_page_skb(struct bnxt *bp,
 	dma_addr -= bp->rx_dma_offset;
 	dma_sync_single_for_cpu(&bp->pdev->dev, dma_addr, rxr->rx_page_size,
 				bp->rx_dir);
-	skb = napi_build_skb(data_ptr - bp->rx_offset, rxr->rx_page_size);
+	frag_off = dma_addr - page_pool_get_dma_addr(page);
+	frag_start = page_address(page) + frag_off;
+	skb = napi_build_skb(frag_start, rxr->rx_page_size);
 	if (!skb) {
 		page_pool_recycle_direct(rxr->page_pool, page);
 		return NULL;
 	}
 	skb_mark_for_recycle(skb);
-	skb_reserve(skb, bp->rx_offset);
+	skb_reserve(skb, data_ptr - (u8 *)frag_start);
 	__skb_put(skb, len);

 	return skb;

base-commit: 9772589b57e44aedc240211c5c3f7a684a034d3a
-- 
2.53.0-Meta

^ permalink raw reply related

* Re: [PATCH bpf v8 1/2] net: Validate protocol in skb_steal_sock() for BPF-assigned sockets
From: Alexei Starovoitov @ 2026-06-08 20:02 UTC (permalink / raw)
  To: Kuniyuki Iwashima, Jiayuan Chen, Alexei Starovoitov
  Cc: bpf, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Willem de Bruijn, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, Shuah Khan, Joe Stringer,
	netdev, linux-kernel, linux-kselftest
In-Reply-To: <CAAVpQUDZoEVEX1kXok-29dzVHk=ZK_2MkZyZY8kRHCJCfAWVFg@mail.gmail.com>

On Mon Jun 8, 2026 at 10:21 AM PDT, Kuniyuki Iwashima wrote:
> On Mon, Jun 8, 2026 at 5:59 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>>
>> bpf_sk_assign_tcp_reqsk() can assign a TCP reqsk to a non-TCP skb,
>> causing a panic when the skb enters the wrong L4 receive path [1].
>> An initial attempt tried to fix this in the BPF helper by checking
>> iph->protocol, but Sashiko [2] revealed that BPF programs can bypass
>> this check via a TOCTOU attack by modifying iph->protocol around the
>> call:
>>
>>     iph->protocol = IPPROTO_TCP;
>>     bpf_sk_assign_tcp_reqsk(udp_skb, tcp_sk);
>>     iph->protocol = IPPROTO_UDP;
>>
>> Furthermore, bpf_sk_assign() has had the same class of vulnerability
>> since its introduction — it can assign any socket type to any skb
>> without protocol validation.  Since the BPF helper check alone cannot
>> prevent a malicious BPF program from crashing the kernel, add protocol
>
> I'm curious about the BPF maintainers' stance on this kind of "bug"
> where admin tries to shoot oneself in the foot.
>
> I saw Alexei said this recently, and I guess it applies here as well ?
> https://lore.kernel.org/bpf/CAADnVQLh7VEKAtckzz=XOVPT8ovpDQshvPPCWHDQu2OWQx27_w@mail.gmail.com/
>
> ---8<---
> Not every "bug" needs a fix.
> If a malicious bpf user wants to crash the kernel they will
> find a way to do so. Especially with agents.
> We cannot realistically close all of the holes.
> Right now the priority is to fix the issues that normal
> users can hit and not bots.
> ---8<---

In addition to that I have to add that skb_steal_sock() is performance
critical path of networking stack. Adding runtime overheard there
because bots can find a way to abuse the interfaces is not a good trade off.
If there is no simple way to fix it completely on the bpf side
then we have to flag this issue as "won't fix" and move on.


^ permalink raw reply

* Re: [PATCH bpf] bpf: Validate BTF repeated field counts before expansion
From: Eduard Zingerman @ 2026-06-08 20:01 UTC (permalink / raw)
  To: Paul Moses, martin.lau, ast, daniel, andrii, memxor, bpf
  Cc: song, yonghong.song, jolsa, houtao1, linux-kernel, stable
In-Reply-To: <20260605234301.1109063-1-p@1g4.org>

On Fri, 2026-06-05 at 23:43 +0000, Paul Moses wrote:
> btf_parse_struct_metas() walks user-supplied BTF during BPF_BTF_LOAD,
> and btf_repeat_fields() expands repeatable fields from array elements
> into the fixed BTF_FIELDS_MAX scratch array used by btf_parse_fields().
> 
> The remaining-capacity check performs the expanded field count calculation
> in u32. A malformed BTF can wrap that calculation, causing the check to
> pass even when the expanded field count exceeds the scratch array
> capacity. The following memcpy() can then write past the end of the
> array.
> 
> Use checked addition and multiplication before copying repeated fields
> and reject impossible counts.
> 
> Fixes: 797d73ee232d ("bpf: Check the remaining info_cnt before repeating btf fields")
> Cc: stable@vger.kernel.org
> Signed-off-by: Paul Moses <p@1g4.org>
> ---

Regardless of the sibling email I sent, I think that this is a good
defensive practice to use check_{add,mul}_overflow() here.
Having said that, it would be nice to have a selftest in the patch-set.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

[...]

^ permalink raw reply

* Re: [PATCH bpf] bpf: Validate BTF repeated field counts before expansion
From: Eduard Zingerman @ 2026-06-08 19:59 UTC (permalink / raw)
  To: Paul Moses, Kumar Kartikeya Dwivedi
  Cc: martin.lau, ast, daniel, andrii, bpf, song, yonghong.song, jolsa,
	houtao1, linux-kernel, stable
In-Reply-To: <0_PQcsqBnb7dqgu9UPK6jIQvePSosttml5p2ZDoXAzy2AseVjvBu3ihswwZPWr5bZkOUCdH6HUvw3MRKJEwVYJAkT3j5gdNBHZp8l7_cP6Y=@1g4.org>

On Sun, 2026-06-07 at 17:53 +0000, Paul Moses wrote:

[...]

The repro is legit.
Here is a somewhat minimized version as a selftest:

    diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c b/tools/testing/selftests/bpf/prog_tests/btf.c
    index a9de328a8697..212ca4472a89 100644
    --- a/tools/testing/selftests/bpf/prog_tests/btf.c
    +++ b/tools/testing/selftests/bpf/prog_tests/btf.c
    @@ -4258,6 +4258,44 @@ static struct btf_raw_test raw_tests[] = {
            .max_entries = 1,
     },
    
    +{
    +#define N 0x1999999aU
    +       .descr = "repeat fields overflow",
    +       .raw_types = {
    +               /* int */                                               /* [1] */
    +               BTF_TYPE_INT_ENC(0, BTF_INT_SIGNED, 0, 32, 4),
    +               /* struct target {} */                                  /* [2] */
    +               BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 0), 1),
    +               /* type_tag "kptr_untrusted" -> target */               /* [3] */
    +               BTF_TYPE_TAG_ENC(NAME_TBD, 2),
    +               /* target * (kptr) */                                   /* [4] */
    +               BTF_PTR_ENC(3),
    +               /* struct outer { target *kp; elem items[N]; } */       /* [5] */
    +               BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 2), (N * 8u + 8u)),
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* kp           */
    +               BTF_MEMBER_ENC(NAME_TBD, 6, 64),        /* items        */
    +               /* elem[N] */                                   /* [6] */
    +               BTF_TYPE_ARRAY_ENC(7, 1, N),
    +               /* struct elem { target *f0..f9; } */                   /* [7] */
    +               BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 10), 8),
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f0           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f1           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f2           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f3           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f4           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f5           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f6           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f7           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f8           */
    +               BTF_MEMBER_ENC(NAME_TBD, 4, 0),         /* f9           */
    +               BTF_END_RAW,
    +       },
    +       BTF_STR_SEC("\0target\0kptr_untrusted\0outer\0kp\0items\0elem"
    +                   "\0f0\0f1\0f2\0f3\0f4\0f5\0f6\0f7\0f8\0f9"),
    +       .btf_load_err = true,
    +#undef N
    +},
    +
     }; /* struct btf_raw_test raw_tests[] */
    
     static const char *get_next_str(const char *start, const char *end)

However, as far as I understand the repro hits an overflow only
because `BTF_TYPE_ENC(NAME_TBD, BTF_INFO_ENC(BTF_KIND_STRUCT, 0, 10), 8)`
lies about `struct elem` size. It is specified as 8, while in reality it is 80.
The size of 80 would make `struct outer` unrepresentable in BTF,
because (N * 80 + 8) exceeds u32 range, and that's what btf_type->size uses.
Given that btf_repeat_fields() only traverses structs/arrays but not unions,
I suspect that overflow won't happen in `field_cnt * (repeat_cnt + 1)`
if proper size checks were implemented in btf_struct_check_meta() / btf_struct_resolve().
Even more, If I change "kptr_untrusted" to "kptr_untrusted11" to avoid fields parsing,
the kernel accepts the bogus BTF.

Paul, could you please investigate why is this happening?

^ permalink raw reply

* Re: [PATCH bpf v5 0/2] bpf: Fix kfunc implicit arg injection and add selftest
From: Alexei Starovoitov @ 2026-06-08 19:58 UTC (permalink / raw)
  To: chenyuan_fl
  Cc: andrii, ast, bot+bpf-ci, bpf, chenyuan, clm, daniel, eddyz87,
	ihor.solodrai, jolsa, linux-kernel, martin.lau, martin.lau,
	memxor, song, yonghong.song
In-Reply-To: <20260608142618.3064380-1-chenyuan_fl@163.com>

On Mon Jun 8, 2026 at 7:26 AM PDT, chenyuan_fl wrote:
> From: Yuan Chen <chenyuan@kylinos.cn>
>
> v4: https://lore.kernel.org/bpf/20260602093836.2632714-1-chenyuan_fl@163.com/
>
> v1 made btf_types_are_same() cross-BTF aware by comparing kind, size
> and name.  Alan Maguire pointed out that this would be too permissive
> since two different structs can share the same name and size.
>
> v2 used a name-based classifier (get_kfunc_arg_inject_type).
>
> v3 splits the combined is_kfunc_arg_ignore || is_kfunc_arg_implicit
> check in check_kfunc_args(), so that an implicit argument reaching
> is_kfunc_arg_implicit() without any prior handler is rejected with
> -EFAULT instead of silently skipped (Eduard Zingerman).  This prevents
> the silent fallthrough that occurred when module BTF was inconsistent
> with vmlinux (e.g. pahole 1.30 breaking distilled base dedup).
>
> v4: fix a build error in the selftest.
>
> v5 changes:
> - Patch 1: extend the kfunc allow-list in check_kfunc_args() to cover
>   all kfuncs with implicit args handled in bpf_fixup_kfunc_call():
>   obj_new, percpu_obj_new, obj_drop, percpu_obj_drop,
>   refcount_acquire, list_push, rbtree_add (Eduard Zingerman).
> - Patch 2: simplify the selftest by verifying aux->name in
>   bpf_kfunc_implicit_arg() instead of adding a separate struct_ops
>   test program with inline assembly (Eduard Zingerman).  This is
>   a single-file change exercised by the existing kfunc_implicit_args
>   selftest.

Pls address feedback from Eduard and respin with [PATCH bpf-next v6] subject.

pw-bot: cr


^ permalink raw reply

* Re: [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
From: Kumar Kartikeya Dwivedi @ 2026-06-08 18:53 UTC (permalink / raw)
  To: Justin Suess, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team
In-Reply-To: <aibxqG77bragcPos@zenbox>

On Mon Jun 8, 2026 at 7:47 PM CEST, Justin Suess wrote:
> On Mon, Jun 08, 2026 at 04:48:34PM +0200, Kumar Kartikeya Dwivedi wrote:
>> The is_tracing_prog_type() predicate omitted BPF_PROG_TYPE_TRACING even
>> though fentry, fexit, fmod_ret, raw_tp BTF and similar programs have the
>> same execution-context concerns as the tracing program types already
>> covered by the helper.
>>
>> This matters for map compatibility checks that reject bpf_spin_lock,
>> bpf_list_head and bpf_rb_root in tracing contexts. BPF_PROG_TYPE_TRACING
>> programs can run from arbitrary instrumented contexts, including places
>> where taking these locks or manipulating graph roots is not safe.
>> BPF_TRACE_ITER is different: iterator programs run from task context, so
>> we continue to exclude them.
>>
>> This can reject existing fentry/fexit-style programs that use map values
>> with these fields. Such programs were accepted only because the
>> predicate missed this program type; their use depends on semantics the
>> verifier already rejects for equivalent tracing hooks.
>>
>> Move is_tracing_prog_type() checks from check_map_prog_compatibility()
>> to points where the fields are actually used to avoid preemptively
>> rejecting tracing programs that use maps with such fields but do not
>> touch these fields.
>>
>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> ---
> Thanks for sending this, looks pretty good to me.
>>  kernel/bpf/verifier.c | 37 ++++++++++++++++++++++---------------
>>  1 file changed, 22 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index ed7ba0e6a9ce..26bfb4465725 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -6967,6 +6967,11 @@ static int process_spin_lock(struct bpf_verifier_env *env, struct bpf_reg_state
>>  	u32 spin_lock_off;
>>  	int err;
>>
>> +	if (is_tracing_prog_type(env->prog)) {
>> +		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
>> +		return -EINVAL;
>> +	}
>> +
>>  	if (!is_const) {
>>  		verbose(env,
>>  			"%s doesn't have constant offset. %s_lock has to be at the constant offset\n",
>> @@ -12222,6 +12227,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>  				return ret;
>>  			break;
>>  		case KF_ARG_PTR_TO_LIST_HEAD:
>> +			if (is_tracing_prog_type(env->prog)) {
>> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
>> +				return -EINVAL;
>> +			}
>>  			if (reg->type != PTR_TO_MAP_VALUE &&
>>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>>  				verbose(env, "%s expected pointer to map value or allocated object\n",
>> @@ -12238,6 +12247,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>  				return ret;
>>  			break;
>>  		case KF_ARG_PTR_TO_RB_ROOT:
>> +			if (is_tracing_prog_type(env->prog)) {
>> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
>> +				return -EINVAL;
>> +			}
>>  			if (reg->type != PTR_TO_MAP_VALUE &&
>>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>>  				verbose(env, "%s expected pointer to map value or allocated object\n",
>> @@ -17664,9 +17677,11 @@ static int check_pseudo_btf_id(struct bpf_verifier_env *env,
>>  	return __add_used_btf(env, btf);
>>  }
>>
>> -static bool is_tracing_prog_type(enum bpf_prog_type type)
>> +static bool is_tracing_prog_type(const struct bpf_prog *prog)
>>  {
>> -	switch (type) {
>> +	switch (resolve_prog_type(prog)) {
>> +	case BPF_PROG_TYPE_TRACING:
>> +		return prog->expected_attach_type != BPF_TRACE_ITER;
> Seems reasonable to exclude this attach type as they run always in task
> context as stated. This seems like the safest approach.
>
> In practice, *most* tracing type programs will be unreachable from
> NMI. But the edge cases are difficult to enumerate...
>
> fentry / fexit are nearly impossible to exclude, since there's no
> good way to answer "can this function be called from an nmi handler"
> at verification time. Such a list would be unmaintainable.
>
> Raw tracepoints are similar. Indeed, it would be *possible* to
> specifically exclude the tracepoints that cannot be called from NMI. But
> this would require maintaining a large list and would be brittle.
>
> fmodify_return attach types are whitelisted to LSM hooks and
> ALLOW_ERROR_INJECTION functions. A quick grep for ALLOW_ERROR_INJECTION
> doesn't bring anything that looks obviously NMI-relevant, but again, I
> think allowing it would be asking for trouble later.
>
> So I think this is a good balance...
>>  	case BPF_PROG_TYPE_KPROBE:
>>  	case BPF_PROG_TYPE_TRACEPOINT:
>>  	case BPF_PROG_TYPE_PERF_EVENT:
>> @@ -17697,24 +17712,16 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>>  		return -EACCES;
>>  	}
>>
>> -	if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
>> -	    btf_record_has_field(map->record, BPF_RB_ROOT)) {
>> -		if (is_tracing_prog_type(prog_type)) {
>> -			verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
>> -			return -EINVAL;
>> -		}
>> -	}
>> -
>>  	if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
>>  		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
>>  			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
>>  			return -EINVAL;
>>  		}
>> -
>> -		if (is_tracing_prog_type(prog_type)) {
>> -			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
>> -			return -EINVAL;
>> -		}
>> +		/*
>> +		 * Rejecting tracing progs accessing maps with bpf_spin_lock in
>> +		 * them here would be too conservative; let's defer rejection
>> +		 * until seeing first use.
>> +		 */
>
> In practice, I think most programs using those spinlocks w/ fentry
> weren't broken.
>
> They were saved only by nature of the fact that the particular
> functions that attached to were uncallable / never were
> called from NMI. Which is the case for 99% of those programs.
>
> So this case is where fixing the other 1% of cases are going
> to have consequences for the remainder.
>
> This is a tough decision to have to make. There's options, but all of
> them come with downsides.
>
> 1. (this patch) break tracing programs using bpf_spin_lock.
> 2. Somehow prevent NMI reentrancy of tracing progs on the same cpu. probably
> by just ignoring the reentrant calls, much how we handle the recursive
> tracepoint prevention for bpf. (complicated, may have side effects)
> 3. Inject a fixup runtime check for in_nmi for every tracing program
> and tail callable program, exiting before executing anything
> (still breakage, just at runtime now)
>
> If breakage is acceptible here, the approach this patch takes is the
> most *correct* way to do it, and the way it should have been done from
> day one, but having users relying on the "broken" behavior makes it
> harder to fix now.
>
> This seems like a maintainer call.

Yeah, you outlined the tradeoffs perfectly. What I'm leaning toward is splitting
this one out of the series, focus on the other two for now in v2, and send a new
set with this fix + opening up bpf_res_spin_lock() for use as a replacement. We
can then emit some guidance for the user to switch when their program breaks to
a working replacement which hopefully allows a reasonable migration path while
reducing this technical debt.

^ permalink raw reply

* Re: [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle
From: Kumar Kartikeya Dwivedi @ 2026-06-08 18:50 UTC (permalink / raw)
  To: Justin Suess, Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team
In-Reply-To: <aicB9SZfVnUV1rz-@zenbox>

On Mon Jun 8, 2026 at 8:01 PM CEST, Justin Suess wrote:
> On Mon, Jun 08, 2026 at 04:48:36PM +0200, Kumar Kartikeya Dwivedi wrote:
>> From: Justin Suess <utilityemal77@gmail.com>
>>
>> Map update and delete paths currently call bpf_obj_free_fields() when a
>> value is being replaced or recycled. That makes field destruction depend
>> on the context of the update/delete operation. For tracing programs this
>> can include NMI context, where referenced kptr destructors, uptr
>> unpinning, and graph root destruction are not generally safe.
>>
>> Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
>> performs NMI-safe cleanup for timer, workqueue, and task_work fields.
>> Fields that need full destruction are left attached to the recycled value
>> and are destroyed by the final cleanup path instead.
>>
>> Switch array and hashtab update/delete/recycle paths to this cancel
>> helper. Keep bpf_obj_free_fields() for final map destruction and for
>> bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
>> destructors, so teardown continues to walk the normal and extra elements
>> and fully destroy their fields.
>>
>> This deliberately relaxes the eager-free semantics of map update/delete
>> for special fields. Programs that relied on a recycled map slot becoming
>> empty immediately after update/delete were relying on behavior that
>> cannot be implemented safely from every BPF execution context without
>> offloading arbitrary destructors.
>>
>> There is a chance this change breaks programs making assumptions
>> regarding the eager freeing of fields. If so, we can relax semantics to
>> cancellation only when irqs_disabled() is true in the future. However,
>> theoretically, map values that get reused eagerly already have weaker
>> guarantees as parallel users can recreate freed fields before the new
>> element becomes visible again.
>>
>> Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
>> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
>> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>> ---
>> [...]
>>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>>  {
> Would a WARN_ON_ONCE for in_nmi in bpf_obj_free_fields help spot these
> kind of bugs in the future?
>
> Usually calling bpf_obj_free_fields in_nmi won't result in a deadlock
> unless specific conditions are met (ie last reference to a refcounted
> object, using a workqueue, etc).
>
> Adding a check here would quickly surface bugs of this class.
>

In general, the idea makes sense when errors are impossible, but people
increasingly run with panic-on-warn enabled, which sort of beats the purpose,
hence I'd rather avoid it.

If you think about it, there are several guardrails that protect us here. First,
we maintain a whitelist for stuff that is invoked in the bpf_obj_cancel_fields()
path, so we don't inadvertently start operating on any new, possibly unsafe,
fields. Second, the free path mostly happens in the map destruction path now,
hence we need to worry less about random contexts it could be invoked in anymore.

>> [...]


^ permalink raw reply

* Re: [PATCH bpf-next v2 3/3] libbpf: fix feature probe failures for implicit BPF tokens
From: Andrii Nakryiko @ 2026-06-08 18:18 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: chenyuan_fl, andrii, ast, bpf, chenyuan, daniel, eddyz87,
	linux-kselftest, shuah
In-Reply-To: <a1561e36-08dc-4ad1-848b-88974f529d10@gmail.com>

On Thu, Jun 4, 2026 at 7:40 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 6/4/26 4:10 AM, chenyuan_fl@163.com wrote:
> > From: Yuan Chen <chenyuan@kylinos.cn>
> >
> > probe_fd() converts BPF syscall failures to return value 0,
> > which feat_supported() interprets as 'feature absent' (ret==0),
> > even when the probe failed due to a BPF token's restricted
> > program/map type mask.
> >
> > In a user namespace with an implicit BPF token, the sequence is:
> >   1. bpf_object_open() -> no feat_cache yet -> global cache used
> >   2. bpf_object_prepare_token() -> creates token, sets feat_cache->token_fd
> >   3. bpf_object__create_maps() -> feat_supported() runs probe
> >      -> bpf_prog_load(SOCKET_FILTER) with token -> -EPERM
> >      -> probe_fd(-EPERM) -> returns 0
> >      -> feat_supported sees ret==0 -> FEAT_MISSING
> >      -> internal maps (global data) skipped -> verifier EINVAL
> >
> > Fix by making probe_fd() return the negative error on failure
> > instead of 0, so feat_supported() enters the ret<0 path.
> >
> > Additionally add a token_fd rescue in feat_supported() for the
> > ret<0 case: when a BPF token is present, a failed feature probe
> > is treated as 'feature supported' because token creation itself
> > proves the kernel BPF subsystem works.  Real BPF issues will
> > be caught during actual program/map loading.
> >
> > Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
> > ---
>
> It looks wrong to say that any feature is supported just because there is
> token_fd present. In patch 1 we only checked that BPF is available, so
> it was ok skipping probing. This one looks a bit more risky.

Yep, agreed, this is not the right approach. Please double check what
kind of issues you still have with only patch 1 applied, and let's
discuss possible solutions (if there are still problems).

> Is the reason this code fails because of BPF_PROG_TYPE_SOCKET_FILTER not enabled
> in the token? I suggest fetch the first supported program type from the token and
> pass it down to bpf_prog_load().
>
> >  tools/lib/bpf/features.c | 23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> >
> > diff --git a/tools/lib/bpf/features.c b/tools/lib/bpf/features.c
> > index b7e388f99d0b..f934452e52f0 100644
> > --- a/tools/lib/bpf/features.c
> > +++ b/tools/lib/bpf/features.c
> > @@ -16,7 +16,14 @@ int probe_fd(int fd)
> >  {
> >       if (fd >= 0)
> >               close(fd);
> > -     return fd >= 0;
> > +     /* Return 1 on success, negative error on failure, so
> > +      * feat_supported() can distinguish probe errors from
> > +      * genuine feature absence.  When a BPF token is present,
> > +      * a negative return triggers the rescue path that marks
> > +      * the feature as SUPPORTED (token creation itself proves
> > +      * the kernel BPF subsystem works).
> > +      */
> > +     return fd >= 0 ? 1 : fd;
> >  }
> >
> >  static int probe_kern_prog_name(int token_fd)
> > @@ -725,9 +732,21 @@ bool feat_supported(struct kern_feature_cache *cache, enum kern_feature_id feat_
> >               } else if (ret == 0) {
> >                       WRITE_ONCE(cache->res[feat_id], FEAT_MISSING);
> >               } else {
> > +                     /*
> > +                      * A BPF token may restrict which program/map types
> > +                      * are permitted, causing the probe to fail even
> > +                      * though the kernel supports the feature.  When a
> > +                      * token is present the probe is best-effort: BPF
> > +                      * token creation itself proves the kernel has a
> > +                      * working BPF subsystem.  Real BPF issues will be
> > +                      * caught during actual program/map loading.
> > +                      */
> > +                     if (cache->token_fd)
> > +                             WRITE_ONCE(cache->res[feat_id], FEAT_SUPPORTED);
> > +                     else
> > +                             WRITE_ONCE(cache->res[feat_id], FEAT_MISSING);
> >                       pr_warn("Detection of kernel %s support failed: %s\n",
> >                               feat->desc, errstr(ret));
> > -                     WRITE_ONCE(cache->res[feat_id], FEAT_MISSING);
> >               }
> >       }
> >
>

^ permalink raw reply

* Re: [PATCH bpf-next v2 1/3] libbpf: Skip bpf_object__probe_loading() when BPF token is in use
From: Andrii Nakryiko @ 2026-06-08 18:17 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: chenyuan_fl, andrii, ast, bpf, chenyuan, daniel, eddyz87,
	linux-kselftest, shuah
In-Reply-To: <e23baad8-1709-4210-81c8-22741b9dfb3e@gmail.com>

On Thu, Jun 4, 2026 at 4:16 AM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> On 6/4/26 4:10 AM, chenyuan_fl@163.com wrote:
> > From: Yuan Chen <chenyuan@kylinos.cn>
> >
> > bpf_object__probe_loading() tries to load trivial SOCKET_FILTER and
> > TRACEPOINT programs to verify the BPF environment works.  When a
> > BPF token is in use with restricted program type permissions, these
> > probe loads may fail because the token does not allow the specific
> > program types, even though BPF loading is perfectly functional.
> >
> > Fix by skipping the probe when a token FD is present: BPF token
> > creation itself proves the kernel has a working BPF subsystem.
> > Real BPF issues will be caught during actual program and map loading.
> >
> > Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
> > ---
> >  tools/lib/bpf/libbpf.c | 10 +++-------
> >  1 file changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> > index ab2071fdd3e8..7e21c5eb9533 100644
> > --- a/tools/lib/bpf/libbpf.c
> > +++ b/tools/lib/bpf/libbpf.c
> > @@ -5167,12 +5167,8 @@ bpf_object__probe_loading(struct bpf_object *obj)
> >               BPF_EXIT_INSN(),
> >       };
> >       int ret, insn_cnt = ARRAY_SIZE(insns);
> > -     LIBBPF_OPTS(bpf_prog_load_opts, opts,
> > -             .token_fd = obj->token_fd,
> > -             .prog_flags = obj->token_fd ? BPF_F_TOKEN_FD : 0,
> > -     );
> >
> > -     if (obj->gen_loader)
> > +     if (obj->gen_loader || obj->token_fd)
>
> This change makes sense, bpf_object__probe_loading() does very basic
> check that BPF works on a given kernel. In this case, when token is
> passed, this is ok to skip. An alternative, trying to make
> bpf_object__probe_loading() correctly token aware is worse.
>
> Acked-by: Mykyta Yatsenko <yatsenko@meta.com>

This is the only patch in the series that makes sense to apply, but I
don't see this series in patchworks. Can you please resend just this
one to BPF mailing list? Thanks

>
> >               return 0;
> >
> >       ret = bump_rlimit_memlock();
> > @@ -5181,9 +5177,9 @@ bpf_object__probe_loading(struct bpf_object *obj)
> >                       errstr(ret));
> >
> >       /* make sure basic loading works */
> > -     ret = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, NULL, "GPL", insns, insn_cnt, &opts);
> > +     ret = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, NULL, "GPL", insns, insn_cnt, NULL);
> >       if (ret < 0)
> > -             ret = bpf_prog_load(BPF_PROG_TYPE_TRACEPOINT, NULL, "GPL", insns, insn_cnt, &opts);
> > +             ret = bpf_prog_load(BPF_PROG_TYPE_TRACEPOINT, NULL, "GPL", insns, insn_cnt, NULL);
> >       if (ret < 0) {
> >               ret = errno;
> >               pr_warn("Error in %s(): %s. Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough value.\n",
>

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/3] selftests/bpf: Add ARRAY map type to token tests that use struct_ops global data
From: Andrii Nakryiko @ 2026-06-08 18:14 UTC (permalink / raw)
  To: chenyuan_fl
  Cc: andrii, ast, bpf, chenyuan, daniel, eddyz87, linux-kselftest,
	shuah
In-Reply-To: <20260604031033.806727-3-chenyuan_fl@163.com>

On Wed, Jun 3, 2026 at 8:11 PM <chenyuan_fl@163.com> wrote:
>
> From: Yuan Chen <chenyuan@kylinos.cn>
>
> obj_priv_btf_success, obj_priv_implicit_token and
> obj_priv_implicit_token_envvar use the dummy_st_ops_success skeleton,
> whose struct_ops programs reference global variables backed by ARRAY
> maps.  Without BPF_MAP_TYPE_ARRAY in the token allowed map types,
> bpf_object__load() fails because the kernel rejects ARRAY map creation
> for global data.
>
> Add BPF_MAP_TYPE_ARRAY to the token permission mask for these three
> subtests so their skeletons can load successfully.
>
> This is distinct from the libbpf probe issue: the skeletons genuinely
> require ARRAY maps for their global data, not merely for probing.
>
> Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
> ---
>  tools/testing/selftests/bpf/prog_tests/token.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/token.c b/tools/testing/selftests/bpf/prog_tests/token.c
> index f2f5d36ae00a..1812de87bedb 100644
> --- a/tools/testing/selftests/bpf/prog_tests/token.c
> +++ b/tools/testing/selftests/bpf/prog_tests/token.c
> @@ -1214,7 +1214,7 @@ void serial_test_token(void)
>                 struct bpffs_opts opts = {
>                         /* allow BTF loading */
>                         .cmds = bit(BPF_BTF_LOAD) | bit(BPF_MAP_CREATE) | bit(BPF_PROG_LOAD),
> -                       .maps = bit(BPF_MAP_TYPE_STRUCT_OPS),
> +                       .maps = bit(BPF_MAP_TYPE_STRUCT_OPS) | bit(BPF_MAP_TYPE_ARRAY),
>                         .progs = bit(BPF_PROG_TYPE_STRUCT_OPS),
>                         .attachs = ~0ULL,
>                 };
> @@ -1225,7 +1225,7 @@ void serial_test_token(void)
>                 struct bpffs_opts opts = {
>                         /* allow BTF loading */
>                         .cmds = bit(BPF_BTF_LOAD) | bit(BPF_MAP_CREATE) | bit(BPF_PROG_LOAD),
> -                       .maps = bit(BPF_MAP_TYPE_STRUCT_OPS),
> +                       .maps = bit(BPF_MAP_TYPE_STRUCT_OPS) | bit(BPF_MAP_TYPE_ARRAY),
>                         .progs = bit(BPF_PROG_TYPE_STRUCT_OPS),
>                         .attachs = ~0ULL,
>                 };
> @@ -1236,7 +1236,7 @@ void serial_test_token(void)
>                 struct bpffs_opts opts = {
>                         /* allow BTF loading */
>                         .cmds = bit(BPF_BTF_LOAD) | bit(BPF_MAP_CREATE) | bit(BPF_PROG_LOAD),
> -                       .maps = bit(BPF_MAP_TYPE_STRUCT_OPS),
> +                       .maps = bit(BPF_MAP_TYPE_STRUCT_OPS) | bit(BPF_MAP_TYPE_ARRAY),

Hm.. ARRAY map is unprivileged, you shouldn't need this. Are you sure
this test failing for you is not a consequence of the issue you fixed
in the first patch?

>                         .progs = bit(BPF_PROG_TYPE_STRUCT_OPS),
>                         .attachs = ~0ULL,
>                 };
> --
> 2.54.0
>

^ permalink raw reply

* Re: [PATCH v4 2/4] perf: Use a union to clear branch entry bitfields
From: James Clark @ 2026-06-08 18:06 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: Puranjay Mohan, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Song Liu, Yonghong Song, Will Deacon,
	Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Shuah Khan, Breno Leitao,
	Ravi Bangoria, Stephane Eranian, Kumar Kartikeya Dwivedi,
	Usama Arif, linux-arm-kernel, linux-perf-users, linux-kselftest,
	linux-kernel, kernel-team
In-Reply-To: <20260527121207.2312181-3-puranjay@kernel.org>



On 27/05/2026 1:11 pm, Puranjay Mohan wrote:
> perf_clear_branch_entry_bitfields() zeroes individual bitfields of struct
> perf_branch_entry but has repeatedly fallen out of sync when new fields
> were added (new_type and priv were missed).
> 
> Wrap the bitfields in an anonymous struct inside a union with a u64
> bitfields member, and clear them all with a single assignment. This
> avoids having to update the clearing function every time a new bitfield
> is added.
> 
> Fixes: bfe4daf850f4 ("perf/core: Add perf_clear_branch_entry_bitfields() helper")
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>   include/linux/perf_event.h            |  9 +--------
>   include/uapi/linux/perf_event.h       | 25 +++++++++++++++----------
>   tools/include/uapi/linux/perf_event.h | 25 +++++++++++++++----------
>   3 files changed, 31 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 48d851fbd8ea..f7360c43f902 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1474,14 +1474,7 @@ static inline u32 perf_sample_data_size(struct perf_sample_data *data,
>    */
>   static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *br)
>   {
> -	br->mispred	= 0;
> -	br->predicted	= 0;
> -	br->in_tx	= 0;
> -	br->abort	= 0;
> -	br->cycles	= 0;
> -	br->type	= 0;
> -	br->spec	= PERF_BR_SPEC_NA;
> -	br->reserved	= 0;
> +	br->bitfields = 0;

The comment above here says:

  * The to and from fields are not cleared because they are
  * systematically modified by caller.

But this is an inline function and the to/from fields are always 
assigned right after calling it. Surely the compiler can work that out 
and this was a premature optimization.

If we rename it to perf_clear_branch_entry() and do:

   *br = (struct perf_branch_entry) {0};

Then we can delete a big comment, remove a potential source of bugs if 
even more members are added, not have to update the uapi headers and 
simplify it instead of working around the original issue.

>   }
>   
>   extern void perf_output_sample(struct perf_output_handle *handle,
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index fd10aa8d697f..c2e7b1b1c4fa 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -1491,16 +1491,21 @@ union perf_mem_data_src {
>   struct perf_branch_entry {
>   	__u64	from;
>   	__u64	to;
> -	__u64	mispred   :  1, /* target mispredicted */
> -		predicted :  1, /* target predicted */
> -		in_tx     :  1, /* in transaction */
> -		abort     :  1, /* transaction abort */
> -		cycles    : 16, /* cycle count to last branch */
> -		type      :  4, /* branch type */
> -		spec      :  2, /* branch speculation info */
> -		new_type  :  4, /* additional branch type */
> -		priv      :  3, /* privilege level */
> -		reserved  : 31;
> +	union {
> +		struct {
> +			__u64	mispred   :  1, /* target mispredicted */
> +				predicted :  1, /* target predicted */
> +				in_tx     :  1, /* in transaction */
> +				abort     :  1, /* transaction abort */
> +				cycles    : 16, /* cycle count to last branch */
> +				type      :  4, /* branch type */
> +				spec      :  2, /* branch speculation info */
> +				new_type  :  4, /* additional branch type */
> +				priv      :  3, /* privilege level */
> +				reserved  : 31;
> +		};
> +		__u64	bitfields;
> +	};
>   };
>   
>   /* Size of used info bits in struct perf_branch_entry */
> diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
> index fd10aa8d697f..c2e7b1b1c4fa 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -1491,16 +1491,21 @@ union perf_mem_data_src {
>   struct perf_branch_entry {
>   	__u64	from;
>   	__u64	to;
> -	__u64	mispred   :  1, /* target mispredicted */
> -		predicted :  1, /* target predicted */
> -		in_tx     :  1, /* in transaction */
> -		abort     :  1, /* transaction abort */
> -		cycles    : 16, /* cycle count to last branch */
> -		type      :  4, /* branch type */
> -		spec      :  2, /* branch speculation info */
> -		new_type  :  4, /* additional branch type */
> -		priv      :  3, /* privilege level */
> -		reserved  : 31;
> +	union {
> +		struct {
> +			__u64	mispred   :  1, /* target mispredicted */
> +				predicted :  1, /* target predicted */
> +				in_tx     :  1, /* in transaction */
> +				abort     :  1, /* transaction abort */
> +				cycles    : 16, /* cycle count to last branch */
> +				type      :  4, /* branch type */
> +				spec      :  2, /* branch speculation info */
> +				new_type  :  4, /* additional branch type */
> +				priv      :  3, /* privilege level */
> +				reserved  : 31;
> +		};
> +		__u64	bitfields;
> +	};
>   };
>   
>   /* Size of used info bits in struct perf_branch_entry */


^ permalink raw reply

* Re: [PATCH bpf-next v1 3/5] bpf: Cancel special fields on map value recycle
From: Justin Suess @ 2026-06-08 18:01 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team
In-Reply-To: <20260608144841.1732406-4-memxor@gmail.com>

On Mon, Jun 08, 2026 at 04:48:36PM +0200, Kumar Kartikeya Dwivedi wrote:
> From: Justin Suess <utilityemal77@gmail.com>
> 
> Map update and delete paths currently call bpf_obj_free_fields() when a
> value is being replaced or recycled. That makes field destruction depend
> on the context of the update/delete operation. For tracing programs this
> can include NMI context, where referenced kptr destructors, uptr
> unpinning, and graph root destruction are not generally safe.
> 
> Introduce bpf_obj_cancel_fields() for the reusable-value path. It only
> performs NMI-safe cleanup for timer, workqueue, and task_work fields.
> Fields that need full destruction are left attached to the recycled value
> and are destroyed by the final cleanup path instead.
> 
> Switch array and hashtab update/delete/recycle paths to this cancel
> helper. Keep bpf_obj_free_fields() for final map destruction and for
> bpf_mem_alloc destructors. Preallocated hashtabs do not have allocator
> destructors, so teardown continues to walk the normal and extra elements
> and fully destroy their fields.
> 
> This deliberately relaxes the eager-free semantics of map update/delete
> for special fields. Programs that relied on a recycled map slot becoming
> empty immediately after update/delete were relying on behavior that
> cannot be implemented safely from every BPF execution context without
> offloading arbitrary destructors.
> 
> There is a chance this change breaks programs making assumptions
> regarding the eager freeing of fields. If so, we can relax semantics to
> cancellation only when irqs_disabled() is true in the future. However,
> theoretically, map values that get reused eagerly already have weaker
> guarantees as parallel users can recreate freed fields before the new
> element becomes visible again.
> 
> Fixes: 14a324f6a67e ("bpf: Wire up freeing of referenced kptr")
> Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> Co-developed-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
>  include/linux/bpf.h                           |  1 +
>  kernel/bpf/arraymap.c                         |  8 ++---
>  kernel/bpf/hashtab.c                          | 32 +++++++++++--------
>  kernel/bpf/syscall.c                          | 27 ++++++++++++++++
>  .../selftests/bpf/prog_tests/htab_update.c    |  4 +--
>  .../selftests/bpf/prog_tests/map_kptr.c       | 10 +-----
>  .../testing/selftests/bpf/progs/htab_update.c |  4 +--
>  7 files changed, 55 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0654d2ffadc1..fc24b81cc206 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2717,6 +2717,7 @@ bool btf_record_equal(const struct btf_record *rec_a, const struct btf_record *r
>  void bpf_obj_free_timer(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_workqueue(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_task_work(const struct btf_record *rec, void *obj);
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj);
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj);
>  void __bpf_obj_drop_impl(void *p, const struct btf_record *rec, bool percpu);
>  
> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
> index e6271a2bf6d6..4e8d6fe6999b 100644
> --- a/kernel/bpf/arraymap.c
> +++ b/kernel/bpf/arraymap.c
> @@ -384,7 +384,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (array->map.map_type == BPF_MAP_TYPE_PERCPU_ARRAY) {
>  		val = this_cpu_ptr(array->pptrs[index & array->index_mask]);
>  		copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	} else {
>  		val = array->value +
>  			(u64)array->elem_size * (index & array->index_mask);
> @@ -392,7 +392,7 @@ static long array_map_update_elem(struct bpf_map *map, void *key, void *value,
>  			copy_map_value_locked(map, val, value, false);
>  		else
>  			copy_map_value(map, val, value);
> -		bpf_obj_free_fields(array->map.record, val);
> +		bpf_obj_cancel_fields(array->map.record, val);
>  	}
>  	return 0;
>  }
> @@ -432,14 +432,14 @@ int bpf_percpu_array_update(struct bpf_map *map, void *key, void *value,
>  		cpu = map_flags >> 32;
>  		ptr = per_cpu_ptr(pptr, cpu);
>  		copy_map_value(map, ptr, value);
> -		bpf_obj_free_fields(array->map.record, ptr);
> +		bpf_obj_cancel_fields(array->map.record, ptr);
>  		goto unlock;
>  	}
>  	for_each_possible_cpu(cpu) {
>  		ptr = per_cpu_ptr(pptr, cpu);
>  		val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
>  		copy_map_value(map, ptr, val);
> -		bpf_obj_free_fields(array->map.record, ptr);
> +		bpf_obj_cancel_fields(array->map.record, ptr);
>  	}
>  unlock:
>  	rcu_read_unlock();
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index b4366cad3cfa..a9b2b882b90f 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -243,6 +243,10 @@ static void htab_free_prealloced_fields(struct bpf_htab *htab)
>  
>  	if (IS_ERR_OR_NULL(htab->map.record))
>  		return;
> +	/*
> +	 * Preallocated maps do not have a bpf_mem_alloc destructor, so fully
> +	 * destroy every element, including the extra elements.
> +	 */
>  	if (htab_has_extra_elems(htab))
>  		num_entries += num_possible_cpus();
>  	for (i = 0; i < num_entries; i++) {
> @@ -833,8 +837,8 @@ static int htab_lru_map_gen_lookup(struct bpf_map *map,
>  	return insn - insn_buf;
>  }
>  
> -static void check_and_free_fields(struct bpf_htab *htab,
> -				  struct htab_elem *elem)
> +static void check_and_cancel_fields(struct bpf_htab *htab,
> +				    struct htab_elem *elem)
>  {
>  	if (IS_ERR_OR_NULL(htab->map.record))
>  		return;
> @@ -844,11 +848,11 @@ static void check_and_free_fields(struct bpf_htab *htab,
>  		int cpu;
>  
>  		for_each_possible_cpu(cpu)
> -			bpf_obj_free_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
> +			bpf_obj_cancel_fields(htab->map.record, per_cpu_ptr(pptr, cpu));
>  	} else {
>  		void *map_value = htab_elem_value(elem, htab->map.key_size);
>  
> -		bpf_obj_free_fields(htab->map.record, map_value);
> +		bpf_obj_cancel_fields(htab->map.record, map_value);
>  	}
>  }
>  
> @@ -883,7 +887,7 @@ static bool htab_lru_map_delete_node(void *arg, struct bpf_lru_node *node)
>  	htab_unlock_bucket(b, flags);
>  
>  	if (l == tgt_l)
> -		check_and_free_fields(htab, l);
> +		check_and_cancel_fields(htab, l);
>  	return l == tgt_l;
>  }
>  
> @@ -948,7 +952,7 @@ static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
>  
>  static void htab_elem_free(struct bpf_htab *htab, struct htab_elem *l)
>  {
> -	check_and_free_fields(htab, l);
> +	check_and_cancel_fields(htab, l);
>  
>  	if (htab->map.map_type == BPF_MAP_TYPE_PERCPU_HASH)
>  		bpf_mem_cache_free(&htab->pcpu_ma, l->ptr_to_pptr);
> @@ -1001,7 +1005,7 @@ static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
>  
>  	if (htab_is_prealloc(htab)) {
>  		bpf_map_dec_elem_count(&htab->map);
> -		check_and_free_fields(htab, l);
> +		check_and_cancel_fields(htab, l);
>  		pcpu_freelist_push(&htab->freelist, &l->fnode);
>  	} else {
>  		dec_elem_count(htab);
> @@ -1018,7 +1022,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
>  		/* copy true value_size bytes */
>  		ptr = this_cpu_ptr(pptr);
>  		copy_map_value(&htab->map, ptr, value);
> -		bpf_obj_free_fields(htab->map.record, ptr);
> +		bpf_obj_cancel_fields(htab->map.record, ptr);
>  	} else {
>  		u32 size = round_up(htab->map.value_size, 8);
>  		void *val;
> @@ -1028,7 +1032,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
>  			cpu = map_flags >> 32;
>  			ptr = per_cpu_ptr(pptr, cpu);
>  			copy_map_value(&htab->map, ptr, value);
> -			bpf_obj_free_fields(htab->map.record, ptr);
> +			bpf_obj_cancel_fields(htab->map.record, ptr);
>  			return;
>  		}
>  
> @@ -1036,7 +1040,7 @@ static void pcpu_copy_value(struct bpf_htab *htab, void __percpu *pptr,
>  			ptr = per_cpu_ptr(pptr, cpu);
>  			val = (map_flags & BPF_F_ALL_CPUS) ? value : value + size * cpu;
>  			copy_map_value(&htab->map, ptr, val);
> -			bpf_obj_free_fields(htab->map.record, ptr);
> +			bpf_obj_cancel_fields(htab->map.record, ptr);
>  		}
>  	}
>  }
> @@ -1252,11 +1256,11 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  	if (l_old) {
>  		hlist_nulls_del_rcu(&l_old->hash_node);
>  
> -		/* l_old has already been stashed in htab->extra_elems, free
> -		 * its special fields before it is available for reuse.
> +		/* l_old has already been stashed in htab->extra_elems, cancel
> +		 * its reusable special fields before it is available for reuse.
>  		 */
>  		if (htab_is_prealloc(htab))
> -			check_and_free_fields(htab, l_old);
> +			check_and_cancel_fields(htab, l_old);
>  	}
>  	htab_unlock_bucket(b, flags);
>  	if (l_old && !htab_is_prealloc(htab))
> @@ -1269,7 +1273,7 @@ static long htab_map_update_elem(struct bpf_map *map, void *key, void *value,
>  
>  static void htab_lru_push_free(struct bpf_htab *htab, struct htab_elem *elem)
>  {
> -	check_and_free_fields(htab, elem);
> +	check_and_cancel_fields(htab, elem);
>  	bpf_map_dec_elem_count(&htab->map);
>  	bpf_lru_push_free(&htab->lru, &elem->lru_node);
>  }
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index d4188a992bd8..8d12aa3728ea 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -808,6 +808,33 @@ void bpf_obj_free_task_work(const struct btf_record *rec, void *obj)
>  	bpf_task_work_cancel_and_free(obj + rec->task_work_off);
>  }
>  
> +void bpf_obj_cancel_fields(const struct btf_record *rec, void *obj)
> +{
> +	const struct btf_field *fields;
> +	int i;
> +
> +	if (IS_ERR_OR_NULL(rec))
> +		return;
> +	fields = rec->fields;
> +	for (i = 0; i < rec->cnt; i++) {
> +		void *field_ptr = obj + fields[i].offset;
> +
> +		switch (fields[i].type) {
> +		case BPF_TIMER:
> +			bpf_timer_cancel_and_free(field_ptr);
> +			break;
> +		case BPF_WORKQUEUE:
> +			bpf_wq_cancel_and_free(field_ptr);
> +			break;
> +		case BPF_TASK_WORK:
> +			bpf_task_work_cancel_and_free(field_ptr);
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +}
> +
>  void bpf_obj_free_fields(const struct btf_record *rec, void *obj)
>  {
Would a WARN_ON_ONCE for in_nmi in bpf_obj_free_fields help spot these
kind of bugs in the future?

Usually calling bpf_obj_free_fields in_nmi won't result in a deadlock
unless specific conditions are met (ie last reference to a refcounted
object, using a workqueue, etc).

Adding a check here would quickly surface bugs of this class.

> [...]

^ permalink raw reply

* Re: [PATCH bpf-next v1 1/5] bpf: Treat non-iterator tracing progs as tracing
From: Justin Suess @ 2026-06-08 17:47 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eduard Zingerman, Emil Tsalapatis, kkd, kernel-team
In-Reply-To: <20260608144841.1732406-2-memxor@gmail.com>

On Mon, Jun 08, 2026 at 04:48:34PM +0200, Kumar Kartikeya Dwivedi wrote:
> The is_tracing_prog_type() predicate omitted BPF_PROG_TYPE_TRACING even
> though fentry, fexit, fmod_ret, raw_tp BTF and similar programs have the
> same execution-context concerns as the tracing program types already
> covered by the helper.
> 
> This matters for map compatibility checks that reject bpf_spin_lock,
> bpf_list_head and bpf_rb_root in tracing contexts. BPF_PROG_TYPE_TRACING
> programs can run from arbitrary instrumented contexts, including places
> where taking these locks or manipulating graph roots is not safe.
> BPF_TRACE_ITER is different: iterator programs run from task context, so
> we continue to exclude them.
> 
> This can reject existing fentry/fexit-style programs that use map values
> with these fields. Such programs were accepted only because the
> predicate missed this program type; their use depends on semantics the
> verifier already rejects for equivalent tracing hooks.
> 
> Move is_tracing_prog_type() checks from check_map_prog_compatibility()
> to points where the fields are actually used to avoid preemptively
> rejecting tracing programs that use maps with such fields but do not
> touch these fields.
> 
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> ---
Thanks for sending this, looks pretty good to me.
>  kernel/bpf/verifier.c | 37 ++++++++++++++++++++++---------------
>  1 file changed, 22 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index ed7ba0e6a9ce..26bfb4465725 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -6967,6 +6967,11 @@ static int process_spin_lock(struct bpf_verifier_env *env, struct bpf_reg_state
>  	u32 spin_lock_off;
>  	int err;
>  
> +	if (is_tracing_prog_type(env->prog)) {
> +		verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
> +		return -EINVAL;
> +	}
> +
>  	if (!is_const) {
>  		verbose(env,
>  			"%s doesn't have constant offset. %s_lock has to be at the constant offset\n",
> @@ -12222,6 +12227,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  				return ret;
>  			break;
>  		case KF_ARG_PTR_TO_LIST_HEAD:
> +			if (is_tracing_prog_type(env->prog)) {
> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
> +				return -EINVAL;
> +			}
>  			if (reg->type != PTR_TO_MAP_VALUE &&
>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>  				verbose(env, "%s expected pointer to map value or allocated object\n",
> @@ -12238,6 +12247,10 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  				return ret;
>  			break;
>  		case KF_ARG_PTR_TO_RB_ROOT:
> +			if (is_tracing_prog_type(env->prog)) {
> +				verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
> +				return -EINVAL;
> +			}
>  			if (reg->type != PTR_TO_MAP_VALUE &&
>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>  				verbose(env, "%s expected pointer to map value or allocated object\n",
> @@ -17664,9 +17677,11 @@ static int check_pseudo_btf_id(struct bpf_verifier_env *env,
>  	return __add_used_btf(env, btf);
>  }
>  
> -static bool is_tracing_prog_type(enum bpf_prog_type type)
> +static bool is_tracing_prog_type(const struct bpf_prog *prog)
>  {
> -	switch (type) {
> +	switch (resolve_prog_type(prog)) {
> +	case BPF_PROG_TYPE_TRACING:
> +		return prog->expected_attach_type != BPF_TRACE_ITER;
Seems reasonable to exclude this attach type as they run always in task
context as stated. This seems like the safest approach.

In practice, *most* tracing type programs will be unreachable from
NMI. But the edge cases are difficult to enumerate...

fentry / fexit are nearly impossible to exclude, since there's no
good way to answer "can this function be called from an nmi handler"
at verification time. Such a list would be unmaintainable.

Raw tracepoints are similar. Indeed, it would be *possible* to
specifically exclude the tracepoints that cannot be called from NMI. But
this would require maintaining a large list and would be brittle.

fmodify_return attach types are whitelisted to LSM hooks and
ALLOW_ERROR_INJECTION functions. A quick grep for ALLOW_ERROR_INJECTION
doesn't bring anything that looks obviously NMI-relevant, but again, I
think allowing it would be asking for trouble later.

So I think this is a good balance...
>  	case BPF_PROG_TYPE_KPROBE:
>  	case BPF_PROG_TYPE_TRACEPOINT:
>  	case BPF_PROG_TYPE_PERF_EVENT:
> @@ -17697,24 +17712,16 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
>  		return -EACCES;
>  	}
>  
> -	if (btf_record_has_field(map->record, BPF_LIST_HEAD) ||
> -	    btf_record_has_field(map->record, BPF_RB_ROOT)) {
> -		if (is_tracing_prog_type(prog_type)) {
> -			verbose(env, "tracing progs cannot use bpf_{list_head,rb_root} yet\n");
> -			return -EINVAL;
> -		}
> -	}
> -
>  	if (btf_record_has_field(map->record, BPF_SPIN_LOCK | BPF_RES_SPIN_LOCK)) {
>  		if (prog_type == BPF_PROG_TYPE_SOCKET_FILTER) {
>  			verbose(env, "socket filter progs cannot use bpf_spin_lock yet\n");
>  			return -EINVAL;
>  		}
> -
> -		if (is_tracing_prog_type(prog_type)) {
> -			verbose(env, "tracing progs cannot use bpf_spin_lock yet\n");
> -			return -EINVAL;
> -		}
> +		/*
> +		 * Rejecting tracing progs accessing maps with bpf_spin_lock in
> +		 * them here would be too conservative; let's defer rejection
> +		 * until seeing first use.
> +		 */

In practice, I think most programs using those spinlocks w/ fentry
weren't broken.

They were saved only by nature of the fact that the particular
functions that attached to were uncallable / never were
called from NMI. Which is the case for 99% of those programs.

So this case is where fixing the other 1% of cases are going
to have consequences for the remainder.

This is a tough decision to have to make. There's options, but all of
them come with downsides.

1. (this patch) break tracing programs using bpf_spin_lock.
2. Somehow prevent NMI reentrancy of tracing progs on the same cpu. probably
by just ignoring the reentrant calls, much how we handle the recursive
tracepoint prevention for bpf. (complicated, may have side effects)
3. Inject a fixup runtime check for in_nmi for every tracing program
and tail callable program, exiting before executing anything
(still breakage, just at runtime now)

If breakage is acceptible here, the approach this patch takes is the
most *correct* way to do it, and the way it should have been done from
day one, but having users relying on the "broken" behavior makes it
harder to fix now.

This seems like a maintainer call.

Justin
>  	}
>  
>  	if ((bpf_prog_is_offloaded(prog->aux) || bpf_map_is_offloaded(map)) &&
> -- 
> 2.53.0
> 

^ permalink raw reply

* Re: [PATCH bpf v5 2/2] selftests/bpf: strengthen bpf_kfunc_implicit_arg to verify aux injection
From: Eduard Zingerman @ 2026-06-08 17:34 UTC (permalink / raw)
  To: chenyuan_fl
  Cc: andrii, ast, bot+bpf-ci, bpf, chenyuan, clm, daniel,
	ihor.solodrai, jolsa, linux-kernel, martin.lau, martin.lau,
	memxor, song, yonghong.song
In-Reply-To: <20260608142618.3064380-3-chenyuan_fl@163.com>

On Mon, 2026-06-08 at 22:26 +0800, chenyuan_fl@163.com wrote:
> From: Yuan Chen <chenyuan@kylinos.cn>
> 
> Verify that the KF_IMPLICIT_ARGS injection path correctly passes
> the bpf_prog_aux pointer by checking aux->name in
> bpf_kfunc_implicit_arg() for the expected program name prefix.
> If the verifier incorrectly skipped injection (as could happen
> with pahole 1.30's BTF mismatch), the stale register would not
> contain a valid aux pointer and the name check would fail.
> 
> This is a positive test exercised by the existing kfunc_implicit_args
> selftest, which calls bpf_kfunc_implicit_arg(5) and expects a return
> value of 5.
> 
> Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
> ---

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

[...]

^ permalink raw reply

* Re: [PATCH v3 6/6] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: sashiko-bot @ 2026-06-08 17:33 UTC (permalink / raw)
  To: Breno Leitao; +Cc: bpf
In-Reply-To: <20260608-bootconfig_using_tools-v3-6-4ddd079a0696@debian.org>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] The patch selects `ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG`, but this Kconfig symbol is not defined anywhere in the kernel.
- [High] Unconditionally prepending the embedded bootconfig in `setup_arch()` breaks the precedence rule where an initrd bootconfig must completely override and replace an embedded bootconfig.
- [High] Prepending embedded bootconfig keys in-place to `boot_command_line` can silently truncate user-supplied bootloader arguments.
- [High] The 'bootconfig' command line parameter is parsed differently by `setup_arch()` and `setup_boot_config()`, causing a split-brain state where `bootconfig=0` disables early parameter application but still enables late parameter application.
--

commit 3b4a5034917523d388052b505ac2245c0d663702
Author: Breno Leitao <leitao@debian.org>

x86/setup: prepend embedded bootconfig cmdline before parse_early_param

Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
build-time-rendered embedded bootconfig "kernel" subtree is part of
boot_command_line by the time parse_early_param() runs. early_param()
handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.

> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 0de23e6471973..8ab11199c16d5 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -127,6 +127,7 @@ config X86
>  	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
>  	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
>  	select ARCH_SUPPORTS_CFI		if X86_64
> +	select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG

[Severity: High]
Does ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG exist? 

This selects a Kconfig symbol that is not defined anywhere in the kernel. This 
is a Kconfig subsystem violation that can cause silent build failures or leave 
required features disabled because the selected symbol has no definition in any 
Kconfig file.

>  	select ARCH_USES_CFI_TRAPS		if X86_64 && CFI
>  	select ARCH_SUPPORTS_LTO_CLANG
>  	select ARCH_SUPPORTS_LTO_CLANG_THIN
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 46882ce79c3a4..003f8651db6c3 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
[ ... ]
> @@ -924,6 +926,20 @@ void __init setup_arch(char **cmdline_p)
>  	builtin_cmdline_added = true;
>  #endif
>  
> +	/*
> +	 * Match the runtime bootconfig parser's opt-in: only fold the
> +	 * embedded kernel.* keys into the cmdline when "bootconfig" is
> +	 * present on the command line, or CONFIG_BOOT_CONFIG_FORCE is set.
> +	 * setup_boot_config() bails out under the same condition, so the
> +	 * early prepend stays in lockstep with what the late runtime parser
> +	 * would have applied. CONFIG_BOOT_CONFIG_FORCE defaults to y when
> +	 * BOOT_CONFIG_EMBED is set, so on the default config the embedded
> +	 * keys are applied unconditionally.
> +	 */
> +	if (IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE) ||
> +	    cmdline_find_option_bool(boot_command_line, "bootconfig"))

[Severity: High]
Could this result in a split-brain state for "bootconfig" parsing?

In setup_arch(), cmdline_find_option_bool("bootconfig") accurately rejects
boolean-false values like bootconfig=0 or bootconfig=off.

However, in init/main.c:setup_boot_config(), the lax pre-existing parser uses 
parse_args() which merely checks for the presence of the "bootconfig" key and
ignores its value. 

Passing bootconfig=0 will therefore skip the early embedded parameter injection
here, but mistakenly apply the embedded bootconfig later at runtime, causing an 
inconsistent system state.

> +		xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);

[Severity: High]
Can this unconditionally prepend the embedded bootconfig, breaking the
precedence rule where an initrd bootconfig must completely override it?

The intended logic in init/main.c:setup_boot_config() allows an initrd 
bootconfig to completely override the embedded one:

init/main.c:setup_boot_config() {
	if (!data) {
		data = xbc_get_embedded_bootconfig(&size);
		from_embedded = true;
	}
    ...
}

By prepending the embedded keys directly to boot_command_line early in 
setup_arch(), these keys bypass the override check and are inevitably processed 
by parse_early_param(). This permanently leaks the embedded keys into the boot 
parameters, causing conflicting or unwanted early parameters to be applied even 
when a valid initrd bootconfig is present.

[Severity: High]
Will prepending embedded bootconfig keys in-place to boot_command_line 
silently truncate user-supplied bootloader arguments?

boot_command_line is a fixed-size buffer bounded by COMMAND_LINE_SIZE. When 
xbc_prepend_embedded_cmdline() prepends the embedded bootconfig string, it must 
push the existing content to the right. 

If the combined length exceeds the buffer, the tail-end arguments 
(such as root= or init=) will be silently truncated, resulting in a kernel panic 
on boot.

>  
>  	strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
>  	*cmdline_p = command_line;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org?part=6

^ permalink raw reply

* Re: [PATCH bpf v5 1/2] bpf: Fix kfunc implicit arg inject type detection to prevent invalid pointer deref
From: Eduard Zingerman @ 2026-06-08 17:28 UTC (permalink / raw)
  To: chenyuan_fl
  Cc: andrii, ast, bot+bpf-ci, bpf, chenyuan, clm, daniel,
	ihor.solodrai, jolsa, linux-kernel, martin.lau, martin.lau,
	memxor, song, yonghong.song
In-Reply-To: <20260608142618.3064380-2-chenyuan_fl@163.com>

On Mon, 2026-06-08 at 22:26 +0800, chenyuan_fl@163.com wrote:
> From: Yuan Chen <chenyuan@kylinos.cn>

[...]

> Fix
> 
> Split the combined is_kfunc_arg_ignore() || is_kfunc_arg_implicit()
> check in check_kfunc_args() so that an implicit argument reaching
> is_kfunc_arg_implicit() without being handled by a prior handler is
> rejected with -EFAULT, instead of silently skipped.  Existing implicit
>       args in bpf_fixup_kfunc_call() (obj_new, percpu_obj_new, obj_drop,
>       percpu_obj_drop, refcount_acquire, list_push, rbtree_add) are
>       explicitly allowed.
> 
> Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
> Fixes: 64e1360524b9 ("bpf: Verifier support for KF_IMPLICIT_ARGS")
> Signed-off-by: Yuan Chen <chenyuan@kylinos.cn>
> ---

Acked-by: Eduard Zingerman <eddyz87@gmail.com>

>  kernel/bpf/verifier.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 8ed484cb1a8a..91aaed7a5eeb 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -11885,9 +11885,27 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  			continue;
>  		}
>  
> -		if (is_kfunc_arg_ignore(btf, &args[i]) || is_kfunc_arg_implicit(meta, i))
> +		if (is_kfunc_arg_ignore(btf, &args[i]))
>  			continue;
>  
> +		if (is_kfunc_arg_implicit(meta, i)) {
> +			/* kfuncs with implicit args (e.g. 'off' parameter)
> +			 * handled during verification in bpf_fixup_kfunc_call():
> +			 * obj_new, percpu_obj_new, obj_drop, percpu_obj_drop,
> +			 * refcount_acquire, list_push, rbtree_add. Don't flag them. */

Nit: comment does not follow formatting rules and the two lines
     listing relevant kfuncs are redundant.

			/*
			 * This an exhaustive list of the non-bpf_prog_aux implicit args kfuncs,
                         * which are handled post-verification by bpf_fixup_kfunc_call().
			 */

> +			if (is_bpf_obj_new_kfunc(meta->func_id) ||
> +			    is_bpf_percpu_obj_new_kfunc(meta->func_id) ||
> +			    is_bpf_obj_drop_kfunc(meta->func_id) ||
> +			    is_bpf_percpu_obj_drop_kfunc(meta->func_id) ||
> +			    is_bpf_refcount_acquire_kfunc(meta->func_id) ||
> +			    is_bpf_list_push_kfunc(meta->func_id) ||
> +			    is_bpf_rbtree_add_kfunc(meta->func_id))
> +				continue;
> +			verbose(env, "%s unrecognized implicit argument, possible BTF mismatch\n",
> +				reg_arg_name(env, argno));
> +			return -EFAULT;
> +		}
> +
>  		t = btf_type_skip_modifiers(btf, args[i].type, NULL);
>  
>  		if (btf_type_is_scalar(t)) {

^ permalink raw reply

* Re: [PATCH bpf v8 1/2] net: Validate protocol in skb_steal_sock() for BPF-assigned sockets
From: Kuniyuki Iwashima @ 2026-06-08 17:21 UTC (permalink / raw)
  To: Jiayuan Chen, Alexei Starovoitov
  Cc: bpf, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Willem de Bruijn, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Song Liu, Yonghong Song, Jiri Olsa, Shuah Khan, Joe Stringer,
	netdev, linux-kernel, linux-kselftest
In-Reply-To: <20260608125846.157004-2-jiayuan.chen@linux.dev>

On Mon, Jun 8, 2026 at 5:59 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> bpf_sk_assign_tcp_reqsk() can assign a TCP reqsk to a non-TCP skb,
> causing a panic when the skb enters the wrong L4 receive path [1].
> An initial attempt tried to fix this in the BPF helper by checking
> iph->protocol, but Sashiko [2] revealed that BPF programs can bypass
> this check via a TOCTOU attack by modifying iph->protocol around the
> call:
>
>     iph->protocol = IPPROTO_TCP;
>     bpf_sk_assign_tcp_reqsk(udp_skb, tcp_sk);
>     iph->protocol = IPPROTO_UDP;
>
> Furthermore, bpf_sk_assign() has had the same class of vulnerability
> since its introduction — it can assign any socket type to any skb
> without protocol validation.  Since the BPF helper check alone cannot
> prevent a malicious BPF program from crashing the kernel, add protocol

I'm curious about the BPF maintainers' stance on this kind of "bug"
where admin tries to shoot oneself in the foot.

I saw Alexei said this recently, and I guess it applies here as well ?
https://lore.kernel.org/bpf/CAADnVQLh7VEKAtckzz=XOVPT8ovpDQshvPPCWHDQu2OWQx27_w@mail.gmail.com/

---8<---
Not every "bug" needs a fix.
If a malicious bpf user wants to crash the kernel they will
find a way to do so. Especially with agents.
We cannot realistically close all of the holes.
Right now the priority is to fix the issues that normal
users can hit and not bots.
---8<---


> validation in skb_steal_sock() to reject mismatched sockets regardless
> of how they were assigned.
>
> The check is applied to all prefetched sockets.  Early demux paths
> already only assign matching protocols (e.g., UDP early demux only
> assigns UDP sockets to UDP skbs), so they pass the check naturally and
> the extra branch is negligible.
>
> Pass the expected protocol from callers rather than extracting it from
> the IP header.  For IPv6, walking extension headers to find the L4
> protocol is complex and unnecessary since each caller already knows
> the protocol it handles.
>
> [1] https://lore.kernel.org/bpf/20260403015851.148209-1-jiayuan.chen@linux.dev/
> [2] https://sashiko.dev/#/patchset/20260403015851.148209-1-jiayuan.chen%40linux.dev
>
> Fixes: cf7fbe660f2d ("bpf: Add socket assign support")
> Fixes: e472f88891ab ("bpf: tcp: Support arbitrary SYN Cookie.")
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>  include/net/inet6_hashtables.h |  7 ++++---
>  include/net/inet_hashtables.h  |  7 ++++---
>  include/net/request_sock.h     | 16 +++++++++++++++-
>  net/ipv4/udp.c                 |  2 +-
>  net/ipv6/udp.c                 |  2 +-
>  5 files changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
> index 2cc5d416bbb5a..218498373a9c1 100644
> --- a/include/net/inet6_hashtables.h
> +++ b/include/net/inet6_hashtables.h
> @@ -106,12 +106,13 @@ static inline
>  struct sock *inet6_steal_sock(struct net *net, struct sk_buff *skb, int doff,
>                               const struct in6_addr *saddr, const __be16 sport,
>                               const struct in6_addr *daddr, const __be16 dport,
> -                             bool *refcounted, inet6_ehashfn_t *ehashfn)
> +                             bool *refcounted, inet6_ehashfn_t *ehashfn,
> +                             int protocol)
>  {
>         struct sock *sk, *reuse_sk;
>         bool prefetched;
>
> -       sk = skb_steal_sock(skb, refcounted, &prefetched);
> +       sk = skb_steal_sock(skb, refcounted, &prefetched, protocol);
>         if (!sk)
>                 return NULL;
>
> @@ -153,7 +154,7 @@ static inline struct sock *__inet6_lookup_skb(struct sk_buff *skb, int doff,
>         struct sock *sk;
>
>         sk = inet6_steal_sock(net, skb, doff, &ip6h->saddr, sport, &ip6h->daddr, dport,
> -                             refcounted, inet6_ehashfn);
> +                             refcounted, inet6_ehashfn, IPPROTO_TCP);
>         if (IS_ERR(sk))
>                 return NULL;
>         if (sk)
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index 6e2fe186d0dcb..a2a044f93cc4b 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -446,12 +446,13 @@ static inline
>  struct sock *inet_steal_sock(struct net *net, struct sk_buff *skb, int doff,
>                              const __be32 saddr, const __be16 sport,
>                              const __be32 daddr, const __be16 dport,
> -                            bool *refcounted, inet_ehashfn_t *ehashfn)
> +                            bool *refcounted, inet_ehashfn_t *ehashfn,
> +                            int protocol)
>  {
>         struct sock *sk, *reuse_sk;
>         bool prefetched;
>
> -       sk = skb_steal_sock(skb, refcounted, &prefetched);
> +       sk = skb_steal_sock(skb, refcounted, &prefetched, protocol);
>         if (!sk)
>                 return NULL;
>
> @@ -494,7 +495,7 @@ static inline struct sock *__inet_lookup_skb(struct sk_buff *skb,
>         struct sock *sk;
>
>         sk = inet_steal_sock(net, skb, doff, iph->saddr, sport, iph->daddr, dport,
> -                            refcounted, inet_ehashfn);
> +                            refcounted, inet_ehashfn, IPPROTO_TCP);
>         if (IS_ERR(sk))
>                 return NULL;
>         if (sk)
> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
> index 5a9c826a7092d..c2b8c6350b624 100644
> --- a/include/net/request_sock.h
> +++ b/include/net/request_sock.h
> @@ -89,9 +89,11 @@ static inline struct sock *req_to_sk(struct request_sock *req)
>   * @skb: sk_buff to steal the socket from
>   * @refcounted: is set to true if the socket is reference-counted
>   * @prefetched: is set to true if the socket was assigned from bpf
> + * @protocol: expected L4 protocol
>   */
>  static inline struct sock *skb_steal_sock(struct sk_buff *skb,
> -                                         bool *refcounted, bool *prefetched)
> +                                         bool *refcounted, bool *prefetched,
> +                                         int protocol)
>  {
>         struct sock *sk = skb->sk;
>
> @@ -103,6 +105,18 @@ static inline struct sock *skb_steal_sock(struct sk_buff *skb,
>
>         *prefetched = skb_sk_is_prefetched(skb);
>         if (*prefetched) {
> +               /* A non-full socket here is either a reqsk or a
> +                * timewait sock, both only contain sock_common and
> +                * lack sk_protocol. Since both can only be TCP,
> +                * use IPPROTO_TCP as the protocol.
> +                */
> +               if (unlikely(((sk_fullsock(sk) ? sk->sk_protocol : IPPROTO_TCP) != protocol))) {
> +                       skb_orphan(skb);
> +                       *prefetched = false;
> +                       *refcounted = false;
> +                       return NULL;
> +               }
> +
>  #if IS_ENABLED(CONFIG_SYN_COOKIES)
>                 if (sk->sk_state == TCP_NEW_SYN_RECV && inet_reqsk(sk)->syncookie) {
>                         struct request_sock *req = inet_reqsk(sk);
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 0ac2bf4f8759b..ceb4d29a64aca 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -2618,7 +2618,7 @@ int udp_rcv(struct sk_buff *skb)
>                 goto csum_error;
>
>         sk = inet_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
> -                            &refcounted, udp_ehashfn);
> +                            &refcounted, udp_ehashfn, IPPROTO_UDP);
>         if (IS_ERR(sk))
>                 goto no_sk;
>
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 15e032194eccc..d9c12cce5acef 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -1106,7 +1106,7 @@ INDIRECT_CALLABLE_SCOPE int udpv6_rcv(struct sk_buff *skb)
>
>         /* Check if the socket is already available, e.g. due to early demux */
>         sk = inet6_steal_sock(net, skb, sizeof(struct udphdr), saddr, uh->source, daddr, uh->dest,
> -                             &refcounted, udp6_ehashfn);
> +                             &refcounted, udp6_ehashfn, IPPROTO_UDP);
>         if (IS_ERR(sk))
>                 goto no_sk;
>
> --
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH v3 3/6] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: sashiko-bot @ 2026-06-08 16:52 UTC (permalink / raw)
  To: Breno Leitao; +Cc: bpf
In-Reply-To: <20260608-bootconfig_using_tools-v3-3-4ddd079a0696@debian.org>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Low] Cross-compiling the kernel with LLVM/Clang will fail with an "Exec format error" when building `embedded_cmdline.bin`.
- [Low] Mixing different access flags (`"aw"` vs `"a"`) for the same `.init.rodata` section causes a fatal section type mismatch error with `ld.lld`.
--

commit 53ecb8492a80850833ff3357341fcf9ac0a1298e
Author: Breno Leitao <leitao@debian.org>

bootconfig: render embedded bootconfig as a kernel cmdline at build time

Add the build-time pipeline that renders the "kernel" subtree of
CONFIG_BOOT_CONFIG_EMBED_FILE into a flat cmdline string. This involves
adding tools/bootconfig to the make prepare step and adding
lib/embedded-cmdline.S to embed the rendered text.

> diff --git a/Makefile b/Makefile
> index bf196c6df5b92..a7572703f9ce8 100644
> --- a/Makefile
> +++ b/Makefile
[ ... ]
> @@ -1545,6 +1545,17 @@ prepare: tools/bpf/resolve_btfids
>  endif
>  endif
>  
> +# tools/bootconfig renders the embedded bootconfig into a cmdline at build time.
> +ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
> +prepare: tools/bootconfig
> +endif
> +
> +# tools/bootconfig is run on the build host during prepare, so force a host
> +# binary here; its own Makefile keeps $(CC) for standalone and cross builds.
> +tools/bootconfig: FORCE
> +	$(Q)mkdir -p $(objtree)/tools
> +	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/ bootconfig CC=$(HOSTCC)

[Severity: Low]
When cross-compiling with Clang (e.g., LLVM=1 CROSS_COMPILE=aarch64-linux-gnu-),
will this build the tool as a host binary?

Since CROSS_COMPILE isn't unset and CFLAGS aren't overridden with host flags,
it appears tools/scripts/Makefile.include will still append target flags like
--target=$(CLANG_TARGET). This can result in the tool being erroneously
compiled as a target-architecture binary.

> diff --git a/lib/Makefile b/lib/Makefile
> index 7f75cc6edf94a..4ace86a5cb6de 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
[ ... ]
> @@ -273,6 +273,22 @@ filechk_defbconf = cat $(or $(real-prereqs), /dev/null)
>  $(obj)/default.bconf: $(CONFIG_BOOT_CONFIG_EMBED_FILE) FORCE
>  	$(call filechk,defbconf)
>  
> +obj-$(CONFIG_BOOT_CONFIG_EMBED_CMDLINE) += embedded-cmdline.o
> +$(obj)/embedded-cmdline.o: $(obj)/embedded_cmdline.bin
> +
> +# Render the bootconfig "kernel" subtree to a flat cmdline string using
> +# the userspace tools/bootconfig parser (-C mode). The runtime prepend
> +# helper enforces COMMAND_LINE_SIZE at boot, so no build-time size
> +# check is performed here (COMMAND_LINE_SIZE is an arch header
> +# constant, not a Kconfig value).
> +quiet_cmd_render_cmdline = BCONF2C $@
> +      cmd_render_cmdline = \
> +	$(objtree)/tools/bootconfig/bootconfig -C $< > $@

[Severity: Low]
If the bootconfig tool was accidentally cross-compiled due to target flags
bleeding into the host build above, could executing it here on the build
host fail with an Exec format error?

> diff --git a/lib/embedded-cmdline.S b/lib/embedded-cmdline.S
> new file mode 100644
> index 0000000000000..740d7ad2dc016
> --- /dev/null
> +++ b/lib/embedded-cmdline.S
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Embed the build-time-rendered bootconfig "kernel" subtree as a flat
> + * cmdline string. setup_arch() prepends this to boot_command_line on
> + * architectures that select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG.
> + *
> + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
> + * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
> + */
> +	.section .init.rodata, "a", %progbits

[Severity: Low]
Does this cause a section type mismatch with the existing bootconfig data?

The existing lib/bootconfig-data.S places embedded bootconfig data into
.section .init.rodata, "aw" (allocatable and writable). When both files are
compiled together, modern linkers like ld.lld enforce section attributes
and will fail the build if the same section name mixes writable ("aw") and
read-only ("a") flags.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org?part=3

^ permalink raw reply

* Re: [PATCH bpf-next v1 5/5] selftests/bpf: Exercise kptr map update lifetime
From: sashiko-bot @ 2026-06-08 16:40 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi; +Cc: bpf
In-Reply-To: <20260608144841.1732406-6-memxor@gmail.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [Critical] The newly added selftests incorrectly assert that map updates preserve old kptr references, codifying severe preexisting kernel bugs where bpf_obj_free_fields() is missing during map updates. This leads to leaked kptrs and cross-key uninitialized memory corruption in hash maps.
--

commit 4d2aa8d889c01038cf4926ada20dda3564090fb2
Author: Kumar Kartikeya Dwivedi <memxor@gmail.com>

selftests/bpf: Exercise kptr map update lifetime

Add focused map_kptr coverage for BPF-side map updates that touch values
containing referenced kptrs. The new syscall programs stash the testmod
refcounted object in various maps, then update the same map from BPF.

> diff --git a/tools/testing/selftests/bpf/progs/map_kptr.c b/tools/testing/selftests/bpf/progs/map_kptr.c
> index e708ffbe1f610..3fbefc568e0a6 100644
> --- a/tools/testing/selftests/bpf/progs/map_kptr.c
> +++ b/tools/testing/selftests/bpf/progs/map_kptr.c

[ ... ]

> +SEC("syscall")
> +int test_array_map_update_kptr(void *ctx)
> +{
> +	struct map_value init = {}, *v;
> +	int key = 0, ret;
> +
> +	v = bpf_map_lookup_elem(&array_map, &key);
> +	if (!v)
> +		return 1;
> +	ret = stash_ref_ptr(v);
> +	if (ret)
> +		return ret;
> +	ret = check_refs(3);
> +	if (ret)
> +		return ret;
> +	ret = bpf_map_update_elem(&array_map, &key, &init, BPF_EXIST);
> +	if (ret)
> +		return 4;
> +	return check_refs(3);
> +}
> +
> +#define DEFINE_HASH_UPDATE_KPTR_TEST(name, map)			\
> +SEC("syscall")							\
> +int name(void *ctx)						\
> +{								\
> +	struct map_value init = {}, *v;				\
> +	int key = 0, ret;					\
> +								\
> +	ret = bpf_map_update_elem(&map, &key, &init, BPF_NOEXIST); \
> +	if (ret)						\
> +		return 1;					\
> +	v = bpf_map_lookup_elem(&map, &key);			\
> +	if (!v)							\
> +		return 2;					\
> +	ret = stash_ref_ptr(v);					\
> +	if (ret)						\
> +		return ret;					\
> +	ret = check_refs(3);					\
> +	if (ret)						\
> +		return ret;					\
> +	ret = bpf_map_update_elem(&map, &key, &init, BPF_EXIST); \
> +	if (ret)						\
> +		return 4;					\
> +	return check_refs(3);					\
> +}

[Severity: Critical]
This is a pre-existing issue, but does this selftest actually codify an
existing memory leak and uninitialized memory corruption in the kernel?

Looking at array_map_update_elem() and htab_map_update_elem() in the kernel,
they incorrectly use bpf_obj_cancel_fields() instead of
bpf_obj_free_fields() when overwriting elements. Because
bpf_obj_cancel_fields() explicitly skips BPF_KPTR_REF, the old kptr
references are never freed.

For hash maps specifically, htab_map_update_elem() allocates a new element
(l_new), and copy_map_value() explicitly skips copying special BTF fields
like kptrs. This means l_new receives uninitialized garbage memory for its
kptr field before being inserted into the map. Meanwhile, the old element is
placed in the freelist without its kptr being freed, leaking the reference.

By asserting check_refs(3) after the map update, is the test just measuring
this leak and ignoring the uninitialized garbage now present in the map,
rather than confirming that the kptr was gracefully preserved?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608144841.1732406-1-memxor@gmail.com?part=5

^ permalink raw reply

* [PATCH v3 6/6] x86/setup: prepend embedded bootconfig cmdline before parse_early_param
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

Call xbc_prepend_embedded_cmdline() in setup_arch() right after the
CONFIG_CMDLINE merge and before strscpy(command_line, ...) so the
build-time-rendered embedded bootconfig "kernel" subtree is part of
boot_command_line by the time parse_early_param() runs. early_param()
handlers (mem=, earlycon=, loglevel=, ...) now see values supplied via
CONFIG_BOOT_CONFIG_EMBED_FILE without parsing bootconfig at runtime.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 arch/x86/Kconfig        |  1 +
 arch/x86/kernel/setup.c | 16 ++++++++++++++++
 init/main.c             | 25 ++++++++++++++++++++++---
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f24810015234..f839795692b4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -126,6 +126,7 @@ config X86
 	select ARCH_SUPPORTS_NUMA_BALANCING	if X86_64
 	select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP	if NR_CPUS <= 4096
 	select ARCH_SUPPORTS_CFI		if X86_64
+	select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
 	select ARCH_USES_CFI_TRAPS		if X86_64 && CFI
 	select ARCH_SUPPORTS_LTO_CLANG
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 46882ce79c3a..003f8651db6c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -6,6 +6,7 @@
  * parts of early kernel initialization.
  */
 #include <linux/acpi.h>
+#include <linux/bootconfig.h>
 #include <linux/console.h>
 #include <linux/cpu.h>
 #include <linux/crash_dump.h>
@@ -36,6 +37,7 @@
 #include <asm/bios_ebda.h>
 #include <asm/bugs.h>
 #include <asm/cacheinfo.h>
+#include <asm/cmdline.h>
 #include <asm/coco.h>
 #include <asm/cpu.h>
 #include <asm/efi.h>
@@ -924,6 +926,20 @@ void __init setup_arch(char **cmdline_p)
 	builtin_cmdline_added = true;
 #endif
 
+	/*
+	 * Match the runtime bootconfig parser's opt-in: only fold the
+	 * embedded kernel.* keys into the cmdline when "bootconfig" is
+	 * present on the command line, or CONFIG_BOOT_CONFIG_FORCE is set.
+	 * setup_boot_config() bails out under the same condition, so the
+	 * early prepend stays in lockstep with what the late runtime parser
+	 * would have applied. CONFIG_BOOT_CONFIG_FORCE defaults to y when
+	 * BOOT_CONFIG_EMBED is set, so on the default config the embedded
+	 * keys are applied unconditionally.
+	 */
+	if (IS_ENABLED(CONFIG_BOOT_CONFIG_FORCE) ||
+	    cmdline_find_option_bool(boot_command_line, "bootconfig"))
+		xbc_prepend_embedded_cmdline(boot_command_line, COMMAND_LINE_SIZE);
+
 	strscpy(command_line, boot_command_line, COMMAND_LINE_SIZE);
 	*cmdline_p = command_line;
 
diff --git a/init/main.c b/init/main.c
index e363232b428b..2ecb6aa536dd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -378,12 +378,15 @@ static void __init setup_boot_config(void)
 	int pos, ret;
 	size_t size;
 	char *err;
+	bool from_embedded = false;
 
 	/* Cut out the bootconfig data even if we have no bootconfig option */
 	data = get_boot_config_from_initrd(&size);
 	/* If there is no bootconfig in initrd, try embedded one. */
-	if (!data)
+	if (!data) {
 		data = xbc_get_embedded_bootconfig(&size);
+		from_embedded = true;
+	}
 
 	strscpy(tmp_cmdline, boot_command_line, COMMAND_LINE_SIZE);
 	err = parse_args("bootconfig", tmp_cmdline, NULL, 0, 0, 0, NULL,
@@ -421,8 +424,24 @@ static void __init setup_boot_config(void)
 	} else {
 		xbc_get_info(&ret, NULL);
 		pr_info("Load bootconfig: %ld bytes %d nodes\n", (long)size, ret);
-		/* keys starting with "kernel." are passed via cmdline */
-		extra_command_line = xbc_make_cmdline("kernel");
+		/*
+		 * keys starting with "kernel." are passed via cmdline. When
+		 * this bootconfig came from the embedded source and
+		 * setup_arch() already prepended the rendered "kernel" subtree
+		 * to boot_command_line, rendering again here would duplicate
+		 * the keys in saved_command_line and make accumulating handlers
+		 * (console=, earlycon=, ...) re-register the same value. Skip
+		 * only when the prepend really happened.
+		 *
+		 * On arches that do not select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG,
+		 * CONFIG_BOOT_CONFIG_EMBED_CMDLINE is unselectable and
+		 * xbc_embedded_cmdline_applied() collapses to a stub returning
+		 * false, so this path still runs and the embedded "kernel"
+		 * keys reach the cmdline via the runtime parser exactly as
+		 * before this series.
+		 */
+		if (!from_embedded || !xbc_embedded_cmdline_applied())
+			extra_command_line = xbc_make_cmdline("kernel");
 		/* Also, "init." keys are init arguments */
 		extra_init_args = xbc_make_cmdline("init");
 	}

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 5/6] bootconfig: add xbc_prepend_embedded_cmdline() helper
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

Add a helper that prepends the build-time-rendered embedded bootconfig
"kernel" subtree (embedded_kernel_cmdline[] from embedded-cmdline.S) to
a cmdline buffer with a separating space. Architectures call this from
setup_arch() before parse_early_param() so early_param() handlers
(mem=, earlycon=, loglevel=, ...) see values supplied via the embedded
bootconfig.

The in-place prepend (shift the existing string right, then drop the
embedded string in front) is factored into a small str_prepend() helper.

On overflow the helper logs an error and leaves the cmdline untouched
rather than panicking. Booting without the embedded values is better
than refusing to boot, and the error tells the user why their embedded
keys are missing.

The helper records whether it actually prepended, exposed via
xbc_embedded_cmdline_applied(). setup_boot_config() uses this to decide
whether the runtime "kernel" render would duplicate keys already folded
into boot_command_line.

When CONFIG_BOOT_CONFIG_EMBED_CMDLINE=n, the public declaration in
<linux/bootconfig.h> resolves to a no-op stub so callers compile
unchanged.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 include/linux/bootconfig.h |  9 ++++++
 lib/bootconfig.c           | 78 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/include/linux/bootconfig.h b/include/linux/bootconfig.h
index 1c7f3b74ffcf..c186137f87ac 100644
--- a/include/linux/bootconfig.h
+++ b/include/linux/bootconfig.h
@@ -308,4 +308,13 @@ static inline const char *xbc_get_embedded_bootconfig(size_t *size)
 }
 #endif
 
+/* Build-time-rendered bootconfig cmdline prepended in setup_arch() */
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size);
+bool __init xbc_embedded_cmdline_applied(void);
+#else
+static inline void xbc_prepend_embedded_cmdline(char *dst, size_t size) { }
+static inline bool xbc_embedded_cmdline_applied(void) { return false; }
+#endif
+
 #endif
diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 926094d97397..f66be0b2dc24 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -19,6 +19,7 @@
 #include <linux/errno.h>
 #include <linux/cache.h>
 #include <linux/compiler.h>
+#include <linux/printk.h>
 #include <linux/sprintf.h>
 #include <linux/memblock.h>
 #include <linux/string.h>
@@ -34,6 +35,83 @@ const char * __init xbc_get_embedded_bootconfig(size_t *size)
 	return (*size) ? embedded_bootconfig_data : NULL;
 }
 #endif
+
+#ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+/* embedded_kernel_cmdline is defined in embedded-cmdline.S */
+extern __visible const char embedded_kernel_cmdline[];
+extern __visible const char embedded_kernel_cmdline_end[];
+
+/* Set once the embedded cmdline has actually been prepended. */
+static bool xbc_cmdline_applied __initdata;
+
+/*
+ * str_prepend() - Prepend @src in front of the string in @dst, in place
+ * @dst: NUL-terminated destination buffer, currently @dst_len bytes long
+ * @dst_len: length of the current @dst string (excluding its NUL)
+ * @src: bytes to prepend (not NUL-terminated)
+ * @src_len: number of bytes from @src to prepend
+ *
+ * The caller must guarantee @dst has room for src_len + dst_len + 1 bytes.
+ * Moving dst_len + 1 bytes carries @dst's NUL terminator too, so an empty
+ * @dst needs no special case.
+ */
+static void __init str_prepend(char *dst, size_t dst_len,
+			       const char *src, size_t src_len)
+{
+	memmove(dst + src_len, dst, dst_len + 1);
+	memcpy(dst, src, src_len);
+}
+
+/**
+ * xbc_prepend_embedded_cmdline() - Prepend embedded bootconfig cmdline
+ * @dst: cmdline buffer to prepend into (must already contain a NUL byte)
+ * @size: total capacity of @dst in bytes
+ *
+ * Prepend the build-time-rendered "kernel" subtree of the embedded
+ * bootconfig to @dst. The rendered string already ends with a single
+ * space (the xbc_snprint_cmdline() invariant), which serves as the
+ * separator between the embedded keys and any existing content of @dst.
+ * On overflow, log an error and leave @dst untouched rather than
+ * silently truncating: booting without the embedded values is better
+ * than refusing to boot, and the error message tells the user why
+ * their embedded keys are missing.
+ *
+ * Intended to be called from setup_arch() before parse_early_param() so
+ * that early_param() handlers see the embedded values.
+ */
+void __init xbc_prepend_embedded_cmdline(char *dst, size_t size)
+{
+	size_t embed_len = embedded_kernel_cmdline_end - embedded_kernel_cmdline;
+	size_t dst_len;
+
+	if (!size || embed_len <= 1)	/* trailing NUL only */
+		return;
+	embed_len--;			/* exclude trailing NUL byte */
+
+	dst_len = strnlen(dst, size);
+	if (embed_len + dst_len + 1 > size) {
+		pr_err("embedded bootconfig cmdline (%zu bytes) does not fit in COMMAND_LINE_SIZE with %zu bytes already used; ignoring embedded values\n",
+		       embed_len, dst_len);
+		return;
+	}
+
+	str_prepend(dst, dst_len, embedded_kernel_cmdline, embed_len);
+	xbc_cmdline_applied = true;
+}
+
+/**
+ * xbc_embedded_cmdline_applied() - Did the embedded cmdline get prepended?
+ *
+ * Return true if xbc_prepend_embedded_cmdline() actually prepended the
+ * embedded "kernel" subtree. setup_boot_config() uses this to avoid
+ * rendering the same keys a second time.
+ */
+bool __init xbc_embedded_cmdline_applied(void)
+{
+	return xbc_cmdline_applied;
+}
+#endif
+
 #endif
 
 /*

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 4/6] bootconfig: clean build-time tools/bootconfig from make clean
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

The previous patch builds tools/bootconfig during 'make prepare' to
render the embedded bootconfig cmdline, but nothing removes it on
'make clean', leaving the compiled tool and its objects behind.

Wire a bootconfig_clean hook into the top-level clean target so the
compiled tool and its objects are removed by make clean, matching the
prepare-wired tools/objtool and tools/bpf/resolve_btfids.

The hook runs tools/bootconfig's Makefile via $(MAKE), which the kernel
build invokes with -rR (MAKEFLAGS += -rR). -rR drops the built-in $(RM)
variable, so the existing "$(RM) -f ..." clean recipe would expand to a
bare "-f ..." and fail. Spell the recipe with a literal "rm -f" so it
keeps working both standalone and when invoked from Kbuild.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Makefile                  | 13 ++++++++++++-
 tools/bootconfig/Makefile |  2 +-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/Makefile b/Makefile
index 4a8ea7c90ca8..84ca047f0c10 100644
--- a/Makefile
+++ b/Makefile
@@ -1580,6 +1580,17 @@ ifneq ($(wildcard $(objtool_O)),)
 	$(Q)$(MAKE) -sC $(abs_srctree)/tools/objtool O=$(objtool_O) srctree=$(abs_srctree) $(patsubst objtool_%,%,$@)
 endif
 
+PHONY += bootconfig_clean
+
+bootconfig_O = $(abspath $(objtree))/tools/bootconfig
+
+# tools/bootconfig is only built (via the prepare hook above) when
+# CONFIG_BOOT_CONFIG_EMBED_CMDLINE is set; skip its clean otherwise.
+bootconfig_clean:
+ifneq ($(wildcard $(bootconfig_O)),)
+	$(Q)$(MAKE) -sC $(srctree)/tools/bootconfig O=$(bootconfig_O) clean
+endif
+
 tools/: FORCE
 	$(Q)mkdir -p $(objtree)/tools
 	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/
@@ -1749,7 +1760,7 @@ vmlinuxclean:
 	$(Q)$(CONFIG_SHELL) $(srctree)/scripts/link-vmlinux.sh clean
 	$(Q)$(if $(ARCH_POSTLINK), $(MAKE) -f $(ARCH_POSTLINK) clean)
 
-clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean
+clean: archclean vmlinuxclean resolve_btfids_clean objtool_clean bootconfig_clean
 
 # mrproper - Delete all generated files, including .config
 #
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 4e82fd9553cd..3cb8066d5141 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -27,4 +27,4 @@ install: $(ALL_PROGRAMS)
 	install $(OUTPUT)bootconfig $(DESTDIR)$(bindir)
 
 clean:
-	$(RM) -f $(OUTPUT)*.o $(ALL_PROGRAMS)
+	rm -f $(OUTPUT)*.o $(ALL_PROGRAMS)

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 3/6] bootconfig: render embedded bootconfig as a kernel cmdline at build time
From: Breno Leitao @ 2026-06-08 16:24 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

Add the build-time pipeline that renders the "kernel" subtree of
CONFIG_BOOT_CONFIG_EMBED_FILE into a flat cmdline string and stashes
it in .init.rodata as embedded_kernel_cmdline[]. A follow-up patch
adds the runtime helper that prepends this string to boot_command_line
during early architecture setup so parse_early_param() sees the values.

The build wires up:
  tools/bootconfig -C kernel - userspace tool already shared with
                               lib/bootconfig.c, used here in -C mode
                               to render a bootconfig file to a cmdline
  lib/embedded-cmdline.S     - .incbin's the rendered text plus a NUL
                               (listed under the EXTRA BOOT CONFIG
                               MAINTAINERS entry)
  lib/Makefile rule          - runs tools/bootconfig at build time
  Makefile prepare dep       - ensures tools/bootconfig is built first,
                               same pattern as tools/objtool and
                               tools/bpf/resolve_btfids

Drop the test target from tools/bootconfig/Makefile's default 'all'
recipe so that hooking the binary into the kernel build does not run
test-bootconfig.sh on every prepare. The tests stay available as
'make -C tools/bootconfig test', matching the convention of
tools/objtool and tools/bpf/resolve_btfids whose 'all' targets only
build the binary.

Require BOOT_CONFIG_EMBED_FILE to be non-empty before the new option
can be enabled, otherwise tools/bootconfig -C runs against an empty
file and prints a parse error on every kernel build.

The feature gates on CONFIG_ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG, a
silent symbol arches select once they've wired the prepend call into
setup_arch(). No arch selects it in this patch, so the user-visible
CONFIG_BOOT_CONFIG_EMBED_CMDLINE is not yet enableable; when an arch
later opts in, the runtime behavior is added by the follow-up patches.

tools/bootconfig also installs on target systems, so its own Makefile
keeps $(CC) and stays cross-buildable as a standalone tool. The kernel
build, which runs the tool on the build host during prepare, instead
forces CC=$(HOSTCC) from a dedicated tools/bootconfig rule, so the
executed binary is always a host binary -- plain $(CC) would
cross-compile it under ARCH=... and fail to exec ("Exec format error").

embedded-cmdline.S places the rendered string in .init.rodata with the
"a" (allocatable, read-only) flag and %progbits, not "aw": the data is
never written at runtime, so it must not land in a writable section.

A follow-up patch wires the build-time tools/bootconfig into the
top-level clean target.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 MAINTAINERS               |  1 +
 Makefile                  | 11 +++++++++++
 init/Kconfig              | 36 ++++++++++++++++++++++++++++++++++++
 lib/Makefile              | 16 ++++++++++++++++
 lib/embedded-cmdline.S    | 16 ++++++++++++++++
 tools/bootconfig/Makefile |  2 +-
 6 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4087b67bbc69..fb9314cbe344 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9845,6 +9845,7 @@ F:	fs/proc/bootconfig.c
 F:	include/linux/bootconfig.h
 F:	lib/bootconfig-data.S
 F:	lib/bootconfig.c
+F:	lib/embedded-cmdline.S
 F:	tools/bootconfig/*
 F:	tools/bootconfig/scripts/*
 
diff --git a/Makefile b/Makefile
index d59f703f9797..4a8ea7c90ca8 100644
--- a/Makefile
+++ b/Makefile
@@ -1543,6 +1543,17 @@ prepare: tools/bpf/resolve_btfids
 endif
 endif
 
+# tools/bootconfig renders the embedded bootconfig into a cmdline at build time.
+ifdef CONFIG_BOOT_CONFIG_EMBED_CMDLINE
+prepare: tools/bootconfig
+endif
+
+# tools/bootconfig is run on the build host during prepare, so force a host
+# binary here; its own Makefile keeps $(CC) for standalone and cross builds.
+tools/bootconfig: FORCE
+	$(Q)mkdir -p $(objtree)/tools
+	$(Q)$(MAKE) O=$(abspath $(objtree)) subdir=tools -C $(srctree)/tools/ bootconfig CC=$(HOSTCC)
+
 # The tools build system is not a part of Kbuild and tends to introduce
 # its own unique issues. If you need to integrate a new tool into Kbuild,
 # please consider locating that tool outside the tools/ tree and using the
diff --git a/init/Kconfig b/init/Kconfig
index ca35184532dc..203b1187fde7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1569,6 +1569,42 @@ config BOOT_CONFIG_EMBED_FILE
 	  This bootconfig will be used if there is no initrd or no other
 	  bootconfig in the initrd.
 
+config ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	bool
+	help
+	  Silent symbol; no C code reads it directly. Architectures
+	  select it once their setup_arch() calls
+	  xbc_prepend_embedded_cmdline() before parse_early_param().
+	  Its only role is to gate the user-visible
+	  BOOT_CONFIG_EMBED_CMDLINE option per-arch, the same
+	  ARCH_SUPPORTS_* idiom used by ARCH_SUPPORTS_CFI, etc.
+
+config BOOT_CONFIG_EMBED_CMDLINE
+	bool "Render embedded bootconfig as kernel cmdline at build time"
+	depends on BOOT_CONFIG_EMBED
+	depends on BOOT_CONFIG_EMBED_FILE != ""
+	depends on ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG
+	default n
+	help
+	  Render the "kernel" subtree of the embedded bootconfig file into a
+	  flat cmdline string at kernel build time and prepend it to
+	  boot_command_line during early architecture setup. This makes
+	  early_param() handlers (e.g. mem=, earlycon=, loglevel=) see the
+	  values supplied via the embedded bootconfig.
+
+	  The runtime bootconfig parser is unaffected, so tree-structured
+	  consumers such as ftrace boot-time tracing keep working.
+
+	  Note: when an initrd also carries a bootconfig, its "kernel"
+	  subtree is still parsed at runtime, but the embedded "kernel"
+	  keys remain in boot_command_line for parse_early_param() and
+	  end up later than the initrd keys in saved_command_line, so
+	  parse_args() last-wins favors the embedded values. If you need
+	  initrd to override embedded kernel.* keys, leave this option
+	  off.
+
+	  If unsure, say N.
+
 config CMDLINE_LOG_WRAP_IDEAL_LEN
 	int "Length to try to wrap the cmdline when logged at boot"
 	default 1021
diff --git a/lib/Makefile b/lib/Makefile
index 6e72d2c1cce7..9de0ac7732a2 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -273,6 +273,22 @@ filechk_defbconf = cat $(or $(real-prereqs), /dev/null)
 $(obj)/default.bconf: $(CONFIG_BOOT_CONFIG_EMBED_FILE) FORCE
 	$(call filechk,defbconf)
 
+obj-$(CONFIG_BOOT_CONFIG_EMBED_CMDLINE) += embedded-cmdline.o
+$(obj)/embedded-cmdline.o: $(obj)/embedded_cmdline.bin
+
+# Render the bootconfig "kernel" subtree to a flat cmdline string using
+# the userspace tools/bootconfig parser (-C mode). The runtime prepend
+# helper enforces COMMAND_LINE_SIZE at boot, so no build-time size
+# check is performed here (COMMAND_LINE_SIZE is an arch header
+# constant, not a Kconfig value).
+quiet_cmd_render_cmdline = BCONF2C $@
+      cmd_render_cmdline = \
+	$(objtree)/tools/bootconfig/bootconfig -C $< > $@
+
+targets += embedded_cmdline.bin
+$(obj)/embedded_cmdline.bin: $(obj)/default.bconf $(objtree)/tools/bootconfig/bootconfig FORCE
+	$(call if_changed,render_cmdline)
+
 obj-$(CONFIG_RBTREE_TEST) += rbtree_test.o
 obj-$(CONFIG_INTERVAL_TREE_TEST) += interval_tree_test.o
 
diff --git a/lib/embedded-cmdline.S b/lib/embedded-cmdline.S
new file mode 100644
index 000000000000..740d7ad2dc01
--- /dev/null
+++ b/lib/embedded-cmdline.S
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Embed the build-time-rendered bootconfig "kernel" subtree as a flat
+ * cmdline string. setup_arch() prepends this to boot_command_line on
+ * architectures that select ARCH_SUPPORTS_CMDLINE_FROM_BOOTCONFIG.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates
+ * Copyright (c) 2026 Breno Leitao <leitao@debian.org>
+ */
+	.section .init.rodata, "a", %progbits
+	.global embedded_kernel_cmdline
+embedded_kernel_cmdline:
+	.incbin "lib/embedded_cmdline.bin"
+	.byte 0
+	.global embedded_kernel_cmdline_end
+embedded_kernel_cmdline_end:
diff --git a/tools/bootconfig/Makefile b/tools/bootconfig/Makefile
index 90eb47c9d8de..4e82fd9553cd 100644
--- a/tools/bootconfig/Makefile
+++ b/tools/bootconfig/Makefile
@@ -15,7 +15,7 @@ override CFLAGS += -Wall -g -I$(CURDIR)/include
 ALL_TARGETS := bootconfig
 ALL_PROGRAMS := $(patsubst %,$(OUTPUT)%,$(ALL_TARGETS))
 
-all: $(ALL_PROGRAMS) test
+all: $(ALL_PROGRAMS)
 
 $(OUTPUT)bootconfig: main.c include/linux/bootconfig.h $(LIBSRC)
 	$(CC) $(filter %.c,$^) $(CFLAGS) $(LDFLAGS) -o $@

-- 
2.53.0-Meta


^ permalink raw reply related

* [PATCH v3 2/6] bootconfig: render descendant keys when xbc_snprint_cmdline() root has a value
From: Breno Leitao @ 2026-06-08 16:23 UTC (permalink / raw)
  To: Masami Hiramatsu, Andrew Morton, Nathan Chancellor, paulmck,
	Nicolas Schier
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, linux-kernel, linux-trace-kernel, linux-kbuild,
	bpf, Breno Leitao, kernel-team
In-Reply-To: <20260608-bootconfig_using_tools-v3-0-4ddd079a0696@debian.org>

xbc_node_for_each_key_value() walks to the first leaf under @root, and
when @root is itself a leaf it yields @root. That happens not only for
an empty "kernel {}" subtree, but also when @root carries both a value
and subkeys, e.g.

	kernel = x
	kernel.foo = bar

Here @root ("kernel") is a leaf because its first child is the value
node "x", so the iterator returns @root first. Feeding @root back into
xbc_node_compose_key_after(root, root) returns -EINVAL, which the only
in-kernel caller papers over with a "len <= 0" check -- but the
follow-up tools/bootconfig -C user propagates the error and turns such
a bootconfig into a build failure. Worse, short-circuiting the whole
call on a leaf @root would silently drop the valid "kernel.foo = bar"
descendant that the pre-existing code rendered.

Skip @root inside the loop instead of bailing out: the value-only entry
is dropped (it is rendered through the "kernel" cmdline path, not here),
while real descendant keys are still emitted. An entirely empty subtree
now renders nothing and returns 0 rather than -EINVAL, matching the
"nothing to render is not an error" semantics expected by the new
build-time caller.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 lib/bootconfig.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/lib/bootconfig.c b/lib/bootconfig.c
index 2ed9ee3dc81c..926094d97397 100644
--- a/lib/bootconfig.c
+++ b/lib/bootconfig.c
@@ -440,6 +440,17 @@ int __init xbc_snprint_cmdline(char *buf, size_t size, struct xbc_node *root)
 	 * itself is well defined and returns the would-be length.
 	 */
 	xbc_node_for_each_key_value(root, knode, val) {
+		/*
+		 * An empty or value-only @root (e.g. "kernel {}" or
+		 * "kernel = x", possibly alongside "kernel.foo = bar")
+		 * yields @root itself here. Skip it: composing a key for it
+		 * would fail with -EINVAL, yet any real descendant keys must
+		 * still be rendered. An entirely empty subtree then renders
+		 * nothing and returns 0 rather than an error.
+		 */
+		if (knode == root)
+			continue;
+
 		ret = xbc_node_compose_key_after(root, knode,
 					xbc_namebuf, XBC_KEYLEN_MAX);
 		if (ret < 0)

-- 
2.53.0-Meta

^ permalink raw reply related

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox