Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 20:13 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
	kernel-team
In-Reply-To: <87mrwlyqg4.fsf@cloudflare.com>

On Tue, Jun 23, 2026 at 1:03 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >>
> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >
> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> > >>
> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> > >> completed all code paths related to sockmap-based redirects should be
> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> > >> socket references would remain under BPF_SYSCALL.
> >> > >>
> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> > >> ---
> >> > >> Changes in v2:
> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
> >> > >> - Elaborate on the end goal in description
> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> > >> ---
> >> > >>  net/unix/af_unix.c  | 4 ++--
> >> > >>  net/unix/unix_bpf.c | 6 ++++++
> >> > >>  2 files changed, 8 insertions(+), 2 deletions(-)
> >> > >>
> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
> >> > >> --- a/net/unix/af_unix.c
> >> > >> +++ b/net/unix/af_unix.c
> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> > >>  #ifdef CONFIG_BPF_SYSCALL
> >> > >>         const struct proto *prot = READ_ONCE(sk->sk_prot);
> >> > >>
> >> > >> -       if (prot != &unix_dgram_proto)
> >> > >> +       if (prot->recvmsg)
> >> > >
> >> > > There is no reason to have this dead branch when
> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >> > >
> >> > > Let's compile out all sockmap code when both configs
> >> > > are not enabled.
> >> > >
> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
> >> > > simpler approach.
> >> >
> >> > Okay, will put the whole file behind hidden config option like so:
> >> >
> >> > --- a/net/unix/Kconfig
> >> > +++ b/net/unix/Kconfig
> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> >         help
> >> >           Support for UNIX socket monitoring interface used by the ss tool.
> >> >           If unsure, say Y.
> >> > +
> >> > +config UNIX_BPF
> >>
> >> Maybe UNIX_BPF_SOCKMAP or something.
> >> bpf_iter is supported without this config.
> >
> > I don't like where it's going.
> > I strongly dislike new config knobs.
> > I'd rather remove existing knobs.
> > What is the motivation?
>
> The goal is to compile out sockmap bits that use sk_msg.
> NET_SOCK_MSG is natural, exisiting candidate.
> New knob wasn't my idea.

I think config w/o description is okay since it's not selectable.

>
> Alternatively, we can do this to avoid the extra knob:
>
> ifdef CONFIG_BPF_SYSCALL
> unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
> endif

This is far better, I forgot ifdef is available.

^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:09 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <CAAVpQUBARp1qCEomgzWXVe35WatdaswujVLku+RESm_LW0dE7Q@mail.gmail.com>

On Tue, Jun 23, 2026 at 12:31 PM -07, Kuniyuki Iwashima wrote:
> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> Okay, will put the whole file behind hidden config option like so:
>>
>> --- a/net/unix/Kconfig
>> +++ b/net/unix/Kconfig
>> @@ -30,3 +30,8 @@ config UNIX_DIAG
>>         help
>>           Support for UNIX socket monitoring interface used by the ss tool.
>>           If unsure, say Y.
>> +
>> +config UNIX_BPF
>
> Maybe UNIX_BPF_SOCKMAP or something.
> bpf_iter is supported without this config.

Not sure what you have in mind re bpf_iter. Can you share more?


^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Kuniyuki Iwashima, bpf, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
	kernel-team
In-Reply-To: <CAADnVQL2pfQ0BoN-vWcuCpbOBBKq_rM7Bp7P4XdLMFER5LGSDg@mail.gmail.com>

On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>
>> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >
>> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
>> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> > >>
>> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
>> > >> completed all code paths related to sockmap-based redirects should be
>> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
>> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
>> > >> socket references would remain under BPF_SYSCALL.
>> > >>
>> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> > >> ---
>> > >> Changes in v2:
>> > >> - Handle prot->recvmsg being NULL (Sashiko)
>> > >> - Elaborate on the end goal in description
>> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
>> > >> ---
>> > >>  net/unix/af_unix.c  | 4 ++--
>> > >>  net/unix/unix_bpf.c | 6 ++++++
>> > >>  2 files changed, 8 insertions(+), 2 deletions(-)
>> > >>
>> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> > >> index f7a9d55eee8a..84c11c60c75f 100644
>> > >> --- a/net/unix/af_unix.c
>> > >> +++ b/net/unix/af_unix.c
>> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>> > >>  #ifdef CONFIG_BPF_SYSCALL
>> > >>         const struct proto *prot = READ_ONCE(sk->sk_prot);
>> > >>
>> > >> -       if (prot != &unix_dgram_proto)
>> > >> +       if (prot->recvmsg)
>> > >
>> > > There is no reason to have this dead branch when
>> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
>> > >
>> > > Let's compile out all sockmap code when both configs
>> > > are not enabled.
>> > >
>> > > Since AF_UNIX differs from TCP/UDP, it can take the
>> > > simpler approach.
>> >
>> > Okay, will put the whole file behind hidden config option like so:
>> >
>> > --- a/net/unix/Kconfig
>> > +++ b/net/unix/Kconfig
>> > @@ -30,3 +30,8 @@ config UNIX_DIAG
>> >         help
>> >           Support for UNIX socket monitoring interface used by the ss tool.
>> >           If unsure, say Y.
>> > +
>> > +config UNIX_BPF
>>
>> Maybe UNIX_BPF_SOCKMAP or something.
>> bpf_iter is supported without this config.
>
> I don't like where it's going.
> I strongly dislike new config knobs.
> I'd rather remove existing knobs.
> What is the motivation?

The goal is to compile out sockmap bits that use sk_msg.
NET_SOCK_MSG is natural, exisiting candidate.
New knob wasn't my idea.

Alternatively, we can do this to avoid the extra knob:

ifdef CONFIG_BPF_SYSCALL
unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
endif

^ permalink raw reply

* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Toke Høiland-Jørgensen @ 2026-06-23 19:59 UTC (permalink / raw)
  To: Ralf Lici
  Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
	Beniamino Galvani
In-Reply-To: <20260623163606.33510-1-ralf@mandelbit.com>

Ralf Lici <ralf@mandelbit.com> writes:

> On Mon, 22 Jun 2026 16:36:24 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >> > My second concern is that the SIIT boundary would be a property of
>> >> > rule and hook placement. That gives flexibility, but it also means the
>> >> > translation point has to be constrained and documented very carefully
>> >> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
>> >> > For this use case I would rather have the route that matches the
>> >> > translation prefix also be the object that says: leave this family
>> >> > here and continue in the other one.
>> >>
>> >> Yeah, with flexibility comes the ability to shoot yourself in the foot.
>> >> But that's not really different from much of the other functionality we
>> >> have in the kernel today, is it? For netfilter in particular it's
>> >> certainly possible to configure a broken NAT configuration that leads to
>> >> packet drops (or just invalid packets being sent out on a network
>> >> device).
>> >>
>> >
>> > True, misconfiguration is always possible and that alone is not an
>> > argument against the netfilter model. But what do we actually gain in
>> > capability from that flexibility? I agree on the UX argument (an admin
>> > would look in nft first), but in terms of what the feature can do, I
>> > can't yet see what the nft model unlocks. More on this just below.
>> >
>> >> > After looking at the available kernel mechanisms again, I think the
>> >> > better model is probably LWT: routes carry an ipxlat encap referencing a
>> >> > named translator domain configured over netlink. That should represent
>> >> > the stateless, prefix-based and symmetric nature of ipxlat.
>> >>
>> >> I think this description actually hits the nail on the head: What are we
>> >> implementing here? Is it a product feature, or a building block for one?
>> >> The properties you mention wrt consistency, symmetry etc are properties
>> >> of the high-level feature (which is also generally the level things are
>> >> specified in RFCs). Whereas other packet mangling features in the kernel
>> >> are more in the "building block" category, where it's possible to
>> >> configure things to implement a particular feature set / compliance with
>> >> a particular RFC, but it's also possible to do things that are outside
>> >> of that.
>> >>
>> >> I think this relates to the "mechanism, not policy" approach that we
>> >> take to most things in the kernel: implement the building blocks to do
>> >> something in the most general way we can, and then leave it up to
>> >> userspace to configure things in a way that results in a consistent
>> >> high-level system behaviour.
>> >>
>> >
>> > That's a good point, and I agree that we should not bake a high-level
>> > product policy into the kernel if what we need is a reusable mechanism
>> > (the LWT idea was my attempt at exactly that). What I am still trying to
>> > understand is whether there is a useful generic trigger for stateless
>> > cross-family translation beyond the route/prefix/policy-routing cases.
>> >
>> > Routes and policy routing already cover the selectors I can make
>> > coherent for a stateless, per-packet translator: destination/source
>> > prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
>> > much more than that, but the additional selectors that would materially
>> > change the translation decision seem to be selectors such as L4 fields,
>> > payload state, or conntrack state. Those are exactly the selectors I am
>> > struggling to make correct for a stateless translator:
>> >
>> > - non-first fragments carry no L4 header at all, yet the translator must
>> >   rewrite every fragment (an nft ... tcp dport trigger cannot fire on
>> >   them);
>> >
>> > - ICMP errors must be translated too, but the flow identity lives in the
>> >   quoted inner header (reversed), not in anything an L4/ct match on the
>> >   error packet can see and there is no conntrack to associate them,
>> >   since this is stateless.
>>
>> True in principle, but if (say) you deploy this on a network that is
>> configured so it will never fragment packets, this won't be an issue in
>> practice.
>>
>> I.e., you're quite right that arbitrary matching criteria cannot be
>> guaranteed to result in coherent translation. But I think that goes into
>> the "use it wrong, get wrong results" bin. E.g., if you match on
>> something that results in only a subset of the packets of a flow being
>> translated, well, only that subset of the packets will make it to the
>> destination. The SIIT translator itself should not try to fix this, but
>> neither should it prevent it; that's what I mean by "building block" -
>> it's up to the builder using the blocks to make sure the building
>> doesn't collapse, that's out of scope for the block manufacturer to
>> worry about :)
>>
>
> I agree with that framing. The translation core should not try to prove
> that the surrounding policy describes a coherent SIIT deployment.

Cool!

>> > So an L4-conditional trigger does not look like a good primitive for
>> > correct stateless SIIT unless the action also defragments/refragments or
>> > uses conntrack-like state. Those may be valid mechanisms, but they move
>> > the design away from the stateless per-packet SIIT boundary this RFC is
>> > trying to model.
>> >
>> > So my first question is: is there a useful nft configuration this should
>> > enable that is not naturally expressible as route selection, while still
>> > remaining stateless SIIT rather than a NAT64-like stateful feature?
>> > Maybe there is a real use case there, but I cannot construct one yet.
>>
>> So the poster child for "match on arbitrary criteria" is of course BPF.
>> You can write BPF programs that match on arbitrary parts of the packet
>> header, custom encapsulation headers,or even on out of band things like
>> system state, phase of the moon, or what have you. And we should
>> certainly allow a BPF program to make the decision on whether to perform
>> the SIIT translation.
>>
>> Which... maybe is an argument to keep it as a device like you do in this
>> RFC series? Redirecting to a device is trivially supported from TC-BPF,
>> which also makes it possible to use the translation mechanism without
>> going through the routing subsystem at all, saving a bit of overhead.
>> Whereas making it a route action ties it very closely to the routing
>> subsystem.
>>
>> WDYT?
>>
>
> I see the netdevice appeal for this, especially as a BPF redirect
> target. But as we discussed earlier, the device model has some real
> problems: the device selected by the first route is not the real
> post-translation egress, so the model ends up doing translation and
> reinjection rather than normal transmission. Concretely:
>
> - it needs synthetic routing state purely to get things like MTU for
>   fragmentation, because the real post-translation nexthop is not known
>   at translation time;
>
> - TTL/Hop Limit handling gets harder to reason about because the packet
>   has effectively gone through two routing decisions;
>
> - rx/tx stats can't be made meaningful for a direction-agnostic device
>   whose ndo_start_xmit is really "translate and receive";
>
> - and the setup is not very obvious: create an interface, route packets
>   to it, then have them come back translated.
>
> None of these is fatal on its own, but together they make me think the
> abstraction does not quite fit.

Right, OK, you're right.

> On the BPF point specifically: I agree a BPF program should be able to
> decide whether to translate. What I am less sure about is whether
> redirecting to a netdevice is the best way to expose that. A TC action
> (yet another model, I know :)) gives you the same thing in-pipeline and
> more directly:
>
>     tc filter add dev wwan0 egress \
>         bpf obj match.o action ipxlat4to6 domain clat0
>
> Let BPF make the policy decision, with the native action doing the
> translation work that the current BPF CLAT implementations have trouble
> with: fragmentation, checksum corner cases, and ICMP error inner
> headers (as explained by Beniamino).
>
> So TC clsact looks like the natural in-kernel replacement for today's
> TC-BPF CLAT programs: no extra netdev, you attach to the existing
> uplink, direction is explicit, and on egress you sit on the real route
> dst, so the synthetic-dst and double-routing problems above just don't
> arise. The cost is more moving parts than a single bpf_redirect since
> userspace has to manage clsact, filters, priorities and action
> lifecycle/cleanup.

Hmm, so no one really uses the bpf filter mechanism, since you can just
do everything from an action anyway (and with TCX attachment, you can
even avoid the overhead of the TC filter/action infrastructure
entirely). However, point taken wrt how to integrate this with BPF. I
guess the most flexible thing would be to expose the functionality
directly (as a kfunc callable from a BPF program). Which also fits with
your point below:

> For a gateway translator, though, I still think a device-bound model is
> less natural. There the translation point is more like a forwarding
> decision across routes and nexthops, so a route/LWT attachment, or
> possibly a netfilter attachment seems easier to reason about. Also, as
> you already pointed out while discussing LWT, an admin setting up NAT64
> is more likely to reach for an nft rule than for a clsact filter on a
> specific device.
>
> Taking a step back, ipxlat is really a generic translation engine plus a
> thin harness around it. So rather than pick one attachment, it might be
> worth structuring the engine so different harnesses can drive it.
> There's interesting precedent for this shape:
>
> - ILA, again, is the closest sibling: stateless IPv6 address translation
>   with a shared core in ila_common.c, driven both by an LWT frontend in
>   ila_lwt.c and by an inline netfilter hook with a netlink-configured
>   mapping table in ila_xlat.c.
>
> - act_ct is the precedent for the TC side specifically: a TC action that
>   reuses the netfilter conntrack engine rather than reimplementing it.
>
> And act_nat is the cautionary counter-example: a standalone TC
> reimplementation of stateless NAT that shares no code with nf_nat, and
> carries a "would be nice to share code" comment :)
>
> So I am wondering whether the right direction is to factor the
> translation engine cleanly, land it with one harness first, and keep the
> other attachment points as follow-up work once the core semantics are
> settled.
>
> Does that direction seem reasonable to you?

Yes, reusable functionality that can be called from multiple places
sounds like a good fit; let's try to structure it that way!

As for which hook to start with, well, let's see if we hear back from
the netfilter devs, but either netfilter or the routing subsystem (LWT
style) would be OK for me I think.

-Toke

^ permalink raw reply

* Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-23 19:45 UTC (permalink / raw)
  To: Avinash Duduskar, ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	dsahern
In-Reply-To: <20260623182849.2623521-1-avinash.duduskar@gmail.com>

Avinash Duduskar <avinash.duduskar@gmail.com> writes:

> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> I think it's better to just move the assignment of params->ifindex
>> entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
>> That way this can be simplified to:
>>
>> 	err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
>> 	if (!err && fwd_dev)
>> 		*fwd_dev = dev;
>> 	return err;
>
> The caller-side restore is ungainly, agreed, but the assignment can't move
> all the way into the helper. The early params->ifindex = dev->ifindex
> sits above the neighbour lookup on purpose: that is d1c362e1dd68a
> ("bpf: Always return target ifindex in bpf_fib_lookup"), which took it
> out of bpf_fib_set_fwd_params() and put it there so a program still
> gets the target ifindex on the BPF_FIB_LKUP_RET_NO_NEIGH path and can
> bpf_redirect_neigh() on it. bpf_fib_set_fwd_params() is called only at
> the set_fwd_params label, below the NO_NEIGH return (and below the IPv6
> NO_SRC_ADDR return), so an assignment living in the helper never runs
> on those paths and params->ifindex falls back to the input. That would
> change the reported ifindex for plain bpf_fib_lookup() callers hitting
> NO_NEIGH, not only the VLAN ones.

Right. Well, seems I forgot about that patch, even though I seem to have
written it :)

> I can still get the caller down to your form by keeping the early write
> and moving just the VLAN_FAILURE rewind into the helper, with one extra
> parameter, the input ifindex saved before the egress write:
>
> 	err = bpf_fib_set_fwd_params(dev, params, flags, mtu, in_ifindex);
> 	if (!err && fwd_dev)
> 		*fwd_dev = dev;
> 	return err;
>
> and the helper owning the rewind in the unreducible branch:
>
> 	} else {
> 		params->ifindex = in_ifindex;
> 		return BPF_FIB_LKUP_RET_VLAN_FAILURE;
> 	}

OK, if we do need to restore it, I think it's better to do it there.

Also, wrt the fwd_dev parameter: Do we really have a use case from using
this from TC? In TC you can just redirect to the VLAN device; this is
meant for XDP which can't do that. So how about we just reject the flag
on the TC side, and get rid of the fwd_dev parameter entirely?

If we do that we're back to just a plain 'return bpf_fib_set_fwd_params()' :)

-Toke


^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Alexei Starovoitov @ 2026-06-23 19:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Jakub Sitnicki, bpf, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
	kernel-team
In-Reply-To: <CAAVpQUBARp1qCEomgzWXVe35WatdaswujVLku+RESm_LW0dE7Q@mail.gmail.com>

On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >
> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> > >>
> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> > >> completed all code paths related to sockmap-based redirects should be
> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> > >> socket references would remain under BPF_SYSCALL.
> > >>
> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> > >> ---
> > >> Changes in v2:
> > >> - Handle prot->recvmsg being NULL (Sashiko)
> > >> - Elaborate on the end goal in description
> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> > >> ---
> > >>  net/unix/af_unix.c  | 4 ++--
> > >>  net/unix/unix_bpf.c | 6 ++++++
> > >>  2 files changed, 8 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > >> index f7a9d55eee8a..84c11c60c75f 100644
> > >> --- a/net/unix/af_unix.c
> > >> +++ b/net/unix/af_unix.c
> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> > >>  #ifdef CONFIG_BPF_SYSCALL
> > >>         const struct proto *prot = READ_ONCE(sk->sk_prot);
> > >>
> > >> -       if (prot != &unix_dgram_proto)
> > >> +       if (prot->recvmsg)
> > >
> > > There is no reason to have this dead branch when
> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> > >
> > > Let's compile out all sockmap code when both configs
> > > are not enabled.
> > >
> > > Since AF_UNIX differs from TCP/UDP, it can take the
> > > simpler approach.
> >
> > Okay, will put the whole file behind hidden config option like so:
> >
> > --- a/net/unix/Kconfig
> > +++ b/net/unix/Kconfig
> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >         help
> >           Support for UNIX socket monitoring interface used by the ss tool.
> >           If unsure, say Y.
> > +
> > +config UNIX_BPF
>
> Maybe UNIX_BPF_SOCKMAP or something.
> bpf_iter is supported without this config.

I don't like where it's going.
I strongly dislike new config knobs.
I'd rather remove existing knobs.
What is the motivation?

^ permalink raw reply

* Re: [PATCH bpf 2/2] selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
From: bot+bpf-ci @ 2026-06-23 19:32 UTC (permalink / raw)
  To: mhal, john.fastabend, jakub, jiayuan.chen, davem, edumazet, kuba,
	pabeni, horms, ast, cong.wang, daniel, andrii, eddyz87, memxor,
	martin.lau, song, yonghong.song, jolsa, emil, shuah
  Cc: netdev, bpf, linux-kernel, linux-kselftest, mhal, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260623-sockmap-lookup-udp-leak-v1-2-05804f9308e4@rbox.co>

[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]

> selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
>
>     Setup and join a cgroup, then attach a cgroup/connect4 program that runs
>
>        sk = bpf_map_lookup_elem(sockmap, 0)
>        bpf_bind(ctx, sa, sizeof(sa))
>        bpf_sk_release(sk)
>
>     Unpatched kernel leaks the socket.
>
>     Signed-off-by: Michal Luczaj <mhal@rbox.co>

This test reproduces a UDP socket leak across sockmap lookup-bind-release,
but there is no Fixes: tag.

The sibling fix commit a2510dc351c5 ("bpf, sockmap: Don't leak UDP socks on
lookup-bind-release") carries a Fixes: tag pointing at the commit that lifted
the socket-state restriction allowing unbound UDP sockets into sockmap.

Should this test carry the same tag so it stays backportable alongside the
fix?

  Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28049771327

^ permalink raw reply

* Re: s2io: driver still in use - please reconsider removal
From: Ethan Nelson-Moore @ 2026-06-23 19:31 UTC (permalink / raw)
  To: Michael Pratte
  Cc: Jakub Kicinski, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Simon Horman, David S . Miller, netdev
In-Reply-To: <20260623112133.752195-1-slatoncomputers@gmail.com>

Hi, Michael,

On Tue, Jun 23, 2026 at 4:21 AM Michael Pratte
<slatoncomputers@gmail.com> wrote:
> Commit aba0138eb7d7 ("net: ethernet: neterion: s2io: remove unused
> driver") removed s2io in v7.0 as "highly unlikely to still be used."
> It is still in use here: an Exar Xframe-II (PCI 17d5:5832) in a
> Supermicro X5DA8.
>
> Bringing it up, I found that no TCP can be transmitted on these
> adapters since v4.2.
[...]
> Given it is evidently still in use, would
> you consider reverting the removal?

Given that the driver has not been working for almost 11 years and you
are seemingly the first person to notice, I would like to respectfully
disagree with this assertion.

Are you using the card for actual work, or are you just testing it out
of curiosity? What kernel version were you running before you upgraded
to a current kernel?

Ethan

^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 19:31 UTC (permalink / raw)
  To: Jakub Sitnicki
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <87v7b9ysep.fsf@cloudflare.com>

On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> completed all code paths related to sockmap-based redirects should be
> >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> socket references would remain under BPF_SYSCALL.
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >> Changes in v2:
> >> - Handle prot->recvmsg being NULL (Sashiko)
> >> - Elaborate on the end goal in description
> >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> ---
> >>  net/unix/af_unix.c  | 4 ++--
> >>  net/unix/unix_bpf.c | 6 ++++++
> >>  2 files changed, 8 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> index f7a9d55eee8a..84c11c60c75f 100644
> >> --- a/net/unix/af_unix.c
> >> +++ b/net/unix/af_unix.c
> >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >>  #ifdef CONFIG_BPF_SYSCALL
> >>         const struct proto *prot = READ_ONCE(sk->sk_prot);
> >>
> >> -       if (prot != &unix_dgram_proto)
> >> +       if (prot->recvmsg)
> >
> > There is no reason to have this dead branch when
> > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >
> > Let's compile out all sockmap code when both configs
> > are not enabled.
> >
> > Since AF_UNIX differs from TCP/UDP, it can take the
> > simpler approach.
>
> Okay, will put the whole file behind hidden config option like so:
>
> --- a/net/unix/Kconfig
> +++ b/net/unix/Kconfig
> @@ -30,3 +30,8 @@ config UNIX_DIAG
>         help
>           Support for UNIX socket monitoring interface used by the ss tool.
>           If unsure, say Y.
> +
> +config UNIX_BPF

Maybe UNIX_BPF_SOCKMAP or something.
bpf_iter is supported without this config.

> +       bool
> +       depends on UNIX
> +       default y if BPF_SYSCALL && NET_SOCK_MSG

^ permalink raw reply

* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 19:21 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <CAAVpQUBsQFFxJFDnJzxmsER3bOjm=zqJ5P5MSeW_T9v-4639cw@mail.gmail.com>

On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
>> completed all code paths related to sockmap-based redirects should be
>> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
>> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
>> socket references would remain under BPF_SYSCALL.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>> Changes in v2:
>> - Handle prot->recvmsg being NULL (Sashiko)
>> - Elaborate on the end goal in description
>> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
>> ---
>>  net/unix/af_unix.c  | 4 ++--
>>  net/unix/unix_bpf.c | 6 ++++++
>>  2 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index f7a9d55eee8a..84c11c60c75f 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>>  #ifdef CONFIG_BPF_SYSCALL
>>         const struct proto *prot = READ_ONCE(sk->sk_prot);
>>
>> -       if (prot != &unix_dgram_proto)
>> +       if (prot->recvmsg)
>
> There is no reason to have this dead branch when
> CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
>
> Let's compile out all sockmap code when both configs
> are not enabled.
>
> Since AF_UNIX differs from TCP/UDP, it can take the
> simpler approach.

Okay, will put the whole file behind hidden config option like so:

--- a/net/unix/Kconfig
+++ b/net/unix/Kconfig
@@ -30,3 +30,8 @@ config UNIX_DIAG
        help
          Support for UNIX socket monitoring interface used by the ss tool.
          If unsure, say Y.
+
+config UNIX_BPF
+       bool
+       depends on UNIX
+       default y if BPF_SYSCALL && NET_SOCK_MSG

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Eric Biggers @ 2026-06-23 19:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bastien Nocera, linux-crypto, Herbert Xu, Marcel Holtmann,
	Luiz Augusto von Dentz, linux-doc, linux-api, linux-kernel,
	netdev, linux-bluetooth, ell
In-Reply-To: <CAHk-=wgNG=F3xO9PjL0RcKy3UWvq0Np9uZu+nFUQBAA8So9xdA@mail.gmail.com>

On Tue, Jun 23, 2026 at 11:56:10AM -0700, Linus Torvalds wrote:
> On Tue, 23 Jun 2026 at 09:51, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > We're aware of that and are taking it into account in the allowlist:
> 
> Note that if we can  just unconditionally make it depend on
> CAP_NET_ADMIN, that would be good - independently of any allowlist.
> 
> Because if iwd and abluetoothd are the main two users, and both of
> those already require CAP_NET_ADMIN anyway...

There's also cryptsetup, including unprivileged benchmarking and also
(in theory) formatting support, and pre-7.0 versions of iproute2 which
used it for computing SHA-1 hashes of BPF programs.

If we broke unprivileged 'cryptsetup benchmark', some people would
definitely notice.  However, since it's just a manually-run benchmark
anyway, users could just run it with sudo.

I don't know about the iproute2 case.

It depends how aggressive we want to be.  My current proposal
(https://lore.kernel.org/linux-crypto/20260622234803.6982-1-ebiggers@kernel.org/)
has the entries in the allowlist marked as either privileged or
unprivileged.  There are just a few unprivileged ones, for cryptsetup
and iproute2 as mentioned.  But we could try doing away with the
unprivileged ones entirely and see who complains.

- Eric

^ permalink raw reply

* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Linus Torvalds @ 2026-06-23 18:56 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Bastien Nocera, linux-crypto, Herbert Xu, Marcel Holtmann,
	Luiz Augusto von Dentz, linux-doc, linux-api, linux-kernel,
	netdev, linux-bluetooth, ell
In-Reply-To: <20260623164932.GA1793@sol>

On Tue, 23 Jun 2026 at 09:51, Eric Biggers <ebiggers@kernel.org> wrote:
>
> We're aware of that and are taking it into account in the allowlist:

Note that if we can  just unconditionally make it depend on
CAP_NET_ADMIN, that would be good - independently of any allowlist.

Because if iwd and abluetoothd are the main two users, and both of
those already require CAP_NET_ADMIN anyway...

                Linus

^ permalink raw reply

* [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Michal Luczaj @ 2026-06-23 18:03 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Alexei Starovoitov, Cong Wang, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260623-sockmap-lookup-udp-leak-v1-0-05804f9308e4@rbox.co>

UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.

Because sockmap accepts unbound UDP sockets, a BPF program can increment a
socket's refcount via lookup. If the socket is subsequently bound, the
transition from unbound to bound causes bpf_sk_release() to skip the
decrement of the refcount, causing a memory leak.

unreferenced object 0xffff88810bc2eb40 (size 1984):
  comm "test_progs", pid 2451, jiffies 4295320596
  hex dump (first 32 bytes):
    7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00  ................
    02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00  ...@............
  backtrace (crc bdee079d):
    kmem_cache_alloc_noprof+0x557/0x660
    sk_prot_alloc+0x69/0x240
    sk_alloc+0x30/0x460
    inet_create+0x2ce/0xf80
    __sock_create+0x25b/0x5c0
    __sys_socket+0x119/0x1d0
    __x64_sys_socket+0x72/0xd0
    do_syscall_64+0xa1/0x5f0
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

Maintain balanced refcounts across sk lookup/release: (re-)set
SOCK_RCU_FREE on proto update to treat the socket (whether bound or
unbound) as not requiring a refcount increment on (a RCU protected) lookup.

Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
sockets in bpf_sk_assign").
---
 net/ipv4/udp_bpf.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index ad57c4c9eaab..970327b59582 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 	if (sk->sk_family == AF_INET6)
 		udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
 
+	/* Treat all sockets as non-refcounted, regardless of binding state. */
+	sock_set_flag(sk, SOCK_RCU_FREE);
+
 	sock_replace_proto(sk, &udp_bpf_prots[family]);
 	return 0;
 }

-- 
2.54.0


^ permalink raw reply related

* [PATCH net 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: Jamal Hadi Salim @ 2026-06-23 18:42 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, victor, andrew+netdev,
	zdi-disclosures, security, stable, Jamal Hadi Salim

The teql master->slaves singly linked list is not protected against multiple
writes. It can be mod'ed concurently from teql_master_xmit(), teql_dequeue(),
teql_init() and teql_destroy() without holding any list lock or RCU protection.

zdi-disclosures@trendmicro.com has demonstrated that the qdisc is freed
after an RCU grace period, but teql_master_xmit() running on another
CPU can still hold a stale pointer into the list, resulting in a
slab-use-after-free:

BUG: KASAN: slab-use-after-free in teql_destroy+0x3ca/0x440 linux/net/sched/sch_teql.c:142
Read of size 8 at addr ffff88802923aa80 by task ip/10024

CPU: 1 UID: 0 PID: 10024 Comm: ip Not tainted 7.1.0-rc5 #1 PREEMPT(lazy)
Hardware name: QEMU Ubuntu 25.10 PC v2 (i440FX + PIIX, + 10.1 machine, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
 <TASK>
 __dump_stack linux/lib/dump_stack.c:94
 dump_stack_lvl+0x100/0x190 linux/lib/dump_stack.c:120
 print_address_description linux/mm/kasan/report.c:378
 print_report+0x139/0x4ad linux/mm/kasan/report.c:482
 kasan_report+0xe4/0x1d0 linux/mm/kasan/report.c:595
 teql_destroy+0x3ca/0x440 linux/net/sched/sch_teql.c:142
 __qdisc_destroy+0x109/0x540 linux/net/sched/sch_generic.c:1100
 qdisc_put+0xad/0xf0 linux/net/sched/sch_generic.c:1128
 dev_shutdown+0x1cd/0x450 linux/net/sched/sch_generic.c:1493
 unregister_netdevice_many_notify+0xd30/0x24b0 linux/net/core/dev.c:12409
 rtnl_delete_link linux/net/core/rtnetlink.c:3552
 rtnl_dellink+0x476/0xb50 linux/net/core/rtnetlink.c:3594
 rtnetlink_rcv_msg+0x954/0xe80 linux/net/core/rtnetlink.c:6997
 netlink_rcv_skb+0x156/0x420 linux/net/netlink/af_netlink.c:2550
 netlink_unicast_kernel linux/net/netlink/af_netlink.c:1318
 netlink_unicast+0x58d/0x860 linux/net/netlink/af_netlink.c:1344
 netlink_sendmsg+0x89a/0xd80 linux/net/netlink/af_netlink.c:1894
 sock_sendmsg_nosec linux/net/socket.c:787
 __sock_sendmsg linux/net/socket.c:802
 ____sys_sendmsg+0x9d9/0xb70 linux/net/socket.c:2698
 ___sys_sendmsg+0x194/0x1e0 linux/net/socket.c:2752
 __sys_sendmsg+0x171/0x220 linux/net/socket.c:2784
 do_syscall_x64 linux/arch/x86/entry/syscall_64.c:63
 do_syscall_64+0xff/0x890 linux/arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f linux/arch/x86/entry/entry_64.S:121
[..]

The zdi-disclosures@trendmicro.com repro created concurrent AF_PACKET senders
on a teql device against a thread that repeatedly adds/deletes the slave qdisc,
together with a SLUB spray that reclaims the freed slot; the resulting
UAF is controllable enough to be turned into a read/write primitive against the
freed qdisc object.

The fix?
Add a per-master slaves_lock spinlock that serializes all mutations of
master->slaves and the NEXT_SLAVE() links in teql_destroy() and
teql_qdisc_init(). teql_master_xmit() also takes the same slaves_lock around
those updates.
Pair this with READ_ONCE()/WRITE_ONCE() on the shared pointers and
rcu_read_lock_bh()/rcu_read_unlock_bh() around the list traversal in
teql_master_xmit() and teql_dequeue(), so that readers either observe a fully
linked list or are deferred until the in-flight mutation completes. The two
early-return paths in teql_master_xmit() are updated to release the RCU-bh
read-side critical section before returning, since leaving it held would disable
BH on that CPU for good.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: zdi-disclosures@trendmicro.com
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
 net/sched/sch_teql.c | 71 +++++++++++++++++++++++++++++---------------
 1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index e7bbc9e5174d..dacdc46637df 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -53,6 +53,7 @@ struct teql_master {
 	struct Qdisc_ops qops;
 	struct net_device *dev;
 	struct Qdisc *slaves;
+	spinlock_t		slaves_lock;	/* serializes writes to ->slaves */
 	struct list_head master_list;
 	unsigned long	tx_bytes;
 	unsigned long	tx_packets;
@@ -101,7 +102,9 @@ teql_dequeue(struct Qdisc *sch)
 	if (skb == NULL) {
 		struct net_device *m = qdisc_dev(q);
 		if (m) {
-			dat->m->slaves = sch;
+			spin_lock_bh(&dat->m->slaves_lock);
+			rcu_assign_pointer(dat->m->slaves, sch);
+			spin_unlock_bh(&dat->m->slaves_lock);
 			netif_wake_queue(m);
 		}
 	} else {
@@ -132,34 +135,37 @@ teql_destroy(struct Qdisc *sch)
 	struct Qdisc *q, *prev;
 	struct teql_sched_data *dat = qdisc_priv(sch);
 	struct teql_master *master = dat->m;
+	struct netdev_queue *txq = NULL;
+	bool reset_master_queue = false;
 
 	if (!master)
 		return;
 
-	prev = master->slaves;
+	spin_lock_bh(&master->slaves_lock);
+	prev = READ_ONCE(master->slaves);
 	if (prev) {
 		do {
-			q = NEXT_SLAVE(prev);
+			q = READ_ONCE(NEXT_SLAVE(prev));
 			if (q == sch) {
-				NEXT_SLAVE(prev) = NEXT_SLAVE(q);
-				if (q == master->slaves) {
-					master->slaves = NEXT_SLAVE(q);
-					if (q == master->slaves) {
-						struct netdev_queue *txq;
-
+				WRITE_ONCE(NEXT_SLAVE(prev), READ_ONCE(NEXT_SLAVE(q)));
+				if (q == READ_ONCE(master->slaves)) {
+					WRITE_ONCE(master->slaves, READ_ONCE(NEXT_SLAVE(q)));
+					if (q == READ_ONCE(master->slaves)) {
 						txq = netdev_get_tx_queue(master->dev, 0);
-						master->slaves = NULL;
-
-						dev_reset_queue(master->dev,
-								txq, NULL);
+						WRITE_ONCE(master->slaves, NULL);
+						reset_master_queue = true;
 					}
 				}
 				skb_queue_purge(&dat->q);
 				break;
 			}
 
-		} while ((prev = q) != master->slaves);
+		} while ((prev = q) != READ_ONCE(master->slaves));
 	}
+	spin_unlock_bh(&master->slaves_lock);
+
+	if (reset_master_queue)
+		dev_reset_queue(master->dev, txq, NULL);
 }
 
 static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
@@ -184,7 +190,8 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 
 	skb_queue_head_init(&q->q);
 
-	if (m->slaves) {
+	spin_lock_bh(&m->slaves_lock);
+	if (READ_ONCE(m->slaves)) {
 		if (m->dev->flags & IFF_UP) {
 			if ((m->dev->flags & IFF_POINTOPOINT &&
 			     !(dev->flags & IFF_POINTOPOINT)) ||
@@ -192,8 +199,10 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 			     !(dev->flags & IFF_BROADCAST)) ||
 			    (m->dev->flags & IFF_MULTICAST &&
 			     !(dev->flags & IFF_MULTICAST)) ||
-			    dev->mtu < m->dev->mtu)
+			    dev->mtu < m->dev->mtu) {
+				spin_unlock_bh(&m->slaves_lock);
 				return -EINVAL;
+			}
 		} else {
 			if (!(dev->flags&IFF_POINTOPOINT))
 				m->dev->flags &= ~IFF_POINTOPOINT;
@@ -204,14 +213,15 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
 			if (dev->mtu < m->dev->mtu)
 				m->dev->mtu = dev->mtu;
 		}
-		q->next = NEXT_SLAVE(m->slaves);
-		NEXT_SLAVE(m->slaves) = sch;
+		WRITE_ONCE(q->next, READ_ONCE(NEXT_SLAVE(m->slaves)));
+		rcu_assign_pointer(NEXT_SLAVE(m->slaves), sch);
 	} else {
-		q->next = sch;
-		m->slaves = sch;
+		WRITE_ONCE(q->next, sch);
+		rcu_assign_pointer(m->slaves, sch);
 		m->dev->mtu = dev->mtu;
 		m->dev->flags = (m->dev->flags&~FMASK)|(dev->flags&FMASK);
 	}
+	spin_unlock_bh(&m->slaves_lock);
 	return 0;
 }
 
@@ -285,7 +295,9 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 	int subq = skb_get_queue_mapping(skb);
 	struct sk_buff *skb_res = NULL;
 
-	start = master->slaves;
+	rcu_read_lock_bh();
+
+	start = rcu_dereference_bh(master->slaves);
 
 restart:
 	nores = 0;
@@ -317,10 +329,14 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				    netdev_start_xmit(skb, slave, slave_txq, false) ==
 				    NETDEV_TX_OK) {
 					__netif_tx_unlock(slave_txq);
-					master->slaves = NEXT_SLAVE(q);
+					spin_lock_bh(&master->slaves_lock);
+					rcu_assign_pointer(master->slaves,
+							   rcu_dereference_bh(NEXT_SLAVE(q)));
+					spin_unlock_bh(&master->slaves_lock);
 					netif_wake_queue(dev);
 					master->tx_packets++;
 					master->tx_bytes += length;
+					rcu_read_unlock_bh();
 					return NETDEV_TX_OK;
 				}
 				__netif_tx_unlock(slave_txq);
@@ -329,14 +345,18 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 				busy = 1;
 			break;
 		case 1:
-			master->slaves = NEXT_SLAVE(q);
+			spin_lock_bh(&master->slaves_lock);
+			rcu_assign_pointer(master->slaves,
+					   rcu_dereference_bh(NEXT_SLAVE(q)));
+			spin_unlock_bh(&master->slaves_lock);
+			rcu_read_unlock_bh();
 			return NETDEV_TX_OK;
 		default:
 			nores = 1;
 			break;
 		}
 		__skb_pull(skb, skb_network_offset(skb));
-	} while ((q = NEXT_SLAVE(q)) != start);
+	} while ((q = rcu_dereference_bh(NEXT_SLAVE(q))) != start);
 
 	if (nores && skb_res == NULL) {
 		skb_res = skb;
@@ -345,12 +365,14 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	if (busy) {
 		netif_stop_queue(dev);
+		rcu_read_unlock_bh();
 		return NETDEV_TX_BUSY;
 	}
 	master->tx_errors++;
 
 drop:
 	master->tx_dropped++;
+	rcu_read_unlock_bh();
 	dev_kfree_skb(skb);
 	return NETDEV_TX_OK;
 }
@@ -444,6 +466,7 @@ static __init void teql_master_setup(struct net_device *dev)
 	struct teql_master *master = netdev_priv(dev);
 	struct Qdisc_ops *ops = &master->qops;
 
+	spin_lock_init(&master->slaves_lock);
 	master->dev	= dev;
 	ops->priv_size  = sizeof(struct teql_sched_data);
 
-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf 2/2] selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
From: Michal Luczaj @ 2026-06-23 18:03 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Alexei Starovoitov, Cong Wang, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260623-sockmap-lookup-udp-leak-v1-0-05804f9308e4@rbox.co>

Setup and join a cgroup, then attach a cgroup/connect4 program that runs

   sk = bpf_map_lookup_elem(sockmap, 0)
   bpf_bind(ctx, sa, sizeof(sa))
   bpf_sk_release(sk)

Unpatched kernel leaks the socket.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 .../selftests/bpf/prog_tests/sockmap_basic.c       | 50 ++++++++++++++++++++++
 .../bpf/progs/test_sockmap_lookup_bind_release.c   | 37 ++++++++++++++++
 2 files changed, 87 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index cb3229711f93..11972ffdb16e 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -7,6 +7,7 @@
 
 #include "test_progs.h"
 #include "test_skmsg_load_helpers.skel.h"
+#include "test_sockmap_lookup_bind_release.skel.h"
 #include "test_sockmap_update.skel.h"
 #include "test_sockmap_invalid_update.skel.h"
 #include "test_sockmap_skb_verdict_attach.skel.h"
@@ -17,6 +18,7 @@
 #include "test_sockmap_msg_pop_data.skel.h"
 #include "bpf_iter_sockmap.skel.h"
 
+#include "cgroup_helpers.h"
 #include "sockmap_helpers.h"
 
 #define TCP_REPAIR		19	/* TCP sock is under repair right now */
@@ -1373,6 +1375,52 @@ static void test_sockmap_multi_channels(int sotype)
 	test_sockmap_pass_prog__destroy(skel);
 }
 
+#define LOOKUP_BIND_RELEASE_CG	"/sockmap_lookup-bind-release"
+#define LOOKUP_BIND_RELEASE_REP	64
+
+static void test_sockmap_lookup_bind_release(void)
+{
+	struct test_sockmap_lookup_bind_release *skel;
+	struct sockaddr_in sa;
+	int cg, i;
+
+	cg = cgroup_setup_and_join(LOOKUP_BIND_RELEASE_CG);
+	if (!ASSERT_OK_FD(cg, "cgroup_setup_and_join"))
+		return;
+
+	skel = test_sockmap_lookup_bind_release__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		goto cleanup;
+
+	skel->links.connect = bpf_program__attach_cgroup(skel->progs.connect, cg);
+	if (!ASSERT_OK_PTR(skel->links.connect, "attach_cgroup"))
+		goto destroy;
+
+	sa.sin_family = AF_INET;
+	sa.sin_port = bpf_htons(1234);
+	sa.sin_addr.s_addr = bpf_htonl(INADDR_LOOPBACK);
+
+	for (i = 0; i < LOOKUP_BIND_RELEASE_REP; ++i) {
+		__close_fd int sk;
+
+		sk = xsocket(AF_INET, SOCK_DGRAM, 0);
+		if (sk < 0)
+			break;
+
+		if (xbpf_map_update_elem(bpf_map__fd(skel->maps.sockmap), &u32(0),
+					 &sk, BPF_ANY))
+			break;
+
+		if (xconnect(sk, (struct sockaddr *)&sa, sizeof(sa)))
+			break;
+	}
+
+destroy:
+	test_sockmap_lookup_bind_release__destroy(skel);
+cleanup:
+	cleanup_cgroup_environment();
+}
+
 void test_sockmap_basic(void)
 {
 	if (test__start_subtest("sockmap create_update_free"))
@@ -1451,4 +1499,6 @@ void test_sockmap_basic(void)
 		test_sockmap_multi_channels(SOCK_STREAM);
 	if (test__start_subtest("sockmap udp multi channels"))
 		test_sockmap_multi_channels(SOCK_DGRAM);
+	if (test__start_subtest("sockmap lookup-bind-release"))
+		test_sockmap_lookup_bind_release();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_lookup_bind_release.c b/tools/testing/selftests/bpf/progs/test_sockmap_lookup_bind_release.c
new file mode 100644
index 000000000000..cc77b193893b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_lookup_bind_release.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SOCKMAP);
+	__uint(max_entries, 1);
+	__type(key, int);
+	__type(value, int);
+} sockmap SEC(".maps");
+
+SEC("cgroup/connect4")
+int connect(struct bpf_sock_addr *ctx)
+{
+	struct bpf_sock *sk;
+	int ret = SK_DROP;
+
+	sk = bpf_map_lookup_elem(&sockmap, &(int){0});
+	if (sk) {
+		if (sk == ctx->sk) {
+			struct sockaddr_in sa = {
+				.sin_family = ctx->user_family,
+				.sin_port = ctx->user_port,
+				.sin_addr.s_addr = ctx->user_ip4
+			};
+
+			ret = !bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa));
+		}
+
+		bpf_sk_release(sk);
+	}
+
+	return ret;
+}
+
+char _license[] SEC("license") = "GPL";

-- 
2.54.0


^ permalink raw reply related

* [PATCH bpf 0/2] bpf, sockmap: Fix sockmap leaking UDP socks
From: Michal Luczaj @ 2026-06-23 18:03 UTC (permalink / raw)
  To: John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Alexei Starovoitov, Cong Wang, Daniel Borkmann, Andrii Nakryiko,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
	Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
  Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj

Fix for UDP sockets refcount asymmetry in sockmap lookup/release.
Accompanied by a selftest.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Michal Luczaj (2):
      bpf, sockmap: Don't leak UDP socks on lookup-bind-release
      selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release

 net/ipv4/udp_bpf.c                                 |  3 ++
 .../selftests/bpf/prog_tests/sockmap_basic.c       | 50 ++++++++++++++++++++++
 .../bpf/progs/test_sockmap_lookup_bind_release.c   | 37 ++++++++++++++++
 3 files changed, 90 insertions(+)
---
base-commit: 12091470c6b4c1c14b2de12dcbae2ada6cb6d20b
change-id: 20260617-sockmap-lookup-udp-leak-bc4e5c5481d7

Best regards,
--  
Michal Luczaj <mhal@rbox.co>


^ permalink raw reply

* Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-23 18:28 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, ast, daniel, andrii
  Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
	john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
	hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
	rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
	dsahern
In-Reply-To: <877bnpeaeq.fsf@toke.dk>

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> I think it's better to just move the assignment of params->ifindex
> entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
> That way this can be simplified to:
>
> 	err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> 	if (!err && fwd_dev)
> 		*fwd_dev = dev;
> 	return err;

The caller-side restore is ungainly, agreed, but the assignment can't move
all the way into the helper. The early params->ifindex = dev->ifindex
sits above the neighbour lookup on purpose: that is d1c362e1dd68a
("bpf: Always return target ifindex in bpf_fib_lookup"), which took it
out of bpf_fib_set_fwd_params() and put it there so a program still
gets the target ifindex on the BPF_FIB_LKUP_RET_NO_NEIGH path and can
bpf_redirect_neigh() on it. bpf_fib_set_fwd_params() is called only at
the set_fwd_params label, below the NO_NEIGH return (and below the IPv6
NO_SRC_ADDR return), so an assignment living in the helper never runs
on those paths and params->ifindex falls back to the input. That would
change the reported ifindex for plain bpf_fib_lookup() callers hitting
NO_NEIGH, not only the VLAN ones.

I can still get the caller down to your form by keeping the early write
and moving just the VLAN_FAILURE rewind into the helper, with one extra
parameter, the input ifindex saved before the egress write:

	err = bpf_fib_set_fwd_params(dev, params, flags, mtu, in_ifindex);
	if (!err && fwd_dev)
		*fwd_dev = dev;
	return err;

and the helper owning the rewind in the unreducible branch:

	} else {
		params->ifindex = in_ifindex;
		return BPF_FIB_LKUP_RET_VLAN_FAILURE;
	}

So the restore leaves the caller; the early egress write stays because
NO_NEIGH and NO_SRC_ADDR depend on it.

3/3 adds a NO_NEIGH arm that pins the egress ifindex (input != egress):
with the assignment moved into the helper, that case reports the input
ifindex instead, while the return code stays NO_NEIGH, only the ifindex
flips. It passes with the early write kept.

> If you move the ifdef into the if statement, the if statement can have
> an else-branch that assigns params->ifindex, so you don't need the
> restore dance (see below).

Same constraint: an else-branch inside bpf_fib_set_fwd_params() only
runs when the helper runs, which is never on the NO_NEIGH/NO_SRC_ADDR
returns, so it cannot be the sole writer of the egress ifindex.

Does the in_ifindex version look right to you? The alternative is to
route the error returns through the label so the assignment can live
fully in the helper; threading the return codes back through it works,
but it is its own kind of dance and reads worse to me.

Thanks,
Avinash

^ permalink raw reply

* Re: [PATCH net] nfc: nci: fix out-of-bounds write in nci_target_auto_activated()
From: Simon Horman @ 2026-06-23 17:21 UTC (permalink / raw)
  To: sam
  Cc: Simon Horman, david, davem, edumazet, kuba, pabeni, oe-linux-nfc,
	netdev, linux-kernel, stable
In-Reply-To: <20260622145243.3167276-1-sam@bynar.io>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
nfc: nci: fix out-of-bounds write in nci_target_auto_activated()

This commit adds a bounds check to nci_target_auto_activated() to prevent
an out-of-bounds array write. If ndev->n_targets equals
NCI_MAX_DISCOVERED_TARGETS, it prints a debug message and returns early.

> diff --git a/net/nfc/nci/ntf.c b/net/nfc/nci/ntf.c
> index c96512bb86531..566ca839fa488 100644
> --- a/net/nfc/nci/ntf.c
> +++ b/net/nfc/nci/ntf.c
> @@ -603,6 +603,12 @@ static void nci_target_auto_activated(struct nci_dev *ndev,
>  	struct nfc_target *target;
>  	int rc;
>  
> +	/* This is a new target, check if we've enough room */
> +	if (ndev->n_targets == NCI_MAX_DISCOVERED_TARGETS) {
> +		pr_debug("not enough room, ignoring new target...\n");
> +		return;

[Severity: High]
Does this early return cause the state machine to stall?

Looking at nci_rf_intf_activated_ntf_packet(), the state is transitioned
to NCI_POLL_ACTIVE right before this function is called:

net/nfc/nci/ntf.c:nci_rf_intf_activated_ntf_packet() {
    ...
    atomic_set(&ndev->state, NCI_POLL_ACTIVE);
    if (err == NCI_STATUS_OK)
        nci_target_auto_activated(ndev, &ntf);
    ...
}

Because of the early return here, nfc_targets_found() is skipped. Does this
leave the state wedged in NCI_POLL_ACTIVE, permanently blocking future 
attempts to stop polling since nci_stop_poll() requires the state to be 
NCI_DISCOVERY?

> +	}
> +
>  	target = &ndev->targets[ndev->n_targets];

[ ... ]

[Severity: Critical]
This isn't a bug introduced by this patch, but can the ntf union copy
uninitialized stack memory into persistent device structures?

In net/nfc/nci/ntf.c:nci_rf_discover_ntf_packet(), the local ntf struct
is declared on the stack:

    struct nci_rf_discover_ntf ntf;

If the incoming packet has ntf.rf_tech_specific_params_len == 0, the
extraction routines are bypassed, leaving the rf_tech_specific_params union
uninitialized. The struct is then passed to nci_add_new_target():

    if (add_target == true)
        nci_add_new_target(ndev, &ntf);

This eventually calls nci_add_new_protocol(), which blindly reads from this
union:

    nfca_poll = (struct rf_tech_specific_params_nfca_poll *)params;
    target->sens_res = nfca_poll->sens_res;
    target->sel_res = nfca_poll->sel_res;
    target->nfcid1_len = nfca_poll->nfcid1_len;

Could this uninitialized stack garbage be copied into ndev->targets, and
later broadcasted to userspace via Netlink in nfc_targets_found()?

[Severity: Critical]
This isn't a bug introduced by this patch, but does the general bytes length
calculation underflow?

In net/nfc/nci/ntf.c:nci_store_general_bytes_nfc_dep():

    ndev->remote_gb_len = min_t(__u8, 
        (ntf->activation_params.listen_nfc_dep.atr_req_len - 
         NFC_ATR_REQ_GT_OFFSET), NFC_ATR_REQ_GB_MAXSIZE);
         
    memcpy(ndev->remote_gb, 
        (ntf->activation_params.listen_nfc_dep.atr_req + 
         NFC_ATR_REQ_GT_OFFSET), ndev->remote_gb_len);

The length is computed by subtracting NFC_ATR_REQ_GT_OFFSET (14) from
atr_req_len. If atr_req_len is less than 14, this subtraction will
underflow.

Because of the min_t cast, this large underflowed value is bounded to
NFC_ATR_REQ_GB_MAXSIZE (48). The memcpy() then copies 48 bytes starting
from atr_req + 14 into ndev->remote_gb.

Since the source buffer is part of the ntf union allocated on the stack
in nci_rf_intf_activated_ntf_packet(), could this leak up to 48 bytes of
uninitialized kernel stack memory to userspace?

[Severity: High]
This isn't a bug introduced by this patch, but does the lack of skb->len
validation in NCI response handlers allow out-of-bounds reads?

In nci_core_init_rsp_packet_v1() in net/nfc/nci/rsp.c, fields are read
and arrays are copied from the payload:

    ndev->num_supported_rf_interfaces = 
        rsp_1->num_supported_rf_interfaces;
        
    memcpy(ndev->supported_rf_interfaces, rsp_1->supported_rf_interfaces, 
           ndev->num_supported_rf_interfaces);

Similarly in nci_core_init_rsp_packet_v2():

    while (rf_interface_idx < ndev->num_supported_rf_interfaces) {
        ndev->supported_rf_interfaces[rf_interface_idx++] = 
            *supported_rf_interface++;
    }

Since the NCI control header is stripped via skb_pull(), leaving the
payload length in skb->len, what happens if an NFCC sends a short response
packet (e.g. plen = 1)? Could this cause the handlers to read adjacent
kernel slab memory?

[Severity: High]
This isn't a bug introduced by this patch, but is there a missing boundary
validation when extracting activation parameters?

In net/nfc/nci/ntf.c:nci_extract_activation_params_iso_dep():

    nfca_poll->rats_res_len = min_t(__u8, *data++, NFC_ATS_MAXSIZE);
    if (nfca_poll->rats_res_len > 0) {
        memcpy(nfca_poll->rats_res, data, nfca_poll->rats_res_len);
    }

And in nci_extract_activation_params_nfc_dep():

    poll->atr_res_len = min_t(__u8, *data++, NFC_ATR_RES_MAXSIZE - 2);
    if (poll->atr_res_len > 0)
        memcpy(poll->atr_res, data, poll->atr_res_len);

These inner lengths are bounded by a maximum size like NFC_ATS_MAXSIZE, but
they are not validated against the remaining bytes in the SKB (skb->len).

If a malformed packet provides a large inner length but a short payload,
could this memcpy() read out of bounds into the device structure,
potentially leaking adjacent slab memory to userspace later?

[Severity: High]
This isn't a bug introduced by this patch, but is there a potential
priority inversion or timeout deadlock involving the device lock?

Functions like nfc_stop_poll() in net/nfc/core.c acquire the device lock:

    device_lock(&dev->dev);
    ...
    if (dev->ops->stop_poll)
        dev->ops->stop_poll(dev);

This eventually calls nci_request(), which synchronously waits for a
completion signaled by the nci_rx_work thread. However, if an NTF packet
is received first, nci_rx_work processes it and invokes
nfc_targets_found(), which also attempts to acquire the device lock:

    device_lock(&dev->dev);

Since the calling thread already holds the device lock, nci_rx_work blocks
indefinitely. Because the RX worker is blocked, it cannot process the
pending RSP, causing nci_request() to time out and fail. Could this
deadlock the RX thread?

^ permalink raw reply

* Re: [PATCH v12 11/12] x86/vmscape: Resolve conflict between attack-vectors and vmscape=force
From: bot+bpf-ci @ 2026-06-23 18:13 UTC (permalink / raw)
  To: pawan.kumar.gupta, x86, jon, nik.borisov, hpa, jpoimboe,
	david.kaplan, seanjc, bp, dave.hansen, peterz, ast, daniel,
	andrii, kpsingh, jolsa, davem, david.laight.linux, luto, tglx,
	mingo, dsahern, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, sdf, haoluo, pbonzini, corbet, jbaron, aliceryhl,
	rostedt, ardb, skhan
  Cc: linux-kernel, kvm, asit.k.mallick, tao1.zhang, bpf, netdev,
	linux-doc, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260622-vmscape-bhb-v12-11-76cbda0ae3e5@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

> x86/vmscape: Resolve conflict between attack-vectors and vmscape=force
>
> vmscape=force option currently defaults to AUTO mitigation. This lets
> attack-vector controls to override the vmscape mitigation. Preventing the
> user from being able to force VMSCAPE mitigation.
>
> When vmscape mitigation is forced, allow it be deployed irrespective of
> attack vectors. Introduce VMSCAPE_MITIGATION_ON that wins over
> attack-vector controls.
>
> Tested-by: Jon Kohler <jon@nutanix.com>
> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

This appears to fix a behavioral bug where vmscape=force mapped to
VMSCAPE_MITIGATION_AUTO, which is gated by should_mitigate_vuln(), so
attack-vector controls could override a user-forced mitigation.

Should this carry a Fixes: tag?

The force -> VMSCAPE_MITIGATION_AUTO mapping that allowed the override
predates this series. The earlier refactor 68d94c61c09a ("x86/vmscape:
Move mitigation selection to a switch()") only moved the logic into a
switch() and did not introduce the conflict, so it does not look like the
right Fixes: target. A precise commit to point at could not be confirmed
from this tree.

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28046221590

^ permalink raw reply

* Re: [PATCH v12 02/12] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: bot+bpf-ci @ 2026-06-23 18:13 UTC (permalink / raw)
  To: pawan.kumar.gupta, x86, jon, nik.borisov, hpa, jpoimboe,
	david.kaplan, seanjc, bp, dave.hansen, peterz, ast, daniel,
	andrii, kpsingh, jolsa, davem, david.laight.linux, luto, tglx,
	mingo, dsahern, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, sdf, haoluo, pbonzini, corbet, jbaron, aliceryhl,
	rostedt, ardb, skhan
  Cc: linux-kernel, kvm, asit.k.mallick, tao1.zhang, bpf, netdev,
	linux-doc, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260622-vmscape-bhb-v12-2-76cbda0ae3e5@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 3513 bytes --]

> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
> index 83f51cab0b1e..2cb4a96247d8 100644
> --- a/arch/x86/kernel/cpu/bugs.c
> +++ b/arch/x86/kernel/cpu/bugs.c

[ ... ]

> @@ -3242,6 +3246,15 @@ void __init cpu_select_mitigations(void)
>  		x86_spec_ctrl_base &= ~SPEC_CTRL_MITIGATIONS_MASK;
>  	}
>
> +	/*
> +	 * Switch to long BHB clear sequence on newer CPUs (with BHI_CTRL
> +	 * support), see Intel's BHI guidance.
> +	 */
> +	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> +		bhb_seq_outer_loop = 12;
> +		bhb_seq_inner_loop = 7;
> +	}

Should this also consider the case of a guest running under a hypervisor
that does not expose BHI_CTRL?

When BHI_CTRL is absent the loop counts keep their short defaults (5/5)
and the guest declares itself BHI-mitigated.  In a heterogeneous
migration pool a guest may present as Ice Lake (no BHI_CTRL) while
actually running on Alder Lake or newer hardware, where the short
sequence is not sufficient.

A default-safe approach would be to also select the long sequence when
running under a hypervisor without BHI_CTRL, for example:

	if (cpu_feature_enabled(X86_FEATURE_BHI_CTRL) ||
	    cpu_feature_enabled(X86_FEATURE_HYPERVISOR)) {
		bhb_seq_outer_loop = 12;
		bhb_seq_inner_loop = 7;
	}

There is precedent for assuming the worst under a hypervisor, such as
vulnerable_to_its(), cpu_set_bug_bits() and check_null_seg_clears_base().

This came up during review and the resolution was to let the series move
forward and revisit the hypervisor case afterwards.  Is that still the
intent, or should the hypervisor fallback land with this patch?


For the missing Fixes: tag, quoting the full commit message:

commit a2a0812335ee19ae1df81f88e736da5210748c2b
Author: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

>     x86/bhi: Make clear_bhb_loop() effective on newer CPUs
>
>     As a mitigation for BHI, clear_bhb_loop() executes branches that overwrite
>     the Branch History Buffer (BHB). On Alder Lake and newer parts this
>     sequence is not sufficient because it doesn't clear enough entries. This
>     was not an issue because these CPUs use the BHI_DIS_S hardware mitigation
>     in the kernel.
>     Now with VMSCAPE (BHI variant) it is also required to isolate branch
>     history between guests and userspace. Since BHI_DIS_S only protects the
>     kernel, the newer CPUs also use IBPB.
>     A cheaper alternative to the current IBPB mitigation is clear_bhb_loop().
>     But it currently does not clear enough BHB entries to be effective on newer
>     CPUs with larger BHB. At boot, dynamically set the loop count of
>     clear_bhb_loop() such that it is effective on newer CPUs too.
>     Introduce global loop counts, initializing them with appropriate value
>     based on the hardware feature X86_FEATURE_BHI_CTRL.
>     Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
>     Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
>     Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>

The short BHB clear sequence with the hardcoded loop count of 5 was
added by clear_bhb_loop() in entry_64.S, which is the code this patch
makes configurable.  Should this carry:

  Fixes: 7390db8aea0d ("x86/bhi: Add support for clearing branch history at syscall entry")


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28046221590

^ permalink raw reply

* Re: [PATCH v12 08/12] KVM: Define EXPORT_STATIC_CALL_FOR_KVM()
From: bot+bpf-ci @ 2026-06-23 18:13 UTC (permalink / raw)
  To: pawan.kumar.gupta, x86, jon, nik.borisov, hpa, jpoimboe,
	david.kaplan, seanjc, bp, dave.hansen, peterz, ast, daniel,
	andrii, kpsingh, jolsa, davem, david.laight.linux, luto, tglx,
	mingo, dsahern, martin.lau, eddyz87, song, yonghong.song,
	john.fastabend, sdf, haoluo, pbonzini, corbet, jbaron, aliceryhl,
	rostedt, ardb, skhan
  Cc: linux-kernel, kvm, asit.k.mallick, tao1.zhang, bpf, netdev,
	linux-doc, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260622-vmscape-bhb-v12-8-76cbda0ae3e5@linux.intel.com>

[-- Attachment #1: Type: text/plain, Size: 2597 bytes --]

> EXPORT_SYMBOL_FOR_KVM() exists to export symbols to KVM modules. Static
> calls need the same treatment when the core kernel defines a static_call
> that KVM needs access to (e.g. from a VM-exit path).
>
> Define EXPORT_STATIC_CALL_FOR_KVM() as the static_call analogue of
> EXPORT_SYMBOL_FOR_KVM(). The same three-way logic applies:
>
>   - KVM_SUB_MODULES defined: export to "kvm," plus all sub-modules
>   - KVM=m, no sub-modules: export to "kvm" only
>   - KVM built-in: no export needed (noop)
>
>   As with EXPORT_SYMBOL_FOR_KVM(), allow architectures to override both
>   macros (e.g. to suppress the export when kvm.ko itself will not be
>   built despite CONFIG_KVM=m). Add the x86 no-op overrides in
>   arch/x86/include/asm/kvm_types.h for that case. To keep the pair in
>   sync, EXPORT_STATIC_CALL_FOR_KVM() is defined inside the
>   EXPORT_SYMBOL_FOR_KVM #ifndef block; an arch that defines
>   EXPORT_SYMBOL_FOR_KVM must also define EXPORT_STATIC_CALL_FOR_KVM or the
>   build will fail with a compile-time error.
>
> As with EXPORT_SYMBOL_FOR_KVM(), allow architectures to override
> EXPORT_STATIC_CALL_FOR_KVM definition (e.g. to suppress the export when
> kvm.ko itself will not be built despite CONFIG_KVM=m). Add the x86 no-op
> override in arch/x86/include/asm/kvm_types.h for that case.
>
> Architectures must also define EXPORT_STATIC_CALL_FOR_KVM when they define
> EXPORT_SYMBOL_FOR_KVM.

This isn't a bug, but the arch-override rationale appears twice in nearly
identical form.  The third paragraph (starting "As with
EXPORT_SYMBOL_FOR_KVM(), allow architectures to override both macros") and
the fourth paragraph (starting "As with EXPORT_SYMBOL_FOR_KVM(), allow
architectures to override EXPORT_STATIC_CALL_FOR_KVM definition") both
explain suppressing the export when kvm.ko will not be built and both
mention adding the x86 no-op override.

The requirement that an arch defining EXPORT_SYMBOL_FOR_KVM must also
define EXPORT_STATIC_CALL_FOR_KVM also shows up twice: once inside the
third paragraph ("an arch that defines EXPORT_SYMBOL_FOR_KVM must also
define EXPORT_STATIC_CALL_FOR_KVM or the build will fail") and again as the
standalone final sentence ("Architectures must also define
EXPORT_STATIC_CALL_FOR_KVM when they define EXPORT_SYMBOL_FOR_KVM").

Could the duplicate paragraphs be folded into a single statement?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28046221590

^ permalink raw reply

* Re: [PATCH net 1/1] net/sched: cls_api: Handle TC_ACT_CONSUMED in tcf_qevent_handle
From: Jamal Hadi Salim @ 2026-06-23 18:00 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, horms, jiri, victor, security,
	Zero Day Initiative
In-Reply-To: <20260620130749.226642-1-jhs@mojatatu.com>

On Sat, Jun 20, 2026 at 9:07 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> tcf_classify() can return TC_ACT_CONSUMED while the skb is held by the
> defragmentation engine (e.g. act_ct on out-of-order fragments). When
> that happens the skb is no longer owned by the caller and must not be
> touched again.
>
> tcf_qevent_handle() did not handle TC_ACT_CONSUMED: it fell through the
> switch and returned the skb to the caller as if classification had
> passed. The only qdisc that wires up qevents today is RED, via three call sites
> (qe_mark on RED_PROB_MARK/HARD_MARK, qe_early_drop on congestion_drop)
> red_enqueue() was continuing to operate on an skb it no longer owns  in this
> case -- enqueueing it, dropping it, or updating statistics. Resulting in a UAF.
>
>   tc qdisc add dev eth0 root handle 1: red ... qevent early_drop block 10
>   tc filter add block 10 ... action ct
>
>   (with ct defrag enabled and traffic that produces out-of-order
>   fragments, e.g. a fragmented UDP stream)
>
> Handle TC_ACT_CONSUMED in tcf_qevent_handle() the same way the ingress
> and egress fast paths do: treat it as stolen and return NULL without
> touching the skb. Unlike the TC_ACT_STOLEN case, the skb must not be
> dropped/freed here, as it is no longer owned by us.
>

I just looked at sashiko claims - one of them (on ebpf) is legit but
the one on qdiscs is some BS it is making up. I will address the ebpf
one this week.

cheers,
jamal

> Fixes: 3f14b377d01d ("net/sched: act_ct: fix skb leak and crash on ooo frags")
> Reported-by: Zero Day Initiative <zdi-disclosures@trendmicro.com>
> Tested-by: Victor Nogueira <victor@mojatatu.com>
> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
> ---
>  net/sched/cls_api.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 20f7f9ee0b353..3e67600a4a1a1 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -4049,6 +4049,9 @@ struct sk_buff *tcf_qevent_handle(struct tcf_qevent *qe, struct Qdisc *sch, stru
>                 skb_do_redirect(skb);
>                 *ret = __NET_XMIT_STOLEN;
>                 return NULL;
> +       case TC_ACT_CONSUMED:
> +               *ret = __NET_XMIT_STOLEN;
> +               return NULL;
>         }
>
>         return skb;
> --
> 2.34.1
>

^ permalink raw reply

* Re: [PATCH v14 0/9] tls: Add TLS 1.3 hardware offload support
From: Nils Juenemann @ 2026-06-23 17:53 UTC (permalink / raw)
  To: rjethwani, netdev
  Cc: borisp, davem, edumazet, john.fastabend, kuba, leon, mbloch,
	saeedm, sd, tariqt

Hi Rishikesh, all,

we have been testing the v14 TLS 1.3 HW offload series on a ConnectX-6
DX and hit a sendfile() final-record loss on the device TX path. We
reduced it to a self-contained C reproducer and characterized it;
reporting it here with the analysis and a question on where a fix belongs.

Setup:

NIC: ConnectX-6 DX (crypto enabled), FW 22.47.1026, SR-IOV VF,
TX offload only

Kernel: net-next + this v14 series

TLS 1.3, AES-128-GCM, kTLS installed via setsockopt(TLS_TX) on the
sending side with fixed test crypto material and no handshake, like
tools/testing/selftests/net/tls

a server sends a file with the raw sendfile(2) syscall; a client on
another host reads the decrypted stream and counts the bytes

Trigger: sendfile(2) with a count larger than the bytes remaining in
the file (count > EOF). This is what a generic copy loop / Go's
net.TCPConn.ReadFrom passes for a file of unknown length (~2 GiB). The
kernel sends up to EOF, but the connection's final TLS record then
appears not to be put on the wire unless a subsequent write flushes it.
An abrupt close() appears to drop it, and the peer receives the whole
body except the last record's bytes.

Reproducer results (two hosts over the ConnectX - a loopback/same-host
connection stays on TLS_SW and does not show it). Same file, 226965
bytes (= 13*16384 + 13973):

TLS_HW count>EOF close() -> 212992 short
TLS_HW count>EOF close(), no zerocopy -> 212992 same
TLS_HW count==exact close() -> 226965 full
TLS_HW count>EOF close_notify, then close() -> 226965 full
TLS_SW count>EOF close(), hw-tx-offload off -> 226965 full

So it is specific to the device-offload TX path: the final record of a
count > EOF sendfile() appears not to be finalized/flushed at EOF, only
by a following write. A bounded count, a trailing write (close_notify),
or software kTLS all avoid it. TLS_TX_ZEROCOPY_RO makes no difference.
We are currently using the exact-count workaround in a preview environment.

We may be misreading the code, so this is only a pointer: with
count > EOF tls_push_data() fills the last record without reaching the
size==0 case; on the device path tls_device_record_close() for that
pending record appears to run only on the next push, and an abrupt
teardown appears to discard it. The software path seems to flush
pending TX records on close (tls_sw_release_resources_tx), which would
explain why it is unaffected.

Reproducer:
https://gist.github.com/totallyunknown/a8f0ad3c54e40befde2f5a8d360fa6be

It installs kTLS with fixed test crypto material via
setsockopt(TLS_TX/TLS_RX), sends a file using the raw sendfile(2)
syscall, and compares count > EOF against exact-count and close_notify.
The v14 selftest (patch 9/9) sends via send() only and ends cleanly, so
it misses this; a sendfile() + count > EOF case reproduces it
deterministically for us.

Question: should the device offload finalize and flush the connection's
final record at EOF / on close, the way software kTLS does, or is a
trailing write required by contract? And should a fix live in net/tls
(device record close on the final partial record / the close path) or
on the mlx5 side?

Thanks,
Nils Juenemann

^ permalink raw reply

* Re: [PATCH net] net/mlx5e: Use sender devcom for MPV master-up
From: manjunath.b.patil @ 2026-06-23 17:51 UTC (permalink / raw)
  To: Tariq Toukan, Saeed Mahameed, Mark Bloch, Leon Romanovsky, netdev
  Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Patrisious Haddad, linux-rdma, linux-kernel, stable
In-Reply-To: <293db0b4-f308-469e-99c1-ef1b57d41451@nvidia.com>



On 6/22/26 2:01 AM, Tariq Toukan wrote:
> 
> 
> On 10/06/2026 20:39, Manjunath Patil wrote:
>> After PCIe DPC recovery, mlx5 reloads the affected functions and
>> replays multiport affiliation events. In the reported failure, the
>> first relevant device error was:
>>
>>    pcieport 0000:10:01.1: DPC: containment event
>>    pcieport 0000:10:01.1: PCIe Bus Error: severity=Uncorrected (Fatal)
>>    pcieport 0000:10:01.1:    [ 5] SDES                   (First)
>>
>> mlx5 recovered the PCI functions and resumed 0000:11:00.1. During
>> that resume, RDMA multiport binding replayed
>> MLX5_DRIVER_EVENT_AFFILIATION_DONE and mlx5e sent
>> MPV_DEVCOM_MASTER_UP. The host then panicked with:
>>
>>    BUG: kernel NULL pointer dereference, address: 0000000000000010
>>    RIP: mlx5_devcom_comp_set_ready+0x5/0x40 [mlx5_core]
>>    RDI: 0000000000000000
>>
>> Call trace included:
>>
>>    mlx5_devcom_comp_set_ready
>>    mlx5e_devcom_event_mpv
>>    mlx5_devcom_send_event
>>    mlx5_ib_bind_slave_port
>>    mlx5r_mp_probe
>>    mlx5_pci_resume
>>
>> MPV devcom registration publishes mlx5e private data to the component
>> peer list before mlx5e_devcom_init_mpv() stores the returned component
>> device in priv->devcom. A concurrent master-up event can therefore
>> reach a peer whose private data is visible but whose priv->devcom
>> backpointer is still NULL.
>>
>> MPV_DEVCOM_MASTER_UP already carries the sender/master mlx5e private
>> data as event_data. The ready bit is stored on the shared devcom
>> component, not on an individual peer. Use the sender devcom when
>> marking the MPV component ready.
>>
>> This preserves the readiness transition while avoiding a NULL
>> dereference of the peer devcom pointer during affiliation replay after
>> PCI error recovery.
>>
>> Fixes: bf11485f8419 ("net/mlx5: Register mlx5e priv to devcom in MPV 
>> mode")
>> Assisted-by: Codex:gpt-5
>> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
>> Cc: stable@vger.kernel.org # 6.7+
>> ---
> 
> Thanks for your patch and sorry for the late response.
> 
>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +++++--
>>   1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/ 
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index 8f2b3abe0092..f7ff20b97e8c 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -211,11 +211,14 @@ static void mlx5e_disable_async_events(struct 
>> mlx5e_priv *priv)
>>   static int mlx5e_devcom_event_mpv(int event, void *my_data, void 
>> *event_data)
>>   {
>> -    struct mlx5e_priv *slave_priv = my_data;
>> +    struct mlx5e_priv *master_priv = event_data;
> 
> makes sense.
> 
>>       switch (event) {
>>       case MPV_DEVCOM_MASTER_UP:
>> -        mlx5_devcom_comp_set_ready(slave_priv->devcom, true);
>> +        if (!master_priv || !master_priv->devcom)
>> +            return -EINVAL;
> 
> is this currently possible? or just being defensive?
> if this return is unreachable I'd drop it.

Yes, the check is only defensive. For MPV_DEVCOM_MASTER_UP, event_data 
is passed from mlx5e_devcom_init_mpv() after priv->devcom has been 
assigned, so it should not be reachable in the valid path.

Please feel free to drop the check while applying. If you prefer a v2, 
let me know and I will send one.

Thanks,
Manjunath

> 
>> +
>> +        mlx5_devcom_comp_set_ready(master_priv->devcom, true);
>>           break;
>>       case MPV_DEVCOM_MASTER_DOWN:
>>           /* no need for comp set ready false since we unregister after
> 


^ permalink raw reply

* [PATCH bpf-next v2 15/15] selftests/bpf: Add test for bpf_tcp_ops header option hooks
From: Amery Hung @ 2026-06-23 17:50 UTC (permalink / raw)
  To: bpf
  Cc: netdev, alexei.starovoitov, andrii, daniel, eddyz87, memxor,
	martin.lau, shakeel.butt, roman.gushchin, kuniyu, kerneljasonxing,
	ameryhung, kernel-team
In-Reply-To: <20260623175006.3136053-1-ameryhung@gmail.com>

Add a test exercising the bpf_tcp_ops parse_hdr, hdr_opt_len and
write_hdr_opt members together with the header option helpers.

The struct_ops program (progs/bpf_tcp_ops_hdr.c) reserves space in
hdr_opt_len via bpf_reserve_hdr_opt(), writes an experimental option in
write_hdr_opt via bpf_store_hdr_opt(), and recovers it in parse_hdr via
bpf_load_hdr_opt() on the incoming skb. Each hook bumps a counter and the
parse hook records the option payload, so the three callbacks and all
three overloaded helpers are covered.

Signed-off-by: Amery Hung <ameryhung@gmail.com>
---
 .../bpf/prog_tests/bpf_tcp_ops_hdr.c          | 97 +++++++++++++++++++
 .../selftests/bpf/progs/bpf_tcp_ops_hdr.c     | 83 ++++++++++++++++
 2 files changed, 180 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/bpf_tcp_ops_hdr.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_tcp_ops_hdr.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ops_hdr.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ops_hdr.c
new file mode 100644
index 000000000000..73e34d2be9a4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ops_hdr.c
@@ -0,0 +1,97 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "cgroup_helpers.h"
+#include "bpf_tcp_ops_hdr.skel.h"
+
+#define CGROUP_PATH	"/bpf_tcp_ops_hdr"
+#define TEST_NETNS	"bpf_tcp_ops_hdr"
+
+#define TEST_OPT_D0	0xAB
+#define TEST_OPT_D1	0xCD
+
+static void send_recv(void)
+{
+	char buf[64] = {};
+	int server_fd, client_fd, accept_fd;
+	ssize_t n;
+
+	server_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0);
+	if (!ASSERT_GE(server_fd, 0, "start_server"))
+		return;
+
+	client_fd = connect_to_fd(server_fd, 0);
+	if (!ASSERT_OK_FD(client_fd, "connect_to_fd"))
+		goto close_server;
+
+	accept_fd = accept(server_fd, NULL, NULL);
+	if (!ASSERT_OK_FD(accept_fd, "accept"))
+		goto close_client;
+
+	/* Exchange data both directions so option-bearing data packets
+	 * are sent and parsed on each side.
+	 */
+	n = send(client_fd, buf, sizeof(buf), 0);
+	ASSERT_EQ(n, sizeof(buf), "client_send");
+	n = recv(accept_fd, buf, sizeof(buf), 0);
+	ASSERT_EQ(n, sizeof(buf), "server_recv");
+
+	n = send(accept_fd, buf, sizeof(buf), 0);
+	ASSERT_EQ(n, sizeof(buf), "server_send");
+	n = recv(client_fd, buf, sizeof(buf), 0);
+	ASSERT_EQ(n, sizeof(buf), "client_recv");
+
+	close(accept_fd);
+close_client:
+	close(client_fd);
+close_server:
+	close(server_fd);
+}
+
+static void run_hdr_opt(void)
+{
+	struct bpf_tcp_ops_hdr *skel = NULL;
+	struct bpf_link *link = NULL;
+	struct netns_obj *ns = NULL;
+	int cgroup_fd;
+
+	cgroup_fd = test__join_cgroup(CGROUP_PATH);
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		return;
+
+	ns = netns_new(TEST_NETNS, true);
+	if (!ASSERT_OK_PTR(ns, "netns_new"))
+		goto done;
+
+	skel = bpf_tcp_ops_hdr__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open_and_load"))
+		goto done;
+
+	link = bpf_map__attach_cgroup_opts(skel->maps.test_hdr_ops, cgroup_fd, NULL);
+	if (!ASSERT_OK_PTR(link, "attach_cgroup"))
+		goto done;
+
+	send_recv();
+
+	/* Reserve + write hooks ran while sending. */
+	ASSERT_GT(skel->bss->hdr_opt_len_cnt, 0, "hdr_opt_len_cnt");
+	ASSERT_GT(skel->bss->write_cnt, 0, "write_cnt");
+	/* Parse hook ran and recovered our option on the receive side. */
+	ASSERT_GT(skel->bss->parse_cnt, 0, "parse_cnt");
+	ASSERT_GT(skel->bss->found_cnt, 0, "found_cnt");
+	ASSERT_EQ(skel->bss->found_d0, TEST_OPT_D0, "found_d0");
+	ASSERT_EQ(skel->bss->found_d1, TEST_OPT_D1, "found_d1");
+
+done:
+	bpf_link__destroy(link);
+	bpf_tcp_ops_hdr__destroy(skel);
+	netns_free(ns);
+	close(cgroup_fd);
+}
+
+void test_bpf_tcp_ops_hdr(void)
+{
+	run_hdr_opt();
+}
diff --git a/tools/testing/selftests/bpf/progs/bpf_tcp_ops_hdr.c b/tools/testing/selftests/bpf/progs/bpf_tcp_ops_hdr.c
new file mode 100644
index 000000000000..46618a604d96
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/bpf_tcp_ops_hdr.c
@@ -0,0 +1,83 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Meta Platforms, Inc. and affiliates. */
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+/* Experimental option kind and payload written/parsed by this test. */
+#define TEST_OPT_KIND	0xFD
+#define TEST_OPT_LEN	4
+#define TEST_OPT_D0	0xAB
+#define TEST_OPT_D1	0xCD
+
+int hdr_opt_len_cnt;
+int write_cnt;
+int parse_cnt;
+int found_cnt;
+__u8 found_d0;
+__u8 found_d1;
+
+SEC("struct_ops")
+void BPF_PROG(test_hdr_opt_len, struct sock *sk, struct sk_buff *skb,
+	      struct request_sock *req, struct sk_buff *syn_skb,
+	      enum tcp_synack_type synack_type, unsigned int *remaining)
+{
+	hdr_opt_len_cnt++;
+
+	/* Reserve TEST_OPT_LEN bytes; the helper decrements *remaining. Stacks
+	 * with other progs in the cgroup hierarchy.
+	 */
+	bpf_reserve_hdr_opt(ctx, TEST_OPT_LEN, 0);
+}
+
+SEC("struct_ops")
+void BPF_PROG(test_write_hdr_opt, struct sock *sk, struct sk_buff *skb,
+	      struct request_sock *req, struct sk_buff *syn_skb,
+	      enum tcp_synack_type synack_type, __u32 opt_off)
+{
+	__u8 opt[TEST_OPT_LEN] = {
+		TEST_OPT_KIND, TEST_OPT_LEN, TEST_OPT_D0, TEST_OPT_D1,
+	};
+
+	/* bpf_store_hdr_opt() takes the program ctx (the kernel reads the
+	 * outgoing skb from it); it appends after any options already written
+	 * in the reserved window, rejects duplicates, and confines the write to
+	 * the header option scratch. Stacks across progs in the cgroup hierarchy.
+	 */
+	if (bpf_store_hdr_opt(ctx, opt, sizeof(opt), 0))
+		return;
+
+	write_cnt++;
+}
+
+SEC("struct_ops")
+void BPF_PROG(test_parse_hdr, struct sock *sk, struct sk_buff *skb)
+{
+	__u8 opt[TEST_OPT_LEN] = {
+		TEST_OPT_KIND, TEST_OPT_LEN, TEST_OPT_D0, TEST_OPT_D1,
+	};
+
+	parse_cnt++;
+
+	/* Look up the experimental option written by test_write_hdr_opt() in
+	 * the incoming skb. For an experimental kind the search matches on the
+	 * 2-byte magic in opt[2..3]; on a match the found option is copied back
+	 * into opt[].
+	 */
+	if (bpf_load_hdr_opt(ctx, opt, sizeof(opt), 0) < 0)
+		return;
+
+	found_d0 = opt[2];
+	found_d1 = opt[3];
+	found_cnt++;
+}
+
+SEC(".struct_ops.link")
+struct bpf_tcp_ops test_hdr_ops = {
+	.hdr_opt_len	= (void *)test_hdr_opt_len,
+	.write_hdr_opt	= (void *)test_write_hdr_opt,
+	.parse_hdr	= (void *)test_parse_hdr,
+};
+
+char _license[] SEC("license") = "GPL";
-- 
2.53.0-Meta


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox