* Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-23 19:45 UTC (permalink / raw)
To: Avinash Duduskar, ast, daniel, andrii
Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
dsahern
In-Reply-To: <20260623182849.2623521-1-avinash.duduskar@gmail.com>
Avinash Duduskar <avinash.duduskar@gmail.com> writes:
> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> I think it's better to just move the assignment of params->ifindex
>> entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
>> That way this can be simplified to:
>>
>> err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
>> if (!err && fwd_dev)
>> *fwd_dev = dev;
>> return err;
>
> The caller-side restore is ungainly, agreed, but the assignment can't move
> all the way into the helper. The early params->ifindex = dev->ifindex
> sits above the neighbour lookup on purpose: that is d1c362e1dd68a
> ("bpf: Always return target ifindex in bpf_fib_lookup"), which took it
> out of bpf_fib_set_fwd_params() and put it there so a program still
> gets the target ifindex on the BPF_FIB_LKUP_RET_NO_NEIGH path and can
> bpf_redirect_neigh() on it. bpf_fib_set_fwd_params() is called only at
> the set_fwd_params label, below the NO_NEIGH return (and below the IPv6
> NO_SRC_ADDR return), so an assignment living in the helper never runs
> on those paths and params->ifindex falls back to the input. That would
> change the reported ifindex for plain bpf_fib_lookup() callers hitting
> NO_NEIGH, not only the VLAN ones.
Right. Well, seems I forgot about that patch, even though I seem to have
written it :)
> I can still get the caller down to your form by keeping the early write
> and moving just the VLAN_FAILURE rewind into the helper, with one extra
> parameter, the input ifindex saved before the egress write:
>
> err = bpf_fib_set_fwd_params(dev, params, flags, mtu, in_ifindex);
> if (!err && fwd_dev)
> *fwd_dev = dev;
> return err;
>
> and the helper owning the rewind in the unreducible branch:
>
> } else {
> params->ifindex = in_ifindex;
> return BPF_FIB_LKUP_RET_VLAN_FAILURE;
> }
OK, if we do need to restore it, I think it's better to do it there.
Also, wrt the fwd_dev parameter: Do we really have a use case from using
this from TC? In TC you can just redirect to the VLAN device; this is
meant for XDP which can't do that. So how about we just reject the flag
on the TC side, and get rid of the fwd_dev parameter entirely?
If we do that we're back to just a plain 'return bpf_fib_set_fwd_params()' :)
-Toke
^ permalink raw reply
* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Toke Høiland-Jørgensen @ 2026-06-23 19:59 UTC (permalink / raw)
To: Ralf Lici
Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
Beniamino Galvani
In-Reply-To: <20260623163606.33510-1-ralf@mandelbit.com>
Ralf Lici <ralf@mandelbit.com> writes:
> On Mon, 22 Jun 2026 16:36:24 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >> > My second concern is that the SIIT boundary would be a property of
>> >> > rule and hook placement. That gives flexibility, but it also means the
>> >> > translation point has to be constrained and documented very carefully
>> >> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
>> >> > For this use case I would rather have the route that matches the
>> >> > translation prefix also be the object that says: leave this family
>> >> > here and continue in the other one.
>> >>
>> >> Yeah, with flexibility comes the ability to shoot yourself in the foot.
>> >> But that's not really different from much of the other functionality we
>> >> have in the kernel today, is it? For netfilter in particular it's
>> >> certainly possible to configure a broken NAT configuration that leads to
>> >> packet drops (or just invalid packets being sent out on a network
>> >> device).
>> >>
>> >
>> > True, misconfiguration is always possible and that alone is not an
>> > argument against the netfilter model. But what do we actually gain in
>> > capability from that flexibility? I agree on the UX argument (an admin
>> > would look in nft first), but in terms of what the feature can do, I
>> > can't yet see what the nft model unlocks. More on this just below.
>> >
>> >> > After looking at the available kernel mechanisms again, I think the
>> >> > better model is probably LWT: routes carry an ipxlat encap referencing a
>> >> > named translator domain configured over netlink. That should represent
>> >> > the stateless, prefix-based and symmetric nature of ipxlat.
>> >>
>> >> I think this description actually hits the nail on the head: What are we
>> >> implementing here? Is it a product feature, or a building block for one?
>> >> The properties you mention wrt consistency, symmetry etc are properties
>> >> of the high-level feature (which is also generally the level things are
>> >> specified in RFCs). Whereas other packet mangling features in the kernel
>> >> are more in the "building block" category, where it's possible to
>> >> configure things to implement a particular feature set / compliance with
>> >> a particular RFC, but it's also possible to do things that are outside
>> >> of that.
>> >>
>> >> I think this relates to the "mechanism, not policy" approach that we
>> >> take to most things in the kernel: implement the building blocks to do
>> >> something in the most general way we can, and then leave it up to
>> >> userspace to configure things in a way that results in a consistent
>> >> high-level system behaviour.
>> >>
>> >
>> > That's a good point, and I agree that we should not bake a high-level
>> > product policy into the kernel if what we need is a reusable mechanism
>> > (the LWT idea was my attempt at exactly that). What I am still trying to
>> > understand is whether there is a useful generic trigger for stateless
>> > cross-family translation beyond the route/prefix/policy-routing cases.
>> >
>> > Routes and policy routing already cover the selectors I can make
>> > coherent for a stateless, per-packet translator: destination/source
>> > prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
>> > much more than that, but the additional selectors that would materially
>> > change the translation decision seem to be selectors such as L4 fields,
>> > payload state, or conntrack state. Those are exactly the selectors I am
>> > struggling to make correct for a stateless translator:
>> >
>> > - non-first fragments carry no L4 header at all, yet the translator must
>> > rewrite every fragment (an nft ... tcp dport trigger cannot fire on
>> > them);
>> >
>> > - ICMP errors must be translated too, but the flow identity lives in the
>> > quoted inner header (reversed), not in anything an L4/ct match on the
>> > error packet can see and there is no conntrack to associate them,
>> > since this is stateless.
>>
>> True in principle, but if (say) you deploy this on a network that is
>> configured so it will never fragment packets, this won't be an issue in
>> practice.
>>
>> I.e., you're quite right that arbitrary matching criteria cannot be
>> guaranteed to result in coherent translation. But I think that goes into
>> the "use it wrong, get wrong results" bin. E.g., if you match on
>> something that results in only a subset of the packets of a flow being
>> translated, well, only that subset of the packets will make it to the
>> destination. The SIIT translator itself should not try to fix this, but
>> neither should it prevent it; that's what I mean by "building block" -
>> it's up to the builder using the blocks to make sure the building
>> doesn't collapse, that's out of scope for the block manufacturer to
>> worry about :)
>>
>
> I agree with that framing. The translation core should not try to prove
> that the surrounding policy describes a coherent SIIT deployment.
Cool!
>> > So an L4-conditional trigger does not look like a good primitive for
>> > correct stateless SIIT unless the action also defragments/refragments or
>> > uses conntrack-like state. Those may be valid mechanisms, but they move
>> > the design away from the stateless per-packet SIIT boundary this RFC is
>> > trying to model.
>> >
>> > So my first question is: is there a useful nft configuration this should
>> > enable that is not naturally expressible as route selection, while still
>> > remaining stateless SIIT rather than a NAT64-like stateful feature?
>> > Maybe there is a real use case there, but I cannot construct one yet.
>>
>> So the poster child for "match on arbitrary criteria" is of course BPF.
>> You can write BPF programs that match on arbitrary parts of the packet
>> header, custom encapsulation headers,or even on out of band things like
>> system state, phase of the moon, or what have you. And we should
>> certainly allow a BPF program to make the decision on whether to perform
>> the SIIT translation.
>>
>> Which... maybe is an argument to keep it as a device like you do in this
>> RFC series? Redirecting to a device is trivially supported from TC-BPF,
>> which also makes it possible to use the translation mechanism without
>> going through the routing subsystem at all, saving a bit of overhead.
>> Whereas making it a route action ties it very closely to the routing
>> subsystem.
>>
>> WDYT?
>>
>
> I see the netdevice appeal for this, especially as a BPF redirect
> target. But as we discussed earlier, the device model has some real
> problems: the device selected by the first route is not the real
> post-translation egress, so the model ends up doing translation and
> reinjection rather than normal transmission. Concretely:
>
> - it needs synthetic routing state purely to get things like MTU for
> fragmentation, because the real post-translation nexthop is not known
> at translation time;
>
> - TTL/Hop Limit handling gets harder to reason about because the packet
> has effectively gone through two routing decisions;
>
> - rx/tx stats can't be made meaningful for a direction-agnostic device
> whose ndo_start_xmit is really "translate and receive";
>
> - and the setup is not very obvious: create an interface, route packets
> to it, then have them come back translated.
>
> None of these is fatal on its own, but together they make me think the
> abstraction does not quite fit.
Right, OK, you're right.
> On the BPF point specifically: I agree a BPF program should be able to
> decide whether to translate. What I am less sure about is whether
> redirecting to a netdevice is the best way to expose that. A TC action
> (yet another model, I know :)) gives you the same thing in-pipeline and
> more directly:
>
> tc filter add dev wwan0 egress \
> bpf obj match.o action ipxlat4to6 domain clat0
>
> Let BPF make the policy decision, with the native action doing the
> translation work that the current BPF CLAT implementations have trouble
> with: fragmentation, checksum corner cases, and ICMP error inner
> headers (as explained by Beniamino).
>
> So TC clsact looks like the natural in-kernel replacement for today's
> TC-BPF CLAT programs: no extra netdev, you attach to the existing
> uplink, direction is explicit, and on egress you sit on the real route
> dst, so the synthetic-dst and double-routing problems above just don't
> arise. The cost is more moving parts than a single bpf_redirect since
> userspace has to manage clsact, filters, priorities and action
> lifecycle/cleanup.
Hmm, so no one really uses the bpf filter mechanism, since you can just
do everything from an action anyway (and with TCX attachment, you can
even avoid the overhead of the TC filter/action infrastructure
entirely). However, point taken wrt how to integrate this with BPF. I
guess the most flexible thing would be to expose the functionality
directly (as a kfunc callable from a BPF program). Which also fits with
your point below:
> For a gateway translator, though, I still think a device-bound model is
> less natural. There the translation point is more like a forwarding
> decision across routes and nexthops, so a route/LWT attachment, or
> possibly a netfilter attachment seems easier to reason about. Also, as
> you already pointed out while discussing LWT, an admin setting up NAT64
> is more likely to reach for an nft rule than for a clsact filter on a
> specific device.
>
> Taking a step back, ipxlat is really a generic translation engine plus a
> thin harness around it. So rather than pick one attachment, it might be
> worth structuring the engine so different harnesses can drive it.
> There's interesting precedent for this shape:
>
> - ILA, again, is the closest sibling: stateless IPv6 address translation
> with a shared core in ila_common.c, driven both by an LWT frontend in
> ila_lwt.c and by an inline netfilter hook with a netlink-configured
> mapping table in ila_xlat.c.
>
> - act_ct is the precedent for the TC side specifically: a TC action that
> reuses the netfilter conntrack engine rather than reimplementing it.
>
> And act_nat is the cautionary counter-example: a standalone TC
> reimplementation of stateless NAT that shares no code with nf_nat, and
> carries a "would be nice to share code" comment :)
>
> So I am wondering whether the right direction is to factor the
> translation engine cleanly, land it with one harness first, and keep the
> other attachment points as follow-up work once the core semantics are
> settled.
>
> Does that direction seem reasonable to you?
Yes, reusable functionality that can be called from multiple places
sounds like a good fit; let's try to structure it that way!
As for which hook to start with, well, let's see if we hear back from
the netfilter devs, but either netfilter or the routing subsystem (LWT
style) would be OK for me I think.
-Toke
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:03 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Kuniyuki Iwashima, bpf, Alexei Starovoitov, Daniel Borkmann,
Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
kernel-team
In-Reply-To: <CAADnVQL2pfQ0BoN-vWcuCpbOBBKq_rM7Bp7P4XdLMFER5LGSDg@mail.gmail.com>
On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>
>> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >
>> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
>> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> > >>
>> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
>> > >> completed all code paths related to sockmap-based redirects should be
>> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
>> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
>> > >> socket references would remain under BPF_SYSCALL.
>> > >>
>> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> > >> ---
>> > >> Changes in v2:
>> > >> - Handle prot->recvmsg being NULL (Sashiko)
>> > >> - Elaborate on the end goal in description
>> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
>> > >> ---
>> > >> net/unix/af_unix.c | 4 ++--
>> > >> net/unix/unix_bpf.c | 6 ++++++
>> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
>> > >>
>> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> > >> index f7a9d55eee8a..84c11c60c75f 100644
>> > >> --- a/net/unix/af_unix.c
>> > >> +++ b/net/unix/af_unix.c
>> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>> > >> #ifdef CONFIG_BPF_SYSCALL
>> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
>> > >>
>> > >> - if (prot != &unix_dgram_proto)
>> > >> + if (prot->recvmsg)
>> > >
>> > > There is no reason to have this dead branch when
>> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
>> > >
>> > > Let's compile out all sockmap code when both configs
>> > > are not enabled.
>> > >
>> > > Since AF_UNIX differs from TCP/UDP, it can take the
>> > > simpler approach.
>> >
>> > Okay, will put the whole file behind hidden config option like so:
>> >
>> > --- a/net/unix/Kconfig
>> > +++ b/net/unix/Kconfig
>> > @@ -30,3 +30,8 @@ config UNIX_DIAG
>> > help
>> > Support for UNIX socket monitoring interface used by the ss tool.
>> > If unsure, say Y.
>> > +
>> > +config UNIX_BPF
>>
>> Maybe UNIX_BPF_SOCKMAP or something.
>> bpf_iter is supported without this config.
>
> I don't like where it's going.
> I strongly dislike new config knobs.
> I'd rather remove existing knobs.
> What is the motivation?
The goal is to compile out sockmap bits that use sk_msg.
NET_SOCK_MSG is natural, exisiting candidate.
New knob wasn't my idea.
Alternatively, we can do this to avoid the extra knob:
ifdef CONFIG_BPF_SYSCALL
unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
endif
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:09 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <CAAVpQUBARp1qCEomgzWXVe35WatdaswujVLku+RESm_LW0dE7Q@mail.gmail.com>
On Tue, Jun 23, 2026 at 12:31 PM -07, Kuniyuki Iwashima wrote:
> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> Okay, will put the whole file behind hidden config option like so:
>>
>> --- a/net/unix/Kconfig
>> +++ b/net/unix/Kconfig
>> @@ -30,3 +30,8 @@ config UNIX_DIAG
>> help
>> Support for UNIX socket monitoring interface used by the ss tool.
>> If unsure, say Y.
>> +
>> +config UNIX_BPF
>
> Maybe UNIX_BPF_SOCKMAP or something.
> bpf_iter is supported without this config.
Not sure what you have in mind re bpf_iter. Can you share more?
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 20:13 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
kernel-team
In-Reply-To: <87mrwlyqg4.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:03 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >>
> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >
> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> > >>
> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> > >> completed all code paths related to sockmap-based redirects should be
> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> > >> socket references would remain under BPF_SYSCALL.
> >> > >>
> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> > >> ---
> >> > >> Changes in v2:
> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
> >> > >> - Elaborate on the end goal in description
> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> > >> ---
> >> > >> net/unix/af_unix.c | 4 ++--
> >> > >> net/unix/unix_bpf.c | 6 ++++++
> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
> >> > >>
> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
> >> > >> --- a/net/unix/af_unix.c
> >> > >> +++ b/net/unix/af_unix.c
> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> > >> #ifdef CONFIG_BPF_SYSCALL
> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> >> > >>
> >> > >> - if (prot != &unix_dgram_proto)
> >> > >> + if (prot->recvmsg)
> >> > >
> >> > > There is no reason to have this dead branch when
> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >> > >
> >> > > Let's compile out all sockmap code when both configs
> >> > > are not enabled.
> >> > >
> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
> >> > > simpler approach.
> >> >
> >> > Okay, will put the whole file behind hidden config option like so:
> >> >
> >> > --- a/net/unix/Kconfig
> >> > +++ b/net/unix/Kconfig
> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> > help
> >> > Support for UNIX socket monitoring interface used by the ss tool.
> >> > If unsure, say Y.
> >> > +
> >> > +config UNIX_BPF
> >>
> >> Maybe UNIX_BPF_SOCKMAP or something.
> >> bpf_iter is supported without this config.
> >
> > I don't like where it's going.
> > I strongly dislike new config knobs.
> > I'd rather remove existing knobs.
> > What is the motivation?
>
> The goal is to compile out sockmap bits that use sk_msg.
> NET_SOCK_MSG is natural, exisiting candidate.
> New knob wasn't my idea.
I think config w/o description is okay since it's not selectable.
>
> Alternatively, we can do this to avoid the extra knob:
>
> ifdef CONFIG_BPF_SYSCALL
> unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
> endif
This is far better, I forgot ifdef is available.
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 20:14 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <87h5mtyq72.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:09 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:31 PM -07, Kuniyuki Iwashima wrote:
> > On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> Okay, will put the whole file behind hidden config option like so:
> >>
> >> --- a/net/unix/Kconfig
> >> +++ b/net/unix/Kconfig
> >> @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> help
> >> Support for UNIX socket monitoring interface used by the ss tool.
> >> If unsure, say Y.
> >> +
> >> +config UNIX_BPF
> >
> > Maybe UNIX_BPF_SOCKMAP or something.
> > bpf_iter is supported without this config.
>
> Not sure what you have in mind re bpf_iter. Can you share more?
I meant UNIX_BPF sounds like it covers bpf iterator for AF_UNIX too.
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Amery Hung @ 2026-06-23 20:22 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: Alexei Starovoitov, Kuniyuki Iwashima, bpf, Alexei Starovoitov,
Daniel Borkmann, Jakub Kicinski, Jiayuan Chen, John Fastabend,
Network Development, kernel-team
In-Reply-To: <87mrwlyqg4.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:04 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >>
> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >
> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> > >>
> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> > >> completed all code paths related to sockmap-based redirects should be
> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> > >> socket references would remain under BPF_SYSCALL.
> >> > >>
> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> > >> ---
> >> > >> Changes in v2:
> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
> >> > >> - Elaborate on the end goal in description
> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> > >> ---
> >> > >> net/unix/af_unix.c | 4 ++--
> >> > >> net/unix/unix_bpf.c | 6 ++++++
> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
> >> > >>
> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
> >> > >> --- a/net/unix/af_unix.c
> >> > >> +++ b/net/unix/af_unix.c
> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> > >> #ifdef CONFIG_BPF_SYSCALL
> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> >> > >>
> >> > >> - if (prot != &unix_dgram_proto)
> >> > >> + if (prot->recvmsg)
> >> > >
> >> > > There is no reason to have this dead branch when
> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >> > >
> >> > > Let's compile out all sockmap code when both configs
> >> > > are not enabled.
> >> > >
> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
> >> > > simpler approach.
> >> >
> >> > Okay, will put the whole file behind hidden config option like so:
> >> >
> >> > --- a/net/unix/Kconfig
> >> > +++ b/net/unix/Kconfig
> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> > help
> >> > Support for UNIX socket monitoring interface used by the ss tool.
> >> > If unsure, say Y.
> >> > +
> >> > +config UNIX_BPF
> >>
> >> Maybe UNIX_BPF_SOCKMAP or something.
> >> bpf_iter is supported without this config.
> >
> > I don't like where it's going.
> > I strongly dislike new config knobs.
> > I'd rather remove existing knobs.
> > What is the motivation?
>
> The goal is to compile out sockmap bits that use sk_msg.
> NET_SOCK_MSG is natural, exisiting candidate.
> New knob wasn't my idea.
I'm also missing the big picture here.
sockmap already holds socket references today. You can store and look
up sockets without attaching any verdict/parser program, and no
redirect happens. So if the goal is to use sockmap purely as a socket
container without the sk_msg fast-path overhead, what does a
compile-time NET_SOCK_MSG knob add over the runtime checks?
I am also not sure if NET_SOCK_MSG is right. It is broader than
"sockmap redirect". It is selected by TLS and {INET,INET6}_ESPINTCP.
Because those select it, it can't be toggled independently.
Could you share the concrete use case you have in mind, and whether
this came out of an earlier discussion or thread upstream?
>
> Alternatively, we can do this to avoid the extra knob:
>
> ifdef CONFIG_BPF_SYSCALL
> unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
> endif
>
^ permalink raw reply
* [syzbot] Monthly net report (Jun 2026)
From: syzbot @ 2026-06-23 20:32 UTC (permalink / raw)
To: linux-kernel, netdev, syzkaller-bugs
Hello net maintainers/developers,
This is a 31-day syzbot report for the net subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/net
During the period, 5 new issues were detected and 15 were fixed.
In total, 59 issues are still open and 1748 have already been fixed.
There are also 30 low-priority issues.
Some of the still happening issues:
Ref Crashes Repro Title
<1> 8753 Yes KMSAN: uninit-value in eth_type_trans (2)
https://syzkaller.appspot.com/bug?extid=0901d0cc75c3d716a3a3
<2> 2480 Yes unregister_netdevice: waiting for DEV to become free (9)
https://syzkaller.appspot.com/bug?extid=e2af46126e0644cbebdd
<3> 686 Yes INFO: task hung in tun_chr_close (5)
https://syzkaller.appspot.com/bug?extid=b0ae8f1abf7d891e0426
<4> 456 Yes WARNING in inet_sock_destruct (6)
https://syzkaller.appspot.com/bug?extid=5b3b7e51dda1be027b7a
<5> 447 No possible deadlock in __ipv6_dev_mc_inc
https://syzkaller.appspot.com/bug?extid=afbcf622635e98bf40d2
<6> 443 Yes INFO: task hung in nsim_destroy (4)
https://syzkaller.appspot.com/bug?extid=8141dcbd23a8f857798a
<7> 150 No possible deadlock in __ethtool_get_link_ksettings
https://syzkaller.appspot.com/bug?extid=9bb8bd77f3966641f298
<8> 83 Yes KASAN: use-after-free Read in qdisc_pkt_len_segs_init
https://syzkaller.appspot.com/bug?extid=83181a31faf9455499c5
<9> 67 Yes INFO: rcu detected stall in rescuer_thread (2)
https://syzkaller.appspot.com/bug?extid=d5f7a5097c24c7c2dbbb
<10> 3 Yes memory leak in __vsock_create (2)
https://syzkaller.appspot.com/bug?extid=1b2c9c4a0f8708082678
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
To disable reminders for individual bugs, reply with the following command:
#syz set <Ref> no-reminders
To change bug's subsystems, reply with:
#syz set <Ref> subsystems: new-subsystem
You may send multiple commands in a single email message.
^ permalink raw reply
* Re: [PATCH net v2] net: airoha: Add retry mechanism to airoha_qdma_set_trtcm_param()
From: Lorenzo Bianconi @ 2026-06-23 20:32 UTC (permalink / raw)
To: Simon Horman
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, leto.liu,
linux-arm-kernel, linux-mediatek, netdev, brown.huang
In-Reply-To: <20260623105341.1076863-3-horms@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 5732 bytes --]
On Jun 23, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
>
> Simon says: I'd appreciate it if you could look over the first two issues
> flagged below in the context of this patch. While the third
> and final issue looks like it is in possible separate follow-up
> territory.
>
> ---
> net: airoha: Add retry mechanism to airoha_qdma_set_trtcm_param()
>
> This patch introduces a retry loop to airoha_qdma_set_trtcm_param() to mitigate
> bus timeouts when configuring QDMA parameters. It attempts to verify that the
> configuration is correctly applied to the hardware.
>
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 3370c3df7c107..bb5c0599a4eed 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -2673,14 +2673,30 @@ static int airoha_qdma_set_trtcm_param(struct airoha_qdma *qdma, int channel,
> > FIELD_PREP(TRTCM_METER_GROUP_MASK, group) |
> > FIELD_PREP(TRTCM_PARAM_INDEX_MASK, idx) |
> > FIELD_PREP(TRTCM_PARAM_RATE_TYPE_MASK, mode);
> > + int i;
> >
> > - airoha_qdma_wr(qdma, REG_TRTCM_DATA_LOW(addr), val);
> > - airoha_qdma_wr(qdma, REG_TRTCM_CFG_PARAM(addr), config);
> > + for (i = 0; i < 100; i++) {
> > + u32 data;
> >
> > - return read_poll_timeout(airoha_qdma_rr, val,
> > - val & TRTCM_PARAM_RW_DONE_MASK,
> > - USEC_PER_MSEC, 10 * USEC_PER_MSEC, true,
> > - qdma, REG_TRTCM_CFG_PARAM(addr));
> > + airoha_qdma_wr(qdma, REG_TRTCM_DATA_LOW(addr), val);
> > + wmb();
> > + airoha_qdma_wr(qdma, REG_TRTCM_CFG_PARAM(addr), config);
> > +
> > + if (read_poll_timeout(airoha_qdma_rr, data,
> > + data & TRTCM_PARAM_RW_DONE_MASK,
> > + USEC_PER_MSEC, 10 * USEC_PER_MSEC,
> > + true, qdma, REG_TRTCM_CFG_PARAM(addr)))
> > + return -ETIMEDOUT;
>
> [Severity: High]
> If a bus timeout causes the write phase's read_poll_timeout() in
> airoha_qdma_set_trtcm_param() to fail, this immediately returns -ETIMEDOUT,
> aborting the 100-iteration retry loop.
>
> Doesn't this defeat the mitigation?
I agree we just run 'continue' here instead of returing -ETIMEDOUT
>
> > +
> > + if (airoha_qdma_get_trtcm_param(qdma, channel, addr, param,
> > + mode, &data, NULL))
> > + continue;
>
> [Severity: High]
> If airoha_qdma_get_trtcm_param() times out during the verification phase, it
> indicates the hardware hasn't finished the read command (DONE bit is still 0).
>
> The continue statement then immediately starts the next iteration, writing new
> commands to DATA_LOW and CFG_PARAM.
>
> Can writing to command registers while the hardware state machine is actively
> busy violate hardware protocols and lead to unrecoverable lockups?
@Brown and @Leto: can you please comment on this case?
>
> > +
> > + if (data == val)
> > + return 0;
>
> [Severity: High]
> This isn't a bug introduced by this patch, but if the bus drops the command
> write to REG_TRTCM_CFG_PARAM, the hardware never executes the command and the
> TRTCM_PARAM_RW_DONE_MASK bit remains set from the previous operation.
>
> Consequently, read_poll_timeout() returns an immediate false success.
>
> If both the set and get command writes are dropped by the bus, the get
> operation will read back the newly written DATA_LOW register value.
>
> Will the data == val check evaluate to true and falsely report success without
> the hardware ever processing the configuration?
This seems quite a 'pathological' scenario, is it really an issue?
Regards,
Lorenzo
>
> > + }
> > +
> > + return -EBUSY;
> > }
> >
>
> [ ... ]
>
> > static int airoha_qdma_set_trtcm_config(struct airoha_qdma *qdma, int channel,
> > u32 addr, enum trtcm_mode_type mode,
> > bool enable, u32 enable_mask)
> > {
> > u32 val;
> >
> > if (airoha_qdma_get_trtcm_param(qdma, channel, addr, TRTCM_MISC_MODE,
> > mode, &val, NULL))
> > return -EINVAL;
>
> [Severity: High]
> Standalone calls to airoha_qdma_get_trtcm_param(), like this one in
> airoha_qdma_set_trtcm_config(), lack a retry loop entirely.
>
> Does this leave read-modify-write cycles fully vulnerable to the same bus
> timeouts the patch attempts to mitigate?
>
> [Severity: High]
> This isn't a bug introduced by this patch, but the patch mitigates bus
> timeouts in airoha_qdma_set_trtcm_param() while entirely missing
> airoha_qdma_set_rl_param().
>
> The latter performs the exact same write-and-poll sequence on identical QDMA
> configuration registers:
>
> drivers/net/ethernet/airoha/airoha_eth.c:airoha_qdma_set_rl_param() {
> ...
> return read_poll_timeout(..., val & RATE_LIMIT_PARAM_RW_DONE_MASK, ...);
> }
>
> If bus timeouts affect QDMA access broadly as stated in the commit message, are
> the rate-limiting configuration paths randomly vulnerable to the same failures?
>
> [Severity: High]
> This isn't a bug introduced by this patch, but in airoha_tc_remove_htb_queue(),
> hardware teardown incorrectly uses queue + 1 instead of queue:
>
> drivers/net/ethernet/airoha/airoha_eth.c:airoha_tc_remove_htb_queue() {
> ...
> airoha_qdma_set_tx_rate_limit(netdev, queue + 1, 0, 0);
> ...
> }
>
> Does this leave the removed queue's hardware limits permanently active
> (resource leak) and inadvertently disable the rate limits for queue + 1,
> corrupting the QoS state of another active channel?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH net v2] ice: Fix use-after-scope in ice_sched_add_nodes_to_layer()
From: Tony Nguyen @ 2026-06-23 20:35 UTC (permalink / raw)
To: NeKon69, przemyslaw.kitszel
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, horms,
piotr.kwapulinski, intel-wired-lan, netdev, linux-kernel
In-Reply-To: <20260617072155.1172432-1-nobodqwe@gmail.com>
On 6/17/2026 12:21 AM, NeKon69 wrote:
> Commit 7fb09a737536 ("ice: Modify recursive way of adding nodes")
> changed ice_sched_add_nodes_to_layer() from recursive control flow to an
> iterative loop.
>
> Inside the loop, first_teid_ptr may be set to the address of a
> block-local variable:
>
> u32 temp;
> ...
> if (num_added)
> first_teid_ptr = &temp;
>
> On the next loop iteration, first_teid_ptr may be passed to
> ice_sched_add_nodes_to_hw_layer(), after temp from the previous
> iteration has gone out of scope.
>
> Instead of keeping temporary storage for later calls, allow
> first_node_teid to be NULL when the caller does not need the TEID.
>
> This was found by Clang with LifetimeSafety enabled while testing C
> language support on a Linux allmodconfig build.
>
> Fixes: 7fb09a737536 ("ice: Modify recursive way of adding nodes")
> Link: https://github.com/llvm/llvm-project/pull/203270
> Signed-off-by: NeKon69 <nobodqwe@gmail.com>
Hi,
The patch itself looks ok but I believe author/sign-off should be an
actual name.
Thanks,
Tony
^ permalink raw reply
* Re: [PATCH net 0/2] tcp: make TCP-AO lookups more predictable
From: Dmitry Safonov @ 2026-06-23 20:35 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Neal Cardwell, Kuniyuki Iwashima, netdev, eric.dumazet
In-Reply-To: <CANn89iJL1sx4Jmb3P4cKiEqP_FmWQ3N1BwfH15LHwhFjUoCYsA@mail.gmail.com>
On Tue, 23 Jun 2026 at 06:25, Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jun 22, 2026 at 6:13 PM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
[..]
> > What do you think?
>
> If intersecting keys are not yet allowed, I think we must return an
> error code at the insertion stage,
> instead of hoping the user will do "the right thing".
That is happenning already: in the new selftest you used different
keyids, so that adds distinct keys and they both may be used/rotated
on the connection (and have to be copied to the established socket).
If you try adding two keys with different prefixes, but matching the
same peer ip (same keyids; available in the same VRFs) – second
setsockopt() will fail. There are tests for this in
setsockopt-closed.c under duplicate_tests().
Thanks,
Dmitry
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:36 UTC (permalink / raw)
To: Amery Hung
Cc: Alexei Starovoitov, Kuniyuki Iwashima, bpf, Alexei Starovoitov,
Daniel Borkmann, Jakub Kicinski, Jiayuan Chen, John Fastabend,
Network Development, kernel-team
In-Reply-To: <CAMB2axMVhJJpP5HZtDFyQLLbKoRxhW08rj1zGRtWtgDkfYaVNA@mail.gmail.com>
On Tue, Jun 23, 2026 at 01:22 PM -07, Amery Hung wrote:
> On Tue, Jun 23, 2026 at 1:04 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
>> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>> >>
>> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >> >
>> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
>> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >> > >>
>> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
>> >> > >> completed all code paths related to sockmap-based redirects should be
>> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
>> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
>> >> > >> socket references would remain under BPF_SYSCALL.
>> >> > >>
>> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> > >> ---
>> >> > >> Changes in v2:
>> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
>> >> > >> - Elaborate on the end goal in description
>> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
>> >> > >> ---
>> >> > >> net/unix/af_unix.c | 4 ++--
>> >> > >> net/unix/unix_bpf.c | 6 ++++++
>> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
>> >> > >>
>> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
>> >> > >> --- a/net/unix/af_unix.c
>> >> > >> +++ b/net/unix/af_unix.c
>> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>> >> > >> #ifdef CONFIG_BPF_SYSCALL
>> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
>> >> > >>
>> >> > >> - if (prot != &unix_dgram_proto)
>> >> > >> + if (prot->recvmsg)
>> >> > >
>> >> > > There is no reason to have this dead branch when
>> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
>> >> > >
>> >> > > Let's compile out all sockmap code when both configs
>> >> > > are not enabled.
>> >> > >
>> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
>> >> > > simpler approach.
>> >> >
>> >> > Okay, will put the whole file behind hidden config option like so:
>> >> >
>> >> > --- a/net/unix/Kconfig
>> >> > +++ b/net/unix/Kconfig
>> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
>> >> > help
>> >> > Support for UNIX socket monitoring interface used by the ss tool.
>> >> > If unsure, say Y.
>> >> > +
>> >> > +config UNIX_BPF
>> >>
>> >> Maybe UNIX_BPF_SOCKMAP or something.
>> >> bpf_iter is supported without this config.
>> >
>> > I don't like where it's going.
>> > I strongly dislike new config knobs.
>> > I'd rather remove existing knobs.
>> > What is the motivation?
>>
>> The goal is to compile out sockmap bits that use sk_msg.
>> NET_SOCK_MSG is natural, exisiting candidate.
>> New knob wasn't my idea.
>
> I'm also missing the big picture here.
>
> sockmap already holds socket references today. You can store and look
> up sockets without attaching any verdict/parser program, and no
> redirect happens. So if the goal is to use sockmap purely as a socket
> container without the sk_msg fast-path overhead, what does a
> compile-time NET_SOCK_MSG knob add over the runtime checks?
Sure, let me clarify. It's about the maintenance overhead.
sockmap-based redirects are a rather niche feature with few users, for
which we've been getting quite a few bug reports since AI came along.
We're not using it internally at Cloudflare, so I don't really have a
good reason to justify time spent on these bug reports.
Hence the move to put sockmap-based redirect behind a config option,
which you can enable at your own risk. Or which we can deprecate, but
that's not really my call.
> I am also not sure if NET_SOCK_MSG is right. It is broader than
> "sockmap redirect". It is selected by TLS and {INET,INET6}_ESPINTCP.
> Because those select it, it can't be toggled independently.
Once the sockmap redirect bits are behind _some_ config option, it will
be easy to replace it with a more granular one that depends on
NET_SOCK_MSG. But we're not there yet. One step at a time.
> Could you share the concrete use case you have in mind, and whether
> this came out of an earlier discussion or thread upstream?
This is a follow up from discussions at BPF summit with Alexei & John.
^ permalink raw reply
* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: Jakub Kicinski @ 2026-06-23 20:36 UTC (permalink / raw)
To: Maciej Fijalkowski
Cc: Jason Xing, Tushar Vyavahare, netdev, magnus.karlsson, stfomichev,
kernelxing, davem, pabeni, ast, daniel, tirthendu.sarkar, bpf
In-Reply-To: <ajqfDYznpCU18C2P@boxer>
On Tue, 23 Jun 2026 16:58:21 +0200 Maciej Fijalkowski wrote:
> On Tue, Jun 23, 2026 at 11:02:48AM +0200, Maciej Fijalkowski wrote:
> > last refactor from Tushar broke BIDIRECTIONAL test case when HW is test
> > target, but not on veth, so let me test these changes locally and then get
> > back to you.
> >
> > BPF CI runs xskxceiver on veth so this has not been caught. Seems my/our
> > focus should be to enable xskxceiver HW tests on any kind of
> > environment/infrastructure.
> >
> > Gonna get back to you by the EOD.
> > Maciej
>
> Ah I replied on other thread I guess, so let me repeat:
>
> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
> Tested-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Great, thanks!
^ permalink raw reply
* Re: s2io: driver still in use - please reconsider removal
From: David Laight @ 2026-06-23 20:40 UTC (permalink / raw)
To: Michael Pratte
Cc: Jakub Kicinski, Paolo Abeni, Eric Dumazet, Ethan Nelson-Moore,
Andrew Lunn, Simon Horman, David S . Miller, netdev
In-Reply-To: <20260623112133.752195-1-slatoncomputers@gmail.com>
On Tue, 23 Jun 2026 06:21:33 -0500
Michael Pratte <slatoncomputers@gmail.com> wrote:
> Hi,
>
> Commit aba0138eb7d7 ("net: ethernet: neterion: s2io: remove unused
> driver") removed s2io in v7.0 as "highly unlikely to still be used."
> It is still in use here: an Exar Xframe-II (PCI 17d5:5832) in a
> Supermicro X5DA8.
Are you really using a dual socket netburst P4 system!
They weren't really any good when new :-)
David
^ permalink raw reply
* Re: [PATCH net] selftests: drv-net: so_txtime: relax variance bounds
From: patchwork-bot+netdevbpf @ 2026-06-23 20:40 UTC (permalink / raw)
To: Willem de Bruijn; +Cc: netdev, davem, kuba, edumazet, pabeni, horms, willemb
In-Reply-To: <20260621200137.1564776-1-willemdebruijn.kernel@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sun, 21 Jun 2026 16:01:18 -0400 you wrote:
> From: Willem de Bruijn <willemb@google.com>
>
> The net-next-hw spinners on netdev.bots.linux.dev observe failing
> so-txtime-py tests. A review of stdout shows most failures to be
> due to exceeding the 4ms grace period. All I saw were within 8ms.
> So increase to that.
>
> [...]
Here is the summary with links:
- [net] selftests: drv-net: so_txtime: relax variance bounds
https://git.kernel.org/netdev/net/c/e38fec239d92
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH v3 net] net: airoha: Fix TX scheduler queue mask loop upper bound
From: patchwork-bot+netdevbpf @ 2026-06-23 20:40 UTC (permalink / raw)
To: Wayen Yan
Cc: netdev, lorenzo, horms, pabeni, kuba, edumazet, andrew+netdev,
angelogioacchino.delregno, matthias.bgg, linux-arm-kernel,
linux-mediatek
In-Reply-To: <178187479434.2400840.1312143943526335838@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 19 Jun 2026 21:12:06 +0800 you wrote:
> In airoha_qdma_set_chan_tx_sched(), the loop clearing queue mask was
> using AIROHA_NUM_TX_RING (32) instead of AIROHA_NUM_QOS_QUEUES (8).
>
> Each channel has 8 queues, and TXQ_DISABLE_CHAN_QUEUE_MASK(channel, i)
> computes BIT(i + (channel * 8)). With i ranging 0..31, this causes:
> - channel 0: clears bit 0..31 (all 4 channels) instead of 0..7
> - channel 1: clears bit 8..31 (channels 1-3) instead of 8..15
> - channel 2: clears bit 16..31 (channels 2-3) instead of 16..23
> - channel 3: clears bit 24..31 (channel 3 only) - correct by accident
>
> [...]
Here is the summary with links:
- [v3,net] net: airoha: Fix TX scheduler queue mask loop upper bound
https://git.kernel.org/netdev/net/c/245043dfc210
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] net, bpf: check master for NULL in xdp_master_redirect()
From: Jakub Kicinski @ 2026-06-23 20:43 UTC (permalink / raw)
To: Ido Schimmel
Cc: Xiang Mei, Jiayuan Chen, Daniel Borkmann, Martin KaFai Lau,
Jesper Dangaard Brouer, netdev, bpf, John Fastabend,
Stanislav Fomichev, Alexei Starovoitov, Jussi Maki, Paolo Abeni,
Weiming Shi, Ido Schimmel, David Ahern
In-Reply-To: <20260623065218.GA378121@shredder>
On Tue, 23 Jun 2026 09:52:18 +0300 Ido Schimmel wrote:
> On Mon, Jun 22, 2026 at 04:34:06PM -0700, Xiang Mei wrote:
> > On Mon, Jun 22, 2026 at 3:58 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > Can you double-confirm that this triggers on current HEAD
> > > of linux/master ? I thought commit 2674d603a9e6 ("vrf: Fix a potential
> > > NPD when removing a port from a VRF") was supposed to prevent all the
> > > torn master fetches. Adding VRF folks to CC.
> >
> > Yes.
> >
> > We have triggered the crash on 56abdaebbf0da304b860bed1f2b5a85f5a6a16a0,
> > which is the latest for net.git, and 2674d603a9e6 was applied. We can
> > still trigger the crash:
>
> 2674d603a9e6 was only for VRF ports, so it doesn't help with this case
> (bond port). Also, the problem that 2674d603a9e6 fixed is a bit
> different. We had a NULL check after netdev_master_upper_dev_get_rcu(),
> but the issue was that this master device was not necessarily a VRF
> master.
Ugh, sorry, my bad. Poor pattern matching of the bugs..
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Amery Hung @ 2026-06-23 20:44 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: Alexei Starovoitov, Kuniyuki Iwashima, bpf, Alexei Starovoitov,
Daniel Borkmann, Jakub Kicinski, Jiayuan Chen, John Fastabend,
Network Development, kernel-team
In-Reply-To: <878q85yoy5.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:36 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 01:22 PM -07, Amery Hung wrote:
> > On Tue, Jun 23, 2026 at 1:04 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> >> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >> >>
> >> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >> >
> >> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> >> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >> > >>
> >> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> >> > >> completed all code paths related to sockmap-based redirects should be
> >> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> >> > >> socket references would remain under BPF_SYSCALL.
> >> >> > >>
> >> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> >> > >> ---
> >> >> > >> Changes in v2:
> >> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
> >> >> > >> - Elaborate on the end goal in description
> >> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> >> > >> ---
> >> >> > >> net/unix/af_unix.c | 4 ++--
> >> >> > >> net/unix/unix_bpf.c | 6 ++++++
> >> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
> >> >> > >>
> >> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
> >> >> > >> --- a/net/unix/af_unix.c
> >> >> > >> +++ b/net/unix/af_unix.c
> >> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> >> > >> #ifdef CONFIG_BPF_SYSCALL
> >> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> >> >> > >>
> >> >> > >> - if (prot != &unix_dgram_proto)
> >> >> > >> + if (prot->recvmsg)
> >> >> > >
> >> >> > > There is no reason to have this dead branch when
> >> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >> >> > >
> >> >> > > Let's compile out all sockmap code when both configs
> >> >> > > are not enabled.
> >> >> > >
> >> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
> >> >> > > simpler approach.
> >> >> >
> >> >> > Okay, will put the whole file behind hidden config option like so:
> >> >> >
> >> >> > --- a/net/unix/Kconfig
> >> >> > +++ b/net/unix/Kconfig
> >> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> >> > help
> >> >> > Support for UNIX socket monitoring interface used by the ss tool.
> >> >> > If unsure, say Y.
> >> >> > +
> >> >> > +config UNIX_BPF
> >> >>
> >> >> Maybe UNIX_BPF_SOCKMAP or something.
> >> >> bpf_iter is supported without this config.
> >> >
> >> > I don't like where it's going.
> >> > I strongly dislike new config knobs.
> >> > I'd rather remove existing knobs.
> >> > What is the motivation?
> >>
> >> The goal is to compile out sockmap bits that use sk_msg.
> >> NET_SOCK_MSG is natural, exisiting candidate.
> >> New knob wasn't my idea.
> >
> > I'm also missing the big picture here.
> >
> > sockmap already holds socket references today. You can store and look
> > up sockets without attaching any verdict/parser program, and no
> > redirect happens. So if the goal is to use sockmap purely as a socket
> > container without the sk_msg fast-path overhead, what does a
> > compile-time NET_SOCK_MSG knob add over the runtime checks?
>
> Sure, let me clarify. It's about the maintenance overhead.
>
> sockmap-based redirects are a rather niche feature with few users, for
> which we've been getting quite a few bug reports since AI came along.
>
> We're not using it internally at Cloudflare, so I don't really have a
> good reason to justify time spent on these bug reports.
>
> Hence the move to put sockmap-based redirect behind a config option,
> which you can enable at your own risk. Or which we can deprecate, but
> that's not really my call.
>
> > I am also not sure if NET_SOCK_MSG is right. It is broader than
> > "sockmap redirect". It is selected by TLS and {INET,INET6}_ESPINTCP.
> > Because those select it, it can't be toggled independently.
>
> Once the sockmap redirect bits are behind _some_ config option, it will
> be easy to replace it with a more granular one that depends on
> NET_SOCK_MSG. But we're not there yet. One step at a time.
>
> > Could you share the concrete use case you have in mind, and whether
> > this came out of an earlier discussion or thread upstream?
>
> This is a follow up from discussions at BPF summit with Alexei & John.
I see. Thanks for explaining the motivation.
^ permalink raw reply
* Re: [PATCH net-next 0/3] selftests/xsk: stabilize timeout test behavior
From: patchwork-bot+netdevbpf @ 2026-06-23 20:50 UTC (permalink / raw)
To: Tushar Vyavahare
Cc: netdev, magnus.karlsson, maciej.fijalkowski, stfomichev,
kernelxing, davem, kuba, pabeni, ast, daniel, tirthendu.sarkar,
bpf
In-Reply-To: <20260616154955.1492560-1-tushar.vyavahare@intel.com>
Hello:
This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Tue, 16 Jun 2026 21:19:51 +0530 you wrote:
> This series improves AF_XDP selftests by making timeout handling
> explicit and fixing sources of non-determinism in xsk timeout tests.
>
> Patch 1 introduces test_spec::poll_tmout and removes implicit
> dependence on RX UMEM setup state for timeout behavior.
>
> Patch 2 fixes thread harness sequencing by attaching XDP programs
> before worker startup, removing signal-based termination, and using
> barrier synchronization only for dual-thread runs.
>
> [...]
Here is the summary with links:
- [net-next,1/3] selftests/xsk: make poll timeout mode explicit
https://git.kernel.org/netdev/net/c/b56cded13137
- [net-next,2/3] selftests/xsk: fix timeout thread harness sequencing
https://git.kernel.org/netdev/net/c/483c1405f817
- [net-next,3/3] selftests/xsk: restore shared_umem after POLL_TXQ_FULL
https://git.kernel.org/netdev/net/c/ea4e9c9d8b2b
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH] net: ipa: fix SMEM state handle leaks in SMP2P init
From: Jakub Kicinski @ 2026-06-23 21:06 UTC (permalink / raw)
To: Haoxiang Li
Cc: Alex Elder, elder, andrew+netdev, davem, edumazet, pabeni, netdev,
linux-kernel, stable
In-Reply-To: <526c68fd-684d-4593-8c6a-e08aafdada5d@ieee.org>
On Tue, 23 Jun 2026 10:53:49 -0500 Alex Elder wrote:
> So I guess they were never "put" before?
>
> This looks OK, but I'll just mention that the IPA code
> doesn't use devm_*() (managed) interfaces. So it would
> be more consistent to just call qcom_smem_state_put()
> at the end of ipa_smp2p_exit() for both ipa->enabled_state
> and ipa->valid_state.
Let's do that instead. The devm_ APIs prevent about as many bugs
as they cause.
^ permalink raw reply
* Re: [PATCH net] net, bpf: check master for NULL in xdp_master_redirect()
From: patchwork-bot+netdevbpf @ 2026-06-23 21:10 UTC (permalink / raw)
To: Xiang Mei
Cc: daniel, martin.lau, hawk, jiayuan.chen, netdev, bpf,
john.fastabend, sdf, ast, joamaki, pabeni, bestswngs
In-Reply-To: <20260620201531.180123-1-xmei5@asu.edu>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Sat, 20 Jun 2026 13:15:31 -0700 you wrote:
> xdp_master_redirect() dereferences the result of
> netdev_master_upper_dev_get_rcu() without a NULL check, but that helper
> returns NULL when the receiving device has no upper-master adjacency.
>
> The reach guard only checks netif_is_bond_slave(). On bond slave release
> bond_upper_dev_unlink() drops the upper-master adjacency before clearing
> IFF_SLAVE, so an XDP_TX reaching xdp_master_redirect() in that window
> still passes netif_is_bond_slave() while master is already NULL, and
> faults on master->flags at offset 0xb0:
>
> [...]
Here is the summary with links:
- [net] net, bpf: check master for NULL in xdp_master_redirect()
https://git.kernel.org/netdev/net/c/e82d8cc4321c
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Emil Tsalapatis @ 2026-06-23 21:19 UTC (permalink / raw)
To: Michal Luczaj, John Fastabend, Jakub Sitnicki, Jiayuan Chen,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Alexei Starovoitov, Cong Wang, Daniel Borkmann,
Andrii Nakryiko, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Martin KaFai Lau, Song Liu, Yonghong Song, Jiri Olsa,
Emil Tsalapatis, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest
In-Reply-To: <20260623-sockmap-lookup-udp-leak-v1-1-05804f9308e4@rbox.co>
On Tue Jun 23, 2026 at 2:03 PM EDT, Michal Luczaj wrote:
> UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
> sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
>
> Because sockmap accepts unbound UDP sockets, a BPF program can increment a
> socket's refcount via lookup. If the socket is subsequently bound, the
> transition from unbound to bound causes bpf_sk_release() to skip the
> decrement of the refcount, causing a memory leak.
>
> unreferenced object 0xffff88810bc2eb40 (size 1984):
> comm "test_progs", pid 2451, jiffies 4295320596
> hex dump (first 32 bytes):
> 7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00 ................
> 02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
> backtrace (crc bdee079d):
> kmem_cache_alloc_noprof+0x557/0x660
> sk_prot_alloc+0x69/0x240
> sk_alloc+0x30/0x460
> inet_create+0x2ce/0xf80
> __sock_create+0x25b/0x5c0
> __sys_socket+0x119/0x1d0
> __x64_sys_socket+0x72/0xd0
> do_syscall_64+0xa1/0x5f0
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Maintain balanced refcounts across sk lookup/release: (re-)set
> SOCK_RCU_FREE on proto update to treat the socket (whether bound or
> unbound) as not requiring a refcount increment on (a RCU protected) lookup.
>
> Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> ---
> Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
> sockets in bpf_sk_assign").
> ---
> net/ipv4/udp_bpf.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
> index ad57c4c9eaab..970327b59582 100644
> --- a/net/ipv4/udp_bpf.c
> +++ b/net/ipv4/udp_bpf.c
> @@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
> if (sk->sk_family == AF_INET6)
> udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
>
> + /* Treat all sockets as non-refcounted, regardless of binding state. */
> + sock_set_flag(sk, SOCK_RCU_FREE);
> +
> sock_replace_proto(sk, &udp_bpf_prots[family]);
> return 0;
> }
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Alexei Starovoitov @ 2026-06-23 21:26 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: Amery Hung, Kuniyuki Iwashima, bpf, Alexei Starovoitov,
Daniel Borkmann, Jakub Kicinski, Jiayuan Chen, John Fastabend,
Network Development, kernel-team
In-Reply-To: <878q85yoy5.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:36 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 01:22 PM -07, Amery Hung wrote:
> > On Tue, Jun 23, 2026 at 1:04 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> >> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >> >>
> >> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >> >
> >> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> >> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >> > >>
> >> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> >> > >> completed all code paths related to sockmap-based redirects should be
> >> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> >> > >> socket references would remain under BPF_SYSCALL.
> >> >> > >>
> >> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> >> > >> ---
> >> >> > >> Changes in v2:
> >> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
> >> >> > >> - Elaborate on the end goal in description
> >> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> >> > >> ---
> >> >> > >> net/unix/af_unix.c | 4 ++--
> >> >> > >> net/unix/unix_bpf.c | 6 ++++++
> >> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
> >> >> > >>
> >> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
> >> >> > >> --- a/net/unix/af_unix.c
> >> >> > >> +++ b/net/unix/af_unix.c
> >> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> >> > >> #ifdef CONFIG_BPF_SYSCALL
> >> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> >> >> > >>
> >> >> > >> - if (prot != &unix_dgram_proto)
> >> >> > >> + if (prot->recvmsg)
> >> >> > >
> >> >> > > There is no reason to have this dead branch when
> >> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >> >> > >
> >> >> > > Let's compile out all sockmap code when both configs
> >> >> > > are not enabled.
> >> >> > >
> >> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
> >> >> > > simpler approach.
> >> >> >
> >> >> > Okay, will put the whole file behind hidden config option like so:
> >> >> >
> >> >> > --- a/net/unix/Kconfig
> >> >> > +++ b/net/unix/Kconfig
> >> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> >> > help
> >> >> > Support for UNIX socket monitoring interface used by the ss tool.
> >> >> > If unsure, say Y.
> >> >> > +
> >> >> > +config UNIX_BPF
> >> >>
> >> >> Maybe UNIX_BPF_SOCKMAP or something.
> >> >> bpf_iter is supported without this config.
> >> >
> >> > I don't like where it's going.
> >> > I strongly dislike new config knobs.
> >> > I'd rather remove existing knobs.
> >> > What is the motivation?
> >>
> >> The goal is to compile out sockmap bits that use sk_msg.
> >> NET_SOCK_MSG is natural, exisiting candidate.
> >> New knob wasn't my idea.
> >
> > I'm also missing the big picture here.
> >
> > sockmap already holds socket references today. You can store and look
> > up sockets without attaching any verdict/parser program, and no
> > redirect happens. So if the goal is to use sockmap purely as a socket
> > container without the sk_msg fast-path overhead, what does a
> > compile-time NET_SOCK_MSG knob add over the runtime checks?
>
> Sure, let me clarify. It's about the maintenance overhead.
>
> sockmap-based redirects are a rather niche feature with few users, for
> which we've been getting quite a few bug reports since AI came along.
>
> We're not using it internally at Cloudflare, so I don't really have a
> good reason to justify time spent on these bug reports.
>
> Hence the move to put sockmap-based redirect behind a config option,
> which you can enable at your own risk. Or which we can deprecate, but
> that's not really my call.
This is wishful thinking that a config knob will stop
the bug reports.
Just disable it for real instead.
> > I am also not sure if NET_SOCK_MSG is right. It is broader than
> > "sockmap redirect". It is selected by TLS and {INET,INET6}_ESPINTCP.
> > Because those select it, it can't be toggled independently.
>
> Once the sockmap redirect bits are behind _some_ config option, it will
> be easy to replace it with a more granular one that depends on
> NET_SOCK_MSG. But we're not there yet. One step at a time.
No. That's not workable.
> > Could you share the concrete use case you have in mind, and whether
> > this came out of an earlier discussion or thread upstream?
>
> This is a follow up from discussions at BPF summit with Alexei & John.
Not quite. The discussion was to disable pieces of sockmap
that are causing trouble.
Not to move them under config knobs, but disable them.
^ permalink raw reply
* Re: [PATCH net v2] net: dsa: sja1105: round up PTP perout pin duration
From: patchwork-bot+netdevbpf @ 2026-06-23 21:50 UTC (permalink / raw)
To: Aleksandrova Alyona
Cc: olteanv, andrew, f.fainelli, davem, edumazet, kuba, pabeni,
richardcochran, linux-kernel, netdev, lvc-project
In-Reply-To: <20260618110508.53094-1-aga@itb.spb.ru>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 18 Jun 2026 14:05:08 +0300 you wrote:
> pin_duration is converted from the user-provided period to SJA1105
> clock ticks and is later passed as the cycle_time argument to
> future_base_time().
>
> Very small period values may become zero after the conversion,
> which can lead to a division by zero in future_base_time().
>
> [...]
Here is the summary with links:
- [net,v2] net: dsa: sja1105: round up PTP perout pin duration
https://git.kernel.org/netdev/net/c/aee5836273b0
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] net: do not acquire dev->tx_global_lock in netdev_watchdog_up()
From: patchwork-bot+netdevbpf @ 2026-06-23 21:50 UTC (permalink / raw)
To: Eric Dumazet
Cc: davem, kuba, pabeni, horms, netdev, eric.dumazet, m.szyprowski
In-Reply-To: <20260622110108.69541-1-edumazet@google.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 22 Jun 2026 11:01:08 +0000 you wrote:
> Marek Szyprowski reported a deadlock during system resume when virtio_net
> driver is used.
>
> The deadlock occurs because netif_device_attach() is called while holding
> dev->tx_global_lock (via netif_tx_lock_bh() in virtnet_restore_up()).
> netif_device_attach() calls __netdev_watchdog_up(), which now also tries
> to acquire dev->tx_global_lock to synchronize with dev_watchdog().
>
> [...]
Here is the summary with links:
- [net] net: do not acquire dev->tx_global_lock in netdev_watchdog_up()
https://git.kernel.org/netdev/net/c/d09a78a2a469
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox