* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:36 UTC (permalink / raw)
To: Amery Hung
Cc: Alexei Starovoitov, Kuniyuki Iwashima, bpf, Alexei Starovoitov,
Daniel Borkmann, Jakub Kicinski, Jiayuan Chen, John Fastabend,
Network Development, kernel-team
In-Reply-To: <CAMB2axMVhJJpP5HZtDFyQLLbKoRxhW08rj1zGRtWtgDkfYaVNA@mail.gmail.com>
On Tue, Jun 23, 2026 at 01:22 PM -07, Amery Hung wrote:
> On Tue, Jun 23, 2026 at 1:04 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
>> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>> >>
>> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >> >
>> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
>> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >> > >>
>> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
>> >> > >> completed all code paths related to sockmap-based redirects should be
>> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
>> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
>> >> > >> socket references would remain under BPF_SYSCALL.
>> >> > >>
>> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> >> > >> ---
>> >> > >> Changes in v2:
>> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
>> >> > >> - Elaborate on the end goal in description
>> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
>> >> > >> ---
>> >> > >> net/unix/af_unix.c | 4 ++--
>> >> > >> net/unix/unix_bpf.c | 6 ++++++
>> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
>> >> > >>
>> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
>> >> > >> --- a/net/unix/af_unix.c
>> >> > >> +++ b/net/unix/af_unix.c
>> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>> >> > >> #ifdef CONFIG_BPF_SYSCALL
>> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
>> >> > >>
>> >> > >> - if (prot != &unix_dgram_proto)
>> >> > >> + if (prot->recvmsg)
>> >> > >
>> >> > > There is no reason to have this dead branch when
>> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
>> >> > >
>> >> > > Let's compile out all sockmap code when both configs
>> >> > > are not enabled.
>> >> > >
>> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
>> >> > > simpler approach.
>> >> >
>> >> > Okay, will put the whole file behind hidden config option like so:
>> >> >
>> >> > --- a/net/unix/Kconfig
>> >> > +++ b/net/unix/Kconfig
>> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
>> >> > help
>> >> > Support for UNIX socket monitoring interface used by the ss tool.
>> >> > If unsure, say Y.
>> >> > +
>> >> > +config UNIX_BPF
>> >>
>> >> Maybe UNIX_BPF_SOCKMAP or something.
>> >> bpf_iter is supported without this config.
>> >
>> > I don't like where it's going.
>> > I strongly dislike new config knobs.
>> > I'd rather remove existing knobs.
>> > What is the motivation?
>>
>> The goal is to compile out sockmap bits that use sk_msg.
>> NET_SOCK_MSG is natural, exisiting candidate.
>> New knob wasn't my idea.
>
> I'm also missing the big picture here.
>
> sockmap already holds socket references today. You can store and look
> up sockets without attaching any verdict/parser program, and no
> redirect happens. So if the goal is to use sockmap purely as a socket
> container without the sk_msg fast-path overhead, what does a
> compile-time NET_SOCK_MSG knob add over the runtime checks?
Sure, let me clarify. It's about the maintenance overhead.
sockmap-based redirects are a rather niche feature with few users, for
which we've been getting quite a few bug reports since AI came along.
We're not using it internally at Cloudflare, so I don't really have a
good reason to justify time spent on these bug reports.
Hence the move to put sockmap-based redirect behind a config option,
which you can enable at your own risk. Or which we can deprecate, but
that's not really my call.
> I am also not sure if NET_SOCK_MSG is right. It is broader than
> "sockmap redirect". It is selected by TLS and {INET,INET6}_ESPINTCP.
> Because those select it, it can't be toggled independently.
Once the sockmap redirect bits are behind _some_ config option, it will
be easy to replace it with a more granular one that depends on
NET_SOCK_MSG. But we're not there yet. One step at a time.
> Could you share the concrete use case you have in mind, and whether
> this came out of an earlier discussion or thread upstream?
This is a follow up from discussions at BPF summit with Alexei & John.
^ permalink raw reply
* Re: [PATCH net 0/2] tcp: make TCP-AO lookups more predictable
From: Dmitry Safonov @ 2026-06-23 20:35 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Neal Cardwell, Kuniyuki Iwashima, netdev, eric.dumazet
In-Reply-To: <CANn89iJL1sx4Jmb3P4cKiEqP_FmWQ3N1BwfH15LHwhFjUoCYsA@mail.gmail.com>
On Tue, 23 Jun 2026 at 06:25, Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Jun 22, 2026 at 6:13 PM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
[..]
> > What do you think?
>
> If intersecting keys are not yet allowed, I think we must return an
> error code at the insertion stage,
> instead of hoping the user will do "the right thing".
That is happenning already: in the new selftest you used different
keyids, so that adds distinct keys and they both may be used/rotated
on the connection (and have to be copied to the established socket).
If you try adding two keys with different prefixes, but matching the
same peer ip (same keyids; available in the same VRFs) – second
setsockopt() will fail. There are tests for this in
setsockopt-closed.c under duplicate_tests().
Thanks,
Dmitry
^ permalink raw reply
* Re: [PATCH net v2] ice: Fix use-after-scope in ice_sched_add_nodes_to_layer()
From: Tony Nguyen @ 2026-06-23 20:35 UTC (permalink / raw)
To: NeKon69, przemyslaw.kitszel
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, horms,
piotr.kwapulinski, intel-wired-lan, netdev, linux-kernel
In-Reply-To: <20260617072155.1172432-1-nobodqwe@gmail.com>
On 6/17/2026 12:21 AM, NeKon69 wrote:
> Commit 7fb09a737536 ("ice: Modify recursive way of adding nodes")
> changed ice_sched_add_nodes_to_layer() from recursive control flow to an
> iterative loop.
>
> Inside the loop, first_teid_ptr may be set to the address of a
> block-local variable:
>
> u32 temp;
> ...
> if (num_added)
> first_teid_ptr = &temp;
>
> On the next loop iteration, first_teid_ptr may be passed to
> ice_sched_add_nodes_to_hw_layer(), after temp from the previous
> iteration has gone out of scope.
>
> Instead of keeping temporary storage for later calls, allow
> first_node_teid to be NULL when the caller does not need the TEID.
>
> This was found by Clang with LifetimeSafety enabled while testing C
> language support on a Linux allmodconfig build.
>
> Fixes: 7fb09a737536 ("ice: Modify recursive way of adding nodes")
> Link: https://github.com/llvm/llvm-project/pull/203270
> Signed-off-by: NeKon69 <nobodqwe@gmail.com>
Hi,
The patch itself looks ok but I believe author/sign-off should be an
actual name.
Thanks,
Tony
^ permalink raw reply
* Re: [PATCH net v2] net: airoha: Add retry mechanism to airoha_qdma_set_trtcm_param()
From: Lorenzo Bianconi @ 2026-06-23 20:32 UTC (permalink / raw)
To: Simon Horman
Cc: andrew+netdev, davem, edumazet, kuba, pabeni, leto.liu,
linux-arm-kernel, linux-mediatek, netdev, brown.huang
In-Reply-To: <20260623105341.1076863-3-horms@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 5732 bytes --]
On Jun 23, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
>
> Simon says: I'd appreciate it if you could look over the first two issues
> flagged below in the context of this patch. While the third
> and final issue looks like it is in possible separate follow-up
> territory.
>
> ---
> net: airoha: Add retry mechanism to airoha_qdma_set_trtcm_param()
>
> This patch introduces a retry loop to airoha_qdma_set_trtcm_param() to mitigate
> bus timeouts when configuring QDMA parameters. It attempts to verify that the
> configuration is correctly applied to the hardware.
>
> > diff --git a/drivers/net/ethernet/airoha/airoha_eth.c b/drivers/net/ethernet/airoha/airoha_eth.c
> > index 3370c3df7c107..bb5c0599a4eed 100644
> > --- a/drivers/net/ethernet/airoha/airoha_eth.c
> > +++ b/drivers/net/ethernet/airoha/airoha_eth.c
> > @@ -2673,14 +2673,30 @@ static int airoha_qdma_set_trtcm_param(struct airoha_qdma *qdma, int channel,
> > FIELD_PREP(TRTCM_METER_GROUP_MASK, group) |
> > FIELD_PREP(TRTCM_PARAM_INDEX_MASK, idx) |
> > FIELD_PREP(TRTCM_PARAM_RATE_TYPE_MASK, mode);
> > + int i;
> >
> > - airoha_qdma_wr(qdma, REG_TRTCM_DATA_LOW(addr), val);
> > - airoha_qdma_wr(qdma, REG_TRTCM_CFG_PARAM(addr), config);
> > + for (i = 0; i < 100; i++) {
> > + u32 data;
> >
> > - return read_poll_timeout(airoha_qdma_rr, val,
> > - val & TRTCM_PARAM_RW_DONE_MASK,
> > - USEC_PER_MSEC, 10 * USEC_PER_MSEC, true,
> > - qdma, REG_TRTCM_CFG_PARAM(addr));
> > + airoha_qdma_wr(qdma, REG_TRTCM_DATA_LOW(addr), val);
> > + wmb();
> > + airoha_qdma_wr(qdma, REG_TRTCM_CFG_PARAM(addr), config);
> > +
> > + if (read_poll_timeout(airoha_qdma_rr, data,
> > + data & TRTCM_PARAM_RW_DONE_MASK,
> > + USEC_PER_MSEC, 10 * USEC_PER_MSEC,
> > + true, qdma, REG_TRTCM_CFG_PARAM(addr)))
> > + return -ETIMEDOUT;
>
> [Severity: High]
> If a bus timeout causes the write phase's read_poll_timeout() in
> airoha_qdma_set_trtcm_param() to fail, this immediately returns -ETIMEDOUT,
> aborting the 100-iteration retry loop.
>
> Doesn't this defeat the mitigation?
I agree we just run 'continue' here instead of returing -ETIMEDOUT
>
> > +
> > + if (airoha_qdma_get_trtcm_param(qdma, channel, addr, param,
> > + mode, &data, NULL))
> > + continue;
>
> [Severity: High]
> If airoha_qdma_get_trtcm_param() times out during the verification phase, it
> indicates the hardware hasn't finished the read command (DONE bit is still 0).
>
> The continue statement then immediately starts the next iteration, writing new
> commands to DATA_LOW and CFG_PARAM.
>
> Can writing to command registers while the hardware state machine is actively
> busy violate hardware protocols and lead to unrecoverable lockups?
@Brown and @Leto: can you please comment on this case?
>
> > +
> > + if (data == val)
> > + return 0;
>
> [Severity: High]
> This isn't a bug introduced by this patch, but if the bus drops the command
> write to REG_TRTCM_CFG_PARAM, the hardware never executes the command and the
> TRTCM_PARAM_RW_DONE_MASK bit remains set from the previous operation.
>
> Consequently, read_poll_timeout() returns an immediate false success.
>
> If both the set and get command writes are dropped by the bus, the get
> operation will read back the newly written DATA_LOW register value.
>
> Will the data == val check evaluate to true and falsely report success without
> the hardware ever processing the configuration?
This seems quite a 'pathological' scenario, is it really an issue?
Regards,
Lorenzo
>
> > + }
> > +
> > + return -EBUSY;
> > }
> >
>
> [ ... ]
>
> > static int airoha_qdma_set_trtcm_config(struct airoha_qdma *qdma, int channel,
> > u32 addr, enum trtcm_mode_type mode,
> > bool enable, u32 enable_mask)
> > {
> > u32 val;
> >
> > if (airoha_qdma_get_trtcm_param(qdma, channel, addr, TRTCM_MISC_MODE,
> > mode, &val, NULL))
> > return -EINVAL;
>
> [Severity: High]
> Standalone calls to airoha_qdma_get_trtcm_param(), like this one in
> airoha_qdma_set_trtcm_config(), lack a retry loop entirely.
>
> Does this leave read-modify-write cycles fully vulnerable to the same bus
> timeouts the patch attempts to mitigate?
>
> [Severity: High]
> This isn't a bug introduced by this patch, but the patch mitigates bus
> timeouts in airoha_qdma_set_trtcm_param() while entirely missing
> airoha_qdma_set_rl_param().
>
> The latter performs the exact same write-and-poll sequence on identical QDMA
> configuration registers:
>
> drivers/net/ethernet/airoha/airoha_eth.c:airoha_qdma_set_rl_param() {
> ...
> return read_poll_timeout(..., val & RATE_LIMIT_PARAM_RW_DONE_MASK, ...);
> }
>
> If bus timeouts affect QDMA access broadly as stated in the commit message, are
> the rate-limiting configuration paths randomly vulnerable to the same failures?
>
> [Severity: High]
> This isn't a bug introduced by this patch, but in airoha_tc_remove_htb_queue(),
> hardware teardown incorrectly uses queue + 1 instead of queue:
>
> drivers/net/ethernet/airoha/airoha_eth.c:airoha_tc_remove_htb_queue() {
> ...
> airoha_qdma_set_tx_rate_limit(netdev, queue + 1, 0, 0);
> ...
> }
>
> Does this leave the removed queue's hardware limits permanently active
> (resource leak) and inadvertently disable the rate limits for queue + 1,
> corrupting the QoS state of another active channel?
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* [syzbot] Monthly net report (Jun 2026)
From: syzbot @ 2026-06-23 20:32 UTC (permalink / raw)
To: linux-kernel, netdev, syzkaller-bugs
Hello net maintainers/developers,
This is a 31-day syzbot report for the net subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/net
During the period, 5 new issues were detected and 15 were fixed.
In total, 59 issues are still open and 1748 have already been fixed.
There are also 30 low-priority issues.
Some of the still happening issues:
Ref Crashes Repro Title
<1> 8753 Yes KMSAN: uninit-value in eth_type_trans (2)
https://syzkaller.appspot.com/bug?extid=0901d0cc75c3d716a3a3
<2> 2480 Yes unregister_netdevice: waiting for DEV to become free (9)
https://syzkaller.appspot.com/bug?extid=e2af46126e0644cbebdd
<3> 686 Yes INFO: task hung in tun_chr_close (5)
https://syzkaller.appspot.com/bug?extid=b0ae8f1abf7d891e0426
<4> 456 Yes WARNING in inet_sock_destruct (6)
https://syzkaller.appspot.com/bug?extid=5b3b7e51dda1be027b7a
<5> 447 No possible deadlock in __ipv6_dev_mc_inc
https://syzkaller.appspot.com/bug?extid=afbcf622635e98bf40d2
<6> 443 Yes INFO: task hung in nsim_destroy (4)
https://syzkaller.appspot.com/bug?extid=8141dcbd23a8f857798a
<7> 150 No possible deadlock in __ethtool_get_link_ksettings
https://syzkaller.appspot.com/bug?extid=9bb8bd77f3966641f298
<8> 83 Yes KASAN: use-after-free Read in qdisc_pkt_len_segs_init
https://syzkaller.appspot.com/bug?extid=83181a31faf9455499c5
<9> 67 Yes INFO: rcu detected stall in rescuer_thread (2)
https://syzkaller.appspot.com/bug?extid=d5f7a5097c24c7c2dbbb
<10> 3 Yes memory leak in __vsock_create (2)
https://syzkaller.appspot.com/bug?extid=1b2c9c4a0f8708082678
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
To disable reminders for individual bugs, reply with the following command:
#syz set <Ref> no-reminders
To change bug's subsystems, reply with:
#syz set <Ref> subsystems: new-subsystem
You may send multiple commands in a single email message.
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Amery Hung @ 2026-06-23 20:22 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: Alexei Starovoitov, Kuniyuki Iwashima, bpf, Alexei Starovoitov,
Daniel Borkmann, Jakub Kicinski, Jiayuan Chen, John Fastabend,
Network Development, kernel-team
In-Reply-To: <87mrwlyqg4.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:04 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >>
> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >
> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> > >>
> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> > >> completed all code paths related to sockmap-based redirects should be
> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> > >> socket references would remain under BPF_SYSCALL.
> >> > >>
> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> > >> ---
> >> > >> Changes in v2:
> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
> >> > >> - Elaborate on the end goal in description
> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> > >> ---
> >> > >> net/unix/af_unix.c | 4 ++--
> >> > >> net/unix/unix_bpf.c | 6 ++++++
> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
> >> > >>
> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
> >> > >> --- a/net/unix/af_unix.c
> >> > >> +++ b/net/unix/af_unix.c
> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> > >> #ifdef CONFIG_BPF_SYSCALL
> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> >> > >>
> >> > >> - if (prot != &unix_dgram_proto)
> >> > >> + if (prot->recvmsg)
> >> > >
> >> > > There is no reason to have this dead branch when
> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >> > >
> >> > > Let's compile out all sockmap code when both configs
> >> > > are not enabled.
> >> > >
> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
> >> > > simpler approach.
> >> >
> >> > Okay, will put the whole file behind hidden config option like so:
> >> >
> >> > --- a/net/unix/Kconfig
> >> > +++ b/net/unix/Kconfig
> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> > help
> >> > Support for UNIX socket monitoring interface used by the ss tool.
> >> > If unsure, say Y.
> >> > +
> >> > +config UNIX_BPF
> >>
> >> Maybe UNIX_BPF_SOCKMAP or something.
> >> bpf_iter is supported without this config.
> >
> > I don't like where it's going.
> > I strongly dislike new config knobs.
> > I'd rather remove existing knobs.
> > What is the motivation?
>
> The goal is to compile out sockmap bits that use sk_msg.
> NET_SOCK_MSG is natural, exisiting candidate.
> New knob wasn't my idea.
I'm also missing the big picture here.
sockmap already holds socket references today. You can store and look
up sockets without attaching any verdict/parser program, and no
redirect happens. So if the goal is to use sockmap purely as a socket
container without the sk_msg fast-path overhead, what does a
compile-time NET_SOCK_MSG knob add over the runtime checks?
I am also not sure if NET_SOCK_MSG is right. It is broader than
"sockmap redirect". It is selected by TLS and {INET,INET6}_ESPINTCP.
Because those select it, it can't be toggled independently.
Could you share the concrete use case you have in mind, and whether
this came out of an earlier discussion or thread upstream?
>
> Alternatively, we can do this to avoid the extra knob:
>
> ifdef CONFIG_BPF_SYSCALL
> unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
> endif
>
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 20:14 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <87h5mtyq72.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:09 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:31 PM -07, Kuniyuki Iwashima wrote:
> > On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> Okay, will put the whole file behind hidden config option like so:
> >>
> >> --- a/net/unix/Kconfig
> >> +++ b/net/unix/Kconfig
> >> @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> help
> >> Support for UNIX socket monitoring interface used by the ss tool.
> >> If unsure, say Y.
> >> +
> >> +config UNIX_BPF
> >
> > Maybe UNIX_BPF_SOCKMAP or something.
> > bpf_iter is supported without this config.
>
> Not sure what you have in mind re bpf_iter. Can you share more?
I meant UNIX_BPF sounds like it covers bpf iterator for AF_UNIX too.
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 20:13 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
kernel-team
In-Reply-To: <87mrwlyqg4.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 1:03 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> > On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >>
> >> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> >
> >> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> >> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >> > >>
> >> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> > >> completed all code paths related to sockmap-based redirects should be
> >> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> > >> socket references would remain under BPF_SYSCALL.
> >> > >>
> >> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> > >> ---
> >> > >> Changes in v2:
> >> > >> - Handle prot->recvmsg being NULL (Sashiko)
> >> > >> - Elaborate on the end goal in description
> >> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> > >> ---
> >> > >> net/unix/af_unix.c | 4 ++--
> >> > >> net/unix/unix_bpf.c | 6 ++++++
> >> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
> >> > >>
> >> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> > >> index f7a9d55eee8a..84c11c60c75f 100644
> >> > >> --- a/net/unix/af_unix.c
> >> > >> +++ b/net/unix/af_unix.c
> >> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> > >> #ifdef CONFIG_BPF_SYSCALL
> >> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> >> > >>
> >> > >> - if (prot != &unix_dgram_proto)
> >> > >> + if (prot->recvmsg)
> >> > >
> >> > > There is no reason to have this dead branch when
> >> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >> > >
> >> > > Let's compile out all sockmap code when both configs
> >> > > are not enabled.
> >> > >
> >> > > Since AF_UNIX differs from TCP/UDP, it can take the
> >> > > simpler approach.
> >> >
> >> > Okay, will put the whole file behind hidden config option like so:
> >> >
> >> > --- a/net/unix/Kconfig
> >> > +++ b/net/unix/Kconfig
> >> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> >> > help
> >> > Support for UNIX socket monitoring interface used by the ss tool.
> >> > If unsure, say Y.
> >> > +
> >> > +config UNIX_BPF
> >>
> >> Maybe UNIX_BPF_SOCKMAP or something.
> >> bpf_iter is supported without this config.
> >
> > I don't like where it's going.
> > I strongly dislike new config knobs.
> > I'd rather remove existing knobs.
> > What is the motivation?
>
> The goal is to compile out sockmap bits that use sk_msg.
> NET_SOCK_MSG is natural, exisiting candidate.
> New knob wasn't my idea.
I think config w/o description is okay since it's not selectable.
>
> Alternatively, we can do this to avoid the extra knob:
>
> ifdef CONFIG_BPF_SYSCALL
> unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
> endif
This is far better, I forgot ifdef is available.
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:09 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <CAAVpQUBARp1qCEomgzWXVe35WatdaswujVLku+RESm_LW0dE7Q@mail.gmail.com>
On Tue, Jun 23, 2026 at 12:31 PM -07, Kuniyuki Iwashima wrote:
> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> Okay, will put the whole file behind hidden config option like so:
>>
>> --- a/net/unix/Kconfig
>> +++ b/net/unix/Kconfig
>> @@ -30,3 +30,8 @@ config UNIX_DIAG
>> help
>> Support for UNIX socket monitoring interface used by the ss tool.
>> If unsure, say Y.
>> +
>> +config UNIX_BPF
>
> Maybe UNIX_BPF_SOCKMAP or something.
> bpf_iter is supported without this config.
Not sure what you have in mind re bpf_iter. Can you share more?
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 20:03 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Kuniyuki Iwashima, bpf, Alexei Starovoitov, Daniel Borkmann,
Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
kernel-team
In-Reply-To: <CAADnVQL2pfQ0BoN-vWcuCpbOBBKq_rM7Bp7P4XdLMFER5LGSDg@mail.gmail.com>
On Tue, Jun 23, 2026 at 12:33 PM -07, Alexei Starovoitov wrote:
> On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>>
>> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >
>> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
>> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> > >>
>> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
>> > >> completed all code paths related to sockmap-based redirects should be
>> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
>> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
>> > >> socket references would remain under BPF_SYSCALL.
>> > >>
>> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> > >> ---
>> > >> Changes in v2:
>> > >> - Handle prot->recvmsg being NULL (Sashiko)
>> > >> - Elaborate on the end goal in description
>> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
>> > >> ---
>> > >> net/unix/af_unix.c | 4 ++--
>> > >> net/unix/unix_bpf.c | 6 ++++++
>> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
>> > >>
>> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> > >> index f7a9d55eee8a..84c11c60c75f 100644
>> > >> --- a/net/unix/af_unix.c
>> > >> +++ b/net/unix/af_unix.c
>> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>> > >> #ifdef CONFIG_BPF_SYSCALL
>> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
>> > >>
>> > >> - if (prot != &unix_dgram_proto)
>> > >> + if (prot->recvmsg)
>> > >
>> > > There is no reason to have this dead branch when
>> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
>> > >
>> > > Let's compile out all sockmap code when both configs
>> > > are not enabled.
>> > >
>> > > Since AF_UNIX differs from TCP/UDP, it can take the
>> > > simpler approach.
>> >
>> > Okay, will put the whole file behind hidden config option like so:
>> >
>> > --- a/net/unix/Kconfig
>> > +++ b/net/unix/Kconfig
>> > @@ -30,3 +30,8 @@ config UNIX_DIAG
>> > help
>> > Support for UNIX socket monitoring interface used by the ss tool.
>> > If unsure, say Y.
>> > +
>> > +config UNIX_BPF
>>
>> Maybe UNIX_BPF_SOCKMAP or something.
>> bpf_iter is supported without this config.
>
> I don't like where it's going.
> I strongly dislike new config knobs.
> I'd rather remove existing knobs.
> What is the motivation?
The goal is to compile out sockmap bits that use sk_msg.
NET_SOCK_MSG is natural, exisiting candidate.
New knob wasn't my idea.
Alternatively, we can do this to avoid the extra knob:
ifdef CONFIG_BPF_SYSCALL
unix-$(CONFIG_NET_SOCK_MSG) += unix_bpf.o
endif
^ permalink raw reply
* Re: [RFC net-next 08/15] ipxlat: add translation engine and dispatch core
From: Toke Høiland-Jørgensen @ 2026-06-23 19:59 UTC (permalink / raw)
To: Ralf Lici
Cc: netdev, Daniel Gröber, Antonio Quartulli, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
linux-kernel, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
Beniamino Galvani
In-Reply-To: <20260623163606.33510-1-ralf@mandelbit.com>
Ralf Lici <ralf@mandelbit.com> writes:
> On Mon, 22 Jun 2026 16:36:24 +0200, Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >> > My second concern is that the SIIT boundary would be a property of
>> >> > rule and hook placement. That gives flexibility, but it also means the
>> >> > translation point has to be constrained and documented very carefully
>> >> > to avoid ambiguous TTL/Hop Limit, PMTU/ICMP, and hook-order behavior.
>> >> > For this use case I would rather have the route that matches the
>> >> > translation prefix also be the object that says: leave this family
>> >> > here and continue in the other one.
>> >>
>> >> Yeah, with flexibility comes the ability to shoot yourself in the foot.
>> >> But that's not really different from much of the other functionality we
>> >> have in the kernel today, is it? For netfilter in particular it's
>> >> certainly possible to configure a broken NAT configuration that leads to
>> >> packet drops (or just invalid packets being sent out on a network
>> >> device).
>> >>
>> >
>> > True, misconfiguration is always possible and that alone is not an
>> > argument against the netfilter model. But what do we actually gain in
>> > capability from that flexibility? I agree on the UX argument (an admin
>> > would look in nft first), but in terms of what the feature can do, I
>> > can't yet see what the nft model unlocks. More on this just below.
>> >
>> >> > After looking at the available kernel mechanisms again, I think the
>> >> > better model is probably LWT: routes carry an ipxlat encap referencing a
>> >> > named translator domain configured over netlink. That should represent
>> >> > the stateless, prefix-based and symmetric nature of ipxlat.
>> >>
>> >> I think this description actually hits the nail on the head: What are we
>> >> implementing here? Is it a product feature, or a building block for one?
>> >> The properties you mention wrt consistency, symmetry etc are properties
>> >> of the high-level feature (which is also generally the level things are
>> >> specified in RFCs). Whereas other packet mangling features in the kernel
>> >> are more in the "building block" category, where it's possible to
>> >> configure things to implement a particular feature set / compliance with
>> >> a particular RFC, but it's also possible to do things that are outside
>> >> of that.
>> >>
>> >> I think this relates to the "mechanism, not policy" approach that we
>> >> take to most things in the kernel: implement the building blocks to do
>> >> something in the most general way we can, and then leave it up to
>> >> userspace to configure things in a way that results in a consistent
>> >> high-level system behaviour.
>> >>
>> >
>> > That's a good point, and I agree that we should not bake a high-level
>> > product policy into the kernel if what we need is a reusable mechanism
>> > (the LWT idea was my attempt at exactly that). What I am still trying to
>> > understand is whether there is a useful generic trigger for stateless
>> > cross-family translation beyond the route/prefix/policy-routing cases.
>> >
>> > Routes and policy routing already cover the selectors I can make
>> > coherent for a stateless, per-packet translator: destination/source
>> > prefix, iif/oif/VRF, mark, TOS/DSCP, and so on. nft can of course match
>> > much more than that, but the additional selectors that would materially
>> > change the translation decision seem to be selectors such as L4 fields,
>> > payload state, or conntrack state. Those are exactly the selectors I am
>> > struggling to make correct for a stateless translator:
>> >
>> > - non-first fragments carry no L4 header at all, yet the translator must
>> > rewrite every fragment (an nft ... tcp dport trigger cannot fire on
>> > them);
>> >
>> > - ICMP errors must be translated too, but the flow identity lives in the
>> > quoted inner header (reversed), not in anything an L4/ct match on the
>> > error packet can see and there is no conntrack to associate them,
>> > since this is stateless.
>>
>> True in principle, but if (say) you deploy this on a network that is
>> configured so it will never fragment packets, this won't be an issue in
>> practice.
>>
>> I.e., you're quite right that arbitrary matching criteria cannot be
>> guaranteed to result in coherent translation. But I think that goes into
>> the "use it wrong, get wrong results" bin. E.g., if you match on
>> something that results in only a subset of the packets of a flow being
>> translated, well, only that subset of the packets will make it to the
>> destination. The SIIT translator itself should not try to fix this, but
>> neither should it prevent it; that's what I mean by "building block" -
>> it's up to the builder using the blocks to make sure the building
>> doesn't collapse, that's out of scope for the block manufacturer to
>> worry about :)
>>
>
> I agree with that framing. The translation core should not try to prove
> that the surrounding policy describes a coherent SIIT deployment.
Cool!
>> > So an L4-conditional trigger does not look like a good primitive for
>> > correct stateless SIIT unless the action also defragments/refragments or
>> > uses conntrack-like state. Those may be valid mechanisms, but they move
>> > the design away from the stateless per-packet SIIT boundary this RFC is
>> > trying to model.
>> >
>> > So my first question is: is there a useful nft configuration this should
>> > enable that is not naturally expressible as route selection, while still
>> > remaining stateless SIIT rather than a NAT64-like stateful feature?
>> > Maybe there is a real use case there, but I cannot construct one yet.
>>
>> So the poster child for "match on arbitrary criteria" is of course BPF.
>> You can write BPF programs that match on arbitrary parts of the packet
>> header, custom encapsulation headers,or even on out of band things like
>> system state, phase of the moon, or what have you. And we should
>> certainly allow a BPF program to make the decision on whether to perform
>> the SIIT translation.
>>
>> Which... maybe is an argument to keep it as a device like you do in this
>> RFC series? Redirecting to a device is trivially supported from TC-BPF,
>> which also makes it possible to use the translation mechanism without
>> going through the routing subsystem at all, saving a bit of overhead.
>> Whereas making it a route action ties it very closely to the routing
>> subsystem.
>>
>> WDYT?
>>
>
> I see the netdevice appeal for this, especially as a BPF redirect
> target. But as we discussed earlier, the device model has some real
> problems: the device selected by the first route is not the real
> post-translation egress, so the model ends up doing translation and
> reinjection rather than normal transmission. Concretely:
>
> - it needs synthetic routing state purely to get things like MTU for
> fragmentation, because the real post-translation nexthop is not known
> at translation time;
>
> - TTL/Hop Limit handling gets harder to reason about because the packet
> has effectively gone through two routing decisions;
>
> - rx/tx stats can't be made meaningful for a direction-agnostic device
> whose ndo_start_xmit is really "translate and receive";
>
> - and the setup is not very obvious: create an interface, route packets
> to it, then have them come back translated.
>
> None of these is fatal on its own, but together they make me think the
> abstraction does not quite fit.
Right, OK, you're right.
> On the BPF point specifically: I agree a BPF program should be able to
> decide whether to translate. What I am less sure about is whether
> redirecting to a netdevice is the best way to expose that. A TC action
> (yet another model, I know :)) gives you the same thing in-pipeline and
> more directly:
>
> tc filter add dev wwan0 egress \
> bpf obj match.o action ipxlat4to6 domain clat0
>
> Let BPF make the policy decision, with the native action doing the
> translation work that the current BPF CLAT implementations have trouble
> with: fragmentation, checksum corner cases, and ICMP error inner
> headers (as explained by Beniamino).
>
> So TC clsact looks like the natural in-kernel replacement for today's
> TC-BPF CLAT programs: no extra netdev, you attach to the existing
> uplink, direction is explicit, and on egress you sit on the real route
> dst, so the synthetic-dst and double-routing problems above just don't
> arise. The cost is more moving parts than a single bpf_redirect since
> userspace has to manage clsact, filters, priorities and action
> lifecycle/cleanup.
Hmm, so no one really uses the bpf filter mechanism, since you can just
do everything from an action anyway (and with TCX attachment, you can
even avoid the overhead of the TC filter/action infrastructure
entirely). However, point taken wrt how to integrate this with BPF. I
guess the most flexible thing would be to expose the functionality
directly (as a kfunc callable from a BPF program). Which also fits with
your point below:
> For a gateway translator, though, I still think a device-bound model is
> less natural. There the translation point is more like a forwarding
> decision across routes and nexthops, so a route/LWT attachment, or
> possibly a netfilter attachment seems easier to reason about. Also, as
> you already pointed out while discussing LWT, an admin setting up NAT64
> is more likely to reach for an nft rule than for a clsact filter on a
> specific device.
>
> Taking a step back, ipxlat is really a generic translation engine plus a
> thin harness around it. So rather than pick one attachment, it might be
> worth structuring the engine so different harnesses can drive it.
> There's interesting precedent for this shape:
>
> - ILA, again, is the closest sibling: stateless IPv6 address translation
> with a shared core in ila_common.c, driven both by an LWT frontend in
> ila_lwt.c and by an inline netfilter hook with a netlink-configured
> mapping table in ila_xlat.c.
>
> - act_ct is the precedent for the TC side specifically: a TC action that
> reuses the netfilter conntrack engine rather than reimplementing it.
>
> And act_nat is the cautionary counter-example: a standalone TC
> reimplementation of stateless NAT that shares no code with nf_nat, and
> carries a "would be nice to share code" comment :)
>
> So I am wondering whether the right direction is to factor the
> translation engine cleanly, land it with one harness first, and keep the
> other attachment points as follow-up work once the core semantics are
> settled.
>
> Does that direction seem reasonable to you?
Yes, reusable functionality that can be called from multiple places
sounds like a good fit; let's try to structure it that way!
As for which hook to start with, well, let's see if we hear back from
the netfilter devs, but either netfilter or the routing subsystem (LWT
style) would be OK for me I think.
-Toke
^ permalink raw reply
* Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Toke Høiland-Jørgensen @ 2026-06-23 19:45 UTC (permalink / raw)
To: Avinash Duduskar, ast, daniel, andrii
Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
dsahern
In-Reply-To: <20260623182849.2623521-1-avinash.duduskar@gmail.com>
Avinash Duduskar <avinash.duduskar@gmail.com> writes:
> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
>> I think it's better to just move the assignment of params->ifindex
>> entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
>> That way this can be simplified to:
>>
>> err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
>> if (!err && fwd_dev)
>> *fwd_dev = dev;
>> return err;
>
> The caller-side restore is ungainly, agreed, but the assignment can't move
> all the way into the helper. The early params->ifindex = dev->ifindex
> sits above the neighbour lookup on purpose: that is d1c362e1dd68a
> ("bpf: Always return target ifindex in bpf_fib_lookup"), which took it
> out of bpf_fib_set_fwd_params() and put it there so a program still
> gets the target ifindex on the BPF_FIB_LKUP_RET_NO_NEIGH path and can
> bpf_redirect_neigh() on it. bpf_fib_set_fwd_params() is called only at
> the set_fwd_params label, below the NO_NEIGH return (and below the IPv6
> NO_SRC_ADDR return), so an assignment living in the helper never runs
> on those paths and params->ifindex falls back to the input. That would
> change the reported ifindex for plain bpf_fib_lookup() callers hitting
> NO_NEIGH, not only the VLAN ones.
Right. Well, seems I forgot about that patch, even though I seem to have
written it :)
> I can still get the caller down to your form by keeping the early write
> and moving just the VLAN_FAILURE rewind into the helper, with one extra
> parameter, the input ifindex saved before the egress write:
>
> err = bpf_fib_set_fwd_params(dev, params, flags, mtu, in_ifindex);
> if (!err && fwd_dev)
> *fwd_dev = dev;
> return err;
>
> and the helper owning the rewind in the unreducible branch:
>
> } else {
> params->ifindex = in_ifindex;
> return BPF_FIB_LKUP_RET_VLAN_FAILURE;
> }
OK, if we do need to restore it, I think it's better to do it there.
Also, wrt the fwd_dev parameter: Do we really have a use case from using
this from TC? In TC you can just redirect to the VLAN device; this is
meant for XDP which can't do that. So how about we just reject the flag
on the TC side, and get rid of the fwd_dev parameter entirely?
If we do that we're back to just a plain 'return bpf_fib_set_fwd_params()' :)
-Toke
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Alexei Starovoitov @ 2026-06-23 19:33 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Jakub Sitnicki, bpf, Alexei Starovoitov, Daniel Borkmann,
Jakub Kicinski, Jiayuan Chen, John Fastabend, Network Development,
kernel-team
In-Reply-To: <CAAVpQUBARp1qCEomgzWXVe35WatdaswujVLku+RESm_LW0dE7Q@mail.gmail.com>
On Tue, Jun 23, 2026 at 12:31 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >
> > On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> > > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> > >>
> > >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> > >> completed all code paths related to sockmap-based redirects should be
> > >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> > >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> > >> socket references would remain under BPF_SYSCALL.
> > >>
> > >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> > >> ---
> > >> Changes in v2:
> > >> - Handle prot->recvmsg being NULL (Sashiko)
> > >> - Elaborate on the end goal in description
> > >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> > >> ---
> > >> net/unix/af_unix.c | 4 ++--
> > >> net/unix/unix_bpf.c | 6 ++++++
> > >> 2 files changed, 8 insertions(+), 2 deletions(-)
> > >>
> > >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > >> index f7a9d55eee8a..84c11c60c75f 100644
> > >> --- a/net/unix/af_unix.c
> > >> +++ b/net/unix/af_unix.c
> > >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> > >> #ifdef CONFIG_BPF_SYSCALL
> > >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> > >>
> > >> - if (prot != &unix_dgram_proto)
> > >> + if (prot->recvmsg)
> > >
> > > There is no reason to have this dead branch when
> > > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> > >
> > > Let's compile out all sockmap code when both configs
> > > are not enabled.
> > >
> > > Since AF_UNIX differs from TCP/UDP, it can take the
> > > simpler approach.
> >
> > Okay, will put the whole file behind hidden config option like so:
> >
> > --- a/net/unix/Kconfig
> > +++ b/net/unix/Kconfig
> > @@ -30,3 +30,8 @@ config UNIX_DIAG
> > help
> > Support for UNIX socket monitoring interface used by the ss tool.
> > If unsure, say Y.
> > +
> > +config UNIX_BPF
>
> Maybe UNIX_BPF_SOCKMAP or something.
> bpf_iter is supported without this config.
I don't like where it's going.
I strongly dislike new config knobs.
I'd rather remove existing knobs.
What is the motivation?
^ permalink raw reply
* Re: [PATCH bpf 2/2] selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
From: bot+bpf-ci @ 2026-06-23 19:32 UTC (permalink / raw)
To: mhal, john.fastabend, jakub, jiayuan.chen, davem, edumazet, kuba,
pabeni, horms, ast, cong.wang, daniel, andrii, eddyz87, memxor,
martin.lau, song, yonghong.song, jolsa, emil, shuah
Cc: netdev, bpf, linux-kernel, linux-kselftest, mhal, ast, andrii,
daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260623-sockmap-lookup-udp-leak-v1-2-05804f9308e4@rbox.co>
[-- Attachment #1: Type: text/plain, Size: 1107 bytes --]
> selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
>
> Setup and join a cgroup, then attach a cgroup/connect4 program that runs
>
> sk = bpf_map_lookup_elem(sockmap, 0)
> bpf_bind(ctx, sa, sizeof(sa))
> bpf_sk_release(sk)
>
> Unpatched kernel leaks the socket.
>
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
This test reproduces a UDP socket leak across sockmap lookup-bind-release,
but there is no Fixes: tag.
The sibling fix commit a2510dc351c5 ("bpf, sockmap: Don't leak UDP socks on
lookup-bind-release") carries a Fixes: tag pointing at the commit that lifted
the socket-state restriction allowing unbound UDP sockets into sockmap.
Should this test carry the same tag so it stays backportable alongside the
fix?
Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28049771327
^ permalink raw reply
* Re: s2io: driver still in use - please reconsider removal
From: Ethan Nelson-Moore @ 2026-06-23 19:31 UTC (permalink / raw)
To: Michael Pratte
Cc: Jakub Kicinski, Paolo Abeni, Eric Dumazet, Andrew Lunn,
Simon Horman, David S . Miller, netdev
In-Reply-To: <20260623112133.752195-1-slatoncomputers@gmail.com>
Hi, Michael,
On Tue, Jun 23, 2026 at 4:21 AM Michael Pratte
<slatoncomputers@gmail.com> wrote:
> Commit aba0138eb7d7 ("net: ethernet: neterion: s2io: remove unused
> driver") removed s2io in v7.0 as "highly unlikely to still be used."
> It is still in use here: an Exar Xframe-II (PCI 17d5:5832) in a
> Supermicro X5DA8.
>
> Bringing it up, I found that no TCP can be transmitted on these
> adapters since v4.2.
[...]
> Given it is evidently still in use, would
> you consider reverting the removal?
Given that the driver has not been working for almost 11 years and you
are seemingly the first person to notice, I would like to respectfully
disagree with this assertion.
Are you using the card for actual work, or are you just testing it out
of curiosity? What kernel version were you running before you upgraded
to a current kernel?
Ethan
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Kuniyuki Iwashima @ 2026-06-23 19:31 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <87v7b9ysep.fsf@cloudflare.com>
On Tue, Jun 23, 2026 at 12:21 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> > On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>
> >> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
> >> completed all code paths related to sockmap-based redirects should be
> >> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
> >> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
> >> socket references would remain under BPF_SYSCALL.
> >>
> >> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
> >> ---
> >> Changes in v2:
> >> - Handle prot->recvmsg being NULL (Sashiko)
> >> - Elaborate on the end goal in description
> >> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
> >> ---
> >> net/unix/af_unix.c | 4 ++--
> >> net/unix/unix_bpf.c | 6 ++++++
> >> 2 files changed, 8 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> >> index f7a9d55eee8a..84c11c60c75f 100644
> >> --- a/net/unix/af_unix.c
> >> +++ b/net/unix/af_unix.c
> >> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
> >> #ifdef CONFIG_BPF_SYSCALL
> >> const struct proto *prot = READ_ONCE(sk->sk_prot);
> >>
> >> - if (prot != &unix_dgram_proto)
> >> + if (prot->recvmsg)
> >
> > There is no reason to have this dead branch when
> > CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
> >
> > Let's compile out all sockmap code when both configs
> > are not enabled.
> >
> > Since AF_UNIX differs from TCP/UDP, it can take the
> > simpler approach.
>
> Okay, will put the whole file behind hidden config option like so:
>
> --- a/net/unix/Kconfig
> +++ b/net/unix/Kconfig
> @@ -30,3 +30,8 @@ config UNIX_DIAG
> help
> Support for UNIX socket monitoring interface used by the ss tool.
> If unsure, say Y.
> +
> +config UNIX_BPF
Maybe UNIX_BPF_SOCKMAP or something.
bpf_iter is supported without this config.
> + bool
> + depends on UNIX
> + default y if BPF_SYSCALL && NET_SOCK_MSG
^ permalink raw reply
* Re: [PATCH bpf-next v2] bpf, unix: Guard sk_msg-dependent code behind CONFIG_NET_SOCK_MSG
From: Jakub Sitnicki @ 2026-06-23 19:21 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
Jiayuan Chen, John Fastabend, netdev, kernel-team
In-Reply-To: <CAAVpQUBsQFFxJFDnJzxmsER3bOjm=zqJ5P5MSeW_T9v-4639cw@mail.gmail.com>
On Tue, Jun 23, 2026 at 09:08 AM -07, Kuniyuki Iwashima wrote:
> On Tue, Jun 23, 2026 at 4:20 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Prepare to decouple BPF_SYSCALL config option from NET_SOCK_MSG. When
>> completed all code paths related to sockmap-based redirects should be
>> guarded by BPF_SYSCALL && NET_SOCK_MSG to allow users to opt out by
>> disabling NET_SOCK_MSG. The implementation of sockmap as a container for
>> socket references would remain under BPF_SYSCALL.
>>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---
>> Changes in v2:
>> - Handle prot->recvmsg being NULL (Sashiko)
>> - Elaborate on the end goal in description
>> - Link to v1: https://patch.msgid.link/20260622-bpf-sk_msg-split-unix-v1-1-d7e0cb7bb03b@cloudflare.com
>> ---
>> net/unix/af_unix.c | 4 ++--
>> net/unix/unix_bpf.c | 6 ++++++
>> 2 files changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
>> index f7a9d55eee8a..84c11c60c75f 100644
>> --- a/net/unix/af_unix.c
>> +++ b/net/unix/af_unix.c
>> @@ -2675,7 +2675,7 @@ static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t si
>> #ifdef CONFIG_BPF_SYSCALL
>> const struct proto *prot = READ_ONCE(sk->sk_prot);
>>
>> - if (prot != &unix_dgram_proto)
>> + if (prot->recvmsg)
>
> There is no reason to have this dead branch when
> CONFIG_BPF_SYSCALL && !NET_SOCK_MSG.
>
> Let's compile out all sockmap code when both configs
> are not enabled.
>
> Since AF_UNIX differs from TCP/UDP, it can take the
> simpler approach.
Okay, will put the whole file behind hidden config option like so:
--- a/net/unix/Kconfig
+++ b/net/unix/Kconfig
@@ -30,3 +30,8 @@ config UNIX_DIAG
help
Support for UNIX socket monitoring interface used by the ss tool.
If unsure, say Y.
+
+config UNIX_BPF
+ bool
+ depends on UNIX
+ default y if BPF_SYSCALL && NET_SOCK_MSG
^ permalink raw reply
* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Eric Biggers @ 2026-06-23 19:19 UTC (permalink / raw)
To: Linus Torvalds
Cc: Bastien Nocera, linux-crypto, Herbert Xu, Marcel Holtmann,
Luiz Augusto von Dentz, linux-doc, linux-api, linux-kernel,
netdev, linux-bluetooth, ell
In-Reply-To: <CAHk-=wgNG=F3xO9PjL0RcKy3UWvq0Np9uZu+nFUQBAA8So9xdA@mail.gmail.com>
On Tue, Jun 23, 2026 at 11:56:10AM -0700, Linus Torvalds wrote:
> On Tue, 23 Jun 2026 at 09:51, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > We're aware of that and are taking it into account in the allowlist:
>
> Note that if we can just unconditionally make it depend on
> CAP_NET_ADMIN, that would be good - independently of any allowlist.
>
> Because if iwd and abluetoothd are the main two users, and both of
> those already require CAP_NET_ADMIN anyway...
There's also cryptsetup, including unprivileged benchmarking and also
(in theory) formatting support, and pre-7.0 versions of iproute2 which
used it for computing SHA-1 hashes of BPF programs.
If we broke unprivileged 'cryptsetup benchmark', some people would
definitely notice. However, since it's just a manually-run benchmark
anyway, users could just run it with sudo.
I don't know about the iproute2 case.
It depends how aggressive we want to be. My current proposal
(https://lore.kernel.org/linux-crypto/20260622234803.6982-1-ebiggers@kernel.org/)
has the entries in the allowlist marked as either privileged or
unprivileged. There are just a few unprivileged ones, for cryptsetup
and iproute2 as mentioned. But we could try doing away with the
unprivileged ones entirely and see who complains.
- Eric
^ permalink raw reply
* Re: [PATCH] crypto: af_alg - Document the deprecation of AF_ALG
From: Linus Torvalds @ 2026-06-23 18:56 UTC (permalink / raw)
To: Eric Biggers
Cc: Bastien Nocera, linux-crypto, Herbert Xu, Marcel Holtmann,
Luiz Augusto von Dentz, linux-doc, linux-api, linux-kernel,
netdev, linux-bluetooth, ell
In-Reply-To: <20260623164932.GA1793@sol>
On Tue, 23 Jun 2026 at 09:51, Eric Biggers <ebiggers@kernel.org> wrote:
>
> We're aware of that and are taking it into account in the allowlist:
Note that if we can just unconditionally make it depend on
CAP_NET_ADMIN, that would be good - independently of any allowlist.
Because if iwd and abluetoothd are the main two users, and both of
those already require CAP_NET_ADMIN anyway...
Linus
^ permalink raw reply
* [PATCH bpf 1/2] bpf, sockmap: Don't leak UDP socks on lookup-bind-release
From: Michal Luczaj @ 2026-06-23 18:03 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Alexei Starovoitov, Cong Wang, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260623-sockmap-lookup-udp-leak-v1-0-05804f9308e4@rbox.co>
UDP sockets get SOCK_RCU_FREE set when (auto-)bound. This means
sk_is_refcounted(unbound) = true, while sk_is_refcounted(bound) = false.
Because sockmap accepts unbound UDP sockets, a BPF program can increment a
socket's refcount via lookup. If the socket is subsequently bound, the
transition from unbound to bound causes bpf_sk_release() to skip the
decrement of the refcount, causing a memory leak.
unreferenced object 0xffff88810bc2eb40 (size 1984):
comm "test_progs", pid 2451, jiffies 4295320596
hex dump (first 32 bytes):
7f 00 00 01 7f 00 00 01 d2 04 1b b7 04 d2 00 00 ................
02 00 01 40 00 00 00 00 00 00 00 00 00 00 00 00 ...@............
backtrace (crc bdee079d):
kmem_cache_alloc_noprof+0x557/0x660
sk_prot_alloc+0x69/0x240
sk_alloc+0x30/0x460
inet_create+0x2ce/0xf80
__sock_create+0x25b/0x5c0
__sys_socket+0x119/0x1d0
__x64_sys_socket+0x72/0xd0
do_syscall_64+0xa1/0x5f0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Maintain balanced refcounts across sk lookup/release: (re-)set
SOCK_RCU_FREE on proto update to treat the socket (whether bound or
unbound) as not requiring a refcount increment on (a RCU protected) lookup.
Fixes: 0c48eefae712 ("sock_map: Lift socket state restriction for datagram sockets")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Note: this issue is related to commit 67312adc96b5 ("bpf: reject unhashed
sockets in bpf_sk_assign").
---
net/ipv4/udp_bpf.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c
index ad57c4c9eaab..970327b59582 100644
--- a/net/ipv4/udp_bpf.c
+++ b/net/ipv4/udp_bpf.c
@@ -173,6 +173,9 @@ int udp_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
if (sk->sk_family == AF_INET6)
udp_bpf_check_v6_needs_rebuild(psock->sk_proto);
+ /* Treat all sockets as non-refcounted, regardless of binding state. */
+ sock_set_flag(sk, SOCK_RCU_FREE);
+
sock_replace_proto(sk, &udp_bpf_prots[family]);
return 0;
}
--
2.54.0
^ permalink raw reply related
* [PATCH net 1/1] net/sched: sch_teql: Introduce slaves_lock to avoid race condition and UAF
From: Jamal Hadi Salim @ 2026-06-23 18:42 UTC (permalink / raw)
To: netdev
Cc: davem, edumazet, kuba, pabeni, horms, victor, andrew+netdev,
zdi-disclosures, security, stable, Jamal Hadi Salim
The teql master->slaves singly linked list is not protected against multiple
writes. It can be mod'ed concurently from teql_master_xmit(), teql_dequeue(),
teql_init() and teql_destroy() without holding any list lock or RCU protection.
zdi-disclosures@trendmicro.com has demonstrated that the qdisc is freed
after an RCU grace period, but teql_master_xmit() running on another
CPU can still hold a stale pointer into the list, resulting in a
slab-use-after-free:
BUG: KASAN: slab-use-after-free in teql_destroy+0x3ca/0x440 linux/net/sched/sch_teql.c:142
Read of size 8 at addr ffff88802923aa80 by task ip/10024
CPU: 1 UID: 0 PID: 10024 Comm: ip Not tainted 7.1.0-rc5 #1 PREEMPT(lazy)
Hardware name: QEMU Ubuntu 25.10 PC v2 (i440FX + PIIX, + 10.1 machine, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
__dump_stack linux/lib/dump_stack.c:94
dump_stack_lvl+0x100/0x190 linux/lib/dump_stack.c:120
print_address_description linux/mm/kasan/report.c:378
print_report+0x139/0x4ad linux/mm/kasan/report.c:482
kasan_report+0xe4/0x1d0 linux/mm/kasan/report.c:595
teql_destroy+0x3ca/0x440 linux/net/sched/sch_teql.c:142
__qdisc_destroy+0x109/0x540 linux/net/sched/sch_generic.c:1100
qdisc_put+0xad/0xf0 linux/net/sched/sch_generic.c:1128
dev_shutdown+0x1cd/0x450 linux/net/sched/sch_generic.c:1493
unregister_netdevice_many_notify+0xd30/0x24b0 linux/net/core/dev.c:12409
rtnl_delete_link linux/net/core/rtnetlink.c:3552
rtnl_dellink+0x476/0xb50 linux/net/core/rtnetlink.c:3594
rtnetlink_rcv_msg+0x954/0xe80 linux/net/core/rtnetlink.c:6997
netlink_rcv_skb+0x156/0x420 linux/net/netlink/af_netlink.c:2550
netlink_unicast_kernel linux/net/netlink/af_netlink.c:1318
netlink_unicast+0x58d/0x860 linux/net/netlink/af_netlink.c:1344
netlink_sendmsg+0x89a/0xd80 linux/net/netlink/af_netlink.c:1894
sock_sendmsg_nosec linux/net/socket.c:787
__sock_sendmsg linux/net/socket.c:802
____sys_sendmsg+0x9d9/0xb70 linux/net/socket.c:2698
___sys_sendmsg+0x194/0x1e0 linux/net/socket.c:2752
__sys_sendmsg+0x171/0x220 linux/net/socket.c:2784
do_syscall_x64 linux/arch/x86/entry/syscall_64.c:63
do_syscall_64+0xff/0x890 linux/arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f linux/arch/x86/entry/entry_64.S:121
[..]
The zdi-disclosures@trendmicro.com repro created concurrent AF_PACKET senders
on a teql device against a thread that repeatedly adds/deletes the slave qdisc,
together with a SLUB spray that reclaims the freed slot; the resulting
UAF is controllable enough to be turned into a read/write primitive against the
freed qdisc object.
The fix?
Add a per-master slaves_lock spinlock that serializes all mutations of
master->slaves and the NEXT_SLAVE() links in teql_destroy() and
teql_qdisc_init(). teql_master_xmit() also takes the same slaves_lock around
those updates.
Pair this with READ_ONCE()/WRITE_ONCE() on the shared pointers and
rcu_read_lock_bh()/rcu_read_unlock_bh() around the list traversal in
teql_master_xmit() and teql_dequeue(), so that readers either observe a fully
linked list or are deferred until the in-flight mutation completes. The two
early-return paths in teql_master_xmit() are updated to release the RCU-bh
read-side critical section before returning, since leaving it held would disable
BH on that CPU for good.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: zdi-disclosures@trendmicro.com
Tested-by: Victor Nogueira <victor@mojatatu.com>
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
---
net/sched/sch_teql.c | 71 +++++++++++++++++++++++++++++---------------
1 file changed, 47 insertions(+), 24 deletions(-)
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index e7bbc9e5174d..dacdc46637df 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -53,6 +53,7 @@ struct teql_master {
struct Qdisc_ops qops;
struct net_device *dev;
struct Qdisc *slaves;
+ spinlock_t slaves_lock; /* serializes writes to ->slaves */
struct list_head master_list;
unsigned long tx_bytes;
unsigned long tx_packets;
@@ -101,7 +102,9 @@ teql_dequeue(struct Qdisc *sch)
if (skb == NULL) {
struct net_device *m = qdisc_dev(q);
if (m) {
- dat->m->slaves = sch;
+ spin_lock_bh(&dat->m->slaves_lock);
+ rcu_assign_pointer(dat->m->slaves, sch);
+ spin_unlock_bh(&dat->m->slaves_lock);
netif_wake_queue(m);
}
} else {
@@ -132,34 +135,37 @@ teql_destroy(struct Qdisc *sch)
struct Qdisc *q, *prev;
struct teql_sched_data *dat = qdisc_priv(sch);
struct teql_master *master = dat->m;
+ struct netdev_queue *txq = NULL;
+ bool reset_master_queue = false;
if (!master)
return;
- prev = master->slaves;
+ spin_lock_bh(&master->slaves_lock);
+ prev = READ_ONCE(master->slaves);
if (prev) {
do {
- q = NEXT_SLAVE(prev);
+ q = READ_ONCE(NEXT_SLAVE(prev));
if (q == sch) {
- NEXT_SLAVE(prev) = NEXT_SLAVE(q);
- if (q == master->slaves) {
- master->slaves = NEXT_SLAVE(q);
- if (q == master->slaves) {
- struct netdev_queue *txq;
-
+ WRITE_ONCE(NEXT_SLAVE(prev), READ_ONCE(NEXT_SLAVE(q)));
+ if (q == READ_ONCE(master->slaves)) {
+ WRITE_ONCE(master->slaves, READ_ONCE(NEXT_SLAVE(q)));
+ if (q == READ_ONCE(master->slaves)) {
txq = netdev_get_tx_queue(master->dev, 0);
- master->slaves = NULL;
-
- dev_reset_queue(master->dev,
- txq, NULL);
+ WRITE_ONCE(master->slaves, NULL);
+ reset_master_queue = true;
}
}
skb_queue_purge(&dat->q);
break;
}
- } while ((prev = q) != master->slaves);
+ } while ((prev = q) != READ_ONCE(master->slaves));
}
+ spin_unlock_bh(&master->slaves_lock);
+
+ if (reset_master_queue)
+ dev_reset_queue(master->dev, txq, NULL);
}
static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
@@ -184,7 +190,8 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
skb_queue_head_init(&q->q);
- if (m->slaves) {
+ spin_lock_bh(&m->slaves_lock);
+ if (READ_ONCE(m->slaves)) {
if (m->dev->flags & IFF_UP) {
if ((m->dev->flags & IFF_POINTOPOINT &&
!(dev->flags & IFF_POINTOPOINT)) ||
@@ -192,8 +199,10 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
!(dev->flags & IFF_BROADCAST)) ||
(m->dev->flags & IFF_MULTICAST &&
!(dev->flags & IFF_MULTICAST)) ||
- dev->mtu < m->dev->mtu)
+ dev->mtu < m->dev->mtu) {
+ spin_unlock_bh(&m->slaves_lock);
return -EINVAL;
+ }
} else {
if (!(dev->flags&IFF_POINTOPOINT))
m->dev->flags &= ~IFF_POINTOPOINT;
@@ -204,14 +213,15 @@ static int teql_qdisc_init(struct Qdisc *sch, struct nlattr *opt,
if (dev->mtu < m->dev->mtu)
m->dev->mtu = dev->mtu;
}
- q->next = NEXT_SLAVE(m->slaves);
- NEXT_SLAVE(m->slaves) = sch;
+ WRITE_ONCE(q->next, READ_ONCE(NEXT_SLAVE(m->slaves)));
+ rcu_assign_pointer(NEXT_SLAVE(m->slaves), sch);
} else {
- q->next = sch;
- m->slaves = sch;
+ WRITE_ONCE(q->next, sch);
+ rcu_assign_pointer(m->slaves, sch);
m->dev->mtu = dev->mtu;
m->dev->flags = (m->dev->flags&~FMASK)|(dev->flags&FMASK);
}
+ spin_unlock_bh(&m->slaves_lock);
return 0;
}
@@ -285,7 +295,9 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
int subq = skb_get_queue_mapping(skb);
struct sk_buff *skb_res = NULL;
- start = master->slaves;
+ rcu_read_lock_bh();
+
+ start = rcu_dereference_bh(master->slaves);
restart:
nores = 0;
@@ -317,10 +329,14 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
netdev_start_xmit(skb, slave, slave_txq, false) ==
NETDEV_TX_OK) {
__netif_tx_unlock(slave_txq);
- master->slaves = NEXT_SLAVE(q);
+ spin_lock_bh(&master->slaves_lock);
+ rcu_assign_pointer(master->slaves,
+ rcu_dereference_bh(NEXT_SLAVE(q)));
+ spin_unlock_bh(&master->slaves_lock);
netif_wake_queue(dev);
master->tx_packets++;
master->tx_bytes += length;
+ rcu_read_unlock_bh();
return NETDEV_TX_OK;
}
__netif_tx_unlock(slave_txq);
@@ -329,14 +345,18 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
busy = 1;
break;
case 1:
- master->slaves = NEXT_SLAVE(q);
+ spin_lock_bh(&master->slaves_lock);
+ rcu_assign_pointer(master->slaves,
+ rcu_dereference_bh(NEXT_SLAVE(q)));
+ spin_unlock_bh(&master->slaves_lock);
+ rcu_read_unlock_bh();
return NETDEV_TX_OK;
default:
nores = 1;
break;
}
__skb_pull(skb, skb_network_offset(skb));
- } while ((q = NEXT_SLAVE(q)) != start);
+ } while ((q = rcu_dereference_bh(NEXT_SLAVE(q))) != start);
if (nores && skb_res == NULL) {
skb_res = skb;
@@ -345,12 +365,14 @@ static netdev_tx_t teql_master_xmit(struct sk_buff *skb, struct net_device *dev)
if (busy) {
netif_stop_queue(dev);
+ rcu_read_unlock_bh();
return NETDEV_TX_BUSY;
}
master->tx_errors++;
drop:
master->tx_dropped++;
+ rcu_read_unlock_bh();
dev_kfree_skb(skb);
return NETDEV_TX_OK;
}
@@ -444,6 +466,7 @@ static __init void teql_master_setup(struct net_device *dev)
struct teql_master *master = netdev_priv(dev);
struct Qdisc_ops *ops = &master->qops;
+ spin_lock_init(&master->slaves_lock);
master->dev = dev;
ops->priv_size = sizeof(struct teql_sched_data);
--
2.54.0
^ permalink raw reply related
* [PATCH bpf 2/2] selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
From: Michal Luczaj @ 2026-06-23 18:03 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Alexei Starovoitov, Cong Wang, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
In-Reply-To: <20260623-sockmap-lookup-udp-leak-v1-0-05804f9308e4@rbox.co>
Setup and join a cgroup, then attach a cgroup/connect4 program that runs
sk = bpf_map_lookup_elem(sockmap, 0)
bpf_bind(ctx, sa, sizeof(sa))
bpf_sk_release(sk)
Unpatched kernel leaks the socket.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
.../selftests/bpf/prog_tests/sockmap_basic.c | 50 ++++++++++++++++++++++
.../bpf/progs/test_sockmap_lookup_bind_release.c | 37 ++++++++++++++++
2 files changed, 87 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index cb3229711f93..11972ffdb16e 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -7,6 +7,7 @@
#include "test_progs.h"
#include "test_skmsg_load_helpers.skel.h"
+#include "test_sockmap_lookup_bind_release.skel.h"
#include "test_sockmap_update.skel.h"
#include "test_sockmap_invalid_update.skel.h"
#include "test_sockmap_skb_verdict_attach.skel.h"
@@ -17,6 +18,7 @@
#include "test_sockmap_msg_pop_data.skel.h"
#include "bpf_iter_sockmap.skel.h"
+#include "cgroup_helpers.h"
#include "sockmap_helpers.h"
#define TCP_REPAIR 19 /* TCP sock is under repair right now */
@@ -1373,6 +1375,52 @@ static void test_sockmap_multi_channels(int sotype)
test_sockmap_pass_prog__destroy(skel);
}
+#define LOOKUP_BIND_RELEASE_CG "/sockmap_lookup-bind-release"
+#define LOOKUP_BIND_RELEASE_REP 64
+
+static void test_sockmap_lookup_bind_release(void)
+{
+ struct test_sockmap_lookup_bind_release *skel;
+ struct sockaddr_in sa;
+ int cg, i;
+
+ cg = cgroup_setup_and_join(LOOKUP_BIND_RELEASE_CG);
+ if (!ASSERT_OK_FD(cg, "cgroup_setup_and_join"))
+ return;
+
+ skel = test_sockmap_lookup_bind_release__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "open_and_load"))
+ goto cleanup;
+
+ skel->links.connect = bpf_program__attach_cgroup(skel->progs.connect, cg);
+ if (!ASSERT_OK_PTR(skel->links.connect, "attach_cgroup"))
+ goto destroy;
+
+ sa.sin_family = AF_INET;
+ sa.sin_port = bpf_htons(1234);
+ sa.sin_addr.s_addr = bpf_htonl(INADDR_LOOPBACK);
+
+ for (i = 0; i < LOOKUP_BIND_RELEASE_REP; ++i) {
+ __close_fd int sk;
+
+ sk = xsocket(AF_INET, SOCK_DGRAM, 0);
+ if (sk < 0)
+ break;
+
+ if (xbpf_map_update_elem(bpf_map__fd(skel->maps.sockmap), &u32(0),
+ &sk, BPF_ANY))
+ break;
+
+ if (xconnect(sk, (struct sockaddr *)&sa, sizeof(sa)))
+ break;
+ }
+
+destroy:
+ test_sockmap_lookup_bind_release__destroy(skel);
+cleanup:
+ cleanup_cgroup_environment();
+}
+
void test_sockmap_basic(void)
{
if (test__start_subtest("sockmap create_update_free"))
@@ -1451,4 +1499,6 @@ void test_sockmap_basic(void)
test_sockmap_multi_channels(SOCK_STREAM);
if (test__start_subtest("sockmap udp multi channels"))
test_sockmap_multi_channels(SOCK_DGRAM);
+ if (test__start_subtest("sockmap lookup-bind-release"))
+ test_sockmap_lookup_bind_release();
}
diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_lookup_bind_release.c b/tools/testing/selftests/bpf/progs/test_sockmap_lookup_bind_release.c
new file mode 100644
index 000000000000..cc77b193893b
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_sockmap_lookup_bind_release.c
@@ -0,0 +1,37 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+ __uint(type, BPF_MAP_TYPE_SOCKMAP);
+ __uint(max_entries, 1);
+ __type(key, int);
+ __type(value, int);
+} sockmap SEC(".maps");
+
+SEC("cgroup/connect4")
+int connect(struct bpf_sock_addr *ctx)
+{
+ struct bpf_sock *sk;
+ int ret = SK_DROP;
+
+ sk = bpf_map_lookup_elem(&sockmap, &(int){0});
+ if (sk) {
+ if (sk == ctx->sk) {
+ struct sockaddr_in sa = {
+ .sin_family = ctx->user_family,
+ .sin_port = ctx->user_port,
+ .sin_addr.s_addr = ctx->user_ip4
+ };
+
+ ret = !bpf_bind(ctx, (struct sockaddr *)&sa, sizeof(sa));
+ }
+
+ bpf_sk_release(sk);
+ }
+
+ return ret;
+}
+
+char _license[] SEC("license") = "GPL";
--
2.54.0
^ permalink raw reply related
* [PATCH bpf 0/2] bpf, sockmap: Fix sockmap leaking UDP socks
From: Michal Luczaj @ 2026-06-23 18:03 UTC (permalink / raw)
To: John Fastabend, Jakub Sitnicki, Jiayuan Chen, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Alexei Starovoitov, Cong Wang, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis, Shuah Khan
Cc: netdev, bpf, linux-kernel, linux-kselftest, Michal Luczaj
Fix for UDP sockets refcount asymmetry in sockmap lookup/release.
Accompanied by a selftest.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Michal Luczaj (2):
bpf, sockmap: Don't leak UDP socks on lookup-bind-release
selftests/bpf: Add test for UDP sock leak on sockmap lookup-bind-release
net/ipv4/udp_bpf.c | 3 ++
.../selftests/bpf/prog_tests/sockmap_basic.c | 50 ++++++++++++++++++++++
.../bpf/progs/test_sockmap_lookup_bind_release.c | 37 ++++++++++++++++
3 files changed, 90 insertions(+)
---
base-commit: 12091470c6b4c1c14b2de12dcbae2ada6cb6d20b
change-id: 20260617-sockmap-lookup-udp-leak-bc4e5c5481d7
Best regards,
--
Michal Luczaj <mhal@rbox.co>
^ permalink raw reply
* Re: [PATCH bpf-next v4 1/3] bpf: Add BPF_FIB_LOOKUP_VLAN flag to bpf_fib_lookup() helper
From: Avinash Duduskar @ 2026-06-23 18:28 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, ast, daniel, andrii
Cc: eddyz87, memxor, martin.lau, song, yonghong.song, jolsa, emil,
john.fastabend, sdf, davem, edumazet, kuba, pabeni, horms, shuah,
hawk, yatsenko, leon.hwang, kpsingh, a.s.protopopov, ameryhung,
rongtao, eyal.birger, bpf, netdev, linux-kernel, linux-kselftest,
dsahern
In-Reply-To: <877bnpeaeq.fsf@toke.dk>
Toke Høiland-Jørgensen <toke@redhat.com> writes:
> I think it's better to just move the assignment of params->ifindex
> entirely into bpf_fib_set_fwd_params(), instead of this restore dance.
> That way this can be simplified to:
>
> err = bpf_fib_set_fwd_params(dev, params, flags, mtu);
> if (!err && fwd_dev)
> *fwd_dev = dev;
> return err;
The caller-side restore is ungainly, agreed, but the assignment can't move
all the way into the helper. The early params->ifindex = dev->ifindex
sits above the neighbour lookup on purpose: that is d1c362e1dd68a
("bpf: Always return target ifindex in bpf_fib_lookup"), which took it
out of bpf_fib_set_fwd_params() and put it there so a program still
gets the target ifindex on the BPF_FIB_LKUP_RET_NO_NEIGH path and can
bpf_redirect_neigh() on it. bpf_fib_set_fwd_params() is called only at
the set_fwd_params label, below the NO_NEIGH return (and below the IPv6
NO_SRC_ADDR return), so an assignment living in the helper never runs
on those paths and params->ifindex falls back to the input. That would
change the reported ifindex for plain bpf_fib_lookup() callers hitting
NO_NEIGH, not only the VLAN ones.
I can still get the caller down to your form by keeping the early write
and moving just the VLAN_FAILURE rewind into the helper, with one extra
parameter, the input ifindex saved before the egress write:
err = bpf_fib_set_fwd_params(dev, params, flags, mtu, in_ifindex);
if (!err && fwd_dev)
*fwd_dev = dev;
return err;
and the helper owning the rewind in the unreducible branch:
} else {
params->ifindex = in_ifindex;
return BPF_FIB_LKUP_RET_VLAN_FAILURE;
}
So the restore leaves the caller; the early egress write stays because
NO_NEIGH and NO_SRC_ADDR depend on it.
3/3 adds a NO_NEIGH arm that pins the egress ifindex (input != egress):
with the assignment moved into the helper, that case reports the input
ifindex instead, while the return code stays NO_NEIGH, only the ifindex
flips. It passes with the early write kept.
> If you move the ifdef into the if statement, the if statement can have
> an else-branch that assigns params->ifindex, so you don't need the
> restore dance (see below).
Same constraint: an else-branch inside bpf_fib_set_fwd_params() only
runs when the helper runs, which is never on the NO_NEIGH/NO_SRC_ADDR
returns, so it cannot be the sole writer of the egress ifindex.
Does the in_ifindex version look right to you? The alternative is to
route the error returns through the label so the assignment can live
fully in the helper; threading the return codes back through it works,
but it is its own kind of dance and reads worse to me.
Thanks,
Avinash
^ permalink raw reply
* Re: [PATCH net] nfc: nci: fix out-of-bounds write in nci_target_auto_activated()
From: Simon Horman @ 2026-06-23 17:21 UTC (permalink / raw)
To: sam
Cc: Simon Horman, david, davem, edumazet, kuba, pabeni, oe-linux-nfc,
netdev, linux-kernel, stable
In-Reply-To: <20260622145243.3167276-1-sam@bynar.io>
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
nfc: nci: fix out-of-bounds write in nci_target_auto_activated()
This commit adds a bounds check to nci_target_auto_activated() to prevent
an out-of-bounds array write. If ndev->n_targets equals
NCI_MAX_DISCOVERED_TARGETS, it prints a debug message and returns early.
> diff --git a/net/nfc/nci/ntf.c b/net/nfc/nci/ntf.c
> index c96512bb86531..566ca839fa488 100644
> --- a/net/nfc/nci/ntf.c
> +++ b/net/nfc/nci/ntf.c
> @@ -603,6 +603,12 @@ static void nci_target_auto_activated(struct nci_dev *ndev,
> struct nfc_target *target;
> int rc;
>
> + /* This is a new target, check if we've enough room */
> + if (ndev->n_targets == NCI_MAX_DISCOVERED_TARGETS) {
> + pr_debug("not enough room, ignoring new target...\n");
> + return;
[Severity: High]
Does this early return cause the state machine to stall?
Looking at nci_rf_intf_activated_ntf_packet(), the state is transitioned
to NCI_POLL_ACTIVE right before this function is called:
net/nfc/nci/ntf.c:nci_rf_intf_activated_ntf_packet() {
...
atomic_set(&ndev->state, NCI_POLL_ACTIVE);
if (err == NCI_STATUS_OK)
nci_target_auto_activated(ndev, &ntf);
...
}
Because of the early return here, nfc_targets_found() is skipped. Does this
leave the state wedged in NCI_POLL_ACTIVE, permanently blocking future
attempts to stop polling since nci_stop_poll() requires the state to be
NCI_DISCOVERY?
> + }
> +
> target = &ndev->targets[ndev->n_targets];
[ ... ]
[Severity: Critical]
This isn't a bug introduced by this patch, but can the ntf union copy
uninitialized stack memory into persistent device structures?
In net/nfc/nci/ntf.c:nci_rf_discover_ntf_packet(), the local ntf struct
is declared on the stack:
struct nci_rf_discover_ntf ntf;
If the incoming packet has ntf.rf_tech_specific_params_len == 0, the
extraction routines are bypassed, leaving the rf_tech_specific_params union
uninitialized. The struct is then passed to nci_add_new_target():
if (add_target == true)
nci_add_new_target(ndev, &ntf);
This eventually calls nci_add_new_protocol(), which blindly reads from this
union:
nfca_poll = (struct rf_tech_specific_params_nfca_poll *)params;
target->sens_res = nfca_poll->sens_res;
target->sel_res = nfca_poll->sel_res;
target->nfcid1_len = nfca_poll->nfcid1_len;
Could this uninitialized stack garbage be copied into ndev->targets, and
later broadcasted to userspace via Netlink in nfc_targets_found()?
[Severity: Critical]
This isn't a bug introduced by this patch, but does the general bytes length
calculation underflow?
In net/nfc/nci/ntf.c:nci_store_general_bytes_nfc_dep():
ndev->remote_gb_len = min_t(__u8,
(ntf->activation_params.listen_nfc_dep.atr_req_len -
NFC_ATR_REQ_GT_OFFSET), NFC_ATR_REQ_GB_MAXSIZE);
memcpy(ndev->remote_gb,
(ntf->activation_params.listen_nfc_dep.atr_req +
NFC_ATR_REQ_GT_OFFSET), ndev->remote_gb_len);
The length is computed by subtracting NFC_ATR_REQ_GT_OFFSET (14) from
atr_req_len. If atr_req_len is less than 14, this subtraction will
underflow.
Because of the min_t cast, this large underflowed value is bounded to
NFC_ATR_REQ_GB_MAXSIZE (48). The memcpy() then copies 48 bytes starting
from atr_req + 14 into ndev->remote_gb.
Since the source buffer is part of the ntf union allocated on the stack
in nci_rf_intf_activated_ntf_packet(), could this leak up to 48 bytes of
uninitialized kernel stack memory to userspace?
[Severity: High]
This isn't a bug introduced by this patch, but does the lack of skb->len
validation in NCI response handlers allow out-of-bounds reads?
In nci_core_init_rsp_packet_v1() in net/nfc/nci/rsp.c, fields are read
and arrays are copied from the payload:
ndev->num_supported_rf_interfaces =
rsp_1->num_supported_rf_interfaces;
memcpy(ndev->supported_rf_interfaces, rsp_1->supported_rf_interfaces,
ndev->num_supported_rf_interfaces);
Similarly in nci_core_init_rsp_packet_v2():
while (rf_interface_idx < ndev->num_supported_rf_interfaces) {
ndev->supported_rf_interfaces[rf_interface_idx++] =
*supported_rf_interface++;
}
Since the NCI control header is stripped via skb_pull(), leaving the
payload length in skb->len, what happens if an NFCC sends a short response
packet (e.g. plen = 1)? Could this cause the handlers to read adjacent
kernel slab memory?
[Severity: High]
This isn't a bug introduced by this patch, but is there a missing boundary
validation when extracting activation parameters?
In net/nfc/nci/ntf.c:nci_extract_activation_params_iso_dep():
nfca_poll->rats_res_len = min_t(__u8, *data++, NFC_ATS_MAXSIZE);
if (nfca_poll->rats_res_len > 0) {
memcpy(nfca_poll->rats_res, data, nfca_poll->rats_res_len);
}
And in nci_extract_activation_params_nfc_dep():
poll->atr_res_len = min_t(__u8, *data++, NFC_ATR_RES_MAXSIZE - 2);
if (poll->atr_res_len > 0)
memcpy(poll->atr_res, data, poll->atr_res_len);
These inner lengths are bounded by a maximum size like NFC_ATS_MAXSIZE, but
they are not validated against the remaining bytes in the SKB (skb->len).
If a malformed packet provides a large inner length but a short payload,
could this memcpy() read out of bounds into the device structure,
potentially leaking adjacent slab memory to userspace later?
[Severity: High]
This isn't a bug introduced by this patch, but is there a potential
priority inversion or timeout deadlock involving the device lock?
Functions like nfc_stop_poll() in net/nfc/core.c acquire the device lock:
device_lock(&dev->dev);
...
if (dev->ops->stop_poll)
dev->ops->stop_poll(dev);
This eventually calls nci_request(), which synchronously waits for a
completion signaled by the nci_rx_work thread. However, if an NTF packet
is received first, nci_rx_work processes it and invokes
nfc_targets_found(), which also attempts to acquire the device lock:
device_lock(&dev->dev);
Since the calling thread already holds the device lock, nci_rx_work blocks
indefinitely. Because the RX worker is blocked, it cannot process the
pending RSP, causing nci_request() to time out and fail. Could this
deadlock the RX thread?
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox