* Re: [RFC PATCH bpf-next 0/5] tcp: opportunistic loopback splice for BPF-paired sockets
From: Jakub Kicinski @ 2026-06-13 17:57 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Cong Wang, Network Development, bpf, John Fastabend,
Jakub Sitnicki, Jiayuan Chen, Hemanth Malla, zijianzhang
In-Reply-To: <CAADnVQ+KTNKkf_Tc-RZR-g8wEfJU4qWcOPnjDbA2=PEtZsYnYg@mail.gmail.com>
On Fri, 12 Jun 2026 09:01:43 -0700 Alexei Starovoitov wrote:
> Just saying that the code is free nowadays, so whether it's 1k lines
> or 10 lines is irrelevant for the discussion.
>
> As far as the idea goes, I think, it would be interesting in pre-AI era,
> but today splice and friends are a prime target for bugs and more bugs.
> skmsg and tcp_bpf are reeling from unfixed bugs too,
> so my take is that we should not add any new features to skmsg
> and instead deprecate what is already there.
100% agreed. There are so many unfixed skmsg bugs it's hard to know
were to start :( Kernel "intelligence" to help unoptimized applications
is particularly unappealing right now.
^ permalink raw reply
* Re: [PATCH bpf v5 1/2] bpf: Run generic devmap egress prog on private skb
From: Alexei Starovoitov @ 2026-06-13 17:53 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Sun Jian, bpf, Network Development, LKML,
open list:KERNEL SELFTEST FRAMEWORK, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
David S. Miller, Jesper Dangaard Brouer, John Fastabend,
Stanislav Fomichev, Shuah Khan, Jiayuan Chen,
Toke Høiland-Jørgensen, Menglong Dong, Emil Tsalapatis
In-Reply-To: <20260613102549.0061a875@kernel.org>
On Sat, Jun 13, 2026 at 10:25 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 12 Jun 2026 19:40:31 +0800 Sun Jian wrote:
> > Suggested-by: Jakub Kicinski <kuba@kernel.org>
>
> I did not suggest this
ohh. I didn't follow discussion closely.
Do you want me to revert the whole set or just remove that line?
^ permalink raw reply
* Re: [PATCH net-next v3 0/4] vsock: consolidate acceptq accounting into core helpers
From: patchwork-bot+netdevbpf @ 2026-06-13 17:50 UTC (permalink / raw)
To: Raf Dickson
Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
vishnu.dasa, bcm-kernel-feedback-list, bobbyeshleman, leonardi,
horms, edumazet, kuba
In-Reply-To: <20260612045216.105796-1-rafdog35@gmail.com>
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 12 Jun 2026 04:52:12 +0000 you wrote:
> These patches follow up on commit c05fa14db43e
> ("vsock/vmci: fix sk_ack_backlog leak on failed handshake")
> by consolidating sk_acceptq_added() and sk_acceptq_removed() into
> the core vsock helpers so transports cannot forget them.
>
> Changes since v2:
> - Add vsock_pending_to_accept() helper for the vmci pending->accept
> transition, avoiding a double sk_acceptq_added() (Stefano Garzarella)
> - Split into 4 patches for bisectability (Stefano Garzarella)
> - Fold sk_acceptq_added() into vsock_add_pending() as a separate patch
>
> [...]
Here is the summary with links:
- [net-next,v3,1/4] vsock: introduce vsock_pending_to_accept() helper
https://git.kernel.org/netdev/net-next/c/77eee189397d
- [net-next,v3,2/4] vsock: fold sk_acceptq_added() into vsock_add_pending()
https://git.kernel.org/netdev/net-next/c/a6fd2cfdcdf5
- [net-next,v3,3/4] vsock: fold sk_acceptq_added() into vsock_enqueue_accept()
https://git.kernel.org/netdev/net-next/c/6f6f9b65a991
- [net-next,v3,4/4] vsock: fold sk_acceptq_removed() into vsock_remove_pending()
https://git.kernel.org/netdev/net-next/c/27fc25bb82e6
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v3] net: wwan: t7xx: check skb_clone in control TX
From: patchwork-bot+netdevbpf @ 2026-06-13 17:50 UTC (permalink / raw)
To: Ruoyu Wang
Cc: chandrashekar.devegowda, haijun.liu, ricardo.martinez,
loic.poulain, ryazanov.s.a, johannes, andrew+netdev, davem,
edumazet, kuba, pabeni, netdev, linux-kernel
In-Reply-To: <20260612035613.1192486-1-ruoyuw560@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 12 Jun 2026 11:56:13 +0800 you wrote:
> t7xx_port_ctrl_tx() clones each skb fragment before passing it to the
> port transmit path. The clone is used immediately to set cloned->len, so
> an skb_clone() failure results in a NULL pointer dereference.
>
> Check the clone before using it. If previous fragments were already
> queued, preserve the driver's existing partial-write behavior by
> returning the number of bytes submitted so far.
>
> [...]
Here is the summary with links:
- [net,v3] net: wwan: t7xx: check skb_clone in control TX
https://git.kernel.org/netdev/net/c/05f789fa90d9
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH net v2 2/2] selftests/tc-testing: act_ct: add TDC test for skb cb preservation across defrag
From: Ren Wei @ 2026-06-13 17:42 UTC (permalink / raw)
To: netdev, linux-kselftest, linux-kernel
Cc: jhs, jiri, kuba, paulb, victor, yuantan098, yifanwucs,
tomapufckgml, bird, xizh2024, n05ec
In-Reply-To: <cover.1781358691.git.xizh2024@lzu.edu.cn>
From: Zihan Xi <xizh2024@lzu.edu.cn>
Add a tc-testing case that sends IPv4 fragments through act_ct on clsact
egress while a root prio qdisc is present on the transmit path.
The test verifies that packet processing and qdisc accounting continue
to work after conntrack defragmentation, covering tc_skb_cb preservation
across defragmentation.
Signed-off-by: Zihan Xi <xizh2024@lzu.edu.cn>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
changes in v2:
- Add tc-testing case 9c2a for skb cb preservation across defrag
- v1 Link: https://lore.kernel.org/all/20260611154939.2615919-1-n05ec@lzu.edu.cn/
.../tc-testing/tc-tests/actions/ct.json | 38 +++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/ct.json b/tools/testing/selftests/tc-testing/tc-tests/actions/ct.json
index 33bb8f3ff8ed..da65f838bd52 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/ct.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/ct.json
@@ -664,5 +664,43 @@
"teardown": [
"$TC qdisc del dev $DEV1 ingress_block 21 clsact"
]
+ },
+ {
+ "id": "9c2a",
+ "name": "Act_ct preserves skb cb across defrag before prio dequeue",
+ "category": [
+ "actions",
+ "ct",
+ "scapy"
+ ],
+ "plugins": {
+ "requires": [
+ "nsPlugin",
+ "scapyPlugin"
+ ]
+ },
+ "setup": [
+ "$TC qdisc add dev $DUMMY root handle 1: prio",
+ "$TC qdisc add dev $DUMMY clsact",
+ "$TC qdisc add dev $DEV1 clsact",
+ "$TC filter add dev $DEV1 ingress protocol ip prio 1 matchall action mirred egress redirect dev $DUMMY"
+ ],
+ "cmdUnderTest": "$TC filter add dev $DUMMY egress protocol ip prio 1 matchall action ct zone 1 pipe",
+ "scapy": [
+ {
+ "iface": "$DEV0",
+ "count": 1,
+ "packet": "[Ether()/frag for frag in fragment(IP(src='10.0.0.10', dst='10.0.0.1', id=1)/UDP(sport=12345, dport=9)/Raw(b'A' * 4000), fragsize=1400)]"
+ }
+ ],
+ "expExitCode": "0",
+ "verifyCmd": "$TC -s qdisc show dev $DUMMY | grep -A 1 '^qdisc prio 1:'",
+ "matchPattern": "Sent [1-9][0-9]* bytes [1-9][0-9]* pkt",
+ "matchCount": "1",
+ "teardown": [
+ "$TC qdisc del dev $DEV1 clsact",
+ "$TC qdisc del dev $DUMMY clsact",
+ "$TC qdisc del dev $DUMMY root handle 1:"
+ ]
}
]
--
2.43.0
^ permalink raw reply related
* [PATCH net v2 1/2] net/sched: act_ct: preserve tc_skb_cb across defragmentation
From: Ren Wei @ 2026-06-13 17:42 UTC (permalink / raw)
To: netdev
Cc: jhs, jiri, kuba, paulb, victor, yuantan098, yifanwucs,
tomapufckgml, bird, xizh2024, n05ec
In-Reply-To: <cover.1781358691.git.xizh2024@lzu.edu.cn>
From: Zihan Xi <xizh2024@lzu.edu.cn>
tcf_ct_handle_fragments() calls nf_ct_handle_fragments() without saving
and restoring skb->cb. The defrag helper clears IPCB/IP6CB, which aliases
the tc_skb_cb/qdisc_skb_cb control buffer. Fragmented traffic through
act_ct therefore loses qdisc metadata such as pkt_segs and can trigger
WARN_ON_ONCE() in qdisc_pkt_segs() when panic_on_warn is enabled.
Save and restore the full tc_skb_cb around nf_ct_handle_fragments(),
matching the pattern used by ovs_ct_handle_fragments().
Fixes: ec624fe740b4 ("net/sched: Extend qdisc control block with tc control block")
Cc: stable@vger.kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Assisted-by: Codex:gpt-5.4
Signed-off-by: Zihan Xi <xizh2024@lzu.edu.cn>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
changes in v2:
- Add TDC selftest in patch 2 per maintainer feedback
- v1 Link: https://lore.kernel.org/all/20260611154939.2615919-1-n05ec@lzu.edu.cn/
net/sched/act_ct.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 6158e13c98d3..ebd40daf05a6 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -845,10 +845,10 @@ static int tcf_ct_handle_fragments(struct net *net, struct sk_buff *skb,
{
enum ip_conntrack_info ctinfo;
struct nf_conn *ct;
+ struct tc_skb_cb cb;
int err = 0;
bool frag;
u8 proto;
- u16 mru;
/* Previously seen (loopback)? Ignore. */
ct = nf_ct_get(skb, &ctinfo);
@@ -862,12 +862,13 @@ static int tcf_ct_handle_fragments(struct net *net, struct sk_buff *skb,
if (err || !frag)
return err;
- err = nf_ct_handle_fragments(net, skb, zone, family, &proto, &mru);
+ cb = *tc_skb_cb(skb);
+ err = nf_ct_handle_fragments(net, skb, zone, family, &proto, &cb.mru);
if (err)
return err;
*defrag = true;
- tc_skb_cb(skb)->mru = mru;
+ *tc_skb_cb(skb) = cb;
return 0;
}
--
2.43.0
^ permalink raw reply related
* [PATCH net v2 0/2] net/sched: act_ct: preserve tc_skb_cb across defragmentation
From: Ren Wei @ 2026-06-13 17:42 UTC (permalink / raw)
To: netdev, linux-kselftest, linux-kernel
Cc: jhs, jiri, kuba, paulb, victor, yuantan098, yifanwucs,
tomapufckgml, bird, xizh2024, n05ec
From: Zihan Xi <xizh2024@lzu.edu.cn>
Hi Linux kernel maintainers,
We found and validated an issue in net/sched/act_ct.c. The bug is
reachable when configuring TC with act_ct on a netdev (requires
CAP_NET_ADMIN). We have tested it, and the fix should not affect
other functionality.
We provide bug details, a PoC, and a crash log below.
v2 adds a tc-testing (TDC) selftest case in patch 2, per maintainer
feedback.
---- details below ----
Bug details:
tcf_ct_handle_fragments() calls nf_ct_handle_fragments() without
saving and restoring skb->cb. The defrag helper clears IPCB/IP6CB,
which aliases the tc_skb_cb/qdisc_skb_cb control buffer in
include/net/sch_generic.h. Fragmented traffic through act_ct
therefore loses qdisc metadata such as pkt_segs.
Later qdisc dequeue paths call qdisc_bstats_update() ->
qdisc_pkt_segs(). For a non-GSO skb, clobbered pkt_segs == 0 trips
DEBUG_NET_WARN_ON_ONCE() in qdisc_pkt_segs(). With panic_on_warn=1
the kernel panics.
Unlike ovs_ct_handle_fragments() in net/openvswitch/conntrack.c, the
act_ct caller only restored mru after defrag, not the full control
buffer. The attached patch saves and restores struct tc_skb_cb around
nf_ct_handle_fragments(), matching the OVS pattern.
Reproducer:
Run as root in the guest (QEMU bullseye image, eth0):
chmod +x ./poc.sh
./poc.sh eth0 10.0.2.2 100
The script installs a root prio qdisc, clsact egress with "action ct",
then sends oversized UDP datagrams with PMTUD disabled to force IPv4
fragmentation through the act_ct defrag path.
We run the PoC in a 2 vCPU, 2 GB RAM x86 QEMU environment.
------BEGIN poc.sh------
#!/bin/sh
set -eu
IFACE="${1:-eth0}"
DST="${2:-10.0.2.2}"
COUNT="${3:-100}"
sysctl -w kernel.panic_on_warn=1 >/dev/null
tc qdisc del dev "$IFACE" clsact 2>/dev/null || true
tc qdisc del dev "$IFACE" root 2>/dev/null || true
tc qdisc add dev "$IFACE" root handle 1: prio
tc qdisc add dev "$IFACE" clsact
tc filter add dev "$IFACE" egress protocol ip pref 1 u32 \
match u32 0 0 action ct zone 1 pipe
python3 - "$DST" "$COUNT" <<'PY'
import socket
import sys
import time
dst = sys.argv[1]
count = int(sys.argv[2])
IP_MTU_DISCOVER = 10
IP_PMTUDISC_DONT = 0
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.setsockopt(socket.IPPROTO_IP, IP_MTU_DISCOVER, IP_PMTUDISC_DONT)
payload = b"A" * 4000
for _ in range(count):
s.sendto(payload, (dst, 9))
time.sleep(0.01)
PY
------END poc.sh------
----BEGIN crash log----
[ 549.900801][T10210] Kernel panic - not syncing: kernel: panic_on_warn set ...
[ 549.901406][T10210] CPU: 2 UID: 0 PID: 10210 Comm: python3 Not tainted 7.1.0-rc1 #2 PREEMPT(full)
[ 549.902720][T10210] Call Trace:
[ 549.903756][T10210] ? qdisc_dequeue_head+0x287/0x370
[ 549.904713][T10210] check_panic_on_warn+0x61/0x80
[ 549.905053][T10210] __warn+0xe8/0x330
[ 549.905345][T10210] ? qdisc_dequeue_head+0x287/0x370
[ 549.909442][T10210] RIP: 0010:qdisc_dequeue_head+0x287/0x370
[ 549.914217][T10210] prio_dequeue+0x40c/0x6a0
[ 549.914539][T10210] __qdisc_run+0x170/0x1b30
[ 549.915561][T10210] __dev_queue_xmit+0x25e6/0x3ac0
[ 549.920352][T10210] ip_do_fragment+0x1188/0x19a0
[ 549.924214][T10210] udp_send_skb+0x885/0x1270
[ 549.924556][T10210] udp_sendmsg+0x13f3/0x20a0
-----END crash log-----
Best regards,
Zihan Xi
Zihan Xi (2):
net/sched: act_ct: preserve tc_skb_cb across defragmentation
selftests/tc-testing: act_ct: add TDC test for skb cb preservation
across defrag
net/sched/act_ct.c | 7 ++--
.../tc-testing/tc-tests/actions/ct.json | 38 +++++++++++++++++++
2 files changed, 42 insertions(+), 3 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH net-next v3 4/4] vsock: fold sk_acceptq_removed() into vsock_remove_pending()
From: Jakub Kicinski @ 2026-06-13 17:40 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Raf Dickson, netdev, virtualization, pabeni, stefanha,
bryan-bt.tan, vishnu.dasa, bcm-kernel-feedback-list,
bobbyeshleman, leonardi, horms, edumazet
In-Reply-To: <aivjf4TZU4Q_s20y@sgarzare-redhat>
On Fri, 12 Jun 2026 12:48:14 +0200 Stefano Garzarella wrote:
> >@@ -773,7 +774,6 @@ static void vsock_pending_work(struct work_struct *work)
> > if (vsock_is_pending(sk)) {
> > vsock_remove_pending(listener, sk);
> >
> ^^
> There is an extra blank line that we can now remove here.
>
> BTW, the code LGTM:
Since the merge window is upon us - also updated when applying.
^ permalink raw reply
* Re: [PATCH v2 net-next 0/2] netdevsim: add fake FT/CLS_FLOWER offload
From: patchwork-bot+netdevbpf @ 2026-06-13 17:40 UTC (permalink / raw)
To: Florian Westphal
Cc: netdev, pabeni, davem, edumazet, kuba, netfilter-devel, pablo
In-Reply-To: <20260612092209.11966-1-fw@strlen.de>
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 12 Jun 2026 11:22:07 +0200 you wrote:
> v2: fix up error reporting via extack
> shellcheck cleanups
> sort config toggles
>
> 1) Enable nf_tables offload control plane testing in netdevsim. Tag
> existing offload fn to allow error injection for testing rollback and abort
> logic.
>
> [...]
Here is the summary with links:
- [v2,net-next,1/2] netdevsim: tc: allow to test nf_tables offload control plane code
https://git.kernel.org/netdev/net-next/c/07ca2ab4ce84
- [v2,net-next,2/2] selftests: netfilter: add phony nft_offload test
https://git.kernel.org/netdev/net-next/c/5394aa0bb00d
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next v2] vsock/vmci: use sk_acceptq_is_full() helper
From: patchwork-bot+netdevbpf @ 2026-06-13 17:40 UTC (permalink / raw)
To: Raf Dickson
Cc: netdev, virtualization, pabeni, sgarzare, stefanha, bryan-bt.tan,
vishnu.dasa, bcm-kernel-feedback-list, leonardi, horms, edumazet,
kuba
In-Reply-To: <20260612045842.122207-1-rafdog35@gmail.com>
Hello:
This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 12 Jun 2026 04:58:42 +0000 you wrote:
> Replace the open-coded backlog check with sk_acceptq_is_full().
> The helper uses > instead of >=, which is the correct comparison
> per commit 64a146513f8f ("[NET]: Revert incorrect accept queue
> backlog changes."), and adds READ_ONCE() for proper memory ordering.
>
> Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
> Signed-off-by: Raf Dickson <rafdog35@gmail.com>
>
> [...]
Here is the summary with links:
- [net-next,v2] vsock/vmci: use sk_acceptq_is_full() helper
https://git.kernel.org/netdev/net-next/c/4ff2e84ff1b3
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net] net: ethernet: mtk_wed: debugfs: correct index in wed_amsdu_show()
From: patchwork-bot+netdevbpf @ 2026-06-13 17:40 UTC (permalink / raw)
To: Wentao Guan
Cc: lorenzo, nbd, sujuan.chen, netdev, linux-kernel, niecheng1,
zhanjun
In-Reply-To: <20260612064501.203058-1-guanwentao@uniontech.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 12 Jun 2026 14:45:01 +0800 you wrote:
> WED_MON_AMSDU_ENG_CNT point to different entry by 'base+n*offect' mode,
> correct the wed amsdu entry number in wed_amsdu_show().
>
> Fixes: 3f3de094e8342 ("net: ethernet: mtk_wed: debugfs: add WED 3.0 debugfs entries")
> Assisted-by: Copilot:gpt-5.2
> Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
>
> [...]
Here is the summary with links:
- [net] net: ethernet: mtk_wed: debugfs: correct index in wed_amsdu_show()
https://git.kernel.org/netdev/net/c/14a8bc41ce9e
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH v2] net: airoha: Fix error handling in airoha_ppe_flush_sram_entries()
From: patchwork-bot+netdevbpf @ 2026-06-13 17:40 UTC (permalink / raw)
To: Wayen.Yan; +Cc: netdev, lorenzo, linux-arm-kernel, linux-mediatek
In-Reply-To: <6a2bd37a.4034e349.1b41bb.1caf@mx.google.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 12 Jun 2026 17:37:00 +0800 you wrote:
> In airoha_ppe_flush_sram_entries(), the outer "err" variable was never
> updated when the inner loop variable shadowed it, causing the function
> to always return 0 even when airoha_ppe_foe_commit_sram_entry() fails.
>
> Drop the outer "err" variable and return directly on error, propagating
> the error code from airoha_ppe_foe_commit_sram_entry() correctly.
>
> [...]
Here is the summary with links:
- [v2] net: airoha: Fix error handling in airoha_ppe_flush_sram_entries()
https://git.kernel.org/netdev/net/c/d7d81b003013
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next v2] vsock/vmci: use sk_acceptq_is_full() helper
From: Jakub Kicinski @ 2026-06-13 17:37 UTC (permalink / raw)
To: Stefano Garzarella
Cc: Raf Dickson, netdev, virtualization, pabeni, stefanha,
bryan-bt.tan, vishnu.dasa, bcm-kernel-feedback-list, leonardi,
horms, edumazet
In-Reply-To: <aivKma8mRjTXV0BM@sgarzare-redhat>
On Fri, 12 Jun 2026 11:03:24 +0200 Stefano Garzarella wrote:
> nit: title should be updated since now this is not just vmci
> (e.g. vsock: use sk_acceptq_is_full() helper in all transports)
>
> Not sure if it can be fixed while applying by netdev maintainers.
Updated and applied, thanks!
^ permalink raw reply
* Re: [PATCH bpf v5 1/2] bpf: Run generic devmap egress prog on private skb
From: Jakub Kicinski @ 2026-06-13 17:25 UTC (permalink / raw)
To: Sun Jian
Cc: bpf, netdev, linux-kernel, linux-kselftest, ast, daniel, andrii,
martin.lau, davem, hawk, john.fastabend, sdf, shuah, jiayuan.chen,
toke, menglong.dong, emil
In-Reply-To: <20260612114032.244616-2-sun.jian.kdev@gmail.com>
On Fri, 12 Jun 2026 19:40:31 +0800 Sun Jian wrote:
> Suggested-by: Jakub Kicinski <kuba@kernel.org>
I did not suggest this
^ permalink raw reply
* Re: [PATCH net-next] tcp: refine tcp_sequence() for the FIN exception
From: Simon Baatz @ 2026-06-13 17:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Neal Cardwell, Kuniyuki Iwashima, netdev, eric.dumazet
In-Reply-To: <aigpnYzdgKSF8FZ4@gandalf.schnuecks.de>
On Tue, Jun 09, 2026 at 04:56:29PM +0200, Simon Baatz wrote:
> On Mon, Jun 08, 2026 at 05:45:30PM -0700, Eric Dumazet wrote:
> > On Mon, Jun 8, 2026 at 3:12???PM Simon Baatz <gmbnomis@gmail.com> wrote:
> > >
> > > Hi Eric,
> > >
> > > On Mon, Jun 08, 2026 at 03:14:52PM +0000, Eric Dumazet wrote:
> > > > Commit 0e24d17bd966 ("tcp: implement RFC 7323 window retraction
> > > > receiver requirements") removed the special FIN case that
> > > > was added in commit 1e3bb184e941 ("tcp: re-enable acceptance of
> > > > FIN packets when RWIN is 0").
> > >
> > > Commit 0e24d17bd966 did not remove the special handling; it is still
> > > present and covered by the test "tcp_rcv_zero_wnd_fin.pkt".
> > >
> > > > If a peer sends a segment containing data and a FIN flag before
> > > > it learns about our window retraction and has a buggy TCP stack,
> > > > it might place the FIN one byte beyond what it thinks is the
> > > > right edge of the window (i.e., max_window_edge + 1).
> > >
> > > The FIN exception in tcp_data_queue() is not a generic allowance for
> > > incorrect FIN handling. It is much more specific and only applies
> > > when:
> > >
> > > 1. the packet is in-sequence
> > > 2. RWIN == 0
> > > 3. the packet is a bare FIN
> > >
> > > > The data portion (end_seq - th->fin) will end exactly at max_window_edge.
> > > > In this case, we will drop the packet if our receive queue is not empty,
> > > > even though the data was sent within the window we previously allowed.
> > > >
> > > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > > Cc: Simon Baatz <gmbnomis@gmail.com>
> > > > ---
> > > > net/ipv4/tcp_input.c | 8 +++++---
> > > > 1 file changed, 5 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > index ab7a4e5435a8a2cbb532d42c54af76d8541c903b..8560a9c6d38207c098d673497caf2c7652c36f5c 100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -4812,18 +4812,20 @@ static enum skb_drop_reason tcp_sequence(const struct sock *sk,
> > > > const struct tcphdr *th)
> > > > {
> > > > const struct tcp_sock *tp = tcp_sk(sk);
> > > > + u32 seq_limit;
> > > >
> > > > if (before(end_seq, tp->rcv_wup))
> > > > return SKB_DROP_REASON_TCP_OLD_SEQUENCE;
> > > >
> > > > - if (unlikely(after(end_seq, tp->rcv_nxt + tcp_max_receive_window(tp)))) {
> > > > + seq_limit = tp->rcv_nxt + tcp_max_receive_window(tp);
> > > > + if (unlikely(after(end_seq, seq_limit))) {
> > > > /* Some stacks are known to handle FIN incorrectly; allow the
> > > > * FIN to extend beyond the window and check it in detail later.
> > > > */
> > > > - if (!after(end_seq - th->fin, tp->rcv_nxt + tcp_receive_window(tp)))
> > > > + if (!after(end_seq - th->fin, seq_limit))
> > > > return SKB_NOT_DROPPED_YET;
> > >
> > > It is not clear which additional case this change is intended to
> > > allow. Are you sure such a packet would not be rejected by later
> > > checks in the data path?
> > >
> > > (For the existing FIN exception, the previous condition also seems
> > > broader than necessary. Actually, it should be sufficient to use
> > > "!after(end_seq - th->fin, tp->rcv_nxt)")
> >
> > It is possible our internal sashiko instance got this wrong.
> >
> > <quote>
> > If tcp_max_receive_window() is greater than tcp_receive_window()
> > (i.e. the window was shrunk), and end_seq is after tcp_max_receive_window(),
> > then end_seq - th->fin will always be after tcp_receive_window().
> > Does this mean the FIN workaround is disabled when the window has been shrunk?
> > Should this use tcp_max_receive_window() instead?
> > </quote>
> >
> > Can you suggest an alternative? Why using two confusing variants of
> > what should be the same stuff?
>
> Judging from past discussions, I think you prefer us to be strict in
> tcp_sequence(), so I think we should replicate the conditions
> (in-sequence, RWIN == 0, bare FIN) for the FIN exception in
> tcp_data_queue() closely here. Iâll send a patch that mirrors those
> conditions in tcp_sequence().
Based on the discussion on the alternative (see thread
https://lore.kernel.org/netdev/20260610-tcp_fin_more_restrictive-v1-1-eefc30d7ddd8@gmail.com/),
we want to accept the "only FIN is beyond window" case in general,
not just for the RWIN == 0 special case.
It might be worth adding a test for the new "accept data packet with
only FIN extending beyond max win" edge case; I can propose one if
helpful.
Reviewed-by: Simon Baatz <gmbnomis@gmail.com>
--
Simon Baatz <gmbnomis@gmail.com>
^ permalink raw reply
* Re: [PATCH v2 bpf-next/net 0/5] bpf: Support RX/TX HW timestamp proxy.
From: Jakub Kicinski @ 2026-06-13 17:20 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
Stanislav Fomichev, Andrii Nakryiko, John Fastabend,
Kumar Kartikeya Dwivedi, Eduard Zingerman, Song Liu,
Yonghong Song, Jiri Olsa, Andrew Lunn, David S . Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, Willem de Bruijn,
Kuniyuki Iwashima, bpf, netdev
In-Reply-To: <20260613010039.1362312-1-kuniyu@google.com>
On Sat, 13 Jun 2026 00:59:57 +0000 Kuniyuki Iwashima wrote:
> When standard socket applications are run on these hosts,
> a userspace proxy is required to mediate traffic between the
> hardware and the applications.
>
> +---------+ +----------------------+
> | proxy | | socket application |
> +---------+ +----------------------+
> ^ ^ ^
> userspace | | |
> -----------| |-----------------------------------------------
> | | | +---------------------+ | skb
> | | `--->| virtual interface |<---'
> kernel | | skb +---------------------+
> -----------| |-----------------------------------------------
> |
> v
> +------------+
> | hardware |
> +------------+
The first patch looks kinda nonsensical but then I saw this diagram.
Looks like you're vibe coding an integration that makes it easier to
treat netdev as a slow path for a user networking stack.
Please tell me if I'm missing anything otherwise add my nack if you
repost.
^ permalink raw reply
* Re: [PATCH net-next] tcp: tighten the FIN exception in tcp_sequence()
From: Simon Baatz @ 2026-06-13 17:18 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jakub Kicinski, Simon Baatz via B4 Relay, Neal Cardwell,
Kuniyuki Iwashima, David S. Miller, Paolo Abeni, Simon Horman,
netdev, linux-kernel
In-Reply-To: <CANn89i+q3Q_k3PHvwOJfOgor4xD1YzSVYO-G+fY4+_ddkKLTQw@mail.gmail.com>
Hi Eric,
On Fri, Jun 12, 2026 at 08:47:51PM -0700, Eric Dumazet wrote:
> On Fri, Jun 12, 2026 at 4:40???PM Simon Baatz <gmbnomis@gmail.com> wrote:
> >
> > Hi Jakub,
> >
> > On Fri, Jun 12, 2026 at 03:43:55PM -0700, Jakub Kicinski wrote:
> > > On Wed, 10 Jun 2026 00:09:24 +0200 Simon Baatz via B4 Relay wrote:
> > > > From: Simon Baatz <gmbnomis@gmail.com>
> > > >
> > > > Commit 1e3bb184e941 ("tcp: re-enable acceptance of FIN packets when
> > > > RWIN is 0") added a special case in tcp_sequence() to mirror the FIN
> > > > exception in tcp_data_queue(), which accepts bare in-order FINs even
> > > > when the advertised window is zero. That behavior is not
> > > > RFC-compliant, but was introduced in commit 2bd99aef1b19 ("tcp: accept
> > > > bare FIN packets under memory pressure") to break tight FIN/ACK loops
> > > > caused by broken clients.
> > > >
> > > > However, the condition added by commit 1e3bb184e941 ("tcp: re-enable
> > > > acceptance of FIN packets when RWIN is 0") is broader than required
> > > > and allows other non-compliant packets as well.
> > > >
> > > > Tighten the tcp_sequence() FIN exception to only allow packets where
> > > > the packet is a bare in-order FIN and only the FIN flag extends beyond
> > > > tcp_max_receive_window(). In particular, this exception is only
> > > > reachable if tcp_max_receive_window() is zero. Otherwise the packet is
> > > > already accepted by the normal sequence check.
> > > >
> > > > The existing packetdrill test tcp_rcv_zero_wnd_fin.pkt exercises this
> > > > behavior already and does not need to be changed.
> > > >
> > > > Signed-off-by: Simon Baatz <gmbnomis@gmail.com>
> > >
> > > This is odd. You are sending this patch which shares a lot of
> > > similarities with Eric's patch:
> > > https://lore.kernel.org/all/20260608151452.706822-1-edumazet@google.com/
> > >
> > > Why are you submitting your own patch instead of discussing it further
> > > with Eric and letting him send v2?
> >
> > That's what I understood from Eric's reply to my comments. He asked
> > for an alternative, so I sent this as a concrete sugggestion.
>
> Yes this is fine, please next time includ a link to the 'other patch'
> since this discussion
> was started by someone :)
Hmm, in hindsight this is quite obvious...
> About your patch, I thought that it would be fine to allow a remote
> peer to add a FIN
> to a payload packet of N bytes even if RWIN == N
I just realized that RFC 9293 has an "or" in:
RCV.NXT =< SEG.SEQ < RCV.NXT+RCV.WND
or
RCV.NXT =< SEG.SEQ+SEG.LEN-1 < RCV.NXT+RCV.WND
(SEG.LEN includes SYN/FIN)
Linux TCP is stricter here effectively requiring both conditions
since commit 9ca48d6 ("tcp: do not accept packets beyond window").
Given that, it's fine to accept packets with an "additional FIN" that
fulfill the first condition.
> It seems the bug of your other stack is that the FIN can be sent with
> no payload,
> and this is a less broader case.
>
> FIN storage (a bit) is there, we can generalize the acceptance of FIN for free ?
Agreed. Let's drop this patch and generalize the FIN handling as you suggest.
- Simon
--
Simon Baatz <gmbnomis@gmail.com>
^ permalink raw reply
* Re: [PATCH bpf-next v4 1/2] bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket
From: bot+bpf-ci @ 2026-06-13 17:07 UTC (permalink / raw)
To: leon.hwang, bpf
Cc: ast, daniel, andrii, eddyz87, memxor, martin.lau, song,
yonghong.song, jolsa, emil, john.fastabend, sdf, davem, edumazet,
kuba, pabeni, horms, shuah, leon.hwang, ihor.solodrai, netdev,
linux-kernel, linux-kselftest, kernel-patches-bot, zhoufeng.zf,
ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
In-Reply-To: <20260613162443.60515-2-leon.hwang@linux.dev>
[-- Attachment #1: Type: text/plain, Size: 1105 bytes --]
> When TCP over IPv4 via INET6 API, bpf_get/setsockopt with ipv4 will
> fail, because sk->sk_family is AF_INET6. With ipv6 will success, not
> take effect, because inet_csk(sk)->icsk_af_ops is ipv6_mapped and
> use ip_queue_xmit, inet_sk(sk)->tos.
> To relax this restriction, allow getting/setting tos for those possible
> ipv4-mapped ipv6 sockets.
> Fixes: ee7f1e1302f5 ("bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()")
> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
In an earlier review round it was suggested that "bpf_get/setsockopt
with ipv4 will fail" should read "bpf_get/setsockopt with SOL_IP will
fail", since the call uses the SOL_IP level rather than an ipv4 option.
The body here still reads "with ipv4 will fail" - was the s/ipv4/SOL_IP/
reword intended to be picked up?
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/27472657322
^ permalink raw reply
* Re: [PATCH] net: correcting section tags for .init and .exit data/functions
From: Nathan Chancellor @ 2026-06-13 17:01 UTC (permalink / raw)
To: xur
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Neal Cardwell, Kuniyuki Iwashima, Willem de Bruijn,
David Ahern, Ido Schimmel, Andreas Färber,
Manivannan Sadhasivam, Nick Desaulniers, Bill Wendling,
Justin Stitt, Maciej Żenczykowski, Yue Haibing, Jeff Layton,
Kees Cook, Fernando Fernandez Mancera, Gustavo A. R. Silva,
Sabrina Dubroca, Masahiro Yamada, Nicolas Schier, netdev,
linux-kernel, linux-arm-kernel, linux-actions, llvm,
kernel test robot
In-Reply-To: <20260612162257.896792-1-xur@google.com>
Hi Rong,
On Fri, Jun 12, 2026 at 09:22:57AM -0700, xur@google.com wrote:
> From: Rong Xu <xur@google.com>
>
> Fix modpost warnings that have surfaced during Clang's distributed ThinLTO
> builds.
>
> WARNING: modpost: vmlinux: section mismatch in reference: tcp4_net_ops.llvm.4527429266264891517+0x8 (section: .data) -> tcp4_proc_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: udp4_net_ops.llvm.17425824324074326067+0x8 (section: .data) -> udp4_proc_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ping_v4_net_ops.llvm.5641696707737373282+0x8 (section: .data) -> ping_v4_proc_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: if6_proc_net_ops.llvm.7870945277386035298+0x8 (section: .data) -> if6_proc_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ipv6_addr_label_ops.llvm.5745897517271459135+0x8 (section: .data) -> ip6addrlbl_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ndisc_net_ops.llvm.8806210167060761094+0x8 (section: .data) -> ndisc_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: raw6_net_ops.llvm.3743523335772203324+0x8 (section: .data) -> raw6_init_net (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: igmp6_net_ops.llvm.7071106350580158050+0x8 (section: .data) -> igmp6_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: tcpv6_net_ops.llvm.17505177970592326146+0x8 (section: .data) -> tcpv6_net_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ip6_flowlabel_net_ops.llvm.6051723423336054316+0x8 (section: .data) -> ip6_flowlabel_proc_init (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: ipv6_proc_ops.llvm.7829948594772821810+0x8 (section: .data) -> ipv6_proc_init_net (section: .init.text)
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202606111233.kM8oo8Df-lkp@intel.com/
> Signed-off-by: Rong Xu <xur@google.com>
Thanks for sending this change to try and clear up those new warnings
from the distributed ThinLTO build. Based on the build reports that
appear from this change downthread, it does not seem like it is quite
right. Additionally, I think the commit message could be a little more
descriptive around the root cause of the warnings and how this patch
actually addresses it (I can infer but I think that information should
be up front and center).
> ---
> net/ipv4/ping.c | 6 +++---
> net/ipv4/tcp_ipv4.c | 6 +++---
> net/ipv4/udp.c | 6 +++---
> net/ipv6/addrconf.c | 6 +++---
> net/ipv6/addrlabel.c | 6 +++---
> net/ipv6/ip6_flowlabel.c | 6 +++---
> net/ipv6/mcast.c | 10 +++++-----
> net/ipv6/ndisc.c | 10 +++++-----
> net/ipv6/proc.c | 6 +++---
> net/ipv6/raw.c | 6 +++---
> net/ipv6/tcp_ipv6.c | 6 +++---
> 11 files changed, 37 insertions(+), 37 deletions(-)
>
> diff --git a/net/ipv4/ping.c b/net/ipv4/ping.c
> index d36f1e273fde..1dda6d661ad8 100644
> --- a/net/ipv4/ping.c
> +++ b/net/ipv4/ping.c
> @@ -1144,17 +1144,17 @@ static void __net_exit ping_v4_proc_exit_net(struct net *net)
> remove_proc_entry("icmp", net->proc_net);
> }
>
> -static struct pernet_operations ping_v4_net_ops = {
> +static struct pernet_operations ping_v4_net_ops __net_initdata = {
> .init = ping_v4_proc_init_net,
> .exit = ping_v4_proc_exit_net,
> };
>
> -int __init ping_proc_init(void)
> +int __net_init ping_proc_init(void)
> {
> return register_pernet_subsys(&ping_v4_net_ops);
> }
>
> -void ping_proc_exit(void)
> +void __net_exit ping_proc_exit(void)
> {
> unregister_pernet_subsys(&ping_v4_net_ops);
> }
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index fdc81150ff6c..9caca5879466 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -3317,17 +3317,17 @@ static void __net_exit tcp4_proc_exit_net(struct net *net)
> remove_proc_entry("tcp", net->proc_net);
> }
>
> -static struct pernet_operations tcp4_net_ops = {
> +static struct pernet_operations tcp4_net_ops __net_initdata = {
> .init = tcp4_proc_init_net,
> .exit = tcp4_proc_exit_net,
> };
>
> -int __init tcp4_proc_init(void)
> +int __net_init tcp4_proc_init(void)
> {
> return register_pernet_subsys(&tcp4_net_ops);
> }
>
> -void tcp4_proc_exit(void)
> +void __net_exit tcp4_proc_exit(void)
> {
> unregister_pernet_subsys(&tcp4_net_ops);
> }
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 70f6cbd4ef73..87f4cced2114 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -3600,17 +3600,17 @@ static void __net_exit udp4_proc_exit_net(struct net *net)
> remove_proc_entry("udp", net->proc_net);
> }
>
> -static struct pernet_operations udp4_net_ops = {
> +static struct pernet_operations udp4_net_ops __net_initdata = {
> .init = udp4_proc_init_net,
> .exit = udp4_proc_exit_net,
> };
>
> -int __init udp4_proc_init(void)
> +int __net_init udp4_proc_init(void)
> {
> return register_pernet_subsys(&udp4_net_ops);
> }
>
> -void udp4_proc_exit(void)
> +void __net_exit udp4_proc_exit(void)
> {
> unregister_pernet_subsys(&udp4_net_ops);
> }
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index c9e5d3e48ab9..73d9439bd408 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4527,17 +4527,17 @@ static void __net_exit if6_proc_net_exit(struct net *net)
> remove_proc_entry("if_inet6", net->proc_net);
> }
>
> -static struct pernet_operations if6_proc_net_ops = {
> +static struct pernet_operations if6_proc_net_ops __net_initdata = {
> .init = if6_proc_net_init,
> .exit = if6_proc_net_exit,
> };
>
> -int __init if6_proc_init(void)
> +int __net_init if6_proc_init(void)
> {
> return register_pernet_subsys(&if6_proc_net_ops);
> }
>
> -void if6_proc_exit(void)
> +void __net_exit if6_proc_exit(void)
> {
> unregister_pernet_subsys(&if6_proc_net_ops);
> }
> diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
> index f4b2618446bd..50f6c1b1edaa 100644
> --- a/net/ipv6/addrlabel.c
> +++ b/net/ipv6/addrlabel.c
> @@ -340,17 +340,17 @@ static void __net_exit ip6addrlbl_net_exit(struct net *net)
> spin_unlock(&net->ipv6.ip6addrlbl_table.lock);
> }
>
> -static struct pernet_operations ipv6_addr_label_ops = {
> +static struct pernet_operations ipv6_addr_label_ops __net_initdata = {
> .init = ip6addrlbl_net_init,
> .exit = ip6addrlbl_net_exit,
> };
>
> -int __init ipv6_addr_label_init(void)
> +int __net_init ipv6_addr_label_init(void)
> {
> return register_pernet_subsys(&ipv6_addr_label_ops);
> }
>
> -void ipv6_addr_label_cleanup(void)
> +void __net_exit ipv6_addr_label_cleanup(void)
> {
> unregister_pernet_subsys(&ipv6_addr_label_ops);
> }
> diff --git a/net/ipv6/ip6_flowlabel.c b/net/ipv6/ip6_flowlabel.c
> index b1ccdf0dc646..f6980c403c68 100644
> --- a/net/ipv6/ip6_flowlabel.c
> +++ b/net/ipv6/ip6_flowlabel.c
> @@ -903,17 +903,17 @@ static void __net_exit ip6_flowlabel_net_exit(struct net *net)
> ip6_flowlabel_proc_fini(net);
> }
>
> -static struct pernet_operations ip6_flowlabel_net_ops = {
> +static struct pernet_operations ip6_flowlabel_net_ops __net_initdata = {
> .init = ip6_flowlabel_proc_init,
> .exit = ip6_flowlabel_net_exit,
> };
>
> -int ip6_flowlabel_init(void)
> +int __net_init ip6_flowlabel_init(void)
> {
> return register_pernet_subsys(&ip6_flowlabel_net_ops);
> }
>
> -void ip6_flowlabel_cleanup(void)
> +void __net_exit ip6_flowlabel_cleanup(void)
> {
> static_key_deferred_flush(&ipv6_flowlabel_exclusive);
> timer_delete(&ip6_fl_gc_timer);
> diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
> index d9b855d5191b..eef5bab1ee13 100644
> --- a/net/ipv6/mcast.c
> +++ b/net/ipv6/mcast.c
> @@ -3209,12 +3209,12 @@ static void __net_exit igmp6_net_exit(struct net *net)
> igmp6_proc_exit(net);
> }
>
> -static struct pernet_operations igmp6_net_ops = {
> +static struct pernet_operations igmp6_net_ops __net_initdata = {
> .init = igmp6_net_init,
> .exit = igmp6_net_exit,
> };
>
> -int __init igmp6_init(void)
> +int __net_init igmp6_init(void)
> {
> int err;
>
> @@ -3231,18 +3231,18 @@ int __init igmp6_init(void)
> return err;
> }
>
> -int __init igmp6_late_init(void)
> +int __net_init igmp6_late_init(void)
> {
> return register_netdevice_notifier(&igmp6_netdev_notifier);
> }
>
> -void igmp6_cleanup(void)
> +void __net_exit igmp6_cleanup(void)
> {
> unregister_pernet_subsys(&igmp6_net_ops);
> destroy_workqueue(mld_wq);
> }
>
> -void igmp6_late_cleanup(void)
> +void __net_exit igmp6_late_cleanup(void)
> {
> unregister_netdevice_notifier(&igmp6_netdev_notifier);
> }
> diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
> index e7ad13c5bd26..3a83280db29d 100644
> --- a/net/ipv6/ndisc.c
> +++ b/net/ipv6/ndisc.c
> @@ -1994,12 +1994,12 @@ static void __net_exit ndisc_net_exit(struct net *net)
> inet_ctl_sock_destroy(net->ipv6.ndisc_sk);
> }
>
> -static struct pernet_operations ndisc_net_ops = {
> +static struct pernet_operations ndisc_net_ops __net_initdata = {
> .init = ndisc_net_init,
> .exit = ndisc_net_exit,
> };
>
> -int __init ndisc_init(void)
> +int __net_init ndisc_init(void)
> {
> int err;
>
> @@ -2027,17 +2027,17 @@ int __init ndisc_init(void)
> #endif
> }
>
> -int __init ndisc_late_init(void)
> +int __net_init ndisc_late_init(void)
> {
> return register_netdevice_notifier(&ndisc_netdev_notifier);
> }
>
> -void ndisc_late_cleanup(void)
> +void __net_exit ndisc_late_cleanup(void)
> {
> unregister_netdevice_notifier(&ndisc_netdev_notifier);
> }
>
> -void ndisc_cleanup(void)
> +void __net_exit ndisc_cleanup(void)
> {
> #ifdef CONFIG_SYSCTL
> neigh_sysctl_unregister(&nd_tbl.parms);
> diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
> index 813013ca4e75..c59bade608cd 100644
> --- a/net/ipv6/proc.c
> +++ b/net/ipv6/proc.c
> @@ -298,17 +298,17 @@ static void __net_exit ipv6_proc_exit_net(struct net *net)
> remove_proc_entry("snmp6", net->proc_net);
> }
>
> -static struct pernet_operations ipv6_proc_ops = {
> +static struct pernet_operations ipv6_proc_ops __net_initdata = {
> .init = ipv6_proc_init_net,
> .exit = ipv6_proc_exit_net,
> };
>
> -int __init ipv6_misc_proc_init(void)
> +int __net_init ipv6_misc_proc_init(void)
> {
> return register_pernet_subsys(&ipv6_proc_ops);
> }
>
> -void ipv6_misc_proc_exit(void)
> +void __net_exit ipv6_misc_proc_exit(void)
> {
> unregister_pernet_subsys(&ipv6_proc_ops);
> }
> diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
> index 3cc58698cbbd..fe399675b8fc 100644
> --- a/net/ipv6/raw.c
> +++ b/net/ipv6/raw.c
> @@ -1256,17 +1256,17 @@ static void __net_exit raw6_exit_net(struct net *net)
> remove_proc_entry("raw6", net->proc_net);
> }
>
> -static struct pernet_operations raw6_net_ops = {
> +static struct pernet_operations raw6_net_ops __net_initdata = {
> .init = raw6_init_net,
> .exit = raw6_exit_net,
> };
>
> -int __init raw6_proc_init(void)
> +int __net_init raw6_proc_init(void)
> {
> return register_pernet_subsys(&raw6_net_ops);
> }
>
> -void raw6_proc_exit(void)
> +void __net_exit raw6_proc_exit(void)
> {
> unregister_pernet_subsys(&raw6_net_ops);
> }
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 36d75fb50a70..d0737f16076b 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -2335,12 +2335,12 @@ static void __net_exit tcpv6_net_exit(struct net *net)
> inet_ctl_sock_destroy(net->ipv6.tcp_sk);
> }
>
> -static struct pernet_operations tcpv6_net_ops = {
> +static struct pernet_operations tcpv6_net_ops __net_initdata = {
> .init = tcpv6_net_init,
> .exit = tcpv6_net_exit,
> };
>
> -int __init tcpv6_init(void)
> +int __net_init tcpv6_init(void)
> {
> int ret;
>
> @@ -2378,7 +2378,7 @@ int __init tcpv6_init(void)
> goto out;
> }
>
> -void tcpv6_exit(void)
> +void __net_exit tcpv6_exit(void)
> {
> unregister_pernet_subsys(&tcpv6_net_ops);
> inet6_unregister_protosw(&tcpv6_protosw);
>
> base-commit: 2b414a95b8f7307d42173ba9e580d6d3e2bcbfce
> --
> 2.54.0.1136.gdb2ca164c4-goog
>
>
--
Cheers,
Nathan
^ permalink raw reply
* [PATCH net-next v2 3/3] docs: net: fix minor issues with strparser docs
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski, skhan
In-Reply-To: <20260613165846.2913092-1-kuba@kernel.org>
Not sure if anyone would read this doc, but the API has evolved
since it was written. Update to:
- show the int return type for strp_init()
- refer to strp_data_ready(), not the old strp_tcp_data_ready() name
- direct users to strp_msg(skb) for strparser metadata instead of
treating skb->cb as struct strp_msg directly
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/strparser.rst | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/Documentation/networking/strparser.rst b/Documentation/networking/strparser.rst
index 8dc6bb04c710..372106b61e65 100644
--- a/Documentation/networking/strparser.rst
+++ b/Documentation/networking/strparser.rst
@@ -40,8 +40,8 @@ Functions
::
- strp_init(struct strparser *strp, struct sock *sk,
- const struct strp_callbacks *cb)
+ int strp_init(struct strparser *strp, struct sock *sk,
+ const struct strp_callbacks *cb)
Called to initialize a stream parser. strp is a struct of type
strparser that is allocated by the upper layer. sk is the TCP
@@ -95,7 +95,7 @@ Functions
void strp_data_ready(struct strparser *strp);
- The upper layer calls strp_tcp_data_ready when data is ready on
+ The upper layer calls strp_data_ready when data is ready on
the lower socket for strparser to process. This should be called
from a data_ready callback that is set on the socket. Note that
maximum messages size is the limit of the receive socket
@@ -123,9 +123,9 @@ Callbacks
should parse the sk_buff as containing the headers for the
next application layer message in the stream.
- The skb->cb in the input skb is a struct strp_msg. Only
- the offset field is relevant in parse_msg and gives the offset
- where the message starts in the skb.
+ The strparser metadata in the input skb can be accessed with
+ strp_msg(skb). Only the offset field is relevant in parse_msg and
+ gives the offset where the message starts in the skb.
The return values of this function are:
@@ -176,11 +176,11 @@ Callbacks
received in rcv_msg (see strp_pause above). This callback
must be set.
- The skb->cb in the input skb is a struct strp_msg. This
- struct contains two fields: offset and full_len. Offset is
- where the message starts in the skb, and full_len is the
- the length of the message. skb->len - offset may be greater
- than full_len since strparser does not trim the skb.
+ The strparser metadata in the input skb can be accessed with
+ strp_msg(skb). This struct contains two fields: offset and full_len.
+ Offset is where the message starts in the skb, and full_len is
+ the length of the message. skb->len - offset may be greater than
+ full_len since strparser does not trim the skb.
::
--
2.54.0
^ permalink raw reply related
* [PATCH net-next v2 2/3] docs: net: fix minor issues with devlink docs
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski, skhan
In-Reply-To: <20260613165846.2913092-1-kuba@kernel.org>
Update devlink documentation to match current code:
- describe health reporter defaults (it's currently under "callbacks"),
best-effort auto-dump, and port-scoped reporters
- fix generic parameter names and values
- fix nested devlink setup wording and registration ordering
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jiri@resnulli.us
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/devlink/devlink-health.rst | 12 ++++++++----
Documentation/networking/devlink/devlink-params.rst | 2 +-
Documentation/networking/devlink/devlink-port.rst | 5 ++++-
Documentation/networking/devlink/devlink-trap.rst | 8 +++++---
Documentation/networking/devlink/index.rst | 10 +++++-----
5 files changed, 23 insertions(+), 14 deletions(-)
diff --git a/Documentation/networking/devlink/devlink-health.rst b/Documentation/networking/devlink/devlink-health.rst
index 4d10536377ab..bedac58a2f36 100644
--- a/Documentation/networking/devlink/devlink-health.rst
+++ b/Documentation/networking/devlink/devlink-health.rst
@@ -33,7 +33,9 @@ asynchronously. All health reports handling is done by ``devlink``.
* Recovery procedures
* Diagnostics procedures
* Object dump procedures
- * Out Of Box initial parameters
+
+Drivers also provide default values for generic reporter parameters when
+creating a health reporter.
Different parts of the driver can register different types of health reporters
with different handlers.
@@ -45,8 +47,9 @@ Actions
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
- * Object dump is being taken and saved at the reporter instance (as long as
- auto-dump is set and there is no other dump which is already stored)
+ * Object dump is being taken and saved at the reporter instance. This is
+ best effort and skipped when recovery is aborted, auto-dump is disabled,
+ no dump callback is registered, or a dump is already stored.
* Auto recovery attempt is being done. Depends on:
- Auto-recovery configuration
@@ -75,7 +78,8 @@ User Interface
==============
User can access/change each reporter's parameters and driver specific callbacks
-via ``devlink``, e.g per error type (per health reporter):
+via ``devlink``, e.g. per error type (per health reporter). Reporters may be
+registered for the whole devlink instance or for a specific devlink port.
* Configure reporter's generic parameters (like: disable/enable auto recovery)
* Invoke recovery procedure
diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst
index ea17756dcda6..ca19ee3e63c8 100644
--- a/Documentation/networking/devlink/devlink-params.rst
+++ b/Documentation/networking/devlink/devlink-params.rst
@@ -122,7 +122,7 @@ own name.
* - ``enable_iwarp``
- Boolean
- Enable handling of iWARP traffic in the device.
- * - ``internal_err_reset``
+ * - ``internal_error_reset``
- Boolean
- When enabled, the device driver will reset the device on internal
errors.
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 5e397798a402..9374ebe70f48 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -38,7 +38,7 @@ Devlink port flavours are described below.
- This indicates an eswitch port representing a port of PCI
subfunction (SF).
* - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
- - This indicates a virtual port for the PCI virtual function.
+ - Any virtual port facing the user.
Devlink port can have a different type based on the link layer described below.
@@ -134,6 +134,9 @@ Users may also set the IPsec crypto capability of the function using
Users may also set the IPsec packet capability of the function using
`devlink port function set ipsec_packet` command.
+The ``migratable`` attribute may be set only on ports with
+``DEVLINK_PORT_FLAVOUR_PCI_VF``.
+
Users may also set the maximum IO event queues of the function
using `devlink port function set max_io_eqs` command.
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 5885e21e2212..ac5bf9337198 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -516,9 +516,11 @@ Generic Packet Trap Groups
Generic packet trap groups are used to aggregate logically related packet
traps. These groups allow the user to batch operations such as setting the trap
-action of all member traps. In addition, ``devlink-trap`` can report aggregated
-per-group packets and bytes statistics, in case per-trap statistics are too
-narrow. The description of these groups must be added to the following table:
+action of all member drop traps whose action may legally change. Exception and
+control traps remain unchanged. In addition, ``devlink-trap`` can report
+aggregated per-group packets and bytes statistics, in case per-trap statistics
+are too narrow. The description of these groups must be added to the following
+table:
.. list-table:: List of Generic Packet Trap Groups
:widths: 10 90
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index f7ba7dcf477d..32f70879ddd0 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -13,8 +13,8 @@ new APIs prefixed by ``devl_*``. The older APIs handle all the locking
in devlink core, but don't allow registration of most sub-objects once
the main devlink object is itself registered. The newer ``devl_*`` APIs assume
the devlink instance lock is already held. Drivers can take the instance
-lock by calling ``devl_lock()``. It is also held all callbacks of devlink
-netlink commands.
+lock by calling ``devl_lock()``. It is also held across all callbacks of
+devlink netlink commands.
Drivers are encouraged to use the devlink instance lock for their own needs.
@@ -33,11 +33,11 @@ devlink instances created underneath. In that case, drivers should make
lock of both nested and parent instances at the same time, devlink
instance lock of the parent instance should be taken first, only then
instance lock of the nested instance could be taken.
- - Driver should use object-specific helpers to setup the
- nested relationship:
+ - Driver should use object-specific helpers to setup the nested relationship
+ before registering the nested devlink instance:
- ``devl_nested_devlink_set()`` - called to setup devlink -> nested
- devlink relationship (could be user for multiple nested instances.
+ devlink relationship (could be used for multiple nested instances).
- ``devl_port_fn_devlink_set()`` - called to setup port function ->
nested devlink relationship.
- ``devlink_linecard_nested_dl_set()`` - called to setup linecard ->
--
2.54.0
^ permalink raw reply related
* [PATCH net-next v2 1/3] docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and rekey
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski, skhan
In-Reply-To: <20260613165846.2913092-1-kuba@kernel.org>
Fill in some gaps in the TLS offload doc:
- describe the tls_dev_del and tls_dev_resync callbacks
- add a mention of rekeying being out of scope for now
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
v2:
- add mentions of the callback in resync text
- Stack -> The stack
v1: https://lore.kernel.org/20260609201224.1191391-1-kuba@kernel.org
CC: john.fastabend@gmail.com
CC: sd@queasysnail.net
CC: corbet@lwn.net
CC: skhan@linuxfoundation.org
CC: linux-doc@vger.kernel.org
---
Documentation/networking/tls-offload.rst | 45 ++++++++++++++++++++----
1 file changed, 38 insertions(+), 7 deletions(-)
diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst
index 25ee8d9f12c9..e5802bcd4d22 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -99,6 +99,29 @@ at the end of kernel structures (see :c:member:`driver_state` members
in ``include/net/tls.h``) to avoid additional allocations and pointer
dereferences.
+When the offloaded connection is destroyed the core calls
+the :c:member:`tls_dev_del` callback so the driver can release per-direction
+state:
+
+.. code-block:: c
+
+ void (*tls_dev_del)(struct net_device *netdev,
+ struct tls_context *ctx,
+ enum tls_offload_ctx_dir direction);
+
+``tls_dev_del`` is mandatory whenever ``tls_dev_add`` is provided.
+
+The third TLS device callback is :c:member:`tls_dev_resync`, called by the core
+to synchronize the TCP stream with the record boundaries:
+
+.. code-block:: c
+
+ int (*tls_dev_resync)(struct net_device *netdev,
+ struct sock *sk, u32 seq, u8 *rcd_sn,
+ enum tls_offload_ctx_dir direction);
+
+See the `Resync handling`_ section for details.
+
TX
--
@@ -250,9 +273,9 @@ sequence number (as it will be updated from a different context).
bool tls_offload_tx_resync_pending(struct sock *sk)
Next time ``ktls`` pushes a record it will first send its TCP sequence number
-and TLS record number to the driver. Stack will also make sure that
-the new record will start on a segment boundary (like it does when
-the connection is initially added).
+and TLS record number to the driver via the ``tls_dev_resync`` callback.
+The stack will also make sure that the new record will start on a segment
+boundary (like it does when the connection is initially added).
RX
--
@@ -344,9 +367,10 @@ all TLS record headers that have been logged since the resync request
started.
The kernel confirms the guessed location was correct and tells the device
-the record sequence number. Meanwhile, the device had been parsing
-and counting all records since the just-confirmed one, it adds the number
-of records it had seen to the record number provided by the kernel.
+the record sequence number via the ``tls_dev_resync`` callback. Meanwhile,
+the device had been parsing and counting all records since the just-confirmed
+one, it adds the number of records it had seen to the record number provided
+by the kernel.
At this point the device is in sync and can resume decryption at next
segment boundary.
@@ -370,12 +394,19 @@ schedules resynchronization after it has received two completely encrypted
records.
The stack waits for the socket to drain and informs the device about
-the next expected record number and its TCP sequence number. If the
+the next expected record number and its TCP sequence number via the
+``tls_dev_resync`` callback. If the
records continue to be received fully encrypted stack retries the
synchronization with an exponential back off (first after 2 encrypted
records, then after 4 records, after 8, after 16... up until every
128 records).
+Rekey
+=====
+
+Offload does not currently support TLS 1.3, therefore key rotation
+is not a concern for offloaded connections at this point.
+
Error handling
==============
--
2.54.0
^ permalink raw reply related
* [PATCH net-next v2 0/3] docs: net: more adjustments to docs
From: Jakub Kicinski @ 2026-06-13 16:58 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, corbet, linux-doc,
john.fastabend, sd, jiri, Jakub Kicinski
A few small updates to the docs.
This is trying to prepare docs for getting fed directly
into AI reviews.
v2:
- fixes in the tls offload patch
- add the strparser patch in place of the already applied XDP md one
v1: https://lore.kernel.org/20260609201224.1191391-1-kuba@kernel.org
Jakub Kicinski (3):
docs: net: tls-offload: document tls_dev_del, tls_dev_resync, and
rekey
docs: net: fix minor issues with devlink docs
docs: net: fix minor issues with strparser docs
.../networking/devlink/devlink-health.rst | 12 +++--
.../networking/devlink/devlink-params.rst | 2 +-
.../networking/devlink/devlink-port.rst | 5 ++-
.../networking/devlink/devlink-trap.rst | 8 ++--
Documentation/networking/devlink/index.rst | 10 ++---
Documentation/networking/strparser.rst | 22 ++++-----
Documentation/networking/tls-offload.rst | 45 ++++++++++++++++---
7 files changed, 72 insertions(+), 32 deletions(-)
--
2.54.0
^ permalink raw reply
* Re: [PATCH net-next v6 3/5] net: dsa: tag_ks8995: Add the KS8995 tag handling
From: Linus Walleij @ 2026-06-13 16:56 UTC (permalink / raw)
To: Jakub Kicinski
Cc: woojung.huh, UNGLinuxDriver, andrew, olteanv, davem, edumazet,
pabeni, robh, krzk+dt, conor+dt, marex, horms, linux, netdev,
devicetree, nb
In-Reply-To: <20260610153952.1685895-1-kuba@kernel.org>
DSA maintainers, look into this analysis a bit below,
the conclusion is that taggers can't re-use vlan_* helpers
because they free the SKB:s on error, so maybe I need
to create new derivative functions to reuse these.
On Wed, Jun 10, 2026 at 5:39 PM Jakub Kicinski <kuba@kernel.org> wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
Fair enough.
> > +static struct sk_buff *ks8995_xmit(struct sk_buff *skb, struct net_device *dev)
> > +{
> > + struct vlan_ethhdr *hdr = vlan_eth_hdr(skb);
> > + bool have_hwaccel_tag = false;
> > + u16 tci = 0, portmask;
> > +
> > + /* Prepare the special KS8995 tags */
> > + portmask = dsa_xmit_port_mask(skb, dev);
> > +
> > + if (skb_vlan_tag_present(skb) && skb->vlan_proto == htons(ETH_P_8021Q)) {
> > + tci = skb_vlan_tag_get(skb);
> > + __vlan_hwaccel_clear_tag(skb);
> > + have_hwaccel_tag = true;
> > + }
>
> [Low]
> What is intended to happen if the skb arrives with both a hwaccel VLAN
> tag and an in-head 0x8100 tag, or with a hwaccel tag whose vlan_proto
> is something other than 0x8100 (for example 0x88a8 service tag)?
>
> In those cases this branch leaves the hwaccel tag alone or only consumes
> it conditionally, and the second branch below either inserts a fresh
> in-head tag while keeping a pre-existing in-head 0x8100, or modifies the
> in-head 0x8100 while a non-8021Q hwaccel tag is still attached. Should
> this path either reject such frames or normalise them first?
I have no idea what "normalise" means in this context.
What happens to e.g. a 88a8 service tag is that it gets forwarded
to the target port using the modified vlan tag added by vlan_insert_tag()
below, I think part of the AI confusion here is coming from the
fact that we are using modified "vlan tags" and the semantics
does not fit the common pattern, and the statistics that are used
in the training data become irrelevant.
The whole point of the code is to modify pre-existing
ETH_P_8021Q tags in-place rather than adding a second one,
which means that everything else will get a second tag, the sole
function of which is to direct the packet to the right port, where it
will be stripped by the hardware.
> > + if (have_hwaccel_tag || hdr->h_vlan_proto != htons(ETH_P_8021Q)) {
> > + skb = vlan_insert_tag(skb, KS8995M_STPID(portmask), tci);
> > + if (!skb)
> > + return NULL;
>
> [High]
> Is returning NULL here safe?
>
> include/linux/if_vlan.h documents that vlan_insert_tag() calls
> dev_kfree_skb_any(skb) on failure (e.g. when skb_cow_head() fails under
> memory pressure) and returns NULL. The caller in net/dsa/user.c does:
>
> nskb = p->xmit(skb, dev);
> if (!nskb) {
> kfree_skb(skb);
> return NETDEV_TX_OK;
> }
>
> so on the NULL return dsa_user_xmit() runs kfree_skb() on its local
> pointer that vlan_insert_tag() has already freed.
vlan_insert_tag()
vlan_insert_inner_tag() <- returns NULL on error
__vlan_insert_inner_tag()
> Doesn't that produce a
> use-after-free on skb->users plus a double free?
>
> Should this path swallow the free done by vlan_insert_tag() (e.g. by
> returning a sentinel handled by the caller) so dsa_user_xmit() does not
> re-free the skb?
I looked into this.
The fact that several taggers do this in ->xmit() such as here in tag_8021q.c:
struct sk_buff *dsa_8021q_xmit(struct sk_buff *skb, struct net_device *netdev,
u16 tpid, u16 tci)
{
/* skb->data points at the MAC header, which is fine
* for vlan_insert_tag().
*/
return vlan_insert_tag(skb, htons(tpid), tci);
}
should be a sign that this is OK in xmit:
return vlan_insert_tag(skb,...);
which is equivalent to:
skb = vlan_insert_tag(skb,...);
if (!skb)
return skb;
which is equivalent to:
skb = vlan_insert_tag(skb,...);
if (!skb)
return NULL;
Which is what I do.
So yeah. skb_free() will be free:ed twice. The code in tag_8021q.c will
also do that. But what do you expect ->xmit() to return on error if
not NULL?
When user.c does this:
/* Transmit function may have to reallocate the original SKB,
* in which case it must have freed it. Only free it here on error.
*/
nskb = p->xmit(skb, dev);
if (!nskb) {
kfree_skb(skb);
return NETDEV_TX_OK;
}
return dsa_enqueue_skb(nskb, dev);
The only way to get clean out of this branch if you run
into an error in ->xmit() is to return NULL!
Now let's look at ->rcv():
> > + /* Move the custom DSA+VLAN tag into the hwaccel area and strip
> > + * it from the skb head
> > + */
> > + skb = skb_vlan_untag(skb);
> > + if (!skb) {
> > + netdev_err(dev, "%s: unable to untag skb\n", __func__);
> > + return NULL;
> > + }
>
> [High]
> Is the NULL return here compatible with the cpu_dp->rcv contract?
>
> skb_vlan_untag() can take its err_free path (e.g. when the
> pskb_may_pull(VLAN_HLEN + sizeof(unsigned short)) inside it fails, or
> skb_reorder_vlan_header() fails) which calls kfree_skb(skb) and returns
> NULL.
>
> The caller in net/dsa/tag.c is:
>
> skb = skb_unshare(skb, GFP_ATOMIC);
> ...
> nskb = cpu_dp->rcv(skb, dev);
> ...
> if (!nskb) {
> kfree_skb(skb);
> return 0;
> }
>
> so when ks8995_rcv() returns NULL after skb_vlan_untag() has already
> freed its argument, dsa_switch_rcv() ends up calling kfree_skb() on a
> freed skb. Wouldn't that be a use-after-free on skb->users plus a
> double free?
>
> Other cpu_dp->rcv handlers appear to assume that returning NULL means
> the skb was not freed. Should this handler do something equivalent so
> the contract still holds?
Yes skb_vlan_untag() will free the skb on error and then this will
free the tag again.
But how do you signal to the caller in net/dsa/tag.c
that "things went sidewise and the SKB is already free:ed"?
So these semantics around ->xmit() and ->rcv() free:in the skb on
a NULL return basically challenges Vladimir's request that I
reuse these functions in the first place. They are not made
for this kind of reuse.
What I *CAN* do is go and create wrappers in skbuff.h/c
that will not free the skb on error just return NULL anyway,
intended for this one user (to begin with), such as
vlan_insert_tag_no_free_skb_on_error();
skb_vlan_untag_no_free_skb_on_errror()
I honestly think these are good names because there is
no risk to misunderstand them...
But then I want some buy-in from the maintainers that this is the
way to go.
Yours,
Linus Walleij
^ permalink raw reply
* [PATCH bpf-next v4 2/2] selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper
From: Leon Hwang @ 2026-06-13 16:24 UTC (permalink / raw)
To: bpf
Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Martin KaFai Lau,
Song Liu, Yonghong Song, Jiri Olsa, Emil Tsalapatis,
John Fastabend, Stanislav Fomichev, David S . Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Shuah Khan, Leon Hwang, Ihor Solodrai, netdev, linux-kernel,
linux-kselftest, kernel-patches-bot
In-Reply-To: <20260613162443.60515-1-leon.hwang@linux.dev>
Verify the fix by:
1. Attach cgroup sockops prog.
2. Build a tcp connection using ipv4 addr in ipv6 socket.
3. Verify the return value of bpf_setsockopt() helper.
Assisted-by: Codex:gpt-5.5-xhigh
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
.../selftests/bpf/prog_tests/setget_sockopt.c | 78 +++++++++++++++++++
.../selftests/bpf/progs/setget_sockopt.c | 23 ++++++
2 files changed, 101 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
index 77fe1bfb7504..4e91d9b615ce 100644
--- a/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
+++ b/tools/testing/selftests/bpf/prog_tests/setget_sockopt.c
@@ -199,6 +199,83 @@ static void test_nonstandard_opt(int family)
bpf_link__destroy(getsockopt_link);
}
+static int connect_to_v4mapped_v6_fd(int server_fd)
+{
+ struct sockaddr_storage addr;
+ struct sockaddr_in *addr4 = (void *)&addr;
+ socklen_t addrlen = sizeof(addr);
+ struct sockaddr_in6 addr6 = {};
+ int fd = -1, v6only = 0, err;
+
+ err = getsockname(server_fd, (struct sockaddr *)&addr, &addrlen);
+ if (!ASSERT_OK(err, "getsockname"))
+ return -1;
+
+ fd = socket(AF_INET6, SOCK_STREAM, 0);
+ if (!ASSERT_GE(fd, 0, "socket"))
+ return -1;
+
+ err = settimeo(fd, 0);
+ if (!ASSERT_OK(err, "settimeo"))
+ goto err_out;
+
+ err = setsockopt(fd, IPPROTO_IPV6, IPV6_V6ONLY, &v6only, sizeof(v6only));
+ if (!ASSERT_OK(err, "clear_v6only"))
+ goto err_out;
+
+ addr6.sin6_family = AF_INET6;
+ addr6.sin6_port = addr4->sin_port;
+ addr6.sin6_addr.s6_addr[10] = 0xff;
+ addr6.sin6_addr.s6_addr[11] = 0xff;
+ memcpy(&addr6.sin6_addr.s6_addr[12], &addr4->sin_addr, sizeof(addr4->sin_addr));
+
+ err = connect(fd, (struct sockaddr *)&addr6, sizeof(addr6));
+ if (!ASSERT_OK(err, "connect"))
+ goto err_out;
+
+ return fd;
+
+err_out:
+ close(fd);
+ return -1;
+}
+
+static void test_v4mapped_v6_ip_tos(void)
+{
+ struct setget_sockopt__bss *bss = skel->bss;
+ int sfd = -1, fd = -1, got = 0, exp = 0x1c;
+ socklen_t optlen;
+
+ memset(bss, 0, sizeof(*bss));
+ bss->v4mapped_v6_ip_tos_enable = 1;
+ bss->v4mapped_v6_ip_tos_ret = -1;
+ bss->v4mapped_v6_ip_tos_val = exp;
+
+ sfd = start_server(AF_INET, SOCK_STREAM, addr4_str, 0, 0);
+ if (!ASSERT_GE(sfd, 0, "start_server"))
+ goto err_out;
+
+ fd = connect_to_v4mapped_v6_fd(sfd);
+ if (!ASSERT_GE(fd, 0, "connect_to_v4mapped_v6_fd"))
+ goto err_out;
+
+ ASSERT_GT(bss->v4mapped_v6_ip_tos_cnt, 0, "v4mapped_v6_ip_tos_cnt");
+ ASSERT_EQ(bss->v4mapped_v6_ip_tos_ret, 0, "v4mapped_v6_ip_tos_ret");
+
+ optlen = sizeof(got);
+ if (!ASSERT_OK(getsockopt(fd, SOL_IP, IP_TOS, &got, &optlen), "getsockopt_ip_tos"))
+ goto err_out;
+
+ ASSERT_EQ(got, exp, "ip_tos");
+
+err_out:
+ bss->v4mapped_v6_ip_tos_enable = 0;
+ if (fd >= 0)
+ close(fd);
+ if (sfd >= 0)
+ close(sfd);
+}
+
void test_setget_sockopt(void)
{
cg_fd = test__join_cgroup(CG_NAME);
@@ -238,6 +315,7 @@ void test_setget_sockopt(void)
test_ktls(AF_INET);
test_nonstandard_opt(AF_INET);
test_nonstandard_opt(AF_INET6);
+ test_v4mapped_v6_ip_tos();
done:
setget_sockopt__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/setget_sockopt.c b/tools/testing/selftests/bpf/progs/setget_sockopt.c
index d330b1511979..636a7cd8e2fa 100644
--- a/tools/testing/selftests/bpf/progs/setget_sockopt.c
+++ b/tools/testing/selftests/bpf/progs/setget_sockopt.c
@@ -387,6 +387,24 @@ int _getsockopt(struct bpf_sockopt *ctx)
return 1;
}
+int v4mapped_v6_ip_tos_enable;
+int v4mapped_v6_ip_tos_ret;
+int v4mapped_v6_ip_tos_cnt;
+int v4mapped_v6_ip_tos_val;
+
+static void test_v4mapped_v6_ip_tos(struct bpf_sock_ops *skops)
+{
+ int tos = v4mapped_v6_ip_tos_val;
+
+ if (!v4mapped_v6_ip_tos_enable || skops->op != BPF_SOCK_OPS_TCP_CONNECT_CB)
+ return;
+ if (skops->family != AF_INET6)
+ return;
+
+ v4mapped_v6_ip_tos_cnt++;
+ v4mapped_v6_ip_tos_ret = bpf_setsockopt(skops, IPPROTO_IP, IP_TOS, &tos, sizeof(tos));
+}
+
SEC("sockops")
int skops_sockopt(struct bpf_sock_ops *skops)
{
@@ -401,6 +419,11 @@ int skops_sockopt(struct bpf_sock_ops *skops)
if (!sk)
return 1;
+ if (v4mapped_v6_ip_tos_enable) {
+ test_v4mapped_v6_ip_tos(skops);
+ return 1;
+ }
+
switch (skops->op) {
case BPF_SOCK_OPS_TCP_LISTEN_CB:
nr_listen += !(bpf_test_sockopt(skops, sk) ||
--
2.54.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox