* Re: [PATCH] gtp: disable BH before calling udp_tunnel_xmit_skb()
From: patchwork-bot+netdevbpf @ 2026-04-20 21:59 UTC (permalink / raw)
To: David CARLIER
Cc: pablo, laforge, andrew+netdev, edumazet, kuba, pabeni, bestswngs,
osmocom-net-gprs, netdev, linux-kernel, stable
In-Reply-To: <20260417055408.4667-1-devnexen@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Fri, 17 Apr 2026 06:54:08 +0100 you wrote:
> gtp_genl_send_echo_req() runs as a generic netlink doit handler in
> process context with BH not disabled. It calls udp_tunnel_xmit_skb(),
> which eventually invokes iptunnel_xmit() — that uses __this_cpu_inc/dec
> on softnet_data.xmit.recursion to track the tunnel xmit recursion level.
>
> Without local_bh_disable(), the task may migrate between
> dev_xmit_recursion_inc() and dev_xmit_recursion_dec(), breaking the
> per-CPU counter pairing. The result is stale or negative recursion
> levels that can later produce false-positive
> SKB_DROP_REASON_RECURSION_LIMIT drops on either CPU.
>
> [...]
Here is the summary with links:
- gtp: disable BH before calling udp_tunnel_xmit_skb()
https://git.kernel.org/netdev/net/c/5638504a2aa9
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v1] net/mlx5: Fix HCA caps leak on notifier init failure
From: patchwork-bot+netdevbpf @ 2026-04-20 21:59 UTC (permalink / raw)
To: Prathamesh Deshpande
Cc: saeedm, leon, cjubran, cratiu, tariqt, kuba, netdev, linux-rdma,
linux-kernel
In-Reply-To: <20260415005022.34764-1-prathameshdeshpande7@gmail.com>
Hello:
This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 15 Apr 2026 01:49:37 +0100 you wrote:
> mlx5_mdev_init() allocates HCA caps via mlx5_hca_caps_alloc() before
> calling mlx5_notifiers_init(). If notifier initialization fails, the
> error path jumps to err_hca_caps and skips mlx5_hca_caps_free(), leaking
> allocated caps.
>
> Add a dedicated unwind label for notifier-init failure that frees HCA
> caps before continuing the existing cleanup sequence.
>
> [...]
Here is the summary with links:
- [net,v1] net/mlx5: Fix HCA caps leak on notifier init failure
https://git.kernel.org/netdev/net/c/d03fc81a5795
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH 2/9] x86/extable: switch to using FIELD_GET_SIGNED()
From: David Laight @ 2026-04-20 22:00 UTC (permalink / raw)
To: Yury Norov
Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, x86, H. Peter Anvin, Andy Lutomirski,
Jonathan Cameron, David Lechner, Nuno Sá, Andy Shevchenko,
Ping-Ke Shih, Richard Cochran, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Alexandre Belloni,
Yury Norov, Rasmus Villemoes, Hans de Goede, Linus Walleij,
Sakari Ailus, Salah Triki, Achim Gratz, Ben Collins, linux-kernel,
linux-iio, linux-wireless, netdev, linux-rtc
In-Reply-To: <aeZf98xjbxdHvZOS@yury>
On Mon, 20 Apr 2026 13:18:47 -0400
Yury Norov <ynorov@nvidia.com> wrote:
> On Mon, Apr 20, 2026 at 01:24:28PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 17, 2026 at 01:36:13PM -0400, Yury Norov wrote:
> > > The EX_DATA register is laid out such that EX_DATA_IMM occupied MSB.
> > > It's done to make sure that FIELD_GET() will sign-extend the IMM
> > > field during extraction.
> > >
> > > To enforce that, all EX_DATA masks are made signed integers. This
> > > works, but relies on the particular implementation of FIELD_GET(),
> > > i.e. masking then shifting, not vice versa; and the particular
> > > placement of the fields in the register.
> >
> > I don't think the order of the mask and shift matters in this case. If
> > we were to first shift down and then mask, it would still work (after
> > all, the mask would also need to be shifted and would also get sign
> > extended, effectively ending up as -1).
>
> FIELD_GET() doesn't require mask to be signed when a reg is signed, so
> shifting mask may become zero-extended in an alternative implementation:
>
> (reg >> __bf_shf(mask)) & (mask >> __bf_shf(mask)
>
> This all is hypothetical, anyways.
>
> > But yes, this very much depends on the signed field being the topmost
> > field and including the MSB.
>
> This is the part I dislike mostly. This would look just like undefined
> behavior for the API user: depending on fields placement or type of the
> inputs, sometimes FIELD_GET() sign-extendeds the field, and sometimes
> not.
>
> We could likely force FIELD_GET() to treat both reg and mask as unsigned
> types, and state that explicitly in the documentation.
>
There is already a BUILD_BUG_ON((_mask) == 0), changing it to >= 0
will detect negative masks.
I think the only one is the x86 exception table.
FIELD_GET() casts the result to typeof(_mask) so the sign of 'reg'
shouldn't matter.
I just tried building with a compile-time check for reg being negative.
But there are too many false positives from FIELD_GET(mask, readl(addr))
and FIELD_GET(mask, READ_ONCE(var)).
The pre-processor expansions of those don't bear thinking about.
It's late now, but I will check how __unsigned_scalar_typeof() handles
variables with const or volatile qualifiers.
I think they do though the 'default' the same at bitfields.
David
^ permalink raw reply
* [PATCH net 0/8] Netfilter/IPVS fixes for net
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
Hi,
The following batch contains Netfilter/IPVS fixes for net:
1) nft_osf actually only supports IPv4, restrict it.
2) Address possible division by zero in nfnetlink_osf, from Xiang Mei.
3) Remove unsafe use of sprintf to fix possible buffer overflow
in the SIP NAT helper, from Florian Westphal.
4) Restrict xt_mac, xt_owner and xt_physdev to inet families only;
xt_realm is only for ipv4, otherwise null-pointer-deref is possible.
5) Use kfree_rcu() in nat core to release hooks, this can be an issue
once nfnetlink_hook gets support to dump NAT hook information, not
currently a real issue but better fix it now. From Florian Westphal.
6) Fix MTU checks in IPVS, from Yingnan Zhang.
7) Fix possible out-of-bounds when matching TCP options in
nfnetlink_osf, from Fernando Fernandez Mancera.
8) Fix potential nul-ptr-deref in ttl check in nfnetlink_osf,
remove useless loop to fix this, also from Fernando.
This is a smaller batch, there are more patches pending in the queue
to arm another pull request as soon as this is considered good enough.
AI might complain again about one more issue regarding osf and
big-endian arches in osf but this batch is targetting crash fixes for
osf at this stage.
Please, pull these changes from:
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git nf-26-04-20
Thanks.
----------------------------------------------------------------
The following changes since commit a663bac71a2f0b3ac6c373168ca57b2a6e6381aa:
net: mctp: fix don't require received header reserved bits to be zero (2026-04-20 11:46:57 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf.git tags/nf-26-04-20
for you to fetch changes up to 711987ba281fd806322a7cd244e98e2a81903114:
netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check (2026-04-20 23:45:44 +0200)
----------------------------------------------------------------
netfilter pull request 26-04-20
----------------------------------------------------------------
Fernando Fernandez Mancera (2):
netfilter: nfnetlink_osf: fix out-of-bounds read on option matching
netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
Florian Westphal (1):
netfilter: conntrack: remove sprintf usage
Pablo Neira Ayuso (3):
netfilter: nft_osf: restrict it to ipv4
netfilter: xtables: restrict several matches to inet family
netfilter: nat: use kfree_rcu to release ops
Xiang Mei (1):
netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO
Yingnan Zhang (1):
ipvs: fix MTU check for GSO packets in tunnel mode
net/ipv4/netfilter/iptable_nat.c | 4 ++--
net/ipv6/netfilter/ip6table_nat.c | 4 ++--
net/netfilter/ipvs/ip_vs_xmit.c | 19 +++++++++++++----
net/netfilter/nf_nat_amanda.c | 2 +-
net/netfilter/nf_nat_core.c | 10 +++++----
net/netfilter/nf_nat_sip.c | 33 +++++++++++++++-------------
net/netfilter/nfnetlink_osf.c | 45 +++++++++++++++++----------------------
net/netfilter/nft_osf.c | 6 +++++-
net/netfilter/xt_mac.c | 34 +++++++++++++++++++----------
net/netfilter/xt_owner.c | 37 +++++++++++++++++++++-----------
net/netfilter/xt_physdev.c | 29 ++++++++++++++++---------
net/netfilter/xt_realm.c | 2 +-
12 files changed, 136 insertions(+), 89 deletions(-)
^ permalink raw reply
* [PATCH net 1/8] netfilter: nft_osf: restrict it to ipv4
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
This expression only supports for ipv4, restrict it.
Fixes: b96af92d6eaf ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Acked-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nft_osf.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/netfilter/nft_osf.c b/net/netfilter/nft_osf.c
index 18003433476c..c02d5cb52143 100644
--- a/net/netfilter/nft_osf.c
+++ b/net/netfilter/nft_osf.c
@@ -28,6 +28,11 @@ static void nft_osf_eval(const struct nft_expr *expr, struct nft_regs *regs,
struct nf_osf_data data;
struct tcphdr _tcph;
+ if (nft_pf(pkt) != NFPROTO_IPV4) {
+ regs->verdict.code = NFT_BREAK;
+ return;
+ }
+
if (pkt->tprot != IPPROTO_TCP) {
regs->verdict.code = NFT_BREAK;
return;
@@ -114,7 +119,6 @@ static int nft_osf_validate(const struct nft_ctx *ctx,
switch (ctx->family) {
case NFPROTO_IPV4:
- case NFPROTO_IPV6:
case NFPROTO_INET:
hooks = (1 << NF_INET_LOCAL_IN) |
(1 << NF_INET_PRE_ROUTING) |
--
2.47.3
^ permalink raw reply related
* [PATCH net 2/8] netfilter: nfnetlink_osf: fix divide-by-zero in OSF_WSS_MODULO
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
From: Xiang Mei <xmei5@asu.edu>
nf_osf_match_one() computes ctx->window % f->wss.val in the
OSF_WSS_MODULO branch with no guard for f->wss.val == 0. A
CAP_NET_ADMIN user can add such a fingerprint via nfnetlink; a
subsequent matching TCP SYN divides by zero and panics the kernel.
Reject the bogus fingerprint in nfnl_osf_add_callback() above the
per-option for-loop. f->wss is per-fingerprint, not per-option, so
the check must run regardless of f->opt_num (including 0). Also
reject wss.wc >= OSF_WSS_MAX; nf_osf_match_one() already treats that
as "should not happen".
Crash:
Oops: divide error: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:nf_osf_match_one (net/netfilter/nfnetlink_osf.c:98)
Call Trace:
<IRQ>
nf_osf_match (net/netfilter/nfnetlink_osf.c:220)
xt_osf_match_packet (net/netfilter/xt_osf.c:32)
ipt_do_table (net/ipv4/netfilter/ip_tables.c:348)
nf_hook_slow (net/netfilter/core.c:622)
ip_local_deliver (net/ipv4/ip_input.c:265)
ip_rcv (include/linux/skbuff.h:1162)
__netif_receive_skb_one_core (net/core/dev.c:6181)
process_backlog (net/core/dev.c:6642)
__napi_poll (net/core/dev.c:7710)
net_rx_action (net/core/dev.c:7945)
handle_softirqs (kernel/softirq.c:622)
Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Weiming Shi <bestswngs@gmail.com>
Suggested-by: Florian Westphal <fw@strlen.de>
Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Fernando Fernandez Mancera <fmancera@suse.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nfnetlink_osf.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index d64ce21c7b55..9de91fdd107c 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -320,6 +320,10 @@ static int nfnl_osf_add_callback(struct sk_buff *skb,
if (f->opt_num > ARRAY_SIZE(f->opt))
return -EINVAL;
+ if (f->wss.wc >= OSF_WSS_MAX ||
+ (f->wss.wc == OSF_WSS_MODULO && f->wss.val == 0))
+ return -EINVAL;
+
for (i = 0; i < f->opt_num; i++) {
if (!f->opt[i].length || f->opt[i].length > MAX_IPOPTLEN)
return -EINVAL;
--
2.47.3
^ permalink raw reply related
* [PATCH net 3/8] netfilter: conntrack: remove sprintf usage
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
From: Florian Westphal <fw@strlen.de>
Replace it with scnprintf, the buffer sizes are expected to be large enough
to hold the result, no need for snprintf+overflow check.
Increase buffer size in mangle_content_len() while at it.
BUG: KASAN: stack-out-of-bounds in vsnprintf+0xea5/0x1270
Write of size 1 at addr [..]
vsnprintf+0xea5/0x1270
sprintf+0xb1/0xe0
mangle_content_len+0x1ac/0x280
nf_nat_sdp_session+0x1cc/0x240
process_sdp+0x8f8/0xb80
process_invite_request+0x108/0x2b0
process_sip_msg+0x5da/0xf50
sip_help_tcp+0x45e/0x780
nf_confirm+0x34d/0x990
[..]
Fixes: 9fafcd7b2032 ("[NETFILTER]: nf_conntrack/nf_nat: add SIP helper port")
Reported-by: Yiming Qian <yimingqian591@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nf_nat_amanda.c | 2 +-
net/netfilter/nf_nat_sip.c | 33 ++++++++++++++++++---------------
2 files changed, 19 insertions(+), 16 deletions(-)
diff --git a/net/netfilter/nf_nat_amanda.c b/net/netfilter/nf_nat_amanda.c
index 98deef6cde69..8f1054920a85 100644
--- a/net/netfilter/nf_nat_amanda.c
+++ b/net/netfilter/nf_nat_amanda.c
@@ -50,7 +50,7 @@ static unsigned int help(struct sk_buff *skb,
return NF_DROP;
}
- sprintf(buffer, "%u", port);
+ snprintf(buffer, sizeof(buffer), "%u", port);
if (!nf_nat_mangle_udp_packet(skb, exp->master, ctinfo,
protoff, matchoff, matchlen,
buffer, strlen(buffer))) {
diff --git a/net/netfilter/nf_nat_sip.c b/net/netfilter/nf_nat_sip.c
index cf4aeb299bde..c845b6d1a2bd 100644
--- a/net/netfilter/nf_nat_sip.c
+++ b/net/netfilter/nf_nat_sip.c
@@ -68,25 +68,27 @@ static unsigned int mangle_packet(struct sk_buff *skb, unsigned int protoff,
}
static int sip_sprintf_addr(const struct nf_conn *ct, char *buffer,
+ size_t size,
const union nf_inet_addr *addr, bool delim)
{
if (nf_ct_l3num(ct) == NFPROTO_IPV4)
- return sprintf(buffer, "%pI4", &addr->ip);
+ return scnprintf(buffer, size, "%pI4", &addr->ip);
else {
if (delim)
- return sprintf(buffer, "[%pI6c]", &addr->ip6);
+ return scnprintf(buffer, size, "[%pI6c]", &addr->ip6);
else
- return sprintf(buffer, "%pI6c", &addr->ip6);
+ return scnprintf(buffer, size, "%pI6c", &addr->ip6);
}
}
static int sip_sprintf_addr_port(const struct nf_conn *ct, char *buffer,
+ size_t size,
const union nf_inet_addr *addr, u16 port)
{
if (nf_ct_l3num(ct) == NFPROTO_IPV4)
- return sprintf(buffer, "%pI4:%u", &addr->ip, port);
+ return scnprintf(buffer, size, "%pI4:%u", &addr->ip, port);
else
- return sprintf(buffer, "[%pI6c]:%u", &addr->ip6, port);
+ return scnprintf(buffer, size, "[%pI6c]:%u", &addr->ip6, port);
}
static int map_addr(struct sk_buff *skb, unsigned int protoff,
@@ -119,7 +121,7 @@ static int map_addr(struct sk_buff *skb, unsigned int protoff,
if (nf_inet_addr_cmp(&newaddr, addr) && newport == port)
return 1;
- buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, ntohs(newport));
+ buflen = sip_sprintf_addr_port(ct, buffer, sizeof(buffer), &newaddr, ntohs(newport));
return mangle_packet(skb, protoff, dataoff, dptr, datalen,
matchoff, matchlen, buffer, buflen);
}
@@ -212,7 +214,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
&addr, true) > 0 &&
nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.src.u3) &&
!nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.dst.u3)) {
- buflen = sip_sprintf_addr(ct, buffer,
+ buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer),
&ct->tuplehash[!dir].tuple.dst.u3,
true);
if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
@@ -229,7 +231,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
&addr, false) > 0 &&
nf_inet_addr_cmp(&addr, &ct->tuplehash[dir].tuple.dst.u3) &&
!nf_inet_addr_cmp(&addr, &ct->tuplehash[!dir].tuple.src.u3)) {
- buflen = sip_sprintf_addr(ct, buffer,
+ buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer),
&ct->tuplehash[!dir].tuple.src.u3,
false);
if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
@@ -247,7 +249,7 @@ static unsigned int nf_nat_sip(struct sk_buff *skb, unsigned int protoff,
htons(n) == ct->tuplehash[dir].tuple.dst.u.udp.port &&
htons(n) != ct->tuplehash[!dir].tuple.src.u.udp.port) {
__be16 p = ct->tuplehash[!dir].tuple.src.u.udp.port;
- buflen = sprintf(buffer, "%u", ntohs(p));
+ buflen = scnprintf(buffer, sizeof(buffer), "%u", ntohs(p));
if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
poff, plen, buffer, buflen)) {
nf_ct_helper_log(skb, ct, "cannot mangle rport");
@@ -418,7 +420,8 @@ static unsigned int nf_nat_sip_expect(struct sk_buff *skb, unsigned int protoff,
if (!nf_inet_addr_cmp(&exp->tuple.dst.u3, &exp->saved_addr) ||
exp->tuple.dst.u.udp.port != exp->saved_proto.udp.port) {
- buflen = sip_sprintf_addr_port(ct, buffer, &newaddr, port);
+ buflen = sip_sprintf_addr_port(ct, buffer, sizeof(buffer),
+ &newaddr, port);
if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
matchoff, matchlen, buffer, buflen)) {
nf_ct_helper_log(skb, ct, "cannot mangle packet");
@@ -438,8 +441,8 @@ static int mangle_content_len(struct sk_buff *skb, unsigned int protoff,
{
enum ip_conntrack_info ctinfo;
struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
+ char buffer[sizeof("4294967295")];
unsigned int matchoff, matchlen;
- char buffer[sizeof("65536")];
int buflen, c_len;
/* Get actual SDP length */
@@ -454,7 +457,7 @@ static int mangle_content_len(struct sk_buff *skb, unsigned int protoff,
&matchoff, &matchlen) <= 0)
return 0;
- buflen = sprintf(buffer, "%u", c_len);
+ buflen = scnprintf(buffer, sizeof(buffer), "%u", c_len);
return mangle_packet(skb, protoff, dataoff, dptr, datalen,
matchoff, matchlen, buffer, buflen);
}
@@ -491,7 +494,7 @@ static unsigned int nf_nat_sdp_addr(struct sk_buff *skb, unsigned int protoff,
char buffer[INET6_ADDRSTRLEN];
unsigned int buflen;
- buflen = sip_sprintf_addr(ct, buffer, addr, false);
+ buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), addr, false);
if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen,
sdpoff, type, term, buffer, buflen))
return 0;
@@ -509,7 +512,7 @@ static unsigned int nf_nat_sdp_port(struct sk_buff *skb, unsigned int protoff,
char buffer[sizeof("nnnnn")];
unsigned int buflen;
- buflen = sprintf(buffer, "%u", port);
+ buflen = scnprintf(buffer, sizeof(buffer), "%u", port);
if (!mangle_packet(skb, protoff, dataoff, dptr, datalen,
matchoff, matchlen, buffer, buflen))
return 0;
@@ -529,7 +532,7 @@ static unsigned int nf_nat_sdp_session(struct sk_buff *skb, unsigned int protoff
unsigned int buflen;
/* Mangle session description owner and contact addresses */
- buflen = sip_sprintf_addr(ct, buffer, addr, false);
+ buflen = sip_sprintf_addr(ct, buffer, sizeof(buffer), addr, false);
if (mangle_sdp_packet(skb, protoff, dataoff, dptr, datalen, sdpoff,
SDP_HDR_OWNER, SDP_HDR_MEDIA, buffer, buflen))
return 0;
--
2.47.3
^ permalink raw reply related
* [PATCH net 4/8] netfilter: xtables: restrict several matches to inet family
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
This is a partial revert of:
commit ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")
to allow ipv4 and ipv6 only.
- xt_mac
- xt_owner
- xt_physdev
These extensions are not used by ebtables in userspace.
Moreover, xt_realm is only for ipv4, since dst->tclassid is ipv4
specific.
Fixes: ab4f21e6fb1c ("netfilter: xtables: use NFPROTO_UNSPEC in more extensions")
Reported-by: "Kito Xu (veritas501)" <hxzene@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/xt_mac.c | 34 +++++++++++++++++++++++-----------
net/netfilter/xt_owner.c | 37 +++++++++++++++++++++++++------------
net/netfilter/xt_physdev.c | 29 +++++++++++++++++++----------
net/netfilter/xt_realm.c | 2 +-
4 files changed, 68 insertions(+), 34 deletions(-)
diff --git a/net/netfilter/xt_mac.c b/net/netfilter/xt_mac.c
index 4798cd2ca26e..7fc5156825e4 100644
--- a/net/netfilter/xt_mac.c
+++ b/net/netfilter/xt_mac.c
@@ -36,25 +36,37 @@ static bool mac_mt(const struct sk_buff *skb, struct xt_action_param *par)
return ret;
}
-static struct xt_match mac_mt_reg __read_mostly = {
- .name = "mac",
- .revision = 0,
- .family = NFPROTO_UNSPEC,
- .match = mac_mt,
- .matchsize = sizeof(struct xt_mac_info),
- .hooks = (1 << NF_INET_PRE_ROUTING) | (1 << NF_INET_LOCAL_IN) |
- (1 << NF_INET_FORWARD),
- .me = THIS_MODULE,
+static struct xt_match mac_mt_reg[] __read_mostly = {
+ {
+ .name = "mac",
+ .family = NFPROTO_IPV4,
+ .match = mac_mt,
+ .matchsize = sizeof(struct xt_mac_info),
+ .hooks = (1 << NF_INET_PRE_ROUTING) |
+ (1 << NF_INET_LOCAL_IN) |
+ (1 << NF_INET_FORWARD),
+ .me = THIS_MODULE,
+ },
+ {
+ .name = "mac",
+ .family = NFPROTO_IPV6,
+ .match = mac_mt,
+ .matchsize = sizeof(struct xt_mac_info),
+ .hooks = (1 << NF_INET_PRE_ROUTING) |
+ (1 << NF_INET_LOCAL_IN) |
+ (1 << NF_INET_FORWARD),
+ .me = THIS_MODULE,
+ },
};
static int __init mac_mt_init(void)
{
- return xt_register_match(&mac_mt_reg);
+ return xt_register_matches(mac_mt_reg, ARRAY_SIZE(mac_mt_reg));
}
static void __exit mac_mt_exit(void)
{
- xt_unregister_match(&mac_mt_reg);
+ xt_unregister_matches(mac_mt_reg, ARRAY_SIZE(mac_mt_reg));
}
module_init(mac_mt_init);
diff --git a/net/netfilter/xt_owner.c b/net/netfilter/xt_owner.c
index 5bfb4843df66..8f2e57b2a586 100644
--- a/net/netfilter/xt_owner.c
+++ b/net/netfilter/xt_owner.c
@@ -127,26 +127,39 @@ owner_mt(const struct sk_buff *skb, struct xt_action_param *par)
return true;
}
-static struct xt_match owner_mt_reg __read_mostly = {
- .name = "owner",
- .revision = 1,
- .family = NFPROTO_UNSPEC,
- .checkentry = owner_check,
- .match = owner_mt,
- .matchsize = sizeof(struct xt_owner_match_info),
- .hooks = (1 << NF_INET_LOCAL_OUT) |
- (1 << NF_INET_POST_ROUTING),
- .me = THIS_MODULE,
+static struct xt_match owner_mt_reg[] __read_mostly = {
+ {
+ .name = "owner",
+ .revision = 1,
+ .family = NFPROTO_IPV4,
+ .checkentry = owner_check,
+ .match = owner_mt,
+ .matchsize = sizeof(struct xt_owner_match_info),
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING),
+ .me = THIS_MODULE,
+ },
+ {
+ .name = "owner",
+ .revision = 1,
+ .family = NFPROTO_IPV6,
+ .checkentry = owner_check,
+ .match = owner_mt,
+ .matchsize = sizeof(struct xt_owner_match_info),
+ .hooks = (1 << NF_INET_LOCAL_OUT) |
+ (1 << NF_INET_POST_ROUTING),
+ .me = THIS_MODULE,
+ }
};
static int __init owner_mt_init(void)
{
- return xt_register_match(&owner_mt_reg);
+ return xt_register_matches(owner_mt_reg, ARRAY_SIZE(owner_mt_reg));
}
static void __exit owner_mt_exit(void)
{
- xt_unregister_match(&owner_mt_reg);
+ xt_unregister_matches(owner_mt_reg, ARRAY_SIZE(owner_mt_reg));
}
module_init(owner_mt_init);
diff --git a/net/netfilter/xt_physdev.c b/net/netfilter/xt_physdev.c
index 53997771013f..d2b0b52434fa 100644
--- a/net/netfilter/xt_physdev.c
+++ b/net/netfilter/xt_physdev.c
@@ -137,24 +137,33 @@ static int physdev_mt_check(const struct xt_mtchk_param *par)
return 0;
}
-static struct xt_match physdev_mt_reg __read_mostly = {
- .name = "physdev",
- .revision = 0,
- .family = NFPROTO_UNSPEC,
- .checkentry = physdev_mt_check,
- .match = physdev_mt,
- .matchsize = sizeof(struct xt_physdev_info),
- .me = THIS_MODULE,
+static struct xt_match physdev_mt_reg[] __read_mostly = {
+ {
+ .name = "physdev",
+ .family = NFPROTO_IPV4,
+ .checkentry = physdev_mt_check,
+ .match = physdev_mt,
+ .matchsize = sizeof(struct xt_physdev_info),
+ .me = THIS_MODULE,
+ },
+ {
+ .name = "physdev",
+ .family = NFPROTO_IPV6,
+ .checkentry = physdev_mt_check,
+ .match = physdev_mt,
+ .matchsize = sizeof(struct xt_physdev_info),
+ .me = THIS_MODULE,
+ },
};
static int __init physdev_mt_init(void)
{
- return xt_register_match(&physdev_mt_reg);
+ return xt_register_matches(physdev_mt_reg, ARRAY_SIZE(physdev_mt_reg));
}
static void __exit physdev_mt_exit(void)
{
- xt_unregister_match(&physdev_mt_reg);
+ xt_unregister_matches(physdev_mt_reg, ARRAY_SIZE(physdev_mt_reg));
}
module_init(physdev_mt_init);
diff --git a/net/netfilter/xt_realm.c b/net/netfilter/xt_realm.c
index 6df485f4403d..61b2f1e58d15 100644
--- a/net/netfilter/xt_realm.c
+++ b/net/netfilter/xt_realm.c
@@ -33,7 +33,7 @@ static struct xt_match realm_mt_reg __read_mostly = {
.matchsize = sizeof(struct xt_realm_info),
.hooks = (1 << NF_INET_POST_ROUTING) | (1 << NF_INET_FORWARD) |
(1 << NF_INET_LOCAL_OUT) | (1 << NF_INET_LOCAL_IN),
- .family = NFPROTO_UNSPEC,
+ .family = NFPROTO_IPV4,
.me = THIS_MODULE
};
--
2.47.3
^ permalink raw reply related
* [PATCH net 5/8] netfilter: nat: use kfree_rcu to release ops
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
Florian Westphal says:
"Historically this is not an issue, even for normal base hooks: the data
path doesn't use the original nf_hook_ops that are used to register the
callbacks.
However, in v5.14 I added the ability to dump the active netfilter
hooks from userspace.
This code will peek back into the nf_hook_ops that are available
at the tail of the pointer-array blob used by the datapath.
The nat hooks are special, because they are called indirectly from
the central nat dispatcher hook. They are currently invisible to
the nfnl hook dump subsystem though.
But once that changes the nat ops structures have to be deferred too."
Update nf_nat_register_fn() to deal with partial exposition of the hooks
from error path which can be also an issue for nfnetlink_hook.
Fixes: e2cf17d3774c ("netfilter: add new hook nfnl subsystem")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/ipv4/netfilter/iptable_nat.c | 4 ++--
net/ipv6/netfilter/ip6table_nat.c | 4 ++--
net/netfilter/nf_nat_core.c | 10 ++++++----
3 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/netfilter/iptable_nat.c b/net/ipv4/netfilter/iptable_nat.c
index a5db7c67d61b..625a1ca13b1b 100644
--- a/net/ipv4/netfilter/iptable_nat.c
+++ b/net/ipv4/netfilter/iptable_nat.c
@@ -79,7 +79,7 @@ static int ipt_nat_register_lookups(struct net *net)
while (i)
nf_nat_ipv4_unregister_fn(net, &ops[--i]);
- kfree(ops);
+ kfree_rcu(ops, rcu);
return ret;
}
}
@@ -100,7 +100,7 @@ static void ipt_nat_unregister_lookups(struct net *net)
for (i = 0; i < ARRAY_SIZE(nf_nat_ipv4_ops); i++)
nf_nat_ipv4_unregister_fn(net, &ops[i]);
- kfree(ops);
+ kfree_rcu(ops, rcu);
}
static int iptable_nat_table_init(struct net *net)
diff --git a/net/ipv6/netfilter/ip6table_nat.c b/net/ipv6/netfilter/ip6table_nat.c
index e119d4f090cc..5be723232df8 100644
--- a/net/ipv6/netfilter/ip6table_nat.c
+++ b/net/ipv6/netfilter/ip6table_nat.c
@@ -81,7 +81,7 @@ static int ip6t_nat_register_lookups(struct net *net)
while (i)
nf_nat_ipv6_unregister_fn(net, &ops[--i]);
- kfree(ops);
+ kfree_rcu(ops, rcu);
return ret;
}
}
@@ -102,7 +102,7 @@ static void ip6t_nat_unregister_lookups(struct net *net)
for (i = 0; i < ARRAY_SIZE(nf_nat_ipv6_ops); i++)
nf_nat_ipv6_unregister_fn(net, &ops[i]);
- kfree(ops);
+ kfree_rcu(ops, rcu);
}
static int ip6table_nat_table_init(struct net *net)
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 83b2b5e9759a..74ec224ce0d6 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -1222,9 +1222,11 @@ int nf_nat_register_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops,
ret = nf_register_net_hooks(net, nat_ops, ops_count);
if (ret < 0) {
mutex_unlock(&nf_nat_proto_mutex);
- for (i = 0; i < ops_count; i++)
- kfree(nat_ops[i].priv);
- kfree(nat_ops);
+ for (i = 0; i < ops_count; i++) {
+ priv = nat_ops[i].priv;
+ kfree_rcu(priv, rcu_head);
+ }
+ kfree_rcu(nat_ops, rcu);
return ret;
}
@@ -1288,7 +1290,7 @@ void nf_nat_unregister_fn(struct net *net, u8 pf, const struct nf_hook_ops *ops,
}
nat_proto_net->nat_hook_ops = NULL;
- kfree(nat_ops);
+ kfree_rcu(nat_ops, rcu);
}
unlock:
mutex_unlock(&nf_nat_proto_mutex);
--
2.47.3
^ permalink raw reply related
* [PATCH net 6/8] ipvs: fix MTU check for GSO packets in tunnel mode
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
From: Yingnan Zhang <342144303@qq.com>
Currently, IPVS skips MTU checks for GSO packets by excluding them with
the !skb_is_gso(skb) condition. This creates problems when IPVS tunnel
mode encapsulates GSO packets with IPIP headers.
The issue manifests in two ways:
1. MTU violation after encapsulation:
When a GSO packet passes through IPVS tunnel mode, the original MTU
check is bypassed. After adding the IPIP tunnel header, the packet
size may exceed the outgoing interface MTU, leading to unexpected
fragmentation at the IP layer.
2. Fragmentation with problematic IP IDs:
When net.ipv4.vs.pmtu_disc=1 and a GSO packet with multiple segments
is fragmented after encapsulation, each segment gets a sequentially
incremented IP ID (0, 1, 2, ...). This happens because:
a) The GSO packet bypasses MTU check and gets encapsulated
b) At __ip_finish_output, the oversized GSO packet is split into
separate SKBs (one per segment), with IP IDs incrementing
c) Each SKB is then fragmented again based on the actual MTU
This sequential IP ID allocation differs from the expected behavior
and can cause issues with fragment reassembly and packet tracking.
Fix this by properly validating GSO packets using
skb_gso_validate_network_len(). This function correctly validates
whether the GSO segments will fit within the MTU after segmentation. If
validation fails, send an ICMP Fragmentation Needed message to enable
proper PMTU discovery.
Fixes: 4cdd34084d53 ("netfilter: nf_conntrack_ipv6: improve fragmentation handling")
Signed-off-by: Yingnan Zhang <342144303@qq.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/ipvs/ip_vs_xmit.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 0fb5162992e5..ce542ed4b013 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -102,6 +102,18 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
return dest_dst;
}
+/* Based on ip_exceeds_mtu(). */
+static bool ip_vs_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
+{
+ if (skb->len <= mtu)
+ return false;
+
+ if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+ return false;
+
+ return true;
+}
+
static inline bool
__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
{
@@ -111,10 +123,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
*/
if (IP6CB(skb)->frag_max_size > mtu)
return true; /* largest fragment violate MTU */
- }
- else if (skb->len > mtu && !skb_is_gso(skb)) {
+ } else if (ip_vs_exceeds_mtu(skb, mtu))
return true; /* Packet size violate MTU size */
- }
+
return false;
}
@@ -232,7 +243,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
return true;
if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
- skb->len > mtu && !skb_is_gso(skb) &&
+ ip_vs_exceeds_mtu(skb, mtu) &&
!ip_vs_iph_icmp(ipvsh))) {
icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
htonl(mtu));
--
2.47.3
^ permalink raw reply related
* [PATCH net 7/8] netfilter: nfnetlink_osf: fix out-of-bounds read on option matching
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
From: Fernando Fernandez Mancera <fmancera@suse.de>
In nf_osf_match(), the nf_osf_hdr_ctx structure is initialized once
and passed by reference to nf_osf_match_one() for each fingerprint
checked. During TCP option parsing, nf_osf_match_one() advances the
shared ctx->optp pointer.
If a fingerprint perfectly matches, the function returns early without
restoring ctx->optp to its initial state. If the user has configured
NF_OSF_LOGLEVEL_ALL, the loop continues to the next fingerprint.
However, because ctx->optp was not restored, the next call to
nf_osf_match_one() starts parsing from the end of the options buffer.
This causes subsequent matches to read garbage data and fail
immediately, making it impossible to log more than one match or logging
incorrect matches.
Instead of using a shared ctx->optp pointer, pass the context as a
constant pointer and use a local pointer (optp) for TCP option
traversal. This makes nf_osf_match_one() strictly stateless from the
caller's perspective, ensuring every fingerprint check starts at the
correct option offset.
Fixes: 1a6a0951fc00 ("netfilter: nfnetlink_osf: add missing fmatch check")
Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nfnetlink_osf.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index 9de91fdd107c..9b209241029b 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -64,9 +64,9 @@ struct nf_osf_hdr_ctx {
static bool nf_osf_match_one(const struct sk_buff *skb,
const struct nf_osf_user_finger *f,
int ttl_check,
- struct nf_osf_hdr_ctx *ctx)
+ const struct nf_osf_hdr_ctx *ctx)
{
- const __u8 *optpinit = ctx->optp;
+ const __u8 *optp = ctx->optp;
unsigned int check_WSS = 0;
int fmatch = FMATCH_WRONG;
int foptsize, optnum;
@@ -95,17 +95,17 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
check_WSS = f->wss.wc;
for (optnum = 0; optnum < f->opt_num; ++optnum) {
- if (f->opt[optnum].kind == *ctx->optp) {
+ if (f->opt[optnum].kind == *optp) {
__u32 len = f->opt[optnum].length;
- const __u8 *optend = ctx->optp + len;
+ const __u8 *optend = optp + len;
fmatch = FMATCH_OK;
- switch (*ctx->optp) {
+ switch (*optp) {
case OSFOPT_MSS:
- mss = ctx->optp[3];
+ mss = optp[3];
mss <<= 8;
- mss |= ctx->optp[2];
+ mss |= optp[2];
mss = ntohs((__force __be16)mss);
break;
@@ -113,7 +113,7 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
break;
}
- ctx->optp = optend;
+ optp = optend;
} else
fmatch = FMATCH_OPT_WRONG;
@@ -156,9 +156,6 @@ static bool nf_osf_match_one(const struct sk_buff *skb,
}
}
- if (fmatch != FMATCH_OK)
- ctx->optp = optpinit;
-
return fmatch == FMATCH_OK;
}
--
2.47.3
^ permalink raw reply related
* [PATCH net 8/8] netfilter: nfnetlink_osf: fix potential NULL dereference in ttl check
From: Pablo Neira Ayuso @ 2026-04-20 22:02 UTC (permalink / raw)
To: netfilter-devel; +Cc: davem, netdev, kuba, pabeni, edumazet, fw, horms
In-Reply-To: <20260420220215.111510-1-pablo@netfilter.org>
From: Fernando Fernandez Mancera <fmancera@suse.de>
The nf_osf_ttl() function accessed skb->dev to perform a local interface
address lookup without verifying that the device pointer was valid.
Additionally, the implementation utilized an in_dev_for_each_ifa_rcu
loop to match the packet source address against local interface
addresses. It assumed that packets from the same subnet should not see a
decrement on the initial TTL. A packet might appear it is from the same
subnet but it actually isn't especially in modern environments with
containers and virtual switching.
Remove the device dereference and interface loop. Replace the logic with
a switch statement that evaluates the TTL according to the ttl_check.
Fixes: 11eeef41d5f6 ("netfilter: passive OS fingerprint xtables match")
Reported-by: Kito Xu (veritas501) <hxzene@gmail.com>
Closes: https://lore.kernel.org/netfilter-devel/20260414074556.2512750-1-hxzene@gmail.com/
Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
Reviewed-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
net/netfilter/nfnetlink_osf.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)
diff --git a/net/netfilter/nfnetlink_osf.c b/net/netfilter/nfnetlink_osf.c
index 9b209241029b..acb753ec5697 100644
--- a/net/netfilter/nfnetlink_osf.c
+++ b/net/netfilter/nfnetlink_osf.c
@@ -31,26 +31,18 @@ EXPORT_SYMBOL_GPL(nf_osf_fingers);
static inline int nf_osf_ttl(const struct sk_buff *skb,
int ttl_check, unsigned char f_ttl)
{
- struct in_device *in_dev = __in_dev_get_rcu(skb->dev);
const struct iphdr *ip = ip_hdr(skb);
- const struct in_ifaddr *ifa;
- int ret = 0;
- if (ttl_check == NF_OSF_TTL_TRUE)
+ switch (ttl_check) {
+ case NF_OSF_TTL_TRUE:
return ip->ttl == f_ttl;
- if (ttl_check == NF_OSF_TTL_NOCHECK)
- return 1;
- else if (ip->ttl <= f_ttl)
+ break;
+ case NF_OSF_TTL_NOCHECK:
return 1;
-
- in_dev_for_each_ifa_rcu(ifa, in_dev) {
- if (inet_ifa_match(ip->saddr, ifa)) {
- ret = (ip->ttl == f_ttl);
- break;
- }
+ case NF_OSF_TTL_LESS:
+ default:
+ return ip->ttl <= f_ttl;
}
-
- return ret;
}
struct nf_osf_hdr_ctx {
--
2.47.3
^ permalink raw reply related
* [syzbot] [kvm?] [net?] [virt?] BUG: sleeping function called from invalid context in vhost_get_avail_idx
From: syzbot @ 2026-04-20 22:09 UTC (permalink / raw)
To: eperezma, jasowang, kvm, linux-kernel, mst, netdev,
syzkaller-bugs, virtualization
Hello,
syzbot found the following issue on:
HEAD commit: 8541d8f725c6 Merge tag 'mtd/for-7.1' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=136454ce580000
kernel config: https://syzkaller.appspot.com/x/.config?x=7e54da1916e8d11f
dashboard link: https://syzkaller.appspot.com/bug?extid=6985cb8e543ea90ba8ee
compiler: gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15d264ce580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=143ec1ba580000
Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/d900f083ada3/non_bootable_disk-8541d8f7.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/22dfea2c37c2/vmlinux-8541d8f7.xz
kernel image: https://storage.googleapis.com/syzbot-assets/e2f93ad68fe3/bzImage-8541d8f7.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+6985cb8e543ea90ba8ee@syzkaller.appspotmail.com
BUG: sleeping function called from invalid context at drivers/vhost/vhost.c:1527
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 6110, name: vhost-6109
preempt_count: 1, expected: 0
RCU nest depth: 0, expected: 0
2 locks held by vhost-6109/6110:
#0: ffff888055624cb0 (&vq->mutex/1){+.+.}-{4:4}, at: handle_tx+0x2d/0x160 drivers/vhost/net.c:971
#1: ffff888055620248 (&vq->mutex){+.+.}-{4:4}, at: vhost_net_busy_poll+0x9c/0x730 drivers/vhost/net.c:554
Preemption disabled at:
[<ffffffff88f1a006>] vhost_net_busy_poll+0x1c6/0x730 drivers/vhost/net.c:563
CPU: 0 UID: 0 PID: 6110 Comm: vhost-6109 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x100/0x190 lib/dump_stack.c:120
__might_resched.cold+0x1ec/0x232 kernel/sched/core.c:9162
__might_fault+0x8b/0x140 mm/memory.c:7322
vhost_get_avail_idx+0x31c/0x4f0 drivers/vhost/vhost.c:1527
vhost_vq_avail_empty drivers/vhost/vhost.c:3206 [inline]
vhost_vq_avail_empty+0xa9/0xe0 drivers/vhost/vhost.c:3199
vhost_net_busy_poll+0x297/0x730 drivers/vhost/net.c:574
vhost_net_tx_get_vq_desc drivers/vhost/net.c:610 [inline]
get_tx_bufs.constprop.0+0x338/0x600 drivers/vhost/net.c:650
handle_tx_copy+0x28c/0x12e0 drivers/vhost/net.c:778
handle_tx+0x139/0x160 drivers/vhost/net.c:985
vhost_run_work_list+0x183/0x220 drivers/vhost/vhost.c:454
vhost_task_fn+0x156/0x430 kernel/vhost_task.c:49
ret_from_fork+0x72b/0xd50 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
^ permalink raw reply
* Re: [PATCH iwl-net 3/4] ice: fix ready bitmap check for non-E822 devices
From: Jacob Keller @ 2026-04-20 22:10 UTC (permalink / raw)
To: Anthony Nguyen, Intel Wired LAN, netdev, Aleksandr Loktionov
Cc: Grzegorz Nitka, Timothy Miskell, Aleksandr Loktionov
In-Reply-To: <20260408-jk-even-more-e825c-fixes-v1-3-b959da91a81f@intel.com>
On 4/8/2026 11:46 AM, Jacob Keller wrote:
> The E800 hardware (apart from E810) has a ready bitmap for the PHY
> indicating which timestamp slots currently have an outstanding timestamp
> waiting to be read by software.
>
> This bitmap is checked in multiple places using the
> ice_get_phy_tx_tstamp_ready():
>
> * ice_ptp_process_tx_tstamp() calls it to determine which timestamps to
> attempt reading from the PHY
> * ice_ptp_tx_tstamps_pending() calls it in a loop at the end of the
> miscellaneous IRQ to check if new timestamps came in while the interrupt
> handler was executing.
> * ice_ptp_maybe_trigger_tx_interrupt() calls it in the auxiliary work task
> to trigger a software interrupt in the event that the hardware logic
> gets stuck.
>
> For E82X devices, multiple PHYs share the same block, and the parameter
> passed to the ready bitmap is a block number associated with the given
> port. For E825-C devices, the PHYs have their own independent blocks and do
> not share, so the parameter passed needs to be the port number. For E810
> devices, the ice_get_phy_tx_tstamp_ready() always returns all 1s regardless
> of what port, since this hardware does not have a ready bitmap. Finally,
> for E830 devices, each PF has its own ready bitmap accessible via register,
> and the block parameter is unused.
>
> The first call correctly uses the Tx timestamp tracker block parameter to
> check the appropriate timestamp block. This works because the tracker is
> setup correctly for each timestamp device type.
>
> The second two callers behave incorrectly for all device types other than
> the older E822 devices. They both iterate in a loop using
> ICE_GET_QUAD_NUM() which is a macro only used by E822 devices. This logic
> is incorrect for devices other than the E822 devices.
>
> For E810 the calls would always return true, causing E810 devices to always
> attempt to trigger a software interrupt even when they have no reason to.
> For E830, this results in duplicate work as the ready bitmap is checked
> once per number of quads. Finally, for E825-C, this results in the pending
> checks failing to detect timestamps on ports other than the first two.
>
> Fix this by introducing a new hardware API function to ice_ptp_hw.c,
> ice_check_phy_tx_tstamp_ready(). This function will check if any timestamps
> are available and returns a positive value if any timestamps are pending.
> For E810, the function always returns false, so that the re-trigger checks
> never happen. For E830, check the ready bitmap just once. For E82x
> hardware, check each quad. Finally, for E825-C, check every port.
>
> The interface function returns an integer to enable reporting of error code
> if the driver is unable read the ready bitmap. This enables callers to
> handle this case properly. The previous implementation assumed that
> timestamps are available if they failed to read the bitmap. This is
> problematic as it could lead to continuous software IRQ triggering if the
> PHY timestamp registers somehow become inaccessible.
>
> This change is especially important for E825-C devices, as the missing
> checks could leave a window open where a new timestamp could arrive while
> the existing timestamps aren't completed. As a result, the hardware
> threshold logic would not trigger a new interrupt. Without the check, the
> timestamp is left unhandled, and new timestamps will not cause an interrupt
> again until the timestamp is handled. Since both the interrupt check and
> the backup check in the auxiliary task do not function properly, the device
> may have Tx timestamps permanently stuck failing on a given port.
>
> The faulty checks originate from commit d938a8cca88a ("ice: Auxbus devices
> & driver for E822 TS") and commit 712e876371f8 ("ice: periodically kick Tx
> timestamp interrupt"), however at the time of the original coding, both
> functions only operated on E822 hardware. This is no longer the case, and
> hasn't been since the introduction of the ETH56G PHY model in commit
> 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
>
> Fixes: 7cab44f1c35f ("ice: Introduce ETH56G PHY model for E825C products")
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
> ---
> drivers/net/ethernet/intel/ice/ice_ptp_hw.h | 1 +
> drivers/net/ethernet/intel/ice/ice_ptp.c | 40 ++++------
> drivers/net/ethernet/intel/ice/ice_ptp_hw.c | 117 ++++++++++++++++++++++++++++
> 3 files changed, 132 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h
> index 9d7acc7eb2ce..1b58b054f4a5 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp_hw.h
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp_hw.h
> @@ -300,6 +300,7 @@ void ice_ptp_reset_ts_memory(struct ice_hw *hw);
> int ice_ptp_init_phc(struct ice_hw *hw);
> void ice_ptp_init_hw(struct ice_hw *hw);
> int ice_get_phy_tx_tstamp_ready(struct ice_hw *hw, u8 block, u64 *tstamp_ready);
> +int ice_check_phy_tx_tstamp_ready(struct ice_hw *hw);
> int ice_ptp_one_port_cmd(struct ice_hw *hw, u8 configured_port,
> enum ice_ptp_tmr_cmd configured_cmd);
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
> index ada42bcc4d0b..34906f972d17 100644
> --- a/drivers/net/ethernet/intel/ice/ice_ptp.c
> +++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
> @@ -2718,7 +2718,7 @@ static bool ice_any_port_has_timestamps(struct ice_pf *pf)
> bool ice_ptp_tx_tstamps_pending(struct ice_pf *pf)
> {
> struct ice_hw *hw = &pf->hw;
> - unsigned int i;
> + int ret;
>
> /* Check software indicator */
> switch (pf->ptp.tx_interrupt_mode) {
> @@ -2739,16 +2739,15 @@ bool ice_ptp_tx_tstamps_pending(struct ice_pf *pf)
> }
>
> /* Check hardware indicator */
> - for (i = 0; i < ICE_GET_QUAD_NUM(hw->ptp.num_lports); i++) {
> - u64 tstamp_ready = 0;
> - int err;
> -
> - err = ice_get_phy_tx_tstamp_ready(&pf->hw, i, &tstamp_ready);
> - if (err || tstamp_ready)
> - return true;
> + ret = ice_check_phy_tx_tstamp_ready(hw);
> + if (ret < 0) {
> + dev_dbg(ice_pf_to_dev(pf), "Unable to read PHY Tx timestamp ready bitmap, err %d\n",
> + ret);
> + /* Stop triggering IRQs if we're unable to read PHY */
> + return false;
> }
>
> - return false;
> + return ret;
Aleks requested that I clarify this return with a comment, since he
feels the implicit conversion to bool may be confusing. We do already
check if its less than 0 above, which excludes converting negative
values to "true", but it may not be obvious. I am going to apply a minor
fixup when sending this to add a comment and make this return an
explicit boolean check with "ret > 0" which is equivalent.
Thanks,
Jake
^ permalink raw reply
* [PATCH bpf v4 0/2] bpf: guard sock_ops rtt_min against non-locked tcp_sock
From: Werner Kasselman @ 2026-04-20 22:16 UTC (permalink / raw)
To: bpf@vger.kernel.org, netdev@vger.kernel.org; +Cc: Werner Kasselman
In-Reply-To: <20260417023119.3830723-1-werner@verivus.com>
sock_ops ctx rewriting guards the direct tcp_sock field loads with
is_locked_tcp_sock, but rtt_min still used a raw load sequence. On
request_sock-backed sock_ops callbacks, that can read past the end of a
tcp_request_sock allocation.
This series switches rtt_min over to the shared guarded tcp_sock field
load helper and adds a tcpbpf runtime test that exercises the
same-register request_sock path.
v3 -> v4:
- reuse a shared guarded tcp_sock field load helper for rtt_min
- preserve the dst_reg == src_reg failure path that zeros the destination
register when the guard fails
- replace the weaker ctx_rewrite test with a runtime tcpbpf selftest that
exercises same-register request_sock access
Werner Kasselman (2):
bpf: guard sock_ops rtt_min against non-locked tcp_sock
selftests/bpf: cover same-reg sock_ops rtt_min request_sock access
net/core/filter.c | 39 ++++++++++---------
.../selftests/bpf/prog_tests/tcpbpf_user.c | 4 ++
.../selftests/bpf/progs/test_tcpbpf_kern.c | 14 +++++++
tools/testing/selftests/bpf/test_tcpbpf.h | 2 +
4 files changed, 40 insertions(+), 19 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH bpf v4 1/2] bpf: guard sock_ops rtt_min against non-locked tcp_sock
From: Werner Kasselman @ 2026-04-20 22:16 UTC (permalink / raw)
To: bpf@vger.kernel.org, netdev@vger.kernel.org
Cc: Werner Kasselman, stable@vger.kernel.org, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
Lawrence Brakmo, open list
In-Reply-To: <20260420221621.1441707-1-werner@verivus.com>
sock_ops_convert_ctx_access() reads rtt_min without the
is_locked_tcp_sock guard used for every other tcp_sock field. On
request_sock-backed sock_ops callbacks, sk points at a
tcp_request_sock and the converted load reads past the end of the
allocation.
Extract the guarded tcp_sock field load sequence into
SOCK_OPS_LOAD_TCP_SOCK_FIELD() and use it for the rtt_min access after
computing the sub-field offset with offsetof(struct minmax_sample, v).
Reusing the shared helper keeps rtt_min aligned with the other guarded
tcp_sock field loads and preserves the dst_reg == src_reg failure path
that zeros the destination register when the guard fails.
Found via AST-based call-graph analysis using sqry.
Fixes: 44f0e43037d3 ("bpf: Add support for reading sk_state and more")
Cc: stable@vger.kernel.org
Signed-off-by: Werner Kasselman <werner@verivus.com>
---
net/core/filter.c | 39 ++++++++++++++++++++-------------------
1 file changed, 20 insertions(+), 19 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index 78b548158fb0..b60f279c004a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10544,12 +10544,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
struct bpf_insn *insn = insn_buf;
int off;
-/* Helper macro for adding read access to tcp_sock or sock fields. */
-#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \
+/* Helper macro for adding guarded read access to tcp_sock fields. */
+#define SOCK_OPS_LOAD_TCP_SOCK_FIELD(FIELD_SIZE, FIELD_OFFSET) \
do { \
int fullsock_reg = si->dst_reg, reg = BPF_REG_9, jmp = 2; \
- BUILD_BUG_ON(sizeof_field(OBJ, OBJ_FIELD) > \
- sizeof_field(struct bpf_sock_ops, BPF_FIELD)); \
if (si->dst_reg == reg || si->src_reg == reg) \
reg--; \
if (si->dst_reg == reg || si->src_reg == reg) \
@@ -10557,7 +10555,7 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
if (si->dst_reg == si->src_reg) { \
*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, reg, \
offsetof(struct bpf_sock_ops_kern, \
- temp)); \
+ temp)); \
fullsock_reg = reg; \
jmp += 2; \
} \
@@ -10571,23 +10569,31 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
if (si->dst_reg == si->src_reg) \
*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \
offsetof(struct bpf_sock_ops_kern, \
- temp)); \
+ temp)); \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
struct bpf_sock_ops_kern, sk),\
si->dst_reg, si->src_reg, \
offsetof(struct bpf_sock_ops_kern, sk));\
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ, \
- OBJ_FIELD), \
+ *insn++ = BPF_LDX_MEM(FIELD_SIZE, \
si->dst_reg, si->dst_reg, \
- offsetof(OBJ, OBJ_FIELD)); \
+ FIELD_OFFSET); \
if (si->dst_reg == si->src_reg) { \
- *insn++ = BPF_JMP_A(1); \
+ *insn++ = BPF_JMP_A(2); \
*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \
offsetof(struct bpf_sock_ops_kern, \
- temp)); \
+ temp)); \
+ *insn++ = BPF_MOV64_IMM(si->dst_reg, 0); \
} \
} while (0)
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \
+ do { \
+ BUILD_BUG_ON(sizeof_field(OBJ, OBJ_FIELD) > \
+ sizeof_field(struct bpf_sock_ops, BPF_FIELD)); \
+ SOCK_OPS_LOAD_TCP_SOCK_FIELD(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),\
+ offsetof(OBJ, OBJ_FIELD)); \
+ } while (0)
+
#define SOCK_OPS_GET_SK() \
do { \
int fullsock_reg = si->dst_reg, reg = BPF_REG_9, jmp = 1; \
@@ -10829,14 +10835,9 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
sizeof(struct minmax));
BUILD_BUG_ON(sizeof(struct minmax) <
sizeof(struct minmax_sample));
-
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
- struct bpf_sock_ops_kern, sk),
- si->dst_reg, si->src_reg,
- offsetof(struct bpf_sock_ops_kern, sk));
- *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
- offsetof(struct tcp_sock, rtt_min) +
- sizeof_field(struct minmax_sample, t));
+ off = offsetof(struct tcp_sock, rtt_min) +
+ offsetof(struct minmax_sample, v);
+ SOCK_OPS_LOAD_TCP_SOCK_FIELD(BPF_W, off);
break;
case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags):
--
2.43.0
^ permalink raw reply related
* [PATCH bpf v4 2/2] selftests/bpf: cover same-reg sock_ops rtt_min request_sock access
From: Werner Kasselman @ 2026-04-20 22:16 UTC (permalink / raw)
To: bpf@vger.kernel.org, netdev@vger.kernel.org
Cc: Werner Kasselman, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan,
open list:KERNEL SELFTEST FRAMEWORK, open list
In-Reply-To: <20260420221621.1441707-1-werner@verivus.com>
Add a tcpbpf sock_ops selftest that forces a same-register
ctx->rtt_min read on request_sock-backed callbacks and verifies the
observed value is zero.
This covers the dst_reg == src_reg path that the previous
ctx_rewrite-only test did not exercise.
Signed-off-by: Werner Kasselman <werner@verivus.com>
---
.../testing/selftests/bpf/prog_tests/tcpbpf_user.c | 4 ++++
.../testing/selftests/bpf/progs/test_tcpbpf_kern.c | 14 ++++++++++++++
tools/testing/selftests/bpf/test_tcpbpf.h | 2 ++
3 files changed, 20 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
index 7e8fe1bad03f..1b08e49327d0 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
@@ -42,6 +42,10 @@ static void verify_result(struct tcpbpf_globals *result)
/* check getsockopt for window_clamp */
ASSERT_EQ(result->window_clamp_client, 9216, "window_clamp_client");
ASSERT_EQ(result->window_clamp_server, 9216, "window_clamp_server");
+
+ /* check same-reg rtt_min read on request_sock-backed callbacks */
+ ASSERT_NEQ(result->rtt_min_req_seen, 0, "rtt_min_req_seen");
+ ASSERT_EQ(result->rtt_min_req_nonzero, 0, "rtt_min_req_nonzero");
}
static void run_test(struct tcpbpf_globals *result)
diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
index 6935f32eeb8f..a488b282b5dd 100644
--- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
@@ -33,6 +33,7 @@ int bpf_testcb(struct bpf_sock_ops *skops)
{
char header[sizeof(struct ipv6hdr) + sizeof(struct tcphdr)];
struct bpf_sock_ops *reuse = skops;
+ long rtt_min = (long)skops;
struct tcphdr *thdr;
int window_clamp = 9216;
int save_syn = 1;
@@ -84,6 +85,19 @@ int bpf_testcb(struct bpf_sock_ops *skops)
global.event_map |= (1 << op);
+ if (!skops->is_fullsock &&
+ (op == BPF_SOCK_OPS_RWND_INIT || op == BPF_SOCK_OPS_NEEDS_ECN)) {
+ asm volatile (
+ "%[rtt_min] = *(u32 *)(%[rtt_min] + %[rtt_min_off]);\n"
+ : [rtt_min] "+r"(rtt_min)
+ : [rtt_min_off] "i"(offsetof(struct bpf_sock_ops, rtt_min))
+ :);
+
+ global.rtt_min_req_seen = 1;
+ if (rtt_min)
+ global.rtt_min_req_nonzero = 1;
+ }
+
switch (op) {
case BPF_SOCK_OPS_TCP_CONNECT_CB:
rv = bpf_setsockopt(skops, SOL_TCP, TCP_WINDOW_CLAMP,
diff --git a/tools/testing/selftests/bpf/test_tcpbpf.h b/tools/testing/selftests/bpf/test_tcpbpf.h
index 9dd9b5590f9d..e9806215cbc0 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf.h
+++ b/tools/testing/selftests/bpf/test_tcpbpf.h
@@ -18,5 +18,7 @@ struct tcpbpf_globals {
__u32 tcp_saved_syn;
__u32 window_clamp_client;
__u32 window_clamp_server;
+ __u32 rtt_min_req_seen;
+ __u32 rtt_min_req_nonzero;
};
#endif
--
2.43.0
^ permalink raw reply related
* [PATCH bpf v5 1/2] bpf: guard sock_ops rtt_min against non-locked tcp_sock
From: Werner Kasselman @ 2026-04-20 23:00 UTC (permalink / raw)
To: bpf@vger.kernel.org, netdev@vger.kernel.org
Cc: stable@vger.kernel.org, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
John Fastabend, Stanislav Fomichev, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Lawrence Brakmo,
open list
In-Reply-To: <20260420230030.2802408-1-werner@verivus.com>
sock_ops_convert_ctx_access() reads rtt_min without the is_locked_tcp_sock guard used for every other tcp_sock field. On request_sock-backed sock_ops callbacks, sk points at a tcp_request_sock and the converted load reads past the end of the allocation.
Extract the guarded tcp_sock field load sequence into SOCK_OPS_LOAD_TCP_SOCK_FIELD() and use it for the rtt_min access after computing the sub-field offset with offsetof(struct minmax_sample, v). Reusing the shared helper keeps rtt_min aligned with the other guarded tcp_sock field loads and preserves the dst_reg == src_reg failure path that zeros the destination register when the guard fails.
Found via AST-based call-graph analysis using sqry.
Fixes: 44f0e43037d3 ("bpf: Add support for reading sk_state and more")
Cc: stable@vger.kernel.org
Signed-off-by: Werner Kasselman <werner@verivus.com>
---
net/core/filter.c | 36 ++++++++++++++++++------------------
1 file changed, 18 insertions(+), 18 deletions(-)
diff --git a/net/core/filter.c b/net/core/filter.c
index fcfcb72663ca..2e7c33d00749 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -10535,12 +10535,10 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
struct bpf_insn *insn = insn_buf;
int off;
-/* Helper macro for adding read access to tcp_sock or sock fields. */
-#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \
+/* Helper macro for adding guarded read access to tcp_sock fields. */
+#define SOCK_OPS_LOAD_TCP_SOCK_FIELD(FIELD_SIZE, FIELD_OFFSET) \
do { \
int fullsock_reg = si->dst_reg, reg = BPF_REG_9, jmp = 2; \
- BUILD_BUG_ON(sizeof_field(OBJ, OBJ_FIELD) > \
- sizeof_field(struct bpf_sock_ops, BPF_FIELD)); \
if (si->dst_reg == reg || si->src_reg == reg) \
reg--; \
if (si->dst_reg == reg || si->src_reg == reg) \
@@ -10548,7 +10546,7 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
if (si->dst_reg == si->src_reg) { \
*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, reg, \
offsetof(struct bpf_sock_ops_kern, \
- temp)); \
+ temp)); \
fullsock_reg = reg; \
jmp += 2; \
} \
@@ -10562,24 +10560,31 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
if (si->dst_reg == si->src_reg) \
*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \
offsetof(struct bpf_sock_ops_kern, \
- temp)); \
+ temp)); \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF( \
struct bpf_sock_ops_kern, sk),\
si->dst_reg, si->src_reg, \
offsetof(struct bpf_sock_ops_kern, sk));\
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ, \
- OBJ_FIELD), \
+ *insn++ = BPF_LDX_MEM(FIELD_SIZE, \
si->dst_reg, si->dst_reg, \
- offsetof(OBJ, OBJ_FIELD)); \
+ FIELD_OFFSET); \
if (si->dst_reg == si->src_reg) { \
*insn++ = BPF_JMP_A(2); \
*insn++ = BPF_LDX_MEM(BPF_DW, reg, si->src_reg, \
offsetof(struct bpf_sock_ops_kern, \
- temp)); \
+ temp)); \
*insn++ = BPF_MOV64_IMM(si->dst_reg, 0); \
} \
} while (0)
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ) \
+ do { \
+ BUILD_BUG_ON(sizeof_field(OBJ, OBJ_FIELD) > \
+ sizeof_field(struct bpf_sock_ops, BPF_FIELD)); \
+ SOCK_OPS_LOAD_TCP_SOCK_FIELD(BPF_FIELD_SIZEOF(OBJ, OBJ_FIELD),\
+ offsetof(OBJ, OBJ_FIELD)); \
+ } while (0)
+
#define SOCK_OPS_GET_SK() \
do { \
int fullsock_reg = si->dst_reg, reg = BPF_REG_9, jmp = 1; \
@@ -10822,14 +10827,9 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
sizeof(struct minmax));
BUILD_BUG_ON(sizeof(struct minmax) <
sizeof(struct minmax_sample));
-
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
- struct bpf_sock_ops_kern, sk),
- si->dst_reg, si->src_reg,
- offsetof(struct bpf_sock_ops_kern, sk));
- *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
- offsetof(struct tcp_sock, rtt_min) +
- sizeof_field(struct minmax_sample, t));
+ off = offsetof(struct tcp_sock, rtt_min) +
+ offsetof(struct minmax_sample, v);
+ SOCK_OPS_LOAD_TCP_SOCK_FIELD(BPF_W, off);
break;
case offsetof(struct bpf_sock_ops, bpf_sock_ops_cb_flags):
--
2.43.0
^ permalink raw reply related
* [PATCH bpf v5 0/2] bpf: guard sock_ops rtt_min against non-locked tcp_sock
From: Werner Kasselman @ 2026-04-20 23:00 UTC (permalink / raw)
To: bpf@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20260417023119.3830723-1-werner@verivus.com>
sock_ops ctx rewriting guards the direct tcp_sock field loads with
is_locked_tcp_sock, but rtt_min still used a raw load sequence. On
request_sock-backed sock_ops callbacks, that can read past the end of a
tcp_request_sock allocation.
This series switches rtt_min over to the shared guarded tcp_sock field
load helper and adds a tcpbpf runtime test that exercises the
same-register request_sock path.
v4 -> v5:
- rebase onto current origin/master to address CI conflict
- no functional changes beyond the rebase
Werner Kasselman (2):
bpf: guard sock_ops rtt_min against non-locked tcp_sock
selftests/bpf: cover same-reg sock_ops rtt_min request_sock access
net/core/filter.c | 36 +++++++++----------
.../selftests/bpf/prog_tests/tcpbpf_user.c | 4 +++
.../selftests/bpf/progs/test_tcpbpf_kern.c | 14 ++++++++
tools/testing/selftests/bpf/test_tcpbpf.h | 2 ++
4 files changed, 38 insertions(+), 18 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH bpf v5 2/2] selftests/bpf: cover same-reg sock_ops rtt_min request_sock access
From: Werner Kasselman @ 2026-04-20 23:00 UTC (permalink / raw)
To: bpf@vger.kernel.org, netdev@vger.kernel.org
Cc: Andrii Nakryiko, Eduard Zingerman, Alexei Starovoitov,
Daniel Borkmann, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
Song Liu, Yonghong Song, Jiri Olsa, Shuah Khan,
open list:KERNEL SELFTEST FRAMEWORK, open list
In-Reply-To: <20260420230030.2802408-1-werner@verivus.com>
Add a tcpbpf sock_ops selftest that forces a same-register ctx->rtt_min read on request_sock-backed callbacks and verifies the observed value is zero.
This covers the dst_reg == src_reg path that the previous ctx_rewrite-only test did not exercise.
Signed-off-by: Werner Kasselman <werner@verivus.com>
---
.../testing/selftests/bpf/prog_tests/tcpbpf_user.c | 4 ++++
.../testing/selftests/bpf/progs/test_tcpbpf_kern.c | 14 ++++++++++++++
tools/testing/selftests/bpf/test_tcpbpf.h | 2 ++
3 files changed, 20 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
index 7e8fe1bad03f..1b08e49327d0 100644
--- a/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
+++ b/tools/testing/selftests/bpf/prog_tests/tcpbpf_user.c
@@ -42,6 +42,10 @@ static void verify_result(struct tcpbpf_globals *result)
/* check getsockopt for window_clamp */
ASSERT_EQ(result->window_clamp_client, 9216, "window_clamp_client");
ASSERT_EQ(result->window_clamp_server, 9216, "window_clamp_server");
+
+ /* check same-reg rtt_min read on request_sock-backed callbacks */
+ ASSERT_NEQ(result->rtt_min_req_seen, 0, "rtt_min_req_seen");
+ ASSERT_EQ(result->rtt_min_req_nonzero, 0, "rtt_min_req_nonzero");
}
static void run_test(struct tcpbpf_globals *result)
diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
index 6935f32eeb8f..a488b282b5dd 100644
--- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
@@ -33,6 +33,7 @@ int bpf_testcb(struct bpf_sock_ops *skops)
{
char header[sizeof(struct ipv6hdr) + sizeof(struct tcphdr)];
struct bpf_sock_ops *reuse = skops;
+ long rtt_min = (long)skops;
struct tcphdr *thdr;
int window_clamp = 9216;
int save_syn = 1;
@@ -84,6 +85,19 @@ int bpf_testcb(struct bpf_sock_ops *skops)
global.event_map |= (1 << op);
+ if (!skops->is_fullsock &&
+ (op == BPF_SOCK_OPS_RWND_INIT || op == BPF_SOCK_OPS_NEEDS_ECN)) {
+ asm volatile (
+ "%[rtt_min] = *(u32 *)(%[rtt_min] + %[rtt_min_off]);\n"
+ : [rtt_min] "+r"(rtt_min)
+ : [rtt_min_off] "i"(offsetof(struct bpf_sock_ops, rtt_min))
+ :);
+
+ global.rtt_min_req_seen = 1;
+ if (rtt_min)
+ global.rtt_min_req_nonzero = 1;
+ }
+
switch (op) {
case BPF_SOCK_OPS_TCP_CONNECT_CB:
rv = bpf_setsockopt(skops, SOL_TCP, TCP_WINDOW_CLAMP,
diff --git a/tools/testing/selftests/bpf/test_tcpbpf.h b/tools/testing/selftests/bpf/test_tcpbpf.h
index 9dd9b5590f9d..e9806215cbc0 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf.h
+++ b/tools/testing/selftests/bpf/test_tcpbpf.h
@@ -18,5 +18,7 @@ struct tcpbpf_globals {
__u32 tcp_saved_syn;
__u32 window_clamp_client;
__u32 window_clamp_server;
+ __u32 rtt_min_req_seen;
+ __u32 rtt_min_req_nonzero;
};
#endif
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] connector/Kconfig: Enable CONFIG_CONNECTOR by default
From: Qais Yousef @ 2026-04-20 23:04 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Paolo Abeni, netdev,
linux-kernel, Vincent Guittot, John Stultz, Steven Rostedt,
Ingo Molnar, Peter Zijlstra
In-Reply-To: <20260420131847.75248693@kernel.org>
+Ingo and Peter
On 04/20/26 13:18, Jakub Kicinski wrote:
> On Sun, 19 Apr 2026 22:42:17 +0100 Qais Yousef wrote:
> > To make new tools that depend on it like schedqos [1] more reliable, it
> > is important to ensure users can find it by default on all system.
>
> If scheduler maintainers think this is appropriate they should take
> this patch via their tree (please). connector falls under networking
> for historical reasons (it's Netlink based) but we lack the context
> necessary to apply a "default y" patch of this nature.
I see, I didn't add them, but I'll resend with them added.
>
> default y should be used if the symbol is necessary for most Linux
> users across use cases and architectures. It's not obvious to me
Hmm I am not aware of such rules. It should be generally is useful and doesn't
have a drawback - which what I understood this is. What is the cost of enabling
this? This seems widely enabled feature by distro in general.
> that that is the case here. The commit message links to a tool
> which is less than a week old?
It is chicken an egg. We want to add sched qos support and it relies on netlink
to monitor tasks as they are created and tag them with QoS. If we can't make
sure this is available on all systems by default (ie: users must consciously
opt-out of this option), we will end up with inconsistencies.
I've hit this when we added UCLAMP and it took debian two years (approx) to
decide to enable it by default after making a feature request.
^ permalink raw reply
* [RESEND PATCH] connector/Kconfig: Enable CONFIG_CONNECTOR by default
From: Qais Yousef @ 2026-04-20 23:06 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, netdev, linux-kernel, Vincent Guittot, John Stultz,
Steven Rostedt, Qais Yousef
To make new tools that depend on it like schedqos [1] more reliable, it
is important to ensure users can find it by default on all system.
[1] https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
Signed-off-by: Qais Yousef <qyousef@layalina.io>
---
Resending with sched maintainers as requested by Jacub
https://lore.kernel.org/lkml/20260420230422.mdfy4icsuhjh7fe3@airbuntu/T/#mea6edb96570c5d5aa2af43a5d11f7c775d0c4dca
drivers/connector/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/connector/Kconfig b/drivers/connector/Kconfig
index 0c2d2aa82d8c..bad247d47146 100644
--- a/drivers/connector/Kconfig
+++ b/drivers/connector/Kconfig
@@ -3,6 +3,7 @@
menuconfig CONNECTOR
tristate "Connector - unified userspace <-> kernelspace linker"
depends on NET
+ default y
help
This is unified userspace <-> kernelspace connector working on top
of the netlink socket protocol.
--
2.34.1
^ permalink raw reply related
* Re: [PATCH] idpf: do not perform flow ops when netdev is detached
From: Jacob Keller @ 2026-04-20 23:44 UTC (permalink / raw)
To: Li Li, Tony Nguyen, Przemek Kitszel, David S. Miller,
Jakub Kicinski, Eric Dumazet, intel-wired-lan
Cc: netdev, linux-kernel, David Decotigny, Anjali Singhai,
Sridhar Samudrala, Brian Vazquez, emil.s.tantilov
In-Reply-To: <20260419192555.3631327-1-boolli@google.com>
On 4/19/2026 12:25 PM, Li Li wrote:
> Even though commit 2e281e1155fc ("idpf: detach and close netdevs while
> handling a reset") prevents ethtool -N/-n operations to operate on
> detached netdevs, we found that out-of-tree workflows like OpenOnload
> can bypass ethtool core locks and call idpf_set_rxnfc directly during
> an idpf HW reset. When this happens, we could get kernel crashes like
> the following:
>
> [ 4045.787439] BUG: kernel NULL pointer dereference, address: 0000000000000070
> [ 4045.794420] #PF: supervisor read access in kernel mode
> [ 4045.799580] #PF: error_code(0x0000) - not-present page
> [ 4045.804739] PGD 0
> [ 4045.806772] Oops: Oops: 0000 [#1] SMP NOPTI
> ...
> [ 4045.836425] Workqueue: onload-wqueue oof_do_deferred_work_fn [onload]
> [ 4045.842926] RIP: 0010:idpf_del_flow_steer+0x24/0x170 [idpf]
> ...
> [ 4045.946323] Call Trace:
> [ 4045.948796] <TASK>
> [ 4045.950915] ? show_trace_log_lvl+0x1b0/0x2f0
> [ 4045.955293] ? show_trace_log_lvl+0x1b0/0x2f0
> [ 4045.959672] ? idpf_set_rxnfc+0x6f/0x80 [idpf]
> [ 4045.964142] ? __die_body.cold+0x8/0x12
> [ 4045.968000] ? page_fault_oops+0x148/0x160
> [ 4045.972117] ? exc_page_fault+0x6f/0x160
> [ 4045.976060] ? asm_exc_page_fault+0x22/0x30
> [ 4045.980262] ? idpf_del_flow_steer+0x24/0x170 [idpf]
> [ 4045.985245] idpf_set_rxnfc+0x6f/0x80 [idpf]
> [ 4045.989535] af_xdp_filter_remove+0x7c/0xb0 [sfc_resource]
> [ 4045.995069] oo_hw_filter_clear_hwports+0x6f/0xa0 [onload]
> [ 4046.000589] oo_hw_filter_update+0x65/0x210 [onload]
> [ 4046.005587] oof_hw_filter_update.constprop.0+0xe7/0x140 [onload]
> [ 4046.011716] oof_manager_update_all_filters+0xad/0x270 [onload]
> [ 4046.017671] __oof_do_deferred_work+0x15e/0x190 [onload]
> [ 4046.023014] oof_do_deferred_work+0x2c/0x40 [onload]
> [ 4046.028018] oof_do_deferred_work_fn+0x12/0x30 [onload]
> [ 4046.033277] process_one_work+0x174/0x330
> [ 4046.037304] worker_thread+0x246/0x390
> [ 4046.041074] ? __pfx_worker_thread+0x10/0x10
> [ 4046.045364] kthread+0xf6/0x240
> [ 4046.048530] ? __pfx_kthread+0x10/0x10
> [ 4046.052297] ret_from_fork+0x2d/0x50
> [ 4046.055896] ? __pfx_kthread+0x10/0x10
> [ 4046.059664] ret_from_fork_asm+0x1a/0x30
> [ 4046.063613] </TASK>
>
> To prevent this, we need to add checks in idpf_set_rxnfc and
> idpf_get_rxnfc to error out if the netdev is already detached.
>
> Tested: implemented the following patch to synthetically force idpf into
> a HW reset:
>
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> index 4fc0bb14c5b1..27476d57bcf0 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c
> @@ -10,6 +10,9 @@
> #define idpf_tx_buf_next(buf) (*(u32 *)&(buf)->priv)
> LIBETH_SQE_CHECK_PRIV(u32);
>
Patchwork, and likely other git tools based around plain text mail do
not work kindly with an embedded diff inside the commit message. Could
you please resubmit with an updated commit message that doesn't simply
insert the raw diff? Perhaps you could indent the diff by a few spaces,
or simply describe what modifications were required to force the failure.
Also you didn't mention the target tree, which I think should be iwl-net.
Thanks,
Jake
> +static bool SIMULATE_TX_TIMEOUT;
> +module_param(SIMULATE_TX_TIMEOUT, bool, 0644);
> +
> /**
> * idpf_chk_linearize - Check if skb exceeds max descriptors per packet
> * @skb: send buffer
> @@ -46,6 +49,8 @@ void idpf_tx_timeout(struct net_device *netdev, unsigned int txqueue)
>
> adapter->tx_timeout_count++;
>
> + SIMULATE_TX_TIMEOUT = false;
> +
> netdev_err(netdev, "Detected Tx timeout: Count %d, Queue %d\n",
> adapter->tx_timeout_count, txqueue);
> if (!idpf_is_reset_in_prog(adapter)) {
> @@ -2225,6 +2230,8 @@ static bool idpf_tx_clean_complq(struct idpf_compl_queue *complq, int budget,
> goto fetch_next_desc;
> }
> tx_q = complq->txq_grp->txqs[rel_tx_qid];
> + if (unlikely(SIMULATE_TX_TIMEOUT && (tx_q->idx % 2 == 1)))
> + goto fetch_next_desc;
>
> /* Determine completion type */
> ctype = le16_get_bits(tx_desc->common.qid_comptype_gen,
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> index be66f9b2e101..ba5da2a86c15 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_virtchnl.c
> @@ -8,6 +8,9 @@
> #include "idpf_virtchnl.h"
> #include "idpf_ptp.h"
>
> +static bool VIRTCHNL_FAILED;
> +module_param(VIRTCHNL_FAILED, bool, 0644);
> +
> /**
> * struct idpf_vc_xn_manager - Manager for tracking transactions
> * @ring: backing and lookup for transactions
> @@ -3496,6 +3499,11 @@ int idpf_vc_core_init(struct idpf_adapter *adapter)
> switch (adapter->state) {
> case __IDPF_VER_CHECK:
> err = idpf_send_ver_msg(adapter);
> +
> + if (unlikely(VIRTCHNL_FAILED)) {
> + err = -EIO;
> + }
> +
> switch (err) {
> case 0:
> /* success, move state machine forward */
>
> And tested by writing 1 to /sys/module/idpf/parameters/VIRTCHNL_FAILED
> and /sys/module/idpf/parameters/SIMULATE_TX_TIMEOUT, and running
> idpf_get_rxnfc() right after the HW reset.
>
> Without the patch: encountered NULL pointer and kernel crash.
>
> With the patch: no crashes.
>
> Fixes: 2e281e1155fc ("idpf: detach and close netdevs while handling a reset")
> Signed-off-by: Li Li <boolli@google.com>
> ---
> drivers/net/ethernet/intel/idpf/idpf_ethtool.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> index bb99d9e7c65d..8368a7e6a754 100644
> --- a/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> +++ b/drivers/net/ethernet/intel/idpf/idpf_ethtool.c
> @@ -43,6 +43,9 @@ static int idpf_get_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd,
> unsigned int cnt = 0;
> int err = 0;
>
> + if (!netdev || !netif_device_present(netdev))
> + return -ENODEV;
> +
> idpf_vport_ctrl_lock(netdev);
> vport = idpf_netdev_to_vport(netdev);
> vport_config = np->adapter->vport_config[np->vport_idx];
> @@ -349,6 +352,9 @@ static int idpf_set_rxnfc(struct net_device *netdev, struct ethtool_rxnfc *cmd)
> {
> int ret = -EOPNOTSUPP;
>
> + if (!netdev || !netif_device_present(netdev))
> + return -ENODEV;
> +
> idpf_vport_ctrl_lock(netdev);
> switch (cmd->cmd) {
> case ETHTOOL_SRXCLSRLINS:
^ permalink raw reply
* Re: [PATCH net v2 1/8] xsk: reject sw-csum UMEM binding to IFF_TX_SKB_NO_LINEAR devices
From: Jason Xing @ 2026-04-20 23:51 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing
In-Reply-To: <e4a22d20c77e94657d63243af39a0667.sdf.kernel@gmail.com>
On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > skb_checksum_help() is a common helper that writes the folded
> > 16-bit checksum back via skb->data + csum_start + csum_offset,
> > i.e. it relies on the skb's linear head and fails (with WARN_ONCE
> > and -EINVAL) when skb_headlen() is 0.
> >
> > AF_XDP generic xmit takes two very different paths depending on the
> > netdev. Drivers that advertise IFF_TX_SKB_NO_LINEAR (e.g. virtio_net)
> > skip the "copy payload into a linear head" step on purpose as a
> > performance optimisation: xsk_build_skb_zerocopy() only attaches UMEM
> > pages as frags and never calls skb_put(), so skb_headlen() stays 0
> > for the whole skb. For these skbs there is simply no linear area for
> > skb_checksum_help() to write the csum into - the sw-csum fallback is
> > structurally inapplicable.
> >
> > The patch tries to catch this and reject the combination with error at
> > setup time. Rejecting at bind() converts this silent per-packet failure
> > into a synchronous, actionable -EOPNOTSUPP at setup time. HW csum and
> > launch_time metadata on IFF_TX_SKB_NO_LINEAR drivers are unaffected
> > because they do not call skb_checksum_help().
> >
> > Without the patch, every descriptor carrying 'XDP_TX_METADATA |
> > XDP_TXMD_FLAGS_CHECKSUM' produces:
> > 1) a WARN_ONCE "offset (N) >= skb_headlen() (0)" from skb_checksum_help(),
> > 2) sendmsg() returning -EINVAL without consuming the descriptor
> > (invalid_descs is not incremented),
> > 3) a wedged TX ring: __xsk_generic_xmit() does not advance the
> > consumer on non-EOVERFLOW errors, so the next sendmsg() re-reads
> > the same descriptor and re-hits the same WARN until the socket
> > is closed.
> >
> > Closes: https://lore.kernel.org/all/20260419045822.843BFC2BCAF@smtp.kernel.org/#t
> > Fixes: 30c3055f9c0d ("xsk: wrap generic metadata handling onto separate function")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > net/xdp/xsk_buff_pool.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index 37b7a68b89b3..c2521b6547e3 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -169,6 +169,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > if (force_zc && force_copy)
> > return -EINVAL;
> >
> > + if (pool->tx_sw_csum && (netdev->priv_flags & IFF_TX_SKB_NO_LINEAR))
> > + return -EOPNOTSUPP;
> > +
> > if (xsk_get_pool_from_qid(netdev, queue_id))
> > return -EBUSY;
> >
> > --
> > 2.41.3
> >
>
> Wondering whether a better fixes tag is commit 11614723af26 ("xsk: Add option
> to calculate TX checksum in SW")?
>
> Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Thanks for the check. But not really. It is the commit 30c3055f9c0d
that brings the csum support of IFF_TX_SKB_NO_LINEAR case where this
issue can be triggered (because this mode no longer puts data into skb
linear area).
Thanks,
Jason
^ permalink raw reply
* Re: [PATCH net v2 3/8] xsk: fix use-after-free of xs->skb in xsk_build_skb() free_err path
From: Jason Xing @ 2026-04-21 0:01 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: bpf, netdev, Jason Xing
In-Reply-To: <eef5d62d475b0aeb93f6f84585f36972.sdf.kernel@gmail.com>
On Tue, Apr 21, 2026 at 3:34 AM Stanislav Fomichev <sdf.kernel@gmail.com> wrote:
>
> > From: Jason Xing <kernelxing@tencent.com>
> >
> > When xsk_build_skb() processes multi-buffer packets in copy mode, the
> > first descriptor stores data into the skb linear area without adding
> > any frags, so nr_frags stays at 0. The caller then sets xs->skb = skb
> > to accumulate subsequent descriptors.
> >
> > If a continuation descriptor fails (e.g. alloc_page returns NULL with
> > -EAGAIN), we jump to free_err where the condition:
> >
> > if (skb && !skb_shinfo(skb)->nr_frags)
> > kfree_skb(skb);
> >
> > evaluates to true because nr_frags is still 0 (the first descriptor
> > used the linear area, not frags). This frees the skb while xs->skb
> > still points to it, creating a dangling pointer. On the next transmit
> > attempt or socket close, xs->skb is dereferenced, causing a
> > use-after-free or double-free.
> >
> > Fix by adding a !xs->skb check to the condition, ensuring we only free
> > skbs that were freshly allocated in this call (xs->skb is NULL) and
> > never free an in-progress multi-buffer skb that the caller still
> > references.
> >
> > Closes: https://lore.kernel.org/all/20260415082654.21026-4-kerneljasonxing@gmail.com/
> > Fixes: 6b9c129c2f93 ("xsk: remove @first_frag from xsk_build_skb()")
> > Signed-off-by: Jason Xing <kernelxing@tencent.com>
> > ---
> > net/xdp/xsk.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 6521604f8d42..4fdd1a45a9bd 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -889,7 +889,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > return skb;
> >
> > free_err:
> > - if (skb && !skb_shinfo(skb)->nr_frags)
> > + if (skb && !xs->skb && !skb_shinfo(skb)->nr_frags)
> > kfree_skb(skb);
> >
> > if (err == -EOVERFLOW) {
> > --
> > 2.41.3
>
> Now "!skb_shinfo(skb)->nr_frags" feels redundant? It's either
> "skb && !xs->skb" and we own the kfree. or "xs->skb != NULL" and we
> want xsk_drop_skb? Or am I missing something?
Your feeling about being redundant is right. I'm removing it now:)
At this stage, the job of this if statement is to find out the first
skb, so !xs->skb is a clear indicator as you said.
Thanks,
Jason
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox