* Re: [PATCH nf] netfilter: xt_TCPMSS: check skb_dst before path-MTU clamping
From: Florian Westphal @ 2026-04-18 19:58 UTC (permalink / raw)
To: Weiming Shi
Cc: Pablo Neira Ayuso, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Phil Sutter, Simon Horman, netfilter-devel, coreteam,
netdev, Xiang Mei
In-Reply-To: <20260418163057.2611503-2-bestswngs@gmail.com>
Weiming Shi <bestswngs@gmail.com> wrote:
> When TCPMSS with CLAMP_PMTU is used via nft_compat in a non-base
> chain, par->hook_mask is set to 0, bypassing the checkentry hook
> validation. The target can then run at PRE_ROUTING where skb_dst is
> NULL, causing a null-ptr-deref in tcpmss_mangle_packet():
>
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> RIP: 0010:tcpmss_mangle_packet (include/net/dst.h:219 net/netfilter/xt_TCPMSS.c:105)
> tcpmss_tg4 (net/netfilter/xt_TCPMSS.c:202)
> nft_target_eval_xt (net/netfilter/nft_compat.c:87)
> nft_do_chain (net/netfilter/nf_tables_core.c:287)
> nf_hook_slow (net/netfilter/core.c:623)
>
> Check skb_dst() for NULL before calling dst_mtu().
FWIW I will apply this patch even though its wrong.
nft_compat.c is just too broken, I don't see how it can be
fixed in any reasonable amount of time.
validation is done too early, at expression instantiation
time.
This doesn't work because we have incomplete graph, it has
to be done at final table validation time.
But then all required compat info (xtables hints) is gone
and no longer available.
AFAICS the only way to resolve this is to cache the info in
the nft_expr priv area (WHERE IS ABSOLUTELY DOESN'T BELONG!)
because thats the only storage thewre is.
*puke*
^ permalink raw reply
* [PATCH net v2] selftests: netfilter: conntrack_sctp_collision.sh: Introduce SCTP INIT collision test
From: Yi Chen @ 2026-04-18 19:58 UTC (permalink / raw)
To: Yi Chen, Pablo Neira Ayuso, Florian Westphal, Phil Sutter,
Long Xin, David S . Miller, Jakub Kicinski, Eric Dumazet,
Paolo Abeni, Simon Horman, Shuah Khan
Cc: coreteam, netfilter-devel, linux-kselftest, linux-kernel, netdev
The existing test covered a scenario where a delayed INIT_ACK chunk
updates the vtag in conntrack after the association has already been
established.
A similar issue can occur with a delayed SCTP INIT chunk.
Add a new simultaneous-open test case where the client's INIT is
delayed, allowing conntrack to establish the association based on
the server-initiated handshake.
When the stale INIT arrives later, it may get recorded and cause a
following INIT_ACK from the peer to be accepted instead of dropped.
This INIT_ACK overwrites the vtag in conntrack, causing subsequent
SCTP DATA chunks to be considered as invalid and then dropped by
nft rules matching on ct state invalid.
This test verifies such stale INIT chunks do not cause problems.
Signed-off-by: Yi Chen <yiche.cy@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
---
v1 -> v2:
- Simplify conf_delay() by passing arguments.
- Avoid calling exit() inside the function.
- Enable nft log by setting net.netfilter.nf_log_all_netns=1.
- Add a description for the "ct invalid drop" rule match.
---
.../net/netfilter/conntrack_sctp_collision.sh | 85 ++++++++++++++-----
1 file changed, 63 insertions(+), 22 deletions(-)
diff --git a/tools/testing/selftests/net/netfilter/conntrack_sctp_collision.sh b/tools/testing/selftests/net/netfilter/conntrack_sctp_collision.sh
index d860f7d9744b..31823050391e 100755
--- a/tools/testing/selftests/net/netfilter/conntrack_sctp_collision.sh
+++ b/tools/testing/selftests/net/netfilter/conntrack_sctp_collision.sh
@@ -2,18 +2,32 @@
# SPDX-License-Identifier: GPL-2.0
#
# Testing For SCTP COLLISION SCENARIO as Below:
-#
+# 1. Stale INIT_ACK capture:
# 14:35:47.655279 IP CLIENT_IP.PORT > SERVER_IP.PORT: sctp (1) [INIT] [init tag: 2017837359]
# 14:35:48.353250 IP SERVER_IP.PORT > CLIENT_IP.PORT: sctp (1) [INIT] [init tag: 1187206187]
# 14:35:48.353275 IP CLIENT_IP.PORT > SERVER_IP.PORT: sctp (1) [INIT ACK] [init tag: 2017837359]
# 14:35:48.353283 IP SERVER_IP.PORT > CLIENT_IP.PORT: sctp (1) [COOKIE ECHO]
# 14:35:48.353977 IP CLIENT_IP.PORT > SERVER_IP.PORT: sctp (1) [COOKIE ACK]
# 14:35:48.855335 IP SERVER_IP.PORT > CLIENT_IP.PORT: sctp (1) [INIT ACK] [init tag: 164579970]
+# (Delayed)
+#
+# 2. Stale INIT capture:
+# 14:35:48.353250 IP SERVER_IP.PORT > CLIENT_IP.PORT: sctp (1) [INIT] [init tag: 1187206187]
+# 14:35:48.353275 IP CLIENT_IP.PORT > SERVER_IP.PORT: sctp (1) [INIT ACK] [init tag: 2017837359]
+# 14:35:48.353283 IP SERVER_IP.PORT > CLIENT_IP.PORT: sctp (1) [COOKIE ECHO]
+# 14:35:48.353977 IP CLIENT_IP.PORT > SERVER_IP.PORT: sctp (1) [COOKIE ACK]
+# 14:35:47.655279 IP CLIENT_IP.PORT > SERVER_IP.PORT: sctp (1) [INIT] [init tag: 2017837359]
+# (Delayed)
+# 14:35:48.855335 IP SERVER_IP.PORT > CLIENT_IP.PORT: sctp (1) [INIT ACK] [init tag: 164579970]
#
# TOPO: SERVER_NS (link0)<--->(link1) ROUTER_NS (link2)<--->(link3) CLIENT_NS
source lib.sh
+checktool "nft --version" "run test without nft"
+checktool "tc -h" "run test without tc"
+checktool "modprobe -q sctp" "load sctp module"
+
CLIENT_IP="198.51.200.1"
CLIENT_PORT=1234
@@ -24,7 +38,8 @@ CLIENT_GW="198.51.200.2"
SERVER_GW="198.51.100.2"
# setup the topo
-setup() {
+topo_setup() {
+ # setup_ns cleans up existing net namespaces first.
setup_ns CLIENT_NS SERVER_NS ROUTER_NS
ip -n "$SERVER_NS" link add link0 type veth peer name link1 netns "$ROUTER_NS"
ip -n "$CLIENT_NS" link add link3 type veth peer name link2 netns "$ROUTER_NS"
@@ -38,35 +53,53 @@ setup() {
ip -n "$ROUTER_NS" addr add $SERVER_GW/24 dev link1
ip -n "$ROUTER_NS" addr add $CLIENT_GW/24 dev link2
ip net exec "$ROUTER_NS" sysctl -wq net.ipv4.ip_forward=1
+ sysctl -wq net.netfilter.nf_log_all_netns=1
ip -n "$CLIENT_NS" link set link3 up
ip -n "$CLIENT_NS" addr add $CLIENT_IP/24 dev link3
ip -n "$CLIENT_NS" route add $SERVER_IP dev link3 via $CLIENT_GW
+}
- # simulate the delay on OVS upcall by setting up a delay for INIT_ACK with
- # tc on $SERVER_NS side
- tc -n "$SERVER_NS" qdisc add dev link0 root handle 1: htb r2q 64
- tc -n "$SERVER_NS" class add dev link0 parent 1: classid 1:1 htb rate 100mbit
- tc -n "$SERVER_NS" filter add dev link0 parent 1: protocol ip u32 match ip protocol 132 \
- 0xff match u8 2 0xff at 32 flowid 1:1
- if ! tc -n "$SERVER_NS" qdisc add dev link0 parent 1:1 handle 10: netem delay 1200ms; then
+conf_delay()
+{
+ # simulate the delay on OVS upcall by setting up a delay for INIT_ACK/INIT with
+ local ns=$1
+ local link=$2
+ local chunk_type=$3
+
+ # use a smaller number for assoc's max_retrans to reproduce the issue
+ ip net exec "$CLIENT_NS" sysctl -wq net.sctp.association_max_retrans=3
+
+ tc -n "$ns" qdisc add dev "$link" root handle 1: htb r2q 64
+ tc -n "$ns" class add dev "$link" parent 1: classid 1:1 htb rate 100mbit
+ tc -n "$ns" filter add dev "$link" parent 1: protocol ip \
+ u32 match ip protocol 132 0xff match u8 "$chunk_type" 0xff at 32 flowid 1:1
+ if ! tc -n "$ns" qdisc add dev "$link" parent 1:1 handle 10: netem delay 1200ms; then
echo "SKIP: Cannot add netem qdisc"
- exit $ksft_skip
+ return $ksft_skip
fi
# simulate the ctstate check on OVS nf_conntrack
- ip net exec "$ROUTER_NS" iptables -A FORWARD -m state --state INVALID,UNTRACKED -j DROP
- ip net exec "$ROUTER_NS" iptables -A INPUT -p sctp -j DROP
-
- # use a smaller number for assoc's max_retrans to reproduce the issue
- modprobe -q sctp
- ip net exec "$CLIENT_NS" sysctl -wq net.sctp.association_max_retrans=3
+ ip net exec "$ROUTER_NS" nft -f - <<-EOF
+ table ip t {
+ chain forward {
+ type filter hook forward priority filter; policy accept;
+ meta l4proto icmp counter accept
+ ct state new counter accept
+ ct state established,related counter accept
+ ct state invalid log flags all counter drop comment \
+ "Expect to drop stale INIT/INIT_ACK chunks"
+ counter
+ }
+ }
+ EOF
+ return 0
}
cleanup() {
- ip net exec "$CLIENT_NS" pkill sctp_collision >/dev/null 2>&1
- ip net exec "$SERVER_NS" pkill sctp_collision >/dev/null 2>&1
+ # cleanup_all_ns terminates running processes in the namespaces.
cleanup_all_ns
+ sysctl -wq net.netfilter.nf_log_all_netns=0
}
do_test() {
@@ -81,7 +114,15 @@ do_test() {
# run the test case
trap cleanup EXIT
-setup && \
-echo "Test for SCTP Collision in nf_conntrack:" && \
-do_test && echo "PASS!"
-exit $?
+
+echo "Test for SCTP INIT_ACK Collision in nf_conntrack:"
+(topo_setup && conf_delay $SERVER_NS link0 2) || exit $?
+if ! do_test; then
+ exit $ksft_fail
+fi
+
+echo "Test for SCTP INIT Collision in nf_conntrack:"
+(topo_setup && conf_delay $CLIENT_NS link3 1) || exit $?
+if ! do_test; then
+ exit $ksft_fail
+fi
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net] net/packet: fix TOCTOU race on mmap'd vnet_hdr in tpacket_snd()
From: Willem de Bruijn @ 2026-04-18 20:17 UTC (permalink / raw)
To: Bingquan Chen, Willem de Bruijn, Greg KH
Cc: Stephen Hemminger, security, David S . Miller, Jakub Kicinski,
Eric Dumazet, netdev, Bingquan Chen
In-Reply-To: <20260418112006.78823-1-patzilla007@gmail.com>
Bingquan Chen wrote:
> In tpacket_snd(), when PACKET_VNET_HDR is enabled, vnet_hdr points
> directly into the mmap'd TX ring buffer shared with userspace. The
> kernel validates the header via __packet_snd_vnet_parse() but then
> re-reads all fields later in virtio_net_hdr_to_skb(). A concurrent
> userspace thread can modify the vnet_hdr fields between validation
> and use, bypassing all safety checks.
>
> The non-TPACKET path (packet_snd()) already correctly copies vnet_hdr
> to a stack-local variable. All other vnet_hdr consumers in the kernel
> (tun.c, tap.c, virtio_net.c) also use stack copies. The TPACKET TX
> path is the only caller of virtio_net_hdr_to_skb() that reads directly
> from user-controlled shared memory.
>
> Fix this by copying vnet_hdr from the mmap'd ring buffer to a
> stack-local variable before validation and use, consistent with the
> approach used in packet_snd() and all other callers.
>
> Fixes: 1d036d25e560 ("packet: tpacket_snd gso and checksum offload")
> Signed-off-by: Bingquan Chen <patzilla007@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
^ permalink raw reply
* Re: [PATCH net v2] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Daniel Borkmann @ 2026-04-18 22:19 UTC (permalink / raw)
To: Justin Iurman, kuba
Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
netdev
In-Reply-To: <7c715640-5c20-4226-9c31-d2c5eef551db@gmail.com>
On 4/18/26 2:40 PM, Justin Iurman wrote:
> On 4/18/26 14:15, Daniel Borkmann wrote:
>> Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
>> Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
>> and applied them in ip6_parse_tlv(), the generic TLV walker
>> invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().
>>
>> ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
>> it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
>> branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
>> loop is bounded only by optlen, which can be up to 2048 bytes.
>> Stuffing the Destination Options header with 2046 Pad1 (type=0)
>> entries advances the scanner a single byte at a time, yielding
>> ~2000 TLV iterations per extension header.
>>
>> Reuse max_dst_opts_cnt to bound the TLV iterations, matching
>> the semantics from 47d3d7ac656a.
>>
>> Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and Destination options")
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>> v1->v2:
>> - Remove unlikely (Justin)
>> - Use abs() given max_dst_opts_cnt's negative meaning (Justin)
>>
>> net/ipv6/ip6_tunnel.c | 5 +++++
>> 1 file changed, 5 insertions(+)
>>
>> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
>> index 907c6a2af331..0f50b7fcb24e 100644
>> --- a/net/ipv6/ip6_tunnel.c
>> +++ b/net/ipv6/ip6_tunnel.c
>> @@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
>> break;
>> }
>> if (nexthdr == NEXTHDR_DEST) {
>> + int tlv_max = abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
>> + int tlv_cnt = 0;
>> u16 i = 2;
>> while (1) {
>> struct ipv6_tlv_tnl_enc_lim *tel;
>> + if (tlv_cnt++ >= tlv_max)
>> + break;
>> +
>> /* No more room for encapsulation limit */
>> if (i + sizeof(*tel) > optlen)
>> break;
>
> Thanks for v2, Daniel.
>
> I'm still wondering: should we align the above parsing behavior with the one in ip6_parse_tlv() to keep things consistent? That is: don't increment tlv_cnt for Pad1/PadN, make sure we don't exceed 8 bytes per padding (consecutive Pad1's, or a PadN), and we could also check that a PadN payload is only made of zeroes. Open question...
Hm, so it would be sth like the below on top of this one, I'd wish we wouldn't
have to open code another ip6_parse_tlv().
Have you seen such cases with the encap limit in the wild (vs encap limit as
first tlv) where a stack places leading Pad1/PadN before encap limit which the
v2 patch wouldn't catch?
net/ipv6/ip6_tunnel.c | 42 +++++++++++++++++++++++++++++++-----------
1 file changed, 31 insertions(+), 11 deletions(-)
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 0f50b7fcb24e..1fa42a1cd97c 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -432,28 +432,48 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff *skb, __u8 *raw)
if (nexthdr == NEXTHDR_DEST) {
int tlv_max = abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
int tlv_cnt = 0;
+ int padlen = 0;
u16 i = 2;
- while (1) {
- struct ipv6_tlv_tnl_enc_lim *tel;
+ while (i < optlen) {
+ struct ipv6_tlv_tnl_enc_lim *tel =
+ (struct ipv6_tlv_tnl_enc_lim *)(skb->data + off + i);
+ int tel_len;
- if (tlv_cnt++ >= tlv_max)
+ if (tel->type == IPV6_TLV_PAD1) {
+ if (++padlen > 7)
+ break;
+ i++;
+ continue;
+ }
+
+ if (i + 2 > optlen)
+ break;
+ tel_len = tel->length + 2;
+ if (i + tel_len > optlen)
break;
- /* No more room for encapsulation limit */
- if (i + sizeof(*tel) > optlen)
+ if (tel->type == IPV6_TLV_PADN) {
+ padlen += tel_len;
+ if (padlen > 7)
+ break;
+ if (memchr_inv((u8 *)tel + 2, 0,
+ tel->length))
+ break;
+ i += tel_len;
+ continue;
+ }
+ padlen = 0;
+
+ if (tlv_cnt++ >= tlv_max)
break;
- tel = (struct ipv6_tlv_tnl_enc_lim *)(skb->data + off + i);
/* return index of option if found and valid */
if (tel->type == IPV6_TLV_TNL_ENCAP_LIMIT &&
tel->length == 1)
return i + off - nhoff;
- /* else jump to next option */
- if (tel->type)
- i += tel->length + 2;
- else
- i++;
+
+ i += tel_len;
}
}
nexthdr = hdr->nexthdr;
--
2.43.0
> Otherwise, LGTM:
> Reviewed-by: Justin Iurman <justin.iurman@gmail.com>
Thanks,
Daniel
^ permalink raw reply related
* Re: [PATCH net v2] ipv6: Apply max_dst_opts_cnt to ip6_tnl_parse_tlv_enc_lim
From: Justin Iurman @ 2026-04-18 22:37 UTC (permalink / raw)
To: Daniel Borkmann, kuba
Cc: edumazet, dsahern, tom, willemdebruijn.kernel, idosch, pabeni,
netdev
In-Reply-To: <d8d60970-6f73-4b9d-8808-76f4f8023100@iogearbox.net>
On 4/19/26 00:19, Daniel Borkmann wrote:
> On 4/18/26 2:40 PM, Justin Iurman wrote:
>> On 4/18/26 14:15, Daniel Borkmann wrote:
>>> Commit 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
>>> Destination options") added net.ipv6.max_{hbh,dst}_opts_{cnt,len}
>>> and applied them in ip6_parse_tlv(), the generic TLV walker
>>> invoked from ipv6_destopt_rcv() and ipv6_parse_hopopts().
>>>
>>> ip6_tnl_parse_tlv_enc_lim() does not go through ip6_parse_tlv();
>>> it has its own hand-rolled TLV scanner inside its NEXTHDR_DEST
>>> branch which looks for IPV6_TLV_TNL_ENCAP_LIMIT. That inner
>>> loop is bounded only by optlen, which can be up to 2048 bytes.
>>> Stuffing the Destination Options header with 2046 Pad1 (type=0)
>>> entries advances the scanner a single byte at a time, yielding
>>> ~2000 TLV iterations per extension header.
>>>
>>> Reuse max_dst_opts_cnt to bound the TLV iterations, matching
>>> the semantics from 47d3d7ac656a.
>>>
>>> Fixes: 47d3d7ac656a ("ipv6: Implement limits on Hop-by-Hop and
>>> Destination options")
>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>> ---
>>> v1->v2:
>>> - Remove unlikely (Justin)
>>> - Use abs() given max_dst_opts_cnt's negative meaning (Justin)
>>>
>>> net/ipv6/ip6_tunnel.c | 5 +++++
>>> 1 file changed, 5 insertions(+)
>>>
>>> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
>>> index 907c6a2af331..0f50b7fcb24e 100644
>>> --- a/net/ipv6/ip6_tunnel.c
>>> +++ b/net/ipv6/ip6_tunnel.c
>>> @@ -430,11 +430,16 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff
>>> *skb, __u8 *raw)
>>> break;
>>> }
>>> if (nexthdr == NEXTHDR_DEST) {
>>> + int tlv_max =
>>> abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
>>> + int tlv_cnt = 0;
>>> u16 i = 2;
>>> while (1) {
>>> struct ipv6_tlv_tnl_enc_lim *tel;
>>> + if (tlv_cnt++ >= tlv_max)
>>> + break;
>>> +
>>> /* No more room for encapsulation limit */
>>> if (i + sizeof(*tel) > optlen)
>>> break;
>>
>> Thanks for v2, Daniel.
>>
>> I'm still wondering: should we align the above parsing behavior with
>> the one in ip6_parse_tlv() to keep things consistent? That is: don't
>> increment tlv_cnt for Pad1/PadN, make sure we don't exceed 8 bytes per
>> padding (consecutive Pad1's, or a PadN), and we could also check that
>> a PadN payload is only made of zeroes. Open question...
>
> Hm, so it would be sth like the below on top of this one, I'd wish we
> wouldn't
> have to open code another ip6_parse_tlv().
>
> Have you seen such cases with the encap limit in the wild (vs encap
> limit as
> first tlv) where a stack places leading Pad1/PadN before encap limit
> which the
> v2 patch wouldn't catch?
Nope. But if it happens, users would be confused as max_dst_opts_cnt
would not have the same meaning in two different code paths. OTOH, I
agree that such situation would look suspicious. I guess it's fine to
keep your patch as is and to not over-complicate things unnecessarily.
> net/ipv6/ip6_tunnel.c | 42 +++++++++++++++++++++++++++++++-----------
> 1 file changed, 31 insertions(+), 11 deletions(-)
>
> diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> index 0f50b7fcb24e..1fa42a1cd97c 100644
> --- a/net/ipv6/ip6_tunnel.c
> +++ b/net/ipv6/ip6_tunnel.c
> @@ -432,28 +432,48 @@ __u16 ip6_tnl_parse_tlv_enc_lim(struct sk_buff
> *skb, __u8 *raw)
> if (nexthdr == NEXTHDR_DEST) {
> int tlv_max =
> abs(READ_ONCE(init_net.ipv6.sysctl.max_dst_opts_cnt));
> int tlv_cnt = 0;
> + int padlen = 0;
> u16 i = 2;
>
> - while (1) {
> - struct ipv6_tlv_tnl_enc_lim *tel;
> + while (i < optlen) {
> + struct ipv6_tlv_tnl_enc_lim *tel =
> + (struct ipv6_tlv_tnl_enc_lim *)(skb->data + off + i);
> + int tel_len;
>
> - if (tlv_cnt++ >= tlv_max)
> + if (tel->type == IPV6_TLV_PAD1) {
> + if (++padlen > 7)
> + break;
> + i++;
> + continue;
> + }
> +
> + if (i + 2 > optlen)
> + break;
> + tel_len = tel->length + 2;
> + if (i + tel_len > optlen)
> break;
>
> - /* No more room for encapsulation limit */
> - if (i + sizeof(*tel) > optlen)
> + if (tel->type == IPV6_TLV_PADN) {
> + padlen += tel_len;
> + if (padlen > 7)
> + break;
> + if (memchr_inv((u8 *)tel + 2, 0,
> + tel->length))
> + break;
> + i += tel_len;
> + continue;
> + }
> + padlen = 0;
> +
> + if (tlv_cnt++ >= tlv_max)
> break;
>
> - tel = (struct ipv6_tlv_tnl_enc_lim *)(skb->data + off +
> i);
> /* return index of option if found and valid */
> if (tel->type == IPV6_TLV_TNL_ENCAP_LIMIT &&
> tel->length == 1)
> return i + off - nhoff;
> - /* else jump to next option */
> - if (tel->type)
> - i += tel->length + 2;
> - else
> - i++;
> +
> + i += tel_len;
> }
> }
> nexthdr = hdr->nexthdr;
^ permalink raw reply
* Re: [PATCH v2 00/19] tracepoint: Avoid double static_branch evaluation at guarded call sites
From: Steven Rostedt @ 2026-04-18 23:04 UTC (permalink / raw)
To: Vineeth Pillai (Google)
Cc: Peter Zijlstra, Dmitry Ilvokhin, Masami Hiramatsu,
Mathieu Desnoyers, Ingo Molnar, Jens Axboe, io-uring,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Alexei Starovoitov, Daniel Borkmann, Marcelo Ricardo Leitner,
Xin Long, Jon Maloy, Aaron Conole, Eelco Chaudron, Ilya Maximets,
netdev, bpf, linux-sctp, tipc-discussion, dev, Jiri Pirko,
Oded Gabbay, Koby Elbaz, dri-devel, Rafael J. Wysocki,
Viresh Kumar, Gautham R. Shenoy, Huang Rui, Mario Limonciello,
Len Brown, Srinivas Pandruvada, linux-pm, MyungJoo Ham,
Kyungmin Park, Chanwoo Choi, Christian König, Sumit Semwal,
linaro-mm-sig, Eddie James, Andrew Jeffery, Joel Stanley,
linux-fsi, David Airlie, Simona Vetter, Alex Deucher,
Danilo Krummrich, Matthew Brost, Philipp Stanner, Harry Wentland,
Leo Li, amd-gfx, Jiri Kosina, Benjamin Tissoires, linux-input,
Wolfram Sang, linux-i2c, Mark Brown, Michael Hennerich,
Nuno Sá, linux-spi, James E.J. Bottomley, Martin K. Petersen,
linux-scsi, Chris Mason, David Sterba, linux-btrfs,
Thomas Gleixner, Andrew Morton, SeongJae Park, linux-mm,
Borislav Petkov, Dave Hansen, x86, linux-trace-kernel,
linux-kernel
In-Reply-To: <20260323160052.17528-1-vineeth@bitbyteword.org>
On Mon, 23 Mar 2026 12:00:19 -0400
"Vineeth Pillai (Google)" <vineeth@bitbyteword.org> wrote:
> if (trace_foo_enabled() && cond)
> trace_call__foo(args); /* calls __do_trace_foo() directly */
Hi Vineeth,
Could you rebase this series on top of 7.1-rc1 when it comes out?
Several of these patches were accepted already. Obviously drop those.
They were the patches that added the feature, and any where the
maintainer acked the patch.
Now that the feature has been accepted, if you post the patch series
again after 7.1-rc1 with all the patches that haven't been accepted
yet, then the maintainers can simply take them directly. As the feature
is now accepted, there's no dependency on it, and they don't need to go
through the tracing tree.
Thanks,
-- Steve
^ permalink raw reply
* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Song Chen @ 2026-04-19 0:07 UTC (permalink / raw)
To: Petr Mladek
Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, joe.lawrence, rostedt, mhiramat, mark.rutland,
mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <aeC7ByGA5MHBcGQR@pathway.suse.cz>
Hi,
On 4/16/26 18:33, Petr Mladek wrote:
> On Wed 2026-04-15 15:01:37, chensong_2000@189.cn wrote:
>> From: Song Chen <chensong_2000@189.cn>
>>
>> The current notifier chain implementation uses a single-linked list
>> (struct notifier_block *next), which only supports forward traversal
>> in priority order. This makes it difficult to handle cleanup/teardown
>> scenarios that require notifiers to be called in reverse priority order.
>>
>> A concrete example is the ordering dependency between ftrace and
>> livepatch during module load/unload. see the detail here [1].
>>
>> This patch replaces the single-linked list in struct notifier_block
>> with a struct list_head, converting the notifier chain into a
>> doubly-linked list sorted in descending priority order. Based on
>> this, a new function notifier_call_chain_reverse() is introduced,
>> which traverses the chain in reverse (ascending priority order).
>> The corresponding blocking_notifier_call_chain_reverse() is also
>> added as the locking wrapper for blocking notifier chains.
>>
>> The internal notifier_call_chain_robust() is updated to use
>> notifier_call_chain_reverse() for rollback: on error, it records
>> the failing notifier (last_nb) and the count of successfully called
>> notifiers (nr), then rolls back exactly those nr-1 notifiers in
>> reverse order starting from last_nb's predecessor, without needing
>> to know the total length of the chain.
>>
>> With this change, subsystems with symmetric setup/teardown ordering
>> requirements can register a single notifier_block with one priority
>> value, and rely on blocking_notifier_call_chain() for forward
>> traversal and blocking_notifier_call_chain_reverse() for reverse
>> traversal, without needing hard-coded call sequences or separate
>> notifier registrations for each direction.
>>
>> [1]:https://lore.kernel.org/all
>> /alpine.LNX.2.00.1602172216491.22700@cbobk.fhfr.pm/
>>
>> --- a/include/linux/notifier.h
>> +++ b/include/linux/notifier.h
>> @@ -53,41 +53,41 @@ typedef int (*notifier_fn_t)(struct notifier_block *nb,
> [...]
>> struct notifier_block {
>> notifier_fn_t notifier_call;
>> - struct notifier_block __rcu *next;
>> + struct list_head __rcu entry;
>> int priority;
>> };
> [...]
>> #define ATOMIC_INIT_NOTIFIER_HEAD(name) do { \
>> spin_lock_init(&(name)->lock); \
>> - (name)->head = NULL; \
>> + INIT_LIST_HEAD(&(name)->head); \
>
> I would expect the RCU variant here, aka INIT_LIST_HEAD_RCU().
I'm not familiar with list rcu, but i will look into it and give it a try.
>
>> --- a/kernel/notifier.c
>> +++ b/kernel/notifier.c
>> @@ -14,39 +14,47 @@
>> * are layered on top of these, with appropriate locking added.
>> */
>>
>> -static int notifier_chain_register(struct notifier_block **nl,
>> +static int notifier_chain_register(struct list_head *nl,
>> struct notifier_block *n,
>> bool unique_priority)
>> {
>> - while ((*nl) != NULL) {
>> - if (unlikely((*nl) == n)) {
>> + struct notifier_block *cur;
>> +
>> + list_for_each_entry(cur, nl, entry) {
>> + if (unlikely(cur == n)) {
>> WARN(1, "notifier callback %ps already registered",
>> n->notifier_call);
>> return -EEXIST;
>> }
>> - if (n->priority > (*nl)->priority)
>> - break;
>> - if (n->priority == (*nl)->priority && unique_priority)
>> +
>> + if (n->priority == cur->priority && unique_priority)
>> return -EBUSY;
>> - nl = &((*nl)->next);
>> +
>> + if (n->priority > cur->priority) {
>> + list_add_tail(&n->entry, &cur->entry);
>> + goto out;
>> + }
>> }
>> - n->next = *nl;
>> - rcu_assign_pointer(*nl, n);
>> +
>> + list_add_tail(&n->entry, nl);
>
> I would expect list_add_tail_rcu() here.
>
>> @@ -59,25 +67,25 @@ static int notifier_chain_unregister(struct notifier_block **nl,
>> * value of this parameter is -1.
>> * @nr_calls: Records the number of notifications sent. Don't care
>> * value of this field is NULL.
>> + * @last_nb: Records the last called notifier block for rolling back
>> * Return: notifier_call_chain returns the value returned by the
>> * last notifier function called.
>> */
>> -static int notifier_call_chain(struct notifier_block **nl,
>> +static int notifier_call_chain(struct list_head *nl,
>> unsigned long val, void *v,
>> - int nr_to_call, int *nr_calls)
>> + int nr_to_call, int *nr_calls,
>> + struct notifier_block **last_nb)
>> {
>> int ret = NOTIFY_DONE;
>> - struct notifier_block *nb, *next_nb;
>> -
>> - nb = rcu_dereference_raw(*nl);
>> + struct notifier_block *nb;
>>
>> - while (nb && nr_to_call) {
>> - next_nb = rcu_dereference_raw(nb->next);
>> + if (!nr_to_call)
>> + return ret;
>>
>> + list_for_each_entry(nb, nl, entry) {
>
> I would expect the RCU variant here, aka list_for_each_rcu()
>
> These are just two random examples which I found by a quick look.
>
> I guess that the notifier API is very old and it does not use all
> the RCU API features which allow to track safety when
> CONFIG_PROVE_RCU and CONFIG_PROVE_RCU_LIST are enabled.
>
> It actually might be worth to audit the code and make it right.
>
>> #ifdef CONFIG_DEBUG_NOTIFIERS
>> if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
>> WARN(1, "Invalid notifier called!");
>> - nb = next_nb;
>> continue;
>> }
>> #endif
>
> That said, I am not sure if the ftrace/livepatching handlers are
> the right motivation for this. Especially when I see the
> complexity of the 2nd patch [*]
>
> After thinking more about it. I am not even sure that the ftrace and
> livepatching callbacks are good candidates for generic notifiers.
> They are too special. It is not only about ordering them against
> each other. But it is also about ordering them against other
> notifiers. The ftrace/livepatching callbacks must be the first/last
> during module load/release.
>
> [*] The 2nd patch is not archived by lore for some reason.
> I have found only a review but it gives a good picture, see
> https://lore.kernel.org/all/1191caf5-6a61-4622-a15e-854d3701f4fc@suse.com/
>
> Best Regards,
> Petr
>
Thanks.
BR
Song
^ permalink raw reply
* Re: [RFC PATCH 1/2] kernel/notifier: replace single-linked list with double-linked list for reverse traversal
From: Song Chen @ 2026-04-19 0:21 UTC (permalink / raw)
To: David Laight
Cc: rafael, lenb, mturquette, sboyd, viresh.kumar, agk, snitzer,
mpatocka, bmarzins, song, yukuai, linan122, jason.wessel, danielt,
dianders, horms, davem, edumazet, kuba, pabeni, paulmck, frederic,
mcgrof, petr.pavlu, da.gomez, samitolvanen, atomlin, jpoimboe,
jikos, mbenes, pmladek, joe.lawrence, rostedt, mhiramat,
mark.rutland, mathieu.desnoyers, linux-modules, linux-kernel,
linux-trace-kernel, linux-acpi, linux-clk, linux-pm,
live-patching, dm-devel, linux-raid, kgdb-bugreport, netdev
In-Reply-To: <20260416133004.07bd2886@pumpkin>
Hi,
On 4/16/26 20:30, David Laight wrote:
> On Wed, 15 Apr 2026 15:01:37 +0800
> chensong_2000@189.cn wrote:
>
>> From: Song Chen <chensong_2000@189.cn>
>>
>> The current notifier chain implementation uses a single-linked list
>> (struct notifier_block *next), which only supports forward traversal
>> in priority order. This makes it difficult to handle cleanup/teardown
>> scenarios that require notifiers to be called in reverse priority order.
>
> If it is only cleanup/teardown then the list can be order-reversed
> as part of that process at the same time as the list is deleted.
>
> David
>
>
>
Sorry, i don't follow, the notifiers in the list are deleted when
calling notifier_chain_unregister, other than that, they are traversed
forward and backward.
Song
^ permalink raw reply
* Re: [PATCH v2 0/3] mISDN: fix socket/device lifetime and naming races
From: Shuvam Pandey @ 2026-04-19 0:24 UTC (permalink / raw)
To: netdev
In-Reply-To: <cover.1776446840.git.shuvampandey1@gmail.com>
While re-reviewing v2, I found correctness issues in the current approach,
including lifetime and synchronization problems.
Please ignore v2 for now. I’m reworking the series and will send a
corrected v3.
Thanks,
Shuvam
^ permalink raw reply
* Re: [PATCH net 1/1] net: l3mdev: Ignore non-L3 uppers in l3mdev_fib_table_rcu
From: Haoze Xie @ 2026-04-19 3:45 UTC (permalink / raw)
To: Ido Schimmel, Ao Zhou
Cc: netdev, David Ahern, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Yifan Wu, Juefei Pu,
Yuan Tan, Xin Liu
In-Reply-To: <20260406103350.GA654671@shredder>
On 4/6/2026 6:33 PM, Ido Schimmel wrote:
> On Sat, Apr 04, 2026 at 07:52:03PM +0800, Ao Zhou wrote:
>> From: Haoze Xie <royenheart@gmail.com>
>>
>> l3mdev_fib_table_rcu() assumes that any upper device observed for
>> an IFF_L3MDEV_SLAVE device is an L3 master and dereferences
>> master->l3mdev_ops unconditionally.
>>
>> VRF slave setup sets IFF_L3MDEV_SLAVE before the upper link is fully
>> switched, so readers can transiently observe a non-L3 upper such as a
>> bridge and follow a NULL l3mdev_ops pointer. Require the current upper
>> to still be an L3 master before consulting its FIB table.
>
> Do you have a reproducer? I don't see how that can happen.
>
> do_set_master() ensures that the device doesn't have a master when
> ndo_add_slave() is called. Meaning, if netif_is_l3_slave() is true and
> netdev_master_upper_dev_get_rcu() resolved a master device, it's
> guaranteed to be a VRF.
>
I agree that this is correct in the synchronous `do_set_master()`
control flow.
My concern is narrower: an RCU reader can still resolve a stale upper
after the unlink, so `netif_is_l3_slave(dev)` alone is not enough to
prove that the resolved upper is still an L3 master at the point where
`master->l3mdev_ops` is dereferenced.
>>
>> Fixes: fee6d4c777a1 ("net: Add netif_is_l3_slave")
>> Reported-by: Yifan Wu <yifanwucs@gmail.com>
>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
>> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
>> Suggested-by: Xin Liu <bird@lzu.edu.cn>
>> Signed-off-by: Haoze Xie <royenheart@gmail.com>
>> Signed-off-by: Ao Zhou <n05ec@lzu.edu.cn>
>> ---
>> net/l3mdev/l3mdev.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
>> index 5432a5f2dfc8..b8a3030cb2c4 100644
>> --- a/net/l3mdev/l3mdev.c
>> +++ b/net/l3mdev/l3mdev.c
>> @@ -177,7 +177,7 @@ u32 l3mdev_fib_table_rcu(const struct net_device *dev)
>> const struct net_device *master;
>>
>> master = netdev_master_upper_dev_get_rcu(_dev);
>> - if (master &&
>> + if (master && netif_is_l3_master(master) &&
>> master->l3mdev_ops->l3mdev_fib_table)
>> tb_id = master->l3mdev_ops->l3mdev_fib_table(master);
>> }
>> --
>> 2.53.0
>>
^ permalink raw reply
* Re: [PATCH net v2 1/1] net: l3mdev: Ignore non-L3 uppers in l3mdev_fib_table_rcu
From: Haoze Xie @ 2026-04-19 3:49 UTC (permalink / raw)
To: Ido Schimmel, Ao Zhou
Cc: netdev, David Ahern, David S . Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Simon Horman, Ido Schimmel,
Jiri Pirko, Yifan Wu, Juefei Pu, Yuan Tan, Xin Liu, royenheart
In-Reply-To: <20260406154808.GA714138@shredder>
On 4/6/2026 11:48 PM, Ido Schimmel wrote:
> On Mon, Apr 06, 2026 at 09:28:16PM +0800, Ao Zhou wrote:
>> From: Haoze Xie <royenheart@gmail.com>
>>
>> l3mdev_fib_table_rcu() assumes that any upper device observed for
>> an IFF_L3MDEV_SLAVE device is an L3 master and dereferences
>> master->l3mdev_ops unconditionally.
>>
>> VRF slave setup sets IFF_L3MDEV_SLAVE before the upper link is fully
>> switched, so readers can transiently observe a non-L3 upper such as a
>> bridge and follow a NULL l3mdev_ops pointer. Require the current upper
>> to still be an L3 master before consulting its FIB table.
>>
>> Fixes: fdeea7be88b1 ("net: vrf: Set slave's private flag before linking")
>> Reported-by: Yifan Wu <yifanwucs@gmail.com>
>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>> Co-developed-by: Yuan Tan <yuantan098@gmail.com>
>> Signed-off-by: Yuan Tan <yuantan098@gmail.com>
>> Suggested-by: Xin Liu <bird@lzu.edu.cn>
>> Reviewed-by: David Ahern <dsahern@kernel.org>
>> Signed-off-by: Haoze Xie <royenheart@gmail.com>
>> Signed-off-by: Ao Zhou <n05ec@lzu.edu.cn>
>> ---
>> changes in v2:
>> - point Fixes to the VRF slave ordering change identified by David Ahern
>> - add David Ahern's Reviewed-by trailer
>>
>> net/l3mdev/l3mdev.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
>> index 5432a5f2dfc8..b8a3030cb2c4 100644
>> --- a/net/l3mdev/l3mdev.c
>> +++ b/net/l3mdev/l3mdev.c
>> @@ -177,7 +177,7 @@ u32 l3mdev_fib_table_rcu(const struct net_device *dev)
>> const struct net_device *master;
>>
>> master = netdev_master_upper_dev_get_rcu(_dev);
>> - if (master &&
>> + if (master && netif_is_l3_master(master) &&
>> master->l3mdev_ops->l3mdev_fib_table)
>
> Don't we have the same problem in l3mdev_l3_rcv() and l3mdev_l3_out()?
> If so, please check if I missed more places and include them in v3.
>
I checked the same pattern in the other slave-side helpers, and v3 now
extends the fix to both `l3mdev_l3_rcv()` and `l3mdev_l3_out()` in
addition to `l3mdev_fib_table_rcu()`.
All three helpers resolve the current upper with
`netdev_master_upper_dev_get_rcu()` and then use `master->l3mdev_ops`.
So v3 consistently requires the resolved upper to still satisfy
`netif_is_l3_master(master)` before dereferencing `l3mdev_ops`.
> And I think that the part that I was missing earlier is that we don't
> have RCU synchronization in the unslaving path, so an RCU reader can
> either see the original master, NULL or a new master (e.g., bridge
> instead of the original VRF master).
>
>> tb_id = master->l3mdev_ops->l3mdev_fib_table(master);
>> }
>> --
>> 2.53.0
>>
^ permalink raw reply
* Re: [PATCH net v2 1/1] net: l3mdev: Ignore non-L3 uppers in l3mdev_fib_table_rcu
From: Haoze Xie @ 2026-04-19 3:51 UTC (permalink / raw)
To: Ido Schimmel, David Ahern
Cc: Ao Zhou, netdev, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Ido Schimmel, Jiri Pirko, Yifan Wu,
Juefei Pu, Yuan Tan, Xin Liu, royenheart
In-Reply-To: <20260407080259.GA760990@shredder>
On 4/7/2026 4:02 PM, Ido Schimmel wrote:
> On Mon, Apr 06, 2026 at 12:14:15PM -0600, David Ahern wrote:
>> On 4/6/26 9:48 AM, Ido Schimmel wrote:
>>>
>>> Don't we have the same problem in l3mdev_l3_rcv() and l3mdev_l3_out()?
>>> If so, please check if I missed more places and include them in v3.
>>>
>>> And I think that the part that I was missing earlier is that we don't
>>> have RCU synchronization in the unslaving path, so an RCU reader can
>>> either see the original master, NULL or a new master (e.g., bridge
>>> instead of the original VRF master).
>>
>> synchronize_rcu after the unlink (control path) seems like a better,
>> more robust option than adding more checks to the datapath.
>
> IMO it would be better to proceed with the original approach and look
> into adding RCU synchronization in net-next. The original approach is
> more surgical and the pattern of first checking netif_is_l3_master()
> already exists in other places.
I followed that direction for v3.
The patch keeps the more surgical datapath fix for net and extends the
same check to the other same-pattern slave-side helpers.
I am sending v3 as a separate public thread. Further feedback is
welcome.
^ permalink raw reply
* [PATCH] bpf: verifier: fix NULL deref in map_kptr_match_type() for scalar regs
From: Avinash Pal @ 2026-04-19 3:59 UTC (permalink / raw)
To: avinashpal441; +Cc: bpf, netdev
A NULL pointer dereference occurs in map_kptr_match_type() when a BPF
program attempts to store a scalar value (non-pointer register) into a
map slot annotated as a kptr (kernel pointer).
Call chain:
check_store_reg()
-> check_mem_access()
-> check_map_kptr_access()
-> map_kptr_match_type()
-> btf_is_kernel(reg->btf) <-- NULL deref
For scalar registers reg->btf is NULL. map_kptr_match_type() passes
this directly to btf_is_kernel(), which performs a container_of()-based
offset access and triggers a general protection fault:
KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
RIP: 0010:btf_is_kernel+0x2a/0x50
Fix: add an early NULL guard on reg->btf at the top of
map_kptr_match_type(). If reg->btf is NULL, emit a verifier log message
and return -EACCES so the program is rejected cleanly instead of
crashing the kernel.
Reproducer: store a scalar into a kptr-annotated BPF_MAP_TYPE_LRU_HASH
slot via a kprobe program — verifier crashes during bpf_prog_load().
Reported-by: Le Chen <le.chen@bugzilla.kernel.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221372
Signed-off-by: Avinash Pal <avinashpal441@gmail.com>
---
kernel/bpf/verifier.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 69d75515ed..15b8f71074 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4545,6 +4545,18 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
struct btf_field *kptr_field,
struct bpf_reg_state *reg, u32 regno)
{
+ /*
+ * If the source register has no BTF type (e.g. it is a scalar),
+ * it cannot possibly match a kptr slot. Reject early to avoid
+ * passing a NULL reg->btf to btf_is_kernel(), which would cause
+ * a NULL pointer dereference inside that function.
+ */
+ if (!reg->btf) {
+ verbose(env, "R%d is not a pointer, cannot store into kptr slot\n",
+ regno);
+ return -EACCES;
+ }
+
const char *targ_name = btf_type_name(kptr_field->kptr.btf, kptr_field->kptr.btf_id);
int perm_flags;
const char *reg_name = "";
--
2.53.0
^ permalink raw reply related
* [PATCH] bpf: verifier: fix NULL deref in map_kptr_match_type() for scalar regs
From: Avinash Pal @ 2026-04-19 4:04 UTC (permalink / raw)
To: bpf; +Cc: ast, daniel, andrii, netdev
A NULL pointer dereference occurs in map_kptr_match_type() when a BPF
program attempts to store a scalar value (non-pointer register) into a
map slot annotated as a kptr (kernel pointer).
Call chain:
check_store_reg()
-> check_mem_access()
-> check_map_kptr_access()
-> map_kptr_match_type()
-> btf_is_kernel(reg->btf) <-- NULL deref
For scalar registers reg->btf is NULL. map_kptr_match_type() passes
this directly to btf_is_kernel(), which performs a container_of()-based
offset access and triggers a general protection fault:
KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
RIP: 0010:btf_is_kernel+0x2a/0x50
Fix: add an early NULL guard on reg->btf at the top of
map_kptr_match_type(). If reg->btf is NULL, emit a verifier log message
and return -EACCES so the program is rejected cleanly instead of
crashing the kernel.
Reproducer: store a scalar into a kptr-annotated BPF_MAP_TYPE_LRU_HASH
slot via a kprobe program — verifier crashes during bpf_prog_load().
Reported-by: Le Chen <le.chen@bugzilla.kernel.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221372
Signed-off-by: Avinash Pal <avinashpal441@gmail.com>
---
kernel/bpf/verifier.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 69d75515ed..15b8f71074 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4545,6 +4545,18 @@ static int map_kptr_match_type(struct bpf_verifier_env *env,
struct btf_field *kptr_field,
struct bpf_reg_state *reg, u32 regno)
{
+ /*
+ * If the source register has no BTF type (e.g. it is a scalar),
+ * it cannot possibly match a kptr slot. Reject early to avoid
+ * passing a NULL reg->btf to btf_is_kernel(), which would cause
+ * a NULL pointer dereference inside that function.
+ */
+ if (!reg->btf) {
+ verbose(env, "R%d is not a pointer, cannot store into kptr slot\n",
+ regno);
+ return -EACCES;
+ }
+
const char *targ_name = btf_type_name(kptr_field->kptr.btf, kptr_field->kptr.btf_id);
int perm_flags;
const char *reg_name = "";
--
2.53.0
^ permalink raw reply related
* Re: [PATCH] bpf: verifier: fix NULL deref in map_kptr_match_type() for scalar regs
From: bot+bpf-ci @ 2026-04-19 4:32 UTC (permalink / raw)
To: avinashpal441, bpf
Cc: ast, daniel, andrii, netdev, ast, andrii, daniel, martin.lau,
eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260419040426.9672-1-avinashpal441@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1254 bytes --]
commit 6c1cde9635b1379ee40c94c5ecc8a7d09507c0c9
Author: Avinash Pal <avinashpal441@gmail.com>
bpf: verifier: fix NULL deref in map_kptr_match_type() for scalar regs
A NULL pointer dereference occurs in map_kptr_match_type() when a BPF
program attempts to store a scalar value (non-pointer register) into a
map slot annotated as a kptr (kernel pointer).
Link: https://bugzilla.kernel.org/show_bug.cgi?id=221372
This commit appears to fix a bug introduced in an earlier commit.
Should this include a Fixes: tag?
The fix guards against NULL reg->btf before calling btf_is_kernel(),
which suggests the vulnerability was introduced when map_kptr_match_type()
started using btf_is_kernel() without checking if reg->btf is NULL first.
Looking at the git history, commit 61df10c7799e ("bpf: Allow storing
unreferenced kptr in map") introduced map_kptr_match_type() and the call
to btf_is_kernel(reg->btf) without a NULL check.
Consider adding:
Fixes: 61df10c7799e ("bpf: Allow storing unreferenced kptr in map")
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24620726038
^ permalink raw reply
* Re: [PATCH iwl-next v2 0/2] igc: enable build_skb path
From: Kohei Enju @ 2026-04-19 4:54 UTC (permalink / raw)
To: intel-wired-lan, netdev
Cc: Tony Nguyen, Przemek Kitszel, Andrew Lunn, David S. Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, dima.ruinskiy,
kohei.enju
In-Reply-To: <20260317062205.39406-1-kohei@enjuk.jp>
On 03/17 06:21, Kohei Enju wrote:
> This series enables the build_skb RX path in igc, which is currently not
> enabled in any configuration.
>
> Patch 1/2 adds missing RX hardware timestamp handling in the build_skb
> path.
> Patch 2/2 enables the build_skb path when XDP is inactive and other
> conditions are met.
>
> Tested on Intel Corporation Ethernet Controller I226-V (rev 04).
>
> Changes:
> v2:
> - don't insist on reverse christmas tree, reducing net diff in the
> patch 1/2 (Dima)
> v1: https://lore.kernel.org/intel-wired-lan/20260307182808.155027-1-kohei@enjuk.jp/
>
> Kohei Enju (2):
> igc: set RX hardware timestamps in igc_build_skb()
> igc: enable build_skb on the non-XDP small-frame RX path
Hi Tony,
Could you drop this series from Intel's queue?
Sashiko pointed out a potential use-after-free when dereferencing RX
hardware timestamps with build_skb enabled, and I believe this concern
is valid.
https://sashiko.dev/#/patchset/20260317062205.39406-1-kohei%40enjuk.jp
I'm working on a new series to address this problem.
Thanks,
Kohei
^ permalink raw reply
* Re: [PATCH] bpf: verifier: fix NULL deref in map_kptr_match_type() for scalar regs
From: sun jian @ 2026-04-19 5:01 UTC (permalink / raw)
To: bot+bpf-ci
Cc: avinashpal441, bpf, ast, daniel, andrii, netdev, martin.lau,
eddyz87, yonghong.song, clm, ihor.solodrai
In-Reply-To: <557b90e63fb85b646de7ae4792eb2c06d6be251dcd95dabe9ed99435c82b85e0@mail.kernel.org>
Hi Avinash,
This looks like the same bug Mykyta already addressed here:
https://lore.kernel.org/bpf/20260416-kptr_crash-v1-0-5589356584b4@meta.com/
That series includes both the verifier fix and a selftest for scalar
store into a kptr slot, and Alexei has already picked it up.
Thanks,
^ permalink raw reply
* [PATCH] Bluetooth: Add HCI_QUIRK_NO_SCAN_WHILE_CONNECTED for combo chips
From: StefanCondorache @ 2026-04-19 7:24 UTC (permalink / raw)
To: linux-bluetooth
Cc: marcel, luiz.dentz, davem, edumazet, kuba, pabeni, horms, netdev,
linux-kernel, StefanCondorache
Realtek combo chips share a single antenna between Wi-Fi and Bluetooth.
Background LE passive scanning while an active connection exists causes
antenna multiplexing conflicts via Packet Traffic Arbitration (PTA),
resulting in audio stuttering and Wi-Fi packet loss.
Add HCI_QUIRK_NO_SCAN_WHILE_CONNECTED to suppress passive scanning in
hci_update_passive_scan_sync() when active connections are present.
Set this quirk for all Realtek devices in btrtl_set_quirks().
Also add device ID 0x0bda:0xc829 to the btusb Realtek device table.
Signed-off-by: StefanCondorache <condorachest@gmail.com>
---
drivers/bluetooth/btrtl.c | 6 ++++++
drivers/bluetooth/btusb.c | 2 ++
include/net/bluetooth/hci.h | 11 +++++++++++
net/bluetooth/hci_sync.c | 7 +++++++
4 files changed, 26 insertions(+)
diff --git a/drivers/bluetooth/btrtl.c b/drivers/bluetooth/btrtl.c
index 62f9d4df3a4f..00dfba656970 100644
--- a/drivers/bluetooth/btrtl.c
+++ b/drivers/bluetooth/btrtl.c
@@ -1299,6 +1299,12 @@ EXPORT_SYMBOL_GPL(btrtl_download_firmware);
void btrtl_set_quirks(struct hci_dev *hdev, struct btrtl_device_info *btrtl_dev)
{
+ /* Realtek combo chips share the antenna between Wi-Fi and
+ * Bluetooth. Suppress passive scanning while connected to
+ * prevent coexistence issues.
+ */
+ hci_set_quirk(hdev, HCI_QUIRK_NO_SCAN_WHILE_CONNECTED);
+
/* Enable controller to do both LE scan and BR/EDR inquiry
* simultaneously.
*/
diff --git a/drivers/bluetooth/btusb.c b/drivers/bluetooth/btusb.c
index 5f57953393be..87972f5fc567 100644
--- a/drivers/bluetooth/btusb.c
+++ b/drivers/bluetooth/btusb.c
@@ -516,6 +516,8 @@ static const struct usb_device_id quirks_table[] = {
/* Realtek 8822CU Bluetooth devices */
{ USB_DEVICE(0x13d3, 0x3549), .driver_info = BTUSB_REALTEK |
BTUSB_WIDEBAND_SPEECH },
+ { USB_DEVICE(0x0bda, 0xc829), .driver_info = BTUSB_REALTEK |
+ BTUSB_WIDEBAND_SPEECH },
/* Realtek 8851BE Bluetooth devices */
{ USB_DEVICE(0x0bda, 0xb850), .driver_info = BTUSB_REALTEK },
diff --git a/include/net/bluetooth/hci.h b/include/net/bluetooth/hci.h
index 572b1c620c5d..8466dc52aeca 100644
--- a/include/net/bluetooth/hci.h
+++ b/include/net/bluetooth/hci.h
@@ -378,6 +378,17 @@ enum {
*/
HCI_QUIRK_BROKEN_READ_PAGE_SCAN_TYPE,
+ /* When this quirk is set, the controller suppresses passive LE
+ * background scanning while an active connection exists.
+ * This is required for combo chips with shared Wi-Fi/Bluetooth
+ * antennas to prevent coexistence issues causing audio drops
+ * and Wi-Fi packet loss.
+ *
+ * This quirk can be set before hci_register_dev is called or
+ * during the hdev->setup vendor callback.
+ */
+ HCI_QUIRK_NO_SCAN_WHILE_CONNECTED,
+
__HCI_NUM_QUIRKS,
};
diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
index fd3aacdea512..5e30e725efa2 100644
--- a/net/bluetooth/hci_sync.c
+++ b/net/bluetooth/hci_sync.c
@@ -3194,6 +3194,13 @@ int hci_update_passive_scan_sync(struct hci_dev *hdev)
if (hdev->discovery.state != DISCOVERY_STOPPED)
return 0;
+ /* If the controller requires no scanning while connected,
+ * suppress passive scanning when an active connection exists.
+ */
+ if (hci_test_quirk(hdev, HCI_QUIRK_NO_SCAN_WHILE_CONNECTED) &&
+ !list_empty(&hdev->conn_hash.list))
+ return 0;
+
/* Reset RSSI and UUID filters when starting background scanning
* since these filters are meant for service discovery only.
*
--
2.53.0
^ permalink raw reply related
* Re: [PATCH net v6 2/2] pppoe: drop PFC frames
From: Qingfang Deng @ 2026-04-19 7:53 UTC (permalink / raw)
To: Jakub Kicinski
Cc: linux-ppp, andrew+netdev, davem, edumazet, pabeni, hataegu0826,
horms, kees, kuniyu, bigeasy, gnault, ericwouds, semen.protsenko,
netdev, linux-kernel, paulus, jaco, carlsonj, wojciech.drewek,
marcin.szycik
In-Reply-To: <20260418191145.213625-1-kuba@kernel.org>
Hi,
On 4/19/2026 3:11 AM, Jakub Kicinski wrote:
> [ ... ]
>> @@ -434,6 +434,12 @@ static int pppoe_rcv(struct sk_buff *skb, struct net_device *dev,
>> if (skb->len < len)
>> goto drop;
>>
>> + /* skb->data points to the PPP protocol header after skb_pull_rcsum.
>> + * Drop PFC frames.
>> + */
>> + if (ppp_skb_is_compressed_proto(skb))
>
> Does this code safely access the PPP protocol byte?
>
> After pulling the PPPoE header, skb->data points to the inner payload.
> However, the new ppp_skb_is_compressed_proto() helper unconditionally
> dereferences skb->data[0].
>
> If a crafted PPPoE packet is received with an inner length of 0 and no
> Ethernet padding, the linear buffer might end exactly after the PPPoE header.
> In that scenario, this would read past the allocated linear buffer.
>
> Even if the inner length is greater than 0, could the payload reside entirely
> in non-linear paged fragments, causing an out-of-bounds read?
>
> Would it be safer to verify the packet has at least 1 byte and use
> pskb_may_pull() to ensure the protocol byte is in the linear region before
> inspecting it, perhaps after the pskb_trim_rcsum() call?
I already updated the pskb_may_pull() above, from struct pppoe_hdr (6)
to PPPOE_SES_HLEN (8), to ensure that.
Regards,
Qingfang
^ permalink raw reply
* Re: [PATCH nf] netfilter: xt_TCPMSS: check skb_dst before path-MTU clamping
From: Florian Westphal @ 2026-04-19 8:00 UTC (permalink / raw)
To: Weiming Shi
Cc: Pablo Neira Ayuso, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Phil Sutter, Simon Horman, netfilter-devel, coreteam,
netdev, Xiang Mei
In-Reply-To: <aePiSwmP6YEQ4mNE@strlen.de>
Florian Westphal <fw@strlen.de> wrote:
> Weiming Shi <bestswngs@gmail.com> wrote:
> > When TCPMSS with CLAMP_PMTU is used via nft_compat in a non-base
> > chain, par->hook_mask is set to 0, bypassing the checkentry hook
> > validation. The target can then run at PRE_ROUTING where skb_dst is
> > NULL, causing a null-ptr-deref in tcpmss_mangle_packet():
> >
> > KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> > RIP: 0010:tcpmss_mangle_packet (include/net/dst.h:219 net/netfilter/xt_TCPMSS.c:105)
> > tcpmss_tg4 (net/netfilter/xt_TCPMSS.c:202)
> > nft_target_eval_xt (net/netfilter/nft_compat.c:87)
> > nft_do_chain (net/netfilter/nf_tables_core.c:287)
> > nf_hook_slow (net/netfilter/core.c:623)
> >
> > Check skb_dst() for NULL before calling dst_mtu().
>
> FWIW I will apply this patch even though its wrong.
>
> nft_compat.c is just too broken, I don't see how it can be
> fixed in any reasonable amount of time.
net/netfilter/xt_TCPMSS.c: (par->hook_mask & ~((1 << NF_INET_FORWARD) |
net/netfilter/xt_addrtype.c: if (par->hook_mask & ((1 << NF_INET_PRE_ROUTING) |
net/netfilter/xt_devgroup.c: par->hook_mask & ~((1 << NF_INET_PRE_ROUTING) |
net/netfilter/xt_physdev.c: par->hook_mask & (1 << NF_INET_LOCAL_OUT)) {
net/netfilter/xt_policy.c: if (par->hook_mask & ((1 << NF_INET_POST_ROUTING) |
net/netfilter/xt_set.c: (par->hook_mask & ~(1 << NF_INET_FORWARD |
Look at this I don't see an alternative to mixing nft specific bits into
x_tables, i.e.:
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -187,6 +187,8 @@ struct xt_target {
/* Should return 0 on success or an error code otherwise (-Exxxx). */
int (*checkentry)(const struct xt_tgchk_param *);
+ int (*nft_validate_chain)(const void *targinfo, unsigned int hook_mask);
+
/* Called when entry of this type deleted. */
void (*destroy)(const struct xt_tgdtor_param *);
#ifdef CONFIG_NETFILTER_XTABLES_COMPAT
.. and then call that from nft_compat.c for TCPSS.
Same for matches.
^ permalink raw reply
* AW: pre-boot plugged SFP autoneg advertisement
From: markus.stockhausen @ 2026-04-19 8:49 UTC (permalink / raw)
To: 'Andrew Lunn'
Cc: linux, hkallweit1, netdev, 'Jonas Jelonek', jan, nbd
In-Reply-To: <90958cc3-e291-44ff-8fc3-102c0f62a269@lunn.ch>
Hi Andrew,
> Von: Andrew Lunn <andrew@lunn.ch>
> Betreff: Re: pre-boot plugged SFP autoneg advertisement
>
> > On Sat, Apr 18, 2026 at 11:27:40AM +0200, markus.stockhausen@gmx.de
wrote:
> > Hi,
> >
> > I'm currently analyzing an issue where a pre-boot-plugged SFP module
> > comes up with autoneg=no advertisement during boot. After an
> > unplug/replug autoneg=yes advertisement is chosen.
> >
> > The following addition in phylink_start() just before the call to
> > phylink_mac_initial_config() mitigiates this.
> >
> > + /* If an SFP module was already present before phylink_start() was
> > + * called, phylink_sfp_set_config() was unable to call
> > + * phylink_mac_initial_config() as phylink was not yet started.
> > + * Ensure the SFP capabilities are reflected in advertising.
> > + */
> > + if (pl->sfp_bus && !linkmode_empty(pl->sfp_support))
> > + linkmode_copy(pl->link_config.advertising, pl->sfp_support);
>
> Let me see if i have the call chain correct. This is net-next/main
> from today.
>
> phylink_sfp_connect_phy() ->
> phylink_sfp_config_phy
>
> if (changed && !test_bit(PHYLINK_DISABLE_STOPPED,
> &pl->phylink_disable_state))
> phylink_mac_initial_config(pl, false);
>
> You are saying PHYLINK_DISABLE_STOPPED is set, so
> phylink_mac_initial_config() is not called.
>
> What i don't see is how phylink_mac_initial_config() does the
> linkmode_copy() you are adding.
Took that hint/question and digged deeper. Added further debug
to each and every linkmode_copy. I think I found the culprit in
a userspace ethtool call. For now I assume OpenWrt netifd.
Adding my last trace below including the original (wrong) idea.
Thank you very much for taking the time and your assistance.
Markus
[ 3.301299] XXXX phylink_create lan12 set pl->link_config.advertising
(autoneg = 1)
[ 3.309954] XXXX phylink_parse_mode lan12 set pl->link_config.advertising
(autoneg = 1)
[ 3.318964] XXX sfp_module_insert lan12 called
[ 3.323935] XXXX phylink_sfp_config_optical lan12 set config.advertising
(autoneg = 1)
[ 3.332815] XXXX phylink_validate_one lan12 set tmp_supported (autoneg =
1)
[ 3.340629] XXXX phylink_validate_mask lan12 set supported (autoneg = 1)
[ 3.348165] XXXX phylink_validate_mask lan12 set state->advertising
(autoneg = 1)
[ 3.356527] rtl83xx-switch 1b000000.switchcore:ethernet-switch lan12
(uninitialized): XXX phylink_sfp_set_config requesting link mode
inband/1000base-x with support 0000000,00000000,00000200,00006440
--- ETHTOOL CALL HERE ---
[ 81.213726] XXXX phylink_ethtool_ksettings_set lan12 start got
config.advertising (autoneg = 1)
[ 81.223542] XXXX phylink_ethtool_ksettings_set lan12 set accoring to
kset->base.autoneg (autoneg = 0)
[ 81.233961] CPU: 0 UID: 0 PID: 1470 Comm: netifd Tainted: G O
6.18.21 #0 NONE
[ 81.234010] Tainted: [O]=OOT_MODULE
[ 81.234017] Hardware name: Zyxel XGS1210-12 A1 Switch
[ 81.234026] Stack : 823a3bbc 80139d20 00000000 00000001 00000000 00000000
00000000 00000000
[ 81.234094] 00000000 00000000 00000000 00000000 00000000 00000001
823a3b78 82040d00
[ 81.234152] 00000000 00000000 80992870 823a3a10 00000000 ffffefff
00000001 00000224
[ 81.234213] 00000226 823a39d4 00000226 000019c8 00000001 00000000
80992870 80a00000
[ 81.234273] 82764648 00000016 00000016 82764628 00000000 80a913b0
00000000 81990000
[ 81.234334] ...
[ 81.234350] Call Trace:
[ 81.234356] [<80115e48>] show_stack+0x28/0xf0
[ 81.234407] [<8010fc78>] dump_stack_lvl+0x70/0xb0
[ 81.234432] [<80592718>] phylink_ethtool_ksettings_set+0x58c/0x6a4
[ 81.234479] [<806bb890>] ethtool_set_link_ksettings+0xbc/0x198
[ 81.234516] [<806be01c>] __dev_ethtool+0xfe0/0x1a1c
[ 81.234550] [<806beb24>] dev_ethtool+0xcc/0x24c
[ 81.234575] [<80678770>] dev_ioctl+0x30c/0x5f4
[ 81.234616] [<80603ebc>] sock_ioctl+0x2bc/0x470
[ 81.234642] [<8036e530>] sys_ioctl+0xb4/0x120
[ 81.234683] [<8011edec>] syscall_common+0x34/0x58
[ 81.234715]
[ 81.234723] XXXX 2 phylink_ethtool_ksettings_set lan12 set
pl->link_config.advertising (autoneg = 0)
-----------
[ 81.375070] rtl83xx-switch 1b000000.switchcore:ethernet-switch lan12: XXX
phylink_start configuring for inband/1000base-x link mode
--- INITIAL FIX/IDEA HERE ---
[ 81.388408] XXX phylink_start lan12 sfp_bus set and linkmode not empty ->
would run linkmode_copy()
-----------
[ 81.398688] XXX phylink_mac_initial_config lan12 called with
force_restart = 1
[ 81.406752] rtl83xx-switch 1b000000.switchcore:ethernet-switch lan12: XXX
major config, requested inband/1000base-x
[ 81.418459] XXXX major_config_entry lan12: autoneg_adv=0 autoneg_sfp=1
sfp_may_have_phy=0
[ 81.427612] XXXX phylink_pcs_neg_mode ENTRY lan12: pl->pcs_neg_mode=0x0
[ 81.435102] XXXX phylink_pcs_neg_mode lan12 advertising autoneg=0
[ 81.442034] rtl83xx-switch 1b000000.switchcore:ethernet-switch lan12: XXX
interface 1000base-x inband modes: pcs=03 phy=00
[ 81.454495] XXXX phylink_pcs_neg_mode lan12 base-x without phy
[ 81.461134] XXXX phylink_pcs_neg_mode EXIT lan12 pl->pcs_neg_mode = 0x40
pl->act_link_an_mode = 0x2
[ 82.085493] XXXX phylink_mac_pcs_get_state lan12 set state->advertising
(autoneg = 0)
[ 82.094391] XXXX phylink_mac_pcs_get_state lan12 autoneg is 0
^ permalink raw reply
* [PATCH net v2] net: iptunnel: fix stale transport header after GRE/TEB decap
From: Jiayuan Chen @ 2026-04-19 9:08 UTC (permalink / raw)
To: netdev
Cc: Jiayuan Chen, syzbot+83181a31faf9455499c5, David S. Miller,
David Ahern, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, Pravin B Shelar, Tom Herbert, linux-kernel
syzbot reported a BUG.
I found that after GRE decapsulation in gretap/ip6gretap paths, the
transport_header becomes stale with a negative offset. The sequence is:
1. Before decap, transport_header points to the outer L4 (GRE) header.
2. __iptunnel_pull_header() calls skb_pull_rcsum() to advance skb->data
past the GRE header, but does not update transport_header.
3. For TEB (gretap/ip6gretap), eth_type_trans() in ip_tunnel_rcv() /
__ip6_tnl_rcv() further pulls ETH_HLEN (14 bytes) from skb->data.
After these two pulls, skb->data has moved forward while transport_header
still points to the old (now behind skb->data) position, resulting in a
negative skb_transport_offset(): typically -4 after GRE pull alone, or
-18 after GRE + inner Ethernet pull.
In the normal case where the inner frame is a recognizable protocol
(e.g., IPv4/TCP), the transport_header is subsequently overwritten by
ip_rcv_core() (or inet_gro_receive() on the GRO path) via
skb_set_transport_header(), and the stale value never reaches downstream
consumers. However, if the inner frame cannot be parsed (e.g.,
eth_type_trans() classifies it as ETH_P_802_2 due to a zero/invalid
inner Ethernet header), neither rescue runs, and the stale offset
persists into __netif_receive_skb_core().
When this stale offset is combined with contradictory GSO metadata (e.g.,
SKB_GSO_TCPV4 injected via virtio_net_hdr from a tun device),
qdisc_pkt_len_segs_init() trusts the negative offset: the unsigned
wraparound makes pskb_may_pull() effectively a no-op, and __tcp_hdrlen()
then reads from an invalid memory location, causing a use-after-free.
The UAF only triggers on the GSO path, where qdisc_pkt_len_segs_init()
dereferences the transport header to compute per-segment length. Fix
this by introducing iptunnel_rebuild_transport_header(), which is a
no-op for non-GSO packets and otherwise re-probes the transport header
via the flow dissector. If re-probing fails, the contradictory GSO
metadata is cleared via skb_gso_reset() so downstream consumers cannot
trust stale offsets. Restricting the rebuild to GSO packets keeps the
flow-dissector cost off the common rx fast path.
reproducer: https://gist.github.com/mrpre/5ba943fd86367af748b70de99263da4b
Link: https://syzkaller.appspot.com/bug?extid=83181a31faf9455499c5
Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path")
Reported-by: syzbot+83181a31faf9455499c5@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/69de2bee.a00a0220.475f0.0041.GAE@google.com/T/
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
---
As a follow-up for production reliability, I am wondering whether we
can extend the existing safety net in __netif_receive_skb_core() to
also handle set-but-negative transport_header:
if (!skb_transport_header_was_set(skb) ||
skb_transport_offset(skb) < 0)
skb_reset_transport_header(skb);
---
include/net/ip_tunnels.h | 12 ++++++++++++
net/ipv4/ip_tunnel.c | 2 ++
net/ipv6/ip6_tunnel.c | 2 ++
3 files changed, 16 insertions(+)
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index d708b66e55cd..9b4e662833a1 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -662,6 +662,18 @@ static inline int iptunnel_pull_offloads(struct sk_buff *skb)
return 0;
}
+static inline void iptunnel_rebuild_transport_header(struct sk_buff *skb)
+{
+ if (!skb_is_gso(skb))
+ return;
+
+ skb->transport_header = (typeof(skb->transport_header))~0U;
+ skb_probe_transport_header(skb);
+
+ if (!skb_transport_header_was_set(skb))
+ skb_gso_reset(skb);
+}
+
static inline void iptunnel_xmit_stats(struct net_device *dev, int pkt_len)
{
if (pkt_len > 0) {
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 50d0f5fe4e4c..c46be68cfafa 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -445,6 +445,8 @@ int ip_tunnel_rcv(struct ip_tunnel *tunnel, struct sk_buff *skb,
if (tun_dst)
skb_dst_set(skb, (struct dst_entry *)tun_dst);
+ iptunnel_rebuild_transport_header(skb);
+
gro_cells_receive(&tunnel->gro_cells, skb);
return 0;
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 46bc06506470..f95348cf3c77 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -879,6 +879,8 @@ static int __ip6_tnl_rcv(struct ip6_tnl *tunnel, struct sk_buff *skb,
if (tun_dst)
skb_dst_set(skb, (struct dst_entry *)tun_dst);
+ iptunnel_rebuild_transport_header(skb);
+
gro_cells_receive(&tunnel->gro_cells, skb);
return 0;
--
2.43.0
^ permalink raw reply related
* Re: [PATCH net v2] net: iptunnel: fix stale transport header after GRE/TEB decap
From: Eric Dumazet @ 2026-04-19 9:25 UTC (permalink / raw)
To: Jiayuan Chen
Cc: netdev, syzbot+83181a31faf9455499c5, David S. Miller, David Ahern,
Jakub Kicinski, Paolo Abeni, Simon Horman, Pravin B Shelar,
Tom Herbert, linux-kernel
In-Reply-To: <20260419090817.127334-1-jiayuan.chen@linux.dev>
On Sun, Apr 19, 2026 at 2:08 AM Jiayuan Chen <jiayuan.chen@linux.dev> wrote:
>
> syzbot reported a BUG.
>
> I found that after GRE decapsulation in gretap/ip6gretap paths, the
> transport_header becomes stale with a negative offset. The sequence is:
>
> 1. Before decap, transport_header points to the outer L4 (GRE) header.
> 2. __iptunnel_pull_header() calls skb_pull_rcsum() to advance skb->data
> past the GRE header, but does not update transport_header.
> 3. For TEB (gretap/ip6gretap), eth_type_trans() in ip_tunnel_rcv() /
> __ip6_tnl_rcv() further pulls ETH_HLEN (14 bytes) from skb->data.
>
> After these two pulls, skb->data has moved forward while transport_header
> still points to the old (now behind skb->data) position, resulting in a
> negative skb_transport_offset(): typically -4 after GRE pull alone, or
> -18 after GRE + inner Ethernet pull.
>
> In the normal case where the inner frame is a recognizable protocol
> (e.g., IPv4/TCP), the transport_header is subsequently overwritten by
> ip_rcv_core() (or inet_gro_receive() on the GRO path) via
> skb_set_transport_header(), and the stale value never reaches downstream
> consumers. However, if the inner frame cannot be parsed (e.g.,
> eth_type_trans() classifies it as ETH_P_802_2 due to a zero/invalid
> inner Ethernet header), neither rescue runs, and the stale offset
> persists into __netif_receive_skb_core().
>
> When this stale offset is combined with contradictory GSO metadata (e.g.,
> SKB_GSO_TCPV4 injected via virtio_net_hdr from a tun device),
> qdisc_pkt_len_segs_init() trusts the negative offset: the unsigned
> wraparound makes pskb_may_pull() effectively a no-op, and __tcp_hdrlen()
> then reads from an invalid memory location, causing a use-after-free.
>
> The UAF only triggers on the GSO path, where qdisc_pkt_len_segs_init()
> dereferences the transport header to compute per-segment length. Fix
> this by introducing iptunnel_rebuild_transport_header(), which is a
> no-op for non-GSO packets and otherwise re-probes the transport header
> via the flow dissector. If re-probing fails, the contradictory GSO
> metadata is cleared via skb_gso_reset() so downstream consumers cannot
> trust stale offsets. Restricting the rebuild to GSO packets keeps the
> flow-dissector cost off the common rx fast path.
>
> reproducer: https://gist.github.com/mrpre/5ba943fd86367af748b70de99263da4b
>
> Link: https://syzkaller.appspot.com/bug?extid=83181a31faf9455499c5
> Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
> Fixes: 0d3c703a9d17 ("ipv6: Cleanup IPv6 tunnel receive path")
> Reported-by: syzbot+83181a31faf9455499c5@syzkaller.appspotmail.com
> Closes: https://lore.kernel.org/all/69de2bee.a00a0220.475f0.0041.GAE@google.com/T/
> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
> ---
>
> As a follow-up for production reliability, I am wondering whether we
> can extend the existing safety net in __netif_receive_skb_core() to
> also handle set-but-negative transport_header:
>
> if (!skb_transport_header_was_set(skb) ||
> skb_transport_offset(skb) < 0)
> skb_reset_transport_header(skb);
> ---
> include/net/ip_tunnels.h | 12 ++++++++++++
> net/ipv4/ip_tunnel.c | 2 ++
> net/ipv6/ip6_tunnel.c | 2 ++
> 3 files changed, 16 insertions(+)
>
> diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
> index d708b66e55cd..9b4e662833a1 100644
> --- a/include/net/ip_tunnels.h
> +++ b/include/net/ip_tunnels.h
> @@ -662,6 +662,18 @@ static inline int iptunnel_pull_offloads(struct sk_buff *skb)
> return 0;
> }
>
> +static inline void iptunnel_rebuild_transport_header(struct sk_buff *skb)
> +{
> + if (!skb_is_gso(skb))
> + return;
> +
> + skb->transport_header = (typeof(skb->transport_header))~0U;
> + skb_probe_transport_header(skb);
> +
> + if (!skb_transport_header_was_set(skb))
> + skb_gso_reset(skb);
I do not think this makes sense.
What is a valid case for this packet being processed further?
The buggy packet must be dropped, instead of being mangled like this.
^ permalink raw reply
* Re: [PATCH net 1/1] mptcp: hold subflow request owners when cloning reqsk
From: Yuan Tan @ 2026-04-19 9:51 UTC (permalink / raw)
To: Kuniyuki Iwashima, Matthieu Baerts
Cc: yuantan098, Ren Wei, netdev, mptcp, davem, edumazet, kuba, pabeni,
horms, ncardwell, dsahern, martineau, geliang, daniel, kafai,
yifanwucs, tomapufckgml, bird, caoruide123, enjou1224z
In-Reply-To: <CAAVpQUDzh-dGsQpBCZjN3rUsoDc2QjzWh-o5yVWoBWDQNXbjmQ@mail.gmail.com>
On 4/16/2026 11:48 AM, Kuniyuki Iwashima wrote:
> On Thu, Apr 16, 2026 at 10:45 AM Matthieu Baerts <matttbe@kernel.org> wrote:
>> Hi Ren,
>>
>> On 15/04/2026 11:31, Ren Wei wrote:
>>> From: Ruide Cao <caoruide123@gmail.com>
>>>
>>> TCP request migration clones pending request sockets with
>>> inet_reqsk_clone(). For MPTCP MP_JOIN requests this raw-copies
>>> subflow_req->msk, but the cloned request does not take a new reference.
>>>
>>> Both the original and the cloned request can later drop the same msk in
>>> subflow_req_destructor(), and a migrated request may keep a dangling msk
>>> pointer after the original owner has already been released.
>>>
>>> Add a request_sock clone callback and let MPTCP grab a reference for cloned
>>> subflow requests that carry an msk. This keeps ownership balanced across
>>> both successful migrations and failed clone/insert paths without changing
>>> other protocols.
>> Thank you for this patch!
>>
>> Indeed, it looks like this path has not been covered by the MPTCP test
>> suite so far.
>>
>> By chance, do you have any simpler reproducer using packetdrill? (the
>> MPTCP fork)
>>
>> https://github.com/multipath-tcp/packetdrill
>>
>> Also, I see Sashiko is pointing to a potential issue with MP_CAPABLE and
>> the token, also only visible with net.ipv4.tcp_migrate_req=1:
>>
>> https://sashiko.dev/#/patchset/86e2514b533bf4d55d4aa2fdbf1404022e8c9430.1776149210.git.caoruide123%40gmail.com
>>
>> Are you also looking at that?
>>
>>> Fixes: c905dee62232 ("tcp: Migrate TCP_NEW_SYN_RECV requests at retransmitting SYN+ACKs.")
>>> Cc: stable@kernel.org
>>> Reported-by: Yuan Tan <yuantan098@gmail.com>
>>> Reported-by: Yifan Wu <yifanwucs@gmail.com>
>>> Reported-by: Juefei Pu <tomapufckgml@gmail.com>
>>> Reported-by: Xin Liu <bird@lzu.edu.cn>
>>> Signed-off-by: Ruide Cao <caoruide123@gmail.com>
>>> Tested-by: Ren Wei <enjou1224z@gmail.com>
>>> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
>>> ---
>>> include/net/request_sock.h | 2 ++
>>> net/ipv4/inet_connection_sock.c | 3 +++
>>> net/mptcp/subflow.c | 13 +++++++++++++
>>> 3 files changed, 18 insertions(+)
>>>
>>> diff --git a/include/net/request_sock.h b/include/net/request_sock.h
>>> index 5a9c826a7092..560e464c400f 100644
>>> --- a/include/net/request_sock.h
>>> +++ b/include/net/request_sock.h
>>> @@ -36,6 +36,8 @@ struct request_sock_ops {
>>> struct sk_buff *skb,
>>> enum sk_rst_reason reason);
>>> void (*destructor)(struct request_sock *req);
>>> + void (*init_clone)(const struct request_sock *req,
>>> + struct request_sock *new_req);
>> @TCP/INET maintainers: are you OK with this new function pointer?
>>
>> Currently, MPTCP is the only user, and "req" is not used, see below.
>>
>>> };
>>>
>>> struct saved_syn {
>>> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
>>> index e961936b6be7..140a9e96ad58 100644
>>> --- a/net/ipv4/inet_connection_sock.c
>>> +++ b/net/ipv4/inet_connection_sock.c
>>> @@ -954,6 +954,9 @@ static struct request_sock *inet_reqsk_clone(struct request_sock *req,
>>> if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(nreq)->tfo_listener)
>>> rcu_assign_pointer(tcp_sk(nreq->sk)->fastopen_rsk, nreq);
>> (Maybe TCP with fastopen could be this other user to call
>> rcu_assign_pointer()? (net-next material))
>>
>>> + if (req->rsk_ops->init_clone)
>>> + req->rsk_ops->init_clone(req, nreq);
> I think a simple direct call is better.
>
> #ifdef CONFIG_MPTCP
> if (tcp_rsk(req)->is_mptcp)
> mptcp_reqsk_clone(nreq);
> #endif
>
Thank you very much for your suggestion. We will use this approach in
the next version of the patch. Would you like us to add your
Suggested-by tag?
>>> +
>>> return nreq;
>>> }
>>>
>>> diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
>>> index 4ff5863aa9fd..5f4069647822 100644
>>> --- a/net/mptcp/subflow.c
>>> +++ b/net/mptcp/subflow.c
>>> @@ -47,6 +47,17 @@ static void subflow_req_destructor(struct request_sock *req)
>>> mptcp_token_destroy_request(req);
>>> }
>>>
>>> +static void subflow_req_clone(const struct request_sock *req,
>>> + struct request_sock *new_req)
>>> +{
>>> + struct mptcp_subflow_request_sock *subflow_req = mptcp_subflow_rsk(new_req);
>>> +
>>> + (void)req;
>>
>> Note: if 'req' is not used, and MPTCP is currently the only user of this
>> new init_clone() callback, perhaps enough to pass only 'new_req' for the
>> moment? 'req' can be added later if other users need it, no?
>>
>> With only init_close(nreq) in a v2, or if TCP maintainers prefer to keep
>> it, feel free to add:
>>
>> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>
>>
>> Cheers,
>> Matt
>> --
>> Sponsored by the NGI0 Core fund.
>>
^ permalink raw reply
* [PATCH net 1/1] ipv4: icmp: validate reply type before using icmp_pointers
From: Ren Wei @ 2026-04-19 10:19 UTC (permalink / raw)
To: netdev
Cc: davem, dsahern, edumazet, kuba, pabeni, horms, andreas.a.roeseler,
yuantan098, yifanwucs, tomapufckgml, bird, caoruide123, n05ec
In-Reply-To: <cover.1776563662.git.caoruide123@gmail.com>
From: Ruide Cao <caoruide123@gmail.com>
Extended echo replies use ICMP_EXT_ECHOREPLY as the outbound reply type.
That value is outside the range covered by icmp_pointers[], which only
describes the traditional ICMP types up to NR_ICMP_TYPES.
Avoid consulting icmp_pointers[] for reply types outside that range and
keep the existing behavior for normal ICMP replies unchanged.
Fixes: d329ea5bd884 ("icmp: add response to RFC 8335 PROBE messages")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Ruide Cao <caoruide123@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
---
net/ipv4/icmp.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 4e2a6c70dcd8..d8036663f035 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -373,7 +373,8 @@ static int icmp_glue_bits(void *from, char *to, int offset, int len, int odd,
to, len);
skb->csum = csum_block_add(skb->csum, csum, odd);
- if (icmp_pointers[icmp_param->data.icmph.type].error)
+ if (icmp_param->data.icmph.type <= NR_ICMP_TYPES &&
+ icmp_pointers[icmp_param->data.icmph.type].error)
nf_ct_attach(skb, icmp_param->skb);
return 0;
}
--
2.34.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox