* [PATCH 1/8] ipv6: xfrm6: release dst on error in xfrm6_rcv_encap()
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
@ 2026-05-05 13:22 ` Steffen Klassert
2026-05-05 13:22 ` [PATCH 2/8] xfrm: ah: account for ESN high bits in async callbacks Steffen Klassert
` (6 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:22 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Yilin Zhu <zylzyl2333@gmail.com>
xfrm6_rcv_encap() performs an IPv6 route lookup when the skb does not
already have a dst attached. ip6_route_input_lookup() returns a
referenced dst entry even when the lookup resolves to an error route.
If dst->error is set, xfrm6_rcv_encap() drops the skb without attaching
the dst to the skb and without releasing the reference returned by the
lookup. Repeated packets hitting this path therefore leak dst entries.
Release the dst before jumping to the drop path.
Fixes: 0146dca70b87 ("xfrm: add support for UDPv6 encapsulation of ESP")
Cc: stable@kernel.org
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Co-developed-by: Yuan Tan <yuantan098@gmail.com>
Signed-off-by: Yuan Tan <yuantan098@gmail.com>
Suggested-by: Xin Liu <bird@lzu.edu.cn>
Tested-by: Ruide Cao <caoruide123@gmail.com>
Signed-off-by: Yilin Zhu <zylzyl2333@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/ipv6/xfrm6_protocol.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/ipv6/xfrm6_protocol.c b/net/ipv6/xfrm6_protocol.c
index ea2f805d3b01..9b586fcec485 100644
--- a/net/ipv6/xfrm6_protocol.c
+++ b/net/ipv6/xfrm6_protocol.c
@@ -88,8 +88,10 @@ int xfrm6_rcv_encap(struct sk_buff *skb, int nexthdr, __be32 spi,
dst = ip6_route_input_lookup(dev_net(skb->dev), skb->dev, &fl6,
skb, flags);
- if (dst->error)
+ if (dst->error) {
+ dst_release(dst);
goto drop;
+ }
skb_dst_set(skb, dst);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 2/8] xfrm: ah: account for ESN high bits in async callbacks
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
2026-05-05 13:22 ` [PATCH 1/8] ipv6: xfrm6: release dst on error in xfrm6_rcv_encap() Steffen Klassert
@ 2026-05-05 13:22 ` Steffen Klassert
2026-05-05 13:22 ` [PATCH 3/8] tools/selftests: Use a sensible timeout value for iperf3 client Steffen Klassert
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:22 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Michael Bommarito <michael.bommarito@gmail.com>
AH allocates its temporary auth/ICV layout differently when ESN is enabled:
the async ahash setup appends a 4-byte seqhi slot before the ICV or
auth_data area, but the async completion callbacks still reconstruct the
temporary layout as if seqhi were absent.
With an async AH implementation selected, that makes AH copy or compare
the wrong bytes on both the IPv4 and IPv6 paths. In UML repro on IPv4 AH
with ESN and forced async hmac(sha1), ping fails with 100% packet loss,
and the callback logs show the pre-fix drift:
ah4 output_done: esn=1 err=0 icv_off=20 expected_off=24
ah4 input_done: esn=1 auth_off=20 expected_auth_off=24 icv_off=32 expected_icv_off=36
Reconstruct the callback-side layout the same way the setup path built it
by skipping the ESN seqhi slot before locating the saved auth_data or ICV.
Per RFC 4302, the ESN high-order 32 bits participate in the AH ICV
computation, so the async callbacks must account for the seqhi slot.
Post-fix, the same IPv4 AH+ESN+forced-async-hmac(sha1) UML repro shows
the corrected offset (ah4 output_done: esn=1 err=0 icv_off=24
expected_off=24) and ping succeeds; net/ipv4/ah4.o and net/ipv6/ah6.o
build clean at W=1. IPv6 AH+ESN was not exercised at runtime, and the
change has not been tested against a real async hardware AH engine.
Fixes: d4d573d0334d ("{IPv4,xfrm} Add ESN support for AH egress part")
Fixes: d8b2a8600b0e ("{IPv4,xfrm} Add ESN support for AH ingress part")
Fixes: 26dd70c3fad3 ("{IPv6,xfrm} Add ESN support for AH egress part")
Fixes: 8d6da6f32557 ("{IPv6,xfrm} Add ESN support for AH ingress part")
Cc: stable@vger.kernel.org
Assisted-by: Codex:gpt-5-4
Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/ipv4/ah4.c | 14 ++++++++++++--
net/ipv6/ah6.c | 14 ++++++++++++--
2 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 5fb812443a08..4366cbac3f06 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -124,9 +124,14 @@ static void ah_output_done(void *data, int err)
struct iphdr *top_iph = ip_hdr(skb);
struct ip_auth_hdr *ah = ip_auth_hdr(skb);
int ihl = ip_hdrlen(skb);
+ int seqhi_len = 0;
+ __be32 *seqhi;
+ if (x->props.flags & XFRM_STATE_ESN)
+ seqhi_len = sizeof(*seqhi);
iph = AH_SKB_CB(skb)->tmp;
- icv = ah_tmp_icv(iph, ihl);
+ seqhi = (__be32 *)((char *)iph + ihl);
+ icv = ah_tmp_icv(seqhi, seqhi_len);
memcpy(ah->auth_data, icv, ahp->icv_trunc_len);
top_iph->tos = iph->tos;
@@ -270,12 +275,17 @@ static void ah_input_done(void *data, int err)
struct ip_auth_hdr *ah = ip_auth_hdr(skb);
int ihl = ip_hdrlen(skb);
int ah_hlen = (ah->hdrlen + 2) << 2;
+ int seqhi_len = 0;
+ __be32 *seqhi;
if (err)
goto out;
+ if (x->props.flags & XFRM_STATE_ESN)
+ seqhi_len = sizeof(*seqhi);
work_iph = AH_SKB_CB(skb)->tmp;
- auth_data = ah_tmp_auth(work_iph, ihl);
+ seqhi = (__be32 *)((char *)work_iph + ihl);
+ auth_data = ah_tmp_auth(seqhi, seqhi_len);
icv = ah_tmp_icv(auth_data, ahp->icv_trunc_len);
err = crypto_memneq(icv, auth_data, ahp->icv_trunc_len) ? -EBADMSG : 0;
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index cb26beea4398..de1e68199a01 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -317,14 +317,19 @@ static void ah6_output_done(void *data, int err)
struct ipv6hdr *top_iph = ipv6_hdr(skb);
struct ip_auth_hdr *ah = ip_auth_hdr(skb);
struct tmp_ext *iph_ext;
+ int seqhi_len = 0;
+ __be32 *seqhi;
extlen = skb_network_header_len(skb) - sizeof(struct ipv6hdr);
if (extlen)
extlen += sizeof(*iph_ext);
+ if (x->props.flags & XFRM_STATE_ESN)
+ seqhi_len = sizeof(*seqhi);
iph_base = AH_SKB_CB(skb)->tmp;
iph_ext = ah_tmp_ext(iph_base);
- icv = ah_tmp_icv(iph_ext, extlen);
+ seqhi = (__be32 *)((char *)iph_ext + extlen);
+ icv = ah_tmp_icv(seqhi, seqhi_len);
memcpy(ah->auth_data, icv, ahp->icv_trunc_len);
memcpy(top_iph, iph_base, IPV6HDR_BASELEN);
@@ -471,13 +476,18 @@ static void ah6_input_done(void *data, int err)
struct ip_auth_hdr *ah = ip_auth_hdr(skb);
int hdr_len = skb_network_header_len(skb);
int ah_hlen = ipv6_authlen(ah);
+ int seqhi_len = 0;
+ __be32 *seqhi;
if (err)
goto out;
+ if (x->props.flags & XFRM_STATE_ESN)
+ seqhi_len = sizeof(*seqhi);
work_iph = AH_SKB_CB(skb)->tmp;
auth_data = ah_tmp_auth(work_iph, hdr_len);
- icv = ah_tmp_icv(auth_data, ahp->icv_trunc_len);
+ seqhi = (__be32 *)(auth_data + ahp->icv_trunc_len);
+ icv = ah_tmp_icv(seqhi, seqhi_len);
err = crypto_memneq(icv, auth_data, ahp->icv_trunc_len) ? -EBADMSG : 0;
if (err)
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 3/8] tools/selftests: Use a sensible timeout value for iperf3 client
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
2026-05-05 13:22 ` [PATCH 1/8] ipv6: xfrm6: release dst on error in xfrm6_rcv_encap() Steffen Klassert
2026-05-05 13:22 ` [PATCH 2/8] xfrm: ah: account for ESN high bits in async callbacks Steffen Klassert
@ 2026-05-05 13:22 ` Steffen Klassert
2026-05-05 13:23 ` [PATCH 4/8] tools/selftests: Add a VXLAN+IPsec traffic test Steffen Klassert
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:22 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Cosmin Ratiu <cratiu@nvidia.com>
The default timeout of cmd() is 5 seconds and Iperf3Runner requests the
iperf3 client to run for 10 seconds, which clearly doesn't work since
commit [1] enforced the timeout parameter.
Use a value derived from duration as timeout (+5 seconds for
startup/teardown/various other overhead).
[1] commit f0bd19316663 ("selftests: net: fix timeout passed as positional argument to communicate()")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
tools/testing/selftests/drivers/net/lib/py/load.py | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/drivers/net/lib/py/load.py b/tools/testing/selftests/drivers/net/lib/py/load.py
index f181fa2d38fc..e24660e5c27f 100644
--- a/tools/testing/selftests/drivers/net/lib/py/load.py
+++ b/tools/testing/selftests/drivers/net/lib/py/load.py
@@ -48,7 +48,10 @@ class Iperf3Runner:
Starts the iperf3 client with the configured options.
"""
cmdline = self._build_client(streams, duration, reverse)
- return cmd(cmdline, background=background, host=self.env.remote)
+ kwargs = {"background": background, "host": self.env.remote}
+ if not background:
+ kwargs["timeout"] = duration + 5
+ return cmd(cmdline, **kwargs)
def measure_bandwidth(self, reverse=False):
"""
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 4/8] tools/selftests: Add a VXLAN+IPsec traffic test
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
` (2 preceding siblings ...)
2026-05-05 13:22 ` [PATCH 3/8] tools/selftests: Use a sensible timeout value for iperf3 client Steffen Klassert
@ 2026-05-05 13:23 ` Steffen Klassert
2026-05-05 13:23 ` [PATCH 5/8] xfrm: Don't clobber inner headers when already set Steffen Klassert
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:23 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Cosmin Ratiu <cratiu@nvidia.com>
There are VXLAN tests and IPsec tests, but there is no test that
combines the two protocols and exercises the tunnel-over-ipsec code
paths. Fix that by adding a traffic test with VXLAN and IPsec using
crypto offload. This is runnable on HW which supports ESP offload (so no
nsim unfortunately).
Traffic is done with iperf3 and the test validates that there are no
packet drops and iperf3 can get to at least 100 Mbps (a very
conservative value on today's crypto offload HW, as it can typically
reach multi-Gbps rates).
Ran right now, the test fails due to a recently exposed bug in xfrm,
which will be fixed in the next patch:
# ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
TAP version 13
1..4
# Check| At ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py,
# line 161, in test_vxlan_ipsec_crypto_offload:
# Check| ksft_eq(drops_after - drops_before, 0,
# Check failed 189 != 0 TX drops during VXLAN+IPsec
# Check| At ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py,
# line 163, in test_vxlan_ipsec_crypto_offload:
# Check| ksft_ge(bw_gbps, 0.1,
# Check failed 0.0015058278404812596 < 0.1 Minimum 100Mbps over
# VXLAN+IPsec
not ok 1 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v4
...
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
.../testing/selftests/drivers/net/hw/Makefile | 1 +
tools/testing/selftests/drivers/net/hw/config | 5 +
.../selftests/drivers/net/hw/ipsec_vxlan.py | 204 ++++++++++++++++++
3 files changed, 210 insertions(+)
create mode 100755 tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile
index 85ca4d1ecf9e..3b6ff4708005 100644
--- a/tools/testing/selftests/drivers/net/hw/Makefile
+++ b/tools/testing/selftests/drivers/net/hw/Makefile
@@ -30,6 +30,7 @@ TEST_PROGS = \
gro_hw.py \
hw_stats_l3.sh \
hw_stats_l3_gre.sh \
+ ipsec_vxlan.py \
iou-zcrx.py \
irq.py \
loopback.sh \
diff --git a/tools/testing/selftests/drivers/net/hw/config b/tools/testing/selftests/drivers/net/hw/config
index dd50cb8a7911..ae0168c2bbe6 100644
--- a/tools/testing/selftests/drivers/net/hw/config
+++ b/tools/testing/selftests/drivers/net/hw/config
@@ -12,5 +12,10 @@ CONFIG_NET_IPGRE=y
CONFIG_NET_IPGRE_DEMUX=y
CONFIG_NETKIT=y
CONFIG_NET_SCH_INGRESS=y
+CONFIG_INET6_ESP=y
+CONFIG_INET6_ESP_OFFLOAD=y
+CONFIG_INET_ESP=y
+CONFIG_INET_ESP_OFFLOAD=y
CONFIG_UDMABUF=y
CONFIG_VXLAN=y
+CONFIG_XFRM_USER=y
diff --git a/tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py b/tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
new file mode 100755
index 000000000000..0740a4d85240
--- /dev/null
+++ b/tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
@@ -0,0 +1,204 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+"""Traffic test for VXLAN + IPsec crypto-offload."""
+
+import os
+
+from lib.py import ksft_run, ksft_exit, ksft_eq, ksft_ge
+from lib.py import ksft_variants, KsftNamedVariant, KsftSkipEx
+from lib.py import CmdExitFailure, NetDrvEpEnv, cmd, defer, ethtool, ip
+from lib.py import Iperf3Runner
+
+# Inner tunnel addresses - TEST-NET-2 (RFC 5737) / doc prefix (RFC 3849)
+INNER_V4_LOCAL = "198.51.100.1"
+INNER_V4_REMOTE = "198.51.100.2"
+INNER_V6_LOCAL = "2001:db8:100::1"
+INNER_V6_REMOTE = "2001:db8:100::2"
+
+# ESP parameters
+SPI_OUT = "0x1000"
+SPI_IN = "0x1001"
+# 128-bit key + 32-bit salt = 20 bytes hex, 128-bit ICV
+ESP_AEAD = "aead 'rfc4106(gcm(aes))' 0x" + "01" * 20 + " 128"
+
+
+def xfrm(args, host=None):
+ """Runs 'ip xfrm' via shell to preserve parentheses in algo names."""
+ cmd(f"ip xfrm {args}", shell=True, host=host)
+
+
+def check_xfrm_offload_support():
+ """Skips if iproute2 lacks xfrm offload support."""
+ out = cmd("ip xfrm state help", fail=False)
+ if "offload" not in out.stdout + out.stderr:
+ raise KsftSkipEx("iproute2 too old, missing xfrm offload")
+
+
+def check_esp_hw_offload(cfg):
+ """Skips if device lacks esp-hw-offload support."""
+ check_xfrm_offload_support()
+ try:
+ feat = ethtool(f"-k {cfg.ifname}", json=True)[0]
+ except (CmdExitFailure, IndexError) as e:
+ raise KsftSkipEx(f"can't query features: {e}") from e
+ if not feat.get("esp-hw-offload", {}).get("active"):
+ raise KsftSkipEx("Device does not support esp-hw-offload")
+
+
+def get_tx_drops(cfg):
+ """Returns TX dropped counter from the physical device."""
+ stats = ip("-s -s link show dev " + cfg.ifname, json=True)[0]
+ return stats["stats64"]["tx"]["dropped"]
+
+
+def setup_vxlan_ipsec(cfg, outer_ipver, inner_ipver):
+ """Sets up VXLAN tunnel with IPsec transport-mode crypto-offload."""
+ vxlan_name = f"vx{os.getpid()}"
+ local_addr = cfg.addr_v[outer_ipver]
+ remote_addr = cfg.remote_addr_v[outer_ipver]
+
+ if inner_ipver == "4":
+ inner_local = f"{INNER_V4_LOCAL}/24"
+ inner_remote = f"{INNER_V4_REMOTE}/24"
+ addr_extra = ""
+ else:
+ inner_local = f"{INNER_V6_LOCAL}/64"
+ inner_remote = f"{INNER_V6_REMOTE}/64"
+ addr_extra = " nodad"
+
+ if outer_ipver == "6":
+ vxlan_opts = "udp6zerocsumtx udp6zerocsumrx"
+ else:
+ vxlan_opts = "noudpcsum"
+
+ # VXLAN tunnel - local side
+ ip(f"link add {vxlan_name} type vxlan id 100 dstport 4789 {vxlan_opts} "
+ f"local {local_addr} remote {remote_addr} dev {cfg.ifname}")
+ defer(ip, f"link del {vxlan_name}")
+ ip(f"addr add {inner_local} dev {vxlan_name}{addr_extra}")
+ ip(f"link set {vxlan_name} up")
+
+ # VXLAN tunnel - remote side
+ ip(f"link add {vxlan_name} type vxlan id 100 dstport 4789 {vxlan_opts} "
+ f"local {remote_addr} remote {local_addr} dev {cfg.remote_ifname}",
+ host=cfg.remote)
+ defer(ip, f"link del {vxlan_name}", host=cfg.remote)
+ ip(f"addr add {inner_remote} dev {vxlan_name}{addr_extra}",
+ host=cfg.remote)
+ ip(f"link set {vxlan_name} up", host=cfg.remote)
+
+ # xfrm state - local outbound SA
+ xfrm(f"state add src {local_addr} dst {remote_addr} "
+ f"proto esp spi {SPI_OUT} "
+ f"{ESP_AEAD} "
+ f"mode transport offload crypto dev {cfg.ifname} dir out")
+ defer(xfrm, f"state del src {local_addr} dst {remote_addr} "
+ f"proto esp spi {SPI_OUT}")
+
+ # xfrm state - local inbound SA
+ xfrm(f"state add src {remote_addr} dst {local_addr} "
+ f"proto esp spi {SPI_IN} "
+ f"{ESP_AEAD} "
+ f"mode transport offload crypto dev {cfg.ifname} dir in")
+ defer(xfrm, f"state del src {remote_addr} dst {local_addr} "
+ f"proto esp spi {SPI_IN}")
+
+ # xfrm state - remote outbound SA (mirror, software crypto)
+ xfrm(f"state add src {remote_addr} dst {local_addr} "
+ f"proto esp spi {SPI_IN} "
+ f"{ESP_AEAD} "
+ f"mode transport",
+ host=cfg.remote)
+ defer(xfrm, f"state del src {remote_addr} dst {local_addr} "
+ f"proto esp spi {SPI_IN}", host=cfg.remote)
+
+ # xfrm state - remote inbound SA (mirror, software crypto)
+ xfrm(f"state add src {local_addr} dst {remote_addr} "
+ f"proto esp spi {SPI_OUT} "
+ f"{ESP_AEAD} "
+ f"mode transport",
+ host=cfg.remote)
+ defer(xfrm, f"state del src {local_addr} dst {remote_addr} "
+ f"proto esp spi {SPI_OUT}", host=cfg.remote)
+
+ # xfrm policy - local out
+ xfrm(f"policy add src {local_addr} dst {remote_addr} "
+ f"proto udp dport 4789 dir out "
+ f"tmpl src {local_addr} dst {remote_addr} proto esp mode transport")
+ defer(xfrm, f"policy del src {local_addr} dst {remote_addr} "
+ f"proto udp dport 4789 dir out")
+
+ # xfrm policy - local in
+ xfrm(f"policy add src {remote_addr} dst {local_addr} "
+ f"proto udp dport 4789 dir in "
+ f"tmpl src {remote_addr} dst {local_addr} proto esp mode transport")
+ defer(xfrm, f"policy del src {remote_addr} dst {local_addr} "
+ f"proto udp dport 4789 dir in")
+
+ # xfrm policy - remote out
+ xfrm(f"policy add src {remote_addr} dst {local_addr} "
+ f"proto udp dport 4789 dir out "
+ f"tmpl src {remote_addr} dst {local_addr} proto esp mode transport",
+ host=cfg.remote)
+ defer(xfrm, f"policy del src {remote_addr} dst {local_addr} "
+ f"proto udp dport 4789 dir out", host=cfg.remote)
+
+ # xfrm policy - remote in
+ xfrm(f"policy add src {local_addr} dst {remote_addr} "
+ f"proto udp dport 4789 dir in "
+ f"tmpl src {local_addr} dst {remote_addr} proto esp mode transport",
+ host=cfg.remote)
+ defer(xfrm, f"policy del src {local_addr} dst {remote_addr} "
+ f"proto udp dport 4789 dir in", host=cfg.remote)
+
+
+def _vxlan_ipsec_variants():
+ """Generates outer/inner IP version variants."""
+ for outer in ["4", "6"]:
+ for inner in ["4", "6"]:
+ yield KsftNamedVariant(f"outer_v{outer}_inner_v{inner}", outer, inner)
+
+
+@ksft_variants(_vxlan_ipsec_variants())
+def test_vxlan_ipsec_crypto_offload(cfg, outer_ipver, inner_ipver):
+ """Tests VXLAN+IPsec crypto-offload has no TX drops."""
+ cfg.require_ipver(outer_ipver)
+ check_esp_hw_offload(cfg)
+
+ setup_vxlan_ipsec(cfg, outer_ipver, inner_ipver)
+
+ if inner_ipver == "4":
+ inner_local = INNER_V4_LOCAL
+ inner_remote = INNER_V4_REMOTE
+ ping = "ping"
+ else:
+ inner_local = INNER_V6_LOCAL
+ inner_remote = INNER_V6_REMOTE
+ ping = "ping -6"
+
+ cmd(f"{ping} -c 1 -W 2 {inner_remote}")
+
+ drops_before = get_tx_drops(cfg)
+
+ runner = Iperf3Runner(cfg, server_ip=inner_local,
+ client_ip=inner_remote)
+ bw_gbps = runner.measure_bandwidth(reverse=True)
+
+ cfg.wait_hw_stats_settle()
+ drops_after = get_tx_drops(cfg)
+
+ ksft_eq(drops_after - drops_before, 0,
+ comment="TX drops during VXLAN+IPsec")
+ ksft_ge(bw_gbps, 0.1,
+ comment="Minimum 100Mbps over VXLAN+IPsec")
+
+
+def main():
+ """Runs VXLAN+IPsec crypto-offload GSO selftest."""
+ with NetDrvEpEnv(__file__, nsim_test=False) as cfg:
+ ksft_run([test_vxlan_ipsec_crypto_offload], args=(cfg,))
+ ksft_exit()
+
+
+if __name__ == "__main__":
+ main()
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 5/8] xfrm: Don't clobber inner headers when already set
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
` (3 preceding siblings ...)
2026-05-05 13:23 ` [PATCH 4/8] tools/selftests: Add a VXLAN+IPsec traffic test Steffen Klassert
@ 2026-05-05 13:23 ` Steffen Klassert
2026-05-05 13:23 ` [PATCH 6/8] xfrm: provide message size for XFRM_MSG_MAPPING Steffen Klassert
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:23 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Cosmin Ratiu <cratiu@nvidia.com>
On VXLAN over IPsec egress, xfrm{4,6}_transport_output() blindly
overwrite inner_transport_header (== the inner TCP header saved in VXLAN
iptunnel_handle_offloads() -> skb_reset_inner_headers()) with the
current transport_header (== the VXLAN outer UDP header set by
udp_tunnel_xmit_skb()).
This was a latent bug, harmless until commit [1] added a doff validation
check in qdisc_pkt_len_segs_init() for encapsulated GSO packets. With
the wrong inner_transport_header set by xfrm, qdisc_pkt_len_segs_init()
interprets inner_transport_header as a TCP header, reads doff=0 from the
upper byte of the VNI and drops the packet with DROP_REASON_SKB_BAD_GSO.
Besides the use in GSO to determine the header size of segmented
packets, inner_transport_header might be used by drivers to set up
inner checksum offloading by pointing the HW to the inner transport
header. A quick browse through available drivers shows that mlx5 uses
skb->csum_start specifically for this scenario, while others either
don't support VXLAN over IPsec crypto offload (ixgbe) or the HW is
capable of parsing the packets itself (nfp, Chelsio).
But in all cases, it is more correct to let the inner_transport_header
point to the innermost header instead of overwriting it in xfrm.
So fix this by guarding all four inner header save sites in
xfrm_output.c (xfrm{4,6}_transport_output, xfrm{4,6}_tunnel_encap_add)
with a check for skb->inner_protocol. When inner_protocol is set, a
tunnel layer (VXLAN, Geneve, GRE, etc.) has already saved the correct
inner header offsets and they must not be overwritten. When
inner_protocol is zero, no prior tunnel encapsulation exists and xfrm
must save the inner headers itself. The tunnel mode checks are only
added for completion, since they aren't strictly required, as
xfrm_output() forces software GSO in tunnel mode before encap.
This makes the previously added test pass:
# ./tools/testing/selftests/drivers/net/hw/ipsec_vxlan.py
TAP version 13
1..4
ok 1 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v4
ok 2 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v4_inner_v6
ok 3 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v4
ok 4 ipsec_vxlan.test_vxlan_ipsec_crypto_offload.outer_v6_inner_v6
# Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0
[1] commit 7fb4c1967011 ("net: pull headers in qdisc_pkt_len_segs_init()")
Fixes: f1bd7d659ef0 ("xfrm: Add encapsulation header offsets while SKB is not encrypted")
Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/xfrm/xfrm_output.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index a9652b422f51..cc35c2fcbbe0 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -66,7 +66,9 @@ static int xfrm4_transport_output(struct xfrm_state *x, struct sk_buff *skb)
struct iphdr *iph = ip_hdr(skb);
int ihl = iph->ihl * 4;
- skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+ if (!skb->inner_protocol)
+ skb_set_inner_transport_header(skb,
+ skb_transport_offset(skb));
skb_set_network_header(skb, -x->props.header_len);
skb->mac_header = skb->network_header +
@@ -167,7 +169,9 @@ static int xfrm6_transport_output(struct xfrm_state *x, struct sk_buff *skb)
int hdr_len;
iph = ipv6_hdr(skb);
- skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+ if (!skb->inner_protocol)
+ skb_set_inner_transport_header(skb,
+ skb_transport_offset(skb));
hdr_len = xfrm6_hdr_offset(x, skb, &prevhdr);
if (hdr_len < 0)
@@ -276,8 +280,10 @@ static int xfrm4_tunnel_encap_add(struct xfrm_state *x, struct sk_buff *skb)
struct iphdr *top_iph;
int flags;
- skb_set_inner_network_header(skb, skb_network_offset(skb));
- skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+ if (!skb->inner_protocol) {
+ skb_set_inner_network_header(skb, skb_network_offset(skb));
+ skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+ }
skb_set_network_header(skb, -x->props.header_len);
skb->mac_header = skb->network_header +
@@ -321,8 +327,10 @@ static int xfrm6_tunnel_encap_add(struct xfrm_state *x, struct sk_buff *skb)
struct ipv6hdr *top_iph;
int dsfield;
- skb_set_inner_network_header(skb, skb_network_offset(skb));
- skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+ if (!skb->inner_protocol) {
+ skb_set_inner_network_header(skb, skb_network_offset(skb));
+ skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+ }
skb_set_network_header(skb, -x->props.header_len);
skb->mac_header = skb->network_header +
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 6/8] xfrm: provide message size for XFRM_MSG_MAPPING
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
` (4 preceding siblings ...)
2026-05-05 13:23 ` [PATCH 5/8] xfrm: Don't clobber inner headers when already set Steffen Klassert
@ 2026-05-05 13:23 ` Steffen Klassert
2026-05-05 13:23 ` [PATCH 7/8] xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete Steffen Klassert
2026-05-05 13:23 ` [PATCH 8/8] xfrm: esp: avoid in-place decrypt on shared skb frags Steffen Klassert
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:23 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Ruijie Li <ruijieli51@gmail.com>
The compat 64=>32 translation path handles XFRM_MSG_MAPPING, but
xfrm_msg_min[] does not provide the native payload size for this
message type.
Add the missing XFRM_MSG_MAPPING entry so compat translation can size
and translate mapping notifications correctly.
Fixes: 5461fc0c8d9f ("xfrm/compat: Add 64=>32-bit messages translator")
Cc: stable@kernel.org
Reported-by: Yuan Tan <yuantan098@gmail.com>
Reported-by: Yifan Wu <yifanwucs@gmail.com>
Reported-by: Juefei Pu <tomapufckgml@gmail.com>
Reported-by: Xin Liu <bird@lzu.edu.cn>
Signed-off-by: Ruijie Li <ruijieli51@gmail.com>
Signed-off-by: Ren Wei <n05ec@lzu.edu.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/xfrm/xfrm_user.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index d56450f61669..38a90e5ee3d9 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -3323,6 +3323,7 @@ const int xfrm_msg_min[XFRM_NR_MSGTYPES] = {
[XFRM_MSG_GETSADINFO - XFRM_MSG_BASE] = sizeof(u32),
[XFRM_MSG_NEWSPDINFO - XFRM_MSG_BASE] = sizeof(u32),
[XFRM_MSG_GETSPDINFO - XFRM_MSG_BASE] = sizeof(u32),
+ [XFRM_MSG_MAPPING - XFRM_MSG_BASE] = XMSGSIZE(xfrm_user_mapping),
[XFRM_MSG_SETDEFAULT - XFRM_MSG_BASE] = XMSGSIZE(xfrm_userpolicy_default),
[XFRM_MSG_GETDEFAULT - XFRM_MSG_BASE] = XMSGSIZE(xfrm_userpolicy_default),
};
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 7/8] xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
` (5 preceding siblings ...)
2026-05-05 13:23 ` [PATCH 6/8] xfrm: provide message size for XFRM_MSG_MAPPING Steffen Klassert
@ 2026-05-05 13:23 ` Steffen Klassert
2026-05-05 13:23 ` [PATCH 8/8] xfrm: esp: avoid in-place decrypt on shared skb frags Steffen Klassert
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:23 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Michal Kosiorek <mkosiorek121@gmail.com>
KASAN reproduces a slab-use-after-free in __xfrm_state_delete()'s
hlist_del_rcu calls under syzkaller load on linux-6.12.y stable
(reproduced on 6.12.47, also reachable via the same code path on
torvalds/master and on the ipsec tree). Nine unique signatures cluster
in the xfrm_state lifecycle, the load-bearing one being:
BUG: KASAN: slab-use-after-free in __hlist_del include/linux/list.h:990 [inline]
BUG: KASAN: slab-use-after-free in hlist_del_rcu include/linux/rculist.h:516 [inline]
BUG: KASAN: slab-use-after-free in __xfrm_state_delete net/xfrm/xfrm_state.c
Write of size 8 at addr ffff8881198bcb70 by task kworker/u8:9/435
Workqueue: netns cleanup_net
Call Trace:
__hlist_del / hlist_del_rcu
__xfrm_state_delete
xfrm_state_delete
xfrm_state_flush
xfrm_state_fini
ops_exit_list
cleanup_net
The other observed signatures hit the same slab object from
__xfrm_state_lookup, xfrm_alloc_spi, __xfrm_state_insert and an OOB
write variant of __xfrm_state_delete, all on the byseq/byspi
hash chains.
__xfrm_state_delete() guards its byseq and byspi unhashes with
value-based predicates:
if (x->km.seq)
hlist_del_rcu(&x->byseq);
if (x->id.spi)
hlist_del_rcu(&x->byspi);
while everywhere else in the file (e.g. state_cache, state_cache_input)
the safer hlist_unhashed() check is used. xfrm_alloc_spi() sets
x->id.spi = newspi inside xfrm_state_lock and then immediately inserts
into byspi, but a path that observes x->id.spi != 0 outside of
xfrm_state_lock can still skip-or-hit the byspi unhash inconsistently
with whether x is actually on the list. The same holds for x->km.seq
versus byseq, and the bydst/bysrc unhashes have no predicate at all,
so a second __xfrm_state_delete() on the same object writes through
LIST_POISON pprev.
The defensive change here:
- Use hlist_del_init_rcu() instead of hlist_del_rcu() on bydst,
bysrc, byseq and byspi so a second deletion is a no-op rather
than a write through LIST_POISON pprev. The byseq/byspi nodes
are already initialised in xfrm_state_alloc().
- Test hlist_unhashed() rather than the value predicate for
byseq/byspi, so the unhash decision tracks list state rather than
mutable scalar fields.
Empirical verification: applied this patch on top of v6.12.47, rebuilt,
and re-ran the same syzkaller harness for 1h16m on a previously-crashy
configuration that produced ~100 hits each of slab-use-after-free
Read in xfrm_alloc_spi / Read in __xfrm_state_lookup / Write in
__xfrm_state_delete. After the patch, 7.1M execs across 32 VMs at
~1550 exec/sec produced zero xfrm_state UAF/OOB hits. /proc/slabinfo
confirms the xfrm_state slab is actively allocated and freed during
the run (~143 KiB resident), so the fuzzer is still exercising those
code paths -- they just no longer crash.
Reproduction:
- Linux 6.12.47 x86_64 + KASAN_GENERIC + KASAN_INLINE + KCOV
- syzkaller @ 746545b8b1e4c3a128db8652b340d3df90ce61db
- 32 QEMU/KVM VMs x 2 vCPU on AWS c5.metal bare metal
- 9 unique signatures collected in ~9h, all within xfrm_state
lifecycle
Fixes: fe9f1d8779cb ("xfrm: add state hashtable keyed by seq")
Fixes: 7b4dc3600e48 ("[XFRM]: Do not add a state whose SPI is zero to the SPI hash.")
Reported-by: Michal Kosiorek <mkosiorek121@gmail.com>
Tested-by: Michal Kosiorek <mkosiorek121@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Michal Kosiorek <mkosiorek121@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/xfrm/xfrm_state.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 1748d374abca..686014d39429 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -818,17 +818,17 @@ int __xfrm_state_delete(struct xfrm_state *x)
spin_lock(&net->xfrm.xfrm_state_lock);
list_del(&x->km.all);
- hlist_del_rcu(&x->bydst);
- hlist_del_rcu(&x->bysrc);
- if (x->km.seq)
- hlist_del_rcu(&x->byseq);
+ hlist_del_init_rcu(&x->bydst);
+ hlist_del_init_rcu(&x->bysrc);
+ if (!hlist_unhashed(&x->byseq))
+ hlist_del_init_rcu(&x->byseq);
if (!hlist_unhashed(&x->state_cache))
hlist_del_rcu(&x->state_cache);
if (!hlist_unhashed(&x->state_cache_input))
hlist_del_rcu(&x->state_cache_input);
- if (x->id.spi)
- hlist_del_rcu(&x->byspi);
+ if (!hlist_unhashed(&x->byspi))
+ hlist_del_init_rcu(&x->byspi);
net->xfrm.state_num--;
xfrm_nat_keepalive_state_updated(x);
spin_unlock(&net->xfrm.xfrm_state_lock);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH 8/8] xfrm: esp: avoid in-place decrypt on shared skb frags
2026-05-05 13:22 [PATCH 0/8] pull request (net): ipsec 2026-05-05 Steffen Klassert
` (6 preceding siblings ...)
2026-05-05 13:23 ` [PATCH 7/8] xfrm: defensively unhash xfrm_state lists in __xfrm_state_delete Steffen Klassert
@ 2026-05-05 13:23 ` Steffen Klassert
7 siblings, 0 replies; 9+ messages in thread
From: Steffen Klassert @ 2026-05-05 13:23 UTC (permalink / raw)
To: David Miller, Jakub Kicinski; +Cc: Herbert Xu, Steffen Klassert, netdev
From: Kuan-Ting Chen <h3xrabbit@gmail.com>
MSG_SPLICE_PAGES can attach pages from a pipe directly to an skb. TCP
marks such skbs with SKBFL_SHARED_FRAG after skb_splice_from_iter(),
so later paths that may modify packet data can first make a private
copy. The IPv4/IPv6 datagram append paths did not set this flag when
splicing pages into UDP skbs.
That leaves an ESP-in-UDP packet made from shared pipe pages looking
like an ordinary uncloned nonlinear skb. ESP input then takes the no-COW
fast path for uncloned skbs without a frag_list and decrypts in place
over data that is not owned privately by the skb.
Mark IPv4/IPv6 datagram splice frags with SKBFL_SHARED_FRAG, matching
TCP. Also make ESP input fall back to skb_cow_data() when the flag is
present, so ESP does not decrypt externally backed frags in place.
Private nonlinear skb frags still use the existing fast path.
This intentionally does not change ESP output. In esp_output_head(),
the path that appends the ESP trailer to existing skb tailroom without
calling skb_cow_data() is not reachable for nonlinear skbs:
skb_tailroom() returns zero when skb->data_len is nonzero, while ESP
tailen is positive. Thus ESP output will either use the separate
destination-frag path or fall back to skb_cow_data().
Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
Fixes: 7da0dde68486 ("ip, udp: Support MSG_SPLICE_PAGES")
Fixes: 6d8192bd69bb ("ip6, udp6: Support MSG_SPLICE_PAGES")
Reported-by: Hyunwoo Kim <imv4bel@gmail.com>
Reported-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Tested-by: Hyunwoo Kim <imv4bel@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Kuan-Ting Chen <h3xrabbit@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
---
net/ipv4/esp4.c | 3 ++-
net/ipv4/ip_output.c | 2 ++
net/ipv6/esp6.c | 3 ++-
net/ipv6/ip6_output.c | 2 ++
4 files changed, 8 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 6dfc0bcdef65..6a5febbdbee4 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -873,7 +873,8 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb)
nfrags = 1;
goto skip_cow;
- } else if (!skb_has_frag_list(skb)) {
+ } else if (!skb_has_frag_list(skb) &&
+ !skb_has_shared_frag(skb)) {
nfrags = skb_shinfo(skb)->nr_frags;
nfrags++;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index e4790cc7b5c2..5bcd73cbdb41 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1233,6 +1233,8 @@ static int __ip_append_data(struct sock *sk,
if (err < 0)
goto error;
copy = err;
+ if (!(flags & MSG_NO_SHARED_FRAGS))
+ skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
wmem_alloc_delta += copy;
} else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 9f75313734f8..9c06c5a1419d 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -915,7 +915,8 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff *skb)
nfrags = 1;
goto skip_cow;
- } else if (!skb_has_frag_list(skb)) {
+ } else if (!skb_has_frag_list(skb) &&
+ !skb_has_shared_frag(skb)) {
nfrags = skb_shinfo(skb)->nr_frags;
nfrags++;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7e92909ab5be..1f2a33fbed6e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1794,6 +1794,8 @@ static int __ip6_append_data(struct sock *sk,
if (err < 0)
goto error;
copy = err;
+ if (!(flags & MSG_NO_SHARED_FRAGS))
+ skb_shinfo(skb)->flags |= SKBFL_SHARED_FRAG;
wmem_alloc_delta += copy;
} else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread