[PATCH bpf v4 0/2] bpf: Update transport_header when encapsulating UDP tunnel in lwt

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf v4 0/2] bpf: Update transport_header when encapsulating UDP tunnel in lwt
@ 2026-06-02 15:09 Leon Hwang
  2026-06-02 15:09 ` [PATCH bpf v4 1/2] " Leon Hwang
  2026-06-02 15:09 ` [PATCH bpf v4 2/2] selftests/bpf: Add tests to verify the fix of encapsulating VxLAN " Leon Hwang
  0 siblings, 2 replies; 4+ messages in thread
From: Leon Hwang @ 2026-06-02 15:09 UTC (permalink / raw)
  To: bpf
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrii Nakryiko, Eduard Zingerman,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	Shuah Khan, Guillaume Nault, Leon Hwang, Ido Schimmel,
	Fernando Fernandez Mancera, Peter Oskolkov, linux-kernel, netdev,
	linux-kselftest, kernel-patches-bot

Currently, bpf_lwt_push_ip_encap() does not update skb->transport_header.
When a driver, e.g. ice, reuses the stale skb->transport_header to
offload checksum computation to NIC hardware, VxLAN packets encapsulated
by bpf_lwt_push_encap() helper may be dropped due to incorrect checksum.

Update skb->transport_header in bpf_lwt_push_ip_encap() whenever the
encapsulated packet uses UDP, so checksum offload works correctly.

Changes:
v3 -> v4:
* Address comments from Emil:
  * Make the logic of skb_set_transport_header() clearer in patch #1.
  * Fold the code of fexit_lwt_push_ip_encap() into test_lwt_ip_encap.c in
    patch #2.
  * Resolve assorted issues of test in patch #2.
* v3: https://lore.kernel.org/bpf/20260601150203.20352-1-leon.hwang@linux.dev/

v2 -> v3:
* Drop patch #1 and #2 of v2 that aim to resolve potential issues
  reported by sashiko (per Alexei).
* Check target IP version and UDP tunnel in test (per sashiko).
* v2: https://lore.kernel.org/bpf/20260529151351.69911-1-leon.hwang@linux.dev/

v1 -> v2:
* Address sashiko's reviews:
  * Fix TOCTOU issue in lwt to avoid changing hdr after checks.
  * Add check iph->ihl < 5 in lwt to avoid infinite-loop in MIPS driver.
  * Update comment style in selftests with BPF comment style.
* v1: https://lore.kernel.org/bpf/20260525142650.2569-1-leon.hwang@linux.dev/

Leon Hwang (2):
  bpf: Update transport_header when encapsulating UDP tunnel in lwt
  selftests/bpf: Add tests to verify the fix of encapsulating VxLAN in
    lwt

 net/core/lwt_bpf.c                            |  12 ++
 .../selftests/bpf/prog_tests/lwt_ip_encap.c   | 145 ++++++++++++++++
 .../selftests/bpf/progs/test_lwt_ip_encap.c   | 155 ++++++++++++++++--
 3 files changed, 302 insertions(+), 10 deletions(-)

-- 
2.54.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH bpf v4 1/2] bpf: Update transport_header when encapsulating UDP tunnel in lwt
  2026-06-02 15:09 [PATCH bpf v4 0/2] bpf: Update transport_header when encapsulating UDP tunnel in lwt Leon Hwang
@ 2026-06-02 15:09 ` Leon Hwang
  2026-06-02 15:51   ` bot+bpf-ci
  2026-06-02 15:09 ` [PATCH bpf v4 2/2] selftests/bpf: Add tests to verify the fix of encapsulating VxLAN " Leon Hwang
  1 sibling, 1 reply; 4+ messages in thread
From: Leon Hwang @ 2026-06-02 15:09 UTC (permalink / raw)
  To: bpf
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrii Nakryiko, Eduard Zingerman,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	Shuah Khan, Guillaume Nault, Leon Hwang, Ido Schimmel,
	Fernando Fernandez Mancera, Peter Oskolkov, linux-kernel, netdev,
	linux-kselftest, kernel-patches-bot, Leon Hwang

Currently, bpf_lwt_push_ip_encap() does not update skb->transport_header.
When a driver, e.g. ice, reuses the stale skb->transport_header to
offload checksum computation to NIC hardware, VxLAN packets encapsulated
by bpf_lwt_push_encap() helper may be dropped due to incorrect checksum.

Update skb->transport_header in bpf_lwt_push_ip_encap() whenever the
encapsulated packet uses UDP, so checksum offload works correctly.

Fixes: 52f278774e79 ("bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap")
Cc: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 net/core/lwt_bpf.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index f71ef82a5f3d..bf588f508b79 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -599,6 +599,7 @@ static int handle_gso_encap(struct sk_buff *skb, bool ipv4, int encap_len)
 
 int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
 {
+	bool is_udp_tunnel;
 	struct iphdr *iph;
 	bool ipv4;
 	int err;
@@ -612,10 +613,16 @@ int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
 		ipv4 = true;
 		if (unlikely(len < iph->ihl * 4))
 			return -EINVAL;
+		is_udp_tunnel = iph->protocol == IPPROTO_UDP;
+		if (unlikely(is_udp_tunnel && len < iph->ihl * 4 + sizeof(struct udphdr)))
+			return -EINVAL;
 	} else if (iph->version == 6) {
 		ipv4 = false;
 		if (unlikely(len < sizeof(struct ipv6hdr)))
 			return -EINVAL;
+		is_udp_tunnel = ((struct ipv6hdr *)iph)->nexthdr == NEXTHDR_UDP;
+		if (unlikely(is_udp_tunnel && len < sizeof(struct ipv6hdr) + sizeof(struct udphdr)))
+			return -EINVAL;
 	} else {
 		return -EINVAL;
 	}
@@ -637,6 +644,11 @@ int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
 	if (ingress)
 		skb_postpush_rcsum(skb, iph, len);
 	skb_reset_network_header(skb);
+	if (is_udp_tunnel) {
+		size_t iph_sz = ipv4 ? iph->ihl * 4 : sizeof(struct ipv6hdr);
+
+		skb_set_transport_header(skb, skb_network_offset(skb) + iph_sz);
+	}
 	memcpy(skb_network_header(skb), hdr, len);
 	bpf_compute_data_pointers(skb);
 	skb_clear_hash(skb);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH bpf v4 2/2] selftests/bpf: Add tests to verify the fix of encapsulating VxLAN in lwt
  2026-06-02 15:09 [PATCH bpf v4 0/2] bpf: Update transport_header when encapsulating UDP tunnel in lwt Leon Hwang
  2026-06-02 15:09 ` [PATCH bpf v4 1/2] " Leon Hwang
@ 2026-06-02 15:09 ` Leon Hwang
  1 sibling, 0 replies; 4+ messages in thread
From: Leon Hwang @ 2026-06-02 15:09 UTC (permalink / raw)
  To: bpf
  Cc: David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Andrii Nakryiko, Eduard Zingerman,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	Shuah Khan, Guillaume Nault, Leon Hwang, Ido Schimmel,
	Fernando Fernandez Mancera, Peter Oskolkov, linux-kernel, netdev,
	linux-kselftest, kernel-patches-bot, Leon Hwang

Add two tests to verify the transport header of skb has been set when
encapsulate VxLAN using bpf_lwt_push_encap() helper.

1. VxLAN over IPv4.
2. VxLAN over IPv6.

Without the fix, the tests would fail:

 lwt_ip_encap_vxlan:FAIL:transport_hdr offset unexpected transport_hdr offset: actual 70 != expected 20
 #208     lwt_ip_encap_vxlan_ipv4:FAIL
 lwt_ip_encap_vxlan:FAIL:transport_hdr offset unexpected transport_hdr offset: actual 110 != expected 40
 #209     lwt_ip_encap_vxlan_ipv6:FAIL

The unexpected offsets are: outer encap headers
(IPv4: iphdr+udp+vxlan+eth = 50 bytes, IPv6: ipv6hdr+udp+vxlan+eth = 70 bytes)
plus the inner IP header (20 or 40 bytes), because without the fix
transport_header still points at the inner transport layer instead of the
outer UDP header.

Assisted-by: Claude:claude-sonnet-4-6
Cc: Leon Hwang <leon.huangfu@shopee.com>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 .../selftests/bpf/prog_tests/lwt_ip_encap.c   | 145 ++++++++++++++++
 .../selftests/bpf/progs/test_lwt_ip_encap.c   | 155 ++++++++++++++++--
 2 files changed, 290 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/lwt_ip_encap.c b/tools/testing/selftests/bpf/prog_tests/lwt_ip_encap.c
index b6391af5f6f9..6606f0ed9a9a 100644
--- a/tools/testing/selftests/bpf/prog_tests/lwt_ip_encap.c
+++ b/tools/testing/selftests/bpf/prog_tests/lwt_ip_encap.c
@@ -3,6 +3,7 @@
 
 #include "network_helpers.h"
 #include "test_progs.h"
+#include "test_lwt_ip_encap.skel.h"
 
 #define BPF_FILE "test_lwt_ip_encap.bpf.o"
 
@@ -32,6 +33,9 @@
 #define IP6_ADDR_8 "fb08::1"
 #define IP6_ADDR_GRE "fb10::1"
 
+#define IP4_ADDR_VXLAN  "172.16.17.100"
+#define IP6_ADDR_VXLAN  "fb11::1"
+
 #define IP6_ADDR_SRC IP6_ADDR_1
 #define IP6_ADDR_DST IP6_ADDR_4
 
@@ -538,3 +542,144 @@ void test_lwt_ip_encap_ipv4(void)
 	if (test__start_subtest("ingress"))
 		lwt_ip_encap(IPV4_ENCAP, INGRESS, "");
 }
+
+/*
+ * VxLAN Setup/topology:
+ *
+ * NS1 (IP*_ADDR_1)                NS2                  NS3 (IP*_ADDR_4)
+ *       [ping src]
+ *           |                          top route
+ *         veth1 (LWT encap)  <<-- veth2        veth3  <<-- veth4 (ping dst)
+ *           |                                                ^
+ *       (bottom route)                                       | (inner pkt)
+ *           v                        bottom route            |
+ *         veth5              -->> veth6        veth7  -->> veth8 (vxlan decap)
+ *                                                          (IP*_ADDR_VXLAN)
+ *
+ * Add the VxLAN endpoint addresses to NS3's veth8, create standard
+ * VxLAN decap devices bound to those addresses, and install routes so
+ * NS1/NS2 can reach the endpoints via the bottom route.  NS2 here is to
+ * make sure the LWT-encap VxLAN packets are routed to NS3 correctly.
+ */
+static int setup_vxlan_routes(const char *ns3, const char *ns1, const char *ns2)
+{
+	struct nstoken *nstoken;
+
+	nstoken = open_netns(ns3);
+	if (!ASSERT_OK_PTR(nstoken, "open ns3 for vxlan"))
+		return -1;
+
+	SYS(fail_close, "ip    a add %s/32  dev veth8", IP4_ADDR_VXLAN);
+	SYS(fail_close, "ip -6 a add %s/128 dev veth8", IP6_ADDR_VXLAN);
+	/*
+	 * Standard VxLAN devices to decap the encapsulated packets.  The inner
+	 * Ethernet frame uses a broadcast dst MAC so the IP stack accepts it
+	 * without ARP or FDB configuration.
+	 */
+	SYS(fail_close, "ip link add vxlan4 type vxlan id 1 dstport 4789 local %s dev veth8 nolearning noudpcsum",
+	    IP4_ADDR_VXLAN);
+	SYS(fail_close, "ip link set vxlan4 up");
+	SYS(fail_close, "ip link add vxlan6 type vxlan id 1 dstport 4789 local %s dev veth8 nolearning udp6zerocsumrx",
+	    IP6_ADDR_VXLAN);
+	SYS(fail_close, "ip link set vxlan6 up");
+	close_netns(nstoken);
+
+	SYS(fail, "ip -n %s    route add %s/32  dev veth5 via %s",
+	    ns1, IP4_ADDR_VXLAN, IP4_ADDR_6);
+	SYS(fail, "ip -n %s    route add %s/32  dev veth7 via %s",
+	    ns2, IP4_ADDR_VXLAN, IP4_ADDR_8);
+	SYS(fail, "ip -n %s -6 route add %s/128 dev veth5 via %s",
+	    ns1, IP6_ADDR_VXLAN, IP6_ADDR_6);
+	SYS(fail, "ip -n %s -6 route add %s/128 dev veth7 via %s",
+	    ns2, IP6_ADDR_VXLAN, IP6_ADDR_8);
+	return 0;
+
+fail_close:
+	close_netns(nstoken);
+fail:
+	return -1;
+}
+
+static void lwt_ip_encap_vxlan(bool ipv4_encap)
+{
+	char ns1[NETNS_NAME_SIZE] = NETNS_BASE "-1-";
+	char ns2[NETNS_NAME_SIZE] = NETNS_BASE "-2-";
+	char ns3[NETNS_NAME_SIZE] = NETNS_BASE "-3-";
+	const char *sec = ipv4_encap ? "encap_vxlan" : "encap_vxlan6";
+	int expected_offset = ipv4_encap ? (int)sizeof(struct iphdr)
+					 : (int)sizeof(struct ipv6hdr);
+	struct test_lwt_ip_encap *skel = NULL;
+	int thdr_offset, err;
+
+	if (!ASSERT_OK(create_ns(ns1, NETNS_NAME_SIZE), "create ns1"))
+		goto out;
+	if (!ASSERT_OK(create_ns(ns2, NETNS_NAME_SIZE), "create ns2"))
+		goto out;
+	if (!ASSERT_OK(create_ns(ns3, NETNS_NAME_SIZE), "create ns3"))
+		goto out;
+
+	if (!ASSERT_OK(setup_network(ns1, ns2, ns3, ""), "setup network"))
+		goto out;
+
+	if (!ASSERT_OK(setup_vxlan_routes(ns3, ns1, ns2), "setup vxlan routes"))
+		goto out;
+
+	skel = test_lwt_ip_encap__open();
+	if (!ASSERT_OK_PTR(skel, "test_lwt_ip_encap__open"))
+		goto out;
+
+	bpf_program__set_autoload(skel->progs.bpf_lwt_encap_gre, false);
+	bpf_program__set_autoload(skel->progs.bpf_lwt_encap_gre6, false);
+	bpf_program__set_autoload(skel->progs.bpf_lwt_encap_vxlan, false);
+	bpf_program__set_autoload(skel->progs.bpf_lwt_encap_vxlan6, false);
+	bpf_program__set_autoload(skel->progs.fexit_lwt_push_ip_encap, true);
+	skel->rodata->tgt_ip_version = ipv4_encap ? 4 : 6;
+
+	err = test_lwt_ip_encap__load(skel);
+	if (!ASSERT_OK(err, "test_lwt_ip_encap__load"))
+		goto out;
+
+	err = test_lwt_ip_encap__attach(skel);
+	if (!ASSERT_OK(err, "test_lwt_ip_encap__attach"))
+		goto out;
+
+	/* Remove the direct NS2->DST route so packets must go via LWT encap. */
+	SYS(out, "ip -n %s    route del %s/32  dev veth3", ns2, IP4_ADDR_DST);
+	SYS(out, "ip -n %s -6 route del %s/128 dev veth3", ns2, IP6_ADDR_DST);
+
+	if (ipv4_encap)
+		SYS(out, "ip -n %s route add %s encap bpf xmit obj %s sec %s dev veth1",
+		    ns1, IP4_ADDR_DST, BPF_FILE, sec);
+	else
+		SYS(out, "ip -n %s -6 route add %s encap bpf xmit obj %s sec %s dev veth1",
+		    ns1, IP6_ADDR_DST, BPF_FILE, sec);
+
+	skel->bss->fexit_triggered = false;
+
+	if (ipv4_encap)
+		SYS(out, "ip netns exec %s ping  -c 1 -W1 %s", ns1, IP4_ADDR_DST);
+	else
+		SYS(out, "ip netns exec %s ping6 -c 1 -W1 %s", ns1, IP6_ADDR_DST);
+
+	if (!ASSERT_TRUE(skel->bss->fexit_triggered, "fexit_triggered"))
+		goto out;
+
+	thdr_offset = (int)skel->bss->transport_hdr - (int)skel->bss->network_hdr;
+	ASSERT_EQ(thdr_offset, expected_offset, "transport_hdr offset");
+
+out:
+	test_lwt_ip_encap__destroy(skel);
+	SYS_NOFAIL("ip netns del %s", ns1);
+	SYS_NOFAIL("ip netns del %s", ns2);
+	SYS_NOFAIL("ip netns del %s", ns3);
+}
+
+void test_lwt_ip_encap_vxlan_ipv4(void)
+{
+	lwt_ip_encap_vxlan(IPV4_ENCAP);
+}
+
+void test_lwt_ip_encap_vxlan_ipv6(void)
+{
+	lwt_ip_encap_vxlan(IPV6_ENCAP);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c b/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
index d6cb986e7533..4a934fccf8f5 100644
--- a/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
+++ b/tools/testing/selftests/bpf/progs/test_lwt_ip_encap.c
@@ -1,11 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
-#include <stddef.h>
+#include "vmlinux.h"
 #include <string.h>
-#include <linux/bpf.h>
-#include <linux/ip.h>
-#include <linux/ipv6.h>
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
 
 struct grehdr {
 	__be16 flags;
@@ -64,13 +62,13 @@ int bpf_lwt_encap_gre6(struct __sk_buff *skb)
 	hdr.ip6hdr.nexthdr = 47;  /* IPPROTO_GRE */
 	hdr.ip6hdr.hop_limit = 0x40;
 	/* fb01::1 */
-	hdr.ip6hdr.saddr.s6_addr[0] = 0xfb;
-	hdr.ip6hdr.saddr.s6_addr[1] = 1;
-	hdr.ip6hdr.saddr.s6_addr[15] = 1;
+	hdr.ip6hdr.saddr.in6_u.u6_addr8[0] = 0xfb;
+	hdr.ip6hdr.saddr.in6_u.u6_addr8[1] = 1;
+	hdr.ip6hdr.saddr.in6_u.u6_addr8[15] = 1;
 	/* fb10::1 */
-	hdr.ip6hdr.daddr.s6_addr[0] = 0xfb;
-	hdr.ip6hdr.daddr.s6_addr[1] = 0x10;
-	hdr.ip6hdr.daddr.s6_addr[15] = 1;
+	hdr.ip6hdr.daddr.in6_u.u6_addr8[0] = 0xfb;
+	hdr.ip6hdr.daddr.in6_u.u6_addr8[1] = 0x10;
+	hdr.ip6hdr.daddr.in6_u.u6_addr8[15] = 1;
 
 	hdr.greh.protocol = skb->protocol;
 
@@ -82,4 +80,141 @@ int bpf_lwt_encap_gre6(struct __sk_buff *skb)
 	return BPF_LWT_REROUTE;
 }
 
+#define VXLAN_PORT  4789
+#define VXLAN_FLAGS 0x08000000
+#define VXLAN_VNI   1
+
+#define ETH_ALEN	6		/* Octets in one ethernet addr	 */
+#define ETH_P_IP	0x0800		/* Internet Protocol packet	*/
+#define ETH_P_IPV6	0x86DD		/* IPv6 over bluebook		*/
+
+static const __u8 bcast[ETH_ALEN] = {
+	0xff, 0xff, 0xff, 0xff, 0xff, 0xff,
+};
+
+static const __u8 srcmac[ETH_ALEN] = {
+	0x02, 0x00, 0x00, 0x00, 0x00, 0x01,
+};
+
+SEC("encap_vxlan")
+int bpf_lwt_encap_vxlan(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct iphdr    iph;
+		struct udphdr   udph;
+		struct vxlanhdr vxh;
+		struct ethhdr   eth;
+	} __attribute__((__packed__)) hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	hdr.iph.ihl      = 5;
+	hdr.iph.version  = 4;
+	hdr.iph.ttl      = 0x40;
+	hdr.iph.protocol = 17; /* IPPROTO_UDP */
+	hdr.iph.tot_len  = bpf_htons(skb->len + sizeof(hdr));
+#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
+	hdr.iph.saddr = 0x640510ac;  /* 172.16.5.100  */
+	hdr.iph.daddr = 0x641110ac;  /* 172.16.17.100 */
+#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+	hdr.iph.saddr = 0xac100564;  /* 172.16.5.100 */
+	hdr.iph.daddr = 0xac101164;  /* 172.16.17.100 */
+#else
+#error "Fix your compiler's __BYTE_ORDER__?!"
+#endif
+
+	hdr.udph.source = bpf_htons(VXLAN_PORT);
+	hdr.udph.dest   = bpf_htons(VXLAN_PORT);
+	hdr.udph.len    = bpf_htons(skb->len + sizeof(hdr.udph) + sizeof(hdr.vxh) +
+				    sizeof(hdr.eth));
+
+	hdr.vxh.vx_flags = bpf_htonl(VXLAN_FLAGS);
+	hdr.vxh.vx_vni   = bpf_htonl(VXLAN_VNI << 8);
+
+	__builtin_memcpy(hdr.eth.h_dest, bcast, ETH_ALEN);
+	__builtin_memcpy(hdr.eth.h_source, srcmac, ETH_ALEN);
+	hdr.eth.h_proto = bpf_htons(ETH_P_IP);
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr, sizeof(hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+SEC("encap_vxlan6")
+int bpf_lwt_encap_vxlan6(struct __sk_buff *skb)
+{
+	struct encap_hdr {
+		struct ipv6hdr  ip6hdr;
+		struct udphdr   udph;
+		struct vxlanhdr vxh;
+		struct ethhdr   eth;
+	} __attribute__((__packed__)) hdr;
+	int err;
+
+	memset(&hdr, 0, sizeof(hdr));
+
+	hdr.ip6hdr.version     = 6;
+	hdr.ip6hdr.nexthdr     = 17; /* IPPROTO_UDP */
+	hdr.ip6hdr.hop_limit   = 0x40;
+	hdr.ip6hdr.payload_len = bpf_htons(skb->len + sizeof(hdr.udph) + sizeof(hdr.vxh) +
+					   sizeof(hdr.eth));
+	/* fb05::1 */
+	hdr.ip6hdr.saddr.in6_u.u6_addr8[0]  = 0xfb;
+	hdr.ip6hdr.saddr.in6_u.u6_addr8[1]  = 0x05;
+	hdr.ip6hdr.saddr.in6_u.u6_addr8[15] = 1;
+	/* fb11::1 */
+	hdr.ip6hdr.daddr.in6_u.u6_addr8[0]  = 0xfb;
+	hdr.ip6hdr.daddr.in6_u.u6_addr8[1]  = 0x11;
+	hdr.ip6hdr.daddr.in6_u.u6_addr8[15] = 1;
+
+	hdr.udph.source = bpf_htons(VXLAN_PORT);
+	hdr.udph.dest   = bpf_htons(VXLAN_PORT);
+	hdr.udph.len    = bpf_htons(skb->len + sizeof(hdr.udph) + sizeof(hdr.vxh) +
+				    sizeof(hdr.eth));
+
+	hdr.vxh.vx_flags = bpf_htonl(VXLAN_FLAGS);
+	hdr.vxh.vx_vni   = bpf_htonl(VXLAN_VNI << 8);
+
+	__builtin_memcpy(hdr.eth.h_dest, bcast, ETH_ALEN);
+	__builtin_memcpy(hdr.eth.h_source, srcmac, ETH_ALEN);
+	hdr.eth.h_proto = bpf_htons(ETH_P_IPV6);
+
+	err = bpf_lwt_push_encap(skb, BPF_LWT_ENCAP_IP, &hdr, sizeof(hdr));
+	if (err)
+		return BPF_DROP;
+
+	return BPF_LWT_REROUTE;
+}
+
+volatile const int tgt_ip_version;
+
+__u16 transport_hdr = 0;
+__u16 network_hdr = 0;
+bool fexit_triggered = false;
+
+SEC("?fexit/bpf_lwt_push_ip_encap")
+int BPF_PROG(fexit_lwt_push_ip_encap, struct sk_buff *skb, void *hdr, u32 len, bool ingress,
+	     int retval)
+{
+	struct iphdr *iph;
+
+	if (retval || fexit_triggered)
+		return 0;
+
+	iph = (typeof(iph)) (skb->head + skb->network_header);
+	if (iph->version != tgt_ip_version)
+		return 0;
+
+	if ((iph->version == 4 && iph->protocol == 17 /* IPPROTO_UDP */) ||
+	    (iph->version == 6 && ((struct ipv6hdr *)iph)->nexthdr == 17 /* IPPROTO_UDP */)) {
+		fexit_triggered = true;
+		transport_hdr   = skb->transport_header;
+		network_hdr     = skb->network_header;
+	}
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH bpf v4 1/2] bpf: Update transport_header when encapsulating UDP tunnel in lwt
  2026-06-02 15:09 ` [PATCH bpf v4 1/2] " Leon Hwang
@ 2026-06-02 15:51   ` bot+bpf-ci
  0 siblings, 0 replies; 4+ messages in thread
From: bot+bpf-ci @ 2026-06-02 15:51 UTC (permalink / raw)
  To: leon.hwang, bpf
  Cc: davem, edumazet, kuba, pabeni, horms, andrii, eddyz87, ast,
	daniel, martin.lau, memxor, song, yonghong.song, jolsa, shuah,
	gnault, leon.hwang, idosch, fmancera, posk, linux-kernel, netdev,
	linux-kselftest, kernel-patches-bot, leon.huangfu, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3225 bytes --]

> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> index f71ef82a5f3d..bf588f508b79 100644
> --- a/net/core/lwt_bpf.c
> +++ b/net/core/lwt_bpf.c
> @@ -612,10 +613,16 @@ int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
>  		ipv4 = true;
>  		if (unlikely(len < iph->ihl * 4))
>  			return -EINVAL;

Can iph->ihl == 0 slip past this check?  Since len is a u32, iph->ihl * 4
is 0 when iph->ihl is 0, and len < 0 is never true, so the buffer is
accepted without returning -EINVAL.

ip_fast_csum() is later called with this iph->ihl, and on some
architectures (PowerPC, MIPS) that implementation assumes ihl >= 5 and
decrements the loop counter, which can underflow when ihl is 0.

Would an explicit iph->ihl >= 5 guard be needed here, for example:

	if (unlikely(iph->ihl < 5 || len < iph->ihl * 4))
		return -EINVAL;

This was raised in review of v1 and an earlier revision added the
iph->ihl < 5 guard, but the current revision appears to have dropped it
again.

>  		is_udp_tunnel = iph->protocol == IPPROTO_UDP;
>  		if (unlikely(is_udp_tunnel && len < iph->ihl * 4 + sizeof(struct udphdr)))
>  			return -EINVAL;
>  	} else if (iph->version == 6) {
>  		ipv4 = false;
>  		if (unlikely(len < sizeof(struct ipv6hdr)))
>  			return -EINVAL;
>  		is_udp_tunnel = ((struct ipv6hdr *)iph)->nexthdr == NEXTHDR_UDP;
>  		if (unlikely(is_udp_tunnel && len < sizeof(struct ipv6hdr) + sizeof(struct udphdr)))
>  			return -EINVAL;
>  	} else {
>  		return -EINVAL;
>  	}
> @@ -637,6 +644,11 @@ int bpf_lwt_push_ip_encap(struct sk_buff *skb, void *hdr, u32 len, bool ingress)
>  	if (ingress)
>  		skb_postpush_rcsum(skb, iph, len);
>  	skb_reset_network_header(skb);
> +	if (is_udp_tunnel) {
> +		size_t iph_sz = ipv4 ? iph->ihl * 4 : sizeof(struct ipv6hdr);
> +
> +		skb_set_transport_header(skb, skb_network_offset(skb) + iph_sz);
> +	}

Can iph->ihl change between the initial length validation and this read?

The hdr pointer, and therefore iph, refers to a buffer provided by the BPF
program, which can be concurrently modified by userspace through a shared
BPF map.  iph->ihl is re-read here to compute iph_sz, and this read happens
before the memcpy() of the header into the skb:

	if (is_udp_tunnel) {
		size_t iph_sz = ipv4 ? iph->ihl * 4 : sizeof(struct ipv6hdr);

		skb_set_transport_header(skb, skb_network_offset(skb) + iph_sz);
	}
	memcpy(skb_network_header(skb), hdr, len);

If iph->ihl is concurrently changed to a larger value such as 15 after the
initial validation, can the transport header offset be set past skb->tail,
leading to out-of-bounds access when drivers later dereference the
transport header for checksum updates?

The suggestion in the v1 discussion was to move skb_set_transport_header()
after the memcpy() so it reads from the copied-in header, but the current
revision still reads iph->ihl before the memcpy().

>  	memcpy(skb_network_header(skb), hdr, len);
>  	bpf_compute_data_pointers(skb);
>  	skb_clear_hash(skb);

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26830439032

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-02 15:51 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-02 15:09 [PATCH bpf v4 0/2] bpf: Update transport_header when encapsulating UDP tunnel in lwt Leon Hwang
2026-06-02 15:09 ` [PATCH bpf v4 1/2] " Leon Hwang
2026-06-02 15:51   ` bot+bpf-ci
2026-06-02 15:09 ` [PATCH bpf v4 2/2] selftests/bpf: Add tests to verify the fix of encapsulating VxLAN " Leon Hwang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox