Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc
@ 2026-05-18 12:28 Mahe Tardy
  2026-05-18 12:28 ` [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 12:28 UTC (permalink / raw)
  To: bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	Mahe Tardy

Hello,

This is v6 of adding the icmp_send kfunc, as suggested during LSF/MM/BPF
2025[^1]. The goal is to allow cgroup_skb programs to actively reject
east-west traffic, similarly to what is possible to do with netfilter
reject target.

The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.

Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.

Caveats of this kfunc design are that a program can call this function N
times, thus send N ICMP unreach control messages and that the program
can return from the BPF filter with pass leading to a potential
confusing situation where the TCP connection was established while the
client received ICMP_DEST_UNREACH messages.

Initially this kfunc was named icmp_send_unreach but now the interface
accepts a type parameter to facilitate future extension to other ICMP
control message types. Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH
are currently supported.

v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.

v3 update:
- fix an undefined reference build error.

v4 updates:
- prevent the kfunc to be called recursively and add a test (thanks to
  Martin).
- do not fetch dst route when unnecessary (thanks to Martin).
- extend the test for IPv6 (thanks to Martin).
- use SK_DROP in examples and use non blocking sockets for testing
  (thanks to Martin).
- test when the kfunc returns -EINVAL (thanks to Jordan).
- add the kfunc to bpf_kfunc_set_skb as suggested by Alexei.
- guard the IPv4 parts with IS_ENABLED(CONFIG_INET).
- fix a wrong initial value for client_fd (thanks to Yonghong).
- add documentation to the kfunc.
- to Jordan: I couldn't include <linux/icmp.h> because of redefines from
  <network_helpers.h>.

v5 updates:
- kfunc name is now icmp_send and takes the control message type as
  parameter for future potential extension (daniel)
- drop the net patches to route packet since now the kfunc is limited to
  cgroup_skb and tc progs (daniel & martin)
- linearize skb headers (sashiko)
- zero SKB control block (sashiko)
- bind to port 0 instead of fixed port (sashiko)
- poll to wait for POLLERR event (sashiko)
- do not use ASSERT_EQ in CMSG_NXTHDR loop (sashiko)
- fix comment about byte order (sashiko)
- fix endianness IP address issue (sashiko)
- add forgotten cleanup_cgroup_environment (sashiko)
- let packets pass in recursion test (sashiko)
- clarify evaluation order for recursion test (sashiko)

v6 updates (all from sashiko):
- bring back the net patches to route packet since tc ingress needs it.
- rename the ip_route_reply helpers from fetch to fill.
- call pskb_network_may_pull on the cloned pkt.
- check explicitly that we received one and only one ICMP err ctrl msg.

[^1]: https://lwn.net/Articles/1022034/

Link to v5: https://lore.kernel.org/bpf/20260515194746.50920-1-mahe.tardy@gmail.com/

Mahe Tardy (6):
  net: move netfilter nf_reject_fill_skb_dst to core ipv4
  net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  bpf: add bpf_icmp_send kfunc
  selftests/bpf: add bpf_icmp_send kfunc tests
  selftests/bpf: add bpf_icmp_send kfunc IPv6 tests
  selftests/bpf: add bpf_icmp_send recursion test

 include/net/ip6_route.h                       |   2 +
 include/net/route.h                           |   1 +
 net/core/filter.c                             | 118 ++++++++++
 net/ipv4/netfilter/nf_reject_ipv4.c           |  19 +-
 net/ipv4/route.c                              |  15 ++
 net/ipv6/netfilter/nf_reject_ipv6.c           |  17 +-
 net/ipv6/route.c                              |  18 ++
 .../bpf/prog_tests/icmp_send_kfunc.c          | 218 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c |  99 ++++++++
 9 files changed, 474 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c

--
2.34.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
@ 2026-05-18 12:28 ` Mahe Tardy
  2026-05-18 13:07   ` bot+bpf-ci
  2026-05-18 12:28 ` [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 12:28 UTC (permalink / raw)
  To: bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	Mahe Tardy

Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fill_dst in ipv4/route.c
so that it can be reused in the following patches by BPF kfuncs.

Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/route.h                 |  1 +
 net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
 net/ipv4/route.c                    | 15 +++++++++++++++
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index f90106f383c5..300d292cd9a1 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    const struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
 				       struct dst_entry *dst_orig);
+int ip_route_reply_fill_dst(struct sk_buff *skb);

 static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
 {
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index fecf6621f679..c1c0724e4d4d 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -252,21 +252,6 @@ static void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *
 	nskb->csum_offset = offsetof(struct tcphdr, check);
 }

-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
-	nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		   int hook)
@@ -279,7 +264,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 	if (!oth)
 		return;

-	if (!skb_dst(oldskb) && nf_reject_fill_skb_dst(oldskb) < 0)
+	if (!skb_dst(oldskb) && ip_route_reply_fill_dst(oldskb) < 0)
 		return;

 	if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -352,7 +337,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
 	if (iph->frag_off & htons(IP_OFFSET))
 		return;

-	if (!skb_dst(skb_in) && nf_reject_fill_skb_dst(skb_in) < 0)
+	if (!skb_dst(skb_in) && ip_route_reply_fill_dst(skb_in) < 0)
 		return;

 	if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bc1296f0ea69..1f031c5ef554 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);

+int ip_route_reply_fill_dst(struct sk_buff *skb)
+{
+	struct rtable *rt;
+	struct flowi4 fl4 = {
+		.daddr = ip_hdr(skb)->saddr
+	};
+
+	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+	skb_dst_set(skb, &rt->dst);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fill_dst);
+
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 			struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
  2026-05-18 12:28 ` [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2026-05-18 12:28 ` Mahe Tardy
  2026-05-18 13:07   ` bot+bpf-ci
  2026-05-18 12:28 ` [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc Mahe Tardy
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 12:28 UTC (permalink / raw)
  To: bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	Mahe Tardy

Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fill_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_output so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/ip6_route.h             |  2 ++
 net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
 net/ipv6/route.c                    | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 09ffe0f13ce7..eb5a60d3babe 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -100,6 +100,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
 	return ip6_route_output_flags(net, sk, fl6, 0);
 }

+int ip6_route_reply_fill_dst(struct sk_buff *skb);
+
 /* Only conditionally release dst if flags indicates
  * !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
  */
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index ef5b7e85cffa..7d2f577e72b8 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -293,21 +293,6 @@ nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
 						   sizeof(struct tcphdr), 0));
 }

-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
-	nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		    int hook)
 {
@@ -440,7 +425,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
 	if (hooknum == NF_INET_LOCAL_OUT && skb_in->dev == NULL)
 		skb_in->dev = net->loopback_dev;

-	if (!skb_dst(skb_in) && nf_reject6_fill_skb_dst(skb_in) < 0)
+	if (!skb_dst(skb_in) && ip6_route_reply_fill_dst(skb_in) < 0)
 		return;

 	icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index e3d355d1fbd6..37a7627a94de 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2725,6 +2725,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
 }
 EXPORT_SYMBOL_GPL(ip6_route_output_flags);

+int ip6_route_reply_fill_dst(struct sk_buff *skb)
+{
+	struct dst_entry *result;
+	struct flowi6 fl = {
+		.daddr = ipv6_hdr(skb)->saddr
+	};
+	int err;
+
+	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+	err = result->error;
+	if (err)
+		dst_release(result);
+	else
+		skb_dst_set(skb, result);
+	return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fill_dst);
+
 struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
 	struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
  2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
  2026-05-18 12:28 ` [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
  2026-05-18 12:28 ` [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2026-05-18 12:28 ` Mahe Tardy
  2026-05-18 13:34   ` bot+bpf-ci
                     ` (2 more replies)
  2026-05-18 12:28 ` [PATCH bpf-next v6 4/6] selftests/bpf: add bpf_icmp_send kfunc tests Mahe Tardy
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 12:28 UTC (permalink / raw)
  To: bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	Mahe Tardy

This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs. We also extend this kfunc to tc
program as a convenience.

This reuses concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still let the packet pass
  (SK_PASS from the cgroup_skb progs for example) and the current skb
  need to stay untouched (cgroup_skb hooks only allow read-only skb
  payload).
* We protect against recursion since the kfunc, by generating an ICMP
  error message, could retrigger the BPF prog that invoked it.

For now, we support cgroup_skb and tc program types. For cgroup_skb and
tc egress, almost everything should be good. However for tc ingress:
- packet will not be routed yet: need to set the net device for
  icmp_send, thus the call to ip[6]_route_reply_fill_dst.
- fragments could trigger hook: icmp_send will only reply to fragment 0.
- ensure the ip headers is linearized before processing, and zero out
  the SKB control block after cloning to prevent icmp_send()/icmpv6_send()
  from misinterpreting garbage data as IP options.

Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
The interface accepts a type parameter to facilitate future extension to
other ICMP control message types.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 net/core/filter.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 118 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 9590877b0714..843fa775596b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -84,6 +84,8 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <linux/icmpv6.h>
+#include <net/icmp.h>

 #include "dev.h"

@@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
 	return 0;
 }

+static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
+
+/**
+ * bpf_icmp_send - Send an ICMP control message
+ * @skb_ctx: Packet that triggered the control message
+ * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
+ * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
+ *
+ * Sends an ICMP control message in response to the packet. The original packet
+ * is cloned before sending the ICMP message, so the BPF program can still let
+ * the packet pass if desired.
+ *
+ * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
+ * supported.
+ *
+ * Recursion protection: If called from a context that would trigger recursion
+ * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
+ * re-entry.
+ *
+ * Return: 0 on success, negative error code on failure:
+ *         -EINVAL: Invalid code parameter
+ *         -EBADMSG: Packet too short or malformed
+ *         -ENOMEM: Memory allocation failed
+ *         -EBUSY: Recursion detected
+ *         -EHOSTUNREACH: Routing failed
+ *         -EPROTONOSUPPORT: Non-IP protocol
+ *         -EOPNOTSUPP: Unsupported ICMP type
+ */
+__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
+{
+	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
+	struct sk_buff *nskb;
+	bool *in_progress;
+
+	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
+	if (*in_progress)
+		return -EBUSY;
+
+	switch (skb->protocol) {
+#if IS_ENABLED(CONFIG_INET)
+	case htons(ETH_P_IP):
+		if (type != ICMP_DEST_UNREACH)
+			return -EOPNOTSUPP;
+		if (code < 0 || code > NR_ICMP_UNREACH)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
+			kfree_skb(nskb);
+			return -EBADMSG;
+		}
+
+		if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
+
+		*in_progress = true;
+		icmp_send(nskb, type, code, 0);
+		*in_progress = false;
+		kfree_skb(nskb);
+		break;
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		if (type != ICMPV6_DEST_UNREACH)
+			return -EOPNOTSUPP;
+		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (!pskb_network_may_pull(nskb, sizeof(struct ipv6hdr))) {
+			kfree_skb(nskb);
+			return -EBADMSG;
+		}
+
+		if (!skb_dst(nskb) && ip6_route_reply_fill_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		memset(IP6CB(nskb), 0, sizeof(struct inet6_skb_parm));
+
+		*in_progress = true;
+		icmpv6_send(nskb, type, code, 0);
+		*in_progress = false;
+		kfree_skb(nskb);
+		break;
+#endif
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	return 0;
+}
+
 __bpf_kfunc_end_defs();

 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12506,6 +12612,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
 BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
 BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)

+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send)
+BTF_ID_FLAGS(func, bpf_icmp_send)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -12536,6 +12646,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
 	.set = &bpf_kfunc_check_set_sock_ops,
 };

+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_icmp_send,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -12557,6 +12672,9 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 					       &bpf_kfunc_set_sock_addr);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_icmp_send);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_icmp_send);
 	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
 }
 late_initcall(bpf_kfunc_init);
--
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v6 4/6] selftests/bpf: add bpf_icmp_send kfunc tests
  2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
                   ` (2 preceding siblings ...)
  2026-05-18 12:28 ` [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc Mahe Tardy
@ 2026-05-18 12:28 ` Mahe Tardy
  2026-05-19  1:34   ` Jordan Rife
  2026-05-18 12:28 ` [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests Mahe Tardy
  2026-05-18 12:28 ` [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
  5 siblings, 1 reply; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 12:28 UTC (permalink / raw)
  To: bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	Mahe Tardy

This test opens a server and client, enters a new cgroup, attach a
cgroup_skb program on egress and calls the bpf_icmp_send function from
the client egress so that an ICMP unreach control message is sent back
to the client. It then fetches the message from the error queue to
confirm the correct ICMP unreach code has been sent.

Note that, for the client, we have to connect in non-blocking mode to
let the test execute faster. Otherwise, we need to wait for the TCP
three-way handshake to timeout in the kernel before reading the errno.

Also note that we don't set IP_RECVERR on the socket in
connect_to_fd_nonblock since the error will be transferred anyway in our
test because the connection is rejected at the beginning of the TCP
handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err for more details.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_kfunc.c          | 152 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c |  38 +++++
 2 files changed, 190 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
new file mode 100644
index 000000000000..4f0aed8152d3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -0,0 +1,152 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include <poll.h>
+#include "icmp_send.skel.h"
+
+#define TIMEOUT_MS 1000
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static int connect_to_fd_nonblock(int server_fd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len = sizeof(addr);
+	int fd, err;
+
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
+		return -1;
+
+	fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	if (fd < 0)
+		return -1;
+
+	err = connect(fd, (struct sockaddr *)&addr, len);
+	if (err < 0 && errno != EINPROGRESS) {
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+	ssize_t n;
+	struct sock_extended_err *sock_err;
+	struct cmsghdr *cm;
+	char ctrl_buf[512];
+	struct msghdr msg = {
+		.msg_control = ctrl_buf,
+		.msg_controllen = sizeof(ctrl_buf),
+	};
+	struct pollfd pfd = {
+		.fd = sockfd,
+		.events = POLLERR,
+	};
+
+	if (!ASSERT_GE(poll(&pfd, 1, TIMEOUT_MS), 1, "poll_errqueue"))
+		return;
+
+	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+		return;
+
+	cm = CMSG_FIRSTHDR(&msg);
+	if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
+		return;
+
+	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if (cm->cmsg_level != IPPROTO_IP ||
+		    cm->cmsg_type != IP_RECVERR)
+			continue;
+
+		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+			       "sock_err_origin_icmp"))
+			return;
+		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+			       "sock_err_type_dest_unreach"))
+			return;
+		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+		return;
+	}
+
+	ASSERT_FAIL("no IP_RECVERR/IPV6_RECVERR control message found");
+}
+
+static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel,
+					    int code)
+{
+	int srv_fd = -1, client_fd = -1;
+	struct sockaddr_in addr;
+	socklen_t len = sizeof(addr);
+
+	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0,
+			      TIMEOUT_MS);
+	if (!ASSERT_GE(srv_fd, 0, "start_server"))
+		return;
+
+	if (getsockname(srv_fd, (struct sockaddr *)&addr, &len)) {
+		close(srv_fd);
+		return;
+	}
+	skel->bss->server_port = ntohs(addr.sin_port);
+	skel->bss->unreach_code = code;
+
+	client_fd = connect_to_fd_nonblock(srv_fd);
+	if (!ASSERT_GE(client_fd, 0, "client_connect_nonblock")) {
+		close(srv_fd);
+		return;
+	}
+
+	/* Skip reading ICMP error queue if code is invalid */
+	if (code >= 0 && code <= NR_ICMP_UNREACH)
+		read_icmp_errqueue(client_fd, code);
+
+	close(client_fd);
+	close(srv_fd);
+}
+
+void test_icmp_send_unreach(void)
+{
+	struct icmp_send *skel;
+	int cgroup_fd = -1;
+
+	skel = icmp_send__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto cleanup;
+
+	skel->links.egress =
+		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+		goto cleanup;
+
+	for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
+		/* The TCP stack reacts differently when asking for
+		 * fragmentation, let's ignore it for now.
+		 */
+		if (code == ICMP_FRAG_NEEDED)
+			continue;
+
+		trigger_prog_read_icmp_errqueue(skel, code);
+		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+	}
+
+	/* Test an invalid code */
+	trigger_prog_read_icmp_errqueue(skel, -1);
+	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+cleanup:
+	icmp_send__destroy(skel);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
new file mode 100644
index 000000000000..6d0be0a9afe1
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+/* 127.0.0.1 in host byte order */
+#define SERVER_IP 0x7F000001
+
+#define ICMP_DEST_UNREACH 3
+
+__u16 server_port = 0;
+int unreach_code = 0;
+int kfunc_ret = -1;
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+	struct tcphdr *tcph;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+		return SK_PASS;
+
+	tcph = (void *)iph + iph->ihl * 4;
+	if ((void *)(tcph + 1) > data_end ||
+	    tcph->dest != bpf_htons(server_port))
+		return SK_PASS;
+
+	kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
+
+	return SK_DROP;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests
  2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
                   ` (3 preceding siblings ...)
  2026-05-18 12:28 ` [PATCH bpf-next v6 4/6] selftests/bpf: add bpf_icmp_send kfunc tests Mahe Tardy
@ 2026-05-18 12:28 ` Mahe Tardy
  2026-05-18 13:21   ` bot+bpf-ci
  2026-05-18 12:28 ` [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
  5 siblings, 1 reply; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 12:28 UTC (permalink / raw)
  To: bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	Mahe Tardy

This test extends the existing tests with IPv6 support.

Note that we need to set IPV6_RECVERR on the socket for IPv6 in
connect_to_fd_nonblock otherwise the error will be ignored even if we
are in the middle of the TCP handshake. See in
net/ipv6/datagram.c:ipv6_icmp_error for more details.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_kfunc.c          | 76 +++++++++++++------
 tools/testing/selftests/bpf/progs/icmp_send.c | 48 +++++++++---
 2 files changed, 90 insertions(+), 34 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index 4f0aed8152d3..d0ac0502f6df 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -8,15 +8,17 @@
 #define TIMEOUT_MS 1000

 #define ICMP_DEST_UNREACH 3
+#define ICMPV6_DEST_UNREACH 1

 #define ICMP_FRAG_NEEDED 4
 #define NR_ICMP_UNREACH 15
+#define ICMPV6_REJECT_ROUTE 6

 static int connect_to_fd_nonblock(int server_fd)
 {
 	struct sockaddr_storage addr;
 	socklen_t len = sizeof(addr);
-	int fd, err;
+	int fd, err, on = 1;

 	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
 		return -1;
@@ -25,6 +27,12 @@ static int connect_to_fd_nonblock(int server_fd)
 	if (fd < 0)
 		return -1;

+	if (addr.ss_family == AF_INET6 &&
+	    setsockopt(fd, IPPROTO_IPV6, IPV6_RECVERR, &on, sizeof(on)) < 0) {
+		close(fd);
+		return -1;
+	}
+
 	err = connect(fd, (struct sockaddr *)&addr, len);
 	if (err < 0 && errno != EINPROGRESS) {
 		close(fd);
@@ -34,7 +42,7 @@ static int connect_to_fd_nonblock(int server_fd)
 	return fd;
 }

-static void read_icmp_errqueue(int sockfd, int expected_code)
+static void read_icmp_errqueue(int sockfd, int expected_code, int af)
 {
 	ssize_t n;
 	struct sock_extended_err *sock_err;
@@ -44,6 +52,12 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
 		.msg_control = ctrl_buf,
 		.msg_controllen = sizeof(ctrl_buf),
 	};
+	int expected_level = (af == AF_INET) ? IPPROTO_IP : IPPROTO_IPV6;
+	int expected_type = (af == AF_INET) ? IP_RECVERR : IPV6_RECVERR;
+	int expected_origin = (af == AF_INET) ? SO_EE_ORIGIN_ICMP :
+						SO_EE_ORIGIN_ICMP6;
+	int expected_ee_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+						 ICMPV6_DEST_UNREACH;
 	struct pollfd pfd = {
 		.fd = sockfd,
 		.events = POLLERR,
@@ -61,16 +75,16 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
 		return;

 	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
-		if (cm->cmsg_level != IPPROTO_IP ||
-		    cm->cmsg_type != IP_RECVERR)
+		if (cm->cmsg_level != expected_level ||
+		    cm->cmsg_type != expected_type)
 			continue;

 		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);

-		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
-			       "sock_err_origin_icmp"))
+		if (!ASSERT_EQ(sock_err->ee_origin, expected_origin,
+			       "sock_err_origin"))
 			return;
-		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+		if (!ASSERT_EQ(sock_err->ee_type, expected_ee_type,
 			       "sock_err_type_dest_unreach"))
 			return;
 		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
@@ -81,14 +95,13 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
 }

 static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel,
-					    int code)
+					    int code, int af, const char *ip)
 {
 	int srv_fd = -1, client_fd = -1;
 	struct sockaddr_in addr;
 	socklen_t len = sizeof(addr);

-	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0,
-			      TIMEOUT_MS);
+	srv_fd = start_server(af, SOCK_STREAM, ip, 0, TIMEOUT_MS);
 	if (!ASSERT_GE(srv_fd, 0, "start_server"))
 		return;

@@ -97,6 +110,8 @@ static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel,
 		return;
 	}
 	skel->bss->server_port = ntohs(addr.sin_port);
+	skel->bss->unreach_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+						    ICMPV6_DEST_UNREACH;
 	skel->bss->unreach_code = code;

 	client_fd = connect_to_fd_nonblock(srv_fd);
@@ -106,13 +121,33 @@ static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel,
 	}

 	/* Skip reading ICMP error queue if code is invalid */
-	if (code >= 0 && code <= NR_ICMP_UNREACH)
-		read_icmp_errqueue(client_fd, code);
+	if (code >= 0 && ((af == AF_INET && code <= NR_ICMP_UNREACH) ||
+			   (af == AF_INET6 && code <= ICMPV6_REJECT_ROUTE)))
+		read_icmp_errqueue(client_fd, code, af);

 	close(client_fd);
 	close(srv_fd);
 }

+static void run_icmp_test(struct icmp_send *skel, int af,
+			  const char *ip, int max_code)
+{
+	for (int code = 0; code <= max_code; code++) {
+		/* The TCP stack reacts differently when asking for
+		 * fragmentation, let's ignore it for now.
+		 */
+		if (af == AF_INET && code == ICMP_FRAG_NEEDED)
+			continue;
+
+		trigger_prog_read_icmp_errqueue(skel, code, af, ip);
+		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+	}
+
+	/* Test an invalid code */
+	trigger_prog_read_icmp_errqueue(skel, -1, af, ip);
+	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+}
+
 void test_icmp_send_unreach(void)
 {
 	struct icmp_send *skel;
@@ -131,20 +166,11 @@ void test_icmp_send_unreach(void)
 	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
 		goto cleanup;

-	for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
-		/* The TCP stack reacts differently when asking for
-		 * fragmentation, let's ignore it for now.
-		 */
-		if (code == ICMP_FRAG_NEEDED)
-			continue;
-
-		trigger_prog_read_icmp_errqueue(skel, code);
-		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
-	}
+	if (test__start_subtest("ipv4"))
+		run_icmp_test(skel, AF_INET, "127.0.0.1", NR_ICMP_UNREACH);

-	/* Test an invalid code */
-	trigger_prog_read_icmp_errqueue(skel, -1);
-	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+	if (test__start_subtest("ipv6"))
+		run_icmp_test(skel, AF_INET6, "::1", ICMPV6_REJECT_ROUTE);

 cleanup:
 	icmp_send__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
index 6d0be0a9afe1..6e1ba539eeb0 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -5,10 +5,11 @@

 /* 127.0.0.1 in host byte order */
 #define SERVER_IP 0x7F000001
-
-#define ICMP_DEST_UNREACH 3
+/* ::1 in host byte order (last 32-bit word) */
+#define SERVER_IP6_LO 0x00000001

 __u16 server_port = 0;
+int unreach_type = 0;
 int unreach_code = 0;
 int kfunc_ret = -1;

@@ -18,19 +19,48 @@ int egress(struct __sk_buff *skb)
 	void *data = (void *)(long)skb->data;
 	void *data_end = (void *)(long)skb->data_end;
 	struct iphdr *iph;
+	struct ipv6hdr *ip6h;
 	struct tcphdr *tcph;
+	__u8 version;

-	iph = data;
-	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
-	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+	if (data + 1 > data_end)
 		return SK_PASS;

-	tcph = (void *)iph + iph->ihl * 4;
-	if ((void *)(tcph + 1) > data_end ||
-	    tcph->dest != bpf_htons(server_port))
+	version = (*((__u8 *)data)) >> 4;
+
+	if (version == 4) {
+		iph = data;
+		if ((void *)(iph + 1) > data_end ||
+		    iph->protocol != IPPROTO_TCP ||
+		    iph->daddr != bpf_htonl(SERVER_IP))
+			return SK_PASS;
+
+		tcph = (void *)iph + iph->ihl * 4;
+		if ((void *)(tcph + 1) > data_end ||
+		    tcph->dest != bpf_htons(server_port))
+			return SK_PASS;
+
+	} else if (version == 6) {
+		ip6h = data;
+		if ((void *)(ip6h + 1) > data_end ||
+		    ip6h->nexthdr != IPPROTO_TCP)
+			return SK_PASS;
+
+		if (ip6h->daddr.in6_u.u6_addr32[0] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[1] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[2] != 0 ||
+		    ip6h->daddr.in6_u.u6_addr32[3] != bpf_htonl(SERVER_IP6_LO))
+			return SK_PASS;
+
+		tcph = (void *)(ip6h + 1);
+		if ((void *)(tcph + 1) > data_end ||
+		    tcph->dest != bpf_htons(server_port))
+			return SK_PASS;
+	} else {
 		return SK_PASS;
+	}

-	kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
+	kfunc_ret = bpf_icmp_send(skb, unreach_type, unreach_code);

 	return SK_DROP;
 }
--
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test
  2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
                   ` (4 preceding siblings ...)
  2026-05-18 12:28 ` [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests Mahe Tardy
@ 2026-05-18 12:28 ` Mahe Tardy
  2026-05-18 13:07   ` bot+bpf-ci
  5 siblings, 1 reply; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 12:28 UTC (permalink / raw)
  To: bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	Mahe Tardy

This test is similar to test_icmp_send_unreach but checks that, in case
of recursion, meaning that the BPF program calling the kfunc was
re-triggered by the icmp_send done by the kfunc, the kfunc will stop
early and return -EBUSY.

The test attaches to the root cgroup to ensure the ICMP packet generated
by the kfunc re-triggers the BPF program. Since it's attached only for
this recursion test, it should not disrupt the whole network.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_kfunc.c          | 40 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/icmp_send.c | 31 ++++++++++++++
 2 files changed, 71 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
index d0ac0502f6df..a9e9806877cf 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
 #include <network_helpers.h>
+#include <cgroup_helpers.h>
 #include <linux/errqueue.h>
 #include <poll.h>
 #include "icmp_send.skel.h"
@@ -10,6 +11,7 @@
 #define ICMP_DEST_UNREACH 3
 #define ICMPV6_DEST_UNREACH 1

+#define ICMP_HOST_UNREACH 1
 #define ICMP_FRAG_NEEDED 4
 #define NR_ICMP_UNREACH 15
 #define ICMPV6_REJECT_ROUTE 6
@@ -176,3 +178,41 @@ void test_icmp_send_unreach(void)
 	icmp_send__destroy(skel);
 	close(cgroup_fd);
 }
+
+void test_icmp_send_unreach_recursion(void)
+{
+	struct icmp_send *skel;
+	int cgroup_fd = -1;
+
+	skel = icmp_send__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	if (setup_cgroup_environment()) {
+		fprintf(stderr, "Failed to setup cgroup environment\n");
+		goto cleanup;
+	}
+
+	cgroup_fd = get_root_cgroup();
+	if (!ASSERT_GE(cgroup_fd, 0, "get_root_cgroup"))
+		goto cleanup;
+
+	skel->links.recursion =
+		bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.recursion, "prog_attach_cgroup"))
+		goto cleanup;
+
+	trigger_prog_read_icmp_errqueue(skel, ICMP_HOST_UNREACH, AF_INET, "127.0.0.1");
+
+	/* Because there's recursion involved, the first call will return at
+	 * index 1 since it will return the second, and the second call will
+	 * return at index 0 since it will return the first.
+	 */
+	ASSERT_EQ(skel->data->rec_kfunc_rets[0], -EBUSY, "kfunc_rets[0]");
+	ASSERT_EQ(skel->data->rec_kfunc_rets[1], 0, "kfunc_rets[1]");
+
+cleanup:
+	cleanup_cgroup_environment();
+	icmp_send__destroy(skel);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
index 6e1ba539eeb0..7830334b747a 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send.c
@@ -13,6 +13,9 @@ int unreach_type = 0;
 int unreach_code = 0;
 int kfunc_ret = -1;

+unsigned int rec_count = 0;
+int rec_kfunc_rets[] = { -1, -1 };
+
 SEC("cgroup_skb/egress")
 int egress(struct __sk_buff *skb)
 {
@@ -65,4 +68,32 @@ int egress(struct __sk_buff *skb)
 	return SK_DROP;
 }

+SEC("cgroup_skb/egress")
+int recursion(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+	int ret;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4)
+		return SK_PASS;
+
+	/* This call will provoke a recursion: the ICMP packet generated by the
+	 * kfunc will re-trigger this program since we are in the root cgroup in
+	 * which the kernel ICMP socket belongs. However when re-entering the
+	 * kfunc, it should return EBUSY.
+	 */
+	ret = bpf_icmp_send(skb, unreach_type, unreach_code);
+	rec_kfunc_rets[rec_count & 1] = ret;
+	__sync_fetch_and_add(&rec_count, 1);
+
+	/* Let the first ICMP error message pass */
+	if (iph->protocol == IPPROTO_ICMP)
+		return SK_PASS;
+
+	return SK_DROP;
+}
+
 char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test
  2026-05-18 12:28 ` [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
@ 2026-05-18 13:07   ` bot+bpf-ci
  2026-05-18 14:39     ` Mahe Tardy
  0 siblings, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-18 13:07 UTC (permalink / raw)
  To: mahe.tardy, bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	mahe.tardy, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3350 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> index d0ac0502f6df..a9e9806877cf 100644
> --- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
> index 6e1ba539eeb0..7830334b747a 100644
> --- a/tools/testing/selftests/bpf/progs/icmp_send.c
> +++ b/tools/testing/selftests/bpf/progs/icmp_send.c
> @@ -13,6 +13,9 @@ int unreach_type = 0;
>  int unreach_code = 0;
>  int kfunc_ret = -1;
>
> +unsigned int rec_count = 0;
> +int rec_kfunc_rets[] = { -1, -1 };
> +
>  SEC("cgroup_skb/egress")
>  int egress(struct __sk_buff *skb)
>  {
> @@ -65,4 +68,32 @@ int egress(struct __sk_buff *skb)
>  	return SK_DROP;
>  }
>
> +SEC("cgroup_skb/egress")
> +int recursion(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct iphdr *iph;
> +	int ret;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4)
> +		return SK_PASS;
> +
> +	/* This call will provoke a recursion: the ICMP packet generated by the
> +	 * kfunc will re-trigger this program since we are in the root cgroup in
> +	 * which the kernel ICMP socket belongs. However when re-entering the
> +	 * kfunc, it should return EBUSY.
> +	 */
> +	ret = bpf_icmp_send(skb, unreach_type, unreach_code);
> +	rec_kfunc_rets[rec_count & 1] = ret;

Will this program affect background traffic when running on the system?

The recursion program is attached to the root cgroup and intercepts all
IPv4 traffic. While v5 changed SK_DROP to SK_PASS to avoid blocking
traffic, the program still calls bpf_icmp_send() on every IPv4 packet
without filtering for test-specific traffic.

This means background IPv4 traffic (SSH, DNS, other tests) will trigger
ICMP Destination Unreachable messages, potentially causing network
disruption and test races on rec_kfunc_rets and rec_count.

The egress program demonstrates proper filtering by checking for TCP,
SERVER_IP, and server_port:

SEC("cgroup_skb/egress")
int egress(struct __sk_buff *skb)
{
	void *data = (void *)(long)skb->data;
	void *data_end = (void *)(long)skb->data_end;
	struct iphdr *iph;
	struct ipv6hdr *ip6h;
	struct tcphdr *tcph;

	iph = data;
	if ((void *)(iph + 1) > data_end || iph->version != 4)
		return SK_PASS;

	if (iph->protocol != IPPROTO_TCP)
		return SK_PASS;

	if (bpf_ntohs(iph->daddr) != SERVER_IP)
		return SK_PASS;

	tcph = (struct tcphdr *)((void *)iph + sizeof(*iph));
	if ((void *)(tcph + 1) > data_end)
		return SK_PASS;

	if (tcph->dest != server_port)
		return SK_PASS;
	...
}

Could the recursion program use similar filtering to explicitly check for
the test's traffic?

> +	__sync_fetch_and_add(&rec_count, 1);
> +
> +	/* Let the first ICMP error message pass */
> +	if (iph->protocol == IPPROTO_ICMP)
> +		return SK_PASS;
> +
> +	return SK_DROP;
> +}
> +
>  char LICENSE[] SEC("license") = "Dual BSD/GPL";


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2026-05-18 12:28 ` [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2026-05-18 13:07   ` bot+bpf-ci
  2026-05-18 14:21     ` Mahe Tardy
  0 siblings, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-18 13:07 UTC (permalink / raw)
  To: mahe.tardy, bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	mahe.tardy, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index bc1296f0ea69..1f031c5ef554 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
>  }
>  EXPORT_SYMBOL_GPL(ip_route_output_flow);
>
> +int ip_route_reply_fill_dst(struct sk_buff *skb)
> +{
> +	struct rtable *rt;
> +	struct flowi4 fl4 = {
> +		.daddr = ip_hdr(skb)->saddr
> +	};

Can this lead to an out-of-bounds read?

Now that this is exported for BPF kfuncs, it can be reached from early
hooks like TC ingress where the IPv4 header might not be linearized yet.

If a malformed packet contains less than sizeof(struct iphdr) bytes of
linear data, does accessing ->saddr require validation via pskb_may_pull()
or similar?

This concern was raised by sashiko-bot@kernel.org in v4 review:
https://lore.kernel.org/bpf/20260421111320.D97D9C2BCB0@smtp.kernel.org/

> +
> +	rt = ip_route_output_key(dev_net(skb->dev), &fl4);

Can skb->dev be NULL here?

While safe in its original netfilter context, a BPF_PROG_TYPE_NETFILTER
program attached to the NF_INET_LOCAL_OUT hook will receive locally
generated SKBs before skb->dev is assigned (e.g., in __ip_local_out before
ip_output).

Does dev_net(skb->dev) need a check to prevent a kernel panic when called
from such contexts?

This concern was raised by sashiko-bot@kernel.org in v4 review:
https://lore.kernel.org/bpf/20260421111320.D97D9C2BCB0@smtp.kernel.org/

> +	if (IS_ERR(rt))
> +		return PTR_ERR(rt);
> +	skb_dst_set(skb, &rt->dst);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(ip_route_reply_fill_dst);


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2026-05-18 12:28 ` [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2026-05-18 13:07   ` bot+bpf-ci
  2026-05-18 14:22     ` Mahe Tardy
  0 siblings, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-18 13:07 UTC (permalink / raw)
  To: mahe.tardy, bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	mahe.tardy, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2536 bytes --]

> diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> index 09ffe0f13ce7..eb5a60d3babe 100644
> --- a/include/net/ip6_route.h
> +++ b/include/net/ip6_route.h
> @@ -100,6 +100,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
>  	return ip6_route_output_flags(net, sk, fl6, 0);
>  }
>
> +int ip6_route_reply_fill_dst(struct sk_buff *skb);
> +

[ ... ]

> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index e3d355d1fbd6..37a7627a94de 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2725,6 +2725,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
>  }
>  EXPORT_SYMBOL_GPL(ip6_route_output_flags);
>
> +int ip6_route_reply_fill_dst(struct sk_buff *skb)
> +{
> +	struct dst_entry *result;
> +	struct flowi6 fl = {
> +		.daddr = ipv6_hdr(skb)->saddr
                         ^^^^^^^^^^^^^^^^^^^^

Is it possible for the IPv6 header to span paged fragments here?

If the packet header isn't guaranteed to be in the linear data area, and
there are no linearization checks like pskb_may_pull() before accessing
ipv6_hdr(), could this read out-of-bounds memory or cause a page fault
when processing undersized or fragmented packets supplied by BPF programs?

While v6 cover letter mentions "call pskb_network_may_pull on the cloned
pkt", this linearization check is not present in this function.

> +	};
> +	int err;
> +
> +	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
                                  ^^^^^^^^^^^^

Could skb->dev be NULL in contexts where this function will be used?

Since this is being exported for use by BPF kfuncs, it might be called
from program types where the device reference is not retained, such as
socket filters.

If skb->dev is NULL in these contexts, won't dev_net(skb->dev) cause a
NULL pointer dereference?

While the netfilter caller (nf_send_unreach6) checks and sets skb->dev to
loopback_dev if NULL, the exported function itself does not have this
protection. When called from BPF kfuncs, especially in contexts like socket
filters or tc ingress before device assignment, skb->dev could be NULL.

> +	err = result->error;
> +	if (err)
> +		dst_release(result);
> +	else
> +		skb_dst_set(skb, result);
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(ip6_route_reply_fill_dst);
> +

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests
  2026-05-18 12:28 ` [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests Mahe Tardy
@ 2026-05-18 13:21   ` bot+bpf-ci
  2026-05-18 14:27     ` Mahe Tardy
  0 siblings, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-18 13:21 UTC (permalink / raw)
  To: mahe.tardy, bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	mahe.tardy, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1861 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> index 4f0aed8152d3..d0ac0502f6df 100644
> --- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c

[ ... ]

> @@ -81,14 +95,13 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
>  }
>
>  static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel,
> -					    int code)
> +					    int code, int af, const char *ip)
>  {
>  	int srv_fd = -1, client_fd = -1;
>  	struct sockaddr_in addr;
>  	socklen_t len = sizeof(addr);
>
> -	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0,
> -			      TIMEOUT_MS);
> +	srv_fd = start_server(af, SOCK_STREAM, ip, 0, TIMEOUT_MS);
>  	if (!ASSERT_GE(srv_fd, 0, "start_server"))
>  		return;

When af is AF_INET6, start_server() creates an IPv6 socket, but addr is
hardcoded as struct sockaddr_in which is 16 bytes.

Does this cause getsockname() to silently truncate the 28-byte
sockaddr_in6 structure because len is initialized to sizeof(addr)?

While reading addr.sin_port happens to work because sin_port and
sin6_port share the same offset, would it be better to use struct
sockaddr_storage here (like in connect_to_fd_nonblock() earlier in the
file) and extract the port conditionally based on the address family to
avoid type confusion?

Note: This concern was raised in v5 review by sashiko-bot@kernel.org,
but does not appear to have been addressed in v6. See
https://lore.kernel.org/bpf/20260515204444.D8204C2BCB0@smtp.kernel.org/

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
  2026-05-18 12:28 ` [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc Mahe Tardy
@ 2026-05-18 13:34   ` bot+bpf-ci
  2026-05-18 14:26     ` Mahe Tardy
  2026-05-18 16:17   ` Stanislav Fomichev
  2026-05-19  1:33   ` Jordan Rife
  2 siblings, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-18 13:34 UTC (permalink / raw)
  To: mahe.tardy, bpf
  Cc: martin.lau, daniel, john.fastabend, ast, andrii, yonghong.song,
	jordan, netdev, netfilter-devel, edumazet, kuba, pabeni,
	mahe.tardy, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3151 bytes --]

> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9590877b0714..843fa775596b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c

[ ... ]

> @@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
>  	return 0;
>  }
>
> +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> +
> +/**
> + * bpf_icmp_send - Send an ICMP control message
> + * @skb_ctx: Packet that triggered the control message
> + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> + * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
> + *
> + * Sends an ICMP control message in response to the packet. The original packet
> + * is cloned before sending the ICMP message, so the BPF program can still let
> + * the packet pass if desired.
> + *
> + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> + * supported.
> + *
> + * Recursion protection: If called from a context that would trigger recursion
> + * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
> + * re-entry.
> + *
> + * Return: 0 on success, negative error code on failure:
> + *         -EINVAL: Invalid code parameter
> + *         -EBADMSG: Packet too short or malformed
> + *         -ENOMEM: Memory allocation failed
> + *         -EBUSY: Recursion detected
> + *         -EHOSTUNREACH: Routing failed
> + *         -EPROTONOSUPPORT: Non-IP protocol
> + *         -EOPNOTSUPP: Unsupported ICMP type
> + */
> +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct sk_buff *nskb;
> +	bool *in_progress;
> +
> +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> +	if (*in_progress)
> +		return -EBUSY;

Is the per-CPU recursion protection flag safe without disabling
preemption or bottom-halves?

Since cgroup_skb egress programs execute in process context, they can be
interrupted by network RX softirqs on the same CPU. If a task sets
*in_progress to true and is immediately preempted by a softirq that runs a
TC ingress program calling bpf_icmp_send(), it will read the same per-CPU
flag and spuriously fail with -EBUSY.

This was raised in v5 by the Sashiko AI bot:

    Since cgroup_skb egress programs execute in process context, they can be
    interrupted by network RX softirqs on the same CPU.

    If a task sets *in_progress to true and is immediately preempted by a
    softirq that runs a TC ingress program calling bpf_icmp_send(), will it
    read the same per-CPU flag and spuriously fail with -EBUSY?

    Does this critical section need to run with bottom-halves disabled?

https://lore.kernel.org/bpf/20260515202358.20252C2BCB0@smtp.kernel.org/

Would using two separate per-CPU flags (one for softirq context, one for
process context) or disabling bottom-halves around the critical section
address this race condition?

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2026-05-18 13:07   ` bot+bpf-ci
@ 2026-05-18 14:21     ` Mahe Tardy
  0 siblings, 0 replies; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 14:21 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni, martin.lau, eddyz87, clm, ihor.solodrai

On Mon, May 18, 2026 at 01:07:54PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index bc1296f0ea69..1f031c5ef554 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
> >  }
> >  EXPORT_SYMBOL_GPL(ip_route_output_flow);
> >
> > +int ip_route_reply_fill_dst(struct sk_buff *skb)
> > +{
> > +	struct rtable *rt;
> > +	struct flowi4 fl4 = {
> > +		.daddr = ip_hdr(skb)->saddr
> > +	};
> 
> Can this lead to an out-of-bounds read?
> 
> Now that this is exported for BPF kfuncs, it can be reached from early
> hooks like TC ingress where the IPv4 header might not be linearized yet.
> 
> If a malformed packet contains less than sizeof(struct iphdr) bytes of
> linear data, does accessing ->saddr require validation via pskb_may_pull()
> or similar?
> 
> This concern was raised by sashiko-bot@kernel.org in v4 review:
> https://lore.kernel.org/bpf/20260421111320.D97D9C2BCB0@smtp.kernel.org/

Yes but it should be fine since it's the caller responsability to make
sure the header has been linearized, and we are calling
pskb_network_may_pull.

> > +
> > +	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
> 
> Can skb->dev be NULL here?
> 
> While safe in its original netfilter context, a BPF_PROG_TYPE_NETFILTER
> program attached to the NF_INET_LOCAL_OUT hook will receive locally
> generated SKBs before skb->dev is assigned (e.g., in __ip_local_out before
> ip_output).
> 
> Does dev_net(skb->dev) need a check to prevent a kernel panic when called
> from such contexts?
> 
> This concern was raised by sashiko-bot@kernel.org in v4 review:
> https://lore.kernel.org/bpf/20260421111320.D97D9C2BCB0@smtp.kernel.org/

We can't attach such program as now it's only cgroup_skb and tc,
outdated.

> > +	if (IS_ERR(rt))
> > +		return PTR_ERR(rt);
> > +	skb_dst_set(skb, &rt->dst);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(ip_route_reply_fill_dst);
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2026-05-18 13:07   ` bot+bpf-ci
@ 2026-05-18 14:22     ` Mahe Tardy
  0 siblings, 0 replies; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 14:22 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni, martin.lau, eddyz87, clm, ihor.solodrai

On Mon, May 18, 2026 at 01:07:57PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
> > index 09ffe0f13ce7..eb5a60d3babe 100644
> > --- a/include/net/ip6_route.h
> > +++ b/include/net/ip6_route.h
> > @@ -100,6 +100,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
> >  	return ip6_route_output_flags(net, sk, fl6, 0);
> >  }
> >
> > +int ip6_route_reply_fill_dst(struct sk_buff *skb);
> > +
> 
> [ ... ]
> 
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index e3d355d1fbd6..37a7627a94de 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -2725,6 +2725,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
> >  }
> >  EXPORT_SYMBOL_GPL(ip6_route_output_flags);
> >
> > +int ip6_route_reply_fill_dst(struct sk_buff *skb)
> > +{
> > +	struct dst_entry *result;
> > +	struct flowi6 fl = {
> > +		.daddr = ipv6_hdr(skb)->saddr
>                          ^^^^^^^^^^^^^^^^^^^^
> 
> Is it possible for the IPv6 header to span paged fragments here?
> 
> If the packet header isn't guaranteed to be in the linear data area, and
> there are no linearization checks like pskb_may_pull() before accessing
> ipv6_hdr(), could this read out-of-bounds memory or cause a page fault
> when processing undersized or fragmented packets supplied by BPF programs?
> 
> While v6 cover letter mentions "call pskb_network_may_pull on the cloned
> pkt", this linearization check is not present in this function.

Same answer, caller is calling pskb_may_pull, outdated.

> 
> > +	};
> > +	int err;
> > +
> > +	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
>                                   ^^^^^^^^^^^^
> 
> Could skb->dev be NULL in contexts where this function will be used?
> 
> Since this is being exported for use by BPF kfuncs, it might be called
> from program types where the device reference is not retained, such as
> socket filters.
> 
> If skb->dev is NULL in these contexts, won't dev_net(skb->dev) cause a
> NULL pointer dereference?
> 
> While the netfilter caller (nf_send_unreach6) checks and sets skb->dev to
> loopback_dev if NULL, the exported function itself does not have this
> protection. When called from BPF kfuncs, especially in contexts like socket
> filters or tc ingress before device assignment, skb->dev could be NULL.

Ditto, outdated.

> 
> > +	err = result->error;
> > +	if (err)
> > +		dst_release(result);
> > +	else
> > +		skb_dst_set(skb, result);
> > +	return err;
> > +}
> > +EXPORT_SYMBOL_GPL(ip6_route_reply_fill_dst);
> > +
> 
> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
  2026-05-18 13:34   ` bot+bpf-ci
@ 2026-05-18 14:26     ` Mahe Tardy
  0 siblings, 0 replies; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 14:26 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni, martin.lau, eddyz87, clm, ihor.solodrai

On Mon, May 18, 2026 at 01:34:48PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 9590877b0714..843fa775596b 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> 
> [ ... ]
> 
> > @@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
> >  	return 0;
> >  }
> >
> > +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> > +
> > +/**
> > + * bpf_icmp_send - Send an ICMP control message
> > + * @skb_ctx: Packet that triggered the control message
> > + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> > + * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
> > + *
> > + * Sends an ICMP control message in response to the packet. The original packet
> > + * is cloned before sending the ICMP message, so the BPF program can still let
> > + * the packet pass if desired.
> > + *
> > + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> > + * supported.
> > + *
> > + * Recursion protection: If called from a context that would trigger recursion
> > + * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
> > + * re-entry.
> > + *
> > + * Return: 0 on success, negative error code on failure:
> > + *         -EINVAL: Invalid code parameter
> > + *         -EBADMSG: Packet too short or malformed
> > + *         -ENOMEM: Memory allocation failed
> > + *         -EBUSY: Recursion detected
> > + *         -EHOSTUNREACH: Routing failed
> > + *         -EPROTONOSUPPORT: Non-IP protocol
> > + *         -EOPNOTSUPP: Unsupported ICMP type
> > + */
> > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > +{
> > +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > +	struct sk_buff *nskb;
> > +	bool *in_progress;
> > +
> > +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> > +	if (*in_progress)
> > +		return -EBUSY;
> 
> Is the per-CPU recursion protection flag safe without disabling
> preemption or bottom-halves?
> 
> Since cgroup_skb egress programs execute in process context, they can be
> interrupted by network RX softirqs on the same CPU. If a task sets
> *in_progress to true and is immediately preempted by a softirq that runs a
> TC ingress program calling bpf_icmp_send(), it will read the same per-CPU
> flag and spuriously fail with -EBUSY.
> 
> This was raised in v5 by the Sashiko AI bot:
> 
>     Since cgroup_skb egress programs execute in process context, they can be
>     interrupted by network RX softirqs on the same CPU.
> 
>     If a task sets *in_progress to true and is immediately preempted by a
>     softirq that runs a TC ingress program calling bpf_icmp_send(), will it
>     read the same per-CPU flag and spuriously fail with -EBUSY?
> 
>     Does this critical section need to run with bottom-halves disabled?
> 
> https://lore.kernel.org/bpf/20260515202358.20252C2BCB0@smtp.kernel.org/
> 
> Would using two separate per-CPU flags (one for softirq context, one for
> process context) or disabling bottom-halves around the critical section
> address this race condition?
> 
> [ ... ]

I've replied on the v5: the race should be acceptable as it will fail
close with EBUSY, if we want to avoid this we might want two PER_CPU
in_progress flags, one for softirq and one for the process context. In
any case, what we want to prevent is recursion (the kfunc triggering
itself), so softirq interruption from process should be okay as it
should concern another packet.

> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests
  2026-05-18 13:21   ` bot+bpf-ci
@ 2026-05-18 14:27     ` Mahe Tardy
  0 siblings, 0 replies; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 14:27 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni, martin.lau, eddyz87, clm, ihor.solodrai

On Mon, May 18, 2026 at 01:21:35PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> > index 4f0aed8152d3..d0ac0502f6df 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> 
> [ ... ]
> 
> > @@ -81,14 +95,13 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
> >  }
> >
> >  static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel,
> > -					    int code)
> > +					    int code, int af, const char *ip)
> >  {
> >  	int srv_fd = -1, client_fd = -1;
> >  	struct sockaddr_in addr;
> >  	socklen_t len = sizeof(addr);
> >
> > -	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0,
> > -			      TIMEOUT_MS);
> > +	srv_fd = start_server(af, SOCK_STREAM, ip, 0, TIMEOUT_MS);
> >  	if (!ASSERT_GE(srv_fd, 0, "start_server"))
> >  		return;
> 
> When af is AF_INET6, start_server() creates an IPv6 socket, but addr is
> hardcoded as struct sockaddr_in which is 16 bytes.
> 
> Does this cause getsockname() to silently truncate the 28-byte
> sockaddr_in6 structure because len is initialized to sizeof(addr)?
> 
> While reading addr.sin_port happens to work because sin_port and
> sin6_port share the same offset, would it be better to use struct
> sockaddr_storage here (like in connect_to_fd_nonblock() earlier in the
> file) and extract the port conditionally based on the address family to
> avoid type confusion?
> 
> Note: This concern was raised in v5 review by sashiko-bot@kernel.org,
> but does not appear to have been addressed in v6. See
> https://lore.kernel.org/bpf/20260515204444.D8204C2BCB0@smtp.kernel.org/

Also I have replied in the v5: I don't think we really care about this
since we are just reading the sin_port.

> 
> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test
  2026-05-18 13:07   ` bot+bpf-ci
@ 2026-05-18 14:39     ` Mahe Tardy
  0 siblings, 0 replies; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 14:39 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni, martin.lau, eddyz87, clm, ihor.solodrai

On Mon, May 18, 2026 at 01:07:53PM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> > index d0ac0502f6df..a9e9806877cf 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> 
> [ ... ]
> 
> > diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
> > index 6e1ba539eeb0..7830334b747a 100644
> > --- a/tools/testing/selftests/bpf/progs/icmp_send.c
> > +++ b/tools/testing/selftests/bpf/progs/icmp_send.c
> > @@ -13,6 +13,9 @@ int unreach_type = 0;
> >  int unreach_code = 0;
> >  int kfunc_ret = -1;
> >
> > +unsigned int rec_count = 0;
> > +int rec_kfunc_rets[] = { -1, -1 };
> > +
> >  SEC("cgroup_skb/egress")
> >  int egress(struct __sk_buff *skb)
> >  {
> > @@ -65,4 +68,32 @@ int egress(struct __sk_buff *skb)
> >  	return SK_DROP;
> >  }
> >
> > +SEC("cgroup_skb/egress")
> > +int recursion(struct __sk_buff *skb)
> > +{
> > +	void *data = (void *)(long)skb->data;
> > +	void *data_end = (void *)(long)skb->data_end;
> > +	struct iphdr *iph;
> > +	int ret;
> > +
> > +	iph = data;
> > +	if ((void *)(iph + 1) > data_end || iph->version != 4)
> > +		return SK_PASS;
> > +
> > +	/* This call will provoke a recursion: the ICMP packet generated by the
> > +	 * kfunc will re-trigger this program since we are in the root cgroup in
> > +	 * which the kernel ICMP socket belongs. However when re-entering the
> > +	 * kfunc, it should return EBUSY.
> > +	 */
> > +	ret = bpf_icmp_send(skb, unreach_type, unreach_code);
> > +	rec_kfunc_rets[rec_count & 1] = ret;
> 
> Will this program affect background traffic when running on the system?
> 
> The recursion program is attached to the root cgroup and intercepts all
> IPv4 traffic. While v5 changed SK_DROP to SK_PASS to avoid blocking
> traffic, the program still calls bpf_icmp_send() on every IPv4 packet
> without filtering for test-specific traffic.
> 
> This means background IPv4 traffic (SSH, DNS, other tests) will trigger
> ICMP Destination Unreachable messages, potentially causing network
> disruption and test races on rec_kfunc_rets and rec_count.
> 
> The egress program demonstrates proper filtering by checking for TCP,
> SERVER_IP, and server_port:
> 
> SEC("cgroup_skb/egress")
> int egress(struct __sk_buff *skb)
> {
> 	void *data = (void *)(long)skb->data;
> 	void *data_end = (void *)(long)skb->data_end;
> 	struct iphdr *iph;
> 	struct ipv6hdr *ip6h;
> 	struct tcphdr *tcph;
> 
> 	iph = data;
> 	if ((void *)(iph + 1) > data_end || iph->version != 4)
> 		return SK_PASS;
> 
> 	if (iph->protocol != IPPROTO_TCP)
> 		return SK_PASS;
> 
> 	if (bpf_ntohs(iph->daddr) != SERVER_IP)
> 		return SK_PASS;
> 
> 	tcph = (struct tcphdr *)((void *)iph + sizeof(*iph));
> 	if ((void *)(tcph + 1) > data_end)
> 		return SK_PASS;
> 
> 	if (tcph->dest != server_port)
> 		return SK_PASS;
> 	...
> }
> 
> Could the recursion program use similar filtering to explicitly check for
> the test's traffic?

I could restrict a little bit more by ignoring all traffic that is
non-TCP for the dest port or non-ICMP but not sure this is actually
needed. E.g. writing something like this before bpf_icmp_send:

if (iph->daddr != bpf_htonl(SERVER_IP))
	return SK_PASS;

if (iph->protocol == IPPROTO_TCP) {
	tcph = (void *)iph + iph->ihl * 4;
	if ((void *)(tcph + 1) > data_end ||
	    tcph->dest != bpf_htons(server_port))
		return SK_PASS;
} else if (iph->protocol != IPPROTO_ICMP) {
	return SK_PASS;
}

But not sure this is strictly needed.

> 
> > +	__sync_fetch_and_add(&rec_count, 1);
> > +
> > +	/* Let the first ICMP error message pass */
> > +	if (iph->protocol == IPPROTO_ICMP)
> > +		return SK_PASS;
> > +
> > +	return SK_DROP;
> > +}
> > +
> >  char LICENSE[] SEC("license") = "Dual BSD/GPL";
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26034287312


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
  2026-05-18 12:28 ` [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc Mahe Tardy
  2026-05-18 13:34   ` bot+bpf-ci
@ 2026-05-18 16:17   ` Stanislav Fomichev
  2026-05-18 17:18     ` Mahe Tardy
  2026-05-19  1:33   ` Jordan Rife
  2 siblings, 1 reply; 22+ messages in thread
From: Stanislav Fomichev @ 2026-05-18 16:17 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni

On 05/18, Mahe Tardy wrote:
> This is needed in the context of Tetragon to provide improved feedback
> (in contrast to just dropping packets) to east-west traffic when blocked
> by policies using cgroup_skb programs. We also extend this kfunc to tc
> program as a convenience.
> 
> This reuses concepts from netfilter reject target codepath with the
> differences that:
> * Packets are cloned since the BPF user can still let the packet pass
>   (SK_PASS from the cgroup_skb progs for example) and the current skb
>   need to stay untouched (cgroup_skb hooks only allow read-only skb
>   payload).
> * We protect against recursion since the kfunc, by generating an ICMP
>   error message, could retrigger the BPF prog that invoked it.
> 
> For now, we support cgroup_skb and tc program types. For cgroup_skb and
> tc egress, almost everything should be good. However for tc ingress:
> - packet will not be routed yet: need to set the net device for
>   icmp_send, thus the call to ip[6]_route_reply_fill_dst.
> - fragments could trigger hook: icmp_send will only reply to fragment 0.
> - ensure the ip headers is linearized before processing, and zero out
>   the SKB control block after cloning to prevent icmp_send()/icmpv6_send()
>   from misinterpreting garbage data as IP options.
> 
> Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
> The interface accepts a type parameter to facilitate future extension to
> other ICMP control message types.
> 
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>  net/core/filter.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 118 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9590877b0714..843fa775596b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -84,6 +84,8 @@
>  #include <linux/un.h>
>  #include <net/xdp_sock_drv.h>
>  #include <net/inet_dscp.h>
> +#include <linux/icmpv6.h>
> +#include <net/icmp.h>
> 
>  #include "dev.h"
> 
> @@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
>  	return 0;
>  }
> 
> +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> +
> +/**
> + * bpf_icmp_send - Send an ICMP control message
> + * @skb_ctx: Packet that triggered the control message
> + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> + * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
> + *
> + * Sends an ICMP control message in response to the packet. The original packet
> + * is cloned before sending the ICMP message, so the BPF program can still let
> + * the packet pass if desired.
> + *
> + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> + * supported.
> + *
> + * Recursion protection: If called from a context that would trigger recursion
> + * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
> + * re-entry.
> + *
> + * Return: 0 on success, negative error code on failure:
> + *         -EINVAL: Invalid code parameter
> + *         -EBADMSG: Packet too short or malformed
> + *         -ENOMEM: Memory allocation failed
> + *         -EBUSY: Recursion detected
> + *         -EHOSTUNREACH: Routing failed
> + *         -EPROTONOSUPPORT: Non-IP protocol
> + *         -EOPNOTSUPP: Unsupported ICMP type
> + */
> +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct sk_buff *nskb;
> +	bool *in_progress;
> +
> +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> +	if (*in_progress)
> +		return -EBUSY;
> +
> +	switch (skb->protocol) {
> +#if IS_ENABLED(CONFIG_INET)
> +	case htons(ETH_P_IP):
> +		if (type != ICMP_DEST_UNREACH)
> +			return -EOPNOTSUPP;
> +		if (code < 0 || code > NR_ICMP_UNREACH)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
> +			kfree_skb(nskb);
> +			return -EBADMSG;
> +		}
> +
> +		if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
> +
> +		*in_progress = true;
> +		icmp_send(nskb, type, code, 0);
> +		*in_progress = false;

[..]

> +		kfree_skb(nskb);

I was going to suggest to use consume_skb here, I think it is a better fit?

But I'm not sure why you do the clone here, I don't see any requirement from
the icmp_send side, can you clarify? Is it because of the pull?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
  2026-05-18 16:17   ` Stanislav Fomichev
@ 2026-05-18 17:18     ` Mahe Tardy
  2026-05-19 21:20       ` Stanislav Fomichev
  0 siblings, 1 reply; 22+ messages in thread
From: Mahe Tardy @ 2026-05-18 17:18 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni

On Mon, May 18, 2026 at 09:17:45AM -0700, Stanislav Fomichev wrote:
> On 05/18, Mahe Tardy wrote:
> > This is needed in the context of Tetragon to provide improved feedback
> > (in contrast to just dropping packets) to east-west traffic when blocked
> > by policies using cgroup_skb programs. We also extend this kfunc to tc
> > program as a convenience.
> > 
> > This reuses concepts from netfilter reject target codepath with the
> > differences that:
> > * Packets are cloned since the BPF user can still let the packet pass
> >   (SK_PASS from the cgroup_skb progs for example) and the current skb
> >   need to stay untouched (cgroup_skb hooks only allow read-only skb
> >   payload).
> > * We protect against recursion since the kfunc, by generating an ICMP
> >   error message, could retrigger the BPF prog that invoked it.
> > 
> > For now, we support cgroup_skb and tc program types. For cgroup_skb and
> > tc egress, almost everything should be good. However for tc ingress:
> > - packet will not be routed yet: need to set the net device for
> >   icmp_send, thus the call to ip[6]_route_reply_fill_dst.
> > - fragments could trigger hook: icmp_send will only reply to fragment 0.
> > - ensure the ip headers is linearized before processing, and zero out
> >   the SKB control block after cloning to prevent icmp_send()/icmpv6_send()
> >   from misinterpreting garbage data as IP options.
> > 
> > Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
> > The interface accepts a type parameter to facilitate future extension to
> > other ICMP control message types.
> > 
> > Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> > ---
> >  net/core/filter.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 118 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 9590877b0714..843fa775596b 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -84,6 +84,8 @@
> >  #include <linux/un.h>
> >  #include <net/xdp_sock_drv.h>
> >  #include <net/inet_dscp.h>
> > +#include <linux/icmpv6.h>
> > +#include <net/icmp.h>
> > 
> >  #include "dev.h"
> > 
> > @@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
> >  	return 0;
> >  }
> > 
> > +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> > +
> > +/**
> > + * bpf_icmp_send - Send an ICMP control message
> > + * @skb_ctx: Packet that triggered the control message
> > + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> > + * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
> > + *
> > + * Sends an ICMP control message in response to the packet. The original packet
> > + * is cloned before sending the ICMP message, so the BPF program can still let
> > + * the packet pass if desired.
> > + *
> > + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> > + * supported.
> > + *
> > + * Recursion protection: If called from a context that would trigger recursion
> > + * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
> > + * re-entry.
> > + *
> > + * Return: 0 on success, negative error code on failure:
> > + *         -EINVAL: Invalid code parameter
> > + *         -EBADMSG: Packet too short or malformed
> > + *         -ENOMEM: Memory allocation failed
> > + *         -EBUSY: Recursion detected
> > + *         -EHOSTUNREACH: Routing failed
> > + *         -EPROTONOSUPPORT: Non-IP protocol
> > + *         -EOPNOTSUPP: Unsupported ICMP type
> > + */
> > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > +{
> > +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > +	struct sk_buff *nskb;
> > +	bool *in_progress;
> > +
> > +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> > +	if (*in_progress)
> > +		return -EBUSY;
> > +
> > +	switch (skb->protocol) {
> > +#if IS_ENABLED(CONFIG_INET)
> > +	case htons(ETH_P_IP):
> > +		if (type != ICMP_DEST_UNREACH)
> > +			return -EOPNOTSUPP;
> > +		if (code < 0 || code > NR_ICMP_UNREACH)
> > +			return -EINVAL;
> > +
> > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > +		if (!nskb)
> > +			return -ENOMEM;
> > +
> > +		if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
> > +			kfree_skb(nskb);
> > +			return -EBADMSG;
> > +		}
> > +
> > +		if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
> > +			kfree_skb(nskb);
> > +			return -EHOSTUNREACH;
> > +		}
> > +
> > +		memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
> > +
> > +		*in_progress = true;
> > +		icmp_send(nskb, type, code, 0);
> > +		*in_progress = false;
> 
> [..]
> 
> > +		kfree_skb(nskb);
> 
> I was going to suggest to use consume_skb here, I think it is a better fit?

Yeah correct, I can replace it with consume_skb, didn't know about it,
thanks.

> But I'm not sure why you do the clone here, I don't see any requirement from
> the icmp_send side, can you clarify? Is it because of the pull?

From the icmp_send side I think it's fine, however, this part might
touch the original packet, especially ip_route_reply_fill_dst:


if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
	kfree_skb(nskb);
	return -EBADMSG;
}

if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
	kfree_skb(nskb);
	return -EHOSTUNREACH;
}

memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));


All of this is mostly there because we allow this kfunc for tc and
especially tc ingress. At this stage, the skb might not have a routing
entry yet and icmp_send needs to know the dev from this or fail
silently. This is the original reason why I added the the net patches
(patch 1 and 2) and it was also spotted by Sashiko when I tried to
remove them[^1].

[^1]: https://lore.kernel.org/bpf/20260515202358.20252C2BCB0@smtp.kernel.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
  2026-05-18 12:28 ` [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc Mahe Tardy
  2026-05-18 13:34   ` bot+bpf-ci
  2026-05-18 16:17   ` Stanislav Fomichev
@ 2026-05-19  1:33   ` Jordan Rife
  2 siblings, 0 replies; 22+ messages in thread
From: Jordan Rife @ 2026-05-19  1:33 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, netdev, netfilter-devel, edumazet, kuba, pabeni

On Mon, May 18, 2026 at 12:28:39PM +0000, Mahe Tardy wrote:
> This is needed in the context of Tetragon to provide improved feedback
> (in contrast to just dropping packets) to east-west traffic when blocked
> by policies using cgroup_skb programs. We also extend this kfunc to tc
> program as a convenience.
> 
> This reuses concepts from netfilter reject target codepath with the
> differences that:
> * Packets are cloned since the BPF user can still let the packet pass
>   (SK_PASS from the cgroup_skb progs for example) and the current skb
>   need to stay untouched (cgroup_skb hooks only allow read-only skb
>   payload).
> * We protect against recursion since the kfunc, by generating an ICMP
>   error message, could retrigger the BPF prog that invoked it.
> 
> For now, we support cgroup_skb and tc program types. For cgroup_skb and
> tc egress, almost everything should be good. However for tc ingress:
> - packet will not be routed yet: need to set the net device for
>   icmp_send, thus the call to ip[6]_route_reply_fill_dst.
> - fragments could trigger hook: icmp_send will only reply to fragment 0.
> - ensure the ip headers is linearized before processing, and zero out
>   the SKB control block after cloning to prevent icmp_send()/icmpv6_send()
>   from misinterpreting garbage data as IP options.
> 
> Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
> The interface accepts a type parameter to facilitate future extension to
> other ICMP control message types.
> 
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>  net/core/filter.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 118 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 9590877b0714..843fa775596b 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -84,6 +84,8 @@
>  #include <linux/un.h>
>  #include <net/xdp_sock_drv.h>
>  #include <net/inet_dscp.h>
> +#include <linux/icmpv6.h>
> +#include <net/icmp.h>
> 
>  #include "dev.h"
> 
> @@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
>  	return 0;
>  }
> 
> +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> +
> +/**
> + * bpf_icmp_send - Send an ICMP control message
> + * @skb_ctx: Packet that triggered the control message
> + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> + * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
> + *
> + * Sends an ICMP control message in response to the packet. The original packet
> + * is cloned before sending the ICMP message, so the BPF program can still let
> + * the packet pass if desired.
> + *
> + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> + * supported.
> + *
> + * Recursion protection: If called from a context that would trigger recursion
> + * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
> + * re-entry.
> + *
> + * Return: 0 on success, negative error code on failure:
> + *         -EINVAL: Invalid code parameter
> + *         -EBADMSG: Packet too short or malformed
> + *         -ENOMEM: Memory allocation failed
> + *         -EBUSY: Recursion detected
> + *         -EHOSTUNREACH: Routing failed
> + *         -EPROTONOSUPPORT: Non-IP protocol
> + *         -EOPNOTSUPP: Unsupported ICMP type
> + */
> +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> +	struct sk_buff *nskb;
> +	bool *in_progress;
> +
> +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> +	if (*in_progress)
> +		return -EBUSY;
> +
> +	switch (skb->protocol) {
> +#if IS_ENABLED(CONFIG_INET)
> +	case htons(ETH_P_IP):
> +		if (type != ICMP_DEST_UNREACH)
> +			return -EOPNOTSUPP;
> +		if (code < 0 || code > NR_ICMP_UNREACH)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
> +			kfree_skb(nskb);
> +			return -EBADMSG;

nit: Instead of having several places where you call kfree_skb, maybe
consider just cleaning up in once place at the end like:

out:
	if (nskb)
		kfree_skb(nskb);
	return err;
	
then in places like this do something like:

	err = -EBADMSG;
	goto out;

> +		}
> +
> +		if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
> +
> +		*in_progress = true;
> +		icmp_send(nskb, type, code, 0);
> +		*in_progress = false;
> +		kfree_skb(nskb);
> +		break;
> +#endif
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case htons(ETH_P_IPV6):
> +		if (type != ICMPV6_DEST_UNREACH)
> +			return -EOPNOTSUPP;
> +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (!pskb_network_may_pull(nskb, sizeof(struct ipv6hdr))) {
> +			kfree_skb(nskb);
> +			return -EBADMSG;
> +		}
> +
> +		if (!skb_dst(nskb) && ip6_route_reply_fill_dst(nskb) < 0) {
> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		memset(IP6CB(nskb), 0, sizeof(struct inet6_skb_parm));
> +
> +		*in_progress = true;
> +		icmpv6_send(nskb, type, code, 0);
> +		*in_progress = false;
> +		kfree_skb(nskb);
> +		break;
> +#endif
> +	default:
> +		return -EPROTONOSUPPORT;
> +	}
> +
> +	return 0;
> +}
> +
>  __bpf_kfunc_end_defs();
> 
>  int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
> @@ -12506,6 +12612,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
>  BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
>  BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
> 
> +BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send)
> +BTF_ID_FLAGS(func, bpf_icmp_send)
> +BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send)
> +
>  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
>  	.owner = THIS_MODULE,
>  	.set = &bpf_kfunc_check_set_skb,
> @@ -12536,6 +12646,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
>  	.set = &bpf_kfunc_check_set_sock_ops,
>  };
> 
> +static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send = {
> +	.owner = THIS_MODULE,
> +	.set = &bpf_kfunc_check_set_icmp_send,
> +};
> +
>  static int __init bpf_kfunc_init(void)
>  {
>  	int ret;
> @@ -12557,6 +12672,9 @@ static int __init bpf_kfunc_init(void)
>  	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
>  					       &bpf_kfunc_set_sock_addr);
>  	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_icmp_send);

Thanks, this could come in handy for TC.

I'm not quite sure yet on using it in lieu of the sock_destroy kfunc for
the UDP connected socket use case we discussed at LSFMMBPF. For socket
LB mode in Cilium to make this work you'd need to add at least one new
map lookup in the fast path to check for backend liveness and this
partially defeats the performance benefits of socket LB which right
now avoids service + backend lookups in the fast path for connected UDP.
Ultimately, it might be better to stick with sock_destroy to kill
sockets out-of-band for that use case, but still it's good to have this
option.

> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_icmp_send);
>  	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
>  }
>  late_initcall(bpf_kfunc_init);
> --
> 2.34.1
> 

Jordan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 4/6] selftests/bpf: add bpf_icmp_send kfunc tests
  2026-05-18 12:28 ` [PATCH bpf-next v6 4/6] selftests/bpf: add bpf_icmp_send kfunc tests Mahe Tardy
@ 2026-05-19  1:34   ` Jordan Rife
  0 siblings, 0 replies; 22+ messages in thread
From: Jordan Rife @ 2026-05-19  1:34 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, netdev, netfilter-devel, edumazet, kuba, pabeni

On Mon, May 18, 2026 at 12:28:40PM +0000, Mahe Tardy wrote:
> This test opens a server and client, enters a new cgroup, attach a
> cgroup_skb program on egress and calls the bpf_icmp_send function from
> the client egress so that an ICMP unreach control message is sent back
> to the client. It then fetches the message from the error queue to
> confirm the correct ICMP unreach code has been sent.
> 
> Note that, for the client, we have to connect in non-blocking mode to
> let the test execute faster. Otherwise, we need to wait for the TCP
> three-way handshake to timeout in the kernel before reading the errno.
> 
> Also note that we don't set IP_RECVERR on the socket in
> connect_to_fd_nonblock since the error will be transferred anyway in our
> test because the connection is rejected at the beginning of the TCP
> handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err for more details.
> 
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>  .../bpf/prog_tests/icmp_send_kfunc.c          | 152 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/icmp_send.c |  38 +++++
>  2 files changed, 190 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
>  create mode 100644 tools/testing/selftests/bpf/progs/icmp_send.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> new file mode 100644
> index 000000000000..4f0aed8152d3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_kfunc.c
> @@ -0,0 +1,152 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include <poll.h>
> +#include "icmp_send.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15
> +
> +static int connect_to_fd_nonblock(int server_fd)
> +{
> +	struct sockaddr_storage addr;
> +	socklen_t len = sizeof(addr);
> +	int fd, err;
> +
> +	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
> +		return -1;
> +
> +	fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);

Is the only customization here SOCK_NONBLOCK? Would it be possible to
simply extend network_helper_opts to make it possible via the existing
connect_to_fd_opts? Alternatively, is it possible to test with
SOCK_DGRAM to avoid the handshake timeout issue?

> +	if (fd < 0)
> +		return -1;
> +
> +	err = connect(fd, (struct sockaddr *)&addr, len);
> +	if (err < 0 && errno != EINPROGRESS) {
> +		close(fd);
> +		return -1;
> +	}
> +
> +	return fd;
> +}
> +
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
> +	ssize_t n;

very small nit: Other functions follow reverse xmas tree order but this
one doesn't which looks kind of weird.

> +	struct sock_extended_err *sock_err;
> +	struct cmsghdr *cm;
> +	char ctrl_buf[512];
> +	struct msghdr msg = {
> +		.msg_control = ctrl_buf,
> +		.msg_controllen = sizeof(ctrl_buf),
> +	};
> +	struct pollfd pfd = {
> +		.fd = sockfd,
> +		.events = POLLERR,
> +	};
> +
> +	if (!ASSERT_GE(poll(&pfd, 1, TIMEOUT_MS), 1, "poll_errqueue"))
> +		return;
> +
> +	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
> +	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> +		return;
> +
> +	cm = CMSG_FIRSTHDR(&msg);
> +	if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
> +		return;
> +
> +	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
> +		if (cm->cmsg_level != IPPROTO_IP ||
> +		    cm->cmsg_type != IP_RECVERR)
> +			continue;
> +
> +		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
> +
> +		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
> +			       "sock_err_origin_icmp"))
> +			return;
> +		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
> +			       "sock_err_type_dest_unreach"))
> +			return;
> +		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
> +		return;
> +	}
> +
> +	ASSERT_FAIL("no IP_RECVERR/IPV6_RECVERR control message found");
> +}
> +
> +static void trigger_prog_read_icmp_errqueue(struct icmp_send *skel,
> +					    int code)
> +{
> +	int srv_fd = -1, client_fd = -1;
> +	struct sockaddr_in addr;
> +	socklen_t len = sizeof(addr);
> +
> +	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", 0,
> +			      TIMEOUT_MS);
> +	if (!ASSERT_GE(srv_fd, 0, "start_server"))
> +		return;

ASSERT_OK_FD

> +
> +	if (getsockname(srv_fd, (struct sockaddr *)&addr, &len)) {
> +		close(srv_fd);
> +		return;
> +	}
> +	skel->bss->server_port = ntohs(addr.sin_port);
> +	skel->bss->unreach_code = code;
> +
> +	client_fd = connect_to_fd_nonblock(srv_fd);
> +	if (!ASSERT_GE(client_fd, 0, "client_connect_nonblock")) {

ASSERT_OK_FD

> +		close(srv_fd);
> +		return;
> +	}
> +
> +	/* Skip reading ICMP error queue if code is invalid */
> +	if (code >= 0 && code <= NR_ICMP_UNREACH)
> +		read_icmp_errqueue(client_fd, code);
> +

Consider doing all cleanup here and just adding a goto label like
cleanup in test_icmp_send_unreach?

> +	close(client_fd);
> +	close(srv_fd);
> +}
> +
> +void test_icmp_send_unreach(void)
> +{
> +	struct icmp_send *skel;
> +	int cgroup_fd = -1;
> +
> +	skel = icmp_send__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		goto cleanup;
> +
> +	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> +		goto cleanup;

ASSERT_OK_FD

> +
> +	skel->links.egress =
> +		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> +		goto cleanup;
> +
> +	for (int code = 0; code <= NR_ICMP_UNREACH; code++) {
> +		/* The TCP stack reacts differently when asking for
> +		 * fragmentation, let's ignore it for now.
> +		 */
> +		if (code == ICMP_FRAG_NEEDED)
> +			continue;
> +
> +		trigger_prog_read_icmp_errqueue(skel, code);
> +		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
> +	}
> +
> +	/* Test an invalid code */
> +	trigger_prog_read_icmp_errqueue(skel, -1);
> +	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
> +
> +cleanup:
> +	icmp_send__destroy(skel);
> +	close(cgroup_fd);

if (cgroup_fd != -1)
	close(cgroup_fd);

> +}
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send.c b/tools/testing/selftests/bpf/progs/icmp_send.c
> new file mode 100644
> index 000000000000..6d0be0a9afe1
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/icmp_send.c
> @@ -0,0 +1,38 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +/* 127.0.0.1 in host byte order */
> +#define SERVER_IP 0x7F000001
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +__u16 server_port = 0;
> +int unreach_code = 0;
> +int kfunc_ret = -1;
> +
> +SEC("cgroup_skb/egress")

No test for TC?

> +int egress(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct iphdr *iph;
> +	struct tcphdr *tcph;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> +	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> +		return SK_PASS;
> +
> +	tcph = (void *)iph + iph->ihl * 4;
> +	if ((void *)(tcph + 1) > data_end ||
> +	    tcph->dest != bpf_htons(server_port))
> +		return SK_PASS;
> +
> +	kfunc_ret = bpf_icmp_send(skb, ICMP_DEST_UNREACH, unreach_code);
> +
> +	return SK_DROP;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> --
> 2.34.1
>

Jordan

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
  2026-05-18 17:18     ` Mahe Tardy
@ 2026-05-19 21:20       ` Stanislav Fomichev
  0 siblings, 0 replies; 22+ messages in thread
From: Stanislav Fomichev @ 2026-05-19 21:20 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: bpf, martin.lau, daniel, john.fastabend, ast, andrii,
	yonghong.song, jordan, netdev, netfilter-devel, edumazet, kuba,
	pabeni

On 05/18, Mahe Tardy wrote:
> On Mon, May 18, 2026 at 09:17:45AM -0700, Stanislav Fomichev wrote:
> > On 05/18, Mahe Tardy wrote:
> > > This is needed in the context of Tetragon to provide improved feedback
> > > (in contrast to just dropping packets) to east-west traffic when blocked
> > > by policies using cgroup_skb programs. We also extend this kfunc to tc
> > > program as a convenience.
> > > 
> > > This reuses concepts from netfilter reject target codepath with the
> > > differences that:
> > > * Packets are cloned since the BPF user can still let the packet pass
> > >   (SK_PASS from the cgroup_skb progs for example) and the current skb
> > >   need to stay untouched (cgroup_skb hooks only allow read-only skb
> > >   payload).
> > > * We protect against recursion since the kfunc, by generating an ICMP
> > >   error message, could retrigger the BPF prog that invoked it.
> > > 
> > > For now, we support cgroup_skb and tc program types. For cgroup_skb and
> > > tc egress, almost everything should be good. However for tc ingress:
> > > - packet will not be routed yet: need to set the net device for
> > >   icmp_send, thus the call to ip[6]_route_reply_fill_dst.
> > > - fragments could trigger hook: icmp_send will only reply to fragment 0.
> > > - ensure the ip headers is linearized before processing, and zero out
> > >   the SKB control block after cloning to prevent icmp_send()/icmpv6_send()
> > >   from misinterpreting garbage data as IP options.
> > > 
> > > Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
> > > The interface accepts a type parameter to facilitate future extension to
> > > other ICMP control message types.
> > > 
> > > Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> > > ---
> > >  net/core/filter.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 118 insertions(+)
> > > 
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 9590877b0714..843fa775596b 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -84,6 +84,8 @@
> > >  #include <linux/un.h>
> > >  #include <net/xdp_sock_drv.h>
> > >  #include <net/inet_dscp.h>
> > > +#include <linux/icmpv6.h>
> > > +#include <net/icmp.h>
> > > 
> > >  #include "dev.h"
> > > 
> > > @@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
> > >  	return 0;
> > >  }
> > > 
> > > +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> > > +
> > > +/**
> > > + * bpf_icmp_send - Send an ICMP control message
> > > + * @skb_ctx: Packet that triggered the control message
> > > + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> > > + * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
> > > + *
> > > + * Sends an ICMP control message in response to the packet. The original packet
> > > + * is cloned before sending the ICMP message, so the BPF program can still let
> > > + * the packet pass if desired.
> > > + *
> > > + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> > > + * supported.
> > > + *
> > > + * Recursion protection: If called from a context that would trigger recursion
> > > + * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
> > > + * re-entry.
> > > + *
> > > + * Return: 0 on success, negative error code on failure:
> > > + *         -EINVAL: Invalid code parameter
> > > + *         -EBADMSG: Packet too short or malformed
> > > + *         -ENOMEM: Memory allocation failed
> > > + *         -EBUSY: Recursion detected
> > > + *         -EHOSTUNREACH: Routing failed
> > > + *         -EPROTONOSUPPORT: Non-IP protocol
> > > + *         -EOPNOTSUPP: Unsupported ICMP type
> > > + */
> > > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > > +{
> > > +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > > +	struct sk_buff *nskb;
> > > +	bool *in_progress;
> > > +
> > > +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> > > +	if (*in_progress)
> > > +		return -EBUSY;
> > > +
> > > +	switch (skb->protocol) {
> > > +#if IS_ENABLED(CONFIG_INET)
> > > +	case htons(ETH_P_IP):
> > > +		if (type != ICMP_DEST_UNREACH)
> > > +			return -EOPNOTSUPP;
> > > +		if (code < 0 || code > NR_ICMP_UNREACH)
> > > +			return -EINVAL;
> > > +
> > > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > > +		if (!nskb)
> > > +			return -ENOMEM;
> > > +
> > > +		if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
> > > +			kfree_skb(nskb);
> > > +			return -EBADMSG;
> > > +		}
> > > +
> > > +		if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
> > > +			kfree_skb(nskb);
> > > +			return -EHOSTUNREACH;
> > > +		}
> > > +
> > > +		memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
> > > +
> > > +		*in_progress = true;
> > > +		icmp_send(nskb, type, code, 0);
> > > +		*in_progress = false;
> > 
> > [..]
> > 
> > > +		kfree_skb(nskb);
> > 
> > I was going to suggest to use consume_skb here, I think it is a better fit?
> 
> Yeah correct, I can replace it with consume_skb, didn't know about it,
> thanks.
> 
> > But I'm not sure why you do the clone here, I don't see any requirement from
> > the icmp_send side, can you clarify? Is it because of the pull?
> 
> From the icmp_send side I think it's fine, however, this part might
> touch the original packet, especially ip_route_reply_fill_dst:
> 
> 
> if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
> 	kfree_skb(nskb);
> 	return -EBADMSG;
> }
> 
> if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
> 	kfree_skb(nskb);
> 	return -EHOSTUNREACH;
> }
> 
> memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
> 
> 
> All of this is mostly there because we allow this kfunc for tc and
> especially tc ingress. At this stage, the skb might not have a routing
> entry yet and icmp_send needs to know the dev from this or fail
> silently. This is the original reason why I added the the net patches
> (patch 1 and 2) and it was also spotted by Sashiko when I tried to
> remove them[^1].
> 
> [^1]: https://lore.kernel.org/bpf/20260515202358.20252C2BCB0@smtp.kernel.org/

Thanks for the details. Then yeah, let's just do the consume_skb part,
the rest looks good.

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-05-19 21:20 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2026-05-18 13:07   ` bot+bpf-ci
2026-05-18 14:21     ` Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2026-05-18 13:07   ` bot+bpf-ci
2026-05-18 14:22     ` Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc Mahe Tardy
2026-05-18 13:34   ` bot+bpf-ci
2026-05-18 14:26     ` Mahe Tardy
2026-05-18 16:17   ` Stanislav Fomichev
2026-05-18 17:18     ` Mahe Tardy
2026-05-19 21:20       ` Stanislav Fomichev
2026-05-19  1:33   ` Jordan Rife
2026-05-18 12:28 ` [PATCH bpf-next v6 4/6] selftests/bpf: add bpf_icmp_send kfunc tests Mahe Tardy
2026-05-19  1:34   ` Jordan Rife
2026-05-18 12:28 ` [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests Mahe Tardy
2026-05-18 13:21   ` bot+bpf-ci
2026-05-18 14:27     ` Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
2026-05-18 13:07   ` bot+bpf-ci
2026-05-18 14:39     ` Mahe Tardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox