public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc
@ 2025-07-10 10:26 Mahe Tardy
  2025-07-10 10:26 ` [PATCH bpf-next v1 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
                   ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-10 10:26 UTC (permalink / raw)
  To: bpf; +Cc: martin.lau, daniel, john.fastabend, ast, andrii, Mahe Tardy

Hello,

This is v1 of adding the icmp_send_unreach kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target.

The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.

Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.

Caveats of this design are that a cgroup_skb program can call this
function N times, and thus send N ICMP unreach control messages, and
that the program can also finish the BPF filter with SK_PASS leading to
a potential confusing situation where the TCP connection was established
while the client receive ICMP_DEST_UNREACH messages.

Other design ideas (to prevent above issues) could be:
* Extend the return codes for the cgroup_skb program to trigger the
  reject after completion (SK_REJECT).
* Adding a kfunc to set the kernel to send an ICMP_HOST_UNREACH control
  message with appropriate code when the cgroup_skb program eventually
  terminates with SK_DROP.

We should bear in mind that we want to extend this with TCP reset next.
Please tell me what's your opinion on above ideas: if adding new return
codes could be considered and/or the other alternatives would be better
than this patch series and thus proposed instead.

v1 updates (from Daniel's offline review):
- rename netfilter moved functions to ip(6)_route_reply_fetch_dst;
- explain why nf_ip(6)_route are replaced with core route functions;
- remove useless IP frag checks;
- add KF_TRUSTED_ARGS to the kfunc;
- simplify declarations of structs, initialize fd to -1 in tests;
- insist on why SK_PASS is easier for testing in BPF prog.

[^1]: https://lwn.net/Articles/1022034/

Mahe Tardy (4):
  net: move netfilter nf_reject_fill_skb_dst to core ipv4
  net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  selftests/bpf: add icmp_send_unreach kfunc tests

 include/net/ip6_route.h                       |  2 +
 include/net/route.h                           |  1 +
 net/core/filter.c                             | 63 +++++++++++-
 net/ipv4/netfilter/nf_reject_ipv4.c           | 19 +---
 net/ipv4/route.c                              | 15 +++
 net/ipv6/netfilter/nf_reject_ipv6.c           | 17 +---
 net/ipv6/route.c                              | 18 ++++
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 96 +++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 35 +++++++
 9 files changed, 232 insertions(+), 34 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

--
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v1 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2025-07-10 10:26 [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
@ 2025-07-10 10:26 ` Mahe Tardy
  2025-07-10 10:26 ` [PATCH bpf-next v1 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-10 10:26 UTC (permalink / raw)
  To: bpf; +Cc: martin.lau, daniel, john.fastabend, ast, andrii, Mahe Tardy

Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fetch_dst in
ipv4/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/route.h                 |  1 +
 net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
 net/ipv4/route.c                    | 15 +++++++++++++++
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 8e39aa822cf9..1f032f768d52 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    const struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
 				       struct dst_entry *dst_orig);
+int ip_route_reply_fetch_dst(struct sk_buff *skb);

 static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
 {
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 87fd945a0d27..76beb78f556a 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -220,21 +220,6 @@ void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *oldskb,
 }
 EXPORT_SYMBOL_GPL(nf_reject_ip_tcphdr_put);

-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
-	nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		   int hook)
@@ -248,7 +233,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		return;

 	if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
-	    nf_reject_fill_skb_dst(oldskb) < 0)
+	    ip_route_reply_fetch_dst(oldskb) < 0)
 		return;

 	if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -322,7 +307,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
 		return;

 	if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
-	    nf_reject_fill_skb_dst(skb_in) < 0)
+	    ip_route_reply_fetch_dst(skb_in) < 0)
 		return;

 	if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index fccb05fb3a79..59b8fc3c01c0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2934,6 +2934,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);

+int ip_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct rtable *rt;
+	struct flowi4 fl4 = {
+		.daddr = ip_hdr(skb)->saddr
+	};
+
+	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+	skb_dst_set(skb, &rt->dst);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
+
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 			struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v1 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2025-07-10 10:26 [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  2025-07-10 10:26 ` [PATCH bpf-next v1 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2025-07-10 10:26 ` Mahe Tardy
  2025-07-10 22:02   ` kernel test robot
  2025-07-10 10:26 ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
  2025-07-10 10:26 ` [PATCH bpf-next v1 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  3 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2025-07-10 10:26 UTC (permalink / raw)
  To: bpf; +Cc: martin.lau, daniel, john.fastabend, ast, andrii, Mahe Tardy

Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fetch_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_outputy so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/ip6_route.h             |  2 ++
 net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
 net/ipv6/route.c                    | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6dbdf60b342f..1426467df547 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -93,6 +93,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
 	return ip6_route_output_flags(net, sk, fl6, 0);
 }

+int ip6_route_reply_fetch_dst(struct sk_buff *skb);
+
 /* Only conditionally release dst if flags indicates
  * !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
  */
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index 9ae2b2725bf9..2553b0d43a0e 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -250,21 +250,6 @@ void nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
 }
 EXPORT_SYMBOL_GPL(nf_reject_ip6_tcphdr_put);

-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
-	nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		    int hook)
 {
@@ -398,7 +383,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
 		skb_in->dev = net->loopback_dev;

 	if ((hooknum == NF_INET_PRE_ROUTING || hooknum == NF_INET_INGRESS) &&
-	    nf_reject6_fill_skb_dst(skb_in) < 0)
+	    ip6_reply_fill_dst(skb_in) < 0)
 		return;

 	icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0143262094b0..f1120f6fe0a2 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2705,6 +2705,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
 }
 EXPORT_SYMBOL_GPL(ip6_route_output_flags);

+int ip6_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct dst_entry *result;
+	struct flowi6 fl = {
+		.daddr = ipv6_hdr(skb)->saddr
+	};
+	int err;
+
+	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+	err = result->error;
+	if (err)
+		dst_release(result);
+	else
+		skb_dst_set(skb, result);
+	return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);
+
 struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
 	struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-10 10:26 [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  2025-07-10 10:26 ` [PATCH bpf-next v1 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
  2025-07-10 10:26 ` [PATCH bpf-next v1 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2025-07-10 10:26 ` Mahe Tardy
  2025-07-10 16:07   ` Alexei Starovoitov
  2025-07-11  0:32   ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc kernel test robot
  2025-07-10 10:26 ` [PATCH bpf-next v1 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  3 siblings, 2 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-10 10:26 UTC (permalink / raw)
  To: bpf; +Cc: martin.lau, daniel, john.fastabend, ast, andrii, Mahe Tardy

This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.

This reuse concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still return SK_PASS from
  the cgroup_skb progs and the current skb need to stay untouched
  (cgroup_skb hooks only allow read-only skb payload).
* Since cgroup_skb programs are called late in the stack, checksums do
  not need to be computed or verified, and IPv4 fragmentation does not
  need to be checked (ip_local_deliver should take care of that
  earlier).

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 net/core/filter.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 60 insertions(+), 1 deletion(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index ab456bf1056e..9215f79e7690 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,8 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <linux/icmp.h>
+#include <net/icmp.h>

 #include "dev.h"

@@ -12140,6 +12142,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
 	return 0;
 }

+__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
+{
+	struct sk_buff *skb = (struct sk_buff *)__skb;
+	struct sk_buff *nskb;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		if (code < 0 || code > NR_ICMP_UNREACH)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (ip_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
+		kfree_skb(nskb);
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (ip6_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
+		kfree_skb(nskb);
+		break;
+#endif
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	return 0;
+}
+
 __bpf_kfunc_end_defs();

 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12177,6 +12226,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
 BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)

+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send_unreach)
+BTF_ID_FLAGS(func, bpf_icmp_send_unreach, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send_unreach)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -12202,6 +12255,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
 	.set = &bpf_kfunc_check_set_sock_ops,
 };

+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send_unreach = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_icmp_send_unreach,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -12221,7 +12279,8 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 					       &bpf_kfunc_set_sock_addr);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
-	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
+	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send_unreach);
 }
 late_initcall(bpf_kfunc_init);

--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v1 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-10 10:26 [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
                   ` (2 preceding siblings ...)
  2025-07-10 10:26 ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-10 10:26 ` Mahe Tardy
  3 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-10 10:26 UTC (permalink / raw)
  To: bpf; +Cc: martin.lau, daniel, john.fastabend, ast, andrii, Mahe Tardy

This test opens a server and client, attach a cgroup_skb program on
egress and calls the icmp_send_unreach function from the client egress
so that an ICMP unreach control message is sent back to the client.
It then fetches the message from the error queue to confirm the correct
ICMP unreach code has been sent.

Note that the BPF program returns SK_PASS to let the connection being
established to finish the test cases quicker. Otherwise, you have to
wait for the TCP three-way handshake to timeout in the kernel and
retrieve the errno translated from the unreach code set by the ICMP
control message.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 96 +++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 35 +++++++
 2 files changed, 131 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
new file mode 100644
index 000000000000..2896d0619358
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include "icmp_send_unreach.skel.h"
+
+#define TIMEOUT_MS 1000
+#define SRV_PORT 54321
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+	ssize_t n;
+	struct sock_extended_err *sock_err;
+	struct cmsghdr *cm;
+	char ctrl_buf[512];
+	struct msghdr msg = {
+		.msg_control = ctrl_buf,
+		.msg_controllen = sizeof(ctrl_buf),
+	};
+
+	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+		return;
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
+		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+			continue;
+
+		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+			       "sock_err_origin_icmp"))
+			return;
+		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+			       "sock_err_type_dest_unreach"))
+			return;
+		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+	}
+}
+
+void test_icmp_send_unreach_kfunc(void)
+{
+	struct icmp_send_unreach *skel;
+	int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
+	int *code;
+
+	skel = icmp_send_unreach__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto cleanup;
+
+	skel->links.egress =
+		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+		goto cleanup;
+
+	code = &skel->bss->unreach_code;
+
+	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
+		// The TCP stack reacts differently when asking for
+		// fragmentation, let's ignore it for now
+		if (*code == ICMP_FRAG_NEEDED)
+			continue;
+
+		srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
+				      SRV_PORT, TIMEOUT_MS);
+		if (!ASSERT_GE(srv_fd, 0, "start_server"))
+			goto for_cleanup;
+
+		client_fd = socket(AF_INET, SOCK_STREAM, 0);
+		ASSERT_GE(client_fd, 0, "client_socket");
+
+		client_fd = connect_to_fd(srv_fd, 0);
+		if (!ASSERT_GE(client_fd, 0, "client_connect"))
+			goto for_cleanup;
+
+		read_icmp_errqueue(client_fd, *code);
+
+for_cleanup:
+		close(client_fd);
+		close(srv_fd);
+	}
+
+cleanup:
+	icmp_send_unreach__destroy(skel);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
new file mode 100644
index 000000000000..ec406028be5f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
+
+int unreach_code = 0;
+
+#define SERVER_PORT 54321
+#define SERVER_IP 0x7F000001
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+	struct tcphdr *tcph;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+		return SK_PASS;
+
+	tcph = (void *)iph + iph->ihl * 4;
+	if ((void *)(tcph + 1) > data_end ||
+	    tcph->dest != bpf_htons(SERVER_PORT))
+		return SK_PASS;
+
+	bpf_icmp_send_unreach(skb, unreach_code);
+
+	/* returns SK_PASS to execute the test case quicker */
+	return SK_PASS;
+}
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-10 10:26 ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-10 16:07   ` Alexei Starovoitov
  2025-07-11 10:57     ` Mahe Tardy
  2025-07-25 18:53     ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  2025-07-11  0:32   ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc kernel test robot
  1 sibling, 2 replies; 55+ messages in thread
From: Alexei Starovoitov @ 2025-07-10 16:07 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: bpf, Martin KaFai Lau, Daniel Borkmann, John Fastabend,
	Alexei Starovoitov, Andrii Nakryiko

On Thu, Jul 10, 2025 at 3:26 AM Mahe Tardy <mahe.tardy@gmail.com> wrote:
>
> This is needed in the context of Tetragon to provide improved feedback
> (in contrast to just dropping packets) to east-west traffic when blocked
> by policies using cgroup_skb programs.
>
> This reuse concepts from netfilter reject target codepath with the
> differences that:
> * Packets are cloned since the BPF user can still return SK_PASS from
>   the cgroup_skb progs and the current skb need to stay untouched
>   (cgroup_skb hooks only allow read-only skb payload).
> * Since cgroup_skb programs are called late in the stack, checksums do
>   not need to be computed or verified, and IPv4 fragmentation does not
>   need to be checked (ip_local_deliver should take care of that
>   earlier).
>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>  net/core/filter.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 60 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ab456bf1056e..9215f79e7690 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -85,6 +85,8 @@
>  #include <linux/un.h>
>  #include <net/xdp_sock_drv.h>
>  #include <net/inet_dscp.h>
> +#include <linux/icmp.h>
> +#include <net/icmp.h>
>
>  #include "dev.h"
>
> @@ -12140,6 +12142,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>         return 0;
>  }
>
> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> +{
> +       struct sk_buff *skb = (struct sk_buff *)__skb;
> +       struct sk_buff *nskb;
> +
> +       switch (skb->protocol) {
> +       case htons(ETH_P_IP):
> +               if (code < 0 || code > NR_ICMP_UNREACH)
> +                       return -EINVAL;
> +
> +               nskb = skb_clone(skb, GFP_ATOMIC);
> +               if (!nskb)
> +                       return -ENOMEM;
> +
> +               if (ip_route_reply_fetch_dst(nskb) < 0) {
> +                       kfree_skb(nskb);
> +                       return -EHOSTUNREACH;
> +               }
> +
> +               icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> +               kfree_skb(nskb);
> +               break;
> +#if IS_ENABLED(CONFIG_IPV6)
> +       case htons(ETH_P_IPV6):
> +               if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> +                       return -EINVAL;
> +
> +               nskb = skb_clone(skb, GFP_ATOMIC);
> +               if (!nskb)
> +                       return -ENOMEM;
> +
> +               if (ip6_route_reply_fetch_dst(nskb) < 0) {
> +                       kfree_skb(nskb);
> +                       return -EHOSTUNREACH;
> +               }
> +
> +               icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> +               kfree_skb(nskb);
> +               break;
> +#endif
> +       default:
> +               return -EPROTONOSUPPORT;
> +       }
> +
> +       return 0;
> +}
> +
>  __bpf_kfunc_end_defs();
>
>  int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
> @@ -12177,6 +12226,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
>  BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
>  BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
>
> +BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send_unreach)
> +BTF_ID_FLAGS(func, bpf_icmp_send_unreach, KF_TRUSTED_ARGS)
> +BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send_unreach)
> +
>  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
>         .owner = THIS_MODULE,
>         .set = &bpf_kfunc_check_set_skb,
> @@ -12202,6 +12255,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
>         .set = &bpf_kfunc_check_set_sock_ops,
>  };
>
> +static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send_unreach = {
> +       .owner = THIS_MODULE,
> +       .set = &bpf_kfunc_check_set_icmp_send_unreach,
> +};
> +
>  static int __init bpf_kfunc_init(void)
>  {
>         int ret;
> @@ -12221,7 +12279,8 @@ static int __init bpf_kfunc_init(void)
>         ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
>                                                &bpf_kfunc_set_sock_addr);
>         ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
> -       return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
> +       ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
> +       return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send_unreach);

Does it have to be restricted to BPF_PROG_TYPE_CGROUP_SKB ?
Can it be a part of bpf_kfunc_set_skb[] and used more generally ?

If restriction is necessary then I guess we can live with extra
bpf_kfunc_set_icmp_send_unreach, though it's odd to create a set
just for one kfunc.
Either way don't change the last 'return ...' line in this file.
Add 'ret = ret ?: register...' instead to reduce churn.

Also cc netdev and netfilter maintainers in v2.

--
pw-bot: cr

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v1 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2025-07-10 10:26 ` [PATCH bpf-next v1 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2025-07-10 22:02   ` kernel test robot
  0 siblings, 0 replies; 55+ messages in thread
From: kernel test robot @ 2025-07-10 22:02 UTC (permalink / raw)
  To: Mahe Tardy, bpf
  Cc: oe-kbuild-all, martin.lau, daniel, john.fastabend, ast, andrii,
	Mahe Tardy

Hi Mahe,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Mahe-Tardy/net-move-netfilter-nf_reject_fill_skb_dst-to-core-ipv4/20250710-182905
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250710102607.12413-3-mahe.tardy%40gmail.com
patch subject: [PATCH bpf-next v1 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20250711/202507110529.9ZuNSyQU-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250711/202507110529.9ZuNSyQU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507110529.9ZuNSyQU-lkp@intel.com/

All errors (new ones prefixed by >>):

   net/ipv6/netfilter/nf_reject_ipv6.c: In function 'nf_send_unreach6':
>> net/ipv6/netfilter/nf_reject_ipv6.c:386:13: error: implicit declaration of function 'ip6_reply_fill_dst' [-Werror=implicit-function-declaration]
     386 |             ip6_reply_fill_dst(skb_in) < 0)
         |             ^~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/ip6_reply_fill_dst +386 net/ipv6/netfilter/nf_reject_ipv6.c

   375	
   376	void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
   377			      unsigned char code, unsigned int hooknum)
   378	{
   379		if (!reject6_csum_ok(skb_in, hooknum))
   380			return;
   381	
   382		if (hooknum == NF_INET_LOCAL_OUT && skb_in->dev == NULL)
   383			skb_in->dev = net->loopback_dev;
   384	
   385		if ((hooknum == NF_INET_PRE_ROUTING || hooknum == NF_INET_INGRESS) &&
 > 386		    ip6_reply_fill_dst(skb_in) < 0)
   387			return;
   388	
   389		icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
   390	}
   391	EXPORT_SYMBOL_GPL(nf_send_unreach6);
   392	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-10 10:26 ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
  2025-07-10 16:07   ` Alexei Starovoitov
@ 2025-07-11  0:32   ` kernel test robot
  1 sibling, 0 replies; 55+ messages in thread
From: kernel test robot @ 2025-07-11  0:32 UTC (permalink / raw)
  To: Mahe Tardy, bpf
  Cc: oe-kbuild-all, martin.lau, daniel, john.fastabend, ast, andrii,
	Mahe Tardy

Hi Mahe,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Mahe-Tardy/net-move-netfilter-nf_reject_fill_skb_dst-to-core-ipv4/20250710-182905
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250710102607.12413-4-mahe.tardy%40gmail.com
patch subject: [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
config: i386-buildonly-randconfig-002-20250711 (https://download.01.org/0day-ci/archive/20250711/202507110812.bV4X2x0X-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250711/202507110812.bV4X2x0X-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507110812.bV4X2x0X-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: net/core/filter.o: in function `bpf_icmp_send_unreach':
>> filter.c:(.text+0x11e97): undefined reference to `ip_route_reply_fetch_dst'
>> ld: filter.c:(.text+0x11eaf): undefined reference to `__icmp_send'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-10 16:07   ` Alexei Starovoitov
@ 2025-07-11 10:57     ` Mahe Tardy
  2025-07-25 18:53     ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  1 sibling, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-11 10:57 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Martin KaFai Lau, Daniel Borkmann, John Fastabend,
	Alexei Starovoitov, Andrii Nakryiko

On Thu, Jul 10, 2025 at 09:07:59AM -0700, Alexei Starovoitov wrote:
> On Thu, Jul 10, 2025 at 3:26 AM Mahe Tardy <mahe.tardy@gmail.com> wrote:
> >
> > This is needed in the context of Tetragon to provide improved feedback
> > (in contrast to just dropping packets) to east-west traffic when blocked
> > by policies using cgroup_skb programs.
> >
> > This reuse concepts from netfilter reject target codepath with the
> > differences that:
> > * Packets are cloned since the BPF user can still return SK_PASS from
> >   the cgroup_skb progs and the current skb need to stay untouched
> >   (cgroup_skb hooks only allow read-only skb payload).
> > * Since cgroup_skb programs are called late in the stack, checksums do
> >   not need to be computed or verified, and IPv4 fragmentation does not
> >   need to be checked (ip_local_deliver should take care of that
> >   earlier).
> >
> > Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> > ---
> >  net/core/filter.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 60 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index ab456bf1056e..9215f79e7690 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -85,6 +85,8 @@
> >  #include <linux/un.h>
> >  #include <net/xdp_sock_drv.h>
> >  #include <net/inet_dscp.h>
> > +#include <linux/icmp.h>
> > +#include <net/icmp.h>
> >
> >  #include "dev.h"
> >
> > @@ -12140,6 +12142,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
> >         return 0;
> >  }
> >
> > +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> > +{
> > +       struct sk_buff *skb = (struct sk_buff *)__skb;
> > +       struct sk_buff *nskb;
> > +
> > +       switch (skb->protocol) {
> > +       case htons(ETH_P_IP):
> > +               if (code < 0 || code > NR_ICMP_UNREACH)
> > +                       return -EINVAL;
> > +
> > +               nskb = skb_clone(skb, GFP_ATOMIC);
> > +               if (!nskb)
> > +                       return -ENOMEM;
> > +
> > +               if (ip_route_reply_fetch_dst(nskb) < 0) {
> > +                       kfree_skb(nskb);
> > +                       return -EHOSTUNREACH;
> > +               }
> > +
> > +               icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> > +               kfree_skb(nskb);
> > +               break;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +       case htons(ETH_P_IPV6):
> > +               if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > +                       return -EINVAL;
> > +
> > +               nskb = skb_clone(skb, GFP_ATOMIC);
> > +               if (!nskb)
> > +                       return -ENOMEM;
> > +
> > +               if (ip6_route_reply_fetch_dst(nskb) < 0) {
> > +                       kfree_skb(nskb);
> > +                       return -EHOSTUNREACH;
> > +               }
> > +
> > +               icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> > +               kfree_skb(nskb);
> > +               break;
> > +#endif
> > +       default:
> > +               return -EPROTONOSUPPORT;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> >  __bpf_kfunc_end_defs();
> >
> >  int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
> > @@ -12177,6 +12226,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
> >  BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
> >  BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
> >
> > +BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send_unreach)
> > +BTF_ID_FLAGS(func, bpf_icmp_send_unreach, KF_TRUSTED_ARGS)
> > +BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send_unreach)
> > +
> >  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
> >         .owner = THIS_MODULE,
> >         .set = &bpf_kfunc_check_set_skb,
> > @@ -12202,6 +12255,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
> >         .set = &bpf_kfunc_check_set_sock_ops,
> >  };
> >
> > +static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send_unreach = {
> > +       .owner = THIS_MODULE,
> > +       .set = &bpf_kfunc_check_set_icmp_send_unreach,
> > +};
> > +
> >  static int __init bpf_kfunc_init(void)
> >  {
> >         int ret;
> > @@ -12221,7 +12279,8 @@ static int __init bpf_kfunc_init(void)
> >         ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> >                                                &bpf_kfunc_set_sock_addr);
> >         ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
> > -       return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
> > +       ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
> > +       return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send_unreach);
> 
> Does it have to be restricted to BPF_PROG_TYPE_CGROUP_SKB ?
> Can it be a part of bpf_kfunc_set_skb[] and used more generally ?

From the assumptions that have been made to write the kfunc in this
state yes, it has to be restricted to cgroup_skb. We would need
additional checks for hooks that are earlier in the stack I think.

Keeping in mind that this kfunc is not a necessity for other prog types
which can already overwrite packets, like TC.
 
> If restriction is necessary then I guess we can live with extra
> bpf_kfunc_set_icmp_send_unreach, though it's odd to create a set
> just for one kfunc.
> Either way don't change the last 'return ...' line in this file.
> Add 'ret = ret ?: register...' instead to reduce churn.
> 
> Also cc netdev and netfilter maintainers in v2.

Yes to both.

Aside, could I have your opinion on this part of the cover letter before
I proceed to fix these patches:

> Other design ideas (to prevent above issues) could be:
> * Extend the return codes for the cgroup_skb program to trigger the
>  reject after completion (SK_REJECT).
> * Adding a kfunc to set the kernel to send an ICMP_HOST_UNREACH control
>  message with appropriate code when the cgroup_skb program eventually
>  terminates with SK_DROP.
> 
> We should bear in mind that we want to extend this with TCP reset next.
> Please tell me what's your opinion on above ideas: if adding new return
> codes could be considered and/or the other alternatives would be better
> than this patch series and thus proposed instead.

These two ideas would make it more natural for cgroup_skb progs but
would prevent someone to extend it to more prog types in the future.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc
  2025-07-10 16:07   ` Alexei Starovoitov
  2025-07-11 10:57     ` Mahe Tardy
@ 2025-07-25 18:53     ` Mahe Tardy
  2025-07-25 18:53       ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
                         ` (3 more replies)
  1 sibling, 4 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
	fw, netfilter-devel, pablo, netdev, coreteam

Hello,

This is v2 of adding the icmp_send_unreach kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target.

The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.

Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.

Caveats of this kfunc design are that a cgroup_skb program can call this
function N times, thus send N ICMP unreach control messages and that the
program can return from the BPF filter with SK_PASS leading to a
potential confusing situation where the TCP connection was established
while the client received ICMP_DEST_UNREACH messages.

Another more sophisticated design idea would be for the kfunc to set the
kernel to send an ICMP_HOST_UNREACH control message with the appropriate
code when the cgroup_skb program terminates with SK_DROP. Creating a new
'SK_REJECT' return code for cgroup_skb program was generally rejected
and would be too limited for other program types support.

We should bear in mind that we want to add a TCP reset kfunc next and
also could extend this kfunc to other program types if wanted.

v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.

[^1]: https://lwn.net/Articles/1022034/

Mahe Tardy (4):
  net: move netfilter nf_reject_fill_skb_dst to core ipv4
  net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  selftests/bpf: add icmp_send_unreach kfunc tests

 include/net/ip6_route.h                       |  2 +
 include/net/route.h                           |  1 +
 net/core/filter.c                             | 59 +++++++++++
 net/ipv4/netfilter/nf_reject_ipv4.c           | 19 +---
 net/ipv4/route.c                              | 15 +++
 net/ipv6/netfilter/nf_reject_ipv6.c           | 17 +---
 net/ipv6/route.c                              | 18 ++++
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 99 +++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 36 +++++++
 9 files changed, 233 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

--
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2025-07-25 18:53     ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
@ 2025-07-25 18:53       ` Mahe Tardy
  2025-07-25 18:53       ` [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
	fw, netfilter-devel, pablo, netdev, coreteam

Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fetch_dst in
ipv4/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/route.h                 |  1 +
 net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
 net/ipv4/route.c                    | 15 +++++++++++++++
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 8e39aa822cf9..1f032f768d52 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    const struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
 				       struct dst_entry *dst_orig);
+int ip_route_reply_fetch_dst(struct sk_buff *skb);

 static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
 {
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 87fd945a0d27..76beb78f556a 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -220,21 +220,6 @@ void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *oldskb,
 }
 EXPORT_SYMBOL_GPL(nf_reject_ip_tcphdr_put);

-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
-	nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		   int hook)
@@ -248,7 +233,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		return;

 	if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
-	    nf_reject_fill_skb_dst(oldskb) < 0)
+	    ip_route_reply_fetch_dst(oldskb) < 0)
 		return;

 	if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -322,7 +307,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
 		return;

 	if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
-	    nf_reject_fill_skb_dst(skb_in) < 0)
+	    ip_route_reply_fetch_dst(skb_in) < 0)
 		return;

 	if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index fccb05fb3a79..59b8fc3c01c0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2934,6 +2934,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);

+int ip_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct rtable *rt;
+	struct flowi4 fl4 = {
+		.daddr = ip_hdr(skb)->saddr
+	};
+
+	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+	skb_dst_set(skb, &rt->dst);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
+
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 			struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2025-07-25 18:53     ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  2025-07-25 18:53       ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2025-07-25 18:53       ` Mahe Tardy
  2025-07-25 18:53       ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
  2025-07-25 18:53       ` [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  3 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
	fw, netfilter-devel, pablo, netdev, coreteam

Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fetch_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_outputy so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/ip6_route.h             |  2 ++
 net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
 net/ipv6/route.c                    | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6dbdf60b342f..1426467df547 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -93,6 +93,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
 	return ip6_route_output_flags(net, sk, fl6, 0);
 }

+int ip6_route_reply_fetch_dst(struct sk_buff *skb);
+
 /* Only conditionally release dst if flags indicates
  * !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
  */
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index 9ae2b2725bf9..994a3b88ac52 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -250,21 +250,6 @@ void nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
 }
 EXPORT_SYMBOL_GPL(nf_reject_ip6_tcphdr_put);

-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
-	nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		    int hook)
 {
@@ -398,7 +383,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
 		skb_in->dev = net->loopback_dev;

 	if ((hooknum == NF_INET_PRE_ROUTING || hooknum == NF_INET_INGRESS) &&
-	    nf_reject6_fill_skb_dst(skb_in) < 0)
+	    ip6_route_reply_fetch_dst(skb_in) < 0)
 		return;

 	icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0d5464c64965..de61540f9524 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2705,6 +2705,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
 }
 EXPORT_SYMBOL_GPL(ip6_route_output_flags);

+int ip6_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct dst_entry *result;
+	struct flowi6 fl = {
+		.daddr = ipv6_hdr(skb)->saddr
+	};
+	int err;
+
+	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+	err = result->error;
+	if (err)
+		dst_release(result);
+	else
+		skb_dst_set(skb, result);
+	return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);
+
 struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
 	struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-25 18:53     ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  2025-07-25 18:53       ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
  2025-07-25 18:53       ` [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2025-07-25 18:53       ` Mahe Tardy
  2025-07-27  1:49         ` kernel test robot
  2025-07-25 18:53       ` [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  3 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
	fw, netfilter-devel, pablo, netdev, coreteam

This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.

This reuse concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still return SK_PASS from
  the cgroup_skb progs and the current skb need to stay untouched
  (cgroup_skb hooks only allow read-only skb payload).
* Since cgroup_skb programs are called late in the stack, checksums do
  not need to be computed or verified, and IPv4 fragmentation does not
  need to be checked (ip_local_deliver should take care of that
  earlier).

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 net/core/filter.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 7a72f766aacf..ff7b2b175425 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,8 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <linux/icmp.h>
+#include <net/icmp.h>

 #include "dev.h"

@@ -12148,6 +12150,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
 	return 0;
 }

+__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
+{
+	struct sk_buff *skb = (struct sk_buff *)__skb;
+	struct sk_buff *nskb;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		if (code < 0 || code > NR_ICMP_UNREACH)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (ip_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
+		kfree_skb(nskb);
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (ip6_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
+		kfree_skb(nskb);
+		break;
+#endif
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	return SK_DROP;
+}
+
 __bpf_kfunc_end_defs();

 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12185,6 +12234,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
 BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)

+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send_unreach)
+BTF_ID_FLAGS(func, bpf_icmp_send_unreach, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send_unreach)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -12210,6 +12263,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
 	.set = &bpf_kfunc_check_set_sock_ops,
 };

+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send_unreach = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_icmp_send_unreach,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -12229,6 +12287,7 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 					       &bpf_kfunc_set_sock_addr);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send_unreach);
 	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
 }
 late_initcall(bpf_kfunc_init);
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-25 18:53     ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
                         ` (2 preceding siblings ...)
  2025-07-25 18:53       ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-25 18:53       ` Mahe Tardy
  3 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
	fw, netfilter-devel, pablo, netdev, coreteam

This test opens a server and client, attach a cgroup_skb program on
egress and calls the icmp_send_unreach function from the client egress
so that an ICMP unreach control message is sent back to the client.
It then fetches the message from the error queue to confirm the correct
ICMP unreach code has been sent.

Note that the BPF program returns SK_PASS to let the connection being
established to finish the test cases quicker. Otherwise, you have to
wait for the TCP three-way handshake to timeout in the kernel and
retrieve the errno translated from the unreach code set by the ICMP
control message.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 99 +++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 36 +++++++
 2 files changed, 135 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
new file mode 100644
index 000000000000..414c1ed8ced3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include "icmp_send_unreach.skel.h"
+
+#define TIMEOUT_MS 1000
+#define SRV_PORT 54321
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+	ssize_t n;
+	struct sock_extended_err *sock_err;
+	struct cmsghdr *cm;
+	char ctrl_buf[512];
+	struct msghdr msg = {
+		.msg_control = ctrl_buf,
+		.msg_controllen = sizeof(ctrl_buf),
+	};
+
+	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+		return;
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
+		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+			continue;
+
+		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+			       "sock_err_origin_icmp"))
+			return;
+		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+			       "sock_err_type_dest_unreach"))
+			return;
+		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+	}
+}
+
+void test_icmp_send_unreach_kfunc(void)
+{
+	struct icmp_send_unreach *skel;
+	int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
+	int *code;
+
+	skel = icmp_send_unreach__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto cleanup;
+
+	skel->links.egress =
+		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+		goto cleanup;
+
+	code = &skel->bss->unreach_code;
+
+	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
+		// The TCP stack reacts differently when asking for
+		// fragmentation, let's ignore it for now
+		if (*code == ICMP_FRAG_NEEDED)
+			continue;
+
+		skel->bss->kfunc_ret = -1;
+
+		srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
+				      SRV_PORT, TIMEOUT_MS);
+		if (!ASSERT_GE(srv_fd, 0, "start_server"))
+			goto for_cleanup;
+
+		client_fd = socket(AF_INET, SOCK_STREAM, 0);
+		ASSERT_GE(client_fd, 0, "client_socket");
+
+		client_fd = connect_to_fd(srv_fd, 0);
+		if (!ASSERT_GE(client_fd, 0, "client_connect"))
+			goto for_cleanup;
+
+		read_icmp_errqueue(client_fd, *code);
+
+		ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
+for_cleanup:
+		close(client_fd);
+		close(srv_fd);
+	}
+
+cleanup:
+	icmp_send_unreach__destroy(skel);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
new file mode 100644
index 000000000000..15783e5d1d65
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
+
+int unreach_code = 0;
+int kfunc_ret = 0;
+
+#define SERVER_PORT 54321
+#define SERVER_IP 0x7F000001
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+	struct tcphdr *tcph;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+		return SK_PASS;
+
+	tcph = (void *)iph + iph->ihl * 4;
+	if ((void *)(tcph + 1) > data_end ||
+	    tcph->dest != bpf_htons(SERVER_PORT))
+		return SK_PASS;
+
+	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
+
+	/* returns SK_PASS to execute the test case quicker */
+	return SK_PASS;
+}
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-25 18:53       ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-27  1:49         ` kernel test robot
  2025-07-28  9:43           ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  0 siblings, 1 reply; 55+ messages in thread
From: kernel test robot @ 2025-07-27  1:49 UTC (permalink / raw)
  To: Mahe Tardy, alexei.starovoitov
  Cc: oe-kbuild-all, andrii, ast, bpf, daniel, john.fastabend,
	mahe.tardy, martin.lau, fw, netfilter-devel, pablo, netdev,
	coreteam

Hi Mahe,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Mahe-Tardy/net-move-netfilter-nf_reject_fill_skb_dst-to-core-ipv4/20250726-030109
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250725185342.262067-4-mahe.tardy%40gmail.com
patch subject: [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
config: arm64-defconfig (https://download.01.org/0day-ci/archive/20250727/202507270940.kXGmRbg5-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250727/202507270940.kXGmRbg5-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507270940.kXGmRbg5-lkp@intel.com/

All errors (new ones prefixed by >>):

   aarch64-linux-ld: Unexpected GOT/PLT entries detected!
   aarch64-linux-ld: Unexpected run-time procedure linkages detected!
   aarch64-linux-ld: net/core/filter.o: in function `bpf_icmp_send_unreach':
>> net/core/filter.c:12184:(.text+0x14574): undefined reference to `ip6_route_reply_fetch_dst'


vim +12184 net/core/filter.c

 12152	
 12153	__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
 12154	{
 12155		struct sk_buff *skb = (struct sk_buff *)__skb;
 12156		struct sk_buff *nskb;
 12157	
 12158		switch (skb->protocol) {
 12159		case htons(ETH_P_IP):
 12160			if (code < 0 || code > NR_ICMP_UNREACH)
 12161				return -EINVAL;
 12162	
 12163			nskb = skb_clone(skb, GFP_ATOMIC);
 12164			if (!nskb)
 12165				return -ENOMEM;
 12166	
 12167			if (ip_route_reply_fetch_dst(nskb) < 0) {
 12168				kfree_skb(nskb);
 12169				return -EHOSTUNREACH;
 12170			}
 12171	
 12172			icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
 12173			kfree_skb(nskb);
 12174			break;
 12175	#if IS_ENABLED(CONFIG_IPV6)
 12176		case htons(ETH_P_IPV6):
 12177			if (code < 0 || code > ICMPV6_REJECT_ROUTE)
 12178				return -EINVAL;
 12179	
 12180			nskb = skb_clone(skb, GFP_ATOMIC);
 12181			if (!nskb)
 12182				return -ENOMEM;
 12183	
 12184			if (ip6_route_reply_fetch_dst(nskb) < 0) {
 12185				kfree_skb(nskb);
 12186				return -EHOSTUNREACH;
 12187			}
 12188	
 12189			icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
 12190			kfree_skb(nskb);
 12191			break;
 12192	#endif
 12193		default:
 12194			return -EPROTONOSUPPORT;
 12195		}
 12196	
 12197		return SK_DROP;
 12198	}
 12199	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
  2025-07-27  1:49         ` kernel test robot
@ 2025-07-28  9:43           ` Mahe Tardy
  2025-07-28  9:43             ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
                               ` (4 more replies)
  0 siblings, 5 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-28  9:43 UTC (permalink / raw)
  To: lkp
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

Hello,

This is v3 of adding the icmp_send_unreach kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target.

The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.

Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.

Caveats of this kfunc design are that a cgroup_skb program can call this
function N times, thus send N ICMP unreach control messages and that the
program can return from the BPF filter with SK_PASS leading to a
potential confusing situation where the TCP connection was established
while the client received ICMP_DEST_UNREACH messages.

Another more sophisticated design idea would be for the kfunc to set the
kernel to send an ICMP_HOST_UNREACH control message with the appropriate
code when the cgroup_skb program terminates with SK_DROP. Creating a new
'SK_REJECT' return code for cgroup_skb program was generally rejected
and would be too limited for other program types support.

We should bear in mind that we want to add a TCP reset kfunc next and
also could extend this kfunc to other program types if wanted.

v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.

v3 update:
- fix an undefined reference build error.

[^1]: https://lwn.net/Articles/1022034/

Mahe Tardy (4):
  net: move netfilter nf_reject_fill_skb_dst to core ipv4
  net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  selftests/bpf: add icmp_send_unreach kfunc tests

 include/net/ip6_route.h                       |  2 +
 include/net/route.h                           |  1 +
 net/core/filter.c                             | 61 ++++++++++++
 net/ipv4/netfilter/nf_reject_ipv4.c           | 19 +---
 net/ipv4/route.c                              | 15 +++
 net/ipv6/netfilter/nf_reject_ipv6.c           | 17 +---
 net/ipv6/route.c                              | 18 ++++
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 99 +++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 36 +++++++
 9 files changed, 235 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

--
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2025-07-28  9:43           ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
@ 2025-07-28  9:43             ` Mahe Tardy
  2025-07-28  9:43             ` [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
                               ` (3 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-28  9:43 UTC (permalink / raw)
  To: lkp
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fetch_dst in
ipv4/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/route.h                 |  1 +
 net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
 net/ipv4/route.c                    | 15 +++++++++++++++
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 8e39aa822cf9..1f032f768d52 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    const struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
 				       struct dst_entry *dst_orig);
+int ip_route_reply_fetch_dst(struct sk_buff *skb);

 static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
 {
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 87fd945a0d27..76beb78f556a 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -220,21 +220,6 @@ void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *oldskb,
 }
 EXPORT_SYMBOL_GPL(nf_reject_ip_tcphdr_put);

-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
-	nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		   int hook)
@@ -248,7 +233,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		return;

 	if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
-	    nf_reject_fill_skb_dst(oldskb) < 0)
+	    ip_route_reply_fetch_dst(oldskb) < 0)
 		return;

 	if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -322,7 +307,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
 		return;

 	if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
-	    nf_reject_fill_skb_dst(skb_in) < 0)
+	    ip_route_reply_fetch_dst(skb_in) < 0)
 		return;

 	if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index fccb05fb3a79..59b8fc3c01c0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2934,6 +2934,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);

+int ip_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct rtable *rt;
+	struct flowi4 fl4 = {
+		.daddr = ip_hdr(skb)->saddr
+	};
+
+	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+	skb_dst_set(skb, &rt->dst);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
+
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 			struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2025-07-28  9:43           ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  2025-07-28  9:43             ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2025-07-28  9:43             ` Mahe Tardy
  2025-07-28  9:43             ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
                               ` (2 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-28  9:43 UTC (permalink / raw)
  To: lkp
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fetch_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_outputy so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/ip6_route.h             |  2 ++
 net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
 net/ipv6/route.c                    | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6dbdf60b342f..1426467df547 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -93,6 +93,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
 	return ip6_route_output_flags(net, sk, fl6, 0);
 }

+int ip6_route_reply_fetch_dst(struct sk_buff *skb);
+
 /* Only conditionally release dst if flags indicates
  * !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
  */
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index 9ae2b2725bf9..994a3b88ac52 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -250,21 +250,6 @@ void nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
 }
 EXPORT_SYMBOL_GPL(nf_reject_ip6_tcphdr_put);

-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
-	nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		    int hook)
 {
@@ -398,7 +383,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
 		skb_in->dev = net->loopback_dev;

 	if ((hooknum == NF_INET_PRE_ROUTING || hooknum == NF_INET_INGRESS) &&
-	    nf_reject6_fill_skb_dst(skb_in) < 0)
+	    ip6_route_reply_fetch_dst(skb_in) < 0)
 		return;

 	icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0d5464c64965..de61540f9524 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2705,6 +2705,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
 }
 EXPORT_SYMBOL_GPL(ip6_route_output_flags);

+int ip6_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct dst_entry *result;
+	struct flowi6 fl = {
+		.daddr = ipv6_hdr(skb)->saddr
+	};
+	int err;
+
+	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+	err = result->error;
+	if (err)
+		dst_release(result);
+	else
+		skb_dst_set(skb, result);
+	return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);
+
 struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
 	struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-28  9:43           ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
  2025-07-28  9:43             ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
  2025-07-28  9:43             ` [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2025-07-28  9:43             ` Mahe Tardy
  2025-07-28 20:10               ` kernel test robot
  2025-07-29  1:05               ` Martin KaFai Lau
  2025-07-28  9:43             ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  2025-07-29  1:21             ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
  4 siblings, 2 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-28  9:43 UTC (permalink / raw)
  To: lkp
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.

This reuse concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still return SK_PASS from
  the cgroup_skb progs and the current skb need to stay untouched
  (cgroup_skb hooks only allow read-only skb payload).
* Since cgroup_skb programs are called late in the stack, checksums do
  not need to be computed or verified, and IPv4 fragmentation does not
  need to be checked (ip_local_deliver should take care of that
  earlier).

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 7a72f766aacf..050872324575 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,10 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <linux/icmp.h>
+#include <net/icmp.h>
+#include <net/route.h>
+#include <net/ip6_route.h>

 #include "dev.h"

@@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
 	return 0;
 }

+__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
+{
+	struct sk_buff *skb = (struct sk_buff *)__skb;
+	struct sk_buff *nskb;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		if (code < 0 || code > NR_ICMP_UNREACH)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (ip_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
+		kfree_skb(nskb);
+		break;
+#if IS_ENABLED(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (ip6_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
+		kfree_skb(nskb);
+		break;
+#endif
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	return SK_DROP;
+}
+
 __bpf_kfunc_end_defs();

 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12185,6 +12236,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
 BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)

+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send_unreach)
+BTF_ID_FLAGS(func, bpf_icmp_send_unreach, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send_unreach)
+
 static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
 	.owner = THIS_MODULE,
 	.set = &bpf_kfunc_check_set_skb,
@@ -12210,6 +12265,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
 	.set = &bpf_kfunc_check_set_sock_ops,
 };

+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send_unreach = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_icmp_send_unreach,
+};
+
 static int __init bpf_kfunc_init(void)
 {
 	int ret;
@@ -12229,6 +12289,7 @@ static int __init bpf_kfunc_init(void)
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
 					       &bpf_kfunc_set_sock_addr);
 	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send_unreach);
 	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
 }
 late_initcall(bpf_kfunc_init);
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-28  9:43           ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
                               ` (2 preceding siblings ...)
  2025-07-28  9:43             ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-28  9:43             ` Mahe Tardy
  2025-07-28 15:40               ` Yonghong Song
                                 ` (2 more replies)
  2025-07-29  1:21             ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
  4 siblings, 3 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-28  9:43 UTC (permalink / raw)
  To: lkp
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

This test opens a server and client, attach a cgroup_skb program on
egress and calls the icmp_send_unreach function from the client egress
so that an ICMP unreach control message is sent back to the client.
It then fetches the message from the error queue to confirm the correct
ICMP unreach code has been sent.

Note that the BPF program returns SK_PASS to let the connection being
established to finish the test cases quicker. Otherwise, you have to
wait for the TCP three-way handshake to timeout in the kernel and
retrieve the errno translated from the unreach code set by the ICMP
control message.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 99 +++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 36 +++++++
 2 files changed, 135 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
new file mode 100644
index 000000000000..414c1ed8ced3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include "icmp_send_unreach.skel.h"
+
+#define TIMEOUT_MS 1000
+#define SRV_PORT 54321
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+	ssize_t n;
+	struct sock_extended_err *sock_err;
+	struct cmsghdr *cm;
+	char ctrl_buf[512];
+	struct msghdr msg = {
+		.msg_control = ctrl_buf,
+		.msg_controllen = sizeof(ctrl_buf),
+	};
+
+	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+		return;
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
+		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+			continue;
+
+		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+			       "sock_err_origin_icmp"))
+			return;
+		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+			       "sock_err_type_dest_unreach"))
+			return;
+		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+	}
+}
+
+void test_icmp_send_unreach_kfunc(void)
+{
+	struct icmp_send_unreach *skel;
+	int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
+	int *code;
+
+	skel = icmp_send_unreach__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto cleanup;
+
+	skel->links.egress =
+		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+		goto cleanup;
+
+	code = &skel->bss->unreach_code;
+
+	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
+		// The TCP stack reacts differently when asking for
+		// fragmentation, let's ignore it for now
+		if (*code == ICMP_FRAG_NEEDED)
+			continue;
+
+		skel->bss->kfunc_ret = -1;
+
+		srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
+				      SRV_PORT, TIMEOUT_MS);
+		if (!ASSERT_GE(srv_fd, 0, "start_server"))
+			goto for_cleanup;
+
+		client_fd = socket(AF_INET, SOCK_STREAM, 0);
+		ASSERT_GE(client_fd, 0, "client_socket");
+
+		client_fd = connect_to_fd(srv_fd, 0);
+		if (!ASSERT_GE(client_fd, 0, "client_connect"))
+			goto for_cleanup;
+
+		read_icmp_errqueue(client_fd, *code);
+
+		ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
+for_cleanup:
+		close(client_fd);
+		close(srv_fd);
+	}
+
+cleanup:
+	icmp_send_unreach__destroy(skel);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
new file mode 100644
index 000000000000..15783e5d1d65
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
+
+int unreach_code = 0;
+int kfunc_ret = 0;
+
+#define SERVER_PORT 54321
+#define SERVER_IP 0x7F000001
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+	struct tcphdr *tcph;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+		return SK_PASS;
+
+	tcph = (void *)iph + iph->ihl * 4;
+	if ((void *)(tcph + 1) > data_end ||
+	    tcph->dest != bpf_htons(SERVER_PORT))
+		return SK_PASS;
+
+	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
+
+	/* returns SK_PASS to execute the test case quicker */
+	return SK_PASS;
+}
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-28  9:43             ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2025-07-28 15:40               ` Yonghong Song
  2025-07-28 15:59                 ` Mahe Tardy
  2025-07-29  1:18               ` Martin KaFai Lau
  2025-08-05 23:26               ` Jordan Rife
  2 siblings, 1 reply; 55+ messages in thread
From: Yonghong Song @ 2025-07-28 15:40 UTC (permalink / raw)
  To: Mahe Tardy, lkp
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo



On 7/28/25 2:43 AM, Mahe Tardy wrote:
> This test opens a server and client, attach a cgroup_skb program on
> egress and calls the icmp_send_unreach function from the client egress
> so that an ICMP unreach control message is sent back to the client.
> It then fetches the message from the error queue to confirm the correct
> ICMP unreach code has been sent.
>
> Note that the BPF program returns SK_PASS to let the connection being
> established to finish the test cases quicker. Otherwise, you have to
> wait for the TCP three-way handshake to timeout in the kernel and
> retrieve the errno translated from the unreach code set by the ICMP
> control message.
>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>   .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 99 +++++++++++++++++++
>   .../selftests/bpf/progs/icmp_send_unreach.c   | 36 +++++++
>   2 files changed, 135 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
>   create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> new file mode 100644
> index 000000000000..414c1ed8ced3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include "icmp_send_unreach.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +#define SRV_PORT 54321
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15
> +
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
> +	ssize_t n;
> +	struct sock_extended_err *sock_err;
> +	struct cmsghdr *cm;
> +	char ctrl_buf[512];
> +	struct msghdr msg = {
> +		.msg_control = ctrl_buf,
> +		.msg_controllen = sizeof(ctrl_buf),
> +	};
> +
> +	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
> +	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> +		return;
> +
> +	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
> +		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
> +		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
> +			continue;
> +
> +		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
> +
> +		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
> +			       "sock_err_origin_icmp"))
> +			return;
> +		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
> +			       "sock_err_type_dest_unreach"))
> +			return;
> +		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
> +	}
> +}
> +
> +void test_icmp_send_unreach_kfunc(void)
> +{
> +	struct icmp_send_unreach *skel;
> +	int cgroup_fd = -1, client_fd = 1, srv_fd = -1;

Should set client_fd = -1? See below ...

> +	int *code;
> +
> +	skel = icmp_send_unreach__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		goto cleanup;
> +
> +	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> +		goto cleanup;
> +
> +	skel->links.egress =
> +		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> +		goto cleanup;
> +
> +	code = &skel->bss->unreach_code;
> +
> +	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
> +		// The TCP stack reacts differently when asking for
> +		// fragmentation, let's ignore it for now
> +		if (*code == ICMP_FRAG_NEEDED)
> +			continue;
> +
> +		skel->bss->kfunc_ret = -1;
> +
> +		srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
> +				      SRV_PORT, TIMEOUT_MS);
> +		if (!ASSERT_GE(srv_fd, 0, "start_server"))
> +			goto for_cleanup;

Otherwise if client_fd = 1, goto for_cleanup will close(1).

> +
> +		client_fd = socket(AF_INET, SOCK_STREAM, 0);
> +		ASSERT_GE(client_fd, 0, "client_socket");

The above two lines are not necessary since client_fd is
actually set in the below.

> +
> +		client_fd = connect_to_fd(srv_fd, 0);
> +		if (!ASSERT_GE(client_fd, 0, "client_connect"))
> +			goto for_cleanup;
> +
> +		read_icmp_errqueue(client_fd, *code);
> +
> +		ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
> +for_cleanup:
> +		close(client_fd);
> +		close(srv_fd);
> +	}
> +
> +cleanup:
> +	icmp_send_unreach__destroy(skel);
> +	close(cgroup_fd);
> +}
[...]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-28 15:40               ` Yonghong Song
@ 2025-07-28 15:59                 ` Mahe Tardy
  0 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2025-07-28 15:59 UTC (permalink / raw)
  To: Yonghong Song
  Cc: lkp, alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

On Mon, Jul 28, 2025 at 08:40:49AM -0700, Yonghong Song wrote:
> 
> 
> On 7/28/25 2:43 AM, Mahe Tardy wrote:

[...]

> > +
> > +void test_icmp_send_unreach_kfunc(void)
> > +{
> > +	struct icmp_send_unreach *skel;
> > +	int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
> 
> Should set client_fd = -1? See below ...

Well spotted yes, it's a typo, thank you.

> > +	int *code;
> > +
> > +	skel = icmp_send_unreach__open_and_load();
> > +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> > +		goto cleanup;
> > +
> > +	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> > +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> > +		goto cleanup;
> > +
> > +	skel->links.egress =
> > +		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> > +	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> > +		goto cleanup;
> > +
> > +	code = &skel->bss->unreach_code;
> > +
> > +	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
> > +		// The TCP stack reacts differently when asking for
> > +		// fragmentation, let's ignore it for now
> > +		if (*code == ICMP_FRAG_NEEDED)
> > +			continue;
> > +
> > +		skel->bss->kfunc_ret = -1;
> > +
> > +		srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
> > +				      SRV_PORT, TIMEOUT_MS);
> > +		if (!ASSERT_GE(srv_fd, 0, "start_server"))
> > +			goto for_cleanup;
> 
> Otherwise if client_fd = 1, goto for_cleanup will close(1).
> 
> > +
> > +		client_fd = socket(AF_INET, SOCK_STREAM, 0);
> > +		ASSERT_GE(client_fd, 0, "client_socket");
> 
> The above two lines are not necessary since client_fd is
> actually set in the below.

Yep, must have been a leftover from when I was discovering the
network_helpers, oops!

> > +
> > +		client_fd = connect_to_fd(srv_fd, 0);
> > +		if (!ASSERT_GE(client_fd, 0, "client_connect"))
> > +			goto for_cleanup;
> > +
> > +		read_icmp_errqueue(client_fd, *code);
> > +
> > +		ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
> > +for_cleanup:
> > +		close(client_fd);
> > +		close(srv_fd);
> > +	}
> > +
> > +cleanup:
> > +	icmp_send_unreach__destroy(skel);
> > +	close(cgroup_fd);
> > +}
> [...]

I'm sending a v4 with those fixed + fixing the builds error when IPv6 is
built as a module from the kfunc patch. Thanks for the review.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-28  9:43             ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-28 20:10               ` kernel test robot
  2025-07-29  1:05               ` Martin KaFai Lau
  1 sibling, 0 replies; 55+ messages in thread
From: kernel test robot @ 2025-07-28 20:10 UTC (permalink / raw)
  To: Mahe Tardy, lkp
  Cc: oe-kbuild-all, alexei.starovoitov, andrii, ast, bpf, coreteam,
	daniel, fw, john.fastabend, mahe.tardy, martin.lau, netdev,
	netfilter-devel, pablo

Hi Mahe,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Mahe-Tardy/net-move-netfilter-nf_reject_fill_skb_dst-to-core-ipv4/20250728-174736
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20250728094345.46132-4-mahe.tardy%40gmail.com
patch subject: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
config: s390-randconfig-001-20250729 (https://download.01.org/0day-ci/archive/20250729/202507290356.HyFMR3K0-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250729/202507290356.HyFMR3K0-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507290356.HyFMR3K0-lkp@intel.com/

All errors (new ones prefixed by >>):

   s390-linux-ld: net/core/filter.o: in function `bpf_icmp_send_unreach':
   filter.c:(.text+0xf7b8): undefined reference to `ip_route_reply_fetch_dst'
>> s390-linux-ld: filter.c:(.text+0xf7ea): undefined reference to `__icmp_send'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-28  9:43             ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
  2025-07-28 20:10               ` kernel test robot
@ 2025-07-29  1:05               ` Martin KaFai Lau
  2025-07-29 10:06                 ` Mahe Tardy
  1 sibling, 1 reply; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-29  1:05 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/28/25 2:43 AM, Mahe Tardy wrote:
> This is needed in the context of Tetragon to provide improved feedback
> (in contrast to just dropping packets) to east-west traffic when blocked
> by policies using cgroup_skb programs.
> 
> This reuse concepts from netfilter reject target codepath with the
> differences that:
> * Packets are cloned since the BPF user can still return SK_PASS from
>    the cgroup_skb progs and the current skb need to stay untouched

This needs more details. Which field(s) of the skb are changed by the kfunc, the 
skb_dst_set in ip[6]_route_reply_fetch_dst() and/or the code path in the 
icmp[v6]_send() ?

>    (cgroup_skb hooks only allow read-only skb payload).
> * Since cgroup_skb programs are called late in the stack, checksums do
>    not need to be computed or verified, and IPv4 fragmentation does not
>    need to be checked (ip_local_deliver should take care of that
>    earlier).
> 
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>   net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 61 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7a72f766aacf..050872324575 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -85,6 +85,10 @@
>   #include <linux/un.h>
>   #include <net/xdp_sock_drv.h>
>   #include <net/inet_dscp.h>
> +#include <linux/icmp.h>
> +#include <net/icmp.h>
> +#include <net/route.h>
> +#include <net/ip6_route.h>
> 
>   #include "dev.h"
> 
> @@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>   	return 0;
>   }
> 
> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)__skb;
> +	struct sk_buff *nskb;
> +
> +	switch (skb->protocol) {
> +	case htons(ETH_P_IP):
> +		if (code < 0 || code > NR_ICMP_UNREACH)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (ip_route_reply_fetch_dst(nskb) < 0) {
> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> +		kfree_skb(nskb);
> +		break;
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case htons(ETH_P_IPV6):
> +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (ip6_route_reply_fetch_dst(nskb) < 0) {

 From a very quick look at icmpv6_send(), it does its own route lookup. I 
haven't looked at the v4 yet.

I am likely missing some details. Can you explain why it needs to do a lookup 
before calling icmpv6_send()?

> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> +		kfree_skb(nskb);
> +		break;
> +#endif
> +	default:
> +		return -EPROTONOSUPPORT;
> +	}
> +
> +	return SK_DROP;
> +}
> +

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-28  9:43             ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  2025-07-28 15:40               ` Yonghong Song
@ 2025-07-29  1:18               ` Martin KaFai Lau
  2025-07-29  9:09                 ` Mahe Tardy
  2025-08-05 23:26               ` Jordan Rife
  2 siblings, 1 reply; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-29  1:18 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/28/25 2:43 AM, Mahe Tardy wrote:
> +SEC("cgroup_skb/egress")
> +int egress(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct iphdr *iph;
> +	struct tcphdr *tcph;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> +	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> +		return SK_PASS;
> +
> +	tcph = (void *)iph + iph->ihl * 4;
> +	if ((void *)(tcph + 1) > data_end ||
> +	    tcph->dest != bpf_htons(SERVER_PORT))
> +		return SK_PASS;
> +
> +	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
> +
> +	/* returns SK_PASS to execute the test case quicker */

Do you know why the user space is slower if 0 (SK_DROP) is used?

> +	return SK_PASS;



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
  2025-07-28  9:43           ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
                               ` (3 preceding siblings ...)
  2025-07-28  9:43             ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2025-07-29  1:21             ` Martin KaFai Lau
  2025-07-29  9:53               ` Mahe Tardy
  4 siblings, 1 reply; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-29  1:21 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/28/25 2:43 AM, Mahe Tardy wrote:
> Hello,
> 
> This is v3 of adding the icmp_send_unreach kfunc, as suggested during
> LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
> actively reject east-west traffic, similarly to what is possible to do
> with netfilter reject target.
> 
> The first step to implement this is using ICMP control messages, with
> the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
> ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
> than a TCP RST reply and will already hint the client TCP stack to abort
> the connection and not retry extensively.
> 
> Note that this is different than the sock_destroy kfunc, that along
> calls tcp_abort and thus sends a reset, destroying the underlying
> socket.
> 
> Caveats of this kfunc design are that a cgroup_skb program can call this
> function N times, thus send N ICMP unreach control messages and that the
> program can return from the BPF filter with SK_PASS leading to a
> potential confusing situation where the TCP connection was established
> while the client received ICMP_DEST_UNREACH messages.
> 
> Another more sophisticated design idea would be for the kfunc to set the
> kernel to send an ICMP_HOST_UNREACH control message with the appropriate
> code when the cgroup_skb program terminates with SK_DROP. Creating a new
> 'SK_REJECT' return code for cgroup_skb program was generally rejected
> and would be too limited for other program types support.
> 
> We should bear in mind that we want to add a TCP reset kfunc next and
> also could extend this kfunc to other program types if wanted.

Some high level questions.

Which other program types do you need this kfunc to send icmp and the future tcp 
rst?

This cover letter mentioned sending icmp unreach is easier than sending tcp rst. 
What problems do you see in sending tcp rst?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-29  1:18               ` Martin KaFai Lau
@ 2025-07-29  9:09                 ` Mahe Tardy
  2025-07-29 23:27                   ` Martin KaFai Lau
  0 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2025-07-29  9:09 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
> On 7/28/25 2:43 AM, Mahe Tardy wrote:
> > +SEC("cgroup_skb/egress")
> > +int egress(struct __sk_buff *skb)
> > +{
> > +	void *data = (void *)(long)skb->data;
> > +	void *data_end = (void *)(long)skb->data_end;
> > +	struct iphdr *iph;
> > +	struct tcphdr *tcph;
> > +
> > +	iph = data;
> > +	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> > +	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> > +		return SK_PASS;
> > +
> > +	tcph = (void *)iph + iph->ihl * 4;
> > +	if ((void *)(tcph + 1) > data_end ||
> > +	    tcph->dest != bpf_htons(SERVER_PORT))
> > +		return SK_PASS;
> > +
> > +	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
> > +
> > +	/* returns SK_PASS to execute the test case quicker */
> 
> Do you know why the user space is slower if 0 (SK_DROP) is used?

I tried to write my understanding of this in the commit description:

"Note that the BPF program returns SK_PASS to let the connection being
established to finish the test cases quicker. Otherwise, you have to
wait for the TCP three-way handshake to timeout in the kernel and
retrieve the errno translated from the unreach code set by the ICMP
control message."

I added this comment because I already had some (offline) feedback that
this looked off, maybe I should develop and put this here directly.

> 
> > +	return SK_PASS;
> 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
  2025-07-29  1:21             ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
@ 2025-07-29  9:53               ` Mahe Tardy
  2025-07-30  1:54                 ` Martin KaFai Lau
  0 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2025-07-29  9:53 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On Mon, Jul 28, 2025 at 06:21:50PM -0700, Martin KaFai Lau wrote:
> On 7/28/25 2:43 AM, Mahe Tardy wrote:
> > Hello,
> > 
> > This is v3 of adding the icmp_send_unreach kfunc, as suggested during
> > LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
> > actively reject east-west traffic, similarly to what is possible to do
> > with netfilter reject target.
> > 
> > The first step to implement this is using ICMP control messages, with
> > the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
> > ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
> > than a TCP RST reply and will already hint the client TCP stack to abort
> > the connection and not retry extensively.
> > 
> > Note that this is different than the sock_destroy kfunc, that along
> > calls tcp_abort and thus sends a reset, destroying the underlying
> > socket.
> > 
> > Caveats of this kfunc design are that a cgroup_skb program can call this
> > function N times, thus send N ICMP unreach control messages and that the
> > program can return from the BPF filter with SK_PASS leading to a
> > potential confusing situation where the TCP connection was established
> > while the client received ICMP_DEST_UNREACH messages.
> > 
> > Another more sophisticated design idea would be for the kfunc to set the
> > kernel to send an ICMP_HOST_UNREACH control message with the appropriate
> > code when the cgroup_skb program terminates with SK_DROP. Creating a new
> > 'SK_REJECT' return code for cgroup_skb program was generally rejected
> > and would be too limited for other program types support.
> > 
> > We should bear in mind that we want to add a TCP reset kfunc next and
> > also could extend this kfunc to other program types if wanted.
> 
> Some high level questions.
> 
> Which other program types do you need this kfunc to send icmp and the future
> tcp rst?

I don't really know, I mostly need this in cgroup_skb for my use case
but I could see other programs type using this either for simplification
(for progs that can already rewrite the packet, like tc) or other
programs types like cgroup_skb, because they can't touch the packet
themselves.

> 
> This cover letter mentioned sending icmp unreach is easier than sending tcp
> rst. What problems do you see in sending tcp rst?
> 

Yes, I based these patches on what net/ipv4/netfilter/ipt_REJECT.c's
'reject_tg' function does. In the case of sending ICMP unreach
'nf_send_unreach', the routing step is quite straighforward as they are
only inverting the daddr and the saddr (that's what my renamed/moved
ip_route_reply_fetch_dst helper does).

In the case of sending RST 'nf_send_reset', there are extra steps, first
the same routing mechanism is done by just inverting the daddr and the
saddr but later 'ip_route_me_harder' is called which is doing a lot
more. I'm currently not sure which parts of this must be ported to work
in our BPF use case so I wanted to start with unreach.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-29  1:05               ` Martin KaFai Lau
@ 2025-07-29 10:06                 ` Mahe Tardy
  2025-07-29 23:13                   ` Martin KaFai Lau
  0 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2025-07-29 10:06 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On Mon, Jul 28, 2025 at 06:05:26PM -0700, Martin KaFai Lau wrote:
> On 7/28/25 2:43 AM, Mahe Tardy wrote:
> > This is needed in the context of Tetragon to provide improved feedback
> > (in contrast to just dropping packets) to east-west traffic when blocked
> > by policies using cgroup_skb programs.
> > 
> > This reuse concepts from netfilter reject target codepath with the
> > differences that:
> > * Packets are cloned since the BPF user can still return SK_PASS from
> >    the cgroup_skb progs and the current skb need to stay untouched
> 
> This needs more details. Which field(s) of the skb are changed by the kfunc,
> the skb_dst_set in ip[6]_route_reply_fetch_dst() and/or the code path in the
> icmp[v6]_send() ?

Okay I can add that: "ip[6]_route_reply_fetch_dst set the dst of the skb
by using the saddr as a daddr and routing it", I don't think
icmp[v6]_send touches the skb?

> 
> >    (cgroup_skb hooks only allow read-only skb payload).
> > * Since cgroup_skb programs are called late in the stack, checksums do
> >    not need to be computed or verified, and IPv4 fragmentation does not
> >    need to be checked (ip_local_deliver should take care of that
> >    earlier).
> > 
> > Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> > ---
> >   net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
> >   1 file changed, 61 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 7a72f766aacf..050872324575 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -85,6 +85,10 @@
> >   #include <linux/un.h>
> >   #include <net/xdp_sock_drv.h>
> >   #include <net/inet_dscp.h>
> > +#include <linux/icmp.h>
> > +#include <net/icmp.h>
> > +#include <net/route.h>
> > +#include <net/ip6_route.h>
> > 
> >   #include "dev.h"
> > 
> > @@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
> >   	return 0;
> >   }
> > 
> > +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> > +{
> > +	struct sk_buff *skb = (struct sk_buff *)__skb;
> > +	struct sk_buff *nskb;
> > +
> > +	switch (skb->protocol) {
> > +	case htons(ETH_P_IP):
> > +		if (code < 0 || code > NR_ICMP_UNREACH)
> > +			return -EINVAL;
> > +
> > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > +		if (!nskb)
> > +			return -ENOMEM;
> > +
> > +		if (ip_route_reply_fetch_dst(nskb) < 0) {
> > +			kfree_skb(nskb);
> > +			return -EHOSTUNREACH;
> > +		}
> > +
> > +		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> > +		kfree_skb(nskb);
> > +		break;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +	case htons(ETH_P_IPV6):
> > +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > +			return -EINVAL;
> > +
> > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > +		if (!nskb)
> > +			return -ENOMEM;
> > +
> > +		if (ip6_route_reply_fetch_dst(nskb) < 0) {
> 
> From a very quick look at icmpv6_send(), it does its own route lookup. I
> haven't looked at the v4 yet.
> 
> I am likely missing some details. Can you explain why it needs to do a
> lookup before calling icmpv6_send()?

From my understanding, I need to do this to invert the daddr with the
saddr to send the unreach message back to the sender.

> 
> > +			kfree_skb(nskb);
> > +			return -EHOSTUNREACH;
> > +		}
> > +
> > +		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> > +		kfree_skb(nskb);
> > +		break;
> > +#endif
> > +	default:
> > +		return -EPROTONOSUPPORT;
> > +	}
> > +
> > +	return SK_DROP;
> > +}
> > +

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
  2025-07-29 10:06                 ` Mahe Tardy
@ 2025-07-29 23:13                   ` Martin KaFai Lau
  0 siblings, 0 replies; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 23:13 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/29/25 3:06 AM, Mahe Tardy wrote:
> On Mon, Jul 28, 2025 at 06:05:26PM -0700, Martin KaFai Lau wrote:
>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>> This is needed in the context of Tetragon to provide improved feedback
>>> (in contrast to just dropping packets) to east-west traffic when blocked
>>> by policies using cgroup_skb programs.
>>>
>>> This reuse concepts from netfilter reject target codepath with the
>>> differences that:
>>> * Packets are cloned since the BPF user can still return SK_PASS from
>>>     the cgroup_skb progs and the current skb need to stay untouched
>>
>> This needs more details. Which field(s) of the skb are changed by the kfunc,
>> the skb_dst_set in ip[6]_route_reply_fetch_dst() and/or the code path in the
>> icmp[v6]_send() ?
> 
> Okay I can add that: "ip[6]_route_reply_fetch_dst set the dst of the skb
> by using the saddr as a daddr and routing it", I don't think
> icmp[v6]_send touches the skb?

I also don't think icmp[v6]_send touches the skb. I am still not sure if 
ip[6]_route_reply_fetch_dst is needed.

> 
>>
>>>     (cgroup_skb hooks only allow read-only skb payload).
>>> * Since cgroup_skb programs are called late in the stack, checksums do
>>>     not need to be computed or verified, and IPv4 fragmentation does not
>>>     need to be checked (ip_local_deliver should take care of that
>>>     earlier).
>>>
>>> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
>>> ---
>>>    net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 61 insertions(+)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 7a72f766aacf..050872324575 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -85,6 +85,10 @@
>>>    #include <linux/un.h>
>>>    #include <net/xdp_sock_drv.h>
>>>    #include <net/inet_dscp.h>
>>> +#include <linux/icmp.h>
>>> +#include <net/icmp.h>
>>> +#include <net/route.h>
>>> +#include <net/ip6_route.h>
>>>
>>>    #include "dev.h"
>>>
>>> @@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>>>    	return 0;
>>>    }
>>>
>>> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
>>> +{
>>> +	struct sk_buff *skb = (struct sk_buff *)__skb;
>>> +	struct sk_buff *nskb;
>>> +
>>> +	switch (skb->protocol) {
>>> +	case htons(ETH_P_IP):
>>> +		if (code < 0 || code > NR_ICMP_UNREACH)
>>> +			return -EINVAL;
>>> +
>>> +		nskb = skb_clone(skb, GFP_ATOMIC);
>>> +		if (!nskb)
>>> +			return -ENOMEM;
>>> +
>>> +		if (ip_route_reply_fetch_dst(nskb) < 0) {
>>> +			kfree_skb(nskb);
>>> +			return -EHOSTUNREACH;
>>> +		}
>>> +
>>> +		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
>>> +		kfree_skb(nskb);
>>> +		break;
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +	case htons(ETH_P_IPV6):
>>> +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
>>> +			return -EINVAL;
>>> +
>>> +		nskb = skb_clone(skb, GFP_ATOMIC);
>>> +		if (!nskb)
>>> +			return -ENOMEM;
>>> +
>>> +		if (ip6_route_reply_fetch_dst(nskb) < 0) {
>>
>>  From a very quick look at icmpv6_send(), it does its own route lookup. I
>> haven't looked at the v4 yet.
>>
>> I am likely missing some details. Can you explain why it needs to do a
>> lookup before calling icmpv6_send()?
> 
>  From my understanding, I need to do this to invert the daddr with the
> saddr to send the unreach message back to the sender.

 From looking at how fl6.{daddr,saddr} are filled and passed to 
icmpv6_route_lookup in icmpv6_send(), the icmpv6_send() should have done the 
reverse/invert route lookup. I also don't see icmpv6_send uses the skb_dst() of 
the original skb. Did I misread the code? The kfunc does not work without 
ip[6]_route_reply_fetch_dst()? Again, I have not checked the v4 icmp_send. fwiw, 
the selftest should have both v4 and v6 test.

Note that at cgroup/egress, the skb->_skb_refdst should have been set.

The same should be true for cgroup/ingress for inet proto but it seems 
BPF_CGROUP_RUN_PROG_"INET"_INGRESS is not called from INET only now. e.g. 
sk_filter() can be called from af_netlink. It seems like there is a bug.

> 
>>
>>> +			kfree_skb(nskb);
>>> +			return -EHOSTUNREACH;
>>> +		}
>>> +
>>> +		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
>>> +		kfree_skb(nskb);
>>> +		break;
>>> +#endif
>>> +	default:
>>> +		return -EPROTONOSUPPORT;
>>> +	}
>>> +
>>> +	return SK_DROP;
>>> +}
>>> +
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-29  9:09                 ` Mahe Tardy
@ 2025-07-29 23:27                   ` Martin KaFai Lau
  2025-07-30  0:01                     ` Martin KaFai Lau
  0 siblings, 1 reply; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 23:27 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/29/25 2:09 AM, Mahe Tardy wrote:
> On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>> +SEC("cgroup_skb/egress")
>>> +int egress(struct __sk_buff *skb)
>>> +{
>>> +	void *data = (void *)(long)skb->data;
>>> +	void *data_end = (void *)(long)skb->data_end;
>>> +	struct iphdr *iph;
>>> +	struct tcphdr *tcph;
>>> +
>>> +	iph = data;
>>> +	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
>>> +	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
>>> +		return SK_PASS;
>>> +
>>> +	tcph = (void *)iph + iph->ihl * 4;
>>> +	if ((void *)(tcph + 1) > data_end ||
>>> +	    tcph->dest != bpf_htons(SERVER_PORT))
>>> +		return SK_PASS;
>>> +
>>> +	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
>>> +
>>> +	/* returns SK_PASS to execute the test case quicker */
>>
>> Do you know why the user space is slower if 0 (SK_DROP) is used?
> 
> I tried to write my understanding of this in the commit description:
> 
> "Note that the BPF program returns SK_PASS to let the connection being
> established to finish the test cases quicker. Otherwise, you have to
> wait for the TCP three-way handshake to timeout in the kernel and
> retrieve the errno translated from the unreach code set by the ICMP
> control message."

This feels like a bit hacky to let the 3WHS finished while the objective of the 
patch set is to drop it. It is not unusual for people to directly borrow this 
code. Does non blocking connect() help?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-29 23:27                   ` Martin KaFai Lau
@ 2025-07-30  0:01                     ` Martin KaFai Lau
  2025-07-30  0:32                       ` Martin KaFai Lau
  0 siblings, 1 reply; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-30  0:01 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/29/25 4:27 PM, Martin KaFai Lau wrote:
> On 7/29/25 2:09 AM, Mahe Tardy wrote:
>> On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
>>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>>> +SEC("cgroup_skb/egress")
>>>> +int egress(struct __sk_buff *skb)
>>>> +{
>>>> +    void *data = (void *)(long)skb->data;
>>>> +    void *data_end = (void *)(long)skb->data_end;
>>>> +    struct iphdr *iph;
>>>> +    struct tcphdr *tcph;
>>>> +
>>>> +    iph = data;
>>>> +    if ((void *)(iph + 1) > data_end || iph->version != 4 ||
>>>> +        iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
>>>> +        return SK_PASS;
>>>> +
>>>> +    tcph = (void *)iph + iph->ihl * 4;
>>>> +    if ((void *)(tcph + 1) > data_end ||
>>>> +        tcph->dest != bpf_htons(SERVER_PORT))
>>>> +        return SK_PASS;
>>>> +
>>>> +    kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
>>>> +
>>>> +    /* returns SK_PASS to execute the test case quicker */
>>>
>>> Do you know why the user space is slower if 0 (SK_DROP) is used?
>>
>> I tried to write my understanding of this in the commit description:
>>
>> "Note that the BPF program returns SK_PASS to let the connection being
>> established to finish the test cases quicker. Otherwise, you have to
>> wait for the TCP three-way handshake to timeout in the kernel and
>> retrieve the errno translated from the unreach code set by the ICMP
>> control message."
> 
> This feels like a bit hacky to let the 3WHS finished while the objective of the 
> patch set is to drop it. It is not unusual for people to directly borrow this 
> code. Does non blocking connect() help?
> 

After reading more on how sk_err_soft is used, non blocking won't help. I think 
I see why tcp rst is better.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-30  0:01                     ` Martin KaFai Lau
@ 2025-07-30  0:32                       ` Martin KaFai Lau
  0 siblings, 0 replies; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-30  0:32 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/29/25 5:01 PM, Martin KaFai Lau wrote:
> On 7/29/25 4:27 PM, Martin KaFai Lau wrote:
>> On 7/29/25 2:09 AM, Mahe Tardy wrote:
>>> On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
>>>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>>>> +SEC("cgroup_skb/egress")
>>>>> +int egress(struct __sk_buff *skb)
>>>>> +{
>>>>> +    void *data = (void *)(long)skb->data;
>>>>> +    void *data_end = (void *)(long)skb->data_end;
>>>>> +    struct iphdr *iph;
>>>>> +    struct tcphdr *tcph;
>>>>> +
>>>>> +    iph = data;
>>>>> +    if ((void *)(iph + 1) > data_end || iph->version != 4 ||
>>>>> +        iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
>>>>> +        return SK_PASS;
>>>>> +
>>>>> +    tcph = (void *)iph + iph->ihl * 4;
>>>>> +    if ((void *)(tcph + 1) > data_end ||
>>>>> +        tcph->dest != bpf_htons(SERVER_PORT))
>>>>> +        return SK_PASS;
>>>>> +
>>>>> +    kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
>>>>> +
>>>>> +    /* returns SK_PASS to execute the test case quicker */
>>>>
>>>> Do you know why the user space is slower if 0 (SK_DROP) is used?
>>>
>>> I tried to write my understanding of this in the commit description:
>>>
>>> "Note that the BPF program returns SK_PASS to let the connection being
>>> established to finish the test cases quicker. Otherwise, you have to
>>> wait for the TCP three-way handshake to timeout in the kernel and
>>> retrieve the errno translated from the unreach code set by the ICMP
>>> control message."
>>
>> This feels like a bit hacky to let the 3WHS finished while the objective of 
>> the patch set is to drop it. It is not unusual for people to directly borrow 
>> this code. Does non blocking connect() help?
>>
> 
> After reading more on how sk_err_soft is used, non blocking won't help. I think 
> I see why tcp rst is better.
> 

Actually, while replying on the cover letter and looking at tcp_v4_err again, 
there is an exception to do ip_icmp_error for TCP_SYN_SENT, so it may worth a 
try on non blocking connect and then poll the sk for err if you haven't tried 
that before.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
  2025-07-29  9:53               ` Mahe Tardy
@ 2025-07-30  1:54                 ` Martin KaFai Lau
  2025-08-01 18:50                   ` Mahe Tardy
  0 siblings, 1 reply; 55+ messages in thread
From: Martin KaFai Lau @ 2025-07-30  1:54 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On 7/29/25 2:53 AM, Mahe Tardy wrote:
>> Which other program types do you need this kfunc to send icmp and the future
>> tcp rst?
> 
> I don't really know, I mostly need this in cgroup_skb for my use case
> but I could see other programs type using this either for simplification
> (for progs that can already rewrite the packet, like tc) or other
> programs types like cgroup_skb, because they can't touch the packet
> themselves.

I also don't think the tc needs this kfunc either. The tc should already have 
ways to do this now.

> 
>>
>> This cover letter mentioned sending icmp unreach is easier than sending tcp
>> rst. What problems do you see in sending tcp rst?
>>
> 
> Yes, I based these patches on what net/ipv4/netfilter/ipt_REJECT.c's
> 'reject_tg' function does. In the case of sending ICMP unreach
> 'nf_send_unreach', the routing step is quite straighforward as they are
> only inverting the daddr and the saddr (that's what my renamed/moved
> ip_route_reply_fetch_dst helper does).
> 
> In the case of sending RST 'nf_send_reset', there are extra steps, first
> the same routing mechanism is done by just inverting the daddr and the
> saddr but later 'ip_route_me_harder' is called which is doing a lot
> more. I'm currently not sure which parts of this must be ported to work
> in our BPF use case so I wanted to start with unreach.

I don't think we necessarily need to completely borrow from nf, the hooks' 
locations are different and the use case may be different.

A concern that I have is the icmp6_send called by the kfunc. The icmp6_send 
should eventually call to ip6_finish_output which may call the very same 
"cgroup/egress" program again in a recursive way. The same for v4 icmp_send.

The icmp packet is sent from an internal kernel sk. I suspect you will see this 
recursive behavior if the test is done in the default cgroup (/sys/fs/cgroup). I 
think the is_ineligible(skb) should have stopped the second icmpv6_send from 
replying to an icmp error and the cgroup hook cannot change the skb. However, I 
am not sure I want to cross this bridge. Is there a way to avoid the recursive 
bpf prog?



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
  2025-07-30  1:54                 ` Martin KaFai Lau
@ 2025-08-01 18:50                   ` Mahe Tardy
  2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
  0 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2025-08-01 18:50 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
	lkp

On Tue, Jul 29, 2025 at 06:54:58PM -0700, Martin KaFai Lau wrote:
> On 7/29/25 2:53 AM, Mahe Tardy wrote:
> > > Which other program types do you need this kfunc to send icmp and the future
> > > tcp rst?
> > 
> > I don't really know, I mostly need this in cgroup_skb for my use case
> > but I could see other programs type using this either for simplification
> > (for progs that can already rewrite the packet, like tc) or other
> > programs types like cgroup_skb, because they can't touch the packet
> > themselves.
> 
> I also don't think the tc needs this kfunc either. The tc should already
> have ways to do this now.
> 
> > 
> > > 
> > > This cover letter mentioned sending icmp unreach is easier than sending tcp
> > > rst. What problems do you see in sending tcp rst?
> > > 
> > 
> > Yes, I based these patches on what net/ipv4/netfilter/ipt_REJECT.c's
> > 'reject_tg' function does. In the case of sending ICMP unreach
> > 'nf_send_unreach', the routing step is quite straighforward as they are
> > only inverting the daddr and the saddr (that's what my renamed/moved
> > ip_route_reply_fetch_dst helper does).
> > 
> > In the case of sending RST 'nf_send_reset', there are extra steps, first
> > the same routing mechanism is done by just inverting the daddr and the
> > saddr but later 'ip_route_me_harder' is called which is doing a lot
> > more. I'm currently not sure which parts of this must be ported to work
> > in our BPF use case so I wanted to start with unreach.
> 
> I don't think we necessarily need to completely borrow from nf, the hooks'
> locations are different and the use case may be different.
> 
> A concern that I have is the icmp6_send called by the kfunc. The icmp6_send
> should eventually call to ip6_finish_output which may call the very same
> "cgroup/egress" program again in a recursive way. The same for v4 icmp_send.
> 
> The icmp packet is sent from an internal kernel sk. I suspect you will see
> this recursive behavior if the test is done in the default cgroup
> (/sys/fs/cgroup). I think the is_ineligible(skb) should have stopped the
> second icmpv6_send from replying to an icmp error and the cgroup hook cannot
> change the skb. However, I am not sure I want to cross this bridge. Is there
> a way to avoid the recursive bpf prog?
> 

Thanks Martin for the review. Indeed the recursive BPF prog call is a
concerning issue. I'll take some time to think about it and hopefully
propose something.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
  2025-07-28  9:43             ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  2025-07-28 15:40               ` Yonghong Song
  2025-07-29  1:18               ` Martin KaFai Lau
@ 2025-08-05 23:26               ` Jordan Rife
  2 siblings, 0 replies; 55+ messages in thread
From: Jordan Rife @ 2025-08-05 23:26 UTC (permalink / raw)
  To: Mahe Tardy
  Cc: lkp, alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

On Mon, Jul 28, 2025 at 09:43:45AM +0000, Mahe Tardy wrote:
> This test opens a server and client, attach a cgroup_skb program on
> egress and calls the icmp_send_unreach function from the client egress
> so that an ICMP unreach control message is sent back to the client.
> It then fetches the message from the error queue to confirm the correct
> ICMP unreach code has been sent.
> 
> Note that the BPF program returns SK_PASS to let the connection being
> established to finish the test cases quicker. Otherwise, you have to
> wait for the TCP three-way handshake to timeout in the kernel and
> retrieve the errno translated from the unreach code set by the ICMP
> control message.
> 
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
>  .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 99 +++++++++++++++++++
>  .../selftests/bpf/progs/icmp_send_unreach.c   | 36 +++++++
>  2 files changed, 135 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
>  create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> 
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> new file mode 100644
> index 000000000000..414c1ed8ced3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include "icmp_send_unreach.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +#define SRV_PORT 54321
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15

small nit: Any reason why ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, and
NR_ICMP_UNREACH are redefined here? I think you should just be able to
#include <linux/icmp.h> this at the top to avoid redefining these.

> +
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
> +	ssize_t n;
> +	struct sock_extended_err *sock_err;
> +	struct cmsghdr *cm;
> +	char ctrl_buf[512];
> +	struct msghdr msg = {
> +		.msg_control = ctrl_buf,
> +		.msg_controllen = sizeof(ctrl_buf),
> +	};
> +
> +	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
> +	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> +		return;
> +
> +	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
> +		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
> +		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
> +			continue;
> +
> +		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
> +
> +		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
> +			       "sock_err_origin_icmp"))
> +			return;
> +		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
> +			       "sock_err_type_dest_unreach"))
> +			return;
> +		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
> +	}
> +}
> +
> +void test_icmp_send_unreach_kfunc(void)
> +{
> +	struct icmp_send_unreach *skel;
> +	int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
> +	int *code;
> +
> +	skel = icmp_send_unreach__open_and_load();
> +	if (!ASSERT_OK_PTR(skel, "skel_open"))
> +		goto cleanup;
> +
> +	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> +	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> +		goto cleanup;
> +
> +	skel->links.egress =
> +		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> +	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> +		goto cleanup;
> +
> +	code = &skel->bss->unreach_code;
> +
> +	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
> +		// The TCP stack reacts differently when asking for
> +		// fragmentation, let's ignore it for now
> +		if (*code == ICMP_FRAG_NEEDED)
> +			continue;
> +
> +		skel->bss->kfunc_ret = -1;
> +
> +		srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
> +				      SRV_PORT, TIMEOUT_MS);
> +		if (!ASSERT_GE(srv_fd, 0, "start_server"))
> +			goto for_cleanup;
> +
> +		client_fd = socket(AF_INET, SOCK_STREAM, 0);
> +		ASSERT_GE(client_fd, 0, "client_socket");
> +
> +		client_fd = connect_to_fd(srv_fd, 0);
> +		if (!ASSERT_GE(client_fd, 0, "client_connect"))
> +			goto for_cleanup;
> +
> +		read_icmp_errqueue(client_fd, *code);
> +
> +		ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");

It might be worth testing that the kfunc returns -EINVAL when the code
is outside the accepted range as well for completeness.

> +for_cleanup:
> +		close(client_fd);
> +		close(srv_fd);
> +	}
> +
> +cleanup:
> +	icmp_send_unreach__destroy(skel);
> +	close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> new file mode 100644
> index 000000000000..15783e5d1d65
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> +
> +int unreach_code = 0;
> +int kfunc_ret = 0;
> +
> +#define SERVER_PORT 54321
> +#define SERVER_IP 0x7F000001
> +
> +SEC("cgroup_skb/egress")
> +int egress(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct iphdr *iph;
> +	struct tcphdr *tcph;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> +	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> +		return SK_PASS;
> +
> +	tcph = (void *)iph + iph->ihl * 4;
> +	if ((void *)(tcph + 1) > data_end ||
> +	    tcph->dest != bpf_htons(SERVER_PORT))
> +		return SK_PASS;
> +
> +	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
> +
> +	/* returns SK_PASS to execute the test case quicker */
> +	return SK_PASS;
> +}
> --
> 2.34.1
> 

Jordan

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v4 0/6] bpf: add icmp_send_unreach kfunc
  2025-08-01 18:50                   ` Mahe Tardy
@ 2026-04-20 10:58                     ` Mahe Tardy
  2026-04-20 10:58                       ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
                                         ` (5 more replies)
  0 siblings, 6 replies; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
  To: mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

Hello,

This is v4 of adding the icmp_send_unreach kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target.

The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.

Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.

Caveats of this kfunc design are that a program can call this function N
times, thus send N ICMP unreach control messages and that the program
can return from the BPF filter with pass leading to a potential
confusing situation where the TCP connection was established while the
client received ICMP_DEST_UNREACH messages.

Initially, this kfunc was added only to cgroup_skb programs, Alexei
suggested not creating its own kfunc set and adding it to the more
global bpf_kfunc_set_skb. Now that recursion is handled and I realized,
thanks to Martin, that fetching the dst route might be only useful in
situation in which the packet was not yet routed, I decided to extend
the kfunc to more program types and route the packet only if needed.

v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.

v3 update:
- fix an undefined reference build error.

v4 updates:
- prevent the kfunc to be called recursively and add a test (thanks to
  Martin).
- do not fetch dst route when unnecessary (thanks to Martin).
- extend the test for IPv6 (thanks to Martin).
- use SK_DROP in examples and use non blocking sockets for testing
  (thanks to Martin).
- test when the kfunc returns -EINVAL (thanks to Jordan).
- add the kfunc to bpf_kfunc_set_skb as suggested by Alexei.
- guard the IPv4 parts with IS_ENABLED(CONFIG_INET).
- fix a wrong initial value for client_fd (thanks to Yonghong).
- add documentation to the kfunc.
- to Jordan: I couldn't include <linux/icmp.h> because of redefines from
  <network_helpers.h>.

[^1]: https://lwn.net/Articles/1022034/

Mahe Tardy (6):
  net: move netfilter nf_reject_fill_skb_dst to core ipv4
  net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  bpf: add bpf_icmp_send_unreach kfunc
  selftests/bpf: add icmp_send_unreach kfunc tests
  selftests/bpf: add icmp_send_unreach kfunc IPv6 tests
  selftests/bpf: add icmp_send_unreach_recursion test

 include/net/ip6_route.h                       |   2 +
 include/net/route.h                           |   1 +
 net/core/filter.c                             |  85 ++++++++
 net/ipv4/netfilter/nf_reject_ipv4.c           |  19 +-
 net/ipv4/route.c                              |  15 ++
 net/ipv6/netfilter/nf_reject_ipv6.c           |  17 +-
 net/ipv6/route.c                              |  18 ++
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 202 ++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 100 +++++++++
 9 files changed, 426 insertions(+), 33 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

--
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
@ 2026-04-20 10:58                       ` Mahe Tardy
  2026-04-20 11:36                         ` bot+bpf-ci
  2026-04-21 11:13                         ` sashiko-bot
  2026-04-20 10:58                       ` [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
                                         ` (4 subsequent siblings)
  5 siblings, 2 replies; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
  To: mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fetch_dst in
ipv4/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/route.h                 |  1 +
 net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
 net/ipv4/route.c                    | 15 +++++++++++++++
 3 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index f90106f383c5..ec2466fd0bec 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
 				    const struct sock *sk);
 struct dst_entry *ipv4_blackhole_route(struct net *net,
 				       struct dst_entry *dst_orig);
+int ip_route_reply_fetch_dst(struct sk_buff *skb);

 static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
 {
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index fecf6621f679..2290451ed122 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -252,21 +252,6 @@ static void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *
 	nskb->csum_offset = offsetof(struct tcphdr, check);
 }

-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
-	nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 /* Send RST reply */
 void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		   int hook)
@@ -279,7 +264,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 	if (!oth)
 		return;

-	if (!skb_dst(oldskb) && nf_reject_fill_skb_dst(oldskb) < 0)
+	if (!skb_dst(oldskb) && ip_route_reply_fetch_dst(oldskb) < 0)
 		return;

 	if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -352,7 +337,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
 	if (iph->frag_off & htons(IP_OFFSET))
 		return;

-	if (!skb_dst(skb_in) && nf_reject_fill_skb_dst(skb_in) < 0)
+	if (!skb_dst(skb_in) && ip_route_reply_fetch_dst(skb_in) < 0)
 		return;

 	if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bc1296f0ea69..7091ef936073 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
 }
 EXPORT_SYMBOL_GPL(ip_route_output_flow);

+int ip_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct rtable *rt;
+	struct flowi4 fl4 = {
+		.daddr = ip_hdr(skb)->saddr
+	};
+
+	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+	if (IS_ERR(rt))
+		return PTR_ERR(rt);
+	skb_dst_set(skb, &rt->dst);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
+
 /* called with rcu_read_lock held */
 static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
 			struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
  2026-04-20 10:58                       ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2026-04-20 10:58                       ` Mahe Tardy
  2026-04-21 11:13                         ` sashiko-bot
  2026-04-20 10:58                       ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
                                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
  To: mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fetch_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.

Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_output so this patch inlines it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 include/net/ip6_route.h             |  2 ++
 net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
 net/ipv6/route.c                    | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 09ffe0f13ce7..3652efec7081 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -100,6 +100,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
 	return ip6_route_output_flags(net, sk, fl6, 0);
 }

+int ip6_route_reply_fetch_dst(struct sk_buff *skb);
+
 /* Only conditionally release dst if flags indicates
  * !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
  */
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index ef5b7e85cffa..9663d1db6d80 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -293,21 +293,6 @@ nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
 						   sizeof(struct tcphdr), 0));
 }

-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
-	struct dst_entry *dst = NULL;
-	struct flowi fl;
-
-	memset(&fl, 0, sizeof(struct flowi));
-	fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
-	nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
-	if (!dst)
-		return -1;
-
-	skb_dst_set(skb_in, dst);
-	return 0;
-}
-
 void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
 		    int hook)
 {
@@ -440,7 +425,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
 	if (hooknum == NF_INET_LOCAL_OUT && skb_in->dev == NULL)
 		skb_in->dev = net->loopback_dev;

-	if (!skb_dst(skb_in) && nf_reject6_fill_skb_dst(skb_in) < 0)
+	if (!skb_dst(skb_in) && ip6_route_reply_fetch_dst(skb_in) < 0)
 		return;

 	icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 19eb6b702227..41871fddec4d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2721,6 +2721,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
 }
 EXPORT_SYMBOL_GPL(ip6_route_output_flags);

+int ip6_route_reply_fetch_dst(struct sk_buff *skb)
+{
+	struct dst_entry *result;
+	struct flowi6 fl = {
+		.daddr = ipv6_hdr(skb)->saddr
+	};
+	int err;
+
+	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+	err = result->error;
+	if (err)
+		dst_release(result);
+	else
+		skb_dst_set(skb, result);
+	return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);
+
 struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
 {
 	struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc
  2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
  2026-04-20 10:58                       ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
  2026-04-20 10:58                       ` [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2026-04-20 10:58                       ` Mahe Tardy
  2026-04-20 11:36                         ` bot+bpf-ci
  2026-04-21 11:13                         ` sashiko-bot
  2026-04-20 10:58                       ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
                                         ` (2 subsequent siblings)
  5 siblings, 2 replies; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
  To: mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.

This reuse concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still let the packet pass
  (SK_PASS from the cgroup_skb progs for example) and the current skb
  need to stay untouched (cgroup_skb hooks only allow read-only skb
  payload). The kfunc set the dst of the cloned skb by using the saddr
  as the daddr and routing it.
* Checksums are not computed or verified and IPv4 fragmentation is not
  checked early (icmp_send will check).
* We protect against recursion since the kfunc, by generating an ICMP
  error message could retrigger the BPF prog that invoked it.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 net/core/filter.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index fcfcb72663ca..a6c3b9145c93 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -84,6 +84,10 @@
 #include <linux/un.h>
 #include <net/xdp_sock_drv.h>
 #include <net/inet_dscp.h>
+#include <linux/icmp.h>
+#include <net/icmp.h>
+#include <net/route.h>
+#include <net/ip6_route.h>

 #include "dev.h"

@@ -12423,6 +12427,86 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
 	return 0;
 }

+static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
+
+/**
+ * bpf_icmp_send_unreach - Send ICMP destination unreachable error
+ * @skb: Packet that triggered the error
+ * @code: ICMP unreachable code (0-15 for IPv4, 0-6 for IPv6)
+ *
+ * Sends an ICMP destination unreachable message in response to the
+ * packet. The original packet is cloned before sending the ICMP error,
+ * so the BPF program can still let the packet pass if desired.
+ *
+ * Recursion protection: If called from a context that would trigger
+ * recursion (e.g., root cgroup processing its own ICMP packets),
+ * returns -EBUSY on re-entry.
+ *
+ * Return: 0 on success, negative error code on failure:
+ *         -EINVAL: Invalid code parameter
+ *         -ENOMEM: Memory allocation failed
+ *         -EHOSTUNREACH: Routing lookup failed
+ *         -EBUSY: Recursion detected
+ *         -EPROTONOSUPPORT: Non-IP protocol
+ */
+__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
+{
+	struct sk_buff *skb = (struct sk_buff *)__skb;
+	struct sk_buff *nskb;
+	bool *in_progress;
+
+	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
+	if (*in_progress)
+		return -EBUSY;
+
+	switch (skb->protocol) {
+#if IS_ENABLED(CONFIG_INET)
+	case htons(ETH_P_IP):
+		if (code < 0 || code > NR_ICMP_UNREACH)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (!skb_dst(nskb) && ip_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		*in_progress = true;
+		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
+		*in_progress = false;
+		kfree_skb(nskb);
+		break;
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+	case htons(ETH_P_IPV6):
+		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+			return -EINVAL;
+
+		nskb = skb_clone(skb, GFP_ATOMIC);
+		if (!nskb)
+			return -ENOMEM;
+
+		if (!skb_dst(nskb) && ip6_route_reply_fetch_dst(nskb) < 0) {
+			kfree_skb(nskb);
+			return -EHOSTUNREACH;
+		}
+
+		*in_progress = true;
+		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
+		*in_progress = false;
+		kfree_skb(nskb);
+		break;
+#endif
+	default:
+		return -EPROTONOSUPPORT;
+	}
+
+	return 0;
+}
+
 __bpf_kfunc_end_defs();

 int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12442,6 +12526,7 @@ int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,

 BTF_KFUNCS_START(bpf_kfunc_check_set_skb)
 BTF_ID_FLAGS(func, bpf_dynptr_from_skb)
+BTF_ID_FLAGS(func, bpf_icmp_send_unreach)
 BTF_KFUNCS_END(bpf_kfunc_check_set_skb)

 BTF_KFUNCS_START(bpf_kfunc_check_set_skb_meta)
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests
  2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
                                         ` (2 preceding siblings ...)
  2026-04-20 10:58                       ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
@ 2026-04-20 10:58                       ` Mahe Tardy
  2026-04-20 11:36                         ` bot+bpf-ci
  2026-04-21 11:13                         ` sashiko-bot
  2026-04-20 10:58                       ` [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests Mahe Tardy
  2026-04-20 10:58                       ` [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test Mahe Tardy
  5 siblings, 2 replies; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
  To: mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

This test opens a server and client, enters a new cgroup, attach a
cgroup_skb program on egress and calls the icmp_send_unreach function
from the client egress so that an ICMP unreach control message is sent
back to the client.  It then fetches the message from the error queue to
confirm the correct ICMP unreach code has been sent.

Note that, for the client, we have to connect in non-blocking mode to
let the test execute faster. Otherwise, we need to wait for the TCP
three-way handshake to timeout in the kernel before reading the errno.

Also note that we don't set IP_RECVERR on the socket in
connect_to_fd_nonblock since the error will be transferred anyway in our
test because the connection is rejected at the beginning of the TCP
handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err line 615 to 655 for
more details.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 136 ++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   |  36 +++++
 2 files changed, 172 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
 create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
new file mode 100644
index 000000000000..24d5e01cfe80
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include "icmp_send_unreach.skel.h"
+
+#define TIMEOUT_MS 1000
+#define SRV_PORT 54321
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static int connect_to_fd_nonblock(int server_fd)
+{
+	struct sockaddr_storage addr;
+	socklen_t len = sizeof(addr);
+	int fd, err;
+
+	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
+		return -1;
+
+	fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);
+	if (fd < 0)
+		return -1;
+
+	err = connect(fd, (struct sockaddr *)&addr, len);
+	if (err < 0 && errno != EINPROGRESS) {
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+	ssize_t n;
+	struct sock_extended_err *sock_err;
+	struct cmsghdr *cm;
+	char ctrl_buf[512];
+	struct msghdr msg = {
+		.msg_control = ctrl_buf,
+		.msg_controllen = sizeof(ctrl_buf),
+	};
+
+	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+		return;
+
+	cm = CMSG_FIRSTHDR(&msg);
+	if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
+		return;
+
+	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
+		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+			continue;
+
+		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+			       "sock_err_origin_icmp"))
+			return;
+		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+			       "sock_err_type_dest_unreach"))
+			return;
+		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+	}
+}
+
+static void trigger_prog_read_icmp_errqueue(int *code)
+{
+	int srv_fd = -1, client_fd = -1;
+
+	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", SRV_PORT,
+			      TIMEOUT_MS);
+	if (!ASSERT_GE(srv_fd, 0, "start_server"))
+		return;
+
+	client_fd = connect_to_fd_nonblock(srv_fd);
+	if (!ASSERT_GE(client_fd, 0, "client_connect_nonblock")) {
+		close(srv_fd);
+		return;
+	}
+
+	/* Skip reading ICMP error queue if code is invalid */
+	if (*code >= 0 && *code <= NR_ICMP_UNREACH)
+		read_icmp_errqueue(client_fd, *code);
+
+	close(srv_fd);
+	close(client_fd);
+}
+
+void test_icmp_send_unreach_kfunc(void)
+{
+	struct icmp_send_unreach *skel;
+	int cgroup_fd = -1;
+	int *code;
+
+	skel = icmp_send_unreach__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+	if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+		goto cleanup;
+
+	skel->links.egress =
+		bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+		goto cleanup;
+
+	code = &skel->bss->unreach_code;
+
+	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
+		/* The TCP stack reacts differently when asking for
+		 * fragmentation, let's ignore it for now.
+		 */
+		if (*code == ICMP_FRAG_NEEDED)
+			continue;
+
+		trigger_prog_read_icmp_errqueue(code);
+		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+	}
+
+	/* Test an invalid code */
+	*code = -1;
+	trigger_prog_read_icmp_errqueue(code);
+	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+cleanup:
+	icmp_send_unreach__destroy(skel);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
new file mode 100644
index 000000000000..6fc5595f08aa
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define SERVER_PORT 54321
+/* 127.0.0.1 in network byte order */
+#define SERVER_IP 0x7F000001
+
+int unreach_code = 0;
+int kfunc_ret = -1;
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+	struct tcphdr *tcph;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+		return SK_PASS;
+
+	tcph = (void *)iph + iph->ihl * 4;
+	if ((void *)(tcph + 1) > data_end ||
+	    tcph->dest != bpf_htons(SERVER_PORT))
+		return SK_PASS;
+
+	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
+
+	return SK_DROP;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests
  2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
                                         ` (3 preceding siblings ...)
  2026-04-20 10:58                       ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2026-04-20 10:58                       ` Mahe Tardy
  2026-04-21 11:13                         ` sashiko-bot
  2026-04-20 10:58                       ` [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test Mahe Tardy
  5 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
  To: mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

This test extend the existing IPv4 tests to IPv6.

Note that we need to set IP_RECVERR on the socket for IPv6 in
connect_to_fd_nonblock otherwise the error will be ignored even if we
are in the middle of the TCP handshake. See in
net/ipv6/datagram.c:ipv6_icmp_error line 313 for more details.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 83 ++++++++++++-------
 .../selftests/bpf/progs/icmp_send_unreach.c   | 46 ++++++++--
 2 files changed, 93 insertions(+), 36 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
index 24d5e01cfe80..047bfd4d80f7 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -8,15 +8,17 @@
 #define SRV_PORT 54321

 #define ICMP_DEST_UNREACH 3
+#define ICMPV6_DEST_UNREACH 1

 #define ICMP_FRAG_NEEDED 4
 #define NR_ICMP_UNREACH 15
+#define NR_ICMPV6_UNREACH 6

 static int connect_to_fd_nonblock(int server_fd)
 {
 	struct sockaddr_storage addr;
 	socklen_t len = sizeof(addr);
-	int fd, err;
+	int fd, err, on = 1;

 	if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
 		return -1;
@@ -25,6 +27,12 @@ static int connect_to_fd_nonblock(int server_fd)
 	if (fd < 0)
 		return -1;

+	if (addr.ss_family == AF_INET6 &&
+	    setsockopt(fd, IPPROTO_IPV6, IPV6_RECVERR, &on, sizeof(on)) < 0) {
+		close(fd);
+		return -1;
+	}
+
 	err = connect(fd, (struct sockaddr *)&addr, len);
 	if (err < 0 && errno != EINPROGRESS) {
 		close(fd);
@@ -34,7 +42,7 @@ static int connect_to_fd_nonblock(int server_fd)
 	return fd;
 }

-static void read_icmp_errqueue(int sockfd, int expected_code)
+static void read_icmp_errqueue(int sockfd, int expected_code, int af)
 {
 	ssize_t n;
 	struct sock_extended_err *sock_err;
@@ -44,6 +52,12 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
 		.msg_control = ctrl_buf,
 		.msg_controllen = sizeof(ctrl_buf),
 	};
+	int expected_level = (af == AF_INET) ? IPPROTO_IP : IPPROTO_IPV6;
+	int expected_type = (af == AF_INET) ? IP_RECVERR : IPV6_RECVERR;
+	int expected_origin = (af == AF_INET) ? SO_EE_ORIGIN_ICMP :
+						SO_EE_ORIGIN_ICMP6;
+	int expected_ee_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+						 ICMPV6_DEST_UNREACH;

 	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
 	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
@@ -54,28 +68,27 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
 		return;

 	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
-		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
-		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+		if (!ASSERT_EQ(cm->cmsg_level, expected_level, "cmsg_level") ||
+		    !ASSERT_EQ(cm->cmsg_type, expected_type, "cmsg_type"))
 			continue;

 		sock_err = (struct sock_extended_err *)CMSG_DATA(cm);

-		if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
-			       "sock_err_origin_icmp"))
+		if (!ASSERT_EQ(sock_err->ee_origin, expected_origin,
+			       "sock_err_origin"))
 			return;
-		if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+		if (!ASSERT_EQ(sock_err->ee_type, expected_ee_type,
 			       "sock_err_type_dest_unreach"))
 			return;
 		ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
 	}
 }

-static void trigger_prog_read_icmp_errqueue(int *code)
+static void trigger_prog_read_icmp_errqueue(int *code, int af, const char *addr)
 {
 	int srv_fd = -1, client_fd = -1;

-	srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", SRV_PORT,
-			      TIMEOUT_MS);
+	srv_fd = start_server(af, SOCK_STREAM, addr, SRV_PORT, TIMEOUT_MS);
 	if (!ASSERT_GE(srv_fd, 0, "start_server"))
 		return;

@@ -86,18 +99,40 @@ static void trigger_prog_read_icmp_errqueue(int *code)
 	}

 	/* Skip reading ICMP error queue if code is invalid */
-	if (*code >= 0 && *code <= NR_ICMP_UNREACH)
-		read_icmp_errqueue(client_fd, *code);
+	if (*code >= 0 && ((af == AF_INET && *code <= NR_ICMP_UNREACH) ||
+			   (af == AF_INET6 && *code <= NR_ICMPV6_UNREACH)))
+		read_icmp_errqueue(client_fd, *code, af);

-	close(srv_fd);
 	close(client_fd);
+	close(srv_fd);
+}
+
+static void run_icmp_test(struct icmp_send_unreach *skel, int af,
+			  const char *addr, int max_code)
+{
+	int *code = &skel->bss->unreach_code;
+
+	for (*code = 0; *code <= max_code; (*code)++) {
+		/* The TCP stack reacts differently when asking for
+		 * fragmentation, let's ignore it for now.
+		 */
+		if (af == AF_INET && *code == ICMP_FRAG_NEEDED)
+			continue;
+
+		trigger_prog_read_icmp_errqueue(code, af, addr);
+		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+	}
+
+	/* Test an invalid code */
+	*code = -1;
+	trigger_prog_read_icmp_errqueue(code, af, addr);
+	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
 }

 void test_icmp_send_unreach_kfunc(void)
 {
 	struct icmp_send_unreach *skel;
 	int cgroup_fd = -1;
-	int *code;

 	skel = icmp_send_unreach__open_and_load();
 	if (!ASSERT_OK_PTR(skel, "skel_open"))
@@ -112,23 +147,11 @@ void test_icmp_send_unreach_kfunc(void)
 	if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
 		goto cleanup;

-	code = &skel->bss->unreach_code;
-
-	for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
-		/* The TCP stack reacts differently when asking for
-		 * fragmentation, let's ignore it for now.
-		 */
-		if (*code == ICMP_FRAG_NEEDED)
-			continue;
-
-		trigger_prog_read_icmp_errqueue(code);
-		ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
-	}
+	if (test__start_subtest("ipv4"))
+		run_icmp_test(skel, AF_INET, "127.0.0.1", NR_ICMP_UNREACH);

-	/* Test an invalid code */
-	*code = -1;
-	trigger_prog_read_icmp_errqueue(code);
-	ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+	if (test__start_subtest("ipv6"))
+		run_icmp_test(skel, AF_INET6, "::1", NR_ICMPV6_UNREACH);

 cleanup:
 	icmp_send_unreach__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
index 6fc5595f08aa..112b9cbfab6f 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -6,6 +6,11 @@
 #define SERVER_PORT 54321
 /* 127.0.0.1 in network byte order */
 #define SERVER_IP 0x7F000001
+/* ::1 in network byte order */
+#define SERVER_IP6_0 0x00000000
+#define SERVER_IP6_1 0x00000000
+#define SERVER_IP6_2 0x00000000
+#define SERVER_IP6_3 0x01000000

 int unreach_code = 0;
 int kfunc_ret = -1;
@@ -16,17 +21,46 @@ int egress(struct __sk_buff *skb)
 	void *data = (void *)(long)skb->data;
 	void *data_end = (void *)(long)skb->data_end;
 	struct iphdr *iph;
+	struct ipv6hdr *ip6h;
 	struct tcphdr *tcph;
+	__u8 version;

-	iph = data;
-	if ((void *)(iph + 1) > data_end || iph->version != 4 ||
-	    iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+	if (data + 1 > data_end)
 		return SK_PASS;

-	tcph = (void *)iph + iph->ihl * 4;
-	if ((void *)(tcph + 1) > data_end ||
-	    tcph->dest != bpf_htons(SERVER_PORT))
+	version = (*((__u8 *)data)) >> 4;
+
+	if (version == 4) {
+		iph = data;
+		if ((void *)(iph + 1) > data_end ||
+		    iph->protocol != IPPROTO_TCP ||
+		    iph->daddr != bpf_htonl(SERVER_IP))
+			return SK_PASS;
+
+		tcph = (void *)iph + iph->ihl * 4;
+		if ((void *)(tcph + 1) > data_end ||
+		    tcph->dest != bpf_htons(SERVER_PORT))
+			return SK_PASS;
+
+	} else if (version == 6) {
+		ip6h = data;
+		if ((void *)(ip6h + 1) > data_end ||
+		    ip6h->nexthdr != IPPROTO_TCP)
+			return SK_PASS;
+
+		if (ip6h->daddr.in6_u.u6_addr32[0] != SERVER_IP6_0 ||
+		    ip6h->daddr.in6_u.u6_addr32[1] != SERVER_IP6_1 ||
+		    ip6h->daddr.in6_u.u6_addr32[2] != SERVER_IP6_2 ||
+		    ip6h->daddr.in6_u.u6_addr32[3] != SERVER_IP6_3)
+			return SK_PASS;
+
+		tcph = (void *)(ip6h + 1);
+		if ((void *)(tcph + 1) > data_end ||
+		    tcph->dest != bpf_htons(SERVER_PORT))
+			return SK_PASS;
+	} else {
 		return SK_PASS;
+	}

 	kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);

--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test
  2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
                                         ` (4 preceding siblings ...)
  2026-04-20 10:58                       ` [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests Mahe Tardy
@ 2026-04-20 10:58                       ` Mahe Tardy
  2026-04-21 11:13                         ` sashiko-bot
  5 siblings, 1 reply; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
  To: mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo

This test is similar to icmp_send_unreach_kfunc but checks that, in case
of recursion, meaning that the BPF program calling the kfunc was
re-triggered by the icmp_send done by the kfunc, the kfunc will stop
early and return -EBUSY.

Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
 .../bpf/prog_tests/icmp_send_unreach_kfunc.c  | 43 +++++++++++++++++++
 .../selftests/bpf/progs/icmp_send_unreach.c   | 30 +++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
index 047bfd4d80f7..a4f4324b2b99 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <test_progs.h>
 #include <network_helpers.h>
+#include <cgroup_helpers.h>
 #include <linux/errqueue.h>
 #include "icmp_send_unreach.skel.h"

@@ -10,6 +11,7 @@
 #define ICMP_DEST_UNREACH 3
 #define ICMPV6_DEST_UNREACH 1

+#define ICMP_HOST_UNREACH 1
 #define ICMP_FRAG_NEEDED 4
 #define NR_ICMP_UNREACH 15
 #define NR_ICMPV6_UNREACH 6
@@ -157,3 +159,44 @@ void test_icmp_send_unreach_kfunc(void)
 	icmp_send_unreach__destroy(skel);
 	close(cgroup_fd);
 }
+
+void test_icmp_send_unreach_recursion(void)
+{
+	struct icmp_send_unreach *skel;
+	int cgroup_fd = -1;
+	int *code;
+
+	skel = icmp_send_unreach__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		goto cleanup;
+
+	if (setup_cgroup_environment()) {
+		fprintf(stderr, "Failed to setup cgroup environment\n");
+		goto cleanup;
+	}
+
+	cgroup_fd = get_root_cgroup();
+	if (!ASSERT_GE(cgroup_fd, 0, "get_root_cgroup"))
+		goto cleanup;
+
+	skel->links.recursion =
+		bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);
+	if (!ASSERT_OK_PTR(skel->links.recursion, "prog_attach_cgroup"))
+		goto cleanup;
+
+	code = &skel->bss->unreach_code;
+	*code = ICMP_HOST_UNREACH;
+
+	trigger_prog_read_icmp_errqueue(code, AF_INET, "127.0.0.1");
+
+	/* Because there's recursion involved, the first call will return at
+	 * index 1 since it will return the second, and the second call will
+	 * return at index 0 since it will return the first.
+	 */
+	ASSERT_EQ(skel->data->rec_kfunc_rets[1], 0, "kfunc_rets[1]");
+	ASSERT_EQ(skel->data->rec_kfunc_rets[0], -EBUSY, "kfunc_rets[0]");
+
+cleanup:
+	icmp_send_unreach__destroy(skel);
+	close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
index 112b9cbfab6f..9aca7c0b12e1 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -15,6 +15,9 @@
 int unreach_code = 0;
 int kfunc_ret = -1;

+uint rec_count = 0;
+int rec_kfunc_rets[] = { -1, -1 };
+
 SEC("cgroup_skb/egress")
 int egress(struct __sk_buff *skb)
 {
@@ -67,4 +70,31 @@ int egress(struct __sk_buff *skb)
 	return SK_DROP;
 }

+SEC("cgroup_skb/egress")
+int recursion(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph;
+
+	iph = data;
+	if ((void *)(iph + 1) > data_end || iph->version != 4)
+		return SK_PASS;
+
+	/* This call will provoke a recursion: the ICMP package generated by the
+	 * kfunc will re-trigger this program since we are in the root cgroup in
+	 * which the kernel ICMP socket belongs. However when re-entering the
+	 * kfunc, it should return EBUSY.
+	 */
+	rec_kfunc_rets[rec_count & 1] =
+		bpf_icmp_send_unreach(skb, unreach_code);
+	__sync_fetch_and_add(&rec_count, 1);
+
+	/* Let the first ICMP error message pass */
+	if (iph->protocol == IPPROTO_ICMP)
+		return SK_PASS;
+
+	return SK_DROP;
+}
+
 char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2026-04-20 10:58                       ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2026-04-20 11:36                         ` bot+bpf-ci
  2026-04-20 13:04                           ` Mahe Tardy
  2026-04-21 11:13                         ` sashiko-bot
  1 sibling, 1 reply; 55+ messages in thread
From: bot+bpf-ci @ 2026-04-20 11:36 UTC (permalink / raw)
  To: mahe.tardy, mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2271 bytes --]

> diff --git a/include/net/route.h b/include/net/route.h
> index f90106f383c5..ec2466fd0bec 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
>  				    const struct sock *sk);
>  struct dst_entry *ipv4_blackhole_route(struct net *net,
>  				       struct dst_entry *dst_orig);
> +int ip_route_reply_fetch_dst(struct sk_buff *skb);
>
>  static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
>  {

[ ... ]

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index bc1296f0ea69..7091ef936073 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
>  }
>  EXPORT_SYMBOL_GPL(ip_route_output_flow);
>
> +int ip_route_reply_fetch_dst(struct sk_buff *skb)
> +{
> +	struct rtable *rt;
> +	struct flowi4 fl4 = {
> +		.daddr = ip_hdr(skb)->saddr
> +	};
> +
> +	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
> +	if (IS_ERR(rt))
> +		return PTR_ERR(rt);
> +	skb_dst_set(skb, &rt->dst);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);

Is ip_route_reply_fetch_dst() actually needed before calling icmp_send()?

Martin KaFai Lau raised this question in v3 of the patch series and
noted that icmpv6_send() does its own route lookup and doesn't use
skb_dst() of the original skb. He also pointed out that at
cgroup/egress, skb->_skb_refdst should already be set.

Looking at icmpv6_send(), the fl6.{daddr,saddr} are filled and passed
to icmpv6_route_lookup, which performs the reverse/invert route lookup.
The function doesn't appear to use skb_dst() of the original skb.

Since this patch moves ip_route_reply_fetch_dst() from netfilter to
core specifically for the BPF kfunc, can you clarify why this routing
step is necessary? Does the kfunc not work without
ip[6]_route_reply_fetch_dst()?

Reference: https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests
  2026-04-20 10:58                       ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2026-04-20 11:36                         ` bot+bpf-ci
  2026-04-20 13:08                           ` Mahe Tardy
  2026-04-21 11:13                         ` sashiko-bot
  1 sibling, 1 reply; 55+ messages in thread
From: bot+bpf-ci @ 2026-04-20 11:36 UTC (permalink / raw)
  To: mahe.tardy, mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> new file mode 100644
> index 000000000000..24d5e01cfe80
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include "icmp_send_unreach.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +#define SRV_PORT 54321
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15

Jordan Rife previously suggested in v3 that these ICMP constants should be
imported from <linux/icmp.h> rather than redefined here. Would it be cleaner
to include the header and avoid the redefinitions?

Reference: https://lore.kernel.org/bpf/usz5bhydsiejr37owgt3zypckzh7fa7ygmhsyaaiprsljx7iy5@ipopnr5n4ds7/

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc
  2026-04-20 10:58                       ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
@ 2026-04-20 11:36                         ` bot+bpf-ci
  2026-04-20 13:07                           ` Mahe Tardy
  2026-04-21 11:13                         ` sashiko-bot
  1 sibling, 1 reply; 55+ messages in thread
From: bot+bpf-ci @ 2026-04-20 11:36 UTC (permalink / raw)
  To: mahe.tardy, mahe.tardy
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3581 bytes --]

> diff --git a/net/core/filter.c b/net/core/filter.c
> --- a/net/core/filter.c
> +++ b/net/core/filter.c

[ ... ]

> @@ -12423,6 +12427,86 @@
>  	return 0;
>  }
>
> +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> +
> +/**
> + * bpf_icmp_send_unreach - Send ICMP destination unreachable error
> + * @skb: Packet that triggered the error
> + * @code: ICMP unreachable code (0-15 for IPv4, 0-6 for IPv6)
> + *
> + * Sends an ICMP destination unreachable message in response to the
> + * packet. The original packet is cloned before sending the ICMP error,
> + * so the BPF program can still let the packet pass if desired.
> + *
> + * Recursion protection: If called from a context that would trigger
> + * recursion (e.g., root cgroup processing its own ICMP packets),
> + * returns -EBUSY on re-entry.
> + *
> + * Return: 0 on success, negative error code on failure:
> + *         -EINVAL: Invalid code parameter
> + *         -ENOMEM: Memory allocation failed
> + *         -EHOSTUNREACH: Routing lookup failed
> + *         -EBUSY: Recursion detected
> + *         -EPROTONOSUPPORT: Non-IP protocol
> + */
> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)__skb;
> +	struct sk_buff *nskb;
> +	bool *in_progress;
> +
> +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> +	if (*in_progress)
> +		return -EBUSY;
> +
> +	switch (skb->protocol) {
> +#if IS_ENABLED(CONFIG_INET)
> +	case htons(ETH_P_IP):
> +		if (code < 0 || code > NR_ICMP_UNREACH)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (!skb_dst(nskb) && ip_route_reply_fetch_dst(nskb) < 0) {
> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		*in_progress = true;
> +		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> +		*in_progress = false;
> +		kfree_skb(nskb);
> +		break;
> +#endif
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case htons(ETH_P_IPV6):
> +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (!skb_dst(nskb) && ip6_route_reply_fetch_dst(nskb) < 0) {
> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}

Is the route lookup needed here? Looking at icmpv6_send() in
net/ipv6/icmp.c, it fills fl6.{daddr,saddr} and passes them to
icmpv6_route_lookup() which performs its own route lookup. The function
doesn't appear to use skb_dst() from the original skb:

icmpv6_send() {
    ...
    fl6.daddr = hdr->saddr;
    fl6.saddr = saddr;
    ...
    dst = icmpv6_route_lookup(net, skb, sk, &fl6);
    ...
}

The same applies to the IPv4 path with icmp_send(). Additionally, at
cgroup/egress the skb->_skb_refdst should already be set, and the same
should be true for cgroup/ingress for inet proto.

Does the kfunc work without ip_route_reply_fetch_dst() and
ip6_route_reply_fetch_dst()? If so, can these calls be removed?

Reference: Martin KaFai Lau's comment in v3 review
https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/

> +
> +		*in_progress = true;
> +		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> +		*in_progress = false;
> +		kfree_skb(nskb);
> +		break;
> +#endif
> +	default:
> +		return -EPROTONOSUPPORT;
> +	}
> +
> +	return 0;
> +}

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2026-04-20 11:36                         ` bot+bpf-ci
@ 2026-04-20 13:04                           ` Mahe Tardy
  0 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 13:04 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

On Mon, Apr 20, 2026 at 11:36:12AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/net/route.h b/include/net/route.h
> > index f90106f383c5..ec2466fd0bec 100644
> > --- a/include/net/route.h
> > +++ b/include/net/route.h
> > @@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
> >  				    const struct sock *sk);
> >  struct dst_entry *ipv4_blackhole_route(struct net *net,
> >  				       struct dst_entry *dst_orig);
> > +int ip_route_reply_fetch_dst(struct sk_buff *skb);
> >
> >  static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
> >  {
> 
> [ ... ]
> 
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index bc1296f0ea69..7091ef936073 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
> >  }
> >  EXPORT_SYMBOL_GPL(ip_route_output_flow);
> >
> > +int ip_route_reply_fetch_dst(struct sk_buff *skb)
> > +{
> > +	struct rtable *rt;
> > +	struct flowi4 fl4 = {
> > +		.daddr = ip_hdr(skb)->saddr
> > +	};
> > +
> > +	rt = ip_route_output_key(dev_net(skb->dev), &fl4);
> > +	if (IS_ERR(rt))
> > +		return PTR_ERR(rt);
> > +	skb_dst_set(skb, &rt->dst);
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
> 
> Is ip_route_reply_fetch_dst() actually needed before calling icmp_send()?
> 
> Martin KaFai Lau raised this question in v3 of the patch series and
> noted that icmpv6_send() does its own route lookup and doesn't use
> skb_dst() of the original skb. He also pointed out that at
> cgroup/egress, skb->_skb_refdst should already be set.
> 
> Looking at icmpv6_send(), the fl6.{daddr,saddr} are filled and passed
> to icmpv6_route_lookup, which performs the reverse/invert route lookup.
> The function doesn't appear to use skb_dst() of the original skb.
> 
> Since this patch moves ip_route_reply_fetch_dst() from netfilter to
> core specifically for the BPF kfunc, can you clarify why this routing
> step is necessary? Does the kfunc not work without
> ip[6]_route_reply_fetch_dst()?
> 
> Reference: https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/

Yes I replied to this in the cover letter, let me quote it:

Initially, this kfunc was added only to cgroup_skb programs, Alexei
suggested not creating its own kfunc set and adding it to the more
global bpf_kfunc_set_skb. Now that recursion is handled and I realized,
thanks to Martin, that fetching the dst route might be only useful in
situation in which the packet was not yet routed, I decided to extend
the kfunc to more program types and route the packet only if needed.

So in short, if we decide to go back to cgroup_skb only, we can remove
this, if we expand to other program types like we do, it still makes
sense to do it.

> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc
  2026-04-20 11:36                         ` bot+bpf-ci
@ 2026-04-20 13:07                           ` Mahe Tardy
  0 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 13:07 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

On Mon, Apr 20, 2026 at 11:36:15AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> 
> [ ... ]
> 
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +	case htons(ETH_P_IPV6):
> > +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > +			return -EINVAL;
> > +
> > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > +		if (!nskb)
> > +			return -ENOMEM;
> > +
> > +		if (!skb_dst(nskb) && ip6_route_reply_fetch_dst(nskb) < 0) {
> > +			kfree_skb(nskb);
> > +			return -EHOSTUNREACH;
> > +		}
> 
> Is the route lookup needed here? Looking at icmpv6_send() in
> net/ipv6/icmp.c, it fills fl6.{daddr,saddr} and passes them to
> icmpv6_route_lookup() which performs its own route lookup. The function
> doesn't appear to use skb_dst() from the original skb:
> 
> icmpv6_send() {
>     ...
>     fl6.daddr = hdr->saddr;
>     fl6.saddr = saddr;
>     ...
>     dst = icmpv6_route_lookup(net, skb, sk, &fl6);
>     ...
> }
> 
> The same applies to the IPv4 path with icmp_send(). Additionally, at
> cgroup/egress the skb->_skb_refdst should already be set, and the same
> should be true for cgroup/ingress for inet proto.
> 
> Does the kfunc work without ip_route_reply_fetch_dst() and
> ip6_route_reply_fetch_dst()? If so, can these calls be removed?
> 
> Reference: Martin KaFai Lau's comment in v3 review
> https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/

Same reply as https://lore.kernel.org/bpf/aeYkdKm7B4NQ3BDo@gmail.com/,
too bad the LLM can't access the cover letter.

> 
> > +
> > +		*in_progress = true;
> > +		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> > +		*in_progress = false;
> > +		kfree_skb(nskb);
> > +		break;
> > +#endif
> > +	default:
> > +		return -EPROTONOSUPPORT;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests
  2026-04-20 11:36                         ` bot+bpf-ci
@ 2026-04-20 13:08                           ` Mahe Tardy
  0 siblings, 0 replies; 55+ messages in thread
From: Mahe Tardy @ 2026-04-20 13:08 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
	john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
	oe-kbuild-all, pablo, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

On Mon, Apr 20, 2026 at 11:36:13AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> > new file mode 100644
> > index 000000000000..24d5e01cfe80
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> > @@ -0,0 +1,136 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <test_progs.h>
> > +#include <network_helpers.h>
> > +#include <linux/errqueue.h>
> > +#include "icmp_send_unreach.skel.h"
> > +
> > +#define TIMEOUT_MS 1000
> > +#define SRV_PORT 54321
> > +
> > +#define ICMP_DEST_UNREACH 3
> > +
> > +#define ICMP_FRAG_NEEDED 4
> > +#define NR_ICMP_UNREACH 15
> 
> Jordan Rife previously suggested in v3 that these ICMP constants should be
> imported from <linux/icmp.h> rather than redefined here. Would it be cleaner
> to include the header and avoid the redefinitions?

Yes it would be cleaner, but looks like I can't do it simply, maybe
there's a solution, see the answer from the cover-letter:

to Jordan: I couldn't include <linux/icmp.h> because of redefines from
<network_helpers.h>.

> 
> Reference: https://lore.kernel.org/bpf/usz5bhydsiejr37owgt3zypckzh7fa7ygmhsyaaiprsljx7iy5@ipopnr5n4ds7/
> 
> [ ... ]
> 
> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
  2026-04-20 10:58                       ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
  2026-04-20 11:36                         ` bot+bpf-ci
@ 2026-04-21 11:13                         ` sashiko-bot
  1 sibling, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-04-21 11:13 UTC (permalink / raw)
  To: Mahe Tardy; +Cc: bpf

Sashiko AI review found 3 potential issue(s):
- [High] NULL Pointer Dereference on `skb->dev` in exported routing helper.
- [High] Missing Header Linearization and Bounds Check causing Out-of-Bounds Read.
- [Low] Misleading function name implies read-only operation while mutating SKB state.
--

commit d9b5a7cd6924dc358237afb642ea1eb45a09213a
Author: Mahe Tardy <mahe.tardy@gmail.com>

net: move netfilter nf_reject_fill_skb_dst to core ipv4

This patch moves and renames nf_reject_fill_skb_dst to
ip_route_reply_fetch_dst in ipv4/route.c to allow reuse by BPF kfuncs.
It also inlines the nf_ip_route wrapper around ip_route_output_key.

[ ... ]

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index bc1296f0ea69b..7091ef9360737 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
>  }
>  EXPORT_SYMBOL_GPL(ip_route_output_flow);
>  
> +int ip_route_reply_fetch_dst(struct sk_buff *skb)

This isn't a bug, but does the name ip_route_reply_fetch_dst imply a read-only
operation? Since it mutates the SKB by setting the destination via
skb_dst_set(), could this lead to callers accidentally overwriting and
leaking existing destinations?

Would a name like ip_route_reply_fill_skb_dst better reflect its side effects?

> +{
> +	struct rtable *rt;
> +	struct flowi4 fl4 = {
> +		.daddr = ip_hdr(skb)->saddr

Can this lead to an out-of-bounds read?

Now that this is exported for BPF kfuncs, it can be reached from early
hooks like TC ingress where the IPv4 header might not be linearized yet.

If a malformed packet contains less than sizeof(struct iphdr) bytes of
linear data, does accessing ->saddr require validation via
pskb_may_pull() or similar?

> +	};
> +
> +	rt = ip_route_output_key(dev_net(skb->dev), &fl4);

Can skb->dev be NULL here?

While safe in its original netfilter context, a BPF_PROG_TYPE_NETFILTER
program attached to the NF_INET_LOCAL_OUT hook will receive locally
generated SKBs before skb->dev is assigned (e.g., in __ip_local_out before
ip_output).

Does dev_net(skb->dev) need a check to prevent a kernel panic when called
from such contexts?

> +	if (IS_ERR(rt))
> +		return PTR_ERR(rt);
> +	skb_dst_set(skb, &rt->dst);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420105816.72168-1-mahe.tardy@gmail.com?part=1

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
  2026-04-20 10:58                       ` [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2026-04-21 11:13                         ` sashiko-bot
  0 siblings, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-04-21 11:13 UTC (permalink / raw)
  To: Mahe Tardy; +Cc: bpf

Sashiko AI review found 2 potential issue(s):
- [High] Unchecked access to `skb->dev` causes a NULL pointer dereference kernel panic.
- [High] Dereferencing `ipv6_hdr(skb)->saddr` without prior linearization checks leads to out-of-bounds reads or page faults.
--

commit 71316e9661f6289f37134f91e40b833968dfa0d5
Author: Mahe Tardy <mahe.tardy@gmail.com>

net: move netfilter nf_reject6_fill_skb_dst to core ipv6

This commit moves and renames nf_reject6_fill_skb_dst to
ip6_route_reply_fetch_dst in ipv6/route.c. This allows the function
to be reused in subsequent patches by BPF kfuncs.

[ ... ]

> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 19eb6b7022278..41871fddec4d6 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c

[ ... ]

> @@ -2721,6 +2721,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
>  }
>  EXPORT_SYMBOL_GPL(ip6_route_output_flags);
>  
> +int ip6_route_reply_fetch_dst(struct sk_buff *skb)
> +{
> +	struct dst_entry *result;
> +	struct flowi6 fl = {
> +		.daddr = ipv6_hdr(skb)->saddr

Is it possible for the IPv6 header to span paged fragments here?

If the packet header isn't guaranteed to be in the linear data area, and
there are no linearization checks like pskb_may_pull() before accessing
ipv6_hdr(), could this read out-of-bounds memory or cause a page fault
when processing undersized or fragmented packets supplied by BPF programs?

> +	};
> +	int err;
> +
> +	result = ip6_route_output(dev_net(skb->dev), NULL, &fl);

Could skb->dev be NULL in contexts where this function will be used?

Since this is being exported for use by BPF kfuncs, it might be called from
program types where the device reference is not retained, such as socket
filters. If skb->dev is NULL in these contexts, won't dev_net(skb->dev)
cause a NULL pointer dereference?

> +	err = result->error;
> +	if (err)
> +		dst_release(result);
> +	else
> +		skb_dst_set(skb, result);
> +	return err;
> +}
> +EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420105816.72168-1-mahe.tardy@gmail.com?part=2

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc
  2026-04-20 10:58                       ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
  2026-04-20 11:36                         ` bot+bpf-ci
@ 2026-04-21 11:13                         ` sashiko-bot
  1 sibling, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-04-21 11:13 UTC (permalink / raw)
  To: Mahe Tardy; +Cc: bpf

Sashiko AI review found 4 potential issue(s):
- [Critical] Unchecked `skb->dev` leads to a guaranteed NULL pointer dereference and kernel panic.
- [High] Missing SKB header linearization and bounds validation before accessing IP/IPv6 headers.
- [High] Uninitialized SKB control block (`skb->cb`) causes memory corruption and info leaks via IP options parsing.
- [High] Ad-hoc synchronization with the per-CPU flag `bpf_icmp_send_in_progress` causes false-positive failures and bypasses recursion protection under preemption.
--

commit 26bc7aced36645f274275d44dc5bb1fd8f53f864
Author: Mahe Tardy <mahe.tardy@gmail.com>

bpf: add bpf_icmp_send_unreach kfunc

This commit introduces the bpf_icmp_send_unreach kfunc to allow BPF programs
to send ICMP destination unreachable messages. It clones the original SKB to
ensure the packet can still proceed unmodified if desired, and includes a
per-CPU flag to prevent recursion.

> diff --git a/net/core/filter.c b/net/core/filter.c
> --- a/net/core/filter.c
> +++ b/net/core/filter.c

[ ... ]

> +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> +
> +/**
> + * bpf_icmp_send_unreach - Send ICMP destination unreachable error

[ ... ]

> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> +{
> +	struct sk_buff *skb = (struct sk_buff *)__skb;
> +	struct sk_buff *nskb;
> +	bool *in_progress;
> +
> +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> +	if (*in_progress)
> +		return -EBUSY;

BPF programs can run in preemptible contexts, such as when running with
migrate_disable() and rcu_read_lock(). Is it possible for a task to be
preempted between checking and setting this flag, or while icmp_send() is
executing?

If another task runs on the same CPU while the flag is set, it might
incorrectly fail to send an ICMP message. Alternatively, could a race allow
the recursion protection to be bypassed entirely if tasks interleave their
reads and writes to this flag?

> +
> +	switch (skb->protocol) {
> +#if IS_ENABLED(CONFIG_INET)
> +	case htons(ETH_P_IP):
> +		if (code < 0 || code > NR_ICMP_UNREACH)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (!skb_dst(nskb) && ip_route_reply_fetch_dst(nskb) < 0) {

This kfunc is registered in bpf_kfunc_check_set_skb, exposing it to program
types like SOCKET_FILTER or SK_SKB. In these contexts, socket buffers
frequently have neither a destination entry (skb_dst is NULL) nor an
assigned network device (skb->dev is NULL).

When ip_route_reply_fetch_dst() executes, it calls dev_net(skb->dev). Does
this lead to a NULL pointer dereference if the network device is not set?

Additionally, BPF programs can arbitrarily modify the SKB length, and the
network header might reside in the non-linear paged area. Does this code
need to call pskb_network_may_pull() or similar to ensure the headers are
linearized?

ip_route_reply_fetch_dst() blindly dereferences ip_hdr(skb)->saddr, which
might result in out-of-bounds reads if the header is not safely in the
linear data area.

> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		*in_progress = true;
> +		icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);

skb_clone() copies the original SKB's control block (skb->cb) verbatim. In
hooks like TC, this area contains subsystem-specific data.

icmp_send() casts this control block to struct inet_skb_parm via IPCB().
If the copied control block data happens to set opt.optlen > 0, will
__ip_options_echo() attempt to parse IP options using arbitrary offsets?

Does the new SKB's control block need to be zeroed before passing it to
the ICMP send functions to prevent reading uninitialized values?

> +		*in_progress = false;
> +		kfree_skb(nskb);
> +		break;
> +#endif
> +#if IS_ENABLED(CONFIG_IPV6)
> +	case htons(ETH_P_IPV6):
> +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> +			return -EINVAL;
> +
> +		nskb = skb_clone(skb, GFP_ATOMIC);
> +		if (!nskb)
> +			return -ENOMEM;
> +
> +		if (!skb_dst(nskb) && ip6_route_reply_fetch_dst(nskb) < 0) {

The same concerns about skb->dev being NULL and the need for IPv6 header
linearization apply here before calling ip6_route_reply_fetch_dst().

> +			kfree_skb(nskb);
> +			return -EHOSTUNREACH;
> +		}
> +
> +		*in_progress = true;
> +		icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);

Similarly, does nskb->cb need to be zeroed before calling icmpv6_send()?

> +		*in_progress = false;
> +		kfree_skb(nskb);
> +		break;
> +#endif

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420105816.72168-1-mahe.tardy@gmail.com?part=3

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests
  2026-04-20 10:58                       ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
  2026-04-20 11:36                         ` bot+bpf-ci
@ 2026-04-21 11:13                         ` sashiko-bot
  1 sibling, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-04-21 11:13 UTC (permalink / raw)
  To: Mahe Tardy; +Cc: bpf

Sashiko AI review found 4 potential issue(s):
- [Medium] Race condition leading to test flakiness due to missing `poll()` before `recvmsg(MSG_ERRQUEUE)`.
- [Medium] Hardcoded server port can cause test collisions during parallel execution.
- [Low] Improper use of `ASSERT_EQ` for flow control and swapped assertion tags in `read_icmp_errqueue`.
- [Low] Factually incorrect comment regarding `SERVER_IP` byte order.
--

commit 954da1124af043d103f7e4f4435358fccf26853c
Author: Mahe Tardy <mahe.tardy@gmail.com>

selftests/bpf: add icmp_send_unreach kfunc tests

This commit adds tests for the icmp_send_unreach kfunc. It sets up a client
and server, uses a cgroup egress program to trigger an ICMP unreachable
message, and reads the error queue to confirm the correct code is received.

> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> new file mode 100644
> index 0000000000000..24d5e01cfe804
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
[ ... ]
> @@ -0,0 +1,136 @@
[ ... ]
> +#define TIMEOUT_MS 1000
> +#define SRV_PORT 54321

Can this cause intermittent bind() failures when selftests are run in
parallel? Since tests often execute concurrently, hardcoding a fixed port
like 54321 might collide with other tests or system processes. Could we
bind to port 0 to allow the OS to assign an ephemeral port, retrieve it via
getsockname(), and pass it to the BPF program dynamically?

[ ... ]
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
[ ... ]
> +	struct msghdr msg = {
> +		.msg_control = ctrl_buf,
> +		.msg_controllen = sizeof(ctrl_buf),
> +	};
> +
> +	n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);

Could this read the error queue before the ICMP message has been processed?

The non-blocking connect() returns -EINPROGRESS, and the ICMP unreachable
message is generated by the BPF program and processed asynchronously by
softirqs. Under heavy load, if the softirq processing is deferred, could
recvmsg() execute before the error queue is populated and cause a spurious
-EAGAIN failure?

Would it be safer to use poll() or select() to wait for the POLLERR event
on the socket before attempting the read?

> +	if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> +		return;
> +
> +	cm = CMSG_FIRSTHDR(&msg);
> +	if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
> +		return;
> +
> +	for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
> +		if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
> +		    !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
> +			continue;

Does using ASSERT_EQ here cause the test to fail immediately if an unexpected
but benign control message is encountered? In BPF selftests, if ASSERT_EQ
evaluates to false, it logs a failure and marks the test as failed instead
of just skipping the message.

Also, it looks like the tags might be swapped. "cmsg_type" is used to label
the cm->cmsg_level check, and "cmsg_level" is used for the cm->cmsg_type
check.

[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> new file mode 100644
> index 0000000000000..6fc5595f08aa9
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
[ ... ]
> @@ -0,0 +1,36 @@
[ ... ]
> +#define SERVER_PORT 54321
> +/* 127.0.0.1 in network byte order */
> +#define SERVER_IP 0x7F000001

This isn't a bug since the code correctly calls bpf_htonl(SERVER_IP) later,
but the comment appears to be incorrect. The value 0x7F000001 represents
127.0.0.1 in host byte order, not network byte order.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420105816.72168-1-mahe.tardy@gmail.com?part=4

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests
  2026-04-20 10:58                       ` [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests Mahe Tardy
@ 2026-04-21 11:13                         ` sashiko-bot
  0 siblings, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-04-21 11:13 UTC (permalink / raw)
  To: Mahe Tardy; +Cc: bpf

Sashiko AI review found 1 potential issue(s):
- [Medium] Hardcoded little-endian value for IPv6 address comparison causes test failure on big-endian architectures.
--

commit 27121b4497956027b4ace73b50126b8070a214c3
Author: Mahe Tardy <mahe.tardy@gmail.com>

selftests/bpf: add icmp_send_unreach kfunc IPv6 tests

This commit extends the existing IPv4 icmp_send_unreach kfunc tests to
support IPv6.

> diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> index 6fc5595f08aa9..112b9cbfab6fe 100644
> --- a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> +++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> @@ -6,6 +6,11 @@
>  #define SERVER_PORT 54321
>  /* 127.0.0.1 in network byte order */
>  #define SERVER_IP 0x7F000001
> +/* ::1 in network byte order */
> +#define SERVER_IP6_0 0x00000000
> +#define SERVER_IP6_1 0x00000000
> +#define SERVER_IP6_2 0x00000000
> +#define SERVER_IP6_3 0x01000000

Does this hardcoded value assume a little-endian architecture?

Since the IPv6 loopback address ::1 in network byte order ends with 0x01,
reading the last 4 bytes as a 32-bit integer yields 0x01000000 on
little-endian systems, but 0x00000001 on big-endian systems like s390x.

> @@ -16,17 +21,46 @@ int egress(struct __sk_buff *skb)
[ ... ]
> +		if (ip6h->daddr.in6_u.u6_addr32[0] != SERVER_IP6_0 ||
> +		    ip6h->daddr.in6_u.u6_addr32[1] != SERVER_IP6_1 ||
> +		    ip6h->daddr.in6_u.u6_addr32[2] != SERVER_IP6_2 ||
> +		    ip6h->daddr.in6_u.u6_addr32[3] != SERVER_IP6_3)
> +			return SK_PASS;

Will this comparison fail on big-endian systems, causing the BPF program to
erroneously ignore the loopback packets?

Could we use bpf_htonl(1) or similar here instead, so the test passes
regardless of the host's endianness?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420105816.72168-1-mahe.tardy@gmail.com?part=5

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test
  2026-04-20 10:58                       ` [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test Mahe Tardy
@ 2026-04-21 11:13                         ` sashiko-bot
  0 siblings, 0 replies; 55+ messages in thread
From: sashiko-bot @ 2026-04-21 11:13 UTC (permalink / raw)
  To: Mahe Tardy; +Cc: bpf

Sashiko AI review found 3 potential issue(s):
- [High] The `recursion` BPF program drops all unrelated IPv4 traffic on the system, causing severe network disruption and test flakiness.
- [Medium] The test leaks the cgroup testing environment by failing to call `cleanup_cgroup_environment()`.
- [Medium] The assignment to `rec_kfunc_rets` relies on an indeterminately sequenced evaluation order, making the test brittle and compiler-dependent.
--

commit 61d5cc3970ae5880f18e7f23d796670f691dcd7f
Author: Mahe Tardy <mahe.tardy@gmail.com>

selftests/bpf: add icmp_send_unreach_recursion test

This test is similar to icmp_send_unreach_kfunc but checks that, in case
of recursion, meaning that the BPF program calling the kfunc was
re-triggered by the icmp_send done by the kfunc, the kfunc will stop
early and return -EBUSY.

> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> index 047bfd4d80f75..a4f4324b2b99d 100644
> --- a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
[ ... ]
> @@ -157,3 +159,44 @@ void test_icmp_send_unreach_kfunc(void)
>  	icmp_send_unreach__destroy(skel);
>  	close(cgroup_fd);
>  }
> +
> +void test_icmp_send_unreach_recursion(void)
> +{
[ ... ]
> +	if (setup_cgroup_environment()) {
> +		fprintf(stderr, "Failed to setup cgroup environment\n");
> +		goto cleanup;
> +	}
> +
> +	cgroup_fd = get_root_cgroup();
> +	if (!ASSERT_GE(cgroup_fd, 0, "get_root_cgroup"))
> +		goto cleanup;
> +
> +	skel->links.recursion =
> +		bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);

Since this test manually invokes setup_cgroup_environment(), shouldn't there
be a corresponding call to cleanup_cgroup_environment() in the cleanup path?

By bypassing the test__join_cgroup() wrapper, the test framework won't
automatically clean up the cgroup state, leaving the test worker in a modified
mount namespace which might cause subsequent tests to fail.

[ ... ]
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> index 112b9cbfab6fe..9aca7c0b12e1f 100644
> --- a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> +++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
[ ... ]
> @@ -67,4 +70,31 @@ int egress(struct __sk_buff *skb)
>  	return SK_DROP;
>  }
>  
> +SEC("cgroup_skb/egress")
> +int recursion(struct __sk_buff *skb)
> +{
> +	void *data = (void *)(long)skb->data;
> +	void *data_end = (void *)(long)skb->data_end;
> +	struct iphdr *iph;
> +
> +	iph = data;
> +	if ((void *)(iph + 1) > data_end || iph->version != 4)
> +		return SK_PASS;
> +

Because this BPF program is attached to the system-wide root cgroup, will this
unconditionally drop all unrelated IPv4 traffic on the system?

Unlike the egress program which explicitly filters for IPPROTO_TCP and
SERVER_IP, this program intercepts all outbound IPv4 traffic, drops it
(returning SK_DROP), and sends ICMP errors.

This could disrupt host networking and cause test flakiness if background
traffic concurrently increments rec_count. Could the BPF program be updated
to explicitly filter for the test's traffic?

> +	/* This call will provoke a recursion: the ICMP package generated by the
> +	 * kfunc will re-trigger this program since we are in the root cgroup in
> +	 * which the kernel ICMP socket belongs. However when re-entering the
> +	 * kfunc, it should return EBUSY.
> +	 */
> +	rec_kfunc_rets[rec_count & 1] =
> +		bpf_icmp_send_unreach(skb, unreach_code);

Is it possible for the array index (rec_count & 1) to be evaluated before the
function call returns?

In C, the evaluation order of the left-hand side and right-hand side of an
assignment is not sequenced. If the compiler evaluates the array index before
the function call, both the outer and inner executions could use
rec_count == 0. This would cause the outer execution to overwrite the inner
execution's -EBUSY result with 0.

Would it be safer to store the kfunc return value in a local variable before
assigning it to the array?

> +	__sync_fetch_and_add(&rec_count, 1);
> +
> +	/* Let the first ICMP error message pass */
> +	if (iph->protocol == IPPROTO_ICMP)
> +		return SK_PASS;
> +
> +	return SK_DROP;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260420105816.72168-1-mahe.tardy@gmail.com?part=6

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2026-04-21 11:13 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10 10:26 [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-10 10:26 ` [PATCH bpf-next v1 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2025-07-10 10:26 ` [PATCH bpf-next v1 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2025-07-10 22:02   ` kernel test robot
2025-07-10 10:26 ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
2025-07-10 16:07   ` Alexei Starovoitov
2025-07-11 10:57     ` Mahe Tardy
2025-07-25 18:53     ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-25 18:53       ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2025-07-25 18:53       ` [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2025-07-25 18:53       ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
2025-07-27  1:49         ` kernel test robot
2025-07-28  9:43           ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-28  9:43             ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2025-07-28  9:43             ` [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2025-07-28  9:43             ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
2025-07-28 20:10               ` kernel test robot
2025-07-29  1:05               ` Martin KaFai Lau
2025-07-29 10:06                 ` Mahe Tardy
2025-07-29 23:13                   ` Martin KaFai Lau
2025-07-28  9:43             ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2025-07-28 15:40               ` Yonghong Song
2025-07-28 15:59                 ` Mahe Tardy
2025-07-29  1:18               ` Martin KaFai Lau
2025-07-29  9:09                 ` Mahe Tardy
2025-07-29 23:27                   ` Martin KaFai Lau
2025-07-30  0:01                     ` Martin KaFai Lau
2025-07-30  0:32                       ` Martin KaFai Lau
2025-08-05 23:26               ` Jordan Rife
2025-07-29  1:21             ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
2025-07-29  9:53               ` Mahe Tardy
2025-07-30  1:54                 ` Martin KaFai Lau
2025-08-01 18:50                   ` Mahe Tardy
2026-04-20 10:58                     ` [PATCH bpf-next v4 0/6] " Mahe Tardy
2026-04-20 10:58                       ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2026-04-20 11:36                         ` bot+bpf-ci
2026-04-20 13:04                           ` Mahe Tardy
2026-04-21 11:13                         ` sashiko-bot
2026-04-20 10:58                       ` [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2026-04-21 11:13                         ` sashiko-bot
2026-04-20 10:58                       ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
2026-04-20 11:36                         ` bot+bpf-ci
2026-04-20 13:07                           ` Mahe Tardy
2026-04-21 11:13                         ` sashiko-bot
2026-04-20 10:58                       ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2026-04-20 11:36                         ` bot+bpf-ci
2026-04-20 13:08                           ` Mahe Tardy
2026-04-21 11:13                         ` sashiko-bot
2026-04-20 10:58                       ` [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests Mahe Tardy
2026-04-21 11:13                         ` sashiko-bot
2026-04-20 10:58                       ` [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test Mahe Tardy
2026-04-21 11:13                         ` sashiko-bot
2025-07-25 18:53       ` [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2025-07-11  0:32   ` [PATCH bpf-next v1 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc kernel test robot
2025-07-10 10:26 ` [PATCH bpf-next v1 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox