* [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc
[not found] <CAADnVQKq_-=N7eJoup6AqFngoocT+D02NF0md_3mi2Vcrw09nQ@mail.gmail.com>
@ 2025-07-25 18:53 ` Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
` (3 more replies)
0 siblings, 4 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
To: alexei.starovoitov
Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
fw, netfilter-devel, pablo, netdev, coreteam
Hello,
This is v2 of adding the icmp_send_unreach kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target.
The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.
Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.
Caveats of this kfunc design are that a cgroup_skb program can call this
function N times, thus send N ICMP unreach control messages and that the
program can return from the BPF filter with SK_PASS leading to a
potential confusing situation where the TCP connection was established
while the client received ICMP_DEST_UNREACH messages.
Another more sophisticated design idea would be for the kfunc to set the
kernel to send an ICMP_HOST_UNREACH control message with the appropriate
code when the cgroup_skb program terminates with SK_DROP. Creating a new
'SK_REJECT' return code for cgroup_skb program was generally rejected
and would be too limited for other program types support.
We should bear in mind that we want to add a TCP reset kfunc next and
also could extend this kfunc to other program types if wanted.
v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.
[^1]: https://lwn.net/Articles/1022034/
Mahe Tardy (4):
net: move netfilter nf_reject_fill_skb_dst to core ipv4
net: move netfilter nf_reject6_fill_skb_dst to core ipv6
bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
selftests/bpf: add icmp_send_unreach kfunc tests
include/net/ip6_route.h | 2 +
include/net/route.h | 1 +
net/core/filter.c | 59 +++++++++++
net/ipv4/netfilter/nf_reject_ipv4.c | 19 +---
net/ipv4/route.c | 15 +++
net/ipv6/netfilter/nf_reject_ipv6.c | 17 +---
net/ipv6/route.c | 18 ++++
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 99 +++++++++++++++++++
.../selftests/bpf/progs/icmp_send_unreach.c | 36 +++++++
9 files changed, 233 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
--
2.34.1
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4
2025-07-25 18:53 ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
@ 2025-07-25 18:53 ` Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
` (2 subsequent siblings)
3 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
To: alexei.starovoitov
Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
fw, netfilter-devel, pablo, netdev, coreteam
Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fetch_dst in
ipv4/route.c so that it can be reused in the following patches by BPF
kfuncs.
Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
include/net/route.h | 1 +
net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
net/ipv4/route.c | 15 +++++++++++++++
3 files changed, 18 insertions(+), 17 deletions(-)
diff --git a/include/net/route.h b/include/net/route.h
index 8e39aa822cf9..1f032f768d52 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
const struct sock *sk);
struct dst_entry *ipv4_blackhole_route(struct net *net,
struct dst_entry *dst_orig);
+int ip_route_reply_fetch_dst(struct sk_buff *skb);
static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
{
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 87fd945a0d27..76beb78f556a 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -220,21 +220,6 @@ void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *oldskb,
}
EXPORT_SYMBOL_GPL(nf_reject_ip_tcphdr_put);
-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
- struct dst_entry *dst = NULL;
- struct flowi fl;
-
- memset(&fl, 0, sizeof(struct flowi));
- fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
- nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
- if (!dst)
- return -1;
-
- skb_dst_set(skb_in, dst);
- return 0;
-}
-
/* Send RST reply */
void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
int hook)
@@ -248,7 +233,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
return;
if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
- nf_reject_fill_skb_dst(oldskb) < 0)
+ ip_route_reply_fetch_dst(oldskb) < 0)
return;
if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -322,7 +307,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
return;
if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
- nf_reject_fill_skb_dst(skb_in) < 0)
+ ip_route_reply_fetch_dst(skb_in) < 0)
return;
if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index fccb05fb3a79..59b8fc3c01c0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2934,6 +2934,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
}
EXPORT_SYMBOL_GPL(ip_route_output_flow);
+int ip_route_reply_fetch_dst(struct sk_buff *skb)
+{
+ struct rtable *rt;
+ struct flowi4 fl4 = {
+ .daddr = ip_hdr(skb)->saddr
+ };
+
+ rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+ if (IS_ERR(rt))
+ return PTR_ERR(rt);
+ skb_dst_set(skb, &rt->dst);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
+
/* called with rcu_read_lock held */
static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
2025-07-25 18:53 ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2025-07-25 18:53 ` Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
3 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
To: alexei.starovoitov
Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
fw, netfilter-devel, pablo, netdev, coreteam
Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fetch_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.
Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_outputy so this patch inlines it.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
include/net/ip6_route.h | 2 ++
net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
net/ipv6/route.c | 18 ++++++++++++++++++
3 files changed, 21 insertions(+), 16 deletions(-)
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6dbdf60b342f..1426467df547 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -93,6 +93,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
return ip6_route_output_flags(net, sk, fl6, 0);
}
+int ip6_route_reply_fetch_dst(struct sk_buff *skb);
+
/* Only conditionally release dst if flags indicates
* !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
*/
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index 9ae2b2725bf9..994a3b88ac52 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -250,21 +250,6 @@ void nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
}
EXPORT_SYMBOL_GPL(nf_reject_ip6_tcphdr_put);
-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
- struct dst_entry *dst = NULL;
- struct flowi fl;
-
- memset(&fl, 0, sizeof(struct flowi));
- fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
- nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
- if (!dst)
- return -1;
-
- skb_dst_set(skb_in, dst);
- return 0;
-}
-
void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
int hook)
{
@@ -398,7 +383,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
skb_in->dev = net->loopback_dev;
if ((hooknum == NF_INET_PRE_ROUTING || hooknum == NF_INET_INGRESS) &&
- nf_reject6_fill_skb_dst(skb_in) < 0)
+ ip6_route_reply_fetch_dst(skb_in) < 0)
return;
icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0d5464c64965..de61540f9524 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2705,6 +2705,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
}
EXPORT_SYMBOL_GPL(ip6_route_output_flags);
+int ip6_route_reply_fetch_dst(struct sk_buff *skb)
+{
+ struct dst_entry *result;
+ struct flowi6 fl = {
+ .daddr = ipv6_hdr(skb)->saddr
+ };
+ int err;
+
+ result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+ err = result->error;
+ if (err)
+ dst_release(result);
+ else
+ skb_dst_set(skb, result);
+ return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);
+
struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
{
struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
2025-07-25 18:53 ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2025-07-25 18:53 ` Mahe Tardy
2025-07-27 1:49 ` kernel test robot
2025-07-25 18:53 ` [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
3 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
To: alexei.starovoitov
Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
fw, netfilter-devel, pablo, netdev, coreteam
This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.
This reuse concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still return SK_PASS from
the cgroup_skb progs and the current skb need to stay untouched
(cgroup_skb hooks only allow read-only skb payload).
* Since cgroup_skb programs are called late in the stack, checksums do
not need to be computed or verified, and IPv4 fragmentation does not
need to be checked (ip_local_deliver should take care of that
earlier).
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
net/core/filter.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 59 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 7a72f766aacf..ff7b2b175425 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,8 @@
#include <linux/un.h>
#include <net/xdp_sock_drv.h>
#include <net/inet_dscp.h>
+#include <linux/icmp.h>
+#include <net/icmp.h>
#include "dev.h"
@@ -12148,6 +12150,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
return 0;
}
+__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
+{
+ struct sk_buff *skb = (struct sk_buff *)__skb;
+ struct sk_buff *nskb;
+
+ switch (skb->protocol) {
+ case htons(ETH_P_IP):
+ if (code < 0 || code > NR_ICMP_UNREACH)
+ return -EINVAL;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ if (ip_route_reply_fetch_dst(nskb) < 0) {
+ kfree_skb(nskb);
+ return -EHOSTUNREACH;
+ }
+
+ icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
+ kfree_skb(nskb);
+ break;
+#if IS_ENABLED(CONFIG_IPV6)
+ case htons(ETH_P_IPV6):
+ if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+ return -EINVAL;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ if (ip6_route_reply_fetch_dst(nskb) < 0) {
+ kfree_skb(nskb);
+ return -EHOSTUNREACH;
+ }
+
+ icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
+ kfree_skb(nskb);
+ break;
+#endif
+ default:
+ return -EPROTONOSUPPORT;
+ }
+
+ return SK_DROP;
+}
+
__bpf_kfunc_end_defs();
int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12185,6 +12234,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send_unreach)
+BTF_ID_FLAGS(func, bpf_icmp_send_unreach, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send_unreach)
+
static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
.owner = THIS_MODULE,
.set = &bpf_kfunc_check_set_skb,
@@ -12210,6 +12263,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
.set = &bpf_kfunc_check_set_sock_ops,
};
+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send_unreach = {
+ .owner = THIS_MODULE,
+ .set = &bpf_kfunc_check_set_icmp_send_unreach,
+};
+
static int __init bpf_kfunc_init(void)
{
int ret;
@@ -12229,6 +12287,7 @@ static int __init bpf_kfunc_init(void)
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
&bpf_kfunc_set_sock_addr);
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+ ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send_unreach);
return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
}
late_initcall(bpf_kfunc_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-25 18:53 ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
` (2 preceding siblings ...)
2025-07-25 18:53 ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-25 18:53 ` Mahe Tardy
3 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-25 18:53 UTC (permalink / raw)
To: alexei.starovoitov
Cc: andrii, ast, bpf, daniel, john.fastabend, mahe.tardy, martin.lau,
fw, netfilter-devel, pablo, netdev, coreteam
This test opens a server and client, attach a cgroup_skb program on
egress and calls the icmp_send_unreach function from the client egress
so that an ICMP unreach control message is sent back to the client.
It then fetches the message from the error queue to confirm the correct
ICMP unreach code has been sent.
Note that the BPF program returns SK_PASS to let the connection being
established to finish the test cases quicker. Otherwise, you have to
wait for the TCP three-way handshake to timeout in the kernel and
retrieve the errno translated from the unreach code set by the ICMP
control message.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 99 +++++++++++++++++++
.../selftests/bpf/progs/icmp_send_unreach.c | 36 +++++++
2 files changed, 135 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
new file mode 100644
index 000000000000..414c1ed8ced3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include "icmp_send_unreach.skel.h"
+
+#define TIMEOUT_MS 1000
+#define SRV_PORT 54321
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+ ssize_t n;
+ struct sock_extended_err *sock_err;
+ struct cmsghdr *cm;
+ char ctrl_buf[512];
+ struct msghdr msg = {
+ .msg_control = ctrl_buf,
+ .msg_controllen = sizeof(ctrl_buf),
+ };
+
+ n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+ if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+ return;
+
+ for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+ if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
+ !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+ continue;
+
+ sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+ if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+ "sock_err_origin_icmp"))
+ return;
+ if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+ "sock_err_type_dest_unreach"))
+ return;
+ ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+ }
+}
+
+void test_icmp_send_unreach_kfunc(void)
+{
+ struct icmp_send_unreach *skel;
+ int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
+ int *code;
+
+ skel = icmp_send_unreach__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ goto cleanup;
+
+ cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+ if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+ goto cleanup;
+
+ skel->links.egress =
+ bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+ if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+ goto cleanup;
+
+ code = &skel->bss->unreach_code;
+
+ for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
+ // The TCP stack reacts differently when asking for
+ // fragmentation, let's ignore it for now
+ if (*code == ICMP_FRAG_NEEDED)
+ continue;
+
+ skel->bss->kfunc_ret = -1;
+
+ srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
+ SRV_PORT, TIMEOUT_MS);
+ if (!ASSERT_GE(srv_fd, 0, "start_server"))
+ goto for_cleanup;
+
+ client_fd = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(client_fd, 0, "client_socket");
+
+ client_fd = connect_to_fd(srv_fd, 0);
+ if (!ASSERT_GE(client_fd, 0, "client_connect"))
+ goto for_cleanup;
+
+ read_icmp_errqueue(client_fd, *code);
+
+ ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
+for_cleanup:
+ close(client_fd);
+ close(srv_fd);
+ }
+
+cleanup:
+ icmp_send_unreach__destroy(skel);
+ close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
new file mode 100644
index 000000000000..15783e5d1d65
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
+
+int unreach_code = 0;
+int kfunc_ret = 0;
+
+#define SERVER_PORT 54321
+#define SERVER_IP 0x7F000001
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+ void *data = (void *)(long)skb->data;
+ void *data_end = (void *)(long)skb->data_end;
+ struct iphdr *iph;
+ struct tcphdr *tcph;
+
+ iph = data;
+ if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+ iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+ return SK_PASS;
+
+ tcph = (void *)iph + iph->ihl * 4;
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(SERVER_PORT))
+ return SK_PASS;
+
+ kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
+
+ /* returns SK_PASS to execute the test case quicker */
+ return SK_PASS;
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
2025-07-25 18:53 ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-27 1:49 ` kernel test robot
2025-07-28 9:43 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
0 siblings, 1 reply; 40+ messages in thread
From: kernel test robot @ 2025-07-27 1:49 UTC (permalink / raw)
To: Mahe Tardy, alexei.starovoitov
Cc: oe-kbuild-all, andrii, ast, bpf, daniel, john.fastabend,
mahe.tardy, martin.lau, fw, netfilter-devel, pablo, netdev,
coreteam
Hi Mahe,
kernel test robot noticed the following build errors:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Mahe-Tardy/net-move-netfilter-nf_reject_fill_skb_dst-to-core-ipv4/20250726-030109
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20250725185342.262067-4-mahe.tardy%40gmail.com
patch subject: [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
config: arm64-defconfig (https://download.01.org/0day-ci/archive/20250727/202507270940.kXGmRbg5-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250727/202507270940.kXGmRbg5-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507270940.kXGmRbg5-lkp@intel.com/
All errors (new ones prefixed by >>):
aarch64-linux-ld: Unexpected GOT/PLT entries detected!
aarch64-linux-ld: Unexpected run-time procedure linkages detected!
aarch64-linux-ld: net/core/filter.o: in function `bpf_icmp_send_unreach':
>> net/core/filter.c:12184:(.text+0x14574): undefined reference to `ip6_route_reply_fetch_dst'
vim +12184 net/core/filter.c
12152
12153 __bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
12154 {
12155 struct sk_buff *skb = (struct sk_buff *)__skb;
12156 struct sk_buff *nskb;
12157
12158 switch (skb->protocol) {
12159 case htons(ETH_P_IP):
12160 if (code < 0 || code > NR_ICMP_UNREACH)
12161 return -EINVAL;
12162
12163 nskb = skb_clone(skb, GFP_ATOMIC);
12164 if (!nskb)
12165 return -ENOMEM;
12166
12167 if (ip_route_reply_fetch_dst(nskb) < 0) {
12168 kfree_skb(nskb);
12169 return -EHOSTUNREACH;
12170 }
12171
12172 icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
12173 kfree_skb(nskb);
12174 break;
12175 #if IS_ENABLED(CONFIG_IPV6)
12176 case htons(ETH_P_IPV6):
12177 if (code < 0 || code > ICMPV6_REJECT_ROUTE)
12178 return -EINVAL;
12179
12180 nskb = skb_clone(skb, GFP_ATOMIC);
12181 if (!nskb)
12182 return -ENOMEM;
12183
12184 if (ip6_route_reply_fetch_dst(nskb) < 0) {
12185 kfree_skb(nskb);
12186 return -EHOSTUNREACH;
12187 }
12188
12189 icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
12190 kfree_skb(nskb);
12191 break;
12192 #endif
12193 default:
12194 return -EPROTONOSUPPORT;
12195 }
12196
12197 return SK_DROP;
12198 }
12199
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
2025-07-27 1:49 ` kernel test robot
@ 2025-07-28 9:43 ` Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
` (4 more replies)
0 siblings, 5 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-28 9:43 UTC (permalink / raw)
To: lkp
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
Hello,
This is v3 of adding the icmp_send_unreach kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target.
The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.
Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.
Caveats of this kfunc design are that a cgroup_skb program can call this
function N times, thus send N ICMP unreach control messages and that the
program can return from the BPF filter with SK_PASS leading to a
potential confusing situation where the TCP connection was established
while the client received ICMP_DEST_UNREACH messages.
Another more sophisticated design idea would be for the kfunc to set the
kernel to send an ICMP_HOST_UNREACH control message with the appropriate
code when the cgroup_skb program terminates with SK_DROP. Creating a new
'SK_REJECT' return code for cgroup_skb program was generally rejected
and would be too limited for other program types support.
We should bear in mind that we want to add a TCP reset kfunc next and
also could extend this kfunc to other program types if wanted.
v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.
v3 update:
- fix an undefined reference build error.
[^1]: https://lwn.net/Articles/1022034/
Mahe Tardy (4):
net: move netfilter nf_reject_fill_skb_dst to core ipv4
net: move netfilter nf_reject6_fill_skb_dst to core ipv6
bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
selftests/bpf: add icmp_send_unreach kfunc tests
include/net/ip6_route.h | 2 +
include/net/route.h | 1 +
net/core/filter.c | 61 ++++++++++++
net/ipv4/netfilter/nf_reject_ipv4.c | 19 +---
net/ipv4/route.c | 15 +++
net/ipv6/netfilter/nf_reject_ipv6.c | 17 +---
net/ipv6/route.c | 18 ++++
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 99 +++++++++++++++++++
.../selftests/bpf/progs/icmp_send_unreach.c | 36 +++++++
9 files changed, 235 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
--
2.34.1
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4
2025-07-28 9:43 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
@ 2025-07-28 9:43 ` Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
` (3 subsequent siblings)
4 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-28 9:43 UTC (permalink / raw)
To: lkp
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fetch_dst in
ipv4/route.c so that it can be reused in the following patches by BPF
kfuncs.
Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
include/net/route.h | 1 +
net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
net/ipv4/route.c | 15 +++++++++++++++
3 files changed, 18 insertions(+), 17 deletions(-)
diff --git a/include/net/route.h b/include/net/route.h
index 8e39aa822cf9..1f032f768d52 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
const struct sock *sk);
struct dst_entry *ipv4_blackhole_route(struct net *net,
struct dst_entry *dst_orig);
+int ip_route_reply_fetch_dst(struct sk_buff *skb);
static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
{
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index 87fd945a0d27..76beb78f556a 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -220,21 +220,6 @@ void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *oldskb,
}
EXPORT_SYMBOL_GPL(nf_reject_ip_tcphdr_put);
-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
- struct dst_entry *dst = NULL;
- struct flowi fl;
-
- memset(&fl, 0, sizeof(struct flowi));
- fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
- nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
- if (!dst)
- return -1;
-
- skb_dst_set(skb_in, dst);
- return 0;
-}
-
/* Send RST reply */
void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
int hook)
@@ -248,7 +233,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
return;
if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
- nf_reject_fill_skb_dst(oldskb) < 0)
+ ip_route_reply_fetch_dst(oldskb) < 0)
return;
if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -322,7 +307,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
return;
if ((hook == NF_INET_PRE_ROUTING || hook == NF_INET_INGRESS) &&
- nf_reject_fill_skb_dst(skb_in) < 0)
+ ip_route_reply_fetch_dst(skb_in) < 0)
return;
if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index fccb05fb3a79..59b8fc3c01c0 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2934,6 +2934,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
}
EXPORT_SYMBOL_GPL(ip_route_output_flow);
+int ip_route_reply_fetch_dst(struct sk_buff *skb)
+{
+ struct rtable *rt;
+ struct flowi4 fl4 = {
+ .daddr = ip_hdr(skb)->saddr
+ };
+
+ rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+ if (IS_ERR(rt))
+ return PTR_ERR(rt);
+ skb_dst_set(skb, &rt->dst);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
+
/* called with rcu_read_lock held */
static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
2025-07-28 9:43 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2025-07-28 9:43 ` Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
` (2 subsequent siblings)
4 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-28 9:43 UTC (permalink / raw)
To: lkp
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fetch_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.
Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_outputy so this patch inlines it.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
include/net/ip6_route.h | 2 ++
net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
net/ipv6/route.c | 18 ++++++++++++++++++
3 files changed, 21 insertions(+), 16 deletions(-)
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 6dbdf60b342f..1426467df547 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -93,6 +93,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
return ip6_route_output_flags(net, sk, fl6, 0);
}
+int ip6_route_reply_fetch_dst(struct sk_buff *skb);
+
/* Only conditionally release dst if flags indicates
* !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
*/
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index 9ae2b2725bf9..994a3b88ac52 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -250,21 +250,6 @@ void nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
}
EXPORT_SYMBOL_GPL(nf_reject_ip6_tcphdr_put);
-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
- struct dst_entry *dst = NULL;
- struct flowi fl;
-
- memset(&fl, 0, sizeof(struct flowi));
- fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
- nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
- if (!dst)
- return -1;
-
- skb_dst_set(skb_in, dst);
- return 0;
-}
-
void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
int hook)
{
@@ -398,7 +383,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
skb_in->dev = net->loopback_dev;
if ((hooknum == NF_INET_PRE_ROUTING || hooknum == NF_INET_INGRESS) &&
- nf_reject6_fill_skb_dst(skb_in) < 0)
+ ip6_route_reply_fetch_dst(skb_in) < 0)
return;
icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0d5464c64965..de61540f9524 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2705,6 +2705,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
}
EXPORT_SYMBOL_GPL(ip6_route_output_flags);
+int ip6_route_reply_fetch_dst(struct sk_buff *skb)
+{
+ struct dst_entry *result;
+ struct flowi6 fl = {
+ .daddr = ipv6_hdr(skb)->saddr
+ };
+ int err;
+
+ result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+ err = result->error;
+ if (err)
+ dst_release(result);
+ else
+ skb_dst_set(skb, result);
+ return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);
+
struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
{
struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
2025-07-28 9:43 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2025-07-28 9:43 ` Mahe Tardy
2025-07-28 20:10 ` kernel test robot
2025-07-29 1:05 ` Martin KaFai Lau
2025-07-28 9:43 ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2025-07-29 1:21 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
4 siblings, 2 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-28 9:43 UTC (permalink / raw)
To: lkp
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.
This reuse concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still return SK_PASS from
the cgroup_skb progs and the current skb need to stay untouched
(cgroup_skb hooks only allow read-only skb payload).
* Since cgroup_skb programs are called late in the stack, checksums do
not need to be computed or verified, and IPv4 fragmentation does not
need to be checked (ip_local_deliver should take care of that
earlier).
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index 7a72f766aacf..050872324575 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -85,6 +85,10 @@
#include <linux/un.h>
#include <net/xdp_sock_drv.h>
#include <net/inet_dscp.h>
+#include <linux/icmp.h>
+#include <net/icmp.h>
+#include <net/route.h>
+#include <net/ip6_route.h>
#include "dev.h"
@@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
return 0;
}
+__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
+{
+ struct sk_buff *skb = (struct sk_buff *)__skb;
+ struct sk_buff *nskb;
+
+ switch (skb->protocol) {
+ case htons(ETH_P_IP):
+ if (code < 0 || code > NR_ICMP_UNREACH)
+ return -EINVAL;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ if (ip_route_reply_fetch_dst(nskb) < 0) {
+ kfree_skb(nskb);
+ return -EHOSTUNREACH;
+ }
+
+ icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
+ kfree_skb(nskb);
+ break;
+#if IS_ENABLED(CONFIG_IPV6)
+ case htons(ETH_P_IPV6):
+ if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+ return -EINVAL;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ if (ip6_route_reply_fetch_dst(nskb) < 0) {
+ kfree_skb(nskb);
+ return -EHOSTUNREACH;
+ }
+
+ icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
+ kfree_skb(nskb);
+ break;
+#endif
+ default:
+ return -EPROTONOSUPPORT;
+ }
+
+ return SK_DROP;
+}
+
__bpf_kfunc_end_defs();
int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12185,6 +12236,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp, KF_TRUSTED_ARGS)
BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
+BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send_unreach)
+BTF_ID_FLAGS(func, bpf_icmp_send_unreach, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send_unreach)
+
static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
.owner = THIS_MODULE,
.set = &bpf_kfunc_check_set_skb,
@@ -12210,6 +12265,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
.set = &bpf_kfunc_check_set_sock_ops,
};
+static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send_unreach = {
+ .owner = THIS_MODULE,
+ .set = &bpf_kfunc_check_set_icmp_send_unreach,
+};
+
static int __init bpf_kfunc_init(void)
{
int ret;
@@ -12229,6 +12289,7 @@ static int __init bpf_kfunc_init(void)
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
&bpf_kfunc_set_sock_addr);
ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
+ ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send_unreach);
return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
}
late_initcall(bpf_kfunc_init);
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-28 9:43 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
` (2 preceding siblings ...)
2025-07-28 9:43 ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-28 9:43 ` Mahe Tardy
2025-07-28 15:40 ` Yonghong Song
` (2 more replies)
2025-07-29 1:21 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
4 siblings, 3 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-28 9:43 UTC (permalink / raw)
To: lkp
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, mahe.tardy, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
This test opens a server and client, attach a cgroup_skb program on
egress and calls the icmp_send_unreach function from the client egress
so that an ICMP unreach control message is sent back to the client.
It then fetches the message from the error queue to confirm the correct
ICMP unreach code has been sent.
Note that the BPF program returns SK_PASS to let the connection being
established to finish the test cases quicker. Otherwise, you have to
wait for the TCP three-way handshake to timeout in the kernel and
retrieve the errno translated from the unreach code set by the ICMP
control message.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 99 +++++++++++++++++++
.../selftests/bpf/progs/icmp_send_unreach.c | 36 +++++++
2 files changed, 135 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
new file mode 100644
index 000000000000..414c1ed8ced3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include "icmp_send_unreach.skel.h"
+
+#define TIMEOUT_MS 1000
+#define SRV_PORT 54321
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+ ssize_t n;
+ struct sock_extended_err *sock_err;
+ struct cmsghdr *cm;
+ char ctrl_buf[512];
+ struct msghdr msg = {
+ .msg_control = ctrl_buf,
+ .msg_controllen = sizeof(ctrl_buf),
+ };
+
+ n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+ if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+ return;
+
+ for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+ if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
+ !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+ continue;
+
+ sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+ if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+ "sock_err_origin_icmp"))
+ return;
+ if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+ "sock_err_type_dest_unreach"))
+ return;
+ ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+ }
+}
+
+void test_icmp_send_unreach_kfunc(void)
+{
+ struct icmp_send_unreach *skel;
+ int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
+ int *code;
+
+ skel = icmp_send_unreach__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ goto cleanup;
+
+ cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+ if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+ goto cleanup;
+
+ skel->links.egress =
+ bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+ if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+ goto cleanup;
+
+ code = &skel->bss->unreach_code;
+
+ for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
+ // The TCP stack reacts differently when asking for
+ // fragmentation, let's ignore it for now
+ if (*code == ICMP_FRAG_NEEDED)
+ continue;
+
+ skel->bss->kfunc_ret = -1;
+
+ srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
+ SRV_PORT, TIMEOUT_MS);
+ if (!ASSERT_GE(srv_fd, 0, "start_server"))
+ goto for_cleanup;
+
+ client_fd = socket(AF_INET, SOCK_STREAM, 0);
+ ASSERT_GE(client_fd, 0, "client_socket");
+
+ client_fd = connect_to_fd(srv_fd, 0);
+ if (!ASSERT_GE(client_fd, 0, "client_connect"))
+ goto for_cleanup;
+
+ read_icmp_errqueue(client_fd, *code);
+
+ ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
+for_cleanup:
+ close(client_fd);
+ close(srv_fd);
+ }
+
+cleanup:
+ icmp_send_unreach__destroy(skel);
+ close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
new file mode 100644
index 000000000000..15783e5d1d65
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
+
+int unreach_code = 0;
+int kfunc_ret = 0;
+
+#define SERVER_PORT 54321
+#define SERVER_IP 0x7F000001
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+ void *data = (void *)(long)skb->data;
+ void *data_end = (void *)(long)skb->data_end;
+ struct iphdr *iph;
+ struct tcphdr *tcph;
+
+ iph = data;
+ if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+ iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+ return SK_PASS;
+
+ tcph = (void *)iph + iph->ihl * 4;
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(SERVER_PORT))
+ return SK_PASS;
+
+ kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
+
+ /* returns SK_PASS to execute the test case quicker */
+ return SK_PASS;
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-28 9:43 ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2025-07-28 15:40 ` Yonghong Song
2025-07-28 15:59 ` Mahe Tardy
2025-07-29 1:18 ` Martin KaFai Lau
2025-08-05 23:26 ` Jordan Rife
2 siblings, 1 reply; 40+ messages in thread
From: Yonghong Song @ 2025-07-28 15:40 UTC (permalink / raw)
To: Mahe Tardy, lkp
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
On 7/28/25 2:43 AM, Mahe Tardy wrote:
> This test opens a server and client, attach a cgroup_skb program on
> egress and calls the icmp_send_unreach function from the client egress
> so that an ICMP unreach control message is sent back to the client.
> It then fetches the message from the error queue to confirm the correct
> ICMP unreach code has been sent.
>
> Note that the BPF program returns SK_PASS to let the connection being
> established to finish the test cases quicker. Otherwise, you have to
> wait for the TCP three-way handshake to timeout in the kernel and
> retrieve the errno translated from the unreach code set by the ICMP
> control message.
>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
> .../bpf/prog_tests/icmp_send_unreach_kfunc.c | 99 +++++++++++++++++++
> .../selftests/bpf/progs/icmp_send_unreach.c | 36 +++++++
> 2 files changed, 135 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> new file mode 100644
> index 000000000000..414c1ed8ced3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include "icmp_send_unreach.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +#define SRV_PORT 54321
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15
> +
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
> + ssize_t n;
> + struct sock_extended_err *sock_err;
> + struct cmsghdr *cm;
> + char ctrl_buf[512];
> + struct msghdr msg = {
> + .msg_control = ctrl_buf,
> + .msg_controllen = sizeof(ctrl_buf),
> + };
> +
> + n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
> + if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> + return;
> +
> + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
> + if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
> + !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
> + continue;
> +
> + sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
> +
> + if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
> + "sock_err_origin_icmp"))
> + return;
> + if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
> + "sock_err_type_dest_unreach"))
> + return;
> + ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
> + }
> +}
> +
> +void test_icmp_send_unreach_kfunc(void)
> +{
> + struct icmp_send_unreach *skel;
> + int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
Should set client_fd = -1? See below ...
> + int *code;
> +
> + skel = icmp_send_unreach__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "skel_open"))
> + goto cleanup;
> +
> + cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> + goto cleanup;
> +
> + skel->links.egress =
> + bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> + if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> + goto cleanup;
> +
> + code = &skel->bss->unreach_code;
> +
> + for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
> + // The TCP stack reacts differently when asking for
> + // fragmentation, let's ignore it for now
> + if (*code == ICMP_FRAG_NEEDED)
> + continue;
> +
> + skel->bss->kfunc_ret = -1;
> +
> + srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
> + SRV_PORT, TIMEOUT_MS);
> + if (!ASSERT_GE(srv_fd, 0, "start_server"))
> + goto for_cleanup;
Otherwise if client_fd = 1, goto for_cleanup will close(1).
> +
> + client_fd = socket(AF_INET, SOCK_STREAM, 0);
> + ASSERT_GE(client_fd, 0, "client_socket");
The above two lines are not necessary since client_fd is
actually set in the below.
> +
> + client_fd = connect_to_fd(srv_fd, 0);
> + if (!ASSERT_GE(client_fd, 0, "client_connect"))
> + goto for_cleanup;
> +
> + read_icmp_errqueue(client_fd, *code);
> +
> + ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
> +for_cleanup:
> + close(client_fd);
> + close(srv_fd);
> + }
> +
> +cleanup:
> + icmp_send_unreach__destroy(skel);
> + close(cgroup_fd);
> +}
[...]
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-28 15:40 ` Yonghong Song
@ 2025-07-28 15:59 ` Mahe Tardy
0 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2025-07-28 15:59 UTC (permalink / raw)
To: Yonghong Song
Cc: lkp, alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
On Mon, Jul 28, 2025 at 08:40:49AM -0700, Yonghong Song wrote:
>
>
> On 7/28/25 2:43 AM, Mahe Tardy wrote:
[...]
> > +
> > +void test_icmp_send_unreach_kfunc(void)
> > +{
> > + struct icmp_send_unreach *skel;
> > + int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
>
> Should set client_fd = -1? See below ...
Well spotted yes, it's a typo, thank you.
> > + int *code;
> > +
> > + skel = icmp_send_unreach__open_and_load();
> > + if (!ASSERT_OK_PTR(skel, "skel_open"))
> > + goto cleanup;
> > +
> > + cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> > + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> > + goto cleanup;
> > +
> > + skel->links.egress =
> > + bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> > + if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> > + goto cleanup;
> > +
> > + code = &skel->bss->unreach_code;
> > +
> > + for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
> > + // The TCP stack reacts differently when asking for
> > + // fragmentation, let's ignore it for now
> > + if (*code == ICMP_FRAG_NEEDED)
> > + continue;
> > +
> > + skel->bss->kfunc_ret = -1;
> > +
> > + srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
> > + SRV_PORT, TIMEOUT_MS);
> > + if (!ASSERT_GE(srv_fd, 0, "start_server"))
> > + goto for_cleanup;
>
> Otherwise if client_fd = 1, goto for_cleanup will close(1).
>
> > +
> > + client_fd = socket(AF_INET, SOCK_STREAM, 0);
> > + ASSERT_GE(client_fd, 0, "client_socket");
>
> The above two lines are not necessary since client_fd is
> actually set in the below.
Yep, must have been a leftover from when I was discovering the
network_helpers, oops!
> > +
> > + client_fd = connect_to_fd(srv_fd, 0);
> > + if (!ASSERT_GE(client_fd, 0, "client_connect"))
> > + goto for_cleanup;
> > +
> > + read_icmp_errqueue(client_fd, *code);
> > +
> > + ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
> > +for_cleanup:
> > + close(client_fd);
> > + close(srv_fd);
> > + }
> > +
> > +cleanup:
> > + icmp_send_unreach__destroy(skel);
> > + close(cgroup_fd);
> > +}
> [...]
I'm sending a v4 with those fixed + fixing the builds error when IPv6 is
built as a module from the kfunc patch. Thanks for the review.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
2025-07-28 9:43 ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
@ 2025-07-28 20:10 ` kernel test robot
2025-07-29 1:05 ` Martin KaFai Lau
1 sibling, 0 replies; 40+ messages in thread
From: kernel test robot @ 2025-07-28 20:10 UTC (permalink / raw)
To: Mahe Tardy, lkp
Cc: oe-kbuild-all, alexei.starovoitov, andrii, ast, bpf, coreteam,
daniel, fw, john.fastabend, mahe.tardy, martin.lau, netdev,
netfilter-devel, pablo
Hi Mahe,
kernel test robot noticed the following build errors:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Mahe-Tardy/net-move-netfilter-nf_reject_fill_skb_dst-to-core-ipv4/20250728-174736
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20250728094345.46132-4-mahe.tardy%40gmail.com
patch subject: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
config: s390-randconfig-001-20250729 (https://download.01.org/0day-ci/archive/20250729/202507290356.HyFMR3K0-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250729/202507290356.HyFMR3K0-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507290356.HyFMR3K0-lkp@intel.com/
All errors (new ones prefixed by >>):
s390-linux-ld: net/core/filter.o: in function `bpf_icmp_send_unreach':
filter.c:(.text+0xf7b8): undefined reference to `ip_route_reply_fetch_dst'
>> s390-linux-ld: filter.c:(.text+0xf7ea): undefined reference to `__icmp_send'
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
2025-07-28 9:43 ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
2025-07-28 20:10 ` kernel test robot
@ 2025-07-29 1:05 ` Martin KaFai Lau
2025-07-29 10:06 ` Mahe Tardy
1 sibling, 1 reply; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 1:05 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/28/25 2:43 AM, Mahe Tardy wrote:
> This is needed in the context of Tetragon to provide improved feedback
> (in contrast to just dropping packets) to east-west traffic when blocked
> by policies using cgroup_skb programs.
>
> This reuse concepts from netfilter reject target codepath with the
> differences that:
> * Packets are cloned since the BPF user can still return SK_PASS from
> the cgroup_skb progs and the current skb need to stay untouched
This needs more details. Which field(s) of the skb are changed by the kfunc, the
skb_dst_set in ip[6]_route_reply_fetch_dst() and/or the code path in the
icmp[v6]_send() ?
> (cgroup_skb hooks only allow read-only skb payload).
> * Since cgroup_skb programs are called late in the stack, checksums do
> not need to be computed or verified, and IPv4 fragmentation does not
> need to be checked (ip_local_deliver should take care of that
> earlier).
>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
> net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 61 insertions(+)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 7a72f766aacf..050872324575 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -85,6 +85,10 @@
> #include <linux/un.h>
> #include <net/xdp_sock_drv.h>
> #include <net/inet_dscp.h>
> +#include <linux/icmp.h>
> +#include <net/icmp.h>
> +#include <net/route.h>
> +#include <net/ip6_route.h>
>
> #include "dev.h"
>
> @@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
> return 0;
> }
>
> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> +{
> + struct sk_buff *skb = (struct sk_buff *)__skb;
> + struct sk_buff *nskb;
> +
> + switch (skb->protocol) {
> + case htons(ETH_P_IP):
> + if (code < 0 || code > NR_ICMP_UNREACH)
> + return -EINVAL;
> +
> + nskb = skb_clone(skb, GFP_ATOMIC);
> + if (!nskb)
> + return -ENOMEM;
> +
> + if (ip_route_reply_fetch_dst(nskb) < 0) {
> + kfree_skb(nskb);
> + return -EHOSTUNREACH;
> + }
> +
> + icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> + kfree_skb(nskb);
> + break;
> +#if IS_ENABLED(CONFIG_IPV6)
> + case htons(ETH_P_IPV6):
> + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> + return -EINVAL;
> +
> + nskb = skb_clone(skb, GFP_ATOMIC);
> + if (!nskb)
> + return -ENOMEM;
> +
> + if (ip6_route_reply_fetch_dst(nskb) < 0) {
From a very quick look at icmpv6_send(), it does its own route lookup. I
haven't looked at the v4 yet.
I am likely missing some details. Can you explain why it needs to do a lookup
before calling icmpv6_send()?
> + kfree_skb(nskb);
> + return -EHOSTUNREACH;
> + }
> +
> + icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> + kfree_skb(nskb);
> + break;
> +#endif
> + default:
> + return -EPROTONOSUPPORT;
> + }
> +
> + return SK_DROP;
> +}
> +
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-28 9:43 ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2025-07-28 15:40 ` Yonghong Song
@ 2025-07-29 1:18 ` Martin KaFai Lau
2025-07-29 9:09 ` Mahe Tardy
2025-08-05 23:26 ` Jordan Rife
2 siblings, 1 reply; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 1:18 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/28/25 2:43 AM, Mahe Tardy wrote:
> +SEC("cgroup_skb/egress")
> +int egress(struct __sk_buff *skb)
> +{
> + void *data = (void *)(long)skb->data;
> + void *data_end = (void *)(long)skb->data_end;
> + struct iphdr *iph;
> + struct tcphdr *tcph;
> +
> + iph = data;
> + if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> + iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> + return SK_PASS;
> +
> + tcph = (void *)iph + iph->ihl * 4;
> + if ((void *)(tcph + 1) > data_end ||
> + tcph->dest != bpf_htons(SERVER_PORT))
> + return SK_PASS;
> +
> + kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
> +
> + /* returns SK_PASS to execute the test case quicker */
Do you know why the user space is slower if 0 (SK_DROP) is used?
> + return SK_PASS;
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
2025-07-28 9:43 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
` (3 preceding siblings ...)
2025-07-28 9:43 ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2025-07-29 1:21 ` Martin KaFai Lau
2025-07-29 9:53 ` Mahe Tardy
4 siblings, 1 reply; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 1:21 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/28/25 2:43 AM, Mahe Tardy wrote:
> Hello,
>
> This is v3 of adding the icmp_send_unreach kfunc, as suggested during
> LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
> actively reject east-west traffic, similarly to what is possible to do
> with netfilter reject target.
>
> The first step to implement this is using ICMP control messages, with
> the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
> ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
> than a TCP RST reply and will already hint the client TCP stack to abort
> the connection and not retry extensively.
>
> Note that this is different than the sock_destroy kfunc, that along
> calls tcp_abort and thus sends a reset, destroying the underlying
> socket.
>
> Caveats of this kfunc design are that a cgroup_skb program can call this
> function N times, thus send N ICMP unreach control messages and that the
> program can return from the BPF filter with SK_PASS leading to a
> potential confusing situation where the TCP connection was established
> while the client received ICMP_DEST_UNREACH messages.
>
> Another more sophisticated design idea would be for the kfunc to set the
> kernel to send an ICMP_HOST_UNREACH control message with the appropriate
> code when the cgroup_skb program terminates with SK_DROP. Creating a new
> 'SK_REJECT' return code for cgroup_skb program was generally rejected
> and would be too limited for other program types support.
>
> We should bear in mind that we want to add a TCP reset kfunc next and
> also could extend this kfunc to other program types if wanted.
Some high level questions.
Which other program types do you need this kfunc to send icmp and the future tcp
rst?
This cover letter mentioned sending icmp unreach is easier than sending tcp rst.
What problems do you see in sending tcp rst?
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-29 1:18 ` Martin KaFai Lau
@ 2025-07-29 9:09 ` Mahe Tardy
2025-07-29 23:27 ` Martin KaFai Lau
0 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2025-07-29 9:09 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
> On 7/28/25 2:43 AM, Mahe Tardy wrote:
> > +SEC("cgroup_skb/egress")
> > +int egress(struct __sk_buff *skb)
> > +{
> > + void *data = (void *)(long)skb->data;
> > + void *data_end = (void *)(long)skb->data_end;
> > + struct iphdr *iph;
> > + struct tcphdr *tcph;
> > +
> > + iph = data;
> > + if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> > + iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> > + return SK_PASS;
> > +
> > + tcph = (void *)iph + iph->ihl * 4;
> > + if ((void *)(tcph + 1) > data_end ||
> > + tcph->dest != bpf_htons(SERVER_PORT))
> > + return SK_PASS;
> > +
> > + kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
> > +
> > + /* returns SK_PASS to execute the test case quicker */
>
> Do you know why the user space is slower if 0 (SK_DROP) is used?
I tried to write my understanding of this in the commit description:
"Note that the BPF program returns SK_PASS to let the connection being
established to finish the test cases quicker. Otherwise, you have to
wait for the TCP three-way handshake to timeout in the kernel and
retrieve the errno translated from the unreach code set by the ICMP
control message."
I added this comment because I already had some (offline) feedback that
this looked off, maybe I should develop and put this here directly.
>
> > + return SK_PASS;
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
2025-07-29 1:21 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
@ 2025-07-29 9:53 ` Mahe Tardy
2025-07-30 1:54 ` Martin KaFai Lau
0 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2025-07-29 9:53 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On Mon, Jul 28, 2025 at 06:21:50PM -0700, Martin KaFai Lau wrote:
> On 7/28/25 2:43 AM, Mahe Tardy wrote:
> > Hello,
> >
> > This is v3 of adding the icmp_send_unreach kfunc, as suggested during
> > LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
> > actively reject east-west traffic, similarly to what is possible to do
> > with netfilter reject target.
> >
> > The first step to implement this is using ICMP control messages, with
> > the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
> > ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
> > than a TCP RST reply and will already hint the client TCP stack to abort
> > the connection and not retry extensively.
> >
> > Note that this is different than the sock_destroy kfunc, that along
> > calls tcp_abort and thus sends a reset, destroying the underlying
> > socket.
> >
> > Caveats of this kfunc design are that a cgroup_skb program can call this
> > function N times, thus send N ICMP unreach control messages and that the
> > program can return from the BPF filter with SK_PASS leading to a
> > potential confusing situation where the TCP connection was established
> > while the client received ICMP_DEST_UNREACH messages.
> >
> > Another more sophisticated design idea would be for the kfunc to set the
> > kernel to send an ICMP_HOST_UNREACH control message with the appropriate
> > code when the cgroup_skb program terminates with SK_DROP. Creating a new
> > 'SK_REJECT' return code for cgroup_skb program was generally rejected
> > and would be too limited for other program types support.
> >
> > We should bear in mind that we want to add a TCP reset kfunc next and
> > also could extend this kfunc to other program types if wanted.
>
> Some high level questions.
>
> Which other program types do you need this kfunc to send icmp and the future
> tcp rst?
I don't really know, I mostly need this in cgroup_skb for my use case
but I could see other programs type using this either for simplification
(for progs that can already rewrite the packet, like tc) or other
programs types like cgroup_skb, because they can't touch the packet
themselves.
>
> This cover letter mentioned sending icmp unreach is easier than sending tcp
> rst. What problems do you see in sending tcp rst?
>
Yes, I based these patches on what net/ipv4/netfilter/ipt_REJECT.c's
'reject_tg' function does. In the case of sending ICMP unreach
'nf_send_unreach', the routing step is quite straighforward as they are
only inverting the daddr and the saddr (that's what my renamed/moved
ip_route_reply_fetch_dst helper does).
In the case of sending RST 'nf_send_reset', there are extra steps, first
the same routing mechanism is done by just inverting the daddr and the
saddr but later 'ip_route_me_harder' is called which is doing a lot
more. I'm currently not sure which parts of this must be ported to work
in our BPF use case so I wanted to start with unreach.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
2025-07-29 1:05 ` Martin KaFai Lau
@ 2025-07-29 10:06 ` Mahe Tardy
2025-07-29 23:13 ` Martin KaFai Lau
0 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2025-07-29 10:06 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On Mon, Jul 28, 2025 at 06:05:26PM -0700, Martin KaFai Lau wrote:
> On 7/28/25 2:43 AM, Mahe Tardy wrote:
> > This is needed in the context of Tetragon to provide improved feedback
> > (in contrast to just dropping packets) to east-west traffic when blocked
> > by policies using cgroup_skb programs.
> >
> > This reuse concepts from netfilter reject target codepath with the
> > differences that:
> > * Packets are cloned since the BPF user can still return SK_PASS from
> > the cgroup_skb progs and the current skb need to stay untouched
>
> This needs more details. Which field(s) of the skb are changed by the kfunc,
> the skb_dst_set in ip[6]_route_reply_fetch_dst() and/or the code path in the
> icmp[v6]_send() ?
Okay I can add that: "ip[6]_route_reply_fetch_dst set the dst of the skb
by using the saddr as a daddr and routing it", I don't think
icmp[v6]_send touches the skb?
>
> > (cgroup_skb hooks only allow read-only skb payload).
> > * Since cgroup_skb programs are called late in the stack, checksums do
> > not need to be computed or verified, and IPv4 fragmentation does not
> > need to be checked (ip_local_deliver should take care of that
> > earlier).
> >
> > Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> > ---
> > net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 61 insertions(+)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 7a72f766aacf..050872324575 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -85,6 +85,10 @@
> > #include <linux/un.h>
> > #include <net/xdp_sock_drv.h>
> > #include <net/inet_dscp.h>
> > +#include <linux/icmp.h>
> > +#include <net/icmp.h>
> > +#include <net/route.h>
> > +#include <net/ip6_route.h>
> >
> > #include "dev.h"
> >
> > @@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
> > return 0;
> > }
> >
> > +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> > +{
> > + struct sk_buff *skb = (struct sk_buff *)__skb;
> > + struct sk_buff *nskb;
> > +
> > + switch (skb->protocol) {
> > + case htons(ETH_P_IP):
> > + if (code < 0 || code > NR_ICMP_UNREACH)
> > + return -EINVAL;
> > +
> > + nskb = skb_clone(skb, GFP_ATOMIC);
> > + if (!nskb)
> > + return -ENOMEM;
> > +
> > + if (ip_route_reply_fetch_dst(nskb) < 0) {
> > + kfree_skb(nskb);
> > + return -EHOSTUNREACH;
> > + }
> > +
> > + icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> > + kfree_skb(nskb);
> > + break;
> > +#if IS_ENABLED(CONFIG_IPV6)
> > + case htons(ETH_P_IPV6):
> > + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > + return -EINVAL;
> > +
> > + nskb = skb_clone(skb, GFP_ATOMIC);
> > + if (!nskb)
> > + return -ENOMEM;
> > +
> > + if (ip6_route_reply_fetch_dst(nskb) < 0) {
>
> From a very quick look at icmpv6_send(), it does its own route lookup. I
> haven't looked at the v4 yet.
>
> I am likely missing some details. Can you explain why it needs to do a
> lookup before calling icmpv6_send()?
From my understanding, I need to do this to invert the daddr with the
saddr to send the unreach message back to the sender.
>
> > + kfree_skb(nskb);
> > + return -EHOSTUNREACH;
> > + }
> > +
> > + icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> > + kfree_skb(nskb);
> > + break;
> > +#endif
> > + default:
> > + return -EPROTONOSUPPORT;
> > + }
> > +
> > + return SK_DROP;
> > +}
> > +
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc
2025-07-29 10:06 ` Mahe Tardy
@ 2025-07-29 23:13 ` Martin KaFai Lau
0 siblings, 0 replies; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 23:13 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/29/25 3:06 AM, Mahe Tardy wrote:
> On Mon, Jul 28, 2025 at 06:05:26PM -0700, Martin KaFai Lau wrote:
>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>> This is needed in the context of Tetragon to provide improved feedback
>>> (in contrast to just dropping packets) to east-west traffic when blocked
>>> by policies using cgroup_skb programs.
>>>
>>> This reuse concepts from netfilter reject target codepath with the
>>> differences that:
>>> * Packets are cloned since the BPF user can still return SK_PASS from
>>> the cgroup_skb progs and the current skb need to stay untouched
>>
>> This needs more details. Which field(s) of the skb are changed by the kfunc,
>> the skb_dst_set in ip[6]_route_reply_fetch_dst() and/or the code path in the
>> icmp[v6]_send() ?
>
> Okay I can add that: "ip[6]_route_reply_fetch_dst set the dst of the skb
> by using the saddr as a daddr and routing it", I don't think
> icmp[v6]_send touches the skb?
I also don't think icmp[v6]_send touches the skb. I am still not sure if
ip[6]_route_reply_fetch_dst is needed.
>
>>
>>> (cgroup_skb hooks only allow read-only skb payload).
>>> * Since cgroup_skb programs are called late in the stack, checksums do
>>> not need to be computed or verified, and IPv4 fragmentation does not
>>> need to be checked (ip_local_deliver should take care of that
>>> earlier).
>>>
>>> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
>>> ---
>>> net/core/filter.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 61 insertions(+)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 7a72f766aacf..050872324575 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -85,6 +85,10 @@
>>> #include <linux/un.h>
>>> #include <net/xdp_sock_drv.h>
>>> #include <net/inet_dscp.h>
>>> +#include <linux/icmp.h>
>>> +#include <net/icmp.h>
>>> +#include <net/route.h>
>>> +#include <net/ip6_route.h>
>>>
>>> #include "dev.h"
>>>
>>> @@ -12148,6 +12152,53 @@ __bpf_kfunc int bpf_sock_ops_enable_tx_tstamp(struct bpf_sock_ops_kern *skops,
>>> return 0;
>>> }
>>>
>>> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
>>> +{
>>> + struct sk_buff *skb = (struct sk_buff *)__skb;
>>> + struct sk_buff *nskb;
>>> +
>>> + switch (skb->protocol) {
>>> + case htons(ETH_P_IP):
>>> + if (code < 0 || code > NR_ICMP_UNREACH)
>>> + return -EINVAL;
>>> +
>>> + nskb = skb_clone(skb, GFP_ATOMIC);
>>> + if (!nskb)
>>> + return -ENOMEM;
>>> +
>>> + if (ip_route_reply_fetch_dst(nskb) < 0) {
>>> + kfree_skb(nskb);
>>> + return -EHOSTUNREACH;
>>> + }
>>> +
>>> + icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
>>> + kfree_skb(nskb);
>>> + break;
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> + case htons(ETH_P_IPV6):
>>> + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
>>> + return -EINVAL;
>>> +
>>> + nskb = skb_clone(skb, GFP_ATOMIC);
>>> + if (!nskb)
>>> + return -ENOMEM;
>>> +
>>> + if (ip6_route_reply_fetch_dst(nskb) < 0) {
>>
>> From a very quick look at icmpv6_send(), it does its own route lookup. I
>> haven't looked at the v4 yet.
>>
>> I am likely missing some details. Can you explain why it needs to do a
>> lookup before calling icmpv6_send()?
>
> From my understanding, I need to do this to invert the daddr with the
> saddr to send the unreach message back to the sender.
From looking at how fl6.{daddr,saddr} are filled and passed to
icmpv6_route_lookup in icmpv6_send(), the icmpv6_send() should have done the
reverse/invert route lookup. I also don't see icmpv6_send uses the skb_dst() of
the original skb. Did I misread the code? The kfunc does not work without
ip[6]_route_reply_fetch_dst()? Again, I have not checked the v4 icmp_send. fwiw,
the selftest should have both v4 and v6 test.
Note that at cgroup/egress, the skb->_skb_refdst should have been set.
The same should be true for cgroup/ingress for inet proto but it seems
BPF_CGROUP_RUN_PROG_"INET"_INGRESS is not called from INET only now. e.g.
sk_filter() can be called from af_netlink. It seems like there is a bug.
>
>>
>>> + kfree_skb(nskb);
>>> + return -EHOSTUNREACH;
>>> + }
>>> +
>>> + icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
>>> + kfree_skb(nskb);
>>> + break;
>>> +#endif
>>> + default:
>>> + return -EPROTONOSUPPORT;
>>> + }
>>> +
>>> + return SK_DROP;
>>> +}
>>> +
>
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-29 9:09 ` Mahe Tardy
@ 2025-07-29 23:27 ` Martin KaFai Lau
2025-07-30 0:01 ` Martin KaFai Lau
0 siblings, 1 reply; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-29 23:27 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/29/25 2:09 AM, Mahe Tardy wrote:
> On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>> +SEC("cgroup_skb/egress")
>>> +int egress(struct __sk_buff *skb)
>>> +{
>>> + void *data = (void *)(long)skb->data;
>>> + void *data_end = (void *)(long)skb->data_end;
>>> + struct iphdr *iph;
>>> + struct tcphdr *tcph;
>>> +
>>> + iph = data;
>>> + if ((void *)(iph + 1) > data_end || iph->version != 4 ||
>>> + iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
>>> + return SK_PASS;
>>> +
>>> + tcph = (void *)iph + iph->ihl * 4;
>>> + if ((void *)(tcph + 1) > data_end ||
>>> + tcph->dest != bpf_htons(SERVER_PORT))
>>> + return SK_PASS;
>>> +
>>> + kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
>>> +
>>> + /* returns SK_PASS to execute the test case quicker */
>>
>> Do you know why the user space is slower if 0 (SK_DROP) is used?
>
> I tried to write my understanding of this in the commit description:
>
> "Note that the BPF program returns SK_PASS to let the connection being
> established to finish the test cases quicker. Otherwise, you have to
> wait for the TCP three-way handshake to timeout in the kernel and
> retrieve the errno translated from the unreach code set by the ICMP
> control message."
This feels like a bit hacky to let the 3WHS finished while the objective of the
patch set is to drop it. It is not unusual for people to directly borrow this
code. Does non blocking connect() help?
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-29 23:27 ` Martin KaFai Lau
@ 2025-07-30 0:01 ` Martin KaFai Lau
2025-07-30 0:32 ` Martin KaFai Lau
0 siblings, 1 reply; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-30 0:01 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/29/25 4:27 PM, Martin KaFai Lau wrote:
> On 7/29/25 2:09 AM, Mahe Tardy wrote:
>> On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
>>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>>> +SEC("cgroup_skb/egress")
>>>> +int egress(struct __sk_buff *skb)
>>>> +{
>>>> + void *data = (void *)(long)skb->data;
>>>> + void *data_end = (void *)(long)skb->data_end;
>>>> + struct iphdr *iph;
>>>> + struct tcphdr *tcph;
>>>> +
>>>> + iph = data;
>>>> + if ((void *)(iph + 1) > data_end || iph->version != 4 ||
>>>> + iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
>>>> + return SK_PASS;
>>>> +
>>>> + tcph = (void *)iph + iph->ihl * 4;
>>>> + if ((void *)(tcph + 1) > data_end ||
>>>> + tcph->dest != bpf_htons(SERVER_PORT))
>>>> + return SK_PASS;
>>>> +
>>>> + kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
>>>> +
>>>> + /* returns SK_PASS to execute the test case quicker */
>>>
>>> Do you know why the user space is slower if 0 (SK_DROP) is used?
>>
>> I tried to write my understanding of this in the commit description:
>>
>> "Note that the BPF program returns SK_PASS to let the connection being
>> established to finish the test cases quicker. Otherwise, you have to
>> wait for the TCP three-way handshake to timeout in the kernel and
>> retrieve the errno translated from the unreach code set by the ICMP
>> control message."
>
> This feels like a bit hacky to let the 3WHS finished while the objective of the
> patch set is to drop it. It is not unusual for people to directly borrow this
> code. Does non blocking connect() help?
>
After reading more on how sk_err_soft is used, non blocking won't help. I think
I see why tcp rst is better.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-30 0:01 ` Martin KaFai Lau
@ 2025-07-30 0:32 ` Martin KaFai Lau
0 siblings, 0 replies; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-30 0:32 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/29/25 5:01 PM, Martin KaFai Lau wrote:
> On 7/29/25 4:27 PM, Martin KaFai Lau wrote:
>> On 7/29/25 2:09 AM, Mahe Tardy wrote:
>>> On Mon, Jul 28, 2025 at 06:18:11PM -0700, Martin KaFai Lau wrote:
>>>> On 7/28/25 2:43 AM, Mahe Tardy wrote:
>>>>> +SEC("cgroup_skb/egress")
>>>>> +int egress(struct __sk_buff *skb)
>>>>> +{
>>>>> + void *data = (void *)(long)skb->data;
>>>>> + void *data_end = (void *)(long)skb->data_end;
>>>>> + struct iphdr *iph;
>>>>> + struct tcphdr *tcph;
>>>>> +
>>>>> + iph = data;
>>>>> + if ((void *)(iph + 1) > data_end || iph->version != 4 ||
>>>>> + iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
>>>>> + return SK_PASS;
>>>>> +
>>>>> + tcph = (void *)iph + iph->ihl * 4;
>>>>> + if ((void *)(tcph + 1) > data_end ||
>>>>> + tcph->dest != bpf_htons(SERVER_PORT))
>>>>> + return SK_PASS;
>>>>> +
>>>>> + kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
>>>>> +
>>>>> + /* returns SK_PASS to execute the test case quicker */
>>>>
>>>> Do you know why the user space is slower if 0 (SK_DROP) is used?
>>>
>>> I tried to write my understanding of this in the commit description:
>>>
>>> "Note that the BPF program returns SK_PASS to let the connection being
>>> established to finish the test cases quicker. Otherwise, you have to
>>> wait for the TCP three-way handshake to timeout in the kernel and
>>> retrieve the errno translated from the unreach code set by the ICMP
>>> control message."
>>
>> This feels like a bit hacky to let the 3WHS finished while the objective of
>> the patch set is to drop it. It is not unusual for people to directly borrow
>> this code. Does non blocking connect() help?
>>
>
> After reading more on how sk_err_soft is used, non blocking won't help. I think
> I see why tcp rst is better.
>
Actually, while replying on the cover letter and looking at tcp_v4_err again,
there is an exception to do ip_icmp_error for TCP_SYN_SENT, so it may worth a
try on non blocking connect and then poll the sk for err if you haven't tried
that before.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
2025-07-29 9:53 ` Mahe Tardy
@ 2025-07-30 1:54 ` Martin KaFai Lau
2025-08-01 18:50 ` Mahe Tardy
0 siblings, 1 reply; 40+ messages in thread
From: Martin KaFai Lau @ 2025-07-30 1:54 UTC (permalink / raw)
To: Mahe Tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On 7/29/25 2:53 AM, Mahe Tardy wrote:
>> Which other program types do you need this kfunc to send icmp and the future
>> tcp rst?
>
> I don't really know, I mostly need this in cgroup_skb for my use case
> but I could see other programs type using this either for simplification
> (for progs that can already rewrite the packet, like tc) or other
> programs types like cgroup_skb, because they can't touch the packet
> themselves.
I also don't think the tc needs this kfunc either. The tc should already have
ways to do this now.
>
>>
>> This cover letter mentioned sending icmp unreach is easier than sending tcp
>> rst. What problems do you see in sending tcp rst?
>>
>
> Yes, I based these patches on what net/ipv4/netfilter/ipt_REJECT.c's
> 'reject_tg' function does. In the case of sending ICMP unreach
> 'nf_send_unreach', the routing step is quite straighforward as they are
> only inverting the daddr and the saddr (that's what my renamed/moved
> ip_route_reply_fetch_dst helper does).
>
> In the case of sending RST 'nf_send_reset', there are extra steps, first
> the same routing mechanism is done by just inverting the daddr and the
> saddr but later 'ip_route_me_harder' is called which is doing a lot
> more. I'm currently not sure which parts of this must be ported to work
> in our BPF use case so I wanted to start with unreach.
I don't think we necessarily need to completely borrow from nf, the hooks'
locations are different and the use case may be different.
A concern that I have is the icmp6_send called by the kfunc. The icmp6_send
should eventually call to ip6_finish_output which may call the very same
"cgroup/egress" program again in a recursive way. The same for v4 icmp_send.
The icmp packet is sent from an internal kernel sk. I suspect you will see this
recursive behavior if the test is done in the default cgroup (/sys/fs/cgroup). I
think the is_ineligible(skb) should have stopped the second icmpv6_send from
replying to an icmp error and the cgroup hook cannot change the skb. However, I
am not sure I want to cross this bridge. Is there a way to avoid the recursive
bpf prog?
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc
2025-07-30 1:54 ` Martin KaFai Lau
@ 2025-08-01 18:50 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
0 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2025-08-01 18:50 UTC (permalink / raw)
To: Martin KaFai Lau
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, netdev, netfilter-devel, oe-kbuild-all, pablo,
lkp
On Tue, Jul 29, 2025 at 06:54:58PM -0700, Martin KaFai Lau wrote:
> On 7/29/25 2:53 AM, Mahe Tardy wrote:
> > > Which other program types do you need this kfunc to send icmp and the future
> > > tcp rst?
> >
> > I don't really know, I mostly need this in cgroup_skb for my use case
> > but I could see other programs type using this either for simplification
> > (for progs that can already rewrite the packet, like tc) or other
> > programs types like cgroup_skb, because they can't touch the packet
> > themselves.
>
> I also don't think the tc needs this kfunc either. The tc should already
> have ways to do this now.
>
> >
> > >
> > > This cover letter mentioned sending icmp unreach is easier than sending tcp
> > > rst. What problems do you see in sending tcp rst?
> > >
> >
> > Yes, I based these patches on what net/ipv4/netfilter/ipt_REJECT.c's
> > 'reject_tg' function does. In the case of sending ICMP unreach
> > 'nf_send_unreach', the routing step is quite straighforward as they are
> > only inverting the daddr and the saddr (that's what my renamed/moved
> > ip_route_reply_fetch_dst helper does).
> >
> > In the case of sending RST 'nf_send_reset', there are extra steps, first
> > the same routing mechanism is done by just inverting the daddr and the
> > saddr but later 'ip_route_me_harder' is called which is doing a lot
> > more. I'm currently not sure which parts of this must be ported to work
> > in our BPF use case so I wanted to start with unreach.
>
> I don't think we necessarily need to completely borrow from nf, the hooks'
> locations are different and the use case may be different.
>
> A concern that I have is the icmp6_send called by the kfunc. The icmp6_send
> should eventually call to ip6_finish_output which may call the very same
> "cgroup/egress" program again in a recursive way. The same for v4 icmp_send.
>
> The icmp packet is sent from an internal kernel sk. I suspect you will see
> this recursive behavior if the test is done in the default cgroup
> (/sys/fs/cgroup). I think the is_ineligible(skb) should have stopped the
> second icmpv6_send from replying to an icmp error and the cgroup hook cannot
> change the skb. However, I am not sure I want to cross this bridge. Is there
> a way to avoid the recursive bpf prog?
>
Thanks Martin for the review. Indeed the recursive BPF prog call is a
concerning issue. I'll take some time to think about it and hopefully
propose something.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests
2025-07-28 9:43 ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2025-07-28 15:40 ` Yonghong Song
2025-07-29 1:18 ` Martin KaFai Lau
@ 2025-08-05 23:26 ` Jordan Rife
2 siblings, 0 replies; 40+ messages in thread
From: Jordan Rife @ 2025-08-05 23:26 UTC (permalink / raw)
To: Mahe Tardy
Cc: lkp, alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
On Mon, Jul 28, 2025 at 09:43:45AM +0000, Mahe Tardy wrote:
> This test opens a server and client, attach a cgroup_skb program on
> egress and calls the icmp_send_unreach function from the client egress
> so that an ICMP unreach control message is sent back to the client.
> It then fetches the message from the error queue to confirm the correct
> ICMP unreach code has been sent.
>
> Note that the BPF program returns SK_PASS to let the connection being
> established to finish the test cases quicker. Otherwise, you have to
> wait for the TCP three-way handshake to timeout in the kernel and
> retrieve the errno translated from the unreach code set by the ICMP
> control message.
>
> Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> ---
> .../bpf/prog_tests/icmp_send_unreach_kfunc.c | 99 +++++++++++++++++++
> .../selftests/bpf/progs/icmp_send_unreach.c | 36 +++++++
> 2 files changed, 135 insertions(+)
> create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> new file mode 100644
> index 000000000000..414c1ed8ced3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> @@ -0,0 +1,99 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include "icmp_send_unreach.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +#define SRV_PORT 54321
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15
small nit: Any reason why ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, and
NR_ICMP_UNREACH are redefined here? I think you should just be able to
#include <linux/icmp.h> this at the top to avoid redefining these.
> +
> +static void read_icmp_errqueue(int sockfd, int expected_code)
> +{
> + ssize_t n;
> + struct sock_extended_err *sock_err;
> + struct cmsghdr *cm;
> + char ctrl_buf[512];
> + struct msghdr msg = {
> + .msg_control = ctrl_buf,
> + .msg_controllen = sizeof(ctrl_buf),
> + };
> +
> + n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
> + if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
> + return;
> +
> + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
> + if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
> + !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
> + continue;
> +
> + sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
> +
> + if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
> + "sock_err_origin_icmp"))
> + return;
> + if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
> + "sock_err_type_dest_unreach"))
> + return;
> + ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
> + }
> +}
> +
> +void test_icmp_send_unreach_kfunc(void)
> +{
> + struct icmp_send_unreach *skel;
> + int cgroup_fd = -1, client_fd = 1, srv_fd = -1;
> + int *code;
> +
> + skel = icmp_send_unreach__open_and_load();
> + if (!ASSERT_OK_PTR(skel, "skel_open"))
> + goto cleanup;
> +
> + cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
> + if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
> + goto cleanup;
> +
> + skel->links.egress =
> + bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
> + if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
> + goto cleanup;
> +
> + code = &skel->bss->unreach_code;
> +
> + for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
> + // The TCP stack reacts differently when asking for
> + // fragmentation, let's ignore it for now
> + if (*code == ICMP_FRAG_NEEDED)
> + continue;
> +
> + skel->bss->kfunc_ret = -1;
> +
> + srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1",
> + SRV_PORT, TIMEOUT_MS);
> + if (!ASSERT_GE(srv_fd, 0, "start_server"))
> + goto for_cleanup;
> +
> + client_fd = socket(AF_INET, SOCK_STREAM, 0);
> + ASSERT_GE(client_fd, 0, "client_socket");
> +
> + client_fd = connect_to_fd(srv_fd, 0);
> + if (!ASSERT_GE(client_fd, 0, "client_connect"))
> + goto for_cleanup;
> +
> + read_icmp_errqueue(client_fd, *code);
> +
> + ASSERT_EQ(skel->bss->kfunc_ret, SK_DROP, "kfunc_ret");
It might be worth testing that the kfunc returns -EINVAL when the code
is outside the accepted range as well for completeness.
> +for_cleanup:
> + close(client_fd);
> + close(srv_fd);
> + }
> +
> +cleanup:
> + icmp_send_unreach__destroy(skel);
> + close(cgroup_fd);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> new file mode 100644
> index 000000000000..15783e5d1d65
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
> @@ -0,0 +1,36 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_endian.h>
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> +
> +int unreach_code = 0;
> +int kfunc_ret = 0;
> +
> +#define SERVER_PORT 54321
> +#define SERVER_IP 0x7F000001
> +
> +SEC("cgroup_skb/egress")
> +int egress(struct __sk_buff *skb)
> +{
> + void *data = (void *)(long)skb->data;
> + void *data_end = (void *)(long)skb->data_end;
> + struct iphdr *iph;
> + struct tcphdr *tcph;
> +
> + iph = data;
> + if ((void *)(iph + 1) > data_end || iph->version != 4 ||
> + iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
> + return SK_PASS;
> +
> + tcph = (void *)iph + iph->ihl * 4;
> + if ((void *)(tcph + 1) > data_end ||
> + tcph->dest != bpf_htons(SERVER_PORT))
> + return SK_PASS;
> +
> + kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
> +
> + /* returns SK_PASS to execute the test case quicker */
> + return SK_PASS;
> +}
> --
> 2.34.1
>
Jordan
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH bpf-next v4 0/6] bpf: add icmp_send_unreach kfunc
2025-08-01 18:50 ` Mahe Tardy
@ 2026-04-20 10:58 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
` (5 more replies)
0 siblings, 6 replies; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
To: mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
Hello,
This is v4 of adding the icmp_send_unreach kfunc, as suggested during
LSF/MM/BPF 2025[^1]. The goal is to allow cgroup_skb programs to
actively reject east-west traffic, similarly to what is possible to do
with netfilter reject target.
The first step to implement this is using ICMP control messages, with
the ICMP_DEST_UNREACH type with various code ICMP_NET_UNREACH,
ICMP_HOST_UNREACH, ICMP_PROT_UNREACH, etc. This is easier to implement
than a TCP RST reply and will already hint the client TCP stack to abort
the connection and not retry extensively.
Note that this is different than the sock_destroy kfunc, that along
calls tcp_abort and thus sends a reset, destroying the underlying
socket.
Caveats of this kfunc design are that a program can call this function N
times, thus send N ICMP unreach control messages and that the program
can return from the BPF filter with pass leading to a potential
confusing situation where the TCP connection was established while the
client received ICMP_DEST_UNREACH messages.
Initially, this kfunc was added only to cgroup_skb programs, Alexei
suggested not creating its own kfunc set and adding it to the more
global bpf_kfunc_set_skb. Now that recursion is handled and I realized,
thanks to Martin, that fetching the dst route might be only useful in
situation in which the packet was not yet routed, I decided to extend
the kfunc to more program types and route the packet only if needed.
v2 updates:
- fix a build error from a missing function call rename;
- avoid changing return line in bpf_kfunc_init;
- return SK_DROP from the kfunc (similarly to bpf_redirect);
- check the return value in the selftest.
v3 update:
- fix an undefined reference build error.
v4 updates:
- prevent the kfunc to be called recursively and add a test (thanks to
Martin).
- do not fetch dst route when unnecessary (thanks to Martin).
- extend the test for IPv6 (thanks to Martin).
- use SK_DROP in examples and use non blocking sockets for testing
(thanks to Martin).
- test when the kfunc returns -EINVAL (thanks to Jordan).
- add the kfunc to bpf_kfunc_set_skb as suggested by Alexei.
- guard the IPv4 parts with IS_ENABLED(CONFIG_INET).
- fix a wrong initial value for client_fd (thanks to Yonghong).
- add documentation to the kfunc.
- to Jordan: I couldn't include <linux/icmp.h> because of redefines from
<network_helpers.h>.
[^1]: https://lwn.net/Articles/1022034/
Mahe Tardy (6):
net: move netfilter nf_reject_fill_skb_dst to core ipv4
net: move netfilter nf_reject6_fill_skb_dst to core ipv6
bpf: add bpf_icmp_send_unreach kfunc
selftests/bpf: add icmp_send_unreach kfunc tests
selftests/bpf: add icmp_send_unreach kfunc IPv6 tests
selftests/bpf: add icmp_send_unreach_recursion test
include/net/ip6_route.h | 2 +
include/net/route.h | 1 +
net/core/filter.c | 85 ++++++++
net/ipv4/netfilter/nf_reject_ipv4.c | 19 +-
net/ipv4/route.c | 15 ++
net/ipv6/netfilter/nf_reject_ipv6.c | 17 +-
net/ipv6/route.c | 18 ++
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 202 ++++++++++++++++++
.../selftests/bpf/progs/icmp_send_unreach.c | 100 +++++++++
9 files changed, 426 insertions(+), 33 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
--
2.34.1
^ permalink raw reply [flat|nested] 40+ messages in thread
* [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
@ 2026-04-20 10:58 ` Mahe Tardy
2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 10:58 ` [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
` (4 subsequent siblings)
5 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
To: mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
Move and rename nf_reject_fill_skb_dst from
ipv4/netfilter/nf_reject_ipv4 to ip_route_reply_fetch_dst in
ipv4/route.c so that it can be reused in the following patches by BPF
kfuncs.
Netfilter uses nf_ip_route that is almost a transparent wrapper around
ip_route_output_key so this patch inlines it.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
include/net/route.h | 1 +
net/ipv4/netfilter/nf_reject_ipv4.c | 19 ++-----------------
net/ipv4/route.c | 15 +++++++++++++++
3 files changed, 18 insertions(+), 17 deletions(-)
diff --git a/include/net/route.h b/include/net/route.h
index f90106f383c5..ec2466fd0bec 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
const struct sock *sk);
struct dst_entry *ipv4_blackhole_route(struct net *net,
struct dst_entry *dst_orig);
+int ip_route_reply_fetch_dst(struct sk_buff *skb);
static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
{
diff --git a/net/ipv4/netfilter/nf_reject_ipv4.c b/net/ipv4/netfilter/nf_reject_ipv4.c
index fecf6621f679..2290451ed122 100644
--- a/net/ipv4/netfilter/nf_reject_ipv4.c
+++ b/net/ipv4/netfilter/nf_reject_ipv4.c
@@ -252,21 +252,6 @@ static void nf_reject_ip_tcphdr_put(struct sk_buff *nskb, const struct sk_buff *
nskb->csum_offset = offsetof(struct tcphdr, check);
}
-static int nf_reject_fill_skb_dst(struct sk_buff *skb_in)
-{
- struct dst_entry *dst = NULL;
- struct flowi fl;
-
- memset(&fl, 0, sizeof(struct flowi));
- fl.u.ip4.daddr = ip_hdr(skb_in)->saddr;
- nf_ip_route(dev_net(skb_in->dev), &dst, &fl, false);
- if (!dst)
- return -1;
-
- skb_dst_set(skb_in, dst);
- return 0;
-}
-
/* Send RST reply */
void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
int hook)
@@ -279,7 +264,7 @@ void nf_send_reset(struct net *net, struct sock *sk, struct sk_buff *oldskb,
if (!oth)
return;
- if (!skb_dst(oldskb) && nf_reject_fill_skb_dst(oldskb) < 0)
+ if (!skb_dst(oldskb) && ip_route_reply_fetch_dst(oldskb) < 0)
return;
if (skb_rtable(oldskb)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST))
@@ -352,7 +337,7 @@ void nf_send_unreach(struct sk_buff *skb_in, int code, int hook)
if (iph->frag_off & htons(IP_OFFSET))
return;
- if (!skb_dst(skb_in) && nf_reject_fill_skb_dst(skb_in) < 0)
+ if (!skb_dst(skb_in) && ip_route_reply_fetch_dst(skb_in) < 0)
return;
if (skb_csum_unnecessary(skb_in) ||
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index bc1296f0ea69..7091ef936073 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
}
EXPORT_SYMBOL_GPL(ip_route_output_flow);
+int ip_route_reply_fetch_dst(struct sk_buff *skb)
+{
+ struct rtable *rt;
+ struct flowi4 fl4 = {
+ .daddr = ip_hdr(skb)->saddr
+ };
+
+ rt = ip_route_output_key(dev_net(skb->dev), &fl4);
+ if (IS_ERR(rt))
+ return PTR_ERR(rt);
+ skb_dst_set(skb, &rt->dst);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
+
/* called with rcu_read_lock held */
static int rt_fill_info(struct net *net, __be32 dst, __be32 src,
struct rtable *rt, u32 table_id, dscp_t dscp,
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2026-04-20 10:58 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
` (3 subsequent siblings)
5 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
To: mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
Move and rename nf_reject6_fill_skb_dst from
ipv6/netfilter/nf_reject_ipv6 to ip6_route_reply_fetch_dst in
ipv6/route.c so that it can be reused in the following patches by BPF
kfuncs.
Netfilter uses nf_ip6_route that is almost a transparent wrapper around
ip6_route_output so this patch inlines it.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
include/net/ip6_route.h | 2 ++
net/ipv6/netfilter/nf_reject_ipv6.c | 17 +----------------
net/ipv6/route.c | 18 ++++++++++++++++++
3 files changed, 21 insertions(+), 16 deletions(-)
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 09ffe0f13ce7..3652efec7081 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -100,6 +100,8 @@ static inline struct dst_entry *ip6_route_output(struct net *net,
return ip6_route_output_flags(net, sk, fl6, 0);
}
+int ip6_route_reply_fetch_dst(struct sk_buff *skb);
+
/* Only conditionally release dst if flags indicates
* !RT6_LOOKUP_F_DST_NOREF or dst is in uncached_list.
*/
diff --git a/net/ipv6/netfilter/nf_reject_ipv6.c b/net/ipv6/netfilter/nf_reject_ipv6.c
index ef5b7e85cffa..9663d1db6d80 100644
--- a/net/ipv6/netfilter/nf_reject_ipv6.c
+++ b/net/ipv6/netfilter/nf_reject_ipv6.c
@@ -293,21 +293,6 @@ nf_reject_ip6_tcphdr_put(struct sk_buff *nskb,
sizeof(struct tcphdr), 0));
}
-static int nf_reject6_fill_skb_dst(struct sk_buff *skb_in)
-{
- struct dst_entry *dst = NULL;
- struct flowi fl;
-
- memset(&fl, 0, sizeof(struct flowi));
- fl.u.ip6.daddr = ipv6_hdr(skb_in)->saddr;
- nf_ip6_route(dev_net(skb_in->dev), &dst, &fl, false);
- if (!dst)
- return -1;
-
- skb_dst_set(skb_in, dst);
- return 0;
-}
-
void nf_send_reset6(struct net *net, struct sock *sk, struct sk_buff *oldskb,
int hook)
{
@@ -440,7 +425,7 @@ void nf_send_unreach6(struct net *net, struct sk_buff *skb_in,
if (hooknum == NF_INET_LOCAL_OUT && skb_in->dev == NULL)
skb_in->dev = net->loopback_dev;
- if (!skb_dst(skb_in) && nf_reject6_fill_skb_dst(skb_in) < 0)
+ if (!skb_dst(skb_in) && ip6_route_reply_fetch_dst(skb_in) < 0)
return;
icmpv6_send(skb_in, ICMPV6_DEST_UNREACH, code, 0);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 19eb6b702227..41871fddec4d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2721,6 +2721,24 @@ struct dst_entry *ip6_route_output_flags(struct net *net,
}
EXPORT_SYMBOL_GPL(ip6_route_output_flags);
+int ip6_route_reply_fetch_dst(struct sk_buff *skb)
+{
+ struct dst_entry *result;
+ struct flowi6 fl = {
+ .daddr = ipv6_hdr(skb)->saddr
+ };
+ int err;
+
+ result = ip6_route_output(dev_net(skb->dev), NULL, &fl);
+ err = result->error;
+ if (err)
+ dst_release(result);
+ else
+ skb_dst_set(skb, result);
+ return err;
+}
+EXPORT_SYMBOL_GPL(ip6_route_reply_fetch_dst);
+
struct dst_entry *ip6_blackhole_route(struct net *net, struct dst_entry *dst_orig)
{
struct rt6_info *rt, *ort = dst_rt6_info(dst_orig);
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
@ 2026-04-20 10:58 ` Mahe Tardy
2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 10:58 ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
` (2 subsequent siblings)
5 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
To: mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
This is needed in the context of Tetragon to provide improved feedback
(in contrast to just dropping packets) to east-west traffic when blocked
by policies using cgroup_skb programs.
This reuse concepts from netfilter reject target codepath with the
differences that:
* Packets are cloned since the BPF user can still let the packet pass
(SK_PASS from the cgroup_skb progs for example) and the current skb
need to stay untouched (cgroup_skb hooks only allow read-only skb
payload). The kfunc set the dst of the cloned skb by using the saddr
as the daddr and routing it.
* Checksums are not computed or verified and IPv4 fragmentation is not
checked early (icmp_send will check).
* We protect against recursion since the kfunc, by generating an ICMP
error message could retrigger the BPF prog that invoked it.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
net/core/filter.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 85 insertions(+)
diff --git a/net/core/filter.c b/net/core/filter.c
index fcfcb72663ca..a6c3b9145c93 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -84,6 +84,10 @@
#include <linux/un.h>
#include <net/xdp_sock_drv.h>
#include <net/inet_dscp.h>
+#include <linux/icmp.h>
+#include <net/icmp.h>
+#include <net/route.h>
+#include <net/ip6_route.h>
#include "dev.h"
@@ -12423,6 +12427,86 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
return 0;
}
+static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
+
+/**
+ * bpf_icmp_send_unreach - Send ICMP destination unreachable error
+ * @skb: Packet that triggered the error
+ * @code: ICMP unreachable code (0-15 for IPv4, 0-6 for IPv6)
+ *
+ * Sends an ICMP destination unreachable message in response to the
+ * packet. The original packet is cloned before sending the ICMP error,
+ * so the BPF program can still let the packet pass if desired.
+ *
+ * Recursion protection: If called from a context that would trigger
+ * recursion (e.g., root cgroup processing its own ICMP packets),
+ * returns -EBUSY on re-entry.
+ *
+ * Return: 0 on success, negative error code on failure:
+ * -EINVAL: Invalid code parameter
+ * -ENOMEM: Memory allocation failed
+ * -EHOSTUNREACH: Routing lookup failed
+ * -EBUSY: Recursion detected
+ * -EPROTONOSUPPORT: Non-IP protocol
+ */
+__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
+{
+ struct sk_buff *skb = (struct sk_buff *)__skb;
+ struct sk_buff *nskb;
+ bool *in_progress;
+
+ in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
+ if (*in_progress)
+ return -EBUSY;
+
+ switch (skb->protocol) {
+#if IS_ENABLED(CONFIG_INET)
+ case htons(ETH_P_IP):
+ if (code < 0 || code > NR_ICMP_UNREACH)
+ return -EINVAL;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ if (!skb_dst(nskb) && ip_route_reply_fetch_dst(nskb) < 0) {
+ kfree_skb(nskb);
+ return -EHOSTUNREACH;
+ }
+
+ *in_progress = true;
+ icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
+ *in_progress = false;
+ kfree_skb(nskb);
+ break;
+#endif
+#if IS_ENABLED(CONFIG_IPV6)
+ case htons(ETH_P_IPV6):
+ if (code < 0 || code > ICMPV6_REJECT_ROUTE)
+ return -EINVAL;
+
+ nskb = skb_clone(skb, GFP_ATOMIC);
+ if (!nskb)
+ return -ENOMEM;
+
+ if (!skb_dst(nskb) && ip6_route_reply_fetch_dst(nskb) < 0) {
+ kfree_skb(nskb);
+ return -EHOSTUNREACH;
+ }
+
+ *in_progress = true;
+ icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
+ *in_progress = false;
+ kfree_skb(nskb);
+ break;
+#endif
+ default:
+ return -EPROTONOSUPPORT;
+ }
+
+ return 0;
+}
+
__bpf_kfunc_end_defs();
int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
@@ -12442,6 +12526,7 @@ int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
BTF_KFUNCS_START(bpf_kfunc_check_set_skb)
BTF_ID_FLAGS(func, bpf_dynptr_from_skb)
+BTF_ID_FLAGS(func, bpf_icmp_send_unreach)
BTF_KFUNCS_END(bpf_kfunc_check_set_skb)
BTF_KFUNCS_START(bpf_kfunc_check_set_skb_meta)
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
` (2 preceding siblings ...)
2026-04-20 10:58 ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
@ 2026-04-20 10:58 ` Mahe Tardy
2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 10:58 ` [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test Mahe Tardy
5 siblings, 1 reply; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
To: mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
This test opens a server and client, enters a new cgroup, attach a
cgroup_skb program on egress and calls the icmp_send_unreach function
from the client egress so that an ICMP unreach control message is sent
back to the client. It then fetches the message from the error queue to
confirm the correct ICMP unreach code has been sent.
Note that, for the client, we have to connect in non-blocking mode to
let the test execute faster. Otherwise, we need to wait for the TCP
three-way handshake to timeout in the kernel before reading the errno.
Also note that we don't set IP_RECVERR on the socket in
connect_to_fd_nonblock since the error will be transferred anyway in our
test because the connection is rejected at the beginning of the TCP
handshake. See in net/ipv4/tcp_ipv4.c:tcp_v4_err line 615 to 655 for
more details.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 136 ++++++++++++++++++
.../selftests/bpf/progs/icmp_send_unreach.c | 36 +++++
2 files changed, 172 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
create mode 100644 tools/testing/selftests/bpf/progs/icmp_send_unreach.c
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
new file mode 100644
index 000000000000..24d5e01cfe80
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <linux/errqueue.h>
+#include "icmp_send_unreach.skel.h"
+
+#define TIMEOUT_MS 1000
+#define SRV_PORT 54321
+
+#define ICMP_DEST_UNREACH 3
+
+#define ICMP_FRAG_NEEDED 4
+#define NR_ICMP_UNREACH 15
+
+static int connect_to_fd_nonblock(int server_fd)
+{
+ struct sockaddr_storage addr;
+ socklen_t len = sizeof(addr);
+ int fd, err;
+
+ if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
+ return -1;
+
+ fd = socket(addr.ss_family, SOCK_STREAM | SOCK_NONBLOCK, 0);
+ if (fd < 0)
+ return -1;
+
+ err = connect(fd, (struct sockaddr *)&addr, len);
+ if (err < 0 && errno != EINPROGRESS) {
+ close(fd);
+ return -1;
+ }
+
+ return fd;
+}
+
+static void read_icmp_errqueue(int sockfd, int expected_code)
+{
+ ssize_t n;
+ struct sock_extended_err *sock_err;
+ struct cmsghdr *cm;
+ char ctrl_buf[512];
+ struct msghdr msg = {
+ .msg_control = ctrl_buf,
+ .msg_controllen = sizeof(ctrl_buf),
+ };
+
+ n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
+ if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
+ return;
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!ASSERT_NEQ(cm, NULL, "cm_firsthdr_null"))
+ return;
+
+ for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
+ if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
+ !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+ continue;
+
+ sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
+
+ if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
+ "sock_err_origin_icmp"))
+ return;
+ if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+ "sock_err_type_dest_unreach"))
+ return;
+ ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
+ }
+}
+
+static void trigger_prog_read_icmp_errqueue(int *code)
+{
+ int srv_fd = -1, client_fd = -1;
+
+ srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", SRV_PORT,
+ TIMEOUT_MS);
+ if (!ASSERT_GE(srv_fd, 0, "start_server"))
+ return;
+
+ client_fd = connect_to_fd_nonblock(srv_fd);
+ if (!ASSERT_GE(client_fd, 0, "client_connect_nonblock")) {
+ close(srv_fd);
+ return;
+ }
+
+ /* Skip reading ICMP error queue if code is invalid */
+ if (*code >= 0 && *code <= NR_ICMP_UNREACH)
+ read_icmp_errqueue(client_fd, *code);
+
+ close(srv_fd);
+ close(client_fd);
+}
+
+void test_icmp_send_unreach_kfunc(void)
+{
+ struct icmp_send_unreach *skel;
+ int cgroup_fd = -1;
+ int *code;
+
+ skel = icmp_send_unreach__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ goto cleanup;
+
+ cgroup_fd = test__join_cgroup("/icmp_send_unreach_cgroup");
+ if (!ASSERT_GE(cgroup_fd, 0, "join_cgroup"))
+ goto cleanup;
+
+ skel->links.egress =
+ bpf_program__attach_cgroup(skel->progs.egress, cgroup_fd);
+ if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
+ goto cleanup;
+
+ code = &skel->bss->unreach_code;
+
+ for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
+ /* The TCP stack reacts differently when asking for
+ * fragmentation, let's ignore it for now.
+ */
+ if (*code == ICMP_FRAG_NEEDED)
+ continue;
+
+ trigger_prog_read_icmp_errqueue(code);
+ ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+ }
+
+ /* Test an invalid code */
+ *code = -1;
+ trigger_prog_read_icmp_errqueue(code);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+
+cleanup:
+ icmp_send_unreach__destroy(skel);
+ close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
new file mode 100644
index 000000000000..6fc5595f08aa
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -0,0 +1,36 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#define SERVER_PORT 54321
+/* 127.0.0.1 in network byte order */
+#define SERVER_IP 0x7F000001
+
+int unreach_code = 0;
+int kfunc_ret = -1;
+
+SEC("cgroup_skb/egress")
+int egress(struct __sk_buff *skb)
+{
+ void *data = (void *)(long)skb->data;
+ void *data_end = (void *)(long)skb->data_end;
+ struct iphdr *iph;
+ struct tcphdr *tcph;
+
+ iph = data;
+ if ((void *)(iph + 1) > data_end || iph->version != 4 ||
+ iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+ return SK_PASS;
+
+ tcph = (void *)iph + iph->ihl * 4;
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(SERVER_PORT))
+ return SK_PASS;
+
+ kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
+
+ return SK_DROP;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
` (3 preceding siblings ...)
2026-04-20 10:58 ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2026-04-20 10:58 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test Mahe Tardy
5 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
To: mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
This test extend the existing IPv4 tests to IPv6.
Note that we need to set IP_RECVERR on the socket for IPv6 in
connect_to_fd_nonblock otherwise the error will be ignored even if we
are in the middle of the TCP handshake. See in
net/ipv6/datagram.c:ipv6_icmp_error line 313 for more details.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 83 ++++++++++++-------
.../selftests/bpf/progs/icmp_send_unreach.c | 46 ++++++++--
2 files changed, 93 insertions(+), 36 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
index 24d5e01cfe80..047bfd4d80f7 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -8,15 +8,17 @@
#define SRV_PORT 54321
#define ICMP_DEST_UNREACH 3
+#define ICMPV6_DEST_UNREACH 1
#define ICMP_FRAG_NEEDED 4
#define NR_ICMP_UNREACH 15
+#define NR_ICMPV6_UNREACH 6
static int connect_to_fd_nonblock(int server_fd)
{
struct sockaddr_storage addr;
socklen_t len = sizeof(addr);
- int fd, err;
+ int fd, err, on = 1;
if (getsockname(server_fd, (struct sockaddr *)&addr, &len))
return -1;
@@ -25,6 +27,12 @@ static int connect_to_fd_nonblock(int server_fd)
if (fd < 0)
return -1;
+ if (addr.ss_family == AF_INET6 &&
+ setsockopt(fd, IPPROTO_IPV6, IPV6_RECVERR, &on, sizeof(on)) < 0) {
+ close(fd);
+ return -1;
+ }
+
err = connect(fd, (struct sockaddr *)&addr, len);
if (err < 0 && errno != EINPROGRESS) {
close(fd);
@@ -34,7 +42,7 @@ static int connect_to_fd_nonblock(int server_fd)
return fd;
}
-static void read_icmp_errqueue(int sockfd, int expected_code)
+static void read_icmp_errqueue(int sockfd, int expected_code, int af)
{
ssize_t n;
struct sock_extended_err *sock_err;
@@ -44,6 +52,12 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
.msg_control = ctrl_buf,
.msg_controllen = sizeof(ctrl_buf),
};
+ int expected_level = (af == AF_INET) ? IPPROTO_IP : IPPROTO_IPV6;
+ int expected_type = (af == AF_INET) ? IP_RECVERR : IPV6_RECVERR;
+ int expected_origin = (af == AF_INET) ? SO_EE_ORIGIN_ICMP :
+ SO_EE_ORIGIN_ICMP6;
+ int expected_ee_type = (af == AF_INET) ? ICMP_DEST_UNREACH :
+ ICMPV6_DEST_UNREACH;
n = recvmsg(sockfd, &msg, MSG_ERRQUEUE);
if (!ASSERT_GE(n, 0, "recvmsg_errqueue"))
@@ -54,28 +68,27 @@ static void read_icmp_errqueue(int sockfd, int expected_code)
return;
for (; cm; cm = CMSG_NXTHDR(&msg, cm)) {
- if (!ASSERT_EQ(cm->cmsg_level, IPPROTO_IP, "cmsg_type") ||
- !ASSERT_EQ(cm->cmsg_type, IP_RECVERR, "cmsg_level"))
+ if (!ASSERT_EQ(cm->cmsg_level, expected_level, "cmsg_level") ||
+ !ASSERT_EQ(cm->cmsg_type, expected_type, "cmsg_type"))
continue;
sock_err = (struct sock_extended_err *)CMSG_DATA(cm);
- if (!ASSERT_EQ(sock_err->ee_origin, SO_EE_ORIGIN_ICMP,
- "sock_err_origin_icmp"))
+ if (!ASSERT_EQ(sock_err->ee_origin, expected_origin,
+ "sock_err_origin"))
return;
- if (!ASSERT_EQ(sock_err->ee_type, ICMP_DEST_UNREACH,
+ if (!ASSERT_EQ(sock_err->ee_type, expected_ee_type,
"sock_err_type_dest_unreach"))
return;
ASSERT_EQ(sock_err->ee_code, expected_code, "sock_err_code");
}
}
-static void trigger_prog_read_icmp_errqueue(int *code)
+static void trigger_prog_read_icmp_errqueue(int *code, int af, const char *addr)
{
int srv_fd = -1, client_fd = -1;
- srv_fd = start_server(AF_INET, SOCK_STREAM, "127.0.0.1", SRV_PORT,
- TIMEOUT_MS);
+ srv_fd = start_server(af, SOCK_STREAM, addr, SRV_PORT, TIMEOUT_MS);
if (!ASSERT_GE(srv_fd, 0, "start_server"))
return;
@@ -86,18 +99,40 @@ static void trigger_prog_read_icmp_errqueue(int *code)
}
/* Skip reading ICMP error queue if code is invalid */
- if (*code >= 0 && *code <= NR_ICMP_UNREACH)
- read_icmp_errqueue(client_fd, *code);
+ if (*code >= 0 && ((af == AF_INET && *code <= NR_ICMP_UNREACH) ||
+ (af == AF_INET6 && *code <= NR_ICMPV6_UNREACH)))
+ read_icmp_errqueue(client_fd, *code, af);
- close(srv_fd);
close(client_fd);
+ close(srv_fd);
+}
+
+static void run_icmp_test(struct icmp_send_unreach *skel, int af,
+ const char *addr, int max_code)
+{
+ int *code = &skel->bss->unreach_code;
+
+ for (*code = 0; *code <= max_code; (*code)++) {
+ /* The TCP stack reacts differently when asking for
+ * fragmentation, let's ignore it for now.
+ */
+ if (af == AF_INET && *code == ICMP_FRAG_NEEDED)
+ continue;
+
+ trigger_prog_read_icmp_errqueue(code, af, addr);
+ ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
+ }
+
+ /* Test an invalid code */
+ *code = -1;
+ trigger_prog_read_icmp_errqueue(code, af, addr);
+ ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
}
void test_icmp_send_unreach_kfunc(void)
{
struct icmp_send_unreach *skel;
int cgroup_fd = -1;
- int *code;
skel = icmp_send_unreach__open_and_load();
if (!ASSERT_OK_PTR(skel, "skel_open"))
@@ -112,23 +147,11 @@ void test_icmp_send_unreach_kfunc(void)
if (!ASSERT_OK_PTR(skel->links.egress, "prog_attach_cgroup"))
goto cleanup;
- code = &skel->bss->unreach_code;
-
- for (*code = 0; *code <= NR_ICMP_UNREACH; (*code)++) {
- /* The TCP stack reacts differently when asking for
- * fragmentation, let's ignore it for now.
- */
- if (*code == ICMP_FRAG_NEEDED)
- continue;
-
- trigger_prog_read_icmp_errqueue(code);
- ASSERT_EQ(skel->data->kfunc_ret, 0, "kfunc_ret");
- }
+ if (test__start_subtest("ipv4"))
+ run_icmp_test(skel, AF_INET, "127.0.0.1", NR_ICMP_UNREACH);
- /* Test an invalid code */
- *code = -1;
- trigger_prog_read_icmp_errqueue(code);
- ASSERT_EQ(skel->data->kfunc_ret, -EINVAL, "kfunc_ret");
+ if (test__start_subtest("ipv6"))
+ run_icmp_test(skel, AF_INET6, "::1", NR_ICMPV6_UNREACH);
cleanup:
icmp_send_unreach__destroy(skel);
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
index 6fc5595f08aa..112b9cbfab6f 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -6,6 +6,11 @@
#define SERVER_PORT 54321
/* 127.0.0.1 in network byte order */
#define SERVER_IP 0x7F000001
+/* ::1 in network byte order */
+#define SERVER_IP6_0 0x00000000
+#define SERVER_IP6_1 0x00000000
+#define SERVER_IP6_2 0x00000000
+#define SERVER_IP6_3 0x01000000
int unreach_code = 0;
int kfunc_ret = -1;
@@ -16,17 +21,46 @@ int egress(struct __sk_buff *skb)
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct iphdr *iph;
+ struct ipv6hdr *ip6h;
struct tcphdr *tcph;
+ __u8 version;
- iph = data;
- if ((void *)(iph + 1) > data_end || iph->version != 4 ||
- iph->protocol != IPPROTO_TCP || iph->daddr != bpf_htonl(SERVER_IP))
+ if (data + 1 > data_end)
return SK_PASS;
- tcph = (void *)iph + iph->ihl * 4;
- if ((void *)(tcph + 1) > data_end ||
- tcph->dest != bpf_htons(SERVER_PORT))
+ version = (*((__u8 *)data)) >> 4;
+
+ if (version == 4) {
+ iph = data;
+ if ((void *)(iph + 1) > data_end ||
+ iph->protocol != IPPROTO_TCP ||
+ iph->daddr != bpf_htonl(SERVER_IP))
+ return SK_PASS;
+
+ tcph = (void *)iph + iph->ihl * 4;
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(SERVER_PORT))
+ return SK_PASS;
+
+ } else if (version == 6) {
+ ip6h = data;
+ if ((void *)(ip6h + 1) > data_end ||
+ ip6h->nexthdr != IPPROTO_TCP)
+ return SK_PASS;
+
+ if (ip6h->daddr.in6_u.u6_addr32[0] != SERVER_IP6_0 ||
+ ip6h->daddr.in6_u.u6_addr32[1] != SERVER_IP6_1 ||
+ ip6h->daddr.in6_u.u6_addr32[2] != SERVER_IP6_2 ||
+ ip6h->daddr.in6_u.u6_addr32[3] != SERVER_IP6_3)
+ return SK_PASS;
+
+ tcph = (void *)(ip6h + 1);
+ if ((void *)(tcph + 1) > data_end ||
+ tcph->dest != bpf_htons(SERVER_PORT))
+ return SK_PASS;
+ } else {
return SK_PASS;
+ }
kfunc_ret = bpf_icmp_send_unreach(skb, unreach_code);
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
` (4 preceding siblings ...)
2026-04-20 10:58 ` [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests Mahe Tardy
@ 2026-04-20 10:58 ` Mahe Tardy
5 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 10:58 UTC (permalink / raw)
To: mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo
This test is similar to icmp_send_unreach_kfunc but checks that, in case
of recursion, meaning that the BPF program calling the kfunc was
re-triggered by the icmp_send done by the kfunc, the kfunc will stop
early and return -EBUSY.
Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
---
.../bpf/prog_tests/icmp_send_unreach_kfunc.c | 43 +++++++++++++++++++
.../selftests/bpf/progs/icmp_send_unreach.c | 30 +++++++++++++
2 files changed, 73 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
index 047bfd4d80f7..a4f4324b2b99 100644
--- a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
+++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
#include <test_progs.h>
#include <network_helpers.h>
+#include <cgroup_helpers.h>
#include <linux/errqueue.h>
#include "icmp_send_unreach.skel.h"
@@ -10,6 +11,7 @@
#define ICMP_DEST_UNREACH 3
#define ICMPV6_DEST_UNREACH 1
+#define ICMP_HOST_UNREACH 1
#define ICMP_FRAG_NEEDED 4
#define NR_ICMP_UNREACH 15
#define NR_ICMPV6_UNREACH 6
@@ -157,3 +159,44 @@ void test_icmp_send_unreach_kfunc(void)
icmp_send_unreach__destroy(skel);
close(cgroup_fd);
}
+
+void test_icmp_send_unreach_recursion(void)
+{
+ struct icmp_send_unreach *skel;
+ int cgroup_fd = -1;
+ int *code;
+
+ skel = icmp_send_unreach__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "skel_open"))
+ goto cleanup;
+
+ if (setup_cgroup_environment()) {
+ fprintf(stderr, "Failed to setup cgroup environment\n");
+ goto cleanup;
+ }
+
+ cgroup_fd = get_root_cgroup();
+ if (!ASSERT_GE(cgroup_fd, 0, "get_root_cgroup"))
+ goto cleanup;
+
+ skel->links.recursion =
+ bpf_program__attach_cgroup(skel->progs.recursion, cgroup_fd);
+ if (!ASSERT_OK_PTR(skel->links.recursion, "prog_attach_cgroup"))
+ goto cleanup;
+
+ code = &skel->bss->unreach_code;
+ *code = ICMP_HOST_UNREACH;
+
+ trigger_prog_read_icmp_errqueue(code, AF_INET, "127.0.0.1");
+
+ /* Because there's recursion involved, the first call will return at
+ * index 1 since it will return the second, and the second call will
+ * return at index 0 since it will return the first.
+ */
+ ASSERT_EQ(skel->data->rec_kfunc_rets[1], 0, "kfunc_rets[1]");
+ ASSERT_EQ(skel->data->rec_kfunc_rets[0], -EBUSY, "kfunc_rets[0]");
+
+cleanup:
+ icmp_send_unreach__destroy(skel);
+ close(cgroup_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
index 112b9cbfab6f..9aca7c0b12e1 100644
--- a/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
+++ b/tools/testing/selftests/bpf/progs/icmp_send_unreach.c
@@ -15,6 +15,9 @@
int unreach_code = 0;
int kfunc_ret = -1;
+uint rec_count = 0;
+int rec_kfunc_rets[] = { -1, -1 };
+
SEC("cgroup_skb/egress")
int egress(struct __sk_buff *skb)
{
@@ -67,4 +70,31 @@ int egress(struct __sk_buff *skb)
return SK_DROP;
}
+SEC("cgroup_skb/egress")
+int recursion(struct __sk_buff *skb)
+{
+ void *data = (void *)(long)skb->data;
+ void *data_end = (void *)(long)skb->data_end;
+ struct iphdr *iph;
+
+ iph = data;
+ if ((void *)(iph + 1) > data_end || iph->version != 4)
+ return SK_PASS;
+
+ /* This call will provoke a recursion: the ICMP package generated by the
+ * kfunc will re-trigger this program since we are in the root cgroup in
+ * which the kernel ICMP socket belongs. However when re-entering the
+ * kfunc, it should return EBUSY.
+ */
+ rec_kfunc_rets[rec_count & 1] =
+ bpf_icmp_send_unreach(skb, unreach_code);
+ __sync_fetch_and_add(&rec_count, 1);
+
+ /* Let the first ICMP error message pass */
+ if (iph->protocol == IPPROTO_ICMP)
+ return SK_PASS;
+
+ return SK_DROP;
+}
+
char LICENSE[] SEC("license") = "Dual BSD/GPL";
--
2.34.1
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
2026-04-20 10:58 ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
@ 2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 13:04 ` Mahe Tardy
0 siblings, 1 reply; 40+ messages in thread
From: bot+bpf-ci @ 2026-04-20 11:36 UTC (permalink / raw)
To: mahe.tardy, mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 2271 bytes --]
> diff --git a/include/net/route.h b/include/net/route.h
> index f90106f383c5..ec2466fd0bec 100644
> --- a/include/net/route.h
> +++ b/include/net/route.h
> @@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
> const struct sock *sk);
> struct dst_entry *ipv4_blackhole_route(struct net *net,
> struct dst_entry *dst_orig);
> +int ip_route_reply_fetch_dst(struct sk_buff *skb);
>
> static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
> {
[ ... ]
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index bc1296f0ea69..7091ef936073 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
> }
> EXPORT_SYMBOL_GPL(ip_route_output_flow);
>
> +int ip_route_reply_fetch_dst(struct sk_buff *skb)
> +{
> + struct rtable *rt;
> + struct flowi4 fl4 = {
> + .daddr = ip_hdr(skb)->saddr
> + };
> +
> + rt = ip_route_output_key(dev_net(skb->dev), &fl4);
> + if (IS_ERR(rt))
> + return PTR_ERR(rt);
> + skb_dst_set(skb, &rt->dst);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
Is ip_route_reply_fetch_dst() actually needed before calling icmp_send()?
Martin KaFai Lau raised this question in v3 of the patch series and
noted that icmpv6_send() does its own route lookup and doesn't use
skb_dst() of the original skb. He also pointed out that at
cgroup/egress, skb->_skb_refdst should already be set.
Looking at icmpv6_send(), the fl6.{daddr,saddr} are filled and passed
to icmpv6_route_lookup, which performs the reverse/invert route lookup.
The function doesn't appear to use skb_dst() of the original skb.
Since this patch moves ip_route_reply_fetch_dst() from netfilter to
core specifically for the BPF kfunc, can you clarify why this routing
step is necessary? Does the kfunc not work without
ip[6]_route_reply_fetch_dst()?
Reference: https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests
2026-04-20 10:58 ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
@ 2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 13:08 ` Mahe Tardy
0 siblings, 1 reply; 40+ messages in thread
From: bot+bpf-ci @ 2026-04-20 11:36 UTC (permalink / raw)
To: mahe.tardy, mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]
> diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> new file mode 100644
> index 000000000000..24d5e01cfe80
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <test_progs.h>
> +#include <network_helpers.h>
> +#include <linux/errqueue.h>
> +#include "icmp_send_unreach.skel.h"
> +
> +#define TIMEOUT_MS 1000
> +#define SRV_PORT 54321
> +
> +#define ICMP_DEST_UNREACH 3
> +
> +#define ICMP_FRAG_NEEDED 4
> +#define NR_ICMP_UNREACH 15
Jordan Rife previously suggested in v3 that these ICMP constants should be
imported from <linux/icmp.h> rather than redefined here. Would it be cleaner
to include the header and avoid the redefinitions?
Reference: https://lore.kernel.org/bpf/usz5bhydsiejr37owgt3zypckzh7fa7ygmhsyaaiprsljx7iy5@ipopnr5n4ds7/
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc
2026-04-20 10:58 ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
@ 2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 13:07 ` Mahe Tardy
0 siblings, 1 reply; 40+ messages in thread
From: bot+bpf-ci @ 2026-04-20 11:36 UTC (permalink / raw)
To: mahe.tardy, mahe.tardy
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 3581 bytes --]
> diff --git a/net/core/filter.c b/net/core/filter.c
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
[ ... ]
> @@ -12423,6 +12427,86 @@
> return 0;
> }
>
> +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> +
> +/**
> + * bpf_icmp_send_unreach - Send ICMP destination unreachable error
> + * @skb: Packet that triggered the error
> + * @code: ICMP unreachable code (0-15 for IPv4, 0-6 for IPv6)
> + *
> + * Sends an ICMP destination unreachable message in response to the
> + * packet. The original packet is cloned before sending the ICMP error,
> + * so the BPF program can still let the packet pass if desired.
> + *
> + * Recursion protection: If called from a context that would trigger
> + * recursion (e.g., root cgroup processing its own ICMP packets),
> + * returns -EBUSY on re-entry.
> + *
> + * Return: 0 on success, negative error code on failure:
> + * -EINVAL: Invalid code parameter
> + * -ENOMEM: Memory allocation failed
> + * -EHOSTUNREACH: Routing lookup failed
> + * -EBUSY: Recursion detected
> + * -EPROTONOSUPPORT: Non-IP protocol
> + */
> +__bpf_kfunc int bpf_icmp_send_unreach(struct __sk_buff *__skb, int code)
> +{
> + struct sk_buff *skb = (struct sk_buff *)__skb;
> + struct sk_buff *nskb;
> + bool *in_progress;
> +
> + in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> + if (*in_progress)
> + return -EBUSY;
> +
> + switch (skb->protocol) {
> +#if IS_ENABLED(CONFIG_INET)
> + case htons(ETH_P_IP):
> + if (code < 0 || code > NR_ICMP_UNREACH)
> + return -EINVAL;
> +
> + nskb = skb_clone(skb, GFP_ATOMIC);
> + if (!nskb)
> + return -ENOMEM;
> +
> + if (!skb_dst(nskb) && ip_route_reply_fetch_dst(nskb) < 0) {
> + kfree_skb(nskb);
> + return -EHOSTUNREACH;
> + }
> +
> + *in_progress = true;
> + icmp_send(nskb, ICMP_DEST_UNREACH, code, 0);
> + *in_progress = false;
> + kfree_skb(nskb);
> + break;
> +#endif
> +#if IS_ENABLED(CONFIG_IPV6)
> + case htons(ETH_P_IPV6):
> + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> + return -EINVAL;
> +
> + nskb = skb_clone(skb, GFP_ATOMIC);
> + if (!nskb)
> + return -ENOMEM;
> +
> + if (!skb_dst(nskb) && ip6_route_reply_fetch_dst(nskb) < 0) {
> + kfree_skb(nskb);
> + return -EHOSTUNREACH;
> + }
Is the route lookup needed here? Looking at icmpv6_send() in
net/ipv6/icmp.c, it fills fl6.{daddr,saddr} and passes them to
icmpv6_route_lookup() which performs its own route lookup. The function
doesn't appear to use skb_dst() from the original skb:
icmpv6_send() {
...
fl6.daddr = hdr->saddr;
fl6.saddr = saddr;
...
dst = icmpv6_route_lookup(net, skb, sk, &fl6);
...
}
The same applies to the IPv4 path with icmp_send(). Additionally, at
cgroup/egress the skb->_skb_refdst should already be set, and the same
should be true for cgroup/ingress for inet proto.
Does the kfunc work without ip_route_reply_fetch_dst() and
ip6_route_reply_fetch_dst()? If so, can these calls be removed?
Reference: Martin KaFai Lau's comment in v3 review
https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/
> +
> + *in_progress = true;
> + icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> + *in_progress = false;
> + kfree_skb(nskb);
> + break;
> +#endif
> + default:
> + return -EPROTONOSUPPORT;
> + }
> +
> + return 0;
> +}
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4
2026-04-20 11:36 ` bot+bpf-ci
@ 2026-04-20 13:04 ` Mahe Tardy
0 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 13:04 UTC (permalink / raw)
To: bot+bpf-ci
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
On Mon, Apr 20, 2026 at 11:36:12AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/include/net/route.h b/include/net/route.h
> > index f90106f383c5..ec2466fd0bec 100644
> > --- a/include/net/route.h
> > +++ b/include/net/route.h
> > @@ -173,6 +173,7 @@ struct rtable *ip_route_output_flow(struct net *, struct flowi4 *flp,
> > const struct sock *sk);
> > struct dst_entry *ipv4_blackhole_route(struct net *net,
> > struct dst_entry *dst_orig);
> > +int ip_route_reply_fetch_dst(struct sk_buff *skb);
> >
> > static inline struct rtable *ip_route_output_key(struct net *net, struct flowi4 *flp)
> > {
>
> [ ... ]
>
> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > index bc1296f0ea69..7091ef936073 100644
> > --- a/net/ipv4/route.c
> > +++ b/net/ipv4/route.c
> > @@ -2945,6 +2945,21 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4,
> > }
> > EXPORT_SYMBOL_GPL(ip_route_output_flow);
> >
> > +int ip_route_reply_fetch_dst(struct sk_buff *skb)
> > +{
> > + struct rtable *rt;
> > + struct flowi4 fl4 = {
> > + .daddr = ip_hdr(skb)->saddr
> > + };
> > +
> > + rt = ip_route_output_key(dev_net(skb->dev), &fl4);
> > + if (IS_ERR(rt))
> > + return PTR_ERR(rt);
> > + skb_dst_set(skb, &rt->dst);
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(ip_route_reply_fetch_dst);
>
> Is ip_route_reply_fetch_dst() actually needed before calling icmp_send()?
>
> Martin KaFai Lau raised this question in v3 of the patch series and
> noted that icmpv6_send() does its own route lookup and doesn't use
> skb_dst() of the original skb. He also pointed out that at
> cgroup/egress, skb->_skb_refdst should already be set.
>
> Looking at icmpv6_send(), the fl6.{daddr,saddr} are filled and passed
> to icmpv6_route_lookup, which performs the reverse/invert route lookup.
> The function doesn't appear to use skb_dst() of the original skb.
>
> Since this patch moves ip_route_reply_fetch_dst() from netfilter to
> core specifically for the BPF kfunc, can you clarify why this routing
> step is necessary? Does the kfunc not work without
> ip[6]_route_reply_fetch_dst()?
>
> Reference: https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/
Yes I replied to this in the cover letter, let me quote it:
Initially, this kfunc was added only to cgroup_skb programs, Alexei
suggested not creating its own kfunc set and adding it to the more
global bpf_kfunc_set_skb. Now that recursion is handled and I realized,
thanks to Martin, that fetching the dst route might be only useful in
situation in which the packet was not yet routed, I decided to extend
the kfunc to more program types and route the packet only if needed.
So in short, if we decide to go back to cgroup_skb only, we can remove
this, if we expand to other program types like we do, it still makes
sense to do it.
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc
2026-04-20 11:36 ` bot+bpf-ci
@ 2026-04-20 13:07 ` Mahe Tardy
0 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 13:07 UTC (permalink / raw)
To: bot+bpf-ci
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
On Mon, Apr 20, 2026 at 11:36:15AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
>
> [ ... ]
>
> > +#if IS_ENABLED(CONFIG_IPV6)
> > + case htons(ETH_P_IPV6):
> > + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > + return -EINVAL;
> > +
> > + nskb = skb_clone(skb, GFP_ATOMIC);
> > + if (!nskb)
> > + return -ENOMEM;
> > +
> > + if (!skb_dst(nskb) && ip6_route_reply_fetch_dst(nskb) < 0) {
> > + kfree_skb(nskb);
> > + return -EHOSTUNREACH;
> > + }
>
> Is the route lookup needed here? Looking at icmpv6_send() in
> net/ipv6/icmp.c, it fills fl6.{daddr,saddr} and passes them to
> icmpv6_route_lookup() which performs its own route lookup. The function
> doesn't appear to use skb_dst() from the original skb:
>
> icmpv6_send() {
> ...
> fl6.daddr = hdr->saddr;
> fl6.saddr = saddr;
> ...
> dst = icmpv6_route_lookup(net, skb, sk, &fl6);
> ...
> }
>
> The same applies to the IPv4 path with icmp_send(). Additionally, at
> cgroup/egress the skb->_skb_refdst should already be set, and the same
> should be true for cgroup/ingress for inet proto.
>
> Does the kfunc work without ip_route_reply_fetch_dst() and
> ip6_route_reply_fetch_dst()? If so, can these calls be removed?
>
> Reference: Martin KaFai Lau's comment in v3 review
> https://lore.kernel.org/bpf/df4b0996-3e88-4ea4-983b-82866455a6fc@linux.dev/
Same reply as https://lore.kernel.org/bpf/aeYkdKm7B4NQ3BDo@gmail.com/,
too bad the LLM can't access the cover letter.
>
> > +
> > + *in_progress = true;
> > + icmpv6_send(nskb, ICMPV6_DEST_UNREACH, code, 0);
> > + *in_progress = false;
> > + kfree_skb(nskb);
> > + break;
> > +#endif
> > + default:
> > + return -EPROTONOSUPPORT;
> > + }
> > +
> > + return 0;
> > +}
>
> [ ... ]
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests
2026-04-20 11:36 ` bot+bpf-ci
@ 2026-04-20 13:08 ` Mahe Tardy
0 siblings, 0 replies; 40+ messages in thread
From: Mahe Tardy @ 2026-04-20 13:08 UTC (permalink / raw)
To: bot+bpf-ci
Cc: alexei.starovoitov, andrii, ast, bpf, coreteam, daniel, fw,
john.fastabend, lkp, martin.lau, netdev, netfilter-devel,
oe-kbuild-all, pablo, martin.lau, eddyz87, yonghong.song, clm,
ihor.solodrai
On Mon, Apr 20, 2026 at 11:36:13AM +0000, bot+bpf-ci@kernel.org wrote:
> > diff --git a/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> > new file mode 100644
> > index 000000000000..24d5e01cfe80
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/prog_tests/icmp_send_unreach_kfunc.c
> > @@ -0,0 +1,136 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <test_progs.h>
> > +#include <network_helpers.h>
> > +#include <linux/errqueue.h>
> > +#include "icmp_send_unreach.skel.h"
> > +
> > +#define TIMEOUT_MS 1000
> > +#define SRV_PORT 54321
> > +
> > +#define ICMP_DEST_UNREACH 3
> > +
> > +#define ICMP_FRAG_NEEDED 4
> > +#define NR_ICMP_UNREACH 15
>
> Jordan Rife previously suggested in v3 that these ICMP constants should be
> imported from <linux/icmp.h> rather than redefined here. Would it be cleaner
> to include the header and avoid the redefinitions?
Yes it would be cleaner, but looks like I can't do it simply, maybe
there's a solution, see the answer from the cover-letter:
to Jordan: I couldn't include <linux/icmp.h> because of redefines from
<network_helpers.h>.
>
> Reference: https://lore.kernel.org/bpf/usz5bhydsiejr37owgt3zypckzh7fa7ygmhsyaaiprsljx7iy5@ipopnr5n4ds7/
>
> [ ... ]
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/24663313503
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2026-04-20 13:08 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CAADnVQKq_-=N7eJoup6AqFngoocT+D02NF0md_3mi2Vcrw09nQ@mail.gmail.com>
2025-07-25 18:53 ` [PATCH bpf-next v1 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
2025-07-27 1:49 ` kernel test robot
2025-07-28 9:43 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 1/4] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 2/4] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2025-07-28 9:43 ` [PATCH bpf-next v3 3/4] bpf: add bpf_icmp_send_unreach cgroup_skb kfunc Mahe Tardy
2025-07-28 20:10 ` kernel test robot
2025-07-29 1:05 ` Martin KaFai Lau
2025-07-29 10:06 ` Mahe Tardy
2025-07-29 23:13 ` Martin KaFai Lau
2025-07-28 9:43 ` [PATCH bpf-next v3 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2025-07-28 15:40 ` Yonghong Song
2025-07-28 15:59 ` Mahe Tardy
2025-07-29 1:18 ` Martin KaFai Lau
2025-07-29 9:09 ` Mahe Tardy
2025-07-29 23:27 ` Martin KaFai Lau
2025-07-30 0:01 ` Martin KaFai Lau
2025-07-30 0:32 ` Martin KaFai Lau
2025-08-05 23:26 ` Jordan Rife
2025-07-29 1:21 ` [PATCH bpf-next v3 0/4] bpf: add icmp_send_unreach kfunc Martin KaFai Lau
2025-07-29 9:53 ` Mahe Tardy
2025-07-30 1:54 ` Martin KaFai Lau
2025-08-01 18:50 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 0/6] " Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 13:04 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 3/6] bpf: add bpf_icmp_send_unreach kfunc Mahe Tardy
2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 13:07 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 4/6] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
2026-04-20 11:36 ` bot+bpf-ci
2026-04-20 13:08 ` Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 5/6] selftests/bpf: add icmp_send_unreach kfunc IPv6 tests Mahe Tardy
2026-04-20 10:58 ` [PATCH bpf-next v4 6/6] selftests/bpf: add icmp_send_unreach_recursion test Mahe Tardy
2025-07-25 18:53 ` [PATCH bpf-next v2 4/4] selftests/bpf: add icmp_send_unreach kfunc tests Mahe Tardy
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox