All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mahe Tardy <mahe.tardy@gmail.com>
To: Jordan Rife <jordan@jrife.io>
Cc: bpf@vger.kernel.org, martin.lau@linux.dev, daniel@iogearbox.net,
	john.fastabend@gmail.com, ast@kernel.org, andrii@kernel.org,
	yonghong.song@linux.dev, netdev@vger.kernel.org,
	netfilter-devel@vger.kernel.org, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com
Subject: Re: [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc
Date: Wed, 20 May 2026 20:48:33 +0200	[thread overview]
Message-ID: <ag4CAVec9jPCAuD0@gmail.com> (raw)
In-Reply-To: <onco52d3vpxkcc6hh3s5vuqjxanasucteq7wnqfqgzg4d65alc@q7h22nu3ytjn>

On Mon, May 18, 2026 at 06:33:45PM -0700, Jordan Rife wrote:
> On Mon, May 18, 2026 at 12:28:39PM +0000, Mahe Tardy wrote:
> > This is needed in the context of Tetragon to provide improved feedback
> > (in contrast to just dropping packets) to east-west traffic when blocked
> > by policies using cgroup_skb programs. We also extend this kfunc to tc
> > program as a convenience.
> > 
> > This reuses concepts from netfilter reject target codepath with the
> > differences that:
> > * Packets are cloned since the BPF user can still let the packet pass
> >   (SK_PASS from the cgroup_skb progs for example) and the current skb
> >   need to stay untouched (cgroup_skb hooks only allow read-only skb
> >   payload).
> > * We protect against recursion since the kfunc, by generating an ICMP
> >   error message, could retrigger the BPF prog that invoked it.
> > 
> > For now, we support cgroup_skb and tc program types. For cgroup_skb and
> > tc egress, almost everything should be good. However for tc ingress:
> > - packet will not be routed yet: need to set the net device for
> >   icmp_send, thus the call to ip[6]_route_reply_fill_dst.
> > - fragments could trigger hook: icmp_send will only reply to fragment 0.
> > - ensure the ip headers is linearized before processing, and zero out
> >   the SKB control block after cloning to prevent icmp_send()/icmpv6_send()
> >   from misinterpreting garbage data as IP options.
> > 
> > Only ICMP_DEST_UNREACH and ICMPV6_DEST_UNREACH are currently supported.
> > The interface accepts a type parameter to facilitate future extension to
> > other ICMP control message types.
> > 
> > Signed-off-by: Mahe Tardy <mahe.tardy@gmail.com>
> > ---
> >  net/core/filter.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 118 insertions(+)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 9590877b0714..843fa775596b 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -84,6 +84,8 @@
> >  #include <linux/un.h>
> >  #include <net/xdp_sock_drv.h>
> >  #include <net/inet_dscp.h>
> > +#include <linux/icmpv6.h>
> > +#include <net/icmp.h>
> > 
> >  #include "dev.h"
> > 
> > @@ -12464,6 +12466,110 @@ __bpf_kfunc int bpf_xdp_pull_data(struct xdp_md *x, u32 len)
> >  	return 0;
> >  }
> > 
> > +static DEFINE_PER_CPU(bool, bpf_icmp_send_in_progress);
> > +
> > +/**
> > + * bpf_icmp_send - Send an ICMP control message
> > + * @skb_ctx: Packet that triggered the control message
> > + * @type: ICMP type (only ICMP_DEST_UNREACH/ICMPV6_DEST_UNREACH supported)
> > + * @code: ICMP code (0-15 for IPv4, 0-6 for IPv6)
> > + *
> > + * Sends an ICMP control message in response to the packet. The original packet
> > + * is cloned before sending the ICMP message, so the BPF program can still let
> > + * the packet pass if desired.
> > + *
> > + * Currently only ICMP_DEST_UNREACH (IPv4) and ICMPV6_DEST_UNREACH (IPv6) are
> > + * supported.
> > + *
> > + * Recursion protection: If called from a context that would trigger recursion
> > + * (e.g., root cgroup processing its own ICMP packets), returns -EBUSY on
> > + * re-entry.
> > + *
> > + * Return: 0 on success, negative error code on failure:
> > + *         -EINVAL: Invalid code parameter
> > + *         -EBADMSG: Packet too short or malformed
> > + *         -ENOMEM: Memory allocation failed
> > + *         -EBUSY: Recursion detected
> > + *         -EHOSTUNREACH: Routing failed
> > + *         -EPROTONOSUPPORT: Non-IP protocol
> > + *         -EOPNOTSUPP: Unsupported ICMP type
> > + */
> > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > +{
> > +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > +	struct sk_buff *nskb;
> > +	bool *in_progress;
> > +
> > +	in_progress = this_cpu_ptr(&bpf_icmp_send_in_progress);
> > +	if (*in_progress)
> > +		return -EBUSY;
> > +
> > +	switch (skb->protocol) {
> > +#if IS_ENABLED(CONFIG_INET)
> > +	case htons(ETH_P_IP):
> > +		if (type != ICMP_DEST_UNREACH)
> > +			return -EOPNOTSUPP;
> > +		if (code < 0 || code > NR_ICMP_UNREACH)
> > +			return -EINVAL;
> > +
> > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > +		if (!nskb)
> > +			return -ENOMEM;
> > +
> > +		if (!pskb_network_may_pull(nskb, sizeof(struct iphdr))) {
> > +			kfree_skb(nskb);
> > +			return -EBADMSG;
> 
> nit: Instead of having several places where you call kfree_skb, maybe
> consider just cleaning up in once place at the end like:
> 
> out:
> 	if (nskb)
> 		kfree_skb(nskb);
> 	return err;
> 	
> then in places like this do something like:
> 
> 	err = -EBADMSG;
> 	goto out;

Yep yep I see, just if I follow Stanislav recommendation to use consume_skb
in the success path[^1], I think it would be simpler to keep the other error
paths with kfree_skb and returning the error.

[^1]: https://lore.kernel.org/bpf/ags3HARTFYwKU8nR@devvm7509.cco0.facebook.com/

> 
> > +		}
> > +
> > +		if (!skb_dst(nskb) && ip_route_reply_fill_dst(nskb) < 0) {
> > +			kfree_skb(nskb);
> > +			return -EHOSTUNREACH;
> > +		}
> > +
> > +		memset(IPCB(nskb), 0, sizeof(struct inet_skb_parm));
> > +
> > +		*in_progress = true;
> > +		icmp_send(nskb, type, code, 0);
> > +		*in_progress = false;
> > +		kfree_skb(nskb);
> > +		break;
> > +#endif
> > +#if IS_ENABLED(CONFIG_IPV6)
> > +	case htons(ETH_P_IPV6):
> > +		if (type != ICMPV6_DEST_UNREACH)
> > +			return -EOPNOTSUPP;
> > +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > +			return -EINVAL;
> > +
> > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > +		if (!nskb)
> > +			return -ENOMEM;
> > +
> > +		if (!pskb_network_may_pull(nskb, sizeof(struct ipv6hdr))) {
> > +			kfree_skb(nskb);
> > +			return -EBADMSG;
> > +		}
> > +
> > +		if (!skb_dst(nskb) && ip6_route_reply_fill_dst(nskb) < 0) {
> > +			kfree_skb(nskb);
> > +			return -EHOSTUNREACH;
> > +		}
> > +
> > +		memset(IP6CB(nskb), 0, sizeof(struct inet6_skb_parm));
> > +
> > +		*in_progress = true;
> > +		icmpv6_send(nskb, type, code, 0);
> > +		*in_progress = false;
> > +		kfree_skb(nskb);
> > +		break;
> > +#endif
> > +	default:
> > +		return -EPROTONOSUPPORT;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  __bpf_kfunc_end_defs();
> > 
> >  int bpf_dynptr_from_skb_rdonly(struct __sk_buff *skb, u64 flags,
> > @@ -12506,6 +12612,10 @@ BTF_KFUNCS_START(bpf_kfunc_check_set_sock_ops)
> >  BTF_ID_FLAGS(func, bpf_sock_ops_enable_tx_tstamp)
> >  BTF_KFUNCS_END(bpf_kfunc_check_set_sock_ops)
> > 
> > +BTF_KFUNCS_START(bpf_kfunc_check_set_icmp_send)
> > +BTF_ID_FLAGS(func, bpf_icmp_send)
> > +BTF_KFUNCS_END(bpf_kfunc_check_set_icmp_send)
> > +
> >  static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
> >  	.owner = THIS_MODULE,
> >  	.set = &bpf_kfunc_check_set_skb,
> > @@ -12536,6 +12646,11 @@ static const struct btf_kfunc_id_set bpf_kfunc_set_sock_ops = {
> >  	.set = &bpf_kfunc_check_set_sock_ops,
> >  };
> > 
> > +static const struct btf_kfunc_id_set bpf_kfunc_set_icmp_send = {
> > +	.owner = THIS_MODULE,
> > +	.set = &bpf_kfunc_check_set_icmp_send,
> > +};
> > +
> >  static int __init bpf_kfunc_init(void)
> >  {
> >  	int ret;
> > @@ -12557,6 +12672,9 @@ static int __init bpf_kfunc_init(void)
> >  	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
> >  					       &bpf_kfunc_set_sock_addr);
> >  	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_tcp_reqsk);
> > +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_icmp_send);
> > +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_icmp_send);
> 
> Thanks, this could come in handy for TC.
> 
> I'm not quite sure yet on using it in lieu of the sock_destroy kfunc for
> the UDP connected socket use case we discussed at LSFMMBPF. For socket
> LB mode in Cilium to make this work you'd need to add at least one new
> map lookup in the fast path to check for backend liveness and this
> partially defeats the performance benefits of socket LB which right
> now avoids service + backend lookups in the fast path for connected UDP.
> Ultimately, it might be better to stick with sock_destroy to kill
> sockets out-of-band for that use case, but still it's good to have this
> option.
> 
> > +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_icmp_send);
> >  	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCK_OPS, &bpf_kfunc_set_sock_ops);
> >  }
> >  late_initcall(bpf_kfunc_init);
> > --
> > 2.34.1
> > 
> 
> Jordan

  reply	other threads:[~2026-05-20 18:48 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-18 12:28 [PATCH bpf-next v6 0/4] bpf: add icmp_send kfunc Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 1/6] net: move netfilter nf_reject_fill_skb_dst to core ipv4 Mahe Tardy
2026-05-18 13:07   ` bot+bpf-ci
2026-05-18 14:21     ` Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 2/6] net: move netfilter nf_reject6_fill_skb_dst to core ipv6 Mahe Tardy
2026-05-18 13:07   ` bot+bpf-ci
2026-05-18 14:22     ` Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 3/6] bpf: add bpf_icmp_send kfunc Mahe Tardy
2026-05-18 13:34   ` bot+bpf-ci
2026-05-18 14:26     ` Mahe Tardy
2026-05-18 16:17   ` Stanislav Fomichev
2026-05-18 17:18     ` Mahe Tardy
2026-05-19 21:20       ` Stanislav Fomichev
2026-05-18 16:25   ` sashiko-bot
2026-05-19  1:33   ` Jordan Rife
2026-05-20 18:48     ` Mahe Tardy [this message]
2026-05-18 12:28 ` [PATCH bpf-next v6 4/6] selftests/bpf: add bpf_icmp_send kfunc tests Mahe Tardy
2026-05-19  1:34   ` Jordan Rife
2026-05-20 19:15     ` Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 5/6] selftests/bpf: add bpf_icmp_send kfunc IPv6 tests Mahe Tardy
2026-05-18 13:21   ` bot+bpf-ci
2026-05-18 14:27     ` Mahe Tardy
2026-05-18 16:45   ` sashiko-bot
2026-05-18 18:13     ` Mahe Tardy
2026-05-18 12:28 ` [PATCH bpf-next v6 6/6] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
2026-05-18 13:07   ` bot+bpf-ci
2026-05-18 14:39     ` Mahe Tardy
2026-05-18 17:07   ` sashiko-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag4CAVec9jPCAuD0@gmail.com \
    --to=mahe.tardy@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=edumazet@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jordan@jrife.io \
    --cc=kuba@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=netfilter-devel@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.