Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stanislav Fomichev <sdf.kernel@gmail.com>
To: Mahe Tardy <mahe.tardy@gmail.com>
Cc: bpf@vger.kernel.org, andrii@kernel.org, ast@kernel.org,
	 daniel@iogearbox.net, john.fastabend@gmail.com, jordan@jrife.io,
	martin.lau@linux.dev,  yonghong.song@linux.dev,
	emil@etsalapatis.com, netdev@vger.kernel.org,
	 edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	davem@davemloft.net,  horms@kernel.org
Subject: Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc
Date: Fri, 26 Jun 2026 09:18:39 -0700	[thread overview]
Message-ID: <aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com> (raw)
In-Reply-To: <aj1b_z6h5xn42Hxe@gmail.com>

On 06/25, Mahe Tardy wrote:
> On Thu, Jun 25, 2026 at 09:24:59AM -0700, Stanislav Fomichev wrote:
> > On 06/25, Mahe Tardy wrote:
> 
> [...]
> 
> > > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > > +{
> > > +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > > +	struct sk_buff *nskb;
> > > +	struct sock *sk;
> > > +
> > > +	sk = skb_to_full_sk(skb);
> > > +	if (sk && sk->sk_kern_sock &&
> > > +	    (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
> > > +		return -EBUSY;
> > > +
> > > +	switch (skb->protocol) {
> > > +#if IS_ENABLED(CONFIG_INET)
> > > +	case htons(ETH_P_IP): {
> > > +		if (type != ICMP_DEST_UNREACH)
> > > +			return -EOPNOTSUPP;
> > > +		if (code < 0 || code > NR_ICMP_UNREACH ||
> > > +		    code == ICMP_FRAG_NEEDED) /* needs a valid next-hop MTU */
> > > +			return -EINVAL;
> > > +
> > > +		/* icmp_send expects skb_dst to be a real rtable. */
> > > +		if (!skb_valid_dst(skb))
> > > +			return -ENETUNREACH;
> > > +
> > > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > > +		if (!nskb)
> > > +			return -ENOMEM;
> > > +
> > > +		memset(IPCB(nskb), 0, sizeof(*IPCB(nskb)));
> > > +		icmp_send(nskb, type, code, 0);
> > > +		consume_skb(nskb);
> > > +		break;
> > > +	}
> > > +#endif
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > +	case htons(ETH_P_IPV6):
> > > +		if (type != ICMPV6_DEST_UNREACH)
> > > +			return -EOPNOTSUPP;
> > > +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > > +			return -EINVAL;
> > 
> > [..]
> > 
> > > +		/* icmpv6_send may treat skb_dst as rt6_info. */
> > > +		if (skb_metadata_dst(skb))
> > > +			return -ENETUNREACH;
> > 
> > A bit confused about this. Which part of icmpv6_send treats skb_dst as rt6_info?
> > (I see the original sashiko report about dst, but icmp6 seems to be not
> > requiring it)
> 
> Yeah I was also a bit confused because this came out of nowhere as soon
> as I put the skb_valid_dst only on the IPv4 path (for different
> reasons), but there is actually a potential trace in which we have type
> confusion indeed:
> 
> - icmp6_send() checks scoped source addresses and calls icmp6_iif() at net/ipv6/icmp.c:702
> - icmp6_iif() calls icmp6_dev() at net/ipv6/icmp.c:441
> - icmp6_dev() does skb_rt6_info(skb) for loopback/L3 master devices at net/ipv6/icmp.c:428
> - skb_rt6_info() casts any non-NULL dst to struct rt6_info at include/net/ip6_route.h:233
> - rt6->rt6i_idev is then dereferenced at net/ipv6/icmp.c:434
> 
> When checking with pahole, we can find this on my local kernel:
> 
> struct rt6_info {
> 	struct dst_entry           dst;                  /*     0   136 */
> 	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> 	struct fib6_info *         from;                 /*   136     8 */
> 	int                        sernum;               /*   144     4 */
> 	struct rt6key              rt6i_dst;             /*   148    20 */
> 	struct rt6key              rt6i_src;             /*   168    20 */
> 	struct in6_addr            rt6i_gateway;         /*   188    16 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	/* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */
> 	struct inet6_dev *         rt6i_idev;            /*   208     8 */  <--- we dereference this
> 	u32                        rt6i_flags;           /*   216     4 */
> 	short unsigned int         rt6i_nfheader_len;    /*   220     2 */
> 
> 	/* size: 224, cachelines: 4, members: 9 */
> 	/* sum members: 218, holes: 1, sum holes: 4 */
> 	/* padding: 2 */
> 	/* last cacheline: 32 bytes */
> };
> 
> And the metadata_dst would look like this:
> 
> struct metadata_dst {
> 	struct dst_entry           dst;                  /*     0   136 */
> 	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> 	enum metadata_type         type;                 /*   136     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	union {
> 		struct ip_tunnel_info tun_info;          /*   144    96 */
> 		struct hw_port_info port_info;           /*   144    16 */
> 		struct macsec_info macsec_info;          /*   144     8 */
> 		struct xfrm_md_info xfrm_info;           /*   144    16 */
> 	} u;                                             /*   144    96 */  <--- we land on this union
> 
> 	/* size: 240, cachelines: 4, members: 3 */
> 	/* sum members: 236, holes: 1, sum holes: 4 */
> 	/* last cacheline: 48 bytes */
> };
> 
> Let's say it's a struct ip_tunnel_info:
> 
> struct ip_tunnel_info {
> 	struct ip_tunnel_key       key;                  /*     0    64 */
> 
> 	/* XXX last struct has 7 bytes of padding */
> 
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct ip_tunnel_encap     encap;                /*    64     8 */  <--- 144 + 64 = 208 we land here
> 	struct dst_cache           dst_cache;            /*    72    16 */
> 	u8                         options_len;          /*    88     1 */
> 	u8                         mode;                 /*    89     1 */
> 
> 	/* size: 96, cachelines: 2, members: 5 */
> 	/* padding: 6 */
> 	/* paddings: 1, sum paddings: 7 */
> 	/* last cacheline: 32 bytes */
> };
> 
> So I imagine this is fairly tricky to trigger but still a case of type
> confusion. I have actually no idea how likely this can happen from my
> call but the trace makes sense at least.

That logic seems to exist for the icmp6_send to find the input device
(since the expected use-case for calling icmp6_send is to the incoming
skb). And since you're mainly doing egress, I don't think this path will
ever trigger (iow the check is not needed)?

Maybe you can add cgroup_ingress test case? Looks like this rt6_info
path might trigger for ipv6 lo? I don't see any ingress test in your
series, so might be good to have one regardless?

next prev parent reply	other threads:[~2026-06-26 16:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 11:03 [PATCH bpf-next v10 0/5] bpf: add icmp_send kfunc Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc Mahe Tardy
2026-06-25 16:24   ` Stanislav Fomichev
2026-06-25 16:49     ` Mahe Tardy
2026-06-26 16:18       ` Stanislav Fomichev [this message]
2026-06-25 11:03 ` [PATCH bpf-next v10 2/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 3/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 4/5] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 5/5] selftests/bpf: add bpf_icmp_send no route test Mahe Tardy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com \
    --to=sdf.kernel@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=emil@etsalapatis.com \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jordan@jrife.io \
    --cc=kuba@kernel.org \
    --cc=mahe.tardy@gmail.com \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.