Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc

Netdev List
 help / color / mirror / Atom feed

From: Stanislav Fomichev <sdf.kernel@gmail.com>
To: Mahe Tardy <mahe.tardy@gmail.com>
Cc: bpf@vger.kernel.org, andrii@kernel.org, ast@kernel.org,
	 daniel@iogearbox.net, john.fastabend@gmail.com, jordan@jrife.io,
	martin.lau@linux.dev,  yonghong.song@linux.dev,
	emil@etsalapatis.com, netdev@vger.kernel.org,
	 edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	davem@davemloft.net,  horms@kernel.org
Subject: Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc
Date: Fri, 26 Jun 2026 09:18:39 -0700	[thread overview]
Message-ID: <aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com> (raw)
In-Reply-To: <aj1b_z6h5xn42Hxe@gmail.com>

On 06/25, Mahe Tardy wrote:
> On Thu, Jun 25, 2026 at 09:24:59AM -0700, Stanislav Fomichev wrote:
> > On 06/25, Mahe Tardy wrote:
> 
> [...]
> 
> > > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > > +{
> > > +	struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > > +	struct sk_buff *nskb;
> > > +	struct sock *sk;
> > > +
> > > +	sk = skb_to_full_sk(skb);
> > > +	if (sk && sk->sk_kern_sock &&
> > > +	    (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
> > > +		return -EBUSY;
> > > +
> > > +	switch (skb->protocol) {
> > > +#if IS_ENABLED(CONFIG_INET)
> > > +	case htons(ETH_P_IP): {
> > > +		if (type != ICMP_DEST_UNREACH)
> > > +			return -EOPNOTSUPP;
> > > +		if (code < 0 || code > NR_ICMP_UNREACH ||
> > > +		    code == ICMP_FRAG_NEEDED) /* needs a valid next-hop MTU */
> > > +			return -EINVAL;
> > > +
> > > +		/* icmp_send expects skb_dst to be a real rtable. */
> > > +		if (!skb_valid_dst(skb))
> > > +			return -ENETUNREACH;
> > > +
> > > +		nskb = skb_clone(skb, GFP_ATOMIC);
> > > +		if (!nskb)
> > > +			return -ENOMEM;
> > > +
> > > +		memset(IPCB(nskb), 0, sizeof(*IPCB(nskb)));
> > > +		icmp_send(nskb, type, code, 0);
> > > +		consume_skb(nskb);
> > > +		break;
> > > +	}
> > > +#endif
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > +	case htons(ETH_P_IPV6):
> > > +		if (type != ICMPV6_DEST_UNREACH)
> > > +			return -EOPNOTSUPP;
> > > +		if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > > +			return -EINVAL;
> > 
> > [..]
> > 
> > > +		/* icmpv6_send may treat skb_dst as rt6_info. */
> > > +		if (skb_metadata_dst(skb))
> > > +			return -ENETUNREACH;
> > 
> > A bit confused about this. Which part of icmpv6_send treats skb_dst as rt6_info?
> > (I see the original sashiko report about dst, but icmp6 seems to be not
> > requiring it)
> 
> Yeah I was also a bit confused because this came out of nowhere as soon
> as I put the skb_valid_dst only on the IPv4 path (for different
> reasons), but there is actually a potential trace in which we have type
> confusion indeed:
> 
> - icmp6_send() checks scoped source addresses and calls icmp6_iif() at net/ipv6/icmp.c:702
> - icmp6_iif() calls icmp6_dev() at net/ipv6/icmp.c:441
> - icmp6_dev() does skb_rt6_info(skb) for loopback/L3 master devices at net/ipv6/icmp.c:428
> - skb_rt6_info() casts any non-NULL dst to struct rt6_info at include/net/ip6_route.h:233
> - rt6->rt6i_idev is then dereferenced at net/ipv6/icmp.c:434
> 
> When checking with pahole, we can find this on my local kernel:
> 
> struct rt6_info {
> 	struct dst_entry           dst;                  /*     0   136 */
> 	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> 	struct fib6_info *         from;                 /*   136     8 */
> 	int                        sernum;               /*   144     4 */
> 	struct rt6key              rt6i_dst;             /*   148    20 */
> 	struct rt6key              rt6i_src;             /*   168    20 */
> 	struct in6_addr            rt6i_gateway;         /*   188    16 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	/* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */
> 	struct inet6_dev *         rt6i_idev;            /*   208     8 */  <--- we dereference this
> 	u32                        rt6i_flags;           /*   216     4 */
> 	short unsigned int         rt6i_nfheader_len;    /*   220     2 */
> 
> 	/* size: 224, cachelines: 4, members: 9 */
> 	/* sum members: 218, holes: 1, sum holes: 4 */
> 	/* padding: 2 */
> 	/* last cacheline: 32 bytes */
> };
> 
> And the metadata_dst would look like this:
> 
> struct metadata_dst {
> 	struct dst_entry           dst;                  /*     0   136 */
> 	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> 	enum metadata_type         type;                 /*   136     4 */
> 
> 	/* XXX 4 bytes hole, try to pack */
> 
> 	union {
> 		struct ip_tunnel_info tun_info;          /*   144    96 */
> 		struct hw_port_info port_info;           /*   144    16 */
> 		struct macsec_info macsec_info;          /*   144     8 */
> 		struct xfrm_md_info xfrm_info;           /*   144    16 */
> 	} u;                                             /*   144    96 */  <--- we land on this union
> 
> 	/* size: 240, cachelines: 4, members: 3 */
> 	/* sum members: 236, holes: 1, sum holes: 4 */
> 	/* last cacheline: 48 bytes */
> };
> 
> Let's say it's a struct ip_tunnel_info:
> 
> struct ip_tunnel_info {
> 	struct ip_tunnel_key       key;                  /*     0    64 */
> 
> 	/* XXX last struct has 7 bytes of padding */
> 
> 	/* --- cacheline 1 boundary (64 bytes) --- */
> 	struct ip_tunnel_encap     encap;                /*    64     8 */  <--- 144 + 64 = 208 we land here
> 	struct dst_cache           dst_cache;            /*    72    16 */
> 	u8                         options_len;          /*    88     1 */
> 	u8                         mode;                 /*    89     1 */
> 
> 	/* size: 96, cachelines: 2, members: 5 */
> 	/* padding: 6 */
> 	/* paddings: 1, sum paddings: 7 */
> 	/* last cacheline: 32 bytes */
> };
> 
> So I imagine this is fairly tricky to trigger but still a case of type
> confusion. I have actually no idea how likely this can happen from my
> call but the trace makes sense at least.

That logic seems to exist for the icmp6_send to find the input device
(since the expected use-case for calling icmp6_send is to the incoming
skb). And since you're mainly doing egress, I don't think this path will
ever trigger (iow the check is not needed)?

Maybe you can add cgroup_ingress test case? Looks like this rt6_info
path might trigger for ipv6 lo? I don't see any ingress test in your
series, so might be good to have one regardless?

next prev parent reply	other threads:[~2026-06-26 16:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 11:03 [PATCH bpf-next v10 0/5] bpf: add icmp_send kfunc Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc Mahe Tardy
2026-06-25 16:24   ` Stanislav Fomichev
2026-06-25 16:49     ` Mahe Tardy
2026-06-26 16:18       ` Stanislav Fomichev [this message]
2026-06-25 11:03 ` [PATCH bpf-next v10 2/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 3/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 4/5] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 5/5] selftests/bpf: add bpf_icmp_send no route test Mahe Tardy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com \
    --to=sdf.kernel@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=emil@etsalapatis.com \
    --cc=horms@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jordan@jrife.io \
    --cc=kuba@kernel.org \
    --cc=mahe.tardy@gmail.com \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox