From: Stanislav Fomichev <sdf.kernel@gmail.com>
To: Mahe Tardy <mahe.tardy@gmail.com>
Cc: bpf@vger.kernel.org, andrii@kernel.org, ast@kernel.org,
daniel@iogearbox.net, john.fastabend@gmail.com, jordan@jrife.io,
martin.lau@linux.dev, yonghong.song@linux.dev,
emil@etsalapatis.com, netdev@vger.kernel.org,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
davem@davemloft.net, horms@kernel.org
Subject: Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc
Date: Fri, 26 Jun 2026 09:18:39 -0700 [thread overview]
Message-ID: <aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com> (raw)
In-Reply-To: <aj1b_z6h5xn42Hxe@gmail.com>
On 06/25, Mahe Tardy wrote:
> On Thu, Jun 25, 2026 at 09:24:59AM -0700, Stanislav Fomichev wrote:
> > On 06/25, Mahe Tardy wrote:
>
> [...]
>
> > > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > > +{
> > > + struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > > + struct sk_buff *nskb;
> > > + struct sock *sk;
> > > +
> > > + sk = skb_to_full_sk(skb);
> > > + if (sk && sk->sk_kern_sock &&
> > > + (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
> > > + return -EBUSY;
> > > +
> > > + switch (skb->protocol) {
> > > +#if IS_ENABLED(CONFIG_INET)
> > > + case htons(ETH_P_IP): {
> > > + if (type != ICMP_DEST_UNREACH)
> > > + return -EOPNOTSUPP;
> > > + if (code < 0 || code > NR_ICMP_UNREACH ||
> > > + code == ICMP_FRAG_NEEDED) /* needs a valid next-hop MTU */
> > > + return -EINVAL;
> > > +
> > > + /* icmp_send expects skb_dst to be a real rtable. */
> > > + if (!skb_valid_dst(skb))
> > > + return -ENETUNREACH;
> > > +
> > > + nskb = skb_clone(skb, GFP_ATOMIC);
> > > + if (!nskb)
> > > + return -ENOMEM;
> > > +
> > > + memset(IPCB(nskb), 0, sizeof(*IPCB(nskb)));
> > > + icmp_send(nskb, type, code, 0);
> > > + consume_skb(nskb);
> > > + break;
> > > + }
> > > +#endif
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > + case htons(ETH_P_IPV6):
> > > + if (type != ICMPV6_DEST_UNREACH)
> > > + return -EOPNOTSUPP;
> > > + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > > + return -EINVAL;
> >
> > [..]
> >
> > > + /* icmpv6_send may treat skb_dst as rt6_info. */
> > > + if (skb_metadata_dst(skb))
> > > + return -ENETUNREACH;
> >
> > A bit confused about this. Which part of icmpv6_send treats skb_dst as rt6_info?
> > (I see the original sashiko report about dst, but icmp6 seems to be not
> > requiring it)
>
> Yeah I was also a bit confused because this came out of nowhere as soon
> as I put the skb_valid_dst only on the IPv4 path (for different
> reasons), but there is actually a potential trace in which we have type
> confusion indeed:
>
> - icmp6_send() checks scoped source addresses and calls icmp6_iif() at net/ipv6/icmp.c:702
> - icmp6_iif() calls icmp6_dev() at net/ipv6/icmp.c:441
> - icmp6_dev() does skb_rt6_info(skb) for loopback/L3 master devices at net/ipv6/icmp.c:428
> - skb_rt6_info() casts any non-NULL dst to struct rt6_info at include/net/ip6_route.h:233
> - rt6->rt6i_idev is then dereferenced at net/ipv6/icmp.c:434
>
> When checking with pahole, we can find this on my local kernel:
>
> struct rt6_info {
> struct dst_entry dst; /* 0 136 */
> /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> struct fib6_info * from; /* 136 8 */
> int sernum; /* 144 4 */
> struct rt6key rt6i_dst; /* 148 20 */
> struct rt6key rt6i_src; /* 168 20 */
> struct in6_addr rt6i_gateway; /* 188 16 */
>
> /* XXX 4 bytes hole, try to pack */
>
> /* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */
> struct inet6_dev * rt6i_idev; /* 208 8 */ <--- we dereference this
> u32 rt6i_flags; /* 216 4 */
> short unsigned int rt6i_nfheader_len; /* 220 2 */
>
> /* size: 224, cachelines: 4, members: 9 */
> /* sum members: 218, holes: 1, sum holes: 4 */
> /* padding: 2 */
> /* last cacheline: 32 bytes */
> };
>
> And the metadata_dst would look like this:
>
> struct metadata_dst {
> struct dst_entry dst; /* 0 136 */
> /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> enum metadata_type type; /* 136 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> union {
> struct ip_tunnel_info tun_info; /* 144 96 */
> struct hw_port_info port_info; /* 144 16 */
> struct macsec_info macsec_info; /* 144 8 */
> struct xfrm_md_info xfrm_info; /* 144 16 */
> } u; /* 144 96 */ <--- we land on this union
>
> /* size: 240, cachelines: 4, members: 3 */
> /* sum members: 236, holes: 1, sum holes: 4 */
> /* last cacheline: 48 bytes */
> };
>
> Let's say it's a struct ip_tunnel_info:
>
> struct ip_tunnel_info {
> struct ip_tunnel_key key; /* 0 64 */
>
> /* XXX last struct has 7 bytes of padding */
>
> /* --- cacheline 1 boundary (64 bytes) --- */
> struct ip_tunnel_encap encap; /* 64 8 */ <--- 144 + 64 = 208 we land here
> struct dst_cache dst_cache; /* 72 16 */
> u8 options_len; /* 88 1 */
> u8 mode; /* 89 1 */
>
> /* size: 96, cachelines: 2, members: 5 */
> /* padding: 6 */
> /* paddings: 1, sum paddings: 7 */
> /* last cacheline: 32 bytes */
> };
>
> So I imagine this is fairly tricky to trigger but still a case of type
> confusion. I have actually no idea how likely this can happen from my
> call but the trace makes sense at least.
That logic seems to exist for the icmp6_send to find the input device
(since the expected use-case for calling icmp6_send is to the incoming
skb). And since you're mainly doing egress, I don't think this path will
ever trigger (iow the check is not needed)?
Maybe you can add cgroup_ingress test case? Looks like this rt6_info
path might trigger for ipv6 lo? I don't see any ingress test in your
series, so might be good to have one regardless?
next prev parent reply other threads:[~2026-06-26 16:18 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 11:03 [PATCH bpf-next v10 0/5] bpf: add icmp_send kfunc Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc Mahe Tardy
2026-06-25 16:24 ` Stanislav Fomichev
2026-06-25 16:49 ` Mahe Tardy
2026-06-26 16:18 ` Stanislav Fomichev [this message]
2026-06-25 11:03 ` [PATCH bpf-next v10 2/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 3/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 4/5] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 5/5] selftests/bpf: add bpf_icmp_send no route test Mahe Tardy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com \
--to=sdf.kernel@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=emil@etsalapatis.com \
--cc=horms@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jordan@jrife.io \
--cc=kuba@kernel.org \
--cc=mahe.tardy@gmail.com \
--cc=martin.lau@linux.dev \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.