From: Stanislav Fomichev <sdf.kernel@gmail.com>
To: Mahe Tardy <mahe.tardy@gmail.com>
Cc: bpf@vger.kernel.org, andrii@kernel.org, ast@kernel.org,
daniel@iogearbox.net, john.fastabend@gmail.com, jordan@jrife.io,
martin.lau@linux.dev, yonghong.song@linux.dev,
emil@etsalapatis.com, netdev@vger.kernel.org,
edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
davem@davemloft.net, horms@kernel.org
Subject: Re: [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc
Date: Fri, 26 Jun 2026 09:18:39 -0700 [thread overview]
Message-ID: <aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com> (raw)
In-Reply-To: <aj1b_z6h5xn42Hxe@gmail.com>
On 06/25, Mahe Tardy wrote:
> On Thu, Jun 25, 2026 at 09:24:59AM -0700, Stanislav Fomichev wrote:
> > On 06/25, Mahe Tardy wrote:
>
> [...]
>
> > > +__bpf_kfunc int bpf_icmp_send(struct __sk_buff *skb_ctx, int type, int code)
> > > +{
> > > + struct sk_buff *skb = (struct sk_buff *)skb_ctx;
> > > + struct sk_buff *nskb;
> > > + struct sock *sk;
> > > +
> > > + sk = skb_to_full_sk(skb);
> > > + if (sk && sk->sk_kern_sock &&
> > > + (sk->sk_protocol == IPPROTO_ICMP || sk->sk_protocol == IPPROTO_ICMPV6))
> > > + return -EBUSY;
> > > +
> > > + switch (skb->protocol) {
> > > +#if IS_ENABLED(CONFIG_INET)
> > > + case htons(ETH_P_IP): {
> > > + if (type != ICMP_DEST_UNREACH)
> > > + return -EOPNOTSUPP;
> > > + if (code < 0 || code > NR_ICMP_UNREACH ||
> > > + code == ICMP_FRAG_NEEDED) /* needs a valid next-hop MTU */
> > > + return -EINVAL;
> > > +
> > > + /* icmp_send expects skb_dst to be a real rtable. */
> > > + if (!skb_valid_dst(skb))
> > > + return -ENETUNREACH;
> > > +
> > > + nskb = skb_clone(skb, GFP_ATOMIC);
> > > + if (!nskb)
> > > + return -ENOMEM;
> > > +
> > > + memset(IPCB(nskb), 0, sizeof(*IPCB(nskb)));
> > > + icmp_send(nskb, type, code, 0);
> > > + consume_skb(nskb);
> > > + break;
> > > + }
> > > +#endif
> > > +#if IS_ENABLED(CONFIG_IPV6)
> > > + case htons(ETH_P_IPV6):
> > > + if (type != ICMPV6_DEST_UNREACH)
> > > + return -EOPNOTSUPP;
> > > + if (code < 0 || code > ICMPV6_REJECT_ROUTE)
> > > + return -EINVAL;
> >
> > [..]
> >
> > > + /* icmpv6_send may treat skb_dst as rt6_info. */
> > > + if (skb_metadata_dst(skb))
> > > + return -ENETUNREACH;
> >
> > A bit confused about this. Which part of icmpv6_send treats skb_dst as rt6_info?
> > (I see the original sashiko report about dst, but icmp6 seems to be not
> > requiring it)
>
> Yeah I was also a bit confused because this came out of nowhere as soon
> as I put the skb_valid_dst only on the IPv4 path (for different
> reasons), but there is actually a potential trace in which we have type
> confusion indeed:
>
> - icmp6_send() checks scoped source addresses and calls icmp6_iif() at net/ipv6/icmp.c:702
> - icmp6_iif() calls icmp6_dev() at net/ipv6/icmp.c:441
> - icmp6_dev() does skb_rt6_info(skb) for loopback/L3 master devices at net/ipv6/icmp.c:428
> - skb_rt6_info() casts any non-NULL dst to struct rt6_info at include/net/ip6_route.h:233
> - rt6->rt6i_idev is then dereferenced at net/ipv6/icmp.c:434
>
> When checking with pahole, we can find this on my local kernel:
>
> struct rt6_info {
> struct dst_entry dst; /* 0 136 */
> /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> struct fib6_info * from; /* 136 8 */
> int sernum; /* 144 4 */
> struct rt6key rt6i_dst; /* 148 20 */
> struct rt6key rt6i_src; /* 168 20 */
> struct in6_addr rt6i_gateway; /* 188 16 */
>
> /* XXX 4 bytes hole, try to pack */
>
> /* --- cacheline 3 boundary (192 bytes) was 16 bytes ago --- */
> struct inet6_dev * rt6i_idev; /* 208 8 */ <--- we dereference this
> u32 rt6i_flags; /* 216 4 */
> short unsigned int rt6i_nfheader_len; /* 220 2 */
>
> /* size: 224, cachelines: 4, members: 9 */
> /* sum members: 218, holes: 1, sum holes: 4 */
> /* padding: 2 */
> /* last cacheline: 32 bytes */
> };
>
> And the metadata_dst would look like this:
>
> struct metadata_dst {
> struct dst_entry dst; /* 0 136 */
> /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
> enum metadata_type type; /* 136 4 */
>
> /* XXX 4 bytes hole, try to pack */
>
> union {
> struct ip_tunnel_info tun_info; /* 144 96 */
> struct hw_port_info port_info; /* 144 16 */
> struct macsec_info macsec_info; /* 144 8 */
> struct xfrm_md_info xfrm_info; /* 144 16 */
> } u; /* 144 96 */ <--- we land on this union
>
> /* size: 240, cachelines: 4, members: 3 */
> /* sum members: 236, holes: 1, sum holes: 4 */
> /* last cacheline: 48 bytes */
> };
>
> Let's say it's a struct ip_tunnel_info:
>
> struct ip_tunnel_info {
> struct ip_tunnel_key key; /* 0 64 */
>
> /* XXX last struct has 7 bytes of padding */
>
> /* --- cacheline 1 boundary (64 bytes) --- */
> struct ip_tunnel_encap encap; /* 64 8 */ <--- 144 + 64 = 208 we land here
> struct dst_cache dst_cache; /* 72 16 */
> u8 options_len; /* 88 1 */
> u8 mode; /* 89 1 */
>
> /* size: 96, cachelines: 2, members: 5 */
> /* padding: 6 */
> /* paddings: 1, sum paddings: 7 */
> /* last cacheline: 32 bytes */
> };
>
> So I imagine this is fairly tricky to trigger but still a case of type
> confusion. I have actually no idea how likely this can happen from my
> call but the trace makes sense at least.
That logic seems to exist for the icmp6_send to find the input device
(since the expected use-case for calling icmp6_send is to the incoming
skb). And since you're mainly doing egress, I don't think this path will
ever trigger (iow the check is not needed)?
Maybe you can add cgroup_ingress test case? Looks like this rt6_info
path might trigger for ipv6 lo? I don't see any ingress test in your
series, so might be good to have one regardless?
next prev parent reply other threads:[~2026-06-26 16:18 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-25 11:03 [PATCH bpf-next v10 0/5] bpf: add icmp_send kfunc Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 1/5] bpf: add bpf_icmp_send kfunc Mahe Tardy
2026-06-25 16:24 ` Stanislav Fomichev
2026-06-25 16:49 ` Mahe Tardy
2026-06-26 16:18 ` Stanislav Fomichev [this message]
2026-06-25 11:03 ` [PATCH bpf-next v10 2/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 3/5] selftests/bpf: add bpf_icmp_send kfunc cgroup_skb IPv6 tests Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 4/5] selftests/bpf: add bpf_icmp_send recursion test Mahe Tardy
2026-06-25 11:03 ` [PATCH bpf-next v10 5/5] selftests/bpf: add bpf_icmp_send no route test Mahe Tardy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aj6kdnfAB0LJKDcR@devvm7509.cco0.facebook.com \
--to=sdf.kernel@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=emil@etsalapatis.com \
--cc=horms@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jordan@jrife.io \
--cc=kuba@kernel.org \
--cc=mahe.tardy@gmail.com \
--cc=martin.lau@linux.dev \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox