From: Lorenzo Bianconi <lorenzo@kernel.org>
To: Stanislav Fomichev <stfomichev@gmail.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>,
bpf@vger.kernel.org, netdev@vger.kernel.org,
Jakub Kicinski <kuba@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <borkmann@iogearbox.net>,
Eric Dumazet <eric.dumazet@gmail.com>,
"David S. Miller" <davem@davemloft.net>,
Paolo Abeni <pabeni@redhat.com>,
sdf@fomichev.me, kernel-team@cloudflare.com,
arthur@arthurfabre.com, jakub@cloudflare.com
Subject: Re: [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets
Date: Wed, 16 Jul 2025 13:17:53 +0200 [thread overview]
Message-ID: <aHeKYZY7l2i1xwel@lore-desk> (raw)
In-Reply-To: <aHE2F1FJlYc37eIz@mini-arch>
[-- Attachment #1: Type: text/plain, Size: 7854 bytes --]
On Jul 11, Stanislav Fomichev wrote:
> On 07/09, Lorenzo Bianconi wrote:
> > On Jul 07, Stanislav Fomichev wrote:
> > > On 07/03, Jesper Dangaard Brouer wrote:
> > > >
> > > >
> > > > On 02/07/2025 18.05, Stanislav Fomichev wrote:
> > > > > On 07/02, Jesper Dangaard Brouer wrote:
> > > > > > This patch series introduces a mechanism for an XDP program to store RX
> > > > > > metadata hints - specifically rx_hash, rx_vlan_tag, and rx_timestamp -
> > > > > > into the xdp_frame. These stored hints are then used to populate the
> > > > > > corresponding fields in the SKB that is created from the xdp_frame
> > > > > > following an XDP_REDIRECT.
> > > > > >
> > > > > > The chosen RX metadata hints intentionally map to the existing NIC
> > > > > > hardware metadata that can be read via kfuncs [1]. While this design
> > > > > > allows a BPF program to read and propagate existing hardware hints, our
> > > > > > primary motivation is to enable setting custom values. This is important
> > > > > > for use cases where the hardware-provided information is insufficient or
> > > > > > needs to be calculated based on packet contents unavailable to the
> > > > > > hardware.
> > > > > >
> > > > > > The primary motivation for this feature is to enable scalable load
> > > > > > balancing of encapsulated tunnel traffic at the XDP layer. When tunnelled
> > > > > > packets (e.g., IPsec, GRE) are redirected via cpumap or to a veth device,
> > > > > > the networking stack later calculates a software hash based on the outer
> > > > > > headers. For a single tunnel, these outer headers are often identical,
> > > > > > causing all packets to be assigned the same hash. This collapses all
> > > > > > traffic onto a single RX queue, creating a performance bottleneck and
> > > > > > defeating receive-side scaling (RSS).
> > > > > >
> > > > > > Our immediate use case involves load balancing IPsec traffic. For such
> > > > > > tunnelled traffic, any hardware-provided RX hash is calculated on the
> > > > > > outer headers and is therefore incorrect for distributing inner flows.
> > > > > > There is no reason to read the existing value, as it must be recalculated.
> > > > > > In our XDP program, we perform a partial decryption to access the inner
> > > > > > headers and calculate a new load-balancing hash, which provides better
> > > > > > flow distribution. However, without this patch set, there is no way to
> > > > > > persist this new hash for the network stack to use post-redirect.
> > > > > >
> > > > > > This series solves the problem by introducing new BPF kfuncs that allow an
> > > > > > XDP program to write e.g. the hash value into the xdp_frame. The
> > > > > > __xdp_build_skb_from_frame() function is modified to use this stored value
> > > > > > to set skb->hash on the newly created SKB. As a result, the veth driver's
> > > > > > queue selection logic uses the BPF-supplied hash, achieving proper
> > > > > > traffic distribution across multiple CPU cores. This also ensures that
> > > > > > consumers, like the GRO engine, can operate effectively.
> > > > > >
> > > > > > We considered XDP traits as an alternative to adding static members to
> > > > > > struct xdp_frame. Given the immediate need for this functionality and the
> > > > > > current development status of traits, we believe this approach is a
> > > > > > pragmatic solution. We are open to migrating to a traits-based
> > > > > > implementation if and when they become a generally accepted mechanism for
> > > > > > such extensions.
> > > > > >
> > > > > > [1] https://docs.kernel.org/networking/xdp-rx-metadata.html
> > > > > > ---
> > > > > > V1: https://lore.kernel.org/all/174897271826.1677018.9096866882347745168.stgit@firesoul/
> > > > >
> > > > > No change log?
> > > >
> > > > We have fixed selftest as requested by Alexie.
> > > > And we have updated cover-letter and doc as you Stanislav requested.
> > > >
> > > > >
> > > > > Btw, any feedback on the following from v1?
> > > > > - https://lore.kernel.org/netdev/aFHUd98juIU4Rr9J@mini-arch/
> > > >
> > > > Addressed as updated cover-letter and documentation. I hope this helps
> > > > reviewers understand the use-case, as the discussion turn into "how do we
> > > > transfer all HW metadata", which is NOT what we want (and a waste of
> > > > precious cycles).
> > > >
> > > > For our use-case, it doesn't make sense to "transfer all HW metadata".
> > > > In fact we don't even want to read the hardware RH-hash, because we already
> > > > know it is wrong (for tunnels), we just want to override the RX-hash used at
> > > > SKB creation. We do want the BPF programmers flexibility to call these
> > > > kfuncs individually (when relevant).
> > > >
> > > > > - https://lore.kernel.org/netdev/20250616145523.63bd2577@kernel.org/
> > > >
> > > > I feel pressured into critiquing Jakub's suggestion, hope this is not too
> > > > harsh. First of all it is not relevant to our this patchset use-case, as it
> > > > focus on all HW metadata.
> > >
> > > [..]
> > >
> > > > Second, I disagree with the idea/mental model of storing in a
> > > > "driver-specific format". The current implementation of driver-specific
> > > > kfunc helpers that "get the metadata" is already doing a conversion to a
> > > > common format, because the BPF-programmer naturally needs this to be the
> > > > same across drivers. Thus, it doesn't make sense to store it back in a
> > > > "driver-specific format", as that just complicate things. My mental model
> > > > is thus, that after the driver-specific "get" operation to result is in a
> > > > common format, that is simply defined by the struct type of the kfunc, which
> > > > is both known by the kernel and BPF-prog.
> > >
> > > Having get/set model seems a bit more generic, no? Potentially giving us the
> > > ability to "correct" HW metadata for the non-redirected cases as well.
> > > Plus we don't hard-code the (internal) layout. Solving only xdp_redirect
> > > seems a bit too narrow, idk..
> >
> > I can't see what the non-redirected use-case could be. Can you please provide
> > more details?
> > Moreover, can it be solved without storing the rx_hash (or the other
> > hw-metadata) in a non-driver specific format?
>
> Having setters feels more generic than narrowly solving only the redirect,
> but I don't have a good use-case in mind.
>
> > Storing the hw-metadata in some of hw-specific format in xdp_frame will not
> > allow to consume them directly building the skb and we will require to decode
> > them again. What is the upside/use-case of this approach? (not considering the
> > orthogonality with the get method).
>
> If we add the store kfuncs to regular drivers, the metadata won't be stored
> in the xdp_frame; it will go into the rx descriptors so regular path that
> builds skbs will use it.
IIUC, the described use-case would be to modify the hw metadata via a
'setter' kfunc executed by an eBPF program bounded to the NIC and to store
the new metadata in the DMA descriptor in order to be consumed by the driver
codebase building the skb, right?
If so:
- we can get the same result just storing (running a kfunc) the modified hw
metadata in the xdp_buff struct using a well-known/generic layout and
consume it in the driver codebase (e.g. if the bounded eBPF program
returns XDP_PASS) using a generic xdp utility routine. This part is not in
the current series.
- Using this approach we are still not preserving the hw metadata if we pass
the xdp_frame to a remote CPU returning XDP_REDIRCT (we need to add more
code)
- I am not completely sure if can always modify the DMA descriptor directly
since it is DMA mapped.
What do you think?
Regards,
Lorenzo
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
next prev parent reply other threads:[~2025-07-16 11:17 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-02 14:58 [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 1/7] net: xdp: Add xdp_rx_meta structure Jesper Dangaard Brouer
2025-07-17 9:19 ` Jakub Sitnicki
2025-07-17 14:40 ` Jesper Dangaard Brouer
2025-07-18 10:33 ` Jakub Sitnicki
2025-07-02 14:58 ` [PATCH bpf-next V2 2/7] selftests/bpf: Adjust test for maximum packet size in xdp_do_redirect Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 3/7] net: xdp: Add kfuncs to store hw metadata in xdp_buff Jesper Dangaard Brouer
2025-07-03 11:41 ` Jesper Dangaard Brouer
2025-07-03 12:26 ` Lorenzo Bianconi
2025-07-02 14:58 ` [PATCH bpf-next V2 4/7] net: xdp: Set skb hw metadata from xdp_frame Jesper Dangaard Brouer
2025-07-02 14:58 ` [PATCH bpf-next V2 5/7] net: veth: Read xdp metadata from rx_meta struct if available Jesper Dangaard Brouer
2025-07-17 12:11 ` Jakub Sitnicki
2025-07-02 14:58 ` [PATCH bpf-next V2 6/7] bpf: selftests: Add rx_meta store kfuncs selftest Jesper Dangaard Brouer
2025-07-23 9:24 ` Bouska, Zdenek
2025-07-02 14:58 ` [PATCH bpf-next V2 7/7] net: xdp: update documentation for xdp-rx-metadata.rst Jesper Dangaard Brouer
2025-07-02 16:05 ` [PATCH bpf-next V2 0/7] xdp: Allow BPF to set RX hints for XDP_REDIRECTed packets Stanislav Fomichev
2025-07-03 11:17 ` Jesper Dangaard Brouer
2025-07-07 14:40 ` Stanislav Fomichev
2025-07-09 9:31 ` Lorenzo Bianconi
2025-07-11 16:04 ` Stanislav Fomichev
2025-07-16 11:17 ` Lorenzo Bianconi [this message]
2025-07-16 21:20 ` Jakub Kicinski
2025-07-17 13:08 ` Jesper Dangaard Brouer
2025-07-18 1:25 ` Jakub Kicinski
2025-07-18 10:56 ` Jesper Dangaard Brouer
2025-07-22 1:13 ` Jakub Kicinski
2025-07-28 10:53 ` Lorenzo Bianconi
2025-07-28 16:29 ` Jakub Kicinski
2025-07-29 11:15 ` Jesper Dangaard Brouer
2025-07-29 19:47 ` Martin KaFai Lau
2025-07-31 16:27 ` Jesper Dangaard Brouer
2025-08-01 20:38 ` Jakub Kicinski
2025-08-04 13:18 ` Jesper Dangaard Brouer
2025-08-06 0:28 ` Jakub Kicinski
2025-08-07 18:26 ` Jesper Dangaard Brouer
2025-08-06 1:24 ` Martin KaFai Lau
2025-08-07 19:07 ` Jesper Dangaard Brouer
2025-08-13 2:59 ` Martin KaFai Lau
2025-07-31 21:18 ` Lorenzo Bianconi
2025-08-01 20:40 ` Jakub Kicinski
2025-08-05 13:18 ` Lorenzo Bianconi
2025-08-05 23:54 ` Jakub Kicinski
2025-07-18 9:55 ` Lorenzo Bianconi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aHeKYZY7l2i1xwel@lore-desk \
--to=lorenzo@kernel.org \
--cc=arthur@arthurfabre.com \
--cc=ast@kernel.org \
--cc=borkmann@iogearbox.net \
--cc=bpf@vger.kernel.org \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=hawk@kernel.org \
--cc=jakub@cloudflare.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@fomichev.me \
--cc=stfomichev@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox