BPF List
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Arthur Fabre <arthur@arthurfabre.com>,
	Network Development <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Yan Zhai <yan@cloudflare.com>,
	jbrandeburg@cloudflare.com, lbiancon@redhat.com,
	Alexei Starovoitov <ast@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	kernel-team@cloudflare.com
Subject: Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
Date: Mon, 05 May 2025 14:35:59 +0200	[thread overview]
Message-ID: <871pt35j68.fsf@toke.dk> (raw)
In-Reply-To: <87a57r4azq.fsf@cloudflare.com>

Jakub Sitnicki <jakub@cloudflare.com> writes:

> On Thu, May 01, 2025 at 12:43 PM +02, Toke Høiland-Jørgensen wrote:
>> Jakub Sitnicki <jakub@cloudflare.com> writes:
>>
>>> On Wed, Apr 30, 2025 at 11:19 AM +02, Toke Høiland-Jørgensen wrote:
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>
>>>>> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>>>>
>>>>>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>>>>>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>
>>> [...]
>>>
>>>>>> * Hardware metadata: metadata exposed from NICs (like the receive
>>>>>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>>>>>   (via kfuncs).
>>>>>>   But that doesn't expose them to the rest of the stack.
>>>>>>   Storing them in traits would allow XDP, other BPF programs, and the
>>>>>>   kernel to access and modify them (for example to into account
>>>>>>   decapsulating a packet).
>>>>>
>>>>> Sure. If traits == existing metadata bpf prog in xdp can communicate
>>>>> with bpf prog in skb layer via that "trait" format.
>>>>> xdp can take tuple hash and store it as key==0 in the trait.
>>>>> The kernel doesn't need to know how to parse that format.
>>>>
>>>> Yes it does, to propagate it to the skb later. I.e.,
>>>>
>>>> XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
>>>> CPUMAP: build skb, read hash from traits, populate skb hash
>>>>
>>>> Same thing for (at least) timestamps and checksums.
>>>>
>>>> Longer term, with traits available we could move more skb fields into
>>>> traits to make struct sk_buff smaller (by moving optional fields to
>>>> traits that don't take up any space if they're not set).
>>>
>>> Perhaps we can have the cake and eat it too.
>>>
>>> We could leave the traits encoding/decoding out of the kernel and, at
>>> the same time, *expose it* to the network stack through BPF struct_ops
>>> programs. At a high level, for example ->get_rx_hash(), not the
>>> individual K/V access. The traits_ops vtable could grow as needed to
>>> support new use cases.
>>>
>>> If you think about it, it's not so different from BPF-powered congestion
>>> algorithms and scheduler extensions. They also expose some state, kept in
>>> maps, that only the loaded BPF code knows how to operate on.
>>
>> Right, the difference being that the kernel works perfectly well without
>> an eBPF congestion control algorithm loaded because it has its own
>> internal implementation that is used by default.
>
> It seems to me that any code path on the network stack still needs to
> work *even if* traits K/V is not available. There has to be a fallback -
> like, RX hash not present in traits K/V? must recompute it. There is no
> guarantee that there will be space available in the traits K/V store for
> whatever value the network stack would like to cache there.

The stack is in control of both the memory allocation and the placement
of the kv-store, so it could totally guarantee that if needed. Which is
the whole point of making this kernel-internal, IMO.

> So if we can agree that traits K/V is a cache, with limited capacity,
> and any code path accessing it must be prepared to deal with a cache
> miss, then I think with struct_ops approach you could have a built-in
> default implementation for exclusive use by the network stack.

If we have such a default implementation (which would presumably be the
one in this series), why would anyone override it? The need for that
seems a bit speculative, so why not just start with the one
implementation, and add the override if a really compelling use case
does eventually turn up?

> This default implementation of the storage access just wouldn't be
> exposed to the BPF or user-space. If you want access from BPF/userland,
> then you'd need to provide a BPF-backed struct_ops for accessing traits
> K/V.
>
>> Having a hard dependency on BPF for in-kernel functionality is a
>> different matter, and limits the cases it can be used for.
>
> Notice that we already rely on XDP program being attached or the storage
> for traits K/V is not available.
>
>> Besides, I don't really see the point of leaving the encoding out of the
>> kernel? We keep the encoding kernel-internal anyway, and just expose a
>> get/set API, so there's no constraint on changing it later (that's kinda
>> the whole point of doing that). And with bulk get/set there's not an
>> efficiency argument either. So what's the point, other than doing things
>> in BPF for its own sake?
>
> There's the additional complexity in the socket glue layer, but I've
> already mentioned that.
>
> What I think makes it even more appealing is that with the high-level
> struct_ops approach, we abstract away the individual K/V pair access and
> leave the problem of "key registration" (e.g., RX hash is key 42) to the
> user-provided implementation.
>
> You, as the user, decide for your particular system how you want to lay
> out the values and for which values you actually want to reserve
> space. IOW, we leave any trade off decisions to the user in the spirit
> of providing a mechanism, not policy.

But we already have such a possibility, that's basically the metadata
space that XDP/TC already has access to. And the whole reason why this
patch set makes sense is that we need something where the kernel
provides something more structured (K/V) that facilitates sharing across
applications that don't have central coordination across the system.
Punting that (back) off to BPF just gets us back to square one...

-Toke


  reply	other threads:[~2025-05-05 12:36 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
2025-04-24 16:22   ` Alexei Starovoitov
2025-04-25 19:26     ` Arthur Fabre
2025-04-29 23:36       ` Alexei Starovoitov
2025-04-30  9:19         ` Toke Høiland-Jørgensen
2025-04-30 16:29           ` Alexei Starovoitov
2025-05-01  7:30             ` Arthur Fabre
2025-04-30 19:19           ` Jakub Sitnicki
2025-05-01 10:43             ` Toke Høiland-Jørgensen
2025-05-01 14:03               ` Jesper Dangaard Brouer
2025-05-05 10:18               ` Jakub Sitnicki
2025-05-05 12:35                 ` Toke Høiland-Jørgensen [this message]
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 02/17] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 03/17] trait: XDP support Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 04/17] trait: XDP selftest Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 05/17] trait: XDP benchmark Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 06/17] trait: Replace memcpy calls with inline copies Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 07/17] trait: Replace memmove calls with inline move Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 08/17] skb: Extension header in packet headroom Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 09/17] trait: Store traits in sk_buff extension Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb Arthur Fabre
2025-04-23 16:36   ` Stanislav Fomichev
2025-04-23 20:54     ` Arthur Fabre
2025-04-23 23:45       ` Stanislav Fomichev
2025-04-24  9:49         ` Toke Høiland-Jørgensen
2025-04-24 15:39           ` Stanislav Fomichev
2025-04-24 18:59             ` Jakub Sitnicki
2025-04-25  8:06               ` Toke Høiland-Jørgensen
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 11/17] ice: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 12/17] veth: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 13/17] virtio_net: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 14/17] mlx5: move xdp_buff scope one level up Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 15/17] mlx5: Propagate trait presence to skb Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 16/17] xdp generic: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 17/17] trait: Allow socket filters to access traits Arthur Fabre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871pt35j68.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=arthur@arthurfabre.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=jakub@cloudflare.com \
    --cc=jbrandeburg@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=lbiancon@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox