All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Arthur Fabre <arthur@arthurfabre.com>,
	Network Development <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Yan Zhai <yan@cloudflare.com>,
	jbrandeburg@cloudflare.com, lbiancon@redhat.com,
	Alexei Starovoitov <ast@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	kernel-team@cloudflare.com
Subject: Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
Date: Mon, 05 May 2025 14:35:59 +0200	[thread overview]
Message-ID: <871pt35j68.fsf@toke.dk> (raw)
In-Reply-To: <87a57r4azq.fsf@cloudflare.com>

Jakub Sitnicki <jakub@cloudflare.com> writes:

> On Thu, May 01, 2025 at 12:43 PM +02, Toke Høiland-Jørgensen wrote:
>> Jakub Sitnicki <jakub@cloudflare.com> writes:
>>
>>> On Wed, Apr 30, 2025 at 11:19 AM +02, Toke Høiland-Jørgensen wrote:
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>
>>>>> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>>>>
>>>>>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>>>>>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>>
>>> [...]
>>>
>>>>>> * Hardware metadata: metadata exposed from NICs (like the receive
>>>>>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>>>>>   (via kfuncs).
>>>>>>   But that doesn't expose them to the rest of the stack.
>>>>>>   Storing them in traits would allow XDP, other BPF programs, and the
>>>>>>   kernel to access and modify them (for example to into account
>>>>>>   decapsulating a packet).
>>>>>
>>>>> Sure. If traits == existing metadata bpf prog in xdp can communicate
>>>>> with bpf prog in skb layer via that "trait" format.
>>>>> xdp can take tuple hash and store it as key==0 in the trait.
>>>>> The kernel doesn't need to know how to parse that format.
>>>>
>>>> Yes it does, to propagate it to the skb later. I.e.,
>>>>
>>>> XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
>>>> CPUMAP: build skb, read hash from traits, populate skb hash
>>>>
>>>> Same thing for (at least) timestamps and checksums.
>>>>
>>>> Longer term, with traits available we could move more skb fields into
>>>> traits to make struct sk_buff smaller (by moving optional fields to
>>>> traits that don't take up any space if they're not set).
>>>
>>> Perhaps we can have the cake and eat it too.
>>>
>>> We could leave the traits encoding/decoding out of the kernel and, at
>>> the same time, *expose it* to the network stack through BPF struct_ops
>>> programs. At a high level, for example ->get_rx_hash(), not the
>>> individual K/V access. The traits_ops vtable could grow as needed to
>>> support new use cases.
>>>
>>> If you think about it, it's not so different from BPF-powered congestion
>>> algorithms and scheduler extensions. They also expose some state, kept in
>>> maps, that only the loaded BPF code knows how to operate on.
>>
>> Right, the difference being that the kernel works perfectly well without
>> an eBPF congestion control algorithm loaded because it has its own
>> internal implementation that is used by default.
>
> It seems to me that any code path on the network stack still needs to
> work *even if* traits K/V is not available. There has to be a fallback -
> like, RX hash not present in traits K/V? must recompute it. There is no
> guarantee that there will be space available in the traits K/V store for
> whatever value the network stack would like to cache there.

The stack is in control of both the memory allocation and the placement
of the kv-store, so it could totally guarantee that if needed. Which is
the whole point of making this kernel-internal, IMO.

> So if we can agree that traits K/V is a cache, with limited capacity,
> and any code path accessing it must be prepared to deal with a cache
> miss, then I think with struct_ops approach you could have a built-in
> default implementation for exclusive use by the network stack.

If we have such a default implementation (which would presumably be the
one in this series), why would anyone override it? The need for that
seems a bit speculative, so why not just start with the one
implementation, and add the override if a really compelling use case
does eventually turn up?

> This default implementation of the storage access just wouldn't be
> exposed to the BPF or user-space. If you want access from BPF/userland,
> then you'd need to provide a BPF-backed struct_ops for accessing traits
> K/V.
>
>> Having a hard dependency on BPF for in-kernel functionality is a
>> different matter, and limits the cases it can be used for.
>
> Notice that we already rely on XDP program being attached or the storage
> for traits K/V is not available.
>
>> Besides, I don't really see the point of leaving the encoding out of the
>> kernel? We keep the encoding kernel-internal anyway, and just expose a
>> get/set API, so there's no constraint on changing it later (that's kinda
>> the whole point of doing that). And with bulk get/set there's not an
>> efficiency argument either. So what's the point, other than doing things
>> in BPF for its own sake?
>
> There's the additional complexity in the socket glue layer, but I've
> already mentioned that.
>
> What I think makes it even more appealing is that with the high-level
> struct_ops approach, we abstract away the individual K/V pair access and
> leave the problem of "key registration" (e.g., RX hash is key 42) to the
> user-provided implementation.
>
> You, as the user, decide for your particular system how you want to lay
> out the values and for which values you actually want to reserve
> space. IOW, we leave any trade off decisions to the user in the spirit
> of providing a mechanism, not policy.

But we already have such a possibility, that's basically the metadata
space that XDP/TC already has access to. And the whole reason why this
patch set makes sense is that we need something where the kernel
provides something more structured (K/V) that facilitates sharing across
applications that don't have central coordination across the system.
Punting that (back) off to BPF just gets us back to square one...

-Toke


  reply	other threads:[~2025-05-05 12:36 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
2025-04-24 16:22   ` Alexei Starovoitov
2025-04-25 19:26     ` Arthur Fabre
2025-04-29 23:36       ` Alexei Starovoitov
2025-04-30  9:19         ` Toke Høiland-Jørgensen
2025-04-30 16:29           ` Alexei Starovoitov
2025-05-01  7:30             ` Arthur Fabre
2025-04-30 19:19           ` Jakub Sitnicki
2025-05-01 10:43             ` Toke Høiland-Jørgensen
2025-05-01 14:03               ` Jesper Dangaard Brouer
2025-05-05 10:18               ` Jakub Sitnicki
2025-05-05 12:35                 ` Toke Høiland-Jørgensen [this message]
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 02/17] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 03/17] trait: XDP support Arthur Fabre
2025-04-23  3:58   ` kernel test robot
2025-04-23  3:58   ` kernel test robot
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 04/17] trait: XDP selftest Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 05/17] trait: XDP benchmark Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 06/17] trait: Replace memcpy calls with inline copies Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 07/17] trait: Replace memmove calls with inline move Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 08/17] skb: Extension header in packet headroom Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 09/17] trait: Store traits in sk_buff extension Arthur Fabre
2025-04-23  4:50   ` kernel test robot
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb Arthur Fabre
2025-04-23 16:36   ` Stanislav Fomichev
2025-04-23 20:54     ` Arthur Fabre
2025-04-23 23:45       ` Stanislav Fomichev
2025-04-24  9:49         ` Toke Høiland-Jørgensen
2025-04-24 15:39           ` Stanislav Fomichev
2025-04-24 18:59             ` Jakub Sitnicki
2025-04-25  8:06               ` Toke Høiland-Jørgensen
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 11/17] ice: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 12/17] veth: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 13/17] virtio_net: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 14/17] mlx5: move xdp_buff scope one level up Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 15/17] mlx5: Propagate trait presence to skb Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 16/17] xdp generic: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 17/17] trait: Allow socket filters to access traits Arthur Fabre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871pt35j68.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=arthur@arthurfabre.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=jakub@cloudflare.com \
    --cc=jbrandeburg@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=lbiancon@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.