Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	Arthur Fabre <arthur@arthurfabre.com>
Cc: Network Development <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>, Jakub Sitnicki <jakub@cloudflare.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Yan Zhai <yan@cloudflare.com>,
	jbrandeburg@cloudflare.com, lbiancon@redhat.com,
	Alexei Starovoitov <ast@kernel.org>,
	Jakub Kicinski <kuba@kernel.org>,
	Eric Dumazet <edumazet@google.com>
Subject: Re: [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata
Date: Wed, 30 Apr 2025 11:19:29 +0200	[thread overview]
Message-ID: <87frhqnh0e.fsf@toke.dk> (raw)
In-Reply-To: <CAADnVQKe3Jfd+pVt868P32-m2a-moP4H7ms_kdZnrYALCxx53Q@mail.gmail.com>

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Apr 25, 2025 at 12:27 PM Arthur Fabre <arthur@arthurfabre.com> wrote:
>>
>> On Thu Apr 24, 2025 at 6:22 PM CEST, Alexei Starovoitov wrote:
>> > On Tue, Apr 22, 2025 at 6:23 AM Arthur Fabre <arthur@arthurfabre.com> wrote:
>> > >
>> > > +/**
>> > > + * trait_set() - Set a trait key.
>> > > + * @traits: Start of trait store area.
>> > > + * @hard_end: Hard limit the trait store can currently grow up against.
>> > > + * @key: The key to set.
>> > > + * @val: The value to set.
>> > > + * @len: The length of the value.
>> > > + * @flags: Unused for now. Should be 0.
>> > > + *
>> > > + * Return:
>> > > + * * %0       - Success.
>> > > + * * %-EINVAL - Key or length invalid.
>> > > + * * %-EBUSY  - Key previously set with different length.
>> > > + * * %-ENOSPC - Not enough room left to store value.
>> > > + */
>> > > +static __always_inline
>> > > +int trait_set(void *traits, void *hard_end, u64 key, const void *val, u64 len, u64 flags)
>> > > +{
>> > > +       if (!__trait_valid_key(key) || !__trait_valid_len(len))
>> > > +               return -EINVAL;
>> > > +
>> > > +       struct __trait_hdr *h = (struct __trait_hdr *)traits;
>> > > +
>> > > +       /* Offset of value of this key. */
>> > > +       int off = __trait_offset(*h, key);
>> > > +
>> > > +       if ((h->high & (1ull << key)) || (h->low & (1ull << key))) {
>> > > +               /* Key is already set, but with a different length */
>> > > +               if (__trait_total_length(__trait_and(*h, (1ull << key))) != len)
>> > > +                       return -EBUSY;
>> > > +       } else {
>> > > +               /* Figure out if we have enough room left: total length of everything now. */
>> > > +               if (traits + sizeof(struct __trait_hdr) + __trait_total_length(*h) + len > hard_end)
>> > > +                       return -ENOSPC;
>> >
>> > I'm still not excited about having two metadata-s
>> > in front of the packet.
>> > Why cannot traits use the same metadata space ?
>> >
>> > For trait_set() you already pass hard_end and have to check it
>> > at run-time.
>> > If you add the same hard_end to trait_get/del the kfuncs will deal
>> > with possible corruption of metadata by the program.
>> > Transition from xdp to skb will be automatic. The code won't care that traits
>> > are there. It will just copy all metadata from xdp to skb. Corrupted or not.
>> > bpf progs in xdp and skb might even use the same kfuncs
>> > (or two different sets if the verifier is not smart enough right now).
>>
>> Good idea, that would solve the corruption problem.
>>
>> But I think storing metadata at the "end" of the headroom (ie where
>> XDP metadata is today) makes it harder to persist in the SKB layer.
>> Functions like __skb_push() assume that skb_headroom() bytes are
>> available just before skb->data.
>>
>> They can be updated to move XDP metadata out of the way, but this
>> assumption seems pretty pervasive.
>
> The same argument can be flipped.
> Why does the skb layer need to push?
> If it needs to encapsulate it will forward to tunnel device
> to go on the wire. At this point any kind of metadata is going
> to be lost on the wire. bpf prog would need to convert
> metadata into actual on the wire format or stash it
> or send to user space.
> I don't see a use case where skb layer would move medadata by N
> bytes, populate these N bytes with "???" and pass to next skb layer.
> skb layers strip (pop) the header when it goes from ip to tcp to user space.
> No need to move metadata.
>
>> By using the "front" of the headroom, we can hide that from the rest of
>> the SKB code. We could even update skb->head to completely hide the
>> space used at the front of the headroom.
>> It also avoids the cost of moving the metadata around (but maybe that
>> would be insignificant).
>
> That's a theory. Do you actually have skb layers pushing things
> while metadata is there?

Erm, any encapsulation? UDP tunnels, wireguard, WiFi, etc. There are
almost 1000 calls to skb_push() all over the kernel. One of the primary
use cases for traits is to tag a packet with an ID to follow it
throughout its lifetime inside the kernel. This absolutely includes
encapsulation; in fact, having the tracing ID survive these kinds of
transformations is one of the primary motivators for this work.

>> XDP metadata also doesn't work well with GRO (see below).
>>
>> > Ideally we add hweight64 as new bpf instructions then maybe
>> > we won't need any kernel changes at all.
>> > bpf side will do:
>> > bpf_xdp_adjust_meta(xdp, -max_offset_for_largest_key);
>> > and then
>> > trait_set(xdp->data_meta /* pointer to trait header */, xdp->data /*
>> > hard end */, ...);
>> > can be implemented as bpf prog.
>> >
>> > Same thing for skb progs.
>> > netfilter/iptable can use another bpf prog to make decisions.
>>
>> There are (at least) two reasons for wanting the kernel to understand the
>> format:
>>
>> * GRO: With an opaque binary blob, the kernel can either forbid GRO when
>>   the metadata differs (like XDP metadata today), or keep the entire blob
>>   of one of the packets.
>>   But maybe some users will want to keep a KV of the first packet, or
>>   the last packet, eg for receive timestamps.
>>   With a KV store we can have a sane default option for merging the
>>   different KV stores, and even add a per KV policy in the future if
>>   needed.
>
> We can have this default for metadata too.
> If all bytes in the metadata blob are the same -> let it GRO.

But that is exactly what we do with metadata today, and it is limiting
use cases. For instance, timestamps are going to differ, but we don't
want that to block GRO, we just want to keep one of the timestamps. So
being able to set a "GRO policy" *per key in the KV-store* is useful.

> I don't think it's a good idea to extend GRO with bpf to make it "smart".

That would be a separate project; for this, the proposal is to make the
kernel understand a couple of different policies (block GRO, keep first,
keep last, discard) per key.

>> * Hardware metadata: metadata exposed from NICs (like the receive
>>   timestamp, 4 tuple hash...) is currently only exposed to XDP programs
>>   (via kfuncs).
>>   But that doesn't expose them to the rest of the stack.
>>   Storing them in traits would allow XDP, other BPF programs, and the
>>   kernel to access and modify them (for example to into account
>>   decapsulating a packet).
>
> Sure. If traits == existing metadata bpf prog in xdp can communicate
> with bpf prog in skb layer via that "trait" format.
> xdp can take tuple hash and store it as key==0 in the trait.
> The kernel doesn't need to know how to parse that format.

Yes it does, to propagate it to the skb later. I.e.,

XDP prog on NIC: get HW hash, store in traits, redirect to CPUMAP
CPUMAP: build skb, read hash from traits, populate skb hash

Same thing for (at least) timestamps and checksums.

Longer term, with traits available we could move more skb fields into
traits to make struct sk_buff smaller (by moving optional fields to
traits that don't take up any space if they're not set).

-Toke

next prev parent reply	other threads:[~2025-04-30  9:19 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-22 13:23 [PATCH RFC bpf-next v2 00/17] traits: Per packet metadata KV store Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 01/17] trait: limited KV store for packet metadata Arthur Fabre
2025-04-24 16:22   ` Alexei Starovoitov
2025-04-25 19:26     ` Arthur Fabre
2025-04-29 23:36       ` Alexei Starovoitov
2025-04-30  9:19         ` Toke Høiland-Jørgensen [this message]
2025-04-30 16:29           ` Alexei Starovoitov
2025-05-01  7:30             ` Arthur Fabre
2025-04-30 19:19           ` Jakub Sitnicki
2025-05-01 10:43             ` Toke Høiland-Jørgensen
2025-05-01 14:03               ` Jesper Dangaard Brouer
2025-05-05 10:18               ` Jakub Sitnicki
2025-05-05 12:35                 ` Toke Høiland-Jørgensen
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 02/17] xdp: Track if metadata is supported in xdp_frame <> xdp_buff conversions Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 03/17] trait: XDP support Arthur Fabre
2025-04-23  3:58   ` kernel test robot
2025-04-23  3:58   ` kernel test robot
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 04/17] trait: XDP selftest Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 05/17] trait: XDP benchmark Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 06/17] trait: Replace memcpy calls with inline copies Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 07/17] trait: Replace memmove calls with inline move Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 08/17] skb: Extension header in packet headroom Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 09/17] trait: Store traits in sk_buff extension Arthur Fabre
2025-04-23  4:50   ` kernel test robot
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 10/17] bnxt: Propagate trait presence to skb Arthur Fabre
2025-04-23 16:36   ` Stanislav Fomichev
2025-04-23 20:54     ` Arthur Fabre
2025-04-23 23:45       ` Stanislav Fomichev
2025-04-24  9:49         ` Toke Høiland-Jørgensen
2025-04-24 15:39           ` Stanislav Fomichev
2025-04-24 18:59             ` Jakub Sitnicki
2025-04-25  8:06               ` Toke Høiland-Jørgensen
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 11/17] ice: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 12/17] veth: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 13/17] virtio_net: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 14/17] mlx5: move xdp_buff scope one level up Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 15/17] mlx5: Propagate trait presence to skb Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 16/17] xdp generic: " Arthur Fabre
2025-04-22 13:23 ` [PATCH RFC bpf-next v2 17/17] trait: Allow socket filters to access traits Arthur Fabre

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87frhqnh0e.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=arthur@arthurfabre.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=jakub@cloudflare.com \
    --cc=jbrandeburg@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=lbiancon@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.