BPF List
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Arthur Fabre <afabre@cloudflare.com>
Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Jakub Sitnicki <jakub@cloudflare.com>,
	Alexander Lobakin <aleksander.lobakin@intel.com>,
	Lorenzo Bianconi <lorenzo@kernel.org>,
	bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org,
	daniel@iogearbox.net, davem@davemloft.net, kuba@kernel.org,
	john.fastabend@gmail.com, edumazet@google.com, pabeni@redhat.com,
	sdf@fomichev.me, tariqt@nvidia.com, saeedm@nvidia.com,
	anthony.l.nguyen@intel.com, przemyslaw.kitszel@intel.com,
	intel-wired-lan@lists.osuosl.org, mst@redhat.com,
	jasowang@redhat.com, mcoquelin.stm32@gmail.com,
	alexandre.torgue@foss.st.com,
	kernel-team <kernel-team@cloudflare.com>,
	Yan Zhai <yan@cloudflare.com>
Subject: Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT
Date: Thu, 26 Sep 2024 14:41:16 +0200	[thread overview]
Message-ID: <87wmiysi37.fsf@toke.dk> (raw)
In-Reply-To: <CAOn4ftshf3pyAst27C2haaSj4eR2n34_pcwWBc5o3zHBkwRb3g@mail.gmail.com>

Arthur Fabre <afabre@cloudflare.com> writes:

> On Sun, Sep 22, 2024 at 1:12 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> FYI, we also had a discussion related to this at LPC on Friday, in this
>> session: https://lpc.events/event/18/contributions/1935/
>>
>> The context here was that Arthur and Jakub want to also support extended
>> rich metadata all the way through the SKB path, and are looking at the
>> same area used for XDP metadata to store it. So there's a need to manage
>> both the kernel's own usage of that area, and userspace/BPF usage of it.
>>
>> I'll try to summarise some of the points of that discussion (all
>> interpretations are my own, of course):
>>
>> - We want something that can be carried with a frame all the way from
>>   the XDP layer, through all SKB layers and to userspace (to replace the
>>   use of skb->mark for this purpose).
>>
>> - We want different applications running on the system (of which the
>>   kernel itself if one, cf this discussion) to be able to share this
>>   field, without having to have an out of band registry (like a Github
>>   repository where applications can agree on which bits to use). Which
>>   probably means that the kernel needs to be in the loop somehow to
>>   explicitly allocate space in the metadata area and track offsets.
>>
>> - Having an explicit API to access this from userspace, without having
>>   to go through BPF (i.e., a socket- or CMSG-based API) would be useful.
>>
>
> Thanks for looping us in, and the great summary Toke!

You're welcome :)

>> The TLV format was one of the suggestions in Arthur and Jakub's talk,
>> but AFAICT, there was not a lot of enthusiasm about this in the room
>> (myself included), because of the parsing overhead and complexity. I
>> believe the alternative that was seen as most favourable was a map
>> lookup-style API, where applications can request a metadata area of
>> arbitrary size and get an ID assigned that they can then use to set/get
>> values in the data path.
>>
>> So, sketching this out, this could be realised by something like:
>>
>> /* could be called from BPF, or through netlink or sysfs; may fail, if
>>  * there is no more space
>>  */
>> int metadata_id = register_packet_metadata_field(sizeof(struct my_meta));
>>
>> The ID is just an opaque identifier that can then be passed to
>> getter/setter functions (for both SKB and XDP), like:
>>
>> ret = bpf_set_packet_metadata_field(pkt, metadata_id,
>>                                     &my_meta_value, sizeof(my_meta_value))
>>
>> ret = bpf_get_packet_metadata_field(pkt, metadata_id,
>>                                     &my_meta_value, sizeof(my_meta_value))
>>
>>
>> On the kernel side, the implementation would track registered fields in
>> a global structure somewhere, say:
>>
>> struct pkt_metadata_entry {
>>   int id;
>>   u8 sz;
>>   u8 offset;
>>   u8 bit;
>> };
>>
>> struct pkt_metadata_registry { /* allocated as a system-wide global */
>>   u8 num_entries;
>>   u8 total_size;
>>   struct pkt_metadata_entry entries[MAX_ENTRIES];
>> };
>>
>> struct xdp_rx_meta { /* at then end of xdp_frame */
>>   u8 sz; /* set to pkt_metadata_registry->total_size on alloc */
>>   u8 fields_set; /* bitmap of fields that have been set, see below */
>>   u8 data[];
>> };
>>
>> int register_packet_metadata_field(u8 size) {
>>   struct pkt_metadata_registry *reg = get_global_registry();
>>   struct pkt_metadata_entry *entry;
>>
>>   if (size + reg->total_size > MAX_METADATA_SIZE)
>>     return -ENOSPC;
>>
>>   entry = &reg->entries[reg->num_entries++];
>>   entry->id = assign_id();
>>   entry->sz = size;
>>   entry->offset = reg->total_size;
>>   entry->bit = reg->num_entries - 1;
>>   reg->total_size += size;
>>
>>   return entry->id;
>> }
>>
>> int bpf_set_packet_metadata_field(struct xdp_frame *frm, int id, void
>>                                   *value, size_t sz)
>> {
>>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>>   if (!entry)
>>     return -ENOENT;
>>
>>   if (entry->sz != sz)
>>     return -EINVAL; /* user error */
>>
>>   if (frm->rx_meta.sz < entry->offset + sz)
>>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>>   memcpy(&frm->rx_meta.data + entry->offset, value, sz);
>>   frm->rx_meta.fields_set |= BIT(entry->bit);
>>
>>   return 0;
>> }
>>
>> int bpf_get_packet_metadata_field(struct xdp_frame *frm, int id, void
>>                                   *value, size_t sz)
>> {
>>   struct pkt_metadata_entry *entry = get_metadata_entry_by_id(id);
>>
>>   if (!entry)
>>     return -ENOENT;
>>
>>   if (entry->sz != sz)
>>     return -EINVAL;
>>
>> if (frm->rx_meta.sz < entry->offset + sz)
>>     return -EFAULT; /* entry allocated after xdp_frame was initialised */
>>
>>   if (!(frm->rx_meta.fields_set & BIT(entry->bit)))
>>     return -ENOENT;
>>
>>   memcpy(value, &frm->rx_meta.data + entry->offset, sz);
>>
>>   return 0;
>> }
>>
>> I'm hinting at some complications here (with the EFAULT return) that
>> needs to be resolved: there is no guarantee that a given packet will be
>> in sync with the current status of the registered metadata, so we need
>> explicit checks for this. If metadata entries are de-registered again
>> this also means dealing with holes and/or reshuffling the metadata
>> layout to reuse the released space (incidentally, this is the one place
>> where a TLV format would have advantages).
>>
>> The nice thing about an API like this, though, is that it's extensible,
>> and the kernel itself can be just another consumer of it for the
>> metadata fields Lorenzo is adding in this series. I.e., we could just
>> pre-define some IDs for metadata vlan, timestamp etc, and use the same
>> functions as above from within the kernel to set and get those values;
>> using the registry, there could even be an option to turn those off if
>> an application wants more space for its own usage. Or, alternatively, we
>> could keep the kernel-internal IDs hardcoded and always allocated, and
>> just use the getter/setter functions as the BPF API for accessing them.
>
> That's exactly what I'm thinking of too, a simple API like:
>
> get(u8 key, u8 len, void *val);
> set(u8 key, u8 len, void *val);
>
> With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
>
> If a NIC doesn't support a certain well-known metadata, the key
> wouldn't be set, and get() would return ENOENT.
>
> I think this also lets us avoid having to "register" keys or bits of
> metadata with the kernel.
> We'd reserve some number of keys for hardware metadata.

Right, but how do you allocate space/offset for each key without an
explicit allocation step? You'd basically have to encode the list of IDs
in the metadata area itself, which implies a TLV format that you have to
walk on every access? The registry idea in my example above was
basically to avoid that...

> The remaining keys would be up to users. They'd have to allocate keys
> to services, and configure services to use those keys.
> This is similar to the way listening on a certain port works: only one
> service can use port 80 or 443, and that can typically beconfigured in
> a service's config file.

Right, well, port numbers *do* actually have an out of band service
registry (IANA), which I thought was what we wanted to avoid? ;)

> This side-steps the whole question of how to change the registered
> metadata for in-flight packets, and how to deal with different NICs
> with different hardware metadata.
>
> I think I've figured out a suitable encoding format, hopefully we'll
> have an RFC soon!

Alright, cool!

-Toke


  reply	other threads:[~2024-09-26 12:41 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-21 16:52 [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT Lorenzo Bianconi
2024-09-21 16:52 ` [RFC bpf-next 1/4] net: xdp: Add xdp_rx_meta structure Lorenzo Bianconi
2024-09-21 16:52 ` [RFC bpf-next 2/4] net: xdp: Update rx_hash of xdp_rx_meta struct running xmo_rx_hash callback Lorenzo Bianconi
2024-09-21 16:52 ` [RFC bpf-next 3/4] net: xdp: Update rx_vlan of xdp_rx_meta struct running xmo_rx_vlan_tag callback Lorenzo Bianconi
2024-09-21 16:53 ` [RFC bpf-next 4/4] net: xdp: Update rx timestamp of xdp_rx_meta struct running xmo_rx_timestamp callback Lorenzo Bianconi
2024-09-21 20:17 ` [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT Alexander Lobakin
2024-09-21 21:36   ` Jesper Dangaard Brouer
2024-09-22  9:17     ` Lorenzo Bianconi
2024-09-22 11:12       ` Toke Høiland-Jørgensen
2024-09-22 15:40         ` Lorenzo Bianconi
2024-09-26 10:54           ` Toke Høiland-Jørgensen
2024-09-26 14:57             ` Lorenzo Bianconi
2024-09-27  1:43               ` Stanislav Fomichev
2024-09-26 11:31         ` Arthur Fabre
2024-09-26 12:41           ` Toke Høiland-Jørgensen [this message]
2024-09-26 15:44             ` Arthur Fabre
2024-09-27 10:24               ` Toke Høiland-Jørgensen
2024-09-27 14:46                 ` Arthur Fabre
2024-09-27 15:06                   ` Lorenzo Bianconi
2024-09-30 10:58                     ` Toke Høiland-Jørgensen
2024-09-30 11:49                       ` Lorenzo Bianconi
2024-10-01 14:16                         ` Arthur Fabre
2024-10-01 14:54                           ` Lorenzo Bianconi
2024-10-01 15:14                             ` Toke Høiland-Jørgensen
2024-10-02 17:02                               ` Stanislav Fomichev
2024-10-02 18:38                                 ` Toke Høiland-Jørgensen
2024-10-02 22:49                                   ` Stanislav Fomichev
2024-10-03  6:35                                     ` Arthur Fabre
2024-10-03 20:26                                       ` Stanislav Fomichev
2024-10-04  2:13                                         ` Daniel Xu
2024-10-04 10:38                                           ` Jesper Dangaard Brouer
2024-10-04 13:55                                             ` Arthur Fabre
2024-10-04 14:14                                               ` Jesper Dangaard Brouer
2024-10-04 14:18                                                 ` Lorenzo Bianconi
2024-10-04 14:29                                                   ` Arthur Fabre
2024-10-04 17:53                                             ` Stanislav Fomichev
2024-10-06 10:27                                               ` Toke Høiland-Jørgensen
2024-10-07 18:48                                                 ` Stanislav Fomichev
2024-10-08  7:15                                                   ` Arthur Fabre
2024-10-04 16:27                                           ` Stanislav Fomichev
2024-09-30 10:52                   ` Toke Høiland-Jørgensen
2024-10-01 14:06                     ` Arthur Fabre
2024-10-01 15:28                       ` Toke Høiland-Jørgensen
2024-10-03  6:51                         ` Arthur Fabre
2024-09-22  9:08   ` Lorenzo Bianconi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wmiysi37.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=afabre@cloudflare.com \
    --cc=aleksander.lobakin@intel.com \
    --cc=alexandre.torgue@foss.st.com \
    --cc=anthony.l.nguyen@intel.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jakub@cloudflare.com \
    --cc=jasowang@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=lorenzo.bianconi@redhat.com \
    --cc=lorenzo@kernel.org \
    --cc=mcoquelin.stm32@gmail.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=saeedm@nvidia.com \
    --cc=sdf@fomichev.me \
    --cc=tariqt@nvidia.com \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox