public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: Lorenzo Bianconi <lorenzo@kernel.org>
To: Arthur Fabre <afabre@cloudflare.com>
Cc: "Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Lorenzo Bianconi" <lorenzo.bianconi@redhat.com>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Jakub Sitnicki" <jakub@cloudflare.com>,
	"Alexander Lobakin" <aleksander.lobakin@intel.com>,
	bpf@vger.kernel.org, netdev@vger.kernel.org, ast@kernel.org,
	daniel@iogearbox.net, davem@davemloft.net, kuba@kernel.org,
	john.fastabend@gmail.com, edumazet@google.com, pabeni@redhat.com,
	sdf@fomichev.me, tariqt@nvidia.com, saeedm@nvidia.com,
	anthony.l.nguyen@intel.com, przemyslaw.kitszel@intel.com,
	intel-wired-lan@lists.osuosl.org, mst@redhat.com,
	jasowang@redhat.com, mcoquelin.stm32@gmail.com,
	alexandre.torgue@foss.st.com,
	kernel-team <kernel-team@cloudflare.com>,
	"Yan Zhai" <yan@cloudflare.com>
Subject: Re: [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT
Date: Fri, 27 Sep 2024 17:06:53 +0200	[thread overview]
Message-ID: <ZvbKDT-2xqx2unrx@lore-rh-laptop> (raw)
In-Reply-To: <D4H5CAN4O95E.3KF8LAH75FYD4@bobby>

[-- Attachment #1: Type: text/plain, Size: 9117 bytes --]

On Sep 27, Arthur Fabre wrote:
> On Fri Sep 27, 2024 at 12:24 PM CEST, Toke Høiland-Jørgensen wrote:
> > "Arthur Fabre" <afabre@cloudflare.com> writes:
> >
> > >> >> The nice thing about an API like this, though, is that it's extensible,
> > >> >> and the kernel itself can be just another consumer of it for the
> > >> >> metadata fields Lorenzo is adding in this series. I.e., we could just
> > >> >> pre-define some IDs for metadata vlan, timestamp etc, and use the same
> > >> >> functions as above from within the kernel to set and get those values;
> > >> >> using the registry, there could even be an option to turn those off if
> > >> >> an application wants more space for its own usage. Or, alternatively, we
> > >> >> could keep the kernel-internal IDs hardcoded and always allocated, and
> > >> >> just use the getter/setter functions as the BPF API for accessing them.
> > >> >
> > >> > That's exactly what I'm thinking of too, a simple API like:
> > >> >
> > >> > get(u8 key, u8 len, void *val);
> > >> > set(u8 key, u8 len, void *val);
> > >> >
> > >> > With "well-known" keys like METADATA_ID_HW_HASH for hardware metadata.
> > >> >
> > >> > If a NIC doesn't support a certain well-known metadata, the key
> > >> > wouldn't be set, and get() would return ENOENT.
> > >> >
> > >> > I think this also lets us avoid having to "register" keys or bits of
> > >> > metadata with the kernel.
> > >> > We'd reserve some number of keys for hardware metadata.
> > >>
> > >> Right, but how do you allocate space/offset for each key without an
> > >> explicit allocation step? You'd basically have to encode the list of IDs
> > >> in the metadata area itself, which implies a TLV format that you have to
> > >> walk on every access? The registry idea in my example above was
> > >> basically to avoid that...
> > >
> > > I've been playing around with having a small fixed header at the front
> > > of the metadata itself, that lets you access values without walking them
> > > all.
> > >
> > > Still WIP, and maybe this is too restrictive, but it lets you encode 64
> > > 2, 4, or 8 byte KVs with a single 16 byte header:
> >
> > Ohh, that's clever, I like it! :)
> >
> > It's also extensible in the sense that the internal representation can
> > change without impacting the API, so if we end up needing more bits we
> > can always add those.
> >
> > Maybe it would be a good idea to make the 'key' parameter a larger
> > integer type (function parameters are always 64-bit anyway, so might as
> > well go all the way up to u64)? That way we can use higher values for
> > the kernel-reserved types instead of reserving part of the already-small
> > key space for applications (assuming the kernel-internal data is stored
> > somewhere else, like in a static struct as in Lorenzo's patch)?
> 
> Good idea! That makes it more extensible too if we ever support more
> keys or bigger lengths.
> 
> I'm not sure where the kernel-reserved types should live. Putting them
> in here uses up some the of KV IDs, but it uses less head room space than 
> always reserving a static struct for them.
> Maybe it doesn't matter much, as long as we use the same API to access
> them, we could internally switch between one and the other.
> 
> >
> > [...]
> >
> > >> > The remaining keys would be up to users. They'd have to allocate keys
> > >> > to services, and configure services to use those keys.
> > >> > This is similar to the way listening on a certain port works: only one
> > >> > service can use port 80 or 443, and that can typically beconfigured in
> > >> > a service's config file.
> > >>
> > >> Right, well, port numbers *do* actually have an out of band service
> > >> registry (IANA), which I thought was what we wanted to avoid? ;)
> > >
> > > Depends how you think about it ;)
> > >
> > > I think we should avoid a global registry. But having a registry per
> > > deployment / server doesn't seem awful. Services that want to use a
> > > field would have a config file setting to set which index it corresponds
> > > to.
> > > Admins would just have to pick a free one on their system, and set it in
> > > the config file of the service.
> > >
> > > This is similar to what we do for non-IANA registered ports internally.
> > > For example each service needs a port on an internal interface to expose
> > > metrics, and we just track which ports are taken in our config
> > > management.
> >
> > Right, this would work, but it assumes that applications are
> > well-behaved and do this correctly. Which they probably do in a
> > centrally-managed system like yours, but for random applications shipped
> > by distros, I'm not sure if it's going to work.
> >
> > In fact, it's more or less the situation we have with skb->mark today,
> > isn't it? I.e., applications can co-exist as long as they don't use the
> > same bits, so they have to coordinate on which bits to use. Sure, with
> > this scheme there will be more total bits to use, which can lessen the
> > pressure somewhat, but the basic problem remains. In other words, I
> > worry that in practice we will end up with another github repository
> > serving as a registry for metadata keys...
> 
> That's true. If applications hardcode keys, we'll be back to having
> conflicts. If someone creates a registry on github I'll be very sad.
> 
> (Maybe we can make the verifier enforce that the key passed to get() and
> set() isn't a constant? - only joking)
> 
> Applications don't tend to do this for ports though, I think most can be
> configured to listen on any port. Is that just because it's been
> instilled as "good practice" over time? Could we try to do the same with
> some very stern documentation and examples?
> 
> Thinking about it more, my only relectance for a registration API is how
> to communicate the ID back to other consumers (our discussion below).
> 
> >
> > > Dynamically registering fields means you have to share the returned ID
> > > with whoever is interested, which sounds tricky.
> > > If an XDP program sets a field like packet_id, every tracing
> > > program that looks at it, and userspace service, would need to know what
> > > the ID of that field is.
> > > Is there a way to easily share that ID with all of them?
> >
> > Right, so I'll admit this was one of the handwavy bits of my original
> > proposal, but I don't think it's unsolvable. You could do something like
> > (once, on application initialisation):
> >
> > __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name for introspection? */
> > bpf_map_update(&shared_application_config, &my_key_index, &my_key);
> >
> > and then just get the key out of that map from all programs that want to
> > use it?
> 
> Passing it out of band works (whether it's via a pinned map like you
> described, or through other means like a unix socket or some other
> API), it's just more complicated.
> 
> Every consumer also needs to know about that API. That won't work with
> standard tools. For example if we set a PACKET_ID KV, maybe we could
> give it to pwru so it could track packets using it?
> Without registering keys, we could pass it as a cli flag. With
> registration, we'd have to have some helper to get the KV ID.
> 
> It also introduces ordering dependencies between the services on
> startup, eg packet tracing hooks could only be attached once our XDP
> service has registered a PACKET_ID KV, and they could query it's API.
> 
> >
> > We could combine such a registration API with your header format, so
> > that the registration just becomes a way of allocating one of the keys
> > from 0-63 (and the registry just becomes a global copy of the header).
> > This would basically amount to moving the "service config file" into the
> > kernel, since that seems to be the only common denominator we can rely
> > on between BPF applications (as all attempts to write a common daemon
> > for BPF management have shown).
> 
> That sounds reasonable. And I guess we'd have set() check the global
> registry to enforce that the key has been registered beforehand?
> 
> >
> > -Toke
> 
> Thanks for all the feedback!

I like this 'fast' KV approach but I guess we should really evaluate its
impact on performances (especially for xdp) since, based on the kfunc calls
order in the ebpf program, we can have one or multiple memmove/memcpy for
each packet, right?

Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
so suitable for nic hw metadata since:
- it grows backward 
- it is probably in a different cacheline with respect to xdp_frame
- nic hw metadata will not start at fixed and immutable address, but it depends
  on the running ebpf program

What about having something like:
- fixed hw nic metadata: just after xdp_frame struct (or if you want at the end
  of the metadata area :)). Here he can reuse the same KV approach if it is fast
- user defined metadata: in the metadata area of the xdp_frame/xdp_buff

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

  reply	other threads:[~2024-09-27 15:06 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-21 16:52 [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT Lorenzo Bianconi
2024-09-21 16:52 ` [RFC bpf-next 1/4] net: xdp: Add xdp_rx_meta structure Lorenzo Bianconi
2024-09-21 16:52 ` [RFC bpf-next 2/4] net: xdp: Update rx_hash of xdp_rx_meta struct running xmo_rx_hash callback Lorenzo Bianconi
2024-09-21 16:52 ` [RFC bpf-next 3/4] net: xdp: Update rx_vlan of xdp_rx_meta struct running xmo_rx_vlan_tag callback Lorenzo Bianconi
2024-09-21 16:53 ` [RFC bpf-next 4/4] net: xdp: Update rx timestamp of xdp_rx_meta struct running xmo_rx_timestamp callback Lorenzo Bianconi
2024-09-21 20:17 ` [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT Alexander Lobakin
2024-09-21 21:36   ` Jesper Dangaard Brouer
2024-09-22  9:17     ` Lorenzo Bianconi
2024-09-22 11:12       ` Toke Høiland-Jørgensen
2024-09-22 15:40         ` Lorenzo Bianconi
2024-09-26 10:54           ` Toke Høiland-Jørgensen
2024-09-26 14:57             ` Lorenzo Bianconi
2024-09-27  1:43               ` Stanislav Fomichev
2024-09-26 11:31         ` Arthur Fabre
2024-09-26 12:41           ` Toke Høiland-Jørgensen
2024-09-26 15:44             ` Arthur Fabre
2024-09-27 10:24               ` Toke Høiland-Jørgensen
2024-09-27 14:46                 ` Arthur Fabre
2024-09-27 15:06                   ` Lorenzo Bianconi [this message]
2024-09-30 10:58                     ` Toke Høiland-Jørgensen
2024-09-30 11:49                       ` Lorenzo Bianconi
2024-10-01 14:16                         ` Arthur Fabre
2024-10-01 14:54                           ` Lorenzo Bianconi
2024-10-01 15:14                             ` Toke Høiland-Jørgensen
2024-10-02 17:02                               ` Stanislav Fomichev
2024-10-02 18:38                                 ` Toke Høiland-Jørgensen
2024-10-02 22:49                                   ` Stanislav Fomichev
2024-10-03  6:35                                     ` Arthur Fabre
2024-10-03 20:26                                       ` Stanislav Fomichev
2024-10-04  2:13                                         ` Daniel Xu
2024-10-04 10:38                                           ` Jesper Dangaard Brouer
2024-10-04 13:55                                             ` Arthur Fabre
2024-10-04 14:14                                               ` Jesper Dangaard Brouer
2024-10-04 14:18                                                 ` Lorenzo Bianconi
2024-10-04 14:29                                                   ` Arthur Fabre
2024-10-04 17:53                                             ` Stanislav Fomichev
2024-10-06 10:27                                               ` Toke Høiland-Jørgensen
2024-10-07 18:48                                                 ` Stanislav Fomichev
2024-10-08  7:15                                                   ` Arthur Fabre
2024-10-04 16:27                                           ` Stanislav Fomichev
2024-09-30 10:52                   ` Toke Høiland-Jørgensen
2024-10-01 14:06                     ` Arthur Fabre
2024-10-01 15:28                       ` Toke Høiland-Jørgensen
2024-10-03  6:51                         ` Arthur Fabre
2024-09-22  9:08   ` Lorenzo Bianconi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZvbKDT-2xqx2unrx@lore-rh-laptop \
    --to=lorenzo@kernel.org \
    --cc=afabre@cloudflare.com \
    --cc=aleksander.lobakin@intel.com \
    --cc=alexandre.torgue@foss.st.com \
    --cc=anthony.l.nguyen@intel.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=jakub@cloudflare.com \
    --cc=jasowang@redhat.com \
    --cc=john.fastabend@gmail.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=lorenzo.bianconi@redhat.com \
    --cc=mcoquelin.stm32@gmail.com \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=przemyslaw.kitszel@intel.com \
    --cc=saeedm@nvidia.com \
    --cc=sdf@fomichev.me \
    --cc=tariqt@nvidia.com \
    --cc=toke@redhat.com \
    --cc=yan@cloudflare.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox