public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf@google.com>
To: "Toke Høiland-Jørgensen" <toke@kernel.org>
Cc: lsf-pc@lists.linux-foundation.org, bpf@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
Date: Tue, 28 Feb 2023 15:02:28 -0800	[thread overview]
Message-ID: <Y/6IBI8hlyI3DcUz@google.com> (raw)
In-Reply-To: <87y1ohh282.fsf@toke.dk>

On 02/28, Toke H�iland-J�rgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:

> > On Mon, Feb 27, 2023 at 3:54 PM Toke H�iland-J�rgensen  
> <toke@kernel.org> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Mon, Feb 27, 2023 at 6:17 AM Toke H�iland-J�rgensen  
> <toke@kernel.org> wrote:
> >> >>
> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >>
> >> >> > On Thu, Feb 23, 2023 at 3:22 PM Toke H�iland-J�rgensen  
> <toke@kernel.org> wrote:
> >> >> >>
> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >> >>
> >> >> >> > I'd like to discuss a potential follow up for the  
> previous "XDP RX
> >> >> >> > metadata" series [0].
> >> >> >> >
> >> >> >> > Now that we can access (a subset of) packet metadata at RX,  
> I'd like to
> >> >> >> > explore the options where we can export some of that metadata  
> on TX. And
> >> >> >> > also whether it might be possible to access some of the TX  
> completion
> >> >> >> > metadata (things like TX timestamp).
> >> >> >> >
> >> >> >> > I'm currently trying to understand whether the same approach  
> I've used
> >> >> >> > on RX could work at TX. By May I plan to have a bunch of  
> options laid
> >> >> >> > out (currently considering XSK tx/compl programs and XDP  
> tx/compl
> >> >> >> > programs) so we have something to discuss.
> >> >> >>
> >> >> >> I've been looking at ways of getting a TX-completion hook for  
> the XDP
> >> >> >> queueing stuff as well. For that, I think it could work to just  
> hook
> >> >> >> into xdp_return_frame(), but if you want to access hardware  
> metadata
> >> >> >> it'll obviously have to be in the driver. A hook in the driver  
> could
> >> >> >> certainly be used for the queueing return as well, though, which  
> may
> >> >> >> help making it worth the trouble :)
> >> >> >
> >> >> > Yeah, I'd like to get to completion descriptors ideally; so  
> nothing
> >> >> > better than a driver hook comes to mind so far :-(
> >> >> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly  
> so far).
> >> >>
> >> >> Is there any other use case for this than getting the TX timestamp?  
> Not
> >> >> really sure what else those descriptors contain...
> >> >
> >> > I don't think so; at least looking at mlx5 and bnxt (the latter
> >> > doesn't have a timestamp in the completion ring).
> >> > So yeah, not sure, maybe that should be on the side and be AF_XDP  
> specific.
> >> > And not even involve bpf, just put the tx tstamp somewhere in umem:
> >> > setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
> >> > &data_relative_offset, ..);
> >> > OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
> >> > have for eternity? (plus, this needs a driver "hook" for af_xdp
> >> > anyway, so why not make it generic?)
> >>
> >> So since this is read-only in any case, we could just make it a
> >> tracepoint instead of a whole new hook? That's what I was planning to  
> do
> >> for xdp_return_frame(); we just need a way to refer back to the  
> original
> >> frame, but we'd need to solve that anyway. Just letting XDP/AF_XDP
> >> specify their own packet ID in some form, and make that part of the
> >> tracepoint data, would probably be sufficient?
> >
> > That would probably mean a driver specific tracepoint (since we might
> > need to get to the completion queue descriptor)?
> > Idk, probably still makes sense to have something that works across
> > different drivers?
> > Or are you suggesting to just do fentry/mlx5e_free_xdpsq_desc and go  
> from there?
> > Not sure I can get a umem frame, as you've mentioned; and also it
> > doesn't look like cqe is there...

> No, we'd define the tracepoint in one place (like along with the others
> in include/xdp/trace/event.h), and have it include the fields we need
> (ifindex, queue index, timestamp, some kind of packet ID). And then add
> a call in each driver that will support this, wherever it makes sense in
> that driver. This is what we do with the trace_xdp_exception() calls
> already scattered through drivers in the RX handling code.

> With this, a BPF listener can attach to just the one tracepoint, and get
> events from all drivers that support it (but it can filter on ifindex
> for the events it's interested in).

> This probably doesn't scale indefinitely: it's possible to add new
> fields to the tracepoint if we discover other uses than the timestamp,
> but this would probably get unwieldy at some point. However, it's a
> lightweight way to add a "hook" if we don't have any other use cases
> than getting the timestamp for now.

> > I guess the fact that it would arrive out-of-band (not in a umem
> > frame) is a minor inconvenience, the userspace should be able to join
> > the data together hopefully.

> Yeah, that's why I suggested the user-defined packet ID: if the event
> just includes that, the BPF or userspace program can do its own matching
> depending on its needs...

I guess we can implement this user-defined packet ID as a TX metadata
as well? Some kind of u32 tag/mark that the XDP program can set? (and
some new AF_XDP TX hook where we can set it up as well)

> >> >> >> > I'd like to some more input on whether applying the same idea  
> on TX
> >> >> >> > makes sense or not and whether there are any sensible  
> alternatives.
> >> >> >> > (IIRC, there was an attempt to do XDP on egress that went  
> nowhere).
> >> >> >>
> >> >> >> I believe that stranded because it was deemed not feasible to  
> cover the
> >> >> >> SKB TX path as well, which means it can't be symmetrical to the  
> RX hook.
> >> >> >> So we ended up with the in-devmap hook instead. I'm not sure if  
> that's
> >> >> >> made easier by multi-buf XDP, so that may be worth revisiting.
> >> >> >>
> >> >> >> For the TX metadata you don't really have to care about the skb  
> path, I
> >> >> >> suppose, so that may not matter too much either. However, at  
> least for
> >> >> >> the in-kernel xdp_frame the TX path is pushed from the stack  
> anyway, so
> >> >> >> I'm not sure if it's worth having a separate hook in the driver  
> (with
> >> >> >> all the added complexity and overhead that entails) just to set
> >> >> >> metadata? That could just as well be done on push from higher up  
> the
> >> >> >> stack; per-driver kfuncs could still be useful for this, though.
> >> >> >>
> >> >> >> And of course something would be needed so that that BPF  
> programs can
> >> >> >> process AF_XDP frames in the kernel before they hit the driver,  
> but
> >> >> >> again I'm not sure that needs to be a hook in the driver.
> >> >> >
> >> >> > Care to elaborate more on "push from higher up the stack"?
> >> >>
> >> >> I'm referring to the XDP_REDIRECT path here: xdp_frames are  
> transmitted
> >> >> by the stack calling ndo_xdp_xmit() in the driver with an array of
> >> >> frames that are immediately put on the wire (see bq_xmit_all() in
> >> >> devmap.c). So any metadata writing could be done at that point,  
> since
> >> >> the target driver is already known; there's even already a program  
> hook
> >> >> in there (used for in-devmap programs).
> >> >>
> >> >> > I've been thinking about mostly two cases:
> >> >> > - XDP_TX - I think this one technically doesn't need an extra  
> hook;
> >> >> > all metadata manipulations can be done at xdp_rx? (however, not  
> sure
> >> >> > how real that is, since the descriptors are probably not exposed  
> over
> >> >> > there?)
> >> >>
> >> >> Well, to me XDP_REDIRECT is the most interesting one (see above). I
> >> >> think we could even drop the XDP_TX case and only do this for
> >> >> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> >> >> I.e., it's possible to XDP_REDIRECT back to the same device, the  
> frames
> >> >> will just take a slight detour up through the stack; but that could  
> also
> >> >> be a good thing if it means we'll have to do less surgery to the  
> drivers
> >> >> to implement this for two paths.
> >> >>
> >> >> It does have the same challenge as you outlined above, though: At  
> that
> >> >> point the TX descriptor probably doesn't exist, so the driver NDO  
> will
> >> >> have to do something else with the data; but maybe we can solve that
> >> >> without moving the hook into the driver itself somehow?
> >> >
> >> > Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> >> > "transmit something out of xdp_rx hook" umbrella. We can maybe come  
> up
> >> > with a skb-like-private metadata layout (as we've discussed  
> previously
> >> > for skb) here as well? But not sure it would solve all the problems?
> >> > I'm thinking of an af_xdp case where it wants to program something
> >> > similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> >> > support) or a checksum offload. Exposing access to the driver tx  
> hooks
> >> > seems like the easiest way to get there?
> >>
> >> Well, I was thinking something like exposing driver kfuncs like what  
> you
> >> implemented on RX, but having the program that calls them be the one in
> >> the existing devmap hook (each map entry is tied to a particular  
> netdev,
> >> so we could use the same type of dev-bound logic as we do on RX). The
> >> driver wouldn't have a TX descriptor at this point, but it could have a
> >> driver-private area somewhere in the xdp_frame (if we make space for  
> it)
> >> which the kfuncs could just write to in whichever format it wants, so
> >> that copying it to the descriptor later is just a memcpy().
> >
> > Yeah, that sounds similar to what we've discussed for the xdp->skb
> > path, right? Maybe it should be even some kind of extension to that?
> > On rx, we stash a bunch of metadata in that private area. If the frame
> > goes to the kernel stack -> put it into skb; if the frame goes back to
> > the wire -> the driver can parse it?

> Yeah, not a bad idea to combine those.

> >> There would still need to be a new hook for AF_XDP, but I think it  
> could
> >> have the same semantics as a devmap prog (i.e., an XDP program type  
> that
> >> runs right before TX, but after the destination device is selected); we
> >> could attach it to the socket, maybe? Doesn't have to be in the driver
> >> itself (just before the driver ndo is called for zc - I think?).
> >
> > Sounds sensible; will try to explore more, thx!

> Awesome, looking forward to seeing what you come up with :)

> -Toke

  reply	other threads:[~2023-02-28 23:02 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-23 22:23 [LSF/MM/BPF TOPIC] XDP metadata for TX Stanislav Fomichev
2023-02-23 23:22 ` Toke Høiland-Jørgensen
2023-02-23 23:51   ` Stanislav Fomichev
2023-02-27 14:17     ` Toke Høiland-Jørgensen
2023-02-27 20:03       ` Stanislav Fomichev
2023-02-27 23:54         ` Toke Høiland-Jørgensen
2023-02-28 21:18           ` Stanislav Fomichev
2023-02-28 22:09             ` Toke Høiland-Jørgensen
2023-02-28 23:02               ` Stanislav Fomichev [this message]
2023-02-28 23:08                 ` Toke Høiland-Jørgensen
2023-03-03  7:42         ` Magnus Karlsson
2023-03-03 12:37           ` Toke Høiland-Jørgensen
2023-03-03 17:16             ` Stanislav Fomichev
2023-03-07 19:32           ` Jesper Dangaard Brouer
2023-03-09 18:04             ` Stanislav Fomichev
2023-03-10 11:09               ` Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y/6IBI8hlyI3DcUz@google.com \
    --to=sdf@google.com \
    --cc=bpf@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=toke@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox