From: "Toke Høiland-Jørgensen" <toke@kernel.org>
To: Magnus Karlsson <magnus.karlsson@gmail.com>,
Stanislav Fomichev <sdf@google.com>
Cc: lsf-pc@lists.linux-foundation.org, bpf@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
Date: Fri, 03 Mar 2023 13:37:19 +0100 [thread overview]
Message-ID: <87zg8uc8ow.fsf@toke.dk> (raw)
In-Reply-To: <CAJ8uoz0jnavFxMJ8tgb4+-+OsCPqVJQez8ULOTM2a60D4RmJ7A@mail.gmail.com>
Magnus Karlsson <magnus.karlsson@gmail.com> writes:
> On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >
>> > Stanislav Fomichev <sdf@google.com> writes:
>> >
>> > > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> > >>
>> > >> Stanislav Fomichev <sdf@google.com> writes:
>> > >>
>> > >> > I'd like to discuss a potential follow up for the previous "XDP RX
>> > >> > metadata" series [0].
>> > >> >
>> > >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
>> > >> > explore the options where we can export some of that metadata on TX. And
>> > >> > also whether it might be possible to access some of the TX completion
>> > >> > metadata (things like TX timestamp).
>> > >> >
>> > >> > I'm currently trying to understand whether the same approach I've used
>> > >> > on RX could work at TX. By May I plan to have a bunch of options laid
>> > >> > out (currently considering XSK tx/compl programs and XDP tx/compl
>> > >> > programs) so we have something to discuss.
>> > >>
>> > >> I've been looking at ways of getting a TX-completion hook for the XDP
>> > >> queueing stuff as well. For that, I think it could work to just hook
>> > >> into xdp_return_frame(), but if you want to access hardware metadata
>> > >> it'll obviously have to be in the driver. A hook in the driver could
>> > >> certainly be used for the queueing return as well, though, which may
>> > >> help making it worth the trouble :)
>> > >
>> > > Yeah, I'd like to get to completion descriptors ideally; so nothing
>> > > better than a driver hook comes to mind so far :-(
>> > > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
>> >
>> > Is there any other use case for this than getting the TX timestamp? Not
>> > really sure what else those descriptors contain...
>>
>> I don't think so; at least looking at mlx5 and bnxt (the latter
>> doesn't have a timestamp in the completion ring).
>> So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
>> And not even involve bpf, just put the tx tstamp somewhere in umem:
>> setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
>> &data_relative_offset, ..);
>> OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
>> have for eternity? (plus, this needs a driver "hook" for af_xdp
>> anyway, so why not make it generic?)
>>
>> > >> > I'd like to some more input on whether applying the same idea on TX
>> > >> > makes sense or not and whether there are any sensible alternatives.
>> > >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
>> > >>
>> > >> I believe that stranded because it was deemed not feasible to cover the
>> > >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
>> > >> So we ended up with the in-devmap hook instead. I'm not sure if that's
>> > >> made easier by multi-buf XDP, so that may be worth revisiting.
>> > >>
>> > >> For the TX metadata you don't really have to care about the skb path, I
>> > >> suppose, so that may not matter too much either. However, at least for
>> > >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
>> > >> I'm not sure if it's worth having a separate hook in the driver (with
>> > >> all the added complexity and overhead that entails) just to set
>> > >> metadata? That could just as well be done on push from higher up the
>> > >> stack; per-driver kfuncs could still be useful for this, though.
>> > >>
>> > >> And of course something would be needed so that that BPF programs can
>> > >> process AF_XDP frames in the kernel before they hit the driver, but
>> > >> again I'm not sure that needs to be a hook in the driver.
>> > >
>> > > Care to elaborate more on "push from higher up the stack"?
>> >
>> > I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
>> > by the stack calling ndo_xdp_xmit() in the driver with an array of
>> > frames that are immediately put on the wire (see bq_xmit_all() in
>> > devmap.c). So any metadata writing could be done at that point, since
>> > the target driver is already known; there's even already a program hook
>> > in there (used for in-devmap programs).
>> >
>> > > I've been thinking about mostly two cases:
>> > > - XDP_TX - I think this one technically doesn't need an extra hook;
>> > > all metadata manipulations can be done at xdp_rx? (however, not sure
>> > > how real that is, since the descriptors are probably not exposed over
>> > > there?)
>> >
>> > Well, to me XDP_REDIRECT is the most interesting one (see above). I
>> > think we could even drop the XDP_TX case and only do this for
>> > XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
>> > I.e., it's possible to XDP_REDIRECT back to the same device, the frames
>> > will just take a slight detour up through the stack; but that could also
>> > be a good thing if it means we'll have to do less surgery to the drivers
>> > to implement this for two paths.
>> >
>> > It does have the same challenge as you outlined above, though: At that
>> > point the TX descriptor probably doesn't exist, so the driver NDO will
>> > have to do something else with the data; but maybe we can solve that
>> > without moving the hook into the driver itself somehow?
>>
>> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
>> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
>> with a skb-like-private metadata layout (as we've discussed previously
>> for skb) here as well? But not sure it would solve all the problems?
>> I'm thinking of an af_xdp case where it wants to program something
>> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
>> support)
>
> We have a patch set of this in the works. Need to finish the last
> couple of tests then optimize performance and it is good to go. We
> should be able to post it during the next cycle that starts next week.
Uh, exciting, will look forward to seeing that! :)
-Toke
next prev parent reply other threads:[~2023-03-03 12:37 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-23 22:23 [LSF/MM/BPF TOPIC] XDP metadata for TX Stanislav Fomichev
2023-02-23 23:22 ` Toke Høiland-Jørgensen
2023-02-23 23:51 ` Stanislav Fomichev
2023-02-27 14:17 ` Toke Høiland-Jørgensen
2023-02-27 20:03 ` Stanislav Fomichev
2023-02-27 23:54 ` Toke Høiland-Jørgensen
2023-02-28 21:18 ` Stanislav Fomichev
2023-02-28 22:09 ` Toke Høiland-Jørgensen
2023-02-28 23:02 ` Stanislav Fomichev
2023-02-28 23:08 ` Toke Høiland-Jørgensen
2023-03-03 7:42 ` Magnus Karlsson
2023-03-03 12:37 ` Toke Høiland-Jørgensen [this message]
2023-03-03 17:16 ` Stanislav Fomichev
2023-03-07 19:32 ` Jesper Dangaard Brouer
2023-03-09 18:04 ` Stanislav Fomichev
2023-03-10 11:09 ` Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zg8uc8ow.fsf@toke.dk \
--to=toke@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=magnus.karlsson@gmail.com \
--cc=sdf@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox