* [LSF/MM/BPF TOPIC] XDP metadata for TX
@ 2023-02-23 22:23 Stanislav Fomichev
2023-02-23 23:22 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 16+ messages in thread
From: Stanislav Fomichev @ 2023-02-23 22:23 UTC (permalink / raw)
To: lsf-pc; +Cc: bpf
I'd like to discuss a potential follow up for the previous "XDP RX
metadata" series [0].
Now that we can access (a subset of) packet metadata at RX, I'd like to
explore the options where we can export some of that metadata on TX. And
also whether it might be possible to access some of the TX completion
metadata (things like TX timestamp).
I'm currently trying to understand whether the same approach I've used
on RX could work at TX. By May I plan to have a bunch of options laid
out (currently considering XSK tx/compl programs and XDP tx/compl
programs) so we have something to discuss.
I'd like to some more input on whether applying the same idea on TX
makes sense or not and whether there are any sensible alternatives.
(IIRC, there was an attempt to do XDP on egress that went nowhere).
0: https://lore.kernel.org/bpf/20230119221536.3349901-1-sdf@google.com/
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-23 22:23 [LSF/MM/BPF TOPIC] XDP metadata for TX Stanislav Fomichev
@ 2023-02-23 23:22 ` Toke Høiland-Jørgensen
2023-02-23 23:51 ` Stanislav Fomichev
0 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-02-23 23:22 UTC (permalink / raw)
To: Stanislav Fomichev, lsf-pc; +Cc: bpf
Stanislav Fomichev <sdf@google.com> writes:
> I'd like to discuss a potential follow up for the previous "XDP RX
> metadata" series [0].
>
> Now that we can access (a subset of) packet metadata at RX, I'd like to
> explore the options where we can export some of that metadata on TX. And
> also whether it might be possible to access some of the TX completion
> metadata (things like TX timestamp).
>
> I'm currently trying to understand whether the same approach I've used
> on RX could work at TX. By May I plan to have a bunch of options laid
> out (currently considering XSK tx/compl programs and XDP tx/compl
> programs) so we have something to discuss.
I've been looking at ways of getting a TX-completion hook for the XDP
queueing stuff as well. For that, I think it could work to just hook
into xdp_return_frame(), but if you want to access hardware metadata
it'll obviously have to be in the driver. A hook in the driver could
certainly be used for the queueing return as well, though, which may
help making it worth the trouble :)
> I'd like to some more input on whether applying the same idea on TX
> makes sense or not and whether there are any sensible alternatives.
> (IIRC, there was an attempt to do XDP on egress that went nowhere).
I believe that stranded because it was deemed not feasible to cover the
SKB TX path as well, which means it can't be symmetrical to the RX hook.
So we ended up with the in-devmap hook instead. I'm not sure if that's
made easier by multi-buf XDP, so that may be worth revisiting.
For the TX metadata you don't really have to care about the skb path, I
suppose, so that may not matter too much either. However, at least for
the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
I'm not sure if it's worth having a separate hook in the driver (with
all the added complexity and overhead that entails) just to set
metadata? That could just as well be done on push from higher up the
stack; per-driver kfuncs could still be useful for this, though.
And of course something would be needed so that that BPF programs can
process AF_XDP frames in the kernel before they hit the driver, but
again I'm not sure that needs to be a hook in the driver.
In any case, the above is just my immediate brain dump (I've been
mulling these things over for a while in relation to the queueing
stuff), and I'd certainly welcome more discussion on the subject! :)
-Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-23 23:22 ` Toke Høiland-Jørgensen
@ 2023-02-23 23:51 ` Stanislav Fomichev
2023-02-27 14:17 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 16+ messages in thread
From: Stanislav Fomichev @ 2023-02-23 23:51 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: lsf-pc, bpf
On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > I'd like to discuss a potential follow up for the previous "XDP RX
> > metadata" series [0].
> >
> > Now that we can access (a subset of) packet metadata at RX, I'd like to
> > explore the options where we can export some of that metadata on TX. And
> > also whether it might be possible to access some of the TX completion
> > metadata (things like TX timestamp).
> >
> > I'm currently trying to understand whether the same approach I've used
> > on RX could work at TX. By May I plan to have a bunch of options laid
> > out (currently considering XSK tx/compl programs and XDP tx/compl
> > programs) so we have something to discuss.
>
> I've been looking at ways of getting a TX-completion hook for the XDP
> queueing stuff as well. For that, I think it could work to just hook
> into xdp_return_frame(), but if you want to access hardware metadata
> it'll obviously have to be in the driver. A hook in the driver could
> certainly be used for the queueing return as well, though, which may
> help making it worth the trouble :)
Yeah, I'd like to get to completion descriptors ideally; so nothing
better than a driver hook comes to mind so far :-(
(I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
> > I'd like to some more input on whether applying the same idea on TX
> > makes sense or not and whether there are any sensible alternatives.
> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
>
> I believe that stranded because it was deemed not feasible to cover the
> SKB TX path as well, which means it can't be symmetrical to the RX hook.
> So we ended up with the in-devmap hook instead. I'm not sure if that's
> made easier by multi-buf XDP, so that may be worth revisiting.
>
> For the TX metadata you don't really have to care about the skb path, I
> suppose, so that may not matter too much either. However, at least for
> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
> I'm not sure if it's worth having a separate hook in the driver (with
> all the added complexity and overhead that entails) just to set
> metadata? That could just as well be done on push from higher up the
> stack; per-driver kfuncs could still be useful for this, though.
>
> And of course something would be needed so that that BPF programs can
> process AF_XDP frames in the kernel before they hit the driver, but
> again I'm not sure that needs to be a hook in the driver.
Care to elaborate more on "push from higher up the stack"?
I've been thinking about mostly two cases:
- XDP_TX - I think this one technically doesn't need an extra hook;
all metadata manipulations can be done at xdp_rx? (however, not sure
how real that is, since the descriptors are probably not exposed over
there?)
- AF_XDP TX - this one needs something deep in the driver (due to tx
zc) to populate the descriptors?
- anything else?
> In any case, the above is just my immediate brain dump (I've been
> mulling these things over for a while in relation to the queueing
> stuff), and I'd certainly welcome more discussion on the subject! :)
Awesome, thanks for the dump! Will try to keep you in the loop!
> -Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-23 23:51 ` Stanislav Fomichev
@ 2023-02-27 14:17 ` Toke Høiland-Jørgensen
2023-02-27 20:03 ` Stanislav Fomichev
0 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-02-27 14:17 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: lsf-pc, bpf
Stanislav Fomichev <sdf@google.com> writes:
> On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > I'd like to discuss a potential follow up for the previous "XDP RX
>> > metadata" series [0].
>> >
>> > Now that we can access (a subset of) packet metadata at RX, I'd like to
>> > explore the options where we can export some of that metadata on TX. And
>> > also whether it might be possible to access some of the TX completion
>> > metadata (things like TX timestamp).
>> >
>> > I'm currently trying to understand whether the same approach I've used
>> > on RX could work at TX. By May I plan to have a bunch of options laid
>> > out (currently considering XSK tx/compl programs and XDP tx/compl
>> > programs) so we have something to discuss.
>>
>> I've been looking at ways of getting a TX-completion hook for the XDP
>> queueing stuff as well. For that, I think it could work to just hook
>> into xdp_return_frame(), but if you want to access hardware metadata
>> it'll obviously have to be in the driver. A hook in the driver could
>> certainly be used for the queueing return as well, though, which may
>> help making it worth the trouble :)
>
> Yeah, I'd like to get to completion descriptors ideally; so nothing
> better than a driver hook comes to mind so far :-(
> (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
Is there any other use case for this than getting the TX timestamp? Not
really sure what else those descriptors contain...
>> > I'd like to some more input on whether applying the same idea on TX
>> > makes sense or not and whether there are any sensible alternatives.
>> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
>>
>> I believe that stranded because it was deemed not feasible to cover the
>> SKB TX path as well, which means it can't be symmetrical to the RX hook.
>> So we ended up with the in-devmap hook instead. I'm not sure if that's
>> made easier by multi-buf XDP, so that may be worth revisiting.
>>
>> For the TX metadata you don't really have to care about the skb path, I
>> suppose, so that may not matter too much either. However, at least for
>> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
>> I'm not sure if it's worth having a separate hook in the driver (with
>> all the added complexity and overhead that entails) just to set
>> metadata? That could just as well be done on push from higher up the
>> stack; per-driver kfuncs could still be useful for this, though.
>>
>> And of course something would be needed so that that BPF programs can
>> process AF_XDP frames in the kernel before they hit the driver, but
>> again I'm not sure that needs to be a hook in the driver.
>
> Care to elaborate more on "push from higher up the stack"?
I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
by the stack calling ndo_xdp_xmit() in the driver with an array of
frames that are immediately put on the wire (see bq_xmit_all() in
devmap.c). So any metadata writing could be done at that point, since
the target driver is already known; there's even already a program hook
in there (used for in-devmap programs).
> I've been thinking about mostly two cases:
> - XDP_TX - I think this one technically doesn't need an extra hook;
> all metadata manipulations can be done at xdp_rx? (however, not sure
> how real that is, since the descriptors are probably not exposed over
> there?)
Well, to me XDP_REDIRECT is the most interesting one (see above). I
think we could even drop the XDP_TX case and only do this for
XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
I.e., it's possible to XDP_REDIRECT back to the same device, the frames
will just take a slight detour up through the stack; but that could also
be a good thing if it means we'll have to do less surgery to the drivers
to implement this for two paths.
It does have the same challenge as you outlined above, though: At that
point the TX descriptor probably doesn't exist, so the driver NDO will
have to do something else with the data; but maybe we can solve that
without moving the hook into the driver itself somehow?
> - AF_XDP TX - this one needs something deep in the driver (due to tx
> zc) to populate the descriptors?
Yeah, this one is a bit more challenging, but having a way to process
AF_XDP frames in the kernel before they're sent out would be good in any
case (for things like policing what packets an AF_XDP application can
send in a cloud deployment, for instance). Would be best if we could
consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
> - anything else?
Well, see above ;)
>> In any case, the above is just my immediate brain dump (I've been
>> mulling these things over for a while in relation to the queueing
>> stuff), and I'd certainly welcome more discussion on the subject! :)
>
> Awesome, thanks for the dump! Will try to keep you in the loop!
Great, thanks!
-Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-27 14:17 ` Toke Høiland-Jørgensen
@ 2023-02-27 20:03 ` Stanislav Fomichev
2023-02-27 23:54 ` Toke Høiland-Jørgensen
2023-03-03 7:42 ` Magnus Karlsson
0 siblings, 2 replies; 16+ messages in thread
From: Stanislav Fomichev @ 2023-02-27 20:03 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: lsf-pc, bpf
On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > I'd like to discuss a potential follow up for the previous "XDP RX
> >> > metadata" series [0].
> >> >
> >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
> >> > explore the options where we can export some of that metadata on TX. And
> >> > also whether it might be possible to access some of the TX completion
> >> > metadata (things like TX timestamp).
> >> >
> >> > I'm currently trying to understand whether the same approach I've used
> >> > on RX could work at TX. By May I plan to have a bunch of options laid
> >> > out (currently considering XSK tx/compl programs and XDP tx/compl
> >> > programs) so we have something to discuss.
> >>
> >> I've been looking at ways of getting a TX-completion hook for the XDP
> >> queueing stuff as well. For that, I think it could work to just hook
> >> into xdp_return_frame(), but if you want to access hardware metadata
> >> it'll obviously have to be in the driver. A hook in the driver could
> >> certainly be used for the queueing return as well, though, which may
> >> help making it worth the trouble :)
> >
> > Yeah, I'd like to get to completion descriptors ideally; so nothing
> > better than a driver hook comes to mind so far :-(
> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
>
> Is there any other use case for this than getting the TX timestamp? Not
> really sure what else those descriptors contain...
I don't think so; at least looking at mlx5 and bnxt (the latter
doesn't have a timestamp in the completion ring).
So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
And not even involve bpf, just put the tx tstamp somewhere in umem:
setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
&data_relative_offset, ..);
OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
have for eternity? (plus, this needs a driver "hook" for af_xdp
anyway, so why not make it generic?)
> >> > I'd like to some more input on whether applying the same idea on TX
> >> > makes sense or not and whether there are any sensible alternatives.
> >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
> >>
> >> I believe that stranded because it was deemed not feasible to cover the
> >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
> >> So we ended up with the in-devmap hook instead. I'm not sure if that's
> >> made easier by multi-buf XDP, so that may be worth revisiting.
> >>
> >> For the TX metadata you don't really have to care about the skb path, I
> >> suppose, so that may not matter too much either. However, at least for
> >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
> >> I'm not sure if it's worth having a separate hook in the driver (with
> >> all the added complexity and overhead that entails) just to set
> >> metadata? That could just as well be done on push from higher up the
> >> stack; per-driver kfuncs could still be useful for this, though.
> >>
> >> And of course something would be needed so that that BPF programs can
> >> process AF_XDP frames in the kernel before they hit the driver, but
> >> again I'm not sure that needs to be a hook in the driver.
> >
> > Care to elaborate more on "push from higher up the stack"?
>
> I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
> by the stack calling ndo_xdp_xmit() in the driver with an array of
> frames that are immediately put on the wire (see bq_xmit_all() in
> devmap.c). So any metadata writing could be done at that point, since
> the target driver is already known; there's even already a program hook
> in there (used for in-devmap programs).
>
> > I've been thinking about mostly two cases:
> > - XDP_TX - I think this one technically doesn't need an extra hook;
> > all metadata manipulations can be done at xdp_rx? (however, not sure
> > how real that is, since the descriptors are probably not exposed over
> > there?)
>
> Well, to me XDP_REDIRECT is the most interesting one (see above). I
> think we could even drop the XDP_TX case and only do this for
> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
> will just take a slight detour up through the stack; but that could also
> be a good thing if it means we'll have to do less surgery to the drivers
> to implement this for two paths.
>
> It does have the same challenge as you outlined above, though: At that
> point the TX descriptor probably doesn't exist, so the driver NDO will
> have to do something else with the data; but maybe we can solve that
> without moving the hook into the driver itself somehow?
Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
"transmit something out of xdp_rx hook" umbrella. We can maybe come up
with a skb-like-private metadata layout (as we've discussed previously
for skb) here as well? But not sure it would solve all the problems?
I'm thinking of an af_xdp case where it wants to program something
similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
support) or a checksum offload. Exposing access to the driver tx hooks
seems like the easiest way to get there?
> > - AF_XDP TX - this one needs something deep in the driver (due to tx
> > zc) to populate the descriptors?
>
> Yeah, this one is a bit more challenging, but having a way to process
> AF_XDP frames in the kernel before they're sent out would be good in any
> case (for things like policing what packets an AF_XDP application can
> send in a cloud deployment, for instance). Would be best if we could
> consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
>
> > - anything else?
>
> Well, see above ;)
>
> >> In any case, the above is just my immediate brain dump (I've been
> >> mulling these things over for a while in relation to the queueing
> >> stuff), and I'd certainly welcome more discussion on the subject! :)
> >
> > Awesome, thanks for the dump! Will try to keep you in the loop!
>
> Great, thanks!
>
> -Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-27 20:03 ` Stanislav Fomichev
@ 2023-02-27 23:54 ` Toke Høiland-Jørgensen
2023-02-28 21:18 ` Stanislav Fomichev
2023-03-03 7:42 ` Magnus Karlsson
1 sibling, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-02-27 23:54 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: lsf-pc, bpf
Stanislav Fomichev <sdf@google.com> writes:
> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > I'd like to discuss a potential follow up for the previous "XDP RX
>> >> > metadata" series [0].
>> >> >
>> >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
>> >> > explore the options where we can export some of that metadata on TX. And
>> >> > also whether it might be possible to access some of the TX completion
>> >> > metadata (things like TX timestamp).
>> >> >
>> >> > I'm currently trying to understand whether the same approach I've used
>> >> > on RX could work at TX. By May I plan to have a bunch of options laid
>> >> > out (currently considering XSK tx/compl programs and XDP tx/compl
>> >> > programs) so we have something to discuss.
>> >>
>> >> I've been looking at ways of getting a TX-completion hook for the XDP
>> >> queueing stuff as well. For that, I think it could work to just hook
>> >> into xdp_return_frame(), but if you want to access hardware metadata
>> >> it'll obviously have to be in the driver. A hook in the driver could
>> >> certainly be used for the queueing return as well, though, which may
>> >> help making it worth the trouble :)
>> >
>> > Yeah, I'd like to get to completion descriptors ideally; so nothing
>> > better than a driver hook comes to mind so far :-(
>> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
>>
>> Is there any other use case for this than getting the TX timestamp? Not
>> really sure what else those descriptors contain...
>
> I don't think so; at least looking at mlx5 and bnxt (the latter
> doesn't have a timestamp in the completion ring).
> So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
> And not even involve bpf, just put the tx tstamp somewhere in umem:
> setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
> &data_relative_offset, ..);
> OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
> have for eternity? (plus, this needs a driver "hook" for af_xdp
> anyway, so why not make it generic?)
So since this is read-only in any case, we could just make it a
tracepoint instead of a whole new hook? That's what I was planning to do
for xdp_return_frame(); we just need a way to refer back to the original
frame, but we'd need to solve that anyway. Just letting XDP/AF_XDP
specify their own packet ID in some form, and make that part of the
tracepoint data, would probably be sufficient?
>> >> > I'd like to some more input on whether applying the same idea on TX
>> >> > makes sense or not and whether there are any sensible alternatives.
>> >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
>> >>
>> >> I believe that stranded because it was deemed not feasible to cover the
>> >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
>> >> So we ended up with the in-devmap hook instead. I'm not sure if that's
>> >> made easier by multi-buf XDP, so that may be worth revisiting.
>> >>
>> >> For the TX metadata you don't really have to care about the skb path, I
>> >> suppose, so that may not matter too much either. However, at least for
>> >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
>> >> I'm not sure if it's worth having a separate hook in the driver (with
>> >> all the added complexity and overhead that entails) just to set
>> >> metadata? That could just as well be done on push from higher up the
>> >> stack; per-driver kfuncs could still be useful for this, though.
>> >>
>> >> And of course something would be needed so that that BPF programs can
>> >> process AF_XDP frames in the kernel before they hit the driver, but
>> >> again I'm not sure that needs to be a hook in the driver.
>> >
>> > Care to elaborate more on "push from higher up the stack"?
>>
>> I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
>> by the stack calling ndo_xdp_xmit() in the driver with an array of
>> frames that are immediately put on the wire (see bq_xmit_all() in
>> devmap.c). So any metadata writing could be done at that point, since
>> the target driver is already known; there's even already a program hook
>> in there (used for in-devmap programs).
>>
>> > I've been thinking about mostly two cases:
>> > - XDP_TX - I think this one technically doesn't need an extra hook;
>> > all metadata manipulations can be done at xdp_rx? (however, not sure
>> > how real that is, since the descriptors are probably not exposed over
>> > there?)
>>
>> Well, to me XDP_REDIRECT is the most interesting one (see above). I
>> think we could even drop the XDP_TX case and only do this for
>> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
>> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
>> will just take a slight detour up through the stack; but that could also
>> be a good thing if it means we'll have to do less surgery to the drivers
>> to implement this for two paths.
>>
>> It does have the same challenge as you outlined above, though: At that
>> point the TX descriptor probably doesn't exist, so the driver NDO will
>> have to do something else with the data; but maybe we can solve that
>> without moving the hook into the driver itself somehow?
>
> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
> with a skb-like-private metadata layout (as we've discussed previously
> for skb) here as well? But not sure it would solve all the problems?
> I'm thinking of an af_xdp case where it wants to program something
> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> support) or a checksum offload. Exposing access to the driver tx hooks
> seems like the easiest way to get there?
Well, I was thinking something like exposing driver kfuncs like what you
implemented on RX, but having the program that calls them be the one in
the existing devmap hook (each map entry is tied to a particular netdev,
so we could use the same type of dev-bound logic as we do on RX). The
driver wouldn't have a TX descriptor at this point, but it could have a
driver-private area somewhere in the xdp_frame (if we make space for it)
which the kfuncs could just write to in whichever format it wants, so
that copying it to the descriptor later is just a memcpy().
There would still need to be a new hook for AF_XDP, but I think it could
have the same semantics as a devmap prog (i.e., an XDP program type that
runs right before TX, but after the destination device is selected); we
could attach it to the socket, maybe? Doesn't have to be in the driver
itself (just before the driver ndo is called for zc - I think?).
-Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-27 23:54 ` Toke Høiland-Jørgensen
@ 2023-02-28 21:18 ` Stanislav Fomichev
2023-02-28 22:09 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 16+ messages in thread
From: Stanislav Fomichev @ 2023-02-28 21:18 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: lsf-pc, bpf
On Mon, Feb 27, 2023 at 3:54 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >>
> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >>
> >> >> > I'd like to discuss a potential follow up for the previous "XDP RX
> >> >> > metadata" series [0].
> >> >> >
> >> >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
> >> >> > explore the options where we can export some of that metadata on TX. And
> >> >> > also whether it might be possible to access some of the TX completion
> >> >> > metadata (things like TX timestamp).
> >> >> >
> >> >> > I'm currently trying to understand whether the same approach I've used
> >> >> > on RX could work at TX. By May I plan to have a bunch of options laid
> >> >> > out (currently considering XSK tx/compl programs and XDP tx/compl
> >> >> > programs) so we have something to discuss.
> >> >>
> >> >> I've been looking at ways of getting a TX-completion hook for the XDP
> >> >> queueing stuff as well. For that, I think it could work to just hook
> >> >> into xdp_return_frame(), but if you want to access hardware metadata
> >> >> it'll obviously have to be in the driver. A hook in the driver could
> >> >> certainly be used for the queueing return as well, though, which may
> >> >> help making it worth the trouble :)
> >> >
> >> > Yeah, I'd like to get to completion descriptors ideally; so nothing
> >> > better than a driver hook comes to mind so far :-(
> >> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
> >>
> >> Is there any other use case for this than getting the TX timestamp? Not
> >> really sure what else those descriptors contain...
> >
> > I don't think so; at least looking at mlx5 and bnxt (the latter
> > doesn't have a timestamp in the completion ring).
> > So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
> > And not even involve bpf, just put the tx tstamp somewhere in umem:
> > setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
> > &data_relative_offset, ..);
> > OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
> > have for eternity? (plus, this needs a driver "hook" for af_xdp
> > anyway, so why not make it generic?)
>
> So since this is read-only in any case, we could just make it a
> tracepoint instead of a whole new hook? That's what I was planning to do
> for xdp_return_frame(); we just need a way to refer back to the original
> frame, but we'd need to solve that anyway. Just letting XDP/AF_XDP
> specify their own packet ID in some form, and make that part of the
> tracepoint data, would probably be sufficient?
That would probably mean a driver specific tracepoint (since we might
need to get to the completion queue descriptor)?
Idk, probably still makes sense to have something that works across
different drivers?
Or are you suggesting to just do fentry/mlx5e_free_xdpsq_desc and go from there?
Not sure I can get a umem frame, as you've mentioned; and also it
doesn't look like cqe is there...
I guess the fact that it would arrive out-of-band (not in a umem
frame) is a minor inconvenience, the userspace should be able to join
the data together hopefully.
> >> >> > I'd like to some more input on whether applying the same idea on TX
> >> >> > makes sense or not and whether there are any sensible alternatives.
> >> >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
> >> >>
> >> >> I believe that stranded because it was deemed not feasible to cover the
> >> >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
> >> >> So we ended up with the in-devmap hook instead. I'm not sure if that's
> >> >> made easier by multi-buf XDP, so that may be worth revisiting.
> >> >>
> >> >> For the TX metadata you don't really have to care about the skb path, I
> >> >> suppose, so that may not matter too much either. However, at least for
> >> >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
> >> >> I'm not sure if it's worth having a separate hook in the driver (with
> >> >> all the added complexity and overhead that entails) just to set
> >> >> metadata? That could just as well be done on push from higher up the
> >> >> stack; per-driver kfuncs could still be useful for this, though.
> >> >>
> >> >> And of course something would be needed so that that BPF programs can
> >> >> process AF_XDP frames in the kernel before they hit the driver, but
> >> >> again I'm not sure that needs to be a hook in the driver.
> >> >
> >> > Care to elaborate more on "push from higher up the stack"?
> >>
> >> I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
> >> by the stack calling ndo_xdp_xmit() in the driver with an array of
> >> frames that are immediately put on the wire (see bq_xmit_all() in
> >> devmap.c). So any metadata writing could be done at that point, since
> >> the target driver is already known; there's even already a program hook
> >> in there (used for in-devmap programs).
> >>
> >> > I've been thinking about mostly two cases:
> >> > - XDP_TX - I think this one technically doesn't need an extra hook;
> >> > all metadata manipulations can be done at xdp_rx? (however, not sure
> >> > how real that is, since the descriptors are probably not exposed over
> >> > there?)
> >>
> >> Well, to me XDP_REDIRECT is the most interesting one (see above). I
> >> think we could even drop the XDP_TX case and only do this for
> >> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> >> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
> >> will just take a slight detour up through the stack; but that could also
> >> be a good thing if it means we'll have to do less surgery to the drivers
> >> to implement this for two paths.
> >>
> >> It does have the same challenge as you outlined above, though: At that
> >> point the TX descriptor probably doesn't exist, so the driver NDO will
> >> have to do something else with the data; but maybe we can solve that
> >> without moving the hook into the driver itself somehow?
> >
> > Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> > "transmit something out of xdp_rx hook" umbrella. We can maybe come up
> > with a skb-like-private metadata layout (as we've discussed previously
> > for skb) here as well? But not sure it would solve all the problems?
> > I'm thinking of an af_xdp case where it wants to program something
> > similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> > support) or a checksum offload. Exposing access to the driver tx hooks
> > seems like the easiest way to get there?
>
> Well, I was thinking something like exposing driver kfuncs like what you
> implemented on RX, but having the program that calls them be the one in
> the existing devmap hook (each map entry is tied to a particular netdev,
> so we could use the same type of dev-bound logic as we do on RX). The
> driver wouldn't have a TX descriptor at this point, but it could have a
> driver-private area somewhere in the xdp_frame (if we make space for it)
> which the kfuncs could just write to in whichever format it wants, so
> that copying it to the descriptor later is just a memcpy().
Yeah, that sounds similar to what we've discussed for the xdp->skb
path, right? Maybe it should be even some kind of extension to that?
On rx, we stash a bunch of metadata in that private area. If the frame
goes to the kernel stack -> put it into skb; if the frame goes back to
the wire -> the driver can parse it?
> There would still need to be a new hook for AF_XDP, but I think it could
> have the same semantics as a devmap prog (i.e., an XDP program type that
> runs right before TX, but after the destination device is selected); we
> could attach it to the socket, maybe? Doesn't have to be in the driver
> itself (just before the driver ndo is called for zc - I think?).
Sounds sensible; will try to explore more, thx!
> -Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-28 21:18 ` Stanislav Fomichev
@ 2023-02-28 22:09 ` Toke Høiland-Jørgensen
2023-02-28 23:02 ` Stanislav Fomichev
0 siblings, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-02-28 22:09 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: lsf-pc, bpf
Stanislav Fomichev <sdf@google.com> writes:
> On Mon, Feb 27, 2023 at 3:54 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >> >>
>> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >>
>> >> >> > I'd like to discuss a potential follow up for the previous "XDP RX
>> >> >> > metadata" series [0].
>> >> >> >
>> >> >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
>> >> >> > explore the options where we can export some of that metadata on TX. And
>> >> >> > also whether it might be possible to access some of the TX completion
>> >> >> > metadata (things like TX timestamp).
>> >> >> >
>> >> >> > I'm currently trying to understand whether the same approach I've used
>> >> >> > on RX could work at TX. By May I plan to have a bunch of options laid
>> >> >> > out (currently considering XSK tx/compl programs and XDP tx/compl
>> >> >> > programs) so we have something to discuss.
>> >> >>
>> >> >> I've been looking at ways of getting a TX-completion hook for the XDP
>> >> >> queueing stuff as well. For that, I think it could work to just hook
>> >> >> into xdp_return_frame(), but if you want to access hardware metadata
>> >> >> it'll obviously have to be in the driver. A hook in the driver could
>> >> >> certainly be used for the queueing return as well, though, which may
>> >> >> help making it worth the trouble :)
>> >> >
>> >> > Yeah, I'd like to get to completion descriptors ideally; so nothing
>> >> > better than a driver hook comes to mind so far :-(
>> >> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
>> >>
>> >> Is there any other use case for this than getting the TX timestamp? Not
>> >> really sure what else those descriptors contain...
>> >
>> > I don't think so; at least looking at mlx5 and bnxt (the latter
>> > doesn't have a timestamp in the completion ring).
>> > So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
>> > And not even involve bpf, just put the tx tstamp somewhere in umem:
>> > setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
>> > &data_relative_offset, ..);
>> > OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
>> > have for eternity? (plus, this needs a driver "hook" for af_xdp
>> > anyway, so why not make it generic?)
>>
>> So since this is read-only in any case, we could just make it a
>> tracepoint instead of a whole new hook? That's what I was planning to do
>> for xdp_return_frame(); we just need a way to refer back to the original
>> frame, but we'd need to solve that anyway. Just letting XDP/AF_XDP
>> specify their own packet ID in some form, and make that part of the
>> tracepoint data, would probably be sufficient?
>
> That would probably mean a driver specific tracepoint (since we might
> need to get to the completion queue descriptor)?
> Idk, probably still makes sense to have something that works across
> different drivers?
> Or are you suggesting to just do fentry/mlx5e_free_xdpsq_desc and go from there?
> Not sure I can get a umem frame, as you've mentioned; and also it
> doesn't look like cqe is there...
No, we'd define the tracepoint in one place (like along with the others
in include/xdp/trace/event.h), and have it include the fields we need
(ifindex, queue index, timestamp, some kind of packet ID). And then add
a call in each driver that will support this, wherever it makes sense in
that driver. This is what we do with the trace_xdp_exception() calls
already scattered through drivers in the RX handling code.
With this, a BPF listener can attach to just the one tracepoint, and get
events from all drivers that support it (but it can filter on ifindex
for the events it's interested in).
This probably doesn't scale indefinitely: it's possible to add new
fields to the tracepoint if we discover other uses than the timestamp,
but this would probably get unwieldy at some point. However, it's a
lightweight way to add a "hook" if we don't have any other use cases
than getting the timestamp for now.
> I guess the fact that it would arrive out-of-band (not in a umem
> frame) is a minor inconvenience, the userspace should be able to join
> the data together hopefully.
Yeah, that's why I suggested the user-defined packet ID: if the event
just includes that, the BPF or userspace program can do its own matching
depending on its needs...
>> >> >> > I'd like to some more input on whether applying the same idea on TX
>> >> >> > makes sense or not and whether there are any sensible alternatives.
>> >> >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
>> >> >>
>> >> >> I believe that stranded because it was deemed not feasible to cover the
>> >> >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
>> >> >> So we ended up with the in-devmap hook instead. I'm not sure if that's
>> >> >> made easier by multi-buf XDP, so that may be worth revisiting.
>> >> >>
>> >> >> For the TX metadata you don't really have to care about the skb path, I
>> >> >> suppose, so that may not matter too much either. However, at least for
>> >> >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
>> >> >> I'm not sure if it's worth having a separate hook in the driver (with
>> >> >> all the added complexity and overhead that entails) just to set
>> >> >> metadata? That could just as well be done on push from higher up the
>> >> >> stack; per-driver kfuncs could still be useful for this, though.
>> >> >>
>> >> >> And of course something would be needed so that that BPF programs can
>> >> >> process AF_XDP frames in the kernel before they hit the driver, but
>> >> >> again I'm not sure that needs to be a hook in the driver.
>> >> >
>> >> > Care to elaborate more on "push from higher up the stack"?
>> >>
>> >> I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
>> >> by the stack calling ndo_xdp_xmit() in the driver with an array of
>> >> frames that are immediately put on the wire (see bq_xmit_all() in
>> >> devmap.c). So any metadata writing could be done at that point, since
>> >> the target driver is already known; there's even already a program hook
>> >> in there (used for in-devmap programs).
>> >>
>> >> > I've been thinking about mostly two cases:
>> >> > - XDP_TX - I think this one technically doesn't need an extra hook;
>> >> > all metadata manipulations can be done at xdp_rx? (however, not sure
>> >> > how real that is, since the descriptors are probably not exposed over
>> >> > there?)
>> >>
>> >> Well, to me XDP_REDIRECT is the most interesting one (see above). I
>> >> think we could even drop the XDP_TX case and only do this for
>> >> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
>> >> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
>> >> will just take a slight detour up through the stack; but that could also
>> >> be a good thing if it means we'll have to do less surgery to the drivers
>> >> to implement this for two paths.
>> >>
>> >> It does have the same challenge as you outlined above, though: At that
>> >> point the TX descriptor probably doesn't exist, so the driver NDO will
>> >> have to do something else with the data; but maybe we can solve that
>> >> without moving the hook into the driver itself somehow?
>> >
>> > Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
>> > "transmit something out of xdp_rx hook" umbrella. We can maybe come up
>> > with a skb-like-private metadata layout (as we've discussed previously
>> > for skb) here as well? But not sure it would solve all the problems?
>> > I'm thinking of an af_xdp case where it wants to program something
>> > similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
>> > support) or a checksum offload. Exposing access to the driver tx hooks
>> > seems like the easiest way to get there?
>>
>> Well, I was thinking something like exposing driver kfuncs like what you
>> implemented on RX, but having the program that calls them be the one in
>> the existing devmap hook (each map entry is tied to a particular netdev,
>> so we could use the same type of dev-bound logic as we do on RX). The
>> driver wouldn't have a TX descriptor at this point, but it could have a
>> driver-private area somewhere in the xdp_frame (if we make space for it)
>> which the kfuncs could just write to in whichever format it wants, so
>> that copying it to the descriptor later is just a memcpy().
>
> Yeah, that sounds similar to what we've discussed for the xdp->skb
> path, right? Maybe it should be even some kind of extension to that?
> On rx, we stash a bunch of metadata in that private area. If the frame
> goes to the kernel stack -> put it into skb; if the frame goes back to
> the wire -> the driver can parse it?
Yeah, not a bad idea to combine those.
>> There would still need to be a new hook for AF_XDP, but I think it could
>> have the same semantics as a devmap prog (i.e., an XDP program type that
>> runs right before TX, but after the destination device is selected); we
>> could attach it to the socket, maybe? Doesn't have to be in the driver
>> itself (just before the driver ndo is called for zc - I think?).
>
> Sounds sensible; will try to explore more, thx!
Awesome, looking forward to seeing what you come up with :)
-Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-28 22:09 ` Toke Høiland-Jørgensen
@ 2023-02-28 23:02 ` Stanislav Fomichev
2023-02-28 23:08 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 16+ messages in thread
From: Stanislav Fomichev @ 2023-02-28 23:02 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: lsf-pc, bpf
On 02/28, Toke H�iland-J�rgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> > On Mon, Feb 27, 2023 at 3:54 PM Toke H�iland-J�rgensen
> <toke@kernel.org> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Mon, Feb 27, 2023 at 6:17 AM Toke H�iland-J�rgensen
> <toke@kernel.org> wrote:
> >> >>
> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >>
> >> >> > On Thu, Feb 23, 2023 at 3:22 PM Toke H�iland-J�rgensen
> <toke@kernel.org> wrote:
> >> >> >>
> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
> >> >> >>
> >> >> >> > I'd like to discuss a potential follow up for the
> previous "XDP RX
> >> >> >> > metadata" series [0].
> >> >> >> >
> >> >> >> > Now that we can access (a subset of) packet metadata at RX,
> I'd like to
> >> >> >> > explore the options where we can export some of that metadata
> on TX. And
> >> >> >> > also whether it might be possible to access some of the TX
> completion
> >> >> >> > metadata (things like TX timestamp).
> >> >> >> >
> >> >> >> > I'm currently trying to understand whether the same approach
> I've used
> >> >> >> > on RX could work at TX. By May I plan to have a bunch of
> options laid
> >> >> >> > out (currently considering XSK tx/compl programs and XDP
> tx/compl
> >> >> >> > programs) so we have something to discuss.
> >> >> >>
> >> >> >> I've been looking at ways of getting a TX-completion hook for
> the XDP
> >> >> >> queueing stuff as well. For that, I think it could work to just
> hook
> >> >> >> into xdp_return_frame(), but if you want to access hardware
> metadata
> >> >> >> it'll obviously have to be in the driver. A hook in the driver
> could
> >> >> >> certainly be used for the queueing return as well, though, which
> may
> >> >> >> help making it worth the trouble :)
> >> >> >
> >> >> > Yeah, I'd like to get to completion descriptors ideally; so
> nothing
> >> >> > better than a driver hook comes to mind so far :-(
> >> >> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly
> so far).
> >> >>
> >> >> Is there any other use case for this than getting the TX timestamp?
> Not
> >> >> really sure what else those descriptors contain...
> >> >
> >> > I don't think so; at least looking at mlx5 and bnxt (the latter
> >> > doesn't have a timestamp in the completion ring).
> >> > So yeah, not sure, maybe that should be on the side and be AF_XDP
> specific.
> >> > And not even involve bpf, just put the tx tstamp somewhere in umem:
> >> > setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
> >> > &data_relative_offset, ..);
> >> > OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
> >> > have for eternity? (plus, this needs a driver "hook" for af_xdp
> >> > anyway, so why not make it generic?)
> >>
> >> So since this is read-only in any case, we could just make it a
> >> tracepoint instead of a whole new hook? That's what I was planning to
> do
> >> for xdp_return_frame(); we just need a way to refer back to the
> original
> >> frame, but we'd need to solve that anyway. Just letting XDP/AF_XDP
> >> specify their own packet ID in some form, and make that part of the
> >> tracepoint data, would probably be sufficient?
> >
> > That would probably mean a driver specific tracepoint (since we might
> > need to get to the completion queue descriptor)?
> > Idk, probably still makes sense to have something that works across
> > different drivers?
> > Or are you suggesting to just do fentry/mlx5e_free_xdpsq_desc and go
> from there?
> > Not sure I can get a umem frame, as you've mentioned; and also it
> > doesn't look like cqe is there...
> No, we'd define the tracepoint in one place (like along with the others
> in include/xdp/trace/event.h), and have it include the fields we need
> (ifindex, queue index, timestamp, some kind of packet ID). And then add
> a call in each driver that will support this, wherever it makes sense in
> that driver. This is what we do with the trace_xdp_exception() calls
> already scattered through drivers in the RX handling code.
> With this, a BPF listener can attach to just the one tracepoint, and get
> events from all drivers that support it (but it can filter on ifindex
> for the events it's interested in).
> This probably doesn't scale indefinitely: it's possible to add new
> fields to the tracepoint if we discover other uses than the timestamp,
> but this would probably get unwieldy at some point. However, it's a
> lightweight way to add a "hook" if we don't have any other use cases
> than getting the timestamp for now.
> > I guess the fact that it would arrive out-of-band (not in a umem
> > frame) is a minor inconvenience, the userspace should be able to join
> > the data together hopefully.
> Yeah, that's why I suggested the user-defined packet ID: if the event
> just includes that, the BPF or userspace program can do its own matching
> depending on its needs...
I guess we can implement this user-defined packet ID as a TX metadata
as well? Some kind of u32 tag/mark that the XDP program can set? (and
some new AF_XDP TX hook where we can set it up as well)
> >> >> >> > I'd like to some more input on whether applying the same idea
> on TX
> >> >> >> > makes sense or not and whether there are any sensible
> alternatives.
> >> >> >> > (IIRC, there was an attempt to do XDP on egress that went
> nowhere).
> >> >> >>
> >> >> >> I believe that stranded because it was deemed not feasible to
> cover the
> >> >> >> SKB TX path as well, which means it can't be symmetrical to the
> RX hook.
> >> >> >> So we ended up with the in-devmap hook instead. I'm not sure if
> that's
> >> >> >> made easier by multi-buf XDP, so that may be worth revisiting.
> >> >> >>
> >> >> >> For the TX metadata you don't really have to care about the skb
> path, I
> >> >> >> suppose, so that may not matter too much either. However, at
> least for
> >> >> >> the in-kernel xdp_frame the TX path is pushed from the stack
> anyway, so
> >> >> >> I'm not sure if it's worth having a separate hook in the driver
> (with
> >> >> >> all the added complexity and overhead that entails) just to set
> >> >> >> metadata? That could just as well be done on push from higher up
> the
> >> >> >> stack; per-driver kfuncs could still be useful for this, though.
> >> >> >>
> >> >> >> And of course something would be needed so that that BPF
> programs can
> >> >> >> process AF_XDP frames in the kernel before they hit the driver,
> but
> >> >> >> again I'm not sure that needs to be a hook in the driver.
> >> >> >
> >> >> > Care to elaborate more on "push from higher up the stack"?
> >> >>
> >> >> I'm referring to the XDP_REDIRECT path here: xdp_frames are
> transmitted
> >> >> by the stack calling ndo_xdp_xmit() in the driver with an array of
> >> >> frames that are immediately put on the wire (see bq_xmit_all() in
> >> >> devmap.c). So any metadata writing could be done at that point,
> since
> >> >> the target driver is already known; there's even already a program
> hook
> >> >> in there (used for in-devmap programs).
> >> >>
> >> >> > I've been thinking about mostly two cases:
> >> >> > - XDP_TX - I think this one technically doesn't need an extra
> hook;
> >> >> > all metadata manipulations can be done at xdp_rx? (however, not
> sure
> >> >> > how real that is, since the descriptors are probably not exposed
> over
> >> >> > there?)
> >> >>
> >> >> Well, to me XDP_REDIRECT is the most interesting one (see above). I
> >> >> think we could even drop the XDP_TX case and only do this for
> >> >> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> >> >> I.e., it's possible to XDP_REDIRECT back to the same device, the
> frames
> >> >> will just take a slight detour up through the stack; but that could
> also
> >> >> be a good thing if it means we'll have to do less surgery to the
> drivers
> >> >> to implement this for two paths.
> >> >>
> >> >> It does have the same challenge as you outlined above, though: At
> that
> >> >> point the TX descriptor probably doesn't exist, so the driver NDO
> will
> >> >> have to do something else with the data; but maybe we can solve that
> >> >> without moving the hook into the driver itself somehow?
> >> >
> >> > Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> >> > "transmit something out of xdp_rx hook" umbrella. We can maybe come
> up
> >> > with a skb-like-private metadata layout (as we've discussed
> previously
> >> > for skb) here as well? But not sure it would solve all the problems?
> >> > I'm thinking of an af_xdp case where it wants to program something
> >> > similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> >> > support) or a checksum offload. Exposing access to the driver tx
> hooks
> >> > seems like the easiest way to get there?
> >>
> >> Well, I was thinking something like exposing driver kfuncs like what
> you
> >> implemented on RX, but having the program that calls them be the one in
> >> the existing devmap hook (each map entry is tied to a particular
> netdev,
> >> so we could use the same type of dev-bound logic as we do on RX). The
> >> driver wouldn't have a TX descriptor at this point, but it could have a
> >> driver-private area somewhere in the xdp_frame (if we make space for
> it)
> >> which the kfuncs could just write to in whichever format it wants, so
> >> that copying it to the descriptor later is just a memcpy().
> >
> > Yeah, that sounds similar to what we've discussed for the xdp->skb
> > path, right? Maybe it should be even some kind of extension to that?
> > On rx, we stash a bunch of metadata in that private area. If the frame
> > goes to the kernel stack -> put it into skb; if the frame goes back to
> > the wire -> the driver can parse it?
> Yeah, not a bad idea to combine those.
> >> There would still need to be a new hook for AF_XDP, but I think it
> could
> >> have the same semantics as a devmap prog (i.e., an XDP program type
> that
> >> runs right before TX, but after the destination device is selected); we
> >> could attach it to the socket, maybe? Doesn't have to be in the driver
> >> itself (just before the driver ndo is called for zc - I think?).
> >
> > Sounds sensible; will try to explore more, thx!
> Awesome, looking forward to seeing what you come up with :)
> -Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-28 23:02 ` Stanislav Fomichev
@ 2023-02-28 23:08 ` Toke Høiland-Jørgensen
0 siblings, 0 replies; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-02-28 23:08 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: lsf-pc, bpf
Stanislav Fomichev <sdf@google.com> writes:
> On 02/28, Toke H�iland-J�rgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>
>> > On Mon, Feb 27, 2023 at 3:54 PM Toke H�iland-J�rgensen
>> <toke@kernel.org> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Mon, Feb 27, 2023 at 6:17 AM Toke H�iland-J�rgensen
>> <toke@kernel.org> wrote:
>> >> >>
>> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >>
>> >> >> > On Thu, Feb 23, 2023 at 3:22 PM Toke H�iland-J�rgensen
>> <toke@kernel.org> wrote:
>> >> >> >>
>> >> >> >> Stanislav Fomichev <sdf@google.com> writes:
>> >> >> >>
>> >> >> >> > I'd like to discuss a potential follow up for the
>> previous "XDP RX
>> >> >> >> > metadata" series [0].
>> >> >> >> >
>> >> >> >> > Now that we can access (a subset of) packet metadata at RX,
>> I'd like to
>> >> >> >> > explore the options where we can export some of that metadata
>> on TX. And
>> >> >> >> > also whether it might be possible to access some of the TX
>> completion
>> >> >> >> > metadata (things like TX timestamp).
>> >> >> >> >
>> >> >> >> > I'm currently trying to understand whether the same approach
>> I've used
>> >> >> >> > on RX could work at TX. By May I plan to have a bunch of
>> options laid
>> >> >> >> > out (currently considering XSK tx/compl programs and XDP
>> tx/compl
>> >> >> >> > programs) so we have something to discuss.
>> >> >> >>
>> >> >> >> I've been looking at ways of getting a TX-completion hook for
>> the XDP
>> >> >> >> queueing stuff as well. For that, I think it could work to just
>> hook
>> >> >> >> into xdp_return_frame(), but if you want to access hardware
>> metadata
>> >> >> >> it'll obviously have to be in the driver. A hook in the driver
>> could
>> >> >> >> certainly be used for the queueing return as well, though, which
>> may
>> >> >> >> help making it worth the trouble :)
>> >> >> >
>> >> >> > Yeah, I'd like to get to completion descriptors ideally; so
>> nothing
>> >> >> > better than a driver hook comes to mind so far :-(
>> >> >> > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly
>> so far).
>> >> >>
>> >> >> Is there any other use case for this than getting the TX timestamp?
>> Not
>> >> >> really sure what else those descriptors contain...
>> >> >
>> >> > I don't think so; at least looking at mlx5 and bnxt (the latter
>> >> > doesn't have a timestamp in the completion ring).
>> >> > So yeah, not sure, maybe that should be on the side and be AF_XDP
>> specific.
>> >> > And not even involve bpf, just put the tx tstamp somewhere in umem:
>> >> > setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
>> >> > &data_relative_offset, ..);
>> >> > OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
>> >> > have for eternity? (plus, this needs a driver "hook" for af_xdp
>> >> > anyway, so why not make it generic?)
>> >>
>> >> So since this is read-only in any case, we could just make it a
>> >> tracepoint instead of a whole new hook? That's what I was planning to
>> do
>> >> for xdp_return_frame(); we just need a way to refer back to the
>> original
>> >> frame, but we'd need to solve that anyway. Just letting XDP/AF_XDP
>> >> specify their own packet ID in some form, and make that part of the
>> >> tracepoint data, would probably be sufficient?
>> >
>> > That would probably mean a driver specific tracepoint (since we might
>> > need to get to the completion queue descriptor)?
>> > Idk, probably still makes sense to have something that works across
>> > different drivers?
>> > Or are you suggesting to just do fentry/mlx5e_free_xdpsq_desc and go
>> from there?
>> > Not sure I can get a umem frame, as you've mentioned; and also it
>> > doesn't look like cqe is there...
>
>> No, we'd define the tracepoint in one place (like along with the others
>> in include/xdp/trace/event.h), and have it include the fields we need
>> (ifindex, queue index, timestamp, some kind of packet ID). And then add
>> a call in each driver that will support this, wherever it makes sense in
>> that driver. This is what we do with the trace_xdp_exception() calls
>> already scattered through drivers in the RX handling code.
>
>> With this, a BPF listener can attach to just the one tracepoint, and get
>> events from all drivers that support it (but it can filter on ifindex
>> for the events it's interested in).
>
>> This probably doesn't scale indefinitely: it's possible to add new
>> fields to the tracepoint if we discover other uses than the timestamp,
>> but this would probably get unwieldy at some point. However, it's a
>> lightweight way to add a "hook" if we don't have any other use cases
>> than getting the timestamp for now.
>
>> > I guess the fact that it would arrive out-of-band (not in a umem
>> > frame) is a minor inconvenience, the userspace should be able to join
>> > the data together hopefully.
>
>> Yeah, that's why I suggested the user-defined packet ID: if the event
>> just includes that, the BPF or userspace program can do its own matching
>> depending on its needs...
>
> I guess we can implement this user-defined packet ID as a TX metadata
> as well? Some kind of u32 tag/mark that the XDP program can set? (and
> some new AF_XDP TX hook where we can set it up as well)
Yeah, good point, that would be an excellent way to resolve that
particular issue :)
-Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-02-27 20:03 ` Stanislav Fomichev
2023-02-27 23:54 ` Toke Høiland-Jørgensen
@ 2023-03-03 7:42 ` Magnus Karlsson
2023-03-03 12:37 ` Toke Høiland-Jørgensen
2023-03-07 19:32 ` Jesper Dangaard Brouer
1 sibling, 2 replies; 16+ messages in thread
From: Magnus Karlsson @ 2023-03-03 7:42 UTC (permalink / raw)
To: Stanislav Fomichev; +Cc: Toke Høiland-Jørgensen, lsf-pc, bpf
On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> > >>
> > >> Stanislav Fomichev <sdf@google.com> writes:
> > >>
> > >> > I'd like to discuss a potential follow up for the previous "XDP RX
> > >> > metadata" series [0].
> > >> >
> > >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
> > >> > explore the options where we can export some of that metadata on TX. And
> > >> > also whether it might be possible to access some of the TX completion
> > >> > metadata (things like TX timestamp).
> > >> >
> > >> > I'm currently trying to understand whether the same approach I've used
> > >> > on RX could work at TX. By May I plan to have a bunch of options laid
> > >> > out (currently considering XSK tx/compl programs and XDP tx/compl
> > >> > programs) so we have something to discuss.
> > >>
> > >> I've been looking at ways of getting a TX-completion hook for the XDP
> > >> queueing stuff as well. For that, I think it could work to just hook
> > >> into xdp_return_frame(), but if you want to access hardware metadata
> > >> it'll obviously have to be in the driver. A hook in the driver could
> > >> certainly be used for the queueing return as well, though, which may
> > >> help making it worth the trouble :)
> > >
> > > Yeah, I'd like to get to completion descriptors ideally; so nothing
> > > better than a driver hook comes to mind so far :-(
> > > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
> >
> > Is there any other use case for this than getting the TX timestamp? Not
> > really sure what else those descriptors contain...
>
> I don't think so; at least looking at mlx5 and bnxt (the latter
> doesn't have a timestamp in the completion ring).
> So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
> And not even involve bpf, just put the tx tstamp somewhere in umem:
> setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
> &data_relative_offset, ..);
> OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
> have for eternity? (plus, this needs a driver "hook" for af_xdp
> anyway, so why not make it generic?)
>
> > >> > I'd like to some more input on whether applying the same idea on TX
> > >> > makes sense or not and whether there are any sensible alternatives.
> > >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
> > >>
> > >> I believe that stranded because it was deemed not feasible to cover the
> > >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
> > >> So we ended up with the in-devmap hook instead. I'm not sure if that's
> > >> made easier by multi-buf XDP, so that may be worth revisiting.
> > >>
> > >> For the TX metadata you don't really have to care about the skb path, I
> > >> suppose, so that may not matter too much either. However, at least for
> > >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
> > >> I'm not sure if it's worth having a separate hook in the driver (with
> > >> all the added complexity and overhead that entails) just to set
> > >> metadata? That could just as well be done on push from higher up the
> > >> stack; per-driver kfuncs could still be useful for this, though.
> > >>
> > >> And of course something would be needed so that that BPF programs can
> > >> process AF_XDP frames in the kernel before they hit the driver, but
> > >> again I'm not sure that needs to be a hook in the driver.
> > >
> > > Care to elaborate more on "push from higher up the stack"?
> >
> > I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
> > by the stack calling ndo_xdp_xmit() in the driver with an array of
> > frames that are immediately put on the wire (see bq_xmit_all() in
> > devmap.c). So any metadata writing could be done at that point, since
> > the target driver is already known; there's even already a program hook
> > in there (used for in-devmap programs).
> >
> > > I've been thinking about mostly two cases:
> > > - XDP_TX - I think this one technically doesn't need an extra hook;
> > > all metadata manipulations can be done at xdp_rx? (however, not sure
> > > how real that is, since the descriptors are probably not exposed over
> > > there?)
> >
> > Well, to me XDP_REDIRECT is the most interesting one (see above). I
> > think we could even drop the XDP_TX case and only do this for
> > XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> > I.e., it's possible to XDP_REDIRECT back to the same device, the frames
> > will just take a slight detour up through the stack; but that could also
> > be a good thing if it means we'll have to do less surgery to the drivers
> > to implement this for two paths.
> >
> > It does have the same challenge as you outlined above, though: At that
> > point the TX descriptor probably doesn't exist, so the driver NDO will
> > have to do something else with the data; but maybe we can solve that
> > without moving the hook into the driver itself somehow?
>
> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
> with a skb-like-private metadata layout (as we've discussed previously
> for skb) here as well? But not sure it would solve all the problems?
> I'm thinking of an af_xdp case where it wants to program something
> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> support)
We have a patch set of this in the works. Need to finish the last
couple of tests then optimize performance and it is good to go. We
should be able to post it during the next cycle that starts next week.
Having tso/encap/tunneling and checksum offload accessible in the
AF_XDP Tx path would indeed be very useful.
> or a checksum offload. Exposing access to the driver tx hooks
> seems like the easiest way to get there?
>
> > > - AF_XDP TX - this one needs something deep in the driver (due to tx
> > > zc) to populate the descriptors?
> >
> > Yeah, this one is a bit more challenging, but having a way to process
> > AF_XDP frames in the kernel before they're sent out would be good in any
> > case (for things like policing what packets an AF_XDP application can
> > send in a cloud deployment, for instance). Would be best if we could
> > consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
> >
> > > - anything else?
> >
> > Well, see above ;)
> >
> > >> In any case, the above is just my immediate brain dump (I've been
> > >> mulling these things over for a while in relation to the queueing
> > >> stuff), and I'd certainly welcome more discussion on the subject! :)
> > >
> > > Awesome, thanks for the dump! Will try to keep you in the loop!
> >
> > Great, thanks!
> >
> > -Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-03-03 7:42 ` Magnus Karlsson
@ 2023-03-03 12:37 ` Toke Høiland-Jørgensen
2023-03-03 17:16 ` Stanislav Fomichev
2023-03-07 19:32 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-03-03 12:37 UTC (permalink / raw)
To: Magnus Karlsson, Stanislav Fomichev; +Cc: lsf-pc, bpf
Magnus Karlsson <magnus.karlsson@gmail.com> writes:
> On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
>>
>> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> >
>> > Stanislav Fomichev <sdf@google.com> writes:
>> >
>> > > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>> > >>
>> > >> Stanislav Fomichev <sdf@google.com> writes:
>> > >>
>> > >> > I'd like to discuss a potential follow up for the previous "XDP RX
>> > >> > metadata" series [0].
>> > >> >
>> > >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
>> > >> > explore the options where we can export some of that metadata on TX. And
>> > >> > also whether it might be possible to access some of the TX completion
>> > >> > metadata (things like TX timestamp).
>> > >> >
>> > >> > I'm currently trying to understand whether the same approach I've used
>> > >> > on RX could work at TX. By May I plan to have a bunch of options laid
>> > >> > out (currently considering XSK tx/compl programs and XDP tx/compl
>> > >> > programs) so we have something to discuss.
>> > >>
>> > >> I've been looking at ways of getting a TX-completion hook for the XDP
>> > >> queueing stuff as well. For that, I think it could work to just hook
>> > >> into xdp_return_frame(), but if you want to access hardware metadata
>> > >> it'll obviously have to be in the driver. A hook in the driver could
>> > >> certainly be used for the queueing return as well, though, which may
>> > >> help making it worth the trouble :)
>> > >
>> > > Yeah, I'd like to get to completion descriptors ideally; so nothing
>> > > better than a driver hook comes to mind so far :-(
>> > > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
>> >
>> > Is there any other use case for this than getting the TX timestamp? Not
>> > really sure what else those descriptors contain...
>>
>> I don't think so; at least looking at mlx5 and bnxt (the latter
>> doesn't have a timestamp in the completion ring).
>> So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
>> And not even involve bpf, just put the tx tstamp somewhere in umem:
>> setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
>> &data_relative_offset, ..);
>> OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
>> have for eternity? (plus, this needs a driver "hook" for af_xdp
>> anyway, so why not make it generic?)
>>
>> > >> > I'd like to some more input on whether applying the same idea on TX
>> > >> > makes sense or not and whether there are any sensible alternatives.
>> > >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
>> > >>
>> > >> I believe that stranded because it was deemed not feasible to cover the
>> > >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
>> > >> So we ended up with the in-devmap hook instead. I'm not sure if that's
>> > >> made easier by multi-buf XDP, so that may be worth revisiting.
>> > >>
>> > >> For the TX metadata you don't really have to care about the skb path, I
>> > >> suppose, so that may not matter too much either. However, at least for
>> > >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
>> > >> I'm not sure if it's worth having a separate hook in the driver (with
>> > >> all the added complexity and overhead that entails) just to set
>> > >> metadata? That could just as well be done on push from higher up the
>> > >> stack; per-driver kfuncs could still be useful for this, though.
>> > >>
>> > >> And of course something would be needed so that that BPF programs can
>> > >> process AF_XDP frames in the kernel before they hit the driver, but
>> > >> again I'm not sure that needs to be a hook in the driver.
>> > >
>> > > Care to elaborate more on "push from higher up the stack"?
>> >
>> > I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
>> > by the stack calling ndo_xdp_xmit() in the driver with an array of
>> > frames that are immediately put on the wire (see bq_xmit_all() in
>> > devmap.c). So any metadata writing could be done at that point, since
>> > the target driver is already known; there's even already a program hook
>> > in there (used for in-devmap programs).
>> >
>> > > I've been thinking about mostly two cases:
>> > > - XDP_TX - I think this one technically doesn't need an extra hook;
>> > > all metadata manipulations can be done at xdp_rx? (however, not sure
>> > > how real that is, since the descriptors are probably not exposed over
>> > > there?)
>> >
>> > Well, to me XDP_REDIRECT is the most interesting one (see above). I
>> > think we could even drop the XDP_TX case and only do this for
>> > XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
>> > I.e., it's possible to XDP_REDIRECT back to the same device, the frames
>> > will just take a slight detour up through the stack; but that could also
>> > be a good thing if it means we'll have to do less surgery to the drivers
>> > to implement this for two paths.
>> >
>> > It does have the same challenge as you outlined above, though: At that
>> > point the TX descriptor probably doesn't exist, so the driver NDO will
>> > have to do something else with the data; but maybe we can solve that
>> > without moving the hook into the driver itself somehow?
>>
>> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
>> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
>> with a skb-like-private metadata layout (as we've discussed previously
>> for skb) here as well? But not sure it would solve all the problems?
>> I'm thinking of an af_xdp case where it wants to program something
>> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
>> support)
>
> We have a patch set of this in the works. Need to finish the last
> couple of tests then optimize performance and it is good to go. We
> should be able to post it during the next cycle that starts next week.
Uh, exciting, will look forward to seeing that! :)
-Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-03-03 12:37 ` Toke Høiland-Jørgensen
@ 2023-03-03 17:16 ` Stanislav Fomichev
0 siblings, 0 replies; 16+ messages in thread
From: Stanislav Fomichev @ 2023-03-03 17:16 UTC (permalink / raw)
To: Toke Høiland-Jørgensen; +Cc: Magnus Karlsson, lsf-pc, bpf
On Fri, Mar 3, 2023 at 4:37 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>
> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>
> > On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
> >>
> >> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> >
> >> > Stanislav Fomichev <sdf@google.com> writes:
> >> >
> >> > > On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >> > >>
> >> > >> Stanislav Fomichev <sdf@google.com> writes:
> >> > >>
> >> > >> > I'd like to discuss a potential follow up for the previous "XDP RX
> >> > >> > metadata" series [0].
> >> > >> >
> >> > >> > Now that we can access (a subset of) packet metadata at RX, I'd like to
> >> > >> > explore the options where we can export some of that metadata on TX. And
> >> > >> > also whether it might be possible to access some of the TX completion
> >> > >> > metadata (things like TX timestamp).
> >> > >> >
> >> > >> > I'm currently trying to understand whether the same approach I've used
> >> > >> > on RX could work at TX. By May I plan to have a bunch of options laid
> >> > >> > out (currently considering XSK tx/compl programs and XDP tx/compl
> >> > >> > programs) so we have something to discuss.
> >> > >>
> >> > >> I've been looking at ways of getting a TX-completion hook for the XDP
> >> > >> queueing stuff as well. For that, I think it could work to just hook
> >> > >> into xdp_return_frame(), but if you want to access hardware metadata
> >> > >> it'll obviously have to be in the driver. A hook in the driver could
> >> > >> certainly be used for the queueing return as well, though, which may
> >> > >> help making it worth the trouble :)
> >> > >
> >> > > Yeah, I'd like to get to completion descriptors ideally; so nothing
> >> > > better than a driver hook comes to mind so far :-(
> >> > > (I'm eye-balling mlx5's mlx5e_free_xdpsq_desc AF_XDP path mostly so far).
> >> >
> >> > Is there any other use case for this than getting the TX timestamp? Not
> >> > really sure what else those descriptors contain...
> >>
> >> I don't think so; at least looking at mlx5 and bnxt (the latter
> >> doesn't have a timestamp in the completion ring).
> >> So yeah, not sure, maybe that should be on the side and be AF_XDP specific.
> >> And not even involve bpf, just put the tx tstamp somewhere in umem:
> >> setsockopt(xsk_fd, SOL_XDP, XSK_STAMP_TX_COMPLETION,
> >> &data_relative_offset, ..);
> >> OTOH, if it is only a timestamp now, it doesn't mean that's all we'd
> >> have for eternity? (plus, this needs a driver "hook" for af_xdp
> >> anyway, so why not make it generic?)
> >>
> >> > >> > I'd like to some more input on whether applying the same idea on TX
> >> > >> > makes sense or not and whether there are any sensible alternatives.
> >> > >> > (IIRC, there was an attempt to do XDP on egress that went nowhere).
> >> > >>
> >> > >> I believe that stranded because it was deemed not feasible to cover the
> >> > >> SKB TX path as well, which means it can't be symmetrical to the RX hook.
> >> > >> So we ended up with the in-devmap hook instead. I'm not sure if that's
> >> > >> made easier by multi-buf XDP, so that may be worth revisiting.
> >> > >>
> >> > >> For the TX metadata you don't really have to care about the skb path, I
> >> > >> suppose, so that may not matter too much either. However, at least for
> >> > >> the in-kernel xdp_frame the TX path is pushed from the stack anyway, so
> >> > >> I'm not sure if it's worth having a separate hook in the driver (with
> >> > >> all the added complexity and overhead that entails) just to set
> >> > >> metadata? That could just as well be done on push from higher up the
> >> > >> stack; per-driver kfuncs could still be useful for this, though.
> >> > >>
> >> > >> And of course something would be needed so that that BPF programs can
> >> > >> process AF_XDP frames in the kernel before they hit the driver, but
> >> > >> again I'm not sure that needs to be a hook in the driver.
> >> > >
> >> > > Care to elaborate more on "push from higher up the stack"?
> >> >
> >> > I'm referring to the XDP_REDIRECT path here: xdp_frames are transmitted
> >> > by the stack calling ndo_xdp_xmit() in the driver with an array of
> >> > frames that are immediately put on the wire (see bq_xmit_all() in
> >> > devmap.c). So any metadata writing could be done at that point, since
> >> > the target driver is already known; there's even already a program hook
> >> > in there (used for in-devmap programs).
> >> >
> >> > > I've been thinking about mostly two cases:
> >> > > - XDP_TX - I think this one technically doesn't need an extra hook;
> >> > > all metadata manipulations can be done at xdp_rx? (however, not sure
> >> > > how real that is, since the descriptors are probably not exposed over
> >> > > there?)
> >> >
> >> > Well, to me XDP_REDIRECT is the most interesting one (see above). I
> >> > think we could even drop the XDP_TX case and only do this for
> >> > XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> >> > I.e., it's possible to XDP_REDIRECT back to the same device, the frames
> >> > will just take a slight detour up through the stack; but that could also
> >> > be a good thing if it means we'll have to do less surgery to the drivers
> >> > to implement this for two paths.
> >> >
> >> > It does have the same challenge as you outlined above, though: At that
> >> > point the TX descriptor probably doesn't exist, so the driver NDO will
> >> > have to do something else with the data; but maybe we can solve that
> >> > without moving the hook into the driver itself somehow?
> >>
> >> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> >> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
> >> with a skb-like-private metadata layout (as we've discussed previously
> >> for skb) here as well? But not sure it would solve all the problems?
> >> I'm thinking of an af_xdp case where it wants to program something
> >> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> >> support)
> >
> > We have a patch set of this in the works. Need to finish the last
> > couple of tests then optimize performance and it is good to go. We
> > should be able to post it during the next cycle that starts next week.
>
> Uh, exciting, will look forward to seeing that! :)
+1, looking forward!
> -Toke
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-03-03 7:42 ` Magnus Karlsson
2023-03-03 12:37 ` Toke Høiland-Jørgensen
@ 2023-03-07 19:32 ` Jesper Dangaard Brouer
2023-03-09 18:04 ` Stanislav Fomichev
1 sibling, 1 reply; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2023-03-07 19:32 UTC (permalink / raw)
To: Magnus Karlsson, Stanislav Fomichev
Cc: brouer, Toke Høiland-Jørgensen, lsf-pc, bpf,
xdp-hints@xdp-project.net
On 03/03/2023 08.42, Magnus Karlsson wrote:
> On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
>> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>> Stanislav Fomichev <sdf@google.com> writes:
>>>> On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>>>>
>>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>>
>>>>>> I'd like to discuss a potential follow up for the previous "XDP RX
>>>>>> metadata" series [0].
>>>>>>
>>>>>> Now that we can access (a subset of) packet metadata at RX, I'd like to
>>>>>> explore the options where we can export some of that metadata on TX. And
>>>>>> also whether it might be possible to access some of the TX completion
>>>>>> metadata (things like TX timestamp).
>>>>>>
IMHO it makes sense to see TX metadata as two separate operations.
(1) Metadata written into the TX descriptor.
(2) Metadata read when processing TX completion.
These operations happen at two different points in time. Thus likely
need different BPF hooks. Having BPF-progs running at each of these
points in time, will allow us to e.g. implement BQL (which is relevant
to XDP queuing effort).
>>>>>> I'm currently trying to understand whether the same approach I've used
>>>>>> on RX could work at TX. By May I plan to have a bunch of options laid
>>>>>> out (currently considering XSK tx/compl programs and XDP tx/compl
>>>>>> programs) so we have something to discuss.
>>>>>
>>>>> I've been looking at ways of getting a TX-completion hook for the XDP
>>>>> queueing stuff as well. For that, I think it could work to just hook
>>>>> into xdp_return_frame(), but if you want to access hardware metadata
>>>>> it'll obviously have to be in the driver. A hook in the driver could
>>>>> certainly be used for the queueing return as well, though, which may
>>>>> help making it worth the trouble :)
>>>>
>>>> Yeah, I'd like to get to completion descriptors ideally; so nothing
>>>> better than a driver hook comes to mind so far :-(
As Toke mentions, I'm also hoping we could leverage or extend the
xdp_return_frame() call. Or implicitly add the "hook" at the existing
xdp_return_frame() call. This is about operation (2) *reading* some
metadata at TX completion time.
Can this be mapped to the RX-kfuncs approach(?), by driver extending
(call/structs) with pointer to TX-desc + adaptor info and BPF-prog/hook
doing TX-kfuncs calls into driver (that knows how to extract completion
data).
[...]
>>> Well, to me XDP_REDIRECT is the most interesting one (see above). I
>>> think we could even drop the XDP_TX case and only do this for
>>> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
>>> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
>>> will just take a slight detour up through the stack; but that could also
>>> be a good thing if it means we'll have to do less surgery to the drivers
>>> to implement this for two paths.
>>>
>>> It does have the same challenge as you outlined above, though: At that
>>> point the TX descriptor probably doesn't exist, so the driver NDO will
>>> have to do something else with the data; but maybe we can solve that
>>> without moving the hook into the driver itself somehow?
>>
>> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
>> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
>> with a skb-like-private metadata layout (as we've discussed previously
>> for skb) here as well? But not sure it would solve all the problems?
This is operation (1) writing metadata into the TX descriptor.
In this case we have a metadata mapping problem, from RX on one device
driver to TX on another device driver. As you say, we also need to map
this SKBs, which have a fairly static set of metadata.
For the most common metadata offloads (like TX-checksum, TX-vlan) I
think it makes sense to store those in xdp_frame area (use for SKB
mapping) and re-use these when at TX writing into the TX descriptor.
BUT there are also metadata TX offloads offloads, like asking for a
specific Launch-Time for at packet, that needs a more flexible approach.
>> I'm thinking of an af_xdp case where it wants to program something
>> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
>> support) or a checksum offload. Exposing access to the driver tx hooks
>> seems like the easiest way to get there?
>>
>>>> - AF_XDP TX - this one needs something deep in the driver (due to tx
>>>> zc) to populate the descriptors?
>>>
>>> Yeah, this one is a bit more challenging, but having a way to process
>>> AF_XDP frames in the kernel before they're sent out would be good in any
>>> case (for things like policing what packets an AF_XDP application can
>>> send in a cloud deployment, for instance). Would be best if we could
>>> consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
>>>
I agree, it would be best if we can consolidate the XDP_REDIRECT and
AF_XDP paths, else we have to re-implement the same for AF_XDP xmit path
(and maintain both paths). I also agree that being able to police what
packets an AF_XDP application can send is a useful feature (e.g. cloud
deployments).
Looking forward to collaborate on this work!
--Jesper
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-03-07 19:32 ` Jesper Dangaard Brouer
@ 2023-03-09 18:04 ` Stanislav Fomichev
2023-03-10 11:09 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 16+ messages in thread
From: Stanislav Fomichev @ 2023-03-09 18:04 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Magnus Karlsson, brouer, Toke Høiland-Jørgensen, lsf-pc,
bpf, xdp-hints@xdp-project.net
On Tue, Mar 7, 2023 at 11:32 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 03/03/2023 08.42, Magnus Karlsson wrote:
> > On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
> >> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>> Stanislav Fomichev <sdf@google.com> writes:
> >>>> On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
> >>>>>
> >>>>> Stanislav Fomichev <sdf@google.com> writes:
> >>>>>
> >>>>>> I'd like to discuss a potential follow up for the previous "XDP RX
> >>>>>> metadata" series [0].
> >>>>>>
> >>>>>> Now that we can access (a subset of) packet metadata at RX, I'd like to
> >>>>>> explore the options where we can export some of that metadata on TX. And
> >>>>>> also whether it might be possible to access some of the TX completion
> >>>>>> metadata (things like TX timestamp).
> >>>>>>
>
> IMHO it makes sense to see TX metadata as two separate operations.
>
> (1) Metadata written into the TX descriptor.
> (2) Metadata read when processing TX completion.
>
> These operations happen at two different points in time. Thus likely
> need different BPF hooks. Having BPF-progs running at each of these
> points in time, will allow us to e.g. implement BQL (which is relevant
> to XDP queuing effort).
I guess for (2) the question here is: is it worth having a separate
hook? Or will a simple traceepoint as Toke suggested be enough? For
BQL purposes, we can still attach a prog to that tracepoint, right?
> >>>>>> I'm currently trying to understand whether the same approach I've used
> >>>>>> on RX could work at TX. By May I plan to have a bunch of options laid
> >>>>>> out (currently considering XSK tx/compl programs and XDP tx/compl
> >>>>>> programs) so we have something to discuss.
> >>>>>
> >>>>> I've been looking at ways of getting a TX-completion hook for the XDP
> >>>>> queueing stuff as well. For that, I think it could work to just hook
> >>>>> into xdp_return_frame(), but if you want to access hardware metadata
> >>>>> it'll obviously have to be in the driver. A hook in the driver could
> >>>>> certainly be used for the queueing return as well, though, which may
> >>>>> help making it worth the trouble :)
> >>>>
> >>>> Yeah, I'd like to get to completion descriptors ideally; so nothing
> >>>> better than a driver hook comes to mind so far :-(
>
> As Toke mentions, I'm also hoping we could leverage or extend the
> xdp_return_frame() call. Or implicitly add the "hook" at the existing
> xdp_return_frame() call. This is about operation (2) *reading* some
> metadata at TX completion time.
Ack, noted, thx. Although, at least for mlx5e_free_xdpsq_desc, I don't
see it being called for the af_xdp tx path. But maybe that's something
we can amend in a couple of places (so xdp_return_frame would handle
most xdp cases, and some new tbd func for af_xdp tx)?
> Can this be mapped to the RX-kfuncs approach(?), by driver extending
> (call/structs) with pointer to TX-desc + adaptor info and BPF-prog/hook
> doing TX-kfuncs calls into driver (that knows how to extract completion
> data).
Yeah, that seems like a natural thing to do here.
>
> [...]
> >>> Well, to me XDP_REDIRECT is the most interesting one (see above). I
> >>> think we could even drop the XDP_TX case and only do this for
> >>> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
> >>> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
> >>> will just take a slight detour up through the stack; but that could also
> >>> be a good thing if it means we'll have to do less surgery to the drivers
> >>> to implement this for two paths.
> >>>
> >>> It does have the same challenge as you outlined above, though: At that
> >>> point the TX descriptor probably doesn't exist, so the driver NDO will
> >>> have to do something else with the data; but maybe we can solve that
> >>> without moving the hook into the driver itself somehow?
> >>
> >> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
> >> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
> >> with a skb-like-private metadata layout (as we've discussed previously
> >> for skb) here as well? But not sure it would solve all the problems?
>
> This is operation (1) writing metadata into the TX descriptor.
> In this case we have a metadata mapping problem, from RX on one device
> driver to TX on another device driver. As you say, we also need to map
> this SKBs, which have a fairly static set of metadata.
>
> For the most common metadata offloads (like TX-checksum, TX-vlan) I
> think it makes sense to store those in xdp_frame area (use for SKB
> mapping) and re-use these when at TX writing into the TX descriptor.
[..]
> BUT there are also metadata TX offloads offloads, like asking for a
> specific Launch-Time for at packet, that needs a more flexible approach.
Why can't these go into the same "common" xdp_frame area?
> >> I'm thinking of an af_xdp case where it wants to program something
> >> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
> >> support) or a checksum offload. Exposing access to the driver tx hooks
> >> seems like the easiest way to get there?
> >>
> >>>> - AF_XDP TX - this one needs something deep in the driver (due to tx
> >>>> zc) to populate the descriptors?
> >>>
> >>> Yeah, this one is a bit more challenging, but having a way to process
> >>> AF_XDP frames in the kernel before they're sent out would be good in any
> >>> case (for things like policing what packets an AF_XDP application can
> >>> send in a cloud deployment, for instance). Would be best if we could
> >>> consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
> >>>
>
> I agree, it would be best if we can consolidate the XDP_REDIRECT and
> AF_XDP paths, else we have to re-implement the same for AF_XDP xmit path
> (and maintain both paths). I also agree that being able to police what
> packets an AF_XDP application can send is a useful feature (e.g. cloud
> deployments).
>
> Looking forward to collaborate on this work!
> --Jesper
Thank you for the comments! So it looks like two things we potentially
need to do/agree upon:
1. User-facing API. One hook + tracepoint vs two hooks (and at what
level: af_xdp vs xdp). I'll try to focus on that first (waiting for
af_xdp patches that Magnus mentioned).
2. Potentially internal refactoring to consolidate XDP_REDIRECT+AF_XDP
(seems like something we should be able to discuss as we go; aka
implementation details)
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
2023-03-09 18:04 ` Stanislav Fomichev
@ 2023-03-10 11:09 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 16+ messages in thread
From: Jesper Dangaard Brouer @ 2023-03-10 11:09 UTC (permalink / raw)
To: Stanislav Fomichev, Jesper Dangaard Brouer
Cc: brouer, Magnus Karlsson, Toke Høiland-Jørgensen, lsf-pc,
bpf, xdp-hints@xdp-project.net
On 09/03/2023 19.04, Stanislav Fomichev wrote:
> On Tue, Mar 7, 2023 at 11:32 AM Jesper Dangaard Brouer
>>
>> On 03/03/2023 08.42, Magnus Karlsson wrote:
>>> On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
>>>> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>>> On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>>>>>>
>>>>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>>>>
>>>>>>>> I'd like to discuss a potential follow up for the previous "XDP RX
>>>>>>>> metadata" series [0].
>>>>>>>>
>>>>>>>> Now that we can access (a subset of) packet metadata at RX, I'd like to
>>>>>>>> explore the options where we can export some of that metadata on TX. And
>>>>>>>> also whether it might be possible to access some of the TX completion
>>>>>>>> metadata (things like TX timestamp).
>>>>>>>>
>>
>> IMHO it makes sense to see TX metadata as two separate operations.
>>
>> (1) Metadata written into the TX descriptor.
>> (2) Metadata read when processing TX completion.
>>
>> These operations happen at two different points in time. Thus likely
>> need different BPF hooks. Having BPF-progs running at each of these
>> points in time, will allow us to e.g. implement BQL (which is relevant
>> to XDP queuing effort).
>
> I guess for (2) the question here is: is it worth having a separate
> hook? Or will a simple traceepoint as Toke suggested be enough? For
> BQL purposes, we can still attach a prog to that tracepoint, right?
>
Yes, see below, I'm hoping for (2) we can avoid separate hook, and
leverage xdp_return_frame() somehow.
For (1) "write into the TX descriptor" it happens so late it the TX
path, that I don't think we can avoid a driver TX hook. Even if we come
up with a common struct that drivers can consume hints from, we will
still need to modify the drivers to do the translation into the hardware
specific TX desc format. If we have to modify the drivers anyhow, then
adding a TX BPF-hook would likely be worthwhile to the flexibility this
creates.
>>>>>>>> I'm currently trying to understand whether the same approach I've used
>>>>>>>> on RX could work at TX. By May I plan to have a bunch of options laid
>>>>>>>> out (currently considering XSK tx/compl programs and XDP tx/compl
>>>>>>>> programs) so we have something to discuss.
>>>>>>>
>>>>>>> I've been looking at ways of getting a TX-completion hook for the XDP
>>>>>>> queueing stuff as well. For that, I think it could work to just hook
>>>>>>> into xdp_return_frame(), but if you want to access hardware metadata
>>>>>>> it'll obviously have to be in the driver. A hook in the driver could
>>>>>>> certainly be used for the queueing return as well, though, which may
>>>>>>> help making it worth the trouble :)
>>>>>>
>>>>>> Yeah, I'd like to get to completion descriptors ideally; so nothing
>>>>>> better than a driver hook comes to mind so far :-(
>>
>> As Toke mentions, I'm also hoping we could leverage or extend the
>> xdp_return_frame() call. Or implicitly add the "hook" at the existing
>> xdp_return_frame() call. This is about operation (2) *reading* some
>> metadata at TX completion time.
>
> Ack, noted, thx. Although, at least for mlx5e_free_xdpsq_desc, I don't
> see it being called for the af_xdp tx path. But maybe that's something
> we can amend in a couple of places (so xdp_return_frame would handle
> most xdp cases, and some new tbd func for af_xdp tx)?
>
Looking at mlx5e_free_xdpsq_desc() it have switch handling the different
frame types. It wouldn't be much work to as you say to amend code and
add e.g. a tracepoint/hook catching more frame types (both xdp and af_xdp).
>> Can this be mapped to the RX-kfuncs approach(?), by driver extending
>> (call/structs) with pointer to TX-desc + adaptor info and BPF-prog/hook
>> doing TX-kfuncs calls into driver (that knows how to extract completion
>> data).
>
> Yeah, that seems like a natural thing to do here.
Great. I hope drivers will no have hidden or freed the TX completion
descriptor before our BPF-prog will get a chance to read this data.
>>
>> [...]
>>>>> Well, to me XDP_REDIRECT is the most interesting one (see above). I
>>>>> think we could even drop the XDP_TX case and only do this for
>>>>> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
>>>>> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
>>>>> will just take a slight detour up through the stack; but that could also
>>>>> be a good thing if it means we'll have to do less surgery to the drivers
>>>>> to implement this for two paths.
>>>>>
>>>>> It does have the same challenge as you outlined above, though: At that
>>>>> point the TX descriptor probably doesn't exist, so the driver NDO will
>>>>> have to do something else with the data; but maybe we can solve that
>>>>> without moving the hook into the driver itself somehow?
>>>>
>>>> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
>>>> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
>>>> with a skb-like-private metadata layout (as we've discussed previously
>>>> for skb) here as well? But not sure it would solve all the problems?
>>
>> This is operation (1) writing metadata into the TX descriptor.
>> In this case we have a metadata mapping problem, from RX on one device
>> driver to TX on another device driver. As you say, we also need to map
>> this SKBs, which have a fairly static set of metadata.
>>
>> For the most common metadata offloads (like TX-checksum, TX-vlan) I
>> think it makes sense to store those in xdp_frame area (use for SKB
>> mapping) and re-use these when at TX writing into the TX descriptor.
>
> [..]
>
>> BUT there are also metadata TX offloads offloads, like asking for a
>> specific Launch-Time for at packet, that needs a more flexible approach.
>
> Why can't these go into the same "common" xdp_frame area?
>
For several reasons:
1. The "common" area in xdp_frame will be size constrained.
2. The Launch-Time for a packet is not something we *read* from the RX
descriptor, thus having room to store it in common area seems
strange.
3. The Launch-Time for a packet is something we likely calculate on the
fly based on "time queue" state at our TX-hook egress. See TC-BPF
code example for FQ-pacing here[1].
4. We likely need a XDP queueing layer (Toke's work) to handle the HW
limitations of Launch-Time feature, as HW it limited how far in the
future we can schedule data (for chips i210 and i225 see[2]).
[1]
https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/edt_pacer_vlan.c
[2]
https://github.com/xdp-project/xdp-project/blob/master/areas/tsn/code01_follow_qdisc_TSN_offload.org
>>>> I'm thinking of an af_xdp case where it wants to program something
>>>> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
>>>> support) or a checksum offload. Exposing access to the driver tx hooks
>>>> seems like the easiest way to get there?
>>>>
>>>>>> - AF_XDP TX - this one needs something deep in the driver (due to tx
>>>>>> zc) to populate the descriptors?
>>>>>
>>>>> Yeah, this one is a bit more challenging, but having a way to process
>>>>> AF_XDP frames in the kernel before they're sent out would be good in any
>>>>> case (for things like policing what packets an AF_XDP application can
>>>>> send in a cloud deployment, for instance). Would be best if we could
>>>>> consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
>>>>>
>>
>> I agree, it would be best if we can consolidate the XDP_REDIRECT and
>> AF_XDP paths, else we have to re-implement the same for AF_XDP xmit path
>> (and maintain both paths). I also agree that being able to police what
>> packets an AF_XDP application can send is a useful feature (e.g. cloud
>> deployments).
>>
>> Looking forward to collaborate on this work!
>> --Jesper
>
> Thank you for the comments! So it looks like two things we potentially
> need to do/agree upon:
>
> 1. User-facing API. One hook + tracepoint vs two hooks (and at what
> level: af_xdp vs xdp). I'll try to focus on that first (waiting for
> af_xdp patches that Magnus mentioned).
>
Yes, I think we need to dig into the code and to figure out what options
are doable. Balancing the future maintenance cost.
> 2. Potentially internal refactoring to consolidate XDP_REDIRECT+AF_XDP
> (seems like something we should be able to discuss as we go; aka
> implementation details)
>
Sounds like we have a plan going forward :-)
--Jesper
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2023-03-10 11:10 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-23 22:23 [LSF/MM/BPF TOPIC] XDP metadata for TX Stanislav Fomichev
2023-02-23 23:22 ` Toke Høiland-Jørgensen
2023-02-23 23:51 ` Stanislav Fomichev
2023-02-27 14:17 ` Toke Høiland-Jørgensen
2023-02-27 20:03 ` Stanislav Fomichev
2023-02-27 23:54 ` Toke Høiland-Jørgensen
2023-02-28 21:18 ` Stanislav Fomichev
2023-02-28 22:09 ` Toke Høiland-Jørgensen
2023-02-28 23:02 ` Stanislav Fomichev
2023-02-28 23:08 ` Toke Høiland-Jørgensen
2023-03-03 7:42 ` Magnus Karlsson
2023-03-03 12:37 ` Toke Høiland-Jørgensen
2023-03-03 17:16 ` Stanislav Fomichev
2023-03-07 19:32 ` Jesper Dangaard Brouer
2023-03-09 18:04 ` Stanislav Fomichev
2023-03-10 11:09 ` Jesper Dangaard Brouer
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox