From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Stanislav Fomichev <sdf@google.com>
Cc: "Alexei Starovoitov" <ast@kernel.org>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Andrii Nakryiko" <andrii@kernel.org>,
"Martin KaFai Lau" <martin.lau@linux.dev>,
"Song Liu" <song@kernel.org>, "Yonghong Song" <yhs@fb.com>,
"John Fastabend" <john.fastabend@gmail.com>,
"KP Singh" <kpsingh@kernel.org>, "Hao Luo" <haoluo@google.com>,
"Jiri Olsa" <jolsa@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
"Eric Dumazet" <edumazet@google.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Jesper Dangaard Brouer" <hawk@kernel.org>,
"Björn Töpel" <bjorn@kernel.org>,
"Magnus Karlsson" <magnus.karlsson@intel.com>,
"Maciej Fijalkowski" <maciej.fijalkowski@intel.com>,
"Jonathan Lemon" <jonathan.lemon@gmail.com>,
"Mykola Lysenko" <mykolal@fb.com>,
"Kumar Kartikeya Dwivedi" <memxor@gmail.com>,
netdev@vger.kernel.org, bpf@vger.kernel.org,
"Freysteinn Alfredsson" <freysteinn.alfredsson@kau.se>,
"Cong Wang" <xiyou.wangcong@gmail.com>
Subject: Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
Date: Wed, 13 Jul 2022 23:52:07 +0200 [thread overview]
Message-ID: <877d4gpto8.fsf@toke.dk> (raw)
In-Reply-To: <CAKH8qBtdnku7StcQ-SamadvAF==DRuLLZO94yOR1WJ9Bg=uX1w@mail.gmail.com>
Stanislav Fomichev <sdf@google.com> writes:
> On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Packet forwarding is an important use case for XDP, which offers
>> significant performance improvements compared to forwarding using the
>> regular networking stack. However, XDP currently offers no mechanism to
>> delay, queue or schedule packets, which limits the practical uses for
>> XDP-based forwarding to those where the capacity of input and output links
>> always match each other (i.e., no rate transitions or many-to-one
>> forwarding). It also prevents an XDP-based router from doing any kind of
>> traffic shaping or reordering to enforce policy.
>>
>> This series represents a first RFC of our attempt to remedy this lack. The
>> code in these patches is functional, but needs additional testing and
>> polishing before being considered for merging. I'm posting it here as an
>> RFC to get some early feedback on the API and overall design of the
>> feature.
>>
>> DESIGN
>>
>> The design consists of three components: A new map type for storing XDP
>> frames, a new 'dequeue' program type that will run in the TX softirq to
>> provide the stack with packets to transmit, and a set of helpers to dequeue
>> packets from the map, optionally drop them, and to schedule an interface
>> for transmission.
>>
>> The new map type is modelled on the PIFO data structure proposed in the
>> literature[0][1]. It represents a priority queue where packets can be
>> enqueued in any priority, but is always dequeued from the head. From the
>> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> helper, where the target index is the desired priority.
>
> I have the same question I asked on the series from Cong:
> Any considerations for existing carousel/edt-like models?
Well, the reason for the addition in patch 5 (continuously increasing
priorities) is exactly to be able to implement EDT-like behaviour, where
the priority is used as time units to clock out packets.
> Can we make the map flexible enough to implement different qdisc
> policies?
That's one of the things we want to be absolutely sure about. We are
starting out with the PIFO map type because the literature makes a good
case that it is flexible enough to implement all conceivable policies.
The goal of the test harness linked as note [4] is to actually examine
this; Frey is our PhD student working on this bit.
Thus far we haven't hit any limitations on this, but we'll need to add
more policies before we are done with this. Another consideration is
performance, of course, so we're also planning to do a comparison with a
more traditional "bunch of FIFO queues" type data structure for at least
a subset of the algorithms. Kartikeya also had an idea for an
alternative way to implement a priority queue using (semi-)lockless
skiplists, which may turn out to perform better.
If there's any particular policy/algorithm you'd like to see included in
this evaluation, please do let us know, BTW! :)
>> The dequeue program type is a new BPF program type that is attached to an
>> interface; when an interface is scheduled for transmission, the stack will
>> execute the attached dequeue program and, if it returns a packet to
>> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
>> driver function.
>>
>> The dequeue program can obtain packets by pulling them out of a PIFO map
>> using the new bpf_packet_dequeue() helper. This returns a pointer to an
>> xdp_md structure, which can be dereferenced to obtain packet data and
>> data_meta pointers like in an XDP program. The returned packets are also
>> reference counted, meaning the verifier enforces that the dequeue program
>> either drops the packet (with the bpf_packet_drop() helper), or returns it
>> for transmission. Finally, a helper is added that can be used to actually
>> schedule an interface for transmission using the dequeue program type; this
>> helper can be called from both XDP and dequeue programs.
>>
>> PERFORMANCE
>>
>> Preliminary performance tests indicate about 50ns overhead of adding
>> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
>> overhead (but still 2x the forwarding performance of the netstack):
>>
>> xdp_fwd : 4.7 Mpps (213 ns /pkt)
>> xdp_fwd -Q: 3.8 Mpps (263 ns /pkt)
>> netstack: 2 Mpps (500 ns /pkt)
>>
>> RELATION TO BPF QDISC
>>
>> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
>> particular the use of a map to store packets. This is no accident, as we've
>> had ongoing discussions for a while now. I have no great hope that we can
>> completely converge the two efforts into a single BPF-based queueing
>> API (as has been discussed before[3], consolidating the SKB and XDP paths
>> is challenging). Rather, I'm hoping that we can converge the designs enough
>> that we can share BPF code between XDP and qdisc layers using common
>> functions, like it's possible to do with XDP and TC-BPF today. This would
>> imply agreeing on the map type and API, and possibly on the set of helpers
>> available to the BPF programs.
>
> What would be the big difference for the map wrt xdp_frame vs sk_buff
> excluding all obvious stuff like locking/refcnt?
I expect it would be quite straight-forward to just add a second subtype
of the PIFO map in this series that holds skbs. In fact, I think that
from the BPF side, the whole model implemented here would be possible to
carry over to the qdisc layer more or less wholesale. Some other
features of the qdisc layer, like locking, classes, and
multi-CPU/multi-queue management may be trickier, but I'm not sure how
much of that we should expose in a BPF qdisc anyway (as you may have
noticed I commented on Cong's series to this effect regarding the
classful qdiscs).
>> PATCH STRUCTURE
>>
>> This series consists of a total of 17 patches, as follows:
>>
>> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
>> patches.
>
> Seems like these can go separately without holding the rest?
Yeah, guess so? They don't really provide much benefit without the users
alter in the series, though, so not sure there's much point in sending
them separately?
>> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
>> program type.
>
> [...]
>
>> Patches 7-10 adds the dequeue helpers and the verifier features needed to
>> recognise packet pointers, reference count them, and allow dereferencing
>> them to obtain packet data pointers.
>
> Have you considered using kfuncs for these instead of introducing new
> hooks/contexts/etc?
I did, but I'm not sure it's such a good fit? In particular, the way the
direct packet access is implemented for dequeue programs (where you can
get an xdp_md pointer and deref that to get data and data_end pointers)
is done this way so programs can share utility functions between XDP and
dequeue programs. And having a new program type for the dequeue progs
seem like the obvious thing to do since they're doing something new?
Maybe I'm missing something, though; could you elaborate on how you'd
use kfuncs instead?
-Toke
next prev parent reply other threads:[~2022-07-13 21:52 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 01/17] dev: Move received_rps counter next to RPS members in softnet data Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 02/17] bpf: Expand map key argument of bpf_redirect_map to u64 Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 03/17] bpf: Use 64-bit return value for bpf_prog_run Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 04/17] bpf: Add a PIFO priority queue map type Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 05/17] pifomap: Add queue rotation for continuously increasing rank mode Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 06/17] xdp: Add dequeue program type for getting packets from a PIFO Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 07/17] bpf: Teach the verifier about referenced packets returned from dequeue programs Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 08/17] bpf: Add helpers to dequeue from a PIFO map Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 09/17] bpf: Introduce pkt_uid member for PTR_TO_PACKET Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 10/17] bpf: Implement direct packet access in dequeue progs Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 11/17] dev: Add XDP dequeue hook Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 12/17] bpf: Add helper to schedule an interface for TX dequeue Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 13/17] libbpf: Add support for dequeue program type and PIFO map type Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs Toke Høiland-Jørgensen
2022-07-14 5:36 ` Andrii Nakryiko
2022-07-14 10:13 ` Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog Toke Høiland-Jørgensen
2022-07-14 5:38 ` Andrii Nakryiko
2022-07-14 6:45 ` Kumar Kartikeya Dwivedi
2022-07-14 18:54 ` Andrii Nakryiko
2022-07-15 11:11 ` Kumar Kartikeya Dwivedi
2022-07-13 11:14 ` [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps Toke Høiland-Jørgensen
2022-07-14 5:41 ` Andrii Nakryiko
2022-07-14 10:18 ` Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 17/17] samples/bpf: Add queueing support to xdp_fwd sample Toke Høiland-Jørgensen
2022-07-13 18:36 ` [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Stanislav Fomichev
2022-07-13 21:52 ` Toke Høiland-Jørgensen [this message]
2022-07-13 22:56 ` Stanislav Fomichev
2022-07-14 10:46 ` Toke Høiland-Jørgensen
2022-07-14 17:24 ` Stanislav Fomichev
2022-07-15 1:12 ` Alexei Starovoitov
2022-07-15 12:55 ` Toke Høiland-Jørgensen
2022-07-17 19:12 ` Cong Wang
2022-07-18 12:25 ` Toke Høiland-Jørgensen
2022-07-14 6:34 ` Kumar Kartikeya Dwivedi
2022-07-17 18:17 ` Cong Wang
2022-07-17 18:41 ` Kumar Kartikeya Dwivedi
2022-07-17 19:23 ` Cong Wang
2022-07-18 12:12 ` Toke Høiland-Jørgensen
2022-07-14 14:05 ` Jamal Hadi Salim
2022-07-14 14:56 ` Dave Taht
2022-07-14 15:33 ` Jamal Hadi Salim
2022-07-14 16:21 ` Toke Høiland-Jørgensen
2022-07-17 17:46 ` Cong Wang
2022-07-18 12:45 ` Toke Høiland-Jørgensen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=877d4gpto8.fsf@toke.dk \
--to=toke@redhat.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=freysteinn.alfredsson@kau.se \
--cc=haoluo@google.com \
--cc=hawk@kernel.org \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=jonathan.lemon@gmail.com \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=mykolal@fb.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=xiyou.wangcong@gmail.com \
--cc=yhs@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).