RE: RFC - Tap io_uring PMD - Konstantin Ananyev

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"Stephen Hemminger" <stephen@networkplumber.org>
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: RE: RFC - Tap io_uring PMD
Date: Wed, 6 Nov 2024 10:30:16 +0000	[thread overview]
Message-ID: <da1588e2e66244f298751dca3712368b@huawei.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F885@smartserver.smartshare.dk>



> > > > > > Probably the hardest part of using io_uring is figuring out how
> > to
> > > > > > collect
> > > > > > completions. The simplest way would be to handle all
> > completions rx
> > > > and
> > > > > > tx
> > > > > > in the rx_burst function.
> > > > >
> > > > > Please don't mix RX and TX, unless explicitly requested by the
> > > > application through the recently introduced "mbuf recycle" feature.
> > > >
> > > > The issue is Rx and Tx share a single fd and ioring for completion
> > is
> > > > per fd.
> > > > The implementation for ioring came from the storage side so
> > initially
> > > > it was for fixing
> > > > the broken Linux AIO support.
> > > >
> > > > Some other devices only have single interrupt or ring shared with
> > rx/tx
> > > > so not unique.
> > > > Virtio, netvsc, and some NIC's.
> > > >
> > > > The problem is that if Tx completes descriptors then there needs to
> > be
> > > > locking
> > > > to prevent Rx thread and Tx thread overlapping. And a spin lock is
> > a
> > > > performance buzz kill.
> > >
> > > Brainstorming a bit here...
> > > What if the new TAP io_uring PMD is designed to use two io_urings per
> > port, one for RX and another one for TX on the same TAP interface?
> > > This requires that a TAP interface can be referenced via two file
> > descriptors (one fd for the RX io_uring and another fd for the TX
> > io_uring), e.g. by using dup() to create the additional file
> > descriptor. I don't know if this is possible, and if it works with
> > io_uring.
> >
> > There a couple of problems with multiple fd's.
> >   - multiple fds pointing to same internal tap queue are not going to
> > get completed separately.
> >   - when multi-proc is supported, limit of 253 fd's in Unix domain IPC
> > comes into play
> >   - tap does not support tx only fd for queues. If fd is queue of tap,
> > receive fan out will go to it.
> >
> > If DPDK was more flexible, harvesting of completion could be done via
> > another thread but that is not general enough
> > to work transparently with all applications.  Existing TAP device plays
> > with SIGIO, but signals are slower.
> 
> I have now read up a bit about io_uring, so here are some thoughts and ideas...
> 
> To avoid locking, there should only be one writer of io_uring Submission Queue Events (SQE), and only one reader of io_uring
> Completion Queue Events (CQE) per TAP interface.
> 
> From what I understand, the TAP io_uring PMD only supports one RX queue per port and one TX queue per port (i.e. per TAP
> interface). We can take advantage of this:
> 
> We can use rte_tx() as the Submission Queue writer and rte_rx() as the Completion Queue reader.
> 
> The PMD must have two internal rte_rings for respectively RX refill and TX completion events.
> 
> rte_rx() does the following:
> Read the Completion Queue;
> If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and enqueue it in the RX Refill rte_ring;
> If TX CQE, enqueue it in the TX Completion rte_ring;
> Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says
> that less than nb_pkts is only returned if no more packets are available for receiving.)
> 
> rte_tx() does the following:
> Pass the data from the TX MBUFs to io_uring TX SQEs, using the TX CQEs in the TX Completion rte_ring, and write them to the io_uring
> Submission Queue.
> Dequeue any RX Refill SQEs from the RX Refill rte_ring and write them to the io_uring Submission Queue.
> 
> This means that the application must call both rte_rx() and rte_tx(); but it would be allowed to call rte_tx() with zero MBUFs.
> 
> The internal rte_rings are Single-Producer, Single-Consumer, and large enough to hold all TX+RX descriptors.
> 
> 
> Alternatively, we can let rte_rx() do all the work and use an rte_ring in the opposite direction...
> 
> The PMD must have two internal rte_rings, one for TX MBUFs and one for TX CQEs. (The latter can be a stack, or any other type of
> container.)
> 
> rte_tx() only does the following:
> Enqueue the TX MBUFs to the TX MBUF rte_ring.
> 
> rte_rx() does the following:
> Dequeue any TX MBUFs from the TX MBUF rte_ring, convert them to TX SQEs, using the TX CQEs in the TX Completion rte_ring, and
> write them to the io_uring Submission Queue.
> Read the Completion Queue;
> If TX CQE, enqueue it in the TX Completion rte_ring;
> If RX CQE, pass the data to the next RX MBUF, convert the RX CQE to an RX Refill SQE and write it to the io_uring Submission Queue;
> Repeat until nb_pkts RX CQEs have been received, or no more CQE's are available. (This complies with the rte_rx() API, which says
> that less than nb_pkts is only returned if no more packets are available for receiving.)
> 
> With the second design, the PMD can support multiple TX queues by using a Multi-Producer rte_ring for the TX MBUFs.
> But it postpones all transmits until rte_rx() is called, so I don't really like it.
> 
> Of the two designs, the first feels more natural to me.
> And if some application absolutely needs multiple TX queues, it can implement a Multi-Producer, Single-Consumer rte_ring as an
> intermediate step in front of the PMD's single TX queue.

And why we can't simply have 2 io_uring(s): one for RX ops, second for TX ops?

next prev parent reply	other threads:[~2024-11-06 10:30 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-30 21:56 RFC - Tap io_uring PMD Stephen Hemminger
2024-10-31 10:27 ` Morten Brørup
2024-11-01  0:34   ` Stephen Hemminger
2024-11-02 22:28     ` Morten Brørup
2024-11-05 18:58       ` Stephen Hemminger
2024-11-05 23:22         ` Morten Brørup
2024-11-05 23:25           ` Stephen Hemminger
2024-11-05 23:54             ` Morten Brørup
2024-11-06  0:52               ` Igor Gutorov
2024-11-07 16:30                 ` Stephen Hemminger
2024-11-06 10:30           ` Konstantin Ananyev [this message]
2024-11-06  0:46 ` Varghese, Vipin
2024-11-06  7:46 ` Maxime Coquelin
2024-11-07 21:51   ` Morten Brørup
2024-11-12  5:21   ` Stephen Hemminger
2024-12-29 10:45     ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da1588e2e66244f298751dca3712368b@huawei.com \
    --to=konstantin.ananyev@huawei.com \
    --cc=dev@dpdk.org \
    --cc=mb@smartsharesystems.com \
    --cc=stephen@networkplumber.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.