From: David Wei <dw@davidwei.uk>
To: Mina Almasry <almasrymina@google.com>
Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org,
Jens Axboe <axboe@kernel.dk>,
Pavel Begunkov <asml.silence@gmail.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
"David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jesper Dangaard Brouer <hawk@kernel.org>,
David Ahern <dsahern@kernel.org>, David Wei <dw@davidwei.uk>
Subject: Re: [PATCH v1 00/15] io_uring zero copy rx
Date: Thu, 10 Oct 2024 17:29:45 -0700 [thread overview]
Message-ID: <73637a66-e7f7-4ba6-a16e-c2ccb43735d6@davidwei.uk> (raw)
In-Reply-To: <CAHS8izOv9cB60oUbxz_52BMGi7T4_u9rzTOCb23LGvZOX0QXqg@mail.gmail.com>
On 2024-10-09 09:55, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>>
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance.
>
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
>
> If we refactor things around a bit we should be able to have the
> memory tied to the RX queue similar to what AF_XDP does, and then we
> should be able to zero copy to the memory via regular sockets and via
> io_uring. This will be useful for us and other applications that would
> like to ZC similar to what you're doing here but not necessarily
> through io_uring.
Using io_uring and trying to move away from a socket based interface is
an explicit longer term goal. I see your proposal of adding a
traditional socket based API as orthogonal to what we're trying to do.
If someone is motivated enough to see this exist then they can build it
themselves.
>
>> Data is 'read' using a new io_uring request
>> type. When done, data is returned via a new shared refill queue. A zero
>> copy page pool refills a hw rx queue from this refill queue directly. Of
>> course, the lifetime of these data buffers are managed by io_uring
>> rather than the networking stack, with different refcounting rules.
>>
>> This patchset is the first step adding basic zero copy support. We will
>> extend this iteratively with new features e.g. dynamically allocated
>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>> general optimisations and more.
>>
>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>> aren't included since Taehee Yoo has already sent a more comprehensive
>> patchset adding support in [1]. Google gve should already support this,
>
> This is an aside, but GVE supports this via the out-of-tree patches
> I've been carrying on github. Uptsream we're working on adding the
> prerequisite page_pool support.
>
>> and Mellanox mlx5 support is WIP pending driver changes.
>>
>> ===========
>> Performance
>> ===========
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> epoll
>> 82.2 Gbps
>>
>> io_uring
>> 116.2 Gbps (+41%)
>>
>> Pinned to _same_ core:
>>
>> epoll
>> 62.6 Gbps
>>
>> io_uring
>> 80.9 Gbps (+29%)
>>
>
> Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
> [1] and io_uring zerocopy respectively?
>
> If not, I would like to see a comparison between TCP RX zerocopy and
> this new io-uring zerocopy. For Google for example we use the TCP RX
> zerocopy, I would like to see perf numbers possibly motivating us to
> move to this new thing.
No, it is comparing epoll without zero copy vs io_uring zero copy. Yes,
that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to
kperf and compare.
>
> [1] https://lwn.net/Articles/752046/
>
>
next prev parent reply other threads:[~2024-10-11 0:29 UTC|newest]
Thread overview: 126+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
2024-10-09 20:17 ` Mina Almasry
2024-10-09 23:16 ` Pavel Begunkov
2024-10-10 18:01 ` Mina Almasry
2024-10-10 18:57 ` Pavel Begunkov
2024-10-13 22:38 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
2024-10-09 20:19 ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
2024-10-08 15:46 ` Stanislav Fomichev
2024-10-08 16:34 ` Pavel Begunkov
2024-10-09 16:28 ` Stanislav Fomichev
2024-10-11 18:44 ` David Wei
2024-10-11 22:02 ` Pavel Begunkov
2024-10-11 22:25 ` Mina Almasry
2024-10-11 23:12 ` Pavel Begunkov
2024-10-09 20:44 ` Mina Almasry
2024-10-09 22:13 ` Pavel Begunkov
2024-10-09 22:19 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
2024-10-09 20:49 ` Mina Almasry
2024-10-09 22:02 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
2024-10-09 20:56 ` Mina Almasry
2024-10-09 21:45 ` Pavel Begunkov
2024-10-13 22:33 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-09 21:00 ` Mina Almasry
2024-10-09 21:59 ` Pavel Begunkov
2024-10-10 17:54 ` Mina Almasry
2024-10-13 17:25 ` David Wei
2024-10-14 13:37 ` Pavel Begunkov
2024-10-14 22:58 ` Mina Almasry
2024-10-16 17:42 ` Pavel Begunkov
2024-11-01 17:18 ` Mina Almasry
2024-11-01 18:35 ` Pavel Begunkov
2024-11-01 19:24 ` Mina Almasry
2024-11-01 21:38 ` Pavel Begunkov
2024-11-04 20:42 ` Mina Almasry
2024-11-04 21:27 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
2024-10-09 21:11 ` Mina Almasry
2024-10-09 21:34 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
2024-10-08 22:25 ` Joe Damato
2024-10-09 15:09 ` Pavel Begunkov
2024-10-09 16:13 ` Joe Damato
2024-10-09 19:12 ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-09 17:50 ` Jens Axboe
2024-10-09 18:09 ` Jens Axboe
2024-10-09 19:08 ` Pavel Begunkov
2024-10-11 22:11 ` Pavel Begunkov
2024-10-13 17:32 ` David Wei
2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-09 18:02 ` Jens Axboe
2024-10-09 19:05 ` Pavel Begunkov
2024-10-09 19:06 ` Jens Axboe
2024-10-09 21:29 ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2024-10-09 18:10 ` Jens Axboe
2024-10-09 22:01 ` Mina Almasry
2024-10-09 22:58 ` Pavel Begunkov
2024-10-10 18:19 ` Mina Almasry
2024-10-10 20:26 ` Pavel Begunkov
2024-10-10 20:53 ` Mina Almasry
2024-10-10 20:58 ` Mina Almasry
2024-10-10 21:22 ` Pavel Begunkov
2024-10-11 0:32 ` Mina Almasry
2024-10-11 1:49 ` Pavel Begunkov
2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-10-09 18:28 ` Jens Axboe
2024-10-09 18:51 ` Pavel Begunkov
2024-10-09 19:01 ` Jens Axboe
2024-10-09 19:27 ` Pavel Begunkov
2024-10-09 19:42 ` Jens Axboe
2024-10-09 19:47 ` Pavel Begunkov
2024-10-09 19:50 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
2024-10-08 15:58 ` Stanislav Fomichev
2024-10-08 16:39 ` Pavel Begunkov
2024-10-08 16:40 ` David Wei
2024-10-09 16:30 ` Stanislav Fomichev
2024-10-09 23:05 ` Pavel Begunkov
2024-10-11 6:22 ` David Wei
2024-10-11 14:43 ` Stanislav Fomichev
2024-10-09 18:38 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-10-09 18:42 ` Jens Axboe
2024-10-10 13:09 ` Pavel Begunkov
2024-10-10 13:19 ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
2024-10-09 18:43 ` Jens Axboe
2024-10-07 22:20 ` [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-08 23:10 ` Joe Damato
2024-10-09 15:07 ` Pavel Begunkov
2024-10-09 16:10 ` Joe Damato
2024-10-09 16:12 ` Jens Axboe
2024-10-11 6:15 ` David Wei
2024-10-09 15:27 ` Jens Axboe
2024-10-09 15:38 ` David Ahern
2024-10-09 15:43 ` Jens Axboe
2024-10-09 15:49 ` Pavel Begunkov
2024-10-09 15:50 ` Jens Axboe
2024-10-09 16:35 ` David Ahern
2024-10-09 16:50 ` Jens Axboe
2024-10-09 16:53 ` Jens Axboe
2024-10-09 17:12 ` Jens Axboe
2024-10-10 14:21 ` Jens Axboe
2024-10-10 15:03 ` David Ahern
2024-10-10 15:15 ` Jens Axboe
2024-10-10 18:11 ` Jens Axboe
2024-10-14 8:42 ` David Laight
2024-10-09 16:55 ` Mina Almasry
2024-10-09 16:57 ` Jens Axboe
2024-10-09 19:32 ` Mina Almasry
2024-10-09 19:43 ` Pavel Begunkov
2024-10-09 19:47 ` Jens Axboe
2024-10-09 17:19 ` David Ahern
2024-10-09 18:21 ` Pedro Tammela
2024-10-10 13:19 ` Pavel Begunkov
2024-10-11 0:35 ` David Wei
2024-10-11 14:28 ` Pedro Tammela
2024-10-11 0:29 ` David Wei [this message]
2024-10-11 19:43 ` Mina Almasry
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=73637a66-e7f7-4ba6-a16e-c2ccb43735d6@davidwei.uk \
--to=dw@davidwei.uk \
--cc=almasrymina@google.com \
--cc=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=hawk@kernel.org \
--cc=io-uring@vger.kernel.org \
--cc=kuba@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).