netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Wei <dw@davidwei.uk>
To: Mina Almasry <almasrymina@google.com>
Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org,
	Jens Axboe <axboe@kernel.dk>,
	Pavel Begunkov <asml.silence@gmail.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	David Ahern <dsahern@kernel.org>, David Wei <dw@davidwei.uk>
Subject: Re: [PATCH v1 00/15] io_uring zero copy rx
Date: Thu, 10 Oct 2024 17:29:45 -0700	[thread overview]
Message-ID: <73637a66-e7f7-4ba6-a16e-c2ccb43735d6@davidwei.uk> (raw)
In-Reply-To: <CAHS8izOv9cB60oUbxz_52BMGi7T4_u9rzTOCb23LGvZOX0QXqg@mail.gmail.com>

On 2024-10-09 09:55, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>>
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance.
> 
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
> 
> If we refactor things around a bit we should be able to have the
> memory tied to the RX queue similar to what AF_XDP does, and then we
> should be able to zero copy to the memory via regular sockets and via
> io_uring. This will be useful for us and other applications that would
> like to ZC similar to what you're doing here but not necessarily
> through io_uring.

Using io_uring and trying to move away from a socket based interface is
an explicit longer term goal. I see your proposal of adding a
traditional socket based API as orthogonal to what we're trying to do.
If someone is motivated enough to see this exist then they can build it
themselves.

> 
>> Data is 'read' using a new io_uring request
>> type. When done, data is returned via a new shared refill queue. A zero
>> copy page pool refills a hw rx queue from this refill queue directly. Of
>> course, the lifetime of these data buffers are managed by io_uring
>> rather than the networking stack, with different refcounting rules.
>>
>> This patchset is the first step adding basic zero copy support. We will
>> extend this iteratively with new features e.g. dynamically allocated
>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>> general optimisations and more.
>>
>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>> aren't included since Taehee Yoo has already sent a more comprehensive
>> patchset adding support in [1]. Google gve should already support this,
> 
> This is an aside, but GVE supports this via the out-of-tree patches
> I've been carrying on github. Uptsream we're working on adding the
> prerequisite page_pool support.
> 
>> and Mellanox mlx5 support is WIP pending driver changes.
>>
>> ===========
>> Performance
>> ===========
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> epoll
>> 82.2 Gbps
>>
>> io_uring
>> 116.2 Gbps (+41%)
>>
>> Pinned to _same_ core:
>>
>> epoll
>> 62.6 Gbps
>>
>> io_uring
>> 80.9 Gbps (+29%)
>>
> 
> Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
> [1]  and io_uring zerocopy respectively?
> 
> If not, I would like to see a comparison between TCP RX zerocopy and
> this new io-uring zerocopy. For Google for example we use the TCP RX
> zerocopy, I would like to see perf numbers possibly motivating us to
> move to this new thing.

No, it is comparing epoll without zero copy vs io_uring zero copy. Yes,
that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to
kperf and compare.

> 
> [1] https://lwn.net/Articles/752046/
> 
> 

  parent reply	other threads:[~2024-10-11  0:29 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-07 22:15 [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-07 22:15 ` [PATCH v1 01/15] net: devmem: pull struct definitions out of ifdef David Wei
2024-10-09 20:17   ` Mina Almasry
2024-10-09 23:16     ` Pavel Begunkov
2024-10-10 18:01       ` Mina Almasry
2024-10-10 18:57         ` Pavel Begunkov
2024-10-13 22:38           ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 02/15] net: prefix devmem specific helpers David Wei
2024-10-09 20:19   ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 03/15] net: generalise net_iov chunk owners David Wei
2024-10-08 15:46   ` Stanislav Fomichev
2024-10-08 16:34     ` Pavel Begunkov
2024-10-09 16:28       ` Stanislav Fomichev
2024-10-11 18:44         ` David Wei
2024-10-11 22:02           ` Pavel Begunkov
2024-10-11 22:25             ` Mina Almasry
2024-10-11 23:12               ` Pavel Begunkov
2024-10-09 20:44   ` Mina Almasry
2024-10-09 22:13     ` Pavel Begunkov
2024-10-09 22:19     ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 04/15] net: page_pool: create hooks for custom page providers David Wei
2024-10-09 20:49   ` Mina Almasry
2024-10-09 22:02     ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 05/15] net: prepare for non devmem TCP memory providers David Wei
2024-10-09 20:56   ` Mina Almasry
2024-10-09 21:45     ` Pavel Begunkov
2024-10-13 22:33       ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 06/15] net: page_pool: add ->scrub mem provider callback David Wei
2024-10-09 21:00   ` Mina Almasry
2024-10-09 21:59     ` Pavel Begunkov
2024-10-10 17:54       ` Mina Almasry
2024-10-13 17:25         ` David Wei
2024-10-14 13:37           ` Pavel Begunkov
2024-10-14 22:58           ` Mina Almasry
2024-10-16 17:42             ` Pavel Begunkov
2024-11-01 17:18               ` Mina Almasry
2024-11-01 18:35                 ` Pavel Begunkov
2024-11-01 19:24                   ` Mina Almasry
2024-11-01 21:38                     ` Pavel Begunkov
2024-11-04 20:42                       ` Mina Almasry
2024-11-04 21:27                         ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 07/15] net: page pool: add helper creating area from pages David Wei
2024-10-09 21:11   ` Mina Almasry
2024-10-09 21:34     ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 08/15] net: add helper executing custom callback from napi David Wei
2024-10-08 22:25   ` Joe Damato
2024-10-09 15:09     ` Pavel Begunkov
2024-10-09 16:13       ` Joe Damato
2024-10-09 19:12         ` Pavel Begunkov
2024-10-07 22:15 ` [PATCH v1 09/15] io_uring/zcrx: add interface queue and refill queue David Wei
2024-10-09 17:50   ` Jens Axboe
2024-10-09 18:09     ` Jens Axboe
2024-10-09 19:08     ` Pavel Begunkov
2024-10-11 22:11     ` Pavel Begunkov
2024-10-13 17:32     ` David Wei
2024-10-07 22:15 ` [PATCH v1 10/15] io_uring/zcrx: add io_zcrx_area David Wei
2024-10-09 18:02   ` Jens Axboe
2024-10-09 19:05     ` Pavel Begunkov
2024-10-09 19:06       ` Jens Axboe
2024-10-09 21:29   ` Mina Almasry
2024-10-07 22:15 ` [PATCH v1 11/15] io_uring/zcrx: implement zerocopy receive pp memory provider David Wei
2024-10-09 18:10   ` Jens Axboe
2024-10-09 22:01   ` Mina Almasry
2024-10-09 22:58     ` Pavel Begunkov
2024-10-10 18:19       ` Mina Almasry
2024-10-10 20:26         ` Pavel Begunkov
2024-10-10 20:53           ` Mina Almasry
2024-10-10 20:58             ` Mina Almasry
2024-10-10 21:22             ` Pavel Begunkov
2024-10-11  0:32               ` Mina Almasry
2024-10-11  1:49                 ` Pavel Begunkov
2024-10-07 22:16 ` [PATCH v1 12/15] io_uring/zcrx: add io_recvzc request David Wei
2024-10-09 18:28   ` Jens Axboe
2024-10-09 18:51     ` Pavel Begunkov
2024-10-09 19:01       ` Jens Axboe
2024-10-09 19:27         ` Pavel Begunkov
2024-10-09 19:42           ` Jens Axboe
2024-10-09 19:47             ` Pavel Begunkov
2024-10-09 19:50               ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 13/15] io_uring/zcrx: add copy fallback David Wei
2024-10-08 15:58   ` Stanislav Fomichev
2024-10-08 16:39     ` Pavel Begunkov
2024-10-08 16:40     ` David Wei
2024-10-09 16:30       ` Stanislav Fomichev
2024-10-09 23:05         ` Pavel Begunkov
2024-10-11  6:22           ` David Wei
2024-10-11 14:43             ` Stanislav Fomichev
2024-10-09 18:38   ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 14/15] io_uring/zcrx: set pp memory provider for an rx queue David Wei
2024-10-09 18:42   ` Jens Axboe
2024-10-10 13:09     ` Pavel Begunkov
2024-10-10 13:19       ` Jens Axboe
2024-10-07 22:16 ` [PATCH v1 15/15] io_uring/zcrx: throttle receive requests David Wei
2024-10-09 18:43   ` Jens Axboe
2024-10-07 22:20 ` [PATCH v1 00/15] io_uring zero copy rx David Wei
2024-10-08 23:10 ` Joe Damato
2024-10-09 15:07   ` Pavel Begunkov
2024-10-09 16:10     ` Joe Damato
2024-10-09 16:12     ` Jens Axboe
2024-10-11  6:15       ` David Wei
2024-10-09 15:27 ` Jens Axboe
2024-10-09 15:38   ` David Ahern
2024-10-09 15:43     ` Jens Axboe
2024-10-09 15:49       ` Pavel Begunkov
2024-10-09 15:50         ` Jens Axboe
2024-10-09 16:35       ` David Ahern
2024-10-09 16:50         ` Jens Axboe
2024-10-09 16:53           ` Jens Axboe
2024-10-09 17:12             ` Jens Axboe
2024-10-10 14:21               ` Jens Axboe
2024-10-10 15:03                 ` David Ahern
2024-10-10 15:15                   ` Jens Axboe
2024-10-10 18:11                 ` Jens Axboe
2024-10-14  8:42                   ` David Laight
2024-10-09 16:55 ` Mina Almasry
2024-10-09 16:57   ` Jens Axboe
2024-10-09 19:32     ` Mina Almasry
2024-10-09 19:43       ` Pavel Begunkov
2024-10-09 19:47       ` Jens Axboe
2024-10-09 17:19   ` David Ahern
2024-10-09 18:21   ` Pedro Tammela
2024-10-10 13:19     ` Pavel Begunkov
2024-10-11  0:35     ` David Wei
2024-10-11 14:28       ` Pedro Tammela
2024-10-11  0:29   ` David Wei [this message]
2024-10-11 19:43     ` Mina Almasry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=73637a66-e7f7-4ba6-a16e-c2ccb43735d6@davidwei.uk \
    --to=dw@davidwei.uk \
    --cc=almasrymina@google.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=io-uring@vger.kernel.org \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).