From: Pavel Begunkov <asml.silence@gmail.com>
To: io-uring@vger.kernel.org, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
Cc: "David S . Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>,
Jonathan Lemon <jonathan.lemon@gmail.com>,
Willem de Bruijn <willemb@google.com>,
Jens Axboe <axboe@kernel.dk>, David Ahern <dsahern@kernel.org>,
kernel-team@fb.com
Subject: Re: [PATCH net-next v3 00/25] io_uring zerocopy send
Date: Tue, 5 Jul 2022 16:04:44 +0100 [thread overview]
Message-ID: <15cff5cd-52d5-68af-75c1-32be28137773@gmail.com> (raw)
In-Reply-To: <cover.1656318994.git.asml.silence@gmail.com>
On 7/5/22 16:01, Pavel Begunkov wrote:
NOTE: This is not be picked directly due to cross-subsystem merge problems.
After finding a consensus and getting necessary acks, I'll work out merging
with Jakub and Jens.
> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
>
> From the kernel networking perspective there are two main changes. The first
> one is passing ubuf_info into the network layer from io_uring (inside of an
> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
> caching on the io_uring side, but also helps to avoid cross-referencing
> and synchronisation problems. The second part is an optional optimisation
> removing page referencing for requests with registered buffers.
>
> Benchmarking with an optimised version of the selftest (see [1]), which in a
> loop sends a bunch of requests and then waits for their completions. "+ flush"
> column posts one additional "buffer-free" notification per request, and
> just "zc" doesn't post buffer notifications at all.
>
> NIC (requests / second):
> IO size | non-zc | zc | zc + flush
> 4000 | 495134 | 606420 (+22%) | 558971 (+12%)
> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)
>
> dummy (requests / second):
> IO size | non-zc | zc | zc + flush
> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)
>
> Previously it also brought a massive performance speedup compared to the
> msg_zerocopy tool (see [3]), which is probably not super interesting.
>
> There is an additional bunch of refcounting optimisations that was omitted from
> the series for simplicity and as they don't change the picture drastically,
> they will be sent as follow up, as well as flushing optimisations closing the
> performance gap b/w two last columns.
>
> Note: the series is based on net-next + for-5.20/io_uring, but as vanilla
> net-next fails for me the repo (see [2]) is on top of for-5.20/io_uring.
>
> Links:
>
> liburing (benchmark + some tests):
> [1] https://github.com/isilence/liburing/tree/zc_v3
>
> kernel repo:
> [2] https://github.com/isilence/linux/tree/zc_v3
>
> RFC v1:
> [3] https://lore.kernel.org/io-uring/cover.1638282789.git.asml.silence@gmail.com/
>
> RFC v2:
> https://lore.kernel.org/io-uring/cover.1640029579.git.asml.silence@gmail.com/
>
> API design overview:
>
> The series introduces an io_uring concept of notifactors. From the userspace
> perspective it's an entity to which it can bind one or more requests and then
> requesting to flush it. Flushing a notifier makes it impossible to attach new
> requests to it, and instructs the notifier to post a completion once all
> requests attached to it are completed and the kernel doesn't need the buffers
> anymore.
>
> Notifications are stored in notification slots, which should be registered as
> an array in io_uring. Each slot stores only one notifier at any particular
> moment. Flushing removes it from the slot and the slot automatically replaces
> it with a new notifier. All operations with notifiers are done by specifying
> an index of a slot it's currently in.
>
> When registering a notification the userspace specifies a u64 tag for each
> slot, which will be copied in notification completion entries as
> cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
> sequence number counting notifiers of a slot.
>
> Changelog:
>
> RFC v2 -> v3:
> mem accounting for non-registered buffers
> allow mixing registered and normal requests per notifier
> notification flushing via IORING_OP_RSRC_UPDATE
> TCP support
> fix buffer indexing
> fix io-wq ->uring_lock locking
> fix bugs when mixing with MSG_ZEROCOPY
> fix managed refs bugs in skbuff.c
>
> RFC -> RFC v2:
> remove additional overhead for non-zc from skb_release_data()
> avoid msg propagation, hide extra bits of non-zc overhead
> task_work based "buffer free" notifications
> improve io_uring's notification refcounting
> added 5/19, (no pfmemalloc tracking)
> added 8/19 and 9/19 preventing small copies with zc
> misc small changes
>
> Pavel Begunkov (25):
> ipv4: avoid partial copy for zc
> ipv6: avoid partial copy for zc
> skbuff: add SKBFL_DONT_ORPHAN flag
> skbuff: carry external ubuf_info in msghdr
> net: bvec specific path in zerocopy_sg_from_iter
> net: optimise bvec-based zc page referencing
> net: don't track pfmemalloc for managed frags
> skbuff: don't mix ubuf_info of different types
> ipv4/udp: support zc with managed data
> ipv6/udp: support zc with managed data
> tcp: support zc with managed data
> io_uring: add zc notification infrastructure
> io_uring: export task put
> io_uring: cache struct io_notif
> io_uring: complete notifiers in tw
> io_uring: add notification slot registration
> io_uring: wire send zc request type
> io_uring: account locked pages for non-fixed zc
> io_uring: allow to pass addr into sendzc
> io_uring: add rsrc referencing for notifiers
> io_uring: sendzc with fixed buffers
> io_uring: flush notifiers after sendzc
> io_uring: rename IORING_OP_FILES_UPDATE
> io_uring: add zc notification flush requests
> selftests/io_uring: test zerocopy send
>
> include/linux/io_uring_types.h | 37 ++
> include/linux/skbuff.h | 59 +-
> include/linux/socket.h | 7 +
> include/uapi/linux/io_uring.h | 43 +-
> io_uring/Makefile | 2 +-
> io_uring/io_uring.c | 40 +-
> io_uring/io_uring.h | 21 +
> io_uring/net.c | 134 ++++
> io_uring/net.h | 4 +
> io_uring/notif.c | 215 +++++++
> io_uring/notif.h | 87 +++
> io_uring/opdef.c | 24 +-
> io_uring/rsrc.c | 55 +-
> io_uring/rsrc.h | 16 +-
> io_uring/tctx.h | 26 -
> net/compat.c | 2 +
> net/core/datagram.c | 53 +-
> net/core/skbuff.c | 35 +-
> net/ipv4/ip_output.c | 63 +-
> net/ipv4/tcp.c | 52 +-
> net/ipv6/ip6_output.c | 62 +-
> net/socket.c | 6 +
> tools/testing/selftests/net/Makefile | 1 +
> .../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
> .../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
> 25 files changed, 1652 insertions(+), 128 deletions(-)
> create mode 100644 io_uring/notif.c
> create mode 100644 io_uring/notif.h
> create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
> create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>
--
Pavel Begunkov
prev parent reply other threads:[~2022-07-05 15:08 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-05 15:01 [PATCH net-next v3 00/25] io_uring zerocopy send Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 01/25] ipv4: avoid partial copy for zc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 02/25] ipv6: " Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 03/25] skbuff: add SKBFL_DONT_ORPHAN flag Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 04/25] skbuff: carry external ubuf_info in msghdr Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 05/25] net: bvec specific path in zerocopy_sg_from_iter Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 06/25] net: optimise bvec-based zc page referencing Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 07/25] net: don't track pfmemalloc for managed frags Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 08/25] skbuff: don't mix ubuf_info of different types Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 09/25] ipv4/udp: support zc with managed data Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 10/25] ipv6/udp: " Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 11/25] tcp: " Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 12/25] io_uring: add zc notification infrastructure Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 13/25] io_uring: export task put Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 14/25] io_uring: cache struct io_notif Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 15/25] io_uring: complete notifiers in tw Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 16/25] io_uring: add notification slot registration Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 17/25] io_uring: wire send zc request type Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 18/25] io_uring: account locked pages for non-fixed zc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 19/25] io_uring: allow to pass addr into sendzc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 20/25] io_uring: add rsrc referencing for notifiers Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 21/25] io_uring: sendzc with fixed buffers Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 22/25] io_uring: flush notifiers after sendzc Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 23/25] io_uring: rename IORING_OP_FILES_UPDATE Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 24/25] io_uring: add zc notification flush requests Pavel Begunkov
2022-07-05 15:01 ` [PATCH net-next v3 25/25] selftests/io_uring: test zerocopy send Pavel Begunkov
2022-07-05 15:04 ` Pavel Begunkov [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=15cff5cd-52d5-68af-75c1-32be28137773@gmail.com \
--to=asml.silence@gmail.com \
--cc=axboe@kernel.dk \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=io-uring@vger.kernel.org \
--cc=jonathan.lemon@gmail.com \
--cc=kernel-team@fb.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=willemb@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).