public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RFC 0/7] Add support for provided registered buffers
@ 2024-10-23 16:07 Jens Axboe
  2024-10-23 16:07 ` [PATCH 1/7] io_uring/kbuf: mark buf_sel_arg mode as KBUF_MODE_FREE once allocated Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring

Hi,

Normally a request can take a provided buffer, which means "pick a
buffer from group X and do IO to/from it", or it can use a registered
buffer, which means "use the buffer at index Y and do IO to/from it".
For things like O_DIRECT and network zero copy, registered buffers can
be used to speedup the operation, as they avoid repeated
get_user_pages() and page referencing calls for each IO operation.

Normal (non zero copy) send supports bundles, which is a way to pick
multiple provided buffers at once and send them. send zero copy only
supports registered buffers, and hence can only send a single buffer
at the time.

This patchset adds support for using a mix of provided and registered
buffers, where the provided buffers merely provide an index into which
registered buffers to use. This enables using provided buffers for
send zc in general, but also bundles where multiple buffers are picked.
This is done by changing how the provided buffers are intepreted.
Normally a provided buffer has an address, length, and buffer ID
associated with it. The address tells the kernel where the IO should
occur. If both fixed and provided buffers are asked for, the provided
buffer address field is instead an encoding of the registered buffer
index and the offset within that buffer. With that in place, using a
combination of the two can work.

Patches 1-4 are just cleanup and prep, patch 5 adds the basic
definition of what a fixed provided buffer looks like, patch 6 adds
support for kbuf to map into a bvec directly, and finally patch 7
adds support for send zero copy to use this mechanism.

More details available in the actual patches. Tested with kperf using
zero copy RX and TX, easily reaching 100G speeds with a single thread
doing 4k sends and receives.

Kernel tree here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-sendzc-provided

 include/uapi/linux/io_uring.h |   8 ++
 io_uring/kbuf.c               | 180 +++++++++++++++++++++++++++----
 io_uring/kbuf.h               |   9 +-
 io_uring/net.c                | 192 ++++++++++++++++++++++------------
 io_uring/net.h                |  10 +-
 io_uring/opdef.c              |   1 +
 6 files changed, 309 insertions(+), 91 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/7] io_uring/kbuf: mark buf_sel_arg mode as KBUF_MODE_FREE once allocated
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
@ 2024-10-23 16:07 ` Jens Axboe
  2024-10-23 16:07 ` [PATCH 2/7] io_uring/kbuf: change io_provided_buffers_select() calling convention Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

If kbuf expands the iovec array, then it doesn't matter if the caller
didn't set KBUF_MODE_FREE or not, as once allocated it should be marked
as such. Ensure that this is the case.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/kbuf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index d407576ddfb7..1318b8ee2599 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -245,6 +245,7 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
 			kfree(arg->iovs);
 		arg->iovs = iov;
 		nr_iovs = nr_avail;
+		arg->mode |= KBUF_MODE_FREE;
 	} else if (nr_avail < nr_iovs) {
 		nr_iovs = nr_avail;
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/7] io_uring/kbuf: change io_provided_buffers_select() calling convention
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
  2024-10-23 16:07 ` [PATCH 1/7] io_uring/kbuf: mark buf_sel_arg mode as KBUF_MODE_FREE once allocated Jens Axboe
@ 2024-10-23 16:07 ` Jens Axboe
  2024-10-23 16:07 ` [PATCH 3/7] io_uring/net: abstract out io_send_import() helper Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Make it more like the ring provided buffers in what arguments it takes,
and use similar ordering as well. This makes the code more consistent

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/kbuf.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 1318b8ee2599..42579525c4bd 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -117,10 +117,11 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len,
 	return NULL;
 }
 
-static int io_provided_buffers_select(struct io_kiocb *req, size_t *len,
-				      struct io_buffer_list *bl,
-				      struct iovec *iov)
+static int io_provided_buffers_select(struct io_kiocb *req,
+				      struct buf_sel_arg *arg,
+				      struct io_buffer_list *bl, size_t *len)
 {
+	struct iovec *iov = arg->iovs;
 	void __user *buf;
 
 	buf = io_provided_buffer_select(req, len, bl);
@@ -311,7 +312,7 @@ int io_buffers_select(struct io_kiocb *req, struct buf_sel_arg *arg,
 			io_kbuf_commit(req, bl, arg->out_len, ret);
 		}
 	} else {
-		ret = io_provided_buffers_select(req, &arg->out_len, bl, arg->iovs);
+		ret = io_provided_buffers_select(req, arg, bl, &arg->out_len);
 	}
 out_unlock:
 	io_ring_submit_unlock(ctx, issue_flags);
@@ -338,7 +339,7 @@ int io_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg)
 	}
 
 	/* don't support multiple buffer selections for legacy */
-	return io_provided_buffers_select(req, &arg->max_len, bl, arg->iovs);
+	return io_provided_buffers_select(req, arg, bl, &arg->max_len);
 }
 
 static int __io_remove_buffers(struct io_ring_ctx *ctx,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/7] io_uring/net: abstract out io_send_import() helper
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
  2024-10-23 16:07 ` [PATCH 1/7] io_uring/kbuf: mark buf_sel_arg mode as KBUF_MODE_FREE once allocated Jens Axboe
  2024-10-23 16:07 ` [PATCH 2/7] io_uring/kbuf: change io_provided_buffers_select() calling convention Jens Axboe
@ 2024-10-23 16:07 ` Jens Axboe
  2024-10-23 16:07 ` [PATCH 4/7] io_uring/net: move send zc fixed buffer import into helper Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

If provided buffers are used, this helper can be used to import the
necessary data from a provided buffer group. Only one user so far, but
add it in preparation of adding another one. While doing so, also split
the actual import into an iov_iter out into a separate helper.

In preparation for needing to know the number of mapped segments, return
that instead. It still returns < 0 on error.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/net.c | 75 +++++++++++++++++++++++++++++++++-----------------
 1 file changed, 50 insertions(+), 25 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 2040195e33ab..13b807c729f9 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -578,28 +578,33 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
-int io_send(struct io_kiocb *req, unsigned int issue_flags)
+static int __io_send_import(struct io_kiocb *req, struct buf_sel_arg *arg,
+			    int nsegs, unsigned int issue_flags)
 {
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
 	struct io_async_msghdr *kmsg = req->async_data;
-	struct socket *sock;
-	unsigned flags;
-	int min_ret = 0;
-	int ret;
+	int ret = nsegs;
 
-	sock = sock_from_file(req->file);
-	if (unlikely(!sock))
-		return -ENOTSOCK;
+	if (nsegs == 1) {
+		sr->buf = arg->iovs[0].iov_base;
+		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
+				   &kmsg->msg.msg_iter);
+		if (unlikely(ret < 0))
+			return ret;
+	} else {
+		iov_iter_init(&kmsg->msg.msg_iter, ITER_SOURCE, arg->iovs,
+			      nsegs, arg->out_len);
+	}
 
-	if (!(req->flags & REQ_F_POLLED) &&
-	    (sr->flags & IORING_RECVSEND_POLL_FIRST))
-		return -EAGAIN;
+	return nsegs;
+}
 
-	flags = sr->msg_flags;
-	if (issue_flags & IO_URING_F_NONBLOCK)
-		flags |= MSG_DONTWAIT;
+static int io_send_import(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_async_msghdr *kmsg = req->async_data;
+	int ret = 1;
 
-retry_bundle:
 	if (io_do_buffer_select(req)) {
 		struct buf_sel_arg arg = {
 			.iovs = &kmsg->fast_iov,
@@ -629,18 +634,38 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
 		}
 		sr->len = arg.out_len;
 
-		if (ret == 1) {
-			sr->buf = arg.iovs[0].iov_base;
-			ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
-						&kmsg->msg.msg_iter);
-			if (unlikely(ret))
-				return ret;
-		} else {
-			iov_iter_init(&kmsg->msg.msg_iter, ITER_SOURCE,
-					arg.iovs, ret, arg.out_len);
-		}
+		return __io_send_import(req, &arg, ret, issue_flags);
 	}
 
+	return ret;
+}
+
+int io_send(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_async_msghdr *kmsg = req->async_data;
+	struct socket *sock;
+	unsigned flags;
+	int min_ret = 0;
+	int ret;
+
+	sock = sock_from_file(req->file);
+	if (unlikely(!sock))
+		return -ENOTSOCK;
+
+	if (!(req->flags & REQ_F_POLLED) &&
+	    (sr->flags & IORING_RECVSEND_POLL_FIRST))
+		return -EAGAIN;
+
+	flags = sr->msg_flags;
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		flags |= MSG_DONTWAIT;
+
+retry_bundle:
+	ret = io_send_import(req, issue_flags);
+	if (unlikely(ret < 0))
+		return ret;
+
 	/*
 	 * If MSG_WAITALL is set, or this is a bundle send, then we need
 	 * the full amount. If just bundle is set, if we do a short send
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/7] io_uring/net: move send zc fixed buffer import into helper
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
                   ` (2 preceding siblings ...)
  2024-10-23 16:07 ` [PATCH 3/7] io_uring/net: abstract out io_send_import() helper Jens Axboe
@ 2024-10-23 16:07 ` Jens Axboe
  2024-10-23 16:07 ` [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

In preparation to making the fixed buffer importing a bit more elaborate
in terms of what it supports.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/net.c | 77 ++++++++++++++++++++++++++++----------------------
 1 file changed, 44 insertions(+), 33 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 13b807c729f9..dbef14aa50f9 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -81,6 +81,9 @@ struct io_sr_msg {
 	struct io_kiocb 		*notif;
 };
 
+static int io_sg_from_iter(struct sk_buff *skb, struct iov_iter *from,
+			   size_t length);
+
 /*
  * Number of times we'll try and do receives if there's more data. If we
  * exceed this limit, then add us to the back of the queue and retry from
@@ -578,6 +581,37 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
+static int io_send_zc_import_single(struct io_kiocb *req,
+				    unsigned int issue_flags)
+{
+	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+	struct io_async_msghdr *kmsg = req->async_data;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_mapped_ubuf *imu;
+	int ret;
+	u16 idx;
+
+	ret = -EFAULT;
+	io_ring_submit_lock(ctx, issue_flags);
+	if (sr->buf_index < ctx->nr_user_bufs) {
+		idx = array_index_nospec(sr->buf_index, ctx->nr_user_bufs);
+		imu = READ_ONCE(ctx->user_bufs[idx]);
+		io_req_set_rsrc_node(sr->notif, ctx);
+		ret = 0;
+	}
+	io_ring_submit_unlock(ctx, issue_flags);
+
+	if (unlikely(ret))
+		return ret;
+
+	ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter, imu,
+				(u64)(uintptr_t)sr->buf, sr->len);
+	if (unlikely(ret))
+		return ret;
+	kmsg->msg.sg_from_iter = io_sg_from_iter;
+	return 0;
+}
+
 static int __io_send_import(struct io_kiocb *req, struct buf_sel_arg *arg,
 			    int nsegs, unsigned int issue_flags)
 {
@@ -1365,40 +1399,17 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_async_msghdr *kmsg = req->async_data;
 	int ret;
 
-	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
-		struct io_ring_ctx *ctx = req->ctx;
-		struct io_mapped_ubuf *imu;
-		int idx;
-
-		ret = -EFAULT;
-		io_ring_submit_lock(ctx, issue_flags);
-		if (sr->buf_index < ctx->nr_user_bufs) {
-			idx = array_index_nospec(sr->buf_index, ctx->nr_user_bufs);
-			imu = READ_ONCE(ctx->user_bufs[idx]);
-			io_req_set_rsrc_node(sr->notif, ctx);
-			ret = 0;
-		}
-		io_ring_submit_unlock(ctx, issue_flags);
+	if (sr->flags & IORING_RECVSEND_FIXED_BUF)
+		return io_send_zc_import_single(req, issue_flags);
 
-		if (unlikely(ret))
-			return ret;
-
-		ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter, imu,
-					(u64)(uintptr_t)sr->buf, sr->len);
-		if (unlikely(ret))
-			return ret;
-		kmsg->msg.sg_from_iter = io_sg_from_iter;
-	} else {
-		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &kmsg->msg.msg_iter);
-		if (unlikely(ret))
-			return ret;
-		ret = io_notif_account_mem(sr->notif, sr->len);
-		if (unlikely(ret))
-			return ret;
-		kmsg->msg.sg_from_iter = io_sg_from_iter_iovec;
-	}
-
-	return ret;
+	ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &kmsg->msg.msg_iter);
+	if (unlikely(ret))
+		return ret;
+	ret = io_notif_account_mem(sr->notif, sr->len);
+	if (unlikely(ret))
+		return ret;
+	kmsg->msg.sg_from_iter = io_sg_from_iter_iovec;
+	return 0;
 }
 
 int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
                   ` (3 preceding siblings ...)
  2024-10-23 16:07 ` [PATCH 4/7] io_uring/net: move send zc fixed buffer import into helper Jens Axboe
@ 2024-10-23 16:07 ` Jens Axboe
  2024-10-24 15:44   ` Pavel Begunkov
  2024-10-23 16:07 ` [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This just adds the necessary shifts that define what a provided buffer
that is merely an index into a registered buffer looks like. A provided
buffer looks like the following:

struct io_uring_buf {
	__u64	addr;
	__u32	len;
	__u16	bid;
	__u16	resv;
};

where 'addr' holds a userspace address, 'len' is the length of the
buffer, and 'bid' is the buffer ID identifying the buffer. This works
fine for a virtual address, but it cannot be used efficiently denote
a registered buffer. Registered buffers are pre-mapped into the kernel
for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
and are used for things like O_DIRECT on storage and zero copy send.

Particularly for the send case, it'd be useful to support a mix of
provided and registered buffers. This enables the use of using a
provided ring buffer to serialize sends, and also enables the use of
send bundles, where a send can pick multiple buffers and send them all
at once.

If provided buffers are used as an index into registered buffers, the
meaning of buf->addr changes. If registered buffer index 'regbuf_index'
is desired, with a length of 'len' and the offset 'regbuf_offset' from
the start of the buffer, then the application would fill out the entry
as follows:

buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
buf->len = len;

and otherwise add it to the buffer ring as usual. The kernel will then
first pick a buffer from the desired buffer group ID, and then decode
which registered buffer to use for the transfer.

This provides a way to use both registered and provided buffers at the
same time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b5..eef88d570cb4 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -733,6 +733,14 @@ struct io_uring_buf_ring {
 	};
 };
 
+/*
+ * When provided buffers are used as indices into registered buffers, the
+ * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
+ * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
+ */
+#define IOU_BUF_REGBUF_BITS	(32ULL)
+#define IOU_BUF_OFFSET_BITS	(32ULL)
+
 /*
  * Flags for IORING_REGISTER_PBUF_RING.
  *
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
                   ` (4 preceding siblings ...)
  2024-10-23 16:07 ` [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers Jens Axboe
@ 2024-10-23 16:07 ` Jens Axboe
  2024-10-24 15:22   ` Pavel Begunkov
  2024-10-23 16:07 ` [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc Jens Axboe
  2024-10-24 14:36 ` [PATCHSET RFC 0/7] Add support for provided registered buffers Pavel Begunkov
  7 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

The provided buffer helpers always map to iovecs. Add a new mode,
KBUF_MODE_BVEC, which instead maps it to a bio_vec array instead. For
use with zero-copy scenarios, where the caller would want to turn it
into a bio_vec anyway, and this avoids first iterating and filling out
and iovec array, only for the caller to then iterate it again and turn
it into a bio_vec array.

Since it's now managing both iovecs and bvecs, change the naming of
buf_sel_arg->nr_iovs member to nr_vecs instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/kbuf.c | 170 +++++++++++++++++++++++++++++++++++++++++++-----
 io_uring/kbuf.h |   9 ++-
 io_uring/net.c  |  10 +--
 3 files changed, 165 insertions(+), 24 deletions(-)

diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 42579525c4bd..10a3a7a27e9a 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -16,6 +16,7 @@
 #include "opdef.h"
 #include "kbuf.h"
 #include "memmap.h"
+#include "rsrc.h"
 
 /* BIDs are addressed by a 16-bit field in a CQE */
 #define MAX_BIDS_PER_BGID (1 << 16)
@@ -117,20 +118,135 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len,
 	return NULL;
 }
 
+static struct io_mapped_ubuf *io_ubuf_from_buf(struct io_ring_ctx *ctx,
+					       u64 addr, unsigned int *offset)
+{
+	struct io_mapped_ubuf *imu;
+	u16 idx;
+
+	/*
+	 * Get registered buffer index and offset, encoded into the
+	 * addr base value.
+	 */
+	idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
+	addr >>= IOU_BUF_REGBUF_BITS;
+	*offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);
+
+	if (unlikely(idx >= ctx->nr_user_bufs))
+		return ERR_PTR(-EFAULT);
+
+	idx = array_index_nospec(idx, ctx->nr_user_bufs);
+	imu = READ_ONCE(ctx->user_bufs[idx]);
+	if (unlikely(*offset >= imu->len))
+		return ERR_PTR(-EFAULT);
+
+	return imu;
+}
+
+static bool io_expand_bvecs(struct buf_sel_arg *arg)
+{
+	int nvecs = arg->nr_vecs + 8;
+	struct bio_vec *bv;
+
+	if (!(arg->mode & KBUF_MODE_EXPAND))
+		return false;
+
+	bv = kmalloc_array(nvecs, sizeof(struct bio_vec), GFP_KERNEL);
+	if (unlikely(!bv))
+		return false;
+	memcpy(bv, arg->bvecs, arg->nr_vecs * sizeof(*bv));
+	if (arg->mode & KBUF_MODE_FREE)
+		kfree(arg->bvecs);
+	arg->bvecs = bv;
+	arg->nr_vecs = nvecs;
+	arg->mode |= KBUF_MODE_FREE;
+	return true;
+}
+
+static int io_fill_bvecs(struct io_ring_ctx *ctx, u64 addr,
+			 struct buf_sel_arg *arg, unsigned int len,
+			 int *vec_off)
+{
+	struct bio_vec *src, *src_prv = NULL;
+	struct io_mapped_ubuf *imu;
+	unsigned int llen = len;
+	unsigned int offset;
+
+	imu = io_ubuf_from_buf(ctx, addr, &offset);
+	if (unlikely(IS_ERR(imu)))
+		return PTR_ERR(imu);
+
+	if (unlikely(offset >= imu->len || len > imu->len))
+		return -EOVERFLOW;
+	if (unlikely(offset > imu->len - len))
+		return -EOVERFLOW;
+
+	src = imu->bvec;
+	if (offset > src->bv_len) {
+		unsigned long seg_skip;
+
+		offset -= src->bv_len;
+		seg_skip = 1 + (offset >> imu->folio_shift);
+		offset &= ((1UL << imu->folio_shift) - 1);
+		src += seg_skip;
+	}
+
+	do {
+		unsigned int this_len = len;
+
+		if (this_len + offset > src->bv_len)
+			this_len = src->bv_len - offset;
+
+		/*
+		 * If contig with previous bio_vec, merge it to minimize the
+		 * number of segments needed. If not, then add a new segment,
+		 * expanding the number of available slots, if needed.
+		 */
+		if (src_prv &&
+		    page_folio(src_prv->bv_page) == page_folio(src->bv_page) &&
+		    src_prv->bv_page + 1 == src->bv_page) {
+			arg->bvecs[*vec_off - 1].bv_len += this_len;
+		} else {
+			struct bio_vec *dst;
+
+			if (*vec_off == arg->nr_vecs && !io_expand_bvecs(arg))
+				break;
+
+			dst = &arg->bvecs[*vec_off];
+			dst->bv_page = src->bv_page;
+			dst->bv_len = this_len;
+			dst->bv_offset = offset;
+			(*vec_off)++;
+		}
+		offset = 0;
+		len -= this_len;
+		src_prv = src++;
+	} while (len);
+
+	return llen - len;
+}
+
 static int io_provided_buffers_select(struct io_kiocb *req,
 				      struct buf_sel_arg *arg,
 				      struct io_buffer_list *bl, size_t *len)
 {
-	struct iovec *iov = arg->iovs;
 	void __user *buf;
+	int ret;
 
 	buf = io_provided_buffer_select(req, len, bl);
 	if (unlikely(!buf))
 		return -ENOBUFS;
 
-	iov[0].iov_base = buf;
-	iov[0].iov_len = *len;
-	return 1;
+	if (arg->mode & KBUF_MODE_BVEC) {
+		u64 addr = (unsigned long)(uintptr_t) buf;
+
+		*len = io_fill_bvecs(req->ctx, addr, arg, *len, &ret);
+	} else {
+		arg->iovs[0].iov_base = buf;
+		arg->iovs[0].iov_len = *len;
+		ret = 1;
+	}
+	return ret;
 }
 
 static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len,
@@ -196,13 +312,16 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len,
 #define PEEK_MAX_IMPORT		256
 
 static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
-				struct io_buffer_list *bl)
+				struct io_buffer_list *bl, int *nbufs)
 {
 	struct io_uring_buf_ring *br = bl->buf_ring;
 	struct iovec *iov = arg->iovs;
-	int nr_iovs = arg->nr_iovs;
+	int nr_iovs = arg->nr_vecs;
 	__u16 nr_avail, tail, head;
 	struct io_uring_buf *buf;
+	int vec_off;
+
+	BUILD_BUG_ON(sizeof(struct iovec) > sizeof(struct bio_vec));
 
 	tail = smp_load_acquire(&br->tail);
 	head = bl->head;
@@ -236,10 +355,12 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
 
 	/*
 	 * only alloc a bigger array if we know we have data to map, eg not
-	 * a speculative peek operation.
+	 * a speculative peek operation. Note that struct bio_vec and
+	 * struct iovec are the same size, so we can use them interchangably
+	 * here as it's just for sizing purposes.
 	 */
 	if (arg->mode & KBUF_MODE_EXPAND && nr_avail > nr_iovs && arg->max_len) {
-		iov = kmalloc_array(nr_avail, sizeof(struct iovec), GFP_KERNEL);
+		iov = kmalloc_array(nr_avail, sizeof(struct bio_vec), GFP_KERNEL);
 		if (unlikely(!iov))
 			return -ENOMEM;
 		if (arg->mode & KBUF_MODE_FREE)
@@ -255,6 +376,7 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
 	if (!arg->max_len)
 		arg->max_len = INT_MAX;
 
+	vec_off = 0;
 	req->buf_index = buf->bid;
 	do {
 		u32 len = buf->len;
@@ -266,15 +388,25 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
 				buf->len = len;
 		}
 
-		iov->iov_base = u64_to_user_ptr(buf->addr);
-		iov->iov_len = len;
-		iov++;
+		if (arg->mode & KBUF_MODE_BVEC) {
+			int ret;
+
+			ret = io_fill_bvecs(req->ctx, buf->addr, arg, len, &vec_off);
+			if (unlikely(ret < 0))
+				return ret;
+			len = ret;
+		} else {
+			iov->iov_base = u64_to_user_ptr(buf->addr);
+			iov->iov_len = len;
+			iov++;
+			vec_off++;
+		}
 
 		arg->out_len += len;
 		arg->max_len -= len;
+		(*nbufs)++;
 		if (!arg->max_len)
 			break;
-
 		buf = io_ring_head_to_buf(br, ++head, bl->mask);
 	} while (--nr_iovs);
 
@@ -283,7 +415,7 @@ static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
 
 	req->flags |= REQ_F_BUFFER_RING;
 	req->buf_list = bl;
-	return iov - arg->iovs;
+	return vec_off;
 }
 
 int io_buffers_select(struct io_kiocb *req, struct buf_sel_arg *arg,
@@ -299,7 +431,9 @@ int io_buffers_select(struct io_kiocb *req, struct buf_sel_arg *arg,
 		goto out_unlock;
 
 	if (bl->flags & IOBL_BUF_RING) {
-		ret = io_ring_buffers_peek(req, arg, bl);
+		int nbufs = 0;
+
+		ret = io_ring_buffers_peek(req, arg, bl, &nbufs);
 		/*
 		 * Don't recycle these buffers if we need to go through poll.
 		 * Nobody else can use them anyway, and holding on to provided
@@ -307,9 +441,9 @@ int io_buffers_select(struct io_kiocb *req, struct buf_sel_arg *arg,
 		 * side anyway with normal buffers. Besides, we already
 		 * committed them, they cannot be put back in the queue.
 		 */
-		if (ret > 0) {
+		if (nbufs) {
 			req->flags |= REQ_F_BUFFERS_COMMIT | REQ_F_BL_NO_RECYCLE;
-			io_kbuf_commit(req, bl, arg->out_len, ret);
+			io_kbuf_commit(req, bl, arg->out_len, nbufs);
 		}
 	} else {
 		ret = io_provided_buffers_select(req, arg, bl, &arg->out_len);
@@ -332,7 +466,9 @@ int io_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg)
 		return -ENOENT;
 
 	if (bl->flags & IOBL_BUF_RING) {
-		ret = io_ring_buffers_peek(req, arg, bl);
+		int nbufs = 0;
+
+		ret = io_ring_buffers_peek(req, arg, bl, &nbufs);
 		if (ret > 0)
 			req->flags |= REQ_F_BUFFERS_COMMIT;
 		return ret;
diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h
index 36aadfe5ac00..7c56ba994f21 100644
--- a/io_uring/kbuf.h
+++ b/io_uring/kbuf.h
@@ -53,13 +53,18 @@ enum {
 	KBUF_MODE_EXPAND	= 1,
 	/* if bigger vec allocated, free old one */
 	KBUF_MODE_FREE		= 2,
+	/* turn into bio_vecs, not iovecs */
+	KBUF_MODE_BVEC		= 4,
 };
 
 struct buf_sel_arg {
-	struct iovec *iovs;
+	union {
+		struct iovec *iovs;
+		struct bio_vec *bvecs;
+	};
 	size_t out_len;
 	size_t max_len;
-	unsigned short nr_iovs;
+	unsigned short nr_vecs;
 	unsigned short mode;
 };
 
diff --git a/io_uring/net.c b/io_uring/net.c
index dbef14aa50f9..154756762a46 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -643,17 +643,17 @@ static int io_send_import(struct io_kiocb *req, unsigned int issue_flags)
 		struct buf_sel_arg arg = {
 			.iovs = &kmsg->fast_iov,
 			.max_len = min_not_zero(sr->len, INT_MAX),
-			.nr_iovs = 1,
+			.nr_vecs = 1,
 		};
 
 		if (kmsg->free_iov) {
-			arg.nr_iovs = kmsg->free_iov_nr;
+			arg.nr_vecs = kmsg->free_iov_nr;
 			arg.iovs = kmsg->free_iov;
 			arg.mode = KBUF_MODE_FREE;
 		}
 
 		if (!(sr->flags & IORING_RECVSEND_BUNDLE))
-			arg.nr_iovs = 1;
+			arg.nr_vecs = 1;
 		else
 			arg.mode |= KBUF_MODE_EXPAND;
 
@@ -1140,12 +1140,12 @@ static int io_recv_buf_select(struct io_kiocb *req, struct io_async_msghdr *kmsg
 	    sr->flags & IORING_RECVSEND_BUNDLE) {
 		struct buf_sel_arg arg = {
 			.iovs = &kmsg->fast_iov,
-			.nr_iovs = 1,
+			.nr_vecs = 1,
 			.mode = KBUF_MODE_EXPAND,
 		};
 
 		if (kmsg->free_iov) {
-			arg.nr_iovs = kmsg->free_iov_nr;
+			arg.nr_vecs = kmsg->free_iov_nr;
 			arg.iovs = kmsg->free_iov;
 			arg.mode |= KBUF_MODE_FREE;
 		}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
                   ` (5 preceding siblings ...)
  2024-10-23 16:07 ` [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC Jens Axboe
@ 2024-10-23 16:07 ` Jens Axboe
  2024-10-24 14:44   ` Pavel Begunkov
  2024-10-24 14:36 ` [PATCHSET RFC 0/7] Add support for provided registered buffers Pavel Begunkov
  7 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-23 16:07 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Provided buffers inform the kernel which buffer group ID to pick a
buffer from for transfer. Normally that buffer contains the usual
addr + length information, as well as a buffer ID that is passed back
at completion time to inform the application of which buffer was used
for the transfer.

However, if registered and provided buffers are combined, then the
provided buffer must instead tell the kernel which registered buffer
index should be used, and the length/offset within that buffer. Rather
than store the addr + length, the application must instead store this
information instead.

If provided buffers are used with send zc, then those buffers must be
an index into a registered buffer. Change the mapping type to use
KBUF_MODE_BVEC, which tells the kbuf handlers to turn the mappings
into bio_vecs rather than iovecs. Then all that is needed is to
setup our iov_iterator to use iov_iter_bvec().

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/net.c   | 64 +++++++++++++++++++++++++++++++++---------------
 io_uring/net.h   | 10 ++++++--
 io_uring/opdef.c |  1 +
 3 files changed, 53 insertions(+), 22 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 154756762a46..c062b1c685bd 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -83,6 +83,8 @@ struct io_sr_msg {
 
 static int io_sg_from_iter(struct sk_buff *skb, struct iov_iter *from,
 			   size_t length);
+static int io_sg_from_iter_iovec(struct sk_buff *skb, struct iov_iter *from,
+				 size_t length);
 
 /*
  * Number of times we'll try and do receives if there's more data. If we
@@ -581,33 +583,34 @@ int io_sendmsg(struct io_kiocb *req, unsigned int issue_flags)
 	return IOU_OK;
 }
 
-static int io_send_zc_import_single(struct io_kiocb *req,
-				    unsigned int issue_flags)
+static int __io_send_zc_import(struct io_kiocb *req,
+			       struct io_async_msghdr *kmsg, int nsegs)
 {
 	struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
-	struct io_async_msghdr *kmsg = req->async_data;
 	struct io_ring_ctx *ctx = req->ctx;
 	struct io_mapped_ubuf *imu;
 	int ret;
 	u16 idx;
 
-	ret = -EFAULT;
-	io_ring_submit_lock(ctx, issue_flags);
-	if (sr->buf_index < ctx->nr_user_bufs) {
+	if (req->flags & REQ_F_BUFFER_SELECT) {
+		struct bio_vec *bv = kmsg->free_bvec ?: &kmsg->fast_bvec;
+
+		WARN_ON_ONCE(bv == &kmsg->fast_bvec && nsegs > 1);
+		iov_iter_bvec(&kmsg->msg.msg_iter, ITER_SOURCE, bv, nsegs, sr->len);
+	} else {
+		if (WARN_ON_ONCE(nsegs != 1))
+			return -EFAULT;
+		if (unlikely(sr->buf_index >= ctx->nr_user_bufs))
+			return -EFAULT;
 		idx = array_index_nospec(sr->buf_index, ctx->nr_user_bufs);
 		imu = READ_ONCE(ctx->user_bufs[idx]);
-		io_req_set_rsrc_node(sr->notif, ctx);
-		ret = 0;
-	}
-	io_ring_submit_unlock(ctx, issue_flags);
 
-	if (unlikely(ret))
-		return ret;
+		ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter, imu,
+					(u64)(uintptr_t)sr->buf, sr->len);
+		if (unlikely(ret))
+			return ret;
+	}
 
-	ret = io_import_fixed(ITER_SOURCE, &kmsg->msg.msg_iter, imu,
-				(u64)(uintptr_t)sr->buf, sr->len);
-	if (unlikely(ret))
-		return ret;
 	kmsg->msg.sg_from_iter = io_sg_from_iter;
 	return 0;
 }
@@ -619,6 +622,16 @@ static int __io_send_import(struct io_kiocb *req, struct buf_sel_arg *arg,
 	struct io_async_msghdr *kmsg = req->async_data;
 	int ret = nsegs;
 
+	if (sr->flags & IORING_RECVSEND_FIXED_BUF) {
+		io_ring_submit_lock(req->ctx, issue_flags);
+		io_req_set_rsrc_node(sr->notif, req->ctx);
+		ret = __io_send_zc_import(req, kmsg, nsegs);
+		io_ring_submit_unlock(req->ctx, issue_flags);
+		if (unlikely(ret < 0))
+			return ret;
+		return nsegs;
+	}
+
 	if (nsegs == 1) {
 		sr->buf = arg->iovs[0].iov_base;
 		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
@@ -646,10 +659,13 @@ static int io_send_import(struct io_kiocb *req, unsigned int issue_flags)
 			.nr_vecs = 1,
 		};
 
+		if (sr->flags & IORING_RECVSEND_FIXED_BUF)
+			arg.mode |= KBUF_MODE_BVEC;
+
 		if (kmsg->free_iov) {
 			arg.nr_vecs = kmsg->free_iov_nr;
 			arg.iovs = kmsg->free_iov;
-			arg.mode = KBUF_MODE_FREE;
+			arg.mode |= KBUF_MODE_FREE;
 		}
 
 		if (!(sr->flags & IORING_RECVSEND_BUNDLE))
@@ -1280,7 +1296,8 @@ void io_send_zc_cleanup(struct io_kiocb *req)
 	}
 }
 
-#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF)
+#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | \
+			    IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_BUNDLE)
 #define IO_ZC_FLAGS_VALID  (IO_ZC_FLAGS_COMMON | IORING_SEND_ZC_REPORT_USAGE)
 
 int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -1399,8 +1416,13 @@ static int io_send_zc_import(struct io_kiocb *req, unsigned int issue_flags)
 	struct io_async_msghdr *kmsg = req->async_data;
 	int ret;
 
+	ret = io_send_import(req, issue_flags);
+	if (unlikely(ret < 0))
+		return ret;
+	if (req->flags & REQ_F_BUFFER_SELECT)
+		return 0;
 	if (sr->flags & IORING_RECVSEND_FIXED_BUF)
-		return io_send_zc_import_single(req, issue_flags);
+		return __io_send_zc_import(req, kmsg, 1);
 
 	ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &kmsg->msg.msg_iter);
 	if (unlikely(ret))
@@ -1416,6 +1438,7 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 {
 	struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg);
 	struct io_async_msghdr *kmsg = req->async_data;
+	unsigned int cflags;
 	struct socket *sock;
 	unsigned msg_flags;
 	int ret, min_ret = 0;
@@ -1476,7 +1499,8 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags)
 		io_notif_flush(zc->notif);
 		io_req_msg_cleanup(req, 0);
 	}
-	io_req_set_res(req, ret, IORING_CQE_F_MORE);
+	cflags = io_put_kbuf(req, ret, issue_flags);
+	io_req_set_res(req, ret, cflags | IORING_CQE_F_MORE);
 	return IOU_OK;
 }
 
diff --git a/io_uring/net.h b/io_uring/net.h
index 52bfee05f06a..e052762cf85d 100644
--- a/io_uring/net.h
+++ b/io_uring/net.h
@@ -5,9 +5,15 @@
 
 struct io_async_msghdr {
 #if defined(CONFIG_NET)
-	struct iovec			fast_iov;
+	union {
+		struct iovec		fast_iov;
+		struct bio_vec		fast_bvec;
+	};
 	/* points to an allocated iov, if NULL we use fast_iov instead */
-	struct iovec			*free_iov;
+	union {
+		struct iovec		*free_iov;
+		struct bio_vec		*free_bvec;
+	};
 	int				free_iov_nr;
 	int				namelen;
 	__kernel_size_t			controllen;
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index a2be3bbca5ff..6203a7dd5052 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -422,6 +422,7 @@ const struct io_issue_def io_issue_defs[] = {
 		.needs_file		= 1,
 		.unbound_nonreg_file	= 1,
 		.pollout		= 1,
+		.buffer_select		= 1,
 		.audit_skip		= 1,
 		.ioprio			= 1,
 #if defined(CONFIG_NET)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCHSET RFC 0/7] Add support for provided registered buffers
  2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
                   ` (6 preceding siblings ...)
  2024-10-23 16:07 ` [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc Jens Axboe
@ 2024-10-24 14:36 ` Pavel Begunkov
  2024-10-24 14:43   ` Jens Axboe
  7 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 14:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/23/24 17:07, Jens Axboe wrote:
> Hi,
> 
> Normally a request can take a provided buffer, which means "pick a
> buffer from group X and do IO to/from it", or it can use a registered
> buffer, which means "use the buffer at index Y and do IO to/from it".
> For things like O_DIRECT and network zero copy, registered buffers can
> be used to speedup the operation, as they avoid repeated
> get_user_pages() and page referencing calls for each IO operation.
> 
> Normal (non zero copy) send supports bundles, which is a way to pick
> multiple provided buffers at once and send them. send zero copy only
> supports registered buffers, and hence can only send a single buffer

That's not true, has never been, send[msg] zc work just fine with
normal (non-registered) buffers.

> at the time.

And that's covered by the posted series for vectored registered
buffers support.

> This patchset adds support for using a mix of provided and registered
> buffers, where the provided buffers merely provide an index into which
> registered buffers to use. This enables using provided buffers for
> send zc in general, but also bundles where multiple buffers are picked.
> This is done by changing how the provided buffers are intepreted.
> Normally a provided buffer has an address, length, and buffer ID
> associated with it. The address tells the kernel where the IO should
> occur. If both fixed and provided buffers are asked for, the provided
> buffer address field is instead an encoding of the registered buffer
> index and the offset within that buffer. With that in place, using a
> combination of the two can work.

What the series doesn't say is how it works with notifications and
what is the proposed user API in regard to it, it's the main if not
the only fundamental distinctive part of the SENDZC API.


> Patches 1-4 are just cleanup and prep, patch 5 adds the basic
> definition of what a fixed provided buffer looks like, patch 6 adds
> support for kbuf to map into a bvec directly, and finally patch 7
> adds support for send zero copy to use this mechanism.
> 
> More details available in the actual patches. Tested with kperf using
> zero copy RX and TX, easily reaching 100G speeds with a single thread
> doing 4k sends and receives.
> 
> Kernel tree here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-sendzc-provided
> 
>   include/uapi/linux/io_uring.h |   8 ++
>   io_uring/kbuf.c               | 180 +++++++++++++++++++++++++++----
>   io_uring/kbuf.h               |   9 +-
>   io_uring/net.c                | 192 ++++++++++++++++++++++------------
>   io_uring/net.h                |  10 +-
>   io_uring/opdef.c              |   1 +
>   6 files changed, 309 insertions(+), 91 deletions(-)
> 

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET RFC 0/7] Add support for provided registered buffers
  2024-10-24 14:36 ` [PATCHSET RFC 0/7] Add support for provided registered buffers Pavel Begunkov
@ 2024-10-24 14:43   ` Jens Axboe
  2024-10-24 15:04     ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 14:43 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 8:36 AM, Pavel Begunkov wrote:
> On 10/23/24 17:07, Jens Axboe wrote:
>> Hi,
>>
>> Normally a request can take a provided buffer, which means "pick a
>> buffer from group X and do IO to/from it", or it can use a registered
>> buffer, which means "use the buffer at index Y and do IO to/from it".
>> For things like O_DIRECT and network zero copy, registered buffers can
>> be used to speedup the operation, as they avoid repeated
>> get_user_pages() and page referencing calls for each IO operation.
>>
>> Normal (non zero copy) send supports bundles, which is a way to pick
>> multiple provided buffers at once and send them. send zero copy only
>> supports registered buffers, and hence can only send a single buffer
> 
> That's not true, has never been, send[msg] zc work just fine with
> normal (non-registered) buffers.

That's not what I'm saying, perhaps it isn't clear. What I'm trying to
say is that it only supports registered buffers, it does not support
provided buffers. It obviously does support regular user provided
buffers that aren't registered or provided, I figured that goes without
saying explicitly.

>> at the time.
> 
> And that's covered by the posted series for vectored registered
> buffers support.

Right, for sendmsg.

>> This patchset adds support for using a mix of provided and registered
>> buffers, where the provided buffers merely provide an index into which
>> registered buffers to use. This enables using provided buffers for
>> send zc in general, but also bundles where multiple buffers are picked.
>> This is done by changing how the provided buffers are intepreted.
>> Normally a provided buffer has an address, length, and buffer ID
>> associated with it. The address tells the kernel where the IO should
>> occur. If both fixed and provided buffers are asked for, the provided
>> buffer address field is instead an encoding of the registered buffer
>> index and the offset within that buffer. With that in place, using a
>> combination of the two can work.
> 
> What the series doesn't say is how it works with notifications and
> what is the proposed user API in regard to it, it's the main if not
> the only fundamental distinctive part of the SENDZC API.

Should not change that? You'll should get the usual two notifications on
send complete, and reuse safe.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc
  2024-10-23 16:07 ` [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc Jens Axboe
@ 2024-10-24 14:44   ` Pavel Begunkov
  2024-10-24 14:48     ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 14:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/23/24 17:07, Jens Axboe wrote:
> Provided buffers inform the kernel which buffer group ID to pick a
> buffer from for transfer. Normally that buffer contains the usual
> addr + length information, as well as a buffer ID that is passed back
> at completion time to inform the application of which buffer was used
> for the transfer.
> 
> However, if registered and provided buffers are combined, then the
> provided buffer must instead tell the kernel which registered buffer
> index should be used, and the length/offset within that buffer. Rather
> than store the addr + length, the application must instead store this
> information instead.
> 
> If provided buffers are used with send zc, then those buffers must be
> an index into a registered buffer. Change the mapping type to use
> KBUF_MODE_BVEC, which tells the kbuf handlers to turn the mappings
> into bio_vecs rather than iovecs. Then all that is needed is to
> setup our iov_iterator to use iov_iter_bvec().
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
...
> diff --git a/io_uring/net.h b/io_uring/net.h
> index 52bfee05f06a..e052762cf85d 100644
> --- a/io_uring/net.h
> +++ b/io_uring/net.h
> @@ -5,9 +5,15 @@
>   
>   struct io_async_msghdr {
>   #if defined(CONFIG_NET)
> -	struct iovec			fast_iov;
> +	union {
> +		struct iovec		fast_iov;
> +		struct bio_vec		fast_bvec;
> +	};
>   	/* points to an allocated iov, if NULL we use fast_iov instead */
> -	struct iovec			*free_iov;
> +	union {
> +		struct iovec		*free_iov;
> +		struct bio_vec		*free_bvec;

I'd rather not do it like that, aliasing with reusing memory and
counting the number is a recipe for disaster when scattered across
code. E.g. seems you change all(?) iovec allocations to allocate
based on the size of the larger structure.

Counting bytes as in my series is less fragile, otherwise it needs
a new structure and a set of helpers that can be kept together.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc
  2024-10-24 14:44   ` Pavel Begunkov
@ 2024-10-24 14:48     ` Jens Axboe
  2024-10-24 15:36       ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 14:48 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 8:44 AM, Pavel Begunkov wrote:
> On 10/23/24 17:07, Jens Axboe wrote:
>> Provided buffers inform the kernel which buffer group ID to pick a
>> buffer from for transfer. Normally that buffer contains the usual
>> addr + length information, as well as a buffer ID that is passed back
>> at completion time to inform the application of which buffer was used
>> for the transfer.
>>
>> However, if registered and provided buffers are combined, then the
>> provided buffer must instead tell the kernel which registered buffer
>> index should be used, and the length/offset within that buffer. Rather
>> than store the addr + length, the application must instead store this
>> information instead.
>>
>> If provided buffers are used with send zc, then those buffers must be
>> an index into a registered buffer. Change the mapping type to use
>> KBUF_MODE_BVEC, which tells the kbuf handlers to turn the mappings
>> into bio_vecs rather than iovecs. Then all that is needed is to
>> setup our iov_iterator to use iov_iter_bvec().
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
> ...
>> diff --git a/io_uring/net.h b/io_uring/net.h
>> index 52bfee05f06a..e052762cf85d 100644
>> --- a/io_uring/net.h
>> +++ b/io_uring/net.h
>> @@ -5,9 +5,15 @@
>>     struct io_async_msghdr {
>>   #if defined(CONFIG_NET)
>> -    struct iovec            fast_iov;
>> +    union {
>> +        struct iovec        fast_iov;
>> +        struct bio_vec        fast_bvec;
>> +    };
>>       /* points to an allocated iov, if NULL we use fast_iov instead */
>> -    struct iovec            *free_iov;
>> +    union {
>> +        struct iovec        *free_iov;
>> +        struct bio_vec        *free_bvec;
> 
> I'd rather not do it like that, aliasing with reusing memory and
> counting the number is a recipe for disaster when scattered across
> code. E.g. seems you change all(?) iovec allocations to allocate
> based on the size of the larger structure.
> 
> Counting bytes as in my series is less fragile, otherwise it needs
> a new structure and a set of helpers that can be kept together.

I have been pondering this, because I'm not a huge fan either. But
outside of the space side, it does come out pretty nicely/clean. This
series is really just a WIP posting as per the RFC, mostly just so we
can come up with something that's clean enough and works for both cases,
as it does have the caching that your series does not. And to
facilitate some more efficient TX/RX zero copy testing.

I'd love a separate struct for these two, but that kind of gets in the
way of the usual iovec imports. I'll get back to this soonish, it's not
a 6.13 thing by any stretch.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET RFC 0/7] Add support for provided registered buffers
  2024-10-24 14:43   ` Jens Axboe
@ 2024-10-24 15:04     ` Pavel Begunkov
  2024-10-24 15:11       ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 15:04 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/24/24 15:43, Jens Axboe wrote:
> On 10/24/24 8:36 AM, Pavel Begunkov wrote:
>> On 10/23/24 17:07, Jens Axboe wrote:
>>> Hi,
>>>
>>> Normally a request can take a provided buffer, which means "pick a
>>> buffer from group X and do IO to/from it", or it can use a registered
>>> buffer, which means "use the buffer at index Y and do IO to/from it".
>>> For things like O_DIRECT and network zero copy, registered buffers can
>>> be used to speedup the operation, as they avoid repeated
>>> get_user_pages() and page referencing calls for each IO operation.
>>>
>>> Normal (non zero copy) send supports bundles, which is a way to pick
>>> multiple provided buffers at once and send them. send zero copy only
>>> supports registered buffers, and hence can only send a single buffer
>>
>> That's not true, has never been, send[msg] zc work just fine with
>> normal (non-registered) buffers.
> 
> That's not what I'm saying, perhaps it isn't clear. What I'm trying to
> say is that it only supports registered buffers, it does not support
> provided buffers. It obviously does support regular user provided
> buffers that aren't registered or provided, I figured that goes without
> saying explicitly.

Normally goes without saying yes, but the confusion here is because
of a more or less explicit implication (or at least I read it so)
"it only supports registered buffers => selected buffer support
should support registered buffers, which it adds"

Does the series allows provided buffers with normal user memory?

>>> at the time.
>>
>> And that's covered by the posted series for vectored registered
>> buffers support.
> 
> Right, for sendmsg.
> 
>>> This patchset adds support for using a mix of provided and registered
>>> buffers, where the provided buffers merely provide an index into which
>>> registered buffers to use. This enables using provided buffers for
>>> send zc in general, but also bundles where multiple buffers are picked.
>>> This is done by changing how the provided buffers are intepreted.
>>> Normally a provided buffer has an address, length, and buffer ID
>>> associated with it. The address tells the kernel where the IO should
>>> occur. If both fixed and provided buffers are asked for, the provided
>>> buffer address field is instead an encoding of the registered buffer
>>> index and the offset within that buffer. With that in place, using a
>>> combination of the two can work.
>>
>> What the series doesn't say is how it works with notifications and
>> what is the proposed user API in regard to it, it's the main if not
>> the only fundamental distinctive part of the SENDZC API.
> 
> Should not change that? You'll should get the usual two notifications on
> send complete, and reuse safe.

Right you get a notification, but what is it supposed to mean to
the user? Like "the notification indicates that all buffers that
are consumed by this request can be reused". Multishot is not a
thing, but how the user has to track what buffers are consumed
by this request? I assume it posts a CQE per buffer completion,
right?

And let's say you have send heavy workload where the user pushes
more than the socket can take, i.e. it has to wait to send more
and there is always something to send. Does it poll-retry as it's
usually done for multishots? How notifications are paced? i.e.
it'll continue hooking more and more buffers onto the same
notification locking all the previously used buffers.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCHSET RFC 0/7] Add support for provided registered buffers
  2024-10-24 15:04     ` Pavel Begunkov
@ 2024-10-24 15:11       ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 15:11 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 9:04 AM, Pavel Begunkov wrote:
> On 10/24/24 15:43, Jens Axboe wrote:
>> On 10/24/24 8:36 AM, Pavel Begunkov wrote:
>>> On 10/23/24 17:07, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> Normally a request can take a provided buffer, which means "pick a
>>>> buffer from group X and do IO to/from it", or it can use a registered
>>>> buffer, which means "use the buffer at index Y and do IO to/from it".
>>>> For things like O_DIRECT and network zero copy, registered buffers can
>>>> be used to speedup the operation, as they avoid repeated
>>>> get_user_pages() and page referencing calls for each IO operation.
>>>>
>>>> Normal (non zero copy) send supports bundles, which is a way to pick
>>>> multiple provided buffers at once and send them. send zero copy only
>>>> supports registered buffers, and hence can only send a single buffer
>>>
>>> That's not true, has never been, send[msg] zc work just fine with
>>> normal (non-registered) buffers.
>>
>> That's not what I'm saying, perhaps it isn't clear. What I'm trying to
>> say is that it only supports registered buffers, it does not support
>> provided buffers. It obviously does support regular user provided
>> buffers that aren't registered or provided, I figured that goes without
>> saying explicitly.
> 
> Normally goes without saying yes, but the confusion here is because
> of a more or less explicit implication (or at least I read it so)
> "it only supports registered buffers => selected buffer support
> should support registered buffers, which it adds"

I'll expand it to be more clear.

> Does the series allows provided buffers with normal user memory?

Yep, it should allow either picking one (or more, for bundles) provided
buffers, and the provided buffer is either normal user memory, or it's
indices into registered buffers.

>>>> This patchset adds support for using a mix of provided and registered
>>>> buffers, where the provided buffers merely provide an index into which
>>>> registered buffers to use. This enables using provided buffers for
>>>> send zc in general, but also bundles where multiple buffers are picked.
>>>> This is done by changing how the provided buffers are intepreted.
>>>> Normally a provided buffer has an address, length, and buffer ID
>>>> associated with it. The address tells the kernel where the IO should
>>>> occur. If both fixed and provided buffers are asked for, the provided
>>>> buffer address field is instead an encoding of the registered buffer
>>>> index and the offset within that buffer. With that in place, using a
>>>> combination of the two can work.
>>>
>>> What the series doesn't say is how it works with notifications and
>>> what is the proposed user API in regard to it, it's the main if not
>>> the only fundamental distinctive part of the SENDZC API.
>>
>> Should not change that? You'll should get the usual two notifications on
>> send complete, and reuse safe.
> 
> Right you get a notification, but what is it supposed to mean to
> the user? Like "the notification indicates that all buffers that
> are consumed by this request can be reused". Multishot is not a
> thing, but how the user has to track what buffers are consumed
> by this request? I assume it posts a CQE per buffer completion,
> right?

Depends on if it's bundles or not. For a non-bundle, a single buffer is
picked, and that buffer is either user memory or it's an index into a
registered buffer. For that, completions work just like they do now - a
single one is posted for the expected inline completion with buffer ID,
and one is posted for reuse laster.

If it's a bundle, it works the same way, two completions are posted. The
first expected inline one will have a buffer ID, and the length will
tell you how many consecutive buffers were sent/consumed. Then the reuse
notification goes with that previous completion.

> And let's say you have send heavy workload where the user pushes
> more than the socket can take, i.e. it has to wait to send more
> and there is always something to send. Does it poll-retry as it's
> usually done for multishots? How notifications are paced? i.e.
> it'll continue hooking more and more buffers onto the same
> notification locking all the previously used buffers.

If it sends nothing, nothing is consumed. If it's a partial send, then
buffers are kept (as nobody else should send them anyway), and it's
retried based on the poll trigger. For the latter case, completion is
postd at the end, when the picked buffers are done. For pacing, you can
limit the amount of data sent by just setting ->len to your desired
bundle/batch size.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC
  2024-10-23 16:07 ` [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC Jens Axboe
@ 2024-10-24 15:22   ` Pavel Begunkov
  2024-10-24 15:27     ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 15:22 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/23/24 17:07, Jens Axboe wrote:
> The provided buffer helpers always map to iovecs. Add a new mode,
> KBUF_MODE_BVEC, which instead maps it to a bio_vec array instead. For
> use with zero-copy scenarios, where the caller would want to turn it
> into a bio_vec anyway, and this avoids first iterating and filling out
> and iovec array, only for the caller to then iterate it again and turn
> it into a bio_vec array.
> 
> Since it's now managing both iovecs and bvecs, change the naming of
> buf_sel_arg->nr_iovs member to nr_vecs instead.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   io_uring/kbuf.c | 170 +++++++++++++++++++++++++++++++++++++++++++-----
>   io_uring/kbuf.h |   9 ++-
>   io_uring/net.c  |  10 +--
>   3 files changed, 165 insertions(+), 24 deletions(-)
> 
> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
> index 42579525c4bd..10a3a7a27e9a 100644
> --- a/io_uring/kbuf.c
> +++ b/io_uring/kbuf.c
...
> +static struct io_mapped_ubuf *io_ubuf_from_buf(struct io_ring_ctx *ctx,
> +					       u64 addr, unsigned int *offset)
> +{
> +	struct io_mapped_ubuf *imu;
> +	u16 idx;
> +
> +	/*
> +	 * Get registered buffer index and offset, encoded into the
> +	 * addr base value.
> +	 */
> +	idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
> +	addr >>= IOU_BUF_REGBUF_BITS;
> +	*offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);

There are two ABI questions with that. First why not use just
user addresses instead of offsets? It's more consistent with
how everything else works. Surely it could've been offsets for
all registered buffers ops from the beggining, but it's not.

And the second, we need to start getting rid of the global node
queue, if we do, this will need to allocate an array of nodes,
store an imu list or something similar, which will be just
as terrible as it sounds, and then it'll need another cache,
sprinking more checks and handling into the hot path and so
on. That's the reason the vectored registered buffer patch
supports juts one registered buffer to index per request, and
I believe this one should do that as well.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC
  2024-10-24 15:22   ` Pavel Begunkov
@ 2024-10-24 15:27     ` Jens Axboe
  2024-10-24 15:40       ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 15:27 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 9:22 AM, Pavel Begunkov wrote:
> On 10/23/24 17:07, Jens Axboe wrote:
>> The provided buffer helpers always map to iovecs. Add a new mode,
>> KBUF_MODE_BVEC, which instead maps it to a bio_vec array instead. For
>> use with zero-copy scenarios, where the caller would want to turn it
>> into a bio_vec anyway, and this avoids first iterating and filling out
>> and iovec array, only for the caller to then iterate it again and turn
>> it into a bio_vec array.
>>
>> Since it's now managing both iovecs and bvecs, change the naming of
>> buf_sel_arg->nr_iovs member to nr_vecs instead.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>   io_uring/kbuf.c | 170 +++++++++++++++++++++++++++++++++++++++++++-----
>>   io_uring/kbuf.h |   9 ++-
>>   io_uring/net.c  |  10 +--
>>   3 files changed, 165 insertions(+), 24 deletions(-)
>>
>> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
>> index 42579525c4bd..10a3a7a27e9a 100644
>> --- a/io_uring/kbuf.c
>> +++ b/io_uring/kbuf.c
> ...
>> +static struct io_mapped_ubuf *io_ubuf_from_buf(struct io_ring_ctx *ctx,
>> +                           u64 addr, unsigned int *offset)
>> +{
>> +    struct io_mapped_ubuf *imu;
>> +    u16 idx;
>> +
>> +    /*
>> +     * Get registered buffer index and offset, encoded into the
>> +     * addr base value.
>> +     */
>> +    idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
>> +    addr >>= IOU_BUF_REGBUF_BITS;
>> +    *offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);
> 
> There are two ABI questions with that. First why not use just
> user addresses instead of offsets? It's more consistent with
> how everything else works. Surely it could've been offsets for
> all registered buffers ops from the beggining, but it's not.

How would that work? You need to pass in addr + buffer index for that.
The usual approach is doing that, and then 'addr' tells you the offset
within the buffer, eg you can just do a subtraction to get your offset.
But you can't pass in both addr + index in a provided buffer, which is
why it's using buf->addr to encode index + offset for that, rather than
rely on the addr for the offset too.

The alternative obviously is to just do the 'addr' and have that be both
index and offset, in which case you'd need to lookup the buffer. And
that's certainly a no-go.

> And the second, we need to start getting rid of the global node
> queue, if we do, this will need to allocate an array of nodes,
> store an imu list or something similar, which will be just
> as terrible as it sounds, and then it'll need another cache,
> sprinking more checks and handling into the hot path and so
> on. That's the reason the vectored registered buffer patch
> supports juts one registered buffer to index per request, and
> I believe this one should do that as well.

Yeah agree, the global node queue is getting in the way of more
important things too, like applications using registered files and not
seeing reclaim in due time. I think that's the main issue here with the
ring wide queue, and certainly something that needs sorting sooner
rather than later.

Limiting this patch to just dealing with a single imu would be perfectly
fine I think, the intended use case here is really large registered
buffers, not little ones. And having a bundle be limited to a single
buffer would be perfectly fine. I can make that change for sure.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc
  2024-10-24 14:48     ` Jens Axboe
@ 2024-10-24 15:36       ` Pavel Begunkov
  0 siblings, 0 replies; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 15:36 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/24/24 15:48, Jens Axboe wrote:
> On 10/24/24 8:44 AM, Pavel Begunkov wrote:
>> On 10/23/24 17:07, Jens Axboe wrote:
>>> Provided buffers inform the kernel which buffer group ID to pick a
>>> buffer from for transfer. Normally that buffer contains the usual
>>> addr + length information, as well as a buffer ID that is passed back
>>> at completion time to inform the application of which buffer was used
>>> for the transfer.
>>>
>>> However, if registered and provided buffers are combined, then the
>>> provided buffer must instead tell the kernel which registered buffer
>>> index should be used, and the length/offset within that buffer. Rather
>>> than store the addr + length, the application must instead store this
>>> information instead.
>>>
>>> If provided buffers are used with send zc, then those buffers must be
>>> an index into a registered buffer. Change the mapping type to use
>>> KBUF_MODE_BVEC, which tells the kbuf handlers to turn the mappings
>>> into bio_vecs rather than iovecs. Then all that is needed is to
>>> setup our iov_iterator to use iov_iter_bvec().
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>> ...
>>> diff --git a/io_uring/net.h b/io_uring/net.h
>>> index 52bfee05f06a..e052762cf85d 100644
>>> --- a/io_uring/net.h
>>> +++ b/io_uring/net.h
>>> @@ -5,9 +5,15 @@
>>>      struct io_async_msghdr {
>>>    #if defined(CONFIG_NET)
>>> -    struct iovec            fast_iov;
>>> +    union {
>>> +        struct iovec        fast_iov;
>>> +        struct bio_vec        fast_bvec;
>>> +    };
>>>        /* points to an allocated iov, if NULL we use fast_iov instead */
>>> -    struct iovec            *free_iov;
>>> +    union {
>>> +        struct iovec        *free_iov;
>>> +        struct bio_vec        *free_bvec;
>>
>> I'd rather not do it like that, aliasing with reusing memory and
>> counting the number is a recipe for disaster when scattered across
>> code. E.g. seems you change all(?) iovec allocations to allocate
>> based on the size of the larger structure.
>>
>> Counting bytes as in my series is less fragile, otherwise it needs
>> a new structure and a set of helpers that can be kept together.
> 
> I have been pondering this, because I'm not a huge fan either. But
> outside of the space side, it does come out pretty nicely/clean. This
> series is really just a WIP posting as per the RFC, mostly just so we
> can come up with something that's clean enough and works for both cases,
> as it does have the caching that your series does not. And to
> facilitate some more efficient TX/RX zero copy testing.

The thing is, I know how to implement caching on top of my series,
but as I commented in the other thread, I don't think it can reuse
the incremental helper well. At least it can try to unify the imu
checking / parsing, bvec copy-setup, but that's minor.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC
  2024-10-24 15:27     ` Jens Axboe
@ 2024-10-24 15:40       ` Pavel Begunkov
  2024-10-24 15:49         ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 15:40 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/24/24 16:27, Jens Axboe wrote:
> On 10/24/24 9:22 AM, Pavel Begunkov wrote:
>> On 10/23/24 17:07, Jens Axboe wrote:
>>> The provided buffer helpers always map to iovecs. Add a new mode,
>>> KBUF_MODE_BVEC, which instead maps it to a bio_vec array instead. For
>>> use with zero-copy scenarios, where the caller would want to turn it
>>> into a bio_vec anyway, and this avoids first iterating and filling out
>>> and iovec array, only for the caller to then iterate it again and turn
>>> it into a bio_vec array.
>>>
>>> Since it's now managing both iovecs and bvecs, change the naming of
>>> buf_sel_arg->nr_iovs member to nr_vecs instead.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>    io_uring/kbuf.c | 170 +++++++++++++++++++++++++++++++++++++++++++-----
>>>    io_uring/kbuf.h |   9 ++-
>>>    io_uring/net.c  |  10 +--
>>>    3 files changed, 165 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
>>> index 42579525c4bd..10a3a7a27e9a 100644
>>> --- a/io_uring/kbuf.c
>>> +++ b/io_uring/kbuf.c
>> ...
>>> +static struct io_mapped_ubuf *io_ubuf_from_buf(struct io_ring_ctx *ctx,
>>> +                           u64 addr, unsigned int *offset)
>>> +{
>>> +    struct io_mapped_ubuf *imu;
>>> +    u16 idx;
>>> +
>>> +    /*
>>> +     * Get registered buffer index and offset, encoded into the
>>> +     * addr base value.
>>> +     */
>>> +    idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
>>> +    addr >>= IOU_BUF_REGBUF_BITS;
>>> +    *offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);
>>
>> There are two ABI questions with that. First why not use just
>> user addresses instead of offsets? It's more consistent with
>> how everything else works. Surely it could've been offsets for
>> all registered buffers ops from the beggining, but it's not.
> 
> How would that work? You need to pass in addr + buffer index for that.

I guess it depends on the second part then, that is if you
want to preserve the layout, in which case you can just use
sqe->buf_index

> The usual approach is doing that, and then 'addr' tells you the offset
> within the buffer, eg you can just do a subtraction to get your offset.
> But you can't pass in both addr + index in a provided buffer, which is
> why it's using buf->addr to encode index + offset for that, rather than
> rely on the addr for the offset too.
> 
> The alternative obviously is to just do the 'addr' and have that be both
> index and offset, in which case you'd need to lookup the buffer. And
> that's certainly a no-go.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-23 16:07 ` [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers Jens Axboe
@ 2024-10-24 15:44   ` Pavel Begunkov
  2024-10-24 15:57     ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 15:44 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/23/24 17:07, Jens Axboe wrote:
> This just adds the necessary shifts that define what a provided buffer
> that is merely an index into a registered buffer looks like. A provided
> buffer looks like the following:
> 
> struct io_uring_buf {
> 	__u64	addr;
> 	__u32	len;
> 	__u16	bid;
> 	__u16	resv;
> };
> 
> where 'addr' holds a userspace address, 'len' is the length of the
> buffer, and 'bid' is the buffer ID identifying the buffer. This works
> fine for a virtual address, but it cannot be used efficiently denote
> a registered buffer. Registered buffers are pre-mapped into the kernel
> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
> and are used for things like O_DIRECT on storage and zero copy send.
> 
> Particularly for the send case, it'd be useful to support a mix of
> provided and registered buffers. This enables the use of using a
> provided ring buffer to serialize sends, and also enables the use of
> send bundles, where a send can pick multiple buffers and send them all
> at once.
> 
> If provided buffers are used as an index into registered buffers, the
> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
> is desired, with a length of 'len' and the offset 'regbuf_offset' from
> the start of the buffer, then the application would fill out the entry
> as follows:
> 
> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
> buf->len = len;
> 
> and otherwise add it to the buffer ring as usual. The kernel will then
> first pick a buffer from the desired buffer group ID, and then decode
> which registered buffer to use for the transfer.
> 
> This provides a way to use both registered and provided buffers at the
> same time.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>   include/uapi/linux/io_uring.h | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 86cb385fe0b5..eef88d570cb4 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>   	};
>   };
>   
> +/*
> + * When provided buffers are used as indices into registered buffers, the
> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
> + */
> +#define IOU_BUF_REGBUF_BITS	(32ULL)
> +#define IOU_BUF_OFFSET_BITS	(32ULL)

32 bit is fine for IO size but not enough to store offsets, it
can only address under 4GB registered buffers.


-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC
  2024-10-24 15:40       ` Pavel Begunkov
@ 2024-10-24 15:49         ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 15:49 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 9:40 AM, Pavel Begunkov wrote:
> On 10/24/24 16:27, Jens Axboe wrote:
>> On 10/24/24 9:22 AM, Pavel Begunkov wrote:
>>> On 10/23/24 17:07, Jens Axboe wrote:
>>>> The provided buffer helpers always map to iovecs. Add a new mode,
>>>> KBUF_MODE_BVEC, which instead maps it to a bio_vec array instead. For
>>>> use with zero-copy scenarios, where the caller would want to turn it
>>>> into a bio_vec anyway, and this avoids first iterating and filling out
>>>> and iovec array, only for the caller to then iterate it again and turn
>>>> it into a bio_vec array.
>>>>
>>>> Since it's now managing both iovecs and bvecs, change the naming of
>>>> buf_sel_arg->nr_iovs member to nr_vecs instead.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>    io_uring/kbuf.c | 170 +++++++++++++++++++++++++++++++++++++++++++-----
>>>>    io_uring/kbuf.h |   9 ++-
>>>>    io_uring/net.c  |  10 +--
>>>>    3 files changed, 165 insertions(+), 24 deletions(-)
>>>>
>>>> diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
>>>> index 42579525c4bd..10a3a7a27e9a 100644
>>>> --- a/io_uring/kbuf.c
>>>> +++ b/io_uring/kbuf.c
>>> ...
>>>> +static struct io_mapped_ubuf *io_ubuf_from_buf(struct io_ring_ctx *ctx,
>>>> +                           u64 addr, unsigned int *offset)
>>>> +{
>>>> +    struct io_mapped_ubuf *imu;
>>>> +    u16 idx;
>>>> +
>>>> +    /*
>>>> +     * Get registered buffer index and offset, encoded into the
>>>> +     * addr base value.
>>>> +     */
>>>> +    idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
>>>> +    addr >>= IOU_BUF_REGBUF_BITS;
>>>> +    *offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);
>>>
>>> There are two ABI questions with that. First why not use just
>>> user addresses instead of offsets? It's more consistent with
>>> how everything else works. Surely it could've been offsets for
>>> all registered buffers ops from the beggining, but it's not.
>>
>> How would that work? You need to pass in addr + buffer index for that.
> 
> I guess it depends on the second part then, that is if you
> want to preserve the layout, in which case you can just use
> sqe->buf_index

The whole point is to make provided AND registered buffers work
together. And you can't pass in a buffer group ID _and_ a registered
buffer index in the SQE.

And for provided buffers, furthermore the point is that the buffer
itself holds information about where to transfer to/from. Once you've
added your buffer, you don't need to further track it, when it gets
picked it has all the information on where the transfer occurs.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-24 15:44   ` Pavel Begunkov
@ 2024-10-24 15:57     ` Jens Axboe
  2024-10-24 16:17       ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 15:57 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 9:44 AM, Pavel Begunkov wrote:
> On 10/23/24 17:07, Jens Axboe wrote:
>> This just adds the necessary shifts that define what a provided buffer
>> that is merely an index into a registered buffer looks like. A provided
>> buffer looks like the following:
>>
>> struct io_uring_buf {
>>     __u64    addr;
>>     __u32    len;
>>     __u16    bid;
>>     __u16    resv;
>> };
>>
>> where 'addr' holds a userspace address, 'len' is the length of the
>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>> fine for a virtual address, but it cannot be used efficiently denote
>> a registered buffer. Registered buffers are pre-mapped into the kernel
>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>> and are used for things like O_DIRECT on storage and zero copy send.
>>
>> Particularly for the send case, it'd be useful to support a mix of
>> provided and registered buffers. This enables the use of using a
>> provided ring buffer to serialize sends, and also enables the use of
>> send bundles, where a send can pick multiple buffers and send them all
>> at once.
>>
>> If provided buffers are used as an index into registered buffers, the
>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>> the start of the buffer, then the application would fill out the entry
>> as follows:
>>
>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>> buf->len = len;
>>
>> and otherwise add it to the buffer ring as usual. The kernel will then
>> first pick a buffer from the desired buffer group ID, and then decode
>> which registered buffer to use for the transfer.
>>
>> This provides a way to use both registered and provided buffers at the
>> same time.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>   include/uapi/linux/io_uring.h | 8 ++++++++
>>   1 file changed, 8 insertions(+)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 86cb385fe0b5..eef88d570cb4 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>       };
>>   };
>>   +/*
>> + * When provided buffers are used as indices into registered buffers, the
>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>> + */
>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
> 
> 32 bit is fine for IO size but not enough to store offsets, it
> can only address under 4GB registered buffers.

I did think about that - at least as it stands, registered buffers are
limited to 1GB in size. That's how it's been since that got added. Now,
for the future, we may obviously lift that limitation, and yeah then
32-bits would not necessarily be enough for the offset.

For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
so we could make do with 31 bits for the size, which would bump the
offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
buffers, or 8x what we support now. Which is probably fine, you'd just
split your buffer registrations into 8G chunks, if you want to register
more than 8G of memory.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-24 15:57     ` Jens Axboe
@ 2024-10-24 16:17       ` Pavel Begunkov
  2024-10-24 17:16         ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 16:17 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/24/24 16:57, Jens Axboe wrote:
> On 10/24/24 9:44 AM, Pavel Begunkov wrote:
>> On 10/23/24 17:07, Jens Axboe wrote:
>>> This just adds the necessary shifts that define what a provided buffer
>>> that is merely an index into a registered buffer looks like. A provided
>>> buffer looks like the following:
>>>
>>> struct io_uring_buf {
>>>      __u64    addr;
>>>      __u32    len;
>>>      __u16    bid;
>>>      __u16    resv;
>>> };
>>>
>>> where 'addr' holds a userspace address, 'len' is the length of the
>>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>>> fine for a virtual address, but it cannot be used efficiently denote
>>> a registered buffer. Registered buffers are pre-mapped into the kernel
>>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>>> and are used for things like O_DIRECT on storage and zero copy send.
>>>
>>> Particularly for the send case, it'd be useful to support a mix of
>>> provided and registered buffers. This enables the use of using a
>>> provided ring buffer to serialize sends, and also enables the use of
>>> send bundles, where a send can pick multiple buffers and send them all
>>> at once.
>>>
>>> If provided buffers are used as an index into registered buffers, the
>>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>>> the start of the buffer, then the application would fill out the entry
>>> as follows:
>>>
>>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>>> buf->len = len;
>>>
>>> and otherwise add it to the buffer ring as usual. The kernel will then
>>> first pick a buffer from the desired buffer group ID, and then decode
>>> which registered buffer to use for the transfer.
>>>
>>> This provides a way to use both registered and provided buffers at the
>>> same time.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>    include/uapi/linux/io_uring.h | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index 86cb385fe0b5..eef88d570cb4 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>>        };
>>>    };
>>>    +/*
>>> + * When provided buffers are used as indices into registered buffers, the
>>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>>> + */
>>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
>>
>> 32 bit is fine for IO size but not enough to store offsets, it
>> can only address under 4GB registered buffers.
> 
> I did think about that - at least as it stands, registered buffers are
> limited to 1GB in size. That's how it's been since that got added. Now,
> for the future, we may obviously lift that limitation, and yeah then
> 32-bits would not necessarily be enough for the offset.

Right, and I don't think it's unreasonable considering with how
much memory systems have nowadays, and we think that one large
registered buffer is a good thing.

> For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
> so we could make do with 31 bits for the size, which would bump the
> offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
> buffers, or 8x what we support now. Which is probably fine, you'd just
> split your buffer registrations into 8G chunks, if you want to register
> more than 8G of memory.

That's why I mentioned IO size, you can register a very large buffer
and do IO with a small subchunk of it, even if that "small" is 4G,
but it still needs to be addressed. I think we need at least an order
of magnitude or two more space for it to last for a bit.

Can it steal bits from IOU_BUF_REGBUF_BITS?

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-24 16:17       ` Pavel Begunkov
@ 2024-10-24 17:16         ` Jens Axboe
  2024-10-24 18:20           ` Pavel Begunkov
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 17:16 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 10:17 AM, Pavel Begunkov wrote:
> On 10/24/24 16:57, Jens Axboe wrote:
>> On 10/24/24 9:44 AM, Pavel Begunkov wrote:
>>> On 10/23/24 17:07, Jens Axboe wrote:
>>>> This just adds the necessary shifts that define what a provided buffer
>>>> that is merely an index into a registered buffer looks like. A provided
>>>> buffer looks like the following:
>>>>
>>>> struct io_uring_buf {
>>>>      __u64    addr;
>>>>      __u32    len;
>>>>      __u16    bid;
>>>>      __u16    resv;
>>>> };
>>>>
>>>> where 'addr' holds a userspace address, 'len' is the length of the
>>>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>>>> fine for a virtual address, but it cannot be used efficiently denote
>>>> a registered buffer. Registered buffers are pre-mapped into the kernel
>>>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>>>> and are used for things like O_DIRECT on storage and zero copy send.
>>>>
>>>> Particularly for the send case, it'd be useful to support a mix of
>>>> provided and registered buffers. This enables the use of using a
>>>> provided ring buffer to serialize sends, and also enables the use of
>>>> send bundles, where a send can pick multiple buffers and send them all
>>>> at once.
>>>>
>>>> If provided buffers are used as an index into registered buffers, the
>>>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>>>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>>>> the start of the buffer, then the application would fill out the entry
>>>> as follows:
>>>>
>>>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>>>> buf->len = len;
>>>>
>>>> and otherwise add it to the buffer ring as usual. The kernel will then
>>>> first pick a buffer from the desired buffer group ID, and then decode
>>>> which registered buffer to use for the transfer.
>>>>
>>>> This provides a way to use both registered and provided buffers at the
>>>> same time.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>    include/uapi/linux/io_uring.h | 8 ++++++++
>>>>    1 file changed, 8 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>> index 86cb385fe0b5..eef88d570cb4 100644
>>>> --- a/include/uapi/linux/io_uring.h
>>>> +++ b/include/uapi/linux/io_uring.h
>>>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>>>        };
>>>>    };
>>>>    +/*
>>>> + * When provided buffers are used as indices into registered buffers, the
>>>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>>>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>>>> + */
>>>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>>>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
>>>
>>> 32 bit is fine for IO size but not enough to store offsets, it
>>> can only address under 4GB registered buffers.
>>
>> I did think about that - at least as it stands, registered buffers are
>> limited to 1GB in size. That's how it's been since that got added. Now,
>> for the future, we may obviously lift that limitation, and yeah then
>> 32-bits would not necessarily be enough for the offset.
> 
> Right, and I don't think it's unreasonable considering with how
> much memory systems have nowadays, and we think that one large
> registered buffer is a good thing.

Agree - but at the same time, not a big hardship to chunk up the region
into 8G chunks rather than allow, eg, a 64G region. Would be nice if it
wasn't a requirement, but unsure how to make that work otherwise.

And not a lot of complaints on having 1G be the size, even from the
varnish side where they register hundreds of gigs of memory.

>> For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
>> so we could make do with 31 bits for the size, which would bump the
>> offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
>> buffers, or 8x what we support now. Which is probably fine, you'd just
>> split your buffer registrations into 8G chunks, if you want to register
>> more than 8G of memory.
> 
> That's why I mentioned IO size, you can register a very large buffer
> and do IO with a small subchunk of it, even if that "small" is 4G,
> but it still needs to be addressed. I think we need at least an order
> of magnitude or two more space for it to last for a bit.
> 
> Can it steal bits from IOU_BUF_REGBUF_BITS?

As mentooned, we can definitely move the one bit, which would bring is
to 31/33, and ~2GB IO size (max in linux) and ~8G of offset. That can be
done without having any tradeoffs. Beyond that, and we're starting to
limit the transfer size, eg it's tradeoff of allowing more offset (and
hence bigger registered regions) vs transfer size. You could probably
argue that 1G would be fine, and hence make it 30/34, which would allow
16GB registered buffers. Just unsure if it's worth it, as neither of
those would allow really huge registered buffers - and does it matter if
your buffer registrations are chunked at 8G vs 16G? Probably not.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-24 17:16         ` Jens Axboe
@ 2024-10-24 18:20           ` Pavel Begunkov
  2024-10-24 19:53             ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Pavel Begunkov @ 2024-10-24 18:20 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 10/24/24 18:16, Jens Axboe wrote:
> On 10/24/24 10:17 AM, Pavel Begunkov wrote:
>> On 10/24/24 16:57, Jens Axboe wrote:
>>> On 10/24/24 9:44 AM, Pavel Begunkov wrote:
>>>> On 10/23/24 17:07, Jens Axboe wrote:
>>>>> This just adds the necessary shifts that define what a provided buffer
>>>>> that is merely an index into a registered buffer looks like. A provided
>>>>> buffer looks like the following:
>>>>>
>>>>> struct io_uring_buf {
>>>>>       __u64    addr;
>>>>>       __u32    len;
>>>>>       __u16    bid;
>>>>>       __u16    resv;
>>>>> };
>>>>>
>>>>> where 'addr' holds a userspace address, 'len' is the length of the
>>>>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>>>>> fine for a virtual address, but it cannot be used efficiently denote
>>>>> a registered buffer. Registered buffers are pre-mapped into the kernel
>>>>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>>>>> and are used for things like O_DIRECT on storage and zero copy send.
>>>>>
>>>>> Particularly for the send case, it'd be useful to support a mix of
>>>>> provided and registered buffers. This enables the use of using a
>>>>> provided ring buffer to serialize sends, and also enables the use of
>>>>> send bundles, where a send can pick multiple buffers and send them all
>>>>> at once.
>>>>>
>>>>> If provided buffers are used as an index into registered buffers, the
>>>>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>>>>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>>>>> the start of the buffer, then the application would fill out the entry
>>>>> as follows:
>>>>>
>>>>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>>>>> buf->len = len;
>>>>>
>>>>> and otherwise add it to the buffer ring as usual. The kernel will then
>>>>> first pick a buffer from the desired buffer group ID, and then decode
>>>>> which registered buffer to use for the transfer.
>>>>>
>>>>> This provides a way to use both registered and provided buffers at the
>>>>> same time.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>     include/uapi/linux/io_uring.h | 8 ++++++++
>>>>>     1 file changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>> index 86cb385fe0b5..eef88d570cb4 100644
>>>>> --- a/include/uapi/linux/io_uring.h
>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>>>>         };
>>>>>     };
>>>>>     +/*
>>>>> + * When provided buffers are used as indices into registered buffers, the
>>>>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>>>>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>>>>> + */
>>>>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>>>>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
>>>>
>>>> 32 bit is fine for IO size but not enough to store offsets, it
>>>> can only address under 4GB registered buffers.
>>>
>>> I did think about that - at least as it stands, registered buffers are
>>> limited to 1GB in size. That's how it's been since that got added. Now,
>>> for the future, we may obviously lift that limitation, and yeah then
>>> 32-bits would not necessarily be enough for the offset.
>>
>> Right, and I don't think it's unreasonable considering with how
>> much memory systems have nowadays, and we think that one large
>> registered buffer is a good thing.
> 
> Agree - but at the same time, not a big hardship to chunk up the region
> into 8G chunks rather than allow, eg, a 64G region. Would be nice if it
> wasn't a requirement, but unsure how to make that work otherwise.
> 
> And not a lot of complaints on having 1G be the size, even from the
> varnish side where they register hundreds of gigs of memory.
> 
>>> For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
>>> so we could make do with 31 bits for the size, which would bump the
>>> offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
>>> buffers, or 8x what we support now. Which is probably fine, you'd just
>>> split your buffer registrations into 8G chunks, if you want to register
>>> more than 8G of memory.
>>
>> That's why I mentioned IO size, you can register a very large buffer
>> and do IO with a small subchunk of it, even if that "small" is 4G,
>> but it still needs to be addressed. I think we need at least an order
>> of magnitude or two more space for it to last for a bit.
>>
>> Can it steal bits from IOU_BUF_REGBUF_BITS?
> 
> As mentooned, we can definitely move the one bit, which would bring is
> to 31/33, and ~2GB IO size (max in linux) and ~8G of offset. That can be
> done without having any tradeoffs. Beyond that, and we're starting to
> limit the transfer size, eg it's tradeoff of allowing more offset (and
> hence bigger registered regions) vs transfer size. You could probably
> argue that 1G would be fine, and hence make it 30/34, which would allow
> 16GB registered buffers. Just unsure if it's worth it, as neither of
> those would allow really huge registered buffers - and does it matter if
> your buffer registrations are chunked at 8G vs 16G? Probably not.

6/7 packs offset and the reg buffer index into the same u64,
not the len. I'm don't see how it affects the len

idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
addr >>= IOU_BUF_REGBUF_BITS;
*offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);

So the tradeoff is with the max size of the registered
buffer table. I doubt you need 2^32, if each is at least
4KB, it's at least 16TB.

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-24 18:20           ` Pavel Begunkov
@ 2024-10-24 19:53             ` Jens Axboe
  2024-10-24 22:46               ` Jens Axboe
  0 siblings, 1 reply; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 19:53 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 12:20 PM, Pavel Begunkov wrote:
> On 10/24/24 18:16, Jens Axboe wrote:
>> On 10/24/24 10:17 AM, Pavel Begunkov wrote:
>>> On 10/24/24 16:57, Jens Axboe wrote:
>>>> On 10/24/24 9:44 AM, Pavel Begunkov wrote:
>>>>> On 10/23/24 17:07, Jens Axboe wrote:
>>>>>> This just adds the necessary shifts that define what a provided buffer
>>>>>> that is merely an index into a registered buffer looks like. A provided
>>>>>> buffer looks like the following:
>>>>>>
>>>>>> struct io_uring_buf {
>>>>>>       __u64    addr;
>>>>>>       __u32    len;
>>>>>>       __u16    bid;
>>>>>>       __u16    resv;
>>>>>> };
>>>>>>
>>>>>> where 'addr' holds a userspace address, 'len' is the length of the
>>>>>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>>>>>> fine for a virtual address, but it cannot be used efficiently denote
>>>>>> a registered buffer. Registered buffers are pre-mapped into the kernel
>>>>>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>>>>>> and are used for things like O_DIRECT on storage and zero copy send.
>>>>>>
>>>>>> Particularly for the send case, it'd be useful to support a mix of
>>>>>> provided and registered buffers. This enables the use of using a
>>>>>> provided ring buffer to serialize sends, and also enables the use of
>>>>>> send bundles, where a send can pick multiple buffers and send them all
>>>>>> at once.
>>>>>>
>>>>>> If provided buffers are used as an index into registered buffers, the
>>>>>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>>>>>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>>>>>> the start of the buffer, then the application would fill out the entry
>>>>>> as follows:
>>>>>>
>>>>>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>>>>>> buf->len = len;
>>>>>>
>>>>>> and otherwise add it to the buffer ring as usual. The kernel will then
>>>>>> first pick a buffer from the desired buffer group ID, and then decode
>>>>>> which registered buffer to use for the transfer.
>>>>>>
>>>>>> This provides a way to use both registered and provided buffers at the
>>>>>> same time.
>>>>>>
>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>> ---
>>>>>>     include/uapi/linux/io_uring.h | 8 ++++++++
>>>>>>     1 file changed, 8 insertions(+)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>>> index 86cb385fe0b5..eef88d570cb4 100644
>>>>>> --- a/include/uapi/linux/io_uring.h
>>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>>>>>         };
>>>>>>     };
>>>>>>     +/*
>>>>>> + * When provided buffers are used as indices into registered buffers, the
>>>>>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>>>>>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>>>>>> + */
>>>>>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>>>>>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
>>>>>
>>>>> 32 bit is fine for IO size but not enough to store offsets, it
>>>>> can only address under 4GB registered buffers.
>>>>
>>>> I did think about that - at least as it stands, registered buffers are
>>>> limited to 1GB in size. That's how it's been since that got added. Now,
>>>> for the future, we may obviously lift that limitation, and yeah then
>>>> 32-bits would not necessarily be enough for the offset.
>>>
>>> Right, and I don't think it's unreasonable considering with how
>>> much memory systems have nowadays, and we think that one large
>>> registered buffer is a good thing.
>>
>> Agree - but at the same time, not a big hardship to chunk up the region
>> into 8G chunks rather than allow, eg, a 64G region. Would be nice if it
>> wasn't a requirement, but unsure how to make that work otherwise.
>>
>> And not a lot of complaints on having 1G be the size, even from the
>> varnish side where they register hundreds of gigs of memory.
>>
>>>> For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
>>>> so we could make do with 31 bits for the size, which would bump the
>>>> offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
>>>> buffers, or 8x what we support now. Which is probably fine, you'd just
>>>> split your buffer registrations into 8G chunks, if you want to register
>>>> more than 8G of memory.
>>>
>>> That's why I mentioned IO size, you can register a very large buffer
>>> and do IO with a small subchunk of it, even if that "small" is 4G,
>>> but it still needs to be addressed. I think we need at least an order
>>> of magnitude or two more space for it to last for a bit.
>>>
>>> Can it steal bits from IOU_BUF_REGBUF_BITS?
>>
>> As mentooned, we can definitely move the one bit, which would bring is
>> to 31/33, and ~2GB IO size (max in linux) and ~8G of offset. That can be
>> done without having any tradeoffs. Beyond that, and we're starting to
>> limit the transfer size, eg it's tradeoff of allowing more offset (and
>> hence bigger registered regions) vs transfer size. You could probably
>> argue that 1G would be fine, and hence make it 30/34, which would allow
>> 16GB registered buffers. Just unsure if it's worth it, as neither of
>> those would allow really huge registered buffers - and does it matter if
>> your buffer registrations are chunked at 8G vs 16G? Probably not.
> 
> 6/7 packs offset and the reg buffer index into the same u64,
> not the len. I'm don't see how it affects the len
> 
> idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
> addr >>= IOU_BUF_REGBUF_BITS;
> *offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);
> 
> So the tradeoff is with the max size of the registered
> buffer table. I doubt you need 2^32, if each is at least
> 4KB, it's at least 16TB.

Ah yes, nevermind, the len is of course separate. Yes indeed that falls
out nicely then, we can just reserve eg 16 bits for registered buffers.
And that leaves 48 bits for the offset, which would hold more than
enough. Even at that and 8G max buffer size, that'd be half a TB of
buffer space. And there's no reason we can't just increase the
registered buffer size on 64-bit to much beyond that. Maybe play it safe
and set aside 20 bits for the buffer index, and that leaves 44 bits for
the registered buffer space? Or 16TB of registered buffer space.

What do you think?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
  2024-10-24 19:53             ` Jens Axboe
@ 2024-10-24 22:46               ` Jens Axboe
  0 siblings, 0 replies; 26+ messages in thread
From: Jens Axboe @ 2024-10-24 22:46 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 10/24/24 1:53 PM, Jens Axboe wrote:
> On 10/24/24 12:20 PM, Pavel Begunkov wrote:
>> On 10/24/24 18:16, Jens Axboe wrote:
>>> On 10/24/24 10:17 AM, Pavel Begunkov wrote:
>>>> On 10/24/24 16:57, Jens Axboe wrote:
>>>>> On 10/24/24 9:44 AM, Pavel Begunkov wrote:
>>>>>> On 10/23/24 17:07, Jens Axboe wrote:
>>>>>>> This just adds the necessary shifts that define what a provided buffer
>>>>>>> that is merely an index into a registered buffer looks like. A provided
>>>>>>> buffer looks like the following:
>>>>>>>
>>>>>>> struct io_uring_buf {
>>>>>>>       __u64    addr;
>>>>>>>       __u32    len;
>>>>>>>       __u16    bid;
>>>>>>>       __u16    resv;
>>>>>>> };
>>>>>>>
>>>>>>> where 'addr' holds a userspace address, 'len' is the length of the
>>>>>>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>>>>>>> fine for a virtual address, but it cannot be used efficiently denote
>>>>>>> a registered buffer. Registered buffers are pre-mapped into the kernel
>>>>>>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>>>>>>> and are used for things like O_DIRECT on storage and zero copy send.
>>>>>>>
>>>>>>> Particularly for the send case, it'd be useful to support a mix of
>>>>>>> provided and registered buffers. This enables the use of using a
>>>>>>> provided ring buffer to serialize sends, and also enables the use of
>>>>>>> send bundles, where a send can pick multiple buffers and send them all
>>>>>>> at once.
>>>>>>>
>>>>>>> If provided buffers are used as an index into registered buffers, the
>>>>>>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>>>>>>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>>>>>>> the start of the buffer, then the application would fill out the entry
>>>>>>> as follows:
>>>>>>>
>>>>>>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>>>>>>> buf->len = len;
>>>>>>>
>>>>>>> and otherwise add it to the buffer ring as usual. The kernel will then
>>>>>>> first pick a buffer from the desired buffer group ID, and then decode
>>>>>>> which registered buffer to use for the transfer.
>>>>>>>
>>>>>>> This provides a way to use both registered and provided buffers at the
>>>>>>> same time.
>>>>>>>
>>>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>>> ---
>>>>>>>     include/uapi/linux/io_uring.h | 8 ++++++++
>>>>>>>     1 file changed, 8 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>>>> index 86cb385fe0b5..eef88d570cb4 100644
>>>>>>> --- a/include/uapi/linux/io_uring.h
>>>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>>>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>>>>>>         };
>>>>>>>     };
>>>>>>>     +/*
>>>>>>> + * When provided buffers are used as indices into registered buffers, the
>>>>>>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>>>>>>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>>>>>>> + */
>>>>>>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>>>>>>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
>>>>>>
>>>>>> 32 bit is fine for IO size but not enough to store offsets, it
>>>>>> can only address under 4GB registered buffers.
>>>>>
>>>>> I did think about that - at least as it stands, registered buffers are
>>>>> limited to 1GB in size. That's how it's been since that got added. Now,
>>>>> for the future, we may obviously lift that limitation, and yeah then
>>>>> 32-bits would not necessarily be enough for the offset.
>>>>
>>>> Right, and I don't think it's unreasonable considering with how
>>>> much memory systems have nowadays, and we think that one large
>>>> registered buffer is a good thing.
>>>
>>> Agree - but at the same time, not a big hardship to chunk up the region
>>> into 8G chunks rather than allow, eg, a 64G region. Would be nice if it
>>> wasn't a requirement, but unsure how to make that work otherwise.
>>>
>>> And not a lot of complaints on having 1G be the size, even from the
>>> varnish side where they register hundreds of gigs of memory.
>>>
>>>>> For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
>>>>> so we could make do with 31 bits for the size, which would bump the
>>>>> offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
>>>>> buffers, or 8x what we support now. Which is probably fine, you'd just
>>>>> split your buffer registrations into 8G chunks, if you want to register
>>>>> more than 8G of memory.
>>>>
>>>> That's why I mentioned IO size, you can register a very large buffer
>>>> and do IO with a small subchunk of it, even if that "small" is 4G,
>>>> but it still needs to be addressed. I think we need at least an order
>>>> of magnitude or two more space for it to last for a bit.
>>>>
>>>> Can it steal bits from IOU_BUF_REGBUF_BITS?
>>>
>>> As mentooned, we can definitely move the one bit, which would bring is
>>> to 31/33, and ~2GB IO size (max in linux) and ~8G of offset. That can be
>>> done without having any tradeoffs. Beyond that, and we're starting to
>>> limit the transfer size, eg it's tradeoff of allowing more offset (and
>>> hence bigger registered regions) vs transfer size. You could probably
>>> argue that 1G would be fine, and hence make it 30/34, which would allow
>>> 16GB registered buffers. Just unsure if it's worth it, as neither of
>>> those would allow really huge registered buffers - and does it matter if
>>> your buffer registrations are chunked at 8G vs 16G? Probably not.
>>
>> 6/7 packs offset and the reg buffer index into the same u64,
>> not the len. I'm don't see how it affects the len
>>
>> idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
>> addr >>= IOU_BUF_REGBUF_BITS;
>> *offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);
>>
>> So the tradeoff is with the max size of the registered
>> buffer table. I doubt you need 2^32, if each is at least
>> 4KB, it's at least 16TB.
> 
> Ah yes, nevermind, the len is of course separate. Yes indeed that falls
> out nicely then, we can just reserve eg 16 bits for registered buffers.
> And that leaves 48 bits for the offset, which would hold more than
> enough. Even at that and 8G max buffer size, that'd be half a TB of
> buffer space. And there's no reason we can't just increase the
> registered buffer size on 64-bit to much beyond that. Maybe play it safe
> and set aside 20 bits for the buffer index, and that leaves 44 bits for
> the registered buffer space? Or 16TB of registered buffer space.
> 
> What do you think?

I changed it to 20/44 as that seems to be the best split. It allows much
larger registered buffers, and 1M of them. That should be plenty.

Might even make sense to just up the registered buffer size at this
point too as a separate change, there's no reason for it to be limited
to 1G. We can keep 1G on 32-bit as it could only go to to 4G-1 on that
anyway. But for 64-bit, we can bump it higher.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-10-24 22:46 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
2024-10-23 16:07 ` [PATCH 1/7] io_uring/kbuf: mark buf_sel_arg mode as KBUF_MODE_FREE once allocated Jens Axboe
2024-10-23 16:07 ` [PATCH 2/7] io_uring/kbuf: change io_provided_buffers_select() calling convention Jens Axboe
2024-10-23 16:07 ` [PATCH 3/7] io_uring/net: abstract out io_send_import() helper Jens Axboe
2024-10-23 16:07 ` [PATCH 4/7] io_uring/net: move send zc fixed buffer import into helper Jens Axboe
2024-10-23 16:07 ` [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers Jens Axboe
2024-10-24 15:44   ` Pavel Begunkov
2024-10-24 15:57     ` Jens Axboe
2024-10-24 16:17       ` Pavel Begunkov
2024-10-24 17:16         ` Jens Axboe
2024-10-24 18:20           ` Pavel Begunkov
2024-10-24 19:53             ` Jens Axboe
2024-10-24 22:46               ` Jens Axboe
2024-10-23 16:07 ` [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC Jens Axboe
2024-10-24 15:22   ` Pavel Begunkov
2024-10-24 15:27     ` Jens Axboe
2024-10-24 15:40       ` Pavel Begunkov
2024-10-24 15:49         ` Jens Axboe
2024-10-23 16:07 ` [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc Jens Axboe
2024-10-24 14:44   ` Pavel Begunkov
2024-10-24 14:48     ` Jens Axboe
2024-10-24 15:36       ` Pavel Begunkov
2024-10-24 14:36 ` [PATCHSET RFC 0/7] Add support for provided registered buffers Pavel Begunkov
2024-10-24 14:43   ` Jens Axboe
2024-10-24 15:04     ` Pavel Begunkov
2024-10-24 15:11       ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox