public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, io-uring@vger.kernel.org
Subject: Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
Date: Thu, 24 Oct 2024 17:17:37 +0100	[thread overview]
Message-ID: <c51938c8-8bb4-44d1-8394-14aeebd58ba2@gmail.com> (raw)
In-Reply-To: <c44ef9b3-bea7-45f5-b050-9c74ff1e0344@kernel.dk>

On 10/24/24 16:57, Jens Axboe wrote:
> On 10/24/24 9:44 AM, Pavel Begunkov wrote:
>> On 10/23/24 17:07, Jens Axboe wrote:
>>> This just adds the necessary shifts that define what a provided buffer
>>> that is merely an index into a registered buffer looks like. A provided
>>> buffer looks like the following:
>>>
>>> struct io_uring_buf {
>>>      __u64    addr;
>>>      __u32    len;
>>>      __u16    bid;
>>>      __u16    resv;
>>> };
>>>
>>> where 'addr' holds a userspace address, 'len' is the length of the
>>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>>> fine for a virtual address, but it cannot be used efficiently denote
>>> a registered buffer. Registered buffers are pre-mapped into the kernel
>>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>>> and are used for things like O_DIRECT on storage and zero copy send.
>>>
>>> Particularly for the send case, it'd be useful to support a mix of
>>> provided and registered buffers. This enables the use of using a
>>> provided ring buffer to serialize sends, and also enables the use of
>>> send bundles, where a send can pick multiple buffers and send them all
>>> at once.
>>>
>>> If provided buffers are used as an index into registered buffers, the
>>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>>> the start of the buffer, then the application would fill out the entry
>>> as follows:
>>>
>>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>>> buf->len = len;
>>>
>>> and otherwise add it to the buffer ring as usual. The kernel will then
>>> first pick a buffer from the desired buffer group ID, and then decode
>>> which registered buffer to use for the transfer.
>>>
>>> This provides a way to use both registered and provided buffers at the
>>> same time.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>    include/uapi/linux/io_uring.h | 8 ++++++++
>>>    1 file changed, 8 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>> index 86cb385fe0b5..eef88d570cb4 100644
>>> --- a/include/uapi/linux/io_uring.h
>>> +++ b/include/uapi/linux/io_uring.h
>>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>>        };
>>>    };
>>>    +/*
>>> + * When provided buffers are used as indices into registered buffers, the
>>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>>> + */
>>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
>>
>> 32 bit is fine for IO size but not enough to store offsets, it
>> can only address under 4GB registered buffers.
> 
> I did think about that - at least as it stands, registered buffers are
> limited to 1GB in size. That's how it's been since that got added. Now,
> for the future, we may obviously lift that limitation, and yeah then
> 32-bits would not necessarily be enough for the offset.

Right, and I don't think it's unreasonable considering with how
much memory systems have nowadays, and we think that one large
registered buffer is a good thing.

> For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
> so we could make do with 31 bits for the size, which would bump the
> offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
> buffers, or 8x what we support now. Which is probably fine, you'd just
> split your buffer registrations into 8G chunks, if you want to register
> more than 8G of memory.

That's why I mentioned IO size, you can register a very large buffer
and do IO with a small subchunk of it, even if that "small" is 4G,
but it still needs to be addressed. I think we need at least an order
of magnitude or two more space for it to last for a bit.

Can it steal bits from IOU_BUF_REGBUF_BITS?

-- 
Pavel Begunkov

  reply	other threads:[~2024-10-24 16:17 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
2024-10-23 16:07 ` [PATCH 1/7] io_uring/kbuf: mark buf_sel_arg mode as KBUF_MODE_FREE once allocated Jens Axboe
2024-10-23 16:07 ` [PATCH 2/7] io_uring/kbuf: change io_provided_buffers_select() calling convention Jens Axboe
2024-10-23 16:07 ` [PATCH 3/7] io_uring/net: abstract out io_send_import() helper Jens Axboe
2024-10-23 16:07 ` [PATCH 4/7] io_uring/net: move send zc fixed buffer import into helper Jens Axboe
2024-10-23 16:07 ` [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers Jens Axboe
2024-10-24 15:44   ` Pavel Begunkov
2024-10-24 15:57     ` Jens Axboe
2024-10-24 16:17       ` Pavel Begunkov [this message]
2024-10-24 17:16         ` Jens Axboe
2024-10-24 18:20           ` Pavel Begunkov
2024-10-24 19:53             ` Jens Axboe
2024-10-24 22:46               ` Jens Axboe
2024-10-23 16:07 ` [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC Jens Axboe
2024-10-24 15:22   ` Pavel Begunkov
2024-10-24 15:27     ` Jens Axboe
2024-10-24 15:40       ` Pavel Begunkov
2024-10-24 15:49         ` Jens Axboe
2024-10-23 16:07 ` [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc Jens Axboe
2024-10-24 14:44   ` Pavel Begunkov
2024-10-24 14:48     ` Jens Axboe
2024-10-24 15:36       ` Pavel Begunkov
2024-10-24 14:36 ` [PATCHSET RFC 0/7] Add support for provided registered buffers Pavel Begunkov
2024-10-24 14:43   ` Jens Axboe
2024-10-24 15:04     ` Pavel Begunkov
2024-10-24 15:11       ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c51938c8-8bb4-44d1-8394-14aeebd58ba2@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox