public inbox for io-uring@vger.kernel.org
 help / color / mirror / Atom feed
From: Pavel Begunkov <asml.silence@gmail.com>
To: Jens Axboe <axboe@kernel.dk>, io-uring@vger.kernel.org
Subject: Re: [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers
Date: Thu, 24 Oct 2024 19:20:33 +0100	[thread overview]
Message-ID: <a572428b-2f31-4a3e-975c-8595fbea7e54@gmail.com> (raw)
In-Reply-To: <2be64142-0d6d-4018-b99b-343350a5fb08@kernel.dk>

On 10/24/24 18:16, Jens Axboe wrote:
> On 10/24/24 10:17 AM, Pavel Begunkov wrote:
>> On 10/24/24 16:57, Jens Axboe wrote:
>>> On 10/24/24 9:44 AM, Pavel Begunkov wrote:
>>>> On 10/23/24 17:07, Jens Axboe wrote:
>>>>> This just adds the necessary shifts that define what a provided buffer
>>>>> that is merely an index into a registered buffer looks like. A provided
>>>>> buffer looks like the following:
>>>>>
>>>>> struct io_uring_buf {
>>>>>       __u64    addr;
>>>>>       __u32    len;
>>>>>       __u16    bid;
>>>>>       __u16    resv;
>>>>> };
>>>>>
>>>>> where 'addr' holds a userspace address, 'len' is the length of the
>>>>> buffer, and 'bid' is the buffer ID identifying the buffer. This works
>>>>> fine for a virtual address, but it cannot be used efficiently denote
>>>>> a registered buffer. Registered buffers are pre-mapped into the kernel
>>>>> for more efficient IO, avoiding a get_user_pages() and page(s) inc+dec,
>>>>> and are used for things like O_DIRECT on storage and zero copy send.
>>>>>
>>>>> Particularly for the send case, it'd be useful to support a mix of
>>>>> provided and registered buffers. This enables the use of using a
>>>>> provided ring buffer to serialize sends, and also enables the use of
>>>>> send bundles, where a send can pick multiple buffers and send them all
>>>>> at once.
>>>>>
>>>>> If provided buffers are used as an index into registered buffers, the
>>>>> meaning of buf->addr changes. If registered buffer index 'regbuf_index'
>>>>> is desired, with a length of 'len' and the offset 'regbuf_offset' from
>>>>> the start of the buffer, then the application would fill out the entry
>>>>> as follows:
>>>>>
>>>>> buf->addr = ((__u64) regbuf_offset << IOU_BUF_OFFSET_BITS) | regbuf_index;
>>>>> buf->len = len;
>>>>>
>>>>> and otherwise add it to the buffer ring as usual. The kernel will then
>>>>> first pick a buffer from the desired buffer group ID, and then decode
>>>>> which registered buffer to use for the transfer.
>>>>>
>>>>> This provides a way to use both registered and provided buffers at the
>>>>> same time.
>>>>>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>> ---
>>>>>     include/uapi/linux/io_uring.h | 8 ++++++++
>>>>>     1 file changed, 8 insertions(+)
>>>>>
>>>>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>>>>> index 86cb385fe0b5..eef88d570cb4 100644
>>>>> --- a/include/uapi/linux/io_uring.h
>>>>> +++ b/include/uapi/linux/io_uring.h
>>>>> @@ -733,6 +733,14 @@ struct io_uring_buf_ring {
>>>>>         };
>>>>>     };
>>>>>     +/*
>>>>> + * When provided buffers are used as indices into registered buffers, the
>>>>> + * lower IOU_BUF_REGBUF_BITS indicate the index into the registered buffers,
>>>>> + * and the upper IOU_BUF_OFFSET_BITS indicate the offset into that buffer.
>>>>> + */
>>>>> +#define IOU_BUF_REGBUF_BITS    (32ULL)
>>>>> +#define IOU_BUF_OFFSET_BITS    (32ULL)
>>>>
>>>> 32 bit is fine for IO size but not enough to store offsets, it
>>>> can only address under 4GB registered buffers.
>>>
>>> I did think about that - at least as it stands, registered buffers are
>>> limited to 1GB in size. That's how it's been since that got added. Now,
>>> for the future, we may obviously lift that limitation, and yeah then
>>> 32-bits would not necessarily be enough for the offset.
>>
>> Right, and I don't think it's unreasonable considering with how
>> much memory systems have nowadays, and we think that one large
>> registered buffer is a good thing.
> 
> Agree - but at the same time, not a big hardship to chunk up the region
> into 8G chunks rather than allow, eg, a 64G region. Would be nice if it
> wasn't a requirement, but unsure how to make that work otherwise.
> 
> And not a lot of complaints on having 1G be the size, even from the
> varnish side where they register hundreds of gigs of memory.
> 
>>> For linux, the max read/write value has always been INT_MAX & PAGE_MASK,
>>> so we could make do with 31 bits for the size, which would bump the
>>> offset to 33-bits, or 8G. That'd leave enough room for, at least, 8G
>>> buffers, or 8x what we support now. Which is probably fine, you'd just
>>> split your buffer registrations into 8G chunks, if you want to register
>>> more than 8G of memory.
>>
>> That's why I mentioned IO size, you can register a very large buffer
>> and do IO with a small subchunk of it, even if that "small" is 4G,
>> but it still needs to be addressed. I think we need at least an order
>> of magnitude or two more space for it to last for a bit.
>>
>> Can it steal bits from IOU_BUF_REGBUF_BITS?
> 
> As mentooned, we can definitely move the one bit, which would bring is
> to 31/33, and ~2GB IO size (max in linux) and ~8G of offset. That can be
> done without having any tradeoffs. Beyond that, and we're starting to
> limit the transfer size, eg it's tradeoff of allowing more offset (and
> hence bigger registered regions) vs transfer size. You could probably
> argue that 1G would be fine, and hence make it 30/34, which would allow
> 16GB registered buffers. Just unsure if it's worth it, as neither of
> those would allow really huge registered buffers - and does it matter if
> your buffer registrations are chunked at 8G vs 16G? Probably not.

6/7 packs offset and the reg buffer index into the same u64,
not the len. I'm don't see how it affects the len

idx = addr & ((1ULL << IOU_BUF_REGBUF_BITS) - 1);
addr >>= IOU_BUF_REGBUF_BITS;
*offset = addr  & ((1ULL << IOU_BUF_OFFSET_BITS) - 1);

So the tradeoff is with the max size of the registered
buffer table. I doubt you need 2^32, if each is at least
4KB, it's at least 16TB.

-- 
Pavel Begunkov

  reply	other threads:[~2024-10-24 18:20 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-23 16:07 [PATCHSET RFC 0/7] Add support for provided registered buffers Jens Axboe
2024-10-23 16:07 ` [PATCH 1/7] io_uring/kbuf: mark buf_sel_arg mode as KBUF_MODE_FREE once allocated Jens Axboe
2024-10-23 16:07 ` [PATCH 2/7] io_uring/kbuf: change io_provided_buffers_select() calling convention Jens Axboe
2024-10-23 16:07 ` [PATCH 3/7] io_uring/net: abstract out io_send_import() helper Jens Axboe
2024-10-23 16:07 ` [PATCH 4/7] io_uring/net: move send zc fixed buffer import into helper Jens Axboe
2024-10-23 16:07 ` [PATCH 5/7] io_uring: add ability for provided buffer to index registered buffers Jens Axboe
2024-10-24 15:44   ` Pavel Begunkov
2024-10-24 15:57     ` Jens Axboe
2024-10-24 16:17       ` Pavel Begunkov
2024-10-24 17:16         ` Jens Axboe
2024-10-24 18:20           ` Pavel Begunkov [this message]
2024-10-24 19:53             ` Jens Axboe
2024-10-24 22:46               ` Jens Axboe
2024-10-23 16:07 ` [PATCH 6/7] io_uring/kbuf: add support for mapping type KBUF_MODE_BVEC Jens Axboe
2024-10-24 15:22   ` Pavel Begunkov
2024-10-24 15:27     ` Jens Axboe
2024-10-24 15:40       ` Pavel Begunkov
2024-10-24 15:49         ` Jens Axboe
2024-10-23 16:07 ` [PATCH 7/7] io_uring/net: add provided buffer and bundle support to send zc Jens Axboe
2024-10-24 14:44   ` Pavel Begunkov
2024-10-24 14:48     ` Jens Axboe
2024-10-24 15:36       ` Pavel Begunkov
2024-10-24 14:36 ` [PATCHSET RFC 0/7] Add support for provided registered buffers Pavel Begunkov
2024-10-24 14:43   ` Jens Axboe
2024-10-24 15:04     ` Pavel Begunkov
2024-10-24 15:11       ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a572428b-2f31-4a3e-975c-8595fbea7e54@gmail.com \
    --to=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox