* [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
@ 2025-06-09 12:14 Ming Lei
2025-06-09 22:29 ` Caleb Sander Mateos
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2025-06-09 12:14 UTC (permalink / raw)
To: Jens Axboe, linux-block; +Cc: Uday Shankar, Caleb Sander Mateos, Ming Lei
Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
1 file changed, 67 insertions(+)
diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index c368e1081b41..16ffca54eed4 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
parameter of `struct ublk_param_segment` with backend for avoiding
unnecessary IO split, which usually hurts io_uring performance.
+Auto Buffer Registration
+------------------------
+
+The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
+and unregistration for I/O requests, which simplifies the buffer management
+process and reduces overhead in the ublk server implementation.
+
+This is another feature flag for using zero copy, and it is compatible with
+``UBLK_F_SUPPORT_ZERO_COPY``.
+
+Feature Overview
+~~~~~~~~~~~~~~~~
+
+This feature automatically registers request buffers to the io_uring context
+before delivering I/O commands to the ublk server and unregisters them when
+completing I/O commands. This eliminates the need for manual buffer
+registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
+``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
+can avoid dependency on the two uring_cmd operations.
+
+This way not only simplifies ublk server implementation, but also makes
+concurrent IO handling becomes possible.
+
+Usage Requirements
+~~~~~~~~~~~~~~~~~~
+
+1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
+ used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
+
+2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
+ unregistration is required.
+
+3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
+ following structure::
+
+ struct ublk_auto_buf_reg {
+ __u16 index; /* Buffer index for registration */
+ __u8 flags; /* Registration flags */
+ __u8 reserved0; /* Reserved for future use */
+ __u32 reserved1; /* Reserved for future use */
+ };
+
+4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
+
+5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
+
+Fallback Behavior
+~~~~~~~~~~~~~~~~~
+
+When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
+
+1. If auto buffer registration fails:
+ - The uring_cmd is completed
+ - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
+ - The ublk server must manually register the buffer
+
+2. If fallback is not enabled:
+ - The ublk I/O request fails silently
+
+Limitations
+~~~~~~~~~~~
+
+- Requires same ``io_ring_ctx`` for all operations
+- May require manual buffer management in fallback cases
+- Reserved fields must be zeroed for future compatibility
+
+
References
==========
--
2.47.0
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-09 12:14 [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) Ming Lei
@ 2025-06-09 22:29 ` Caleb Sander Mateos
2025-06-10 2:06 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Caleb Sander Mateos @ 2025-06-09 22:29 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar
On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
Thanks, this is a nice explanation. Just a few suggestions.
> ---
> Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 67 insertions(+)
>
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index c368e1081b41..16ffca54eed4 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> parameter of `struct ublk_param_segment` with backend for avoiding
> unnecessary IO split, which usually hurts io_uring performance.
>
> +Auto Buffer Registration
> +------------------------
> +
> +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> +and unregistration for I/O requests, which simplifies the buffer management
> +process and reduces overhead in the ublk server implementation.
> +
> +This is another feature flag for using zero copy, and it is compatible with
> +``UBLK_F_SUPPORT_ZERO_COPY``.
> +
> +Feature Overview
> +~~~~~~~~~~~~~~~~
> +
> +This feature automatically registers request buffers to the io_uring context
> +before delivering I/O commands to the ublk server and unregisters them when
> +completing I/O commands. This eliminates the need for manual buffer
> +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> +can avoid dependency on the two uring_cmd operations.
> +
> +This way not only simplifies ublk server implementation, but also makes
> +concurrent IO handling becomes possible.
I'm not sure what "concurrent IO handling" refers to. Any ublk server
can handle incoming I/O requests concurrently, regardless of what
features it has enabled. Do you mean it avoids the need for linked
io_uring requests to properly order buffer registration and
unregistration with the I/O operations using the registered buffer?
> +
> +Usage Requirements
> +~~~~~~~~~~~~~~~~~~
> +
> +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> +
> +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> + unregistration is required.
nit: don't think this needs to be a separate point, could be combined with (1).
> +
> +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> + following structure::
nit: extra ":"
> +
> + struct ublk_auto_buf_reg {
> + __u16 index; /* Buffer index for registration */
> + __u8 flags; /* Registration flags */
> + __u8 reserved0; /* Reserved for future use */
> + __u32 reserved1; /* Reserved for future use */
> + };
Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
ambiguous how this struct is "passed" in sqe->addr.
> +
> +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> +
> +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> +
> +Fallback Behavior
> +~~~~~~~~~~~~~~~~~
> +
> +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> +
> +1. If auto buffer registration fails:
I would switch these. Both (1) and (2) refer to when auto buffer
registration fails. So I would expect something like:
If auto buffer registration fails:
1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
...
2. If fallback is not enabled:
...
> + - The uring_cmd is completed
Maybe add "without registering the request buffer"?
> + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> + - The ublk server must manually register the buffer
Only if it wants a registered buffer for the ublk request. Technically
the ublk server could decide to fall back on user-copy, for example.
> +
> +2. If fallback is not enabled:
> + - The ublk I/O request fails silently
"silently" is a bit ambiguous. It's certainly not silent to the
application submitting the ublk I/O. Maybe say that the ublk I/O
request fails and no uring_cmd is completed to the ublk server?
> +
> +Limitations
> +~~~~~~~~~~~
> +
> +- Requires same ``io_ring_ctx`` for all operations
Another limitation that prevents us from adopting the auto buffer
registration feature is the need to reserve a unique buffer table
index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
buffer table has a max size of 16K (could potentially be increased to
64K), this limit is easily reached when there are a large number of
ublk devices or the ublk queue depth is large. I think we could remove
this limitation in the future by adding support for allocating buffer
indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
Best,
Caleb
> +- May require manual buffer management in fallback cases
> +- Reserved fields must be zeroed for future compatibility
> +
> +
> References
> ==========
>
> --
> 2.47.0
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-09 22:29 ` Caleb Sander Mateos
@ 2025-06-10 2:06 ` Ming Lei
2025-06-11 15:54 ` Caleb Sander Mateos
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2025-06-10 2:06 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar
On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> >
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
>
> Thanks, this is a nice explanation. Just a few suggestions.
>
> > ---
> > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > 1 file changed, 67 insertions(+)
> >
> > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > index c368e1081b41..16ffca54eed4 100644
> > --- a/Documentation/block/ublk.rst
> > +++ b/Documentation/block/ublk.rst
> > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > parameter of `struct ublk_param_segment` with backend for avoiding
> > unnecessary IO split, which usually hurts io_uring performance.
> >
> > +Auto Buffer Registration
> > +------------------------
> > +
> > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > +and unregistration for I/O requests, which simplifies the buffer management
> > +process and reduces overhead in the ublk server implementation.
> > +
> > +This is another feature flag for using zero copy, and it is compatible with
> > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > +
> > +Feature Overview
> > +~~~~~~~~~~~~~~~~
> > +
> > +This feature automatically registers request buffers to the io_uring context
> > +before delivering I/O commands to the ublk server and unregisters them when
> > +completing I/O commands. This eliminates the need for manual buffer
> > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > +can avoid dependency on the two uring_cmd operations.
> > +
> > +This way not only simplifies ublk server implementation, but also makes
> > +concurrent IO handling becomes possible.
>
> I'm not sure what "concurrent IO handling" refers to. Any ublk server
> can handle incoming I/O requests concurrently, regardless of what
> features it has enabled. Do you mean it avoids the need for linked
> io_uring requests to properly order buffer registration and
> unregistration with the I/O operations using the registered buffer?
Yes, if io_uring OPs depends on buffer registering & unregistering, these
OPs can't be issued concurrently any more, that is one io_uring constraint.
I will add the above words.
>
> > +
> > +Usage Requirements
> > +~~~~~~~~~~~~~~~~~~
> > +
> > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > +
> > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > + unregistration is required.
>
> nit: don't think this needs to be a separate point, could be combined with (1).
OK.
>
> > +
> > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > + following structure::
>
> nit: extra ":"
In reStructuredText (reST), the double colon :: serves as a literal block marker to
indicate preformatted text.
>
> > +
> > + struct ublk_auto_buf_reg {
> > + __u16 index; /* Buffer index for registration */
> > + __u8 flags; /* Registration flags */
> > + __u8 reserved0; /* Reserved for future use */
> > + __u32 reserved1; /* Reserved for future use */
> > + };
>
> Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> ambiguous how this struct is "passed" in sqe->addr.
OK
>
> > +
> > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > +
> > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > +
> > +Fallback Behavior
> > +~~~~~~~~~~~~~~~~~
> > +
> > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > +
> > +1. If auto buffer registration fails:
>
> I would switch these. Both (1) and (2) refer to when auto buffer
> registration fails. So I would expect something like:
>
> If auto buffer registration fails:
>
> 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> ...
> 2. If fallback is not enabled:
> ...
>
> > + - The uring_cmd is completed
>
> Maybe add "without registering the request buffer"?
>
> > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > + - The ublk server must manually register the buffer
>
> Only if it wants a registered buffer for the ublk request. Technically
> the ublk server could decide to fall back on user-copy, for example.
Good catch!
>
> > +
> > +2. If fallback is not enabled:
> > + - The ublk I/O request fails silently
>
> "silently" is a bit ambiguous. It's certainly not silent to the
> application submitting the ublk I/O. Maybe say that the ublk I/O
> request fails and no uring_cmd is completed to the ublk server?
Yes, but the document focus on ublk side, and the client is generic
for every driver, so I guess it may be fine.
>
> > +
> > +Limitations
> > +~~~~~~~~~~~
> > +
> > +- Requires same ``io_ring_ctx`` for all operations
>
> Another limitation that prevents us from adopting the auto buffer
> registration feature is the need to reserve a unique buffer table
> index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> buffer table has a max size of 16K (could potentially be increased to
> 64K), this limit is easily reached when there are a large number of
> ublk devices or the ublk queue depth is large. I think we could remove
> this limitation in the future by adding support for allocating buffer
> indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
OK.
But I guess it isn't big deal in reality since the task context should
be saturated easily with so big setting.
UBLK_F_PER_IO_DAEMON should alleviate the limit by adding more tasks/io_ring_ctx.
Also I am working on BATCH_IO[1] feature to allow one queue to be served
by multiple contexts, meantime one context can serve more than one queue
too in easy & dynamic & batch way. Then the 16K limit can be alleviated
too.
https://github.com/ming1/linux/commits/ublk2-cmd-batch/
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-10 2:06 ` Ming Lei
@ 2025-06-11 15:54 ` Caleb Sander Mateos
2025-06-12 3:16 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Caleb Sander Mateos @ 2025-06-11 15:54 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar
On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > >
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >
> > Thanks, this is a nice explanation. Just a few suggestions.
> >
> > > ---
> > > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 67 insertions(+)
> > >
> > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > index c368e1081b41..16ffca54eed4 100644
> > > --- a/Documentation/block/ublk.rst
> > > +++ b/Documentation/block/ublk.rst
> > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > parameter of `struct ublk_param_segment` with backend for avoiding
> > > unnecessary IO split, which usually hurts io_uring performance.
> > >
> > > +Auto Buffer Registration
> > > +------------------------
> > > +
> > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > +and unregistration for I/O requests, which simplifies the buffer management
> > > +process and reduces overhead in the ublk server implementation.
> > > +
> > > +This is another feature flag for using zero copy, and it is compatible with
> > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > +
> > > +Feature Overview
> > > +~~~~~~~~~~~~~~~~
> > > +
> > > +This feature automatically registers request buffers to the io_uring context
> > > +before delivering I/O commands to the ublk server and unregisters them when
> > > +completing I/O commands. This eliminates the need for manual buffer
> > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > +can avoid dependency on the two uring_cmd operations.
> > > +
> > > +This way not only simplifies ublk server implementation, but also makes
> > > +concurrent IO handling becomes possible.
> >
> > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > can handle incoming I/O requests concurrently, regardless of what
> > features it has enabled. Do you mean it avoids the need for linked
> > io_uring requests to properly order buffer registration and
> > unregistration with the I/O operations using the registered buffer?
>
> Yes, if io_uring OPs depends on buffer registering & unregistering, these
> OPs can't be issued concurrently any more, that is one io_uring constraint.
>
> I will add the above words.
>
> >
> > > +
> > > +Usage Requirements
> > > +~~~~~~~~~~~~~~~~~~
> > > +
> > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > +
> > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > + unregistration is required.
> >
> > nit: don't think this needs to be a separate point, could be combined with (1).
>
> OK.
>
> >
> > > +
> > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > + following structure::
> >
> > nit: extra ":"
>
> In reStructuredText (reST), the double colon :: serves as a literal block marker to
> indicate preformatted text.
>
> >
> > > +
> > > + struct ublk_auto_buf_reg {
> > > + __u16 index; /* Buffer index for registration */
> > > + __u8 flags; /* Registration flags */
> > > + __u8 reserved0; /* Reserved for future use */
> > > + __u32 reserved1; /* Reserved for future use */
> > > + };
> >
> > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > ambiguous how this struct is "passed" in sqe->addr.
>
> OK
>
> >
> > > +
> > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > +
> > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > +
> > > +Fallback Behavior
> > > +~~~~~~~~~~~~~~~~~
> > > +
> > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > +
> > > +1. If auto buffer registration fails:
> >
> > I would switch these. Both (1) and (2) refer to when auto buffer
> > registration fails. So I would expect something like:
> >
> > If auto buffer registration fails:
> >
> > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > ...
> > 2. If fallback is not enabled:
> > ...
> >
> > > + - The uring_cmd is completed
> >
> > Maybe add "without registering the request buffer"?
> >
> > > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > + - The ublk server must manually register the buffer
> >
> > Only if it wants a registered buffer for the ublk request. Technically
> > the ublk server could decide to fall back on user-copy, for example.
>
> Good catch!
>
> >
> > > +
> > > +2. If fallback is not enabled:
> > > + - The ublk I/O request fails silently
> >
> > "silently" is a bit ambiguous. It's certainly not silent to the
> > application submitting the ublk I/O. Maybe say that the ublk I/O
> > request fails and no uring_cmd is completed to the ublk server?
>
> Yes, but the document focus on ublk side, and the client is generic
> for every driver, so I guess it may be fine.
>
> >
> > > +
> > > +Limitations
> > > +~~~~~~~~~~~
> > > +
> > > +- Requires same ``io_ring_ctx`` for all operations
> >
> > Another limitation that prevents us from adopting the auto buffer
> > registration feature is the need to reserve a unique buffer table
> > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > buffer table has a max size of 16K (could potentially be increased to
> > 64K), this limit is easily reached when there are a large number of
> > ublk devices or the ublk queue depth is large. I think we could remove
> > this limitation in the future by adding support for allocating buffer
> > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
>
> OK.
>
> But I guess it isn't big deal in reality since the task context should
> be saturated easily with so big setting.
I don't know about your "reality" but it's certainly a big deal for us :)
To reduce contention on the blk-mq queues for the application
submitting I/O to the ublk devices, we want a large number of queues
for each ublk device. But we also want a large queue depth for each
individual queue to avoid the async request allocation path in case
any one application thread issues a lot of concurrent I/O to a single
ublk device. And we have 128 ublk devices, which again all want large
queue depths in case the application sends a lot of I/O to a single
ublk device. The result is that concurrently each ublk server thread
fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
buffer table limit.
Best,
Caleb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-11 15:54 ` Caleb Sander Mateos
@ 2025-06-12 3:16 ` Ming Lei
2025-06-12 14:38 ` Caleb Sander Mateos
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2025-06-12 3:16 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar
On Wed, Jun 11, 2025 at 08:54:53AM -0700, Caleb Sander Mateos wrote:
> On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > >
> > > Thanks, this is a nice explanation. Just a few suggestions.
> > >
> > > > ---
> > > > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 67 insertions(+)
> > > >
> > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > > index c368e1081b41..16ffca54eed4 100644
> > > > --- a/Documentation/block/ublk.rst
> > > > +++ b/Documentation/block/ublk.rst
> > > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > > parameter of `struct ublk_param_segment` with backend for avoiding
> > > > unnecessary IO split, which usually hurts io_uring performance.
> > > >
> > > > +Auto Buffer Registration
> > > > +------------------------
> > > > +
> > > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > > +and unregistration for I/O requests, which simplifies the buffer management
> > > > +process and reduces overhead in the ublk server implementation.
> > > > +
> > > > +This is another feature flag for using zero copy, and it is compatible with
> > > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > > +
> > > > +Feature Overview
> > > > +~~~~~~~~~~~~~~~~
> > > > +
> > > > +This feature automatically registers request buffers to the io_uring context
> > > > +before delivering I/O commands to the ublk server and unregisters them when
> > > > +completing I/O commands. This eliminates the need for manual buffer
> > > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > > +can avoid dependency on the two uring_cmd operations.
> > > > +
> > > > +This way not only simplifies ublk server implementation, but also makes
> > > > +concurrent IO handling becomes possible.
> > >
> > > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > > can handle incoming I/O requests concurrently, regardless of what
> > > features it has enabled. Do you mean it avoids the need for linked
> > > io_uring requests to properly order buffer registration and
> > > unregistration with the I/O operations using the registered buffer?
> >
> > Yes, if io_uring OPs depends on buffer registering & unregistering, these
> > OPs can't be issued concurrently any more, that is one io_uring constraint.
> >
> > I will add the above words.
> >
> > >
> > > > +
> > > > +Usage Requirements
> > > > +~~~~~~~~~~~~~~~~~~
> > > > +
> > > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > > +
> > > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > > + unregistration is required.
> > >
> > > nit: don't think this needs to be a separate point, could be combined with (1).
> >
> > OK.
> >
> > >
> > > > +
> > > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > > + following structure::
> > >
> > > nit: extra ":"
> >
> > In reStructuredText (reST), the double colon :: serves as a literal block marker to
> > indicate preformatted text.
> >
> > >
> > > > +
> > > > + struct ublk_auto_buf_reg {
> > > > + __u16 index; /* Buffer index for registration */
> > > > + __u8 flags; /* Registration flags */
> > > > + __u8 reserved0; /* Reserved for future use */
> > > > + __u32 reserved1; /* Reserved for future use */
> > > > + };
> > >
> > > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > > ambiguous how this struct is "passed" in sqe->addr.
> >
> > OK
> >
> > >
> > > > +
> > > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > > +
> > > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > > +
> > > > +Fallback Behavior
> > > > +~~~~~~~~~~~~~~~~~
> > > > +
> > > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > +
> > > > +1. If auto buffer registration fails:
> > >
> > > I would switch these. Both (1) and (2) refer to when auto buffer
> > > registration fails. So I would expect something like:
> > >
> > > If auto buffer registration fails:
> > >
> > > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > ...
> > > 2. If fallback is not enabled:
> > > ...
> > >
> > > > + - The uring_cmd is completed
> > >
> > > Maybe add "without registering the request buffer"?
> > >
> > > > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > > + - The ublk server must manually register the buffer
> > >
> > > Only if it wants a registered buffer for the ublk request. Technically
> > > the ublk server could decide to fall back on user-copy, for example.
> >
> > Good catch!
> >
> > >
> > > > +
> > > > +2. If fallback is not enabled:
> > > > + - The ublk I/O request fails silently
> > >
> > > "silently" is a bit ambiguous. It's certainly not silent to the
> > > application submitting the ublk I/O. Maybe say that the ublk I/O
> > > request fails and no uring_cmd is completed to the ublk server?
> >
> > Yes, but the document focus on ublk side, and the client is generic
> > for every driver, so I guess it may be fine.
> >
> > >
> > > > +
> > > > +Limitations
> > > > +~~~~~~~~~~~
> > > > +
> > > > +- Requires same ``io_ring_ctx`` for all operations
> > >
> > > Another limitation that prevents us from adopting the auto buffer
> > > registration feature is the need to reserve a unique buffer table
> > > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > > buffer table has a max size of 16K (could potentially be increased to
> > > 64K), this limit is easily reached when there are a large number of
> > > ublk devices or the ublk queue depth is large. I think we could remove
> > > this limitation in the future by adding support for allocating buffer
> > > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
> >
> > OK.
> >
> > But I guess it isn't big deal in reality since the task context should
> > be saturated easily with so big setting.
>
> I don't know about your "reality" but it's certainly a big deal for us :)
> To reduce contention on the blk-mq queues for the application
> submitting I/O to the ublk devices, we want a large number of queues
> for each ublk device. But we also want a large queue depth for each
> individual queue to avoid the async request allocation path in case
> any one application thread issues a lot of concurrent I/O to a single
> ublk device. And we have 128 ublk devices, which again all want large
> queue depths in case the application sends a lot of I/O to a single
> ublk device. The result is that concurrently each ublk server thread
> fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
> buffer table limit.
Yes, you can setup 512K I/Os in single task/io_uring context, but how many
can be actively handled during unit time? The number could be much less than
512k or 16K, because it is a single pthread/io_uring/cpu core, which may be
saturated easily, so most of these IOs may wait somewhere for cpu or whatever
resource.
So when you have one nice per-task buf-index allocation algorithm, it may not
be one issue given 16K is big enough.
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-12 3:16 ` Ming Lei
@ 2025-06-12 14:38 ` Caleb Sander Mateos
2025-06-13 1:18 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Caleb Sander Mateos @ 2025-06-12 14:38 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar
On Wed, Jun 11, 2025 at 8:16 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Wed, Jun 11, 2025 at 08:54:53AM -0700, Caleb Sander Mateos wrote:
> > On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > > > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > > > >
> > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > >
> > > > Thanks, this is a nice explanation. Just a few suggestions.
> > > >
> > > > > ---
> > > > > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 67 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > > > index c368e1081b41..16ffca54eed4 100644
> > > > > --- a/Documentation/block/ublk.rst
> > > > > +++ b/Documentation/block/ublk.rst
> > > > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > > > parameter of `struct ublk_param_segment` with backend for avoiding
> > > > > unnecessary IO split, which usually hurts io_uring performance.
> > > > >
> > > > > +Auto Buffer Registration
> > > > > +------------------------
> > > > > +
> > > > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > > > +and unregistration for I/O requests, which simplifies the buffer management
> > > > > +process and reduces overhead in the ublk server implementation.
> > > > > +
> > > > > +This is another feature flag for using zero copy, and it is compatible with
> > > > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > > > +
> > > > > +Feature Overview
> > > > > +~~~~~~~~~~~~~~~~
> > > > > +
> > > > > +This feature automatically registers request buffers to the io_uring context
> > > > > +before delivering I/O commands to the ublk server and unregisters them when
> > > > > +completing I/O commands. This eliminates the need for manual buffer
> > > > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > > > +can avoid dependency on the two uring_cmd operations.
> > > > > +
> > > > > +This way not only simplifies ublk server implementation, but also makes
> > > > > +concurrent IO handling becomes possible.
> > > >
> > > > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > > > can handle incoming I/O requests concurrently, regardless of what
> > > > features it has enabled. Do you mean it avoids the need for linked
> > > > io_uring requests to properly order buffer registration and
> > > > unregistration with the I/O operations using the registered buffer?
> > >
> > > Yes, if io_uring OPs depends on buffer registering & unregistering, these
> > > OPs can't be issued concurrently any more, that is one io_uring constraint.
> > >
> > > I will add the above words.
> > >
> > > >
> > > > > +
> > > > > +Usage Requirements
> > > > > +~~~~~~~~~~~~~~~~~~
> > > > > +
> > > > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > > > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > > > +
> > > > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > > > + unregistration is required.
> > > >
> > > > nit: don't think this needs to be a separate point, could be combined with (1).
> > >
> > > OK.
> > >
> > > >
> > > > > +
> > > > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > > > + following structure::
> > > >
> > > > nit: extra ":"
> > >
> > > In reStructuredText (reST), the double colon :: serves as a literal block marker to
> > > indicate preformatted text.
> > >
> > > >
> > > > > +
> > > > > + struct ublk_auto_buf_reg {
> > > > > + __u16 index; /* Buffer index for registration */
> > > > > + __u8 flags; /* Registration flags */
> > > > > + __u8 reserved0; /* Reserved for future use */
> > > > > + __u32 reserved1; /* Reserved for future use */
> > > > > + };
> > > >
> > > > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > > > ambiguous how this struct is "passed" in sqe->addr.
> > >
> > > OK
> > >
> > > >
> > > > > +
> > > > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > > > +
> > > > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > > > +
> > > > > +Fallback Behavior
> > > > > +~~~~~~~~~~~~~~~~~
> > > > > +
> > > > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > +
> > > > > +1. If auto buffer registration fails:
> > > >
> > > > I would switch these. Both (1) and (2) refer to when auto buffer
> > > > registration fails. So I would expect something like:
> > > >
> > > > If auto buffer registration fails:
> > > >
> > > > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > ...
> > > > 2. If fallback is not enabled:
> > > > ...
> > > >
> > > > > + - The uring_cmd is completed
> > > >
> > > > Maybe add "without registering the request buffer"?
> > > >
> > > > > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > > > + - The ublk server must manually register the buffer
> > > >
> > > > Only if it wants a registered buffer for the ublk request. Technically
> > > > the ublk server could decide to fall back on user-copy, for example.
> > >
> > > Good catch!
> > >
> > > >
> > > > > +
> > > > > +2. If fallback is not enabled:
> > > > > + - The ublk I/O request fails silently
> > > >
> > > > "silently" is a bit ambiguous. It's certainly not silent to the
> > > > application submitting the ublk I/O. Maybe say that the ublk I/O
> > > > request fails and no uring_cmd is completed to the ublk server?
> > >
> > > Yes, but the document focus on ublk side, and the client is generic
> > > for every driver, so I guess it may be fine.
> > >
> > > >
> > > > > +
> > > > > +Limitations
> > > > > +~~~~~~~~~~~
> > > > > +
> > > > > +- Requires same ``io_ring_ctx`` for all operations
> > > >
> > > > Another limitation that prevents us from adopting the auto buffer
> > > > registration feature is the need to reserve a unique buffer table
> > > > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > > > buffer table has a max size of 16K (could potentially be increased to
> > > > 64K), this limit is easily reached when there are a large number of
> > > > ublk devices or the ublk queue depth is large. I think we could remove
> > > > this limitation in the future by adding support for allocating buffer
> > > > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
> > >
> > > OK.
> > >
> > > But I guess it isn't big deal in reality since the task context should
> > > be saturated easily with so big setting.
> >
> > I don't know about your "reality" but it's certainly a big deal for us :)
> > To reduce contention on the blk-mq queues for the application
> > submitting I/O to the ublk devices, we want a large number of queues
> > for each ublk device. But we also want a large queue depth for each
> > individual queue to avoid the async request allocation path in case
> > any one application thread issues a lot of concurrent I/O to a single
> > ublk device. And we have 128 ublk devices, which again all want large
> > queue depths in case the application sends a lot of I/O to a single
> > ublk device. The result is that concurrently each ublk server thread
> > fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
> > buffer table limit.
>
> Yes, you can setup 512K I/Os in single task/io_uring context, but how many
> can be actively handled during unit time? The number could be much less than
> 512k or 16K, because it is a single pthread/io_uring/cpu core, which may be
> saturated easily, so most of these IOs may wait somewhere for cpu or whatever
> resource.
Yes, that's exactly my point. Our ublk server only allocates enough
resources to handle 4K concurrent I/Os per thread. But since we don't
know which ublk devices or queues might receive the I/Os, we have to
fetch a queue depth of 4K on *every* ublk device queue. Perhaps the
batched approach you're working on will help here. But for now, the
total number of fetched ublk I/Os is an obstacle to adopting auto
buffer registration. And waiting to allocate the buffer index until an
incoming I/O actually needs to register a buffer seems like a
straightforward way to avoid this obstacle.
Best,
Caleb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-12 14:38 ` Caleb Sander Mateos
@ 2025-06-13 1:18 ` Ming Lei
2025-06-13 1:36 ` Caleb Sander Mateos
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2025-06-13 1:18 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar
On Thu, Jun 12, 2025 at 07:38:01AM -0700, Caleb Sander Mateos wrote:
> On Wed, Jun 11, 2025 at 8:16 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Wed, Jun 11, 2025 at 08:54:53AM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > > > > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > > > > >
> > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > >
> > > > > Thanks, this is a nice explanation. Just a few suggestions.
> > > > >
> > > > > > ---
> > > > > > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > > > > 1 file changed, 67 insertions(+)
> > > > > >
> > > > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > > > > index c368e1081b41..16ffca54eed4 100644
> > > > > > --- a/Documentation/block/ublk.rst
> > > > > > +++ b/Documentation/block/ublk.rst
> > > > > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > > > > parameter of `struct ublk_param_segment` with backend for avoiding
> > > > > > unnecessary IO split, which usually hurts io_uring performance.
> > > > > >
> > > > > > +Auto Buffer Registration
> > > > > > +------------------------
> > > > > > +
> > > > > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > > > > +and unregistration for I/O requests, which simplifies the buffer management
> > > > > > +process and reduces overhead in the ublk server implementation.
> > > > > > +
> > > > > > +This is another feature flag for using zero copy, and it is compatible with
> > > > > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > > > > +
> > > > > > +Feature Overview
> > > > > > +~~~~~~~~~~~~~~~~
> > > > > > +
> > > > > > +This feature automatically registers request buffers to the io_uring context
> > > > > > +before delivering I/O commands to the ublk server and unregisters them when
> > > > > > +completing I/O commands. This eliminates the need for manual buffer
> > > > > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > > > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > > > > +can avoid dependency on the two uring_cmd operations.
> > > > > > +
> > > > > > +This way not only simplifies ublk server implementation, but also makes
> > > > > > +concurrent IO handling becomes possible.
> > > > >
> > > > > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > > > > can handle incoming I/O requests concurrently, regardless of what
> > > > > features it has enabled. Do you mean it avoids the need for linked
> > > > > io_uring requests to properly order buffer registration and
> > > > > unregistration with the I/O operations using the registered buffer?
> > > >
> > > > Yes, if io_uring OPs depends on buffer registering & unregistering, these
> > > > OPs can't be issued concurrently any more, that is one io_uring constraint.
> > > >
> > > > I will add the above words.
> > > >
> > > > >
> > > > > > +
> > > > > > +Usage Requirements
> > > > > > +~~~~~~~~~~~~~~~~~~
> > > > > > +
> > > > > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > > > > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > > > > +
> > > > > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > > > > + unregistration is required.
> > > > >
> > > > > nit: don't think this needs to be a separate point, could be combined with (1).
> > > >
> > > > OK.
> > > >
> > > > >
> > > > > > +
> > > > > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > > > > + following structure::
> > > > >
> > > > > nit: extra ":"
> > > >
> > > > In reStructuredText (reST), the double colon :: serves as a literal block marker to
> > > > indicate preformatted text.
> > > >
> > > > >
> > > > > > +
> > > > > > + struct ublk_auto_buf_reg {
> > > > > > + __u16 index; /* Buffer index for registration */
> > > > > > + __u8 flags; /* Registration flags */
> > > > > > + __u8 reserved0; /* Reserved for future use */
> > > > > > + __u32 reserved1; /* Reserved for future use */
> > > > > > + };
> > > > >
> > > > > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > > > > ambiguous how this struct is "passed" in sqe->addr.
> > > >
> > > > OK
> > > >
> > > > >
> > > > > > +
> > > > > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > > > > +
> > > > > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > > > > +
> > > > > > +Fallback Behavior
> > > > > > +~~~~~~~~~~~~~~~~~
> > > > > > +
> > > > > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > +
> > > > > > +1. If auto buffer registration fails:
> > > > >
> > > > > I would switch these. Both (1) and (2) refer to when auto buffer
> > > > > registration fails. So I would expect something like:
> > > > >
> > > > > If auto buffer registration fails:
> > > > >
> > > > > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > ...
> > > > > 2. If fallback is not enabled:
> > > > > ...
> > > > >
> > > > > > + - The uring_cmd is completed
> > > > >
> > > > > Maybe add "without registering the request buffer"?
> > > > >
> > > > > > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > > > > + - The ublk server must manually register the buffer
> > > > >
> > > > > Only if it wants a registered buffer for the ublk request. Technically
> > > > > the ublk server could decide to fall back on user-copy, for example.
> > > >
> > > > Good catch!
> > > >
> > > > >
> > > > > > +
> > > > > > +2. If fallback is not enabled:
> > > > > > + - The ublk I/O request fails silently
> > > > >
> > > > > "silently" is a bit ambiguous. It's certainly not silent to the
> > > > > application submitting the ublk I/O. Maybe say that the ublk I/O
> > > > > request fails and no uring_cmd is completed to the ublk server?
> > > >
> > > > Yes, but the document focus on ublk side, and the client is generic
> > > > for every driver, so I guess it may be fine.
> > > >
> > > > >
> > > > > > +
> > > > > > +Limitations
> > > > > > +~~~~~~~~~~~
> > > > > > +
> > > > > > +- Requires same ``io_ring_ctx`` for all operations
> > > > >
> > > > > Another limitation that prevents us from adopting the auto buffer
> > > > > registration feature is the need to reserve a unique buffer table
> > > > > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > > > > buffer table has a max size of 16K (could potentially be increased to
> > > > > 64K), this limit is easily reached when there are a large number of
> > > > > ublk devices or the ublk queue depth is large. I think we could remove
> > > > > this limitation in the future by adding support for allocating buffer
> > > > > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
> > > >
> > > > OK.
> > > >
> > > > But I guess it isn't big deal in reality since the task context should
> > > > be saturated easily with so big setting.
> > >
> > > I don't know about your "reality" but it's certainly a big deal for us :)
> > > To reduce contention on the blk-mq queues for the application
> > > submitting I/O to the ublk devices, we want a large number of queues
> > > for each ublk device. But we also want a large queue depth for each
> > > individual queue to avoid the async request allocation path in case
> > > any one application thread issues a lot of concurrent I/O to a single
> > > ublk device. And we have 128 ublk devices, which again all want large
> > > queue depths in case the application sends a lot of I/O to a single
> > > ublk device. The result is that concurrently each ublk server thread
> > > fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
> > > buffer table limit.
> >
> > Yes, you can setup 512K I/Os in single task/io_uring context, but how many
> > can be actively handled during unit time? The number could be much less than
> > 512k or 16K, because it is a single pthread/io_uring/cpu core, which may be
> > saturated easily, so most of these IOs may wait somewhere for cpu or whatever
> > resource.
>
> Yes, that's exactly my point. Our ublk server only allocates enough
> resources to handle 4K concurrent I/Os per thread. But since we don't
> know which ublk devices or queues might receive the I/Os, we have to
> fetch a queue depth of 4K on *every* ublk device queue. Perhaps the
> batched approach you're working on will help here. But for now, the
> total number of fetched ublk I/Os is an obstacle to adopting auto
> buffer registration.
oops, I forget the point that buffer index has to be provided beforehand,
that is really one limit for your case with too many IOs in single uring
context.
The batched approach may not help too because the model is to issue command
beforehand for fetching new io command.
> And waiting to allocate the buffer index until an
> incoming I/O actually needs to register a buffer seems like a
> straightforward way to avoid this obstacle.
One way is to rely on bpf program to allocate & provide buffer index via
struct_ops, which can be called exactly before registering & unregistering
io buffer. The concept should be simple, but the whole implementation may
take some effort(most are boiler plate).
thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-13 1:18 ` Ming Lei
@ 2025-06-13 1:36 ` Caleb Sander Mateos
2025-06-13 1:53 ` Ming Lei
0 siblings, 1 reply; 10+ messages in thread
From: Caleb Sander Mateos @ 2025-06-13 1:36 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar
On Thu, Jun 12, 2025 at 6:18 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, Jun 12, 2025 at 07:38:01AM -0700, Caleb Sander Mateos wrote:
> > On Wed, Jun 11, 2025 at 8:16 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Wed, Jun 11, 2025 at 08:54:53AM -0700, Caleb Sander Mateos wrote:
> > > > On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > > > > > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > >
> > > > > > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > > > > > >
> > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > >
> > > > > > Thanks, this is a nice explanation. Just a few suggestions.
> > > > > >
> > > > > > > ---
> > > > > > > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > > > > > 1 file changed, 67 insertions(+)
> > > > > > >
> > > > > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > > > > > index c368e1081b41..16ffca54eed4 100644
> > > > > > > --- a/Documentation/block/ublk.rst
> > > > > > > +++ b/Documentation/block/ublk.rst
> > > > > > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > > > > > parameter of `struct ublk_param_segment` with backend for avoiding
> > > > > > > unnecessary IO split, which usually hurts io_uring performance.
> > > > > > >
> > > > > > > +Auto Buffer Registration
> > > > > > > +------------------------
> > > > > > > +
> > > > > > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > > > > > +and unregistration for I/O requests, which simplifies the buffer management
> > > > > > > +process and reduces overhead in the ublk server implementation.
> > > > > > > +
> > > > > > > +This is another feature flag for using zero copy, and it is compatible with
> > > > > > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > > > > > +
> > > > > > > +Feature Overview
> > > > > > > +~~~~~~~~~~~~~~~~
> > > > > > > +
> > > > > > > +This feature automatically registers request buffers to the io_uring context
> > > > > > > +before delivering I/O commands to the ublk server and unregisters them when
> > > > > > > +completing I/O commands. This eliminates the need for manual buffer
> > > > > > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > > > > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > > > > > +can avoid dependency on the two uring_cmd operations.
> > > > > > > +
> > > > > > > +This way not only simplifies ublk server implementation, but also makes
> > > > > > > +concurrent IO handling becomes possible.
> > > > > >
> > > > > > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > > > > > can handle incoming I/O requests concurrently, regardless of what
> > > > > > features it has enabled. Do you mean it avoids the need for linked
> > > > > > io_uring requests to properly order buffer registration and
> > > > > > unregistration with the I/O operations using the registered buffer?
> > > > >
> > > > > Yes, if io_uring OPs depends on buffer registering & unregistering, these
> > > > > OPs can't be issued concurrently any more, that is one io_uring constraint.
> > > > >
> > > > > I will add the above words.
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +Usage Requirements
> > > > > > > +~~~~~~~~~~~~~~~~~~
> > > > > > > +
> > > > > > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > > > > > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > > > > > +
> > > > > > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > > > > > + unregistration is required.
> > > > > >
> > > > > > nit: don't think this needs to be a separate point, could be combined with (1).
> > > > >
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > > > > > + following structure::
> > > > > >
> > > > > > nit: extra ":"
> > > > >
> > > > > In reStructuredText (reST), the double colon :: serves as a literal block marker to
> > > > > indicate preformatted text.
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > + struct ublk_auto_buf_reg {
> > > > > > > + __u16 index; /* Buffer index for registration */
> > > > > > > + __u8 flags; /* Registration flags */
> > > > > > > + __u8 reserved0; /* Reserved for future use */
> > > > > > > + __u32 reserved1; /* Reserved for future use */
> > > > > > > + };
> > > > > >
> > > > > > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > > > > > ambiguous how this struct is "passed" in sqe->addr.
> > > > >
> > > > > OK
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > > > > > +
> > > > > > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > > > > > +
> > > > > > > +Fallback Behavior
> > > > > > > +~~~~~~~~~~~~~~~~~
> > > > > > > +
> > > > > > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > > +
> > > > > > > +1. If auto buffer registration fails:
> > > > > >
> > > > > > I would switch these. Both (1) and (2) refer to when auto buffer
> > > > > > registration fails. So I would expect something like:
> > > > > >
> > > > > > If auto buffer registration fails:
> > > > > >
> > > > > > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > ...
> > > > > > 2. If fallback is not enabled:
> > > > > > ...
> > > > > >
> > > > > > > + - The uring_cmd is completed
> > > > > >
> > > > > > Maybe add "without registering the request buffer"?
> > > > > >
> > > > > > > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > > > > > + - The ublk server must manually register the buffer
> > > > > >
> > > > > > Only if it wants a registered buffer for the ublk request. Technically
> > > > > > the ublk server could decide to fall back on user-copy, for example.
> > > > >
> > > > > Good catch!
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +2. If fallback is not enabled:
> > > > > > > + - The ublk I/O request fails silently
> > > > > >
> > > > > > "silently" is a bit ambiguous. It's certainly not silent to the
> > > > > > application submitting the ublk I/O. Maybe say that the ublk I/O
> > > > > > request fails and no uring_cmd is completed to the ublk server?
> > > > >
> > > > > Yes, but the document focus on ublk side, and the client is generic
> > > > > for every driver, so I guess it may be fine.
> > > > >
> > > > > >
> > > > > > > +
> > > > > > > +Limitations
> > > > > > > +~~~~~~~~~~~
> > > > > > > +
> > > > > > > +- Requires same ``io_ring_ctx`` for all operations
> > > > > >
> > > > > > Another limitation that prevents us from adopting the auto buffer
> > > > > > registration feature is the need to reserve a unique buffer table
> > > > > > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > > > > > buffer table has a max size of 16K (could potentially be increased to
> > > > > > 64K), this limit is easily reached when there are a large number of
> > > > > > ublk devices or the ublk queue depth is large. I think we could remove
> > > > > > this limitation in the future by adding support for allocating buffer
> > > > > > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
> > > > >
> > > > > OK.
> > > > >
> > > > > But I guess it isn't big deal in reality since the task context should
> > > > > be saturated easily with so big setting.
> > > >
> > > > I don't know about your "reality" but it's certainly a big deal for us :)
> > > > To reduce contention on the blk-mq queues for the application
> > > > submitting I/O to the ublk devices, we want a large number of queues
> > > > for each ublk device. But we also want a large queue depth for each
> > > > individual queue to avoid the async request allocation path in case
> > > > any one application thread issues a lot of concurrent I/O to a single
> > > > ublk device. And we have 128 ublk devices, which again all want large
> > > > queue depths in case the application sends a lot of I/O to a single
> > > > ublk device. The result is that concurrently each ublk server thread
> > > > fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
> > > > buffer table limit.
> > >
> > > Yes, you can setup 512K I/Os in single task/io_uring context, but how many
> > > can be actively handled during unit time? The number could be much less than
> > > 512k or 16K, because it is a single pthread/io_uring/cpu core, which may be
> > > saturated easily, so most of these IOs may wait somewhere for cpu or whatever
> > > resource.
> >
> > Yes, that's exactly my point. Our ublk server only allocates enough
> > resources to handle 4K concurrent I/Os per thread. But since we don't
> > know which ublk devices or queues might receive the I/Os, we have to
> > fetch a queue depth of 4K on *every* ublk device queue. Perhaps the
> > batched approach you're working on will help here. But for now, the
> > total number of fetched ublk I/Os is an obstacle to adopting auto
> > buffer registration.
>
> oops, I forget the point that buffer index has to be provided beforehand,
> that is really one limit for your case with too many IOs in single uring
> context.
>
> The batched approach may not help too because the model is to issue command
> beforehand for fetching new io command.
>
> > And waiting to allocate the buffer index until an
> > incoming I/O actually needs to register a buffer seems like a
> > straightforward way to avoid this obstacle.
>
> One way is to rely on bpf program to allocate & provide buffer index via
> struct_ops, which can be called exactly before registering & unregistering
> io buffer. The concept should be simple, but the whole implementation may
> take some effort(most are boiler plate).
A BPF program feels overly complex. Ideally the ublk server could
create a sparse buffer table and just let io_uring allocate an unused
buffer index for each incoming ublk I/O and return it in the io_uring
CQE. This is basically identical to IORING_FILE_INDEX_ALLOC, except
for registered buffers instead of registered files. It would require a
change in io_uring to support allocating a registered buffer index on
demand, but hopefully not too much work to leverage what already
exists for registered files. And the ublk server would of course have
to set UBLK_AUTO_BUF_REG_FALLBACK to gracefully handle buffer index
allocation failures if the client application issues more concurrent
I/Os than there are available buffer indices.
Best,
Caleb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-13 1:36 ` Caleb Sander Mateos
@ 2025-06-13 1:53 ` Ming Lei
2025-06-13 1:57 ` Caleb Sander Mateos
0 siblings, 1 reply; 10+ messages in thread
From: Ming Lei @ 2025-06-13 1:53 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar
On Thu, Jun 12, 2025 at 06:36:41PM -0700, Caleb Sander Mateos wrote:
> On Thu, Jun 12, 2025 at 6:18 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Thu, Jun 12, 2025 at 07:38:01AM -0700, Caleb Sander Mateos wrote:
> > > On Wed, Jun 11, 2025 at 8:16 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Wed, Jun 11, 2025 at 08:54:53AM -0700, Caleb Sander Mateos wrote:
> > > > > On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > > > > > > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > > >
> > > > > > > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > > > > > > >
> > > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > >
> > > > > > > Thanks, this is a nice explanation. Just a few suggestions.
> > > > > > >
> > > > > > > > ---
> > > > > > > > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > > > > > > 1 file changed, 67 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > > > > > > index c368e1081b41..16ffca54eed4 100644
> > > > > > > > --- a/Documentation/block/ublk.rst
> > > > > > > > +++ b/Documentation/block/ublk.rst
> > > > > > > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > > > > > > parameter of `struct ublk_param_segment` with backend for avoiding
> > > > > > > > unnecessary IO split, which usually hurts io_uring performance.
> > > > > > > >
> > > > > > > > +Auto Buffer Registration
> > > > > > > > +------------------------
> > > > > > > > +
> > > > > > > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > > > > > > +and unregistration for I/O requests, which simplifies the buffer management
> > > > > > > > +process and reduces overhead in the ublk server implementation.
> > > > > > > > +
> > > > > > > > +This is another feature flag for using zero copy, and it is compatible with
> > > > > > > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > > > > > > +
> > > > > > > > +Feature Overview
> > > > > > > > +~~~~~~~~~~~~~~~~
> > > > > > > > +
> > > > > > > > +This feature automatically registers request buffers to the io_uring context
> > > > > > > > +before delivering I/O commands to the ublk server and unregisters them when
> > > > > > > > +completing I/O commands. This eliminates the need for manual buffer
> > > > > > > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > > > > > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > > > > > > +can avoid dependency on the two uring_cmd operations.
> > > > > > > > +
> > > > > > > > +This way not only simplifies ublk server implementation, but also makes
> > > > > > > > +concurrent IO handling becomes possible.
> > > > > > >
> > > > > > > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > > > > > > can handle incoming I/O requests concurrently, regardless of what
> > > > > > > features it has enabled. Do you mean it avoids the need for linked
> > > > > > > io_uring requests to properly order buffer registration and
> > > > > > > unregistration with the I/O operations using the registered buffer?
> > > > > >
> > > > > > Yes, if io_uring OPs depends on buffer registering & unregistering, these
> > > > > > OPs can't be issued concurrently any more, that is one io_uring constraint.
> > > > > >
> > > > > > I will add the above words.
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +Usage Requirements
> > > > > > > > +~~~~~~~~~~~~~~~~~~
> > > > > > > > +
> > > > > > > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > > > > > > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > > > > > > +
> > > > > > > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > > > > > > + unregistration is required.
> > > > > > >
> > > > > > > nit: don't think this needs to be a separate point, could be combined with (1).
> > > > > >
> > > > > > OK.
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > > > > > > + following structure::
> > > > > > >
> > > > > > > nit: extra ":"
> > > > > >
> > > > > > In reStructuredText (reST), the double colon :: serves as a literal block marker to
> > > > > > indicate preformatted text.
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > + struct ublk_auto_buf_reg {
> > > > > > > > + __u16 index; /* Buffer index for registration */
> > > > > > > > + __u8 flags; /* Registration flags */
> > > > > > > > + __u8 reserved0; /* Reserved for future use */
> > > > > > > > + __u32 reserved1; /* Reserved for future use */
> > > > > > > > + };
> > > > > > >
> > > > > > > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > > > > > > ambiguous how this struct is "passed" in sqe->addr.
> > > > > >
> > > > > > OK
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > > > > > > +
> > > > > > > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > > > > > > +
> > > > > > > > +Fallback Behavior
> > > > > > > > +~~~~~~~~~~~~~~~~~
> > > > > > > > +
> > > > > > > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > > > +
> > > > > > > > +1. If auto buffer registration fails:
> > > > > > >
> > > > > > > I would switch these. Both (1) and (2) refer to when auto buffer
> > > > > > > registration fails. So I would expect something like:
> > > > > > >
> > > > > > > If auto buffer registration fails:
> > > > > > >
> > > > > > > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > > ...
> > > > > > > 2. If fallback is not enabled:
> > > > > > > ...
> > > > > > >
> > > > > > > > + - The uring_cmd is completed
> > > > > > >
> > > > > > > Maybe add "without registering the request buffer"?
> > > > > > >
> > > > > > > > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > > > > > > + - The ublk server must manually register the buffer
> > > > > > >
> > > > > > > Only if it wants a registered buffer for the ublk request. Technically
> > > > > > > the ublk server could decide to fall back on user-copy, for example.
> > > > > >
> > > > > > Good catch!
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +2. If fallback is not enabled:
> > > > > > > > + - The ublk I/O request fails silently
> > > > > > >
> > > > > > > "silently" is a bit ambiguous. It's certainly not silent to the
> > > > > > > application submitting the ublk I/O. Maybe say that the ublk I/O
> > > > > > > request fails and no uring_cmd is completed to the ublk server?
> > > > > >
> > > > > > Yes, but the document focus on ublk side, and the client is generic
> > > > > > for every driver, so I guess it may be fine.
> > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +Limitations
> > > > > > > > +~~~~~~~~~~~
> > > > > > > > +
> > > > > > > > +- Requires same ``io_ring_ctx`` for all operations
> > > > > > >
> > > > > > > Another limitation that prevents us from adopting the auto buffer
> > > > > > > registration feature is the need to reserve a unique buffer table
> > > > > > > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > > > > > > buffer table has a max size of 16K (could potentially be increased to
> > > > > > > 64K), this limit is easily reached when there are a large number of
> > > > > > > ublk devices or the ublk queue depth is large. I think we could remove
> > > > > > > this limitation in the future by adding support for allocating buffer
> > > > > > > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
> > > > > >
> > > > > > OK.
> > > > > >
> > > > > > But I guess it isn't big deal in reality since the task context should
> > > > > > be saturated easily with so big setting.
> > > > >
> > > > > I don't know about your "reality" but it's certainly a big deal for us :)
> > > > > To reduce contention on the blk-mq queues for the application
> > > > > submitting I/O to the ublk devices, we want a large number of queues
> > > > > for each ublk device. But we also want a large queue depth for each
> > > > > individual queue to avoid the async request allocation path in case
> > > > > any one application thread issues a lot of concurrent I/O to a single
> > > > > ublk device. And we have 128 ublk devices, which again all want large
> > > > > queue depths in case the application sends a lot of I/O to a single
> > > > > ublk device. The result is that concurrently each ublk server thread
> > > > > fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
> > > > > buffer table limit.
> > > >
> > > > Yes, you can setup 512K I/Os in single task/io_uring context, but how many
> > > > can be actively handled during unit time? The number could be much less than
> > > > 512k or 16K, because it is a single pthread/io_uring/cpu core, which may be
> > > > saturated easily, so most of these IOs may wait somewhere for cpu or whatever
> > > > resource.
> > >
> > > Yes, that's exactly my point. Our ublk server only allocates enough
> > > resources to handle 4K concurrent I/Os per thread. But since we don't
> > > know which ublk devices or queues might receive the I/Os, we have to
> > > fetch a queue depth of 4K on *every* ublk device queue. Perhaps the
> > > batched approach you're working on will help here. But for now, the
> > > total number of fetched ublk I/Os is an obstacle to adopting auto
> > > buffer registration.
> >
> > oops, I forget the point that buffer index has to be provided beforehand,
> > that is really one limit for your case with too many IOs in single uring
> > context.
> >
> > The batched approach may not help too because the model is to issue command
> > beforehand for fetching new io command.
> >
> > > And waiting to allocate the buffer index until an
> > > incoming I/O actually needs to register a buffer seems like a
> > > straightforward way to avoid this obstacle.
> >
> > One way is to rely on bpf program to allocate & provide buffer index via
> > struct_ops, which can be called exactly before registering & unregistering
> > io buffer. The concept should be simple, but the whole implementation may
> > take some effort(most are boiler plate).
>
> A BPF program feels overly complex. Ideally the ublk server could
> create a sparse buffer table and just let io_uring allocate an unused
> buffer index for each incoming ublk I/O and return it in the io_uring
> CQE. This is basically identical to IORING_FILE_INDEX_ALLOC, except
> for registered buffers instead of registered files. It would require a
> change in io_uring to support allocating a registered buffer index on
> demand, but hopefully not too much work to leverage what already
> exists for registered files. And the ublk server would of course have
> to set UBLK_AUTO_BUF_REG_FALLBACK to gracefully handle buffer index
> allocation failures if the client application issues more concurrent
> I/Os than there are available buffer indices.
Indeed, it may be simpler than IORING_FILE_INDEX_ALLOC, since it needn't
to expose as uapi, the user can be just io_buffer_register_bvec().
Care to make a patch?
Thanks,
Ming
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
2025-06-13 1:53 ` Ming Lei
@ 2025-06-13 1:57 ` Caleb Sander Mateos
0 siblings, 0 replies; 10+ messages in thread
From: Caleb Sander Mateos @ 2025-06-13 1:57 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar
On Thu, Jun 12, 2025 at 6:53 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, Jun 12, 2025 at 06:36:41PM -0700, Caleb Sander Mateos wrote:
> > On Thu, Jun 12, 2025 at 6:18 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Thu, Jun 12, 2025 at 07:38:01AM -0700, Caleb Sander Mateos wrote:
> > > > On Wed, Jun 11, 2025 at 8:16 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > >
> > > > > On Wed, Jun 11, 2025 at 08:54:53AM -0700, Caleb Sander Mateos wrote:
> > > > > > On Mon, Jun 9, 2025 at 7:07 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jun 09, 2025 at 03:29:34PM -0700, Caleb Sander Mateos wrote:
> > > > > > > > On Mon, Jun 9, 2025 at 5:14 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > Document recently merged feature auto buffer registration(UBLK_F_AUTO_BUF_REG).
> > > > > > > > >
> > > > > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > > >
> > > > > > > > Thanks, this is a nice explanation. Just a few suggestions.
> > > > > > > >
> > > > > > > > > ---
> > > > > > > > > Documentation/block/ublk.rst | 67 ++++++++++++++++++++++++++++++++++++
> > > > > > > > > 1 file changed, 67 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> > > > > > > > > index c368e1081b41..16ffca54eed4 100644
> > > > > > > > > --- a/Documentation/block/ublk.rst
> > > > > > > > > +++ b/Documentation/block/ublk.rst
> > > > > > > > > @@ -352,6 +352,73 @@ For reaching best IO performance, ublk server should align its segment
> > > > > > > > > parameter of `struct ublk_param_segment` with backend for avoiding
> > > > > > > > > unnecessary IO split, which usually hurts io_uring performance.
> > > > > > > > >
> > > > > > > > > +Auto Buffer Registration
> > > > > > > > > +------------------------
> > > > > > > > > +
> > > > > > > > > +The ``UBLK_F_AUTO_BUF_REG`` feature automatically handles buffer registration
> > > > > > > > > +and unregistration for I/O requests, which simplifies the buffer management
> > > > > > > > > +process and reduces overhead in the ublk server implementation.
> > > > > > > > > +
> > > > > > > > > +This is another feature flag for using zero copy, and it is compatible with
> > > > > > > > > +``UBLK_F_SUPPORT_ZERO_COPY``.
> > > > > > > > > +
> > > > > > > > > +Feature Overview
> > > > > > > > > +~~~~~~~~~~~~~~~~
> > > > > > > > > +
> > > > > > > > > +This feature automatically registers request buffers to the io_uring context
> > > > > > > > > +before delivering I/O commands to the ublk server and unregisters them when
> > > > > > > > > +completing I/O commands. This eliminates the need for manual buffer
> > > > > > > > > +registration/unregistration via ``UBLK_IO_REGISTER_IO_BUF`` and
> > > > > > > > > +``UBLK_IO_UNREGISTER_IO_BUF`` commands, then IO handling in ublk server
> > > > > > > > > +can avoid dependency on the two uring_cmd operations.
> > > > > > > > > +
> > > > > > > > > +This way not only simplifies ublk server implementation, but also makes
> > > > > > > > > +concurrent IO handling becomes possible.
> > > > > > > >
> > > > > > > > I'm not sure what "concurrent IO handling" refers to. Any ublk server
> > > > > > > > can handle incoming I/O requests concurrently, regardless of what
> > > > > > > > features it has enabled. Do you mean it avoids the need for linked
> > > > > > > > io_uring requests to properly order buffer registration and
> > > > > > > > unregistration with the I/O operations using the registered buffer?
> > > > > > >
> > > > > > > Yes, if io_uring OPs depends on buffer registering & unregistering, these
> > > > > > > OPs can't be issued concurrently any more, that is one io_uring constraint.
> > > > > > >
> > > > > > > I will add the above words.
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +Usage Requirements
> > > > > > > > > +~~~~~~~~~~~~~~~~~~
> > > > > > > > > +
> > > > > > > > > +1. The ublk server must create a sparse buffer table on the same ``io_ring_ctx``
> > > > > > > > > + used for ``UBLK_IO_FETCH_REQ`` and ``UBLK_IO_COMMIT_AND_FETCH_REQ``.
> > > > > > > > > +
> > > > > > > > > +2. If uring_cmd is issued on a different ``io_ring_ctx``, manual buffer
> > > > > > > > > + unregistration is required.
> > > > > > > >
> > > > > > > > nit: don't think this needs to be a separate point, could be combined with (1).
> > > > > > >
> > > > > > > OK.
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +3. Buffer registration data must be passed via uring_cmd's ``sqe->addr`` with the
> > > > > > > > > + following structure::
> > > > > > > >
> > > > > > > > nit: extra ":"
> > > > > > >
> > > > > > > In reStructuredText (reST), the double colon :: serves as a literal block marker to
> > > > > > > indicate preformatted text.
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > + struct ublk_auto_buf_reg {
> > > > > > > > > + __u16 index; /* Buffer index for registration */
> > > > > > > > > + __u8 flags; /* Registration flags */
> > > > > > > > > + __u8 reserved0; /* Reserved for future use */
> > > > > > > > > + __u32 reserved1; /* Reserved for future use */
> > > > > > > > > + };
> > > > > > > >
> > > > > > > > Suggest using ublk_auto_buf_reg_to_sqe_addr()? Otherwise, it seems
> > > > > > > > ambiguous how this struct is "passed" in sqe->addr.
> > > > > > >
> > > > > > > OK
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +4. All reserved fields in ``ublk_auto_buf_reg`` must be zeroed.
> > > > > > > > > +
> > > > > > > > > +5. Optional flags can be passed via ``ublk_auto_buf_reg.flags``.
> > > > > > > > > +
> > > > > > > > > +Fallback Behavior
> > > > > > > > > +~~~~~~~~~~~~~~~~~
> > > > > > > > > +
> > > > > > > > > +When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > > > > +
> > > > > > > > > +1. If auto buffer registration fails:
> > > > > > > >
> > > > > > > > I would switch these. Both (1) and (2) refer to when auto buffer
> > > > > > > > registration fails. So I would expect something like:
> > > > > > > >
> > > > > > > > If auto buffer registration fails:
> > > > > > > >
> > > > > > > > 1. When ``UBLK_AUTO_BUF_REG_FALLBACK`` is enabled:
> > > > > > > > ...
> > > > > > > > 2. If fallback is not enabled:
> > > > > > > > ...
> > > > > > > >
> > > > > > > > > + - The uring_cmd is completed
> > > > > > > >
> > > > > > > > Maybe add "without registering the request buffer"?
> > > > > > > >
> > > > > > > > > + - ``UBLK_IO_F_NEED_REG_BUF`` is set in ``ublksrv_io_desc.op_flags``
> > > > > > > > > + - The ublk server must manually register the buffer
> > > > > > > >
> > > > > > > > Only if it wants a registered buffer for the ublk request. Technically
> > > > > > > > the ublk server could decide to fall back on user-copy, for example.
> > > > > > >
> > > > > > > Good catch!
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +2. If fallback is not enabled:
> > > > > > > > > + - The ublk I/O request fails silently
> > > > > > > >
> > > > > > > > "silently" is a bit ambiguous. It's certainly not silent to the
> > > > > > > > application submitting the ublk I/O. Maybe say that the ublk I/O
> > > > > > > > request fails and no uring_cmd is completed to the ublk server?
> > > > > > >
> > > > > > > Yes, but the document focus on ublk side, and the client is generic
> > > > > > > for every driver, so I guess it may be fine.
> > > > > > >
> > > > > > > >
> > > > > > > > > +
> > > > > > > > > +Limitations
> > > > > > > > > +~~~~~~~~~~~
> > > > > > > > > +
> > > > > > > > > +- Requires same ``io_ring_ctx`` for all operations
> > > > > > > >
> > > > > > > > Another limitation that prevents us from adopting the auto buffer
> > > > > > > > registration feature is the need to reserve a unique buffer table
> > > > > > > > index for every ublk tag on the io_ring_ctx. Since the io_ring_ctx
> > > > > > > > buffer table has a max size of 16K (could potentially be increased to
> > > > > > > > 64K), this limit is easily reached when there are a large number of
> > > > > > > > ublk devices or the ublk queue depth is large. I think we could remove
> > > > > > > > this limitation in the future by adding support for allocating buffer
> > > > > > > > indices on demand, analogous to IORING_FILE_INDEX_ALLOC.
> > > > > > >
> > > > > > > OK.
> > > > > > >
> > > > > > > But I guess it isn't big deal in reality since the task context should
> > > > > > > be saturated easily with so big setting.
> > > > > >
> > > > > > I don't know about your "reality" but it's certainly a big deal for us :)
> > > > > > To reduce contention on the blk-mq queues for the application
> > > > > > submitting I/O to the ublk devices, we want a large number of queues
> > > > > > for each ublk device. But we also want a large queue depth for each
> > > > > > individual queue to avoid the async request allocation path in case
> > > > > > any one application thread issues a lot of concurrent I/O to a single
> > > > > > ublk device. And we have 128 ublk devices, which again all want large
> > > > > > queue depths in case the application sends a lot of I/O to a single
> > > > > > ublk device. The result is that concurrently each ublk server thread
> > > > > > fetches 512K ublk I/Os, which is significantly above the io_ring_ctx
> > > > > > buffer table limit.
> > > > >
> > > > > Yes, you can setup 512K I/Os in single task/io_uring context, but how many
> > > > > can be actively handled during unit time? The number could be much less than
> > > > > 512k or 16K, because it is a single pthread/io_uring/cpu core, which may be
> > > > > saturated easily, so most of these IOs may wait somewhere for cpu or whatever
> > > > > resource.
> > > >
> > > > Yes, that's exactly my point. Our ublk server only allocates enough
> > > > resources to handle 4K concurrent I/Os per thread. But since we don't
> > > > know which ublk devices or queues might receive the I/Os, we have to
> > > > fetch a queue depth of 4K on *every* ublk device queue. Perhaps the
> > > > batched approach you're working on will help here. But for now, the
> > > > total number of fetched ublk I/Os is an obstacle to adopting auto
> > > > buffer registration.
> > >
> > > oops, I forget the point that buffer index has to be provided beforehand,
> > > that is really one limit for your case with too many IOs in single uring
> > > context.
> > >
> > > The batched approach may not help too because the model is to issue command
> > > beforehand for fetching new io command.
> > >
> > > > And waiting to allocate the buffer index until an
> > > > incoming I/O actually needs to register a buffer seems like a
> > > > straightforward way to avoid this obstacle.
> > >
> > > One way is to rely on bpf program to allocate & provide buffer index via
> > > struct_ops, which can be called exactly before registering & unregistering
> > > io buffer. The concept should be simple, but the whole implementation may
> > > take some effort(most are boiler plate).
> >
> > A BPF program feels overly complex. Ideally the ublk server could
> > create a sparse buffer table and just let io_uring allocate an unused
> > buffer index for each incoming ublk I/O and return it in the io_uring
> > CQE. This is basically identical to IORING_FILE_INDEX_ALLOC, except
> > for registered buffers instead of registered files. It would require a
> > change in io_uring to support allocating a registered buffer index on
> > demand, but hopefully not too much work to leverage what already
> > exists for registered files. And the ublk server would of course have
> > to set UBLK_AUTO_BUF_REG_FALLBACK to gracefully handle buffer index
> > allocation failures if the client application issues more concurrent
> > I/Os than there are available buffer indices.
>
> Indeed, it may be simpler than IORING_FILE_INDEX_ALLOC, since it needn't
> to expose as uapi, the user can be just io_buffer_register_bvec().
>
> Care to make a patch?
Sure, can do. Give me a couple days, I have a number of other things
on my plate.
Best,
Caleb
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-06-13 1:57 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-09 12:14 [PATCH] ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG) Ming Lei
2025-06-09 22:29 ` Caleb Sander Mateos
2025-06-10 2:06 ` Ming Lei
2025-06-11 15:54 ` Caleb Sander Mateos
2025-06-12 3:16 ` Ming Lei
2025-06-12 14:38 ` Caleb Sander Mateos
2025-06-13 1:18 ` Ming Lei
2025-06-13 1:36 ` Caleb Sander Mateos
2025-06-13 1:53 ` Ming Lei
2025-06-13 1:57 ` Caleb Sander Mateos
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).