ublk: RFC fetch_req

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* ublk: RFC fetch_req_multishot
@ 2025-04-24 18:19 Ofer Oshri
  2025-04-24 18:28 ` Caleb Sander Mateos
  2025-04-25  4:10 ` Ming Lei
  0 siblings, 2 replies; 9+ messages in thread
From: Ofer Oshri @ 2025-04-24 18:19 UTC (permalink / raw)
  To: linux-block@vger.kernel.org
  Cc: ming.lei@redhat.com, axboe@kernel.dk, Jared Holzman, Yoav Cohen,
	Guy Eisenberg, Omri Levi

Hi,

Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.

Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.

To address this, we’d like to propose an enhancement to the ublk driver. The idea is inspired by the multi-shot concept, where a single request allows multiple replies.

We propose adding:

1. A method to register a pool of ublk_io commands.

2. Introduce a new UBLK_U_IO_FETCH_REQ_MULTISHOT operation, where a pool of ublk_io commands is bound to a block device. Then, upon receiving a new BIO, the ublk driver can select a reply from the pre-registered pool and push it to the io_uring.

3. Introduce a new UBLK_U_IO_COMMIT_REQ command to explicitly mark the completion of a request. In this case, the ublk driver returns the request to the pool.  We can retain the existing UBLK_U_IO_COMMIT_AND_FETCH_REQ command, but for multi-shot scenarios, the “FETCH” operation would simply mean returning the request to the pool.

What are your thoughts on this approach?

Ofer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
  2025-04-24 18:19 ublk: RFC fetch_req_multishot Ofer Oshri
@ 2025-04-24 18:28 ` Caleb Sander Mateos
  2025-04-24 19:07   ` Ofer Oshri
       [not found]   ` <IA1PR12MB60672D37508D641368D211B8B6852@IA1PR12MB6067.namprd12.prod.outlook.com>
  2025-04-25  4:10 ` Ming Lei
  1 sibling, 2 replies; 9+ messages in thread
From: Caleb Sander Mateos @ 2025-04-24 18:28 UTC (permalink / raw)
  To: Ofer Oshri
  Cc: linux-block@vger.kernel.org, ming.lei@redhat.com, axboe@kernel.dk,
	Jared Holzman, Yoav Cohen, Guy Eisenberg, Omri Levi

On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
>
> Hi,
>
> Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
>
> Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.

What do you mean by "size of the io_uring", the submission queue size?
Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
batches of N?

Best,
Caleb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
  2025-04-24 18:28 ` Caleb Sander Mateos
@ 2025-04-24 19:07   ` Ofer Oshri
       [not found]   ` <IA1PR12MB60672D37508D641368D211B8B6852@IA1PR12MB6067.namprd12.prod.outlook.com>
  1 sibling, 0 replies; 9+ messages in thread
From: Ofer Oshri @ 2025-04-24 19:07 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: linux-block@vger.kernel.org, ming.lei@redhat.com, axboe@kernel.dk,
	Jared Holzman, Yoav Cohen, Guy Eisenberg, Omri Levi



________________________________________
From: Caleb Sander Mateos <csander@purestorage.com>
Sent: Thursday, April 24, 2025 9:28 PM
To: Ofer Oshri <ofer@nvidia.com>
Cc: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; ming.lei@redhat.com <ming.lei@redhat.com>; axboe@kernel.dk <axboe@kernel.dk>; Jared Holzman <jholzman@nvidia.com>; Yoav Cohen <yoav@nvidia.com>; Guy Eisenberg <geisenberg@nvidia.com>; Omri Levi <omril@nvidia.com>
Subject: Re: ublk: RFC fetch_req_multishot
 
External email: Use caution opening links or attachments


On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
>
> Hi,
>
> Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
>
> Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.

What do you mean by "size of the io_uring", the submission queue size?
Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
batches of N?

Best,
Caleb

N is the size of the submission queue, and P is not fixed and unknown at the time of ring initialization....

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
       [not found]   ` <IA1PR12MB60672D37508D641368D211B8B6852@IA1PR12MB6067.namprd12.prod.outlook.com>
@ 2025-04-24 19:07     ` Caleb Sander Mateos
  2025-04-24 21:07       ` Jared Holzman
  2025-04-25  5:23       ` Ming Lei
  0 siblings, 2 replies; 9+ messages in thread
From: Caleb Sander Mateos @ 2025-04-24 19:07 UTC (permalink / raw)
  To: Ofer Oshri
  Cc: linux-block@vger.kernel.org, ming.lei@redhat.com, axboe@kernel.dk,
	Jared Holzman, Yoav Cohen, Guy Eisenberg, Omri Levi

On Thu, Apr 24, 2025 at 11:58 AM Ofer Oshri <ofer@nvidia.com> wrote:
>
>
>
> ________________________________
> From: Caleb Sander Mateos <csander@purestorage.com>
> Sent: Thursday, April 24, 2025 9:28 PM
> To: Ofer Oshri <ofer@nvidia.com>
> Cc: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; ming.lei@redhat.com <ming.lei@redhat.com>; axboe@kernel.dk <axboe@kernel.dk>; Jared Holzman <jholzman@nvidia.com>; Yoav Cohen <yoav@nvidia.com>; Guy Eisenberg <geisenberg@nvidia.com>; Omri Levi <omril@nvidia.com>
> Subject: Re: ublk: RFC fetch_req_multishot
>
> External email: Use caution opening links or attachments
>
>
> On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
> >
> > Hi,
> >
> > Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
> >
> > Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.
>
> What do you mean by "size of the io_uring", the submission queue size?
> Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
> batches of N?
>
> Best,
> Caleb
>
> N is the size of the submission queue, and P is not fixed and unknown at the time of ring initialization....

I don't think it matters whether P (the number of ublk devices) is
known ahead of time or changes dynamically. My point is that you can
submit the UBLK_U_IO_FETCH_REQ operations in batches of N to avoid
exceeding the io_uring SQ depth. (If there are other operations
potentially interleaved with the UBLK_U_IO_FETCH_REQ ones, then just
submit each time the io_uring SQ fills up.) Any values of P, M, and N
should work. Perhaps I'm misunderstanding you, because I don't know
what "io_uring exhaustion" refers to.

Multishot ublk io_uring operations don't seem like a trivial feature
to implement. Currently, incoming ublk requests are posted to the ublk
server using io_uring's "task work" mechanism, which inserts the
io_uring operation into an intrusive linked list. If you wanted a
single ublk io_uring operation to post multiple completions, it would
need to allocate some structure for each incoming request to insert
into the task work list. There is also an assumption that the ublk
io_uring operations correspond 1-1 with the blk-mq requests for the
ublk device, which would be broken by multishot ublk io_uring
operations.

Best,
Caleb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
  2025-04-24 19:07     ` Caleb Sander Mateos
@ 2025-04-24 21:07       ` Jared Holzman
  2025-04-24 21:52         ` Caleb Sander Mateos
  2025-04-25  5:23       ` Ming Lei
  1 sibling, 1 reply; 9+ messages in thread
From: Jared Holzman @ 2025-04-24 21:07 UTC (permalink / raw)
  To: Caleb Sander Mateos, Ofer Oshri
  Cc: linux-block@vger.kernel.org, ming.lei@redhat.com, axboe@kernel.dk,
	Yoav Cohen, Guy Eisenberg, Omri Levi

On 24/04/2025 22:07, Caleb Sander Mateos wrote:
> On Thu, Apr 24, 2025 at 11:58 AM Ofer Oshri <ofer@nvidia.com> wrote:
>>
>>
>>
>> ________________________________
>> From: Caleb Sander Mateos <csander@purestorage.com>
>> Sent: Thursday, April 24, 2025 9:28 PM
>> To: Ofer Oshri <ofer@nvidia.com>
>> Cc: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; ming.lei@redhat.com <ming.lei@redhat.com>; axboe@kernel.dk <axboe@kernel.dk>; Jared Holzman <jholzman@nvidia.com>; Yoav Cohen <yoav@nvidia.com>; Guy Eisenberg <geisenberg@nvidia.com>; Omri Levi <omril@nvidia.com>
>> Subject: Re: ublk: RFC fetch_req_multishot
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
>>>
>>> Hi,
>>>
>>> Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
>>>
>>> Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.
>>
>> What do you mean by "size of the io_uring", the submission queue size?
>> Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
>> batches of N?
>>
>> Best,
>> Caleb
>>
>> N is the size of the submission queue, and P is not fixed and unknown at the time of ring initialization....
> 
> I don't think it matters whether P (the number of ublk devices) is
> known ahead of time or changes dynamically. My point is that you can
> submit the UBLK_U_IO_FETCH_REQ operations in batches of N to avoid
> exceeding the io_uring SQ depth. (If there are other operations
> potentially interleaved with the UBLK_U_IO_FETCH_REQ ones, then just
> submit each time the io_uring SQ fills up.) Any values of P, M, and N
> should work. Perhaps I'm misunderstanding you, because I don't know
> what "io_uring exhaustion" refers to.
> 
> Multishot ublk io_uring operations don't seem like a trivial feature
> to implement. Currently, incoming ublk requests are posted to the ublk
> server using io_uring's "task work" mechanism, which inserts the
> io_uring operation into an intrusive linked list. If you wanted a
> single ublk io_uring operation to post multiple completions, it would
> need to allocate some structure for each incoming request to insert
> into the task work list. There is also an assumption that the ublk
> io_uring operations correspond 1-1 with the blk-mq requests for the
> ublk device, which would be broken by multishot ublk io_uring
> operations.
> 
> Best,
> Caleb

Hi Caleb,

I think what Ofer is trying to say is that we have a scaling issue. 

Our deployment could consist of 100s of ublk devices, not all of which will be dispatching IO at the same time. If we were to submit the maximum number of IO requests that our application can handle for every ublk device we need to deploy, the memory requirements would be excessive.

For this reason, we would prefer to have a global pool of IO requests that can be registered with the ublk-control device that each of the ublk devices registered to it can use.

We understand this is a complex undertaking and would be willing to do the work ourselves, but before we start we want to know if the requirement is reasonable enough for our changes to be accepted upstream.

Regards,

Jared


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
  2025-04-24 21:07       ` Jared Holzman
@ 2025-04-24 21:52         ` Caleb Sander Mateos
  0 siblings, 0 replies; 9+ messages in thread
From: Caleb Sander Mateos @ 2025-04-24 21:52 UTC (permalink / raw)
  To: Jared Holzman
  Cc: Ofer Oshri, linux-block@vger.kernel.org, ming.lei@redhat.com,
	axboe@kernel.dk, Yoav Cohen, Guy Eisenberg, Omri Levi

On Thu, Apr 24, 2025 at 2:07 PM Jared Holzman <jholzman@nvidia.com> wrote:
>
> On 24/04/2025 22:07, Caleb Sander Mateos wrote:
> > On Thu, Apr 24, 2025 at 11:58 AM Ofer Oshri <ofer@nvidia.com> wrote:
> >>
> >>
> >>
> >> ________________________________
> >> From: Caleb Sander Mateos <csander@purestorage.com>
> >> Sent: Thursday, April 24, 2025 9:28 PM
> >> To: Ofer Oshri <ofer@nvidia.com>
> >> Cc: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; ming.lei@redhat.com <ming.lei@redhat.com>; axboe@kernel.dk <axboe@kernel.dk>; Jared Holzman <jholzman@nvidia.com>; Yoav Cohen <yoav@nvidia.com>; Guy Eisenberg <geisenberg@nvidia.com>; Omri Levi <omril@nvidia.com>
> >> Subject: Re: ublk: RFC fetch_req_multishot
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
> >>>
> >>> Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.
> >>
> >> What do you mean by "size of the io_uring", the submission queue size?
> >> Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
> >> batches of N?
> >>
> >> Best,
> >> Caleb
> >>
> >> N is the size of the submission queue, and P is not fixed and unknown at the time of ring initialization....
> >
> > I don't think it matters whether P (the number of ublk devices) is
> > known ahead of time or changes dynamically. My point is that you can
> > submit the UBLK_U_IO_FETCH_REQ operations in batches of N to avoid
> > exceeding the io_uring SQ depth. (If there are other operations
> > potentially interleaved with the UBLK_U_IO_FETCH_REQ ones, then just
> > submit each time the io_uring SQ fills up.) Any values of P, M, and N
> > should work. Perhaps I'm misunderstanding you, because I don't know
> > what "io_uring exhaustion" refers to.
> >
> > Multishot ublk io_uring operations don't seem like a trivial feature
> > to implement. Currently, incoming ublk requests are posted to the ublk
> > server using io_uring's "task work" mechanism, which inserts the
> > io_uring operation into an intrusive linked list. If you wanted a
> > single ublk io_uring operation to post multiple completions, it would
> > need to allocate some structure for each incoming request to insert
> > into the task work list. There is also an assumption that the ublk
> > io_uring operations correspond 1-1 with the blk-mq requests for the
> > ublk device, which would be broken by multishot ublk io_uring
> > operations.
> >
> > Best,
> > Caleb
>
> Hi Caleb,
>
> I think what Ofer is trying to say is that we have a scaling issue.
>
> Our deployment could consist of 100s of ublk devices, not all of which will be dispatching IO at the same time. If we were to submit the maximum number of IO requests that our application can handle for every ublk device we need to deploy, the memory requirements would be excessive.

Thanks, I see what you mean. Yes, it's certainly a reasonable concern
in principle. The memory requirements may not be as steep as you
imagine. We have a similar architecture and haven't encountered any
issues. Each of our machines has 100+ ublk devices, each with 16
queues, with the maximum of 4096 requests per queue. The per-I/O state
for ublk and io_uring is pretty small; it's nowhere near our biggest
consumer of RAM.

>
> For this reason, we would prefer to have a global pool of IO requests that can be registered with the ublk-control device that each of the ublk devices registered to it can use.

It could probably work, but I think there are some details to iron
out. First of all, a global pool wouldn't work if there are multiple
ublk server applications whose I/Os should be isolated from each
other. And to get decent performance, I think you would definitely
want to partition these I/O request pools to avoid contention. A
possible approach would be to have one pool per ublk server thread.

Best,
Caleb

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
  2025-04-24 18:19 ublk: RFC fetch_req_multishot Ofer Oshri
  2025-04-24 18:28 ` Caleb Sander Mateos
@ 2025-04-25  4:10 ` Ming Lei
  1 sibling, 0 replies; 9+ messages in thread
From: Ming Lei @ 2025-04-25  4:10 UTC (permalink / raw)
  To: Ofer Oshri
  Cc: linux-block@vger.kernel.org, axboe@kernel.dk, Jared Holzman,
	Yoav Cohen, Guy Eisenberg, Omri Levi, Caleb Sander Mateos

On Thu, Apr 24, 2025 at 06:19:29PM +0000, Ofer Oshri wrote:
> Hi,
> 
> Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
> 

Can I understand you are using single io_uring for serving one hw queue of
multiple ublk device?

> Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.
> 

Suppose N is the SQ size, the supported count of ublk device can be much bigger
than N/M, because any SQE is freed & available after it is issued to kernel, here
the SQE should be free for reuse after one UBLK_U_IO_FETCH_REQ uring_cmd is
issued to ublk driver.

That is said you can queue arbitrary number of uring_cmd with fixed SQ
size since N is just the submission batch size.

But it needs the ublk server implementation to flush queued SQE if
io_uring_get_sqe() returns NULL.

> To address this, we’d like to propose an enhancement to the ublk driver. The idea is inspired by the multi-shot concept, where a single request allows multiple replies.
> 
> We propose adding:
> 
> 1. A method to register a pool of ublk_io commands.
> 
> 2. Introduce a new UBLK_U_IO_FETCH_REQ_MULTISHOT operation, where a pool of ublk_io commands is bound to a block device. Then, upon receiving a new BIO, the ublk driver can select a reply from the pre-registered pool and push it to the io_uring.
> 
> 3. Introduce a new UBLK_U_IO_COMMIT_REQ command to explicitly mark the completion of a request. In this case, the ublk driver returns the request to the pool.  We can retain the existing UBLK_U_IO_COMMIT_AND_FETCH_REQ command, but for multi-shot scenarios, the “FETCH” operation would simply mean returning the request to the pool.
> 
> What are your thoughts on this approach?

I think we need to understand the real problem you want to address
before digging into the uring_cmd pool concept.

1) for save memory for lots of ublk device ?

- so far, the main preallocation should be from blk-mq request, and
as Caleb mentioned, the state memory from both ublk and io_uring isn't
very big

2) need to support as many as ublk device in single io_uring context with
limited SQ/CQ size ?

- it may not be one big problem because fixed SQ size allows to issue
arbitrary number of uring_cmd

- but CQ size may limit number of completed uring_cmd for notifying
incoming ublk request, is this your problem? Jens has added ring resize
via IORING_REGISTER_RESIZE_RINGS:

https://lore.kernel.org/io-uring/20241022021159.820925-1-axboe@kernel.dk/


3) or other requirement?



Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
  2025-04-24 19:07     ` Caleb Sander Mateos
  2025-04-24 21:07       ` Jared Holzman
@ 2025-04-25  5:23       ` Ming Lei
  2025-06-06 12:03         ` Ming Lei
  1 sibling, 1 reply; 9+ messages in thread
From: Ming Lei @ 2025-04-25  5:23 UTC (permalink / raw)
  To: Caleb Sander Mateos
  Cc: Ofer Oshri, linux-block@vger.kernel.org, axboe@kernel.dk,
	Jared Holzman, Yoav Cohen, Guy Eisenberg, Omri Levi

On Thu, Apr 24, 2025 at 12:07:32PM -0700, Caleb Sander Mateos wrote:
> On Thu, Apr 24, 2025 at 11:58 AM Ofer Oshri <ofer@nvidia.com> wrote:
> >
> >
> >
> > ________________________________
> > From: Caleb Sander Mateos <csander@purestorage.com>
> > Sent: Thursday, April 24, 2025 9:28 PM
> > To: Ofer Oshri <ofer@nvidia.com>
> > Cc: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; ming.lei@redhat.com <ming.lei@redhat.com>; axboe@kernel.dk <axboe@kernel.dk>; Jared Holzman <jholzman@nvidia.com>; Yoav Cohen <yoav@nvidia.com>; Guy Eisenberg <geisenberg@nvidia.com>; Omri Levi <omril@nvidia.com>
> > Subject: Re: ublk: RFC fetch_req_multishot
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
> > >
> > > Hi,
> > >
> > > Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
> > >
> > > Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.
> >
> > What do you mean by "size of the io_uring", the submission queue size?
> > Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
> > batches of N?
> >
> > Best,
> > Caleb
> >
> > N is the size of the submission queue, and P is not fixed and unknown at the time of ring initialization....
> 
> I don't think it matters whether P (the number of ublk devices) is
> known ahead of time or changes dynamically. My point is that you can
> submit the UBLK_U_IO_FETCH_REQ operations in batches of N to avoid
> exceeding the io_uring SQ depth. (If there are other operations
> potentially interleaved with the UBLK_U_IO_FETCH_REQ ones, then just
> submit each time the io_uring SQ fills up.) Any values of P, M, and N
> should work. Perhaps I'm misunderstanding you, because I don't know
> what "io_uring exhaustion" refers to.
> 
> Multishot ublk io_uring operations don't seem like a trivial feature
> to implement. Currently, incoming ublk requests are posted to the ublk
> server using io_uring's "task work" mechanism, which inserts the
> io_uring operation into an intrusive linked list. If you wanted a
> single ublk io_uring operation to post multiple completions, it would
> need to allocate some structure for each incoming request to insert
> into the task work list. There is also an assumption that the ublk
> io_uring operations correspond 1-1 with the blk-mq requests for the
> ublk device, which would be broken by multishot ublk io_uring
> operations.

For delivering ublk io command to ublk server, I feel multishot can be
used in the following way:

- use IORING_OP_READ_MULTISHOT to read from ublk char device, do it for
  each queue, queue id may be passed via offset

- block in ublk_ch_read_iter() if nothing comes from this queue of the
ublk block device

- if any ublk block io comes, fill `ublksrv_io_desc` in mmapped area, and
push the 'tag' to the read ring buffer(provided buffer)

- wakeup the read IO after one whole IO batch is done

For commit ublk io command result to ublk driver, it can be similar with
delivering by writing 'tag' to ublk char device via IORING_OP_WRITE_FIXED or
IORING_OP_WRITE, still per queue via ring_buf approach, but need one mmapped
buffer for storing the io command result, 4 bytes should be enough for each io.

With the above way:

- use read/write to deliver io command & commit io command result, so
  single read/write replaces one batch of uring_cmd

- needn't uring command any more, big security_uring_cmd() cost can be avoided

- memory footprint is reduced a lot, no extra uring_cmd for each IO

- extra task work scheduling is avoided

- Probably uring exiting handling can be simplified too.


Sounds like ublk 2.0 prototype, :-)


Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: ublk: RFC fetch_req_multishot
  2025-04-25  5:23       ` Ming Lei
@ 2025-06-06 12:03         ` Ming Lei
  0 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2025-06-06 12:03 UTC (permalink / raw)
  To: Caleb Sander Mateos, Ofer Oshri
  Cc: linux-block@vger.kernel.org, axboe@kernel.dk, Jared Holzman,
	Yoav Cohen, Guy Eisenberg, Omri Levi, Uday Shankar

On Fri, Apr 25, 2025 at 01:23:16PM +0800, Ming Lei wrote:
> On Thu, Apr 24, 2025 at 12:07:32PM -0700, Caleb Sander Mateos wrote:
> > On Thu, Apr 24, 2025 at 11:58 AM Ofer Oshri <ofer@nvidia.com> wrote:
> > >
> > >
> > >
> > > ________________________________
> > > From: Caleb Sander Mateos <csander@purestorage.com>
> > > Sent: Thursday, April 24, 2025 9:28 PM
> > > To: Ofer Oshri <ofer@nvidia.com>
> > > Cc: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; ming.lei@redhat.com <ming.lei@redhat.com>; axboe@kernel.dk <axboe@kernel.dk>; Jared Holzman <jholzman@nvidia.com>; Yoav Cohen <yoav@nvidia.com>; Guy Eisenberg <geisenberg@nvidia.com>; Omri Levi <omril@nvidia.com>
> > > Subject: Re: ublk: RFC fetch_req_multishot
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
> > > >
> > > > Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.
> > >
> > > What do you mean by "size of the io_uring", the submission queue size?
> > > Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
> > > batches of N?
> > >
> > > Best,
> > > Caleb
> > >
> > > N is the size of the submission queue, and P is not fixed and unknown at the time of ring initialization....
> > 
> > I don't think it matters whether P (the number of ublk devices) is
> > known ahead of time or changes dynamically. My point is that you can
> > submit the UBLK_U_IO_FETCH_REQ operations in batches of N to avoid
> > exceeding the io_uring SQ depth. (If there are other operations
> > potentially interleaved with the UBLK_U_IO_FETCH_REQ ones, then just
> > submit each time the io_uring SQ fills up.) Any values of P, M, and N
> > should work. Perhaps I'm misunderstanding you, because I don't know
> > what "io_uring exhaustion" refers to.
> > 
> > Multishot ublk io_uring operations don't seem like a trivial feature
> > to implement. Currently, incoming ublk requests are posted to the ublk
> > server using io_uring's "task work" mechanism, which inserts the
> > io_uring operation into an intrusive linked list. If you wanted a
> > single ublk io_uring operation to post multiple completions, it would
> > need to allocate some structure for each incoming request to insert
> > into the task work list. There is also an assumption that the ublk
> > io_uring operations correspond 1-1 with the blk-mq requests for the
> > ublk device, which would be broken by multishot ublk io_uring
> > operations.
> 
> For delivering ublk io command to ublk server, I feel multishot can be
> used in the following way:
> 
> - use IORING_OP_READ_MULTISHOT to read from ublk char device, do it for
>   each queue, queue id may be passed via offset
> 
> - block in ublk_ch_read_iter() if nothing comes from this queue of the
> ublk block device
> 
> - if any ublk block io comes, fill `ublksrv_io_desc` in mmapped area, and
> push the 'tag' to the read ring buffer(provided buffer)
> 
> - wakeup the read IO after one whole IO batch is done
> 
> For commit ublk io command result to ublk driver, it can be similar with
> delivering by writing 'tag' to ublk char device via IORING_OP_WRITE_FIXED or
> IORING_OP_WRITE, still per queue via ring_buf approach, but need one mmapped
> buffer for storing the io command result, 4 bytes should be enough for each io.
> 
> With the above way:
> 
> - use read/write to deliver io command & commit io command result, so
>   single read/write replaces one batch of uring_cmd
> 
> - needn't uring command any more, big security_uring_cmd() cost can be avoided
> 
> - memory footprint is reduced a lot, no extra uring_cmd for each IO
> 
> - extra task work scheduling is avoided
> 
> - Probably uring exiting handling can be simplified too.
> 
> 
> Sounds like ublk 2.0 prototype, :-)

I have been working towards this direction:

https://github.com/ming1/linux/commits/ublk2-cmd-batch/

by adding three new batch commands, all are per-queue:

`UBLK_U_IO_FETCH_IO_CMDS`
	
	- multishot with provided buffer

	- issued once, CQE is posted after new io/io batch is coming by filling
	io tag into the provided buffer

	- re-issue after the whole buffer is used up, so issue cost is reduced

	- multiple `UBLK_U_IO_FETCH_IO_CMDS` are allowed to be issued concurrently
	from different task contexts for supporting load balance

	- each `UBLK_U_IO_FETCH_IO_CMDS` can carry 'priority' info for supporting
	prioritized schedule, not done yet, should be easier to implement

`UBLK_U_IO_COMMIT_IO_CMDS`

	- this command has a fixed buffer, in which io tag, io command result
	and other info(buf_index) for FETCH is provided, and multiple IOs or
	batch IO are covered

`UBLK_U_IO_PREP_IO_CMDS`:

	batch version of `UBLK_IO_FETCH_REQ`, still has one fixed buffer for
	carrying io tag, info for fetch, similar with `UBLK_U_IO_COMMIT_IO_CMDS`

In this way, lots of existing ublk constraint are relaxed:

- any of the three command can be issued from any task context, there isn't
  per-io task or ubq_daemon limit any more. But AUTO_BUF_REG is one
  exception, which requires FETCH and COMMIT command are in same io_ring_ctx.

- easier to support load balance, any IO commands fetched by the command
of `UBLK_U_IO_FETCH_IO_CMDS` can be handled in the task for issuing
UBLK_U_IO_FETCH_IO_CMDS

- both FETCH and COMMIT are handled in batch way, communication cost is
reduced.

One drawback is that cost is added in client IO issue side(ublk_queue_rq() and
ublk_queue_rqs()), goodness is that communication cost is reduced in ublk server
side. 

Simple test running on one server shows that performance is good

- kublk(`--batch --auto_zc -q 2` vs. `--auto_zc -q 2`): ~10% IOPS improvement

The feature is still in very early stage, and any comments are welcome!



Thanks,
Ming


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-06 12:03 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-24 18:19 ublk: RFC fetch_req_multishot Ofer Oshri
2025-04-24 18:28 ` Caleb Sander Mateos
2025-04-24 19:07   ` Ofer Oshri
     [not found]   ` <IA1PR12MB60672D37508D641368D211B8B6852@IA1PR12MB6067.namprd12.prod.outlook.com>
2025-04-24 19:07     ` Caleb Sander Mateos
2025-04-24 21:07       ` Jared Holzman
2025-04-24 21:52         ` Caleb Sander Mateos
2025-04-25  5:23       ` Ming Lei
2025-06-06 12:03         ` Ming Lei
2025-04-25  4:10 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox