Re: ublk: RFC fetch_req_multishot

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>,
	Ofer Oshri <ofer@nvidia.com>
Cc: "linux-block@vger.kernel.org" <linux-block@vger.kernel.org>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	Jared Holzman <jholzman@nvidia.com>, Yoav Cohen <yoav@nvidia.com>,
	Guy Eisenberg <geisenberg@nvidia.com>,
	Omri Levi <omril@nvidia.com>,
	Uday Shankar <ushankar@purestorage.com>
Subject: Re: ublk: RFC fetch_req_multishot
Date: Fri, 6 Jun 2025 20:03:16 +0800	[thread overview]
Message-ID: <aELZBPmYUMHDbusQ@fedora> (raw)
In-Reply-To: <aAscRPVcTBiBHNe7@fedora>

On Fri, Apr 25, 2025 at 01:23:16PM +0800, Ming Lei wrote:
> On Thu, Apr 24, 2025 at 12:07:32PM -0700, Caleb Sander Mateos wrote:
> > On Thu, Apr 24, 2025 at 11:58 AM Ofer Oshri <ofer@nvidia.com> wrote:
> > >
> > >
> > >
> > > ________________________________
> > > From: Caleb Sander Mateos <csander@purestorage.com>
> > > Sent: Thursday, April 24, 2025 9:28 PM
> > > To: Ofer Oshri <ofer@nvidia.com>
> > > Cc: linux-block@vger.kernel.org <linux-block@vger.kernel.org>; ming.lei@redhat.com <ming.lei@redhat.com>; axboe@kernel.dk <axboe@kernel.dk>; Jared Holzman <jholzman@nvidia.com>; Yoav Cohen <yoav@nvidia.com>; Guy Eisenberg <geisenberg@nvidia.com>; Omri Levi <omril@nvidia.com>
> > > Subject: Re: ublk: RFC fetch_req_multishot
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Thu, Apr 24, 2025 at 11:19 AM Ofer Oshri <ofer@nvidia.com> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Our code uses a single io_uring per core, which is shared among all block devices - meaning each block device on a core uses the same io_uring.
> > > >
> > > > Let’s say the size of the io_uring is N. Each block device submits M UBLK_U_IO_FETCH_REQ requests. As a result, with the current implementation, we can only support up to P block devices, where P = N / M. This means that when we attempt to support block device P+1, it will fail due to io_uring exhaustion.
> > >
> > > What do you mean by "size of the io_uring", the submission queue size?
> > > Why can't you submit all P * M UBLK_U_IO_FETCH_REQ operations in
> > > batches of N?
> > >
> > > Best,
> > > Caleb
> > >
> > > N is the size of the submission queue, and P is not fixed and unknown at the time of ring initialization....
> > 
> > I don't think it matters whether P (the number of ublk devices) is
> > known ahead of time or changes dynamically. My point is that you can
> > submit the UBLK_U_IO_FETCH_REQ operations in batches of N to avoid
> > exceeding the io_uring SQ depth. (If there are other operations
> > potentially interleaved with the UBLK_U_IO_FETCH_REQ ones, then just
> > submit each time the io_uring SQ fills up.) Any values of P, M, and N
> > should work. Perhaps I'm misunderstanding you, because I don't know
> > what "io_uring exhaustion" refers to.
> > 
> > Multishot ublk io_uring operations don't seem like a trivial feature
> > to implement. Currently, incoming ublk requests are posted to the ublk
> > server using io_uring's "task work" mechanism, which inserts the
> > io_uring operation into an intrusive linked list. If you wanted a
> > single ublk io_uring operation to post multiple completions, it would
> > need to allocate some structure for each incoming request to insert
> > into the task work list. There is also an assumption that the ublk
> > io_uring operations correspond 1-1 with the blk-mq requests for the
> > ublk device, which would be broken by multishot ublk io_uring
> > operations.
> 
> For delivering ublk io command to ublk server, I feel multishot can be
> used in the following way:
> 
> - use IORING_OP_READ_MULTISHOT to read from ublk char device, do it for
>   each queue, queue id may be passed via offset
> 
> - block in ublk_ch_read_iter() if nothing comes from this queue of the
> ublk block device
> 
> - if any ublk block io comes, fill `ublksrv_io_desc` in mmapped area, and
> push the 'tag' to the read ring buffer(provided buffer)
> 
> - wakeup the read IO after one whole IO batch is done
> 
> For commit ublk io command result to ublk driver, it can be similar with
> delivering by writing 'tag' to ublk char device via IORING_OP_WRITE_FIXED or
> IORING_OP_WRITE, still per queue via ring_buf approach, but need one mmapped
> buffer for storing the io command result, 4 bytes should be enough for each io.
> 
> With the above way:
> 
> - use read/write to deliver io command & commit io command result, so
>   single read/write replaces one batch of uring_cmd
> 
> - needn't uring command any more, big security_uring_cmd() cost can be avoided
> 
> - memory footprint is reduced a lot, no extra uring_cmd for each IO
> 
> - extra task work scheduling is avoided
> 
> - Probably uring exiting handling can be simplified too.
> 
> 
> Sounds like ublk 2.0 prototype, :-)

I have been working towards this direction:

https://github.com/ming1/linux/commits/ublk2-cmd-batch/

by adding three new batch commands, all are per-queue:

`UBLK_U_IO_FETCH_IO_CMDS`
	
	- multishot with provided buffer

	- issued once, CQE is posted after new io/io batch is coming by filling
	io tag into the provided buffer

	- re-issue after the whole buffer is used up, so issue cost is reduced

	- multiple `UBLK_U_IO_FETCH_IO_CMDS` are allowed to be issued concurrently
	from different task contexts for supporting load balance

	- each `UBLK_U_IO_FETCH_IO_CMDS` can carry 'priority' info for supporting
	prioritized schedule, not done yet, should be easier to implement

`UBLK_U_IO_COMMIT_IO_CMDS`

	- this command has a fixed buffer, in which io tag, io command result
	and other info(buf_index) for FETCH is provided, and multiple IOs or
	batch IO are covered

`UBLK_U_IO_PREP_IO_CMDS`:

	batch version of `UBLK_IO_FETCH_REQ`, still has one fixed buffer for
	carrying io tag, info for fetch, similar with `UBLK_U_IO_COMMIT_IO_CMDS`

In this way, lots of existing ublk constraint are relaxed:

- any of the three command can be issued from any task context, there isn't
  per-io task or ubq_daemon limit any more. But AUTO_BUF_REG is one
  exception, which requires FETCH and COMMIT command are in same io_ring_ctx.

- easier to support load balance, any IO commands fetched by the command
of `UBLK_U_IO_FETCH_IO_CMDS` can be handled in the task for issuing
UBLK_U_IO_FETCH_IO_CMDS

- both FETCH and COMMIT are handled in batch way, communication cost is
reduced.

One drawback is that cost is added in client IO issue side(ublk_queue_rq() and
ublk_queue_rqs()), goodness is that communication cost is reduced in ublk server
side. 

Simple test running on one server shows that performance is good

- kublk(`--batch --auto_zc -q 2` vs. `--auto_zc -q 2`): ~10% IOPS improvement

The feature is still in very early stage, and any comments are welcome!



Thanks,
Ming

next prev parent reply	other threads:[~2025-06-06 12:03 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-24 18:19 ublk: RFC fetch_req_multishot Ofer Oshri
2025-04-24 18:28 ` Caleb Sander Mateos
2025-04-24 19:07   ` Ofer Oshri
     [not found]   ` <IA1PR12MB60672D37508D641368D211B8B6852@IA1PR12MB6067.namprd12.prod.outlook.com>
2025-04-24 19:07     ` Caleb Sander Mateos
2025-04-24 21:07       ` Jared Holzman
2025-04-24 21:52         ` Caleb Sander Mateos
2025-04-25  5:23       ` Ming Lei
2025-06-06 12:03         ` Ming Lei [this message]
2025-04-25  4:10 ` Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aELZBPmYUMHDbusQ@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=geisenberg@nvidia.com \
    --cc=jholzman@nvidia.com \
    --cc=linux-block@vger.kernel.org \
    --cc=ofer@nvidia.com \
    --cc=omril@nvidia.com \
    --cc=ushankar@purestorage.com \
    --cc=yoav@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox