From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org,
Uday Shankar <ushankar@purestorage.com>,
Stefani Seibold <stefani@seibold.net>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing
Date: Tue, 2 Dec 2025 09:27:02 +0800 [thread overview]
Message-ID: <aS5AZjizRDt7ql3T@fedora> (raw)
In-Reply-To: <CADUfDZo0N22fp5+Si8eBE5SgRkCMsMa1VDTiLO_+zLWfUOVc9g@mail.gmail.com>
On Mon, Dec 01, 2025 at 09:51:59AM -0800, Caleb Sander Mateos wrote:
> On Mon, Dec 1, 2025 at 1:42 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Sun, Nov 30, 2025 at 09:55:47PM -0800, Caleb Sander Mateos wrote:
> > > On Thu, Nov 20, 2025 at 6:00 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > Add UBLK_U_IO_FETCH_IO_CMDS command to enable efficient batch processing
> > > > of I/O requests. This multishot uring_cmd allows the ublk server to fetch
> > > > multiple I/O commands in a single operation, significantly reducing
> > > > submission overhead compared to individual FETCH_REQ* commands.
> > > >
> > > > Key Design Features:
> > > >
> > > > 1. Multishot Operation: One UBLK_U_IO_FETCH_IO_CMDS can fetch many I/O
> > > > commands, with the batch size limited by the provided buffer length.
> > > >
> > > > 2. Dynamic Load Balancing: Multiple fetch commands can be submitted
> > > > simultaneously, but only one is active at any time. This enables
> > > > efficient load distribution across multiple server task contexts.
> > > >
> > > > 3. Implicit State Management: The implementation uses three key variables
> > > > to track state:
> > > > - evts_fifo: Queue of request tags awaiting processing
> > > > - fcmd_head: List of available fetch commands
> > > > - active_fcmd: Currently active fetch command (NULL = none active)
> > > >
> > > > States are derived implicitly:
> > > > - IDLE: No fetch commands available
> > > > - READY: Fetch commands available, none active
> > > > - ACTIVE: One fetch command processing events
> > > >
> > > > 4. Lockless Reader Optimization: The active fetch command can read from
> > > > evts_fifo without locking (single reader guarantee), while writers
> > > > (ublk_queue_rq/ublk_queue_rqs) use evts_lock protection. The memory
> > > > barrier pairing plays key role for the single lockless reader
> > > > optimization.
> > > >
> > > > Implementation Details:
> > > >
> > > > - ublk_queue_rq() and ublk_queue_rqs() save request tags to evts_fifo
> > > > - __ublk_pick_active_fcmd() selects an available fetch command when
> > > > events arrive and no command is currently active
> > >
> > > What is __ublk_pick_active_fcmd()? I don't see a function with that name.
> >
> > It is renamed as __ublk_acquire_fcmd(), and its counter pair is
> > __ublk_release_fcmd().
>
> Okay, update the commit message then?
>
> >
> > >
> > > > - ublk_batch_dispatch() moves tags from evts_fifo to the fetch command's
> > > > buffer and posts completion via io_uring_mshot_cmd_post_cqe()
> > > > - State transitions are coordinated via evts_lock to maintain consistency
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > > drivers/block/ublk_drv.c | 412 +++++++++++++++++++++++++++++++---
> > > > include/uapi/linux/ublk_cmd.h | 7 +
> > > > 2 files changed, 388 insertions(+), 31 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index cc9c92d97349..2e5e392c939e 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -93,6 +93,7 @@
> > > >
> > > > /* ublk batch fetch uring_cmd */
> > > > struct ublk_batch_fcmd {
> > > > + struct list_head node;
> > > > struct io_uring_cmd *cmd;
> > > > unsigned short buf_group;
> > > > };
> > > > @@ -117,7 +118,10 @@ struct ublk_uring_cmd_pdu {
> > > > */
> > > > struct ublk_queue *ubq;
> > > >
> > > > - u16 tag;
> > > > + union {
> > > > + u16 tag;
> > > > + struct ublk_batch_fcmd *fcmd; /* batch io only */
> > > > + };
> > > > };
> > > >
> > > > struct ublk_batch_io_data {
> > > > @@ -229,18 +233,36 @@ struct ublk_queue {
> > > > struct ublk_device *dev;
> > > >
> > > > /*
> > > > - * Inflight ublk request tag is saved in this fifo
> > > > + * Batch I/O State Management:
> > > > + *
> > > > + * The batch I/O system uses implicit state management based on the
> > > > + * combination of three key variables below.
> > > > + *
> > > > + * - IDLE: list_empty(&fcmd_head) && !active_fcmd
> > > > + * No fetch commands available, events queue in evts_fifo
> > > > + *
> > > > + * - READY: !list_empty(&fcmd_head) && !active_fcmd
> > > > + * Fetch commands available but none processing events
> > > > *
> > > > - * There are multiple writer from ublk_queue_rq() or ublk_queue_rqs(),
> > > > - * so lock is required for storing request tag to fifo
> > > > + * - ACTIVE: active_fcmd
> > > > + * One fetch command actively processing events from evts_fifo
> > > > *
> > > > - * Make sure just one reader for fetching request from task work
> > > > - * function to ublk server, so no need to grab the lock in reader
> > > > - * side.
> > > > + * Key Invariants:
> > > > + * - At most one active_fcmd at any time (single reader)
> > > > + * - active_fcmd is always from fcmd_head list when non-NULL
> > > > + * - evts_fifo can be read locklessly by the single active reader
> > > > + * - All state transitions require evts_lock protection
> > > > + * - Multiple writers to evts_fifo require lock protection
> > > > */
> > > > struct {
> > > > DECLARE_KFIFO_PTR(evts_fifo, unsigned short);
> > > > spinlock_t evts_lock;
> > > > +
> > > > + /* List of fetch commands available to process events */
> > > > + struct list_head fcmd_head;
> > > > +
> > > > + /* Currently active fetch command (NULL = none active) */
> > > > + struct ublk_batch_fcmd *active_fcmd;
> > > > }____cacheline_aligned_in_smp;
> > > >
> > > > struct ublk_io ios[] __counted_by(q_depth);
> > > > @@ -292,12 +314,20 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > > > static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> > > > u16 q_id, u16 tag, struct ublk_io *io, size_t offset);
> > > > static inline unsigned int ublk_req_build_flags(struct request *req);
> > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > + struct ublk_batch_io_data *data,
> > > > + struct ublk_batch_fcmd *fcmd);
> > > >
> > > > static inline bool ublk_dev_support_batch_io(const struct ublk_device *ub)
> > > > {
> > > > return false;
> > > > }
> > > >
> > > > +static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > > +{
> > > > + return false;
> > > > +}
> > > > +
> > > > static inline void ublk_io_lock(struct ublk_io *io)
> > > > {
> > > > spin_lock(&io->lock);
> > > > @@ -624,13 +654,45 @@ static wait_queue_head_t ublk_idr_wq; /* wait until one idr is freed */
> > > >
> > > > static DEFINE_MUTEX(ublk_ctl_mutex);
> > > >
> > > > +static struct ublk_batch_fcmd *
> > > > +ublk_batch_alloc_fcmd(struct io_uring_cmd *cmd)
> > > > +{
> > > > + struct ublk_batch_fcmd *fcmd = kzalloc(sizeof(*fcmd), GFP_NOIO);
> > >
> > > An allocation in the I/O path seems unfortunate. Is there not room to
> > > store the struct ublk_batch_fcmd in the io_uring_cmd pdu?
> >
> > It is allocated once for one mshot request, which covers many IOs.
> >
> > It can't be held in uring_cmd pdu, but the allocation can be optimized in
> > future. Not a big deal in enablement stage.
>
> Okay, seems fine to optimize it in the future.
>
> >
> > > > +
> > > > + if (fcmd) {
> > > > + fcmd->cmd = cmd;
> > > > + fcmd->buf_group = READ_ONCE(cmd->sqe->buf_index);
> > >
> > > Is it necessary to store sample this here just to pass it back to the
> > > io_uring layer? Wouldn't the io_uring layer already have access to it
> > > in struct io_kiocb's buf_index field?
> >
> > ->buf_group is used by io_uring_cmd_buffer_select(), and this way also
> > follows ->buf_index uses in both io_uring/net.c and io_uring/rw.c.
> >
> >
> > io_ring_buffer_select(), so we can't reuse req->buf_index here.
>
> But io_uring/net.c and io_uring/rw.c both retrieve the buf_group value
> from req->buf_index instead of the SQE, for example:
> if (req->flags & REQ_F_BUFFER_SELECT)
> sr->buf_group = req->buf_index;
>
> Seems like it would make sense to do the same for
> UBLK_U_IO_FETCH_IO_CMDS. That also saves one pointer dereference here.
IMO we shouldn't encourage driver to access `io_kiocb`, however, cmd->sqe
is exposed to driver explicitly.
>
> >
> > >
> > > > + }
> > > > + return fcmd;
> > > > +}
> > > > +
> > > > +static void ublk_batch_free_fcmd(struct ublk_batch_fcmd *fcmd)
> > > > +{
> > > > + kfree(fcmd);
> > > > +}
> > > > +
> > > > +static void __ublk_release_fcmd(struct ublk_queue *ubq)
> > > > +{
> > > > + WRITE_ONCE(ubq->active_fcmd, NULL);
> > > > +}
> > > >
> > > > -static void ublk_batch_deinit_fetch_buf(const struct ublk_batch_io_data *data,
> > > > +/*
> > > > + * Nothing can move on, so clear ->active_fcmd, and the caller should stop
> > > > + * dispatching
> > > > + */
> > > > +static void ublk_batch_deinit_fetch_buf(struct ublk_queue *ubq,
> > > > + const struct ublk_batch_io_data *data,
> > > > struct ublk_batch_fcmd *fcmd,
> > > > int res)
> > > > {
> > > > + spin_lock(&ubq->evts_lock);
> > > > + list_del(&fcmd->node);
> > > > + WARN_ON_ONCE(fcmd != ubq->active_fcmd);
> > > > + __ublk_release_fcmd(ubq);
> > > > + spin_unlock(&ubq->evts_lock);
> > > > +
> > > > io_uring_cmd_done(fcmd->cmd, res, data->issue_flags);
> > > > - fcmd->cmd = NULL;
> > > > + ublk_batch_free_fcmd(fcmd);
> > > > }
> > > >
> > > > static int ublk_batch_fetch_post_cqe(struct ublk_batch_fcmd *fcmd,
> > > > @@ -1491,6 +1553,8 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > bool needs_filter;
> > > > int ret;
> > > >
> > > > + WARN_ON_ONCE(data->cmd != fcmd->cmd);
> > > > +
> > > > sel = io_uring_cmd_buffer_select(fcmd->cmd, fcmd->buf_group, &len,
> > > > data->issue_flags);
> > > > if (sel.val < 0)
> > > > @@ -1548,23 +1612,94 @@ static int __ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > return ret;
> > > > }
> > > >
> > > > -static __maybe_unused int
> > > > -ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > - const struct ublk_batch_io_data *data,
> > > > - struct ublk_batch_fcmd *fcmd)
> > > > +static struct ublk_batch_fcmd *__ublk_acquire_fcmd(
> > > > + struct ublk_queue *ubq)
> > > > +{
> > > > + struct ublk_batch_fcmd *fcmd;
> > > > +
> > > > + lockdep_assert_held(&ubq->evts_lock);
> > > > +
> > > > + /*
> > > > + * Ordering updating ubq->evts_fifo and checking ubq->active_fcmd.
> > > > + *
> > > > + * The pair is the smp_mb() in ublk_batch_dispatch().
> > > > + *
> > > > + * If ubq->active_fcmd is observed as non-NULL, the new added tags
> > > > + * can be visisible in ublk_batch_dispatch() with the barrier pairing.
> > > > + */
> > > > + smp_mb();
> > > > + if (READ_ONCE(ubq->active_fcmd)) {
> > > > + fcmd = NULL;
> > > > + } else {
> > > > + fcmd = list_first_entry_or_null(&ubq->fcmd_head,
> > > > + struct ublk_batch_fcmd, node);
> > > > + WRITE_ONCE(ubq->active_fcmd, fcmd);
> > > > + }
> > > > + return fcmd;
> > > > +}
> > > > +
> > > > +static void ublk_batch_tw_cb(struct io_uring_cmd *cmd,
> > > > + unsigned int issue_flags)
> > > > +{
> > > > + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
> > > > + struct ublk_batch_fcmd *fcmd = pdu->fcmd;
> > > > + struct ublk_batch_io_data data = {
> > > > + .ub = pdu->ubq->dev,
> > > > + .cmd = fcmd->cmd,
> > > > + .issue_flags = issue_flags,
> > > > + };
> > > > +
> > > > + WARN_ON_ONCE(pdu->ubq->active_fcmd != fcmd);
> > > > +
> > > > + ublk_batch_dispatch(pdu->ubq, &data, fcmd);
> > > > +}
> > > > +
> > > > +static void ublk_batch_dispatch(struct ublk_queue *ubq,
> > > > + struct ublk_batch_io_data *data,
> > > > + struct ublk_batch_fcmd *fcmd)
> > > > {
> > > > + struct ublk_batch_fcmd *new_fcmd;
> > >
> > > Is the new_fcmd variable necessary? Can fcmd be reused instead?
> > >
> > > > + void *handle;
> > > > + bool empty;
> > > > int ret = 0;
> > > >
> > > > +again:
> > > > while (!ublk_io_evts_empty(ubq)) {
> > > > ret = __ublk_batch_dispatch(ubq, data, fcmd);
> > > > if (ret <= 0)
> > > > break;
> > > > }
> > > >
> > > > - if (ret < 0)
> > > > - ublk_batch_deinit_fetch_buf(data, fcmd, ret);
> > > > + if (ret < 0) {
> > > > + ublk_batch_deinit_fetch_buf(ubq, data, fcmd, ret);
> > > > + return;
> > > > + }
> > > >
> > > > - return ret;
> > > > + handle = io_uring_cmd_ctx_handle(fcmd->cmd);
> > > > + __ublk_release_fcmd(ubq);
> > > > + /*
> > > > + * Order clearing ubq->active_fcmd from __ublk_release_fcmd() and
> > > > + * checking ubq->evts_fifo.
> > > > + *
> > > > + * The pair is the smp_mb() in __ublk_acquire_fcmd().
> > > > + */
> > > > + smp_mb();
> > > > + empty = ublk_io_evts_empty(ubq);
> > > > + if (likely(empty))
> > >
> > > nit: empty variable seems unnecessary
> > >
> > > > + return;
> > > > +
> > > > + spin_lock(&ubq->evts_lock);
> > > > + new_fcmd = __ublk_acquire_fcmd(ubq);
> > > > + spin_unlock(&ubq->evts_lock);
> > > > +
> > > > + if (!new_fcmd)
> > > > + return;
> > > > + if (handle == io_uring_cmd_ctx_handle(new_fcmd->cmd)) {
> > >
> > > This check seems to be meant to decide whether the new and old
> > > UBLK_U_IO_FETCH_IO_CMDS commands can execute in the same task work?
> >
> > Actually not.
> >
> > > But belonging to the same io_uring context doesn't necessarily mean
> > > that the same task issued them. It seems like it would be safer to
> > > always dispatch new_fcmd->cmd to task work.
> >
> > What matters is just that ctx->uring_lock & issue_flag matches from ublk
> > viewpoint, so it is safe to do so.
>
> Okay, that makes sense.
>
> >
> > However, given it is hit in slow path, so starting new dispatch
> > is easier.
>
> Yeah, I'd agree it makes sense to keep the unexpected path code
> simpler. There may also be fairness concerns from looping indefinitely
> here if the evts_fifo continues to be nonempty, so dispatching to task
> work seems safer.
Fair enough.
Thanks,
Ming
next prev parent reply other threads:[~2025-12-02 1:27 UTC|newest]
Thread overview: 66+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-21 1:58 [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
2025-11-21 1:58 ` [PATCH V4 01/27] kfifo: add kfifo_alloc_node() helper for NUMA awareness Ming Lei
2025-11-29 19:12 ` Caleb Sander Mateos
2025-12-01 1:46 ` Ming Lei
2025-12-01 5:58 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 02/27] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() Ming Lei
2025-11-21 1:58 ` [PATCH V4 03/27] ublk: add `union ublk_io_buf` with improved naming Ming Lei
2025-11-21 1:58 ` [PATCH V4 04/27] ublk: refactor auto buffer register in ublk_dispatch_req() Ming Lei
2025-11-21 1:58 ` [PATCH V4 05/27] ublk: pass const pointer to ublk_queue_is_zoned() Ming Lei
2025-11-21 1:58 ` [PATCH V4 06/27] ublk: add helper of __ublk_fetch() Ming Lei
2025-11-21 1:58 ` [PATCH V4 07/27] ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO Ming Lei
2025-11-21 1:58 ` [PATCH V4 08/27] ublk: prepare for not tracking task context for command batch Ming Lei
2025-11-21 1:58 ` [PATCH V4 09/27] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-11-29 19:19 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 10/27] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-11-29 19:47 ` Caleb Sander Mateos
2025-11-30 19:25 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 11/27] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-11-30 16:39 ` Caleb Sander Mateos
2025-12-01 10:25 ` Ming Lei
2025-12-01 16:43 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 12/27] ublk: add io events fifo structure Ming Lei
2025-11-30 16:53 ` Caleb Sander Mateos
2025-12-01 3:04 ` Ming Lei
2025-11-21 1:58 ` [PATCH V4 13/27] ublk: add batch I/O dispatch infrastructure Ming Lei
2025-11-30 19:24 ` Caleb Sander Mateos
2025-11-30 21:37 ` Caleb Sander Mateos
2025-12-01 2:32 ` Ming Lei
2025-12-01 17:37 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 14/27] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Ming Lei
2025-12-01 5:55 ` Caleb Sander Mateos
2025-12-01 9:41 ` Ming Lei
2025-12-01 17:51 ` Caleb Sander Mateos
2025-12-02 1:27 ` Ming Lei [this message]
2025-12-02 1:39 ` Caleb Sander Mateos
2025-12-02 8:14 ` Ming Lei
2025-12-02 15:20 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 15/27] ublk: abort requests filled in event kfifo Ming Lei
2025-12-01 18:52 ` Caleb Sander Mateos
2025-12-02 1:29 ` Ming Lei
2025-12-01 19:00 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 16/27] ublk: add new feature UBLK_F_BATCH_IO Ming Lei
2025-12-01 21:16 ` Caleb Sander Mateos
2025-12-02 1:44 ` Ming Lei
2025-12-02 16:05 ` Caleb Sander Mateos
2025-12-03 2:21 ` Ming Lei
2025-11-21 1:58 ` [PATCH V4 17/27] ublk: document " Ming Lei
2025-12-01 21:46 ` Caleb Sander Mateos
2025-12-02 1:55 ` Ming Lei
2025-12-02 2:03 ` Ming Lei
2025-11-21 1:58 ` [PATCH V4 18/27] ublk: implement batch request completion via blk_mq_end_request_batch() Ming Lei
2025-12-01 21:55 ` Caleb Sander Mateos
2025-11-21 1:58 ` [PATCH V4 19/27] selftests: ublk: fix user_data truncation for tgt_data >= 256 Ming Lei
2025-11-21 1:58 ` [PATCH V4 20/27] selftests: ublk: replace assert() with ublk_assert() Ming Lei
2025-11-21 1:58 ` [PATCH V4 21/27] selftests: ublk: add ublk_io_buf_idx() for returning io buffer index Ming Lei
2025-11-21 1:58 ` [PATCH V4 22/27] selftests: ublk: add batch buffer management infrastructure Ming Lei
2025-11-21 1:58 ` [PATCH V4 23/27] selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-11-21 1:58 ` [PATCH V4 24/27] selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-11-21 1:58 ` [PATCH V4 25/27] selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS Ming Lei
2025-11-21 1:58 ` [PATCH V4 26/27] selftests: ublk: add --batch/-b for enabling F_BATCH_IO Ming Lei
2025-11-21 1:58 ` [PATCH V4 27/27] selftests: ublk: support arbitrary threads/queues combination Ming Lei
2025-11-28 11:59 ` [PATCH V4 00/27] ublk: add UBLK_F_BATCH_IO Ming Lei
2025-11-28 16:19 ` Jens Axboe
2025-11-28 19:07 ` Caleb Sander Mateos
2025-11-29 1:24 ` Ming Lei
2025-11-28 16:22 ` (subset) " Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aS5AZjizRDt7ql3T@fedora \
--to=ming.lei@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=csander@purestorage.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=stefani@seibold.net \
--cc=ushankar@purestorage.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox