Re: [PATCH 08/23] ublk: handle UBLK_U_IO_PREP_IO_CMDS

Linux block layer
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	Uday Shankar <ushankar@purestorage.com>
Subject: Re: [PATCH 08/23] ublk: handle UBLK_U_IO_PREP_IO_CMDS
Date: Thu, 16 Oct 2025 18:08:44 +0800	[thread overview]
Message-ID: <aPDELNXqlckznZJI@fedora> (raw)
In-Reply-To: <CADUfDZoQwNhQncLYi-AZZGpPSackgKejXYoBZh43sciCAUDGCg@mail.gmail.com>

On Thu, Sep 18, 2025 at 11:12:00AM -0700, Caleb Sander Mateos wrote:
> On Tue, Sep 9, 2025 at 8:56 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Sat, Sep 06, 2025 at 12:48:41PM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Sep 1, 2025 at 3:03 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
> > > > which allows userspace to prepare a batch of I/O requests.
> > > >
> > > > The core of this change is the `ublk_walk_cmd_buf` function, which iterates
> > > > over the elements in the uring_cmd fixed buffer. For each element, it parses
> > > > the I/O details, finds the corresponding `ublk_io` structure, and prepares it
> > > > for future dispatch.
> > > >
> > > > Add per-io lock for protecting concurrent delivery and committing.
> > > >
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > >  drivers/block/ublk_drv.c      | 191 +++++++++++++++++++++++++++++++++-
> > > >  include/uapi/linux/ublk_cmd.h |   5 +
> > > >  2 files changed, 195 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 4da0dbbd7e16..a4bae3d1562a 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -116,6 +116,10 @@ struct ublk_uring_cmd_pdu {
> > > >  struct ublk_batch_io_data {
> > > >         struct ublk_queue *ubq;
> > > >         struct io_uring_cmd *cmd;
> > > > +       unsigned int issue_flags;
> > > > +
> > > > +       /* set when walking the element buffer */
> > > > +       const struct ublk_elem_header *elem;
> > > >  };
> > > >
> > > >  /*
> > > > @@ -200,6 +204,7 @@ struct ublk_io {
> > > >         unsigned task_registered_buffers;
> > > >
> > > >         void *buf_ctx_handle;
> > > > +       spinlock_t lock;
> > >
> > > From our experience writing a high-throughput ublk server, the
> > > spinlocks and mutexes in the kernel are some of the largest CPU
> > > hotspots. We have spent a lot of effort working to avoid locking where
> > > possible or shard data structures to reduce contention on the locks.
> > > Even uncontended locks are still very expensive to acquire and release
> > > on machines with many CPUs due to the cache coherency overhead. ublk's
> > > per-io daemon architecture is great for performance by removing the
> >
> > io-uring highly depends on batch submission and completion, but per-io daemon
> > may break the batch easily, because it doesn't guarantee that one batch IOs
> > can be forwarded in single io task/io_uring when static tag mapping policy is
> > taken, for example:
> 
> That's a good point. We've mainly focused on optimizing the ublk
> server side, but it's true that distributing incoming ublk I/Os to
> more ublk server threads adds overhead on the submitting side. One
> idea we had but haven't experimented with much is for the ublk server
> to perform the round-robin assignment of tags within each queue to

round-robin often hurts perf, and it isn't enabled yet.

> threads in larger chunks. For example, with a chunk size of 4, tags 0
> to 3 would be assigned to thread 0, tags 4 to 7 would be assigned to
> thread 1, etc. That would improve the batching of ublk I/Os when
> dispatching them from the submitting CPU to the ublk server thread.
> There's an inherent tradeoff where distributing tags to ublk server
> threads in larger chunks makes the distribution less balanced for
> small numbers of I/Os, but it will be balanced when averaged over
> large numbers of I/Os.

How can fixed chunk size work generically? It depends on workload batch
size on /dev/ublkbN, and different workload takes different batch size.

> 
> >
> > ```
> > [root@ktest-40 ublk]# ./kublk add -t null  --nthreads 8 -q 4 --per_io_tasks
> > dev id 0: nr_hw_queues 4 queue_depth 128 block size 512 dev_capacity 524288000
> >         max rq size 1048576 daemon pid 89975 flags 0x6042 state LIVE
> >         queue 0: affinity(0 )
> >         queue 1: affinity(4 )
> >         queue 2: affinity(8 )
> >         queue 3: affinity(12 )
> > [root@ktest-40 ublk]#
> > [root@ktest-40 ublk]# ./kublk add -t null  -q 4
> > dev id 1: nr_hw_queues 4 queue_depth 128 block size 512 dev_capacity 524288000
> >         max rq size 1048576 daemon pid 90002 flags 0x6042 state LIVE
> >         queue 0: affinity(0 )
> >         queue 1: affinity(4 )
> >         queue 2: affinity(8 )
> >         queue 3: affinity(12 )
> > [root@ktest-40 ublk]#
> > [root@ktest-40 ublk]# ~/git/fio/t/io_uring -p0 /dev/ublkb0
> > submitter=0, tid=90024, file=/dev/ublkb0, nfiles=1, node=-1
> > polled=0, fixedbufs=1, register_files=1, buffered=0, QD=128
> > Engine=io_uring, sq_ring=128, cq_ring=128
> > IOPS=188.54K, BW=736MiB/s, IOS/call=32/31
> > IOPS=187.90K, BW=734MiB/s, IOS/call=32/32
> > IOPS=195.39K, BW=763MiB/s, IOS/call=32/32
> > ^CExiting on signal
> > Maximum IOPS=195.39K
> >
> > [root@ktest-40 ublk]# ~/git/fio/t/io_uring -p0 /dev/ublkb1
> > submitter=0, tid=90026, file=/dev/ublkb1, nfiles=1, node=-1
> > polled=0, fixedbufs=1, register_files=1, buffered=0, QD=128
> > Engine=io_uring, sq_ring=128, cq_ring=128
> > IOPS=608.26K, BW=2.38GiB/s, IOS/call=32/31
> > IOPS=586.59K, BW=2.29GiB/s, IOS/call=32/31
> > IOPS=599.62K, BW=2.34GiB/s, IOS/call=32/32
> > ^CExiting on signal
> > Maximum IOPS=608.26K
> >
> > ```
> >
> >
> > > need for locks in the I/O path. I can't really see us adopting this
> > > ublk batching feature; adding a spin_lock() + spin_unlock() to every
> > > ublk commit operation is not worth the reduction in io_uring SQEs and
> > > uring_cmds.
> >
> > As I mentioned in cover letter, the per-io lock can be avoided for UBLK_F_PER_IO_DAEMON
> > as one follow-up, since io->task is still there for helping to track task context.
> >
> > Just want to avoid too much features in enablement stage, that is also
> > why the spin lock is wrapped in helper.
> 
> Okay, good to know there's at least an idea for how to avoid the
> spinlock. Makes sense to defer it to follow-on work.
> 
> >
> > >
> > > >  } ____cacheline_aligned_in_smp;
> > > >
> > > >  struct ublk_queue {
> > > > @@ -276,6 +281,16 @@ static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > >         return false;
> > > >  }
> > > >
> > > > +static inline void ublk_io_lock(struct ublk_io *io)
> > > > +{
> > > > +       spin_lock(&io->lock);
> > > > +}
> > > > +
> > > > +static inline void ublk_io_unlock(struct ublk_io *io)
> > > > +{
> > > > +       spin_unlock(&io->lock);
> > > > +}
> > > > +
> > > >  static inline struct ublksrv_io_desc *
> > > >  ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
> > > >  {
> > > > @@ -2538,6 +2553,171 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> > > >         return ublk_ch_uring_cmd_local(cmd, issue_flags);
> > > >  }
> > > >
> > > > +static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
> > > > +                                       const struct ublk_elem_header *elem)
> > > > +{
> > > > +       const void *buf = (const void *)elem;
> > >
> > > Don't need an explicit cast in order to cast to void *.
> >
> > OK.
> >
> > >
> > >
> > > > +
> > > > +       if (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR)
> > > > +               return *(__u64 *)(buf + sizeof(*elem));
> > > > +       return -1;
> > >
> > > Why -1 and not 0? ublk_check_fetch_buf() is expecting a 0 buf_addr to
> > > indicate the lack
> >
> > Good catch, it needs to return 0.
> >
> > >
> > > > +}
> > > > +
> > > > +static struct ublk_auto_buf_reg
> > > > +ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
> > > > +                       const struct ublk_elem_header *elem)
> > > > +{
> > > > +       struct ublk_auto_buf_reg reg = {
> > > > +               .index = elem->buf_index,
> > > > +               .flags = (uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) ?
> > > > +                       UBLK_AUTO_BUF_REG_FALLBACK : 0,
> > > > +       };
> > > > +
> > > > +       return reg;
> > > > +}
> > > > +
> > > > +/* 48 can cover any type of buffer element(8, 16 and 24 bytes) */
> > >
> > > "can cover" is a bit vague. Can you be explicit that the buffer size
> > > needs to be a multiple of any possible buffer element size?
> >
> > I should have documented that 48 is least common multiple(LCM) of (8, 16 and
> > 24)
> >
> > >
> > > > +#define UBLK_CMD_BATCH_TMP_BUF_SZ  (48 * 10)
> > > > +struct ublk_batch_io_iter {
> > > > +       /* copy to this buffer from iterator first */
> > > > +       unsigned char buf[UBLK_CMD_BATCH_TMP_BUF_SZ];
> > > > +       struct iov_iter iter;
> > > > +       unsigned done, total;
> > > > +       unsigned char elem_bytes;
> > > > +};
> > > > +
> > > > +static int __ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter,
> > > > +                               struct ublk_batch_io_data *data,
> > > > +                               unsigned bytes,
> > > > +                               int (*cb)(struct ublk_io *io,
> > > > +                                       const struct ublk_batch_io_data *data))
> > > > +{
> > > > +       int i, ret = 0;
> > > > +
> > > > +       for (i = 0; i < bytes; i += iter->elem_bytes) {
> > > > +               const struct ublk_elem_header *elem =
> > > > +                       (const struct ublk_elem_header *)&iter->buf[i];
> > > > +               struct ublk_io *io;
> > > > +
> > > > +               if (unlikely(elem->tag >= data->ubq->q_depth)) {
> > > > +                       ret = -EINVAL;
> > > > +                       break;
> > > > +               }
> > > > +
> > > > +               io = &data->ubq->ios[elem->tag];
> > > > +               data->elem = elem;
> > > > +               ret = cb(io, data);
> > >
> > > Why not just pas elem as a separate argument to the callback?
> >
> > One reason is that we don't have complete type for 'elem' since its size
> > is a variable.
> 
> I didn't mean to pass ublk_elem_header by value, still by pointer.
> Just that you could pass const struct ublk_elem_header *elem as an
> additional parameter to the callback. I think that would make the code
> a bit easier to follow than passing it via data->elem.

OK.

> 
> >
> > >
> > > > +               if (unlikely(ret))
> > > > +                       break;
> > > > +       }
> > > > +       iter->done += i;
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static int ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter,
> > > > +                            struct ublk_batch_io_data *data,
> > > > +                            int (*cb)(struct ublk_io *io,
> > > > +                                    const struct ublk_batch_io_data *data))
> > > > +{
> > > > +       int ret = 0;
> > > > +
> > > > +       while (iter->done < iter->total) {
> > > > +               unsigned int len = min(sizeof(iter->buf), iter->total - iter->done);
> > > > +
> > > > +               ret = copy_from_iter(iter->buf, len, &iter->iter);
> > > > +               if (ret != len) {
> > >
> > > How would this be possible? The iterator comes from an io_uring
> > > registered buffer with at least the requested length, so the user
> > > addresses should have been validated when the buffer was registered.
> > > Should this just be a WARN_ON()?
> >
> > yes, that is why pr_warn() is used, I remember that WARN_ON() isn't
> > encouraged in user code path.
> >
> > >
> > > > +                       pr_warn("ublk%d: read batch cmd buffer failed %u/%u\n",
> > > > +                                       data->ubq->dev->dev_info.dev_id, ret, len);
> > > > +                       ret = -EINVAL;
> > > > +                       break;
> > > > +               }
> > > > +
> > > > +               ret = __ublk_walk_cmd_buf(iter, data, len, cb);
> > > > +               if (ret)
> > > > +                       break;
> > > > +       }
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static int ublk_batch_unprep_io(struct ublk_io *io,
> > > > +                               const struct ublk_batch_io_data *data)
> > > > +{
> > > > +       if (ublk_queue_ready(data->ubq))
> > > > +               data->ubq->dev->nr_queues_ready--;
> > > > +
> > > > +       ublk_io_lock(io);
> > > > +       io->flags = 0;
> > > > +       ublk_io_unlock(io);
> > > > +       data->ubq->nr_io_ready--;
> > > > +       return 0;
> > >
> > > This "unprep" looks very subtle and fairly complicated. Is it really
> > > necessary? What's wrong with leaving the I/Os that were successfully
> > > prepped? It also looks racy to clear io->flags after the queue is
> > > ready, as the io may already be in use by some I/O request.
> >
> > ublk_batch_unprep_io() is called in partial completion of UBLK_U_IO_PREP_IO_CMDS,
> > when START_DEV can't succeed, so there can't be any IO.
> 
> Isn't it possible that the UBLK_U_IO_PREP_IO_CMDS batch contains all
> the I/Os not yet prepped followed by some duplicates? Then the device
> could be started following the successful completion of all the newly
> prepped I/Os, but the batch would fail on the following duplicate
> I/Os, causing the successfully prepped I/Os to be unprepped?

It can be avoided easily because ub->mutex is required for UBLK_U_IO_PREP_IO_CMDS,
such as, ub->dev_info.state can be set to UBLK_S_DEV_DEAD in case of any failure.

> 
> >
> > >
> > > > +}
> > > > +
> > > > +static void ublk_batch_revert_prep_cmd(struct ublk_batch_io_iter *iter,
> > > > +                                      struct ublk_batch_io_data *data)
> > > > +{
> > > > +       int ret;
> > > > +
> > > > +       if (!iter->done)
> > > > +               return;
> > > > +
> > > > +       iov_iter_revert(&iter->iter, iter->done);
> > >
> > > Shouldn't the iterator be reverted by the total number of bytes
> > > copied, which may be more than iter->done?
> >
> > ->done is exactly the total bytes handled.
> 
> But the number of bytes "handled" is not the same as the number of
> bytes the iterator was advanced by, right? The copy_from_iter() is
> responsible for advancing the iterator, but __ublk_walk_cmd_buf() may
> break early before processing all those elements. iter->done would
> only be set to the number of bytes processed by __ublk_walk_cmd_buf(),
> which may be less than the bytes obtained from the iterator.

Good catch, it could be handled by reverting the unhandled bytes manually
in __ublk_walk_cmd_buf().

> 
> >
> > >
> > > > +       iter->total = iter->done;
> > > > +       iter->done = 0;
> > > > +
> > > > +       ret = ublk_walk_cmd_buf(iter, data, ublk_batch_unprep_io);
> > > > +       WARN_ON_ONCE(ret);
> > > > +}
> > > > +
> > > > +static int ublk_batch_prep_io(struct ublk_io *io,
> > > > +                             const struct ublk_batch_io_data *data)
> > > > +{
> > > > +       const struct ublk_batch_io *uc = io_uring_sqe_cmd(data->cmd->sqe);
> > > > +       union ublk_io_buf buf = { 0 };
> > > > +       int ret;
> > > > +
> > > > +       if (ublk_support_auto_buf_reg(data->ubq))
> > > > +               buf.auto_reg = ublk_batch_auto_buf_reg(uc, data->elem);
> > > > +       else if (ublk_need_map_io(data->ubq)) {
> > > > +               buf.addr = ublk_batch_buf_addr(uc, data->elem);
> > > > +
> > > > +               ret = ublk_check_fetch_buf(data->ubq, buf.addr);
> > > > +               if (ret)
> > > > +                       return ret;
> > > > +       }
> > > > +
> > > > +       ublk_io_lock(io);
> > > > +       ret = __ublk_fetch(data->cmd, data->ubq, io);
> > > > +       if (!ret)
> > > > +               io->buf = buf;
> > > > +       ublk_io_unlock(io);
> > > > +
> > > > +       return ret;
> > > > +}
> > > > +
> > > > +static int ublk_handle_batch_prep_cmd(struct ublk_batch_io_data *data)
> > > > +{
> > > > +       struct io_uring_cmd *cmd = data->cmd;
> > > > +       const struct ublk_batch_io *uc = io_uring_sqe_cmd(cmd->sqe);
> > > > +       struct ublk_batch_io_iter iter = {
> > > > +               .total = uc->nr_elem * uc->elem_bytes,
> > > > +               .elem_bytes = uc->elem_bytes,
> > > > +       };
> > > > +       int ret;
> > > > +
> > > > +       ret = io_uring_cmd_import_fixed(cmd->sqe->addr, cmd->sqe->len,
> > >
> > > Could iter.total be used in place of cmd->sqe->len? That way userspace
> > > wouldn't have to specify a redundant value in the SQE len field.
> >
> > This way follows how buffer is used in io_uring/rw.c, but looks it can be saved.
> > But benefit is cross-verify, cause io-uring sqe user interface is complicated.
> 
> In a IORING_OP_{READ,WRITE}{,V} operation, there aren't other fields
> that can be used to determine the length of data that will be
> accessed. I would rather not require userspace to pass a redundant
> value, this makes the UAPI even more complicated.

Fair enough, will drop the sqe->len use.


Thanks,
Ming

next prev parent reply	other threads:[~2025-10-16 10:08 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-01 10:02 [PATCH 00/23] ublk: add UBLK_F_BATCH_IO Ming Lei
2025-09-01 10:02 ` [PATCH 01/23] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() Ming Lei
2025-09-03  3:47   ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 02/23] ublk: add `union ublk_io_buf` with improved naming Ming Lei
2025-09-03  4:01   ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 03/23] ublk: refactor auto buffer register in ublk_dispatch_req() Ming Lei
2025-09-03  4:41   ` Caleb Sander Mateos
2025-09-10  2:23     ` Ming Lei
2025-09-11 18:13       ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 04/23] ublk: add helper of __ublk_fetch() Ming Lei
2025-09-03  4:42   ` Caleb Sander Mateos
2025-09-10  2:30     ` Ming Lei
2025-09-01 10:02 ` [PATCH 05/23] ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO Ming Lei
2025-09-06 18:47   ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 06/23] ublk: prepare for not tracking task context for command batch Ming Lei
2025-09-06 18:48   ` Caleb Sander Mateos
2025-09-10  2:35     ` Ming Lei
2025-09-01 10:02 ` [PATCH 07/23] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-09-06 18:50   ` Caleb Sander Mateos
2025-09-10  3:05     ` Ming Lei
2025-09-01 10:02 ` [PATCH 08/23] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-09-06 19:48   ` Caleb Sander Mateos
2025-09-10  3:56     ` Ming Lei
2025-09-18 18:12       ` Caleb Sander Mateos
2025-10-16 10:08         ` Ming Lei [this message]
2025-10-22  8:00           ` Caleb Sander Mateos
2025-10-22 10:15             ` Ming Lei
2025-09-01 10:02 ` [PATCH 09/23] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-09-02  6:19   ` kernel test robot
2025-09-01 10:02 ` [PATCH 10/23] ublk: add io events fifo structure Ming Lei
2025-09-01 10:02 ` [PATCH 11/23] ublk: add batch I/O dispatch infrastructure Ming Lei
2025-09-01 10:02 ` [PATCH 12/23] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Ming Lei
2025-09-01 10:02 ` [PATCH 13/23] ublk: abort requests filled in event kfifo Ming Lei
2025-09-01 10:02 ` [PATCH 14/23] ublk: add new feature UBLK_F_BATCH_IO Ming Lei
2025-09-01 10:02 ` [PATCH 15/23] ublk: document " Ming Lei
2025-09-01 10:02 ` [PATCH 16/23] selftests: ublk: replace assert() with ublk_assert() Ming Lei
2025-09-01 10:02 ` [PATCH 17/23] selftests: ublk: add ublk_io_buf_idx() for returning io buffer index Ming Lei
2025-09-01 10:02 ` [PATCH 18/23] selftests: ublk: add batch buffer management infrastructure Ming Lei
2025-09-01 10:02 ` [PATCH 19/23] selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-09-01 10:02 ` [PATCH 20/23] selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-09-01 10:02 ` [PATCH 21/23] selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS Ming Lei
2025-09-01 10:02 ` [PATCH 22/23] selftests: ublk: add --batch/-b for enabling F_BATCH_IO Ming Lei
2025-09-01 10:02 ` [PATCH 23/23] selftests: ublk: support arbitrary threads/queues combination Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPDELNXqlckznZJI@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=linux-block@vger.kernel.org \
    --cc=ushankar@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox