Re: [PATCH 08/23] ublk: handle UBLK_U_IO_PREP_IO_CMDS

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

From: Ming Lei <ming.lei@redhat.com>
To: Caleb Sander Mateos <csander@purestorage.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org,
	Uday Shankar <ushankar@purestorage.com>
Subject: Re: [PATCH 08/23] ublk: handle UBLK_U_IO_PREP_IO_CMDS
Date: Wed, 22 Oct 2025 18:15:25 +0800	[thread overview]
Message-ID: <aPiuvRXKHYC9KL8C@fedora> (raw)
In-Reply-To: <CADUfDZqCo2O0toQ0M=RBDLnYkANJJ3iQkFmpD_QDbbimx6egRg@mail.gmail.com>

On Wed, Oct 22, 2025 at 01:00:53AM -0700, Caleb Sander Mateos wrote:
> On Thu, Oct 16, 2025 at 3:08 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Thu, Sep 18, 2025 at 11:12:00AM -0700, Caleb Sander Mateos wrote:
> > > On Tue, Sep 9, 2025 at 8:56 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Sat, Sep 06, 2025 at 12:48:41PM -0700, Caleb Sander Mateos wrote:
> > > > > On Mon, Sep 1, 2025 at 3:03 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > > > >
> > > > > > This commit implements the handling of the UBLK_U_IO_PREP_IO_CMDS command,
> > > > > > which allows userspace to prepare a batch of I/O requests.
> > > > > >
> > > > > > The core of this change is the `ublk_walk_cmd_buf` function, which iterates
> > > > > > over the elements in the uring_cmd fixed buffer. For each element, it parses
> > > > > > the I/O details, finds the corresponding `ublk_io` structure, and prepares it
> > > > > > for future dispatch.
> > > > > >
> > > > > > Add per-io lock for protecting concurrent delivery and committing.
> > > > > >
> > > > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > > > ---
> > > > > >  drivers/block/ublk_drv.c      | 191 +++++++++++++++++++++++++++++++++-
> > > > > >  include/uapi/linux/ublk_cmd.h |   5 +
> > > > > >  2 files changed, 195 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > > > index 4da0dbbd7e16..a4bae3d1562a 100644
> > > > > > --- a/drivers/block/ublk_drv.c
> > > > > > +++ b/drivers/block/ublk_drv.c
> > > > > > @@ -116,6 +116,10 @@ struct ublk_uring_cmd_pdu {
> > > > > >  struct ublk_batch_io_data {
> > > > > >         struct ublk_queue *ubq;
> > > > > >         struct io_uring_cmd *cmd;
> > > > > > +       unsigned int issue_flags;
> > > > > > +
> > > > > > +       /* set when walking the element buffer */
> > > > > > +       const struct ublk_elem_header *elem;
> > > > > >  };
> > > > > >
> > > > > >  /*
> > > > > > @@ -200,6 +204,7 @@ struct ublk_io {
> > > > > >         unsigned task_registered_buffers;
> > > > > >
> > > > > >         void *buf_ctx_handle;
> > > > > > +       spinlock_t lock;
> > > > >
> > > > > From our experience writing a high-throughput ublk server, the
> > > > > spinlocks and mutexes in the kernel are some of the largest CPU
> > > > > hotspots. We have spent a lot of effort working to avoid locking where
> > > > > possible or shard data structures to reduce contention on the locks.
> > > > > Even uncontended locks are still very expensive to acquire and release
> > > > > on machines with many CPUs due to the cache coherency overhead. ublk's
> > > > > per-io daemon architecture is great for performance by removing the
> > > >
> > > > io-uring highly depends on batch submission and completion, but per-io daemon
> > > > may break the batch easily, because it doesn't guarantee that one batch IOs
> > > > can be forwarded in single io task/io_uring when static tag mapping policy is
> > > > taken, for example:
> > >
> > > That's a good point. We've mainly focused on optimizing the ublk
> > > server side, but it's true that distributing incoming ublk I/Os to
> > > more ublk server threads adds overhead on the submitting side. One
> > > idea we had but haven't experimented with much is for the ublk server
> > > to perform the round-robin assignment of tags within each queue to
> >
> > round-robin often hurts perf, and it isn't enabled yet.
> 
> I don't mean BLK_MQ_F_TAG_RR. I thought even the default tag
> allocation scheme resulted in approximately round-robin tag
> allocation, right? __sbitmap_queue_get_batch() will attempt to
> allocate contiguous bits from the map, so a batch of queued requests
> will likely be assigned sequential tags (or a couple sequential runs
> of tags) in the queue. I guess that's only true if the queue is mostly
> empty; if many tags are in use, it will be harder to allocate
> contiguous sets of tags.

Yes, __sbitmap_queue_get_batch() may fail and fallback to single bit
allocation, so you need to setup big queue depth for avoiding batch
allocation failure. But it is still hard to avoid in case of very
high IO depth.

> 
> >
> > > threads in larger chunks. For example, with a chunk size of 4, tags 0
> > > to 3 would be assigned to thread 0, tags 4 to 7 would be assigned to
> > > thread 1, etc. That would improve the batching of ublk I/Os when
> > > dispatching them from the submitting CPU to the ublk server thread.
> > > There's an inherent tradeoff where distributing tags to ublk server
> > > threads in larger chunks makes the distribution less balanced for
> > > small numbers of I/Os, but it will be balanced when averaged over
> > > large numbers of I/Os.
> >
> > How can fixed chunk size work generically? It depends on workload batch
> > size on /dev/ublkbN, and different workload takes different batch size.
> 
> Yes, that's a good point. It requires pretty specific knowledge of the
> workload to optimize the tag assignment to ublk server threads like
> this.
> 
> >
> > >
> > > >
> > > > ```
> > > > [root@ktest-40 ublk]# ./kublk add -t null  --nthreads 8 -q 4 --per_io_tasks
> > > > dev id 0: nr_hw_queues 4 queue_depth 128 block size 512 dev_capacity 524288000
> > > >         max rq size 1048576 daemon pid 89975 flags 0x6042 state LIVE
> > > >         queue 0: affinity(0 )
> > > >         queue 1: affinity(4 )
> > > >         queue 2: affinity(8 )
> > > >         queue 3: affinity(12 )
> > > > [root@ktest-40 ublk]#
> > > > [root@ktest-40 ublk]# ./kublk add -t null  -q 4
> > > > dev id 1: nr_hw_queues 4 queue_depth 128 block size 512 dev_capacity 524288000
> > > >         max rq size 1048576 daemon pid 90002 flags 0x6042 state LIVE
> > > >         queue 0: affinity(0 )
> > > >         queue 1: affinity(4 )
> > > >         queue 2: affinity(8 )
> > > >         queue 3: affinity(12 )
> > > > [root@ktest-40 ublk]#
> > > > [root@ktest-40 ublk]# ~/git/fio/t/io_uring -p0 /dev/ublkb0
> > > > submitter=0, tid=90024, file=/dev/ublkb0, nfiles=1, node=-1
> > > > polled=0, fixedbufs=1, register_files=1, buffered=0, QD=128
> > > > Engine=io_uring, sq_ring=128, cq_ring=128
> > > > IOPS=188.54K, BW=736MiB/s, IOS/call=32/31
> > > > IOPS=187.90K, BW=734MiB/s, IOS/call=32/32
> > > > IOPS=195.39K, BW=763MiB/s, IOS/call=32/32
> > > > ^CExiting on signal
> > > > Maximum IOPS=195.39K
> > > >
> > > > [root@ktest-40 ublk]# ~/git/fio/t/io_uring -p0 /dev/ublkb1
> > > > submitter=0, tid=90026, file=/dev/ublkb1, nfiles=1, node=-1
> > > > polled=0, fixedbufs=1, register_files=1, buffered=0, QD=128
> > > > Engine=io_uring, sq_ring=128, cq_ring=128
> > > > IOPS=608.26K, BW=2.38GiB/s, IOS/call=32/31
> > > > IOPS=586.59K, BW=2.29GiB/s, IOS/call=32/31
> > > > IOPS=599.62K, BW=2.34GiB/s, IOS/call=32/32
> > > > ^CExiting on signal
> > > > Maximum IOPS=608.26K
> > > >
> > > > ```
> > > >
> > > >
> > > > > need for locks in the I/O path. I can't really see us adopting this
> > > > > ublk batching feature; adding a spin_lock() + spin_unlock() to every
> > > > > ublk commit operation is not worth the reduction in io_uring SQEs and
> > > > > uring_cmds.
> > > >
> > > > As I mentioned in cover letter, the per-io lock can be avoided for UBLK_F_PER_IO_DAEMON
> > > > as one follow-up, since io->task is still there for helping to track task context.
> > > >
> > > > Just want to avoid too much features in enablement stage, that is also
> > > > why the spin lock is wrapped in helper.
> > >
> > > Okay, good to know there's at least an idea for how to avoid the
> > > spinlock. Makes sense to defer it to follow-on work.
> > >
> > > >
> > > > >
> > > > > >  } ____cacheline_aligned_in_smp;
> > > > > >
> > > > > >  struct ublk_queue {
> > > > > > @@ -276,6 +281,16 @@ static inline bool ublk_support_batch_io(const struct ublk_queue *ubq)
> > > > > >         return false;
> > > > > >  }
> > > > > >
> > > > > > +static inline void ublk_io_lock(struct ublk_io *io)
> > > > > > +{
> > > > > > +       spin_lock(&io->lock);
> > > > > > +}
> > > > > > +
> > > > > > +static inline void ublk_io_unlock(struct ublk_io *io)
> > > > > > +{
> > > > > > +       spin_unlock(&io->lock);
> > > > > > +}
> > > > > > +
> > > > > >  static inline struct ublksrv_io_desc *
> > > > > >  ublk_get_iod(const struct ublk_queue *ubq, unsigned tag)
> > > > > >  {
> > > > > > @@ -2538,6 +2553,171 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> > > > > >         return ublk_ch_uring_cmd_local(cmd, issue_flags);
> > > > > >  }
> > > > > >
> > > > > > +static inline __u64 ublk_batch_buf_addr(const struct ublk_batch_io *uc,
> > > > > > +                                       const struct ublk_elem_header *elem)
> > > > > > +{
> > > > > > +       const void *buf = (const void *)elem;
> > > > >
> > > > > Don't need an explicit cast in order to cast to void *.
> > > >
> > > > OK.
> > > >
> > > > >
> > > > >
> > > > > > +
> > > > > > +       if (uc->flags & UBLK_BATCH_F_HAS_BUF_ADDR)
> > > > > > +               return *(__u64 *)(buf + sizeof(*elem));
> > > > > > +       return -1;
> > > > >
> > > > > Why -1 and not 0? ublk_check_fetch_buf() is expecting a 0 buf_addr to
> > > > > indicate the lack
> > > >
> > > > Good catch, it needs to return 0.
> > > >
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > +static struct ublk_auto_buf_reg
> > > > > > +ublk_batch_auto_buf_reg(const struct ublk_batch_io *uc,
> > > > > > +                       const struct ublk_elem_header *elem)
> > > > > > +{
> > > > > > +       struct ublk_auto_buf_reg reg = {
> > > > > > +               .index = elem->buf_index,
> > > > > > +               .flags = (uc->flags & UBLK_BATCH_F_AUTO_BUF_REG_FALLBACK) ?
> > > > > > +                       UBLK_AUTO_BUF_REG_FALLBACK : 0,
> > > > > > +       };
> > > > > > +
> > > > > > +       return reg;
> > > > > > +}
> > > > > > +
> > > > > > +/* 48 can cover any type of buffer element(8, 16 and 24 bytes) */
> > > > >
> > > > > "can cover" is a bit vague. Can you be explicit that the buffer size
> > > > > needs to be a multiple of any possible buffer element size?
> > > >
> > > > I should have documented that 48 is least common multiple(LCM) of (8, 16 and
> > > > 24)
> > > >
> > > > >
> > > > > > +#define UBLK_CMD_BATCH_TMP_BUF_SZ  (48 * 10)
> > > > > > +struct ublk_batch_io_iter {
> > > > > > +       /* copy to this buffer from iterator first */
> > > > > > +       unsigned char buf[UBLK_CMD_BATCH_TMP_BUF_SZ];
> > > > > > +       struct iov_iter iter;
> > > > > > +       unsigned done, total;
> > > > > > +       unsigned char elem_bytes;
> > > > > > +};
> > > > > > +
> > > > > > +static int __ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter,
> > > > > > +                               struct ublk_batch_io_data *data,
> > > > > > +                               unsigned bytes,
> > > > > > +                               int (*cb)(struct ublk_io *io,
> > > > > > +                                       const struct ublk_batch_io_data *data))
> > > > > > +{
> > > > > > +       int i, ret = 0;
> > > > > > +
> > > > > > +       for (i = 0; i < bytes; i += iter->elem_bytes) {
> > > > > > +               const struct ublk_elem_header *elem =
> > > > > > +                       (const struct ublk_elem_header *)&iter->buf[i];
> > > > > > +               struct ublk_io *io;
> > > > > > +
> > > > > > +               if (unlikely(elem->tag >= data->ubq->q_depth)) {
> > > > > > +                       ret = -EINVAL;
> > > > > > +                       break;
> > > > > > +               }
> > > > > > +
> > > > > > +               io = &data->ubq->ios[elem->tag];
> > > > > > +               data->elem = elem;
> > > > > > +               ret = cb(io, data);
> > > > >
> > > > > Why not just pas elem as a separate argument to the callback?
> > > >
> > > > One reason is that we don't have complete type for 'elem' since its size
> > > > is a variable.
> > >
> > > I didn't mean to pass ublk_elem_header by value, still by pointer.
> > > Just that you could pass const struct ublk_elem_header *elem as an
> > > additional parameter to the callback. I think that would make the code
> > > a bit easier to follow than passing it via data->elem.
> >
> > OK.
> >
> > >
> > > >
> > > > >
> > > > > > +               if (unlikely(ret))
> > > > > > +                       break;
> > > > > > +       }
> > > > > > +       iter->done += i;
> > > > > > +       return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static int ublk_walk_cmd_buf(struct ublk_batch_io_iter *iter,
> > > > > > +                            struct ublk_batch_io_data *data,
> > > > > > +                            int (*cb)(struct ublk_io *io,
> > > > > > +                                    const struct ublk_batch_io_data *data))
> > > > > > +{
> > > > > > +       int ret = 0;
> > > > > > +
> > > > > > +       while (iter->done < iter->total) {
> > > > > > +               unsigned int len = min(sizeof(iter->buf), iter->total - iter->done);
> > > > > > +
> > > > > > +               ret = copy_from_iter(iter->buf, len, &iter->iter);
> > > > > > +               if (ret != len) {
> > > > >
> > > > > How would this be possible? The iterator comes from an io_uring
> > > > > registered buffer with at least the requested length, so the user
> > > > > addresses should have been validated when the buffer was registered.
> > > > > Should this just be a WARN_ON()?
> > > >
> > > > yes, that is why pr_warn() is used, I remember that WARN_ON() isn't
> > > > encouraged in user code path.
> > > >
> > > > >
> > > > > > +                       pr_warn("ublk%d: read batch cmd buffer failed %u/%u\n",
> > > > > > +                                       data->ubq->dev->dev_info.dev_id, ret, len);
> > > > > > +                       ret = -EINVAL;
> > > > > > +                       break;
> > > > > > +               }
> > > > > > +
> > > > > > +               ret = __ublk_walk_cmd_buf(iter, data, len, cb);
> > > > > > +               if (ret)
> > > > > > +                       break;
> > > > > > +       }
> > > > > > +       return ret;
> > > > > > +}
> > > > > > +
> > > > > > +static int ublk_batch_unprep_io(struct ublk_io *io,
> > > > > > +                               const struct ublk_batch_io_data *data)
> > > > > > +{
> > > > > > +       if (ublk_queue_ready(data->ubq))
> > > > > > +               data->ubq->dev->nr_queues_ready--;
> > > > > > +
> > > > > > +       ublk_io_lock(io);
> > > > > > +       io->flags = 0;
> > > > > > +       ublk_io_unlock(io);
> > > > > > +       data->ubq->nr_io_ready--;
> > > > > > +       return 0;
> > > > >
> > > > > This "unprep" looks very subtle and fairly complicated. Is it really
> > > > > necessary? What's wrong with leaving the I/Os that were successfully
> > > > > prepped? It also looks racy to clear io->flags after the queue is
> > > > > ready, as the io may already be in use by some I/O request.
> > > >
> > > > ublk_batch_unprep_io() is called in partial completion of UBLK_U_IO_PREP_IO_CMDS,
> > > > when START_DEV can't succeed, so there can't be any IO.
> > >
> > > Isn't it possible that the UBLK_U_IO_PREP_IO_CMDS batch contains all
> > > the I/Os not yet prepped followed by some duplicates? Then the device
> > > could be started following the successful completion of all the newly
> > > prepped I/Os, but the batch would fail on the following duplicate
> > > I/Os, causing the successfully prepped I/Os to be unprepped?
> >
> > It can be avoided easily because ub->mutex is required for UBLK_U_IO_PREP_IO_CMDS,
> > such as, ub->dev_info.state can be set to UBLK_S_DEV_DEAD in case of any failure.
> 
> Are you saying that the situation I described isn't possible, or that
> it can be prevented with an additional state check?

I meant it can be avoided easily, such as by adding check ublk_dev_ready() in
ublk_ctrl_start_dev() after ub->mutex is acquired.

> I don't think the mutex alone prevents this situation. The mutex
> guards against concurrent UBLK_U_IO_PREP_IO_CMDS, but it doesn't
> prevent requests from being queued concurrently to the ublk device
> once it's ready. And __ublk_fetch() will mark the ublk device as ready
> as soon as all the tags have been fetched/prepped, when there could
> still be more commands in the UBLK_U_IO_PREP_IO_CMDS batch.
> I think to fix the issue, you'd need to wait to mark the ublk device
> ready until the end of the UBLK_U_IO_PREP_IO_CMDS batch.


Thanks,
Ming

next prev parent reply	other threads:[~2025-10-22 10:15 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-01 10:02 [PATCH 00/23] ublk: add UBLK_F_BATCH_IO Ming Lei
2025-09-01 10:02 ` [PATCH 01/23] ublk: add parameter `struct io_uring_cmd *` to ublk_prep_auto_buf_reg() Ming Lei
2025-09-03  3:47   ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 02/23] ublk: add `union ublk_io_buf` with improved naming Ming Lei
2025-09-03  4:01   ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 03/23] ublk: refactor auto buffer register in ublk_dispatch_req() Ming Lei
2025-09-03  4:41   ` Caleb Sander Mateos
2025-09-10  2:23     ` Ming Lei
2025-09-11 18:13       ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 04/23] ublk: add helper of __ublk_fetch() Ming Lei
2025-09-03  4:42   ` Caleb Sander Mateos
2025-09-10  2:30     ` Ming Lei
2025-09-01 10:02 ` [PATCH 05/23] ublk: define ublk_ch_batch_io_fops for the coming feature F_BATCH_IO Ming Lei
2025-09-06 18:47   ` Caleb Sander Mateos
2025-09-01 10:02 ` [PATCH 06/23] ublk: prepare for not tracking task context for command batch Ming Lei
2025-09-06 18:48   ` Caleb Sander Mateos
2025-09-10  2:35     ` Ming Lei
2025-09-01 10:02 ` [PATCH 07/23] ublk: add new batch command UBLK_U_IO_PREP_IO_CMDS & UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-09-06 18:50   ` Caleb Sander Mateos
2025-09-10  3:05     ` Ming Lei
2025-09-01 10:02 ` [PATCH 08/23] ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-09-06 19:48   ` Caleb Sander Mateos
2025-09-10  3:56     ` Ming Lei
2025-09-18 18:12       ` Caleb Sander Mateos
2025-10-16 10:08         ` Ming Lei
2025-10-22  8:00           ` Caleb Sander Mateos
2025-10-22 10:15             ` Ming Lei [this message]
2025-09-01 10:02 ` [PATCH 09/23] ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-09-02  6:19   ` kernel test robot
2025-09-01 10:02 ` [PATCH 10/23] ublk: add io events fifo structure Ming Lei
2025-09-01 10:02 ` [PATCH 11/23] ublk: add batch I/O dispatch infrastructure Ming Lei
2025-09-01 10:02 ` [PATCH 12/23] ublk: add UBLK_U_IO_FETCH_IO_CMDS for batch I/O processing Ming Lei
2025-09-01 10:02 ` [PATCH 13/23] ublk: abort requests filled in event kfifo Ming Lei
2025-09-01 10:02 ` [PATCH 14/23] ublk: add new feature UBLK_F_BATCH_IO Ming Lei
2025-09-01 10:02 ` [PATCH 15/23] ublk: document " Ming Lei
2025-09-01 10:02 ` [PATCH 16/23] selftests: ublk: replace assert() with ublk_assert() Ming Lei
2025-09-01 10:02 ` [PATCH 17/23] selftests: ublk: add ublk_io_buf_idx() for returning io buffer index Ming Lei
2025-09-01 10:02 ` [PATCH 18/23] selftests: ublk: add batch buffer management infrastructure Ming Lei
2025-09-01 10:02 ` [PATCH 19/23] selftests: ublk: handle UBLK_U_IO_PREP_IO_CMDS Ming Lei
2025-09-01 10:02 ` [PATCH 20/23] selftests: ublk: handle UBLK_U_IO_COMMIT_IO_CMDS Ming Lei
2025-09-01 10:02 ` [PATCH 21/23] selftests: ublk: handle UBLK_U_IO_FETCH_IO_CMDS Ming Lei
2025-09-01 10:02 ` [PATCH 22/23] selftests: ublk: add --batch/-b for enabling F_BATCH_IO Ming Lei
2025-09-01 10:02 ` [PATCH 23/23] selftests: ublk: support arbitrary threads/queues combination Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPiuvRXKHYC9KL8C@fedora \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=csander@purestorage.com \
    --cc=linux-block@vger.kernel.org \
    --cc=ushankar@purestorage.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox